November 20, 2024

From PyTorch to Production: Deploying ML Models on Robots

machine-learningroboticspytorchtensorrt

From PyTorch to Production: Deploying ML Models on Robots

Training a model that achieves great accuracy on your test set is only half the battle. The real challenge begins when you need that model running at 30+ FPS on a robot with limited compute, unreliable lighting, and zero tolerance for latency. This post covers the full journey of taking a YOLOv5 object detection model from a Jupyter notebook to a production robotics pipeline — the bottlenecks we hit, the optimizations that worked, and the ones that didn't.

Context: The Competition

Our team was building an autonomous robot for a university competition. The robot needed to navigate an arena, identify specific objects (colored blocks, markers, obstacles), pick them up, and place them in designated zones. Object detection was the eyes of the entire system — if it was slow or inaccurate, every downstream system (path planning, grasping, navigation) would suffer.

The constraints were brutal:

Hardware: NVIDIA Jetson Xavier NX — 6 CPU cores, 384 CUDA cores, 8GB shared memory
Latency budget: Under 50ms per frame for the full perception pipeline (capture → detect → publish)
Accuracy requirement: 90%+ mAP on our custom object classes
Runtime: 15-minute competition runs with no restarts — the system had to be rock solid

The Gap Between Research and Deployment

We started where most ML projects start: a Jupyter notebook. Data collection, labeling (800+ images using Roboflow), training with Ultralytics YOLOv5, and evaluation. The results were encouraging — 92% mAP on our test set, with clean bounding boxes even on partially occluded objects.

Then we deployed it on the Jetson.

8 FPS.

Not even close to real-time. The robot was essentially blind for 125ms between each frame — an eternity when you're trying to avoid obstacles or line up a grasp. We needed at least 30 FPS to give the planning system enough visual feedback to operate smoothly.

Diagnosing the Bottleneck

The first instinct was "the GPU is too slow." But profiling told a different story. We used nsys (NVIDIA Nsight Systems) to profile the entire pipeline:

nsys profile --stats=true python3 detect.py --source camera.mp4

The results were surprising:

| Stage | Time (ms) | % of Total | |-------|-----------|------------| | Frame capture | 5 | 4% | | Preprocessing (Python/NumPy) | 28 | 22% | | Model inference (GPU) | 65 | 52% | | Postprocessing (NMS, Python) | 18 | 14% | | Drawing + publishing | 9 | 7% |

The GPU inference was slow, yes — but preprocessing in Python was eating 22% of our budget. Pure Python loops over NumPy arrays for resizing, padding, and normalization were killing us. And the GIL meant we couldn't even parallelize preprocessing and postprocessing effectively.

The Optimization Pipeline

We attacked the problem from multiple angles, in order of impact.

1. Model Export: PyTorch → ONNX → TensorRT

The single biggest improvement came from converting the model from PyTorch to TensorRT. PyTorch is fantastic for research and training, but it carries a lot of overhead for inference — dynamic graph construction, Python interpreter overhead, and no hardware-specific kernel optimization.

The conversion pipeline was:

import torch
from ultralytics import YOLO
 
# Step 1: Load the trained model
model = YOLO('runs/train/best.pt')
 
# Step 2: Export to ONNX (intermediate representation)
model.export(format='onnx', imgsz=640, half=False, simplify=True)

# Step 3: Convert ONNX to TensorRT engine
# This runs on the target Jetson hardware for device-specific optimization
trtexec --onnx=best.onnx \
        --saveEngine=best.engine \
        --fp16 \
        --workspace=2048 \
        --minShapes=images:1x3x640x640 \
        --optShapes=images:1x3x640x640 \
        --maxShapes=images:1x3x640x640

The key flags:

--fp16: Use half-precision floating point. This halves memory bandwidth requirements and doubles throughput on Tensor Cores, with typically less than 0.5% mAP loss.
--workspace=2048: Gives TensorRT 2GB to explore optimization strategies (layer fusion, kernel selection)
Fixed shapes: Since our input size was always 640x640, we could skip dynamic shape overhead

Result: Inference dropped from 65ms to 22ms — a 3x speedup with only 0.3% mAP loss (91.7% vs 92.0%).

2. Ditching Python: The C++ Inference Pipeline

With inference now at 22ms, the Python preprocessing (28ms) was our new bottleneck. The irony wasn't lost on us — the "supporting" code was slower than the actual neural network.

We rewrote the entire pipeline in C++ using the TensorRT C++ API. This eliminated:

Python interpreter overhead
NumPy array copying and memory allocation
GIL contention
The overhead of Python → C++ bridge (pybind11/ctypes)

Here's the core inference loop:

class Detector {
public:
    struct Detection {
        cv::Rect bbox;
        float confidence;
        int classId;
    };
 
    Detector(const std::string& enginePath) {
        // Deserialize the TensorRT engine
        std::ifstream file(enginePath, std::ios::binary);
        std::vector<char> data(std::istreambuf_iterator<char>(file), {});
 
        runtime_ = nvinfer1::createInferRuntime(logger_);
        engine_ = runtime_->deserializeCudaEngine(data.data(), data.size());
        context_ = engine_->createExecutionContext();
 
        // Pre-allocate GPU buffers
        cudaMalloc(&inputBuffer_, INPUT_SIZE * sizeof(float));
        cudaMalloc(&outputBuffer_, OUTPUT_SIZE * sizeof(float));
    }
 
    std::vector<Detection> detect(const cv::Mat& frame) {
        // Preprocess: resize + normalize on GPU using cv::cuda
        cv::cuda::GpuMat gpuFrame;
        gpuFrame.upload(frame);
 
        cv::cuda::GpuMat resized;
        cv::cuda::resize(gpuFrame, resized, cv::Size(640, 640));
 
        // Convert to float, normalize to [0, 1], CHW format
        preprocessGPU(resized, inputBuffer_);
 
        // Run inference
        void* bindings[] = {inputBuffer_, outputBuffer_};
        context_->enqueueV2(bindings, stream_, nullptr);
        cudaStreamSynchronize(stream_);
 
        // Postprocess: NMS
        std::vector<float> output(OUTPUT_SIZE);
        cudaMemcpy(output.data(), outputBuffer_,
                   OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost);
 
        return nonMaxSuppression(output, 0.45f, 0.5f);
    }
 
private:
    nvinfer1::IRuntime* runtime_;
    nvinfer1::ICudaEngine* engine_;
    nvinfer1::IExecutionContext* context_;
    void* inputBuffer_;
    void* outputBuffer_;
    cudaStream_t stream_;
    Logger logger_;
};

Key optimizations in the C++ version:

GPU-side preprocessing: Using cv::cuda to resize and normalize directly on the GPU, avoiding a CPU→GPU copy for every frame
Pre-allocated buffers: cudaMalloc once at startup, reuse for every frame. No allocation during the hot loop.
Zero-copy where possible: The camera driver wrote frames directly to GPU-accessible memory

Result: Preprocessing dropped from 28ms to 3ms. Total pipeline: 28ms per frame (35+ FPS).

3. Multithreaded Pipeline Architecture

Even at 28ms, we were leaving performance on the table. Each stage of the pipeline was running sequentially — capture, preprocess, infer, postprocess — but these stages could be overlapped.

We implemented a three-stage pipeline using C++ threads and lock-free queues:

Thread 1 (Capture):     [Frame N] [Frame N+1] [Frame N+2] ...
Thread 2 (Inference):          [Frame N] [Frame N+1] [Frame N+2] ...
Thread 3 (Postprocess):               [Frame N] [Frame N+1] ...

While Thread 2 is running inference on Frame N, Thread 1 is already capturing Frame N+1. This pipeline parallelism effectively hid the capture latency entirely.

class PipelinedDetector {
public:
    void start() {
        captureThread_ = std::thread(&PipelinedDetector::captureLoop, this);
        inferThread_ = std::thread(&PipelinedDetector::inferLoop, this);
        postThread_ = std::thread(&PipelinedDetector::postprocessLoop, this);
    }
 
private:
    void captureLoop() {
        while (running_) {
            cv::Mat frame;
            camera_.read(frame);
            captureQueue_.push(std::move(frame));
        }
    }
 
    void inferLoop() {
        while (running_) {
            auto frame = captureQueue_.pop();  // Blocks until available
            auto rawOutput = detector_.infer(frame);
            inferQueue_.push({std::move(frame), std::move(rawOutput)});
        }
    }
 
    void postprocessLoop() {
        while (running_) {
            auto [frame, rawOutput] = inferQueue_.pop();
            auto detections = nonMaxSuppression(rawOutput);
            publishDetections(detections);  // Send to ROS
        }
    }
 
    ThreadSafeQueue<cv::Mat> captureQueue_{3};  // Buffer up to 3 frames
    ThreadSafeQueue<InferResult> inferQueue_{2};
    Detector detector_;
    cv::VideoCapture camera_;
    std::atomic<bool> running_{true};
};

The bounded queue sizes (3 and 2) were important — we didn't want to buffer too many frames because that would increase latency. If the capture thread was producing frames faster than inference could consume them, old frames would be dropped. For robotics, a fresh-but-approximate result is always better than a perfect-but-stale one.

4. ROS Integration

The final piece was publishing detection results into the ROS ecosystem. We created a custom ROS node that exposed detections as sensor_msgs::Detection2DArray messages:

class DetectionNode : public rclcpp::Node {
public:
    DetectionNode() : Node("yolo_detector") {
        publisher_ = this->create_publisher<vision_msgs::msg::Detection2DArray>(
            "/detections", 10);
 
        detector_ = std::make_unique<PipelinedDetector>();
        detector_->setCallback([this](const auto& detections) {
            auto msg = toRosMessage(detections);
            publisher_->publish(msg);
        });
        detector_->start();
    }
};

Other ROS nodes subscribed to /detections and used the bounding box data for:

Navigation: Avoiding detected obstacles
Planning: Identifying target objects and computing approach vectors
Grasping: Centering the gripper on the detected object's bounding box centroid

Because ROS uses a publish-subscribe model, adding new consumers was trivial — we could pipe detections into a visualization tool (RViz), a logging node, or a scoring system without modifying the detector.

Results: The Full Picture

Here's the performance comparison across our optimization journey:

| Version | FPS | Latency | mAP | |---------|-----|---------|-----| | PyTorch (Python) | 8 | 125ms | 92.0% | | TensorRT FP16 (Python) | 18 | 55ms | 91.7% | | TensorRT FP16 (C++) | 35 | 28ms | 91.7% | | Pipelined C++ | 38 | 26ms* | 91.7% |

*Pipeline latency measured as time from frame capture to detection publish.

The jump from 8 FPS to 38 FPS — a 4.75x improvement — directly translated to better competition performance. Our robot could react to objects and obstacles in real-time, leading to a 25% improvement in scored runs compared to our preliminary rounds (which used the Python pipeline).

Lessons Learned

Profile Before You Optimize

This is the most important lesson. Our initial assumption was "the GPU is the bottleneck," which would have led us to spend weeks trying to prune or distill the model. Profiling revealed that Python preprocessing was nearly as costly as inference. Measure first, optimize second, always.

FP16 Is Usually Good Enough

We tested FP16, INT8, and FP32 inference:

FP32 → FP16: 0.3% mAP loss, 2x speedup
FP16 → INT8: 1.8% mAP loss, 1.3x speedup

For our use case, FP16 was the clear sweet spot. INT8 quantization introduced noticeable accuracy degradation on small or partially occluded objects — exactly the hard cases that mattered most in competition.

End-to-End Latency > Throughput

In robotics, the question isn't "how many frames can you process per second?" — it's "how old is the data your planning system is acting on?" A system that processes 60 FPS but buffers 10 frames has 167ms of visual latency. Our pipelined system at 38 FPS with a 2-frame buffer had only 26ms of latency. The robot felt dramatically more responsive.

The "Last Mile" Is the Hardest

Getting from a working Jupyter notebook to a production C++ pipeline took 3x longer than the initial model training. This is a universal truth in ML engineering that doesn't get talked about enough. Budget accordingly.

Don't Over-Optimize in Isolation

We initially spent a week trying to squeeze more FPS out of the inference engine alone (model pruning, layer fusion hints, custom CUDA kernels). The returns were diminishing — maybe 2-3 FPS. Then we rewrote preprocessing in C++ and gained 12 FPS in two days. Optimize the whole pipeline, not just the part that feels most "ML."

What I'd Do Differently

If I were starting this project today, I'd consider a few alternatives:

YOLOv8 or YOLOv9: Newer YOLO variants have better accuracy-speed tradeoffs out of the box, and Ultralytics has dramatically improved their TensorRT export pipeline.
NVIDIA DeepStream: Instead of building a custom C++ pipeline, DeepStream provides a GPU-accelerated video analytics framework that handles capture, inference, and postprocessing with optimized plugins. We didn't use it because the learning curve was steep, but it would have saved us weeks.
Training with TensorRT in mind: Some architectural choices (like certain activation functions) don't have optimized TensorRT kernels. Knowing your deployment target during model selection can avoid conversion headaches.

Conclusion

Taking ML models from research to production on resource-constrained hardware is a distinct engineering discipline. It requires profiling skills, systems programming knowledge, and a willingness to step outside the comfortable Python ecosystem. But the payoff is real — in our case, it was the difference between a robot that stumbled and one that competed.

The full pipeline is open-source — check out the project on GitHub.