From PyTorch to Production: Deploying ML Models on Robots
From PyTorch to Production: Deploying ML Models on Robots
Training a model that achieves great accuracy on your test set is only half the battle. The real challenge begins when you need that model running at 30+ FPS on a robot with limited compute, unreliable lighting, and zero tolerance for latency. This post covers the full journey of taking a YOLOv5 object detection model from a Jupyter notebook to a production robotics pipeline — the bottlenecks we hit, the optimizations that worked, and the ones that didn't.
Context: The Competition
Our team was building an autonomous robot for a university competition. The robot needed to navigate an arena, identify specific objects (colored blocks, markers, obstacles), pick them up, and place them in designated zones. Object detection was the eyes of the entire system — if it was slow or inaccurate, every downstream system (path planning, grasping, navigation) would suffer.
The constraints were brutal:
- Hardware: NVIDIA Jetson Xavier NX — 6 CPU cores, 384 CUDA cores, 8GB shared memory
- Latency budget: Under 50ms per frame for the full perception pipeline (capture → detect → publish)
- Accuracy requirement: 90%+ mAP on our custom object classes
- Runtime: 15-minute competition runs with no restarts — the system had to be rock solid
The Gap Between Research and Deployment
We started where most ML projects start: a Jupyter notebook. Data collection, labeling (800+ images using Roboflow), training with Ultralytics YOLOv5, and evaluation. The results were encouraging — 92% mAP on our test set, with clean bounding boxes even on partially occluded objects.
Then we deployed it on the Jetson.
8 FPS.
Not even close to real-time. The robot was essentially blind for 125ms between each frame — an eternity when you're trying to avoid obstacles or line up a grasp. We needed at least 30 FPS to give the planning system enough visual feedback to operate smoothly.
Diagnosing the Bottleneck
The first instinct was "the GPU is too slow." But profiling told a different story. We used nsys (NVIDIA Nsight Systems) to profile the entire pipeline:
nsys profile --stats=true python3 detect.py --source camera.mp4The results were surprising:
| Stage | Time (ms) | % of Total | |-------|-----------|------------| | Frame capture | 5 | 4% | | Preprocessing (Python/NumPy) | 28 | 22% | | Model inference (GPU) | 65 | 52% | | Postprocessing (NMS, Python) | 18 | 14% | | Drawing + publishing | 9 | 7% |
The GPU inference was slow, yes — but preprocessing in Python was eating 22% of our budget. Pure Python loops over NumPy arrays for resizing, padding, and normalization were killing us. And the GIL meant we couldn't even parallelize preprocessing and postprocessing effectively.
The Optimization Pipeline
We attacked the problem from multiple angles, in order of impact.
1. Model Export: PyTorch → ONNX → TensorRT
The single biggest improvement came from converting the model from PyTorch to TensorRT. PyTorch is fantastic for research and training, but it carries a lot of overhead for inference — dynamic graph construction, Python interpreter overhead, and no hardware-specific kernel optimization.
The conversion pipeline was:
import torch
from ultralytics import YOLO
# Step 1: Load the trained model
model = YOLO('runs/train/best.pt')
# Step 2: Export to ONNX (intermediate representation)
model.export(format='onnx', imgsz=640, half=False, simplify=True)# Step 3: Convert ONNX to TensorRT engine
# This runs on the target Jetson hardware for device-specific optimization
trtexec --onnx=best.onnx \
--saveEngine=best.engine \
--fp16 \
--workspace=2048 \
--minShapes=images:1x3x640x640 \
--optShapes=images:1x3x640x640 \
--maxShapes=images:1x3x640x640The key flags:
--fp16: Use half-precision floating point. This halves memory bandwidth requirements and doubles throughput on Tensor Cores, with typically less than 0.5% mAP loss.--workspace=2048: Gives TensorRT 2GB to explore optimization strategies (layer fusion, kernel selection)- Fixed shapes: Since our input size was always 640x640, we could skip dynamic shape overhead
Result: Inference dropped from 65ms to 22ms — a 3x speedup with only 0.3% mAP loss (91.7% vs 92.0%).
2. Ditching Python: The C++ Inference Pipeline
With inference now at 22ms, the Python preprocessing (28ms) was our new bottleneck. The irony wasn't lost on us — the "supporting" code was slower than the actual neural network.
We rewrote the entire pipeline in C++ using the TensorRT C++ API. This eliminated:
- Python interpreter overhead
- NumPy array copying and memory allocation
- GIL contention
- The overhead of Python → C++ bridge (pybind11/ctypes)
Here's the core inference loop:
class Detector {
public:
struct Detection {
cv::Rect bbox;
float confidence;
int classId;
};
Detector(const std::string& enginePath) {
// Deserialize the TensorRT engine
std::ifstream file(enginePath, std::ios::binary);
std::vector<char> data(std::istreambuf_iterator<char>(file), {});
runtime_ = nvinfer1::createInferRuntime(logger_);
engine_ = runtime_->deserializeCudaEngine(data.data(), data.size());
context_ = engine_->createExecutionContext();
// Pre-allocate GPU buffers
cudaMalloc(&inputBuffer_, INPUT_SIZE * sizeof(float));
cudaMalloc(&outputBuffer_, OUTPUT_SIZE * sizeof(float));
}
std::vector<Detection> detect(const cv::Mat& frame) {
// Preprocess: resize + normalize on GPU using cv::cuda
cv::cuda::GpuMat gpuFrame;
gpuFrame.upload(frame);
cv::cuda::GpuMat resized;
cv::cuda::resize(gpuFrame, resized, cv::Size(640, 640));
// Convert to float, normalize to [0, 1], CHW format
preprocessGPU(resized, inputBuffer_);
// Run inference
void* bindings[] = {inputBuffer_, outputBuffer_};
context_->enqueueV2(bindings, stream_, nullptr);
cudaStreamSynchronize(stream_);
// Postprocess: NMS
std::vector<float> output(OUTPUT_SIZE);
cudaMemcpy(output.data(), outputBuffer_,
OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost);
return nonMaxSuppression(output, 0.45f, 0.5f);
}
private:
nvinfer1::IRuntime* runtime_;
nvinfer1::ICudaEngine* engine_;
nvinfer1::IExecutionContext* context_;
void* inputBuffer_;
void* outputBuffer_;
cudaStream_t stream_;
Logger logger_;
};Key optimizations in the C++ version:
- GPU-side preprocessing: Using
cv::cudato resize and normalize directly on the GPU, avoiding a CPU→GPU copy for every frame - Pre-allocated buffers:
cudaMalloconce at startup, reuse for every frame. No allocation during the hot loop. - Zero-copy where possible: The camera driver wrote frames directly to GPU-accessible memory
Result: Preprocessing dropped from 28ms to 3ms. Total pipeline: 28ms per frame (35+ FPS).
3. Multithreaded Pipeline Architecture
Even at 28ms, we were leaving performance on the table. Each stage of the pipeline was running sequentially — capture, preprocess, infer, postprocess — but these stages could be overlapped.
We implemented a three-stage pipeline using C++ threads and lock-free queues:
Thread 1 (Capture): [Frame N] [Frame N+1] [Frame N+2] ...
Thread 2 (Inference): [Frame N] [Frame N+1] [Frame N+2] ...
Thread 3 (Postprocess): [Frame N] [Frame N+1] ...
While Thread 2 is running inference on Frame N, Thread 1 is already capturing Frame N+1. This pipeline parallelism effectively hid the capture latency entirely.
class PipelinedDetector {
public:
void start() {
captureThread_ = std::thread(&PipelinedDetector::captureLoop, this);
inferThread_ = std::thread(&PipelinedDetector::inferLoop, this);
postThread_ = std::thread(&PipelinedDetector::postprocessLoop, this);
}
private:
void captureLoop() {
while (running_) {
cv::Mat frame;
camera_.read(frame);
captureQueue_.push(std::move(frame));
}
}
void inferLoop() {
while (running_) {
auto frame = captureQueue_.pop(); // Blocks until available
auto rawOutput = detector_.infer(frame);
inferQueue_.push({std::move(frame), std::move(rawOutput)});
}
}
void postprocessLoop() {
while (running_) {
auto [frame, rawOutput] = inferQueue_.pop();
auto detections = nonMaxSuppression(rawOutput);
publishDetections(detections); // Send to ROS
}
}
ThreadSafeQueue<cv::Mat> captureQueue_{3}; // Buffer up to 3 frames
ThreadSafeQueue<InferResult> inferQueue_{2};
Detector detector_;
cv::VideoCapture camera_;
std::atomic<bool> running_{true};
};The bounded queue sizes (3 and 2) were important — we didn't want to buffer too many frames because that would increase latency. If the capture thread was producing frames faster than inference could consume them, old frames would be dropped. For robotics, a fresh-but-approximate result is always better than a perfect-but-stale one.
4. ROS Integration
The final piece was publishing detection results into the ROS ecosystem. We created a custom ROS node that exposed detections as sensor_msgs::Detection2DArray messages:
class DetectionNode : public rclcpp::Node {
public:
DetectionNode() : Node("yolo_detector") {
publisher_ = this->create_publisher<vision_msgs::msg::Detection2DArray>(
"/detections", 10);
detector_ = std::make_unique<PipelinedDetector>();
detector_->setCallback([this](const auto& detections) {
auto msg = toRosMessage(detections);
publisher_->publish(msg);
});
detector_->start();
}
};Other ROS nodes subscribed to /detections and used the bounding box data for:
- Navigation: Avoiding detected obstacles
- Planning: Identifying target objects and computing approach vectors
- Grasping: Centering the gripper on the detected object's bounding box centroid
Because ROS uses a publish-subscribe model, adding new consumers was trivial — we could pipe detections into a visualization tool (RViz), a logging node, or a scoring system without modifying the detector.
Results: The Full Picture
Here's the performance comparison across our optimization journey:
| Version | FPS | Latency | mAP | |---------|-----|---------|-----| | PyTorch (Python) | 8 | 125ms | 92.0% | | TensorRT FP16 (Python) | 18 | 55ms | 91.7% | | TensorRT FP16 (C++) | 35 | 28ms | 91.7% | | Pipelined C++ | 38 | 26ms* | 91.7% |
*Pipeline latency measured as time from frame capture to detection publish.
The jump from 8 FPS to 38 FPS — a 4.75x improvement — directly translated to better competition performance. Our robot could react to objects and obstacles in real-time, leading to a 25% improvement in scored runs compared to our preliminary rounds (which used the Python pipeline).
Lessons Learned
Profile Before You Optimize
This is the most important lesson. Our initial assumption was "the GPU is the bottleneck," which would have led us to spend weeks trying to prune or distill the model. Profiling revealed that Python preprocessing was nearly as costly as inference. Measure first, optimize second, always.
FP16 Is Usually Good Enough
We tested FP16, INT8, and FP32 inference:
- FP32 → FP16: 0.3% mAP loss, 2x speedup
- FP16 → INT8: 1.8% mAP loss, 1.3x speedup
For our use case, FP16 was the clear sweet spot. INT8 quantization introduced noticeable accuracy degradation on small or partially occluded objects — exactly the hard cases that mattered most in competition.
End-to-End Latency > Throughput
In robotics, the question isn't "how many frames can you process per second?" — it's "how old is the data your planning system is acting on?" A system that processes 60 FPS but buffers 10 frames has 167ms of visual latency. Our pipelined system at 38 FPS with a 2-frame buffer had only 26ms of latency. The robot felt dramatically more responsive.
The "Last Mile" Is the Hardest
Getting from a working Jupyter notebook to a production C++ pipeline took 3x longer than the initial model training. This is a universal truth in ML engineering that doesn't get talked about enough. Budget accordingly.
Don't Over-Optimize in Isolation
We initially spent a week trying to squeeze more FPS out of the inference engine alone (model pruning, layer fusion hints, custom CUDA kernels). The returns were diminishing — maybe 2-3 FPS. Then we rewrote preprocessing in C++ and gained 12 FPS in two days. Optimize the whole pipeline, not just the part that feels most "ML."
What I'd Do Differently
If I were starting this project today, I'd consider a few alternatives:
-
YOLOv8 or YOLOv9: Newer YOLO variants have better accuracy-speed tradeoffs out of the box, and Ultralytics has dramatically improved their TensorRT export pipeline.
-
NVIDIA DeepStream: Instead of building a custom C++ pipeline, DeepStream provides a GPU-accelerated video analytics framework that handles capture, inference, and postprocessing with optimized plugins. We didn't use it because the learning curve was steep, but it would have saved us weeks.
-
Training with TensorRT in mind: Some architectural choices (like certain activation functions) don't have optimized TensorRT kernels. Knowing your deployment target during model selection can avoid conversion headaches.
Conclusion
Taking ML models from research to production on resource-constrained hardware is a distinct engineering discipline. It requires profiling skills, systems programming knowledge, and a willingness to step outside the comfortable Python ecosystem. But the payoff is real — in our case, it was the difference between a robot that stumbled and one that competed.
The full pipeline is open-source — check out the project on GitHub.