YOLOs-CPP-TensorRT
Header-only C++ YOLO library for NVIDIA TensorRT — GPU preprocessing, CUDA Graph replay, sub-2ms latency, 530+ FPS
Problem
Most YOLO C++ wrappers treat preprocessing as an afterthought — resizing on the CPU, copying synchronously, rebuilding TensorRT launch parameters every frame. On a laptop GPU this costs 1–3ms per frame before inference even begins. When the model itself runs in under 2ms, a CPU preprocessing bottleneck doubles end-to-end latency.
The other gap: TensorRT engines are model-specific and require manual configuration of output tensor shape differences between YOLO versions. Each new YOLO release breaks existing wrappers.
Approach
YOLOs-TRT was built around one principle: the GPU should never wait for the CPU. Every stage of the pipeline that can move to the GPU does — preprocessing runs as a single CUDA kernel, host-to-device transfer uses pinned memory for async overlap, and the entire inference graph is captured once and replayed via cudaGraphLaunch.
Model version auto-detection reads output tensor shapes at engine load time — no manual --model-version flag required. FP32, FP16, and INT8 engines load identically through the same API.
Architecture
GPU letterbox + normalize (single CUDA kernel) → async cudaMemcpyAsync to pinned buffer → cudaGraphLaunch (CUDA Graph replay of enqueueV3) → NMS post-processing → structured output.
The CUDA kernel performs bilinear letterbox resize, BGR→RGB conversion, and /255.0 normalisation in one pass, writing directly into the TRT input buffer. One cudaStream_t drives the full preprocess → infer → postprocess pipeline with minimal sync points.
CUDA Graph capture runs once at model load; fixed-shape engines replay with ~0.1–0.3ms less dispatch overhead per frame than bare enqueueV3.
Results
Measured on NVIDIA RTX 2000 Ada (Laptop) — YOLOv11n · 640×640 · 1000 iterations · 10-iter warm-up:
| Precision | FPS | Avg latency | P50 | P99 | GPU memory |
|---|---|---|---|---|---|
| FP32 | 466 | 2.14 ms | 2.04 ms | 3.03 ms | 530 MB |
| FP16 | 479 | 2.09 ms | 1.98 ms | 2.91 ms | 536 MB |
| INT8 | 530 | 1.89 ms | 1.78 ms | 2.70 ms | 444 MB |
Numbers include the full pipeline — GPU preprocessing, inference, and post-processing. Scales roughly linearly on higher-end GPUs.
Supported tasks (auto-detected from tensor shape):
| Task | YOLO versions |
|---|---|
| Detection | v5 · v7 · v8 · v9 · v10 · v11 · v12 · v26 · NAS |
| Segmentation | v8-seg · v11-seg · v26-seg |
| Pose | v8-pose · v11-pose · v26-pose |
| OBB | v8-obb · v11-obb · v26-obb |
| Classification | v8-cls · v11-cls · v12-cls · v26-cls |
- 54 stars on GitHub
- Jetson Xavier/Orin compatible (CC ≥ 7.5)
Lessons
CUDA Graph capture is a significant win but has a hard constraint: it only works with fixed input shapes. The first time a caller changes the input resolution, the graph must be re-captured. Building a small graph cache keyed on (height, width) avoids repeated capture costs for multi-resolution workloads.
TensorRT INT8 calibration quality determines whether INT8 saves memory without accuracy loss. Using a representative calibration dataset from the actual deployment domain (not ImageNet defaults) is the difference between a useful INT8 engine and one that misses detections.
Stack
Technologies