YOLOs-CPP-TensorRT

Header-only C++ YOLO library for NVIDIA TensorRT — GPU preprocessing, CUDA Graph replay, sub-2ms latency, 530+ FPS

Creator & Maintainer · 2026 · active · GitHub ↗

Problem

Most YOLO C++ wrappers treat preprocessing as an afterthought — resizing on the CPU, copying synchronously, rebuilding TensorRT launch parameters every frame. On a laptop GPU this costs 1–3ms per frame before inference even begins. When the model itself runs in under 2ms, a CPU preprocessing bottleneck doubles end-to-end latency.

The other gap: TensorRT engines are model-specific and require manual configuration of output tensor shape differences between YOLO versions. Each new YOLO release breaks existing wrappers.

Approach

YOLOs-TRT was built around one principle: the GPU should never wait for the CPU. Every stage of the pipeline that can move to the GPU does — preprocessing runs as a single CUDA kernel, host-to-device transfer uses pinned memory for async overlap, and the entire inference graph is captured once and replayed via cudaGraphLaunch.

Model version auto-detection reads output tensor shapes at engine load time — no manual --model-version flag required. FP32, FP16, and INT8 engines load identically through the same API.

Architecture

GPU letterbox + normalize (single CUDA kernel) → async cudaMemcpyAsync to pinned buffer → cudaGraphLaunch (CUDA Graph replay of enqueueV3) → NMS post-processing → structured output.

The CUDA kernel performs bilinear letterbox resize, BGR→RGB conversion, and /255.0 normalisation in one pass, writing directly into the TRT input buffer. One cudaStream_t drives the full preprocess → infer → postprocess pipeline with minimal sync points.

CUDA Graph capture runs once at model load; fixed-shape engines replay with ~0.1–0.3ms less dispatch overhead per frame than bare enqueueV3.

Results

Measured on NVIDIA RTX 2000 Ada (Laptop) — YOLOv11n · 640×640 · 1000 iterations · 10-iter warm-up:

Precision	FPS	Avg latency	P50	P99	GPU memory
FP32	466	2.14 ms	2.04 ms	3.03 ms	530 MB
FP16	479	2.09 ms	1.98 ms	2.91 ms	536 MB
INT8	530	1.89 ms	1.78 ms	2.70 ms	444 MB

Numbers include the full pipeline — GPU preprocessing, inference, and post-processing. Scales roughly linearly on higher-end GPUs.

Supported tasks (auto-detected from tensor shape):

Task	YOLO versions
Detection	v5 · v7 · v8 · v9 · v10 · v11 · v12 · v26 · NAS
Segmentation	v8-seg · v11-seg · v26-seg
Pose	v8-pose · v11-pose · v26-pose
OBB	v8-obb · v11-obb · v26-obb
Classification	v8-cls · v11-cls · v12-cls · v26-cls

54 stars on GitHub
Jetson Xavier/Orin compatible (CC ≥ 7.5)

Lessons

CUDA Graph capture is a significant win but has a hard constraint: it only works with fixed input shapes. The first time a caller changes the input resolution, the graph must be re-captured. Building a small graph cache keyed on (height, width) avoids repeated capture costs for multi-resolution workloads.

TensorRT INT8 calibration quality determines whether INT8 saves memory without accuracy loss. Using a representative calibration dataset from the actual deployment domain (not ImageNet defaults) is the difference between a useful INT8 engine and one that misses detections.

Stack

View source on GitHub

Technologies