← Work

YOLOs-CPP

Header-only C++ library for real-time YOLO inference — detection, segmentation, pose, OBB — no Python, no runtime bloat

Creator & Maintainer · 2025 · active · GitHub ↗ · Demo ↗

Problem

Running YOLO in robotics and embedded systems typically means a Python runtime, a subprocess boundary, and latency you can’t budget. The ONNX ecosystem promised cross-platform inference, but the reference implementations were Python-first. Teams either lived with the Python overhead or rewrote from scratch every time a new YOLO version dropped.

The deeper problem: YOLO versions v5 through v12 have incompatible output formats. Each update broke existing C++ wrappers. Projects using detection today and adding segmentation tomorrow had to touch two separate codebases.

Approach

Single-header design, one file per task type. Drop yolov8_det.hpp into any CMake project, link ONNX Runtime and OpenCV, and you have a working detector in under fifty lines. No framework lock-in, no package manager step.

The API surface is deliberately narrow: construct with a model path and confidence threshold, call detect(frame), iterate results. The same pattern applies across detection, segmentation, oriented bounding boxes, and pose estimation — switching task types is a one-line change.

Model-agnostic output parsing handles differences between YOLO output formats internally. Adding v12 support required touching only the parser, not any caller code.

Architecture

Each header encapsulates: ONNX session initialisation, pre-processing (resize, normalise, NCHW conversion), inference, and post-processing (NMS, coordinate rescaling). GPU execution paths use the ONNX Runtime CUDA execution provider when available; the same binary falls back to CPU without recompilation.

Quantized models (INT8/FP16) load identically to FP32 — no code changes needed. Sample pipelines cover image files, video streams, and live camera feeds via OpenCV VideoCapture. 36 automated tests gate each release.

Results

Measured on Intel i7-12700H / RTX 3060, 640×640 input, YOLOv11n model:

BackendFPSLatencyMemory
CPU15 FPS67 ms48 MB
CUDA (GPU)97 FPS10 ms412 MB

Additional GPU benchmarks (RTX 3060, 640×640):

ModelFPS
YOLOv8n86 FPS
YOLO26n78 FPS
YOLOv11n-seg65 FPS
YOLOv11n-pose80 FPS
  • Supports YOLO v5, v6, v7, v8, v9, v10, v11, v12 in detection, segmentation, OBB, pose, and classification modes
  • Zero Python in the inference path — deterministic latency on embedded hardware
  • 968 stars on GitHub; used in production robotics perception and industrial inspection

Lessons

Post-processing is where version differences live. YOLO v8 switched from anchor-based to anchor-free heads; v10 added NMS-free variants. Keeping the pre/post-processing logic inside each header rather than a shared base class made these changes easier to isolate and test without regressions.

Header-only simplicity has limits: compile times grow with template depth. Future work: a thin compiled core with the header as a lightweight adaptor.

Stack

  • C++17
  • ONNX Runtime 1.x
  • OpenCV 4.x
  • CUDA (optional)
  • CMake

Technologies

  • C++17
  • ONNX Runtime
  • OpenCV
  • CUDA
  • YOLOv5–v12
  • Quantization (INT8/FP16)