Depths-CPP
Header-only C++ monocular depth estimation — Depth Anything v2 via ONNX Runtime, real-time on CPU and GPU
Problem
Stereo cameras and LiDAR provide geometric depth but add cost, weight, and calibration complexity to robot platforms. Monocular depth estimation offers a software-only alternative — one RGB camera, no additional hardware — but the best models (Depth Anything v2) ship as Python packages. Deploying them on embedded robot computers without a Python runtime requires an ONNX export and a C++ inference wrapper that correctly replicates the model’s normalisation pipeline.
Approach
Single-header design following the same pattern as YOLOs-CPP. One header file handles ONNX session setup, pre-processing, inference, and depth map output. Construct with an ONNX model path and a GPU flag, call predict(frame), receive a floating-point depth map as a cv::Mat. The output is ready to pass directly to obstacle avoidance or point-cloud generation code.
Both relative (normalised) and metric depth modes are supported using the appropriate model variant. Colour-mapped visualisation (COLORMAP_INFERNO) is a one-line call. Supports image, video, and live camera inference modes. Multi-threaded architecture with adaptive batch size for throughput-oriented workloads.
Architecture
Input frame → resize to 384×384 → normalise to [0,1] and standardise (ImageNet mean/std) → ONNX inference (CPU, CUDA, or TensorRT execution provider) → HxW float32 depth map → optional COLORMAP_INFERNO visualisation.
Session is created once at construction; inference is stateless. The same binary selects the execution provider at runtime — no recompilation for CPU vs GPU. Dynamic input shape handling accommodates varying resolutions.
Results
Model zoo (all at 384×384 input):
| Model | Type | Notes |
|---|---|---|
vits.onnx | FP32, relative depth | ViT-Small, general use |
vits_quint8.onnx | UINT8 quantized | Edge-optimised, lower memory |
vits_metric_indoor.onnx | FP32, metric depth | Calibrated for indoor scenes |
vits_metric_outdoor.onnx | FP32, metric depth | Calibrated for outdoor scenes |
- TensorRT, CUDA, and CPU execution providers supported in the same binary
- Runs on Linux, macOS, and Windows; cross-platform CMake build
- 112 stars on GitHub
Lessons
Normalisation conventions differ between model families and are not documented consistently in the Depth Anything v2 release. The PyTorch export uses a specific mean/std pair that differs from the standard ImageNet values used by the ViT backbone. Testing the C++ pre-processing against the Python reference implementation frame-by-frame — comparing depth map values numerically, not just visually — was the only reliable verification method.
Metric depth models require the correct indoor/outdoor variant for the deployment environment. Using an outdoor model indoors (or vice versa) produces plausible-looking but numerically wrong depth values — a failure mode that visual inspection alone will not catch.
Stack
Technologies