Depths-CPP

Header-only C++ monocular depth estimation — Depth Anything v2 via ONNX Runtime, real-time on CPU and GPU

Creator & Maintainer · 2024 · active · GitHub ↗

Problem

Stereo cameras and LiDAR provide geometric depth but add cost, weight, and calibration complexity to robot platforms. Monocular depth estimation offers a software-only alternative — one RGB camera, no additional hardware — but the best models (Depth Anything v2) ship as Python packages. Deploying them on embedded robot computers without a Python runtime requires an ONNX export and a C++ inference wrapper that correctly replicates the model’s normalisation pipeline.

Approach

Single-header design following the same pattern as YOLOs-CPP. One header file handles ONNX session setup, pre-processing, inference, and depth map output. Construct with an ONNX model path and a GPU flag, call predict(frame), receive a floating-point depth map as a cv::Mat. The output is ready to pass directly to obstacle avoidance or point-cloud generation code.

Both relative (normalised) and metric depth modes are supported using the appropriate model variant. Colour-mapped visualisation (COLORMAP_INFERNO) is a one-line call. Supports image, video, and live camera inference modes. Multi-threaded architecture with adaptive batch size for throughput-oriented workloads.

Architecture

Input frame → resize to 384×384 → normalise to [0,1] and standardise (ImageNet mean/std) → ONNX inference (CPU, CUDA, or TensorRT execution provider) → HxW float32 depth map → optional COLORMAP_INFERNO visualisation.

Session is created once at construction; inference is stateless. The same binary selects the execution provider at runtime — no recompilation for CPU vs GPU. Dynamic input shape handling accommodates varying resolutions.

Results

Model zoo (all at 384×384 input):

Model	Type	Notes
`vits.onnx`	FP32, relative depth	ViT-Small, general use
`vits_quint8.onnx`	UINT8 quantized	Edge-optimised, lower memory
`vits_metric_indoor.onnx`	FP32, metric depth	Calibrated for indoor scenes
`vits_metric_outdoor.onnx`	FP32, metric depth	Calibrated for outdoor scenes

TensorRT, CUDA, and CPU execution providers supported in the same binary
Runs on Linux, macOS, and Windows; cross-platform CMake build
112 stars on GitHub

Lessons

Normalisation conventions differ between model families and are not documented consistently in the Depth Anything v2 release. The PyTorch export uses a specific mean/std pair that differs from the standard ImageNet values used by the ViT backbone. Testing the C++ pre-processing against the Python reference implementation frame-by-frame — comparing depth map values numerically, not just visually — was the only reliable verification method.

Metric depth models require the correct indoor/outdoor variant for the deployment environment. Using an outdoor model indoors (or vice versa) produces plausible-looking but numerically wrong depth values — a failure mode that visual inspection alone will not catch.

Stack

View source on GitHub

Technologies