kokoro-onnx-cpp
On-device text-to-speech in C++ — Kokoro TTS running via ONNX Runtime for low-latency voice synthesis on embedded and robot hardware
Problem
Robots with voice interfaces — reception bots, collaborative manipulators, delivery AMRs — need text-to-speech that is low-latency, offline-capable, and integrates without a Python runtime. Cloud TTS services add network dependency and latency that breaks conversational interaction. Existing C++ TTS options are either ancient (Festival) or require complex framework setups.
Kokoro is a recent high-quality open-weight TTS model with a permissive license; the reference implementation is Python. Bringing it to C++ via ONNX unlocks deployment on robot embedded computers with no Python dependency.
Approach
Export the Kokoro TTS model components (text encoder, duration predictor, HiFi-GAN vocoder) to ONNX format and build a C++ inference pipeline. Text input → G2P (grapheme-to-phoneme) conversion → acoustic model (ONNX) → HiFi-GAN vocoder (ONNX) → PCM audio output. Output can be streamed to a sound card or written as WAV.
Each ONNX session is created once at startup. G2P uses a lookup table for common words; an espeak backend is available for rare words and proper nouns.
Architecture
Input text → G2P → phoneme sequence → encoder + duration predictor (ONNX) → mel spectrogram → HiFi-GAN vocoder (ONNX) → PCM float32 → WAV write or audio device output.
The pipeline is single-threaded and stateless between synthesis calls. ONNX Runtime CPU execution provider is the default; CUDA is selectable at construction for GPU-accelerated synthesis on compatible hardware.
Results
- Fully offline — no network dependency; suitable for privacy-sensitive and air-gapped deployments
- No Python runtime required — pure C++ inference pipeline
- WAV output at 24 kHz, 16-bit PCM — compatible with standard audio playback and ROS2 audio nodes
- Voice quality matches the Kokoro reference Python implementation (ONNX export preserves model weights exactly)
Stack
Technologies