← Work

kokoro-onnx-cpp

On-device text-to-speech in C++ — Kokoro TTS running via ONNX Runtime for low-latency voice synthesis on embedded and robot hardware

Creator & Maintainer · 2025 · active · GitHub ↗

Problem

Robots with voice interfaces — reception bots, collaborative manipulators, delivery AMRs — need text-to-speech that is low-latency, offline-capable, and integrates without a Python runtime. Cloud TTS services add network dependency and latency that breaks conversational interaction. Existing C++ TTS options are either ancient (Festival) or require complex framework setups.

Kokoro is a recent high-quality open-weight TTS model with a permissive license; the reference implementation is Python. Bringing it to C++ via ONNX unlocks deployment on robot embedded computers with no Python dependency.

Approach

Export the Kokoro TTS model components (text encoder, duration predictor, HiFi-GAN vocoder) to ONNX format and build a C++ inference pipeline. Text input → G2P (grapheme-to-phoneme) conversion → acoustic model (ONNX) → HiFi-GAN vocoder (ONNX) → PCM audio output. Output can be streamed to a sound card or written as WAV.

Each ONNX session is created once at startup. G2P uses a lookup table for common words; an espeak backend is available for rare words and proper nouns.

Architecture

Input text → G2P → phoneme sequence → encoder + duration predictor (ONNX) → mel spectrogram → HiFi-GAN vocoder (ONNX) → PCM float32 → WAV write or audio device output.

The pipeline is single-threaded and stateless between synthesis calls. ONNX Runtime CPU execution provider is the default; CUDA is selectable at construction for GPU-accelerated synthesis on compatible hardware.

Results

  • Fully offline — no network dependency; suitable for privacy-sensitive and air-gapped deployments
  • No Python runtime required — pure C++ inference pipeline
  • WAV output at 24 kHz, 16-bit PCM — compatible with standard audio playback and ROS2 audio nodes
  • Voice quality matches the Kokoro reference Python implementation (ONNX export preserves model weights exactly)

Stack

  • C++17
  • ONNX Runtime
  • Kokoro TTS
  • HiFi-GAN
  • libsndfile
  • CMake

Technologies

  • C++17
  • ONNX Runtime
  • Kokoro TTS
  • HiFi-GAN
  • Audio Processing