Skip to main content

Text-to-speech C++ samples

This folder contains C++ examples for ov::genai::Text2SpeechPipeline.

Supported Models

  • SpeechT5
    • Requires exported SpeechT5 model and vocoder.
    • Usually uses a speaker embedding file.
  • Kokoro
    • Uses a Kokoro model directory.
    • Uses a speaker embedding file and language options.
    • End-to-end Kokoro language support includes:
      • en-us (English, United States)
      • en-gb (English, United Kingdom)
      • es (Spanish)
      • fr-fr (French, France)
      • hi (Hindi)
      • it (Italian)
      • pt-br (Portuguese, Brazil)
    • Not yet supported for end-to-end text generation in this flow: ja (Japanese), zh (Chinese/Mandarin).

SpeechT5 setup

Install export tools and export SpeechT5 with vocoder:

pip install --upgrade-strategy eager -r ../../export-requirements.txt
optimum-cli export openvino --model microsoft/speecht5_tts --model-kwargs "{\"vocoder\": \"microsoft/speecht5_hifigan\"}" speecht5_tts

Create a speaker embedding file (SpeechT5-specific):

python ../../python/speech_generation/create_speaker_embedding.py

Kokoro setup

pip install --upgrade-strategy eager -r ../../export-requirements.txt
pip install kokoro
optimum-cli export openvino -m hexgrad/Kokoro-82M ov_Kokoro-82M --trust-remote-code

Note: After export is complete, you will find the available speaker embedding .bin files in ov_Kokoro-82M/voices.

Use of espeak-ng within the Kokoro Pipeline

Within the Kokoro Text-to-Speech pipeline, espeak-ng is an external dependency used for the grapheme-to-phoneme (G2P) stage. Its role varies depending on the selected language:

  • English (en-us, en-gb): espeak-ng is used as a fallback for words that are not found in the built-in dictionary. See the kokoro_phonemize_fallback sample for an example of using an OpenVINO-based fallback model to avoid relying on espeak-ng for English.

  • Non-English (es, fr-fr, hi, it, pt-br): espeak-ng serves as the primary G2P (phonemization) engine. As such, it must be installed to enable end-to-end text-to-speech generation for these languages.

Note: espeak-ng is licensed under GPLv3 and must be installed separately. OpenVINO GenAI detects its presence automatically at runtime.

To install espeak-ng, follow the official guide: https://github.com/espeak-ng/espeak-ng/blob/master/docs/guide.md

Run samples

1) text2speech

SpeechT5:

text2speech speecht5_tts "Hello from OpenVINO GenAI" speaker_embedding.bin

Kokoro:

text2speech ov_Kokoro-82M "Hello, and welcome to speech generation using OpenVINO GenAI." ov_Kokoro-82M/voices/af_heart.bin --language en-us

Kokoro (non-English):

text2speech ov_Kokoro-82M "Hola y bienvenidos a la generación de voz utilizando OpenVINO GenAI." ov_Kokoro-82M/voices/ef_dora.bin --language es

Text2speech with speed control:

text2speech ov_Kokoro-82M "Hello from OpenVINO GenAI with a faster speaking rate." ov_Kokoro-82M/voices/af_heart.bin --language en-us --speed 1.15

2) kokoro_phonemize_fallback (Kokoro only)

This sample demonstrates how to use an OpenVINO-based fallback model for phonemization, allowing you to avoid relying on espeak-ng when working with English languages.

Why use a fallback model instead of espeak-ng?

While espeak-ng provides robust phonemization, using an OpenVINO-based fallback model avoids the need for external dependencies and consideration of their associated licensing requirements, enabling a more self-contained and uniformly licensed deployment.

Export OV fallback models:

US:

optimum-cli export openvino --model PeterReid/graphemes_to_phonemes_en_us --task text2text-generation graphemes_to_phonemes_en_us-ov

GB:

optimum-cli export openvino --model PeterReid/graphemes_to_phonemes_en_gb --task text2text-generation graphemes_to_phonemes_en_gb-ov

Run using fallback models:

US model + en-us:

kokoro_phonemize_fallback ov_Kokoro-82M "Vellorin traded copperchimes for rainmint at Candlehaven." --speaker_embedding_file_path ov_Kokoro-82M/voices/af_heart.bin --language en-us --phonemize_fallback_model_dir graphemes_to_phonemes_en_us-ov

GB model + en-gb:

kokoro_phonemize_fallback ov_Kokoro-82M "Vellorin traded copperchimes for rainmint at Candlehaven." --speaker_embedding_file_path ov_Kokoro-82M/voices/bf_emma.bin --language en-gb --phonemize_fallback_model_dir graphemes_to_phonemes_en_gb-ov

To run with default espeak-ng fallback:

kokoro_phonemize_fallback ov_Kokoro-82M "Vellorin traded copperchimes for rainmint at Candlehaven." --speaker_embedding_file_path ov_Kokoro-82M/voices/af_heart.bin --language en-us

Set --language to match the fallback model variant (en-us with ..._en_us-ov, en-gb with ..._en_gb-ov). OpenVINO fallback models above are an English-only feature (en-us / en-gb). For non-English Kokoro languages, phonemization is handled directly by espeak-ng as the primary G2P path (this fallback-model feature is not used).

All samples produce WAV output.

Refer to Supported Models for model details.

Text-to-speech API usage

#include "openvino/genai/speech_generation/text2speech_pipeline.hpp"

ov::genai::Text2SpeechPipeline pipe(models_path, device);

gen_speech = pipe.generate(prompt, speaker_embedding);

// Kokoro generation with an application-prepared embedding tensor
gen_speech = pipe.generate(prompt,
speaker_embedding,
ov::AnyMap{{"language", "en-us"}});

auto speech = gen_speech.speeches[0];
// speech tensor contains the waveform of the spoken phrase