Text-to-speech Python samples
This folder contains Python examples for openvino_genai.Text2SpeechPipeline:
text2speech.py: basic text → audio generation (SpeechT5 and Kokoro)kokoro_phonemize_fallback.py: Kokoro unknown-word fallback behavior
Supported Models
- SpeechT5
- Requires exported SpeechT5 model and vocoder.
- Usually uses a speaker embedding file.
- Kokoro
- Uses a Kokoro model directory.
- Uses
--speaker_embedding_file_pathand--languageoptions. - End-to-end Kokoro language support includes:
en-us(English, United States)en-gb(English, United Kingdom)es(Spanish)fr-fr(French, France)hi(Hindi)it(Italian)pt-br(Portuguese, Brazil)
- Not yet supported for end-to-end text generation in this flow:
ja(Japanese),zh(Chinese/Mandarin).
Install dependencies
-
Export-time deps:
pip install --upgrade-strategy eager -r ../../export-requirements.txt -
Runtime deps:
pip install -r ../../deployment-requirements.txt
SpeechT5 setup
Export SpeechT5 with vocoder:
optimum-cli export openvino --model microsoft/speecht5_tts --model-kwargs "{\"vocoder\": \"microsoft/speecht5_hifigan\"}" speecht5_tts
Create a speaker embedding file (SpeechT5-specific):
python create_speaker_embedding.py
Kokoro setup
pip install --upgrade-strategy eager -r ../../export-requirements.txt
pip install kokoro
optimum-cli export openvino -m hexgrad/Kokoro-82M ov_Kokoro-82M --trust-remote-code
Note: After export is complete, you will find the available speaker embedding
.binfiles inov_Kokoro-82M/voices.
Use of espeak-ng within the Kokoro Pipeline
Within the Kokoro Text-to-Speech pipeline, espeak-ng is an external dependency used for the grapheme-to-phoneme (G2P) stage. Its role varies depending on the selected language:
-
English (
en-us,en-gb):espeak-ngis used as a fallback for words that are not found in the built-in dictionary. See thekokoro_phonemize_fallbacksample for an example of using an OpenVINO-based fallback model to avoid relying onespeak-ngfor English. -
Non-English (
es,fr-fr,hi,it,pt-br):espeak-ngserves as the primary G2P (phonemization) engine. As such, it must be installed to enable end-to-end text-to-speech generation for these languages.
Note:
espeak-ngis licensed under GPLv3 and must be installed separately. OpenVINO GenAI detects its presence automatically at runtime.
To install espeak-ng, follow the official guide:
https://github.com/espeak-ng/espeak-ng/blob/master/docs/guide.md
Run samples
1) text2speech.py
SpeechT5:
python text2speech.py --speaker_embedding_file_path speaker_embedding.bin speecht5_tts "Hello from OpenVINO GenAI"
Kokoro:
python text2speech.py --speaker_embedding_file_path ov_Kokoro-82M/voices/af_heart.bin --language en-us ov_Kokoro-82M "Hello, and welcome to speech generation using OpenVINO GenAI."
Kokoro (non-English):
python text2speech.py --speaker_embedding_file_path ov_Kokoro-82M/voices/ef_dora.bin --language es ov_Kokoro-82M "Hola y bienvenidos a la generación de voz utilizando OpenVINO GenAI."
Text2speech with speed control:
python text2speech.py --speaker_embedding_file_path ov_Kokoro-82M/voices/af_heart.bin --language en-us --speed 1.15 ov_Kokoro-82M "Hello from OpenVINO GenAI with a faster speaking rate."
2) kokoro_phonemize_fallback.py (Kokoro only)
This sample demonstrates how to use an OpenVINO-based fallback model for phonemization, allowing you to avoid relying on espeak-ng when working with English languages.
Why use a fallback model instead of espeak-ng?
While espeak-ng provides robust phonemization, using an OpenVINO-based fallback model avoids the need for external dependencies and consideration of their associated licensing requirements, enabling a more self-contained and uniformly licensed deployment.
Export OV fallback models:
US:
optimum-cli export openvino --model PeterReid/graphemes_to_phonemes_en_us --task text2text-generation graphemes_to_phonemes_en_us-ov
GB:
optimum-cli export openvino --model PeterReid/graphemes_to_phonemes_en_gb --task text2text-generation graphemes_to_phonemes_en_gb-ov
Run using fallback models:
US model + en-us:
python kokoro_phonemize_fallback.py ov_Kokoro-82M "Vellorin traded copperchimes for rainmint at Candlehaven." --speaker_embedding_file_path ov_Kokoro-82M/voices/af_heart.bin --language en-us --phonemize_fallback_model_dir graphemes_to_phonemes_en_us-ov
GB model + en-gb:
python kokoro_phonemize_fallback.py ov_Kokoro-82M "Vellorin traded copperchimes for rainmint at Candlehaven." --speaker_embedding_file_path ov_Kokoro-82M/voices/bf_emma.bin --language en-gb --phonemize_fallback_model_dir graphemes_to_phonemes_en_gb-ov
Use default espeak-ng fallback (omit --phonemize_fallback_model_dir):
python kokoro_phonemize_fallback.py ov_Kokoro-82M "Vellorin traded copperchimes for rainmint at Candlehaven." --speaker_embedding_file_path ov_Kokoro-82M/voices/af_heart.bin --language en-us
Set --language to match the fallback model variant (en-us with ..._en_us-ov, en-gb with ..._en_gb-ov).
OpenVINO fallback models above are an English-only feature (en-us / en-gb). For non-English Kokoro languages, phonemization is handled directly by espeak-ng as the primary G2P path (this fallback-model feature is not used).
All samples produce WAV output.
Refer to Supported Models for model details.
Text-to-speech API usage
import openvino_genai
pipe = openvino_genai.Text2SpeechPipeline(model_dir, device)
result = pipe.generate("Hello OpenVINO GenAI", speaker_embedding)
# Kokoro generation with an application-prepared embedding tensor
result = pipe.generate("Hello from Kokoro", speaker_embedding, language="en-us")
# Kokoro unknown-word fallback via config
cfg = pipe.get_generation_config()
cfg.phonemize_fallback_model_dir = "graphemes_to_phonemes_en_us-ov" # set -> OV fallback
# cfg.phonemize_fallback_model_dir = None # unset -> espeak-ng fallback
pipe.set_generation_config(cfg)
result = pipe.generate("Vellorin traded copperchimes for rainmint at Candlehaven.", speaker_embedding, language="en-us")