Text-to-speech Python samples

This folder contains Python examples for openvino_genai.Text2SpeechPipeline:

text2speech.py: basic text → audio generation (SpeechT5 and Kokoro)
kokoro_phonemize_fallback.py: Kokoro unknown-word fallback behavior

Supported Models

SpeechT5
- Requires exported SpeechT5 model and vocoder.
- Usually uses a speaker embedding file.
Kokoro
- Uses a Kokoro model directory.
- Uses --speaker_embedding_file_path and --language options.
- End-to-end Kokoro language support includes:
  - en-us (English, United States)
  - en-gb (English, United Kingdom)
  - es (Spanish)
  - fr-fr (French, France)
  - hi (Hindi)
  - it (Italian)
  - pt-br (Portuguese, Brazil)
- Not yet supported for end-to-end text generation in this flow: ja (Japanese), zh (Chinese/Mandarin).

Install dependencies

Export-time deps:

pip install --upgrade-strategy eager -r ../../export-requirements.txt
Runtime deps:

pip install -r ../../deployment-requirements.txt

SpeechT5 setup

Export SpeechT5 with vocoder:

optimum-cli export openvino --model microsoft/speecht5_tts --model-kwargs "{\"vocoder\": \"microsoft/speecht5_hifigan\"}" speecht5_tts

Create a speaker embedding file (SpeechT5-specific):

python create_speaker_embedding.py

Kokoro setup

pip install --upgrade-strategy eager -r ../../export-requirements.txt
pip install kokoro
optimum-cli export openvino -m hexgrad/Kokoro-82M ov_Kokoro-82M --trust-remote-code

Note: After export is complete, you will find the available speaker embedding .bin files in ov_Kokoro-82M/voices.

Use of `espeak-ng` within the Kokoro Pipeline

Within the Kokoro Text-to-Speech pipeline, espeak-ng is an external dependency used for the grapheme-to-phoneme (G2P) stage. Its role varies depending on the selected language:

English (en-us, en-gb): espeak-ng is used as a fallback for words that are not found in the built-in dictionary. See the kokoro_phonemize_fallback sample for an example of using an OpenVINO-based fallback model to avoid relying on espeak-ng for English.
Non-English (es, fr-fr, hi, it, pt-br): espeak-ng serves as the primary G2P (phonemization) engine. As such, it must be installed to enable end-to-end text-to-speech generation for these languages.

Note: espeak-ng is licensed under GPLv3 and must be installed separately. OpenVINO GenAI detects its presence automatically at runtime.

To install espeak-ng, follow the official guide: https://github.com/espeak-ng/espeak-ng/blob/master/docs/guide.md

Run samples

1) `text2speech.py`

SpeechT5:

python text2speech.py --speaker_embedding_file_path speaker_embedding.bin speecht5_tts "Hello from OpenVINO GenAI"

Kokoro:

python text2speech.py --speaker_embedding_file_path ov_Kokoro-82M/voices/af_heart.bin --language en-us ov_Kokoro-82M "Hello, and welcome to speech generation using OpenVINO GenAI."

Kokoro (non-English):

python text2speech.py --speaker_embedding_file_path ov_Kokoro-82M/voices/ef_dora.bin --language es ov_Kokoro-82M "Hola y bienvenidos a la generación de voz utilizando OpenVINO GenAI."

Text2speech with speed control:

python text2speech.py --speaker_embedding_file_path ov_Kokoro-82M/voices/af_heart.bin --language en-us --speed 1.15 ov_Kokoro-82M "Hello from OpenVINO GenAI with a faster speaking rate."

2) `kokoro_phonemize_fallback.py` (Kokoro only)

This sample demonstrates how to use an OpenVINO-based fallback model for phonemization, allowing you to avoid relying on espeak-ng when working with English languages.

Why use a fallback model instead of espeak-ng?

While espeak-ng provides robust phonemization, using an OpenVINO-based fallback model avoids the need for external dependencies and consideration of their associated licensing requirements, enabling a more self-contained and uniformly licensed deployment.

Export OV fallback models:

US:

optimum-cli export openvino --model PeterReid/graphemes_to_phonemes_en_us --task text2text-generation graphemes_to_phonemes_en_us-ov

GB:

optimum-cli export openvino --model PeterReid/graphemes_to_phonemes_en_gb --task text2text-generation graphemes_to_phonemes_en_gb-ov

Run using fallback models:

US model + en-us:

python kokoro_phonemize_fallback.py ov_Kokoro-82M "Vellorin traded copperchimes for rainmint at Candlehaven." --speaker_embedding_file_path ov_Kokoro-82M/voices/af_heart.bin --language en-us --phonemize_fallback_model_dir graphemes_to_phonemes_en_us-ov

GB model + en-gb:

python kokoro_phonemize_fallback.py ov_Kokoro-82M "Vellorin traded copperchimes for rainmint at Candlehaven." --speaker_embedding_file_path ov_Kokoro-82M/voices/bf_emma.bin --language en-gb --phonemize_fallback_model_dir graphemes_to_phonemes_en_gb-ov

Use default espeak-ng fallback (omit --phonemize_fallback_model_dir):

python kokoro_phonemize_fallback.py ov_Kokoro-82M "Vellorin traded copperchimes for rainmint at Candlehaven." --speaker_embedding_file_path ov_Kokoro-82M/voices/af_heart.bin --language en-us

Set --language to match the fallback model variant (en-us with ..._en_us-ov, en-gb with ..._en_gb-ov). OpenVINO fallback models above are an English-only feature (en-us / en-gb). For non-English Kokoro languages, phonemization is handled directly by espeak-ng as the primary G2P path (this fallback-model feature is not used).

All samples produce WAV output.

Refer to Supported Models for model details.

Text-to-speech API usage

import openvino_genai

pipe = openvino_genai.Text2SpeechPipeline(model_dir, device)

result = pipe.generate("Hello OpenVINO GenAI", speaker_embedding)

# Kokoro generation with an application-prepared embedding tensor
result = pipe.generate("Hello from Kokoro", speaker_embedding, language="en-us")

# Kokoro unknown-word fallback via config
cfg = pipe.get_generation_config()
cfg.phonemize_fallback_model_dir = "graphemes_to_phonemes_en_us-ov"  # set -> OV fallback
# cfg.phonemize_fallback_model_dir = None  # unset -> espeak-ng fallback
pipe.set_generation_config(cfg)
result = pipe.generate("Vellorin traded copperchimes for rainmint at Candlehaven.", speaker_embedding, language="en-us")

Supported Models​

Install dependencies​

SpeechT5 setup​

Kokoro setup​

Use of espeak-ng within the Kokoro Pipeline​

Run samples​

1) text2speech.py​

2) kokoro_phonemize_fallback.py (Kokoro only)​

Export OV fallback models:​

Run using fallback models:​