Skip to main content

Text-to-speech pipeline sample

This example demonstrates how to use the openvino_genai.Text2SpeechPipeline in Python to convert input text into speech. You can specify a target voice using a speaker embedding vector that captures the desired voice characteristics. Additionally, you can choose the inference device (e.g., CPU, GPU) to control where the model runs.

Download and convert the model and tokenizers

The --upgrade-strategy eager option is needed to ensure optimum-intel is upgraded to the latest version.

Install ../../export-requirements.txt to convert a model.

pip install --upgrade-strategy eager -r ../../export-requirements.txt
optimum-cli export openvino --model microsoft/speecht5_tts --model-kwargs "{\"vocoder\": \"microsoft/speecht5_hifigan\"}" speecht5_tts

Note: Currently, text-to-speech in OpenVINO GenAI supports the SpeechT5 TTS model. When exporting the model, you must specify a vocoder using the --model-kwargs option in JSON format.

Prepare speaker embedding file

To generate speech using the SpeechT5 TTS model, you can specify a target voice by providing a speaker embedding file. This file must contain 512 32-bit floating-point values that represent the voice characteristics of the target speaker. The model will use these characteristics to synthesize the input text in the specified voice.

If no speaker embedding is provided, the model will default to a built-in speaker for speech generation.

You can generate a speaker embedding using the create_speaker_embedding.py script. This script records 5 seconds of audio from your microphone and extracts a speaker embedding vector from the recording.

To run the script:

python create_speaker_embedding.py

Run Text-to-speech sample

Follow Get Started with Samples to run the sample.

text-to-speech speecht5_tts "Hello OpenVINO GenAI" speaker_embedding.bin

It generates output_audio.wav file containing the phrase Hello OpenVINO GenAI spoken in the target voice.

See SUPPORTED_MODELS.md for the list of supported models.

Text-to-speech pipeline usage

#include "openvino/genai/speech_generation/text2speech_pipeline.hpp"

ov::genai::Text2SpeechPipeline pipe(models_path, device);
gen_speech = pipe.generate(prompt, speaker_embedding);

auto speech = gen_speech.speeches[0];
// speech tensor contains the waveform of the spoken phrase