Speech Generation Using SpeechT5
Currently, speech generation pipeline supports the SpeechT5 TTS model. The generated audio signal is a single-channel (mono) waveform with a sampling rate of 16 kHz.
Convert and Optimize Model
Download and convert model (e.g. speecht5_tts) and its vocoder to OpenVINO format from Hugging Face.
SpeechT5 requires specifying a vocoder via --model-kwargs:
optimum-cli export openvino --model microsoft/speecht5_tts --weight-format int4 --model-kwargs '{"vocoder":"microsoft/speecht5_hifigan"}' speecht5_tts
See all supported Speech Generation Models.
Refer to the Model Preparation guide for detailed instructions on how to download, convert and optimize models for OpenVINO GenAI.
Run Model Using OpenVINO GenAI
The Text2SpeechPipeline is the main object for generating speech from text.
It automatically loads the TTS model and vocoder from the converted model directory.
- Python
- C++
- CPU
- GPU
import openvino_genai
import soundfile as sf
pipeline = openvino_genai.Text2SpeechPipeline(model_path, "CPU")
# Generate audio using the default speaker
result = pipeline.generate("Hello OpenVINO GenAI")
# speech tensor contains the waveform of the spoken phrase
speech = result.speeches[0]
sf.write("output_audio.wav", speech.data[0], samplerate=16000)
import openvino_genai
import soundfile as sf
pipeline = openvino_genai.Text2SpeechPipeline(model_path, "GPU")
# Generate audio using the default speaker
result = pipeline.generate("Hello OpenVINO GenAI")
# speech tensor contains the waveform of the spoken phrase
speech = result.speeches[0]
sf.write("output_audio.wav", speech.data[0], samplerate=16000)
- CPU
- GPU
#include "audio_utils.hpp"
#include "openvino/genai/speech_generation/text2speech_pipeline.hpp"
int main(int argc, char* argv[]) {
std::string models_path = argv[1];
ov::genai::Text2SpeechPipeline pipeline(model_path, "CPU");
auto result = pipeline.generate("Hello OpenVINO GenAI");
auto waveform_size = result.speeches[0].get_size();
auto waveform_ptr = result.speeches[0].data<const float>();
auto bits_per_sample = result.speeches[0].get_element_type().bitwidth();
utils::audio::save_to_wav(waveform_ptr, waveform_size, "output_audio.wav", bits_per_sample);
return 0;
}
#include "audio_utils.hpp"
#include "openvino/genai/speech_generation/text2speech_pipeline.hpp"
int main(int argc, char* argv[]) {
std::string models_path = argv[1];
ov::genai::Text2SpeechPipeline pipeline(model_path, "GPU");
auto result = pipeline.generate("Hello OpenVINO GenAI");
auto waveform_size = result.speeches[0].get_size();
auto waveform_ptr = result.speeches[0].data<const float>();
auto bits_per_sample = result.speeches[0].get_element_type().bitwidth();
utils::audio::save_to_wav(waveform_ptr, waveform_size, "output_audio.wav", bits_per_sample);
return 0;
}
Use CPU or GPU as devices without any other code change.
Additional Usage Options
Use Speaker Embedding File
To generate speech using the SpeechT5 TTS model, you can specify a target voice by providing a speaker embedding file.
This file must contain 512 32-bit floating-point values that represent the voice characteristics of the target speaker. The model will use these characteristics to synthesize the input text in the specified voice.
If no speaker embedding is provided, the model uses the default built-in speaker.
You can generate a speaker embedding using the create_speaker_embedding.py script. This script records 5 seconds of audio from your microphone and extracts a speaker embedding vector from the recording.
python create_speaker_embedding.py
- Python
- C++
import openvino_genai
import openvino as ov
import numpy as np
import soundfile as sf
pipeline = openvino_genai.Text2SpeechPipeline(model_path, "CPU")
speaker_embedding = np.fromfile(args.speaker_embedding_file_path, dtype=np.float32).reshape(1, 512)
speaker_embedding = ov.Tensor(speaker_embedding)
result = pipeline.generate("Hello OpenVINO GenAI", speaker_embedding)
speech = result.speeches[0]
sf.write("output_audio.wav", speech.data[0], samplerate=16000)
#include "openvino/genai/speech_generation/text2speech_pipeline.hpp"
#include "audio_utils.hpp"
int main(int argc, char* argv[]) {
std::string model_path = argv[1];
ov::genai::Text2SpeechPipeline pipeline(model_path, "CPU");
auto speaker_embedding = utils::audio::read_speaker_embedding(speaker_embedding_path);
auto result = pipeline.generate("Hello OpenVINO GenAI", speaker_embedding);
auto waveform_size = result.speeches[0].get_size();
auto waveform_ptr = result.speeches[0].data<const float>();
auto bits_per_sample = result.speeches[0].get_element_type().bitwidth();
utils::audio::save_to_wav(waveform_ptr, waveform_size, "output_audio.wav", bits_per_sample);
return 0;
}