Skip to main content

Speech Generation Using SpeechT5

Note

Currently, speech generation pipeline supports the SpeechT5 TTS model. The generated audio signal is a single-channel (mono) waveform with a sampling rate of 16 kHz.

Convert and Optimize Model

Download and convert model (e.g. speecht5_tts) and its vocoder to OpenVINO format from Hugging Face. SpeechT5 requires specifying a vocoder via --model-kwargs:

optimum-cli export openvino --model microsoft/speecht5_tts --weight-format int4 --model-kwargs '{"vocoder":"microsoft/speecht5_hifigan"}' speecht5_tts

See all supported Speech Generation Models.

info

Refer to the Model Preparation guide for detailed instructions on how to download, convert and optimize models for OpenVINO GenAI.

Run Model Using OpenVINO GenAI

The Text2SpeechPipeline is the main object for generating speech from text. It automatically loads the TTS model and vocoder from the converted model directory.

import openvino_genai
import soundfile as sf

pipeline = openvino_genai.Text2SpeechPipeline(model_path, "CPU")

# Generate audio using the default speaker
result = pipeline.generate("Hello OpenVINO GenAI")
# speech tensor contains the waveform of the spoken phrase
speech = result.speeches[0]
sf.write("output_audio.wav", speech.data[0], samplerate=16000)
tip

Use CPU or GPU as devices without any other code change.

Additional Usage Options

tip

Check out Python and C++ speech generation samples.

Use Speaker Embedding File

To generate speech using the SpeechT5 TTS model, you can specify a target voice by providing a speaker embedding file.

This file must contain 512 32-bit floating-point values that represent the voice characteristics of the target speaker. The model will use these characteristics to synthesize the input text in the specified voice.

If no speaker embedding is provided, the model uses the default built-in speaker.

You can generate a speaker embedding using the create_speaker_embedding.py script. This script records 5 seconds of audio from your microphone and extracts a speaker embedding vector from the recording.

python create_speaker_embedding.py
import openvino_genai
import openvino as ov
import numpy as np
import soundfile as sf

pipeline = openvino_genai.Text2SpeechPipeline(model_path, "CPU")

speaker_embedding = np.fromfile(args.speaker_embedding_file_path, dtype=np.float32).reshape(1, 512)
speaker_embedding = ov.Tensor(speaker_embedding)
result = pipeline.generate("Hello OpenVINO GenAI", speaker_embedding)

speech = result.speeches[0]
sf.write("output_audio.wav", speech.data[0], samplerate=16000)