Text-to-speech pipeline sample

This example demonstrates how to use the Text2SpeechPipeline from openvino-genai-node to convert input text into speech. The application accepts a text string, runs TTS inference, and writes the output to a WAV file using the node-wav package.

You can specify a target voice using a speaker embedding binary file that captures the desired voice characteristics. Additionally, you can choose the inference device (e.g., CPU, GPU) to control where the model runs.

Download and convert the model and tokenizers

The --upgrade-strategy eager option is needed to ensure optimum-intel is upgraded to the latest version.

It's not required to install ../../export-requirements.txt for deployment if the model has already been exported.

pip install --upgrade-strategy eager -r <GENAI_ROOT_DIR>/samples/export-requirements.txt

Then, run the export with Optimum CLI:

optimum-cli export openvino --model microsoft/speecht5_tts --model-kwargs "{\"vocoder\": \"microsoft/speecht5_hifigan\"}" speecht5_tts

Note: Currently, text-to-speech in OpenVINO GenAI supports the SpeechT5 TTS model. When exporting the model, you must specify a vocoder using the --model-kwargs option in JSON format.

Prepare speaker embedding file (optional)

To generate speech using the SpeechT5 TTS model, you can specify a target voice by providing a speaker embedding file. This file must contain 512 32-bit floating-point values that represent the voice characteristics of the target speaker. The model will use these characteristics to synthesize the input text in the specified voice.

If no speaker embedding is provided, the model will default to a built-in speaker for speech generation.

You can generate a speaker embedding using the Python create_speaker_embedding.py script from the Python samples.

Run

From the samples/js directory, install dependencies (if not already done):

npm install

If you use the master branch, you may need to build openvino-genai-node from source first.

Run the sample:

node speech_generation/text2speech.js speecht5_tts "Hello OpenVINO GenAI"

With a speaker embedding:

node speech_generation/text2speech.js speecht5_tts "Hello OpenVINO GenAI" --speaker_embedding speaker_embedding.bin

Optional positional argument for device (default: CPU):

node speech_generation/text2speech.js speecht5_tts "Hello OpenVINO GenAI" GPU

Custom output file path:

node speech_generation/text2speech.js speecht5_tts "Hello OpenVINO GenAI" --output my_audio.wav

Output:

[Info] Text successfully converted to audio file "output_audio.wav".

=== Performance Summary ===
Throughput              : 123.45 samples/sec.
Total Generation Time   : 1.234 sec.

Refer to the Supported Models for more details.

Text-to-speech pipeline usage

import { readFile, writeFile } from 'node:fs/promises';
import { encode } from 'node-wav';
import { Text2SpeechPipeline } from 'openvino-genai-node';

const pipeline = await Text2SpeechPipeline(modelDir, "CPU");
const result = await pipeline.generate("Hello OpenVINO GenAI");
// result.speeches[0] is an OpenVINO Tensor with the waveform at 16 kHz
const wavData = encode([result.speeches[0].data], { sampleRate: 16000 });
await writeFile("output_audio.wav", wavData);

Download and convert the model and tokenizers​

Prepare speaker embedding file (optional)​

Run​

Text-to-speech pipeline usage

Download and convert the model and tokenizers

Prepare speaker embedding file (optional)

Run