Whisper automatic speech recognition sample
This example showcases inference of speech recognition Whisper Models. The application doesn't have many configuration options to encourage the reader to explore and modify the source code. For example, change the device for inference to GPU. The sample features ov::genai::WhisperPipeline
and uses audio file in wav format as an input source.
Download and convert the model and tokenizers
The --upgrade-strategy eager
option is needed to ensure optimum-intel
is upgraded to the latest version.
It's not required to install ../../export-requirements.txt for deployment if the model has already been exported.
pip install --upgrade-strategy eager -r ../../requirements.txt
optimum-cli export openvino --trust-remote-code --model openai/whisper-base whisper-base
If NPU is the inference device, an additional option --disable-stateful
is required. See NPU with OpenVINO GenAI for the detail.
Prepare audio file
Prepare audio file in wav format with sampling rate 16k Hz.
You can download example audio file: https://storage.openvinotoolkit.org/models_contrib/speech/2021.2/librispeech_s5/how_are_you_doing_today.wav
Run
Follow Get Started with Samples to run the sample.
whisper_speech_recognition whisper-base how_are_you_doing_today.wav
Output:
How are you doing today?
timestamps: [0, 2] text: How are you doing today?
See SUPPORTED_MODELS.md for the list of supported models.
Whisper pipeline usage
#include "openvino/genai/whisper_pipeline.hpp"
ov::genai::WhisperPipeline pipeline(model_dir, "CPU");
// Pipeline expects normalized audio with Sample Rate of 16kHz
ov::genai::RawSpeechInput raw_speech = read_wav("how_are_you_doing_today.wav");
auto result = pipeline.generate(raw_speech);
// How are you doing today?
Transcription
Whisper pipeline predicts the language of the source audio automatically.
ov::genai::RawSpeechInput raw_speech = read_wav("how_are_you_doing_today.wav");
auto result = pipeline.generate(raw_speech);
// How are you doing today?
raw_speech = read_wav("fr_sample.wav");
result = pipeline.generate(raw_speech);
// Il s'agit d'une entité très complexe qui consiste...
If the source audio languange is know in advance, it can be specified as an argument to generate
method:
ov::genai::RawSpeechInput raw_speech = read_wav("how_are_you_doing_today.wav");
auto result = pipeline.generate(raw_speech, ov::genai::language("<|en|>"));
// How are you doing today?
raw_speech = read_wav("fr_sample.wav");
result = pipeline.generate(raw_speech, ov::genai::language("<|fr|>"));
// Il s'agit d'une entité très complexe qui consiste...
Translation
By default, Whisper performs the task of speech transcription, where the source audio language is the same as the target text language. To perform speech translation, where the target text is in English, set the task to "translate":
ov::genai::RawSpeechInput raw_speech = read_wav("fr_sample.wav");
auto result = pipeline.generate(raw_speech, ov::genai::task("translate"));
// It is a very complex entity that consists...
Timestamps prediction
The model can predict timestamps. For sentence-level timestamps, pass the return_timestamps
argument:
ov::genai::RawSpeechInput raw_speech = read_wav("how_are_you_doing_today.wav");
auto result = pipeline.generate(raw_speech, ov::genai::return_timestamps(true));
std::cout << std::setprecision(2);
for (auto& chunk : *result.chunks) {
std::cout << "timestamps: [" << chunk.start_ts << ", " << chunk.end_ts << "] text: " << chunk.text << "\n";
}
// timestamps: [0, 2] text: How are you doing today?
Long-Form audio Transcription
The Whisper model is designed to work on audio samples of up to 30s in duration. Whisper pipeline uses sequential chunking algorithm to transcribe audio samples of arbitrary length. Sequential chunking algorithm uses a "sliding window", transcribing 30-second slices one after the other.
Initial prompt and hotwords
Whisper pipeline has initial_prompt
and hotwords
generate arguments:
initial_prompt
: initial prompt tokens passed as a previous transcription (after<|startofprev|>
token) to the first processing windowhotwords
: hotwords tokens passed as a previous transcription (after<|startofprev|>
token) to the all processing windows
The Whisper model can use that context to better understand the speech and maintain a consistent writing style. However, prompts do not need to be genuine transcripts from prior audio segments. Such prompts can be used to steer the model to use particular spellings or styles:
auto result = pipeline.generate(raw_speech);
// He has gone and gone for good answered Paul Icrom who...
result = pipeline.generate(raw_speech, ov::genai::initial_prompt("Polychrome"));
// He has gone and gone for good answered Polychrome who...
Troubleshooting
Empty or rubbish output
Example output:
----------------
To resolve this ensure that audio data has 16k Hz sampling rate