Speech Recognition Using Whisper

Convert and Optimize Model

Download and convert model (e.g. openai/whisper-base) to OpenVINO format from Hugging Face:

Default
INT8 Static Quantization

optimum-cli export openvino --model openai/whisper-base --trust-remote-code whisper_ov

optimum-cli export openvino --model openai/whisper-base --quant-mode int8 --dataset librispeech --num-samples 32 --trust-remote-code whisper_ov_int8

See all supported Speech Recognition Models.

info

Refer to the Model Preparation guide for detailed instructions on how to download, convert and optimize models for OpenVINO GenAI.

Run Model Using OpenVINO GenAI

OpenVINO GenAI introduces the WhisperPipeline pipeline for inference of speech recognition Whisper models. You can construct it straight away from the folder with the converted model. It will automatically load the model, tokenizer, detokenizer and default generation configuration.

info

WhisperPipeline expects normalized audio files in WAV format at sampling rate of 16 kHz as input.

Python
C++
JavaScript

import openvino_genai as ov_genai
import librosa

def read_wav(filepath):
  raw_speech, samplerate = librosa.load(filepath, sr=16000)
  return raw_speech.tolist()

raw_speech = read_wav('sample.wav')

pipe = ov_genai.WhisperPipeline(model_path, "CPU")
result = pipe.generate(raw_speech, max_new_tokens=100)
print(result)

import openvino_genai as ov_genai
import librosa

def read_wav(filepath):
  raw_speech, samplerate = librosa.load(filepath, sr=16000)
  return raw_speech.tolist()

raw_speech = read_wav('sample.wav')

pipe = ov_genai.WhisperPipeline(model_path, "GPU")
result = pipe.generate(raw_speech, max_new_tokens=100)
print(result)

#include "openvino/genai/whisper_pipeline.hpp"
#include "audio_utils.hpp"
#include <iostream>

int main(int argc, char* argv[]) {
  std::filesystem::path models_path = argv[1];
  std::string wav_file_path = argv[2];

  ov::genai::RawSpeechInput raw_speech = utils::audio::read_wav(wav_file_path);

  ov::genai::WhisperPipeline pipe(models_path, "CPU");
  auto result = pipe.generate(raw_speech, ov::genai::max_new_tokens(100));
  std::cout << result << std::endl;
}

#include "openvino/genai/whisper_pipeline.hpp"
#include "audio_utils.hpp"
#include <iostream>

int main(int argc, char* argv[]) {
  std::filesystem::path models_path = argv[1];
  std::string wav_file_path = argv[2];

  ov::genai::RawSpeechInput raw_speech = utils::audio::read_wav(wav_file_path);

  ov::genai::WhisperPipeline pipe(models_path, "GPU");
  auto result = pipe.generate(raw_speech, ov::genai::max_new_tokens(100));
  std::cout << result << std::endl;
}

import { WhisperPipeline } from 'openvino-genai-node';
import { readAudio } from './wav_utils.js';

const rawSpeech = readAudio('sample.wav');

const pipeline = await WhisperPipeline(modelPath, "CPU");
const generationConfig = { max_new_tokens: 100 };
const result = await pipeline.generate(rawSpeech, { generationConfig });
console.log(result.texts[0]);

import { WhisperPipeline } from 'openvino-genai-node';
import { readAudio } from './wav_utils.js';

const rawSpeech = readAudio('sample.wav');

const pipeline = await WhisperPipeline(modelPath, "GPU");
const generationConfig = { max_new_tokens: 100 };
const result = await pipeline.generate(rawSpeech, { generationConfig });
console.log(result.texts[0]);

tip

Use CPU or GPU as devices without any other code change.

Additional Usage Options

tip

Check out Python, C++, and JavaScript Whisper speech recognition samples.

Use Different Generation Parameters

Generation Configuration Workflow

Get the model default config with get_generation_config()
Modify parameters
Apply the updated config using one of the following methods:
- Use set_generation_config(config)
- Pass config directly to generate() (e.g. generate(prompt, config))
- Specify options as inputs in the generate() method (e.g. generate(prompt, max_new_tokens=100))

Basic Generation Configuration

Python
C++
JavaScript

import openvino_genai as ov_genai

pipe = ov_genai.WhisperPipeline(model_path, "CPU")

# Get default configuration
config = pipe.get_generation_config()

# Modify parameters
config.max_new_tokens = 100
config.temperature = 0.7
config.top_k = 50
config.top_p = 0.9
config.repetition_penalty = 1.2

# Generate text with custom configuration
result = pipe.generate(raw_speech, config)

int main() {
    ov::genai::WhisperPipeline pipe(model_path, "CPU");

    // Get default configuration
    auto config = pipe.get_generation_config();

    // Modify parameters
    config.max_new_tokens = 100;
    config.temperature = 0.7f;
    config.top_k = 50;
    config.top_p = 0.9f;
    config.repetition_penalty = 1.2f;

    // Generate text with custom configuration
    auto result = pipe.generate(raw_speech, config);
}

import { WhisperPipeline } from 'openvino-genai-node';

const pipeline = await WhisperPipeline(modelPath, "CPU");

// Get default configuration
const config = pipeline.getGenerationConfig();

// Modify parameters
const generationConfig = {
  ...config,
  max_new_tokens: 100,
  temperature: 0.7,
  top_k: 50,
  top_p: 0.9,
  repetition_penalty: 1.2
};

// Generate text with custom configuration
const result = await pipeline.generate(rawSpeech, { generationConfig });

Understanding Basic Generation Parameters

max_new_tokens: The maximum numbers of tokens to generate, excluding the number of tokens in the prompt. max_new_tokens has priority over max_length.
temperature: Controls the level of creativity in AI-generated text:
- Low temperature (e.g. 0.2) leads to more focused and deterministic output, choosing tokens with the highest probability.
- Medium temperature (e.g. 1.0) maintains a balance between creativity and focus, selecting tokens based on their probabilities without significant bias.
- High temperature (e.g. 2.0) makes output more creative and adventurous, increasing the chances of selecting less likely tokens.
top_k: Limits token selection to the k most likely next tokens. Higher values allow more diverse outputs.
top_p: Selects from the smallest set of tokens whose cumulative probability exceeds p. Helps balance diversity and quality.
repetition_penalty: Reduces the likelihood of repeating tokens. Values above 1.0 discourage repetition.

For the full list of generation parameters, refer to the Generation Config API.

Optimizing Generation with Grouped Beam Search

Beam search helps explore multiple possible text completions simultaneously, often leading to higher quality outputs.

Python
C++
JavaScript

import openvino_genai as ov_genai

pipe = ov_genai.WhisperPipeline(model_path, "CPU")

# Get default generation config
config = pipe.get_generation_config()

# Modify parameters
config.max_new_tokens = 256
config.num_beams = 15
config.num_beam_groups = 3
config.diversity_penalty = 1.0

# Generate text with custom configuration
result = pipe.generate(raw_speech, config)

int main() {
    ov::genai::WhisperPipeline pipe(model_path, "CPU");

    // Get default generation config
    ov::genai::GenerationConfig config = pipe.get_generation_config();

    // Modify parameters
    config.max_new_tokens = 256;
    config.num_beams = 15;
    config.num_beam_groups = 3;
    config.diversity_penalty = 1.0f;

    // Generate text with custom configuration
    auto result = pipe.generate(raw_speech, config);
}

const pipeline = await WhisperPipeline(modelPath, "CPU");

// Get default generation config
const config = pipeline.getGenerationConfig();

// Modify parameters
const generationConfig = {
  ...config,
  max_new_tokens: 256,
  num_beams: 15,
  num_beam_groups: 3,
  diversity_penalty: 1.0
};

// Generate text with custom configuration
const result = await pipeline.generate(rawSpeech, { generationConfig });

Understanding Beam Search Generation Parameters

max_new_tokens: The maximum numbers of tokens to generate, excluding the number of tokens in the prompt. max_new_tokens has priority over max_length.
num_beams: The number of beams for beam search. 1 disables beam search.
num_beam_groups: The number of groups to divide num_beams into in order to ensure diversity among different groups of beams.
diversity_penalty: value is subtracted from a beam's score if it generates the same token as any beam from other group at a particular time.

For the full list of generation parameters, refer to the Generation Config API.

Transcription

Whisper models can automatically detect the language of the input audio, or you can specify the language to improve accuracy:

Python
C++
JavaScript

pipe = ov_genai.WhisperPipeline(model_path, "CPU")

# Automatic language detection
raw_speech = read_wav("speech_sample.wav")
result = pipe.generate(raw_speech)

# Explicitly specify language (English)
result = pipe.generate(raw_speech, language="<|en|>")

# French speech sample
raw_speech = read_wav("french_sample.wav")
result = pipe.generate(raw_speech, language="<|fr|>")

int main() {
    ov::genai::WhisperPipeline pipe(model_path, "CPU");

    // Automatic language detection
    auto result = pipe.generate(raw_speech);

    // Explicitly specify language (English)
    result = pipe.generate(raw_speech, ov::genai::language("<|en|>"));

    // French speech sample
    raw_speech = utils::audio::read_wav("french_sample.wav");
    result = pipe.generate(raw_speech, ov::genai::language("<|fr|>"));
}

const pipeline = await WhisperPipeline(modelPath, "CPU");

// Automatic language detection
let result = await pipeline.generate(rawSpeech);

// Explicitly specify language (English)
let generationConfig = { language: "<|en|>" };
result = await pipeline.generate(rawSpeech, { generationConfig });

// French speech sample
generationConfig = { language: "<|fr|>" };
result = await pipeline.generate(frenchRawSpeech, { generationConfig });

Translation

By default, Whisper performs transcription, keeping the output in the same language as the input. To translate non-English speech to English, use the translate task:

Python
C++
JavaScript

pipe = ov_genai.WhisperPipeline(model_path, "CPU")

# Translate French audio to English
raw_speech = read_wav("french_sample.wav")
result = pipe.generate(raw_speech, task="translate")

int main() {
    ov::genai::WhisperPipeline pipe(model_path, "CPU");

    // Translate French audio to English
    raw_speech = utils::audio::read_wav("french_sample.wav");
    result = pipe.generate(raw_speech, ov::genai::task("translate"));
}

const pipeline = await WhisperPipeline(modelPath, "CPU");

// Translate French audio to English
const generationConfig = { task: "translate" };
const result = await pipeline.generate(rawSpeech, { generationConfig });

Timestamps Prediction

Whisper can predict timestamps for each segment of speech, which is useful for synchronization or creating subtitles:

Python
C++
JavaScript

pipe = ov_genai.WhisperPipeline(model_path, "CPU")

# Enable timestamp prediction
result = pipe.generate(raw_speech, return_timestamps=True)

# Print timestamps and text segments
for chunk in result.chunks:
    print(f"timestamps: [{chunk.start_ts:.2f}, {chunk.end_ts:.2f}] text: {chunk.text}")

int main() {
    ov::genai::WhisperPipeline pipe(model_path, "CPU");

    // Enable timestamp prediction
    result = pipe.generate(raw_speech, ov::genai::return_timestamps(true));

    // Print timestamps and text segments
    for (auto& chunk : *result.chunks) {
        std::cout << "timestamps: [" << chunk.start_ts << ", " << chunk.end_ts
                  << "] text: " << chunk.text << "\n";
    }
}

const pipeline = await WhisperPipeline(modelPath, "CPU");

// Enable timestamp prediction
const generationConfig = { return_timestamps: true, language: "<|en|>", task: "transcribe" };
const result = await pipeline.generate(rawSpeech, { generationConfig });

// Print timestamps and text segments
for (const chunk of result.chunks ?? []) {
  console.log(`timestamps: [${chunk.startTs.toFixed(2)}, ${chunk.endTs.toFixed(2)}] text: ${chunk.text}`);
}

Word-level Timestamps Prediction

Whisper can predict timestamps for each word of speech, which provides more granular timing information compared to segment-level timestamps.

Python
C++
JavaScript

# Word timestamps require decomposition of cross-attention decoder SDPA layers,
# so word_timestamps must be passed to the pipeline constructor (not just in generation config)
pipe = openvino_genai.WhisperPipeline(model_path, "CPU", word_timestamps=True)

# Enable word-level timestamp prediction
result = pipe.generate(raw_speech, word_timestamps=True)

# Print word-level timestamps
for word in result.words:
    print(f"[{word.start_ts:.2f}, {word.end_ts:.2f}]: {word.word}")

int main() {
    // Word timestamps require decomposition of cross-attention decoder SDPA layers,
    // so word_timestamps must be passed to the pipeline constructor (not just in generation config)
    ov::genai::WhisperPipeline pipeline(model_path, "CPU", ov::genai::word_timestamps(true));

    // Enable word-level timestamp prediction
    auto result = pipeline.generate(raw_speech, ov::genai::word_timestamps(true));

    // Print word-level timestamps
    std::cout << std::fixed << std::setprecision(2);
    for (auto& word : *result.words) {
        std::cout << "[" << word.start_ts << ", " << word.end_ts << "]: " << word.word << "\n";
    }
}

// Word timestamps require decomposition of cross-attention decoder SDPA layers,
// so word_timestamps must be passed to the pipeline constructor (not just in generation config)
const pipeline = await WhisperPipeline(modelPath, "CPU", { word_timestamps: true });

// Enable word-level timestamp prediction
const generationConfig = { return_timestamps: true, word_timestamps: true, language: "<|en|>", task: "transcribe" };
const result = await pipeline.generate(rawSpeech, { generationConfig });

// Print word-level timestamps
for (const w of result.words ?? []) {
  console.log(`[${w.startTs.toFixed(2)}, ${w.endTs.toFixed(2)}]: ${w.word}`);
}

info

NPU device requires STATIC_PIPELINE=True property passed to WhisperPipeline constructor: openvino_genai.WhisperPipeline(model_path, "NPU", word_timestamps=True, STATIC_PIPELINE=True)

Long-Form Audio Processing

Whisper models are designed for audio segments up to 30 seconds in length. For longer audio, the OpenVINO GenAI Whisper pipeline automatically handles the processing using a sequential chunking algorithm ("sliding window"):

The audio is divided into 30-second segments
Each segment is processed sequentially
Results are combined to produce the complete transcription

This happens automatically when you input longer audio files.

Using Initial Prompts and Hotwords

You can improve transcription quality and guide the model's output style by providing initial prompts or hotwords using the following parameters:

initial_prompt: initial prompt tokens passed as a previous transcription (after <|startofprev|> token) to the first processing window.
hotwords: hotwords tokens passed as a previous transcription (after <|startofprev|> token) to the all processing windows.

Whisper models can use that context to better understand the speech and maintain a consistent writing style. However, prompts do not need to be genuine transcripts from prior audio segments. Such prompts can be used to steer the model to use particular spellings or styles:

Python
C++
JavaScript

pipe = ov_genai.WhisperPipeline(model_path, "CPU")

result = pipe.generate(raw_speech)
# He has gone and gone for good answered Paul Icrom who...

result = pipe.generate(raw_speech, initial_prompt="Polychrome")
# He has gone and gone for good answered Polychrome who...

int main() {
    ov::genai::WhisperPipeline pipe(model_path, "CPU");

    auto result = pipeline.generate(raw_speech);
    // He has gone and gone for good answered Paul Icrom who...

    result = pipeline.generate(raw_speech, ov::genai::initial_prompt("Polychrome"));
    // He has gone and gone for good answered Polychrome who...
}

const pipeline = await WhisperPipeline(modelPath, "CPU");

let result = await pipeline.generate(rawSpeech);
// He has gone and gone for good answered Paul Icrom who...

const generationConfig = { initial_prompt: "Polychrome" };
result = await pipeline.generate(rawSpeech, { generationConfig });
// He has gone and gone for good answered Polychrome who...

info

For the full list of Whisper generation parameters, refer to the Whisper Generation Config API.

Streaming the Output

Refer to the Streaming guide for more information on streaming the output with OpenVINO GenAI.

Convert and Optimize Model​

Run Model Using OpenVINO GenAI​

Additional Usage Options​

Use Different Generation Parameters​

Generation Configuration Workflow​

Basic Generation Configuration​

Optimizing Generation with Grouped Beam Search​

Transcription​

Translation​

Timestamps Prediction​

Word-level Timestamps Prediction​

Long-Form Audio Processing​

Using Initial Prompts and Hotwords​

Streaming the Output​

Convert and Optimize Model

Run Model Using OpenVINO GenAI

Additional Usage Options

Use Different Generation Parameters

Generation Configuration Workflow

Basic Generation Configuration

Optimizing Generation with Grouped Beam Search

Transcription

Translation

Timestamps Prediction

Word-level Timestamps Prediction

Long-Form Audio Processing

Using Initial Prompts and Hotwords

Streaming the Output