Performance Metrics

Overview

openvino_genai.PerfMetrics (referred as PerfMetrics for simplicity) is a structure that holds performance metrics for each generate call. PerfMetrics holds fields with mean and standard deviations for the following metrics:

Time To the First Token (TTFT), ms
Time per Output Token (TPOT), ms/token
Generate total duration, ms
Tokenization duration, ms
Detokenization duration, ms
Throughput, tokens/s

and:

Load time, ms
Number of generated tokens
Number of tokens in the input prompt

Performance metrics are stored either in the DecodedResults or EncodedResults in perf_metric field. Additionally to the fields mentioned above, PerfMetrics has a member raw_metrics of type openvino_genai.RawPerfMetrics that contains raw values for the durations of each batch of new token generation, tokenization durations, detokenization durations, and more. These raw metrics are accessible if you wish to calculate your own statistical values such as median or percentiles. However, since mean and standard deviation values are usually sufficient, we will focus on PerfMetrics.

Python
C++

import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(models_path, "CPU")
result = pipe.generate(["The Sun is yellow because"], max_new_tokens=20)
perf_metrics = result.perf_metrics

print(f'Generate duration: {perf_metrics.get_generate_duration().mean:.2f}')
print(f'TTFT: {perf_metrics.get_ttft().mean:.2f} ms')
print(f'TPOT: {perf_metrics.get_tpot().mean:.2f} ms/token')
print(f'Throughput: {perf_metrics.get_throughput().mean:.2f} tokens/s')

#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>

int main(int argc, char* argv[]) {
    std::string models_path = argv[1];
    ov::genai::LLMPipeline pipe(models_path, "CPU");
    auto result = pipe.generate("The Sun is yellow because", ov::genai::max_new_tokens(20));
    auto perf_metrics = result.perf_metrics;

    std::cout << std::fixed << std::setprecision(2);
    std::cout << "Generate duration: " << perf_metrics.get_generate_duration().mean << " ms" << std::endl;
    std::cout << "TTFT: " << metrics.get_ttft().mean  << " ms" << std::endl;
    std::cout << "TPOT: " << metrics.get_tpot().mean  << " ms/token " << std::endl;
    std::cout << "Throughput: " << metrics.get_throughput().mean  << " tokens/s" << std::endl;
}

Output:
mean_generate_duration: 76.28
mean_ttft: 42.58
mean_tpot 3.80

Note

If the input prompt is just a string, the generate function returns only a string without perf_metrics. To obtain perf_metrics, provide the prompt as a list with at least one element or call generate with encoded inputs.

Accumulating Metrics

Several perf_metrics can be added to each other. In that case raw_metrics are concatenated and mean/std values are recalculated. This accumulates statistics from several generate() calls.

Python
C++

import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(models_path, "CPU")
res_1 = pipe.generate(["The Sun is yellow because"], max_new_tokens=20)
res_2 = pipe.generate(["Why Sky is blue because"], max_new_tokens=20)
perf_metrics = res_1.perf_metrics + res_2.perf_metrics

print(f'Generate duration: {perf_metrics.get_generate_duration().mean:.2f}')
print(f'TTFT: {perf_metrics.get_ttft().mean:.2f} ms')
print(f'TPOT: {perf_metrics.get_tpot().mean:.2f} ms/token')
print(f'Throughput: {perf_metrics.get_throughput().mean:.2f} tokens/s')

#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>

int main(int argc, char* argv[]) {
    std::string models_path = argv[1];
    ov::genai::LLMPipeline pipe(models_path, "CPU");
    auto result_1 = pipe.generate("The Sun is yellow because", ov::genai::max_new_tokens(20));
    auto result_2 = pipe.generate("The Sun is yellow because", ov::genai::max_new_tokens(20));
    auto perf_metrics = result_1.perf_metrics + result_2.perf_metrics

    std::cout << std::fixed << std::setprecision(2);
    std::cout << "Generate duration: " << perf_metrics.get_generate_duration().mean << " ms" << std::endl;
    std::cout << "TTFT: " << metrics.get_ttft().mean  << " ms" << std::endl;
    std::cout << "TPOT: " << metrics.get_tpot().mean  << " ms/token " << std::endl;
    std::cout << "Throughput: " << metrics.get_throughput().mean  << " tokens/s" << std::endl;
}

Using Raw Performance Metrics

In addition to mean and standard deviation values, the perf_metrics object has a raw_metrics field. This field stores raw data, including:

Timestamps for each batch of generated tokens
Batch sizes for each timestamp
Tokenization durations
Detokenization durations
Other relevant metrics

These metrics can be use for more fine grained analysis, such as calculating exact median values, percentiles, etc.

Below are a few examples of how to use raw metrics.

Getting timestamps for each generated token:

import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(models_path, "CPU")
result = pipe.generate(["The Sun is yellow because"], max_new_tokens=20)
perf_metrics = result.perf_metrics
raw_metrics = perf_metrics.raw_metrics

print(f'Generate duration: {perf_metrics.get_generate_duration().mean:.2f}')
print(f'Throughput: {perf_metrics.get_throughput().mean:.2f} tokens/s')
print(f'Timestamps: {" ms, ".join(f"{i:.2f}" for i in raw_metrics.m_new_token_times)}')

Getting pure inference time without tokenizatin and detokenization duration:

import openvino_genai as ov_genai
import numpy as np
pipe = ov_genai.LLMPipeline(models_path, "CPU")
result = pipe.generate(["The Sun is yellow because"], max_new_tokens=20)
perf_metrics = result.perf_metrics
print(f'Generate duration: {perf_metrics.get_generate_duration().mean:.2f} ms')

raw_metrics = perf_metrics.raw_metrics
generate_duration = np.array(raw_metrics.generate_durations)
tok_detok_duration = np.array(raw_metrics.tokenization_durations) - np.array(raw_metrics.detokenization_durations)
pure_inference_duration = np.sum(generate_duration - tok_detok_duration) / 1000 # in milliseconds
print(f'Pure Inference duration: {pure_inference_duration:.2f} ms')

Using raw metrics to calculate median value of generate duration:

import openvino_genai as ov_genai
import numpy as np
pipe = ov_genai.LLMPipeline(models_path, "CPU")
result = pipe.generate(["The Sun is yellow because"], max_new_tokens=20)
perf_metrics = result.perf_metrics
raw_metrics = perf_metrics.raw_metrics

print(f'Generate duration: {perf_metrics.get_generate_duration().mean:.2f}')
print(f'Throughput: {perf_metrics.get_throughput().mean:.2f} tokens/s')
durations = np.array(raw_metrics.m_new_token_times[1:]) - np.array(raw_metrics.m_new_token_times[:-1])
print(f'Median from token to token duration: {np.median(durations):.2f} ms')

tip

For more examples of how metrics are used, please refer to the Python benchmark_genai.py and C++ benchmark_genai samples.

Overview​

Accumulating Metrics​

Using Raw Performance Metrics​

Getting timestamps for each generated token:​

Getting pure inference time without tokenizatin and detokenization duration:​

Using raw metrics to calculate median value of generate duration:​