Tokenization
OpenVINO GenAI provides a way to tokenize and detokenize text using the ov::genai::Tokenizer
class.
The Tokenizer
is a high level abstraction over the OpenVINO Tokenizers library.
It can be initialized from the path, in-memory IR representation or obtained from the ov::genai::LLMPipeline
object.
- Python
- C++
import openvino_genai as ov_genai
# Initialize from the path
tokenizer = ov_genai.Tokenizer(models_path)
# Or get tokenizer instance from LLMPipeline
pipe = ov_genai.LLMPipeline(models_path, "CPU")
tokenizer = pipe.get_tokenizer()
#include "openvino/genai/llm_pipeline.hpp"
// Initialize from the path
auto tokenizer = ov::genai::Tokenizer(models_path);
// Or get tokenizer instance from LLMPipeline
auto pipe = ov::genai::LLMPipeline pipe(models_path, "CPU");
auto tokenzier = pipe.get_tokenizer();
Tokenizer
has encode()
and decode()
methods which support the following arguments: add_special_tokens
, skip_special_tokens
, pad_to_max_length
, max_length
.
Example - disable adding special tokens:
- Python
- C++
tokens = tokenizer.encode("The Sun is yellow because", add_special_tokens=False)
auto tokens = tokenizer.encode("The Sun is yellow because", ov::genai::add_special_tokens(false));
The encode()
method returns a TokenizedInputs
object containing input_ids
and attention_mask
, both stored as ov::Tensor
.
Since ov::Tensor
requires fixed-length sequences, padding is applied to match the longest sequence in a batch, ensuring a uniform shape.
Also resulting sequence is truncated by max_length
.
If this value is not defined by used, it's is taken from the IR.
Both padding and max_length
can be controlled by the user.
If pad_to_max_length
is set to true, then instead of padding to the longest sequence it will be padded to the max_length
.
Example - control padding:
- Python
- C++
import openvino_genai as ov_genai
tokenizer = ov_genai.Tokenizer(models_path)
prompts = ["The Sun is yellow because", "The"]
# Since prompt is defenitely shorter than maximal length (which is taken from IR) will not affect shape.
# Resulting shape is defined by length of the longest tokens sequence.
# Equivalent of HuggingFace hf_tokenizer.encode(prompt, padding="longest", truncation=True)
tokens = tokenizer.encode(["The Sun is yellow because", "The"])
# or is equivalent to
tokens = tokenizer.encode(["The Sun is yellow because", "The"], pad_to_max_length=False)
print(tokens.input_ids.shape)
# out_shape: [2, 6]
# Resulting tokens tensor will be padded to 1024, sequences which exceed this length will be truncated.
# Equivalent of HuggingFace hf_tokenizer.encode(prompt, padding="max_length", truncation=True, max_length=1024)
tokens = tokenizer.encode(["The Sun is yellow because",
"The"
"The longest string ever" * 2000], pad_to_max_length=True, max_length=1024)
print(tokens.input_ids.shape)
# out_shape: [3, 1024]
# For single string prompts truncation and padding are also applied.
tokens = tokenizer.encode("The Sun is yellow because", pad_to_max_length=True, max_length=128)
print(tokens.input_ids.shape)
# out_shape: [1, 128]
#include "openvino/genai/llm_pipeline.hpp"
auto tokenizer = ov::genai::Tokenizer(models_path);
std::vector<std::string> prompts = {"The Sun is yellow because", "The"};
// Since prompt is defenitely shorter than maximal length (which is taken from IR) will not affect shape.
// Resulting shape is defined by length of the longest tokens sequence.
// Equivalent of HuggingFace hf_tokenizer.encode(prompt, padding="longest", truncation=True)
tokens = tokenizer.encode({"The Sun is yellow because", "The"})
// or is equivalent to
tokens = tokenizer.encode({"The Sun is yellow because", "The"}, ov::genai::pad_to_max_length(False))
// out_shape: [2, 6]
// Resulting tokens tensor will be padded to 1024.
// Equivalent of HuggingFace hf_tokenizer.encode(prompt, padding="max_length", truncation=True, max_length=1024)
tokens = tokenizer.encode({"The Sun is yellow because",
"The",
std::string(2000, 'n')}, ov::genai::pad_to_max_length(True), ov::genai::max_length(1024))
// out_shape: [3, 1024]
// For single string prompts truncation and padding are also applied.
tokens = tokenizer.encode({"The Sun is yellow because"}, ov::genai::pad_to_max_length(True), ov::genai::max_length(1024))
// out_shape: [1, 128]