Semantic Search using Text Embedding

Convert and Optimize Model

Download and convert a text embedding model (e.g. BAAI/bge-small-en-v1.5) to OpenVINO format from Hugging Face:

optimum-cli export openvino --model BAAI/bge-small-en-v1.5 --task feature-extraction bge-small-en-v1_5_ov

See all supported Text Embedding Models.

info

Refer to the Model Preparation guide for detailed instructions on how to download, convert and optimize models for OpenVINO GenAI.

Run Model Using OpenVINO GenAI

TextEmbeddingPipeline generates vector representations for text using embedding models.

Python
C++
JavaScript

import openvino_genai as ov_genai

pipeline = ov_genai.TextEmbeddingPipeline(
  models_path,
  "CPU",
  pooling_type = ov_genai.TextEmbeddingPipeline.PoolingType.MEAN,
  normalize = True
)

documents_embeddings = pipeline.embed_documents(documents)
query_embeddings = pipeline.embed_query("What is the capital of France?")

import openvino_genai as ov_genai

pipeline = ov_genai.TextEmbeddingPipeline(
  models_path,
  "GPU",
  pooling_type = ov_genai.TextEmbeddingPipeline.PoolingType.MEAN,
  normalize = True
)

documents_embeddings = pipeline.embed_documents(documents)
query_embeddings = pipeline.embed_query("What is the capital of France?")

#include "openvino/genai/rag/text_embedding_pipeline.hpp"

int main(int argc, char* argv[]) try {
  auto documents = std::vector<std::string>(argv + 2, argv + argc);
  std::string models_path = argv[1];

  ov::genai::TextEmbeddingPipeline pipeline(
      models_path,
      "CPU",
      ov::genai::pooling_type(ov::genai::TextEmbeddingPipeline::PoolingType::MEAN),
      ov::genai::normalize(true)
  );

  ov::genai::EmbeddingResults documents_embeddings = pipeline.embed_documents(documents);
  ov::genai::EmbeddingResult query_embedding = pipeline.embed_query("What is the capital of France?");
}

#include "openvino/genai/rag/text_embedding_pipeline.hpp"

int main(int argc, char* argv[]) try {
  auto documents = std::vector<std::string>(argv + 2, argv + argc);
  std::string models_path = argv[1];

  ov::genai::TextEmbeddingPipeline pipeline(
      models_path,
      "GPU",
      ov::genai::pooling_type(ov::genai::TextEmbeddingPipeline::PoolingType::MEAN),
      ov::genai::normalize(true)
  );

  ov::genai::EmbeddingResults documents_embeddings = pipeline.embed_documents(documents);
  ov::genai::EmbeddingResult query_embedding = pipeline.embed_query("What is the capital of France?");
}

import { TextEmbeddingPipeline, PoolingType } from "openvino-genai-node";

const pipeline = await TextEmbeddingPipeline(
  models_path,
  "CPU",
  {
      pooling_type: PoolingType.MEAN,
      normalize: true
  }
);

const documents_embeddings = await pipeline.embedDocuments(documents);
const query_embeddings = await pipeline.embedQuery("What is the capital of France?");

import { TextEmbeddingPipeline, PoolingType } from "openvino-genai-node";

const pipeline = await TextEmbeddingPipeline(
  models_path,
  "GPU",
  {
      pooling_type: PoolingType.MEAN,
      normalize: true
  }
);

const documents_embeddings = await pipeline.embedDocuments(documents);
const query_embeddings = await pipeline.embedQuery("What is the capital of France?");

tip

Use CPU or GPU as devices without any other code change.

Additional Usage Options

tip

Check out Python, C++, and JavaScript text embedding samples.

Pooling Strategies

Text embedding models support different pooling strategies to aggregate token embeddings into a single vector:

CLS: Use the first token embedding (default for many models)
MEAN: Average all token embeddings
LAST_TOKEN: Use the last token embedding

You can set the pooling strategy via the pooling_type parameter.

L2 Normalization

L2 normalization can be applied to the output embeddings for improved retrieval performance. Enable it with the normalize parameter.

Input Size and Padding

You can control how input texts are tokenized and padded:

max_length: Maximum length of tokens passed to the embedding model. Longer texts will be truncated.
pad_to_max_length: If true, model input tensors are padded to the maximum length.
padding_side: Side to use for padding ("left" or "right").

Batch Size Configuration

The batch_size parameter is useful for optimizing performance during database population:

When set, the pipeline fixes the model shape for inference optimization.
The number of documents passed to the pipeline must equal batch_size.
For query embeddings, set batch_size=1 or leave it unset.

Fixed Shape Optimization

Setting batch_size, max_length, and pad_to_max_length=true together will fix the model shape for optimal inference performance.

info

Fixed shapes are required for NPU device inference.

Query and Embed Instructions

Some models support special instructions for queries and documents. Use query_instruction and embed_instruction to provide these if needed.

Example: Custom Configuration

Python
C++
JavaScript

import openvino_genai as ov_genai
pipeline = ov_genai.TextEmbeddingPipeline(
    models_path,
    "CPU",
    pooling_type=ov_genai.TextEmbeddingPipeline.PoolingType.MEAN,
    normalize=True,
    max_length=512,
    pad_to_max_length=True,
    padding_side="left",
    batch_size=4,
    query_instruction="Represent this sentence for searching relevant passages: ",
    embed_instruction="Represent this passage for retrieval: "
)

#include "openvino/genai/rag/text_embedding_pipeline.hpp"
ov::genai::TextEmbeddingPipeline pipeline(
    models_path,
    "CPU",
    ov::genai::pooling_type(ov::genai::TextEmbeddingPipeline::PoolingType::MEAN),
    ov::genai::normalize(true),
    ov::genai::max_length(512),
    ov::genai::pad_to_max_length(true),
    ov::genai::padding_side("left"),
    ov::genai::batch_size(4),
    ov::genai::query_instruction("Represent this sentence for searching relevant passages: "),
    ov::genai::embed_instruction("Represent this passage for retrieval: ")
);

import { TextEmbeddingPipeline, PoolingType } from "openvino-genai-node";

const pipeline = await TextEmbeddingPipeline(
    models_path,
    "CPU",
    {
        pooling_type: PoolingType.MEAN,
        normalize: true,
        max_length: 512,
        pad_to_max_length: true,
        padding_side: "left",
        batch_size: 4,
        query_instruction: "Represent this sentence for searching relevant passages: ",
        embed_instruction: "Represent this passage for retrieval: "
    }
);

info

For the full list of configuration options, see the TextEmbeddingPipeline API Reference.

Convert and Optimize Model​

Run Model Using OpenVINO GenAI​

Additional Usage Options​

Pooling Strategies​

L2 Normalization​

Input Size and Padding​

Batch Size Configuration​

Fixed Shape Optimization​

Query and Embed Instructions​

Example: Custom Configuration​

Convert and Optimize Model

Run Model Using OpenVINO GenAI

Additional Usage Options

Pooling Strategies

L2 Normalization

Input Size and Padding

Batch Size Configuration

Fixed Shape Optimization

Query and Embed Instructions

Example: Custom Configuration