Skip to main content

Use OpenVINO GenAI in Chat Scenario

For chat applications, OpenVINO GenAI provides special optimizations to maintain conversation context and improve performance using KV-cache.

Refer to the How It Works for more information about KV-cache.

info

Chat mode is supported for both LLMPipeline and VLMPipeline.

ChatHistory

ChatHistory stores conversation messages and optional metadata for chat templates. Messages are stored as JSON-like objects, so it supports various nested message structures with any field names your model or chat template requires (not just simple "role" and "content" fields).

A simple chat example (with grouped beam search decoding):

import openvino_genai as ov_genai

pipe = ov_genai.LLMPipeline(model_path, 'CPU')

config = {'max_new_tokens': 100, 'num_beam_groups': 3, 'num_beams': 15, 'diversity_penalty': 1.5}
pipe.set_generation_config(config)

chat_history = ov_genai.ChatHistory()

while True:
try:
prompt = input('question:\n')
except EOFError:
break

chat_history.append({"role": "user", "content": prompt})
decoded_results = pipe.generate(chat_history)
# Add assistant's response to chat history
chat_history.append({"role": "assistant", "content": decoded_results.texts[0]})

print('answer:\n')
print(decoded_results.texts[0])
print('\n----------\n')
info

ChatHistory messages are not updated automatically when using pipe.generate(). You need to manually append user prompts and model responses to the ChatHistory instance as shown in the examples above.

System Prompt

Add a system message at the beginning to set the assistant's behavior:

import openvino_genai as ov_genai

chat_history = ov_genai.ChatHistory()
chat_history.append({"role": "system", "content": "You are a helpful assistant."})

# Or using constructor
chat_history = ov_genai.ChatHistory([
{"role": "system", "content": "You are a helpful assistant."}
])

Chat History Metadata

Additionally, ChatHistory manages optional metadata for consistent chat template application:

  • Tools definitions for function calling and agentic scenarios
  • Custom chat template variables (e.g. enable_thinking for models with extended reasoning like Qwen3)
import openvino_genai as ov_genai
import json

chat_history = ov_genai.ChatHistory()
chat_history.append({"role": "system", "content": system_prompt})

# Load tools from JSON string
tools: list[dict] = json.loads("...")

# Set tools definitions
chat_history.set_tools(tools)
# Set custom chat template variables
chat_history.set_extra_context({ "enable_thinking": True })

chat_history.append({"role": "user", "content": user_prompt})
decoded_results = pipe.generate(chat_history, config)
# Add assistant's response to chat history
chat_history.append({"role": "assistant", "content": decoded_results.texts[0]})

start_chat() / finish_chat() API

Deprecation Notice

start_chat() / finish_chat() API is deprecated and will be removed in the next major release. It is recommended to use ChatHistory for managing chat conversations.

A simple chat example (with grouped beam search decoding):

import openvino_genai as ov_genai

pipe = ov_genai.LLMPipeline(model_path, 'CPU')

config = {'max_new_tokens': 100, 'num_beam_groups': 3, 'num_beams': 15, 'diversity_penalty': 1.5}
pipe.set_generation_config(config)

pipe.start_chat()
while True:
try:
prompt = input('question:\n')
except EOFError:
break
answer = pipe.generate(prompt)
print('answer:\n')
print(answer)
print('\n----------\n')
pipe.finish_chat()
info

For more information, refer to the Python, C++, and JavaScript chat samples.