Skip to main content

Streaming the Output

For more interactive UIs during generation, you can stream output tokens.

info

Streaming is supported for LLMPipeline, VLMPipeline and WhisperPipeline.

Streaming Function

In this example, a function outputs words to the console immediately upon generation:

import openvino_genai as ov_genai

pipe = ov_genai.LLMPipeline(model_path, "CPU")

# Create a streamer function
def streamer(subword):
print(subword, end='', flush=True)
# Return flag corresponds whether generation should be stopped.
return ov_genai.StreamingStatus.RUNNING

pipe.start_chat()
while True:
try:
prompt = input('question:\n')
except EOFError:
break
pipe.generate(prompt, streamer=streamer, max_new_tokens=100)
print('\n----------\n')
pipe.finish_chat()

Custom Streamer Class

You can also create your custom streamer for more sophisticated processing:

import openvino_genai as ov_genai

pipe = ov_genai.LLMPipeline(model_path, "CPU")

# Create custom streamer class
class CustomStreamer(ov_genai.StreamerBase):
def __init__(self):
super().__init__()
# Initialization logic.

def write(self, token: int | list[int]) -> ov_genai.StreamingStatus:
# Custom processing logic for new decoded token(s).

# Return flag corresponds whether generation should be stopped.
return ov_genai.StreamingStatus.RUNNING

def end(self):
# Custom finalization logic.
pass

pipe.start_chat()
while True:
try:
prompt = input('question:\n')
except EOFError:
break
pipe.generate(prompt, streamer=CustomStreamer(), max_new_tokens=100)
print('\n----------\n')
pipe.finish_chat()
info

For fully implemented iterable CustomStreamer refer to multinomial_causal_lm sample.