Text to Image C++ Generation Pipeline

Examples in this folder showcase inference of text to image models like Stable Diffusion 1.5, 2.1, LCM. The application doesn't have many configuration options to encourage the reader to explore and modify the source code. For example, change the device for inference to GPU. The sample features ov::genai::Text2ImagePipeline and uses a text prompt as input source.

There are several sample files:

text2image.cpp demonstrates basic usage of the text to image pipeline
text2image_concurrency.cpp demonstrates concurrent usage of the text to image pipeline to create multiple images with different prompts
lora_text2image.cpp shows how to apply LoRA adapters to the pipeline
heterogeneous_stable_diffusion.cpp shows how to assemble a heterogeneous txt2image pipeline from individual subcomponents (scheduler, text encoder, unet, vae decoder)
image2image.cpp demonstrates basic usage of the image to image pipeline
image2image_concurrency.cpp.cpp demonstrates concurrent usage of the image to image pipeline to create multiple images with different prompts
inpainting.cpp demonstrates basic usage of the inpainting pipeline
benchmark_image_gen.cpp demonstrates how to benchmark the text to image / image to image / inpainting pipeline

Users can change the sample code and play with the following generation parameters:

Change width or height of generated image
Generate multiple images per prompt
Adjust a number of inference steps
Play with guidance scale (read more details)
(SD 1.x, 2.x; SD3, SDXL) Add negative prompt when guidance scale > 1
(SDXL, SD3, FLUX) Specify other positive prompts like prompt_2
Apply multiple different LoRA adapters and mix them with different blending coefficients
(Image to image and inpainting) Play with strength parameter to control how initial image is noised and reduce number of inference steps

[!NOTE] Image generated with HuggingFace / Optimum Intel is not the same generated by this C++ sample: C++ random generation with MT19937 results differ from numpy.random.randn() and diffusers.utils.randn_tensor (uses torch.Generator inside). So, it's expected that image generated by Diffusers and C++ versions provide different images, because latent images are initialize differently.

Download and convert the models and tokenizers

The --upgrade-strategy eager option is needed to ensure optimum-intel is upgraded to the latest version.

It's not required to install ../../export-requirements.txt for deployment if the model has already been exported.

pip install --upgrade-strategy eager -r ../../requirements.txt
optimum-cli export openvino --model dreamlike-art/dreamlike-anime-1.0 --task stable-diffusion --weight-format fp16 dreamlike_anime_1_0_ov/FP16

Run text to image

Follow Get Started with Samples to run the sample.

stable_diffusion ./dreamlike_anime_1_0_ov/FP16 'cyberpunk cityscape like Tokyo New York with tall buildings at dusk golden hour cinematic lighting'

Examples

Prompt: cyberpunk cityscape like Tokyo New York with tall buildings at dusk golden hour cinematic lighting

Run with callback

You can also add a callback to the main.cpp file to interrupt the image generation process earlier if you are satisfied with the intermediate result of the image generation or to add logs.

Please find the template of the callback usage below.

ov::genai::Text2ImagePipeline pipe(models_path, device);

auto callback = [&](size_t step, size_t num_steps, ov::Tensor& latent) -> bool {
   std::cout << "Image generation step: " << step + 1 << " / " << num_steps << std::endl;
   ov::Tensor img = pipe.decode(latent); // get intermediate image tensor
   if (your_condition) // return true if you want to interrupt image generation
      return true;
   return false;
};

ov::Tensor image = pipe.generate(prompt,
   /* other generation properties */
   ov::genai::callback(callback)
);

Run with optional LoRA adapters

LoRA adapters can be connected to the pipeline and modify generated images to have certain style, details or quality. Adapters are supported in Safetensors format and can be downloaded from public sources like Civitai or HuggingFace or trained by the user. Adapters compatible with a base model should be used only. A weighted blend of multiple adapters can be applied by specifying multiple adapter files with corresponding alpha parameters in command line. Check lora.cpp source code to learn how to enable adapters and specify them in each generate call.

[!NOTE]

LoRA alpha interpretation in OpenVINO GenAI

The OpenVINO GenAI implementation merges the traditional LoRA parameters into a single effective scaling factor used during inference.

In this context, the alpha value already includes:

normalization by LoRA rank (alpha / rank)

any user-defined scaling factor (weight)

This means alpha in GenAI should be treated as the final scaling weight applied to the LoRA update — not the raw alpha parameter from training.

Example: Running with a LoRA Adapter

Here is an example how to run the sample with a single adapter. First download adapter file from https://civitai.com/models/67927/soulcard page manually and save it as soulcard.safetensors. Or download it from command line:

wget -O soulcard.safetensors https://civitai.com/api/download/models/72591

Then run lora_text2image executable:

./lora_text2image dreamlike_anime_1_0_ov/FP16 'curly-haired unicorn in the forest, anime, line' soulcard.safetensors 0.7

The sample generates two images with and without adapters applied using the same prompt:

lora.bmp with adapters applied
baseline.bmp without adapters applied

Check the difference:

With adapter	Without adapter

Run text to image with multiple devices

The heterogeneous_stable_diffusion sample demonstrates how a Text2ImagePipeline object can be created from individual subcomponents - scheduler, text encoder, unet, & vae decoder. This approach gives fine-grained control over the devices used to execute each stage of the stable diffusion pipeline.

The usage of this sample is:

./heterogeneous_stable_diffusion <MODEL_DIR> '<PROMPT>' [ <TXT_ENCODE_DEVICE> <UNET_DEVICE> <VAE_DEVICE> ]

For example:

./heterogeneous_stable_diffusion ./dreamlike_anime_1_0_ov/FP16 'cyberpunk cityscape like Tokyo New York with tall buildings at dusk golden hour cinematic lighting' CPU NPU GPU

The sample will create a stable diffusion pipeline such that the text encoder is executed on the CPU, UNet on the NPU, and VAE decoder on the GPU.

Run image to image pipeline

The image2mage.cpp sample demonstrates basic image to image generation pipeline. The difference with text to image pipeline is that final image is denoised from initial image converted to latent space and noised with image noise according to strength parameter. strength should be in range of [0., 1.] where 1. means initial image is fully noised and it is an equivalent to text to image generation. Also, strength parameter linearly affects a number of inferenece steps, because lower strength values means initial latent already has some structure and it requires less steps to denoise it.

To run the sample, download initial image first:

wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png

And then run the sample:

./image2mage ./dreamlike_anime_1_0_ov/FP16 'cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k' cat.png

The resuling image is:

Note, that LoRA, heterogeneous execution and other features of Text2ImagePipeline are applicable for Image2ImagePipeline.

Run inpainting pipeline

The inpainting.cpp sample demonstrates usage of inpainting pipeline, which can inpaint initial image by a given mask. Inpainting pipeline can work on typical text to image models as well as on specialized models which are oftenly named space/model-inpainting, e.g. stabilityai/stable-diffusion-2-inpainting.

Such models can be converted in the same way as regular ones via optimum-cli:

optimum-cli export openvino --model stabilityai/stable-diffusion-2-inpainting --weight-format fp16 stable-diffusion-2-inpainting

Let's also download input data:

wget -O image.png https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png

wget -O mask_image.png https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png

And run the sample:

./inpainting ./stable-diffusion-2-inpainting 'Face of a yellow cat, high resolution, sitting on a park bench' image.png mask_image.png

The resuling image is:

Note, that LoRA, heterogeneous execution and other features of Text2ImagePipeline are applicable for InpaintingPipeline.

benchmarking sample for image generation pipelines

This benchmark_image_gen.cpp sample script demonstrates how to benchmark the text to image pipeline, image to image pipeline and inpainting pipeline. The script includes functionality for warm-up iterations, generating image, and calculating various performance metrics.

The usage of this sample is:

./benchmark_image_gen [OPTIONS]

Options:

-t, --pipeline_type (default: "text2image"): Pipeline type(text2image, image2image, inpainting).
-m, --model: Path to the model and tokenizers base directory.
-p, --prompt (default: "The Sky is blue because"): The prompt to generate text.
--nw, --num_warmup (default: 1): Number of warmup iterations.
-n, --num_iter (default: 3): Number of iterations.
-d, --device (default: "CPU"): Device(s) to run the pipeline with.
-w, --width (default: 512): The width of the output image.
--ht, --height (default: 512): The height of the output image.
--is, --num_inference_steps (default: 20): The number of inference steps.
--ni, --num_images_per_prompt (default: 1): The number of images to generate per generate() call.
-o, --output_dir (default: ""): Path to save output image.
-i, --image: Path to input image.
-s, --strength: Indicates extent to transform the reference image. Must be between 0 and 1.
--mi, --mask_image: Path to mask image.
`-r, --reshape': Reshape pipeline before compilation. This can improve image generation performance.

For example:

./benchmark_image_gen -t text2image -m dreamlike_anime_1_0_ov/FP16 -n 10 -d CPU

Performance output:

[warmup-0] generate time: 85008.00 ms, total infer time:84999.88 ms
[warmup-0] text encoder infer time: 98.00 ms
[warmup-0] unet iteration num:21, first iteration time:4317.94 ms, other iteration avg time:3800.91 ms
[warmup-0] unet inference num:21, first inference time:4317.71 ms, other inference avg time:3800.61 ms
[warmup-0] vae encoder infer time:0.00 ms, vae decoder infer time:4572.00 ms

[iter-0] generate time: 84349.00 ms, total infer time:84340.97 ms
[iter-0] text encoder infer time: 76.00 ms
[iter-0] unet iteration num:21, first iteration time:3805.63 ms, other iteration avg time:3799.68 ms
[iter-0] unet inference num:21, first inference time:3805.42 ms, other inference avg time:3799.38 ms
[iter-0] vae encoder infer time:0.00 ms, vae decoder infer time:4472.00 ms

[iter-1] generate time: 84391.00 ms, total infer time:84384.36 ms
[iter-1] text encoder infer time: 78.00 ms
[iter-1] unet iteration num:21, first iteration time:3801.15 ms, other iteration avg time:3802.17 ms
[iter-1] unet inference num:21, first inference time:3800.93 ms, other inference avg time:3801.87 ms
[iter-1] vae encoder infer time:0.00 ms, vae decoder infer time:4468.00 ms

[iter-2] generate time: 84377.00 ms, total infer time:84366.51 ms
[iter-2] text encoder infer time: 76.00 ms
[iter-2] unet iteration num:21, first iteration time:3783.31 ms, other iteration avg time:3802.25 ms
[iter-2] unet inference num:21, first inference time:3783.09 ms, other inference avg time:3801.82 ms
[iter-2] vae encoder infer time:0.00 ms, vae decoder infer time:4471.00 ms

Test finish, load time: 9356.00 ms
Warmup number:1, first generate warmup time:85008.00 ms, infer warmup time:84999.88 ms
Generate iteration number:3, for one iteration, generate avg time: 84372.34 ms, infer avg time:84363.95 ms, all text encoders infer avg time:76.67 ms, vae encoder infer avg time:0.00 ms, vae decoder infer avg time:4470.33 ms

Run multiple generations with different prompt in parallel

It is highly recommended to use ov::genai::num_images_per_prompt(X) parameter to generate multiple images in parallel. However, when the generation options differ (prompt, height, width), it is recommended to clone the pipeline. It is possible to re-use models compiled into device for concurrent generation with different prompts in separate threads.

Here in this example we load and compile the entire pipeline once, and then use clone() to create separate generation requests to be reused in separate threads:

std::vector<ov::genai::Text2ImagePipeline> pipelines;

// Prepare initial pipeline and compiled models into device
pipelines.emplace_back(models_path, device);
// Clone pipeline for concurrent usage
for (size_t i = 1; i < 4; i++)
   pipelines.emplace_back(pipelines.begin()->clone());

std::vector<std::thread> threads;

for (size_t i = 0; i < 4; i++) {
  auto& pipe = pipelines.at(i);
  threads.emplace_back([&pipe, i] {
    std::string prompt = "A card with number " + std::to_string(i);

    ov::Tensor image = pipe.generate(prompt,
      ov::AnyMap{
        ov::genai::width(512),
        ov::genai::height(512),
        ov::genai::num_inference_steps(25)});

    // save image
  });
}

for (auto& thread : threads) {
   thread.join();
}

Download and convert the models and tokenizers​

Run text to image​

Examples​

Run with callback​

Run with optional LoRA adapters​

LoRA alpha interpretation in OpenVINO GenAI​

Example: Running with a LoRA Adapter​

Run text to image with multiple devices​

Run image to image pipeline​

Run inpainting pipeline​

benchmarking sample for image generation pipelines​

Run multiple generations with different prompt in parallel​