Visual Prompting (Zero-shot learning)#

Visual prompting is a computer vision task that uses a combination of an image and prompts, such as texts, bounding boxes, points, and so on to troubleshoot problems. Using these useful prompts, the main purpose of this task is to obtain labels from unlabeled datasets, and to use generated label information on particular domains or to develop a new model with the generated information.

This section examines the solutions for visual prompting offered by the OpenVINO Training Extensions library. Segment Anything (SAM), is one of the most famous visual prompting methods and this model will be used to adapt a new dataset domain. Especially, in this section, we try to automatically predict given images without any training, called as zero-shot learning. Unlike fine-tuning, zero-shot learning needs only pre-processing component.

Pre-processing: Resize an image according to the longest axis and pad the rest with zero. This pre-processing step is internalized in the model for standalone usecases unlike other models.

Note

In OTX 2.0, zero-shot learning supports various types of prompts, such as bounding boxes, points, and masks. (Polygon will be available soon.) Regardless of the prompt types, prediction will be made in the order prompts come in.

Note

Currently, Post-Training Quantization (PTQ) for SAM is only supported, not Quantization Aware Training (QAT).

Dataset Format#

For the dataset handling inside OpenVINO™ Training Extensions, we use Dataset Management Framework (Datumaro).

We support four dataset formats for zero-shot visual prompting:

Common Semantic Segmentation for semantic segmentation
COCO for instance segmentation
Pascal VOC for instance segmentation and semantic segmentation
Datumaro for custom format dataset

Models#

We support the following model recipes in experimental phase:

Template ID	Name	Complexity (GFLOPs)	Model size (MB)
Zero_Shot_SAM_Tiny_ViT	Zero_Shot_SAM_Tiny_ViT	38.54	47
Zero_Shot_SAM_ViT_B	Zero_Shot_SAM_ViT_B	454.76	363

Simple tutorial#

There are two steps for zero-shot inference: learn and infer. Learn is to extract reference features from given reference images and prompts. These extracted reference features will be used to get point candidates on given target images. Extracted reference features will be returned as outputs and saved at a given path for OTX standalone usecase. You can do learn with the following source code:

(otx) ...$ otx train \
    --config <model_config_path> \
    --data_root <path_to_data_root>

Infer is to get predicted masks on given target images. Unlike learn, this stage doesn’t need any prompt information.

(otx) ...$ otx test
    --config <model_config_path> \
    --data_root <path_to_data_root> \
    --checkpoint <path_to_weights_from_learn>

For example, when the positive (green) and the negative (red) points were given with the reference image for learn stage, you can get basic SAM prediction result (left). If you give the same reference image as the target image for infer stage, you can get target prediction results (right).

../../../../_images/vpm_ref_prediction.png

You can get target prediction results for other given images like below.

../../../../_images/vpm_tgt_prediction.png