Instance Segmentation#

Instance segmentation is a computer vision task that involves identifying and segmenting individual objects within an image.

It is a more advanced version of object detection, as it doesn’t only detect the presence of an object in an image but also segments the object by creating a mask that separates it from the background. This allows getting more detailed information about the object, such as its shape and location, to be extracted.

Instance segmentation is commonly used in applications such as self-driving cars, robotics, and image-editing software.


We integrate two types of instance segmentation architecture within OpenVINO™ Training Extensions:: Mask R-CNN and RTMDet.

Mask R-CNN, a widely adopted method, builds upon the Faster R-CNN architecture, known for its two-stage object detection mechanism. In the initial stage, it proposes regions of interest, while in the subsequent stage, it predicts the class and bounding box offsets for each proposal. Distinguishing itself, Mask R-CNN incorporates an additional branch dedicated to predicting object masks concurrently with the existing branches for bounding box regression and object classification.

On the other hand, RTMDet leverages the architecture of RTMNet, a lightweight, one-stage model designed for both object detection and instance segmentation tasks. RTMNet prioritizes efficiency, making it particularly suitable for real-time applications. RTMDet-Inst extends the capabilities of RTMNet to encompass instance segmentation by integrating a mask prediction branch.

For the supervised training we use the following algorithms components:

  • Augmentations: We use only a random flip for both augmentations pipelines, train and validation.

  • Optimizer: We use SGD optimizer with the weight decay set to 1e-4 and momentum set to 0.9.

  • Learning rate schedule: For scheduling training process we use ReduceLrOnPlateau with linear learning rate warmup for 200 iterations. This method monitors a target metric (in our case we use metric on the validation set) and if no improvement is seen for a patience number of epochs, the learning rate is reduced.

  • Loss functions: For the bounding box regression we use L1 Loss (the sum of the absolute differences between the ground truth value and the predicted value), Cross Entropy Loss for the categories classification and segmentation masks prediction.

  • Additionally, we use the Exponential Moving Average (EMA) for the model’s weights and the early stopping to add adaptability to the training pipeline and prevent overfitting.

Dataset Format#

For the dataset handling inside OpenVINO™ Training Extensions, we use Dataset Management Framework (Datumaro). For instance segmentation we support COCO dataset format.


Please, refer to our dedicated tutorial how to train, validate and optimize instance segmentation model for more details.


We support the following ready-to-use model recipes:

Model Recipe


Complexity (GFLOPs)

Model size (MB)

Instance Segmentation MaskRCNN EfficientNetB2B




Instance Segmentation MaskRCNN ResNet50




Instance Segmentation MaskRCNN SwinT




Instance Segmentation RTMDet-Inst Tiny




Above table can be found using the following command

(otx) ...$ otx find --task INSTANCE_SEGMENTATION

MaskRCNN-SwinT leverages Swin Transformer architecture as its backbone network for feature extraction. This choice, while yielding superior accuracy, comes with a longer training time and higher computational requirements.

In contrast, the MaskRCNN-ResNet50 model adopts the more conventional ResNet-50 backbone network, striking a balance between accuracy and computational efficiency.

Meanwhile, MaskRCNN-EfficientNetB2B employs EfficientNet-B2 architecture as its backbone, offering a compromise between accuracy and speed during training, making it a favorable option when minimizing training time and computational resources is essential.

Recently, we have updated RTMDet-Ins-tiny, integrating works from RTMNet to prioritize real-time instance segmentation inference. While this model is tailored for real-time applications due to its lightweight design, it may not achieve the same level of accuracy as its counterparts, potentially necessitating more extensive training data.

Our experiments indicate that MaskRCNN-SwinT and MaskRCNN-ResNet50 outperform MaskRCNN-EfficientNetB2B in terms of accuracy. However, if reducing training time is paramount, transitioning to MaskRCNN-EfficientNetB2B is recommended. Conversely, for applications where inference speed is crucial, RTMDet-Ins-tiny presents an optimal solution.

In the table below the mAP metric on some academic datasets using our supervised pipeline is presented. The results were obtained on our recipes without any changes. We use 1024x1024 image resolution, for other hyperparameters, please, refer to the related recipe. We trained each model with single Nvidia GeForce RTX3090.

Model name



Pascal-VOC 2007