Instance Segmentation#

Instance segmentation is a computer vision task that involves identifying and segmenting individual objects within an image.

It is a more advanced version of object detection, as it doesn’t only detect the presence of an object in an image but also segments the object by creating a mask that separates it from the background. This allows getting more detailed information about the object, such as its shape and location, to be extracted.

Instance segmentation is commonly used in applications such as self-driving cars, robotics, and image-editing software.

../../../../_images/instance_seg_example.png

We solve this problem in the MaskRCNN manner. The main idea of Mask R-CNN is to add a branch for predicting an object mask in parallel with the existing branch for bounding box regression and object classification.

This is done by using a fully convolutional network (FCN) on top of the feature map generated by the last convolutional layer of the backbone network. The model first generates region proposals, and then uses a RoIAlign layer to align the region proposals with the feature map, then the FCN predicts the class and box offset for each proposal and the mask for each class.

The mask branch is trained to predict a binary mask for each object instance, where the mask is aligned with the object’s bounding box and has the same size as the region of interest (RoI). The predicted mask is then used to segment the object from the background.

For the supervised training we use the following algorithms components:

  • Augmentations: We use only a random flip for both augmentations pipelines, train and validation.

  • Optimizer: We use SGD optimizer with the weight decay set to 1e-4 and momentum set to 0.9.

  • Learning rate schedule: For scheduling training process we use ReduceLrOnPlateau with linear learning rate warmup for 200 iterations. This method monitors a target metric (in our case we use metric on the validation set) and if no improvement is seen for a patience number of epochs, the learning rate is reduced.

  • Loss functions: For the bounding box regression we use L1 Loss (the sum of the absolute differences between the ground truth value and the predicted value), Cross Entropy Loss for the categories classification and segmentation masks prediction.

  • Additionally, we use the Exponential Moving Average (EMA) for the model’s weights and the early stopping to add adaptability to the training pipeline and prevent overfitting.

Dataset Format#

For the dataset handling inside OpenVINO™ Training Extensions, we use Dataset Management Framework (Datumaro). For instance segmentation we support COCO dataset format. If you have your dataset in those formats, then you can simply run using one line of code:

$ otx train  <model_template> --train-data-roots <path_to_data_root> \
                                        --val-data-roots <path_to_data_root>

Note

Please, refer to our dedicated tutorial how to train, validate and optimize instance segmentation model for more details.

Models#

We support the following ready-to-use model templates:

Template ID

Name

Complexity (GFLOPs)

Model size (MB)

Custom_Counting_Instance_Segmentation_MaskRCNN_EfficientNetB2B

MaskRCNN-EfficientNetB2B

68.48

13.27

Custom_Counting_Instance_Segmentation_MaskRCNN_ResNet50

MaskRCNN-ResNet50

533.80

177.90

Custom_Counting_Instance_Segmentation_MaskRCNN_ConvNeXt

MaskRCNN-ConvNeXt

266.78

192.4

MaskRCNN-ResNet50 utilizes the ResNet-50 architecture as the backbone network for extracting image features. This choice of backbone network results in a higher number of parameters and FLOPs, which consequently requires more training time. However, the model offers superior performance in terms of accuracy.

On the other hand, MaskRCNN-EfficientNetB2B employs the EfficientNet-B2 architecture as the backbone network. This selection strikes a balance between accuracy and speed, making it a preferable option when prioritizing training time and computational cost.

Recently, we have made updates to MaskRCNN-ConvNeXt, incorporating the ConvNeXt backbone. Through our experiments, we have observed that this variant achieves better accuracy compared to MaskRCNN-ResNet50 while utilizing less GPU memory. However, it is important to note that the training time and inference duration may slightly increase. If minimizing training time is a significant concern, we recommend considering a switch to MaskRCNN-EfficientNetB2B.

Semi-supervised Learning#

We employ the modified Unbiased Teacher framework to tackle the problem of Semi-supervised learning in instance segmentation. This framework leverages two models during training: a “student” model, which serves as the primary model being trained, and a “teacher” model, which acts as a guiding reference for the student model.

During training, the student model is updated using both ground truth annotations (for labeled data) and pseudo-labels (for unlabeled data). These pseudo-labels are generated by the teacher model’s predictions. Notably, the teacher model’s parameters are updated based on the moving average of the student model’s parameters. This means that backward loss propagation is not utilized for updating the teacher model. Once training is complete, only the student model is used for making predictions in the instance segmentation task.

We also use the warmup stage for the teacher model during the first epochs to avoid utilizing too misleading pseudo labeling. We use constant thresholding for pseudo bounding boxes and, unlike in Unbiased Teacher work, we utilize all unlabeled losses including regression one and mask loss. There are some key differences in the augmentation pipelines used for labeled and unlabeled data. Basic augmentations such as random flip, random rotate, and random crop are employed for the teacher model’s input. On the other hand, stronger augmentations like color distortion, RGB to gray conversion, Gaussian blur and Random Erasing are applied to the student model. This discrepancy helps improve generalization and prevents unnecessary overfitting on the pseudo-labels generated by the teacher model. In the same way as for the supervised pipeline we utilize EMA smoothing for the student model throughout the whole training process.

Note

To obtain a better performance after fine-tuning on small labeled datasets used in Semi-SL tasks we adopt a repeat dataset which brings metric improvement in our experiments. However, the training time also increases noticeably. If the training time is important or the Semi-SL dataset has a sufficient number of labeled images, dataset repetition times can be decreased or switched off in the corresponding data_pipeline.py config.

The table below presents the mAP metric achieved by our templates on various datasets. We provide these scores for comparison purposes, alongside the supervised baseline trained solely on labeled data.

  • Cityscapes : 8 classes, 267 labeled images, 2708 unlabeled and 500 images for validation

  • TrashCan : 22 classes, 606 labeled images, 5459 unlabeled and 1147 images for validation

  • WGISD : 5 classes , 11 labeled images, 599 unlabeled and 27 images for validation

  • Pascal-tiny : 20 classes, 337 labeled images, 709 unlabeled and 303 images for validation

Model name

Cityscapes

TrashCan

WGISD

Pascal-tiny

MaskRCNN-ResNet50-supervised

33.79

25.63

23.21

22.23

MaskRCNN-ResNet50-semisl

40.08

36.78

41.12

24.84

MaskRCNN-EfficientNetB2B-supervised

25.81

23.12

32.6

15.00

MaskRCNN-EfficientNetB2B-semisl

28.73

26.45

33.5

16.24