Instance Segmentation#

Instance segmentation is a computer vision task that involves identifying and segmenting individual objects within an image.

It is a more advanced version of object detection, as it doesn’t only detect the presence of an object in an image but also segments the object by creating a mask that separates it from the background. This allows getting more detailed information about the object, such as its shape and location, to be extracted.

Instance segmentation is commonly used in applications such as self-driving cars, robotics, and image-editing software.

../../../../_images/instance_seg_example.png

We solve this problem in the MaskRCNN manner. The main idea of Mask R-CNN is to add a branch for predicting an object mask in parallel with the existing branch for bounding box regression and object classification.

This is done by using a fully convolutional network (FCN) on top of the feature map generated by the last convolutional layer of the backbone network. The model first generates region proposals, and then uses a RoIAlign layer to align the region proposals with the feature map, then the FCN predicts the class and box offset for each proposal and the mask for each class.

The mask branch is trained to predict a binary mask for each object instance, where the mask is aligned with the object’s bounding box and has the same size as the region of interest (RoI). The predicted mask is then used to segment the object from the background.

For the supervised training we use the following algorithms components:

Augmentations: We use only a random flip for both augmentations pipelines, train and validation.
Optimizer: We use SGD optimizer with the weight decay set to 1e-4 and momentum set to 0.9.
Learning rate schedule: For scheduling training process we use ReduceLrOnPlateau with linear learning rate warmup for 200 iterations. This method monitors a target metric (in our case we use metric on the validation set) and if no improvement is seen for a patience number of epochs, the learning rate is reduced.
Loss functions: For the bounding box regression we use L1 Loss (the sum of the absolute differences between the ground truth value and the predicted value), Cross Entropy Loss for the categories classification and segmentation masks prediction.
Additionally, we use the Exponential Moving Average (EMA) for the model’s weights and the early stopping to add adaptability to the training pipeline and prevent overfitting.

Dataset Format#

For the dataset handling inside OpenVINO™ Training Extensions, we use Dataset Management Framework (Datumaro). For instance segmentation we support COCO dataset format. If you have your dataset in those formats, then you can simply run using one line of code:

$ otx train --template <model_template> --train-data-roots <path_to_data_root> \
            --val-data-roots <path_to_data_root>

Note

Please, refer to our dedicated tutorial how to train, validate and optimize instance segmentation model for more details.

Models#

We support the following ready-to-use model templates:

Template ID	Name	Complexity (GFLOPs)	Model size (MB)
Custom_Counting_Instance_Segmentation_MaskRCNN_EfficientNetB2B	MaskRCNN-EfficientNetB2B	68.48	13.27
Custom_Counting_Instance_Segmentation_MaskRCNN_ResNet50	MaskRCNN-ResNet50	533.80	177.90

MaskRCNN-ResNet50 uses ResNet-50 as the backbone network for the image features extraction. It has more parameters and FLOPs and needs more time to train, meanwhile providing superior performance in terms of accuracy. MaskRCNN-EfficientNetB2B uses EfficientNet-B2 as the backbone network. It is a good trade-off between accuracy and speed. It is a better choice when training time and computational cost are in priority.