Object Detection#

Object detection is a computer vision task where it’s needed to locate objects, finding their bounding boxes coordinates together with defining class. The input is an image, and the output is a pair of coordinates for bouding box corners and a class number for each detected object.

The common approach to building object detection architecture is to take a feature extractor (backbone), that can be inherited from the classification task. Then goes a head that calculates coordinates and class probabilities based on aggregated information from the image. Additionally, some architectures use Feature Pyramid Network (FPN) to transfer and process feature maps from backbone to head and called neck.

For the supervised training we use the following algorithms components:

  • Augmentations: We use random crop and random rotate, simple bright and color distortions and multiscale training for the training pipeline.

  • Optimizer: We use SGD optimizer with the weight decay set to 1e-4 and momentum set to 0.9.

  • Learning rate schedule: ReduceLROnPlateau. This learning rate scheduler proved its efficiency in dataset-agnostic trainings, its logic is to drop LR after some time without improving the target accuracy metric. Also, we update it with iteration_patience parameter that ensures that a certain number of training iterations (steps through the dataset) were passed before dropping LR.

  • Loss function: We use Generalized IoU Loss for localization loss to train the ability of the model to find the coordinates of the objects. For the classification head, we use a standard FocalLoss.

  • Additional training techniques
    • Early stopping: To add adaptability to the training pipeline and prevent overfitting.

    • Anchor clustering for SSD: This model highly relies on predefined anchor boxes hyperparameter that impacts the size of objects, which can be detected. So before training, we collect object statistics within dataset, cluster them and modify anchor boxes sizes to fit the most for objects the model is going to detect.

    • Backbone pretraining: we pretrained MobileNetV2 backbone on large ImageNet21k dataset to improve feature extractor and learn better and faster.

Dataset Format#

At the current point we support COCO and Pascal-VOC dataset formats. Learn more about the formats by following the links above. Here is an example of expected format for COCO dataset:

├── annotations/
    ├── instances_train.json
    ├── instances_val.json
    └── instances_test.json
├──images/
    (Split is optional)
    ├── train
    ├── val
    └── test

Note

Please, refer to our dedicated tutorial for more information how to train, validate and optimize detection models.

Models#

We support the following ready-to-use model recipes:

Recipe ID

Name

Complexity (GFLOPs)

Model size (MB)

Custom_Object_Detection_YOLOX

YOLOX-TINY

6.5

20.4

Object_Detection_YOLOX_S

YOLOX_S

33.51

46.0

Object_Detection_YOLOX_L

YOLOX_L

194.57

207.0

Object_Detection_YOLOX_X

YOLOX_X

352.42

378.0

Custom_Object_Detection_Gen3_SSD

SSD

9.4

7.6

Custom_Object_Detection_Gen3_ATSS

MobileNetV2-ATSS

20.6

9.1

Object_Detection_ResNeXt101_ATSS

ResNeXt101-ATSS

434.75

344.0

D-Fine X Detection <https://github.com/openvinotoolkit/training_extensions/blob/develop/src/otx/recipe/detection/dfine_x.yaml>

D-Fine X

202.486

240.0

Above table can be found using the following command

(otx) ...$ otx find --task DETECTION

MobileNetV2-ATSS is a good medium-range model that works well and fast in most cases. SSD and YOLOX are light models, that a perfect for the fastest inference on low-power hardware. YOLOX achieved the same accuracy as SSD, and even outperforms its inference on CPU 1.5 times, but requires 3 times more time for training due to Mosaic augmentation, which is even more than for ATSS. So if you have resources for a long training, you can pick the YOLOX model. ATSS still shows good performance among RetinaNet based models. Therfore, We added ATSS with large scale backbone, ResNeXt101-ATSS. We integrated large ResNeXt101 backbone to our Custom ATSS head, and it shows good transfer learning performance. In addition, we added a YOLOX variants to support users’ diverse situations.

In the table below the test mAP on some academic datasets using our supervised pipeline is presented.

For COCO dataset the accuracy of pretrained weights is shown, and we report official COCO mAP with AP50. Except for COCO, we report AP50 as performance metric.

5 datasets were selected as transfer learning datasets. BDD100K is the largest dataset among we used. 70000 images are used as train images and 10000 images are used for validation. Brackish and Plantdoc are datasets of medium size. They have around 10000 images for train and 1500 images for validation. BCCD and Chess pieces are datasets of small size. They have around 300 images for train and 100 images for validation. We used our own recipes without any modification. For hyperparameters, please, refer to the related recipe. We trained each model with a single Nvidia GeForce RTX3090.

Model name

COCO(AP50)

BDD100K

Brackish

Plantdoc

BCCD

Chess pieces

YOLOX-TINY

31.0 (48.2)

24.8

96.3

51.5

88.5

99.2

SSD

13.5

28.2

96.5

52.9

91.1

99.1

MobileNetV2-ATSS

32.5 (49.5)

40.2

99.1

63.4

93.4

99.1

ResNeXt101-ATSS

45.1 (63.8)

45.5

99.3

69.3

93.1

99.1

YOLOX-S

40.3 (59.1)

37.1

93.6

54.8

92.7

98.8

YOLOX-L

49.4 (67.1)

44.5

94.6

55.8

91.8

99.0

YOLOX-X

50.9 (68.4)

44.2

96.3

56.2

91.5

98.9