Action Detection#

Sptio-Temporal action detection is the problem of localizing the actor(spatial detection) and action(temporal detection). We solve this problem by combining 3D action classification backbone and 2D object detection model. We can combine these two models in several ways. Currently, we support the simplest way. The other ways will be supported in near future.

X3D + Fast-RCNN architecture comes from X3D paper. This model requires pre-computed actor proposals. Actor pre-proposals can be obtained from COCO pre-trained 2D object detector (e.g. Faster-RCNN, ATSS). If the custom dataset requires finetuning of 2d object detector, please refer otx.algorithms.detection. Region-of-interest (RoI) features are extracted at the last feature map of X3D by extending a 2D proposal at a keyframe into a 3D RoI by replicating it along the temporal axis. The RoI features fed into the roi head of Fast-RCNN.

For better transfer learning we use the following algorithm components:

Augmentations: We use only random crop and random flip for the training pipeline
Optimizer: We use SGD optimizer with the weight decay set to 1e-4 and momentum set to 0.9.
Loss functions: For the multi-label case binary cross entropy loss is used. In the other case, Cross Entropy Loss is used for the categories classification.

Dataset Format#

For the dataset handling inside OpenVINO™ Training Extensions, we use Dataset Management Framework (Datumaro). Since current Datumaro does not support AVA dataset format, therefore conversion to CVAT dataset format is needed. Currently, we offer conversion code from the AVA dataset format to the CVAT dataset format. Please refer this script

If you have your dataset in those formats, then you can simply run using one line of code:

$ otx train src/otx/algorithms/action/configs/detection/x3d_fast_rcnn/template.yaml \
            --train-data-roots <path_to_data_root/train> \
            --val-data-roots <path_to_data_root/val>

Models#

We support the following ready-to-use model templates for transfer learning:

Template ID	Name	Complexity (GFLOPs)	Model size (MB)
Custom_Action_Detection_X3D_FAST_RCNN	x3d_fast_rcnn	13.04	8.32

In the table below the mAP on some academic datasets are presented. Each model is trained using Kinetics-400 pre-trained weight with single Nvidia GeForce RTX3090.

Model name	JHMDB	UCF101-24
x3d_fast_rcnn	92.14	80.7