Action Detection#

Sptio-Temporal action detection is the problem of localizing the actor(spatial detection) and action(temporal detection). We solve this problem by combining 3D action classification backbone and 2D object detection model. We can combine these two models in several ways. Currently, we support the simplest way. The other ways will be supported in near future.

X3D + Fast-RCNN architecture comes from X3D paper. This model requires pre-computed actor proposals. Actor pre-proposals can be obtained from COCO pre-trained 2D object detector (e.g. Faster-RCNN, ATSS). If the custom dataset requires finetuning of 2d object detector, please refer otx.algorithms.detection. Region-of-interest (RoI) features are extracted at the last feature map of X3D by extending a 2D proposal at a keyframe into a 3D RoI by replicating it along the temporal axis. The RoI features fed into the roi head of Fast-RCNN.

For better transfer learning we use the following algorithm components:

  • Augmentations: We use only random crop and random flip for the training pipeline

  • Optimizer: We use SGD optimizer with the weight decay set to 1e-4 and momentum set to 0.9.

  • Loss functions: For the multi-label case binary cross entropy loss is used. In the other case, Cross Entropy Loss is used for the categories classification.

Dataset Format#

We support the popular action classification formats, AVA dataset format.


We support the following ready-to-use model recipes for transfer learning:

Recipe ID


Complexity (GFLOPs)

Model size (MB)





To see which models are available for the task, the following command can be executed:

(otx) ...$ otx find --task ACTION_DETECTION

In the table below the mAP on some academic datasets are presented. Each model is trained using Kinetics-400 pre-trained weight with single Nvidia GeForce RTX3090.

Model name