Action Classification#

Action classification is a problem of identifying the action that is being performed in a video. The input to the algorithm is a sequence of video frames, and the output is a label indicating the action that is being performed.

For supervised learning we use the following algorithms components:

Augmentations: We use standard data augmentations for videos, including random resizing and random cropping, horizontal flipping. We randomly sample a segment of frames from each video during training.
Optimizer: We use the Adam with weight decay fix (AdamW) optimizer.
Learning rate schedule: We use a step learning rate schedule, where the learning rate is reduced by a factor of 10 after a fixed number of epochs. We also use the Linear Warmup technique to gradually increase the learning rate at the beginning of training.
Loss function: We use the Cross-Entropy Loss as the loss function.

Dataset Format#

We support the popular action classification formats, Kinetics .

The names of the annotations files and the overall dataset structure should be the same converted to Kinetics format from original dataset.

Refer to our tutorial for more information on how to train, validate, and optimize action classification models.

Models#

Currently OpenVINO™ Training Extensions supports X3D and MoViNet for action classification.

Recipe ID	Name	Complexity (GFLOPs)	Model size (MB)
Custom_Action_Classification_X3D	X3D	2.49	3.79
Custom_Action_Classificaiton_MoViNet	MoViNet	2.71	3.10

To see which models are available for the task, the following command can be executed:

(otx) ...$ otx find --task ACTION_CLASSIFICATION

In the table below the top-1 accuracy on some academic datasets are presented. Each model is trained with single NVIDIA GeForce RTX 3090.

Model name	HMDB51	UCF101
X3D	67.19	87.89
MoViNet	62.74	81.32