otx.algo.action_classification#

Module for OTX action classification models.

Classes

BaseRecognizer(backbone, cls_head[, neck, ...])

Custom 3d recognizer class for OTX.

MoViNetBackbone(**kwargs)

MoViNet wrapper class for OTX.

MoViNetHead(num_classes, in_channels, ...[, ...])

Classification head for MoViNet.

MoViNetRecognizer(**kwargs)

MoViNet recognizer model framework for OTX compatibility.

X3DBackbone(gamma_w, gamma_b, gamma_d, ...)

X3D backbone.

X3DHead(num_classes, in_channels, ...[, ...])

Classification head for I3D.

class otx.algo.action_classification.BaseRecognizer(backbone: torch.Module, cls_head: torch.Module, neck: torch.Module | None = None, test_cfg: dict | None = None)[source]#

Bases: BaseModule

Custom 3d recognizer class for OTX.

This is for patching forward function during export procedure.

Initialize BaseModule, inherited from torch.nn.Module.

extract_feat(inputs: Tensor, stage: str = 'neck', data_samples: list[ActionDataSample] | None = None, test_mode: bool = False) tuple[source]#

Extract features of different stages.

Parameters:
  • inputs (torch.Tensor) – The input data.

  • stage (str) – Which stage to output the feature. Defaults to 'neck'.

  • data_samples (list[ActionDataSample], optional) – Action data samples, which are only needed in training. Defaults to None.

  • test_mode (bool) – Whether in test mode. Defaults to False.

Returns:

The extracted features. dict: A dict recording the kwargs for downstream

pipeline. These keys are usually included: loss_aux.

Return type:

torch.Tensor

forward(inputs: Tensor, data_samples: list[ActionDataSample] | None = None, mode: str = 'tensor', **kwargs) dict[str, Tensor] | list[ActionDataSample] | tuple[Tensor] | Tensor[source]#

The unified entry for a forward process in both training and test.

The method should accept three modes:

  • tensor: Forward the whole network and return tensor or tuple of

tensor without any post-processing, same as a common nn.Module. - predict: Forward and return the predictions, which are fully processed to a list of ActionDataSample. - loss: Forward and return a dict of losses according to the given inputs and data samples.

Note that this method doesn’t handle neither back propagation nor optimizer updating, which are done in the train_step().

Parameters:
  • inputs (torch.Tensor) – The input tensor with shape (N, C, …) in general.

  • data_samples (List[``ActionDataSample], optional) – The annotation data of every samples. Defaults to None.

  • mode (str) – Return what kind of value. Defaults to tensor.

Returns:

The return type depends on mode.

  • If mode="tensor", return a tensor or a tuple of tensor.

  • If mode="predict", return a list of ActionDataSample.

  • If mode="loss", return a dict of tensor.

loss(inputs: Tensor, data_samples: list[ActionDataSample] | None, **kwargs) dict[source]#

Calculate losses from a batch of inputs and data samples.

Parameters:
  • inputs (torch.Tensor) – Raw Inputs of the recognizer. These should usually be mean centered and std scaled.

  • data_samples (List[ActionDataSample]) – The batch data samples. It usually includes information such as gt_label.

Returns:

A dictionary of loss components.

Return type:

dict

predict(inputs: Tensor, data_samples: list[ActionDataSample] | None, **kwargs) list[ActionDataSample][source]#

Predict results from a batch of inputs and data samples with postprocessing.

Parameters:
  • inputs (torch.Tensor) – Raw Inputs of the recognizer. These should usually be mean centered and std scaled.

  • data_samples (List[ActionDataSample]) – The batch data samples. It usually includes information such as gt_label.

Returns:

Return the recognition results. The returns value is ActionDataSample, which usually contains pred_scores. And the pred_scores usually contains following keys.

  • item (torch.Tensor): Classification scores, has a shape

    (num_classes, )

Return type:

List[ActionDataSample]

property with_cls_head: bool#

whether the recognizer has a cls_head.

Type:

bool

property with_neck: bool#

whether the recognizer has a neck.

Type:

bool

class otx.algo.action_classification.MoViNetBackbone(**kwargs)[source]#

Bases: MoViNetBackboneBase

MoViNet wrapper class for OTX.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

static fill_conv(conf: DictConfig, input_channels: int, out_channels: int, kernel_size: tuple[int, int, int], stride: tuple[int, int, int], padding: tuple[int, int, int]) None[source]#

Set the values of a given DictConfig object to conv layer.

Parameters:
  • conf (DictConfig) – The DictConfig object to be updated.

  • input_channels (int) – The number of input channels.

  • out_channels (int) – The number of output channels.

  • kernel_size (tuple[int]) – The size of the kernel.

  • stride (tuple[int]) – The stride of the kernel.

  • padding (tuple[int]) – The padding of the kernel.

Returns:

None.

static fill_se_config(conf: DictConfig, input_channels: int, out_channels: int, expanded_channels: int, kernel_size: tuple[int, int, int], stride: tuple[int, int, int], padding: tuple[int, int, int], padding_avg: tuple[int, int, int]) None[source]#

Set the values of a given DictConfig object to SE module.

Parameters:
  • conf (DictConfig) – The DictConfig object to be updated.

  • input_channels (int) – The number of input channels.

  • out_channels (int) – The number of output channels.

  • expanded_channels (int) – The number of channels after expansion in the basic block.

  • kernel_size (tuple[int]) – The size of the kernel.

  • stride (tuple[int]) – The stride of the kernel.

  • padding (tuple[int]) – The padding of the kernel.

  • padding_avg (tuple[int]) – The padding for the average pooling operation.

Returns:

None.

class otx.algo.action_classification.MoViNetHead(num_classes: int, in_channels: int, hidden_dim: int, loss_cls: Module, topk: tuple[int, int] = (1, 5), tf_like: bool = False, conv_type: str = '3d', average_clips: str | None = None)[source]#

Bases: BaseHead

Classification head for MoViNet.

Parameters:
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • hidden_dim (int) – Number of channels in hidden layer.

  • tf_like (bool) – If True, uses TensorFlow-style padding. Default: False.

  • conv_type (str) – Type of convolutional layer. Default: ‘3d’.

  • loss_cls (nn.module) – Loss class like CrossEntropyLoss.

  • topk (tuple[int, int]) – Top-K training loss calculation. Default: (1, 5).

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.5.

  • init_std (float) – Standard deviation for initialization. Default: 0.1.

Initialize BaseModule, inherited from torch.nn.Module.

forward(x: Tensor, **kwargs) Tensor[source]#

Defines the computation performed at every call.

Parameters:

x (torch.Tensor) – The input data.

Returns:

The classification scores for input samples.

Return type:

torch.Tensor

init_weights() None[source]#

Initialize the parameters from scratch.

class otx.algo.action_classification.MoViNetRecognizer(**kwargs)[source]#

Bases: BaseRecognizer

MoViNet recognizer model framework for OTX compatibility.

Initialize BaseModule, inherited from torch.nn.Module.

static load_state_dict_pre_hook(module: Module, state_dict: dict, prefix: str, *args, **kwargs) None[source]#

Redirect input state_dict to model for OTX model compatibility.

static state_dict_hook(module: Module, state_dict: dict, *args, **kwargs) None[source]#

Redirect model as output state_dict for OTX MoviNet compatibility.

class otx.algo.action_classification.X3DBackbone(gamma_w: float = 1.0, gamma_b: float = 1.0, gamma_d: float = 1.0, pretrained: str | None = None, in_channels: int = 3, num_stages: int = 4, spatial_strides: tuple[int, int, int, int] = (2, 2, 2, 2), frozen_stages: int = -1, se_style: str = 'half', se_ratio: float = 0.0625, use_swish: bool = True, normalization: ~typing.Callable[[...], ~torch.nn.modules.module.Module] | None = None, activation: ~typing.Callable[[...], ~torch.nn.modules.module.Module] | None = <class 'torch.nn.modules.activation.ReLU'>, norm_eval: bool = False, with_cp: bool = False, zero_init_residual: bool = True, **kwargs)[source]#

Bases: Module

X3D backbone. https://arxiv.org/pdf/2004.04730.pdf.

Parameters:
  • gamma_w (float) – Global channel width expansion factor. Default: 1.

  • gamma_b (float) – Bottleneck channel width expansion factor. Default: 1.

  • gamma_d (float) – Network depth expansion factor. Default: 1.

  • pretrained (str | None) – Name of pretrained model. Default: None.

  • in_channels (int) – Channel num of input features. Default: 3.

  • num_stages (int) – Resnet stages. Default: 4.

  • spatial_strides (Sequence[int]) – Spatial strides of residual blocks of each stage. Default: (1, 2, 2, 2).

  • frozen_stages (int) – Stages to be frozen (all param fixed). If set to -1, it means not freezing any parameters. Default: -1.

  • se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.

  • se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: 1 / 16.

  • use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.

  • normalization (Callable[..., nn.Module] | None) – Normalization layer module. Defaults to None.

  • activation (Callable[..., nn.Module] | None) – Activation layer module. Defaults to nn.ReLU.

  • norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.

  • kwargs (dict, optional) – Key arguments for “make_res_layer”.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) Tensor[source]#

Defines the computation performed at every call.

Parameters:

x (torch.Tensor) – The input data.

Returns:

The feature of the input samples extracted by the backbone.

Return type:

torch.Tensor

init_weights() None[source]#

Initiate the parameters either from existing checkpoint or from scratch.

make_res_layer(block: ~torch.nn.modules.module.Module, layer_inplanes: int, inplanes: int, planes: int, blocks: int, spatial_stride: int = 1, se_style: str = 'half', se_ratio: float | None = None, use_swish: bool = True, normalization: ~typing.Callable[[...], ~torch.nn.modules.module.Module] | None = None, activation: ~typing.Callable[[...], ~torch.nn.modules.module.Module] | None = <class 'torch.nn.modules.activation.ReLU'>, with_cp: bool = False, **kwargs) Module[source]#

Build residual layer for ResNet3D.

Parameters:
  • block (nn.Module) – Residual module to be built.

  • layer_inplanes (int) – Number of channels for the input feature of the res layer.

  • inplanes (int) – Number of channels for the input feature in each block, which equals to base_channels * gamma_w.

  • planes (int) – Number of channels for the output feature in each block, which equals to base_channel * gamma_w * gamma_b.

  • blocks (int) – Number of residual blocks.

  • spatial_stride (int) – Spatial strides in residual and conv layers. Default: 1.

  • se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.

  • se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: None.

  • use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.

  • normalization (Callable[..., nn.Module] | None) – Normalization layer module. Defaults to None.

  • activation (Callable[..., nn.Module] | None) – Activation layer module. Defaults to nn.ReLU.

  • with_cp (bool | None) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

Returns:

A residual layer for the given config.

Return type:

nn.Module

train(mode: bool = True) None[source]#

Set the optimization status when training.

class otx.algo.action_classification.X3DHead(num_classes: int, in_channels: int, hidden_dim: int, loss_cls: Module, spatial_type: str = 'avg', dropout_ratio: float = 0.5, init_std: float = 0.01, fc1_bias: bool = False, average_clips: str | None = None)[source]#

Bases: BaseHead

Classification head for I3D.

Parameters:
  • num_classes (int) – Number of classes to be classified.

  • in_channels (int) – Number of channels in input feature.

  • loss_cls (nn.module) – Loss class like CrossEntropyLoss.

  • spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.

  • dropout_ratio (float) – Probability of dropout layer. Default: 0.5.

  • init_std (float) – Std value for Initiation. Default: 0.01.

  • fc1_bias (bool) – If the first fc layer has bias. Default: False.

Initialize BaseModule, inherited from torch.nn.Module.

forward(x: Tensor, **kwargs) Tensor[source]#

Defines the computation performed at every call.

Parameters:

x (Tensor) – The input data.

Returns:

The classification scores for input samples.

Return type:

Tensor

init_weights() None[source]#

Initiate the parameters from scratch.