otx.algo.action_classification#

Module for OTX action classification models.

Classes

`BaseRecognizer`(backbone, cls_head[, neck, ...])	Custom 3d recognizer class for OTX.
`MoViNetBackbone`(**kwargs)	MoViNet wrapper class for OTX.
`MoViNetHead`(num_classes, in_channels, ...[, ...])	Classification head for MoViNet.
`MoViNetRecognizer`(**kwargs)	MoViNet recognizer model framework for OTX compatibility.
`X3DBackbone`(gamma_w, gamma_b, gamma_d, ...)	X3D backbone.
`X3DHead`(num_classes, in_channels, ...[, ...])	Classification head for I3D.

class otx.algo.action_classification.BaseRecognizer(backbone: torch.Module, cls_head: torch.Module, neck: torch.Module | None = None, test_cfg: dict | None = None)[source]#

Bases: BaseModule

Custom 3d recognizer class for OTX.

This is for patching forward function during export procedure.

Initialize BaseModule, inherited from torch.nn.Module.

extract_feat(inputs: Tensor, stage: str = 'neck', data_samples: list[ActionDataSample] | None = None, test_mode: bool = False) → tuple[source]#

Extract features of different stages.

Parameters:

inputs (torch.Tensor) – The input data.
stage (str) – Which stage to output the feature. Defaults to 'neck'.
data_samples (list[ActionDataSample], optional) – Action data samples, which are only needed in training. Defaults to None.
test_mode (bool) – Whether in test mode. Defaults to False.

Returns:

The extracted features. dict: A dict recording the kwargs for downstream

pipeline. These keys are usually included: loss_aux.

Return type:

torch.Tensor

forward(inputs: Tensor, data_samples: list[ActionDataSample] | None = None, mode: str = 'tensor', **kwargs) → dict[str, Tensor] | list[ActionDataSample] | tuple[Tensor] | Tensor[source]#

The unified entry for a forward process in both training and test.

The method should accept three modes:

tensor: Forward the whole network and return tensor or tuple of

tensor without any post-processing, same as a common nn.Module. - predict: Forward and return the predictions, which are fully processed to a list of ActionDataSample. - loss: Forward and return a dict of losses according to the given inputs and data samples.

Note that this method doesn’t handle neither back propagation nor optimizer updating, which are done in the train_step().

Parameters:

inputs (torch.Tensor) – The input tensor with shape (N, C, …) in general.
data_samples (List[``ActionDataSample], optional) – The annotation data of every samples. Defaults to None.
mode (str) – Return what kind of value. Defaults to tensor.

Returns:

The return type depends on mode.

If mode="tensor", return a tensor or a tuple of tensor.
If mode="predict", return a list of ActionDataSample.
If mode="loss", return a dict of tensor.

loss(inputs: Tensor, data_samples: list[ActionDataSample] | None, **kwargs) → dict[source]#

Calculate losses from a batch of inputs and data samples.

Parameters:

inputs (torch.Tensor) – Raw Inputs of the recognizer. These should usually be mean centered and std scaled.
data_samples (List[ActionDataSample]) – The batch data samples. It usually includes information such as gt_label.

Returns:

A dictionary of loss components.

Return type:

dict

predict(inputs: Tensor, data_samples: list[ActionDataSample] | None, **kwargs) → list[ActionDataSample][source]#

Predict results from a batch of inputs and data samples with postprocessing.

Parameters:

inputs (torch.Tensor) – Raw Inputs of the recognizer. These should usually be mean centered and std scaled.
data_samples (List[ActionDataSample]) – The batch data samples. It usually includes information such as gt_label.

Returns:

Return the recognition results. The returns value is ActionDataSample, which usually contains pred_scores. And the pred_scores usually contains following keys.

item (torch.Tensor): Classification scores, has a shape
(num_classes, )

Return type:

List[ActionDataSample]

property with_cls_head: bool#

whether the recognizer has a cls_head.

Type:: bool

property with_neck: bool#

whether the recognizer has a neck.

Type:: bool

class otx.algo.action_classification.MoViNetBackbone(**kwargs)[source]#

Bases: MoViNetBackboneBase

MoViNet wrapper class for OTX.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

static fill_conv(conf: DictConfig, input_channels: int, out_channels: int, kernel_size: tuple[int, int, int], stride: tuple[int, int, int], padding: tuple[int, int, int]) → None[source]#

Set the values of a given DictConfig object to conv layer.

Parameters:

conf (DictConfig) – The DictConfig object to be updated.
input_channels (int) – The number of input channels.
out_channels (int) – The number of output channels.
kernel_size (tuple[int]) – The size of the kernel.
stride (tuple[int]) – The stride of the kernel.
padding (tuple[int]) – The padding of the kernel.

Returns:

None.

static fill_se_config(conf: DictConfig, input_channels: int, out_channels: int, expanded_channels: int, kernel_size: tuple[int, int, int], stride: tuple[int, int, int], padding: tuple[int, int, int], padding_avg: tuple[int, int, int]) → None[source]#

Set the values of a given DictConfig object to SE module.

Parameters:

conf (DictConfig) – The DictConfig object to be updated.
input_channels (int) – The number of input channels.
out_channels (int) – The number of output channels.
expanded_channels (int) – The number of channels after expansion in the basic block.
kernel_size (tuple[int]) – The size of the kernel.
stride (tuple[int]) – The stride of the kernel.
padding (tuple[int]) – The padding of the kernel.
padding_avg (tuple[int]) – The padding for the average pooling operation.

Returns:

None.

class otx.algo.action_classification.MoViNetHead(num_classes: int, in_channels: int, hidden_dim: int, loss_cls: Module, topk: tuple[int, int] = (1, 5), tf_like: bool = False, conv_type: str = '3d', average_clips: str | None = None)[source]#

Bases: BaseHead

Classification head for MoViNet.

Parameters:

num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
hidden_dim (int) – Number of channels in hidden layer.
tf_like (bool) – If True, uses TensorFlow-style padding. Default: False.
conv_type (str) – Type of convolutional layer. Default: ‘3d’.
loss_cls (nn.module) – Loss class like CrossEntropyLoss.
topk (tuple[int, int]) – Top-K training loss calculation. Default: (1, 5).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.5.
init_std (float) – Standard deviation for initialization. Default: 0.1.

Initialize BaseModule, inherited from torch.nn.Module.

forward(x: Tensor, **kwargs) → Tensor[source]#

Defines the computation performed at every call.

Parameters:: x (torch.Tensor) – The input data.
Returns:: The classification scores for input samples.
Return type:: torch.Tensor

init_weights() → None[source]#: Initialize the parameters from scratch.

class otx.algo.action_classification.MoViNetRecognizer(**kwargs)[source]#

Bases: BaseRecognizer

MoViNet recognizer model framework for OTX compatibility.

Initialize BaseModule, inherited from torch.nn.Module.

static load_state_dict_pre_hook(module: Module, state_dict: dict, prefix: str, *args, **kwargs) → None[source]#: Redirect input state_dict to model for OTX model compatibility.

static state_dict_hook(module: Module, state_dict: dict, *args, **kwargs) → None[source]#: Redirect model as output state_dict for OTX MoviNet compatibility.

class otx.algo.action_classification.X3DBackbone(gamma_w: float = 1.0, gamma_b: float = 1.0, gamma_d: float = 1.0, pretrained: str | None = None, in_channels: int = 3, num_stages: int = 4, spatial_strides: tuple[int, int, int, int] = (2, 2, 2, 2), frozen_stages: int = -1, se_style: str = 'half', se_ratio: float = 0.0625, use_swish: bool = True, normalization: ~typing.Callable[[...], ~torch.nn.modules.module.Module] | None = None, activation: ~typing.Callable[[...], ~torch.nn.modules.module.Module] | None = <class 'torch.nn.modules.activation.ReLU'>, norm_eval: bool = False, with_cp: bool = False, zero_init_residual: bool = True, **kwargs)[source]#

Bases: Module

X3D backbone. https://arxiv.org/pdf/2004.04730.pdf.

Parameters:

gamma_w (float) – Global channel width expansion factor. Default: 1.
gamma_b (float) – Bottleneck channel width expansion factor. Default: 1.
gamma_d (float) – Network depth expansion factor. Default: 1.
pretrained (str | None) – Name of pretrained model. Default: None.
in_channels (int) – Channel num of input features. Default: 3.
num_stages (int) – Resnet stages. Default: 4.
spatial_strides (Sequence[int]) – Spatial strides of residual blocks of each stage. Default: (1, 2, 2, 2).
frozen_stages (int) – Stages to be frozen (all param fixed). If set to -1, it means not freezing any parameters. Default: -1.
se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.
se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: 1 / 16.
use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.
normalization (Callable[..., nn.Module] | None) – Normalization layer module. Defaults to None.
activation (Callable[..., nn.Module] | None) – Activation layer module. Defaults to nn.ReLU.
norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.
kwargs (dict, optional) – Key arguments for “make_res_layer”.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tensor[source]#

Defines the computation performed at every call.

Parameters:: x (torch.Tensor) – The input data.
Returns:: The feature of the input samples extracted by the backbone.
Return type:: torch.Tensor

init_weights() → None[source]#: Initiate the parameters either from existing checkpoint or from scratch.

make_res_layer(block: ~torch.nn.modules.module.Module, layer_inplanes: int, inplanes: int, planes: int, blocks: int, spatial_stride: int = 1, se_style: str = 'half', se_ratio: float | None = None, use_swish: bool = True, normalization: ~typing.Callable[[...], ~torch.nn.modules.module.Module] | None = None, activation: ~typing.Callable[[...], ~torch.nn.modules.module.Module] | None = <class 'torch.nn.modules.activation.ReLU'>, with_cp: bool = False, **kwargs) → Module[source]#

Build residual layer for ResNet3D.

Parameters:

block (nn.Module) – Residual module to be built.
layer_inplanes (int) – Number of channels for the input feature of the res layer.
inplanes (int) – Number of channels for the input feature in each block, which equals to base_channels * gamma_w.
planes (int) – Number of channels for the output feature in each block, which equals to base_channel * gamma_w * gamma_b.
blocks (int) – Number of residual blocks.
spatial_stride (int) – Spatial strides in residual and conv layers. Default: 1.
se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.
se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: None.
use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.
normalization (Callable[..., nn.Module] | None) – Normalization layer module. Defaults to None.
activation (Callable[..., nn.Module] | None) – Activation layer module. Defaults to nn.ReLU.
with_cp (bool | None) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

Returns:

A residual layer for the given config.

Return type:

nn.Module

train(mode: bool = True) → None[source]#: Set the optimization status when training.

class otx.algo.action_classification.X3DHead(num_classes: int, in_channels: int, hidden_dim: int, loss_cls: Module, spatial_type: str = 'avg', dropout_ratio: float = 0.5, init_std: float = 0.01, fc1_bias: bool = False, average_clips: str | None = None)[source]#

Bases: BaseHead

Classification head for I3D.

Parameters:

num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (nn.module) – Loss class like CrossEntropyLoss.
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.5.
init_std (float) – Std value for Initiation. Default: 0.01.
fc1_bias (bool) – If the first fc layer has bias. Default: False.

Initialize BaseModule, inherited from torch.nn.Module.

forward(x: Tensor, **kwargs) → Tensor[source]#

Defines the computation performed at every call.

Parameters:: x (Tensor) – The input data.
Returns:: The classification scores for input samples.
Return type:: Tensor

init_weights() → None[source]#: Initiate the parameters from scratch.