otx.algo.action_classification#
Module for OTX action classification models.
Classes
|
Custom 3d recognizer class for OTX. |
|
MoViNet wrapper class for OTX. |
|
Classification head for MoViNet. |
|
MoViNet recognizer model framework for OTX compatibility. |
|
X3D backbone. |
|
Classification head for I3D. |
- class otx.algo.action_classification.BaseRecognizer(backbone: torch.Module, cls_head: torch.Module, neck: torch.Module | None = None, test_cfg: dict | None = None)[source]#
Bases:
BaseModule
Custom 3d recognizer class for OTX.
This is for patching forward function during export procedure.
Initialize BaseModule, inherited from torch.nn.Module.
- extract_feat(inputs: Tensor, stage: str = 'neck', data_samples: list[ActionDataSample] | None = None, test_mode: bool = False) tuple [source]#
Extract features of different stages.
- Parameters:
- Returns:
The extracted features. dict: A dict recording the kwargs for downstream
pipeline. These keys are usually included:
loss_aux
.- Return type:
torch.Tensor
- forward(inputs: Tensor, data_samples: list[ActionDataSample] | None = None, mode: str = 'tensor', **kwargs) dict[str, Tensor] | list[ActionDataSample] | tuple[Tensor] | Tensor [source]#
The unified entry for a forward process in both training and test.
The method should accept three modes:
tensor
: Forward the whole network and return tensor or tuple of
tensor without any post-processing, same as a common nn.Module. -
predict
: Forward and return the predictions, which are fully processed to a list ofActionDataSample
. -loss
: Forward and return a dict of losses according to the given inputs and data samples.Note that this method doesn’t handle neither back propagation nor optimizer updating, which are done in the
train_step()
.- Parameters:
- Returns:
The return type depends on
mode
.If
mode="tensor"
, return a tensor or a tuple of tensor.If
mode="predict"
, return a list ofActionDataSample
.If
mode="loss"
, return a dict of tensor.
- loss(inputs: Tensor, data_samples: list[ActionDataSample] | None, **kwargs) dict [source]#
Calculate losses from a batch of inputs and data samples.
- Parameters:
inputs (torch.Tensor) – Raw Inputs of the recognizer. These should usually be mean centered and std scaled.
data_samples (List[
ActionDataSample
]) – The batch data samples. It usually includes information such asgt_label
.
- Returns:
A dictionary of loss components.
- Return type:
- predict(inputs: Tensor, data_samples: list[ActionDataSample] | None, **kwargs) list[ActionDataSample] [source]#
Predict results from a batch of inputs and data samples with postprocessing.
- Parameters:
inputs (torch.Tensor) – Raw Inputs of the recognizer. These should usually be mean centered and std scaled.
data_samples (List[
ActionDataSample
]) – The batch data samples. It usually includes information such asgt_label
.
- Returns:
Return the recognition results. The returns value is
ActionDataSample
, which usually containspred_scores
. And thepred_scores
usually contains following keys.- item (torch.Tensor): Classification scores, has a shape
(num_classes, )
- Return type:
List[
ActionDataSample
]
- class otx.algo.action_classification.MoViNetBackbone(**kwargs)[source]#
Bases:
MoViNetBackboneBase
MoViNet wrapper class for OTX.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- static fill_conv(conf: DictConfig, input_channels: int, out_channels: int, kernel_size: tuple[int, int, int], stride: tuple[int, int, int], padding: tuple[int, int, int]) None [source]#
Set the values of a given DictConfig object to conv layer.
- static fill_se_config(conf: DictConfig, input_channels: int, out_channels: int, expanded_channels: int, kernel_size: tuple[int, int, int], stride: tuple[int, int, int], padding: tuple[int, int, int], padding_avg: tuple[int, int, int]) None [source]#
Set the values of a given DictConfig object to SE module.
- Parameters:
conf (DictConfig) – The DictConfig object to be updated.
input_channels (int) – The number of input channels.
out_channels (int) – The number of output channels.
expanded_channels (int) – The number of channels after expansion in the basic block.
padding_avg (tuple[int]) – The padding for the average pooling operation.
- Returns:
None.
- class otx.algo.action_classification.MoViNetHead(num_classes: int, in_channels: int, hidden_dim: int, loss_cls: Module, topk: tuple[int, int] = (1, 5), tf_like: bool = False, conv_type: str = '3d', average_clips: str | None = None)[source]#
Bases:
BaseHead
Classification head for MoViNet.
- Parameters:
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
hidden_dim (int) – Number of channels in hidden layer.
tf_like (bool) – If True, uses TensorFlow-style padding. Default: False.
conv_type (str) – Type of convolutional layer. Default: ‘3d’.
loss_cls (nn.module) – Loss class like CrossEntropyLoss.
topk (tuple[int, int]) – Top-K training loss calculation. Default: (1, 5).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.5.
init_std (float) – Standard deviation for initialization. Default: 0.1.
Initialize BaseModule, inherited from torch.nn.Module.
- class otx.algo.action_classification.MoViNetRecognizer(**kwargs)[source]#
Bases:
BaseRecognizer
MoViNet recognizer model framework for OTX compatibility.
Initialize BaseModule, inherited from torch.nn.Module.
- class otx.algo.action_classification.X3DBackbone(gamma_w: float = 1.0, gamma_b: float = 1.0, gamma_d: float = 1.0, pretrained: str | None = None, in_channels: int = 3, num_stages: int = 4, spatial_strides: tuple[int, int, int, int] = (2, 2, 2, 2), frozen_stages: int = -1, se_style: str = 'half', se_ratio: float = 0.0625, use_swish: bool = True, normalization: ~typing.Callable[[...], ~torch.nn.modules.module.Module] | None = None, activation: ~typing.Callable[[...], ~torch.nn.modules.module.Module] | None = <class 'torch.nn.modules.activation.ReLU'>, norm_eval: bool = False, with_cp: bool = False, zero_init_residual: bool = True, **kwargs)[source]#
Bases:
Module
X3D backbone. https://arxiv.org/pdf/2004.04730.pdf.
- Parameters:
gamma_w (float) – Global channel width expansion factor. Default: 1.
gamma_b (float) – Bottleneck channel width expansion factor. Default: 1.
gamma_d (float) – Network depth expansion factor. Default: 1.
pretrained (str | None) – Name of pretrained model. Default: None.
in_channels (int) – Channel num of input features. Default: 3.
num_stages (int) – Resnet stages. Default: 4.
spatial_strides (Sequence[int]) – Spatial strides of residual blocks of each stage. Default:
(1, 2, 2, 2)
.frozen_stages (int) – Stages to be frozen (all param fixed). If set to -1, it means not freezing any parameters. Default: -1.
se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.
se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: 1 / 16.
use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.
normalization (Callable[..., nn.Module] | None) – Normalization layer module. Defaults to None.
activation (Callable[..., nn.Module] | None) – Activation layer module. Defaults to
nn.ReLU
.norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.
kwargs (dict, optional) – Key arguments for “make_res_layer”.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(x: Tensor) Tensor [source]#
Defines the computation performed at every call.
- Parameters:
x (torch.Tensor) – The input data.
- Returns:
The feature of the input samples extracted by the backbone.
- Return type:
torch.Tensor
- init_weights() None [source]#
Initiate the parameters either from existing checkpoint or from scratch.
- make_res_layer(block: ~torch.nn.modules.module.Module, layer_inplanes: int, inplanes: int, planes: int, blocks: int, spatial_stride: int = 1, se_style: str = 'half', se_ratio: float | None = None, use_swish: bool = True, normalization: ~typing.Callable[[...], ~torch.nn.modules.module.Module] | None = None, activation: ~typing.Callable[[...], ~torch.nn.modules.module.Module] | None = <class 'torch.nn.modules.activation.ReLU'>, with_cp: bool = False, **kwargs) Module [source]#
Build residual layer for ResNet3D.
- Parameters:
block (nn.Module) – Residual module to be built.
layer_inplanes (int) – Number of channels for the input feature of the res layer.
inplanes (int) – Number of channels for the input feature in each block, which equals to base_channels * gamma_w.
planes (int) – Number of channels for the output feature in each block, which equals to base_channel * gamma_w * gamma_b.
blocks (int) – Number of residual blocks.
spatial_stride (int) – Spatial strides in residual and conv layers. Default: 1.
se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.
se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: None.
use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.
normalization (Callable[..., nn.Module] | None) – Normalization layer module. Defaults to None.
activation (Callable[..., nn.Module] | None) – Activation layer module. Defaults to
nn.ReLU
.with_cp (bool | None) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
- Returns:
A residual layer for the given config.
- Return type:
nn.Module
- class otx.algo.action_classification.X3DHead(num_classes: int, in_channels: int, hidden_dim: int, loss_cls: Module, spatial_type: str = 'avg', dropout_ratio: float = 0.5, init_std: float = 0.01, fc1_bias: bool = False, average_clips: str | None = None)[source]#
Bases:
BaseHead
Classification head for I3D.
- Parameters:
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (nn.module) – Loss class like CrossEntropyLoss.
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.5.
init_std (float) – Std value for Initiation. Default: 0.01.
fc1_bias (bool) – If the first fc layer has bias. Default: False.
Initialize BaseModule, inherited from torch.nn.Module.