otx.algo.classification.backbones#

Backbone modules for OTX custom model.

Classes

EfficientNetBackbone(version[, input_size, ...])

EfficientNetBackbone class represents the backbone architecture of EfficientNet models.

TimmBackbone(model_name[, pretrained])

Timm backbone model.

MobileNetV3Backbone([mode, width_mult, ...])

MobileNetV3Backbone class represents the backbone architecture of MobileNetV3.

VisionTransformer(arch, , , , , , , , , , , ...)

Implementation of Vision Transformer from Timm.

TorchvisionBackbone(backbone[, pretrained])

TorchvisionBackbone is a class that represents a backbone model from the torchvision library.

class otx.algo.classification.backbones.EfficientNetBackbone(version: Literal['b0', 'b1', 'b2', 'b3', 'b4', 'b5', 'b6', 'b7', 'b8'], input_size: tuple[int, int] | None = None, pretrained: bool = True, **kwargs)[source]#

Bases: object

EfficientNetBackbone class represents the backbone architecture of EfficientNet models.

EFFICIENTNET_CFG#

A dictionary containing configuration parameters for different versions of EfficientNet.

Type:

ClassVar[dict[str, Any]]

init_block_channels#

The number of channels in the initial block of the backbone.

Type:

ClassVar[int]

layers#

A list specifying the number of layers in each stage of the backbone.

Type:

ClassVar[list[int]]

downsample#

A list specifying whether downsampling is applied.

Type:

ClassVar[list[int]]

channels_per_layers#

A list specifying the number of channels.

Type:

ClassVar[list[int]]

expansion_factors_per_layers#

A list specifying the expansion factor.

Type:

ClassVar[list[int]]

kernel_sizes_per_layers#

A list specifying the kernel size in each stage of the backbone.

Type:

ClassVar[list[int]]

strides_per_stage#

A list specifying the stride in each stage of the backbone.

Type:

ClassVar[list[int]]

final_block_channels#

The number of channels in the final block of the backbone.

Type:

ClassVar[int]

Create a new instance of the EfficientNet class.

Parameters:
  • version (EFFICIENTNET_VERSION) – The version of EfficientNet to use.

  • input_size (tuple[int, int] | None, optional) – The input size of the model. Defaults to None.

  • pretrained (bool, optional) – Whether to load pretrained weights. Defaults to True.

  • **kwargs – Additional keyword arguments to be passed to the EfficientNet constructor.

Returns:

The created EfficientNet model instance.

Return type:

EfficientNet

class otx.algo.classification.backbones.MobileNetV3Backbone(mode: Literal['small', 'large'] = 'large', width_mult: float = 1.0, pretrained: bool = True, **kwargs)[source]#

Bases: object

MobileNetV3Backbone class represents the backbone architecture of MobileNetV3.

Parameters:
  • mode (Literal["small", "large"], optional) – The mode of the backbone architecture. Defaults to “large”.

  • width_mult (float, optional) – Width multiplier for the backbone architecture. Defaults to 1.0.

  • pretrained (bool, optional) – Whether to load pretrained weights. Defaults to True.

  • **kwargs – Additional keyword arguments to be passed to the MobileNetV3 model.

Returns:

An instance of the MobileNetV3 model.

Return type:

MobileNetV3

Examples

# Create a MobileNetV3Backbone instance backbone = MobileNetV3Backbone(mode=”small”, width_mult=0.75, pretrained=False)

# Create a MobileNetV3 model with the specified backbone model = MobileNetV3(backbone=backbone)

Create a new instance of the MobileNetV3 class.

Parameters:
  • mode (Literal["small", "large"], optional) – The mode of the MobileNetV3 model. Defaults to “large”.

  • width_mult (float, optional) – Width multiplier for the MobileNetV3 model. Defaults to 1.0.

  • pretrained (bool, optional) – Whether to load pretrained weights for the MobileNetV3 model. Defaults to True.

  • **kwargs – Additional keyword arguments to be passed to the MobileNetV3 constructor.

Returns:

A new instance of the MobileNetV3 class.

Return type:

MobileNetV3

class otx.algo.classification.backbones.TimmBackbone(model_name: str, pretrained: bool = False, **kwargs)[source]#

Bases: Module

Timm backbone model.

Parameters:
  • model_name (str) – The name of the model. You can find available models at timm.list_models() or timm.list_pretrained().

  • pretrained (bool, optional) – Whether to load pretrained weights. Defaults to False.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

extract_features(x: Tensor) Tensor[source]#

Extract features.

forward(x: Tensor, **kwargs) tuple[Tensor][source]#

Forward.

get_config_optim(lrs: list[float] | float) list[dict[str, float]][source]#

Get optimizer configs.

class otx.algo.classification.backbones.TorchvisionBackbone(backbone: Literal['alexnet', 'convnext_base', 'convnext_large', 'convnext_small', 'convnext_tiny', 'efficientnet_b0', 'efficientnet_b1', 'efficientnet_b2', 'efficientnet_b3', 'efficientnet_b4', 'efficientnet_b5', 'efficientnet_b6', 'efficientnet_b7', 'efficientnet_v2_l', 'efficientnet_v2_m', 'efficientnet_v2_s', 'googlenet', 'mobilenet_v3_large', 'mobilenet_v3_small', 'regnet_x_16gf', 'regnet_x_1_6gf', 'regnet_x_32gf', 'regnet_x_3_2gf', 'regnet_x_400mf', 'regnet_x_800mf', 'regnet_x_8gf', 'regnet_y_128gf', 'regnet_y_16gf', 'regnet_y_1_6gf', 'regnet_y_32gf', 'regnet_y_3_2gf', 'regnet_y_400mf', 'regnet_y_800mf', 'regnet_y_8gf', 'resnet101', 'resnet152', 'resnet18', 'resnet34', 'resnet50', 'resnext101_32x8d', 'resnext101_64x4d', 'resnext50_32x4d', 'swin_b', 'swin_s', 'swin_t', 'swin_v2_b', 'swin_v2_s', 'swin_v2_t', 'vgg11', 'vgg11_bn', 'vgg13', 'vgg13_bn', 'vgg16', 'vgg16_bn', 'vgg19', 'vgg19_bn', 'wide_resnet101_2', 'wide_resnet50_2'], pretrained: bool = False, **kwargs)[source]#

Bases: Module

TorchvisionBackbone is a class that represents a backbone model from the torchvision library.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(*args) Tensor[source]#

Forward pass of the model.

class otx.algo.classification.backbones.VisionTransformer(arch: ~typing.Literal['vit-t', 'vit-tiny', 'vit-s', 'vit-small', 'vit-b', 'vit-base', 'vit-l', 'vit-large', 'vit-h', 'vit-huge', 'dinov2-s', 'dinov2-small', 'dinov2-small-seg', 'dinov2-b', 'dinov2-base', 'dinov2-l', 'dinov2-large', 'dinov2-g', 'dinov2-giant'] | str = 'vit-base', img_size: int | tuple[int, int] = 224, patch_size: int | None = None, in_chans: int = 3, num_classes: int = 1000, embed_dim: int | None = None, depth: int | None = None, num_heads: int | None = None, mlp_ratio: float | None = None, qkv_bias: bool = True, qk_norm: bool = False, init_values: float | None = None, class_token: bool = True, no_embed_class: bool | None = None, reg_tokens: int | None = None, pre_norm: bool = False, dynamic_img_size: bool = False, dynamic_img_pad: bool = False, pos_drop_rate: float = 0.0, patch_drop_rate: float = 0.0, proj_drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, embed_layer: ~typing.Callable = <class 'timm.layers.patch_embed.PatchEmbed'>, block_fn: ~torch.nn.modules.module.Module = <class 'timm.models.vision_transformer.Block'>, mlp_layer: ~torch.nn.modules.module.Module | None = None, act_layer: str | ~typing.Callable | ~typing.Type[~torch.nn.modules.module.Module] | None = None, norm_layer: str | ~typing.Callable | ~typing.Type[~torch.nn.modules.module.Module] | None = None, interpolate_offset: float = 0.1, lora: bool = False)[source]#

Bases: BaseModule

Implementation of Vision Transformer from Timm.

A PyTorch impl ofAn Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Parameters:
  • arch – Vision Transformer architecture.

  • img_size – Input image size.

  • patch_size – Patch size.

  • in_chans – Number of image input channels.

  • num_classes – Mumber of classes for classification head.

  • embed_dim – Transformer embedding dimension.

  • depth – Depth of transformer.

  • num_heads – Number of attention heads.

  • mlp_ratio – Ratio of mlp hidden dim to embedding dim.

  • qkv_bias – Enable bias for qkv projections if True.

  • init_values – Layer-scale init values (layer-scale enabled if not None).

  • class_token – Use class token.

  • no_embed_class – Don’t include position embeddings for class (or reg) tokens.

  • reg_tokens – Number of register tokens.

  • drop_rate – Head dropout rate.

  • pos_drop_rate – Position embedding dropout rate.

  • attn_drop_rate – Attention dropout rate.

  • drop_path_rate – Stochastic depth rate.

  • weight_init – Weight initialization scheme.

  • fix_init – Apply weight initialization fix (scaling w/ layer index).

  • embed_layer – Patch embedding layer.

  • norm_layer – Normalization layer.

  • act_layer – MLP activation layer.

  • block_fn – Transformer block layer.

  • interpolate_offset – work-around offset to apply when interpolating positional embeddings

  • lora – Enable LoRA training.

Initialize BaseModule, inherited from torch.nn.Module.

forward(x: Tensor, out_type: Literal['raw', 'cls_token', 'featmap', 'avg_featmap'] = 'cls_token') tuple[source]#

Forward pass of the VisionTransformer model.

get_intermediate_layers(x: Tensor, n: int = 1, reshape: bool = False, return_class_token: bool = False, norm: bool = True) tuple[source]#

Get intermediate layers of the VisionTransformer.

Parameters:
  • x (torch.Tensor) – Input tensor.

  • n (int) – Number of last blocks to take. If it’s a list, take the specified blocks.

  • reshape (bool) – Whether to reshape the output feature maps.

  • return_class_token (bool) – Whether to return the class token.

  • norm (bool) – Whether to apply normalization to the outputs.

Returns:

A tuple containing the intermediate layer outputs.

Return type:

tuple

init_weights() None[source]#

Initializes the weights of the VisionTransformer.

interpolate_pos_encoding(x: Tensor, w: int, h: int) Tensor[source]#

Interpolates the positional encoding to match the input dimensions.

Parameters:
  • x (torch.Tensor) – Input tensor.

  • w (int) – Width of the input image.

  • h (int) – Height of the input image.

Returns:

Tensor with interpolated positional encoding.

Return type:

torch.Tensor

load_pretrained(checkpoint_path: Path, prefix: str = '') None[source]#

Loads the pretrained weight to the VisionTransformer.

prepare_tokens_with_masks(x: Tensor, masks: Tensor | None = None) Tensor[source]#

Prepare tokens with optional masks.

Parameters:
  • x (torch.Tensor) – Input tensor.

  • masks (torch.Tensor | None) – Optional masks tensor.

Returns:

Tensor with prepared tokens.

Return type:

torch.Tensor