otx.algo.classification.backbones#

Backbone modules for OTX custom model.

Classes

`EfficientNetBackbone`(model_name[, ...])	EfficientNetBackbone class represents the backbone architecture of EfficientNet models.
`TimmBackbone`(model_name[, pretrained])	Timm backbone model.
`MobileNetV3Backbone`([model_name, ...])	MobileNetV3Backbone class represents the backbone architecture of MobileNetV3.
`VisionTransformer`(model_name, img_size, ...)	Implementation of Vision Transformer from Timm.
`TorchvisionBackbone`(backbone[, pretrained])	TorchvisionBackbone is a class that represents a backbone model from the torchvision library.

class otx.algo.classification.backbones.EfficientNetBackbone(model_name: str, input_size: tuple[int, int] | None = None, pretrained: bool = True, **kwargs)[source]#

Bases: object

EfficientNetBackbone class represents the backbone architecture of EfficientNet models.

EFFICIENTNET_CFG#

A dictionary containing configuration parameters for different versions of EfficientNet.

Type:: ClassVar[dict[str, Any]]

init_block_channels#

The number of channels in the initial block of the backbone.

Type:: ClassVar[int]

layers#

A list specifying the number of layers in each stage of the backbone.

Type:: ClassVar[list[int]]

downsample#

A list specifying whether downsampling is applied.

Type:: ClassVar[list[int]]

channels_per_layers#

A list specifying the number of channels.

Type:: ClassVar[list[int]]

expansion_factors_per_layers#

A list specifying the expansion factor.

Type:: ClassVar[list[int]]

kernel_sizes_per_layers#

A list specifying the kernel size in each stage of the backbone.

Type:: ClassVar[list[int]]

strides_per_stage#

A list specifying the stride in each stage of the backbone.

Type:: ClassVar[list[int]]

final_block_channels#

The number of channels in the final block of the backbone.

Type:: ClassVar[int]

Create a new instance of the EfficientNet class.

Parameters:

version (EFFICIENTNET_VERSION) – The version of EfficientNet to use.
input_size (tuple[int, int] | None, optional) – The input size of the model. Defaults to None.
pretrained (bool, optional) – Whether to load pretrained weights. Defaults to True.
**kwargs – Additional keyword arguments to be passed to the EfficientNet constructor.

Returns:

The created EfficientNet model instance.

Return type:

EfficientNet

class otx.algo.classification.backbones.MobileNetV3Backbone(model_name: str = 'mobilenetv3_large', width_mult: float = 1.0, pretrained: bool = True, **kwargs)[source]#

Bases: object

MobileNetV3Backbone class represents the backbone architecture of MobileNetV3.

Parameters:

mode (Literal["small", "large"], optional) – The mode of the backbone architecture. Defaults to “large”.
width_mult (float, optional) – Width multiplier for the backbone architecture. Defaults to 1.0.
pretrained (bool, optional) – Whether to load pretrained weights. Defaults to True.
**kwargs – Additional keyword arguments to be passed to the MobileNetV3 model.

Returns:

An instance of the MobileNetV3 model.

Return type:

MobileNetV3

Examples

# Create a MobileNetV3Backbone instance backbone = MobileNetV3Backbone(mode=”small”, width_mult=0.75, pretrained=False)

# Create a MobileNetV3 model with the specified backbone model = MobileNetV3(backbone=backbone)

Create a new instance of the MobileNetV3 class.

Parameters:

model_name (str, optional) – The mode of the MobileNetV3 model. Defaults to “large”.
width_mult (float, optional) – Width multiplier for the MobileNetV3 model. Defaults to 1.0.
pretrained (bool, optional) – Whether to load pretrained weights for the MobileNetV3 model. Defaults to True.
**kwargs – Additional keyword arguments to be passed to the MobileNetV3 constructor.

Returns:

A new instance of the MobileNetV3 class.

Return type:

MobileNetV3

class otx.algo.classification.backbones.TimmBackbone(model_name: str, pretrained: bool = True, **kwargs)[source]#

Bases: Module

Timm backbone model.

Parameters:

model_name (str) – The name of the model. You can find available models at timm.list_models() or timm.list_pretrained().
pretrained (bool, optional) – Whether to load pretrained weights. Defaults to False.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

extract_features(x: Tensor) → Tensor[source]#: Extract features.

forward(x: Tensor, **kwargs) → tuple[Tensor][source]#: Forward.

get_config_optim(lrs: list[float] | float) → list[dict[str, float]][source]#: Get optimizer configs.

class otx.algo.classification.backbones.TorchvisionBackbone(backbone: str, pretrained: bool = True, **kwargs)[source]#

Bases: Module

TorchvisionBackbone is a class that represents a backbone model from the torchvision library.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(*args) → Tensor[source]#: Forward pass of the model.

class otx.algo.classification.backbones.VisionTransformer(model_name: str = 'vit-base', img_size: int | tuple[int, int] = 224, patch_size: int | None = None, in_chans: int = 3, num_classes: int = 1000, embed_dim: int | None = None, depth: int | None = None, num_heads: int | None = None, mlp_ratio: float | None = None, qkv_bias: bool = True, qk_norm: bool = False, init_values: float | None = None, class_token: bool = True, no_embed_class: bool | None = None, reg_tokens: int | None = None, pre_norm: bool = False, dynamic_img_size: bool = False, dynamic_img_pad: bool = False, pos_drop_rate: float = 0.0, patch_drop_rate: float = 0.0, proj_drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, embed_layer: ~typing.Callable = <class 'timm.layers.patch_embed.PatchEmbed'>, block_fn: ~torch.nn.modules.module.Module = <class 'timm.models.vision_transformer.Block'>, mlp_layer: ~torch.nn.modules.module.Module | None = None, act_layer: str | ~typing.Callable | ~typing.Type[~torch.nn.modules.module.Module] | None = None, norm_layer: str | ~typing.Callable | ~typing.Type[~torch.nn.modules.module.Module] | None = None, interpolate_offset: float = 0.1, lora: bool = False)[source]#

Bases: BaseModule

Implementation of Vision Transformer from Timm.

A PyTorch impl ofAn Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Parameters:

model_name – Vision Transformer architecture.
img_size – Input image size.
patch_size – Patch size.
in_chans – Number of image input channels.
num_classes – Mumber of classes for classification head.
embed_dim – Transformer embedding dimension.
depth – Depth of transformer.
num_heads – Number of attention heads.
mlp_ratio – Ratio of mlp hidden dim to embedding dim.
qkv_bias – Enable bias for qkv projections if True.
init_values – Layer-scale init values (layer-scale enabled if not None).
class_token – Use class token.
no_embed_class – Don’t include position embeddings for class (or reg) tokens.
reg_tokens – Number of register tokens.
drop_rate – Head dropout rate.
pos_drop_rate – Position embedding dropout rate.
attn_drop_rate – Attention dropout rate.
drop_path_rate – Stochastic depth rate.
weight_init – Weight initialization scheme.
fix_init – Apply weight initialization fix (scaling w/ layer index).
embed_layer – Patch embedding layer.
norm_layer – Normalization layer.
act_layer – MLP activation layer.
block_fn – Transformer block layer.
interpolate_offset – work-around offset to apply when interpolating positional embeddings
lora – Enable LoRA training.

Initialize BaseModule, inherited from torch.nn.Module.

forward(x: Tensor, out_type: Literal['raw', 'cls_token', 'featmap', 'avg_featmap'] = 'cls_token') → tuple[source]#: Forward pass of the VisionTransformer model.

get_intermediate_layers(x: Tensor, n: int = 1, reshape: bool = False, return_class_token: bool = False, norm: bool = True) → tuple[source]#

Get intermediate layers of the VisionTransformer.

Parameters:

x (torch.Tensor) – Input tensor.
n (int) – Number of last blocks to take. If it’s a list, take the specified blocks.
reshape (bool) – Whether to reshape the output feature maps.
return_class_token (bool) – Whether to return the class token.
norm (bool) – Whether to apply normalization to the outputs.

Returns:

A tuple containing the intermediate layer outputs.

Return type:

tuple

init_weights() → None[source]#: Initializes the weights of the VisionTransformer.

interpolate_pos_encoding(x: Tensor, w: int, h: int) → Tensor[source]#

Interpolates the positional encoding to match the input dimensions.

Parameters:

x (torch.Tensor) – Input tensor.
w (int) – Width of the input image.
h (int) – Height of the input image.

Returns:

Tensor with interpolated positional encoding.

Return type:

torch.Tensor

load_pretrained(checkpoint_path: Path, prefix: str = '') → None[source]#: Loads the pretrained weight to the VisionTransformer.

prepare_tokens_with_masks(x: Tensor, masks: Tensor | None = None) → Tensor[source]#

Prepare tokens with optional masks.

Parameters:

x (torch.Tensor) – Input tensor.
masks (torch.Tensor | None) – Optional masks tensor.

Returns:

Tensor with prepared tokens.

Return type:

torch.Tensor