otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit#

Vision Transformers.

Functions

add_decomposed_rel_pos(attn, q, rel_pos_h, ...)

Calculate decomposed Relative Positional Embeddings from mvitv2.

build_vit(backbone, image_size)

Build ViT backbone.

get_rel_pos(q_size, k_size, rel_pos)

Get relative positional embeddings according to the relative positions of query and key sizes.

window_partition(x, window_size)

Partition into non-overlapping windows with padding if needed.

window_unpartition(windows, window_size, ...)

Window unpartition into original sequences and removing padding.

Classes

Attention(dim[, num_heads, qkv_bias, ...])

Multi-head Attention block with relative position embeddings.

Block(dim, num_heads, mlp_ratio, qkv_bias, ...)

Transformer blocks with support of window attention and residual propagation blocks.

PatchEmbed([kernel_size, stride, padding, ...])

Image to Patch Embedding.

ViT(img_size, patch_size, in_chans, ...)

Vision Transformer for visual prompting task.

class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.Attention(dim: int, num_heads: int = 8, qkv_bias: bool = True, use_rel_pos: bool = False, rel_pos_zero_init: bool = True, input_size: Tuple[int, int] | None = None)[source]#

Bases: Module

Multi-head Attention block with relative position embeddings.

Parameters:
  • dim (int) – Number of input channels.

  • num_heads (int) – Number of attention heads.

  • qkv_bias (bool) – If True, add a learnable bias to query, key, value.

  • use_rel_pos (bool) – If True, add relative positional embeddings to the attention map.

  • rel_pos_zero_init (bool) – If True, zero initialize relative positional parameters.

  • input_size (tuple(int, int) or None) – Input resolution for calculating the relative positional parameter size.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) Tensor[source]#

Forward function.

Parameters:

x (Tensor) – Input tensor of shape (B, H, W, C).

Returns:

Output tensor of shape (B, H, W, C).

Return type:

Tensor

class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.Block(dim: int, num_heads: int, mlp_ratio: float = 4.0, qkv_bias: bool = True, norm_layer: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.normalization.LayerNorm'>, act_layer: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.GELU'>, use_rel_pos: bool = False, rel_pos_zero_init: bool = True, window_size: int = 0, input_size: ~typing.Tuple[int, int] | None = None)[source]#

Bases: Module

Transformer blocks with support of window attention and residual propagation blocks.

Parameters:
  • dim (int) – Number of input channels.

  • num_heads (int) – Number of attention heads in each ViT block.

  • mlp_ratio (float) – Ratio of mlp hidden dim to embedding dim.

  • qkv_bias (bool) – If True, add a learnable bias to query, key, value.

  • norm_layer (nn.Module) – Normalization layer.

  • act_layer (nn.Module) – Activation layer.

  • use_rel_pos (bool) – If True, add relative positional embeddings to the attention map.

  • rel_pos_zero_init (bool) – If True, zero initialize relative positional parameters.

  • window_size (int) – Window size for window attention blocks. If it equals 0, then use global attention.

  • input_size (tuple(int, int) or None) – Input resolution for calculating the relative positional parameter size.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) Tensor[source]#

Forward function.

Parameters:

x (Tensor) – Input tensor of shape (B, H, W, C).

Returns:

Output tensor of shape (B, H, W, C).

Return type:

Tensor

class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.PatchEmbed(kernel_size: Tuple[int, int] = (16, 16), stride: Tuple[int, int] = (16, 16), padding: Tuple[int, int] = (0, 0), in_chans: int = 3, embed_dim: int = 768)[source]#

Bases: Module

Image to Patch Embedding.

Parameters:
  • kernel_size (Tuple) – kernel size of the projection layer.

  • stride (Tuple) – stride of the projection layer.

  • padding (Tuple) – padding size of the projection layer.

  • in_chans (int) – Number of input image channels.

  • embed_dim (int) – Patch embedding dimension.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) Tensor[source]#

Forward call.

Parameters:

x (Tensor) – input image tensor with shape (B, C, H, W).

Returns:

output tensor with shape (B, H’, W’, C’).

Return type:

Tensor

class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.ViT(img_size: int = 1024, patch_size: int = 16, in_chans: int = 3, embed_dim: int = 768, depth: int = 12, num_heads: int = 12, mlp_ratio: float = 4.0, out_chans: int = 256, qkv_bias: bool = True, norm_layer: ~torch.nn.modules.module.Module = <class 'torch.nn.modules.normalization.LayerNorm'>, act_layer: ~torch.nn.modules.module.Module = <class 'torch.nn.modules.activation.GELU'>, use_abs_pos: bool = True, use_rel_pos: bool = False, rel_pos_zero_init: bool = True, window_size: int = 0, global_attn_indexes: ~typing.Tuple[int, ...] = ())[source]#

Bases: Module

Vision Transformer for visual prompting task.

Parameters:
  • img_size (int) – Input image size.

  • patch_size (int) – Patch size.

  • in_chans (int) – Number of input image channels.

  • embed_dim (int) – Patch embedding dimension.

  • depth (int) – Depth of ViT.

  • num_heads (int) – Number of attention heads in each ViT block.

  • mlp_ratio (float) – Ratio of mlp hidden dim to embedding dim.

  • out_chans (int) – Number of output channels.

  • qkv_bias (bool) – If True, add a learnable bias to query, key, value.

  • norm_layer (nn.Module) – Normalization layer.

  • act_layer (nn.Module) – Activation layer.

  • use_abs_pos (bool) – If True, use absolute positional embeddings.

  • use_rel_pos (bool) – If True, add relative positional embeddings to the attention map.

  • rel_pos_zero_init (bool) – If True, zero initialize relative positional parameters.

  • window_size (int) – Window size for window attention blocks.

  • global_attn_indexes (list) – Indexes for blocks using global attention.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) Tensor[source]#

Forward function.

Parameters:

x (Tensor) – Input tensor of shape (B, C, H, W).

Returns:

Output tensor of shape (B, out_chans, H, W).

Return type:

Tensor

otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.add_decomposed_rel_pos(attn: Tensor, q: Tensor, rel_pos_h: Tensor, rel_pos_w: Tensor, q_size: Tuple[int, int], k_size: Tuple[int, int]) Tensor[source]#

Calculate decomposed Relative Positional Embeddings from mvitv2.

facebookresearch/mvit

Parameters:
  • attn (Tensor) – attention map.

  • q (Tensor) – query q in the attention layer with shape (B, q_h * q_w, C).

  • rel_pos_h (Tensor) – relative position embeddings (Lh, C) for height axis.

  • rel_pos_w (Tensor) – relative position embeddings (Lw, C) for width axis.

  • q_size (Tuple) – spatial sequence size of query q with (q_h, q_w).

  • k_size (Tuple) – spatial sequence size of key k with (k_h, k_w).

Returns:

attention map with added relative positional embeddings.

Return type:

attn (Tensor)

otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.build_vit(backbone: str, image_size: int)[source]#

Build ViT backbone.

Parameters:
  • backbone (str) – backbone name.

  • image_size (int) – input image size.

Returns:

ViT backbone.

Return type:

ViT

otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.get_rel_pos(q_size: int, k_size: int, rel_pos: Tensor) Tensor[source]#

Get relative positional embeddings according to the relative positions of query and key sizes.

Parameters:
  • q_size (int) – size of query q.

  • k_size (int) – size of key k.

  • rel_pos (Tensor) – relative position embeddings (L, C).

Returns:

Extracted positional embeddings according to relative positions.

Return type:

Tensor

otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.window_partition(x: Tensor, window_size: int) Tuple[Tensor, Tuple[int, int]][source]#

Partition into non-overlapping windows with padding if needed.

Parameters:
  • x (Tensor) – Input tokens with [B, H, W, C].

  • window_size (int) – Window size.

Returns:

windows after partition with [B * num_windows, window_size, window_size, C]. (Hp, Wp) (Tuple[int, int]): padded height and width before partition

Return type:

windows (Tensor)

otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.window_unpartition(windows: Tensor, window_size: int, pad_hw: Tuple[int, int], hw: Tuple[int, int]) Tensor[source]#

Window unpartition into original sequences and removing padding.

Parameters:
  • windows (Tensor) – input tokens with [B * num_windows, window_size, window_size, C].

  • window_size (int) – window size.

  • pad_hw (Tuple) – padded height and width (Hp, Wp).

  • hw (Tuple) – original height and width (H, W) before padding.

Returns:

unpartitioned sequences with [B, H, W, C].

Return type:

x (Tensor)