otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit#

Vision Transformers.

Functions

`add_decomposed_rel_pos`(attn, q, rel_pos_h, ...)	Calculate decomposed Relative Positional Embeddings from mvitv2.
`build_vit`(backbone, image_size)	Build ViT backbone.
`get_rel_pos`(q_size, k_size, rel_pos)	Get relative positional embeddings according to the relative positions of query and key sizes.
`window_partition`(x, window_size)	Partition into non-overlapping windows with padding if needed.
`window_unpartition`(windows, window_size, ...)	Window unpartition into original sequences and removing padding.

Classes

`Attention`(dim[, num_heads, qkv_bias, ...])	Multi-head Attention block with relative position embeddings.
`Block`(dim, num_heads, mlp_ratio, qkv_bias, ...)	Transformer blocks with support of window attention and residual propagation blocks.
`PatchEmbed`([kernel_size, stride, padding, ...])	Image to Patch Embedding.
`ViT`(img_size, patch_size, in_chans, ...)	Vision Transformer for visual prompting task.

class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.Attention(dim: int, num_heads: int = 8, qkv_bias: bool = True, use_rel_pos: bool = False, rel_pos_zero_init: bool = True, input_size: Tuple[int, int] | None = None)[source]#

Bases: Module

Multi-head Attention block with relative position embeddings.

Parameters:

dim (int) – Number of input channels.
num_heads (int) – Number of attention heads.
qkv_bias (bool) – If True, add a learnable bias to query, key, value.
use_rel_pos (bool) – If True, add relative positional embeddings to the attention map.
rel_pos_zero_init (bool) – If True, zero initialize relative positional parameters.
input_size (tuple(int, int) or None) – Input resolution for calculating the relative positional parameter size.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tensor[source]#

Forward function.

Parameters:: x (Tensor) – Input tensor of shape (B, H, W, C).
Returns:: Output tensor of shape (B, H, W, C).
Return type:: Tensor

class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.Block(dim: int, num_heads: int, mlp_ratio: float = 4.0, qkv_bias: bool = True, norm_layer: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.normalization.LayerNorm'>, act_layer: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.GELU'>, use_rel_pos: bool = False, rel_pos_zero_init: bool = True, window_size: int = 0, input_size: ~typing.Tuple[int, int] | None = None)[source]#

Bases: Module

Transformer blocks with support of window attention and residual propagation blocks.

Parameters:

dim (int) – Number of input channels.
num_heads (int) – Number of attention heads in each ViT block.
mlp_ratio (float) – Ratio of mlp hidden dim to embedding dim.
qkv_bias (bool) – If True, add a learnable bias to query, key, value.
norm_layer (nn.Module) – Normalization layer.
act_layer (nn.Module) – Activation layer.
use_rel_pos (bool) – If True, add relative positional embeddings to the attention map.
rel_pos_zero_init (bool) – If True, zero initialize relative positional parameters.
window_size (int) – Window size for window attention blocks. If it equals 0, then use global attention.
input_size (tuple(int, int) or None) – Input resolution for calculating the relative positional parameter size.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tensor[source]#

Forward function.

Parameters:: x (Tensor) – Input tensor of shape (B, H, W, C).
Returns:: Output tensor of shape (B, H, W, C).
Return type:: Tensor

class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.PatchEmbed(kernel_size: Tuple[int, int] = (16, 16), stride: Tuple[int, int] = (16, 16), padding: Tuple[int, int] = (0, 0), in_chans: int = 3, embed_dim: int = 768)[source]#

Bases: Module

Image to Patch Embedding.

Parameters:

kernel_size (Tuple) – kernel size of the projection layer.
stride (Tuple) – stride of the projection layer.
padding (Tuple) – padding size of the projection layer.
in_chans (int) – Number of input image channels.
embed_dim (int) – Patch embedding dimension.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tensor[source]#

Forward call.

Parameters:: x (Tensor) – input image tensor with shape (B, C, H, W).
Returns:: output tensor with shape (B, H’, W’, C’).
Return type:: Tensor

class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.ViT(img_size: int = 1024, patch_size: int = 16, in_chans: int = 3, embed_dim: int = 768, depth: int = 12, num_heads: int = 12, mlp_ratio: float = 4.0, out_chans: int = 256, qkv_bias: bool = True, norm_layer: ~torch.nn.modules.module.Module = <class 'torch.nn.modules.normalization.LayerNorm'>, act_layer: ~torch.nn.modules.module.Module = <class 'torch.nn.modules.activation.GELU'>, use_abs_pos: bool = True, use_rel_pos: bool = False, rel_pos_zero_init: bool = True, window_size: int = 0, global_attn_indexes: ~typing.Tuple[int, ...] = ())[source]#

Bases: Module

Vision Transformer for visual prompting task.

Parameters:

img_size (int) – Input image size.
patch_size (int) – Patch size.
in_chans (int) – Number of input image channels.
embed_dim (int) – Patch embedding dimension.
depth (int) – Depth of ViT.
num_heads (int) – Number of attention heads in each ViT block.
mlp_ratio (float) – Ratio of mlp hidden dim to embedding dim.
out_chans (int) – Number of output channels.
qkv_bias (bool) – If True, add a learnable bias to query, key, value.
norm_layer (nn.Module) – Normalization layer.
act_layer (nn.Module) – Activation layer.
use_abs_pos (bool) – If True, use absolute positional embeddings.
use_rel_pos (bool) – If True, add relative positional embeddings to the attention map.
rel_pos_zero_init (bool) – If True, zero initialize relative positional parameters.
window_size (int) – Window size for window attention blocks.
global_attn_indexes (list) – Indexes for blocks using global attention.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tensor[source]#

Forward function.

Parameters:: x (Tensor) – Input tensor of shape (B, C, H, W).
Returns:: Output tensor of shape (B, out_chans, H, W).
Return type:: Tensor

otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.add_decomposed_rel_pos(attn: Tensor, q: Tensor, rel_pos_h: Tensor, rel_pos_w: Tensor, q_size: Tuple[int, int], k_size: Tuple[int, int]) → Tensor[source]#

Calculate decomposed Relative Positional Embeddings from mvitv2.

facebookresearch/mvit

Parameters:

attn (Tensor) – attention map.
q (Tensor) – query q in the attention layer with shape (B, q_h * q_w, C).
rel_pos_h (Tensor) – relative position embeddings (Lh, C) for height axis.
rel_pos_w (Tensor) – relative position embeddings (Lw, C) for width axis.
q_size (Tuple) – spatial sequence size of query q with (q_h, q_w).
k_size (Tuple) – spatial sequence size of key k with (k_h, k_w).

Returns:

attention map with added relative positional embeddings.

Return type:

attn (Tensor)

otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.build_vit(backbone: str, image_size: int)[source]#

Build ViT backbone.

Parameters:

backbone (str) – backbone name.
image_size (int) – input image size.

Returns:

ViT backbone.

Return type:

ViT

otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.get_rel_pos(q_size: int, k_size: int, rel_pos: Tensor) → Tensor[source]#

Get relative positional embeddings according to the relative positions of query and key sizes.

Parameters:

q_size (int) – size of query q.
k_size (int) – size of key k.
rel_pos (Tensor) – relative position embeddings (L, C).

Returns:

Extracted positional embeddings according to relative positions.

Return type:

Tensor

otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.window_partition(x: Tensor, window_size: int) → Tuple[Tensor, Tuple[int, int]][source]#

Partition into non-overlapping windows with padding if needed.

Parameters:

x (Tensor) – Input tokens with [B, H, W, C].
window_size (int) – Window size.

Returns:

windows after partition with [B * num_windows, window_size, window_size, C]. (Hp, Wp) (Tuple[int, int]): padded height and width before partition

Return type:

windows (Tensor)

otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.window_unpartition(windows: Tensor, window_size: int, pad_hw: Tuple[int, int], hw: Tuple[int, int]) → Tensor[source]#

Window unpartition into original sequences and removing padding.

Parameters:

windows (Tensor) – input tokens with [B * num_windows, window_size, window_size, C].
window_size (int) – window size.
pad_hw (Tuple) – padded height and width (Hp, Wp).
hw (Tuple) – original height and width (H, W) before padding.

Returns:

unpartitioned sequences with [B, H, W, C].

Return type:

x (Tensor)