otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit#
Vision Transformers.
Functions
|
Calculate decomposed Relative Positional Embeddings from mvitv2. |
|
Build ViT backbone. |
|
Get relative positional embeddings according to the relative positions of query and key sizes. |
|
Partition into non-overlapping windows with padding if needed. |
|
Window unpartition into original sequences and removing padding. |
Classes
|
Multi-head Attention block with relative position embeddings. |
|
Transformer blocks with support of window attention and residual propagation blocks. |
|
Image to Patch Embedding. |
|
Vision Transformer for visual prompting task. |
- class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.Attention(dim: int, num_heads: int = 8, qkv_bias: bool = True, use_rel_pos: bool = False, rel_pos_zero_init: bool = True, input_size: Tuple[int, int] | None = None)[source]#
Bases:
Module
Multi-head Attention block with relative position embeddings.
- Parameters:
dim (int) – Number of input channels.
num_heads (int) – Number of attention heads.
qkv_bias (bool) – If True, add a learnable bias to query, key, value.
use_rel_pos (bool) – If True, add relative positional embeddings to the attention map.
rel_pos_zero_init (bool) – If True, zero initialize relative positional parameters.
input_size (tuple(int, int) or None) – Input resolution for calculating the relative positional parameter size.
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.Block(dim: int, num_heads: int, mlp_ratio: float = 4.0, qkv_bias: bool = True, norm_layer: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.normalization.LayerNorm'>, act_layer: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.GELU'>, use_rel_pos: bool = False, rel_pos_zero_init: bool = True, window_size: int = 0, input_size: ~typing.Tuple[int, int] | None = None)[source]#
Bases:
Module
Transformer blocks with support of window attention and residual propagation blocks.
- Parameters:
dim (int) – Number of input channels.
num_heads (int) – Number of attention heads in each ViT block.
mlp_ratio (float) – Ratio of mlp hidden dim to embedding dim.
qkv_bias (bool) – If True, add a learnable bias to query, key, value.
norm_layer (nn.Module) – Normalization layer.
act_layer (nn.Module) – Activation layer.
use_rel_pos (bool) – If True, add relative positional embeddings to the attention map.
rel_pos_zero_init (bool) – If True, zero initialize relative positional parameters.
window_size (int) – Window size for window attention blocks. If it equals 0, then use global attention.
input_size (tuple(int, int) or None) – Input resolution for calculating the relative positional parameter size.
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.PatchEmbed(kernel_size: Tuple[int, int] = (16, 16), stride: Tuple[int, int] = (16, 16), padding: Tuple[int, int] = (0, 0), in_chans: int = 3, embed_dim: int = 768)[source]#
Bases:
Module
Image to Patch Embedding.
- Parameters:
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.ViT(img_size: int = 1024, patch_size: int = 16, in_chans: int = 3, embed_dim: int = 768, depth: int = 12, num_heads: int = 12, mlp_ratio: float = 4.0, out_chans: int = 256, qkv_bias: bool = True, norm_layer: ~torch.nn.modules.module.Module = <class 'torch.nn.modules.normalization.LayerNorm'>, act_layer: ~torch.nn.modules.module.Module = <class 'torch.nn.modules.activation.GELU'>, use_abs_pos: bool = True, use_rel_pos: bool = False, rel_pos_zero_init: bool = True, window_size: int = 0, global_attn_indexes: ~typing.Tuple[int, ...] = ())[source]#
Bases:
Module
Vision Transformer for visual prompting task.
- Parameters:
img_size (int) – Input image size.
patch_size (int) – Patch size.
in_chans (int) – Number of input image channels.
embed_dim (int) – Patch embedding dimension.
depth (int) – Depth of ViT.
num_heads (int) – Number of attention heads in each ViT block.
mlp_ratio (float) – Ratio of mlp hidden dim to embedding dim.
out_chans (int) – Number of output channels.
qkv_bias (bool) – If True, add a learnable bias to query, key, value.
norm_layer (nn.Module) – Normalization layer.
act_layer (nn.Module) – Activation layer.
use_abs_pos (bool) – If True, use absolute positional embeddings.
use_rel_pos (bool) – If True, add relative positional embeddings to the attention map.
rel_pos_zero_init (bool) – If True, zero initialize relative positional parameters.
window_size (int) – Window size for window attention blocks.
global_attn_indexes (list) – Indexes for blocks using global attention.
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.add_decomposed_rel_pos(attn: Tensor, q: Tensor, rel_pos_h: Tensor, rel_pos_w: Tensor, q_size: Tuple[int, int], k_size: Tuple[int, int]) Tensor [source]#
Calculate decomposed Relative Positional Embeddings from mvitv2.
- Parameters:
attn (Tensor) – attention map.
q (Tensor) – query q in the attention layer with shape (B, q_h * q_w, C).
rel_pos_h (Tensor) – relative position embeddings (Lh, C) for height axis.
rel_pos_w (Tensor) – relative position embeddings (Lw, C) for width axis.
q_size (Tuple) – spatial sequence size of query q with (q_h, q_w).
k_size (Tuple) – spatial sequence size of key k with (k_h, k_w).
- Returns:
attention map with added relative positional embeddings.
- Return type:
attn (Tensor)
- otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.build_vit(backbone: str, image_size: int)[source]#
Build ViT backbone.
- otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.get_rel_pos(q_size: int, k_size: int, rel_pos: Tensor) Tensor [source]#
Get relative positional embeddings according to the relative positions of query and key sizes.
- otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.window_partition(x: Tensor, window_size: int) Tuple[Tensor, Tuple[int, int]] [source]#
Partition into non-overlapping windows with padding if needed.
- Parameters:
x (Tensor) – Input tokens with [B, H, W, C].
window_size (int) – Window size.
- Returns:
windows after partition with [B * num_windows, window_size, window_size, C]. (Hp, Wp) (Tuple[int, int]): padded height and width before partition
- Return type:
windows (Tensor)
- otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.backbones.vit.window_unpartition(windows: Tensor, window_size: int, pad_hw: Tuple[int, int], hw: Tuple[int, int]) Tensor [source]#
Window unpartition into original sequences and removing padding.
- Parameters:
windows (Tensor) – input tokens with [B * num_windows, window_size, window_size, C].
window_size (int) – window size.
pad_hw (Tuple) – padded height and width (Hp, Wp).
hw (Tuple) – original height and width (H, W) before padding.
- Returns:
unpartitioned sequences with [B, H, W, C].
- Return type:
x (Tensor)