otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.decoders.sam_mask_decoder#
Mask decoder module for SAM.
Classes
|
An attention layer. |
|
Simple MLP with ReLU activations. |
|
Predicts masks given an image and prompt embeddings, using a transformer architecture. |
|
A transformer block with four layers. |
|
A transformer decoder that attends to an input image using queries whose positional embedding is supplied. |
- class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.decoders.sam_mask_decoder.Attention(embedding_dim: int, num_heads: int, downsample_rate: int = 1)[source]#
Bases:
Module
An attention layer.
It allows for downscaling the size of the embedding after projection to queries, keys, and values.
- Parameters:
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(q: Tensor, k: Tensor, v: Tensor) Tensor [source]#
Apply the attention layer to the queries, keys, and values.
- Parameters:
q (Tensor) – Queries to attend to. Should be shape B x N_queries x C for any N_queries.
k (Tensor) – Keys to attend to. Should be shape B x N_keys x C for any N_keys.
v (Tensor) – Values to attend to. Should be shape B x N_values x C for any N_values.
- Returns:
The output of the attention layer. Shape B x N_queries x C.
- Return type:
Tensor
- class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.decoders.sam_mask_decoder.MLP(input_dim: int, hidden_dim: int, output_dim: int, num_layers: int, sigmoid_output: bool = False)[source]#
Bases:
Module
Simple MLP with ReLU activations.
- Parameters:
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.decoders.sam_mask_decoder.SAMMaskDecoder(*, transformer_dim: int, transformer_cfg: dict, num_multimask_outputs: int = 3, activation: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.GELU'>, iou_head_depth: int = 3, iou_head_hidden_dim: int = 256)[source]#
Bases:
Module
Predicts masks given an image and prompt embeddings, using a transformer architecture.
- Parameters:
transformer_dim (int) – Channel dimension of the transformer.
transformer_cfg (dict) – Configuration of the transformer.
num_multimask_outputs (int) – The number of masks to predict when disambiguating masks.
activation (nn.Module) – Type of activation to use when upscaling masks.
iou_head_depth (int) – Depth of the MLP used to predict mask quality.
iou_head_hidden_dim (int) – Hidden dimension of the MLP used to predict mask quality.
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(image_embeddings: Tensor, image_pe: Tensor, sparse_prompt_embeddings: Tensor, dense_prompt_embeddings: Tensor, multimask_output: bool) Tuple[Tensor, Tensor] [source]#
Predict masks given image and prompt embeddings.
- Parameters:
image_embeddings (Tensor) – Embeddings from the image encoder.
image_pe (Tensor) – Positional encoding with the shape of image_embeddings.
sparse_prompt_embeddings (Tensor) – Embeddings of the points and boxes.
dense_prompt_embeddings (Tensor) – Embeddings of the mask inputs.
multimask_output (bool) – Whether to return multiple masks or a single mask.
- Returns:
Batched predicted masks. iou_pred (Tensor): Batched predictions of mask quality.
- Return type:
masks (Tensor)
- predict_masks(image_embeddings: Tensor, image_pe: Tensor, sparse_prompt_embeddings: Tensor, dense_prompt_embeddings: Tensor) Tuple[Tensor, Tensor] [source]#
Predicts masks. See ‘forward’ for more details.
- Parameters:
image_embeddings (Tensor) – Embeddings from the image encoder.
image_pe (Tensor) – Positional encoding with the shape of image_embeddings.
sparse_prompt_embeddings (Tensor) – Embeddings of the points and boxes.
dense_prompt_embeddings (Tensor) – Embeddings of the mask inputs.
- Returns:
Batched predicted masks. iou_pred (Tensor): Batched predictions of mask quality.
- Return type:
masks (Tensor)
- class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.decoders.sam_mask_decoder.TwoWayAttentionBlock(embedding_dim: int, num_heads: int, mlp_dim: int = 2048, activation: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.ReLU'>, attention_downsample_rate: int = 2, skip_first_layer_pe: bool = False)[source]#
Bases:
Module
A transformer block with four layers.
self-attention of sparse inputs,
cross attention of sparse inputs to dense inputs,
mlp block on sparse inputs, and
cross attention of dense inputs to sparse inputs.
- Parameters:
embedding_dim (int) – Channel dimension of the embeddings in the transformer block.
num_heads (int) – The number of heads in the attention layers of the transformer block.
mlp_dim (int) – Hidden dimension of the mlp block, defaults to 2048.
activation (nn.Module) – Activation of the mlp block, defaults to nn.ReLU.
skip_first_layer_pe (bool) – Skip the PE on the first layer of the transformer block.
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(queries: Tensor, keys: Tensor, query_pe: Tensor, key_pe: Tensor) Tuple[Tensor, Tensor] [source]#
Apply the transformer block to the queries and keys.
- Parameters:
queries (Tensor) – Queries to attend to. Should be shape B x N_queries x C for any N_queries.
keys (Tensor) – Keys to attend to. Should be shape B x N_keys x C for any N_keys.
query_pe (Tensor) – Positional encoding to add to the queries. Must have the same shape as queries.
key_pe (Tensor) – Positional encoding to add to the keys. Must have the same shape as keys.
- Returns:
Processed queries. Tensor: Processed keys.
- Return type:
Tensor
- class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.decoders.sam_mask_decoder.TwoWayTransformer(depth: int, embedding_dim: int, num_heads: int, mlp_dim: int, activation: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.ReLU'>, attention_downsample_rate: int = 2)[source]#
Bases:
Module
A transformer decoder that attends to an input image using queries whose positional embedding is supplied.
- Parameters:
depth (int) – Number of layers in the transformer decoder.
embedding_dim (int) – Channel dimension for the input embeddings and the positional embeddings.
num_heads (int) – The number of heads for multihead attention. Must divide embedding_dim evenly.
mlp_dim (int) – Channel dimension internal to the MLP block in the transformer layers, defaults to 2048.
activation (nn.Module) – Activation to use in the MLP block, defaults to nn.ReLU.
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(image_embedding: Tensor, image_pe: Tensor, point_embedding: Tensor) Tuple[Tensor, Tensor] [source]#
Apply the transformer to the image and point embeddings.
- Parameters:
image_embedding (Tensor) – Image to attend to. Should be shape B x embedding_dim x h x w for any h and w.
image_pe (Tensor) – Positional encoding to add to the image. Must have the same shape as image_embedding.
point_embedding (Tensor) – Embedding to add to the query points. Must have shape B x N_points x embedding_dim for any N_points.
- Returns:
Processed point_embedding with shape B x N_points x embedding_dim for any N_points. Tensor: Processed image_embedding with shape B x embedding_dim x h x w for any h and w.
- Return type:
Tensor