otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.decoders.sam_mask_decoder#

Mask decoder module for SAM.

Classes

`Attention`(embedding_dim, num_heads[, ...])	An attention layer.
`MLP`(input_dim, hidden_dim, output_dim, ...)	Simple MLP with ReLU activations.
`SAMMaskDecoder`(*, transformer_dim, ...)	Predicts masks given an image and prompt embeddings, using a transformer architecture.
`TwoWayAttentionBlock`(embedding_dim, ...)	A transformer block with four layers.
`TwoWayTransformer`(depth, embedding_dim, ...)	A transformer decoder that attends to an input image using queries whose positional embedding is supplied.

class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.decoders.sam_mask_decoder.Attention(embedding_dim: int, num_heads: int, downsample_rate: int = 1)[source]#

Bases: Module

An attention layer.

It allows for downscaling the size of the embedding after projection to queries, keys, and values.

Parameters:

embedding_dim (int) – Channel dimension of the embeddings.
num_heads (int) – The number of heads in the attention layers.
downsample_rate (int) – The rate to downsample the embedding by after projection to queries, keys, and values.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(q: Tensor, k: Tensor, v: Tensor) → Tensor[source]#

Apply the attention layer to the queries, keys, and values.

Parameters:

q (Tensor) – Queries to attend to. Should be shape B x N_queries x C for any N_queries.
k (Tensor) – Keys to attend to. Should be shape B x N_keys x C for any N_keys.
v (Tensor) – Values to attend to. Should be shape B x N_values x C for any N_values.

Returns:

The output of the attention layer. Shape B x N_queries x C.

Return type:

Tensor

class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.decoders.sam_mask_decoder.MLP(input_dim: int, hidden_dim: int, output_dim: int, num_layers: int, sigmoid_output: bool = False)[source]#

Bases: Module

Simple MLP with ReLU activations.

Parameters:

input_dim (int) – Input dimension.
hidden_dim (int) – Hidden dimension.
output_dim (int) – Output dimension.
num_layers (int) – Number of layers.
sigmoid_output (bool) – Whether to apply sigmoid to the output.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tensor[source]#

Forward pass.

Parameters:: x (Tensor) – Input tensor.
Returns:: Output tensor.
Return type:: Tensor

class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.decoders.sam_mask_decoder.SAMMaskDecoder(*, transformer_dim: int, transformer_cfg: dict, num_multimask_outputs: int = 3, activation: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.GELU'>, iou_head_depth: int = 3, iou_head_hidden_dim: int = 256)[source]#

Bases: Module

Predicts masks given an image and prompt embeddings, using a transformer architecture.

Parameters:

transformer_dim (int) – Channel dimension of the transformer.
transformer_cfg (dict) – Configuration of the transformer.
num_multimask_outputs (int) – The number of masks to predict when disambiguating masks.
activation (nn.Module) – Type of activation to use when upscaling masks.
iou_head_depth (int) – Depth of the MLP used to predict mask quality.
iou_head_hidden_dim (int) – Hidden dimension of the MLP used to predict mask quality.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(image_embeddings: Tensor, image_pe: Tensor, sparse_prompt_embeddings: Tensor, dense_prompt_embeddings: Tensor, multimask_output: bool) → Tuple[Tensor, Tensor][source]#

Predict masks given image and prompt embeddings.

Parameters:

image_embeddings (Tensor) – Embeddings from the image encoder.
image_pe (Tensor) – Positional encoding with the shape of image_embeddings.
sparse_prompt_embeddings (Tensor) – Embeddings of the points and boxes.
dense_prompt_embeddings (Tensor) – Embeddings of the mask inputs.
multimask_output (bool) – Whether to return multiple masks or a single mask.

Returns:

Batched predicted masks. iou_pred (Tensor): Batched predictions of mask quality.

Return type:

masks (Tensor)

predict_masks(image_embeddings: Tensor, image_pe: Tensor, sparse_prompt_embeddings: Tensor, dense_prompt_embeddings: Tensor) → Tuple[Tensor, Tensor][source]#

Predicts masks. See ‘forward’ for more details.

Parameters:

image_embeddings (Tensor) – Embeddings from the image encoder.
image_pe (Tensor) – Positional encoding with the shape of image_embeddings.
sparse_prompt_embeddings (Tensor) – Embeddings of the points and boxes.
dense_prompt_embeddings (Tensor) – Embeddings of the mask inputs.

Returns:

Batched predicted masks. iou_pred (Tensor): Batched predictions of mask quality.

Return type:

masks (Tensor)

class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.decoders.sam_mask_decoder.TwoWayAttentionBlock(embedding_dim: int, num_heads: int, mlp_dim: int = 2048, activation: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.ReLU'>, attention_downsample_rate: int = 2, skip_first_layer_pe: bool = False)[source]#

Bases: Module

A transformer block with four layers.

self-attention of sparse inputs,
cross attention of sparse inputs to dense inputs,
mlp block on sparse inputs, and
cross attention of dense inputs to sparse inputs.

Parameters:

embedding_dim (int) – Channel dimension of the embeddings in the transformer block.
num_heads (int) – The number of heads in the attention layers of the transformer block.
mlp_dim (int) – Hidden dimension of the mlp block, defaults to 2048.
activation (nn.Module) – Activation of the mlp block, defaults to nn.ReLU.
skip_first_layer_pe (bool) – Skip the PE on the first layer of the transformer block.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(queries: Tensor, keys: Tensor, query_pe: Tensor, key_pe: Tensor) → Tuple[Tensor, Tensor][source]#

Apply the transformer block to the queries and keys.

Parameters:

queries (Tensor) – Queries to attend to. Should be shape B x N_queries x C for any N_queries.
keys (Tensor) – Keys to attend to. Should be shape B x N_keys x C for any N_keys.
query_pe (Tensor) – Positional encoding to add to the queries. Must have the same shape as queries.
key_pe (Tensor) – Positional encoding to add to the keys. Must have the same shape as keys.

Returns:

Processed queries. Tensor: Processed keys.

Return type:

Tensor

class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.decoders.sam_mask_decoder.TwoWayTransformer(depth: int, embedding_dim: int, num_heads: int, mlp_dim: int, activation: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.ReLU'>, attention_downsample_rate: int = 2)[source]#

Bases: Module

A transformer decoder that attends to an input image using queries whose positional embedding is supplied.

Parameters:

depth (int) – Number of layers in the transformer decoder.
embedding_dim (int) – Channel dimension for the input embeddings and the positional embeddings.
num_heads (int) – The number of heads for multihead attention. Must divide embedding_dim evenly.
mlp_dim (int) – Channel dimension internal to the MLP block in the transformer layers, defaults to 2048.
activation (nn.Module) – Activation to use in the MLP block, defaults to nn.ReLU.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(image_embedding: Tensor, image_pe: Tensor, point_embedding: Tensor) → Tuple[Tensor, Tensor][source]#

Apply the transformer to the image and point embeddings.

Parameters:

image_embedding (Tensor) – Image to attend to. Should be shape B x embedding_dim x h x w for any h and w.
image_pe (Tensor) – Positional encoding to add to the image. Must have the same shape as image_embedding.
point_embedding (Tensor) – Embedding to add to the query points. Must have shape B x N_points x embedding_dim for any N_points.

Returns:

Processed point_embedding with shape B x N_points x embedding_dim for any N_points. Tensor: Processed image_embedding with shape B x embedding_dim x h x w for any h and w.

Return type:

Tensor