otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.decoders.sam_mask_decoder#

Mask decoder module for SAM.

Classes

Attention(embedding_dim, num_heads[, ...])

An attention layer.

MLP(input_dim, hidden_dim, output_dim, ...)

Simple MLP with ReLU activations.

SAMMaskDecoder(*, transformer_dim, ...)

Predicts masks given an image and prompt embeddings, using a transformer architecture.

TwoWayAttentionBlock(embedding_dim, ...)

A transformer block with four layers.

TwoWayTransformer(depth, embedding_dim, ...)

A transformer decoder that attends to an input image using queries whose positional embedding is supplied.

class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.decoders.sam_mask_decoder.Attention(embedding_dim: int, num_heads: int, downsample_rate: int = 1)[source]#

Bases: Module

An attention layer.

It allows for downscaling the size of the embedding after projection to queries, keys, and values.

Parameters:
  • embedding_dim (int) – Channel dimension of the embeddings.

  • num_heads (int) – The number of heads in the attention layers.

  • downsample_rate (int) – The rate to downsample the embedding by after projection to queries, keys, and values.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(q: Tensor, k: Tensor, v: Tensor) Tensor[source]#

Apply the attention layer to the queries, keys, and values.

Parameters:
  • q (Tensor) – Queries to attend to. Should be shape B x N_queries x C for any N_queries.

  • k (Tensor) – Keys to attend to. Should be shape B x N_keys x C for any N_keys.

  • v (Tensor) – Values to attend to. Should be shape B x N_values x C for any N_values.

Returns:

The output of the attention layer. Shape B x N_queries x C.

Return type:

Tensor

class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.decoders.sam_mask_decoder.MLP(input_dim: int, hidden_dim: int, output_dim: int, num_layers: int, sigmoid_output: bool = False)[source]#

Bases: Module

Simple MLP with ReLU activations.

Parameters:
  • input_dim (int) – Input dimension.

  • hidden_dim (int) – Hidden dimension.

  • output_dim (int) – Output dimension.

  • num_layers (int) – Number of layers.

  • sigmoid_output (bool) – Whether to apply sigmoid to the output.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) Tensor[source]#

Forward pass.

Parameters:

x (Tensor) – Input tensor.

Returns:

Output tensor.

Return type:

Tensor

class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.decoders.sam_mask_decoder.SAMMaskDecoder(*, transformer_dim: int, transformer_cfg: dict, num_multimask_outputs: int = 3, activation: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.GELU'>, iou_head_depth: int = 3, iou_head_hidden_dim: int = 256)[source]#

Bases: Module

Predicts masks given an image and prompt embeddings, using a transformer architecture.

Parameters:
  • transformer_dim (int) – Channel dimension of the transformer.

  • transformer_cfg (dict) – Configuration of the transformer.

  • num_multimask_outputs (int) – The number of masks to predict when disambiguating masks.

  • activation (nn.Module) – Type of activation to use when upscaling masks.

  • iou_head_depth (int) – Depth of the MLP used to predict mask quality.

  • iou_head_hidden_dim (int) – Hidden dimension of the MLP used to predict mask quality.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(image_embeddings: Tensor, image_pe: Tensor, sparse_prompt_embeddings: Tensor, dense_prompt_embeddings: Tensor, multimask_output: bool) Tuple[Tensor, Tensor][source]#

Predict masks given image and prompt embeddings.

Parameters:
  • image_embeddings (Tensor) – Embeddings from the image encoder.

  • image_pe (Tensor) – Positional encoding with the shape of image_embeddings.

  • sparse_prompt_embeddings (Tensor) – Embeddings of the points and boxes.

  • dense_prompt_embeddings (Tensor) – Embeddings of the mask inputs.

  • multimask_output (bool) – Whether to return multiple masks or a single mask.

Returns:

Batched predicted masks. iou_pred (Tensor): Batched predictions of mask quality.

Return type:

masks (Tensor)

predict_masks(image_embeddings: Tensor, image_pe: Tensor, sparse_prompt_embeddings: Tensor, dense_prompt_embeddings: Tensor) Tuple[Tensor, Tensor][source]#

Predicts masks. See ‘forward’ for more details.

Parameters:
  • image_embeddings (Tensor) – Embeddings from the image encoder.

  • image_pe (Tensor) – Positional encoding with the shape of image_embeddings.

  • sparse_prompt_embeddings (Tensor) – Embeddings of the points and boxes.

  • dense_prompt_embeddings (Tensor) – Embeddings of the mask inputs.

Returns:

Batched predicted masks. iou_pred (Tensor): Batched predictions of mask quality.

Return type:

masks (Tensor)

class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.decoders.sam_mask_decoder.TwoWayAttentionBlock(embedding_dim: int, num_heads: int, mlp_dim: int = 2048, activation: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.ReLU'>, attention_downsample_rate: int = 2, skip_first_layer_pe: bool = False)[source]#

Bases: Module

A transformer block with four layers.

  1. self-attention of sparse inputs,

  2. cross attention of sparse inputs to dense inputs,

  3. mlp block on sparse inputs, and

  4. cross attention of dense inputs to sparse inputs.

Parameters:
  • embedding_dim (int) – Channel dimension of the embeddings in the transformer block.

  • num_heads (int) – The number of heads in the attention layers of the transformer block.

  • mlp_dim (int) – Hidden dimension of the mlp block, defaults to 2048.

  • activation (nn.Module) – Activation of the mlp block, defaults to nn.ReLU.

  • skip_first_layer_pe (bool) – Skip the PE on the first layer of the transformer block.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(queries: Tensor, keys: Tensor, query_pe: Tensor, key_pe: Tensor) Tuple[Tensor, Tensor][source]#

Apply the transformer block to the queries and keys.

Parameters:
  • queries (Tensor) – Queries to attend to. Should be shape B x N_queries x C for any N_queries.

  • keys (Tensor) – Keys to attend to. Should be shape B x N_keys x C for any N_keys.

  • query_pe (Tensor) – Positional encoding to add to the queries. Must have the same shape as queries.

  • key_pe (Tensor) – Positional encoding to add to the keys. Must have the same shape as keys.

Returns:

Processed queries. Tensor: Processed keys.

Return type:

Tensor

class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.decoders.sam_mask_decoder.TwoWayTransformer(depth: int, embedding_dim: int, num_heads: int, mlp_dim: int, activation: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.ReLU'>, attention_downsample_rate: int = 2)[source]#

Bases: Module

A transformer decoder that attends to an input image using queries whose positional embedding is supplied.

Parameters:
  • depth (int) – Number of layers in the transformer decoder.

  • embedding_dim (int) – Channel dimension for the input embeddings and the positional embeddings.

  • num_heads (int) – The number of heads for multihead attention. Must divide embedding_dim evenly.

  • mlp_dim (int) – Channel dimension internal to the MLP block in the transformer layers, defaults to 2048.

  • activation (nn.Module) – Activation to use in the MLP block, defaults to nn.ReLU.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(image_embedding: Tensor, image_pe: Tensor, point_embedding: Tensor) Tuple[Tensor, Tensor][source]#

Apply the transformer to the image and point embeddings.

Parameters:
  • image_embedding (Tensor) – Image to attend to. Should be shape B x embedding_dim x h x w for any h and w.

  • image_pe (Tensor) – Positional encoding to add to the image. Must have the same shape as image_embedding.

  • point_embedding (Tensor) – Embedding to add to the query points. Must have shape B x N_points x embedding_dim for any N_points.

Returns:

Processed point_embedding with shape B x N_points x embedding_dim for any N_points. Tensor: Processed image_embedding with shape B x embedding_dim x h x w for any h and w.

Return type:

Tensor