otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.encoders.sam_prompt_encoder#

Prompt encoder module for SAM.

Classes

`PositionEmbeddingRandom`([num_pos_feats, scale])	Positional encoding using random spatial frequencies.
`SAMPromptEncoder`(embed_dim, ...)	Encodes prompts for input to SAM's mask decoder.

class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.encoders.sam_prompt_encoder.PositionEmbeddingRandom(num_pos_feats: int = 64, scale: float | None = None)[source]#

Bases: Module

Positional encoding using random spatial frequencies.

Parameters:

num_pos_feats (int) – The number of positional frequencies.
scale (float) – The scale of the positional encoding.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(size: Tuple[int, int]) → Tensor[source]#

Generate positional encoding for a grid of the specified size.

Parameters:: size (tuple(int, int)) – The size of the grid to generate the encoding for.
Returns:: The positional encoding, as (num_pos_feats * 2, H, W).
Return type:: Tensor

forward_with_coords(coords_input: Tensor, image_size: Tuple[int, int]) → Tensor[source]#

Positionally encode points that are not normalized to [0,1].

Parameters:

coords_input (Tensor) – The coordinates to encode, as (N, 1, 2).
image_size (tuple(int, int)) – The size of the image the coordinates are from.

Returns:

The positional encoding, as (N, 1, num_pos_feats * 2).

Return type:

Tensor

class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.encoders.sam_prompt_encoder.SAMPromptEncoder(embed_dim: int, image_embedding_size: ~typing.Tuple[int, int], input_image_size: ~typing.Tuple[int, int], mask_in_chans: int, activation: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.GELU'>)[source]#

Bases: Module

Encodes prompts for input to SAM’s mask decoder.

Parameters:

embed_dim (int) – The prompts’ embedding dimension.
image_embedding_size (tuple(int, int)) – The spatial size of the image embedding, as (H, W).
input_image_size (int) – The padded size of the image as input to the image encoder, as (H, W).
mask_in_chans (int) – The number of hidden channels used for encoding input masks.
activation (nn.Module) – The activation to use when encoding input masks.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(points: Tuple[Tensor, Tensor] | None, boxes: Tensor | None, masks: Tensor | None) → Tuple[Tensor, Tensor][source]#

Embeds different types of prompts, returning both sparse and dense embeddings.

Parameters:

points (tuple(Tensor, Tensor) or none) – Point coordinates and labels to embed. Point coordinates are BxNx2 arrays of point prompts to the model. Each point is in (X,Y) in pixels. Labels are BxN arrays of labels for the point prompts. 1 indicates a foreground point and 0 indicates a background point.
boxes (Tensor or none) – A Bx4 array given a box prompt to the model, in XYXY format.
masks (Tensor or none) – A low resolution mask input to the model, typically coming from a previous prediction iteration. Has form Bx1xHxW, where for SAM, H=W=256. Masks returned by a previous iteration of the predict method do not need further transformation.

Returns:

sparse embeddings for the points and boxes, with shape Nx1x(embed_dim),: where N is determined by the number of input points and boxes.

dense_embeddings (Tensor): dense embeddings for the masks, in the shape Nx(embed_dim)x(embed_H)x(embed_W).

Return type:

sparse_embeddings (Tensor)

get_dense_pe() → Tensor[source]#

Returns the positional encoding.

It used to encode point prompts, applied to a dense set of points the shape of the image encoding.

Returns:: Positional encoding with shape 1x(embed_dim)x(embedding_h)x(embedding_w).
Return type:: Tensor