otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.encoders.sam_prompt_encoder#
Prompt encoder module for SAM.
Classes
|
Positional encoding using random spatial frequencies. |
|
Encodes prompts for input to SAM's mask decoder. |
- class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.encoders.sam_prompt_encoder.PositionEmbeddingRandom(num_pos_feats: int = 64, scale: float | None = None)[source]#
Bases:
Module
Positional encoding using random spatial frequencies.
- Parameters:
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(size: Tuple[int, int]) Tensor [source]#
Generate positional encoding for a grid of the specified size.
- class otx.algorithms.visual_prompting.adapters.pytorch_lightning.models.encoders.sam_prompt_encoder.SAMPromptEncoder(embed_dim: int, image_embedding_size: ~typing.Tuple[int, int], input_image_size: ~typing.Tuple[int, int], mask_in_chans: int, activation: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.GELU'>)[source]#
Bases:
Module
Encodes prompts for input to SAM’s mask decoder.
- Parameters:
embed_dim (int) – The prompts’ embedding dimension.
image_embedding_size (tuple(int, int)) – The spatial size of the image embedding, as (H, W).
input_image_size (int) – The padded size of the image as input to the image encoder, as (H, W).
mask_in_chans (int) – The number of hidden channels used for encoding input masks.
activation (nn.Module) – The activation to use when encoding input masks.
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(points: Tuple[Tensor, Tensor] | None, boxes: Tensor | None, masks: Tensor | None) Tuple[Tensor, Tensor] [source]#
Embeds different types of prompts, returning both sparse and dense embeddings.
- Parameters:
points (tuple(Tensor, Tensor) or none) – Point coordinates and labels to embed. Point coordinates are BxNx2 arrays of point prompts to the model. Each point is in (X,Y) in pixels. Labels are BxN arrays of labels for the point prompts. 1 indicates a foreground point and 0 indicates a background point.
boxes (Tensor or none) – A Bx4 array given a box prompt to the model, in XYXY format.
masks (Tensor or none) – A low resolution mask input to the model, typically coming from a previous prediction iteration. Has form Bx1xHxW, where for SAM, H=W=256. Masks returned by a previous iteration of the predict method do not need further transformation.
- Returns:
- sparse embeddings for the points and boxes, with shape Nx1x(embed_dim),
where N is determined by the number of input points and boxes.
dense_embeddings (Tensor): dense embeddings for the masks, in the shape Nx(embed_dim)x(embed_H)x(embed_W).
- Return type:
sparse_embeddings (Tensor)