otx.algorithms.detection.adapters.mmdet.models.layers#

Initial file for mmdetection layers for models.

Classes

CustomDINOTransformer([as_two_stage, ...])

Custom DINO transformer.

DINOTransformerDecoder(*args[, ...])

Transformer encoder of DINO.

CdnQueryGenerator(num_classes, embed_dims, ...)

Implement query generator of the Contrastive denoising (CDN).

EfficientTransformerEncoder([...])

TransformerEncoder of Lite-DETR.

EfficientTransformerLayer([small_expand, ...])

Efficient TransformerLayer for Lite-DETR.

SmallExpandFFN([embed_dims, ...])

Implements feed-forward networks (FFNs) with small expand.

class otx.algorithms.detection.adapters.mmdet.models.layers.CdnQueryGenerator(num_classes: int, embed_dims: int, num_matching_queries: int, label_noise_scale: float = 0.5, box_noise_scale: float = 1.0, group_cfg: Config | None = None)[source]#

Bases: BaseModule

Implement query generator of the Contrastive denoising (CDN).

Proposed in`DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection <https://arxiv.org/abs/2203.03605>`_.

Code is modified from the official github repo.

Original implementation: mmdet.models.layers.transformer.dino_layers.CdnQueryGenerator What’s changed: None

Parameters:
  • num_classes (int) – Number of object classes.

  • embed_dims (int) – The embedding dimensions of the generated queries.

  • num_matching_queries (int) – The queries number of the matching part. Used for generating dn_mask.

  • label_noise_scale (float) – The scale of label noise, defaults to 0.5.

  • box_noise_scale (float) – The scale of box noise, defaults to 1.0.

  • group_cfg (ConfigDict or dict, optional) – The config of the denoising queries grouping, includes dynamic, num_dn_queries, and num_groups. Two grouping strategies, ‘static dn groups’ and ‘dynamic dn groups’, are supported. When dynamic is False, the num_groups should be set, and the number of denoising query groups will always be num_groups. When dynamic is True, the num_dn_queries should be set, and the group number will be dynamic to ensure that the denoising queries number will not exceed num_dn_queries to prevent large fluctuations of memory. Defaults to None.

Initialize BaseModule, inherited from torch.nn.Module

__call__(batch_info: List[Dict[str, Any]]) tuple[source]#

Generate contrastive denoising (cdn) queries with ground truth.

Descriptions of the Number Values in code and comments:
  • num_target_total: the total target number of the input batch samples.

  • max_num_target: the max target number of the input batch samples.

  • num_noisy_targets: the total targets number after adding noise, i.e., num_target_total * num_groups * 2.

  • num_denoising_queries: the length of the output batched queries, i.e., max_num_target * num_groups * 2.

NOTE The format of input bboxes in batch_info is unnormalized (x, y, x, y), and the output bbox queries are embedded by normalized (cx, cy, w, h) format bboxes going through inverse_sigmoid.

Parameters:

batch_info (list[dict[str, union[tuple, tensor]]]) – List of the batch information such as image size, and gt information.

Returns:

The outputs of the dn query generator.

  • dn_label_query (Tensor): The output content queries for denoising part, has shape (bs, num_denoising_queries, dim), where num_denoising_queries = max_num_target * num_groups * 2.

  • dn_bbox_query (Tensor): The output reference bboxes as positions of queries for denoising part, which are embedded by normalized (cx, cy, w, h) format bboxes going through inverse_sigmoid, has shape (bs, num_denoising_queries, 4) with the last dimension arranged as (cx, cy, w, h).

  • attn_mask (Tensor): The attention mask to prevent information leakage from different denoising groups and matching parts, will be used as self_attn_mask of the decoder, has shape (num_queries_total, num_queries_total), where num_queries_total is the sum of num_denoising_queries and num_matching_queries.

  • dn_meta (Dict[str, int]): The dictionary saves information about group collation, including ‘num_denoising_queries’ and ‘num_denoising_groups’. It will be used for split outputs of denoising and matching parts and loss calculation.

Return type:

tuple

collate_dn_queries(input_label_query: Tensor, input_bbox_query: Tensor, batch_idx: Tensor, batch_size: int, num_groups: int) Tuple[Tensor, Tensor][source]#

Collate generated queries to obtain batched dn queries.

The strategy for query collation is as follow:

        input_queries (num_target_total, query_dim)
P_A1 P_B1 P_B2 N_A1 N_B1 N_B2 P'A1 P'B1 P'B2 N'A1 N'B1 N'B2
  |________ group1 ________|    |________ group2 ________|
                             |
                             V
          P_A1 Pad0 N_A1 Pad0 P'A1 Pad0 N'A1 Pad0
          P_B1 P_B2 N_B1 N_B2 P'B1 P'B2 N'B1 N'B2
           |____ group1 ____| |____ group2 ____|
 batched_queries (batch_size, max_num_target, query_dim)

where query_dim is 4 for bbox and self.embed_dims for label.
Notation: _-group 1; '-group 2;
          A-Sample1(has 1 target); B-sample2(has 2 targets)
Parameters:
  • input_label_query (Tensor) – The generated label queries of all targets, has shape (num_target_total, embed_dims) where num_target_total = sum(num_target_list).

  • input_bbox_query (Tensor) – The generated bbox queries of all targets, has shape (num_target_total, 4) with the last dimension arranged as (cx, cy, w, h).

  • batch_idx (Tensor) – The batch index of the corresponding sample for each target, has shape (num_target_total).

  • batch_size (int) – The size of the input batch.

  • num_groups (int) – The number of denoising query groups.

Returns:

Output batched label and bbox queries. - batched_label_query (Tensor): The output batched label queries,

has shape (batch_size, max_num_target, embed_dims).

  • batched_bbox_query (Tensor): The output batched bbox queries, has shape (batch_size, max_num_target, 4) with the last dimension arranged as (cx, cy, w, h).

Return type:

tuple[Tensor]

generate_dn_bbox_query(gt_bboxes: Tensor, num_groups: int) Tensor[source]#

Generate noisy bboxes and their query embeddings.

The strategy for generating noisy bboxes is as follow:

   +--------------------+
   |      negative      |
   |    +----------+    |
   |    | positive |    |
   |    |    +-----|----+------------+
   |    |    |     |    |            |
   |    +----+-----+    |            |
   |         |          |            |
   +---------+----------+            |
             |                       |
             |        gt bbox        |
             |                       |
             |             +---------+----------+
             |             |         |          |
             |             |    +----+-----+    |
             |             |    |    |     |    |
             +-------------|--- +----+     |    |
                           |    | positive |    |
                           |    +----------+    |
                           |      negative      |
                           +--------------------+

The random noise is added to the top-left and down-right point
positions, hence, normalized (x, y, x, y) format of bboxes are
required. The noisy bboxes of positive queries have the points
both within the inner square, while those of negative queries
have the points both between the inner and outer squares.

Besides, the length of outer square is twice as long as that of the inner square, i.e., self.box_noise_scale * w_or_h / 2. NOTE The noise is added to all the bboxes. Moreover, there is still unconsidered case when one point is within the positive square and the others is between the inner and outer squares.

Parameters:
  • gt_bboxes (Tensor) – The concatenated gt bboxes of all samples in the batch, has shape (num_target_total, 4) with the last dimension arranged as (cx, cy, w, h) where num_target_total = sum(num_target_list).

  • num_groups (int) – The number of denoising query groups.

Returns:

The output noisy bboxes, which are embedded by normalized (cx, cy, w, h) format bboxes going through inverse_sigmoid, has shape (num_noisy_targets, 4) with the last dimension arranged as (cx, cy, w, h), where num_noisy_targets = num_target_total * num_groups * 2.

Return type:

Tensor

generate_dn_label_query(gt_labels: Tensor, num_groups: int) Tensor[source]#

Generate noisy labels and their query embeddings.

The strategy for generating noisy labels is: Randomly choose labels of self.label_noise_scale * 0.5 proportion and override each of them with a random object category label.

NOTE Not add noise to all labels. Besides, the self.label_noise_scale * 0.5 arg is the ratio of the chosen positions, which is higher than the actual proportion of noisy labels, because the labels to override may be correct. And the gap becomes larger as the number of target categories decreases. The users should notice this and modify the scale arg or the corresponding logic according to specific dataset.

Parameters:
  • gt_labels (Tensor) – The concatenated gt labels of all samples in the batch, has shape (num_target_total, ) where num_target_total = sum(num_target_list).

  • num_groups (int) – The number of denoising query groups.

Returns:

The query embeddings of noisy labels, has shape (num_noisy_targets, embed_dims), where num_noisy_targets = num_target_total * num_groups * 2.

Return type:

Tensor

generate_dn_mask(max_num_target: int, num_groups: int, device: device | str) Tensor[source]#

Generate attention mask to prevent information leakage from different denoising groups and matching parts.

               0 0 0 0 1 1 1 1 0 0 0 0 0
               0 0 0 0 1 1 1 1 0 0 0 0 0
               0 0 0 0 1 1 1 1 0 0 0 0 0
               0 0 0 0 1 1 1 1 0 0 0 0 0
               1 1 1 1 0 0 0 0 0 0 0 0 0
               1 1 1 1 0 0 0 0 0 0 0 0 0
               1 1 1 1 0 0 0 0 0 0 0 0 0
               1 1 1 1 0 0 0 0 0 0 0 0 0
               1 1 1 1 1 1 1 1 0 0 0 0 0
               1 1 1 1 1 1 1 1 0 0 0 0 0
               1 1 1 1 1 1 1 1 0 0 0 0 0
               1 1 1 1 1 1 1 1 0 0 0 0 0
               1 1 1 1 1 1 1 1 0 0 0 0 0
max_num_target |_|           |_________| num_matching_queries
               |_____________| num_denoising_queries

      1 -> True  (Masked), means 'can not see'.
      0 -> False (UnMasked), means 'can see'.
Parameters:
  • max_num_target (int) – The max target number of the input batch samples.

  • num_groups (int) – The number of denoising query groups.

  • (obj (device) – device or str): The device of generated mask.

Returns:

The attention mask to prevent information leakage from different denoising groups and matching parts, will be used as self_attn_mask of the decoder, has shape (num_queries_total, num_queries_total), where num_queries_total is the sum of num_denoising_queries and num_matching_queries.

Return type:

Tensor

get_num_groups(max_num_target: int | None = None) int[source]#

Calculate denoising query groups number.

Two grouping strategies, ‘static dn groups’ and ‘dynamic dn groups’, are supported. When self.dynamic_dn_groups is False, the number of denoising query groups will always be self.num_groups. When self.dynamic_dn_groups is True, the group number will be dynamic, ensuring the denoising queries number will not exceed self.num_dn_queries to prevent large fluctuations of memory.

NOTE The num_group is shared for different samples in a batch. When the target numbers in the samples varies, the denoising queries of the samples containing fewer targets are padded to the max length.

Parameters:

max_num_target (int, optional) – The max target number of the batch samples. It will only be used when self.dynamic_dn_groups is True. Defaults to None.

Returns:

The denoising group number of the current batch.

Return type:

int

class otx.algorithms.detection.adapters.mmdet.models.layers.CustomDINOTransformer(as_two_stage=False, num_feature_levels=4, two_stage_num_proposals=300, **kwargs)[source]#

Bases: DeformableDetrTransformer

Custom DINO transformer.

Original implementation: mmdet.models.utils.transformer.DeformableDETR in mmdet2.x What’s changed: The forward function is modified.

Modified implementations come from mmdet.models.detectors.dino.DINO in mmdet3.x

Initialize BaseModule, inherited from torch.nn.Module

forward(batch_info: List[Dict[str, Tuple | Tensor]], mlvl_feats: List[Tensor], mlvl_masks: List[Tensor], query_embed: Tensor, mlvl_pos_embeds: List[Tensor], reg_branches: ModuleList | None = None, cls_branches: ModuleList | None = None, **kwargs)[source]#

Forward function for Transformer.

What’s changed:

In mmdet3.x forward of transformer is divided into pre_transformer() -> forward_encoder() -> pre_decoder() -> forward_decoder(). In comparison, mmdet2.x forward function takes charge of all functions above. The differences in Deformable DETR and DINO are occured in pre_decoder(), forward_decoder(). Therefore this function modified those parts. Modified implementations come from pre_decoder(), and forward_decoder() of mmdet.models.detectors.dino.DINO in mmdet3.x.

Parameters:
  • batch_info (list(dict(str, union(tuple, tensor)))) – Information about batch such as image shape, gt information.

  • mlvl_feats (list(Tensor)) – Input queries from different level. Each element has shape [bs, embed_dims, h, w].

  • mlvl_masks (list(Tensor)) – The key_padding_mask from different level used for encoder and decoder, each element has shape [bs, h, w].

  • query_embed (Tensor) – The query embedding for decoder, with shape [num_query, c].

  • mlvl_pos_embeds (list(Tensor)) –

    The positional encoding of feats from different level, has the shape

    [bs, embed_dims, h, w].

  • (obj (cls_branches) – nn.ModuleList): Regression heads for feature maps from each decoder layer. Only would be passed when with_box_refine is True. Default to None.

  • (obj

    nn.ModuleList): Classification heads for feature maps from each decoder layer. Only would

    be passed when as_two_stage is True. Default to None.

  • kwargs – Additional argument for forward_transformer function.

Returns:

results of decoder containing the following tensor.

  • inter_states: Outputs from decoder. If

    return_intermediate_dec is True output has shape (num_dec_layers, bs, num_query, embed_dims), else has shape (1, bs, num_query, embed_dims).

  • inter_references_out: The internal value of reference points in decoder, has shape (num_dec_layers, bs,num_query, embed_dims)

  • enc_outputs_class: The classification score of proposals generated from encoder’s feature maps, has shape (batch, h*w, num_classes). Only would be returned when as_two_stage is True, otherwise None.

  • enc_outputs_coord_unact: The regression results generated from encoder’s feature maps., has shape (batch, h*w, 4). Only would be returned when as_two_stage is True, otherwise None.

  • dn_meta (Dict[str, int]): The dictionary saves information about

    group collation, including ‘num_denoising_queries’ and ‘num_denoising_groups’. It will be used for split outputs of denoising and matching parts and loss calculation.

Return type:

tuple[Tensor]

init_layers()[source]#

Initialize layers of the DINO.

Unlike Deformable DETR, DINO does not need pos_trans, pos_trans_norm.

class otx.algorithms.detection.adapters.mmdet.models.layers.DINOTransformerDecoder(*args, return_intermediate=False, **kwargs)[source]#

Bases: DeformableDetrTransformerDecoder

Transformer encoder of DINO.

Initialize BaseModule, inherited from torch.nn.Module

forward(query: Tensor, value: Tensor, key_padding_mask: Tensor, self_attn_mask: Tensor, reference_points: Tensor, spatial_shapes: Tensor, level_start_index: Tensor, valid_ratios: Tensor, reg_branches: ModuleList, **kwargs) Tensor[source]#

Forward function of Transformer decoder.

Original implementation: forward function of DinoTransformerDecoder in mmdet3.x. What’s change: Since implementation of base transformer layer is different between mmdet2.x and mmdet3.x, input shape of layer and some input parameters of layer is modified.

Parameters:
  • query (Tensor) – The input query, has shape (num_queries, bs, dim).

  • value (Tensor) – The input values, has shape (num_value, bs, dim).

  • key_padding_mask (Tensor) – The key_padding_mask of self_attn input. ByteTensor, has shape (num_queries, bs).

  • self_attn_mask (Tensor) – The attention mask to prevent information leakage from different denoising groups and matching parts, has shape (num_queries_total, num_queries_total). It is None when self.training is False.

  • reference_points (Tensor) – The initial reference, has shape (bs, num_queries, 4) with the last dimension arranged as (cx, cy, w, h).

  • spatial_shapes (Tensor) – Spatial shapes of features in all levels, has shape (num_levels, 2), last dimension represents (h, w).

  • level_start_index (Tensor) – The start index of each level. A tensor has shape (num_levels, ) and can be represented as [0, h_0*w_0, h_0*w_0+h_1*w_1, …].

  • valid_ratios (Tensor) – The ratios of the valid width and the valid height relative to the width and the height of features in all levels, has shape (bs, num_levels, 2).

  • reg_branches – (obj:nn.ModuleList): Used for refining the regression results.

  • kwargs – Additional argument for attention layers.

Returns:

Output queries of Transformer encoder, which is also called ‘encoder output embeddings’ or ‘memory’, has shape (num_queries, bs, dim)

Return type:

Tensor

class otx.algorithms.detection.adapters.mmdet.models.layers.EfficientTransformerEncoder(transformerlayers=None, num_layers=None, init_cfg=None, post_norm_cfg={'type': 'LN'}, enc_scale=3, num_expansion=3, **kwargs)[source]#

Bases: BaseModule

TransformerEncoder of Lite-DETR.

Parameters:

post_norm_cfg (dict) – Config of last normalization layer. Default: LN. Only used when self.pre_norm is True

Initialize BaseModule, inherited from torch.nn.Module

forward(query, key, value, query_pos=None, key_pos=None, attn_masks=None, query_key_padding_mask=None, key_padding_mask=None, level_start_index=None, reference_points=None, **kwargs)[source]#

Forward function for TransformerCoder.

Parameters:
  • query (Tensor) – Input query with shape (num_queries, bs, embed_dims).

  • key (Tensor) – The key tensor with shape (num_keys, bs, embed_dims).

  • value (Tensor) – The value tensor with shape (num_keys, bs, embed_dims).

  • query_pos (Tensor) – The positional encoding for query. Default: None.

  • key_pos (Tensor) – The positional encoding for key. Default: None.

  • attn_masks (List[Tensor], optional) – Each element is 2D Tensor which is used in calculation of corresponding attention in operation_order. Default: None.

  • query_key_padding_mask (Tensor) – ByteTensor for query, with shape [bs, num_queries]. Only used in self-attention Default: None.

  • key_padding_mask (Tensor) – ByteTensor for query, with shape [bs, num_keys]. Default: None.

  • level_start_index (Tensor) – Start index for each level.

  • reference_points (Tensor) – BBox predictions’ reference.

  • kwargs – Additional arguments.

Returns:

results with shape [num_queries, bs, embed_dims].

Return type:

Tensor

class otx.algorithms.detection.adapters.mmdet.models.layers.EfficientTransformerLayer(small_expand=False, attn_cfgs=None, ffn_cfgs={'act_cfg': {'inplace': True, 'type': 'ReLU'}, 'embed_dims': 256, 'feedforward_channels': 1024, 'ffn_drop': 0.0, 'num_fcs': 2, 'type': 'FFN'}, operation_order=None, norm_cfg={'type': 'LN'}, init_cfg=None, batch_first=False, enc_scale=3, **kwargs)[source]#

Bases: BaseTransformerLayer

Efficient TransformerLayer for Lite-DETR.

It is base transformer encoder layer for Lite-DETR <https://arxiv.org/pdf/2303.07335.pdf>`_ .

Parameters:
  • obj (ffn_cfgs (list[mmcv.ConfigDict] |) – mmcv.ConfigDict | None )): Configs for self_attention or cross_attention modules, The order of the configs in the list should be consistent with corresponding attentions in operation_order. If it is a dict, all of the attention modules in operation_order will be built with this config. Default: None.

  • objmmcv.ConfigDict | None )): Configs for FFN, The order of the configs in the list should be consistent with corresponding ffn in operation_order. If it is a dict, all of the attention modules in operation_order will be built with this config.

  • operation_order (tuple[str]) – The execution order of operation in transformer. Such as (‘self_attn’, ‘norm’, ‘ffn’, ‘norm’). Support prenorm when you specifying first element as norm. Default:None.

  • norm_cfg (dict) – Config dict for normalization layer. Default: dict(type=’LN’).

  • (obj (init_cfg) – mmcv.ConfigDict): The Config for initialization. Default: None.

  • batch_first (bool) – Key, Query and Value are shape of (batch, n, embed_dim) or (n, batch, embed_dim). Default to False.

  • enc_scale (int) – Scale of high level features. Default is 3.

Initialize BaseModule, inherited from torch.nn.Module

forward(query, key=None, value=None, query_pos=None, key_pos=None, attn_masks=None, query_key_padding_mask=None, key_padding_mask=None, level_start_index=None, **kwargs)[source]#

Forward function for TransformerDecoderLayer.

**kwargs contains some specific arguments of attentions.

Parameters:
  • query (Tensor) – The input query with shape [num_queries, bs, embed_dims] if self.batch_first is False, else [bs, num_queries embed_dims].

  • key (Tensor) – The key tensor with shape [num_keys, bs, embed_dims] if self.batch_first is False, else [bs, num_keys, embed_dims] .

  • value (Tensor) – The value tensor with same shape as key.

  • query_pos (Tensor) – The positional encoding for query. Default: None.

  • key_pos (Tensor) – The positional encoding for key. Default: None.

  • attn_masks (List[Tensor] | None) – 2D Tensor used in calculation of corresponding attention. The length of it should equal to the number of attention in operation_order. Default: None.

  • query_key_padding_mask (Tensor) – ByteTensor for query, with shape [bs, num_queries]. Only used in self_attn layer. Defaults to None.

  • key_padding_mask (Tensor) – ByteTensor for query, with shape [bs, num_keys]. Default: None.

  • level_start_index (Tensor) – Start index for each level.

  • kwargs – Additional arguments.

Returns:

forwarded results with shape [num_queries, bs, embed_dims].

Return type:

Tensor

class otx.algorithms.detection.adapters.mmdet.models.layers.SmallExpandFFN(embed_dims=256, feedforward_channels=1024, num_fcs=2, act_cfg={'inplace': True, 'type': 'ReLU'}, ffn_drop=0.0, dropout_layer=None, add_identity=True, init_cfg=None, **kwargs)[source]#

Bases: FFN

Implements feed-forward networks (FFNs) with small expand.

Parameters:
  • embed_dims (int) – The feature dimension. Same as MultiheadAttention. Defaults: 256.

  • feedforward_channels (int) – The hidden dimension of FFNs. Defaults: 1024.

  • num_fcs (int, optional) – The number of fully-connected layers in FFNs. Default: 2.

  • act_cfg (dict, optional) – The activation config for FFNs. Default: dict(type=’ReLU’)

  • ffn_drop (float, optional) – Probability of an element to be zeroed in FFN. Default 0.0.

  • add_identity (bool, optional) – Whether to add the identity connection. Default: True.

  • (obj (init_cfg) – ConfigDict): The dropout_layer used when adding the shortcut.

  • (objmmcv.ConfigDict): The Config for initialization. Default: None.

Initialize BaseModule, inherited from torch.nn.Module

forward(x, level_start_index, enc_scale, identity=None)[source]#

Forward function for FFN.

forward_ffn(layers, norm, x, identity=None)[source]#

Forward Feed Forward Network given layers.