otx.algo.object_detection_3d.detectors.monodetr#

MonoDetr core Pytorch detector.

Classes

MonoDETR(backbone, depthaware_transformer, ...)

This is the MonoDETR module that performs monocualr 3D object detection.

class otx.algo.object_detection_3d.detectors.monodetr.MonoDETR(backbone: ~torch.nn.modules.module.Module, depthaware_transformer: ~torch.nn.modules.module.Module, depth_predictor: ~torch.nn.modules.module.Module, num_classes: int, num_queries: int, num_feature_levels: int, criterion: ~torch.nn.modules.module.Module | None = None, aux_loss: bool = True, with_box_refine: bool = False, init_box: bool = False, group_num: int = 11, activation: ~typing.Callable[[...], ~torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.ReLU'>)[source]#

Bases: Module

This is the MonoDETR module that performs monocualr 3D object detection.

Initializes the model.

Parameters:
  • backbone (nn.Module) – torch module of the backbone to be used. See backbone.py

  • depthaware_transformer (nn.Module) – depth-aware transformer architecture. See depth_aware_transformer.py

  • depth_predictor (nn.Module) – depth predictor module

  • criterion (nn.Module | None) – loss criterion module

  • num_classes (int) – number of object classes

  • num_queries (int) – number of object queries, ie detection slot. This is the maximal number of objects DETR can detect in a single image. For KITTI, we recommend 50 queries.

  • num_feature_levels (int) – number of feature levels

  • aux_loss (bool) – True if auxiliary decoding losses (loss at each decoder layer) are to be used.

  • with_box_refine (bool) – iterative bounding box refinement

  • init_box (bool) – True if the bounding box embedding layers should be initialized to zero

  • group_num (int) – number of groups for depth-aware bounding box embedding

  • activation (Callable[..., nn.Module]) – activation function to be applied to the output of the transformer

forward(images: Tensor, calibs: Tensor, img_sizes: Tensor, targets: list[dict[str, Tensor]] | None = None, mode: str = 'predict') dict[str, Tensor][source]#

Forward method of the MonoDETR model.

Parameters:
  • images (Tensor) – images for each sample.

  • calibs (Tensor) – camera matrices for each sample.

  • img_sizes (Tensor) – image sizes for each sample.

  • targets (list[dict[str, Tensor]) – ground truth boxes and labels for each sample. Defaults to None.

  • mode (str) – The mode of operation. Defaults to “predict”.

Returns:

A dictionary of tensors. If mode is “loss”, the tensors are the loss values. If mode is “predict”, the tensors are the logits.

Return type:

dict[str, Tensor]