otx.algo.object_detection_3d.detectors.monodetr#
MonoDetr core Pytorch detector.
Classes
|
This is the MonoDETR module that performs monocualr 3D object detection. |
- class otx.algo.object_detection_3d.detectors.monodetr.MonoDETR(backbone: ~torch.nn.modules.module.Module, depthaware_transformer: ~torch.nn.modules.module.Module, depth_predictor: ~torch.nn.modules.module.Module, num_classes: int, num_queries: int, num_feature_levels: int, criterion: ~torch.nn.modules.module.Module | None = None, aux_loss: bool = True, with_box_refine: bool = False, init_box: bool = False, group_num: int = 11, activation: ~typing.Callable[[...], ~torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.ReLU'>)[source]#
Bases:
Module
This is the MonoDETR module that performs monocualr 3D object detection.
Initializes the model.
- Parameters:
backbone (nn.Module) – torch module of the backbone to be used. See backbone.py
depthaware_transformer (nn.Module) – depth-aware transformer architecture. See depth_aware_transformer.py
depth_predictor (nn.Module) – depth predictor module
criterion (nn.Module | None) – loss criterion module
num_classes (int) – number of object classes
num_queries (int) – number of object queries, ie detection slot. This is the maximal number of objects DETR can detect in a single image. For KITTI, we recommend 50 queries.
num_feature_levels (int) – number of feature levels
aux_loss (bool) – True if auxiliary decoding losses (loss at each decoder layer) are to be used.
with_box_refine (bool) – iterative bounding box refinement
init_box (bool) – True if the bounding box embedding layers should be initialized to zero
group_num (int) – number of groups for depth-aware bounding box embedding
activation (Callable[..., nn.Module]) – activation function to be applied to the output of the transformer
- forward(images: Tensor, calibs: Tensor, img_sizes: Tensor, targets: list[dict[str, Tensor]] | None = None, mode: str = 'predict') dict[str, Tensor] [source]#
Forward method of the MonoDETR model.
- Parameters:
images (Tensor) – images for each sample.
calibs (Tensor) – camera matrices for each sample.
img_sizes (Tensor) – image sizes for each sample.
targets (list[dict[str, Tensor]) – ground truth boxes and labels for each sample. Defaults to None.
mode (str) – The mode of operation. Defaults to “predict”.
- Returns:
A dictionary of tensors. If mode is “loss”, the tensors are the loss values. If mode is “predict”, the tensors are the logits.
- Return type: