Introduction#

Datumaro is a framework and CLI tool to build, transform, and analyze datasets.

a tool to build composite datasets and iterate over them
a tool to create and maintain datasets
- Version control of annotations and images
- Publication (with removal of sensitive information)
- Editing
- Joining and splitting
- Exporting, format changing
- Image preprocessing
a dataset storage
a tool to debug datasets
- A network can be used to generate informative data subsets (e.g., with false-positives) to be analyzed further

Key Features#

Datumaro supports the following features:

Dataset reading, writing, conversion in any direction.
- CIFAR-10/100 (classification)
- Cityscapes
- COCO (image_info, instances, person_keypoints, captions, labels, panoptic, stuff)
- CVAT
- ImageNet
- Kitti (segmentation, detection, 3D raw / velodyne points)
- LabelMe
- LFW (classification, person re-identification, landmarks)
- MNIST (classification)
- Open Images
- PASCAL VOC (classification, detection, segmentation, action_classification, person_layout)
- TF Detection API (bboxes, masks)
- YOLO (bboxes)
Other formats and documentation for them can be found here.
Dataset building
- Merging multiple datasets into one
- Dataset filtering by a custom criteria:
  - remove polygons of a certain class
  - remove images without annotations of a specific class
  - remove occluded annotations from images
  - keep only vertically-oriented images
  - remove small area bounding boxes from annotations
- Annotation conversions, for instance:
  - polygons to instance masks and vice-versa
  - apply a custom colormap for mask annotations
  - rename or remove dataset labels
- Splitting a dataset into multiple subsets like train, val, and test:
  - random split
  - task-specific splits based on annotations, which keep initial label and attribute distributions
    - for classification task, based on labels
    - for detection task, based on bboxes
    - for re-identification task, based on labels, avoiding having same IDs in training and test splits
- Sampling a dataset
  - analyzes inference result from the given dataset and selects the best and the least amount of samples for annotation.
  - Select the sample that best suits model training.
    - sampling with Entropy based algorithm
Dataset quality checking
- Simple checking for errors
- Comparison with model inference
- Merging and comparison of multiple datasets
- Annotation validation based on the task type(classification, etc)
Dataset comparison
Dataset statistics (image mean and std, annotation statistics)
Model integration
- Inference (OpenVINO, Caffe, PyTorch, TensorFlow, MxNet, etc.)
- Explainable AI (RISE algorithm)
  - RISE for classification
  - RISE for object detection