Introduction ############ **Datumaro** is a framework and CLI tool to build, transform, and analyze datasets. - a tool to build composite datasets and iterate over them - a tool to create and maintain datasets - Version control of annotations and images - Publication (with removal of sensitive information) - Editing - Joining and splitting - Exporting, format changing - Image preprocessing - a dataset storage - a tool to debug datasets - A network can be used to generate informative data subsets (e.g., with false-positives) to be analyzed further Key Features ------------ Datumaro supports the following features: - Dataset reading, writing, conversion in any direction. - `CIFAR-10/100 `_ (`classification`) - `Cityscapes `_ - `COCO `_ (`image_info`, `instances`, `person_keypoints`, `captions`, `labels`, `panoptic`, `stuff`) - `CVAT `_ - `ImageNet `_ - `Kitti `_ (`segmentation`, `detection`, `3D raw` / `velodyne points`) - `LabelMe `_ - `LFW `_ (`classification`, `person re-identification`, `landmarks`) - `MNIST `_ (`classification`) - `Open Images `_ - `PASCAL VOC `_ (`classification`, `detection`, `segmentation`, `action_classification`, `person_layout`) - `TF Detection API `_ (`bboxes`, `masks`) - `YOLO `_ (`bboxes`) Other formats and documentation for them can be found `here `_. - Dataset building - Merging multiple datasets into one - Dataset filtering by a custom criteria: - remove polygons of a certain class - remove images without annotations of a specific class - remove ``occluded`` annotations from images - keep only vertically-oriented images - remove small area bounding boxes from annotations - Annotation conversions, for instance: - polygons to instance masks and vice-versa - apply a custom colormap for mask annotations - rename or remove dataset labels - Splitting a dataset into multiple subsets like ``train``, ``val``, and ``test``: - random split - task-specific splits based on annotations, which keep initial label and attribute distributions - for classification task, based on labels - for detection task, based on bboxes - for re-identification task, based on labels, avoiding having same IDs in training and test splits - Sampling a dataset - analyzes inference result from the given dataset and selects the ``best`` and the ``least amount of`` samples for annotation. - Select the sample that best suits model training. - sampling with Entropy based algorithm - Dataset quality checking - Simple checking for errors - Comparison with model inference - Merging and comparison of multiple datasets - Annotation validation based on the task type(classification, etc) - Dataset comparison - Dataset statistics (image mean and std, annotation statistics) - Model integration - Inference (OpenVINO, Caffe, PyTorch, TensorFlow, MxNet, etc.) - Explainable AI (`RISE algorithm `_) - RISE for classification - RISE for object detection