Introduction#

Datumaro is a framework and CLI tool to build, transform, and analyze datasets.

  • a tool to build composite datasets and iterate over them

  • a tool to create and maintain datasets

    • Version control of annotations and images

    • Publication (with removal of sensitive information)

    • Editing

    • Joining and splitting

    • Exporting, format changing

    • Image preprocessing

  • a dataset storage

  • a tool to debug datasets

    • A network can be used to generate informative data subsets (e.g., with false-positives) to be analyzed further

Key Features#

Datumaro supports the following features:

  • Dataset reading, writing, conversion in any direction.

    Other formats and documentation for them can be found here.

  • Dataset building

    • Merging multiple datasets into one

    • Dataset filtering by a custom criteria:

      • remove polygons of a certain class

      • remove images without annotations of a specific class

      • remove occluded annotations from images

      • keep only vertically-oriented images

      • remove small area bounding boxes from annotations

    • Annotation conversions, for instance:

      • polygons to instance masks and vice-versa

      • apply a custom colormap for mask annotations

      • rename or remove dataset labels

    • Splitting a dataset into multiple subsets like train, val, and test:

      • random split

      • task-specific splits based on annotations, which keep initial label and attribute distributions

        • for classification task, based on labels

        • for detection task, based on bboxes

        • for re-identification task, based on labels, avoiding having same IDs in training and test splits

    • Sampling a dataset

      • analyzes inference result from the given dataset and selects the best and the least amount of samples for annotation.

      • Select the sample that best suits model training.

        • sampling with Entropy based algorithm

  • Dataset quality checking

    • Simple checking for errors

    • Comparison with model inference

    • Merging and comparison of multiple datasets

    • Annotation validation based on the task type(classification, etc)

  • Dataset comparison

  • Dataset statistics (image mean and std, annotation statistics)

  • Model integration

    • Inference (OpenVINO, Caffe, PyTorch, TensorFlow, MxNet, etc.)

    • Explainable AI (RISE algorithm)

      • RISE for classification

      • RISE for object detection