Format specification#

CIFAR format specification is available here.

Supported annotation types:

  • Label

Datumaro supports Python version CIFAR-10/100. The difference between CIFAR-10 and CIFAR-100 is how labels are stored in the meta files (batches.meta or meta) and in the annotation files. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a “fine” label (the class to which it belongs) and a “coarse” label (the superclass to which it belongs). In CIFAR-10 there are no superclasses.

CIFAR formats contain 32 x 32 images. As an extension, Datumaro supports reading and writing of arbitrary-sized images.

Import CIFAR dataset#

The CIFAR dataset is available for free download:

A Datumaro project with a CIFAR source can be created in the following way:

datum project create
datum project import --format cifar <path/to/dataset>

It is possible to specify project name and project directory. Run datum project create --help for more information.

CIFAR-10 dataset directory should have the following structure:

└─ Dataset/
    ├── dataset_meta.json # a list of non-format labels (optional)
    ├── batches.meta
    ├── <subset_name1>
    ├── <subset_name2>
    └── ...

CIFAR-100 dataset directory should have the following structure:

└─ Dataset/
    ├── dataset_meta.json # a list of non-format labels (optional)
    ├── meta
    ├── <subset_name1>
    ├── <subset_name2>
    └── ...

Dataset files use the Pickle data format.

Meta files:

    num_cases_per_batch: 1000
    label_names: list of strings (['airplane', 'automobile', 'bird', ...])
    num_vis: 3072

    fine_label_names: list of strings (['apple', 'aquarium_fish', ...])
    coarse_label_names: list of strings (['aquatic_mammals', 'fish', ...])

Annotation files:

    'batch_label': 'training batch 1 of <N>'
    'data': numpy.ndarray of uint8, layout N x C x H x W
    'filenames': list of strings

    If images have non-default size (32x32) (Datumaro extension):
        'image_sizes': list of (H, W) tuples

    'labels': list of strings

    'fine_labels': list of integers
    'coarse_labels': list of integers

To add custom classes, you can use dataset_meta.json.

Export to other formats#

Datumaro can convert a CIFAR dataset into any other format Datumaro supports. To get the expected result, convert the dataset to a format that supports the classification task (e.g. MNIST, ImageNet, PascalVOC, etc.)

There are several ways to convert a CIFAR dataset to other dataset formats using CLI:

datum project create
datum project import -f cifar <path/to/cifar>
datum project export -f imagenet -o <output/dir>


datum convert -if cifar -i <path/to/dataset> \
    -f imagenet -o <output/dir> -- --save-media

Or, using Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path/to/dataset>', 'cifar')
dataset.export('save_dir', 'imagenet', save_media=True)

Export to CIFAR#

There are several ways to convert a dataset to CIFAR format:

# export dataset into CIFAR format from existing project
datum project export -p <path/to/project> -f cifar -o <output/dir> \
    -- --save-media
# converting to CIFAR format from other format
datum convert -if imagenet -i <path/to/dataset> \
    -f cifar -o <output/dir> -- --save-media

Extra options for exporting to CIFAR format:

  • --save-media allow to export dataset with saving media files (by default False)

  • --image-ext <IMAGE_EXT> allow to specify image extension for exporting the dataset (by default .png)

  • --save-dataset-meta - allow to export dataset with saving dataset meta file (by default False)

The format (CIFAR-10 or CIFAR-100) in which the dataset will be exported depends on the presence of superclasses in the LabelCategories.


Datumaro supports filtering, transformation, merging etc. for all formats and for the CIFAR format in particular. Follow the user manual to get more information about these operations.

There are several examples of using Datumaro operations to solve particular problems with CIFAR dataset:

Example 1. How to create a custom CIFAR-like dataset#

import numpy as np
import datumaro as dm

dataset = dm.Dataset.from_iterable([
    dm.DatasetItem(id=0, image=np.ones((32, 32, 3)),
    dm.DatasetItem(id=1, image=np.ones((32, 32, 3)),
], categories=['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck'])

dataset.export('./dataset', format='cifar')

Example 2. How to filter and convert a CIFAR dataset to ImageNet#

Convert a CIFAR dataset to ImageNet format, keep only images with the dog class present:

# Download CIFAR-10 dataset:
datum convert --input-format cifar --input-path <path/to/cifar> \
              --output-format imagenet \
              --filter '/item[annotation/label="dog"]'

Examples of using this format from the code can be found in the format tests