Level 3: Data Import and Export#
Datumaro is a tool that supports public data formats across a wide range of tasks such as classification, detection, segmentation, pose estimation, or visual tracking. To facilitate this, Datumaro provides assistance with data import and export via both Python API and CLI. This makes it easier for users to work with various data formats using Datumaro.
Prepare dataset#
For the segmentation task, we here introduce the Cityscapes, which collects road scenes from 50 different cities and contains 5K fine-grained pixel-level annotations and 20K coarse annotations. More detailed description is given by here. The Cityscapes dataset is available for free download.
Convert data format#
Users sometimes need to compare, merge, or manage various kinds of public datasets in a unified
system. To achieve this, Datumaro not only has import
and export
funcionalities, but also
provides convert
, which shortens the import and export into a single command line.
Let’s convert the Cityscapes data into the MS-COCO format, which is described in here.
Without creation of a project, we can achieve this with a single line command convert
in Datumaro
datum convert -if cityscapes -i <path/to/cityscapes> -f coco_panoptic -o <path/to/output>
With Python API, we can import the data through Dataset
as below.
from datumaro.components.dataset import Dataset
data_path = '/path/to/cityscapes'
data_format = 'cityscapes'
dataset = Dataset.import_from(data_path, data_format)
We then export the import dataset as
output_path = '/path/to/output'
dataset.export(output_path, format='coco_panoptic')
With the project-based CLI, we first require to create
a project by
datum project create -o <path/to/project>
We now import
Cityscapes data into the project through
datum project import --format cityscapes -p <path/to/project> <path/to/cityscapes>
(Optional) When we import a data, the change is automatically commited in the project.
This can be shown through log
as
datum project log -p <path/to/project>
(Optional) We can check the imported dataset information such as subsets, number of data, or
categories through info
.
datum project info -p <path/to/project>
Finally, we export
the data within the project with MS-COCO format as
datum project export --format coco -p <path/to/project> -o <path/to/save> -- --save-media
Even if you are not sure about the format of the dataset, there’s no need to worry. You can easily detect the format in the next level, which is described in the next level!