Level 7: Merge Two Heterogeneous Datasets#

In the latest deep learning trends, training foundation models with larger datasets has become increasingly popular. To achieve this, it is crucial to collect and prepare massive datasets for deep learning model development. Collecting and labeling large datasets can be challenging, so consolidating scattered datasets into a unified one is important. For instance, Florence created the FLOD-9M massive dataset by combining MS-COCO, LVIS, OpenImages, and Object365 datasets to use for training.

In this tutorial, we provide the simple example for merging two datasets and the detailed description for merge operation is given by here. The more advanced Python example with the label mapping between datasets is given here.

Prepare datasets#

We here download two aerial datasets named by Eurosat and UC Merced as a simple ImageNet format by

datum download get -i tfds:eurosat -f imagenet --output-dir <path/to/eurosat> -- --save-media

datum download get -i tfds:uc_merced -f imagenet --output-dir <path/to/uc_merced> -- --save-media

Merge datasets#

Without the project declaration, we can simply merge multiple datasets by

datum merge --merge_policy union --format imagenet --output-dir <path/to/output> <path/to/eurosat> <path/to/uc_merced> -- --save-media

We now have the merge data with the merge report named by merge_report.json inside the output directory.

from datumaro.components.dataset import Dataset

eurosat_path = '/path/to/eurosat'
eurosat = Dataset.import_from(eurosat_path, 'imagenet')

uc_merced_path = '/path/to/uc_merced'
uc_merced = Dataset.import_from(uc_merced_path, 'imagenet')

from datumaro.components.hl_ops import HLOps

merged = HLOps.merge(eurosat, uc_merced, merge_policy='union')

With the project-based CLI, we first create two project and import datasets into each project

datum project create --output-dir <path/to/project1>
datum project import --format imagenet --project <path/to/project1> <path/to/eurosat>

datum project create --output-dir <path/to/project2>
datum project import --format imagenet --project <path/to/project2> <path/to/uc_merced>

We merge two projects through

datum merge --merge-policy union --format imagenet --output-dir <path/to/output> <path/to/project1> <path/to/project2> -- --save-media

Similar to merge without projects, we have the merge report named by merge_report.json inside the output directory. Finally, we import the merged data (<path/to/output>) into a project. In this tutorial, we create another project and import this into the project.

datum project create --output-dir <path/to/project3>
datum project import --format imagenet --project <path/to/project3> <path/to/output>