Merge Multiple Datasets for Classification Tasks#

Jupyter Notebook

Datumaro supports merging multiple datasets into a single dataset.

In this document, we import EuroSAT and UCMerced datasets. Both datasets have aerial domains and are used for classification tasks. Although they have a similar domain, they have different label categories. Therefore, in this example, you will learn how to combine multiple datasets to create a single dataset by merging heterogeneous datasets using the Datumaro transform and merge commands.

Download Datasets#

We provide a CLI command to download the datasets from TensorFlow Datasets.

[ ]:
!datum download get -i tfds:eurosat -o eurosat -- --save-media
!datum download get -i tfds:uc_merced -o uc_merced -- --save-media

Import Datasets#

[1]:
# Copyright (C) 2023 Intel Corporation
#
# SPDX-License-Identifier: MIT

import datumaro as dm

eurosat = dm.Dataset.import_from("eurosat")
print(eurosat)

viz = dm.Visualizer(eurosat, figsize=(8, 6))
items = viz.get_random_items(4)
fig = viz.vis_gallery(items)
fig.show()
Dataset
        size=27000
        source_path=eurosat
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=27000
        annotations_count=27000
subsets
        train: # of items=27000, # of annotated items=27000, # of annotations=27000, annotation types=['label']
infos
        categories
        label: ['AnnualCrop', 'Forest', 'HerbaceousVegetation', 'Highway', 'Industrial', 'Pasture', 'PermanentCrop', 'Residential', 'River', 'SeaLake']

../../../_images/docs_jupyter_notebook_examples_notebooks_01_merge_multiple_datasets_for_classification_4_1.png
[2]:
uc_merced = dm.Dataset.import_from("uc_merced")
print(uc_merced)

viz = dm.Visualizer(uc_merced, figsize=(8, 6))
items = viz.get_random_items(4)
fig = viz.vis_gallery(items)
fig.show()
Dataset
        size=2100
        source_path=uc_merced
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=2100
        annotations_count=2100
subsets
        train: # of items=2100, # of annotated items=2100, # of annotations=2100, annotation types=['label']
infos
        categories
        label: ['agricultural', 'airplane', 'baseballdiamond', 'beach', 'buildings', 'chaparral', 'denseresidential', 'forest', 'freeway', 'golfcourse', 'harbor', 'intersection', 'mediumresidential', 'mobilehomepark', 'overpass', 'parkinglot', 'river', 'runway', 'sparseresidential', 'storagetanks', 'tenniscourt']

../../../_images/docs_jupyter_notebook_examples_notebooks_01_merge_multiple_datasets_for_classification_5_1.png
[3]:
eurosat_label_names = [
    label_cat.name for label_cat in eurosat.categories()[dm.AnnotationType.label]
]
uc_merced_label_names = [
    label_cat.name for label_cat in uc_merced.categories()[dm.AnnotationType.label]
]

print("EuroSAT label names:")
print(eurosat_label_names)

print("UCMerced label names:")
print(uc_merced_label_names)
EuroSAT label names:
['AnnualCrop', 'Forest', 'HerbaceousVegetation', 'Highway', 'Industrial', 'Pasture', 'PermanentCrop', 'Residential', 'River', 'SeaLake']
UCMerced label names:
['agricultural', 'airplane', 'baseballdiamond', 'beach', 'buildings', 'chaparral', 'denseresidential', 'forest', 'freeway', 'golfcourse', 'harbor', 'intersection', 'mediumresidential', 'mobilehomepark', 'overpass', 'parkinglot', 'river', 'runway', 'sparseresidential', 'storagetanks', 'tenniscourt']

Transform - Remap Label Names#

The two datasets have different label names literally, but some of them are semantically identical. We use the following table to remap the labels. After this label remapping operation is complete, the two datasets can be merged into the one.

EuroSAT

UCMerced

Destination

AnnualCrop

agricultural

agricultural

Pasture

PermanentCrop

Industrial

buildings

industrial

parkinglot

storagetanks

Forest

forest

forest

Highway

freeway

highway

intersection

overpass

HerbaceousVegetation

chaparral

chaparral

Residential

denseresidential

residential

mediumresidential

baseballdiamond

sparseresidential

golfcourse

tenniscourt

mobilehomepark

River

river

river

SeaLake

harbor

sea

beach

airplane

airport

runway

[4]:
eurosat.transform(
    "remap_labels",
    mapping={
        "AnnualCrop": "agricultural",
        "Pasture": "agricultural",
        "PermanentCrop": "agricultural",
        "Industrial": "industrial",
        "Forest": "forest",
        "Highway": "highway",
        "HerbaceousVegetation": "chaparral",
        "Residential": "residential",
        "River": "river",
        "SeaLake": "sea",
    },
)
[4]:
Dataset
        size=27000
        source_path=eurosat
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=27000
        annotations_count=27000
subsets
        train: # of items=27000, # of annotated items=27000, # of annotations=27000, annotation types=['label']
infos
        categories
        label: ['agricultural', 'forest', 'chaparral', 'highway', 'industrial', 'residential', 'river', 'sea']
[5]:
uc_merced.transform(
    "remap_labels",
    mapping={
        "buildings": "industrial",
        "parkinglot": "industrial",
        "storagetanks": "industrial",
        "freeway": "highway",
        "intersection": "highway",
        "overpass": "highway",
        "denseresidential": "residential",
        "mediumresidential": "residential",
        "baseballdiamond": "residential",
        "sparseresidential": "residential",
        "golfcourse": "residential",
        "tenniscourt": "residential",
        "mobilehomepark": "residential",
        "harbor": "sea",
        "beach": "sea",
        "airplane": "airport",
        "runway": "airport",
    },
)
[5]:
Dataset
        size=2100
        source_path=uc_merced
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=2100
        annotations_count=2100
subsets
        train: # of items=2100, # of annotated items=2100, # of annotations=2100, annotation types=['label']
infos
        categories
        label: ['agricultural', 'airport', 'residential', 'sea', 'industrial', 'chaparral', 'forest', 'highway', 'river']

Merge Heterogenous Datasets#

Since we want to merge heterogenous datasets with different label categories (although some of them are overlapping), we have to choose merge_policy="union".

[6]:
merged = dm.HLOps.merge(uc_merced, eurosat, merge_policy="union")
merged
[6]:
Dataset
        size=29100
        source_path=None
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=29100
        annotations_count=29100
subsets
        train: # of items=29100, # of annotated items=29100, # of annotations=29100, annotation types=['label']
infos
        categories
        label: ['agricultural', 'airport', 'residential', 'sea', 'industrial', 'chaparral', 'forest', 'highway', 'river']

Now, we apply a random split to the merged dataset to make three subsets: “train”, “val”, and “test”.

[7]:
merged.transform("random_split", splits=[("train", 0.5), ("val", 0.2), ("test", 0.3)])
merged
[7]:
Dataset
        size=29100
        source_path=None
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=29100
        annotations_count=29100
subsets
        test: # of items=8730, # of annotated items=8730, # of annotations=8730, annotation types=['label']
        train: # of items=14550, # of annotated items=14550, # of annotations=14550, annotation types=['label']
        val: # of items=5820, # of annotated items=5820, # of annotations=5820, annotation types=['label']
infos
        categories
        label: ['agricultural', 'airport', 'residential', 'sea', 'industrial', 'chaparral', 'forest', 'highway', 'river']

The final step is to export the merged dataset and make it usable for model training!

[8]:
merged.export("merged", format="imagenet_with_subset_dirs", save_media=True)