Merge Multiple Datasets for Classification Tasks#

Datumaro supports merging multiple datasets into a single dataset.

In this document, we import EuroSAT and UCMerced datasets. Both datasets have aerial domains and are used for classification tasks. Although they have a similar domain, they have different label categories. Therefore, in this example, you will learn how to combine multiple datasets to create a single dataset by merging heterogeneous datasets using the Datumaro transform and merge commands.

Download Datasets#

We provide a CLI command to download the datasets from TensorFlow Datasets.

[ ]:

!datum download get -i tfds:eurosat -o eurosat -- --save-media
!datum download get -i tfds:uc_merced -o uc_merced -- --save-media

Import Datasets#

[1]:

# Copyright (C) 2023 Intel Corporation
#
# SPDX-License-Identifier: MIT

import datumaro as dm

eurosat = dm.Dataset.import_from("eurosat")
print(eurosat)

viz = dm.Visualizer(eurosat, figsize=(8, 6))
items = viz.get_random_items(4)
fig = viz.vis_gallery(items)
fig.show()

Dataset
        size=27000
        source_path=eurosat
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=27000
        annotations_count=27000
subsets
        train: # of items=27000, # of annotated items=27000, # of annotations=27000, annotation types=['label']
infos
        categories
        label: ['AnnualCrop', 'Forest', 'HerbaceousVegetation', 'Highway', 'Industrial', 'Pasture', 'PermanentCrop', 'Residential', 'River', 'SeaLake']

../../../_images/docs_jupyter_notebook_examples_notebooks_01_merge_multiple_datasets_for_classification_4_1.png

[2]:

uc_merced = dm.Dataset.import_from("uc_merced")
print(uc_merced)

viz = dm.Visualizer(uc_merced, figsize=(8, 6))
items = viz.get_random_items(4)
fig = viz.vis_gallery(items)
fig.show()

Dataset
        size=2100
        source_path=uc_merced
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=2100
        annotations_count=2100
subsets
        train: # of items=2100, # of annotated items=2100, # of annotations=2100, annotation types=['label']
infos
        categories
        label: ['agricultural', 'airplane', 'baseballdiamond', 'beach', 'buildings', 'chaparral', 'denseresidential', 'forest', 'freeway', 'golfcourse', 'harbor', 'intersection', 'mediumresidential', 'mobilehomepark', 'overpass', 'parkinglot', 'river', 'runway', 'sparseresidential', 'storagetanks', 'tenniscourt']

../../../_images/docs_jupyter_notebook_examples_notebooks_01_merge_multiple_datasets_for_classification_5_1.png

[3]:

eurosat_label_names = [
    label_cat.name for label_cat in eurosat.categories()[dm.AnnotationType.label]
]
uc_merced_label_names = [
    label_cat.name for label_cat in uc_merced.categories()[dm.AnnotationType.label]
]

print("EuroSAT label names:")
print(eurosat_label_names)

print("UCMerced label names:")
print(uc_merced_label_names)

EuroSAT label names:
['AnnualCrop', 'Forest', 'HerbaceousVegetation', 'Highway', 'Industrial', 'Pasture', 'PermanentCrop', 'Residential', 'River', 'SeaLake']
UCMerced label names:
['agricultural', 'airplane', 'baseballdiamond', 'beach', 'buildings', 'chaparral', 'denseresidential', 'forest', 'freeway', 'golfcourse', 'harbor', 'intersection', 'mediumresidential', 'mobilehomepark', 'overpass', 'parkinglot', 'river', 'runway', 'sparseresidential', 'storagetanks', 'tenniscourt']

Transform - Remap Label Names#

The two datasets have different label names literally, but some of them are semantically identical. We use the following table to remap the labels. After this label remapping operation is complete, the two datasets can be merged into the one.

EuroSAT	UCMerced	Destination
AnnualCrop	agricultural	agricultural
Pasture
PermanentCrop
Industrial	buildings	industrial
	parkinglot
	storagetanks
Forest	forest	forest
Highway	freeway	highway
	intersection
	overpass
HerbaceousVegetation	chaparral	chaparral
Residential	denseresidential	residential
	mediumresidential
	baseballdiamond
	sparseresidential
	golfcourse
	tenniscourt
	mobilehomepark
River	river	river
SeaLake	harbor	sea
SeaLake	beach	sea
	airplane	airport
	runway	airport

[4]:

eurosat.transform(
    "remap_labels",
    mapping={
        "AnnualCrop": "agricultural",
        "Pasture": "agricultural",
        "PermanentCrop": "agricultural",
        "Industrial": "industrial",
        "Forest": "forest",
        "Highway": "highway",
        "HerbaceousVegetation": "chaparral",
        "Residential": "residential",
        "River": "river",
        "SeaLake": "sea",
    },
)

[4]:

Dataset
        size=27000
        source_path=eurosat
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=27000
        annotations_count=27000
subsets
        train: # of items=27000, # of annotated items=27000, # of annotations=27000, annotation types=['label']
infos
        categories
        label: ['agricultural', 'forest', 'chaparral', 'highway', 'industrial', 'residential', 'river', 'sea']

[5]:

uc_merced.transform(
    "remap_labels",
    mapping={
        "buildings": "industrial",
        "parkinglot": "industrial",
        "storagetanks": "industrial",
        "freeway": "highway",
        "intersection": "highway",
        "overpass": "highway",
        "denseresidential": "residential",
        "mediumresidential": "residential",
        "baseballdiamond": "residential",
        "sparseresidential": "residential",
        "golfcourse": "residential",
        "tenniscourt": "residential",
        "mobilehomepark": "residential",
        "harbor": "sea",
        "beach": "sea",
        "airplane": "airport",
        "runway": "airport",
    },
)

[5]:

Dataset
        size=2100
        source_path=uc_merced
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=2100
        annotations_count=2100
subsets
        train: # of items=2100, # of annotated items=2100, # of annotations=2100, annotation types=['label']
infos
        categories
        label: ['agricultural', 'airport', 'residential', 'sea', 'industrial', 'chaparral', 'forest', 'highway', 'river']

Merge Heterogenous Datasets#

Since we want to merge heterogenous datasets with different label categories (although some of them are overlapping), we have to choose merge_policy="union".

[6]:

merged = dm.HLOps.merge(uc_merced, eurosat, merge_policy="union")
merged

[6]:

Dataset
        size=29100
        source_path=None
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=29100
        annotations_count=29100
subsets
        train: # of items=29100, # of annotated items=29100, # of annotations=29100, annotation types=['label']
infos
        categories
        label: ['agricultural', 'airport', 'residential', 'sea', 'industrial', 'chaparral', 'forest', 'highway', 'river']

Now, we apply a random split to the merged dataset to make three subsets: “train”, “val”, and “test”.

[7]:

merged.transform("random_split", splits=[("train", 0.5), ("val", 0.2), ("test", 0.3)])
merged

[7]:

Dataset
        size=29100
        source_path=None
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=29100
        annotations_count=29100
subsets
        test: # of items=8730, # of annotated items=8730, # of annotations=8730, annotation types=['label']
        train: # of items=14550, # of annotated items=14550, # of annotations=14550, annotation types=['label']
        val: # of items=5820, # of annotated items=5820, # of annotations=5820, annotation types=['label']
infos
        categories
        label: ['agricultural', 'airport', 'residential', 'sea', 'industrial', 'chaparral', 'forest', 'highway', 'river']

The final step is to export the merged dataset and make it usable for model training!

[8]:

merged.export("merged", format="imagenet_with_subset_dirs", save_media=True)