Merge Multiple Datasets for Classification Tasks#
Datumaro supports merging multiple datasets into a single dataset.
In this document, we import EuroSAT and UCMerced datasets. Both datasets have aerial domains and are used for classification tasks. Although they have a similar domain, they have different label categories. Therefore, in this example, you will learn how to combine multiple datasets to create a single dataset by merging heterogeneous datasets using the Datumaro transform
and merge
commands.
Download Datasets#
We provide a CLI command to download the datasets from TensorFlow Datasets.
[ ]:
!datum download get -i tfds:eurosat -o eurosat -- --save-media
!datum download get -i tfds:uc_merced -o uc_merced -- --save-media
Import Datasets#
[1]:
# Copyright (C) 2023 Intel Corporation
#
# SPDX-License-Identifier: MIT
import datumaro as dm
eurosat = dm.Dataset.import_from("eurosat")
print(eurosat)
viz = dm.Visualizer(eurosat, figsize=(8, 6))
items = viz.get_random_items(4)
fig = viz.vis_gallery(items)
fig.show()
Dataset
size=27000
source_path=eurosat
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=27000
annotations_count=27000
subsets
train: # of items=27000, # of annotated items=27000, # of annotations=27000, annotation types=['label']
infos
categories
label: ['AnnualCrop', 'Forest', 'HerbaceousVegetation', 'Highway', 'Industrial', 'Pasture', 'PermanentCrop', 'Residential', 'River', 'SeaLake']
[2]:
uc_merced = dm.Dataset.import_from("uc_merced")
print(uc_merced)
viz = dm.Visualizer(uc_merced, figsize=(8, 6))
items = viz.get_random_items(4)
fig = viz.vis_gallery(items)
fig.show()
Dataset
size=2100
source_path=uc_merced
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=2100
annotations_count=2100
subsets
train: # of items=2100, # of annotated items=2100, # of annotations=2100, annotation types=['label']
infos
categories
label: ['agricultural', 'airplane', 'baseballdiamond', 'beach', 'buildings', 'chaparral', 'denseresidential', 'forest', 'freeway', 'golfcourse', 'harbor', 'intersection', 'mediumresidential', 'mobilehomepark', 'overpass', 'parkinglot', 'river', 'runway', 'sparseresidential', 'storagetanks', 'tenniscourt']
[3]:
eurosat_label_names = [
label_cat.name for label_cat in eurosat.categories()[dm.AnnotationType.label]
]
uc_merced_label_names = [
label_cat.name for label_cat in uc_merced.categories()[dm.AnnotationType.label]
]
print("EuroSAT label names:")
print(eurosat_label_names)
print("UCMerced label names:")
print(uc_merced_label_names)
EuroSAT label names:
['AnnualCrop', 'Forest', 'HerbaceousVegetation', 'Highway', 'Industrial', 'Pasture', 'PermanentCrop', 'Residential', 'River', 'SeaLake']
UCMerced label names:
['agricultural', 'airplane', 'baseballdiamond', 'beach', 'buildings', 'chaparral', 'denseresidential', 'forest', 'freeway', 'golfcourse', 'harbor', 'intersection', 'mediumresidential', 'mobilehomepark', 'overpass', 'parkinglot', 'river', 'runway', 'sparseresidential', 'storagetanks', 'tenniscourt']
Transform - Remap Label Names#
The two datasets have different label names literally, but some of them are semantically identical. We use the following table to remap the labels. After this label remapping operation is complete, the two datasets can be merged into the one.
EuroSAT | UCMerced | Destination |
---|---|---|
AnnualCrop | agricultural | agricultural |
Pasture | ||
PermanentCrop | ||
Industrial | buildings | industrial |
parkinglot | ||
storagetanks | ||
Forest | forest | forest |
Highway | freeway | highway |
intersection | ||
overpass | ||
HerbaceousVegetation | chaparral | chaparral |
Residential | denseresidential | residential |
mediumresidential | ||
baseballdiamond | ||
sparseresidential | ||
golfcourse | ||
tenniscourt | ||
mobilehomepark | ||
River | river | river |
SeaLake | harbor | sea |
beach | ||
airplane | airport | |
runway |
[4]:
eurosat.transform(
"remap_labels",
mapping={
"AnnualCrop": "agricultural",
"Pasture": "agricultural",
"PermanentCrop": "agricultural",
"Industrial": "industrial",
"Forest": "forest",
"Highway": "highway",
"HerbaceousVegetation": "chaparral",
"Residential": "residential",
"River": "river",
"SeaLake": "sea",
},
)
[4]:
Dataset
size=27000
source_path=eurosat
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=27000
annotations_count=27000
subsets
train: # of items=27000, # of annotated items=27000, # of annotations=27000, annotation types=['label']
infos
categories
label: ['agricultural', 'forest', 'chaparral', 'highway', 'industrial', 'residential', 'river', 'sea']
[5]:
uc_merced.transform(
"remap_labels",
mapping={
"buildings": "industrial",
"parkinglot": "industrial",
"storagetanks": "industrial",
"freeway": "highway",
"intersection": "highway",
"overpass": "highway",
"denseresidential": "residential",
"mediumresidential": "residential",
"baseballdiamond": "residential",
"sparseresidential": "residential",
"golfcourse": "residential",
"tenniscourt": "residential",
"mobilehomepark": "residential",
"harbor": "sea",
"beach": "sea",
"airplane": "airport",
"runway": "airport",
},
)
[5]:
Dataset
size=2100
source_path=uc_merced
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=2100
annotations_count=2100
subsets
train: # of items=2100, # of annotated items=2100, # of annotations=2100, annotation types=['label']
infos
categories
label: ['agricultural', 'airport', 'residential', 'sea', 'industrial', 'chaparral', 'forest', 'highway', 'river']
Merge Heterogenous Datasets#
Since we want to merge heterogenous datasets with different label categories (although some of them are overlapping), we have to choose merge_policy="union"
.
[6]:
merged = dm.HLOps.merge(uc_merced, eurosat, merge_policy="union")
merged
[6]:
Dataset
size=29100
source_path=None
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=29100
annotations_count=29100
subsets
train: # of items=29100, # of annotated items=29100, # of annotations=29100, annotation types=['label']
infos
categories
label: ['agricultural', 'airport', 'residential', 'sea', 'industrial', 'chaparral', 'forest', 'highway', 'river']
Now, we apply a random split to the merged dataset to make three subsets: “train”, “val”, and “test”.
[7]:
merged.transform("random_split", splits=[("train", 0.5), ("val", 0.2), ("test", 0.3)])
merged
[7]:
Dataset
size=29100
source_path=None
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=29100
annotations_count=29100
subsets
test: # of items=8730, # of annotated items=8730, # of annotations=8730, annotation types=['label']
train: # of items=14550, # of annotated items=14550, # of annotations=14550, annotation types=['label']
val: # of items=5820, # of annotated items=5820, # of annotations=5820, annotation types=['label']
infos
categories
label: ['agricultural', 'airport', 'residential', 'sea', 'industrial', 'chaparral', 'forest', 'highway', 'river']
The final step is to export the merged dataset and make it usable for model training!
[8]:
merged.export("merged", format="imagenet_with_subset_dirs", save_media=True)