Merge Heterogeneous Datasets for Detection#
Datumaro supports merging heterogeneous datasets into a unified data format.
In this example, we import two heterogeneous detection datasets and export a merged dataset into a unified data format.
First, we import two datasets, i.e., MS-COCO and Pascal-VOC, and transforms them with filter
duplicates, reindex
ids, and remap
labels before merging.
Then, we perform the intersect
merge operation and split into train
, val
, and test
subsets for AI practices.
[21]:
# Copyright (C) 2021 Intel Corporation
#
# SPDX-License-Identifier: MIT
import datumaro as dm
coco_path = "./coco_dataset"
coco_dataset = dm.Dataset.import_from(coco_path, format="coco_instances")
voc_path = "./VOCdevkit/VOC2007"
voc_dataset = dm.Dataset.import_from(voc_path, format="voc_detection")
print("MS-COCO dataset:")
print(coco_dataset)
print("Pascal-VOC dataset:")
print(voc_dataset)
WARNING:root:File './coco_dataset/annotations/image_info_test-dev2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File './coco_dataset/annotations/image_info_test2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File './coco_dataset/annotations/image_info_unlabeled2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File './coco_dataset/annotations/person_keypoints_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File './coco_dataset/annotations/captions_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File './coco_dataset/annotations/person_keypoints_train2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File './coco_dataset/annotations/captions_train2017.json' was skipped, could't match this file with any of these tasks: coco_instances
MS-COCO dataset:
Dataset
size=123287
source_path=./coco_dataset
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=122218
annotations_count=1915643
subsets
train2017: # of items=118287, # of annotated items=117266, # of annotations=1836996, annotation types=['mask', 'polygon', 'bbox']
val2017: # of items=5000, # of annotated items=4952, # of annotations=78647, annotation types=['mask', 'polygon', 'bbox']
infos
categories
label: ['person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush']
Pascal-VOC dataset:
Dataset
size=10022
source_path=./VOCdevkit/VOC2007
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=10022
annotations_count=31324
subsets
train: # of items=2501, # of annotated items=2501, # of annotations=7844, annotation types=['bbox']
trainval: # of items=5011, # of annotated items=5011, # of annotations=15662, annotation types=['bbox']
val: # of items=2510, # of annotated items=2510, # of annotations=7818, annotation types=['bbox']
infos
categories
label: ['background', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor', 'ignored']
Filter Duplicates#
Here, we reject subset trainval
in Pascal-VOC data, because it caueses duplicates.
[22]:
voc_dataset.filter('/item[subset!="trainval"]')
[22]:
Dataset
size=5011
source_path=./VOCdevkit/VOC2007
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=5011
annotations_count=15662
subsets
train: # of items=2501, # of annotated items=2501, # of annotations=7844, annotation types=['bbox']
val: # of items=2510, # of annotated items=2510, # of annotations=7818, annotation types=['bbox']
infos
categories
label: ['background', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor', 'ignored']
Transform - Remap Label Names#
Since many labels defined in Pascal-VOC data are also included in MS-COCO data, one-to-one mapping of most classes is possible.
Meanwhile, the number 61 of MS-COCO labels corresponding to roadside, animals, household items, foods, accessories, and kitchen utensils will be mapped to the Pascal-VOC’s background
class to merge them into a single unified dataset.
MS-COCO | Pascal-VOC |
---|---|
person | person |
bicycle | bicycle |
car | car |
motorcycle | motorbike |
airplane | aeroplane |
bus | bus |
train | train |
boat | boat |
bird | bird |
cat | cat |
dog | dog |
horse | horse |
sheep | sheep |
cow | cow |
bottle | bottle |
chair | chair |
couch | sofa |
potted plant | pottedplant |
dining table | diningtable |
tv | tvmonitor |
others (61 classes) | background |
[23]:
identicals = [
"person",
"bicycle",
"car",
"bus",
"train",
"boat",
"bird",
"cat",
"dog",
"horse",
"sheep",
"cow",
"bottle",
"chair",
]
mappings = {
"motorcycle": "motorbike",
"airplane": "aeroplane",
"couch": "sofa",
"potted plant": "pottedplant",
"dining table": "diningtable",
"tv": "tvmonitor",
}
for label in coco_dataset.categories()[dm.AnnotationType.label]:
if label.name in identicals or label.name in mappings:
continue
mappings.update({label.name: "background"})
print(mappings)
coco_dataset.transform("remap_labels", mapping=mappings)
{'motorcycle': 'motorbike', 'airplane': 'aeroplane', 'couch': 'sofa', 'potted plant': 'pottedplant', 'dining table': 'diningtable', 'tv': 'tvmonitor', 'truck': 'background', 'traffic light': 'background', 'fire hydrant': 'background', 'stop sign': 'background', 'parking meter': 'background', 'bench': 'background', 'elephant': 'background', 'bear': 'background', 'zebra': 'background', 'giraffe': 'background', 'backpack': 'background', 'umbrella': 'background', 'handbag': 'background', 'tie': 'background', 'suitcase': 'background', 'frisbee': 'background', 'skis': 'background', 'snowboard': 'background', 'sports ball': 'background', 'kite': 'background', 'baseball bat': 'background', 'baseball glove': 'background', 'skateboard': 'background', 'surfboard': 'background', 'tennis racket': 'background', 'wine glass': 'background', 'cup': 'background', 'fork': 'background', 'knife': 'background', 'spoon': 'background', 'bowl': 'background', 'banana': 'background', 'apple': 'background', 'sandwich': 'background', 'orange': 'background', 'broccoli': 'background', 'carrot': 'background', 'hot dog': 'background', 'pizza': 'background', 'donut': 'background', 'cake': 'background', 'bed': 'background', 'toilet': 'background', 'laptop': 'background', 'mouse': 'background', 'remote': 'background', 'keyboard': 'background', 'cell phone': 'background', 'microwave': 'background', 'oven': 'background', 'toaster': 'background', 'sink': 'background', 'refrigerator': 'background', 'book': 'background', 'clock': 'background', 'vase': 'background', 'scissors': 'background', 'teddy bear': 'background', 'hair drier': 'background', 'toothbrush': 'background'}
[23]:
Dataset
size=123287
source_path=./coco_dataset
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=122218
annotations_count=1915643
subsets
train2017: # of items=118287, # of annotated items=117266, # of annotations=1836996, annotation types=['mask', 'polygon', 'bbox']
val2017: # of items=5000, # of annotated items=4952, # of annotations=78647, annotation types=['mask', 'polygon', 'bbox']
infos
categories
label: ['person', 'bicycle', 'car', 'motorbike', 'aeroplane', 'bus', 'train', 'background', 'boat', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'bottle', 'chair', 'sofa', 'pottedplant', 'diningtable', 'tvmonitor']
Reindex Items#
To avoid conflicts within id
s when merging, we need to reindex items to be exclusive.
[24]:
coco_dataset.transform("reindex", start=0)
voc_dataset.transform("reindex", start=len(coco_dataset))
[24]:
Dataset
size=5011
source_path=./VOCdevkit/VOC2007
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=5011
annotations_count=15662
subsets
train: # of items=2501, # of annotated items=2501, # of annotations=7844, annotation types=['bbox']
val: # of items=2510, # of annotated items=2510, # of annotations=7818, annotation types=['bbox']
infos
categories
label: ['background', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor', 'ignored']
Merge Heterogenous Datasets#
Since we have already aligned two datasets into a homogeneous form, we have to choose merge_policy="intersect"
here.
[25]:
merged = dm.HLOps.merge(coco_dataset, voc_dataset, merge_policy="intersect")
print(merged)
Dataset
size=128298
source_path=None
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=127229
annotations_count=1931305
subsets
train: # of items=2501, # of annotated items=2501, # of annotations=7844, annotation types=['bbox']
train2017: # of items=118287, # of annotated items=117266, # of annotations=1836996, annotation types=['mask', 'polygon', 'bbox']
val: # of items=2510, # of annotated items=2510, # of annotations=7818, annotation types=['bbox']
val2017: # of items=5000, # of annotated items=4952, # of annotations=78647, annotation types=['mask', 'polygon', 'bbox']
infos
categories
label: ['person', 'bicycle', 'car', 'motorbike', 'aeroplane', 'bus', 'train', 'background', 'boat', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'bottle', 'chair', 'sofa', 'pottedplant', 'diningtable', 'tvmonitor', 'ignored']
Split into Subsets#
For AI practices, we now reorganize the merged data into train
, val
, and test
subsets.
[26]:
merged.transform("random_split", splits=[("train", 0.5), ("val", 0.2), ("test", 0.3)])
print(merged)
Dataset
size=128298
source_path=None
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=127229
annotations_count=1931305
subsets
test: # of items=38490, # of annotated items=38173, # of annotations=580468, annotation types=['mask', 'polygon', 'bbox']
train: # of items=64149, # of annotated items=63600, # of annotations=967532, annotation types=['mask', 'polygon', 'bbox']
val: # of items=25659, # of annotated items=25456, # of annotations=383305, annotation types=['mask', 'polygon', 'bbox']
infos
categories
label: ['person', 'bicycle', 'car', 'motorbike', 'aeroplane', 'bus', 'train', 'background', 'boat', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'bottle', 'chair', 'sofa', 'pottedplant', 'diningtable', 'tvmonitor', 'ignored']