Merge Heterogeneous Datasets for Detection#

Datumaro supports merging heterogeneous datasets into a unified data format.

In this example, we import two heterogeneous detection datasets and export a merged dataset into a unified data format.

First, we import two datasets, i.e., MS-COCO and Pascal-VOC, and transforms them with filter duplicates, reindex ids, and remap labels before merging.

Then, we perform the intersect merge operation and split into train, val, and test subsets for AI practices.

[21]:

# Copyright (C) 2021 Intel Corporation
#
# SPDX-License-Identifier: MIT

import datumaro as dm

coco_path = "./coco_dataset"
coco_dataset = dm.Dataset.import_from(coco_path, format="coco_instances")

voc_path = "./VOCdevkit/VOC2007"
voc_dataset = dm.Dataset.import_from(voc_path, format="voc_detection")

print("MS-COCO dataset:")
print(coco_dataset)

print("Pascal-VOC dataset:")
print(voc_dataset)

WARNING:root:File './coco_dataset/annotations/image_info_test-dev2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File './coco_dataset/annotations/image_info_test2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File './coco_dataset/annotations/image_info_unlabeled2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File './coco_dataset/annotations/person_keypoints_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File './coco_dataset/annotations/captions_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File './coco_dataset/annotations/person_keypoints_train2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File './coco_dataset/annotations/captions_train2017.json' was skipped, could't match this file with any of these tasks: coco_instances

MS-COCO dataset:
Dataset
        size=123287
        source_path=./coco_dataset
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=122218
        annotations_count=1915643
subsets
        train2017: # of items=118287, # of annotated items=117266, # of annotations=1836996, annotation types=['mask', 'polygon', 'bbox']
        val2017: # of items=5000, # of annotated items=4952, # of annotations=78647, annotation types=['mask', 'polygon', 'bbox']
infos
        categories
        label: ['person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush']

Pascal-VOC dataset:
Dataset
        size=10022
        source_path=./VOCdevkit/VOC2007
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=10022
        annotations_count=31324
subsets
        train: # of items=2501, # of annotated items=2501, # of annotations=7844, annotation types=['bbox']
        trainval: # of items=5011, # of annotated items=5011, # of annotations=15662, annotation types=['bbox']
        val: # of items=2510, # of annotated items=2510, # of annotations=7818, annotation types=['bbox']
infos
        categories
        label: ['background', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor', 'ignored']

Filter Duplicates#

Here, we reject subset trainval in Pascal-VOC data, because it caueses duplicates.

[22]:

voc_dataset.filter('/item[subset!="trainval"]')

[22]:

Dataset
        size=5011
        source_path=./VOCdevkit/VOC2007
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=5011
        annotations_count=15662
subsets
        train: # of items=2501, # of annotated items=2501, # of annotations=7844, annotation types=['bbox']
        val: # of items=2510, # of annotated items=2510, # of annotations=7818, annotation types=['bbox']
infos
        categories
        label: ['background', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor', 'ignored']

Transform - Remap Label Names#

Since many labels defined in Pascal-VOC data are also included in MS-COCO data, one-to-one mapping of most classes is possible.

Meanwhile, the number 61 of MS-COCO labels corresponding to roadside, animals, household items, foods, accessories, and kitchen utensils will be mapped to the Pascal-VOC’s background class to merge them into a single unified dataset.

MS-COCO	Pascal-VOC
person	person
bicycle	bicycle
car	car
motorcycle	motorbike
airplane	aeroplane
bus	bus
train	train
boat	boat
bird	bird
cat	cat
dog	dog
horse	horse
sheep	sheep
cow	cow
bottle	bottle
chair	chair
couch	sofa
potted plant	pottedplant
dining table	diningtable
tv	tvmonitor
others (61 classes)	background

[23]:

identicals = [
    "person",
    "bicycle",
    "car",
    "bus",
    "train",
    "boat",
    "bird",
    "cat",
    "dog",
    "horse",
    "sheep",
    "cow",
    "bottle",
    "chair",
]
mappings = {
    "motorcycle": "motorbike",
    "airplane": "aeroplane",
    "couch": "sofa",
    "potted plant": "pottedplant",
    "dining table": "diningtable",
    "tv": "tvmonitor",
}

for label in coco_dataset.categories()[dm.AnnotationType.label]:
    if label.name in identicals or label.name in mappings:
        continue
    mappings.update({label.name: "background"})

print(mappings)
coco_dataset.transform("remap_labels", mapping=mappings)

{'motorcycle': 'motorbike', 'airplane': 'aeroplane', 'couch': 'sofa', 'potted plant': 'pottedplant', 'dining table': 'diningtable', 'tv': 'tvmonitor', 'truck': 'background', 'traffic light': 'background', 'fire hydrant': 'background', 'stop sign': 'background', 'parking meter': 'background', 'bench': 'background', 'elephant': 'background', 'bear': 'background', 'zebra': 'background', 'giraffe': 'background', 'backpack': 'background', 'umbrella': 'background', 'handbag': 'background', 'tie': 'background', 'suitcase': 'background', 'frisbee': 'background', 'skis': 'background', 'snowboard': 'background', 'sports ball': 'background', 'kite': 'background', 'baseball bat': 'background', 'baseball glove': 'background', 'skateboard': 'background', 'surfboard': 'background', 'tennis racket': 'background', 'wine glass': 'background', 'cup': 'background', 'fork': 'background', 'knife': 'background', 'spoon': 'background', 'bowl': 'background', 'banana': 'background', 'apple': 'background', 'sandwich': 'background', 'orange': 'background', 'broccoli': 'background', 'carrot': 'background', 'hot dog': 'background', 'pizza': 'background', 'donut': 'background', 'cake': 'background', 'bed': 'background', 'toilet': 'background', 'laptop': 'background', 'mouse': 'background', 'remote': 'background', 'keyboard': 'background', 'cell phone': 'background', 'microwave': 'background', 'oven': 'background', 'toaster': 'background', 'sink': 'background', 'refrigerator': 'background', 'book': 'background', 'clock': 'background', 'vase': 'background', 'scissors': 'background', 'teddy bear': 'background', 'hair drier': 'background', 'toothbrush': 'background'}

[23]:

Dataset
        size=123287
        source_path=./coco_dataset
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=122218
        annotations_count=1915643
subsets
        train2017: # of items=118287, # of annotated items=117266, # of annotations=1836996, annotation types=['mask', 'polygon', 'bbox']
        val2017: # of items=5000, # of annotated items=4952, # of annotations=78647, annotation types=['mask', 'polygon', 'bbox']
infos
        categories
        label: ['person', 'bicycle', 'car', 'motorbike', 'aeroplane', 'bus', 'train', 'background', 'boat', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'bottle', 'chair', 'sofa', 'pottedplant', 'diningtable', 'tvmonitor']

Reindex Items#

To avoid conflicts within ids when merging, we need to reindex items to be exclusive.

[24]:

coco_dataset.transform("reindex", start=0)
voc_dataset.transform("reindex", start=len(coco_dataset))

[24]:

Dataset
        size=5011
        source_path=./VOCdevkit/VOC2007
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=5011
        annotations_count=15662
subsets
        train: # of items=2501, # of annotated items=2501, # of annotations=7844, annotation types=['bbox']
        val: # of items=2510, # of annotated items=2510, # of annotations=7818, annotation types=['bbox']
infos
        categories
        label: ['background', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor', 'ignored']

Merge Heterogenous Datasets#

Since we have already aligned two datasets into a homogeneous form, we have to choose merge_policy="intersect" here.

[25]:

merged = dm.HLOps.merge(coco_dataset, voc_dataset, merge_policy="intersect")
print(merged)

Dataset
        size=128298
        source_path=None
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=127229
        annotations_count=1931305
subsets
        train: # of items=2501, # of annotated items=2501, # of annotations=7844, annotation types=['bbox']
        train2017: # of items=118287, # of annotated items=117266, # of annotations=1836996, annotation types=['mask', 'polygon', 'bbox']
        val: # of items=2510, # of annotated items=2510, # of annotations=7818, annotation types=['bbox']
        val2017: # of items=5000, # of annotated items=4952, # of annotations=78647, annotation types=['mask', 'polygon', 'bbox']
infos
        categories
        label: ['person', 'bicycle', 'car', 'motorbike', 'aeroplane', 'bus', 'train', 'background', 'boat', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'bottle', 'chair', 'sofa', 'pottedplant', 'diningtable', 'tvmonitor', 'ignored']

Split into Subsets#

For AI practices, we now reorganize the merged data into train, val, and test subsets.

[26]:

merged.transform("random_split", splits=[("train", 0.5), ("val", 0.2), ("test", 0.3)])
print(merged)

Dataset
        size=128298
        source_path=None
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=127229
        annotations_count=1931305
subsets
        test: # of items=38490, # of annotated items=38173, # of annotations=580468, annotation types=['mask', 'polygon', 'bbox']
        train: # of items=64149, # of annotated items=63600, # of annotations=967532, annotation types=['mask', 'polygon', 'bbox']
        val: # of items=25659, # of annotated items=25456, # of annotations=383305, annotation types=['mask', 'polygon', 'bbox']
infos
        categories
        label: ['person', 'bicycle', 'car', 'motorbike', 'aeroplane', 'bus', 'train', 'background', 'boat', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'bottle', 'chair', 'sofa', 'pottedplant', 'diningtable', 'tvmonitor', 'ignored']