Merge Heterogeneous Datasets for Detection#

Jupyter Notebook

Datumaro supports merging heterogeneous datasets into a unified data format.

In this example, we import two heterogeneous detection datasets and export a merged dataset into a unified data format.

First, we import two datasets, i.e., MS-COCO and Pascal-VOC, and transforms them with filter duplicates, reindex ids, and remap labels before merging.

Then, we perform the intersect merge operation and split into train, val, and test subsets for AI practices.

[21]:
# Copyright (C) 2021 Intel Corporation
#
# SPDX-License-Identifier: MIT

import datumaro as dm

coco_path = "./coco_dataset"
coco_dataset = dm.Dataset.import_from(coco_path, format="coco_instances")

voc_path = "./VOCdevkit/VOC2007"
voc_dataset = dm.Dataset.import_from(voc_path, format="voc_detection")

print("MS-COCO dataset:")
print(coco_dataset)

print("Pascal-VOC dataset:")
print(voc_dataset)
WARNING:root:File './coco_dataset/annotations/image_info_test-dev2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File './coco_dataset/annotations/image_info_test2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File './coco_dataset/annotations/image_info_unlabeled2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File './coco_dataset/annotations/person_keypoints_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File './coco_dataset/annotations/captions_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File './coco_dataset/annotations/person_keypoints_train2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File './coco_dataset/annotations/captions_train2017.json' was skipped, could't match this file with any of these tasks: coco_instances
MS-COCO dataset:
Dataset
        size=123287
        source_path=./coco_dataset
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=122218
        annotations_count=1915643
subsets
        train2017: # of items=118287, # of annotated items=117266, # of annotations=1836996, annotation types=['mask', 'polygon', 'bbox']
        val2017: # of items=5000, # of annotated items=4952, # of annotations=78647, annotation types=['mask', 'polygon', 'bbox']
infos
        categories
        label: ['person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush']

Pascal-VOC dataset:
Dataset
        size=10022
        source_path=./VOCdevkit/VOC2007
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=10022
        annotations_count=31324
subsets
        train: # of items=2501, # of annotated items=2501, # of annotations=7844, annotation types=['bbox']
        trainval: # of items=5011, # of annotated items=5011, # of annotations=15662, annotation types=['bbox']
        val: # of items=2510, # of annotated items=2510, # of annotations=7818, annotation types=['bbox']
infos
        categories
        label: ['background', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor', 'ignored']

Filter Duplicates#

Here, we reject subset trainval in Pascal-VOC data, because it caueses duplicates.

[22]:
voc_dataset.filter('/item[subset!="trainval"]')
[22]:
Dataset
        size=5011
        source_path=./VOCdevkit/VOC2007
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=5011
        annotations_count=15662
subsets
        train: # of items=2501, # of annotated items=2501, # of annotations=7844, annotation types=['bbox']
        val: # of items=2510, # of annotated items=2510, # of annotations=7818, annotation types=['bbox']
infos
        categories
        label: ['background', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor', 'ignored']

Transform - Remap Label Names#

Since many labels defined in Pascal-VOC data are also included in MS-COCO data, one-to-one mapping of most classes is possible.

Meanwhile, the number 61 of MS-COCO labels corresponding to roadside, animals, household items, foods, accessories, and kitchen utensils will be mapped to the Pascal-VOC’s background class to merge them into a single unified dataset.

MS-COCO

Pascal-VOC

person

person

bicycle

bicycle

car

car

motorcycle

motorbike

airplane

aeroplane

bus

bus

train

train

boat

boat

bird

bird

cat

cat

dog

dog

horse

horse

sheep

sheep

cow

cow

bottle

bottle

chair

chair

couch

sofa

potted plant

pottedplant

dining table

diningtable

tv

tvmonitor

others (61 classes)

background

[23]:
identicals = [
    "person",
    "bicycle",
    "car",
    "bus",
    "train",
    "boat",
    "bird",
    "cat",
    "dog",
    "horse",
    "sheep",
    "cow",
    "bottle",
    "chair",
]
mappings = {
    "motorcycle": "motorbike",
    "airplane": "aeroplane",
    "couch": "sofa",
    "potted plant": "pottedplant",
    "dining table": "diningtable",
    "tv": "tvmonitor",
}

for label in coco_dataset.categories()[dm.AnnotationType.label]:
    if label.name in identicals or label.name in mappings:
        continue
    mappings.update({label.name: "background"})

print(mappings)
coco_dataset.transform("remap_labels", mapping=mappings)
{'motorcycle': 'motorbike', 'airplane': 'aeroplane', 'couch': 'sofa', 'potted plant': 'pottedplant', 'dining table': 'diningtable', 'tv': 'tvmonitor', 'truck': 'background', 'traffic light': 'background', 'fire hydrant': 'background', 'stop sign': 'background', 'parking meter': 'background', 'bench': 'background', 'elephant': 'background', 'bear': 'background', 'zebra': 'background', 'giraffe': 'background', 'backpack': 'background', 'umbrella': 'background', 'handbag': 'background', 'tie': 'background', 'suitcase': 'background', 'frisbee': 'background', 'skis': 'background', 'snowboard': 'background', 'sports ball': 'background', 'kite': 'background', 'baseball bat': 'background', 'baseball glove': 'background', 'skateboard': 'background', 'surfboard': 'background', 'tennis racket': 'background', 'wine glass': 'background', 'cup': 'background', 'fork': 'background', 'knife': 'background', 'spoon': 'background', 'bowl': 'background', 'banana': 'background', 'apple': 'background', 'sandwich': 'background', 'orange': 'background', 'broccoli': 'background', 'carrot': 'background', 'hot dog': 'background', 'pizza': 'background', 'donut': 'background', 'cake': 'background', 'bed': 'background', 'toilet': 'background', 'laptop': 'background', 'mouse': 'background', 'remote': 'background', 'keyboard': 'background', 'cell phone': 'background', 'microwave': 'background', 'oven': 'background', 'toaster': 'background', 'sink': 'background', 'refrigerator': 'background', 'book': 'background', 'clock': 'background', 'vase': 'background', 'scissors': 'background', 'teddy bear': 'background', 'hair drier': 'background', 'toothbrush': 'background'}
[23]:
Dataset
        size=123287
        source_path=./coco_dataset
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=122218
        annotations_count=1915643
subsets
        train2017: # of items=118287, # of annotated items=117266, # of annotations=1836996, annotation types=['mask', 'polygon', 'bbox']
        val2017: # of items=5000, # of annotated items=4952, # of annotations=78647, annotation types=['mask', 'polygon', 'bbox']
infos
        categories
        label: ['person', 'bicycle', 'car', 'motorbike', 'aeroplane', 'bus', 'train', 'background', 'boat', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'bottle', 'chair', 'sofa', 'pottedplant', 'diningtable', 'tvmonitor']

Reindex Items#

To avoid conflicts within ids when merging, we need to reindex items to be exclusive.

[24]:
coco_dataset.transform("reindex", start=0)
voc_dataset.transform("reindex", start=len(coco_dataset))
[24]:
Dataset
        size=5011
        source_path=./VOCdevkit/VOC2007
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=5011
        annotations_count=15662
subsets
        train: # of items=2501, # of annotated items=2501, # of annotations=7844, annotation types=['bbox']
        val: # of items=2510, # of annotated items=2510, # of annotations=7818, annotation types=['bbox']
infos
        categories
        label: ['background', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor', 'ignored']

Merge Heterogenous Datasets#

Since we have already aligned two datasets into a homogeneous form, we have to choose merge_policy="intersect" here.

[25]:
merged = dm.HLOps.merge(coco_dataset, voc_dataset, merge_policy="intersect")
print(merged)
Dataset
        size=128298
        source_path=None
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=127229
        annotations_count=1931305
subsets
        train: # of items=2501, # of annotated items=2501, # of annotations=7844, annotation types=['bbox']
        train2017: # of items=118287, # of annotated items=117266, # of annotations=1836996, annotation types=['mask', 'polygon', 'bbox']
        val: # of items=2510, # of annotated items=2510, # of annotations=7818, annotation types=['bbox']
        val2017: # of items=5000, # of annotated items=4952, # of annotations=78647, annotation types=['mask', 'polygon', 'bbox']
infos
        categories
        label: ['person', 'bicycle', 'car', 'motorbike', 'aeroplane', 'bus', 'train', 'background', 'boat', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'bottle', 'chair', 'sofa', 'pottedplant', 'diningtable', 'tvmonitor', 'ignored']

Split into Subsets#

For AI practices, we now reorganize the merged data into train, val, and test subsets.

[26]:
merged.transform("random_split", splits=[("train", 0.5), ("val", 0.2), ("test", 0.3)])
print(merged)
Dataset
        size=128298
        source_path=None
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=127229
        annotations_count=1931305
subsets
        test: # of items=38490, # of annotated items=38173, # of annotations=580468, annotation types=['mask', 'polygon', 'bbox']
        train: # of items=64149, # of annotated items=63600, # of annotations=967532, annotation types=['mask', 'polygon', 'bbox']
        val: # of items=25659, # of annotated items=25456, # of annotations=383305, annotation types=['mask', 'polygon', 'bbox']
infos
        categories
        label: ['person', 'bicycle', 'car', 'motorbike', 'aeroplane', 'bus', 'train', 'background', 'boat', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'bottle', 'chair', 'sofa', 'pottedplant', 'diningtable', 'tvmonitor', 'ignored']