# Merge Heterogeneous Datasets for Detection

[![Jupyter Notebook](https://img.shields.io/badge/jupyter-%23FA0F00.svg?style=for-the-badge&logo=jupyter&logoColor=white)](https://github.com/openvinotoolkit/datumaro/blob/develop/notebooks/02_merge_heterogeneous_datasets_for_detection.ipynb)

Datumaro supports merging heterogeneous datasets into a unified data format.

In this example, we import two heterogeneous detection datasets and export a merged dataset into a unified data format.

First, we import two datasets, i.e., MS-COCO and Pascal-VOC, and transforms them with `filter` duplicates, `reindex` ids, and `remap` labels before merging.

Then, we perform the `intersect` merge operation and split into `train`, `val`, and `test` subsets for AI practices.

In [21]:
# Copyright (C) 2021 Intel Corporation
#
# SPDX-License-Identifier: MIT

import datumaro as dm

coco_path = "./coco_dataset"
coco_dataset = dm.Dataset.import_from(coco_path, format="coco_instances")

voc_path = "./VOCdevkit/VOC2007"
voc_dataset = dm.Dataset.import_from(voc_path, format="voc_detection")

print("MS-COCO dataset:")
print(coco_dataset)

print("Pascal-VOC dataset:")
print(voc_dataset)



MS-COCO dataset:
Dataset
	size=123287
	source_path=./coco_dataset
	media_type=<class 'datumaro.components.media.Image'>
	annotated_items_count=122218
	annotations_count=1915643
subsets
	train2017: # of items=118287, # of annotated items=117266, # of annotations=1836996, annotation types=['mask', 'polygon', 'bbox']
	val2017: # of items=5000, # of annotated items=4952, # of annotations=78647, annotation types=['mask', 'polygon', 'bbox']
infos
	categories
	label: ['person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'oran

## Filter Duplicates

Here, we reject subset `trainval` in Pascal-VOC data, because it caueses duplicates.

In [22]:
voc_dataset.filter('/item[subset!="trainval"]')

Dataset
	size=5011
	source_path=./VOCdevkit/VOC2007
	media_type=<class 'datumaro.components.media.Image'>
	annotated_items_count=5011
	annotations_count=15662
subsets
	train: # of items=2501, # of annotated items=2501, # of annotations=7844, annotation types=['bbox']
	val: # of items=2510, # of annotated items=2510, # of annotations=7818, annotation types=['bbox']
infos
	categories
	label: ['background', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor', 'ignored']

## Transform - Remap Label Names

Since many labels defined in Pascal-VOC data are also included in MS-COCO data, one-to-one mapping of most classes is possible.

Meanwhile, the number 61 of MS-COCO labels corresponding to roadside, animals, household items, foods, accessories, and kitchen utensils will be mapped to the Pascal-VOC's `background` class to merge them into a single unified dataset.

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-9wq8{border-color:inherit;text-align:center;vertical-align:middle}
</style>
<table class="blueTable">
<thead>
<tr>
<th>MS-COCO</th>
<th>Pascal-VOC</th>
</tr>
</thead>
<tbody>
<tr>
<td>person</td>
<td>person</td>
</tr>
<tr>
<td>bicycle</td>
<td>bicycle</td>
</tr>
<tr>
<td>car</td>
<td>car</td>
</tr>
<tr>
<td>motorcycle</td>
<td>motorbike</td>
</tr>
<tr>
<td>airplane</td>
<td>aeroplane</td>
</tr>
<tr>
<td>bus</td>
<td>bus</td>
</tr>
<tr>
<td>train</td>
<td>train</td>
</tr>
<tr>
<td>boat</td>
<td>boat</td>
</tr>
<tr>
<td>bird</td>
<td>bird</td>
</tr>
<tr>
<td>cat</td>
<td>cat</td>
</tr>
<tr>
<td>dog</td>
<td>dog</td>
</tr>
<tr>
<td>horse</td>
<td>horse</td>
</tr>
<tr>
<td>sheep</td>
<td>sheep</td>
</tr>
<tr>
<td>cow</td>
<td>cow</td>
</tr>
<tr>
<td>bottle</td>
<td>bottle</td>
</tr>
<tr>
<td>chair</td>
<td>chair</td>
</tr>
<tr>
<td>couch</td>
<td>sofa</td>
</tr>
<tr>
<td>potted plant</td>
<td>pottedplant</td>
</tr>
<tr>
<td>dining table</td>
<td>diningtable</td>
</tr>
<tr>
<td>tv</td>
<td>tvmonitor</td>
</tr>
<tr>
<td>others (61 classes)</td>
<td>background</td>
</tr>
</tbody>
</table>

In [23]:
identicals = [
    "person",
    "bicycle",
    "car",
    "bus",
    "train",
    "boat",
    "bird",
    "cat",
    "dog",
    "horse",
    "sheep",
    "cow",
    "bottle",
    "chair",
]
mappings = {
    "motorcycle": "motorbike",
    "airplane": "aeroplane",
    "couch": "sofa",
    "potted plant": "pottedplant",
    "dining table": "diningtable",
    "tv": "tvmonitor",
}

for label in coco_dataset.categories()[dm.AnnotationType.label]:
    if label.name in identicals or label.name in mappings:
        continue
    mappings.update({label.name: "background"})

print(mappings)
coco_dataset.transform("remap_labels", mapping=mappings)

{'motorcycle': 'motorbike', 'airplane': 'aeroplane', 'couch': 'sofa', 'potted plant': 'pottedplant', 'dining table': 'diningtable', 'tv': 'tvmonitor', 'truck': 'background', 'traffic light': 'background', 'fire hydrant': 'background', 'stop sign': 'background', 'parking meter': 'background', 'bench': 'background', 'elephant': 'background', 'bear': 'background', 'zebra': 'background', 'giraffe': 'background', 'backpack': 'background', 'umbrella': 'background', 'handbag': 'background', 'tie': 'background', 'suitcase': 'background', 'frisbee': 'background', 'skis': 'background', 'snowboard': 'background', 'sports ball': 'background', 'kite': 'background', 'baseball bat': 'background', 'baseball glove': 'background', 'skateboard': 'background', 'surfboard': 'background', 'tennis racket': 'background', 'wine glass': 'background', 'cup': 'background', 'fork': 'background', 'knife': 'background', 'spoon': 'background', 'bowl': 'background', 'banana': 'background', 'apple': 'background', 'sand

Dataset
	size=123287
	source_path=./coco_dataset
	media_type=<class 'datumaro.components.media.Image'>
	annotated_items_count=122218
	annotations_count=1915643
subsets
	train2017: # of items=118287, # of annotated items=117266, # of annotations=1836996, annotation types=['mask', 'polygon', 'bbox']
	val2017: # of items=5000, # of annotated items=4952, # of annotations=78647, annotation types=['mask', 'polygon', 'bbox']
infos
	categories
	label: ['person', 'bicycle', 'car', 'motorbike', 'aeroplane', 'bus', 'train', 'background', 'boat', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'bottle', 'chair', 'sofa', 'pottedplant', 'diningtable', 'tvmonitor']

## Reindex Items

To avoid conflicts within `id`s when merging, we need to reindex items to be exclusive.

In [24]:
coco_dataset.transform("reindex", start=0)
voc_dataset.transform("reindex", start=len(coco_dataset))

Dataset
	size=5011
	source_path=./VOCdevkit/VOC2007
	media_type=<class 'datumaro.components.media.Image'>
	annotated_items_count=5011
	annotations_count=15662
subsets
	train: # of items=2501, # of annotated items=2501, # of annotations=7844, annotation types=['bbox']
	val: # of items=2510, # of annotated items=2510, # of annotations=7818, annotation types=['bbox']
infos
	categories
	label: ['background', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor', 'ignored']

## Merge Heterogenous Datasets

Since we have already aligned two datasets into a homogeneous form, we have to choose `merge_policy="intersect"` here.

In [25]:
merged = dm.HLOps.merge(coco_dataset, voc_dataset, merge_policy="intersect")
print(merged)

Dataset
	size=128298
	source_path=None
	media_type=<class 'datumaro.components.media.Image'>
	annotated_items_count=127229
	annotations_count=1931305
subsets
	train: # of items=2501, # of annotated items=2501, # of annotations=7844, annotation types=['bbox']
	train2017: # of items=118287, # of annotated items=117266, # of annotations=1836996, annotation types=['mask', 'polygon', 'bbox']
	val: # of items=2510, # of annotated items=2510, # of annotations=7818, annotation types=['bbox']
	val2017: # of items=5000, # of annotated items=4952, # of annotations=78647, annotation types=['mask', 'polygon', 'bbox']
infos
	categories
	label: ['person', 'bicycle', 'car', 'motorbike', 'aeroplane', 'bus', 'train', 'background', 'boat', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'bottle', 'chair', 'sofa', 'pottedplant', 'diningtable', 'tvmonitor', 'ignored']



## Split into Subsets

For AI practices, we now reorganize the merged data into `train`, `val`, and `test` subsets.

In [26]:
merged.transform("random_split", splits=[("train", 0.5), ("val", 0.2), ("test", 0.3)])
print(merged)

Dataset
	size=128298
	source_path=None
	media_type=<class 'datumaro.components.media.Image'>
	annotated_items_count=127229
	annotations_count=1931305
subsets
	test: # of items=38490, # of annotated items=38173, # of annotations=580468, annotation types=['mask', 'polygon', 'bbox']
	train: # of items=64149, # of annotated items=63600, # of annotations=967532, annotation types=['mask', 'polygon', 'bbox']
	val: # of items=25659, # of annotated items=25456, # of annotations=383305, annotation types=['mask', 'polygon', 'bbox']
infos
	categories
	label: ['person', 'bicycle', 'car', 'motorbike', 'aeroplane', 'bus', 'train', 'background', 'boat', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'bottle', 'chair', 'sofa', 'pottedplant', 'diningtable', 'tvmonitor', 'ignored']

