# Transform Dataset: Re-id, Reindexing, Remapping, etc.

[![Jupyter Notebook](https://img.shields.io/badge/jupyter-%23FA0F00.svg?style=for-the-badge&logo=jupyter&logoColor=white)](https://github.com/openvinotoolkit/datumaro/blob/develop/notebooks/05_transform.ipynb)

In this notebook example, we will take a look at Datumaro transform api, where transform provides splitting and merging subsets, redefining annotation information, reidentifying media, and task-changing with the modification of the annotation format, e.g., from masks to polygons, from bounding boxes to masks, from shapes to bounding boxes, etc.

## Prerequisite
### Download COCO 2017 validation dataset

Please refer https://github.com/openvinotoolkit/datumaro/blob/develop/notebooks/03_visualize.ipynb to prepare COCO 2017 validation dataset.

In [2]:
# Copyright (C) 2022 Intel Corporation
#
# SPDX-License-Identifier: MIT

import datumaro as dm

dataset = dm.Dataset.import_from("coco_dataset", format="coco_instances")

print("Representation for sample COCO dataset")
dataset



Representation for sample COCO dataset


Dataset
	size=123287
	source_path=coco_dataset
	media_type=
	annotated_items_count=122218
	annotations_count=1018861
subsets
	train2017: # of items=118287, # of annotated items=117266, # of annotations=976995, annotation types=['mask', 'polygon']
	val2017: # of items=5000, # of annotated items=4952, # of annotations=41866, annotation types=['mask', 'polygon']
categories
	label: ['person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'p

### Transform media ID

We first modify the `media_id` through transformation. The original `media_id` are given by below.

In [3]:
subsets = list(dataset.subsets().keys())
print("Subset candidates:", subsets)


def get_ids(dataset: dm.Dataset, subset: str):
 ids = []
 for item in dataset:
 if item.subset == subset:
 ids += [item.id]

 return ids


get_ids(dataset, subsets[0])

Subset candidates: ['val2017', 'train2017']


['000000397133',
 '000000037777',
 '000000252219',
 '000000087038',
 '000000174482',
 '000000403385',
 '000000006818',
 '000000480985',
 '000000458054',
 '000000331352',
 '000000296649',
 '000000386912',
 '000000502136',
 '000000491497',
 '000000184791',
 '000000348881',
 '000000289393',
 '000000522713',
 '000000181666',
 '000000017627',
 '000000143931',
 '000000303818',
 '000000463730',
 '000000460347',
 '000000322864',
 '000000226111',
 '000000153299',
 '000000308394',
 '000000456496',
 '000000058636',
 '000000041888',
 '000000184321',
 '000000565778',
 '000000297343',
 '000000336587',
 '000000122745',
 '000000219578',
 '000000555705',
 '000000443303',
 '000000500663',
 '000000418281',
 '000000025560',
 '000000403817',
 '000000085329',
 '000000329323',
 '000000239274',
 '000000286994',
 '000000511321',
 '000000314294',
 '000000233771',
 '000000475779',
 '000000301867',
 '000000312421',
 '000000185250',
 '000000356427',
 '000000572517',
 '000000270244',
 '000000516316',
 '000000125211

We here adopt `reindex` transformation to make `media_id` be incrementing from `start`.

In [4]:
reindexing_dataset = dataset.transform("reindex", start=0)
get_ids(reindexing_dataset, subsets[0])

['0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 '10',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '18',
 '19',
 '20',
 '21',
 '22',
 '23',
 '24',
 '25',
 '26',
 '27',
 '28',
 '29',
 '30',
 '31',
 '32',
 '33',
 '34',
 '35',
 '36',
 '37',
 '38',
 '39',
 '40',
 '41',
 '42',
 '43',
 '44',
 '45',
 '46',
 '47',
 '48',
 '49',
 '50',
 '51',
 '52',
 '53',
 '54',
 '55',
 '56',
 '57',
 '58',
 '59',
 '60',
 '61',
 '62',
 '63',
 '64',
 '65',
 '66',
 '67',
 '68',
 '69',
 '70',
 '71',
 '72',
 '73',
 '74',
 '75',
 '76',
 '77',
 '78',
 '79',
 '80',
 '81',
 '82',
 '83',
 '84',
 '85',
 '86',
 '87',
 '88',
 '89',
 '90',
 '91',
 '92',
 '93',
 '94',
 '95',
 '96',
 '97',
 '98',
 '99',
 '100',
 '101',
 '102',
 '103',
 '104',
 '105',
 '106',
 '107',
 '108',
 '109',
 '110',
 '111',
 '112',
 '113',
 '114',
 '115',
 '116',
 '117',
 '118',
 '119',
 '120',
 '121',
 '122',
 '123',
 '124',
 '125',
 '126',
 '127',
 '128',
 '129',
 '130',
 '131',
 '132',
 '133',
 '134',
 '135',
 '136',
 '137',
 '138'

By adopting `id_from_image_name`, we can rollback the `media_id` to be the media name.

In [5]:
rollback_dataset = dataset.transform("id_from_image_name")
get_ids(rollback_dataset, subsets[0])

['000000397133',
 '000000037777',
 '000000252219',
 '000000087038',
 '000000174482',
 '000000403385',
 '000000006818',
 '000000480985',
 '000000458054',
 '000000331352',
 '000000296649',
 '000000386912',
 '000000502136',
 '000000491497',
 '000000184791',
 '000000348881',
 '000000289393',
 '000000522713',
 '000000181666',
 '000000017627',
 '000000143931',
 '000000303818',
 '000000463730',
 '000000460347',
 '000000322864',
 '000000226111',
 '000000153299',
 '000000308394',
 '000000456496',
 '000000058636',
 '000000041888',
 '000000184321',
 '000000565778',
 '000000297343',
 '000000336587',
 '000000122745',
 '000000219578',
 '000000555705',
 '000000443303',
 '000000500663',
 '000000418281',
 '000000025560',
 '000000403817',
 '000000085329',
 '000000329323',
 '000000239274',
 '000000286994',
 '000000511321',
 '000000314294',
 '000000233771',
 '000000475779',
 '000000301867',
 '000000312421',
 '000000185250',
 '000000356427',
 '000000572517',
 '000000270244',
 '000000516316',
 '000000125211

### Transform annotation

For the task-chanining or merging multiple heterogeneous datasets, we need to redefine the class definition. Datumaro provides this class redefinition through `remap_labels` as below. 

In [6]:
mapping = {"motorcycle": "bicycle", "bus": "car", "truck": "car"}
remap_label_dataset = dataset.transform("remap_labels", mapping=mapping)
remap_label_dataset

Dataset
	size=123287
	source_path=coco_dataset
	media_type=
	annotated_items_count=122218
	annotations_count=1018861
subsets
	train2017: # of items=118287, # of annotated items=117266, # of annotations=976995, annotation types=['mask', 'polygon']
	val2017: # of items=5000, # of annotated items=4952, # of annotations=41866, annotation types=['mask', 'polygon']
categories
	label: ['person', 'bicycle', 'car', 'airplane', 'train', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining t

### Split datasets

From now on, we are going to give examples of extracting the subset of the imported dataset and splitting this into multiple subsets. Datumaro provides two types of splitter; one is the per-sample level random splitter from the given ratio of subsets and the other is the task-specific splitter under consideration of annotation instances.

We first extract the validation dataset and split this into multiple cross-validation datasets.

In [7]:
# from datumaro.components.dataset import Dataset

val_dataset = dataset.filter(
 '/item[subset="val2017"]'
) # or Dataset(dataset.get_subset(subsets[0]))
val_dataset

Dataset
	size=5000
	source_path=coco_dataset
	media_type=
	annotated_items_count=4952
	annotations_count=41866
subsets
	val2017: # of items=5000, # of annotated items=4952, # of annotations=41866, annotation types=['mask', 'polygon']
categories
	label: ['person', 'bicycle', 'car', 'airplane', 'train', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigera

In [8]:
splits = (("val1", 0.2), ("val2", 0.2), ("val3", 0.2), ("val4", 0.2), ("val5", 0.2))
crossval_dataset = val_dataset.transform("random_split", splits=splits)
crossval_dataset

Dataset
	size=5000
	source_path=coco_dataset
	media_type=
	annotated_items_count=4952
	annotations_count=41866
subsets
	val1: # of items=1000, # of annotated items=991, # of annotations=8344, annotation types=['mask', 'polygon']
	val2: # of items=1000, # of annotated items=991, # of annotations=7646, annotation types=['mask', 'polygon']
	val3: # of items=1000, # of annotated items=993, # of annotations=8625, annotation types=['mask', 'polygon']
	val4: # of items=1000, # of annotated items=986, # of annotations=8752, annotation types=['mask', 'polygon']
	val5: # of items=1000, # of annotated items=991, # of annotations=8499, annotation types=['mask', 'polygon']
categories
	label: ['person', 'bicycle', 'car', 'airplane', 'train', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports b

Furthermore, Datumaro provides the split function in the viewpoint of annotation instead of sample through a task-specific splitter. By performing below, we can get the well-distributed validation datasets in terms of the number of annotations.

In [9]:
import datumaro.plugins.splitter as splitter

task = splitter.SplitTask.segmentation.name
splits = [("val1", 0.2), ("val2", 0.2), ("val3", 0.2), ("val4", 0.2), ("val5", 0.2)]

crossval_per_ann_dataset = val_dataset.transform("split", task=task, splits=splits)
crossval_per_ann_dataset

Dataset
	size=5000
	source_path=coco_dataset
	media_type=
	annotated_items_count=4952
	annotations_count=41866
subsets
	val1: # of items=1000, # of annotated items=1000, # of annotations=8368, annotation types=['mask', 'polygon']
	val2: # of items=967, # of annotated items=919, # of annotations=8374, annotation types=['mask', 'polygon']
	val3: # of items=1032, # of annotated items=1032, # of annotations=8374, annotation types=['mask', 'polygon']
	val4: # of items=987, # of annotated items=987, # of annotations=8376, annotation types=['mask', 'polygon']
	val5: # of items=1014, # of annotated items=1014, # of annotations=8374, annotation types=['mask', 'polygon']
categories
	label: ['person', 'bicycle', 'car', 'airplane', 'train', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports 

Lastly, we can rename the subset as below. 

In [10]:
mapping = {"val1": "train", "val2": "train", "val3": "train", "val4": "val", "val5": "test"}
test_dataset = dataset.transform("map_subsets", mapping=mapping)
test_dataset

Dataset
	size=5000
	source_path=coco_dataset
	media_type=
	annotated_items_count=4952
	annotations_count=41866
subsets
	test: # of items=1014, # of annotated items=1014, # of annotations=8374, annotation types=['mask', 'polygon']
	train: # of items=2999, # of annotated items=2951, # of annotations=25116, annotation types=['mask', 'polygon']
	val: # of items=987, # of annotated items=987, # of annotations=8376, annotation types=['mask', 'polygon']
categories
	label: ['person', 'bicycle', 'car', 'airplane', 'train', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot'