Import and Export Public Data#
In this notebook, we are going to show you how to import public data through Datumaro.
Import MS-COCO data#
MS-COCO is one of the most popular data, which contains the number 120K of data and they are annotated into bounding boxes, polygons, and masks. We import MS-COCO data for instance segmentation tasks among other tasks such as panoptic segmentation or person keypoint detection.
[5]:
from datumaro.components.dataset import Dataset
coco_path = "./coco_dataset"
coco_dataset = Dataset.import_from(coco_path, "coco_instances")
print(coco_dataset)
WARNING:root:File './coco_dataset/annotations/image_info_test-dev2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File './coco_dataset/annotations/image_info_test2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File './coco_dataset/annotations/image_info_unlabeled2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File './coco_dataset/annotations/person_keypoints_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File './coco_dataset/annotations/captions_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File './coco_dataset/annotations/person_keypoints_train2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File './coco_dataset/annotations/captions_train2017.json' was skipped, could't match this file with any of these tasks: coco_instances
Dataset
size=123287
source_path=./coco_dataset
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=122218
annotations_count=1915643
subsets
train2017: # of items=118287, # of annotated items=117266, # of annotations=1836996, annotation types=['bbox', 'polygon', 'mask']
val2017: # of items=5000, # of annotated items=4952, # of annotations=78647, annotation types=['bbox', 'polygon', 'mask']
infos
categories
label: ['person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush']
Export MS-COCO data into Pascal-VOC data format#
We now export the imported COCO dataset into another popular data format Pascal-VOC. This helps us to reuse the same data feeding code blocks in training or deployment frameworks. Below shows the reformatted COCO dataset with Pascal-VOC format.
[21]:
print("Original MS-COCO data format")
!tree -L 1 ./coco_dataset
save_path = "coco_dataset_with_voc_format"
coco_dataset.export(save_path, "voc", save_media=True)
print("Reformulated MS-COCO data with Pascal-VOC format")
!tree -L 1 ./coco_dataset_with_voc_format
Original MS-COCO data format
./coco_dataset
├── annotations
└── images
2 directories, 0 files
Reformulated MS-COCO data with Pascal-VOC format
./coco_dataset_with_voc_format
├── Annotations
├── ImageSets
├── JPEGImages
├── labelmap.txt
├── SegmentationClass
└── SegmentationObject
5 directories, 1 file
Import Pascal-VOC data#
We now move on to import Pascal-VOC data. Similar to MS-COCO data, Pascal-VOC supports multiple tasks including object detection, segmentation, person layout, and action classification. We are going to import Pascal-VOC data with the task-specific purpose.
First, we import the data for object detection tasks, where Pascal VOC contains 21 classes including background
class. We here check that items have only a bounding box annotation type.
[19]:
voc_path = "VOCdevkit/VOC2007"
voc_dataset = Dataset.import_from(voc_path, "voc_detection")
print(voc_dataset)
Dataset
size=10022
source_path=VOCdevkit/VOC2007
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=10022
annotations_count=31324
subsets
train: # of items=2501, # of annotated items=2501, # of annotations=7844, annotation types=['bbox']
trainval: # of items=5011, # of annotated items=5011, # of annotations=15662, annotation types=['bbox']
val: # of items=2510, # of annotated items=2510, # of annotations=7818, annotation types=['bbox']
infos
categories
label: ['background', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor', 'ignored']
Import Pascal-VOC data for another task#
We now import Pascal-VOC data for a person layout task, where this is composed of bounding boxes according to person bodies, e.g., head
, hand
, and foot
within person
.
[22]:
voc_layout_dataset = Dataset.import_from(voc_path, "voc_layout")
print(voc_layout_dataset)
Dataset
size=1292
source_path=VOCdevkit/VOC2007
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=644
annotations_count=3986
subsets
train: # of items=318, # of annotated items=166, # of annotations=1001, annotation types=['bbox']
trainval: # of items=646, # of annotated items=322, # of annotations=1993, annotation types=['bbox']
val: # of items=328, # of annotated items=156, # of annotations=992, annotation types=['bbox']
infos
categories
label: ['background', 'person', 'head', 'hand', 'foot']