Filter Data through Your Query#
In this notebook example, we’ll take a look at Datumaro filter
API.
[1]:
# Copyright (C) 2022 Intel Corporation
#
# SPDX-License-Identifier: MIT
import datumaro as dm
Filtered by subset#
We export sample VOC dataset to filter only train subset.
[2]:
dataset = dm.Dataset.import_from("./tests/assets/voc_dataset/voc_dataset1", format="voc")
[3]:
print("Representation for sample VOC dataset")
dataset
Representation for sample VOC dataset
[3]:
Dataset
size=2
source_path=./tests/assets/voc_dataset/voc_dataset1
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=1
annotations_count=15
subsets
test: # of items=1, # of annotated items=0, # of annotations=0, annotation types=[]
train: # of items=1, # of annotated items=1, # of annotations=15, annotation types=['label', 'bbox', 'mask']
categories
label: ['background', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor', 'ignored', 'head', 'hand', 'foot']
mask: []
In VOC dataset, there are ‘train’ and ‘test’ subset. We will filter only ‘train’ subset.
[4]:
dataset = dm.Dataset.filter(dataset, '/item[subset="train"]')
[5]:
print("Representation for `train` subset of sample VOC dataset")
dataset
Representation for `train` subset of sample VOC dataset
[5]:
Dataset
size=1
source_path=./tests/assets/voc_dataset/voc_dataset1
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=1
annotations_count=15
subsets
train: # of items=1, # of annotated items=1, # of annotations=15, annotation types=['label', 'bbox', 'mask']
categories
label: ['background', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor', 'ignored', 'head', 'hand', 'foot']
mask: []
Filtered by id#
We export sample widerface dataset to filter only dataset which id is id=0_Parade_image_01
.
[6]:
dataset = dm.Dataset.import_from("./tests/assets/widerface_dataset")
[7]:
print("Representation for sample WiderFace dataset")
dataset
Representation for sample WiderFace dataset
[7]:
Dataset
size=3
source_path=./tests/assets/widerface_dataset
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=3
annotations_count=9
subsets
train: # of items=2, # of annotated items=2, # of annotations=5, annotation types=['label', 'bbox']
val: # of items=1, # of annotated items=1, # of annotations=4, annotation types=['label', 'bbox']
categories
label: ['Parade', 'Handshaking']
[8]:
dataset = dm.Dataset.filter(dataset, '/item[id="0_Parade_image_01"]')
[9]:
print("Representation for `id == 1` dataset of sample WiderFace dataset")
dataset
Representation for `id == 1` dataset of sample WiderFace dataset
[9]:
Dataset
size=1
source_path=./tests/assets/widerface_dataset
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=1
annotations_count=2
subsets
train: # of items=1, # of annotated items=1, # of annotations=2, annotation types=['label', 'bbox']
categories
label: ['Parade', 'Handshaking']
Filtered by width and height#
We export sample dataset to extract a dataset with images which have shorter width than height.
[10]:
dataset = dm.Dataset.import_from("./tests/assets/coco_dataset/coco")
WARNING:root:Not implemented: Found potentially conflicting source types with labels: labels, panoptic, stuff, person_keypoints, instances. Only one type will be used: instances
WARNING:root:Not implemented: conflicting source './tests/assets/coco_dataset/coco/annotations/labels_train.json' is skipped.
WARNING:root:Not implemented: conflicting source './tests/assets/coco_dataset/coco/annotations/person_keypoints_train.json' is skipped.
WARNING:root:Not implemented: conflicting source './tests/assets/coco_dataset/coco/annotations/stuff_train.json' is skipped.
WARNING:root:Not implemented: conflicting source './tests/assets/coco_dataset/coco/annotations/panoptic_train.json' is skipped.
WARNING:root:Not implemented: conflicting source './tests/assets/coco_dataset/coco/annotations/panoptic_val.json' is skipped.
WARNING:root:Not implemented: conflicting source './tests/assets/coco_dataset/coco/annotations/labels_val.json' is skipped.
WARNING:root:Not implemented: conflicting source './tests/assets/coco_dataset/coco/annotations/person_keypoints_val.json' is skipped.
WARNING:root:Not implemented: conflicting source './tests/assets/coco_dataset/coco/annotations/stuff_val.json' is skipped.
[11]:
def get_width_height(dataset: dm.Dataset):
size_dict = {}
for item in dataset:
size_dict[item.id] = item.media.size
return size_dict
[12]:
print("Representation for sample COCO dataset")
dataset
Representation for sample COCO dataset
[12]:
Dataset
size=2
source_path=./tests/assets/coco_dataset/coco
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=2
annotations_count=6
subsets
train: # of items=1, # of annotated items=1, # of annotations=2, annotation types=['bbox', 'caption']
val: # of items=1, # of annotated items=1, # of annotations=4, annotation types=['mask', 'caption', 'polygon']
categories
label: ['a', 'b', 'c']
[13]:
print("Width and height for sample COCO dataset images")
get_width_height(dataset)
Width and height for sample COCO dataset images
[13]:
{'a': (5, 10), 'b': (10, 5)}
[14]:
dataset = dm.Dataset.filter(dataset, "/item[image/width < image/height]")
[15]:
print("Representation for `width < height` sample COCO dataset images")
dataset
Representation for `width < height` sample COCO dataset images
[15]:
Dataset
size=1
source_path=./tests/assets/coco_dataset/coco
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=1
annotations_count=4
subsets
val: # of items=1, # of annotated items=1, # of annotations=4, annotation types=['mask', 'caption', 'polygon']
categories
label: ['a', 'b', 'c']
[16]:
print("Width and height for `width < height` sample COCO dataset images")
get_width_height(dataset)
Width and height for `width < height` sample COCO dataset images
[16]:
{'b': (10, 5)}
Filtered by label and area#
We export sample dataset to extract only non-persons
.
[17]:
dataset = dm.Dataset.import_from("./tests/assets/voc_dataset/voc_dataset1")
[18]:
print("Representation for sample VOC dataset")
dataset
Representation for sample VOC dataset
[18]:
Dataset
size=2
source_path=./tests/assets/voc_dataset/voc_dataset1
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=1
annotations_count=15
subsets
test: # of items=1, # of annotated items=0, # of annotations=0, annotation types=[]
train: # of items=1, # of annotated items=1, # of annotations=15, annotation types=['label', 'bbox', 'mask']
categories
label: ['background', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor', 'ignored', 'head', 'hand', 'foot']
mask: []
Indicate filter_annotations
as True
if filter needs to apply to annotations. The default value is False
to items.
[19]:
dataset = dm.Dataset.filter(dataset, '/item/annotation[label!="person"]', filter_annotations=True)
[20]:
print('Representation for sample VOC dataset whose annotation is `label!="person"`')
dataset
Representation for sample VOC dataset whose annotation is `label!="person"`
[20]:
Dataset
size=2
source_path=./tests/assets/voc_dataset/voc_dataset1
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=1
annotations_count=13
subsets
test: # of items=1, # of annotated items=0, # of annotations=0, annotation types=[]
train: # of items=1, # of annotated items=1, # of annotations=13, annotation types=['label', 'bbox', 'mask']
categories
label: ['background', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor', 'ignored', 'head', 'hand', 'foot']
mask: []
Filtered by annotation#
We export sample dataset to extract non-occluded
annotations, remove empty images. Use data only from the “s1” source of the project.
[21]:
dataset = dm.Dataset.import_from("./tests/assets/voc_dataset/voc_dataset1")
[22]:
print("Representation for sample VOC dataset")
dataset
Representation for sample VOC dataset
[22]:
Dataset
size=2
source_path=./tests/assets/voc_dataset/voc_dataset1
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=1
annotations_count=15
subsets
test: # of items=1, # of annotated items=0, # of annotations=0, annotation types=[]
train: # of items=1, # of annotated items=1, # of annotations=15, annotation types=['label', 'bbox', 'mask']
categories
label: ['background', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor', 'ignored', 'head', 'hand', 'foot']
mask: []
[23]:
dm.Dataset.filter(
dataset, '/item/annotation[occluded="False"]', filter_annotations=True, remove_empty=True
)
[23]:
Dataset
size=1
source_path=./tests/assets/voc_dataset/voc_dataset1
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=1
annotations_count=2
subsets
train: # of items=1, # of annotated items=1, # of annotations=2, annotation types=['bbox']
categories
label: ['background', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor', 'ignored', 'head', 'hand', 'foot']
mask: []
[24]:
print("Representation for `non-occluded annotations and empty images removed sample VOC dataset`")
dataset
Representation for `non-occluded annotations and empty images removed sample VOC dataset`
[24]:
Dataset
size=1
source_path=./tests/assets/voc_dataset/voc_dataset1
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=1
annotations_count=2
subsets
train: # of items=1, # of annotated items=1, # of annotations=2, annotation types=['bbox']
categories
label: ['background', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor', 'ignored', 'head', 'hand', 'foot']
mask: []