Filter Data through Your Query#
In this notebook example, we’ll take a look at Datumaro filter
API. Datumaro provides two Python API types for filtering.
Using the XML XPath query
It is a Python string query that can be useful for simple filtering or CLI users. If you use this query, Datumaro dataset item representation is converted to XML format and filtered by the selector of XPath query. For more details about this, please refer to this link.
Using the user-provided Python function query
It is a Python callable such as
Callable[[DatasetItem], bool]
(for filtering dataset items) orCallable[[DatasetItem, Annotation], bool]
(for filtering annotations). Users can implement their own Python function for the given dataset item or annotation.
Firstly, we start this lesson with importing Datumaro in our runtime session.
[1]:
# Copyright (C) 2023 Intel Corporation
#
# SPDX-License-Identifier: MIT
import datumaro as dm
Filtered by subset#
To show filtering by subset, we first import the dummy VOC dataset from the testing asset in our repository.
[2]:
dataset = dm.Dataset.import_from("../tests/assets/voc_dataset/voc_dataset1", format="voc")
print("Subsets:", list(dataset.subsets().keys()))
Subsets: ['test', 'train']
In VOC dataset, there are ‘train’ and ‘test’ subsets. We will filter out ‘test’ subset using the XPath string query this time. You can see that there remains only ‘train’ subset after filtering.
[3]:
filtered = dataset.clone().filter('/item[subset="train"]')
print("Subsets:", list(filtered.subsets().keys()))
Subsets: ['train']
This time, we can do the same thing with the user-provided Python function query as follows. From now on, we will show both query types for filtering.
[4]:
def retain_train_subset(item):
return item.subset == "train"
filtered = dataset.clone().filter(retain_train_subset)
print("Subsets:", list(filtered.subsets().keys()))
Subsets: ['train']
Filtered by image width or height#
To show filtering by image width or height, we create a dummy Dataset
from the following code. There are two items with images that are horizontally long or vertically long.
[5]:
import numpy as np
dataset = dm.Dataset.from_iterable(
[
dm.DatasetItem(
id="horizontally_long",
media=dm.Image.from_numpy(np.zeros(shape=(10, 20, 3), dtype=np.uint8)),
),
dm.DatasetItem(
id="vertically_long",
media=dm.Image.from_numpy(np.zeros(shape=(20, 10, 3), dtype=np.uint8)),
),
]
)
for item in dataset:
print(f'ID: "{item.id}", Height: {item.media.size[0]}, Width: {item.media.size[1]}')
ID: "horizontally_long", Height: 10, Width: 20
ID: "vertically_long", Height: 20, Width: 10
[6]:
print('"Vertically long" item will remain')
filtered = dataset.clone().filter("/item[image/width < image/height]")
for item in filtered:
print(f'ID: "{item.id}", Height: {item.media.size[0]}, Width: {item.media.size[1]}')
def retain_horizontally_long(item):
return item.media.size[0] < item.media.size[1]
print('Now, conversely, "Horizontally long" item will remain conversely')
filtered = dataset.clone().filter(retain_horizontally_long)
for item in filtered:
print(f'ID: "{item.id}", Height: {item.media.size[0]}, Width: {item.media.size[1]}')
"Vertically long" item will remain
ID: "vertically_long", Height: 20, Width: 10
Now, conversely, "Horizontally long" item will remain conversely
ID: "horizontally_long", Height: 10, Width: 20
Filtered by label and area#
Let’s get back to the dummy VOC dataset at the first lesson. We want to remove all annotations associated with the person
label in the dataset. You can see that there is one item with id=2007_000001
having person
label.
[7]:
def find_item_with_given_label_name(dataset, label_name):
label_cats = dataset.categories()[dm.AnnotationType.label]
for item in dataset:
labels = {label_cats[ann.label].name for ann in item.annotations}
if label_name in labels:
print(f'ID: {item.id} has "{label_name}" label')
dataset = dm.Dataset.import_from("../tests/assets/voc_dataset/voc_dataset1", format="voc")
print("There exist a person")
find_item_with_given_label_name(dataset, "person")
There exist a person
ID: 2007_000001 has "person" label
We can remove all annotations not having person
label with the following query. On the other hand, using the Python function, we can remove all airplane
annotations as well. As shown, you have to set filter_annotations
as True
if you want to apply filtering to annotations. The default value is False
. Therefore, in the previous examples, we have been able to apply filtering to dataset items rather than annotations.
[8]:
filtered = dataset.clone().filter('/item/annotation[label!="person"]', filter_annotations=True)
print("There is no person")
find_item_with_given_label_name(dataset, "person")
print("There is an airplane")
find_item_with_given_label_name(dataset, "airplane")
def remove_airplane(item, ann):
label_cats = dataset.categories()[dm.AnnotationType.label]
return label_cats[ann.label].name != "airplane"
print("Now, we removed it")
filtered = dataset.clone().filter(remove_airplane, filter_annotations=True)
find_item_with_given_label_name(dataset, "airplane")
There is no person
ID: 2007_000001 has "person" label
There is an airplane
Now, we removed it
Filtered by attributes#
Some data format has special attributes for each dataset item or annotation. One of them would be occluded
boolean which has been used for COCO format. This boolean used to indicate whether the object is occluded by another object or not. We can also filter a dataset item or annotation with attribute fields. The following example will show how to do that.
[9]:
dataset = dm.Dataset.from_iterable(
[
dm.DatasetItem(
id="item_with_occlusion",
media=dm.Image.from_numpy(np.zeros(shape=(10, 20, 3), dtype=np.uint8)),
annotations=[
dm.Bbox(0, 0, 1, 1, attributes={"occluded": True}),
],
),
dm.DatasetItem(
id="item_without_occlusion",
media=dm.Image.from_numpy(np.zeros(shape=(10, 20, 3), dtype=np.uint8)),
annotations=[
dm.Bbox(0, 0, 1, 1, attributes={"occluded": False}),
],
),
]
)
for item in dataset:
print(f"ID: {item.id}")
ID: item_with_occlusion
ID: item_without_occlusion
Now, we will retain annotations with occluded=False
only. However, we set remove_empty=True
flag as well. By setting this flag to True
, at the same time that we filter annotations, we can remove the dataset item which has no annotations after filtering as well. Therefore, item_with_occlusion
should be removed because it has no bbox after filtering.
[10]:
print("There is no item with occlusion")
filtered = dataset.clone().filter(
'/item/annotation[occluded="False"]', filter_annotations=True, remove_empty=True
)
for item in filtered:
print(f"ID: {item.id}")
def remove_occluded_ann(item, ann):
return not ann.attributes.get("occluded", False)
print("There is no item with occlusion again")
filtered = dataset.clone().filter(remove_occluded_ann, filter_annotations=True, remove_empty=True)
for item in filtered:
print(f"ID: {item.id}")
There is no item with occlusion
ID: item_without_occlusion
There is no item with occlusion again
ID: item_without_occlusion