Filter Data through Your Query#

Jupyter Notebook

In this notebook example, we’ll take a look at Datumaro filter API. Datumaro provides two Python API types for filtering.

  1. Using the XML XPath query

    It is a Python string query that can be useful for simple filtering or CLI users. If you use this query, Datumaro dataset item representation is converted to XML format and filtered by the selector of XPath query. For more details about this, please refer to this link.

  2. Using the user-provided Python function query

    It is a Python callable such as Callable[[DatasetItem], bool] (for filtering dataset items) or Callable[[DatasetItem, Annotation], bool] (for filtering annotations). Users can implement their own Python function for the given dataset item or annotation.

Firstly, we start this lesson with importing Datumaro in our runtime session.

[1]:
# Copyright (C) 2023 Intel Corporation
#
# SPDX-License-Identifier: MIT

import datumaro as dm

Filtered by subset#

To show filtering by subset, we first import the dummy VOC dataset from the testing asset in our repository.

[2]:
dataset = dm.Dataset.import_from("../tests/assets/voc_dataset/voc_dataset1", format="voc")
print("Subsets:", list(dataset.subsets().keys()))
Subsets: ['test', 'train']

In VOC dataset, there are ‘train’ and ‘test’ subsets. We will filter out ‘test’ subset using the XPath string query this time. You can see that there remains only ‘train’ subset after filtering.

[3]:
filtered = dataset.clone().filter('/item[subset="train"]')
print("Subsets:", list(filtered.subsets().keys()))
Subsets: ['train']

This time, we can do the same thing with the user-provided Python function query as follows. From now on, we will show both query types for filtering.

[4]:
def retain_train_subset(item):
    return item.subset == "train"


filtered = dataset.clone().filter(retain_train_subset)
print("Subsets:", list(filtered.subsets().keys()))
Subsets: ['train']

Filtered by image width or height#

To show filtering by image width or height, we create a dummy Dataset from the following code. There are two items with images that are horizontally long or vertically long.

[5]:
import numpy as np

dataset = dm.Dataset.from_iterable(
    [
        dm.DatasetItem(
            id="horizontally_long",
            media=dm.Image.from_numpy(np.zeros(shape=(10, 20, 3), dtype=np.uint8)),
        ),
        dm.DatasetItem(
            id="vertically_long",
            media=dm.Image.from_numpy(np.zeros(shape=(20, 10, 3), dtype=np.uint8)),
        ),
    ]
)
for item in dataset:
    print(f'ID: "{item.id}", Height: {item.media.size[0]}, Width: {item.media.size[1]}')
ID: "horizontally_long", Height: 10, Width: 20
ID: "vertically_long", Height: 20, Width: 10
[6]:
print('"Vertically long" item will remain')

filtered = dataset.clone().filter("/item[image/width < image/height]")
for item in filtered:
    print(f'ID: "{item.id}", Height: {item.media.size[0]}, Width: {item.media.size[1]}')


def retain_horizontally_long(item):
    return item.media.size[0] < item.media.size[1]


print('Now, conversely, "Horizontally long" item will remain conversely')

filtered = dataset.clone().filter(retain_horizontally_long)
for item in filtered:
    print(f'ID: "{item.id}", Height: {item.media.size[0]}, Width: {item.media.size[1]}')
"Vertically long" item will remain
ID: "vertically_long", Height: 20, Width: 10
Now, conversely, "Horizontally long" item will remain conversely
ID: "horizontally_long", Height: 10, Width: 20

Filtered by label and area#

Let’s get back to the dummy VOC dataset at the first lesson. We want to remove all annotations associated with the person label in the dataset. You can see that there is one item with id=2007_000001 having person label.

[7]:
def find_item_with_given_label_name(dataset, label_name):
    label_cats = dataset.categories()[dm.AnnotationType.label]
    for item in dataset:
        labels = {label_cats[ann.label].name for ann in item.annotations}
        if label_name in labels:
            print(f'ID: {item.id} has "{label_name}" label')


dataset = dm.Dataset.import_from("../tests/assets/voc_dataset/voc_dataset1", format="voc")
print("There exist a person")
find_item_with_given_label_name(dataset, "person")
There exist a person
ID: 2007_000001 has "person" label

We can remove all annotations not having person label with the following query. On the other hand, using the Python function, we can remove all airplane annotations as well. As shown, you have to set filter_annotations as True if you want to apply filtering to annotations. The default value is False. Therefore, in the previous examples, we have been able to apply filtering to dataset items rather than annotations.

[8]:
filtered = dataset.clone().filter('/item/annotation[label!="person"]', filter_annotations=True)
print("There is no person")
find_item_with_given_label_name(dataset, "person")

print("There is an airplane")
find_item_with_given_label_name(dataset, "airplane")


def remove_airplane(item, ann):
    label_cats = dataset.categories()[dm.AnnotationType.label]
    return label_cats[ann.label].name != "airplane"


print("Now, we removed it")
filtered = dataset.clone().filter(remove_airplane, filter_annotations=True)
find_item_with_given_label_name(dataset, "airplane")
There is no person
ID: 2007_000001 has "person" label
There is an airplane
Now, we removed it

Filtered by attributes#

Some data format has special attributes for each dataset item or annotation. One of them would be occluded boolean which has been used for COCO format. This boolean used to indicate whether the object is occluded by another object or not. We can also filter a dataset item or annotation with attribute fields. The following example will show how to do that.

[9]:
dataset = dm.Dataset.from_iterable(
    [
        dm.DatasetItem(
            id="item_with_occlusion",
            media=dm.Image.from_numpy(np.zeros(shape=(10, 20, 3), dtype=np.uint8)),
            annotations=[
                dm.Bbox(0, 0, 1, 1, attributes={"occluded": True}),
            ],
        ),
        dm.DatasetItem(
            id="item_without_occlusion",
            media=dm.Image.from_numpy(np.zeros(shape=(10, 20, 3), dtype=np.uint8)),
            annotations=[
                dm.Bbox(0, 0, 1, 1, attributes={"occluded": False}),
            ],
        ),
    ]
)
for item in dataset:
    print(f"ID: {item.id}")
ID: item_with_occlusion
ID: item_without_occlusion

Now, we will retain annotations with occluded=False only. However, we set remove_empty=True flag as well. By setting this flag to True, at the same time that we filter annotations, we can remove the dataset item which has no annotations after filtering as well. Therefore, item_with_occlusion should be removed because it has no bbox after filtering.

[10]:
print("There is no item with occlusion")
filtered = dataset.clone().filter(
    '/item/annotation[occluded="False"]', filter_annotations=True, remove_empty=True
)
for item in filtered:
    print(f"ID: {item.id}")


def remove_occluded_ann(item, ann):
    return not ann.attributes.get("occluded", False)


print("There is no item with occlusion again")
filtered = dataset.clone().filter(remove_occluded_ann, filter_annotations=True, remove_empty=True)
for item in filtered:
    print(f"ID: {item.id}")
There is no item with occlusion
ID: item_without_occlusion
There is no item with occlusion again
ID: item_without_occlusion