Find Most Similar Data from Image or Text Queries#

Jupyter Notebook

In this notebook example, we’ll take a look at Datumaro data exploration Python API. Specifically, we are going to provide the example codes for data exploration for image query and text query with MS-COCO 2017 dataset. Please prepare COCO 2017 validation dataset or download it referring this link.

[13]:
# Copyright (C) 2022 Intel Corporation
#
# SPDX-License-Identifier: MIT

import datumaro as dm
from datumaro.components.algorithms.hash_key_inference.explorer import Explorer
from datumaro.components.visualizer import Visualizer

Data exploration#

Explore with COCO instance segmentation dataset#

To use data exploration, we need to define hash for each dataset. Explorer calculates the hash key automatically. If you want to re-use the calculated hash key, please set save_hashkey_meta as True when exporting the dataset. The default value is False.

[14]:
dataset = dm.Dataset.import_from("coco_dataset", format="coco_instances")
dataset
WARNING:root:File 'coco_dataset/annotations/captions_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File 'coco_dataset/annotations/person_keypoints_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File 'coco_dataset/annotations/panoptic_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
[14]:
Dataset
        size=5000
        source_path=coco_dataset
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=4952
        annotations_count=78647
subsets
        val2017: # of items=5000, # of annotated items=4952, # of annotations=78647, annotation types=['polygon', 'bbox', 'mask']
infos
        categories
        label: ['person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush']

Set explorer with dataset which is used to database.

[15]:
explorer = Explorer(dataset)

Explore with image query#

Set one of dataset as query which you want to find similar dataset.

[16]:
for i, item in enumerate(dataset):
    if i == 50:
        query = item

Use Visualizer to check which query will be used.

[17]:
visualizer = Visualizer(dataset, figsize=(12, 12), alpha=0)
fig = visualizer.vis_one_sample(query.id, "val2017")
fig.show()
../../../_images/docs_jupyter_notebook_examples_notebooks_07_data_explorer_11_0.png
[18]:
topk_list = explorer.explore_topk(query, topk=15)
[19]:
subset_list = []
id_list = []
for result in topk_list:
    subset_list.append(result.subset)
    id_list.append(result.id)
[20]:
fig = visualizer.vis_gallery(id_list[:12], subset_list[:12])
fig.show()
../../../_images/docs_jupyter_notebook_examples_notebooks_07_data_explorer_14_0.png

Explore with text query#

Set text as query which you want to find similar dataset. You can set it as a sentence or a word.

[21]:
topk_list = explorer.explore_topk("elephant", topk=15)
[22]:
subset_list = []
id_list = []
for result in topk_list:
    subset_list.append(result.subset)
    id_list.append(result.id)
[24]:
fig = visualizer.vis_gallery(id_list[:12], subset_list[:12])
fig.show()
../../../_images/docs_jupyter_notebook_examples_notebooks_07_data_explorer_19_0.png