Find Most Similar Data from Image or Text Queries#
In this notebook example, we’ll take a look at Datumaro data exploration Python API. Specifically, we are going to provide the example codes for data exploration for image query and text query with MS-COCO 2017 dataset. Please prepare COCO 2017 validation dataset or download it referring this link.
[13]:
# Copyright (C) 2022 Intel Corporation
#
# SPDX-License-Identifier: MIT
import datumaro as dm
from datumaro.components.algorithms.hash_key_inference.explorer import Explorer
from datumaro.components.visualizer import Visualizer
Data exploration#
Explore with COCO instance segmentation dataset#
To use data exploration, we need to define hash for each dataset. Explorer
calculates the hash key automatically. If you want to re-use the calculated hash key, please set save_hashkey_meta
as True
when exporting the dataset. The default value is False
.
[14]:
dataset = dm.Dataset.import_from("coco_dataset", format="coco_instances")
dataset
WARNING:root:File 'coco_dataset/annotations/captions_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File 'coco_dataset/annotations/person_keypoints_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File 'coco_dataset/annotations/panoptic_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
[14]:
Dataset
size=5000
source_path=coco_dataset
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=4952
annotations_count=78647
subsets
val2017: # of items=5000, # of annotated items=4952, # of annotations=78647, annotation types=['polygon', 'bbox', 'mask']
infos
categories
label: ['person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush']
Set explorer with dataset
which is used to database.
[15]:
explorer = Explorer(dataset)
Explore with image
query#
Set one of dataset as query which you want to find similar dataset.
[16]:
for i, item in enumerate(dataset):
if i == 50:
query = item
Use Visualizer to check which query will be used.
[17]:
visualizer = Visualizer(dataset, figsize=(12, 12), alpha=0)
fig = visualizer.vis_one_sample(query.id, "val2017")
fig.show()
[18]:
topk_list = explorer.explore_topk(query, topk=15)
[19]:
subset_list = []
id_list = []
for result in topk_list:
subset_list.append(result.subset)
id_list.append(result.id)
[20]:
fig = visualizer.vis_gallery(id_list[:12], subset_list[:12])
fig.show()
Explore with text
query#
Set text as query which you want to find similar dataset. You can set it as a sentence or a word.
[21]:
topk_list = explorer.explore_topk("elephant", topk=15)
[22]:
subset_list = []
id_list = []
for result in topk_list:
subset_list.append(result.subset)
id_list.append(result.id)
[24]:
fig = visualizer.vis_gallery(id_list[:12], subset_list[:12])
fig.show()