Level 10: Dataset Explorartion from a Query Image/Text#

Datumaro support exploration feature to find out similar data for query among dataset. With query, the exploration result includes top-k similar data among dataset. Through this feature, you could figure out dataset property. You could check the visualization result of exploration using Visualizer.

More detailed descriptions about explorer are given by Explore The Python example for the usage of explorer is described in here.

With Python API, we can explore similar items as below

from datumaro.components.dataset import Dataset
from datumaro.components.environment import Environment
from datumaro.components.algorithms.hash_key_inference.explorer import Explorer

data_path = '/path/to/data'

env = Environment()
detected_formats = env.detect_dataset(data_path)

dataset = Dataset.import_from(data_path, detected_formats[0])

explorer = Explorer(dataset)
query = '/path/to/image/file'
topk = 20
topk_result = explorer.explore_topk(query, topk)
dataset.export(dir, save_hashkey_meta=True)

Through set save_hashkey_meta = True, we could save hash_key of items, which is base of explorer. This allows we to re-explore this dataset without redundant hash calculations.

Without the project declaration, we can simply explore dataset like below.

You can set the query using one of the following options: QUERY_PATH, QUERY_ID, or QUERY_STR

datum explore <target> --query-img-path QUERY_PATH -topk TOPK_NUM

QUERY_PATH could be image file path or list of them

TOPK_NUM is an integer that you want to find the number of similar results for query

Exploration result would be printed by log and result files would be copied into explore_result folder.

datum explore <target> --query-item-id QUERY_ID -topk TOPK_NUM

QUERY_ID could be datasetitem id or list of them

datum explore <target> --query-str QUERY_STR -topk TOPK_NUM

QUERY_STR could be text description or list of them

datum explore <target> --query-str QUERY_STR -topk TOPK_NUM -s -o DST_DIR

To save the result, specify the output directory as DST_DIR

With the project-based CLI, we first require to create a project by

datum project create --output-dir <path/to/project>

We now import data in to project through

datum project import --project <path/to/project> <path/to/data>

We can explore similar items for the query.

You can set the query using one of the following options: QUERY_PATH, QUERY_ID, or QUERY_STR

datum explore --query-img-path QUERY_PATH -topk TOPK_NUM -p <path/to/project>

QUERY_PATH could be image file path or list of them

TOPK_NUM is an integer that you want to find the number of similar results for query

Exploration result would be printed by log and result files would be copied into explore_result folder.

datum explore <target> --query-item-id QUERY_ID -topk TOPK_NUM -p <path/to/project>

QUERY_ID could be datasetitem id or list of them

datum explore <target> --query-str QUERY_STR -topk TOPK_NUM -p <path/to/project>

QUERY_STR could be text description or list of them