datumaro.components.hl_ops#

Classes

HLOps()

High-level dataset operations for Python API.

class datumaro.components.hl_ops.HLOps[source]#

Bases: object

High-level dataset operations for Python API.

static compare(first_dataset: IDataset, second_dataset: IDataset, report_dir: str | None = None, method: str = 'table', **kwargs) IDataset[source]#

Compare two datasets and optionally save a comparison report.

Parameters:
  • first_dataset (IDataset) – The first dataset to compare.

  • second_dataset (IDataset) – The second dataset to compare.

  • report_dir (Optional[str], optional) – The directory path to save the comparison report. Defaults to None.

  • method (str, optional) – The comparison method to use. Possible values are “table”, “equality”, “distance”. Defaults to “table”.

  • **kwargs – Additional keyword arguments that can be passed to the comparison method.

Returns:

The result of the comparison.

Return type:

IDataset

Raises:

ValueError – If the method is “distance” and report_dir is not specified.

Example

comparator = Comparator() result = comparator.compare(first_dataset, second_dataset, report_dir=”./comparison_report”) print(result)

static transform(dataset: IDataset, method: str | Type[Transform], *, env: Environment | None = None, **kwargs) IDataset[source]#

Applies some function to dataset items.

Results are computed lazily, if the transform supports this.

Parameters:
  • dataset – The dataset to be transformed

  • method – The transformation to be applied to the dataset. If a string is passed, it is treated as a plugin name, which is searched for in the environment set by the ‘env’ argument

  • env – A plugin collection. If not set, the built-in plugins are used

  • **kwargs – Parameters for the transformation

Returns: a wrapper around the input dataset

filter(expr: str, *, filter_annotations: bool = False, remove_empty: bool = False) IDataset[source]#
filter(filter_func: Callable[[DatasetItem], bool] | Callable[[DatasetItem, Annotation], bool], *, filter_annotations: bool = False, remove_empty: bool = False) IDataset
static merge(*datasets: Dataset, merge_policy: str = 'exact', report_path: str | None = None, **kwargs) Dataset[source]#

Merge datasets according to merge_policy. You have to choose an appropriate merge_policy for your purpose. The available merge policies are “union”, “intersect”, and “exact”. For more details about the merge policies, please refer to get_merger().

static run_model(dataset: IDataset, model: Launcher | Type[ModelTransform], *, batch_size: int = 1, append_annotation: bool = False, num_workers: int = 0, **kwargs) IDataset[source]#

Run the model on the dataset item media entities, such as images, to obtain pseudo labels and add them as dataset annotations.

Parameters:
  • dataset – The dataset to be transformed

  • model – The model to be applied to the dataset

  • batch_size – The number of dataset items processed simultaneously by the model

  • append_annotation – Whether append new annotation to existed annotations

  • num_workers – The number of worker threads to use for parallel inference. Set to 0 for single-process mode. Default is 0.

  • **kwargs – Parameters for the model

Returns: a wrapper around the input dataset, which is computed lazily

during iteration

static export(dataset: IDataset, path: str, format: str | Type[Exporter], *, env: Environment | None = None, **kwargs) None[source]#

Saves the input dataset in some format.

Parameters:
  • dataset – The dataset to be saved

  • path – The output directory

  • format – The desired output format for the dataset. If a string is passed, it is treated as a plugin name, which is searched for in the environment set by the ‘env’ argument

  • env – A plugin collection. If not set, the built-in plugins are used

  • **kwargs – Parameters for the export format

static validate(dataset: IDataset, task: str | TaskType, *, env: Environment | None = None, **kwargs) Dict[source]#

Checks dataset annotations for correctness relatively to a task type.

Parameters:
  • dataset – The dataset to check

  • task – Target task type - classification, detection etc.

  • env – A plugin collection. If not set, the built-in plugins are used

  • **kwargs – Parameters for the validator

Returns: a dictionary with validation results

static aggregate(dataset: Dataset, from_subsets: Iterable[str], to_subset: str) Dataset[source]#
class datumaro.components.hl_ops.Dataset(source: IDataset | None = None, *, infos: Dict[str, Any] | None = None, categories: Dict[AnnotationType, Categories] | None = None, media_type: Type[MediaElement] | None = None, task_type: TaskType | None = None, env: Environment | None = None)[source]#

Bases: IDataset

Represents a dataset, contains metainfo about labels and dataset items. Provides iteration and access options to dataset elements.

By default, all operations are done lazily, it can be changed by modifying the eager property and by using the eager_mode context manager.

Dataset is supposed to have a single media type for its items. If the dataset is filled manually or from extractors, and media type does not match, an error is raised.

classmethod from_iterable(iterable: ~typing.Iterable[~datumaro.components.dataset_base.DatasetItem], infos: ~typing.Dict[str, ~typing.Any] | None = None, categories: ~typing.Dict[~datumaro.components.annotation.AnnotationType, ~datumaro.components.annotation.Categories] | ~typing.List[str] | None = None, *, env: ~datumaro.components.environment.Environment | None = None, media_type: ~typing.Type[~datumaro.components.media.MediaElement] = <class 'datumaro.components.media.Image'>, task_type: ~datumaro.components.task.TaskType | None = TaskType.unlabeled) Dataset[source]#

Creates a new dataset from an iterable object producing dataset items - a generator, a list etc. It is a convenient way to create and fill a custom dataset.

Parameters:
  • iterable – An iterable which returns dataset items

  • infos – A dictionary of the dataset specific information

  • categories – A simple list of labels or complete information about labels. If not specified, an empty list of labels is assumed.

  • media_type – Media type for the dataset items. If the sequence contains items with mismatching media type, an error is raised during caching

  • env – A context for plugins, which will be used for this dataset. If not specified, the builtin plugins will be used.

Returns:

A new dataset with specified contents

Return type:

dataset

classmethod from_extractors(*sources: IDataset, env: Environment | None = None, merge_policy: str = 'exact') Dataset[source]#

Creates a new dataset from one or several `Extractor`s.

In case of a single input, creates a lazy wrapper around the input. In case of several inputs, merges them and caches the resulting dataset.

Parameters:
  • sources – one or many input extractors

  • env – A context for plugins, which will be used for this dataset. If not specified, the builtin plugins will be used.

  • merge_policy – Policy on how to merge multiple datasets. Possible options are “exact”, “intersect”, and “union”.

Returns:

A new dataset with contents produced by input extractors

Return type:

dataset

define_infos(infos: Dict[str, Any]) None[source]#
define_categories(categories: Dict[AnnotationType, Categories]) None[source]#
init_cache() None[source]#
get_subset(name) DatasetSubset[source]#
subsets() Dict[str, DatasetSubset][source]#

Enumerates subsets in the dataset. Each subset can be a dataset itself.

infos() Dict[str, Any][source]#

Returns meta-info of dataset.

categories() Dict[AnnotationType, Categories][source]#

Returns metainfo about dataset labels.

media_type() Type[MediaElement][source]#

Returns media type of the dataset items.

All the items are supposed to have the same media type. Supposed to be constant and known immediately after the object construction (i.e. doesn’t require dataset iteration).

task_type() TaskType[source]#

Returns available task type from dataset annotation types.

get(id: str, subset: str | None = None) DatasetItem | None[source]#

Provides random access to dataset items.

get_annotated_items()[source]#
get_annotations()[source]#
get_datasetitem_by_path(path)[source]#
get_label_cat_names()[source]#
get_subset_info() str[source]#
get_infos() Tuple[str][source]#
get_categories_info() Tuple[str][source]#
put(item: DatasetItem, id: str | None = None, subset: str | None = None) None[source]#
remove(id: str, subset: str | None = None) None[source]#
filter(expr: str, *, filter_annotations: bool = False, remove_empty: bool = False) Dataset[source]#
filter(filter_func: Callable[[DatasetItem], bool] | Callable[[DatasetItem, Annotation], bool], *, filter_annotations: bool = False, remove_empty: bool = False) Dataset
update(source: DatasetPatch | IDataset | Iterable[DatasetItem]) Dataset[source]#

Updates items of the current dataset from another dataset or an iterable (the source). Items from the source overwrite matching items in the current dataset. Unmatched items are just appended.

If the source is a DatasetPatch, the removed items in the patch will be removed in the current dataset.

If the source is a dataset, labels are matched. If the labels match, but the order is different, the annotation labels will be remapped to the current dataset label order during updating.

Returns: self

transform(method: str | Type[Transform], **kwargs) Dataset[source]#

Applies some function to dataset items.

Results are stored in-place. Modifications are applied lazily. Transforms are not allowed to change media type of dataset items.

Parameters:
  • method – The transformation to be applied to the dataset. If a string is passed, it is treated as a plugin name, which is searched for in the dataset environment.

  • **kwargs – Parameters for the transformation

Returns: self

run_model(model: Launcher | Type[ModelTransform], *, batch_size: int = 1, append_annotation: bool = False, num_workers: int = 0, **kwargs) Dataset[source]#

Applies a model to dataset items’ media and produces a dataset with media and annotations.

Parameters:
  • model – The model to be applied to the dataset

  • batch_size – The number of dataset items processed simultaneously by the model

  • append_annotation – Whether append new annotation to existed annotations

  • num_workers – The number of worker threads to use for parallel inference. Set to 0 for single-process mode. Default is 0.

  • **kwargs – Parameters for the model

Returns: self

select(pred: Callable[[DatasetItem], bool]) Dataset[source]#
property data_path: str | None#
property format: str | None#
property options: Dict[str, Any]#
property is_modified: bool#
get_patch() DatasetPatch[source]#
property env: Environment#
property is_cache_initialized: bool#
property is_eager: bool#
property is_bound: bool#
bind(path: str, format: str | None = None, *, options: Dict[str, Any] | None = None) None[source]#

Binds the dataset to a speific directory. Allows to set default saving parameters.

The following saves will be done to this directory by default and will use the saved parameters.

flush_changes()[source]#
export(save_dir: str, format: str | Type[Exporter], *, progress_reporter: ProgressReporter | None = None, error_policy: ExportErrorPolicy | None = None, **kwargs) None[source]#

Saves the dataset in some format.

Parameters:
  • save_dir – The output directory

  • format – The desired output format. If a string is passed, it is treated as a plugin name, which is searched for in the dataset environment.

  • progress_reporter – An object to report progress

  • error_policy – An object to report format-related errors

  • **kwargs – Parameters for the format

save(save_dir: str | None = None, **kwargs) None[source]#
classmethod load(path: str, **kwargs) Dataset[source]#
classmethod import_from(path: str, format: str | None = None, *, env: Environment | None = None, progress_reporter: ProgressReporter | None = None, error_policy: ImportErrorPolicy | None = None, **kwargs) Dataset[source]#

Creates a Dataset instance from a dataset on the disk.

Parameters:
  • path (path - The input file or directory) –

  • format. (format - Dataset) – If a string is passed, it is treated as a plugin name, which is searched for in the env plugin context. If not set, will try to detect automatically, using the env plugin context.

  • set (env - A plugin collection. If not) –

  • used (the built-in plugins are) –

  • progress. (progress_reporter - An object to report) – Implies earger loading.

  • errors. (error_policy - An object to report format-related) – Implies earger loading.

  • format (**kwargs - Parameters for the) –

static detect(path: str, *, env: Environment | None = None, depth: int = 2) str[source]#

Attempts to detect dataset format of a given directory.

This function tries to detect a single format and fails if it’s not possible. Check Environment.detect_dataset() for a function that reports status for each format checked.

Parameters:
  • path – The directory to check

  • depth – The maximum depth for recursive search

  • env – A plugin collection. If not set, the built-in plugins are used

property is_stream: bool#

Boolean indicating whether the dataset is a stream

If the dataset is a stream, the dataset item is generated on demand from its iterator.

clone() Dataset[source]#

Create a deep copy of this dataset.

Returns:

A cloned instance of the Dataset.

exception datumaro.components.hl_ops.DatasetError[source]#

Bases: DatumaroError

class datumaro.components.hl_ops.DistanceComparator(iou_threshold=0.5)[source]#

Bases: object

Method generated by attrs for class DistanceComparator.

match_annotations(item_a, item_b)[source]#
match_labels(item_a, item_b)[source]#
match_polygons(item_a, item_b)[source]#
match_masks(item_a, item_b)[source]#
match_boxes(item_a, item_b)[source]#
match_points(item_a, item_b)[source]#
match_lines(item_a, item_b)[source]#
class datumaro.components.hl_ops.DistanceCompareVisualizer(comparator, save_dir: str, output_format: None | str | OutputFormat = None)[source]#

Bases: object

class OutputFormat(value)[source]#

Bases: Enum

An enumeration.

simple = 1#
tensorboard = 2#
DEFAULT_FORMAT = 1#
save(a: IDataset, b: IDataset)[source]#
update_label_confusion(label_diff)[source]#
update_bbox_confusion(diff)[source]#
update_polygon_confusion(diff)[source]#
update_mask_confusion(diff)[source]#
classmethod draw_text_with_background(frame, text, origin, font=None, scale=1.0, color=(0, 0, 0), thickness=1, bgcolor=(1, 1, 1))[source]#
draw_detection_roi(frame, x, y, w, h, label, conf, color)[source]#
get_a_label(label_id)[source]#
get_b_label(label_id)[source]#
draw_bbox(img, shape, label, color)[source]#
get_label_diff_file()[source]#
save_item_label_diff(item_a, item_b, diff)[source]#
save_item_bbox_diff(item_a, item_b, diff)[source]#
save_as_tensorboard(img, name)[source]#
save_conf_matrix(conf_matrix, filename)[source]#
class datumaro.components.hl_ops.Environment(use_lazy_import: bool = True)[source]#

Bases: object

property extractors: DatasetBaseRegistry#
property importers: ImporterRegistry#
property launchers: LauncherRegistry#
property exporters: ExporterRegistry#
property generators: GeneratorRegistry#
property transforms: TransformRegistry#
property validators: ValidatorRegistry#
load_plugins(plugins_dir)[source]#
register_plugins(plugins)[source]#
make_extractor(name, *args, **kwargs)[source]#
make_importer(name, *args, **kwargs)[source]#
make_launcher(name, *args, **kwargs)[source]#
make_exporter(name, *args, **kwargs)[source]#
make_transform(name, *args, **kwargs)[source]#
is_format_known(name)[source]#
detect_dataset(path: str, depth: int = 1, rejection_callback: Callable[[str, RejectionReason, str], None] | None = None) List[str][source]#
classmethod merge(envs: Sequence[Environment]) Environment[source]#
classmethod release_builtin_plugins()[source]#
class datumaro.components.hl_ops.EqualityComparator(*, match_images: bool = False, ignored_fields=_Nothing.NOTHING, ignored_attrs=_Nothing.NOTHING, ignored_item_attrs=_Nothing.NOTHING, all=False)[source]#

Bases: object

Method generated by attrs for class EqualityComparator.

match_images: bool#
errors: list#
compare_datasets(a, b)[source]#
static save_compare_report(output: Dict, report_dir: str) None[source]#

Saves the comparison report to JSON and text files.

Parameters:
  • output – A dictionary containing the comparison data.

  • report_dir – A string representing the directory to save the report files.

class datumaro.components.hl_ops.Exporter(extractor: IDataset, save_dir: str, *, save_media: bool | None = None, image_ext: str | None = None, default_image_ext: str | None = None, save_dataset_meta: bool = False, save_hashkey_meta: bool = False, stream: bool = False, ctx: ExportContext | None = None)[source]#

Bases: CliPlugin

DEFAULT_IMAGE_EXT = None#
classmethod build_cmdline_parser(**kwargs)[source]#
classmethod convert(extractor, save_dir, **options)[source]#
classmethod patch(dataset, patch, save_dir, **options)[source]#
apply()[source]#

Execute the data-format conversion

property can_stream: bool#

Flag to indicate whether the exporter can export the dataset in a stream manner or not.

class datumaro.components.hl_ops.IDataset[source]#

Bases: object

subsets() Dict[str, IDataset][source]#

Enumerates subsets in the dataset. Each subset can be a dataset itself.

get_subset(name) IDataset[source]#
infos() Dict[str, Any][source]#

Returns meta-info of dataset.

categories() Dict[AnnotationType, Categories][source]#

Returns metainfo about dataset labels.

get(id: str, subset: str | None = None) DatasetItem | None[source]#

Provides random access to dataset items.

media_type() Type[MediaElement][source]#

Returns media type of the dataset items.

All the items are supposed to have the same media type. Supposed to be constant and known immediately after the object construction (i.e. doesn’t require dataset iteration).

task_type() TaskType[source]#

Returns available task type from dataset annotation types.

property is_stream: bool#

Boolean indicating whether the dataset is a stream

If the dataset is a stream, the dataset item is generated on demand from its iterator.

class datumaro.components.hl_ops.Launcher(model_dir: str | None = None)[source]#

Bases: CliPlugin

preprocess(item: DatasetItem) Tuple[ndarray | Dict[str, ndarray], PrepInfo][source]#

Preprocess single dataset item before launch()

There are two output types:

1. The output is np.ndarray. For example, it can be image data as np.ndarray with BGR format (H, W, C). In this step, you usually implement resizing, normalizing, or color channel conversion for your launcher (or model).

2. The output is Dict[str, np.ndarray]. For example, it can be image and text pairs. Therefore, this can be used for the model having multi modality for image and text inputs.

infer(inputs: Dict[str, ndarray]) List[Dict[str, ndarray] | List[Dict[str, ndarray]]][source]#
infer(inputs: ndarray) List[Dict[str, ndarray] | List[Dict[str, ndarray]]]
postprocess(pred: Dict[str, ndarray] | List[Dict[str, ndarray]], info: PrepInfo) List[Annotation][source]#
launch(batch: Sequence[DatasetItem], stack: bool = True) List[List[Annotation]][source]#

Launch to obtain the inference outputs of items.

Parameters:
  • inputs – batch of Datasetitems

  • stack – If true, launch inference for the stacked input for the batch-wise dimension Otherwise, launch inference for each input.

Returns:

A list of annotation list. Each annotation list is mapped to the input DatasetItem. These annotation list are pseudo-labels obtained by the model inference.

infos()[source]#
categories()[source]#
type_check(item: DatasetItem) bool[source]#

Check the media type of dataset item.

If False, the item is excluded from the input batch.

class datumaro.components.hl_ops.ModelTransform(extractor: IDataset, launcher: Launcher, batch_size: int = 1, append_annotation: bool = False, num_workers: int = 0)[source]#

Bases: Transform

A transformation class for applying a model’s inference to dataset items.

This class takes an dataset, a launcher, and other optional parameters to transform the dataset item from the model outputs by the launcher. It can process items using multiple processes if specified, making it suitable for parallelized inference tasks.

Parameters:
  • extractor – The dataset extractor to obtain items from.

  • launcher – The launcher responsible for model inference.

  • batch_size – The batch size for processing items. Default is 1.

  • append_annotation – Whether to append inference annotations to existing annotations. Default is False.

  • num_workers – The number of worker threads to use for parallel inference. Set to 0 for single-process mode. Default is 0.

get_subset(name)[source]#
infos()[source]#

Returns meta-info of dataset.

categories()[source]#

Returns metainfo about dataset labels.

transform_item(item)[source]#
class datumaro.components.hl_ops.TableComparator[source]#

Bases: object

Class for comparing datasets and generating comparison report table.

Method generated by attrs for class TableComparator.

compare_datasets(first: Dataset, second: Dataset, mode: str = 'all') Tuple[str, str, str, Dict][source]#

Compares two datasets and generates comparison reports.

Parameters:
  • first – The first dataset to compare.

  • second – The second dataset to compare.

Returns:

A tuple containing high-level table, mid-level table, low-level table, and a dictionary representation of the comparison.

static save_compare_report(high_level_table: str, mid_level_table: str, low_level_table: str, comparison_dict: Dict, report_dir: str) None[source]#

Saves the comparison report to JSON and text files.

Parameters:
  • high_level_table – High-level comparison table as a string.

  • mid_level_table – Mid-level comparison table as a string.

  • low_level_table – Low-level comparison table as a string.

  • comparison_dict – A dictionary containing the comparison data.

  • report_dir – A string representing the directory to save the report files.

class datumaro.components.hl_ops.TaskType(value)[source]#

Bases: Enum

An enumeration.

classification = 1#
detection = 2#
segmentation = 3#
class datumaro.components.hl_ops.Transform(extractor: IDataset)[source]#

Bases: DatasetBase, CliPlugin

A base class for dataset transformations that change dataset items or their annotations.

static wrap_item(item, **kwargs)[source]#
categories()[source]#

Returns metainfo about dataset labels.

subsets()[source]#

Enumerates subsets in the dataset. Each subset can be a dataset itself.

media_type()[source]#

Returns media type of the dataset items.

All the items are supposed to have the same media type. Supposed to be constant and known immediately after the object construction (i.e. doesn’t require dataset iteration).

infos()[source]#

Returns meta-info of dataset.

class datumaro.components.hl_ops.UserFunctionAnnotationsFilter(extractor: IDataset, filter_func: Callable[[DatasetItem, Annotation], bool], remove_empty: bool = False)[source]#

Bases: ItemTransform

Filter annotations using a user-provided Python function.

Parameters:
  • extractor – Datumaro Dataset to filter.

  • filter_func – A Python callable that takes DatasetItem and Annotation as its inputs and returns a boolean. If the return value is True, the Annotation will be retained. Otherwise, it is removed.

  • remove_empty – If True, DatasetItem without any annotations is removed after filtering its annotations. Otherwise, do not filter DatasetItem.

Example

This is an example of removing bounding boxes sized greater than 50% of the image size:

from datumaro.components.media import Image from datumaro.components.annotation import Annotation, Bbox

def filter_func(item: DatasetItem, ann: Annotation) -> bool:

# If the annotation is not a Bbox, do not filter if not isinstance(ann, Bbox):

return False

h, w = item.media_as(Image).size image_size = h * w bbox_size = ann.h * ann.w

# Accept Bboxes smaller than 50% of the image size return bbox_size < 0.5 * image_size

filtered = UserFunctionAnnotationsFilter(

extractor=dataset, filter_func=filter_func)

# No bounding boxes with a size greater than 50% of their image filtered_items = [item for item in filtered]

transform_item(item: DatasetItem) DatasetItem | None[source]#

Returns a modified copy of the input item.

Avoid changing and returning the input item, because it can lead to unexpected problems. Use wrap_item() or item.wrap() to simplify copying.

class datumaro.components.hl_ops.UserFunctionDatasetFilter(extractor: IDataset, filter_func: Callable[[DatasetItem], bool])[source]#

Bases: ItemTransform

Filter dataset items using a user-provided Python function.

Parameters:
  • extractor – Datumaro Dataset to filter.

  • filter_func – A Python callable that takes a DatasetItem as its input and returns a boolean. If the return value is True, that DatasetItem will be retained. Otherwise, it is removed.

Example

This is an example of filtering dataset items with images larger than 1024 pixels:

from datumaro.components.media import Image

def filter_func(item: DatasetItem) -> bool:

h, w = item.media_as(Image).size return h > 1024 or w > 1024

filtered = UserFunctionDatasetFilter(

extractor=dataset, filter_func=filter_func)

# No items with an image height or width greater than 1024 filtered_items = [item for item in filtered]

transform_item(item: DatasetItem) DatasetItem | None[source]#

Returns a modified copy of the input item.

Avoid changing and returning the input item, because it can lead to unexpected problems. Use wrap_item() or item.wrap() to simplify copying.

class datumaro.components.hl_ops.Validator[source]#

Bases: CliPlugin

validate(dataset: IDataset) Dict[source]#

Returns the validation results of a dataset based on task type.

Parameters:

dataset (IDataset) – Dataset to be validated

Raises:

ValueError

Returns:

Dict with validation statistics, reports and summary.

Return type:

validation_results (dict)

compute_statistics(dataset: IDataset) Dict[source]#

Computes statistics of the dataset based on task type.

Parameters:

dataset (IDataset) – a dataset to be validated

Returns:

A dict object containing statistics of the dataset.

Return type:

stats (dict)

generate_reports(stats: Dict) List[Dict][source]#
class datumaro.components.hl_ops.XPathAnnotationsFilter(extractor: IDataset, xpath: str, remove_empty: bool = False)[source]#

Bases: ItemTransform

transform_item(item: DatasetItem) DatasetItem | None[source]#

Returns a modified copy of the input item.

Avoid changing and returning the input item, because it can lead to unexpected problems. Use wrap_item() or item.wrap() to simplify copying.

class datumaro.components.hl_ops.XPathDatasetFilter(extractor: IDataset, xpath: str)[source]#

Bases: ItemTransform

transform_item(item: DatasetItem) DatasetItem | None[source]#

Returns a modified copy of the input item.

Avoid changing and returning the input item, because it can lead to unexpected problems. Use wrap_item() or item.wrap() to simplify copying.

datumaro.components.hl_ops.get_merger(merge_policy: str = 'exact', *args, **kwargs) Merger[source]#

Get Merger according to merge_policy. You have to choose an appropriate Merger for your purpose. The available merge policies are “union”, “intersect”, and “exact”.

  1. UnionMerge

Merge several datasets with “union” policy:

  • Label categories are merged according to the union of their label names.

For example, if Dataset-A has {“car”, “cat”, “dog”} and Dataset-B has {“car”, “bus”, “truck”} labels, the merged dataset will have {“bust”, “car”, “cat”, “dog”, “truck”} labels.

  • If there are two or more dataset items whose (id, subset) pairs match each other,

both are included in the merged dataset. At this time, since the same (id, subset) pair cannot be duplicated in the dataset, we add a suffix to the id of each source item. For example, if Dataset-A has DatasetItem(id=”magic”, subset=”train”) and Dataset-B has also DatasetItem(id=”magic”, subset=”train”), the merged dataset will have DatasetItem(id=”magic-0”, subset=”train”) and DatasetItem(id=”magic-1”, subset=”train”).

  1. IntersectMerge

Merge several datasets with “intersect” policy:

  • If there are two or more dataset items whose (id, subset) pairs match each other,

we can consider this as having an intersection in our dataset. This method merges the annotations of the corresponding DatasetItem into one DatasetItem to handle this intersection. The rule to handle merging annotations is provided by AnnotationMerger according to their annotation types. For example, DatasetItem(id=”item_1”, subset=”train”, annotations=[Bbox(0, 0, 1, 1)]) from Dataset-A and DatasetItem(id=”item_1”, subset=”train”, annotations=[Bbox(.5, .5, 1, 1)]) from Dataset-B can be merged into DatasetItem(id=”item_1”, subset=”train”, annotations=[Bbox(0, 0, 1, 1)]).

  • Label categories are merged according to the union of their label names

(Same as UnionMerge). For example, if Dataset-A has {“car”, “cat”, “dog”} and Dataset-B has {“car”, “bus”, “truck”} labels, the merged dataset will have {“bust”, “car”, “cat”, “dog”, “truck”} labels.

  • This merge has configuration parameters (conf) to control the annotation merge behaviors.

For example,

```python merge = IntersectMerge(

conf=IntersectMerge.Conf(

pairwise_dist=0.25, groups=[], output_conf_thresh=0.0, quorum=0,

)

)#

For more details for the parameters, please refer to IntersectMerge.Conf.

  1. ExactMerge

Merges several datasets using the “simple” algorithm:

  • All datasets should have the same categories

  • items are matched by (id, subset) pairs

  • matching items share the media info available:
    • nothing + nothing = nothing

    • nothing + something = something

    • something A + something B = conflict

  • annotations are matched by value and shared

  • in case of conflicts, throws an error

datumaro.components.hl_ops.on_error_do(callback, *args, ignore_errors=False, kwargs=None)[source]#
datumaro.components.hl_ops.overload(func)[source]#

Decorator for overloaded functions/methods.

In a stub file, place two or more stub definitions for the same function in a row, each decorated with @overload. For example:

@overload def utf8(value: None) -> None: … @overload def utf8(value: bytes) -> bytes: … @overload def utf8(value: str) -> bytes: …

In a non-stub file (i.e. a regular .py file), do the same but follow it with an implementation. The implementation should not be decorated with @overload. For example:

@overload def utf8(value: None) -> None: … @overload def utf8(value: bytes) -> bytes: … @overload def utf8(value: str) -> bytes: … def utf8(value):

# implementation goes here

datumaro.components.hl_ops.parse_str_enum_value(value, enum_class, default=<object object>, unknown_member_error=None)[source]#
datumaro.components.hl_ops.scoped(func, arg_name=None)[source]#

A function decorator, which allows to do actions with the current scope, such as registering error and exit callbacks and context managers.