datumaro.components.dataset#
Functions
|
Classes
|
Represents a dataset, contains metainfo about labels and dataset items. |
|
|
|
- class datumaro.components.dataset.Dataset(source: Optional[IDataset] = None, *, infos: Optional[Dict[str, Any]] = None, categories: Optional[Dict[AnnotationType, Categories]] = None, media_type: Optional[Type[MediaElement]] = None, env: Optional[Environment] = None)[source]#
Bases:
IDataset
Represents a dataset, contains metainfo about labels and dataset items. Provides iteration and access options to dataset elements.
By default, all operations are done lazily, it can be changed by modifying the eager property and by using the eager_mode context manager.
Dataset is supposed to have a single media type for its items. If the dataset is filled manually or from extractors, and media type does not match, an error is raised.
- classmethod from_iterable(iterable: ~typing.Iterable[~datumaro.components.dataset_base.DatasetItem], infos: ~typing.Optional[~typing.Dict[str, ~typing.Any]] = None, categories: ~typing.Optional[~typing.Union[~typing.Dict[~datumaro.components.annotation.AnnotationType, ~datumaro.components.annotation.Categories], ~typing.List[str]]] = None, *, env: ~typing.Optional[~datumaro.components.environment.Environment] = None, media_type: ~typing.Type[~datumaro.components.media.MediaElement] = <class 'datumaro.components.media.Image'>) Dataset [source]#
Creates a new dataset from an iterable object producing dataset items - a generator, a list etc. It is a convenient way to create and fill a custom dataset.
- Parameters:
iterable – An iterable which returns dataset items
infos – A dictionary of the dataset specific information
categories – A simple list of labels or complete information about labels. If not specified, an empty list of labels is assumed.
media_type – Media type for the dataset items. If the sequence contains items with mismatching media type, an error is raised during caching
env – A context for plugins, which will be used for this dataset. If not specified, the builtin plugins will be used.
- Returns:
A new dataset with specified contents
- Return type:
dataset
- classmethod from_extractors(*sources: IDataset, env: Optional[Environment] = None, merge_policy: str = 'exact') Dataset [source]#
Creates a new dataset from one or several `Extractor`s.
In case of a single input, creates a lazy wrapper around the input. In case of several inputs, merges them and caches the resulting dataset.
- Parameters:
sources – one or many input extractors
env – A context for plugins, which will be used for this dataset. If not specified, the builtin plugins will be used.
merge_policy – Policy on how to merge multiple datasets. Possible options are “exact”, “intersect”, and “union”.
- Returns:
A new dataset with contents produced by input extractors
- Return type:
dataset
- define_categories(categories: Dict[AnnotationType, Categories]) None [source]#
- get_subset(name) DatasetSubset [source]#
- subsets() Dict[str, DatasetSubset] [source]#
Enumerates subsets in the dataset. Each subset can be a dataset itself.
- categories() Dict[AnnotationType, Categories] [source]#
Returns metainfo about dataset labels.
- media_type() Type[MediaElement] [source]#
Returns media type of the dataset items.
All the items are supposed to have the same media type. Supposed to be constant and known immediately after the object construction (i.e. doesn’t require dataset iteration).
- get(id: str, subset: Optional[str] = None) Optional[DatasetItem] [source]#
Provides random access to dataset items.
- filter(expr: str, filter_annotations: bool = False, remove_empty: bool = False) Dataset [source]#
Filters out some dataset items or annotations, using a custom filter expression.
Results are stored in-place. Modifications are applied lazily.
- Parameters:
expr – XPath-formatted filter expression (e.g. /item[subset = ‘train’], /item/annotation[label = ‘cat’])
filter_annotations – Indicates if the filter should be applied to items or annotations
remove_empty – When filtering annotations, allows to exclude empty items from the resulting dataset
Returns: self
- update(source: Union[DatasetPatch, IDataset, Iterable[DatasetItem]]) Dataset [source]#
Updates items of the current dataset from another dataset or an iterable (the source). Items from the source overwrite matching items in the current dataset. Unmatched items are just appended.
If the source is a DatasetPatch, the removed items in the patch will be removed in the current dataset.
If the source is a dataset, labels are matched. If the labels match, but the order is different, the annotation labels will be remapped to the current dataset label order during updating.
Returns: self
- transform(method: Union[str, Type[Transform]], **kwargs) Dataset [source]#
Applies some function to dataset items.
Results are stored in-place. Modifications are applied lazily. Transforms are not allowed to change media type of dataset items.
- Parameters:
method – The transformation to be applied to the dataset. If a string is passed, it is treated as a plugin name, which is searched for in the dataset environment.
**kwargs – Parameters for the transformation
Returns: self
- run_model(model: Union[Launcher, Type[ModelTransform]], *, batch_size: int = 1, append_annotation: bool = False, num_workers: int = 0, **kwargs) Dataset [source]#
Applies a model to dataset items’ media and produces a dataset with media and annotations.
- Parameters:
model – The model to be applied to the dataset
batch_size – The number of dataset items processed simultaneously by the model
append_annotation – Whether append new annotation to existed annotations
num_workers – The number of worker threads to use for parallel inference. Set to 0 for single-process mode. Default is 0.
**kwargs – Parameters for the model
Returns: self
- get_patch() DatasetPatch [source]#
- property env: Environment#
- bind(path: str, format: Optional[str] = None, *, options: Optional[Dict[str, Any]] = None) None [source]#
Binds the dataset to a speific directory. Allows to set default saving parameters.
The following saves will be done to this directory by default and will use the saved parameters.
- export(save_dir: str, format: Union[str, Type[Exporter]], *, progress_reporter: Optional[ProgressReporter] = None, error_policy: Optional[ExportErrorPolicy] = None, **kwargs) None [source]#
Saves the dataset in some format.
- Parameters:
save_dir – The output directory
format – The desired output format. If a string is passed, it is treated as a plugin name, which is searched for in the dataset environment.
progress_reporter – An object to report progress
error_policy – An object to report format-related errors
**kwargs – Parameters for the format
- classmethod import_from(path: str, format: Optional[str] = None, *, env: Optional[Environment] = None, progress_reporter: Optional[ProgressReporter] = None, error_policy: Optional[ImportErrorPolicy] = None, **kwargs) Dataset [source]#
Creates a Dataset instance from a dataset on the disk.
- Parameters:
path (path - The input file or directory) –
format. (format - Dataset) – If a string is passed, it is treated as a plugin name, which is searched for in the env plugin context. If not set, will try to detect automatically, using the env plugin context.
set (env - A plugin collection. If not) –
used (the built-in plugins are) –
progress. (progress_reporter - An object to report) – Implies earger loading.
errors. (error_policy - An object to report format-related) – Implies earger loading.
format (**kwargs - Parameters for the) –
- static detect(path: str, *, env: Optional[Environment] = None, depth: int = 2) str [source]#
Attempts to detect dataset format of a given directory.
This function tries to detect a single format and fails if it’s not possible. Check Environment.detect_dataset() for a function that reports status for each format checked.
- Parameters:
path – The directory to check
depth – The maximum depth for recursive search
env – A plugin collection. If not set, the built-in plugins are used
- datumaro.components.dataset.eager_mode(new_mode: bool = True, dataset: Optional[Dataset] = None) None [source]#
- class datumaro.components.dataset.AnnotationType(value)[source]#
Bases:
IntEnum
An enumeration.
- unknown = 0#
- label = 1#
- mask = 2#
- points = 3#
- polygon = 4#
- polyline = 5#
- bbox = 6#
- caption = 7#
- cuboid_3d = 8#
- super_resolution_annotation = 9#
- depth_annotation = 10#
- ellipse = 11#
- hash_key = 12#
- feature_vector = 13#
- tabular = 14#
- class datumaro.components.dataset.DatasetBase(*, length: ~typing.Optional[int] = None, subsets: ~typing.Optional[~typing.Sequence[str]] = None, media_type: ~typing.Type[~datumaro.components.media.MediaElement] = <class 'datumaro.components.media.Image'>, ctx: ~typing.Optional[~datumaro.components.contexts.importer.ImportContext] = None)[source]#
Bases:
_DatasetBase
,CliPlugin
A base class for user-defined and built-in extractors. Should be used in cases, where SubsetBase is not enough, or its use makes problems with performance, implementation etc.
- exception datumaro.components.dataset.DatasetImportError[source]#
Bases:
DatumaroError
- class datumaro.components.dataset.DatasetItem(id: str, *, subset: Optional[str] = None, media: Optional[Union[str, MediaElement]] = None, annotations: Optional[List[Annotation]] = None, attributes: Optional[Dict[str, Any]] = None)[source]#
Bases:
object
- media: Optional[MediaElement]#
- annotations: List[Annotation]#
- class datumaro.components.dataset.DatasetItemStorageDatasetView(parent: DatasetItemStorage, infos: Dict[str, Any], categories: Dict[AnnotationType, Categories], media_type: Optional[Type[MediaElement]])[source]#
Bases:
IDataset
- class Subset(parent: DatasetItemStorageDatasetView, name: str)[source]#
Bases:
IDataset
- class datumaro.components.dataset.DatasetPatch(data: DatasetItemStorage, infos: Dict[str, Any], categories: Dict[AnnotationType, Categories], updated_items: Dict[Tuple[str, str], ItemStatus], updated_subsets: Optional[Dict[str, ItemStatus]] = None)[source]#
Bases:
object
- class DatasetPatchWrapper(patch: DatasetPatch, parent: IDataset)[source]#
- property updated_subsets: Dict[str, ItemStatus]#
- class datumaro.components.dataset.DatasetStorage(source: Union[IDataset, DatasetItemStorage], infos: Optional[Dict[str, Any]] = None, categories: Optional[Dict[AnnotationType, Categories]] = None, media_type: Optional[Type[MediaElement]] = None)[source]#
Bases:
IDataset
- categories() Dict[AnnotationType, Categories] [source]#
Returns metainfo about dataset labels.
- define_categories(categories: Dict[AnnotationType, Categories])[source]#
- media_type() Type[MediaElement] [source]#
Returns media type of the dataset items.
All the items are supposed to have the same media type. Supposed to be constant and known immediately after the object construction (i.e. doesn’t require dataset iteration).
- put(item: DatasetItem) None [source]#
- get(id: str, subset: Optional[str] = None) Optional[DatasetItem] [source]#
Provides random access to dataset items.
- subsets() Dict[str, IDataset] [source]#
Enumerates subsets in the dataset. Each subset can be a dataset itself.
- get_datasetitem_by_path(path: str) Optional[DatasetItem] [source]#
- get_patch() DatasetPatch [source]#
- update(source: Union[DatasetPatch, IDataset, Iterable[DatasetItem]])[source]#
- class datumaro.components.dataset.DatasetSubset(parent: Dataset, name: str)[source]#
Bases:
IDataset
- class datumaro.components.dataset.Environment(use_lazy_import: bool = True)[source]#
Bases:
object
- property extractors: PluginRegistry#
- property importers: PluginRegistry#
- property launchers: PluginRegistry#
- property exporters: PluginRegistry#
- property generators: PluginRegistry#
- property transforms: PluginRegistry#
- property validators: PluginRegistry#
- detect_dataset(path: str, depth: int = 1, rejection_callback: Optional[Callable[[str, RejectionReason, str], None]] = None) List[str] [source]#
- classmethod merge(envs: Sequence[Environment]) Environment [source]#
- class datumaro.components.dataset.ExportContext(progress_reporter=None, error_policy=None)[source]#
Bases:
object
Method generated by attrs for class ExportContext.
- progress_reporter: ProgressReporter#
- error_policy: ExportErrorPolicy#
- class datumaro.components.dataset.ExportErrorPolicy[source]#
Bases:
object
- report_item_error(error: Exception, *, item_id: Tuple[str, str]) None [source]#
Allows to report a problem with a dataset item. If this function returns, the converter must skip the item.
- class datumaro.components.dataset.Exporter(extractor: IDataset, save_dir: str, *, save_media: Optional[bool] = None, image_ext: Optional[str] = None, default_image_ext: Optional[str] = None, save_dataset_meta: bool = False, save_hashkey_meta: bool = False, stream: bool = False, ctx: Optional[ExportContext] = None)[source]#
Bases:
CliPlugin
- DEFAULT_IMAGE_EXT = None#
- class datumaro.components.dataset.IDataset[source]#
Bases:
object
- subsets() Dict[str, IDataset] [source]#
Enumerates subsets in the dataset. Each subset can be a dataset itself.
- categories() Dict[AnnotationType, Categories] [source]#
Returns metainfo about dataset labels.
- get(id: str, subset: Optional[str] = None) Optional[DatasetItem] [source]#
Provides random access to dataset items.
- media_type() Type[MediaElement] [source]#
Returns media type of the dataset items.
All the items are supposed to have the same media type. Supposed to be constant and known immediately after the object construction (i.e. doesn’t require dataset iteration).
- class datumaro.components.dataset.Image(size: Optional[Tuple[int, int]] = None, ext: Optional[str] = None, *args, **kwargs)[source]#
Bases:
MediaElement
[ndarray
]
- class datumaro.components.dataset.ImportContext(progress_reporter=None, error_policy=None)[source]#
Bases:
object
Method generated by attrs for class ImportContext.
- progress_reporter: ProgressReporter#
- error_policy: ImportErrorPolicy#
- class datumaro.components.dataset.ImportErrorPolicy[source]#
Bases:
object
- report_item_error(error: Exception, *, item_id: Tuple[str, str]) None [source]#
Allows to report a problem with a dataset item. If this function returns, the extractor must skip the item.
- class datumaro.components.dataset.ItemTransform(extractor: IDataset)[source]#
Bases:
Transform
- transform_item(item: DatasetItem) Optional[DatasetItem] [source]#
Returns a modified copy of the input item.
Avoid changing and returning the input item, because it can lead to unexpected problems. Use wrap_item() or item.wrap() to simplify copying.
- class datumaro.components.dataset.LabelCategories(items: List[str] = _Nothing.NOTHING, label_groups: List[LabelGroup] = _Nothing.NOTHING, *, attributes: Set[str] = _Nothing.NOTHING)[source]#
Bases:
Categories
Method generated by attrs for class LabelCategories.
- class Category(name, parent: str = '', attributes: Set[str] = _Nothing.NOTHING)[source]#
Bases:
object
Method generated by attrs for class LabelCategories.Category.
- class LabelGroup(name, labels: List[str] = [], group_type: GroupType = GroupType.EXCLUSIVE)[source]#
Bases:
object
Method generated by attrs for class LabelCategories.LabelGroup.
- label_groups: List[LabelGroup]#
- classmethod from_iterable(iterable: Iterable[Union[str, Tuple[str], Tuple[str, str], Tuple[str, str, List[str]]]]) LabelCategories [source]#
Creates a LabelCategories from iterable.
- Parameters:
iterable –
This iterable object can be:
a list of str - will be interpreted as list of Category names
a list of positional arguments - will generate Categories with these arguments
Returns: a LabelCategories object
- class datumaro.components.dataset.Launcher(model_dir: Optional[str] = None)[source]#
Bases:
CliPlugin
- preprocess(item: DatasetItem) Tuple[Union[ndarray, Dict[str, ndarray]], PrepInfo] [source]#
Preprocess single dataset item before launch()
There are two output types:
1. The output is np.ndarray. For example, it can be image data as np.ndarray with BGR format (H, W, C). In this step, you usually implement resizing, normalizing, or color channel conversion for your launcher (or model).
2. The output is Dict[str, np.ndarray]. For example, it can be image and text pairs. Therefore, this can be used for the model having multi modality for image and text inputs.
- infer(inputs: Dict[str, ndarray]) List[Union[Dict[str, ndarray], List[Dict[str, ndarray]]]] [source]#
- infer(inputs: ndarray) List[Union[Dict[str, ndarray], List[Dict[str, ndarray]]]]
- postprocess(pred: Union[Dict[str, ndarray], List[Dict[str, ndarray]]], info: PrepInfo) List[Annotation] [source]#
- launch(batch: Sequence[DatasetItem], stack: bool = True) List[List[Annotation]] [source]#
Launch to obtain the inference outputs of items.
- Parameters:
inputs – batch of Datasetitems
stack – If true, launch inference for the stacked input for the batch-wise dimension Otherwise, launch inference for each input.
- Returns:
A list of annotation list. Each annotation list is mapped to the input
DatasetItem
. These annotation list are pseudo-labels obtained by the model inference.
- type_check(item: DatasetItem) bool [source]#
Check the media type of dataset item.
If False, the item is excluded from the input batch.
- class datumaro.components.dataset.MediaElement(crypter: ~datumaro.components.crypter.Crypter = <datumaro.components.crypter.NullCrypter object>)[source]#
Bases:
Generic
[AnyData
]
- class datumaro.components.dataset.ModelTransform(extractor: IDataset, launcher: Launcher, batch_size: int = 1, append_annotation: bool = False, num_workers: int = 0)[source]#
Bases:
Transform
A transformation class for applying a model’s inference to dataset items.
This class takes an dataset, a launcher, and other optional parameters to transform the dataset item from the model outputs by the launcher. It can process items using multiple processes if specified, making it suitable for parallelized inference tasks.
- Parameters:
extractor – The dataset extractor to obtain items from.
launcher – The launcher responsible for model inference.
batch_size – The batch size for processing items. Default is 1.
append_annotation – Whether to append inference annotations to existing annotations. Default is False.
num_workers – The number of worker threads to use for parallel inference. Set to 0 for single-process mode. Default is 0.
- exception datumaro.components.dataset.MultipleFormatsMatchError(formats)[source]#
Bases:
DatasetImportError
Method generated by attrs for class MultipleFormatsMatchError.
- formats#
- exception datumaro.components.dataset.NoMatchingFormatsError[source]#
Bases:
DatasetImportError
- class datumaro.components.dataset.NullProgressReporter[source]#
Bases:
ProgressReporter
- iter(iterable: Iterable[T], *, total: Optional[int] = None, desc: Optional[str] = None) Iterable[T] [source]#
Traverses the iterable and reports progress simultaneously.
Starts and finishes the progress bar automatically.
- Parameters:
iterable – An iterable to be traversed
total – The expected number of iterations. If not provided, will try to use iterable.__len__.
desc – The status message
- Returns:
An iterable over elements of the input sequence
- split(count: int) Tuple[ProgressReporter] [source]#
Splits the progress bar into few independent parts. In case of 0 must return an empty tuple.
This class is supposed to manage the state of children progress bars and release of their resources, if necessary.
- class datumaro.components.dataset.ProgressReporter[source]#
Bases:
object
- Only one set of methods must be called:
start - report_status - finish
iter
split
This class is supposed to manage the state of children progress bars and release of their resources, if necessary.
- iter(iterable: Iterable[T], *, total: Optional[int] = None, desc: Optional[str] = None) Iterable[T] [source]#
Traverses the iterable and reports progress simultaneously.
Starts and finishes the progress bar automatically.
- Parameters:
iterable – An iterable to be traversed
total – The expected number of iterations. If not provided, will try to use iterable.__len__.
desc – The status message
- Returns:
An iterable over elements of the input sequence
- split(count: int) Tuple[ProgressReporter, ...] [source]#
Splits the progress bar into few independent parts. In case of 0 must return an empty tuple.
This class is supposed to manage the state of children progress bars and release of their resources, if necessary.
- class datumaro.components.dataset.StreamDataset(source: Optional[IDataset] = None, *, infos: Optional[Dict[str, Any]] = None, categories: Optional[Dict[AnnotationType, Categories]] = None, media_type: Optional[Type[MediaElement]] = None, env: Optional[Environment] = None)[source]#
Bases:
Dataset
- class datumaro.components.dataset.StreamDatasetStorage(source: IDataset, infos: Optional[Dict[str, Any]] = None, categories: Optional[Dict[AnnotationType, Categories]] = None, media_type: Optional[Type[MediaElement]] = None)[source]#
Bases:
DatasetStorage
- put(item: DatasetItem) None [source]#
- get(id: str, subset: Optional[str] = None) Optional[DatasetItem] [source]#
Provides random access to dataset items.
- property subset_names#
- subsets() Dict[str, IDataset] [source]#
Enumerates subsets in the dataset. Each subset can be a dataset itself.
- get_datasetitem_by_path(path: str) Optional[DatasetItem] [source]#
- update(source: Union[DatasetPatch, IDataset, Iterable[DatasetItem]])[source]#
- categories() Dict[AnnotationType, Categories] [source]#
Returns metainfo about dataset labels.
- class datumaro.components.dataset.Transform(extractor: IDataset)[source]#
Bases:
DatasetBase
,CliPlugin
A base class for dataset transformations that change dataset items or their annotations.
- exception datumaro.components.dataset.UnknownFormatError(format)[source]#
Bases:
DatumaroError
Method generated by attrs for class UnknownFormatError.
- format#
- class datumaro.components.dataset.XPathAnnotationsFilter(extractor, xpath=None, remove_empty=False)[source]#
Bases:
ItemTransform
- class datumaro.components.dataset.XPathDatasetFilter(extractor, xpath=None)[source]#
Bases:
ItemTransform
- datumaro.components.dataset.contextmanager(func)[source]#
@contextmanager decorator.
Typical usage:
@contextmanager def some_generator(<arguments>):
<setup> try:
yield <value>
- finally:
<cleanup>
This makes this:
- with some_generator(<arguments>) as <variable>:
<body>
equivalent to this:
<setup> try:
<variable> = <value> <body>
- finally:
<cleanup>
- datumaro.components.dataset.copy(x)[source]#
Shallow copy operation on arbitrary Python objects.
See the module’s __doc__ string for more info.