datumaro.plugins.transforms#
Classes
|
Collects all labels from annotations (of all types) and transforms them into a set of annotations of type Label |
|
Converts the types of annotations within a dataset based on a specified mapping. |
|
Subtracts one from the coordinates of bounding boxes |
|
|
|
|
|
A class used to refine the media items in a dataset. |
|
This class provides functionality to correct and refine a dataset based on a validation report. |
|
Sorts polygons and masks ("segments") according to z_order, crops covered areas of underlying segments. |
|
Renames items in the dataset based on the image file name, excluding the extension. |
|
Renames subsets in the dataset. |
|
|
|
Replaces instance masks and, optionally, polygons with a single mask. |
|
|
|
Changes the content of infos. |
|
Changes the order of labels in the dataset from the existing to the desired one, removes unknown labels and adds new labels. |
|
A class used to assign pseudo-labels to items in a dataset based on their similarity to predefined labels. |
|
Joins all subsets into one and splits the result into few parts. |
|
Replaces dataset item IDs with sequential indices. |
|
Replaces dataset items' annotations with sequential indices. |
|
Changes labels in the dataset. |
|
Allows to remove annotations on specific dataset items. |
|
Allows to remove item and annotation attributes in a dataset. |
|
Allows to remove specific dataset items from dataset by their ids. |
|
Renames items in the dataset. |
|
Resizes images and annotations in the dataset to the specified size. |
|
|
|
Sorts dataset items. |
- class datumaro.plugins.transforms.CropCoveredSegments(extractor: IDataset)[source]#
Bases:
ItemTransform
,CliPlugin
Sorts polygons and masks (“segments”) according to z_order, crops covered areas of underlying segments. If a segment is split into several independent parts by the segments above, produces the corresponding number of separate annotations joined into a group.
- class datumaro.plugins.transforms.MergeInstanceSegments(extractor, include_polygons=False)[source]#
Bases:
ItemTransform
,CliPlugin
Replaces instance masks and, optionally, polygons with a single mask. A group of annotations with the same group id is considered an “instance”. The largest annotation in the group is considered the group “head”, so the resulting mask takes properties from that annotation.
- class datumaro.plugins.transforms.PolygonsToMasks(extractor: IDataset)[source]#
Bases:
ItemTransform
,CliPlugin
- class datumaro.plugins.transforms.BoxesToMasks(extractor: IDataset)[source]#
Bases:
ItemTransform
,CliPlugin
- class datumaro.plugins.transforms.BoxesToPolygons(extractor: IDataset)[source]#
Bases:
ItemTransform
,CliPlugin
- class datumaro.plugins.transforms.MasksToPolygons(extractor: IDataset)[source]#
Bases:
ItemTransform
,CliPlugin
- class datumaro.plugins.transforms.ShapesToBoxes(extractor: IDataset)[source]#
Bases:
ItemTransform
,CliPlugin
- class datumaro.plugins.transforms.Reindex(extractor, start: int = 1)[source]#
-
Replaces dataset item IDs with sequential indices.
- class datumaro.plugins.transforms.ReindexAnnotations(extractor, start: int = 1, reindex_each_item: bool = False)[source]#
Bases:
ItemTransform
,CliPlugin
Replaces dataset items’ annotations with sequential indices.
- transform_item(item: DatasetItem) DatasetItem [source]#
Returns a modified copy of the input item.
Avoid changing and returning the input item, because it can lead to unexpected problems. Use wrap_item() or item.wrap() to simplify copying.
- class datumaro.plugins.transforms.MapSubsets(extractor, mapping=None)[source]#
Bases:
ItemTransform
,CliPlugin
Renames subsets in the dataset.
- class datumaro.plugins.transforms.RandomSplit(extractor, splits, seed=None)[source]#
-
Joins all subsets into one and splits the result into few parts. It is expected that item ids are unique and subset ratios sum up to 1.
Example:
random_split --subset train:.67 --subset test:.33
- class datumaro.plugins.transforms.IdFromImageName(extractor, ensure_unique: bool = False, suffix_length: int = 3)[source]#
Bases:
ItemTransform
,CliPlugin
Renames items in the dataset based on the image file name, excluding the extension. When ‘ensure_unique’ is enabled, a random suffix is appened to ensure each identifier is unique in cases where the image name is not distinct. By default, the random suffix is three characters long, but this can be adjusted with the ‘suffix_length’ parameter.
Examples:
Renames items without duplication check:
id_from_image_name
Renames items with duplication check:
id_from_image_name --ensure_unique - Renames items with duplication check and alters the suffix length(default: 3):
id_from_image_name --ensure_unique --suffix_length 2
- DEFAULT_RETRY = 1000#
- SUFFIX_LETTERS = 'abcdefghijklmnopqrstuvwxyz0123456789'#
- class datumaro.plugins.transforms.Rename(extractor, regex)[source]#
Bases:
ItemTransform
,CliPlugin
Renames items in the dataset. Supports regular expressions. The first character in the expression is a delimiter for the pattern and replacement parts. Replacement part can also contain str.format replacement fields with the item (of type DatasetItem) object available. Please use doulbe quotes to represent regex.
- Examples:
Replace ‘pattern’ with ‘replacement’:
rename -e "|pattern|replacement|"
Remove ‘frame_’ from item ids:
rename -e "|^frame_||"
Rename by regex:
rename -e "|frame_(\d+)_extra|{item.subset}_id_\1|"
- class datumaro.plugins.transforms.RemapLabels(extractor: IDataset, mapping: Dict[str, str] | List[Tuple[str, str]], default: None | str | DefaultAction = None)[source]#
Bases:
ItemTransform
,CliPlugin
Changes labels in the dataset.
- A label can be:
renamed (and joined with existing) - when ‘–label <old_name>:<new_name>’ is specified
deleted - when ‘–label <name>:’ is specified, or default action is ‘delete’ and the label is not mentioned in the list. When a label is deleted, all the associated annotations are removed
kept unchanged - when specified ‘–label <name>:<name>’ or default action is ‘keep’ and the label is not mentioned in the list.
Annotations with no label are managed by the default action policy.
Examples:
Remove the ‘person’ label (and corresponding annotations):
remap_labels -l person: --default keep
Rename ‘person’ to ‘pedestrian’ and ‘human’ to ‘pedestrian’, join:
remap_labels -l person:pedestrian -l human:pedestrian --default keep
Rename ‘person’ to ‘car’ and ‘cat’ to ‘dog’, keep ‘bus’, remove others:
remap_labels -l person:car -l bus:bus -l cat:dog --default delete
- class datumaro.plugins.transforms.ProjectInfos(extractor: IDataset, dst_infos: Dict[str, Any], overwrite: bool = False)[source]#
-
Changes the content of infos. A user can add meta-data of dataset such as author, comments, or related papers. Infos values are not affect on the dataset structure. We thus can add any meta-data freely.
- class datumaro.plugins.transforms.ProjectLabels(extractor: IDataset, dst_labels: Iterable[str] | LabelCategories)[source]#
Bases:
ItemTransform
Changes the order of labels in the dataset from the existing to the desired one, removes unknown labels and adds new labels. Updates or removes the corresponding annotations.
Labels are matched by names (case dependent). Parent labels are only kept if they are present in the resulting set of labels. If new labels are added, and the dataset has mask colors defined, new labels will obtain generated colors.
Useful for merging similar datasets, whose labels need to be aligned.
- Examples:
Align the source dataset labels to [person, cat, dog]:
project_labels -l person -l cat -l dog
- class datumaro.plugins.transforms.AnnsToLabels(extractor: IDataset)[source]#
Bases:
ItemTransform
,CliPlugin
Collects all labels from annotations (of all types) and transforms them into a set of annotations of type Label
- class datumaro.plugins.transforms.BboxValuesDecrement(extractor: IDataset)[source]#
Bases:
ItemTransform
,CliPlugin
Subtracts one from the coordinates of bounding boxes
- class datumaro.plugins.transforms.ResizeTransform(extractor: IDataset, width: int, height: int)[source]#
Bases:
ItemTransform
Resizes images and annotations in the dataset to the specified size. Supports upscaling, downscaling and mixed variants.
- Examples:
Resize all images to 256x256 size
resize -dw 256 -dh 256
- class datumaro.plugins.transforms.RemoveItems(extractor: IDataset, ids: Iterable[Tuple[str, str]])[source]#
Bases:
ItemTransform
Allows to remove specific dataset items from dataset by their ids.
Can be useful to clean the dataset from broken or unnecessary samples.
- Examples:
Remove specific items from the dataset
remove_items --id 'image1:train' --id 'image2:test'
- class datumaro.plugins.transforms.RemoveAnnotations(extractor: IDataset, *, ids: Iterable[Tuple[str, str, int | None]])[source]#
Bases:
ItemTransform
Allows to remove annotations on specific dataset items.
Can be useful to clean the dataset from broken or unnecessary annotations.
- Examples:
Remove annotations from specific items in the dataset
remove_annotations --id 'image1:train' --id 'image2:test'
- transform_item(item: DatasetItem)[source]#
Returns a modified copy of the input item.
Avoid changing and returning the input item, because it can lead to unexpected problems. Use wrap_item() or item.wrap() to simplify copying.
- class datumaro.plugins.transforms.RemoveAttributes(extractor: IDataset, ids: Iterable[Tuple[str, str]] | None = None, attributes: Iterable[str] | None = None)[source]#
Bases:
ItemTransform
Allows to remove item and annotation attributes in a dataset.
Can be useful to clean the dataset from broken or unnecessary attributes.
- Examples:
Remove the is_crowd attribute from dataset
remove_attributes --attr 'is_crowd'
Remove the occluded attribute from annotations of the 2010_001705 item in the train subset
remove_attributes --id '2010_001705:train' --attr 'occluded'
- transform_item(item: DatasetItem)[source]#
Returns a modified copy of the input item.
Avoid changing and returning the input item, because it can lead to unexpected problems. Use wrap_item() or item.wrap() to simplify copying.
- class datumaro.plugins.transforms.Correct(extractor: IDataset, reports: str | Dict)[source]#
-
This class provides functionality to correct and refine a dataset based on a validation report. It processes a validation report (typically in JSON format) to identify and rectify various dataset issues, such as undefined labels, missing annotations, outliers, empty labels/captions, and unnecessary characters in captions. The correction process includes:
Adding missing labels and attributes.
Removing or adjusting annotations with invalid or anomalous values.
Filling in missing labels and captions with appropriate values.
Removing unnecessary characters from text-based annotations like captions.
Handling outliers by capping values within specified bounds.
Updating dataset categories and annotations according to the corrections.
The class is designed to be used as part of a command-line interface (CLI) and can be configured with different validation reports. It integrates with the dataset extraction process, ensuring that corrections are applied consistently across the dataset.
- class datumaro.plugins.transforms.AstypeAnnotations(extractor: IDataset, mapping: Dict[str, str] | List[Tuple[str, str]] | None = None)[source]#
Bases:
ItemTransform
Converts the types of annotations within a dataset based on a specified mapping.
This transform changes annotations to ‘Label’ if they are categorical, and to ‘Caption’ if they are of type string, float, or integer. This is particularly useful when working with tabular data that needs to be converted into a format suitable for specific machine learning tasks.
- Examples:
Converts the type of a title annotation:
astype_annotations --mapping 'title:text'
- transform_item(item: DatasetItem)[source]#
Returns a modified copy of the input item.
Avoid changing and returning the input item, because it can lead to unexpected problems. Use wrap_item() or item.wrap() to simplify copying.
- class datumaro.plugins.transforms.Clean(extractor: IDataset, batch_size: int = 1, num_workers: int = 0)[source]#
Bases:
TabularTransform
A class used to refine the media items in a dataset.
This class provides methods to clean and preprocess media data within a dataset. The media data can be of various types such as strings, numeric values, or categorical values. The cleaning process for each type of data is handled differently:
String Media: For string data, the class employs natural language processing (NLP)
techniques to remove unnecessary characters. This involves cleaning tasks such as removing special characters, punctuation, and other irrelevant elements to refine the textual data. - Numeric Media: For numeric data, the class identifies and handles outliers and missing values. Outliers are either removed or replaced based on a defined strategy, and missing values are filled using appropriate methods such as mean, median, or a predefined value.
- class datumaro.plugins.transforms.PseudoLabeling(extractor: IDataset, labels: List[str] | None = None, explorer: Explorer | None = None)[source]#
Bases:
ItemTransform
A class used to assign pseudo-labels to items in a dataset based on their similarity to predefined labels.
This class leverages hashing techniques to compute the similarity between dataset items and a set of predefined labels. It assigns the most similar label as a pseudo-label to each item. This is particularly useful in semi-supervised learning scenarios where some labels are missing or uncertain.
- Attributes:
extractor : IDataset
The dataset extractor that provides access to dataset items and their annotations. - labels : Optional[List[str]] A list of label names to be used for pseudo-labeling. If not provided, all available labels in the dataset will be used. - explorer : Optional[Explorer] An optional Explorer object used to compute hash keys for items and labels. If not provided, a new Explorer will be created.
- transform_item(item: DatasetItem)[source]#
Returns a modified copy of the input item.
Avoid changing and returning the input item, because it can lead to unexpected problems. Use wrap_item() or item.wrap() to simplify copying.