Correct Dataset from Validation Report#

In this notebook example, we will demonstrate how to rectify a dataset using a pre-generated validation report. The report is designed to analyze various types of anomalies in the data, as shown in the previous notebook example. By leveraging this report, we can enhance the dataset by addressing issues like undefined labels, missing annotations, and statistical outliers.

Prerequisite#

Download COCO 2017 validation dataset#

Please refer this notebook for preparing COCO 2017 validation dataset.

[60]:

from datumaro.components.dataset import Dataset

path = "coco_dataset"
dataset = Dataset.import_from(path, "coco_instances")

print("Representation for sample COCO dataset")
dataset

WARNING:root:File 'coco_dataset/annotations/panoptic_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File 'coco_dataset/annotations/person_keypoints_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File 'coco_dataset/annotations/captions_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances

Representation for sample COCO dataset

[60]:

Dataset
        size=5000
        source_path=coco_dataset
        media_type=<class 'datumaro.components.media.Image'>
        annotated_items_count=4952
        annotations_count=78647
subsets
        val2017: # of items=5000, # of annotated items=4952, # of annotations=78647, annotation types=['polygon', 'bbox', 'mask']
infos
        categories
        label: ['person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush']

Generate the validation report#

As following the previous example, we then generate the validation report per severity (error, warning, and info) and various anomaly types. In this example, we are going to resolve the errors and warnings through correct transformation.

[61]:

from datumaro.plugins.validators import DetectionValidator

extra_args = {
    "few_samples_thr": 100,
    "imbalance_ratio_thr": 5,
    "far_from_mean_thr": 10.0,
}

validator = DetectionValidator(**extra_args)


def validate(dataset):
    reports = validator.validate(dataset)

    print("Validation report summary:", reports["summary"])

    error_cnt = {}
    warning_cnt = {}
    info_cnt = {}
    for report in reports["validation_reports"]:
        if report["severity"] == "error":
            type = report["anomaly_type"]
            if error_cnt.get(type) is None:
                error_cnt[type] = 1
            else:
                error_cnt[type] += 1
        if report["severity"] == "warning":
            type = report["anomaly_type"]
            if warning_cnt.get(type) is None:
                warning_cnt[type] = 1
            else:
                warning_cnt[type] += 1
        if report["severity"] == "info":
            type = report["anomaly_type"]
            if info_cnt.get(type) is None:
                info_cnt[type] = 1
            else:
                info_cnt[type] += 1
    print("The number of reports per error type: ", error_cnt)
    print("The number of reports per warning type: ", warning_cnt)
    print("The number of reports per info type: ", info_cnt)

    return reports


reports = validate(dataset)

Validation report summary: {'errors': 36782, 'warnings': 105, 'infos': 69}
The number of reports per error type:  {'UndefinedAttribute': 36781, 'NegativeLength': 1}
The number of reports per warning type:  {'MissingAnnotation': 48, 'FarFromLabelMean': 57}
The number of reports per info type:  {'FewSamplesInLabel': 9, 'ImbalancedLabels': 1, 'ImbalancedDistInLabel': 59}

Visualize some anomalies#

Let’s visually check how serious the abnormal samples are for a few examples.

[62]:

from datumaro.components.annotation import AnnotationType
from datumaro.components.visualizer import Visualizer

far_from_mean_ids = []
far_from_mean_subsets = []
for report in reports["validation_reports"]:
    if report["anomaly_type"] == "FarFromLabelMean":
        if report["item_id"] in far_from_mean_ids:
            continue
        far_from_mean_ids.append(report["item_id"])
        far_from_mean_subsets.append(report["subset"])

visualizer = Visualizer(dataset, figsize=(12, 8), ignored_types=[AnnotationType.mask], alpha=0.5)
fig = visualizer.vis_gallery(far_from_mean_ids[:8], far_from_mean_subsets[:8])
fig.show()

WARNING:accuracy_checker:/home/wonju/datumaro/datumaro/components/visualizer.py:416: UserWarning: mask in self.ignored_types. Skip it.
  warnings.warn(msg)

../../../_images/docs_jupyter_notebook_examples_notebooks_12_correct_dataset_5_1.png

Correct anomalies#

Among the many transformations provided by Datumaro, correct transformation corrects anomalies as shown in the table below.

Anomaly type	Description	Task	Type	Target	Operation
MissingLabelCategories	Metadata (ex. LabelCategories) should be defined	common	error	category	add
MissingAnnotation	No annotation found for an Item	common	warning	item	remove
MissingAttribute	An attribute key is missing for an Item	common	warning	item	add
UndefinedLabel	A label not defined in the metadata is found for an item	common	error	category	add
UndefinedAttribute	An attribute not defined in the metadata is found for an item	common	error	category	add
MultiLabelAnnotations	Item needs a single label	classification	error	item	remove
NegativeLength	The width or height of bounding box is negative	detection	error	ann	remove
InvalidValue	There’s invalid (ex. inf, nan) value for bounding box info.	detection	error	ann	remove
FarFromLabelMean	An annotation has an too small or large value than average for a label	detection, segmentation	warning	ann	remove
FarFromAttrMean	An annotation has an too small or large value than average for an attribute	detection, segmentation	warning	ann	remove
LabelDefinedButNotFound	A label is defined, but not found actually	common	warning
AttributeDefinedButNotFound	An attribute is defined, but not found actually	common	warning
OnlyOneLabel	The dataset consists of only label	common	info
OnlyOneAttributeValue	The dataset consists of only attribute value	common	info
FewSamplesInLabel	The number of samples in a label might be too low	common	info
FewSamplesInAttribute	The number of samples in an attribute might be too low	common	info
ImbalancedLabels	There is an imbalance in the label distribution	common	info
ImbalancedAttribute	There is an imbalance in the attribute distribution	common	info
ImbalancedDistInLabel	Values (ex. bbox width) are not evenly distributed for a label	detection, segmentation	info
ImbalancedDistInAttribute	Values (ex. bbox width) are not evenly distributed for an attribute	detection, segmentation	info

[63]:

import datumaro.plugins.transforms as transforms

refined_dataset = transforms.Correct(dataset, reports=reports)

reports = validate(refined_dataset)

Validation report summary: {'errors': 0, 'warnings': 58, 'infos': 281}
The number of reports per error type:  {}
The number of reports per warning type:  {'MissingAnnotation': 1, 'FarFromLabelMean': 33, 'FarFromAttrMean': 24}
The number of reports per info type:  {'FewSamplesInLabel': 9, 'ImbalancedLabels': 1, 'FewSamplesInAttribute': 49, 'ImbalancedAttribute': 41, 'OnlyOneAttributeValue': 39, 'ImbalancedDistInLabel': 48, 'ImbalancedDistInAttribute': 94}

Because of dataset reduction according to some anomalies such as NegativeLength or InvalideValue, the overall statistics can be also updated and it yields another anomalies as shown above (one FarFromLabelMean and another FarFromAttrMean). These can be refined again through one more refinement cycle as below.

[64]:

second_refined_dataset = transforms.Correct(refined_dataset, reports=reports)

reports = validate(second_refined_dataset)

Validation report summary: {'errors': 0, 'warnings': 21, 'infos': 276}
The number of reports per error type:  {}
The number of reports per warning type:  {'FarFromLabelMean': 12, 'FarFromAttrMean': 9}
The number of reports per info type:  {'FewSamplesInLabel': 9, 'ImbalancedLabels': 1, 'FewSamplesInAttribute': 49, 'ImbalancedAttribute': 41, 'OnlyOneAttributeValue': 39, 'ImbalancedDistInLabel': 45, 'ImbalancedDistInAttribute': 92}

We further check that some noisy bounding box annotations of having extreme aspect ratios is well cleaned below.

[65]:

visualizer = Visualizer(
    second_refined_dataset, figsize=(12, 8), ignored_types=[AnnotationType.mask], alpha=0.5
)
fig = visualizer.vis_gallery(far_from_mean_ids[:8], far_from_mean_subsets[:8])
fig.show()

../../../_images/docs_jupyter_notebook_examples_notebooks_12_correct_dataset_11_0.png