Validate#

Validate Dataset#

This command inspects annotations with respect to the task type and stores the results in JSON file.

The task types supported are classification, detection, segmentation and tabular (the -t/--task-type parameter).

The validation result contains

  • annotation statistics based on the task type

  • validation reports, such as

    • items not having annotations

    • items having undefined annotations

    • imbalanced distribution in class/attributes

    • too small or large values

  • summary

Usage:

datum validate [-h] -t TASK [-s SUBSET_NAME] [-p PROJECT_DIR]
               [target] [-- EXTRA_ARGS]

Parameters:

  • <target> (string) - Target dataset revpath. By default, validates the current project.

  • -t, --task-type (string) - Task type for validation

  • -s, --subset (string) - Dataset subset to be validated

  • -p, --project (string) - Directory of the project to operate on (default: current directory).

  • -h, --help - Print the help message and exit.

  • <extra args> - The list of extra validation parameters. Should be passed after the -- separator after the main command arguments:

    • -fs, --few-samples-thr (number) - The threshold for giving a warning for minimum number of samples per class

    • -ir, --imbalance-ratio-thr (number) - The threshold for giving imbalance data warning

    • -m, --far-from-mean-thr (number) - The threshold for giving a warning that data is far from mean

    • -dr, --dominance-ratio-thr (number) - The threshold for giving a warning bounding box imbalance

    • -k, --topk-bins (number) - The ratio of bins with the highest number of data to total bins in the histogram

Examples:

  • Validate a project’s subset as a classification dataset

    datum validate -t classification -s subset
    
  • Give warning when imbalance ratio of data with classification task over 40

    datum validate -p <path/to/project/> -t classification -- -ir 40
    

List of validation items (anomaly types)#

Anomaly Type

Description

Task Type

MissingLabelCategories

Metadata (ex. LabelCategories) should be defined

common

MissingAnnotation

No annotation found for an Item

common

MissingAttribute

An attribute key is missing for an Item

common

MultiLabelAnnotations

Item needs a single label

classification

UndefinedLabel

A label not defined in the metadata is found for an item

common

UndefinedAttribute

An attribute not defined in the metadata is found for an item

common

LabelDefinedButNotFound

A label is defined, but not found actually

common

AttributeDefinedButNotFound

An attribute is defined, but not found actually

common

OnlyOneLabel

The dataset consists of only label

common

OnlyOneAttributeValue

The dataset consists of only attribute value

common

FewSamplesInLabel

The number of samples in a label might be too low

common

FewSamplesInAttribute

The number of samples in an attribute might be too low

common

ImbalancedLabels

There is an imbalance in the label distribution

common

ImbalancedAttribute

There is an imbalance in the attribute distribution

common

ImbalancedDistInLabel

Values (ex. bbox width) are not evenly distributed for a label

detection, segmentation

ImbalancedDistInAttribute

Values (ex. bbox width) are not evenly distributed for an attribute

detection, segmentation

NegativeLength

The width or height of bounding box is negative

detection

InvalidValue

There’s invalid (ex. inf, nan) value for bounding box info.

detection

FarFromLabelMean

An annotation has an too small or large value than average for a label

detection, segmentation

FarFromAttrMean

An annotation has an too small or large value than average for an attribute

detection, segmentation

BrokenAnnotation

Some annotations are not defined for an item

tabular

EmptyLabel

A value of the label column is not defined for an item

tabular

EmptyCaption

A value of the caption column is not defined for an item

tabular

FewSamplesInCaption

The number of samples in a caption might be too low

tabular

RedundanciesInCaption

Redundancies of an caption for an item

tabular

ImbalancedCaptions

There is an imbalance in the caption distribution

tabular

ImbalancedDistInCaption

Values are not evenly distributed for a caption only if caption is number

tabular

FarFromCaptionMean

An annotation has an too small or large value than average for a caption only if caption is number

tabular

OutlierInCaption

An annotation has an outlier value based on Interquartile Range Method only if caption is number

tabular

Validation Result Format:

{
    'statistics': {
        ## common statistics
        'label_distribution': {
            'defined_labels': <dict>,   # <label:str>: <count:int>
            'undefined_labels': <dict>
            # <label:str>: {
            #     'count': <int>,
            #     'items_with_undefined_label': [<item_key>, ]
            # }
        },
        'attribute_distribution': {
            'defined_attributes': <dict>,
            # <label:str>: {
            #     <attribute:str>: {
            #         'distribution': {<attr_value:str>: <count:int>, },
            #         'items_missing_attribute': [<item_key>, ]
            #     }
            # }
            'undefined_attributes': <dict>
            # <label:str>: {
            #     <attribute:str>: {
            #         'distribution': {<attr_value:str>: <count:int>, },
            #         'items_with_undefined_attr': [<item_key>, ]
            #     }
            # }
        },
        'total_ann_count': <int>,
        'items_missing_annotation': <list>, # [<item_key>, ]

        ## statistics for classification task
        'items_with_multiple_labels': <list>, # [<item_key>, ]

        ## statistics for detection task
        'items_with_invalid_value': <dict>,
        # '<item_key>': {<ann_id:int>: [ <property:str>, ], }
        # - properties: 'x', 'y', 'width', 'height',
        #               'area(wxh)', 'ratio(w/h)', 'short', 'long'
        # - 'short' is min(w,h) and 'long' is max(w,h).
        'items_with_negative_length': <dict>,
        # '<item_key>': { <ann_id:int>: { <'width'|'height'>: <value>, }, }
        'bbox_distribution_in_label': <dict>, # <label:str>: <bbox_template>
        'bbox_distribution_in_attribute': <dict>,
        # <label:str>: {<attribute:str>: { <attr_value>: <bbox_template>, }, }
        'bbox_distribution_in_dataset_item': <dict>,
        # '<item_key>': <bbox count:int>

        ## statistics for segmentation task
        'items_with_invalid_value': <dict>,
        # '<item_key>': {<ann_id:int>: [ <property:str>, ], }
        # - properties: 'area', 'width', 'height'
        'mask_distribution_in_label': <dict>, # <label:str>: <mask_template>
        'mask_distribution_in_attribute': <dict>,
        # <label:str>: {
        #     <attribute:str>: { <attr_value>: <mask_template>, }
        # }
        'mask_distribution_in_dataset_item': <dict>,
        # '<item_key>': <mask/polygon count: int>

        ## statistics for tabular task
        'items_broken_annotation': <list>, # [<item_key>, ]
        'label_distribution': {
            'defined_labels': <dict>,   # <label:str>: <count:int>
            'empty_labels': <dict>
            # <label:str>: {
            #     'count': <int>,
            #     'items_with_empty_label': [<item_key>, ]
            # }
        },
        'caption_distribution': {
            'defined_captions': <dict>,   # <label:str>: <count:int>
            'empty_captions': <dict>
            # <label:str>: {
            #     'count': <int>,
            #     'items_with_empty_label': [<item_key>, ]
            # }
            'redundancies': <dict>
            # <label:str>: {
            #     'stopword': <dict>,
            #         'count': <int>,
            #         'items_with_redundancies': [<item_key>, ]
            #     'url': <dict>,
            #         'count': <int>,
            #         'items_with_redundancies': [<item_key>, ]
            #     }
            # }
        },

    },
    'validation_reports': <list>, # [ <validation_error_format>, ]
    # validation_error_format = {
    #     'anomaly_type': <str>,
    #     'description': <str>,
    #     'severity': <str>, # 'warning' or 'error'
    #     'item_id': <str>,  # optional, when it is related to a DatasetItem
    #     'subset': <str>,   # optional, when it is related to a DatasetItem
    # }
    'summary': {
        'errors': <count: int>,
        'warnings': <count: int>
    }
}

item_key is defined as,

item_key = (<DatasetItem.id:str>, <DatasetItem.subset:str>)

bbox_template and mask_template are defined as,

bbox_template = {
    'width': <numerical_stat_template>,
    'height': <numerical_stat_template>,
    'area(wxh)': <numerical_stat_template>,
    'ratio(w/h)': <numerical_stat_template>,
    'short': <numerical_stat_template>, # short = min(w, h)
    'long': <numerical_stat_template>   # long = max(w, h)
}
mask_template = {
    'area': <numerical_stat_template>,
    'width': <numerical_stat_template>,
    'height': <numerical_stat_template>
}

numerical_stat_template is defined as,

numerical_stat_template = {
    'items_far_from_mean': <dict>,
    # {'<item_key>': {<ann_id:int>: <value:float>, }, }
    'mean': <float>,
    'stddev': <float>,
    'min': <float>,
    'max': <float>,
    'median': <float>,
    'histogram': {
        'bins': <list>,   # [<float>, ]
        'counts': <list>, # [<int>, ]
    }
}