Level 8: Dataset Validation#
When creating a dataset, it is natural for imbalances to occur between categories, and sometimes there may be very few data points for the minority class. In addition, inconsistent annotations may be produced by annotators or over time. When training a model with such data, more attention should be paid, and sometimes it may be necessary to filter or correct the data in advance. Datumaro provides data validation functionality for this purpose.
More detailed descriptions about validation errors and warnings are given by here. The Python example for the usage of validator is described in this notebook.
from datumaro.components.environment import Environment
from datumaro.components.dataset import Dataset
data_path = '/path/to/data'
env = Environment()
detected_formats = env.detect_dataset(data_path)
dataset = Dataset.import_from(data_path, detected_formats[0])
from datumaro.plugins.validators import DetectionValidator
validator = DetectionValidator() # Or ClassificationValidator or SegementationValidator
reports = validator.validate(dataset)
With the project-based CLI, we first require to create
a project by
datum project create -o <path/to/project>
We now import
MS-COCO validation data into the project through
datum project import --format coco_instances -p <path/to/project> <path/to/data>
(Optional) When we import a data, the change is automatically commited in the project.
This can be shown through log
as
datum project log -p <path/to/project>
(Optional) We can check the imported dataset information such as subsets, number of data, or
categories through info
.
datum dinfo -p <path/to/project>
Finally, we validate
the data within the project as
datum validate --task-type <classification/detection/segmentation> --subset <subset_name> -p <path/to/project>
We now have the validation report named by validation-report-<subset_name>.json
.