Level 6: Data Comparison with Two Heterogeneous Datasets#

Comparison is a fundamental tool that enables users to identify and understand the discrepancies and variations that exist between datasets. It allows for a comprehensive assessment of variations in data distribution, format, and annotation standards present across different sources. By pinpointing the differences in data distribution, format, and annotation standards across multiple sources, the comparison paves the way for a streamlined and effective dataset consolidation process. In essence, it serves as the cornerstone for achieving a cohesive and comprehensive large-scale dataset, a critical requirement for training deep learning models.

In this tutorial, we provide a simple example for comparing two datasets, and the detailed description of the comparison operation is given in the Compare section.

Comparing Datasets#

Without the project declaration, you can simply compare multiple datasets using the following command:

datum compare <path/to/dataset1> <path/to/dataset2> -o result

In this case, the table method is used to generate a comparison table. You will have the comparison report named table_compare.json and table_compare.txt inside the output directory.

To compare if annotations are equal, use:

datum compare <path/to/dataset1> <path/to/dataset2> -m equality -o result

You will have the comparison report named equality_compare.json inside the output directory.

To compare a dataset from another project with a distance metric, use:

datum compare <path/to/other/project/> -m distance -o result

You will have the comparison report named <annotation_type>_confusion.png inside the output directory. If there is a label difference, then a label_confusion result will be created. This supports label, bbox, polygon, and mask annotation types.

With the project-based CLI, you can compare the current project’s main target (project) in the working tree with the specified dataset using the following command:

datum compare <path/to/specified/dataset>

You can also simply compare multiple datasets by using:

datum compare <path/to/dataset1> <path/to/dataset2>