Level 14: Dataset Pruning#
Datumaro support prune feature to extract representative subset of dataset. The pruned dataset allows us to examine the trade-off between accuracy and convergence time when training on a reduced data sample. By selecting a subset of instances that captures the essential patterns and characteristics of the data, we aim to evaluate the impact of dataset size on model performance.
More detailed descriptions about pruning are given by Prune The Python example for the usage of pruning is described in here.
With Python API, we can prune dataset as below
from datumaro.components.dataset import Dataset
from datumaro.components.environment import Environment
from datumaro.componenets.prune import prune
data_path = '/path/to/data'
env = Environment()
detected_formats = env.detect_dataset(data_path)
dataset = Dataset.import_from(data_path, detected_formats[0])
prune = Prune(dataset, cluster_method='<how/to/prune/dataset>')
result = prune.get_pruned(ratio='<how/much/to/prune/dataset>')
We can choose the desired method as <how/to/prune/dataset>
among the provided ones. The default value is random
.
Additionally, we can specify how much of the dataset we want to retain by providing a float value between 0 and 1 for the <how/much/to/prune/dataset>
parameter. The default value is 0.5.
Without the project declaration, we can simply prune
dataset by
datum prune <target> -m METHOD -r RATIO -h HASH_TYPE
We could use --overwrite
instead of setting -o/--output-dir
.
We can choose the desired method as METHOD
among the provided ones. The default value is random
.
Additionally, we can specify how much of the dataset we want to retain by providing a float value between 0 and 1 for the RATIO
parameter. The default value is 0.5.
With the project-based CLI, we first require to create
a project by
datum project create --output-dir <path/to/project>
We now import
data into project through
datum project import --project <path/to/project> <path/to/data>
We can prune
dataset
datum prune -m METHOD -r RATIO -h HASH_TYPE -p <path/to/project>
We can choose the desired method as METHOD
among the provided ones. The default value is random
.
Additionally, we can specify how much of the dataset we want to retain by providing a float value between 0 and 1 for the RATIO
parameter. The default value is 0.5.