datumaro.plugins.sampler.relevancy_sampler#

Classes

RelevancySampler(extractor, count, *, ...[, ...])

Sampler that analyzes model inference results on the dataset and picks the best sample for training.

class datumaro.plugins.sampler.relevancy_sampler.RelevancySampler(extractor: IDataset, count: int, *, algorithm: str | Algorithm, sampling_method: str | SamplingMethod, input_subset: str | None = None, sampled_subset: str = 'sample', unsampled_subset: str = 'unsampled', output_file: str | None = None)[source]#

Bases: Transform, CliPlugin

Sampler that analyzes model inference results on the dataset and picks the best sample for training.

Creates a dataset from the -k/–count hardest items for a model. The whole dataset or a single subset will be split into the sampled and unsampled subsets based on the model confidence. The dataset must contain model confidence values in the scores attributes of annotations.

There are five methods of sampling (the -m/–method option):
  • topk - Return the k items with the highest uncertainty data

  • lowk - Return the k items with the lowest uncertainty data

  • randk - Return random k items

  • mixk - Return a half using topk, and the other half using lowk method

  • randtopk - Select 3*k items randomly, and return the topk among them

Notes:
  • Each image’s inference result must contain the probability for all classes.

  • Requesting a sample larger than the number of all images will return all images.

Example: select the most relevant data subset of 20 images

based on model certainty, put the result into ‘sample’ subset and put all the rest into ‘unsampled’ subset, use ‘train’ subset as input.

relevancy_sampler \ 

  --algorithm entropy \ 

  --subset_name train \ 

  --sample_name sample \ 

  --unsampled_name unsampled \ 

  --sampling_method topk -k 20
Parameters:
  • extractor

  • algorithm – Specifying the algorithm to calculate the uncertainty for sample selection. default: ‘entropy’

  • subset_name – The name of the subset to which you want to select a sample.

  • sample_name – Subset name of the selected sample, default: ‘sample’

  • sampling_method – Method of sampling, ‘topk’ or ‘lowk’ or ‘randk’

  • count – Number of samples extracted

  • output_file – A path to .csv file for sampling results

classmethod build_cmdline_parser(**kwargs)[source]#