datumaro.plugins.ndr#

Classes

Algorithm(value)

An enumeration.

NDR(extractor, working_subset[, ...])

Removes near-duplicated images in subset

OverSamplingMethod(value)

An enumeration.

UnderSamplingMethod(value)

An enumeration.

class datumaro.plugins.ndr.Algorithm(value)[source]#

Bases: Enum

An enumeration.

gradient = 1#
class datumaro.plugins.ndr.OverSamplingMethod(value)[source]#

Bases: Enum

An enumeration.

random = 1#
similarity = 2#
class datumaro.plugins.ndr.UnderSamplingMethod(value)[source]#

Bases: Enum

An enumeration.

uniform = 1#
inverse = 2#
class datumaro.plugins.ndr.NDR(extractor, working_subset, duplicated_subset='duplicated', algorithm=None, num_cut=None, over_sample=None, under_sample=None, seed=None, **kwargs)[source]#

Bases: Transform, CliPlugin

Removes near-duplicated images in subset

Remove duplicated images from a dataset. Keep at most -k/–num_cut resulting images.

Available oversampling policies (the -e parameter):
  • random - sample from removed data randomly

  • similarity - sample from removed data with ascending similarity score

Available undersampling policies (the -u parameter):
  • uniform - sample data with uniform distribution

  • inverse - sample data with reciprocal of the number of number of

items with the same similarity

Example: apply NDR, return no more than 100 images

ndr

  --working_subset train

  --algorithm gradient

  --num_cut 100

  --over_sample random

  --under_sample uniform

Near-duplicated image removal

Parameters:
  • working_subset (str) – name of the subset to operate if None, use DEFAULT_SUBSET_NAME

  • duplicated_subset (str) – name of the subset for the removed data after NDR runs

  • algorithm (str) – name of the algorithm to use “gradient” only for now.

  • num_cut (int) – number of outputs you want. the algorithm will cut whole dataset to this amount if None, return result without any modification

  • over_sample ("random" or "similarity") – specify the strategy when num_cut > length of the result after removal if random, sample from removed data randomly if similarity, select from removed data with ascending order of similarity

  • under_sample ("uniform" or "inverse") – specify the strategy when num_cut < length of the result after removal if uniform, sample data with uniform distribution if inverse, sample data with reciprocal of the number of data which have same hash key

  • gradient (Algorithm Specific for) –

    block_shape: tuple, (h, w)

    for the robustness, this function will operate on blocks mean and variance will be calculated on this block

    hash_dim: int

    dimension(or bit) of the hash function

    sim_threshold: float

    the threshold value for saving hash-collided samples. larger value means more generous, i.e., saving more samples

Return type:

None, other subsets combined with the result

classmethod build_cmdline_parser(**kwargs)[source]#