datumaro.plugins.ndr#

Classes

`Algorithm`(value)	An enumeration.
`NDR`(extractor, working_subset[, ...])	Removes near-duplicated images in subset
`OverSamplingMethod`(value)	An enumeration.
`UnderSamplingMethod`(value)	An enumeration.

class datumaro.plugins.ndr.Algorithm(value)[source]#

Bases: Enum

An enumeration.

gradient = 1#

class datumaro.plugins.ndr.OverSamplingMethod(value)[source]#

Bases: Enum

An enumeration.

random = 1#

similarity = 2#

class datumaro.plugins.ndr.UnderSamplingMethod(value)[source]#

Bases: Enum

An enumeration.

uniform = 1#

inverse = 2#

class datumaro.plugins.ndr.NDR(extractor, working_subset, duplicated_subset='duplicated', algorithm=None, num_cut=None, over_sample=None, under_sample=None, seed=None, **kwargs)[source]#

Bases: Transform, CliPlugin

Removes near-duplicated images in subset

Remove duplicated images from a dataset. Keep at most -k/–num_cut resulting images.

Available oversampling policies (the -e parameter):

random - sample from removed data randomly
similarity - sample from removed data with ascending similarity score

Available undersampling policies (the -u parameter):

uniform - sample data with uniform distribution
inverse - sample data with reciprocal of the number of number of

items with the same similarity

Example: apply NDR, return no more than 100 images

ndr

  --working_subset train

  --algorithm gradient

  --num_cut 100

  --over_sample random

  --under_sample uniform

Near-duplicated image removal

Parameters:

working_subset (str) – name of the subset to operate if None, use DEFAULT_SUBSET_NAME
duplicated_subset (str) – name of the subset for the removed data after NDR runs
algorithm (str) – name of the algorithm to use “gradient” only for now.
num_cut (int) – number of outputs you want. the algorithm will cut whole dataset to this amount if None, return result without any modification
over_sample ("random" or "similarity") – specify the strategy when num_cut > length of the result after removal if random, sample from removed data randomly if similarity, select from removed data with ascending order of similarity
under_sample ("uniform" or "inverse") – specify the strategy when num_cut < length of the result after removal if uniform, sample data with uniform distribution if inverse, sample data with reciprocal of the number of data which have same hash key
gradient (Algorithm Specific for) –

block_shape: tuple, (h, w)
for the robustness, this function will operate on blocks mean and variance will be calculated on this block

hash_dim: int
dimension(or bit) of the hash function

sim_threshold: float
the threshold value for saving hash-collided samples. larger value means more generous, i.e., saving more samples

Return type:

None, other subsets combined with the result

classmethod build_cmdline_parser(**kwargs)[source]#