datumaro.plugins.ndr#
Classes
|
An enumeration. |
|
Removes near-duplicated images in subset |
|
An enumeration. |
|
An enumeration. |
- class datumaro.plugins.ndr.OverSamplingMethod(value)[source]#
Bases:
Enum
An enumeration.
- random = 1#
- similarity = 2#
- class datumaro.plugins.ndr.UnderSamplingMethod(value)[source]#
Bases:
Enum
An enumeration.
- uniform = 1#
- inverse = 2#
- class datumaro.plugins.ndr.NDR(extractor, working_subset, duplicated_subset='duplicated', algorithm=None, num_cut=None, over_sample=None, under_sample=None, seed=None, **kwargs)[source]#
-
Removes near-duplicated images in subset
Remove duplicated images from a dataset. Keep at most -k/–num_cut resulting images.
- Available oversampling policies (the -e parameter):
random - sample from removed data randomly
similarity - sample from removed data with ascending similarity score
- Available undersampling policies (the -u parameter):
uniform - sample data with uniform distribution
inverse - sample data with reciprocal of the number of number of
items with the same similarity
Example: apply NDR, return no more than 100 images
ndr --working_subset train --algorithm gradient --num_cut 100 --over_sample random --under_sample uniform
Near-duplicated image removal
- Parameters:
working_subset (str) – name of the subset to operate if None, use DEFAULT_SUBSET_NAME
duplicated_subset (str) – name of the subset for the removed data after NDR runs
algorithm (str) – name of the algorithm to use “gradient” only for now.
num_cut (int) – number of outputs you want. the algorithm will cut whole dataset to this amount if None, return result without any modification
over_sample ("random" or "similarity") – specify the strategy when num_cut > length of the result after removal if random, sample from removed data randomly if similarity, select from removed data with ascending order of similarity
under_sample ("uniform" or "inverse") – specify the strategy when num_cut < length of the result after removal if uniform, sample data with uniform distribution if inverse, sample data with reciprocal of the number of data which have same hash key
gradient (Algorithm Specific for) –
- block_shape: tuple, (h, w)
for the robustness, this function will operate on blocks mean and variance will be calculated on this block
- hash_dim: int
dimension(or bit) of the hash function
- sim_threshold: float
the threshold value for saving hash-collided samples. larger value means more generous, i.e., saving more samples
- Return type:
None, other subsets combined with the result