datumaro.components.format_detection#

Module Attributes

FormatDetector

Denotes a callback that implements detection for a specific dataset format.

Functions

apply_format_detector(dataset_root_path, ...)

Checks whether the dataset located at dataset_root_path belongs to the format detected by detector.

detect_dataset_format(formats, path, *[, ...])

Determines which format(s) the dataset at the specified path belongs to.

Classes

DetectedFormat(confidence, name)

FormatDetectionConfidence(value)

Represents the level of confidence that a detector has in a dataset belonging to the detector's format.

FormatDetectionContext(root_path)

An instance of this class is given to a dataset format detector.

RejectionCallback(*args, **kwargs)

RejectionReason(value)

An enumeration.

Exceptions

FormatDetectionUnsupported

Represents a situation where detection is attempted with a format that does not support it.

FormatRequirementsUnmet(failed_alternatives)

Represents a situation where a dataset does not meet the requirements of a given dataset format.

class datumaro.components.format_detection.FormatDetectionConfidence(value)[source]#

Bases: IntEnum

Represents the level of confidence that a detector has in a dataset belonging to the detector’s format.

NONE = 1#
EXTREME_LOW = 5#

This is currently only assigned to ImageDir format. This is because ImageDir format can be detected in every image dataset format.

Type:

EXTREME_LOW

LOW = 10#

The dataset seems to belong to the format, but the format is too loosely defined to be able to distinguish it from other formats.

Type:

LOW

MEDIUM = 20#

The dataset seems to belong to the format, and is likely not to belong to any other format.

Type:

MEDIUM

class datumaro.components.format_detection.DetectedFormat(confidence: datumaro.components.format_detection.FormatDetectionConfidence, name: str)[source]#

Bases: object

confidence: FormatDetectionConfidence#
name: str#
class datumaro.components.format_detection.RejectionReason(value)[source]#

Bases: Enum

An enumeration.

unmet_requirements = 1#
insufficient_confidence = 2#
detection_unsupported = 3#
exception datumaro.components.format_detection.FormatRequirementsUnmet(failed_alternatives: Sequence[str])[source]#

Bases: _FormatRejected

Represents a situation where a dataset does not meet the requirements of a given dataset format. More specifically, if this exception is raised, then it is necessary (but may not be sufficient) for the dataset to meet at least one of these requirements to be detected as being in that format.

Each element of failed_alternatives must be a human-readable statement describing a requirement that was not met.

Must not be constructed or raised directly; use FormatDetectionContext methods.

property reason: RejectionReason#
exception datumaro.components.format_detection.FormatDetectionUnsupported[source]#

Bases: _FormatRejected

Represents a situation where detection is attempted with a format that does not support it.

Must not be constructed or raised directly; use FormatDetectionContext.raise_unsupported instead.

property reason: RejectionReason#
class datumaro.components.format_detection.FormatDetectionContext(root_path: str)[source]#

Bases: object

An instance of this class is given to a dataset format detector. See the FormatDetector documentation. The class should not be instantiated directly.

A context encapsulates information about the dataset whose format is being detected. It also offers methods that place requirements on that dataset. Each such method raises a FormatRequirementsUnmet exception if the requirement is not met. If the requirement _is_ met, the return value depends on the method.

property root_path: str#

Returns the path to the root directory of the dataset. Detectors should avoid using this property in favor of specific requirement methods.

raise_unsupported() NoReturn[source]#

Raises a FormatDetectionUnsupported exception to signal that the current format does not support detection.

fail(requirement_desc: str) NoReturn[source]#

Places a requirement that is never met. requirement_desc must contain a human-readable description of the requirement.

require_file(pattern: str, *, exclude_fnames: str | Collection[str] = ()) str[source]#

Places the requirement that the dataset contains at least one file whose relative path matches the given pattern. The pattern must be a glob-like pattern; ** can be used to indicate a sequence of zero or more subdirectories. If the pattern does not describe a relative path, or refers to files outside the dataset root, the requirement is considered unmet. If the requirement is met, the relative path to one of the files that match the pattern is returned. If there are multiple such files, it’s unspecified which one of them is returned.

exclude_fnames must be a collection of patterns or a single pattern. If at least one pattern is supplied, then the placed requirement is narrowed to only accept files with names that match none of these patterns.

require_files(pattern: str, *, exclude_fnames: str | Collection[str] = ()) List[str][source]#

Same as require_file, but returns all matching paths in alphabetical order.

require_files_iter(pattern: str, *, exclude_fnames: str | Collection[str] = ()) Iterator[str][source]#

Same as require_files, but returns a generator.

probe_text_file(path: str, requirement_desc: str, is_binary_file: bool = False) Iterator[BufferedReader | TextIO][source]#

Returns a context manager that can be used to place a requirement on the contents of the file referred to by path. To do so, you must enter and exit this context manager (typically, by using the with statement). On entering, the file is opened for reading in text mode and the resulting file object is returned. On exiting, the file object is closed.

The requirement that is placed by doing this is considered met if all of the following are true:

  • path is a relative path that refers to a file within the dataset root.

  • The file is opened successfully.

  • The context is exited without an exception.

If the context is exited with an exception that was produced by another requirement being unmet, that exception is reraised and the new requirement is abandoned.

requirement_desc must be a human-readable statement describing the requirement.

require_any() Iterator[None][source]#

Returns a context manager that can be used to place a requirement that is considered met if at least one of several alternative sets of requirements is met. To do so, use a with statement, with the alternative sets of requirements represented as nested with statements using the context manager returned by alternative:

with context.require_any():
    with context.alternative():
        # place requirements from alternative set 1 here
    with context.alternative():
        # place requirements from alternative set 2 here
    ...

The contents of all with context.alternative() blocks will be executed, even if an alternative that is met is found early.

Requirements must not be placed directly within a with context.require_any() block.

alternative() Iterator[None][source]#

Returns a context manager that can be used in combination with require_any to define alternative requirements. See the documentation for require_any for more details.

Must only be used directly within a with context.requirements() block.

datumaro.components.format_detection.FormatDetector#

Denotes a callback that implements detection for a specific dataset format. The callback receives an instance of FormatDetectionContext and must call methods on that instance to place requirements that the dataset must meet in order for it to be considered as belonging to the format.

Must terminate in one of the following ways:

  • by returning the level of confidence in the dataset belonging to the format (or None, which is equivalent to the MEDIUM level);

  • by raising a FormatRequirementsUnmet exception via one of the FormatDetectionContext methods;

  • by raising a FormatDetectionUnsupported exception via FormatDetectionContext.raise_unsupported.

alias of Callable[[FormatDetectionContext], Optional[FormatDetectionConfidence]]

datumaro.components.format_detection.apply_format_detector(dataset_root_path: str, detector: Callable[[FormatDetectionContext], FormatDetectionConfidence | None]) FormatDetectionConfidence[source]#

Checks whether the dataset located at dataset_root_path belongs to the format detected by detector. If it does, returns the confidence level of the detection. Otherwise, terminates with the exception that was raised by the detector.

class datumaro.components.format_detection.RejectionCallback(*args, **kwargs)[source]#

Bases: Protocol

datumaro.components.format_detection.detect_dataset_format(formats: Iterable[Tuple[str, Callable[[FormatDetectionContext], FormatDetectionConfidence | None]]], path: str, *, rejection_callback: RejectionCallback | None = None) Sequence[DetectedFormat][source]#

Determines which format(s) the dataset at the specified path belongs to.

The function applies each supplied detector to the given patch and decides whether the corresponding format is detected or rejected. A format may be rejected if the detector fails or if it succeeds with less confidence than another detector (other rejection reasons might be added in the future).

Parameters:
  • formats – The formats to be considered. Each element of the iterable must be a tuple of a format name and a FormatDetector instance.

  • path – the filesystem path to the dataset to be analyzed.

  • rejection_callback – Unless None, called for every rejected format to report the reason it was rejected.

Returns: a sequence of detected format names.