datumaro.plugins.data_formats.arrow.importer#

Classes

ArrowImporter()

class datumaro.plugins.data_formats.arrow.importer.ArrowImporter[source]#

Bases: Importer

classmethod detect(context: FormatDetectionContext) → FormatDetectionConfidence | None[source]#

classmethod find_sources(path: str) → List[Dict][source]#

classmethod find_sources_with_params(path: str, **extra_params) → List[Dict][source]#

classmethod get_file_extensions() → List[str][source]#

exception datumaro.plugins.data_formats.arrow.importer.DatasetImportError[source]#: Bases: DatumaroError

class datumaro.plugins.data_formats.arrow.importer.DatumaroArrow[source]#

Bases: object

SIGNATURE = 'signature:datumaro_arrow'#

VERSION = '2.0'#

MP_TIMEOUT = 300.0#

ID_FIELD = 'id'#

SUBSET_FIELD = 'subset'#

IMAGE_FIELD = StructType(struct<has_bytes: bool, bytes: binary, path: string, size: fixed_size_list<item: uint16>[2]>)#

POINT_CLOUD_FIELD = StructType(struct<has_bytes: bool, bytes: binary, path: string, extra_images: list<image: struct<has_bytes: bool, bytes: binary, path: string, size: fixed_size_list<item: uint16>[2]>>>)#

MEDIA_FIELD = StructType(struct<type: uint8, image: struct<has_bytes: bool, bytes: binary, path: string, size: fixed_size_list<item: uint16>[2]>, point_cloud: struct<has_bytes: bool, bytes: binary, path: string, extra_images: list<image: struct<has_bytes: bool, bytes: binary, path: string, size: fixed_size_list<item: uint16>[2]>>>>)#

SCHEMA = id: string subset: string media: struct<type: uint8, image: struct<has_bytes: bool, bytes: binary, path: string, size: fixed_size_list<item: uint16>[2]>, point_cloud: struct<has_bytes: bool, bytes: binary, path: string, extra_images: list<image: struct<has_bytes: bool, bytes: binary, path: string, size: fixed_size_list<item: uint16>[2]>>>> child 0, type: uint8 child 1, image: struct<has_bytes: bool, bytes: binary, path: string, size: fixed_size_list<item: uint16>[2]> child 0, has_bytes: bool child 1, bytes: binary child 2, path: string child 3, size: fixed_size_list<item: uint16>[2] child 0, item: uint16 child 2, point_cloud: struct<has_bytes: bool, bytes: binary, path: string, extra_images: list<image: struct<has_bytes: bool, bytes: binary, path: string, size: fixed_size_list<item: uint16>[2]>>> child 0, has_bytes: bool child 1, bytes: binary child 2, path: string child 3, extra_images: list<image: struct<has_bytes: bool, bytes: binary, path: string, size: fixed_size_list<item: uint16>[2]>> child 0, image: struct<has_bytes: bool, bytes: binary, path: string, size: fixed_size_list<item: uint16>[2]> child 0, has_bytes: bool child 1, bytes: binary child 2, path: string child 3, size: fixed_size_list<item: uint16>[2] child 0, item: uint16 annotations: binary attributes: binary#

classmethod check_signature(signature: str)[source]#

classmethod check_schema(schema: Schema)[source]#

classmethod check_version(version: str)[source]#

classmethod create_schema_with_metadata(extractor: IDataset)[source]#

class datumaro.plugins.data_formats.arrow.importer.FormatDetectionConfidence(value)[source]#

Bases: IntEnum

Represents the level of confidence that a detector has in a dataset belonging to the detector’s format.

NONE = 1#

EXTREME_LOW = 5#

This is currently only assigned to ImageDir format. This is because ImageDir format can be detected in every image dataset format.

Type:: EXTREME_LOW

LOW = 10#

The dataset seems to belong to the format, but the format is too loosely defined to be able to distinguish it from other formats.

Type:: LOW

MEDIUM = 20#

The dataset seems to belong to the format, and is likely not to belong to any other format.

Type:: MEDIUM

class datumaro.plugins.data_formats.arrow.importer.FormatDetectionContext(root_path: str)[source]#

Bases: object

An instance of this class is given to a dataset format detector. See the FormatDetector documentation. The class should not be instantiated directly.

A context encapsulates information about the dataset whose format is being detected. It also offers methods that place requirements on that dataset. Each such method raises a FormatRequirementsUnmet exception if the requirement is not met. If the requirement _is_ met, the return value depends on the method.

property root_path: str#: Returns the path to the root directory of the dataset. Detectors should avoid using this property in favor of specific requirement methods.

raise_unsupported() → NoReturn[source]#: Raises a FormatDetectionUnsupported exception to signal that the current format does not support detection.

fail(requirement_desc: str) → NoReturn[source]#: Places a requirement that is never met. requirement_desc must contain a human-readable description of the requirement.

require_file(pattern: str, *, exclude_fnames: str | Collection[str] = ()) → str[source]#

Places the requirement that the dataset contains at least one file whose relative path matches the given pattern. The pattern must be a glob-like pattern; ** can be used to indicate a sequence of zero or more subdirectories. If the pattern does not describe a relative path, or refers to files outside the dataset root, the requirement is considered unmet. If the requirement is met, the relative path to one of the files that match the pattern is returned. If there are multiple such files, it’s unspecified which one of them is returned.

exclude_fnames must be a collection of patterns or a single pattern. If at least one pattern is supplied, then the placed requirement is narrowed to only accept files with names that match none of these patterns.

require_files(pattern: str, *, exclude_fnames: str | Collection[str] = ()) → List[str][source]#: Same as require_file, but returns all matching paths in alphabetical order.

require_files_iter(pattern: str, *, exclude_fnames: str | Collection[str] = ()) → Iterator[str][source]#: Same as require_files, but returns a generator.

probe_text_file(path: str, requirement_desc: str, is_binary_file: bool = False) → Iterator[BufferedReader | TextIO][source]#

Returns a context manager that can be used to place a requirement on the contents of the file referred to by path. To do so, you must enter and exit this context manager (typically, by using the with statement). On entering, the file is opened for reading in text mode and the resulting file object is returned. On exiting, the file object is closed.

The requirement that is placed by doing this is considered met if all of the following are true:

path is a relative path that refers to a file within the dataset root.
The file is opened successfully.
The context is exited without an exception.

If the context is exited with an exception that was produced by another requirement being unmet, that exception is reraised and the new requirement is abandoned.

requirement_desc must be a human-readable statement describing the requirement.

require_any() → Iterator[None][source]#

Returns a context manager that can be used to place a requirement that is considered met if at least one of several alternative sets of requirements is met. To do so, use a with statement, with the alternative sets of requirements represented as nested with statements using the context manager returned by alternative:

with context.require_any():
    with context.alternative():
        # place requirements from alternative set 1 here
    with context.alternative():
        # place requirements from alternative set 2 here
    ...

The contents of all with context.alternative() blocks will be executed, even if an alternative that is met is found early.

Requirements must not be placed directly within a with context.require_any() block.

alternative() → Iterator[None][source]#

Returns a context manager that can be used in combination with require_any to define alternative requirements. See the documentation for require_any for more details.

Must only be used directly within a with context.requirements() block.

class datumaro.plugins.data_formats.arrow.importer.Importer[source]#

Bases: CliPlugin

DETECT_CONFIDENCE = 10#

classmethod detect(context: FormatDetectionContext) → FormatDetectionConfidence[source]#

classmethod get_file_extensions() → List[str][source]#

classmethod find_sources(path: str) → List[Dict][source]#

classmethod find_sources_with_params(path: str, **extra_params) → List[Dict][source]#

property can_stream: bool#: Flag to indicate whether the importer can stream the dataset item or not.

get_extractor_merger() → Type[ExtractorMerger] | None[source]#

Extractor merger dedicated for the data format

Datumaro import process spawns multiple DatasetBase for the detected sources. We can find a bunch of the detected sources from the given directory path. It is usually each detected source is corresponded to the subset of dataset in many data formats.

Parameters:: stream – There can exist a branch according to stream flag
Returns:: If None, use Dataset.from_extractors() to merge the extractors, Otherwise, use the return type to merge the extractors.