# Filter ## Filter datasets This command allows to extract a sub-dataset from a dataset. The new dataset includes only items satisfying some condition. The XML [XPath](https://devhints.io/xpath) is used as a query format. The command can be applied to a dataset or a project build target, a stage or the combined `project` target, in which case all the project targets will be affected. A build tree stage will be recorded if `--stage` is enabled, and the resulting dataset(-s) will be saved if `--apply` is enabled. By default, datasets are updated in-place. The `-o/--output-dir` option can be used to specify another output directory. When updating in-place, use the `--overwrite` parameter (in-place updates fail by default to prevent data loss), unless a project target is modified. The current project (`-p/--project`) is also used as a context for plugins, so it can be useful for dataset paths having custom formats. When not specified, the current project's working tree is used. There are several filtering modes available (the `-m/--mode` parameter). Supported modes: - `i`, `items` - `a`, `annotations` - `i+a`, `a+i`, `items+annotations`, `annotations+items` When filtering annotations, use the `items+annotations` mode to point that annotation-less dataset items should be removed, otherwise they will be kept in the resulting dataset. To select an annotation, write an XPath that returns `annotation` elements (see examples). Item representations can be printed with the `--dry-run` parameter: ``` xml 290768 minival2014 612 612 3 80154 bbox 39 264.59 150.25 11.19 42.31 473.87 669839 bbox 41 163.58 191.75 76.98 73.63 5668.77 ... ``` The command can only be applied to a project build target, a stage or the combined `project` target, in which case all the targets will be affected. A build tree stage will be added if `--stage` is enabled, and the resulting dataset(-s) will be saved if `--apply` is enabled. Usage: ```console datum filter [-h] [-e FILTER] [-m MODE] [--dry-run] [--stage STAGE] [--apply APPLY] [-o DST_DIR] [--overwrite] [-p PROJECT_DIR] [target] ``` Parameters: - `` (string) - Target [dataset revpath](../../user-manual/how_to_use_datumaro.md#dataset-path-concepts). By default, filters all targets of the current project. - `-e, --filter` (string) - XML XPath filter expression for dataset items - `-m, --mode` (string) - The filtering mode. Default is the `i` mode. - `--dry-run` - Print XML representations of the filtered dataset and exit. - `--stage` (bool) - Include this action as a project build step. If true, this operation will be saved in the project build tree, allowing to reproduce the resulting dataset later. Applicable only to main project targets (i.e. data sources and the `project` target, but not intermediate stages). Enabled by default. - `--apply` (bool) - Run this command immediately. If disabled, only the build tree stage will be written. Enabled by default. - `-o, --output-dir` (string) - Output directory. Can be omitted for main project targets (i.e. data sources and the `project` target, but not intermediate stages) and dataset targets. If not specified, the results will be saved inplace. - `--overwrite` - Allows to overwrite existing files in the output directory, when it is specified and is not empty. - `-p, --project` (string) - Directory of the project to operate on (default: current directory). - `-h, --help` - Print the help message and exit. Example: - Extract a dataset with images with `width` < `height` ```console datum filter -e '/item[image/width < image/height]' ``` - Extract a dataset with images of the `train` subset ```console datum filter -e '/item[subset="train"]' ``` - Extract a dataset with only large annotations of the `cat` class and any non-`persons` ```console datum filter --mode annotations \ -e '/item/annotation[(label="cat" and area > 99.5) or label!="person"]' ``` - Extract a dataset with non-occluded annotations, remove empty images. Use data only from the "s1" source of the project ```console datum project create datum project import --name s1 --format voc datum project import --name s2 --format voc datum filter s1 \ -m i+a -e '/item/annotation[occluded="False"]' ``` - Extract a dataset composed solely of items containing annotations. ```console datum filter -e '/item[annotation]' ``` The `item[annotation]` checks if there is a child named `annotation` within the `item` node.