# Tabular ## Format specification Tabular dataset generally refers to table data with multiple rows and columns.
`.csv` files are the most common format, and OpenML uses `.arff` as the official format. Datumaro only supports tabular data in `.csv` format where the first row is a header with unique column names. It's because other formats can be converted to `.csv` easily as shown below. ```python # convert '.arff' to '.csv' from scipy.io.arff import loadarff import pandas as pd data = loadarff("dataset.arff") df = pd.DataFrame(data[0]) categorical = [col for col in df.columns if df[col].dtype=="O"] df[categorical] = df[categorical_columns].apply(lambda x: x.str.decode('utf8')) df.to_csv("arff.csv", index=False) # convert '.parquet', '.feather', '.hdf5', '.pickle' to '.csv'. pd.read_parquet("dataset.parquet").to_csv('parquet.csv', index=False) pd.read_feather("dataset.feather").to_csv('feather.csv', index=False) pd.read_hdf("dataset.hdf5").to_csv('hdf5.csv', index=False) pd.read_pickle("dataset.pickle").to_csv('pickle.csv', index=False) # convert '.jay' to '.csv' import datatable as dt data = dt.fread("dataset.jay") data.to_csv("jay.csv") ``` A tabular dataset can be one of the following: - a single file with a `.csv` extension - a directory contains `.csv` files (supports only 1 depth). ``` dataset/ ├── aaa.csv ├── ... └── zzz.csv ``` Supported annotation types: - `Tabular` ## Import tabular dataset A Datumaro project with a tabular source can be created in the following way: ```bash datum project create datum project import --format tabular ``` It is also possible to import the dataset using Python API: ```python import datumaro as dm dataset = dm.Dataset.import_from('', 'tabular') ``` Datumaro stores the imported table as media (a list of `TableRow`) and annotates the target columns. The last column is regarded as the target column, which can be specified by the user when importing the dataset as shown below. ```bash datum project create datum project import --format tabular -- --target input:length(m),output:breed_category,pet_category datum project import --format tabular ``` ```python import datumaro as dm dataset = dm.Dataset.import_from('', 'tabular', target={"input":"length(m)", "output":["breed_category", "pet_category"]}) dataset = dm.Dataset.import_from('', 'tabular') ``` As shown, the target can be a single column name or a comma-separated list of columns. Note that each tabular file is considered as a subset. ## Export tabular dataset Datumaro supports exporting a tabular dataset using CLI or python API. Each subset will be saved to a separate `.csv` file. ```bash datum project create datum project import -f tabular datum project export -f tabular -o ``` ```python import datumaro as dm dataset = dm.Dataset.import_from('', 'tabular') dataset.export('', 'tabular') ``` Note that converting a tabular dataset into other formats and vice versa is not supproted. ## Examples Examples of using this format from the code can be found in [the format tests](https://github.com/openvinotoolkit/datumaro/blob/develop/tests/unit/test_tabular_format.py). The datasets here are randomly sampled and you can find the full dataset below. - Electricity Datset: | date | day | period | nswprice | nswdemand | vicprice | vicdemand | transfer | class | |---------:|------:|---------:|-----------:|------------:|-----------:|------------:|-----------:|:--------| | 0.425556 | 5 | 0.340426 | 0.076108 | 0.392889 | 0.003467 | 0.422915 | 0.414912 | UP | | 0.425512 | 4 | 0.617021 | 0.060376 | 0.483041 | 0.003467 | 0.422915 | 0.414912 | DOWN | | 0.013982 | 4 | 0.042553 | 0.061967 | 0.521125 | 0.003467 | 0.422915 | 0.414912 | DOWN | | 0.907349 | 3 | 0.06383 | 0.080581 | 0.331003 | 0.00538 | 0.47566 | 0.441228 | DOWN | | 0.889341 | 0 | 0.361702 | 0.027141 | 0.379649 | 0.001624 | 0.248317 | 0.69386 | DOWN | - Buddy Dataset: | pet_id | issue_date | listing_date | condition | color_type | length(m) | height(cm) | X1 | X2 | breed_category | pet_category | |:-----------|:--------------------|:--------------------|------------:|:-------------|------------:|-------------:|-----:|-----:|-----------------:|---------------:| | ANSL_59957 | 2015-10-21 00:00:00 | 2016-11-12 09:00:00 | nan | Lynx Point | 0.49 | 24.53 | 16 | 9 | 2 | 1 | | ANSL_57687 | 2016-08-25 00:00:00 | 2016-09-20 08:11:00 | nan | Red | 0.87 | 43.17 | 15 | 4 | 2 | 4 | | ANSL_62277 | 2014-12-29 00:00:00 | 2017-01-19 14:47:00 | nan | Brown | 0.81 | 25.72 | 15 | 4 | 2 | 4 | | ANSL_72624 | 2016-10-30 00:00:00 | 2017-02-18 14:57:00 | 1 | Brown Tabby | 0.36 | 10.18 | 0 | 1 | 0 | 1 | | ANSL_51838 | 2014-12-29 00:00:00 | 2017-01-19 14:46:00 | nan | Brown | 0.97 | 48.7 | 15 | 4 | 2 | 4 |