# CSVs and Tabular Data LinkML can support both complex interlinked normalized relational data as well as flat/denormalized data as typically found in spreadsheets and in CSVs used by data scientists. Our philosophy is "always have a schema" even when working with simple tabular data. ## Conversion the `linkml-convert` script can be used to convert between CSVs/TSVs and other formats like JSON/RDF. The same tooling for [validating-data](validating-data) operate in the same way. ### Conventions for working with tabular data LinkML allows you to create schemas with complex nested data - these don't necessarily have a simple unified mapping to tables/TSVs. However, you can still work with tabular representations if your schema has a certain "shape" and you provide sufficient hints. ### Container objects See [part 2 of the tutorial](../intro/tutorial02) for an introduction to container objects. To serialize your data objects as TSVs, it's assumed that you have a class in your schema that serves the role of *container*. It can be called whatever you like. You can also annotate this with [tree_root](https://w3id.org/linkml/tree_root) set to true. This class will have a multivalued slot pointing at the list of things you want to serialize in the TSV. This slot is known as the *index slot* For example, in the [PersonSchema](https://github.com/linkml/linkml/tree/main/examples/PersonSchema) schema, the Container class has two possible index slots: * persons: points at a list of Person objects * organizations: points at a list of Organization objects You can only serialize one of these in any one TSV (using more advanced techniques you could create a union class for Person and Organization and serialize this, but this is outside the scope of this tutorial) The linkml command line tools for conversion and validation will do their best to guess the index slot and the container, but if there is no unambiguous choice, then have to provide these using the following arguments: ```text -C, --target-class TEXT name of class in datamodel that the root node instantiates -S, --index-slot TEXT top level slot. Required for CSV dumping/loading ``` For example, to serialize the organizations in the provided YAML data file in this repository, you can run: ```bash linkml-convert -t tsv -s examples/PersonSchema/personinfo.yaml -C Container -S organizations examples/PersonSchema/data/example_personinfo_data.yaml ``` Note that currently serializing the person objects won't work, as the Person class is too nested to be serialized as TSV ### On the fly denormalization The [json-flattener/](https://github.com/cmungall/json-flattener/) library is used to do on-the-fly denormalizations. For example: * multivalued slots are serialized using a `|` separator * nested slots are flattened to paths, e.g if Container has a slot persons, and Person has a slot name, then the path with be `persons_name` ### Customizing multivalued field formatting By default, multivalued fields are serialized with square brackets and pipe delimiters: ``` [value1|value2|value3] ``` However, when working with spreadsheets, users often prefer typing values without brackets: ``` value1|value2|value3 ``` You can customize the formatting using schema-level annotations: ```yaml id: https://example.org/myschema name: myschema annotations: list_wrapper: none # removes brackets from all multivalued fields list_delimiter: "|" # delimiter between values (default) slots: tags: range: string multivalued: true categories: range: string multivalued: true ``` Note: These annotations apply to ALL multivalued fields in the CSV/TSV output. This is because the underlying json-flattener library uses a single global configuration for list formatting. #### Available annotations | Annotation | Values | Default | Description | |------------|--------|---------|-------------| | `list_wrapper` | `square`, `curly`, `paren`, `none` | `square` | `square` uses `[a\|b]`, `curly` uses `{a\|b}`, `paren` uses `(a\|b)`, `none` has no wrapper `a\|b` | | `list_delimiter` | any string | `\|` | Character(s) used to separate list items | | `list_strip_whitespace` | `true`, `false` | `true` | Strip whitespace around delimiters when loading and dumping (e.g., `a \| b` → `['a', 'b']`) | | `refuse_delimiter_in_data` | `true`, `false` | `false` | Raise an error before serializing if any multivalued field value contains the delimiter character. Prevents silent data corruption during round-tripping. | #### CLI options You can override schema annotations using CLI options on `linkml-convert`: ```bash linkml-convert -s schema.yaml -C Container -S items -t tsv \ --list-wrapper none \ --list-delimiter "|" \ --list-strip-whitespace \ input.yaml ``` | CLI Option | Description | |------------|-------------| | `--list-wrapper` | `square`, `curly`, `paren`, or `none` - overrides schema annotation | | `--list-delimiter` | Delimiter string - overrides schema annotation | | `--list-strip-whitespace` / `--no-list-strip-whitespace` | Strip whitespace from list values (default: strip) | | `--refuse-delimiter-in-data` / `--no-refuse-delimiter-in-data` | Raise an error if any multivalued value contains the delimiter (default: allow) | All options apply to both input (loading) and output (dumping): - **On input**: `a | b | c` is parsed as `['a', 'b', 'c']` (stripped) or `['a ', ' b ', ' c']` (preserved) - **On output**: `['dog ', 'cat']` is written as `dog|cat` (stripped) or `dog |cat` (preserved) #### Examples **Default behavior (square brackets):** ```yaml id: https://example.org/default name: default_example # No annotations - uses default square wrapper slots: aliases: range: string multivalued: true ``` TSV output: `[Alice|Bob|Charlie]` **No wrapper with pipe delimiter:** ```yaml id: https://example.org/unwrapped name: unwrapped_example annotations: list_wrapper: none list_delimiter: "|" slots: aliases: range: string multivalued: true ``` TSV output: `Alice|Bob|Charlie` **No wrapper with semicolon delimiter:** ```yaml id: https://example.org/semicolon name: semicolon_example annotations: list_wrapper: none list_delimiter: ";" slots: categories: range: string multivalued: true ``` TSV output: `category1;category2;category3` #### Working example To try this with existing test files in the repository, first compare the default output: ```bash linkml-convert -s tests/linkml_runtime/test_loaders_dumpers/models/books_normalized.yaml \ -C Shop -S all_book_series -t tsv \ tests/linkml_runtime/test_loaders_dumpers/input/books_normalized_01.yaml ``` This produces default (square-bracketed) output like `[scifi|fantasy]`. To get unwrapped output, copy the schema and add the annotations block after `imports:`: ```yaml id: https://w3id.org/example name: example description: example annotations: list_wrapper: none list_delimiter: "|" imports: - linkml:types # ... rest of schema ``` Then run with your modified schema: ```bash linkml-convert -s my_modified_schema.yaml \ -C Shop -S all_book_series -t tsv \ tests/linkml_runtime/test_loaders_dumpers/input/books_normalized_01.yaml ``` This produces unwrapped output like `scifi|fantasy` (no brackets). #### Loading unwrapped TSV back to YAML The annotations also control how TSV data is parsed back into structured YAML. Given a TSV file with unwrapped pipe-delimited values: | id | name | genres | |----|------|--------| | S001 | Lord of the Rings | fantasy | | S002 | The Culture Series | scifi | | S003 | Book of the New Sun | scifi\|fantasy | Using a schema with unwrapped annotations: ```bash linkml-convert -s my_modified_schema.yaml \ -C Shop -S all_book_series -t yaml \ my_data.tsv ``` The `scifi|fantasy` value is correctly parsed as a list: ```yaml all_book_series: - id: S001 name: Lord of the Rings genres: - fantasy - id: S002 name: The Culture Series genres: - scifi - id: S003 name: Book of the New Sun genres: - scifi - fantasy ``` ## Boolean Values Boolean values in CSV/TSV files can be represented in various ways depending on the source (spreadsheets, databases, etc.). LinkML follows [Postel's law](https://en.wikipedia.org/wiki/Robustness_principle): **be liberal in what you accept, conservative in what you produce**. - **Loading** (liberal): accepts multiple truthy/falsy representations, extensible per-schema - **Dumping** (conservative): produces one canonical format, defaulting to `true`/`false` ### Loading Booleans When loading CSV/TSV data, LinkML coerces string values to booleans for slots with `range: boolean`. The defaults follow pandas/R conventions (case-insensitive): | Truthy Values | Falsy Values | |---------------|--------------| | `T`, `TRUE` (case-insensitive) | `F`, `FALSE` (case-insensitive) | This coercion is **schema-aware** — only values in slots declared as `range: boolean` are converted. String values like "yes" in a text field remain as strings. #### Empty strings Empty strings (`""`) in CSV/TSV are coerced to null regardless of the slot type. #### Extending the defaults Many datasets (e.g. NCBI BioSamples, MIxS) use `yes`/`no`, `1`/`0`, or `on`/`off` for boolean values. You can add these via schema annotations or CLI options — they are **added to** the defaults, not replacing them. **Schema annotations:** ```yaml id: https://example.org/myschema name: myschema annotations: boolean_truthy: "yes,on,1" # comma-separated, case-insensitive boolean_falsy: "no,off,0" ``` **CLI options:** ```bash linkml-convert -s schema.yaml -C Container -S items -t json \ --boolean-truthy "yes,on,1" \ --boolean-falsy "no,off,0" \ input.tsv ``` CLI options extend schema annotations if both are set. ### Dumping Booleans By default, boolean values are written as lowercase `true`/`false`. This can be changed for interoperability with systems that expect a different format. #### Schema Annotation Add the `boolean_output` annotation at the schema level: ```yaml id: https://example.org/myschema name: myschema annotations: boolean_output: "yes" # Write booleans as yes/no ``` Supported values: `true`, `True`, `TRUE`, `yes`, `Yes`, `YES`, `on`, `On`, `ON`, `1` #### CLI Option Override the schema annotation with `--boolean-output`: ```bash linkml-convert -s schema.yaml -C Container -S items -t tsv \ --boolean-output yes \ input.yaml ``` ### Example Given this data: ```yaml items: - id: "1" is_active: true - id: "2" is_active: false ``` Default output (`true`/`false`): ``` id is_active 1 true 2 false ``` With `--boolean-output yes`: ``` id is_active 1 yes 2 no ``` With `--boolean-output 1`: ``` id is_active 1 1 2 0 ``` ## Inference of schemas from tabular data Use `generalize-tsv` command in the [schema-automator](https://github.com/linkml/schema-automator)