CSVs and Tabular Data#
LinkML can support both complex interlinked normalized relational data as well as flat/denormalized data as typically found in spreadsheets and in CSVs used by data scientists.
Our philosophy is “always have a schema” even when working with simple tabular data.
linkml-convert script can be used to convert between CSVs/TSVs and other formats like JSON/RDF. The same tooling for validating-data operate in the same way.
Conventions for working with tabular data#
LinkML allows you to create schemas with complex nested data - these don’t necessarily have a simple unified mapping to tables/TSVs. However, you can still work with tabular representations if your schema has a certain “shape” and you provide sufficient hints.
See part 2 of the tutorial for an introduction to container objects.
To serialize your data objects as TSVs, it’s assumed that you have a class in your schema that serves the role of container. It can be called whatever you like. You can also annotate this with tree_root set to true. This class will have a multivalued slot pointing at the list of things you want to serialize in the TSV. This slot is known as the index slot
For example, in the PersonSchema schema, the Container class has two possible index slots:
persons: points at a list of Person objects
organizations: points at a list of Organization objects
You can only serialize one of these in any one TSV (using more advanced techniques you could create a union class for Person and Organization and serialize this, but this is outside the scope of this tutorial)
The linkml command line tools for conversion and validation will do their best to guess the index slot and the container, but if there is no unambiguous choice, then have to provide these using the following arguments:
-C, --target-class TEXT name of class in datamodel that the root
-S, --index-slot TEXT top level slot. Required for CSV
For example, to serialize the organizations in the provided YAML data file in this repository, you can run:
linkml-convert -t tsv -s examples/PersonSchema/personinfo.yaml -C Container -S organizations examples/PersonSchema/data/example_personinfo_data.yaml
Note that currently serializing the person objects won’t work, as the Person class is too nested to be serialized as TSV
On the fly denormalization#
The json-flattener/ library is used to do on-the-fly denormalizations. For example:
multivalued slots are serialized using a
nested slots are flattened to paths, e.g if Container has a slot persons, and Person has a slot name, then the path with be
Inference of schemas from tabular data#
generalize-tsv command in the schema-automator