LinkML is designed to allow for a variety of strategies for data validation. The overall philosophy is to provide maximum expressivity in the language to allow model designers to state all consraints in a declarative fashion, and then to leverage existing frameworks and to allow the user to balance concerns such as expressivity vs efficiency.
Currently there are 4 supported strategies:
validation via Python object instantiation
validation through JSON-Schema
validation of triples in a triplestore or RDF file via generation of SPARQL constraints
validation of RDF via generation of ShEx
However, others will be supported in future
Validation of JSON documents
linkml-convert command will automatically perform data validation.
Currently it performs two level validation:
it will convert data to in-memory Python objects, using dataclass validation
it will then convert the LinkML schema to JSON-Schema and employ JSON-Schema validation
Note that you can easily generate JSON-Schema and use your validator of choice, see JSON Schema Generation
Validation of RDF triplestores using generated SPARQL
The LinkML framework can also be used to validate RDF, either in a file, or a triplestore. There are two steps:
generation of SPARQL constraint-style queries (see sparqlgen )
execution of those queries on an in-memory graph or external triplestore
The user can choose to run only the first step, to obtain a bank of SPARQL queries that can be applied selectively
linkml-sparql-validate --help Usage: linkml-sparql-validate [OPTIONS] Validates sparql Example: linkml-sparql-validate -U http://sparql.hegroup.org/sparql -s tests/test_validation/input/omo.yaml Options: -G, --named-graph TEXT Constrain query to a named graph -i, --input TEXT Input file to validate -U, --endpoint-url TEXT URL of sparql endpoint -L, --limit TEXT Max results per query -o, --output TEXT Path to report file -f, --input-format [yaml|json|rdf|csv|tsv] Input format. Inferred from input suffix if not specified -t, --output-format [yaml|json|rdf|csv|tsv] Output format. Inferred from output suffix if not specified -s, --schema TEXT Path to schema specified as LinkML yaml --help Show this message and exit.
Future versions of LinkML will employ a powerful constraint and inference language.
One of the use cases here is being able to specify that the
length field is equal to
end - start. This declarative knowledge can then be used to either (1) infer the value of
length if unspecified (2) infer either
end if only one of these is specified alongside
length (3) check consistency if all three are specified.
These constraints can then be executed over large databases via a variety of strategies including:
generation of datalog programs for efficient engines such as souffle
generation of SQL queries to be used with relational databases
We also plan to support generation of SHACL from LinkML