Data Validation#
LinkML is designed to allow for a variety of strategies for data validation. The overall philosophy is to provide maximum expressivity in the language to allow model designers to state all constraints in a declarative fashion, and then to leverage existing frameworks and to allow the user to balance concerns such as expressivity vs efficiency.
Currently there are 5 supported strategies:
validation via Python object instantiation
validation through JSON-Schema
validation of triples in a triplestore or RDF file via generation of SPARQL constraints
validation of RDF via generation of ShEx or SHACL
validation via SQL loading and queries
However, others will be supported in future; in particular, scalable validation of massive databases.
Validation of JSON documents#
The linkml-convert
command will automatically perform data validation.
Currently it performs two level validation:
it will convert data to in-memory Python objects, using dataclass validation
it will then convert the LinkML schema to JSON-Schema and employ JSON-Schema validation
Note that you can easily generate JSON-Schema and use your validator of choice, see JSON Schema Generation
Validation of RDF triplestores using generated SPARQL#
The LinkML framework can also be used to validate RDF, either in a file, or a triplestore. There are two steps:
generation of SPARQL constraint-style queries (see sparqlgen )
execution of those queries on an in-memory graph or external triplestore
The user can choose to run only the first step, to obtain a bank of SPARQL queries that can be applied selectively
linkml-sparql-validate --help
Usage: linkml-sparql-validate [OPTIONS]
Validates sparql
Example:
linkml-sparql-validate -U http://sparql.hegroup.org/sparql -s
tests/test_validation/input/omo.yaml
Options:
-G, --named-graph TEXT Constrain query to a named graph
-i, --input TEXT Input file to validate
-U, --endpoint-url TEXT URL of sparql endpoint
-L, --limit TEXT Max results per query
-o, --output TEXT Path to report file
-f, --input-format [yaml|json|rdf|csv|tsv]
Input format. Inferred from input suffix if
not specified
-t, --output-format [yaml|json|rdf|csv|tsv]
Output format. Inferred from output suffix
if not specified
-s, --schema TEXT Path to schema specified as LinkML yaml
--help Show this message and exit.
Validation via shape languages#
Currently the linkml framework does not provide builtin support for validating using a shape language, but the following strategy can be used:
Convert data to RDF using
linkml-convert
Convert schema to a shape language using
gen-shex
orgen-shacl
Use a ShEx or SHACL validator
See next section for more details.
Future plans#
Future versions of LinkML will employ a powerful constraint and inference language.
One of the use cases here is being able to specify that the length
field is equal to end - start
. This declarative knowledge can then be used to either (1) infer the value of length
if unspecified (2) infer either start
or end
if only one of these is specified alongside length
(3) check consistency if all three are specified.
These constraints can then be executed over large databases via a variety of strategies including:
generation of datalog programs for efficient engines such as souffle
generation of SQL queries to be used with relational databases
Command Line#
.. currentmodule:: linkml.utils.jsonschemavalidator
.. click:: linkml.utils.jsonschemavalidator:cli :prog: linkml-validate :nested: full