Data Validation#
LinkML is designed to allow for a variety of strategies for data validation. The overall philosophy is to provide maximum expressivity in the language to allow model designers to state all constraints in a declarative fashion, and then to leverage existing frameworks and to allow the user to balance concerns such as expressivity vs efficiency.
Currently there are several supported strategies:
validation with the
linkml.validator
package and its CLIvalidation via Python object instantiation
validation via JSON Schema using external tools
validation of triples in a triplestore or RDF file via generation of SPARQL constraints
validation of RDF via generation of ShEx or SHACL
validation via SQL loading and queries
However, others will be supported in future; in particular, scalable validation of massive databases.
The linkml.validator
package and CLI#
This package contains the main entry point for various flexible validation strategies.
Validation in Python code#
If you are writing your own Python code to perform validation the simplest approach is to use the linkml.validator.validate()
function. For example:
from linkml.validator import validate
instance = {
"id": "ORCID:1234",
"full_name": "Clark Kent",
"age": 32,
"phone": "555-555-5555",
}
report = validate(instance, "personinfo.yaml", "Person")
if not report.results:
print('The instance is valid!')
else:
for result in report.results:
print(result.message)
This function takes a single instance (typically represented as a Python dict) and validates it according to the given schema (specified here by a path to the source file, but dict or object representation of the schema is also accepted). This example also explicitly specifies which class within the schema (Person
) the data instance should adhere to. If this is omitted, the function will attempt to infer it.
The other high-level function is linkml.validator.validate_file()
. It loads data instances from a file and validates each of them according to a class in a schema. Assuming the contents of people.csv
look like:
id,full_name,age,phone
ORCID:1234,Clark Kent,32,555-555-5555
ORCID:5678,Lois Lane,33,555-555-1234
Each row can be validated with:
from linkml.validator import validate_file
report = validate_file("people.csv", "personinfo.yaml", "Person")
Under the hood, both of these functions use a strategy of generating a JSON Schema artifact from the LinkML schema and validating instances using a JSON Schema validator.
While many LinkML constructs can be expressed in JSON Schema (which makes it a good default validation strategy), there are some features of LinkML not supported by JSON Schema. For more fine-grained control over the validation strategy use the linkml.validator.Validator
class. Using this class it is possible to mix JSON Schema validation with other strategies or forego it altogether.
The key idea behind the linkml.validator.Validator
is that it does not do any validation itself. Instead, it simply orchestrates validation according to a set of validation plugins. In this example, the basic JSON Schema validation will happen (disallowing additional properties because of the closed
option) as well as a validation that checks that recommended slots are populated:
from linkml.validator import Validator
from linkml.validator.plugins import JsonschemaValidationPlugin, RecommendedSlotsPlugin
validator = Validator(
schema="personinfo.yaml",
validation_plugins=[
JsonschemaValidationPlugin(closed=True),
RecommendedSlotsPlugin()
]
)
validator.validate({"id": "ORCID:1234", "full_name": "Clark Kent", "age": 32, "phone": "555-555-5555"}, "Person")
This example only uses a validation strategy based on generating Pydantic models from the LinkML schema instead:
from linkml.validator import Validator
from linkml.validator.plugins import PydanticValidationPlugin
validator = Validator(
schema="personinfo.yaml",
validation_plugins=[PydanticValidationPlugin()]
)
validator.validate({"id": "ORCID:1234", "full_name": "Clark Kent", "age": 32, "phone": "555-555-5555"}, "Person")
Refer to the linkml.validator.plugins
documentation for more information about the available plugins and their benefits and tradeoffs.
The linkml-validate
CLI#
The same functionality is also available via a the linkml-validate
command line interface. For basic validation, simply provide a schema and a source to load data instances from:
$ linkml-validate --schema personinfo.yaml --target-class Person people.csv
No issues found!
Similar to the linkml.validator.validate()
and linkml.validator.validate_file()
functions, this will perform basic validation based on a JSON Schema validator. If advanced customization is needed, create a configuration YAML file and provide it with the --config
argument:
$ linkml-validate --config person-validation.config.yaml
The configuration YAML file can have the following keys. All keys are optional:
Key |
Description |
Default value |
---|---|---|
|
Path to the LinkML schema. Overrides |
None |
|
Class in the schema to validate against. Overrides the
|
None |
|
A list of sources where each source is either a string or a dictionary with a single key.
This value overrides any |
None |
|
A dictionary where each key is the name of a
Classes defined in the |
JsonschemaValidationPlugin:
closed: true
|
Here is an example configuration file:
# person-validation.config.yaml
schema: personinfo.yaml
target_class: Container
# Data from two files will be validated. A loader for the JSON file will be created
# automatically based on the file extension. A loader for the CSV file is specified
# manually in order to provide custom options.
data_sources:
- people.json
- CsvLoader:
source: people.csv
index_slot_name: persons
# Data will be validated according to the JsonschemaValidationPlugin with additional
# properties allowed (closed: false) and also the RecommendedSlotsPlugin
plugins:
JsonschemaValidationPlugin:
closed: false
RecommendedSlotsPlugin:
linkml-validate#
Validate data according to a LinkML Schema
linkml-validate [OPTIONS] [DATA_SOURCES]...
Options
- -s, --schema <schema>#
Schema file to validate data against
- -C, --target-class <target_class>#
Class within the schema to validate data against
- --config <config>#
Validation configuration YAML file.
- --exit-on-first-failure#
Exit after the first validation failure is found. If not specified all validation failures are reported.
- --legacy-mode#
Use legacy linkml-validate behavior.
- -m, --module <module>#
[DEPRECATED: only used in legacy mode] Path to python datamodel module
- -f, --input-format <input_format>#
[DEPRECATED: only used in legacy mode] Input format. Inferred from input suffix if not specified
- Options:
yml | yaml | json | rdf | ttl | json-ld | csv | tsv
- -S, --index-slot <index_slot>#
[DEPRECATED: only used in legacy mode] top level slot. Required for CSV dumping/loading
- --include-range-class-descendants, --no-range-class-descendants#
[DEPRECATED: only used in legacy mode] When handling range constraints, include all descendants of the range class instead of just the range class
- -D, --include-context, --no-include-context#
Include additional context when reporting of validation errors.
- Default:
False
- -V, --version#
Show the version and exit.
Arguments
- DATA_SOURCES#
Optional argument(s)
Python object instantiation#
If you have generated Python dataclasses or Pydantic models from your LinkML schema, you can also use them as a lightweight form of validation.
$ gen-python personinfo.yaml > personinfo.py
$ echo '{"id":"ORCID:1234","full_name":"Clark Kent","age":32,"phone":"555-555-5555"}' > person.json
from personinfo import Person
import json
with open("person.json") as f:
person_data = json.load(f)
kent = Person(**person_data)
If you remove the id
key from person.json
and run the above code again, you will see a ValueError
raised indicating that id
is required.
JSON Schema with external tools#
If you need to perform validation outside of a Python-based project, JSON Schema validation is often the most straightforward to implement. From your LinkML schema project, generate a JSON Schema artifact:
$ gen-json-schema personinfo.yaml > personinfo.schema.json
The personinfo.schema.json
artifact can then be used in any other project where a JSON Schema implementation is available.
Validation of RDF triplestores using generated SPARQL#
The LinkML framework can also be used to validate RDF, either in a file, or a triplestore. There are two steps:
generation of SPARQL constraint-style queries (see [sparqlgen](../generators/sparql) )
execution of those queries on an in-memory graph or external triplestore
The user can choose to run only the first step, to obtain a bank of SPARQL queries that can be applied selectively
linkml-sparql-validate#
Validates sparql
Example:
linkml-sparql-validate -U http://sparql.hegroup.org/sparql -s tests/test_validation/input/omo.yaml
linkml-sparql-validate [OPTIONS]
Options
- -G, --named-graph <named_graph>#
Constrain query to a named graph
- -i, --input <input>#
Input file to validate
- -U, --endpoint-url <endpoint_url>#
URL of sparql endpoint
- -L, --limit <limit>#
Max results per query
- -o, --output <output>#
Path to report file
- -f, --input-format <input_format>#
Input format. Inferred from input suffix if not specified
- Options:
yml | yaml | json | rdf | ttl | json-ld | csv | tsv
- -t, --output-format <output_format>#
Output format. Inferred from output suffix if not specified
- Options:
yml | yaml | json | rdf | ttl | json-ld | csv | tsv
- -s, --schema <schema>#
Path to schema specified as LinkML yaml
- -V, --version#
Show the version and exit.
Validation via shape languages#
Currently the linkml framework does not provide builtin support for validating using a shape language, but the following strategy can be used:
Convert data to RDF using
linkml-convert
Convert schema to a shape language using
gen-shex
orgen-shacl
Use a ShEx or SHACL validator
See next section for more details.
Future plans#
Future versions of LinkML will employ a powerful constraint and inference language.
One of the use cases here is being able to specify that the length
field is equal to end - start
. This declarative knowledge can then be used to either (1) infer the value of length
if unspecified (2) infer either start
or end
if only one of these is specified alongside length
(3) check consistency if all three are specified.
These constraints can then be executed over large databases via a variety of strategies including:
generation of datalog programs for efficient engines such as souffle
generation of SQL queries to be used with relational databases