Data Validation#

LinkML is designed to allow for a variety of strategies for data validation. The overall philosophy is to provide maximum expressivity in the language to allow model designers to state all constraints in a declarative fashion, and then to leverage existing frameworks and to allow the user to balance concerns such as expressivity vs efficiency.

Currently there are several supported strategies:

  • validation with the linkml.validator package and its CLI

  • validation via Python object instantiation

  • validation via JSON Schema using external tools

  • validation of triples in a triplestore or RDF file via generation of SPARQL constraints

  • validation of RDF via generation of ShEx or SHACL

  • validation via SQL loading and queries

However, others will be supported in future; in particular, scalable validation of massive databases.

The linkml.validator package and CLI#

This package contains the main entry point for various flexible validation strategies.

Validation in Python code#

If you are writing your own Python code to perform validation the simplest approach is to use the linkml.validator.validate() function. For example:

from linkml.validator import validate

instance = {
    "id": "ORCID:1234",
    "full_name": "Clark Kent",
    "age": 32,
    "phone": "555-555-5555",
}

report = validate(instance, "personinfo.yaml", "Person")

if not report.results:
    print('The instance is valid!')
else:
    for result in report.results:
        print(result.message)

This function takes a single instance (typically represented as a Python dict) and validates it according to the given schema (specified here by a path to the source file, but dict or object representation of the schema is also accepted). This example also explicitly specifies which class within the schema (Person) the data instance should adhere to. If this is omitted, the function will attempt to infer it.

The other high-level function is linkml.validator.validate_file(). It loads data instances from a file and validates each of them according to a class in a schema. Assuming the contents of people.csv look like:

id,full_name,age,phone
ORCID:1234,Clark Kent,32,555-555-5555
ORCID:5678,Lois Lane,33,555-555-1234

Each row can be validated with:

from linkml.validator import validate_file

report = validate_file("people.csv", "personinfo.yaml", "Person")

Under the hood, both of these functions use a strategy of generating a JSON Schema artifact from the LinkML schema and validating instances using a JSON Schema validator.

While many LinkML constructs can be expressed in JSON Schema (which makes it a good default validation strategy), there are some features of LinkML not supported by JSON Schema. For more fine-grained control over the validation strategy use the linkml.validator.Validator class. Using this class it is possible to mix JSON Schema validation with other strategies or forego it altogether.

The key idea behind the linkml.validator.Validator is that it does not do any validation itself. Instead, it simply orchestrates validation according to a set of validation plugins. In this example, the basic JSON Schema validation will happen (disallowing additional properties because of the closed option) as well as a validation that checks that recommended slots are populated:

from linkml.validator import Validator
from linkml.validator.plugins import JsonschemaValidationPlugin, RecommendedSlotsPlugin

validator = Validator(
    schema="personinfo.yaml",
    validation_plugins=[
        JsonschemaValidationPlugin(closed=True),
        RecommendedSlotsPlugin()
    ]
)
validator.validate({"id": "ORCID:1234", "full_name": "Clark Kent", "age": 32, "phone": "555-555-5555"}, "Person")

This example only uses a validation strategy based on generating Pydantic models from the LinkML schema instead:

from linkml.validator import Validator
from linkml.validator.plugins import PydanticValidationPlugin

validator = Validator(
    schema="personinfo.yaml",
    validation_plugins=[PydanticValidationPlugin()]
)
validator.validate({"id": "ORCID:1234", "full_name": "Clark Kent", "age": 32, "phone": "555-555-5555"}, "Person")

Refer to the linkml.validator.plugins documentation for more information about the available plugins and their benefits and tradeoffs.

The linkml-validate CLI#

The same functionality is also available via a the linkml-validate command line interface. For basic validation, simply provide a schema and a source to load data instances from:

$ linkml-validate --schema personinfo.yaml --target-class Person people.csv
No issues found!

Similar to the linkml.validator.validate() and linkml.validator.validate_file() functions, this will perform basic validation based on a JSON Schema validator. If advanced customization is needed, create a configuration YAML file and provide it with the --config argument:

$ linkml-validate --config person-validation.config.yaml

The configuration YAML file can have the following keys. All keys are optional:

Key

Description

Default value

schema

Path to the LinkML schema. Overrides --schema CLI argument if both are provided.

None

target_class

Class in the schema to validate against. Overrides the --target-class CLI argument if both are provided.

None

data_sources

A list of sources where each source is either a string or a dictionary with a single key.

  • If the source is a string it is interpreted as a file path and data will be loaded from it based on the file extension.

  • If the source is a dictionary it should have a single key representing the the name of a linkml.validator.loaders.Loader subclass. The value is a dictionary that will be interpreted as constructor keyword arguments for the given class.

This value overrides any DATA_SOURCES arguments passed to the CLI

None

plugins

A dictionary where each key is the name of a linkml.validator.plugins.ValidationPlugin subclass. Each value is a dictionary that will be interpreted as constructor keyword arguments for the given class.

Classes defined in the linkml.validator.plugins package do not required a full dotted name (e.g. just JsonschemaValidationPlugin is sufficient). Classes outside of this package can be used, but you must specify the full dotted name (e.g. my_project.MyCustomValidationPlugin)

JsonschemaValidationPlugin:
  closed: true

Here is an example configuration file:

# person-validation.config.yaml
schema: personinfo.yaml
target_class: Container

# Data from two files will be validated. A loader for the JSON file will be created
# automatically based on the file extension. A loader for the CSV file is specified
# manually in order to provide custom options.
data_sources:
  - people.json
  - CsvLoader:
      source: people.csv
      index_slot_name: persons

# Data will be validated according to the JsonschemaValidationPlugin with additional
# properties allowed (closed: false) and also the RecommendedSlotsPlugin
plugins:
  JsonschemaValidationPlugin:
    closed: false
  RecommendedSlotsPlugin:

linkml-validate#

linkml-validate [OPTIONS] [DATA_SOURCES]...

Options

-s, --schema <schema>#

Schema file to validate data against

-C, --target-class <target_class>#

Class within the schema to validate data against

--config <config>#

Validation configuration YAML file.

--exit-on-first-failure#

Exit after the first validation failure is found. If not specified all validation failures are reported.

--legacy-mode#

Use legacy linkml-validate behavior.

-m, --module <module>#

[DEPRECATED: only used in legacy mode] Path to python datamodel module

-f, --input-format <input_format>#

[DEPRECATED: only used in legacy mode] Input format. Inferred from input suffix if not specified

Options:

yml | yaml | json | rdf | ttl | json-ld | csv | tsv

-S, --index-slot <index_slot>#

[DEPRECATED: only used in legacy mode] top level slot. Required for CSV dumping/loading

--include-range-class-descendants, --no-range-class-descendants#

[DEPRECATED: only used in legacy mode] When handling range constraints, include all descendants of the range class instead of just the range class

-V, --version#

Show the version and exit.

Arguments

DATA_SOURCES#

Optional argument(s)

Python object instantiation#

If you have generated Python dataclasses or Pydantic models from your LinkML schema, you can also use them as a lightweight form of validation.

$ gen-python personinfo.yaml > personinfo.py
$ echo '{"id":"ORCID:1234","full_name":"Clark Kent","age":32,"phone":"555-555-5555"}' > person.json
from personinfo import Person
import json

with open("person.json") as f:
    person_data = json.load(f)

kent = Person(**person_data)

If you remove the id key from person.json and run the above code again, you will see a ValueError raised indicating that id is required.

JSON Schema with external tools#

If you need to perform validation outside of a Python-based project, JSON Schema validation is often the most straightforward to implement. From your LinkML schema project, generate a JSON Schema artifact:

$ gen-json-schema personinfo.yaml > personinfo.schema.json

The personinfo.schema.json artifact can then be used in any other project where a JSON Schema implementation is available.

Validation of RDF triplestores using generated SPARQL#

The LinkML framework can also be used to validate RDF, either in a file, or a triplestore. There are two steps:

  1. generation of SPARQL constraint-style queries (see [sparqlgen](../generators/sparql) )

  2. execution of those queries on an in-memory graph or external triplestore

The user can choose to run only the first step, to obtain a bank of SPARQL queries that can be applied selectively

linkml-sparql-validate#

Validates sparql

Example:

linkml-sparql-validate -U http://sparql.hegroup.org/sparql -s tests/test_validation/input/omo.yaml

linkml-sparql-validate [OPTIONS]

Options

-G, --named-graph <named_graph>#

Constrain query to a named graph

-i, --input <input>#

Input file to validate

-U, --endpoint-url <endpoint_url>#

URL of sparql endpoint

-L, --limit <limit>#

Max results per query

-o, --output <output>#

Path to report file

-f, --input-format <input_format>#

Input format. Inferred from input suffix if not specified

Options:

yml | yaml | json | rdf | ttl | json-ld | csv | tsv

-t, --output-format <output_format>#

Output format. Inferred from output suffix if not specified

Options:

yml | yaml | json | rdf | ttl | json-ld | csv | tsv

-s, --schema <schema>#

Path to schema specified as LinkML yaml

-V, --version#

Show the version and exit.

Validation via shape languages#

Currently the linkml framework does not provide builtin support for validating using a shape language, but the following strategy can be used:

  1. Convert data to RDF using linkml-convert

  2. Convert schema to a shape language using gen-shex or gen-shacl

  3. Use a ShEx or SHACL validator

See next section for more details.

Future plans#

Future versions of LinkML will employ a powerful constraint and inference language.

One of the use cases here is being able to specify that the length field is equal to end - start. This declarative knowledge can then be used to either (1) infer the value of length if unspecified (2) infer either start or end if only one of these is specified alongside length (3) check consistency if all three are specified.

These constraints can then be executed over large databases via a variety of strategies including:

  • generation of datalog programs for efficient engines such as souffle

  • generation of SQL queries to be used with relational databases