Data Validation
===============
LinkML is designed to allow for a variety of strategies for data
validation. The overall philosophy is to provide maximum expressivity
in the language to allow model designers to state all constraints in a
declarative fashion, and then to leverage existing frameworks and to
allow the user to balance concerns such as expressivity vs efficiency.
Currently there are several supported strategies:
- validation with the ``linkml.validator`` package and its CLI
- validation via Python object instantiation
- validation via JSON Schema using external tools
- validation of triples in a triplestore or RDF file via generation of SPARQL constraints
- validation of RDF via generation of ShEx or SHACL
- validation via SQL loading and queries
However, others will be supported in future; in particular, scalable validation
of massive databases.
The ``linkml.validator`` package and CLI
----------------------------------------
This package contains the main entry point for various flexible validation strategies.
Validation in Python code
^^^^^^^^^^^^^^^^^^^^^^^^^
If you are writing your own Python code to perform validation the simplest approach is to use the :func:`linkml.validator.validate` function. For example:
.. code-block:: python
from linkml.validator import validate
instance = {
"id": "ORCID:1234",
"full_name": "Clark Kent",
"age": 32,
"phone": "555-555-5555",
}
report = validate(instance, "personinfo.yaml", "Person")
if not report.results:
print('The instance is valid!')
else:
for result in report.results:
print(result.message)
This function takes a single instance (typically represented as a Python dict) and validates it according to the given schema (specified here by a path to the source file, but dict or object representation of the schema is also accepted). This example also explicitly specifies which class within the schema (``Person``) the data instance should adhere to. If this is omitted, the function will attempt to infer it.
The other high-level function is :func:`linkml.validator.validate_file`. It loads data instances from a file and validates each of them according to a class in a schema. Assuming the contents of ``people.csv`` look like:
.. code-block:: text
id,full_name,age,phone
ORCID:1234,Clark Kent,32,555-555-5555
ORCID:5678,Lois Lane,33,555-555-1234
Each row can be validated with:
.. code-block:: python
from linkml.validator import validate_file
report = validate_file("people.csv", "personinfo.yaml", "Person")
Under the hood, both of these functions use a strategy of generating a JSON Schema artifact from the LinkML schema and validating instances using a JSON Schema validator.
While many LinkML constructs can be expressed in JSON Schema (which makes it a good default validation strategy), there are some features of LinkML not supported by JSON Schema. For more fine-grained control over the validation strategy use the :class:`linkml.validator.Validator` class. Using this class it is possible to mix JSON Schema validation with other strategies or forego it altogether.
The key idea behind the :class:`linkml.validator.Validator` is that it does not do any validation itself. Instead, it simply orchestrates validation according to a set of validation plugins. In this example, the basic JSON Schema validation will happen (disallowing additional properties because of the ``closed`` option) as well as a validation that checks that recommended slots are populated:
.. code-block:: python
from linkml.validator import Validator
from linkml.validator.plugins import JsonschemaValidationPlugin, RecommendedSlotsPlugin
validator = Validator(
schema="personinfo.yaml",
validation_plugins=[
JsonschemaValidationPlugin(closed=True),
RecommendedSlotsPlugin()
]
)
validator.validate({"id": "ORCID:1234", "full_name": "Clark Kent", "age": 32, "phone": "555-555-5555"}, "Person")
This example only uses a validation strategy based on generating `Pydantic `_ models from the LinkML schema instead:
.. code-block:: python
from linkml.validator import Validator
from linkml.validator.plugins import PydanticValidationPlugin
validator = Validator(
schema="personinfo.yaml",
validation_plugins=[PydanticValidationPlugin()]
)
validator.validate({"id": "ORCID:1234", "full_name": "Clark Kent", "age": 32, "phone": "555-555-5555"}, "Person")
Refer to the :mod:`linkml.validator.plugins` documentation for more information about the available plugins and their benefits and tradeoffs.
The ``linkml-validate`` CLI
^^^^^^^^^^^^^^^^^^^^^^^^^^^
The same functionality is also available via a the ``linkml-validate`` command line interface. For basic validation, simply provide a schema and a source to load data instances from:
.. code-block:: bash
$ linkml-validate --schema personinfo.yaml --target-class Person people.csv
No issues found!
Similar to the :func:`linkml.validator.validate` and :func:`linkml.validator.validate_file` functions, this will perform basic validation based on a JSON Schema validator. If advanced customization is needed, create a configuration YAML file and provide it with the ``--config`` argument:
.. code-block:: bash
$ linkml-validate --config person-validation.config.yaml
The configuration YAML file can have the following keys. All keys are optional:
=================== ======================================================== ================================
Key Description Default value
=================== ======================================================== ================================
``schema`` Path to the LinkML schema. Overrides ``--schema`` CLI None
argument if both are provided.
``target_class`` Class in the schema to validate against. Overrides the None
``--target-class`` CLI argument if both are provided.
``data_sources`` A list of sources where each source is either a string None
or a dictionary with a single key.
- If the source is a string it is interpreted as a
file path and data will be loaded from it based on
the file extension.
- If the source is a dictionary it should have a
single key representing the the name of a
:class:`linkml.validator.loaders.Loader` subclass.
The value is a dictionary that will be interpreted
as constructor keyword arguments for the given class.
This value overrides any ``DATA_SOURCES`` arguments
passed to the CLI
``plugins`` A dictionary where each key is the name of a .. code-block:: yaml
:class:`linkml.validator.plugins.ValidationPlugin`
subclass. Each value is a dictionary that will be JsonschemaValidationPlugin:
interpreted as constructor keyword arguments for the closed: true
given class.
Classes defined in the ``linkml.validator.plugins``
package do not required a full dotted name (e.g. just
``JsonschemaValidationPlugin`` is sufficient). Classes
outside of this package can be used, but you must
specify the full dotted name (e.g.
``my_project.MyCustomValidationPlugin``)
=================== ======================================================== ================================
Here is an example configuration file:
.. code-block:: yaml
# person-validation.config.yaml
schema: personinfo.yaml
target_class: Container
# Data from two files will be validated. A loader for the JSON file will be created
# automatically based on the file extension. A loader for the CSV file is specified
# manually in order to provide custom options.
data_sources:
- people.json
- CsvLoader:
source: people.csv
index_slot_name: persons
# Data will be validated according to the JsonschemaValidationPlugin with additional
# properties allowed (closed: false) and also the RecommendedSlotsPlugin
plugins:
JsonschemaValidationPlugin:
closed: false
RecommendedSlotsPlugin:
.. click:: linkml.validator.cli:cli
:prog: linkml-validate
Python object instantiation
---------------------------
If you have generated :doc:`../generators/python` dataclasses or :doc:`../generators/pydantic` models from your LinkML schema, you can also use them as a lightweight form of validation.
.. code-block:: shell
$ gen-python personinfo.yaml > personinfo.py
$ echo '{"id":"ORCID:1234","full_name":"Clark Kent","age":32,"phone":"555-555-5555"}' > person.json
.. code-block:: python
from personinfo import Person
import json
with open("person.json") as f:
person_data = json.load(f)
kent = Person(**person_data)
If you remove the ``id`` key from ``person.json`` and run the above code again, you will see a ``ValueError`` raised indicating that ``id`` is required.
JSON Schema with external tools
-------------------------------
If you need to perform validation outside of a Python-based project, JSON Schema validation is often the most straightforward to implement. From your LinkML schema project, generate a JSON Schema artifact:
.. code-block:: shell
$ gen-json-schema personinfo.yaml > personinfo.schema.json
The ``personinfo.schema.json`` artifact can then be used in any other project where a `JSON Schema implementation `_ is available.
Validation of RDF triplestores using generated SPARQL
-----------------------------------------------------
The LinkML framework can also be used to validate RDF, either in a file, or a triplestore. There are two steps:
1. generation of SPARQL constraint-style queries (see [sparqlgen](../generators/sparql) )
2. execution of those queries on an in-memory graph or external triplestore
The user can choose to run only the first step, to obtain a bank of SPARQL queries that can be applied selectively
.. click:: linkml.validators.sparqlvalidator:cli
:prog: linkml-sparql-validate
:nested: full
Validation via shape languages
------------------------------
Currently the linkml framework does not provide builtin support for validating using a shape language, but the following strategy can be used:
1. Convert data to RDF using ``linkml-convert``
2. Convert schema to a shape language using ``gen-shex`` or ``gen-shacl``
3. Use a ShEx or SHACL validator
See next section for more details.
Future plans
------------
Future versions of LinkML will employ a powerful constraint and inference language.
One of the use cases here is being able to specify that the ``length`` field is equal to ``end - start``. This declarative knowledge can then be used to either (1) infer the value of ``length`` if unspecified (2) infer either ``start`` or ``end`` if only one of these is specified alongside ``length`` (3) check consistency if all three are specified.
These constraints can then be executed over large databases via a variety of strategies including:
* generation of datalog programs for efficient engines such as souffle
* generation of SQL queries to be used with relational databases