Part 3: Adding constraints and performing validation#
Now we will add richer information to our schema, including:
adding ranges for fields such as age
using pattern to force a field to conform to a regular expression
declaring the
id
slot to be an identifierdeclaring the
full_name
slot to be requiredadding textual descriptions of schema elements
Example schema#
personinfo.yaml:
id: https://w3id.org/linkml/examples/personinfo
name: personinfo
prefixes:
linkml: https://w3id.org/linkml/
imports:
- linkml:types
default_range: string
classes:
Person:
attributes:
id:
identifier: true ## unique key for a person
full_name:
required: true ## must be supplied
description:
name of the person
aliases:
multivalued: true ## range is a list
description:
other names for the person
phone:
pattern: "^[\\d\\(\\)\\-]+$" ## regular expression
age:
range: integer ## an int between 0 and 200
minimum_value: 0
maximum_value: 200
Container:
attributes:
persons:
multivalued: true
inlined_as_list: true
range: Person
We use yaml comment syntax (i.e the part after #
) for comments - these are ignored by the parser.
Depicted as:
Note that we haven’t declared ranges for some fields, but the default_range directive at the schema level ensures things default to string.
Example data#
Let’s deliberately introduce some bad data to make sure our validator is working:
bad-data.yaml:
persons:
- id: ORCID:1234
full_name: Clark Kent
age: 90
phone: 1-800-kryptonite
- id: ORCID:5678
age: 33
Running the following command:
linkml-validate -s personinfo.yaml bad-data.yaml
Will result in:
[ERROR] [bad-data.yaml/0] '1-800-kryptonite' does not match '^[\\d\\(\\)\\-]+$' in /persons/0/phone
[ERROR] [bad-data.yaml/0] 'full_name' is a required property in /persons/1
This indicates there are two issues with our data. The first says that the phone number of the first entry in the persons list (/persons/0/phone
) doesn’t conform to the regular expression syntax we stated. The second says that we are missing the required full_name
slot on the second entry in the person list (/persons/1
).
Let’s fix the second issue.
better-data.yaml:
persons:
- id: ORCID:1234
full_name: Clark Kent
age: 90
phone: 1-800-kryptonite
- id: ORCID:5678
full_name: Lois Lane
age: 33
linkml-validate -s personinfo.yaml better-data.yaml
Will result in:
[ERROR] [better-data.yaml/0] '1-800-kryptonite' does not match '^[\\d\\(\\)\\-]+$' in /persons/0/phone
We have successfully fixed one of the issues with the data!
Exercises#
See if you can iterate on the example data to get something that validates.
Using the JSON Schema directly#
The linkml-validate
command is a wrapper than can be used for an
open-ended number of validator implementations. The current default is
to use a JSON Schema validator. This involves converting LinkML to
JSON-Schema - note that there are some features of LinkML not
supported by JSON-Schema, so the current validator is not guaranteed
to be complete.
If you prefer you can use your own JSON Schema validator. First compile to jsonschema. Unlike the linkml-validate
command, the gen-json-schema
command does not attempt to automatically infer which class in your schema to use for validation. You must either identify it in your schema by setting tree_root: true
on one class or pass the -t/--top-class
option to gen-json-schema
.
gen-json-schema personinfo.yaml --top-class Container > personinfo.schema.json
You can then use the jsonschema
command that comes with the python library (any jsonschema validator will do here)
jsonschema -i bad-data.json personinfo.schema.json
In general this should give you similar results, with some caveats:
the
bad-data.yaml
can be converted tobad-data.json
using https://www.json2yaml.com/.the
linkml-validator
will first perform an internal conversion prior to using the jsonschema validator, and some errors may be caught at that stagethe conversion process may mask some errors - e.g. if a slot has range integer and is supplied as a string, implicit conversion is used
See the JSON-Schema generator docs for more info on JSON-Schema validation
Other validation strategies#
Other strategies include
converting data to a relational database and doing performant evaluation in SQL
converting data to RDF and using either Shape validators or SPARQL queries
The next section deals with working with RDF data.
Further reading#
Metamodel Specification
identifier slot
required slot
minimum_value slot
maximum_value slot
tree_root slot
FAQ:
Generators: