Using ontology terms as values in data#
LinkML provides a flexible way of modeling data. LinkML allows for the optional use of ontologies, vocabularies, or controlled vocabularies to add semantics to datamodels, for example, by mapping classes or slots to external terms.
This howto guide deals with another use case, where we want to include ontology elements as data values in our data model. In formal terms, this is called including ontology elements in the domain of discourse.
This is in principle straightforward - we just treat ontology elements the same way we would any other identifier or object. However, in some cases, this can lead to confusion about what the respective roles of the LinkML schema, data, or ontologies are.
Motivating Example: associations to ontology terms.#
Let’s say we want to model associations between genes and phenotypes. This is a standard use case for biological ontologies - creating annotations that associate some kind of entity with a descriptor.
In the simplest case, this might be communicated by a two-column file:
Gene |
Phenotype |
---|---|
PEX1 |
Seizure |
PEX1 |
Hypotonia |
This uses labels, which is not best practice; we could instead do this:
Gene |
Phenotype |
---|---|
NCBIGene:5189 |
HP:0001250 |
NCBIGene:5189 |
HP:0001252 |
Or perhaps a denormalized representation:
Gene |
Gene Label |
Phenotype |
Phenotype Label |
---|---|---|---|
NCBIGene:5189 |
PEX1 |
HP:0001250 |
Seizure |
NCBIGene:5189 |
PEX1 |
HP:0001252 |
Hypotonia |
This is denormalized because we end up repeating values.
If we go with a richer data serialization form like YAML, JSON, RDF, or a relational database model, we can normalize this model. For YAML/JSON this may be implemented by referencing objects in another collection, like this:
associations:
- gene: NCBIGene:5189
phenotype: HP:0001250
- gene: NCBIGene:5189
phenotype: HP:0001252
genes:
- id: NCBIGene:5189
label: PEX1
phenotypes:
- id: HP:0001250
label: Seizure
- id: HP:0001252
label: Hypotonia
However, for now let’s return to the simple 2-element model:
Gene |
Phenotype |
---|---|
NCBIGene:5189 |
HP:0001250 |
NCBIGene:5189 |
HP:0001252 |
Simple schema for pairwise associations#
The simplest possible data model that could work for this case is:
classes:
GenePhenotypeAssociation:
attributes:
gene:
phenotype:
Note that the schema doesn’t care that the phenotypes come from an ontology, or that the genes come from a standard resource - these are just pieces of data.
However, this isn’t quite satisfactory - it allows the data provider to put any free text they like in.
We would like to constrain both gene
and phenotype
to be identifiers.
We can do this by specifying a range:
classes:
GenePhenotypeAssociation:
attributes:
gene:
range: uriorcurie
phenotype:
range: uriorcurie
We can constrain it further still, by including a regexp pattern:
classes:
GenePhenotypeAssociation:
attributes:
gene:
range: uriorcurie
pattern: "NCBIGene:\\d+"
phenotype:
range: uriorcurie
pattern: "HP:\\d+"
(obviously this constrains the schema so tightly it can’t be used for other phenotype ontologies, which may or may not be what we want).
So far so good. But what if we want to have a data model where we can communicate information about the genes and phenotypes themselves, rather than forcing the client to do an external lookup?
Let’s go one step further, and make a class for gene and phenotype:
classes:
GenePhenotypeAssociation:
attributes:
gene:
range: Gene
phenotype:
range: Phenotype
Gene:
attributes:
id:
range: uriorcurie
identifier: true
pattern: "NCBIGene:\\d+"
label:
Phenotype:
attributes:
id:
range: uriorcurie
identifier: true
pattern: "HP:\\d+"
label:
We can abstract it a bit further to avoid repetition:
classes:
GenePhenotypeAssociation:
attributes:
gene:
range: Gene
phenotype:
range: Phenotype
NamedThing:
attributes:
id:
range: uriorcurie
identifier: true
label:
Gene:
is_a: NamedThing
id_prefixes:
- NCBIGene
Phenotype:
is_a: NamedThing
id_prefixes:
- HP
Note we are taking advantage of the id_prefixes
metaslot, but
strictly speaking this is weaker than the previous regular expression pattern.
Adding a container#
Let’s add a container class, to allow us to bundle lists of objects inside a single JSON or YAML document:
Container:
tree_root: true
attributes:
genes:
range: Gene
inlined_as_list: true
phenotypes:
range: Phenotype
inlined_as_list: true
associations:
range: Association
inlined_as_list: true ## not necessary as Association has no id
Our container class allows genes, phenotypes, plus associations between them to be transmitted as a single YAML/JSON object/document.
Note that inlining is non-default if a referenced entity has an identifier. This means that the right way to represent associations is using references (like foreign keys in a relational database):
associations:
- gene: NCBIGene:5189
phenotype: HP:0001250
- gene: NCBIGene:5189
phenotype: HP:0001252
Example of separate collections#
We can optionally communicate information about the referenced entities:
associations:
- gene: NCBIGene:5189
phenotype: HP:0001250
- gene: NCBIGene:5189
phenotype: HP:0001252
genes:
- id: NCBIGene:5189
label: PEX1
phenotypes:
- id: HP:0001250
label: Seizure
- id: HP:0001252
label: Hypotonia
Representing the ontology hierarchy as data#
It’s common practice to separate the ontology representation from the data, but in some cases it may be useful to transmit everything using the same schema, sending both associations and ontology classificiation in one YAML/JSON blob.
Let’s do that here, by adding a parents
slot in the schema:
Phenotype:
is_a: NamedThing
attributes:
parents:
range: Phenotype
multiavalued: true
slot_uri: rdfs:subClassOf
Note we could call this whatever we like. We include a slot_uri declaration
to indicate that this is equivalent to rdfs:subClassOf
.
This modified schema allows data like:
phenotypes:
- id: HP:0001250
label: Seizure
parents:
- HP:0012638
- id: HP:0012638
label:
- Abnormal nervous system physiology
parents:
...
This is very practical - consumers of the data can consume the associations and the ontology hierarchy together to perform rollup operations, etc.
The fact that we have two classification systems co-existing (LinkML is_a hierarchy and ontology hierarchy as data) is not be a cause for concern.
Ontology classes may be LinkML instances#
So far, so good. This should so far be familiar to people who have modeled this kind of ontological association in JSON-Schema, or relational databases.
However, this could potentially be confusing for people coming from a particular kind of ontology modeling background, such as OBO. In this community, a phenotype concepts like “Seizure” (HP:0001250) denotes a class, and there are many such classes in an ontology. Instances of seizures would be particular instances such as those experienced by an individual at a particular space and time.
But here we are modeling HP:0001250 as an instance. What’s going on?
In fact this is quite straightforward - ontology classes (typically formalized in OWL) and classes in LinkML are not the same thing, despite the name “class”. And instances in LinkML and instances in “realist” OBO ontologies are not the same thing.
Ontology class hierarchies and LinkML class hierarchies need not be mirrored#
Next we will look at a more advanced example. Here we will also talk about how what we are modeling is represented in RDF/OWL, so some knowledge of these frameworks helps here.
A model of organisms in LinkML#
Consider a schema that models both individual people and organisms, as well as taxonomic concepts such as Homo sapiens or Vertebrate:
classes:
NamedThing:
attributes:
id:
range: uriorcurie
label:
IndividualOrganism:
is_a: NamedThing
attributes:
species:
range: Species
examples:
- description: Seabiscuit the horse
- description: Napoleon Bonaparte
OrganismTaxonomicConcept:
is_a: NamedThing
abstract: true
attributes:
parent_concept:
range: OrganismTaxonomicConcept
Species:
is_a: OrganismTaxonomicConcept
examples:
- description: Homo sapiens
- description: Felis catus
Genus:
is_a: OrganismTaxonomicConcept
examples:
- description: Homo
- description: Felis
Note we have decided to make subclasses of a generic taxon concept class for different taxonomic ranks (we only should species and genus but we could add more).
Individual organisms are connected to species via a species
attribute, and species are connected up to parent taxa via a
parent_concept
attribute.
IndividualOrganism:
id: wikidata:Q517
label: Napoleon Bonaparte
species: NCBITaxon:9606
Species:
id: NCBITaxon:9606
label: Homo sapiens
parent_concept: NCBITaxon:9605
Note here that in the LinkML model, our classes are IndividualOrganism, Species, Genus, (and potentially other ranks, and a generic grouping of these). Our instances are Napoleon, Homo sapiens, Homo.
When we translate the YAML above to RDF we get:
wikidata:Q517 rdf:type my:IndividualOrganism .
NCBITaxon:9606 rdf:type my:Species .
NCBITaxon:9606 my:parent_concept NCBITaxon:9605
NCBITaxon:9605 rdf:type my:Genus .
In OWL terms, this is called the ABox
Our LinkML schema can also be represented as RDF or OWL (formally: TBox)
my:IndividualOrganism a owl:Class .
my:Genus a owl:Class .
my:Species a owl:Class .
my:Genus rdfs:subClassOf my:OrganismTaxonomicConcept
my:Species rdfs:subClassOf my:OrganismTaxonomicConcept
(omitting some axioms for brevity)
Again, this should not be such a foreign way of modeling things from a standard database perspective. But if you are coming from ontology modeling this could be confusing.
Next, we’ll look at an ontologist’s way to model the same domain. Let’s first summarize the LinkML model:
Individuals such as Napoleon as well as taxonomic concepts such as human or cat are instances
individuals such as Napoleon instantiate “individual organism”, whereas taxonomic concepts instantiate Species, Genus, etc
we can add more properties and constraints on each LinkML class, e.g.
make
species
a required fieldconstrain the parent of
Species
to be aGenus
rather than any taxonomic conceptadd appropriate slots to “IndividualOrganism”, e.g. a single-value-per-time geolocation
add appropriate slots to taxonomic concepts
common name vs scientific name
constrain species names to be binomial
geolocation ranges
From a LinkML modeling perspective, these additional properties would be Good Things. They allow us to constrain our data model to avoid instance data that is invalid or surprising (for example, Napoleon having a “species” value of “Vertebrate” or “HistoricHuman”).
A model of organisms following ontology conventions#
Consider how this is modeled in ontologies in OBO or clinical terminologies like SNOMED or NCIT. In these ontologies, there is neither a “individual organism” class nor classes for ranks like “species”.
Instead there is just a hierarchy of organism OWL classes, increasingly refined:
Organism
Vertebrate
Mammalia
Homo
Homo sapiens
Felis
Felis catus
Russian blue
(Intermediate nodes omitted for brevity)
There is also nothing formally prohibiting classes such as “FriendlyMammal” or “HistoricHuman”, but by convention the class hierarchy mirrors conventional classifications that mirror phylogeny.
In this model there are no logical elements “species” or “genus”. It’s common practice to include the taxonomic rank as an OWL annotation property. If we want to include these concepts as true first-class logical citizens in an OWL model, then we need to either introduce punning (OWL-DL) or metaclasses (OWL-Full).
In practice, punning or metaclasses are not used much in OWL, so let’s stick with the rank-free model. Formally, concepts like “Homo sapiens” are not in the domain of discourse.
Individual organisms like Napoleon (Q517 in Wikidata) instantiate the classes in the hierarchy:
wikidata:Q517 rdf:type NCBITaxon:9606 .
NCBITaxon:9606 rdfs:subClassOf NCBITaxon:9605
Compare to the RDF serialization of the LinkML instances:
wikidata:Q517 my:species NCBITaxon:9606 .
NCBITaxon:9606 my:parent_concept NCBITaxon:9605
In this case, rdf:type
corresponds roughly to the species
attribute in the LinkML model. It’s not quite the same, as we might have the following OWL:
wikidata:Q517 rdf:type NCBITaxon:9605 . ## Homo
This is valid (and entailed) but less specific. Note that this would be disallowed in the LinkML model, which intentionally forces the data provider to provide a species-level taxon node ID rather than any other taxon ID.
In the RDF model we might even have:
wikidata:Q517 rdf:type My:HistoricPerson .
My:HistoricPerson rdfs:subClassOf NCBITaxon:9606 .
Aligning the LinkML model with the ontological model#
Note also the correspondence between the owl SubClassOf axiom and the ‘parent_concept` attribute in our LinkML model. These would correspond even further if we extended our model to other taxonomic ranks.
We could map these using slot_uri
:
classes:
NamedThing:
attributes:
id:
range: uriorcurie
label:
IndividualOrganism:
class_uri: NCBITaxon:1 ## root node of NCBI taxonomy
is_a: NamedThing
attributes:
species:
range: Species
slot_uri: rdf:type ## map species to instantiation predicate
examples:
- description: Seabiscuit the horse
- description: Napoleon Bonaparte
OrganismTaxonomicConcept:
is_a: NamedThing
abstract: true
attributes:
parent_concept:
range: OrganismTaxonomicConcept
slot_uri: rdfs:subClassOf ## map parent_concept to subsumption
Species:
is_a: OrganismTaxonomicConcept
examples:
- description: Homo sapiens
- description: Felis catus
Genus:
is_a: OrganismTaxonomicConcept
examples:
- description: Homo
- description: Felis
The LinkML instances now serialize as:
wikidata:Q517 rdf:type NCBITaxon:1 .
wikidata:Q517 rdf:type NCBITaxon:9606 .
NCBITaxon:9606 rdf:type my:Species .
NCBITaxon:9606 rdfs:subClassOf NCBITaxon:9605
NCBITaxon:9605 rdf:type my:Genus .
Viewed through the lens of RDF/OWL this is potentially confusing. Under OWL2 Description Logic semantics, we have introduced punning, and under OWL-Full we have metaclasses. The latter approach is quite common in knowledge bases such as Wikidata.
Separate models#
We can imagine people getting confused, and making incorrect inferences such as the following:
Homo sapiens is a Species
Species is a Genus
Therefore, Homo sapiens is a Genus
Clearly this is wrong. In fact entailment is thankfully not justified either via the LinkML or via the RDF/OWL (either punning model or metaclass).
The mistake is confusing the different levels of modeling.
When should hierarchies be mirrored?#
It should be clear that LinkML (and more generally, schema and shape frameworks such as JSON-Schema, SHACL, and so on) and formal OWL modeling are distinct. By keeping these separate, we avoid problems.
However, there are some cases where hierarchies in our data model do trivially mirror our ontological hierarchies. There are some schemas and data models that also resemble upper ontologies.
schema.org for everyday concepts like Person, CreativeWork
biolink for biological concepts like Gene, Chemical, Disease
chemrof for chemical concepts like atom, isotope, molecule
In the case of schema.org, most elements can do double duty as ontology classes compatible with OBO-style realist modeling (intended to model the world scientifically) as well as schema classes (intended to model how we exchange data about the things in the world).
However, this can get quite nuanced. Sometimes there are classifications that make sense in one perspective and not in the other.
The modeling of personhood in ontologies can get quite involved. Some ontologies will treat Person as a subclass of Homo sapiens (which is scientifically valid but from a modeling perspective mixes two separate concerns); other ontologies may represent personhood as a “role”, which complicates things if you want to have straightforward connections between concepts like “Person” and “Address”
This gets even more nuanced with biomedical concepts, where we have to deal with multiple interlinked ontological debates about modeling concepts like Gene and Allele, and whether these are classes or instances. Most bio-ontologies eliminate the concept of “levels” in hierarchies, so the concepts “eukaroyotic gene”, “gene”, “human Shh gene” and “human Shh gene with foo variant” are all valid gene concepts, just at different levels of the hierarchy.
Additionally, ontologists have a habit of grouping unlike entities or separating like concepts, on the basis of upper ontologies.
A full discussion of these issues is well outside the scope of this guide.
From a modeling perspective, the key points are:
use the appropriate modeling framework for the problem at hand
mirror hierarchies where appropriate
do not assume hierarchies must be mirrored