# Using ontology terms as values in data

LinkML provides a flexible way of modeling data. LinkML allows for the optional use
of *ontologies*, *vocabularies*, or *controlled vocabularies* to add semantics to
datamodels, for example, by mapping classes or slots to external terms.

This howto guide deals with another use case, where we want to include ontology
elements as data values in our data model. In formal terms, this is called including
ontology elements *in the domain of discourse*.

This is in principle straightforward - we just treat ontology elements
the same way we would any other identifier or object. However, in some
cases, this can lead to confusion about what the respective roles of
the LinkML schema, data, or ontologies are.

## Motivating Example: associations to ontology terms.

Let's say we want to model associations between genes and
phenotypes. This is a standard use case for biological ontologies -
creating *annotations* that associate some kind of entity with a descriptor.

In the simplest case, this might be communicated by a two-column file:

|Gene|Phenotype|
|---|---|
|PEX1|Seizure|
|PEX1|Hypotonia|

This uses labels, which is not best practice; we could instead do this:

|Gene|Phenotype|
|---|---|
|NCBIGene:5189|HP:0001250|
|NCBIGene:5189|HP:0001252|

Or perhaps a denormalized representation:

|Gene|Gene Label|Phenotype|Phenotype Label|
|---|---|---|---|
|NCBIGene:5189|PEX1|HP:0001250|Seizure|
|NCBIGene:5189|PEX1|HP:0001252|Hypotonia|

This is *denormalized* because we end up repeating values.

If we go with a richer data serialization form like YAML, JSON, RDF,
or a relational database model, we can *normalize* this model. For
YAML/JSON this may be implemented by *referencing* objects in another
collection, like this:

```yaml
associations:
  - gene: NCBIGene:5189
    phenotype: HP:0001250
  - gene: NCBIGene:5189
    phenotype: HP:0001252
genes:
  - id: NCBIGene:5189
    label: PEX1
phenotypes:
  - id: HP:0001250
    label: Seizure
  - id: HP:0001252
    label: Hypotonia
```

However, for now let's return to the simple 2-element model:

|Gene|Phenotype|
|---|---|
|NCBIGene:5189|HP:0001250|
|NCBIGene:5189|HP:0001252|

### Simple schema for pairwise associations

The simplest possible data model that could work for this case is:

```yaml
classes:
 GenePhenotypeAssociation:
   attributes:
     gene:
     phenotype:
```

Note that the schema doesn't care that the phenotypes come from an ontology, or that the genes
come from a standard resource - these are just pieces of data.

However, this isn't quite satisfactory - it allows the data provider to put any free text they like in.
We would like to constrain both `gene` and `phenotype` to be identifiers.

We can do this by specifying a [range](https://w3id.org/linkml/range):

```yaml
classes:
 GenePhenotypeAssociation:
   attributes:
     gene:
       range: uriorcurie
     phenotype:
       range: uriorcurie
```

We can constrain it further still, by including a regexp [pattern](https://w3id.org/linkml/pattern):

```yaml
classes:
 GenePhenotypeAssociation:
   attributes:
     gene:
       range: uriorcurie
       pattern: "NCBIGene:\\d+"
     phenotype:
       range: uriorcurie
       pattern: "HP:\\d+"
```

(obviously this constrains the schema so tightly it can't be used
for other phenotype ontologies, which may or may not be what we want).

So far so good. But what if we want to have a data model where we
can communicate information about the genes and phenotypes themselves,
rather than forcing the client to do an external lookup?

Let's go one step further, and make a [class](https://w3id.org/linkml/ClassDefinition) for gene and phenotype:

```yaml
classes:
 GenePhenotypeAssociation:
   attributes:
     gene:
       range: Gene
     phenotype:
       range: Phenotype
 Gene:
   attributes:
     id:
       range: uriorcurie
       identifier: true
       pattern: "NCBIGene:\\d+"
     label:
 Phenotype:
   attributes:
     id:
       range: uriorcurie
       identifier: true
       pattern: "HP:\\d+"
     label:
```

We can abstract it a bit further to avoid repetition:

```yaml
classes:
 GenePhenotypeAssociation:
   attributes:
     gene:
       range: Gene
     phenotype:
       range: Phenotype
 NamedThing:
   attributes:
     id:
       range: uriorcurie
       identifier: true
     label:
 Gene:
  is_a: NamedThing
  id_prefixes:
    - NCBIGene
 Phenotype:
  is_a: NamedThing
  id_prefixes:
    - HP
```

Note we are taking advantage of the `id_prefixes` metaslot, but
strictly speaking this is weaker than the previous regular expression pattern.

### Adding a container

Let's add a *container* class, to allow us to bundle lists of objects inside a single JSON or YAML document:

```
  Container:
    tree_root: true
    attributes:
      genes:
        range: Gene
        inlined_as_list: true
      phenotypes:
        range: Phenotype
        inlined_as_list: true
      associations:
        range: Association
        inlined_as_list: true  ## not necessary as Association has no id

```

Our container class allows genes, phenotypes, plus associations between them to be transmitted as a single YAML/JSON object/document.

Note that [inlining](https://linkml.io/linkml/schemas/inlining.html) is non-default if a referenced entity has an identifier. This means
that the right way to represent associations is using references (like foreign
keys in a relational database):

```yaml
associations:
  - gene: NCBIGene:5189
    phenotype: HP:0001250
  - gene: NCBIGene:5189
    phenotype: HP:0001252
```

### Example of separate collections

We can optionally communicate information about the referenced entities:

```yaml
associations:
  - gene: NCBIGene:5189
    phenotype: HP:0001250
  - gene: NCBIGene:5189
    phenotype: HP:0001252
genes:
  - id: NCBIGene:5189
    label: PEX1
phenotypes:
  - id: HP:0001250
    label: Seizure
  - id: HP:0001252
    label: Hypotonia
```

## Representing the ontology hierarchy as data

It's common practice to separate the ontology representation from the data,
but in some cases it may be useful to transmit everything using the same
schema, sending both associations and ontology classificiation in one YAML/JSON blob.

Let's do that here, by adding a `parents` slot in the schema:

```yaml
 Phenotype:
  is_a: NamedThing
  attributes:
    parents:
      range: Phenotype
      multiavalued: true
      slot_uri: rdfs:subClassOf
```

Note we could call this whatever we like. We include a [slot_uri](https://w3id.org/linkml/slot_uri) declaration
to indicate that this is equivalent to `rdfs:subClassOf`.

This modified schema allows data like:

```yaml
phenotypes:
  - id: HP:0001250
    label: Seizure
    parents:
      - HP:0012638
  - id: HP:0012638
    label:
      - Abnormal nervous system physiology
    parents:
       ...
```

This is very practical - consumers of the data can consume the
associations and the ontology hierarchy together to perform rollup
operations, etc.

The fact that we have two classification systems co-existing (LinkML
is_a hierarchy and ontology hierarchy as data) is not be a cause
for concern.

### Ontology classes may be LinkML instances

So far, so good. This should so far be familiar to people who have
modeled this kind of ontological association in JSON-Schema, or
relational databases.

However, this could potentially be confusing for people coming from a
particular kind of ontology modeling background, such as OBO. In this
community, a phenotype concepts like "Seizure" (HP:0001250) denotes a
*class*, and there are many such classes in an ontology. Instances of
seizures would be particular instances such as those experienced by an
individual at a particular space and time.

But here we are modeling HP:0001250 as an *instance*. What's going on?

In fact this is quite straightforward - ontology classes (typically
formalized in OWL) and classes in LinkML are not the same thing,
despite the name "class". And instances in LinkML and instances in
"realist" OBO ontologies are not the same thing.

## Ontology class hierarchies and LinkML class hierarchies need not be mirrored

Next we will look at a more advanced example. Here we will also
talk about how what we are modeling is represented in RDF/OWL, so some
knowledge of these frameworks helps here.

### A model of organisms in LinkML

Consider a schema that models both individual people and organisms, as well as taxonomic concepts
such as Homo sapiens or Vertebrate:

```yaml
classes:
 NamedThing:
   attributes:
     id:
       range: uriorcurie
     label:
 IndividualOrganism:
  is_a: NamedThing
  attributes:
    species:
      range: Species
  examples:
    - description: Seabiscuit the horse
    - description: Napoleon Bonaparte
 OrganismTaxonomicConcept:
  is_a: NamedThing
  abstract: true
  attributes:
    parent_concept:
      range: OrganismTaxonomicConcept
 Species:
  is_a: OrganismTaxonomicConcept
  examples:
    - description: Homo sapiens
    - description: Felis catus
 Genus:
  is_a: OrganismTaxonomicConcept
  examples:
    - description: Homo
    - description: Felis
```

Note we have decided to make subclasses of a generic taxon concept class for different taxonomic ranks
(we only should species and genus but we could add more).

Individual organisms are connected to species via a `species`
attribute, and species are connected up to parent taxa via a
`parent_concept` attribute.


IndividualOrganism:
```yaml
id: wikidata:Q517
label: Napoleon Bonaparte
species: NCBITaxon:9606
```

Species:
```yaml
id: NCBITaxon:9606
label: Homo sapiens
parent_concept: NCBITaxon:9605
```

Note here that in the LinkML model, our __classes__ are
*IndividualOrganism*, *Species*, *Genus*, (and potentially other
ranks, and a generic grouping of these). Our __instances__ are
Napoleon, Homo sapiens, Homo.

When we translate the YAML above to RDF we get:

```turtle
wikidata:Q517 rdf:type my:IndividualOrganism .
NCBITaxon:9606 rdf:type my:Species .
NCBITaxon:9606 my:parent_concept NCBITaxon:9605
NCBITaxon:9605 rdf:type my:Genus .
```

In OWL terms, this is called the **ABox**

Our LinkML schema can also be represented as RDF or OWL (formally: **TBox**)

```turtle
my:IndividualOrganism a owl:Class .
my:Genus a owl:Class .
my:Species a owl:Class .
my:Genus rdfs:subClassOf my:OrganismTaxonomicConcept
my:Species rdfs:subClassOf my:OrganismTaxonomicConcept
```

(omitting some axioms for brevity)

Again, this should not be such a foreign way of modeling things from a standard database perspective.
But if you are coming from ontology modeling this could be confusing.

Next, we'll look at an ontologist's way to model the same domain. Let's first summarize the LinkML model:

- Individuals such as Napoleon as well as taxonomic concepts such as human or cat are *instances*
- individuals such as Napoleon instantiate "individual organism", whereas taxonomic concepts instantiate Species, Genus, etc
- we can add more properties and constraints on each LinkML class, e.g.
    - make `species` a required field
    - constrain the parent of `Species` to be a `Genus` rather than any taxonomic concept
    - add appropriate slots to "IndividualOrganism", e.g. a single-value-per-time geolocation
    - add appropriate slots to taxonomic concepts
        - common name vs scientific name
        - constrain species names to be binomial
        - geolocation ranges

From a LinkML modeling perspective, these additional properties would be Good Things. They allow
us to constrain our data model to avoid instance data that is invalid or surprising (for example,
Napoleon having a "species" value of "Vertebrate" or "HistoricHuman").

### A model of organisms following ontology conventions

Consider how this is modeled in ontologies in OBO or clinical terminologies like SNOMED or NCIT.
In these ontologies, there is neither a "individual organism" class nor classes for ranks like "species".

Instead there is just a hierarchy of organism OWL classes, increasingly refined:

* Organism
    * Vertebrate
        * Mammalia
            * Homo
                * Homo sapiens
            * Felis
                * Felis catus
                    * Russian blue


(Intermediate nodes omitted for brevity)

There is also nothing formally prohibiting classes such as
"FriendlyMammal" or "HistoricHuman", but by convention the class
hierarchy mirrors conventional classifications that mirror phylogeny.

In this model there are no logical elements "species" or "genus". It's common practice
to include the taxonomic rank as an OWL *annotation property*. If we want to include these
concepts as true first-class logical citizens in an OWL model, then we need to either introduce
*punning* (OWL-DL) or *metaclasses* (OWL-Full).

In practice, punning or metaclasses are not used much in OWL, so let's stick with the rank-free
model. Formally, concepts like "Homo sapiens" are not in the *domain of discourse*.

Individual organisms like Napoleon (Q517 in Wikidata) instantiate the classes in the hierarchy:

```
wikidata:Q517 rdf:type NCBITaxon:9606 .
NCBITaxon:9606 rdfs:subClassOf NCBITaxon:9605
```

Compare to the RDF serialization of the LinkML instances:

```
wikidata:Q517 my:species NCBITaxon:9606 .
NCBITaxon:9606 my:parent_concept NCBITaxon:9605
```

In this case, `rdf:type` corresponds roughly to the `species` attribute in the LinkML model. It's not quite the same, as we might have the following OWL:

```
wikidata:Q517 rdf:type NCBITaxon:9605 .  ## Homo
```

This is valid (and entailed) but less specific. Note that this would be disallowed
in the LinkML model, which intentionally forces the data provider to provide a species-level
taxon node ID rather than any other taxon ID.

In the RDF model we might even have:


```
wikidata:Q517 rdf:type My:HistoricPerson .
My:HistoricPerson rdfs:subClassOf NCBITaxon:9606 .
```

### Aligning the LinkML model with the ontological model

Note also the correspondence between the owl SubClassOf axiom and the 'parent_concept` attribute in our LinkML model.
These would correspond even further if we extended our model to other taxonomic ranks.

We could map these using `slot_uri`:

```yaml
classes:
 NamedThing:
   attributes:
     id:
       range: uriorcurie
     label:
 IndividualOrganism:
  class_uri: NCBITaxon:1   ## root node of NCBI taxonomy
  is_a: NamedThing
  attributes:
    species:
      range: Species
      slot_uri: rdf:type   ## map species to instantiation predicate
  examples:
    - description: Seabiscuit the horse
    - description: Napoleon Bonaparte
 OrganismTaxonomicConcept:
  is_a: NamedThing
  abstract: true
  attributes:
    parent_concept:
      range: OrganismTaxonomicConcept
      slot_uri: rdfs:subClassOf   ## map parent_concept to subsumption
 Species:
  is_a: OrganismTaxonomicConcept
  examples:
    - description: Homo sapiens
    - description: Felis catus
 Genus:
  is_a: OrganismTaxonomicConcept
  examples:
    - description: Homo
    - description: Felis
```

The LinkML instances now serialize as:

```
wikidata:Q517 rdf:type NCBITaxon:1 .
wikidata:Q517 rdf:type NCBITaxon:9606 .
NCBITaxon:9606 rdf:type my:Species .
NCBITaxon:9606 rdfs:subClassOf NCBITaxon:9605
NCBITaxon:9605 rdf:type my:Genus .
```

Viewed through the lens of RDF/OWL this is potentially
confusing. Under OWL2 Description Logic semantics, we have introduced
*punning*, and under OWL-Full we have *metaclasses*. The latter
approach is quite common in knowledge bases such as Wikidata.

### Separate models

We can imagine people getting confused, and making incorrect inferences
such as the following:

1. Homo sapiens is a Species
2. Species is a Genus
3. Therefore, Homo sapiens is a Genus

Clearly this is wrong. In fact entailment is thankfully not justified
either via the LinkML or via the RDF/OWL (either punning model or metaclass).

The mistake is confusing the different levels of modeling.

## When should hierarchies be mirrored?

It should be clear that LinkML (and more generally, schema and shape
frameworks such as JSON-Schema, SHACL, and so on) and formal OWL
modeling are distinct. By keeping these separate, we avoid problems.

However, there are some cases where hierarchies in our data model do
trivially mirror our ontological hierarchies. There are some schemas
and data models that also resemble upper ontologies.

* schema.org for everyday concepts like Person, CreativeWork
* biolink for biological concepts like Gene, Chemical, Disease
* chemrof for chemical concepts like atom, isotope, molecule

In the case of schema.org, most elements can do double duty as
ontology classes compatible with OBO-style realist modeling (intended
to model the world scientifically) as well as schema classes (intended
to model how we exchange data about the things in the world).

However, this can get quite nuanced. Sometimes there are
classifications that make sense in one perspective and not in the
other.

The modeling of personhood in ontologies can get quite involved. Some
ontologies will treat Person as a subclass of Homo sapiens (which is
scientifically valid but from a modeling perspective mixes two
separate concerns); other ontologies may represent personhood as a
"role", which complicates things if you want to have straightforward
connections between concepts like "Person" and "Address"

This gets even more nuanced with biomedical concepts, where we have to deal with
multiple interlinked ontological debates about modeling concepts like
Gene and Allele, and whether these are classes or instances. Most
bio-ontologies eliminate the concept of "levels" in hierarchies, so
the concepts "eukaroyotic gene", "gene", "human Shh gene" and "human
Shh gene with foo variant" are all valid gene concepts, just at
different levels of the hierarchy.

Additionally, ontologists have a habit of grouping unlike entities or
separating like concepts, on the basis of upper ontologies.

A full discussion of these issues is well outside the scope of this
guide.

From a modeling perspective, the key points are:

- use the appropriate modeling framework for the problem at hand
- mirror hierarchies where appropriate
- do not assume hierarchies must be mirrored