Part 2: Adding a container object#

In part 1 of this tutorial we created a schema for describing a person, and showed how we could use this to validate YAML or JSON files with a single person instance. In this tutorial we address collections and hierarchy.

In practice our data will typically contain multiple instances, for example we might want to describe a list of persons (people). How do we express that? We need a way to group the instances together. For this purpose, we can define a class with a multivalued slot and use range to specify the type of the object we want to collect.

More complex data are also often hierarchical. In order to express hierarchies, multivalued slots alone are not enough. We also need a way to mark which class is the root of our hierarchy. In LinkML, the tree_root slot is used to designate a class as the root of a tree structure. Only one class in a schema can be set as root. If more than one class is marked as tree_root, a validation error will occur.

Marking one class to serve as the root of the tree (as “container” of the other classes) is especially important when serializing and deserializing data. The class marked as tree_root will be the top-level object in the serialized data.

Example data file#

Let’s start with a simple data file that contains more than one instance of person. We choose to structure this as a YAML/JSON dictionary, with an index slot called persons:

data.yaml:

persons:
  - id: ORCID:1234
    full_name: Clark Kent
    age: "32"
    phone: 555-555-5555
  - id: ORCID:4567
    full_name: Lois Lane
    age: "33"

In Working with Data we will learn how to express such data in TSV format.

Nesting lists of objects#

We can describe this data using the following schema.

personinfo.yaml:

id: https://w3id.org/linkml/examples/personinfo
name: personinfo
prefixes:
  linkml: https://w3id.org/linkml/
imports:
  - linkml:types
default_range: string

classes:
  Person:
    attributes:
      id:
      full_name:
      aliases:
      phone:
      age:
        range: integer
  Container:
    tree_root: true
    attributes:
      persons:
        multivalued: true
        inlined_as_list: true
        range: Person

We introduce a class called Container. This doesn’t necessarily reflect a “real world” entity in our domain, it’s just a convenient holder for our data. Right now the container has only a single attribute/slot called “persons” because it just need to holding instances of Person. But it could hold other kinds of data, too.

The Container class has three crucial characteristics:

  • it is multivalued - i.e. it holds a list

  • it is inlined - i.e. the values are nested underneath the container

  • the range is Person - i.e. the expected values in the data are persons (people)

Moreover, the Container class is also marked as root class of our model. In this simple schema setting tree_root is not strictly necessary. LinkML is able to infer that the class Container is the root class because it is not referenced as a range in any other class. However, it is good practice to nevertheless mark the root class explicitly.

Later on we will explore these in more detail.

Validating#

We can validate this to make sure we got it right:

linkml-validate -s personinfo.yaml data.yaml

This should report no errors.

Visualizing#

We can use yUML to visualize the schema. The gen-yuml command can generate REST URLs.

gen-yuml -f yuml personinfo.yaml

Outputs:

https://yuml.me/diagram/nofunky;dir:TB/class/[Container]++- persons 0..*>[Person|id:string %3F;full_name:string %3F;aliases:string %3F;phone:string %3F;age:string %3F],[Container]

Requesting the URL gives the schema as svg image:

img

We can alternatively let yUML generate the visualization in png, jpg or pdf format. In this case a download directory must be passed to the command. To get the visualization as file personinfo.png downloaded to the current directory run

gen-yuml -f png -d . personinfo.yaml

Besides yUML, linkML supports visualizations with Mermaid (gen-erdiagram) and plantuml (gen-plantuml).

Exercises#

  1. Extend the container object to include dataset-level metadata:

    • description of the dataset

    • name of the dataset

  2. Modify the schema to allow multiple aliases

  3. Modify the test dataset to include multiple aliases for Clark Kent: “Superman” and “Man of Steel”

  4. Validate the data

Further reading#

Next#

Next we will explore how to add constraints to the schema.