Porting LinkML tools to other programming languages

Preamble

this is an advanced how-to guide that is not expected to be useful for typical LinkML users or developers.

If you are thinking of porting any part of the LinkML toolchain to another language (R, Rust, GO, …) please contact us first, by making an issue in our GitHub tracker

Introduction

LinkML is a programming language-neutral framework. Using LinkML you can specify a data model (schema, data dictionary) for your project, and use LinkML tools for validating data. The majority tools are in Python, but they can be executed during the development cycle of your project, avoiding python dependencies in non-python code.

Before embarking on a porting project, please read the The LinkML specification

Data class generators

For a small but growing set of languages there is support for generating object models for that language. For example, for Python it’s possible to generate either dataclass-compliant classes, or Pydantic.

These classes follow a Data Access Object (DAO) style pattern, and are intended primarily as data holders. Using these in code is optional, but we recommend this as it gives you strong typing and IDE support.

For other languages it should be possible to write your own generators using the flexible Jinja2 template based generator framework. This may suffice for some use cases.

However, there may still be cases where it is desirable to have full LinkML support within a language other than python, typically when writing a generic application that is designed to work with any LinkML model.

Examples of these might be:

  1. A generic data editor or visualizer for LinkML models

  2. A code generator for a particular language where one wants to take advantage of features of that language

For more on the first use case, see the tool developer guide.

As an example of the second, consider writing a code generator for Java that creates data access classes with the ability to customize the generation for various different styles of java annotations (Jackson, Lombok, Hibernate, …), as well as allowing for choices such as between Java records, interfaces, or classes.

While this could be done in Python and Jinja2 templates, there is an advantage to leveraging existing tooling in the host language.

Direct YAML access vs higher level access

It may be tempting to take the “easy” approach and simply write code that works directly from LinkML yaml files. All modern languages have support for YAML.

For example, in python, to access all attributes of a class:

def get_identifier_slot_name(class_name: str) -> str:
    for a_name, a in schema["classes"][class_name]["attributes"].items():
        if a["identifier"]:
          return a_name

This is relatively easy but suffers from downsides - the code will raise exceptions if certain values or absent, or if the class name is present. There is no typing support to tell you if you made a typo with one of the keys.

For this reason, the Python linkml-runtime has a SchemaView object that provides higher level access to LinkML schemas. SchemaView operates over a Python datamodel that is auto-generated from the LinkML metamodel.

At this time, there are ports of SchemaView to other languages underway, but not as feature complete.

Despite drawbacks, the direct access approach is a good approach if you are targeting one of the simple sub-profiles of LinkML. However, if you are aiming for a more powerful library, then this has a number of disadvantages.

Definitions

  • LinkML framework: a general purpose programming library for working with LinkML models

  • Target Language: the programming language one wishes to write a LinkML framework for

  • LinkML metamodel: the schema that describes LinkML itself

Overview of Python Framework

Currently there are 3 core repos:

  • linkml-model This holds the Source of Truth for the metamodel

  • linkml-runtime A runtime library that is depended on by schema-specific generated python dataclass libraries

  • linkml The core tooling, including generators

This guide assumes familiarity with all of the above

Note that there are no runtime dependencies on linkml-model; this would create a circularity since the metamodel is like any other schema and depends on the runtime. Instead, linkml-model is synced into linkml-runtime.

How to Port to your language of choice

This provides a general framework to tackle the problem of building support for a target language via bootstrapping

Step 1: Bootstrap metamodel data access classes

Even if you intended to ultimately use the target language for doing code geneation, we strongly recommend you bootstrap by using the existing generators framework. This may take a small amount of Python coding, but the majority of this involves writing Jinja2 templates, so no Python expertise is required.

Consult some of the existing generators to see how this is done. Be sure to use the newer style SchemaView generators.

for example: javagen

Definitely make an issue on our repo, so you can coordinate with others interested in support for the target language!

When writing the bootstrap generators there are a number of design decisions:

  • Should the generated code be standalone, or should there be runtime dependencies?

  • If there are runtime dependencies, are these on 3rd party libraries or something you will generate?

  • What are the appropriate constructs to map to in the target language?

  • How do you intend to handle serialization and parsing to/from JSON?

  • What is your strategy for validation within the target language?

  • How should imports be handled in a way that comports well with the target language?

There are no right or wrong answers here. A good strategy is to start with the simplest thing that could work and build out from there.

By way of comparison here are some rough descriptions of existing generators and their decisions:

Case study: pydanticgen

pydanticgen: this generates largely standalone python classes, with only a dependency on the 3rd party Pydantic framework (all classes ultimately inherit from BaseModel)

Case study: pythongen

pythongen: this builds fully-featured Python classes that follow the dataclasses framework, which is part of the Python core. These classes are not standalone however. Let’s examine the header of generated classes:

import dataclasses
import sys
import re
from jsonasobj2 import JsonObj, as_dict
from typing import Optional, List, Union, Dict, ClassVar, Any
from dataclasses import dataclass
from linkml_runtime.linkml_model.meta import EnumDefinition, PermissibleValue, PvFormulaOptions

from linkml_runtime.utils.slot import Slot
from linkml_runtime.utils.metamodelcore import empty_list, empty_dict, bnode
from linkml_runtime.utils.yamlutils import YAMLRoot, extended_str, extended_float, extended_int
from linkml_runtime.utils.dataclass_extensions_376 import dataclasses_init_fn_with_kwargs
from linkml_runtime.utils.formatutils import camelcase, underscore, sfx
from linkml_runtime.utils.enumerations import EnumDefinitionImpl
from rdflib import Namespace, URIRef
from linkml_runtime.utils.curienamespace import CurieNamespace
from linkml_runtime.linkml_model.types import Boolean, Float, Integer, String, Uriorcurie
from linkml_runtime.utils.metamodelcore import Bool, URIorCURIE

Many of these are part of Python:

The standard typing library is used for type hints:

from typing import Optional, List, Union, Dict, ClassVar, Any

the standard python dataclasses library is used:

import dataclasses
...

@dataclass
class Person(NamedThing):
   ...

Others are dependencies on either the linkml-runtime itself or on external libraries

All classes inherit from YAMLRoot which itself inherits from JsonObj in jsonasonj2:

from linkml_runtime.utils.yamlutils import YAMLRoot, extended_str, extended_float, extended_int
...

@dataclass
class NamedThing(YAMLRoot):
   ...

The as_dict method from jsonasobj2 provides convenient methods for normalizing inputs to initialization routines:

from jsonasobj2 import JsonObj, as_dict
...

@dataclass
class MedicalEvent(Event):
    ...
    def __post_init__(self, *_: List[str], **kwargs: Dict[str, Any]):
        ...
        if self.diagnosis is not None and not isinstance(self.diagnosis, DiagnosisConcept):
            self.diagnosis = DiagnosisConcept(**as_dict(self.diagnosis))

rdflib.URIRef is used to allow for introspection to retrieve the URI of any LinkML class at rumtime:

from rdflib import Namespace, URIRef
...

@dataclass
class Person(NamedThing):
    """
    A person (alive, dead, undead, or fictional).
    """
    _inherited_slots: ClassVar[List[str]] = []

    class_class_uri: ClassVar[URIRef] = SCHEMA.Person
    class_class_curie: ClassVar[str] = "schema:Person"
    class_name: ClassVar[str] = "Person"
    class_model_uri: ClassVar[URIRef] = PERSONINFO.Person

Additionally, if parts of the metamodel are imported - such as for linkml:types this generates a corresponding python import:

from linkml_runtime.linkml_model.types import Boolean, Date, Float, Integer, String
from linkml_runtime.utils.metamodelcore import Bool, XSDDate
...

@dataclass
class Event(YAMLRoot):
    ...
    started_at_time: Optional[Union[str, XSDDate]] = None

if your schema is modular and imports other schemas, the relevant import statements are added

the python generator will also generate subclasses of builtin Python strings - these are useful for non-inlined references to type the reference slot:

from linkml_runtime.utils.yamlutils import YAMLRoot, extended_str, extended_float, extended_int

class NamedThingId(extended_str):
    pass

class PersonId(NamedThingId):
    pass

...
@dataclass
class FamilialRelationship(Relationship):
    ...
    type: Union[str, "FamilialRelationshipType"] = None
    related_to: Union[str, PersonId] = None

All of these decisions lead to a very fully featured set of generated data access classes, but these decisions do not need to be replicated when writing a generator for your target language.

Case study: typescriptgen

See: gen-typescript

In contrast to the python approaches, this targets typescript Interfaces as the construct of choice in the target language

Note that interfaces are “compiled away” but are used for static analyis and IDE support

This is being used to bootstrap a javascript runtime: linkml-runtime.js

Case study: javagen

See: gen-java

Currently the approaches here are under discussion. There are different choices of target language construct:

  • classes

  • interfaces

  • records (introduced in java 14)

The current pattern for generation roughly follows the pydanticgen approach of generating largely standalone Java records.

Implementing for your target language

We recommend starting with a generator for simple standalone classes. These do not need to be as fully featured as pythongen. You can treat things like annotating model elements with URIs, providing introspection support, and generating JSON as separate concerns – at least at first.

Language-specific considerations

This guide is intended to be general purpose, but it is important to consider local idioms for any given language, and how that community may use it.

For example, R developers do not typically develop database infrastructure the same way other LinkML programmatic users might. The underlying R dataframe model is a slightly different way of looking at data than a classic OO, RDF, or JSON approach.

Rust has many considerations around safety and sharing of objects, when converting a LinkML model to Rust data structures there are considerations that are lower level than what is captured in LinkML.

For some languages, it may suffice to simply generate helper methods and vocabulary helpers, or validators that work with existing structures.

Step 2: Generate data access classes for the metamodel

This will give you the equivalent of meta.py, types.py in linkml-runtime.

E.g if your target language is C, and you decided on structs as target construct, then you will have structs with names like ClassDefinition, SchemaDefinition

Depending on the approach you took in Step 1, this may also include an approach to instantiate this target language datamodel from YAML or JSON.

(this will vary tremendously with the target language - for example, for java we are relying on Jackson annotations)

If you have that, then you should have the ability to do the equivalent of:

from linkml_runtime.loaders import yaml_loader
from linkml_runtime.linkml_model.meta import SchemaDefinition

schema = yaml_loader.load('my-schema.yaml', target=SchemaDefinition)
# now do things with the schema

Step 3: Implement core logic for derived schemas

See: Part 4 of the specification

By this stage you can load and dump schema objects in a type-safe way. It may be tempting to just go off and implement things with this.

However, you will also want to implement the core logic of LinkML as defined in the LinkML specification, under “Derived Schemas”. You can do this simply by implementing the spec in your target language, but a more pragmatic approach is to port the python class SchemaView

SchemaView is a class in linkml-runtime here: https://github.com/linkml/linkml-runtime

Note that you may only need to implement this for a simple profile of linkml before moving to the next step. As an example, see the SchemaView.js class in linkml-runtime.js

This may include:

  • basic inheritance methods using is_a and mixins

  • induced class slots/attributes

Step 4 (optional): Jettison the bootstrap approach

Now you have a core framework - perhaps not yet complete - for your target language, you may wish to jettison the original generator you wrote with Python/Jinja2 and implement this in the target language. This could have advantages such as leveraging code generation frameworks in that target language (this is the case for Java that has a rich codegen framework)

At this stage you may wish to revisit some design decisions in the original generator, and make the generated code more fully featured like pythongen.

Future Steps

The current LinkML toolchain has many features that you may wish to port over, but not everything needs to be ported.

LinkML follows a “parasitzation” strategy where we can leverage other tool chains.

For example, implementing a validator in the target language may not be the highest priority if you can leverage the existing JSON-Schema generator and use existing json schema validators for the target language. The resulting validator may not be as complete, but it may be sufficient for your purposes.

Similarly, to do something like dynamically convert objects instantiated in the target language to RDF and back, it may be sufficient to rely on the existing JSON-LD context generator

Summary

This guide is by its nature high level, and various decisions will be determined by specific features of the target language or even by developer preference. As we gain more experience and port core linkml utilities to more target languages we will update this guide.