Tutorial: Using the Python API

This tutorial walks through the basics of using the LinkML-Store Python API.

We will create an empty database, populate it with some objects, and then query it. We’ll be using the DuckDB adapter.

Create a Client object

A Client acts as a holder for databases.

[26]:
from linkml_store import Client
client = Client()

Attach and populate a DuckDB in-memory database

Here we will create/attach a database to the client, and populate it with some JSON objects. Note that for this example, we don’t provide a schema - this is induced behind the scenes for us.

We will use the duckdb adapter, which is currently the only one in linkml-store. This is a very flexible adapter that can work in-memory or on-disk, and can be used for “classic” relational data as well as complex nested objects.

The attach_database method creates a new database or attaches to an existing one. Here we will specify duckdb with no additional parameters (you could pass a full sqlalchemy URI like duckdb:////tmp/test.db to create a disk-based database). We will also give it as alias mem so we can refer to it later.

[27]:
db = client.attach_database("duckdb", "mem")

Next we’ll create a Collection and add some objects to it. For the duckdb adapter, a collection corresponds to a table, but linkml-store does not dictate any particular behavior and different adapters may implement differently. But in general a collection should be a list of entities of a similar type.

When creating a collection we can also optionally pass an alias, in addition to the type of objects stored in it. Conventionally, this might be the name of the key in an overall container object.

[28]:
pc = db.create_collection("Person", alias="persons")

Now we can add some objects to the collection. We can add them one at a time, or in a batch.

[29]:
pc.insert([
    {"id": "P1", "name": "Jie", "occupation": "StuntDouble", "moon": "Io"},
    {"id": "P2", "name": "Ann", "occupation": "Welder",  "moon": "Io"},
    {"id": "P3", "name": "Joe", "occupation": "Welder",  "moon": "Europa"},
    {"id": "P4", "name": "Baz", "occupation": "Plumber",  "moon": "Europa"},
    {"id": "P5", "name": "Xan", "occupation": "Plumber",  "moon": "Europa"},
    {"id": "P6", "name": "Gav", "occupation": "Plumber",  "moon": "Io"},
    {"id": "P7", "name": "Ann", "occupation": "Mechanic",  "moon": "Io"},
])

Querying

We can query the collection methods such as find, query. Queries are generally mongodb-style key-value pairs. Currently only exact matches are supported. The query language is independent of the underlying database, e.g. this will work for mongodb, triplestores, etc later on.

[30]:
qr = pc.find({"moon": "Europa"})

The resulting QueryResult object can be interrogated to get overall row count, actual rows, etc. It is designed to support windowing over larger payloads than the toy example here.

[31]:
qr.num_rows
[31]:
3
[32]:
qr.rows
[32]:
[{'id': 'P3', 'name': 'Joe', 'occupation': 'Welder', 'moon': 'Europa'},
 {'id': 'P4', 'name': 'Baz', 'occupation': 'Plumber', 'moon': 'Europa'},
 {'id': 'P5', 'name': 'Xan', 'occupation': 'Plumber', 'moon': 'Europa'}]
[33]:
qr.rows_dataframe
[33]:
id moon name occupation
0 P3 Europa Joe Welder
1 P4 Europa Baz Plumber
2 P5 Europa Xan Plumber

Faceting

We can also perform faceted queries. Again, behind the scenes, the adapter figures the best way to do this. For a SQL database this may involve multiple queries and may be less performant than Solr/ES.

[34]:
pc.query_facets(facet_columns=["occupation"])
[34]:
{'occupation': [('Plumber', 3),
  ('Welder', 2),
  ('Mechanic', 1),
  ('StuntDouble', 1)]}

We can also facet on multiple columns, or specify a combinatorial facet. Here we will get facet counts for different occupations on different moons.

[35]:
pc.query_facets(facet_columns=[("occupation", "moon")])
[35]:
{('occupation', 'moon'): [('Plumber', 'Europa', 2),
  ('StuntDouble', 'Io', 1),
  ('Welder', 'Io', 1),
  ('Mechanic', 'Io', 1),
  ('Welder', 'Europa', 1),
  ('Plumber', 'Io', 1)]}

Plotting

Although it doesn’t really make sense for such a small dataset we will plot the facet counts. First we will pivot the data.

[36]:
import pandas as pd

# Provided dictionary
data_dict = pc.query_facets(facet_columns=[("occupation", "moon")])

# Transforming the data into a format suitable for a DataFrame
data_for_df = []
for occupation_moon, values in data_dict.items():
    for occupation, moon, count in values:
        data_for_df.append({
            'Occupation': occupation,
            'Moon': moon,
            'Count': count
        })

# Creating the DataFrame
df = pd.DataFrame(data_for_df)

# Pivoting the DataFrame to get it ready for a stacked bar chart
df_pivoted = df.pivot_table(index='Moon', columns='Occupation', values='Count', fill_value=0)

df_pivoted

[36]:
Occupation Mechanic Plumber StuntDouble Welder
Moon
Europa 0.0 2.0 0.0 1.0
Io 1.0 1.0 1.0 1.0
[37]:
import matplotlib.pyplot as plt
[38]:
def plot_stacked_counts(pc, col1, col2):
    data_dict = pc.query_facets(facet_columns=[(col1, col2)])
    data_for_df = []
    for pair, values in data_dict.items():
        for v1, v2, count in values:
            data_for_df.append({
                col1: v1,
                col2: v2,
                'Count': count
            })
    df = pd.DataFrame(data_for_df)
    df_pivoted = df.pivot_table(index=col1, columns=col2, values='Count', fill_value=0)
    df_pivoted.plot(kind='bar', stacked=True, figsize=(10, 6))
    plt.title(f'Object count by {col1} and {col2}')
    plt.xlabel(col1)
    plt.ylabel('Objects')
    plt.xticks(rotation=45)
    plt.legend(title=col2)
    plt.show()

plot_stacked_counts(pc, "moon", "occupation")
../_images/tutorials_Python-Tutorial_22_0.png

Complex objects

We can also store complex objects with arbitrary levels of nesting.

For the duckdb adapter, we explicitly do not use the standard linkml relmodel transform; instead we put top level objects in individual collections, and behind the scenes duckdb will use lists and json objects for nesting, rather than rewriting these to normalized tables.

Here we will create a top level json objects that lists persons and organizations. Persons can have histories that are lists of event objects:

[39]:
obj = {
        "persons": [
            {"id": 1, "name": "n1", "history": [
                {"event": "birth", "date": "2021-01-01"},
                {"event": "death", "date": "2021-02-01"},
                {"event": "hired", "date": "2021-02-01", "organization": "Org1"},
            ]},
            {"id": 2, "name": "n2", "age_in_years": 30},
        ],
        "organizations": [
            {"id": "Org1", "name": "org1"},
            {"id": "Org2", "name": "org2", "found_date": "2021-01-01"},
        ],
    }

[40]:
db = client.attach_database("duckdb", "complex")
[41]:
db.store(obj)
[42]:
db.get_collection("persons").find().rows_dataframe
[42]:
age_in_years history id name
0 NaN [{'event': 'birth', 'date': '2021-01-01'}, {'e... 1 n1
1 30.0 None 2 n2

Schemas

Every Database has a schema. These can be provided explicitly via a LinkML specification, or they can be induced from data.

In the first example, the schema is induced. We can query it:

[43]:
from linkml_runtime.dumpers import yaml_dumper
print(yaml_dumper.dumps(pc.class_definition()))
name: Person
from_schema: http://example.org/test-schema
attributes:
  id:
    name: id
    from_schema: http://example.org/test-schema
    domain_of:
    - Person
    range: string
    multivalued: false
    inlined: false
  name:
    name: name
    from_schema: http://example.org/test-schema
    domain_of:
    - Person
    range: string
    multivalued: false
    inlined: false
  occupation:
    name: occupation
    from_schema: http://example.org/test-schema
    domain_of:
    - Person
    range: string
    multivalued: false
    inlined: false
  moon:
    name: moon
    from_schema: http://example.org/test-schema
    domain_of:
    - Person
    range: string
    multivalued: false
    inlined: false

Asserting schemas

Inducing schemas is useful for quick operations but in general you are better specifying a schema up-front. This makes things more explicit, and in future can be used to do things like validate at time of insertion.

TODO

Other adapters

TODO

Indexing

[44]:
from linkml_store.utils.format_utils import load_objects

COUNTRIES = "../../tests/input/countries/countries.jsonl"
objects = load_objects(COUNTRIES)
[45]:
db = client.attach_database("duckdb", "CountryDB")
collection = db.create_collection("Country", alias="countries")
[46]:
collection.insert(objects)
[47]:
collection.query_facets(facet_columns=["continent"])
[47]:
{'continent': [('Asia', 5),
  ('Europe', 5),
  ('Africa', 3),
  ('North America', 3),
  ('Oceania', 2),
  ('South America', 2)]}
[48]:
collection.attach_indexer("simple")
[49]:
sr = collection.search("Countries in the North that speak english and french")
[50]:
sr.rows_dataframe
[50]:
score name code capital continent languages
0 0.151858 Canada CA Ottawa North America [English, French]
1 0.139609 United States US Washington, D.C. North America [English]
2 0.128295 South Africa ZA Pretoria Africa [Zulu, Xhosa, Afrikaans, English, Northern Sot...
3 0.101139 United Kingdom GB London Europe [English]
4 0.096154 Mexico MX Mexico City North America [Spanish]
5 0.095924 New Zealand NZ Wellington Oceania [English, Māori]
6 0.091025 France FR Paris Europe [French]
7 0.087383 Argentina AR Buenos Aires South America [Spanish]
8 0.079243 Australia AU Canberra Oceania [English]
9 0.077674 Brazil BR Brasília South America [Portuguese]
10 0.070797 Nigeria NG Abuja Africa [English]
11 0.066827 India IN New Delhi Asia [Hindi, English]
12 0.061174 South Korea KR Seoul Asia [Korean]
13 0.057554 China CN Beijing Asia [Standard Chinese]
14 0.040782 Spain ES Madrid Europe [Spanish]
15 0.040782 Japan JP Tokyo Asia [Japanese]
16 0.040564 Egypt EG Cairo Africa [Arabic]
17 0.040032 Germany DE Berlin Europe [German]
18 0.028916 Indonesia ID Jakarta Asia [Indonesian]
19 0.027985 Italy IT Rome Europe [Italian]
[50]: