How to use MongoDB with LinkML-Store

LinkML-Store provides a uniform interface across different backends. It allows you to write database-neutral code and operations where it makes sense, and use database-specific code where you need it.

The best supported backend is duckdb. The next best is MongoDB.

This tutorial walks through using MongoDB via the Python interface. It is recommended you start first with the main tutorial.

Creating a client and attaching to a database

First we will create a client as normal:

[1]:
from linkml_store import Client

client = Client()

Next we’ll attach to a MongoDB instance. this assumes you have one running already.

[2]:
db = client.attach_database("mongodb://localhost:27017", "test")
[3]:
db.handle
[3]:
'mongodb://localhost:27017'
[4]:
db.metadata.model_dump_json()
[4]:
'{"handle":"mongodb://localhost:27017","alias":"test","schema_location":null,"schema_dict":null,"collections":{},"recreate_if_exists":false,"collection_type_slot":null,"searchable_slots":null,"ensure_referential_integrity":false}'

Creating a collection

We’ll create a simple test collection. The concept of collection in linkml-store maps directly to mongodb collections

[5]:
collection = db.create_collection("test", recreate_if_exists=True)

Preparing data to load

Next we’ll parse an (incomplete) list of countries in JSON-Lines format:

[6]:
COUNTRIES = "../../tests/input/countries/countries.jsonl"
[7]:
from linkml_store.utils.format_utils import load_objects

objects = load_objects(COUNTRIES)

Let’s check with pandas just to make sure it looks as expected:

[8]:
import pandas as pd
pd.DataFrame(objects)
[8]:
name code capital continent languages
0 United States US Washington, D.C. North America [English]
1 Canada CA Ottawa North America [English, French]
2 Mexico MX Mexico City North America [Spanish]
3 Brazil BR Brasília South America [Portuguese]
4 Argentina AR Buenos Aires South America [Spanish]
5 United Kingdom GB London Europe [English]
6 France FR Paris Europe [French]
7 Germany DE Berlin Europe [German]
8 Italy IT Rome Europe [Italian]
9 Spain ES Madrid Europe [Spanish]
10 China CN Beijing Asia [Standard Chinese]
11 Japan JP Tokyo Asia [Japanese]
12 India IN New Delhi Asia [Hindi, English]
13 South Korea KR Seoul Asia [Korean]
14 Indonesia ID Jakarta Asia [Indonesian]
15 Australia AU Canberra Oceania [English]
16 New Zealand NZ Wellington Oceania [English, Māori]
17 Egypt EG Cairo Africa [Arabic]
18 Nigeria NG Abuja Africa [English]
19 South Africa ZA Pretoria Africa [Zulu, Xhosa, Afrikaans, English, Northern Sot...

Inserting objects

We will call insert on the collection to add the objects. Note we haven’t specified a schema - this will be induced.

[9]:
collection.insert(objects)

Let’s check this worked by querying:

[10]:
qr = collection.find()
[11]:
qr.rows_dataframe
[11]:
name code capital continent languages
0 United States US Washington, D.C. North America [English]
1 Canada CA Ottawa North America [English, French]
2 Mexico MX Mexico City North America [Spanish]
3 Brazil BR Brasília South America [Portuguese]
4 Argentina AR Buenos Aires South America [Spanish]
5 United Kingdom GB London Europe [English]
6 France FR Paris Europe [French]
7 Germany DE Berlin Europe [German]
8 Italy IT Rome Europe [Italian]
9 Spain ES Madrid Europe [Spanish]
10 China CN Beijing Asia [Standard Chinese]
11 Japan JP Tokyo Asia [Japanese]
12 India IN New Delhi Asia [Hindi, English]
13 South Korea KR Seoul Asia [Korean]
14 Indonesia ID Jakarta Asia [Indonesian]
15 Australia AU Canberra Oceania [English]
16 New Zealand NZ Wellington Oceania [English, Māori]
17 Egypt EG Cairo Africa [Arabic]
18 Nigeria NG Abuja Africa [English]
19 South Africa ZA Pretoria Africa [Zulu, Xhosa, Afrikaans, English, Northern Sot...

Queries

We can specify key-value constraints:

[12]:
qr = collection.find({"continent": "Europe"})
[13]:
qr.rows_dataframe
[13]:
name code capital continent languages
0 United Kingdom GB London Europe [English]
1 France FR Paris Europe [French]
2 Germany DE Berlin Europe [German]
3 Italy IT Rome Europe [Italian]
4 Spain ES Madrid Europe [Spanish]

Facet counts

We will now do a query fetching facet counts for all fields.

Unlike Solr, MongoDB doesn’t facet natively but under the hood linkml-store implements the necessary logic

[14]:
fc = collection.query_facets()
[15]:
fc["continent"]
[15]:
[('Europe', 5),
 ('Asia', 5),
 ('Africa', 3),
 ('North America', 3),
 ('Oceania', 2),
 ('South America', 2)]

Creating an LLM embedding index

We will now attach an indexer. By default the llm indexer uses OpenAI so you will need a key:

[16]:
collection.attach_indexer("llm")

We can now query using the index. Note that search terms need only be semantically related, they don’t need to contain the same lexical elements

[17]:
qr = collection.search("countries with a King or Queen")
qr.rows_dataframe
[17]:
score name code capital continent languages
0 0.770891 United Kingdom GB London Europe [English]
1 0.758388 Australia AU Canberra Oceania [English]
2 0.754203 South Korea KR Seoul Asia [Korean]
3 0.750652 New Zealand NZ Wellington Oceania [English, Māori]
4 0.750419 United States US Washington, D.C. North America [English]
5 0.748973 South Africa ZA Pretoria Africa [Zulu, Xhosa, Afrikaans, English, Northern Sot...
6 0.748322 Canada CA Ottawa North America [English, French]
7 0.746444 France FR Paris Europe [French]
8 0.745408 Germany DE Berlin Europe [German]
9 0.743449 Spain ES Madrid Europe [Spanish]
10 0.739726 China CN Beijing Asia [Standard Chinese]
11 0.739504 Nigeria NG Abuja Africa [English]
12 0.738601 Egypt EG Cairo Africa [Arabic]
13 0.735424 Brazil BR Brasília South America [Portuguese]
14 0.735056 Mexico MX Mexico City North America [Spanish]
15 0.734002 Japan JP Tokyo Asia [Japanese]
16 0.731288 Argentina AR Buenos Aires South America [Spanish]
17 0.728014 Indonesia ID Jakarta Asia [Indonesian]
18 0.724164 India IN New Delhi Asia [Hindi, English]
19 0.723299 Italy IT Rome Europe [Italian]

The precise ranking could be debated, but in terms of rough semantic distance the first answer is in the right ballpark, at the time of writing.

[18]:
qr.num_rows
[18]:
20
[19]:
qr.ranked_rows
[19]:
[(0.7708908770614274,
  {'name': 'United Kingdom',
   'code': 'GB',
   'capital': 'London',
   'continent': 'Europe',
   'languages': ['English']}),
 (0.7583880255490492,
  {'name': 'Australia',
   'code': 'AU',
   'capital': 'Canberra',
   'continent': 'Oceania',
   'languages': ['English']}),
 (0.754202745445488,
  {'name': 'South Korea',
   'code': 'KR',
   'capital': 'Seoul',
   'continent': 'Asia',
   'languages': ['Korean']}),
 (0.7506523769140084,
  {'name': 'New Zealand',
   'code': 'NZ',
   'capital': 'Wellington',
   'continent': 'Oceania',
   'languages': ['English', 'Māori']}),
 (0.7504190890778679,
  {'name': 'United States',
   'code': 'US',
   'capital': 'Washington, D.C.',
   'continent': 'North America',
   'languages': ['English']}),
 (0.7489726600700292,
  {'name': 'South Africa',
   'code': 'ZA',
   'capital': 'Pretoria',
   'continent': 'Africa',
   'languages': ['Zulu',
    'Xhosa',
    'Afrikaans',
    'English',
    'Northern Sotho',
    'Tswana',
    'Southern Sotho',
    'Tsonga',
    'Swazi',
    'Venda',
    'Southern Ndebele']}),
 (0.7483222334041403,
  {'name': 'Canada',
   'code': 'CA',
   'capital': 'Ottawa',
   'continent': 'North America',
   'languages': ['English', 'French']}),
 (0.7464438929713734,
  {'name': 'France',
   'code': 'FR',
   'capital': 'Paris',
   'continent': 'Europe',
   'languages': ['French']}),
 (0.7454078196210195,
  {'name': 'Germany',
   'code': 'DE',
   'capital': 'Berlin',
   'continent': 'Europe',
   'languages': ['German']}),
 (0.7434487849009042,
  {'name': 'Spain',
   'code': 'ES',
   'capital': 'Madrid',
   'continent': 'Europe',
   'languages': ['Spanish']}),
 (0.7397262074302214,
  {'name': 'China',
   'code': 'CN',
   'capital': 'Beijing',
   'continent': 'Asia',
   'languages': ['Standard Chinese']}),
 (0.7395038203235198,
  {'name': 'Nigeria',
   'code': 'NG',
   'capital': 'Abuja',
   'continent': 'Africa',
   'languages': ['English']}),
 (0.7386007424118528,
  {'name': 'Egypt',
   'code': 'EG',
   'capital': 'Cairo',
   'continent': 'Africa',
   'languages': ['Arabic']}),
 (0.7354238434740793,
  {'name': 'Brazil',
   'code': 'BR',
   'capital': 'Brasília',
   'continent': 'South America',
   'languages': ['Portuguese']}),
 (0.7350558425995254,
  {'name': 'Mexico',
   'code': 'MX',
   'capital': 'Mexico City',
   'continent': 'North America',
   'languages': ['Spanish']}),
 (0.7340019061796953,
  {'name': 'Japan',
   'code': 'JP',
   'capital': 'Tokyo',
   'continent': 'Asia',
   'languages': ['Japanese']}),
 (0.7312880542513781,
  {'name': 'Argentina',
   'code': 'AR',
   'capital': 'Buenos Aires',
   'continent': 'South America',
   'languages': ['Spanish']}),
 (0.7280135748889252,
  {'name': 'Indonesia',
   'code': 'ID',
   'capital': 'Jakarta',
   'continent': 'Asia',
   'languages': ['Indonesian']}),
 (0.7241642577932456,
  {'name': 'India',
   'code': 'IN',
   'capital': 'New Delhi',
   'continent': 'Asia',
   'languages': ['Hindi', 'English']}),
 (0.7232991877572457,
  {'name': 'Italy',
   'code': 'IT',
   'capital': 'Rome',
   'continent': 'Europe',
   'languages': ['Italian']})]
[19]: