How to use MongoDB with LinkML-Store
LinkML-Store provides a uniform interface across different backends. It allows you to write database-neutral code and operations where it makes sense, and use database-specific code where you need it.
The best supported backend is duckdb. The next best is MongoDB.
This tutorial walks through using MongoDB via the Python interface. It is recommended you start first with the main tutorial.
Creating a client and attaching to a database
First we will create a client as normal:
[1]:
from linkml_store import Client
client = Client()
Next we’ll attach to a MongoDB instance. this assumes you have one running already.
[2]:
db = client.attach_database("mongodb://localhost:27017", "test")
[3]:
db.handle
[3]:
'mongodb://localhost:27017'
[4]:
db.metadata.model_dump_json()
[4]:
'{"handle":"mongodb://localhost:27017","alias":"test","schema_location":null,"schema_dict":null,"collections":{},"recreate_if_exists":false,"collection_type_slot":null,"searchable_slots":null,"ensure_referential_integrity":false}'
Creating a collection
We’ll create a simple test collection. The concept of collection in linkml-store maps directly to mongodb collections
[5]:
collection = db.create_collection("test", recreate_if_exists=True)
Preparing data to load
Next we’ll parse an (incomplete) list of countries in JSON-Lines format:
[6]:
COUNTRIES = "../../tests/input/countries/countries.jsonl"
[7]:
from linkml_store.utils.format_utils import load_objects
objects = load_objects(COUNTRIES)
Let’s check with pandas just to make sure it looks as expected:
[8]:
import pandas as pd
pd.DataFrame(objects)
[8]:
name | code | capital | continent | languages | |
---|---|---|---|---|---|
0 | United States | US | Washington, D.C. | North America | [English] |
1 | Canada | CA | Ottawa | North America | [English, French] |
2 | Mexico | MX | Mexico City | North America | [Spanish] |
3 | Brazil | BR | Brasília | South America | [Portuguese] |
4 | Argentina | AR | Buenos Aires | South America | [Spanish] |
5 | United Kingdom | GB | London | Europe | [English] |
6 | France | FR | Paris | Europe | [French] |
7 | Germany | DE | Berlin | Europe | [German] |
8 | Italy | IT | Rome | Europe | [Italian] |
9 | Spain | ES | Madrid | Europe | [Spanish] |
10 | China | CN | Beijing | Asia | [Standard Chinese] |
11 | Japan | JP | Tokyo | Asia | [Japanese] |
12 | India | IN | New Delhi | Asia | [Hindi, English] |
13 | South Korea | KR | Seoul | Asia | [Korean] |
14 | Indonesia | ID | Jakarta | Asia | [Indonesian] |
15 | Australia | AU | Canberra | Oceania | [English] |
16 | New Zealand | NZ | Wellington | Oceania | [English, Māori] |
17 | Egypt | EG | Cairo | Africa | [Arabic] |
18 | Nigeria | NG | Abuja | Africa | [English] |
19 | South Africa | ZA | Pretoria | Africa | [Zulu, Xhosa, Afrikaans, English, Northern Sot... |
Inserting objects
We will call insert
on the collection to add the objects. Note we haven’t specified a schema - this will be induced.
[9]:
collection.insert(objects)
Let’s check this worked by querying:
[10]:
qr = collection.find()
[11]:
qr.rows_dataframe
[11]:
name | code | capital | continent | languages | |
---|---|---|---|---|---|
0 | United States | US | Washington, D.C. | North America | [English] |
1 | Canada | CA | Ottawa | North America | [English, French] |
2 | Mexico | MX | Mexico City | North America | [Spanish] |
3 | Brazil | BR | Brasília | South America | [Portuguese] |
4 | Argentina | AR | Buenos Aires | South America | [Spanish] |
5 | United Kingdom | GB | London | Europe | [English] |
6 | France | FR | Paris | Europe | [French] |
7 | Germany | DE | Berlin | Europe | [German] |
8 | Italy | IT | Rome | Europe | [Italian] |
9 | Spain | ES | Madrid | Europe | [Spanish] |
10 | China | CN | Beijing | Asia | [Standard Chinese] |
11 | Japan | JP | Tokyo | Asia | [Japanese] |
12 | India | IN | New Delhi | Asia | [Hindi, English] |
13 | South Korea | KR | Seoul | Asia | [Korean] |
14 | Indonesia | ID | Jakarta | Asia | [Indonesian] |
15 | Australia | AU | Canberra | Oceania | [English] |
16 | New Zealand | NZ | Wellington | Oceania | [English, Māori] |
17 | Egypt | EG | Cairo | Africa | [Arabic] |
18 | Nigeria | NG | Abuja | Africa | [English] |
19 | South Africa | ZA | Pretoria | Africa | [Zulu, Xhosa, Afrikaans, English, Northern Sot... |
Queries
We can specify key-value constraints:
[12]:
qr = collection.find({"continent": "Europe"})
[13]:
qr.rows_dataframe
[13]:
name | code | capital | continent | languages | |
---|---|---|---|---|---|
0 | United Kingdom | GB | London | Europe | [English] |
1 | France | FR | Paris | Europe | [French] |
2 | Germany | DE | Berlin | Europe | [German] |
3 | Italy | IT | Rome | Europe | [Italian] |
4 | Spain | ES | Madrid | Europe | [Spanish] |
Facet counts
We will now do a query fetching facet counts for all fields.
Unlike Solr, MongoDB doesn’t facet natively but under the hood linkml-store implements the necessary logic
[14]:
fc = collection.query_facets()
[15]:
fc["continent"]
[15]:
[('Europe', 5),
('Asia', 5),
('Africa', 3),
('North America', 3),
('Oceania', 2),
('South America', 2)]
Creating an LLM embedding index
We will now attach an indexer. By default the llm
indexer uses OpenAI so you will need a key:
[16]:
collection.attach_indexer("llm")
We can now query using the index. Note that search terms need only be semantically related, they don’t need to contain the same lexical elements
[17]:
qr = collection.search("countries with a King or Queen")
qr.rows_dataframe
[17]:
score | name | code | capital | continent | languages | |
---|---|---|---|---|---|---|
0 | 0.770891 | United Kingdom | GB | London | Europe | [English] |
1 | 0.758388 | Australia | AU | Canberra | Oceania | [English] |
2 | 0.754203 | South Korea | KR | Seoul | Asia | [Korean] |
3 | 0.750652 | New Zealand | NZ | Wellington | Oceania | [English, Māori] |
4 | 0.750419 | United States | US | Washington, D.C. | North America | [English] |
5 | 0.748973 | South Africa | ZA | Pretoria | Africa | [Zulu, Xhosa, Afrikaans, English, Northern Sot... |
6 | 0.748322 | Canada | CA | Ottawa | North America | [English, French] |
7 | 0.746444 | France | FR | Paris | Europe | [French] |
8 | 0.745408 | Germany | DE | Berlin | Europe | [German] |
9 | 0.743449 | Spain | ES | Madrid | Europe | [Spanish] |
10 | 0.739726 | China | CN | Beijing | Asia | [Standard Chinese] |
11 | 0.739504 | Nigeria | NG | Abuja | Africa | [English] |
12 | 0.738601 | Egypt | EG | Cairo | Africa | [Arabic] |
13 | 0.735424 | Brazil | BR | Brasília | South America | [Portuguese] |
14 | 0.735056 | Mexico | MX | Mexico City | North America | [Spanish] |
15 | 0.734002 | Japan | JP | Tokyo | Asia | [Japanese] |
16 | 0.731288 | Argentina | AR | Buenos Aires | South America | [Spanish] |
17 | 0.728014 | Indonesia | ID | Jakarta | Asia | [Indonesian] |
18 | 0.724164 | India | IN | New Delhi | Asia | [Hindi, English] |
19 | 0.723299 | Italy | IT | Rome | Europe | [Italian] |
The precise ranking could be debated, but in terms of rough semantic distance the first answer is in the right ballpark, at the time of writing.
[18]:
qr.num_rows
[18]:
20
[19]:
qr.ranked_rows
[19]:
[(0.7708908770614274,
{'name': 'United Kingdom',
'code': 'GB',
'capital': 'London',
'continent': 'Europe',
'languages': ['English']}),
(0.7583880255490492,
{'name': 'Australia',
'code': 'AU',
'capital': 'Canberra',
'continent': 'Oceania',
'languages': ['English']}),
(0.754202745445488,
{'name': 'South Korea',
'code': 'KR',
'capital': 'Seoul',
'continent': 'Asia',
'languages': ['Korean']}),
(0.7506523769140084,
{'name': 'New Zealand',
'code': 'NZ',
'capital': 'Wellington',
'continent': 'Oceania',
'languages': ['English', 'Māori']}),
(0.7504190890778679,
{'name': 'United States',
'code': 'US',
'capital': 'Washington, D.C.',
'continent': 'North America',
'languages': ['English']}),
(0.7489726600700292,
{'name': 'South Africa',
'code': 'ZA',
'capital': 'Pretoria',
'continent': 'Africa',
'languages': ['Zulu',
'Xhosa',
'Afrikaans',
'English',
'Northern Sotho',
'Tswana',
'Southern Sotho',
'Tsonga',
'Swazi',
'Venda',
'Southern Ndebele']}),
(0.7483222334041403,
{'name': 'Canada',
'code': 'CA',
'capital': 'Ottawa',
'continent': 'North America',
'languages': ['English', 'French']}),
(0.7464438929713734,
{'name': 'France',
'code': 'FR',
'capital': 'Paris',
'continent': 'Europe',
'languages': ['French']}),
(0.7454078196210195,
{'name': 'Germany',
'code': 'DE',
'capital': 'Berlin',
'continent': 'Europe',
'languages': ['German']}),
(0.7434487849009042,
{'name': 'Spain',
'code': 'ES',
'capital': 'Madrid',
'continent': 'Europe',
'languages': ['Spanish']}),
(0.7397262074302214,
{'name': 'China',
'code': 'CN',
'capital': 'Beijing',
'continent': 'Asia',
'languages': ['Standard Chinese']}),
(0.7395038203235198,
{'name': 'Nigeria',
'code': 'NG',
'capital': 'Abuja',
'continent': 'Africa',
'languages': ['English']}),
(0.7386007424118528,
{'name': 'Egypt',
'code': 'EG',
'capital': 'Cairo',
'continent': 'Africa',
'languages': ['Arabic']}),
(0.7354238434740793,
{'name': 'Brazil',
'code': 'BR',
'capital': 'Brasília',
'continent': 'South America',
'languages': ['Portuguese']}),
(0.7350558425995254,
{'name': 'Mexico',
'code': 'MX',
'capital': 'Mexico City',
'continent': 'North America',
'languages': ['Spanish']}),
(0.7340019061796953,
{'name': 'Japan',
'code': 'JP',
'capital': 'Tokyo',
'continent': 'Asia',
'languages': ['Japanese']}),
(0.7312880542513781,
{'name': 'Argentina',
'code': 'AR',
'capital': 'Buenos Aires',
'continent': 'South America',
'languages': ['Spanish']}),
(0.7280135748889252,
{'name': 'Indonesia',
'code': 'ID',
'capital': 'Jakarta',
'continent': 'Asia',
'languages': ['Indonesian']}),
(0.7241642577932456,
{'name': 'India',
'code': 'IN',
'capital': 'New Delhi',
'continent': 'Asia',
'languages': ['Hindi', 'English']}),
(0.7232991877572457,
{'name': 'Italy',
'code': 'IT',
'capital': 'Rome',
'continent': 'Europe',
'languages': ['Italian']})]
[19]: