How to use Semantic Search
This tutorial will show you how to use indexing and semantic search.
Background
LinkML-Store allows you to compose different indexing strategies with any backend. Currently there are two indexing strategies:
Simple trigram-based
LLM text or image embedding based (using models from OpenAI, HuggingFace, and others)
These indexes can be added into any backend (duckdb, mongo, …)
Additionally, some backends may have their own indexing strategy
Solr has a number of text-based indexing strategies
ChromaDB can use text-based vector embeddings
LinkML-Store allows for maximum flexibility.
This tutorial shows how to use an OpenAI-based embedding strategy in combination with DuckDB.
Obtaining upstream files
We will use the OBO Graphs encoding of the Enzyme Commission (EC) database, via biogragmatics
We will use the pystow library to cache the upstream file.
[1]:
import pystow
path=pystow.ensure("tmp", "eccode.json", url="https://w3id.org/biopragmatics/resources/eccode/eccode.json")
Let’s examining the structure of the JSON. There is a top level graphs
index, each of which holds a set of nodes
and edges
:
[2]:
import json
graphdoc = json.load(open(path))
graph = graphdoc["graphs"][0]
[3]:
len(graph["nodes"]), len(graph["edges"])
[3]:
(7177, 506022)
Storing the JSON
We will create a duckdb database to insert the JSON objects. We’ll put this in a tmp/
folder
[4]:
!mkdir -p tmp
[5]:
from linkml_store import Client
client = Client()
db = client.attach_database("duckdb:///tmp/eccode.db", "eccode", recreate_if_exists=True)
We will create an index for nodes. (we could make a separate collection for edges, but this is less relevant for this tutorial)
[6]:
nodes_collection = db.create_collection("Node", "nodes")
For demonstration purposes we’ll only store the first 200 entries (it can be slow to index everything via the OpenAI API)
[7]:
nodes_collection.insert(graph["nodes"][0:200])
[8]:
nodes_collection.find(limit=6).rows
[8]:
[{'id': 'http://purl.obolibrary.org/obo/RO_0002327',
'lbl': 'enables',
'type': 'PROPERTY',
'meta': None},
{'id': 'http://purl.obolibrary.org/obo/RO_0002351',
'lbl': 'has member',
'type': 'PROPERTY',
'meta': None},
{'id': 'http://purl.obolibrary.org/obo/eccode_1',
'lbl': 'Oxidoreductases',
'type': 'CLASS',
'meta': None},
{'id': 'http://purl.obolibrary.org/obo/eccode_1.1',
'lbl': 'Acting on the CH-OH group of donors',
'type': 'CLASS',
'meta': None},
{'id': 'http://purl.obolibrary.org/obo/eccode_1.1.1',
'lbl': 'With NAD(+) or NADP(+) as acceptor',
'type': 'CLASS',
'meta': None},
{'id': 'http://purl.obolibrary.org/obo/eccode_1.1.1.1',
'lbl': 'alcohol dehydrogenase',
'type': 'CLASS',
'meta': ['synonyms']}]
Creating an LLMIndexer
We will create an indexer, and configure it to cache calls. This means that the 2nd time we run this notebook it will be much faster, since all the embeddings will be cached.
The indexer will index using the lbl
field. In OBO Graphs JSON, this is the name/label of the concept.
[9]:
from linkml_store.index.implementations.llm_indexer import LLMIndexer
index = LLMIndexer(name="test", cached_embeddings_database="tmp/llm_cache.db", index_attributes=["lbl"])
[10]:
nodes_collection.attach_indexer(index)
Searching using the index
Now we have attached an index, we can use it in semantic search. We’ll search our EC subset nodes collection for a string sugar transporters
. Note that this string doesn’t occur precisely in the index but we can still rank closeness in semantic space.
When using search
the field ranked_rows
is populated in the result object. This is a list of (score, object)
tuples, which we will look at by translating into a pandas DataFrame:
[11]:
qr = nodes_collection.search("sugar transporters")
[12]:
results = [{"sim": r[0], "id": r[1]["id"], "name": r[1]["lbl"]} for r in qr.ranked_rows]
[13]:
import pandas as pd
df = pd.DataFrame(results)
[14]:
df
[14]:
sim | id | name | |
---|---|---|---|
0 | 0.811164 | http://purl.obolibrary.org/obo/eccode_1.1.1.22 | UDP-glucose 6-dehydrogenase |
1 | 0.809901 | http://purl.obolibrary.org/obo/eccode_1.1.1.124 | fructose 5-dehydrogenase (NADP(+)) |
2 | 0.808242 | http://purl.obolibrary.org/obo/eccode_1.1.1.10 | L-xylulose reductase |
3 | 0.804669 | http://purl.obolibrary.org/obo/eccode_1.1.1.162 | erythrulose reductase |
4 | 0.804353 | http://purl.obolibrary.org/obo/eccode_1.1.1.271 | GDP-L-fucose synthase |
... | ... | ... | ... |
195 | 0.741834 | http://purl.obolibrary.org/obo/eccode_1.1.1.141 | 15-hydroxyprostaglandin dehydrogenase (NAD(+)) |
196 | 0.738374 | http://purl.obolibrary.org/obo/eccode_1.1.1.147 | 16alpha-hydroxysteroid dehydrogenase |
197 | 0.738128 | http://purl.obolibrary.org/obo/RO_0002351 | has member |
198 | 0.729969 | http://purl.obolibrary.org/obo/eccode_1.1.1.104 | 4-oxoproline reductase |
199 | 0.722231 | http://purl.obolibrary.org/obo/eccode_1.1.1.223 | isopiperitenol dehydrogenase |
200 rows × 3 columns
Even though our dataset had no actual sugar transporters, there are still ranked results, with the top 3 ranked highly by virtue of concerning sugars (even if they are not transporters).
Note if we had indexed all of EC we would see sugar transporters.
[15]:
qr = nodes_collection.search("sugar transporters", where={"type": "CLASS"})
qr.ranked_rows[0:3]
[15]:
[(0.8111048198475599,
{'id': 'http://purl.obolibrary.org/obo/eccode_1.1.1.22',
'lbl': 'UDP-glucose 6-dehydrogenase',
'type': 'CLASS',
'meta': None}),
(0.8098110004639347,
{'id': 'http://purl.obolibrary.org/obo/eccode_1.1.1.124',
'lbl': 'fructose 5-dehydrogenase (NADP(+))',
'type': 'CLASS',
'meta': ['synonyms']}),
(0.8081767571833294,
{'id': 'http://purl.obolibrary.org/obo/eccode_1.1.1.10',
'lbl': 'L-xylulose reductase',
'type': 'CLASS',
'meta': None})]
How it works
Let’s peek under the hood into the duckdb instance to see how this is all implemented in DuckDB.
To do this we’ll connect to the duckdb instance directly using the sql
extension in Jupyter
Load extension:
[16]:
%load_ext sql
[17]:
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False
Connect to the duckdb database
NOTE in general you don’t need to do this - we are just doing this here to show the internals.
[18]:
%sql duckdb:///tmp/eccode.db
Query the nodes
table (no index)
[19]:
%%sql
SELECT * FROM nodes;
[19]:
id | lbl | type | meta | |
---|---|---|---|---|
0 | http://purl.obolibrary.org/obo/RO_0002327 | enables | PROPERTY | NaN |
1 | http://purl.obolibrary.org/obo/RO_0002351 | has member | PROPERTY | NaN |
2 | http://purl.obolibrary.org/obo/eccode_1 | Oxidoreductases | CLASS | NaN |
3 | http://purl.obolibrary.org/obo/eccode_1.1 | Acting on the CH-OH group of donors | CLASS | NaN |
4 | http://purl.obolibrary.org/obo/eccode_1.1.1 | With NAD(+) or NADP(+) as acceptor | CLASS | NaN |
... | ... | ... | ... | ... |
195 | http://purl.obolibrary.org/obo/eccode_1.1.1.284 | S-(hydroxymethyl)glutathione dehydrogenase | CLASS | [synonyms] |
196 | http://purl.obolibrary.org/obo/eccode_1.1.1.285 | 3''-deamino-3''-oxonicotianamine reductase | CLASS | NaN |
197 | http://purl.obolibrary.org/obo/eccode_1.1.1.286 | isocitrate--homoisocitrate dehydrogenase | CLASS | [synonyms] |
198 | http://purl.obolibrary.org/obo/eccode_1.1.1.287 | D-arabinitol dehydrogenase (NADP(+)) | CLASS | [synonyms] |
199 | http://purl.obolibrary.org/obo/eccode_1.1.1.288 | xanthoxin dehydrogenase | CLASS | [synonyms] |
200 rows × 4 columns
Query the index. Behind the scenes, linkml-store will create a table to cache each index for each collection. These currently start with internal__index__
and are followed by the type of the objects, followed by the name of the index.
[21]:
%%sql
SELECT * FROM internal__index__nodes__test;
[21]:
id | lbl | type | meta | __index__ | |
---|---|---|---|---|---|
0 | http://purl.obolibrary.org/obo/RO_0002327 | enables | PROPERTY | NaN | [-0.021716245, -0.024930306, -0.015913868, -0.... |
1 | http://purl.obolibrary.org/obo/RO_0002351 | has member | PROPERTY | NaN | [-0.03492431, -0.015462456, 0.002913293, -0.02... |
2 | http://purl.obolibrary.org/obo/eccode_1 | Oxidoreductases | CLASS | NaN | [-0.031664208, -0.026391044, 8.377296e-05, -0.... |
3 | http://purl.obolibrary.org/obo/eccode_1.1 | Acting on the CH-OH group of donors | CLASS | NaN | [-0.023240522, -0.019391688, -0.006624823, -0.... |
4 | http://purl.obolibrary.org/obo/eccode_1.1.1 | With NAD(+) or NADP(+) as acceptor | CLASS | NaN | [0.00993415, -0.039508518, 0.023213472, -0.016... |
... | ... | ... | ... | ... | ... |
195 | http://purl.obolibrary.org/obo/eccode_1.1.1.284 | S-(hydroxymethyl)glutathione dehydrogenase | CLASS | [synonyms] | [-0.020920139, -0.0042932644, 0.0039249077, -0... |
196 | http://purl.obolibrary.org/obo/eccode_1.1.1.285 | 3''-deamino-3''-oxonicotianamine reductase | CLASS | NaN | [-0.02360331, -0.025488937, -0.010397324, -0.0... |
197 | http://purl.obolibrary.org/obo/eccode_1.1.1.286 | isocitrate--homoisocitrate dehydrogenase | CLASS | [synonyms] | [-0.019758105, -0.016041763, 0.017833093, -0.0... |
198 | http://purl.obolibrary.org/obo/eccode_1.1.1.287 | D-arabinitol dehydrogenase (NADP(+)) | CLASS | [synonyms] | [-0.012204822, -0.034410186, 0.007919805, -0.0... |
199 | http://purl.obolibrary.org/obo/eccode_1.1.1.288 | xanthoxin dehydrogenase | CLASS | [synonyms] | [-0.012760352, -0.016459865, 0.011843718, -0.0... |
200 rows × 5 columns
We can see that the index duplicates the content of the main table, and adds an additional vector column with the embedding.
[ ]: