How to use Semantic Search

This tutorial will show you how to use indexing and semantic search.

Background

LinkML-Store allows you to compose different indexing strategies with any backend. Currently there are two indexing strategies:

Simple trigram-based
LLM text or image embedding based (using models from OpenAI, HuggingFace, and others)

These indexes can be added into any backend (duckdb, mongo, …)

Additionally, some backends may have their own indexing strategy

Solr has a number of text-based indexing strategies
ChromaDB can use text-based vector embeddings

LinkML-Store allows for maximum flexibility.

This tutorial shows how to use an OpenAI-based embedding strategy in combination with DuckDB.

Obtaining upstream files

We will use the OBO Graphs encoding of the Enzyme Commission (EC) database, via biogragmatics

We will use the pystow library to cache the upstream file.

[1]:

import pystow
path=pystow.ensure("tmp", "eccode.json", url="https://w3id.org/biopragmatics/resources/eccode/eccode.json")

Let’s examining the structure of the JSON. There is a top level graphs index, each of which holds a set of nodes and edges:

[2]:

import json

graphdoc = json.load(open(path))
graph = graphdoc["graphs"][0]

[3]:

len(graph["nodes"]), len(graph["edges"])

[3]:

(7177, 506022)

Storing the JSON

We will create a duckdb database to insert the JSON objects. We’ll put this in a tmp/ folder

[4]:

!mkdir -p tmp

[5]:

from linkml_store import Client

client = Client()
db = client.attach_database("duckdb:///tmp/eccode.db", "eccode", recreate_if_exists=True)

We will create an index for nodes. (we could make a separate collection for edges, but this is less relevant for this tutorial)

[6]:

nodes_collection = db.create_collection("Node", "nodes")

For demonstration purposes we’ll only store the first 200 entries (it can be slow to index everything via the OpenAI API)

[7]:

nodes_collection.insert(graph["nodes"][0:200])

[8]:

nodes_collection.find(limit=6).rows

[8]:

[{'id': 'http://purl.obolibrary.org/obo/RO_0002327',
  'lbl': 'enables',
  'type': 'PROPERTY',
  'meta': None},
 {'id': 'http://purl.obolibrary.org/obo/RO_0002351',
  'lbl': 'has member',
  'type': 'PROPERTY',
  'meta': None},
 {'id': 'http://purl.obolibrary.org/obo/eccode_1',
  'lbl': 'Oxidoreductases',
  'type': 'CLASS',
  'meta': None},
 {'id': 'http://purl.obolibrary.org/obo/eccode_1.1',
  'lbl': 'Acting on the CH-OH group of donors',
  'type': 'CLASS',
  'meta': None},
 {'id': 'http://purl.obolibrary.org/obo/eccode_1.1.1',
  'lbl': 'With NAD(+) or NADP(+) as acceptor',
  'type': 'CLASS',
  'meta': None},
 {'id': 'http://purl.obolibrary.org/obo/eccode_1.1.1.1',
  'lbl': 'alcohol dehydrogenase',
  'type': 'CLASS',
  'meta': ['synonyms']}]

Creating an LLMIndexer

We will create an indexer, and configure it to cache calls. This means that the 2nd time we run this notebook it will be much faster, since all the embeddings will be cached.

The indexer will index using the lbl field. In OBO Graphs JSON, this is the name/label of the concept.

[9]:

from linkml_store.index.implementations.llm_indexer import LLMIndexer

index = LLMIndexer(name="test", cached_embeddings_database="tmp/llm_cache.db", index_attributes=["lbl"])

[10]:

nodes_collection.attach_indexer(index)

Searching using the index

Now we have attached an index, we can use it in semantic search. We’ll search our EC subset nodes collection for a string sugar transporters. Note that this string doesn’t occur precisely in the index but we can still rank closeness in semantic space.

When using search the field ranked_rows is populated in the result object. This is a list of (score, object) tuples, which we will look at by translating into a pandas DataFrame:

[11]:

qr = nodes_collection.search("sugar transporters")

[12]:

results = [{"sim": r[0], "id": r[1]["id"], "name": r[1]["lbl"]} for r in qr.ranked_rows]

[13]:

import pandas as pd
df = pd.DataFrame(results)

[14]:

df

[14]:

	sim	id	name
0	0.811164	http://purl.obolibrary.org/obo/eccode_1.1.1.22	UDP-glucose 6-dehydrogenase
1	0.809901	http://purl.obolibrary.org/obo/eccode_1.1.1.124	fructose 5-dehydrogenase (NADP(+))
2	0.808242	http://purl.obolibrary.org/obo/eccode_1.1.1.10	L-xylulose reductase
3	0.804669	http://purl.obolibrary.org/obo/eccode_1.1.1.162	erythrulose reductase
4	0.804353	http://purl.obolibrary.org/obo/eccode_1.1.1.271	GDP-L-fucose synthase
...	...	...	...
195	0.741834	http://purl.obolibrary.org/obo/eccode_1.1.1.141	15-hydroxyprostaglandin dehydrogenase (NAD(+))
196	0.738374	http://purl.obolibrary.org/obo/eccode_1.1.1.147	16alpha-hydroxysteroid dehydrogenase
197	0.738128	http://purl.obolibrary.org/obo/RO_0002351	has member
198	0.729969	http://purl.obolibrary.org/obo/eccode_1.1.1.104	4-oxoproline reductase
199	0.722231	http://purl.obolibrary.org/obo/eccode_1.1.1.223	isopiperitenol dehydrogenase

200 rows × 3 columns

Even though our dataset had no actual sugar transporters, there are still ranked results, with the top 3 ranked highly by virtue of concerning sugars (even if they are not transporters).

Note if we had indexed all of EC we would see sugar transporters.

[15]:

qr = nodes_collection.search("sugar transporters", where={"type": "CLASS"})
qr.ranked_rows[0:3]

[15]:

[(0.8111048198475599,
  {'id': 'http://purl.obolibrary.org/obo/eccode_1.1.1.22',
   'lbl': 'UDP-glucose 6-dehydrogenase',
   'type': 'CLASS',
   'meta': None}),
 (0.8098110004639347,
  {'id': 'http://purl.obolibrary.org/obo/eccode_1.1.1.124',
   'lbl': 'fructose 5-dehydrogenase (NADP(+))',
   'type': 'CLASS',
   'meta': ['synonyms']}),
 (0.8081767571833294,
  {'id': 'http://purl.obolibrary.org/obo/eccode_1.1.1.10',
   'lbl': 'L-xylulose reductase',
   'type': 'CLASS',
   'meta': None})]

How it works

Let’s peek under the hood into the duckdb instance to see how this is all implemented in DuckDB.

To do this we’ll connect to the duckdb instance directly using the sql extension in Jupyter

Load extension:

[16]:

%load_ext sql

Tip: You may define configurations in /Users/cjm/repos/linkml-store/pyproject.toml or /Users/cjm/.jupysql/config.

Please review our configuration guideline.

Did not find user configurations in /Users/cjm/repos/linkml-store/pyproject.toml.

[17]:

%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

Connect to the duckdb database

NOTE in general you don’t need to do this - we are just doing this here to show the internals.

[18]:

%sql duckdb:///tmp/eccode.db

Query the nodes table (no index)

[19]:

%%sql
SELECT * FROM nodes;

[19]:

	id	lbl	type	meta
0	http://purl.obolibrary.org/obo/RO_0002327	enables	PROPERTY	NaN
1	http://purl.obolibrary.org/obo/RO_0002351	has member	PROPERTY	NaN
2	http://purl.obolibrary.org/obo/eccode_1	Oxidoreductases	CLASS	NaN
3	http://purl.obolibrary.org/obo/eccode_1.1	Acting on the CH-OH group of donors	CLASS	NaN
4	http://purl.obolibrary.org/obo/eccode_1.1.1	With NAD(+) or NADP(+) as acceptor	CLASS	NaN
...	...	...	...	...
195	http://purl.obolibrary.org/obo/eccode_1.1.1.284	S-(hydroxymethyl)glutathione dehydrogenase	CLASS	[synonyms]
196	http://purl.obolibrary.org/obo/eccode_1.1.1.285	3''-deamino-3''-oxonicotianamine reductase	CLASS	NaN
197	http://purl.obolibrary.org/obo/eccode_1.1.1.286	isocitrate--homoisocitrate dehydrogenase	CLASS	[synonyms]
198	http://purl.obolibrary.org/obo/eccode_1.1.1.287	D-arabinitol dehydrogenase (NADP(+))	CLASS	[synonyms]
199	http://purl.obolibrary.org/obo/eccode_1.1.1.288	xanthoxin dehydrogenase	CLASS	[synonyms]

200 rows × 4 columns

Query the index. Behind the scenes, linkml-store will create a table to cache each index for each collection. These currently start with internal__index__ and are followed by the type of the objects, followed by the name of the index.

[21]:

%%sql
SELECT * FROM internal__index__nodes__test;

[21]:

	id	lbl	type	meta	__index__
0	http://purl.obolibrary.org/obo/RO_0002327	enables	PROPERTY	NaN	[-0.021716245, -0.024930306, -0.015913868, -0....
1	http://purl.obolibrary.org/obo/RO_0002351	has member	PROPERTY	NaN	[-0.03492431, -0.015462456, 0.002913293, -0.02...
2	http://purl.obolibrary.org/obo/eccode_1	Oxidoreductases	CLASS	NaN	[-0.031664208, -0.026391044, 8.377296e-05, -0....
3	http://purl.obolibrary.org/obo/eccode_1.1	Acting on the CH-OH group of donors	CLASS	NaN	[-0.023240522, -0.019391688, -0.006624823, -0....
4	http://purl.obolibrary.org/obo/eccode_1.1.1	With NAD(+) or NADP(+) as acceptor	CLASS	NaN	[0.00993415, -0.039508518, 0.023213472, -0.016...
...	...	...	...	...	...
195	http://purl.obolibrary.org/obo/eccode_1.1.1.284	S-(hydroxymethyl)glutathione dehydrogenase	CLASS	[synonyms]	[-0.020920139, -0.0042932644, 0.0039249077, -0...
196	http://purl.obolibrary.org/obo/eccode_1.1.1.285	3''-deamino-3''-oxonicotianamine reductase	CLASS	NaN	[-0.02360331, -0.025488937, -0.010397324, -0.0...
197	http://purl.obolibrary.org/obo/eccode_1.1.1.286	isocitrate--homoisocitrate dehydrogenase	CLASS	[synonyms]	[-0.019758105, -0.016041763, 0.017833093, -0.0...
198	http://purl.obolibrary.org/obo/eccode_1.1.1.287	D-arabinitol dehydrogenase (NADP(+))	CLASS	[synonyms]	[-0.012204822, -0.034410186, 0.007919805, -0.0...
199	http://purl.obolibrary.org/obo/eccode_1.1.1.288	xanthoxin dehydrogenase	CLASS	[synonyms]	[-0.012760352, -0.016459865, 0.011843718, -0.0...

200 rows × 5 columns

We can see that the index duplicates the content of the main table, and adds an additional vector column with the embedding.

[ ]: