Contents Menu Expand Light mode Dark mode Auto light/dark, in light mode Auto light/dark, in dark mode Skip to content
linkml-store documentation
Logo
linkml-store documentation

Contents:

  • About
  • Tutorials
    • Tutorial: Using the Command Line Interface
    • Tutorial: Using the Python API
  • How-To Guides
    • How to use MongoDB with LinkML-Store
    • How to use Semantic Search
    • How to query the Monarch-KG
    • How to query Solr using the Command Line
    • How to use Neo4J with LinkML-Store
    • How to Check Referential Integrity
    • Example: Storing an ontology
    • How to index Phenopackets with LinkML-Store
    • How to index GO-CAMs with LinkML-Store
    • How to predict missing data
    • Perform RAG Inference
    • Perform LLM Inference
  • Reference
    • API
    • Adapters
      • linkml_store.api.stores.duckdb package
        • linkml_store.api.stores.duckdb.duckdb_collection module
        • linkml_store.api.stores.duckdb.duckdb_database module
        • linkml_store.api.stores.duckdb.mappings module
      • linkml_store.api.stores.mongodb package
        • linkml_store.api.stores.mongodb.mongodb_collection module
        • linkml_store.api.stores.mongodb.mongodb_database module
      • linkml_store.api.stores.filesystem package
        • linkml_store.api.stores.filesystem.filesystem_collection module
        • linkml_store.api.stores.filesystem.filesystem_database module
      • linkml_store.api.stores.solr package
        • linkml_store.api.stores.solr.solr_collection module
        • linkml_store.api.stores.solr.solr_database module
        • linkml_store.api.stores.solr.solr_utils module
      • linkml_store.api.stores.hdf5 package
        • linkml_store.api.stores.hdf5.hdf5_collection module
        • linkml_store.api.stores.hdf5.hdf5_database module
      • linkml_store.api.stores.chromadb package
        • linkml_store.api.stores.chromadb.chromadb_collection module
        • linkml_store.api.stores.chromadb.chromadb_database module
  • Indices and tables
  • Frequently Asked Questions
Back to top
View this page

How to use Semantic Search

This tutorial will show you how to use indexing and semantic search.

Background

LinkML-Store allows you to compose different indexing strategies with any backend. Currently there are two indexing strategies:

  • Simple trigram-based

  • LLM text or image embedding based (using models from OpenAI, HuggingFace, and others)

These indexes can be added into any backend (duckdb, mongo, …)

Additionally, some backends may have their own indexing strategy

  • Solr has a number of text-based indexing strategies

  • ChromaDB can use text-based vector embeddings

LinkML-Store allows for maximum flexibility.

This tutorial shows how to use an OpenAI-based embedding strategy in combination with DuckDB.

Obtaining upstream files

We will use the OBO Graphs encoding of the Enzyme Commission (EC) database, via biogragmatics

We will use the pystow library to cache the upstream file.

[1]:
import pystow
path=pystow.ensure("tmp", "eccode.json", url="https://w3id.org/biopragmatics/resources/eccode/eccode.json")

Let’s examining the structure of the JSON. There is a top level graphs index, each of which holds a set of nodes and edges:

[2]:
import json

graphdoc = json.load(open(path))
graph = graphdoc["graphs"][0]
[3]:
len(graph["nodes"]), len(graph["edges"])
[3]:
(7177, 506022)

Storing the JSON

We will create a duckdb database to insert the JSON objects. We’ll put this in a tmp/ folder

[4]:
!mkdir -p tmp
[5]:
from linkml_store import Client

client = Client()
db = client.attach_database("duckdb:///tmp/eccode.db", "eccode", recreate_if_exists=True)

We will create an index for nodes. (we could make a separate collection for edges, but this is less relevant for this tutorial)

[6]:
nodes_collection = db.create_collection("Node", "nodes")

For demonstration purposes we’ll only store the first 200 entries (it can be slow to index everything via the OpenAI API)

[7]:
nodes_collection.insert(graph["nodes"][0:200])
[8]:
nodes_collection.find(limit=6).rows
[8]:
[{'id': 'http://purl.obolibrary.org/obo/RO_0002327',
  'lbl': 'enables',
  'type': 'PROPERTY',
  'meta': None},
 {'id': 'http://purl.obolibrary.org/obo/RO_0002351',
  'lbl': 'has member',
  'type': 'PROPERTY',
  'meta': None},
 {'id': 'http://purl.obolibrary.org/obo/eccode_1',
  'lbl': 'Oxidoreductases',
  'type': 'CLASS',
  'meta': None},
 {'id': 'http://purl.obolibrary.org/obo/eccode_1.1',
  'lbl': 'Acting on the CH-OH group of donors',
  'type': 'CLASS',
  'meta': None},
 {'id': 'http://purl.obolibrary.org/obo/eccode_1.1.1',
  'lbl': 'With NAD(+) or NADP(+) as acceptor',
  'type': 'CLASS',
  'meta': None},
 {'id': 'http://purl.obolibrary.org/obo/eccode_1.1.1.1',
  'lbl': 'alcohol dehydrogenase',
  'type': 'CLASS',
  'meta': ['synonyms']}]

Creating an LLMIndexer

We will create an indexer, and configure it to cache calls. This means that the 2nd time we run this notebook it will be much faster, since all the embeddings will be cached.

The indexer will index using the lbl field. In OBO Graphs JSON, this is the name/label of the concept.

[9]:
from linkml_store.index.implementations.llm_indexer import LLMIndexer

index = LLMIndexer(name="test", cached_embeddings_database="tmp/llm_cache.db", index_attributes=["lbl"])
[10]:
nodes_collection.attach_indexer(index)

Searching using the index

Now we have attached an index, we can use it in semantic search. We’ll search our EC subset nodes collection for a string sugar transporters. Note that this string doesn’t occur precisely in the index but we can still rank closeness in semantic space.

When using search the field ranked_rows is populated in the result object. This is a list of (score, object) tuples, which we will look at by translating into a pandas DataFrame:

[11]:
qr = nodes_collection.search("sugar transporters")
[12]:
results = [{"sim": r[0], "id": r[1]["id"], "name": r[1]["lbl"]} for r in qr.ranked_rows]
[13]:
import pandas as pd
df = pd.DataFrame(results)
[14]:
df
[14]:
sim id name
0 0.811164 http://purl.obolibrary.org/obo/eccode_1.1.1.22 UDP-glucose 6-dehydrogenase
1 0.809901 http://purl.obolibrary.org/obo/eccode_1.1.1.124 fructose 5-dehydrogenase (NADP(+))
2 0.808242 http://purl.obolibrary.org/obo/eccode_1.1.1.10 L-xylulose reductase
3 0.804669 http://purl.obolibrary.org/obo/eccode_1.1.1.162 erythrulose reductase
4 0.804353 http://purl.obolibrary.org/obo/eccode_1.1.1.271 GDP-L-fucose synthase
... ... ... ...
195 0.741834 http://purl.obolibrary.org/obo/eccode_1.1.1.141 15-hydroxyprostaglandin dehydrogenase (NAD(+))
196 0.738374 http://purl.obolibrary.org/obo/eccode_1.1.1.147 16alpha-hydroxysteroid dehydrogenase
197 0.738128 http://purl.obolibrary.org/obo/RO_0002351 has member
198 0.729969 http://purl.obolibrary.org/obo/eccode_1.1.1.104 4-oxoproline reductase
199 0.722231 http://purl.obolibrary.org/obo/eccode_1.1.1.223 isopiperitenol dehydrogenase

200 rows × 3 columns

Even though our dataset had no actual sugar transporters, there are still ranked results, with the top 3 ranked highly by virtue of concerning sugars (even if they are not transporters).

Note if we had indexed all of EC we would see sugar transporters.

[15]:
qr = nodes_collection.search("sugar transporters", where={"type": "CLASS"})
qr.ranked_rows[0:3]
[15]:
[(0.8111048198475599,
  {'id': 'http://purl.obolibrary.org/obo/eccode_1.1.1.22',
   'lbl': 'UDP-glucose 6-dehydrogenase',
   'type': 'CLASS',
   'meta': None}),
 (0.8098110004639347,
  {'id': 'http://purl.obolibrary.org/obo/eccode_1.1.1.124',
   'lbl': 'fructose 5-dehydrogenase (NADP(+))',
   'type': 'CLASS',
   'meta': ['synonyms']}),
 (0.8081767571833294,
  {'id': 'http://purl.obolibrary.org/obo/eccode_1.1.1.10',
   'lbl': 'L-xylulose reductase',
   'type': 'CLASS',
   'meta': None})]

How it works

Let’s peek under the hood into the duckdb instance to see how this is all implemented in DuckDB.

To do this we’ll connect to the duckdb instance directly using the sql extension in Jupyter

Load extension:

[16]:
%load_ext sql
Tip: You may define configurations in /Users/cjm/repos/linkml-store/pyproject.toml or /Users/cjm/.jupysql/config.
Please review our configuration guideline.
Did not find user configurations in /Users/cjm/repos/linkml-store/pyproject.toml.
[17]:
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

Connect to the duckdb database

NOTE in general you don’t need to do this - we are just doing this here to show the internals.

[18]:
%sql duckdb:///tmp/eccode.db

Query the nodes table (no index)

[19]:
%%sql
SELECT * FROM nodes;
[19]:
id lbl type meta
0 http://purl.obolibrary.org/obo/RO_0002327 enables PROPERTY NaN
1 http://purl.obolibrary.org/obo/RO_0002351 has member PROPERTY NaN
2 http://purl.obolibrary.org/obo/eccode_1 Oxidoreductases CLASS NaN
3 http://purl.obolibrary.org/obo/eccode_1.1 Acting on the CH-OH group of donors CLASS NaN
4 http://purl.obolibrary.org/obo/eccode_1.1.1 With NAD(+) or NADP(+) as acceptor CLASS NaN
... ... ... ... ...
195 http://purl.obolibrary.org/obo/eccode_1.1.1.284 S-(hydroxymethyl)glutathione dehydrogenase CLASS [synonyms]
196 http://purl.obolibrary.org/obo/eccode_1.1.1.285 3''-deamino-3''-oxonicotianamine reductase CLASS NaN
197 http://purl.obolibrary.org/obo/eccode_1.1.1.286 isocitrate--homoisocitrate dehydrogenase CLASS [synonyms]
198 http://purl.obolibrary.org/obo/eccode_1.1.1.287 D-arabinitol dehydrogenase (NADP(+)) CLASS [synonyms]
199 http://purl.obolibrary.org/obo/eccode_1.1.1.288 xanthoxin dehydrogenase CLASS [synonyms]

200 rows × 4 columns

Query the index. Behind the scenes, linkml-store will create a table to cache each index for each collection. These currently start with internal__index__ and are followed by the type of the objects, followed by the name of the index.

[21]:
%%sql
SELECT * FROM internal__index__nodes__test;
[21]:
id lbl type meta __index__
0 http://purl.obolibrary.org/obo/RO_0002327 enables PROPERTY NaN [-0.021716245, -0.024930306, -0.015913868, -0....
1 http://purl.obolibrary.org/obo/RO_0002351 has member PROPERTY NaN [-0.03492431, -0.015462456, 0.002913293, -0.02...
2 http://purl.obolibrary.org/obo/eccode_1 Oxidoreductases CLASS NaN [-0.031664208, -0.026391044, 8.377296e-05, -0....
3 http://purl.obolibrary.org/obo/eccode_1.1 Acting on the CH-OH group of donors CLASS NaN [-0.023240522, -0.019391688, -0.006624823, -0....
4 http://purl.obolibrary.org/obo/eccode_1.1.1 With NAD(+) or NADP(+) as acceptor CLASS NaN [0.00993415, -0.039508518, 0.023213472, -0.016...
... ... ... ... ... ...
195 http://purl.obolibrary.org/obo/eccode_1.1.1.284 S-(hydroxymethyl)glutathione dehydrogenase CLASS [synonyms] [-0.020920139, -0.0042932644, 0.0039249077, -0...
196 http://purl.obolibrary.org/obo/eccode_1.1.1.285 3''-deamino-3''-oxonicotianamine reductase CLASS NaN [-0.02360331, -0.025488937, -0.010397324, -0.0...
197 http://purl.obolibrary.org/obo/eccode_1.1.1.286 isocitrate--homoisocitrate dehydrogenase CLASS [synonyms] [-0.019758105, -0.016041763, 0.017833093, -0.0...
198 http://purl.obolibrary.org/obo/eccode_1.1.1.287 D-arabinitol dehydrogenase (NADP(+)) CLASS [synonyms] [-0.012204822, -0.034410186, 0.007919805, -0.0...
199 http://purl.obolibrary.org/obo/eccode_1.1.1.288 xanthoxin dehydrogenase CLASS [synonyms] [-0.012760352, -0.016459865, 0.011843718, -0.0...

200 rows × 5 columns

We can see that the index duplicates the content of the main table, and adds an additional vector column with the embedding.

[ ]:

Next
How to query the Monarch-KG
Previous
How to use MongoDB with LinkML-Store
Copyright © 2025, Author 1 <author@org.org>
Made with Sphinx and @pradyunsg's Furo
On this page
  • How to use Semantic Search
    • Background
    • Obtaining upstream files
    • Storing the JSON
    • Creating an LLMIndexer
    • Searching using the index
    • How it works