{ "cells": [ { "cell_type": "markdown", "source": [ "# How to use Semantic Search\n", "\n", "This tutorial will show you how to use indexing and semantic search.\n", "\n", "## Background\n", "\n", "LinkML-Store allows you to *compose* different indexing strategies with any backend. Currently there are two\n", "indexing strategies:\n", "\n", "- Simple trigram-based\n", "- LLM text or image embedding based (using models from OpenAI, HuggingFace, and others)\n", "\n", "These indexes can be added into any backend (duckdb, mongo, ...)\n", "\n", "Additionally, some backends may have their own indexing strategy\n", "\n", "- Solr has a number of text-based indexing strategies\n", "- ChromaDB can use text-based vector embeddings\n", "\n", "LinkML-Store allows for maximum flexibility.\n", "\n", "This tutorial shows how to use an OpenAI-based embedding strategy in combination with DuckDB." ], "metadata": { "collapsed": false }, "id": "315813bcb5f486a4" }, { "cell_type": "markdown", "source": [ "## Obtaining upstream files\n", "\n", "We will use the OBO Graphs encoding of the Enzyme Commission (EC) database, via [biogragmatics](https://w3id.org/biopragmatics)\n", "\n", "We will use the pystow library to cache the upstream file. " ], "metadata": { "collapsed": false }, "id": "2559f81507693d24" }, { "cell_type": "code", "execution_count": 21, "outputs": [], "source": [ "import pystow\n", "path=pystow.ensure(\"tmp\", \"eccode.json\", url=\"https://w3id.org/biopragmatics/resources/eccode/eccode.json\")" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-05-09T21:34:38.097147Z", "start_time": "2024-05-09T21:34:38.095093Z" } }, "id": "5a0030a3e24b1545" }, { "cell_type": "markdown", "source": [ "Let's examining the structure of the JSON. There is a top level `graphs` index, each of which holds a set of `nodes` and `edges`:" ], "metadata": { "collapsed": false }, "id": "8e046de9070b82c1" }, { "cell_type": "code", "execution_count": 22, "outputs": [], "source": [ "import json\n", "\n", "graphdoc = json.load(open(path))\n", "graph = graphdoc[\"graphs\"][0]" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-05-09T21:34:38.447475Z", "start_time": "2024-05-09T21:34:38.097539Z" } }, "id": "793566db96fc1cd9" }, { "cell_type": "code", "execution_count": 23, "outputs": [ { "data": { "text/plain": "(7177, 506022)" }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(graph[\"nodes\"]), len(graph[\"edges\"])" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-05-09T21:34:38.453170Z", "start_time": "2024-05-09T21:34:38.450138Z" } }, "id": "b3c3cf57c2d9aeed" }, { "cell_type": "markdown", "source": [ "## Storing the JSON\n", "\n", "We will create a duckdb database to insert the JSON objects. We'll put this in a `tmp/` folder" ], "metadata": { "collapsed": false }, "id": "9f11d6d2b7a2c602" }, { "cell_type": "code", "execution_count": 24, "outputs": [], "source": [ "!mkdir -p tmp" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-05-09T21:34:38.582813Z", "start_time": "2024-05-09T21:34:38.462457Z" } }, "id": "cbae2c783889c9b3" }, { "cell_type": "code", "execution_count": 25, "outputs": [], "source": [ "from linkml_store import Client\n", "\n", "client = Client()\n", "db = client.attach_database(\"duckdb:///tmp/eccode.db\", \"eccode\", recreate_if_exists=True)" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-05-09T21:34:38.587860Z", "start_time": "2024-05-09T21:34:38.585172Z" } }, "id": "6a8adce3d3ec93c6" }, { "cell_type": "markdown", "source": [ "We will create an index for nodes. (we could make a separate collection for edges, but this is less relevant\n", "for this tutorial)" ], "metadata": { "collapsed": false }, "id": "1dcfaf6e8d4b3bf6" }, { "cell_type": "code", "execution_count": 26, "outputs": [], "source": [ "nodes_collection = db.create_collection(\"Node\", \"nodes\")" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-05-09T21:34:38.591738Z", "start_time": "2024-05-09T21:34:38.588796Z" } }, "id": "4fa95b75cd1f19cf" }, { "cell_type": "markdown", "source": [ "For demonstration purposes we'll only store the first 200 entries (it can be slow to index everything via the OpenAI API)" ], "metadata": { "collapsed": false }, "id": "ecc62533a80b1175" }, { "cell_type": "code", "execution_count": 27, "outputs": [], "source": [ "nodes_collection.insert(graph[\"nodes\"][0:200])" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-05-09T21:34:38.657942Z", "start_time": "2024-05-09T21:34:38.594941Z" } }, "id": "aee6a95b4a86432d" }, { "cell_type": "code", "execution_count": 28, "outputs": [ { "data": { "text/plain": "[{'id': 'http://purl.obolibrary.org/obo/RO_0002327',\n 'lbl': 'enables',\n 'type': 'PROPERTY',\n 'meta': None},\n {'id': 'http://purl.obolibrary.org/obo/RO_0002351',\n 'lbl': 'has member',\n 'type': 'PROPERTY',\n 'meta': None},\n {'id': 'http://purl.obolibrary.org/obo/eccode_1',\n 'lbl': 'Oxidoreductases',\n 'type': 'CLASS',\n 'meta': None},\n {'id': 'http://purl.obolibrary.org/obo/eccode_1.1',\n 'lbl': 'Acting on the CH-OH group of donors',\n 'type': 'CLASS',\n 'meta': None},\n {'id': 'http://purl.obolibrary.org/obo/eccode_1.1.1',\n 'lbl': 'With NAD(+) or NADP(+) as acceptor',\n 'type': 'CLASS',\n 'meta': None},\n {'id': 'http://purl.obolibrary.org/obo/eccode_1.1.1.1',\n 'lbl': 'alcohol dehydrogenase',\n 'type': 'CLASS',\n 'meta': ['synonyms']}]" }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nodes_collection.find(limit=6).rows" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-05-09T21:34:38.677242Z", "start_time": "2024-05-09T21:34:38.658654Z" } }, "id": "95c06a10ecfcfb67" }, { "cell_type": "markdown", "source": [ "## Creating an LLMIndexer\n", "\n", "We will create an indexer, and configure it to cache calls. This means that the 2nd time we run this notebook\n", "it will be much faster, since all the embeddings will be cached.\n", "\n", "The indexer will index using the `lbl` field. In OBO Graphs JSON, this is the name/label of the concept." ], "metadata": { "collapsed": false }, "id": "80d64d4e6570d69c" }, { "cell_type": "code", "execution_count": 29, "outputs": [], "source": [ "from linkml_store.index.implementations.llm_indexer import LLMIndexer\n", "\n", "index = LLMIndexer(name=\"test\", cached_embeddings_database=\"tmp/llm_cache.db\", index_attributes=[\"lbl\"])" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-05-09T21:34:38.681471Z", "start_time": "2024-05-09T21:34:38.675793Z" } }, "id": "8e75156bfdafe7b" }, { "cell_type": "code", "execution_count": 30, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages/duckdb_engine/__init__.py:580: SAWarning: Did not recognize type 'list' of column 'embedding'\n", " columns = self._get_columns_info(rows, domains, enums, schema) # type: ignore[attr-defined]\n", "/Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages/duckdb_engine/__init__.py:173: DuckDBEngineWarning: duckdb-engine doesn't yet support reflection on indices\n", " warnings.warn(\n" ] } ], "source": [ "nodes_collection.attach_indexer(index)" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-05-09T21:36:41.479112Z", "start_time": "2024-05-09T21:34:38.679704Z" } }, "id": "993c0360941dbb3b" }, { "cell_type": "markdown", "source": [ "## Searching using the index\n", "\n", "Now we have attached an index, we can use it in semantic search. We'll search our EC subset nodes collection for a string `sugar transporters`. Note that this string doesn't occur precisely in the index but we can still rank closeness in semantic space.\n", "\n", "When using `search` the field `ranked_rows` is populated in the result object. This is a list of `(score, object)` tuples, which we will look at by translating into a pandas DataFrame:" ], "metadata": { "collapsed": false }, "id": "47c1cc117f379d63" }, { "cell_type": "code", "execution_count": 31, "outputs": [], "source": [ "qr = nodes_collection.search(\"sugar transporters\")" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-05-09T21:36:41.829516Z", "start_time": "2024-05-09T21:36:41.479725Z" } }, "id": "e3abb5d529063c6e" }, { "cell_type": "code", "execution_count": 32, "outputs": [], "source": [ "results = [{\"sim\": r[0], \"id\": r[1][\"id\"], \"name\": r[1][\"lbl\"]} for r in qr.ranked_rows]" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-05-09T21:36:41.834697Z", "start_time": "2024-05-09T21:36:41.830216Z" } }, "id": "cc3bce09e81d66a" }, { "cell_type": "code", "execution_count": 33, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.DataFrame(results)" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-05-09T21:36:41.839351Z", "start_time": "2024-05-09T21:36:41.834571Z" } }, "id": "5b9d6d63561b72db" }, { "cell_type": "code", "execution_count": 34, "outputs": [ { "data": { "text/plain": " sim id \\\n0 0.811244 http://purl.obolibrary.org/obo/eccode_1.1.1.22 \n1 0.811244 http://purl.obolibrary.org/obo/eccode_1.1.1.22 \n2 0.809951 http://purl.obolibrary.org/obo/eccode_1.1.1.124 \n3 0.809951 http://purl.obolibrary.org/obo/eccode_1.1.1.124 \n4 0.808296 http://purl.obolibrary.org/obo/eccode_1.1.1.10 \n.. ... ... \n395 0.738170 http://purl.obolibrary.org/obo/RO_0002351 \n396 0.730031 http://purl.obolibrary.org/obo/eccode_1.1.1.104 \n397 0.730031 http://purl.obolibrary.org/obo/eccode_1.1.1.104 \n398 0.722299 http://purl.obolibrary.org/obo/eccode_1.1.1.223 \n399 0.722299 http://purl.obolibrary.org/obo/eccode_1.1.1.223 \n\n name \n0 UDP-glucose 6-dehydrogenase \n1 UDP-glucose 6-dehydrogenase \n2 fructose 5-dehydrogenase (NADP(+)) \n3 fructose 5-dehydrogenase (NADP(+)) \n4 L-xylulose reductase \n.. ... \n395 has member \n396 4-oxoproline reductase \n397 4-oxoproline reductase \n398 isopiperitenol dehydrogenase \n399 isopiperitenol dehydrogenase \n\n[400 rows x 3 columns]", "text/html": "
\n | sim | \nid | \nname | \n
---|---|---|---|
0 | \n0.811244 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.22 | \nUDP-glucose 6-dehydrogenase | \n
1 | \n0.811244 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.22 | \nUDP-glucose 6-dehydrogenase | \n
2 | \n0.809951 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.124 | \nfructose 5-dehydrogenase (NADP(+)) | \n
3 | \n0.809951 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.124 | \nfructose 5-dehydrogenase (NADP(+)) | \n
4 | \n0.808296 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.10 | \nL-xylulose reductase | \n
... | \n... | \n... | \n... | \n
395 | \n0.738170 | \nhttp://purl.obolibrary.org/obo/RO_0002351 | \nhas member | \n
396 | \n0.730031 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.104 | \n4-oxoproline reductase | \n
397 | \n0.730031 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.104 | \n4-oxoproline reductase | \n
398 | \n0.722299 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.223 | \nisopiperitenol dehydrogenase | \n
399 | \n0.722299 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.223 | \nisopiperitenol dehydrogenase | \n
400 rows × 3 columns
\n\n | id | \nlbl | \ntype | \nmeta | \n
---|---|---|---|---|
0 | \nhttp://purl.obolibrary.org/obo/RO_0002327 | \nenables | \nPROPERTY | \nNaN | \n
1 | \nhttp://purl.obolibrary.org/obo/RO_0002351 | \nhas member | \nPROPERTY | \nNaN | \n
2 | \nhttp://purl.obolibrary.org/obo/eccode_1 | \nOxidoreductases | \nCLASS | \nNaN | \n
3 | \nhttp://purl.obolibrary.org/obo/eccode_1.1 | \nActing on the CH-OH group of donors | \nCLASS | \nNaN | \n
4 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1 | \nWith NAD(+) or NADP(+) as acceptor | \nCLASS | \nNaN | \n
... | \n... | \n... | \n... | \n... | \n
195 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.284 | \nS-(hydroxymethyl)glutathione dehydrogenase | \nCLASS | \n[synonyms] | \n
196 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.285 | \n3''-deamino-3''-oxonicotianamine reductase | \nCLASS | \nNaN | \n
197 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.286 | \nisocitrate--homoisocitrate dehydrogenase | \nCLASS | \n[synonyms] | \n
198 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.287 | \nD-arabinitol dehydrogenase (NADP(+)) | \nCLASS | \n[synonyms] | \n
199 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.288 | \nxanthoxin dehydrogenase | \nCLASS | \n[synonyms] | \n
200 rows × 4 columns
\n\n | id | \nlbl | \ntype | \nmeta | \n__index__ | \n
---|---|---|---|---|---|
0 | \nhttp://purl.obolibrary.org/obo/RO_0002327 | \nenables | \nPROPERTY | \nNaN | \n[-0.021716245, -0.024930306, -0.015913868, -0.... | \n
1 | \nhttp://purl.obolibrary.org/obo/RO_0002351 | \nhas member | \nPROPERTY | \nNaN | \n[-0.03492431, -0.015462456, 0.002913293, -0.02... | \n
2 | \nhttp://purl.obolibrary.org/obo/eccode_1 | \nOxidoreductases | \nCLASS | \nNaN | \n[-0.031664208, -0.026391044, 8.377296e-05, -0.... | \n
3 | \nhttp://purl.obolibrary.org/obo/eccode_1.1 | \nActing on the CH-OH group of donors | \nCLASS | \nNaN | \n[-0.023240522, -0.019391688, -0.006624823, -0.... | \n
4 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1 | \nWith NAD(+) or NADP(+) as acceptor | \nCLASS | \nNaN | \n[0.00993415, -0.039508518, 0.023213472, -0.016... | \n
... | \n... | \n... | \n... | \n... | \n... | \n
195 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.284 | \nS-(hydroxymethyl)glutathione dehydrogenase | \nCLASS | \n[synonyms] | \n[-0.020920139, -0.0042932644, 0.0039249077, -0... | \n
196 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.285 | \n3''-deamino-3''-oxonicotianamine reductase | \nCLASS | \nNaN | \n[-0.02360331, -0.025488937, -0.010397324, -0.0... | \n
197 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.286 | \nisocitrate--homoisocitrate dehydrogenase | \nCLASS | \n[synonyms] | \n[-0.019758105, -0.016041763, 0.017833093, -0.0... | \n
198 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.287 | \nD-arabinitol dehydrogenase (NADP(+)) | \nCLASS | \n[synonyms] | \n[-0.012204822, -0.034410186, 0.007919805, -0.0... | \n
199 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.288 | \nxanthoxin dehydrogenase | \nCLASS | \n[synonyms] | \n[-0.012760352, -0.016459865, 0.011843718, -0.0... | \n
200 rows × 5 columns
\n