{ "cells": [ { "cell_type": "markdown", "source": [ "# How to use Semantic Search\n", "\n", "This tutorial will show you how to use indexing and semantic search.\n", "\n", "## Background\n", "\n", "LinkML-Store allows you to *compose* different indexing strategies with any backend. Currently there are two\n", "indexing strategies:\n", "\n", "- Simple trigram-based\n", "- LLM text or image embedding based (using models from OpenAI, HuggingFace, and others)\n", "\n", "These indexes can be added into any backend (duckdb, mongo, ...)\n", "\n", "Additionally, some backends may have their own indexing strategy\n", "\n", "- Solr has a number of text-based indexing strategies\n", "- ChromaDB can use text-based vector embeddings\n", "\n", "LinkML-Store allows for maximum flexibility.\n", "\n", "This tutorial shows how to use an OpenAI-based embedding strategy in combination with DuckDB." ], "metadata": { "collapsed": false }, "id": "315813bcb5f486a4" }, { "cell_type": "markdown", "source": [ "## Obtaining upstream files\n", "\n", "We will use the OBO Graphs encoding of the Enzyme Commission (EC) database, via [biogragmatics](https://w3id.org/biopragmatics)\n", "\n", "We will use the pystow library to cache the upstream file. " ], "metadata": { "collapsed": false }, "id": "2559f81507693d24" }, { "cell_type": "code", "execution_count": 1, "outputs": [], "source": [ "import pystow\n", "path=pystow.ensure(\"tmp\", \"eccode.json\", url=\"https://w3id.org/biopragmatics/resources/eccode/eccode.json\")" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-07-05T19:07:04.786347Z", "start_time": "2024-07-05T19:07:04.677568Z" } }, "id": "5a0030a3e24b1545" }, { "cell_type": "markdown", "source": [ "Let's examining the structure of the JSON. There is a top level `graphs` index, each of which holds a set of `nodes` and `edges`:" ], "metadata": { "collapsed": false }, "id": "8e046de9070b82c1" }, { "cell_type": "code", "execution_count": 2, "outputs": [], "source": [ "import json\n", "\n", "graphdoc = json.load(open(path))\n", "graph = graphdoc[\"graphs\"][0]" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-07-05T19:07:04.976093Z", "start_time": "2024-07-05T19:07:04.749596Z" } }, "id": "793566db96fc1cd9" }, { "cell_type": "code", "execution_count": 3, "outputs": [ { "data": { "text/plain": "(7177, 506022)" }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(graph[\"nodes\"]), len(graph[\"edges\"])" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-07-05T19:07:04.979929Z", "start_time": "2024-07-05T19:07:04.977334Z" } }, "id": "b3c3cf57c2d9aeed" }, { "cell_type": "markdown", "source": [ "## Storing the JSON\n", "\n", "We will create a duckdb database to insert the JSON objects. We'll put this in a `tmp/` folder" ], "metadata": { "collapsed": false }, "id": "9f11d6d2b7a2c602" }, { "cell_type": "code", "execution_count": 4, "outputs": [], "source": [ "!mkdir -p tmp" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-07-05T19:07:05.099895Z", "start_time": "2024-07-05T19:07:04.980371Z" } }, "id": "cbae2c783889c9b3" }, { "cell_type": "code", "execution_count": 5, "outputs": [], "source": [ "from linkml_store import Client\n", "\n", "client = Client()\n", "db = client.attach_database(\"duckdb:///tmp/eccode.db\", \"eccode\", recreate_if_exists=True)" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-07-05T19:07:06.181320Z", "start_time": "2024-07-05T19:07:05.100573Z" } }, "id": "6a8adce3d3ec93c6" }, { "cell_type": "markdown", "source": [ "We will create an index for nodes. (we could make a separate collection for edges, but this is less relevant\n", "for this tutorial)" ], "metadata": { "collapsed": false }, "id": "1dcfaf6e8d4b3bf6" }, { "cell_type": "code", "execution_count": 6, "outputs": [], "source": [ "nodes_collection = db.create_collection(\"Node\", \"nodes\")" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-07-05T19:07:06.183588Z", "start_time": "2024-07-05T19:07:06.181604Z" } }, "id": "4fa95b75cd1f19cf" }, { "cell_type": "markdown", "source": [ "For demonstration purposes we'll only store the first 200 entries (it can be slow to index everything via the OpenAI API)" ], "metadata": { "collapsed": false }, "id": "ecc62533a80b1175" }, { "cell_type": "code", "execution_count": 7, "outputs": [], "source": [ "nodes_collection.insert(graph[\"nodes\"][0:200])" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-07-05T19:07:06.360142Z", "start_time": "2024-07-05T19:07:06.184591Z" } }, "id": "aee6a95b4a86432d" }, { "cell_type": "code", "execution_count": 8, "outputs": [ { "data": { "text/plain": "[{'id': 'http://purl.obolibrary.org/obo/RO_0002327',\n 'lbl': 'enables',\n 'type': 'PROPERTY',\n 'meta': None},\n {'id': 'http://purl.obolibrary.org/obo/RO_0002351',\n 'lbl': 'has member',\n 'type': 'PROPERTY',\n 'meta': None},\n {'id': 'http://purl.obolibrary.org/obo/eccode_1',\n 'lbl': 'Oxidoreductases',\n 'type': 'CLASS',\n 'meta': None},\n {'id': 'http://purl.obolibrary.org/obo/eccode_1.1',\n 'lbl': 'Acting on the CH-OH group of donors',\n 'type': 'CLASS',\n 'meta': None},\n {'id': 'http://purl.obolibrary.org/obo/eccode_1.1.1',\n 'lbl': 'With NAD(+) or NADP(+) as acceptor',\n 'type': 'CLASS',\n 'meta': None},\n {'id': 'http://purl.obolibrary.org/obo/eccode_1.1.1.1',\n 'lbl': 'alcohol dehydrogenase',\n 'type': 'CLASS',\n 'meta': ['synonyms']}]" }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nodes_collection.find(limit=6).rows" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-07-05T19:07:06.390591Z", "start_time": "2024-07-05T19:07:06.363698Z" } }, "id": "95c06a10ecfcfb67" }, { "cell_type": "markdown", "source": [ "## Creating an LLMIndexer\n", "\n", "We will create an indexer, and configure it to cache calls. This means that the 2nd time we run this notebook\n", "it will be much faster, since all the embeddings will be cached.\n", "\n", "The indexer will index using the `lbl` field. In OBO Graphs JSON, this is the name/label of the concept." ], "metadata": { "collapsed": false }, "id": "80d64d4e6570d69c" }, { "cell_type": "code", "execution_count": 9, "outputs": [], "source": [ "from linkml_store.index.implementations.llm_indexer import LLMIndexer\n", "\n", "index = LLMIndexer(name=\"test\", cached_embeddings_database=\"tmp/llm_cache.db\", index_attributes=[\"lbl\"])" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-07-05T19:07:06.393562Z", "start_time": "2024-07-05T19:07:06.390800Z" } }, "id": "8e75156bfdafe7b" }, { "cell_type": "code", "execution_count": 10, "outputs": [], "source": [ "nodes_collection.attach_indexer(index)" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-07-05T19:07:12.102912Z", "start_time": "2024-07-05T19:07:06.393220Z" } }, "id": "993c0360941dbb3b" }, { "cell_type": "markdown", "source": [ "## Searching using the index\n", "\n", "Now we have attached an index, we can use it in semantic search. We'll search our EC subset nodes collection for a string `sugar transporters`. Note that this string doesn't occur precisely in the index but we can still rank closeness in semantic space.\n", "\n", "When using `search` the field `ranked_rows` is populated in the result object. This is a list of `(score, object)` tuples, which we will look at by translating into a pandas DataFrame:" ], "metadata": { "collapsed": false }, "id": "47c1cc117f379d63" }, { "cell_type": "code", "execution_count": 11, "outputs": [], "source": [ "qr = nodes_collection.search(\"sugar transporters\")" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-07-05T19:07:12.376668Z", "start_time": "2024-07-05T19:07:12.106295Z" } }, "id": "e3abb5d529063c6e" }, { "cell_type": "code", "execution_count": 12, "outputs": [], "source": [ "results = [{\"sim\": r[0], \"id\": r[1][\"id\"], \"name\": r[1][\"lbl\"]} for r in qr.ranked_rows]" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-07-05T19:07:12.380018Z", "start_time": "2024-07-05T19:07:12.377643Z" } }, "id": "cc3bce09e81d66a" }, { "cell_type": "code", "execution_count": 13, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.DataFrame(results)" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-07-05T19:07:12.383310Z", "start_time": "2024-07-05T19:07:12.381381Z" } }, "id": "5b9d6d63561b72db" }, { "cell_type": "code", "execution_count": 14, "outputs": [ { "data": { "text/plain": " sim id \\\n0 0.811164 http://purl.obolibrary.org/obo/eccode_1.1.1.22 \n1 0.809901 http://purl.obolibrary.org/obo/eccode_1.1.1.124 \n2 0.808242 http://purl.obolibrary.org/obo/eccode_1.1.1.10 \n3 0.804669 http://purl.obolibrary.org/obo/eccode_1.1.1.162 \n4 0.804353 http://purl.obolibrary.org/obo/eccode_1.1.1.271 \n.. ... ... \n195 0.741834 http://purl.obolibrary.org/obo/eccode_1.1.1.141 \n196 0.738374 http://purl.obolibrary.org/obo/eccode_1.1.1.147 \n197 0.738128 http://purl.obolibrary.org/obo/RO_0002351 \n198 0.729969 http://purl.obolibrary.org/obo/eccode_1.1.1.104 \n199 0.722231 http://purl.obolibrary.org/obo/eccode_1.1.1.223 \n\n name \n0 UDP-glucose 6-dehydrogenase \n1 fructose 5-dehydrogenase (NADP(+)) \n2 L-xylulose reductase \n3 erythrulose reductase \n4 GDP-L-fucose synthase \n.. ... \n195 15-hydroxyprostaglandin dehydrogenase (NAD(+)) \n196 16alpha-hydroxysteroid dehydrogenase \n197 has member \n198 4-oxoproline reductase \n199 isopiperitenol dehydrogenase \n\n[200 rows x 3 columns]", "text/html": "
\n | sim | \nid | \nname | \n
---|---|---|---|
0 | \n0.811164 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.22 | \nUDP-glucose 6-dehydrogenase | \n
1 | \n0.809901 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.124 | \nfructose 5-dehydrogenase (NADP(+)) | \n
2 | \n0.808242 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.10 | \nL-xylulose reductase | \n
3 | \n0.804669 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.162 | \nerythrulose reductase | \n
4 | \n0.804353 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.271 | \nGDP-L-fucose synthase | \n
... | \n... | \n... | \n... | \n
195 | \n0.741834 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.141 | \n15-hydroxyprostaglandin dehydrogenase (NAD(+)) | \n
196 | \n0.738374 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.147 | \n16alpha-hydroxysteroid dehydrogenase | \n
197 | \n0.738128 | \nhttp://purl.obolibrary.org/obo/RO_0002351 | \nhas member | \n
198 | \n0.729969 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.104 | \n4-oxoproline reductase | \n
199 | \n0.722231 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.223 | \nisopiperitenol dehydrogenase | \n
200 rows × 3 columns
\n\n | id | \nlbl | \ntype | \nmeta | \n
---|---|---|---|---|
0 | \nhttp://purl.obolibrary.org/obo/RO_0002327 | \nenables | \nPROPERTY | \nNaN | \n
1 | \nhttp://purl.obolibrary.org/obo/RO_0002351 | \nhas member | \nPROPERTY | \nNaN | \n
2 | \nhttp://purl.obolibrary.org/obo/eccode_1 | \nOxidoreductases | \nCLASS | \nNaN | \n
3 | \nhttp://purl.obolibrary.org/obo/eccode_1.1 | \nActing on the CH-OH group of donors | \nCLASS | \nNaN | \n
4 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1 | \nWith NAD(+) or NADP(+) as acceptor | \nCLASS | \nNaN | \n
... | \n... | \n... | \n... | \n... | \n
195 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.284 | \nS-(hydroxymethyl)glutathione dehydrogenase | \nCLASS | \n[synonyms] | \n
196 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.285 | \n3''-deamino-3''-oxonicotianamine reductase | \nCLASS | \nNaN | \n
197 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.286 | \nisocitrate--homoisocitrate dehydrogenase | \nCLASS | \n[synonyms] | \n
198 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.287 | \nD-arabinitol dehydrogenase (NADP(+)) | \nCLASS | \n[synonyms] | \n
199 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.288 | \nxanthoxin dehydrogenase | \nCLASS | \n[synonyms] | \n
200 rows × 4 columns
\n\n | id | \nlbl | \ntype | \nmeta | \n__index__ | \n
---|---|---|---|---|---|
0 | \nhttp://purl.obolibrary.org/obo/RO_0002327 | \nenables | \nPROPERTY | \nNaN | \n[-0.021716245, -0.024930306, -0.015913868, -0.... | \n
1 | \nhttp://purl.obolibrary.org/obo/RO_0002351 | \nhas member | \nPROPERTY | \nNaN | \n[-0.03492431, -0.015462456, 0.002913293, -0.02... | \n
2 | \nhttp://purl.obolibrary.org/obo/eccode_1 | \nOxidoreductases | \nCLASS | \nNaN | \n[-0.031664208, -0.026391044, 8.377296e-05, -0.... | \n
3 | \nhttp://purl.obolibrary.org/obo/eccode_1.1 | \nActing on the CH-OH group of donors | \nCLASS | \nNaN | \n[-0.023240522, -0.019391688, -0.006624823, -0.... | \n
4 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1 | \nWith NAD(+) or NADP(+) as acceptor | \nCLASS | \nNaN | \n[0.00993415, -0.039508518, 0.023213472, -0.016... | \n
... | \n... | \n... | \n... | \n... | \n... | \n
195 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.284 | \nS-(hydroxymethyl)glutathione dehydrogenase | \nCLASS | \n[synonyms] | \n[-0.020920139, -0.0042932644, 0.0039249077, -0... | \n
196 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.285 | \n3''-deamino-3''-oxonicotianamine reductase | \nCLASS | \nNaN | \n[-0.02360331, -0.025488937, -0.010397324, -0.0... | \n
197 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.286 | \nisocitrate--homoisocitrate dehydrogenase | \nCLASS | \n[synonyms] | \n[-0.019758105, -0.016041763, 0.017833093, -0.0... | \n
198 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.287 | \nD-arabinitol dehydrogenase (NADP(+)) | \nCLASS | \n[synonyms] | \n[-0.012204822, -0.034410186, 0.007919805, -0.0... | \n
199 | \nhttp://purl.obolibrary.org/obo/eccode_1.1.1.288 | \nxanthoxin dehydrogenase | \nCLASS | \n[synonyms] | \n[-0.012760352, -0.016459865, 0.011843718, -0.0... | \n
200 rows × 5 columns
\n