{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": [
"# Perform RAG Inference\n",
"\n",
"This notebook demonstrates how to perform inference using RAG (Retrieval-Augmented Generation).\n",
"\n",
"Note that linkml-store is a data-first framework, the main emphasis is not on AI or LLMs. However, it does support a pluggable **Inference** framework, and one of the integrations is a simple RAG-based inference engine.\n",
"\n",
"For this notebook, we will be using the command line interface, but the same can be done programmatically using the Python API."
],
"id": "113e1f5d2f048e03"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Loading the data into duckdb",
"id": "966de1b52f388b87"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2024-08-21T22:53:28.992417Z",
"start_time": "2024-08-21T22:53:25.449555Z"
}
},
"cell_type": "code",
"source": [
"%%bash\n",
"mkdir -p tmp\n",
"rm -rf tmp/countries.ddb\n",
"linkml-store -d duckdb:///tmp/countries.ddb -c countries insert ../../tests/input/countries/countries.jsonl"
],
"id": "da1ed3b6811477ee",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Inserted 20 objects from ../../tests/input/countries/countries.jsonl into collection 'countries'.\n"
]
}
],
"execution_count": 1
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Let's check what this looks like by using `describe` and examining the first entry:",
"id": "88191ea890186dc9"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "",
"id": "6ec8e511c6d465a4"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2024-08-21T22:53:31.506844Z",
"start_time": "2024-08-21T22:53:28.997931Z"
}
},
"cell_type": "code",
"source": [
"%%bash\n",
"linkml-store -d duckdb:///tmp/countries.ddb describe"
],
"id": "af9d9160e75afed4",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" count unique top freq\n",
"capital 20 20 Washington, D.C. 1\n",
"code 20 20 US 1\n",
"continent 20 6 Europe 5\n",
"languages 20 15 [English] 4\n",
"name 20 20 United States 1\n"
]
}
],
"execution_count": 2
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2024-08-21T22:53:34.133244Z",
"start_time": "2024-08-21T22:53:31.595517Z"
}
},
"cell_type": "code",
"source": [
"%%bash\n",
"linkml-store -d duckdb:///tmp/countries.ddb query --limit 1 -O yaml"
],
"id": "45da9e5fd1353ccb",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"name: United States\n",
"code: US\n",
"capital: Washington, D.C.\n",
"continent: North America\n",
"languages:\n",
"- English\n",
"\n"
]
}
],
"execution_count": 3
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"First we will check we don't already have the country we will use for testing in the database\n",
"(the `countries.jsonl` file is intentionally incomplete)"
],
"id": "3c48cefc91936587"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2024-08-21T22:53:34.139594Z",
"start_time": "2024-08-21T22:53:34.138220Z"
}
},
"cell_type": "code",
"source": "",
"id": "e16642d6ff6e8d41",
"outputs": [],
"execution_count": null
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2024-08-21T22:53:36.623611Z",
"start_time": "2024-08-21T22:53:34.144978Z"
}
},
"cell_type": "code",
"source": [
"%%bash\n",
"linkml-store -d duckdb:///tmp/countries.ddb -c countries query -w \"name: Uruguay\""
],
"id": "a0c1dff5eb9e6528",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[]\n"
]
}
],
"execution_count": 4
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Inferring a specific field",
"id": "5723b14db6ae067f"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2024-08-21T22:55:46.001102Z",
"start_time": "2024-08-21T22:55:41.459988Z"
}
},
"cell_type": "code",
"source": [
"%%bash\n",
"linkml-store -d duckdb:///tmp/countries.ddb -c countries infer -t rag -T languages -q \"name: Uruguay\""
],
"id": "e3b5b54814c56690",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"predicted_object:\n",
" languages:\n",
" - Spanish\n",
"\n"
]
}
],
"execution_count": 18
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"The RAG engine works by first indexing the countries collection by embedding each entry. The top N results matching the query are fetched and used as *context* for the LLM query.\n",
"\n",
"Note that in this particular case, we have a very small collection of twenty entries, and it's not even necessary to perform RAG at all, as the entire collection can easily fit within the context window of the LLM query. However, this small set is useful for demo purposes."
],
"id": "1a3c35ac1b902a86"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Inferring a whole object",
"id": "4695228dca721456"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2024-08-21T22:53:44.984216Z",
"start_time": "2024-08-21T22:53:41.075755Z"
}
},
"cell_type": "code",
"source": [
"%%bash\n",
"linkml-store -d duckdb:///tmp/countries.ddb -c countries infer -t rag -q \"name: Uruguay\""
],
"id": "f0c9a8f8dd5e319c",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"predicted_object:\n",
" capital: Montevideo\n",
" code: UY\n",
" continent: South America\n",
" languages:\n",
" - Spanish\n",
"\n"
]
}
],
"execution_count": 6
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Inferring from multiple fields",
"id": "53615fc0697e0c39"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2024-08-21T23:02:33.338749Z",
"start_time": "2024-08-21T23:02:29.166889Z"
}
},
"cell_type": "code",
"source": [
"%%bash\n",
"linkml-store -d duckdb:///tmp/countries.ddb -c countries infer -t rag -q \"{continent: South America, languages: [Dutch]}\""
],
"id": "cf1d1e39a0d4b56f",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"predicted_object:\n",
" capital: Paramaribo\n",
" code: SR\n",
" name: Suriname\n",
"\n"
]
}
],
"execution_count": 27
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## RAG configuration - using a different model\n",
"\n",
"The datasette llm framework is used under the hood. This means that you can use the `llm` command to list the available models and configurations, as well as install new ones."
],
"id": "8c65369fed257b0b"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2024-08-21T23:02:42.141357Z",
"start_time": "2024-08-21T23:02:41.334264Z"
}
},
"cell_type": "code",
"source": [
"%%bash\n",
"llm models"
],
"id": "bf740ea5beb16d2a",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"OpenAI Chat: gpt-3.5-turbo (aliases: 3.5, chatgpt)\n",
"OpenAI Chat: gpt-3.5-turbo-16k (aliases: chatgpt-16k, 3.5-16k)\n",
"OpenAI Chat: gpt-4 (aliases: 4, gpt4)\n",
"OpenAI Chat: gpt-4-32k (aliases: 4-32k)\n",
"OpenAI Chat: gpt-4-1106-preview\n",
"OpenAI Chat: gpt-4-0125-preview\n",
"OpenAI Chat: gpt-4-turbo-2024-04-09\n",
"OpenAI Chat: gpt-4-turbo (aliases: gpt-4-turbo-preview, 4-turbo, 4t)\n",
"OpenAI Chat: gpt-4o (aliases: 4o)\n",
"OpenAI Chat: gpt-4o-mini (aliases: 4o-mini)\n",
"OpenAI Completion: gpt-3.5-turbo-instruct (aliases: 3.5-instruct, chatgpt-instruct)\n",
"OpenAI Chat: gpt-4-vision-preview (aliases: 4V, gpt-4-vision)\n",
"OpenAI Chat: litellm-mixtral\n",
"OpenAI Chat: litellm-llama3\n",
"OpenAI Chat: litellm-llama3-chatqa\n",
"OpenAI Chat: litellm-groq-mixtral\n",
"OpenAI Chat: litellm-groq-llama\n",
"OpenAI Chat: gpt-4o-2024-05-13 (aliases: 4o, gpt-4o)\n",
"OpenAI Chat: lbl/llama-3\n",
"OpenAI Chat: lbl/claude-opus\n",
"OpenAI Chat: lbl/claude-sonnet\n",
"OpenAI Chat: lbl/gpt-4o\n",
"OpenAI Chat: lbl/llama-3\n",
"Anthropic Messages: claude-3-opus-20240229 (aliases: claude-3-opus)\n",
"Anthropic Messages: claude-3-sonnet-20240229 (aliases: claude-3-sonnet)\n",
"Anthropic Messages: claude-3-haiku-20240307 (aliases: claude-3-haiku)\n",
"Anthropic Messages: claude-3-5-sonnet-20240620 (aliases: claude-3.5-sonnet)\n"
]
}
],
"execution_count": 28
},
{
"metadata": {},
"cell_type": "markdown",
"source": "We'll try `claude-3-haiku`, a small model. This may not be powerful enough for extraction tasks, but general knowledge about countries should be within its capabilities.",
"id": "d543d1a2277951f8"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2024-08-21T23:02:47.027628Z",
"start_time": "2024-08-21T23:02:42.160752Z"
}
},
"cell_type": "code",
"source": [
"%%bash\n",
"linkml-store -d duckdb:///tmp/countries.ddb -c countries infer -t rag:llm_config.model_name=claude-3-haiku -q \"name: Uruguay\" "
],
"id": "77210bae9000f5b8",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"predicted_object:\n",
" capital: Montevideo\n",
" code: UY\n",
" continent: South America\n",
" languages:\n",
" - Spanish\n",
"\n"
]
}
],
"execution_count": 29
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Persisting the RAG model",
"id": "e5aa22d2d79ddfcf"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2024-08-21T23:05:44.704010Z",
"start_time": "2024-08-21T23:05:39.504800Z"
}
},
"cell_type": "code",
"source": [
"%%bash\n",
"linkml-store -d duckdb:///tmp/countries.ddb -c countries infer -t rag -q \"name: Uruguay\" -E tmp/countries.rag.json"
],
"id": "2c0253b043877be5",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"predicted_object:\n",
" capital: Montevideo\n",
" code: UY\n",
" continent: South America\n",
" languages:\n",
" - Spanish\n",
"\n"
]
}
],
"execution_count": 44
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2024-08-21T23:06:06.484062Z",
"start_time": "2024-08-21T23:06:06.456550Z"
}
},
"cell_type": "code",
"source": [
"%%bash\n",
"ls -l tmp/countries.rag.json"
],
"id": "57e0399eeb033544",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-rw-r--r-- 1 cjm staff 498212 Aug 21 16:05 tmp/countries.rag.json\n"
]
}
],
"execution_count": 45
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2024-08-21T23:06:35.105523Z",
"start_time": "2024-08-21T23:06:30.767533Z"
}
},
"cell_type": "code",
"source": [
"%%bash\n",
"linkml-store -d duckdb:///tmp/countries.ddb -c countries infer -t rag -q \"name: Uruguay\" -L tmp/countries.rag.json"
],
"id": "f357aca421d3f29",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"predicted_object:\n",
" capital: Montevideo\n",
" code: UY\n",
" continent: South America\n",
" languages:\n",
" - Spanish\n",
"\n"
]
}
],
"execution_count": 46
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Evaluation",
"id": "56a938a3cfd88ac9"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2024-08-22T02:00:08.927287Z",
"start_time": "2024-08-22T02:00:00.774789Z"
}
},
"cell_type": "code",
"source": [
"%%bash\n",
"linkml-store -d duckdb:///tmp/countries.ddb -c countries infer -t rag -T languages -T code -F name -n 5"
],
"id": "c6f50474a64adc8e",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Outcome: true_positive_count=5.0 total_count=5 // accuracy: 1.0\n"
]
}
],
"execution_count": 55
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## How RAG indexing works under the hood\n",
"\n",
"Behind the scenes, whenever you use the RAG inference engine, a separate collection is automatically created for a test dataset; additionally, an index is also created in the same database. This is true regardless of the database backend (DuckDB, MongoDB, etc.).\n",
"\n",
"(note: if you are using an in-memory duckdb instance then the index is forgotten after each run, which\n",
"could get expensive if you have a large collection).\n",
"\n",
"Let's examine our database to see the new collection and index. We will use the Jupyter SQL magic to query the database."
],
"id": "ef995b7df9dc2425"
},
{
"metadata": {},
"cell_type": "code",
"outputs": [],
"execution_count": null,
"source": "",
"id": "bbe5f552d3c239e8"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2024-08-21T23:03:50.219582Z",
"start_time": "2024-08-21T23:03:50.212113Z"
}
},
"cell_type": "code",
"source": [
"%load_ext sql\n",
"%config SqlMagic.autopandas = True\n",
"%config SqlMagic.feedback = False\n",
"%config SqlMagic.displaycon = False"
],
"id": "ac5fd025661ef7ec",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The sql extension is already loaded. To reload it, use:\n",
" %reload_ext sql\n"
]
}
],
"execution_count": 38
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2024-08-21T23:03:50.864378Z",
"start_time": "2024-08-21T23:03:50.850251Z"
}
},
"cell_type": "code",
"source": [
"%%bash\n",
"cp tmp/countries.ddb tmp/countries-copy.ddb"
],
"id": "20e500fe878072b3",
"outputs": [],
"execution_count": 39
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2024-08-21T23:03:51.483877Z",
"start_time": "2024-08-21T23:03:51.416823Z"
}
},
"cell_type": "code",
"source": "%sql duckdb:///tmp/countries-copy.ddb",
"id": "4452ee3d4c8f718f",
"outputs": [],
"execution_count": 40
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2024-08-21T23:03:52.224522Z",
"start_time": "2024-08-21T23:03:52.085675Z"
}
},
"cell_type": "code",
"source": [
"%%sql\n",
"SELECT * FROM information_schema.tables"
],
"id": "9a06cb5c358797cd",
"outputs": [
{
"data": {
"text/plain": [
" table_catalog table_schema table_name \\\n",
"0 countries-copy main countries \n",
"1 countries-copy main countries__rag_train \n",
"2 countries-copy main internal__index__countries__rag_train__llm \n",
"\n",
" table_type self_referencing_column_name reference_generation \\\n",
"0 BASE TABLE None None \n",
"1 BASE TABLE None None \n",
"2 BASE TABLE None None \n",
"\n",
" user_defined_type_catalog user_defined_type_schema user_defined_type_name \\\n",
"0 None None None \n",
"1 None None None \n",
"2 None None None \n",
"\n",
" is_insertable_into is_typed commit_action TABLE_COMMENT \n",
"0 YES NO None None \n",
"1 YES NO None None \n",
"2 YES NO None None "
],
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" table_catalog | \n",
" table_schema | \n",
" table_name | \n",
" table_type | \n",
" self_referencing_column_name | \n",
" reference_generation | \n",
" user_defined_type_catalog | \n",
" user_defined_type_schema | \n",
" user_defined_type_name | \n",
" is_insertable_into | \n",
" is_typed | \n",
" commit_action | \n",
" TABLE_COMMENT | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" countries-copy | \n",
" main | \n",
" countries | \n",
" BASE TABLE | \n",
" None | \n",
" None | \n",
" None | \n",
" None | \n",
" None | \n",
" YES | \n",
" NO | \n",
" None | \n",
" None | \n",
"
\n",
" \n",
" 1 | \n",
" countries-copy | \n",
" main | \n",
" countries__rag_train | \n",
" BASE TABLE | \n",
" None | \n",
" None | \n",
" None | \n",
" None | \n",
" None | \n",
" YES | \n",
" NO | \n",
" None | \n",
" None | \n",
"
\n",
" \n",
" 2 | \n",
" countries-copy | \n",
" main | \n",
" internal__index__countries__rag_train__llm | \n",
" BASE TABLE | \n",
" None | \n",
" None | \n",
" None | \n",
" None | \n",
" None | \n",
" YES | \n",
" NO | \n",
" None | \n",
" None | \n",
"
\n",
" \n",
"
\n",
"
"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 41
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2024-08-21T23:03:53.030257Z",
"start_time": "2024-08-21T23:03:52.863279Z"
}
},
"cell_type": "code",
"source": [
"%%sql\n",
"select * from internal__index__countries__rag_train__llm limit 5"
],
"id": "d16b905ca3e0c87",
"outputs": [
{
"data": {
"text/plain": [
" name code capital continent languages \\\n",
"0 Argentina AR Buenos Aires South America [Spanish] \n",
"1 South Korea KR Seoul Asia [Korean] \n",
"2 United States US Washington, D.C. North America [English] \n",
"3 Nigeria NG Abuja Africa [English] \n",
"4 India IN New Delhi Asia [Hindi, English] \n",
"\n",
" __index__ \n",
"0 [-0.009016353, 0.02336632, 0.007532564, -0.008... \n",
"1 [3.8781454e-05, 0.013463534, 0.017664365, -0.0... \n",
"2 [-0.0077237985, 0.016569635, -0.0042663547, -0... \n",
"3 [-0.0055540577, 0.0037728157, -0.003473751, -0... \n",
"4 [-0.0031975685, 0.025214365, 0.002862445, 0.00... "
],
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" code | \n",
" capital | \n",
" continent | \n",
" languages | \n",
" __index__ | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Argentina | \n",
" AR | \n",
" Buenos Aires | \n",
" South America | \n",
" [Spanish] | \n",
" [-0.009016353, 0.02336632, 0.007532564, -0.008... | \n",
"
\n",
" \n",
" 1 | \n",
" South Korea | \n",
" KR | \n",
" Seoul | \n",
" Asia | \n",
" [Korean] | \n",
" [3.8781454e-05, 0.013463534, 0.017664365, -0.0... | \n",
"
\n",
" \n",
" 2 | \n",
" United States | \n",
" US | \n",
" Washington, D.C. | \n",
" North America | \n",
" [English] | \n",
" [-0.0077237985, 0.016569635, -0.0042663547, -0... | \n",
"
\n",
" \n",
" 3 | \n",
" Nigeria | \n",
" NG | \n",
" Abuja | \n",
" Africa | \n",
" [English] | \n",
" [-0.0055540577, 0.0037728157, -0.003473751, -0... | \n",
"
\n",
" \n",
" 4 | \n",
" India | \n",
" IN | \n",
" New Delhi | \n",
" Asia | \n",
" [Hindi, English] | \n",
" [-0.0031975685, 0.025214365, 0.002862445, 0.00... | \n",
"
\n",
" \n",
"
\n",
"
"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 42
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2024-08-21T23:03:56.796051Z",
"start_time": "2024-08-21T23:03:56.676322Z"
}
},
"cell_type": "code",
"source": [
"%%sql\n",
"select count(*) from internal__index__countries__rag_train__llm"
],
"id": "8412b7da0370589a",
"outputs": [
{
"data": {
"text/plain": [
" count_star()\n",
"0 14"
],
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" count_star() | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 14 | \n",
"
\n",
" \n",
"
\n",
"
"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 43
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2024-08-21T22:56:01.557505Z",
"start_time": "2024-08-21T22:56:01.413646Z"
}
},
"cell_type": "code",
"source": [
"%%sql\n",
"select count(*) from countries"
],
"id": "9b369a4364d3225a",
"outputs": [
{
"data": {
"text/plain": [
" count_star()\n",
"0 20"
],
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" count_star() | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 20 | \n",
"
\n",
" \n",
"
\n",
"
"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 25
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## Configuring the training/test split\n",
"\n",
"By default, the infer command will split your data in collection into a test and train set. This is useful for evaluation, but if you want to use the entire dataset, or you want to configure the split size, you can use `--training-test-data-split` (`-S`).\n"
],
"id": "60fc1b7bc202a874"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2024-08-21T22:53:52.422228Z",
"start_time": "2024-08-21T22:53:52.362385Z"
}
},
"cell_type": "code",
"source": "",
"id": "d1e222f9928ce487",
"outputs": [],
"execution_count": 16
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2024-08-21T22:53:52.445754Z",
"start_time": "2024-08-21T22:53:52.444399Z"
}
},
"cell_type": "code",
"source": "",
"id": "d283afe27a797857",
"outputs": [],
"execution_count": null
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2024-08-21T23:03:36.178653Z",
"start_time": "2024-08-21T23:03:31.285675Z"
}
},
"cell_type": "code",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"predicted_object:\n",
" capital: Montevideo\n",
" code: UY\n",
" continent: South America\n",
" languages:\n",
" - Spanish\n",
"\n"
]
}
],
"execution_count": 37,
"source": [
"%%bash\n",
"linkml-store -d duckdb:///tmp/countries.ddb -c countries infer -t rag -S 1.0 0.0 -q \"name: Uruguay\" "
],
"id": "c6b938a6f63fc481"
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## Extraction tasks\n",
"\n",
"We can also use this engine for *extraction tasks* - this involves extracting structured data or knowledge from\n",
"textual or unstructured data.\n",
"\n",
"In fact, we don't need any new capabilities here - extraction can just be seen as a special case of inference,\n",
"where the feature set includes or is restricted to text, and the target set is the whole object.\n",
"\n",
"We can demonstrate this with a simple zero-shot example:"
],
"id": "1cea5554183cdd77"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2024-08-21T23:32:29.163733Z",
"start_time": "2024-08-21T23:32:29.146032Z"
}
},
"cell_type": "code",
"source": [
"%%bash\n",
"echo '{text: I saw the cat sitting on the mat, subject: cat, predicate: sits-on, object: mat}' > tmp/extraction-examples.yaml"
],
"id": "d0e36617f7d6dab7",
"outputs": [],
"execution_count": 53
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2024-08-21T23:32:34.057806Z",
"start_time": "2024-08-21T23:32:29.702387Z"
}
},
"cell_type": "code",
"source": [
"%%bash\n",
"linkml-store -i tmp/extraction-examples.yaml infer -t rag -q \"text: the Earth rotates around the Sun\""
],
"id": "22d81129ff484935",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"predicted_object:\n",
" object: Sun\n",
" predicate: rotates-around\n",
" subject: Earth\n",
"\n"
]
}
],
"execution_count": 54
},
{
"metadata": {},
"cell_type": "code",
"outputs": [],
"execution_count": null,
"source": "",
"id": "8844aa25ae33472"
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}