{ "cells": [ { "cell_type": "markdown", "source": [ "# Tutorial: Using the Command Line Interface\n", "\n", "This tutorial walks through usage of LinkML-Store via the Command Line Interface (CLI)\n", "\n", "This tutorial is a Jupyter notebook: it can be executed in a command line environment,\n", "or you can try it for yourself by running commands directly.\n", "\n", "Note the `%%bash` is a directive for Jupyter itself, you don't need to type this" ], "metadata": { "collapsed": false }, "id": "92e124c26a2d83da" }, { "cell_type": "markdown", "source": [ "## Top level command\n", "\n", "The top level command is `linkml-store`. This command doesn't do anything itself, instead there are various *subcommands*.\n", "\n", "The store command has a few *global options* to specify configuration/database/collection" ], "metadata": { "collapsed": false }, "id": "9ae24f91d65fdda0" }, { "cell_type": "code", "execution_count": 1, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: linkml-store [OPTIONS] COMMAND [ARGS]...\n", "\n", " A CLI for interacting with the linkml-store.\n", "\n", "Options:\n", " -d, --database TEXT Database name\n", " -c, --collection TEXT Collection name\n", " -i, --input TEXT Input file (alternative to\n", " database/collection)\n", " -C, --config PATH Path to the configuration file\n", " --set TEXT Metadata settings in the form PATHEXPR=value\n", " -v, --verbose\n", " -q, --quiet / --no-quiet\n", " -B, --base-dir TEXT Base directory for the client configuration\n", " --stacktrace / --no-stacktrace If set then show full stacktrace on error\n", " [default: no-stacktrace]\n", " --help Show this message and exit.\n", "\n", "Commands:\n", " apply Apply a patch to a collection.\n", " describe Describe the collection schema.\n", " diff Diffs two collectoons to create a patch.\n", " export Exports a database to a standard dump format.\n", " fq Query facets from the specified collection.\n", " import Imports a database from a dump.\n", " index Create an index over a collection.\n", " indexes Show the indexes for a collection.\n", " infer Predict a complete object from a partial object.\n", " insert Insert objects from files (JSON, YAML, TSV) into the...\n", " list-collections\n", " query Query objects from the specified collection.\n", " schema Show the schema for a database\n", " search Search objects in the specified collection.\n", " store Store objects from files (JSON, YAML, TSV) into the...\n", " validate Validate objects in the specified collection.\n" ] } ], "source": [ "%%bash\n", "linkml-store --help" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-15T22:09:06.042653Z", "start_time": "2024-08-15T22:09:01.607938Z" } }, "id": "f367252f5e8857b4" }, { "cell_type": "markdown", "source": [ "## Inserting objects from a file\n", "\n", "Next we'll explore the ``insert`` command:" ], "metadata": { "collapsed": false }, "id": "684ee59be469e12" }, { "cell_type": "code", "execution_count": 2, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: linkml-store insert [OPTIONS] [FILES]...\n", "\n", " Insert objects from files (JSON, YAML, TSV) into the specified collection.\n", "\n", " Using a configuration:\n", "\n", " linkml-store -C config.yaml -c genes insert data/genes/*.json\n", "\n", " Note: if you don't provide a schema this will be inferred, but it is usually\n", " better to provide an explicit schema\n", "\n", "Options:\n", " -f, --format [json|jsonl|yaml|yamll|tsv|csv|python|parquet|formatted|table|duckdb|postgres|mongodb]\n", " Input format\n", " -i, --object TEXT Input object as YAML\n", " --help Show this message and exit.\n" ] } ], "source": [ "%%bash\n", "linkml-store --stacktrace insert --help" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-15T22:09:08.432218Z", "start_time": "2024-08-15T22:09:06.042818Z" } }, "id": "cfe24edc122b04e7" }, { "cell_type": "markdown", "source": [ "We'll insert a small test file (in JSON Lines format) into a fresh database." ], "metadata": { "collapsed": false }, "id": "8cf50fcf5f257fdd" }, { "cell_type": "code", "execution_count": 3, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\"name\": \"United States\", \"code\": \"US\", \"capital\": \"Washington, D.C.\", \"continent\": \"North America\", \"languages\": [\"English\"]}\n", "{\"name\": \"Canada\", \"code\": \"CA\", \"capital\": \"Ottawa\", \"continent\": \"North America\", \"languages\": [\"English\", \"French\"]}\n", "{\"name\": \"Mexico\", \"code\": \"MX\", \"capital\": \"Mexico City\", \"continent\": \"North America\", \"languages\": [\"Spanish\"]}\n", "{\"name\": \"Brazil\", \"code\": \"BR\", \"capital\": \"Brasília\", \"continent\": \"South America\", \"languages\": [\"Portuguese\"]}\n", "{\"name\": \"Argentina\", \"code\": \"AR\", \"capital\": \"Buenos Aires\", \"continent\": \"South America\", \"languages\": [\"Spanish\"]}\n", "{\"name\": \"United Kingdom\", \"code\": \"GB\", \"capital\": \"London\", \"continent\": \"Europe\", \"languages\": [\"English\"]}\n", "{\"name\": \"France\", \"code\": \"FR\", \"capital\": \"Paris\", \"continent\": \"Europe\", \"languages\": [\"French\"]}\n", "{\"name\": \"Germany\", \"code\": \"DE\", \"capital\": \"Berlin\", \"continent\": \"Europe\", \"languages\": [\"German\"]}\n", "{\"name\": \"Italy\", \"code\": \"IT\", \"capital\": \"Rome\", \"continent\": \"Europe\", \"languages\": [\"Italian\"]}\n", "{\"name\": \"Spain\", \"code\": \"ES\", \"capital\": \"Madrid\", \"continent\": \"Europe\", \"languages\": [\"Spanish\"]}\n" ] } ], "source": [ "%%bash\n", "head ../../tests/input/countries/countries.jsonl" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-15T22:09:08.452600Z", "start_time": "2024-08-15T22:09:08.432679Z" } }, "id": "afc4bfb1ecf80cc4" }, { "cell_type": "markdown", "source": [ "To make sure we have a fresh setup, we'll create a temporary directory `tmp` (if it doesn't already exist),\n", "and be sure to remove any copy of the database we intend to create.\n", "\n", "We'll then insert the objects:" ], "metadata": { "collapsed": false }, "id": "8ec898e12ac5c6ea" }, { "cell_type": "code", "execution_count": 4, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inserted 20 objects from ../../tests/input/countries/countries.jsonl into collection 'countries'.\n" ] } ], "source": [ "%%bash\n", "mkdir -p tmp\n", "rm -rf tmp/countries.db\n", "linkml-store --database duckdb:///tmp/countries.db --collection countries insert ../../tests/input/countries/countries.jsonl" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-15T22:09:11.137934Z", "start_time": "2024-08-15T22:09:08.452084Z" } }, "id": "be9cebbea43d03a8" }, { "cell_type": "markdown", "source": [ "Note that the `--database` and `--collection` options come *before* the `insert` subcommand.\n", "\n", "With LinkML-Store, everything must go into a collection, so we specified `countries` as the name" ], "metadata": { "collapsed": false }, "id": "9c4c6c201c6c3188" }, { "cell_type": "markdown", "source": [ "## Querying\n", "\n", "Next we'll explore the `query` command:" ], "metadata": { "collapsed": false }, "id": "4550b33d68b04a8d" }, { "cell_type": "code", "execution_count": 5, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: linkml-store query [OPTIONS]\n", "\n", " Query objects from the specified collection.\n", "\n", " Leave the query field blank to return all objects in the collection.\n", "\n", " Examples:\n", "\n", " linkml-store -d duckdb:///countries.db -c countries query\n", "\n", " Queries can be specified in YAML, as basic key-value pairs\n", "\n", " Examples:\n", "\n", " linkml-store -d duckdb:///countries.db -c countries query -w 'code: NZ'\n", "\n", " More complex queries can be specified using MongoDB-style query syntax\n", "\n", " Examples:\n", "\n", " linkml-store -d file:. -c persons query -w 'occupation: {$ne:\n", " Architect}'\n", "\n", " Finds all people who are not architects.\n", "\n", "Options:\n", " -w, --where TEXT WHERE clause for the query, as YAML\n", " -s, --select TEXT SELECT clause for the query, as YAML\n", " -l, --limit INTEGER Maximum number of results to return\n", " -O, --output-type [json|jsonl|yaml|yamll|tsv|csv|python|parquet|formatted|table|duckdb|postgres|mongodb]\n", " Output format\n", " -o, --output PATH Output file path\n", " --help Show this message and exit.\n" ] } ], "source": [ "%%bash\n", "linkml-store query --help" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-15T22:09:13.593161Z", "start_time": "2024-08-15T22:09:11.139309Z" } }, "id": "d4d0b66a1a78f50a" }, { "cell_type": "markdown", "source": [ "Let's query for all objects that have `code=\"GB\"`, and get the results back as a CSV. The argument for the `--where` (or `-w`) option is a YAML object with a MongoDB-style query." ], "metadata": { "collapsed": false }, "id": "99a6d52ab591f584" }, { "cell_type": "code", "execution_count": 6, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----+----------------+--------+-----------+-------------+-------------+\n", "| | name | code | capital | continent | languages |\n", "|----+----------------+--------+-----------+-------------+-------------|\n", "| 0 | United Kingdom | GB | London | Europe | ['English'] |\n", "+----+----------------+--------+-----------+-------------+-------------+\n" ] } ], "source": [ "%%bash\n", "linkml-store --database duckdb:///tmp/countries.db -c countries query -w \"code: GB\" -O table" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-15T22:09:16.205212Z", "start_time": "2024-08-15T22:09:13.593771Z" } }, "id": "225613b70b0d57fc" }, { "cell_type": "markdown", "source": [ "We can get the output in different formats:" ], "metadata": { "collapsed": false }, "id": "e86ae98fe4c48413" }, { "cell_type": "code", "execution_count": 7, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "name: United Kingdom\n", "code: GB\n", "capital: London\n", "continent: Europe\n", "languages:\n", "- English\n" ] } ], "source": [ "%%bash\n", "linkml-store --database duckdb:///tmp/countries.db -c countries query -w \"code: GB\" -O yaml" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-15T22:09:18.789753Z", "start_time": "2024-08-15T22:09:16.206755Z" } }, "id": "5d47e9648428caf0" }, { "cell_type": "markdown", "source": [ "Formats include csv, tsv, yaml, json, jsonl, table, formatted (a human-readable format)" ], "metadata": { "collapsed": false }, "id": "8d980c36b6c9b839" }, { "cell_type": "markdown", "source": [ "## Describing the data set\n", "\n", "The `describe` command gives a high-level overview of the data set:" ], "metadata": { "collapsed": false }, "id": "ae1d98ffa2767e5f" }, { "cell_type": "code", "execution_count": 8, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: linkml-store describe [OPTIONS]\n", "\n", " Describe the collection schema.\n", "\n", "Options:\n", " -w, --where TEXT WHERE clause for the query\n", " -O, --output-type [json|jsonl|yaml|yamll|tsv|csv|python|parquet|formatted|table|duckdb|postgres|mongodb]\n", " Output format\n", " -o, --output PATH Output file path\n", " -l, --limit INTEGER Maximum number of results to return\n", " [default: -1]\n", " --help Show this message and exit.\n" ] } ], "source": [ "%%bash\n", "linkml-store describe --help" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-15T22:09:32.349533Z", "start_time": "2024-08-15T22:09:29.849173Z" } }, "id": "45cf8f0e25f8d1ae" }, { "cell_type": "markdown", "source": [ "Let's try with the countries dataset:" ], "metadata": { "collapsed": false }, "id": "ff10a119becb6ad8" }, { "cell_type": "code", "execution_count": 9, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " count unique top freq\n", "capital 20 20 Washington, D.C. 1\n", "code 20 20 US 1\n", "continent 20 6 Europe 5\n", "languages 20 15 [English] 4\n", "name 20 20 United States 1\n" ] } ], "source": [ "%%bash\n", "linkml-store -d duckdb:///tmp/countries.db -c countries describe -O formatted" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-15T22:09:35.020384Z", "start_time": "2024-08-15T22:09:32.351347Z" } }, "id": "364f240fc0035045" }, { "cell_type": "markdown", "source": [ "Note this command is more useful for numeric data..." ], "metadata": { "collapsed": false }, "id": "bdc0a6d167506809" }, { "cell_type": "markdown", "source": [ "## Facet Counts\n", "\n", "You can combine any query (including an empty query, for fetching the whole database) with a *facet query* which fetches counts for\n", "numbers of objects broken down by some specified slot or slots." ], "metadata": { "collapsed": false }, "id": "91fcaf45c7c8c95a" }, { "cell_type": "code", "execution_count": 10, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: linkml-store fq [OPTIONS]\n", "\n", " Query facets from the specified collection.\n", "\n", " :param ctx: :param where: :param limit: :param columns: :param output_type:\n", " :param output: :return:\n", "\n", "Options:\n", " -w, --where TEXT WHERE clause for the query\n", " -l, --limit INTEGER Maximum number of results to return\n", " -O, --output-type [json|jsonl|yaml|yamll|tsv|csv|python|parquet|formatted|table|duckdb|postgres|mongodb]\n", " Output format\n", " -o, --output PATH Output file path\n", " -S, --columns TEXT Columns to facet on\n", " -U, --wide / --no-wide, --no-U Wide table [default: no-wide]\n", " --help Show this message and exit.\n" ] } ], "source": [ "%%bash\n", "linkml-store fq --help" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-15T22:09:37.453076Z", "start_time": "2024-08-15T22:09:35.021303Z" } }, "id": "5676c7a8a30699a7" }, { "cell_type": "code", "execution_count": 11, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"continent\": {\n", " \"Asia\": 5,\n", " \"Europe\": 5,\n", " \"Africa\": 3,\n", " \"North America\": 3,\n", " \"South America\": 2,\n", " \"Oceania\": 2\n", " }\n", "}\n" ] } ], "source": [ "%%bash\n", "linkml-store -d duckdb:///tmp/countries.db -c countries fq -S continent" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-15T22:09:40.000430Z", "start_time": "2024-08-15T22:09:37.453339Z" } }, "id": "6d8152d20290120c" }, { "cell_type": "code", "execution_count": 12, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------------------+-------------+-------------+\n", "| | continent | languages |\n", "|------------------+-------------+-------------|\n", "| Europe | 5 | nan |\n", "| Asia | 5 | nan |\n", "| North America | 3 | nan |\n", "| Africa | 3 | nan |\n", "| South America | 2 | nan |\n", "| Oceania | 2 | nan |\n", "| English | nan | 8 |\n", "| Spanish | nan | 3 |\n", "| French | nan | 2 |\n", "| Italian | nan | 1 |\n", "| Standard Chinese | nan | 1 |\n", "| Tswana | nan | 1 |\n", "| Southern Sotho | nan | 1 |\n", "| Portuguese | nan | 1 |\n", "| Māori | nan | 1 |\n", "| Xhosa | nan | 1 |\n", "| Zulu | nan | 1 |\n", "| Tsonga | nan | 1 |\n", "| German | nan | 1 |\n", "| Korean | nan | 1 |\n", "| Northern Sotho | nan | 1 |\n", "| Venda | nan | 1 |\n", "| Southern Ndebele | nan | 1 |\n", "| Hindi | nan | 1 |\n", "| Swazi | nan | 1 |\n", "| Japanese | nan | 1 |\n", "| Indonesian | nan | 1 |\n", "| Arabic | nan | 1 |\n", "| Afrikaans | nan | 1 |\n", "+------------------+-------------+-------------+\n" ] } ], "source": [ "%%bash\n", "linkml-store --stacktrace -d duckdb:///tmp/countries.db -c countries fq -S continent,languages -O table " ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-15T22:09:42.570357Z", "start_time": "2024-08-15T22:09:40.004508Z" } }, "id": "fa0717e71e31e101" }, { "cell_type": "markdown", "source": [ "Remember this is a test dataset deliberately reduced so we don't expect to see all countries there!" ], "metadata": { "collapsed": false }, "id": "b5a1d7cf536cc60e" }, { "cell_type": "markdown", "source": [ "## Search\n", "\n", "LinkML-Store is intended to allow for a flexible range of *search strategies*. Some of these may come from the underlying data store\n", "(for example, SOLr or ES is backed by Lucene indexing). Or they may be integrated orthogonally.\n", "\n", "A key search mechanism that is supported is *text embedding* via *Large Language Models (LLMs)*. Note these are not enabled by default.\n", "\n", "Currently the default mechanism (which works regardless of the underlying store) is a highly naive trigram-based vector embedding. This requires\n", "no external model. It is intended primarily for demonstration purposes, and should be swapped out for something else." ], "metadata": { "collapsed": false }, "id": "1fd37a3fabafcac4" }, { "cell_type": "markdown", "source": [ "### Indexing a collection\n", "\n", "First we will explore the `index` command" ], "metadata": { "collapsed": false }, "id": "82dd185bda0ec1bd" }, { "cell_type": "code", "execution_count": 13, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: linkml-store index [OPTIONS]\n", "\n", " Create an index over a collection.\n", "\n", " By default a simple trigram index is used.\n", "\n", "Options:\n", " -t, --index-type TEXT Type of index to create. Values: simple, llm\n", " [default: simple]\n", " -E, --cached-embeddings-database TEXT\n", " Path to the database where embeddings are\n", " cached\n", " -T, --text-template TEXT Template for text embeddings\n", " --help Show this message and exit.\n" ] } ], "source": [ "%%bash\n", "linkml-store index --help" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-08T03:14:44.361757Z", "start_time": "2024-08-08T03:14:42.990323Z" } }, "id": "ae0172f931e5f228" }, { "cell_type": "markdown", "source": [ "Next we'll make a (default) index" ], "metadata": { "collapsed": false }, "id": "65f5422c6dd449d9" }, { "cell_type": "code", "execution_count": 13, "outputs": [], "source": [ "%%bash\n", "linkml-store -d duckdb:///tmp/countries.db -c countries index" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-15T22:10:33.487481Z", "start_time": "2024-08-15T22:10:30.806144Z" } }, "id": "3c97f99cca09a03d" }, { "cell_type": "markdown", "source": [ "### Searching a collection using an index\n", "\n", "Let's explore the `search` command" ], "metadata": { "collapsed": false }, "id": "981aea1e6dd63508" }, { "cell_type": "code", "execution_count": 14, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: linkml-store search [OPTIONS] SEARCH_TERM\n", "\n", " Search objects in the specified collection.\n", "\n", "Options:\n", " -w, --where TEXT WHERE clause for the search\n", " -l, --limit INTEGER Maximum number of search results\n", " -O, --output-type [json|jsonl|yaml|yamll|tsv|csv|python|parquet|formatted|table|duckdb|postgres|mongodb]\n", " Output format\n", " -o, --output PATH Output file path\n", " --auto-index / --no-auto-index Automatically index the collection\n", " [default: no-auto-index]\n", " -t, --index-type TEXT Type of index to create. Values: simple, llm\n", " [default: simple]\n", " --help Show this message and exit.\n" ] } ], "source": [ "%%bash\n", "linkml-store search --help" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-15T22:10:58.634667Z", "start_time": "2024-08-15T22:10:56.173548Z" } }, "id": "a6e98ccac65635ba" }, { "cell_type": "markdown", "source": [ "Now we'll search for countries in the North where both English and French are spoken. We'll pose this as a natural language query, but the default index is only picking up on trigram tokens in the strings." ], "metadata": { "collapsed": false }, "id": "f5c0cd805f8d19dc" }, { "cell_type": "code", "execution_count": 15, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "score,name,code,capital,continent,languages\r\n", "0.15670402880167877,Canada,CA,Ottawa,North America,\"['English', 'French']\"\r\n", "0.14806601565681218,South Africa,ZA,Pretoria,Africa,\"['Zulu', 'Xhosa', 'Afrikaans', 'English', 'Northern Sotho', 'Tswana', 'Southern Sotho', 'Tsonga', 'Swazi', 'Venda', 'Southern Ndebele']\"\r\n", "0.13749236361227862,United States,US,\"Washington, D.C.\",North America,['English']\r\n", "0.09860812114511587,Argentina,AR,Buenos Aires,South America,['Spanish']\r\n", "0.09765536333140983,Mexico,MX,Mexico City,North America,['Spanish']\r\n" ] } ], "source": [ "%%bash\n", "linkml-store -d duckdb:///tmp/countries.db -c countries search \"countries in the North where both english and french spoken\" --limit 5 -O csv" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-15T22:11:28.958992Z", "start_time": "2024-08-15T22:11:26.350455Z" } }, "id": "8fce64c44a6aae21" }, { "cell_type": "markdown", "source": [ "By default, all fields in the object are indexed. Canada comes out top as the strings for English and France are present (or rather trigrams from those words). But remember the default method is just for illustration!" ], "metadata": { "collapsed": false }, "id": "f69630e05da3bd6b" }, { "cell_type": "markdown", "source": [ "## Indexing using an LLM (OPTIONAL)\n", "\n", "Note for this to work, you need to have installed this package with the `llm` extra, like this:\n", "\n", "```bash\n", "pip install linkml-store[llm]\n", "```\n", "\n", "Or if you have this repo checked out and are using Poetry:\n", "\n", "```bash\n", "poetry install --all-extras\n", "```\n", "\n", "You will also need an OpenAI account.\n", "\n", "If this is too much, you can just skip this section!\n" ], "metadata": { "collapsed": false }, "id": "a59443d06387db90" }, { "cell_type": "code", "execution_count": 16, "outputs": [], "source": [ "%%bash\n", "linkml-store -d duckdb:///tmp/countries.db -c countries index -t llm -E tmp/llm_countries_cache.db" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-15T22:11:54.616826Z", "start_time": "2024-08-15T22:11:51.391031Z" } }, "id": "180b3f44075c0291" }, { "cell_type": "code", "execution_count": 18, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "score,name,code,capital,continent,languages\r\n", "0.7927589434263863,Canada,CA,Ottawa,North America,\"['English', 'French']\"\r\n", "0.7641071952797124,France,FR,Paris,Europe,['French']\r\n", "0.7546847140878102,United States,US,\"Washington, D.C.\",North America,['English']\r\n", "0.7424773577897005,Australia,AU,Canberra,Oceania,['English']\r\n", "0.741656789495497,United Kingdom,GB,London,Europe,['English']\r\n" ] } ], "source": [ "%%bash\n", "linkml-store -d duckdb:///tmp/countries.db -c countries search -t llm \"countries in the North where both english and french spoken\" --limit 5 -O csv" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-15T22:13:13.645889Z", "start_time": "2024-08-15T22:13:10.649734Z" } }, "id": "9711a8db9c414953" }, { "cell_type": "markdown", "source": [ "The results are not particularly meaningful, but the idea is that this could be used in a RAG-style system." ], "metadata": { "collapsed": false }, "id": "df6bdc130db45fa2" }, { "cell_type": "markdown", "source": [ "## Schemas\n", "\n", "Note in the above we did not explicitly specify a schema; instead it is *induced*.\n", "\n", "We can use the `schema` command to see the induced schema in [LinkML YAML](https://linkml.github.io/linkml/)." ], "metadata": { "collapsed": false }, "id": "2661d59e4e665823" }, { "cell_type": "code", "execution_count": 19, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "name: test-schema\n", "id: http://example.org/test-schema\n", "imports:\n", "- linkml:types\n", "prefixes:\n", " linkml:\n", " prefix_prefix: linkml\n", " prefix_reference: https://w3id.org/linkml/\n", " test_schema:\n", " prefix_prefix: test_schema\n", " prefix_reference: http://example.org/test-schema/\n", "default_prefix: test_schema\n", "default_range: string\n", "classes:\n", " countries:\n", " name: countries\n", " attributes:\n", " name:\n", " name: name\n", " range: string\n", " required: false\n", " multivalued: false\n", " code:\n", " name: code\n", " range: string\n", " required: false\n", " multivalued: false\n", " capital:\n", " name: capital\n", " range: string\n", " required: false\n", " multivalued: false\n", " continent:\n", " name: continent\n", " range: string\n", " required: false\n", " multivalued: false\n", " languages:\n", " name: languages\n", " range: string\n", " required: false\n", " multivalued: true\n", " internal__index__countries__llm:\n", " name: internal__index__countries__llm\n", " attributes:\n", " name:\n", " name: name\n", " range: string\n", " required: false\n", " multivalued: false\n", " code:\n", " name: code\n", " range: string\n", " required: false\n", " multivalued: false\n", " capital:\n", " name: capital\n", " range: string\n", " required: false\n", " multivalued: false\n", " continent:\n", " name: continent\n", " range: string\n", " required: false\n", " multivalued: false\n", " languages:\n", " name: languages\n", " range: string\n", " required: false\n", " multivalued: true\n", " __index__:\n", " name: __index__\n", " range: string\n", " required: false\n", " multivalued: true\n", " internal__index__countries__simple:\n", " name: internal__index__countries__simple\n", " attributes:\n", " name:\n", " name: name\n", " range: string\n", " required: false\n", " multivalued: false\n", " code:\n", " name: code\n", " range: string\n", " required: false\n", " multivalued: false\n", " capital:\n", " name: capital\n", " range: string\n", " required: false\n", " multivalued: false\n", " continent:\n", " name: continent\n", " range: string\n", " required: false\n", " multivalued: false\n", " languages:\n", " name: languages\n", " range: string\n", " required: false\n", " multivalued: true\n", " __index__:\n", " name: __index__\n", " range: string\n", " required: false\n", " multivalued: true\n" ] } ], "source": [ "%%bash\n", "linkml-store -d duckdb:///tmp/countries.db schema" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-08T03:14:54.200438Z", "start_time": "2024-08-08T03:14:52.729657Z" } }, "id": "f36b8ae0c4325d2" }, { "cell_type": "markdown", "source": [ "## Configuration Files and Explicit Schemas\n", "\n", "Rather than repeat `--database` and `--collection` each time, we can make use of YAML config files.\n", "\n", "These can also package useful information and schemas.\n", "\n", "First we will create a fresh copy of a directory with both configuration files and schemas:" ], "metadata": { "collapsed": false }, "id": "78696f002f28d51d" }, { "cell_type": "code", "execution_count": 20, "outputs": [], "source": [ "%%bash\n", "cp -pr ../../tests/input/countries tmp\n", "rm tmp/countries/countries.db" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-08T03:14:54.217540Z", "start_time": "2024-08-08T03:14:54.200816Z" } }, "id": "dad98c3579f24bbd" }, { "cell_type": "markdown", "source": [ "The configuration YAML is fairly minimal - it specifies a single database with a single collection, and a pointer to a schema" ], "metadata": { "collapsed": false }, "id": "9a9be08dcc572a7f" }, { "cell_type": "code", "execution_count": 21, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "databases:\n", " countries_db:\n", " handle: \"duckdb:///{base_dir}/countries.db\"\n", " schema_location: \"{base_dir}/countries.linkml.yaml\"\n", " collections:\n", " countries:\n", " type: Country\n" ] } ], "source": [ "%%bash\n", "cat tmp/countries/countries.config.yaml" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-08T03:14:54.232744Z", "start_time": "2024-08-08T03:14:54.218498Z" } }, "id": "bfde580a0ec64091" }, { "cell_type": "markdown", "source": [ "The schema itself is fairly basic - a single class (whose name matches the `type`) in the configuration,\n", "with some slots. Note the slots have some constraints, e.g. regexps" ], "metadata": { "collapsed": false }, "id": "9242b8942af6f976" }, { "cell_type": "code", "execution_count": 22, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id: https://example.org/countries\n", "name: countries\n", "description: A schema for representing countries\n", "license: https://creativecommons.org/publicdomain/zero/1.0/\n", "\n", "prefixes:\n", " countries: https://example.org/countries/\n", " linkml: https://w3id.org/linkml/\n", "\n", "default_prefix: countries\n", "default_range: string\n", "\n", "imports:\n", " - linkml:types\n", "\n", "classes:\n", " Country:\n", " description: A sovereign state\n", " slots:\n", " - name\n", " - code\n", " - capital\n", " - continent\n", " - languages\n", " Route:\n", " slots:\n", " - origin\n", " - destination\n", " - method\n", "\n", "slots:\n", " name:\n", " description: The name of the country\n", " required: true\n", " # identifier: true\n", " code:\n", " description: The ISO 3166-1 alpha-2 code of the country\n", " required: true\n", " pattern: '^[A-Z]{2}$'\n", " identifier: true\n", " capital:\n", " description: The capital city of the country\n", " required: true\n", " continent:\n", " description: The continent where the country is located\n", " required: true\n", " languages:\n", " description: The main languages spoken in the country\n", " range: Language\n", " multivalued: true\n", " origin:\n", " range: Country\n", " destination:\n", " range: Country\n", " method:\n", " range: MethodEnum\n", "\n", "enums:\n", " MethodEnum:\n", " permissible_values:\n", " rail:\n", " air:\n", " road:\n", "\n", "types:\n", " Language:\n", " typeof: string\n", " description: A human language" ] } ], "source": [ "%%bash\n", "cat tmp/countries/countries.linkml.yaml" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-08T03:14:54.276205Z", "start_time": "2024-08-08T03:14:54.231054Z" } }, "id": "cebbfe0d134749e7" }, { "cell_type": "code", "execution_count": 23, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "name: countries\n", "description: A schema for representing countries\n", "id: https://example.org/countries\n", "imports:\n", "- linkml:types\n", "license: https://creativecommons.org/publicdomain/zero/1.0/\n", "prefixes:\n", " countries:\n", " prefix_prefix: countries\n", " prefix_reference: https://example.org/countries/\n", " linkml:\n", " prefix_prefix: linkml\n", " prefix_reference: https://w3id.org/linkml/\n", "default_prefix: countries\n", "default_range: string\n", "types:\n", " Language:\n", " name: Language\n", " description: A human language\n", " typeof: string\n", "enums:\n", " MethodEnum:\n", " name: MethodEnum\n", " permissible_values:\n", " rail:\n", " text: rail\n", " air:\n", " text: air\n", " road:\n", " text: road\n", "slots:\n", " name:\n", " name: name\n", " description: The name of the country\n", " required: true\n", " code:\n", " name: code\n", " description: The ISO 3166-1 alpha-2 code of the country\n", " identifier: true\n", " required: true\n", " pattern: ^[A-Z]{2}$\n", " capital:\n", " name: capital\n", " description: The capital city of the country\n", " required: true\n", " continent:\n", " name: continent\n", " description: The continent where the country is located\n", " required: true\n", " languages:\n", " name: languages\n", " description: The main languages spoken in the country\n", " range: Language\n", " multivalued: true\n", " origin:\n", " name: origin\n", " range: Country\n", " destination:\n", " name: destination\n", " range: Country\n", " method:\n", " name: method\n", " range: MethodEnum\n", "classes:\n", " Country:\n", " name: Country\n", " description: A sovereign state\n", " slots:\n", " - name\n", " - code\n", " - capital\n", " - continent\n", " - languages\n", " Route:\n", " name: Route\n", " slots:\n", " - origin\n", " - destination\n", " - method\n", "source_file: tmp/countries/countries.linkml.yaml\n" ] } ], "source": [ "%%bash\n", "linkml-store -B tmp/countries -C tmp/countries/countries.config.yaml -d countries_db schema" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-08T03:14:55.883019Z", "start_time": "2024-08-08T03:14:54.266010Z" } }, "id": "f8472c54c0b79cad" }, { "cell_type": "code", "execution_count": 25, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inserted 20 objects from tmp/countries/countries.jsonl into collection 'countries'.\n" ] } ], "source": [ "%%bash\n", "linkml-store -B tmp/countries -C tmp/countries/countries.config.yaml -d countries_db -c countries insert tmp/countries/countries.jsonl" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-08T03:15:32.386438Z", "start_time": "2024-08-08T03:15:30.827847Z" } }, "id": "f0db6bd8db3ed955" }, { "cell_type": "code", "execution_count": 27, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "countries\n", "alias: countries\n", "type: Country\n", "additional_properties: null\n", "attributes: null\n", "indexers: null\n", "hidden: false\n", "is_prepopulated: false\n", "source: null\n", "derived_from: null\n", "page_size: null\n", "graph_projection: null\n", "validate_modifications: false\n" ] } ], "source": [ "%%bash\n", "linkml-store -B tmp/countries -C tmp/countries/countries.config.yaml -d countries_db list-collections" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-08T03:15:48.491259Z", "start_time": "2024-08-08T03:15:47.045855Z" } }, "id": "722706dd4ac509b4" }, { "cell_type": "code", "execution_count": 28, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\n", " {\n", " \"name\": \"United Kingdom\",\n", " \"code\": \"GB\",\n", " \"capital\": \"London\",\n", " \"continent\": \"Europe\",\n", " \"languages\": [\n", " \"English\"\n", " ]\n", " }\n", "]\n" ] } ], "source": [ "%%bash\n", "linkml-store --stacktrace -B tmp/countries -C tmp/countries/countries.config.yaml -d countries_db -c countries query -w \"code: GB\" " ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-08T03:16:08.888907Z", "start_time": "2024-08-08T03:16:07.367377Z" } }, "id": "b831b4012a6c03a3" }, { "cell_type": "markdown", "source": [ "## Validation\n", "\n", "LinkML-Store is designed to allow for rich validation, regardless of the underlying database store used.\n", "\n", "For validation to work, we need to specify an explicit schema, as we have done with the configuration above.\n", "\n", "To test it, we will insert some fake data:" ], "metadata": { "collapsed": false }, "id": "5f94d882bd97cb78" }, { "cell_type": "code", "execution_count": 31, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inserted 3 objects from {name: Foolandia, code: \"X Y\", languages: [\"Fooish\"]} into collection 'countries'.\n" ] } ], "source": [ "%%bash\n", "linkml-store -B tmp/countries -C tmp/countries/countries.config.yaml -d countries_db insert --object '{name: Foolandia, code: \"X Y\", languages: [\"Fooish\"]}'" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-08T03:16:51.619629Z", "start_time": "2024-08-08T03:16:50.084048Z" } }, "id": "9d81c524271eddfc" }, { "cell_type": "markdown", "source": [ "Let's check that the data is there:" ], "metadata": { "collapsed": false }, "id": "198e7b6dcfd0b1c7" }, { "cell_type": "code", "execution_count": 33, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\n", " {\n", " \"name\": \"Foolandia\",\n", " \"code\": \"X Y\",\n", " \"capital\": null,\n", " \"continent\": null,\n", " \"languages\": [\n", " \"Fooish\"\n", " ]\n", " }\n", "]\n" ] } ], "source": [ "%%bash\n", "linkml-store -B tmp/countries -C tmp/countries/countries.config.yaml -d countries_db query -w 'name: Foolandia'" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-08T03:17:03.207774Z", "start_time": "2024-08-08T03:17:01.680489Z" } }, "id": "bc79b2c4af397111" }, { "cell_type": "markdown", "source": [ "Note that by default, validation is *deferred*. You can insert whatever you like, and then validate later.\n", "\n", "Other configurations may be more suited to your project, including strict/prospective validation.\n", "\n", "Next let's examine the schema:" ], "metadata": { "collapsed": false }, "id": "ef95ca843c46e78b" }, { "cell_type": "code", "execution_count": 35, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "name: countries\n", "description: A schema for representing countries\n", "id: https://example.org/countries\n", "imports:\n", "- linkml:types\n", "license: https://creativecommons.org/publicdomain/zero/1.0/\n", "prefixes:\n", " countries:\n", " prefix_prefix: countries\n", " prefix_reference: https://example.org/countries/\n", " linkml:\n", " prefix_prefix: linkml\n", " prefix_reference: https://w3id.org/linkml/\n", "default_prefix: countries\n", "default_range: string\n", "types:\n", " Language:\n", " name: Language\n", " description: A human language\n", " typeof: string\n", "enums:\n", " MethodEnum:\n", " name: MethodEnum\n", " permissible_values:\n", " rail:\n", " text: rail\n", " air:\n", " text: air\n", " road:\n", " text: road\n", "slots:\n", " name:\n", " name: name\n", " description: The name of the country\n", " required: true\n", " code:\n", " name: code\n", " description: The ISO 3166-1 alpha-2 code of the country\n", " identifier: true\n", " required: true\n", " pattern: ^[A-Z]{2}$\n", " capital:\n", " name: capital\n", " description: The capital city of the country\n", " required: true\n", " continent:\n", " name: continent\n", " description: The continent where the country is located\n", " required: true\n", " languages:\n", " name: languages\n", " description: The main languages spoken in the country\n", " range: Language\n", " multivalued: true\n", " origin:\n", " name: origin\n", " range: Country\n", " destination:\n", " name: destination\n", " range: Country\n", " method:\n", " name: method\n", " range: MethodEnum\n", "classes:\n", " Country:\n", " name: Country\n", " description: A sovereign state\n", " slots:\n", " - name\n", " - code\n", " - capital\n", " - continent\n", " - languages\n", " Route:\n", " name: Route\n", " slots:\n", " - origin\n", " - destination\n", " - method\n", "source_file: tmp/countries/countries.linkml.yaml\n" ] } ], "source": [ "%%bash\n", "linkml-store -B tmp/countries -C tmp/countries/countries.config.yaml -d countries_db schema" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-08T03:17:19.068032Z", "start_time": "2024-08-08T03:17:17.696515Z" } }, "id": "ba9a72c20c4f9f8" }, { "cell_type": "markdown", "source": [ "### Run validation\n", "\n", "Next we will run the `validate` command:" ], "metadata": { "collapsed": false }, "id": "e72e6855cff8280c" }, { "cell_type": "markdown", "source": [], "metadata": { "collapsed": false }, "id": "73269d55d343dc2b" }, { "cell_type": "code", "execution_count": 37, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----+-----------------------+------------+--------------------------------------------+---------------------------------------------------------------+------------------+----------------+-----------+\n", "| | type | severity | message | instance | instance_index | instantiates | context |\n", "|----+-----------------------+------------+--------------------------------------------+---------------------------------------------------------------+------------------+----------------+-----------|\n", "| 0 | jsonschema validation | ERROR | 'X Y' does not match '^[A-Z]{2}$' in /code | {'name': 'Foolandia', 'code': 'X Y', 'languages': ['Fooish']} | 0 | Country | [] |\n", "| 1 | jsonschema validation | ERROR | 'capital' is a required property in / | {'name': 'Foolandia', 'code': 'X Y', 'languages': ['Fooish']} | 0 | Country | [] |\n", "| 2 | jsonschema validation | ERROR | 'continent' is a required property in / | {'name': 'Foolandia', 'code': 'X Y', 'languages': ['Fooish']} | 0 | Country | [] |\n", "+----+-----------------------+------------+--------------------------------------------+---------------------------------------------------------------+------------------+----------------+-----------+\n" ] } ], "source": [ "%%bash\n", "linkml-store -B tmp/countries -C tmp/countries/countries.config.yaml -d countries_db validate -O table" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-08T03:17:29.769529Z", "start_time": "2024-08-08T03:17:28.076152Z" } }, "id": "597b805524772921" }, { "cell_type": "markdown", "source": [ "Here we can see 3 issues with the data we added:\n", "\n", "* the code doesn't match the regexp we provided (it has a space)\n", "* the capital is missing\n", "* the continent is missing\n", " " ], "metadata": { "collapsed": false }, "id": "e411262d116ec17d" }, { "cell_type": "markdown", "source": [ "## Inference\n", "\n", "LinkML implements the \"CRUDSI\" pattern: In addition to **C**reate, **R**ead, **U**pdate, **D**elete, we support **S**earch, we also support **I**_nference_.\n", "\n", "Inference is a procedure for filling in missing attribute values, or for correcting or repairing existing attribute values.\n", "\n", "Different inference strategies include:\n", "\n", "* procedural or rule-based inference\n", "* projection or transformation of data\n", "* statistical inference or machine learning (ML), for example by inferring decision trees or regression models\n", "* inference using generative AI and Large Language Models (LLMs)\n", "\n", "We will demonstrate the use of LLM inference, via the RAGInferenceEngine. This works by fetching the most relevant\n", "rows from the collection at the time of inference (based on supplied input), presenting these as example\n", "input-output pairs to the LLM, and then asking the LLM to complete the supplied input.\n", "\n", "Our countries collection is (intentionally) incomplete. Let's fill in some missing rows:" ], "metadata": { "collapsed": false }, "id": "53ab43f071b000be" }, { "cell_type": "code", "execution_count": 40, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "predicted_object:\n", " capital: Montevideo\n", " code: UY\n", " continent: South America\n", " languages:\n", " - Spanish\n" ] } ], "source": [ "%%bash\n", "linkml-store -d duckdb:///tmp/countries.db -c countries infer -t rag -q 'name: Uruguay'" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-08T03:36:45.390615Z", "start_time": "2024-08-08T03:36:41.726848Z" } }, "id": "49d1162b62cf0459" }, { "cell_type": "code", "execution_count": 48, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: llm-claude-3 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (0.4)\n", "Requirement already satisfied: llm in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from llm-claude-3) (0.15)\n", "Requirement already satisfied: anthropic>=0.17.0 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from llm-claude-3) (0.32.0)\n", "Requirement already satisfied: anyio<5,>=3.5.0 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from anthropic>=0.17.0->llm-claude-3) (4.4.0)\n", "Requirement already satisfied: distro<2,>=1.7.0 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from anthropic>=0.17.0->llm-claude-3) (1.9.0)\n", "Requirement already satisfied: httpx<1,>=0.23.0 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from anthropic>=0.17.0->llm-claude-3) (0.27.0)\n", "Requirement already satisfied: jiter<1,>=0.4.0 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from anthropic>=0.17.0->llm-claude-3) (0.5.0)\n", "Requirement already satisfied: pydantic<3,>=1.9.0 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from anthropic>=0.17.0->llm-claude-3) (2.8.2)\n", "Requirement already satisfied: sniffio in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from anthropic>=0.17.0->llm-claude-3) (1.3.1)\n", "Requirement already satisfied: tokenizers>=0.13.0 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from anthropic>=0.17.0->llm-claude-3) (0.19.1)\n", "Requirement already satisfied: typing-extensions<5,>=4.7 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from anthropic>=0.17.0->llm-claude-3) (4.12.2)\n", "Requirement already satisfied: click in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from llm->llm-claude-3) (8.1.7)\n", "Requirement already satisfied: openai>=1.0 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from llm->llm-claude-3) (1.40.1)\n", "Requirement already satisfied: click-default-group>=1.2.3 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from llm->llm-claude-3) (1.2.4)\n", "Requirement already satisfied: sqlite-utils>=3.37 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from llm->llm-claude-3) (3.37)\n", "Requirement already satisfied: sqlite-migrate>=0.1a2 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from llm->llm-claude-3) (0.1b0)\n", "Requirement already satisfied: PyYAML in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from llm->llm-claude-3) (6.0.2)\n", "Requirement already satisfied: pluggy in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from llm->llm-claude-3) (1.5.0)\n", "Requirement already satisfied: python-ulid in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from llm->llm-claude-3) (2.7.0)\n", "Requirement already satisfied: setuptools in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from llm->llm-claude-3) (72.1.0)\n", "Requirement already satisfied: pip in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from llm->llm-claude-3) (24.2)\n", "Requirement already satisfied: idna>=2.8 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from anyio<5,>=3.5.0->anthropic>=0.17.0->llm-claude-3) (3.7)\n", "Requirement already satisfied: exceptiongroup>=1.0.2 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from anyio<5,>=3.5.0->anthropic>=0.17.0->llm-claude-3) (1.2.2)\n", "Requirement already satisfied: certifi in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from httpx<1,>=0.23.0->anthropic>=0.17.0->llm-claude-3) (2024.7.4)\n", "Requirement already satisfied: httpcore==1.* in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from httpx<1,>=0.23.0->anthropic>=0.17.0->llm-claude-3) (1.0.5)\n", "Requirement already satisfied: h11<0.15,>=0.13 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from httpcore==1.*->httpx<1,>=0.23.0->anthropic>=0.17.0->llm-claude-3) (0.14.0)\n", "Requirement already satisfied: tqdm>4 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from openai>=1.0->llm->llm-claude-3) (4.66.5)\n", "Requirement already satisfied: annotated-types>=0.4.0 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from pydantic<3,>=1.9.0->anthropic>=0.17.0->llm-claude-3) (0.7.0)\n", "Requirement already satisfied: pydantic-core==2.20.1 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from pydantic<3,>=1.9.0->anthropic>=0.17.0->llm-claude-3) (2.20.1)\n", "Requirement already satisfied: sqlite-fts4 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from sqlite-utils>=3.37->llm->llm-claude-3) (1.0.3)\n", "Requirement already satisfied: tabulate in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from sqlite-utils>=3.37->llm->llm-claude-3) (0.9.0)\n", "Requirement already satisfied: python-dateutil in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from sqlite-utils>=3.37->llm->llm-claude-3) (2.9.0.post0)\n", "Requirement already satisfied: huggingface-hub<1.0,>=0.16.4 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from tokenizers>=0.13.0->anthropic>=0.17.0->llm-claude-3) (0.24.5)\n", "Requirement already satisfied: filelock in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers>=0.13.0->anthropic>=0.17.0->llm-claude-3) (3.15.4)\n", "Requirement already satisfied: fsspec>=2023.5.0 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers>=0.13.0->anthropic>=0.17.0->llm-claude-3) (2024.6.1)\n", "Requirement already satisfied: packaging>=20.9 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers>=0.13.0->anthropic>=0.17.0->llm-claude-3) (24.1)\n", "Requirement already satisfied: requests in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers>=0.13.0->anthropic>=0.17.0->llm-claude-3) (2.32.3)\n", "Requirement already satisfied: six>=1.5 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from python-dateutil->sqlite-utils>=3.37->llm->llm-claude-3) (1.16.0)\n", "Requirement already satisfied: charset-normalizer<4,>=2 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from requests->huggingface-hub<1.0,>=0.16.4->tokenizers>=0.13.0->anthropic>=0.17.0->llm-claude-3) (3.3.2)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from requests->huggingface-hub<1.0,>=0.16.4->tokenizers>=0.13.0->anthropic>=0.17.0->llm-claude-3) (2.2.2)\n" ] } ], "source": [ "%%bash\n", "llm install llm-claude-3" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-08T03:58:52.225316Z", "start_time": "2024-08-08T03:58:50.030044Z" } }, "id": "1acc22ab6878d507" }, { "cell_type": "code", "execution_count": 49, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "predicted_object:\n", " capital: Montevideo\n", " code: UY\n", " continent: South America\n", " languages:\n", " - Spanish\n" ] } ], "source": [ "%%bash\n", "linkml-store -d duckdb:///tmp/countries.db -c countries infer -t rag:llm_config.model_name=claude-3-opus -q 'name: Uruguay'" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-08T03:59:19.847912Z", "start_time": "2024-08-08T03:59:15.216553Z" } }, "id": "1a00a3e2d841f9fc" }, { "cell_type": "markdown", "source": [ "We can also restrict the predictions to a specific attribute:" ], "metadata": { "collapsed": false }, "id": "2eefceeec5a08bb9" }, { "cell_type": "code", "execution_count": 44, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "predicted_object:\n", " continent: South America\n" ] } ], "source": [ "%%bash\n", "linkml-store -d duckdb:///tmp/countries.db -c countries infer -t rag -q 'name: Uruguay' -T continent" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-08T03:41:10.778489Z", "start_time": "2024-08-08T03:41:07.522908Z" } }, "id": "a9d33b99802538" }, { "cell_type": "markdown", "source": [ "Note that LLMs are particularly suited to this kind of inference, when we supply an out of distribution (in our existing collection);\n", "we are relying on pre-trained knowledge in the model.\n", "\n", "This is *not* expected to work with a traditional ML model - in this case it will complain that it has no data on the provided\n", "feature column:" ], "metadata": { "collapsed": false }, "id": "64fdce7388491e5d" }, { "cell_type": "code", "execution_count": 45, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "KeyError: 'Uruguay'\n", "\n", "During handling of the above exception, another exception occurred:\n", "\n", "ValueError: y contains previously unseen labels: 'Uruguay'\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Failed as expected\n" ] } ], "source": [ "%%bash\n", "linkml-store -d duckdb:///tmp/countries.db -c countries infer -t sklearn -T continent -q 'name: Uruguay' || echo \"Failed as expected\"" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-08T03:41:29.605873Z", "start_time": "2024-08-08T03:41:27.365712Z" } }, "id": "cbd4a72b3cd06262" }, { "cell_type": "markdown", "source": [ "## Inference using statistical models\n", "\n", "A more appropriate dataset for a traditional ML model would be the Iris dataset. Let's first explore it:" ], "metadata": { "collapsed": false }, "id": "9758390d5809c794" }, { "cell_type": "code", "execution_count": 51, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " count unique top freq mean std min 25% 50% 75% max\n", "petal_length 100.0 NaN NaN NaN 2.861 1.449549 1.0 1.5 2.45 4.325 5.1\n", "petal_width 100.0 NaN NaN NaN 0.786 0.565153 0.1 0.2 0.8 1.3 1.8\n", "sepal_length 100.0 NaN NaN NaN 5.471 0.641698 4.3 5.0 5.4 5.9 7.0\n", "sepal_width 100.0 NaN NaN NaN 3.099 0.478739 2.0 2.8 3.05 3.4 4.4\n", "species 100 2 setosa 50 NaN NaN NaN NaN NaN NaN NaN\n" ] } ], "source": [ "%%bash\n", "linkml-store -i ../../tests/input/iris.jsonl describe" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-08T04:01:51.686957Z", "start_time": "2024-08-08T04:01:50.066382Z" } }, "id": "537f60cb74580bc3" }, { "cell_type": "code", "execution_count": 52, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "predicted_object:\n", " species: setosa\n" ] } ], "source": [ "%%bash\n", "linkml-store -i ../../tests/input/iris.jsonl infer -t sklearn -q '{\"sepal_length\": 5.1, \"sepal_width\": 3.5, \"petal_length\": 1.4, \"petal_width\": 0.2}'" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-08T04:01:56.755020Z", "start_time": "2024-08-08T04:01:54.819477Z" } }, "id": "85e9020ba171c4e" }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [], "metadata": { "collapsed": false }, "id": "cc0d655b88fc9441" } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 5 }