{ "cells": [ { "cell_type": "markdown", "source": [ "# Tutorial: Using the Command Line Interface\n", "\n", "This tutorial walks through usage of LinkML-Store via the Command Line Interface (CLI)\n", "\n", "This tutorial is a Jupyter notebook: it can be executed in a command line environment,\n", "or you can try it for yourself by running commands directly.\n", "\n", "Note the `%%bash` is a directive for Jupyter itself, you don't need to type this" ], "metadata": { "collapsed": false }, "id": "92e124c26a2d83da" }, { "cell_type": "markdown", "source": [ "## Top level command\n", "\n", "The top level command is `linkml-store`. This command doesn't do anything itself, instead there are various *subcommands*.\n", "\n", "The store command has a few *global options* to specify configuration/database/collection" ], "metadata": { "collapsed": false }, "id": "9ae24f91d65fdda0" }, { "cell_type": "code", "execution_count": 1, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: linkml-store [OPTIONS] COMMAND [ARGS]...\n", "\n", " A CLI for interacting with the linkml-store.\n", "\n", "Options:\n", " -d, --database TEXT Database name\n", " -c, --collection TEXT Collection name\n", " -C, --config PATH Path to the configuration file\n", " --set TEXT Metadata settings in the form PATHEXPR=value\n", " -v, --verbose\n", " -q, --quiet / --no-quiet\n", " --stacktrace / --no-stacktrace If set then show full stacktrace on error\n", " [default: no-stacktrace]\n", " --help Show this message and exit.\n", "\n", "Commands:\n", " apply Apply a patch to a collection.\n", " describe Describe the collection schema.\n", " diff Diffs two collectoons to create a patch.\n", " export Exports a database to a dump.\n", " fq Query facets from the specified collection.\n", " import Imports a database from a dump.\n", " index Create an index over a collection.\n", " indexes Show the indexes for a collection.\n", " insert Insert objects from files (JSON, YAML, TSV) into the...\n", " list-collections\n", " query Query objects from the specified collection.\n", " schema Show the schema for a database\n", " search Search objects in the specified collection.\n", " store Store objects from files (JSON, YAML, TSV) into the...\n", " validate Validate objects in the specified collection.\n" ] } ], "source": [ "%%bash\n", "linkml-store --help" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-06-25T15:08:18.145877Z", "start_time": "2024-06-25T15:08:15.829611Z" } }, "id": "f367252f5e8857b4" }, { "cell_type": "markdown", "source": [ "## Inserting objects from a file\n", "\n", "Next we'll explore the ``insert`` command:" ], "metadata": { "collapsed": false }, "id": "684ee59be469e12" }, { "cell_type": "code", "execution_count": 2, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: linkml-store insert [OPTIONS] [FILES]...\n", "\n", " Insert objects from files (JSON, YAML, TSV) into the specified collection.\n", "\n", "Options:\n", " -f, --format [json|jsonl|yaml|tsv|csv|parquet|formatted]\n", " Input format\n", " -i, --object TEXT Input object as YAML\n", " --help Show this message and exit.\n" ] } ], "source": [ "%%bash\n", "linkml-store --stacktrace insert --help" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-06-25T15:08:19.687123Z", "start_time": "2024-06-25T15:08:18.146287Z" } }, "id": "cfe24edc122b04e7" }, { "cell_type": "markdown", "source": [ "We'll insert a small test file (in JSON Lines format) into a fresh database." ], "metadata": { "collapsed": false }, "id": "8cf50fcf5f257fdd" }, { "cell_type": "code", "execution_count": 3, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\"name\": \"United States\", \"code\": \"US\", \"capital\": \"Washington, D.C.\", \"continent\": \"North America\", \"languages\": [\"English\"]}\n", "{\"name\": \"Canada\", \"code\": \"CA\", \"capital\": \"Ottawa\", \"continent\": \"North America\", \"languages\": [\"English\", \"French\"]}\n", "{\"name\": \"Mexico\", \"code\": \"MX\", \"capital\": \"Mexico City\", \"continent\": \"North America\", \"languages\": [\"Spanish\"]}\n", "{\"name\": \"Brazil\", \"code\": \"BR\", \"capital\": \"Brasília\", \"continent\": \"South America\", \"languages\": [\"Portuguese\"]}\n", "{\"name\": \"Argentina\", \"code\": \"AR\", \"capital\": \"Buenos Aires\", \"continent\": \"South America\", \"languages\": [\"Spanish\"]}\n", "{\"name\": \"United Kingdom\", \"code\": \"GB\", \"capital\": \"London\", \"continent\": \"Europe\", \"languages\": [\"English\"]}\n", "{\"name\": \"France\", \"code\": \"FR\", \"capital\": \"Paris\", \"continent\": \"Europe\", \"languages\": [\"French\"]}\n", "{\"name\": \"Germany\", \"code\": \"DE\", \"capital\": \"Berlin\", \"continent\": \"Europe\", \"languages\": [\"German\"]}\n", "{\"name\": \"Italy\", \"code\": \"IT\", \"capital\": \"Rome\", \"continent\": \"Europe\", \"languages\": [\"Italian\"]}\n", "{\"name\": \"Spain\", \"code\": \"ES\", \"capital\": \"Madrid\", \"continent\": \"Europe\", \"languages\": [\"Spanish\"]}\n" ] } ], "source": [ "%%bash\n", "head ../../tests/input/countries/countries.jsonl" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-06-25T15:08:20.815369Z", "start_time": "2024-06-25T15:08:20.792716Z" } }, "id": "afc4bfb1ecf80cc4" }, { "cell_type": "markdown", "source": [ "To make sure we have a fresh setup, we'll create a temporary directory `tmp` (if it doesn't already exist),\n", "and be sure to remove any copy of the database we intend to create.\n", "\n", "We'll then insert the objects:" ], "metadata": { "collapsed": false }, "id": "8ec898e12ac5c6ea" }, { "cell_type": "code", "execution_count": 4, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inserted 20 objects from ../../tests/input/countries/countries.jsonl into collection 'countries'.\n" ] } ], "source": [ "%%bash\n", "mkdir -p tmp\n", "rm -rf tmp/countries.db\n", "linkml-store --database duckdb:///tmp/countries.db --collection countries insert ../../tests/input/countries/countries.jsonl" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-06-25T15:08:27.698074Z", "start_time": "2024-06-25T15:08:26.094359Z" } }, "id": "be9cebbea43d03a8" }, { "cell_type": "markdown", "source": [ "Note that the `--database` and `--collection` options come *before* the `insert` subcommand.\n", "\n", "With LinkML-Store, everything must go into a collection, so we specified `countries` as the name" ], "metadata": { "collapsed": false }, "id": "9c4c6c201c6c3188" }, { "cell_type": "markdown", "source": [ "## Querying\n", "\n", "Next we'll explore the `query` command:" ], "metadata": { "collapsed": false }, "id": "4550b33d68b04a8d" }, { "cell_type": "code", "execution_count": 24, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: linkml-store query [OPTIONS]\n", "\n", " Query objects from the specified collection.\n", "\n", " Leave the query field blank to return all objects in the collection.\n", "\n", " Examples:\n", "\n", " linkml-store -d duckdb:///countries.db -c countries query\n", "\n", " Queries can be specified in YAML, as basic key-value pairs\n", "\n", " Examples:\n", "\n", " linkml-store -d duckdb:///countries.db -c countries query -w 'code: NZ'\n", "\n", " More complex queries can be specified using MongoDB-style query syntax\n", "\n", " Examples:\n", "\n", " linkml-store -d file:. -c persons query -w 'occupation: {$ne:\n", " Architect}'\n", "\n", " Finds all people who are not architects.\n", "\n", "Options:\n", " -w, --where TEXT WHERE clause for the query, as YAML\n", " -l, --limit INTEGER Maximum number of results to return\n", " -O, --output-type [json|jsonl|yaml|tsv|csv|parquet|formatted]\n", " Output format\n", " -o, --output PATH Output file path\n", " --help Show this message and exit.\n" ] } ], "source": [ "%%bash\n", "linkml-store query --help" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-06-25T15:19:07.166608Z", "start_time": "2024-06-25T15:19:05.030214Z" } }, "id": "d4d0b66a1a78f50a" }, { "cell_type": "markdown", "source": [ "Let's query for all objects that have `code=\"GB\"`, and get the results back as a CSV. The argument for the `--where` (or `-w`) option is a YAML object with a MongoDB-style query." ], "metadata": { "collapsed": false }, "id": "99a6d52ab591f584" }, { "cell_type": "code", "execution_count": 5, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " name code capital continent languages\n", "0 United Kingdom GB London Europe [English]\n" ] } ], "source": [ "%%bash\n", "linkml-store --database duckdb:///tmp/countries.db -c countries query -w \"code: GB\" -O formatted" ], "metadata": { "collapsed": false }, "id": "225613b70b0d57fc" }, { "cell_type": "markdown", "source": [ "We can get the output in different formats:" ], "metadata": { "collapsed": false }, "id": "e86ae98fe4c48413" }, { "cell_type": "code", "execution_count": 6, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "name: United Kingdom\n", "code: GB\n", "capital: London\n", "continent: Europe\n", "languages:\n", "- English\n" ] } ], "source": [ "%%bash\n", "linkml-store --database duckdb:///tmp/countries.db -c countries query -w \"code: GB\" -O yaml" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-06-25T15:09:27.368346Z", "start_time": "2024-06-25T15:09:25.681868Z" } }, "id": "5d47e9648428caf0" }, { "cell_type": "markdown", "source": [ "Formats include csv, tsv, yaml, json, jsonl, formatted (a human-readable format)" ], "metadata": { "collapsed": false }, "id": "8d980c36b6c9b839" }, { "cell_type": "markdown", "source": [ "## Describing the data set\n", "\n", "The `describe` command gives a high-level overview of the data set:" ], "metadata": { "collapsed": false }, "id": "ae1d98ffa2767e5f" }, { "cell_type": "code", "execution_count": 25, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: linkml-store describe [OPTIONS]\n", "\n", " Describe the collection schema.\n", "\n", "Options:\n", " -w, --where TEXT WHERE clause for the query\n", " -O, --output-type [json|jsonl|yaml|tsv|csv|parquet|formatted]\n", " Output format\n", " -o, --output PATH Output file path\n", " --help Show this message and exit.\n" ] } ], "source": [ "%%bash\n", "linkml-store describe --help" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-06-25T15:19:57.960276Z", "start_time": "2024-06-25T15:19:55.879771Z" } }, "id": "45cf8f0e25f8d1ae" }, { "cell_type": "markdown", "source": [ "Let's try with the countries dataset:" ], "metadata": { "collapsed": false }, "id": "ff10a119becb6ad8" }, { "cell_type": "code", "execution_count": 8, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " count unique top freq\n", "capital 1 1 Washington, D.C. 1\n", "code 1 1 US 1\n", "continent 1 1 North America 1\n", "languages 1 1 [English] 1\n", "name 1 1 United States 1\n" ] } ], "source": [ "%%bash\n", "linkml-store -d duckdb:///tmp/countries.db -c countries describe" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-06-25T15:10:51.615868Z", "start_time": "2024-06-25T15:10:49.498214Z" } }, "id": "364f240fc0035045" }, { "cell_type": "markdown", "source": [ "Note this command is more useful for numeric data..." ], "metadata": { "collapsed": false }, "id": "bdc0a6d167506809" }, { "cell_type": "markdown", "source": [ "## Facet Counts\n", "\n", "You can combine any query (including an empty query, for fetching the whole database) with a *facet query* which fetches counts for\n", "numbers of objects broken down by some specified slot or slots." ], "metadata": { "collapsed": false }, "id": "91fcaf45c7c8c95a" }, { "cell_type": "code", "execution_count": 26, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: linkml-store fq [OPTIONS]\n", "\n", " Query facets from the specified collection.\n", "\n", " :param ctx: :param where: :param limit: :param columns: :param output_type:\n", " :param output: :return:\n", "\n", "Options:\n", " -w, --where TEXT WHERE clause for the query\n", " -l, --limit INTEGER Maximum number of results to return\n", " -O, --output-type [json|jsonl|yaml|tsv|csv|parquet|formatted]\n", " Output format\n", " -o, --output PATH Output file path\n", " -S, --columns TEXT Columns to facet on\n", " --help Show this message and exit.\n" ] } ], "source": [ "%%bash\n", "linkml-store fq --help" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-06-25T15:20:56.187707Z", "start_time": "2024-06-25T15:20:54.588084Z" } }, "id": "5676c7a8a30699a7" }, { "cell_type": "code", "execution_count": 9, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"continent\": {\n", " \"Europe\": 5,\n", " \"Asia\": 5,\n", " \"Africa\": 3,\n", " \"North America\": 3,\n", " \"South America\": 2,\n", " \"Oceania\": 2\n", " }\n", "}\n" ] } ], "source": [ "%%bash\n", "linkml-store -d duckdb:///tmp/countries.db -c countries fq -S continent" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-06-25T15:11:38.152596Z", "start_time": "2024-06-25T15:11:35.949820Z" } }, "id": "6d8152d20290120c" }, { "cell_type": "markdown", "source": [ "Remember this is a test dataset deliberately reduced so we don't expect to see all countries there!" ], "metadata": { "collapsed": false }, "id": "b5a1d7cf536cc60e" }, { "cell_type": "markdown", "source": [ "## Search\n", "\n", "LinkML-Store is intended to allow for a flexible range of *search strategies*. Some of these may come from the underlying data store\n", "(for example, SOLr or ES is backed by Lucene indexing). Or they may be integrated orthogonally.\n", "\n", "A key search mechanism that is supported is *text embedding* via *Large Language Models (LLMs)*. Note these are not enabled by default.\n", "\n", "Currently the default mechanism (which works regardless of the underlying store) is a highly naive trigram-based vector embedding. This requires\n", "no external model. It is intended primarily for demonstration purposes, and should be swapped out for something else." ], "metadata": { "collapsed": false }, "id": "1fd37a3fabafcac4" }, { "cell_type": "markdown", "source": [ "### Indexing a collection\n", "\n", "First we will explore the `index` command" ], "metadata": { "collapsed": false }, "id": "82dd185bda0ec1bd" }, { "cell_type": "code", "execution_count": 27, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: linkml-store index [OPTIONS]\n", "\n", " Create an index over a collection.\n", "\n", " By default a simple trigram index is used.\n", "\n", "Options:\n", " -t, --index-type TEXT Type of index to create. Values: simple, llm\n", " [default: simple]\n", " -E, --cached-embeddings-database TEXT\n", " Path to the database where embeddings are\n", " cached\n", " -T, --text-template TEXT Template for text embeddings\n", " --help Show this message and exit.\n" ] } ], "source": [ "%%bash\n", "linkml-store index --help" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-06-25T15:23:05.815936Z", "start_time": "2024-06-25T15:23:03.735942Z" } }, "id": "ae0172f931e5f228" }, { "cell_type": "markdown", "source": [ "Next we'll make a (default) index" ], "metadata": { "collapsed": false }, "id": "65f5422c6dd449d9" }, { "cell_type": "code", "execution_count": 28, "outputs": [], "source": [ "%%bash\n", "linkml-store -d duckdb:///tmp/countries.db -c countries index" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-06-25T15:23:32.706822Z", "start_time": "2024-06-25T15:23:30.763140Z" } }, "id": "3c97f99cca09a03d" }, { "cell_type": "markdown", "source": [ "### Searching a collection using an index\n", "\n", "Let's explore the `search` command" ], "metadata": { "collapsed": false }, "id": "981aea1e6dd63508" }, { "cell_type": "code", "execution_count": 29, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: linkml-store search [OPTIONS] SEARCH_TERM\n", "\n", " Search objects in the specified collection.\n", "\n", "Options:\n", " -w, --where TEXT WHERE clause for the search\n", " -l, --limit INTEGER Maximum number of search results\n", " -O, --output-type [json|jsonl|yaml|tsv|csv|parquet|formatted]\n", " Output format\n", " -o, --output PATH Output file path\n", " --auto-index / --no-auto-index Automatically index the collection\n", " [default: no-auto-index]\n", " -t, --index-type TEXT Type of index to create. Values: simple, llm\n", " [default: simple]\n", " --help Show this message and exit.\n" ] } ], "source": [ "%%bash\n", "linkml-store search --help" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-06-25T15:23:49.293350Z", "start_time": "2024-06-25T15:23:47.731123Z" } }, "id": "a6e98ccac65635ba" }, { "cell_type": "markdown", "source": [ "Now we'll search for countries in the North where both English and French are spoken. We'll pose this as a natural language query, but the default index is only picking up on trigram tokens in the strings." ], "metadata": { "collapsed": false }, "id": "f5c0cd805f8d19dc" }, { "cell_type": "code", "execution_count": 30, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "score,name,code,capital,continent,languages\r\n", "0.15670402880167877,Canada,CA,Ottawa,North America,\"['English', 'French']\"\r\n", "0.14806601565681218,South Africa,ZA,Pretoria,Africa,\"['Zulu', 'Xhosa', 'Afrikaans', 'English', 'Northern Sotho', 'Tswana', 'Southern Sotho', 'Tsonga', 'Swazi', 'Venda', 'Southern Ndebele']\"\r\n", "0.13749236361227862,United States,US,\"Washington, D.C.\",North America,['English']\r\n", "0.09860812114511587,Argentina,AR,Buenos Aires,South America,['Spanish']\r\n", "0.09765536333140983,Mexico,MX,Mexico City,North America,['Spanish']\r\n" ] } ], "source": [ "%%bash\n", "linkml-store -d duckdb:///tmp/countries.db -c countries search \"countries in the North where both english and french spoken\" --limit 5 -O csv" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-06-25T15:25:05.946946Z", "start_time": "2024-06-25T15:25:04.289332Z" } }, "id": "8fce64c44a6aae21" }, { "cell_type": "markdown", "source": [ "By default, all fields in the object are indexed. Canada comes out top as the strings for English and France are present (or rather trigrams from those words). But remember the default method is just for illustration!" ], "metadata": { "collapsed": false }, "id": "f69630e05da3bd6b" }, { "cell_type": "markdown", "source": [ "## Indexing using an LLM (OPTIONAL)\n", "\n", "Note for this to work, you need to have installed this package with the `llm` extra, like this:\n", "\n", "```bash\n", "pip install linkml-store[llm]\n", "```\n", "\n", "Or if you have this repo checked out and are using Poetry:\n", "\n", "```bash\n", "poetry install --all-extras\n", "```\n", "\n", "You will also need an OpenAI account.\n", "\n", "If this is too much, you can just skip this section!\n" ], "metadata": { "collapsed": false }, "id": "a59443d06387db90" }, { "cell_type": "code", "execution_count": 31, "outputs": [], "source": [ "%%bash\n", "linkml-store -d duckdb:///tmp/countries.db -c countries index -t llm -E tmp/llm_cache.db" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-06-25T15:27:24.863515Z", "start_time": "2024-06-25T15:27:21.938776Z" } }, "id": "180b3f44075c0291" }, { "cell_type": "code", "execution_count": 32, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "score,name,code,capital,continent,languages\r\n", "0.7927589434263863,Canada,CA,Ottawa,North America,\"['English', 'French']\"\r\n", "0.7641212153371397,France,FR,Paris,Europe,['French']\r\n", "0.7546847140878102,United States,US,\"Washington, D.C.\",North America,['English']\r\n", "0.7424773577897005,Australia,AU,Canberra,Oceania,['English']\r\n", "0.741656789495497,United Kingdom,GB,London,Europe,['English']\r\n" ] } ], "source": [ "%%bash\n", "linkml-store -d duckdb:///tmp/countries.db -c countries search -t llm \"countries in the North where both english and french spoken\" --limit 5 -O csv" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-06-25T15:28:16.883407Z", "start_time": "2024-06-25T15:28:14.554867Z" } }, "id": "9711a8db9c414953" }, { "cell_type": "markdown", "source": [ "The results are not particularly meaningful, but the idea is that this could be used in a RAG-style system." ], "metadata": { "collapsed": false }, "id": "df6bdc130db45fa2" }, { "cell_type": "markdown", "source": [ "## Schemas\n", "\n", "Note in the above we did not explicitly specify a schema; instead it is *induced*.\n", "\n", "We can use the `schema` command to see the induced schema in [LinkML YAML](https://linkml.github.io/linkml/)." ], "metadata": { "collapsed": false }, "id": "2661d59e4e665823" }, { "cell_type": "code", "execution_count": 11, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "name: test-schema\n", "id: http://example.org/test-schema\n", "imports:\n", "- linkml:types\n", "prefixes:\n", " linkml:\n", " prefix_prefix: linkml\n", " prefix_reference: https://w3id.org/linkml/\n", " test_schema:\n", " prefix_prefix: test_schema\n", " prefix_reference: http://example.org/test-schema/\n", "default_prefix: test_schema\n", "default_range: string\n", "classes:\n", " countries:\n", " name: countries\n", " attributes:\n", " name:\n", " name: name\n", " multivalued: false\n", " range: string\n", " required: false\n", " code:\n", " name: code\n", " multivalued: false\n", " range: string\n", " required: false\n", " capital:\n", " name: capital\n", " multivalued: false\n", " range: string\n", " required: false\n", " continent:\n", " name: continent\n", " multivalued: false\n", " range: string\n", " required: false\n", " languages:\n", " name: languages\n", " multivalued: true\n", " range: string\n", " required: false\n", " internal__index__countries__simple:\n", " name: internal__index__countries__simple\n", " attributes:\n", " name:\n", " name: name\n", " multivalued: false\n", " range: string\n", " required: false\n", " code:\n", " name: code\n", " multivalued: false\n", " range: string\n", " required: false\n", " capital:\n", " name: capital\n", " multivalued: false\n", " range: string\n", " required: false\n", " continent:\n", " name: continent\n", " multivalued: false\n", " range: string\n", " required: false\n", " languages:\n", " name: languages\n", " multivalued: true\n", " range: string\n", " required: false\n", " __index__:\n", " name: __index__\n", " multivalued: true\n", " range: string\n", " required: false\n" ] } ], "source": [ "%%bash\n", "linkml-store -d duckdb:///tmp/countries.db schema" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-06-25T15:11:57.597135Z", "start_time": "2024-06-25T15:11:55.984958Z" } }, "id": "f36b8ae0c4325d2" }, { "cell_type": "markdown", "source": [ "## Configuration Files and Explicit Schemas\n", "\n", "Rather than repeat `--database` and `--collection` each time, we can make use of YAML config files.\n", "\n", "These can also package useful information and schemas.\n", "\n", "First we will create a fresh copy of a directory with both configuration files and schemas:" ], "metadata": { "collapsed": false }, "id": "78696f002f28d51d" }, { "cell_type": "code", "execution_count": 12, "outputs": [], "source": [ "%%bash\n", "cp -pr ../../tests/input/countries tmp\n", "rm tmp/countries/countries.db" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-06-25T15:12:04.752008Z", "start_time": "2024-06-25T15:12:04.733524Z" } }, "id": "dad98c3579f24bbd" }, { "cell_type": "markdown", "source": [ "The configuration YAML is fairly minimal - it specifies a single database with a single collection, and a pointer to a schema" ], "metadata": { "collapsed": false }, "id": "9a9be08dcc572a7f" }, { "cell_type": "code", "execution_count": 13, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "databases:\n", " countries_db:\n", " handle: \"duckdb:///{base_dir}/countries.db\"\n", " schema_location: \"{base_dir}/countries.linkml.yaml\"\n", " collections:\n", " countries:\n", " type: Country\n" ] } ], "source": [ "%%bash\n", "cat tmp/countries/countries.config.yaml" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-06-25T15:12:07.629556Z", "start_time": "2024-06-25T15:12:07.607664Z" } }, "id": "bfde580a0ec64091" }, { "cell_type": "markdown", "source": [ "The schema itself is fairly basic - a single class (whose name matches the `type`) in the configuration,\n", "with some slots. Note the slots have some constraints, e.g. regexps" ], "metadata": { "collapsed": false }, "id": "9242b8942af6f976" }, { "cell_type": "code", "execution_count": 14, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id: https://example.org/countries\n", "name: countries\n", "description: A schema for representing countries\n", "license: https://creativecommons.org/publicdomain/zero/1.0/\n", "\n", "prefixes:\n", " countries: https://example.org/countries/\n", " linkml: https://w3id.org/linkml/\n", "\n", "default_prefix: countries\n", "default_range: string\n", "\n", "imports:\n", " - linkml:types\n", "\n", "classes:\n", " Country:\n", " description: A sovereign state\n", " slots:\n", " - name\n", " - code\n", " - capital\n", " - continent\n", " - languages\n", " Route:\n", " slots:\n", " - origin\n", " - destination\n", " - method\n", "\n", "slots:\n", " name:\n", " description: The name of the country\n", " required: true\n", " # identifier: true\n", " code:\n", " description: The ISO 3166-1 alpha-2 code of the country\n", " required: true\n", " pattern: '^[A-Z]{2}$'\n", " identifier: true\n", " capital:\n", " description: The capital city of the country\n", " required: true\n", " continent:\n", " description: The continent where the country is located\n", " required: true\n", " languages:\n", " description: The main languages spoken in the country\n", " range: Language\n", " multivalued: true\n", " origin:\n", " range: Country\n", " destination:\n", " range: Country\n", " method:\n", " range: MethodEnum\n", "\n", "enums:\n", " MethodEnum:\n", " permissible_values:\n", " rail:\n", " air:\n", " road:\n", "\n", "types:\n", " Language:\n", " typeof: string\n", " description: A human language" ] } ], "source": [ "%%bash\n", "cat tmp/countries/countries.linkml.yaml" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-06-25T15:12:19.817066Z", "start_time": "2024-06-25T15:12:19.765625Z" } }, "id": "cebbfe0d134749e7" }, { "cell_type": "code", "execution_count": 15, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inserted 20 objects from tmp/countries/countries.jsonl into collection 'countries'.\n" ] } ], "source": [ "%%bash\n", "linkml-store -C tmp/countries/countries.config.yaml insert tmp/countries/countries.jsonl" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-06-25T15:12:25.554095Z", "start_time": "2024-06-25T15:12:23.774442Z" } }, "id": "f0db6bd8db3ed955" }, { "cell_type": "code", "execution_count": 16, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "countries\n", "name: countries\n", "alias: null\n", "type: Country\n", "additional_properties: null\n", "attributes: null\n", "indexers: null\n", "hidden: false\n", "is_prepopulated: false\n", "source_location: null\n" ] } ], "source": [ "%%bash\n", "linkml-store -C tmp/countries/countries.config.yaml list-collections" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-06-25T15:12:27.084569Z", "start_time": "2024-06-25T15:12:25.555144Z" } }, "id": "722706dd4ac509b4" }, { "cell_type": "code", "execution_count": 17, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\n", " {\n", " \"name\": \"United Kingdom\",\n", " \"code\": \"GB\",\n", " \"capital\": \"London\",\n", " \"continent\": \"Europe\",\n", " \"languages\": [\n", " \"English\"\n", " ]\n", " }\n", "]\n" ] } ], "source": [ "%%bash\n", "linkml-store --stacktrace -C tmp/countries/countries.config.yaml -c countries query -w \"code: GB\" " ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-06-25T15:12:28.679053Z", "start_time": "2024-06-25T15:12:27.085311Z" } }, "id": "b831b4012a6c03a3" }, { "cell_type": "markdown", "source": [ "## Validation\n", "\n", "LinkML-Store is designed to allow for rich validation, regardless of the underlying database store used.\n", "\n", "For validation to work, we need to specify an explicit schema, as we have done with the configuration above.\n", "\n", "To test it, we will insert some fake data:" ], "metadata": { "collapsed": false }, "id": "5f94d882bd97cb78" }, { "cell_type": "code", "execution_count": 18, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inserted 3 objects from {name: Foolandia, code: \"X Y\", languages: [\"Fooish\"]} into collection 'countries'.\n" ] } ], "source": [ "%%bash\n", "linkml-store -C tmp/countries/countries.config.yaml insert --object '{name: Foolandia, code: \"X Y\", languages: [\"Fooish\"]}'" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-06-25T15:12:41.190334Z", "start_time": "2024-06-25T15:12:39.647612Z" } }, "id": "9d81c524271eddfc" }, { "cell_type": "markdown", "source": [ "Let's check that the data is there:" ], "metadata": { "collapsed": false }, "id": "198e7b6dcfd0b1c7" }, { "cell_type": "code", "execution_count": 82, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\n", " {\n", " \"name\": \"Foolandia\",\n", " \"code\": \"X Y\",\n", " \"capital\": null,\n", " \"continent\": null,\n", " \"languages\": [\n", " \"Fooish\"\n", " ]\n", " }\n", "]\n" ] } ], "source": [ "%%bash\n", "linkml-store -C tmp/countries/countries.config.yaml query -w 'name: Foolandia'" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-04-22T20:49:31.369509Z", "start_time": "2024-04-22T20:49:30.604751Z" } }, "id": "bc79b2c4af397111" }, { "cell_type": "markdown", "source": [ "Note that by default, validation is *deferred*. You can insert whatever you like, and then validate later.\n", "\n", "Other configurations may be more suited to your project, including strict/prospective validation.\n", "\n", "Next let's examine the schema:" ], "metadata": { "collapsed": false }, "id": "ef95ca843c46e78b" }, { "cell_type": "code", "execution_count": 83, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "name: countries\n", "description: A schema for representing countries\n", "id: https://example.org/countries\n", "imports:\n", "- linkml:types\n", "license: https://creativecommons.org/publicdomain/zero/1.0/\n", "prefixes:\n", " countries:\n", " prefix_prefix: countries\n", " prefix_reference: https://example.org/countries/\n", " linkml:\n", " prefix_prefix: linkml\n", " prefix_reference: https://w3id.org/linkml/\n", "default_prefix: countries\n", "default_range: string\n", "types:\n", " Language:\n", " name: Language\n", " description: A human language\n", " typeof: string\n", "slots:\n", " name:\n", " name: name\n", " description: The name of the country\n", " identifier: true\n", " required: true\n", " code:\n", " name: code\n", " description: The ISO 3166-1 alpha-2 code of the country\n", " required: true\n", " pattern: ^[A-Z]{2}$\n", " capital:\n", " name: capital\n", " description: The capital city of the country\n", " required: true\n", " continent:\n", " name: continent\n", " description: The continent where the country is located\n", " required: true\n", " languages:\n", " name: languages\n", " description: The main languages spoken in the country\n", " multivalued: true\n", " range: Language\n", "classes:\n", " Country:\n", " name: Country\n", " description: A sovereign state\n", " slots:\n", " - name\n", " - code\n", " - capital\n", " - continent\n", " - languages\n", "source_file: tmp/countries/countries.linkml.yaml\n" ] } ], "source": [ "%%bash\n", "linkml-store -C tmp/countries/countries.config.yaml schema" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-04-22T20:49:32.031074Z", "start_time": "2024-04-22T20:49:31.370194Z" } }, "id": "ba9a72c20c4f9f8" }, { "cell_type": "markdown", "source": [ "### Run validation\n", "\n", "Next we will run the `validate` command:" ], "metadata": { "collapsed": false }, "id": "e72e6855cff8280c" }, { "cell_type": "code", "execution_count": 23, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "type: jsonschema validation\n", "severity: ERROR\n", "message: '''X Y'' does not match ''^[A-Z]{2}$'' in /code'\n", "instance:\n", " name: Foolandia\n", " code: X Y\n", " languages:\n", " - Fooish\n", "instance_index: 0\n", "instantiates: Country\n", "context: []\n", "---\n", "type: jsonschema validation\n", "severity: ERROR\n", "message: '''capital'' is a required property in /'\n", "instance:\n", " name: Foolandia\n", " code: X Y\n", " languages:\n", " - Fooish\n", "instance_index: 0\n", "instantiates: Country\n", "context: []\n", "---\n", "type: jsonschema validation\n", "severity: ERROR\n", "message: '''continent'' is a required property in /'\n", "instance:\n", " name: Foolandia\n", " code: X Y\n", " languages:\n", " - Fooish\n", "instance_index: 0\n", "instantiates: Country\n", "context: []\n" ] } ], "source": [ "%%bash\n", "linkml-store -C tmp/countries/countries.config.yaml validate -O yaml" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-06-25T15:16:27.611766Z", "start_time": "2024-06-25T15:16:25.805899Z" } }, "id": "597b805524772921" }, { "cell_type": "markdown", "source": [ "Here we can see 3 issues with the data we added:\n", "\n", "* the code doesn't match the regexp we provided (it has a space)\n", "* the capital is missing\n", "* the continent is missing\n", " " ], "metadata": { "collapsed": false }, "id": "e411262d116ec17d" }, { "cell_type": "code", "execution_count": 84, "outputs": [], "source": [], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-04-22T20:49:33.226848Z", "start_time": "2024-04-22T20:49:33.223340Z" } }, "id": "935911ccc9e2cd8f" } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 5 }