Tutorial: Using the Command Line Interface
This tutorial walks through usage of LinkML-Store via the Command Line Interface (CLI)
This tutorial is a Jupyter notebook: it can be executed in a command line environment, or you can try it for yourself by running commands directly.
Note the %%bash
is a directive for Jupyter itself, you don’t need to type this
Top level command
The top level command is linkml-store
. This command doesn’t do anything itself, instead there are various subcommands.
The store command has a few global options to specify configuration/database/collection
[1]:
%%bash
linkml-store --help
Usage: linkml-store [OPTIONS] COMMAND [ARGS]...
A CLI for interacting with the linkml-store.
Options:
-d, --database TEXT Database name
-c, --collection TEXT Collection name
-i, --input TEXT Input file (alternative to
database/collection)
-C, --config PATH Path to the configuration file
--set TEXT Metadata settings in the form PATHEXPR=value
-v, --verbose
-q, --quiet / --no-quiet
-B, --base-dir TEXT Base directory for the client configuration
--stacktrace / --no-stacktrace If set then show full stacktrace on error
[default: no-stacktrace]
--help Show this message and exit.
Commands:
apply Apply a patch to a collection.
describe Describe the collection schema.
diff Diffs two collectoons to create a patch.
export Exports a database to a standard dump format.
fq Query facets from the specified collection.
import Imports a database from a dump.
index Create an index over a collection.
indexes Show the indexes for a collection.
infer Predict a complete object from a partial object.
insert Insert objects from files (JSON, YAML, TSV) into the...
list-collections
query Query objects from the specified collection.
schema Show the schema for a database
search Search objects in the specified collection.
store Store objects from files (JSON, YAML, TSV) into the...
validate Validate objects in the specified collection.
Inserting objects from a file
Next we’ll explore the insert
command:
[2]:
%%bash
linkml-store --stacktrace insert --help
Usage: linkml-store insert [OPTIONS] [FILES]...
Insert objects from files (JSON, YAML, TSV) into the specified collection.
Using a configuration:
linkml-store -C config.yaml -c genes insert data/genes/*.json
Note: if you don't provide a schema this will be inferred, but it is usually
better to provide an explicit schema
Options:
-f, --format [json|jsonl|yaml|yamll|tsv|csv|python|parquet|formatted|table|duckdb|postgres|mongodb]
Input format
-i, --object TEXT Input object as YAML
--help Show this message and exit.
We’ll insert a small test file (in JSON Lines format) into a fresh database.
[3]:
%%bash
head ../../tests/input/countries/countries.jsonl
{"name": "United States", "code": "US", "capital": "Washington, D.C.", "continent": "North America", "languages": ["English"]}
{"name": "Canada", "code": "CA", "capital": "Ottawa", "continent": "North America", "languages": ["English", "French"]}
{"name": "Mexico", "code": "MX", "capital": "Mexico City", "continent": "North America", "languages": ["Spanish"]}
{"name": "Brazil", "code": "BR", "capital": "Brasília", "continent": "South America", "languages": ["Portuguese"]}
{"name": "Argentina", "code": "AR", "capital": "Buenos Aires", "continent": "South America", "languages": ["Spanish"]}
{"name": "United Kingdom", "code": "GB", "capital": "London", "continent": "Europe", "languages": ["English"]}
{"name": "France", "code": "FR", "capital": "Paris", "continent": "Europe", "languages": ["French"]}
{"name": "Germany", "code": "DE", "capital": "Berlin", "continent": "Europe", "languages": ["German"]}
{"name": "Italy", "code": "IT", "capital": "Rome", "continent": "Europe", "languages": ["Italian"]}
{"name": "Spain", "code": "ES", "capital": "Madrid", "continent": "Europe", "languages": ["Spanish"]}
To make sure we have a fresh setup, we’ll create a temporary directory tmp
(if it doesn’t already exist), and be sure to remove any copy of the database we intend to create.
We’ll then insert the objects:
[4]:
%%bash
mkdir -p tmp
rm -rf tmp/countries.db
linkml-store --database duckdb:///tmp/countries.db --collection countries insert ../../tests/input/countries/countries.jsonl
Inserted 20 objects from ../../tests/input/countries/countries.jsonl into collection 'countries'.
Note that the --database
and --collection
options come before the insert
subcommand.
With LinkML-Store, everything must go into a collection, so we specified countries
as the name
Querying
Next we’ll explore the query
command:
[5]:
%%bash
linkml-store query --help
Usage: linkml-store query [OPTIONS]
Query objects from the specified collection.
Leave the query field blank to return all objects in the collection.
Examples:
linkml-store -d duckdb:///countries.db -c countries query
Queries can be specified in YAML, as basic key-value pairs
Examples:
linkml-store -d duckdb:///countries.db -c countries query -w 'code: NZ'
More complex queries can be specified using MongoDB-style query syntax
Examples:
linkml-store -d file:. -c persons query -w 'occupation: {$ne:
Architect}'
Finds all people who are not architects.
Options:
-w, --where TEXT WHERE clause for the query, as YAML
-s, --select TEXT SELECT clause for the query, as YAML
-l, --limit INTEGER Maximum number of results to return
-O, --output-type [json|jsonl|yaml|yamll|tsv|csv|python|parquet|formatted|table|duckdb|postgres|mongodb]
Output format
-o, --output PATH Output file path
--help Show this message and exit.
Let’s query for all objects that have code="GB"
, and get the results back as a CSV. The argument for the --where
(or -w
) option is a YAML object with a MongoDB-style query.
[6]:
%%bash
linkml-store --database duckdb:///tmp/countries.db -c countries query -w "code: GB" -O table
+----+----------------+--------+-----------+-------------+-------------+
| | name | code | capital | continent | languages |
|----+----------------+--------+-----------+-------------+-------------|
| 0 | United Kingdom | GB | London | Europe | ['English'] |
+----+----------------+--------+-----------+-------------+-------------+
We can get the output in different formats:
[7]:
%%bash
linkml-store --database duckdb:///tmp/countries.db -c countries query -w "code: GB" -O yaml
name: United Kingdom
code: GB
capital: London
continent: Europe
languages:
- English
Formats include csv, tsv, yaml, json, jsonl, table, formatted (a human-readable format)
Describing the data set
The describe
command gives a high-level overview of the data set:
[8]:
%%bash
linkml-store describe --help
Usage: linkml-store describe [OPTIONS]
Describe the collection schema.
Options:
-w, --where TEXT WHERE clause for the query
-O, --output-type [json|jsonl|yaml|yamll|tsv|csv|python|parquet|formatted|table|duckdb|postgres|mongodb]
Output format
-o, --output PATH Output file path
-l, --limit INTEGER Maximum number of results to return
[default: -1]
--help Show this message and exit.
Let’s try with the countries dataset:
[9]:
%%bash
linkml-store -d duckdb:///tmp/countries.db -c countries describe -O formatted
count unique top freq
capital 20 20 Washington, D.C. 1
code 20 20 US 1
continent 20 6 Europe 5
languages 20 15 [English] 4
name 20 20 United States 1
Note this command is more useful for numeric data…
Facet Counts
You can combine any query (including an empty query, for fetching the whole database) with a facet query which fetches counts for numbers of objects broken down by some specified slot or slots.
[10]:
%%bash
linkml-store fq --help
Usage: linkml-store fq [OPTIONS]
Query facets from the specified collection.
:param ctx: :param where: :param limit: :param columns: :param output_type:
:param output: :return:
Options:
-w, --where TEXT WHERE clause for the query
-l, --limit INTEGER Maximum number of results to return
-O, --output-type [json|jsonl|yaml|yamll|tsv|csv|python|parquet|formatted|table|duckdb|postgres|mongodb]
Output format
-o, --output PATH Output file path
-S, --columns TEXT Columns to facet on
-U, --wide / --no-wide, --no-U Wide table [default: no-wide]
--help Show this message and exit.
[11]:
%%bash
linkml-store -d duckdb:///tmp/countries.db -c countries fq -S continent
{
"continent": {
"Asia": 5,
"Europe": 5,
"Africa": 3,
"North America": 3,
"South America": 2,
"Oceania": 2
}
}
[12]:
%%bash
linkml-store --stacktrace -d duckdb:///tmp/countries.db -c countries fq -S continent,languages -O table
+------------------+-------------+-------------+
| | continent | languages |
|------------------+-------------+-------------|
| Europe | 5 | nan |
| Asia | 5 | nan |
| North America | 3 | nan |
| Africa | 3 | nan |
| South America | 2 | nan |
| Oceania | 2 | nan |
| English | nan | 8 |
| Spanish | nan | 3 |
| French | nan | 2 |
| Italian | nan | 1 |
| Standard Chinese | nan | 1 |
| Tswana | nan | 1 |
| Southern Sotho | nan | 1 |
| Portuguese | nan | 1 |
| Māori | nan | 1 |
| Xhosa | nan | 1 |
| Zulu | nan | 1 |
| Tsonga | nan | 1 |
| German | nan | 1 |
| Korean | nan | 1 |
| Northern Sotho | nan | 1 |
| Venda | nan | 1 |
| Southern Ndebele | nan | 1 |
| Hindi | nan | 1 |
| Swazi | nan | 1 |
| Japanese | nan | 1 |
| Indonesian | nan | 1 |
| Arabic | nan | 1 |
| Afrikaans | nan | 1 |
+------------------+-------------+-------------+
Remember this is a test dataset deliberately reduced so we don’t expect to see all countries there!
Search
LinkML-Store is intended to allow for a flexible range of search strategies. Some of these may come from the underlying data store (for example, SOLr or ES is backed by Lucene indexing). Or they may be integrated orthogonally.
A key search mechanism that is supported is text embedding via Large Language Models (LLMs). Note these are not enabled by default.
Currently the default mechanism (which works regardless of the underlying store) is a highly naive trigram-based vector embedding. This requires no external model. It is intended primarily for demonstration purposes, and should be swapped out for something else.
Indexing a collection
First we will explore the index
command
[13]:
%%bash
linkml-store index --help
Usage: linkml-store index [OPTIONS]
Create an index over a collection.
By default a simple trigram index is used.
Options:
-t, --index-type TEXT Type of index to create. Values: simple, llm
[default: simple]
-E, --cached-embeddings-database TEXT
Path to the database where embeddings are
cached
-T, --text-template TEXT Template for text embeddings
--help Show this message and exit.
Next we’ll make a (default) index
[13]:
%%bash
linkml-store -d duckdb:///tmp/countries.db -c countries index
Searching a collection using an index
Let’s explore the search
command
[14]:
%%bash
linkml-store search --help
Usage: linkml-store search [OPTIONS] SEARCH_TERM
Search objects in the specified collection.
Options:
-w, --where TEXT WHERE clause for the search
-l, --limit INTEGER Maximum number of search results
-O, --output-type [json|jsonl|yaml|yamll|tsv|csv|python|parquet|formatted|table|duckdb|postgres|mongodb]
Output format
-o, --output PATH Output file path
--auto-index / --no-auto-index Automatically index the collection
[default: no-auto-index]
-t, --index-type TEXT Type of index to create. Values: simple, llm
[default: simple]
--help Show this message and exit.
Now we’ll search for countries in the North where both English and French are spoken. We’ll pose this as a natural language query, but the default index is only picking up on trigram tokens in the strings.
[15]:
%%bash
linkml-store -d duckdb:///tmp/countries.db -c countries search "countries in the North where both english and french spoken" --limit 5 -O csv
score,name,code,capital,continent,languages
0.15670402880167877,Canada,CA,Ottawa,North America,"['English', 'French']"
0.14806601565681218,South Africa,ZA,Pretoria,Africa,"['Zulu', 'Xhosa', 'Afrikaans', 'English', 'Northern Sotho', 'Tswana', 'Southern Sotho', 'Tsonga', 'Swazi', 'Venda', 'Southern Ndebele']"
0.13749236361227862,United States,US,"Washington, D.C.",North America,['English']
0.09860812114511587,Argentina,AR,Buenos Aires,South America,['Spanish']
0.09765536333140983,Mexico,MX,Mexico City,North America,['Spanish']
By default, all fields in the object are indexed. Canada comes out top as the strings for English and France are present (or rather trigrams from those words). But remember the default method is just for illustration!
Indexing using an LLM (OPTIONAL)
Note for this to work, you need to have installed this package with the llm
extra, like this:
pip install linkml-store[llm]
Or if you have this repo checked out and are using Poetry:
poetry install --all-extras
You will also need an OpenAI account.
If this is too much, you can just skip this section!
[16]:
%%bash
linkml-store -d duckdb:///tmp/countries.db -c countries index -t llm -E tmp/llm_countries_cache.db
[18]:
%%bash
linkml-store -d duckdb:///tmp/countries.db -c countries search -t llm "countries in the North where both english and french spoken" --limit 5 -O csv
score,name,code,capital,continent,languages
0.7927589434263863,Canada,CA,Ottawa,North America,"['English', 'French']"
0.7641071952797124,France,FR,Paris,Europe,['French']
0.7546847140878102,United States,US,"Washington, D.C.",North America,['English']
0.7424773577897005,Australia,AU,Canberra,Oceania,['English']
0.741656789495497,United Kingdom,GB,London,Europe,['English']
The results are not particularly meaningful, but the idea is that this could be used in a RAG-style system.
Schemas
Note in the above we did not explicitly specify a schema; instead it is induced.
We can use the schema
command to see the induced schema in LinkML YAML.
[19]:
%%bash
linkml-store -d duckdb:///tmp/countries.db schema
name: test-schema
id: http://example.org/test-schema
imports:
- linkml:types
prefixes:
linkml:
prefix_prefix: linkml
prefix_reference: https://w3id.org/linkml/
test_schema:
prefix_prefix: test_schema
prefix_reference: http://example.org/test-schema/
default_prefix: test_schema
default_range: string
classes:
countries:
name: countries
attributes:
name:
name: name
range: string
required: false
multivalued: false
code:
name: code
range: string
required: false
multivalued: false
capital:
name: capital
range: string
required: false
multivalued: false
continent:
name: continent
range: string
required: false
multivalued: false
languages:
name: languages
range: string
required: false
multivalued: true
internal__index__countries__llm:
name: internal__index__countries__llm
attributes:
name:
name: name
range: string
required: false
multivalued: false
code:
name: code
range: string
required: false
multivalued: false
capital:
name: capital
range: string
required: false
multivalued: false
continent:
name: continent
range: string
required: false
multivalued: false
languages:
name: languages
range: string
required: false
multivalued: true
__index__:
name: __index__
range: string
required: false
multivalued: true
internal__index__countries__simple:
name: internal__index__countries__simple
attributes:
name:
name: name
range: string
required: false
multivalued: false
code:
name: code
range: string
required: false
multivalued: false
capital:
name: capital
range: string
required: false
multivalued: false
continent:
name: continent
range: string
required: false
multivalued: false
languages:
name: languages
range: string
required: false
multivalued: true
__index__:
name: __index__
range: string
required: false
multivalued: true
Configuration Files and Explicit Schemas
Rather than repeat --database
and --collection
each time, we can make use of YAML config files.
These can also package useful information and schemas.
First we will create a fresh copy of a directory with both configuration files and schemas:
[20]:
%%bash
cp -pr ../../tests/input/countries tmp
rm tmp/countries/countries.db
The configuration YAML is fairly minimal - it specifies a single database with a single collection, and a pointer to a schema
[21]:
%%bash
cat tmp/countries/countries.config.yaml
databases:
countries_db:
handle: "duckdb:///{base_dir}/countries.db"
schema_location: "{base_dir}/countries.linkml.yaml"
collections:
countries:
type: Country
The schema itself is fairly basic - a single class (whose name matches the type
) in the configuration, with some slots. Note the slots have some constraints, e.g. regexps
[22]:
%%bash
cat tmp/countries/countries.linkml.yaml
id: https://example.org/countries
name: countries
description: A schema for representing countries
license: https://creativecommons.org/publicdomain/zero/1.0/
prefixes:
countries: https://example.org/countries/
linkml: https://w3id.org/linkml/
default_prefix: countries
default_range: string
imports:
- linkml:types
classes:
Country:
description: A sovereign state
slots:
- name
- code
- capital
- continent
- languages
Route:
slots:
- origin
- destination
- method
slots:
name:
description: The name of the country
required: true
# identifier: true
code:
description: The ISO 3166-1 alpha-2 code of the country
required: true
pattern: '^[A-Z]{2}$'
identifier: true
capital:
description: The capital city of the country
required: true
continent:
description: The continent where the country is located
required: true
languages:
description: The main languages spoken in the country
range: Language
multivalued: true
origin:
range: Country
destination:
range: Country
method:
range: MethodEnum
enums:
MethodEnum:
permissible_values:
rail:
air:
road:
types:
Language:
typeof: string
description: A human language
[23]:
%%bash
linkml-store -B tmp/countries -C tmp/countries/countries.config.yaml -d countries_db schema
name: countries
description: A schema for representing countries
id: https://example.org/countries
imports:
- linkml:types
license: https://creativecommons.org/publicdomain/zero/1.0/
prefixes:
countries:
prefix_prefix: countries
prefix_reference: https://example.org/countries/
linkml:
prefix_prefix: linkml
prefix_reference: https://w3id.org/linkml/
default_prefix: countries
default_range: string
types:
Language:
name: Language
description: A human language
typeof: string
enums:
MethodEnum:
name: MethodEnum
permissible_values:
rail:
text: rail
air:
text: air
road:
text: road
slots:
name:
name: name
description: The name of the country
required: true
code:
name: code
description: The ISO 3166-1 alpha-2 code of the country
identifier: true
required: true
pattern: ^[A-Z]{2}$
capital:
name: capital
description: The capital city of the country
required: true
continent:
name: continent
description: The continent where the country is located
required: true
languages:
name: languages
description: The main languages spoken in the country
range: Language
multivalued: true
origin:
name: origin
range: Country
destination:
name: destination
range: Country
method:
name: method
range: MethodEnum
classes:
Country:
name: Country
description: A sovereign state
slots:
- name
- code
- capital
- continent
- languages
Route:
name: Route
slots:
- origin
- destination
- method
source_file: tmp/countries/countries.linkml.yaml
[25]:
%%bash
linkml-store -B tmp/countries -C tmp/countries/countries.config.yaml -d countries_db -c countries insert tmp/countries/countries.jsonl
Inserted 20 objects from tmp/countries/countries.jsonl into collection 'countries'.
[27]:
%%bash
linkml-store -B tmp/countries -C tmp/countries/countries.config.yaml -d countries_db list-collections
countries
alias: countries
type: Country
additional_properties: null
attributes: null
indexers: null
hidden: false
is_prepopulated: false
source: null
derived_from: null
page_size: null
graph_projection: null
validate_modifications: false
[28]:
%%bash
linkml-store --stacktrace -B tmp/countries -C tmp/countries/countries.config.yaml -d countries_db -c countries query -w "code: GB"
[
{
"name": "United Kingdom",
"code": "GB",
"capital": "London",
"continent": "Europe",
"languages": [
"English"
]
}
]
Validation
LinkML-Store is designed to allow for rich validation, regardless of the underlying database store used.
For validation to work, we need to specify an explicit schema, as we have done with the configuration above.
To test it, we will insert some fake data:
[31]:
%%bash
linkml-store -B tmp/countries -C tmp/countries/countries.config.yaml -d countries_db insert --object '{name: Foolandia, code: "X Y", languages: ["Fooish"]}'
Inserted 3 objects from {name: Foolandia, code: "X Y", languages: ["Fooish"]} into collection 'countries'.
Let’s check that the data is there:
[33]:
%%bash
linkml-store -B tmp/countries -C tmp/countries/countries.config.yaml -d countries_db query -w 'name: Foolandia'
[
{
"name": "Foolandia",
"code": "X Y",
"capital": null,
"continent": null,
"languages": [
"Fooish"
]
}
]
Note that by default, validation is deferred. You can insert whatever you like, and then validate later.
Other configurations may be more suited to your project, including strict/prospective validation.
Next let’s examine the schema:
[35]:
%%bash
linkml-store -B tmp/countries -C tmp/countries/countries.config.yaml -d countries_db schema
name: countries
description: A schema for representing countries
id: https://example.org/countries
imports:
- linkml:types
license: https://creativecommons.org/publicdomain/zero/1.0/
prefixes:
countries:
prefix_prefix: countries
prefix_reference: https://example.org/countries/
linkml:
prefix_prefix: linkml
prefix_reference: https://w3id.org/linkml/
default_prefix: countries
default_range: string
types:
Language:
name: Language
description: A human language
typeof: string
enums:
MethodEnum:
name: MethodEnum
permissible_values:
rail:
text: rail
air:
text: air
road:
text: road
slots:
name:
name: name
description: The name of the country
required: true
code:
name: code
description: The ISO 3166-1 alpha-2 code of the country
identifier: true
required: true
pattern: ^[A-Z]{2}$
capital:
name: capital
description: The capital city of the country
required: true
continent:
name: continent
description: The continent where the country is located
required: true
languages:
name: languages
description: The main languages spoken in the country
range: Language
multivalued: true
origin:
name: origin
range: Country
destination:
name: destination
range: Country
method:
name: method
range: MethodEnum
classes:
Country:
name: Country
description: A sovereign state
slots:
- name
- code
- capital
- continent
- languages
Route:
name: Route
slots:
- origin
- destination
- method
source_file: tmp/countries/countries.linkml.yaml
Run validation
Next we will run the validate
command:
[37]:
%%bash
linkml-store -B tmp/countries -C tmp/countries/countries.config.yaml -d countries_db validate -O table
+----+-----------------------+------------+--------------------------------------------+---------------------------------------------------------------+------------------+----------------+-----------+
| | type | severity | message | instance | instance_index | instantiates | context |
|----+-----------------------+------------+--------------------------------------------+---------------------------------------------------------------+------------------+----------------+-----------|
| 0 | jsonschema validation | ERROR | 'X Y' does not match '^[A-Z]{2}$' in /code | {'name': 'Foolandia', 'code': 'X Y', 'languages': ['Fooish']} | 0 | Country | [] |
| 1 | jsonschema validation | ERROR | 'capital' is a required property in / | {'name': 'Foolandia', 'code': 'X Y', 'languages': ['Fooish']} | 0 | Country | [] |
| 2 | jsonschema validation | ERROR | 'continent' is a required property in / | {'name': 'Foolandia', 'code': 'X Y', 'languages': ['Fooish']} | 0 | Country | [] |
+----+-----------------------+------------+--------------------------------------------+---------------------------------------------------------------+------------------+----------------+-----------+
Here we can see 3 issues with the data we added:
the code doesn’t match the regexp we provided (it has a space)
the capital is missing
the continent is missing
Inference
LinkML implements the “CRUDSI” pattern: In addition to Create, Read, Update, Delete, we support Search, we also support I_nference_.
Inference is a procedure for filling in missing attribute values, or for correcting or repairing existing attribute values.
Different inference strategies include:
procedural or rule-based inference
projection or transformation of data
statistical inference or machine learning (ML), for example by inferring decision trees or regression models
inference using generative AI and Large Language Models (LLMs)
We will demonstrate the use of LLM inference, via the RAGInferenceEngine. This works by fetching the most relevant rows from the collection at the time of inference (based on supplied input), presenting these as example input-output pairs to the LLM, and then asking the LLM to complete the supplied input.
Our countries collection is (intentionally) incomplete. Let’s fill in some missing rows:
[40]:
%%bash
linkml-store -d duckdb:///tmp/countries.db -c countries infer -t rag -q 'name: Uruguay'
predicted_object:
capital: Montevideo
code: UY
continent: South America
languages:
- Spanish
[48]:
%%bash
llm install llm-claude-3
Requirement already satisfied: llm-claude-3 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (0.4)
Requirement already satisfied: llm in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from llm-claude-3) (0.15)
Requirement already satisfied: anthropic>=0.17.0 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from llm-claude-3) (0.32.0)
Requirement already satisfied: anyio<5,>=3.5.0 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from anthropic>=0.17.0->llm-claude-3) (4.4.0)
Requirement already satisfied: distro<2,>=1.7.0 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from anthropic>=0.17.0->llm-claude-3) (1.9.0)
Requirement already satisfied: httpx<1,>=0.23.0 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from anthropic>=0.17.0->llm-claude-3) (0.27.0)
Requirement already satisfied: jiter<1,>=0.4.0 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from anthropic>=0.17.0->llm-claude-3) (0.5.0)
Requirement already satisfied: pydantic<3,>=1.9.0 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from anthropic>=0.17.0->llm-claude-3) (2.8.2)
Requirement already satisfied: sniffio in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from anthropic>=0.17.0->llm-claude-3) (1.3.1)
Requirement already satisfied: tokenizers>=0.13.0 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from anthropic>=0.17.0->llm-claude-3) (0.19.1)
Requirement already satisfied: typing-extensions<5,>=4.7 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from anthropic>=0.17.0->llm-claude-3) (4.12.2)
Requirement already satisfied: click in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from llm->llm-claude-3) (8.1.7)
Requirement already satisfied: openai>=1.0 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from llm->llm-claude-3) (1.40.1)
Requirement already satisfied: click-default-group>=1.2.3 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from llm->llm-claude-3) (1.2.4)
Requirement already satisfied: sqlite-utils>=3.37 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from llm->llm-claude-3) (3.37)
Requirement already satisfied: sqlite-migrate>=0.1a2 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from llm->llm-claude-3) (0.1b0)
Requirement already satisfied: PyYAML in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from llm->llm-claude-3) (6.0.2)
Requirement already satisfied: pluggy in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from llm->llm-claude-3) (1.5.0)
Requirement already satisfied: python-ulid in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from llm->llm-claude-3) (2.7.0)
Requirement already satisfied: setuptools in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from llm->llm-claude-3) (72.1.0)
Requirement already satisfied: pip in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from llm->llm-claude-3) (24.2)
Requirement already satisfied: idna>=2.8 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from anyio<5,>=3.5.0->anthropic>=0.17.0->llm-claude-3) (3.7)
Requirement already satisfied: exceptiongroup>=1.0.2 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from anyio<5,>=3.5.0->anthropic>=0.17.0->llm-claude-3) (1.2.2)
Requirement already satisfied: certifi in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from httpx<1,>=0.23.0->anthropic>=0.17.0->llm-claude-3) (2024.7.4)
Requirement already satisfied: httpcore==1.* in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from httpx<1,>=0.23.0->anthropic>=0.17.0->llm-claude-3) (1.0.5)
Requirement already satisfied: h11<0.15,>=0.13 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from httpcore==1.*->httpx<1,>=0.23.0->anthropic>=0.17.0->llm-claude-3) (0.14.0)
Requirement already satisfied: tqdm>4 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from openai>=1.0->llm->llm-claude-3) (4.66.5)
Requirement already satisfied: annotated-types>=0.4.0 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from pydantic<3,>=1.9.0->anthropic>=0.17.0->llm-claude-3) (0.7.0)
Requirement already satisfied: pydantic-core==2.20.1 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from pydantic<3,>=1.9.0->anthropic>=0.17.0->llm-claude-3) (2.20.1)
Requirement already satisfied: sqlite-fts4 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from sqlite-utils>=3.37->llm->llm-claude-3) (1.0.3)
Requirement already satisfied: tabulate in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from sqlite-utils>=3.37->llm->llm-claude-3) (0.9.0)
Requirement already satisfied: python-dateutil in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from sqlite-utils>=3.37->llm->llm-claude-3) (2.9.0.post0)
Requirement already satisfied: huggingface-hub<1.0,>=0.16.4 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from tokenizers>=0.13.0->anthropic>=0.17.0->llm-claude-3) (0.24.5)
Requirement already satisfied: filelock in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers>=0.13.0->anthropic>=0.17.0->llm-claude-3) (3.15.4)
Requirement already satisfied: fsspec>=2023.5.0 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers>=0.13.0->anthropic>=0.17.0->llm-claude-3) (2024.6.1)
Requirement already satisfied: packaging>=20.9 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers>=0.13.0->anthropic>=0.17.0->llm-claude-3) (24.1)
Requirement already satisfied: requests in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers>=0.13.0->anthropic>=0.17.0->llm-claude-3) (2.32.3)
Requirement already satisfied: six>=1.5 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from python-dateutil->sqlite-utils>=3.37->llm->llm-claude-3) (1.16.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from requests->huggingface-hub<1.0,>=0.16.4->tokenizers>=0.13.0->anthropic>=0.17.0->llm-claude-3) (3.3.2)
Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages (from requests->huggingface-hub<1.0,>=0.16.4->tokenizers>=0.13.0->anthropic>=0.17.0->llm-claude-3) (2.2.2)
[49]:
%%bash
linkml-store -d duckdb:///tmp/countries.db -c countries infer -t rag:llm_config.model_name=claude-3-opus -q 'name: Uruguay'
predicted_object:
capital: Montevideo
code: UY
continent: South America
languages:
- Spanish
We can also restrict the predictions to a specific attribute:
[44]:
%%bash
linkml-store -d duckdb:///tmp/countries.db -c countries infer -t rag -q 'name: Uruguay' -T continent
predicted_object:
continent: South America
Note that LLMs are particularly suited to this kind of inference, when we supply an out of distribution (in our existing collection); we are relying on pre-trained knowledge in the model.
This is not expected to work with a traditional ML model - in this case it will complain that it has no data on the provided feature column:
[45]:
%%bash
linkml-store -d duckdb:///tmp/countries.db -c countries infer -t sklearn -T continent -q 'name: Uruguay' || echo "Failed as expected"
KeyError: 'Uruguay'
During handling of the above exception, another exception occurred:
ValueError: y contains previously unseen labels: 'Uruguay'
Failed as expected
Inference using statistical models
A more appropriate dataset for a traditional ML model would be the Iris dataset. Let’s first explore it:
[51]:
%%bash
linkml-store -i ../../tests/input/iris.jsonl describe
count unique top freq mean std min 25% 50% 75% max
petal_length 100.0 NaN NaN NaN 2.861 1.449549 1.0 1.5 2.45 4.325 5.1
petal_width 100.0 NaN NaN NaN 0.786 0.565153 0.1 0.2 0.8 1.3 1.8
sepal_length 100.0 NaN NaN NaN 5.471 0.641698 4.3 5.0 5.4 5.9 7.0
sepal_width 100.0 NaN NaN NaN 3.099 0.478739 2.0 2.8 3.05 3.4 4.4
species 100 2 setosa 50 NaN NaN NaN NaN NaN NaN NaN
[52]:
%%bash
linkml-store -i ../../tests/input/iris.jsonl infer -t sklearn -q '{"sepal_length": 5.1, "sepal_width": 3.5, "petal_length": 1.4, "petal_width": 0.2}'
predicted_object:
species: setosa
[ ]: