Tutorial: Using the Command Line Interface
This tutorial walks through usage of LinkML-Store via the Command Line Interface (CLI)
This tutorial is a Jupyter notebook: it can be executed in a command line environment, or you can try it for yourself by running commands directly.
Note the %%bash
is a directive for Jupyter itself, you don’t need to type this
Top level command
The top level command is linkml-store
. This command doesn’t do anything itself, instead there are various subcommands.
The store command has a few global options to specify configuration/database/collection
[1]:
%%bash
linkml-store --help
Usage: linkml-store [OPTIONS] COMMAND [ARGS]...
A CLI for interacting with the linkml-store.
Options:
-d, --database TEXT Database name
-c, --collection TEXT Collection name
-C, --config PATH Path to the configuration file
--set TEXT Metadata settings in the form PATHEXPR=value
-v, --verbose
-q, --quiet / --no-quiet
--stacktrace / --no-stacktrace If set then show full stacktrace on error
[default: no-stacktrace]
--help Show this message and exit.
Commands:
apply Apply a patch to a collection.
describe Describe the collection schema.
diff Diffs two collectoons to create a patch.
export Exports a database to a dump.
fq Query facets from the specified collection.
import Imports a database from a dump.
index Create an index over a collection.
indexes Show the indexes for a collection.
insert Insert objects from files (JSON, YAML, TSV) into the...
list-collections
query Query objects from the specified collection.
schema Show the schema for a database
search Search objects in the specified collection.
store Store objects from files (JSON, YAML, TSV) into the...
validate Validate objects in the specified collection.
Inserting objects from a file
Next we’ll explore the insert
command:
[2]:
%%bash
linkml-store --stacktrace insert --help
Usage: linkml-store insert [OPTIONS] [FILES]...
Insert objects from files (JSON, YAML, TSV) into the specified collection.
Options:
-f, --format [json|jsonl|yaml|tsv|csv|parquet|formatted]
Input format
-i, --object TEXT Input object as YAML
--help Show this message and exit.
We’ll insert a small test file (in JSON Lines format) into a fresh database.
[3]:
%%bash
head ../../tests/input/countries/countries.jsonl
{"name": "United States", "code": "US", "capital": "Washington, D.C.", "continent": "North America", "languages": ["English"]}
{"name": "Canada", "code": "CA", "capital": "Ottawa", "continent": "North America", "languages": ["English", "French"]}
{"name": "Mexico", "code": "MX", "capital": "Mexico City", "continent": "North America", "languages": ["Spanish"]}
{"name": "Brazil", "code": "BR", "capital": "Brasília", "continent": "South America", "languages": ["Portuguese"]}
{"name": "Argentina", "code": "AR", "capital": "Buenos Aires", "continent": "South America", "languages": ["Spanish"]}
{"name": "United Kingdom", "code": "GB", "capital": "London", "continent": "Europe", "languages": ["English"]}
{"name": "France", "code": "FR", "capital": "Paris", "continent": "Europe", "languages": ["French"]}
{"name": "Germany", "code": "DE", "capital": "Berlin", "continent": "Europe", "languages": ["German"]}
{"name": "Italy", "code": "IT", "capital": "Rome", "continent": "Europe", "languages": ["Italian"]}
{"name": "Spain", "code": "ES", "capital": "Madrid", "continent": "Europe", "languages": ["Spanish"]}
To make sure we have a fresh setup, we’ll create a temporary directory tmp
(if it doesn’t already exist), and be sure to remove any copy of the database we intend to create.
We’ll then insert the objects:
[4]:
%%bash
mkdir -p tmp
rm -rf tmp/countries.db
linkml-store --database duckdb:///tmp/countries.db --collection countries insert ../../tests/input/countries/countries.jsonl
Inserted 20 objects from ../../tests/input/countries/countries.jsonl into collection 'countries'.
Note that the --database
and --collection
options come before the insert
subcommand.
With LinkML-Store, everything must go into a collection, so we specified countries
as the name
Querying
Next we’ll explore the query
command:
[24]:
%%bash
linkml-store query --help
Usage: linkml-store query [OPTIONS]
Query objects from the specified collection.
Leave the query field blank to return all objects in the collection.
Examples:
linkml-store -d duckdb:///countries.db -c countries query
Queries can be specified in YAML, as basic key-value pairs
Examples:
linkml-store -d duckdb:///countries.db -c countries query -w 'code: NZ'
More complex queries can be specified using MongoDB-style query syntax
Examples:
linkml-store -d file:. -c persons query -w 'occupation: {$ne:
Architect}'
Finds all people who are not architects.
Options:
-w, --where TEXT WHERE clause for the query, as YAML
-l, --limit INTEGER Maximum number of results to return
-O, --output-type [json|jsonl|yaml|tsv|csv|parquet|formatted]
Output format
-o, --output PATH Output file path
--help Show this message and exit.
Let’s query for all objects that have code="GB"
, and get the results back as a CSV. The argument for the --where
(or -w
) option is a YAML object with a MongoDB-style query.
[5]:
%%bash
linkml-store --database duckdb:///tmp/countries.db -c countries query -w "code: GB" -O formatted
name code capital continent languages
0 United Kingdom GB London Europe [English]
We can get the output in different formats:
[6]:
%%bash
linkml-store --database duckdb:///tmp/countries.db -c countries query -w "code: GB" -O yaml
name: United Kingdom
code: GB
capital: London
continent: Europe
languages:
- English
Formats include csv, tsv, yaml, json, jsonl, formatted (a human-readable format)
Describing the data set
The describe
command gives a high-level overview of the data set:
[25]:
%%bash
linkml-store describe --help
Usage: linkml-store describe [OPTIONS]
Describe the collection schema.
Options:
-w, --where TEXT WHERE clause for the query
-O, --output-type [json|jsonl|yaml|tsv|csv|parquet|formatted]
Output format
-o, --output PATH Output file path
--help Show this message and exit.
Let’s try with the countries dataset:
[8]:
%%bash
linkml-store -d duckdb:///tmp/countries.db -c countries describe
count unique top freq
capital 1 1 Washington, D.C. 1
code 1 1 US 1
continent 1 1 North America 1
languages 1 1 [English] 1
name 1 1 United States 1
Note this command is more useful for numeric data…
Facet Counts
You can combine any query (including an empty query, for fetching the whole database) with a facet query which fetches counts for numbers of objects broken down by some specified slot or slots.
[26]:
%%bash
linkml-store fq --help
Usage: linkml-store fq [OPTIONS]
Query facets from the specified collection.
:param ctx: :param where: :param limit: :param columns: :param output_type:
:param output: :return:
Options:
-w, --where TEXT WHERE clause for the query
-l, --limit INTEGER Maximum number of results to return
-O, --output-type [json|jsonl|yaml|tsv|csv|parquet|formatted]
Output format
-o, --output PATH Output file path
-S, --columns TEXT Columns to facet on
--help Show this message and exit.
[9]:
%%bash
linkml-store -d duckdb:///tmp/countries.db -c countries fq -S continent
{
"continent": {
"Europe": 5,
"Asia": 5,
"Africa": 3,
"North America": 3,
"South America": 2,
"Oceania": 2
}
}
Remember this is a test dataset deliberately reduced so we don’t expect to see all countries there!
Search
LinkML-Store is intended to allow for a flexible range of search strategies. Some of these may come from the underlying data store (for example, SOLr or ES is backed by Lucene indexing). Or they may be integrated orthogonally.
A key search mechanism that is supported is text embedding via Large Language Models (LLMs). Note these are not enabled by default.
Currently the default mechanism (which works regardless of the underlying store) is a highly naive trigram-based vector embedding. This requires no external model. It is intended primarily for demonstration purposes, and should be swapped out for something else.
Indexing a collection
First we will explore the index
command
[27]:
%%bash
linkml-store index --help
Usage: linkml-store index [OPTIONS]
Create an index over a collection.
By default a simple trigram index is used.
Options:
-t, --index-type TEXT Type of index to create. Values: simple, llm
[default: simple]
-E, --cached-embeddings-database TEXT
Path to the database where embeddings are
cached
-T, --text-template TEXT Template for text embeddings
--help Show this message and exit.
Next we’ll make a (default) index
[28]:
%%bash
linkml-store -d duckdb:///tmp/countries.db -c countries index
Searching a collection using an index
Let’s explore the search
command
[29]:
%%bash
linkml-store search --help
Usage: linkml-store search [OPTIONS] SEARCH_TERM
Search objects in the specified collection.
Options:
-w, --where TEXT WHERE clause for the search
-l, --limit INTEGER Maximum number of search results
-O, --output-type [json|jsonl|yaml|tsv|csv|parquet|formatted]
Output format
-o, --output PATH Output file path
--auto-index / --no-auto-index Automatically index the collection
[default: no-auto-index]
-t, --index-type TEXT Type of index to create. Values: simple, llm
[default: simple]
--help Show this message and exit.
Now we’ll search for countries in the North where both English and French are spoken. We’ll pose this as a natural language query, but the default index is only picking up on trigram tokens in the strings.
[30]:
%%bash
linkml-store -d duckdb:///tmp/countries.db -c countries search "countries in the North where both english and french spoken" --limit 5 -O csv
score,name,code,capital,continent,languages
0.15670402880167877,Canada,CA,Ottawa,North America,"['English', 'French']"
0.14806601565681218,South Africa,ZA,Pretoria,Africa,"['Zulu', 'Xhosa', 'Afrikaans', 'English', 'Northern Sotho', 'Tswana', 'Southern Sotho', 'Tsonga', 'Swazi', 'Venda', 'Southern Ndebele']"
0.13749236361227862,United States,US,"Washington, D.C.",North America,['English']
0.09860812114511587,Argentina,AR,Buenos Aires,South America,['Spanish']
0.09765536333140983,Mexico,MX,Mexico City,North America,['Spanish']
By default, all fields in the object are indexed. Canada comes out top as the strings for English and France are present (or rather trigrams from those words). But remember the default method is just for illustration!
Indexing using an LLM (OPTIONAL)
Note for this to work, you need to have installed this package with the llm
extra, like this:
pip install linkml-store[llm]
Or if you have this repo checked out and are using Poetry:
poetry install --all-extras
You will also need an OpenAI account.
If this is too much, you can just skip this section!
[31]:
%%bash
linkml-store -d duckdb:///tmp/countries.db -c countries index -t llm -E tmp/llm_cache.db
[32]:
%%bash
linkml-store -d duckdb:///tmp/countries.db -c countries search -t llm "countries in the North where both english and french spoken" --limit 5 -O csv
score,name,code,capital,continent,languages
0.7927589434263863,Canada,CA,Ottawa,North America,"['English', 'French']"
0.7641212153371397,France,FR,Paris,Europe,['French']
0.7546847140878102,United States,US,"Washington, D.C.",North America,['English']
0.7424773577897005,Australia,AU,Canberra,Oceania,['English']
0.741656789495497,United Kingdom,GB,London,Europe,['English']
The results are not particularly meaningful, but the idea is that this could be used in a RAG-style system.
Schemas
Note in the above we did not explicitly specify a schema; instead it is induced.
We can use the schema
command to see the induced schema in LinkML YAML.
[11]:
%%bash
linkml-store -d duckdb:///tmp/countries.db schema
name: test-schema
id: http://example.org/test-schema
imports:
- linkml:types
prefixes:
linkml:
prefix_prefix: linkml
prefix_reference: https://w3id.org/linkml/
test_schema:
prefix_prefix: test_schema
prefix_reference: http://example.org/test-schema/
default_prefix: test_schema
default_range: string
classes:
countries:
name: countries
attributes:
name:
name: name
multivalued: false
range: string
required: false
code:
name: code
multivalued: false
range: string
required: false
capital:
name: capital
multivalued: false
range: string
required: false
continent:
name: continent
multivalued: false
range: string
required: false
languages:
name: languages
multivalued: true
range: string
required: false
internal__index__countries__simple:
name: internal__index__countries__simple
attributes:
name:
name: name
multivalued: false
range: string
required: false
code:
name: code
multivalued: false
range: string
required: false
capital:
name: capital
multivalued: false
range: string
required: false
continent:
name: continent
multivalued: false
range: string
required: false
languages:
name: languages
multivalued: true
range: string
required: false
__index__:
name: __index__
multivalued: true
range: string
required: false
Configuration Files and Explicit Schemas
Rather than repeat --database
and --collection
each time, we can make use of YAML config files.
These can also package useful information and schemas.
First we will create a fresh copy of a directory with both configuration files and schemas:
[12]:
%%bash
cp -pr ../../tests/input/countries tmp
rm tmp/countries/countries.db
The configuration YAML is fairly minimal - it specifies a single database with a single collection, and a pointer to a schema
[13]:
%%bash
cat tmp/countries/countries.config.yaml
databases:
countries_db:
handle: "duckdb:///{base_dir}/countries.db"
schema_location: "{base_dir}/countries.linkml.yaml"
collections:
countries:
type: Country
The schema itself is fairly basic - a single class (whose name matches the type
) in the configuration, with some slots. Note the slots have some constraints, e.g. regexps
[14]:
%%bash
cat tmp/countries/countries.linkml.yaml
id: https://example.org/countries
name: countries
description: A schema for representing countries
license: https://creativecommons.org/publicdomain/zero/1.0/
prefixes:
countries: https://example.org/countries/
linkml: https://w3id.org/linkml/
default_prefix: countries
default_range: string
imports:
- linkml:types
classes:
Country:
description: A sovereign state
slots:
- name
- code
- capital
- continent
- languages
Route:
slots:
- origin
- destination
- method
slots:
name:
description: The name of the country
required: true
# identifier: true
code:
description: The ISO 3166-1 alpha-2 code of the country
required: true
pattern: '^[A-Z]{2}$'
identifier: true
capital:
description: The capital city of the country
required: true
continent:
description: The continent where the country is located
required: true
languages:
description: The main languages spoken in the country
range: Language
multivalued: true
origin:
range: Country
destination:
range: Country
method:
range: MethodEnum
enums:
MethodEnum:
permissible_values:
rail:
air:
road:
types:
Language:
typeof: string
description: A human language
[15]:
%%bash
linkml-store -C tmp/countries/countries.config.yaml insert tmp/countries/countries.jsonl
Inserted 20 objects from tmp/countries/countries.jsonl into collection 'countries'.
[16]:
%%bash
linkml-store -C tmp/countries/countries.config.yaml list-collections
countries
name: countries
alias: null
type: Country
additional_properties: null
attributes: null
indexers: null
hidden: false
is_prepopulated: false
source_location: null
[17]:
%%bash
linkml-store --stacktrace -C tmp/countries/countries.config.yaml -c countries query -w "code: GB"
[
{
"name": "United Kingdom",
"code": "GB",
"capital": "London",
"continent": "Europe",
"languages": [
"English"
]
}
]
Validation
LinkML-Store is designed to allow for rich validation, regardless of the underlying database store used.
For validation to work, we need to specify an explicit schema, as we have done with the configuration above.
To test it, we will insert some fake data:
[18]:
%%bash
linkml-store -C tmp/countries/countries.config.yaml insert --object '{name: Foolandia, code: "X Y", languages: ["Fooish"]}'
Inserted 3 objects from {name: Foolandia, code: "X Y", languages: ["Fooish"]} into collection 'countries'.
Let’s check that the data is there:
[82]:
%%bash
linkml-store -C tmp/countries/countries.config.yaml query -w 'name: Foolandia'
[
{
"name": "Foolandia",
"code": "X Y",
"capital": null,
"continent": null,
"languages": [
"Fooish"
]
}
]
Note that by default, validation is deferred. You can insert whatever you like, and then validate later.
Other configurations may be more suited to your project, including strict/prospective validation.
Next let’s examine the schema:
[83]:
%%bash
linkml-store -C tmp/countries/countries.config.yaml schema
name: countries
description: A schema for representing countries
id: https://example.org/countries
imports:
- linkml:types
license: https://creativecommons.org/publicdomain/zero/1.0/
prefixes:
countries:
prefix_prefix: countries
prefix_reference: https://example.org/countries/
linkml:
prefix_prefix: linkml
prefix_reference: https://w3id.org/linkml/
default_prefix: countries
default_range: string
types:
Language:
name: Language
description: A human language
typeof: string
slots:
name:
name: name
description: The name of the country
identifier: true
required: true
code:
name: code
description: The ISO 3166-1 alpha-2 code of the country
required: true
pattern: ^[A-Z]{2}$
capital:
name: capital
description: The capital city of the country
required: true
continent:
name: continent
description: The continent where the country is located
required: true
languages:
name: languages
description: The main languages spoken in the country
multivalued: true
range: Language
classes:
Country:
name: Country
description: A sovereign state
slots:
- name
- code
- capital
- continent
- languages
source_file: tmp/countries/countries.linkml.yaml
Run validation
Next we will run the validate
command:
[23]:
%%bash
linkml-store -C tmp/countries/countries.config.yaml validate -O yaml
type: jsonschema validation
severity: ERROR
message: '''X Y'' does not match ''^[A-Z]{2}$'' in /code'
instance:
name: Foolandia
code: X Y
languages:
- Fooish
instance_index: 0
instantiates: Country
context: []
---
type: jsonschema validation
severity: ERROR
message: '''capital'' is a required property in /'
instance:
name: Foolandia
code: X Y
languages:
- Fooish
instance_index: 0
instantiates: Country
context: []
---
type: jsonschema validation
severity: ERROR
message: '''continent'' is a required property in /'
instance:
name: Foolandia
code: X Y
languages:
- Fooish
instance_index: 0
instantiates: Country
context: []
Here we can see 3 issues with the data we added:
the code doesn’t match the regexp we provided (it has a space)
the capital is missing
the continent is missing
[84]: