Tutorial: Using the Command Line Interface

This tutorial walks through usage of LinkML-Store via the Command Line Interface (CLI)

This tutorial is a Jupyter notebook: it can be executed in a command line environment, or you can try it for yourself by running commands directly.

Note the %%bash is a directive for Jupyter itself, you don’t need to type this

Top level command

The top level command is linkml-store. This command doesn’t do anything itself, instead there are various subcommands.

The store command has a few global options to specify configuration/database/collection

[1]:
%%bash
linkml-store --help
Usage: linkml-store [OPTIONS] COMMAND [ARGS]...

  A CLI for interacting with the linkml-store.

Options:
  -d, --database TEXT             Database name
  -c, --collection TEXT           Collection name
  -C, --config PATH               Path to the configuration file
  --set TEXT                      Metadata settings in the form PATHEXPR=value
  -v, --verbose
  -q, --quiet / --no-quiet
  --stacktrace / --no-stacktrace  If set then show full stacktrace on error
                                  [default: no-stacktrace]
  --help                          Show this message and exit.

Commands:
  apply             Apply a patch to a collection.
  describe          Describe the collection schema.
  diff              Diffs two collectoons to create a patch.
  export            Exports a database to a dump.
  fq                Query facets from the specified collection.
  import            Imports a database from a dump.
  index             Create an index over a collection.
  indexes           Show the indexes for a collection.
  insert            Insert objects from files (JSON, YAML, TSV) into the...
  list-collections
  query             Query objects from the specified collection.
  schema            Show the schema for a database
  search            Search objects in the specified collection.
  store             Store objects from files (JSON, YAML, TSV) into the...
  validate          Validate objects in the specified collection.

Inserting objects from a file

Next we’ll explore the insert command:

[2]:
%%bash
linkml-store --stacktrace insert --help
Usage: linkml-store insert [OPTIONS] [FILES]...

  Insert objects from files (JSON, YAML, TSV) into the specified collection.

Options:
  -f, --format [json|jsonl|yaml|tsv|csv|parquet|formatted]
                                  Input format
  -i, --object TEXT               Input object as YAML
  --help                          Show this message and exit.

We’ll insert a small test file (in JSON Lines format) into a fresh database.

[3]:
%%bash
head ../../tests/input/countries/countries.jsonl
{"name": "United States", "code": "US", "capital": "Washington, D.C.", "continent": "North America", "languages": ["English"]}
{"name": "Canada", "code": "CA", "capital": "Ottawa", "continent": "North America", "languages": ["English", "French"]}
{"name": "Mexico", "code": "MX", "capital": "Mexico City", "continent": "North America", "languages": ["Spanish"]}
{"name": "Brazil", "code": "BR", "capital": "Brasília", "continent": "South America", "languages": ["Portuguese"]}
{"name": "Argentina", "code": "AR", "capital": "Buenos Aires", "continent": "South America", "languages": ["Spanish"]}
{"name": "United Kingdom", "code": "GB", "capital": "London", "continent": "Europe", "languages": ["English"]}
{"name": "France", "code": "FR", "capital": "Paris", "continent": "Europe", "languages": ["French"]}
{"name": "Germany", "code": "DE", "capital": "Berlin", "continent": "Europe", "languages": ["German"]}
{"name": "Italy", "code": "IT", "capital": "Rome", "continent": "Europe", "languages": ["Italian"]}
{"name": "Spain", "code": "ES", "capital": "Madrid", "continent": "Europe", "languages": ["Spanish"]}

To make sure we have a fresh setup, we’ll create a temporary directory tmp (if it doesn’t already exist), and be sure to remove any copy of the database we intend to create.

We’ll then insert the objects:

[4]:
%%bash
mkdir -p tmp
rm -rf tmp/countries.db
linkml-store --database duckdb:///tmp/countries.db --collection countries insert ../../tests/input/countries/countries.jsonl
Inserted 20 objects from ../../tests/input/countries/countries.jsonl into collection 'countries'.

Note that the --database and --collection options come before the insert subcommand.

With LinkML-Store, everything must go into a collection, so we specified countries as the name

Querying

Next we’ll explore the query command:

[24]:
%%bash
linkml-store query --help
Usage: linkml-store query [OPTIONS]

  Query objects from the specified collection.

  Leave the query field blank to return all objects in the collection.

  Examples:

      linkml-store -d duckdb:///countries.db -c countries query

  Queries can be specified in YAML, as basic key-value pairs

  Examples:

      linkml-store -d duckdb:///countries.db -c countries query -w 'code: NZ'

  More complex queries can be specified using MongoDB-style query syntax

  Examples:

      linkml-store -d file:. -c persons query  -w 'occupation: {$ne:
      Architect}'

  Finds all people who are not architects.

Options:
  -w, --where TEXT                WHERE clause for the query, as YAML
  -l, --limit INTEGER             Maximum number of results to return
  -O, --output-type [json|jsonl|yaml|tsv|csv|parquet|formatted]
                                  Output format
  -o, --output PATH               Output file path
  --help                          Show this message and exit.

Let’s query for all objects that have code="GB", and get the results back as a CSV. The argument for the --where (or -w) option is a YAML object with a MongoDB-style query.

[5]:
%%bash
linkml-store --database duckdb:///tmp/countries.db -c countries query -w "code: GB" -O formatted
             name code capital continent  languages
0  United Kingdom   GB  London    Europe  [English]

We can get the output in different formats:

[6]:
%%bash
linkml-store --database duckdb:///tmp/countries.db -c countries query -w "code: GB" -O yaml
name: United Kingdom
code: GB
capital: London
continent: Europe
languages:
- English

Formats include csv, tsv, yaml, json, jsonl, formatted (a human-readable format)

Describing the data set

The describe command gives a high-level overview of the data set:

[25]:
%%bash
linkml-store describe --help
Usage: linkml-store describe [OPTIONS]

  Describe the collection schema.

Options:
  -w, --where TEXT                WHERE clause for the query
  -O, --output-type [json|jsonl|yaml|tsv|csv|parquet|formatted]
                                  Output format
  -o, --output PATH               Output file path
  --help                          Show this message and exit.

Let’s try with the countries dataset:

[8]:
%%bash
linkml-store -d duckdb:///tmp/countries.db -c countries describe
          count unique               top freq
capital       1      1  Washington, D.C.    1
code          1      1                US    1
continent     1      1     North America    1
languages     1      1         [English]    1
name          1      1     United States    1

Note this command is more useful for numeric data…

Facet Counts

You can combine any query (including an empty query, for fetching the whole database) with a facet query which fetches counts for numbers of objects broken down by some specified slot or slots.

[26]:
%%bash
linkml-store fq --help
Usage: linkml-store fq [OPTIONS]

  Query facets from the specified collection.

  :param ctx: :param where: :param limit: :param columns: :param output_type:
  :param output: :return:

Options:
  -w, --where TEXT                WHERE clause for the query
  -l, --limit INTEGER             Maximum number of results to return
  -O, --output-type [json|jsonl|yaml|tsv|csv|parquet|formatted]
                                  Output format
  -o, --output PATH               Output file path
  -S, --columns TEXT              Columns to facet on
  --help                          Show this message and exit.
[9]:
%%bash
linkml-store -d duckdb:///tmp/countries.db -c countries fq -S continent
{
  "continent": {
    "Europe": 5,
    "Asia": 5,
    "Africa": 3,
    "North America": 3,
    "South America": 2,
    "Oceania": 2
  }
}

Remember this is a test dataset deliberately reduced so we don’t expect to see all countries there!

Indexing using an LLM (OPTIONAL)

Note for this to work, you need to have installed this package with the llm extra, like this:

pip install linkml-store[llm]

Or if you have this repo checked out and are using Poetry:

poetry install --all-extras

You will also need an OpenAI account.

If this is too much, you can just skip this section!

[31]:
%%bash
linkml-store -d duckdb:///tmp/countries.db -c countries index -t llm -E tmp/llm_cache.db
[32]:
%%bash
linkml-store -d duckdb:///tmp/countries.db -c countries search -t llm "countries in the North where both english and french spoken" --limit 5 -O csv
score,name,code,capital,continent,languages
0.7927589434263863,Canada,CA,Ottawa,North America,"['English', 'French']"
0.7641212153371397,France,FR,Paris,Europe,['French']
0.7546847140878102,United States,US,"Washington, D.C.",North America,['English']
0.7424773577897005,Australia,AU,Canberra,Oceania,['English']
0.741656789495497,United Kingdom,GB,London,Europe,['English']

The results are not particularly meaningful, but the idea is that this could be used in a RAG-style system.

Schemas

Note in the above we did not explicitly specify a schema; instead it is induced.

We can use the schema command to see the induced schema in LinkML YAML.

[11]:
%%bash
linkml-store -d duckdb:///tmp/countries.db schema
name: test-schema
id: http://example.org/test-schema
imports:
- linkml:types
prefixes:
  linkml:
    prefix_prefix: linkml
    prefix_reference: https://w3id.org/linkml/
  test_schema:
    prefix_prefix: test_schema
    prefix_reference: http://example.org/test-schema/
default_prefix: test_schema
default_range: string
classes:
  countries:
    name: countries
    attributes:
      name:
        name: name
        multivalued: false
        range: string
        required: false
      code:
        name: code
        multivalued: false
        range: string
        required: false
      capital:
        name: capital
        multivalued: false
        range: string
        required: false
      continent:
        name: continent
        multivalued: false
        range: string
        required: false
      languages:
        name: languages
        multivalued: true
        range: string
        required: false
  internal__index__countries__simple:
    name: internal__index__countries__simple
    attributes:
      name:
        name: name
        multivalued: false
        range: string
        required: false
      code:
        name: code
        multivalued: false
        range: string
        required: false
      capital:
        name: capital
        multivalued: false
        range: string
        required: false
      continent:
        name: continent
        multivalued: false
        range: string
        required: false
      languages:
        name: languages
        multivalued: true
        range: string
        required: false
      __index__:
        name: __index__
        multivalued: true
        range: string
        required: false

Configuration Files and Explicit Schemas

Rather than repeat --database and --collection each time, we can make use of YAML config files.

These can also package useful information and schemas.

First we will create a fresh copy of a directory with both configuration files and schemas:

[12]:
%%bash
cp -pr ../../tests/input/countries tmp
rm tmp/countries/countries.db

The configuration YAML is fairly minimal - it specifies a single database with a single collection, and a pointer to a schema

[13]:
%%bash
cat tmp/countries/countries.config.yaml
databases:
  countries_db:
    handle: "duckdb:///{base_dir}/countries.db"
    schema_location: "{base_dir}/countries.linkml.yaml"
    collections:
      countries:
        type: Country

The schema itself is fairly basic - a single class (whose name matches the type) in the configuration, with some slots. Note the slots have some constraints, e.g. regexps

[14]:
%%bash
cat tmp/countries/countries.linkml.yaml
id: https://example.org/countries
name: countries
description: A schema for representing countries
license: https://creativecommons.org/publicdomain/zero/1.0/

prefixes:
  countries: https://example.org/countries/
  linkml: https://w3id.org/linkml/

default_prefix: countries
default_range: string

imports:
  - linkml:types

classes:
  Country:
    description: A sovereign state
    slots:
      - name
      - code
      - capital
      - continent
      - languages
  Route:
    slots:
      - origin
      - destination
      - method

slots:
  name:
    description: The name of the country
    required: true
    # identifier: true
  code:
    description: The ISO 3166-1 alpha-2 code of the country
    required: true
    pattern: '^[A-Z]{2}$'
    identifier: true
  capital:
    description: The capital city of the country
    required: true
  continent:
    description: The continent where the country is located
    required: true
  languages:
    description: The main languages spoken in the country
    range: Language
    multivalued: true
  origin:
    range: Country
  destination:
    range: Country
  method:
    range: MethodEnum

enums:
  MethodEnum:
    permissible_values:
      rail:
      air:
      road:

types:
  Language:
    typeof: string
    description: A human language
[15]:
%%bash
linkml-store  -C tmp/countries/countries.config.yaml insert tmp/countries/countries.jsonl
Inserted 20 objects from tmp/countries/countries.jsonl into collection 'countries'.
[16]:
%%bash
linkml-store -C tmp/countries/countries.config.yaml list-collections
countries
name: countries
alias: null
type: Country
additional_properties: null
attributes: null
indexers: null
hidden: false
is_prepopulated: false
source_location: null
[17]:
%%bash
linkml-store --stacktrace -C tmp/countries/countries.config.yaml -c countries query -w "code: GB"
[
  {
    "name": "United Kingdom",
    "code": "GB",
    "capital": "London",
    "continent": "Europe",
    "languages": [
      "English"
    ]
  }
]

Validation

LinkML-Store is designed to allow for rich validation, regardless of the underlying database store used.

For validation to work, we need to specify an explicit schema, as we have done with the configuration above.

To test it, we will insert some fake data:

[18]:
%%bash
linkml-store -C tmp/countries/countries.config.yaml insert --object '{name: Foolandia, code: "X Y", languages: ["Fooish"]}'
Inserted 3 objects from {name: Foolandia, code: "X Y", languages: ["Fooish"]} into collection 'countries'.

Let’s check that the data is there:

[82]:
%%bash
linkml-store -C tmp/countries/countries.config.yaml query -w 'name: Foolandia'
[
  {
    "name": "Foolandia",
    "code": "X Y",
    "capital": null,
    "continent": null,
    "languages": [
      "Fooish"
    ]
  }
]

Note that by default, validation is deferred. You can insert whatever you like, and then validate later.

Other configurations may be more suited to your project, including strict/prospective validation.

Next let’s examine the schema:

[83]:
%%bash
linkml-store -C tmp/countries/countries.config.yaml schema
name: countries
description: A schema for representing countries
id: https://example.org/countries
imports:
- linkml:types
license: https://creativecommons.org/publicdomain/zero/1.0/
prefixes:
  countries:
    prefix_prefix: countries
    prefix_reference: https://example.org/countries/
  linkml:
    prefix_prefix: linkml
    prefix_reference: https://w3id.org/linkml/
default_prefix: countries
default_range: string
types:
  Language:
    name: Language
    description: A human language
    typeof: string
slots:
  name:
    name: name
    description: The name of the country
    identifier: true
    required: true
  code:
    name: code
    description: The ISO 3166-1 alpha-2 code of the country
    required: true
    pattern: ^[A-Z]{2}$
  capital:
    name: capital
    description: The capital city of the country
    required: true
  continent:
    name: continent
    description: The continent where the country is located
    required: true
  languages:
    name: languages
    description: The main languages spoken in the country
    multivalued: true
    range: Language
classes:
  Country:
    name: Country
    description: A sovereign state
    slots:
    - name
    - code
    - capital
    - continent
    - languages
source_file: tmp/countries/countries.linkml.yaml

Run validation

Next we will run the validate command:

[23]:
%%bash
linkml-store -C tmp/countries/countries.config.yaml validate -O yaml
type: jsonschema validation
severity: ERROR
message: '''X Y'' does not match ''^[A-Z]{2}$'' in /code'
instance:
  name: Foolandia
  code: X Y
  languages:
  - Fooish
instance_index: 0
instantiates: Country
context: []
---
type: jsonschema validation
severity: ERROR
message: '''capital'' is a required property in /'
instance:
  name: Foolandia
  code: X Y
  languages:
  - Fooish
instance_index: 0
instantiates: Country
context: []
---
type: jsonschema validation
severity: ERROR
message: '''continent'' is a required property in /'
instance:
  name: Foolandia
  code: X Y
  languages:
  - Fooish
instance_index: 0
instantiates: Country
context: []

Here we can see 3 issues with the data we added:

  • the code doesn’t match the regexp we provided (it has a space)

  • the capital is missing

  • the continent is missing

[84]: