Perform LLM Inference
This notebook demonstrates how to perform inference using LLMs.
Whereas the previous RAG example used existing examples, this will perform de-novo inference using a schema.
Note that linkml-store is a data-first framework, the main emphasis is not on AI or LLMs. However, it does support a pluggable Inference framework, and one of the integrations is a simple LLM-based inference engine.
For this notebook, we will be using the command line interface, but the same can be done programmatically using the Python API.
Loading the data into duckdb
For this we will take all uniprot “caution” free text comments for human proteins and load them into a duckdb database.
[1]:
%%bash
mkdir -p tmp
rm -rf tmp/up.ddb
linkml-store -d duckdb:///tmp/up.ddb -c Entry insert ../../tests/input/uniprot/uniprot-comments.tsv
Inserted 2390 objects from ../../tests/input/uniprot/uniprot-comments.tsv into collection 'Entry'.
Let’s check what this looks like by using describe
and examining the first entry:
[2]:
%%bash
linkml-store -d tmp/up.ddb describe
count unique top freq
category 2390 1 2390
id 2390 2284 EFC2_HUMAN 4
text 2390 1383 Could be the product of a pseudogene 259
Introspecting the schema
Here we will use a ready-made LinkML schema that has the categories we want to assign as a LinkML enum, with examples (examples in schemas help humans and LLMs)
[4]:
%%bash
linkml-store -S ../../tests/input/uniprot/schema.yaml -d tmp/up.ddb schema
name: uniprot-comments
id: http://example.org/uniprot-comments
imports:
- linkml:types
prefixes:
linkml:
prefix_prefix: linkml
prefix_reference: https://w3id.org/linkml/
up:
prefix_prefix: up
prefix_reference: http://example.org/tuniprot-comments
default_prefix: up
default_range: string
enums:
CommentCategory:
name: CommentCategory
permissible_values:
FUNCTION_DISPUTED:
text: FUNCTION_DISPUTED
description: A caution indicating that a previously reported function has
been challenged or disproven in subsequent studies; may warrant GO NOT annotation
examples:
- value: FUNCTION_DISPUTED
description: Originally described for its in vitro hydrolytic activity towards
dGMP, dAMP and dIMP. However, this was not confirmed in vivo
FUNCTION_PREDICTION_ONLY:
text: FUNCTION_PREDICTION_ONLY
description: A caution indicating function is based only on computational
prediction or sequence similarity
examples:
- value: FUNCTION_PREDICTION_ONLY
description: Predicted to be involved in X based on sequence similarity
FUNCTION_LACKS_EVIDENCE:
text: FUNCTION_LACKS_EVIDENCE
description: A caution indicating insufficient experimental evidence to support
predicted function
examples:
- value: FUNCTION_LACKS_EVIDENCE
description: In contrast to other Macro-domain containing proteins, lacks
ADP-ribose glycohydrolase activity
FUNCTION_DEBATED:
text: FUNCTION_DEBATED
description: A caution about ongoing scientific debate regarding function;
differs from DISPUTED in lacking clear evidence against
examples:
- value: FUNCTION_DEBATED
description: Was initially thought to act as a major regulator of cardiac
hypertrophy... However, while PDE5A regulates nitric-oxide-generated cGMP,
nitric oxide signaling is often depressed by heart disease, limiting its
effect
LOCALIZATION_DISPUTED:
text: LOCALIZATION_DISPUTED
description: A caution about conflicting or uncertain cellular localization
evidence
examples:
- value: LOCALIZATION_DISPUTED
description: Cellular localization remains to be finally defined. While
most authors have deduced a localization at the basolateral side, other
studies demonstrated an apical localization
NAMING_CONFUSION:
text: NAMING_CONFUSION
description: A caution about potential confusion with similarly named proteins
or historical naming issues
examples:
- value: NAMING_CONFUSION
description: This protein should not be confused with the conventional myosin-1
(MYH1);Was termed importin alpha-4
GENE_COPY_NUMBER:
text: GENE_COPY_NUMBER
description: A caution about gene duplication or copy number that might affect
interpretation
examples:
- value: GENE_COPY_NUMBER
description: Maps to a duplicated region on chromosome 15; the gene is present
in at least 3 almost identical copies
EXPRESSION_MECHANISM_UNCLEAR:
text: EXPRESSION_MECHANISM_UNCLEAR
description: A caution about unclear or unusual mechanisms of gene expression
or protein production
examples:
- value: EXPRESSION_MECHANISM_UNCLEAR
description: This peptide has been shown to be biologically active but is
the product of a mitochondrial gene. The mechanisms allowing the production
and secretion of the peptide remain unclear
SEQUENCE_FEATURE_MISSING:
text: SEQUENCE_FEATURE_MISSING
description: A caution about missing or unexpected sequence features that
might affect function
examples:
- value: SEQUENCE_FEATURE_MISSING
description: No predictable signal peptide
SPECIES_DIFFERENCE:
text: SPECIES_DIFFERENCE
description: A caution about significant functional or property differences
between orthologs
examples:
- value: SPECIES_DIFFERENCE
description: Affinity and capacity of the transporter for endogenous substrates
vary among orthologs. For endogenous compounds such as dopamine, histamine,
serotonin and thiamine, mouse ortholog display higher affinity
PUBLICATION_CONFLICT:
text: PUBLICATION_CONFLICT
description: A caution about conflicting published evidence or interpretation
examples:
- value: PUBLICATION_CONFLICT
description: Although initially reported to transport carnitine across the
hepatocyte membrane, another study was unable to verify this finding
CLAIMS_RETRACTED:
text: CLAIMS_RETRACTED
description: A caution about function claims that were retracted or withdrawn
examples:
- value: CLAIMS_RETRACTED
description: Has been reported to enhance netrin-induced phosphorylation
of PAK1 and FYN... This article has been withdrawn by the authors
PROTEIN_IDENTITY:
text: PROTEIN_IDENTITY
description: A caution about uncertainty in the identity or existence of distinct
protein products
examples:
- value: PROTEIN_IDENTITY
description: It is not known whether the so-called human ASE1 and human
CAST proteins represent two sides of a single gene product
FUNCTION_UNCERTAIN_INITIATION:
text: FUNCTION_UNCERTAIN_INITIATION
description: A caution about uncertainty in translation initiation site
examples:
- value: FUNCTION_UNCERTAIN_INITIATION
description: It is uncertain whether Met-1 or Met-37 is the initiator
PSEUDOGENE_STATUS:
text: PSEUDOGENE_STATUS
description: A caution about whether the gene encodes a protein
examples:
- value: PSEUDOGENE_STATUS
description: Could be the product of a pseudogene
classes:
Entry:
name: Entry
attributes:
id:
name: id
identifier: true
category:
name: category
range: CommentCategory
text:
name: text
description: The text of the comment
source_file: ../../tests/input/uniprot/schema.yaml
[52]:
%%bash
linkml-store -d tmp/up.ddb -c cv insert ../../tests/input/uniprot/uniprot-caution-cv.csv
Inserted 15 objects from ../../tests/input/uniprot/uniprot-caution-cv.csv into collection 'cv'.
[53]:
%%bash
linkml-store -d tmp/up.ddb::cv query --limit 3 -O yaml
TERM: FUNCTION_DISPUTED
DEFINITION: A caution indicating that a previously reported function has been challenged
or disproven in subsequent studies; may warrant GO NOT annotation
EXAMPLES: Originally described for its in vitro hydrolytic activity towards dGMP,
dAMP and dIMP. However, this was not confirmed in vivo
---
TERM: FUNCTION_PREDICTION_ONLY
DEFINITION: A caution indicating function is based only on computational prediction
or sequence similarity
EXAMPLES: Predicted to be involved in X based on sequence similarity
---
TERM: FUNCTION_LACKS_EVIDENCE
DEFINITION: A caution indicating insufficient experimental evidence to support predicted
function
EXAMPLES: In contrast to other Macro-domain containing proteins, lacks ADP-ribose
glycohydrolase activity
[54]:
%%bash
linkml-store -d tmp/up.ddb query --limit 3 -O yaml
TERM: FUNCTION_DISPUTED
DEFINITION: A caution indicating that a previously reported function has been challenged
or disproven in subsequent studies; may warrant GO NOT annotation
EXAMPLES: Originally described for its in vitro hydrolytic activity towards dGMP,
dAMP and dIMP. However, this was not confirmed in vivo
---
TERM: FUNCTION_PREDICTION_ONLY
DEFINITION: A caution indicating function is based only on computational prediction
or sequence similarity
EXAMPLES: Predicted to be involved in X based on sequence similarity
---
TERM: FUNCTION_LACKS_EVIDENCE
DEFINITION: A caution indicating insufficient experimental evidence to support predicted
function
EXAMPLES: In contrast to other Macro-domain containing proteins, lacks ADP-ribose
glycohydrolase activity
[55]:
%%bash
linkml-store -S ../../tests/input/uniprot/schema.yaml -d tmp/up.ddb schema
2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: phenopackets, base_dir: /Users/cjm
2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: mgi, base_dir: /Users/cjm
2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: nmdc, base_dir: /Users/cjm
2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: amigo, base_dir: /Users/cjm
2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: gocams, base_dir: /Users/cjm
2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: cadsr, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: mixs, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: mondo, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: hpoa, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: hpoa_mongo, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: hpoa_kg, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: maxoa, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: refmet, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: neo4j, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: gold, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: nmdc_duckdb, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Creating/attaching database: tmp/up.ddb
2025-02-04 11:08:15,385 - linkml_store.api.client - INFO - Initializing databases
2025-02-04 11:08:15,385 - linkml_store.api.client - INFO - Attaching tmp/up.ddb
2025-02-04 11:08:15,388 - linkml_store.api.database - INFO - Setting schema view for duckdb:///tmp/up.ddb
name: uniprot-comments
id: http://example.org/uniprot-comments
imports:
- linkml:types
prefixes:
linkml:
prefix_prefix: linkml
prefix_reference: https://w3id.org/linkml/
up:
prefix_prefix: up
prefix_reference: http://example.org/tuniprot-comments
default_prefix: up
default_range: string
enums:
CommentCategory:
name: CommentCategory
permissible_values:
FUNCTION_DISPUTED:
text: FUNCTION_DISPUTED
description: A caution indicating that a previously reported function has
been challenged or disproven in subsequent studies; may warrant GO NOT annotation
examples:
- value: FUNCTION_DISPUTED
description: Originally described for its in vitro hydrolytic activity towards
dGMP, dAMP and dIMP. However, this was not confirmed in vivo
FUNCTION_PREDICTION_ONLY:
text: FUNCTION_PREDICTION_ONLY
description: A caution indicating function is based only on computational
prediction or sequence similarity
examples:
- value: FUNCTION_PREDICTION_ONLY
description: Predicted to be involved in X based on sequence similarity
FUNCTION_LACKS_EVIDENCE:
text: FUNCTION_LACKS_EVIDENCE
description: A caution indicating insufficient experimental evidence to support
predicted function
examples:
- value: FUNCTION_LACKS_EVIDENCE
description: In contrast to other Macro-domain containing proteins, lacks
ADP-ribose glycohydrolase activity
FUNCTION_DEBATED:
text: FUNCTION_DEBATED
description: A caution about ongoing scientific debate regarding function;
differs from DISPUTED in lacking clear evidence against
examples:
- value: FUNCTION_DEBATED
description: Was initially thought to act as a major regulator of cardiac
hypertrophy... However, while PDE5A regulates nitric-oxide-generated cGMP,
nitric oxide signaling is often depressed by heart disease, limiting its
effect
LOCALIZATION_DISPUTED:
text: LOCALIZATION_DISPUTED
description: A caution about conflicting or uncertain cellular localization
evidence
examples:
- value: LOCALIZATION_DISPUTED
description: Cellular localization remains to be finally defined. While
most authors have deduced a localization at the basolateral side, other
studies demonstrated an apical localization
NAMING_CONFUSION:
text: NAMING_CONFUSION
description: A caution about potential confusion with similarly named proteins
or historical naming issues
examples:
- value: NAMING_CONFUSION
description: This protein should not be confused with the conventional myosin-1
(MYH1);Was termed importin alpha-4
GENE_COPY_NUMBER:
text: GENE_COPY_NUMBER
description: A caution about gene duplication or copy number that might affect
interpretation
examples:
- value: GENE_COPY_NUMBER
description: Maps to a duplicated region on chromosome 15; the gene is present
in at least 3 almost identical copies
EXPRESSION_MECHANISM_UNCLEAR:
text: EXPRESSION_MECHANISM_UNCLEAR
description: A caution about unclear or unusual mechanisms of gene expression
or protein production
examples:
- value: EXPRESSION_MECHANISM_UNCLEAR
description: This peptide has been shown to be biologically active but is
the product of a mitochondrial gene. The mechanisms allowing the production
and secretion of the peptide remain unclear
SEQUENCE_FEATURE_MISSING:
text: SEQUENCE_FEATURE_MISSING
description: A caution about missing or unexpected sequence features that
might affect function
examples:
- value: SEQUENCE_FEATURE_MISSING
description: No predictable signal peptide
SPECIES_DIFFERENCE:
text: SPECIES_DIFFERENCE
description: A caution about significant functional or property differences
between orthologs
examples:
- value: SPECIES_DIFFERENCE
description: Affinity and capacity of the transporter for endogenous substrates
vary among orthologs. For endogenous compounds such as dopamine, histamine,
serotonin and thiamine, mouse ortholog display higher affinity
PUBLICATION_CONFLICT:
text: PUBLICATION_CONFLICT
description: A caution about conflicting published evidence or interpretation
examples:
- value: PUBLICATION_CONFLICT
description: Although initially reported to transport carnitine across the
hepatocyte membrane, another study was unable to verify this finding
CLAIMS_RETRACTED:
text: CLAIMS_RETRACTED
description: A caution about function claims that were retracted or withdrawn
examples:
- value: CLAIMS_RETRACTED
description: Has been reported to enhance netrin-induced phosphorylation
of PAK1 and FYN... This article has been withdrawn by the authors
PROTEIN_IDENTITY:
text: PROTEIN_IDENTITY
description: A caution about uncertainty in the identity or existence of distinct
protein products
examples:
- value: PROTEIN_IDENTITY
description: It is not known whether the so-called human ASE1 and human
CAST proteins represent two sides of a single gene product
FUNCTION_UNCERTAIN_INITIATION:
text: FUNCTION_UNCERTAIN_INITIATION
description: A caution about uncertainty in translation initiation site
examples:
- value: FUNCTION_UNCERTAIN_INITIATION
description: It is uncertain whether Met-1 or Met-37 is the initiator
PSEUDOGENE_STATUS:
text: PSEUDOGENE_STATUS
description: A caution about whether the gene encodes a protein
examples:
- value: PSEUDOGENE_STATUS
description: Could be the product of a pseudogene
classes:
Entry:
name: Entry
attributes:
id:
name: id
identifier: true
category:
name: category
range: CommentCategory
text:
name: text
description: The text of the comment
source_file: ../../tests/input/uniprot/schema.yaml
Inferring a specific field
[5]:
%%bash
linkml-store -S ../../tests/input/uniprot/schema.yaml -d tmp/up.ddb -c Entry infer -t llm -T category --where "id: MOTSC_HUMAN"
id: MOTSC_HUMAN
category: EXPRESSION_MECHANISM_UNCLEAR
text: This peptide has been shown to be biologically active but is the product of
a mitochondrial gene. Usage of the mitochondrial genetic code yields tandem start
and stop codons so translation must occur in the cytoplasm. The mechanisms allowing
the production and secretion of the peptide remain unclear
Inferring all rows
Here we use a --where
clause to query all rows in our collection and pass them through the inference engine
[63]:
%%bash
linkml-store -S ../../tests/input/uniprot/schema.yaml -d tmp/up.ddb -c Entry infer -t llm -T category --where "{}" -O csv -o tmp/up.predicted.csv
[6]:
import pandas as pd
[7]:
df = pd.read_csv("tmp/up.predicted.csv")
df
[7]:
id | category | text | |
---|---|---|---|
0 | MOTSC_HUMAN | EXPRESSION_MECHANISM_UNCLEAR | This peptide has been shown to be biologically... |
1 | POTB3_HUMAN | GENE_COPY_NUMBER | Maps to a duplicated region on chromosome 15; ... |
2 | MYO1C_HUMAN | NAMING_CONFUSION | Represents an unconventional myosin. This prot... |
3 | IMA4_HUMAN | NAMING_CONFUSION | Was termed importin alpha-4 |
4 | S22A1_HUMAN | LOCALIZATION_DISPUTED | Cellular localization of OCT1 in the intestine... |
... | ... | ... | ... |
95 | POK9_HUMAN | NaN | Truncated; frameshift leads to premature stop ... |
96 | UB2L3_HUMAN | PSEUDOGENE_STATUS | PubMed:10760570 reported that UBE2L1, UBE2L2 a... |
97 | CBX1_HUMAN | CLAIMS_RETRACTED | Was previously reported to interact with ASXL1... |
98 | H33_HUMAN | CLAIMS_RETRACTED | The original paper reporting lysine deaminatio... |
99 | RELB_HUMAN | FUNCTION_DISPUTED | Was originally thought to inhibit the transcri... |
100 rows × 3 columns