Perform LLM Inference

This notebook demonstrates how to perform inference using LLMs.

Whereas the previous RAG example used existing examples, this will perform de-novo inference using a schema.

Note that linkml-store is a data-first framework, the main emphasis is not on AI or LLMs. However, it does support a pluggable Inference framework, and one of the integrations is a simple LLM-based inference engine.

For this notebook, we will be using the command line interface, but the same can be done programmatically using the Python API.

Loading the data into duckdb

For this we will take all uniprot “caution” free text comments for human proteins and load them into a duckdb database.

[1]:
%%bash
mkdir -p tmp
rm -rf tmp/up.ddb
linkml-store  -d duckdb:///tmp/up.ddb -c Entry insert ../../tests/input/uniprot/uniprot-comments.tsv
Inserted 2390 objects from ../../tests/input/uniprot/uniprot-comments.tsv into collection 'Entry'.

Let’s check what this looks like by using describe and examining the first entry:

[2]:
%%bash
linkml-store -d tmp/up.ddb describe
         count unique                                   top  freq
category  2390      1                                        2390
id        2390   2284                            EFC2_HUMAN     4
text      2390   1383  Could be the product of a pseudogene   259

Introspecting the schema

Here we will use a ready-made LinkML schema that has the categories we want to assign as a LinkML enum, with examples (examples in schemas help humans and LLMs)

[4]:
%%bash
linkml-store -S ../../tests/input/uniprot/schema.yaml  -d tmp/up.ddb schema
name: uniprot-comments
id: http://example.org/uniprot-comments
imports:
- linkml:types
prefixes:
  linkml:
    prefix_prefix: linkml
    prefix_reference: https://w3id.org/linkml/
  up:
    prefix_prefix: up
    prefix_reference: http://example.org/tuniprot-comments
default_prefix: up
default_range: string
enums:
  CommentCategory:
    name: CommentCategory
    permissible_values:
      FUNCTION_DISPUTED:
        text: FUNCTION_DISPUTED
        description: A caution indicating that a previously reported function has
          been challenged or disproven in subsequent studies; may warrant GO NOT annotation
        examples:
        - value: FUNCTION_DISPUTED
          description: Originally described for its in vitro hydrolytic activity towards
            dGMP, dAMP and dIMP. However, this was not confirmed in vivo
      FUNCTION_PREDICTION_ONLY:
        text: FUNCTION_PREDICTION_ONLY
        description: A caution indicating function is based only on computational
          prediction or sequence similarity
        examples:
        - value: FUNCTION_PREDICTION_ONLY
          description: Predicted to be involved in X based on sequence similarity
      FUNCTION_LACKS_EVIDENCE:
        text: FUNCTION_LACKS_EVIDENCE
        description: A caution indicating insufficient experimental evidence to support
          predicted function
        examples:
        - value: FUNCTION_LACKS_EVIDENCE
          description: In contrast to other Macro-domain containing proteins, lacks
            ADP-ribose glycohydrolase activity
      FUNCTION_DEBATED:
        text: FUNCTION_DEBATED
        description: A caution about ongoing scientific debate regarding function;
          differs from DISPUTED in lacking clear evidence against
        examples:
        - value: FUNCTION_DEBATED
          description: Was initially thought to act as a major regulator of cardiac
            hypertrophy... However, while PDE5A regulates nitric-oxide-generated cGMP,
            nitric oxide signaling is often depressed by heart disease, limiting its
            effect
      LOCALIZATION_DISPUTED:
        text: LOCALIZATION_DISPUTED
        description: A caution about conflicting or uncertain cellular localization
          evidence
        examples:
        - value: LOCALIZATION_DISPUTED
          description: Cellular localization remains to be finally defined. While
            most authors have deduced a localization at the basolateral side, other
            studies demonstrated an apical localization
      NAMING_CONFUSION:
        text: NAMING_CONFUSION
        description: A caution about potential confusion with similarly named proteins
          or historical naming issues
        examples:
        - value: NAMING_CONFUSION
          description: This protein should not be confused with the conventional myosin-1
            (MYH1);Was termed importin alpha-4
      GENE_COPY_NUMBER:
        text: GENE_COPY_NUMBER
        description: A caution about gene duplication or copy number that might affect
          interpretation
        examples:
        - value: GENE_COPY_NUMBER
          description: Maps to a duplicated region on chromosome 15; the gene is present
            in at least 3 almost identical copies
      EXPRESSION_MECHANISM_UNCLEAR:
        text: EXPRESSION_MECHANISM_UNCLEAR
        description: A caution about unclear or unusual mechanisms of gene expression
          or protein production
        examples:
        - value: EXPRESSION_MECHANISM_UNCLEAR
          description: This peptide has been shown to be biologically active but is
            the product of a mitochondrial gene. The mechanisms allowing the production
            and secretion of the peptide remain unclear
      SEQUENCE_FEATURE_MISSING:
        text: SEQUENCE_FEATURE_MISSING
        description: A caution about missing or unexpected sequence features that
          might affect function
        examples:
        - value: SEQUENCE_FEATURE_MISSING
          description: No predictable signal peptide
      SPECIES_DIFFERENCE:
        text: SPECIES_DIFFERENCE
        description: A caution about significant functional or property differences
          between orthologs
        examples:
        - value: SPECIES_DIFFERENCE
          description: Affinity and capacity of the transporter for endogenous substrates
            vary among orthologs. For endogenous compounds such as dopamine, histamine,
            serotonin and thiamine, mouse ortholog display higher affinity
      PUBLICATION_CONFLICT:
        text: PUBLICATION_CONFLICT
        description: A caution about conflicting published evidence or interpretation
        examples:
        - value: PUBLICATION_CONFLICT
          description: Although initially reported to transport carnitine across the
            hepatocyte membrane, another study was unable to verify this finding
      CLAIMS_RETRACTED:
        text: CLAIMS_RETRACTED
        description: A caution about function claims that were retracted or withdrawn
        examples:
        - value: CLAIMS_RETRACTED
          description: Has been reported to enhance netrin-induced phosphorylation
            of PAK1 and FYN... This article has been withdrawn by the authors
      PROTEIN_IDENTITY:
        text: PROTEIN_IDENTITY
        description: A caution about uncertainty in the identity or existence of distinct
          protein products
        examples:
        - value: PROTEIN_IDENTITY
          description: It is not known whether the so-called human ASE1 and human
            CAST proteins represent two sides of a single gene product
      FUNCTION_UNCERTAIN_INITIATION:
        text: FUNCTION_UNCERTAIN_INITIATION
        description: A caution about uncertainty in translation initiation site
        examples:
        - value: FUNCTION_UNCERTAIN_INITIATION
          description: It is uncertain whether Met-1 or Met-37 is the initiator
      PSEUDOGENE_STATUS:
        text: PSEUDOGENE_STATUS
        description: A caution about whether the gene encodes a protein
        examples:
        - value: PSEUDOGENE_STATUS
          description: Could be the product of a pseudogene
classes:
  Entry:
    name: Entry
    attributes:
      id:
        name: id
        identifier: true
      category:
        name: category
        range: CommentCategory
      text:
        name: text
        description: The text of the comment
source_file: ../../tests/input/uniprot/schema.yaml

[52]:
%%bash
linkml-store  -d tmp/up.ddb -c cv insert ../../tests/input/uniprot/uniprot-caution-cv.csv
Inserted 15 objects from ../../tests/input/uniprot/uniprot-caution-cv.csv into collection 'cv'.
[53]:
%%bash
linkml-store  -d tmp/up.ddb::cv query --limit 3 -O yaml
TERM: FUNCTION_DISPUTED
DEFINITION: A caution indicating that a previously reported function has been challenged
  or disproven in subsequent studies; may warrant GO NOT annotation
EXAMPLES: Originally described for its in vitro hydrolytic activity towards dGMP,
  dAMP and dIMP. However, this was not confirmed in vivo
---
TERM: FUNCTION_PREDICTION_ONLY
DEFINITION: A caution indicating function is based only on computational prediction
  or sequence similarity
EXAMPLES: Predicted to be involved in X based on sequence similarity
---
TERM: FUNCTION_LACKS_EVIDENCE
DEFINITION: A caution indicating insufficient experimental evidence to support predicted
  function
EXAMPLES: In contrast to other Macro-domain containing proteins, lacks ADP-ribose
  glycohydrolase activity

[54]:
%%bash
linkml-store  -d tmp/up.ddb query --limit 3 -O yaml
TERM: FUNCTION_DISPUTED
DEFINITION: A caution indicating that a previously reported function has been challenged
  or disproven in subsequent studies; may warrant GO NOT annotation
EXAMPLES: Originally described for its in vitro hydrolytic activity towards dGMP,
  dAMP and dIMP. However, this was not confirmed in vivo
---
TERM: FUNCTION_PREDICTION_ONLY
DEFINITION: A caution indicating function is based only on computational prediction
  or sequence similarity
EXAMPLES: Predicted to be involved in X based on sequence similarity
---
TERM: FUNCTION_LACKS_EVIDENCE
DEFINITION: A caution indicating insufficient experimental evidence to support predicted
  function
EXAMPLES: In contrast to other Macro-domain containing proteins, lacks ADP-ribose
  glycohydrolase activity

[55]:
%%bash
linkml-store -S ../../tests/input/uniprot/schema.yaml  -d tmp/up.ddb schema
2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: phenopackets, base_dir: /Users/cjm
2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: mgi, base_dir: /Users/cjm
2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: nmdc, base_dir: /Users/cjm
2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: amigo, base_dir: /Users/cjm
2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: gocams, base_dir: /Users/cjm
2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: cadsr, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: mixs, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: mondo, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: hpoa, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: hpoa_mongo, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: hpoa_kg, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: maxoa, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: refmet, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: neo4j, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: gold, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: nmdc_duckdb, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Creating/attaching database: tmp/up.ddb
2025-02-04 11:08:15,385 - linkml_store.api.client - INFO - Initializing databases
2025-02-04 11:08:15,385 - linkml_store.api.client - INFO - Attaching tmp/up.ddb
2025-02-04 11:08:15,388 - linkml_store.api.database - INFO - Setting schema view for duckdb:///tmp/up.ddb
name: uniprot-comments
id: http://example.org/uniprot-comments
imports:
- linkml:types
prefixes:
  linkml:
    prefix_prefix: linkml
    prefix_reference: https://w3id.org/linkml/
  up:
    prefix_prefix: up
    prefix_reference: http://example.org/tuniprot-comments
default_prefix: up
default_range: string
enums:
  CommentCategory:
    name: CommentCategory
    permissible_values:
      FUNCTION_DISPUTED:
        text: FUNCTION_DISPUTED
        description: A caution indicating that a previously reported function has
          been challenged or disproven in subsequent studies; may warrant GO NOT annotation
        examples:
        - value: FUNCTION_DISPUTED
          description: Originally described for its in vitro hydrolytic activity towards
            dGMP, dAMP and dIMP. However, this was not confirmed in vivo
      FUNCTION_PREDICTION_ONLY:
        text: FUNCTION_PREDICTION_ONLY
        description: A caution indicating function is based only on computational
          prediction or sequence similarity
        examples:
        - value: FUNCTION_PREDICTION_ONLY
          description: Predicted to be involved in X based on sequence similarity
      FUNCTION_LACKS_EVIDENCE:
        text: FUNCTION_LACKS_EVIDENCE
        description: A caution indicating insufficient experimental evidence to support
          predicted function
        examples:
        - value: FUNCTION_LACKS_EVIDENCE
          description: In contrast to other Macro-domain containing proteins, lacks
            ADP-ribose glycohydrolase activity
      FUNCTION_DEBATED:
        text: FUNCTION_DEBATED
        description: A caution about ongoing scientific debate regarding function;
          differs from DISPUTED in lacking clear evidence against
        examples:
        - value: FUNCTION_DEBATED
          description: Was initially thought to act as a major regulator of cardiac
            hypertrophy... However, while PDE5A regulates nitric-oxide-generated cGMP,
            nitric oxide signaling is often depressed by heart disease, limiting its
            effect
      LOCALIZATION_DISPUTED:
        text: LOCALIZATION_DISPUTED
        description: A caution about conflicting or uncertain cellular localization
          evidence
        examples:
        - value: LOCALIZATION_DISPUTED
          description: Cellular localization remains to be finally defined. While
            most authors have deduced a localization at the basolateral side, other
            studies demonstrated an apical localization
      NAMING_CONFUSION:
        text: NAMING_CONFUSION
        description: A caution about potential confusion with similarly named proteins
          or historical naming issues
        examples:
        - value: NAMING_CONFUSION
          description: This protein should not be confused with the conventional myosin-1
            (MYH1);Was termed importin alpha-4
      GENE_COPY_NUMBER:
        text: GENE_COPY_NUMBER
        description: A caution about gene duplication or copy number that might affect
          interpretation
        examples:
        - value: GENE_COPY_NUMBER
          description: Maps to a duplicated region on chromosome 15; the gene is present
            in at least 3 almost identical copies
      EXPRESSION_MECHANISM_UNCLEAR:
        text: EXPRESSION_MECHANISM_UNCLEAR
        description: A caution about unclear or unusual mechanisms of gene expression
          or protein production
        examples:
        - value: EXPRESSION_MECHANISM_UNCLEAR
          description: This peptide has been shown to be biologically active but is
            the product of a mitochondrial gene. The mechanisms allowing the production
            and secretion of the peptide remain unclear
      SEQUENCE_FEATURE_MISSING:
        text: SEQUENCE_FEATURE_MISSING
        description: A caution about missing or unexpected sequence features that
          might affect function
        examples:
        - value: SEQUENCE_FEATURE_MISSING
          description: No predictable signal peptide
      SPECIES_DIFFERENCE:
        text: SPECIES_DIFFERENCE
        description: A caution about significant functional or property differences
          between orthologs
        examples:
        - value: SPECIES_DIFFERENCE
          description: Affinity and capacity of the transporter for endogenous substrates
            vary among orthologs. For endogenous compounds such as dopamine, histamine,
            serotonin and thiamine, mouse ortholog display higher affinity
      PUBLICATION_CONFLICT:
        text: PUBLICATION_CONFLICT
        description: A caution about conflicting published evidence or interpretation
        examples:
        - value: PUBLICATION_CONFLICT
          description: Although initially reported to transport carnitine across the
            hepatocyte membrane, another study was unable to verify this finding
      CLAIMS_RETRACTED:
        text: CLAIMS_RETRACTED
        description: A caution about function claims that were retracted or withdrawn
        examples:
        - value: CLAIMS_RETRACTED
          description: Has been reported to enhance netrin-induced phosphorylation
            of PAK1 and FYN... This article has been withdrawn by the authors
      PROTEIN_IDENTITY:
        text: PROTEIN_IDENTITY
        description: A caution about uncertainty in the identity or existence of distinct
          protein products
        examples:
        - value: PROTEIN_IDENTITY
          description: It is not known whether the so-called human ASE1 and human
            CAST proteins represent two sides of a single gene product
      FUNCTION_UNCERTAIN_INITIATION:
        text: FUNCTION_UNCERTAIN_INITIATION
        description: A caution about uncertainty in translation initiation site
        examples:
        - value: FUNCTION_UNCERTAIN_INITIATION
          description: It is uncertain whether Met-1 or Met-37 is the initiator
      PSEUDOGENE_STATUS:
        text: PSEUDOGENE_STATUS
        description: A caution about whether the gene encodes a protein
        examples:
        - value: PSEUDOGENE_STATUS
          description: Could be the product of a pseudogene
classes:
  Entry:
    name: Entry
    attributes:
      id:
        name: id
        identifier: true
      category:
        name: category
        range: CommentCategory
      text:
        name: text
        description: The text of the comment
source_file: ../../tests/input/uniprot/schema.yaml

Inferring a specific field

[5]:
%%bash
linkml-store -S ../../tests/input/uniprot/schema.yaml  -d tmp/up.ddb -c Entry infer -t llm -T category --where "id: MOTSC_HUMAN"
id: MOTSC_HUMAN
category: EXPRESSION_MECHANISM_UNCLEAR
text: This peptide has been shown to be biologically active but is the product of
  a mitochondrial gene. Usage of the mitochondrial genetic code yields tandem start
  and stop codons so translation must occur in the cytoplasm. The mechanisms allowing
  the production and secretion of the peptide remain unclear

Inferring all rows

Here we use a --where clause to query all rows in our collection and pass them through the inference engine

[63]:
%%bash
linkml-store -S ../../tests/input/uniprot/schema.yaml  -d tmp/up.ddb -c Entry infer -t llm -T category --where "{}" -O csv -o tmp/up.predicted.csv
[6]:
import pandas as pd
[7]:
df = pd.read_csv("tmp/up.predicted.csv")
df
[7]:
id category text
0 MOTSC_HUMAN EXPRESSION_MECHANISM_UNCLEAR This peptide has been shown to be biologically...
1 POTB3_HUMAN GENE_COPY_NUMBER Maps to a duplicated region on chromosome 15; ...
2 MYO1C_HUMAN NAMING_CONFUSION Represents an unconventional myosin. This prot...
3 IMA4_HUMAN NAMING_CONFUSION Was termed importin alpha-4
4 S22A1_HUMAN LOCALIZATION_DISPUTED Cellular localization of OCT1 in the intestine...
... ... ... ...
95 POK9_HUMAN NaN Truncated; frameshift leads to premature stop ...
96 UB2L3_HUMAN PSEUDOGENE_STATUS PubMed:10760570 reported that UBE2L1, UBE2L2 a...
97 CBX1_HUMAN CLAIMS_RETRACTED Was previously reported to interact with ASXL1...
98 H33_HUMAN CLAIMS_RETRACTED The original paper reporting lysine deaminatio...
99 RELB_HUMAN FUNCTION_DISPUTED Was originally thought to inhibit the transcri...

100 rows × 3 columns