{ "cells": [ { "cell_type": "markdown", "id": "113e1f5d2f048e03", "metadata": {}, "source": [ "# Perform LLM Inference\n", "\n", "This notebook demonstrates how to perform inference using LLMs.\n", "\n", "Whereas the [previous RAG example](Perform-RAG-Inference.ipynb) used existing examples,\n", "this will perform de-novo inference using a schema.\n", "\n", "Note that linkml-store is a data-first framework, the main emphasis is not on AI or LLMs. However, it does support a pluggable **Inference** framework, and one of the integrations is a simple LLM-based inference engine.\n", "\n", "For this notebook, we will be using the command line interface, but the same can be done programmatically using the Python API." ] }, { "cell_type": "markdown", "id": "966de1b52f388b87", "metadata": {}, "source": [ "## Loading the data into duckdb\n", "\n", "For this we will take all uniprot \"caution\" free text comments for human proteins and load them into a duckdb database." ] }, { "cell_type": "code", "execution_count": 1, "id": "da1ed3b6811477ee", "metadata": { "jupyter": { "is_executing": true } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inserted 2390 objects from ../../tests/input/uniprot/uniprot-comments.tsv into collection 'Entry'.\n" ] } ], "source": [ "%%bash\n", "mkdir -p tmp\n", "rm -rf tmp/up.ddb\n", "linkml-store -d duckdb:///tmp/up.ddb -c Entry insert ../../tests/input/uniprot/uniprot-comments.tsv" ] }, { "cell_type": "markdown", "id": "88191ea890186dc9", "metadata": {}, "source": [ "Let's check what this looks like by using `describe` and examining the first entry:" ] }, { "cell_type": "code", "execution_count": 2, "id": "af9d9160e75afed4", "metadata": { "jupyter": { "is_executing": true } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " count unique top freq\n", "category 2390 1 2390\n", "id 2390 2284 EFC2_HUMAN 4\n", "text 2390 1383 Could be the product of a pseudogene 259\n" ] } ], "source": [ "%%bash\n", "linkml-store -d tmp/up.ddb describe" ] }, { "cell_type": "markdown", "id": "8a5c5a6e-c327-412f-9fa7-3720528f7ce9", "metadata": {}, "source": [ "## Introspecting the schema\n", "\n", "Here we will use a ready-made LinkML schema that has the categories we want to assign as a LinkML enum, with\n", "examples (examples in schemas help humans *and* LLMs)" ] }, { "cell_type": "code", "execution_count": 4, "id": "9d75e40f-37b5-45b9-9bb7-ff69486348b7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "name: uniprot-comments\n", "id: http://example.org/uniprot-comments\n", "imports:\n", "- linkml:types\n", "prefixes:\n", " linkml:\n", " prefix_prefix: linkml\n", " prefix_reference: https://w3id.org/linkml/\n", " up:\n", " prefix_prefix: up\n", " prefix_reference: http://example.org/tuniprot-comments\n", "default_prefix: up\n", "default_range: string\n", "enums:\n", " CommentCategory:\n", " name: CommentCategory\n", " permissible_values:\n", " FUNCTION_DISPUTED:\n", " text: FUNCTION_DISPUTED\n", " description: A caution indicating that a previously reported function has\n", " been challenged or disproven in subsequent studies; may warrant GO NOT annotation\n", " examples:\n", " - value: FUNCTION_DISPUTED\n", " description: Originally described for its in vitro hydrolytic activity towards\n", " dGMP, dAMP and dIMP. However, this was not confirmed in vivo\n", " FUNCTION_PREDICTION_ONLY:\n", " text: FUNCTION_PREDICTION_ONLY\n", " description: A caution indicating function is based only on computational\n", " prediction or sequence similarity\n", " examples:\n", " - value: FUNCTION_PREDICTION_ONLY\n", " description: Predicted to be involved in X based on sequence similarity\n", " FUNCTION_LACKS_EVIDENCE:\n", " text: FUNCTION_LACKS_EVIDENCE\n", " description: A caution indicating insufficient experimental evidence to support\n", " predicted function\n", " examples:\n", " - value: FUNCTION_LACKS_EVIDENCE\n", " description: In contrast to other Macro-domain containing proteins, lacks\n", " ADP-ribose glycohydrolase activity\n", " FUNCTION_DEBATED:\n", " text: FUNCTION_DEBATED\n", " description: A caution about ongoing scientific debate regarding function;\n", " differs from DISPUTED in lacking clear evidence against\n", " examples:\n", " - value: FUNCTION_DEBATED\n", " description: Was initially thought to act as a major regulator of cardiac\n", " hypertrophy... However, while PDE5A regulates nitric-oxide-generated cGMP,\n", " nitric oxide signaling is often depressed by heart disease, limiting its\n", " effect\n", " LOCALIZATION_DISPUTED:\n", " text: LOCALIZATION_DISPUTED\n", " description: A caution about conflicting or uncertain cellular localization\n", " evidence\n", " examples:\n", " - value: LOCALIZATION_DISPUTED\n", " description: Cellular localization remains to be finally defined. While\n", " most authors have deduced a localization at the basolateral side, other\n", " studies demonstrated an apical localization\n", " NAMING_CONFUSION:\n", " text: NAMING_CONFUSION\n", " description: A caution about potential confusion with similarly named proteins\n", " or historical naming issues\n", " examples:\n", " - value: NAMING_CONFUSION\n", " description: This protein should not be confused with the conventional myosin-1\n", " (MYH1);Was termed importin alpha-4\n", " GENE_COPY_NUMBER:\n", " text: GENE_COPY_NUMBER\n", " description: A caution about gene duplication or copy number that might affect\n", " interpretation\n", " examples:\n", " - value: GENE_COPY_NUMBER\n", " description: Maps to a duplicated region on chromosome 15; the gene is present\n", " in at least 3 almost identical copies\n", " EXPRESSION_MECHANISM_UNCLEAR:\n", " text: EXPRESSION_MECHANISM_UNCLEAR\n", " description: A caution about unclear or unusual mechanisms of gene expression\n", " or protein production\n", " examples:\n", " - value: EXPRESSION_MECHANISM_UNCLEAR\n", " description: This peptide has been shown to be biologically active but is\n", " the product of a mitochondrial gene. The mechanisms allowing the production\n", " and secretion of the peptide remain unclear\n", " SEQUENCE_FEATURE_MISSING:\n", " text: SEQUENCE_FEATURE_MISSING\n", " description: A caution about missing or unexpected sequence features that\n", " might affect function\n", " examples:\n", " - value: SEQUENCE_FEATURE_MISSING\n", " description: No predictable signal peptide\n", " SPECIES_DIFFERENCE:\n", " text: SPECIES_DIFFERENCE\n", " description: A caution about significant functional or property differences\n", " between orthologs\n", " examples:\n", " - value: SPECIES_DIFFERENCE\n", " description: Affinity and capacity of the transporter for endogenous substrates\n", " vary among orthologs. For endogenous compounds such as dopamine, histamine,\n", " serotonin and thiamine, mouse ortholog display higher affinity\n", " PUBLICATION_CONFLICT:\n", " text: PUBLICATION_CONFLICT\n", " description: A caution about conflicting published evidence or interpretation\n", " examples:\n", " - value: PUBLICATION_CONFLICT\n", " description: Although initially reported to transport carnitine across the\n", " hepatocyte membrane, another study was unable to verify this finding\n", " CLAIMS_RETRACTED:\n", " text: CLAIMS_RETRACTED\n", " description: A caution about function claims that were retracted or withdrawn\n", " examples:\n", " - value: CLAIMS_RETRACTED\n", " description: Has been reported to enhance netrin-induced phosphorylation\n", " of PAK1 and FYN... This article has been withdrawn by the authors\n", " PROTEIN_IDENTITY:\n", " text: PROTEIN_IDENTITY\n", " description: A caution about uncertainty in the identity or existence of distinct\n", " protein products\n", " examples:\n", " - value: PROTEIN_IDENTITY\n", " description: It is not known whether the so-called human ASE1 and human\n", " CAST proteins represent two sides of a single gene product\n", " FUNCTION_UNCERTAIN_INITIATION:\n", " text: FUNCTION_UNCERTAIN_INITIATION\n", " description: A caution about uncertainty in translation initiation site\n", " examples:\n", " - value: FUNCTION_UNCERTAIN_INITIATION\n", " description: It is uncertain whether Met-1 or Met-37 is the initiator\n", " PSEUDOGENE_STATUS:\n", " text: PSEUDOGENE_STATUS\n", " description: A caution about whether the gene encodes a protein\n", " examples:\n", " - value: PSEUDOGENE_STATUS\n", " description: Could be the product of a pseudogene\n", "classes:\n", " Entry:\n", " name: Entry\n", " attributes:\n", " id:\n", " name: id\n", " identifier: true\n", " category:\n", " name: category\n", " range: CommentCategory\n", " text:\n", " name: text\n", " description: The text of the comment\n", "source_file: ../../tests/input/uniprot/schema.yaml\n", "\n" ] } ], "source": [ "%%bash\n", "linkml-store -S ../../tests/input/uniprot/schema.yaml -d tmp/up.ddb schema" ] }, { "cell_type": "code", "execution_count": 52, "id": "079bba40-cb67-47b0-ba06-b2a0d43e6a86", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inserted 15 objects from ../../tests/input/uniprot/uniprot-caution-cv.csv into collection 'cv'.\n" ] } ], "source": [ "%%bash\n", "linkml-store -d tmp/up.ddb -c cv insert ../../tests/input/uniprot/uniprot-caution-cv.csv" ] }, { "cell_type": "code", "execution_count": 53, "id": "e87bfb30-9488-45d3-a133-6fac124a842e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TERM: FUNCTION_DISPUTED\n", "DEFINITION: A caution indicating that a previously reported function has been challenged\n", " or disproven in subsequent studies; may warrant GO NOT annotation\n", "EXAMPLES: Originally described for its in vitro hydrolytic activity towards dGMP,\n", " dAMP and dIMP. However, this was not confirmed in vivo\n", "---\n", "TERM: FUNCTION_PREDICTION_ONLY\n", "DEFINITION: A caution indicating function is based only on computational prediction\n", " or sequence similarity\n", "EXAMPLES: Predicted to be involved in X based on sequence similarity\n", "---\n", "TERM: FUNCTION_LACKS_EVIDENCE\n", "DEFINITION: A caution indicating insufficient experimental evidence to support predicted\n", " function\n", "EXAMPLES: In contrast to other Macro-domain containing proteins, lacks ADP-ribose\n", " glycohydrolase activity\n", "\n" ] } ], "source": [ "%%bash\n", "linkml-store -d tmp/up.ddb::cv query --limit 3 -O yaml" ] }, { "cell_type": "code", "execution_count": 54, "id": "45da9e5fd1353ccb", "metadata": { "ExecuteTime": { "end_time": "2025-02-04T14:34:37.309647Z", "start_time": "2024-08-21T22:53:31.595517Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TERM: FUNCTION_DISPUTED\n", "DEFINITION: A caution indicating that a previously reported function has been challenged\n", " or disproven in subsequent studies; may warrant GO NOT annotation\n", "EXAMPLES: Originally described for its in vitro hydrolytic activity towards dGMP,\n", " dAMP and dIMP. However, this was not confirmed in vivo\n", "---\n", "TERM: FUNCTION_PREDICTION_ONLY\n", "DEFINITION: A caution indicating function is based only on computational prediction\n", " or sequence similarity\n", "EXAMPLES: Predicted to be involved in X based on sequence similarity\n", "---\n", "TERM: FUNCTION_LACKS_EVIDENCE\n", "DEFINITION: A caution indicating insufficient experimental evidence to support predicted\n", " function\n", "EXAMPLES: In contrast to other Macro-domain containing proteins, lacks ADP-ribose\n", " glycohydrolase activity\n", "\n" ] } ], "source": [ "%%bash\n", "linkml-store -d tmp/up.ddb query --limit 3 -O yaml" ] }, { "cell_type": "code", "execution_count": 55, "id": "7e75f8d0-5fda-4846-a979-416ffc2c7ab7", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: phenopackets, base_dir: /Users/cjm\n", "2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: mgi, base_dir: /Users/cjm\n", "2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: nmdc, base_dir: /Users/cjm\n", "2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: amigo, base_dir: /Users/cjm\n", "2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: gocams, base_dir: /Users/cjm\n", "2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: cadsr, base_dir: /Users/cjm\n", "2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: mixs, base_dir: /Users/cjm\n", "2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: mondo, base_dir: /Users/cjm\n", "2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: hpoa, base_dir: /Users/cjm\n", "2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: hpoa_mongo, base_dir: /Users/cjm\n", "2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: hpoa_kg, base_dir: /Users/cjm\n", "2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: maxoa, base_dir: /Users/cjm\n", "2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: refmet, base_dir: /Users/cjm\n", "2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: neo4j, base_dir: /Users/cjm\n", "2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: gold, base_dir: /Users/cjm\n", "2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: nmdc_duckdb, base_dir: /Users/cjm\n", "2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Creating/attaching database: tmp/up.ddb\n", "2025-02-04 11:08:15,385 - linkml_store.api.client - INFO - Initializing databases\n", "2025-02-04 11:08:15,385 - linkml_store.api.client - INFO - Attaching tmp/up.ddb\n", "2025-02-04 11:08:15,388 - linkml_store.api.database - INFO - Setting schema view for duckdb:///tmp/up.ddb\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "name: uniprot-comments\n", "id: http://example.org/uniprot-comments\n", "imports:\n", "- linkml:types\n", "prefixes:\n", " linkml:\n", " prefix_prefix: linkml\n", " prefix_reference: https://w3id.org/linkml/\n", " up:\n", " prefix_prefix: up\n", " prefix_reference: http://example.org/tuniprot-comments\n", "default_prefix: up\n", "default_range: string\n", "enums:\n", " CommentCategory:\n", " name: CommentCategory\n", " permissible_values:\n", " FUNCTION_DISPUTED:\n", " text: FUNCTION_DISPUTED\n", " description: A caution indicating that a previously reported function has\n", " been challenged or disproven in subsequent studies; may warrant GO NOT annotation\n", " examples:\n", " - value: FUNCTION_DISPUTED\n", " description: Originally described for its in vitro hydrolytic activity towards\n", " dGMP, dAMP and dIMP. However, this was not confirmed in vivo\n", " FUNCTION_PREDICTION_ONLY:\n", " text: FUNCTION_PREDICTION_ONLY\n", " description: A caution indicating function is based only on computational\n", " prediction or sequence similarity\n", " examples:\n", " - value: FUNCTION_PREDICTION_ONLY\n", " description: Predicted to be involved in X based on sequence similarity\n", " FUNCTION_LACKS_EVIDENCE:\n", " text: FUNCTION_LACKS_EVIDENCE\n", " description: A caution indicating insufficient experimental evidence to support\n", " predicted function\n", " examples:\n", " - value: FUNCTION_LACKS_EVIDENCE\n", " description: In contrast to other Macro-domain containing proteins, lacks\n", " ADP-ribose glycohydrolase activity\n", " FUNCTION_DEBATED:\n", " text: FUNCTION_DEBATED\n", " description: A caution about ongoing scientific debate regarding function;\n", " differs from DISPUTED in lacking clear evidence against\n", " examples:\n", " - value: FUNCTION_DEBATED\n", " description: Was initially thought to act as a major regulator of cardiac\n", " hypertrophy... However, while PDE5A regulates nitric-oxide-generated cGMP,\n", " nitric oxide signaling is often depressed by heart disease, limiting its\n", " effect\n", " LOCALIZATION_DISPUTED:\n", " text: LOCALIZATION_DISPUTED\n", " description: A caution about conflicting or uncertain cellular localization\n", " evidence\n", " examples:\n", " - value: LOCALIZATION_DISPUTED\n", " description: Cellular localization remains to be finally defined. While\n", " most authors have deduced a localization at the basolateral side, other\n", " studies demonstrated an apical localization\n", " NAMING_CONFUSION:\n", " text: NAMING_CONFUSION\n", " description: A caution about potential confusion with similarly named proteins\n", " or historical naming issues\n", " examples:\n", " - value: NAMING_CONFUSION\n", " description: This protein should not be confused with the conventional myosin-1\n", " (MYH1);Was termed importin alpha-4\n", " GENE_COPY_NUMBER:\n", " text: GENE_COPY_NUMBER\n", " description: A caution about gene duplication or copy number that might affect\n", " interpretation\n", " examples:\n", " - value: GENE_COPY_NUMBER\n", " description: Maps to a duplicated region on chromosome 15; the gene is present\n", " in at least 3 almost identical copies\n", " EXPRESSION_MECHANISM_UNCLEAR:\n", " text: EXPRESSION_MECHANISM_UNCLEAR\n", " description: A caution about unclear or unusual mechanisms of gene expression\n", " or protein production\n", " examples:\n", " - value: EXPRESSION_MECHANISM_UNCLEAR\n", " description: This peptide has been shown to be biologically active but is\n", " the product of a mitochondrial gene. The mechanisms allowing the production\n", " and secretion of the peptide remain unclear\n", " SEQUENCE_FEATURE_MISSING:\n", " text: SEQUENCE_FEATURE_MISSING\n", " description: A caution about missing or unexpected sequence features that\n", " might affect function\n", " examples:\n", " - value: SEQUENCE_FEATURE_MISSING\n", " description: No predictable signal peptide\n", " SPECIES_DIFFERENCE:\n", " text: SPECIES_DIFFERENCE\n", " description: A caution about significant functional or property differences\n", " between orthologs\n", " examples:\n", " - value: SPECIES_DIFFERENCE\n", " description: Affinity and capacity of the transporter for endogenous substrates\n", " vary among orthologs. For endogenous compounds such as dopamine, histamine,\n", " serotonin and thiamine, mouse ortholog display higher affinity\n", " PUBLICATION_CONFLICT:\n", " text: PUBLICATION_CONFLICT\n", " description: A caution about conflicting published evidence or interpretation\n", " examples:\n", " - value: PUBLICATION_CONFLICT\n", " description: Although initially reported to transport carnitine across the\n", " hepatocyte membrane, another study was unable to verify this finding\n", " CLAIMS_RETRACTED:\n", " text: CLAIMS_RETRACTED\n", " description: A caution about function claims that were retracted or withdrawn\n", " examples:\n", " - value: CLAIMS_RETRACTED\n", " description: Has been reported to enhance netrin-induced phosphorylation\n", " of PAK1 and FYN... This article has been withdrawn by the authors\n", " PROTEIN_IDENTITY:\n", " text: PROTEIN_IDENTITY\n", " description: A caution about uncertainty in the identity or existence of distinct\n", " protein products\n", " examples:\n", " - value: PROTEIN_IDENTITY\n", " description: It is not known whether the so-called human ASE1 and human\n", " CAST proteins represent two sides of a single gene product\n", " FUNCTION_UNCERTAIN_INITIATION:\n", " text: FUNCTION_UNCERTAIN_INITIATION\n", " description: A caution about uncertainty in translation initiation site\n", " examples:\n", " - value: FUNCTION_UNCERTAIN_INITIATION\n", " description: It is uncertain whether Met-1 or Met-37 is the initiator\n", " PSEUDOGENE_STATUS:\n", " text: PSEUDOGENE_STATUS\n", " description: A caution about whether the gene encodes a protein\n", " examples:\n", " - value: PSEUDOGENE_STATUS\n", " description: Could be the product of a pseudogene\n", "classes:\n", " Entry:\n", " name: Entry\n", " attributes:\n", " id:\n", " name: id\n", " identifier: true\n", " category:\n", " name: category\n", " range: CommentCategory\n", " text:\n", " name: text\n", " description: The text of the comment\n", "source_file: ../../tests/input/uniprot/schema.yaml\n", "\n" ] } ], "source": [ "%%bash\n", "linkml-store -S ../../tests/input/uniprot/schema.yaml -d tmp/up.ddb schema" ] }, { "cell_type": "markdown", "id": "5723b14db6ae067f", "metadata": {}, "source": [ "## Inferring a specific field" ] }, { "cell_type": "code", "execution_count": 5, "id": "e3b5b54814c56690", "metadata": { "ExecuteTime": { "end_time": "2025-02-04T14:34:37.310628Z", "start_time": "2024-08-21T22:55:41.459988Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id: MOTSC_HUMAN\n", "category: EXPRESSION_MECHANISM_UNCLEAR\n", "text: This peptide has been shown to be biologically active but is the product of\n", " a mitochondrial gene. Usage of the mitochondrial genetic code yields tandem start\n", " and stop codons so translation must occur in the cytoplasm. The mechanisms allowing\n", " the production and secretion of the peptide remain unclear\n", "\n" ] } ], "source": [ "%%bash\n", "linkml-store -S ../../tests/input/uniprot/schema.yaml -d tmp/up.ddb -c Entry infer -t llm -T category --where \"id: MOTSC_HUMAN\"" ] }, { "cell_type": "markdown", "id": "c3826927-2022-451a-9cd9-ccf10e303bc3", "metadata": {}, "source": [ "## Inferring all rows\n", "\n", "Here we use a `--where` clause to query all rows in our collection and pass them through the inference engine" ] }, { "cell_type": "code", "execution_count": 63, "id": "970d3e14-b0d5-4468-90f3-0c347c9fe4e5", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "linkml-store -S ../../tests/input/uniprot/schema.yaml -d tmp/up.ddb -c Entry infer -t llm -T category --where \"{}\" -O csv -o tmp/up.predicted.csv" ] }, { "cell_type": "code", "execution_count": 6, "id": "684afb71-1737-4dbc-a7b2-d209fc56f1cb", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 7, "id": "1b82be5f-8c6b-4b51-b41c-5905c936a84d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idcategorytext
0MOTSC_HUMANEXPRESSION_MECHANISM_UNCLEARThis peptide has been shown to be biologically...
1POTB3_HUMANGENE_COPY_NUMBERMaps to a duplicated region on chromosome 15; ...
2MYO1C_HUMANNAMING_CONFUSIONRepresents an unconventional myosin. This prot...
3IMA4_HUMANNAMING_CONFUSIONWas termed importin alpha-4
4S22A1_HUMANLOCALIZATION_DISPUTEDCellular localization of OCT1 in the intestine...
............
95POK9_HUMANNaNTruncated; frameshift leads to premature stop ...
96UB2L3_HUMANPSEUDOGENE_STATUSPubMed:10760570 reported that UBE2L1, UBE2L2 a...
97CBX1_HUMANCLAIMS_RETRACTEDWas previously reported to interact with ASXL1...
98H33_HUMANCLAIMS_RETRACTEDThe original paper reporting lysine deaminatio...
99RELB_HUMANFUNCTION_DISPUTEDWas originally thought to inhibit the transcri...
\n", "

100 rows × 3 columns

\n", "
" ], "text/plain": [ " id category \\\n", "0 MOTSC_HUMAN EXPRESSION_MECHANISM_UNCLEAR \n", "1 POTB3_HUMAN GENE_COPY_NUMBER \n", "2 MYO1C_HUMAN NAMING_CONFUSION \n", "3 IMA4_HUMAN NAMING_CONFUSION \n", "4 S22A1_HUMAN LOCALIZATION_DISPUTED \n", ".. ... ... \n", "95 POK9_HUMAN NaN \n", "96 UB2L3_HUMAN PSEUDOGENE_STATUS \n", "97 CBX1_HUMAN CLAIMS_RETRACTED \n", "98 H33_HUMAN CLAIMS_RETRACTED \n", "99 RELB_HUMAN FUNCTION_DISPUTED \n", "\n", " text \n", "0 This peptide has been shown to be biologically... \n", "1 Maps to a duplicated region on chromosome 15; ... \n", "2 Represents an unconventional myosin. This prot... \n", "3 Was termed importin alpha-4 \n", "4 Cellular localization of OCT1 in the intestine... \n", ".. ... \n", "95 Truncated; frameshift leads to premature stop ... \n", "96 PubMed:10760570 reported that UBE2L1, UBE2L2 a... \n", "97 Was previously reported to interact with ASXL1... \n", "98 The original paper reporting lysine deaminatio... \n", "99 Was originally thought to inhibit the transcri... \n", "\n", "[100 rows x 3 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(\"tmp/up.predicted.csv\")\n", "df" ] }, { "cell_type": "markdown", "id": "1a3c35ac1b902a86", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 5 }