{
"cells": [
{
"cell_type": "markdown",
"source": [
"# How to index Phenopackets with LinkML-Store\n",
"\n",
"\n",
"\n"
],
"metadata": {
"collapsed": false
},
"id": "fc4794dd116ed21"
},
{
"cell_type": "markdown",
"source": [
"## Use pystow to download phenopackets\n",
"\n",
"We will download from the Monarch Initiative [phenopacket-store](https://github.com/monarch-initiative/phenopacket-store)"
],
"metadata": {
"collapsed": false
},
"id": "e19f50e1b2fc5d89"
},
{
"cell_type": "code",
"execution_count": 1,
"outputs": [],
"source": [
"import pandas as pd\n",
"import pystow\n",
"import yaml\n",
"\n",
"path = pystow.ensure_untar(\"tmp\", \"phenopackets\", url=\" https://github.com/monarch-initiative/phenopacket-store/releases/latest/download/all_phenopackets.tgz\")"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:19:53.029659Z",
"start_time": "2024-08-08T04:19:52.644983Z"
}
},
"id": "158d589d95a155e5"
},
{
"cell_type": "code",
"execution_count": 2,
"outputs": [
{
"data": {
"text/plain": "4876"
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# iterate over all *.json files in the phenopackets directory and parse to an object\n",
"# we will recursively walk the path using os.walk ( we don't worry about loading yet)\n",
"import os\n",
"import json\n",
"objs = []\n",
"for root, dirs, files in os.walk(path):\n",
" for file in files:\n",
" if file.endswith(\".json\"):\n",
" with open(os.path.join(root, file)) as stream:\n",
" obj = json.load(stream)\n",
" objs.append(obj)\n",
"len(objs)"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:19:53.857295Z",
"start_time": "2024-08-08T04:19:53.031182Z"
}
},
"id": "142993c7e60551d1"
},
{
"cell_type": "markdown",
"source": [
"## Creating a client and attaching to a database\n",
"\n",
"First we will create a client as normal:"
],
"metadata": {
"collapsed": false
},
"id": "493c7599d2f40c27"
},
{
"cell_type": "code",
"execution_count": 3,
"outputs": [],
"source": [
"from linkml_store import Client\n",
"\n",
"client = Client()"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:19:54.759379Z",
"start_time": "2024-08-08T04:19:53.857607Z"
}
},
"id": "initial_id"
},
{
"cell_type": "markdown",
"source": [
"Next we'll attach to a MongoDB instance. this assumes you have one running already.\n",
"\n",
"We will make a database called \"phenopackets\" and recreate it if it already exists\n",
"\n",
"(note for people running this notebook locally - if you happen to have a database with this name in your current mongo instance it will be deleted!)"
],
"metadata": {
"collapsed": false
},
"id": "470f1cb70bf3641b"
},
{
"cell_type": "code",
"execution_count": 4,
"outputs": [],
"source": [
"db = client.attach_database(\"mongodb://localhost:27017\", \"phenopackets\", recreate_if_exists=True)"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:19:54.766122Z",
"start_time": "2024-08-08T04:19:54.759646Z"
}
},
"id": "cc164c0acbe4c39d"
},
{
"cell_type": "markdown",
"source": [
"## Creating a collection\n",
"\n",
"We'll create a simple test collection. The concept of collection in linkml-store maps directly to mongodb collections"
],
"metadata": {
"collapsed": false
},
"id": "334ea2ced79828f7"
},
{
"cell_type": "markdown",
"source": [],
"metadata": {
"collapsed": false
},
"id": "a0a98c5a5c9f0072"
},
{
"cell_type": "code",
"execution_count": 5,
"outputs": [],
"source": [
"collection = db.create_collection(\"main\", recreate_if_exists=True)"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:19:54.900921Z",
"start_time": "2024-08-08T04:19:54.763519Z"
}
},
"id": "c3a79013f9359a9"
},
{
"cell_type": "markdown",
"source": [
"## Inserting objects into the store\n",
"\n",
"We'll use the standard `insert` method to insert the phenopackets into the collection. At this stage there is no explicit schema."
],
"metadata": {
"collapsed": false
},
"id": "207f35ee61edc14d"
},
{
"cell_type": "code",
"execution_count": 6,
"outputs": [],
"source": [
"collection.insert(objs)"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:19:55.204911Z",
"start_time": "2024-08-08T04:19:54.950259Z"
}
},
"id": "4a09a78fe3c8dc33"
},
{
"cell_type": "markdown",
"source": [
"## Check contents\n",
"\n",
"We can check the number of rows in the collection, to ensure everything was inserted correctly:"
],
"metadata": {
"collapsed": false
},
"id": "47f933e901372da8"
},
{
"cell_type": "code",
"execution_count": 7,
"outputs": [
{
"data": {
"text/plain": "4876"
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"collection.find({}, limit=1).num_rows"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:19:55.219375Z",
"start_time": "2024-08-08T04:19:55.206313Z"
}
},
"id": "f505fdc8cc20196e"
},
{
"cell_type": "code",
"execution_count": 8,
"outputs": [],
"source": [
"assert collection.find({}, limit=1).num_rows == len(objs)"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:19:55.239170Z",
"start_time": "2024-08-08T04:19:55.220076Z"
}
},
"id": "e6ae22c986b9ba5b"
},
{
"cell_type": "markdown",
"source": [],
"metadata": {
"collapsed": false
},
"id": "adc134486070cf0d"
},
{
"cell_type": "markdown",
"source": [
"Let's check with pandas just to make sure it looks as expected; we'll query for a specific OMIM disease:"
],
"metadata": {
"collapsed": false
},
"id": "90e2e9793375431f"
},
{
"cell_type": "code",
"execution_count": 9,
"outputs": [
{
"data": {
"text/plain": " id \\\n0 PMID_28289718_Higgins-Patient-1 \n1 PMID_31173466_Suzuki-Patient-1 \n2 PMID_28289718_Higgins-Patient-2 \n\n subject \\\n0 {'id': 'Higgins-Patient-1', 'timeAtLastEncount... \n1 {'id': 'Suzuki-Patient-1', 'timeAtLastEncounte... \n2 {'id': 'Higgins-Patient-2', 'timeAtLastEncount... \n\n phenotypicFeatures \\\n0 [{'type': {'id': 'HP:0001714', 'label': 'Ventr... \n1 [{'type': {'id': 'HP:0001714', 'label': 'Ventr... \n2 [{'type': {'id': 'HP:0001714', 'label': 'Ventr... \n\n interpretations \\\n0 [{'id': 'Higgins-Patient-1', 'progressStatus':... \n1 [{'id': 'Suzuki-Patient-1', 'progressStatus': ... \n2 [{'id': 'Higgins-Patient-2', 'progressStatus':... \n\n diseases \\\n0 [{'term': {'id': 'OMIM:618499', 'label': 'Noon... \n1 [{'term': {'id': 'OMIM:618499', 'label': 'Noon... \n2 [{'term': {'id': 'OMIM:618499', 'label': 'Noon... \n\n metaData \n0 {'created': '2024-03-28T11:11:48.590163946Z', ... \n1 {'created': '2024-03-28T11:11:48.594725131Z', ... \n2 {'created': '2024-03-28T11:11:48.592718124Z', ... ",
"text/html": "
\n\n
\n \n \n | \n id | \n subject | \n phenotypicFeatures | \n interpretations | \n diseases | \n metaData | \n
\n \n \n \n 0 | \n PMID_28289718_Higgins-Patient-1 | \n {'id': 'Higgins-Patient-1', 'timeAtLastEncount... | \n [{'type': {'id': 'HP:0001714', 'label': 'Ventr... | \n [{'id': 'Higgins-Patient-1', 'progressStatus':... | \n [{'term': {'id': 'OMIM:618499', 'label': 'Noon... | \n {'created': '2024-03-28T11:11:48.590163946Z', ... | \n
\n \n 1 | \n PMID_31173466_Suzuki-Patient-1 | \n {'id': 'Suzuki-Patient-1', 'timeAtLastEncounte... | \n [{'type': {'id': 'HP:0001714', 'label': 'Ventr... | \n [{'id': 'Suzuki-Patient-1', 'progressStatus': ... | \n [{'term': {'id': 'OMIM:618499', 'label': 'Noon... | \n {'created': '2024-03-28T11:11:48.594725131Z', ... | \n
\n \n 2 | \n PMID_28289718_Higgins-Patient-2 | \n {'id': 'Higgins-Patient-2', 'timeAtLastEncount... | \n [{'type': {'id': 'HP:0001714', 'label': 'Ventr... | \n [{'id': 'Higgins-Patient-2', 'progressStatus':... | \n [{'term': {'id': 'OMIM:618499', 'label': 'Noon... | \n {'created': '2024-03-28T11:11:48.592718124Z', ... | \n
\n \n
\n
"
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"qr = collection.find({\"diseases.term.id\": \"OMIM:618499\"}, limit=3)\n",
"qr.rows_dataframe"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:19:55.262084Z",
"start_time": "2024-08-08T04:19:55.226047Z"
}
},
"id": "e763fe6cd50022e2"
},
{
"cell_type": "markdown",
"source": [
"As expected, there are three rows with the OMIM disease 618499."
],
"metadata": {
"collapsed": false
},
"id": "4a266efbcb405673"
},
{
"cell_type": "markdown",
"source": [
"## Query faceting\n",
"\n",
"We will now demonstrate faceted queries, allowing us to count the number of instances of different categorical values or categorical value combinations.\n",
"\n",
"First we'll facet on the subject sex. We can use path notation, e.g. `subject.sex` here:"
],
"metadata": {
"collapsed": false
},
"id": "d4749758585df35c"
},
{
"cell_type": "code",
"execution_count": 10,
"outputs": [
{
"data": {
"text/plain": "{'subject.sex': [('MALE', 1807), ('FEMALE', 1564)]}"
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"collection.query_facets({}, facet_columns=[\"subject.sex\"])"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:19:55.291740Z",
"start_time": "2024-08-08T04:19:55.261545Z"
}
},
"id": "9b7f01f14d36958b"
},
{
"cell_type": "markdown",
"source": [
"We can also facet by the disease name/label. We'll restrict this to the top 20"
],
"metadata": {
"collapsed": false
},
"id": "ea6e13f82ec50e62"
},
{
"cell_type": "code",
"execution_count": 11,
"outputs": [
{
"data": {
"text/plain": "{'diseases.term.label': [('Developmental and epileptic encephalopathy 4', 463),\n ('Developmental and epileptic encephalopathy 11', 342),\n ('KBG syndrome', 337),\n ('Leber congenital amaurosis 6', 191),\n ('Glass syndrome', 158),\n ('Holt-Oram syndrome', 103),\n ('Mitochondrial DNA depletion syndrome 13 (encephalomyopathic type)', 95),\n ('Neurodevelopmental disorder with coarse facies and mild distal skeletal abnormalities',\n 73),\n ('Jacobsen syndrome', 69),\n ('Coffin-Siris syndrome 8', 65),\n ('Kabuki Syndrome 1', 65),\n ('Houge-Janssen syndrome 2', 60),\n ('ZTTK SYNDROME', 52),\n ('Greig cephalopolysyndactyly syndrome', 51),\n ('Seizures, benign familial infantile, 3', 51),\n ('Mitochondrial DNA depletion syndrome 6 (hepatocerebral type)', 50),\n ('Marfan syndrome', 50),\n ('Developmental delay, dysmorphic facies, and brain anomalies', 49),\n ('Loeys-Dietz syndrome 3', 49),\n ('Hypomagnesemia 3, renal', 46)]}"
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"collection.query_facets({}, facet_columns=[\"diseases.term.label\"], facet_limit=20)\n"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:19:55.352444Z",
"start_time": "2024-08-08T04:19:55.282697Z"
}
},
"id": "27857349279abc41"
},
{
"cell_type": "code",
"execution_count": 12,
"outputs": [
{
"data": {
"text/plain": "{'subject.timeAtLastEncounter.age.iso8601duration': [('P4Y', 131),\n ('P3Y', 114),\n ('P6Y', 100),\n ('P5Y', 97),\n ('P2Y', 95),\n ('P7Y', 85),\n ('P10Y', 82),\n ('P9Y', 77),\n ('P8Y', 71)]}"
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"collection.query_facets({}, facet_columns=[\"subject.timeAtLastEncounter.age.iso8601duration\"], facet_limit=10)\n"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:19:55.369632Z",
"start_time": "2024-08-08T04:19:55.313628Z"
}
},
"id": "86eea02b6c25c2cd"
},
{
"cell_type": "code",
"execution_count": 13,
"outputs": [
{
"data": {
"text/plain": "{'interpretations.diagnosis.genomicInterpretations.variantInterpretation.variationDescriptor.geneContext.symbol': [('STXBP1',\n 463),\n ('SCN2A', 393),\n ('ANKRD11', 337),\n ('RPGRIP1', 273),\n ('SATB2', 158),\n ('FBN1', 151),\n ('LMNA', 127),\n ('FBXL4', 117),\n ('TBX5', 103),\n ('SPTAN1', 85)]}"
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"collection.query_facets({}, facet_columns=[\"interpretations.diagnosis.genomicInterpretations.variantInterpretation.variationDescriptor.geneContext.symbol\"], facet_limit=10)\n"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:19:55.372125Z",
"start_time": "2024-08-08T04:19:55.319912Z"
}
},
"id": "10f2c971ed09c386"
},
{
"cell_type": "markdown",
"source": [
"We can also facet on combinations:"
],
"metadata": {
"collapsed": false
},
"id": "ee540382322111a9"
},
{
"cell_type": "code",
"execution_count": 14,
"outputs": [
{
"data": {
"text/plain": "{('subject.sex', 'diseases.term.label'): [(('MALE', 'KBG syndrome'), 175),\n (('FEMALE', 'KBG syndrome'), 143),\n (('MALE', 'Glass syndrome'), 90),\n (('FEMALE', 'Glass syndrome'), 62),\n (('MALE',\n 'Mitochondrial DNA depletion syndrome 13 (encephalomyopathic type)'),\n 58),\n (('MALE',\n 'Neurodevelopmental disorder with coarse facies and mild distal skeletal abnormalities'),\n 54),\n (('FEMALE', 'Jacobsen syndrome'), 49),\n (('MALE', 'Coffin-Siris syndrome 8'), 37),\n (('FEMALE',\n 'Mitochondrial DNA depletion syndrome 13 (encephalomyopathic type)'),\n 37),\n (('FEMALE', 'Kabuki Syndrome 1'), 35),\n (('MALE', 'Houge-Janssen syndrome 2'), 32),\n (('MALE', 'Kabuki Syndrome 1'), 30),\n (('FEMALE', 'Developmental delay, dysmorphic facies, and brain anomalies'),\n 29),\n (('FEMALE', 'Holt-Oram syndrome'), 28),\n (('MALE', 'Intellectual developmental disorder, autosomal dominant 21'), 28),\n (('MALE', 'Cardiac, facial, and digital anomalies with developmental delay'),\n 28),\n (('FEMALE', 'Developmental and epileptic encephalopathy 28'), 27),\n (('MALE', 'Loeys-Dietz syndrome 3'), 27),\n (('MALE', 'ZTTK SYNDROME'), 26),\n (('FEMALE', 'ZTTK SYNDROME'), 26)]}"
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fqr = collection.query_facets({}, facet_columns=[(\"subject.sex\", \"diseases.term.label\")], facet_limit=20)\n",
"fqr\n"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:19:55.422853Z",
"start_time": "2024-08-08T04:19:55.364925Z"
}
},
"id": "5eca26a67254d3d2"
},
{
"cell_type": "code",
"execution_count": 17,
"outputs": [
{
"data": {
"text/plain": " subject.sex diseases.term.label Value\n0 MALE KBG syndrome 175\n1 FEMALE KBG syndrome 143\n2 MALE Glass syndrome 90\n3 FEMALE Glass syndrome 62\n4 MALE Mitochondrial DNA depletion syndrome 13 (encep... 58\n5 MALE Neurodevelopmental disorder with coarse facies... 54\n6 FEMALE Jacobsen syndrome 49\n7 MALE Coffin-Siris syndrome 8 37\n8 FEMALE Mitochondrial DNA depletion syndrome 13 (encep... 37\n9 FEMALE Kabuki Syndrome 1 35\n10 MALE Houge-Janssen syndrome 2 32\n11 MALE Kabuki Syndrome 1 30\n12 FEMALE Developmental delay, dysmorphic facies, and br... 29\n13 FEMALE Holt-Oram syndrome 28\n14 MALE Intellectual developmental disorder, autosomal... 28\n15 MALE Cardiac, facial, and digital anomalies with de... 28\n16 FEMALE Developmental and epileptic encephalopathy 28 27\n17 MALE Loeys-Dietz syndrome 3 27\n18 MALE ZTTK SYNDROME 26\n19 FEMALE ZTTK SYNDROME 26",
"text/html": "\n\n
\n \n \n | \n subject.sex | \n diseases.term.label | \n Value | \n
\n \n \n \n 0 | \n MALE | \n KBG syndrome | \n 175 | \n
\n \n 1 | \n FEMALE | \n KBG syndrome | \n 143 | \n
\n \n 2 | \n MALE | \n Glass syndrome | \n 90 | \n
\n \n 3 | \n FEMALE | \n Glass syndrome | \n 62 | \n
\n \n 4 | \n MALE | \n Mitochondrial DNA depletion syndrome 13 (encep... | \n 58 | \n
\n \n 5 | \n MALE | \n Neurodevelopmental disorder with coarse facies... | \n 54 | \n
\n \n 6 | \n FEMALE | \n Jacobsen syndrome | \n 49 | \n
\n \n 7 | \n MALE | \n Coffin-Siris syndrome 8 | \n 37 | \n
\n \n 8 | \n FEMALE | \n Mitochondrial DNA depletion syndrome 13 (encep... | \n 37 | \n
\n \n 9 | \n FEMALE | \n Kabuki Syndrome 1 | \n 35 | \n
\n \n 10 | \n MALE | \n Houge-Janssen syndrome 2 | \n 32 | \n
\n \n 11 | \n MALE | \n Kabuki Syndrome 1 | \n 30 | \n
\n \n 12 | \n FEMALE | \n Developmental delay, dysmorphic facies, and br... | \n 29 | \n
\n \n 13 | \n FEMALE | \n Holt-Oram syndrome | \n 28 | \n
\n \n 14 | \n MALE | \n Intellectual developmental disorder, autosomal... | \n 28 | \n
\n \n 15 | \n MALE | \n Cardiac, facial, and digital anomalies with de... | \n 28 | \n
\n \n 16 | \n FEMALE | \n Developmental and epileptic encephalopathy 28 | \n 27 | \n
\n \n 17 | \n MALE | \n Loeys-Dietz syndrome 3 | \n 27 | \n
\n \n 18 | \n MALE | \n ZTTK SYNDROME | \n 26 | \n
\n \n 19 | \n FEMALE | \n ZTTK SYNDROME | \n 26 | \n
\n \n
\n
"
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from linkml_store.utils.pandas_utils import facet_summary_to_dataframe_unmelted\n",
"\n",
"facet_summary_to_dataframe_unmelted(fqr)"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:21:54.268683Z",
"start_time": "2024-08-08T04:21:54.265246Z"
}
},
"id": "854f55b91f350de2"
},
{
"cell_type": "markdown",
"source": [
"## Semantic Search\n",
"\n",
"We will index phenopackets using a template that extracts the subject, phenotypic features and diseases.\n",
"\n",
"First we will create a textualization template for a phenopacket. We will keep it minimal for simplicity - this doesn't include treatments, families, etc."
],
"metadata": {
"collapsed": false
},
"id": "648f05e75f250221"
},
{
"cell_type": "code",
"execution_count": 18,
"outputs": [],
"source": [
"template = \"\"\"\n",
"subject: {{subject}}\n",
"phenotypes: {% for p in phenotypicFeatures %}{{p.type.label}}{% endfor %}\n",
"diseases: {% for d in diseases %}{{d.term.label}}{% endfor %}\n",
"\"\"\""
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:22:10.270395Z",
"start_time": "2024-08-08T04:22:10.265368Z"
}
},
"id": "976095541027ce9e"
},
{
"cell_type": "markdown",
"source": [
"Next we will create an indexer using the template. This will use the Jinja2 syntax for templating.\n",
"We will also cache LLM embedding queries, so if we want to incrementally add new phenopackets we can avoid re-running the LLM embeddings calls."
],
"metadata": {
"collapsed": false
},
"id": "76a71f8590bd5602"
},
{
"cell_type": "code",
"execution_count": 19,
"outputs": [],
"source": [
"from linkml_store.index.implementations.llm_indexer import LLMIndexer\n",
"\n",
"index = LLMIndexer(\n",
" name=\"ppkt\", \n",
" cached_embeddings_database=\"tmp/llm_pheno_cache.db\",\n",
" text_template=template,\n",
" text_template_syntax=\"jinja2\",\n",
")"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:22:12.817263Z",
"start_time": "2024-08-08T04:22:12.804284Z"
}
},
"id": "e98f9d6eb4a5e385"
},
{
"cell_type": "markdown",
"source": [
"We can test the template on the first row of the collection:"
],
"metadata": {
"collapsed": false
},
"id": "e6c28d4d95b920ba"
},
{
"cell_type": "code",
"execution_count": 20,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"subject: {'id': 'Higgins-Patient-1', 'timeAtLastEncounter': {'age': {'iso8601duration': 'P17Y'}}, 'sex': 'FEMALE'}\n",
"phenotypes: Ventricular hypertrophyHeart murmurHypertrophic cardiomyopathyShort statureHypertelorismLow-set earsPosteriorly rotated earsGlobal developmental delayCognitive impairmentCardiac arrest\n",
"diseases: Noonan syndrome-11\n"
]
}
],
"source": [
"print(index.object_to_text(qr.rows[0]))"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:22:15.579758Z",
"start_time": "2024-08-08T04:22:15.565798Z"
}
},
"id": "16dce837e31c88f6"
},
{
"cell_type": "markdown",
"source": [
"That looks as expected. We can now attach the indexer to the collection and index the collection:"
],
"metadata": {
"collapsed": false
},
"id": "4fbd1fc091c4c7b"
},
{
"cell_type": "code",
"execution_count": 21,
"outputs": [],
"source": [
"collection.attach_indexer(index, auto_index=True)"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:25:25.307954Z",
"start_time": "2024-08-08T04:22:19.475790Z"
}
},
"id": "18a0bd86de7f1d81"
},
{
"cell_type": "markdown",
"source": [
"## Semantic Search\n",
"\n",
"Let's query based on text criteria:"
],
"metadata": {
"collapsed": false
},
"id": "f49056b209918a9"
},
{
"cell_type": "code",
"execution_count": 22,
"outputs": [
{
"data": {
"text/plain": " score id \\\n0 0.824639 PMID_30658709_patient \n1 0.824639 PMID_30658709_patient \n2 0.813770 PMID_36932076_Patient_1 \n3 0.813770 PMID_36932076_Patient_1 \n4 0.804126 PMID_37303127_6 \n\n subject \\\n0 {'id': 'patient', 'timeAtLastEncounter': {'age... \n1 {'id': 'patient', 'timeAtLastEncounter': {'age... \n2 {'id': 'Patient 1', 'timeAtLastEncounter': {'a... \n3 {'id': 'Patient 1', 'timeAtLastEncounter': {'a... \n4 {'id': '6', 'timeAtLastEncounter': {'age': {'i... \n\n phenotypicFeatures \\\n0 [{'type': {'id': 'HP:0031956', 'label': 'Eleva... \n1 [{'type': {'id': 'HP:0031956', 'label': 'Eleva... \n2 [{'type': {'id': 'HP:0000979', 'label': 'Purpu... \n3 [{'type': {'id': 'HP:0000979', 'label': 'Purpu... \n4 [{'type': {'id': 'HP:0001397', 'label': 'Hepat... \n\n interpretations \\\n0 [{'id': 'patient', 'progressStatus': 'SOLVED',... \n1 [{'id': 'patient', 'progressStatus': 'SOLVED',... \n2 [{'id': 'Patient 1', 'progressStatus': 'SOLVED... \n3 [{'id': 'Patient 1', 'progressStatus': 'SOLVED... \n4 [{'id': '6', 'progressStatus': 'SOLVED', 'diag... \n\n diseases \\\n0 [{'term': {'id': 'OMIM:615878', 'label': 'Chol... \n1 [{'term': {'id': 'OMIM:615878', 'label': 'Chol... \n2 [{'term': {'id': 'OMIM:620376', 'label': 'Auto... \n3 [{'term': {'id': 'OMIM:620376', 'label': 'Auto... \n4 [{'term': {'id': 'OMIM:151660', 'label': 'Lipo... \n\n metaData \n0 {'created': '2024-05-05T09:03:25.388371944Z', ... \n1 {'created': '2024-05-05T09:03:25.388371944Z', ... \n2 {'created': '2024-04-19T06:07:57.188061952Z', ... \n3 {'created': '2024-04-19T06:07:57.188061952Z', ... \n4 {'created': '2024-03-23T17:41:42.999521017Z', ... ",
"text/html": "\n\n
\n \n \n | \n score | \n id | \n subject | \n phenotypicFeatures | \n interpretations | \n diseases | \n metaData | \n
\n \n \n \n 0 | \n 0.824639 | \n PMID_30658709_patient | \n {'id': 'patient', 'timeAtLastEncounter': {'age... | \n [{'type': {'id': 'HP:0031956', 'label': 'Eleva... | \n [{'id': 'patient', 'progressStatus': 'SOLVED',... | \n [{'term': {'id': 'OMIM:615878', 'label': 'Chol... | \n {'created': '2024-05-05T09:03:25.388371944Z', ... | \n
\n \n 1 | \n 0.824639 | \n PMID_30658709_patient | \n {'id': 'patient', 'timeAtLastEncounter': {'age... | \n [{'type': {'id': 'HP:0031956', 'label': 'Eleva... | \n [{'id': 'patient', 'progressStatus': 'SOLVED',... | \n [{'term': {'id': 'OMIM:615878', 'label': 'Chol... | \n {'created': '2024-05-05T09:03:25.388371944Z', ... | \n
\n \n 2 | \n 0.813770 | \n PMID_36932076_Patient_1 | \n {'id': 'Patient 1', 'timeAtLastEncounter': {'a... | \n [{'type': {'id': 'HP:0000979', 'label': 'Purpu... | \n [{'id': 'Patient 1', 'progressStatus': 'SOLVED... | \n [{'term': {'id': 'OMIM:620376', 'label': 'Auto... | \n {'created': '2024-04-19T06:07:57.188061952Z', ... | \n
\n \n 3 | \n 0.813770 | \n PMID_36932076_Patient_1 | \n {'id': 'Patient 1', 'timeAtLastEncounter': {'a... | \n [{'type': {'id': 'HP:0000979', 'label': 'Purpu... | \n [{'id': 'Patient 1', 'progressStatus': 'SOLVED... | \n [{'term': {'id': 'OMIM:620376', 'label': 'Auto... | \n {'created': '2024-04-19T06:07:57.188061952Z', ... | \n
\n \n 4 | \n 0.804126 | \n PMID_37303127_6 | \n {'id': '6', 'timeAtLastEncounter': {'age': {'i... | \n [{'type': {'id': 'HP:0001397', 'label': 'Hepat... | \n [{'id': '6', 'progressStatus': 'SOLVED', 'diag... | \n [{'term': {'id': 'OMIM:151660', 'label': 'Lipo... | \n {'created': '2024-03-23T17:41:42.999521017Z', ... | \n
\n \n
\n
"
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"qr = collection.search(\"patients with liver diseases\")\n",
"qr.rows_dataframe[0:5]"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:25:28.272146Z",
"start_time": "2024-08-08T04:25:25.308470Z"
}
},
"id": "1ddd4ac75719342d"
},
{
"cell_type": "markdown",
"source": [
"Let's check the first one"
],
"metadata": {
"collapsed": false
},
"id": "b54c088d3d69f8a3"
},
{
"cell_type": "code",
"execution_count": 23,
"outputs": [
{
"data": {
"text/plain": "(0.824638728366563,\n {'id': 'PMID_30658709_patient',\n 'subject': {'id': 'patient',\n 'timeAtLastEncounter': {'age': {'iso8601duration': 'P1Y11M'}},\n 'sex': 'FEMALE'},\n 'phenotypicFeatures': [{'type': {'id': 'HP:0031956',\n 'label': 'Elevated circulating aspartate aminotransferase concentration'},\n 'onset': {'age': {'iso8601duration': 'P1Y11M'}}},\n {'type': {'id': 'HP:0031964',\n 'label': 'Elevated circulating alanine aminotransferase concentration'},\n 'onset': {'age': {'iso8601duration': 'P1Y11M'}}},\n {'type': {'id': 'HP:0003573', 'label': 'Increased total bilirubin'},\n 'onset': {'age': {'iso8601duration': 'P6M'}}},\n {'type': {'id': 'HP:0012202',\n 'label': 'Increased serum bile acid concentration'},\n 'onset': {'age': {'iso8601duration': 'P6M'}}},\n {'type': {'id': 'HP:0002908', 'label': 'Conjugated hyperbilirubinemia'},\n 'onset': {'age': {'iso8601duration': 'P6M'}}},\n {'type': {'id': 'HP:0001433', 'label': 'Hepatosplenomegaly'},\n 'onset': {'age': {'iso8601duration': 'P6M'}}},\n {'type': {'id': 'HP:0001510', 'label': 'Growth delay'},\n 'onset': {'age': {'iso8601duration': 'P6M'}}},\n {'type': {'id': 'HP:0000989', 'label': 'Pruritus'},\n 'onset': {'age': {'iso8601duration': 'P6M'}}},\n {'type': {'id': 'HP:0000952', 'label': 'Jaundice'},\n 'onset': {'age': {'iso8601duration': 'P6M'}}},\n {'type': {'id': 'HP:0100810', 'label': 'Pointed helix'},\n 'onset': {'age': {'iso8601duration': 'P6M'}}},\n {'type': {'id': 'HP:0002650', 'label': 'Scoliosis'}},\n {'type': {'id': 'HP:0003112',\n 'label': 'Abnormal circulating amino acid concentration'},\n 'excluded': True},\n {'type': {'id': 'HP:0001928', 'label': 'Abnormality of coagulation'},\n 'excluded': True},\n {'type': {'id': 'HP:0010701', 'label': 'Abnormal immunoglobulin level'},\n 'excluded': True},\n {'type': {'id': 'HP:0001627', 'label': 'Abnormal heart morphology'},\n 'excluded': True}],\n 'interpretations': [{'id': 'patient',\n 'progressStatus': 'SOLVED',\n 'diagnosis': {'disease': {'id': 'OMIM:615878',\n 'label': 'Cholestasis, progressive familial intrahepatic 4'},\n 'genomicInterpretations': [{'subjectOrBiosampleId': 'patient',\n 'interpretationStatus': 'CAUSATIVE',\n 'variantInterpretation': {'variationDescriptor': {'id': 'var_kKNGnjOxGXMbcoWzDGEJKVPIB',\n 'geneContext': {'valueId': 'HGNC:11828', 'symbol': 'TJP2'},\n 'expressions': [{'syntax': 'hgvs.c',\n 'value': 'NM_004817.4:c.2355+1G>C'},\n {'syntax': 'hgvs.g', 'value': 'NC_000009.12:g.69238790G>C'}],\n 'vcfRecord': {'genomeAssembly': 'hg38',\n 'chrom': 'chr9',\n 'pos': '69238790',\n 'ref': 'G',\n 'alt': 'C'},\n 'moleculeContext': 'genomic',\n 'allelicState': {'id': 'GENO:0000136', 'label': 'homozygous'}}}}]}}],\n 'diseases': [{'term': {'id': 'OMIM:615878',\n 'label': 'Cholestasis, progressive familial intrahepatic 4'},\n 'onset': {'ontologyClass': {'id': 'HP:0003593',\n 'label': 'Infantile onset'}}}],\n 'metaData': {'created': '2024-05-05T09:03:25.388371944Z',\n 'createdBy': 'ORCID:0000-0002-0736-9199',\n 'resources': [{'id': 'geno',\n 'name': 'Genotype Ontology',\n 'url': 'http://purl.obolibrary.org/obo/geno.owl',\n 'version': '2022-03-05',\n 'namespacePrefix': 'GENO',\n 'iriPrefix': 'http://purl.obolibrary.org/obo/GENO_'},\n {'id': 'hgnc',\n 'name': 'HUGO Gene Nomenclature Committee',\n 'url': 'https://www.genenames.org',\n 'version': '06/01/23',\n 'namespacePrefix': 'HGNC',\n 'iriPrefix': 'https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/'},\n {'id': 'omim',\n 'name': 'An Online Catalog of Human Genes and Genetic Disorders',\n 'url': 'https://www.omim.org',\n 'version': 'January 4, 2023',\n 'namespacePrefix': 'OMIM',\n 'iriPrefix': 'https://www.omim.org/entry/'},\n {'id': 'so',\n 'name': 'Sequence types and features ontology',\n 'url': 'http://purl.obolibrary.org/obo/so.obo',\n 'version': '2021-11-22',\n 'namespacePrefix': 'SO',\n 'iriPrefix': 'http://purl.obolibrary.org/obo/SO_'},\n {'id': 'hp',\n 'name': 'human phenotype ontology',\n 'url': 'http://purl.obolibrary.org/obo/hp.owl',\n 'version': '2024-04-26',\n 'namespacePrefix': 'HP',\n 'iriPrefix': 'http://purl.obolibrary.org/obo/HP_'}],\n 'phenopacketSchemaVersion': '2.0',\n 'externalReferences': [{'id': 'PMID:30658709',\n 'reference': 'https://pubmed.ncbi.nlm.nih.gov/30658709',\n 'description': 'Novel compound heterozygote mutations of TJP2 in a Chinese child with progressive cholestatic liver disease'}]}})"
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"qr.ranked_rows[0]"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:25:28.279596Z",
"start_time": "2024-08-08T04:25:28.275434Z"
}
},
"id": "5a4fd8fe217fdf6b"
},
{
"cell_type": "markdown",
"source": [
"We can combine semantic search with queries:"
],
"metadata": {
"collapsed": false
},
"id": "4f38cf9889a15086"
},
{
"cell_type": "code",
"execution_count": 24,
"outputs": [
{
"data": {
"text/plain": " score id \\\n0 0.813827 PMID_36932076_Patient_1 \n1 0.813827 PMID_36932076_Patient_1 \n2 0.799738 PMID_36932076_Patient_3 \n3 0.799738 PMID_36932076_Patient_3 \n4 0.799243 PMID_27536553_27536553_P3 \n\n subject \\\n0 {'id': 'Patient 1', 'timeAtLastEncounter': {'a... \n1 {'id': 'Patient 1', 'timeAtLastEncounter': {'a... \n2 {'id': 'Patient 3', 'timeAtLastEncounter': {'a... \n3 {'id': 'Patient 3', 'timeAtLastEncounter': {'a... \n4 {'id': '27536553_P3', 'timeAtLastEncounter': {... \n\n phenotypicFeatures \\\n0 [{'type': {'id': 'HP:0000979', 'label': 'Purpu... \n1 [{'type': {'id': 'HP:0000979', 'label': 'Purpu... \n2 [{'type': {'id': 'HP:0001511', 'label': 'Intra... \n3 [{'type': {'id': 'HP:0001511', 'label': 'Intra... \n4 [{'type': {'id': 'HP:0001396', 'label': 'Chole... \n\n interpretations \\\n0 [{'id': 'Patient 1', 'progressStatus': 'SOLVED... \n1 [{'id': 'Patient 1', 'progressStatus': 'SOLVED... \n2 [{'id': 'Patient 3', 'progressStatus': 'SOLVED... \n3 [{'id': 'Patient 3', 'progressStatus': 'SOLVED... \n4 [{'id': '27536553_P3', 'progressStatus': 'SOLV... \n\n diseases \\\n0 [{'term': {'id': 'OMIM:620376', 'label': 'Auto... \n1 [{'term': {'id': 'OMIM:620376', 'label': 'Auto... \n2 [{'term': {'id': 'OMIM:620376', 'label': 'Auto... \n3 [{'term': {'id': 'OMIM:620376', 'label': 'Auto... \n4 [{'term': {'id': 'OMIM:256810', 'label': 'Mito... \n\n metaData \n0 {'created': '2024-04-19T06:07:57.188061952Z', ... \n1 {'created': '2024-04-19T06:07:57.188061952Z', ... \n2 {'created': '2024-04-19T06:07:57.190312862Z', ... \n3 {'created': '2024-04-19T06:07:57.190312862Z', ... \n4 {'created': '2024-03-23T19:28:35.688389062Z', ... ",
"text/html": "\n\n
\n \n \n | \n score | \n id | \n subject | \n phenotypicFeatures | \n interpretations | \n diseases | \n metaData | \n
\n \n \n \n 0 | \n 0.813827 | \n PMID_36932076_Patient_1 | \n {'id': 'Patient 1', 'timeAtLastEncounter': {'a... | \n [{'type': {'id': 'HP:0000979', 'label': 'Purpu... | \n [{'id': 'Patient 1', 'progressStatus': 'SOLVED... | \n [{'term': {'id': 'OMIM:620376', 'label': 'Auto... | \n {'created': '2024-04-19T06:07:57.188061952Z', ... | \n
\n \n 1 | \n 0.813827 | \n PMID_36932076_Patient_1 | \n {'id': 'Patient 1', 'timeAtLastEncounter': {'a... | \n [{'type': {'id': 'HP:0000979', 'label': 'Purpu... | \n [{'id': 'Patient 1', 'progressStatus': 'SOLVED... | \n [{'term': {'id': 'OMIM:620376', 'label': 'Auto... | \n {'created': '2024-04-19T06:07:57.188061952Z', ... | \n
\n \n 2 | \n 0.799738 | \n PMID_36932076_Patient_3 | \n {'id': 'Patient 3', 'timeAtLastEncounter': {'a... | \n [{'type': {'id': 'HP:0001511', 'label': 'Intra... | \n [{'id': 'Patient 3', 'progressStatus': 'SOLVED... | \n [{'term': {'id': 'OMIM:620376', 'label': 'Auto... | \n {'created': '2024-04-19T06:07:57.190312862Z', ... | \n
\n \n 3 | \n 0.799738 | \n PMID_36932076_Patient_3 | \n {'id': 'Patient 3', 'timeAtLastEncounter': {'a... | \n [{'type': {'id': 'HP:0001511', 'label': 'Intra... | \n [{'id': 'Patient 3', 'progressStatus': 'SOLVED... | \n [{'term': {'id': 'OMIM:620376', 'label': 'Auto... | \n {'created': '2024-04-19T06:07:57.190312862Z', ... | \n
\n \n 4 | \n 0.799243 | \n PMID_27536553_27536553_P3 | \n {'id': '27536553_P3', 'timeAtLastEncounter': {... | \n [{'type': {'id': 'HP:0001396', 'label': 'Chole... | \n [{'id': '27536553_P3', 'progressStatus': 'SOLV... | \n [{'term': {'id': 'OMIM:256810', 'label': 'Mito... | \n {'created': '2024-03-23T19:28:35.688389062Z', ... | \n
\n \n
\n
"
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"qr = collection.search(\"patients with liver diseases\", where={\"subject.sex\": \"MALE\"})\n",
"qr.rows_dataframe[0:5]"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:25:29.336474Z",
"start_time": "2024-08-08T04:25:28.280410Z"
}
},
"id": "8a218f8f7688a2d3"
},
{
"cell_type": "markdown",
"source": [
"## Validation\n",
"\n",
"Next we will demonstrate validation over a whole collection.\n",
"\n",
"Currently validating depends on a LinkML schema - we have previously copied this schema into the test folder.\n",
"We will load the schema into the database object:"
],
"metadata": {
"collapsed": false
},
"id": "41a14e7976a923b3"
},
{
"cell_type": "code",
"execution_count": 25,
"outputs": [],
"source": [
"db.load_schema_view(\"../../tests/input/schemas/phenopackets_linkml/phenopackets.yaml\")"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:25:29.445248Z",
"start_time": "2024-08-08T04:25:29.336606Z"
}
},
"id": "5294ee7927a372f1"
},
{
"cell_type": "markdown",
"source": [
"Quick sanity check to ensure that worked:"
],
"metadata": {
"collapsed": false
},
"id": "292d662d92bdfdb4"
},
{
"cell_type": "code",
"execution_count": 26,
"outputs": [
{
"data": {
"text/plain": "['Age',\n 'AgeRange',\n 'Dictionary',\n 'Evidence',\n 'ExternalReference',\n 'File',\n 'GestationalAge',\n 'OntologyClass',\n 'Procedure',\n 'TimeElement']"
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(db.schema_view.all_classes())[0:10]"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:25:29.452594Z",
"start_time": "2024-08-08T04:25:29.446076Z"
}
},
"id": "c211d3ce33b05fd5"
},
{
"cell_type": "code",
"execution_count": 27,
"outputs": [],
"source": [
"collection.metadata.type = \"Phenopacket\""
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:25:29.452722Z",
"start_time": "2024-08-08T04:25:29.449193Z"
}
},
"id": "7109f8da1228fe6a"
},
{
"cell_type": "code",
"execution_count": 28,
"outputs": [],
"source": [
"from linkml_runtime.dumpers import yaml_dumper\n",
"for r in db.iter_validate_database():\n",
" # known issue - https://github.com/monarch-initiative/phenopacket-store/issues/97\n",
" if \"is not of type 'integer'\" in r.message:\n",
" continue\n",
" print(r.message[0:100])\n",
" print(r)\n",
" raise ValueError(\"Unexpected validation error\")"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:25:36.577981Z",
"start_time": "2024-08-08T04:25:29.453399Z"
}
},
"id": "bce050193361ecf2"
},
{
"cell_type": "markdown",
"source": [
"## Command Line Usage\n",
"\n",
"We can also use the command line for all of the above operations.\n",
"\n",
"For example, feceted queries:"
],
"metadata": {
"collapsed": false
},
"id": "8ff5109280b990e0"
},
{
"cell_type": "code",
"execution_count": 29,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\r\n",
" \"subject.sex\": {\r\n",
" \"MALE\": 1807,\r\n",
" \"FEMALE\": 1564\r\n",
" }\r\n",
"}\r\n"
]
}
],
"source": [
"!linkml-store -d mongodb://localhost:27017 -c main fq -S subject.sex"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:25:38.974658Z",
"start_time": "2024-08-08T04:25:36.578645Z"
}
},
"id": "92208567bec477fb"
},
{
"cell_type": "code",
"execution_count": 30,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"phenotypicFeatures.type.label:\r\n",
" Global developmental delay: 1705\r\n",
" Hypotonia: 1056\r\n",
" Intellectual disability: 1028\r\n",
" Seizure: 950\r\n",
" Hypertelorism: 925\r\n",
" Delayed speech and language development: 829\r\n",
" Short stature: 806\r\n",
" Microcephaly: 780\r\n",
" Scoliosis: 702\r\n",
" Feeding difficulties: 678\r\n",
" Low-set ears: 598\r\n",
" Autistic behavior: 519\r\n",
" Motor delay: 518\r\n",
" Downslanted palpebral fissures: 505\r\n",
" Strabismus: 504\r\n",
" Long philtrum: 500\r\n",
" Ptosis: 498\r\n",
" Patent foramen ovale: 469\r\n",
" Anteverted nares: 461\r\n",
" Hearing impairment: 451\r\n",
" Epicanthus: 447\r\n",
" Ventricular septal defect: 435\r\n",
" Thick eyebrow: 433\r\n",
" Cleft palate: 423\r\n",
" Joint hypermobility: 388\r\n",
" High palate: 383\r\n",
" Triangular face: 369\r\n",
" Micrognathia: 364\r\n",
" Posteriorly rotated ears: 350\r\n",
" Failure to thrive: 345\r\n",
" Prominent forehead: 343\r\n",
" Thin upper lip vermilion: 338\r\n",
" Sleep abnormality: 331\r\n",
" Wide nasal bridge: 331\r\n",
" Infantile spasms: 325\r\n",
" Long eyelashes: 325\r\n",
" Pectus excavatum: 322\r\n",
" Ataxia: 319\r\n",
" Pes planus: 315\r\n",
" Bilateral tonic-clonic seizure: 314\r\n",
" Bulbous nose: 311\r\n",
" Intellectual disability, severe: 306\r\n",
" Nystagmus: 298\r\n",
" Absent speech: 294\r\n",
" Midface retrusion: 290\r\n",
" Bicuspid aortic valve: 288\r\n",
" Deeply set eye: 283\r\n",
" Delayed ability to walk: 282\r\n",
" Pulmonic stenosis: 280\r\n",
" Cryptorchidism: 279\r\n",
" Talipes equinovarus: 277\r\n",
" Attention deficit hyperactivity disorder: 275\r\n",
" Recurrent otitis media: 275\r\n",
" Macrocephaly: 275\r\n",
" Abnormality of the hand: 273\r\n",
" Depressed nasal bridge: 273\r\n",
" Autism: 270\r\n",
" Macrodontia: 266\r\n",
" Dystonia: 265\r\n",
" Narrow forehead: 261\r\n",
" Smooth philtrum: 249\r\n",
" Microtia: 248\r\n",
" Inguinal hernia: 247\r\n",
" Upslanted palpebral fissure: 246\r\n",
" Ventriculomegaly: 240\r\n",
" Synophrys: 236\r\n",
" Cerebellar atrophy: 234\r\n",
" Ectopia lentis: 234\r\n",
" Thin corpus callosum: 231\r\n",
" EEG abnormality: 230\r\n",
" Short philtrum: 226\r\n",
" Arachnodactyly: 224\r\n",
" Short neck: 223\r\n",
" Highly arched eyebrow: 221\r\n",
" Epileptic encephalopathy: 219\r\n",
" Developmental regression: 218\r\n",
" Generalized tonic seizure: 218\r\n",
" Protruding ear: 217\r\n",
" Atrial septal defect: 213\r\n",
" Umbilical hernia: 213\r\n",
" Cerebral atrophy: 212\r\n",
" Atrioventricular canal defect: 206\r\n",
" Low anterior hairline: 203\r\n",
" Mitral valve prolapse: 199\r\n",
" Focal impaired awareness seizure: 199\r\n",
" Delayed skeletal maturation: 198\r\n",
" Hypsarrhythmia: 198\r\n",
" Intrauterine growth retardation: 196\r\n",
" Hypoplasia of the corpus callosum: 192\r\n",
" Spasticity: 192\r\n",
" Growth delay: 186\r\n",
" Aortic root aneurysm: 181\r\n",
" Severe global developmental delay: 173\r\n",
" Multifocal epileptiform discharges: 169\r\n",
" Mandibular prognathia: 167\r\n",
" Dysarthria: 167\r\n",
" Patent ductus arteriosus: 166\r\n",
" Blue sclerae: 166\r\n",
" Proptosis: 164\r\n",
" Cataract: 162\r\n",
"\r\n"
]
}
],
"source": [
"!linkml-store -d mongodb://localhost:27017 -c main fq -S phenotypicFeatures.type.label -O yaml\n"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:25:40.981467Z",
"start_time": "2024-08-08T04:25:38.982356Z"
}
},
"id": "db26d37f9e60283d"
},
{
"cell_type": "code",
"execution_count": 31,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"diseases.term.label+subject.sex:\r\n",
" ('KBG syndrome', 'MALE'): 175\r\n",
" ('KBG syndrome', 'FEMALE'): 143\r\n",
" ('Glass syndrome', 'MALE'): 90\r\n",
" ('Glass syndrome', 'FEMALE'): 62\r\n",
" ('Mitochondrial DNA depletion syndrome 13 (encephalomyopathic type)', 'MALE'): 58\r\n",
" ('Neurodevelopmental disorder with coarse facies and mild distal skeletal abnormalities', 'MALE'): 54\r\n",
" ('Jacobsen syndrome', 'FEMALE'): 49\r\n",
" ('Coffin-Siris syndrome 8', 'MALE'): 37\r\n",
" ('Mitochondrial DNA depletion syndrome 13 (encephalomyopathic type)', 'FEMALE'): 37\r\n",
" ('Kabuki Syndrome 1', 'FEMALE'): 35\r\n",
" ('Houge-Janssen syndrome 2', 'MALE'): 32\r\n",
" ('Kabuki Syndrome 1', 'MALE'): 30\r\n",
" ('Developmental delay, dysmorphic facies, and brain anomalies', 'FEMALE'): 29\r\n",
" ('Intellectual developmental disorder, autosomal dominant 21', 'MALE'): 28\r\n",
" ('Holt-Oram syndrome', 'FEMALE'): 28\r\n",
" ('Cardiac, facial, and digital anomalies with developmental delay', 'MALE'): 28\r\n",
" ('Loeys-Dietz syndrome 3', 'MALE'): 27\r\n",
" ('Developmental and epileptic encephalopathy 28', 'FEMALE'): 27\r\n",
" ('ZTTK SYNDROME', 'FEMALE'): 26\r\n",
" ('ZTTK SYNDROME', 'MALE'): 26\r\n",
" ('Loeys-Dietz syndrome 4', 'MALE'): 26\r\n",
" ('Marfan syndrome', 'MALE'): 26\r\n",
" ('Hypomagnesemia 3, renal', 'MALE'): 26\r\n",
" ('Intellectual developmental disorder, X-linked 112', 'MALE'): 26\r\n",
" ('Mitochondrial DNA depletion syndrome 6 (hepatocerebral type)', 'MALE'): 26\r\n",
" ('Marfan syndrome', 'FEMALE'): 24\r\n",
" ('Ectopia lentis, familial', 'MALE'): 24\r\n",
" ('Coffin-Siris syndrome 8', 'FEMALE'): 24\r\n",
" ('Mitochondrial DNA depletion syndrome 6 (hepatocerebral type)', 'FEMALE'): 24\r\n",
" ('Houge-Janssen syndrome 2', 'FEMALE'): 24\r\n",
" ('Cardiomyopathy, dilated, 1A', 'MALE'): 23\r\n",
" ('Loeys-Dietz syndrome 5', 'MALE'): 23\r\n",
" ('Holt-Oram syndrome', 'MALE'): 22\r\n",
" ('Mitochondrial complex IV deficiency, nuclear type 2', 'MALE'): 22\r\n",
" ('Loeys-Dietz syndrome 3', 'FEMALE'): 22\r\n",
" ('Cardiomyopathy, dilated, 1A', 'FEMALE'): 21\r\n",
" ('Kufor-Rakeb syndrome', 'MALE'): 21\r\n",
" ('Jacobsen syndrome', 'MALE'): 20\r\n",
" ('Developmental delay, dysmorphic facies, and brain anomalies', 'MALE'): 20\r\n",
" ('Ectopia lentis, familial', 'FEMALE'): 20\r\n",
" ('Ehlers-Danlos syndrome, vascular type', 'FEMALE'): 20\r\n",
" ('Loeys-Dietz syndrome 5', 'FEMALE'): 20\r\n",
" ('Neurodevelopmental disorder with coarse facies and mild distal skeletal abnormalities', 'FEMALE'): 19\r\n",
" ('Hypomagnesemia 3, renal', 'FEMALE'): 19\r\n",
" ('Intellectual developmental disorder, autosomal dominant 21', 'FEMALE'): 18\r\n",
" ('Acrofacial dysostosis 1, Nager type', 'FEMALE'): 18\r\n",
" ('LEOPARD syndrome 1', 'MALE'): 18\r\n",
" ('Anemia, sideroblastic, and spinocerebellar ataxia', 'MALE'): 18\r\n",
" ('Spastic ataxia 8, autosomal recessive, with hypomyelinating leukodystrophy', 'MALE'): 18\r\n",
" ('Albinism, oculocutaneous, type IV', 'FEMALE'): 17\r\n",
" ('Cardiac, facial, and digital anomalies with developmental delay', 'FEMALE'): 17\r\n",
" ('Developmental and epileptic encephalopathy 28', 'MALE'): 16\r\n",
" ('Developmental delay with or without epilepsy', 'MALE'): 16\r\n",
" ('Aarskog-Scott syndrome', 'MALE'): 16\r\n",
" ('Ehlers-Danlos syndrome, vascular type', 'MALE'): 15\r\n",
" ('Spastic paraplegia 91, autosomal dominant, with or without cerebellar ataxia', 'FEMALE'): 15\r\n",
" ('Spastic ataxia 8, autosomal recessive, with hypomyelinating leukodystrophy', 'FEMALE'): 15\r\n",
" ('Marfan lipodystrophy syndrome', 'FEMALE'): 15\r\n",
" ('Noonan syndrome 1', 'MALE'): 14\r\n",
" ('Sulfite oxidase deficiency', 'MALE'): 14\r\n",
" ('Spastic paraplegia 91, autosomal dominant, with or without cerebellar ataxia', 'MALE'): 13\r\n",
" ('Developmental and epileptic encephalopathy 112', 'FEMALE'): 13\r\n",
" ('Noonan syndrome 1', 'FEMALE'): 13\r\n",
" ('Albinism, oculocutaneous, type IV', 'MALE'): 13\r\n",
" ('Neurodevelopmental disorder with motor and language delay, ocular defects, and brain abnormalities', 'FEMALE'): 13\r\n",
" ('Developmental and epileptic encephalopathy 5', 'FEMALE'): 13\r\n",
" ('LEOPARD syndrome 1', 'FEMALE'): 13\r\n",
" ('Loeys-Dietz syndrome 2', 'MALE'): 13\r\n",
" ('Kufor-Rakeb syndrome', 'FEMALE'): 12\r\n",
" ('Ataxia-pancytopenia syndrome', 'MALE'): 12\r\n",
" ('Autoinflammatory syndrome, familial, with or without immunodeficiency', 'FEMALE'): 12\r\n",
" ('Neurodevelopmental disorder with or without anomalies of the brain, eye, or heart', 'MALE'): 12\r\n",
" ('Hypotonia, infantile, with psychomotor retardation and characteristic facies 3', 'FEMALE'): 12\r\n",
" ('Acrofacial dysostosis, Cincinnati type', 'MALE'): 11\r\n",
" ('Noonan syndrome 2', 'FEMALE'): 11\r\n",
" ('Sulfite oxidase deficiency', 'FEMALE'): 11\r\n",
" ('HMG-CoA synthase-2 deficiency', 'MALE'): 11\r\n",
" ('Hypotonia, infantile, with psychomotor retardation and characteristic facies 3', 'MALE'): 11\r\n",
" ('Neurodevelopmental disorder with or without variable brain abnormalities', 'MALE'): 11\r\n",
" ('Autoimmune polyendocrinopathy syndrome , type I, with or without reversible metaphyseal dysplasia', 'FEMALE'): 11\r\n",
" ('Neurodevelopmental disorder with progressive microcephaly, spasticity, and brain anomalies', 'MALE'): 10\r\n",
" ('Spastic paraplegia 76, autosomal recessive', 'FEMALE'): 10\r\n",
" ('Coffin-Siris syndrome 3', 'FEMALE'): 10\r\n",
" ('Noonan syndrome 6', 'MALE'): 10\r\n",
" ('Loeys-Dietz syndrome 6', 'FEMALE'): 10\r\n",
" ('Cornelia de Lange syndrome 6', 'MALE'): 10\r\n",
" ('EZH1-related neurodevelopmental disorder', 'FEMALE'): 10\r\n",
" ('Multiple mitochondrial dysfunctions syndrome 4', 'FEMALE'): 9\r\n",
" ('Intellectual developmental disorder, autosomal dominant 70', 'MALE'): 9\r\n",
" ('Neurodevelopmental disorder with or without variable brain abnormalities', 'FEMALE'): 9\r\n",
" ('Developmental and epileptic encephalopathy 5', 'MALE'): 9\r\n",
" ('Distal renal tubular acidosis 1', 'FEMALE'): 9\r\n",
" ('Developmental and epileptic encephalopathy 112', 'MALE'): 9\r\n",
" ('Noonan syndrome 2', 'MALE'): 9\r\n",
" ('Parkinson disease 15, autosomal recessive', 'MALE'): 9\r\n",
" ('Ataxia-pancytopenia syndrome', 'FEMALE'): 9\r\n",
" ('Muscular dystrophy, limb-girdle, autosomal recessive 28', 'MALE'): 9\r\n",
" ('Immunoskeletal dysplasia with neurodevelopmental abnormalitie', 'FEMALE'): 9\r\n",
" ('Joubert syndrome 10', 'MALE'): 9\r\n",
" ('Contractural arachnodactyly, congenital', 'FEMALE'): 9\r\n",
"\r\n"
]
}
],
"source": [
"!linkml-store -d mongodb://localhost:27017 -c main fq -S diseases.term.label+subject.sex -O yaml\n"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T04:25:42.994702Z",
"start_time": "2024-08-08T04:25:40.981132Z"
}
},
"id": "93d79d7857e40e34"
},
{
"cell_type": "markdown",
"source": [
"## Inference\n",
"\n"
],
"metadata": {
"collapsed": false
},
"id": "987209d3df999bcc"
},
{
"cell_type": "code",
"execution_count": 32,
"outputs": [],
"source": [
"from linkml_store.inference import get_inference_engine\n",
"\n",
"predictor = get_inference_engine(\"sklearn\")"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T20:17:34.841221Z",
"start_time": "2024-08-08T20:17:34.803343Z"
}
},
"id": "31cd95bba4c2d6d5"
},
{
"cell_type": "code",
"execution_count": 33,
"outputs": [],
"source": [
"predictor.load_and_split_data(collection)\n"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-08T20:18:34.462521Z",
"start_time": "2024-08-08T20:18:34.421359Z"
}
},
"id": "8a2d1b23e204e977"
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"predictor.config.target_attributes = [\"diseases.term.label\"]"
],
"metadata": {
"collapsed": false
},
"id": "c63d94d5e199c367"
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}