How to index Phenopackets with LinkML-Store

Use pystow to download phenopackets

We will download from the Monarch Initiative phenopacket-store

[2]:

import pandas as pd
import pystow
import yaml

path = pystow.ensure_untar("tmp", "phenopackets", url=" https://github.com/monarch-initiative/phenopacket-store/releases/latest/download/all_phenopackets.tgz")

[3]:

# iterate over all *.json files in the phenopackets directory and parse to an object
# we will recursively walk the path using os.walk ( we don't worry about loading yet)
import os
import json
objs = []
for root, dirs, files in os.walk(path):
    for file in files:
        if file.endswith(".json"):
            with open(os.path.join(root, file)) as stream:
                obj = json.load(stream)
                objs.append(obj)
len(objs)

[3]:

Creating a client and attaching to a database

First we will create a client as normal:

[4]:

from linkml_store import Client

client = Client()

Next we’ll attach to a MongoDB instance. this assumes you have one running already.

We will make a database called “phenopackets” and recreate it if it already exists

(note for people running this notebook locally - if you happen to have a database with this name in your current mongo instance it will be deleted!)

[6]:

db = client.attach_database("mongodb://localhost:27017", "phenopackets", recreate_if_exists=True)

Creating a collection

We’ll create a simple test collection. The concept of collection in linkml-store maps directly to mongodb collections

[7]:

collection = db.create_collection("main", recreate_if_exists=True)

Inserting objects into the store

We’ll use the standard insert method to insert the phenopackets into the collection. At this stage there is no explicit schema.

[9]:

collection.insert(objs)

Check contents

We can check the number of rows in the collection, to ensure everything was inserted correctly:

[10]:

collection.find({}, limit=1).num_rows

[10]:

[11]:

assert collection.find({}, limit=1).num_rows == len(objs)

Let’s check with pandas just to make sure it looks as expected; we’ll query for a specific OMIM disease:

[12]:

qr = collection.find({"diseases.term.id": "OMIM:618499"}, limit=3)
qr.rows_dataframe

[12]:

	id	subject	phenotypicFeatures	interpretations	diseases	metaData
0	PMID_28289718_Higgins-Patient-1	{'id': 'Higgins-Patient-1', 'timeAtLastEncount...	[{'type': {'id': 'HP:0001714', 'label': 'Ventr...	[{'id': 'Higgins-Patient-1', 'progressStatus':...	[{'term': {'id': 'OMIM:618499', 'label': 'Noon...	{'created': '2024-03-28T11:11:48.590163946Z', ...
1	PMID_31173466_Suzuki-Patient-1	{'id': 'Suzuki-Patient-1', 'timeAtLastEncounte...	[{'type': {'id': 'HP:0001714', 'label': 'Ventr...	[{'id': 'Suzuki-Patient-1', 'progressStatus': ...	[{'term': {'id': 'OMIM:618499', 'label': 'Noon...	{'created': '2024-03-28T11:11:48.594725131Z', ...
2	PMID_28289718_Higgins-Patient-2	{'id': 'Higgins-Patient-2', 'timeAtLastEncount...	[{'type': {'id': 'HP:0001714', 'label': 'Ventr...	[{'id': 'Higgins-Patient-2', 'progressStatus':...	[{'term': {'id': 'OMIM:618499', 'label': 'Noon...	{'created': '2024-03-28T11:11:48.592718124Z', ...

As expected, there are three rows with the OMIM disease 618499.

Query faceting

We will now demonstrate faceted queries, allowing us to count the number of instances of different categorical values or categorical value combinations.

First we’ll facet on the subject sex. We can use path notation, e.g. subject.sex here:

[40]:

collection.query_facets({}, facet_columns=["subject.sex"])

[40]:

{'subject.sex': [('MALE', 1807), ('FEMALE', 1564), (None, 1505)]}

We can also facet by the disease name/label. We’ll restrict this to the top 20

[49]:

collection.query_facets({}, facet_columns=["diseases.term.label"], facet_limit=20)

[49]:

{'diseases.term.label': [(['Developmental and epileptic encephalopathy 4'],
   463),
  (['Developmental and epileptic encephalopathy 11'], 342),
  (['KBG syndrome'], 337),
  (['Leber congenital amaurosis 6'], 191),
  (['Glass syndrome'], 158),
  (['Holt-Oram syndrome'], 103),
  (['Mitochondrial DNA depletion syndrome 13 (encephalomyopathic type)'], 95),
  (['Neurodevelopmental disorder with coarse facies and mild distal skeletal abnormalities'],
   73),
  (['Jacobsen syndrome'], 69),
  (['Coffin-Siris syndrome 8'], 65),
  (['Kabuki Syndrome 1'], 65),
  (['Houge-Janssen syndrome 2'], 60),
  (['ZTTK SYNDROME'], 52),
  (['Greig cephalopolysyndactyly syndrome'], 51),
  (['Seizures, benign familial infantile, 3'], 51),
  (['Marfan syndrome'], 50),
  (['Mitochondrial DNA depletion syndrome 6 (hepatocerebral type)'], 50),
  (['Loeys-Dietz syndrome 3'], 49),
  (['Developmental delay, dysmorphic facies, and brain anomalies'], 49),
  (['Intellectual developmental disorder, autosomal dominant 21'], 46)]}

[48]:

collection.query_facets({}, facet_columns=["subject.timeAtLastEncounter.age.iso8601duration"], facet_limit=10)

[48]:

{'subject.timeAtLastEncounter.age.iso8601duration': [(None, 2087),
  ('P4Y', 131),
  ('P3Y', 114),
  ('P6Y', 100),
  ('P5Y', 97),
  ('P2Y', 95),
  ('P7Y', 85),
  ('P10Y', 82),
  ('P9Y', 77),
  ('P8Y', 71)]}

[63]:

collection.query_facets({}, facet_columns=["interpretations.diagnosis.genomicInterpretations.variantInterpretation.variationDescriptor.geneContext.symbol"], facet_limit=10)

[63]:

{'interpretations.diagnosis.genomicInterpretations.variantInterpretation.variationDescriptor.geneContext.symbol': [([['STXBP1']],
   463),
  ([['SCN2A']], 393),
  ([['ANKRD11']], 337),
  ([['RPGRIP1']], 185),
  ([['SATB2']], 158),
  ([['FBN1']], 151),
  ([['LMNA']], 127),
  ([['TBX5']], 103),
  ([['SPTAN1']], 85),
  ([['GLI3']], 82)]}

We can also facet on combinations:

[51]:

fqr = collection.query_facets({}, facet_columns=[("subject.sex", "diseases.term.label")], facet_limit=20)
fqr

[51]:

{('subject.sex',
  'diseases.term.label'): [({'diseasestermlabel': ['Developmental and epileptic encephalopathy 4']},
   463), ({'diseasestermlabel': ['Developmental and epileptic encephalopathy 11']},
   342), ({'diseasestermlabel': ['Leber congenital amaurosis 6']},
   191), ({'subjectsex': 'MALE',
    'diseasestermlabel': ['KBG syndrome']}, 175), ({'subjectsex': 'FEMALE',
    'diseasestermlabel': ['KBG syndrome']},
   143), ({'subjectsex': 'MALE', 'diseasestermlabel': ['Glass syndrome']},
   90), ({'subjectsex': 'FEMALE', 'diseasestermlabel': ['Glass syndrome']},
   62), ({'subjectsex': 'MALE',
    'diseasestermlabel': ['Mitochondrial DNA depletion syndrome 13 (encephalomyopathic type)']},
   58), ({'subjectsex': 'MALE',
    'diseasestermlabel': ['Neurodevelopmental disorder with coarse facies and mild distal skeletal abnormalities']},
   54), ({'diseasestermlabel': ['Holt-Oram syndrome']},
   53), ({'diseasestermlabel': ['Greig cephalopolysyndactyly syndrome']}, 51), ({'diseasestermlabel': ['Seizures, benign familial infantile, 3']},
   51), ({'subjectsex': 'FEMALE', 'diseasestermlabel': ['Jacobsen syndrome']},
   49), ({'diseasestermlabel': ['Emery-Dreifuss muscular dystrophy 2, autosomal dominant']},
   41), ({'diseasestermlabel': ['Cone-rod dystrophy 13']},
   38), ({'subjectsex': 'FEMALE',
    'diseasestermlabel': ['Mitochondrial DNA depletion syndrome 13 (encephalomyopathic type)']}, 37), ({'subjectsex': 'MALE',
    'diseasestermlabel': ['Coffin-Siris syndrome 8']},
   37), ({'subjectsex': 'FEMALE', 'diseasestermlabel': ['Kabuki Syndrome 1']},
   35), ({'subjectsex': 'MALE',
    'diseasestermlabel': ['Houge-Janssen syndrome 2']},
   32), ({'subjectsex': 'MALE', 'diseasestermlabel': ['Kabuki Syndrome 1']},
   30)]}

[60]:

import pandas as pd
def fqr_as_dfs(fqr: dict):
    dfs = []
    for k, vs in fqr.items():
        rows = []
        for obj, count in vs:
            row = {}
            for col, val in zip(k, obj.values()):
                row[col] = val[0] if isinstance(val, list) else val
            row["count"] = count
            rows.append(row)
        df = pd.DataFrame(columns=list(k) + ["count"], data=rows)
        dfs.append(df)
    return dfs

fqr_as_dfs(fqr)[0]

[60]:

	subject.sex	diseases.term.label	count
0	Developmental and epileptic encephalopathy 4	NaN	463
1	Developmental and epileptic encephalopathy 11	NaN	342
2	Leber congenital amaurosis 6	NaN	191
3	MALE	KBG syndrome	175
4	FEMALE	KBG syndrome	143
5	MALE	Glass syndrome	90
6	FEMALE	Glass syndrome	62
7	MALE	Mitochondrial DNA depletion syndrome 13 (encep...	58
8	MALE	Neurodevelopmental disorder with coarse facies...	54
9	Holt-Oram syndrome	NaN	53
10	Greig cephalopolysyndactyly syndrome	NaN	51
11	Seizures, benign familial infantile, 3	NaN	51
12	FEMALE	Jacobsen syndrome	49
13	Emery-Dreifuss muscular dystrophy 2, autosomal...	NaN	41
14	Cone-rod dystrophy 13	NaN	38
15	FEMALE	Mitochondrial DNA depletion syndrome 13 (encep...	37
16	MALE	Coffin-Siris syndrome 8	37
17	FEMALE	Kabuki Syndrome 1	35
18	MALE	Houge-Janssen syndrome 2	32
19	MALE	Kabuki Syndrome 1	30

Semantic Search

We will index phenopackets using a template that extracts the subject, phenotypic features and diseases.

First we will create a textualization template for a phenopacket. We will keep it minimal for simplicity - this doesn’t include treatments, families, etc.

[13]:

template = """
subject: {{subject}}
phenotypes: {% for p in phenotypicFeatures %}{{p.type.label}}{% endfor %}
diseases: {% for d in diseases %}{{d.term.label}}{% endfor %}
"""

Next we will create an indexer using the template. This will use the Jinja2 syntax for templating. We will also cache LLM embedding queries, so if we want to incrementally add new phenopackets we can avoid re-running the LLM embeddings calls.

[18]:

from linkml_store.index.implementations.llm_indexer import LLMIndexer

index = LLMIndexer(
    name="ppkt",
    cached_embeddings_database="tmp/llm_pheno_cache.db",
    text_template=template,
    text_template_syntax="jinja2",
)

We can test the template on the first row of the collection:

[19]:

print(index.object_to_text(qr.rows[0]))


subject: {'id': 'Higgins-Patient-1', 'timeAtLastEncounter': {'age': {'iso8601duration': 'P17Y'}}, 'sex': 'FEMALE'}
phenotypes: Ventricular hypertrophyHeart murmurHypertrophic cardiomyopathyShort statureHypertelorismLow-set earsPosteriorly rotated earsGlobal developmental delayCognitive impairmentCardiac arrest
diseases: Noonan syndrome-11

That looks as expected. We can now attach the indexer to the collection and index the collection:

[20]:

collection.attach_indexer(index, auto_index=True)

/Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages/duckdb_engine/__init__.py:580: SAWarning: Did not recognize type 'list' of column 'embedding'
  columns = self._get_columns_info(rows, domains, enums, schema)  # type: ignore[attr-defined]
/Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages/duckdb_engine/__init__.py:173: DuckDBEngineWarning: duckdb-engine doesn't yet support reflection on indices
  warnings.warn(

Semantic Search

Let’s query based on text criteria:

[64]:

qr = collection.search("patients with liver diseases")
qr.rows_dataframe[0:5]

[64]:

	score	id	subject	phenotypicFeatures	interpretations	diseases	metaData
0	0.824664	PMID_30658709_patient	{'id': 'patient', 'timeAtLastEncounter': {'age...	[{'type': {'id': 'HP:0031956', 'label': 'Eleva...	[{'id': 'patient', 'progressStatus': 'SOLVED',...	[{'term': {'id': 'OMIM:615878', 'label': 'Chol...	{'created': '2024-05-05T09:03:25.388371944Z', ...
1	0.813827	PMID_36932076_Patient_1	{'id': 'Patient 1', 'timeAtLastEncounter': {'a...	[{'type': {'id': 'HP:0000979', 'label': 'Purpu...	[{'id': 'Patient 1', 'progressStatus': 'SOLVED...	[{'term': {'id': 'OMIM:620376', 'label': 'Auto...	{'created': '2024-04-19T06:07:57.188061952Z', ...
2	0.804126	PMID_37303127_6	{'id': '6', 'timeAtLastEncounter': {'age': {'i...	[{'type': {'id': 'HP:0001397', 'label': 'Hepat...	[{'id': '6', 'progressStatus': 'SOLVED', 'diag...	[{'term': {'id': 'OMIM:151660', 'label': 'Lipo...	{'created': '2024-03-23T17:41:42.999521017Z', ...
3	0.799738	PMID_36932076_Patient_3	{'id': 'Patient 3', 'timeAtLastEncounter': {'a...	[{'type': {'id': 'HP:0001511', 'label': 'Intra...	[{'id': 'Patient 3', 'progressStatus': 'SOLVED...	[{'term': {'id': 'OMIM:620376', 'label': 'Auto...	{'created': '2024-04-19T06:07:57.190312862Z', ...
4	0.799243	PMID_27536553_27536553_P3	{'id': '27536553_P3', 'timeAtLastEncounter': {...	[{'type': {'id': 'HP:0001396', 'label': 'Chole...	[{'id': '27536553_P3', 'progressStatus': 'SOLV...	[{'term': {'id': 'OMIM:256810', 'label': 'Mito...	{'created': '2024-03-23T19:28:35.688389062Z', ...

Let’s check the first one

[65]:

qr.ranked_rows[0]

[65]:

(0.8246637496927007,
 {'id': 'PMID_30658709_patient',
  'subject': {'id': 'patient',
   'timeAtLastEncounter': {'age': {'iso8601duration': 'P1Y11M'}},
   'sex': 'FEMALE'},
  'phenotypicFeatures': [{'type': {'id': 'HP:0031956',
     'label': 'Elevated circulating aspartate aminotransferase concentration'},
    'onset': {'age': {'iso8601duration': 'P1Y11M'}}},
   {'type': {'id': 'HP:0031964',
     'label': 'Elevated circulating alanine aminotransferase concentration'},
    'onset': {'age': {'iso8601duration': 'P1Y11M'}}},
   {'type': {'id': 'HP:0003573', 'label': 'Increased total bilirubin'},
    'onset': {'age': {'iso8601duration': 'P6M'}}},
   {'type': {'id': 'HP:0012202',
     'label': 'Increased serum bile acid concentration'},
    'onset': {'age': {'iso8601duration': 'P6M'}}},
   {'type': {'id': 'HP:0002908', 'label': 'Conjugated hyperbilirubinemia'},
    'onset': {'age': {'iso8601duration': 'P6M'}}},
   {'type': {'id': 'HP:0001433', 'label': 'Hepatosplenomegaly'},
    'onset': {'age': {'iso8601duration': 'P6M'}}},
   {'type': {'id': 'HP:0001510', 'label': 'Growth delay'},
    'onset': {'age': {'iso8601duration': 'P6M'}}},
   {'type': {'id': 'HP:0000989', 'label': 'Pruritus'},
    'onset': {'age': {'iso8601duration': 'P6M'}}},
   {'type': {'id': 'HP:0000952', 'label': 'Jaundice'},
    'onset': {'age': {'iso8601duration': 'P6M'}}},
   {'type': {'id': 'HP:0100810', 'label': 'Pointed helix'},
    'onset': {'age': {'iso8601duration': 'P6M'}}},
   {'type': {'id': 'HP:0002650', 'label': 'Scoliosis'}},
   {'type': {'id': 'HP:0003112',
     'label': 'Abnormal circulating amino acid concentration'},
    'excluded': True},
   {'type': {'id': 'HP:0001928', 'label': 'Abnormality of coagulation'},
    'excluded': True},
   {'type': {'id': 'HP:0010701', 'label': 'Abnormal immunoglobulin level'},
    'excluded': True},
   {'type': {'id': 'HP:0001627', 'label': 'Abnormal heart morphology'},
    'excluded': True}],
  'interpretations': [{'id': 'patient',
    'progressStatus': 'SOLVED',
    'diagnosis': {'disease': {'id': 'OMIM:615878',
      'label': 'Cholestasis, progressive familial intrahepatic 4'},
     'genomicInterpretations': [{'subjectOrBiosampleId': 'patient',
       'interpretationStatus': 'CAUSATIVE',
       'variantInterpretation': {'variationDescriptor': {'id': 'var_kKNGnjOxGXMbcoWzDGEJKVPIB',
         'geneContext': {'valueId': 'HGNC:11828', 'symbol': 'TJP2'},
         'expressions': [{'syntax': 'hgvs.c',
           'value': 'NM_004817.4:c.2355+1G>C'},
          {'syntax': 'hgvs.g', 'value': 'NC_000009.12:g.69238790G>C'}],
         'vcfRecord': {'genomeAssembly': 'hg38',
          'chrom': 'chr9',
          'pos': '69238790',
          'ref': 'G',
          'alt': 'C'},
         'moleculeContext': 'genomic',
         'allelicState': {'id': 'GENO:0000136', 'label': 'homozygous'}}}}]}}],
  'diseases': [{'term': {'id': 'OMIM:615878',
     'label': 'Cholestasis, progressive familial intrahepatic 4'},
    'onset': {'ontologyClass': {'id': 'HP:0003593',
      'label': 'Infantile onset'}}}],
  'metaData': {'created': '2024-05-05T09:03:25.388371944Z',
   'createdBy': 'ORCID:0000-0002-0736-9199',
   'resources': [{'id': 'geno',
     'name': 'Genotype Ontology',
     'url': 'http://purl.obolibrary.org/obo/geno.owl',
     'version': '2022-03-05',
     'namespacePrefix': 'GENO',
     'iriPrefix': 'http://purl.obolibrary.org/obo/GENO_'},
    {'id': 'hgnc',
     'name': 'HUGO Gene Nomenclature Committee',
     'url': 'https://www.genenames.org',
     'version': '06/01/23',
     'namespacePrefix': 'HGNC',
     'iriPrefix': 'https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/'},
    {'id': 'omim',
     'name': 'An Online Catalog of Human Genes and Genetic Disorders',
     'url': 'https://www.omim.org',
     'version': 'January 4, 2023',
     'namespacePrefix': 'OMIM',
     'iriPrefix': 'https://www.omim.org/entry/'},
    {'id': 'so',
     'name': 'Sequence types and features ontology',
     'url': 'http://purl.obolibrary.org/obo/so.obo',
     'version': '2021-11-22',
     'namespacePrefix': 'SO',
     'iriPrefix': 'http://purl.obolibrary.org/obo/SO_'},
    {'id': 'hp',
     'name': 'human phenotype ontology',
     'url': 'http://purl.obolibrary.org/obo/hp.owl',
     'version': '2024-04-26',
     'namespacePrefix': 'HP',
     'iriPrefix': 'http://purl.obolibrary.org/obo/HP_'}],
   'phenopacketSchemaVersion': '2.0',
   'externalReferences': [{'id': 'PMID:30658709',
     'reference': 'https://pubmed.ncbi.nlm.nih.gov/30658709',
     'description': 'Novel compound heterozygote mutations of TJP2 in a Chinese child with progressive cholestatic liver disease'}]}})

We can combine semantic search with queries:

[66]:

qr = collection.search("patients with liver diseases", where={"subject.sex": "MALE"})
qr.rows_dataframe[0:5]

[66]:

	score	id	subject	phenotypicFeatures	interpretations	diseases	metaData
0	0.813827	PMID_36932076_Patient_1	{'id': 'Patient 1', 'timeAtLastEncounter': {'a...	[{'type': {'id': 'HP:0000979', 'label': 'Purpu...	[{'id': 'Patient 1', 'progressStatus': 'SOLVED...	[{'term': {'id': 'OMIM:620376', 'label': 'Auto...	{'created': '2024-04-19T06:07:57.188061952Z', ...
1	0.799738	PMID_36932076_Patient_3	{'id': 'Patient 3', 'timeAtLastEncounter': {'a...	[{'type': {'id': 'HP:0001511', 'label': 'Intra...	[{'id': 'Patient 3', 'progressStatus': 'SOLVED...	[{'term': {'id': 'OMIM:620376', 'label': 'Auto...	{'created': '2024-04-19T06:07:57.190312862Z', ...
2	0.799243	PMID_27536553_27536553_P3	{'id': '27536553_P3', 'timeAtLastEncounter': {...	[{'type': {'id': 'HP:0001396', 'label': 'Chole...	[{'id': '27536553_P3', 'progressStatus': 'SOLV...	[{'term': {'id': 'OMIM:256810', 'label': 'Mito...	{'created': '2024-03-23T19:28:35.688389062Z', ...
3	0.798670	PMID_29321044_Patient_3	{'id': 'Patient 3', 'timeAtLastEncounter': {'a...	[{'type': {'id': 'HP:0031956', 'label': 'Eleva...	[{'id': 'Patient 3', 'progressStatus': 'SOLVED...	[{'term': {'id': 'OMIM:616829', 'label': 'Cong...	{'created': '2024-05-11T06:05:50.632786035Z', ...
4	0.798010	PMID_36517554_patient_1	{'id': 'patient 1', 'timeAtLastEncounter': {'a...	[{'type': {'id': 'HP:0002240', 'label': 'Hepat...	[{'id': 'patient 1', 'progressStatus': 'SOLVED...	[{'term': {'id': 'OMIM:620603', 'label': 'Immu...	{'created': '2024-03-29T11:25:36.649104833Z', ...

Validation

Next we will demonstrate validation over a whole collection.

Currently validating depends on a LinkML schema - we have previously copied this schema into the test folder. We will load the schema into the database object:

[26]:

db.load_schema_view("../../tests/input/phenopackets_linkml/phenopackets.yaml")

Quick sanity check to ensure that worked:

[30]:

list(db.schema_view.all_classes())[0:10]

[30]:

['Age',
 'AgeRange',
 'Dictionary',
 'Evidence',
 'ExternalReference',
 'File',
 'GestationalAge',
 'OntologyClass',
 'Procedure',
 'TimeElement']

[32]:

collection.metadata.type = "Phenopacket"

[39]:

from linkml_runtime.dumpers import yaml_dumper
for r in db.iter_validate_database():
    # known issue - https://github.com/monarch-initiative/phenopacket-store/issues/97
    if "is not of type 'integer'" in r.message:
        continue
    print(r.message[0:100])
    print(r)
    raise ValueError("Unexpected validation error")

Command Line Usage

We can also use the command line for all of the above operations.

For example, feceted queries:

[68]:

!linkml-store -d mongodb://localhost:27017 -c main fq -S subject.sex

{
  "subject.sex": {
    "MALE": 1807,
    "FEMALE": 1564,
    "None": 1505
  }
}

[ ]: