Working with Complex Hierarchical Data¶
This tutorial demonstrates how linkml-data-qc handles deeply nested data structures with multiple levels of lists and objects. We'll use a disease knowledge base as an example.
The Challenge¶
Real-world knowledge bases often have complex structures:
- Diseases with multiple subtypes
- Each subtype has evidence citations
- Pathophysiology entries with cell type annotations
- Phenotypes with nested ontology term bindings
Tracking compliance across all these nested structures requires understanding path notation.
Setup: A Complex Schema¶
Let's create a schema that models a disease knowledge base with nested structures:
%%bash
cat > /tmp/disease_kb_schema.yaml << 'EOF'
id: https://example.org/disease-kb
name: disease_kb_schema
prefixes:
linkml: https://w3id.org/linkml/
imports:
- linkml:types
default_range: string
classes:
Disease:
attributes:
name:
required: true
description:
recommended: true
disease_term:
range: OntologyAnnotation
inlined: true
recommended: true
has_subtypes:
range: DiseaseSubtype
multivalued: true
inlined_as_list: true
pathophysiology:
range: PathophysiologyStep
multivalued: true
inlined_as_list: true
phenotypes:
range: Phenotype
multivalued: true
inlined_as_list: true
OntologyAnnotation:
attributes:
preferred_term:
recommended: true
term:
range: OntologyTerm
inlined: true
recommended: true
OntologyTerm:
attributes:
id:
recommended: true
label:
recommended: true
DiseaseSubtype:
attributes:
name:
required: true
description:
recommended: true
evidence:
range: Evidence
multivalued: true
inlined_as_list: true
Evidence:
attributes:
reference:
recommended: true
supports:
recommended: true
snippet:
recommended: true
explanation:
recommended: true
PathophysiologyStep:
attributes:
name:
required: true
description:
recommended: true
cell_types:
range: OntologyAnnotation
multivalued: true
inlined_as_list: true
evidence:
range: Evidence
multivalued: true
inlined_as_list: true
Phenotype:
attributes:
name:
required: true
category:
recommended: true
frequency:
recommended: true
phenotype_term:
range: OntologyAnnotation
inlined: true
recommended: true
evidence:
range: Evidence
multivalued: true
inlined_as_list: true
EOF
echo "Schema created with nested classes!"
Schema created with nested classes!
Test Data: A Disease Entry¶
Now let's create a disease entry with varying levels of completeness:
%%bash
cat > /tmp/disease_aps.yaml << 'EOF'
name: Antiphospholipid Syndrome
description: A systemic autoimmune disorder characterized by antiphospholipid antibodies.
disease_term:
preferred_term: antiphospholipid syndrome
term:
id: MONDO:0000810
label: antiphospholipid syndrome
has_subtypes:
- name: Primary APS
description: occurs in the absence of any other disease
evidence:
- reference: PMID:16338214
supports: SUPPORT
snippet: the condition can exist on its own
explanation: Supports primary APS definition
- reference: PMID:27550302
supports: SUPPORT
snippet: APS can be isolated (primary APS)
# Missing explanation!
- name: Secondary APS
description: occurs with other autoimmune diseases
evidence:
- reference: PMID:11014973
supports: SUPPORT
snippet: APS may be associated with another autoimmune disease
explanation: Confirms secondary APS
- name: Asymptomatic APS
# Missing description!
evidence:
- reference: PMID:17145604
supports: SUPPORT
# Missing snippet and explanation!
pathophysiology:
- name: Antibody Production
description: The immune system produces antiphospholipid antibodies.
cell_types:
- preferred_term: B cell
term:
id: CL:0000236
label: B cell
- preferred_term: T cell
# Missing nested term!
evidence:
- reference: PMID:29867951
supports: SUPPORT
snippet: production of antibodies that bind phospholipid-binding proteins
explanation: Describes antibody production mechanism
- name: Blood Clot Formation
description: Antibodies increase thrombosis risk.
# No cell_types - that's fine, not always applicable
evidence:
- reference: PMID:22100379
supports: SUPPORT
snippet: antibodies associated with thrombosis risk
explanation: Supports clotting mechanism
phenotypes:
- name: Deep Vein Thrombosis
category: Thrombosis
frequency: FREQUENT
phenotype_term:
preferred_term: DVT
term:
id: HP:0002625
label: deep vein thrombosis
evidence:
- reference: PMID:36575066
supports: SUPPORT
snippet: venous thromboembolism is frequent
explanation: Confirms DVT frequency
- name: Preeclampsia
category: Pregnancy-Related
# Missing frequency!
phenotype_term:
preferred_term: Preeclampsia
term:
id: HP:0100602
# Missing label!
evidence:
- reference: PMID:26815583
supports: PARTIAL
# Missing snippet and explanation!
- name: Livedo Reticularis
category: Dermatologic
frequency: OCCASIONAL
# Missing phenotype_term entirely!
evidence:
- reference: PMID:26223086
supports: SUPPORT
snippet: common cutaneous manifestation
explanation: Supports dermatologic involvement
EOF
echo "Complex disease entry created!"
Complex disease entry created!
Analyzing Nested Compliance¶
Let's run the analysis and see how paths are reported:
%%bash
linkml-data-qc /tmp/disease_aps.yaml \
-s /tmp/disease_kb_schema.yaml \
-t Disease
Compliance Report: /tmp/disease_aps.yaml Target Class: Disease Global Compliance: 85.7% (24/28) Weig
hted Compliance: 82.6% Summary by Slot: category: 100.0% disease_term: 100.0% frequency: 66.7
% id: 100.0% label: 75.0% phenotype_term: 66.7% preferred_term: 100.0% term: 80.0% Aggreg
ated Scores by List Path: pathophysiology[].cell_types[].preferred_term: 100.0% (2/2) pathophysi
ology[].cell_types[].term: 50.0% (1/2) pathophysiology[].cell_types[].term.id: 100.0% (1/1) path
ophysiology[].cell_types[].term.label: 100.0% (1/1) phenotypes[].category: 100.0% (3/3) phenotyp
es[].frequency: 66.7% (2/3) phenotypes[].phenotype_term: 66.7% (2/3) phenotypes[].phenotype_term
.preferred_term: 100.0% (2/2) phenotypes[].phenotype_term.term: 100.0% (2/2) phenotypes[].phenot
ype_term.term.id: 100.0% (2/2) phenotypes[].phenotype_term.term.label: 50.0% (1/2) Detailed Path
Scores:
(root) (Disease): 100.0%
- disease_term: OK
disease_term (OntologyAnnotation): 100.0
%
- preferred_term: OK
- term: OK
disease_term.term (OntologyTerm): 100.0%
- id: OK
- label: OK
pathophysiology[0].cell_types[0] (OntologyAnnotation): 100.0%
- preferred_term:
OK
- term: OK
pathophysiology[0].cell_types[0].term (OntologyTerm): 100.0%
- id: OK
-
label: OK
pathophysiology[0].cell_types[1] (OntologyAnnotation): 50.0%
- preferred_term: OK
- term: MISSING
phenotypes[0] (Phenotype): 100.0%
- category: OK
- frequency: OK
- p
henotype_term: OK
phenotypes[0].phenotype_term (OntologyAnnotation): 100.0%
- preferred_term:
OK
- term: OK
phenotypes[0].phenotype_term.term (OntologyTerm): 100.0%
- id: OK
- labe
l: OK
phenotypes[1] (Phenotype): 66.7%
- category: OK
- frequency: MISSING
- phenotype
_term: OK
phenotypes[1].phenotype_term (OntologyAnnotation): 100.0%
- preferred_term: OK
-
term: OK
phenotypes[1].phenotype_term.term (OntologyTerm): 50.0%
- id: OK
- label: MISSIN
G
phenotypes[2] (Phenotype): 66.7%
- category: OK
- frequency: OK
- phenotype_term: MI
SSING
Understanding the Paths¶
The output shows aggregated paths like:
has_subtypes[].description- All subtype descriptionshas_subtypes[].evidence[].snippet- All snippets across all evidence in all subtypespathophysiology[].cell_types[].term- All nested term objects in cell_typesphenotypes[].phenotype_term.term.label- Deeply nested labels
Let's look at the JSON output for more detail:
%%bash
linkml-data-qc /tmp/disease_aps.yaml \
-s /tmp/disease_kb_schema.yaml \
-t Disease \
-f json | python -m json.tool | head -100
{
"file_path": "/tmp/disease_aps.yaml",
"target_class": "Disease",
"schema_path": "/tmp/
disease_kb_schema.yaml",
"global_compliance": 85.71428571428571,
"weighted_compliance": 82.6
086956521739,
"total_checks": 28,
"total_populated": 24,
"path_scores": [
{
"path": "(root)",
"parent_class": "Disease",
"item_count": 1,
"slot_scores": [
{
"path": "(root)",
"
slot_name": "disease_term",
"populated": 1,
"total": 1,
"percentage": 100.0
}
],
"overall_percentage
": 100.0
},
{
"path": "disease_term",
"parent_class": "Ontol
ogyAnnotation",
"item_count": 1,
"slot_scores": [
{
"path": "disease_term",
"slot_name": "preferred_term",
"populated": 1,
"total": 1,
"percentage": 100.0
},
{
"path": "disease_term",
"slo
t_name": "term",
"populated": 1,
"total": 1,
"percentage": 100.0
}
],
"overall_percentage": 100.0
},
{
"path": "disease_term.term",
"parent_class": "OntologyTer
m",
"item_count": 1,
"slot_scores": [
{
"path": "disease_term.term",
"slot_name": "id",
"populated":
1,
"total": 1,
"percentage": 100.0
},
{
"path": "disease_term.term",
"slot_name": "labe
l",
"populated": 1,
"total": 1,
"percent
age": 100.0
}
],
"overall_percentage": 100.0
},
{
"path": "pathophysiology[0].cell_types[0]",
"parent_class": "OntologyA
nnotation",
"item_count": 1,
"slot_scores": [
{
"path": "pathophysiology[0].cell_types[0]",
"slot_name": "preferred_term
",
"populated": 1,
"total": 1,
"percenta
ge": 100.0
},
{
"path": "pathophysiology[0].cell
_types[0]",
"slot_name": "term",
"populated": 1,
"total": 1,
"percentage": 100.0
}
],
"overall_percentage": 100.0
},
{
"path": "pathophysiology[0].cell_
types[0].term",
"parent_class": "OntologyTerm",
"item_count": 1,
"slot_scores": [
{
"path": "pathophysiology[0].cell_types[0].te
rm",
"slot_name": "id",
"populated": 1,
"total": 1,
Configuring Weights for Nested Paths¶
We can prioritize certain nested paths over others:
%%bash
cat > /tmp/disease_qc_config.yaml << 'EOF'
default_weight: 1.0
# Slot-level configuration
slots:
# Ontology term bindings are critical
id:
weight: 3.0
min_compliance: 90.0
label:
weight: 2.0
min_compliance: 80.0
# Evidence snippets are important
snippet:
weight: 2.0
min_compliance: 75.0
explanation:
weight: 1.5
# Descriptions nice-to-have
description:
weight: 0.5
# Path-level overrides for specific nested locations
paths:
# Phenotype terms are especially critical
"phenotypes[].phenotype_term.term.id":
weight: 5.0
min_compliance: 95.0
# Root disease term is essential
"disease_term.term.id":
weight: 5.0
min_compliance: 100.0
EOF
echo "Configuration created!"
Configuration created!
%%bash
linkml-data-qc /tmp/disease_aps.yaml \
-s /tmp/disease_kb_schema.yaml \
-t Disease \
-c /tmp/disease_qc_config.yaml
Compliance Report: /tmp/disease_aps.yaml Target Class: Disease Global Compliance: 85.7% (24/28) Weig
hted Compliance: 86.1% Config: /tmp/disease_qc_config.yaml Summary by Slot: category: 100.0% di
sease_term: 100.0% frequency: 66.7% id: 100.0% label: 75.0% phenotype_term: 66.7% preferre
d_term: 100.0% term: 80.0% Threshold Violations (1): phenotypes[].phenotype_term.term.label: 50
.0% < 80.0% (shortfall: 30.0%) Aggregated Scores by List Path: pathophysiology[].cell_types[].pre
ferred_term: 100.0% (2/2) pathophysiology[].cell_types[].term: 50.0% (1/2) pathophysiology[].cel
l_types[].term.id: 100.0% (1/1) pathophysiology[].cell_types[].term.label: 100.0% (1/1) phenotyp
es[].category: 100.0% (3/3) phenotypes[].frequency: 66.7% (2/3) phenotypes[].phenotype_term: 66.
7% (2/3) phenotypes[].phenotype_term.preferred_term: 100.0% (2/2) phenotypes[].phenotype_term.te
rm: 100.0% (2/2) phenotypes[].phenotype_term.term.id: 100.0% (2/2) phenotypes[].phenotype_term.t
erm.label: 50.0% (1/2)
Detailed Path Scores:
(root) (Disease): 100.0%
- disease_term: OK
di
sease_term (OntologyAnnotation): 100.0%
- preferred_term: OK
- term: OK
disease_term.term
(OntologyTerm): 100.0%
- id: OK
- label: OK
pathophysiology[0].cell_types[0] (OntologyAnno
tation): 100.0%
- preferred_term: OK
- term: OK
pathophysiology[0].cell_types[0].term (Ont
ologyTerm): 100.0%
- id: OK
- label: OK
pathophysiology[0].cell_types[1] (OntologyAnnotati
on): 50.0%
- preferred_term: OK
- term: MISSING
phenotypes[0] (Phenotype): 100.0%
- ca
tegory: OK
- frequency: OK
- phenotype_term: OK
phenotypes[0].phenotype_term (OntologyAnno
tation): 100.0%
- preferred_term: OK
- term: OK
phenotypes[0].phenotype_term.term (Ontolog
yTerm): 100.0%
- id: OK
- label: OK
phenotypes[1] (Phenotype): 66.7%
- category: OK
- frequency: MISSING
- phenotype_term: OK
phenotypes[1].phenotype_term (OntologyAnnotation):
100.0%
- preferred_term: OK
- term: OK
phenotypes[1].phenotype_term.term (OntologyTerm):
50.0%
- id: OK
- label: MISSING
phenotypes[2] (Phenotype): 66.7%
- category: OK
-
frequency: OK
- phenotype_term: MISSING
Checking for Threshold Violations¶
%%bash
linkml-data-qc /tmp/disease_aps.yaml \
-s /tmp/disease_kb_schema.yaml \
-t Disease \
-c /tmp/disease_qc_config.yaml \
--fail-on-violations || echo "Violations detected! Exit code: $?"
Compliance Report: /tmp/disease_aps.yaml Target Class: Disease Global Compliance: 85.7% (24/28) Weig
hted Compliance: 86.1% Config: /tmp/disease_qc_config.yaml Summary by Slot: category: 100.0% di
sease_term: 100.0% frequency: 66.7% id: 100.0% label: 75.0% phenotype_term: 66.7% preferre
d_term: 100.0% term: 80.0% Threshold Violations (1): phenotypes[].phenotype_term.term.label: 50
.0% < 80.0% (shortfall: 30.0%) Aggregated Scores by List Path: pathophysiology[].cell_types[].pre
ferred_term: 100.0% (2/2) pathophysiology[].cell_types[].term: 50.0% (1/2) pathophysiology[].cel
l_types[].term.id: 100.0% (1/1) pathophysiology[].cell_types[].term.label: 100.0% (1/1) phenotyp
es[].category: 100.0% (3/3) phenotypes[].frequency: 66.7% (2/3) phenotypes[].phenotype_term: 66.
7% (2/3) phenotypes[].phenotype_term.preferred_term: 100.0% (2/2) phenotypes[].phenotype_term.te
rm: 100.0% (2/2) phenotypes[].phenotype_term.term.id: 100.0% (2/2) phenotypes[].phenotype_term.t
erm.label: 50.0% (1/2)
Detailed Path Scores:
(root) (Disease): 100.0%
- disease_term: OK
di
sease_term (OntologyAnnotation): 100.0%
- preferred_term: OK
- term: OK
disease_term.term
(OntologyTerm): 100.0%
- id: OK
- label: OK
pathophysiology[0].cell_types[0] (OntologyAnno
tation): 100.0%
- preferred_term: OK
- term: OK
pathophysiology[0].cell_types[0].term (Ont
ologyTerm): 100.0%
- id: OK
- label: OK
pathophysiology[0].cell_types[1] (OntologyAnnotati
on): 50.0%
- preferred_term: OK
- term: MISSING
phenotypes[0] (Phenotype): 100.0%
- ca
tegory: OK
- frequency: OK
- phenotype_term: OK
phenotypes[0].phenotype_term (OntologyAnno
tation): 100.0%
- preferred_term: OK
- term: OK
phenotypes[0].phenotype_term.term (Ontolog
yTerm): 100.0%
- id: OK
- label: OK
phenotypes[1] (Phenotype): 66.7%
- category: OK
- frequency: MISSING
- phenotype_term: OK
phenotypes[1].phenotype_term (OntologyAnnotation):
100.0%
- preferred_term: OK
- term: OK
phenotypes[1].phenotype_term.term (OntologyTerm):
50.0%
- id: OK
- label: MISSING
phenotypes[2] (Phenotype): 66.7%
- category: OK
-
frequency: OK
- phenotype_term: MISSING
1 threshold violation(s) found: phenotypes[].phenotype_term.term.label: 50.0% < 80.0%
Violations detected! Exit code: 1
Finding Low-Compliance Areas¶
Use CSV output to find the paths with lowest compliance:
%%bash
echo "=== Lowest compliance areas ==="
linkml-data-qc /tmp/disease_aps.yaml \
-s /tmp/disease_kb_schema.yaml \
-t Disease \
-f csv | tail -n +2 | sort -t, -k7 -n | head -10
=== Lowest compliance areas ===
/tmp/disease_aps.yaml,pathophysiology[0].cell_types[1],OntologyAnnotation,term,0,1,0.0 /tmp/diseas
e_aps.yaml,phenotypes[1],Phenotype,frequency,0,1,0.0 /tmp/disease_aps.yaml,phenotypes[1].phenotype_
term.term,OntologyTerm,label,0,1,0.0 /tmp/disease_aps.yaml,phenotypes[2],Phenotype,phenotype_term,0
,1,0.0 /tmp/disease_aps.yaml,(root),Disease,disease_term,1,1,100.0 /tmp/disease_aps.yaml,disease_t
erm,OntologyAnnotation,preferred_term,1,1,100.0 /tmp/disease_aps.yaml,disease_term,OntologyAnnotati
on,term,1,1,100.0 /tmp/disease_aps.yaml,disease_term.term,OntologyTerm,id,1,1,100.0 /tmp/disease_a
ps.yaml,disease_term.term,OntologyTerm,label,1,1,100.0
Path Patterns Summary¶
Here's what we learned about paths in complex data:
| Pattern | Example | Meaning |
|---|---|---|
(root) |
- | Top-level object |
slot |
disease_term |
Direct child object |
slot.nested |
disease_term.term |
Nested object |
list[] |
has_subtypes[] |
All items in list |
list[].slot |
has_subtypes[].description |
Slot in all list items |
list[].nested[] |
has_subtypes[].evidence[] |
Nested list in list |
list[].obj.slot |
phenotypes[].phenotype_term.term |
Deeply nested |
Key Takeaways¶
- Aggregated paths (with
[]) show compliance across all matching items - Deeply nested paths help identify specific curation gaps
- Path-specific config lets you prioritize critical nested locations
- CSV output is useful for finding lowest-compliance areas to prioritize curation