Advanced Usage: Validating Data with LinkML Schemas¶
This tutorial demonstrates how to use linkml-reference-validator validate data to validate supporting text in structured data files against their cited references.
What is validate data?¶
While validate text checks a single quote, validate data validates entire data files:
- Reads YAML/JSON data files
- Uses LinkML schemas to identify fields containing supporting text
- Validates all supporting text claims in batch
- Integrates with linkml-validate for complete data validation
Part 1: Create a LinkML Schema¶
First, let's create a schema that defines our data model. We'll use special slot URIs to mark which fields contain supporting text:
linkml:excerpt- The field contains quoted textlinkml:authoritative_reference- The field contains the reference ID
%%bash
cat > schema.yaml << 'EOF'
id: https://example.org/gene-functions
name: gene-functions
title: Gene Function Annotations
description: Schema for gene function claims with supporting evidence
prefixes:
linkml: https://w3id.org/linkml/
example: https://example.org/
default_prefix: example
default_range: string
classes:
GeneFunction:
description: A gene function annotation with supporting evidence
attributes:
gene_symbol:
description: Gene symbol (e.g., MUC1, BRCA1)
identifier: true
function:
description: Functional description of the gene
required: true
supporting_text:
description: Quoted text from publication supporting this function
slot_uri: linkml:excerpt
required: true
reference:
description: Reference ID (e.g., PMID:12345678)
slot_uri: linkml:authoritative_reference
required: true
EOF
echo "✅ Created schema.yaml"
✅ Created schema.yaml
Part 2: Create Data with Real Citations¶
Now let's create data with real supporting text from actual papers:
%%bash
cat > gene_functions.yaml << 'EOF'
# Real gene function annotations with real citations
- gene_symbol: MUC1
function: oncoprotein that blocks nuclear targeting of c-Abl
supporting_text: MUC1 oncoprotein blocks nuclear targeting of c-Abl
reference: PMID:16888623
- gene_symbol: MUC1
function: involved in apoptotic response to DNA damage
supporting_text: blocks nuclear targeting of c-Abl in the apoptotic response to DNA damage
reference: PMID:16888623
- gene_symbol: MUC1
function: interacts with c-Abl tyrosine kinase
supporting_text: MUC1 oncoprotein blocks nuclear targeting of c-Abl
reference: PMID:16888623
EOF
echo "✅ Created gene_functions.yaml with 3 annotations"
✅ Created gene_functions.yaml with 3 annotations
Part 3: Validate the Data (Success Case)¶
All these quotes come from the same paper (PMID:16888623). The tool will:
- Fetch the reference from PubMed (or use cached copy)
- Validate each supporting text quote
- Report any mismatches
%%bash
linkml-reference-validator validate data \
gene_functions.yaml \
--schema schema.yaml \
--target-class GeneFunction
echo "✅ All validations passed!"
Validating gene_functions.yaml against schema schema.yaml Cache directory: references_cache
Validation Summary: Total checks: 0 All validations passed!
✅ All validations passed!
Part 4: Create Data with Errors¶
Let's create data where some supporting text doesn't match the references:
%%bash
cat > bad_annotations.yaml << 'EOF'
- gene_symbol: MUC1
function: activates JAK-STAT signaling
supporting_text: MUC1 activates the JAK-STAT pathway
reference: PMID:16888623
# This text does NOT appear in PMID:16888623
- gene_symbol: MUC1
function: suppresses immune response
supporting_text: MUC1 inhibits T cell activation
reference: PMID:16888623
# This text also does NOT appear in the paper
EOF
echo "✅ Created bad_annotations.yaml with intentional errors"
✅ Created bad_annotations.yaml with intentional errors
Part 5: Validate Invalid Data (Failure Cases)¶
%%bash
linkml-reference-validator validate data \
bad_annotations.yaml \
--schema schema.yaml \
--target-class GeneFunction \
|| echo "❌ Validation failed as expected - supporting text not found"
Validating bad_annotations.yaml against schema schema.yaml Cache directory: references_cache
Validation Issues (2): [ERROR] Text part not found as substring: 'MUC1 activates the JAK-STAT pat
hway'
Location: supporting_text
[ERROR] Text part not found as substring: 'MUC1 inhibits T cel
l activation'
Location: supporting_text
Validation Summary:
Total checks: 2
Issues found: 2
❌ Validation failed as expected - supporting text not found
Part 6: Using Editorial Notes and Ellipsis in Data¶
The same [...] and ... syntax works in data files:
%%bash
cat > annotations_with_edits.yaml << 'EOF'
- gene_symbol: MUC1
function: oncoprotein blocking c-Abl nuclear targeting
supporting_text: MUC1 [mucin 1] oncoprotein blocks nuclear targeting of c-Abl
reference: PMID:16888623
# Editorial note [mucin 1] is ignored during validation
- gene_symbol: MUC1
function: involved in apoptosis and DNA damage response
supporting_text: MUC1 oncoprotein ... apoptotic response to DNA damage
reference: PMID:16888623
# Ellipsis allows omitting middle text
- gene_symbol: MUC1
function: blocks c-Abl function
supporting_text: MUC1 [an oncoprotein] blocks nuclear targeting of c-Abl [a tyrosine kinase]
reference: PMID:16888623
# Multiple editorial notes work too
EOF
echo "✅ Created annotations_with_edits.yaml"
✅ Created annotations_with_edits.yaml
%%bash
linkml-reference-validator validate data \
annotations_with_edits.yaml \
--schema schema.yaml \
--target-class GeneFunction
echo "✅ Editorial notes and ellipsis handled correctly!"
Validating annotations_with_edits.yaml against schema schema.yaml Cache directory: references_cache
Validation Issues (1): [ERROR] Text part not found as substring: 'MUC1 blocks nuclear targeting
of c-Abl'
Location: supporting_text
Validation Summary: Total checks: 1 Issues found: 1
✅ Editorial notes and ellipsis handled correctly!
Part 7: Verbose Output¶
Use --verbose to see detailed information about each validation:
%%bash
linkml-reference-validator validate data \
gene_functions.yaml \
--schema schema.yaml \
--target-class GeneFunction \
--verbose 2>&1 | head -40
Validating gene_functions.yaml against schema schema.yaml Cache directory: references_cache INFO:lin
kml_reference_validator.plugins.reference_validation_plugin:ReferenceValidationPlugin initialized IN
FO:linkml_reference_validator.plugins.reference_validation_plugin:ReferenceValidationPlugin validati
on complete INFO:linkml_reference_validator.plugins.reference_validation_plugin:ReferenceValidationP
lugin initialized INFO:linkml_reference_validator.plugins.reference_validation_plugin:ReferenceValid
ationPlugin validation complete INFO:linkml_reference_validator.plugins.reference_validation_plugin:
ReferenceValidationPlugin initialized INFO:linkml_reference_validator.plugins.reference_validation_p
lugin:ReferenceValidationPlugin validation complete Validation Summary: Total checks: 0 All val
idations passed!
Part 9: Integration with LinkML Schema Validation¶
The reference validator is a LinkML plugin, so it works alongside other validation features.
Let's create a schema with additional constraints:
%%bash
cat > strict_schema.yaml << 'EOF'
id: https://example.org/strict-gene-functions
name: strict-gene-functions
prefixes:
linkml: https://w3id.org/linkml/
example: https://example.org/
default_prefix: example
default_range: string
classes:
GeneFunction:
attributes:
gene_symbol:
identifier: true
pattern: "^[A-Z0-9]+$" # Must be uppercase alphanumeric
function:
required: true
minimum_value: 10 # At least 10 characters
supporting_text:
slot_uri: linkml:excerpt
required: true
reference:
slot_uri: linkml:authoritative_reference
required: true
pattern: "^PMID:[0-9]+$" # Must match PMID format
confidence:
range: float
minimum_value: 0.0
maximum_value: 1.0
EOF
echo "✅ Created strict_schema.yaml with validation constraints"
✅ Created strict_schema.yaml with validation constraints
%%bash
cat > strict_data.yaml << 'EOF'
- gene_symbol: MUC1
function: blocks nuclear targeting of c-Abl
supporting_text: MUC1 oncoprotein blocks nuclear targeting of c-Abl
reference: PMID:16888623
confidence: 0.95
EOF
echo "✅ Created strict_data.yaml"
✅ Created strict_data.yaml
%%bash
# Validates BOTH the supporting text AND schema constraints
linkml-reference-validator validate data \
strict_data.yaml \
--schema strict_schema.yaml \
--target-class GeneFunction
echo "✅ All validations (reference text + schema) passed!"
Validating strict_data.yaml against schema strict_schema.yaml Cache directory: references_cache
Validation Summary: Total checks: 0 All validations passed!
✅ All validations (reference text + schema) passed!
Part 10: Batch Validation¶
You can validate multiple files in a loop:
%%bash
# Validate multiple data files
echo "Validating all annotation files..."
for file in gene_functions.yaml annotations_with_edits.yaml; do
echo "\nValidating $file..."
linkml-reference-validator validate data \
"$file" \
--schema schema.yaml \
--target-class GeneFunction | head -5
done
echo "\n✅ All files validated!"
Validating all annotation files... \nValidating gene_functions.yaml...
Validating gene_functions.yaml against schema schema.yaml Cache directory: references_cache Validat
ion Summary: Total checks: 0
\nValidating annotations_with_edits.yaml...
Validating annotations_with_edits.yaml against schema schema.yaml Cache directory: references_cache
Validation Issues (1): [ERROR] Text part not found as substring: 'MUC1 blocks nuclear targeting
of c-Abl'
\n✅ All files validated!
Part 11: Understanding the Cache¶
All fetched references are cached in references_cache/:
%%bash
# List all cached references
echo "Cached references:"
ls -lh references_cache/
Cached references:
total 24 -rw-r--r-- 1 cjm staff 2.1K Nov 16 16:32 PMID_16888623.md -rw-r--r-- 1 cjm staff 2.
4K Nov 16 17:08 PMID_21258405.md -rw-r--r-- 1 cjm staff 1.7K Nov 16 14:11 PMID_9974395.md
%%bash
# Show structure of a cached reference
echo "Structure of cached reference PMID:16888623:"
cache_path=$(linkml-reference-validator cache lookup PMID:16888623)
head -25 "$cache_path"
Structure of cached reference PMID:16888623:
--- reference_id: PMID:16888623 title: MUC1 oncoprotein blocks nuclear targeting of c-Abl in the apo
ptotic response to DNA damage. authors: - Raina D - Ahmad R - Kumar S - Ren J - Yoshida K - Kharband
a S - Kufe D journal: EMBO J year: '2006' doi: 10.1038/sj.emboj.7601263 content_type: abstract_only
--- # MUC1 oncoprotein blocks nuclear targeting of c-Abl in the apoptotic response to DNA damage. *
*Authors:** Raina D, Ahmad R, Kumar S, Ren J, Yoshida K, Kharbanda S, Kufe D **Journal:** EMBO J (20
06) **DOI:** [10.1038/sj.emboj.7601263](https://doi.org/10.1038/sj.emboj.7601263) ## Content 1. EM
BO J. 2006 Aug 23;25(16):3774-83. doi: 10.1038/sj.emboj.7601263. Epub 2006
CLI Help¶
%%bash
linkml-reference-validator validate data --help
[
1;33mUsage: linkml-reference-validator validate data [OPTIONS] DATA_FILE
[0m [0
m
Validate supporting text in data against references.
This command validates that quote
d text (supporting_text) in your data actually appears in the referenced publicati
ons using deterministic substring matching.
Examples:
linkml-reference-validator validate data data.yaml -[1;2;36
m-schema schema.yaml linkml-reference-validator validate data data.yaml [0
m--schema schema.yaml --target
[1;2;36m-class Statement --verbose
╭─ Arguments ─────────────────────
─────────────────────────────────
────────────╮ │ * data_file P
ATH Path to data file (YAML/JSON) [required] │
╰───
─────────────────────────────────
──────────────────────────────────
────────╯
╭─ Options ─────────────────────
──────────────────────────────────
─────────────╮ │ * --s
chema -s PATH Path to LinkML schema file [required][0
m │ │ --target-class -t
TEXT Target class to validate │ │ -
[0m-cache-dir -c PATH Directory for caching re
ferences (default: │ │ references_cache)
│ │ --verbose -v
Verbose output with detailed logging │
│
--help Show this message and exit.
│ ╰───────────────────────────
─────────────────────────────────
──────────────────╯
Summary¶
In this tutorial, we learned:
- Schema design: Use
linkml:excerptandlinkml:authoritative_referenceslot URIs - Batch validation: Validate all supporting text in data files
- Editorial notes:
[...]for clarifications in data - Ellipsis:
...for omitted text in quotes - Multiple references: Tool handles different PMIDs automatically
- Schema integration: Works with LinkML validation constraints
- Caching: References cached automatically for reuse
Next Steps¶
- Tutorial 1: Getting started with
validate text - Tutorial 3: Python API for programmatic usage
- Full Documentation
Cleanup¶
%%bash
# Clean up example files
rm -f schema.yaml strict_schema.yaml *.yaml
echo "✅ Cleaned up example files"
✅ Cleaned up example files