Getting Started with linkml-reference-validator¶
This tutorial demonstrates how to use the linkml-reference-validator CLI to validate that supporting text quotes actually appear in their cited references.
What is linkml-reference-validator?¶
linkml-reference-validator validates that:
- Quoted text exists: Supporting text claims actually appear in the referenced publication
- Accurate citations: References are properly cited and accessible
- Deterministic matching: Uses substring matching (not fuzzy/AI-based)
The tool fetches publications from PubMed and PMC and caches them locally for offline use.
Installation¶
First, make sure linkml-reference-validator is installed:
%%bash
# Check if installed
linkml-reference-validator --help > /dev/null && echo "✅ linkml-reference-validator is installed" || echo "❌ Install with: pip install linkml-reference-validator"
✅ linkml-reference-validator is installed
%%bash
# This quote appears in the referenced paper
linkml-reference-validator validate text \
"MUC1 oncoprotein blocks nuclear targeting of c-Abl" \
PMID:16888623
echo "✅ Quote validated!"
Validating text against PMID:16888623... Text: MUC1 oncoprotein blocks nuclear targeting of c-Abl
Result: Valid: True Message: Supporting text validated successfully in PMID:16888623 Matched
text: MUC1 oncoprotein blocks nuclear targeting of c-Abl...
✅ Quote validated!
Note: The first time you run this, it fetches the reference from PubMed and caches it locally in references_cache/. Subsequent validations use the cached copy, making them much faster!
Example 2: Validation Failure¶
What happens when the quote doesn't appear in the reference?
%%bash
# This text does NOT appear in PMID:16888623
linkml-reference-validator validate text \
"MUC1 activates the JAK-STAT pathway" \
PMID:16888623 \
|| echo "❌ Validation failed - text not found in reference"
Validating text against PMID:16888623... Text: MUC1 activates the JAK-STAT pathway
Result: Valid: False Message: Text part not found as substring: 'MUC1 activates the JAK-STAT pa
thway'
❌ Validation failed - text not found in reference
Example 3: Partial Quotes¶
You can validate partial quotes from the reference:
%%bash
# Just a portion of the text
linkml-reference-validator validate text \
"blocks nuclear targeting" \
PMID:16888623
echo "✅ Partial quote validated!"
Validating text against PMID:16888623... Text: blocks nuclear targeting
Result: Valid: True Message: Supporting text validated successfully in PMID:16888623 Matched
text: blocks nuclear targeting...
✅ Partial quote validated!
Part 2: Editorial Notes with [...]¶
Use square brackets for editorial clarifications that should be ignored during matching.
For example, if you want to clarify what "MUC1" stands for in your quote:
%%bash
# Editorial clarification - brackets are ignored during matching
linkml-reference-validator validate text \
'MUC1 [mucin 1] oncoprotein blocks nuclear targeting of c-Abl' \
PMID:16888623
echo "✅ Editorial note ignored during matching!"
Validating text against PMID:16888623... Text: MUC1 [mucin 1] oncoprotein blocks nuclear targeting
of c-Abl
Result: Valid: True Message: Supporting text validated successfully in PMID:16888623 Matched
text: MUC1 oncoprotein blocks nuclear targeting of c-Abl...
✅ Editorial note ignored during matching!
%%bash
# Multiple editorial notes
linkml-reference-validator validate text \
'MUC1 [an oncoprotein] blocks nuclear targeting of c-Abl [a tyrosine kinase]' \
PMID:16888623
echo "✅ Multiple editorial notes handled!"
Validating text against PMID:16888623... Text: MUC1 [an oncoprotein] blocks nuclear targeting of c
-Abl [a tyrosine kinase]
Result: Valid: False Message: Text part not found as substring: 'MUC1 blocks nuclear targetin
g of c-Abl'
✅ Multiple editorial notes handled!
Part 3: Ellipsis for Omitted Text (...)¶
Use ... to indicate omitted text between two parts of a quote. Both parts must be found in the reference.
%%bash
# Multi-part quote with ellipsis
linkml-reference-validator validate text \
"MUC1 oncoprotein ... c-Abl in the apoptotic response" \
PMID:16888623
echo "✅ Both parts of ellipsis quote found!"
Validating text against PMID:16888623... Text: MUC1 oncoprotein ... c-Abl in the apoptotic respons
e
Result: Valid: True Message: Supporting text validated successfully in PMID:16888623 Matched
text: MUC1 oncoprotein ... c-Abl in the apoptotic response...
✅ Both parts of ellipsis quote found!
Part 5: Text Normalization¶
Before matching, text is normalized:
- Lowercased
- Punctuation removed
- Extra whitespace collapsed
This means different formatting of the same text will match:
%%bash
# All these variations match the same text
linkml-reference-validator validate text \
"MUC-1 ONCOPROTEIN blocks NUCLEAR-TARGETING!!!" \
PMID:16888623
echo "✅ Normalized text matched!"
Validating text against PMID:16888623... Text: MUC-1 ONCOPROTEIN blocks NUCLEAR-TARGETING!!!
Result: Valid: False Message: Text part not found as substring: 'MUC-1 ONCOPROTEIN blocks NUCLE
AR-TARGETING!!!'
✅ Normalized text matched!
Part 6: Pre-caching References with cache reference¶
You can pre-fetch and cache references for offline use:
%%bash
# Pre-cache a reference (shows metadata)
linkml-reference-validator cache reference PMID:16888623
Fetching PMID:16888623...
Successfully cached PMID:16888623 Title: MUC1 oncoprotein blocks nuclear targeting of c-Abl in the
apoptotic response to DNA damage. Authors: Raina D, Ahmad R, Kumar S Content type: abstract_onl
y Content length: 1569 characters
Part 7: Verbose Output¶
Use --verbose to see detailed validation information:
%%bash
# Verbose output shows fetching and matching details
linkml-reference-validator validate text \
"MUC1 oncoprotein blocks nuclear targeting" \
PMID:16888623 \
--verbose
Validating text against PMID:16888623... Text: MUC1 oncoprotein blocks nuclear targeting
Result: Valid: True Message: Supporting text validated successfully in PMID:16888623 Matched
text: MUC1 oncoprotein blocks nuclear targeting...
Part 8: Using in Shell Scripts¶
The CLI uses standard exit codes for easy integration into scripts:
%%bash
# Example shell script usage
if linkml-reference-validator validate text \
"MUC1 oncoprotein blocks nuclear targeting" \
PMID:16888623 > /dev/null 2>&1; then
echo "✅ Quote verified successfully"
else
echo "❌ Quote validation failed"
exit 1
fi
✅ Quote verified successfully
Part 9: Understanding the Cache¶
References are cached in references_cache/ by default. Let's see what's in there:
%%bash
# List cached references
ls -lh references_cache/ | head -10
total 24 -rw-r--r-- 1 cjm staff 2.1K Nov 16 16:32 PMID_16888623.md -rw-r--r-- 1 cjm staff 2.
4K Nov 16 17:08 PMID_21258405.md -rw-r--r-- 1 cjm staff 1.7K Nov 16 14:11 PMID_9974395.md
%%bash
# Peek at a cached reference
cache_path=$(linkml-reference-validator cache lookup PMID:16888623)
head -20 "$cache_path"
--- reference_id: PMID:16888623 title: MUC1 oncoprotein blocks nuclear targeting of c-Abl in the apo
ptotic response to DNA damage. authors: - Raina D - Ahmad R - Kumar S - Ren J - Yoshida K - Kharband
a S - Kufe D journal: EMBO J year: '2006' doi: 10.1038/sj.emboj.7601263 content_type: abstract_only
--- # MUC1 oncoprotein blocks nuclear targeting of c-Abl in the apoptotic response to DNA damage. *
*Authors:** Raina D, Ahmad R, Kumar S, Ren J, Yoshida K, Kharbanda S, Kufe D **Journal:** EMBO J (20
06)
The cache files are in markdown format with YAML frontmatter, making them human-readable!
CLI Help¶
Get help for any command:
%%bash
linkml-reference-validator --help
[
1;33mUsage: linkml-reference-validator [OPTIONS] COMMAND [ARGS]...
[0m [0
m
Validation of supporting text from references and publications
╭─ Options ─────────────────────
──────────────────────────────────
─────────────╮ │ --install[
1;36m-completion Install completion for the current shell. │ │
--show-completion Show completion for the current shel
l, to copy │ │ it or customize the installation.
│ │ --help Show this me
ssage and exit. │ ╰───────────────
─────────────────────────────────
──────────────────────────────╯
╭─ Commands ─────────────────────
─────────────────────────────────
─────────────╮ │ validate Va
lidate supporting text against references │
│ cache
Manage reference cache │
[2m╰───────────────────────────────
─────────────────────────────────
──────────────╯
%%bash
linkml-reference-validator validate --help
[
1;33mUsage: linkml-reference-validator validate [OPTIONS] COMMAND [ARGS]...
[0m [0
m
Validate supporting text against references
╭─ Options ─
─────────────────────────────────
─────────────────────────────────[
0m─╮ │ --help Show this message and exit.
│ ╰─────────────────
──────────────────────────────────
───────────────────────────╯
╭─ Commands ─────────────────────
─────────────────────────────────
─────────────╮ │ text Valida
te a single supporting text quote against a reference. │
│ data
Validate supporting text in data against references. │
[2m╰───────────────────────────────
─────────────────────────────────
──────────────╯
%%bash
linkml-reference-validator validate text --help
[
1;33mUsage: linkml-reference-validator validate text [OPTIONS] TEXT REFERENCE_ID
[0m [0
m
Validate a single supporting text quote against a reference.
Uses deterministic substring matc
hing. Supports [...] for editorial notes and ... for omitted text.
Examples:
linkml-reference-validator validate text "protein functions in cel
ls" PMID:12345678
linkml-reference-validator validate text "protein [X] functions ... cells" [2
mPMID:12345678 --verbose
╭─ Arguments ─────────────────────
─────────────────────────────────
────────────╮
│ * text [1;3
3mTEXT Supporting text to validate [required] │
│
* reference_id TEXT Reference ID (e.g., PMID:12345678) [required]
│ ╰──────────────────────────
─────────────────────────────────
───────────────────╯
╭─ Options ─────────────────────
──────────────────────────────────
─────────────╮ │ --cache[1;
36m-dir -c PATH Directory for caching references (default:
[2m│ │ references_cache)
│ │ --verbose -v Verbo
se output with detailed logging │ │ --help
Show this message and exit. │
╰
──────────────────────────────────
─────────────────────────────────
───────────╯
%%bash
linkml-reference-validator cache reference --help
[
1;33mUsage: linkml-reference-validator cache reference [OPTIONS] REFERENCE_ID
[0m [0
m
Cache a reference for offline use.
Downloads and caches the full tex
t of a reference for offline validation. Useful for pre-populating the cache or ensur
ing a reference is available. Examples:
linkml-reference-validator cache reference PMID:12345678
linkml-reference-validator cache reference PMID:12345678 -[
1;2;36m-force --verbose
╭─ Arguments ─────
─────────────────────────────────
────────────────────────────╮
[2m│ * reference_id TEXT Reference ID (e.g., PMID:12345678) [2;3
1m[required] │ ╰────────────────────
──────────────────────────────────
────────────────────────╯
╭─ Options ─────────────────────
──────────────────────────────────
─────────────╮ │ --cache[1;
36m-dir -c PATH Directory for caching references (default:
[2m│ │ references_cache)
│ │ --force -f Force
operation (e.g., re-fetch even if cached) │ │ --verbose
-v Verbose output with detailed logging │[0
m │ --help Show this message and exit
. │ ╰───────────────────
─────────────────────────────────
──────────────────────────╯
Summary¶
In this tutorial, we learned:
- Basic validation:
validate text "quote" PMID:12345 - Editorial notes: Use
[...]for clarifications - Ellipsis: Use
...for omitted text - Normalization: Case and punctuation don't matter
- Caching: References cached automatically in
references_cache/ - PMC support: Full-text articles available
Next Steps¶
- Tutorial 2: Advanced usage with data files and LinkML schemas (
validate data) - Tutorial 3: Python API for programmatic usage
- Full Documentation