How It Works
Understanding the validation process and design decisions.
Overview
linkml-reference-validator validates that quoted text (supporting text) actually appears in cited references. It uses deterministic substring matching rather than fuzzy or AI-based approaches.
The Validation Process
1. Text Normalization
Before matching, both the supporting text and reference content are normalized:
- Lowercased:
"MUC1"→"muc1" - Punctuation removed:
"c-Abl"→"c abl" - Whitespace collapsed: Multiple spaces become single space
- Editorial notes removed:
"[mucin 1]"→""
Example:
Original: "MUC1 [mucin 1] oncoprotein blocks c-Abl!!!"
Normalized: "muc1 oncoprotein blocks c abl"
This allows matching despite formatting differences while maintaining exactness.
2. Substring Matching
After normalization, the validator checks if the supporting text appears as a substring in the reference content.
Simple case:
supporting_text = "MUC1 oncoprotein"
reference_content = "...The MUC1 oncoprotein blocks nuclear..."
# Match: "muc1 oncoprotein" found in normalized reference
3. Ellipsis Handling
When supporting text contains ..., each part is matched separately:
Supporting: "MUC1 oncoprotein ... nuclear targeting"
Parts: ["MUC1 oncoprotein", "nuclear targeting"]
# Both parts must exist in the reference
Why Deterministic Matching?
Not Fuzzy Matching
We explicitly avoid fuzzy/similarity matching because:
- Accuracy: No false positives from "close enough" matches
- Reproducibility: Same input always gives same result
- Explainability: Clear why something matched or didn't
- Trust: Critical for scientific accuracy
Not AI-Based
We don't use LLMs or semantic similarity because:
- Determinism: Results must be reproducible
- Verifiability: Humans can verify the match themselves
- No hallucinations: The text either exists or doesn't
- Simplicity: No model dependencies or API costs
Reference Fetching
PubMed (PMID)
For PMID:12345678:
- Queries NCBI E-utilities API
- Fetches abstract and metadata
- Parses XML response with BeautifulSoup
- Caches as markdown with YAML frontmatter
PubMed Central (PMC)
For PMC:12345:
- Queries PMC API for full-text XML
- Extracts all sections (abstract, introduction, methods, results, discussion)
- Provides more content than abstracts alone
- Also cached as markdown
Caching
References are cached in references_cache/ as markdown files:
references_cache/
PMID_16888623.md
PMC_3458566.md
Cache file format:
---
reference_id: PMID:16888623
title: MUC1 oncoprotein blocks nuclear targeting...
authors:
- Raina D
- Ahmad R
journal: Molecular Cell
year: '2006'
doi: 10.1016/j.molcel.2006.04.017
content_type: abstract_only
---
# MUC1 oncoprotein blocks nuclear targeting...
**Authors:** Raina D, Ahmad R, ...
**Journal:** Molecular Cell (2006)
## Content
The MUC1 oncoprotein blocks nuclear targeting...
Cache Benefits
- Offline usage: Work without network after initial fetch
- Performance: Instant validation after first fetch
- Reproducibility: Same reference version for all validations
- Inspection: Human-readable cache files
LinkML Integration
The validator is a LinkML plugin that uses special slot URIs:
classes:
Statement:
attributes:
supporting_text:
slot_uri: linkml:excerpt # Marks as quoted text
reference:
slot_uri: linkml:authoritative_reference # Marks as reference ID
When LinkML validates data, it calls our plugin for fields marked with these URIs.
Editorial Conventions
Square Brackets [...]
Used for editorial clarifications inserted into quotes:
Original reference: "MUC1 oncoprotein blocks nuclear targeting"
Your quote: "MUC1 [mucin 1] oncoprotein blocks nuclear targeting"
The [mucin 1] is removed before matching.
Ellipsis ...
Used to indicate omitted text between parts:
Original: "MUC1 oncoprotein blocks nuclear targeting of c-Abl"
Your quote: "MUC1 oncoprotein ... c-Abl"
Both parts must exist in the reference.
Design Principles
1. Conservative by Default
- Only exact substring matches count
- No approximations or suggestions
- Fail fast on mismatches
2. Progressive Disclosure
- Simple cases require minimal syntax
- Advanced features (editorial notes, ellipsis) available when needed
- Sensible defaults (cache location, etc.)
3. CLI-First
- Command-line is the primary interface
- Python API available for integration
- No GUI required
4. Standards-Based
- Uses LinkML schemas
- NCBI standard identifiers (PMID, PMC)
- Markdown for cache files
Limitations
What This Tool Does NOT Do
- Semantic matching: Won't match paraphrases
- Citation formatting: Not a bibliography manager
- Fact checking: Only verifies text existence
- Plagiarism detection: Not designed for that purpose
Known Limitations
- Abstracts only for most PMIDs: Full text requires PMC
- Network required: For initial reference fetch
- English-focused: Normalization optimized for English text
- No OCR: Can't extract text from images/PDFs in papers
When to Use This Tool
Good Use Cases ✅
- Validating gene function claims in databases
- Checking supporting text in knowledge graphs
- Verifying quotes in scientific documentation
- Batch validation of curated annotations
Not Recommended ❌
- Checking if ideas are supported (use human review)
- Finding similar papers (use search engines)
- Generating citations (use citation managers)
- Paraphrase detection (use plagiarism tools)