How It Works

Understanding the validation process and design decisions.

Overview

linkml-reference-validator validates that quoted text (supporting text) actually appears in cited references. It uses deterministic substring matching rather than fuzzy or AI-based approaches.

The Validation Process

1. Text Normalization

Before matching, both the supporting text and reference content are normalized:

Lowercased: "MUC1" → "muc1"
Punctuation removed: "c-Abl" → "c abl"
Whitespace collapsed: Multiple spaces become single space
Editorial notes removed: "[mucin 1]" → ""

Example:

Original: "MUC1 [mucin 1] oncoprotein blocks c-Abl!!!"
Normalized: "muc1 oncoprotein blocks c abl"

This allows matching despite formatting differences while maintaining exactness.

2. Substring Matching

After normalization, the validator checks if the supporting text appears as a substring in the reference content.

Simple case:

supporting_text = "MUC1 oncoprotein"
reference_content = "...The MUC1 oncoprotein blocks nuclear..."
# Match: "muc1 oncoprotein" found in normalized reference

3. Ellipsis Handling

When supporting text contains ..., each part is matched separately:

Supporting: "MUC1 oncoprotein ... nuclear targeting"
Parts: ["MUC1 oncoprotein", "nuclear targeting"]
# Both parts must exist in the reference

Why Deterministic Matching?

Not Fuzzy Matching

We explicitly avoid fuzzy/similarity matching because:

Accuracy: No false positives from "close enough" matches
Reproducibility: Same input always gives same result
Explainability: Clear why something matched or didn't
Trust: Critical for scientific accuracy

Not AI-Based

We don't use LLMs or semantic similarity because:

Determinism: Results must be reproducible
Verifiability: Humans can verify the match themselves
No hallucinations: The text either exists or doesn't
Simplicity: No model dependencies or API costs

Reference Fetching

PubMed (PMID)

For PMID:12345678:

Queries NCBI E-utilities API
Fetches abstract and metadata
Parses XML response with BeautifulSoup
Caches as markdown with YAML frontmatter

PubMed Central (PMC)

For PMC:12345:

Queries PMC API for full-text XML
Extracts all sections (abstract, introduction, methods, results, discussion)
Provides more content than abstracts alone
Also cached as markdown

Caching

References are cached in references_cache/ as markdown files:

references_cache/
  PMID_16888623.md
  PMC_3458566.md

Cache file format:

---
reference_id: PMID:16888623
title: MUC1 oncoprotein blocks nuclear targeting...
authors:
  - Raina D
  - Ahmad R
journal: Molecular Cell
year: '2006'
doi: 10.1016/j.molcel.2006.04.017
content_type: abstract_only
---

# MUC1 oncoprotein blocks nuclear targeting...

**Authors:** Raina D, Ahmad R, ...
**Journal:** Molecular Cell (2006)

## Content

The MUC1 oncoprotein blocks nuclear targeting...

Cache Benefits

Offline usage: Work without network after initial fetch
Performance: Instant validation after first fetch
Reproducibility: Same reference version for all validations
Inspection: Human-readable cache files

LinkML Integration

The validator is a LinkML plugin that uses special slot URIs:

classes:
  Statement:
    attributes:
      supporting_text:
        slot_uri: linkml:excerpt  # Marks as quoted text
      reference:
        slot_uri: linkml:authoritative_reference  # Marks as reference ID

When LinkML validates data, it calls our plugin for fields marked with these URIs.

Editorial Conventions

Square Brackets `[...]`

Used for editorial clarifications inserted into quotes:

Original reference: "MUC1 oncoprotein blocks nuclear targeting"
Your quote: "MUC1 [mucin 1] oncoprotein blocks nuclear targeting"

The [mucin 1] is removed before matching.

Ellipsis `...`

Used to indicate omitted text between parts:

Original: "MUC1 oncoprotein blocks nuclear targeting of c-Abl"
Your quote: "MUC1 oncoprotein ... c-Abl"

Both parts must exist in the reference.

Design Principles

1. Conservative by Default

Only exact substring matches count
No approximations or suggestions
Fail fast on mismatches

2. Progressive Disclosure

Simple cases require minimal syntax
Advanced features (editorial notes, ellipsis) available when needed
Sensible defaults (cache location, etc.)

3. CLI-First

Command-line is the primary interface
Python API available for integration
No GUI required

4. Standards-Based

Uses LinkML schemas
NCBI standard identifiers (PMID, PMC)
Markdown for cache files

Limitations

What This Tool Does NOT Do

Semantic matching: Won't match paraphrases
Citation formatting: Not a bibliography manager
Fact checking: Only verifies text existence
Plagiarism detection: Not designed for that purpose

Known Limitations

Abstracts only for most PMIDs: Full text requires PMC
Network required: For initial reference fetch
English-focused: Normalization optimized for English text
No OCR: Can't extract text from images/PDFs in papers

When to Use This Tool

Good Use Cases ✅

Validating gene function claims in databases
Checking supporting text in knowledge graphs
Verifying quotes in scientific documentation
Batch validation of curated annotations

Not Recommended ❌

Checking if ideas are supported (use human review)
Finding similar papers (use search engines)
Generating citations (use citation managers)
Paraphrase detection (use plagiarism tools)