Skip to content

How It Works

Understanding the validation process and design decisions.

Overview

linkml-reference-validator validates that quoted text (supporting text) actually appears in cited references. It uses deterministic substring matching rather than fuzzy or AI-based approaches.

The Validation Process

1. Text Normalization

Before matching, both the supporting text and reference content are normalized:

  • Lowercased: "MUC1""muc1"
  • Punctuation removed: "c-Abl""c abl"
  • Whitespace collapsed: Multiple spaces become single space
  • Editorial notes removed: "[mucin 1]"""

Example:

Original: "MUC1 [mucin 1] oncoprotein blocks c-Abl!!!"
Normalized: "muc1 oncoprotein blocks c abl"

This allows matching despite formatting differences while maintaining exactness.

2. Substring Matching

After normalization, the validator checks if the supporting text appears as a substring in the reference content.

Simple case:

supporting_text = "MUC1 oncoprotein"
reference_content = "...The MUC1 oncoprotein blocks nuclear..."
# Match: "muc1 oncoprotein" found in normalized reference

3. Ellipsis Handling

When supporting text contains ..., each part is matched separately:

Supporting: "MUC1 oncoprotein ... nuclear targeting"
Parts: ["MUC1 oncoprotein", "nuclear targeting"]
# Both parts must exist in the reference

Why Deterministic Matching?

Not Fuzzy Matching

We explicitly avoid fuzzy/similarity matching because:

  1. Accuracy: No false positives from "close enough" matches
  2. Reproducibility: Same input always gives same result
  3. Explainability: Clear why something matched or didn't
  4. Trust: Critical for scientific accuracy

Not AI-Based

We don't use LLMs or semantic similarity because:

  1. Determinism: Results must be reproducible
  2. Verifiability: Humans can verify the match themselves
  3. No hallucinations: The text either exists or doesn't
  4. Simplicity: No model dependencies or API costs

Reference Fetching

PubMed (PMID)

For PMID:12345678:

  1. Queries NCBI E-utilities API
  2. Fetches abstract and metadata
  3. Parses XML response with BeautifulSoup
  4. Caches as markdown with YAML frontmatter

PubMed Central (PMC)

For PMC:12345:

  1. Queries PMC API for full-text XML
  2. Extracts all sections (abstract, introduction, methods, results, discussion)
  3. Provides more content than abstracts alone
  4. Also cached as markdown

Caching

References are cached in references_cache/ as markdown files:

references_cache/
  PMID_16888623.md
  PMC_3458566.md

Cache file format:

---
reference_id: PMID:16888623
title: MUC1 oncoprotein blocks nuclear targeting...
authors:
  - Raina D
  - Ahmad R
journal: Molecular Cell
year: '2006'
doi: 10.1016/j.molcel.2006.04.017
content_type: abstract_only
---

# MUC1 oncoprotein blocks nuclear targeting...

**Authors:** Raina D, Ahmad R, ...
**Journal:** Molecular Cell (2006)

## Content

The MUC1 oncoprotein blocks nuclear targeting...

Cache Benefits

  • Offline usage: Work without network after initial fetch
  • Performance: Instant validation after first fetch
  • Reproducibility: Same reference version for all validations
  • Inspection: Human-readable cache files

LinkML Integration

The validator is a LinkML plugin that uses special slot URIs:

classes:
  Statement:
    attributes:
      supporting_text:
        slot_uri: linkml:excerpt  # Marks as quoted text
      reference:
        slot_uri: linkml:authoritative_reference  # Marks as reference ID

When LinkML validates data, it calls our plugin for fields marked with these URIs.

Editorial Conventions

Square Brackets [...]

Used for editorial clarifications inserted into quotes:

Original reference: "MUC1 oncoprotein blocks nuclear targeting"
Your quote: "MUC1 [mucin 1] oncoprotein blocks nuclear targeting"

The [mucin 1] is removed before matching.

Ellipsis ...

Used to indicate omitted text between parts:

Original: "MUC1 oncoprotein blocks nuclear targeting of c-Abl"
Your quote: "MUC1 oncoprotein ... c-Abl"

Both parts must exist in the reference.

Design Principles

1. Conservative by Default

  • Only exact substring matches count
  • No approximations or suggestions
  • Fail fast on mismatches

2. Progressive Disclosure

  • Simple cases require minimal syntax
  • Advanced features (editorial notes, ellipsis) available when needed
  • Sensible defaults (cache location, etc.)

3. CLI-First

  • Command-line is the primary interface
  • Python API available for integration
  • No GUI required

4. Standards-Based

  • Uses LinkML schemas
  • NCBI standard identifiers (PMID, PMC)
  • Markdown for cache files

Limitations

What This Tool Does NOT Do

  • Semantic matching: Won't match paraphrases
  • Citation formatting: Not a bibliography manager
  • Fact checking: Only verifies text existence
  • Plagiarism detection: Not designed for that purpose

Known Limitations

  • Abstracts only for most PMIDs: Full text requires PMC
  • Network required: For initial reference fetch
  • English-focused: Normalization optimized for English text
  • No OCR: Can't extract text from images/PDFs in papers

When to Use This Tool

Good Use Cases ✅

  • Validating gene function claims in databases
  • Checking supporting text in knowledge graphs
  • Verifying quotes in scientific documentation
  • Batch validation of curated annotations
  • Checking if ideas are supported (use human review)
  • Finding similar papers (use search engines)
  • Generating citations (use citation managers)
  • Paraphrase detection (use plagiarism tools)