Complete Workflow Tutorial: Building a Validated Gene Annotation System

This tutorial walks you through building a complete gene annotation validation system from scratch, using real examples and best practices.

What We'll Build

A validated gene function annotation system that: - Stores gene function claims with supporting text from publications - Automatically validates that quotes match their cited sources - Supports multiple reference types (PMID, DOI, PMC) - Includes repair capabilities for common errors - Can be integrated into a CI/CD pipeline

Time required: 30-45 minutes

Prerequisites

Python 3.10+ installed
Basic understanding of YAML
Familiarity with command line
(Optional) NCBI API key for higher rate limits

Step 1: Installation and Setup (5 minutes)

Install the Tool

# Using pip
pip install linkml-reference-validator

# Or using uv (faster)
curl -LsSf https://astral.sh/uv/install.sh | sh
uv pip install linkml-reference-validator

Create Project Structure

# Create project directory
mkdir gene-annotation-validator
cd gene-annotation-validator

# Create subdirectories
mkdir -p schemas data references_cache tests

# Verify installation
linkml-reference-validator --version

Configure NCBI Access (Optional)

# Set environment variables
export NCBI_EMAIL="your.email@example.com"

# Test with a simple validation
linkml-reference-validator validate text \
  "MUC1 oncoprotein blocks nuclear targeting of c-Abl" \
  PMID:16888623

Expected output:

Validating text against PMID:16888623...
Result:
  Valid: True
  Message: Supporting text validated successfully in PMID:16888623

Step 2: Design Your Data Model (10 minutes)

Create the LinkML Schema

We'll create a schema for gene function annotations with evidence from literature.

schemas/gene_annotations.yaml:

id: https://example.org/gene-annotations
name: gene-annotations
description: Schema for validated gene function annotations

prefixes:
  linkml: https://w3id.org/linkml/
  dcterms: http://purl.org/dc/terms/
  biolink: https://w3id.org/biolink/vocab/

default_prefix: gene_annotations

classes:
  # Root container class
  GeneAnnotationCollection:
    tree_root: true
    description: Collection of gene function annotations
    attributes:
      annotations:
        multivalued: true
        range: GeneAnnotation
        description: List of gene annotations

  # Main annotation class
  GeneAnnotation:
    description: An annotation describing a gene's function with supporting evidence
    attributes:
      id:
        identifier: true
        required: true
        description: Unique identifier for this annotation

      gene_symbol:
        required: true
        description: Official gene symbol (e.g., TP53, BRCA1)
        pattern: "^[A-Z0-9]+$"

      gene_name:
        description: Full gene name

      function_summary:
        required: true
        description: Brief summary of the gene's function

      function_category:
        range: FunctionCategory
        description: Broad categorization of gene function

      species:
        range: Species
        description: Species this annotation applies to
        required: true

      evidence:
        required: true
        multivalued: true
        range: Evidence
        description: Supporting evidence from literature

      last_reviewed:
        range: date
        description: Date this annotation was last reviewed

      curator:
        description: Person who created/reviewed this annotation

  # Evidence class with reference validation
  Evidence:
    description: Evidence supporting a gene function claim
    attributes:
      reference_id:
        required: true
        slot_uri: linkml:authoritative_reference
        description: |
          Reference identifier (PMID, PMC, DOI, or file path)
          Examples: PMID:16888623, PMC:3458566, DOI:10.1038/nature12373

      reference_title:
        slot_uri: dcterms:title
        description: Title of the referenced publication (validated if provided)

      supporting_text:
        required: true
        slot_uri: linkml:excerpt
        description: |
          Direct quote from the reference supporting the annotation.
          Use [brackets] for editorial clarifications.
          Use ... for omitted text between parts.

      evidence_type:
        range: EvidenceType
        description: Type of experimental evidence

      confidence:
        range: ConfidenceLevel
        description: Curator's confidence in this evidence

      notes:
        description: Additional context or clarifications

# Enumerations
enums:
  FunctionCategory:
    permissible_values:
      TUMOR_SUPPRESSOR:
        description: Prevents uncontrolled cell growth
      ONCOGENE:
        description: Promotes cell growth and division
      DNA_REPAIR:
        description: Repairs damaged DNA
      TRANSCRIPTION_FACTOR:
        description: Regulates gene expression
      CELL_CYCLE_REGULATOR:
        description: Controls cell cycle progression
      KINASE:
        description: Phosphorylates other proteins
      PHOSPHATASE:
        description: Removes phosphate groups
      RECEPTOR:
        description: Receives extracellular signals
      SIGNALING:
        description: Transmits cellular signals

  EvidenceType:
    permissible_values:
      EXPERIMENTAL:
        description: Direct experimental evidence
      COMPUTATIONAL:
        description: Computational prediction or inference
      LITERATURE:
        description: Statement from literature without original data
      CURATOR_INFERENCE:
        description: Inferred by curator from related evidence

  ConfidenceLevel:
    permissible_values:
      HIGH:
        description: Strong, consistent evidence
      MEDIUM:
        description: Good evidence but some uncertainty
      LOW:
        description: Limited or conflicting evidence

  Species:
    permissible_values:
      HUMAN:
        description: Homo sapiens
      MOUSE:
        description: Mus musculus
      RAT:
        description: Rattus norvegicus
      YEAST:
        description: Saccharomyces cerevisiae

Understanding the Schema

Key elements: - slot_uri: linkml:excerpt - Marks supporting_text for validation - slot_uri: linkml:authoritative_reference - Marks reference_id as the reference - slot_uri: dcterms:title - Optionally validates reference titles - Enumerations - Controlled vocabularies for consistency - Required fields - Ensures data completeness

Step 3: Create Sample Data (10 minutes)

Example 1: Simple Annotation

data/tp53_annotation.yaml:

annotations:
  - id: ANN001
    gene_symbol: TP53
    gene_name: Tumor protein p53
    function_summary: Regulates cell cycle and acts as tumor suppressor
    function_category: TUMOR_SUPPRESSOR
    species: HUMAN
    curator: Jane Doe
    last_reviewed: 2024-01-15

    evidence:
      - reference_id: PMID:16888623
        reference_title: MUC1 oncoprotein blocks nuclear targeting of c-Abl
        supporting_text: "MUC1 oncoprotein blocks nuclear targeting of c-Abl"
        evidence_type: EXPERIMENTAL
        confidence: HIGH

Example 2: Multiple Evidence Items

data/brca1_annotation.yaml:

annotations:
  - id: ANN002
    gene_symbol: BRCA1
    gene_name: Breast cancer type 1 susceptibility protein
    function_summary: Critical role in DNA repair and tumor suppression
    function_category: DNA_REPAIR
    species: HUMAN
    curator: John Smith
    last_reviewed: 2024-02-20

    evidence:
      # Evidence 1: DNA repair function
      - reference_id: PMID:12345678
        supporting_text: "BRCA1 plays a critical role in DNA double-strand break repair"
        evidence_type: EXPERIMENTAL
        confidence: HIGH
        notes: Direct experimental demonstration

      # Evidence 2: Tumor suppressor function
      - reference_id: PMID:23456789
        supporting_text: "BRCA1 functions as a tumor suppressor ... maintaining genomic stability"
        evidence_type: EXPERIMENTAL
        confidence: HIGH
        notes: Used ellipsis to connect non-contiguous parts

      # Evidence 3: Using editorial notes
      - reference_id: PMC:3458566
        supporting_text: "BRCA1 [breast cancer type 1] is involved in homologous recombination"
        evidence_type: LITERATURE
        confidence: MEDIUM
        notes: Added gene name clarification in brackets

Example 3: Mixed Reference Types

data/multi_gene_annotations.yaml:

annotations:
  - id: ANN003
    gene_symbol: EGFR
    gene_name: Epidermal growth factor receptor
    function_summary: Receptor tyrosine kinase involved in cell proliferation
    function_category: RECEPTOR
    species: HUMAN
    curator: Jane Doe

    evidence:
      # Using DOI
      - reference_id: DOI:10.1038/nature12373
        supporting_text: "EGFR is a receptor tyrosine kinase"
        evidence_type: EXPERIMENTAL
        confidence: HIGH

      # Using local file
      - reference_id: file:./references/egfr_review.md
        supporting_text: "EGFR mutations are found in many cancers"
        evidence_type: LITERATURE
        confidence: MEDIUM
        notes: From local review article

  - id: ANN004
    gene_symbol: JAK1
    gene_name: Janus kinase 1
    function_summary: Tyrosine kinase in cytokine signaling
    function_category: KINASE
    species: HUMAN
    curator: John Smith

    evidence:
      # Using URL
      - reference_id: url:https://example.org/jak1-article.html
        supporting_text: "JAK1 is a key mediator of cytokine signaling"
        evidence_type: LITERATURE
        confidence: MEDIUM

Step 4: Validate Your Data (10 minutes)

Basic Validation

# Validate single file
linkml-reference-validator validate data \
  data/tp53_annotation.yaml \
  --schema schemas/gene_annotations.yaml \
  --target-class GeneAnnotationCollection

# Expected output:
# Validating data/tp53_annotation.yaml...
# ✓ All validations passed!

Verbose Validation

# See detailed validation info
linkml-reference-validator validate data \
  data/brca1_annotation.yaml \
  --schema schemas/gene_annotations.yaml \
  --target-class GeneAnnotationCollection \
  --verbose

# Shows:
# - Each reference being validated
# - What text is being searched for
# - Whether full text or abstract was used
# - Validation results for each item

Batch Validation

# Validate all files in data directory
for file in data/*.yaml; do
  echo "Validating $file..."
  linkml-reference-validator validate data \
    "$file" \
    --schema schemas/gene_annotations.yaml \
    --target-class GeneAnnotationCollection
done

Step 5: Handle Validation Errors (10 minutes)

Scenario 1: Character Encoding Issues

Create a file with common encoding issues:

data/error_example1.yaml:

annotations:
  - id: ANN005
    gene_symbol: TEST1
    function_summary: Test gene for CO2 transport
    function_category: SIGNALING
    species: HUMAN

    evidence:
      - reference_id: PMID:16888623
        # This will fail: ASCII "O2" instead of subscript
        supporting_text: "protein involved in O2 transport"
        evidence_type: EXPERIMENTAL
        confidence: HIGH

Validate and repair:

# First validate to see the error
linkml-reference-validator validate data \
  data/error_example1.yaml \
  --schema schemas/gene_annotations.yaml \
  --target-class GeneAnnotationCollection

# Use repair to fix (dry run first)
linkml-reference-validator repair data \
  data/error_example1.yaml \
  --schema schemas/gene_annotations.yaml \
  --target-class GeneAnnotationCollection \
  --dry-run

# Review the suggested fixes, then apply
linkml-reference-validator repair data \
  data/error_example1.yaml \
  --schema schemas/gene_annotations.yaml \
  --target-class GeneAnnotationCollection \
  --no-dry-run

Scenario 2: Missing Ellipsis

data/error_example2.yaml:

annotations:
  - id: ANN006
    gene_symbol: TEST2
    function_summary: Test gene
    function_category: SIGNALING
    species: HUMAN

    evidence:
      - reference_id: PMID:16888623
        # This will fail: missing "..." between non-contiguous parts
        supporting_text: "MUC1 oncoprotein blocks c-Abl"
        evidence_type: EXPERIMENTAL
        confidence: HIGH

The repair command will suggest adding ellipsis:

Suggested fix (MEDIUM confidence):
  "MUC1 oncoprotein blocks c-Abl" → "MUC1 oncoprotein ... blocks ... c-Abl"

Scenario 3: Text Not in Reference

data/error_example3.yaml:

annotations:
  - id: ANN007
    gene_symbol: TEST3
    function_summary: Test gene
    function_category: SIGNALING
    species: HUMAN

    evidence:
      - reference_id: PMID:16888623
        # This will fail: text doesn't exist in reference
        supporting_text: "completely fabricated text that doesn't exist"
        evidence_type: EXPERIMENTAL
        confidence: HIGH

The repair command will flag for removal:

RECOMMENDED REMOVALS (low confidence):
  PMID:16888623 at evidence[0]:
    Similarity: 5%
    Snippet: 'completely fabricated text that doesn't exist'
    Action: Remove or find correct reference

Step 6: Create Configuration File (5 minutes)

Create a project configuration:

.linkml-reference-validator.yaml:

# Validation configuration
validation:
  cache_dir: ./references_cache

  # Custom prefix mappings
  reference_prefix_map:
    pubmed: PMID
    pmc: PMC
    doi: DOI

  # Base directory for file:// references
  reference_base_dir: ./references

# Repair configuration
repair:
  # Confidence thresholds
  auto_fix_threshold: 0.95
  suggest_threshold: 0.80
  removal_threshold: 0.50

  # Character normalization
  character_mappings:
    "O2": "O₂"
    "CO2": "CO₂"
    "H2O": "H₂O"
    "N2": "N₂"
    "+/-": "±"
    "alpha": "α"
    "beta": "β"
    "gamma": "γ"

  # Skip references with known issues
  skip_references: []

  # Trusted references (manually verified)
  trusted_low_similarity: []

Use the configuration:

linkml-reference-validator validate data \
  data/*.yaml \
  --schema schemas/gene_annotations.yaml \
  --target-class GeneAnnotationCollection \
  --config .linkml-reference-validator.yaml

Step 7: Integrate with Version Control (5 minutes)

Create Git Pre-commit Hook

.git/hooks/pre-commit:

#!/bin/bash

echo "🔍 Validating gene annotations..."

# Validate all data files
for file in data/*.yaml; do
  if [ -f "$file" ]; then
    echo "  Checking $file..."

    linkml-reference-validator validate data \
      "$file" \
      --schema schemas/gene_annotations.yaml \
      --target-class GeneAnnotationCollection \
      --config .linkml-reference-validator.yaml

    if [ $? -ne 0 ]; then
      echo "❌ Validation failed for $file"
      echo ""
      echo "To fix errors, run:"
      echo "  linkml-reference-validator repair data $file --schema schemas/gene_annotations.yaml --dry-run"
      exit 1
    fi
  fi
done

echo "✅ All validations passed!"
exit 0

Make it executable:

chmod +x .git/hooks/pre-commit

Create Makefile

Makefile:

.PHONY: validate validate-verbose repair clean test

SCHEMA := schemas/gene_annotations.yaml
DATA_DIR := data
CONFIG := .linkml-reference-validator.yaml
TARGET_CLASS := GeneAnnotationCollection

# Validate all data files
validate:
    @echo "Validating all annotations..."
    @for file in $(DATA_DIR)/*.yaml; do \
        echo "Checking $$file..."; \
        linkml-reference-validator validate data \
            $$file \
            --schema $(SCHEMA) \
            --target-class $(TARGET_CLASS) \
            --config $(CONFIG) || exit 1; \
    done
    @echo "✅ All validations passed!"

# Validate with verbose output
validate-verbose:
    @for file in $(DATA_DIR)/*.yaml; do \
        echo "Checking $$file..."; \
        linkml-reference-validator validate data \
            $$file \
            --schema $(SCHEMA) \
            --target-class $(TARGET_CLASS) \
            --config $(CONFIG) \
            --verbose; \
    done

# Show suggested repairs (dry run)
repair:
    @for file in $(DATA_DIR)/*.yaml; do \
        echo "Checking repairs for $$file..."; \
        linkml-reference-validator repair data \
            $$file \
            --schema $(SCHEMA) \
            --target-class $(TARGET_CLASS) \
            --config $(CONFIG) \
            --dry-run; \
    done

# Apply repairs
repair-apply:
    @for file in $(DATA_DIR)/*.yaml; do \
        echo "Applying repairs to $$file..."; \
        linkml-reference-validator repair data \
            $$file \
            --schema $(SCHEMA) \
            --target-class $(TARGET_CLASS) \
            --config $(CONFIG) \
            --no-dry-run; \
    done

# Clean cache
clean:
    rm -rf references_cache/

# Run tests
test: validate
    @echo "Running tests..."
    @python -m pytest tests/ -v

Usage:

make validate              # Validate all files
make validate-verbose      # Verbose output
make repair                # Show suggested repairs
make repair-apply          # Apply repairs
make clean                 # Clear cache

Step 8: CI/CD Integration

GitHub Actions

.github/workflows/validate-annotations.yml:

name: Validate Gene Annotations

on:
  push:
    branches: [ main, develop ]
    paths:
      - 'data/**.yaml'
      - 'schemas/**.yaml'
  pull_request:
    branches: [ main ]
    paths:
      - 'data/**.yaml'
      - 'schemas/**.yaml'

jobs:
  validate:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install linkml-reference-validator

      - name: Cache references
        uses: actions/cache@v3
        with:
          path: references_cache
          key: ${{ runner.os }}-references-${{ hashFiles('data/**/*.yaml') }}
          restore-keys: |
            ${{ runner.os }}-references-

      - name: Validate annotations
        run: |
          make validate
        env:
          NCBI_EMAIL: ${{ secrets.NCBI_EMAIL }}
          NCBI_API_KEY: ${{ secrets.NCBI_API_KEY }}

      - name: Upload cache artifacts
        if: always()
        uses: actions/upload-artifact@v3
        with:
          name: references-cache
          path: references_cache/
          retention-days: 30

Step 9: Testing and Quality Assurance

Create Test Files

tests/test_validation.py:

#!/usr/bin/env python3
"""Test suite for gene annotation validation."""

import subprocess
import yaml
from pathlib import Path

DATA_DIR = Path("data")
SCHEMA = Path("schemas/gene_annotations.yaml")
TARGET_CLASS = "GeneAnnotationCollection"

def test_schema_valid():
    """Test that schema itself is valid."""
    result = subprocess.run(
        ["linkml-validate", "--schema", str(SCHEMA), str(SCHEMA)],
        capture_output=True,
        text=True
    )
    assert result.returncode == 0, f"Schema validation failed: {result.stderr}"

def test_all_data_files_valid():
    """Test that all data files validate against schema."""
    for data_file in DATA_DIR.glob("*.yaml"):
        if "error" in data_file.name:
            continue  # Skip error example files

        print(f"Testing {data_file}...")
        result = subprocess.run(
            [
                "linkml-reference-validator", "validate", "data",
                str(data_file),
                "--schema", str(SCHEMA),
                "--target-class", TARGET_CLASS
            ],
            capture_output=True,
            text=True
        )
        assert result.returncode == 0, \
            f"Validation failed for {data_file}: {result.stderr}"

def test_data_completeness():
    """Test that all required fields are present."""
    for data_file in DATA_DIR.glob("*.yaml"):
        if "error" in data_file.name:
            continue

        with open(data_file) as f:
            data = yaml.safe_load(f)

        # Check each annotation
        for ann in data.get("annotations", []):
            assert "id" in ann, f"Missing id in {data_file}"
            assert "gene_symbol" in ann, f"Missing gene_symbol in {data_file}"
            assert "evidence" in ann, f"Missing evidence in {data_file}"

            # Check each evidence item
            for ev in ann["evidence"]:
                assert "reference_id" in ev, f"Missing reference_id in {data_file}"
                assert "supporting_text" in ev, f"Missing supporting_text in {data_file}"

if __name__ == "__main__":
    test_schema_valid()
    test_all_data_files_valid()
    test_data_completeness()
    print("✅ All tests passed!")

Run tests:

python tests/test_validation.py

Step 10: Documentation and Maintenance

Create README

README.md:

# Gene Annotation Validation System

Validated gene function annotations with supporting evidence from literature.

## Quick Start

```bash
# Validate all annotations
make validate

# Add new annotation
cp templates/annotation_template.yaml data/new_gene.yaml
# Edit data/new_gene.yaml with your annotation
make validate

# Repair validation errors
make repair

Directory Structure

.
├── schemas/
│   └── gene_annotations.yaml    # LinkML schema
├── data/
│   ├── tp53_annotation.yaml     # Gene annotations
│   └── ...
├── references_cache/             # Cached references
├── tests/
│   └── test_validation.py       # Test suite
├── .linkml-reference-validator.yaml  # Config
└── Makefile                     # Build commands

Contributing

Create new annotation file in data/
Validate: make validate
Fix any errors: make repair
Commit and push (pre-commit hook will validate)


### Create Template

**templates/annotation_template.yaml:**
```yaml
annotations:
  - id: ANN_XXX  # Replace with unique ID
    gene_symbol: GENE_SYMBOL  # Official gene symbol
    gene_name: Full Gene Name
    function_summary: Brief summary of function
    function_category: CATEGORY  # See schema for options
    species: HUMAN  # Or MOUSE, RAT, YEAST
    curator: Your Name
    last_reviewed: YYYY-MM-DD

    evidence:
      - reference_id: PMID:XXXXXXXX  # Or DOI:, PMC:, file:, url:
        reference_title: Article title (optional but recommended)
        supporting_text: "Direct quote from the reference"
        evidence_type: EXPERIMENTAL  # Or COMPUTATIONAL, LITERATURE, CURATOR_INFERENCE
        confidence: HIGH  # Or MEDIUM, LOW
        notes: Additional context (optional)

Summary

You've now built a complete gene annotation validation system! You've learned:

✅ How to install and configure linkml-reference-validator
✅ How to design a LinkML schema with validation markers
✅ How to create validated data files
✅ How to validate and repair data
✅ How to integrate validation into your workflow
✅ How to set up CI/CD for automatic validation
✅ How to write tests for your validation system

Next Steps

Expand your schema - Add more gene attributes, relationships, or evidence types
Import existing data - Convert existing annotations to your new format
Integrate with databases - Export validated data to SQL, MongoDB, or RDF
Build a web interface - Create a UI for curators to add/edit annotations
Set up monitoring - Track validation success rates and common error patterns