Python API for Programmatic Usage¶

This tutorial demonstrates using the Python API for integrating reference validation into your own applications.

Prerequisites¶

Review Tutorial 1 and Tutorial 2 for CLI usage patterns.

When to Use the Python API¶

Use the Python API when you need to:

Integrate validation into existing Python applications
Build custom validation workflows
Collect statistics and programmatic results
Handle validation errors programmatically

Note: For most use cases, the CLI is simpler and recommended.

Setup¶

In [1]:

Copied!





import tempfile
from pathlib import Path
from linkml_reference_validator.validation.supporting_text_validator import SupportingTextValidator
from linkml_reference_validator.models import ReferenceValidationConfig
from linkml_reference_validator.etl.reference_fetcher import ReferenceFetcher
import tempfile
from pathlib import Path
from linkml_reference_validator.validation.supporting_text_validator import SupportingTextValidator
from linkml_reference_validator.models import ReferenceValidationConfig
from linkml_reference_validator.etl.reference_fetcher import ReferenceFetcher

In [2]:

Copied!





# Create temporary cache directory
temp_dir = tempfile.mkdtemp()
cache_dir = Path(temp_dir) / "reference_cache"
cache_dir.mkdir(exist_ok=True)

print(f"Working directory: {temp_dir}")
print(f"Cache directory: {cache_dir}")
# Create temporary cache directory
temp_dir = tempfile.mkdtemp()
cache_dir = Path(temp_dir) / "reference_cache"
cache_dir.mkdir(exist_ok=True)

print(f"Working directory: {temp_dir}")
print(f"Cache directory: {cache_dir}")

Working directory: /var/folders/nc/m4tx21912kv1b8nk3zzx9plr0000gn/T/tmpclsqvmjr
Cache directory: /var/folders/nc/m4tx21912kv1b8nk3zzx9plr0000gn/T/tmpclsqvmjr/reference_cache

In [3]:

Copied!





# Create test references
test_cache_file = cache_dir / "PMID_12345678.txt"
test_cache_file.write_text("""ID: PMID:12345678
Title: TP53 Functions in Cell Cycle Regulation
Authors: Smith J, Doe A, Johnson K
Journal: Nature
Year: 2024
DOI: 10.1038/nature12345
ContentType: abstract

The TP53 protein functions in cell cycle regulation and plays a critical role as a tumor suppressor. 
Studies have shown that TP53 regulates cell cycle checkpoints and DNA repair mechanisms.
Loss of TP53 function is associated with various cancers.
""")

print("✓ Created test reference: PMID:12345678")
# Create test references
test_cache_file = cache_dir / "PMID_12345678.txt"
test_cache_file.write_text("""ID: PMID:12345678
Title: TP53 Functions in Cell Cycle Regulation
Authors: Smith J, Doe A, Johnson K
Journal: Nature
Year: 2024
DOI: 10.1038/nature12345
ContentType: abstract

The TP53 protein functions in cell cycle regulation and plays a critical role as a tumor suppressor. 
Studies have shown that TP53 regulates cell cycle checkpoints and DNA repair mechanisms.
Loss of TP53 function is associated with various cancers.
""")

print("✓ Created test reference: PMID:12345678")

✓ Created test reference: PMID:12345678

Example 1: Basic Validation¶

In [4]:

Copied!





# Create configuration
config = ReferenceValidationConfig(cache_dir=str(cache_dir))

# Create validator
validator = SupportingTextValidator(config)

# Validate a quote
result = validator.validate(
    supporting_text="TP53 protein functions in cell cycle regulation",
    reference_id="PMID:12345678"
)

print(f"Is valid: {result.is_valid}")
print(f"Message: {result.message}")
# Create configuration
config = ReferenceValidationConfig(cache_dir=str(cache_dir))

# Create validator
validator = SupportingTextValidator(config)

# Validate a quote
result = validator.validate(
    supporting_text="TP53 protein functions in cell cycle regulation",
    reference_id="PMID:12345678"
)

print(f"Is valid: {result.is_valid}")
print(f"Message: {result.message}")

Is valid: True
Message: Supporting text validated successfully in PMID:12345678

Example 2: Working with Validation Results¶

In [5]:

Copied!





# The ValidationResult object has several useful attributes
print("ValidationResult attributes:")
print(f"  is_valid: {result.is_valid}")
print(f"  message: {result.message}")
print(f"  reference_id: {result.reference_id}")
print(f"  supporting_text: {result.supporting_text}")
# The ValidationResult object has several useful attributes
print("ValidationResult attributes:")
print(f"  is_valid: {result.is_valid}")
print(f"  message: {result.message}")
print(f"  reference_id: {result.reference_id}")
print(f"  supporting_text: {result.supporting_text}")

ValidationResult attributes:
  is_valid: True
  message: Supporting text validated successfully in PMID:12345678
  reference_id: PMID:12345678
  supporting_text: TP53 protein functions in cell cycle regulation

Example 3: Batch Validation¶

In [6]:

Copied!





# Validate multiple quotes
test_cases = [
    ("TP53 protein functions in cell cycle regulation", "PMID:12345678"),
    ("plays a critical role as a tumor suppressor", "PMID:12345678"),
    ("TP53 regulates cell cycle checkpoints", "PMID:12345678"),
    ("TP53 inhibits apoptosis", "PMID:12345678"),  # This will fail
]

results = []
for quote, ref_id in test_cases:
    result = validator.validate(
        supporting_text=quote,
        reference_id=ref_id
    )
    results.append(result)
    status = "✓" if result.is_valid else "✗"
    print(f"{status} {quote[:50]}...")

print(f"\nTotal: {len(results)}, Passed: {sum(r.is_valid for r in results)}, Failed: {sum(not r.is_valid for r in results)}")
# Validate multiple quotes
test_cases = [
    ("TP53 protein functions in cell cycle regulation", "PMID:12345678"),
    ("plays a critical role as a tumor suppressor", "PMID:12345678"),
    ("TP53 regulates cell cycle checkpoints", "PMID:12345678"),
    ("TP53 inhibits apoptosis", "PMID:12345678"),  # This will fail
]

results = []
for quote, ref_id in test_cases:
    result = validator.validate(
        supporting_text=quote,
        reference_id=ref_id
    )
    results.append(result)
    status = "✓" if result.is_valid else "✗"
    print(f"{status} {quote[:50]}...")

print(f"\nTotal: {len(results)}, Passed: {sum(r.is_valid for r in results)}, Failed: {sum(not r.is_valid for r in results)}")

✓ TP53 protein functions in cell cycle regulation...
✓ plays a critical role as a tumor suppressor...
✓ TP53 regulates cell cycle checkpoints...
✗ TP53 inhibits apoptosis...

Total: 4, Passed: 3, Failed: 1

Example 4: Using the Reference Fetcher¶

In [7]:

Copied!





# The fetcher can be used independently
fetcher = ReferenceFetcher(config)

# Fetch a reference
reference = fetcher.fetch("PMID:12345678")

print(f"Reference: {reference.reference_id}")
print(f"Title: {reference.title}")
print(f"Authors: {reference.authors}")
print(f"Year: {reference.year}")
print(f"Content type: {reference.content_type}")
print(f"Content length: {len(reference.content)} characters")
print(f"\nContent preview:\n{reference.content[:200]}...")
# The fetcher can be used independently
fetcher = ReferenceFetcher(config)

# Fetch a reference
reference = fetcher.fetch("PMID:12345678")

print(f"Reference: {reference.reference_id}")
print(f"Title: {reference.title}")
print(f"Authors: {reference.authors}")
print(f"Year: {reference.year}")
print(f"Content type: {reference.content_type}")
print(f"Content length: {len(reference.content)} characters")
print(f"\nContent preview:\n{reference.content[:200]}...")

Reference: PMID:12345678
Title: TP53 Functions in Cell Cycle Regulation
Authors: ['Smith J', 'Doe A', 'Johnson K']
Year: 2024
Content type: abstract
Content length: 248 characters

Content preview:
The TP53 protein functions in cell cycle regulation and plays a critical role as a tumor suppressor. 
Studies have shown that TP53 regulates cell cycle checkpoints and DNA repair mechanisms.
Loss of T...

Example 5: Text Normalization¶

Understanding how text is normalized before matching.

In [8]:

Copied!





# The normalize_text method is a static method
examples = [
    "TP53 (p53) protein",
    "T-Cell Receptor",
    "DNA-binding domain",
    "α-catenin",
]

print("Text Normalization:")
for text in examples:
    normalized = SupportingTextValidator.normalize_text(text)
    print(f"  {text:30} → {normalized}")
# The normalize_text method is a static method
examples = [
    "TP53 (p53) protein",
    "T-Cell Receptor",
    "DNA-binding domain",
    "α-catenin",
]

print("Text Normalization:")
for text in examples:
    normalized = SupportingTextValidator.normalize_text(text)
    print(f"  {text:30} → {normalized}")

Text Normalization:
  TP53 (p53) protein             → tp53 p53 protein
  T-Cell Receptor                → t cell receptor
  DNA-binding domain             → dna binding domain
  α-catenin                      → alpha catenin

Example 6: Custom Configuration¶

In [9]:

Copied!





# Create custom configuration
custom_config = ReferenceValidationConfig(
    cache_dir=str(cache_dir),
    email="your.email@example.com",  # For NCBI Entrez
    # api_key="your_api_key"  # Optional for higher rate limits
)

print("Configuration:")
print(f"  Cache directory: {custom_config.cache_dir}")
print(f"  Email: {custom_config.email}")
# Create custom configuration
custom_config = ReferenceValidationConfig(
    cache_dir=str(cache_dir),
    email="your.email@example.com",  # For NCBI Entrez
    # api_key="your_api_key"  # Optional for higher rate limits
)

print("Configuration:")
print(f"  Cache directory: {custom_config.cache_dir}")
print(f"  Email: {custom_config.email}")

Configuration:
  Cache directory: /var/folders/nc/m4tx21912kv1b8nk3zzx9plr0000gn/T/tmpclsqvmjr/reference_cache
  Email: your.email@example.com

Example 7: Error Handling¶

In [10]:

Copied!





# Validation returns a result object, not exceptions
# This makes it easy to handle failures

def validate_with_error_handling(validator, quote, ref_id):
    """Example of proper error handling."""
    try:
        result = validator.validate(
            supporting_text=quote,
            reference_id=ref_id
        )
        
        if result.is_valid:
            return {"status": "success", "message": result.message}
        else:
            return {"status": "failed", "message": result.message}
    
    except Exception as e:
        return {"status": "error", "message": str(e)}

# Test it
result = validate_with_error_handling(
    validator,
    "TP53 protein functions in cell cycle regulation",
    "PMID:12345678"
)

print(f"Result: {result}")
# Validation returns a result object, not exceptions
# This makes it easy to handle failures

def validate_with_error_handling(validator, quote, ref_id):
    """Example of proper error handling."""
    try:
        result = validator.validate(
            supporting_text=quote,
            reference_id=ref_id
        )
        
        if result.is_valid:
            return {"status": "success", "message": result.message}
        else:
            return {"status": "failed", "message": result.message}
    
    except Exception as e:
        return {"status": "error", "message": str(e)}

# Test it
result = validate_with_error_handling(
    validator,
    "TP53 protein functions in cell cycle regulation",
    "PMID:12345678"
)

print(f"Result: {result}")

Result: {'status': 'success', 'message': 'Supporting text validated successfully in PMID:12345678'}

Example 8: Collecting Statistics¶

In [11]:

Copied!





from collections import defaultdict

# Create multiple test references
(cache_dir / "PMID_11111111.txt").write_text("""ID: PMID:11111111
Title: BRCA1 Function
Authors: Smith J
ContentType: abstract
BRCA1 plays a critical role in DNA repair mechanisms.
""")

(cache_dir / "PMID:22222222.txt").write_text("""ID: PMID:22222222
Title: TP53 Function
Authors: Doe A
ContentType: abstract
TP53 functions as a tumor suppressor.
""")

# Gene annotations to validate
gene_annotations = [
    {
        "gene": "BRCA1",
        "evidence": [
            {"ref": "PMID:11111111", "text": "BRCA1 plays a critical role in DNA repair mechanisms"}
        ]
    },
    {
        "gene": "TP53",
        "evidence": [
            {"ref": "PMID:22222222", "text": "TP53 functions as a tumor suppressor"},
            {"ref": "PMID:12345678", "text": "TP53 regulates cell cycle checkpoints"},
        ]
    }
]

# Collect statistics
stats = {
    "total": 0,
    "passed": 0,
    "failed": 0,
    "by_gene": defaultdict(lambda: {"passed": 0, "failed": 0})
}

for gene_data in gene_annotations:
    gene = gene_data["gene"]
    
    for evidence in gene_data["evidence"]:
        result = validator.validate(
            supporting_text=evidence["text"],
            reference_id=evidence["ref"]
        )
        
        stats["total"] += 1
        if result.is_valid:
            stats["passed"] += 1
            stats["by_gene"][gene]["passed"] += 1
        else:
            stats["failed"] += 1
            stats["by_gene"][gene]["failed"] += 1

# Print summary
print("Validation Statistics:")
print(f"  Total validations: {stats['total']}")
print(f"  Passed: {stats['passed']} ({stats['passed']/stats['total']*100:.1f}%)")
print(f"  Failed: {stats['failed']} ({stats['failed']/stats['total']*100:.1f}%)")
print("\nBy Gene:")
for gene, counts in stats["by_gene"].items():
    total = counts["passed"] + counts["failed"]
    print(f"  {gene}: {counts['passed']}/{total} passed")
from collections import defaultdict

# Create multiple test references
(cache_dir / "PMID_11111111.txt").write_text("""ID: PMID:11111111
Title: BRCA1 Function
Authors: Smith J
ContentType: abstract
BRCA1 plays a critical role in DNA repair mechanisms.
""")

(cache_dir / "PMID:22222222.txt").write_text("""ID: PMID:22222222
Title: TP53 Function
Authors: Doe A
ContentType: abstract
TP53 functions as a tumor suppressor.
""")

# Gene annotations to validate
gene_annotations = [
    {
        "gene": "BRCA1",
        "evidence": [
            {"ref": "PMID:11111111", "text": "BRCA1 plays a critical role in DNA repair mechanisms"}
        ]
    },
    {
        "gene": "TP53",
        "evidence": [
            {"ref": "PMID:22222222", "text": "TP53 functions as a tumor suppressor"},
            {"ref": "PMID:12345678", "text": "TP53 regulates cell cycle checkpoints"},
        ]
    }
]

# Collect statistics
stats = {
    "total": 0,
    "passed": 0,
    "failed": 0,
    "by_gene": defaultdict(lambda: {"passed": 0, "failed": 0})
}

for gene_data in gene_annotations:
    gene = gene_data["gene"]
    
    for evidence in gene_data["evidence"]:
        result = validator.validate(
            supporting_text=evidence["text"],
            reference_id=evidence["ref"]
        )
        
        stats["total"] += 1
        if result.is_valid:
            stats["passed"] += 1
            stats["by_gene"][gene]["passed"] += 1
        else:
            stats["failed"] += 1
            stats["by_gene"][gene]["failed"] += 1

# Print summary
print("Validation Statistics:")
print(f"  Total validations: {stats['total']}")
print(f"  Passed: {stats['passed']} ({stats['passed']/stats['total']*100:.1f}%)")
print(f"  Failed: {stats['failed']} ({stats['failed']/stats['total']*100:.1f}%)")
print("\nBy Gene:")
for gene, counts in stats["by_gene"].items():
    total = counts["passed"] + counts["failed"]
    print(f"  {gene}: {counts['passed']}/{total} passed")

Validation Statistics:
  Total validations: 3
  Passed: 1 (33.3%)
  Failed: 2 (66.7%)

By Gene:
  BRCA1: 0/1 passed
  TP53: 1/2 passed

Example 9: Integration Example¶

A more complete example showing how to integrate into an application.

In [12]:

Copied!





class GeneAnnotationValidator:
    """Example class for validating gene annotations."""
    
    def __init__(self, cache_dir: str):
        config = ReferenceValidationConfig(cache_dir=cache_dir)
        self.validator = SupportingTextValidator(config)
    
    def validate_annotation(self, annotation: dict) -> dict:
        """Validate a single gene annotation.
        
        Args:
            annotation: Dict with 'gene', 'function', and 'evidence' keys
            
        Returns:
            Dict with validation results
        """
        gene = annotation["gene"]
        evidence_list = annotation["evidence"]
        
        results = []
        all_valid = True
        
        for evidence in evidence_list:
            result = self.validator.validate(
                supporting_text=evidence["text"],
                reference_id=evidence["ref"]
            )
            results.append({
                "reference": evidence["ref"],
                "text": evidence["text"],
                "valid": result.is_valid,
                "message": result.message
            })
            all_valid = all_valid and result.is_valid
        
        return {
            "gene": gene,
            "valid": all_valid,
            "evidence_results": results
        }

# Use the validator
gene_validator = GeneAnnotationValidator(cache_dir=str(cache_dir))

annotation = {
    "gene": "TP53",
    "function": "tumor suppressor",
    "evidence": [
        {"ref": "PMID:12345678", "text": "TP53 protein functions in cell cycle regulation"},
        {"ref": "PMID:12345678", "text": "plays a critical role as a tumor suppressor"},
    ]
}

result = gene_validator.validate_annotation(annotation)

print(f"Gene: {result['gene']}")
print(f"Overall valid: {result['valid']}")
print("\nEvidence validation:")
for ev_result in result['evidence_results']:
    status = "✓" if ev_result['valid'] else "✗"
    print(f"  {status} {ev_result['reference']}: {ev_result['text'][:50]}...")
class GeneAnnotationValidator:
    """Example class for validating gene annotations."""
    
    def __init__(self, cache_dir: str):
        config = ReferenceValidationConfig(cache_dir=cache_dir)
        self.validator = SupportingTextValidator(config)
    
    def validate_annotation(self, annotation: dict) -> dict:
        """Validate a single gene annotation.
        
        Args:
            annotation: Dict with 'gene', 'function', and 'evidence' keys
            
        Returns:
            Dict with validation results
        """
        gene = annotation["gene"]
        evidence_list = annotation["evidence"]
        
        results = []
        all_valid = True
        
        for evidence in evidence_list:
            result = self.validator.validate(
                supporting_text=evidence["text"],
                reference_id=evidence["ref"]
            )
            results.append({
                "reference": evidence["ref"],
                "text": evidence["text"],
                "valid": result.is_valid,
                "message": result.message
            })
            all_valid = all_valid and result.is_valid
        
        return {
            "gene": gene,
            "valid": all_valid,
            "evidence_results": results
        }

# Use the validator
gene_validator = GeneAnnotationValidator(cache_dir=str(cache_dir))

annotation = {
    "gene": "TP53",
    "function": "tumor suppressor",
    "evidence": [
        {"ref": "PMID:12345678", "text": "TP53 protein functions in cell cycle regulation"},
        {"ref": "PMID:12345678", "text": "plays a critical role as a tumor suppressor"},
    ]
}

result = gene_validator.validate_annotation(annotation)

print(f"Gene: {result['gene']}")
print(f"Overall valid: {result['valid']}")
print("\nEvidence validation:")
for ev_result in result['evidence_results']:
    status = "✓" if ev_result['valid'] else "✗"
    print(f"  {status} {ev_result['reference']}: {ev_result['text'][:50]}...")

Gene: TP53
Overall valid: True

Evidence validation:
  ✓ PMID:12345678: TP53 protein functions in cell cycle regulation...
  ✓ PMID:12345678: plays a critical role as a tumor suppressor...

Summary¶

Key Classes¶

ReferenceValidationConfig - Configuration

config = ReferenceValidationConfig(
    cache_dir="path/to/cache",
    email="your@email.com"
)

SupportingTextValidator - Main validator

validator = SupportingTextValidator(config)
result = validator.validate(
    supporting_text="quote",
    reference_id="PMID:12345678"
)

ReferenceFetcher - Fetch references

fetcher = ReferenceFetcher(config)
reference = fetcher.fetch("PMID:12345678")

When to Use Python API vs CLI¶

Use CLI when:

Quick one-off validations
Shell scripting
CI/CD pipelines
Standard LinkML workflows

Use Python API when:

Building custom applications
Need programmatic access to results
Custom validation workflows
Collecting statistics/analytics

Next Steps¶

Review API Documentation
Explore source code for advanced usage
Check GitHub for examples

Cleanup¶

In [13]:

Copied!

import shutil
shutil.rmtree(temp_dir)
print(f"Cleaned up: {temp_dir}")
import shutil
shutil.rmtree(temp_dir)
print(f"Cleaned up: {temp_dir}")

Cleaned up: /var/folders/nc/m4tx21912kv1b8nk3zzx9plr0000gn/T/tmpclsqvmjr