Python API for Programmatic Usage¶
This tutorial demonstrates using the Python API for integrating reference validation into your own applications.
Prerequisites¶
Review Tutorial 1 and Tutorial 2 for CLI usage patterns.
When to Use the Python API¶
Use the Python API when you need to:
- Integrate validation into existing Python applications
- Build custom validation workflows
- Collect statistics and programmatic results
- Handle validation errors programmatically
Note: For most use cases, the CLI is simpler and recommended.
Setup¶
In [1]:
Copied!
import tempfile
from pathlib import Path
from linkml_reference_validator.validation.supporting_text_validator import SupportingTextValidator
from linkml_reference_validator.models import ReferenceValidationConfig
from linkml_reference_validator.etl.reference_fetcher import ReferenceFetcher
import tempfile
from pathlib import Path
from linkml_reference_validator.validation.supporting_text_validator import SupportingTextValidator
from linkml_reference_validator.models import ReferenceValidationConfig
from linkml_reference_validator.etl.reference_fetcher import ReferenceFetcher
In [2]:
Copied!
# Create temporary cache directory
temp_dir = tempfile.mkdtemp()
cache_dir = Path(temp_dir) / "reference_cache"
cache_dir.mkdir(exist_ok=True)
print(f"Working directory: {temp_dir}")
print(f"Cache directory: {cache_dir}")
# Create temporary cache directory
temp_dir = tempfile.mkdtemp()
cache_dir = Path(temp_dir) / "reference_cache"
cache_dir.mkdir(exist_ok=True)
print(f"Working directory: {temp_dir}")
print(f"Cache directory: {cache_dir}")
Working directory: /var/folders/nc/m4tx21912kv1b8nk3zzx9plr0000gn/T/tmpclsqvmjr Cache directory: /var/folders/nc/m4tx21912kv1b8nk3zzx9plr0000gn/T/tmpclsqvmjr/reference_cache
In [3]:
Copied!
# Create test references
test_cache_file = cache_dir / "PMID_12345678.txt"
test_cache_file.write_text("""ID: PMID:12345678
Title: TP53 Functions in Cell Cycle Regulation
Authors: Smith J, Doe A, Johnson K
Journal: Nature
Year: 2024
DOI: 10.1038/nature12345
ContentType: abstract
The TP53 protein functions in cell cycle regulation and plays a critical role as a tumor suppressor.
Studies have shown that TP53 regulates cell cycle checkpoints and DNA repair mechanisms.
Loss of TP53 function is associated with various cancers.
""")
print("✓ Created test reference: PMID:12345678")
# Create test references
test_cache_file = cache_dir / "PMID_12345678.txt"
test_cache_file.write_text("""ID: PMID:12345678
Title: TP53 Functions in Cell Cycle Regulation
Authors: Smith J, Doe A, Johnson K
Journal: Nature
Year: 2024
DOI: 10.1038/nature12345
ContentType: abstract
The TP53 protein functions in cell cycle regulation and plays a critical role as a tumor suppressor.
Studies have shown that TP53 regulates cell cycle checkpoints and DNA repair mechanisms.
Loss of TP53 function is associated with various cancers.
""")
print("✓ Created test reference: PMID:12345678")
✓ Created test reference: PMID:12345678
Example 1: Basic Validation¶
In [4]:
Copied!
# Create configuration
config = ReferenceValidationConfig(cache_dir=str(cache_dir))
# Create validator
validator = SupportingTextValidator(config)
# Validate a quote
result = validator.validate(
supporting_text="TP53 protein functions in cell cycle regulation",
reference_id="PMID:12345678"
)
print(f"Is valid: {result.is_valid}")
print(f"Message: {result.message}")
# Create configuration
config = ReferenceValidationConfig(cache_dir=str(cache_dir))
# Create validator
validator = SupportingTextValidator(config)
# Validate a quote
result = validator.validate(
supporting_text="TP53 protein functions in cell cycle regulation",
reference_id="PMID:12345678"
)
print(f"Is valid: {result.is_valid}")
print(f"Message: {result.message}")
Is valid: True Message: Supporting text validated successfully in PMID:12345678
Example 2: Working with Validation Results¶
In [5]:
Copied!
# The ValidationResult object has several useful attributes
print("ValidationResult attributes:")
print(f" is_valid: {result.is_valid}")
print(f" message: {result.message}")
print(f" reference_id: {result.reference_id}")
print(f" supporting_text: {result.supporting_text}")
# The ValidationResult object has several useful attributes
print("ValidationResult attributes:")
print(f" is_valid: {result.is_valid}")
print(f" message: {result.message}")
print(f" reference_id: {result.reference_id}")
print(f" supporting_text: {result.supporting_text}")
ValidationResult attributes: is_valid: True message: Supporting text validated successfully in PMID:12345678 reference_id: PMID:12345678 supporting_text: TP53 protein functions in cell cycle regulation
Example 3: Batch Validation¶
In [6]:
Copied!
# Validate multiple quotes
test_cases = [
("TP53 protein functions in cell cycle regulation", "PMID:12345678"),
("plays a critical role as a tumor suppressor", "PMID:12345678"),
("TP53 regulates cell cycle checkpoints", "PMID:12345678"),
("TP53 inhibits apoptosis", "PMID:12345678"), # This will fail
]
results = []
for quote, ref_id in test_cases:
result = validator.validate(
supporting_text=quote,
reference_id=ref_id
)
results.append(result)
status = "✓" if result.is_valid else "✗"
print(f"{status} {quote[:50]}...")
print(f"\nTotal: {len(results)}, Passed: {sum(r.is_valid for r in results)}, Failed: {sum(not r.is_valid for r in results)}")
# Validate multiple quotes
test_cases = [
("TP53 protein functions in cell cycle regulation", "PMID:12345678"),
("plays a critical role as a tumor suppressor", "PMID:12345678"),
("TP53 regulates cell cycle checkpoints", "PMID:12345678"),
("TP53 inhibits apoptosis", "PMID:12345678"), # This will fail
]
results = []
for quote, ref_id in test_cases:
result = validator.validate(
supporting_text=quote,
reference_id=ref_id
)
results.append(result)
status = "✓" if result.is_valid else "✗"
print(f"{status} {quote[:50]}...")
print(f"\nTotal: {len(results)}, Passed: {sum(r.is_valid for r in results)}, Failed: {sum(not r.is_valid for r in results)}")
✓ TP53 protein functions in cell cycle regulation... ✓ plays a critical role as a tumor suppressor... ✓ TP53 regulates cell cycle checkpoints... ✗ TP53 inhibits apoptosis... Total: 4, Passed: 3, Failed: 1
Example 4: Using the Reference Fetcher¶
In [7]:
Copied!
# The fetcher can be used independently
fetcher = ReferenceFetcher(config)
# Fetch a reference
reference = fetcher.fetch("PMID:12345678")
print(f"Reference: {reference.reference_id}")
print(f"Title: {reference.title}")
print(f"Authors: {reference.authors}")
print(f"Year: {reference.year}")
print(f"Content type: {reference.content_type}")
print(f"Content length: {len(reference.content)} characters")
print(f"\nContent preview:\n{reference.content[:200]}...")
# The fetcher can be used independently
fetcher = ReferenceFetcher(config)
# Fetch a reference
reference = fetcher.fetch("PMID:12345678")
print(f"Reference: {reference.reference_id}")
print(f"Title: {reference.title}")
print(f"Authors: {reference.authors}")
print(f"Year: {reference.year}")
print(f"Content type: {reference.content_type}")
print(f"Content length: {len(reference.content)} characters")
print(f"\nContent preview:\n{reference.content[:200]}...")
Reference: PMID:12345678 Title: TP53 Functions in Cell Cycle Regulation Authors: ['Smith J', 'Doe A', 'Johnson K'] Year: 2024 Content type: abstract Content length: 248 characters Content preview: The TP53 protein functions in cell cycle regulation and plays a critical role as a tumor suppressor. Studies have shown that TP53 regulates cell cycle checkpoints and DNA repair mechanisms. Loss of T...
Example 5: Text Normalization¶
Understanding how text is normalized before matching.
In [8]:
Copied!
# The normalize_text method is a static method
examples = [
"TP53 (p53) protein",
"T-Cell Receptor",
"DNA-binding domain",
"α-catenin",
]
print("Text Normalization:")
for text in examples:
normalized = SupportingTextValidator.normalize_text(text)
print(f" {text:30} → {normalized}")
# The normalize_text method is a static method
examples = [
"TP53 (p53) protein",
"T-Cell Receptor",
"DNA-binding domain",
"α-catenin",
]
print("Text Normalization:")
for text in examples:
normalized = SupportingTextValidator.normalize_text(text)
print(f" {text:30} → {normalized}")
Text Normalization: TP53 (p53) protein → tp53 p53 protein T-Cell Receptor → t cell receptor DNA-binding domain → dna binding domain α-catenin → alpha catenin
Example 6: Custom Configuration¶
In [9]:
Copied!
# Create custom configuration
custom_config = ReferenceValidationConfig(
cache_dir=str(cache_dir),
email="your.email@example.com", # For NCBI Entrez
# api_key="your_api_key" # Optional for higher rate limits
)
print("Configuration:")
print(f" Cache directory: {custom_config.cache_dir}")
print(f" Email: {custom_config.email}")
# Create custom configuration
custom_config = ReferenceValidationConfig(
cache_dir=str(cache_dir),
email="your.email@example.com", # For NCBI Entrez
# api_key="your_api_key" # Optional for higher rate limits
)
print("Configuration:")
print(f" Cache directory: {custom_config.cache_dir}")
print(f" Email: {custom_config.email}")
Configuration: Cache directory: /var/folders/nc/m4tx21912kv1b8nk3zzx9plr0000gn/T/tmpclsqvmjr/reference_cache Email: your.email@example.com
Example 7: Error Handling¶
In [10]:
Copied!
# Validation returns a result object, not exceptions
# This makes it easy to handle failures
def validate_with_error_handling(validator, quote, ref_id):
"""Example of proper error handling."""
try:
result = validator.validate(
supporting_text=quote,
reference_id=ref_id
)
if result.is_valid:
return {"status": "success", "message": result.message}
else:
return {"status": "failed", "message": result.message}
except Exception as e:
return {"status": "error", "message": str(e)}
# Test it
result = validate_with_error_handling(
validator,
"TP53 protein functions in cell cycle regulation",
"PMID:12345678"
)
print(f"Result: {result}")
# Validation returns a result object, not exceptions
# This makes it easy to handle failures
def validate_with_error_handling(validator, quote, ref_id):
"""Example of proper error handling."""
try:
result = validator.validate(
supporting_text=quote,
reference_id=ref_id
)
if result.is_valid:
return {"status": "success", "message": result.message}
else:
return {"status": "failed", "message": result.message}
except Exception as e:
return {"status": "error", "message": str(e)}
# Test it
result = validate_with_error_handling(
validator,
"TP53 protein functions in cell cycle regulation",
"PMID:12345678"
)
print(f"Result: {result}")
Result: {'status': 'success', 'message': 'Supporting text validated successfully in PMID:12345678'}
Example 8: Collecting Statistics¶
In [11]:
Copied!
from collections import defaultdict
# Create multiple test references
(cache_dir / "PMID_11111111.txt").write_text("""ID: PMID:11111111
Title: BRCA1 Function
Authors: Smith J
ContentType: abstract
BRCA1 plays a critical role in DNA repair mechanisms.
""")
(cache_dir / "PMID:22222222.txt").write_text("""ID: PMID:22222222
Title: TP53 Function
Authors: Doe A
ContentType: abstract
TP53 functions as a tumor suppressor.
""")
# Gene annotations to validate
gene_annotations = [
{
"gene": "BRCA1",
"evidence": [
{"ref": "PMID:11111111", "text": "BRCA1 plays a critical role in DNA repair mechanisms"}
]
},
{
"gene": "TP53",
"evidence": [
{"ref": "PMID:22222222", "text": "TP53 functions as a tumor suppressor"},
{"ref": "PMID:12345678", "text": "TP53 regulates cell cycle checkpoints"},
]
}
]
# Collect statistics
stats = {
"total": 0,
"passed": 0,
"failed": 0,
"by_gene": defaultdict(lambda: {"passed": 0, "failed": 0})
}
for gene_data in gene_annotations:
gene = gene_data["gene"]
for evidence in gene_data["evidence"]:
result = validator.validate(
supporting_text=evidence["text"],
reference_id=evidence["ref"]
)
stats["total"] += 1
if result.is_valid:
stats["passed"] += 1
stats["by_gene"][gene]["passed"] += 1
else:
stats["failed"] += 1
stats["by_gene"][gene]["failed"] += 1
# Print summary
print("Validation Statistics:")
print(f" Total validations: {stats['total']}")
print(f" Passed: {stats['passed']} ({stats['passed']/stats['total']*100:.1f}%)")
print(f" Failed: {stats['failed']} ({stats['failed']/stats['total']*100:.1f}%)")
print("\nBy Gene:")
for gene, counts in stats["by_gene"].items():
total = counts["passed"] + counts["failed"]
print(f" {gene}: {counts['passed']}/{total} passed")
from collections import defaultdict
# Create multiple test references
(cache_dir / "PMID_11111111.txt").write_text("""ID: PMID:11111111
Title: BRCA1 Function
Authors: Smith J
ContentType: abstract
BRCA1 plays a critical role in DNA repair mechanisms.
""")
(cache_dir / "PMID:22222222.txt").write_text("""ID: PMID:22222222
Title: TP53 Function
Authors: Doe A
ContentType: abstract
TP53 functions as a tumor suppressor.
""")
# Gene annotations to validate
gene_annotations = [
{
"gene": "BRCA1",
"evidence": [
{"ref": "PMID:11111111", "text": "BRCA1 plays a critical role in DNA repair mechanisms"}
]
},
{
"gene": "TP53",
"evidence": [
{"ref": "PMID:22222222", "text": "TP53 functions as a tumor suppressor"},
{"ref": "PMID:12345678", "text": "TP53 regulates cell cycle checkpoints"},
]
}
]
# Collect statistics
stats = {
"total": 0,
"passed": 0,
"failed": 0,
"by_gene": defaultdict(lambda: {"passed": 0, "failed": 0})
}
for gene_data in gene_annotations:
gene = gene_data["gene"]
for evidence in gene_data["evidence"]:
result = validator.validate(
supporting_text=evidence["text"],
reference_id=evidence["ref"]
)
stats["total"] += 1
if result.is_valid:
stats["passed"] += 1
stats["by_gene"][gene]["passed"] += 1
else:
stats["failed"] += 1
stats["by_gene"][gene]["failed"] += 1
# Print summary
print("Validation Statistics:")
print(f" Total validations: {stats['total']}")
print(f" Passed: {stats['passed']} ({stats['passed']/stats['total']*100:.1f}%)")
print(f" Failed: {stats['failed']} ({stats['failed']/stats['total']*100:.1f}%)")
print("\nBy Gene:")
for gene, counts in stats["by_gene"].items():
total = counts["passed"] + counts["failed"]
print(f" {gene}: {counts['passed']}/{total} passed")
Validation Statistics: Total validations: 3 Passed: 1 (33.3%) Failed: 2 (66.7%) By Gene: BRCA1: 0/1 passed TP53: 1/2 passed
Example 9: Integration Example¶
A more complete example showing how to integrate into an application.
In [12]:
Copied!
class GeneAnnotationValidator:
"""Example class for validating gene annotations."""
def __init__(self, cache_dir: str):
config = ReferenceValidationConfig(cache_dir=cache_dir)
self.validator = SupportingTextValidator(config)
def validate_annotation(self, annotation: dict) -> dict:
"""Validate a single gene annotation.
Args:
annotation: Dict with 'gene', 'function', and 'evidence' keys
Returns:
Dict with validation results
"""
gene = annotation["gene"]
evidence_list = annotation["evidence"]
results = []
all_valid = True
for evidence in evidence_list:
result = self.validator.validate(
supporting_text=evidence["text"],
reference_id=evidence["ref"]
)
results.append({
"reference": evidence["ref"],
"text": evidence["text"],
"valid": result.is_valid,
"message": result.message
})
all_valid = all_valid and result.is_valid
return {
"gene": gene,
"valid": all_valid,
"evidence_results": results
}
# Use the validator
gene_validator = GeneAnnotationValidator(cache_dir=str(cache_dir))
annotation = {
"gene": "TP53",
"function": "tumor suppressor",
"evidence": [
{"ref": "PMID:12345678", "text": "TP53 protein functions in cell cycle regulation"},
{"ref": "PMID:12345678", "text": "plays a critical role as a tumor suppressor"},
]
}
result = gene_validator.validate_annotation(annotation)
print(f"Gene: {result['gene']}")
print(f"Overall valid: {result['valid']}")
print("\nEvidence validation:")
for ev_result in result['evidence_results']:
status = "✓" if ev_result['valid'] else "✗"
print(f" {status} {ev_result['reference']}: {ev_result['text'][:50]}...")
class GeneAnnotationValidator:
"""Example class for validating gene annotations."""
def __init__(self, cache_dir: str):
config = ReferenceValidationConfig(cache_dir=cache_dir)
self.validator = SupportingTextValidator(config)
def validate_annotation(self, annotation: dict) -> dict:
"""Validate a single gene annotation.
Args:
annotation: Dict with 'gene', 'function', and 'evidence' keys
Returns:
Dict with validation results
"""
gene = annotation["gene"]
evidence_list = annotation["evidence"]
results = []
all_valid = True
for evidence in evidence_list:
result = self.validator.validate(
supporting_text=evidence["text"],
reference_id=evidence["ref"]
)
results.append({
"reference": evidence["ref"],
"text": evidence["text"],
"valid": result.is_valid,
"message": result.message
})
all_valid = all_valid and result.is_valid
return {
"gene": gene,
"valid": all_valid,
"evidence_results": results
}
# Use the validator
gene_validator = GeneAnnotationValidator(cache_dir=str(cache_dir))
annotation = {
"gene": "TP53",
"function": "tumor suppressor",
"evidence": [
{"ref": "PMID:12345678", "text": "TP53 protein functions in cell cycle regulation"},
{"ref": "PMID:12345678", "text": "plays a critical role as a tumor suppressor"},
]
}
result = gene_validator.validate_annotation(annotation)
print(f"Gene: {result['gene']}")
print(f"Overall valid: {result['valid']}")
print("\nEvidence validation:")
for ev_result in result['evidence_results']:
status = "✓" if ev_result['valid'] else "✗"
print(f" {status} {ev_result['reference']}: {ev_result['text'][:50]}...")
Gene: TP53 Overall valid: True Evidence validation: ✓ PMID:12345678: TP53 protein functions in cell cycle regulation... ✓ PMID:12345678: plays a critical role as a tumor suppressor...
Summary¶
Key Classes¶
ReferenceValidationConfig - Configuration
config = ReferenceValidationConfig(
cache_dir="path/to/cache",
email="your@email.com"
)
SupportingTextValidator - Main validator
validator = SupportingTextValidator(config)
result = validator.validate(
supporting_text="quote",
reference_id="PMID:12345678"
)
ReferenceFetcher - Fetch references
fetcher = ReferenceFetcher(config)
reference = fetcher.fetch("PMID:12345678")
When to Use Python API vs CLI¶
Use CLI when:
- Quick one-off validations
- Shell scripting
- CI/CD pipelines
- Standard LinkML workflows
Use Python API when:
- Building custom applications
- Need programmatic access to results
- Custom validation workflows
- Collecting statistics/analytics
Next Steps¶
- Review API Documentation
- Explore source code for advanced usage
- Check GitHub for examples
Cleanup¶
In [13]:
Copied!
import shutil
shutil.rmtree(temp_dir)
print(f"Cleaned up: {temp_dir}")
import shutil
shutil.rmtree(temp_dir)
print(f"Cleaned up: {temp_dir}")
Cleaned up: /var/folders/nc/m4tx21912kv1b8nk3zzx9plr0000gn/T/tmpclsqvmjr