Validating GEO Dataset References¶
This tutorial demonstrates how to validate supporting text quotes against Gene Expression Omnibus (GEO) datasets.
What is GEO?¶
GEO (Gene Expression Omnibus) is NCBI's public repository for gene expression and other functional genomics data. Each dataset has:
- GSE accessions: GEO Series (collections of related samples)
- GDS accessions: GEO DataSets (curated, analysis-ready datasets)
The linkml-reference-validator can fetch and validate quotes against GEO dataset metadata.
%%bash
# Cache the GEO dataset
linkml-reference-validator cache reference GEO:GSE67472
Validating Text Against GEO Dataset¶
Now let's validate a quote from the dataset's description:
%%bash
# Validate text that should be in the dataset description
linkml-reference-validator validate text \
"Airway epithelial" \
GEO:GSE67472
echo "Validation complete!"
Validation Failure Example¶
What happens when the text isn't in the dataset?
%%bash
# This text is NOT in GSE67472
linkml-reference-validator validate text \
"This is completely unrelated to the dataset" \
GEO:GSE67472 \
|| echo "Validation failed - text not found!"
Part 2: Python API¶
You can also use the Python API directly:
from linkml_reference_validator.models import ReferenceValidationConfig
from linkml_reference_validator.etl.sources.entrez import GEOSource
# Create config
config = ReferenceValidationConfig(
email="your-email@example.com", # Required by NCBI
rate_limit_delay=0.5, # Be respectful to the API
)
# Create the GEO source
source = GEOSource()
# Fetch the dataset
result = source.fetch("GSE67472", config)
if result:
print(f"Reference ID: {result.reference_id}")
print(f"Title: {result.title}")
print(f"Content type: {result.content_type}")
print(f"Entrez UID: {result.metadata.get('entrez_uid')}")
print(f"\nContent preview:\n{result.content[:500]}...")
else:
print("Failed to fetch dataset")
Validation with Python API¶
from linkml_reference_validator.validation.supporting_text_validator import SupportingTextValidator
from linkml_reference_validator.models import ReferenceValidationConfig
from pathlib import Path
# Create config with cache directory
config = ReferenceValidationConfig(
cache_dir=Path("references_cache"),
email="your-email@example.com",
rate_limit_delay=0.5,
)
# Create validator
validator = SupportingTextValidator(config)
# Validate some text
result = validator.validate(
"airway epithelial", # Text to validate
"GEO:GSE67472", # Reference
)
print(f"Valid: {result.is_valid}")
print(f"Severity: {result.severity}")
print(f"Message: {result.message}")
if result.match_result:
print(f"Found: {result.match_result.found}")
print(f"Matched text: {result.match_result.matched_text}")
Part 3: How GEO Fetching Works¶
The Accession to UID Conversion¶
GEO accessions (like GSE67472) cannot be used directly with NCBI's esummary API. The GEOSource automatically converts accessions to numeric UIDs:
- esearch: Searches for the accession and returns the numeric UID
- esummary: Uses the UID to fetch the dataset metadata
You can see this in action:
from Bio import Entrez
Entrez.email = "your-email@example.com"
# Step 1: Convert accession to UID via esearch
handle = Entrez.esearch(db="gds", term="GSE67472[Accession]")
search_result = Entrez.read(handle)
handle.close()
print("Accession: GSE67472")
print(f"UID(s) found: {search_result['IdList']}")
if search_result['IdList']:
uid = search_result['IdList'][0]
# Step 2: Fetch summary using UID
handle = Entrez.esummary(db="gds", id=uid)
summary = Entrez.read(handle)
handle.close()
if summary:
record = summary[0]
print(f"\nDataset Title: {record.get('title')}")
print(f"Platform: {record.get('GPL')}")
print(f"Samples: {record.get('n_samples')}")
%%bash
# Fetch a GSE series
linkml-reference-validator cache reference GEO:GSE67472
GDS (GEO DataSet)¶
GDS accessions are curated, analysis-ready datasets:
%%bash
# Fetch a GDS dataset
linkml-reference-validator cache reference GEO:GDS1234
Part 5: Viewing Cached References¶
Cached GEO references are stored in markdown format:
%%bash
# List GEO references in cache
ls -lh references_cache/GEO_* 2>/dev/null || echo "No GEO references cached yet"
%%bash
# View a cached GEO reference (if it exists)
cache_path=$(linkml-reference-validator cache lookup GEO:GSE67472 2>/dev/null || true)
if [ -n "$cache_path" ] && [ -f "$cache_path" ]; then
head -30 "$cache_path"
else
echo "GEO:GSE67472 not found in cache"
fi
Summary¶
In this tutorial, we learned:
- GEO accession types: GSE (Series) and GDS (DataSets)
- CLI usage:
cache reference GEO:GSExxxxxandvalidate text "..." GEO:GSExxxxx - Python API: Using GEOSource and SupportingTextValidator
- How it works: Automatic accession-to-UID conversion via esearch
Next Steps¶
- Validate quotes in your own data files with
validate data - See Tutorial 3: Python API for more programmatic usage
- Check out Tutorial 4: OBO Validation for ontology validation