Validating OBO Format Files¶
This tutorial demonstrates how to validate supporting text in OBO format ontology files using the validate text-file command.
Background¶
OBO format ontologies can include axiom annotations containing supporting text from publications. For example:
obo
[Term]
id: GO:0043263
name: cellulosome
def: "An extracellular multi-enzyme complex..." [PMID:11601609] {
ex:supporting_text="a unique extracellular multi-enzyme complex,
called cellulosome[PMID:11601609]"
}
The validate text-file command extracts these annotations using regular expressions and validates them against the referenced publications.
Setup: Create a Sample OBO File¶
Let's create a sample OBO file with the cellulosome example from the GO database:
%%bash
# Create a sample OBO file with axiom annotations
cat > cellulosome_example.obo << 'EOF'
format-version: 1.2
idspace: ex https://example.org/
[Term]
id: GO:0043263
name: cellulosome
namespace: cellular_component
alt_id: GO:1990296
def: "An extracellular multi-enzyme complex containing up to 11 different enzymes aligned on a non-catalytic scaffolding glycoprotein. Functions to hydrolyze cellulose." [PMID:11601609] {ex:supporting_text="a unique extracellular multi-enzyme complex, called cellulosome [containing] up to 11 different enzymes [which] are aligned on the non-catalytic scaffolding protein[PMID:11601609]"}
synonym: "scaffoldin complex" NARROW []
xref: Wikipedia:Cellulosome
EOF
cat cellulosome_example.obo
Part 1: Basic OBO Validation¶
The validate text-file command uses regex patterns to extract supporting text and reference IDs from text files.
Command Structure¶
linkml-reference-validator validate text-file <file> \
--regex <pattern> \
--text-group <number> \
--ref-group <number>
For OBO files with ex:supporting_text annotations, we use this regex:
- Pattern:
ex:supporting_text="([^"]*)\[(\S+:\S+)\]" - Group 1 (text-group): Captures the supporting text
- Group 2 (ref-group): Captures the reference ID (e.g., PMID:11601609)
%%bash
# Validate the OBO file
linkml-reference-validator validate text-file cellulosome_example.obo \
--regex 'ex:supporting_text="([^"]*)\[(\S+:\S+)\]"' \
--text-group 1 \
--ref-group 2
What happened?
- The tool extracted the supporting text from the axiom annotation
- It fetched PMID:11601609 from PubMed (and cached it)
- It validated that the supporting text appears in the reference
- ✅ Validation passed! The quote is authentic.
Note that the supporting text includes editorial brackets [containing] and [which] - these are automatically ignored during matching.
Part 2: Understanding the Regex Pattern¶
Let's break down the regex pattern:
regex
ex:supporting_text="([^"]*)\[(\S+:\S+)\]"
ex:supporting_text="- Literal match for the annotation property([^"]*)- Group 1: Captures everything before the final[, excluding quotes\[- Literal[character (escaped)(\S+:\S+)- Group 2: Captures the reference ID (format: PREFIX:ID)\]- Literal]character (escaped)"- Closing quote
Let's test pattern extraction without validation:
%%bash
# Extract matches using grep to see what the regex captures
grep -oP 'ex:supporting_text="\K[^"]*(?=")' cellulosome_example.obo | head -1
Part 3: Multiple Terms in One File¶
Let's add more terms to see batch validation:
%%bash
# Create a larger OBO file with multiple terms
cat > multi_term_example.obo << 'EOF'
format-version: 1.2
idspace: ex https://example.org/
[Term]
id: GO:0043263
name: cellulosome
def: "An extracellular multi-enzyme complex containing up to 11 different enzymes aligned on a non-catalytic scaffolding glycoprotein." [PMID:11601609] {ex:supporting_text="a unique extracellular multi-enzyme complex, called cellulosome [containing] up to 11 different enzymes [which] are aligned on the non-catalytic scaffolding protein[PMID:11601609]"}
xref: Wikipedia:Cellulosome
[Term]
id: GO:0005737
name: cytoplasm
def: "The contents of a cell excluding the plasma membrane and nucleus, but including other subcellular structures." [PMID:9974395]
[Term]
id: GO:0016020
name: membrane
def: "A lipid bilayer along with all the proteins and protein complexes embedded in it and attached to it." [PMID:21258405] {ex:supporting_text="The membrane is composed of a lipid bilayer[PMID:21258405]"}
EOF
Created multi_term_example.obo with 3 terms:
- GO:0043263 (cellulosome): has supporting text annotation
- GO:0005737 (cytoplasm): no supporting text annotation
- GO:0016020 (membrane): has supporting text annotation
%%bash
# Validate the file with multiple terms
linkml-reference-validator validate text-file multi_term_example.obo \
--regex 'ex:supporting_text="([^"]*)\[(\S+:\S+)\]"' \
--text-group 1 \
--ref-group 2
Key observations:
- Only lines with
ex:supporting_textannotations are validated - Lines without the annotation are silently skipped (GO:0005737)
- Each match shows the line number for easy reference
- Both validations are reported in the summary
Part 4: Summary Mode¶
For large files, use --summary to see only the overall statistics:
%%bash
# Summary mode - only shows statistics
linkml-reference-validator validate text-file multi_term_example.obo \
--regex 'ex:supporting_text="([^"]*)\[(\S+:\S+)\]"' \
--summary
Part 5: Verbose Mode¶
Use --verbose to see detailed matching information:
%%bash
# Verbose mode - shows detailed matching
linkml-reference-validator validate text-file cellulosome_example.obo \
--regex 'ex:supporting_text="([^"]*)\[(\S+:\S+)\]"' \
--verbose
%%bash
Use a custom cache directory¶
mkdir -p obo_references_cache
linkml-reference-validator validate text-file cellulosome_example.obo
--regex 'ex:supporting_text="([^"]*)[(\S+:\S+)]"'
--cache-dir obo_references_cache
ls -lh obo_references_cache/
%%bash
# Use a custom cache directory
mkdir -p obo_references_cache
linkml-reference-validator validate text-file cellulosome_example.obo \
--regex 'ex:supporting_text="([^"]*)\[(\S+:\S+)\]"' \
--cache-dir obo_references_cache
echo ""
echo "Cache contents:"
ls -lh obo_references_cache/
%%bash
View the cached reference¶
head -20 references_cache/PMID_11601609.md echo "" echo "..." echo "" tail -10 references_cache/PMID_11601609.md
%%bash
# View the cached reference
cache_path=$(linkml-reference-validator cache lookup PMID:11601609)
echo "Cached reference metadata:"
head -20 "$cache_path"
echo ""
echo "..."
echo ""
echo "Abstract excerpt:"
tail -10 "$cache_path"
%%bash
Create OBO with different annotation property¶
cat > custom_annotation.obo << 'EOF' format-version: 1.2
[Term] id: GO:0043263 name: cellulosome def: "An extracellular multi-enzyme complex." [PMID:11601609] property_value: evidence_text "cellulosome is a multi-enzyme complex" PMID:11601609 EOF
cat custom_annotation.obo
%%bash
# Create OBO with different annotation property
cat > custom_annotation.obo << 'EOF'
format-version: 1.2
[Term]
id: GO:0043263
name: cellulosome
def: "An extracellular multi-enzyme complex." [PMID:11601609]
property_value: evidence_text "cellulosome is a multi-enzyme complex" PMID:11601609
EOF
echo "✅ Created custom_annotation.obo"
cat custom_annotation.obo
%%bash
# Validate with custom regex pattern
linkml-reference-validator validate text-file custom_annotation.obo \
--regex 'evidence_text "([^"]+)" (\S+:\S+)' \
--text-group 1 \
--ref-group 2
Part 9: Testing with Invalid Text¶
Let's see what happens when supporting text doesn't match the reference:
%%bash
# This should fail validation
linkml-reference-validator validate text-file invalid_example.obo \
--regex 'ex:supporting_text="([^"]*)\[(\S+:\S+)\]"' \
|| true
Expected result: Validation should fail and report that the text doesn't match the reference.
%%bash
# 1. Agent adds new terms with supporting text to OBO file
cat > agent_additions.obo << 'EOF'
format-version: 1.2
idspace: ex https://example.org/
[Term]
id: GO:0043263
name: cellulosome
def: "An extracellular multi-enzyme complex containing up to 11 different enzymes aligned on a non-catalytic scaffolding glycoprotein." [PMID:11601609] {ex:supporting_text="a unique extracellular multi-enzyme complex, called cellulosome [containing] up to 11 different enzymes [which] are aligned on the non-catalytic scaffolding protein[PMID:11601609]"}
EOF
# 2. Validate before committing
linkml-reference-validator validate text-file agent_additions.obo \
--regex 'ex:supporting_text="([^"]*)\[(\S+:\S+)\]"' \
--summary
If validation passes, you can safely commit:
git add agent_additions.obo
git commit -m 'Add cellulosome with validated supporting text'
%%bash
Example CI/CD script¶
cat > validate_obo.sh << 'EOF'
!/bin/bash¶
set -e # Exit on any error
echo "Validating OBO file supporting text..."
linkml-reference-validator validate text-file "$1"
--regex 'ex:supporting_text="([^"]*)[(\S+:\S+)]"'
--summary
EXIT_CODE=$?
if [ $EXIT_CODE -eq 0 ]; then echo "✅ All supporting text validated successfully" exit 0 else echo "❌ Supporting text validation failed" echo "Please review the errors above before merging." exit 1 fi EOF
chmod +x validate_obo.sh
Usage: ./validate_obo.sh your_ontology.obo
%%bash
# 1. Agent adds new terms with supporting text to OBO file
cat > agent_additions.obo << 'EOF'
format-version: 1.2
idspace: ex https://example.org/
[Term]
id: GO:0043263
name: cellulosome
def: "An extracellular multi-enzyme complex containing up to 11 different enzymes aligned on a non-catalytic scaffolding glycoprotein." [PMID:11601609] {ex:supporting_text="a unique extracellular multi-enzyme complex, called cellulosome [containing] up to 11 different enzymes [which] are aligned on the non-catalytic scaffolding protein[PMID:11601609]"}
EOF
# 2. Validate before committing
echo "Step 1: Validate agent additions..."
if linkml-reference-validator validate text-file agent_additions.obo \
--regex 'ex:supporting_text="([^"]*)\[(\S+:\S+)\]"' \
--summary; then
echo ""
echo "✅ Step 2: All validations passed - safe to commit!"
echo "✅ Step 3: git add agent_additions.obo"
echo "✅ Step 4: git commit -m 'Add cellulosome with validated supporting text'"
else
echo ""
echo "❌ Step 2: Validation failed - review before committing!"
exit 1
fi
Part 11: Integration with CI/CD¶
You can use exit codes for automation:
%%bash
# Example CI/CD script
cat > validate_obo.sh << 'EOF'
#!/bin/bash
set -e # Exit on any error
echo "Validating OBO file supporting text..."
linkml-reference-validator validate text-file "$1" \
--regex 'ex:supporting_text="([^"]*)\[(\S+:\S+)\]"' \
--summary
EXIT_CODE=$?
if [ $EXIT_CODE -eq 0 ]; then
echo "✅ All supporting text validated successfully"
exit 0
else
echo "❌ Supporting text validation failed"
echo "Please review the errors above before merging."
exit 1
fi
EOF
chmod +x validate_obo.sh
echo "✅ Created validate_obo.sh script"
echo ""
echo "Usage: ./validate_obo.sh your_ontology.obo"
%%bash
# Test the script
./validate_obo.sh cellulosome_example.obo
%%bash
Clean up example files¶
rm -f cellulosome_example.obo multi_term_example.obo custom_annotation.obo rm -f invalid_example.obo agent_additions.obo validate_obo.sh rm -rf obo_references_cache
%%bash
linkml-reference-validator validate text-file --help
Summary¶
In this tutorial, we learned:
✅ Basic OBO validation - Extract and validate supporting text from axiom annotations
✅ Regex patterns - Use custom patterns for different annotation formats
✅ Batch processing - Validate multiple terms in one command
✅ Summary mode - Quick statistics for large files
✅ Verbose mode - Detailed matching information for debugging
✅ Cache management - Organize downloaded references
✅ Error detection - Identify hallucinated or incorrect supporting text
✅ CI/CD integration - Automated validation in workflows
Key Command¶
linkml-reference-validator validate text-file ontology.obo \
--regex 'ex:supporting_text="([^"]*)\[(\S+:\S+)\]"' \
--text-group 1 \
--ref-group 2
Use Cases¶
- AI-generated content - Validate agent-added definitions before committing
- Quality control - Batch validate existing annotations
- Pre-commit hooks - Prevent hallucinated text from entering the repository
- Curation workflows - Verify supporting evidence during manual curation
Next Steps¶
Cleanup (Optional)¶
%%bash
# Clean up example files
rm -f cellulosome_example.obo multi_term_example.obo custom_annotation.obo
rm -f invalid_example.obo agent_additions.obo validate_obo.sh
rm -rf obo_references_cache
echo "✅ Cleaned up example files"