Validating URL References
This guide explains how to validate references that use URLs instead of traditional identifiers like PMIDs or DOIs.
Overview
The linkml-reference-validator supports validating references that point to web content, such as:
- Book chapters hosted online
- Educational resources
- Documentation pages
- Blog posts or articles
- Any static web content
When a reference field contains a URL, the validator:
- Fetches the web page content
- Extracts the page title from
<title>tag (for HTML) - Caches the content for future validations
- Validates your supporting text against the page content
URL Format
Use the url: prefix to specify URL references (the prefix is optional for bare
HTTP/HTTPS URLs and will be normalized to url: internally):
my_field:
value: "Some text from the web page..."
references:
- "url:https://example.com/book/chapter1"
Or via CLI:
linkml-reference-validator validate text \
"Some text from the web page" \
url:https://example.com/book/chapter1
Example
Suppose you have an online textbook chapter at https://example.com/biology/cell-structure with the following content:
<html>
<head>
<title>Chapter 3: Cell Structure and Function</title>
</head>
<body>
<h1>Cell Structure and Function</h1>
<p>The cell is the basic structural and functional unit of all living organisms.</p>
<p>Cells contain various organelles that perform specific functions...</p>
</body>
</html>
You can validate text extracted from this chapter:
linkml-reference-validator validate text \
"The cell is the basic structural and functional unit of all living organisms" \
url:https://example.com/biology/cell-structure
How URL Validation Works
1. Content Fetching
When the validator encounters a URL reference, it:
- Makes an HTTP GET request to fetch the page
- Uses a polite user agent header identifying the tool
- Respects rate limiting (configurable via
rate_limit_delay) - Handles timeouts (default 30 seconds)
2. Content Storage
The fetcher stores:
- Title: Extracted from the
<title>tag (for HTML pages) - Content: The raw page content as received
- Content type: Marked as
urlto distinguish from other reference types
Note: The validator stores raw page content without HTML-to-text conversion.
HTML tags remain in the cached file, and tag names can surface during
normalization. If validation fails because tags interrupt the text, consider
extracting plain text and validating against a local file: reference instead.
3. Caching
Fetched URL content is cached to disk in markdown format with YAML frontmatter:
---
reference_id: url:https://example.com/biology/cell-structure
title: "Chapter 3: Cell Structure and Function"
content_type: url
---
# Chapter 3: Cell Structure and Function
## Content
<html>
<head>
<title>Chapter 3: Cell Structure and Function</title>
</head>
...
Cache files are stored in the configured cache directory (default: references_cache/).
Configuration
URL fetching behavior can be configured:
# config.yaml
rate_limit_delay: 0.5 # Wait 0.5 seconds between requests
email: "your-email@example.com" # Used in user agent
cache_dir: ".cache/references" # Where to cache fetched content
Or via command-line:
linkml-reference-validator validate \
--cache-dir .cache \
--rate-limit-delay 0.5 \
my-data.yaml
Limitations
Static Content Only
URL validation is designed for static web pages. It may not work well with:
- Dynamic content loaded via JavaScript
- Pages requiring authentication
- Content behind paywalls
- Frequently changing content
Raw Content
The validator stores raw page content. For HTML pages:
- HTML tags are preserved in the cache
- The text normalization during validation handles most cases
- Complex HTML layouts may require careful text extraction
No Rendering
The fetcher downloads raw HTML and parses it directly. It does not:
- Execute JavaScript
- Render the page in a browser
- Handle dynamic content
Best Practices
1. Use Stable URLs
Choose URLs that are unlikely to change:
- Versioned documentation:
https://docs.example.com/v1.0/chapter1 - Archived content:
https://archive.example.com/2024/article - Avoid URLs with session parameters
2. Verify Content Quality
After adding a URL reference, verify the extracted content:
# Check what was extracted
linkml-reference-validator cache lookup url:https://example.com/page --content
Ensure the cached content contains the text you're referencing.
3. Cache Management
- Commit cache files to version control for reproducibility
- Use
linkml-reference-validator cache reference url:https://... --forceto update cached content when pages change - Periodically review cached URLs to ensure they're still accessible
4. Mix Reference Types
URL references work alongside PMIDs and DOIs:
findings:
value: "Multiple studies confirm this relationship"
references:
- "PMID:12345678" # Research paper
- "DOI:10.1234/journal.article" # Another paper
- "url:https://example.com/textbook/chapter5" # Textbook chapter
Troubleshooting
URL Not Fetching
If URL content isn't being fetched:
- Check network connectivity
- Verify the URL is accessible in a browser
- Check for rate limiting or IP blocks
- Look for error messages in the logs
Validation Failing
If validation fails for URL references:
- Check the cached content to see what was extracted
- Verify your supporting text actually appears on the page
- Check for whitespace or formatting differences
- Consider if the page content has changed since caching
Force Refresh
To re-fetch content for a URL that may have changed, refresh the cache before validating:
linkml-reference-validator cache reference url:https://example.com/page --force
linkml-reference-validator validate text \
"Updated content" \
url:https://example.com/page
Comparison with Other Reference Types
| Feature | PMID | DOI | URL | file |
|---|---|---|---|---|
| Source | PubMed | Crossref | Any web page | Local filesystem |
| Content Type | Abstract + Full Text | Abstract | Raw HTML/text | Raw file content |
| Metadata | Rich (authors, journal, etc.) | Rich | Minimal (title only) | Minimal (title from heading) |
| Stability | High | High | Variable | High (local control) |
| Access | Free for abstracts | Varies | Varies | Always available |
| Caching | Yes | Yes | Yes | Yes |
See Also
- Using Local Files and URLs - Quick reference for file and URL sources
- Validating DOIs - For journal articles with DOIs
- Validating OBO Files - For ontology-specific validation
- How It Works - Core validation concepts
- CLI Reference - Command-line options