Validating URL References

This guide explains how to validate references that use URLs instead of traditional identifiers like PMIDs or DOIs.

Overview

The linkml-reference-validator supports validating references that point to web content, such as:

Book chapters hosted online
Educational resources
Documentation pages
Blog posts or articles
Any static web content

When a reference field contains a URL, the validator:

Fetches the web page content
Extracts the page title from <title> tag (for HTML)
Caches the content for future validations
Validates your supporting text against the page content

URL Format

Use the url: prefix to specify URL references (the prefix is optional for bare HTTP/HTTPS URLs and will be normalized to url: internally):

my_field:
  value: "Some text from the web page..."
  references:
    - "url:https://example.com/book/chapter1"

Or via CLI:

linkml-reference-validator validate text \
  "Some text from the web page" \
  url:https://example.com/book/chapter1

Example

Suppose you have an online textbook chapter at https://example.com/biology/cell-structure with the following content:

<html>
  <head>
    <title>Chapter 3: Cell Structure and Function</title>
  </head>
  <body>
    <h1>Cell Structure and Function</h1>
    <p>The cell is the basic structural and functional unit of all living organisms.</p>
    <p>Cells contain various organelles that perform specific functions...</p>
  </body>
</html>

You can validate text extracted from this chapter:

linkml-reference-validator validate text \
  "The cell is the basic structural and functional unit of all living organisms" \
  url:https://example.com/biology/cell-structure

How URL Validation Works

1. Content Fetching

When the validator encounters a URL reference, it:

Makes an HTTP GET request to fetch the page
Uses a polite user agent header identifying the tool
Respects rate limiting (configurable via rate_limit_delay)
Handles timeouts (default 30 seconds)

2. Content Storage

The fetcher stores:

Title: Extracted from the <title> tag (for HTML pages)
Content: The raw page content as received
Content type: Marked as url to distinguish from other reference types

Note: The validator stores raw page content without HTML-to-text conversion. HTML tags remain in the cached file, and tag names can surface during normalization. If validation fails because tags interrupt the text, consider extracting plain text and validating against a local file: reference instead.

3. Caching

Fetched URL content is cached to disk in markdown format with YAML frontmatter:

---
reference_id: url:https://example.com/biology/cell-structure
title: "Chapter 3: Cell Structure and Function"
content_type: url
---

# Chapter 3: Cell Structure and Function

## Content

<html>
  <head>
    <title>Chapter 3: Cell Structure and Function</title>
  </head>
  ...

Cache files are stored in the configured cache directory (default: references_cache/).

Configuration

URL fetching behavior can be configured:

# config.yaml
rate_limit_delay: 0.5  # Wait 0.5 seconds between requests
email: "your-email@example.com"  # Used in user agent
cache_dir: ".cache/references"  # Where to cache fetched content

Or via command-line:

linkml-reference-validator validate \
  --cache-dir .cache \
  --rate-limit-delay 0.5 \
  my-data.yaml

Limitations

Static Content Only

URL validation is designed for static web pages. It may not work well with:

Dynamic content loaded via JavaScript
Pages requiring authentication
Content behind paywalls
Frequently changing content

Raw Content

The validator stores raw page content. For HTML pages:

HTML tags are preserved in the cache
The text normalization during validation handles most cases
Complex HTML layouts may require careful text extraction

No Rendering

The fetcher downloads raw HTML and parses it directly. It does not:

Execute JavaScript
Render the page in a browser
Handle dynamic content

Best Practices

1. Use Stable URLs

Choose URLs that are unlikely to change:

Versioned documentation: https://docs.example.com/v1.0/chapter1
Archived content: https://archive.example.com/2024/article
Avoid URLs with session parameters

2. Verify Content Quality

After adding a URL reference, verify the extracted content:

# Check what was extracted
linkml-reference-validator cache lookup url:https://example.com/page --content

Ensure the cached content contains the text you're referencing.

3. Cache Management

Commit cache files to version control for reproducibility
Use linkml-reference-validator cache reference url:https://... --force to update cached content when pages change
Periodically review cached URLs to ensure they're still accessible

4. Mix Reference Types

URL references work alongside PMIDs and DOIs:

findings:
  value: "Multiple studies confirm this relationship"
  references:
    - "PMID:12345678"  # Research paper
    - "DOI:10.1234/journal.article"  # Another paper
    - "url:https://example.com/textbook/chapter5"  # Textbook chapter

Troubleshooting

URL Not Fetching

If URL content isn't being fetched:

Check network connectivity
Verify the URL is accessible in a browser
Check for rate limiting or IP blocks
Look for error messages in the logs

Validation Failing

If validation fails for URL references:

Check the cached content to see what was extracted
Verify your supporting text actually appears on the page
Check for whitespace or formatting differences
Consider if the page content has changed since caching

Force Refresh

To re-fetch content for a URL that may have changed, refresh the cache before validating:

linkml-reference-validator cache reference url:https://example.com/page --force

linkml-reference-validator validate text \
  "Updated content" \
  url:https://example.com/page

Comparison with Other Reference Types

Feature	PMID	DOI	URL	file
Source	PubMed	Crossref	Any web page	Local filesystem
Content Type	Abstract + Full Text	Abstract	Raw HTML/text	Raw file content
Metadata	Rich (authors, journal, etc.)	Rich	Minimal (title only)	Minimal (title from heading)
Stability	High	High	Variable	High (local control)
Access	Free for abstracts	Varies	Varies	Always available
Caching	Yes	Yes	Yes	Yes