Validating Text Against DOIs
This guide shows how to validate supporting text against publications using Digital Object Identifiers (DOIs).
Overview
DOIs are persistent identifiers for digital objects, commonly used for journal articles and data repositories. The validator fetches publication metadata from:
- Crossref API - Primary source for journal articles
- DataCite API - Fallback for repository DOIs (Zenodo, Figshare, Dryad, OSTI)
This dual-source approach ensures broad coverage across both scholarly publications and data repositories.
Basic Usage
Validate a Single Quote
linkml-reference-validator validate text \
"Nanometre-scale thermometry" \
DOI:10.1038/nature12373
Output:
Validating text against DOI:10.1038/nature12373...
Text: Nanometre-scale thermometry
Result:
Valid: True
Message: Supporting text validated successfully in DOI:10.1038/nature12373
DOI Format
DOIs should be prefixed with DOI::
DOI:10.1038/nature12373
DOI:10.1126/science.1234567
DOI:10.1016/j.cell.2023.01.001
The DOI itself follows the standard format: 10.prefix/suffix
Supported Repository DOIs
The validator supports DOIs from data repositories via the DataCite API:
| Repository | DOI Prefix | Example |
|---|---|---|
| Zenodo | 10.5281/zenodo.* |
DOI:10.5281/zenodo.7961621 |
| Figshare | 10.6084/m9.figshare.* |
DOI:10.6084/m9.figshare.123456 |
| Dryad | 10.5061/dryad.* |
DOI:10.5061/dryad.abc123 |
| OSTI | 10.2172/* |
DOI:10.2172/1234567 |
Looking Up Repository DOIs
linkml-reference-validator lookup DOI:10.5281/zenodo.7961621
For Zenodo DOIs, the output includes supplementary file metadata:
Reference: DOI:10.5281/zenodo.7961621
Title: Gene Ontology Curators AI Workshop
Authors: Dickinson R, Carbon S, Mungall CJ
...
Content type: abstract_only
--- Supplementary Files (3) ---
- Dickinson_Varenna2022.pdf (1,975,995 bytes)
- workshop_slides.pptx (2,345,678 bytes)
- data_analysis.xlsx (123,456 bytes)
Downloading Supplementary Files
By default, only metadata about supplementary files is captured. To download the actual files:
linkml-reference-validator lookup -D DOI:10.5281/zenodo.7961621
Downloaded files are stored in:
references_cache/
files/
DOI_10.5281_zenodo.7961621/
Dickinson_Varenna2022.pdf
workshop_slides.pptx
data_analysis.xlsx
Publisher DOIs vs Repository DOIs
Important: Supplementary file support only works for repository DOIs (Zenodo, Figshare, Dryad), not for publisher DOIs (Elsevier, Springer, Nature, etc.).
| DOI Type | Example | Supplementary Files |
|---|---|---|
| Repository (Zenodo) | 10.5281/zenodo.7961621 |
✅ File metadata + download |
| Repository (Figshare) | 10.6084/m9.figshare.123456 |
✅ File metadata + download |
| Publisher (Elsevier) | 10.1016/j.neuron.2011.05.021 |
❌ Not available |
| Publisher (Nature) | 10.1038/nature12373 |
❌ Not available |
Why the difference?
- Repository APIs (Zenodo, Figshare) are designed for data sharing and provide open, documented file APIs
- Publisher APIs (Elsevier, Springer) require paid institutional access or text-mining agreements
- Even when articles are in PMC, supplementary files are often not available via the OA API
Workarounds for publisher supplementary files:
- Manual download: Download supplementary files from the publisher website and use
file:references - Data repository: Check if the authors deposited data separately in Zenodo/Figshare/Dryad
- PubMed Central: For some OA articles, supplementary files may be available via PMC
Pre-caching DOIs
For offline validation or to speed up repeated validations:
linkml-reference-validator cache reference DOI:10.1038/nature12373
Output:
Fetching DOI:10.1038/nature12373...
Successfully cached DOI:10.1038/nature12373
Title: Nanometre-scale thermometry in a living cell
Authors: G. Kucsko, P. C. Maurer, N. Y. Yao
Content type: abstract_only
Content length: 1234 characters
Cached references are stored in references_cache/ as markdown files with YAML frontmatter.
Using DOIs in Data Files
DOIs work the same as PMIDs in LinkML data files:
schema.yaml:
id: https://example.org/my-schema
name: my-schema
prefixes:
linkml: https://w3id.org/linkml/
classes:
Statement:
attributes:
id:
identifier: true
supporting_text:
slot_uri: linkml:excerpt
reference:
slot_uri: linkml:authoritative_reference
data.yaml:
- id: stmt1
supporting_text: Nanometre-scale thermometry
reference: DOI:10.1038/nature12373
- id: stmt2
supporting_text: MUC1 oncoprotein blocks nuclear targeting
reference: PMID:16888623
Validate:
linkml-reference-validator validate data \
data.yaml \
--schema schema.yaml \
--target-class Statement
You can mix DOIs and PMIDs in the same data file.
Repairing DOI References
The repair command also works with DOIs:
linkml-reference-validator repair text \
"Nanometre scale thermometry" \
DOI:10.1038/nature12373
DOI vs PMID: When to Use Each
| Feature | PMID | DOI |
|---|---|---|
| Source | NCBI PubMed | Crossref + DataCite |
| Coverage | Biomedical literature | All scholarly content + data repos |
| Full text | Via PMC when available | Metadata only |
| Abstract | Usually available | Depends on publisher/repo |
| Keywords | MeSH terms | Subjects (if available) |
| Supplementary files | No | Yes (Zenodo, etc.) |
Use PMID when: - Working with biomedical/life science literature - Full text access is important - The article is indexed in PubMed - You need MeSH term keywords
Use DOI when: - The article is not in PubMed - Working with non-biomedical journals - Working with data repositories (Zenodo, Figshare, Dryad) - You need supplementary file metadata
Content Availability
Unlike PMIDs which often provide abstracts, DOI metadata from Crossref may have limited content:
- Title: Always available
- Authors: Usually available
- Abstract: Depends on publisher policy
- Full text: Not available via Crossref
If the abstract is not available, validation will be limited to matching against the title and other metadata.
Troubleshooting
"Content type: unavailable"
This means Crossref returned metadata but no abstract. The DOI was fetched successfully, but validation may fail if your text doesn't match the title.
Solution: Consider using the PMID if the article is in PubMed.
"Failed to fetch DOI"
The DOI may be invalid or both APIs (Crossref and DataCite) may have failed.
How DOI resolution works: 1. First, the validator tries Crossref API 2. If Crossref returns 404, it falls back to DataCite API 3. If both fail, the error is reported
Check:
1. Verify the DOI format (should be 10.prefix/suffix)
2. Test the DOI at https://doi.org/YOUR_DOI
3. Try again later if APIs are rate-limiting
4. For repository DOIs (Zenodo, etc.), ensure the record is public
Rate Limiting
The validator automatically respects Crossref rate limits. For bulk operations, consider:
- Pre-caching references before validation
- Using a polite pool (add your email in config for higher limits)
See Also
- Quickstart - Getting started with validation
- CLI Reference - Complete command documentation
- Validating OBO Files - Working with ontology files