Content Types
When fetching references, the validator tracks what type of content was retrieved using the content_type field. This is important because validation reliability depends on having access to the right content - an excerpt from a paper's Methods section won't be found if only the abstract is available.
Content Type Values
| Value | Source(s) | Description |
|---|---|---|
full_text_xml |
PMID (via PMC) | Full text retrieved from PubMed Central as XML |
full_text_html |
PMID (via PMC) | Full text retrieved from PubMed Central as HTML (fallback) |
abstract_only |
PMID, DOI | Only the abstract was available |
summary |
GEO, BioProject, BioSample, ClinicalTrials | Database summary/description field |
local_file |
file: | Content read from local filesystem |
url |
url: | Content fetched from web URL |
unavailable |
Any | No content could be retrieved |
Internal Values (not typically seen by users)
| Value | Meaning |
|---|---|
no_pmc |
PMID has no corresponding PMC ID |
pmc_restricted |
PMC entry exists but content is restricted |
unknown |
Default value (content type not determined) |
How Content Type Affects Validation
Full Text Available (full_text_xml, full_text_html)
When full text is available, validation is most reliable. The content includes:
- Abstract
- Introduction
- Methods
- Results
- Discussion
- References
Excerpts from any section can be validated.
Abstract Only (abstract_only)
When only the abstract is available:
- Validation works for text from the abstract
- Validation will fail for excerpts from other sections
- Error messages note this limitation:
(note: only abstract available for PMID:nnn, full text may contain this excerpt)
This is common for PMIDs without open-access PMC versions.
Summary (summary)
For database records (GEO, BioProject, etc.), the "summary" or "description" field is used. This is typically a brief description, not a full publication.
Unavailable (unavailable)
No content was retrieved. Validation will fail with a clear error message.
PMID Full Text Cascade
For PMID references, the validator automatically attempts to get the best content available:
When full text is retrieved, it's concatenated with the abstract to provide maximum coverage.
Checking Content Type
In Cache Files
Cached references store content type in YAML frontmatter:
---
reference_id: PMID:16888623
title: Example Article
content_type: abstract_only
---
Via CLI
linkml-reference-validator cache lookup PMID:16888623 --content
Output includes:
content_type: abstract_only
In Validation Results
When validation fails and only abstract was available, the error message includes context:
Text part not found as substring: 'excerpt from methods section'
(note: only abstract available for PMID:16888623, full text may contain this excerpt)
Strategies When Full Text Unavailable
If validation fails due to abstract-only content:
-
Check for PMC version: Search PubMed Central for the article - if available, use
PMC:prefix instead ofPMID: -
Use local file: If you have the PDF/text, save as markdown and reference with
file:./path/to/paper.md -
Use URL: If the full text is freely available online, use
url:https://... -
Remove the excerpt: If the excerpt can't be verified, consider removing it from your data
-
Accept the limitation: Document that certain excerpts couldn't be verified due to access limitations
See Using Local Files and URLs for detailed guidance on alternatives.
Reference Metadata
Beyond the main content, the validator extracts additional metadata from references:
Keywords
The keywords field contains subject terms or tags from the source:
| Source | Keyword Type | Example |
|---|---|---|
| PMID | MeSH terms | Adaptor Proteins, Signal Transducing/metabolism |
| DOI (Crossref) | Subjects | General Chemistry, Biochemistry |
| DOI (DataCite) | Subjects | Climate Change, Bioinformatics |
MeSH (Medical Subject Headings) terms from PubMed are particularly valuable for biomedical literature, providing standardized vocabulary with qualifier subheadings (e.g., /metabolism, /genetics).
Example cache file with keywords:
---
reference_id: PMID:33505029
title: Genomic mechanisms of climate adaptation...
keywords:
- Acclimatization/genetics
- Biofuels
- Genetic Introgression
- Genome, Plant/genetics
content_type: abstract_only
---
Viewing keywords via CLI:
linkml-reference-validator lookup PMID:33505029 --format text
Output:
Reference: PMID:33505029
Title: Genomic mechanisms of climate adaptation...
Keywords: Acclimatization/genetics, Biofuels, Genetic Introgression, ...
Supplementary Files
For repository DOIs (Zenodo, etc.), the supplementary_files field lists associated files:
---
reference_id: DOI:10.5281/zenodo.7961621
supplementary_files:
- filename: data_analysis.xlsx
download_url: https://zenodo.org/api/records/7961621/files/data_analysis.xlsx/content
size_bytes: 123456
checksum: md5:88c66d378d886fea4969949c5877802f
---
See Validating DOIs for details on supplementary files.