Skip to content

Validating ClinicalTrials.gov References

This guide shows how to validate supporting text against clinical trial records from ClinicalTrials.gov.

Overview

The ClinicalTrials.gov source fetches trial data from the ClinicalTrials.gov API v2. It extracts:

  • Title: Official title (falls back to brief title)
  • Content: Brief summary (falls back to detailed description)
  • Metadata: Trial status and lead sponsor

The source uses the bioregistry standard prefix clinicaltrials with identifiers following the pattern NCT followed by 8 digits (e.g., NCT00000001).

Basic Usage

Validate text against a clinical trial using its NCT identifier:

linkml-reference-validator validate text \
  "A randomized controlled trial investigating..." \
  clinicaltrials:NCT00000001

Accepted Identifier Formats

You can use the bioregistry standard prefix or bare NCT identifiers:

clinicaltrials:NCT00000001
clinicaltrials:NCT12345678
NCT00000001
NCT12345678

The prefix is case-insensitive:

clinicaltrials:NCT00000001
CLINICALTRIALS:NCT00000001

Prefix Aliases and Normalization

If your data uses alternate prefix styles (e.g., the legacy NCT: prefix), configure normalization in .linkml-reference-validator.yaml:

validation:
  reference_prefix_map:
    NCT: clinicaltrials
    nct: clinicaltrials
    ct: clinicaltrials
    ClinicalTrials: clinicaltrials

Or programmatically:

from linkml_reference_validator.models import ReferenceValidationConfig

config = ReferenceValidationConfig(
    reference_prefix_map={
        "NCT": "clinicaltrials",
        "ct": "clinicaltrials",
    }
)

Pre-caching Clinical Trial Records

To cache trial data for offline validation or faster repeated access:

linkml-reference-validator cache reference clinicaltrials:NCT00000001

Cached references are stored in references_cache/ as markdown files with YAML frontmatter containing metadata like trial status and sponsor.

Rate Limiting

The ClinicalTrials.gov API has rate limits. The default rate_limit_delay of 0.5 seconds between requests should be sufficient for most use cases:

from linkml_reference_validator.models import ReferenceValidationConfig

config = ReferenceValidationConfig(
    rate_limit_delay=0.5,  # default
)

Content Availability

Not all trials have detailed descriptions. If only a brief summary is available, that will be used for validation. Trials without any description will return content_type: unavailable.

Example: Validating Trial Descriptions

from linkml_reference_validator.etl.sources import ClinicalTrialsSource
from linkml_reference_validator.models import ReferenceValidationConfig

config = ReferenceValidationConfig()
source = ClinicalTrialsSource()

# Fetch trial content
content = source.fetch("NCT00000001", config)

if content:
    print(f"Reference ID: {content.reference_id}")  # clinicaltrials:NCT00000001
    print(f"Title: {content.title}")
    print(f"Summary: {content.content}")
    print(f"Status: {content.metadata.get('status')}")
    print(f"Sponsor: {content.metadata.get('sponsor')}")

Bioregistry Standard

This source follows the bioregistry standard for ClinicalTrials.gov identifiers:

  • Prefix: clinicaltrials
  • Pattern: ^NCT\d{8}$
  • Example CURIE: clinicaltrials:NCT00222573

Alternative prefixes recognized by bioregistry include clinicaltrial, NCT, and ctgov. Use the reference_prefix_map configuration to normalize these to the standard prefix.

See Also