Skip to content

linkml-data-qc

A compliance analysis tool for LinkML data files. Measures how well your data populates recommended: true slots defined in LinkML schemas.

Why linkml-data-qc?

When building knowledge bases with LinkML, certain fields may be marked as recommended in your schema - fields that should ideally be populated but aren't strictly required. This tool helps you:

  • Track data quality across your knowledge base
  • Identify gaps where recommended fields are missing
  • Enforce standards with configurable thresholds in CI/CD
  • Prioritize curation by finding low-compliance areas

Quick Start

# Install
pip install linkml-data-qc

# Analyze a single file
linkml-data-qc data.yaml -s schema.yaml -t TargetClass

# Analyze a directory
linkml-data-qc data/ -s schema.yaml -t TargetClass --pattern "*.yaml"

# Fail CI if compliance drops below 70%
linkml-data-qc data/ -s schema.yaml -t TargetClass --min-compliance 70

Features

  • Hierarchical scoring - Compliance at global, path, and per-item levels
  • Aggregated list scoring - Roll up scores using jq-style [] notation
  • Configurable weights - Prioritize important fields
  • Threshold enforcement - Set minimum compliance requirements
  • Multiple formats - JSON, CSV, and human-readable text output
  • Visual dashboards - Generate PNG dashboard images (optional viz extras)
  • CI/CD integration - Exit codes for automated pipelines

Visual Dashboards

Generate visual QC dashboards to quickly assess data quality:

pip install linkml-data-qc[viz]

# Single PNG dashboard
linkml-data-qc data.yaml -s schema.yaml -t MyClass --dashboard qc_dashboard.png

# Full HTML dashboard site (for GitHub Pages)
linkml-data-qc data/ -s schema.yaml -t MyClass --dashboard-dir ./dashboard/

Example Dashboard

See a live example: dismech QC Dashboard

Documentation

Example Output

Compliance Report: data/Asthma.yaml
Target Class: Disease
Global Compliance: 65.3% (125/191)
Weighted Compliance: 71.2%

Summary by Slot:
  description: 78.4%
  term: 72.1%

Aggregated Scores by List Path:
  pathophysiology[].description: 100.0% (5/5)
  pathophysiology[].term: 80.0% (4/5)
  phenotypes[].phenotype_term.term: 60.0% (3/5)