The Gap Between Valid Data and Quality Data
This document explores the conceptual foundations of data quality assessment and explains why measuring completeness requires fundamentally different approaches than binary validation.
The Limits of Binary Validation
Most data validation systems operate on a pass/fail model. A record either conforms to a schema or it doesn't. An email field contains a valid email address, or it's rejected. A date falls within an acceptable range, or it fails validation. This binary approach is computationally tractable and provides clear, actionable signals: fix this error, then proceed.
But binary validation answers only one question: "Is this data structurally acceptable?"
It cannot answer the questions that matter most for real-world use:
- Is this data useful for its intended purpose?
- Does this dataset contain enough information to support analysis?
- Are we making progress toward a well-curated knowledge base?
- Which areas of our data need the most attention?
These questions require measurement on a continuum, not binary classification.
Data Quality as a Multidimensional Concept
The foundational work by Wang and Strong (1996) established that data quality is fundamentally about "fitness for use"—a definition that emphasizes the consumer's perspective rather than abstract correctness. Their research identified 15 dimensions of data quality organized into four categories:
| Category | Dimensions | Focus |
|---|---|---|
| Intrinsic | Accuracy, Believability, Objectivity, Reputation | Quality inherent to the data itself |
| Contextual | Completeness, Relevancy, Timeliness, Value-added, Appropriate amount | Quality relative to the task at hand |
| Representational | Interpretability, Ease of understanding, Conciseness, Consistency | How data is presented and formatted |
| Accessibility | Access, Security | How easily data can be obtained |
Notice that validity—the concern of schema validators—maps primarily to the Intrinsic category. But completeness is a Contextual dimension: whether data is complete depends entirely on what you're trying to do with it.
Completeness: The Missing Middle Ground
Completeness is particularly interesting because it occupies the gap between validation and curation:
Validation asks: "Does this field contain a valid value?" Completeness asks: "Is this field populated at all?" Curation asks: "Is this field populated with the right information?"
A customer record with an email field containing test@example.com passes validation
(it's a syntactically valid email) but may fail completeness expectations in different ways:
- It might be a placeholder that needs replacement
- The field might be present but contextually irrelevant
- The surrounding record might lack other important fields
This is why data quality frameworks distinguish between:
- Required fields — Must be present; absence is a validation error
- Optional fields — May be absent without affecting validity
- Recommended fields — Should be present for quality data, but absence isn't fatal
The third category is crucial. Recommended fields encode institutional knowledge about what makes data useful, not just acceptable. They represent the difference between a minimal skeleton that passes validation and a rich record that serves its purpose well.
Why Binary Approaches Fail for Completeness
Consider a biomedical knowledge base with 10,000 disease entries. A binary completeness check might report:
❌ 3,247 entries missing description field
❌ 1,892 entries missing ontology term binding
✓ All entries have required name field
This tells you something, but not enough:
- Are the missing descriptions concentrated in rare diseases (where information is scarce)?
- Is the 80% completion rate for ontology terms good or bad for your use case?
- Which specific areas should curators prioritize?
- Are things getting better or worse over time?
What you need is a measurement approach that produces continuous scores, supports aggregation at multiple levels, and enables tracking over time. This is precisely what distinguishes quality assessment from validation.
The Measurement Problem
Data quality monitoring in production systems typically involves setting thresholds:
- Alert if completeness drops below 90%
- Fail the pipeline if null rates exceed 5%
- Block deployment if any critical field falls below threshold
But threshold-setting is itself a nuanced problem:
-
Different fields have different importance. A missing description is less critical than a missing identifier. Simple percentage calculations treat all fields equally.
-
Context matters. 70% completeness might be excellent for a newly created dataset but unacceptable for a mature knowledge base.
-
Aggregation level matters. Global completeness masks local problems. A dataset might be 95% complete overall while specific categories hover at 40%.
-
Thresholds need calibration. Initial thresholds often prove too strict or too lenient once real data flows through the system.
This suggests that completeness measurement should be:
- Weighted — More important fields contribute more to the score
- Hierarchical — Scores at global, category, and item levels
- Configurable — Thresholds and weights adjustable per context
- Transparent — Individual contributions visible for debugging
From Measurement to Quality Gates
The FAIR Data Maturity Model provides a useful analogy. Rather than asking "Is this data FAIR?" (binary), it asks "How FAIR is this data?" (continuous) across multiple indicators, each scored on a maturity scale.
Similarly, completeness assessment works best as a quality gate that enforces minimum standards while providing visibility into the full picture:
┌─────────────────────────────────────────────────────────┐
│ Hard Validation │ Completeness Assessment │
│ (Binary: Pass/Fail) │ (Continuous: 0-100%) │
├──────────────────────────┼─────────────────────────────┤
│ Required fields present? │ What % of recommended │
│ Types correct? │ fields are populated? │
│ Values in range? │ Which areas need work? │
│ References resolve? │ Are we improving over time? │
└──────────────────────────┴─────────────────────────────┘
The left side blocks bad data from entering the system. The right side measures how good the data that passes validation actually is.
Field Classification in Practice
Effective completeness tracking requires thoughtful field classification:
| Classification | Validation Behavior | Completeness Behavior |
|---|---|---|
| Required | Reject if missing | Not tracked (always 100%) |
| Recommended | Accept if missing | Tracked and scored |
| Optional | Accept if missing | Not tracked |
| Conditional | Depends on context | May be tracked based on context |
The key insight is that "recommended" fields encode domain expertise about data quality. A disease entry can exist with just a name, but a useful disease entry includes descriptions, ontology term bindings, and evidence citations. Schema designers capture this knowledge by marking fields as recommended.
Weighted Scoring
Not all recommended fields matter equally. Industry practice suggests weighting fields by business impact:
weighted_compliance = Σ(populated_count × weight) / Σ(total_count × weight) × 100
This allows fine-grained prioritization:
- Ontology term bindings might be critical for interoperability (weight: 2.0)
- Free-text descriptions are nice to have (weight: 0.5)
- Machine-readable synonyms are highly valued (weight: 1.5)
Weights reflect organizational priorities and use-case requirements, making completeness measurement contextual—exactly as Wang and Strong's framework suggests it should be.
Hierarchical Analysis
Aggregate scores hide important details. A dataset at 85% global completeness might have:
- 100% completeness for well-curated categories
- 40% completeness for newly added categories
- A handful of items with 0% completeness dragging down averages
Hierarchical scoring reveals these patterns:
Global Compliance: 85.0%
├── Category A: 100.0%
├── Category B: 95.0%
├── Category C: 42.0% ← Problem area
│ ├── Item C.1: 80.0%
│ ├── Item C.2: 25.0% ← Worst offender
│ └── Item C.3: 20.0%
└── Category D: 88.0%
This transforms completeness from a single number into actionable intelligence: these specific items need attention.
The Role of Thresholds
Thresholds convert continuous measurements back into actionable signals. Rather than asking "what's our completeness percentage?" (interesting but not actionable), thresholds answer "is our completeness acceptable?" (actionable).
Effective threshold design recognizes:
- Path-specific thresholds: Critical paths (e.g.,
diseases[].ontology_term) might require 90% while others accept 60% - Violation reporting: When thresholds are breached, report the gap so teams know how much work is needed
- Progressive tightening: Start with achievable thresholds, then raise them as data quality improves
Entropy and Data Quality Degradation
Information theory offers another lens on completeness. Shannon entropy measures the information content of a message. A fully-populated record with rich metadata has high information content; a sparse record with only required fields has low information content.
More practically, data quality degrades over time unless actively maintained. Records become stale. Field semantics drift. New requirements emerge that existing data doesn't satisfy. Continuous completeness monitoring detects these trends before they become crises.
Connecting to Curation Workflows
Completeness measurement is most valuable when it connects to curation workflows:
- Identify gaps: Which items have the lowest completeness?
- Prioritize work: Given limited curator time, which gaps matter most?
- Track progress: Is curation effort actually improving completeness?
- Enforce standards: Prevent merging data that would regress quality
This positions completeness assessment as infrastructure for data stewardship, not just a metric to report.
Summary: Two Complementary Approaches
| Aspect | Validation | Completeness Assessment |
|---|---|---|
| Question | Is this data acceptable? | Is this data useful? |
| Output | Pass/Fail | Percentage (0-100%) |
| Scope | Required constraints | Recommended fields |
| Action | Block bad data | Guide curation effort |
| Granularity | Per-record | Hierarchical/aggregated |
| Weights | Equal (all required) | Configurable by importance |
Both are necessary. Validation ensures data integrity. Completeness assessment measures data utility. Together, they provide a complete picture of data quality.
How This Applies to LinkML
In LinkML schemas, the distinction between required and recommended fields is explicit:
- Required fields (
required: true) — Enforced by LinkML validators - Recommended fields (
recommended: true) — Not enforced, but tracked by this tool
This makes LinkML particularly well-suited for completeness assessment: the schema itself encodes which fields matter for quality, and the tool can automatically discover and measure them without additional configuration.
For implementation details, path notation syntax, and configuration options, see the Reference and How-To sections.
Further Reading
- Wang, R. Y. & Strong, D. M. (1996). Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems, 12(4), 5-33.
- RDA Working Group (2020). FAIR Data Maturity Model. Data Science Journal.
- DAMA-NL (2020). Dimensions of Data Quality. Research Paper.
- TDWG Biodiversity Data Quality Interest Group — Framework for assessing fitness for use in biodiversity data.