Data Dictionary Format¶
Schema Automator’s canonical data dictionary format — the format we ask new studies to produce when they’re onboarding to a data-sharing pipeline.
This is a forward-looking, prescriptive target spec. The audience is a data owner at a new study, preparing their data for sharing, before any inconsistent local conventions have set in. Handling of legacy or messy data dictionaries is a separate concern; those are normalized into this format by an upstream layer (see linkml/schema-automator#200).
Goals¶
The format provides enough computable information for Schema Automator to produce maximally-enriched LinkML schemas, while staying simple enough that any researcher — clinical, environmental, social, business — can author and maintain it without specialized tooling. The format is row-per-variable and deliberately minimal.
Substrate¶
The recommended substrate is TSV (tab-separated values). CSV is also accepted; tabs avoid quoting issues when code values contain commas.
A YAML form is sanctioned for tooling-friendly authoring (e.g., schema emission, programmatic generation). YAML and TSV are equivalent — the same set of rows, the same fields, the same conformance rules. Tooling consumes either.
The canonical machine-readable definition is the LinkML schema at
schema_automator/metamodels/data_dictionary.yaml.
Two specs¶
The format is structured as two specs:
Spec A is the recommended set of fields. This is what we ask researchers to produce.
Spec B is a set of optional columns that researchers may add independently. Adopting one Spec B column does not require adopting any other; tooling consumes whichever optional fields are present and ignores the rest.
Spec A: recommended fields¶
Each row of a data dictionary describes one column of the data file.
Field |
When required |
Description |
|---|---|---|
|
Always |
Column name as it appears in the data file. Accepts a string identifier or a URI/CURIE; URI/CURIE forms are preferred when the column has a known semantic identity in a controlled vocabulary. |
|
Best-practice required |
Data type from the canonical type vocabulary (see below). |
|
Best-practice required |
Prose description of what the variable represents. Must not contain code lists, unit declarations, value ranges, or example values — those have dedicated fields. |
|
When |
The permissible values for this column, encoded as
|
|
When |
Unit of measure. Use the literal token |
|
When |
Minimum permissible value. Use |
|
When |
Maximum permissible value. Use |
The none token is distinct from an empty cell. An empty cell means the
author has not declared the field and is reported as a conformance issue.
none is the explicit declaration that the field is genuinely not
applicable.
Spec B: optional columns¶
Researchers may add any subset of the following columns to a data
dictionary. Each is independently adoptable — including label does not
imply including required.
Field |
Description |
|---|---|
|
Short human-readable name for the variable. Distinct from
|
|
Boolean. Whether each cell of this column contains a list of values rather than a single value (e.g., multi-select responses). |
|
Boolean. Whether this column requires a value in every row. Rarely accurately known at authoring time; if absent, schema inference fills the gap. |
|
Regular expression that all values of this column must match. Power-user field; pattern inference may fill the gap if absent. |
|
URI or CURIE that semantically anchors this variable in a
controlled vocabulary (e.g., |
|
External references — codebooks, study protocols, standards. Multivalued in TSV form: see “Multivalued TSV cells” below. |
|
Sample values for this column. Useful as authoring guidance for
consumers and as a fallback signal for pattern inference when
|
Multivalued TSV cells¶
Multivalued Spec B fields (see_also, example_values) encode
multiple values pipe-separated within a single cell, using the same
convention as the codes field. Whitespace around the | separator
is trimmed; the values themselves are kept verbatim. If a value contains
a literal pipe, escape it as \| (same rule as in codes).
Example (a see_also cell):
LOINC:2160-0 | https://example.org/protocol.pdf
In YAML form, multivalued fields are written as plain YAML lists.
Type vocabulary¶
The canonical type vocabulary is fixed at ten values. Anything outside this set is disallowed.
Value |
Meaning |
|---|---|
|
Sequence of characters. |
|
Whole number. |
|
Real number. Covers |
|
True/false value. |
|
Calendar date, ISO 8601 ( |
|
Date and time, ISO 8601 by default. |
|
Time of day, ISO 8601 by default. |
|
Uniform Resource Identifier (full IRI form). |
|
Compact URI (e.g., |
|
The variable is enumerated. The |
The vocabulary is the researcher-comprehensible subset of LinkML’s built-in
types. Technical LinkML primitives (ncname, nodeidentifier,
jsonpath, etc.) are excluded — researchers won’t write them, and
emission can’t honor the distinction usefully without other context.
Composite or multi-typed declarations (e.g., decimal, encoded) are not
permitted. Enumerated columns use the permissible_values type and
declare their codes in the codes field; an integer column whose values
are codes (e.g., 1=Yes, 0=No) is declared as permissible_values
with codes, not as integer with codes.
Codes¶
The codes field for permissible_values columns is a list of
PermissibleValueDefinition records — each carrying a code, an
optional label, and optional per-code metadata (description, semantic
URI). The canonical structured form is used in YAML; TSV authors use a
compact serialization grammar described below.
Canonical (structured) form¶
In YAML, codes is a list of mappings. Each mapping has at least a
code (the literal data value); label is recommended; description
and uri are optional Spec B-style additions for richer per-code
metadata.
codes:
- {code: F, label: Female}
- {code: M, label: Male}
- {code: O, label: Other}
- {code: U, label: Unknown}
For the bareword case (where the literal data value is itself the
human-readable form), the label is omitted:
codes:
- {code: EHR}
- {code: Survey}
- {code: Lab}
Per-code metadata example:
codes:
- code: "1"
label: Current smoker
description: Self-reported current daily or occasional smoker.
uri: SCTID:77176002
TSV serialization¶
In TSV form, the codes list is serialized as a single string of
pipe-separated tokens. Each token is either:
code, label— a comma-separated pair, wherecodeis the literal value as it appears in the data andlabelis the human-readable meaning. The first comma in a token separates code from label; subsequent commas in the label are literal text. Whitespace around the separators is ignored.value— bareword shorthand, interpreted as a code with no label (the literal value serves as both code and meaning). Use this when the literal data value is itself the human-readable form (e.g., color names, country codes, status strings).
Examples:
1, Yes | 0, No | 2, Unknown
EHR | Survey | Lab
F, Female | M, Male | O, Other | U, Unknown
The TSV serialization is lossy with respect to the canonical structured
form — only code and label survive the round-trip. Per-code
description and uri cannot be expressed in the TSV grammar; if
those fields matter, use YAML or write them in a sibling YAML file.
Escaping¶
To include a literal comma, pipe, or backslash in a code, prefix it with a backslash:
\,— literal comma\|— literal pipe\\— literal backslash
Examples with escaping:
1, Black\, non-Hispanic | 2, White\, non-Hispanic | 3, Hispanic
>=$50\,000, Middle income | <$50\,000, Low income
Note that labels (the part after the first comma in a code, label
token) may contain unescaped commas — only the first comma per token is
parsed as the code/label separator. Codes (the part before the comma, or
the entire bareword) must escape commas because the first unescaped comma
would otherwise be parsed as the separator.
Other backslash sequences (e.g., \n, \t) are reserved for future
revisions and are currently invalid.
Conformance¶
The format defines two conformance modes:
Default mode (warn-only). All best-practice and conditional requirements are reported as warnings. Tooling continues processing; the author sees a list of issues to address.
Strict mode (fail). Any missing best-practice/conditional field, any
invalid type vocabulary value, malformed codes encoding, or
type-inappropriate field (e.g., codes on a boolean row, unit
on a string row) causes validation to fail. Use this in CI or when
consuming a data dictionary that must be clean.
Both modes are implemented by the same validator running against the same
LinkML schema. Strict mode runs linkml-validate directly; default mode
wraps it and downgrades non-fatal issues to warnings.
A small number of content-quality concerns cannot be expressed in the LinkML schema cleanly and are enforced by an additional lint pass:
Descriptions containing code lists, units, ranges, or example values (see “Description content rules” below).
Numeric
min/maxvalues whose numeric form does not match the declared type — e.g., a fractional0.5minimum on anintegercolumn. The schema accepts any number formin/max; the lint pass catches the type mismatch.
Description content rules¶
The description field must contain only prose. The following content
is disallowed:
Code lists or enumerated values (use
codes)Unit declarations (use
unit)Numeric ranges (use
minandmax)Example values (use
example_values, optional Spec B)
Descriptions polluted with such content are less useful, not more, and indicate the author put information in the wrong place. Enforcement is performed by a content-quality lint (separate from structural validation) that flags suspicious description content.
Document-level metadata¶
The format intentionally has no document-level metadata in v1. There is no header section, no sidecar metadata file, and no filename convention. Every productive option (header that breaks CSV/TSV format, sidecar that adds coupling, filename encoding that’s fragile to parse) has worse trade-offs than dropping document-level metadata entirely. If a future revision adds it, the substrate decision will need to be revisited.
Relationship to existing formats¶
Across communities, real-world data dictionaries converge on a small core: name, description, and (sometimes, partially) type information. That common ground is what nearly every author actually writes, regardless of which format they nominally produce. This spec is a deliberately small extension of that core — adding the structure needed to validate and enrich, while staying close to what people already draft in the wild.
The flat-file shape is a deliberate alignment, not an oversight. Every format below (with the partial exception of XML codebook standards) starts from row-per-variable, columns-of-attributes. That’s how data dictionaries naturally exist in researcher hands — a spreadsheet, a CSV, a few columns deep. Adding hierarchy, sidecar files, or required tooling would make authoring harder and make conversion from existing formats harder. We chose simplicity at this layer in exchange for keeping the existing-format-to-this-format adapters tractable.
We are implementing the core of the formats listed below. If our spec ever diverges substantially from that core, that’s a signal we’ve over-rotated on novelty and should re-examine the divergence rather than ship it.
Frictionless Table Schema (data.gov, OpenML, the datasets package).
Our type vocabulary is a researcher-comprehensible subset of Frictionless’s;
pattern and per-column structure align directly. Where we diverge: we
treat enumerated values as a primary type (permissible_values) rather
than a string-with-enum-constraint, matching how researchers think about
multiple-choice columns. We intend to ship a Frictionless adapter
(translator in both directions).
REDCap data dictionary (de facto standard in clinical research). Our
codes encoding (code, label | code, label | ... with bareword
shorthand) is taken directly from REDCap. Our type-as-major-axis approach
mirrors REDCap’s Field Type. We diverge by collapsing REDCap’s UI-flavored
types (radio, dropdown, checkbox) into the single semantic
type permissible_values — a data dictionary describes data, not the
form widget that produced it. We intend to ship a REDCap adapter.
SchemaSheets (LinkML community). This format supersedes SchemaSheets for the data-dictionary use case. Where SchemaSheets requires authors to work with LinkML conventions (slot/class distinctions, slot URIs, mixin semantics), we present a researcher-comprehensible flat surface. SchemaSheets capabilities not directly expressible here will be addressed during the migration period. A SchemaSheets adapter is part of the planned tooling.
DDI / CDISC (heavyweight institutional standards). These describe a different scope — full study lifecycle, regulatory submission. Not displaced; complementary in cases where they apply.
Out of scope¶
This format intentionally covers a single, flat tabular dataset described by a single data dictionary. The following use cases need a higher-order artifact on top of, or alongside, this format, and are explicitly out of scope:
Hierarchical or nested data. Columns containing JSON-shaped structures, repeated groups, document trees. Authoring as a flat row-per-column descriptor doesn’t fit. Use a structured data model (LinkML class definitions, JSON Schema, Avro) for the nesting and reference this format for leaf-level columns where applicable.
Multi-table datasets with relationships. Several related tables sharing keys (e.g., dbGaP’s phs/pht/phv structure, relational databases). Each table can have its own data dictionary in this format, but the relational structure between them — foreign keys, joins, shared dimensions — is the job of a higher-order schema artifact. A simple manifest pattern (a list of data dictionaries plus a relations file) would address this without burdening per-DD authoring.
Variable versioning and lineage. “This variable was renamed from
old_namein v3,” “valid from 2020-01 to 2023-06,” provenance chains. Versioning concerns belong at the dataset/schema level, not per-row.Cross-column constraints. Conditional required-fields, dependencies (“field A is required only if field B is X”), value relationships across columns. The general case needs a real expression language, not a small format extension.
Domain-specific metadata. Clinical (codeset versions, encounter types), survey (question wording, branching logic), environmental (sensor calibration, measurement uncertainty). The format is intentionally domain-naive. Domain extensions can live in optional Spec B columns by convention, but defining them is the responsibility of domain communities, not this spec.
Future revisions¶
The following are deferred from v1 but may be added as small additive extensions to this format in future revisions:
Named, reusable enum definitions (e.g., declaring an enum once and referencing it from multiple
permissible_valuescolumns).A
formatfield for refining types (string+format: email, date format strings, etc.).Document-level metadata and file-level conventions.
Examples¶
Worked examples are provided in docs/examples/:
dd_example_minimal.tsv— Spec A only.dd_example_with_optional.tsv— Spec A plus selected Spec B columns.dd_example_minimal.yaml— Spec A in YAML form.