Data Dictionary Format

Schema Automator’s canonical data dictionary format — the format we ask new studies to produce when they’re onboarding to a data-sharing pipeline.

This is a forward-looking, prescriptive target spec. The audience is a data owner at a new study, preparing their data for sharing, before any inconsistent local conventions have set in. Handling of legacy or messy data dictionaries is a separate concern; those are normalized into this format by an upstream layer (see linkml/schema-automator#200).

Goals

The format provides enough computable information for Schema Automator to produce maximally-enriched LinkML schemas, while staying simple enough that any researcher — clinical, environmental, social, business — can author and maintain it without specialized tooling. The format is row-per-variable and deliberately minimal.

Substrate

The recommended substrate is TSV (tab-separated values). CSV is also accepted; tabs avoid quoting issues when code values contain commas.

A YAML form is sanctioned for tooling-friendly authoring (e.g., schema emission, programmatic generation). YAML and TSV are equivalent — the same set of rows, the same fields, the same conformance rules. Tooling consumes either.

The canonical machine-readable definition is the LinkML schema at schema_automator/metamodels/data_dictionary.yaml.

Two specs

The format is structured as two specs:

Spec A is the recommended set of fields. This is what we ask researchers to produce.

Spec B is a set of optional columns that researchers may add independently. Adopting one Spec B column does not require adopting any other; tooling consumes whichever optional fields are present and ignores the rest.

Spec B: optional columns

Researchers may add any subset of the following columns to a data dictionary. Each is independently adoptable — including label does not imply including required.

Field

Description

label

Short human-readable name for the variable. Distinct from description (prose) and name (machine identifier).

multivalued

Boolean. Whether each cell of this column contains a list of values rather than a single value (e.g., multi-select responses).

required

Boolean. Whether this column requires a value in every row. Rarely accurately known at authoring time; if absent, schema inference fills the gap.

pattern

Regular expression that all values of this column must match. Power-user field; pattern inference may fill the gap if absent.

uri

URI or CURIE that semantically anchors this variable in a controlled vocabulary (e.g., OMOP:1234567). Equivalent to LinkML’s slot_uri for the emitted slot.

see_also

External references — codebooks, study protocols, standards. Multivalued in TSV form: see “Multivalued TSV cells” below.

example_values

Sample values for this column. Useful as authoring guidance for consumers and as a fallback signal for pattern inference when pattern is not provided. Multivalued in TSV form: see “Multivalued TSV cells” below.

Multivalued TSV cells

Multivalued Spec B fields (see_also, example_values) encode multiple values pipe-separated within a single cell, using the same convention as the codes field. Whitespace around the | separator is trimmed; the values themselves are kept verbatim. If a value contains a literal pipe, escape it as \| (same rule as in codes).

Example (a see_also cell):

LOINC:2160-0 | https://example.org/protocol.pdf

In YAML form, multivalued fields are written as plain YAML lists.

Type vocabulary

The canonical type vocabulary is fixed at ten values. Anything outside this set is disallowed.

Value

Meaning

string

Sequence of characters.

integer

Whole number.

decimal

Real number. Covers float and double at the LinkML emission layer; Schema Automator chooses the appropriate LinkML primitive based on observed data.

boolean

True/false value.

date

Calendar date, ISO 8601 (YYYY-MM-DD) by default.

datetime

Date and time, ISO 8601 by default.

time

Time of day, ISO 8601 by default.

uri

Uniform Resource Identifier (full IRI form).

curie

Compact URI (e.g., OMOP:1234567).

permissible_values

The variable is enumerated. The codes field declares its permissible values.

The vocabulary is the researcher-comprehensible subset of LinkML’s built-in types. Technical LinkML primitives (ncname, nodeidentifier, jsonpath, etc.) are excluded — researchers won’t write them, and emission can’t honor the distinction usefully without other context.

Composite or multi-typed declarations (e.g., decimal, encoded) are not permitted. Enumerated columns use the permissible_values type and declare their codes in the codes field; an integer column whose values are codes (e.g., 1=Yes, 0=No) is declared as permissible_values with codes, not as integer with codes.

Codes

The codes field for permissible_values columns is a list of PermissibleValueDefinition records — each carrying a code, an optional label, and optional per-code metadata (description, semantic URI). The canonical structured form is used in YAML; TSV authors use a compact serialization grammar described below.

Canonical (structured) form

In YAML, codes is a list of mappings. Each mapping has at least a code (the literal data value); label is recommended; description and uri are optional Spec B-style additions for richer per-code metadata.

codes:
  - {code: F, label: Female}
  - {code: M, label: Male}
  - {code: O, label: Other}
  - {code: U, label: Unknown}

For the bareword case (where the literal data value is itself the human-readable form), the label is omitted:

codes:
  - {code: EHR}
  - {code: Survey}
  - {code: Lab}

Per-code metadata example:

codes:
  - code: "1"
    label: Current smoker
    description: Self-reported current daily or occasional smoker.
    uri: SCTID:77176002

TSV serialization

In TSV form, the codes list is serialized as a single string of pipe-separated tokens. Each token is either:

  • code, label — a comma-separated pair, where code is the literal value as it appears in the data and label is the human-readable meaning. The first comma in a token separates code from label; subsequent commas in the label are literal text. Whitespace around the separators is ignored.

  • value — bareword shorthand, interpreted as a code with no label (the literal value serves as both code and meaning). Use this when the literal data value is itself the human-readable form (e.g., color names, country codes, status strings).

Examples:

1, Yes | 0, No | 2, Unknown
EHR | Survey | Lab
F, Female | M, Male | O, Other | U, Unknown

The TSV serialization is lossy with respect to the canonical structured form — only code and label survive the round-trip. Per-code description and uri cannot be expressed in the TSV grammar; if those fields matter, use YAML or write them in a sibling YAML file.

Escaping

To include a literal comma, pipe, or backslash in a code, prefix it with a backslash:

  • \, — literal comma

  • \| — literal pipe

  • \\ — literal backslash

Examples with escaping:

1, Black\, non-Hispanic | 2, White\, non-Hispanic | 3, Hispanic
>=$50\,000, Middle income | <$50\,000, Low income

Note that labels (the part after the first comma in a code, label token) may contain unescaped commas — only the first comma per token is parsed as the code/label separator. Codes (the part before the comma, or the entire bareword) must escape commas because the first unescaped comma would otherwise be parsed as the separator.

Other backslash sequences (e.g., \n, \t) are reserved for future revisions and are currently invalid.

Conformance

The format defines two conformance modes:

Default mode (warn-only). All best-practice and conditional requirements are reported as warnings. Tooling continues processing; the author sees a list of issues to address.

Strict mode (fail). Any missing best-practice/conditional field, any invalid type vocabulary value, malformed codes encoding, or type-inappropriate field (e.g., codes on a boolean row, unit on a string row) causes validation to fail. Use this in CI or when consuming a data dictionary that must be clean.

Both modes are implemented by the same validator running against the same LinkML schema. Strict mode runs linkml-validate directly; default mode wraps it and downgrades non-fatal issues to warnings.

A small number of content-quality concerns cannot be expressed in the LinkML schema cleanly and are enforced by an additional lint pass:

  • Descriptions containing code lists, units, ranges, or example values (see “Description content rules” below).

  • Numeric min/max values whose numeric form does not match the declared type — e.g., a fractional 0.5 minimum on an integer column. The schema accepts any number for min/max; the lint pass catches the type mismatch.

Description content rules

The description field must contain only prose. The following content is disallowed:

  • Code lists or enumerated values (use codes)

  • Unit declarations (use unit)

  • Numeric ranges (use min and max)

  • Example values (use example_values, optional Spec B)

Descriptions polluted with such content are less useful, not more, and indicate the author put information in the wrong place. Enforcement is performed by a content-quality lint (separate from structural validation) that flags suspicious description content.

Document-level metadata

The format intentionally has no document-level metadata in v1. There is no header section, no sidecar metadata file, and no filename convention. Every productive option (header that breaks CSV/TSV format, sidecar that adds coupling, filename encoding that’s fragile to parse) has worse trade-offs than dropping document-level metadata entirely. If a future revision adds it, the substrate decision will need to be revisited.

Relationship to existing formats

Across communities, real-world data dictionaries converge on a small core: name, description, and (sometimes, partially) type information. That common ground is what nearly every author actually writes, regardless of which format they nominally produce. This spec is a deliberately small extension of that core — adding the structure needed to validate and enrich, while staying close to what people already draft in the wild.

The flat-file shape is a deliberate alignment, not an oversight. Every format below (with the partial exception of XML codebook standards) starts from row-per-variable, columns-of-attributes. That’s how data dictionaries naturally exist in researcher hands — a spreadsheet, a CSV, a few columns deep. Adding hierarchy, sidecar files, or required tooling would make authoring harder and make conversion from existing formats harder. We chose simplicity at this layer in exchange for keeping the existing-format-to-this-format adapters tractable.

We are implementing the core of the formats listed below. If our spec ever diverges substantially from that core, that’s a signal we’ve over-rotated on novelty and should re-examine the divergence rather than ship it.

Frictionless Table Schema (data.gov, OpenML, the datasets package). Our type vocabulary is a researcher-comprehensible subset of Frictionless’s; pattern and per-column structure align directly. Where we diverge: we treat enumerated values as a primary type (permissible_values) rather than a string-with-enum-constraint, matching how researchers think about multiple-choice columns. We intend to ship a Frictionless adapter (translator in both directions).

REDCap data dictionary (de facto standard in clinical research). Our codes encoding (code, label | code, label | ... with bareword shorthand) is taken directly from REDCap. Our type-as-major-axis approach mirrors REDCap’s Field Type. We diverge by collapsing REDCap’s UI-flavored types (radio, dropdown, checkbox) into the single semantic type permissible_values — a data dictionary describes data, not the form widget that produced it. We intend to ship a REDCap adapter.

SchemaSheets (LinkML community). This format supersedes SchemaSheets for the data-dictionary use case. Where SchemaSheets requires authors to work with LinkML conventions (slot/class distinctions, slot URIs, mixin semantics), we present a researcher-comprehensible flat surface. SchemaSheets capabilities not directly expressible here will be addressed during the migration period. A SchemaSheets adapter is part of the planned tooling.

DDI / CDISC (heavyweight institutional standards). These describe a different scope — full study lifecycle, regulatory submission. Not displaced; complementary in cases where they apply.

Out of scope

This format intentionally covers a single, flat tabular dataset described by a single data dictionary. The following use cases need a higher-order artifact on top of, or alongside, this format, and are explicitly out of scope:

  • Hierarchical or nested data. Columns containing JSON-shaped structures, repeated groups, document trees. Authoring as a flat row-per-column descriptor doesn’t fit. Use a structured data model (LinkML class definitions, JSON Schema, Avro) for the nesting and reference this format for leaf-level columns where applicable.

  • Multi-table datasets with relationships. Several related tables sharing keys (e.g., dbGaP’s phs/pht/phv structure, relational databases). Each table can have its own data dictionary in this format, but the relational structure between them — foreign keys, joins, shared dimensions — is the job of a higher-order schema artifact. A simple manifest pattern (a list of data dictionaries plus a relations file) would address this without burdening per-DD authoring.

  • Variable versioning and lineage. “This variable was renamed from old_name in v3,” “valid from 2020-01 to 2023-06,” provenance chains. Versioning concerns belong at the dataset/schema level, not per-row.

  • Cross-column constraints. Conditional required-fields, dependencies (“field A is required only if field B is X”), value relationships across columns. The general case needs a real expression language, not a small format extension.

  • Domain-specific metadata. Clinical (codeset versions, encounter types), survey (question wording, branching logic), environmental (sensor calibration, measurement uncertainty). The format is intentionally domain-naive. Domain extensions can live in optional Spec B columns by convention, but defining them is the responsibility of domain communities, not this spec.

Future revisions

The following are deferred from v1 but may be added as small additive extensions to this format in future revisions:

  • Named, reusable enum definitions (e.g., declaring an enum once and referencing it from multiple permissible_values columns).

  • A format field for refining types (string + format: email, date format strings, etc.).

  • Document-level metadata and file-level conventions.

Examples

Worked examples are provided in docs/examples/:

  • dd_example_minimal.tsv — Spec A only.

  • dd_example_with_optional.tsv — Spec A plus selected Spec B columns.

  • dd_example_minimal.yaml — Spec A in YAML form.