Data Dictionary Format ====================== Schema Automator's canonical data dictionary format — the format we ask new studies to produce when they're onboarding to a data-sharing pipeline. This is a forward-looking, prescriptive target spec. The audience is a data owner at a *new* study, preparing their data for sharing, before any inconsistent local conventions have set in. Handling of legacy or messy data dictionaries is a separate concern; those are normalized into this format by an upstream layer (see `linkml/schema-automator#200 `_). Goals ----- The format provides enough computable information for Schema Automator to produce maximally-enriched LinkML schemas, while staying simple enough that any researcher — clinical, environmental, social, business — can author and maintain it without specialized tooling. The format is row-per-variable and deliberately minimal. Substrate --------- The recommended substrate is **TSV (tab-separated values)**. CSV is also accepted; tabs avoid quoting issues when code values contain commas. A YAML form is sanctioned for tooling-friendly authoring (e.g., schema emission, programmatic generation). YAML and TSV are equivalent — the same set of rows, the same fields, the same conformance rules. Tooling consumes either. The canonical machine-readable definition is the LinkML schema at ``schema_automator/metamodels/data_dictionary.yaml``. Two specs --------- The format is structured as two specs: **Spec A** is the recommended set of fields. This is what we ask researchers to produce. **Spec B** is a set of optional columns that researchers may add independently. Adopting one Spec B column does not require adopting any other; tooling consumes whichever optional fields are present and ignores the rest. Spec A: recommended fields --------------------------- Each row of a data dictionary describes one column of the data file. .. list-table:: :header-rows: 1 :widths: 15 20 65 * - Field - When required - Description * - ``name`` - Always - Column name as it appears in the data file. Accepts a string identifier or a URI/CURIE; URI/CURIE forms are preferred when the column has a known semantic identity in a controlled vocabulary. * - ``type`` - Best-practice required - Data type from the canonical type vocabulary (see below). * - ``description`` - Best-practice required - Prose description of what the variable represents. **Must not contain code lists, unit declarations, value ranges, or example values** — those have dedicated fields. * - ``codes`` - When ``type`` is ``permissible_values`` - The permissible values for this column, encoded as ``code, label | code, label | ...``. Bareword shorthand is allowed: a token without a comma is interpreted as ``value, value``. * - ``unit`` - When ``type`` is ``integer`` or ``decimal`` - Unit of measure. Use the literal token ``none`` for genuinely unitless quantities (counts, ratios). UCUM units are encouraged but not required; freeform reasonable units (``mg/dL``, ``servings/week``) are accepted. * - ``min`` - When ``type`` is ``integer`` or ``decimal`` - Minimum permissible value. Use ``none`` for genuinely unbounded variables. * - ``max`` - When ``type`` is ``integer`` or ``decimal`` - Maximum permissible value. Use ``none`` for genuinely unbounded variables. The ``none`` token is distinct from an empty cell. An empty cell means the author has not declared the field and is reported as a conformance issue. ``none`` is the explicit declaration that the field is genuinely not applicable. Spec B: optional columns ------------------------- Researchers may add any subset of the following columns to a data dictionary. Each is independently adoptable — including ``label`` does not imply including ``required``. .. list-table:: :header-rows: 1 :widths: 18 82 * - Field - Description * - ``label`` - Short human-readable name for the variable. Distinct from ``description`` (prose) and ``name`` (machine identifier). * - ``multivalued`` - Boolean. Whether each cell of this column contains a list of values rather than a single value (e.g., multi-select responses). * - ``required`` - Boolean. Whether this column requires a value in every row. Rarely accurately known at authoring time; if absent, schema inference fills the gap. * - ``pattern`` - Regular expression that all values of this column must match. Power-user field; pattern inference may fill the gap if absent. * - ``uri`` - URI or CURIE that semantically anchors this variable in a controlled vocabulary (e.g., ``OMOP:1234567``). Equivalent to LinkML's ``slot_uri`` for the emitted slot. * - ``see_also`` - External references — codebooks, study protocols, standards. Multivalued in TSV form: see "Multivalued TSV cells" below. * - ``example_values`` - Sample values for this column. Useful as authoring guidance for consumers and as a fallback signal for pattern inference when ``pattern`` is not provided. Multivalued in TSV form: see "Multivalued TSV cells" below. Multivalued TSV cells ~~~~~~~~~~~~~~~~~~~~~ Multivalued Spec B fields (``see_also``, ``example_values``) encode multiple values pipe-separated within a single cell, using the same convention as the ``codes`` field. Whitespace around the ``|`` separator is trimmed; the values themselves are kept verbatim. If a value contains a literal pipe, escape it as ``\|`` (same rule as in codes). Example (a ``see_also`` cell):: LOINC:2160-0 | https://example.org/protocol.pdf In YAML form, multivalued fields are written as plain YAML lists. Type vocabulary --------------- The canonical type vocabulary is fixed at ten values. Anything outside this set is disallowed. .. list-table:: :header-rows: 1 :widths: 25 75 * - Value - Meaning * - ``string`` - Sequence of characters. * - ``integer`` - Whole number. * - ``decimal`` - Real number. Covers ``float`` and ``double`` at the LinkML emission layer; Schema Automator chooses the appropriate LinkML primitive based on observed data. * - ``boolean`` - True/false value. * - ``date`` - Calendar date, ISO 8601 (``YYYY-MM-DD``) by default. * - ``datetime`` - Date and time, ISO 8601 by default. * - ``time`` - Time of day, ISO 8601 by default. * - ``uri`` - Uniform Resource Identifier (full IRI form). * - ``curie`` - Compact URI (e.g., ``OMOP:1234567``). * - ``permissible_values`` - The variable is enumerated. The ``codes`` field declares its permissible values. The vocabulary is the researcher-comprehensible subset of LinkML's built-in types. Technical LinkML primitives (``ncname``, ``nodeidentifier``, ``jsonpath``, etc.) are excluded — researchers won't write them, and emission can't honor the distinction usefully without other context. Composite or multi-typed declarations (e.g., ``decimal, encoded``) are not permitted. Enumerated columns use the ``permissible_values`` type and declare their codes in the ``codes`` field; an integer column whose values are codes (e.g., ``1=Yes, 0=No``) is declared as ``permissible_values`` with codes, not as ``integer`` with codes. Codes ----- The ``codes`` field for ``permissible_values`` columns is a list of ``PermissibleValueDefinition`` records — each carrying a code, an optional label, and optional per-code metadata (description, semantic URI). The canonical structured form is used in YAML; TSV authors use a compact serialization grammar described below. Canonical (structured) form ~~~~~~~~~~~~~~~~~~~~~~~~~~~ In YAML, ``codes`` is a list of mappings. Each mapping has at least a ``code`` (the literal data value); ``label`` is recommended; ``description`` and ``uri`` are optional Spec B-style additions for richer per-code metadata. :: codes: - {code: F, label: Female} - {code: M, label: Male} - {code: O, label: Other} - {code: U, label: Unknown} For the bareword case (where the literal data value is itself the human-readable form), the ``label`` is omitted: :: codes: - {code: EHR} - {code: Survey} - {code: Lab} Per-code metadata example:: codes: - code: "1" label: Current smoker description: Self-reported current daily or occasional smoker. uri: SCTID:77176002 TSV serialization ~~~~~~~~~~~~~~~~~ In TSV form, the ``codes`` list is serialized as a single string of pipe-separated tokens. Each token is either: - ``code, label`` — a comma-separated pair, where ``code`` is the literal value as it appears in the data and ``label`` is the human-readable meaning. The first comma in a token separates code from label; subsequent commas in the label are literal text. Whitespace around the separators is ignored. - ``value`` — bareword shorthand, interpreted as a code with no label (the literal value serves as both code and meaning). Use this when the literal data value is itself the human-readable form (e.g., color names, country codes, status strings). Examples:: 1, Yes | 0, No | 2, Unknown EHR | Survey | Lab F, Female | M, Male | O, Other | U, Unknown The TSV serialization is lossy with respect to the canonical structured form — only ``code`` and ``label`` survive the round-trip. Per-code ``description`` and ``uri`` cannot be expressed in the TSV grammar; if those fields matter, use YAML or write them in a sibling YAML file. Escaping ~~~~~~~~ To include a literal comma, pipe, or backslash in a code, prefix it with a backslash: - ``\,`` — literal comma - ``\|`` — literal pipe - ``\\`` — literal backslash Examples with escaping:: 1, Black\, non-Hispanic | 2, White\, non-Hispanic | 3, Hispanic >=$50\,000, Middle income | <$50\,000, Low income Note that labels (the part after the first comma in a ``code, label`` token) may contain unescaped commas — only the first comma per token is parsed as the code/label separator. Codes (the part before the comma, or the entire bareword) must escape commas because the first unescaped comma would otherwise be parsed as the separator. Other backslash sequences (e.g., ``\n``, ``\t``) are reserved for future revisions and are currently invalid. Conformance ----------- The format defines two conformance modes: **Default mode (warn-only).** All best-practice and conditional requirements are reported as warnings. Tooling continues processing; the author sees a list of issues to address. **Strict mode (fail).** Any missing best-practice/conditional field, any invalid type vocabulary value, malformed codes encoding, or type-inappropriate field (e.g., ``codes`` on a ``boolean`` row, ``unit`` on a ``string`` row) causes validation to fail. Use this in CI or when consuming a data dictionary that must be clean. Both modes are implemented by the same validator running against the same LinkML schema. Strict mode runs ``linkml-validate`` directly; default mode wraps it and downgrades non-fatal issues to warnings. A small number of content-quality concerns cannot be expressed in the LinkML schema cleanly and are enforced by an additional lint pass: - Descriptions containing code lists, units, ranges, or example values (see "Description content rules" below). - Numeric ``min``/``max`` values whose numeric form does not match the declared type — e.g., a fractional ``0.5`` minimum on an ``integer`` column. The schema accepts any number for ``min``/``max``; the lint pass catches the type mismatch. Description content rules ------------------------- The ``description`` field must contain only prose. The following content is disallowed: - Code lists or enumerated values (use ``codes``) - Unit declarations (use ``unit``) - Numeric ranges (use ``min`` and ``max``) - Example values (use ``example_values``, optional Spec B) Descriptions polluted with such content are less useful, not more, and indicate the author put information in the wrong place. Enforcement is performed by a content-quality lint (separate from structural validation) that flags suspicious description content. Document-level metadata ----------------------- The format intentionally has no document-level metadata in v1. There is no header section, no sidecar metadata file, and no filename convention. Every productive option (header that breaks CSV/TSV format, sidecar that adds coupling, filename encoding that's fragile to parse) has worse trade-offs than dropping document-level metadata entirely. If a future revision adds it, the substrate decision will need to be revisited. Relationship to existing formats -------------------------------- Across communities, real-world data dictionaries converge on a small core: **name, description, and (sometimes, partially) type information**. That common ground is what nearly every author actually writes, regardless of which format they nominally produce. This spec is a deliberately small extension of that core — adding the structure needed to validate and enrich, while staying close to what people already draft in the wild. The flat-file shape is a deliberate alignment, not an oversight. Every format below (with the partial exception of XML codebook standards) starts from row-per-variable, columns-of-attributes. That's how data dictionaries naturally exist in researcher hands — a spreadsheet, a CSV, a few columns deep. Adding hierarchy, sidecar files, or required tooling would make authoring harder *and* make conversion from existing formats harder. We chose simplicity at this layer in exchange for keeping the existing-format-to-this-format adapters tractable. We are implementing the core of the formats listed below. If our spec ever diverges substantially from that core, that's a signal we've over-rotated on novelty and should re-examine the divergence rather than ship it. **Frictionless Table Schema** (data.gov, OpenML, the ``datasets`` package). Our type vocabulary is a researcher-comprehensible subset of Frictionless's; ``pattern`` and per-column structure align directly. Where we diverge: we treat enumerated values as a primary type (``permissible_values``) rather than a string-with-enum-constraint, matching how researchers think about multiple-choice columns. We intend to ship a Frictionless adapter (translator in both directions). **REDCap data dictionary** (de facto standard in clinical research). Our codes encoding (``code, label | code, label | ...`` with bareword shorthand) is taken directly from REDCap. Our type-as-major-axis approach mirrors REDCap's Field Type. We diverge by collapsing REDCap's UI-flavored types (``radio``, ``dropdown``, ``checkbox``) into the single semantic type ``permissible_values`` — a data dictionary describes data, not the form widget that produced it. We intend to ship a REDCap adapter. **SchemaSheets** (LinkML community). This format supersedes SchemaSheets for the data-dictionary use case. Where SchemaSheets requires authors to work with LinkML conventions (slot/class distinctions, slot URIs, mixin semantics), we present a researcher-comprehensible flat surface. SchemaSheets capabilities not directly expressible here will be addressed during the migration period. A SchemaSheets adapter is part of the planned tooling. **DDI / CDISC** (heavyweight institutional standards). These describe a different scope — full study lifecycle, regulatory submission. Not displaced; complementary in cases where they apply. Out of scope ------------ This format intentionally covers a single, flat tabular dataset described by a single data dictionary. The following use cases need a higher-order artifact on top of, or alongside, this format, and are explicitly out of scope: - **Hierarchical or nested data.** Columns containing JSON-shaped structures, repeated groups, document trees. Authoring as a flat row-per-column descriptor doesn't fit. Use a structured data model (LinkML class definitions, JSON Schema, Avro) for the nesting and reference this format for leaf-level columns where applicable. - **Multi-table datasets with relationships.** Several related tables sharing keys (e.g., dbGaP's phs/pht/phv structure, relational databases). Each table can have its own data dictionary in this format, but the relational structure between them — foreign keys, joins, shared dimensions — is the job of a higher-order schema artifact. A simple manifest pattern (a list of data dictionaries plus a relations file) would address this without burdening per-DD authoring. - **Variable versioning and lineage.** "This variable was renamed from ``old_name`` in v3," "valid from 2020-01 to 2023-06," provenance chains. Versioning concerns belong at the dataset/schema level, not per-row. - **Cross-column constraints.** Conditional required-fields, dependencies ("field A is required only if field B is X"), value relationships across columns. The general case needs a real expression language, not a small format extension. - **Domain-specific metadata.** Clinical (codeset versions, encounter types), survey (question wording, branching logic), environmental (sensor calibration, measurement uncertainty). The format is intentionally domain-naive. Domain extensions can live in optional Spec B columns by convention, but defining them is the responsibility of domain communities, not this spec. Future revisions ---------------- The following are deferred from v1 but may be added as small additive extensions to *this* format in future revisions: - Named, reusable enum definitions (e.g., declaring an enum once and referencing it from multiple ``permissible_values`` columns). - A ``format`` field for refining types (``string`` + ``format: email``, date format strings, etc.). - Document-level metadata and file-level conventions. Examples -------- Worked examples are provided in ``docs/examples/``: - ``dd_example_minimal.tsv`` — Spec A only. - ``dd_example_with_optional.tsv`` — Spec A plus selected Spec B columns. - ``dd_example_minimal.yaml`` — Spec A in YAML form.