Simple data dictionaries

A data dictionary is a file (or collection of files) which unambiguously declares, defines and annotates all the variables collected in a project and associated to a dataset (_definition: FAIR cookbook).

Schemasheets is an idea framework for managing a data dictionary.

Example Data Dictionary

The FAIR Cookbook provides an example of a data dictionary for tracking various aspects of a research subject or model organism, including:

  • subject_id
  • species
  • strain (for model organisms)
  • age + age unit
  • etc

See Example.

Let's start by copying this directly into a google sheet.

You can see this on the v1 tab of this sheet

File Name Variable Name Variable Label Variable Ontology ID or RDFtype Variable ID Source Variable Statistical Type Variable Data Type Variable Size Max Allowed Value Min Allowed Value Regex Allowed Value Shorthands Allowed Value Descriptions Computed Value Unique (alone) Unique (Combined with) Required Collection Form Name Comments
1_Subjects.txt SUBJECT_ID Subject number https://schema.org/identifier https://schema.org categorical variable integer               Y   Y FORM 1
1_Subjects.txt SPECIES Species name https://schema.org/name https://schema.org categorical variable string                     FORM 1
1_Subjects.txt STRAIN Strain TODO substitute broken link https://bioschemas.org/profiles/Taxon/0.6-RELEASE/identifier https://schemas.org/ categorical variable string           http://purl.obolibrary.org/obo/NCBITaxon_40674         FORM 1
1_Subjects.txt AGE Age at study initiation https://bioschemas.org/types/BioSample/0.1-RELEASE-2019_06_19 https://bioschemas.org/ continuous variable integer                   Y FORM 1
1_Subjects.txt AGE_UNIT Age unit http://purl.obolibrary.org/obo/UO_0000003 http://purl.obolibrary.org/obo/uo categorial variable string                   Y FORM 1
1_Subjects.txt SEX Sex https://schema.org/gender https://schema.org categorical variable enum         M;F M=male;F=female         FORM 1

Adding a descriptor row

Our first task is to add a descriptor row that describes how each column heading maps to a LinkML metamodel element.

Here we will tackle this incrementally, starting with the first 3 columns, we will map to:

  • [class][https://w3id.org/linkml/ClassDefinition]
  • [slot][https://w3id.org/linkml/SlotDefinition]
  • [title][https://w3id.org/linkml/title]

The table now looks like this:

File Name Variable Name Variable Label
> class slot title
1_Subjects.txt SUBJECT_ID Subject number
1_Subjects.txt SPECIES Species name
1_Subjects.txt STRAIN Strain
1_Subjects.txt AGE Age at study initiation
1_Subjects.txt AGE_UNIT Age unit
1_Subjects.txt SEX Sex
1_Subjects.txt SOMEDATE Date of acquiring subject
1_Subjects.txt HEMOGLOBIN Hematology: Hemoglobin
1_Subjects.txt HEMOGLOBIN_UNIT Hemoglobin unit
1_Subjects.txt HEIGHT Body size
1_Subjects.txt HEIGHT_UNIT Body size unit
1_Subjects.txt WEIGHT Body weight
1_Subjects.txt WEIGHT_UNIT Body weight unit
1_Subjects.txt BMI Body mass index
1_Subjects.txt LAB Laboratory
2_Samples.txt SAMPLE_ID Sample ID
2_Samples.txt SAMPLE_SITE Sample collection site
2_Samples.txt ANALYTE_TYPE Type of analysis
2_Samples.txt GENOTYPING_CENTER GENOTYPING_CENTER
2_Samples.txt SEQUENCING_CENTER SEQUENCING_CENTER
3_SampleMapping.txt SUBJECT_ID Subject number
3_SampleMapping.txt SAMPLE_ID Sample ID

Our choice of how to map the first column is a bit odd, and reflects a slight mismatch between schemasheets/LinkML, which aims to describe a data model that can be used for multiple instantiations of the same format and a data dictionary that is oriented around describing a single distribution.

Here we are implicitly creating classes/records like "1_Subjects.txt" which doesn't really conform to standard class naming conventions in LinkML. Later we will explore rewriting these with names like "Subject", "Sample", and "SampleMapping"

TODO

For the second column, the choice of ALL-CAPS for slot name also goes against standard naming conventions, but this doesn't really matter so much, and the title (col 3) is the string that should be used in user-facing applications like Data Harmonizer.

Modifications

  • We modified the minimum and maximum values which were specified using commas instead of periods for decimal notation
  • The "regex" field had a value YYYY-MM-DD, but this isn't an actual regex

This framework allows you to represent complex relation-style schemas using spreadsheets/TSVs. But it also allows for representation of simple "data dictionaries" or "minimal information lists". These can be thought of as "wide tables", e.g. representing individual observations or observable units such as persons or samples.

TODO