Simple data dictionaries

A data dictionary is a file (or collection of files) which unambiguously declares, defines and annotates all the variables collected in a project and associated to a dataset (_definition: FAIR cookbook).

Schemasheets is an idea framework for managing a data dictionary.

Example Data Dictionary

The FAIR Cookbook provides an example of a data dictionary for tracking various aspects of a research subject or model organism, including:

subject_id
species
strain (for model organisms)
age + age unit
etc

See Example.

Let's start by copying this directly into a google sheet.

You can see this on the v1 tab of this sheet

File Name	Variable Name	Variable Label	Variable Ontology ID or RDFtype	Variable ID Source	Variable Statistical Type	Variable Data Type	Allowed Value Shorthands	Allowed Value Descriptions	Unique (alone)	Required	Collection Form Name
1_Subjects.txt	SUBJECT_ID	Subject number	https://schema.org/identifier	https://schema.org	categorical variable	integer			Y	Y	FORM 1
1_Subjects.txt	SPECIES	Species name	https://schema.org/name	https://schema.org	categorical variable	string					FORM 1
1_Subjects.txt	STRAIN	Strain	TODO substitute broken link https://bioschemas.org/profiles/Taxon/0.6-RELEASE/identifier	https://schemas.org/	categorical variable	string		http://purl.obolibrary.org/obo/NCBITaxon_40674			FORM 1
1_Subjects.txt	AGE	Age at study initiation	https://bioschemas.org/types/BioSample/0.1-RELEASE-2019_06_19	https://bioschemas.org/	continuous variable	integer				Y	FORM 1
1_Subjects.txt	AGE_UNIT	Age unit	http://purl.obolibrary.org/obo/UO_0000003	http://purl.obolibrary.org/obo/uo	categorial variable	string				Y	FORM 1
1_Subjects.txt	SEX	Sex	https://schema.org/gender	https://schema.org	categorical variable	enum	M;F	M=male;F=female			FORM 1

Adding a descriptor row

Our first task is to add a descriptor row that describes how each column heading maps to a LinkML metamodel element.

Here we will tackle this incrementally, starting with the first 3 columns, we will map to:

[class][https://w3id.org/linkml/ClassDefinition]
[slot][https://w3id.org/linkml/SlotDefinition]
[title][https://w3id.org/linkml/title]

The table now looks like this:

File Name	Variable Name	Variable Label
`>` class	slot	title
1_Subjects.txt	SUBJECT_ID	Subject number
1_Subjects.txt	SPECIES	Species name
1_Subjects.txt	STRAIN	Strain
1_Subjects.txt	AGE	Age at study initiation
1_Subjects.txt	AGE_UNIT	Age unit
1_Subjects.txt	SEX	Sex
1_Subjects.txt	SOMEDATE	Date of acquiring subject
1_Subjects.txt	HEMOGLOBIN	Hematology: Hemoglobin
1_Subjects.txt	HEMOGLOBIN_UNIT	Hemoglobin unit
1_Subjects.txt	HEIGHT	Body size
1_Subjects.txt	HEIGHT_UNIT	Body size unit
1_Subjects.txt	WEIGHT	Body weight
1_Subjects.txt	WEIGHT_UNIT	Body weight unit
1_Subjects.txt	BMI	Body mass index
1_Subjects.txt	LAB	Laboratory
2_Samples.txt	SAMPLE_ID	Sample ID
2_Samples.txt	SAMPLE_SITE	Sample collection site
2_Samples.txt	ANALYTE_TYPE	Type of analysis
2_Samples.txt	GENOTYPING_CENTER	GENOTYPING_CENTER
2_Samples.txt	SEQUENCING_CENTER	SEQUENCING_CENTER
3_SampleMapping.txt	SUBJECT_ID	Subject number
3_SampleMapping.txt	SAMPLE_ID	Sample ID

Our choice of how to map the first column is a bit odd, and reflects a slight mismatch between schemasheets/LinkML, which aims to describe a data model that can be used for multiple instantiations of the same format and a data dictionary that is oriented around describing a single distribution.

Here we are implicitly creating classes/records like "1_Subjects.txt" which doesn't really conform to standard class naming conventions in LinkML. Later we will explore rewriting these with names like "Subject", "Sample", and "SampleMapping"

TODO

For the second column, the choice of ALL-CAPS for slot name also goes against standard naming conventions, but this doesn't really matter so much, and the title (col 3) is the string that should be used in user-facing applications like Data Harmonizer.

Modifications

We modified the minimum and maximum values which were specified using commas instead of periods for decimal notation
The "regex" field had a value YYYY-MM-DD, but this isn't an actual regex

This framework allows you to represent complex relation-style schemas using spreadsheets/TSVs. But it also allows for representation of simple "data dictionaries" or "minimal information lists". These can be thought of as "wide tables", e.g. representing individual observations or observable units such as persons or samples.

TODO

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search