Pipeline User Guide¶
This guide walks through using the dm-bip pipeline to transform tabular data into a harmonized LinkML data model.
Prerequisites¶
- dm-bip installed (see Installation)
makeinstalled (pre-installed on most Mac/Linux systems)- Input data as TSV or CSV files
Quick Start with Toy Data¶
The fastest way to understand the pipeline is to run it on the included toy data:
# Simple run — clean TSVs with human-readable columns
make pipeline CONFIG=toy_data/pre_cleaned/config.mk
# Full run — starts from raw dbGaP-format files, exercises all stages
make pipeline CONFIG=toy_data/from_raw/config.mk
See toy_data/README.md for details on the toy dataset.
Pipeline Overview¶
The pipeline has four stages, orchestrated by make. Running make pipeline executes all applicable stages in order. The manual commands below use the toy data as examples.
1. Prepare (make prepare-input)¶
Clean raw input files: strip dbGaP metadata headers, filter tables by ID, output clean TSVs. Only runs when DM_RAW_SOURCE is set. Skip this if your data is already clean TSV/CSV.
uv run python src/dm_bip/cleaners/prepare_input.py --source toy_data/data/raw --mapping toy_data/from_raw/specs --output output/ToyRaw/prepared
2. Schema (make schema-create)¶
Infer a source LinkML schema from the data using schema-automator. Produces one class per file, one slot per column.
uv run schemauto generalize-tsvs -n ToyPreCleaned toy_data/data/pre_cleaned/*.tsv -o output/ToyPreCleaned/ToyPreCleaned.yaml
3. Validate (make validate-data)¶
Validate each input file against the generated schema using linkml validate. Supports parallel execution (make -j 4 validate-data).
uv run linkml validate --schema output/ToyPreCleaned/ToyPreCleaned.yaml --target-class subject toy_data/data/pre_cleaned/subject.tsv
4. Map (make map-data)¶
Transform data to a target schema using linkml-map transformation specifications.
uv run linkml-map map-data \
-T output/ToyPreCleaned/mapped-data/composed-specs/Participant.yaml \
-s output/ToyPreCleaned/ToyPreCleaned.yaml \
--target-schema toy_data/target-schema.yaml \
-o output/ToyPreCleaned/mapped-data/TOY-Participant-data.yaml \
-f yaml \
toy_data/data/pre_cleaned/
Preparing Your Data¶
Input files must meet these requirements: - Format: TSV (tab-separated) or CSV - Filenames: lowercase, no spaces or special characters - One file per source table
To convert a CSV to TSV:
uv run python -c "import pandas as pd; pd.read_csv('file.csv').to_csv('file.tsv', sep='\t', index=False)"
Raw dbGaP Data¶
If you're working with raw dbGaP .txt.gz archives, the prepare step handles extraction and cleaning automatically. Set DM_RAW_SOURCE to the directory containing the archives.
Running the Pipeline¶
Basic Usage¶
make pipeline DM_INPUT_DIR=path/to/your/tsvs DM_SCHEMA_NAME=MyStudy
This runs schema creation and validation. To also run data transformation, provide transformation specs and a target schema:
make pipeline \
DM_INPUT_DIR=path/to/your/tsvs \
DM_SCHEMA_NAME=MyStudy \
DM_TRANS_SPEC_DIR=path/to/specs \
DM_MAP_TARGET_SCHEMA=path/to/target-schema.yaml
Using a Config File¶
For reproducibility, put your variables in a .mk file and pass it with CONFIG=:
make pipeline CONFIG=my-study/config.mk
See toy_data/pre_cleaned/config.mk for an example.
Key Variables¶
| Variable | Description | Default |
|---|---|---|
DM_INPUT_DIR |
Directory containing TSV/CSV files | |
DM_SCHEMA_NAME |
Name for the generated schema | Schema |
DM_OUTPUT_DIR |
Output directory | output/<schema_name> |
DM_TRANS_SPEC_DIR |
Transformation specification directory | |
DM_MAP_TARGET_SCHEMA |
Target schema for transformation | |
DM_RAW_SOURCE |
Directory of raw .txt.gz files (enables prepare step) |
|
DM_MAP_OUTPUT_TYPE |
Output format(s): yaml, jsonl, json, tsv (space-separated for multiple, e.g., yaml jsonl) |
yaml |
DM_MAP_CHUNK_SIZE |
Rows per processing batch | 10000 |
Run make help to see the full list of targets and variables.
Output Structure¶
All output goes to DM_OUTPUT_DIR (default: output/<schema_name>/):
output/MyStudy/
├── MyStudy.yaml # Generated source schema
├── prepared/ # Clean TSVs (if prepare step ran)
├── validation-logs/ # Schema and data validation logs
│ ├── data-validation/ # Per-file validation results
│ └── data-validation-errors/ # Symlinks to files with errors
└── mapped-data/ # Transformed output files
Writing Transformation Specifications¶
Transformation specs are YAML files that tell linkml-map how to map source data to a target schema. Create one spec file per target class.
Basic Structure¶
- class_derivations:
Participant:
populated_from: subject # source filename (without extension)
slot_derivations:
id:
populated_from: subject_id # source column name
external_id:
populated_from: participant_external_id
populated_fromunder the class name identifies which input file provides the data- Each
slot_derivationmaps a target slot to a source column
Slot Value Options¶
Slots can be populated in several ways:
slot_derivations:
# Direct column mapping
age:
populated_from: age_at_enrollment
# Constant value
study_name:
value: "My Study"
# Value mappings (categorical recoding)
sex:
populated_from: gender
value_mappings:
'1': male
'2': female
# Expression (Python expression using column values)
age_in_days:
populated_from: age_years
expr: "{age_years} * 365"
For the full specification format, including enum_derivations and object_derivations, see the LinkML-Map documentation.
Data Requirements¶
For each target class, all source slots must exist in a single input file. If your data spans multiple files, preprocess them into combined files before running the pipeline. You can use any tool for this (pandas, R, dbt, etc.).