Pipeline User Guide¶

This guide walks through using the dm-bip pipeline to transform tabular data into a harmonized LinkML data model.

Prerequisites¶

dm-bip installed (see Installation)
make installed (pre-installed on most Mac/Linux systems)
Input data as TSV or CSV files

Quick Start with Toy Data¶

The fastest way to understand the pipeline is to run it on the included toy data:

# Simple run — clean TSVs with human-readable columns
make pipeline CONFIG=toy_data/pre_cleaned/config.mk

# Full run — starts from raw dbGaP-format files, exercises all stages
make pipeline CONFIG=toy_data/from_raw/config.mk

See toy_data/README.md for details on the toy dataset.

Pipeline Overview¶

The pipeline has four stages, orchestrated by make. Running make pipeline executes all applicable stages in order. The manual commands below use the toy data as examples.

1. Prepare (`make prepare-input`)¶

Clean raw input files: strip dbGaP metadata headers, filter tables by ID, output clean TSVs. Only runs when DM_RAW_SOURCE is set. Skip this if your data is already clean TSV/CSV.

uv run python src/dm_bip/cleaners/prepare_input.py --source toy_data/data/raw --mapping toy_data/from_raw/specs --output output/ToyRaw/prepared

2. Schema (`make schema-create`)¶

Infer a source LinkML schema from the data using schema-automator. Produces one class per file, one slot per column.

uv run schemauto generalize-tsvs -n ToyPreCleaned toy_data/data/pre_cleaned/*.tsv -o output/ToyPreCleaned/ToyPreCleaned.yaml

3. Validate (`make validate-data`)¶

Validate each input file against the generated schema using linkml validate. Supports parallel execution (make -j 4 validate-data).

uv run linkml validate --schema output/ToyPreCleaned/ToyPreCleaned.yaml --target-class subject toy_data/data/pre_cleaned/subject.tsv

4. Map (`make map-data`)¶

Transform data to a target schema using linkml-map transformation specifications.

uv run linkml-map map-data \
  -T output/ToyPreCleaned/mapped-data/composed-specs/Participant.yaml \
  -s output/ToyPreCleaned/ToyPreCleaned.yaml \
  --target-schema toy_data/target-schema.yaml \
  -o output/ToyPreCleaned/mapped-data/TOY-Participant-data.yaml \
  -f yaml \
  toy_data/data/pre_cleaned/

Preparing Your Data¶

Input files must meet these requirements: - Format: TSV (tab-separated) or CSV - Filenames: lowercase, no spaces or special characters - One file per source table

To convert a CSV to TSV:

uv run python -c "import pandas as pd; pd.read_csv('file.csv').to_csv('file.tsv', sep='\t', index=False)"

Raw dbGaP Data¶

If you're working with raw dbGaP .txt.gz archives, the prepare step handles extraction and cleaning automatically. Set DM_RAW_SOURCE to the directory containing the archives.

Running the Pipeline¶

Basic Usage¶

make pipeline DM_INPUT_DIR=path/to/your/tsvs DM_SCHEMA_NAME=MyStudy

This runs schema creation and validation. To also run data transformation, provide transformation specs and a target schema:

make pipeline \
  DM_INPUT_DIR=path/to/your/tsvs \
  DM_SCHEMA_NAME=MyStudy \
  DM_TRANS_SPEC_DIR=path/to/specs \
  DM_MAP_TARGET_SCHEMA=path/to/target-schema.yaml

Using a Config File¶

For reproducibility, put your variables in a .mk file and pass it with CONFIG=:

make pipeline CONFIG=my-study/config.mk

See toy_data/pre_cleaned/config.mk for an example.

Key Variables¶

Variable	Description	Default
`DM_INPUT_DIR`	Directory containing TSV/CSV files
`DM_SCHEMA_NAME`	Name for the generated schema	`Schema`
`DM_OUTPUT_DIR`	Output directory	`output/<schema_name>`
`DM_TRANS_SPEC_DIR`	Transformation specification directory
`DM_MAP_TARGET_SCHEMA`	Target schema for transformation
`DM_RAW_SOURCE`	Directory of raw `.txt.gz` files (enables prepare step)
`DM_MAP_OUTPUT_TYPE`	Output format(s): `yaml`, `jsonl`, `json`, `tsv` (space-separated for multiple, e.g., `yaml jsonl`)	`yaml`
`DM_MAP_CHUNK_SIZE`	Rows per processing batch	`10000`

Run make help to see the full list of targets and variables.

Output Structure¶

All output goes to DM_OUTPUT_DIR (default: output/<schema_name>/):

output/MyStudy/
├── MyStudy.yaml                    # Generated source schema
├── prepared/                       # Clean TSVs (if prepare step ran)
├── validation-logs/                # Schema and data validation logs
│   ├── data-validation/            # Per-file validation results
│   └── data-validation-errors/     # Symlinks to files with errors
└── mapped-data/                    # Transformed output files

Writing Transformation Specifications¶

Transformation specs are YAML files that tell linkml-map how to map source data to a target schema. Create one spec file per target class.

Basic Structure¶

- class_derivations:
    Participant:
      populated_from: subject        # source filename (without extension)
      slot_derivations:
        id:
          populated_from: subject_id  # source column name
        external_id:
          populated_from: participant_external_id

populated_from under the class name identifies which input file provides the data
Each slot_derivation maps a target slot to a source column

Slot Value Options¶

Slots can be populated in several ways:

slot_derivations:
    # Direct column mapping
    age:
      populated_from: age_at_enrollment

    # Constant value
    study_name:
      value: "My Study"

    # Value mappings (categorical recoding)
    sex:
      populated_from: gender
      value_mappings:
        '1': male
        '2': female

    # Expression (Python expression using column values)
    age_in_days:
      populated_from: age_years
      expr: "{age_years} * 365"

For the full specification format, including enum_derivations and object_derivations, see the LinkML-Map documentation.

Data Requirements¶

For each target class, all source slots must exist in a single input file. If your data spans multiple files, preprocess them into combined files before running the pipeline. You can use any tool for this (pandas, R, dbt, etc.).