Scripts - Format Converters

Merge a column from one file to another

The merge_add_column.py script is designed to take two files and widen one file, i.e. add an additional column, with data from the second file. For example, biometric data, e.g. age, may be collected at various timepoints. However, in order to map this data using linkml-map, rather than having one timepoint column with various timepoints and one age column with values for each participant at the different timepoints as separate rows, this data should be converted from a long format to a wide format where there are columns for each timepoint and the metric of interest, e.g. age_at_visit_timepoint_1. For example, for the INCLUDE LinkML model the Participant.ageAtFirstParticipantEngagement values for the BrainPower study are sourced from a file where the timepoint value equals 1 for the column age_at_visit. The script will convert this long data into a new column, e.g. age_at_visit_timepoint_1 and add the age values into the column.

Usage

From the scripts directory run:

python merge_add_column.py \
  --left_path <PATH-TO-INPUT-FILE_1> \
  --right_path <PATH-TO-INPUT-FILE_2> \
  --output_file <PATH-TO-OUTPUT-FILE> \
  --new_column age_at_visit_timepoint_1 \
  --source_column age_at_visit \
  --left_id id \
  --right_id id \
  --filter_column timepoint \
  --filter_value 1

Working example

python merge_add_column.py \
  --left_path ../../../data/BrainPower-STUDY/raw_data/TSV/demographics.tsv \
  --right_path ../../../data/BrainPower-STUDY/raw_data/TSV/ageateventandlatency.tsv \
  --output_file demographics_with_age.tsv \
  --new_column age_at_visit_timepoint_1 \
  --source_column age_at_visit \
  --left_id id \
  --right_id id \
  --filter_column timepoint \
  --filter_value 1

NOTE: If running the script in the linkml-map uv environment, run as uv run python ... after adding pandas to that environment (it is not included by default).

Melt a file - convert from wide to long format

The melt_conditions.py script is designed to convert a file of conditions from a long format to a wide format. This is a pre-processing script to prepare data for annotation using Harmonica. The use case for this script is when the conditions to annotate with ontology terms are listed as column headers (see Wide format file example below). The result of the script is a long format file (see Long format file example below) where each row represents one condition that the participant has based on the presence of the value 1. TODO: Generalize for additional values that indicates the participant was found to have the indicated condition, e.g. true, present, etc.

Usage

python melt_conditions.py \
  --input_file <PATH-TO-INPUT-FILE> \
  --output_file <PATH-TO-OUTPUT-FILE> \
  --id_vars_str <comma separated list of ID variables # id,timepoint> \
  --var_name <name of new column header for the conditions>

NOTE: If running the script in the linkml-map uv environment, run as uv run python ... after adding pandas to that environment (it is not included by default).

Working example

uv run python melt_and_annotate_conditions.py \
  --input_file ../../../data/BrainPower-STUDY/raw_data/TSV/healthconditions.tsv  \
  --id_vars id,timepoint \
  --var_name condition_name

Example file formats for “wide” and “long” formats

Wide format

id  timepoint asd vsd pda
123 1 0 1 0
124 1 1 0 1
125 1 1 0 0

Long Format

id  timepoint condition_name  has_condition
123 1 vsd 1
124 1 asd 1
124 1 pda 1
125 1 pda 1