Scripts - Format Converters
Merge a column from one file to another
The merge_add_column.py script is designed to take two files and widen one file, i.e. add an additional column, with data from the second file. For example, biometric data, e.g. age, may be collected at various timepoints. However, in order to map this data using linkml-map, rather than having one timepoint column with various timepoints and one age column with values for each participant at the different timepoints as separate rows, this data should be converted from a long format to a wide format where there are columns for each timepoint and the metric of interest, e.g. age_at_visit_timepoint_1. For example, for the INCLUDE LinkML model the Participant.ageAtFirstParticipantEngagement values for the BrainPower study are sourced from a file where the timepoint value equals 1 for the column age_at_visit. The script will convert this long data into a new column, e.g. age_at_visit_timepoint_1 and add the age values into the column.
Usage
From the scripts directory run:
python merge_add_column.py \
--left_path <PATH-TO-INPUT-FILE_1> \
--right_path <PATH-TO-INPUT-FILE_2> \
--output_file <PATH-TO-OUTPUT-FILE> \
--new_column age_at_visit_timepoint_1 \
--source_column age_at_visit \
--left_id id \
--right_id id \
--filter_column timepoint \
--filter_value 1
Working example
python merge_add_column.py \
--left_path ../../../data/BrainPower-STUDY/raw_data/TSV/demographics.tsv \
--right_path ../../../data/BrainPower-STUDY/raw_data/TSV/ageateventandlatency.tsv \
--output_file demographics_with_age.tsv \
--new_column age_at_visit_timepoint_1 \
--source_column age_at_visit \
--left_id id \
--right_id id \
--filter_column timepoint \
--filter_value 1
NOTE: If running the script in the linkml-map uv environment, run as uv run python ... after adding pandas to that environment (it is not included by default).
Melt a file - convert from wide to long format
The melt_conditions.py script is designed to convert a file of conditions from a long format to a wide format. This is a pre-processing script to prepare data for annotation using Harmonica. The use case for this script is when the conditions to annotate with ontology terms are listed as column headers (see Wide format file example below). The result of the script is a long format file (see Long format file example below) where each row represents one condition that the participant has based on the presence of the value 1.
TODO: Generalize for additional values that indicates the participant was found to have the indicated condition, e.g. true, present, etc.
Usage
python melt_conditions.py \
--input_file <PATH-TO-INPUT-FILE> \
--output_file <PATH-TO-OUTPUT-FILE> \
--id_vars_str <comma separated list of ID variables # id,timepoint> \
--var_name <name of new column header for the conditions>
NOTE: If running the script in the linkml-map uv environment, run as uv run python ... after adding pandas to that environment (it is not included by default).
Working example
uv run python melt_and_annotate_conditions.py \
--input_file ../../../data/BrainPower-STUDY/raw_data/TSV/healthconditions.tsv \
--id_vars id,timepoint \
--var_name condition_name
Example file formats for “wide” and “long” formats
Wide format
id timepoint asd vsd pda
123 1 0 1 0
124 1 1 0 1
125 1 1 0 0
Long Format
id timepoint condition_name has_condition
123 1 vsd 1
124 1 asd 1
124 1 pda 1
125 1 pda 1