0.1 Data
0.2 Directories
1 Bioinformatic preparation
2 Analysis with just one cis-region: PCSK9 with p-value threshold criterion (P<5E-8)
3 Analysis of PCSK9 on METSIM metabolites (ignoring p-value threshold criterion)
- 3.1 Harmonizing PCSK9 and all METSIM metabolites (ignoring p-value criterion)
- 3.2 HyPrColoc of PCSK9 and all METSIM metabolites (ignoring p-value criterion)
  - 3.2.1 Results: HyPrColoc of PCSK9 and all metabolites (ignoring p-value)
4 Benchmarking loop of selected proteins on METISM metabolites (P<5E-8 threshold)
5 Benchmarking loop of selected proteins on METISM metabolites (P<5E-6 threshold)
6 Comparison of HyPrColoc results for selected proteins using P<5E-8 and P<5E-6 filters
7 Shiny app to deliver results

0.1 Data

0.1.1 UK Biobank Pharma Proteomics Project (UKB-PPP) Proteins

2940 summary statistics; Europeans; both GRCh19/38

Inflammation: 736
Cardiometabolic: 736
Oncology: 735
Neurology: 733

#!/bin/sh

# Directory containing the files
dir_path="/Users/charleenadams/ukbppp"

# Temporary file for storing category counts
temp_counts=$(mktemp)

# Loop through all .tar files in the directory
for file in "$dir_path"/*.tar; do
    # Extract the category from the filename
    category=$(echo "$file" | sed -E 's/.*_v1_([A-Za-z]+)(_II)?\.tar/\1/')

    # Output the category to the temp file
    echo "$category" >> "$temp_counts"
done

# Count the occurrences of each category and save to output
sort "$temp_counts" | uniq -c | awk '{print $1, $2}' | sort -nr > "$dir_path/category_counts.txt"

# Clean up
rm "$temp_counts"

echo "Category counts saved to $dir_path/category_counts.txt"

0.1.2 METSIM

1391 summary statistics; Finnish men; GRCh38; tsv files

Amino Acid: 215
Carbohydrate: 25
Cofactors and Vitamins: 38
Energy: 10
Lipid: 548
Nucleotide: 42
Partially Characterized: 16
Peptide: 42
Uncharacterized: 292
Xenobiotics: 163

0.1.3 Nightingale

249 summary statistics; Europeans; GRCh37; vcf files

Don’t have a breakdown by class, but do know the files include:

Lipoprotein Subclasses:
- Includes particle concentrations and composition
Lipoprotein Particle Size
Apolipoprotein A-I and B
Multiple Cholesterol and Triglyceride Measures
Albumin
Various Fatty Acids
Low-Molecular-Weight Metabolites:
- Amino acids (including branched-chain and aromatic)
- Glycolysis-related measures
- Ketone bodies

0.1.4 Clinical phenotype(s)

Coronary Heart Disease (CHD) GRCh37; vcf file; unformatted and downloaded

Phenotype	Abbreviation	Dataset	Author	PubMed ID	Sample Size (N)	Number of Cases (N Cases)	Population
Coronary Heart Disease	CHD	ieu-a-7	Nikpay	26343387	184,305	60,801	Mixed

NOTE: Remember the Jurgens data…

e.g, for BMI: “/Users/charleenadams/acy1_bmi/GWAS_sumstats_EUR__invnorm_bmi__TOTALsample.tsv”

0.1.5 HyPrColoc sample overlap considerations

I revisited the Staley’s R Markdown example, examined the paper’s supplementary material on their simulation approach, and test-ran the analysis with the required matrices for accounting for sample overlap—LD matrices, sample overlap matrices, and correlation matrices of betas. Unfortunately, incorporating this adjustment into our pipeline is not feasible.

While generating these matrices dynamically is straightforward, running the model with the correlation matrix is computationally prohibitive—the runtime would be impractical. The authors acknowledge this in both the supplement and the R Markdown exercise, cautioning that adjusting for sample overlap may not be necessary. They provide several key arguments against it:

Risk of False Negatives – Colocalization is inherently more likely for correlated traits, and adjusting for overlap might remove true signals.
False Positives May Be Overstated – Even when assuming no sample overlap, correct colocalization clusters were still detected.
Computational Cost vs. Benefit – HyPrColoc was designed for large-scale analyses, but incorporating correlations drastically increases complexity without strong evidence of improving accuracy.

This aligns with a key takeaway from James’s R Markdown exercise:

“When we incorrectly assumed that the traits were measured in studies with no overlapping participants, we correctly detected the three clusters of colocalized traits—at a fraction of the computational cost of accounting for correlations between summary data.”

0.1.6 Future Steps

I have not yet added CHD to the analysis or formatted the data, as I’m prioritizing benchmarking proteins and METSIM metabolites. The plan is to integrate CHD and other clinical phenotypes once that is complete.

The CHD data was generated from the CARDIoGRAMplusC4D chip, which included Europeans and South Asians. The authors of HyPrColoc state that their method assumes all traits in a model derive from the same population, yet they used this dataset themselves. That raises a few questions (in the colloquial sense, not in the petitio principii sense):

Which is it? Can HyPrColoc handle traits from different ancestries or not?
Was this an oversight or an intentional decision?

UKP-PPP, METSIM, and CHD CARDIoGRAMplusC4D appear reasonably independent. We can consider adding in UKB Nightingale or Nightingale summary statistics from a smaller but more broadly European (vs UKB) cohort. Regardless, METSIM remains the priority for now. Likewise, we could add in proteins subset to the cis-regions of proteins of interest.

0.2 Directories

Main: /Users/charleenadams/coloc_clin_mol
- Scripts: /Users/charleenadams/coloc_clin_mol/scripts
UKB-PPP files:
- Original .tar files were at /Users/charleenadams/ukbppp; deleted after extraction to save 1.5TB
- Extracted tars: /Users/charleenadams/ukbppp/coloc_clin_mol_extracted (main directory for UKB-PPP files now)
- rsIDs added files: /Users/charleenadams/merged_ukbppp
METSIM files: /Users/charleenadams/metsim_files
Nightingale files: /Users/charleenadams/met_d_files
Selected clinical phenotype files: /Users/charleenadams/clinical_files

1 Bioinformatic preparation

1.1 Fetch metabolite summary statistics

1.2 Format UKB-PPP pQTLs

I had previously obtained these (see https://rpubs.com/YodaMendel/1243451) for an example of how to programmatically get files from UKB-PPP.

1.2.1 Untar all 2940


#!/bin/bash

# Define the directory containing your tar files
tar_directory="/Users/charleenadams/ukbppp"

# Define the output directory for extracted files
output_directory="/Users/charleenadams/ukbppp/coloc_clin_mol_extracted"

# Define the log file for tracking progress
log_file="/Users/charleenadams/ukbppp/extraction.log"

# Create the output directory if it doesn't exist
mkdir -p "$output_directory"

# Create or clear the log file
> "$log_file"

# Function to extract a single tar file
extract_tar() {
    local tar_file="$1"
    local output_directory="$2"
    local log_file="$3"

    # Log the start of extraction
    echo "Starting extraction: $tar_file" >> "$log_file"

    # Extract the tar file into the output directory
    if tar -xf "$tar_file" -C "$output_directory"; then
        echo "Completed extraction: $tar_file" >> "$log_file"
    else
        echo "Failed extraction: $tar_file" >> "$log_file"
    fi
}

export -f extract_tar

# Use `find` to locate all `.tar` files and process them in parallel
find "$tar_directory" -maxdepth 1 -type f -name "*.tar" | parallel -j 16 extract_tar {} "$output_directory" "$log_file"

echo "Extraction complete. Log saved to $log_file."

# Run with: 
#chmod +x untar.sh
#nohup ./untar.sh >untar.log 2>&1 &
ps -p 97839 -o stat,etime

1.2.2 Add rsIDs by CHR

~28 hours run time parallelized, including Mac sleeping overnight 😞
- Next time use caffeinate -i ./add_rsids.sh

#!/bin/bash

# Base directory where your data is located
base_dir="/Users/charleenadams/ukbppp/coloc_clin_mol_extracted"

# Directory where RSID map files are located
map_dir="/Users/charleenadams/temp_BI/chemokine_rgs_olink/maps"

# Log file to track progress
log_file="$base_dir/add_rsids.log"
echo "Log started on $(date)" > "$log_file"

# Function to process each file
process_file() {
    local file="$1"
    local dir
    dir=$(dirname "$file")
    local chrom
    chrom=$(basename "$file" | sed -E 's/.*chr([0-9XY]+).*/\1/')
    local map_file="$map_dir/olink_rsid_map_mac5_info03_b0_7_chr${chrom}_patched_v2.tsv.gz"
    local output_file="${file%.gz}_rsid_added.gz"

    {
        echo "Processing file: $file (Chromosome $chrom)"

        if [[ -n "$chrom" && -f "$map_file" ]]; then
            echo "  Using RSID map: $map_file"

            # Remove existing rsid_added file to avoid overwrite prompts
            rm -f "$output_file"

            # Join the original file with the RSID map based on the ID column
            gunzip -c "$file" | \
            awk 'BEGIN {OFS="\t"}
                 NR==FNR {map[$1]=$4; genpos37[$1]=$5; genpos38[$1]=$6; next}
                 ($3 in map) {print $0, map[$3], genpos37[$3], genpos38[$3]}
                 !($3 in map) {print $0, ".", ".", "."}' <(gunzip -c "$map_file") - | \
            gzip -c > "$output_file"

            if [[ -s "$output_file" ]]; then
                echo "  RSID added with GENPOS37 and GENPOS38. Output saved to $output_file"

                # Delete the original file if the RSID-added file was successfully created
                rm -f "$file"
                echo "  Deleted original file: $file"
            else
                echo "  Warning: $output_file is empty! Original file not deleted."
            fi
        else
            echo "  Error: RSID map file for chromosome $chrom not found or chromosome not identified."
        fi
    } >> "$log_file" 2>&1
}

export -f process_file
export map_dir
export log_file

# Find all .gz files and process them in parallel
find "$base_dir" -type f -name "*.gz" | parallel -j16 process_file

echo "Log ended on $(date)" >> "$log_file"
echo "Processing complete. Check $log_file for details."

#Run with: 
#nano add_rsids.sh
#chmod +x add_rsids.sh
#nohup ./add_rsids.sh > add_rsids2.log 2>&1 &
#ps -p 54107 -o stat,etime

1.2.3 🚫 DON’T 🚫 merge UKP-PPP chromosome files: draft script!!!

Takes too long; keep as chromosomes; retained because it likely works and we might need the chromosomes merged if we try out a non-cis-region approach later.

#!/bin/bash

# Define directories
base_dir="/Users/charleenadams/ukbppp/coloc_clin_mol_extracted"
output_dir="/Users/charleenadams/merged_ukbppp"
log_file="${output_dir}/merge_rsid_files.log"

# Create output directory if it doesn't exist
mkdir -p "$output_dir"

# Start log file
echo "Processing started on $(date)" > "$log_file"

# Function to process each subdirectory
process_directory() {
    local dir="$1"

    # Extract the subdirectory name for naming the merged file
    subdir_name=$(basename "$dir")
    merged_file="${output_dir}/${subdir_name}_merged_rsid_added.gz"

    echo "Processing directory: $dir" | tee -a "$log_file"

    # Find all *_rsid_added.gz files in the current directory
    rsid_files=("${dir}"/*_rsid_added.gz)

    # Check if there are files to merge
    if [[ ${#rsid_files[@]} -gt 0 ]]; then
        # Extract header from the first file
        gunzip -c "${rsid_files[0]}" | head -n 1 > temp_header.txt

        # Process each file, excluding the header, and append to a temporary content file
        for file in "${rsid_files[@]}"; do
            gunzip -c "$file" | tail -n +2 >> temp_content.txt
        done

        # Concatenate header and content into the final merged file
        cat temp_header.txt temp_content.txt | gzip -c > "$merged_file"

        # Clean up temporary files
        rm temp_header.txt temp_content.txt

        # Confirm if the merged file was created successfully and isn’t empty
        if [[ -s "$merged_file" ]]; then
            echo "  Merged file created: $merged_file" | tee -a "$log_file"
        else
            echo "  Warning: Merged file is empty for directory $dir" | tee -a "$log_file"
        fi
    else
        echo "  No '_rsid_added.gz' files found in directory $dir" | tee -a "$log_file"
    fi
}

export -f process_directory
export log_file output_dir

# Find all subdirectories in base_dir and process them in parallel
find "$base_dir" -mindepth 1 -maxdepth 1 -type d -print0 | parallel -0 -j 15 process_directory

echo "Processing completed on $(date)" >> "$log_file"
echo "Check $log_file for details."

# Run with:
# Script located at: /Users/charleenadams/coloc_clin_mol/scripts
#nano chrom_merge.sh
#chmod +x chrom_merge.sh
#nohup ./chrom_merge.sh > chrom_merge.log 2>&1 &
#ps -p 20853 -o stat,etime

1.2.4 Create cis-regions

The method I devised below obtains cis-regions (500KB up and downstream of TSSes using Ensembl) for 2808 of 2940 (96%) of the files.

#!/bin/bash

# Directories
base_dir="/Users/charleenadams/ukbppp/coloc_clin_mol_extracted"
results_dir="/Users/charleenadams/cis_ukbppp"
mkdir -p "$results_dir"

# Log file
log_file="$results_dir/process_files.log"
echo "Processing started on $(date)" > "$log_file"

# Function to process each directory
process_directory() {
    local dir="$1"
    local protein_name
    protein_name=$(basename "$dir" | cut -d '_' -f1) # Extract protein name (e.g., ITGB6)

    echo "Processing directory: $dir (Protein: $protein_name)" | tee -a "$log_file"

    # Retrieve TSS for the protein using Ensembl REST API
    tss_info=$(curl -s "https://rest.ensembl.org/lookup/symbol/human/${protein_name}?content-type=application/json")
    if [[ -z "$tss_info" ]]; then
        echo "  Error: TSS not found for $protein_name" | tee -a "$log_file"
        return
    fi
    tss=$(echo "$tss_info" | jq '.start') # Extract TSS (start position)
    chrom=$(echo "$tss_info" | jq -r '.seq_region_name') # Extract chromosome

    # Check if TSS and chromosome are valid
    if [[ -z "$tss" || -z "$chrom" ]]; then
        echo "  Error: Invalid TSS or chromosome for $protein_name" | tee -a "$log_file"
        return
    fi

    # Define the region (500kb upstream and downstream)
    start=$((tss - 500000))
    end=$((tss + 500000))
    if ((start < 0)); then start=0; fi

    # Locate the chromosome file for the TSS
    chrom_file=$(find "$dir" -type f -name "discovery_chr${chrom}_*.gz")
    if [[ -z "$chrom_file" ]]; then
        echo "  Error: Chromosome file not found for $protein_name (chr $chrom)" | tee -a "$log_file"
        return
    fi

    # Filter the chromosome file for the region and save results
    output_file="${results_dir}/${protein_name}_cis_region.gz"
    gunzip -c "$chrom_file" | awk -v chrom="$chrom" -v start="$start" -v end="$end" \
        'NR==1 || ($1 == chrom && $2 >= start && $2 <= end)' | gzip -c > "$output_file"

    if [[ -s "$output_file" ]]; then
        echo "  Cis-region file created: $output_file" | tee -a "$log_file"
    else
        echo "  Warning: Cis-region file is empty for $protein_name" | tee -a "$log_file"
    fi
}

export -f process_directory
export results_dir log_file

# Run the process in parallel for all directories
find "$base_dir" -mindepth 1 -maxdepth 1 -type d | parallel -j15 process_directory {}

echo "Processing completed on $(date)" >> "$log_file"
echo "Check $log_file for details."

# Run with:
#nano cis_regions.sh
#chmod +x cis_regions.sh
#nohup ./cis_regions.sh > cis_regions.log 2>&1 &
#ps -p 20853 -o stat,etime

1.3 Debugging headers

Adding in the rsIDs created headers with mixed delimiters, so I fixed that.

#!/bin/bash

input_dir="/Users/charleenadams/cis_ukbppp"
output_dir="/Users/charleenadams/cis_ukbppp/cleaned"

mkdir -p "$output_dir"

for file in "$input_dir"/*_cis_region.gz; do
    if [[ -f "$file" ]]; then
        output_file="$output_dir/$(basename "$file" .gz)_cleaned.gz"
        echo "Processing $file -> $output_file"
        gunzip -c "$file" | \
        awk 'BEGIN {OFS="\t"} {for (i=1; i<=NF; i++) $i=$i} 1' | \
        gzip > "$output_file"
    else
        echo "Skipping $file: Not a valid file."
    fi
done

echo "Processing complete. Cleaned files are saved in $output_dir."

du -sh /Users/charleenadams/cis_ukbppp/* | grep -v '4.0K' | wc -l

du -sh /Users/charleenadams/cis_ukbppp/* | grep -v '4.0K' > non_4K_files.txt

1.4 Format Nightingale VCF files: draft script

#!/bin/bash

# Define the directory containing the .vcf.gz files
input_dir="/Users/charleenadams/met_d_files"
output_dir="/Users/charleenadams/met_d_files/formatted"
mkdir -p "$output_dir"

# Log file for tracking progress
log_file="$output_dir/format_vcf.log"
echo "Log started on $(date)" > "$log_file"

# Function to process each file
process_vcf() {
    local file="$1"
    local base_name
    base_name=$(basename "$file" .vcf.gz)
    local output_file="${output_dir}/${base_name}_formatted.vcf.gz"

    echo "Processing $file..." | tee -a "$log_file"

    # Extract the header and process the data
    gunzip -c "$file" | \
    awk -v OFS="\t" '
        BEGIN {process = 0}
        /^#CHROM/ {
            process = 1;
            print "CHROM", "POS19", "ID", "ALLELE0", "ALLELE1", "QUAL", "FILTER", "INFO", "BETA", "SE", "LOG10P", "A1FREQ", "rsid", "GENPOS";
        }
        process && !/^#/ {
            split($9, format_keys, ":");
            split($10, format_values, ":");
            # Create new columns based on FORMAT field
            for (i = 1; i <= length(format_keys); i++) {
                if (format_keys[i] == "ES") BETA = format_values[i];
                if (format_keys[i] == "SE") SE = format_values[i];
                if (format_keys[i] == "LP") LOG10P = format_values[i];
                if (format_keys[i] == "AF") A1FREQ = format_values[i];
                if (format_keys[i] == "ID") rsid = format_values[i];
            }
            # Map columns to required headers
            print $1, $2, rsid, $4, $5, $6, $7, $8, BETA, SE, LOG10P, A1FREQ, rsid, $2;
        }' | gzip -c > "$output_file"

    # Confirm processing success
    if [[ -s "$output_file" ]]; then
        echo "  Successfully processed: $output_file" | tee -a "$log_file"
        rm "$file"
    else
        echo "  Warning: Processing failed for $file" | tee -a "$log_file"
    fi
}

export -f process_vcf
export output_dir
export log_file

# Find all .vcf.gz files (excluding .tbi and .tbi.1 files) and process them in parallel
find "$input_dir" -type f -name "*.vcf.gz" ! -name "*.tbi*" | parallel -j4 process_vcf

echo "Log ended on $(date)" >> "$log_file"
echo "Processing complete. Check $log_file for details."

2 Analysis with just one cis-region: PCSK9 with p-value threshold criterion (P<5E-8)

2.1 Subset METSIM files by PCSK9 cis-region

#!/bin/bash

#run with: 
#cd /Users/charleenadams/coloc_clin_mol/scripts
#chmod +x subset_PCSK9_metsim.sh
#nohup ./subset_PCSK9_metsim.sh > subset_PCSK9_metsim.log 2>&1 &
#ps -p 42761 -o stat,etime

# Input files and directories
cis_file="/Users/charleenadams/cis_ukbppp/cleaned/PCSK9_cis_region_cleaned.gz"
metsim_dir="/Users/charleenadams/metsim_files"
output_dir="/Users/charleenadams/metsim_files/metsim_cis_PCSK9_fixed"
log_file="subset_PCSK9_metsim.log"
rsid_file="pcsk9_rsid_list.txt"

# Create output directory
mkdir -p "$output_dir"

# Extract rsid list from the cis-region file (column 15) and store in a file
gzcat "$cis_file" | awk 'NR > 1 {print $15}' | sort | uniq > "$rsid_file"

if [[ ! -s "$rsid_file" ]]; then
  echo "Error: No rsid entries found in $cis_file" >> "$log_file"
  exit 1
fi

# Export variables for parallel processing
export metsim_dir output_dir log_file rsid_file

# Function to process a single METSIM file
process_metsim_file() {
  local metsim_file=$1
  local base_name=$(basename "$metsim_file" .txt.gz)
  local output_file="$output_dir/${base_name}_cis_PCSK9.txt"

  gzcat "$metsim_file" | awk '
    BEGIN {
      while ((getline < "'$rsid_file'") > 0) lookup[$1] = 1;
      close("'$rsid_file'");
    }
    NR == 1 {print $0; next} # Print header
    $5 in lookup {print $0}  # RSIDs are in column 5 in METSIM
  ' > "$output_file"

  if [[ ! -s "$output_file" ]]; then
    echo "Warning: No matching rows for $metsim_file" >> "$log_file"
    rm "$output_file"
  else
    echo "Processed $metsim_file" >> "$log_file"
  fi
}

export -f process_metsim_file

# Use GNU parallel to process files
find "$metsim_dir" -name "*.gz" | parallel -j $(nproc) process_metsim_file {}

2.2 Filter METSIM files on at least one SNP with P<5E-8 in PCSK9 cis-region

# Ran interactively in Rstudio

# Load necessary libraries
library(data.table)

# Directory where the txt files reside
directory <- "/Users/charleenadams/metsim_files/metsim_cis_PCSK9_fixed"
# Subdirectory where to copy files with pval < 5e-8
subdirectory <- "p_val_metsim_cis_PCSK9_fixed"

# Ensure subdirectory exists
if (!dir.exists(file.path(directory, subdirectory))) {
  dir.create(file.path(directory, subdirectory))
}

# Get all .txt files in the directory
files <- list.files(path = directory, pattern = "\\.txt$", full.names = TRUE)

# Process each file
for (file in files) {
  cat("Processing file:", basename(file), "\n")
  
  # Read the file
  data <- fread(file)
  
  # Check if any p-value is less than 5e-8
  if (any(data$pval < 5e-8, na.rm = TRUE)) {
    cat("File", basename(file), "has pval < 5e-8\n")
    file.copy(from = file, to = file.path(directory, subdirectory, basename(file)))
    cat("Copied", basename(file), "to", subdirectory, "\n")
  } else {
    cat("No match found in", basename(file), "\n")
  }
}

cat("Script completed\n")

2.2.1 Prep for HyPrColoc

Tricky. Do NOT under any circumstances select “update all” when prompted to install newer versions.

# Follow these installation steps exactly
remotes::install_version("Rcpp", version = "1.0.12", repos = "http://cran.us.r-project.org")
remotes::install_version("RcppEigen", version = "0.3.3.9.3", repos = "http://cran.us.r-project.org")

Sys.setenv(CC = "/opt/homebrew/opt/gcc/bin/gcc-14")
Sys.setenv(CXX = "/opt/homebrew/opt/gcc/bin/g++-14")

devtools::install_github("jrs95/hyprcoloc", 
                         build_opts = c("--no-resave-data", "--no-manual"), 
                         build_vignettes = TRUE)

# Load the necessary libraries
library(hyprcoloc)

# Regression coefficients from ten studies (Traits 1-5, 6-8 & 9-10 colocalize)
betas_t <- hyprcoloc::test.betas
ses_t <- hyprcoloc::test.ses

# Trait names and SNP IDs
traits_t <- paste0("T", 1:10)
rsid_t <- rownames(betas_t)
str(rsid_t)

# Colocalization analysis
test <- hyprcoloc(betas_t, ses_t, trait.names = traits_t, snp.id = rsid_t, snpscores = TRUE)

___
# Define number of traits and SNPs
num_traits <- 800
num_snps <- 1062  # Adjust if necessary

# Generate new trait names
traits_t <- paste0("T", 1:num_traits)

# Expand betas_t matrix (adding random noise to mimic new traits)
betas_t_expanded <- cbind(
  betas_t,  # Existing data
  matrix(
    rnorm(num_snps * (num_traits - ncol(betas_t)), 
          mean = rowMeans(betas_t, na.rm = TRUE), 
          sd = apply(betas_t, 1, sd, na.rm = TRUE)),
    nrow = num_snps
  )
)

# Expand ses_t matrix (adding noise)
ses_t_expanded <- cbind(
  ses_t,  # Existing data
  matrix(
    rnorm(num_snps * (num_traits - ncol(ses_t)), 
          mean = rowMeans(ses_t, na.rm = TRUE), 
          sd = apply(ses_t, 1, sd, na.rm = TRUE)),
    nrow = num_snps
  )
)

# Assign new column names
colnames(betas_t_expanded) <- traits_t
colnames(ses_t_expanded) <- traits_t

# Ensure row names are preserved
rownames(betas_t_expanded) <- rownames(betas_t)
rownames(ses_t_expanded) <- rownames(ses_t)

# Run colocalization analysis with 800 traits
test <- hyprcoloc(betas_t_expanded, ses_t_expanded, trait.names = traits_t, snp.id = rsid_t, snpscores = TRUE)

# Print summary of results
print(test)

2.3 Harmonizing PCSK9 and METSIM metabolites with at least on rsID with P<5E-8

We chose to prioritize which MESTIM metabolites to examine using a p-value theshold with the criterion being that at least one SNP in a respective cis-region being less than the threshold.

# Ran interactively in Rstudio
# Could do:
# nohup Rscript ./harmonize_PCSK9_metsim_pval_subsets_parallel.R > harmonize_PCSK9_metsim_pval_subsets_parallel.log 2>&1 &
# ps -p 17652 -o stat,etime

# Load required libraries
library(data.table)
library(TwoSampleMR)

# Custom logging function
custom_log <- function(level, message) {
  msg <- paste(format(Sys.time(), "%Y-%m-%d %H:%M:%S"), level, message)
  cat(msg, "\n", file = "benchmark_harm_twosampleMR.log", append = TRUE)
  cat(msg, "\n")  # Also print to console
}

# Directories
cis_region_dir <- "/Users/charleenadams/cis_ukbppp/cleaned/"
metsim_dir <- "/Users/charleenadams/metsim_files/metsim_cis_PCSK9_fixed/p_val_metsim_cis_PCSK9_fixed"
output_dir <- "/Users/charleenadams/harmonized_results_fixed/"

# Function to read and format data for TwoSampleMR, including chromosome and position
read_and_format_data <- function(file_path, is_cis) {
  tryCatch({
    # Read file and convert to data.frame
    df <- fread(file_path, sep = "\t", header = TRUE)
    df <- as.data.frame(df)  # Ensure it is a data.frame
    
    # Format data for TwoSampleMR
    formatted_df <- format_data(df, 
                                type = ifelse(is_cis, "exposure", "outcome"), 
                                snp_col = ifelse(is_cis, "rsid", "rsids"), 
                                beta_col = ifelse(is_cis, "BETA", "beta"), 
                                se_col = ifelse(is_cis, "SE", "sebeta"), 
                                effect_allele_col = ifelse(is_cis, "ALLELE1", "alt"), 
                                other_allele_col = ifelse(is_cis, "ALLELE0", "ref"), 
                                eaf_col = ifelse(is_cis, "A1FREQ", "maf"), 
                                pval_col = ifelse(is_cis, "LOG10P", "pval"),
                                chr_col = ifelse(is_cis, "CHROM", "chrom"), 
                                pos_col = ifelse(is_cis, "POS38", "pos"),
                                phenotype_col = basename(file_path))
    
    # Set exposure or outcome names
    if (is_cis) {
      formatted_df$id.exposure <- basename(file_path)
    } else {
      formatted_df$id.outcome <- basename(file_path)
    }
    
    return(formatted_df)
  }, error = function(e) {
    custom_log("ERROR", paste("Error reading file", file_path, ":", e$message))
    return(NULL)
  })
}

# Main function using TwoSampleMR
main <- function() {
  # Create output directory if it doesn't exist
  dir.create(output_dir, showWarnings = FALSE, recursive = TRUE)
  
  # Select the cis-region file
  cis_file <- file.path(cis_region_dir, "PCSK9_cis_region_cleaned.gz")
  
  # Get the list of METSIM subset files
  metsim_files <- list.files(metsim_dir, pattern = "*.txt", full.names = TRUE)
  
  custom_log("INFO", paste("Processing cis-region file:", cis_file))
  custom_log("INFO", paste("Processing METSIM files:", paste(metsim_files, collapse = ", ")))
  
  # Read and format the cis-region file
  cis_df <- read_and_format_data(cis_file, is_cis = TRUE)
  if (is.null(cis_df)) {
    custom_log("ERROR", "Failed to read cis-region file.")
    return()
  }
  
  # Harmonize each METSIM file with the cis-region file
  harmonized_results <- lapply(metsim_files, function(metsim_file) {
    metsim_df <- read_and_format_data(metsim_file, is_cis = FALSE)
    if (!is.null(metsim_df)) {
      tryCatch({
        harmonized <- harmonise_data(exposure_dat = cis_df, outcome_dat = metsim_df, action = 2)
        custom_log("INFO", paste("Harmonized", metsim_file, "with", nrow(harmonized), "rows."))
        return(harmonized)
      }, error = function(e) {
        custom_log("ERROR", paste("Error harmonizing data for", metsim_file, ":", e$message))
        return(NULL)
      })
    }
    return(NULL)
  })
  
  # Combine harmonized results
  harmonized_results <- harmonized_results[!sapply(harmonized_results, is.null)]
  if (length(harmonized_results) > 0) {
    final_results <- rbindlist(harmonized_results)
    output_file <- file.path(output_dir, "harmonized_results_PCSK9.txt")
    fwrite(final_results, file = output_file, sep = "\t")
    custom_log("INFO", paste("Saved harmonized results to", output_file))
  } else {
    custom_log("WARNING", "No harmonized results were generated.")
  }
}

# Run the main function
main()

2.4 HyPrColoc of PCSK9 and METSIM metabolites with at least one SNP with P<5E-8

We conducted a colocalization analysis using HyPrColoc to explore shared genetic loci influencing PCSK9 and ?? METSIM metabolites.

# Run interactively in Rstudio
library(data.table)
library(hyprcoloc)
library(openxlsx)
library(DT)
library(knitr)

# Load phenotype mapping data
phenotype_map <- fread("/Users/charleenadams/metsim_files/phenotypes.tsv")
print(names(phenotype_map))  # Confirm structure of phenotype_map

# Define file path to harmonized results: note, this directory contains the files with at least one SNP P<5E-8
file_path <- "/Users/charleenadams/harmonized_results_fixed/harmonized_results_PCSK9.txt"

#file_path <- "/Users/charleenadams/harmonized_results/harmonized_results_PCSK9.txt"

# Read the harmonized data
data <- fread(file_path)

# Remove all characters after and including the first `_`
data$id.outcome <- sub("_.*", "", data$id.outcome)
data$id.exposure <- sub("_.*", "", data$id.exposure)

# Ensure both data frames are data.tables
setDT(data)
setDT(phenotype_map)

# Merge phenostring from phenotype_map into data using id.outcome and phenocode
data <- merge(data, phenotype_map[, .(phenocode, phenostring)], 
              by.x = "id.outcome", by.y = "phenocode", all.x = TRUE)

# Rename the phenostring column to METSIM_name
setnames(data, "phenostring", "METSIM_name")

# Remove asterisks (*) from the METSIM_name column
data$METSIM_name <- gsub("\\*", "", data$METSIM_name)
cat("Data read successfully. Total rows:", nrow(data), "\n")

# Ensure the data contains the required columns
required_columns <- c("SNP", "id.exposure", "METSIM_name", "beta.exposure", "beta.outcome", "se.exposure", "se.outcome")
if (!all(required_columns %in% colnames(data))) {
  stop("The harmonized data is missing required columns!")
}

# Summarize pval.outcome for each id.outcome
pval_summary <- data[, .(
  min_pval = min(pval.outcome, na.rm = TRUE),
  max_pval = max(pval.outcome, na.rm = TRUE),
  mean_pval = mean(pval.outcome, na.rm = TRUE),
  median_pval = median(pval.outcome, na.rm = TRUE),
  count = .N  # Count number of SNPs per id.outcome
), by = id.outcome]

# Display results
print(pval_summary)
pval_summary <- pval_summary[order(pval_summary$min_pval),]

# Extract unique traits and common SNPs
traits <- unique(c(data$id.exposure, data$METSIM_name))
common_snps <- unique(data$SNP)
cat("Number of traits:", length(traits), "Number of SNPs:", length(common_snps), "\n")

# Filter data to include only common SNPs
data <- data[SNP %in% common_snps]

# Reshape data to create a long table for matrix population
long_data <- rbindlist(list(
  data[, .(SNP, trait = id.exposure, beta = beta.exposure, se = se.exposure)],
  data[, .(SNP, trait = METSIM_name, beta = beta.outcome, se = se.outcome)]
), use.names = TRUE)

# Ensure no duplicate SNP-trait combinations
long_data <- unique(long_data, by = c("SNP", "trait"))
cat("Rows in reshaped long_data:", nrow(long_data), "\n")

# Create matrices for betas and ses
betas_mat <- dcast(long_data, SNP ~ trait, value.var = "beta", fill = NA)
ses_mat <- dcast(long_data, SNP ~ trait, value.var = "se", fill = NA)
snps <- betas_mat$SNP

# Convert matrices to the appropriate format for hyprcoloc
rownames(betas_mat) <- betas_mat$SNP
rownames(ses_mat) <- ses_mat$SNP

betas_mat <- as.matrix(betas_mat[, -1, with = FALSE])
ses_mat <- as.matrix(ses_mat[, -1, with = FALSE])

# Restore the SNP column as rownames
rownames(betas_mat) <- snps
rownames(ses_mat) <- snps

# Confirm rownames are correctly set
head(rownames(betas_mat))

cat("Dimensions of beta matrix:", dim(betas_mat), "\n")
cat("Dimensions of SE matrix:", dim(ses_mat), "\n")

# Check for rows with missing values
rows_with_na_betas <- apply(betas_mat, 1, function(row) any(is.na(row)))
rows_with_na_ses <- apply(ses_mat, 1, function(row) any(is.na(row)))

# Filter matrices to keep only rows with no missing values
betas_mat <- betas_mat[!rows_with_na_betas, , drop = FALSE]
ses_mat <- ses_mat[!rows_with_na_ses, , drop = FALSE]

cat("Dimensions of beta matrix:", dim(betas_mat), "\n")
cat("Dimensions of SE matrix:", dim(ses_mat), "\n")

# Run hyprcoloc
cat("Running hyprcoloc...\n")

betas <- betas_mat
ses <- ses_mat
traits_names <- colnames(betas_mat)
traits_names
snp_ids <- rownames(betas_mat)

# Uniform priors = FALSE
results_unipriorsFALSE <- hyprcoloc(betas, ses, trait.names = traits_names, snp.id = snp_ids, snpscores = TRUE,uniform.priors=FALSE)


# Static table with kable
kable(
  results_unipriorsFALSE$results,
  caption = "HyprColoc Results with Uniform Priors = FALSE",
  align = 'c'
)

# Uniform priors = TRUE
results_unipriorsTRUE <- hyprcoloc(betas, ses, trait.names = traits_names, snp.id = snp_ids, snpscores = TRUE, uniform.priors=TRUE)

kable(
  results_unipriorsTRUE$results,
  caption = "HyprColoc Results with Uniform Priors = TRUE",
  align = 'c'
)

# Without assuming uniform priors and with reg.thresh and align.thresh
results_uniform_priors_FALSE_reg.thresh.95_align.thresh.95 <- hyprcoloc(betas, ses, trait.names = traits_names, snp.id = snp_ids, uniform.priors = FALSE,snpscores = TRUE,
  reg.thresh = 0.95, align.thresh = 0.95
)

# Input
traits_string3 <- results_uniform_priors_FALSE_reg.thresh.95_align.thresh.95[["results"]][["traits"]][1]

# Split the string by commas and count the number of elements
num_traits3 <- length(strsplit(traits_string3, ", ")[[1]])

# Print the result
cat("Number of traits:", num_traits3, "\n")

kable(
  results_uniform_priors_FALSE_reg.thresh.95_align.thresh.95$results,
  caption = "HyprColoc Results with Uniform Priors = FALSE and regional and alignment thresholds = 0.95",
  align = 'c'
)

# Save the results as tabs in Excel
# Define file path for the output Excel workbook
output_file <- "/Users/charleenadams/pcsk9_hyprcoloc_corrected_results_summary_with_snpscores_fixed.xlsx"

# Initialize a new workbook
wb <- createWorkbook()

# Function to add a result set to the workbook
add_results_to_workbook <- function(workbook, result_data, sheet_name) {
  # Extract the results data frame
  results_df <- result_data$results
  
  # Create a worksheet and write the data
  addWorksheet(workbook, sheet_name)
  writeData(workbook, sheet_name, results_df)
  
  # Apply styling
  header_style <- createStyle(fontSize = 12, fontColour = "#FFFFFF", halign = "CENTER", 
                              fgFill = "#4F81BD", textDecoration = "Bold")
  addStyle(workbook, sheet_name, header_style, rows = 1, cols = 1:ncol(results_df), gridExpand = TRUE)
}

# Function to add SNP scores to the workbook with rsIDs
add_snpscores_to_workbook <- function(workbook, snp_scores, rsIDs, sheet_name) {
  # Convert SNP scores to a data table and include rsIDs
  snp_scores_df <- as.data.table(snp_scores)
  colnames(snp_scores_df) <- paste0("SNP_Score_Set_", seq_len(ncol(snp_scores_df)))
  snp_scores_df$rsID <- rsIDs
  
  # Reorder columns to put rsID first
  setcolorder(snp_scores_df, c("rsID", setdiff(names(snp_scores_df), "rsID")))
  
  # Create a worksheet and write the data
  addWorksheet(workbook, sheet_name)
  writeData(workbook, sheet_name, snp_scores_df)
  
  # Apply styling
  header_style <- createStyle(fontSize = 12, fontColour = "#FFFFFF", halign = "CENTER", 
                              fgFill = "#4F81BD", textDecoration = "Bold")
  addStyle(workbook, sheet_name, header_style, rows = 1, cols = 1:ncol(snp_scores_df), gridExpand = TRUE)
}

# Add results and SNP scores for each analysis
# Ensure rsIDs are available as rownames in the matrices
rsIDs <- rownames(betas_mat)

# results_unipriorsFALSE
add_results_to_workbook(wb, results_unipriorsFALSE, "Results_Unipriors_FALSE")
add_snpscores_to_workbook(wb, results_unipriorsFALSE$snpscores, rsIDs, "SNP_Scores_Unipriors_FALSE")

# results_unipriorsTRUE
add_results_to_workbook(wb, results_unipriorsTRUE, "Results_Unipriors_TRUE")
add_snpscores_to_workbook(wb, results_unipriorsTRUE$snpscores, rsIDs, "SNP_Scores_Unipriors_TRUE")

# results_uniform_priors_FALSE_reg.thresh.95_align.thresh.95
add_results_to_workbook(wb, results_uniform_priors_FALSE_reg.thresh.95_align.thresh.95, 
                        "Unipriors_FALSE_Thresh_95")
add_snpscores_to_workbook(wb, results_uniform_priors_FALSE_reg.thresh.95_align.thresh.95$snpscores, 
                          rsIDs, "SNP_Scores_Thresh_95")

# Save the workbook
saveWorkbook(wb, output_file, overwrite = TRUE)

# Confirm completion
cat("Results and SNP scores saved to:", output_file, "\n")

3 Analysis of PCSK9 on METSIM metabolites (ignoring p-value threshold criterion)

3.1 Harmonizing PCSK9 and all METSIM metabolites (ignoring p-value criterion)

# Ran interactively in rstudio
# Could do:
# nohup Rscript ./harmonize_PCSK9_metsim_pval_subsets_parallel.R > harmonize_PCSK9_metsim_pval_subsets_parallel.log 2>&1 &
# ps -p 17652 -o stat,etime

# Load required libraries
library(data.table)
library(TwoSampleMR)

# Custom logging function
custom_log <- function(level, message) {
  msg <- paste(format(Sys.time(), "%Y-%m-%d %H:%M:%S"), level, message)
  cat(msg, "\n", file = "benchmark_harm_twosampleMR.log", append = TRUE)
  cat(msg, "\n")  # Also print to console
}

# Directories
cis_region_dir <- "/Users/charleenadams/cis_ukbppp/cleaned/"
metsim_dir <- "/Users/charleenadams/metsim_files/metsim_cis_PCSK9_fixed/"
output_dir <- "/Users/charleenadams/harmonized_results_fixed_all/"

# Function to read and format data for TwoSampleMR, including chromosome and position
read_and_format_data <- function(file_path, is_cis) {
  tryCatch({
    # Read file and convert to data.frame
    df <- fread(file_path, sep = "\t", header = TRUE)
    df <- as.data.frame(df)  # Ensure it is a data.frame
    
    # Format data for TwoSampleMR
    formatted_df <- format_data(df, 
                                type = ifelse(is_cis, "exposure", "outcome"), 
                                snp_col = ifelse(is_cis, "rsid", "rsids"), 
                                beta_col = ifelse(is_cis, "BETA", "beta"), 
                                se_col = ifelse(is_cis, "SE", "sebeta"), 
                                effect_allele_col = ifelse(is_cis, "ALLELE1", "alt"), 
                                other_allele_col = ifelse(is_cis, "ALLELE0", "ref"), 
                                eaf_col = ifelse(is_cis, "A1FREQ", "maf"), 
                                pval_col = ifelse(is_cis, "LOG10P", "pval"),
                                chr_col = ifelse(is_cis, "CHROM", "chrom"), 
                                pos_col = ifelse(is_cis, "POS38", "pos"),
                                phenotype_col = basename(file_path))
    
    # Set exposure or outcome names
    if (is_cis) {
      formatted_df$id.exposure <- basename(file_path)
    } else {
      formatted_df$id.outcome <- basename(file_path)
    }
    
    return(formatted_df)
  }, error = function(e) {
    custom_log("ERROR", paste("Error reading file", file_path, ":", e$message))
    return(NULL)
  })
}

# Main function using TwoSampleMR
main <- function() {
  # Create output directory if it doesn't exist
  dir.create(output_dir, showWarnings = FALSE, recursive = TRUE)
  
  # Select the cis-region file
  cis_file <- file.path(cis_region_dir, "PCSK9_cis_region_cleaned.gz")
  
  # Get the list of METSIM subset files
  metsim_files <- list.files(metsim_dir, pattern = "*.txt", full.names = TRUE)
  
  custom_log("INFO", paste("Processing cis-region file:", cis_file))
  custom_log("INFO", paste("Processing METSIM files:", paste(metsim_files, collapse = ", ")))
  
  # Read and format the cis-region file
  cis_df <- read_and_format_data(cis_file, is_cis = TRUE)
  if (is.null(cis_df)) {
    custom_log("ERROR", "Failed to read cis-region file.")
    return()
  }
  
  # Harmonize each METSIM file with the cis-region file
  harmonized_results <- lapply(metsim_files, function(metsim_file) {
    metsim_df <- read_and_format_data(metsim_file, is_cis = FALSE)
    if (!is.null(metsim_df)) {
      tryCatch({
        harmonized <- harmonise_data(exposure_dat = cis_df, outcome_dat = metsim_df, action = 2)
        custom_log("INFO", paste("Harmonized", metsim_file, "with", nrow(harmonized), "rows."))
        return(harmonized)
      }, error = function(e) {
        custom_log("ERROR", paste("Error harmonizing data for", metsim_file, ":", e$message))
        return(NULL)
      })
    }
    return(NULL)
  })
  
  # Combine harmonized results
  harmonized_results <- harmonized_results[!sapply(harmonized_results, is.null)]
  if (length(harmonized_results) > 0) {
    final_results <- rbindlist(harmonized_results)
    output_file <- file.path(output_dir, "harmonized_results_PCSK9.txt")
    fwrite(final_results, file = output_file, sep = "\t")
    custom_log("INFO", paste("Saved harmonized results to", output_file))
  } else {
    custom_log("WARNING", "No harmonized results were generated.")
  }
}

# Run the main function
main()

3.2 HyPrColoc of PCSK9 and all METSIM metabolites (ignoring p-value criterion)

This is technically doable but not feasible for a pipeline. I believe it took over 8 hours.

# Run interactively in Rstudio
library(data.table)
library(hyprcoloc)
library(openxlsx)
library(DT)
library(knitr)

# Load phenotype mapping data
phenotype_map <- fread("/Users/charleenadams/metsim_files/phenotypes.tsv")
print(names(phenotype_map))  # Confirm structure of phenotype_map

# Define file path to harmonized results
file_path <- "/Users/charleenadams/harmonized_results_fixed_all/harmonized_results_PCSK9.txt"
#file_path <- "/Users/charleenadams/harmonized_results/harmonized_results_PCSK9.txt"# legacy

# Read the harmonized data
data <- fread(file_path)

# Remove all characters after and including the first `_`
data$id.outcome <- sub("_.*", "", data$id.outcome)
data$id.exposure <- sub("_.*", "", data$id.exposure)

# Ensure both data frames are data.tables
setDT(data)
setDT(phenotype_map)

# Merge phenostring from phenotype_map into data using id.outcome and phenocode
data <- merge(data, phenotype_map[, .(phenocode, phenostring)], 
              by.x = "id.outcome", by.y = "phenocode", all.x = TRUE)

# Rename the phenostring column to METSIM_name
setnames(data, "phenostring", "METSIM_name")

# Remove asterisks (*) from the METSIM_name column
data$METSIM_name <- gsub("\\*", "", data$METSIM_name)
cat("Data read successfully. Total rows:", nrow(data), "\n")

# Ensure the data contains the required columns
required_columns <- c("SNP", "id.exposure", "METSIM_name", "beta.exposure", "beta.outcome", "se.exposure", "se.outcome")
if (!all(required_columns %in% colnames(data))) {
  stop("The harmonized data is missing required columns!")
}

# Extract unique traits and common SNPs
traits <- unique(c(data$id.exposure, data$METSIM_name))
common_snps <- unique(data$SNP)
cat("Number of traits:", length(traits), "Number of SNPs:", length(common_snps), "\n")

# Filter data to include only common SNPs
data <- data[SNP %in% common_snps]

# Reshape data to create a long table for matrix population
long_data <- rbindlist(list(
  data[, .(SNP, trait = id.exposure, beta = beta.exposure, se = se.exposure)],
  data[, .(SNP, trait = METSIM_name, beta = beta.outcome, se = se.outcome)]
), use.names = TRUE)

# Ensure no duplicate SNP-trait combinations
long_data <- unique(long_data, by = c("SNP", "trait"))
cat("Rows in reshaped long_data:", nrow(long_data), "\n")

# Create matrices for betas and ses
betas_mat <- dcast(long_data, SNP ~ trait, value.var = "beta", fill = NA)
ses_mat <- dcast(long_data, SNP ~ trait, value.var = "se", fill = NA)
snps <- betas_mat$SNP

# Convert matrices to the appropriate format for hyprcoloc
rownames(betas_mat) <- betas_mat$SNP
rownames(ses_mat) <- ses_mat$SNP

betas_mat <- as.matrix(betas_mat[, -1, with = FALSE])
ses_mat <- as.matrix(ses_mat[, -1, with = FALSE])

# Restore the SNP column as rownames
rownames(betas_mat) <- snps
rownames(ses_mat) <- snps

# Confirm rownames are correctly set
head(rownames(betas_mat))

cat("Dimensions of beta matrix:", dim(betas_mat), "\n")
cat("Dimensions of SE matrix:", dim(ses_mat), "\n")

# Check for rows with missing values
rows_with_na_betas <- apply(betas_mat, 1, function(row) any(is.na(row)))
rows_with_na_ses <- apply(ses_mat, 1, function(row) any(is.na(row)))

# Filter matrices to keep only rows with no missing values
betas_mat <- betas_mat[!rows_with_na_betas, , drop = FALSE]
ses_mat <- ses_mat[!rows_with_na_ses, , drop = FALSE]

cat("Dimensions of beta matrix:", dim(betas_mat), "\n")
cat("Dimensions of SE matrix:", dim(ses_mat), "\n")

# This noted-out solution assumes no effect of the missing SNP on the metabolite, which may be false. So, I removed all rows with any missing values instead, which would risk eliminating a causal SNP if in a row with some missing values. Perhaps six of one, half a dozen of the other. The same candidate causal SNP is found regardless.
# betas_mat_filtered[is.na(betas_mat_filtered)] <- 0
# ses_mat_filtered[is.na(ses_mat_filtered)] <- 1e6  # Large SE for low confidence

# Run hyprcoloc
cat("Running hyprcoloc...\n")

betas <- betas_mat
ses <- ses_mat
traits_names <- colnames(betas_mat)
traits_names
snp_ids <- rownames(betas_mat)

# Uniform priors = FALSE
results_unipriorsFALSE <- hyprcoloc(betas, ses, trait.names = traits_names, snp.id = snp_ids, snpscores = TRUE,uniform.priors=FALSE)

# Static table with kable
kable(
  results_unipriorsFALSE$results,
  caption = "HyprColoc Results with Uniform Priors = FALSE",
  align = 'c'
)

# Uniform priors = TRUE
results_unipriorsTRUE <- hyprcoloc(betas, ses, trait.names = traits_names, snp.id = snp_ids, snpscores = TRUE, uniform.priors=TRUE)

kable(
  results_unipriorsTRUE$results,
  caption = "HyprColoc Results with Uniform Priors = TRUE",
  align = 'c'
)

# Without assuming uniform priors and with reg.thresh and align.thresh
results_uniform_priors_FALSE_reg.thresh.95_align.thresh.95 <- hyprcoloc(betas, ses, trait.names = traits_names, snp.id = snp_ids, uniform.priors = FALSE,snpscores = TRUE,
  reg.thresh = 0.95, align.thresh = 0.95
)

# Input
traits_string3 <- results_uniform_priors_FALSE_reg.thresh.95_align.thresh.95[["results"]][["traits"]][1]

# Split the string by commas and count the number of elements
num_traits3 <- length(strsplit(traits_string3, ", ")[[1]])

# Print the result
cat("Number of traits:", num_traits3, "\n")

kable(
  results_uniform_priors_FALSE_reg.thresh.95_align.thresh.95$results,
  caption = "HyprColoc Results with Uniform Priors = FALSE and regional and alignment thresholds = 0.95",
  align = 'c'
)

# Save the results as tabs in Excel
# Define file path for the output Excel workbook
output_file <- "/Users/charleenadams/pcsk9_hyprcoloc_corrected_results_summary_with_snpscores_fixed.xlsx"

# Initialize a new workbook
wb <- createWorkbook()

# Function to add a result set to the workbook
add_results_to_workbook <- function(workbook, result_data, sheet_name) {
  # Extract the results data frame
  results_df <- result_data$results
  
  # Create a worksheet and write the data
  addWorksheet(workbook, sheet_name)
  writeData(workbook, sheet_name, results_df)
  
  # Apply styling
  header_style <- createStyle(fontSize = 12, fontColour = "#FFFFFF", halign = "CENTER", 
                              fgFill = "#4F81BD", textDecoration = "Bold")
  addStyle(workbook, sheet_name, header_style, rows = 1, cols = 1:ncol(results_df), gridExpand = TRUE)
}

# Function to add SNP scores to the workbook with rsIDs
add_snpscores_to_workbook <- function(workbook, snp_scores, rsIDs, sheet_name) {
  # Convert SNP scores to a data table and include rsIDs
  snp_scores_df <- as.data.table(snp_scores)
  colnames(snp_scores_df) <- paste0("SNP_Score_Set_", seq_len(ncol(snp_scores_df)))
  snp_scores_df$rsID <- rsIDs
  
  # Reorder columns to put rsID first
  setcolorder(snp_scores_df, c("rsID", setdiff(names(snp_scores_df), "rsID")))
  
  # Create a worksheet and write the data
  addWorksheet(workbook, sheet_name)
  writeData(workbook, sheet_name, snp_scores_df)
  
  # Apply styling
  header_style <- createStyle(fontSize = 12, fontColour = "#FFFFFF", halign = "CENTER", 
                              fgFill = "#4F81BD", textDecoration = "Bold")
  addStyle(workbook, sheet_name, header_style, rows = 1, cols = 1:ncol(snp_scores_df), gridExpand = TRUE)
}

# Add results and SNP scores for each analysis
# Ensure rsIDs are available as rownames in the matrices
rsIDs <- rownames(betas_mat)

# results_unipriorsFALSE
add_results_to_workbook(wb, results_unipriorsFALSE, "Results_Unipriors_FALSE")
add_snpscores_to_workbook(wb, results_unipriorsFALSE$snpscores, rsIDs, "SNP_Scores_Unipriors_FALSE")

# results_unipriorsTRUE
add_results_to_workbook(wb, results_unipriorsTRUE, "Results_Unipriors_TRUE")
add_snpscores_to_workbook(wb, results_unipriorsTRUE$snpscores, rsIDs, "SNP_Scores_Unipriors_TRUE")

# results_uniform_priors_FALSE_reg.thresh.95_align.thresh.95
add_results_to_workbook(wb, results_uniform_priors_FALSE_reg.thresh.95_align.thresh.95, 
                        "Unipriors_FALSE_Thresh_95")
add_snpscores_to_workbook(wb, results_uniform_priors_FALSE_reg.thresh.95_align.thresh.95$snpscores, 
                          rsIDs, "SNP_Scores_Thresh_95")

# Save the workbook
saveWorkbook(wb, output_file, overwrite = TRUE)

# Confirm completion
cat("Results and SNP scores saved to:", output_file, "\n")

3.2.1 Results: HyPrColoc of PCSK9 and all metabolites (ignoring p-value)

library(data.table)

# Load phenotype mapping data
phenotype_map <- fread("/Users/charleenadams/metsim_files/phenotypes.tsv")
print(names(phenotype_map))  # Confirm structure of phenotype_map
head(phenotype_map)

Results_Unipriors_FALSE <- fread("/Users/charleenadams/coloc_clin_mol/results/pcsk9/all/Results_Unipriors_FALSE.csv")
Results_Unipriors_FALSE

Results_Unipriors_TRUE <- fread("/Users/charleenadams/coloc_clin_mol/results/pcsk9/all/Results_Unipriors_TRUE.csv")
Results_Unipriors_TRUE

Results_Unipriors_FALSE_Thresh_95 <- fread("/Users/charleenadams/coloc_clin_mol/results/pcsk9/all/Results_Unipriors_FALSE_Thresh_95.csv")
Results_Unipriors_FALSE_Thresh_95
str(Results_Unipriors_FALSE_Thresh_95)
Results_Unipriors_FALSE_Thresh_95$traits
str(Results_Unipriors_FALSE_Thresh_95$traits[1])

split_metabolites <- function(metabolite_string) {
  split_list <- c()  # Initialize empty list for results
  current_part <- ""  # Buffer to accumulate valid metabolite names
  open_parens <- 0  # Tracks nesting level
  prev_char <- ""  # Stores the previous character

  # Loop through characters
  for (char in strsplit(metabolite_string, "")[[1]]) {
    if (char == "(") open_parens <- open_parens + 1
    if (char == ")") open_parens <- open_parens - 1
    
    # 🔥 Fix: Don't split if the comma is followed by a digit (like 1,2- or 2,3-)
    if (char == "," && open_parens == 0 && !grepl("\\d", prev_char)) { 
      split_list <- c(split_list, trimws(current_part))  # Save previous part
      current_part <- ""  # Reset buffer
    } else {
      current_part <- paste0(current_part, char)  # Accumulate valid characters
    }
    
    prev_char <- char  # Update previous character
  }
  
  # Add last part
  split_list <- c(split_list, trimws(current_part))
  
  return(split_list)
}

# Results_Unipriors_FALSE_Thresh_95$traits[1]
metabolite_string_Results_Unipriors_FALSE_Thresh_95 <- Results_Unipriors_FALSE_Thresh_95$traits[1]
split_list_FALSE_Thresh_95 <- split_metabolites(metabolite_string_Results_Unipriors_FALSE_Thresh_95)
print(split_list_FALSE_Thresh_95)

metabolite_string_Results_Unipriors_FALSE <- Results_Unipriors_FALSE$traits[1]
split_list_FALSE <- split_metabolites(metabolite_string_Results_Unipriors_FALSE)
print(split_list_FALSE)

metabolite_string_Results_Unipriors_TRUE <- Results_Unipriors_TRUE$traits[1]
split_list_TRUE <- split_metabolites(metabolite_string_Results_Unipriors_TRUE)
print(split_list_TRUE)

4 Benchmarking loop of selected proteins on METISM metabolites (P<5E-8 threshold)

4.1 Identify 25 proteins in list from Usman

24/25 available

cd /Users/charleenadams/cis_ukbppp/cleaned

# Find the 25 proteins
ls | grep -E '^(ACY1|APOE|CD36|IL6|ACE2|NTproBNP|LEPR|AGXT|LEP|LRP11|LPL|REN|UMOD|FABP4|IL1RL1|FAP|IL6R|ANGPTL3|APOL1|ASAH1|KLKB1|PDGFRA|SIGLEC9|WFIKKN2|MAP4K5|NPPB)_' | wc -l

# List them
ls | grep -E '^(ACY1|APOE|CD36|IL6|ACE2|NTproBNP|LEPR|AGXT|LEP|LRP11|LPL|REN|UMOD|FABP4|IL1RL1|FAP|IL6R|ANGPTL3|APOL1|ASAH1|KLKB1|PDGFRA|SIGLEC9|WFIKKN2|MAP4K5|NPPB)_'

4.2 Loop to make cis-regions for 24 proteins (minus ACE2)

#!/bin/bash
 
# Run with:
# cd /Users/charleenadams/coloc_clin_mol/scripts
# chmod +x loop_subset_24_parallel.sh
# nohup ./loop_subset_24_parallel.sh > loop_subset_24_parallel.log 2>&1 &
# ps -p 2562 -o stat,etime

set -x

# Define directories
cis_dir="/Users/charleenadams/cis_ukbppp/cleaned"
metsim_dir="/Users/charleenadams/metsim_files"
output_base_dir="/Users/charleenadams/metsim_files"

# Identify the specific 24 proteins dynamically
# remove ACE2
cis_files=$(ls "$cis_dir" | grep -E '^(ACY1|APOE|CD36|IL6|NTproBNP|LEPR|AGXT|LEP|LRP11|LPL|REN|UMOD|FABP4|IL1RL1|FAP|IL6R|ANGPTL3|APOL1|ASAH1|KLKB1|PDGFRA|SIGLEC9|WFIKKN2|MAP4K5|NPPB)_')

# Loop through each identified cis-region file
for cis_file in $cis_files; do
    protein_name=$(echo "$cis_file" | cut -d'_' -f1)
    output_dir="${output_base_dir}/${protein_name}_fixed"
    rsid_file="${protein_name}_rsid_list.txt"

    # Create output directory
    mkdir -p "$output_dir"

    # Extract RSID list from column 15 of the cis-region file
    gzcat "$cis_dir/$cis_file" | awk 'NR > 1 {print $15}' | sort | uniq > "$rsid_file"

    # Check if RSID file is empty
    if [[ ! -s "$rsid_file" ]]; then
        echo "Error: No RSIDs found in $cis_file" >> "${protein_name}.log"
        continue
    fi

    export output_dir rsid_file

    # Function to process a single METSIM file
    process_metsim_file() {
        local metsim_file=$1
        local base_name=$(basename "$metsim_file" .txt.gz)
        local output_file="$output_dir/${base_name}_cis_${protein_name}.txt"

        gzcat "$metsim_file" | awk '
            BEGIN {
                while ((getline < "'$rsid_file'") > 0) lookup[$1] = 1;
                close("'$rsid_file'");
            }
            NR == 1 {print $0; next} # Print header
            $5 in lookup {print $0}  # RSIDs are in column 5 in METSIM
        ' > "$output_file"

        if [[ ! -s "$output_file" ]]; then
            echo "Warning: No matching rows for $metsim_file" >> "${protein_name}.log"
            rm "$output_file"
        else
            echo "Processed $metsim_file" >> "${protein_name}.log"
        fi
    }

    export -f process_metsim_file

    # Use GNU parallel to process METSIM files in parallel
    find "$metsim_dir" -name "*.gz" | parallel -j $(nproc) process_metsim_file {}

done

4.3 Filter METSIM files on at least one SNP with P<5E-8 in respective cis-region

# ran interactively in Rstudio

# Load necessary libraries
library(data.table)

# Parent directory where the protein-specific folders are located
parent_directory <- "/Users/charleenadams/metsim_files"

# Explicit list of valid subdirectories to process
valid_subdirectories <- c("LEP_fixed", "FABP4_fixed", "MAP4K5_fixed", "IL1RL1_fixed", 
                          "REN_fixed", "IL6_fixed", "LRP11_fixed", "ASAH1_fixed", "FAP_fixed", 
                          "APOE_fixed", "ACY1_fixed", "AGXT_fixed", "KLKB1_fixed", "APOL1_fixed", 
                          "NPPB_fixed", "SIGLEC9_fixed", "LEPR_fixed", "UMOD_fixed", "CD36_fixed", 
                          "ANGPTL3_fixed", "WFIKKN2_fixed", "IL6R_fixed", "LPL_fixed", "PDGFRA_fixed")

# Get all valid subdirectories
subdirectories <- file.path(parent_directory, valid_subdirectories)
subdirectories <- subdirectories[dir.exists(subdirectories)]

# Process each subdirectory
for (subdir in subdirectories) {
  
  # Extract the protein name from the directory name
  protein_name <- basename(subdir)
  
  # Define the subdirectory for significant p-values
  significant_subdir <- file.path(subdir, paste0("p_val_metsim_cis_", protein_name))
  
  # Ensure the subdirectory exists
  if (!dir.exists(significant_subdir)) {
    dir.create(significant_subdir)
    cat("Created directory:", significant_subdir, "\n")
  }
  
  # Get all .txt files in the subdirectory
  files <- list.files(path = subdir, pattern = "\\.txt$", full.names = TRUE)
  
  # Process each file
  for (file in files) {
    cat("Processing file:", basename(file), "in", protein_name, "\n")
    
    # Read the file
    data <- fread(file)
    
    # Check if any p-value is less than 5e-8
    if (any(data$pval < 5e-8, na.rm = TRUE)) {
      cat("File", basename(file), "has pval < 5e-8\n")
      file.copy(from = file, to = file.path(significant_subdir, basename(file)))
      cat("Copied", basename(file), "to", significant_subdir, "\n")
    } else {
      cat("No match found in", basename(file), "\n")
    }
  }
}

cat("Script completed\n")

4.4 Loop to harmonize selected cis-region proteins and METSIM metabolites with at least on rsID with P<5E-8

# Ran interactively in rstudio
# could do:
# nohup Rscript ./harmonize_loop24_metsim_pval_subsets_parallel.R > harmonize_loop24_metsim_pval_subsets_parallel.log 2>&1 &
# ps -p 17652 -o stat,etime

# Load required libraries
library(data.table)
library(TwoSampleMR)

# Custom logging function
custom_log <- function(level, message) {
  msg <- paste(format(Sys.time(), "%Y-%m-%d %H:%M:%S"), level, message)
  cat(msg, "\n", file = "benchmark_harm_twosampleMR.log", append = TRUE)
  cat(msg, "\n")  # Also print to console
}

# Function to read and format data for TwoSampleMR, including chromosome and position
read_and_format_data <- function(file_path, is_cis) {
  tryCatch({
    # Read file and convert to data.frame
    df <- fread(file_path, sep = "\t", header = TRUE)
    df <- as.data.frame(df)  # Ensure it is a data.frame
    
    # Format data for TwoSampleMR
    formatted_df <- format_data(df, 
                                type = ifelse(is_cis, "exposure", "outcome"), 
                                snp_col = ifelse(is_cis, "rsid", "rsids"), 
                                beta_col = ifelse(is_cis, "BETA", "beta"), 
                                se_col = ifelse(is_cis, "SE", "sebeta"), 
                                effect_allele_col = ifelse(is_cis, "ALLELE1", "alt"), 
                                other_allele_col = ifelse(is_cis, "ALLELE0", "ref"), 
                                eaf_col = ifelse(is_cis, "A1FREQ", "maf"), 
                                pval_col = ifelse(is_cis, "LOG10P", "pval"),
                                chr_col = ifelse(is_cis, "CHROM", "chrom"), 
                                pos_col = ifelse(is_cis, "POS38", "pos"),
                                phenotype_col = basename(file_path))
    
    # Set exposure or outcome names
    if (is_cis) {
      formatted_df$id.exposure <- basename(file_path)
    } else {
      formatted_df$id.outcome <- basename(file_path)
    }
    
    return(formatted_df)
  }, error = function(e) {
    custom_log("ERROR", paste("Error reading file", file_path, ":", e$message))
    return(NULL)
  })
}

# Define directories
cis_region_dir <- "/Users/charleenadams/cis_ukbppp/cleaned/"
metsim_dir <- "/Users/charleenadams/metsim_files"
output_dir <- "/Users/charleenadams/harmonized_results_fixed/"

# List of valid protein subdirectories (ensures only the right ones are processed)
valid_proteins <- c("LEP", "FABP4", "MAP4K5", "IL1RL1", "REN", "IL6", "LRP11", "ASAH1", "FAP",
                    "APOE", "ACY1", "AGXT", "KLKB1", "APOL1", "NPPB", "SIGLEC9", "LEPR",
                    "UMOD", "CD36", "ANGPTL3", "WFIKKN2", "IL6R", "LPL", "PDGFRA")

# Iterate through each protein's directory
for (protein in valid_proteins) {
  
  # Define paths
  protein_fixed_dir <- file.path(metsim_dir, paste0(protein, "_fixed"))
  pval_subdir <- file.path(protein_fixed_dir, paste0("p_val_metsim_cis_", protein, "_fixed"))
  cis_file <- file.path(cis_region_dir, paste0(protein, "_cis_region_cleaned.gz"))
  
  # Check if pval_subdir exists before proceeding
  if (!dir.exists(pval_subdir)) {
    custom_log("WARNING", paste("Skipping", pval_subdir, "- Directory does not exist."))
    next
  }

  # Get list of METSIM subset files
  metsim_files <- list.files(pval_subdir, pattern = "\\.txt$", full.names = TRUE)
  
  if (length(metsim_files) == 0) {
    custom_log("WARNING", paste("No files found in", pval_subdir, "- Skipping."))
    next
  }
  
  # Read and format the cis-region file
  if (!file.exists(cis_file)) {
    custom_log("ERROR", paste("Missing cis-region file:", cis_file, "- Skipping."))
    next
  }
  
  custom_log("INFO", paste("Processing", protein, "with cis-region file:", cis_file))
  
  cis_df <- read_and_format_data(cis_file, is_cis = TRUE)
  if (is.null(cis_df)) {
    custom_log("ERROR", paste("Failed to read cis-region file for", protein, "- Skipping."))
    next
  }

  # Harmonize each METSIM file with the cis-region file
  harmonized_results <- lapply(metsim_files, function(metsim_file) {
    metsim_df <- read_and_format_data(metsim_file, is_cis = FALSE)
    if (!is.null(metsim_df)) {
      tryCatch({
        harmonized <- harmonise_data(exposure_dat = cis_df, outcome_dat = metsim_df, action = 2)
        custom_log("INFO", paste("Harmonized", metsim_file, "with", nrow(harmonized), "rows."))
        return(harmonized)
      }, error = function(e) {
        custom_log("ERROR", paste("Error harmonizing data for", metsim_file, ":", e$message))
        return(NULL)
      })
    }
    return(NULL)
  })

  # Combine harmonized results
  harmonized_results <- harmonized_results[!sapply(harmonized_results, is.null)]
  
  if (length(harmonized_results) > 0) {
    final_results <- rbindlist(harmonized_results)
    output_file <- file.path(output_dir, paste0("harmonized_results_", protein, ".txt"))
    fwrite(final_results, file = output_file, sep = "\t")
    custom_log("INFO", paste("Saved harmonized results for", protein, "to", output_file))
  } else {
    custom_log("WARNING", paste("No harmonized results generated for", protein))
  }
}

cat("Script completed\n")

4.5 HyPrColoc of selected proteins and METSIM metabolites (at least one SNP with P<5E-8)

We conducted a colocalization analysis using HyPrColoc to explore shared genetic loci influencing selected proteins and METSIM metabolites filtered on at least one SNP in the METSIM files have a P<5E-8.

# Run interactively in Rstudio
# Load required libraries
library(data.table)
library(hyprcoloc)
library(openxlsx)
library(DT)
library(knitr)

# Define input and output directories
input_dir <- "/Users/charleenadams/harmonized_results_fixed/"
output_dir <- "/Users/charleenadams/harmonized_results_fixed/hyprcoloc_results"

# Ensure output directory exists
if (!dir.exists(output_dir)) {
  dir.create(output_dir, recursive = TRUE)
}

# Load phenotype mapping data
phenotype_map <- fread("/Users/charleenadams/metsim_files/phenotypes.tsv")

# Get list of all harmonized result files
harmonized_files <- list.files(input_dir, pattern = "^harmonized_results_.*\\.txt$", full.names = TRUE)

# Function to process each file
process_file <- function(file_path) {
  protein_name <- gsub("^harmonized_results_|\\.txt$", "", basename(file_path))  # Extract protein name
  cat("Processing file:", file_path, "\n")

  # Read the harmonized data
  data <- fread(file_path)
  
  # Remove all characters after and including the first `_`
  data$id.outcome <- sub("_.*", "", data$id.outcome)
  data$id.exposure <- sub("_.*", "", data$id.exposure)

  # Merge phenotype names and ensure correct naming format
  setDT(data)
  setDT(phenotype_map)
  data <- merge(data, phenotype_map[, .(phenocode, phenostring)], 
                by.x = "id.outcome", by.y = "phenocode", all.x = TRUE)
  setnames(data, "phenostring", "METSIM_name")
  data$METSIM_name <- gsub("\\*", "", data$METSIM_name)

  # Append the phenocode to the METSIM_name (e.g., "butyrylcarnitine (C4)_C100001054")
  data$METSIM_name <- paste0(data$METSIM_name, "_", data$id.outcome)

  # Ensure required columns exist
  required_columns <- c("SNP", "id.exposure", "METSIM_name", "beta.exposure", "beta.outcome", "se.exposure", "se.outcome")
  if (!all(required_columns %in% colnames(data))) {
    cat("Skipping", protein_name, "- missing required columns!\n")
    return(NULL)
  }

  # Reshape data to create matrices
  long_data <- rbindlist(list(
    data[, .(SNP, trait = id.exposure, beta = beta.exposure, se = se.exposure)],
    data[, .(SNP, trait = METSIM_name, beta = beta.outcome, se = se.outcome)]
  ), use.names = TRUE)
  long_data <- unique(long_data, by = c("SNP", "trait"))

  # Create beta and SE matrices
  betas_mat <- dcast(long_data, SNP ~ trait, value.var = "beta", fill = NA)
  ses_mat <- dcast(long_data, SNP ~ trait, value.var = "se", fill = NA)
  snps <- betas_mat$SNP

  rownames(betas_mat) <- betas_mat$SNP
  rownames(ses_mat) <- ses_mat$SNP
  betas_mat <- as.matrix(betas_mat[, -1, with = FALSE])
  ses_mat <- as.matrix(ses_mat[, -1, with = FALSE])
  rownames(betas_mat) <- snps
  rownames(ses_mat) <- snps

  # Filter matrices to remove missing data rows
  rows_with_na <- apply(betas_mat, 1, function(row) any(is.na(row))) | apply(ses_mat, 1, function(row) any(is.na(row)))
  betas_mat <- betas_mat[!rows_with_na, , drop = FALSE]
  ses_mat <- ses_mat[!rows_with_na, , drop = FALSE]

  # Run HyprColoc
  cat("Running HyprColoc for", protein_name, "...\n")

  betas <- betas_mat
  ses <- ses_mat
  traits_names <- colnames(betas_mat)
  snp_ids <- rownames(betas_mat)

  results_unipriorsFALSE <- hyprcoloc(betas, ses, trait.names = traits_names, snp.id = snp_ids, snpscores = TRUE, uniform.priors = FALSE)
  results_unipriorsTRUE <- hyprcoloc(betas, ses, trait.names = traits_names, snp.id = snp_ids, snpscores = TRUE, uniform.priors = TRUE)
  results_thresh_95 <- hyprcoloc(betas, ses, trait.names = traits_names, snp.id = snp_ids, snpscores = TRUE, uniform.priors = FALSE, reg.thresh = 0.95, align.thresh = 0.95)

  # Define output file path
  output_file <- file.path(output_dir, paste0(protein_name, "_hyprcoloc_results.xlsx"))

  # Initialize a new workbook
  wb <- createWorkbook()

  # Style for headers
  header_style <- createStyle(fontSize = 12, fontColour = "#FFFFFF", halign = "CENTER", fgFill = "#4F81BD", textDecoration = "Bold", wrapText = TRUE)

  # Style for left-aligned titles
  title_style <- createStyle(fontSize = 14, halign = "LEFT", textDecoration = "Bold")

  # Style for wrapping text in traits column
  traits_style <- createStyle(wrapText = TRUE)

  # Function to add results with styled header and left-aligned title
  add_results_to_workbook <- function(workbook, result_data, sheet_name, title) {
    if (!is.null(result_data$results)) {
      addWorksheet(workbook, sheet_name)
      writeData(workbook, sheet_name, data.table(Title = title), startRow = 1, colNames = FALSE)
      addStyle(workbook, sheet_name, title_style, rows = 1, cols = 1, gridExpand = TRUE)
      
      # Count the number of traits in each row
      result_data$results$counts <- sapply(result_data$results$traits, function(x) {
        if (x == "None") {
          return(0)  # If no traits are listed
        } else {
          return(length(strsplit(x, ", ")[[1]]))
        }
      })
      
      writeData(workbook, sheet_name, result_data$results, startRow = 3)
      addStyle(workbook, sheet_name, header_style, rows = 3, cols = 1:ncol(result_data$results), gridExpand = TRUE)
      
      # Apply wrap text style to 'traits' column
      traits_col_index <- which(names(result_data$results) == "traits")
      if (length(traits_col_index) > 0) {
        addStyle(workbook, sheet_name, traits_style, rows = 4:(nrow(result_data$results) + 3), cols = traits_col_index, gridExpand = TRUE, stack = TRUE)
      }
      
      # Auto adjust column widths based on content
      for (i in 1:ncol(result_data$results)) {
        setColWidths(workbook, sheet = sheet_name, cols = i, widths = "auto")
      }
    }
  }

  # Function to add SNP scores with sorting and renaming
  add_snpscores_to_workbook <- function(workbook, snp_scores, sheet_name, title) {
    if (!is.null(snp_scores)) {
      addWorksheet(workbook, sheet_name)
      writeData(workbook, sheet_name, data.table(Title = title), startRow = 1, colNames = FALSE)
      addStyle(workbook, sheet_name, title_style, rows = 1, cols = 1, gridExpand = TRUE)
      
      # Convert to data.table, remove 'V2' column, add 'rsID', rename 'V1', and sort
      snp_scores_df <- as.data.table(snp_scores)
      snp_scores_df <- snp_scores_df[, .(V1)]  # Keep only V1
      snp_scores_df$rsID <- snp_ids
      setnames(snp_scores_df, "V1", "SNP_scores")
      snp_scores_df <- snp_scores_df[order(-SNP_scores)]  # Sort descending
      
      writeData(workbook, sheet_name, snp_scores_df, startRow = 3)
      addStyle(workbook, sheet_name, header_style, rows = 3, cols = 1:ncol(snp_scores_df), gridExpand = TRUE)
      
      # Auto adjust column widths based on content
      for (i in 1:ncol(snp_scores_df)) {
        setColWidths(workbook, sheet = sheet_name, cols = i, widths = "auto")
      }
    }
  }

  # Save results
  add_results_to_workbook(wb, results_unipriorsFALSE, paste0(protein_name, "_Res_Unip_F"), paste("HyprColoc of", protein_name, "(Uniform Priors = FALSE)"))
  add_results_to_workbook(wb, results_unipriorsTRUE, paste0(protein_name, "_Res_Unip_T"), paste("HyprColoc of", protein_name, "(Uniform Priors = TRUE)"))
  add_results_to_workbook(wb, results_thresh_95, paste0(protein_name, "_Res_Unip_F_95"), paste("HyprColoc of", protein_name, "(Uniform Priors = FALSE, Thresholds = 0.95)"))

  add_snpscores_to_workbook(wb, results_unipriorsFALSE$snpscores, paste0(protein_name, "_Scores_Unip_F"), paste("HyprColoc SNP Scores for", protein_name, "(Uniform Priors = FALSE)"))
  add_snpscores_to_workbook(wb, results_unipriorsTRUE$snpscores, paste0(protein_name, "_Scores_Unip_T"), paste("HyprColoc SNP Scores for", protein_name, "(Uniform Priors = TRUE)"))
  add_snpscores_to_workbook(wb, results_thresh_95$snpscores, paste0(protein_name, "_Scores_Unip_F_95"), paste("HyprColoc SNP Scores for", protein_name, "(Uniform Priors = FALSE, Thresholds = 0.95)"))

  saveWorkbook(wb, output_file, overwrite = TRUE)
  cat("Results saved to:", output_file, "\n")
}

# Process all files
for (file_path in harmonized_files) {
  process_file(file_path)
}

cat("All HyprColoc analyses completed!\n")

5 Benchmarking loop of selected proteins on METISM metabolites (P<5E-6 threshold)

5.1 Filter METSIM files on at least one SNP with P<5E-6

# ran interactively in Rstudio

# Load necessary libraries
library(data.table)

# Parent directory where the protein-specific folders are located
parent_directory <- "/Users/charleenadams/metsim_files"

# Explicit list of valid subdirectories to process
valid_subdirectories <- c("LEP_fixed", "FABP4_fixed", "MAP4K5_fixed", "IL1RL1_fixed", 
                          "REN_fixed", "IL6_fixed", "LRP11_fixed", "ASAH1_fixed", "FAP_fixed", 
                          "APOE_fixed", "ACY1_fixed", "AGXT_fixed", "KLKB1_fixed", "APOL1_fixed", 
                          "NPPB_fixed", "SIGLEC9_fixed", "LEPR_fixed", "UMOD_fixed", "CD36_fixed", 
                          "ANGPTL3_fixed", "WFIKKN2_fixed", "IL6R_fixed", "LPL_fixed", "PDGFRA_fixed")

# Get all valid subdirectories
subdirectories <- file.path(parent_directory, valid_subdirectories)
subdirectories <- subdirectories[dir.exists(subdirectories)]

# Process each subdirectory
for (subdir in subdirectories) {
  
  # Extract the protein name from the directory name
  protein_name <- basename(subdir)
  
  # Define the subdirectory for significant p-values
  significant_subdir <- file.path(subdir, paste0("p_val_5E6_metsim_cis_", protein_name))
  
  # Ensure the subdirectory exists
  if (!dir.exists(significant_subdir)) {
    dir.create(significant_subdir)
    cat("Created directory:", significant_subdir, "\n")
  }
  
  # Get all .txt files in the subdirectory
  files <- list.files(path = subdir, pattern = "\\.txt$", full.names = TRUE)
  
  # Process each file
  for (file in files) {
    cat("Processing file:", basename(file), "in", protein_name, "\n")
    
    # Read the file
    data <- fread(file)
    
    # Check if any p-value is less than 5e-8
    if (any(data$pval < 5e-6, na.rm = TRUE)) {
      cat("File", basename(file), "has pval < 5e-6\n")
      file.copy(from = file, to = file.path(significant_subdir, basename(file)))
      cat("Copied", basename(file), "to", significant_subdir, "\n")
    } else {
      cat("No match found in", basename(file), "\n")
    }
  }
}

cat("Script completed\n")

5.2 Loop to harmonize selected cis-region proteins and METSIM metabolites with at least on rsID with P<5E-6

# Ran interactively in rstudio
# Could do but DIDN'T:
# nohup Rscript ./harmonize_loop24_metsim_pval_5E6_subsets_parallel.R > harmonize_loop24_metsim_pval_5E6_subsets_parallel.log 2>&1 &
# ps -p 17652 -o stat,etime

# Do this: make directory
# mkdir pv5E6

# Load required libraries
library(data.table)
library(TwoSampleMR)

# Custom logging function
custom_log <- function(level, message) {
  msg <- paste(format(Sys.time(), "%Y-%m-%d %H:%M:%S"), level, message)
  cat(msg, "\n", file = "benchmark_harm_twosampleMR.log", append = TRUE)
  cat(msg, "\n")  # Also print to console
}

# Function to read and format data for TwoSampleMR, including chromosome and position
read_and_format_data <- function(file_path, is_cis) {
  tryCatch({
    # Read file and convert to data.frame
    df <- fread(file_path, sep = "\t", header = TRUE)
    df <- as.data.frame(df)  # Ensure it is a data.frame
    
    # Format data for TwoSampleMR
    formatted_df <- format_data(df, 
                                type = ifelse(is_cis, "exposure", "outcome"), 
                                snp_col = ifelse(is_cis, "rsid", "rsids"), 
                                beta_col = ifelse(is_cis, "BETA", "beta"), 
                                se_col = ifelse(is_cis, "SE", "sebeta"), 
                                effect_allele_col = ifelse(is_cis, "ALLELE1", "alt"), 
                                other_allele_col = ifelse(is_cis, "ALLELE0", "ref"), 
                                eaf_col = ifelse(is_cis, "A1FREQ", "maf"), 
                                pval_col = ifelse(is_cis, "LOG10P", "pval"),
                                chr_col = ifelse(is_cis, "CHROM", "chrom"), 
                                pos_col = ifelse(is_cis, "POS38", "pos"),
                                phenotype_col = basename(file_path))
    
    # Set exposure or outcome names
    if (is_cis) {
      formatted_df$id.exposure <- basename(file_path)
    } else {
      formatted_df$id.outcome <- basename(file_path)
    }
    
    return(formatted_df)
  }, error = function(e) {
    custom_log("ERROR", paste("Error reading file", file_path, ":", e$message))
    return(NULL)
  })
}

# Define directories
cis_region_dir <- "/Users/charleenadams/cis_ukbppp/cleaned/"
metsim_dir <- "/Users/charleenadams/metsim_files"
output_dir <- "/Users/charleenadams/harmonized_results_fixed/pv5E6"

# List of valid protein subdirectories (ensures only the right ones are processed)
valid_proteins <- c("LEP", "FABP4", "MAP4K5", "IL1RL1", "REN", "IL6", "LRP11", "ASAH1", "FAP",
                    "APOE", "ACY1", "AGXT", "KLKB1", "APOL1", "NPPB", "SIGLEC9", "LEPR",
                    "UMOD", "CD36", "ANGPTL3", "WFIKKN2", "IL6R", "LPL", "PDGFRA")

# Iterate through each protein's directory
for (protein in valid_proteins) {
  
  # Define paths
  protein_fixed_dir <- file.path(metsim_dir, paste0(protein, "_fixed"))
  pval_subdir <- file.path(protein_fixed_dir, paste0("p_val_5E6_metsim_cis_", protein, "_fixed"))
  cis_file <- file.path(cis_region_dir, paste0(protein, "_cis_region_cleaned.gz"))
  
  # Check if pval_subdir exists before proceeding
  if (!dir.exists(pval_subdir)) {
    custom_log("WARNING", paste("Skipping", pval_subdir, "- Directory does not exist."))
    next
  }

  # Get list of METSIM subset files
  metsim_files <- list.files(pval_subdir, pattern = "\\.txt$", full.names = TRUE)
  
  if (length(metsim_files) == 0) {
    custom_log("WARNING", paste("No files found in", pval_subdir, "- Skipping."))
    next
  }
  
  # Read and format the cis-region file
  if (!file.exists(cis_file)) {
    custom_log("ERROR", paste("Missing cis-region file:", cis_file, "- Skipping."))
    next
  }
  
  custom_log("INFO", paste("Processing", protein, "with cis-region file:", cis_file))
  
  cis_df <- read_and_format_data(cis_file, is_cis = TRUE)
  if (is.null(cis_df)) {
    custom_log("ERROR", paste("Failed to read cis-region file for", protein, "- Skipping."))
    next
  }

  # Harmonize each METSIM file with the cis-region file
  harmonized_results <- lapply(metsim_files, function(metsim_file) {
    metsim_df <- read_and_format_data(metsim_file, is_cis = FALSE)
    if (!is.null(metsim_df)) {
      tryCatch({
        harmonized <- harmonise_data(exposure_dat = cis_df, outcome_dat = metsim_df, action = 2)
        custom_log("INFO", paste("Harmonized", metsim_file, "with", nrow(harmonized), "rows."))
        return(harmonized)
      }, error = function(e) {
        custom_log("ERROR", paste("Error harmonizing data for", metsim_file, ":", e$message))
        return(NULL)
      })
    }
    return(NULL)
  })

  # Combine harmonized results
  harmonized_results <- harmonized_results[!sapply(harmonized_results, is.null)]
  
  if (length(harmonized_results) > 0) {
    final_results <- rbindlist(harmonized_results)
    output_file <- file.path(output_dir, paste0("harmonized_results_5E6_", protein, ".txt"))
    fwrite(final_results, file = output_file, sep = "\t")
    custom_log("INFO", paste("Saved harmonized results for", protein, "to", output_file))
  } else {
    custom_log("WARNING", paste("No harmonized results generated for", protein))
  }
}

cat("Script completed\n")

5.3 HyPrColoc of selected proteins and METSIM metabolites with at least one SNP with P<5E-6

# Run interactively in Rstudio
# Load required libraries

# create new directory for the harmzonized pv_5E6 files
# mv *_pv_5E6* /Users/charleenadams/harmonized_results_fixed/pv5E6

library(data.table)
library(hyprcoloc)
library(openxlsx)
library(DT)
library(knitr)

# Define input and output directories
input_dir <- "/Users/charleenadams/harmonized_results_fixed/pv5E6"
output_dir <- "/Users/charleenadams/harmonized_results_fixed/hyprcoloc_results_pval_5E6"

# Ensure output directory exists
if (!dir.exists(output_dir)) {
  dir.create(output_dir, recursive = TRUE)
}

# Load phenotype mapping data
phenotype_map <- fread("/Users/charleenadams/metsim_files/phenotypes.tsv")

# Get list of all harmonized result files
harmonized_files <- list.files(input_dir, pattern = "^harmonized_results_.*\\.txt$", full.names = TRUE)

# Function to process each file
process_file <- function(file_path) {
  protein_name <- gsub("^harmonized_results_|\\.txt$", "", basename(file_path))  # Extract protein name
  cat("Processing file:", file_path, "\n")

  # Read the harmonized data
  data <- fread(file_path)
  
  # Remove all characters after and including the first `_`
  data$id.outcome <- sub("_.*", "", data$id.outcome)
  data$id.exposure <- sub("_.*", "", data$id.exposure)

  # Merge phenotype names and ensure correct naming format
  setDT(data)
  setDT(phenotype_map)
  data <- merge(data, phenotype_map[, .(phenocode, phenostring)], 
                by.x = "id.outcome", by.y = "phenocode", all.x = TRUE)
  setnames(data, "phenostring", "METSIM_name")
  data$METSIM_name <- gsub("\\*", "", data$METSIM_name)

  # Append the phenocode to the METSIM_name (e.g., "butyrylcarnitine (C4)_C100001054")
  data$METSIM_name <- paste0(data$METSIM_name, "_", data$id.outcome)

  # Ensure required columns exist
  required_columns <- c("SNP", "id.exposure", "METSIM_name", "beta.exposure", "beta.outcome", "se.exposure", "se.outcome")
  if (!all(required_columns %in% colnames(data))) {
    cat("Skipping", protein_name, "- missing required columns!\n")
    return(NULL)
  }

  # Reshape data to create matrices
  long_data <- rbindlist(list(
    data[, .(SNP, trait = id.exposure, beta = beta.exposure, se = se.exposure)],
    data[, .(SNP, trait = METSIM_name, beta = beta.outcome, se = se.outcome)]
  ), use.names = TRUE)
  long_data <- unique(long_data, by = c("SNP", "trait"))

  # Create beta and SE matrices
  betas_mat <- dcast(long_data, SNP ~ trait, value.var = "beta", fill = NA)
  ses_mat <- dcast(long_data, SNP ~ trait, value.var = "se", fill = NA)
  snps <- betas_mat$SNP

  rownames(betas_mat) <- betas_mat$SNP
  rownames(ses_mat) <- ses_mat$SNP
  betas_mat <- as.matrix(betas_mat[, -1, with = FALSE])
  ses_mat <- as.matrix(ses_mat[, -1, with = FALSE])
  rownames(betas_mat) <- snps
  rownames(ses_mat) <- snps

  # Filter matrices to remove missing data rows
  rows_with_na <- apply(betas_mat, 1, function(row) any(is.na(row))) | apply(ses_mat, 1, function(row) any(is.na(row)))
  betas_mat <- betas_mat[!rows_with_na, , drop = FALSE]
  ses_mat <- ses_mat[!rows_with_na, , drop = FALSE]

  # Run HyprColoc
  cat("Running HyprColoc for", protein_name, "...\n")

  betas <- betas_mat
  ses <- ses_mat
  traits_names <- colnames(betas_mat)
  snp_ids <- rownames(betas_mat)

  results_unipriorsFALSE <- hyprcoloc(betas, ses, trait.names = traits_names, snp.id = snp_ids, snpscores = TRUE, uniform.priors = FALSE)
  results_unipriorsTRUE <- hyprcoloc(betas, ses, trait.names = traits_names, snp.id = snp_ids, snpscores = TRUE, uniform.priors = TRUE)
  results_thresh_95 <- hyprcoloc(betas, ses, trait.names = traits_names, snp.id = snp_ids, snpscores = TRUE, uniform.priors = FALSE, reg.thresh = 0.95, align.thresh = 0.95)

  # Define output file path
  output_file <- file.path(output_dir, paste0(protein_name, "_hyprcoloc_results.xlsx"))

  # Initialize a new workbook
  wb <- createWorkbook()

  # Style for headers
  header_style <- createStyle(fontSize = 12, fontColour = "#FFFFFF", halign = "CENTER", fgFill = "#4F81BD", textDecoration = "Bold", wrapText = TRUE)

  # Style for left-aligned titles
  title_style <- createStyle(fontSize = 14, halign = "LEFT", textDecoration = "Bold")

  # Style for wrapping text in traits column
  traits_style <- createStyle(wrapText = TRUE)

  # Function to add results with styled header and left-aligned title
  add_results_to_workbook <- function(workbook, result_data, sheet_name, title) {
    if (!is.null(result_data$results)) {
      addWorksheet(workbook, sheet_name)
      writeData(workbook, sheet_name, data.table(Title = title), startRow = 1, colNames = FALSE)
      addStyle(workbook, sheet_name, title_style, rows = 1, cols = 1, gridExpand = TRUE)
      
      # Count the number of traits in each row
      result_data$results$counts <- sapply(result_data$results$traits, function(x) {
        if (x == "None") {
          return(0)  # If no traits are listed
        } else {
          return(length(strsplit(x, ", ")[[1]]))
        }
      })
      
      writeData(workbook, sheet_name, result_data$results, startRow = 3)
      addStyle(workbook, sheet_name, header_style, rows = 3, cols = 1:ncol(result_data$results), gridExpand = TRUE)
      
      # Apply wrap text style to 'traits' column
      traits_col_index <- which(names(result_data$results) == "traits")
      if (length(traits_col_index) > 0) {
        addStyle(workbook, sheet_name, traits_style, rows = 4:(nrow(result_data$results) + 3), cols = traits_col_index, gridExpand = TRUE, stack = TRUE)
      }
      
      # Auto adjust column widths based on content
      for (i in 1:ncol(result_data$results)) {
        setColWidths(workbook, sheet = sheet_name, cols = i, widths = "auto")
      }
    }
  }

  # Function to add SNP scores with sorting and renaming
  add_snpscores_to_workbook <- function(workbook, snp_scores, sheet_name, title) {
    if (!is.null(snp_scores)) {
      addWorksheet(workbook, sheet_name)
      writeData(workbook, sheet_name, data.table(Title = title), startRow = 1, colNames = FALSE)
      addStyle(workbook, sheet_name, title_style, rows = 1, cols = 1, gridExpand = TRUE)
      
      # Convert to data.table, remove 'V2' column, add 'rsID', rename 'V1', and sort
      snp_scores_df <- as.data.table(snp_scores)
      snp_scores_df <- snp_scores_df[, .(V1)]  # Keep only V1
      snp_scores_df$rsID <- snp_ids
      setnames(snp_scores_df, "V1", "SNP_scores")
      snp_scores_df <- snp_scores_df[order(-SNP_scores)]  # Sort descending
      
      writeData(workbook, sheet_name, snp_scores_df, startRow = 3)
      addStyle(workbook, sheet_name, header_style, rows = 3, cols = 1:ncol(snp_scores_df), gridExpand = TRUE)
      
      # Auto adjust column widths based on content
      for (i in 1:ncol(snp_scores_df)) {
        setColWidths(workbook, sheet = sheet_name, cols = i, widths = "auto")
      }
    }
  }

  # Save results
  add_results_to_workbook(wb, results_unipriorsFALSE, paste0(protein_name, "_Res_Unip_F"), paste("HyprColoc of", protein_name, "(Uniform Priors = FALSE)"))
  add_results_to_workbook(wb, results_unipriorsTRUE, paste0(protein_name, "_Res_Unip_T"), paste("HyprColoc of", protein_name, "(Uniform Priors = TRUE)"))
  add_results_to_workbook(wb, results_thresh_95, paste0(protein_name, "_Res_Unip_F_95"), paste("HyprColoc of", protein_name, "(Uniform Priors = FALSE, Thresholds = 0.95)"))

  add_snpscores_to_workbook(wb, results_unipriorsFALSE$snpscores, paste0(protein_name, "_Scores_Unip_F"), paste("HyprColoc SNP Scores for", protein_name, "(Uniform Priors = FALSE)"))
  add_snpscores_to_workbook(wb, results_unipriorsTRUE$snpscores, paste0(protein_name, "_Scores_Unip_T"), paste("HyprColoc SNP Scores for", protein_name, "(Uniform Priors = TRUE)"))
  add_snpscores_to_workbook(wb, results_thresh_95$snpscores, paste0(protein_name, "_Scores_Unip_F_95"), paste("HyprColoc SNP Scores for", protein_name, "(Uniform Priors = FALSE, Thresholds = 0.95)"))

  saveWorkbook(wb, output_file, overwrite = TRUE)
  cat("Results saved to:", output_file, "\n")
}

# Process all files
for (file_path in harmonized_files) {
  process_file(file_path)
}

cat("All HyprColoc analyses completed!\n")

6 Comparison of HyPrColoc results for selected proteins using P<5E-8 and P<5E-6 filters

6.1 Script for individual Excel files of results

Produces individual Excel files for the proteins with three models (uniform priors = FALSE, uniform priors = TRUE, and uniform priors = FALSE and regional and alignment thresholds = 0.95)

# Ran interactively in Rstudio

# Load required libraries
library(openxlsx)
library(data.table)

# Define directories
dir_p5e8 <- "/Users/charleenadams/harmonized_results_fixed/hyprcoloc_results"
dir_p5e6 <- "/Users/charleenadams/harmonized_results_fixed/hyprcoloc_results_pval_5E6"
output_dir <- "/Users/charleenadams/harmonized_results_fixed/hyprcoloc_comparisons_pval_5E8_5E6"

# Ensure output directory exists
if (!dir.exists(output_dir)) {
  dir.create(output_dir, recursive = TRUE)
}

# List files in each directory
files_p5e8 <- list.files(dir_p5e8, pattern = "_hyprcoloc_results.xlsx$", full.names = TRUE)
files_p5e6 <- list.files(dir_p5e6, pattern = "_hyprcoloc_results.xlsx$", full.names = TRUE)

# Extract protein names
protein_names_p5e8 <- sub("_hyprcoloc_results.xlsx", "", basename(files_p5e8))
protein_names_p5e6 <- sub("5E6_", "", sub("_hyprcoloc_results.xlsx", "", basename(files_p5e6)))

# Identify common proteins
common_proteins <- intersect(protein_names_p5e8, protein_names_p5e6)

# Function to process HyprColoc results
process_hyprcoloc_results <- function(file_path) {
  wb <- loadWorkbook(file_path)
  sheet_names <- names(wb)
  
  # Identify sheets that contain results
  result_sheets <- sheet_names[grep("_Res_Unip_", sheet_names)]
  
  if (length(result_sheets) == 0) {
    return(NULL)
  }
  
  results_list <- list()
  
  for (sheet in result_sheets) {
    data <- read.xlsx(file_path, sheet = sheet, startRow = 3)  # Skip first 2 rows
    
    if (!all(c("iteration", "traits", "posterior_prob", "posterior_explained_by_snp", "candidate_snp", "regional_prob") %in% colnames(data))) {
      next
    }
    
    # Clean and process data
    setDT(data)
    data <- data[, .(iteration, traits, candidate_snp, posterior_prob, posterior_explained_by_snp, regional_prob)]
    
    # Remove square brackets from traits
    data[, traits := gsub("\\[.*?\\]", "", traits)]
    
    # Count traits
    data[, num_traits := sapply(strsplit(traits, ", "), length)]
    
    # Wrap traits every 2 per line for readability
    data[, traits := sapply(strsplit(traits, ", "), function(x) paste(x, collapse = ifelse(length(x) > 2, "\n", ", ")))]
    
    results_list[[sheet]] <- data
  }
  
  return(results_list)
}

# Function to compare two datasets
compare_hyprcoloc_results <- function(protein, file1, file2) {
  data_p5e8 <- process_hyprcoloc_results(file1)
  data_p5e6 <- process_hyprcoloc_results(file2)
  
  if (is.null(data_p5e8) | is.null(data_p5e6)) {
    cat("Skipping", protein, "- No comparable HyprColoc results found.\n")
    return(NULL)
  }
  
  comparison_results <- list()
  
  for (sheet in names(data_p5e8)) {
    sheet_p5e6 <- paste0("5E6_", sheet)
    
    if (!(sheet_p5e6 %in% names(data_p5e6))) {
      next
    }
    
    df1 <- data_p5e8[[sheet]]
    df2 <- data_p5e6[[sheet_p5e6]]
    
    # Merge results by iteration
    merged_df <- merge(df1, df2, by = "iteration", suffixes = c("_p5e8", "_p5e6"), all = TRUE)
    
    # Compare traits
    merged_df[, match_traits := traits_p5e8 == traits_p5e6]
    
    # Compare SNP and posterior probability
    merged_df[, match_snp := candidate_snp_p5e8 == candidate_snp_p5e6]
    merged_df[, posterior_prob_diff := posterior_explained_by_snp_p5e8 - posterior_explained_by_snp_p5e6]
    
    # Add regional_prob columns
    merged_df[, regional_prob_diff := regional_prob_p5e8 - regional_prob_p5e6]
    
    comparison_results[[sheet]] <- merged_df
  }
  
  return(comparison_results)
}

# Formatting: Dark lavender header, bold white font, center-aligned columns
header_style <- createStyle(fontSize = 12, fontColour = "#FFFFFF", halign = "CENTER",
                            fgFill = "#69306D", textDecoration = "Bold", wrapText = TRUE)

centered_style <- createStyle(halign = "CENTER")
wrapped_text_style <- createStyle(wrapText = TRUE)

# Loop through all common proteins and compare results
for (protein in common_proteins) {
  file1 <- file.path(dir_p5e8, paste0(protein, "_hyprcoloc_results.xlsx"))
  file2 <- file.path(dir_p5e6, paste0("5E6_", protein, "_hyprcoloc_results.xlsx"))
  
  comparison <- compare_hyprcoloc_results(protein, file1, file2)
  
  if (!is.null(comparison)) {
    output_file <- file.path(output_dir, paste0(protein, "_hyprcoloc_comparison.xlsx"))
    
    wb <- createWorkbook()
    
    for (sheet in names(comparison)) {
      addWorksheet(wb, sheet)
      writeData(wb, sheet, comparison[[sheet]], startRow = 2, colNames = TRUE)
      
      # Apply header formatting
      addStyle(wb, sheet, header_style, rows = 2, cols = 1:ncol(comparison[[sheet]]), gridExpand = TRUE)
      
      # Apply center alignment
      addStyle(wb, sheet, centered_style, rows = 3:(nrow(comparison[[sheet]]) + 2), cols = 1:ncol(comparison[[sheet]]), gridExpand = TRUE)
      
      # Set column width
      setColWidths(wb, sheet, cols = 1:ncol(comparison[[sheet]]), widths = 25)
      
      # Apply wrapped text for trait columns
      trait_cols <- grep("traits", names(comparison[[sheet]]))
      if (length(trait_cols) > 0) {
        addStyle(wb, sheet, wrapped_text_style, rows = 3:(nrow(comparison[[sheet]]) + 2), cols = trait_cols, gridExpand = TRUE)
      }
    }
    
    saveWorkbook(wb, output_file, overwrite = TRUE)
    cat("Saved comparison for", protein, "to", output_file, "\n")
  }
}

cat("All comparisons completed!\n")

6.2 Script for three summary Excel files comparing results: one for each uniform prior configuration

This script collects the results for the proteins from the results just above and puts them in tabs by name of protein.

# Load required libraries
library(openxlsx)
library(data.table)

# Define directories
comparison_dir <- "/Users/charleenadams/harmonized_results_fixed/hyprcoloc_comparisons_pval_5E8_5E6"

# Define output file names for each tab
output_top_file_F <- "/Users/charleenadams/coloc_clin_mol/scripts/hypr_results_deliver/www/hyprcoloc_comparisons_unipF_top.xlsx"
output_top_file_T <- "/Users/charleenadams/coloc_clin_mol/scripts/hypr_results_deliver/www/hyprcoloc_comparisons_unipT_top.xlsx"
output_top_file_F_95 <- "/Users/charleenadams/coloc_clin_mol/scripts/hypr_results_deliver/www/hyprcoloc_comparisons_unipF_95_top.xlsx"

# List all comparison files, excluding temporary files
comparison_files <- list.files(comparison_dir, pattern = "_hyprcoloc_comparison.xlsx$", full.names = TRUE)
comparison_files <- comparison_files[!grepl("~$", comparison_files)]

# Function to process each tab and create separate results files
process_comparison_files <- function(comparison_files, tab_index, output_file) {
  wb <- createWorkbook()
  
  # Header Formatting: Dark Lavender, Bold White Font
  header_style <- createStyle(fontSize = 12, fontColour = "#FFFFFF", halign = "CENTER",
                              fgFill = "#69306D", textDecoration = "Bold", wrapText = TRUE)
  
  centered_style <- createStyle(halign = "CENTER")
  wrapped_text_style <- createStyle(wrapText = TRUE)
  
  summary_data <- list()
  
  ### 🔹 First, Create the "Top_Hits" Worksheet
  addWorksheet(wb, "Top_Hits")

  for (file in comparison_files) {
    protein_name <- sub("_hyprcoloc_comparison.xlsx", "", basename(file))
    
    # Try loading the workbook, catch errors
    wb_comp <- tryCatch({
      loadWorkbook(file)
    }, error = function(e) {
      message(paste("Skipping", file, "- Could not load workbook:", e$message))
      return(NULL)
    })
    
    if (is.null(wb_comp)) {
      next
    }
    
    sheet_names <- names(wb_comp)

    # Check if the tab index is valid
    if (length(sheet_names) >= tab_index) {
      sheet <- sheet_names[tab_index]
      
      # Read data directly from the specified tab of the Excel file
      data <- tryCatch({
        read.xlsx(file, sheet = sheet, colNames = TRUE, detectDates = FALSE)
      }, error = function(e) {
        message(paste("Skipping sheet:", sheet, "in", file, "- Error reading sheet:", e$message))
        return(NULL)
      })
      
      if (is.null(data)) {
        message(paste("Data is NULL after reading the sheet", sheet, "from", file))
        next
      }
      
      # Convert to data.table
      setDT(data)
      
      # Ensure required columns exist before filtering
      required_cols <- c("traits_p5e8", "posterior_prob_p5e8", "posterior_explained_by_snp_p5e8",
                         "regional_prob_p5e8", "candidate_snp_p5e8",
                         "traits_p5e6", "posterior_prob_p5e6", "posterior_explained_by_snp_p5e6",
                         "regional_prob_p5e6", "candidate_snp_p5e6")
      
      missing_cols <- setdiff(required_cols, names(data))
      if (length(missing_cols) > 0) {
        message(paste("Skipping summary extraction for", protein_name, "due to missing columns:", paste(missing_cols, collapse=", ")))
        next
      }
      
      # Filter traits where posterior_prob, regional_prob, and posterior_explained_by_snp are all >= 0.8
      top_results_p5e8 <- data[posterior_prob_p5e8 >= 0.8 & 
                               posterior_explained_by_snp_p5e8 >= 0.8 & 
                               regional_prob_p5e8 >= 0.8,
                               .(protein = protein_name,
                                 traits = traits_p5e8,
                                 pval_threshold = "5E-8",
                                 candidate_snp = candidate_snp_p5e8,
                                 posterior_probability = posterior_prob_p5e8,
                                 posterior_prob_explained_by_snp = posterior_explained_by_snp_p5e8,
                                 regional_prob = regional_prob_p5e8,
                                 includes_protein = ifelse(grepl(protein_name, traits_p5e8, ignore.case = TRUE), "Yes", "No"))]
      
      top_results_p5e6 <- data[posterior_prob_p5e6 >= 0.8 & 
                               posterior_explained_by_snp_p5e6 >= 0.8 & 
                               regional_prob_p5e6 >= 0.8,
                               .(protein = protein_name,
                                 traits = traits_p5e6,
                                 pval_threshold = "5E-6",
                                 candidate_snp = candidate_snp_p5e6,
                                 posterior_probability = posterior_prob_p5e6,
                                 posterior_prob_explained_by_snp = posterior_explained_by_snp_p5e6,
                                 regional_prob = regional_prob_p5e6,
                                 includes_protein = ifelse(grepl(protein_name, traits_p5e6, ignore.case = TRUE), "Yes", "No"))]
      
      # Combine results
      top_results <- rbind(top_results_p5e8, top_results_p5e6)
      
      if (nrow(top_results) > 0) {
        summary_data[[protein_name]] <- top_results
      }
      
      # Add worksheet for this protein
      addWorksheet(wb, protein_name)
      writeData(wb, protein_name, data, startRow = 1, colNames = TRUE)
      
      # Apply formatting
      addStyle(wb, protein_name, header_style, rows = 1, cols = 1:ncol(data), gridExpand = TRUE)
      addStyle(wb, protein_name, centered_style, rows = 2:(nrow(data) + 1), cols = 1:ncol(data), gridExpand = TRUE)
      
      # Wrap text for trait columns
      trait_cols <- grep("traits", names(data))
      if (length(trait_cols) > 0) {
        addStyle(wb, protein_name, wrapped_text_style, rows = 2:(nrow(data) + 1), cols = trait_cols, gridExpand = TRUE)
      }
      
      # Set column width
      setColWidths(wb, protein_name, cols = 1:ncol(data), widths = 25)
    } else {
      message(paste("Tab index", tab_index, "does not exist for", file))
    }
  }
  
  # Populate "Top_Hits" tab
  if (length(summary_data) > 0) {
    summary_df <- rbindlist(summary_data, use.names = TRUE, fill = TRUE)
    writeData(wb, "Top_Hits", summary_df, startRow = 1, colNames = TRUE)
    
    # Apply formatting to summary tab
    addStyle(wb, "Top_Hits", header_style, rows = 1, cols = 1:ncol(summary_df), gridExpand = TRUE)
    addStyle(wb, "Top_Hits", centered_style, rows = 2:(nrow(summary_df) + 1), cols = 1:ncol(summary_df), gridExpand = TRUE)
    
    # Wrap text for trait column
    addStyle(wb, "Top_Hits", wrapped_text_style, rows = 2:(nrow(summary_df) + 1), cols = 2, gridExpand = TRUE)
    
    # Set column width
    setColWidths(wb, "Top_Hits", cols = 1:ncol(summary_df), widths = 25)
  }

  # Save final summary file
  saveWorkbook(wb, output_file, overwrite = TRUE)
  
  cat("Top hits Excel file created at:", output_file, "\n")
}

# Process for each tab
process_comparison_files(comparison_files, 1, output_top_file_F)
process_comparison_files(comparison_files, 2, output_top_file_T)
process_comparison_files(comparison_files, 3, output_top_file_F_95)

7 Shiny app to deliver results

Rpubs doesn’t let me share files, but this can be done with rsconnect and a shiny app.

Go here for the results: https://yodamendel.shinyapps.io/hypr_results_deliver/

# Run with: 
# rsconnect::deployApp('/Users/charleenadams/coloc_clin_mol/scripts/hypr_results_deliver')
library(shiny)
library(DT)
library(openxlsx)
library(data.table)

# Define file paths relative to www/
file1 <- "www/hyprcoloc_comparisons_unipT_top.xlsx"
file2 <- "www/hyprcoloc_comparisons_unipF_95_top.xlsx"
file3 <- "www/hyprcoloc_comparisons_unipF_top.xlsx"

# Function to read "Top_Hits" sheet from Excel
read_top_hits <- function(file_path) {
  if (!file.exists(file_path)) {
    return(NULL)
  }

  wb <- loadWorkbook(file_path)
  sheet_names <- names(wb)

  if (!"Top_Hits" %in% sheet_names) {
    return(NULL)
  }

  data <- read.xlsx(file_path, sheet = "Top_Hits", colNames = TRUE)

  if (nrow(data) == 0) {
    return(NULL)
  }

  setDT(data)

  # Remove unwanted columns
  data <- data[, !c("regional_prob_p5e8", "regional_prob_p5e6", "Uniform_Prior"), with = FALSE]

  return(data)
}

# Read data
data_F <- read_top_hits(file3)
data_T <- read_top_hits(file1)
data_F_95 <- read_top_hits(file2)

# UI
ui <- fluidPage(

  # Professional Title Section
  div(style = "text-align: center; margin-bottom: 20px;",
      h1("HyPrColoc Results"),
      h4("Colocalization Analysis of UK Biobank Proteomics & METSIM Metabolomics"),
      tags$a(href = "https://www.bidmc.org/research/research-by-department/medicine/cardiovascular-medicine/personal-genomics-and-cardiometabolic-disease",
             target = "_blank",
             "Our Lab",
             style = "font-size: 18px; color: #0033cc; text-decoration: none; font-weight: bold;")
  ),

  sidebarLayout(

    # Reduce sidebar width to make tables larger
    sidebarPanel(
      width = 3,  # Reduced from default (4) to 3
      h3("Instructions"),  # Match font size with Interactive Tables
      p("Click the buttons below to download full results.", br(),
        "Each file will have today's date appended."),
      hr(),
      h3("Download Results"),
      downloadButton("download_file1", "Download hyprcoloc_comparisons_unipT"),
      br(), br(),
      downloadButton("download_file2", "Download hyprcoloc_comparisons_unipF_95"),
      br(), br(),
      downloadButton("download_file3", "Download hyprcoloc_comparisons_unipF"),
      hr(),

      # OBJECTIVE BOX
      div(style = "border: 1px solid #ccc; padding: 10px; background-color: #f9f9f9;",
          strong("Objective"),
          p("The primary objective of this statistical analysis is to develop and implement a pipeline for performing
            colocalization analysis using HyPrColoc on molecular and clinical phenotypes. This includes proteomics
            (UK Biobank Pharma Proteomics Project - UKB-PPP), metabolomics (METSIM and Nightingale), and clinical disease
            traits such as coronary heart disease (CHD). The results presented here are for selected proteins and METSIM
            metabolites within a 1MB region around the transcription start site (TSS) of the respective proteins.")
      )
    ),

    # Increase the main panel width
    mainPanel(
      width = 9,  # Increased from default (8) to 9
      h3("Interactive Tables"),
      h5("Selected results (posterior probability & posterior probability explained by SNP ≥ 0.8)"),

      tabsetPanel(
        tabPanel("Uniform Priors = FALSE",
                 DTOutput("table_F")),
        tabPanel("Uniform Priors = TRUE",
                 DTOutput("table_T")),
        tabPanel("Uniform Priors = FALSE; thresholds = 0.95",
                 DTOutput("table_F_95"))
      )
    )
  ),

  # BIDMC Logo - Bottom Left of Page
  div(style = "position: fixed; bottom: 10px; left: 10px;",
      img(src = "BIDMC_HMS_Stacked-LockUp.png", width = "125px")  # Adjust width as needed
  )
)

# Server
server <- function(input, output) {

  # Download Handlers
  output$download_file1 <- downloadHandler(
    filename = function() { paste0("hyprcoloc_comparisons_unipT_", Sys.Date(), ".xlsx") },
    content = function(file) { file.copy(file1, file) }
  )

  output$download_file2 <- downloadHandler(
    filename = function() { paste0("hyprcoloc_comparisons_unipF_95_", Sys.Date(), ".xlsx") },
    content = function(file) { file.copy(file2, file) }
  )

  output$download_file3 <- downloadHandler(
    filename = function() { paste0("hyprcoloc_comparisons_unipF_", Sys.Date(), ".xlsx") },
    content = function(file) { file.copy(file3, file) }
  )

  # Define table styling with alternating light lavender rows
  datatable_options <- list(
    dom = 'Blfrtip',  # Adds a search button
    pageLength = 25,  # Default to 25 rows per page
    scrollX = TRUE,   # Enable horizontal scrolling
    autoWidth = TRUE, # Auto adjust column width
    rowCallback = JS(
      "function(row, data, index) {
        if (index % 2 === 0) {
          $(row).css('background-color', '#E6E6FA');  // Light lavender
        }
      }"
    )
  )

  # Render DataTables
  output$table_F <- renderDT({
    if (!is.null(data_F)) {
      datatable(data_F, options = datatable_options, rownames = FALSE, filter = "top")
    } else {
      datatable(data.frame(Message = "No data available"), options = list(dom = 't'))
    }
  })

  output$table_T <- renderDT({
    if (!is.null(data_T)) {
      datatable(data_T, options = datatable_options, rownames = FALSE, filter = "top")
    } else {
      datatable(data.frame(Message = "No data available"), options = list(dom = 't'))
    }
  })

  output$table_F_95 <- renderDT({
    if (!is.null(data_F_95)) {
      datatable(data_F_95, options = datatable_options, rownames = FALSE, filter = "top")
    } else {
      datatable(data.frame(Message = "No data available"), options = list(dom = 't'))
    }
  })
}

# Run app
shinyApp(ui = ui, server = server)

Pipeline Development for Colocalization (HyPrColoc) Across Clinical and Molecular Phenotypes

שׁוֹשַׁנָּה🌹

February 01, 2025