Cis-MR of Selected UKB-PPP Proteins and SBP

1 Methods
2 Bioinformatics Prep: SBP file
5 MR Loop
- 5.1 Clean the R2 Column in the Master Results File
6 Colocalization
7 App

1 Methods

1.1 Study Overview

We conducted a two-sample Mendelian Randomization (MR) analysis to evaluate the causal effect of selected UKB-PPP proteins (1 MB region surrounding their respective transcription start sites [TSSs]) on systolic blood pressure (SBP).

1.2 UKB-PPP pQTL Files

I had previously obtained the pQTL files (see https://rpubs.com/YodaMendel/1243451 for an example of how to programmatically get files from UKB-PPP). For obtaining the files and adding rsids, refer to: 

- /Users/charleenadams/temp_BI/chemokine_rgs_olink/processed_ukbppp_chemokine_list/ukbpp_chemokines_40.Rmd
- https://rpubs.com/YodaMendel/1268255
- https://rpubs.com/YodaMendel/1243451

1.3 SBP GWAS Dataset: ieu-b-38

cd /Volumes/LaCie/tahir/mr_ukbppp_bp
wget https://gwas.mrcieu.ac.uk/files/ieu-b-38/ieu-b-38.vcf.gz

Dataset ID: ieu-b-38
Trait: Systolic Blood Pressure
PMID: 30224653
Year: 2018
Category: Continuous
Subcategory: Blood Pressure
Population: European
Sex: Males and Females
Sample Size: 757,601
Number of SNPs: 7,088,083
Unit: NA
Author: Evangelou, E
Consortium: International Consortium of Blood Pressure
Ontology: NA
Build: HG19/GRCh37
Note: NA

1.3.1 Study Description

“Here, we report genome-wide discovery analyses of BP traits - systolic (SBP), diastolic (DBP) and pulse pressure (PP) - in people of European ancestry drawn from UK Biobank (UKB) and the International Consortium of Blood Pressure-Genome Wide Association Studies (ICBP). We adopted a combination of a one- and two-stage study design to test common and low-frequency single nucleotide polymorphisms (SNPs) with minor allele frequency (MAF) ≥ 1% associated with BP traits. In all, we studied over 1 million people of European descent, including replication data from the US Million Veterans Program (MVP, N=220,520) and the Estonian Genome Centre, University of Tartu (EGCUT, N=28,742) Biobank.”

For more details, refer to the full paper.

2 Bioinformatics Prep: SBP file

Step 1: Bash script calling bcftools to convert vcf to tsv and format the headers for TwoSampleMR

Dependencies:

•   bcftools (for querying VCF)
•   awk (for formatting)
•   bgzip (for compressing the output)

run with: cd /Volumes/LaCie/tahir/mr_ukbppp_bp/scripts chmod +x format_ieu-b-38.vcf.sh nohup ./format_ieu-b-38.vcf.sh > format_ieu-b-38.vcf.log 2>&1 & ps -p 96562 -o stat,etime

# Step 1: Decompress ieu-b-38.vcf.gz without deleting the original
cd /Volumes/LaCie/tahir/mr_ukbppp_bp
gunzip -k ieu-b-38.vcf.gz

# Step 2: Convert the decompressed .vcf file to a .tsv format and calculate p-values from logp
outfile="ieu-b-38_formatted.tsv"

# Write header
echo -e "CHR\tBP\tSNP\tREF\tALT\tEAF\tBETA\tSE\tLP\tPVAL" > "$outfile"

# Extract fields and calculate pval
awk 'BEGIN{OFS="\t"} 
  !/^#/ {
    split($10, a, ":");
    beta = a[1];
    se = a[2];
    lp = a[3];
    eaf = a[4];
    pval = 10^(-lp);
    print $1, $2, $3, $4, $5, eaf, beta, se, lp, pval
  }' ieu-b-38.vcf >> "$outfile"

3 Bioinformatics Prep: UKB-PPP files

3.1 Move 173 protein subdirectories

# Clean MR_bp_proteins.csv

# Output file name
outfile="MR_bp_proteins_cleaned.csv"

# Process:
# - Remove lines that exactly match "Protein" or "Assay" (case-sensitive)
# - Trim leading/trailing whitespace
# - Remove completely empty lines
# - Remove funky hidden characters (like ^M)

# Run
awk '{
  gsub(/^[ \t]+|[ \t]+$/, "");  # Trim leading/trailing spaces
  gsub(/\r/, "");               # Remove carriage returns if any
  if ($0 != "" && $0 != "Protein" && $0 != "Assay") print $0;
}' MR_bp_proteins.csv > "$outfile"

echo "✅ Cleaned file saved as: $outfile"

3.2 Copy over the UKB-PPP pQTL files

cd /Volumes/LaCie/tahir/mr_ukbppp_bp/scripts
chmod +x copy_173_ukbppp.sh
./copy_173_ukbppp.sh

#!/bin/bash

# Source and destination directories
SRC_DIR="/Users/charleenadams/ukbppp/coloc_clin_mol_extracted"
DEST_DIR="/Volumes/LaCie/tahir/mr_ukbppp_bp/data_173"
PROTEIN_LIST="/Volumes/LaCie/tahir/mr_ukbppp_bp/MR_bp_proteins_cleaned.csv"

# Make sure the destination directory exists
mkdir -p "$DEST_DIR"

# Start the process
echo "🚀 Starting directory search and copy..."

# Loop through each protein name, skipping the header
tail -n +2 "$PROTEIN_LIST" | while IFS= read -r PROTEIN; do
    echo "🔍 Searching for folders matching: *$PROTEIN*"

    # Find matching directories
    MATCHES=$(find "$SRC_DIR" -type d -name "*$PROTEIN*" 2>/dev/null)

    if [[ -n "$MATCHES" ]]; then
        echo "$MATCHES" | while read -r DIR; do
            echo "✅ Copying $DIR to $DEST_DIR"
            cp -R "$DIR" "$DEST_DIR"
        done
    else
        echo "❌ NOT FOUND: No directory found for $PROTEIN"
    fi
done

echo "🎉 All done!"

Look at how many copied over: 198 vs 173 because Olink has multiple proteins of same gene name.

find . -mindepth 1 -maxdepth 1 -type d | wc -l

3.3 Virtual environment

conda env list

For Python3 to tackle merging, header delim issues, and LOG10P conversion

Step 1:

Create a virtual environment: mr_env
Create the Python3 script to run in parallel with all but 1 available core
to run Python3 to merge the files within each subdirectory,
to deal with mixed delimins in header
convert LOG10P to P

source ~/venvs/mr_env/bin/activate #created previously
cd /Volumes/LaCie/tahir/mr_ukbppp_bp/scripts
nano consolidate_chrs.py
chmod +x consolidate_chrs.py
nohup ./consolidate_chrs.py > consolidate_chrs.log 2>&1 &

#!/usr/bin/env python3

import os
import psutil  # <-- Need this for RAM and CPU stats! Install via: pip install psutil
import pandas as pd
import numpy as np
import gzip
import multiprocessing
import time

# Base directory where the subdirectories exist
base_dir = "/Volumes/LaCie/tahir/mr_ukbppp_bp/data_173"

def auto_tune_cores(ram_per_process_gb=4, cpu_threshold=80):
    """
    Dynamically calculate number of cores based on available RAM and CPU usage.
    """
    # Total logical cores
    total_cores = os.cpu_count()

    # Get available RAM (in GB)
    available_ram_gb = psutil.virtual_memory().available / (1024 ** 3)

    # Estimate safe core usage based on RAM
    cores_by_ram = int(available_ram_gb // ram_per_process_gb)

    # CPU load (average over 1 minute)
    cpu_load_percent = psutil.cpu_percent(interval=1)

    # If CPU is already working hard, reduce the core count
    if cpu_load_percent > cpu_threshold:
        print(f"⚠️ CPU load is {cpu_load_percent}%. Throttling cores down.")
        cores_by_cpu = max(1, total_cores // 4)
    else:
        cores_by_cpu = total_cores

    # Take the safer limit
    optimal_cores = max(1, min(cores_by_ram, cores_by_cpu, total_cores - 1))

    print(f"🧠 Available RAM: {available_ram_gb:.2f} GB")
    print(f"⚙️ CPU load: {cpu_load_percent:.2f}%")
    print(f"🚀 Using {optimal_cores} out of {total_cores} cores.")

    return optimal_cores

def process_subdir(subdir):
    """Function to process and merge files in a single subdirectory"""
    subdir_path = os.path.join(base_dir, subdir)

    if not os.path.isdir(subdir_path):
        print(f"⛔ Skipping {subdir} (not a directory)")
        return

    merged_file_path = os.path.join(subdir_path, f"{subdir}_merged.tsv.gz")

    if os.path.exists(merged_file_path):
        print(f"✅ Already merged: {merged_file_path}")
        return

    merged_data = []
    file_count = 0

    print(f"🔨 Processing {subdir}...")

    for filename in sorted(os.listdir(subdir_path)):
        if not filename.endswith("_rsid_added.gz"):
            continue  # Skip non-target files

        file_path = os.path.join(subdir_path, filename)

        try:
            with gzip.open(file_path, "rt") as f:
                df = pd.read_csv(f, sep=r"\s+", engine="python")

            # Standardize column names
            df.columns = df.columns.str.strip().str.upper()

            if "LOG10P" not in df.columns:
                raise ValueError(f"LOG10P column missing in {filename}, columns are: {df.columns.tolist()}")

            # Add the P-value
            df["P"] = 10 ** (-df["LOG10P"])

            merged_data.append(df)
            file_count += 1

        except Exception as e:
            print(f"💥 Error processing {filename} in {subdir}: {e}")

    if merged_data and file_count >= 23:
        merged_df = pd.concat(merged_data, ignore_index=True)
        merged_df.to_csv(merged_file_path, sep="\t", index=False, compression="gzip")
        print(f"🎉 Merged {file_count} files in {subdir} -> {merged_file_path}")
    elif merged_data:
        print(f"⚠️ Only {file_count} files processed in {subdir}, not merging.")
    else:
        print(f"🚫 No valid files found in {subdir}")

if __name__ == "__main__":
    # Auto-tune cores based on system load
    num_cores = auto_tune_cores()

    # Get all subdirectories to process
    all_subdirs = [d for d in os.listdir(base_dir) if os.path.isdir(os.path.join(base_dir, d))]

    print(f"🔧 Starting parallel processing for {len(all_subdirs)} directories using {num_cores} cores...\n")

    # Start the pool
    with multiprocessing.Pool(processes=num_cores) as pool:
        pool.map(process_subdir, all_subdirs)

    print("\n✅ ALL DONE!")

Step 2: Check headers in R

This is to verify that it worked.

# Load necessary library
library(data.table)

# Define file path
file_path <- "/Volumes/LaCie/tahir/mr_ukbppp_bp/data_173/ACAN_P16112_OID20159_v1_Cardiometabolic/ACAN_P16112_OID20159_v1_Cardiometabolic_merged.tsv.gz"

# Read the gzipped TSV file using gunzip -c (macOS-friendly)
data <- fread(cmd = paste("gunzip -c", file_path), sep = "\t", header = TRUE)

# Display first few rows
print(head(data))

# Summary statistics for numeric columns
summary(data)

# Check column names
colnames(data)
table(data$CHROM)

4 Get the files into one spot

Copy the merged files to data_173/mr_files

mkdir -p /Volumes/LaCie/tahir/mr_ukbppp_bp/data_173/mr_files
find /Volumes/LaCie/tahir/mr_ukbppp_bp/data_173 -name "*_merged.tsv.gz" -exec cp {} /Volumes/LaCie/tahir/mr_ukbppp_bp/data_173/mr_files/ \;

cd /Volumes/LaCie/tahir/mr_ukbppp_bp/data/mr_files
cp ieu-b-38_formatted.tsv /Volumes/LaCie/tahir/mr_ukbppp_bp/data_173/mr_files/

5 MR Loop

# run with:
nohup Rscript /Volumes/LaCie/tahir/mr_ukbppp_bp/scripts/mr_loop.R > /Volumes/LaCie/tahir/mr_ukbppp_bp/scripts/mr_loop.log 2>&1 &
ps -p 30446 -o stat,etime

rm(list = ls())

# Load libraries
library(TwoSampleMR)
library(MendelianRandomization)
library(ieugwasr)
library(MRInstruments)
library(dplyr)
library(tidyr)
library(data.table)
library(openxlsx)
library(ggplot2)
library(scales)
library(RhpcBLASctl)
library(patchwork)
library(genetics.binaRies)

# Define %||% operator for fallback
`%||%` <- function(x, y) if (length(x) > 0) x else y

# Set memory limit
Sys.setenv("R_MAX_VSIZE" = 128000)
cat("Max vector size set to:", Sys.getenv("R_MAX_VSIZE"), "MB\n")

# Set directories and files
base_dir <- "/Volumes/LaCie/tahir/mr_ukbppp_bp"
data_dir <- file.path(base_dir, "data_173", "mr_files")
results_dir <- file.path(base_dir, "results_173_tss")
if (!dir.exists(results_dir)) dir.create(results_dir, recursive = TRUE)

# Check if data_dir exists
if (!dir.exists(data_dir)) {
  stop(sprintf("Data directory %s does not exist. Please verify the path.", data_dir))
}

# Outcome configuration for SBP
outcome_config <- list(
  name = "SBP",
  file = file.path(data_dir, "ieu-b-38_formatted.tsv"),
  id = "ieu-b-38"
)

# Debug: Print outcome file path
cat("Checking outcome file:", outcome_config$file, "\n")

# Verify outcome file exists
if (!file.exists(outcome_config$file)) {
  stop(sprintf("Outcome file %s does not exist. Please verify the file name and path.", outcome_config$file))
}

# Annotation file
local_annotation_file <- file.path(base_dir, "mart_clean_37_alias.txt")
if (!file.exists(local_annotation_file)) {
  stop(sprintf("Local annotation file %s not found", local_annotation_file))
}

# List protein files
protein_files <- list.files(data_dir, pattern = "^[A-Z0-9]+_[A-Z0-9]+_OID[0-9]+_v1_.*\\.tsv(\\.gz)?$", full.names = TRUE)
protein_files <- protein_files[!grepl(basename(outcome_config$file), protein_files)]

# Debug: List all files in data_dir
all_files <- list.files(data_dir, full.names = TRUE)
cat("All files in data_dir:\n")
print(all_files)
cat("Protein files to process:\n")
print(protein_files)
if (length(protein_files) == 0) {
  stop("No protein files found. Check the file pattern or data_dir contents.")
}

# Load annotation file
gene_data <- fread(local_annotation_file, data.table = FALSE)
cat(sprintf("Loaded local annotation file with %d genes\n", nrow(gene_data)))

# Set up logging
log_file <- file.path(base_dir, "mr_analysis.log")
sink(log_file, append = TRUE, split = TRUE)
cat(sprintf("Starting sequential MR analysis on %s\n", Sys.time()))

forest_plots <- list()

# Format exposure function
format_exposure <- function(file, protein_chrom, tss_start, tss_end) {
  dat <- fread(file) %>%
    filter(
      CHROM == protein_chrom,
      POS19 >= tss_start,
      POS19 <= tss_end
    ) %>%
    dplyr::select(
      SNP = RSID,
      chr.exposure = CHROM,
      pos.exposure = POS19,
      effect_allele.exposure = ALLELE1,
      other_allele.exposure = ALLELE0,
      eaf.exposure = A1FREQ,
      beta.exposure = BETA,
      se.exposure = SE,
      pval.exposure = P,
      samplesize.exposure = N
    ) %>%
    mutate(
      exposure = sub("_v1_.*$", "", basename(file)),
      id.exposure = exposure,
      chr.exposure = as.character(chr.exposure)
    ) %>%
    filter(!is.na(SNP) & grepl("^rs", SNP))
  cat(sprintf("Exposure SNPs for %s in TSS window: %d\n", basename(file), nrow(dat)))
  return(dat)
}

# Format outcome function (fixed to handle sample size correctly)
format_outcome <- function(file, protein_chrom, tss_start, tss_end) {
  dat <- fread(file)
  # Debug: Print column names
  cat("Columns in outcome file:", colnames(dat), "\n")
  
  dat <- dat %>%
    filter(
      CHR == protein_chrom,
      BP >= tss_start,
      BP <= tss_end
    ) %>%
    dplyr::select(
      SNP = SNP,
      chr.outcome = CHR,
      pos.outcome = BP,
      effect_allele.outcome = ALT,
      other_allele.outcome = REF,
      eaf.outcome = EAF,
      beta.outcome = BETA,
      se.outcome = SE,
      pval.outcome = PVAL
    ) %>%
    mutate(
      outcome = outcome_config$name,
      id.outcome = outcome_config$id,
      chr.outcome = as.character(chr.outcome),
      samplesize.outcome = 757601  # Added constant sample size
    ) %>%
    filter(!is.na(SNP) & grepl("^rs", SNP))
  cat(sprintf("Outcome SNPs in TSS window (chr %s, %d-%d): %d\n", protein_chrom, tss_start, tss_end, nrow(dat)))
  return(dat)
}

# Function to add formatted Excel sheet
add_formatted_sheet <- function(wb, sheet_name, data, title, olink_id) {
  if (is.null(data) || !is.data.frame(data) || nrow(data) == 0 || ncol(data) == 0) {
    cat(sprintf("Skipping sheet '%s' for %s: Invalid or empty data\n", sheet_name, olink_id))
    addWorksheet(wb, sheet_name)
    writeData(wb, sheet_name, "No data available")
    return()
  }
  addWorksheet(wb, sheet_name)
  writeData(wb, sheet_name, title, startRow = 1, startCol = 1)
  writeData(wb, sheet_name, data, startRow = 3, startCol = 1, headerStyle = header_style)
  mergeCells(wb, sheet_name, cols = 1:ncol(data), rows = 1)
  setColWidths(wb, sheet_name, cols = 1:ncol(data), widths = "auto")
  addStyle(wb, sheet_name, title_style, rows = 1, cols = 1)
}

# Modified perform_mr function
perform_mr <- function(protein_file, outcome_file, results_dir, gene_data, clump_r2, clump_label) {
  olink_id <- sub("_v1_.*$", "", basename(protein_file))
  protein_dir <- file.path(results_dir, clump_label, olink_id)
  today <- Sys.Date()
  
  excel_file <- file.path(protein_dir, sprintf("MR_Results_%s_%s_r2%s_%s.xlsx", olink_id, outcome_config$name, clump_label, today))
  tss_harmonized_file <- file.path(protein_dir, sprintf("tss_harmonized_%s_%s_r2%s_%s.csv", olink_id, outcome_config$name, clump_label, today))
  plot_file <- file.path(protein_dir, sprintf("MR_Forestplot_%s_%s_r2%s_%s.png", olink_id, outcome_config$name, clump_label, today))
  
  # Check for existing files
  existing_harmonized <- list.files(protein_dir, pattern = sprintf("(full|tss)_harmonized_%s_%s_r2%s_.*\\.csv$", olink_id, outcome_config$name, clump_label), full.names = TRUE)
  if (length(existing_harmonized) > 0 || file.exists(excel_file) || file.exists(plot_file)) {
    cat(sprintf("Skipping %s (r2 = %s): Existing files found (e.g., %s)\n", olink_id, clump_r2, existing_harmonized[1] %||% excel_file))
    return(NULL)
  }
  
  if (!dir.exists(protein_dir)) dir.create(protein_dir, recursive = TRUE)
  cat(sprintf("Processing %s with clump_r2 = %s...\n", olink_id, clump_r2))
  
  # Get TSS and chromosome
  protein <- strsplit(olink_id, "_")[[1]][1]
  cat(sprintf("Searching for protein: %s\n", protein))
  
  protein_data <- gene_data %>%
    filter(`Gene name` == protein) %>%
    dplyr::select(`Chromosome/scaffold name`, `Transcription start site (TSS)`, `Gene stable ID version`, `Gene name`, `Gene Synonym`)
  
  if (nrow(protein_data) == 0) {
    cat(sprintf("No match found for Gene name '%s', trying Gene Synonym...\n", protein))
    protein_data <- gene_data %>%
      filter(`Gene Synonym` == protein) %>%
      dplyr::select(`Chromosome/scaffold name`, `Transcription start site (TSS)`, `Gene stable ID version`, `Gene name`, `Gene Synonym`)
  }
  
  if (nrow(protein_data) == 0) {
    cat(sprintf("No TSS data found for %s in either Gene name or Gene Synonym; skipping\n", olink_id))
    return(NULL)
  }
  
  cat(sprintf("Protein data for %s:\n", olink_id))
  print(protein_data)
  
  chrom_col <- "Chromosome/scaffold name"
  tss_col <- "Transcription start site (TSS)"
  tss_column <- protein_data[[tss_col]]
  
  if (all(is.na(tss_column)) || length(tss_column) == 0) {
    cat(sprintf("No valid TSS values for %s; skipping\n", olink_id))
    return(NULL)
  }
  
  protein_tss <- min(tss_column, na.rm = TRUE)
  protein_chrom <- protein_data[[chrom_col]][1]
  
  if (is.infinite(protein_tss) || is.na(protein_tss) || is.na(protein_chrom)) {
    cat(sprintf("Invalid TSS (%s) or chromosome (%s) for %s; skipping\n", protein_tss, protein_chrom, olink_id))
    return(NULL)
  }
  
  cat(sprintf("Using TSS: %d, Chromosome: %s for %s\n", protein_tss, protein_chrom, olink_id))
  
  window_size <- 500000
  tss_start <- protein_tss - window_size
  tss_end <- protein_tss + window_size
  
  # Filter and harmonize TSS data
  exp_dat <- format_exposure(protein_file, protein_chrom, tss_start, tss_end)
  if (nrow(exp_dat) == 0) {
    cat(sprintf("No exposure SNPs in TSS window for %s; skipping\n", olink_id))
    return(NULL)
  }
  
  out_dat <- format_outcome(outcome_file, protein_chrom, tss_start, tss_end)
  if (nrow(out_dat) == 0) {
    cat(sprintf("No outcome SNPs in TSS window for %s; skipping\n", olink_id))
    return(NULL)
  }
  
  dat <- harmonise_data(exposure_dat = exp_dat, outcome_dat = out_dat, action = 1)
  write.csv(dat, tss_harmonized_file, row.names = FALSE)
  cat(sprintf("TSS harmonized data for %s saved to: %s\n", olink_id, tss_harmonized_file))
  
  instruments <- dat %>%
    filter(mr_keep == TRUE, pval.exposure < 5e-6) %>%
    mutate(
      rsid = SNP,
      pval = pval.exposure,
      id = sprintf("%s exposure", olink_id),
      F_stat = (beta.exposure / se.exposure)^2
    ) %>%
    filter(F_stat > 10)
  
  cat(sprintf("Number of instruments for %s after F-stat filtering: %d\n", olink_id, nrow(instruments)))
  
  if (nrow(instruments) == 0) {
    cat(sprintf("No instruments for %s after F-stat filtering; skipping\n", olink_id))
    return(NULL)
  }
  
  instruments <- steiger_filtering(instruments)
  if (!"steiger_dir" %in% colnames(instruments) || nrow(instruments) == 0) {
    cat(sprintf("Steiger filtering produced no valid instruments for %s; skipping\n", olink_id))
    return(NULL)
  }
  instruments <- instruments %>% filter(steiger_dir == TRUE)
  cat(sprintf("Number of instruments for %s after Steiger filtering: %d\n", olink_id, nrow(instruments)))
  
  if (nrow(instruments) == 0) {
    cat(sprintf("No instruments for %s after Steiger filtering; skipping\n", olink_id))
    return(NULL)
  }
  
  clumped_dat <- NULL
  tryCatch({
    clumped <- ld_clump(
      dplyr::tibble(rsid = instruments$rsid, pval = instruments$pval, id = instruments$id),
      plink_bin = genetics.binaRies::get_plink_binary(),
      bfile = "/Users/charleenadams/1000G_bfiles/EUR/EUR",
      clump_r2 = clump_r2,
      clump_kb = 500
    )
    clumped_dat <- instruments %>% filter(SNP %in% clumped$rsid)
    cat(sprintf("Number of SNPs retained for %s after clumping (r2 = %s): %d\n", olink_id, clump_r2, nrow(clumped_dat)))
  }, error = function(e) {
    cat(sprintf("LD clumping failed for %s (r2 = %s): %s\nSkipping MR analysis.\n", olink_id, clump_r2, e$message))
    return(NULL)
  })
  
  if (is.null(clumped_dat) || nrow(clumped_dat) == 0) {
    cat(sprintf("No SNPs retained for %s after clumping (r2 = %s); skipping MR analysis\n", olink_id, clump_r2))
    return(NULL)
  }
  
  # Perform MR analyses
  ivw_result <- mr(clumped_dat, method_list = "mr_ivw")
  egger_result <- if (nrow(clumped_dat) >= 3) {
    tryCatch(mr(clumped_dat, method_list = "mr_egger_regression"), error = function(e) NULL)
  } else NULL
  weighted_median_result <- mr(clumped_dat, method_list = "mr_weighted_median")
  
  mr_input <- mr_input(
    bx = clumped_dat$beta.exposure,
    bxse = clumped_dat$se.exposure,
    by = clumped_dat$beta.outcome,
    byse = clumped_dat$se.outcome,
    exposure = olink_id,
    outcome = outcome_config$name,
    snps = clumped_dat$SNP
  )
  lasso_result <- tryCatch(mr_lasso(mr_input), error = function(e) { 
    cat(sprintf("MR-Lasso failed for %s: %s\n", olink_id, e$message)); NULL 
  })
  conmix_result <- tryCatch(mr_conmix(mr_input), error = function(e) { 
    cat(sprintf("MR-ConMix failed for %s: %s\n", olink_id, e$message)); NULL 
  })
  
  heterogeneity_result <- mr_heterogeneity(clumped_dat)
  pleiotropy_result <- if (nrow(clumped_dat) >= 3) mr_pleiotropy_test(clumped_dat) else NULL
  loo_result <- mr_leaveoneout(clumped_dat)
  
  wald_ratios <- clumped_dat %>%
    mutate(
      wald_beta = beta.outcome / beta.exposure,
      wald_se = sqrt((se.outcome^2 / beta.exposure^2) + ((beta.outcome^2 * se.exposure^2) / (beta.exposure^4))),
      pval = 2 * pnorm(abs(wald_beta / wald_se), lower.tail = FALSE),
      method = paste("Wald Ratio:", SNP)
    ) %>%
    dplyr::select(SNP, wald_beta, wald_se, pval, method)
  
  # Process MR results for forest plot
  mr_results <- bind_rows(
    if (nrow(ivw_result) > 0) ivw_result[, c("method", "b", "se", "pval")],
    if (!is.null(egger_result) && nrow(egger_result) > 0) egger_result[, c("method", "b", "se", "pval")],
    if (nrow(weighted_median_result) > 0) weighted_median_result[, c("method", "b", "se", "pval")],
    if (!is.null(lasso_result)) data.frame(
      method = "MR-Lasso",
      b = lasso_result@Estimate,
      se = lasso_result@StdError,
      pval = lasso_result@Pvalue
    ),
    if (!is.null(conmix_result)) data.frame(
      method = "MR-ConMix",
      b = conmix_result@Estimate,
      se = (conmix_result@CIUpper - conmix_result@CILower)/(2*1.96),
      pval = conmix_result@Pvalue
    ),
    wald_ratios %>% dplyr::select(method, b = wald_beta, se = wald_se, pval)
  ) %>%
    mutate(
      method = factor(method, levels = unique(method)),
      or = exp(b),
      or_lower = exp(b - 1.96 * se),
      or_upper = exp(b + 1.96 * se)
    ) %>%
    na.omit()
  
  # Define colors for forest plot
  base_colors <- c("IVW" = "#1F78B4", "MR-Egger" = "#FF7F00", "Weighted Median" = "#33A02C",
                   "MR-Lasso" = "#FB9A99", "MR-ConMix" = "#E41A1C")
  if (nrow(mr_results) == 0 || !("method" %in% names(mr_results))) {
    cat(sprintf("Warning: No valid MR results for %s; using default colors\n", olink_id))
    color_list <- base_colors
  } else {
    wald_methods <- unique(mr_results$method[grepl("Wald Ratio", mr_results$method)])
    n_wald <- length(wald_methods)
    wald_colors <- if (n_wald > 0) setNames(hue_pal()(n_wald), wald_methods) else NULL
    color_list <- c(base_colors, wald_colors)
    color_list <- color_list[names(color_list) %in% unique(mr_results$method)]
  }
  
  # Create forest plot
  x_min <- min(mr_results$or_lower, na.rm = TRUE)
  x_max <- max(mr_results$or_upper, na.rm = TRUE)
  forest_plot <- ggplot(mr_results, aes(x = or, y = method, color = method)) +
    geom_errorbarh(aes(xmin = or_lower, xmax = or_upper), height = 0.2, na.rm = TRUE, linewidth = 0.8) +
    geom_point(size = 3) +
    geom_vline(xintercept = 1, linetype = "dashed", color = "grey50") +
    labs(
      title = sprintf("Mendelian Randomization Estimates:\n %s on %s (r2 = %s)", protein, outcome_config$name, clump_r2),
      x = "Causal Effect (Beta)",
      y = ""
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
      axis.text.y = element_text(size = 12, face = "bold"),
      axis.text.x = element_text(size = 10),
      axis.title.x = element_text(size = 12),
      panel.grid.major = element_line(color = "grey90"),
      panel.grid.minor = element_blank(),
      legend.position = "none",
      panel.background = element_rect(fill = "white", color = NA),
      plot.background = element_rect(fill = "white", color = NA)
    ) +
    scale_color_manual(values = color_list) +
    xlim(x_min - 0.02 * (x_max - x_min), x_max + 0.02 * (x_max - x_min))
  
  forest_plots[[paste(olink_id, clump_label, sep = "_")]] <<- forest_plot
  
  # Save outputs to Excel
  wb <- createWorkbook()
  title_style <- createStyle(fontSize = 14, fontColour = "black", textDecoration = "bold", halign = "center")
  header_style <- createStyle(fontColour = "white", fgFill = "#C8A2C8", textDecoration = "bold", halign = "center", border = "TopBottomLeftRight")
  
  toc_data <- data.frame(
    Sheet = c("IVW", "MR-Egger", "Weighted_Median", "MR-Lasso", "MR-ConMix", 
              "Wald_Ratios", "Heterogeneity_Test", "Pleiotropy_Test", "Leave_One_Out",
              "Instruments", "Clumped_Data"),
    Title = c("Inverse Variance Weighted Results", "MR-Egger Results", 
              "Weighted Median Results", "MR-Lasso Results", "MR-ConMix Results",
              "Wald Ratio Tests", "Heterogeneity Test Results", 
              "Pleiotropy Test Results", "Leave-One-Out Analysis",
              "Filtered Instruments", "Clumped SNPs Data")
  )
  addWorksheet(wb, "TOC")
  writeData(wb, "TOC", "Table of Contents", startRow = 1, startCol = 1)
  mergeCells(wb, "TOC", cols = 1:2, rows = 1)
  addStyle(wb, "TOC", title_style, rows = 1, cols = 1)
  writeData(wb, "TOC", toc_data, startRow = 3, startCol = 1, headerStyle = header_style)
  setColWidths(wb, "TOC", cols = 1:2, widths = "auto")
  
  add_formatted_sheet <- function(wb, sheet_name, data, title) {
    if (is.null(data) || !is.data.frame(data) || nrow(data) == 0 || ncol(data) == 0) {
      cat(sprintf("Skipping sheet '%s' for %s: Invalid or empty data\n", sheet_name, olink_id))
      addWorksheet(wb, sheet_name)
      writeData(wb, sheet_name, "No data available")
      return()
    }
    addWorksheet(wb, sheet_name)
    writeData(wb, sheet_name, title, startRow = 1, startCol = 1)
    writeData(wb, sheet_name, data, startRow = 3, startCol = 1, headerStyle = header_style)
    mergeCells(wb, sheet_name, cols = 1:ncol(data), rows = 1)
    setColWidths(wb, sheet_name, cols = 1:ncol(data), widths = "auto")
    addStyle(wb, sheet_name, title_style, rows = 1, cols = 1)
  }
  
  add_formatted_sheet(wb, "IVW", ivw_result, "Inverse Variance Weighted Results")
  add_formatted_sheet(wb, "MR-Egger", egger_result, "MR-Egger Results")
  add_formatted_sheet(wb, "Weighted_Median", weighted_median_result, "Weighted Median Results")
  if (!is.null(lasso_result)) {
    lasso_df <- data.frame(
      Exposure = lasso_result@Exposure,
      Outcome = lasso_result@Outcome,
      Estimate = lasso_result@Estimate,
      StdError = lasso_result@StdError,
      CILower = lasso_result@CILower,
      CIUpper = lasso_result@CIUpper,
      Pvalue = lasso_result@Pvalue,
      SNPs = lasso_result@SNPs,
      Valid = lasso_result@Valid,
      ValidSNPs = paste(lasso_result@ValidSNPs, collapse = ", ")
    )
    add_formatted_sheet(wb, "MR-Lasso", lasso_df, "MR-Lasso Results")
  }
  if (!is.null(conmix_result)) {
    conmix_df <- data.frame(
      Exposure = conmix_result@Exposure,
      Outcome = conmix_result@Outcome,
      Estimate = conmix_result@Estimate,
      Pvalue = conmix_result@Pvalue,
      CILower = conmix_result@CILower,
      CIUpper = conmix_result@CIUpper,
      SNPs = conmix_result@SNPs
    )
    add_formatted_sheet(wb, "MR-ConMix", conmix_df, "MR-ConMix Results")
  }
  add_formatted_sheet(wb, "Wald_Ratios", wald_ratios, "Wald Ratio Tests")
  add_formatted_sheet(wb, "Heterogeneity_Test", heterogeneity_result, "Heterogeneity Test Results")
  add_formatted_sheet(wb, "Pleiotropy_Test", pleiotropy_result, "Pleiotropy Test Results")
  add_formatted_sheet(wb, "Leave_One_Out", loo_result, "Leave-One-Out Analysis")
  add_formatted_sheet(wb, "Instruments", instruments, "Filtered Instruments")
  add_formatted_sheet(wb, "Clumped_Data", clumped_dat, "Clumped SNPs Data")
  
  saveWorkbook(wb, excel_file, overwrite = FALSE)
  ggsave(plot_file, forest_plot, width = 12, height = 10, dpi = 600, bg = "white")
  
  cat(sprintf("Results for %s (r2 = %s) saved to: %s and %s\n", olink_id, clump_r2, excel_file, plot_file))
  gc(verbose = TRUE)
}

# Main analysis loop for multiple clump_r2 values
for (clump_r2 in c(0.1, 0.01, 0.001)) {
  clump_label <- paste0("r2_", gsub("\\.", "", clump_r2))
  for (protein_file in protein_files) {
    perform_mr(protein_file, outcome_config$file, results_dir, gene_data, clump_r2, clump_label)
  }
}

cat(sprintf("MR analysis completed on %s\n", Sys.time()))
sink()

# Master file creation
today <- Sys.Date()
master_file <- file.path(base_dir, sprintf("Master_MR_Results_%s_%s.xlsx", outcome_config$name, today))

result_files <- list.files(results_dir, pattern = sprintf("MR_Results_.*_%s_r2.*\\.xlsx$", outcome_config$name), 
                           full.names = TRUE, recursive = TRUE)

tabs <- c("IVW", "MR-Egger", "Weighted_Median", "MR-Lasso", "MR-ConMix", 
          "Wald_Ratios", "Heterogeneity_Test", "Pleiotropy_Test", "Leave_One_Out",
          "Instruments", "Clumped_Data")

master_data <- setNames(vector("list", length(tabs)), tabs)

extract_protein_id <- function(file) {
  basename <- basename(file)
  protein_id <- sub(sprintf("MR_Results_(.*)_%s_r2.*\\.xlsx$", outcome_config$name), "\\1", basename)
  r2_value <- sub(".*_r2([0-9]+)_.*\\.xlsx$", "\\1", basename)
  list(protein_id = protein_id, r2_value = r2_value)
}

for (file in result_files) {
  info <- extract_protein_id(file)
  protein_id <- info$protein_id
  r2_value <- info$r2_value
  cat(sprintf("Processing %s (r2 = %s)...\n", protein_id, r2_value))
  
  wb <- tryCatch({
    loadWorkbook(file)
  }, error = function(e) {
    cat(sprintf("Failed to load %s: %s\nSkipping\n", protein_id, e$message))
    return(NULL)
  }, warning = function(w) {
    cat(sprintf("Warning loading %s: %s\n", protein_id, w$message))
    suppressWarnings(loadWorkbook(file))
  })
  
  if (is.null(wb)) next
  
  sheet_names <- sheets(wb)
  
  for (tab in tabs) {
    if (tab %in% sheet_names) {
      data <- tryCatch({
        readWorkbook(wb, sheet = tab, startRow = 3, colNames = TRUE)
      }, error = function(e) {
        cat(sprintf("Error reading %s in %s: %s\n", tab, protein_id, e$message))
        return(NULL)
      }, warning = function(w) {
        cat(sprintf("Warning reading %s in %s: %s\n", tab, protein_id, w$message))
        suppressWarnings(readWorkbook(wb, sheet = tab, startRow = 3, colNames = TRUE))
      })
      
      if (is.null(data) || nrow(data) == 0 || 
          (nrow(data) == 1 && length(data[1, 1]) > 0 && data[1, 1] == "No data available")) {
        next
      }
      
      if ("chr.exposure" %in% names(data)) {
        data$chr.exposure <- as.character(data$chr.exposure)
      }
      if ("chr.outcome" %in% names(data)) {
        data$chr.outcome <- as.character(data$chr.outcome)
      }
      data$Protein_ID <- protein_id
      data$Clump_R2 <- r2_value
      
      if (is.null(master_data[[tab]])) {
        master_data[[tab]] <- data
      } else {
        master_data[[tab]] <- bind_rows(master_data[[tab]], data)
      }
    }
  }
}

wb <- createWorkbook()
title_style <- createStyle(fontSize = 14, fontColour = "black", textDecoration = "bold", halign = "center")
header_style <- createStyle(fontColour = "white", fgFill = "#C8A2C8", textDecoration = "bold", halign = "center", border = "TopBottomLeftRight")

toc_data <- data.frame(
  Sheet = tabs,
  Title = c("Inverse Variance Weighted Results", "MR-Egger Results",
            "Weighted Median Results", "MR-Lasso Results", "MR-ConMix Results",
            "Wald Ratio Tests", "Heterogeneity Test Results",
            "Pleiotropy Test Results", "Leave-One-Out Analysis",
            "Filtered Instruments", "Clumped SNPs Data")
)
addWorksheet(wb, "TOC")
writeData(wb, "TOC", "Table of Contents", startRow = 1, startCol = 1)
mergeCells(wb, "TOC", cols = 1:2, rows = 1)
addStyle(wb, "TOC", title_style, rows = 1, cols = 1)
writeData(wb, "TOC", toc_data, startRow = 3, startCol = 1, headerStyle = header_style)
setColWidths(wb, "TOC", cols = 1:2, widths = "auto")

add_formatted_sheet <- function(wb, sheet_name, data, title) {
  if (is.null(data) || nrow(data) == 0) {
    cat(sprintf("Skipping sheet '%s': No data\n", sheet_name))
    addWorksheet(wb, sheet_name)
    writeData(wb, sheet_name, "No data available")
    return()
  }
  addWorksheet(wb, sheet_name)
  writeData(wb, sheet_name, title, startRow = 1, startCol = 1)
  writeData(wb, sheet_name, data, startRow = 3, startCol = 1, headerStyle = header_style)
  mergeCells(wb, sheet_name, cols = 1:ncol(data), rows = 1)
  setColWidths(wb, sheet_name, cols = 1:ncol(data), widths = "auto")
  addStyle(wb, sheet_name, title_style, rows = 1, cols = 1)
}

for (tab in tabs) {
  add_formatted_sheet(wb, tab, master_data[[tab]], toc_data$Title[tabs == tab])
}

saveWorkbook(wb, master_file, overwrite = TRUE)
cat(sprintf("Master file saved to: %s\n", master_file))

5.1 Clean the R2 Column in the Master Results File

import pandas as pd
import openpyxl
from openpyxl.utils.dataframe import dataframe_to_rows
import re

# File paths
input_file = "/Volumes/LaCie/tahir/mr_ukbppp_bp/Master_MR_Results_SBP_2025-04-30_Modified.xlsx"
output_file = "/Volumes/LaCie/tahir/mr_ukbppp_bp/Master_MR_Results_SBP_2025-04-30_Modified_v3.xlsx"

# Load the Excel file
wb = openpyxl.load_workbook(input_file)
sheets_to_process = ["IVW", "MR-Egger", "Weighted_Median", "Wald_Ratios", "Heterogeneity_Test", "Pleiotropy_Test", "Leave_One_Out", "Instruments", "Clumped_Data"]

# Function to extract r2 value from file name and map to R2
def extract_and_convert_r2(file_name):
    if pd.isna(file_name):
        return pd.NA
    # Extract the r2 value from the file name (e.g., r2r2_0001)
    match = re.search(r'r2r2_(\d+)_', str(file_name))
    if not match:
        print(f"Could not extract r2 value from file name: {file_name}")
        return pd.NA
    r2_value = match.group(1)  # e.g., "0001"
    # Map to R2
    if r2_value == "0001":
        return "0.001"
    elif r2_value == "001":
        return "0.01"
    elif r2_value == "01":
        return "0.1"
    else:
        print(f"Unexpected r2 value extracted: {r2_value}")
        return pd.NA

for sheet_name in sheets_to_process:
    print(f"Processing sheet: {sheet_name}")
    
    # Read the sheet into a DataFrame
    df = pd.read_excel(input_file, sheet_name=sheet_name, header=2)
    
    # Check if the sheet has data
    if df.empty or (df.shape[0] == 1 and df.iloc[0, 0] == "No data available"):
        print(f"Sheet {sheet_name} is empty or has no data. Skipping.")
        continue
    
    # Ensure File column exists (should already be renamed from Clump_R2)
    if "File" not in df.columns:
        if "Clump_R2" in df.columns:
            df = df.rename(columns={"Clump_R2": "File"})
            print(f"Renamed Clump_R2 to File in sheet {sheet_name}")
        else:
            print(f"Sheet {sheet_name} does not have a File or Clump_R2 column. Adding R2 with NA values.")
            df["R2"] = pd.NA
            # Write back to sheet and continue
            ws = wb[sheet_name]
            for row in ws["A3":f"{openpyxl.utils.get_column_letter(ws.max_column)}{ws.max_row}"]:
                for cell in row:
                    cell.value = None
            for r_idx, row in enumerate(dataframe_to_rows(df, index=False, header=True), 3):
                for c_idx, value in enumerate(row, 1):
                    ws.cell(row=r_idx, column=c_idx, value=value)
            continue
    
    # Debug: Print unique values in File column
    print(f"Unique values in File column for sheet {sheet_name}: {df['File'].unique()}")
    
    # Create new R2 column by extracting r2 from File column
    df["R2"] = df["File"].apply(extract_and_convert_r2)
    
    # Debug: Check if R2 column has values
    print(f"R2 column values for sheet {sheet_name}: {df['R2'].unique()}")
    
    # Write the modified DataFrame back to the sheet
    ws = wb[sheet_name]
    
    # Clear existing data below the title (row 1) and header (row 3)
    for row in ws["A3":f"{openpyxl.utils.get_column_letter(ws.max_column)}{ws.max_row}"]:
        for cell in row:
            cell.value = None
    
    # Write the updated DataFrame starting from row 3
    for r_idx, row in enumerate(dataframe_to_rows(df, index=False, header=True), 3):
        for c_idx, value in enumerate(row, 1):
            ws.cell(row=r_idx, column=c_idx, value=value)

# Save the modified workbook
wb.save(output_file)
print(f"Modified Excel file saved to: {output_file}")

6 Colocalization

# Clear environment
rm(list = ls())

# Load required libraries
library(coloc)
library(data.table)
library(dplyr)
library(openxlsx)

# Set base directory
base_dir <- "/Volumes/LaCie/tahir/mr_ukbppp_bp"
results_dir <- file.path(base_dir, "results_173_tss")
output_file <- file.path(base_dir, sprintf("Coloc_Results_SBP_%s.xlsx", Sys.Date()))

# Define subdirectory to process (choose one)
r2_dirs <- "r2_01"  # You can change this to "r2_001" or "r2_0001" if desired

# Find all tss_harmonized files in the specified r2 directory
tss_files <- unlist(lapply(r2_dirs, function(r2_dir) {
  protein_dirs <- list.dirs(file.path(results_dir, r2_dir), recursive = FALSE)
  unlist(lapply(protein_dirs, function(protein_dir) {
    files <- list.files(protein_dir, pattern = "^tss_harmonized_.*\\.csv$", full.names = TRUE)
    return(files)
  }))
}))

if (length(tss_files) == 0) {
  stop("No tss_harmonized files found in the subdirectory.")
}
cat(sprintf("Found %d tss_harmonized files to process in %s.\n", length(tss_files), r2_dirs))

# Function to extract protein ID from file name
extract_info <- function(file_path) {
  file_name <- basename(file_path)
  # Extract protein ID (e.g., WIF1_Q9Y5W5_OID21464)
  protein_id <- sub("^tss_harmonized_(.*)_SBP_r2r2_.*\\.csv$", "\\1", file_name)
  list(protein_id = protein_id, file_name = file_name)
}

# Function to run colocalization analysis
run_coloc <- function(tss_file) {
  # Extract info
  info <- extract_info(tss_file)
  protein_id <- info$protein_id
  file_name <- info$file_name
  
  cat(sprintf("Processing %s...\n", protein_id))
  
  # Read the harmonized data
  dat <- tryCatch({
    fread(tss_file)
  }, error = function(e) {
    cat(sprintf("Failed to read %s: %s\n", tss_file, e$message))
    return(NULL)
  })
  
  # Check if data is empty or insufficient
  if (is.null(dat) || nrow(dat) < 1) {
    cat(sprintf("No data in %s. Skipping.\n", tss_file))
    return(NULL)
  }
  
  # Verify required columns exist
  required_cols <- c("SNP", "beta.exposure", "se.exposure", "samplesize.exposure", 
                     "beta.outcome", "se.outcome", "samplesize.outcome", 
                     "eaf.exposure", "eaf.outcome")
  missing_cols <- setdiff(required_cols, colnames(dat))
  if (length(missing_cols) > 0) {
    cat(sprintf("Missing required columns in %s: %s\n", tss_file, paste(missing_cols, collapse = ", ")))
    return(NULL)
  }
  
  # Remove duplicate SNPs (keep first occurrence)
  if (any(duplicated(dat$SNP))) {
    cat(sprintf("Found %d duplicate SNPs in %s. Keeping first occurrence.\n", 
                sum(duplicated(dat$SNP)), protein_id))
    dat <- dat[!duplicated(dat$SNP), ]
  }
  
  # Check for NA or invalid values
  if (any(is.na(dat$beta.exposure)) || any(is.na(dat$se.exposure)) ||
      any(is.na(dat$beta.outcome)) || any(is.na(dat$se.outcome)) ||
      any(is.na(dat$eaf.exposure)) || any(is.na(dat$eaf.outcome))) {
    cat(sprintf("NA values found in critical columns for %s. Skipping.\n", protein_id))
    return(NULL)
  }
  
  # Compute MAF (minor allele frequency) as min(eaf, 1-eaf)
  maf_exposure <- pmin(dat$eaf.exposure, 1 - dat$eaf.exposure)
  maf_outcome <- pmin(dat$eaf.outcome, 1 - dat$eaf.outcome)
  
  # Prepare exposure dataset (protein)
  exposure_data <- list(
    snp = dat$SNP,
    beta = dat$beta.exposure,
    varbeta = (dat$se.exposure)^2,
    N = dat$samplesize.exposure[1],  # Assuming same sample size for all SNPs
    MAF = maf_exposure,
    type = "quant"
  )
  
  # Prepare outcome dataset (SBP)
  outcome_data <- list(
    snp = dat$SNP,
    beta = dat$beta.outcome,
    varbeta = (dat$se.outcome)^2,
    N = dat$samplesize.outcome[1],  # Assuming same sample size for all SNPs
    MAF = maf_outcome,
    type = "quant"
  )
  
  # Run colocalization analysis
  coloc_result <- tryCatch({
    coloc.abf(dataset1 = exposure_data, dataset2 = outcome_data)
  }, error = function(e) {
    cat(sprintf("Coloc failed for %s: %s\n", protein_id, e$message))
    return(NULL)
  })
  
  if (is.null(coloc_result)) {
    return(NULL)
  }
  
  # Extract summary results
  summary <- coloc_result$summary
  result <- data.frame(
    Protein_ID = protein_id,
    File = file_name,
    NSNPs = summary["nsnps"],
    PP_H0 = summary["PP.H0.abf"],
    PP_H1 = summary["PP.H1.abf"],
    PP_H2 = summary["PP.H2.abf"],
    PP_H3 = summary["PP.H3.abf"],
    PP_H4 = summary["PP.H4.abf"],
    stringsAsFactors = FALSE
  )
  
  return(result)
}

# Process all files
coloc_results <- lapply(tss_files, run_coloc)
coloc_results <- do.call(rbind, Filter(Negate(is.null), coloc_results))

if (is.null(coloc_results) || nrow(coloc_results) == 0) {
  stop("No colocalization results to save.")
}

# Clean and format the results (remove R2 and P-value Threshold columns)
coloc_results_clean <- coloc_results %>%
  rename(
    "Protein ID" = Protein_ID,
    "No. SNPs" = NSNPs,
    "PP H0 (No Association)" = PP_H0,
    "PP H1 (Exposure Only)" = PP_H1,
    "PP H2 (Outcome Only)" = PP_H2,
    "PP H3 (Distinct Variants)" = PP_H3,
    "PP H4 (Shared Variant)" = PP_H4
  ) %>%
  mutate(
    across(starts_with("PP "), ~ round(.x, 4))
  ) %>%
  dplyr::select(
    "Protein ID", File, "No. SNPs",
    "PP H0 (No Association)", "PP H1 (Exposure Only)", "PP H2 (Outcome Only)",
    "PP H3 (Distinct Variants)", "PP H4 (Shared Variant)"
  )

# Create Excel workbook with formatted headers
wb <- createWorkbook()
title_style <- createStyle(fontSize = 14, fontColour = "black", textDecoration = "bold", halign = "center")
header_style <- createStyle(fontColour = "white", fgFill = "#C8A2C8", textDecoration = "bold", halign = "center", border = "TopBottomLeftRight")

# Add sheet for colocalization results
addWorksheet(wb, "Colocalization Results")
writeData(wb, "Colocalization Results", "Table 1. Colocalization Analysis Results for Proteins and SBP", startRow = 1, startCol = 1)
mergeCells(wb, "Colocalization Results", cols = 1:ncol(coloc_results_clean), rows = 1)
addStyle(wb, "Colocalization Results", title_style, rows = 1, cols = 1)
writeData(wb, "Colocalization Results", coloc_results_clean, startRow = 3, startCol = 1, headerStyle = header_style)
setColWidths(wb, "Colocalization Results", cols = 1:ncol(coloc_results_clean), widths = "auto")

# Save the workbook
saveWorkbook(wb, output_file, overwrite = TRUE)
cat(sprintf("Colocalization results saved to: %s\n", output_file))

7 App

See: https://yodamendel.shinyapps.io/mr_ukbppp_bp_app/

# Run with:
# rsconnect::deployApp('/Volumes/LaCie/tahir/mr_ukbppp_bp/mr_ukbppp_bp_app')

library(shiny)

# ---- UI ----
ui <- fluidPage(
  
  # ---- HEAD STYLES ----
  tags$head(
    tags$style(HTML("
      body {
        font-family: 'Lato', sans-serif;
        color: #1A2526;
        background-color: #FFFFFF;  /* Pure white background */
      }
      #main_title {
        color: #005566;
        margin-bottom: 10px;
        font-size: 32px;
        font-weight: bold;
      }
      #lab_link {
        display: inline-block;
        margin-top: 10px;
        font-size: 18px;
        color: #A51C30;
        text-decoration: none;
        font-weight: bold;
      }
      #lab_link:hover {
        color: #7A1422;
      }
      .info-box {
        background-color: #F5F9FF;
        border: 2px solid #005566;
        border-radius: 10px;
        padding: 20px;
        margin-bottom: 30px;
        box-shadow: 2px 2px 8px rgba(0,0,0,0.1);
      }
      .download-box {
        background-color: #FFF5F5;
        border: 2px solid #A51C30;
        border-radius: 10px;
        padding: 20px;
        margin-bottom: 30px;
        box-shadow: 2px 2px 8px rgba(0,0,0,0.1);
      }
      .download-button {
        margin-bottom: 15px;
      }
      .btn-primary {
        background-color: #005566;
        border-color: #003f4d;
        font-size: 16px;
        padding: 10px 20px;
        border-radius: 8px;
        color: #ffffff;
        text-decoration: none;
        display: inline-block;
        transition: background-color 0.3s ease;
      }
      .btn-primary:hover {
        background-color: #007c91;
        color: #ffffff;
      }
      #logo_container {
        text-align: center;
        margin-top: 30px;
        margin-bottom: 20px;
      }
      #logo_image {
        width: 200px;
        height: auto;
      }
    "))
  ),
  
  # ---- HEADER TITLE & LAB LINK ----
  div(
    style = "text-align: center; margin-bottom: 10px;",
    h1(HTML("Cis-MR and Colocalization of Selected UKB-PPP Proteins & SBP"), id = "main_title")
  ),
  
  div(
    style = "text-align: center; margin-bottom: 30px;",
    tags$a(
      href = "https://www.bidmc.org/research/research-by-department/medicine/cardiovascular-medicine/personal-genomics-and-cardiometabolic-disease",
      target = "_blank",
      "🔬 Our Lab",
      id = "lab_link"
    )
  ),
  
  # ---- DESCRIPTION BOX ----
  div(class = "info-box",
      h3("🧬 Analysis Summary"),
      p(HTML("
        This platform showcases Mendelian Randomization (MR) and colocalization analyses of selected proteins from the UK Biobank Pharma Proteomics Project (UKB-PPP) with systolic blood pressure (SBP).<br><br>
        <b>Dataset:</b> We used GWAS summary statistics for SBP (Dataset ID: ieu-b-38) from Evangelou et al. (2018, PMID: <a href='https://pmc.ncbi.nlm.nih.gov/articles/PMC6284793/' target='_blank'>30224653</a>), conducted by the International Consortium of Blood Pressure.<br><br>
        <b>MR Analysis:</b> We constructed <b>cis-acting genetic instruments</b> (1MB around TSS), filtered at <b>P < 5e-6</b> with LD clumping at <b>R² thresholds of 0.1, 0.01, and 0.001</b>. Effect estimates (b) are presented as <b>betas</b>.<br><br>
        <b>Colocalization Analysis:</b> We performed colocalization analysis to test for shared causal variants between protein levels and SBP, reporting posterior probabilities for hypotheses H0 to H4.<br><br>
        <b>Downloadable File Includes:</b><br><br>
        ✅ <b>MR Results:</b> Results for selected UKB-PPP proteins across three R² thresholds (0.1, 0.01, 0.001), including IVW, MR-Egger, Weighted Median, MR-LASSO, MR-ConMix, Wald Ratios, heterogeneity, pleiotropy, and instrument data.<br><br>
        ✅ <b>Colocalization Results:</b> Posterior probabilities for shared causal variants between protein levels and SBP (last tab in the Excel) <br><br>
        <b>Notes and Scripts</b> (Walkthrough + Annotated Code):<br>
        👉 <a href='https://rpubs.com/YodaMendel/1304438' target='_blank'>https://rpubs.com/YodaMendel/1304438</a>
      "))
  ),
  
  # ---- DOWNLOAD BOX ----
  div(class = "download-box",
      h3("📦 Download Results"),
      p("Click below to download the master Excel file containing MR and colocalization results."),
      div(
        class = "download-button",
        downloadButton(
          outputId = "download_mr_results",
          label = "Download Master MR and Coloc Results (Excel)",
          class = "btn-primary"
        )
      )
  ),
  
  # ---- BIDMC LOGO WITH HYPERLINK ----
  div(
    id = "logo_container",
    tags$a(
      href = "https://www.bidmc.org/research/research-by-department/medicine/cardiovascular-medicine/personal-genomics-and-cardiometabolic-disease",
      target = "_blank",
      tags$img(
        src = "BIDMC_HMS_Stacked-LockUp.png",
        id = "logo_image",
        alt = "Beth Israel Deaconess Medical Center Logo"
      )
    )
  )
)

# ---- SERVER ----
server <- function(input, output, session) {
  
  # Download handler for the MR and colocalization results Excel file
  output$download_mr_results <- downloadHandler(
    filename = function() {
      "Master_MR_Results_SBP_2025-04-30_Modified_v3.xlsx"
    },
    content = function(file) {
      file.copy("www/Master_MR_Results_SBP_2025-04-30_Modified_v3.xlsx", file)
    },
    contentType = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
  )
}

# ---- RUN APP ----
shinyApp(ui = ui, server = server)