Cis-MR of 183 UKB-PPP proteins and CHD

1 Methods
2 Bioinformatics Prep: CHD file
5 MR Loop (Cardiogram)
6 MR Loop (FinnGen)
7 Comparison of Cardiogram and FinnGen
8 Formating the Merged Excel Results
9 App

1 Methods

1.1 Study Overview

We conducted a two-sample Mendelian Randomization (MR) analysis to evaluate the causal effect of 183 UKB-PPP proteins circulating protein levels (1 Mb region surrounding their respective transcription start sites [TSSs]), on coronary heart disease (CHD) using CARDIoGRAMplusC4D and FinnGen summary statistics.

1.2 Data

I had previously obtained the pQTL files (see https://rpubs.com/YodaMendel/1243451 for an example of how to programmatically get files from UKB-PPP). For obtaining the files and adding rsids, refer to:

/Users/charleenadams/temp_BI/chemokine_rgs_olink/processed_ukbppp_chemokine_list/ukbpp_chemokines_40.Rmd
https://rpubs.com/YodaMendel/1268255
https://rpubs.com/YodaMendel/1243451

1.3 10 previously selected UKB-PPP proteins

I initially ran a similar analysis with 10 proteins.

SHISA5
IGFBP7
STC2
SLURP1
SPINT1
HYOU1
SELL
CD59
FGF3
CYTL1

Now expanded to 183.

1.4 Cardiogram CHD

cd /Users/charleenadams/mr_ukbppp_chd
wget https://gwas.mrcieu.ac.uk/files/ieu-a-7/ieu-a-7.vcf.gz

MRC IEU ID: ieu-a-7
PMID    26343387
Year    2015
Category:   Disease
Sub category?   Cardiovascular
Population: Mixed (European and South Asian)
Sex Males and Females
ncase:  60,801
ncontrol:   123,504
Sample size:    184,305
Number of SNPs: 9,455,779
Unit:   log odds
Priority    1
Author: Nikpay
Consortium: CARDIoGRAMplusC4D
Build:  HG19/GRCh37

Other CARDIoGRAM GWASes: https://www.cardiogramplusc4d.org/data-downloads/

1.5 FinnGen CHD

FinnGen R12 summary stats
Released Oct 25th 2023

Update Mar 2024: Added summary stats for chr X pseudoautosomal regions (PAR)

Summary stats are available in gs://finngen-production-library-green/finngen_R12/finngen_R12_analysis_data/summary_stats/release/
Analysis document is available in gs://finngen-production-library-green/finngen_R12/finngen_R12_analysis_documentation/finngen_R12_analysis_document_GWAS.pdf

Columns:

#chrom  chromosome
pos chromosome position in GRCh38
ref reference allele
alt alternative (effect) allele
rsids   rsid(s), comma separated
nearest_genes   nearest gene or nearest genes (comma separated) when more than one gene overlap the position
pval    p-value
mlogp   -log10 p-value
beta    effect size beta (log(OR) scale)
sebeta  standard error of effect size
af_alt  alternative allele frequency in cases+controls
af_alt_cases    alternative allele frequency in cases
af_alt_controls alternative allele frequency in controls

Major coronary heart disease event
IX Diseases of the circulatory system (I9_)
https://r12.finngen.fi/pheno/I9_CHD
https://risteys.finngen.fi/endpoints/I9_CHD

56650 cases
443698 controls

# Locaton and details about files:

https://console.cloud.google.com/storage/browser/_details/finngen-public-data-r12/summary_stats/release/finngen_R12_I9_CHD.gz?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%257B_22k_22_3A_22_22_2C_22t_22_3A10_2C_22v_22_3A_22_5C_22finngen_R12_I9_CHD.gz_5C_22_22_2C_22s_22_3Atrue%257D%255D%22,%22s%22:%5B(%22i%22:%22objectMetadata%2Fsize%22,%22s%22:%220%22),(%22i%22:%22displayName%22,%22s%22:%220%22)%5D))&inv=1&invt=Absn_g

https://console.cloud.google.com/storage/browser/finngen-public-data-r12/summary_stats/release?inv=1&invt=Absn_g&pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%257B_22k_22_3A_22_22_2C_22t_22_3A10_2C_22v_22_3A_22_5C_22finngen_R12_I9_CHD.gz_5C_22_22_2C_22s_22_3Atrue%257D%255D%22,%22s%22:%5B(%22i%22:%22objectMetadata%2Fsize%22,%22s%22:%220%22),(%22i%22:%22displayName%22,%22s%22:%220%22)%5D))&prefix=&forceOnObjectsSortingFiltering=true

2 Bioinformatics Prep: CHD file

Step 1: Bash script calling bcftools to convert vcf to tsv and format the headers for TwoSampleMR

Dependencies:

•   bcftools (for querying VCF)
•   awk (for formatting)
•   bgzip (for compressing the output)

run with: cd /Users/charleenadams/mr_ukbppp_chd/scripts chmod +x format_ieu_vcf.sh nohup ./format_ieu_vcf.sh > format_ieu_vcf.log 2>&1 & ps -p 96562 -o stat,etime

#!/bin/bash

# Input and output file paths
INPUT_VCF="/Users/charleenadams/mr_ukbppp_chd/ieu-a-7.vcf.gz"
OUTPUT_TSV="/Users/charleenadams/mr_ukbppp_chd/data/ieu-a-7_formatted.tsv.gz"

# Check if bcftools is installed
if ! command -v bcftools &> /dev/null; then
    echo "Error: bcftools is not installed. Install it with 'brew install bcftools' or 'apt install bcftools'."
    exit 1
fi

# Extract the required columns from the FORMAT field, ensuring the correct order
bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\t[%ES\t%SE\t%LP\t%AF\t%SS]\n' $INPUT_VCF | \
awk -v OFS="\t" '
BEGIN {print "#CHROM", "POS", "ID", "REF", "ALT", "BETA", "SE", "PVAL", "A1FREQ", "N"}
{
    pval = 10^(-$8);  # Convert LP (log10 p-value) to actual p-value
    print $1, $2, $3, $4, $5, $6, $7, pval, $9, $10
}' | bgzip > $OUTPUT_TSV

echo "Formatted file saved as $OUTPUT_TSV"

Step 2: R to verify p-value conversion

# Load necessary library
library(data.table)

# Define file path
file_path <- "/Users/charleenadams/mr_ukbppp_chd/data_183/mr_files/ieu-a-7_formatted.tsv.gz"

# Read the gzipped TSV file using gunzip -c (macOS-friendly)
data <- fread(cmd = paste("gunzip -c", file_path), sep = "\t", header = TRUE)

# Display first few rows
print(head(data))

# Summary statistics for numeric columns
summary(data)

# Check column names
colnames(data)
table(data$`#CHROM`) # doesn't have chrom X info

x=data[which(data$`#CHROM`=="X"),]

# Look at FinnGen CHD
file_path_fin <- "/Users/charleenadams/mr_ukbppp_chd/data_183/mr_files/summary_stats_release_finngen_R12_I9_CHD.gz"

# Read the gzipped TSV file using gunzip -c (macOS-friendly)
data_fin <- fread(cmd = paste("gunzip -c", file_path_fin), sep = "\t", header = TRUE)

# Display first few rows
print(head(data_fin))

# Summary statistics for numeric columns
summary(data_fin)

# Check column names
colnames(data_fin)
table(data_fin$`#chrom`) # X is "23"

2.1 Number of SNPs in Cardiogram CHD

8,479,694

for f in /Users/charleenadams/mr_ukbppp_chd/data_183/mr_files/ieu-a-7_formatted.tsv.gz; do
    echo -n "$f: "
    gzcat "$f" | wc -l
done
# 8,479,694

## /Users/charleenadams/mr_ukbppp_chd/data_183/mr_files/ieu-a-7_formatted.tsv.gz:  8479694

2.2 Number of SNPs in FinnGen CHD

21,327,063

for f in /Users/charleenadams/mr_ukbppp_chd/data_183/mr_files/summary_stats_release_finngen_R12_I9_CHD.gz; do
    echo -n "$f: "
    gzcat "$f" | wc -l
done
# 21,327,063

## /Users/charleenadams/mr_ukbppp_chd/data_183/mr_files/summary_stats_release_finngen_R12_I9_CHD.gz:  21327063

3 Bioinformatics Prep: UKB-PPP files

3.1 Move 183 protein subdirectories

Dan sent the list via email, which caused some funky hidden characters to appear when I copied his list to a csv. I lost an hour’s time trying to figure it out. Kill me. Just kill me.

run with: tr -d '\r' < /Users/charleenadams/mr_ukbppp_chd/protein_list_183.csv > /Users/charleenadams/mr_ukbppp_chd/protein_list_183_clean.csv cat -vet /Users/charleenadams/mr_ukbppp_chd/protein_list_183_clean.csv | head cd /Users/charleenadams/mr_ukbppp_chd/scripts chmod +x copy_protein_dirs_183.sh nohup ./copy_protein_dirs_183.sh > copy_protein_dirs_183.log 2>&1

#!/bin/bash

# Source and destination directories
SRC_DIR="/Users/charleenadams/ukbppp/coloc_clin_mol_extracted"
DEST_DIR="/Users/charleenadams/mr_ukbppp_chd/data_183"
PROTEIN_LIST="/Users/charleenadams/mr_ukbppp_chd/protein_list_183_clean.csv"

# Make sure the destination directory exists
mkdir -p "$DEST_DIR"

# Start the process
echo "🚀 Starting directory search and copy..."

# Loop through each protein name, skipping the header
tail -n +2 "$PROTEIN_LIST" | while IFS= read -r PROTEIN; do
    echo "🔍 Searching for folders matching: *$PROTEIN*"

    # Find matching directories
    MATCHES=$(find "$SRC_DIR" -type d -name "*$PROTEIN*" 2>/dev/null)

    if [[ -n "$MATCHES" ]]; then
        echo "$MATCHES" | while read -r DIR; do
            echo "✅ Copying $DIR to $DEST_DIR"
            cp -R "$DIR" "$DEST_DIR"
        done
    else
        echo "❌ NOT FOUND: No directory found for $PROTEIN"
    fi
done

echo "🎉 All done!"

3.2 Virtual environment

conda env list

For Python3 to tackle merging, header delim issues, and LOG10P conversion

Step 1:

Create the Python3 script to run in parallel with all but 1 available core
to run Python3 to merge the files within each subdirectory,
to deal with mixed delimins in header
convert LOG10P to P

#!/usr/bin/env python3

import os
import psutil  # <-- Need this for RAM and CPU stats! Install via: pip install psutil
import pandas as pd
import numpy as np
import gzip
import multiprocessing
import time

# Base directory where the subdirectories exist
base_dir = "/Users/charleenadams/mr_ukbppp_chd/data_183"

def auto_tune_cores(ram_per_process_gb=4, cpu_threshold=80):
    """
    Dynamically calculate number of cores based on available RAM and CPU usage.
    """
    # Total logical cores
    total_cores = os.cpu_count()

    # Get available RAM (in GB)
    available_ram_gb = psutil.virtual_memory().available / (1024 ** 3)

    # Estimate safe core usage based on RAM
    cores_by_ram = int(available_ram_gb // ram_per_process_gb)

    # CPU load (average over 1 minute)
    cpu_load_percent = psutil.cpu_percent(interval=1)

    # If CPU is already working hard, reduce the core count
    if cpu_load_percent > cpu_threshold:
        print(f"⚠️ CPU load is {cpu_load_percent}%. Throttling cores down.")
        cores_by_cpu = max(1, total_cores // 4)
    else:
        cores_by_cpu = total_cores

    # Take the safer limit
    optimal_cores = max(1, min(cores_by_ram, cores_by_cpu, total_cores - 1))

    print(f"🧠 Available RAM: {available_ram_gb:.2f} GB")
    print(f"⚙️ CPU load: {cpu_load_percent:.2f}%")
    print(f"🚀 Using {optimal_cores} out of {total_cores} cores.")

    return optimal_cores

def process_subdir(subdir):
    """Function to process and merge files in a single subdirectory"""
    subdir_path = os.path.join(base_dir, subdir)

    if not os.path.isdir(subdir_path):
        print(f"⛔ Skipping {subdir} (not a directory)")
        return

    merged_file_path = os.path.join(subdir_path, f"{subdir}_merged.tsv.gz")

    if os.path.exists(merged_file_path):
        print(f"✅ Already merged: {merged_file_path}")
        return

    merged_data = []
    file_count = 0

    print(f"🔨 Processing {subdir}...")

    for filename in sorted(os.listdir(subdir_path)):
        if not filename.endswith("_rsid_added.gz"):
            continue  # Skip non-target files

        file_path = os.path.join(subdir_path, filename)

        try:
            with gzip.open(file_path, "rt") as f:
                df = pd.read_csv(f, sep=r"\s+", engine="python")

            # Standardize column names
            df.columns = df.columns.str.strip().str.upper()

            if "LOG10P" not in df.columns:
                raise ValueError(f"LOG10P column missing in {filename}, columns are: {df.columns.tolist()}")

            # Add the P-value
            df["P"] = 10 ** (-df["LOG10P"])

            merged_data.append(df)
            file_count += 1

        except Exception as e:
            print(f"💥 Error processing {filename} in {subdir}: {e}")

    if merged_data and file_count >= 23:
        merged_df = pd.concat(merged_data, ignore_index=True)
        merged_df.to_csv(merged_file_path, sep="\t", index=False, compression="gzip")
        print(f"🎉 Merged {file_count} files in {subdir} -> {merged_file_path}")
    elif merged_data:
        print(f"⚠️ Only {file_count} files processed in {subdir}, not merging.")
    else:
        print(f"🚫 No valid files found in {subdir}")

if __name__ == "__main__":
    # Auto-tune cores based on system load
    num_cores = auto_tune_cores()

    # Get all subdirectories to process
    all_subdirs = [d for d in os.listdir(base_dir) if os.path.isdir(os.path.join(base_dir, d))]

    print(f"🔧 Starting parallel processing for {len(all_subdirs)} directories using {num_cores} cores...\n")

    # Start the pool
    with multiprocessing.Pool(processes=num_cores) as pool:
        pool.map(process_subdir, all_subdirs)

    print("\n✅ ALL DONE!")

Step 2: Run and check number of SNPs per merged file

I did this because I’ve got multiple pip and python versions and don’t want to deal with it.

# virtual env
cd /Users/charleenadams/mr_ukbppp_chd/scripts
python3 -m venv ~/venvs/mr_env
source ~/venvs/mr_env/bin/activate
pip install pandas numpy

# run
nohup python3 merge_subdir_files_183.py > merge_subdir_files_183.log 2>&1 &
ps -p 4790 -o stat,etime

# Check the merged files are happening
find /Users/charleenadams/mr_ukbppp_chd/data_183 -name "*_merged.tsv.gz" | wc -l
watch -n 5 'find /Users/charleenadams/mr_ukbppp_chd/data_183 -name "*_merged.tsv.gz" | wc -l'

# Check the number of SNPs in each of the merged files: ~16,000,000
for f in /Users/charleenadams/mr_ukbppp_chd/data_183/*/*_merged.tsv.gz; do
    echo -n "$f: "
    gzcat "$f" | wc -l
done

Step 3: Check headers in R

This is to verify that it worked.

# Load necessary library
library(data.table)

# Define file path
file_path <- "/Users/charleenadams/mr_ukbppp_chd/data_183/WFDC2_Q14508_OID21505_v1_Oncology/WFDC2_Q14508_OID21505_v1_Oncology_merged.tsv.gz"

file_path <- "/Users/charleenadams/mr_ukbppp_chd/data_183/mr_files/LETM1_O95202_OID31018_v1_Neurology_II_merged.tsv.gz"

# Read the gzipped TSV file using gunzip -c (macOS-friendly)
data <- fread(cmd = paste("gunzip -c", file_path), sep = "\t", header = TRUE)

# Display first few rows
print(head(data))

# Summary statistics for numeric columns
summary(data)

# Check column names
colnames(data)
table(data$CHROM)

4 Get the files into one spot

Copy the 183 (really 227) merged files to data_183/mr_files

mkdir -p /Users/charleenadams/mr_ukbppp_chd/data_183/mr_files
find /Users/charleenadams/mr_ukbppp_chd/data_183 -name "*_merged.tsv.gz" -exec cp {} /Users/charleenadams/mr_ukbppp_chd/data_183/mr_files/ \;

cd /Users/charleenadams/mr_ukbppp_chd/data/mr_files
cp ieu-a-7_formatted.tsv.gz /Users/charleenadams/mr_ukbppp_chd/data_183/mr_files/

5 MR Loop (Cardiogram)

# run with:
nohup Rscript /Users/charleenadams/mr_ukbppp_chd/scripts/mr_loop_183_cardiogram_tss.R > /Users/charleenadams/mr_ukbppp_chd/scripts/mr_loop_183_cardiogram_tss.log 2>&1 &
ps -p 30446 -o stat,etime

Lessons: Code that works for handfuls of analyses often fails at scale, not for any fault of the code–due to external servers, such as MRC-IEU or ensembl. External servers and api’s have rate limits and internal server clogs. This makes using on them for scaled up analyses utterly unreliable. The solution is finding out how to do what the external groups do and do it oneself locally. I couldn’t run ld_clump or fetch TSS info from ensembl reliably for this loop. Instead, I obtained local copies of plink binaries and an ensembl annotation files (builds 37 and 38) that included alias gene names.

Also, the Cardiogram data didn’t have X chromosome.

rm(list = ls())

library(TwoSampleMR)
library(MendelianRandomization)
library(ieugwasr)
library(MRInstruments)
library(dplyr)
library(tidyr)
library(data.table)
library(openxlsx)
library(ggplot2)
library(scales)
library(RhpcBLASctl)
library(patchwork)

# Define %||% operator for fallback
`%||%` <- function(x, y) if (length(x) > 0) x else y

# Set memory limit to match RAM (128 GB)
Sys.setenv("R_MAX_VSIZE" = 128000)
cat("Max vector size set to:", Sys.getenv("R_MAX_VSIZE"), "MB\n")

# Set directories and files
base_dir <- "/Users/charleenadams/mr_ukbppp_chd"
data_dir <- file.path(base_dir, "data_183", "mr_files")
results_dir <- file.path(base_dir, "results_183_tss")
if (!dir.exists(results_dir)) dir.create(results_dir, recursive = TRUE)

# Highlighted out because run before the main script
# # Prepare ensembl 37 file
# mart <- fread("/Users/charleenadams/mr_ukbppp_chd/mart_export_37_alias.txt")
# 
# # Step 1: Convert Chromosome X to 23
# mart$`Chromosome/scaffold name`[mart$`Chromosome/scaffold name` == "X"] <- "23"
# 
# # Step 2: Filter out chromosomes that are not 1-23
# # First convert column to character (if not already)
# mart$`Chromosome/scaffold name` <- as.character(mart$`Chromosome/scaffold name`)
# 
# # Then filter
# mart_clean <- mart[mart$`Chromosome/scaffold name` %in% as.character(1:23), ]
# 
# # Save as a tab-delimited text file
# write.table(mart_clean,
#             file = "mart_clean_37_alias.txt",  # Change the path and filename as needed
#             sep = "\t",               # Tab delimiter
#             row.names = FALSE,        # Do not write row numbers
#             quote = FALSE)            # Do not add quotes around character fields

local_annotation_file <- "/Users/charleenadams/mr_ukbppp_chd/mart_clean_37_alias.txt" #downloaded from ensembl and curated; build 37
chd_file <- file.path(data_dir, "ieu-a-7_formatted.tsv.gz")
protein_files <- list.files(data_dir, pattern = "^[A-Z0-9]+_[A-Z0-9]+_OID[0-9]+_v1_.*\\.tsv\\.gz$", full.names = TRUE)
protein_files <- protein_files[!grepl("ieu-a-7_formatted.tsv.gz", protein_files)]
cat("Protein files to process:\n")
print(protein_files)

# Load annotation file
if (!file.exists(local_annotation_file)) stop(sprintf("Local annotation file %s not found", local_annotation_file))
gene_data <- fread(local_annotation_file, data.table = FALSE)
cat(sprintf("Loaded local annotation file with %d genes\n", nrow(gene_data)))

# Set up logging
log_file <- file.path(base_dir, "mr_analysis.log")
sink(log_file, append = TRUE, split = TRUE)
cat(sprintf("Starting sequential MR analysis on %s\n", Sys.time()))

forest_plots <- list()

# Format exposure function with TSS filter
format_exposure <- function(file, protein_chrom, tss_start, tss_end) {
  dat <- fread(file) %>%
    filter(
      CHROM == protein_chrom,
      POS19 >= tss_start,
      POS19 <= tss_end
    ) %>%
    dplyr::select(
      SNP = RSID,
      chr.exposure = CHROM,
      pos.exposure = POS19,
      effect_allele.exposure = ALLELE1,
      other_allele.exposure = ALLELE0,
      eaf.exposure = A1FREQ,
      beta.exposure = BETA,
      se.exposure = SE,
      pval.exposure = P,
      samplesize.exposure = N
    ) %>%
    mutate(
      exposure = sub("_v1_.*$", "", basename(file)),
      id.exposure = exposure,
      chr.exposure = as.character(chr.exposure)
    ) %>%
    filter(!is.na(SNP) & grepl("^rs", SNP))
  cat(sprintf("Exposure SNPs for %s in TSS window: %d\n", basename(file), nrow(dat)))
  return(dat)
}

# Format outcome function with TSS filter
format_outcome <- function(file, protein_chrom, tss_start, tss_end) {
  dat <- fread(file) %>%
    filter(
      `#CHROM` == protein_chrom,
      POS >= tss_start,
      POS <= tss_end
    ) %>%
    dplyr::select(
      SNP = ID,
      chr.outcome = `#CHROM`,
      pos.outcome = POS,
      effect_allele.outcome = ALT,
      other_allele.outcome = REF,
      eaf.outcome = A1FREQ,
      beta.outcome = BETA,
      se.outcome = SE,
      pval.outcome = PVAL,
      samplesize.outcome = N
    ) %>%
    mutate(
      outcome = "CHD",
      id.outcome = "ieu-a-7",
      chr.outcome = as.character(chr.outcome)
    ) %>%
    filter(!is.na(SNP) & grepl("^rs", SNP))
  cat(sprintf("Outcome SNPs in TSS window (chr %s, %d-%d): %d\n", protein_chrom, tss_start, tss_end, nrow(dat)))
  return(dat)
}

# Modified perform_mr function with Gene Synonym fallback
perform_mr <- function(protein_file, chd_file, results_dir, gene_data) {
  olink_id <- sub("_v1_.*$", "", basename(protein_file))
  protein_dir <- file.path(results_dir, olink_id)
  today <- Sys.Date()
  
  excel_file <- file.path(protein_dir, sprintf("MR_Results_%s_CHD_%s.xlsx", olink_id, today))
  tss_harmonized_file <- file.path(protein_dir, sprintf("tss_harmonized_%s_CHD_%s.csv", olink_id, today))
  plot_file <- file.path(protein_dir, sprintf("MR_Forestplot_%s_CHD_%s.png", olink_id, today))
  
  # Check for any harmonized or result files
  existing_harmonized <- list.files(protein_dir, pattern = sprintf("(full|tss)_harmonized_%s_CHD_.*\\.csv$", olink_id), full.names = TRUE)
  if (length(existing_harmonized) > 0 || file.exists(excel_file) || file.exists(plot_file)) {
    cat(sprintf("Skipping %s: Existing files found (e.g., %s)\n", olink_id, existing_harmonized[1] %||% excel_file))
    return(NULL)
  }
  
  if (!dir.exists(protein_dir)) dir.create(protein_dir, recursive = TRUE)
  cat(sprintf("Processing %s...\n", olink_id))
  
  # Get TSS and chromosome
  protein <- strsplit(olink_id, "_")[[1]][1]
  cat(sprintf("Searching for protein: %s\n", protein))
  
  # First try Gene name
  protein_data <- gene_data %>%
    filter(`Gene name` == protein) %>%
    dplyr::select(`Chromosome/scaffold name`, `Transcription start site (TSS)`, `Gene stable ID version`, `Gene name`, `Gene Synonym`)
  
  # If no match found, try Gene Synonym
  if (nrow(protein_data) == 0) {
    cat(sprintf("No match found for Gene name '%s', trying Gene Synonym...\n", protein))
    protein_data <- gene_data %>%
      filter(`Gene Synonym` == protein) %>%
      dplyr::select(`Chromosome/scaffold name`, `Transcription start site (TSS)`, `Gene stable ID version`, `Gene name`, `Gene Synonym`)
  }
  
  if (nrow(protein_data) == 0) {
    cat(sprintf("No TSS data found for %s in either Gene name or Gene Synonym; skipping\n", olink_id))
    return(NULL)
  }
  
  cat(sprintf("Protein data for %s:\n", olink_id))
  print(protein_data)
  
  chrom_col <- "Chromosome/scaffold name"
  tss_col <- "Transcription start site (TSS)"
  tss_column <- protein_data[[tss_col]]
  
  if (all(is.na(tss_column)) || length(tss_column) == 0) {
    cat(sprintf("No valid TSS values for %s; skipping\n", olink_id))
    return(NULL)
  }
  
  protein_tss <- min(tss_column, na.rm = TRUE)
  protein_chrom <- protein_data[[chrom_col]][1]
  
  if (is.infinite(protein_tss) || is.na(protein_tss) || is.na(protein_chrom)) {
    cat(sprintf("Invalid TSS (%s) or chromosome (%s) for %s; skipping\n", protein_tss, protein_chrom, olink_id))
    return(NULL)
  }
  
  cat(sprintf("Using TSS: %d, Chromosome: %s for %s\n", protein_tss, protein_chrom, olink_id))
  
  window_size <- 500000
  tss_start <- protein_tss - window_size
  tss_end <- protein_tss + window_size
  
  # Filter and harmonize TSS data
  exp_dat <- format_exposure(protein_file, protein_chrom, tss_start, tss_end)
  if (nrow(exp_dat) == 0) {
    cat(sprintf("No exposure SNPs in TSS window for %s; skipping\n", olink_id))
    return(NULL)
  }
  
  out_dat <- format_outcome(chd_file, protein_chrom, tss_start, tss_end)
  if (nrow(out_dat) == 0) {
    cat(sprintf("No outcome SNPs in TSS window for %s; skipping\n", olink_id))
    return(NULL)
  }
  
  dat <- harmonise_data(exposure_dat = exp_dat, outcome_dat = out_dat, action = 1)
  write.csv(dat, tss_harmonized_file, row.names = FALSE)
  cat(sprintf("TSS harmonized data for %s saved to: %s\n", olink_id, tss_harmonized_file))
  
  instruments <- dat %>%
    filter(mr_keep == TRUE, pval.exposure < 5e-6) %>%
    mutate(
      rsid = SNP,
      pval = pval.exposure,
      id = sprintf("%s exposure", olink_id),
      F_stat = (beta.exposure / se.exposure)^2
    ) %>%
    filter(F_stat > 10)
  
  cat(sprintf("Number of instruments for %s after F-stat filtering: %d\n", olink_id, nrow(instruments)))
  
  if (nrow(instruments) == 0) {
    cat(sprintf("No instruments for %s after F-stat filtering; skipping\n", olink_id))
    return(NULL)
  }
  
  instruments <- steiger_filtering(instruments)
  if (!"steiger_dir" %in% colnames(instruments) || nrow(instruments) == 0) {
    cat(sprintf("Steiger filtering produced no valid instruments for %s; skipping\n", olink_id))
    return(NULL)
  }
  instruments <- instruments %>% filter(steiger_dir == TRUE)
  cat(sprintf("Number of instruments for %s after Steiger filtering: %d\n", olink_id, nrow(instruments)))
  
  if (nrow(instruments) == 0) {
    cat(sprintf("No instruments for %s after Steiger filtering; skipping\n", olink_id))
    return(NULL)
  }
  
  clumped_dat <- NULL
  tryCatch({
    clumped <- ld_clump(
      dplyr::tibble(rsid = instruments$rsid, pval = instruments$pval, id = instruments$id),
      plink_bin = genetics.binaRies::get_plink_binary(),
      bfile = "/Users/charleenadams/1000G_bfiles/EUR/EUR",
      clump_r2 = 0.001,
      clump_kb = 500
    )
    clumped_dat <- instruments %>% filter(SNP %in% clumped$rsid)
    cat(sprintf("Number of SNPs retained for %s after clumping: %d\n", olink_id, nrow(clumped_dat)))
  }, error = function(e) {
    cat(sprintf("LD clumping failed for %s: %s\nSkipping MR analysis for this protein.\n", olink_id, e$message))
    return(NULL)
  })
  
  if (is.null(clumped_dat) || nrow(clumped_dat) == 0) {
    cat(sprintf("No SNPs retained for %s after clumping; skipping MR analysis\n", olink_id))
    return(NULL)
  }
  
  # Perform MR analyses
  ivw_result <- mr(clumped_dat, method_list = "mr_ivw")
  egger_result <- if (nrow(clumped_dat) >= 3) {
    tryCatch(mr(clumped_dat, method_list = "mr_egger_regression"), error = function(e) NULL)
  } else NULL
  weighted_median_result <- mr(clumped_dat, method_list = "mr_weighted_median")
  
  mr_input <- mr_input(
    bx = clumped_dat$beta.exposure,
    bxse = clumped_dat$se.exposure,
    by = clumped_dat$beta.outcome,
    byse = clumped_dat$se.outcome,
    exposure = olink_id,
    outcome = "CHD",
    snps = clumped_dat$SNP
  )
  lasso_result <- tryCatch(mr_lasso(mr_input), error = function(e) { 
    cat(sprintf("MR-Lasso failed for %s: %s\n", olink_id, e$message)); NULL 
  })
  conmix_result <- tryCatch(mr_conmix(mr_input), error = function(e) { 
    cat(sprintf("MR-ConMix failed for %s: %s\n", olink_id, e$message)); NULL 
  })
  
  heterogeneity_result <- mr_heterogeneity(clumped_dat)
  pleiotropy_result <- if (nrow(clumped_dat) >= 3) mr_pleiotropy_test(clumped_dat) else NULL
  loo_result <- mr_leaveoneout(clumped_dat)
  
  wald_ratios <- clumped_dat %>%
    mutate(
      wald_beta = beta.outcome / beta.exposure,
      wald_se = sqrt((se.outcome^2 / beta.exposure^2) + ((beta.outcome^2 * se.exposure^2) / (beta.exposure^4))),
      pval = 2 * pnorm(abs(wald_beta / wald_se), lower.tail = FALSE),
      method = paste("Wald Ratio:", SNP)
    ) %>%
    dplyr::select(SNP, wald_beta, wald_se, pval, method)
  
  # Process MR results
  mr_results <- bind_rows(
    if (nrow(ivw_result) > 0) ivw_result[, c("method", "b", "se", "pval")],
    if (!is.null(egger_result) && nrow(egger_result) > 0) egger_result[, c("method", "b", "se", "pval")],
    if (nrow(weighted_median_result) > 0) weighted_median_result[, c("method", "b", "se", "pval")],
    if (!is.null(lasso_result)) data.frame(
      method = "MR-Lasso",
      b = lasso_result@Estimate,
      se = lasso_result@StdError,
      pval = lasso_result@Pvalue
    ),
    if (!is.null(conmix_result)) data.frame(
      method = "MR-ConMix",
      b = conmix_result@Estimate,
      se = (conmix_result@CIUpper - conmix_result@CILower)/(2*1.96),
      pval = conmix_result@Pvalue
    ),
    wald_ratios %>% dplyr::select(method, b = wald_beta, se = wald_se, pval)
  ) %>%
    mutate(
      method = factor(method, levels = unique(method)),
      or = exp(b),
      or_lower = exp(b - 1.96 * se),
      or_upper = exp(b + 1.96 * se)
    ) %>%
    na.omit()
  
  base_colors <- c("IVW" = "#1F78B4", "MR-Egger" = "#FF7F00", "Weighted Median" = "#33A02C",
                   "MR-Lasso" = "#FB9A99", "MR-ConMix" = "#E41A1C")
  if (nrow(mr_results) == 0 || !("method" %in% names(mr_results))) {
    cat(sprintf("Warning: No valid MR results for %s; using default colors\n", olink_id))
    color_list <- base_colors
  } else {
    wald_methods <- unique(mr_results$method[grepl("Wald Ratio", mr_results$method)])
    n_wald <- length(wald_methods)
    wald_colors <- if (n_wald > 0) setNames(hue_pal()(n_wald), wald_methods) else NULL
    color_list <- c(base_colors, wald_colors)
    color_list <- color_list[names(color_list) %in% unique(mr_results$method)]
  }
  
  x_min <- min(mr_results$or_lower, na.rm = TRUE)
  x_max <- max(mr_results$or_upper, na.rm = TRUE)
  forest_plot <- ggplot(mr_results, aes(x = or, y = method, color = method)) +
    geom_errorbarh(aes(xmin = or_lower, xmax = or_upper), height = 0.2, na.rm = TRUE, linewidth = 0.8) +
    geom_point(size = 3) +
    geom_vline(xintercept = 1, linetype = "dashed", color = "grey50") +
    labs(
      title = sprintf("Mendelian Randomization Estimates:\n %s on CHD", protein),
      x = "Causal Effect (Odds Ratio)",
      y = ""
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
      axis.text.y = element_text(size = 12, face = "bold"),
      axis.text.x = element_text(size = 10),
      axis.title.x = element_text(size = 12),
      panel.grid.major = element_line(color = "grey90"),
      panel.grid.minor = element_blank(),
      legend.position = "none",
      panel.background = element_rect(fill = "white", color = NA),
      plot.background = element_rect(fill = "white", color = NA)
    ) +
    scale_color_manual(values = color_list) +
    xlim(x_min - 0.02 * (x_max - x_min), x_max + 0.02 * (x_max - x_min))
  
  forest_plots[[olink_id]] <<- forest_plot
  
  # Save outputs with overwrite protection
  wb <- createWorkbook()
  title_style <- createStyle(fontSize = 14, fontColour = "black", textDecoration = "bold", halign = "center")
  header_style <- createStyle(fontColour = "white", fgFill = "#C8A2C8", textDecoration = "bold", halign = "center", border = "TopBottomLeftRight")
  
  toc_data <- data.frame(
    Sheet = c("IVW", "MR-Egger", "Weighted_Median", "MR-Lasso", "MR-ConMix", 
              "Wald_Ratios", "Heterogeneity_Test", "Pleiotropy_Test", "Leave_One_Out",
              "Instruments", "Clumped_Data"),
    Title = c("Inverse Variance Weighted Results", "MR-Egger Results", 
              "Weighted Median Results", "MR-Lasso Results", "MR-ConMix Results",
              "Wald Ratio Tests", "Heterogeneity Test Results", 
              "Pleiotropy Test Results", "Leave-One-Out Analysis",
              "Filtered Instruments", "Clumped SNPs Data")
  )
  addWorksheet(wb, "TOC")
  writeData(wb, "TOC", "Table of Contents", startRow = 1, startCol = 1)
  mergeCells(wb, "TOC", cols = 1:2, rows = 1)
  addStyle(wb, "TOC", title_style, rows = 1, cols = 1)
  writeData(wb, "TOC", toc_data, startRow = 3, startCol = 1, headerStyle = header_style)
  setColWidths(wb, "TOC", cols = 1:2, widths = "auto")
  
  add_formatted_sheet <- function(wb, sheet_name, data, title) {
    if (is.null(data) || !is.data.frame(data) || nrow(data) == 0 || ncol(data) == 0) {
      cat(sprintf("Skipping sheet '%s' for %s: Invalid or empty data\n", sheet_name, olink_id))
      addWorksheet(wb, sheet_name)
      writeData(wb, sheet_name, "No data available")
      return()
    }
    addWorksheet(wb, sheet_name)
    writeData(wb, sheet_name, title, startRow = 1, startCol = 1)
    writeData(wb, sheet_name, data, startRow = 3, startCol = 1, headerStyle = header_style)
    mergeCells(wb, sheet_name, cols = 1:ncol(data), rows = 1)
    setColWidths(wb, sheet_name, cols = 1:ncol(data), widths = "auto")
    addStyle(wb, sheet_name, title_style, rows = 1, cols = 1)
  }
  
  add_formatted_sheet(wb, "IVW", ivw_result, "Inverse Variance Weighted Results")
  add_formatted_sheet(wb, "MR-Egger", egger_result, "MR-Egger Results")
  add_formatted_sheet(wb, "Weighted_Median", weighted_median_result, "Weighted Median Results")
  if (!is.null(lasso_result)) {
    lasso_df <- data.frame(
      Exposure = lasso_result@Exposure,
      Outcome = lasso_result@Outcome,
      Estimate = lasso_result@Estimate,
      StdError = lasso_result@StdError,
      CILower = lasso_result@CILower,
      CIUpper = lasso_result@CIUpper,
      Pvalue = lasso_result@Pvalue,
      SNPs = lasso_result@SNPs,
      Valid = lasso_result@Valid,
      ValidSNPs = paste(lasso_result@ValidSNPs, collapse = ", ")
    )
    add_formatted_sheet(wb, "MR-Lasso", lasso_df, "MR-Lasso Results")
  }
  if (!is.null(conmix_result)) {
    conmix_df <- data.frame(
      Exposure = conmix_result@Exposure,
      Outcome = conmix_result@Outcome,
      Estimate = conmix_result@Estimate,
      Pvalue = conmix_result@Pvalue,
      CILower = conmix_result@CILower,
      CIUpper = conmix_result@CIUpper,
      SNPs = conmix_result@SNPs
    )
    add_formatted_sheet(wb, "MR-ConMix", conmix_df, "MR-ConMix Results")
  }
  add_formatted_sheet(wb, "Wald_Ratios", wald_ratios, "Wald Ratio Tests")
  add_formatted_sheet(wb, "Heterogeneity_Test", heterogeneity_result, "Heterogeneity Test Results")
  add_formatted_sheet(wb, "Pleiotropy_Test", pleiotropy_result, "Pleiotropy Test Results")
  add_formatted_sheet(wb, "Leave_One_Out", loo_result, "Leave-One-Out Analysis")
  add_formatted_sheet(wb, "Instruments", instruments, "Filtered Instruments")
  add_formatted_sheet(wb, "Clumped_Data", clumped_dat, "Clumped SNPs Data")
  
  saveWorkbook(wb, excel_file, overwrite = FALSE)
  ggsave(plot_file, forest_plot, width = 12, height = 10, dpi = 600, bg = "white")
  
  cat(sprintf("Results for %s saved to: %s and %s\n", olink_id, excel_file, plot_file))
  gc(verbose = TRUE)
}

# Run MR analysis for each protein
for (protein_file in protein_files) {
  perform_mr(protein_file, chd_file, results_dir, gene_data)
}

cat(sprintf("MR analysis completed on %s\n", Sys.time()))
sink()

# Master file creation
rm(list = ls())

library(openxlsx)
library(dplyr)
library(data.table)

base_dir <- "/Users/charleenadams/mr_ukbppp_chd"
results_dir <- file.path(base_dir, "results_183_tss")
today <- Sys.Date()
master_file <- file.path(base_dir, sprintf("Master_MR_Results_CHD_%s.xlsx", today))

result_files <- list.files(results_dir, pattern = "MR_Results_.*_CHD_.*\\.xlsx$", 
                           full.names = TRUE, recursive = TRUE)

tabs <- c("IVW", "MR-Egger", "Weighted_Median", "MR-Lasso", "MR-ConMix", 
          "Wald_Ratios", "Heterogeneity_Test", "Pleiotropy_Test", "Leave_One_Out",
          "Instruments", "Clumped_Data")

master_data <- setNames(vector("list", length(tabs)), tabs)

extract_protein_id <- function(file) {
  basename <- basename(file)
  sub("MR_Results_(.*)_CHD_.*\\.xlsx$", "\\1", basename)
}

for (file in result_files) {
  protein_id <- extract_protein_id(file)
  cat(sprintf("Processing %s...\n", protein_id))
  
  wb <- tryCatch({
    loadWorkbook(file)
  }, error = function(e) {
    cat(sprintf("Failed to load %s: %s\nSkipping\n", protein_id, e$message))
    return(NULL)
  }, warning = function(w) {
    cat(sprintf("Warning loading %s: %s\n", protein_id, w$message))
    suppressWarnings(loadWorkbook(file))
  })
  
  if (is.null(wb)) next
  
  sheet_names <- sheets(wb)
  
  for (tab in tabs) {
    if (tab %in% sheet_names) {
      data <- tryCatch({
        readWorkbook(wb, sheet = tab, startRow = 3, colNames = TRUE)
      }, error = function(e) {
        cat(sprintf("Error reading %s in %s: %s\n", tab, protein_id, e$message))
        return(NULL)
      }, warning = function(w) {
        cat(sprintf("Warning reading %s in %s: %s\n", tab, protein_id, w$message))
        suppressWarnings(readWorkbook(wb, sheet = tab, startRow = 3, colNames = TRUE))
      })
      
      if (is.null(data) || nrow(data) == 0 || 
          (nrow(data) == 1 && length(data[1, 1]) > 0 && data[1, 1] == "No data available")) {
        next
      }
      
      if ("chr.exposure" %in% names(data)) {
        data$chr.exposure <- as.character(data$chr.exposure)
      }
      if ("chr.outcome" %in% names(data)) {
        data$chr.outcome <- as.character(data$chr.outcome)
      }
      data$Protein_ID <- protein_id
      
      if (is.null(master_data[[tab]])) {
        master_data[[tab]] <- data
      } else {
        master_data[[tab]] <- bind_rows(master_data[[tab]], data)
      }
    }
  }
}

wb <- createWorkbook()
title_style <- createStyle(fontSize = 14, fontColour = "black", textDecoration = "bold", halign = "center")
header_style <- createStyle(fontColour = "white", fgFill = "#C8A2C8", textDecoration = "bold", halign = "center", border = "TopBottomLeftRight")

toc_data <- data.frame(
  Sheet = tabs,
  Title = c("Inverse Variance Weighted Results", "MR-Egger Results",
            "Weighted Median Results", "MR-Lasso Results", "MR-ConMix Results",
            "Wald Ratio Tests", "Heterogeneity Test Results",
            "Pleiotropy Test Results", "Leave-One-Out Analysis",
            "Filtered Instruments", "Clumped SNPs Data")
)
addWorksheet(wb, "TOC")
writeData(wb, "TOC", "Table of Contents", startRow = 1, startCol = 1)
mergeCells(wb, "TOC", cols = 1:2, rows = 1)
addStyle(wb, "TOC", title_style, rows = 1, cols = 1)
writeData(wb, "TOC", toc_data, startRow = 3, startCol = 1, headerStyle = header_style)
setColWidths(wb, "TOC", cols = 1:2, widths = "auto")

add_formatted_sheet <- function(wb, sheet_name, data, title) {
  if (is.null(data) || nrow(data) == 0) {
    cat(sprintf("Skipping sheet '%s': No data\n", sheet_name))
    addWorksheet(wb, sheet_name)
    writeData(wb, sheet_name, "No data available")
    return()
  }
  addWorksheet(wb, sheet_name)
  writeData(wb, sheet_name, title, startRow = 1, startCol = 1)
  writeData(wb, sheet_name, data, startRow = 3, startCol = 1, headerStyle = header_style)
  mergeCells(wb, sheet_name, cols = 1:ncol(data), rows = 1)
  setColWidths(wb, sheet_name, cols = 1:ncol(data), widths = "auto")
  addStyle(wb, sheet_name, title_style, rows = 1, cols = 1)
}

for (tab in tabs) {
  add_formatted_sheet(wb, tab, master_data[[tab]], toc_data$Title[tabs == tab])
}

saveWorkbook(wb, master_file, overwrite = TRUE)
cat(sprintf("Master file saved to: %s\n", master_file))

6 MR Loop (FinnGen)

run with: 
mr_loop_183_finngen.R

caffeinate nohup Rscript /Users/charleenadams/mr_ukbppp_chd/scripts/mr_loop_183_finngen.R> /Users/charleenadams/mr_ukbppp_chd/scripts/mr_loop_183_finngen.log 2>&1 &

ps -p 30446 -o stat,etime

rm(list = ls())

library(TwoSampleMR)
library(MendelianRandomization)
library(ieugwasr)
library(MRInstruments)
library(dplyr)
library(tidyr)
library(data.table)
library(openxlsx)
library(ggplot2)
library(scales)
library(RhpcBLASctl)
library(patchwork)

# Define %||% operator for fallback
`%||%` <- function(x, y) if (length(x) > 0) x else y

# Set memory limit to match RAM (128 GB)
Sys.setenv("R_MAX_VSIZE" = 128000)
cat("Max vector size set to:", Sys.getenv("R_MAX_VSIZE"), "MB\n")

# Set directories and files
base_dir <- "/Users/charleenadams/mr_ukbppp_chd"
data_dir <- file.path(base_dir, "data_183", "mr_files")
results_dir <- file.path(base_dir, "results_183_finngen")
if (!dir.exists(results_dir)) dir.create(results_dir, recursive = TRUE)

local_annotation_file <- "/Users/charleenadams/mr_ukbppp_chd/mart_export_clean.txt" #build 38
chd_file <- file.path(data_dir, "summary_stats_release_finngen_R12_I9_CHD.gz")
protein_files <- list.files(data_dir, pattern = "^[A-Z0-9]+_[A-Z0-9]+_OID[0-9]+_v1_.*\\.tsv\\.gz$", full.names = TRUE)
protein_files <- protein_files[!grepl("ieu-a-7_formatted.tsv.gz|summary_stats_release_finngen_R12_I9_CHD.gz", protein_files)]
cat("Protein files to process:\n")
print(protein_files)

# Load annotation file
if (!file.exists(local_annotation_file)) stop(sprintf("Local annotation file %s not found", local_annotation_file))
gene_data <- fread(local_annotation_file, data.table = FALSE)
cat(sprintf("Loaded local annotation file with %d genes\n", nrow(gene_data)))

# Set up logging
log_file <- file.path(base_dir, "mr_analysis.log")
sink(log_file, append = TRUE, split = TRUE)
cat(sprintf("Starting sequential MR analysis on %s\n", Sys.time()))

forest_plots <- list()

# Format exposure function with TSS filter
format_exposure <- function(file, protein_chrom, tss_start, tss_end) {
  dat <- fread(file) %>%
    filter(
      CHROM == protein_chrom,
      POS38 >= tss_start,
      POS38 <= tss_end
    ) %>%
    dplyr::select(
      SNP = RSID,
      chr.exposure = CHROM,
      pos.exposure = POS38,
      effect_allele.exposure = ALLELE1,
      other_allele.exposure = ALLELE0,
      eaf.exposure = A1FREQ,
      beta.exposure = BETA,
      se.exposure = SE,
      pval.exposure = P,
      samplesize.exposure = N
    ) %>%
    mutate(
      exposure = sub("_v1_.*$", "", basename(file)),
      id.exposure = exposure,
      chr.exposure = as.character(chr.exposure)
    ) %>%
    filter(!is.na(SNP) & grepl("^rs", SNP))
  cat(sprintf("Exposure SNPs for %s in TSS window: %d\n", basename(file), nrow(dat)))
  return(dat)
}

# Format outcome function with TSS filter
format_outcome <- function(file, protein_chrom, tss_start, tss_end) {
  dat <- fread(file) %>%
    filter(
      `#chrom` == protein_chrom,
      pos >= tss_start,
      pos <= tss_end
    ) %>%
    dplyr::select(
      SNP = rsids,
      chr.outcome = `#chrom`,
      pos.outcome = pos,
      effect_allele.outcome = alt,
      other_allele.outcome = ref,
      eaf.outcome = af_alt,
      beta.outcome = beta,
      se.outcome = sebeta,
      pval.outcome = pval
    ) %>%
    mutate(
      samplesize.outcome = 500348,
      outcome = "FinnGen I9_CHD",
      id.outcome = "FinnGen_I9_CHD",
      chr.outcome = as.character(chr.outcome)
    ) %>%
    filter(!is.na(SNP) & grepl("^rs", SNP))
  cat(sprintf("Outcome SNPs in TSS window (chr %s, %d-%d): %d\n", protein_chrom, tss_start, tss_end, nrow(dat)))
  return(dat)
}

# Modified perform_mr function with robust skip logic
perform_mr <- function(protein_file, chd_file, results_dir, gene_data) {
  olink_id <- sub("_v1_.*$", "", basename(protein_file))
  protein_dir <- file.path(results_dir, olink_id)
  today <- Sys.Date()
  
  excel_file <- file.path(protein_dir, sprintf("MR_Results_%s_CHD_%s.xlsx", olink_id, today))
  tss_harmonized_file <- file.path(protein_dir, sprintf("tss_harmonized_%s_CHD_%s.csv", olink_id, today))
  plot_file <- file.path(protein_dir, sprintf("MR_Forestplot_%s_CHD_%s.png", olink_id, today))
  
  # Check for any harmonized or result files
  existing_harmonized <- list.files(protein_dir, pattern = sprintf("(full|tss)_harmonized_%s_CHD_.*\\.csv$", olink_id), full.names = TRUE)
  if (length(existing_harmonized) > 0 || file.exists(excel_file) || file.exists(plot_file)) {
    cat(sprintf("Skipping %s: Existing files found (e.g., %s)\n", olink_id, existing_harmonized[1] %||% excel_file))
    return(NULL)
  }
  
  if (!dir.exists(protein_dir)) dir.create(protein_dir, recursive = TRUE)
  cat(sprintf("Processing %s...\n", olink_id))
  
  # Get TSS and chromosome
  protein <- strsplit(olink_id, "_")[[1]][1]
  cat(sprintf("Searching for protein: %s\n", protein))
  
  protein_data <- gene_data %>%
    filter(`Gene name` == protein) %>%
    dplyr::select(`Chromosome/scaffold name`, `Transcription start site (TSS)`, `Gene stable ID version`, `Gene name`)
  
  if (nrow(protein_data) == 0) {
    cat(sprintf("No TSS data found for %s; skipping\n", olink_id))
    return(NULL)
  }
  
  cat(sprintf("Protein data for %s:\n", olink_id))
  print(protein_data)
  
  chrom_col <- "Chromosome/scaffold name"
  tss_col <- "Transcription start site (TSS)"
  tss_column <- protein_data[[tss_col]]
  
  if (all(is.na(tss_column)) || length(tss_column) == 0) {
    cat(sprintf("No valid TSS values for %s; skipping\n", olink_id))
    return(NULL)
  }
  
  protein_tss <- min(tss_column, na.rm = TRUE)
  protein_chrom <- protein_data[[chrom_col]][1]
  
  if (is.infinite(protein_tss) || is.na(protein_tss) || is.na(protein_chrom)) {
    cat(sprintf("Invalid TSS (%s) or chromosome (%s) for %s; skipping\n", protein_tss, protein_chrom, olink_id))
    return(NULL)
  }
  
  cat(sprintf("Using TSS: %d, Chromosome: %s for %s\n", protein_tss, protein_chrom, olink_id))
  
  window_size <- 500000
  tss_start <- protein_tss - window_size
  tss_end <- protein_tss + window_size
  
  # Filter and harmonize TSS data
  exp_dat <- format_exposure(protein_file, protein_chrom, tss_start, tss_end)
  if (nrow(exp_dat) == 0) {
    cat(sprintf("No exposure SNPs in TSS window for %s; skipping\n", olink_id))
    return(NULL)
  }
  
  out_dat <- format_outcome(chd_file, protein_chrom, tss_start, tss_end)
  if (nrow(out_dat) == 0) {
    cat(sprintf("No outcome SNPs in TSS window for %s; skipping\n", olink_id))
    return(NULL)
  }
  
  dat <- harmonise_data(exposure_dat = exp_dat, outcome_dat = out_dat, action = 1)
  write.csv(dat, tss_harmonized_file, row.names = FALSE)
  cat(sprintf("TSS harmonized data for %s saved to: %s\n", olink_id, tss_harmonized_file))
  
  instruments <- dat %>%
    filter(mr_keep == TRUE, pval.exposure < 5e-6) %>%
    mutate(
      rsid = SNP,
      pval = pval.exposure,
      id = sprintf("%s exposure", olink_id),
      F_stat = (beta.exposure / se.exposure)^2
    ) %>%
    filter(F_stat > 10)
  
  cat(sprintf("Number of instruments for %s after F-stat filtering: %d\n", olink_id, nrow(instruments)))
  
  if (nrow(instruments) == 0) {
    cat(sprintf("No instruments for %s after F-stat filtering; skipping\n", olink_id))
    return(NULL)
  }
  
  instruments <- steiger_filtering(instruments)
  if (!"steiger_dir" %in% colnames(instruments) || nrow(instruments) == 0) {
    cat(sprintf("Steiger filtering produced no valid instruments for %s; skipping\n", olink_id))
    return(NULL)
  }
  instruments <- instruments %>% filter(steiger_dir == TRUE)
  cat(sprintf("Number of instruments for %s after Steiger filtering: %d\n", olink_id, nrow(instruments)))
  
  if (nrow(instruments) == 0) {
    cat(sprintf("No instruments for %s after Steiger filtering; skipping\n", olink_id))
    return(NULL)
  }
  
  clumped_dat <- NULL
  tryCatch({
    clumped <- ld_clump(
      dplyr::tibble(rsid = instruments$rsid, pval = instruments$pval, id = instruments$id),
      plink_bin = genetics.binaRies::get_plink_binary(),
      bfile = "/Users/charleenadams/1000G_bfiles/EUR/EUR",
      clump_r2 = 0.001,
      clump_kb = 500
    )
    clumped_dat <- instruments %>% filter(SNP %in% clumped$rsid)
    cat(sprintf("Number of SNPs retained for %s after clumping: %d\n", olink_id, nrow(clumped_dat)))
  }, error = function(e) {
    cat(sprintf("LD clumping failed for %s: %s\nSkipping MR analysis for this protein.\n", olink_id, e$message))
    return(NULL)
  })
  
  if (is.null(clumped_dat) || nrow(clumped_dat) == 0) {
    cat(sprintf("No SNPs retained for %s after clumping; skipping MR analysis\n", olink_id))
    return(NULL)
  }
  
  # Perform MR analyses
  ivw_result <- mr(clumped_dat, method_list = "mr_ivw")
  egger_result <- if (nrow(clumped_dat) >= 3) {
    tryCatch(mr(clumped_dat, method_list = "mr_egger_regression"), error = function(e) NULL)
  } else NULL
  weighted_median_result <- mr(clumped_dat, method_list = "mr_weighted_median")
  
  mr_input <- mr_input(
    bx = clumped_dat$beta.exposure,
    bxse = clumped_dat$se.exposure,
    by = clumped_dat$beta.outcome,
    byse = clumped_dat$se.outcome,
    exposure = olink_id,
    outcome = "CHD",
    snps = clumped_dat$SNP
  )
  lasso_result <- tryCatch(mr_lasso(mr_input), error = function(e) { 
    cat(sprintf("MR-Lasso failed for %s: %s\n", olink_id, e$message)); NULL 
  })
  conmix_result <- tryCatch(mr_conmix(mr_input), error = function(e) { 
    cat(sprintf("MR-ConMix failed for %s: %s\n", olink_id, e$message)); NULL 
  })
  
  heterogeneity_result <- mr_heterogeneity(clumped_dat)
  pleiotropy_result <- if (nrow(clumped_dat) >= 3) mr_pleiotropy_test(clumped_dat) else NULL
  loo_result <- mr_leaveoneout(clumped_dat)
  
  wald_ratios <- clumped_dat %>%
    mutate(
      wald_beta = beta.outcome / beta.exposure,
      wald_se = sqrt((se.outcome^2 / beta.exposure^2) + ((beta.outcome^2 * se.exposure^2) / (beta.exposure^4))),
      pval = 2 * pnorm(abs(wald_beta / wald_se), lower.tail = FALSE),
      method = paste("Wald Ratio:", SNP)
    ) %>%
    dplyr::select(SNP, wald_beta, wald_se, pval, method)
  
  # Process MR results
  mr_results <- bind_rows(
    if (nrow(ivw_result) > 0) ivw_result[, c("method", "b", "se", "pval")],
    if (!is.null(egger_result) && nrow(egger_result) > 0) egger_result[, c("method", "b", "se", "pval")],
    if (nrow(weighted_median_result) > 0) weighted_median_result[, c("method", "b", "se", "pval")],
    if (!is.null(lasso_result)) data.frame(
      method = "MR-Lasso",
      b = lasso_result@Estimate,
      se = lasso_result@StdError,
      pval = lasso_result@Pvalue
    ),
    if (!is.null(conmix_result)) data.frame(
      method = "MR-ConMix",
      b = conmix_result@Estimate,
      se = (conmix_result@CIUpper - conmix_result@CILower)/(2*1.96),
      pval = conmix_result@Pvalue
    ),
    wald_ratios %>% dplyr::select(method, b = wald_beta, se = wald_se, pval)
  ) %>%
    mutate(
      method = factor(method, levels = unique(method)),
      or = exp(b),
      or_lower = exp(b - 1.96 * se),
      or_upper = exp(b + 1.96 * se)
    ) %>%
    na.omit()
  
  base_colors <- c("IVW" = "#1F78B4", "MR-Egger" = "#FF7F00", "Weighted Median" = "#33A02C",
                   "MR-Lasso" = "#FB9A99", "MR-ConMix" = "#E41A1C")
  if (nrow(mr_results) == 0 || !("method" %in% names(mr_results))) {
    cat(sprintf("Warning: No valid MR results for %s; using default colors\n", olink_id))
    color_list <- base_colors
  } else {
    wald_methods <- unique(mr_results$method[grepl("Wald Ratio", mr_results$method)])
    n_wald <- length(wald_methods)
    wald_colors <- if (n_wald > 0) setNames(hue_pal()(n_wald), wald_methods) else NULL
    color_list <- c(base_colors, wald_colors)
    color_list <- color_list[names(color_list) %in% unique(mr_results$method)]
  }
  
  x_min <- min(mr_results$or_lower, na.rm = TRUE)
  x_max <- max(mr_results$or_upper, na.rm = TRUE)
  forest_plot <- ggplot(mr_results, aes(x = or, y = method, color = method)) +
    geom_errorbarh(aes(xmin = or_lower, xmax = or_upper), height = 0.2, na.rm = TRUE, linewidth = 0.8) +
    geom_point(size = 3) +
    geom_vline(xintercept = 1, linetype = "dashed", color = "grey50") +
    labs(
      title = sprintf("Mendelian Randomization Estimates:\n %s on CHD", protein),
      x = "Causal Effect (Odds Ratio)",
      y = ""
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
      axis.text.y = element_text(size = 12, face = "bold"),
      axis.text.x = element_text(size = 10),
      axis.title.x = element_text(size = 12),
      panel.grid.major = element_line(color = "grey90"),
      panel.grid.minor = element_blank(),
      legend.position = "none",
      panel.background = element_rect(fill = "white", color = NA),
      plot.background = element_rect(fill = "white", color = NA)
    ) +
    scale_color_manual(values = color_list) +
    xlim(x_min - 0.02 * (x_max - x_min), x_max + 0.02 * (x_max - x_min))
  
  forest_plots[[olink_id]] <<- forest_plot
  
  # Save outputs with overwrite protection
  wb <- createWorkbook()
  title_style <- createStyle(fontSize = 14, fontColour = "black", textDecoration = "bold", halign = "center")
  header_style <- createStyle(fontColour = "white", fgFill = "#C8A2C8", textDecoration = "bold", halign = "center", border = "TopBottomLeftRight")
  
  toc_data <- data.frame(
    Sheet = c("IVW", "MR-Egger", "Weighted_Median", "MR-Lasso", "MR-ConMix", 
              "Wald_Ratios", "Heterogeneity_Test", "Pleiotropy_Test", "Leave_One_Out",
              "Instruments", "Clumped_Data"),
    Title = c("Inverse Variance Weighted Results", "MR-Egger Results", 
              "Weighted Median Results", "MR-Lasso Results", "MR-ConMix Results",
              "Wald Ratio Tests", "Heterogeneity Test Results", 
              "Pleiotropy Test Results", "Leave-One-Out Analysis",
              "Filtered Instruments", "Clumped SNPs Data")
  )
  addWorksheet(wb, "TOC")
  writeData(wb, "TOC", "Table of Contents", startRow = 1, startCol = 1)
  mergeCells(wb, "TOC", cols = 1:2, rows = 1)
  addStyle(wb, "TOC", title_style, rows = 1, cols = 1)
  writeData(wb, "TOC", toc_data, startRow = 3, startCol = 1, headerStyle = header_style)
  setColWidths(wb, "TOC", cols = 1:2, widths = "auto")
  
  add_formatted_sheet <- function(wb, sheet_name, data, title) {
    if (is.null(data) || !is.data.frame(data) || nrow(data) == 0 || ncol(data) == 0) {
      cat(sprintf("Skipping sheet '%s' for %s: Invalid or empty data\n", sheet_name, olink_id))
      addWorksheet(wb, sheet_name)
      writeData(wb, sheet_name, "No data available")
      return()
    }
    addWorksheet(wb, sheet_name)
    writeData(wb, sheet_name, title, startRow = 1, startCol = 1)
    writeData(wb, sheet_name, data, startRow = 3, startCol = 1, headerStyle = header_style)
    mergeCells(wb, sheet_name, cols = 1:ncol(data), rows = 1)
    setColWidths(wb, sheet_name, cols = 1:ncol(data), widths = "auto")
    addStyle(wb, sheet_name, title_style, rows = 1, cols = 1)
  }
  
  add_formatted_sheet(wb, "IVW", ivw_result, "Inverse Variance Weighted Results")
  add_formatted_sheet(wb, "MR-Egger", egger_result, "MR-Egger Results")
  add_formatted_sheet(wb, "Weighted_Median", weighted_median_result, "Weighted Median Results")
  if (!is.null(lasso_result)) {
    lasso_df <- data.frame(
      Exposure = lasso_result@Exposure,
      Outcome = lasso_result@Outcome,
      Estimate = lasso_result@Estimate,
      StdError = lasso_result@StdError,
      CILower = lasso_result@CILower,
      CIUpper = lasso_result@CIUpper,
      Pvalue = lasso_result@Pvalue,
      SNPs = lasso_result@SNPs,
      Valid = lasso_result@Valid,
      ValidSNPs = paste(lasso_result@ValidSNPs, collapse = ", ")
    )
    add_formatted_sheet(wb, "MR-Lasso", lasso_df, "MR-Lasso Results")
  }
  if (!is.null(conmix_result)) {
    conmix_df <- data.frame(
      Exposure = conmix_result@Exposure,
      Outcome = conmix_result@Outcome,
      Estimate = conmix_result@Estimate,
      Pvalue = conmix_result@Pvalue,
      CILower = conmix_result@CILower,
      CIUpper = conmix_result@CIUpper,
      SNPs = conmix_result@SNPs
    )
    add_formatted_sheet(wb, "MR-ConMix", conmix_df, "MR-ConMix Results")
  }
  add_formatted_sheet(wb, "Wald_Ratios", wald_ratios, "Wald Ratio Tests")
  add_formatted_sheet(wb, "Heterogeneity_Test", heterogeneity_result, "Heterogeneity Test Results")
  add_formatted_sheet(wb, "Pleiotropy_Test", pleiotropy_result, "Pleiotropy Test Results")
  add_formatted_sheet(wb, "Leave_One_Out", loo_result, "Leave-One-Out Analysis")
  add_formatted_sheet(wb, "Instruments", instruments, "Filtered Instruments")
  add_formatted_sheet(wb, "Clumped_Data", clumped_dat, "Clumped SNPs Data")
  
  saveWorkbook(wb, excel_file, overwrite = FALSE)
  ggsave(plot_file, forest_plot, width = 12, height = 10, dpi = 600, bg = "white")
  
  cat(sprintf("Results for %s saved to: %s and %s\n", olink_id, excel_file, plot_file))
  gc(verbose = TRUE)
}

# Run MR analysis for each protein
for (protein_file in protein_files) {
  perform_mr(protein_file, chd_file, results_dir, gene_data)
}

cat(sprintf("MR analysis completed on %s\n", Sys.time()))
sink()

# Master file creation
rm(list = ls())

library(openxlsx)
library(dplyr)
library(data.table)

base_dir <- "/Users/charleenadams/mr_ukbppp_chd"
results_dir <- file.path(base_dir, "results_183_finngen")
today <- Sys.Date()
master_file <- file.path(base_dir, sprintf("Master_MR_Results_finngen_CHD_%s.xlsx", today))

result_files <- list.files(results_dir, pattern = "MR_Results_.*_CHD_.*\\.xlsx$", 
                           full.names = TRUE, recursive = TRUE)

tabs <- c("IVW", "MR-Egger", "Weighted_Median", "MR-Lasso", "MR-ConMix", 
          "Wald_Ratios", "Heterogeneity_Test", "Pleiotropy_Test", "Leave_One_Out",
          "Instruments", "Clumped_Data")

master_data <- setNames(vector("list", length(tabs)), tabs)

extract_protein_id <- function(file) {
  basename <- basename(file)
  sub("MR_Results_(.*)_CHD_.*\\.xlsx$", "\\1", basename)
}

for (file in result_files) {
  protein_id <- extract_protein_id(file)
  cat(sprintf("Processing %s...\n", protein_id))
  
  wb <- tryCatch({
    loadWorkbook(file)
  }, error = function(e) {
    cat(sprintf("Failed to load %s: %s\nSkipping\n", protein_id, e$message))
    return(NULL)
  }, warning = function(w) {
    cat(sprintf("Warning loading %s: %s\n", protein_id, w$message))
    suppressWarnings(loadWorkbook(file))
  })
  
  if (is.null(wb)) next
  
  sheet_names <- sheets(wb)
  
  for (tab in tabs) {
    if (tab %in% sheet_names) {
      data <- tryCatch({
        readWorkbook(wb, sheet = tab, startRow = 3, colNames = TRUE)
      }, error = function(e) {
        cat(sprintf("Error reading %s in %s: %s\n", tab, protein_id, e$message))
        return(NULL)
      }, warning = function(w) {
        cat(sprintf("Warning reading %s in %s: %s\n", tab, protein_id, w$message))
        suppressWarnings(readWorkbook(wb, sheet = tab, startRow = 3, colNames = TRUE))
      })
      
      if (is.null(data) || nrow(data) == 0 || 
          (nrow(data) == 1 && length(data[1, 1]) > 0 && data[1, 1] == "No data available")) {
        next
      }
      
      if ("chr.exposure" %in% names(data)) {
        data$chr.exposure <- as.character(data$chr.exposure)
      }
      if ("chr.outcome" %in% names(data)) {
        data$chr.outcome <- as.character(data$chr.outcome)
      }
      data$Protein_ID <- protein_id
      
      if (is.null(master_data[[tab]])) {
        master_data[[tab]] <- data
      } else {
        master_data[[tab]] <- bind_rows(master_data[[tab]], data)
      }
    }
  }
}

wb <- createWorkbook()
title_style <- createStyle(fontSize = 14, fontColour = "black", textDecoration = "bold", halign = "center")
header_style <- createStyle(fontColour = "white", fgFill = "#C8A2C8", textDecoration = "bold", halign = "center", border = "TopBottomLeftRight")

toc_data <- data.frame(
  Sheet = tabs,
  Title = c("Inverse Variance Weighted Results", "MR-Egger Results",
            "Weighted Median Results", "MR-Lasso Results", "MR-ConMix Results",
            "Wald Ratio Tests", "Heterogeneity Test Results",
            "Pleiotropy Test Results", "Leave-One-Out Analysis",
            "Filtered Instruments", "Clumped SNPs Data")
)
addWorksheet(wb, "TOC")
writeData(wb, "TOC", "Table of Contents", startRow = 1, startCol = 1)
mergeCells(wb, "TOC", cols = 1:2, rows = 1)
addStyle(wb, "TOC", title_style, rows = 1, cols = 1)
writeData(wb, "TOC", toc_data, startRow = 3, startCol = 1, headerStyle = header_style)
setColWidths(wb, "TOC", cols = 1:2, widths = "auto")

add_formatted_sheet <- function(wb, sheet_name, data, title) {
  if (is.null(data) || nrow(data) == 0) {
    cat(sprintf("Skipping sheet '%s': No data\n", sheet_name))
    addWorksheet(wb, sheet_name)
    writeData(wb, sheet_name, "No data available")
    return()
  }
  addWorksheet(wb, sheet_name)
  writeData(wb, sheet_name, title, startRow = 1, startCol = 1)
  writeData(wb, sheet_name, data, startRow = 3, startCol = 1, headerStyle = header_style)
  mergeCells(wb, sheet_name, cols = 1:ncol(data), rows = 1)
  setColWidths(wb, sheet_name, cols = 1:ncol(data), widths = "auto")
  addStyle(wb, sheet_name, title_style, rows = 1, cols = 1)
}

for (tab in tabs) {
  add_formatted_sheet(wb, tab, master_data[[tab]], toc_data$Title[tabs == tab])
}

saveWorkbook(wb, master_file, overwrite = TRUE)
cat(sprintf("Master file saved to: %s\n", master_file))

7 Comparison of Cardiogram and FinnGen

rm(list = ls())

# Load required packages
library(openxlsx)
library(dplyr)
library(data.table)

# Set base directory
base_dir <- "/Users/charleenadams/mr_ukbppp_chd"

# Define paths to the two Excel files and protein list
excel_file_finngen <- file.path(base_dir, "Master_MR_Results_finngen_CHD_2025-03-22.xlsx")
excel_file_cardiogram <- file.path(base_dir, "Master_MR_Results_CHD_2025-03-22.xlsx")
protein_list_file <- file.path(base_dir, "protein_list_clean.txt")

# Check if files exist
if (!file.exists(excel_file_finngen)) {
  stop(sprintf("Excel file %s not found.", excel_file_finngen))
}
if (!file.exists(excel_file_cardiogram)) {
  stop(sprintf("Excel file %s not found.", excel_file_cardiogram))
}
if (!file.exists(protein_list_file)) {
  stop(sprintf("Protein list file %s not found.", protein_list_file))
}

# Load both workbooks
wb_finngen <- loadWorkbook(excel_file_finngen)
wb_cardiogram <- loadWorkbook(excel_file_cardiogram)

# Load protein list
protein_list <- fread(protein_list_file, data.table = FALSE)
if (!("Olink ID" %in% colnames(protein_list))) {
  stop("Column 'Olink ID' not found in protein_list_clean.txt.")
}

# Extract OIDxxxxx from Olink ID (e.g., Olink3K_OID20981 -> OID20981)
protein_list$OID <- sub(".*_(OID[0-9]+)$", "\\1", protein_list$`Olink ID`)
oids <- unique(protein_list$OID)

# List available sheets in both workbooks
sheets_finngen <- sheets(wb_finngen)
sheets_cardiogram <- sheets(wb_cardiogram)

cat("Sheets in FinnGen workbook:\n")
print(sheets_finngen)
cat("\nSheets in Cardiogram workbook:\n")
print(sheets_cardiogram)

# Define the sheets to compare (excluding TOC)
tabs <- c("IVW", "MR-Egger", "Weighted_Median", "MR-Lasso", "MR-ConMix", 
          "Wald_Ratios", "Heterogeneity_Test", "Pleiotropy_Test", "Leave_One_Out",
          "Instruments", "Clumped_Data")

# Function to read a sheet and handle errors
read_sheet <- function(wb, sheet_name, file_label) {
  if (!(sheet_name %in% sheets(wb))) {
    cat(sprintf("Sheet '%s' not found in %s. Skipping.\n", sheet_name, file_label))
    return(NULL)
  }
  
  data <- tryCatch({
    readWorkbook(wb, sheet = sheet_name, startRow = 3, colNames = TRUE)
  }, error = function(e) {
    cat(sprintf("Error reading sheet '%s' in %s: %s\n", sheet_name, file_label, e$message))
    return(NULL)
  })
  
  if (is.null(data) || nrow(data) == 0 || 
      (nrow(data) == 1 && length(data[1, 1]) > 0 && data[1, 1] == "No data available")) {
    cat(sprintf("Sheet '%s' in %s is empty or contains no data.\n", sheet_name, file_label))
    return(NULL)
  }
  
  if (!("Protein_ID" %in% colnames(data))) {
    cat(sprintf("Column 'Protein_ID' not found in sheet '%s' of %s. Skipping.\n", 
                sheet_name, file_label))
    return(NULL)
  }
  
  return(data)
}

# Store comparison results for Excel output
comparison_results <- list()

# Compare Protein_IDs across each sheet and count OID matches
for (sheet_name in tabs) {
  cat(sprintf("\n=== Comparing sheet '%s' ===\n", sheet_name))
  
  # Read data from both files
  data_finngen <- read_sheet(wb_finngen, sheet_name, "FinnGen file")
  data_cardiogram <- read_sheet(wb_cardiogram, sheet_name, "Cardiogram file")
  
  # Skip if both datasets are NULL
  if (is.null(data_finngen) && is.null(data_cardiogram)) {
    cat("Both datasets are empty or unavailable. Skipping comparison.\n")
    next
  }
  
  # Get unique Protein_IDs
  proteins_finngen <- if (!is.null(data_finngen)) unique(data_finngen$Protein_ID) else character(0)
  proteins_cardiogram <- if (!is.null(data_cardiogram)) unique(data_cardiogram$Protein_ID) else character(0)
  
  # Count unique proteins
  cat(sprintf("Number of unique Protein_IDs in FinnGen (%s): %d\n", sheet_name, length(proteins_finngen)))
  cat(sprintf("Number of unique Protein_IDs in Cardiogram (%s): %d\n", sheet_name, length(proteins_cardiogram)))
  
  # Find overlaps and differences
  common_proteins <- intersect(proteins_finngen, proteins_cardiogram)
  finngen_only <- setdiff(proteins_finngen, proteins_cardiogram)
  cardiogram_only <- setdiff(proteins_cardiogram, proteins_finngen)
  
  # Count Protein_IDs matching OIDs from protein_list_clean.txt
  count_oids_finngen <- sum(sapply(oids, function(oid) any(grepl(oid, proteins_finngen))))
  count_oids_cardiogram <- sum(sapply(oids, function(oid) any(grepl(oid, proteins_cardiogram))))
  
  # Report to console
  cat(sprintf("Number of Protein_IDs common to both: %d\n", length(common_proteins)))
  if (length(common_proteins) > 0) {
    cat("Common Protein_IDs:\n")
    print(common_proteins)
  }
  
  cat(sprintf("Number of Protein_IDs unique to FinnGen: %d\n", length(finngen_only)))
  if (length(finngen_only) > 0) {
    cat("Protein_IDs unique to FinnGen:\n")
    print(finngen_only)
  }
  
  cat(sprintf("Number of Protein_IDs unique to Cardiogram: %d\n", length(cardiogram_only)))
  if (length(cardiogram_only) > 0) {
    cat("Protein_IDs unique to Cardiogram:\n")
    print(cardiogram_only)
  }
  
  cat(sprintf("Number of Protein_IDs in FinnGen matching OIDs from protein_list_clean.txt: %d\n", count_oids_finngen))
  cat(sprintf("Number of Protein_IDs in Cardiogram matching OIDs from protein_list_clean.txt: %d\n", count_oids_cardiogram))
  
  # Store results for Excel
  comparison_results[[sheet_name]] <- list(
    counts = data.frame(
      Metric = c("Unique Protein_IDs in FinnGen", "Unique Protein_IDs in Cardiogram", 
                 "Common Protein_IDs", "FinnGen Matching OIDs", "Cardiogram Matching OIDs"),
      Count = c(length(proteins_finngen), length(proteins_cardiogram), length(common_proteins), 
                count_oids_finngen, count_oids_cardiogram)
    ),
    common = if (length(common_proteins) > 0) data.frame(Protein_ID = common_proteins) else NULL,
    finngen_only = if (length(finngen_only) > 0) data.frame(Protein_ID = finngen_only) else NULL,
    cardiogram_only = if (length(cardiogram_only) > 0) data.frame(Protein_ID = cardiogram_only) else NULL
  )
}

# Part 2: Analyze results directories
cat("\n=== Analyzing Results Directories ===\n")

# Define the directories
dir_finngen <- "/Users/charleenadams/mr_ukbppp_chd/results_183_finngen"
dir_cardiogram <- "/Users/charleenadams/mr_ukbppp_chd/results_183_tss"

# Check if directories exist
if (!dir.exists(dir_finngen)) {
  stop(sprintf("Directory %s not found.", dir_finngen))
}
if (!dir.exists(dir_cardiogram)) {
  stop(sprintf("Directory %s not found.", dir_cardiogram))
}

# Function to analyze a directory
analyze_directory <- function(dir_path, label) {
  cat(sprintf("\nAnalyzing directory: %s\n", label))
  
  # List subfolders
  subfolders <- list.dirs(dir_path, full.names = FALSE, recursive = FALSE)
  if (length(subfolders) == 0) {
    cat("No subfolders found.\n")
    return(NULL)
  }
  
  # Initialize lists for categorization
  empty_folders <- character()
  tss_folders <- character()
  mr_results_folders <- character()
  both_folders <- character()
  
  # Analyze each subfolder
  for (folder in subfolders) {
    folder_path <- file.path(dir_path, folder)
    files <- list.files(folder_path, full.names = FALSE)
    
    # Check file types
    has_tss <- any(grepl("^tss_harmonized_.*\\.csv$", files))
    has_mr_results <- any(grepl("^MR_Results_.*\\.xlsx$", files))
    
    # Categorize
    if (length(files) == 0) {
      empty_folders <- c(empty_folders, folder)
    } else if (has_tss && has_mr_results) {
      both_folders <- c(both_folders, folder)
    } else if (has_tss) {
      tss_folders <- c(tss_folders, folder)
    } else if (has_mr_results) {
      mr_results_folders <- c(mr_results_folders, folder)
    }
  }
  
  # Report to console
  cat(sprintf("Total subfolders: %d\n", length(subfolders)))
  cat(sprintf("Empty subfolders (0 files): %d\n", length(empty_folders)))
  if (length(empty_folders) > 0) {
    cat("Empty subfolders:\n")
    print(empty_folders)
  }
  
  cat(sprintf("Subfolders with tss_harmonized_ files only: %d\n", length(tss_folders)))
  if (length(tss_folders) > 0) {
    cat("Subfolders with tss_harmonized_ files only:\n")
    print(tss_folders)
  }
  
  cat(sprintf("Subfolders with MR_Results_ files only: %d\n", length(mr_results_folders)))
  if (length(mr_results_folders) > 0) {
    cat("Subfolders with MR_Results_ files only:\n")
    print(mr_results_folders)
  }
  
  cat(sprintf("Subfolders with both tss_harmonized_ and MR_Results_ files: %d\n", length(both_folders)))
  if (length(both_folders) > 0) {
    cat("Subfolders with both:\n")
    print(both_folders)
  }
  
  # Return results for Excel
  return(list(
    counts = data.frame(
      Metric = c("Total Subfolders", "Empty Subfolders", "tss_harmonized_ Only", 
                 "MR_Results_ Only", "Both tss_harmonized_ and MR_Results_"),
      Count = c(length(subfolders), length(empty_folders), length(tss_folders), 
                length(mr_results_folders), length(both_folders))
    ),
    empty = if (length(empty_folders) > 0) data.frame(Subfolder = empty_folders) else NULL,
    tss_only = if (length(tss_folders) > 0) data.frame(Subfolder = tss_folders) else NULL,
    mr_results_only = if (length(mr_results_folders) > 0) data.frame(Subfolder = mr_results_folders) else NULL,
    both = if (length(both_folders) > 0) data.frame(Subfolder = both_folders) else NULL
  ))
}

# Analyze both directories and store results
dir_results_finngen <- analyze_directory(dir_finngen, "results_183_finngen")
dir_results_cardiogram <- analyze_directory(dir_cardiogram, "results_183_tss")

# Part 3: Create a master results file merging FinnGen and Cardiogram results
cat("\n=== Creating Master Results File ===\n")

# Initialize master data for merging
master_data <- setNames(vector("list", length(tabs)), tabs)

# Function to read and label data
read_and_label <- function(wb, sheet_name, label) {
  data <- read_sheet(wb, sheet_name, label)
  if (!is.null(data)) {
    data$Source <- label
    # Add Our_Set column: Yes if Protein_ID contains an OID from protein_list_clean.txt, No otherwise
    data$Our_Set <- ifelse(sapply(data$Protein_ID, function(pid) any(sapply(oids, function(oid) grepl(oid, pid)))), "Yes", "No")
  }
  return(data)
}

# Merge data from both files
for (tab in tabs) {
  data_finngen <- read_and_label(wb_finngen, tab, "FinnGen")
  data_cardiogram <- read_and_label(wb_cardiogram, tab, "Cardiogram")
  
  # Combine data
  if (!is.null(data_finngen) && !is.null(data_cardiogram)) {
    master_data[[tab]] <- bind_rows(data_finngen, data_cardiogram)
  } else if (!is.null(data_finngen)) {
    master_data[[tab]] <- data_finngen
  } else if (!is.null(data_cardiogram)) {
    master_data[[tab]] <- data_cardiogram
  }
}

# Create a new workbook for comparison results
output_wb <- createWorkbook()
title_style <- createStyle(fontSize = 14, fontColour = "black", textDecoration = "bold", halign = "center")
header_style <- createStyle(fontColour = "white", fgFill = "#C8A2C8", textDecoration = "bold", halign = "center", border = "TopBottomLeftRight")

# Define sheets for TOC (including master results)
toc_data <- data.frame(
  Sheet = c(paste0(tabs, "_Comparison"), "FinnGen_Directory", "Cardiogram_Directory", tabs),
  Title = c(paste(tabs, "Comparison Results"),
            "FinnGen Directory Analysis",
            "Cardiogram Directory Analysis",
            paste(tabs, "Merged Results"))
)

# Add TOC sheet
addWorksheet(output_wb, "TOC")
writeData(output_wb, "TOC", "Table of Contents", startRow = 1, startCol = 1)
mergeCells(output_wb, "TOC", cols = 1:2, rows = 1)
addStyle(output_wb, "TOC", title_style, rows = 1, cols = 1)
writeData(output_wb, "TOC", toc_data, startRow = 3, startCol = 1, headerStyle = header_style)
setColWidths(output_wb, "TOC", cols = 1:2, widths = "auto")

# Function to add a formatted sheet
add_formatted_sheet <- function(wb, sheet_name, data, title) {
  if (is.null(data) || nrow(data) == 0) {
    addWorksheet(wb, sheet_name)
    writeData(wb, sheet_name, "No data available", startRow = 1, startCol = 1)
    return()
  }
  addWorksheet(wb, sheet_name)
  writeData(wb, sheet_name, title, startRow = 1, startCol = 1)
  writeData(wb, sheet_name, data, startRow = 3, startCol = 1, headerStyle = header_style)
  mergeCells(wb, sheet_name, cols = 1:ncol(data), rows = 1)
  setColWidths(wb, sheet_name, cols = 1:ncol(data), widths = "auto")
  addStyle(wb, sheet_name, title_style, rows = 1, cols = 1)
}

# Add comparison results to Excel
for (sheet_name in tabs) {
  if (!is.null(comparison_results[[sheet_name]])) {
    result <- comparison_results[[sheet_name]]
    
    # Create a new sheet for this comparison
    sheet_title <- paste(sheet_name, "Comparison Results")
    sheet_name_out <- paste0(sheet_name, "_Comparison")
    addWorksheet(output_wb, sheet_name_out)
    
    # Write title
    writeData(output_wb, sheet_name_out, sheet_title, startRow = 1, startCol = 1)
    mergeCells(output_wb, sheet_name_out, cols = 1:2, rows = 1)
    addStyle(output_wb, sheet_name_out, title_style, rows = 1, cols = 1)
    
    # Write counts
    writeData(output_wb, sheet_name_out, "Counts Summary", startRow = 3, startCol = 1)
    writeData(output_wb, sheet_name_out, result$counts, startRow = 4, startCol = 1, headerStyle = header_style)
    setColWidths(output_wb, sheet_name_out, cols = 1:2, widths = "auto")
    
    # Write common Protein_IDs
    if (!is.null(result$common)) {
      writeData(output_wb, sheet_name_out, "Common Protein_IDs", startRow = 10, startCol = 1)
      writeData(output_wb, sheet_name_out, result$common, startRow = 11, startCol = 1, headerStyle = header_style)
      setColWidths(output_wb, sheet_name_out, cols = 1, widths = "auto")
    }
    
    # Write FinnGen-only Protein_IDs
    if (!is.null(result$finngen_only)) {
      start_row <- if (!is.null(result$common)) 11 + nrow(result$common) + 2 else 10
      writeData(output_wb, sheet_name_out, "Protein_IDs Unique to FinnGen", startRow = start_row, startCol = 1)
      writeData(output_wb, sheet_name_out, result$finngen_only, startRow = start_row + 1, startCol = 1, headerStyle = header_style)
      setColWidths(output_wb, sheet_name_out, cols = 1, widths = "auto")
    }
    
    # Write Cardiogram-only Protein_IDs
    if (!is.null(result$cardiogram_only)) {
      start_row <- if (!is.null(result$finngen_only)) {
        (if (!is.null(result$common)) 11 + nrow(result$common) + 2 else 10) + nrow(result$finngen_only) + 2
      } else if (!is.null(result$common)) {
        11 + nrow(result$common) + 2
      } else {
        10
      }
      writeData(output_wb, sheet_name_out, "Protein_IDs Unique to Cardiogram", startRow = start_row, startCol = 1)
      writeData(output_wb, sheet_name_out, result$cardiogram_only, startRow = start_row + 1, startCol = 1, headerStyle = header_style)
      setColWidths(output_wb, sheet_name_out, cols = 1, widths = "auto")
    }
  }
}

# Add directory analysis results to Excel
# FinnGen Directory
if (!is.null(dir_results_finngen)) {
  add_formatted_sheet(output_wb, "FinnGen_Directory", dir_results_finngen$counts, "FinnGen Directory Analysis")
  
  # Write detailed lists
  start_row <- 8
  if (!is.null(dir_results_finngen$empty)) {
    writeData(output_wb, "FinnGen_Directory", "Empty Subfolders", startRow = start_row, startCol = 1)
    writeData(output_wb, "FinnGen_Directory", dir_results_finngen$empty, startRow = start_row + 1, startCol = 1, headerStyle = header_style)
    setColWidths(output_wb, "FinnGen_Directory", cols = 1, widths = "auto")
    start_row <- start_row + nrow(dir_results_finngen$empty) + 3
  }
  
  if (!is.null(dir_results_finngen$tss_only)) {
    writeData(output_wb, "FinnGen_Directory", "Subfolders with tss_harmonized_ Only", startRow = start_row, startCol = 1)
    writeData(output_wb, "FinnGen_Directory", dir_results_finngen$tss_only, startRow = start_row + 1, startCol = 1, headerStyle = header_style)
    setColWidths(output_wb, "FinnGen_Directory", cols = 1, widths = "auto")
    start_row <- start_row + nrow(dir_results_finngen$tss_only) + 3
  }
  
  if (!is.null(dir_results_finngen$mr_results_only)) {
    writeData(output_wb, "FinnGen_Directory", "Subfolders with MR_Results_ Only", startRow = start_row, startCol = 1)
    writeData(output_wb, "FinnGen_Directory", dir_results_finngen$mr_results_only, startRow = start_row + 1, startCol = 1, headerStyle = header_style)
    setColWidths(output_wb, "FinnGen_Directory", cols = 1, widths = "auto")
    start_row <- start_row + nrow(dir_results_finngen$mr_results_only) + 3
  }
  
  if (!is.null(dir_results_finngen$both)) {
    writeData(output_wb, "FinnGen_Directory", "Subfolders with Both tss_harmonized_ and MR_Results_", startRow = start_row, startCol = 1)
    writeData(output_wb, "FinnGen_Directory", dir_results_finngen$both, startRow = start_row + 1, startCol = 1, headerStyle = header_style)
    setColWidths(output_wb, "FinnGen_Directory", cols = 1, widths = "auto")
  }
}

# Cardiogram Directory
if (!is.null(dir_results_cardiogram)) {
  add_formatted_sheet(output_wb, "Cardiogram_Directory", dir_results_cardiogram$counts, "Cardiogram Directory Analysis")
  
  # Write detailed lists
  start_row <- 8
  if (!is.null(dir_results_cardiogram$empty)) {
    writeData(output_wb, "Cardiogram_Directory", "Empty Subfolders", startRow = start_row, startCol = 1)
    writeData(output_wb, "Cardiogram_Directory", dir_results_cardiogram$empty, startRow = start_row + 1, startCol = 1, headerStyle = header_style)
    setColWidths(output_wb, "Cardiogram_Directory", cols = 1, widths = "auto")
    start_row <- start_row + nrow(dir_results_cardiogram$empty) + 3
  }
  
  if (!is.null(dir_results_cardiogram$tss_only)) {
    writeData(output_wb, "Cardiogram_Directory", "Subfolders with tss_harmonized_ Only", startRow = start_row, startCol = 1)
    writeData(output_wb, "Cardiogram_Directory", dir_results_cardiogram$tss_only, startRow = start_row + 1, startCol = 1, headerStyle = header_style)
    setColWidths(output_wb, "Cardiogram_Directory", cols = 1, widths = "auto")
    start_row <- start_row + nrow(dir_results_cardiogram$tss_only) + 3
  }
  
  if (!is.null(dir_results_cardiogram$mr_results_only)) {
    writeData(output_wb, "Cardiogram_Directory", "Subfolders with MR_Results_ Only", startRow = start_row, startCol = 1)
    writeData(output_wb, "Cardiogram_Directory", dir_results_cardiogram$mr_results_only, startRow = start_row + 1, startCol = 1, headerStyle = header_style)
    setColWidths(output_wb, "Cardiogram_Directory", cols = 1, widths = "auto")
    start_row <- start_row + nrow(dir_results_cardiogram$mr_results_only) + 3
  }
  
  if (!is.null(dir_results_cardiogram$both)) {
    writeData(output_wb, "Cardiogram_Directory", "Subfolders with Both tss_harmonized_ and MR_Results_", startRow = start_row, startCol = 1)
    writeData(output_wb, "Cardiogram_Directory", dir_results_cardiogram$both, startRow = start_row + 1, startCol = 1, headerStyle = header_style)
    setColWidths(output_wb, "Cardiogram_Directory", cols = 1, widths = "auto")
  }
}

# Add merged results to Excel
for (tab in tabs) {
  if (!is.null(master_data[[tab]])) {
    add_formatted_sheet(output_wb, tab, master_data[[tab]], paste(tab, "Merged Results"))
  }
}

# Save the comparison workbook
output_file <- file.path(base_dir, "Comparison_Results_2025-03-22.xlsx")
saveWorkbook(output_wb, output_file, overwrite = TRUE)
cat(sprintf("Comparison results saved to: %s\n", output_file))

# Create a separate master results workbook
master_wb <- createWorkbook()

# Add TOC sheet for master results
toc_data_master <- data.frame(
  Sheet = tabs,
  Title = paste(tabs, "Merged Results")
)
addWorksheet(master_wb, "TOC")
writeData(master_wb, "TOC", "Table of Contents", startRow = 1, startCol = 1)
mergeCells(master_wb, "TOC", cols = 1:2, rows = 1)
addStyle(master_wb, "TOC", title_style, rows = 1, cols = 1)
writeData(master_wb, "TOC", toc_data_master, startRow = 3, startCol = 1, headerStyle = header_style)
setColWidths(master_wb, "TOC", cols = 1:2, widths = "auto")

# Add merged data to master workbook
for (tab in tabs) {
  if (!is.null(master_data[[tab]])) {
    add_formatted_sheet(master_wb, tab, master_data[[tab]], paste(tab, "Merged Results"))
  }
}

# Save the master results workbook
master_output_file <- file.path(base_dir, "Master_Merged_Results_2025-03-22.xlsx")
saveWorkbook(master_wb, master_output_file, overwrite = TRUE)
cat(sprintf("Master merged results saved to: %s\n", master_output_file))

fread(/Users/charleenadams/mr_ukbppp_chd/Master_Merged_Results_2025-03-23.xlsx)

8 Formating the Merged Excel Results

This is to have results in a format that are easier on the eyes and near-publication ready. I removed the MR-LASSO and MR-ConMix.

# Run with:
# cd /Users/charleenadams/mr_ukbppp_chd/scripts
# nohup python3 pub_ready_results.py > pub_ready_results.log 2>&1 &
# ps -p $(pgrep -f pub_ready_results.py) -o pid,stat,etime

import pandas as pd
import numpy as np
import sys
from openpyxl import Workbook
from openpyxl.utils.dataframe import dataframe_to_rows
from datetime import datetime

# --- CONFIGURATION ---
excel_file_path = "/Users/charleenadams/mr_ukbppp_chd/Comparison_Results_2025-03-23.xlsx"

# System date + time in filename (format YYYY-MM-DD_HHMM)
current_time = datetime.now().strftime("%Y-%m-%d_%H%M")
output_path = f"/Users/charleenadams/mr_ukbppp_chd/Publication_Ready_MR_Results_{current_time}.xlsx"

target_sheets = [
    "IVW", "MR-Egger", "Weighted_Median", "Wald_Ratios",
    "Heterogeneity_Test", "Pleiotropy_Test", "Clumped_Data"
]

HEADER_ROW_INDEX = 2  # Header is on row 3 (zero-based index)

# --- READ DATA ---
print("📖 Reading Excel sheets...")

try:
    data_frames = {
        sheet: pd.read_excel(excel_file_path, sheet_name=sheet, header=HEADER_ROW_INDEX)
        for sheet in target_sheets
    }
except Exception as e:
    print(f"❌ Failed to read Excel sheets: {e}")
    sys.exit(1)

# --- COLUMN VALIDATION ---
def validate_columns(df, expected_columns, df_name):
    missing = [col for col in expected_columns if col not in df.columns]
    if missing:
        print(f"❌ Missing columns in {df_name}: {missing}")
        print(f"🔎 Available columns in {df_name}: {list(df.columns)}")
        sys.exit(1)

# --- CLEANING FUNCTIONS ---
def clean_common_mr_results(df, method_name):
    df = df.copy()

    df.rename(columns={
        'Protein_ID': 'Protein ID',
        'Source': 'Data Source',
        'nsnp': 'No. SNPs',
        'b': 'Log Odds',
        'se': 'Standard Error',
        'pval': 'P-value',
        'Our_Set': 'Our Set'
    }, inplace=True)

    expected = ['Protein ID', 'Data Source', 'No. SNPs', 'Log Odds', 'Standard Error', 'P-value', 'Our Set']
    validate_columns(df, expected, method_name)

    df['MR Method'] = df['method']
    df['Lower CI'] = df['Log Odds'] - 1.96 * df['Standard Error']
    df['Upper CI'] = df['Log Odds'] + 1.96 * df['Standard Error']

    df['P-value Threshold (for genetic instrument construction)'] = '5E-6'
    df['LD R² Threshold (for genetic instrument construction)'] = '0.001'

    return df[[
        'Protein ID', 'Data Source', 'No. SNPs', 'MR Method',
        'Log Odds', 'Standard Error', 'Lower CI', 'Upper CI',
        'P-value', 'Our Set',
        'P-value Threshold (for genetic instrument construction)',
        'LD R² Threshold (for genetic instrument construction)'
    ]]

def clean_wald_ratios(df):
    df = df.copy()

    df.rename(columns={
        'Protein_ID': 'Protein ID',
        'Source': 'Data Source',
        'wald_beta': 'Log Odds',
        'wald_se': 'Standard Error',
        'pval': 'P-value',
        'Our_Set': 'Our Set'
    }, inplace=True)

    expected = ['Protein ID', 'Data Source', 'Log Odds', 'Standard Error', 'P-value', 'Our Set']
    validate_columns(df, expected, "Wald Ratios")

    df['No. SNPs'] = 1
    df['MR Method'] = df['method'] if 'method' in df.columns else 'Wald Ratio'

    df['Lower CI'] = df['Log Odds'] - 1.96 * df['Standard Error']
    df['Upper CI'] = df['Log Odds'] + 1.96 * df['Standard Error']

    df['P-value Threshold (for genetic instrument construction)'] = '5E-6'
    df['LD R² Threshold (for genetic instrument construction)'] = '0.001'

    return df[[
        'Protein ID', 'Data Source', 'No. SNPs', 'MR Method',
        'Log Odds', 'Standard Error', 'Lower CI', 'Upper CI',
        'P-value', 'Our Set',
        'P-value Threshold (for genetic instrument construction)',
        'LD R² Threshold (for genetic instrument construction)'
    ]]

def clean_clumped_data(df):
    df = df.copy()

    rename_map = {
        'SNP': 'SNP',
        'chr.exposure': 'Chromosome Exposure',
        'chr.outcome': 'Chromosome Outcome',
        'pos.exposure': 'Position Exposure',
        'pos.outcome': 'Position Outcome',
        'effect_allele.exposure': 'Effect Allele Exposure',
        'other_allele.exposure': 'Other Allele Exposure',
        'effect_allele.outcome': 'Effect Allele Outcome',
        'other_allele.outcome': 'Other Allele Outcome',
        'beta.exposure': 'Beta Exposure',
        'se.exposure': 'SE Exposure',
        'pval.exposure': 'P-value Exposure',
        'eaf.exposure': 'EAF Exposure',
        'samplesize.exposure': 'Sample Size Exposure',
        'beta.outcome': 'Beta Outcome',
        'se.outcome': 'SE Outcome',
        'pval.outcome': 'P-value Outcome',
        'eaf.outcome': 'EAF Outcome',
        'samplesize.outcome': 'Sample Size Outcome',
        'remove': 'Remove',
        'palindromic': 'Palindromic',
        'ambiguous': 'Ambiguous',
        'rsq.exposure': 'RSQ Exposure',
        'rsq.outcome': 'RSQ Outcome',
        'steiger_dir': 'Steiger Direction',
        'steiger_pval': 'Steiger P-value',
        'F_stat': 'F-statistic',
        'Protein_ID': 'Protein ID',
        'Source': 'Data Source',
        'Our_Set': 'Our Set'
    }

    df.rename(columns={k: v for k, v in rename_map.items() if k in df.columns}, inplace=True)

    column_order = [
        'Protein ID', 'Data Source',
        'Sample Size Exposure', 'Sample Size Outcome',
        'SNP',
        'Chromosome Exposure', 'Chromosome Outcome',
        'Position Exposure', 'Position Outcome',
        'Effect Allele Exposure', 'Other Allele Exposure',
        'Effect Allele Outcome', 'Other Allele Outcome',
        'EAF Exposure', 'EAF Outcome',
        'Beta Exposure', 'Beta Outcome',
        'SE Exposure', 'SE Outcome',
        'P-value Exposure', 'P-value Outcome',
        'RSQ Exposure', 'RSQ Outcome',
        'Steiger Direction', 'Steiger P-value',
        'F-statistic',
        'Remove', 'Palindromic', 'Ambiguous',
        'Our Set'
    ]

    existing_cols = [col for col in column_order if col in df.columns]
    return df[existing_cols]

def clean_heterogeneity(df):
    df = df.copy()

    df.rename(columns={
        'Protein_ID': 'Protein ID',
        'Source': 'Data Source',
        'Q': 'Q-Heterogeneity',
        'Q_df': 'Q-Heterogeneity Degrees Freedom',
        'Q_pval': 'Q-Heterogeneity P-value',
        'Our_Set': 'Our Set'
    }, inplace=True)

    return df

def clean_pleiotropy(df):
    df = df.copy()

    df.rename(columns={
        'Protein_ID': 'Protein ID',
        'Source': 'Data Source',
        'egger_intercept': 'MR-Egger Intercept',
        'se': 'MR-Egger Standard Error',
        'pval': 'MR-Egger P-value',
        'Our_Set': 'Our Set'
    }, inplace=True)

    return df

# --- CLEAN DATA ---
print("🧹 Cleaning data...")

ivw_clean = clean_common_mr_results(data_frames['IVW'], 'IVW')
egger_clean = clean_common_mr_results(data_frames['MR-Egger'], 'MR-Egger')
wm_clean = clean_common_mr_results(data_frames['Weighted_Median'], 'Weighted Median')
wald_clean = clean_wald_ratios(data_frames['Wald_Ratios'])
clumped_data_clean = clean_clumped_data(data_frames['Clumped_Data'])

heterogeneity_clean = clean_heterogeneity(data_frames['Heterogeneity_Test'])
pleiotropy_clean = clean_pleiotropy(data_frames['Pleiotropy_Test'])

# --- MERGE RESULTS ---
print("🔀 Merging results...")

mr_results_combined = pd.concat([
    ivw_clean, egger_clean, wm_clean, wald_clean
], ignore_index=True)

# Merge heterogeneity: IVW
hetero_ivw = heterogeneity_clean[heterogeneity_clean['method'] == 'Inverse variance weighted']
hetero_egger = heterogeneity_clean[heterogeneity_clean['method'] == 'MR Egger']

# Merge heterogeneity for IVW
final_results = mr_results_combined.merge(
    hetero_ivw[['Protein ID', 'Data Source', 'Our Set',
                'Q-Heterogeneity', 'Q-Heterogeneity Degrees Freedom', 'Q-Heterogeneity P-value']],
    on=['Protein ID', 'Data Source', 'Our Set'], how='left'
)

# Merge heterogeneity for MR-Egger rows separately
final_results = final_results.merge(
    hetero_egger[['Protein ID', 'Data Source', 'Our Set',
                  'Q-Heterogeneity', 'Q-Heterogeneity Degrees Freedom', 'Q-Heterogeneity P-value']],
    on=['Protein ID', 'Data Source', 'Our Set'], how='left', suffixes=('', '_egger')
)

# Use MR Egger heterogeneity for Egger rows
egger_mask = final_results['MR Method'] == 'MR Egger'
for col in ['Q-Heterogeneity', 'Q-Heterogeneity Degrees Freedom', 'Q-Heterogeneity P-value']:
    final_results.loc[egger_mask, col] = final_results.loc[egger_mask, f"{col}_egger"]
    final_results.drop(columns=[f"{col}_egger"], inplace=True)

# Merge pleiotropy for IVW rows only
final_results = final_results.merge(
    pleiotropy_clean[['Protein ID', 'Data Source', 'Our Set',
                      'MR-Egger Intercept', 'MR-Egger Standard Error', 'MR-Egger P-value']],
    on=['Protein ID', 'Data Source', 'Our Set'], how='left'
)

final_results.loc[final_results['MR Method'] != 'Inverse variance weighted',
                  ['MR-Egger Intercept', 'MR-Egger Standard Error', 'MR-Egger P-value']] = np.nan

# --- EXPORT TO EXCEL ---
print(f"💾 Saving to {output_path}")

wb = Workbook()

# === MR RESULTS SHEET ===
ws1 = wb.active
ws1.title = "MR Results"

# Add title to row 1
ws1.merge_cells(start_row=1, start_column=1, end_row=1, end_column=final_results.shape[1])
ws1.cell(row=1, column=1).value = (
    "Table 1. Associations of genetically predicted protein levels with coronary heart disease (CHD) "
    "in Cardiogram and FinnGen using cis-Mendelian randomization (MR) analyses."
)

# Write data starting from row 3
for r_idx, row in enumerate(dataframe_to_rows(final_results, index=False, header=True), 3):
    for c_idx, value in enumerate(row, 1):
        ws1.cell(row=r_idx, column=c_idx, value=value)

# === DATA SHEET ===
ws2 = wb.create_sheet("Data")

# Add title to row 1
ws2.merge_cells(start_row=1, start_column=1, end_row=1, end_column=clumped_data_clean.shape[1])
ws2.cell(row=1, column=1).value = (
    "Table 2. Instrument data used to construct genetic instruments (P-value<5E-6) using cis-variants "
    "(1MB surrounding the transcription start site of the encoding gene for the protein) associated with circulating protein levels."
)

# Write data starting from row 3
for r_idx, row in enumerate(dataframe_to_rows(clumped_data_clean, index=False, header=True), 3):
    for c_idx, value in enumerate(row, 1):
        ws2.cell(row=r_idx, column=c_idx, value=value)

# SAVE
wb.save(output_path)

print("✅ Publication-ready results exported successfully!")

9 App

See: https://yodamendel.shinyapps.io/mr_ukbppp_chd_app_fresh/

# Deployed with: 
rsconnect::deployApp(
  appDir = "/Users/charleenadams/mr_ukbppp_chd/mr_ukbppp_chd_app",
  appName = "mr_ukbppp_chd_app_fresh",
  forceUpdate = TRUE
)

#setwd("/Users/charleenadams/mr_ukbppp_chd/mr_ukbppp_chd_app") # only for knowledge of where to deploy; must be commented out or app won't run

#options(rsconnect.http.timeout = 3600)

# Run with:
# rsconnect::deployApp('/Users/charleenadams/mr_ukbppp_chd/mr_ukbppp_chd_app')

library(shiny)

# ---- UI ----
ui <- fluidPage(
  
  # ---- HEAD STYLES ----
  tags$head(
    tags$style(HTML("
      body {
        font-family: 'Lato', sans-serif;
        color: #1A2526;
        background-color: #FFFFFF;  /* Pure white background */
      }
      #main_title {
        color: #005566;
        margin-bottom: 10px;
        font-size: 32px;
        font-weight: bold;
      }
      #lab_link {
        display: inline-block;
        margin-top: 10px;
        font-size: 18px;
        color: #A51C30;
        text-decoration: none;
        font-weight: bold;
      }
      #lab_link:hover {
        color: #7A1422;
      }
      .info-box {
        background-color: #F5F9FF;
        border: 2px solid #005566;
        border-radius: 10px;
        padding: 20px;
        margin-bottom: 30px;
        box-shadow: 2px 2px 8px rgba(0,0,0,0.1);
      }
      .download-box {
        background-color: #FFF5F5;
        border: 2px solid #A51C30;
        border-radius: 10px;
        padding: 20px;
        margin-bottom: 30px;
        box-shadow: 2px 2px 8px rgba(0,0,0,0.1);
      }
      .download-button {
        margin-bottom: 15px;
      }
      .btn-primary {
        background-color: #005566;
        border-color: #003f4d;
        font-size: 16px;
        padding: 10px 20px;
        border-radius: 8px;
        color: #ffffff;
        text-decoration: none;
        display: inline-block;
        transition: background-color 0.3s ease;
      }
      .btn-primary:hover {
        background-color: #007c91;
        color: #ffffff;
      }
      #logo_container {
        text-align: center;
        margin-top: 30px;
        margin-bottom: 20px;
      }
      #logo_image {
        width: 200px;
        height: auto;
      }
    "))
  ),
  
  # ---- HEADER TITLE & LAB LINK ----
  div(
    style = "text-align: center; margin-bottom: 10px;",
    h1(HTML("Cis-MR of 183 UKB-PPP Proteins & CHD️"), id = "main_title")
  ),
  
  div(
    style = "text-align: center; margin-bottom: 30px;",
    tags$a(
      href = "https://www.bidmc.org/research/research-by-department/medicine/cardiovascular-medicine/personal-genomics-and-cardiometabolic-disease",
      target = "_blank",
      "🔬 Our Lab",
      id = "lab_link"
    )
  ),
  
  # ---- DESCRIPTION BOX ----
  div(class = "info-box",
      h3("🧬 Analysis Summary"),
      p(HTML("
        This platform showcases Mendelian Randomization (MR) analyses of 183 proteins from UKB-PPP, investigating their causal role in coronary heart disease (CHD).<br><br>
        We constructed <b>cis-acting genetic instruments</b> (1MB around TSS), filtered at <b>P &lt; 5e-06</b> and <b>R² &lt; 0.001</b>.<br><br>
        Effect estimates (b) are presented as <b>log odds</b>.<br><br>
        
        <b>The downloadable Excel file includes:</b><br></b><br>
        
        ✅ MR results for <b>CARDIoGRAMplusC4D</b> and <b>FinnGen</b>, plus proteins sharing gene names but different Olink IDs with our set of 183.<br><br>
        ✅ Pradeep's supplementary table for comparison.<br><br>
        ✅ QC information about what we were able to test.<br><br>
        
        <b>Notes and Scripts</b> (Walkthrough + Annotated Code):<br>
        👉 <a href='https://rpubs.com/YodaMendel/1288109' target='_blank'>https://rpubs.com/YodaMendel/1288109</a>
      "))
  ),
  
  # ---- DOWNLOAD BOX ----
  div(class = "download-box",
      h3("📦 Download Results"),
      p("Click below to download the master Excel file of our MR results directly."),
      div(
        class = "download-button",
        downloadButton(
          outputId = "download_results",
          label = "Download Master MR Results (Excel)",
          class = "btn-primary"
        )
      )
  ),
  
  # ---- BIDMC LOGO WITH HYPERLINK ----
  div(
    id = "logo_container",
    tags$a(
      href = "https://www.bidmc.org/research/research-by-department/medicine/cardiovascular-medicine/personal-genomics-and-cardiometabolic-disease",
      target = "_blank",
      tags$img(
        src = "BIDMC_HMS_Stacked-LockUp.png",
        id = "logo_image",
        alt = "Beth Israel Deaconess Medical Center Logo"
      )
    )
  )
)

# ---- SERVER ----
server <- function(input, output, session) {
  
  # Download handler for the Excel file
  output$download_results <- downloadHandler(
    filename = function() {
      "Publication_Ready_MR_Results.xlsx"
    },
    content = function(file) {
      file.copy("www/Publication_Ready_MR_Results_2025-03-24_1412.xlsx", file)
    },
    contentType = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
  )
}

# ---- RUN APP ----
shinyApp(ui = ui, server = server)