1 Methods

1.1 Study Overview

We conducted two-sample Mendelian Randomization (MR) analyses and colocalization to evaluate the causal effect of acetyl-amino acid levels within a 1 Mb region surrounding the transcription start site (TSS) of the ACY1 gene on 7 metabolic ouctomes. Genetic instruments were derived from the METSIM cohort, while outcome summary statistics were obtained from various cohorts. All analyses were performed using R (version 4.5.0) and Python (version 3.12.9).

1.2 Data Acquisition

1.2.1 Acetyl-amino Acids

Summary statistics for acetyl-amino acids were sourced from the METSIM (Metabolic Syndrome in Men) study, a population-based cohort of Finnish men. The data, available in GRCh38 coordinates, were downloaded from the PheWeb repository (https://pheweb.org/metsim-metab/) as compressed files. These files were decompressed using the following command:

for f in C*; do mv "$f" "$f.gz"; bgzip -d "$f.gz"; done

This yielded tab-separated value (tsv) files containing association statistics for approximately 6,000 individuals. The dataset includes the following columns:

Chromosome (chrom)
Position (pos)
Reference (ref) and alternate (alt) alleles
Effect sizes (beta)
Standard errors (sebeta)
P-values (pval)
Minor allele frequencies (maf)
rsIDs (rsids)

The following acetyl-amino acids were fetched (GRCh38 coordinates):

N-acetylmethionine (C1083, 6,135 samples):

wget https://pheweb.org/metsim-metab/download/C1083

N-acetylalanine (C1110, 6,136 samples):

wget https://pheweb.org/metsim-metab/download/C1110

N-acetylisoleucine (C100001276, 4,946 samples):

wget https://pheweb.org/metsim-metab/download/C100001276

N-acetylglutamate (C100000282, 5,864 samples):

wget https://pheweb.org/metsim-metab/download/C100000282

N-acetylserine (C100001851, 6,133 samples):

wget https://pheweb.org/metsim-metab/download/C100001851

1.2.2 Outcome Datasets

1.2.2.1 Jurgens (UKBB) BMI

BMI summary statistics were obtained from the Jurgens et al. (2022) GWAS of European-ancestry individuals in the UK Biobank, available at https://personal.broadinstitute.org/ryank/Jurgens_Pirruccello_2022_GWAS_Sumstats.zip. The file GWAS_sumstats_EUR__invnorm_bmi__TOTALsample.tsv provides association results for approximately 460,000 participants, originally reported in GRCh37 (hg19) coordinates. Key columns include:

Chromosome (CHR)
Base pair position (BP)
SNP identifier (SNP)
Effect allele (ALLELE1)
Other allele (ALLELE0)
Effect allele frequency (A1FREQ)
Beta coefficient (BETA)
Standard error (SE)
P-value (P)

BMI summary statistics were downloaded and processed as follows:

wget https://personal.broadinstitute.org/ryank/Jurgens_Pirruccello_2022_GWAS_Sumstats.zip
# Select: GWAS_sumstats_EUR__invnorm_bmi__TOTALsample.tsv
# Use LiftOver to convert from GRCh37 to GRCh38 coordinates

1.2.2.2 Pure Hypercholesterolaemia (ICD10: E78.0)

Dataset: ukb-b-12651
Year: 2018
Category: Binary
Population: European
Sex: Males and Females
Cases: 22,622
Controls: 440,388
Sample Size: 463,010
Number of SNPs: 9,851,867
Unit: Standard Deviation (SD)
Author: Ben Elsworth
Consortium: MRC-IEU
Build: GRCh37 (hg19)
Note: 41204#E780: Output from GWAS pipeline using Phesant-derived variables from UK Biobank
Formatted file: ukb-b-12651_formatted_lifted38.tsv

1.2.2.3 Type 2 Diabetes Without Complications (ICD10: E11.9)

Dataset: ukb-b-970
Year: 2018
Category: Binary
Population: European
Sex: Males and Females
Cases: 12,045
Controls: 450,965
Sample Size: 463,010
Number of SNPs: 9,851,867
Unit: Standard Deviation (SD)
Author: Ben Elsworth
Consortium: MRC-IEU
Build: GRCh37 (hg19)
Note: 41204#E119: Output from GWAS pipeline using Phesant-derived variables from UK Biobank
Formatted file: ukb-b-970_formatted_lifted38.tsv

1.2.2.4 Essential (Primary) Hypertension (ICD10: I10)

Dataset: ukb-b-12493
Year: 2018
Category: Binary
Population: European
Sex: Males and Females
Cases: 54,358
Controls: 408,652
Sample Size: 463,010
Number of SNPs: 9,851,867
Unit: Standard Deviation (SD)
Author: Ben Elsworth
Consortium: MRC-IEU
Build: GRCh37 (hg19)
Note: 41204#I10: Output from GWAS pipeline using Phesant-derived variables from UK Biobank
Formatted file: ukb-b-12493_formatted_lifted38.tsv

1.2.2.5 Obesity, Unspecified (ICD10: E66.9)

Dataset: ukb-b-15541
Year: 2018
Category: Binary
Population: European
Sex: Males and Females
Cases: 4,688
Controls: 458,322
Sample Size: 463,010
Number of SNPs: 9,851,867
Unit: Standard Deviation (SD)
Author: Ben Elsworth
Consortium: MRC-IEU
Build: GRCh37 (hg19)
Note: 41204#E669: Output from GWAS pipeline using Phesant-derived variables from UK Biobank
Formatted file: ukb-b-15541_formatted_lifted38.tsv

1.2.2.6 Non-alcoholic Fatty Liver Disease

Dataset: 34841290-GCST90091033-EFO_0003095.h.tsv.gz
GWAS ID: GCST90091033
Author Notes: Ghodsian N et al. (2021). Cell Rep Med. doi:10.1016/j.xcrm.2021.100437. PMID: 34841290.
Sample Size: 778,614
Population: European
Genome Assembly: GRCh38
Cases: 8,434
Controls: 770,180
Sample Size: 778,614
File Type: pre-GWAS-SSF
MD5Sum: 126496bbc3248f90319ff9406b1c76c3
Harmonization: Harmonized, unsorted
Reference: ftp://ftp.ensembl.org/pub/release-95/fasta/homo_sapiens/dna/
Formated file: 34841290-GCST90091033-EFO_0003095_formatted.tsv

1.2.2.7 Type 2 Diabetes

Dataset: ebi-a-GCST006867
PMID: 30054458
Year: 2018
Population: European
Cases: 61,714
Controls: 1,178 #Error on MRC-IEU’s part! Should be 593952
Sample Size: 655,666
Number of SNPs: 5,030,727
Unit: logOR
Author: Xue A
Build: GRCh37 (hg19)
Formatted file: ebi-a-GCST006867_formatted_lifted38.tsv

1.3 MRC-IEU Data Processing

The following bash commands were used to decompress and format MRC-IEU outcome files (.vcf.gz) into tab-separated value (tsv) files with calculated p-values:

# Step 1: Decompress all .vcf.gz files
for f in ukb-b-12493.vcf.gz ukb-b-12651.vcf.gz ukb-b-15541.vcf.gz ukb-b-970.vcf.gz ebi-a-GCST006867.vcf.gz; do
  gunzip -k "$f"
done

# Step 2: Convert to tsv and calculate pval from logp
for f in ukb-b-12493.vcf ukb-b-12651.vcf ukb-b-15541.vcf ukb-b-970.vcf ebi-a-GCST006867.vcf; do
  outfile="${f/.vcf/_formatted.tsv}"
  echo -e "CHR\tBP\tSNP\tREF\tALT\tEAF\tBETA\tSE\tLP\tPVAL" > "$outfile"
  awk 'BEGIN{OFS="\t"} 
    !/^#/ {
      split($10, a, ":");
      beta = a[1];
      se = a[2];
      lp = a[3];
      eaf = a[4];
      pval = 10^(-lp);
      print $1, $2, $3, $4, $5, eaf, beta, se, lp, pval
    }' "$f" >> "$outfile"
done

1.4 GWAS Catalog Processing: Non-alcoholic Fatty Liver Disease

gunzip -k 34841290-GCST90091033-EFO_0003095.h.tsv.gz

awk 'BEGIN{OFS="\t"; print "CHR","BP","SNP","REF","ALT","EAF","BETA","SE","LP","PVAL"}
     NR>1 {
       lp = -log($21)/log(10);
       print $3, $4, $2, $5, $6, $11, $7, $20, lp, $21
     }' 34841290-GCST90091033-EFO_0003095.h.tsv > 34841290-GCST90091033-EFO_0003095_formatted.tsv

1.5 Coordinate Conversion for BMI Data

To align the BMI data with the acetylaminos data (GRCh38), we converted the GRCh37 coordinates to GRCh38 using a three-step bioinformatics pipeline implemented in Python and Unix:

Conversion to BED Format: A Python script (make_bed.py) transformed the BMI TSV file into a BED format, extracting chromosome, start (BP - 1 for 0-based coordinates), end (BP), and SNP ID. The output was saved as GWAS_sumstats_EUR__invnorm_bmi__TOTALsample.bed.
LiftOver: The UCSC liftOver tool was applied using the chain file hg19ToHg38.over.chain.gz to map GRCh37 coordinates to GRCh38. Successfully mapped SNPs were written to GWAS_sumstats_EUR__invnorm_bmi__TOTALsample_hg38.bed, with unmapped SNPs logged in unmapped_SNPs.bed.
Merging with Original Data: A second Python script (format_bmi.py) processed the lifted BED file, adjusting coordinates to 1-based notation and merging them with the original TSV data using the SNP column. The merged dataset was sorted by SNP, processed in chunks (500,000 rows) to manage memory, and renamed to conform to the TwoSampleMR R package requirements: effect_allele (ALLELE1), other_allele (ALLELE0), eaf (A1FREQ), beta (BETA), se (SE), pval (P), and BP_GRCh38 (new position). The final output was saved as merged_gwas_data_grch38_twosample.tsv.

1.5.1 Data Cleaning and Harmonization

Both datasets underwent cleaning and harmonization in R using the data.table, dplyr, and tidyr packages:

Acetylaminos Data: Rows with missing or invalid rsIDs were filtered out (!is.na(rsids) & rsids != "" & grepl("^rs", rsids)), and the data were formatted for MR analysis with columns including chr.exposure, pos.exposure, beta.exposure, se.exposure, pval.exposure, effect_allele.exposure, other_allele.exposure, eaf.exposure, samplesize.exposure (5,830), and SNP.
BMI Data: Similar filtering removed invalid SNPs, and the data were formatted with chr.outcome, pos.outcome, beta.outcome, se.outcome, pval.outcome, effect_allele.outcome, other_allele.outcome, eaf.outcome, samplesize.outcome (460,000), and SNP.
Harmonization: The harmonise_data function from the TwoSampleMR package aligned acetylaminos (exposure) and BMI (outcome) data by matching alleles and removing palindromic SNPs with ambiguous effects, producing a harmonized dataset saved as harmonized_nat_Jurgens_BMI_dat_[date].csv.

1.5.2 Bioinformatics Pipeline Prep: BMI

BMI summary statistics were on GRCh37. I wanted the GRCh38 coordinates for other post-GWAS analyses.

This was a 3-step process:

Converting the BMI summary statistics to BED format.
Running liftOver to obtain GRCh38 coordinates
Merging the liftedOver coordinates with the original TSV

Step 1: Converts BMI from TSV to BED file format

This first script (make_bed.py) converts the BMI summary statistics file (in TSV format) to a BED file format.

import csv

# Open the input TSV file and prepare to write to the output BED file
with open('GWAS_sumstats_EUR__invnorm_bmi__TOTALsample.tsv', 'r') as tsvfile, open('GWAS_sumstats_EUR__invnorm_bmi__TOTALsample.bed', 'w') as bedfile:
    tsv_reader = csv.DictReader(tsvfile, delimiter='\t')
    
    for row in tsv_reader:
        # Extract the necessary information
        chromosome = f"chr{row['CHR']}"
        start = int(row['BP']) - 1  # Convert to 0-based for BED format
        end = row['BP']
        snp_id = row['SNP']
        
        # Write to BED file
        bedfile.write(f"{chromosome}\t{start}\t{end}\t{snp_id}\n")

print("Conversion complete. Check 'GWAS_sumstats_EUR__invnorm_bmi__TOTALsample.bed' for the output.")

Step 2: liftOver

liftOver \
  GWAS_sumstats_EUR__invnorm_bmi__TOTALsample.bed \       # Input BED file
  hg19ToHg38.over.chain.gz \                              # Chain file for converting GRCh37 (hg19) → GRCh38 (hg38)
  GWAS_sumstats_EUR__invnorm_bmi__TOTALsample_hg38.bed \  # Output file (successfully mapped SNPs)
  unmapped_SNPs.bed                                       # Output file for unmapped SNPs

Step 3: This 2nd script (format_bmi.py) does the following:

Reads the BED File:

Reads GWAS_sumstats_EUR__invnorm_bmi__TOTALsample_hg38.bed containing:
- Chromosome (CHR)
- Start and end positions
- SNP identifiers
Adjusts coordinates from 0-based to 1-based:
- Assigns END positions to a new column called BP_GRCh38.

Sorts BED Data:

Sorts the BED data by the SNP column for efficient merging later.

Processes the GWAS TSV File:

Reads the large GWAS_sumstats_EUR__invnorm_bmi__TOTALsample.tsv file in chunks (500,000 rows at a time) to avoid memory overflow.
Sorts each chunk by SNP for faster and more efficient merging.

Merges:

Merges the chunked data with the BED file using the SNP column.
Adds the BP_GRCh38 (new genome build position) to the GWAS data.
Uses a left join to ensure all GWAS SNPs are preserved, even if there isn’t a matching position in the BED file.

Rename Columns:

Renames key columns to fit the format required by the TwoSampleMR package:

ALLELE1 → effect_allele
ALLELE0 → other_allele
A1FREQ → eaf (effect allele frequency)
BETA → beta
SE → se (standard error)
P → pval

Saves Merged Results:

Combines all the merged chunks into one large DataFrame.
Saves this merged DataFrame as merged_gwas_data_grch38_original.tsv for downstream analysis using the TwoSampleMR package.

import pandas as pd

# Define chunk size for reading large TSV files in smaller parts
chunk_size = 500000  # Adjust based on memory availability

# Initialize an empty list to store the merged chunks
merged_chunks = []

try:
    # Step 1: Load the BED file (without headers)
    bed_columns = ['CHR', 'START', 'END', 'SNP']
    bed_data = pd.read_csv('GWAS_sumstats_EUR__invnorm_bmi__TOTALsample_hg38.bed', sep='\t', header=None, names=bed_columns)
    print("Loaded BED file successfully.", flush=True)

    # Step 2: Add 1 to the START position to convert from zero-based to one-based
    bed_data['BP_GRCh38'] = bed_data['END']
    print("Converted BED file positions to one-based.", flush=True)

    # Step 3: Sort the BED data by the SNP column to optimize merging
    bed_data = bed_data.sort_values(by='SNP')
    print("Sorted BED data by SNP.", flush=True)

    # Step 4: Process the TSV file in chunks, and merge each chunk with the BED data
    for chunk in pd.read_csv('GWAS_sumstats_EUR__invnorm_bmi__TOTALsample.tsv', sep='\t', chunksize=chunk_size):
        print(f"Processing chunk...", flush=True)
        # Sort the chunk by SNP for more efficient merging
        chunk = chunk.sort_values(by='SNP')

        # Merge the chunk with the BED data
        merged_chunk = pd.merge(chunk, bed_data[['CHR', 'SNP', 'BP_GRCh38']], on='SNP', how='left')

        # Append the merged chunk to the list of merged chunks
        merged_chunks.append(merged_chunk)

    # Step 5: Concatenate all merged chunks together into one DataFrame
    merged_data = pd.concat(merged_chunks)
    print("Merged all chunks successfully.", flush=True)

    # Step 6: Save the merged data to a new TSV file before changing column names
    merged_data.to_csv('merged_gwas_data_grch38_original.tsv', sep='\t', index=False)
    print("Saved 'merged_gwas_data_grch38_original.tsv' successfully.", flush=True)

    # Step 7: Perform the column renaming for TwoSampleMR, keeping both BP columns
    merged_data.rename(columns={
        'SNP': 'SNP',
        'CHR': 'CHR',  # Keep the original chromosome column
        'BP': 'BP',  # Keep the original BP
        'BP_GRCh38': 'BP_GRCh38',  # Add BP_GRCh38
        'ALLELE1': 'effect_allele',  # ALLELE1 -> effect_allele
        'ALLELE0': 'other_allele',   # ALLELE0 -> other_allele
        'A1FREQ': 'eaf',             # A1FREQ -> effect allele frequency
        'BETA': 'beta',              # BETA -> beta
        'SE': 'se',                  # SE -> standard error
        'P': 'pval'                  # P -> p-value
    }, inplace=True)

    # Step 8: Save the final TwoSampleMR-compatible file in TSV format
    merged_data.to_csv('merged_gwas_data_grch38_twosample.tsv', sep='\t', index=False)
    print("Saved 'merged_gwas_data_grch38_twosample.tsv' successfully.", flush=True)

    # Step 9 (optional / didn't work): Save the merged data in Parquet format for faster reading in the future
    # merged_data.to_parquet('merged_gwas_data_grch38.parquet', index=False)
    # print("Saved 'merged_gwas_data_grch38.parquet' successfully (Parquet format).", flush=True)

except Exception as e:
    print(f"An error occurred: {e}", flush=True)

1.6 Coordinate Conversion for MRC-IEU Outcomes

Re-install liftOver to use in base conda env.

#!/bin/bash

# install_liftover.sh
# Script to install UCSC liftOver on Mac M1/M2/M3 (Apple Silicon) correctly
# Downloads Intel (x86_64) binary because Rosetta handles it automatically

echo "Navigating to /opt/homebrew/bin (standard Homebrew location for Apple Silicon)..."
cd /opt/homebrew/bin || { echo "/opt/homebrew/bin not found, trying /usr/local/bin"; cd /usr/local/bin || exit 1; }

echo "Removing any existing broken liftOver binary..."
sudo rm -f liftOver

echo "Downloading working macOS x86_64 (Intel) liftOver binary from UCSC..."
sudo curl -O http://hgdownload.soe.ucsc.edu/admin/exe/macOSX.x86_64/liftOver

echo "Making liftOver executable..."
sudo chmod +x liftOver

echo "Testing liftOver installation..."
if liftOver --help > /dev/null 2>&1; then
    echo "✅ liftOver installed successfully and ready to use!"
else
    echo "❌ liftOver installation failed. Please check manually."
fi

# Use liftOver: skip non-alcoholic fatty liver disease, as already on build 38.
cd /Users/charleenadams/mr_acetylaminos_bmi

for f in *_formatted.tsv; do
  # Skip the file that's already on GRCh38
  if [[ "$f" == "34841290-GCST90091033-EFO_0003095_formatted.tsv" ]]; then
    echo "Skipping $f (already GRCh38)"
    continue
  fi

  prefix="${f/.tsv/}"

  # Step 1: Create BED file for liftOver (0-based start)
  awk 'BEGIN{OFS="\t"} NR>1 {print "chr"$1, $2-1, $2, $3}' "$f" > "${prefix}_lift.bed"

  # Step 2: Run liftOver
  liftOver "${prefix}_lift.bed" hg19ToHg38.over.chain.gz "${prefix}_lifted.bed" "${prefix}_unmapped.bed"

  # Step 3: Map lifted SNPs back (chr prefix removed)
  awk 'BEGIN{OFS="\t"} {gsub(/^chr/,"",$1); print $4, $1, $3}' "${prefix}_lifted.bed" > "${prefix}_map.tsv"

  # Step 4: Join lifted coordinates back into the original data
  awk 'BEGIN{OFS="\t"} FNR==NR{a[$1]=$3; next} 
       FNR==1{print $0, "BP38"; next}
       {print $0, (a[$3] ? a[$3] : "NA")}' "${prefix}_map.tsv" "$f" > "${prefix}_lifted38.tsv"

done

1.7 1-Pipeline MR: 5 Acetyl-amino Acids on Jurgens BMI

Instrument Selection: From the filtered dataset, genetic instruments were selected with a relaxed p-value threshold of 5E-4 (versus the conventional 5 × 10⁻⁸) to maximize power within the ACY1 TSS region, justified by prior evidence of Acetylaminos association. Instruments were required to have an F-statistic > 10 to ensure strength, calculated as \[ F = \left( \frac{\beta_{exposure}}{SE_{exposure}} \right)^2 \]. Steiger filtering (steiger_filtering) retained only SNPs where the exposure effect directionally preceded the outcome.
Linkage Disequilibrium (LD) Clumping: Independent instruments were identified using ld_clump with a clumping window of 500 kb (versus the default 10,000 kb) and an r^2 threshold of 0.001, referencing the 1000 Genomes European (EUR) population (bfile = EUR).
MR Methods: Five MR methods were applied:
- Inverse Variance Weighted (IVW): Assumes no horizontal pleiotropy (mr_ivw).
- MR-Egger: Adjusts for directional pleiotropy (mr_egger_regression).
- Weighted Median: Robust to invalid instruments if <50% are pleiotropic (mr_weighted_median).
- MR-Lasso: Identifies and adjusts for pleiotropic outliers (mr_lasso).
- Contamination Mixture (ConMix): Models mixture distributions to account for invalid instruments (mr_conmix).
Sensitivity Analyses: Heterogeneity was assessed with mr_heterogeneity, pleiotropy with mr_pleiotropy_test, and leave-one-out analysis with mr_leaveoneout. Wald ratio tests were computed for each instrument individually.
When only one instrument was available, we only performed Wald ratio tests (nothing else)

# Clear the environment
rm(list = ls())

# Load required libraries
library(data.table)
library(dplyr)
library(TwoSampleMR)
library(MRInstruments)
library(openxlsx)
library(ggplot2)
library(scales)

# Define exposure file information
exposure_info <- list(
  list(
    file = "/Users/charleenadams/mr_acetylaminos_bmi/C1083",
    exposure_name = "N-acetylmethionine",
    sample_size = 6135
  ),
  list(
    file = "/Users/charleenadams/mr_acetylaminos_bmi/C1110",
    exposure_name = "N-acetylalanine",
    sample_size = 6136
  ),
  list(
    file = "/Users/charleenadams/mr_acetylaminos_bmi/C100001276",
    exposure_name = "N-acetylisoleucine",
    sample_size = 4946
  ),
  list(
    file = "/Users/charleenadams/mr_acetylaminos_bmi/C100000282",
    exposure_name = "N-acetylglutamate",
    sample_size = 5864
  ),
  list(
    file = "/Users/charleenadams/mr_acetylaminos_bmi/C100001851",
    exposure_name = "N-acetylserine",
    sample_size = 6133
  )
)

# Loop through each exposure file
for (exp in exposure_info) {
  exposure_file <- exp$file
  exposure_name <- exp$exposure_name
  sample_size <- exp$sample_size
  today <- Sys.Date()
  
  # Create results directory for this exposure with p5E4 threshold
  results_dir <- paste0("/Users/charleenadams/mr_acetylaminos_bmi/results_", exposure_name, "_p5E4/")
  if (!dir.exists(results_dir)) {
    dir.create(results_dir, recursive = TRUE)
    cat("Created directory:", results_dir, "\n")
  }
  
  # ---------------------------------------------
  # Load and Format Exposure Data
  # ---------------------------------------------
  
  exp_data <- fread(exposure_file) %>% as.data.frame()
  filtered_exp_df <- exp_data %>% 
    filter(!is.na(rsids) & rsids != "" & grepl("^rs", rsids)) %>%
    arrange(rsids)
  
  exp_dat <- filtered_exp_df %>%
    mutate(
      chr.exposure = chrom,
      pos.exposure = pos,   
      beta.exposure = beta,
      se.exposure = sebeta,
      exposure = exposure_name,
      id.exposure = exposure_name,
      pval.exposure = pval,
      SNP.exposure = rsids,
      SNP = rsids,
      effect_allele.exposure = alt,
      other_allele.exposure = ref,
      eaf.exposure = maf,
      samplesize.exposure = sample_size,
      id_col = nearest_genes
    )
  
  # ---------------------------------------------
  # Load and Format BMI Data
  # ---------------------------------------------
  
  bmi <- fread("/Users/charleenadams/mr_acetylaminos_bmi/merged_gwas_data_grch38_twosample.tsv")
  filtered_bmi_df <- bmi %>%
    filter(!is.na(SNP) & SNP != "" & grepl("^rs", SNP)) %>%
    arrange(SNP)
  
  out_dat <- filtered_bmi_df %>%
    mutate(
      SNP = SNP,
      SNP.outcome = SNP,
      chr.outcome = CHR_x,
      pos.outcome = BP_GRCh38,
      beta.outcome = beta,
      se.outcome = se,
      pval.outcome = pval,
      effect_allele.outcome = effect_allele,
      other_allele.outcome = other_allele,
      eaf.outcome = eaf,
      samplesize.outcome = 460000,
      id.outcome = "BMI",
      outcome = "BMI"
    )
  
  # ---------------------------------------------
  # Harmonize Data
  # ---------------------------------------------
  
  dat <- harmonise_data(exposure_dat = exp_dat, outcome_dat = out_dat)
  harmonized_file <- paste0(results_dir, "harmonized_", exposure_name, "_Jurgens_BMI_dat_", today, ".csv")
  write.csv(dat, harmonized_file, row.names = FALSE)
  
  # ---------------------------------------------
  # Select 1MB around ACY1 region
  # ---------------------------------------------
  
  # Load the local annotation file
  local_annotation <- read.table("/Users/charleenadams/mr_ukbppp_chd/mart_export_clean.txt", 
                                 header = TRUE, sep = "\t", stringsAsFactors = FALSE)
  
  # Filter for the gene ACY1
  acy1_data <- local_annotation[local_annotation$Gene.name == "ACY1", ]
  
  # Extract the TSS and chromosome
  acy1_tss <- min(acy1_data$Transcript.start..bp.)
  acy1_chrom <- acy1_data$Chromosome.scaffold.name[1]
  window_size <- 500000
  tss_start <- acy1_tss - window_size
  tss_end <- acy1_tss + window_size
  
  # Display the results
  cat("TSS for ACY1:", acy1_tss, "\n")
  cat("500kb upstream:", tss_start, "\n")
  cat("500kb downstream:", tss_end, "\n")
  
  # Filter data within 1Mb of ACY1 TSS
  dat_filtered <- dat %>%
    filter(chr.outcome == acy1_chrom & pos.outcome >= tss_start & pos.outcome <= tss_end)
  filtered_file <- paste0(results_dir, "filtered_ACY1_1Mb_", exposure_name, "_Jurgens_BMI_", today, ".csv")
  write.csv(dat_filtered, file = filtered_file, row.names = FALSE)
  cat("Filtered data saved to:", filtered_file, "\n")
  cat("Number of observations within 1 Mb of ACY1 TSS:", nrow(dat_filtered), "\n")
  
  # ---------------------------------------------
  # MR
  # ---------------------------------------------
  
  # Step 1: Filter Instruments
  dat_filtered <- read.csv(filtered_file)
  instruments <- dat_filtered %>%
    filter(mr_keep == TRUE, pval.exposure < 5e-4) %>%  # Relaxed threshold to 5e-4
    mutate(
      rsid = SNP,
      pval = pval.exposure,
      id = paste0(exposure_name, " (1MB around ACY1 TSS)"),
      F_stat = (beta.exposure / se.exposure)^2  # Compute F-statistic
    ) %>%
    filter(F_stat > 10)  # Exclude weak instruments
  
  cat("Number of instruments after F-stat filtering:", nrow(instruments), "\n")
  
  # Step 1.5: Steiger Filtering
  instruments <- steiger_filtering(instruments)
  instruments <- instruments %>% filter(steiger_dir == TRUE)  # Keep only SNPs with correct direction
  cat("Number of instruments after Steiger filtering:", nrow(instruments), "\n")
  cat("Preview of Steiger-filtered instruments:\n")
  print(head(instruments))
  
  # Subset instruments to keep only TwoSampleMR, F-stat, and Steiger fields
  instruments_subset <- instruments %>%
    dplyr::select(
      SNP, 
      effect_allele.exposure, other_allele.exposure,
      effect_allele.outcome, other_allele.outcome,
      beta.exposure, se.exposure, pval.exposure,
      beta.outcome, se.outcome, pval.outcome,
      eaf.exposure, eaf.outcome,
      id.exposure, exposure,
      id.outcome, outcome,
      samplesize.exposure, samplesize.outcome,
      mr_keep, action,
      F_stat,
      steiger_dir, steiger_pval
    )
  
  # Step 2: Local LD Clumping
  clumped <- ld_clump(
    dplyr::tibble(rsid = instruments$rsid, pval = instruments$pval, id = instruments$id),
    plink_bin = genetics.binaRies::get_plink_binary(),
    bfile = "/Users/charleenadams/1000G_bfiles/EUR/EUR",
    clump_r2 = 0.001,
    clump_kb = 500  # Reduced window
  )
  clumped_dat <- instruments %>% dplyr::filter(SNP %in% clumped$rsid)
  cat("Local clumping completed. Number of SNPs retained:", nrow(clumped_dat), "\n")
  cat("Preview of clumped data:\n")
  print(head(clumped_dat))
  
  # Subset clumped_dat to keep only TwoSampleMR, F-stat, and Steiger fields
  clumped_dat_subset <- clumped_dat %>%
    dplyr::select(
      SNP, 
      effect_allele.exposure, other_allele.exposure,
      effect_allele.outcome, other_allele.outcome,
      beta.exposure, se.exposure, pval.exposure,
      beta.outcome, se.outcome, pval.outcome,
      eaf.exposure, eaf.outcome,
      id.exposure, exposure,
      id.outcome, outcome,
      samplesize.exposure, samplesize.outcome,
      mr_keep, action,
      F_stat,
      steiger_dir, steiger_pval
    )
  
  # Step 3: Perform Selected MR Analyses
  # 3.1: Inverse Variance Weighted (IVW)
  ivw_result <- tryCatch({
    mr(clumped_dat, method_list = "mr_ivw")
  }, error = function(e) {
    cat("Error in IVW for", exposure_name, ":", conditionMessage(e), "\n")
    NULL
  })
  
  # 3.2: MR-Egger
  egger_result <- tryCatch({
    mr(clumped_dat, method_list = "mr_egger_regression")
  }, error = function(e) {
    cat("Error in MR-Egger for", exposure_name, ":", conditionMessage(e), "\n")
    NULL
  })
  
  # 3.3: Weighted Median
  weighted_median_result <- tryCatch({
    mr(clumped_dat, method_list = "mr_weighted_median")
  }, error = function(e) {
    cat("Error in Weighted Median for", exposure_name, ":", conditionMessage(e), "\n")
    NULL
  })
  
  # 3.4: MR-Lasso
  mr_input <- mr_input(
    bx = clumped_dat$beta.exposure,
    bxse = clumped_dat$se.exposure,
    by = clumped_dat$beta.outcome,
    byse = clumped_dat$se.outcome,
    exposure = exposure_name,
    outcome = "BMI",
    snps = clumped_dat$SNP
  )
  lasso_result <- tryCatch({
    mr_lasso(mr_input)
  }, error = function(e) {
    cat("Error in MR-Lasso for", exposure_name, ":", conditionMessage(e), "\n")
    NULL
  })
  
  # 3.5: Contamination Mixture (ConMix)
  conmix_result <- tryCatch({
    mr_conmix(mr_input)
  }, error = function(e) {
    cat("Error in MR-ConMix for", exposure_name, ":", conditionMessage(e), "\n")
    NULL
  })
  
  # 3.6: Heterogeneity Test
  heterogeneity_result <- tryCatch({
    mr_heterogeneity(clumped_dat)
  }, error = function(e) {
    cat("Error in Heterogeneity Test for", exposure_name, ":", conditionMessage(e), "\n")
    NULL
  })
  
  # 3.7: Pleiotropy Test
  pleiotropy_result <- tryCatch({
    mr_pleiotropy_test(clumped_dat)
  }, error = function(e) {
    cat("Error in Pleiotropy Test for", exposure_name, ":", conditionMessage(e), "\n")
    NULL
  })
  
  # 3.8: Leave-One-Out Analysis
  loo_result <- tryCatch({
    mr_leaveoneout(clumped_dat)
  }, error = function(e) {
    cat("Error in Leave-One-Out for", exposure_name, ":", conditionMessage(e), "\n")
    NULL
  })
  
  # Step 4: Perform Wald Ratio Tests for Each Instrument
  wald_ratios <- clumped_dat %>%
    mutate(
      wald_beta = beta.outcome / beta.exposure,
      wald_se = sqrt((se.outcome^2 / beta.exposure^2) + ((beta.outcome^2 * se.exposure^2) / (beta.exposure^4))),
      pval = 2 * pnorm(abs(wald_beta / wald_se), lower.tail = FALSE),
      method = paste("Wald Ratio:", SNP)
    ) %>%
    dplyr::select(SNP, wald_beta, wald_se, pval, method)
  
  cat("\n=== Wald Ratio Tests for Each Instrument for", exposure_name, "===\n")
  print(wald_ratios)
  
  # Step 5: Prepare Data for Forest Plot
  ivw_df <- if (!is.null(ivw_result)) ivw_result %>% mutate(method = "IVW") else NULL
  egger_df <- if (!is.null(egger_result)) egger_result %>% mutate(method = "MR-Egger") else NULL
  weighted_median_df <- if (!is.null(weighted_median_result)) weighted_median_result %>% mutate(method = "Weighted Median") else NULL
  
  lasso_df <- if (!is.null(lasso_result)) {
    data.frame(method = "MR-Lasso", b = lasso_result@Estimate, 
               se = lasso_result@StdError, pval = lasso_result@Pvalue)
  } else {
    NULL
  }
  
  conmix_se <- if (!is.null(conmix_result)) {
    (conmix_result@CIUpper - conmix_result@CILower) / (2 * 1.96)  # SE = (CIUpper - CILower) / (2 * 1.96) for 95% CI
  } else {
    NA
  }
  
  conmix_df <- if (!is.null(conmix_result)) {
    data.frame(method = "MR-ConMix", b = conmix_result@Estimate, 
               se = conmix_se, pval = conmix_result@Pvalue)
  } else {
    NULL
  }
  
  mr_results <- bind_rows(
    ivw_df,
    egger_df,
    weighted_median_df,
    lasso_df,
    conmix_df,
    wald_ratios %>% dplyr::select(method, b = wald_beta, se = wald_se, pval)
  ) %>%
    mutate(method = as.factor(method)) %>%
    filter(!is.na(method) & method != "NA") %>%
    arrange(b) %>%
    mutate(method = factor(method, levels = unique(method)))
  
  # Step 6: Create Beautiful Forest Plot with Dynamic Colors (No Numbers)
  base_colors <- c(
    "IVW" = "#1F78B4",
    "MR-Egger" = "#FF7F00",
    "Weighted Median" = "#33A02C",
    "MR-Lasso" = "#FB9A99",
    "MR-ConMix" = "#E41A1C"
  )
  
  wald_methods <- unique(wald_ratios$method)
  n_wald <- length(wald_methods)
  if (n_wald > 0) {
    wald_colors <- hue_pal()(n_wald)
    names(wald_colors) <- wald_methods
  } else {
    wald_colors <- NULL
  }
  
  color_list <- c(base_colors, wald_colors)
  available_methods <- unique(mr_results$method)
  color_list <- color_list[names(color_list) %in% available_methods]
  
  # Dynamically determine x-axis limits based on confidence intervals
  x_min <- min(mr_results$b - 1.96 * mr_results$se, na.rm = TRUE)
  x_max <- max(mr_results$b + 1.96 * mr_results$se, na.rm = TRUE)
  
  forest_plot <- ggplot(mr_results, aes(x = b, y = method, color = method)) +
    geom_point(size = 3) +
    geom_errorbarh(aes(xmin = b - 1.96 * se, xmax = b + 1.96 * se), height = 0.2) +
    geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") +
    labs(
      title = paste0("Mendelian Randomization Estimates:\n", exposure_name, " (1MB around ACY1 TSS) on Jurgens BMI"),
      x = "Causal Effect (Beta)",
      y = ""
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
      axis.text.y = element_text(size = 12, face = "bold"),
      axis.text.x = element_text(size = 10),
      axis.title.x = element_text(size = 12),
      panel.grid.major = element_line(color = "grey90"),
      panel.grid.minor = element_blank(),
      legend.position = "none",
      panel.background = element_rect(fill = "white", color = NA),
      plot.background = element_rect(fill = "white", color = NA)
    ) +
    scale_color_manual(values = color_list) +
    xlim(x_min - 0.02, x_max + 0.02)
  
  # Step 7: Create Forest Plot Ordered by Method
  desired_order <- c(
    "IVW",
    "MR-Egger",
    "MR-Lasso",
    "MR-ConMix",
    "Weighted Median",
    unique(wald_ratios$method)
  )
  desired_order <- desired_order[desired_order %in% unique(mr_results$method)]
  
  mr_results$method <- factor(mr_results$method, levels = desired_order)
  mr_results <- mr_results[order(mr_results$method), ]
  
  forest_plot_by_method <- ggplot(mr_results, aes(x = b, y = method, color = method)) +
    geom_errorbarh(aes(xmin = b - 1.96 * se, xmax = b + 1.96 * se), 
                   height = 0.2, na.rm = TRUE, linewidth = 0.8) +
    geom_point(size = 3) +
    geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") +
    labs(
      title = paste0("Mendelian Randomization Estimates:\n", exposure_name, " (1MB around ACY1 TSS)\non Jurgens BMI"),
      x = "Causal Effect (Beta)",
      y = ""
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
      axis.text.y = element_text(size = 12, face = "bold"),
      axis.text.x = element_text(size = 10),
      axis.title.x = element_text(size = 12),
      panel.grid.major = element_line(color = "grey90"),
      panel.grid.minor = element_blank(),
      legend.position = "none",
      panel.background = element_rect(fill = "white", color = NA),
      plot.background = element_rect(fill = "white", color = NA)
    ) +
    scale_color_manual(values = color_list) +
    xlim(x_min - 0.02, x_max + 0.02)
  
  # Save plots
  plot_file <- paste0(results_dir, "MR_Forestplot_", exposure_name, "_Jurgens_BMI_", today, ".png")
  ggsave(plot_file, forest_plot, width = 12, height = 10, dpi = 600, bg = "white")
  cat("Forest plot saved to:", plot_file, "\n")
  
  plot_file_by_method <- paste0(results_dir, "MR_Forestplot_", exposure_name, "_Jurgens_BMI_By_Method_", today, ".png")
  ggsave(plot_file_by_method, plot = forest_plot_by_method, dpi = 300, width = 8, height = 5)
  cat("Forest plot by method saved to:", plot_file_by_method, "\n")
  
  # Step 8: Save Everything in One Excel Spreadsheet
  wb <- createWorkbook()
  title_style <- createStyle(fontSize = 14, fontColour = "black", textDecoration = "bold", halign = "center")
  header_style <- createStyle(fontColour = "white", fgFill = "#C8A2C8", textDecoration = "bold", halign = "center", border = "TopBottomLeftRight")
  
  add_formatted_sheet <- function(wb, sheet_name, data, title) {
    addWorksheet(wb, sheet_name)
    if (is.null(data) || nrow(data) == 0 || ncol(data) == 0) {
      writeData(wb, sheet_name, paste("No data available for", sheet_name), startRow = 1, startCol = 1)
      addStyle(wb, sheet_name, title_style, rows = 1, cols = 1)
    } else {
      writeData(wb, sheet_name, title, startRow = 1, startCol = 1)
      writeData(wb, sheet_name, data, startRow = 3, startCol = 1, headerStyle = header_style)
      mergeCells(wb, sheet_name, cols = 1:ncol(data), rows = 1)
      setRowHeights(wb, sheet_name, rows = 1, heights = 20)
      setColWidths(wb, sheet_name, cols = 1:ncol(data), widths = "auto")
      addStyle(wb, sheet_name, title_style, rows = 1, cols = 1)
    }
  }
  
  # TOC with all sections
  toc_data <- data.frame(
    Sheet = c("IVW", "MR-Egger", "Weighted_Median", "MR-Lasso", "MR-ConMix", 
              "Wald_Ratios", "Heterogeneity_Test", "Pleiotropy_Test", "Leave_One_Out",
              "Instruments", "Clumped_Data"),
    Title = c("Inverse Variance Weighted (IVW) Results", "MR-Egger Results", 
              "Weighted Median Results", "MR-Lasso Results", "MR-ConMix Results",
              "Wald Ratio Tests for Each Instrument", "Heterogeneity Test Results", 
              "Pleiotropy Test Results", "Leave-One-Out Analysis Results",
              "Filtered Instruments Data", "Clumped SNPs Data")
  )
  toc_data <- toc_data[complete.cases(toc_data), ]  # Remove any NA rows
  
  addWorksheet(wb, "TOC")
  writeData(wb, "TOC", "Table of Contents", startRow = 1, startCol = 1)
  mergeCells(wb, "TOC", cols = 1:2, rows = 1)
  addStyle(wb, "TOC", title_style, rows = 1, cols = 1)
  writeData(wb, "TOC", toc_data, startRow = 3, startCol = 1, headerStyle = header_style)
  setColWidths(wb, "TOC", cols = 1:2, widths = "auto")
  
  # Add all sheets
  add_formatted_sheet(wb, "IVW", ivw_result, "Inverse Variance Weighted (IVW) Results")
  add_formatted_sheet(wb, "MR-Egger", egger_result, "MR-Egger Results")
  add_formatted_sheet(wb, "Weighted_Median", weighted_median_result, "Weighted Median Results")
  
  if (!is.null(lasso_result)) {
    lasso_df <- data.frame(
      Exposure = lasso_result@Exposure,
      Outcome = lasso_result@Outcome,
      Estimate = lasso_result@Estimate,
      StdError = lasso_result@StdError,
      CILower = lasso_result@CILower,
      CIUpper = lasso_result@CIUpper,
      Pvalue = lasso_result@Pvalue,
      SNPs = lasso_result@SNPs,
      Valid = lasso_result@Valid,
      ValidSNPs = if (length(lasso_result@ValidSNPs) > 0) paste(lasso_result@ValidSNPs, collapse = ", ") else "None",
      RegEstimate = lasso_result@RegEstimate,
      RegIntercept = paste(lasso_result@RegIntercept, collapse = ", "),
      Lambda = lasso_result@Lambda
    )
    add_formatted_sheet(wb, "MR-Lasso", lasso_df, "MR-Lasso Results")
  } else {
    addWorksheet(wb, "MR-Lasso")
    writeData(wb, "MR-Lasso", "MR-Lasso Analysis Failed", startRow = 1, startCol = 1)
    addStyle(wb, "MR-Lasso", title_style, rows = 1, cols = 1)
  }
  
  if (!is.null(conmix_result)) {
    conmix_df <- data.frame(
      Exposure = conmix_result@Exposure,
      Outcome = conmix_result@Outcome,
      Estimate = conmix_result@Estimate,
      Pvalue = conmix_result@Pvalue,
      SNPs = conmix_result@SNPs,
      Psi = conmix_result@Psi,
      CILower = conmix_result@CILower,
      CIUpper = conmix_result@CIUpper,
      CIRange = paste(conmix_result@CIRange, collapse = ", "),
      CIMin = conmix_result@CIMin,
      CIMax = conmix_result@CIMax,
      CIStep = conmix_result@CIStep,
      Valid = paste(conmix_result@Valid, collapse = ", "),
      ValidSNPs = paste(conmix_result@ValidSNPs, collapse = ", "),
      Alpha = conmix_result@Alpha
    )
    add_formatted_sheet(wb, "MR-ConMix", conmix_df, "MR-ConMix Results")
  } else {
    addWorksheet(wb, "MR-ConMix")
    writeData(wb, "MR-ConMix", "MR-ConMix Analysis Failed", startRow = 1, startCol = 1)
    addStyle(wb, "MR-ConMix", title_style, rows = 1, cols = 1)
  }
  
  add_formatted_sheet(wb, "Wald_Ratios", wald_ratios, "Wald Ratio Tests for Each Instrument")
  add_formatted_sheet(wb, "Heterogeneity_Test", heterogeneity_result, "Heterogeneity Test Results")
  add_formatted_sheet(wb, "Pleiotropy_Test", pleiotropy_result, "Pleiotropy Test Results")
  add_formatted_sheet(wb, "Leave_One_Out", loo_result, "Leave-One-Out Analysis Results")
  add_formatted_sheet(wb, "Instruments", instruments_subset, "Filtered Instruments Data")
  add_formatted_sheet(wb, "Clumped_Data", clumped_dat_subset, "Clumped SNPs Data")
  
  excel_file <- paste0(results_dir, "MR_Results_", exposure_name, "_Jurgens_BMI_", today, ".xlsx")
  saveWorkbook(wb, excel_file, overwrite = TRUE)
  cat("Excel file saved to:", excel_file, "\n")
}

1.8 7-Pipelines MR: 5 Acetyl-amino Acids on 7 Outcomes

# Clear the environment
rm(list = ls())

# Load required libraries
library(data.table)
library(dplyr)
library(TwoSampleMR)
library(MRInstruments)
library(openxlsx)
library(ggplot2)
library(scales)
library(MendelianRandomization)

# Note for Mac M3 users:
# Ensure dependencies are installed:
# - Install Homebrew: /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# - Install plink: brew install plink
# - Install R packages if missing: install.packages(c("data.table", "dplyr", "TwoSampleMR", "MRInstruments", "openxlsx", "ggplot2", "scales", "MendelianRandomization"))

# Define master results directory
master_results_dir <- "/Users/charleenadams/mr_acetylaminos_bmi/results/"
if (!dir.exists(master_results_dir)) {
  dir.create(master_results_dir, recursive = TRUE)
  cat("Created master results directory:", master_results_dir, "\n")
}

# Define exposure file information
exposure_info <- list(
  list(
    file = "/Users/charleenadams/mr_acetylaminos_bmi/C1083",
    exposure_name = "N-acetylmethionine",
    sample_size = 6135
  ),
  list(
    file = "/Users/charleenadams/mr_acetylaminos_bmi/C1110",
    exposure_name = "N-acetylalanine",
    sample_size = 6136
  ),
  list(
    file = "/Users/charleenadams/mr_acetylaminos_bmi/C100001276",
    exposure_name = "N-acetylisoleucine",
    sample_size = 4946
  ),
  list(
    file = "/Users/charleenadams/mr_acetylaminos_bmi/C100000282",
    exposure_name = "N-acetylglutamate",
    sample_size = 5864
  ),
  list(
    file = "/Users/charleenadams/mr_acetylaminos_bmi/C100001851",
    exposure_name = "N-acetylserine",
    sample_size = 6133
  )
)

# Define outcome file information
outcome_info <- list(
  list(
    file = "/Users/charleenadams/mr_acetylaminos_bmi/merged_gwas_data_grch38_twosample.tsv",
    outcome_name = "BMI",
    sample_size = 460000,
    id = "BMI"
  ),
  list(
    file = "/Users/charleenadams/mr_acetylaminos_bmi/ukb-b-12651_formatted_lifted38.tsv",
    outcome_name = "PureHypercholesterolaemia",
    sample_size = 463010,
    id = "ukb-b-12651"
  ),
  list(
    file = "/Users/charleenadams/mr_acetylaminos_bmi/ukb-b-970_formatted_lifted38.tsv",
    outcome_name = "Type2DiabetesNoComplications",
    sample_size = 463010,
    id = "ukb-b-970"
  ),
  list(
    file = "/Users/charleenadams/mr_acetylaminos_bmi/ukb-b-12493_formatted_lifted38.tsv",
    outcome_name = "EssentialHypertension",
    sample_size = 463010,
    id = "ukb-b-12493"
  ),
  list(
    file = "/Users/charleenadams/mr_acetylaminos_bmi/ukb-b-15541_formatted_lifted38.tsv",
    outcome_name = "Obesity",
    sample_size = 463010,
    id = "ukb-b-15541"
  ),
  list(
    file = "/Users/charleenadams/mr_acetylaminos_bmi/34841290-GCST90091033-EFO_0003095_formatted.tsv",
    outcome_name = "NonAlcoholicFattyLiverDisease",
    sample_size = 778614,
    id = "GCST90091033"
  ),
  list(
    file = "/Users/charleenadams/mr_acetylaminos_bmi/ebi-a-GCST006867_formatted_lifted38.tsv",
    outcome_name = "Type2Diabetes",
    sample_size = 655666,
    id = "ebi-a-GCST006867"
  )
)

# Load local annotation file for ACY1 TSS (outside loops for efficiency)
local_annotation <- read.table("/Users/charleenadams/mr_ukbppp_chd/mart_export_clean.txt", 
                               header = TRUE, sep = "\t", stringsAsFactors = FALSE)
acy1_data <- local_annotation[local_annotation$Gene.name == "ACY1", ]
acy1_tss <- min(acy1_data$Transcript.start..bp.)
acy1_chrom <- acy1_data$Chromosome.scaffold.name[1]
window_size <- 500000
tss_start <- acy1_tss - window_size
tss_end <- acy1_tss + window_size

cat("TSS for ACY1:", acy1_tss, "\n")
cat("500kb upstream:", tss_start, "\n")
cat("500kb downstream:", tss_end, "\n")

# Double loop through each exposure and outcome
for (exp in exposure_info) {
  exposure_file <- exp$file
  exposure_name <- exp$exposure_name
  sample_size_exp <- exp$sample_size
  
  # Load and format exposure data (once per exposure)
  exp_data <- fread(exposure_file) %>% as.data.frame()
  filtered_exp_df <- exp_data %>% 
    filter(!is.na(rsids) & rsids != "" & grepl("^rs", rsids)) %>%
    arrange(rsids)
  
  exp_dat <- filtered_exp_df %>%
    mutate(
      chr.exposure = chrom,
      pos.exposure = pos,   
      beta.exposure = beta,
      se.exposure = sebeta,
      exposure = exposure_name,
      id.exposure = exposure_name,
      pval.exposure = pval,
      SNP.exposure = rsids,
      SNP = rsids,
      effect_allele.exposure = alt,
      other_allele.exposure = ref,
      eaf.exposure = maf,
      samplesize.exposure = sample_size_exp,
      id_col = nearest_genes
    )
  
  for (out in outcome_info) {
    outcome_file <- out$file
    outcome_name <- out$outcome_name
    sample_size_out <- out$sample_size
    outcome_id <- out$id
    today <- Sys.Date()
    
    # Create results subdirectory
    results_dir <- paste0(master_results_dir, exposure_name, "_", outcome_name, "/")
    if (!dir.exists(results_dir)) {
      dir.create(results_dir, recursive = TRUE)
      cat("Created directory:", results_dir, "\n")
    }
    
    # Check if analysis is already complete
    excel_file <- paste0(results_dir, "MR_Results_", exposure_name, "_", outcome_name, "_", today, ".xlsx")
    if (file.exists(excel_file)) {
      cat("Analysis for", exposure_name, "on", outcome_name, "already complete. Skipping.\n")
      next
    }
    
    # ---------------------------------------------
    # Load and Format Outcome Data
    # ---------------------------------------------
    
    out_data <- fread(outcome_file) %>% as.data.frame()
    cat("Columns in", outcome_file, ":\n")
    print(colnames(out_data))
    
    filtered_out_df <- out_data %>%
      filter(!is.na(SNP) & SNP != "" & grepl("^rs", SNP)) %>%
      arrange(SNP)
    
    # Standardize column names, handling BMI-specific names
    if (outcome_id == "BMI") {
      out_dat <- filtered_out_df %>%
        mutate(
          SNP = SNP,
          SNP.outcome = SNP,
          chr.outcome = CHR_x,
          pos.outcome = BP_GRCh38,
          beta.outcome = beta,
          se.outcome = se,
          pval.outcome = pval,
          effect_allele.outcome = effect_allele,
          other_allele.outcome = other_allele,
          eaf.outcome = eaf,
          samplesize.outcome = sample_size_out,
          id.outcome = outcome_id,
          outcome = outcome_name
        )
    } else {
      out_dat <- filtered_out_df %>%
        mutate(
          SNP = SNP,
          SNP.outcome = SNP,
          chr.outcome = CHR,
          pos.outcome = BP,
          beta.outcome = BETA,
          se.outcome = SE,
          pval.outcome = PVAL,
          effect_allele.outcome = ALT,
          other_allele.outcome = REF,
          eaf.outcome = EAF,
          samplesize.outcome = sample_size_out,
          id.outcome = outcome_id,
          outcome = outcome_name
        )
    }
    
    # ---------------------------------------------
    # Harmonize Data
    # ---------------------------------------------
    
    dat <- harmonise_data(exposure_dat = exp_dat, outcome_dat = out_dat)
    harmonized_file <- paste0(results_dir, "harmonized_", exposure_name, "_", outcome_name, "_", today, ".csv")
    write.csv(dat, harmonized_file, row.names = FALSE)
    cat("Harmonized data saved to:", harmonized_file, "\n")
    
    # ---------------------------------------------
    # Filter 1MB around ACY1 TSS
    # ---------------------------------------------
    
    dat_filtered <- dat %>%
      filter(chr.outcome == acy1_chrom & pos.outcome >= tss_start & pos.outcome <= tss_end)
    filtered_file <- paste0(results_dir, "filtered_ACY1_1Mb_", exposure_name, "_", outcome_name, "_", today, ".csv")
    write.csv(dat_filtered, file = filtered_file, row.names = FALSE)
    cat("Filtered data saved to:", filtered_file, "\n")
    cat("Number of observations within 1 Mb of ACY1 TSS:", nrow(dat_filtered), "\n")
    
    # ---------------------------------------------
    # MR Analyses
    # ---------------------------------------------
    
    # Step 1: Filter Instruments
    dat_filtered <- read.csv(filtered_file)
    instruments <- dat_filtered %>%
      filter(mr_keep == TRUE, pval.exposure < 5e-4) %>%
      mutate(
        rsid = SNP,
        pval = pval.exposure,
        id = paste0(exposure_name, " (1MB around ACY1 TSS)"),
        F_stat = (beta.exposure / se.exposure)^2
      ) %>%
      filter(F_stat > 10)
    
    cat("Number of instruments after F-stat filtering:", nrow(instruments), "\n")
    
    # Check if instruments are empty
    if (nrow(instruments) == 0) {
      cat("No valid instruments found for", exposure_name, "on", outcome_name, ". Skipping MR analyses.\n")
      
      # Create placeholder Excel file
      wb <- createWorkbook()
      title_style <- createStyle(fontSize = 14, fontColour = "black", textDecoration = "bold", halign = "center")
      addWorksheet(wb, "Summary")
      writeData(wb, "Summary", paste("No valid instruments found for", exposure_name, "on", outcome_name), 
                startRow = 1, startCol = 1)
      addStyle(wb, "Summary", title_style, rows = 1, cols = 1)
      saveWorkbook(wb, excel_file, overwrite = TRUE)
      cat("Placeholder Excel file saved to:", excel_file, "\n")
      next
    }
    
    # Step 1.5: Steiger Filtering
    instruments <- tryCatch({
      steiger_filtering(instruments)
    }, error = function(e) {
      cat("Error in Steiger filtering for", exposure_name, "on", outcome_name, ":", conditionMessage(e), "\n")
      instruments  # Return original instruments to avoid breaking
    })
    
    # Check if steiger_dir exists before filtering
    if ("steiger_dir" %in% colnames(instruments)) {
      instruments <- instruments %>% filter(steiger_dir == TRUE)
    } else {
      cat("Warning: steiger_dir column not found after steiger_filtering for", exposure_name, "on", outcome_name, ". Proceeding without Steiger filtering.\n")
    }
    
    cat("Number of instruments after Steiger filtering:", nrow(instruments), "\n")
    cat("Preview of Steiger-filtered instruments:\n")
    print(head(instruments))
    
    # Subset instruments
    instruments_subset <- instruments %>%
      dplyr::select(
        SNP, 
        effect_allele.exposure, other_allele.exposure,
        effect_allele.outcome, other_allele.outcome,
        beta.exposure, se.exposure, pval.exposure,
        beta.outcome, se.outcome, pval.outcome,
        eaf.exposure, eaf.outcome,
        id.exposure, exposure,
        id.outcome, outcome,
        samplesize.exposure, samplesize.outcome,
        mr_keep, action,
        F_stat,
        steiger_dir, steiger_pval
      )
    
    # Step 2: Local LD Clumping
    clumped <- ld_clump(
      dplyr::tibble(rsid = instruments$rsid, pval = instruments$pval, id = instruments$id),
      plink_bin = genetics.binaRies::get_plink_binary(),
      bfile = "/Users/charleenadams/1000G_bfiles/EUR/EUR",
      clump_r2 = 0.001,
      clump_kb = 500
    )
    clumped_dat <- instruments %>% dplyr::filter(SNP %in% clumped$rsid)
    cat("Local clumping completed. Number of SNPs retained:", nrow(clumped_dat), "\n")
    cat("Preview of clumped data:\n")
    print(head(clumped_dat))
    
    # Subset clumped data
    clumped_dat_subset <- clumped_dat %>%
      dplyr::select(
        SNP, 
        effect_allele.exposure, other_allele.exposure,
        effect_allele.outcome, other_allele.outcome,
        beta.exposure, se.exposure, pval.exposure,
        beta.outcome, se.outcome, pval.outcome,
        eaf.exposure, eaf.outcome,
        id.exposure, exposure,
        id.outcome, outcome,
        samplesize.exposure, samplesize.outcome,
        mr_keep, action,
        F_stat,
        steiger_dir, steiger_pval
      )
    
    # Step 3: Perform MR Analyses
    ivw_result <- tryCatch({
      mr(clumped_dat, method_list = "mr_ivw")
    }, error = function(e) {
      cat("Error in IVW for", exposure_name, "on", outcome_name, ":", conditionMessage(e), "\n")
      NULL
    })
    
    egger_result <- tryCatch({
      mr(clumped_dat, method_list = "mr_egger_regression")
    }, error = function(e) {
      cat("Error in MR-Egger for", exposure_name, "on", outcome_name, ":", conditionMessage(e), "\n")
      NULL
    })
    
    weighted_median_result <- tryCatch({
      mr(clumped_dat, method_list = "mr_weighted_median")
    }, error = function(e) {
      cat("Error in Weighted Median for", exposure_name, "on", outcome_name, ":", conditionMessage(e), "\n")
      NULL
    })
    
    mr_input <- mr_input(
      bx = clumped_dat$beta.exposure,
      bxse = clumped_dat$se.exposure,
      by = clumped_dat$beta.outcome,
      byse = clumped_dat$se.outcome,
      exposure = exposure_name,
      outcome = outcome_name,
      snps = clumped_dat$SNP
    )
    lasso_result <- tryCatch({
      mr_lasso(mr_input)
    }, error = function(e) {
      cat("Error in MR-Lasso for", exposure_name, "on", outcome_name, ":", conditionMessage(e), "\n")
      NULL
    })
    
    conmix_result <- tryCatch({
      mr_conmix(mr_input)
    }, error = function(e) {
      cat("Error in MR-ConMix for", exposure_name, "on", outcome_name, ":", conditionMessage(e), "\n")
      NULL
    })
    
    heterogeneity_result <- tryCatch({
      mr_heterogeneity(clumped_dat)
    }, error = function(e) {
      cat("Error in Heterogeneity Test for", exposure_name, "on", outcome_name, ":", conditionMessage(e), "\n")
      NULL
    })
    
    pleiotropy_result <- tryCatch({
      mr_pleiotropy_test(clumped_dat)
    }, error = function(e) {
      cat("Error in Pleiotropy Test for", exposure_name, "on", outcome_name, ":", conditionMessage(e), "\n")
      NULL
    })
    
    loo_result <- tryCatch({
      mr_leaveoneout(clumped_dat)
    }, error = function(e) {
      cat("Error in Leave-One-Out for", exposure_name, "on", outcome_name, ":", conditionMessage(e), "\n")
      NULL
    })
    
    # Step 4: Wald Ratio Tests
    wald_ratios <- clumped_dat %>%
      mutate(
        wald_beta = beta.outcome / beta.exposure,
        wald_se = sqrt((se.outcome^2 / beta.exposure^2) + ((beta.outcome^2 * se.exposure^2) / (beta.exposure^4))),
        pval = 2 * pnorm(abs(wald_beta / wald_se), lower.tail = FALSE),
        method = paste("Wald Ratio:", SNP)
      ) %>%
      dplyr::select(SNP, wald_beta, wald_se, pval, method)
    
    cat("\n=== Wald Ratio Tests for", exposure_name, "on", outcome_name, "===\n")
    print(wald_ratios)
    
    # Step 5: Prepare Data for Forest Plot
    ivw_df <- if (!is.null(ivw_result)) ivw_result %>% mutate(method = "IVW") else NULL
    egger_df <- if (!is.null(egger_result)) egger_result %>% mutate(method = "MR-Egger") else NULL
    weighted_median_df <- if (!is.null(weighted_median_result)) weighted_median_result %>% mutate(method = "Weighted Median") else NULL
    
    lasso_df <- if (!is.null(lasso_result)) {
      data.frame(method = "MR-Lasso", b = lasso_result@Estimate, 
                 se = lasso_result@StdError, pval = lasso_result@Pvalue)
    } else {
      NULL
    }
    
    conmix_se <- if (!is.null(conmix_result)) {
      (conmix_result@CIUpper - conmix_result@CILower) / (2 * 1.96)
    } else {
      NA
    }
    
    conmix_df <- if (!is.null(conmix_result)) {
      data.frame(method = "MR-ConMix", b = conmix_result@Estimate, 
                 se = conmix_se, pval = conmix_result@Pvalue)
    } else {
      NULL
    }
    
    mr_results <- bind_rows(
      ivw_df,
      egger_df,
      weighted_median_df,
      lasso_df,
      conmix_df,
      wald_ratios %>% dplyr::select(method, b = wald_beta, se = wald_se, pval)
    ) %>%
      mutate(method = as.factor(method)) %>%
      filter(!is.na(method) & method != "NA") %>%
      arrange(b) %>%
      mutate(method = factor(method, levels = unique(method)))
    
    # Step 6: Create Forest Plot
    base_colors <- c(
      "IVW" = "#1F78B4",
      "MR-Egger" = "#FF7F00",
      "Weighted Median" = "#33A02C",
      "MR-Lasso" = "#FB9A99",
      "MR-ConMix" = "#E41A1C"
    )
    
    wald_methods <- unique(wald_ratios$method)
    n_wald <- length(wald_methods)
    if (n_wald > 0) {
      wald_colors <- hue_pal()(n_wald)
      names(wald_colors) <- wald_methods
    } else {
      wald_colors <- NULL
    }
    
    color_list <- c(base_colors, wald_colors)
    available_methods <- unique(mr_results$method)
    color_list <- color_list[names(color_list) %in% available_methods]
    
    x_min <- min(mr_results$b - 1.96 * mr_results$se, na.rm = TRUE)
    x_max <- max(mr_results$b + 1.96 * mr_results$se, na.rm = TRUE)
    
    forest_plot <- ggplot(mr_results, aes(x = b, y = method, color = method)) +
      geom_point(size = 3) +
      geom_errorbarh(aes(xmin = b - 1.96 * se, xmax = b + 1.96 * se), height = 0.2) +
      geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") +
      labs(
        title = paste0("MR Estimates:\n", exposure_name, " (1MB around ACY1 TSS) on ", outcome_name),
        x = "Causal Effect (Beta)",
        y = ""
      ) +
      theme_minimal() +
      theme(
        plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
        axis.text.y = element_text(size = 12, face = "bold"),
        axis.text.x = element_text(size = 10),
        axis.title.x = element_text(size = 12),
        panel.grid.major = element_line(color = "grey90"),
        panel.grid.minor = element_blank(),
        legend.position = "none",
        panel.background = element_rect(fill = "white", color = NA),
        plot.background = element_rect(fill = "white", color = NA)
      ) +
      scale_color_manual(values = color_list) +
      xlim(x_min - 0.02, x_max + 0.02)
    
    # Step 7: Create Forest Plot Ordered by Method
    desired_order <- c(
      "IVW",
      "MR-Egger",
      "MR-Lasso",
      "MR-ConMix",
      "Weighted Median",
      unique(wald_ratios$method)
    )
    desired_order <- desired_order[desired_order %in% unique(mr_results$method)]
    
    mr_results$method <- factor(mr_results$method, levels = desired_order)
    mr_results <- mr_results[order(mr_results$method), ]
    
    forest_plot_by_method <- ggplot(mr_results, aes(x = b, y = method, color = method)) +
      geom_errorbarh(aes(xmin = b - 1.96 * se, xmax = b + 1.96 * se), 
                     height = 0.2, na.rm = TRUE, linewidth = 0.8) +
      geom_point(size = 3) +
      geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") +
      labs(
        title = paste0("MR Estimates:\n", exposure_name, " (1MB around ACY1 TSS) on ", outcome_name),
        x = "Causal Effect (Beta)",
        y = ""
      ) +
      theme_minimal() +
      theme(
        plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
        axis.text.y = element_text(size = 12, face = "bold"),
        axis.text.x = element_text(size = 10),
        axis.title.x = element_text(size = 12),
        panel.grid.major = element_line(color = "grey90"),
        panel.grid.minor = element_blank(),
        legend.position = "none",
        panel.background = element_rect(fill = "white", color = NA),
        plot.background = element_rect(fill = "white", color = NA)
      ) +
      scale_color_manual(values = color_list) +
      xlim(x_min - 0.02, x_max + 0.02)
    
    # Save plots
    plot_file <- paste0(results_dir, "MR_Forestplot_", exposure_name, "_", outcome_name, "_", today, ".png")
    ggsave(plot_file, forest_plot, width = 12, height = 10, dpi = 600, bg = "white")
    cat("Forest plot saved to:", plot_file, "\n")
    
    plot_file_by_method <- paste0(results_dir, "MR_Forestplot_", exposure_name, "_", outcome_name, "_By_Method_", today, ".png")
    ggsave(plot_file_by_method, forest_plot_by_method, dpi = 300, width = 8, height = 5)
    cat("Forest plot by method saved to:", plot_file_by_method, "\n")
    
    # Step 8: Save Results in Excel
    wb <- createWorkbook()
    title_style <- createStyle(fontSize = 14, fontColour = "black", textDecoration = "bold", halign = "center")
    header_style <- createStyle(fontColour = "white", fgFill = "#C8A2C8", textDecoration = "bold", halign = "center", border = "TopBottomLeftRight")
    
    add_formatted_sheet <- function(wb, sheet_name, data, title) {
      addWorksheet(wb, sheet_name)
      if (is.null(data) || nrow(data) == 0 || ncol(data) == 0) {
        writeData(wb, sheet_name, paste("No data available for", sheet_name), startRow = 1, startCol = 1)
        addStyle(wb, sheet_name, title_style, rows = 1, cols = 1)
      } else {
        writeData(wb, sheet_name, title, startRow = 1, startCol = 1)
        writeData(wb, sheet_name, data, startRow = 3, startCol = 1, headerStyle = header_style)
        mergeCells(wb, sheet_name, cols = 1:ncol(data), rows = 1)
        setRowHeights(wb, sheet_name, rows = 1, heights = 20)
        setColWidths(wb, sheet_name, cols = 1:ncol(data), widths = "auto")
        addStyle(wb, sheet_name, title_style, rows = 1, cols = 1)
      }
    }
    
    toc_data <- data.frame(
      Sheet = c("IVW", "MR-Egger", "Weighted_Median", "MR-Lasso", "MR-ConMix", 
                "Wald_Ratios", "Heterogeneity_Test", "Pleiotropy_Test", "Leave_One_Out",
                "Instruments", "Clumped_Data"),
      Title = c("Inverse Variance Weighted (IVW) Results", "MR-Egger Results", 
                "Weighted Median Results", "MR-Lasso Results", "MR-ConMix Results",
                "Wald Ratio Tests for Each Instrument", "Heterogeneity Test Results", 
                "Pleiotropy Test Results", "Leave-One-Out Analysis Results",
                "Filtered Instruments Data", "Clumped SNPs Data")
    )
    toc_data <- toc_data[complete.cases(toc_data), ]
    
    addWorksheet(wb, "TOC")
    writeData(wb, "TOC", "Table of Contents", startRow = 1, startCol = 1)
    mergeCells(wb, "TOC", cols = 1:2, rows = 1)
    addStyle(wb, "TOC", title_style, rows = 1, cols = 1)
    writeData(wb, "TOC", toc_data, startRow = 3, startCol = 1, headerStyle = header_style)
    setColWidths(wb, "TOC", cols = 1:2, widths = "auto")
    
    add_formatted_sheet(wb, "IVW", ivw_result, "Inverse Variance Weighted (IVW) Results")
    add_formatted_sheet(wb, "MR-Egger", egger_result, "MR-Egger Results")
    add_formatted_sheet(wb, "Weighted_Median", weighted_median_result, "Weighted Median Results")
    
    if (!is.null(lasso_result)) {
      lasso_df <- data.frame(
        Exposure = lasso_result@Exposure,
        Outcome = lasso_result@Outcome,
        Estimate = lasso_result@Estimate,
        StdError = lasso_result@StdError,
        CILower = lasso_result@CILower,
        CIUpper = lasso_result@CIUpper,
        Pvalue = lasso_result@Pvalue,
        SNPs = lasso_result@SNPs,
        Valid = lasso_result@Valid,
        ValidSNPs = if (length(lasso_result@ValidSNPs) > 0) paste(lasso_result@ValidSNPs, collapse = ", ") else "None",
        RegEstimate = lasso_result@RegEstimate,
        RegIntercept = paste(lasso_result@RegIntercept, collapse = ", "),
        Lambda = lasso_result@Lambda
      )
      add_formatted_sheet(wb, "MR-Lasso", lasso_df, "MR-Lasso Results")
    } else {
      addWorksheet(wb, "MR-Lasso")
      writeData(wb, "MR-Lasso", "MR-Lasso Analysis Failed", startRow = 1, startCol = 1)
      addStyle(wb, "MR-Lasso", title_style, rows = 1, cols = 1)
    }
    
    if (!is.null(conmix_result)) {
      conmix_df <- data.frame(
        Exposure = conmix_result@Exposure,
        Outcome = conmix_result@Outcome,
        Estimate = conmix_result@Estimate,
        Pvalue = conmix_result@Pvalue,
        SNPs = conmix_result@SNPs,
        Psi = conmix_result@Psi,
        CILower = conmix_result@CILower,
        CIUpper = conmix_result@CIUpper,
        CIRange = paste(conmix_result@CIRange, collapse = ", "),
        CIMin = conmix_result@CIMin,
        CIMax = conmix_result@CIMax,
        CIStep = conmix_result@CIStep,
        Valid = paste(conmix_result@Valid, collapse = ", "),
        ValidSNPs = paste(conmix_result@ValidSNPs, collapse = ", "),
        Alpha = conmix_result@Alpha
      )
      add_formatted_sheet(wb, "MR-ConMix", conmix_df, "MR-ConMix Results")
    } else {
      addWorksheet(wb, "MR-ConMix")
      writeData(wb, "MR-ConMix", "MR-ConMix Analysis Failed", startRow = 1, startCol = 1)
      addStyle(wb, "MR-ConMix", title_style, rows = 1, cols = 1)
    }
    
    add_formatted_sheet(wb, "Wald_Ratios", wald_ratios, "Wald Ratio Tests for Each Instrument")
    add_formatted_sheet(wb, "Heterogeneity_Test", heterogeneity_result, "Heterogeneity Test Results")
    add_formatted_sheet(wb, "Pleiotropy_Test", pleiotropy_result, "Pleiotropy Test Results")
    add_formatted_sheet(wb, "Leave_One_Out", loo_result, "Leave-One-Out Analysis Results")
    add_formatted_sheet(wb, "Instruments", instruments_subset, "Filtered Instruments Data")
    add_formatted_sheet(wb, "Clumped_Data", clumped_dat_subset, "Clumped SNPs Data")
    
    saveWorkbook(wb, excel_file, overwrite = TRUE)
    cat("Excel file saved to:", excel_file, "\n")
  }
}

1.9 Coloc Pipline

# Load required packages
library(coloc)
library(data.table)
library(stringr)

# Set working directory
setwd("/Users/charleenadams/mr_acetylaminos_bmi/results")

# Find all filtered CSV files recursively
filtered_files <- list.files(path = ".", pattern = "^filtered_.*\\.csv$", recursive = TRUE, full.names = TRUE)

# Today's date
today_date <- format(Sys.Date(), "%Y-%m-%d")

# Create output directory if needed
dir.create("coloc_results", showWarnings = FALSE)

# Mapping of known case/control numbers
case_control_info <- list(
  "PureHypercholesterolaemia" = list(ncase = 22622, ncontrol = 440388),
  "Type2DiabetesNoComplications" = list(ncase = 12045, ncontrol = 450965),
  "EssentialHypertension" = list(ncase = 54358, ncontrol = 408652),
  "Obesity" = list(ncase = 4688, ncontrol = 458322),
  "NonAlcoholicFattyLiverDisease" = list(ncase = 8434, ncontrol = 770180, n_total = 778614),
  "Type2Diabetes" = list(ncase = 61714, ncontrol = 593952)
)

# Start looping through each file
for (file in filtered_files) {
  cat("Running coloc for:", file, "\n")
  
  # Read the data
  dat <- fread(file)
  
  # ❗ Remove duplicated SNPs (keep first occurrence only)
  dat <- dat[!duplicated(dat$SNP), ]
  
  # Build clean exposure_outcome name
  clean_name <- gsub("^.*/filtered_|\\.csv$", "", file)
  
  # Split the name by "_"
  split_name <- strsplit(clean_name, "_")[[1]]
  
  # Outcome is second-to-last element before the date
  outcome_name <- split_name[length(split_name) - 1]
  
  # Decide outcome type
  if (tolower(outcome_name) == "bmi") {
    outcome_type <- "quant"
    s_outcome <- NULL
  } else {
    outcome_type <- "cc"
    
    if (outcome_name %in% names(case_control_info)) {
      info <- case_control_info[[outcome_name]]
      if (!is.na(info$ncase) & !is.na(info$ncontrol)) {
        s_outcome <- info$ncase / (info$ncase + info$ncontrol)
      } else if (!is.na(info$n_total)) {
        s_outcome <- 0.05  # fallback for NAFLD if needed
      } else {
        stop(paste0("No valid case/control info for outcome: ", outcome_name))
      }
    } else {
      stop(paste0("Outcome not recognized in case_control_info: ", outcome_name))
    }
  }
  
  # Build exposure dataset
  dataset1 <- list(
    beta = dat$beta.exposure,
    varbeta = (dat$se.exposure)^2,
    snp = dat$SNP,
    MAF = dat$eaf.exposure,
    type = "quant",
    N = unique(dat$samplesize.exposure)
  )
  
  # Build outcome dataset
  dataset2 <- list(
    beta = dat$beta.outcome,
    varbeta = (dat$se.outcome)^2,
    snp = dat$SNP,
    MAF = dat$eaf.outcome,
    type = outcome_type,
    N = unique(dat$samplesize.outcome)
  )
  
  if (!is.null(s_outcome)) {
    dataset2$s <- s_outcome
  }
  
  # Run coloc.abf
  coloc_res <- coloc.abf(dataset1, dataset2)
  
  # Save coloc result
  output_file <- paste0("coloc_results/coloc_", clean_name, "_", today_date, ".rds")
  saveRDS(coloc_res, file = output_file)
}

# Compile the results

# Find all coloc result .rds files
coloc_files <- list.files(path = "coloc_results", pattern = "^coloc_.*\\.rds$", full.names = TRUE)

# Initialize empty list to store results
coloc_summary_list <- list()

# Loop through each coloc result file
for (file in coloc_files) {
  cat("Collecting:", file, "\n")
  
  # Read coloc result
  res <- readRDS(file)
  
  # Extract posterior probabilities
  pp <- res$summary
  
  # Extract exposure and outcome properly from filename
  clean_name <- gsub("^coloc_|\\.rds$", "", basename(file))
  parts <- unlist(strsplit(clean_name, "_"))
  
  # parts[1] = ACY1
  # parts[2] = 1Mb
  # parts[3] = exposure (e.g., N-acetylalanine)
  # parts[4] = outcome (e.g., BMI)
  # parts[5] = date (ignore)
  
  if (length(parts) < 5) {
    stop(paste("Filename", file, "does not have expected structure!"))
  }
  
  exposure <- parts[3]
  outcome <- parts[4]
  
  # Assemble into a row
  coloc_summary_list[[file]] <- data.frame(
    exposure = exposure,
    outcome = outcome,
    PP.H0 = pp["PP.H0.abf"],
    PP.H1 = pp["PP.H1.abf"],
    PP.H2 = pp["PP.H2.abf"],
    PP.H3 = pp["PP.H3.abf"],
    PP.H4 = pp["PP.H4.abf"]
  )
}

# Combine all into one data frame
coloc_summary_df <- rbindlist(coloc_summary_list)

# Save to CSV
fwrite(coloc_summary_df, file = "coloc_summary_table_2025-04-27.csv")

cat("✅ Coloc summary table saved as coloc_summary_table_2025-04-27.csv\n")

1.10 App for Results

https://yodamendel.shinyapps.io/mr_acetylaminos_app/

# Run with:
# shiny::runApp("/Users/charleenadams/mr_acetylaminos_bmi/mr_acetylaminos_app")
# rsconnect::deployApp('/Users/charleenadams/mr_acetylaminos_bmi/mr_acetylaminos_app')

library(shiny)

# Set CRAN repository to avoid mrcieu.r-universe.dev issues
options(repos = c(CRAN = "https://cloud.r-project.org"))

# Define file paths for results, plots, and colocalization summary
base_path <- file.path("data", "results")
exposures <- c("N-acetylmethionine", "N-acetylalanine", "N-acetylisoleucine", "N-acetylglutamate", "N-acetylserine")
outcomes <- c("BMI", "PureHypercholesterolaemia", "Type2DiabetesNoComplications", "EssentialHypertension", 
              "Obesity", "NonAlcoholicFattyLiverDisease", "Type2Diabetes")
file_date <- "2025-04-27"  # Fixed date to match your files
coloc_file <- file.path("data", "coloc_summary_table_2025-04-27 copy.csv")  # Path to colocalization summary

# Generate file paths dynamically for MR results and plots
file_paths <- list()
plot_paths <- list()
for (exp in exposures) {
  for (out in outcomes) {
    folder <- file.path(base_path, paste0(exp, "_", out))
    key <- paste(exp, out, sep = "_")
    file_paths[[key]] <- file.path(folder, paste0("MR_Results_", exp, "_", out, "_", file_date, ".xlsx"))
    plot_paths[[key]] <- file.path(folder, paste0("MR_Forestplot_", exp, "_", out, "_By_Method_", file_date, ".png"))
  }
}

# UI
ui <- fluidPage(
  tags$head(
    tags$style(HTML("
      #logo_container {
        text-align: center;
        margin-top: 20px;
        margin-bottom: 20px;
        opacity: 0;
        transition: opacity 0.5s ease-in-out;
      }
      #logo_container.visible {
        opacity: 1;
      }
      .image-grid {
        display: flex;
        flex-wrap: wrap;
        justify-content: center;
        gap: 20px;
        margin-top: 20px;
      }
      .image-grid img {
        width: 200px;
        height: 200px;
        object-fit: contain;
      }
      .error-message {
        color: red;
        font-size: 16px;
        text-align: center;
      }
    ")),
    tags$script(HTML("
      $(document).on('shiny:connected', function() {
        var acy1Image = $('.image-grid');
        var logoContainer = $('#logo_container');
        
        $(window).on('scroll', function() {
          var scrollPosition = $(window).scrollTop() + $(window).height();
          var acy1Bottom = acy1Image.offset().top + acy1Image.outerHeight();
          
          if (scrollPosition > acy1Bottom) {
            logoContainer.addClass('visible');
          } else {
            logoContainer.removeClass('visible');
          }
        });
      });
    "))
  ),
  
  div(style = "text-align: center; margin-bottom: 20px;",
      h1("Mendelian Randomization and Colocalization Results", style = "color: #A94800;"),
      h4(HTML("Analysis of Five Acetyl-Amino Acids (1MB around <i>ACY1</i> TSS) on Seven Metabolic Outcomes"), style = "color: #A94800;"),
      tags$a(href = "https://www.bidmc.org/research/research-by-department/medicine/cardiovascular-medicine/personal-genomics-and-cardiometabolic-disease",
             target = "_blank",
             "Our Lab",
             style = "font-size: 18px; color: #A94800; text-decoration: none; font-weight: bold;")
  ),
  
  div(style = "background-color: #FFF3E0; padding: 20px; border-radius: 10px; margin-bottom: 20px; border: 1px solid #D2691E;",
      h3(HTML("The Story of Acetyl-Amino Acids and <i>ACY1</i>"), style = "color: #D2691E; text-align: center; margin-bottom: 20px;"),
      p(style = "text-align: justify; font-size: 16px; line-height: 1.6; color: #333333;",
        HTML("Amino acids are the building blocks of life, but their acetylated forms—like <i>N</i>-acetylmethionine, <i>N</i>-acetylalanine, <i>N</i>-acetylisoleucine, <i>N</i>-acetylglutamate, and <i>N</i>-acetylserine—might hold secrets to metabolic health. These metabolites, regulated by the enzyme encoded by the <i>ACY1</i> gene, are more than just biochemical byproducts. Emerging evidence suggests they could influence conditions like obesity, diabetes, and liver disease.")),
      p(style = "text-align: justify; font-size: 16px; line-height: 1.6; color: #333333;",
        HTML("<i>ACY1</i>, or aminoacylase-1, is an enzyme that breaks down <i>N</i>-acetylated amino acids in the kidneys and other tissues. Genetic variants near its transcription start site (TSS) can alter its activity, potentially affecting circulating levels of these metabolites. But do these acetyl-amino acids actually cause changes in metabolic outcomes, or are they just along for the ride? That’s where Mendelian Randomization (MR) and colocalization come in.")),
      p(style = "text-align: justify; font-size: 16px; line-height: 1.6; color: #333333;",
        HTML("We used MR to test causality between five acetyl-amino acids and seven metabolic outcomes—BMI, pure hypercholesterolemia, type 2 diabetes (with and without complications), essential hypertension, obesity, and non-alcoholic fatty liver disease—forming 35 exposure-outcome pairs. Our exposure data came from the METSIM cohort (~6,000 Finnish men), and outcome data were sourced from large-scale GWAS (up to 778,614 participants).")),
      p(style = "text-align: justify; font-size: 16px; line-height: 1.6; color: #333333;",
        HTML("The MR pipeline employed five methods—IVW, MR-Egger, Weighted Median, MR-Lasso, and ConMix—with a relaxed p-value threshold (5E-4) and 500 kb clumping for independent SNPs. Sensitivity analyses checked for heterogeneity and pleiotropy. Additionally, we performed colocalization analysis on all 35 pairs to assess whether the genetic signals for exposures and outcomes share a common causal variant, strengthening causal inference.")),
      p(style = "text-align: justify; font-size: 16px; line-height: 1.6; color: #333333;",
        HTML("The results? 35 MR analyses showed hints of causality for some pairs, though not all pairs could be tested and most were null. Colocalization was null for all, though. Download the MR results, forest plots, or colocalization summary below to explore the full story. (If we want to use any figures, I'll regenerate them to remove or fix the titles.)"))
  ),
  
  sidebarLayout(
    sidebarPanel(
      width = 3,
      style = "background-color: #FFE8D6; padding: 15px; border-radius: 5px;",
      h3("Instructions", style = "color: #A94800;"),
      p("Select an acetyl-amino acid and metabolic outcome to download MR results (Excel) or forest plots (PNG). Download the colocalization summary (CSV) for all 35 pairs.", style = "color: #A94800;"),
      hr(),
      h3("Download Results", style = "color: #A94800;"),
      selectInput("exposure", "Select Acetyl-Amino Acid:",
                  choices = exposures, selected = exposures[1]),
      selectInput("outcome", "Select Metabolic Outcome:",
                  choices = outcomes, selected = outcomes[1]),
      uiOutput("download_results_ui"),
      br(),
      uiOutput("download_plot_ui"),
      br(),
      downloadButton("download_coloc", "Colocalization Summary"),
      hr(),
      div(style = "border: 1px solid #D2691E; padding: 10px; background-color: #D2691E; border-radius: 5px;",
          strong("Objective", style = "color: #FFFFFF;"),
          p(HTML("This app shares MR and colocalization analyses of five acetyl-amino acids (1MB around <i>ACY1</i> TSS) on seven metabolic outcomes."), 
            style = "color: #FFFFFF;")
      )
    ),
    
    mainPanel(
      width = 9,
      div(style = "text-align: center; margin-top: 20px;",
          div(style = "margin-bottom: 10px;",
              p(HTML("Tap any acetyl-amino acid image to see methods and pipeline"), 
                style = "color: #A94800; font-size: 24px; font-weight: bold; text-decoration: none;")
          ),
          div(class = "image-grid",
              tags$a(href = "https://rpubs.com/YodaMendel/1303448", target = "_blank",
                     img(id = "acy1_image", src = "image.png", alt = "N-acetylmethionine")),
              tags$a(href = "https://rpubs.com/YodaMendel/1303448", target = "_blank",
                     img(src = "image2.png", alt = "N-acetylalanine")),
              tags$a(href = "https://rpubs.com/YodaMendel/1303448", target = "_blank",
                     img(src = "image3.png", alt = "N-acetylisoleucine")),
              tags$a(href = "https://rpubs.com/YodaMendel/1303448", target = "_blank",
                     img(src = "image4.png", alt = "N-acetylglutamate")),
              tags$a(href = "https://rpubs.com/YodaMendel/1303448", target = "_blank",
                     img(src = "image5.png", alt = "N-acetylserine"))
          )
      ),
      
      div(id = "logo_container",
          tags$a(href = "https://www.bidmc.org/research/research-by-department/medicine/cardiovascular-medicine/personal-genomics-and-cardiometabolic-disease",
                 target = "_blank",
                 img(src = "BIDMC_HMS_Stacked-LockUp.png", height = "100px", width = "auto"))
      )
    )
  )
)

# Server
server <- function(input, output, session) {
  
  # Reactive file path based on user selection
  selected_file <- reactive({
    file_paths[[paste(input$exposure, input$outcome, sep = "_")]]
  })
  
  selected_plot <- reactive({
    plot_paths[[paste(input$exposure, input$outcome, sep = "_")]]
  })
  
  # Dynamic UI for results download
  output$download_results_ui <- renderUI({
    file_path <- selected_file()
    if (file.exists(file_path)) {
      downloadButton("download_results", "Download MR Results")
    } else {
      div(class = "error-message", 
          paste("Results file not available for", input$exposure, "and", input$outcome))
    }
  })
  
  output$download_results <- downloadHandler(
    filename = function() { 
      paste0("MR_Results_", input$exposure, "_", input$outcome, "_", file_date, ".xlsx") 
    },
    content = function(file) {
      tryCatch({
        file.copy(selected_file(), file)
        cat("Downloaded results file:", selected_file(), "\n")
      }, error = function(e) {
        cat("Error downloading results file:", selected_file(), "\nError:", conditionMessage(e), "\n")
        stop(e)
      })
    }
  )
  
  # Dynamic UI for plot download
  output$download_plot_ui <- renderUI({
    plot_path <- selected_plot()
    if (file.exists(plot_path)) {
      downloadButton("download_plot", "Download Forest Plot")
    } else {
      div(class = "error-message", 
          paste("Forest plot not available for", input$exposure, "and", input$outcome))
    }
  })
  
  output$download_plot <- downloadHandler(
    filename = function() { 
      paste0("MR_Forestplot_", input$exposure, "_", input$outcome, "_By_Method_", file_date, ".png") 
    },
    content = function(file) {
      tryCatch({
        file.copy(selected_plot(), file)
        cat("Downloaded plot file:", selected_plot(), "\n")
      }, error = function(e) {
        cat("Error downloading plot file:", selected_plot(), "\nError:", conditionMessage(e), "\n")
        stop(e)
      })
    }
  )
  
  # Download handler for colocalization summary
  output$download_coloc <- downloadHandler(
    filename = "coloc_summary_table_2025-04-27.csv",
    content = function(file) {
      tryCatch({
        file.copy(coloc_file, file)
        cat("Downloaded colocalization file:", coloc_file, "\n")
      }, error = function(e) {
        cat("Error downloading colocalization file:", coloc_file, "\nError:", conditionMessage(e), "\n")
        stop(e)
      })
    },
    contentType = "text/csv"
  )
}

# Run app
shinyApp(ui = ui, server = server)

MR and Coloc Pipelines of 5 Acetyl-amino Acids (1MB around ACY1 TSS) on 7 Metabolic Outcomes

N-acetyl-methionine, N-acetyl-alanine, N-acetyl-isoleucine, N-acetyl-glutamate, and N-acetyl-serine on BMI, Pure Hypercholesterolaemia, T2D, T2D without Complications, Essential (Primary) Hypertension, Obesity, and Non-alcoholic Fatty Liver Disease

שׁוֹשַׁנָּה🌹

April 28, 2025