Mendelian randomization of N-Acetylglutamine (1MB around ACY1 TSS) and body mass index

1 Methods
- 1.1 Study Overview
- 1.2 Data Sources
3 Bioinformatics Pipeline Prep
4 Mendelian Randomization Analyses
5 MR
6 MR with COJO Instruments
- 6.1 Heterogeneity and Pleiotropy
7 FinnGen Replication Attempt
8 FinnGen Replication COJO
9 Main Findings

1 Methods

1.1 Study Overview

We conducted a two-sample Mendelian Randomization (MR) analysis to evaluate the causal effect of N-acetylguanine (NAG) levels, within a 1 Mb region surrounding the transcription start site (TSS) of the ACY1 gene, on body mass index (BMI). Genetic instruments for NAG were derived from the METSIM cohort, while BMI summary statistics were obtained from the Jurgens et al. (2022) UK Biobank (UKB) GWAS. All analyses were performed using R (version 4.4.3) and Python (version 3.12.9), leveraging multiple statistical packages and bioinformatics tools to ensure robust and reproducible results.

1.2 Data Sources

1.2.1 Exposure Data: N-Acetylguanine (NAG)

Summary statistics for NAG were sourced from the METSIM (Metabolic Syndrome in Men) study, a population-based cohort of Finnish men. The NAG data, available in GRCh38 coordinates, were downloaded from the PheWeb repository (https://pheweb.org/metsim-metab/pheno/C100001253) as a compressed file (C100001253). This file was decompressed using gunzip phenocode-C100001253.tsv.gz to yield phenocode-C100001253.tsv, containing association statistics for NAG across 6,099 individuals. The dataset includes:

Chromosome (chrom)
Position (pos)
Reference (ref) and alternate (alt) alleles
Effect sizes (beta)
Standard errors (sebeta)
P-values (pval)
Minor allele frequencies (maf)
rsIDs (rsids)

1.2.2 Outcome Data: Body Mass Index (BMI)

BMI summary statistics were obtained from the Jurgens et al. (2022) GWAS of European ancestry individuals in the UK Biobank (https://personal.broadinstitute.org/ryank/Jurgens_Pirruccello_2022_GWAS_Sumstats.zip). The file GWAS_sumstats_EUR__invnorm_bmi__TOTALsample.tsv provides association results for 460,000 participants, originally reported in GRCh37 (hg19) coordinates. Key columns include:

Chromosome (CHR)
Base pair position (BP)
SNP identifier (SNP)
Effect allele (ALLELE1)
Other allele (ALLELE0)
Effect allele frequency (A1FREQ)
Beta coefficient (BETA)
Standard error (SE)
P-value (P)

2 Getting the Data

# NAG: 
# on GRCh38
wget https://pheweb.org/metsim-metab/download/C100001253

# BMI:
wget https://personal.broadinstitute.org/ryank/Jurgens_Pirruccello_2022_GWAS_Sumstats.zip
# select: GWAS_sumstats_EUR__invnorm_bmi__TOTALsample.tsv
# use LiftOver to obtain GRCh38 coordinates

2.1 Coordinate Conversion for BMI Data

To align the BMI data with the NAG data (GRCh38), we converted the GRCh37 coordinates to GRCh38 using a three-step bioinformatics pipeline implemented in Python and Unix:

Conversion to BED Format: A Python script (make_bed.py) transformed the BMI TSV file into a BED format, extracting chromosome, start (BP - 1 for 0-based coordinates), end (BP), and SNP ID. The output was saved as GWAS_sumstats_EUR__invnorm_bmi__TOTALsample.bed.
LiftOver: The UCSC liftOver tool was applied using the chain file hg19ToHg38.over.chain.gz to map GRCh37 coordinates to GRCh38. Successfully mapped SNPs were written to GWAS_sumstats_EUR__invnorm_bmi__TOTALsample_hg38.bed, with unmapped SNPs logged in unmapped_SNPs.bed.
Merging with Original Data: A second Python script (format_bmi.py) processed the lifted BED file, adjusting coordinates to 1-based notation and merging them with the original TSV data using the SNP column. The merged dataset was sorted by SNP, processed in chunks (500,000 rows) to manage memory, and renamed to conform to the TwoSampleMR R package requirements: effect_allele (ALLELE1), other_allele (ALLELE0), eaf (A1FREQ), beta (BETA), se (SE), pval (P), and BP_GRCh38 (new position). The final output was saved as merged_gwas_data_grch38_twosample.tsv.

2.2 Data Cleaning and Harmonization

Both datasets underwent cleaning and harmonization in R using the data.table, dplyr, and tidyr packages:

NAG Data: Rows with missing or invalid rsIDs were filtered out (!is.na(rsids) & rsids != "" & grepl("^rs", rsids)), and the data were formatted for MR analysis with columns including chr.exposure, pos.exposure, beta.exposure, se.exposure, pval.exposure, effect_allele.exposure, other_allele.exposure, eaf.exposure, samplesize.exposure (5,830), and SNP.
BMI Data: Similar filtering removed invalid SNPs, and the data were formatted with chr.outcome, pos.outcome, beta.outcome, se.outcome, pval.outcome, effect_allele.outcome, other_allele.outcome, eaf.outcome, samplesize.outcome (460,000), and SNP.
Harmonization: The harmonise_data function from the TwoSampleMR package aligned NAG (exposure) and BMI (outcome) data by matching alleles and removing palindromic SNPs with ambiguous effects, producing a harmonized dataset saved as harmonized_nat_Jurgens_BMI_dat_[date].csv.

3 Bioinformatics Pipeline Prep

BMI summary statistics were on GRCh37. I wanted the GRCh38 coordinates for other post-GWAS analyses.

This was a 3-step process:

Converting the BMI summary statistics to BED format.
Running liftOver to obtain GRCh38 coordinates
Merging the liftedOver coordinates with the original TSV

Step 1: Converts BMI from TSV to BED file format

This first script (make_bed.py) converts the BMI summary statistics file (in TSV format) to a BED file format.

import csv

# Open the input TSV file and prepare to write to the output BED file
with open('GWAS_sumstats_EUR__invnorm_bmi__TOTALsample.tsv', 'r') as tsvfile, open('GWAS_sumstats_EUR__invnorm_bmi__TOTALsample.bed', 'w') as bedfile:
    tsv_reader = csv.DictReader(tsvfile, delimiter='\t')
    
    for row in tsv_reader:
        # Extract the necessary information
        chromosome = f"chr{row['CHR']}"
        start = int(row['BP']) - 1  # Convert to 0-based for BED format
        end = row['BP']
        snp_id = row['SNP']
        
        # Write to BED file
        bedfile.write(f"{chromosome}\t{start}\t{end}\t{snp_id}\n")

print("Conversion complete. Check 'GWAS_sumstats_EUR__invnorm_bmi__TOTALsample.bed' for the output.")

Step 2: liftOver

liftOver \
  GWAS_sumstats_EUR__invnorm_bmi__TOTALsample.bed \       # Input BED file
  hg19ToHg38.over.chain.gz \                              # Chain file for converting GRCh37 (hg19) → GRCh38 (hg38)
  GWAS_sumstats_EUR__invnorm_bmi__TOTALsample_hg38.bed \  # Output file (successfully mapped SNPs)
  unmapped_SNPs.bed                                       # Output file for unmapped SNPs

Step 3: This 2nd script (format_bmi.py) does the following:

Reads the BED File:

Reads GWAS_sumstats_EUR__invnorm_bmi__TOTALsample_hg38.bed containing:
- Chromosome (CHR)
- Start and end positions
- SNP identifiers
Adjusts coordinates from 0-based to 1-based:
- Assigns END positions to a new column called BP_GRCh38.

Sorts BED Data:

Sorts the BED data by the SNP column for efficient merging later.

Processes the GWAS TSV File:

Reads the large GWAS_sumstats_EUR__invnorm_bmi__TOTALsample.tsv file in chunks (500,000 rows at a time) to avoid memory overflow.
Sorts each chunk by SNP for faster and more efficient merging.

Merges:

Merges the chunked data with the BED file using the SNP column.
Adds the BP_GRCh38 (new genome build position) to the GWAS data.
Uses a left join to ensure all GWAS SNPs are preserved, even if there isn’t a matching position in the BED file.

Rename Columns:

Renames key columns to fit the format required by the TwoSampleMR package:

ALLELE1 → effect_allele
ALLELE0 → other_allele
A1FREQ → eaf (effect allele frequency)
BETA → beta
SE → se (standard error)
P → pval

Saves Merged Results:

Combines all the merged chunks into one large DataFrame.
Saves this merged DataFrame as merged_gwas_data_grch38_original.tsv for downstream analysis using the TwoSampleMR package.

import pandas as pd

# Define chunk size for reading large TSV files in smaller parts
chunk_size = 500000  # Adjust based on memory availability

# Initialize an empty list to store the merged chunks
merged_chunks = []

try:
    # Step 1: Load the BED file (without headers)
    bed_columns = ['CHR', 'START', 'END', 'SNP']
    bed_data = pd.read_csv('GWAS_sumstats_EUR__invnorm_bmi__TOTALsample_hg38.bed', sep='\t', header=None, names=bed_columns)
    print("Loaded BED file successfully.", flush=True)

    # Step 2: Add 1 to the START position to convert from zero-based to one-based
    bed_data['BP_GRCh38'] = bed_data['END']
    print("Converted BED file positions to one-based.", flush=True)

    # Step 3: Sort the BED data by the SNP column to optimize merging
    bed_data = bed_data.sort_values(by='SNP')
    print("Sorted BED data by SNP.", flush=True)

    # Step 4: Process the TSV file in chunks, and merge each chunk with the BED data
    for chunk in pd.read_csv('GWAS_sumstats_EUR__invnorm_bmi__TOTALsample.tsv', sep='\t', chunksize=chunk_size):
        print(f"Processing chunk...", flush=True)
        # Sort the chunk by SNP for more efficient merging
        chunk = chunk.sort_values(by='SNP')

        # Merge the chunk with the BED data
        merged_chunk = pd.merge(chunk, bed_data[['CHR', 'SNP', 'BP_GRCh38']], on='SNP', how='left')

        # Append the merged chunk to the list of merged chunks
        merged_chunks.append(merged_chunk)

    # Step 5: Concatenate all merged chunks together into one DataFrame
    merged_data = pd.concat(merged_chunks)
    print("Merged all chunks successfully.", flush=True)

    # Step 6: Save the merged data to a new TSV file before changing column names
    merged_data.to_csv('merged_gwas_data_grch38_original.tsv', sep='\t', index=False)
    print("Saved 'merged_gwas_data_grch38_original.tsv' successfully.", flush=True)

    # Step 7: Perform the column renaming for TwoSampleMR, keeping both BP columns
    merged_data.rename(columns={
        'SNP': 'SNP',
        'CHR': 'CHR',  # Keep the original chromosome column
        'BP': 'BP',  # Keep the original BP
        'BP_GRCh38': 'BP_GRCh38',  # Add BP_GRCh38
        'ALLELE1': 'effect_allele',  # ALLELE1 -> effect_allele
        'ALLELE0': 'other_allele',   # ALLELE0 -> other_allele
        'A1FREQ': 'eaf',             # A1FREQ -> effect allele frequency
        'BETA': 'beta',              # BETA -> beta
        'SE': 'se',                  # SE -> standard error
        'P': 'pval'                  # P -> p-value
    }, inplace=True)

    # Step 8: Save the final TwoSampleMR-compatible file in TSV format
    merged_data.to_csv('merged_gwas_data_grch38_twosample.tsv', sep='\t', index=False)
    print("Saved 'merged_gwas_data_grch38_twosample.tsv' successfully.", flush=True)

    # Step 9 (optional / didn't work): Save the merged data in Parquet format for faster reading in the future
    # merged_data.to_parquet('merged_gwas_data_grch38.parquet', index=False)
    # print("Saved 'merged_gwas_data_grch38.parquet' successfully (Parquet format).", flush=True)

except Exception as e:
    print(f"An error occurred: {e}", flush=True)

4 Mendelian Randomization Analyses

We attempted two primary MR approaches to assess the causal effect of NAG on BMI within the ACY1 region, using R packages TwoSampleMR, MendelianRandomization, ieugwasr, and MRInstruments, with visualization via ggplot2.

4.1 Suite of MR Approaches

Instrument Selection: From the filtered dataset, genetic instruments were selected with a relaxed p-value threshold of 5 × 10⁻⁶ (versus the conventional 5 × 10⁻⁸) to maximize power within the ACY1 TSS region, justified by prior evidence of NAG association. Instruments were required to have an F-statistic > 10 to ensure strength, calculated as \[ F = \left( \frac{\beta_{exposure}}{SE_{exposure}} \right)^2 \]. Steiger filtering (steiger_filtering) retained only SNPs where the exposure effect directionally preceded the outcome.
Linkage Disequilibrium (LD) Clumping: Independent instruments were identified using ld_clump with a clumping window of 500 kb (versus the default 10,000 kb) and an r^2 threshold of 0.001, referencing the 1000 Genomes European (EUR) population (bfile = EUR).
MR Methods: Five MR methods were applied:
- Inverse Variance Weighted (IVW): Assumes no horizontal pleiotropy (mr_ivw).
- MR-Egger: Adjusts for directional pleiotropy (mr_egger_regression).
- Weighted Median: Robust to invalid instruments if <50% are pleiotropic (mr_weighted_median).
- MR-Lasso: Identifies and adjusts for pleiotropic outliers (mr_lasso).
- Contamination Mixture (ConMix): Models mixture distributions to account for invalid instruments (mr_conmix).
Sensitivity Analyses: Heterogeneity was assessed with mr_heterogeneity, pleiotropy with mr_pleiotropy_test, and leave-one-out analysis with mr_leaveoneout. Wald ratio tests were computed for each instrument individually.
When only one instrument was available, we only performed Wald ratio tests (nothing else)

4.1.1 MR with COJO Instruments

Instrument Selection: Identical to the Suite approach, instruments were filtered at p < 5 × 10⁻⁶ and F > 10, followed by Steiger filtering.
Conditional Joint Analysis (COJO): The GCTA software (version 1.9x) was used to perform COJO analysis (gcta64 --cojo-slct --cojo-p 5e-6) on the filtered instruments, formatted as a tab-delimited file (cojo_input.txt) with SNP, alleles (A1, A2), frequency, beta, SE, p-value, and sample size. LD was estimated using the 1000 Genomes EUR reference panel. Conditionally independent SNPs were extracted from cojo_output.jma.cojo and merged back into the instrument set, updating beta, SE, and p-values.
MR Methods: The same five MR methods (IVW, MR-Egger, Weighted Median, MR-Lasso, ConMix) were applied to the COJO-selected instruments, with identical sensitivity analyses.
Output: Results were saved in MR_Results_ACY1_Jurgens_BMI_COJO_[date].xlsx and visualized in MR_Forestplot_ACY1_Jurgens_BMI_COJO_[date].png.

4.2 Statistical Software and Visualization

Languages: R (version 4.x) and Python (version 3.x).
R Packages: TwoSampleMR, MendelianRandomization, ieugwasr, MRInstruments, data.table, dplyr, tidyr, readxl, openxlsx, ggplot2, ggrepel, corrplot, RhpcBLASctl, biomaRt, scales.
Python Libraries: pandas, csv.
Other Tools: liftOver (UCSC), GCTA (version 1.9x), PLINK (via genetics.binaRies).
Visualization: Forest plots were created with ggplot2, using distinct colors for each MR method and error bars representing 95% confidence intervals, saved at 600 DPI.

5 MR

Only one instrument: rs150416778

Performed with a relaxed pval (5E-6) threshold (vs 5E-8) and clumping window = 500 kb
Relaxing the p-value threshold for instrument selection is justified here, given the focus on a TSS region and our a priori knowledge. I’m less concerned about a SNP in ACY1’s TSS being a false positive for a NAG association than I am about another violation to the MR assumptions; the real challenge is horizontal pleiotropy, which I’ve addressed using a range of MR methods and sensitivity analyses.
The default window for ld_clump is 10000 kb, which is is 10x larger than our ACY1 region. So, I chose a much smaller window (500 kb).

rm(list = ls())

# Load Required Libraries

# Mendelian Randomization
library(TwoSampleMR)          # Core package for Mendelian Randomization (MR) analyses
library(MendelianRandomization) # Additional MR methods, including MR-Lasso
library(ieugwasr)             # For local LD clumping with ld_clump()
library(MRInstruments)        # For proxy SNP lookup (if needed)

# Data Wrangling & Handling
library(dplyr)                # Data manipulation (filtering, joining, summarizing)
library(tidyr)                # Data reshaping
library(data.table)           # Fast and efficient data handling
library(readxl)               # Reading Excel files
library(openxlsx)             # Writing and formatting Excel files efficiently

# Statistical & Visualization
library(ggplot2)              # For plotting
library(ggrepel)              # Improved text labeling in plots
library(corrplot)             # Visualization of correlation matrices
library(RhpcBLASctl)          # Control multithreading for efficiency in MR analyses
library(biomaRt)              # Querying Ensembl for gene annotation
library(scales)               # For dynamic color generation

# Set Working Directory & Setup Folder for Project
setwd("/Users/charleenadams/mr_nag_bmi/")
if (!dir.exists("results_jurgens_p5E6")) dir.create("results_jurgens_p5E6", recursive = TRUE)

# ---------------------------------------------
# Load and Format NAG Data
# ---------------------------------------------

nag <- fread("/Users/charleenadams/mr_nag_bmi/phenocode-C100001253.tsv") %>% as.data.frame()
filtered_nag_df <- nag %>% 
  filter(!is.na(rsids) & rsids != "" & grepl("^rs", rsids)) %>%
  arrange(rsids)

exp_dat <- filtered_nag_df %>%
  mutate(
    chr.exposure = chrom,
    pos.exposure = pos,   
    beta.exposure = beta,
    se.exposure = sebeta,
    exposure = "N-acetylglutamine",
    id.exposure = "N-acetylglutamine",
    pval.exposure = pval,
    SNP.exposure = rsids,
    SNP = rsids,
    effect_allele.exposure = alt,
    other_allele.exposure = ref,
    eaf.exposure = maf,
    samplesize.exposure = 5830,
    id_col = nearest_genes
  )

# ---------------------------------------------
# Load and Format BMI Data
# ---------------------------------------------

bmi <- fread("/Users/charleenadams/mr_nag_bmi/merged_gwas_data_grch38_twosample.tsv")
filtered_bmi_df <- bmi %>%
  filter(!is.na(SNP) & SNP != "" & grepl("^rs", SNP)) %>%
  arrange(SNP)

out_dat <- filtered_bmi_df %>%
  mutate(
    SNP = SNP,
    SNP.outcome = SNP,
    chr.outcome = CHR_x,
    pos.outcome = BP_GRCh38,
    beta.outcome = beta,
    se.outcome = se,
    pval.outcome = pval,
    effect_allele.outcome = effect_allele,
    other_allele.outcome = other_allele,
    eaf.outcome = eaf,
    samplesize.outcome = 460000,
    id.outcome = "BMI",
    outcome = "BMI"
  )

# ---------------------------------------------
# Harmonize Data
# ---------------------------------------------

dat <- harmonise_data(exposure_dat = exp_dat, outcome_dat = out_dat)
today <- Sys.Date()
write.csv(dat, paste0("/Users/charleenadams/mr_nag_bmi/harmonized_nag_Jurgens_BMI_dat_", today, ".csv"), row.names = FALSE)

# ---------------------------------------------
# Select 1MB around ACY1 region
# ---------------------------------------------

dat <- fread("/Users/charleenadams/mr_nag_bmi/harmonized_nag_Jurgens_BMI_dat_2025-04-02.csv")
# "/Users/charleenadams/mr_ukbppp_chd/mart_export_clean.txt" #build 38
# Load the necessary data (local_annotation)
local_annotation <- read.table("/Users/charleenadams/mr_ukbppp_chd/mart_export_clean.txt", header = TRUE, sep = "\t", stringsAsFactors = FALSE)

# Filter for the gene ACY1
acy1_data <- local_annotation[local_annotation$Gene.name == "ACY1", ]

# Check the data for ACY1
head(acy1_data)

# Extract the TSS (Transcription Start Site)
# TSS is given as the start site of the transcript, use the `Transcript.start..bp.` column
acy1_tss <- min(acy1_data$Transcript.start..bp.)

# Define the 500kb window up and down from the TSS
upstream_window <- acy1_tss - 500000
downstream_window <- acy1_tss + 500000

# Display the results
cat("TSS for ACY1:", acy1_tss, "\n")
cat("500kb upstream:", upstream_window, "\n")
cat("500kb downstream:", downstream_window, "\n")

ACY1_chrom <- acy1_data$Chromosome.scaffold.name[1]

# Define the filtered dataset
dat_filtered <- dat %>%
  filter(chr.outcome == ACY1_chrom & pos.outcome >= upstream_window & pos.outcome <= downstream_window)

# Create the filename with the current system date
file_name <- paste0("/Users/charleenadams/mr_nag_bmi/filtered_ACY1_1Mb_Jurgens_BMI_", Sys.Date(), ".csv")

# Save the filtered data to a CSV file with the system date in the filename
write.csv(dat_filtered, file = file_name, row.names = FALSE)

# ---------------------------------------------
# MR
# ---------------------------------------------

# Step 1: Filter Instruments
dat_filtered <- read.csv("/Users/charleenadams/mr_nag_bmi/filtered_ACY1_1Mb_Jurgens_BMI_2025-04-02.csv")
instruments <- dat_filtered %>%
  filter(mr_keep == TRUE, pval.exposure < 5e-6) %>%  # Relaxed threshold
  mutate(
    rsid = SNP,
    pval = pval.exposure,
    id = "N-acetylguanine (1MB around ACY1 TSS)",
    F_stat = (beta.exposure / se.exposure)^2  # Compute F-statistic
  ) %>%
  filter(F_stat > 10)  # Exclude weak instruments

cat("Number of instruments after F-stat filtering:", nrow(instruments), "\n")

# Step 1.5: Steiger Filtering
instruments <- steiger_filtering(instruments)
instruments <- instruments %>% filter(steiger_dir == TRUE)  # Keep only SNPs with correct direction
cat("Number of instruments after Steiger filtering:", nrow(instruments), "\n")
cat("Preview of Steiger-filtered instruments:\n")
print(head(instruments))

# Subset instruments to keep only relevant fields
instruments_subset <- instruments %>%
  dplyr::select(
    SNP, 
    effect_allele.exposure, other_allele.exposure,
    effect_allele.outcome, other_allele.outcome,
    beta.exposure, se.exposure, pval.exposure,
    beta.outcome, se.outcome, pval.outcome,
    eaf.exposure, eaf.outcome,
    id.exposure, exposure,
    id.outcome, outcome,
    samplesize.exposure, samplesize.outcome,
    mr_keep, action,
    F_stat,
    steiger_dir, steiger_pval
  )
instruments_subset$rsid <- instruments_subset$SNP

# Step 2: Local LD Clumping
clumped <- ld_clump(
  dplyr::tibble(rsid = instruments_subset$rsid, pval = instruments_subset$pval, id = instruments_subset$id),
  plink_bin = genetics.binaRies::get_plink_binary(),
  bfile = "/Users/charleenadams/1000G_bfiles/EUR/EUR",
  clump_r2 = 0.001,
  clump_kb = 500  # Reduced window
)
clumped_dat <- instruments_subset %>% dplyr::filter(SNP %in% clumped$rsid)
cat("Local clumping completed. Number of SNPs retained:", nrow(clumped_dat), "\n")
cat("Preview of clumped data:\n")
print(head(clumped_dat))

# Subset clumped_dat to keep only relevant fields
clumped_dat_subset <- clumped_dat %>%
  dplyr::select(
    SNP, 
    effect_allele.exposure, other_allele.exposure,
    effect_allele.outcome, other_allele.outcome,
    beta.exposure, se.exposure, pval.exposure,
    beta.outcome, se.outcome, pval.outcome,
    eaf.exposure, eaf.outcome,
    id.exposure, exposure,
    id.outcome, outcome,
    samplesize.exposure, samplesize.outcome,
    mr_keep, action,
    F_stat,
    steiger_dir, steiger_pval
  )

# Step 3: Perform Wald Ratio Tests for Each Instrument
wald_ratios <- clumped_dat %>%
  mutate(
    wald_beta = beta.outcome / beta.exposure,
    wald_se = sqrt((se.outcome^2 / beta.exposure^2) + ((beta.outcome^2 * se.exposure^2) / (beta.exposure^4))),
    pval = 2 * pnorm(abs(wald_beta / wald_se), lower.tail = FALSE),
    method = paste("Wald Ratio:", SNP)
  ) %>%
  dplyr::select(SNP, wald_beta, wald_se, pval, method)

cat("\n=== Wald Ratio Tests for Each Instrument ===\n")
print(wald_ratios)

# Step 4: Bind Results
mr_results <- bind_rows(
  wald_ratios %>% dplyr::select(method, b = wald_beta, se = wald_se, pval)
) %>%
  mutate(method = as.factor(method)) %>%
  filter(!is.na(method) & method != "NA") %>%
  arrange(b) %>%
  mutate(method = factor(method, levels = unique(method)))

# Step 5: Create Forest Plot
base_colors <- c("wald_ratios" = "#33A02C")

wald_methods <- unique(wald_ratios$method)
n_wald <- length(wald_methods)
if (n_wald > 0) {
  wald_colors <- hue_pal()(n_wald)
  names(wald_colors) <- wald_methods
} else {
  wald_colors <- NULL
}

color_list <- c(base_colors, wald_colors)
available_methods <- unique(mr_results$method)
color_list <- color_list[names(color_list) %in% available_methods]

forest_plot <- ggplot(wald_ratios, aes(x = wald_beta, y = method, color = method)) +
  geom_point(size = 3) +
  geom_errorbarh(aes(xmin = wald_beta - 1.96 * wald_se, xmax = wald_beta + 1.96 * wald_se), height = 0.2) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") +
  labs(
    title = "Mendelian Randomization Estimates:\n N-acetylguanine (1MB around ACY1 TSS) on Jurgens BMI",
    x = "Causal Effect (Beta)",
    y = ""
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.text.y = element_text(size = 12, face = "bold"),
    axis.text.x = element_text(size = 10),
    axis.title.x = element_text(size = 12),
    panel.grid.major = element_line(color = "grey90"),
    panel.grid.minor = element_blank(),
    legend.position = "none",
    panel.background = element_rect(fill = "white", color = NA),
    plot.background = element_rect(fill = "white", color = NA)
  ) +
  scale_color_manual(values = color_list) +
  xlim(-0.15, 0.05)

# Step 6: Save Results and Plot
today <- Sys.Date()
results_dir <- "/Users/charleenadams/mr_nag_bmi/results_jurgens_p5E6/"
excel_file <- paste0(results_dir, "MR_Results_ACY1_Jurgens_BMI_", today, ".xlsx")
plot_file <- paste0(results_dir, "MR_Forestplot_ACY1_Jurgens_BMI_", today, ".png")

# Save Wald ratios to CSV
write.csv(wald_ratios, file = paste0(results_dir, "wald_ratios_", today, ".csv"), row.names = FALSE)

# Save forest plot as PNG
ggsave(plot_file, plot = forest_plot, width = 10, height = 8, dpi = 300)

cat("Results and plot saved successfully!")

6 MR with COJO Instruments

Only one instrument: rs150416778

With: with relaxed pval (5E-6)

rm(list = ls())

# Load Required Libraries
library(TwoSampleMR)          # Core MR analyses
library(MendelianRandomization) # MR-Lasso and ConMix
library(ieugwasr)             # LD clumping
library(MRInstruments)        # Proxy SNP lookup
library(dplyr)                # Data manipulation
library(tidyr)                # Data reshaping
library(data.table)           # Fast data handling
library(readxl)               # Read Excel
library(openxlsx)             # Write formatted Excel
library(ggplot2)              # Plotting
library(ggrepel)              # Text labeling in plots
library(corrplot)             # Correlation matrices
library(RhpcBLASctl)          # Multithreading control
library(biomaRt)              # Gene annotation
library(scales)               # Color generation
library(pheatmap)             # Purrty heatmap

# Set Working Directory & Setup Folder
setwd("/Users/charleenadams/temp_BI/mr_nag_ACY1_bmi")
if (!dir.exists("cojo")) dir.create("cojo", recursive = TRUE)

# ---------------------------------------------
# Load and Format NAG Data
# ---------------------------------------------

nag <- fread("/Users/charleenadams/mr_nag_bmi/phenocode-C100001253.tsv") %>% as.data.frame()
filtered_nag_df <- nag %>% 
  filter(!is.na(rsids) & rsids != "" & grepl("^rs", rsids)) %>%
  arrange(rsids)

exp_dat <- filtered_nag_df %>%
  mutate(
    chr.exposure = chrom,
    pos.exposure = pos,   
    beta.exposure = beta,
    se.exposure = sebeta,
    exposure = "N-acetylglutamine",
    id.exposure = "N-acetylglutamine",
    pval.exposure = pval,
    SNP.exposure = rsids,
    SNP = rsids,
    effect_allele.exposure = alt,
    other_allele.exposure = ref,
    eaf.exposure = maf,
    samplesize.exposure = 5830,
    id_col = nearest_genes
  )

# ---------------------------------------------
# Load and Format BMI Data
# ---------------------------------------------

bmi <- fread("/Users/charleenadams/mr_nag_bmi/merged_gwas_data_grch38_twosample.tsv")
filtered_bmi_df <- bmi %>%
  filter(!is.na(SNP) & SNP != "" & grepl("^rs", SNP)) %>%
  arrange(SNP)

out_dat <- filtered_bmi_df %>%
  mutate(
    SNP = SNP,
    SNP.outcome = SNP,
    chr.outcome = CHR_x,
    pos.outcome = BP_GRCh38,
    beta.outcome = beta,
    se.outcome = se,
    pval.outcome = pval,
    effect_allele.outcome = effect_allele,
    other_allele.outcome = other_allele,
    eaf.outcome = eaf,
    samplesize.outcome = 460000,
    id.outcome = "BMI",
    outcome = "BMI"
  )

# ---------------------------------------------
# Harmonize Data
# ---------------------------------------------

dat <- harmonise_data(exposure_dat = exp_dat, outcome_dat = out_dat)
today <- Sys.Date()
write.csv(dat, paste0("/Users/charleenadams/mr_nag_bmi/harmonized_nag_Jurgens_BMI_dat_", today, ".csv"), row.names = FALSE)

# ---------------------------------------------
# Select 1MB around ACY1 region
# ---------------------------------------------

dat <- fread("/Users/charleenadams/mr_nag_bmi/harmonized_nag_Jurgens_BMI_dat_2025-04-02.csv")
# "/Users/charleenadams/mr_ukbppp_chd/mart_export_clean.txt" #build 38
# Load the necessary data (local_annotation)
local_annotation <- read.table("/Users/charleenadams/mr_ukbppp_chd/mart_export_clean.txt", header = TRUE, sep = "\t", stringsAsFactors = FALSE)

# Filter for the gene ACY1
acy1_data <- local_annotation[local_annotation$Gene.name == "ACY1", ]

# Check the data for ACY1
head(acy1_data)

# Extract the TSS (Transcription Start Site)
# TSS is given as the start site of the transcript, use the `Transcript.start..bp.` column
acy1_tss <- min(acy1_data$Transcript.start..bp.)

# Define the 500kb window up and down from the TSS
upstream_window <- acy1_tss - 500000
downstream_window <- acy1_tss + 500000

# Display the results
cat("TSS for ACY1:", acy1_tss, "\n")
cat("500kb upstream:", upstream_window, "\n")
cat("500kb downstream:", downstream_window, "\n")

ACY1_chrom <- acy1_data$Chromosome.scaffold.name[1]

# Define the filtered dataset
dat_filtered <- dat %>%
  filter(chr.outcome == ACY1_chrom & pos.outcome >= upstream_window & pos.outcome <= downstream_window)

# Create the filename with the current system date
file_name <- paste0("/Users/charleenadams/mr_nag_bmi/filtered_ACY1_1Mb_Jurgens_BMI_", Sys.Date(), ".csv")

# Save the filtered data to a CSV file with the system date in the filename
write.csv(dat_filtered, file = file_name, row.names = FALSE)

# ---------------------------------------------
# MR with COJO Analysis
# ---------------------------------------------

# Step 1: Filter Instruments
dat_filtered <- read.csv("/Users/charleenadams/mr_nag_bmi/filtered_ACY1_1Mb_Jurgens_BMI_2025-04-02.csv")
instruments <- dat_filtered %>%
  filter(mr_keep == TRUE, pval.exposure < 5e-6) %>%  # Relaxed threshold
  mutate(
    rsid = SNP,
    pval = pval.exposure,
    id = "N-acetylguanine (1MB around ACY1 TSS)",
    F_stat = (beta.exposure / se.exposure)^2  # Compute F-statistic
  ) %>%
  filter(F_stat > 10)  # Exclude weak instruments

cat("Number of instruments after F-stat filtering:", nrow(instruments), "\n")

# Step 1.5: Steiger Filtering
instruments <- steiger_filtering(instruments)
instruments <- instruments %>% filter(steiger_dir == TRUE)  # Keep only SNPs with correct direction
cat("Number of instruments after Steiger filtering:", nrow(instruments), "\n")
cat("Preview of Steiger-filtered instruments:\n")
print(head(instruments))

# Subset instruments to keep only relevant fields
instruments_subset <- instruments %>%
  dplyr::select(
    SNP, 
    chr.exposure,chr.outcome,
    pos.exposure,pos.outcome,
    effect_allele.exposure, other_allele.exposure,
    effect_allele.outcome, other_allele.outcome,
    beta.exposure, se.exposure, pval.exposure,
    beta.outcome, se.outcome, pval.outcome,
    eaf.exposure, eaf.outcome,
    id.exposure, exposure,
    id.outcome, outcome,
    samplesize.exposure, samplesize.outcome,
    mr_keep, action,
    F_stat,
    steiger_dir, steiger_pval
  )

head(instruments_subset)

# Step 2: Prepare COJO Input File: SNP A1 A2 freq b se p N 
cojo_input <- instruments_subset %>%
  dplyr::select(SNP, effect_allele.exposure, other_allele.exposure, 
                eaf.exposure,  # Add frequency
                beta.exposure, se.exposure, pval.exposure, 
                samplesize.exposure, chr.exposure, pos.exposure) %>%
  rename(A1 = effect_allele.exposure,
         A2 = other_allele.exposure,
         freq = eaf.exposure,  # Rename to freq
         b = beta.exposure,
         se = se.exposure,
         p = pval.exposure,
         N = samplesize.exposure,
         CHR = chr.exposure,
         BP = pos.exposure)

# Save COJO input file
fwrite(cojo_input, "/Users/charleenadams/mr_nag_bmi/cojo_input.txt", sep = "\t", quote = FALSE)

cojo_input <- fread("/Users/charleenadams/mr_nag_bmi/cojo_input.txt")

setwd("/Users/charleenadams/mr_nag_bmi/results_cojo")

# Step 3: Run COJO Analysis (requires GCTA installed)
# Note: This step assumes GCTA is installed and accessible from your terminal/command line
# Adjust the GCTA path if necessary or run this command manually in your terminal
system("gcta64 --bfile /Users/charleenadams/1000G_bfiles/EUR/EUR --cojo-file /Users/charleenadams/mr_nag_bmi/cojo_input.txt --cojo-slct --cojo-p 5e-6 --out cojo_output")

# Produces
# cojo_output.cma.cojo
# cojo_output.jma.cojo
# cojo_output.ldr.cojo
# cojo_output.log

# Read the ld matrix file into R
ld_matrix <- read.table("/Users/charleenadams/mr_nag_bmi/results_cojo/cojo_output.ldr.cojo", header = TRUE, row.names = 1)
isSymmetric(as.matrix(ld_matrix))
ld_matrix <- as.matrix(ld_matrix)

# # r2
# ld_matrix_r2 <- ld_matrix^2
# 
# # heatmap
# heatmap(as.matrix(ld_matrix), symm = TRUE, main = "LD Matrix Heatmap")
# 
# # Create the customized heatmap
# pheatmap(ld_matrix,
#          # Color scheme: dark blue to yellow gradient
#          color = colorRampPalette(c("#1f77b4", "white", "#ffcc00"))(100),
#          # No clustering since LD matrices are position-based
#          cluster_rows = FALSE,
#          cluster_cols = FALSE,
#          # Display correlation values in cells
#          display_numbers = TRUE,
#          number_color = "black",          # Number color
#          fontsize_number = 14,             # Font size for numbers
#          # Customize borders
#          border_color = "gray30",         # Thin gray borders around cells
#          # Legend customization
#          legend = TRUE,                   # Include legend (default)
#          legend_breaks = seq(-1, 1, 0.5), # Custom breaks for legend
#          legend_labels = c("-1", "-0.5", "0", "0.5", "1"), # Custom labels
#          # Labels and title
#          main = "LD Matrix Heatmap", # Title
#          fontsize = 14,                   # Title font size
#          fontsize_row = 12,                # Row label font size
#          fontsize_col = 12,                # Column label font size
#          angle_col = 45,                  # Rotate column labels for readability
#          # Output settings
#          filename = "/Users/charleenadams/mr_nag_bmi/custom_ld_heatmap.png", # Save file name
#          width = 10,                      # Width in inches
#          height = 10,                     # Height in inches
#          res = 600)                       # Resolution in DPI
# 
# # Read COJO results
# cojo_results <- fread("cojo_output.jma.cojo") %>%
#   dplyr::select(SNP, CHR = Chr, BP = bp, bJ, bJ_se, pJ)

6.1 Heterogeneity and Pleiotropy

Couldn’t test.

7 FinnGen Replication Attempt

Only one instrument: rs150416778

With: relaxed pval (5E-6) and clump = 500

rm(list = ls())

# Load Required Libraries

# Mendelian Randomization
library(TwoSampleMR)          # Core package for Mendelian Randomization (MR) analyses
library(MendelianRandomization) # Additional MR methods, including MR-Lasso
library(ieugwasr)             # For local LD clumping with ld_clump()
library(MRInstruments)        # For proxy SNP lookup (if needed)

# Data Wrangling & Handling
library(dplyr)                # Data manipulation (filtering, joining, summarizing)
library(tidyr)                # Data reshaping
library(data.table)           # Fast and efficient data handling
library(readxl)               # Reading Excel files
library(openxlsx)             # Writing and formatting Excel files efficiently

# Statistical & Visualization
library(ggplot2)              # For plotting
library(ggrepel)              # Improved text labeling in plots
library(corrplot)             # Visualization of correlation matrices
library(RhpcBLASctl)          # Control multithreading for efficiency in MR analyses
library(biomaRt)              # Querying Ensembl for gene annotation
library(scales)               # For dynamic color generation

# Set Working Directory & Setup Folder for Project
setwd("/Users/charleenadams/mr_nag_bmi/")
if (!dir.exists("results_finngen_p5E6")) dir.create("results_finngen_p5E6", recursive = TRUE)

# ---------------------------------------------
# Load and Format NAG Data
# ---------------------------------------------

nag <- fread("/Users/charleenadams/mr_nag_bmi/phenocode-C100001253.tsv") %>% as.data.frame()
filtered_nag_df <- nag %>% 
  filter(!is.na(rsids) & rsids != "" & grepl("^rs", rsids)) %>%
  arrange(rsids)

exp_dat <- filtered_nag_df %>%
  mutate(
    chr.exposure = chrom,
    pos.exposure = pos,   
    beta.exposure = beta,
    se.exposure = sebeta,
    exposure = "N-acetylglutamine",
    id.exposure = "N-acetylglutamine",
    pval.exposure = pval,
    SNP.exposure = rsids,
    SNP = rsids,
    effect_allele.exposure = alt,
    other_allele.exposure = ref,
    eaf.exposure = maf,
    samplesize.exposure = 5830,
    id_col = nearest_genes
  )

# ---------------------------------------------
# Load and Format BMI Data
# ---------------------------------------------

bmi <- fread("/Users/charleenadams/temp_BI/mr_nat_pter_bmi/summary_stats_release_finngen_R12_BMI_IRN")
filtered_bmi_df <- bmi %>%
  filter(!is.na(rsids) & rsids != "" & grepl("^rs", rsids)) %>%
  arrange(rsids) %>%
  rename(CHR = `#chrom`)

out_dat <- filtered_bmi_df %>%
  mutate(
    SNP = rsids,
    SNP.outcome = rsids,
    chr.outcome = CHR,
    pos.outcome = pos,
    beta.outcome = beta,
    se.outcome = sebeta,
    pval.outcome = pval,
    effect_allele.outcome = alt,
    other_allele.outcome = ref,
    eaf.outcome = af_alt,
    samplesize.outcome = 500348,
    id.outcome = "FinnGen BMI",
    outcome = "FinnGen BMI",
    id_col = "nearest_genes"
  )

# ---------------------------------------------
# Harmonize Data
# ---------------------------------------------

dat <- harmonise_data(exposure_dat = exp_dat, outcome_dat = out_dat)
today <- Sys.Date()
write.csv(dat, paste0("/Users/charleenadams/mr_nag_bmi/harmonized_nag_finngen_BMI_dat_", today, ".csv"), row.names = FALSE)

# ---------------------------------------------
# Select 1MB around ACY1 region
# ---------------------------------------------

dat <- fread("/Users/charleenadams/mr_nag_bmi/harmonized_nag_finngen_BMI_dat_2025-04-02.csv")
# "/Users/charleenadams/mr_ukbppp_chd/mart_export_clean.txt" #build 38
# Load the necessary data (local_annotation)
local_annotation <- read.table("/Users/charleenadams/mr_ukbppp_chd/mart_export_clean.txt", header = TRUE, sep = "\t", stringsAsFactors = FALSE)

# Filter for the gene ACY1
acy1_data <- local_annotation[local_annotation$Gene.name == "ACY1", ]

# Check the data for ACY1
head(acy1_data)

# Extract the TSS (Transcription Start Site)
# TSS is given as the start site of the transcript, use the `Transcript.start..bp.` column
acy1_tss <- min(acy1_data$Transcript.start..bp.)

# Define the 500kb window up and down from the TSS
upstream_window <- acy1_tss - 500000
downstream_window <- acy1_tss + 500000

# Display the results
cat("TSS for ACY1:", acy1_tss, "\n")
cat("500kb upstream:", upstream_window, "\n")
cat("500kb downstream:", downstream_window, "\n")

ACY1_chrom <- acy1_data$Chromosome.scaffold.name[1]

# Define the filtered dataset
dat_filtered <- dat %>%
  filter(chr.outcome == ACY1_chrom & pos.outcome >= upstream_window & pos.outcome <= downstream_window)

# Create the filename with the current system date
file_name <- paste0("/Users/charleenadams/mr_nag_bmi/filtered_ACY1_1Mb_finngen_BMI_", Sys.Date(), ".csv")

# Save the filtered data to a CSV file with the system date in the filename
write.csv(dat_filtered, file = file_name, row.names = FALSE)

# ---------------------------------------------
# MR
# ---------------------------------------------

# Step 1: Filter Instruments
dat_filtered <- read.csv("/Users/charleenadams/mr_nag_bmi/filtered_ACY1_1Mb_finngen_BMI_2025-04-02.csv")
instruments <- dat_filtered %>%
  filter(mr_keep == TRUE, pval.exposure < 5e-6) %>%  # Relaxed threshold
  mutate(
    rsid = SNP,
    pval = pval.exposure,
    id = "N-acetylguanine (1MB around ACY1 TSS)",
    F_stat = (beta.exposure / se.exposure)^2  # Compute F-statistic
  ) %>%
  filter(F_stat > 10)  # Exclude weak instruments

cat("Number of instruments after F-stat filtering:", nrow(instruments), "\n")

# Step 1.5: Steiger Filtering
instruments <- steiger_filtering(instruments)
instruments <- instruments %>% filter(steiger_dir == TRUE)  # Keep only SNPs with correct direction
cat("Number of instruments after Steiger filtering:", nrow(instruments), "\n")
cat("Preview of Steiger-filtered instruments:\n")
print(head(instruments))

# Subset instruments to keep only relevant fields
instruments_subset <- instruments %>%
  dplyr::select(
    SNP, 
    chr.exposure,chr.outcome,pos.exposure,pos.outcome,
    effect_allele.exposure, other_allele.exposure,
    effect_allele.outcome, other_allele.outcome,
    beta.exposure, se.exposure, pval.exposure,
    beta.outcome, se.outcome, pval.outcome,
    eaf.exposure, eaf.outcome,
    id.exposure, exposure,
    id.outcome, outcome,
    samplesize.exposure, samplesize.outcome,
    mr_keep, action,
    F_stat,
    steiger_dir, steiger_pval
  )
instruments_subset$rsid <- instruments_subset$SNP

# Step 2: Local LD Clumping
clumped <- ld_clump(
  dplyr::tibble(rsid = instruments_subset$rsid, pval = instruments_subset$pval, id = instruments_subset$id),
  plink_bin = genetics.binaRies::get_plink_binary(),
  bfile = "/Users/charleenadams/1000G_bfiles/EUR/EUR",
  clump_r2 = 0.001,
  clump_kb = 500  # Reduced window
)
clumped_dat <- instruments_subset %>% dplyr::filter(SNP %in% clumped$rsid)
cat("Local clumping completed. Number of SNPs retained:", nrow(clumped_dat), "\n")
cat("Preview of clumped data:\n")
print(head(clumped_dat))

# Subset clumped_dat to keep only relevant fields
clumped_dat_subset <- clumped_dat %>%
  dplyr::select(
    SNP, 
    effect_allele.exposure, other_allele.exposure,
    effect_allele.outcome, other_allele.outcome,
    beta.exposure, se.exposure, pval.exposure,
    beta.outcome, se.outcome, pval.outcome,
    eaf.exposure, eaf.outcome,
    id.exposure, exposure,
    id.outcome, outcome,
    samplesize.exposure, samplesize.outcome,
    mr_keep, action,
    F_stat,
    steiger_dir, steiger_pval
  )

# Step 3: Perform Wald Ratio Tests for Each Instrument
wald_ratios <- clumped_dat %>%
  mutate(
    wald_beta = beta.outcome / beta.exposure,
    wald_se = sqrt((se.outcome^2 / beta.exposure^2) + ((beta.outcome^2 * se.exposure^2) / (beta.exposure^4))),
    pval = 2 * pnorm(abs(wald_beta / wald_se), lower.tail = FALSE),
    method = paste("Wald Ratio:", SNP)
  ) %>%
  dplyr::select(SNP, wald_beta, wald_se, pval, method)

cat("\n=== Wald Ratio Tests for Each Instrument ===\n")
print(wald_ratios)

# Step 4: Bind Results
mr_results <- bind_rows(
  wald_ratios %>% dplyr::select(method, b = wald_beta, se = wald_se, pval)
) %>%
  mutate(method = as.factor(method)) %>%
  filter(!is.na(method) & method != "NA") %>%
  arrange(b) %>%
  mutate(method = factor(method, levels = unique(method)))

# Step 5: Create Forest Plot
base_colors <- c("wald_ratios" = "#33A02C")

wald_methods <- unique(wald_ratios$method)
n_wald <- length(wald_methods)
if (n_wald > 0) {
  wald_colors <- hue_pal()(n_wald)
  names(wald_colors) <- wald_methods
} else {
  wald_colors <- NULL
}

color_list <- c(base_colors, wald_colors)
available_methods <- unique(mr_results$method)
color_list <- color_list[names(color_list) %in% available_methods]

forest_plot <- ggplot(wald_ratios, aes(x = wald_beta, y = method, color = method)) +
  geom_point(size = 3) +
  geom_errorbarh(aes(xmin = wald_beta - 1.96 * wald_se, xmax = wald_beta + 1.96 * wald_se), height = 0.2) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") +
  labs(
    title = "Mendelian Randomization Estimates:\n N-acetylguanine (1MB around ACY1 TSS) on Jurgens BMI",
    x = "Causal Effect (Beta)",
    y = ""
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.text.y = element_text(size = 12, face = "bold"),
    axis.text.x = element_text(size = 10),
    axis.title.x = element_text(size = 12),
    panel.grid.major = element_line(color = "grey90"),
    panel.grid.minor = element_blank(),
    legend.position = "none",
    panel.background = element_rect(fill = "white", color = NA),
    plot.background = element_rect(fill = "white", color = NA)
  ) +
  scale_color_manual(values = color_list) +
  xlim(-0.15, 0.05)

# Step 6: Save Results and Plot
today <- Sys.Date()
results_dir <- "/Users/charleenadams/mr_nag_bmi/results_finngen_p5E6/"
excel_file <- paste0(results_dir, "MR_Results_ACY1_finngen_BMI_", today, ".xlsx")
plot_file <- paste0(results_dir, "MR_Forestplot_ACY1_finngen_BMI_", today, ".png")

# Save Wald ratios to CSV
write.csv(wald_ratios, file = paste0(results_dir, "wald_ratios_", today, ".csv"), row.names = FALSE)

# Save forest plot as PNG
ggsave(plot_file, plot = forest_plot, width = 10, height = 8, dpi = 300)

cat("Results and plot saved successfully!")

8 FinnGen Replication COJO

Only one instrument: rs150416778

rm(list = ls())

# Load Required Libraries

# Mendelian Randomization
library(TwoSampleMR)          # Core package for Mendelian Randomization (MR) analyses
library(MendelianRandomization) # Additional MR methods, including MR-Lasso
library(ieugwasr)             # For local LD clumping with ld_clump()
library(MRInstruments)        # For proxy SNP lookup (if needed)

# Data Wrangling & Handling
library(dplyr)                # Data manipulation (filtering, joining, summarizing)
library(tidyr)                # Data reshaping
library(data.table)           # Fast and efficient data handling
library(readxl)               # Reading Excel files
library(openxlsx)             # Writing and formatting Excel files efficiently

# Statistical & Visualization
library(ggplot2)              # For plotting
library(ggrepel)              # Improved text labeling in plots
library(corrplot)             # Visualization of correlation matrices
library(RhpcBLASctl)          # Control multithreading for efficiency in MR analyses
library(biomaRt)              # Querying Ensembl for gene annotation
library(scales)               # For dynamic color generation

# Set Working Directory & Setup Folder for Project
setwd("/Users/charleenadams/mr_nag_bmi/")
if (!dir.exists("results_finngen_p5E6")) dir.create("results_finngen_p5E6", recursive = TRUE)

# ---------------------------------------------
# Load and Format NAG Data
# ---------------------------------------------

nag <- fread("/Users/charleenadams/mr_nag_bmi/phenocode-C100001253.tsv") %>% as.data.frame()
filtered_nag_df <- nag %>% 
  filter(!is.na(rsids) & rsids != "" & grepl("^rs", rsids)) %>%
  arrange(rsids)

exp_dat <- filtered_nag_df %>%
  mutate(
    chr.exposure = chrom,
    pos.exposure = pos,   
    beta.exposure = beta,
    se.exposure = sebeta,
    exposure = "N-acetylglutamine",
    id.exposure = "N-acetylglutamine",
    pval.exposure = pval,
    SNP.exposure = rsids,
    SNP = rsids,
    effect_allele.exposure = alt,
    other_allele.exposure = ref,
    eaf.exposure = maf,
    samplesize.exposure = 5830,
    id_col = nearest_genes
  )

# ---------------------------------------------
# Load and Format BMI Data
# ---------------------------------------------

bmi <- fread("/Users/charleenadams/temp_BI/mr_nat_pter_bmi/summary_stats_release_finngen_R12_BMI_IRN")
filtered_bmi_df <- bmi %>%
  filter(!is.na(rsids) & rsids != "" & grepl("^rs", rsids)) %>%
  arrange(rsids) %>%
  rename(CHR = `#chrom`)

out_dat <- filtered_bmi_df %>%
  mutate(
    SNP = rsids,
    SNP.outcome = rsids,
    chr.outcome = CHR,
    pos.outcome = pos,
    beta.outcome = beta,
    se.outcome = sebeta,
    pval.outcome = pval,
    effect_allele.outcome = alt,
    other_allele.outcome = ref,
    eaf.outcome = af_alt,
    samplesize.outcome = 500348,
    id.outcome = "FinnGen BMI",
    outcome = "FinnGen BMI",
    id_col = "nearest_genes"
  )

# ---------------------------------------------
# Harmonize Data
# ---------------------------------------------

dat <- harmonise_data(exposure_dat = exp_dat, outcome_dat = out_dat)
today <- Sys.Date()
write.csv(dat, paste0("/Users/charleenadams/mr_nag_bmi/harmonized_nag_finngen_BMI_dat_", today, ".csv"), row.names = FALSE)

# ---------------------------------------------
# Select 1MB around ACY1 region
# ---------------------------------------------

dat <- fread("/Users/charleenadams/mr_nag_bmi/harmonized_nag_finngen_BMI_dat_2025-04-02.csv")
# "/Users/charleenadams/mr_ukbppp_chd/mart_export_clean.txt" #build 38
# Load the necessary data (local_annotation)
local_annotation <- read.table("/Users/charleenadams/mr_ukbppp_chd/mart_export_clean.txt", header = TRUE, sep = "\t", stringsAsFactors = FALSE)

# Filter for the gene ACY1
acy1_data <- local_annotation[local_annotation$Gene.name == "ACY1", ]

# Check the data for ACY1
head(acy1_data)

# Extract the TSS (Transcription Start Site)
# TSS is given as the start site of the transcript, use the `Transcript.start..bp.` column
acy1_tss <- min(acy1_data$Transcript.start..bp.)

# Define the 500kb window up and down from the TSS
upstream_window <- acy1_tss - 500000
downstream_window <- acy1_tss + 500000

# Display the results
cat("TSS for ACY1:", acy1_tss, "\n")
cat("500kb upstream:", upstream_window, "\n")
cat("500kb downstream:", downstream_window, "\n")

ACY1_chrom <- acy1_data$Chromosome.scaffold.name[1]

# Define the filtered dataset
dat_filtered <- dat %>%
  filter(chr.outcome == ACY1_chrom & pos.outcome >= upstream_window & pos.outcome <= downstream_window)

# Create the filename with the current system date
file_name <- paste0("/Users/charleenadams/mr_nag_bmi/filtered_ACY1_1Mb_finngen_BMI_", Sys.Date(), ".csv")

# Save the filtered data to a CSV file with the system date in the filename
write.csv(dat_filtered, file = file_name, row.names = FALSE)

# ---------------------------------------------
# MR
# ---------------------------------------------

# Step 1: Filter Instruments
dat_filtered <- read.csv("/Users/charleenadams/mr_nag_bmi/filtered_ACY1_1Mb_finngen_BMI_2025-04-02.csv")
instruments <- dat_filtered %>%
  filter(mr_keep == TRUE, pval.exposure < 5e-6) %>%  # Relaxed threshold
  mutate(
    rsid = SNP,
    pval = pval.exposure,
    id = "N-acetylguanine (1MB around ACY1 TSS)",
    F_stat = (beta.exposure / se.exposure)^2  # Compute F-statistic
  ) %>%
  filter(F_stat > 10)  # Exclude weak instruments

cat("Number of instruments after F-stat filtering:", nrow(instruments), "\n")

# Step 1.5: Steiger Filtering
instruments <- steiger_filtering(instruments)
instruments <- instruments %>% filter(steiger_dir == TRUE)  # Keep only SNPs with correct direction
cat("Number of instruments after Steiger filtering:", nrow(instruments), "\n")
cat("Preview of Steiger-filtered instruments:\n")
print(head(instruments))

# Subset instruments to keep only relevant fields
instruments_subset <- instruments %>%
  dplyr::select(
    SNP, 
    chr.exposure,chr.outcome,pos.exposure,pos.outcome,
    effect_allele.exposure, other_allele.exposure,
    effect_allele.outcome, other_allele.outcome,
    beta.exposure, se.exposure, pval.exposure,
    beta.outcome, se.outcome, pval.outcome,
    eaf.exposure, eaf.outcome,
    id.exposure, exposure,
    id.outcome, outcome,
    samplesize.exposure, samplesize.outcome,
    mr_keep, action,
    F_stat,
    steiger_dir, steiger_pval
  )
instruments_subset$rsid <- instruments_subset$SNP

# Step 2: Prepare COJO Input File: SNP A1 A2 freq b se p N 
cojo_input <- instruments_subset %>%
  dplyr::select(SNP, effect_allele.exposure, other_allele.exposure, 
                eaf.exposure,  # Add frequency
                beta.exposure, se.exposure, pval.exposure, 
                samplesize.exposure, chr.exposure, pos.exposure) %>%
  rename(A1 = effect_allele.exposure,
         A2 = other_allele.exposure,
         freq = eaf.exposure,  # Rename to freq
         b = beta.exposure,
         se = se.exposure,
         p = pval.exposure,
         N = samplesize.exposure,
         CHR = chr.exposure,
         BP = pos.exposure)

# Save COJO input file
fwrite(cojo_input, "/Users/charleenadams/mr_nag_bmi/cojo_finngen_input.txt", sep = "\t", quote = FALSE)

cojo_input <- fread("/Users/charleenadams/mr_nag_bmi/cojo_finngen_input.txt")

setwd("/Users/charleenadams/mr_nag_bmi/results_finngen_cojo")

# Step 3: Run COJO Analysis (requires GCTA installed)
# Note: This step assumes GCTA is installed and accessible from your terminal/command line
# Adjust the GCTA path if necessary or run this command manually in your terminal
system("gcta64 --bfile /Users/charleenadams/1000G_bfiles/EUR/EUR --cojo-file /Users/charleenadams/mr_nag_bmi/cojo_finngen_input.txt --cojo-slct --cojo-p 5e-6 --out cojo_finngen_output")

# Read the ld matrix file into R
ld_matrix <- read.table("/Users/charleenadams/mr_nag_bmi/results_finngen_cojo/cojo_finngen_output.ldr.cojo", header = TRUE, row.names = 1)
isSymmetric(as.matrix(ld_matrix))
ld_matrix <- as.matrix(ld_matrix)

# # r2
# ld_matrix_r2 <- ld_matrix^2
# 
# # heatmap
# heatmap(as.matrix(ld_matrix), symm = TRUE, main = "LD Matrix Heatmap")
# 
# # Create the customized heatmap
# pheatmap(ld_matrix,
#          # Color scheme: dark blue to yellow gradient
#          color = colorRampPalette(c("#1f77b4", "white", "#ffcc00"))(100),
#          # No clustering since LD matrices are position-based
#          cluster_rows = FALSE,
#          cluster_cols = FALSE,
#          # Display correlation values in cells
#          display_numbers = TRUE,
#          number_color = "black",          # Number color
#          fontsize_number = 14,             # Font size for numbers
#          # Customize borders
#          border_color = "gray30",         # Thin gray borders around cells
#          # Legend customization
#          legend = TRUE,                   # Include legend (default)
#          legend_breaks = seq(-1, 1, 0.5), # Custom breaks for legend
#          legend_labels = c("-1", "-0.5", "0", "0.5", "1"), # Custom labels
#          # Labels and title
#          main = "LD Matrix Heatmap", # Title
#          fontsize = 14,                   # Title font size
#          fontsize_row = 12,                # Row label font size
#          fontsize_col = 12,                # Column label font size
#          angle_col = 45,                  # Rotate column labels for readability
#          # Output settings
#          filename = "/Users/charleenadams/mr_nag_bmi/custom_ld_heatmap.png", # Save file name
#          width = 10,                      # Width in inches
#          height = 10,                     # Height in inches
#          res = 600)                       # Resolution in DPI
# 
# # Read COJO results
# cojo_results <- fread("cojo_output.jma.cojo") %>%
#   dplyr::select(SNP, CHR = Chr, BP = bp, bJ, bJ_se, pJ)

9 Main Findings

COJO didn’t allow us to get more instruments for either the Jurgens or Finngen analyses, leaving use with a Wald ratio analysis for each.

Jurgens: rs150416778 (beta = 0.002, SE = 0.009, p =0.796)

FinnGen: rs150416778 (beta = -0.037, SE = 0.017, p =0.033)

Mendelian randomization of N-Acetylglutamine (1MB around ACY1 TSS) and body mass index

MR of NAG on BMI

שׁוֹשַׁנָּה🌹

April 08, 2025

1 Methods

1.1 Study Overview

1.2 Data Sources

1.2.1 Exposure Data: N-Acetylguanine (NAG)

1.2.2 Outcome Data: Body Mass Index (BMI)

2 Getting the Data

2.1 Coordinate Conversion for BMI Data

2.2 Data Cleaning and Harmonization

3 Bioinformatics Pipeline Prep

4 Mendelian Randomization Analyses

4.1 Suite of MR Approaches

4.1.1 MR with COJO Instruments

4.2 Statistical Software and Visualization

5 MR

6 MR with COJO Instruments

6.1 Heterogeneity and Pleiotropy

7 FinnGen Replication Attempt

8 FinnGen Replication COJO

9 Main Findings