1 Methods

1.1 Study Overview

We conducted a two-sample Mendelian Randomization (MR) analysis to evaluate the causal effect of N-acetyltaurine (NAT) levels, within a 1 Mb region surrounding the transcription start site (TSS) of the PTER gene, on body mass index (BMI). Genetic instruments for NAT were derived from the METSIM cohort, while BMI summary statistics were obtained from the Jurgens et al. (2022) UK Biobank (UKB) GWAS. All analyses were performed using R (version 4.x) and Python (version 3.x), leveraging multiple statistical packages and bioinformatics tools to ensure robust and reproducible results.

1.2 Data Sources

1.2.1 Exposure Data: N-Acetyltaurine (NAT)

Summary statistics for NAT were sourced from the METSIM (Metabolic Syndrome in Men) study, a population-based cohort of Finnish men. The NAT data, available in GRCh38 coordinates, were downloaded from the PheWeb repository (https://pheweb.org/metsim-metab/download/C100005466) as a compressed file (C100005466). This file was decompressed using bgzip -c -d to yield C100005466.uncompressed, containing association statistics for NAT across 6,099 individuals. The dataset includes:

  • Chromosome (chrom)
  • Position (pos)
  • Reference (ref) and alternate (alt) alleles
  • Effect sizes (beta)
  • Standard errors (sebeta)
  • P-values (pval)
  • Minor allele frequencies (maf)
  • rsIDs (rsids)

1.2.2 Outcome Data: Body Mass Index (BMI)

BMI summary statistics were obtained from the Jurgens et al. (2022) GWAS of European ancestry individuals in the UK Biobank (https://personal.broadinstitute.org/ryank/Jurgens_Pirruccello_2022_GWAS_Sumstats.zip). The file GWAS_sumstats_EUR__invnorm_bmi__TOTALsample.tsv provides association results for 460,000 participants, originally reported in GRCh37 (hg19) coordinates. Key columns include:

  • Chromosome (CHR)
  • Base pair position (BP)
  • SNP identifier (SNP)
  • Effect allele (ALLELE1)
  • Other allele (ALLELE0)
  • Effect allele frequency (A1FREQ)
  • Beta coefficient (BETA)
  • Standard error (SE)
  • P-value (P)

1.3 Coordinate Conversion for BMI Data

To align the BMI data with the NAT data (GRCh38), we converted the GRCh37 coordinates to GRCh38 using a three-step bioinformatics pipeline implemented in Python and Unix:

  1. Conversion to BED Format: A Python script (make_bed.py) transformed the BMI TSV file into a BED format, extracting chromosome, start (BP - 1 for 0-based coordinates), end (BP), and SNP ID. The output was saved as GWAS_sumstats_EUR__invnorm_bmi__TOTALsample.bed.

  2. LiftOver: The UCSC liftOver tool was applied using the chain file hg19ToHg38.over.chain.gz to map GRCh37 coordinates to GRCh38. Successfully mapped SNPs were written to GWAS_sumstats_EUR__invnorm_bmi__TOTALsample_hg38.bed, with unmapped SNPs logged in unmapped_SNPs.bed.

  3. Merging with Original Data: A second Python script (format_bmi.py) processed the lifted BED file, adjusting coordinates to 1-based notation and merging them with the original TSV data using the SNP column. The merged dataset was sorted by SNP, processed in chunks (500,000 rows) to manage memory, and renamed to conform to the TwoSampleMR R package requirements: effect_allele (ALLELE1), other_allele (ALLELE0), eaf (A1FREQ), beta (BETA), se (SE), pval (P), and BP_GRCh38 (new position). The final output was saved as merged_gwas_data_grch38_twosample.tsv.

1.4 Data Cleaning and Harmonization

Both datasets underwent cleaning and harmonization in R using the data.table, dplyr, and tidyr packages:

  • NAT Data: Rows with missing or invalid rsIDs were filtered out (!is.na(rsids) & rsids != "" & grepl("^rs", rsids)), and the data were formatted for MR analysis with columns including chr.exposure, pos.exposure, beta.exposure, se.exposure, pval.exposure, effect_allele.exposure, other_allele.exposure, eaf.exposure, samplesize.exposure (6,099), and SNP.

  • BMI Data: Similar filtering removed invalid SNPs, and the data were formatted with chr.outcome, pos.outcome, beta.outcome, se.outcome, pval.outcome, effect_allele.outcome, other_allele.outcome, eaf.outcome, samplesize.outcome (460,000), and SNP.

  • Harmonization: The harmonise_data function from the TwoSampleMR package aligned NAT (exposure) and BMI (outcome) data by matching alleles and removing palindromic SNPs with ambiguous effects, producing a harmonized dataset saved as harmonized_nat_Jurgens_BMI_dat_[date].csv.

1.5 Defining the PTER Region

To focus on the PTER locus, we used the biomaRt R package to query the Ensembl database (GRCh38, hsapiens_gene_ensembl dataset) for the PTER gene’s TSS. Attributes retrieved included:

  • ensembl_gene_id
  • external_gene_name
  • chromosome_name
  • transcription_start_site
  • strand

The most upstream TSS was selected (min(transcription_start_site)), and a 1 Mb window (±500 kb) was defined around this position on chromosome 10. The harmonized dataset was filtered to retain only SNPs within this region (chr.outcome == pter_chrom & pos.outcome >= tss_start & pos.outcome <= tss_end), saved as filtered_PTER_1Mb_Jurgens_BMI_2025-02-19.csv.

1.6 Mendelian Randomization Analyses

We implemented two primary MR approaches to assess the causal effect of NAT on BMI within the PTER region, using R packages TwoSampleMR, MendelianRandomization, ieugwasr, and MRInstruments, with visualization via ggplot2.

1.6.1 Suite of MR Approaches

  • Instrument Selection: From the filtered dataset, genetic instruments were selected with a relaxed p-value threshold of 5 × 10⁻⁶ (versus the conventional 5 × 10⁻⁸) to maximize power within the PTER TSS region, justified by prior evidence of NAT association. Instruments were required to have an F-statistic > 10 to ensure strength, calculated as \[ F = \left( \frac{\beta_{exposure}}{SE_{exposure}} \right)^2 \]. Steiger filtering (steiger_filtering) retained only SNPs where the exposure effect directionally preceded the outcome.

  • Linkage Disequilibrium (LD) Clumping: Independent instruments were identified using ld_clump with a clumping window of 500 kb (versus the default 10,000 kb) and an r^2 threshold of 0.001, referencing the 1000 Genomes European (EUR) population (bfile = EUR).

  • MR Methods: Five MR methods were applied:

    • Inverse Variance Weighted (IVW): Assumes no horizontal pleiotropy (mr_ivw).
    • MR-Egger: Adjusts for directional pleiotropy (mr_egger_regression).
    • Weighted Median: Robust to invalid instruments if <50% are pleiotropic (mr_weighted_median).
    • MR-Lasso: Identifies and adjusts for pleiotropic outliers (mr_lasso).
    • Contamination Mixture (ConMix): Models mixture distributions to account for invalid instruments (mr_conmix).
  • Sensitivity Analyses: Heterogeneity was assessed with mr_heterogeneity, pleiotropy with mr_pleiotropy_test, and leave-one-out analysis with mr_leaveoneout. Wald ratio tests were computed for each instrument individually.

  • Output: Results were saved in an Excel file (MR_Results_PTER_Jurgens_BMI_[date].xlsx) with sheets for each method, heterogeneity, pleiotropy, and instrument details, alongside a forest plot (MR_Forestplot_PTER_Jurgens_BMI_[date].png) generated using ggplot2.

1.6.2 MR with COJO Instruments

  • Instrument Selection: Identical to the Suite approach, instruments were filtered at p < 5 × 10⁻⁶ and F > 10, followed by Steiger filtering.

  • Conditional Joint Analysis (COJO): The GCTA software (version 1.9x) was used to perform COJO analysis (gcta64 --cojo-slct --cojo-p 5e-6) on the filtered instruments, formatted as a tab-delimited file (cojo_input.txt) with SNP, alleles (A1, A2), frequency, beta, SE, p-value, and sample size. LD was estimated using the 1000 Genomes EUR reference panel. Conditionally independent SNPs were extracted from cojo_output.jma.cojo and merged back into the instrument set, updating beta, SE, and p-values.

  • MR Methods: The same five MR methods (IVW, MR-Egger, Weighted Median, MR-Lasso, ConMix) were applied to the COJO-selected instruments, with identical sensitivity analyses.

  • Output: Results were saved in MR_Results_PTER_Jurgens_BMI_COJO_[date].xlsx and visualized in MR_Forestplot_PTER_Jurgens_BMI_COJO_[date].png.

1.6.3 Alternative Approach

An additional MR analysis with a stricter p-value threshold (5 × 10⁻⁸) and a 10,000 kb clumping window was explored but not prioritized, as the relaxed threshold and smaller window better suited the PTER-specific hypothesis.

1.7 Statistical Software and Visualization

  • Languages: R (version 4.x) and Python (version 3.x).
  • R Packages: TwoSampleMR, MendelianRandomization, ieugwasr, MRInstruments, data.table, dplyr, tidyr, readxl, openxlsx, ggplot2, ggrepel, corrplot, RhpcBLASctl, biomaRt, scales.
  • Python Libraries: pandas, csv.
  • Other Tools: liftOver (UCSC), GCTA (version 1.9x), PLINK (via genetics.binaRies).
  • Visualization: Forest plots were created with ggplot2, using distinct colors for each MR method and error bars representing 95% confidence intervals, saved at 600 DPI.

2 Getting the Data

3 Bioinformatics Pipeline Prep

BMI summary statistics were on GRCh37. I wanted the GRCh38 coordinates for other post-GWAS analyses.

This was a 3-step process:

  1. Converting the BMI summary statistics to BED format.
  2. Running liftOver to obtain GRCh38 coordinates
  3. Merging the liftedOver coordinates with the original TSV

4 Suite of MR Approaches

5 MR with COJO Instruments

6 Results (COJO-selected SNPs)

6.0.1 Main Finding

Across multiple MR methods, including Inverse Variance Weighted (IVW), Weighted Median, MR-Lasso, and MR-ConMix, there is consistent evidence of a negative causal association between NAT and BMI. This suggests that genetically higher levels of NAT are associated with lower BMI. Specifically, the IVW method yielded a beta coefficient of -0.016 (SE = 0.006, p = 0.012), indicating a statistically significant negative association. The Weighted Median method reinforced this finding with a beta coefficient of -0.021 (SE = 0.006, p = 0.001), and MR-ConMix further supported the association with a beta coefficient of -0.033 (p = 0.016). The MR-Lasso analysis also confirmed a beta estimate of -0.016 (SE = 0.006, p = 0.012).

Although multiple MR methods consistently suggest a negative association, MR-Egger did not detect a significant association (beta = -0.002, SE = 0.014, p = 0.886)

6.1 Heterogeneity and Pleiotropy

Heterogeneity tests revealed moderate variability among the SNPs. The IVW heterogeneity test yielded a Q-statistic of 13.3 (degrees of freedom = 6, p = 0.039), suggesting heterogeneity among the genetic instruments. However, no significant directional pleiotropy was detected by the MR-Egger intercept test (intercept = -0.004, SE = 0.004, p = 0.320). While this implies reasonably reliable results, the presence of heterogeneity warrants caution. MR-Lasso mitigates these concerns by applying L1 regularization, effectively shrinking the effects of potentially invalid instruments toward zero. MR-ConMix, on the other hand, addresses pleiotropy by modeling a mixture of valid and invalid instruments, using parameters like Psi (0.024) and Alpha (0.05) to enhance robustness. Together, these methods improve the reliability of causal inference under potential violations of MR assumptions.

Replicating this study in additional populations to confirm causality and enhance the robustness of the findings is warranted.

7 Extra Analysis with More-Stringent Criteria

8 FinnGen Replication Attempt (Spoiler: It’s NULL)

9 FinnGen Replication COJO (Spoiler: Also NULL)

10 Full and Downloadable Results

11 Code for Results App