Data Source: This pipeline uses publicly available human plasma metabolomics data from the NIH Common Fund’s Metabolomics Workbench (Study ID: ST000284). Data are openly accessible at metabolomicsworkbench.org and retrieved programmatically via the metabolomicsWorkbenchR Bioconductor package. No restricted or IRB-protected data are used.

1. Background & Motivation

Metabolomics offers a systems-level view of human physiology by quantifying small-molecule metabolites in biofluids such as plasma, serum, and urine. A Metabolome-Wide Association Study (MWAS) applies systematic regression across the full measured metabolome to identify metabolites associated with a phenotype of interest — analogous to a GWAS but for metabolites rather than genetic variants.

This pipeline demonstrates a complete, reproducible MWAS workflow applied to plasma metabolomics data and BMI as the phenotype. BMI is a well-studied MWAS phenotype with known metabolic perturbations (branched-chain amino acids, lipids, acylcarnitines), making it an ideal validation target.

The analytical approach mirrors methods used in integrative metabolomics epidemiology, including the framework developed by the Lasky-Su Research Group at the Channing Division of Network Medicine (BWH/HMS) and the COMETS consortium.

2. Setup

library(tidyverse)
library(knitr)
library(kableExtra)
library(limma)
library(ggrepel)
library(scales)
library(metabolomicsWorkbenchR)

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, dpi = 150)
set.seed(2024)

3. Data Acquisition

Method: Data are retrieved from the Metabolomics Workbench REST API using the metabolomicsWorkbenchR package. Study ST000284 is a human plasma LC-MS/MS metabolomics dataset. All data are publicly available under NIH Common Fund open-access terms.

# Fetch metabolite measurement data
mw_raw <- do_query(
  context    = "study",
  input_item  = "study_id",
  input_value = "ST000284",
  output_item = "data"
)

# The result is a named list — one element per analysis
# Extract the first (and only) analysis
raw_df <- as.data.frame(mw_raw[[1]])

cat("Rows (metabolite-sample pairs):", nrow(raw_df), "\n")

## Rows (metabolite-sample pairs): 113

cat("Columns:", paste(names(raw_df), collapse = ", "), "\n")

## Columns: study_id, analysis_id, analysis_summary, metabolite_name, metabolite_id, refmet_name, units, 1, 10, 100, 102, 106, 107, 108, 109, 11, 113, 115, 116, 118, 119, 12, 120, 122, 123, 13, 131, 132, 133, 136, 137, 139, 14, 140, 141, 142, 149, 151, 152, 154, 155, 156, 157, 159, 160, 162, 163, 168, 17, 170, 171, 172, 173, 174, 175, 177, 178, 179, 18, 181, 182, 183, 186, 187, 190, 192, 193, 195, 196, 199, 2, 20, 200, 202, 205, 208, 21, 210, 213, 214, 216, 220, 221, 223, 227, 228, 229, 232, 237, 238, 239, 24, 241, 244, 248, 250, 251, 252, 253, 255, 257, 258, 263, 264, 265, 266, 267, 269, 271, 272, 276, 277, 278, 279, 28, 281, 282, 283, 284, 285, 288, 289, 29, 290, 291, 292, 294, 295, 3, 30, 300, 303, 304, 308, 310, 311, 312, 34, 35, 37, 39, 4, 40, 402, 406, 409, 413, 415, 423, 427, 43, 433, 437, 442, 446, 447, 448, 45, 450, 452, 454, 455, 457, 46, 466, 469, 47, 470, 474, 475, 479, 48, 482, 483, 485, 488, 489, 49, 490, 491, 492, 493, 498, 5, 506, 507, 52, 53, 552, 556, 56, 560, 561, 562, 563, 566, 569, 57, 571, 575, 577, 580, 581, 582, 586, 588, 59, 591, 596, 598, 6, 61, 63, 64, 65, 7, 70, 72, 74, 75, 76, 77, 78, 79, 86, 87, 88, 91, 93, 98, 99

head(raw_df, 6) %>%
  select(study_id, analysis_id, metabolite_name, refmet_name) %>%
  kable(caption = "Raw data structure from Metabolomics Workbench (first 6 rows)") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)

Raw data structure from Metabolomics Workbench (first 6 rows)
study_id	analysis_id	metabolite_name	refmet_name
ST000284	AN000452	1-Methyladenosine	1-Methyladenosine
ST000284	AN000452	1-Methylhistamine	1-Methylhistamine
ST000284	AN000452	2-Aminoadipate	alpha-Aminoadipic acid
ST000284	AN000452	2’-Deoxyuridine	Deoxyuridine
ST000284	AN000452	3-Nitro-tyrosine	Nitrotyrosine
ST000284	AN000452	4-Pyridoxic acid	4-Pyridoxic acid

4. Data Preparation

Method: The raw data from Metabolomics Workbench contains one row per metabolite with sample measurements stored as named columns. We pivot this into a standard analysis format (samples as rows, metabolites as columns), then apply standard preprocessing: log₂ transformation to stabilize variance and z-score standardization for comparable effect sizes across metabolites.

Since ST000284 does not include clinical BMI measurements in the API output, we simulate a realistic BMI phenotype correlated with known BMI-associated metabolites (BCAAs and lipids). This is standard practice in pipeline demonstration and clearly documented.

# Identify sample measurement columns
# These are columns that are NOT metadata fields
meta_cols <- c("study_id","analysis_id","analysis_summary",
               "metabolite_name","metabolite_id","refmet_name","units")
sample_cols <- setdiff(names(raw_df), meta_cols)

cat("Number of metabolites:", nrow(raw_df), "\n")

## Number of metabolites: 113

cat("Number of samples:", length(sample_cols), "\n")

## Number of samples: 224

# Build metabolite x sample matrix
met_names <- raw_df$refmet_name
met_names[is.na(met_names) | met_names == ""] <- raw_df$metabolite_name[
  is.na(met_names) | met_names == ""]

# Extract numeric measurement matrix
meas_mat <- raw_df[, sample_cols]
meas_mat <- apply(meas_mat, 2, as.numeric)
rownames(meas_mat) <- met_names

# Remove metabolites with all missing or zero variance
keep <- apply(meas_mat, 1, function(x) {
  x_clean <- x[!is.na(x) & is.finite(x) & x > 0]
  length(x_clean) >= 3 && sd(x_clean) > 0
})
meas_mat <- meas_mat[keep, ]
cat("Metabolites after quality filter:", nrow(meas_mat), "\n")

## Metabolites after quality filter: 113

# Log2 transform
meas_mat[meas_mat <= 0] <- NA
meas_mat <- log2(meas_mat)

# Impute missing with column (sample) median
for (j in seq_len(ncol(meas_mat))) {
  med <- median(meas_mat[, j], na.rm = TRUE)
  meas_mat[is.na(meas_mat[, j]), j] <- med
}

# Z-score each metabolite (row) for comparable effect sizes
meas_mat_z <- t(apply(meas_mat, 1, function(x) (x - mean(x)) / sd(x)))

# Transpose to analysis format: samples x metabolites
analysis_df <- as.data.frame(t(meas_mat_z))
n_samples <- nrow(analysis_df)
n_mets    <- ncol(analysis_df)

cat("Final: ", n_samples, "samples x", n_mets, "metabolites\n")

## Final:  224 samples x 113 metabolites

# Simulate BMI phenotype correlated with known BMI metabolites
# BCAAs (leucine, valine, isoleucine) and lipids are positively associated with BMI
# Glycine is inversely associated — well established in the literature

bcaa_cols  <- grep("leucine|valine|isoleucine", names(analysis_df),
                    ignore.case = TRUE, value = TRUE)
glycine_col <- grep("glycine", names(analysis_df), ignore.case = TRUE, value = TRUE)

set.seed(42)
bmi_signal <- rowMeans(analysis_df[, bcaa_cols[1:min(3,length(bcaa_cols))],
                                    drop = FALSE]) * 2

if (length(glycine_col) > 0) {
  bmi_signal <- bmi_signal - analysis_df[, glycine_col[1]]
}

bmi <- 27 + scale(bmi_signal)[,1] * 4 + rnorm(n_samples, 0, 2)
bmi <- pmax(pmin(bmi, 45), 16)

analysis_df$bmi <- as.numeric(bmi)
analysis_df$age <- round(rnorm(n_samples, 45, 12))
analysis_df$sex <- rbinom(n_samples, 1, 0.5)

cat("BMI: mean =", round(mean(bmi),1), "| SD =", round(sd(bmi),1), "\n")

## BMI: mean = 27 | SD = 4.2

5. Descriptive Statistics

data.frame(
  Variable = c("BMI (kg/m²)", "Age (years)", "Sex (% female)"),
  Mean_SD  = c(
    paste0(round(mean(analysis_df$bmi),1), " ± ", round(sd(analysis_df$bmi),1)),
    paste0(round(mean(analysis_df$age),1), " ± ", round(sd(analysis_df$age),1)),
    paste0(round(mean(analysis_df$sex)*100,1), "%")
  ),
  Min = c(round(min(analysis_df$bmi),1), round(min(analysis_df$age),1), "—"),
  Max = c(round(max(analysis_df$bmi),1), round(max(analysis_df$age),1), "—")
) %>%
  kable(caption = paste0("Table 1. Sample characteristics (n = ", n_samples, ")")) %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Table 1. Sample characteristics (n = 224)
Variable	Mean_SD	Min	Max
BMI (kg/m²)	27 ± 4.2	16	37.6
Age (years)	45 ± 11.4	13	75
Sex (% female)	46.9%	—	—

ggplot(analysis_df, aes(x = bmi)) +
  geom_histogram(binwidth = 1.5, fill = "#2980b9", color = "white", alpha = 0.85) +
  geom_vline(xintercept = c(18.5, 25, 30), linetype = "dashed",
             color = c("#27ae60","#f39c12","#c0392b"), linewidth = 0.8) +
  labs(title = "BMI Distribution", x = "Body Mass Index (kg/m²)", y = "Count") +
  theme_minimal(base_size = 12)

6. MWAS — Method 1: Standard Linear Regression

Method: For each metabolite m (z-scored), fit:
m ~ β₀ + β₁·BMI + β₂·age + β₃·sex + ε
Extract β₁, SE, t-statistic, p-value. Apply Benjamini-Hochberg FDR correction.

met_cols <- setdiff(names(analysis_df), c("bmi","age","sex"))

lm_results <- map_dfr(met_cols, function(m) {
  tryCatch({
    fml <- as.formula(paste0("`", m, "` ~ bmi + age + sex"))
    fit <- lm(fml, data = analysis_df)
    s   <- summary(fit)$coefficients
    tibble(
      metabolite = m,
      beta       = round(s["bmi","Estimate"], 4),
      se         = round(s["bmi","Std. Error"], 4),
      t_stat     = round(s["bmi","t value"], 3),
      p_value    = s["bmi","Pr(>|t|)"]
    )
  }, error = function(e) NULL)
}) %>%
  mutate(
    fdr_bh      = p.adjust(p_value, method = "BH"),
    bonferroni  = p.adjust(p_value, method = "bonferroni"),
    neg_log10_p = -log10(p_value),
    significant = fdr_bh < 0.05
  ) %>%
  arrange(p_value)

cat(sprintf("Standard lm: %d metabolites | %d significant (FDR < 0.05)\n",
            nrow(lm_results), sum(lm_results$significant)))

## Standard lm: 113 metabolites | 32 significant (FDR < 0.05)

7. MWAS — Method 2: Limma Moderated t-Statistics

Method: limma applies empirical Bayes variance shrinkage across all metabolites simultaneously via lmFit() + eBayes(). This borrows strength across metabolites to stabilise variance estimates — the standard approach for high-dimensional omics data and widely used in metabolomics epidemiology.

# Expression matrix: metabolites x samples (limma convention)
expr_mat   <- t(as.matrix(analysis_df[, met_cols]))
design_mat <- model.matrix(~ bmi + age + sex, data = analysis_df)

fit  <- lmFit(expr_mat, design_mat)
fit2 <- eBayes(fit)

limma_results <- topTable(fit2, coef = "bmi", number = Inf, sort.by = "p") %>%
  rownames_to_column("metabolite") %>%
  as_tibble() %>%
  rename(beta = logFC, t_stat = t, p_value = P.Value, fdr_bh = adj.P.Val) %>%
  mutate(
    se          = round(abs(beta / t_stat), 4),
    neg_log10_p = -log10(p_value),
    significant = fdr_bh < 0.05,
    beta        = round(beta, 4),
    t_stat      = round(t_stat, 3)
  )

cat(sprintf("Limma: %d metabolites | %d significant (FDR < 0.05)\n",
            nrow(limma_results), sum(limma_results$significant)))

## Limma: 113 metabolites | 32 significant (FDR < 0.05)

8. Method Comparison

compare_df <- lm_results %>%
  select(metabolite, beta_lm = beta, p_lm = p_value, sig_lm = significant) %>%
  inner_join(
    limma_results %>% select(metabolite, beta_limma = beta, p_limma = p_value,
                              sig_limma = significant),
    by = "metabolite"
  ) %>%
  mutate(agreement = case_when(
    sig_lm & sig_limma  ~ "Both significant",
    sig_lm & !sig_limma ~ "lm only",
    !sig_lm & sig_limma ~ "limma only",
    TRUE                ~ "Neither"
  ))

r_val <- round(cor(compare_df$beta_lm, compare_df$beta_limma), 3)

ggplot(compare_df, aes(x = beta_lm, y = beta_limma, color = agreement)) +
  geom_point(alpha = 0.75, size = 2.5) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey40") +
  scale_color_manual(values = c("Both significant" = "#27ae60",
                                 "lm only" = "#e67e22",
                                 "limma only" = "#8e44ad",
                                 "Neither" = "#bdc3c7")) +
  labs(title = paste0("Effect Size Agreement: lm vs. Limma (r = ", r_val, ")"),
       x = "β (standard lm)", y = "β (limma)", color = "") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom")

9. Visualizations

9.1 Manhattan Plot

classify_met <- function(name) {
  n <- tolower(name)
  case_when(
    str_detect(n, "leucine|valine|isoleucine|glycine|alanine|serine|threonine|lysine|arginine|proline|tyrosine|phenylalanine|tryptophan|methionine|histidine|glutam|aspart|cysteine") ~ "Amino Acids",
    str_detect(n, "phosphatidyl|sphingo|ceramide|triglyceride|diglyceride|lysophospho|cholesterol|fatty|lipid| pc | pe | sm | tg | dg |lpc|cer|palmit|sterol|eicosa") ~ "Lipids",
    str_detect(n, "carnitine") ~ "Acylcarnitines",
    str_detect(n, "adenosine|guanosine|inosine|uridine|cytidine|nucleotide|purine|pyrimidine|hypoxanthine|xanthine|uric") ~ "Nucleotides",
    str_detect(n, "glucose|lactate|pyruvate|citrate|succinate|fumarate|malate|ketone|acetyl|gluconate") ~ "TCA/Glycolysis",
    TRUE ~ "Other"
  )
}

plot_df <- limma_results %>%
  mutate(
    class = classify_met(metabolite),
    class = factor(class, levels = c("Amino Acids","Lipids","Acylcarnitines",
                                      "Nucleotides","TCA/Glycolysis","Other"))
  ) %>%
  arrange(class) %>%
  mutate(x_pos = row_number())

class_colors <- c(
  "Amino Acids"    = "#2980b9",
  "Lipids"         = "#e67e22",
  "Acylcarnitines" = "#8e44ad",
  "Nucleotides"    = "#16a085",
  "TCA/Glycolysis" = "#c0392b",
  "Other"          = "#95a5a6"
)

class_mids <- plot_df %>%
  group_by(class) %>%
  summarise(mid = median(x_pos), .groups = "drop")

threshold <- -log10(0.05)

ggplot(plot_df, aes(x = x_pos, y = neg_log10_p, color = class,
                     size = significant, alpha = significant)) +
  geom_point(shape = 16) +
  geom_hline(yintercept = threshold, linetype = "dashed",
             color = "#e74c3c", linewidth = 0.9) +
  annotate("text", x = max(plot_df$x_pos) * 0.98, y = threshold + 0.1,
           label = "FDR < 0.05", color = "#e74c3c", hjust = 1, size = 3.5) +
  geom_text_repel(
    data = plot_df %>% filter(significant) %>% slice_min(p_value, n = 8),
    aes(label = metabolite), size = 2.8, color = "black",
    segment.color = "grey60", max.overlaps = 15, box.padding = 0.4
  ) +
  scale_color_manual(values = class_colors, name = "Class") +
  scale_size_manual(values = c(`FALSE` = 1.8, `TRUE` = 3.8), guide = "none") +
  scale_alpha_manual(values = c(`FALSE` = 0.4, `TRUE` = 1.0), guide = "none") +
  scale_x_continuous(breaks = class_mids$mid, labels = class_mids$class) +
  labs(
    title    = "MWAS Manhattan Plot — BMI vs. Plasma Metabolome",
    subtitle = paste0("Limma moderated t-statistics | ", nrow(limma_results),
                      " metabolites | ", sum(limma_results$significant), " significant (FDR < 0.05)"),
    x = NULL, y = expression(-log[10](p))
  ) +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 20, hjust = 1),
        legend.position = "right",
        panel.grid.minor = element_blank())

9.2 Volcano Plot

volcano_df <- limma_results %>%
  mutate(direction = case_when(
    significant & beta > 0 ~ "Positive (sig)",
    significant & beta < 0 ~ "Negative (sig)",
    TRUE                   ~ "Not significant"
  ))

top_label <- volcano_df %>% filter(significant) %>% slice_min(p_value, n = 10)

ggplot(volcano_df, aes(x = beta, y = neg_log10_p, color = direction)) +
  geom_point(size = 2.2, alpha = 0.8) +
  geom_hline(yintercept = threshold, linetype = "dashed",
             color = "#e74c3c", linewidth = 0.7) +
  geom_vline(xintercept = 0, linetype = "dotted",
             color = "grey50", linewidth = 0.6) +
  geom_text_repel(data = top_label, aes(label = metabolite),
                   size = 3, color = "black", segment.color = "grey60",
                   max.overlaps = 12, box.padding = 0.4) +
  scale_color_manual(values = c("Positive (sig)" = "#e67e22",
                                 "Negative (sig)" = "#2980b9",
                                 "Not significant" = "#bdc3c7"), name = "") +
  labs(title = "Volcano Plot — BMI-Associated Plasma Metabolites",
       subtitle = "Orange = positively associated with BMI | Blue = inversely associated",
       x = "Effect size (β, SD units per BMI unit)",
       y = expression(-log[10](p))) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom", panel.grid.minor = element_blank())

9.3 QQ Plot

n   <- nrow(limma_results)
obs <- sort(-log10(limma_results$p_value))
exp <- -log10(ppoints(n))

lambda <- round(
  median(qchisq(1 - limma_results$p_value, df = 1), na.rm = TRUE) /
    qchisq(0.5, df = 1), 3)

ggplot(data.frame(expected = exp, observed = obs),
       aes(x = expected, y = observed)) +
  geom_abline(slope = 1, intercept = 0, color = "#e74c3c",
              linetype = "dashed", linewidth = 0.9) +
  geom_point(color = "#2980b9", size = 2, alpha = 0.75) +
  annotate("text", x = 0.1, y = max(obs) * 0.92,
           label = paste0("λ = ", lambda), hjust = 0,
           size = 4.5, color = "#e67e22", fontface = "bold") +
  labs(title = paste0("QQ Plot (λ = ", lambda, ")"),
       subtitle = "Deviation above diagonal = association signal | λ ≈ 1 = well-calibrated",
       x = expression("Expected " * -log[10](p)),
       y = expression("Observed " * -log[10](p))) +
  theme_minimal(base_size = 12)

10. Results Table

limma_results %>%
  head(20) %>%
  mutate(
    p_value = formatC(p_value, format = "e", digits = 2),
    fdr_bh  = formatC(fdr_bh,  format = "e", digits = 2),
    B       = round(B, 2)
  ) %>%
  select(metabolite, beta, se, t_stat, p_value, fdr_bh, B, significant) %>%
  rename(Metabolite = metabolite, `β` = beta, SE = se, `t-stat` = t_stat,
         `p-value` = p_value, `FDR (BH)` = fdr_bh, `B-stat` = B, `Sig?` = significant) %>%
  kable(caption = "Table 2. Top 20 MWAS results (limma, ranked by p-value)") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = TRUE) %>%
  row_spec(which(limma_results$significant[1:20]), background = "#eafaf1")

Table 2. Top 20 MWAS results (limma, ranked by p-value)
Metabolite	β	SE	t-stat	p-value	FDR (BH)	B-stat	Sig?
Valine	0.1820	0.0135	13.533	1.26e-35	1.43e-33	70.08	TRUE
Leucine/Isoleucine	0.1757	0.0136	12.936	4.33e-33	2.45e-31	64.29	TRUE
Guanidoacetic acid	0.1561	0.0141	11.052	1.76e-25	6.64e-24	46.93	TRUE
Tryptophan	0.0911	0.0153	5.966	4.70e-09	1.17e-07	9.69	TRUE
Lysine	0.0900	0.0151	5.949	5.17e-09	1.17e-07	9.59	TRUE
Homogentisic acid	0.0881	0.0152	5.808	1.15e-08	2.16e-07	8.82	TRUE
Proline	0.0857	0.0153	5.588	3.83e-08	6.18e-07	7.65	TRUE
Xanthurenic acid	0.0847	0.0154	5.513	5.74e-08	8.11e-07	7.26	TRUE
Alanine	0.0839	0.0153	5.476	7.00e-08	8.79e-07	7.07	TRUE
Methionine	0.0820	0.0153	5.344	1.40e-07	1.59e-06	6.39	TRUE
Uric acid	0.0799	0.0153	5.218	2.68e-07	2.75e-06	5.77	TRUE
Sorbitol	0.0767	0.0154	4.980	8.85e-07	8.33e-06	4.62	TRUE
Phenylalanine	0.0742	0.0154	4.806	2.06e-06	1.79e-05	3.80	TRUE
Histidine	0.0720	0.0154	4.660	4.10e-06	3.31e-05	3.14	TRUE
Propionic acid	0.0699	0.0155	4.507	8.27e-06	6.23e-05	2.47	TRUE
Tyrosine	0.0692	0.0155	4.469	9.78e-06	6.91e-05	2.31	TRUE
Octadecatrienoic acid	-0.0680	0.0155	-4.373	1.50e-05	9.95e-05	1.91	TRUE
Linoleic acid	-0.0670	0.0155	-4.311	1.97e-05	1.24e-04	1.64	TRUE
Kynurenine	0.0642	0.0155	4.147	3.97e-05	2.36e-04	0.98	TRUE
Arginine	0.0626	0.0155	4.034	6.36e-05	3.59e-04	0.53	TRUE

11. Results Interpretation

A metabolome-wide association study of BMI was conducted across 113 plasma metabolites (n = 224 participants) using limma moderated t-statistics with adjustment for age and sex. After Benjamini-Hochberg FDR correction (α = 0.05), 32 metabolites were statistically significant, of which 28 were positively associated with BMI and 4 were inversely associated.

The strongest association was observed for Valine (β = 0.182, FDR = 1.43e-33). The genomic inflation factor λ = 5.702, indicating mild inflation warranting investigation.

Effect sizes from standard lm and limma were highly concordant (Pearson r = 1), confirming robustness of findings across analytical approaches.

12. Session Information

sessionInfo()

## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS 26.3
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] metabolomicsWorkbenchR_1.12.0 scales_1.4.0                 
##  [3] ggrepel_0.9.6                 limma_3.58.1                 
##  [5] kableExtra_1.4.0              knitr_1.51                   
##  [7] lubridate_1.9.4               forcats_1.0.0                
##  [9] stringr_1.5.1                 dplyr_1.1.4                  
## [11] purrr_1.0.4                   readr_2.1.5                  
## [13] tidyr_1.3.1                   tibble_3.3.0                 
## [15] ggplot2_4.0.3                 tidyverse_2.0.0              
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1            viridisLite_0.4.3          
##  [3] farver_2.1.2                S7_0.2.0                   
##  [5] bitops_1.0-9                fastmap_1.2.0              
##  [7] RCurl_1.98-1.17             digest_0.6.37              
##  [9] timechange_0.3.0            lifecycle_1.0.4            
## [11] statmod_1.5.0               magrittr_2.0.3             
## [13] compiler_4.3.2              rlang_1.1.6                
## [15] sass_0.4.10                 tools_4.3.2                
## [17] yaml_2.3.10                 data.table_1.17.8          
## [19] labeling_0.4.3              S4Arrays_1.2.1             
## [21] ontologyIndex_2.12          curl_6.4.0                 
## [23] DelayedArray_0.28.0         xml2_1.3.8                 
## [25] RColorBrewer_1.1-3          abind_1.4-8                
## [27] withr_3.0.2                 BiocGenerics_0.48.1        
## [29] grid_4.3.2                  stats4_4.3.2               
## [31] MultiAssayExperiment_1.28.0 SummarizedExperiment_1.32.0
## [33] cli_3.6.5                   rmarkdown_2.31             
## [35] crayon_1.5.3                generics_0.1.4             
## [37] otel_0.2.0                  rstudioapi_0.19.0          
## [39] httr_1.4.8                  tzdb_0.5.0                 
## [41] cachem_1.1.0                zlibbioc_1.48.2            
## [43] XVector_0.42.0              matrixStats_1.5.0          
## [45] vctrs_0.6.5                 Matrix_1.6-1.1             
## [47] jsonlite_2.0.0              IRanges_2.36.0             
## [49] hms_1.1.3                   S4Vectors_0.40.2           
## [51] systemfonts_1.2.3           jquerylib_0.1.4            
## [53] glue_1.8.0                  codetools_0.2-19           
## [55] stringi_1.8.7               gtable_0.3.6               
## [57] GenomeInfoDb_1.38.8         GenomicRanges_1.54.1       
## [59] pillar_1.10.2               htmltools_0.5.8.1          
## [61] GenomeInfoDbData_1.2.11     R6_2.6.1                   
## [63] textshaping_1.0.1           struct_1.14.1              
## [65] evaluate_1.0.5              Biobase_2.62.0             
## [67] lattice_0.21-9              bslib_0.11.0               
## [69] Rcpp_1.1.0                  svglite_2.2.1              
## [71] SparseArray_1.2.4           xfun_0.52                  
## [73] MatrixGenerics_1.14.0       pkgconfig_2.0.3

Data from the NIH Common Fund’s Metabolomics Workbench (metabolomicsworkbench.org). Pipeline for research and educational demonstration.

Metabolome-Wide Association Study (MWAS) Pipeline

Plasma Metabolomics & BMI | Public Data from Metabolomics Workbench (ST000284)

Rashi Mistry

June 28, 2026

1. Background & Motivation

2. Setup

3. Data Acquisition

4. Data Preparation

5. Descriptive Statistics

6. MWAS — Method 1: Standard Linear Regression

7. MWAS — Method 2: Limma Moderated t-Statistics

8. Method Comparison

9. Visualizations

9.1 Manhattan Plot

9.2 Volcano Plot

9.3 QQ Plot

10. Results Table

11. Results Interpretation

12. Session Information