CODE

Load Required Libraries

To prepare the R environment, we begin by loading the necessary packages. The chunk option knitr::opts_chunk$set(echo = TRUE) ensures that code is displayed by default in each chunk, which supports reproducibility. We load readxl for reading Excel files, and several tidyverse packages—dplyr, tidyr, and purrr—to facilitate data cleaning, transformation, and functional programming. For creating Excel workbooks efficiently, we use openxlsx, which is lightweight and does not require Java. For data visualization and statistical annotation, we rely on ggplot2, ggpubr, and patchwork, which together support layered plotting and combining multiple plots. The use of tidy tools promotes readable, pipeline-style code, while ggplot2 remains a standard for reproducible graphics.

If any package is missing, it should be installed using install.packages() before loading it with library(). To keep the knitting output clean, it’s recommended to suppress startup messages using message = FALSE and warning = FALSE.

knitr::opts_chunk$set(echo = TRUE)
# Data manipulation
library(readxl)
library(dplyr)
library(tidyr)
library(purrr)

# Excel export
library(openxlsx)

# Visualization
library(ggplot2)
library(ggpubr)
library(patchwork)

⸻

Data Import and Cleaning

To begin data processing, we read the raw dataset using read_excel("Merged.xlsx"), which imports the Excel sheet into a tibble or data frame. We then classify each row based on the presence of the string “BSA” in the Sample ID BT column using ifelse(grepl("BSA", df1$Sample ID BT), "BSA", "No_BSA"), creating a new column to indicate the preparation method. Since Sample ID Client is now used as the identifier, we remove the original Sample ID BT column with df1 <- df1[,-1]. To ensure proper data types for analysis, we clean and convert relevant columns—ranging from VLDL-C to HDL-Z (nm)—by replacing “<LOD” strings with NA and coercing them to numeric using mutate(across(..., ~ as.numeric(gsub("<LOD", NA, .x)))). This step is crucial because statistical tests require numeric inputs, and instruments often encode below-detection values as strings. Using across() simplifies applying transformations across multiple columns without listing each one. After conversion, it’s important to verify column types using str(df1) or sapply(df1, class). Additionally, when referencing columns with special characters like spaces or parentheses, backticks should be used. Finally, to ensure accurate classification, inspect the distribution of the new method column with table(df1$Method) to catch any mislabeling.

#Data Import and Cleaning

# Read the dataset
df1 <- read_excel("Merged.xlsx")

# Categorize samples by method
df1$Method <- ifelse(grepl("BSA", df1$`Sample ID BT`), "BSA", "No_BSA")

# Remove 'Sample ID BT' column
df1 <- df1[,-1]

# Convert metabolite columns to numeric; replace <LOD with NA
df1 <- df1 %>%
  ungroup() %>%
  mutate(across(`VLDL-C`:`HDL-Z (nm)`, ~ as.numeric(gsub("<LOD", NA, .x))))

# Add replicate numbers per sample-method
df1 <- df1 %>%
  group_by(`Sample ID Client`, Method) %>%
  mutate(Replicate = row_number()) %>%
  ungroup()

⸻

Pivot wider by method

To reshape the dataset for paired statistical testing, we transform it from a long format—where each row represents a single measurement—to a wide format, where each row corresponds to a unique sample and replicate, with separate columns for BSA and No_BSA values per metabolite. This is achieved using pivot_wider(id_cols = c(\Sample ID Client`, Replicate), names_from = Method, values_from = c(`VLDL-C`:`HDL-Z (nm)`)), which generates new columns such as VLDL-C_BSA and VLDL-C_No_BSA. The theoretical rationale is that paired tests require both conditions (BSA and No_BSA) to be represented in the same row, simplifying comparisons between them.

After pivoting, it’s important to check the structure of the resulting data frame using str(wide) to confirm that the new columns exist and are correctly named. If pivot_wider() produces list-columns, this indicates duplicate entries for the same sample-method combination, which should be resolved by aggregating the data (e.g., using mean or median) before pivoting.

# Pivot wider by method
wide <- df1 %>%
  pivot_wider(
    id_cols = c(`Sample ID Client`, Replicate),
    names_from = Method,
    values_from = c(`VLDL-C`:`HDL-Z (nm)`)
  )

# Show structure
str(wide)

## tibble [54 × 48] (S3: tbl_df/tbl/data.frame)
##  $ Sample ID Client             : chr [1:54] "E17" "EA4 Exp 17" "LA3 Exp 17" "E10" ...
##  $ Replicate                    : int [1:54] 1 1 1 1 1 1 1 1 1 1 ...
##  $ VLDL-C_No_BSA                : num [1:54] 4.65 86.65 129.47 81.72 77.99 ...
##  $ VLDL-C_BSA                   : num [1:54] 4.65 103.77 152.59 93.44 94.37 ...
##  $ IDL-C_No_BSA                 : num [1:54] 2.29 24.24 44.79 21.04 30.47 ...
##  $ IDL-C_BSA                    : num [1:54] NA 10.7 22.25 9.68 17.7 ...
##  $ LDL-C_No_BSA                 : num [1:54] 29.9 328.7 621.5 219.1 460.1 ...
##  $ LDL-C_BSA                    : num [1:54] 26.1 341.5 667.1 229.7 492.8 ...
##  $ HDL-C_No_BSA                 : num [1:54] 22.8 97.4 112 58.2 108.5 ...
##  $ HDL-C_BSA                    : num [1:54] 28.5 81 65.8 47.2 72.1 ...
##  $ VLDL-TG_No_BSA               : num [1:54] 35.1 131 363.8 132.1 139.7 ...
##  $ VLDL-TG_BSA                  : num [1:54] 44.1 158 401.2 158.3 166.4 ...
##  $ IDL-TG_No_BSA                : num [1:54] 7.01 8.96 23.8 12.15 17.21 ...
##  $ IDL-TG_BSA                   : num [1:54] 1.32 1.09 NA 1.4 1.04 ...
##  $ LDL-TG_No_BSA                : num [1:54] 6.64 24.92 93.34 23.24 58.47 ...
##  $ LDL-TG_BSA                   : num [1:54] 1.01 24.42 84.1 17.77 65.09 ...
##  $ HDL-TG_No_BSA                : num [1:54] 17.2 20.3 6.6 23.2 18.7 ...
##  $ HDL-TG_BSA                   : num [1:54] 19.5 NA NA 13.1 NA ...
##  $ VLDL-P (nmol/L)_No_BSA       : num [1:54] 23.4 123 285.3 121.4 128 ...
##  $ VLDL-P (nmol/L)_BSA          : num [1:54] 32 144 298 141 149 ...
##  $ Large VLDL-P (nmol/L)_No_BSA : num [1:54] 0.742 3.404 6.492 3.075 3.065 ...
##  $ Large VLDL-P (nmol/L)_BSA    : num [1:54] 0.795 4.941 6.714 4.443 3.338 ...
##  $ Medium VLDL-P (nmol/L)_No_BSA: num [1:54] 2.66 16.26 32.04 16.19 13.84 ...
##  $ Medium VLDL-P (nmol/L)_BSA   : num [1:54] 0.811 19.457 50.269 17.647 19.97 ...
##  $ Small VLDL-P (nmol/L)_No_BSA : num [1:54] 20 103 247 102 111 ...
##  $ Small VLDL-P (nmol/L)_BSA    : num [1:54] 30.4 119.4 240.8 119.2 125.5 ...
##  $ LDL-P (nmol/L)_No_BSA        : num [1:54] 308 3120 6488 2183 4669 ...
##  $ LDL-P (nmol/L)_BSA           : num [1:54] 227 3438 6871 2167 5029 ...
##  $ Large LDL-P (nmol/L)_No_BSA  : num [1:54] 59.5 595.1 1039.3 373.6 734 ...
##  $ Large LDL-P (nmol/L)_BSA     : num [1:54] 62.9 730.9 1146 554.3 844 ...
##  $ Medium LDL-P (nmol/L)_No_BSA : num [1:54] 77 777 1835 559 1401 ...
##  $ Medium LDL-P (nmol/L)_BSA    : num [1:54] 6.22e-02 2.92e+02 1.73e+03 2.27e+02 1.38e+03 ...
##  $ Small LDL-P (nmol/L)_No_BSA  : num [1:54] 171 1747 3613 1250 2534 ...
##  $ Small LDL-P (nmol/L)_BSA     : num [1:54] 164 2415 3996 1386 2806 ...
##  $ HDL-P (μmol/L)_No_BSA        : num [1:54] 15 38.6 37.7 27.3 45.2 ...
##  $ HDL-P (μmol/L)_BSA           : num [1:54] 11.3 19.6 16.5 14.5 18.2 ...
##  $ Large HDL-P (μmol/L)_No_BSA  : num [1:54] 0.157 0.523 0.727 0.373 0.604 ...
##  $ Large HDL-P (μmol/L)_BSA     : num [1:54] 0.716 0.867 0.522 0.607 0.417 ...
##  $ Medium HDL-P (μmol/L)_No_BSA : num [1:54] 7.21 22.32 22.97 15.32 21.53 ...
##  $ Medium HDL-P (μmol/L)_BSA    : num [1:54] 10.6 18.8 15.9 13.9 17.7 ...
##  $ Small HDL-P (μmol/L)_No_BSA  : num [1:54] 7.63 15.77 14.03 11.56 23.11 ...
##  $ Small HDL-P (μmol/L)_BSA     : num [1:54] 0.01498 0.01815 0.01013 0.0127 0.00865 ...
##  $ VLDL-Z (nm)_No_BSA           : num [1:54] 42.4 42.4 42.2 42.4 42.1 ...
##  $ VLDL-Z (nm)_BSA              : num [1:54] 41.5 42.6 42.6 42.4 42.3 ...
##  $ LDL-Z (nm)_No_BSA            : num [1:54] 21.2 21.2 21 21 21.1 ...
##  $ LDL-Z (nm)_BSA               : num [1:54] 21 20.8 21 21.2 21.1 ...
##  $ HDL-Z (nm)_No_BSA            : num [1:54] 8.43 8.57 8.63 8.55 8.43 ...
##  $ HDL-Z (nm)_BSA               : num [1:54] 9.22 9.18 9.14 9.17 9.12 ...

⸻

Identify Metabolites

To prepare for paired statistical testing, we create a vector called metabolites that contains only the base names of metabolites measured in both BSA and No_BSA conditions. This ensures that we only test pairs where both values are available. We start by retrieving all column names using names(wide), then identify those ending in _BSA with grep("_BSA$", all_cols, value = TRUE). We strip the _BSA suffix using gsub("_BSA$", "", ...) to get the base metabolite names. To confirm that each of these also has a corresponding _No_BSA column, we intersect this list with similarly processed _No_BSA columns. This step is essential to avoid errors during testing and ensures that only complete pairs are included.

After creating the vector, it’s important to print and inspect it to verify that the expected metabolite names are present. Also, be mindful of column naming conventions—if suffixes differ (e.g., _BSA vs .BSA), the pattern used in grep() and gsub() should be adjusted accordingly.

# Identify metabolites measured in both BSA and No_BSA
all_cols <- names(wide)
metabolites <- gsub("_BSA$", "", grep("_BSA$", all_cols, value = TRUE))
metabolites <- metabolites[metabolites %in% gsub("_No_BSA$", "", grep("_No_BSA$", all_cols, value = TRUE))]

metabolites

##  [1] "VLDL-C"                 "IDL-C"                  "LDL-C"                 
##  [4] "HDL-C"                  "VLDL-TG"                "IDL-TG"                
##  [7] "LDL-TG"                 "HDL-TG"                 "VLDL-P (nmol/L)"       
## [10] "Large VLDL-P (nmol/L)"  "Medium VLDL-P (nmol/L)" "Small VLDL-P (nmol/L)" 
## [13] "LDL-P (nmol/L)"         "Large LDL-P (nmol/L)"   "Medium LDL-P (nmol/L)" 
## [16] "Small LDL-P (nmol/L)"   "HDL-P (μmol/L)"         "Large HDL-P (μmol/L)"  
## [19] "Medium HDL-P (μmol/L)"  "Small HDL-P (μmol/L)"   "VLDL-Z (nm)"           
## [22] "LDL-Z (nm)"             "HDL-Z (nm)"

⸻

Statistical Analysis (Paired Wilcoxon Test)

To statistically compare metabolite levels between BSA and No_BSA preparations, we perform a paired Wilcoxon signed-rank test for each metabolite, using only those with valid paired measurements. This is implemented using map_df() from the purrr package, which iterates over the metabolites vector and returns a tidy data frame of results. For each metabolite, we extract the BSA and No_BSA values using:

x <- wide[[paste0(met, "_BSA")]]
y <- wide[[paste0(met, "_No_BSA")]]

We ensure that at least two non-missing pairs exist with if (sum(!is.na(x) & !is.na(y)) >= 2) before running the test. The Wilcoxon test is performed using wilcox.test(x, y, paired = TRUE, exact = FALSE), which is suitable for non-normally distributed data and avoids issues with ties in moderate sample sizes. Alongside the test statistic and p-value, we compute mean_BSA and mean_No_BSA to provide interpretable effect sizes, even though medians might be more robust for skewed data.

Key considerations:

The Wilcoxon test returns a V statistic, not a t-statistic, so interpretation should reflect that.
For very small sample sizes, the test may lack power, so it’s helpful to report the number of valid pairs per metabolite.
Always inspect the results to ensure expected metabolites are included and that the test was applied correctly.

Adjust p-values & Percent change (Post-processing)

To adjust for multiple comparisons and enhance interpretability, we apply false discovery rate (FDR) correction to the raw p-values and calculate percent change between conditions. Specifically, adjusted p-values are computed using p.adjust(p_value, method = "fdr"), which applies the Benjamini–Hochberg procedure to control the expected proportion of false discoveries across all metabolite tests.

Percent change is calculated as (mean_BSA - mean_No_BSA) / mean_No_BSA * 100, providing a clear relative measure of effect—positive values indicate an increase with BSA, while negative values indicate a decrease.

To visually convey statistical significance, we use case_when() to assign significance stars based on adjusted p-value thresholds. It’s important to handle cases where mean_No_BSA is zero or near zero, as this can lead to unstable or infinite percent changes; such cases should be treated separately, possibly by reporting absolute differences or applying a small constant. While means are used for interpretability, medians or log2 fold-changes may be more appropriate when data distributions are skewed.

# Run paired Wilcoxon test for each metabolite
results <- map_df(metabolites, function(met) {
  x <- wide[[paste0(met, "_BSA")]]
  y <- wide[[paste0(met, "_No_BSA")]]
  
  if(sum(!is.na(x) & !is.na(y)) >= 2) {
    test <- wilcox.test(x, y, paired = TRUE, exact = FALSE)
    mean_x <- mean(x, na.rm = TRUE)
    mean_y <- mean(y, na.rm = TRUE)
  } else {
    test <- list(statistic = NA, p.value = NA)
    mean_x <- mean(x, na.rm = TRUE)
    mean_y <- mean(y, na.rm = TRUE)
  }
  
  data.frame(
    Metabolite = met,
    stat = test$statistic,
    p_value = test$p.value,
    mean_BSA = mean_x,
    mean_No_BSA = mean_y
  )
})

# Adjust p-values and calculate percent change
results <- results %>%
  mutate(
    p_adj = p.adjust(p_value, method = "fdr"),
    percent_change = (mean_BSA - mean_No_BSA) / mean_No_BSA * 100,
    significance = case_when(
      p_adj < 0.001 ~ "***",
      p_adj < 0.01  ~ "**",
      p_adj < 0.05  ~ "*",
      TRUE          ~ ""
    )
  ) %>%
  arrange(p_adj)

# Show top results
head(results)

##                  Metabolite stat      p_value    mean_BSA mean_No_BSA
## V...1 Medium LDL-P (nmol/L)    0 7.790492e-10 460.8693934  714.971392
## V...2  Small HDL-P (μmol/L)    2 8.797194e-10   0.7124352   14.776995
## V...3            HDL-Z (nm) 1274 8.278932e-10   9.1331034    8.513289
## V...4        HDL-P (μmol/L)   10 1.425214e-09  14.8606742   33.079170
## V...5               VLDL-TG 1229 1.162832e-08 169.9373223  151.416199
## V...6       VLDL-P (nmol/L) 1225 1.457654e-08 142.5760702  126.449328
##              p_adj percent_change significance
## V...1 6.744515e-09     -35.540163          ***
## V...2 6.744515e-09     -95.178754          ***
## V...3 6.744515e-09       7.280552          ***
## V...4 8.194980e-09     -55.075432          ***
## V...5 4.789436e-08      12.231930          ***
## V...6 4.789436e-08      12.753521          ***

⸻

Volcano Plot

To create a visual summary of the results, we generate a scatter plot showing percent change on the x-axis and statistical significance on the y-axis, represented as negative log10 of the adjusted p-value. This is done using ggplot2, where each point reflects a metabolite, and geom_text() is used to label only the significant ones to avoid overcrowding. The scale_color_identity() function allows custom colors defined per row to be used directly in the plot. This volcano-style plot helps highlight metabolites that exhibit both strong statistical evidence and substantial biological change. Since extremely small p-values can result in infinite values when transformed with -log10(p_adj), we ensure that p_adj has a lower bound by replacing zeros with a small constant (e.g., 1e-300). To maintain clarity, only top hits or those marked as significant are labeled, preventing visual clutter.

volcano <- results %>%
  mutate(
    neg_log10_p = -log10(p_adj),
    color = case_when(
      significance == "***" ~ "red",
      significance == "**"  ~ "orange",
      significance == "*"   ~ "blue",
      TRUE                  ~ "gray"
    )
  )

p_volcano <- ggplot(volcano, aes(x = percent_change, y = neg_log10_p, label = Metabolite)) +
  geom_point(aes(color = color), size = 3) +
  scale_color_identity() +
  geom_text(data = subset(volcano, significance != ""), hjust = 1.1, vjust = 0.5, size = 3) +
  theme_minimal() +
  xlab("Percent Change (BSA vs No_BSA)") +
  ylab("-log10(FDR-adjusted p-value)") +
  ggtitle("Volcano Plot of Metabolite Changes with BSA") +
  theme(plot.title = element_text(hjust = 0.5))

p_volcano

⸻

Export Results to Excel

To make the statistical results easily accessible for collaborators, we save the results table to an Excel file using functions from the openxlsx package. This involves creating a workbook with createWorkbook(), adding a worksheet using addWorksheet(), writing the data with writeData(), and finally saving the file using saveWorkbook(). Excel is a widely used format for data sharing, and openxlsx allows writing files without requiring Java, making it efficient and lightweight. Before saving, it’s important to confirm the file path and ensure write permissions are in place. Using overwrite = TRUE can prevent errors during saving, but care should be taken not to overwrite important files unintentionally. Optionally, separate sheets can be included for significant and non-significant metabolites to enhance clarity and reviewability.

wb <- createWorkbook()
addWorksheet(wb, "Metabolite_Stats")
writeData(wb, "Metabolite_Stats", results)

# Save workbook
saveWorkbook(wb, "Metabolite_Comparison_BSA.xlsx", overwrite = TRUE)

⸻

Comparison Plots for Each Metabolite

To visually support the statistical findings and explore distributional patterns, we generate a consistent panel of plots for each metabolite: - a scatter plot comparing BSA vs No_BSA values - a density plot showing the distribution per method - a boxplot comparing methods with annotated Wilcoxon p-values.

The data is reshaped using pivot_longer() and tidy selection with !!sym(col) to prepare for plotting. The scatter plot uses geom_point() and geom_smooth(method = "lm") to show linear trends, along with stat_cor() to display Pearson correlation. The density plot employs geom_density() to highlight distribution overlap, while the boxplot combines geom_boxplot() and geom_jitter() to show individual data points, with stat_compare_means(method = "wilcox.test", label = "p.signif") for significance annotation.

These plots are arranged using patchwork syntax as (scatter + density) / boxplot, creating a clear and consistent 2x2 layout. This approach helps reveal systematic biases, distribution shapes, and statistical differences. Before plotting, NA values are filtered out to avoid warnings or dropped observations.

For metabolites with very few observations, regression lines or correlation stats may be skipped or annotated to reflect limited sample size. Since each metabolite is plotted separately, axis scaling remains consistent within each panel, avoiding distortion from extreme value ranges.

make_comparison_plots <- function(data, metabolite) {
  col1 <- paste0(metabolite, "_BSA")
  col2 <- paste0(metabolite, "_No_BSA")
  
  if(!(col1 %in% names(data) & col2 %in% names(data))) return(NULL)
  
  df_long <- data %>%
    mutate(Sample = row_number()) %>%
    select(Sample, !!sym(col1), !!sym(col2)) %>%
    pivot_longer(cols = -Sample, names_to = "Method", values_to = "Value") %>%
    mutate(Method = ifelse(Method == col1, "Using BSA", "No BSA"))
  
  # Scatter
  scatter <- ggplot(data %>% filter(!is.na(!!sym(col1)) & !is.na(!!sym(col2))),
                    aes(x = !!sym(col1), y = !!sym(col2))) +
    geom_point(color = "#1f77b4", alpha = 0.7) +
    geom_smooth(method = "lm", color = "purple", se = FALSE) +
    stat_cor(aes(label = after_stat(r.label)), method = "pearson") +
    labs(title = paste0(metabolite, " (BSA vs No_BSA)"),
         x = "Using BSA", y = "No BSA") +
    theme_minimal()
  
  # Density
  density <- ggplot(df_long, aes(x = Value, fill = Method, color = Method)) +
    geom_density(alpha = 0.3) +
    labs(title = paste0(metabolite, " - Distribution"), x = metabolite, y = "Density") +
    theme_minimal()
  
  # Boxplot
  boxplot <- ggplot(df_long, aes(x = Method, y = Value, fill = Method)) +
    geom_boxplot(alpha = 0.6, outlier.shape = NA) +
    geom_jitter(width = 0.1, alpha = 0.3) +
    stat_compare_means(method = "wilcox.test", label = "p.signif") +
    labs(title = paste0(metabolite, " - Comparison"), x = "", y = metabolite) +
    theme_minimal()
  
  combined <- (scatter + density) / boxplot +
    plot_annotation(title = paste("Comparison for", metabolite),
                    theme = theme(plot.title = element_text(face = "bold", size = 14)))
  
  return(combined)
}

# Generate plots for all metabolites
plots_list <- lapply(metabolites, function(met) make_comparison_plots(wide, met))
names(plots_list) <- metabolites

# Display first plot as example
plots_list[[1]]

# Display all metabolite comparison plots
for (i in seq_along(plots_list)) {
  cat("###", names(plots_list)[i], "\n")   # Add a section header in the notebook
  print(plots_list[[i]])                   # Display each plot
}

## ### VLDL-C

## ### IDL-C

## ### LDL-C

## ### HDL-C

## ### VLDL-TG

## ### IDL-TG

## ### LDL-TG

## ### HDL-TG

## ### VLDL-P (nmol/L)

## ### Large VLDL-P (nmol/L)

## ### Medium VLDL-P (nmol/L)

## ### Small VLDL-P (nmol/L)

## ### LDL-P (nmol/L)

## ### Large LDL-P (nmol/L)

## ### Medium LDL-P (nmol/L)

## ### Small LDL-P (nmol/L)

## ### HDL-P (μmol/L)

## ### Large HDL-P (μmol/L)

## ### Medium HDL-P (μmol/L)

## ### Small HDL-P (μmol/L)

## ### VLDL-Z (nm)

## ### LDL-Z (nm)

## ### HDL-Z (nm)

⸻

Export All Plots to PDF

To archive or share the full set of per-metabolite comparison plots, we render them into a single multi-page PDF using lapply() to generate a list of plots with make_comparison_plots(wide) for each metabolite.

The resulting plots_list is then written to a PDF using pdf("All_Metabolite_Comparisons.pdf", ...), looping through each plot with print(p) to place one on each page. This format is ideal for reviewers, as each page presents a complete visualization for a single metabolite. To avoid blank pages, any NULL plots—resulting from missing data or insufficient observations—are skipped during printing. If the number of metabolites is large, the resulting file may be sizable; in such cases, exporting individual PNGs per metabolite can be a useful alternative for embedding in presentations or reports.

pdf("All_Metabolite_Comparisons.pdf", width = 10, height = 12)

for (met in metabolites) {
  p <- make_comparison_plots(wide, met)
  print(p)
}

dev.off()

## quartz_off_screen 
##                 2

⸻

Metabolite Comparison Analysis: BSA vs No_BSA

Ali Mohamed

2025-11-11

Introduction

Overall analysis — what we want and how

Aim of the analysis (overall):

Why this workflow?