This notebook analyzes the effect of BSA on the
metabolome of our samples.
We compare samples prepared using BSA versus the standard method
(No_BSA) across multiple metabolites.
We want to evaluate how adding Bovine Serum Albumin (BSA) during sample preparation changes measured metabolite values compared to the standard (No_BSA) preparation. To do this we:
• Clean and harmonize the raw dataset.
• Pair measurements from the same sample prepared two ways.
• Use a paired non-parametric test (Wilcoxon signed-rank) to identify metabolites significantly changed by BSA (we choose non-parametric because differences may not be normally distributed).
• Compute effect sizes as percent change (BSA vs No_BSA) and correct p-values for multiple testing (FDR).
• Visualize results with a volcano plot and per-metabolite comparison plots (scatter, density, boxplot).
• Export statistics and plots for reporting.
Pairing preserves within-sample correlation (more statistical power). Non-parametric test removes the normality assumption for differences. FDR controls false positives from testing many metabolites. Visualizations help check assumptions, distribution, and effect sizes.
To prepare the R environment, we begin by loading the necessary
packages. The chunk option
knitr::opts_chunk$set(echo = TRUE) ensures that code is
displayed by default in each chunk, which supports reproducibility. We
load readxl for reading Excel files, and several tidyverse
packages—dplyr, tidyr, and
purrr—to facilitate data cleaning, transformation, and
functional programming. For creating Excel workbooks efficiently, we use
openxlsx, which is lightweight and does not require Java.
For data visualization and statistical annotation, we rely on
ggplot2, ggpubr, and patchwork,
which together support layered plotting and combining multiple plots.
The use of tidy tools promotes readable, pipeline-style code, while
ggplot2 remains a standard for reproducible graphics.
If any package is missing, it should be installed using
install.packages() before loading it with
library(). To keep the knitting output clean, it’s
recommended to suppress startup messages using
message = FALSE and warning = FALSE.
knitr::opts_chunk$set(echo = TRUE)
# Data manipulation
library(readxl)
library(dplyr)
library(tidyr)
library(purrr)
# Excel export
library(openxlsx)
# Visualization
library(ggplot2)
library(ggpubr)
library(patchwork)
⸻
To begin data processing, we read the raw dataset using
read_excel("Merged.xlsx"), which imports the Excel sheet
into a tibble or data frame. We then classify each row based on the
presence of the string “BSA” in the Sample ID BT column
using
ifelse(grepl("BSA", df1$Sample ID BT), "BSA", "No_BSA"),
creating a new column to indicate the preparation method. Since
Sample ID Client is now used as the identifier, we remove
the original Sample ID BT column with
df1 <- df1[,-1]. To ensure proper data types for
analysis, we clean and convert relevant columns—ranging from
VLDL-C to HDL-Z (nm)—by replacing “<LOD”
strings with NA and coercing them to numeric using
mutate(across(..., ~ as.numeric(gsub("<LOD", NA, .x)))).
This step is crucial because statistical tests require numeric inputs,
and instruments often encode below-detection values as strings. Using
across() simplifies applying transformations across
multiple columns without listing each one. After conversion, it’s
important to verify column types using str(df1) or
sapply(df1, class). Additionally, when referencing columns
with special characters like spaces or parentheses, backticks should be
used. Finally, to ensure accurate classification, inspect the
distribution of the new method column with
table(df1$Method) to catch any mislabeling.
#Data Import and Cleaning
# Read the dataset
df1 <- read_excel("Merged.xlsx")
# Categorize samples by method
df1$Method <- ifelse(grepl("BSA", df1$`Sample ID BT`), "BSA", "No_BSA")
# Remove 'Sample ID BT' column
df1 <- df1[,-1]
# Convert metabolite columns to numeric; replace <LOD with NA
df1 <- df1 %>%
ungroup() %>%
mutate(across(`VLDL-C`:`HDL-Z (nm)`, ~ as.numeric(gsub("<LOD", NA, .x))))
# Add replicate numbers per sample-method
df1 <- df1 %>%
group_by(`Sample ID Client`, Method) %>%
mutate(Replicate = row_number()) %>%
ungroup()
⸻
To reshape the dataset for paired statistical testing, we transform
it from a long format—where each row represents a single measurement—to
a wide format, where each row corresponds to a unique sample and
replicate, with separate columns for BSA and No_BSA values per
metabolite. This is achieved using
pivot_wider(id_cols = c(\Sample ID Client`, Replicate),
names_from = Method, values_from = c(`VLDL-C`:`HDL-Z (nm)`)), which
generates new columns such as VLDL-C_BSA and
VLDL-C_No_BSA. The theoretical rationale is that paired
tests require both conditions (BSA and No_BSA) to be represented in the
same row, simplifying comparisons between them.
After pivoting, it’s important to check the structure of the
resulting data frame using str(wide) to confirm that the
new columns exist and are correctly named. If pivot_wider()
produces list-columns, this indicates duplicate entries for the same
sample-method combination, which should be resolved by aggregating the
data (e.g., using mean or median) before pivoting.
# Pivot wider by method
wide <- df1 %>%
pivot_wider(
id_cols = c(`Sample ID Client`, Replicate),
names_from = Method,
values_from = c(`VLDL-C`:`HDL-Z (nm)`)
)
# Show structure
str(wide)
## tibble [54 × 48] (S3: tbl_df/tbl/data.frame)
## $ Sample ID Client : chr [1:54] "E17" "EA4 Exp 17" "LA3 Exp 17" "E10" ...
## $ Replicate : int [1:54] 1 1 1 1 1 1 1 1 1 1 ...
## $ VLDL-C_No_BSA : num [1:54] 4.65 86.65 129.47 81.72 77.99 ...
## $ VLDL-C_BSA : num [1:54] 4.65 103.77 152.59 93.44 94.37 ...
## $ IDL-C_No_BSA : num [1:54] 2.29 24.24 44.79 21.04 30.47 ...
## $ IDL-C_BSA : num [1:54] NA 10.7 22.25 9.68 17.7 ...
## $ LDL-C_No_BSA : num [1:54] 29.9 328.7 621.5 219.1 460.1 ...
## $ LDL-C_BSA : num [1:54] 26.1 341.5 667.1 229.7 492.8 ...
## $ HDL-C_No_BSA : num [1:54] 22.8 97.4 112 58.2 108.5 ...
## $ HDL-C_BSA : num [1:54] 28.5 81 65.8 47.2 72.1 ...
## $ VLDL-TG_No_BSA : num [1:54] 35.1 131 363.8 132.1 139.7 ...
## $ VLDL-TG_BSA : num [1:54] 44.1 158 401.2 158.3 166.4 ...
## $ IDL-TG_No_BSA : num [1:54] 7.01 8.96 23.8 12.15 17.21 ...
## $ IDL-TG_BSA : num [1:54] 1.32 1.09 NA 1.4 1.04 ...
## $ LDL-TG_No_BSA : num [1:54] 6.64 24.92 93.34 23.24 58.47 ...
## $ LDL-TG_BSA : num [1:54] 1.01 24.42 84.1 17.77 65.09 ...
## $ HDL-TG_No_BSA : num [1:54] 17.2 20.3 6.6 23.2 18.7 ...
## $ HDL-TG_BSA : num [1:54] 19.5 NA NA 13.1 NA ...
## $ VLDL-P (nmol/L)_No_BSA : num [1:54] 23.4 123 285.3 121.4 128 ...
## $ VLDL-P (nmol/L)_BSA : num [1:54] 32 144 298 141 149 ...
## $ Large VLDL-P (nmol/L)_No_BSA : num [1:54] 0.742 3.404 6.492 3.075 3.065 ...
## $ Large VLDL-P (nmol/L)_BSA : num [1:54] 0.795 4.941 6.714 4.443 3.338 ...
## $ Medium VLDL-P (nmol/L)_No_BSA: num [1:54] 2.66 16.26 32.04 16.19 13.84 ...
## $ Medium VLDL-P (nmol/L)_BSA : num [1:54] 0.811 19.457 50.269 17.647 19.97 ...
## $ Small VLDL-P (nmol/L)_No_BSA : num [1:54] 20 103 247 102 111 ...
## $ Small VLDL-P (nmol/L)_BSA : num [1:54] 30.4 119.4 240.8 119.2 125.5 ...
## $ LDL-P (nmol/L)_No_BSA : num [1:54] 308 3120 6488 2183 4669 ...
## $ LDL-P (nmol/L)_BSA : num [1:54] 227 3438 6871 2167 5029 ...
## $ Large LDL-P (nmol/L)_No_BSA : num [1:54] 59.5 595.1 1039.3 373.6 734 ...
## $ Large LDL-P (nmol/L)_BSA : num [1:54] 62.9 730.9 1146 554.3 844 ...
## $ Medium LDL-P (nmol/L)_No_BSA : num [1:54] 77 777 1835 559 1401 ...
## $ Medium LDL-P (nmol/L)_BSA : num [1:54] 6.22e-02 2.92e+02 1.73e+03 2.27e+02 1.38e+03 ...
## $ Small LDL-P (nmol/L)_No_BSA : num [1:54] 171 1747 3613 1250 2534 ...
## $ Small LDL-P (nmol/L)_BSA : num [1:54] 164 2415 3996 1386 2806 ...
## $ HDL-P (μmol/L)_No_BSA : num [1:54] 15 38.6 37.7 27.3 45.2 ...
## $ HDL-P (μmol/L)_BSA : num [1:54] 11.3 19.6 16.5 14.5 18.2 ...
## $ Large HDL-P (μmol/L)_No_BSA : num [1:54] 0.157 0.523 0.727 0.373 0.604 ...
## $ Large HDL-P (μmol/L)_BSA : num [1:54] 0.716 0.867 0.522 0.607 0.417 ...
## $ Medium HDL-P (μmol/L)_No_BSA : num [1:54] 7.21 22.32 22.97 15.32 21.53 ...
## $ Medium HDL-P (μmol/L)_BSA : num [1:54] 10.6 18.8 15.9 13.9 17.7 ...
## $ Small HDL-P (μmol/L)_No_BSA : num [1:54] 7.63 15.77 14.03 11.56 23.11 ...
## $ Small HDL-P (μmol/L)_BSA : num [1:54] 0.01498 0.01815 0.01013 0.0127 0.00865 ...
## $ VLDL-Z (nm)_No_BSA : num [1:54] 42.4 42.4 42.2 42.4 42.1 ...
## $ VLDL-Z (nm)_BSA : num [1:54] 41.5 42.6 42.6 42.4 42.3 ...
## $ LDL-Z (nm)_No_BSA : num [1:54] 21.2 21.2 21 21 21.1 ...
## $ LDL-Z (nm)_BSA : num [1:54] 21 20.8 21 21.2 21.1 ...
## $ HDL-Z (nm)_No_BSA : num [1:54] 8.43 8.57 8.63 8.55 8.43 ...
## $ HDL-Z (nm)_BSA : num [1:54] 9.22 9.18 9.14 9.17 9.12 ...
⸻
To prepare for paired statistical testing, we create a vector called
metabolites that contains only the base names of
metabolites measured in both BSA and No_BSA conditions. This ensures
that we only test pairs where both values are available. We start by
retrieving all column names using names(wide), then
identify those ending in _BSA with
grep("_BSA$", all_cols, value = TRUE). We strip the
_BSA suffix using gsub("_BSA$", "", ...) to
get the base metabolite names. To confirm that each of these also has a
corresponding _No_BSA column, we intersect this list with
similarly processed _No_BSA columns. This step is essential
to avoid errors during testing and ensures that only complete pairs are
included.
After creating the vector, it’s important to print and inspect it to
verify that the expected metabolite names are present. Also, be mindful
of column naming conventions—if suffixes differ (e.g., _BSA
vs .BSA), the pattern used in grep() and
gsub() should be adjusted accordingly.
# Identify metabolites measured in both BSA and No_BSA
all_cols <- names(wide)
metabolites <- gsub("_BSA$", "", grep("_BSA$", all_cols, value = TRUE))
metabolites <- metabolites[metabolites %in% gsub("_No_BSA$", "", grep("_No_BSA$", all_cols, value = TRUE))]
metabolites
## [1] "VLDL-C" "IDL-C" "LDL-C"
## [4] "HDL-C" "VLDL-TG" "IDL-TG"
## [7] "LDL-TG" "HDL-TG" "VLDL-P (nmol/L)"
## [10] "Large VLDL-P (nmol/L)" "Medium VLDL-P (nmol/L)" "Small VLDL-P (nmol/L)"
## [13] "LDL-P (nmol/L)" "Large LDL-P (nmol/L)" "Medium LDL-P (nmol/L)"
## [16] "Small LDL-P (nmol/L)" "HDL-P (μmol/L)" "Large HDL-P (μmol/L)"
## [19] "Medium HDL-P (μmol/L)" "Small HDL-P (μmol/L)" "VLDL-Z (nm)"
## [22] "LDL-Z (nm)" "HDL-Z (nm)"
⸻
To statistically compare metabolite levels between BSA and No_BSA
preparations, we perform a paired Wilcoxon signed-rank test for each
metabolite, using only those with valid paired measurements. This is
implemented using map_df() from the purrr
package, which iterates over the metabolites vector and
returns a tidy data frame of results. For each metabolite, we extract
the BSA and No_BSA values using:
x <- wide[[paste0(met, "_BSA")]]
y <- wide[[paste0(met, "_No_BSA")]]
We ensure that at least two non-missing pairs exist with
if (sum(!is.na(x) & !is.na(y)) >= 2) before running
the test. The Wilcoxon test is performed using
wilcox.test(x, y, paired = TRUE, exact = FALSE), which is
suitable for non-normally distributed data and avoids issues with ties
in moderate sample sizes. Alongside the test statistic and p-value, we
compute mean_BSA and mean_No_BSA to provide
interpretable effect sizes, even though medians might be more robust for
skewed data.
Key considerations:
To adjust for multiple comparisons and enhance interpretability, we
apply false discovery rate (FDR) correction to the raw p-values and
calculate percent change between conditions. Specifically, adjusted
p-values are computed using
p.adjust(p_value, method = "fdr"), which applies the
Benjamini–Hochberg procedure to control the expected proportion of false
discoveries across all metabolite tests.
Percent change is calculated as
(mean_BSA - mean_No_BSA) / mean_No_BSA * 100, providing a
clear relative measure of effect—positive values indicate an increase
with BSA, while negative values indicate a decrease.
To visually convey statistical significance, we use
case_when() to assign significance stars based on adjusted
p-value thresholds. It’s important to handle cases where
mean_No_BSA is zero or near zero, as this can lead to
unstable or infinite percent changes; such cases should be treated
separately, possibly by reporting absolute differences or applying a
small constant. While means are used for interpretability, medians or
log2 fold-changes may be more appropriate when data distributions are
skewed.
# Run paired Wilcoxon test for each metabolite
results <- map_df(metabolites, function(met) {
x <- wide[[paste0(met, "_BSA")]]
y <- wide[[paste0(met, "_No_BSA")]]
if(sum(!is.na(x) & !is.na(y)) >= 2) {
test <- wilcox.test(x, y, paired = TRUE, exact = FALSE)
mean_x <- mean(x, na.rm = TRUE)
mean_y <- mean(y, na.rm = TRUE)
} else {
test <- list(statistic = NA, p.value = NA)
mean_x <- mean(x, na.rm = TRUE)
mean_y <- mean(y, na.rm = TRUE)
}
data.frame(
Metabolite = met,
stat = test$statistic,
p_value = test$p.value,
mean_BSA = mean_x,
mean_No_BSA = mean_y
)
})
# Adjust p-values and calculate percent change
results <- results %>%
mutate(
p_adj = p.adjust(p_value, method = "fdr"),
percent_change = (mean_BSA - mean_No_BSA) / mean_No_BSA * 100,
significance = case_when(
p_adj < 0.001 ~ "***",
p_adj < 0.01 ~ "**",
p_adj < 0.05 ~ "*",
TRUE ~ ""
)
) %>%
arrange(p_adj)
# Show top results
head(results)
## Metabolite stat p_value mean_BSA mean_No_BSA
## V...1 Medium LDL-P (nmol/L) 0 7.790492e-10 460.8693934 714.971392
## V...2 Small HDL-P (μmol/L) 2 8.797194e-10 0.7124352 14.776995
## V...3 HDL-Z (nm) 1274 8.278932e-10 9.1331034 8.513289
## V...4 HDL-P (μmol/L) 10 1.425214e-09 14.8606742 33.079170
## V...5 VLDL-TG 1229 1.162832e-08 169.9373223 151.416199
## V...6 VLDL-P (nmol/L) 1225 1.457654e-08 142.5760702 126.449328
## p_adj percent_change significance
## V...1 6.744515e-09 -35.540163 ***
## V...2 6.744515e-09 -95.178754 ***
## V...3 6.744515e-09 7.280552 ***
## V...4 8.194980e-09 -55.075432 ***
## V...5 4.789436e-08 12.231930 ***
## V...6 4.789436e-08 12.753521 ***
⸻
To create a visual summary of the results, we generate a scatter plot
showing percent change on the x-axis and statistical significance on the
y-axis, represented as negative log10 of the adjusted p-value. This is
done using ggplot2, where each point reflects a
metabolite, and geom_text() is used to label only the
significant ones to avoid overcrowding. The
scale_color_identity() function allows custom colors
defined per row to be used directly in the plot. This volcano-style plot
helps highlight metabolites that exhibit both strong statistical
evidence and substantial biological change. Since extremely small
p-values can result in infinite values when transformed with
-log10(p_adj), we ensure that p_adj has a
lower bound by replacing zeros with a small constant (e.g.,
1e-300). To maintain clarity, only top hits or those marked
as significant are labeled, preventing visual clutter.
volcano <- results %>%
mutate(
neg_log10_p = -log10(p_adj),
color = case_when(
significance == "***" ~ "red",
significance == "**" ~ "orange",
significance == "*" ~ "blue",
TRUE ~ "gray"
)
)
p_volcano <- ggplot(volcano, aes(x = percent_change, y = neg_log10_p, label = Metabolite)) +
geom_point(aes(color = color), size = 3) +
scale_color_identity() +
geom_text(data = subset(volcano, significance != ""), hjust = 1.1, vjust = 0.5, size = 3) +
theme_minimal() +
xlab("Percent Change (BSA vs No_BSA)") +
ylab("-log10(FDR-adjusted p-value)") +
ggtitle("Volcano Plot of Metabolite Changes with BSA") +
theme(plot.title = element_text(hjust = 0.5))
p_volcano
⸻
To make the statistical results easily accessible for collaborators,
we save the results table to an Excel file using functions from the
openxlsx package. This involves creating a workbook
with createWorkbook(), adding a worksheet using
addWorksheet(), writing the data with
writeData(), and finally saving the file using
saveWorkbook(). Excel is a widely used format for data
sharing, and openxlsx allows writing files without
requiring Java, making it efficient and lightweight. Before saving, it’s
important to confirm the file path and ensure write permissions are in
place. Using overwrite = TRUE can prevent errors during
saving, but care should be taken not to overwrite important files
unintentionally. Optionally, separate sheets can be included for
significant and non-significant metabolites to enhance clarity and
reviewability.
wb <- createWorkbook()
addWorksheet(wb, "Metabolite_Stats")
writeData(wb, "Metabolite_Stats", results)
# Save workbook
saveWorkbook(wb, "Metabolite_Comparison_BSA.xlsx", overwrite = TRUE)
⸻
To visually support the statistical findings and explore distributional patterns, we generate a consistent panel of plots for each metabolite: - a scatter plot comparing BSA vs No_BSA values - a density plot showing the distribution per method - a boxplot comparing methods with annotated Wilcoxon p-values.
The data is reshaped using pivot_longer() and tidy
selection with !!sym(col) to prepare for plotting. The
scatter plot uses geom_point() and
geom_smooth(method = "lm") to show linear trends, along
with stat_cor() to display Pearson correlation. The density
plot employs geom_density() to highlight distribution
overlap, while the boxplot combines geom_boxplot() and
geom_jitter() to show individual data points, with
stat_compare_means(method = "wilcox.test", label = "p.signif")
for significance annotation.
These plots are arranged using patchwork syntax as
(scatter + density) / boxplot, creating a clear and
consistent 2x2 layout. This approach helps reveal systematic biases,
distribution shapes, and statistical differences. Before plotting, NA
values are filtered out to avoid warnings or dropped observations.
For metabolites with very few observations, regression lines or correlation stats may be skipped or annotated to reflect limited sample size. Since each metabolite is plotted separately, axis scaling remains consistent within each panel, avoiding distortion from extreme value ranges.
make_comparison_plots <- function(data, metabolite) {
col1 <- paste0(metabolite, "_BSA")
col2 <- paste0(metabolite, "_No_BSA")
if(!(col1 %in% names(data) & col2 %in% names(data))) return(NULL)
df_long <- data %>%
mutate(Sample = row_number()) %>%
select(Sample, !!sym(col1), !!sym(col2)) %>%
pivot_longer(cols = -Sample, names_to = "Method", values_to = "Value") %>%
mutate(Method = ifelse(Method == col1, "Using BSA", "No BSA"))
# Scatter
scatter <- ggplot(data %>% filter(!is.na(!!sym(col1)) & !is.na(!!sym(col2))),
aes(x = !!sym(col1), y = !!sym(col2))) +
geom_point(color = "#1f77b4", alpha = 0.7) +
geom_smooth(method = "lm", color = "purple", se = FALSE) +
stat_cor(aes(label = after_stat(r.label)), method = "pearson") +
labs(title = paste0(metabolite, " (BSA vs No_BSA)"),
x = "Using BSA", y = "No BSA") +
theme_minimal()
# Density
density <- ggplot(df_long, aes(x = Value, fill = Method, color = Method)) +
geom_density(alpha = 0.3) +
labs(title = paste0(metabolite, " - Distribution"), x = metabolite, y = "Density") +
theme_minimal()
# Boxplot
boxplot <- ggplot(df_long, aes(x = Method, y = Value, fill = Method)) +
geom_boxplot(alpha = 0.6, outlier.shape = NA) +
geom_jitter(width = 0.1, alpha = 0.3) +
stat_compare_means(method = "wilcox.test", label = "p.signif") +
labs(title = paste0(metabolite, " - Comparison"), x = "", y = metabolite) +
theme_minimal()
combined <- (scatter + density) / boxplot +
plot_annotation(title = paste("Comparison for", metabolite),
theme = theme(plot.title = element_text(face = "bold", size = 14)))
return(combined)
}
# Generate plots for all metabolites
plots_list <- lapply(metabolites, function(met) make_comparison_plots(wide, met))
names(plots_list) <- metabolites
# Display first plot as example
plots_list[[1]]
# Display all metabolite comparison plots
for (i in seq_along(plots_list)) {
cat("###", names(plots_list)[i], "\n") # Add a section header in the notebook
print(plots_list[[i]]) # Display each plot
}
## ### VLDL-C
## ### IDL-C
## ### LDL-C
## ### HDL-C
## ### VLDL-TG
## ### IDL-TG
## ### LDL-TG
## ### HDL-TG
## ### VLDL-P (nmol/L)
## ### Large VLDL-P (nmol/L)
## ### Medium VLDL-P (nmol/L)
## ### Small VLDL-P (nmol/L)
## ### LDL-P (nmol/L)
## ### Large LDL-P (nmol/L)
## ### Medium LDL-P (nmol/L)
## ### Small LDL-P (nmol/L)
## ### HDL-P (μmol/L)
## ### Large HDL-P (μmol/L)
## ### Medium HDL-P (μmol/L)
## ### Small HDL-P (μmol/L)
## ### VLDL-Z (nm)
## ### LDL-Z (nm)
## ### HDL-Z (nm)
⸻
To archive or share the full set of per-metabolite comparison plots,
we render them into a single multi-page PDF using
lapply() to generate a list of plots with
make_comparison_plots(wide) for each metabolite.
The resulting plots_list is then written to a PDF using
pdf("All_Metabolite_Comparisons.pdf", ...), looping through
each plot with print(p) to place one on each page. This
format is ideal for reviewers, as each page presents a complete
visualization for a single metabolite. To avoid blank pages, any
NULL plots—resulting from missing data or insufficient
observations—are skipped during printing. If the number of metabolites
is large, the resulting file may be sizable; in such cases, exporting
individual PNGs per metabolite can be a useful alternative for embedding
in presentations or reports.
pdf("All_Metabolite_Comparisons.pdf", width = 10, height = 12)
for (met in metabolites) {
p <- make_comparison_plots(wide, met)
print(p)
}
dev.off()
## quartz_off_screen
## 2
⸻
This notebook:
• Cleans and preprocesses metabolite data.
• Performs paired Wilcoxon tests. • Adjusts p-values for multiple comparisons (FDR).
• Generates volcano plots and detailed per-metabolite comparisons.
• Exports results to Excel and visualizations to PDF.
⸻