Load required libraries for microbiome data analysis and visualization
library(ggplot2) # Widely used package for data visualization
library(broom) # Converts statistical analysis objects into tidy tibbles
library(knitr) # Dynamic report generation; enables formatted tables and results in reports
library(phyloseq) # For organizing and analyzing microbiome census data
library(microbiome) # Additional utilities for microbiome data exploration and transformation
library(vegan) # Ecological diversity analysis and ordination methods
#Other commonly used libraries:
#library(tibble) # Modern reimagining of data.frames (more readable printing and enhanced functionality)
#library(dplyr) # Grammar of data manipulation (filtering, selecting, grouping, summarizing)
#library(compositions) # Tools for compositional data analysis (e.g., CLR transformation)
This code block loads raw microbiome data from CSV files, including
feature counts, taxonomy annotations, and sample metadata. It also
defines a grouping variable (Diet) for downstream
analysis.
Here’s an updated and concise Markdown explanation describing the three input files:
This section loads three key input files required for microbiome analysis:
A matrix of raw feature abundances. Rows represent OTUs or ASVs
(e.g., OTUNAME1), and columns represent samples.
Example format:
#NAME,Sample1,Sample2,Sample3...
OTUNAME1,10,5,0
OTUNAME2,3,8,1
Metadata table describing experimental variables for each sample
(rows = samples, columns = variables such as Diet,
Treatment). Example format:
#NAME,Diet,Treatment
Sample1,Low,T
Sample2,Low,F
Sample3,High,T
Taxonomic assignments for each OTU/ASV. Each row corresponds to a feature and includes hierarchical ranks (e.g., Domain, Phylum, Class, etc.). Example format:
#TAXONOMY,Domain,Phylum,Class,...
OTUNAME1,Bacteria,Firmicutes,Bacilli,...
OTUNAME2,Bacteria,Bacteroidetes,Bacteroidia,...
The Diet variable from the mapping file is extracted and
set for grouping in downstream analyses.
# Read the raw counts
counts_table <- read.csv("counts.csv", row.names = 1)
# Read taxonomy
taxon_table <- read.csv("taxonomy.csv", row.names = 1)
# Read experimental mapping file
mapping <- read.csv("mapping.csv", row.names = 1)
# Set grouping variable
variable <- "Diet"
To streamline downstream microbiome analysis, the raw count data,
taxonomy annotations, and sample metadata are combined into a single
phyloseq object. This unified structure ensures
compatibility across functions for filtering, normalization,
visualization, and statistical modeling.
The code below constructs this object and also extracts a synchronized metadata table to ensure all components (samples and taxa) are aligned for consistent analysis.
otu <- otu_table(as.matrix(counts_table), taxa_are_rows = TRUE)
tax <- tax_table(as.matrix(taxon_table))
sampledata <- sample_data(mapping)
# This command synchronizes the tables, keeping only overlapping samples and taxa
physeq <- phyloseq(otu, tax, sampledata)
# Create a synchronized metadata object for all downstream analysis. This is crucial.
physeq_meta <- as(sample_data(physeq), "data.frame")
This block performs ordination analysis on microbiome data using the Bray-Curtis distance and PCoA (Principal Coordinates Analysis), then plots the result, colored by a chosen variable (e.g. Diet).
# Total Sum Scaling (TSS) normalization
physeq_tss <- transform_sample_counts(physeq, function(x) x / sum(x))
# Bray-Curtis distance (requires samples in rows, so we transpose the OTU table)
bray_dist <- vegdist(t(as.matrix(otu_table(physeq_tss))), method = "bray")
# PCoA
ordination_bray <- cmdscale(bray_dist, k = 2, eig = TRUE)
ordination_df <- as.data.frame(ordination_bray$points)
ordination_df$Sample <- rownames(ordination_df)
ordination_df <- merge(ordination_df, physeq_meta, by.x = "Sample", by.y = "row.names")
# Plot
ggplot(ordination_df, aes(x = V1, y = V2, color = .data[[variable]])) +
geom_point(size = 4) +
labs(
title = "PCoA using Bray-Curtis",
x = "PC1", y = "PC2", color = variable
) +
theme_minimal()
The Compositional Data Analysis (CoDA) approach starts by applying a Centered Log-Ratio (CLR) transformation to the count data. This transformation addresses the compositional nature of microbiome data by expressing each taxon relative to the geometric mean of the sample, making the data suitable for standard multivariate methods.
Once transformed, we compute the Aitchison distance, which is simply the Euclidean distance on CLR-transformed data. This ensures that the distance metric respects the properties of compositional data.
Finally, we perform PCoA and visualize the ordination, just as we did with the Bray-Curtis method, but now in a statistically appropriate space.
# The microbiome::transform function correctly handles zeros by default.
physeq_clr <- microbiome::transform(physeq, "clr")
# Aitchison distance (Euclidean distance on CLR-transformed data)
aitch_dist <- dist(t(as.matrix(otu_table(physeq_clr))), method = "euclidean")
# PCoA
ordination_aitch <- cmdscale(aitch_dist, k = 2, eig = TRUE)
ordination_df2 <- as.data.frame(ordination_aitch$points)
ordination_df2$Sample <- rownames(ordination_df2)
ordination_df2 <- merge(ordination_df2, physeq_meta, by.x = "Sample", by.y = "row.names")
# Plot
ggplot(ordination_df2, aes(x = V1, y = V2, color = .data[[variable]])) +
geom_point(size = 4) +
labs(
title = "PCoA using Aitchison Distance (CLR)",
x = "PC1", y = "PC2", color = variable
) +
theme_minimal()
In this step, we test whether the differences in microbial community
composition between groups (e.g., diet types) are statistically
significant. We use PERMANOVA (Permutational
Multivariate Analysis of Variance), implemented via the
adonis2 function from the vegan package.
This method partitions variation in a distance matrix according to one or more variables from the metadata. We run PERMANOVA twice:
The results tell us whether groupings based on the chosen variable (e.g., Diet) explain a significant portion of the variance in community composition.
# Classical
formula_bray <- as.formula(paste("bray_dist ~", variable))
adonis_bray <- adonis2(formula_bray, data = physeq_meta)
kable(tidy(adonis_bray), caption = "PERMANOVA on Bray-Curtis distances.")
| term | df | SumOfSqs | R2 | statistic | p.value |
|---|---|---|---|---|---|
| Model | 1 | 0.0064268 | 0.8351549 | 20.2652 | 0.1 |
| Residual | 4 | 0.0012685 | 0.1648451 | NA | NA |
| Total | 5 | 0.0076954 | 1.0000000 | NA | NA |
# Compositional
formula_aitch <- as.formula(paste("aitch_dist ~", variable))
adonis_aitch <- adonis2(formula_aitch, data = physeq_meta)
kable(tidy(adonis_aitch), caption = "PERMANOVA on Aitchison distances.")
| term | df | SumOfSqs | R2 | statistic | p.value |
|---|---|---|---|---|---|
| Model | 1 | 6.102910 | 0.6900676 | 8.906041 | 0.1 |
| Residual | 4 | 2.741020 | 0.3099324 | NA | NA |
| Total | 5 | 8.843931 | 1.0000000 | NA | NA |
Before fully trusting the PERMANOVA results, it’s important to check whether differences between groups are due to actual shifts in composition rather than differences in variability or spread within groups.
This is done using a beta-dispersion test, which is conceptually similar to checking for homogeneity of variances in traditional ANOVA. If group dispersions are significantly different, the PERMANOVA might detect “differences” simply because one group is more variable.
Here, we apply the betadisper() function to both the
Bray-Curtis and Aitchison distance matrices to test for such effects,
followed by an ANOVA to assess significance.
# Extract the grouping vector from the synchronized metadata
grouping_vector <- physeq_meta[[variable]]
# Classical
disp_bray <- betadisper(bray_dist, grouping_vector)
cat("### Beta-dispersion results for Bray-Curtis\n")
kable(tidy(anova(disp_bray)), caption = "Beta-dispersion on Bray-Curtis distances.")
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| Groups | 1 | 8.40e-06 | 8.40e-06 | 0.2034321 | 0.6753348 |
| Residuals | 4 | 1.66e-04 | 4.15e-05 | NA | NA |
# Compositional
disp_aitch <- betadisper(aitch_dist, grouping_vector)
cat("\n### Beta-dispersion results for Aitchison\n")
kable(tidy(anova(disp_aitch)), caption = "Beta-dispersion on Aitchison distances.")
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| Groups | 1 | 0.0588322 | 0.0588322 | 0.8841407 | 0.4003081 |
| Residuals | 4 | 0.2661665 | 0.0665416 | NA | NA |
QIB Core Bioinformatics