The Systems Biology Framework

Welcome to the comprehensive Metagenomics pipeline. This document executes the journey from raw, noisy sequencer data to high-resolution ecological inference.

The Parameter Dictionary: Libraries

dada2: The machine-learning engine that models sequencer errors.
phyloseq: The architect. It binds disparate spreadsheets into one ecosystem.
vegan: The foundational ecology package for Alpha and Beta diversity.
DESeq2: The differential abundance engine, adapted from RNA-seq.

Core Configuration & Libraries

Why are we doing this?

Before executing bioinformatics, we must initialize our digital environment. R cannot perform genomics out of the box. We must load the Bioconductor ecosystem—a specialized suite of tools designed specifically for high-throughput DNA analysis. Furthermore, we must explicitly define the exact primer sequences the laboratory used. If we do not tell the computer what the primers look like, it cannot remove them, which will ruin our entire downstream analysis.

library(dada2)
library(phyloseq)
library(DECIPHER)
library(phangorn)
library(biomformat)
library(ggplot2)
library(vegan)
library(ape)
library(writexl)
library(DESeq2)
library(dplyr)
library(DT)
library(microbiome)
library(Biostrings)
library(ggpubr)
library(indicspecies)
library(betapart)
library(igraph)
library(Hmisc)
library(reshape2)
library(ggrepel)

# ---- CORE CONFIGURATION ----
path <- "C:/Users/Owner/Documents/BSP_2022/BSP_2025/Metabarcoding_training"
cutadapt_path <- "C:/Users/Owner/Downloads/cutadapt" # Verify this matches your system installation
system2(cutadapt_path, args = "--version") # Run shell commands from R

# Standard Universal Primers (To be tested via Cutadapt)
fwd_16S <- "CCTACGGGNGGCWGCAG"        # V3 Forward
rev_16S <- "GACTACHVGGGTATCTAATCC"   # V4 Reverse
fwd_ITS <- "GCATCGATGAAGAACGCAGC"    # ITS3 Forward
rev_ITS <- "TCCTCCGCTTATTGATATGC"    # ITS4 Reverse

PHASE 1: BACTERIAL 16S PIPELINE (DADA2)

1.1 Primer Removal (Cutadapt) & Quality Filtering

Why are we doing this?

When the sequencing machine reads DNA, it also sequences the artificial primers attached to the ends of the molecules. If we leave these artificial primers attached, DADA2 will view the slight chemical variations in the primers as actual evolutionary mutations, causing it to invent hundreds of “fake” bacterial species. We use cutadapt to cleanly slice them off.

Parameter Deep-Dive:

--discard-untrimmed: We ruthlessly instruct the algorithm to throw away any DNA string where the primer is missing. If the primer isn’t there, it’s either junk DNA or a massive sequencing error.
truncLen=c(250, 200): Sequencing chemistry physically degrades at the tail end of a read. We rigidly chop the forward reads at 250 bases and reverse reads at 200 to remove the blurry, high-error tails.
maxEE=c(2,2): “Maximum Expected Errors.” We discard any sequence where the accumulated machine errors mathematically equal more than 2 expected typos.

cat("\n\n======================================================\n")
cat("STARTING 16S PIPELINE - WATCH CUTADAPT RETENTION RATES\n")
cat("======================================================\n\n")

fnFs_16S <- sort(list.files(path, pattern = "__1_1\\.fastq\\.gz$", full.names = TRUE))
fnRs_16S <- sort(list.files(path, pattern = "__1_2\\.fastq\\.gz$", full.names = TRUE))
sample.names_16S <- sapply(strsplit(basename(fnFs_16S), "__1"), `[`, 1)

path_cut_16S <- file.path(path, "cutadapt_16S")
dir.create(path_cut_16S, showWarnings = FALSE)
fnFs_cut_16S <- file.path(path_cut_16S, paste0(sample.names_16S, "_F_cut.fastq.gz"))
fnRs_cut_16S <- file.path(path_cut_16S, paste0(sample.names_16S, "_R_cut.fastq.gz"))

# Cutadapt Trimming
for(i in seq_along(fnFs_16S)) {
  system2(cutadapt_path, args = c("-g", fwd_16S, "-G", rev_16S,
                                  "-o", shQuote(fnFs_cut_16S[i]), "-p", shQuote(fnRs_cut_16S[i]),
                                  shQuote(fnFs_16S[i]), shQuote(fnRs_16S[i]),
                                  "--discard-untrimmed", "--minimum-length=100"))
}

# DADA2 Filtering
filtFs_16S <- file.path(path, "filtered_16S", paste0(sample.names_16S, "_F_filt.fastq.gz"))
filtRs_16S <- file.path(path, "filtered_16S", paste0(sample.names_16S, "_R_filt.fastq.gz"))
names(filtFs_16S) <- sample.names_16S; names(filtRs_16S) <- sample.names_16S

out_16S <- filterAndTrim(fnFs_cut_16S, filtFs_16S, fnRs_cut_16S, filtRs_16S, 
                         truncLen=c(250, 200), maxN=0, maxEE=c(2,2), truncQ=2, 
                         rm.phix=TRUE, compress=TRUE, multithread=TRUE)

1.2 Quality Control Plots

We plot the Phred Quality Scores to visually verify that our truncLen parameters successfully removed the low-quality tails from our reads.

# 16S Quality Control Plots
print(plotQualityProfile(fnFs_16S[1:2]) + ggtitle("RAW Forward Reads (16S)"))

print(plotQualityProfile(filtFs_16S[1:2]) + ggtitle("FILTERED Forward Reads (16S)"))

1.3 Error Modeling & Taxonomy Assignment

Why are we doing this?

Older pipelines grouped DNA sequences into “OTUs” based on a blurry 97% similarity threshold. DADA2 abandons this entirely. The learnErrors function calculates the exact probability of an ‘A’ mutating into a ‘G’ inside your specific sequencing machine. By understanding the machine’s exact error rate, it can correct the typos and yield Amplicon Sequence Variants (ASVs)—mathematically exact, single-nucleotide-resolution biological strains. We then align these perfect sequences against the global SILVA database to determine their names.

# Error Learning & Core Algorithm
errF_16S <- learnErrors(filtFs_16S, multithread=TRUE)
errR_16S <- learnErrors(filtRs_16S, multithread=TRUE)
mergers_16S <- mergePairs(dada(filtFs_16S, err=errF_16S, multithread=TRUE), filtFs_16S, 
                          dada(filtRs_16S, err=errR_16S, multithread=TRUE), filtRs_16S, verbose=TRUE)
seqtab_nochim_16S <- removeBimeraDenovo(rowsum(makeSequenceTable(mergers_16S), group = sample.names_16S), method="consensus", multithread=TRUE)

# Taxonomy (SILVA)
seqtab_final_16S <- seqtab_nochim_16S[, nchar(getSequences(seqtab_nochim_16S)) >= 350]
taxa_16S <- assignTaxonomy(seqtab_final_16S, file.path(path, "silva_nr99_v138.2_toGenus_trainset.fa.gz"), multithread=TRUE)
taxa_16S <- addSpecies(taxa_16S, file.path(path, "silva_v138.2_assignSpecies.fa.gz"))

PHASE 2: FUNGAL / YEAST ITS PIPELINE

2.1 The Fungal Pipeline Crisis

Why are we doing this differently?

Fungal processing is distinctly harder than bacterial processing. Bacterial 16S genes are universally the exact same length. Fungal ITS regions, however, wildly fluctuate from 100 to 600 base pairs depending on the species.

** CRITICAL DIFFERENCE:** Notice that truncLen is entirely missing from the fungal filterAndTrim command below. If you apply a rigid truncLen here like you did for bacteria, you will accidentally decapitate valid, naturally long fungal sequences and ruin your taxonomy! Furthermore, we use the UNITE database, which is the global gold standard for fungal classification, rather than SILVA.

cat("\n\n======================================================\n")
cat("STARTING ITS PIPELINE - WATCH CUTADAPT RETENTION RATES\n")
cat("======================================================\n\n")

fnFs_ITS <- sort(list.files(path, pattern = "__2_1\\.fastq\\.gz$", full.names = TRUE))
fnRs_ITS <- sort(list.files(path, pattern = "__2_2\\.fastq\\.gz$", full.names = TRUE))
sample.names_ITS <- sapply(strsplit(basename(fnFs_ITS), "__2"), `[`, 1)

path_cut_ITS <- file.path(path, "cutadapt_ITS")
dir.create(path_cut_ITS, showWarnings = FALSE)
fnFs_cut_ITS <- file.path(path_cut_ITS, paste0(sample.names_ITS, "_F_cut.fastq.gz"))
fnRs_cut_ITS <- file.path(path_cut_ITS, paste0(sample.names_ITS, "_R_cut.fastq.gz"))

# Cutadapt Trimming
for(i in seq_along(fnFs_ITS)) {
  system2(cutadapt_path, args = c("-g", fwd_ITS, "-G", rev_ITS,
                                  "-o", shQuote(fnFs_cut_ITS[i]), "-p", shQuote(fnRs_cut_ITS[i]),
                                  shQuote(fnFs_ITS[i]), shQuote(fnRs_ITS[i]),
                                  "--discard-untrimmed", "--minimum-length=50"))
}

# DADA2 Filtering (NO TRUNCLEN)
filtFs_ITS <- file.path(path, "filtered_ITS", paste0(sample.names_ITS, "_F_filt.fastq.gz"))
filtRs_ITS <- file.path(path, "filtered_ITS", paste0(sample.names_ITS, "_R_filt.fastq.gz"))
names(filtFs_ITS) <- sample.names_ITS; names(filtRs_ITS) <- sample.names_ITS

out_ITS <- filterAndTrim(fnFs_cut_ITS, filtFs_ITS, fnRs_cut_ITS, filtRs_ITS, 
                         maxN=0, maxEE=c(2,2), truncQ=2, minLen=50, 
                         rm.phix=TRUE, compress=TRUE, multithread=TRUE)

# ITS Quality Control Plots
print(plotQualityProfile(fnFs_ITS[1:2]) + ggtitle("RAW Forward Reads (ITS)"))

print(plotQualityProfile(filtFs_ITS[1:2]) + ggtitle("FILTERED Forward Reads (ITS)"))

# Error Learning & Core Algorithm
errF_ITS <- learnErrors(filtFs_ITS, multithread=TRUE)
errR_ITS <- learnErrors(filtRs_ITS, multithread=TRUE)
mergers_ITS <- mergePairs(dada(filtFs_ITS, err=errF_ITS, multithread=TRUE), filtFs_ITS, 
                          dada(filtRs_ITS, err=errR_ITS, multithread=TRUE), filtRs_ITS, verbose=TRUE)
seqtab_final_ITS <- removeBimeraDenovo(rowsum(makeSequenceTable(mergers_ITS), group = sample.names_ITS), method="consensus", multithread=TRUE)

# Taxonomy (UNITE - Single Threaded for Windows Stability)
taxa_ITS <- assignTaxonomy(seqtab_final_ITS, file.path(path, "sh_general_release_dynamic_19.02.2025.fasta"), multithread=FALSE)

PHASE 3: PHYLOGENETIC TREES & PHYLOSEQ ASSEMBLY

3.1 Object Assembly & Host Contaminant Removal

Why are we doing this?

Right now, our data exists in four different objects: an ASV count matrix, a taxonomy dictionary, literal DNA strings, and our experimental metadata. This is useless for complex statistical modeling. phyloseq acts as the architect, fusing them into a single, unified “Ecosystem” object.

** Ecological Imperative:** Notice the subset_taxa commands at the end of the 16S block. In metabarcoding of plant materials, the 16S primers often accidentally sequence the plant’s own Chloroplasts and Mitochondria (which evolved from ancient bacteria). We must explicitly filter these out, or plant DNA will artificially dominate our bacterial statistics!

cat("\n--- Building Phylogenetic Trees & Phyloseq Objects ---\n")

metadata <- data.frame(
  SampleID = rownames(seqtab_final_16S),
  TimePoint = factor(ifelse(grepl("^0", rownames(seqtab_final_16S)), "Start", "Finish"), levels=c("Start", "Finish")),
  Treatment = factor(ifelse(grepl("CET", rownames(seqtab_final_16S)), "Controlled", "Spontaneous"))
)
rownames(metadata) <- metadata$SampleID

# 16S Assembly (Includes ML Tree)
# We use a Generalized Time Reversible (GTR) model to calculate the exact evolutionary distance between bacteria.
generate_ml_tree <- function(seqtab_matrix) {
  seqs <- getSequences(seqtab_matrix); names(seqs) <- seqs
  phang_align <- phyDat(as(AlignSeqs(DNAStringSet(seqs), anchor=NA, verbose=FALSE), "matrix"), type="DNA")
  return(optim.pml(pml(NJ(dist.ml(phang_align)), data=phang_align), model="GTR", optInv=TRUE, optGamma=TRUE, rearrangement="stochastic", control=pml.control(trace=0))$tree)
}

ps_16S <- phyloseq(otu_table(seqtab_final_16S, taxa_are_rows=FALSE), sample_data(metadata), 
                   tax_table(taxa_16S), phy_tree(generate_ml_tree(seqtab_final_16S)))
dna_16S <- Biostrings::DNAStringSet(taxa_names(ps_16S)); names(dna_16S) <- taxa_names(ps_16S)
ps_16S <- merge_phyloseq(ps_16S, dna_16S)
taxa_names(ps_16S) <- paste0("ASV_16S_", seq(ntaxa(ps_16S)))

# Filter Host Contaminants (CRITICAL)
ps_16S <- subset_taxa(ps_16S, (Order != "Chloroplast") | is.na(Order))
ps_16S <- subset_taxa(ps_16S, (Family != "Mitochondria") | is.na(Family))

# ITS Assembly (No Tree)
taxa_ITS <- readRDS(file.path(path, "Saved_R_Data", "taxa_ITS_UNITE.rds"))

ps_ITS <- phyloseq(otu_table(seqtab_final_ITS, taxa_are_rows=FALSE), sample_data(metadata), tax_table(taxa_ITS))
dna_ITS <- Biostrings::DNAStringSet(taxa_names(ps_ITS)); names(dna_ITS) <- taxa_names(ps_ITS)
ps_ITS <- merge_phyloseq(ps_ITS, dna_ITS)
taxa_names(ps_ITS) <- paste0("ASV_ITS_", seq(ntaxa(ps_ITS)))

# Save the final compiled objects
saveRDS(ps_16S, file = file.path(path, "Saved_R_Data", "ps_16S_final.rds"))
saveRDS(ps_ITS, file = file.path(path, "Saved_R_Data", "ps_ITS_final.rds"))
cat("\nPhyloseq objects successfully rebuilt with embedded DNA sequences.\n")

PHASE 4. BACTERIAL COMMUNITY ANALYSIS (16S)

4.1 16S Setup & Background Exports

Why are we doing this?

Sequencing machines are not perfect. Some samples may yield 50,000 reads, while others yield only 50. First, we prune any sample with < 1000 reads because they represent sequencer failures. Second, we must transform_sample_counts. This normalizes the data, converting raw read counts into Relative Abundances (percentages). Without this math, you cannot legally compare a heavily-sequenced tank to a lightly-sequenced tank.

# Clean and Transform
ps_clean_16S <- prune_samples(sample_sums(ps_16S) > 1000, ps_16S)
ps_rel_16S <- transform_sample_counts(ps_clean_16S, function(x) x / sum(x))

# Background File Exports
writeXStringSet(refseq(ps_clean_16S), file.path(path, "Bacteria_16S_rep.fasta"), format="fasta")
if (!is.null(phy_tree(ps_clean_16S, errorIfNULL = FALSE))) { 
  write.tree(phy_tree(ps_clean_16S), file.path(path, "Bacteria_16S_phylo.tre")) 
}

krona_16S <- psmelt(ps_clean_16S) %>% 
  filter(Abundance > 0) %>% 
  select(Abundance, Kingdom, Phylum, Class, Order, Family, Genus, Species) %>% 
  mutate(across(everything(), ~tidyr::replace_na(as.character(.), "Unclassified")))
write.table(krona_16S, file.path(path, "Bacteria_16S_Krona_Input.txt"), sep="\t", row.names=FALSE, col.names=FALSE, quote=FALSE)

4.2 16S Alpha Diversity

Why are we doing this?

Alpha diversity measures the internal complexity of a single sample. * Chao1/Observed: Measures “Richness” (How many different bacterial species are present?) * Shannon/Simpson: Measures “Evenness” (Are all bacteria equally represented, or is there an apex predator dominating the ecosystem?)

** Ecological Interpretation:** If you inoculate a coffee tank with a highly aggressive commercial starter culture, you want to see Alpha Diversity plummet over time. That plummet proves your starter culture successfully dominated the tank and outcompeted the background noise.

alpha_df_16S <- estimate_richness(ps_clean_16S, measures=c("Shannon", "Simpson", "Chao1", "Observed"))
write_xlsx(cbind(SampleID = rownames(alpha_df_16S), alpha_df_16S), file.path(path, "Bacteria_16S_Alpha_Diversity.xlsx"))

htmltools::tagList(datatable(alpha_df_16S, options = list(pageLength = 5, scrollX = TRUE), caption = "16S Alpha Diversity Indices"))

plot_richness(ps_clean_16S, x="TimePoint", color="Treatment", measures=c("Shannon", "Simpson")) +
  geom_boxplot(outlier.shape = NA, alpha=0.7) + 
  geom_jitter(width=0.1, alpha=0.6) + 
  theme_classic() + 
  labs(title="Bacteria 16S: Alpha Diversity Significance") +
  stat_compare_means(aes(group = Treatment), label = "p.signif", method = "wilcox.test")

4.3 16S Beta Diversity & PERMANOVA

Why are we doing this?

While Alpha looks internally, Beta Diversity looks externally, asking: “How mathematically different is Tank A from Tank B?” We use the Bray-Curtis algorithm, which calculates distance based on who is there and how many of them exist.

** Ecological Interpretation:** The PCoA plot below gives you a visual mapping of the tanks. But visuals are not proof. The PERMANOVA (adonis2) is the ultimate statistical test. If it yields a p-value < 0.05, you have mathematical proof that your controlled treatment significantly altered the entire ecosystem architecture compared to spontaneous fermentation.

dist_bray_16S <- phyloseq::distance(ps_clean_16S, method="bray")
write.csv(as.matrix(dist_bray_16S), file.path(path, "Bacteria_16S_BrayCurtis.csv"))

# PERMANOVA (The Statistical Proof)
sample_df_16S <- as(sample_data(ps_clean_16S), "data.frame")
permanova_16S <- adonis2(dist_bray_16S ~ Treatment * TimePoint, data = sample_df_16S, permutations = 999)
htmltools::tagList(datatable(as.data.frame(permanova_16S), options = list(dom = 't'), caption = "16S PERMANOVA Results"))

# PCoA Plot
plot_ordination(ps_rel_16S, ordinate(ps_rel_16S, method="PCoA", distance="bray"), color="TimePoint", shape="Treatment") +
  geom_point(size=4) + theme_minimal() + labs(title="Bacteria 16S: PCoA (Bray-Curtis)")

# UPGMA Tree
png(file.path(path, "Bacteria_16S_UPGMA.png"), width=800, height=600, res=100)
plot(hclust(dist_bray_16S, method="average"), main="Bacteria 16S: UPGMA Clustering", xlab="Samples", sub="")
dev.off()

## png 
##   2

plot(hclust(dist_bray_16S, method="average"), main="Bacteria 16S: UPGMA Clustering", xlab="Samples", sub="")

4.4 16S Taxonomic Composition

Why are we doing this?

Now that we have proven the tanks are different, we need to see who is actually inside them. These barplots and heatmaps strip away the complex math and simply show you the physical bacterial dominance.

top40_16S <- names(sort(taxa_sums(ps_clean_16S), decreasing=TRUE))[1:40]

# Barplot (Top 20)
plot_bar(prune_taxa(top40_16S[1:20], ps_clean_16S), x="SampleID", fill="Genus") + 
  facet_wrap(~Treatment, scales="free_x") + theme_bw() + labs(title="Bacteria 16S: Top 20 Genera")

# Heatmap (Top 40)
plot_heatmap(prune_taxa(top40_16S, ps_clean_16S), method="NMDS", distance="bray", 
             sample.order="Treatment", taxa.label="Genus", taxa.order="Phylum",
             low="#000033", high="#FF3300", na.value="white") + labs(title="Bacteria 16S: Taxonomic Heatmap")

4.5 16S Core Microbiome & DESeq2 (Differential Abundance)

Why are we doing this?

Barplots are visual guesses. DESeq2 is absolute mathematical truth. DESeq2 was built for RNA-seq, but we utilize its Negative Binomial generalized linear model to calculate exactly which bacteria shifted in abundance between treatments.

** The Crucial Parameter:** sfType="poscounts". Microbiome data is filled with zeroes (a bug exists in one tank, but is totally absent in another). Standard math crashes when trying to calculate means with zeroes. poscounts forces the algorithm to bypass the zeroes, ensuring our differential abundance math is valid!

** Ecological Interpretation (The Volcano Plot): * Green Dots (Top-Right):** Bacteria that exploded in population due to your treatment. Look for beneficial organisms here. * Red Dots (Top-Left): Bacteria that suffered a massive, statistically significant die-off. Look for aerobic pathogens and spoilage bugs here.

# Core Microbiome
ps_ctrl_16S <- subset_samples(ps_clean_16S, Treatment == "Controlled")
core_taxa_16S <- core_members(ps_ctrl_16S, detection = 0.001, prevalence = 0.8)

if(length(core_taxa_16S) > 0) {
  cat("\n**16S Core Taxa Detected:**\n")
  htmltools::tagList(datatable(as.data.frame(tax_table(ps_clean_16S)[core_taxa_16S, ]), options = list(pageLength = 5, scrollX = TRUE)))
}

## 
## **16S Core Taxa Detected:**

# DESeq2
ds2_16S <- DESeq(phyloseq_to_deseq2(ps_clean_16S, ~ TimePoint + Treatment), test="Wald", fitType="parametric", sfType="poscounts")
res_16S <- results(ds2_16S, contrast=c("Treatment", "Controlled", "Spontaneous"))
sigtab_16S <- res_16S[which(res_16S$padj < 0.05), ]

# DESeq2 Volcano Plot Visualization (With Taxonomy Annotations)
tax_df_16S <- as.data.frame(tax_table(ps_clean_16S))
volcano_data_16S <- cbind(as.data.frame(res_16S), tax_df_16S[rownames(res_16S), ])

volcano_data_16S <- volcano_data_16S[!is.na(volcano_data_16S$padj), ]

volcano_data_16S$Significance <- "Not Significant"
volcano_data_16S$Significance[volcano_data_16S$log2FoldChange > 1 & volcano_data_16S$padj < 0.05] <- "Significantly Increased (Controlled)"
volcano_data_16S$Significance[volcano_data_16S$log2FoldChange < -1 & volcano_data_16S$padj < 0.05] <- "Significantly Decreased (Spontaneous)"

volcano_data_16S$Label <- NA 
sig_rows_16S <- volcano_data_16S$Significance != "Not Significant"
volcano_data_16S$Label[sig_rows_16S] <- volcano_data_16S$Genus[sig_rows_16S] 

ggplot(volcano_data_16S, aes(x = log2FoldChange, y = -log10(padj), color = Significance)) +
  geom_point(alpha = 0.7, size = 2.5) +
  scale_color_manual(values = c("Not Significant" = "grey80", 
                                "Significantly Increased (Controlled)" = "#2CA02C", 
                                "Significantly Decreased (Spontaneous)" = "#D62728")) +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed", color = "black") +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "black") +
  geom_text_repel(aes(label = Label), size = 3.5, max.overlaps = 20, 
                  box.padding = 0.5, point.padding = 0.3, color = "black", show.legend = FALSE) +
  theme_bw() +
  labs(title = "DESeq2 Volcano Plot: Controlled vs. Spontaneous Fermentation",
       x = "Log2 Fold Change (Biological Magnitude)",
       y = "-Log10 Adjusted p-value (Statistical Confidence)") +
  theme(legend.position = "bottom")

if(nrow(sigtab_16S) > 0) {
  sig_df_16S <- cbind(as(sigtab_16S, "data.frame"), as(tax_table(ps_clean_16S)[rownames(sigtab_16S), ], "matrix"))
  write_xlsx(sig_df_16S, file.path(path, "Bacteria_16S_DESeq2.xlsx"))
  cat("\n**16S Significant DESeq2 Taxa:**\n")
  htmltools::tagList(datatable(sig_df_16S, options = list(pageLength = 5, scrollX = TRUE)))
}

## 
## **16S Significant DESeq2 Taxa:**

4.6 16S Indicator Species Analysis (IndVal)

Why are we doing this?

We are hunting for biomarkers. An Indicator Species must meet two strict statistical criteria: Fidelity (it is found in every single sample of a specific treatment) and Specificity (it is never found outside that treatment). Finding an indicator gives you a perfect diagnostic fingerprint for future field trials.

otu_mat_16S <- as(otu_table(ps_clean_16S), "matrix")
if(taxa_are_rows(ps_clean_16S)){ otu_mat_16S <- t(otu_mat_16S) }

ind_val_16S <- multipatt(otu_mat_16S, sample_data(ps_clean_16S)$Treatment, func = "IndVal.g", control = how(nperm=999))
ind_sum_16S <- ind_val_16S$sign %>% filter(p.value <= 0.05)

if(nrow(ind_sum_16S) > 0) {
  ind_final_16S <- cbind(as.data.frame(tax_table(ps_clean_16S)[rownames(ind_sum_16S), ]), ind_sum_16S)
  write_xlsx(ind_final_16S, file.path(path, "Bacteria_16S_Indicator.xlsx"))
  htmltools::tagList(datatable(ind_final_16S, options = list(pageLength = 5, scrollX = TRUE), caption = "16S Indicator Species"))
} else {
  cat("\n*No significant indicator species found for 16S.*\n")
}

4.7 16S Beta Diversity Partitioning

Why are we doing this?

When bacterial communities shift, we must understand the mechanism of how they shifted. * Turnover: Did the entire community flip, with new species completely replacing the old ones? * Nestedness: Did the environment simply become so harsh that weak species died off, leaving a nested subset of hardy survivors? This algorithm mathematically partitions the beta diversity to give you the answer.

# Extract OTU table and ensure samples are rows
otu_mat_part <- as(otu_table(ps_clean_16S), "matrix")
if(taxa_are_rows(ps_clean_16S)){ otu_mat_part <- t(otu_mat_part) }

# Calculate partitioned beta diversity (Abundance-based Bray-Curtis)
beta_part <- beta.pair.abund(otu_mat_part)

# Extract the mean components
mean_turnover <- mean(as.matrix(beta_part$beta.bray.bal)[lower.tri(as.matrix(beta_part$beta.bray.bal))])
mean_nestedness <- mean(as.matrix(beta_part$beta.bray.gra)[lower.tri(as.matrix(beta_part$beta.bray.gra))])
mean_total <- mean_turnover + mean_nestedness

# Create a summary dataframe
part_df <- data.frame(
  Component = c("Turnover (Species Replacement)", "Nestedness (Species Loss/Gain)"),
  Percentage = c((mean_turnover/mean_total)*100, (mean_nestedness/mean_total)*100)
)

# Plot the Partitioning
ggplot(part_df, aes(x = "", y = Percentage, fill = Component)) +
  geom_bar(stat = "identity", width = 0.5, color="black") +
  coord_flip() + theme_minimal() + 
  scale_fill_manual(values=c("#E69F00", "#56B4E9")) +
  labs(title = "Bacteria 16S: Drivers of Community Dissimilarity", x = "", y = "% Contribution to Beta Diversity")

4.8 16S Target Genera Ratio Dynamics (LAB vs. Spoilage)

Why are we doing this?

In agricultural fermentation (like coffee or cacao), the entire goal is safety and flavor preservation. We must track the specific transition from aerobic spoilage bacteria (which cause mold/rot) to anaerobic Lactic Acid Bacteria (LAB, which lower pH and protect the crop).

** Parameter Deep-Dive: The Pseudo-Count** Notice the + 0.0001 added to the ratio math below. If your Controlled tank was so highly efficient that it killed 100% of the spoilage bacteria, the denominator becomes 0. Dividing by zero violently crashes the R script. We add a microscopic “pseudo-count” so the math computes without altering the biological reality.

# Melt the phyloseq object to a dataframe
melt_16S <- psmelt(ps_rel_16S)

# Define functional groups (You can adjust these genera based on your specific top taxa)
lab_genera <- c("Lactobacillus", "Leuconostoc", "Lactiplantibacillus", "Weissella", "Pediococcus")
spoilage_genera <- c("Acetobacter", "Gluconobacter", "Enterobacter", "Klebsiella", "Pantoea")

# Sum abundances per sample for each group
ratio_df <- melt_16S %>%
  group_by(SampleID, Treatment, TimePoint) %>%
  dplyr::summarize(
    LAB_Abundance = sum(Abundance[Genus %in% lab_genera], na.rm = TRUE),
    Spoilage_Abundance = sum(Abundance[Genus %in% spoilage_genera], na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(LAB_to_Spoilage_Ratio = (LAB_Abundance + 0.0001) / (Spoilage_Abundance + 0.0001))

# Plot the Ratio Dynamics
ggplot(ratio_df, aes(x = TimePoint, y = log10(LAB_to_Spoilage_Ratio), fill = Treatment)) +
  geom_boxplot(alpha = 0.8, color="black", outlier.shape = NA) +
  geom_jitter(position = position_dodge(0.75), size = 2, alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  theme_bw() +
  scale_fill_manual(values=c("#2CA02C", "#D62728")) +
  labs(title = "16S Shift: LAB vs Spoilage/AAB Dynamics", 
       y = "Log10 (LAB / Spoilage Ratio)", 
       subtitle = "Values above 0 (red line) indicate Lactic Acid Bacteria dominance")

PHASE 5. FUNGAL COMMUNITY ANALYSIS (ITS)

In this section, we repeat the rigorous ecological workflow for the Fungal ITS community. Fungi and yeasts are the “flavor engines” of the ecosystem, responsible for synthesizing volatile aromatic compounds (esters) that define the quality of the final crop.

5.1 ITS Setup & Background Exports

Why are we doing this?

Just like the bacteria, we prune failed sequencing runs (<1000 reads) and normalize the library depths by converting to Relative Abundances.

# Clean and Transform
ps_clean_ITS <- prune_samples(sample_sums(ps_ITS) > 1000, ps_ITS)
ps_rel_ITS <- transform_sample_counts(ps_clean_ITS, function(x) x / sum(x))

# Background File Exports
writeXStringSet(refseq(ps_clean_ITS), file.path(path, "Fungi_ITS_rep.fasta"), format="fasta")
# Note: No tree export for ITS because we explicitly bypassed tree creation for fungi.

krona_ITS <- psmelt(ps_clean_ITS) %>% 
  filter(Abundance > 0) %>% 
  select(Abundance, Kingdom, Phylum, Class, Order, Family, Genus, Species) %>% 
  mutate(across(everything(), ~tidyr::replace_na(as.character(.), "Unclassified")))
write.table(krona_ITS, file.path(path, "Fungi_ITS_Krona_Input.txt"), sep="\t", row.names=FALSE, col.names=FALSE, quote=FALSE)

5.2 ITS Alpha Diversity

Why are we doing this?

Monitoring the richness and evenness of the yeast population. A successful yeast inoculation should dramatically drop the Alpha Diversity as the apex yeast outcompetes wild molds.

alpha_df_ITS <- estimate_richness(ps_clean_ITS, measures=c("Shannon", "Simpson", "Chao1", "Observed"))
write_xlsx(cbind(SampleID = rownames(alpha_df_ITS), alpha_df_ITS), file.path(path, "Fungi_ITS_Alpha_Diversity.xlsx"))

htmltools::tagList(datatable(alpha_df_ITS, options = list(pageLength = 5, scrollX = TRUE), caption = "ITS Alpha Diversity Indices"))

plot_richness(ps_clean_ITS, x="TimePoint", color="Treatment", measures=c("Shannon", "Simpson")) +
  geom_boxplot(outlier.shape = NA, alpha=0.7) + 
  geom_jitter(width=0.1, alpha=0.6) + 
  theme_classic() + 
  labs(title="Fungi ITS: Alpha Diversity Significance") +
  stat_compare_means(aes(group = Treatment), label = "p.signif", method = "wilcox.test")

5.3 ITS Beta Diversity & PERMANOVA

Why are we doing this?

Calculating the Bray-Curtis distance matrix to mathematically prove if the fungal architecture of the Controlled tanks differs from the Spontaneous tanks. The adonis2 PERMANOVA provides the final p-value.

dist_bray_ITS <- phyloseq::distance(ps_clean_ITS, method="bray")
write.csv(as.matrix(dist_bray_ITS), file.path(path, "Fungi_ITS_BrayCurtis.csv"))

# PERMANOVA
sample_df_ITS <- as(sample_data(ps_clean_ITS), "data.frame")
permanova_ITS <- adonis2(dist_bray_ITS ~ Treatment * TimePoint, data = sample_df_ITS, permutations = 999)
htmltools::tagList(datatable(as.data.frame(permanova_ITS), options = list(dom = 't'), caption = "ITS PERMANOVA Results"))

# PCoA Plot
plot_ordination(ps_rel_ITS, ordinate(ps_rel_ITS, method="PCoA", distance="bray"), color="TimePoint", shape="Treatment") +
  geom_point(size=4) + theme_minimal() + labs(title="Fungi ITS: PCoA (Bray-Curtis)")

# UPGMA Tree
png(file.path(path, "Fungi_ITS_UPGMA.png"), width=800, height=600, res=100)
plot(hclust(dist_bray_ITS, method="average"), main="Fungi ITS: UPGMA Clustering", xlab="Samples", sub="")
dev.off()

## png 
##   2

plot(hclust(dist_bray_ITS, method="average"), main="Fungi ITS: UPGMA Clustering", xlab="Samples", sub="")

5.4 ITS Taxonomic Composition

Visualizing the exact genera driving the fungal profile.

top40_ITS <- names(sort(taxa_sums(ps_clean_ITS), decreasing=TRUE))[1:40]

# Barplot (Top 20)
plot_bar(prune_taxa(top40_ITS[1:20], ps_clean_ITS), x="SampleID", fill="Genus") + 
  facet_wrap(~Treatment, scales="free_x") + theme_bw() + labs(title="Fungi ITS: Top 20 Genera")

# Heatmap (Top 40)
plot_heatmap(prune_taxa(top40_ITS, ps_clean_ITS), method="NMDS", distance="bray", 
             sample.order="Treatment", taxa.label="Genus", taxa.order="Phylum",
             low="#000033", high="#FF3300", na.value="white") + labs(title="Fungi ITS: Taxonomic Heatmap")

5.5 ITS Core Microbiome & DESeq2

Why are we doing this?

Applying the Negative Binomial model to find statistically significant yeast shifts.

** Ecological Interpretation: Intrageneric Warfare** Notice the specific command in the code below: paste0(rownames... ). We deliberately coded the Volcano Plot to print the ASV ID next to the Genus name (e.g., ASV_ITS_1 (Nakaseomyces)).

Because DADA2 achieves single-nucleotide resolution, it tracks individual strains. If your plot shows one Nakaseomyces ASV heavily in the green, and a different Nakaseomyces ASV heavily in the red, you have mathematically proven Strain Replacement. Your commercial starter culture aggressively consumed the resources, outcompeted, and eradicated the wild field strains of its own genus!

# Core Microbiome
ps_ctrl_ITS <- subset_samples(ps_clean_ITS, Treatment == "Controlled")
core_taxa_ITS <- core_members(ps_ctrl_ITS, detection = 0.001, prevalence = 0.8)

if(length(core_taxa_ITS) > 0) {
  cat("\n**ITS Core Taxa Detected:**\n")
  htmltools::tagList(datatable(as.data.frame(tax_table(ps_clean_ITS)[core_taxa_ITS, ]), options = list(pageLength = 5, scrollX = TRUE)))
}

## 
## **ITS Core Taxa Detected:**

# 2. Run the DESeq2 Engine
ds2_ITS <- DESeq(phyloseq_to_deseq2(ps_clean_ITS, ~ TimePoint + Treatment), test="Wald", fitType="parametric", sfType="poscounts")
res_ITS <- results(ds2_ITS, contrast=c("Treatment", "Controlled", "Spontaneous"))
sigtab_ITS <- res_ITS[which(res_ITS$padj < 0.05), ]

# 3. Export Significant Data to Excel/HTML
if(nrow(sigtab_ITS) > 0) {
  sig_df_ITS <- cbind(as(sigtab_ITS, "data.frame"), as(tax_table(ps_clean_ITS)[rownames(sigtab_ITS), ], "matrix"))
  write_xlsx(sig_df_ITS, file.path(path, "Fungi_ITS_DESeq2.xlsx"))
  cat("\n**ITS Significant DESeq2 Taxa:**\n")
  htmltools::tagList(datatable(sig_df_ITS, options = list(pageLength = 5, scrollX = TRUE)))
}

## 
## **ITS Significant DESeq2 Taxa:**

# ---------------------------------------------------------
# 4. Generate the Annotated Volcano Plot
# ---------------------------------------------------------
# Bind the Taxonomy Table to the full DESeq2 results
tax_df_ITS <- as.data.frame(tax_table(ps_clean_ITS))
volcano_data_ITS <- cbind(as.data.frame(res_ITS), tax_df_ITS[rownames(res_ITS), ])

# Remove NA values that will crash the plot
volcano_data_ITS <- volcano_data_ITS[!is.na(volcano_data_ITS$padj), ]

# Categorize the fungi for coloring
volcano_data_ITS$Significance <- "Not Significant"
volcano_data_ITS$Significance[volcano_data_ITS$log2FoldChange > 1 & volcano_data_ITS$padj < 0.05] <- "Significantly Increased (Controlled)"
volcano_data_ITS$Significance[volcano_data_ITS$log2FoldChange < -1 & volcano_data_ITS$padj < 0.05] <- "Significantly Decreased (Spontaneous)"

# Create a specific column just for the Labels (Genus Level)
volcano_data_ITS$Label <- NA # Leave everything blank by default
sig_rows_ITS <- volcano_data_ITS$Significance != "Not Significant"
volcano_data_ITS$Label[sig_rows_ITS] <- volcano_data_ITS$Genus[sig_rows_ITS] # Only name the significant fungi

# This pastes the unique ASV ID next to the Genus name!
volcano_data_ITS$Label[sig_rows_ITS] <- paste0(rownames(volcano_data_ITS)[sig_rows_ITS], " (", volcano_data_ITS$Genus[sig_rows_ITS], ")")

# Plot the Eruptions
ggplot(volcano_data_ITS, aes(x = log2FoldChange, y = -log10(padj), color = Significance)) +
  geom_point(alpha = 0.7, size = 2.5) +
  scale_color_manual(values = c("Not Significant" = "grey80", 
                                "Significantly Increased (Controlled)" = "#2CA02C", 
                                "Significantly Decreased (Spontaneous)" = "#D62728")) +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed", color = "black") +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "black") +
  geom_text_repel(aes(label = Label), size = 3.5, max.overlaps = 20, 
                  box.padding = 0.5, point.padding = 0.3, color = "black", show.legend = FALSE) +
  theme_bw() +
  labs(title = "ITS DESeq2 Volcano Plot: Fungal Shifts in Fermentation",
       x = "Log2 Fold Change (Biological Magnitude)",
       y = "-Log10 Adjusted p-value (Statistical Confidence)") +
  theme(legend.position = "bottom")

5.6 ITS Indicator Species Analysis (IndVal)

otu_mat_ITS <- as(otu_table(ps_clean_ITS), "matrix")
if(taxa_are_rows(ps_clean_ITS)){ otu_mat_ITS <- t(otu_mat_ITS) }

ind_val_ITS <- multipatt(otu_mat_ITS, sample_data(ps_clean_ITS)$Treatment, func = "IndVal.g", control = how(nperm=999))
ind_sum_ITS <- ind_val_ITS$sign %>% filter(p.value <= 0.05)

if(nrow(ind_sum_ITS) > 0) {
  ind_final_ITS <- cbind(as.data.frame(tax_table(ps_clean_ITS)[rownames(ind_sum_ITS), ]), ind_sum_ITS)
  write_xlsx(ind_final_ITS, file.path(path, "Fungi_ITS_Indicator.xlsx"))
  htmltools::tagList(datatable(ind_final_ITS, options = list(pageLength = 5, scrollX = TRUE), caption = "ITS Indicator Species"))
} else {
  cat("\n*No significant indicator species found for ITS.*\n")
}

5.7 ITS Beta Diversity Partitioning (Turnover vs. Nestedness)

By partitioning the fungal beta diversity, we can prove whether yeast community shifts are driven by complete species replacement (Turnover) or simply species loss (Nestedness).

# Extract OTU table and ensure samples are rows
otu_mat_part <- as(otu_table(ps_clean_ITS), "matrix")
if(taxa_are_rows(ps_clean_ITS)){ otu_mat_part <- t(otu_mat_part) }

# Calculate partitioned beta diversity (Abundance-based Bray-Curtis)
beta_part <- beta.pair.abund(otu_mat_part)

# Extract the mean components
mean_turnover <- mean(as.matrix(beta_part$beta.bray.bal)[lower.tri(as.matrix(beta_part$beta.bray.bal))])
mean_nestedness <- mean(as.matrix(beta_part$beta.bray.gra)[lower.tri(as.matrix(beta_part$beta.bray.gra))])
mean_total <- mean_turnover + mean_nestedness

# Create a summary dataframe
part_df <- data.frame(
  Component = c("Turnover (Replacement)", "Nestedness (Loss/Gain)"),
  Percentage = c((mean_turnover/mean_total)*100, (mean_nestedness/mean_total)*100)
)

# Plot the Partitioning
ggplot(part_df, aes(x = "", y = Percentage, fill = Component)) +
  geom_bar(stat = "identity", width = 0.5, color="black") +
  coord_flip() + theme_minimal() + 
  scale_fill_manual(values=c("#E69F00", "#56B4E9")) +
  labs(title = "Fungi ITS: Drivers of Community Dissimilarity", x = "", y = "% Contribution to Beta Diversity")

5.8 ITS Co-occurrence Network Analysis (The Interactome)

Why are we doing this?

We are visualizing ecological warfare. A fermentation tank is not a peaceful environment; it is a battleground. This algorithm calculates Spearman correlations between the top fungi to see which organisms are cooperating, and which are actively driving each other to extinction.

** Parameter Deep-Dive: * type="spearman": We use Spearman rather than Pearson correlation because microbial growth is highly non-linear. * simplify(..., edge.attr.comb = "first"): This prevents the graph from turning into an unreadable “hairball” by merging redundant connecting lines while preserving the statistical color coding. * layout_with_fr**: The Fruchterman-Reingold algorithm. This is a physics simulation that treats the microbes as charged particles.

** Ecological Interpretation: * Blue Lines (Mutualism):** These fungi rely on each other to survive. The physics algorithm pulls them together. * Red Lines (Competitive Exclusion): These fungi are actively competing for the same resources. The physics algorithm pushes them apart, untangling the web into a readable map.

# Keep only the top 30 most abundant taxa to prevent an unreadable "hairball" graph
top30_taxa_ITS <- names(sort(taxa_sums(ps_clean_ITS), decreasing=TRUE))[1:30]
ps_net_ITS <- prune_taxa(top30_taxa_ITS, ps_clean_ITS)

# Extract OTU table and Genus names
net_otu <- as(otu_table(ps_net_ITS), "matrix")
if(taxa_are_rows(ps_net_ITS)){ net_otu <- t(net_otu) }
colnames(net_otu) <- as.data.frame(tax_table(ps_net_ITS))$Genus

# Calculate Spearman correlation matrix and p-values
cor_res <- rcorr(as.matrix(net_otu), type="spearman")
R_mat <- cor_res$r
P_mat <- cor_res$P

# Flatten matrices to build the network edges
edges <- melt(R_mat, varnames = c("Taxon_A", "Taxon_B"), value.name = "Correlation")
p_edges <- melt(P_mat, varnames = c("Taxon_A", "Taxon_B"), value.name = "p_value")
edges$p_value <- p_edges$p_value

# Filter for strong, significant interactions only
edges_sig <- edges %>%
  filter(Taxon_A != Taxon_B) %>%
  filter(p_value < 0.05) %>%
  filter(abs(Correlation) > 0.6)

# Build the igraph network object
if(nrow(edges_sig) > 0) {
  net_graph <- graph_from_data_frame(edges_sig, directed = FALSE)
  
  # THE FIX: Tell simplify to preserve the edge attributes!
  net_graph <- simplify(net_graph, edge.attr.comb = "first") 
  
  # Assign edge colors (Blue = Positive/Co-occurrence, Red = Negative/Exclusion)
  E(net_graph)$color <- ifelse(E(net_graph)$Correlation > 0, "#0072B2", "#D55E00")
  E(net_graph)$weight <- abs(E(net_graph)$Correlation) * 8 # Edge thickness
  
  # Plot the network
  set.seed(42) # For reproducible layout
  plot(net_graph, 
       layout = layout_with_fr(net_graph), # Fruchterman-Reingold layout
       vertex.size = 12, 
       vertex.label.cex = 0.8, 
       vertex.label.color = "black",
       vertex.color = "lightblue",
       edge.width = E(net_graph)$weight,
       main = "Fungi ITS: Co-occurrence Network (Top 30 Genera)",
       sub = "Blue = Positive Correlation | Red = Negative Correlation")
} else {
  cat("\n*No significant strong correlations found among the top taxa.*\n")
}

The Metabarcoding Masterclass: 16S & ITS

Jeshurun Asher Tarun, Ph.D.

2026-06-21

The Systems Biology Framework

The Parameter Dictionary: Libraries

Core Configuration & Libraries

Why are we doing this?

PHASE 1: BACTERIAL 16S PIPELINE (DADA2)

1.1 Primer Removal (Cutadapt) & Quality Filtering

Why are we doing this?

Parameter Deep-Dive:

1.2 Quality Control Plots

1.3 Error Modeling & Taxonomy Assignment

Why are we doing this?

PHASE 2: FUNGAL / YEAST ITS PIPELINE

2.1 The Fungal Pipeline Crisis

Why are we doing this differently?

PHASE 3: PHYLOGENETIC TREES & PHYLOSEQ ASSEMBLY

3.1 Object Assembly & Host Contaminant Removal

Why are we doing this?

PHASE 4. BACTERIAL COMMUNITY ANALYSIS (16S)

4.1 16S Setup & Background Exports

Why are we doing this?

4.2 16S Alpha Diversity

Why are we doing this?

4.3 16S Beta Diversity & PERMANOVA

Why are we doing this?

4.4 16S Taxonomic Composition

Why are we doing this?

4.5 16S Core Microbiome & DESeq2 (Differential Abundance)

Why are we doing this?

4.6 16S Indicator Species Analysis (IndVal)

Why are we doing this?

4.7 16S Beta Diversity Partitioning

Why are we doing this?

4.8 16S Target Genera Ratio Dynamics (LAB vs. Spoilage)

Why are we doing this?

PHASE 5. FUNGAL COMMUNITY ANALYSIS (ITS)

5.1 ITS Setup & Background Exports

Why are we doing this?

5.2 ITS Alpha Diversity

Why are we doing this?

5.3 ITS Beta Diversity & PERMANOVA

Why are we doing this?

5.4 ITS Taxonomic Composition

5.5 ITS Core Microbiome & DESeq2

Why are we doing this?

5.6 ITS Indicator Species Analysis (IndVal)

5.7 ITS Beta Diversity Partitioning (Turnover vs. Nestedness)

5.8 ITS Co-occurrence Network Analysis (The Interactome)

Why are we doing this?