This tutorial shows you how to compare beta diversity of different sample groups and test for statistically significant differences in microbial community composition. Requires tidyverse, vegan, csv, and phyloseq.
(Packages must be installed prior to running this tutorial)
library(tidyverse)
## Warning in system("timedatectl", intern = TRUE): running command 'timedatectl'
## had status 1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(phyloseq)
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(csv)
library(ape)
##
## Attaching package: 'ape'
##
## The following object is masked from 'package:dplyr':
##
## where
library(pairwiseAdonis)
## Loading required package: cluster
#set seed for reproducible data
set.seed(81)
This tutorial assumes that you have a phyloseq object of the data that you want to plot. The example phyloseq object shown here has 9 samples, 9 sample variables, and 12,003 unique taxa.
toy_ps <- read_rds("/home/lgschaer/old/Tutorials/toy_phyloseq.rds")
toy_ps
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 12003 taxa and 9 samples ]
## sample_data() Sample Data: [ 9 samples by 9 sample variables ]
## tax_table() Taxonomy Table: [ 12003 taxa by 7 taxonomic ranks ]
This phyloseq object has 9 sample variables: SampleID, Environoment, Sample_Type, Replicate, Substrate, Transfer, Substrate_Label, Environment_Label_Location, and Environment_Label.
head(sample_data(toy_ps))
## SampleID Environment Sample_Type Replicate Substrate Transfer
## CCONR1T1 CCONR1T1 Compost Enrichment R1 Control T1
## CCONR2T1 CCONR2T1 Compost Enrichment R2 Control T1
## CCONR3T1 CCONR3T1 Compost Enrichment R3 Control T1
## CTAR1T1 CTAR1T1 Compost Enrichment R1 TA T1
## CTAR2T1 CTAR2T1 Compost Enrichment R2 TA T1
## CTAR3T1 CTAR3T1 Compost Enrichment R3 TA T1
## Substrate_Label Environment_Label_Location Environment_Label
## CCONR1T1 Control Compost\nCalumet, MI Compost
## CCONR2T1 Control Compost\nCalumet, MI Compost
## CCONR3T1 Control Compost\nCalumet, MI Compost
## CTAR1T1 Terephthalamide Compost\nCalumet, MI Compost
## CTAR2T1 Terephthalamide Compost\nCalumet, MI Compost
## CTAR3T1 Terephthalamide Compost\nCalumet, MI Compost
We visualize beta diversity by making a PCoA (Principal Components Analysis) ordination. Different types of ordinations can be used to evaluate different characteristics of the data.
#ordination
bray <- ordinate(
physeq = toy_ps, #change this to your phyloseq
method = "PCoA",
distance = "bray"
)
Make a simple plot without fancy formatting
plot_ordination(
physeq = toy_ps, #phyloseq object
ordination = bray)+ #ordination
geom_point(aes(fill = Substrate, shape = Replicate), size = 6)
Add layers of formatting to make the plot prettier
#plot
colors <- c("lightblue", "maroon", "olivedrab")
plot_ordination(
physeq = toy_ps, #phyloseq object
ordination = bray)+ #ordination
geom_point(aes(fill = Substrate, shape = Replicate), size = 6) +
scale_fill_manual(values = colors) +
scale_shape_manual(values = c(21, 22, 23, 24, 25))+
theme_linedraw() + #changes theme, removes grey background
theme(
legend.title = element_blank(), #removes legend title
legend.position = "bottom",
legend.text = element_text(size = 20, face = "bold"),
axis.text.y.left = element_text(size = 10),
axis.title.y = element_text(size = 20),
axis.text.x = element_text(size = 10),
axis.title.x = element_text(size = 20),
strip.text = element_text(face = "bold", size = 20))+
guides(fill = guide_legend(override.aes = list(shape = 21))) #fills legend points based on the fill command
Unifrac ordinations are based the phylogenetic distance, so the first step is to make a phylogenetic tree using the ape package
#adding a phylogenetic tree to phyloseq object using ape library
random_tree = rtree(ntaxa(toy_ps), rooted=TRUE, tip.label=taxa_names(toy_ps))
plot(random_tree)
toy_ps2 = merge_phyloseq(toy_ps, random_tree)
toy_ps2
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 12003 taxa and 9 samples ]
## sample_data() Sample Data: [ 9 samples by 9 sample variables ]
## tax_table() Taxonomy Table: [ 12003 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 12003 tips and 12002 internal nodes ]
#ordination
uni <- ordinate(
physeq = toy_ps2,
method = "PCoA",
distance = "unifrac"
)
#summary(distance)
#distance
#plot
colors <- c("lightblue", "maroon", "olivedrab")
plot_ordination(
physeq = toy_ps2, #phyloseq object
ordination = uni)+ #ordination
geom_point(aes(fill = Substrate, shape = Replicate), size = 6) + #sets fill color to sampletype
scale_fill_manual(values = colors) +
scale_shape_manual(values = c(21, 22, 23, 24, 25))+
theme_linedraw() + #changes theme, removes grey background
theme(
legend.title = element_blank(), #removes legend title
#legend.background = element_rect(fill = "white", color = "black"), #adds black boarder around legend
legend.position = "bottom",
legend.text = element_text(size = 20, face = "bold"),
axis.text.y.left = element_text(size = 10),
axis.title.y = element_text(size = 20),
axis.text.x = element_text(size = 10),
axis.title.x = element_text(size = 20),
strip.text = element_text(face = "bold", size = 20))+
guides(fill = guide_legend(override.aes = list(shape = 21))) #fills legend points based on the fill command
We have visualized differences between sample groups, but are these differences statistically significant? We will use a PERMANOVA to find out.
head(sample_data(toy_ps))
## SampleID Environment Sample_Type Replicate Substrate Transfer
## CCONR1T1 CCONR1T1 Compost Enrichment R1 Control T1
## CCONR2T1 CCONR2T1 Compost Enrichment R2 Control T1
## CCONR3T1 CCONR3T1 Compost Enrichment R3 Control T1
## CTAR1T1 CTAR1T1 Compost Enrichment R1 TA T1
## CTAR2T1 CTAR2T1 Compost Enrichment R2 TA T1
## CTAR3T1 CTAR3T1 Compost Enrichment R3 TA T1
## Substrate_Label Environment_Label_Location Environment_Label
## CCONR1T1 Control Compost\nCalumet, MI Compost
## CCONR2T1 Control Compost\nCalumet, MI Compost
## CCONR3T1 Control Compost\nCalumet, MI Compost
## CTAR1T1 Terephthalamide Compost\nCalumet, MI Compost
## CTAR2T1 Terephthalamide Compost\nCalumet, MI Compost
## CTAR3T1 Terephthalamide Compost\nCalumet, MI Compost
bray <- phyloseq::distance(toy_ps, method = "bray")
sam <- data.frame(sample_data(toy_ps))
adonis2(bray ~ Substrate + Replicate + Substrate*Replicate, data = sam, permutations = 999)
## No residual component
##
## adonis2(formula = bray ~ Substrate + Replicate + Substrate * Replicate, data = sam, permutations = 999)
## Df SumOfSqs R2 F Pr(>F)
## Model 8 2.8496 1
## Residual 0 0.0000 0
## Total 8 2.8496 1
#Comparing substrate type (no inocula samples)
sdata <- as.csv("/home/lgschaer/old/Tutorials/phyloseq/sdata.csv", row.names = 1, header = TRUE, sep = ",", check.names = TRUE, stringsAsFactors = TRUE) %>%
rownames_to_column(var = "SampleID")
head(sdata)
## SampleID Environment Sample_Type Replicate Substrate Transfer Substrate_Label
## 1 CCONR1T1 Compost Enrichment R1 Control T1 Control
## 2 CCONR2T1 Compost Enrichment R2 Control T1 Control
## 3 CCONR3T1 Compost Enrichment R3 Control T1 Control
## 4 CTAR1T1 Compost Enrichment R1 TA T1 Terephthalamide
## 5 CTAR2T1 Compost Enrichment R2 TA T1 Terephthalamide
## 6 CTAR3T1 Compost Enrichment R3 TA T1 Terephthalamide
## Environment_Label_Location Environment_Label
## 1 Compost\nCalumet, MI Compost
## 2 Compost\nCalumet, MI Compost
## 3 Compost\nCalumet, MI Compost
## 4 Compost\nCalumet, MI Compost
## 5 Compost\nCalumet, MI Compost
## 6 Compost\nCalumet, MI Compost
adonisData <- otu_table(toy_ps) %>% #use sequence table that has inocula removed
as.data.frame() %>%
rownames_to_column(var = "SampleID") %>% #change rownames to a column so there is a common variable to join by
left_join(sdata, by = "SampleID") %>% #join sample data to the sequence table
select(-c("SampleID", "Environment", "Replicate", "Transfer", "Sample_Type")) %>% #remove all metadata columns except the one to be used to compare
select(Substrate, everything()) #rearrange so substrate column is first
dim(adonisData)
## [1] 9 12007
head(adonisData[,1:10])
## Substrate sp1 sp2 sp3 sp4 sp5 sp6 sp7 sp8 sp9
## 1 Control 13 0 0 0 0 0 0 0 0
## 2 Control 20 0 0 0 0 0 0 0 0
## 3 Control 0 0 0 37 20 0 0 0 0
## 4 TA 0 0 462 0 0 0 0 0 0
## 5 TA 0 259 1968 0 0 0 0 0 0
## 6 TA 0 0 767 0 0 0 20 0 0
pairwise.adonis2(adonisData[,2:3722]~Substrate,method="bray",data=adonisData)
## 'nperm' >= set of all permutations: complete enumeration.
## Set of permutations < 'minperm'. Generating entire set.
## 'nperm' >= set of all permutations: complete enumeration.
## Set of permutations < 'minperm'. Generating entire set.
## 'nperm' >= set of all permutations: complete enumeration.
## Set of permutations < 'minperm'. Generating entire set.
## $parent_call
## [1] "adonisData[, 2:3722] ~ Substrate , strata = Null , permutations 999"
##
## $Control_vs_TA
## Df SumOfSqs R2 F Pr(>F)
## Substrate 1 0.56172 0.35175 2.1705 0.1
## Residual 4 1.03520 0.64825
## Total 5 1.59692 1.00000
##
## $Control_vs_TPA
## Df SumOfSqs R2 F Pr(>F)
## Substrate 1 0.92349 0.49628 3.9409 0.1
## Residual 4 0.93734 0.50372
## Total 5 1.86083 1.00000
##
## $TA_vs_TPA
## Df SumOfSqs R2 F Pr(>F)
## Substrate 1 0.92054 0.65543 7.6086 0.1
## Residual 4 0.48394 0.34457
## Total 5 1.40448 1.00000
##
## attr(,"class")
## [1] "pwadstrata" "list"