Beta Diversity Tutorial

This tutorial shows you how to compare beta diversity of different sample groups and test for statistically significant differences in microbial community composition. Requires tidyverse, vegan, csv, and phyloseq.

Start by loading the required R packages

(Packages must be installed prior to running this tutorial)

library(tidyverse)
## Warning in system("timedatectl", intern = TRUE): running command 'timedatectl'
## had status 1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(phyloseq)
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(csv)
library(ape)
## 
## Attaching package: 'ape'
## 
## The following object is masked from 'package:dplyr':
## 
##     where
library(pairwiseAdonis)
## Loading required package: cluster
#set seed for reproducible data
set.seed(81)

Load your phyloseq object

This tutorial assumes that you have a phyloseq object of the data that you want to plot. The example phyloseq object shown here has 9 samples, 9 sample variables, and 12,003 unique taxa.

toy_ps <- read_rds("/home/lgschaer/old/Tutorials/toy_phyloseq.rds")  
toy_ps
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 12003 taxa and 9 samples ]
## sample_data() Sample Data:       [ 9 samples by 9 sample variables ]
## tax_table()   Taxonomy Table:    [ 12003 taxa by 7 taxonomic ranks ]

Explore the sample variables in the test phyloseq object

This phyloseq object has 9 sample variables: SampleID, Environoment, Sample_Type, Replicate, Substrate, Transfer, Substrate_Label, Environment_Label_Location, and Environment_Label.

head(sample_data(toy_ps))
##          SampleID Environment Sample_Type Replicate Substrate Transfer
## CCONR1T1 CCONR1T1     Compost  Enrichment        R1   Control       T1
## CCONR2T1 CCONR2T1     Compost  Enrichment        R2   Control       T1
## CCONR3T1 CCONR3T1     Compost  Enrichment        R3   Control       T1
## CTAR1T1   CTAR1T1     Compost  Enrichment        R1        TA       T1
## CTAR2T1   CTAR2T1     Compost  Enrichment        R2        TA       T1
## CTAR3T1   CTAR3T1     Compost  Enrichment        R3        TA       T1
##          Substrate_Label Environment_Label_Location Environment_Label
## CCONR1T1         Control       Compost\nCalumet, MI           Compost
## CCONR2T1         Control       Compost\nCalumet, MI           Compost
## CCONR3T1         Control       Compost\nCalumet, MI           Compost
## CTAR1T1  Terephthalamide       Compost\nCalumet, MI           Compost
## CTAR2T1  Terephthalamide       Compost\nCalumet, MI           Compost
## CTAR3T1  Terephthalamide       Compost\nCalumet, MI           Compost

Make a bray curtis ordination

We visualize beta diversity by making a PCoA (Principal Components Analysis) ordination. Different types of ordinations can be used to evaluate different characteristics of the data.

Make a bray curtis ordination

#ordination
bray <- ordinate(
  physeq = toy_ps, #change this to your phyloseq
  method = "PCoA", 
  distance = "bray" 
)

Make a simple plot without fancy formatting

plot_ordination(
  physeq = toy_ps,                                                          #phyloseq object
  ordination = bray)+                                                #ordination
  geom_point(aes(fill = Substrate, shape = Replicate), size = 6) 

Add layers of formatting to make the plot prettier

#plot
colors <- c("lightblue", "maroon", "olivedrab")

plot_ordination(
  physeq = toy_ps,                                                          #phyloseq object
  ordination = bray)+                                                        #ordination
  geom_point(aes(fill = Substrate, shape = Replicate), size = 6) +                         
  scale_fill_manual(values = colors) +
  scale_shape_manual(values = c(21, 22, 23, 24, 25))+
  theme_linedraw() +                                                         #changes theme, removes grey background
  theme(                             
    legend.title = element_blank(),                                          #removes legend title
    legend.position = "bottom",
    legend.text = element_text(size = 20, face = "bold"),                                 
    axis.text.y.left = element_text(size = 10),
    axis.title.y = element_text(size = 20),
    axis.text.x = element_text(size = 10),
    axis.title.x = element_text(size = 20),
    strip.text = element_text(face = "bold", size = 20))+
  guides(fill = guide_legend(override.aes = list(shape = 21)))          #fills legend points based on the fill command

Now make a unifrac ordination

Unifrac ordinations are based the phylogenetic distance, so the first step is to make a phylogenetic tree using the ape package

#adding a phylogenetic tree to phyloseq object using ape library
random_tree = rtree(ntaxa(toy_ps), rooted=TRUE, tip.label=taxa_names(toy_ps))
plot(random_tree)

toy_ps2 = merge_phyloseq(toy_ps, random_tree)
toy_ps2
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 12003 taxa and 9 samples ]
## sample_data() Sample Data:       [ 9 samples by 9 sample variables ]
## tax_table()   Taxonomy Table:    [ 12003 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 12003 tips and 12002 internal nodes ]
#ordination
uni <- ordinate(
  physeq = toy_ps2, 
  method = "PCoA", 
  distance = "unifrac"
)
#summary(distance)
#distance

#plot
colors <- c("lightblue", "maroon", "olivedrab")

plot_ordination(
  physeq = toy_ps2,                                                          #phyloseq object
  ordination = uni)+                                                #ordination
  geom_point(aes(fill = Substrate, shape = Replicate), size = 6) +                         #sets fill color to sampletype
  scale_fill_manual(values = colors) +
  scale_shape_manual(values = c(21, 22, 23, 24, 25))+
  theme_linedraw() +                                                      #changes theme, removes grey background
  theme(                             
    legend.title = element_blank(),                                      #removes legend title
    #legend.background = element_rect(fill = "white", color = "black"),  #adds black boarder around legend
    legend.position = "bottom",
    legend.text = element_text(size = 20, face = "bold"),                                 
    axis.text.y.left = element_text(size = 10),
    axis.title.y = element_text(size = 20),
    axis.text.x = element_text(size = 10),
    axis.title.x = element_text(size = 20),
    strip.text = element_text(face = "bold", size = 20))+
  guides(fill = guide_legend(override.aes = list(shape = 21)))          #fills legend points based on the fill command

Look for statistically significant differences between sample types

We have visualized differences between sample groups, but are these differences statistically significant? We will use a PERMANOVA to find out.

Overall PERMANOVA

head(sample_data(toy_ps))
##          SampleID Environment Sample_Type Replicate Substrate Transfer
## CCONR1T1 CCONR1T1     Compost  Enrichment        R1   Control       T1
## CCONR2T1 CCONR2T1     Compost  Enrichment        R2   Control       T1
## CCONR3T1 CCONR3T1     Compost  Enrichment        R3   Control       T1
## CTAR1T1   CTAR1T1     Compost  Enrichment        R1        TA       T1
## CTAR2T1   CTAR2T1     Compost  Enrichment        R2        TA       T1
## CTAR3T1   CTAR3T1     Compost  Enrichment        R3        TA       T1
##          Substrate_Label Environment_Label_Location Environment_Label
## CCONR1T1         Control       Compost\nCalumet, MI           Compost
## CCONR2T1         Control       Compost\nCalumet, MI           Compost
## CCONR3T1         Control       Compost\nCalumet, MI           Compost
## CTAR1T1  Terephthalamide       Compost\nCalumet, MI           Compost
## CTAR2T1  Terephthalamide       Compost\nCalumet, MI           Compost
## CTAR3T1  Terephthalamide       Compost\nCalumet, MI           Compost
bray <- phyloseq::distance(toy_ps, method = "bray")
sam <- data.frame(sample_data(toy_ps))
adonis2(bray ~ Substrate + Replicate + Substrate*Replicate, data = sam, permutations = 999)
## No residual component
## 
## adonis2(formula = bray ~ Substrate + Replicate + Substrate * Replicate, data = sam, permutations = 999)
##          Df SumOfSqs R2 F Pr(>F)
## Model     8   2.8496  1         
## Residual  0   0.0000  0         
## Total     8   2.8496  1

Pairwise PERMANOVA

#Comparing substrate type (no inocula samples)
sdata <- as.csv("/home/lgschaer/old/Tutorials/phyloseq/sdata.csv", row.names = 1, header = TRUE, sep = ",", check.names = TRUE, stringsAsFactors = TRUE) %>%
  rownames_to_column(var = "SampleID")
head(sdata)
##   SampleID Environment Sample_Type Replicate Substrate Transfer Substrate_Label
## 1 CCONR1T1     Compost  Enrichment        R1   Control       T1         Control
## 2 CCONR2T1     Compost  Enrichment        R2   Control       T1         Control
## 3 CCONR3T1     Compost  Enrichment        R3   Control       T1         Control
## 4  CTAR1T1     Compost  Enrichment        R1        TA       T1 Terephthalamide
## 5  CTAR2T1     Compost  Enrichment        R2        TA       T1 Terephthalamide
## 6  CTAR3T1     Compost  Enrichment        R3        TA       T1 Terephthalamide
##   Environment_Label_Location Environment_Label
## 1       Compost\nCalumet, MI           Compost
## 2       Compost\nCalumet, MI           Compost
## 3       Compost\nCalumet, MI           Compost
## 4       Compost\nCalumet, MI           Compost
## 5       Compost\nCalumet, MI           Compost
## 6       Compost\nCalumet, MI           Compost
adonisData <- otu_table(toy_ps) %>%                                      #use sequence table that has inocula removed
  as.data.frame() %>%
  rownames_to_column(var = "SampleID") %>%                                          #change rownames to a column so there is a common variable to join by
  left_join(sdata, by = "SampleID") %>%                                            #join sample data to the sequence table
  select(-c("SampleID", "Environment", "Replicate", "Transfer", "Sample_Type")) %>% #remove all metadata columns except the one to be used to compare
  select(Substrate, everything())                                                   #rearrange so substrate column is first
dim(adonisData)
## [1]     9 12007
head(adonisData[,1:10])
##   Substrate sp1 sp2  sp3 sp4 sp5 sp6 sp7 sp8 sp9
## 1   Control  13   0    0   0   0   0   0   0   0
## 2   Control  20   0    0   0   0   0   0   0   0
## 3   Control   0   0    0  37  20   0   0   0   0
## 4        TA   0   0  462   0   0   0   0   0   0
## 5        TA   0 259 1968   0   0   0   0   0   0
## 6        TA   0   0  767   0   0   0  20   0   0
pairwise.adonis2(adonisData[,2:3722]~Substrate,method="bray",data=adonisData)
## 'nperm' >= set of all permutations: complete enumeration.
## Set of permutations < 'minperm'. Generating entire set.
## 'nperm' >= set of all permutations: complete enumeration.
## Set of permutations < 'minperm'. Generating entire set.
## 'nperm' >= set of all permutations: complete enumeration.
## Set of permutations < 'minperm'. Generating entire set.
## $parent_call
## [1] "adonisData[, 2:3722] ~ Substrate , strata = Null , permutations 999"
## 
## $Control_vs_TA
##           Df SumOfSqs      R2      F Pr(>F)
## Substrate  1  0.56172 0.35175 2.1705    0.1
## Residual   4  1.03520 0.64825              
## Total      5  1.59692 1.00000              
## 
## $Control_vs_TPA
##           Df SumOfSqs      R2      F Pr(>F)
## Substrate  1  0.92349 0.49628 3.9409    0.1
## Residual   4  0.93734 0.50372              
## Total      5  1.86083 1.00000              
## 
## $TA_vs_TPA
##           Df SumOfSqs      R2      F Pr(>F)
## Substrate  1  0.92054 0.65543 7.6086    0.1
## Residual   4  0.48394 0.34457              
## Total      5  1.40448 1.00000              
## 
## attr(,"class")
## [1] "pwadstrata" "list"