Phyloseq Tutorial

This tutorial shows you how to create a phyloseq object from the output files from DADA2. It also demonstrates how to rarefy the phyloseq object. Packages tidyverse and phyloseq are required.

Start by loading the required R packages

(Packages must be installed prior to running this tutorial)

library(tidyverse)
## Warning in system("timedatectl", intern = TRUE): running command 'timedatectl'
## had status 1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(phyloseq)
library(csv)

#set seed for reproducible data
set.seed(81)

Load your sample “meta” data

In addition to the output files from DADA2, you will need a spreadsheet with additional information about each of your samples. This includes information about treatments, origin of the sample, sampling location, etc. In this example, we have “Environment”, “Sample_Type”, “Replicate”, “Substrate”, “Transfer”, and a few additional variables representing labels. These columns will change depending on the experimental design and the data being processed. There should be one row for each sample that was sequenced. The row names must match the sample IDs used on the sample sheet when sequencing.

sdata <- as.csv("/home/lgschaer/old/Tutorials/phyloseq/sdata.csv", row.names = 1, header = TRUE, sep = ",", check.names = TRUE, stringsAsFactors = TRUE)
head(sdata)
##          Environment Sample_Type Replicate Substrate Transfer Substrate_Label
## CCONR1T1     Compost  Enrichment        R1   Control       T1         Control
## CCONR2T1     Compost  Enrichment        R2   Control       T1         Control
## CCONR3T1     Compost  Enrichment        R3   Control       T1         Control
## CTAR1T1      Compost  Enrichment        R1        TA       T1 Terephthalamide
## CTAR2T1      Compost  Enrichment        R2        TA       T1 Terephthalamide
## CTAR3T1      Compost  Enrichment        R3        TA       T1 Terephthalamide
##          Environment_Label_Location Environment_Label
## CCONR1T1       Compost\nCalumet, MI           Compost
## CCONR2T1       Compost\nCalumet, MI           Compost
## CCONR3T1       Compost\nCalumet, MI           Compost
## CTAR1T1        Compost\nCalumet, MI           Compost
## CTAR2T1        Compost\nCalumet, MI           Compost
## CTAR3T1        Compost\nCalumet, MI           Compost

Load your sequence table

The sequence table is one of the output files from DADA2, it is typically saved in a file called “seqtab.rds”. In this file, the column names are the DNA sequences for each ASV. The rows represent each sample. The numerical values are the number of sequenced reads in each sample belonging to each ASV. The taxa table already contains the annotations for the sequences, so we do not need these sequences for the analysis. These will be removed to speed up computational time. Any taxa with zero counts will be removed from the sequence table.

sequence_table <- readRDS("/home/lgschaer/old/Tutorials/phyloseq/seqtab.rds")
sequence_table[1:3,1:3] #view the sequences
## OTU Table:          [3 taxa and 3 samples]
##                      taxa are columns
##          GTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTGGGCAAGACAGATGTGAAATCCCCGGGCTCAACCTGGGAATGGCATTTGTGACTGCCCGGCTAGAGTGTGTCAGAGGGGGGTAGAATTCCACGTGTAGCAGTGAAATGCGTAGAGATGTGGAGGAATACCGATGGCGAAGGCAGCCCCCTGGGATAACACTGACGCTCATGCACGAAAGCGTGGGGAGCAAAC
## CCONR1T1                                                                                                                                                                                                                                                      13
## CCONR2T1                                                                                                                                                                                                                                                      20
## CCONR3T1                                                                                                                                                                                                                                                       0
##          GTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTTTGTAAGACAGGCGTGAAATCCCCGGGCTCAACCTGGGAATTGCGCTTGTGACTGCAAGGCTGGAGTGCGGCAGAGGGGGATGGAATTCCGCGTGTAGCAGTGAAATGCGTAGATATGCGGAGGAACACCGATGGCGAAGGCAATCCCCTGGGCCTGCACTGACGCTCATGCACGAAAGCGTGGGGAGCAAAC
## CCONR1T1                                                                                                                                                                                                                                                       0
## CCONR2T1                                                                                                                                                                                                                                                       0
## CCONR3T1                                                                                                                                                                                                                                                       0
##          GTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTTTGTAAGACAGGCGTGAAATCCCCGGGCTTAACCTGGGAATTGCGCTTGTGACTGCAAGGCTGGAGTGCGGCAGAGGGGGATGGAATTCCGCGTGTAGCAGTGAAATGCGTAGATATGCGGAGGAACACCGATGGCGAAGGCAATCCCCTGGGCCTGCACTGACGCTCATGCACGAAAGCGTGGGGAGCAAAC
## CCONR1T1                                                                                                                                                                                                                                                       0
## CCONR2T1                                                                                                                                                                                                                                                       0
## CCONR3T1                                                                                                                                                                                                                                                       0
colnames(sequence_table) <- NULL #remove the sequences
sequence_table[1:5,1:5] #confirm that the sequences have been removed
## OTU Table:          [5 taxa and 5 samples]
##                      taxa are columns
##          sp1 sp2  sp3 sp4 sp5
## CCONR1T1  13   0    0   0   0
## CCONR2T1  20   0    0   0   0
## CCONR3T1   0   0    0  37  20
## CTAR1T1    0   0  462   0   0
## CTAR2T1    0 259 1968   0   0
#remove any taxa with zero counts
sequence_table <- as.matrix(sequence_table)                            #change to matrix format
m <- (colSums(sequence_table, na.rm=TRUE) != 0)                        
nonzero <- sequence_table[, m]   

Load your taxa table

The taxa table is one of the output files from DADA2, it is typically saved in a file called “taxtab.rds”. This file contains the annotations for the sequences in the same order as the sequence table.

taxa <- readRDS("/home/lgschaer/old/Tutorials/phyloseq/taxtab.rds")
taxa[1:5,1:5]
## Taxonomy Table:     [5 taxa by 5 taxonomic ranks]:
##     Kingdom    Phylum             Class                 Order              
## sp1 "Bacteria" "Proteobacteria"   "Gammaproteobacteria" "Burkholderiales"  
## sp2 "Bacteria" "Proteobacteria"   "Gammaproteobacteria" "Burkholderiales"  
## sp3 "Bacteria" "Proteobacteria"   "Gammaproteobacteria" "Burkholderiales"  
## sp4 "Bacteria" "Proteobacteria"   "Gammaproteobacteria" "Pseudomonadales"  
## sp5 "Bacteria" "Actinobacteriota" "Actinobacteria"      "Corynebacteriales"
##     Family            
## sp1 "Oxalobacteraceae"
## sp2 "Comamonadaceae"  
## sp3 "Comamonadaceae"  
## sp4 "Pseudomonadaceae"
## sp5 "Nocardiaceae"

Build the phyloseq object

Do final checks and assemble the phyloseq object.

#format for phyloseq
samdata = sample_data(sdata)
seqtab = otu_table(nonzero, taxa_are_rows = FALSE)
taxtab = tax_table(taxa)


#check that sample names and taxa names match
head(sample_names(samdata))
## [1] "CCONR1T1" "CCONR2T1" "CCONR3T1" "CTAR1T1"  "CTAR2T1"  "CTAR3T1"
head(sample_names(seqtab))
## [1] "CCONR1T1" "CCONR2T1" "CCONR3T1" "CTAR1T1"  "CTAR2T1"  "CTAR3T1"
head(taxa_names(seqtab))
## [1] "sp1" "sp2" "sp3" "sp4" "sp5" "sp6"
head(taxa_names(taxtab))
## [1] "sp1" "sp2" "sp3" "sp4" "sp5" "sp6"
#combine all components into a phyloseq object
toy_ps = phyloseq(otu_table(seqtab), tax_table(taxtab), sample_data(samdata))
toy_ps
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1385 taxa and 9 samples ]
## sample_data() Sample Data:       [ 9 samples by 8 sample variables ]
## tax_table()   Taxonomy Table:    [ 1385 taxa by 7 taxonomic ranks ]
write_rds(toy_ps, "/home/lgschaer/old/Tutorials/phyloseq/toy_ps_out.rds")

Now everything has been combined into a phyloseq object with 9 samples containing 1385 taxa at 7 taxonomic ranks.

Filter the phyloseq object

Remove non-bacteria, mitochondira, and chloroplasts from the data set.

#Filter out eukaryotes and mitochondria
toy_ps2 <- toy_ps %>%
  subset_taxa(
    Kingdom == "Bacteria" &                   #only bacteria
      Family  != "Mitochondria" &             #filter out mitochondria
      Class   != "Chloroplast"                #filter out chloroplasts
  )
toy_ps2
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1226 taxa and 9 samples ]
## sample_data() Sample Data:       [ 9 samples by 8 sample variables ]
## tax_table()   Taxonomy Table:    [ 1226 taxa by 7 taxonomic ranks ]

Rarefy and normalize

Rarefy and normalize the dataset by removing samples that sequenced poorly, removing taxa which are underrepresented in repeated subsampling, and normalizing sequencing depth across the dataset.

#remove samples with fewer counts than the chosen threshold
samplesover1000_all <- subset_samples(toy_ps2, sample_sums(toy_ps2) > 1000) #set the smallest sample with more than 1000 reads to be the threshold for rarefying.

#remove taxa with zero counts
any(taxa_sums(samplesover1000_all) == 0)
## [1] TRUE
sum(taxa_sums(samplesover1000_all) == 0)
## [1] 1
prune_samplesover1000_all <- prune_taxa(taxa_sums(samplesover1000_all) > 0, samplesover1000_all)
any(taxa_sums(prune_samplesover1000_all) == 0)
## [1] FALSE
#set the seedfor reproducible data
set.seed(81)

#rarefy
toy_ps3 <- rarefy_even_depth(prune_samplesover1000_all, rngseed= 81, sample.size = min(sample_sums(prune_samplesover1000_all)))
## `set.seed(81)` was used to initialize repeatable random subsampling.
## Please record this for your records so others can reproduce.
## Try `set.seed(81); .Random.seed` for the full vector
## ...
## 450OTUs were removed because they are no longer 
## present in any sample after random subsampling
## ...