This tutorial shows you how to create a phyloseq object from the output files from DADA2. It also demonstrates how to rarefy the phyloseq object. Packages tidyverse and phyloseq are required.
(Packages must be installed prior to running this tutorial)
library(tidyverse)
## Warning in system("timedatectl", intern = TRUE): running command 'timedatectl'
## had status 1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(phyloseq)
library(csv)
#set seed for reproducible data
set.seed(81)
In addition to the output files from DADA2, you will need a spreadsheet with additional information about each of your samples. This includes information about treatments, origin of the sample, sampling location, etc. In this example, we have “Environment”, “Sample_Type”, “Replicate”, “Substrate”, “Transfer”, and a few additional variables representing labels. These columns will change depending on the experimental design and the data being processed. There should be one row for each sample that was sequenced. The row names must match the sample IDs used on the sample sheet when sequencing.
sdata <- as.csv("/home/lgschaer/old/Tutorials/phyloseq/sdata.csv", row.names = 1, header = TRUE, sep = ",", check.names = TRUE, stringsAsFactors = TRUE)
head(sdata)
## Environment Sample_Type Replicate Substrate Transfer Substrate_Label
## CCONR1T1 Compost Enrichment R1 Control T1 Control
## CCONR2T1 Compost Enrichment R2 Control T1 Control
## CCONR3T1 Compost Enrichment R3 Control T1 Control
## CTAR1T1 Compost Enrichment R1 TA T1 Terephthalamide
## CTAR2T1 Compost Enrichment R2 TA T1 Terephthalamide
## CTAR3T1 Compost Enrichment R3 TA T1 Terephthalamide
## Environment_Label_Location Environment_Label
## CCONR1T1 Compost\nCalumet, MI Compost
## CCONR2T1 Compost\nCalumet, MI Compost
## CCONR3T1 Compost\nCalumet, MI Compost
## CTAR1T1 Compost\nCalumet, MI Compost
## CTAR2T1 Compost\nCalumet, MI Compost
## CTAR3T1 Compost\nCalumet, MI Compost
The sequence table is one of the output files from DADA2, it is typically saved in a file called “seqtab.rds”. In this file, the column names are the DNA sequences for each ASV. The rows represent each sample. The numerical values are the number of sequenced reads in each sample belonging to each ASV. The taxa table already contains the annotations for the sequences, so we do not need these sequences for the analysis. These will be removed to speed up computational time. Any taxa with zero counts will be removed from the sequence table.
sequence_table <- readRDS("/home/lgschaer/old/Tutorials/phyloseq/seqtab.rds")
sequence_table[1:3,1:3] #view the sequences
## OTU Table: [3 taxa and 3 samples]
## taxa are columns
## GTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTGGGCAAGACAGATGTGAAATCCCCGGGCTCAACCTGGGAATGGCATTTGTGACTGCCCGGCTAGAGTGTGTCAGAGGGGGGTAGAATTCCACGTGTAGCAGTGAAATGCGTAGAGATGTGGAGGAATACCGATGGCGAAGGCAGCCCCCTGGGATAACACTGACGCTCATGCACGAAAGCGTGGGGAGCAAAC
## CCONR1T1 13
## CCONR2T1 20
## CCONR3T1 0
## GTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTTTGTAAGACAGGCGTGAAATCCCCGGGCTCAACCTGGGAATTGCGCTTGTGACTGCAAGGCTGGAGTGCGGCAGAGGGGGATGGAATTCCGCGTGTAGCAGTGAAATGCGTAGATATGCGGAGGAACACCGATGGCGAAGGCAATCCCCTGGGCCTGCACTGACGCTCATGCACGAAAGCGTGGGGAGCAAAC
## CCONR1T1 0
## CCONR2T1 0
## CCONR3T1 0
## GTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTTTGTAAGACAGGCGTGAAATCCCCGGGCTTAACCTGGGAATTGCGCTTGTGACTGCAAGGCTGGAGTGCGGCAGAGGGGGATGGAATTCCGCGTGTAGCAGTGAAATGCGTAGATATGCGGAGGAACACCGATGGCGAAGGCAATCCCCTGGGCCTGCACTGACGCTCATGCACGAAAGCGTGGGGAGCAAAC
## CCONR1T1 0
## CCONR2T1 0
## CCONR3T1 0
colnames(sequence_table) <- NULL #remove the sequences
sequence_table[1:5,1:5] #confirm that the sequences have been removed
## OTU Table: [5 taxa and 5 samples]
## taxa are columns
## sp1 sp2 sp3 sp4 sp5
## CCONR1T1 13 0 0 0 0
## CCONR2T1 20 0 0 0 0
## CCONR3T1 0 0 0 37 20
## CTAR1T1 0 0 462 0 0
## CTAR2T1 0 259 1968 0 0
#remove any taxa with zero counts
sequence_table <- as.matrix(sequence_table) #change to matrix format
m <- (colSums(sequence_table, na.rm=TRUE) != 0)
nonzero <- sequence_table[, m]
The taxa table is one of the output files from DADA2, it is typically saved in a file called “taxtab.rds”. This file contains the annotations for the sequences in the same order as the sequence table.
taxa <- readRDS("/home/lgschaer/old/Tutorials/phyloseq/taxtab.rds")
taxa[1:5,1:5]
## Taxonomy Table: [5 taxa by 5 taxonomic ranks]:
## Kingdom Phylum Class Order
## sp1 "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Burkholderiales"
## sp2 "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Burkholderiales"
## sp3 "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Burkholderiales"
## sp4 "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Pseudomonadales"
## sp5 "Bacteria" "Actinobacteriota" "Actinobacteria" "Corynebacteriales"
## Family
## sp1 "Oxalobacteraceae"
## sp2 "Comamonadaceae"
## sp3 "Comamonadaceae"
## sp4 "Pseudomonadaceae"
## sp5 "Nocardiaceae"
Do final checks and assemble the phyloseq object.
#format for phyloseq
samdata = sample_data(sdata)
seqtab = otu_table(nonzero, taxa_are_rows = FALSE)
taxtab = tax_table(taxa)
#check that sample names and taxa names match
head(sample_names(samdata))
## [1] "CCONR1T1" "CCONR2T1" "CCONR3T1" "CTAR1T1" "CTAR2T1" "CTAR3T1"
head(sample_names(seqtab))
## [1] "CCONR1T1" "CCONR2T1" "CCONR3T1" "CTAR1T1" "CTAR2T1" "CTAR3T1"
head(taxa_names(seqtab))
## [1] "sp1" "sp2" "sp3" "sp4" "sp5" "sp6"
head(taxa_names(taxtab))
## [1] "sp1" "sp2" "sp3" "sp4" "sp5" "sp6"
#combine all components into a phyloseq object
toy_ps = phyloseq(otu_table(seqtab), tax_table(taxtab), sample_data(samdata))
toy_ps
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1385 taxa and 9 samples ]
## sample_data() Sample Data: [ 9 samples by 8 sample variables ]
## tax_table() Taxonomy Table: [ 1385 taxa by 7 taxonomic ranks ]
write_rds(toy_ps, "/home/lgschaer/old/Tutorials/phyloseq/toy_ps_out.rds")
Now everything has been combined into a phyloseq object with 9 samples containing 1385 taxa at 7 taxonomic ranks.
Remove non-bacteria, mitochondira, and chloroplasts from the data set.
#Filter out eukaryotes and mitochondria
toy_ps2 <- toy_ps %>%
subset_taxa(
Kingdom == "Bacteria" & #only bacteria
Family != "Mitochondria" & #filter out mitochondria
Class != "Chloroplast" #filter out chloroplasts
)
toy_ps2
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1226 taxa and 9 samples ]
## sample_data() Sample Data: [ 9 samples by 8 sample variables ]
## tax_table() Taxonomy Table: [ 1226 taxa by 7 taxonomic ranks ]
Rarefy and normalize the dataset by removing samples that sequenced poorly, removing taxa which are underrepresented in repeated subsampling, and normalizing sequencing depth across the dataset.
#remove samples with fewer counts than the chosen threshold
samplesover1000_all <- subset_samples(toy_ps2, sample_sums(toy_ps2) > 1000) #set the smallest sample with more than 1000 reads to be the threshold for rarefying.
#remove taxa with zero counts
any(taxa_sums(samplesover1000_all) == 0)
## [1] TRUE
sum(taxa_sums(samplesover1000_all) == 0)
## [1] 1
prune_samplesover1000_all <- prune_taxa(taxa_sums(samplesover1000_all) > 0, samplesover1000_all)
any(taxa_sums(prune_samplesover1000_all) == 0)
## [1] FALSE
#set the seedfor reproducible data
set.seed(81)
#rarefy
toy_ps3 <- rarefy_even_depth(prune_samplesover1000_all, rngseed= 81, sample.size = min(sample_sums(prune_samplesover1000_all)))
## `set.seed(81)` was used to initialize repeatable random subsampling.
## Please record this for your records so others can reproduce.
## Try `set.seed(81); .Random.seed` for the full vector
## ...
## 450OTUs were removed because they are no longer
## present in any sample after random subsampling
## ...