This is a demo of how to run regression and classification random forest models with microbial community datasets in R. The slides which accompanied this code are available here.
Much of this code utilizes the Phyloseq package.
This page was created with Rmarkdown.
Author: Michelle Berry
Date Created: 2/24/15
Date Updated: 10/4/15
===================================================================
In this tutorial, we are working with illumina 16s data that has already been processed into an OTU and taxonomy table from the mothur pipeline. Phyloseq has a variety of import options if you processed your raw sequence data with a different pipeline.
The data used in this tutorial consists of water samples collected from the Western basin of Lake Erie between May and November 2014. The goal of this dataset was to understand how the bacterial community in Lake Erie shifts during toxic algal blooms caused predominantly by a genus of cyanobacteria called Microcystis. Water samples were fractionated to distinguish free-living bacteria, particle-associated bacteria, and larger colonies of blooming cyanobacteria.
#Load libraries
library(ggplot2)
library(vegan)
library(dplyr)
library(magrittr)
library(scales)
library(grid)
library(reshape2)
library(phyloseq)
library(randomForest)
library(knitr)
# Set working directory
setwd("~/chabs/miseq_may2015/analysis/")
# Source code files
# miseqR.R can be found in this repository
# habs_functions.R are formatting functions specific to this dataset
source("miseqR.R")
source("habs_functions.R")
# Because grey plots are ugly
theme_set(theme_bw())
here we import the mothur shared file, consensus taxonomy file, and our sample metadata and store them in one phyloseq object
# Import mothur files and sample metadata
sharedfile = "mothur/allhabs.shared"
taxfile = "mothur/allhabs.taxonomy"
metadat = "other/habs_metadata.csv"
mothurdata = import_mothur(mothur_shared_file = sharedfile,
mothur_constaxonomy_file = taxfile)
# Import sample metadata
metadata <- read.csv(metadat)
sampdat <- sample_data(metadata)
rownames(sampdat) <- sampdat$SampleID
# Merge mothurdata object with sample metadata
erie.merge <- merge_phyloseq(mothurdata, sampdat)
erie.merge
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 19656 taxa and 305 samples ]
## sample_data() Sample Data: [ 305 samples by 39 sample variables ]
## tax_table() Taxonomy Table: [ 19656 taxa by 6 taxonomic ranks ]
Now we have a phyloseq object called erie.merge.
Before we move on with analysis, we need to do some basic reformatting and filtering
colnames(tax_table(erie.merge))
## [1] "Rank1" "Rank2" "Rank3" "Rank4" "Rank5" "Rank6"
# These taxonomy names are not helpful, so let's rename them
colnames(tax_table(erie.merge)) <- c("Kingdom", "Phylum", "Class",
"Order", "Family", "Genus")
# Filter out non-samples (i.e. water, mock, blanks)
# Note, there is a column in my metadata named "Type"
erie.merge %>%
subset_samples(Type == "sample") %>%
subset_samples(Date != "8/11") -> erie # bad samples from this date
# prune taxa which were only present in removed samples
erie <- prune_taxa(taxa_sums(erie) > 0, erie)
# Filter out non-bacteria, chloroplasts and mitochondria
# You may have done this already in mothur, but it's good to check
erie %>%
subset_taxa(Kingdom == "Bacteria" &
Family != "mitochondria" &
Class != "Chloroplast") -> erie
erie
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 15062 taxa and 285 samples ]
## sample_data() Sample Data: [ 285 samples by 39 sample variables ]
## tax_table() Taxonomy Table: [ 15062 taxa by 6 taxonomic ranks ]
For this random forest we want to build a model which will classify samples based on the location they were taken from. We will compare two sites: one is nearshore and in the densest part of the bloom, and the other is offshore on the edge of the bloom. In doing this, we can pull out which OTUs seem to be associated with nearshore/high bloom sites vs offshore/low bloom sites.
First, we need to do a few things to filter our data and put it in the right format. The first step is to subset our samples to just the free-living and algal/particle-associated algae. Next, we scale our reads by normalizing to 15000 reads which is slightly below our smallest library size
# Set minlib
minlib = 15000
# Select unfractionated samples (whole bacterial community)
# Select our two sites; WE2 is nearshore, WE4 is offshore
# Scale reads to minlib library size
erie %>%
subset_samples(Fraction == "CNA") %>%
subset_samples(Station %in% c("WE2", "WE4")) %>%
scale_reads(n = minlib, round = "round") -> sites.scale
Lets make some barplots to get a sense of what these communities look like. Here are some barplots of phylum composition from the two sites
# Glom at phylum level, prune out taxa < 2%, and format
sites.scale %>%
taxglom_and_melt(taxrank = "Phylum", prune = 0.02) %>%
habs_format() -> sites.long
# Set Colors (RColorBrewer)
phy <- c(Acidobacteria="#4D4D4D", Actinobacteria="#ff7f00",
Armatimonadetes="#ffff99", Bacteroidetes="#6a3d9a",
Candidate_division_SR1 ="#b15928", Chlorobi="#cab2d6",
Chloroflexi="#B2DF8A", Cyanobacteria="#33A02C",
Gemmatimonadetes="#fdbf6f", Planctomycetes="#A6CEE3",
Proteobacteria="#1F78B4", unclassified="#fb9a99",
Verrucomicrobia ="#e31a1c")
# Stacked barplot
ggplot(sites.long, aes(x = Date, y = Abundance, fill = Phylum)) +
facet_grid(Station~.) +
geom_bar(stat = "identity") +
geom_bar(
stat = "identity",
position = "fill",
colour = "black",
show_guide = FALSE
) +
scale_fill_manual(values = phy) +
stackbar_theme +
guides(fill = guide_legend(reverse = TRUE, keywidth = 1, keyheight = 1)) +
xlab("") +
ylab("Relative Abundance (Phyla > 2%) \n") +
ggtitle("Phylum composition of bacteria in Lake Erie\n at a nearshore (WE2) and offshore (WE4) site")
At the phylum level there aren’t very obvious differences between these two sites, but we would expect them to be much more different at lower taxonomic levels.
To get a sense of how these sites differ at the OTU level, let’s make an NMDS to look at the bray-curtis dissimilarity of these two sites using OTUs.
# NMDS scores using bray-curtis distance
set.seed(1)
bray.nmds <- ordinate(physeq = sites.scale, method = "NMDS", distance = "bray")
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.1303245
## Run 1 stress 0.1667012
## Run 2 stress 0.1433592
## Run 3 stress 0.1759191
## Run 4 stress 0.1840836
## Run 5 stress 0.143441
## Run 6 stress 0.1303245
## ... New best solution
## ... procrustes: rmse 3.431547e-05 max resid 0.0001311812
## *** Solution reached
# Plot NMDS and color samples by sites
plot_ordination(
physeq = sites.scale,
ordination = bray.nmds,
color = "Station"
) +
geom_point(size = 3) +
ggtitle("Offshore and nearshore bacterial communities")
Okay, so our sites are not distinctly clustering separately, but let’s run an adonis (permanova) test to see if they have different centroids
adonis.site <- phyloseq_to_adonis(
physeq = sites.scale,
dist = "bray",
formula = "Station"
)
##
## Call:
## adonis(formula = f, data = metadata)
##
## Permutation: free
## Number of permutations: 999
##
## Terms added sequentially (first to last)
##
## Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
## Station 1 0.3918 0.39182 3.3301 0.08057 0.005 **
## Residuals 38 4.4711 0.11766 0.91943
## Total 39 4.8629 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 999
##
## Response: Distances
## Df Sum Sq Mean Sq F N.Perm Pr(>F)
## Groups 1 0.014324 0.0143238 2.4046 999 0.134
## Residuals 38 0.226361 0.0059569
Our two sites have significantly different centroids and our test for homogeneity of dispersion did not come back significant, so we can have more assurance that our adonis result is due to real differences in centroids and not just differences in dispersion. Our random forest should be able to pull out the key OTUs that separate these two sites and accurately classify them.
Random forests can handle sparse matrices, but we still want to prune out lots of our rare OTUs which are just contributing noise. We will do this by eliminating OTUs with an average relative abundance below 0.0001
# How many OTUs do we currently have?
ntaxa(sites.scale)
## [1] 5452
# Set prunescale
prunescale = 0.0001
# Prune out rare OTUs by mean relative abundance set by prunescale
tax.mean <- taxa_sums(sites.scale)/nsamples(sites.scale)
sites.prune <- prune_taxa(tax.mean > prunescale*minlib, sites.scale)
sites.prune
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 387 taxa and 40 samples ]
## sample_data() Sample Data: [ 40 samples by 39 sample variables ]
## tax_table() Taxonomy Table: [ 387 taxa by 6 taxonomic ranks ]
Now we only have 387 OTU’s to put in our model which should speed up computation time and hopefully reduce artifacts in our results.
# Make a dataframe of training data with OTUs as column and samples as rows
predictors <- t(otu_table(sites.prune))
dim(predictors)
## [1] 40 387
We have 40 samples and 387 OTUs
# Make one column for our outcome/response variable
response <- as.factor(sample_data(sites.prune)$Station)
# Combine them into 1 data frame
rf.data <- data.frame(response, predictors)
Now we will use the randomForest package to train and test our random forest model using the “out of bag” error to estimate our model error. OOB is a nice feature of random forest models whereby since the training data is bootstrapped, you only use approximately 2/3 of the data at each iteration. The remaining 1/3 or “out of bag” data can be used to validate your model. This removes the need to use another form of cross-validation such as using a separate validation set or k-folds.
It is important to set a seed for reproducability By default, randomForest uses p/3 variables when building a random forest of regression trees and root(p) variables when building a random forest of classification trees. In this case, p/3 = 129
set.seed(2)
erie.classify <- randomForest(response~., data = rf.data, ntree = 100)
print(erie.classify)
##
## Call:
## randomForest(formula = response ~ ., data = rf.data, ntree = 100)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 19
##
## OOB estimate of error rate: 17.5%
## Confusion matrix:
## WE2 WE4 class.error
## WE2 18 3 0.1428571
## WE4 4 15 0.2105263
In most statistical learning algorithms, the data needs to be split up into “training” and “test” data. The idea is to train the model on one set of data and test it on a naive set of data. Random forests are nice because you have a built-in way of estimating the model error. Since only ~2/3 of the data is used everytime we bootstrap our samples for construction of the kth tree, we can use the remaining ~1/3 of the data (called the out of bag samples) to test model error
Our out of bag error is 17.5% .
# What variables are stored in the output?
names(erie.classify)
## [1] "call" "type" "predicted"
## [4] "err.rate" "confusion" "votes"
## [7] "oob.times" "classes" "importance"
## [10] "importanceSD" "localImportance" "proximity"
## [13] "ntree" "mtry" "forest"
## [16] "y" "test" "inbag"
## [19] "terms"
Lets make some plots of the most important variables in our model. FOr a classification tree, variable importance is measured by mean decrease in GINI coefficient (measure of node purity) due to that variable
# Make a data frame with predictor names and their importance
imp <- importance(erie.classify)
imp <- data.frame(predictors = rownames(imp), imp)
# Order the predictor levels by importance
imp.sort <- arrange(imp, desc(MeanDecreaseGini))
imp.sort$predictors <- factor(imp.sort$predictors, levels = imp.sort$predictors)
# Select the top 10 predictors
imp.20 <- imp.sort[1:20, ]
# ggplot
ggplot(imp.20, aes(x = predictors, y = MeanDecreaseGini)) +
geom_bar(stat = "identity", fill = "indianred") +
coord_flip() +
ggtitle("Most important OTUs for classifying Erie samples\n into nearshore or offshore")
# What are those OTUs?
otunames <- imp.20$predictors
r <- rownames(tax_table(sites.scale)) %in% otunames
kable(tax_table(sites.scale)[r, ])
Kingdom | Phylum | Class | Order | Family | Genus | |
---|---|---|---|---|---|---|
Otu00002 | Bacteria | Proteobacteria | Betaproteobacteria | Burkholderiales | Comamonadaceae | Limnohabitans |
Otu00048 | Bacteria | Bacteroidetes | Cytophagia | Cytophagales | Cytophagaceae | Pseudarcicella |
Otu00130 | Bacteria | Bacteroidetes | Sphingobacteriia | Sphingobacteriales | Chitinophagaceae | unclassified |
Otu00137 | Bacteria | Proteobacteria | Betaproteobacteria | TRA3-20 | unclassified | unclassified |
Otu00141 | Bacteria | Verrucomicrobia | Spartobacteria | Chthoniobacterales | unclassified | unclassified |
Otu00185 | Bacteria | Verrucomicrobia | Verrucomicrobiae | Verrucomicrobiales | Verrucomicrobiaceae | unclassified |
Otu00190 | Bacteria | Actinobacteria | Acidimicrobiia | Acidimicrobiales | Acidimicrobiaceae | CL500-29_marine_group |
Otu00206 | Bacteria | Proteobacteria | Gammaproteobacteria | Chromatiales | Chromatiaceae | Rheinheimera |
Otu00217 | Bacteria | Bacteroidetes | Sphingobacteriia | Sphingobacteriales | PHOS-HE51 | unclassified |
Otu00221 | Bacteria | Verrucomicrobia | Opitutae | Opitutales | Opitutaceae | unclassified |
Otu00260 | Bacteria | Bacteroidetes | Cytophagia | Cytophagales | Cyclobacteriaceae | Algoriphagus |
Otu00262 | Bacteria | Bacteroidetes | Flavobacteriia | Flavobacteriales | Flavobacteriaceae | Flavobacterium |
Otu00307 | Bacteria | Actinobacteria | Actinobacteria | Micrococcales | Microbacteriaceae | MWH-Ta3 |
Otu00418 | Bacteria | Bacteroidetes | unclassified | unclassified | unclassified | unclassified |
Otu00427 | Bacteria | Actinobacteria | Acidimicrobiia | Acidimicrobiales | Acidimicrobiaceae | CL500-29_marine_group |
Otu00483 | Bacteria | Proteobacteria | unclassified | unclassified | unclassified | unclassified |
Otu00485 | Bacteria | Proteobacteria | Alphaproteobacteria | Sphingomonadales | Sphingomonadaceae | Sandarakinorhabdus |
Otu00562 | Bacteria | Bacteroidetes | Sphingobacteriia | Sphingobacteriales | Chitinophagaceae | Ferruginibacter |
Otu00725 | Bacteria | Verrucomicrobia | Verrucomicrobiae | Verrucomicrobiales | Verrucomicrobiaceae | Prosthecobacter |
Otu00729 | Bacteria | Actinobacteria | Actinobacteria | Micrococcales | Microbacteriaceae | Candidatus_Aquiluna |
Now we could follow up with this analysis by making some plots of each of these OTUs at our two sites over time and looking into the literature to see whats known about them