This is a demo of how to run regression and classification random forest models with microbial community datasets in R. The slides which accompanied this code are available here.

Much of this code utilizes the Phyloseq package.

This page was created with Rmarkdown.

Author: Michelle Berry
Date Created: 2/24/15
Date Updated: 10/4/15

===================================================================

In this tutorial, we are working with illumina 16s data that has already been processed into an OTU and taxonomy table from the mothur pipeline. Phyloseq has a variety of import options if you processed your raw sequence data with a different pipeline.

The data used in this tutorial consists of water samples collected from the Western basin of Lake Erie between May and November 2014. The goal of this dataset was to understand how the bacterial community in Lake Erie shifts during toxic algal blooms caused predominantly by a genus of cyanobacteria called Microcystis. Water samples were fractionated to distinguish free-living bacteria, particle-associated bacteria, and larger colonies of blooming cyanobacteria.

Libraries

#Load libraries
library(ggplot2)
library(vegan)
library(dplyr)
library(magrittr)
library(scales)
library(grid)
library(reshape2)
library(phyloseq)
library(randomForest)
library(knitr)
# Set working directory
setwd("~/chabs/miseq_may2015/analysis/")

# Source code files
# miseqR.R can be found in this repository
# habs_functions.R are formatting functions specific to this dataset
source("miseqR.R")
source("habs_functions.R")

# Because grey plots are ugly
theme_set(theme_bw())

Data import

here we import the mothur shared file, consensus taxonomy file, and our sample metadata and store them in one phyloseq object

# Import mothur files and sample metadata
sharedfile = "mothur/allhabs.shared"
taxfile = "mothur/allhabs.taxonomy"
metadat = "other/habs_metadata.csv"
 
mothurdata = import_mothur(mothur_shared_file = sharedfile,
  mothur_constaxonomy_file = taxfile)

# Import sample metadata
metadata <- read.csv(metadat)
sampdat <- sample_data(metadata)
rownames(sampdat) <- sampdat$SampleID
  
# Merge mothurdata object with sample metadata
erie.merge <- merge_phyloseq(mothurdata, sampdat)
erie.merge
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 19656 taxa and 305 samples ]
## sample_data() Sample Data:       [ 305 samples by 39 sample variables ]
## tax_table()   Taxonomy Table:    [ 19656 taxa by 6 taxonomic ranks ]

Now we have a phyloseq object called erie.merge.

Before we move on with analysis, we need to do some basic reformatting and filtering

colnames(tax_table(erie.merge))
## [1] "Rank1" "Rank2" "Rank3" "Rank4" "Rank5" "Rank6"
# These taxonomy names are not helpful, so let's rename them
colnames(tax_table(erie.merge)) <- c("Kingdom", "Phylum", "Class",
  "Order", "Family", "Genus")
  
# Filter out non-samples (i.e. water, mock, blanks) 
# Note, there is a column in my metadata named "Type"
erie.merge %>%
  subset_samples(Type == "sample") %>%
  subset_samples(Date != "8/11") -> erie # bad samples from this date

# prune taxa which were only present in removed samples
erie <- prune_taxa(taxa_sums(erie) > 0, erie)

# Filter out non-bacteria, chloroplasts and mitochondria 
# You may have done this already in mothur, but it's good to check
erie %>%
  subset_taxa(Kingdom == "Bacteria" & 
              Family != "mitochondria" & 
              Class != "Chloroplast") -> erie

erie
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 15062 taxa and 285 samples ]
## sample_data() Sample Data:       [ 285 samples by 39 sample variables ]
## tax_table()   Taxonomy Table:    [ 15062 taxa by 6 taxonomic ranks ]

Classification rf: modeling sample locations

For this random forest we want to build a model which will classify samples based on the location they were taken from. We will compare two sites: one is nearshore and in the densest part of the bloom, and the other is offshore on the edge of the bloom. In doing this, we can pull out which OTUs seem to be associated with nearshore/high bloom sites vs offshore/low bloom sites.

First, we need to do a few things to filter our data and put it in the right format. The first step is to subset our samples to just the free-living and algal/particle-associated algae. Next, we scale our reads by normalizing to 15000 reads which is slightly below our smallest library size

# Set minlib
minlib = 15000

# Select unfractionated samples (whole bacterial community)
# Select our two sites; WE2 is nearshore, WE4 is offshore
# Scale reads to minlib library size
erie %>%
  subset_samples(Fraction == "CNA") %>%
  subset_samples(Station %in% c("WE2", "WE4")) %>%
  scale_reads(n = minlib, round = "round") -> sites.scale   

Lets make some barplots to get a sense of what these communities look like. Here are some barplots of phylum composition from the two sites

# Glom at phylum level, prune out taxa < 2%, and format
sites.scale %>%
  taxglom_and_melt(taxrank = "Phylum", prune = 0.02) %>%
  habs_format() -> sites.long

# Set Colors (RColorBrewer)
phy <- c(Acidobacteria="#4D4D4D", Actinobacteria="#ff7f00", 
         Armatimonadetes="#ffff99", Bacteroidetes="#6a3d9a", 
         Candidate_division_SR1 ="#b15928", Chlorobi="#cab2d6",
         Chloroflexi="#B2DF8A", Cyanobacteria="#33A02C",
         Gemmatimonadetes="#fdbf6f", Planctomycetes="#A6CEE3",
         Proteobacteria="#1F78B4", unclassified="#fb9a99", 
         Verrucomicrobia ="#e31a1c") 

# Stacked barplot 
ggplot(sites.long, aes(x = Date, y = Abundance, fill = Phylum)) + 
  facet_grid(Station~.) + 
  geom_bar(stat = "identity") +
  geom_bar(
    stat = "identity", 
    position = "fill", 
    colour = "black", 
    show_guide = FALSE
  ) + 
  scale_fill_manual(values = phy) +
  stackbar_theme +
  guides(fill = guide_legend(reverse = TRUE, keywidth = 1, keyheight = 1)) +
  xlab("") +
  ylab("Relative Abundance (Phyla > 2%) \n") +
  ggtitle("Phylum composition of bacteria in Lake Erie\n at a nearshore (WE2) and offshore (WE4) site") 

At the phylum level there aren’t very obvious differences between these two sites, but we would expect them to be much more different at lower taxonomic levels.

To get a sense of how these sites differ at the OTU level, let’s make an NMDS to look at the bray-curtis dissimilarity of these two sites using OTUs.

# NMDS scores using bray-curtis distance 
set.seed(1)
bray.nmds <- ordinate(physeq = sites.scale, method = "NMDS", distance = "bray")
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.1303245 
## Run 1 stress 0.1667012 
## Run 2 stress 0.1433592 
## Run 3 stress 0.1759191 
## Run 4 stress 0.1840836 
## Run 5 stress 0.143441 
## Run 6 stress 0.1303245 
## ... New best solution
## ... procrustes: rmse 3.431547e-05  max resid 0.0001311812 
## *** Solution reached
# Plot NMDS and color samples by sites
plot_ordination(
  physeq = sites.scale, 
  ordination = bray.nmds, 
  color = "Station"
) + 
  geom_point(size = 3) + 
  ggtitle("Offshore and nearshore bacterial communities")

Okay, so our sites are not distinctly clustering separately, but let’s run an adonis (permanova) test to see if they have different centroids

adonis.site <- phyloseq_to_adonis(
  physeq = sites.scale, 
  dist = "bray", 
  formula = "Station"
)
## 
## Call:
## adonis(formula = f, data = metadata) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)   
## Station    1    0.3918 0.39182  3.3301 0.08057  0.005 **
## Residuals 38    4.4711 0.11766         0.91943          
## Total     39    4.8629                 1.00000          
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 999
## 
## Response: Distances
##           Df   Sum Sq   Mean Sq      F N.Perm Pr(>F)
## Groups     1 0.014324 0.0143238 2.4046    999  0.134
## Residuals 38 0.226361 0.0059569

Our two sites have significantly different centroids and our test for homogeneity of dispersion did not come back significant, so we can have more assurance that our adonis result is due to real differences in centroids and not just differences in dispersion. Our random forest should be able to pull out the key OTUs that separate these two sites and accurately classify them.

Random forests can handle sparse matrices, but we still want to prune out lots of our rare OTUs which are just contributing noise. We will do this by eliminating OTUs with an average relative abundance below 0.0001

# How many OTUs do we currently have? 
ntaxa(sites.scale)
## [1] 5452
# Set prunescale 
prunescale = 0.0001

# Prune out rare OTUs by mean relative abundance set by prunescale
tax.mean <- taxa_sums(sites.scale)/nsamples(sites.scale)
sites.prune <- prune_taxa(tax.mean > prunescale*minlib, sites.scale)

sites.prune
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 387 taxa and 40 samples ]
## sample_data() Sample Data:       [ 40 samples by 39 sample variables ]
## tax_table()   Taxonomy Table:    [ 387 taxa by 6 taxonomic ranks ]

Now we only have 387 OTU’s to put in our model which should speed up computation time and hopefully reduce artifacts in our results.

# Make a dataframe of training data with OTUs as column and samples as rows
predictors <- t(otu_table(sites.prune))
dim(predictors)
## [1]  40 387

We have 40 samples and 387 OTUs

# Make one column for our outcome/response variable 
response <- as.factor(sample_data(sites.prune)$Station)

# Combine them into 1 data frame
rf.data <- data.frame(response, predictors)

Now we will use the randomForest package to train and test our random forest model using the “out of bag” error to estimate our model error. OOB is a nice feature of random forest models whereby since the training data is bootstrapped, you only use approximately 2/3 of the data at each iteration. The remaining 1/3 or “out of bag” data can be used to validate your model. This removes the need to use another form of cross-validation such as using a separate validation set or k-folds.

It is important to set a seed for reproducability By default, randomForest uses p/3 variables when building a random forest of regression trees and root(p) variables when building a random forest of classification trees. In this case, p/3 = 129

Results

set.seed(2)
erie.classify <- randomForest(response~., data = rf.data, ntree = 100)
print(erie.classify)
## 
## Call:
##  randomForest(formula = response ~ ., data = rf.data, ntree = 100) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 19
## 
##         OOB estimate of  error rate: 17.5%
## Confusion matrix:
##     WE2 WE4 class.error
## WE2  18   3   0.1428571
## WE4   4  15   0.2105263

In most statistical learning algorithms, the data needs to be split up into “training” and “test” data. The idea is to train the model on one set of data and test it on a naive set of data. Random forests are nice because you have a built-in way of estimating the model error. Since only ~2/3 of the data is used everytime we bootstrap our samples for construction of the kth tree, we can use the remaining ~1/3 of the data (called the out of bag samples) to test model error

Our out of bag error is 17.5% .

# What variables are stored in the output?
names(erie.classify)
##  [1] "call"            "type"            "predicted"      
##  [4] "err.rate"        "confusion"       "votes"          
##  [7] "oob.times"       "classes"         "importance"     
## [10] "importanceSD"    "localImportance" "proximity"      
## [13] "ntree"           "mtry"            "forest"         
## [16] "y"               "test"            "inbag"          
## [19] "terms"

plots

Lets make some plots of the most important variables in our model. FOr a classification tree, variable importance is measured by mean decrease in GINI coefficient (measure of node purity) due to that variable

# Make a data frame with predictor names and their importance
imp <- importance(erie.classify)
imp <- data.frame(predictors = rownames(imp), imp)

# Order the predictor levels by importance
imp.sort <- arrange(imp, desc(MeanDecreaseGini))
imp.sort$predictors <- factor(imp.sort$predictors, levels = imp.sort$predictors)

# Select the top 10 predictors
imp.20 <- imp.sort[1:20, ]


# ggplot
ggplot(imp.20, aes(x = predictors, y = MeanDecreaseGini)) +
  geom_bar(stat = "identity", fill = "indianred") +
  coord_flip() +
  ggtitle("Most important OTUs for classifying Erie samples\n into nearshore or offshore")

# What are those OTUs?
otunames <- imp.20$predictors
r <- rownames(tax_table(sites.scale)) %in% otunames
kable(tax_table(sites.scale)[r, ])
Kingdom Phylum Class Order Family Genus
Otu00002 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Comamonadaceae Limnohabitans
Otu00048 Bacteria Bacteroidetes Cytophagia Cytophagales Cytophagaceae Pseudarcicella
Otu00130 Bacteria Bacteroidetes Sphingobacteriia Sphingobacteriales Chitinophagaceae unclassified
Otu00137 Bacteria Proteobacteria Betaproteobacteria TRA3-20 unclassified unclassified
Otu00141 Bacteria Verrucomicrobia Spartobacteria Chthoniobacterales unclassified unclassified
Otu00185 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Verrucomicrobiaceae unclassified
Otu00190 Bacteria Actinobacteria Acidimicrobiia Acidimicrobiales Acidimicrobiaceae CL500-29_marine_group
Otu00206 Bacteria Proteobacteria Gammaproteobacteria Chromatiales Chromatiaceae Rheinheimera
Otu00217 Bacteria Bacteroidetes Sphingobacteriia Sphingobacteriales PHOS-HE51 unclassified
Otu00221 Bacteria Verrucomicrobia Opitutae Opitutales Opitutaceae unclassified
Otu00260 Bacteria Bacteroidetes Cytophagia Cytophagales Cyclobacteriaceae Algoriphagus
Otu00262 Bacteria Bacteroidetes Flavobacteriia Flavobacteriales Flavobacteriaceae Flavobacterium
Otu00307 Bacteria Actinobacteria Actinobacteria Micrococcales Microbacteriaceae MWH-Ta3
Otu00418 Bacteria Bacteroidetes unclassified unclassified unclassified unclassified
Otu00427 Bacteria Actinobacteria Acidimicrobiia Acidimicrobiales Acidimicrobiaceae CL500-29_marine_group
Otu00483 Bacteria Proteobacteria unclassified unclassified unclassified unclassified
Otu00485 Bacteria Proteobacteria Alphaproteobacteria Sphingomonadales Sphingomonadaceae Sandarakinorhabdus
Otu00562 Bacteria Bacteroidetes Sphingobacteriia Sphingobacteriales Chitinophagaceae Ferruginibacter
Otu00725 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Verrucomicrobiaceae Prosthecobacter
Otu00729 Bacteria Actinobacteria Actinobacteria Micrococcales Microbacteriaceae Candidatus_Aquiluna

Now we could follow up with this analysis by making some plots of each of these OTUs at our two sites over time and looking into the literature to see whats known about them