Lecture: Exploratory analysis of microbiome data in R/Bioconductor

Levi Waldron

June 6, 2018

Outline

Properties of processed microbiome data

Properties of processed microbiome data

Example: Rampelli Africa dataset

Rampelli S et al.: Metagenome Sequencing of the Hadza Hunter-Gatherer Gut Microbiota. Curr. Biol. 2015, 25:1682–1693.

suppressPackageStartupMessages(library(phyloseq))
Rampelli
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 727 taxa and 38 samples ]
## sample_data() Sample Data:       [ 38 samples by 17 sample variables ]
## tax_table()   Taxonomy Table:    [ 727 taxa by 8 taxonomic ranks ]
ab <- otu_table(Rampelli)
ab[1:5, 1:4]
## OTU Table:          [5 taxa and 4 samples]
##                      taxa are rows
##                         H1      H10      H11      H12
## k__Bacteria       99.01042 84.50315 67.44338 74.98205
## k__Archaea         0.98958  0.18360  0.28721  0.20239
## p__Firmicutes     55.91770 37.15785 27.61193 30.97439
## p__Bacteroidetes  29.17228 24.42213 16.44710 21.01599
## p__Proteobacteria 10.63735 15.06530 17.95200 16.36751
summary(ab[, 1:4])
##        H1              H10                H11                H12        
##  Min.   : 0.000   Min.   : 0.00000   Min.   : 0.00000   Min.   : 0.000  
##  1st Qu.: 0.000   1st Qu.: 0.00000   1st Qu.: 0.00000   1st Qu.: 0.000  
##  Median : 0.000   Median : 0.00000   Median : 0.00000   Median : 0.000  
##  Mean   : 1.076   Mean   : 1.08239   Mean   : 1.06046   Mean   : 1.067  
##  3rd Qu.: 0.000   3rd Qu.: 0.02908   3rd Qu.: 0.01579   3rd Qu.: 0.000  
##  Max.   :99.010   Max.   :84.50315   Max.   :67.44338   Max.   :74.982
summary(sample_data(Rampelli))
##   subjectID          body_site         antibiotics_current_use
##  Length:38          Length:38          Length:38              
##  Class :character   Class :character   Class :character       
##  Mode  :character   Mode  :character   Mode  :character       
##                                                               
##                                                               
##                                                               
##  study_condition         age        age_category          gender         
##  Length:38          Min.   : 8.00   Length:38          Length:38         
##  Class :character   1st Qu.:23.25   Class :character   Class :character  
##  Mode  :character   Median :30.00   Mode  :character   Mode  :character  
##                     Mean   :31.68                                        
##                     3rd Qu.:38.00                                        
##                     Max.   :70.00                                        
##    country            location         non_westernized   
##  Length:38          Length:38          Length:38         
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##  DNA_extraction_kit  number_reads       number_bases      
##  Length:38          Min.   : 3740248   Min.   :3.165e+08  
##  Class :character   1st Qu.: 9756842   1st Qu.:9.056e+08  
##  Mode  :character   Median :15193451   Median :1.462e+09  
##                     Mean   :22303876   Mean   :2.114e+09  
##                     3rd Qu.:29293112   3rd Qu.:2.788e+09  
##                     Max.   :77841428   Max.   :7.485e+09  
##  minimum_read_length median_read_length   curator         
##  Min.   :60          Min.   : 84.00     Length:38         
##  1st Qu.:60          1st Qu.:100.00     Class :character  
##  Median :60          Median :100.00     Mode  :character  
##  Mean   :60          Mean   : 97.18                       
##  3rd Qu.:60          3rd Qu.:100.00                       
##  Max.   :60          Max.   :101.00                       
##  NCBI_accession    
##  Length:38         
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Distances in high-dimensional data analysis

The importance of distance

animals

The importance of distance (cont’d)

Heatmap

plot_heatmap(Rampelli, method="PCoA", distance="bray")

Metrics and distances

A metric satisfies the following five properties:

  1. non-negativity \(d(a, b) \ge 0\)
  2. symmetry \(d(a, b) = d(b, a)\)
  3. identification mark \(d(a, a) = 0\)
  4. definiteness \(d(a, b) = 0\) if and only if \(a=b\)
  5. triangle inequality \(d(a, b) + d(b, c) \ge d(a, c)\)

Euclidian distance (metric)

Euclidian distance in high dimensions

dim(otu_table(Rampelli))
## [1] 727  38
table(sample_data(Rampelli)$location)
## 
## Bologna Dedauko Sengele 
##      11      20       7

Interested in identifying similar samples and similar microbes

Euclidian distance in high dimensions

Euclidian distance in high dimensions

Euclidean distance as for two dimensions. E.g., the distance between two samples \(i\) and \(j\) is:

\[ \mbox{dist}(i,j) = \sqrt{ \sum_{g=1}^{727} (Y_{g,i}-Y_{g,j })^2 } \]

and the distance between two features \(h\) and \(g\) is:

\[ \mbox{dist}(h,g) = \sqrt{ \sum_{i=1}^{38} (Y_{h,i}-Y_{g,i})^2 } \]

3 sample example

ab <- otu_table(Rampelli)
dedauko1 <- ab[, 1]
dedauko2 <- ab[, 2]
bologna1 <- ab[, 29]
sqrt(sum((dedauko1 - dedauko2)^2))
## [1] 70.08112
sqrt(sum((dedauko1 - bologna1)^2))
## [1] 131.1257

3 sample example using dist()

dim(ab)
## [1] 727  38
(d <- dist(t(ab[, c(1, 2, 29)])))
##             H1       H10
## H10   70.08112          
## IT11 131.12574 135.38449
class(d)
## [1] "dist"

Alpha / Beta diversity measures

From Morgan and Huttenhower Human Microbiome Analysis

SVD

These examples describe the A) sequence counts and B) relative abundances of six taxa detected in three samples. C) A collector’s curve using a richness estimator approximates the relationship between the number of sequences drawn from each sample and the number of taxa expected to be present based on detected abundances. D) Alpha diversity captures both the organismal richness of a sample and the evenness of the organisms’ abundance distribution. E) Beta diversity represents the similarity (or difference) in organismal composition between samples.

Beta diversity / dissimilarity

E.g. Bray-Curtis dissimilarity between all pairs of samples:

plot(hclust(phyloseq::distance(Rampelli, method="bray")), 
     main="Bray-Curtis Dissimilarity", xlab="", sub = "")

Dimension reduction and PCA

Motivation for dimension reduction

Simulate the heights of twin pairs:

suppressPackageStartupMessages(library(MASS))
set.seed(1)
n <- 100
y <- t(MASS::mvrnorm(n, c(0,0), matrix(c(1, 0.95, 0.95, 1), 2, 2)))
dim(y)
## [1]   2 100
cor(t(y))
##           [,1]      [,2]
## [1,] 1.0000000 0.9433295
## [2,] 0.9433295 1.0000000

Motivation for dimension reduction

Motivation for dimension reduction

Singular Value Decomposition (SVD)

SVD generalizes the example rotation we looked at:

\[\mathbf{Y} = \mathbf{UDV}^\top\]

SVD

Singular Value Decomposition (SVD)

SVD

SVD of Rampelli dataset

Scaling:

e.standardize.fast <- t(scale(t(ab), scale=FALSE))

SVD:

s <- svd(e.standardize.fast)
names(s)
## [1] "d" "u" "v"

SVD of Rampelli dataset

dim(s$u)     # loadings
## [1] 727  38
length(s$d)  # eigenvalues
## [1] 38
dim(s$v)     # d %*% vT = scores
## [1] 38 38
SVD

SVD vs. prcomp()

rafalib::mypar()
p <- prcomp( t(e.standardize.fast) )
plot(s$u[, 1] ~ p$rotation[, 1])

Note: u and v can each be multiplied by -1 arbitrarily

PCA interpretation: loadings

SVD

PCA interpretation: loadings

plot(p$rotation[, 1], xlab="Index of taxa", ylab="Loadings of PC1", 
     main="Loadings of PC1") #or, predict(p)
abline(h=c(-0.2, 0.2), col="red")

PCA interpretation: loadings

e.pc1taxa <- ab[p$rotation[, 1] < -0.2 | p$rotation[, 1] > 0.2, ]
pheatmap::pheatmap(e.pc1taxa, scale="none", show_rownames=TRUE, 
                  show_colnames = FALSE)

PCA interpretation: eigenvalues

PCA interpretation: eigenvalues

Alternatively as cumulative % variance explained (using cumsum() function):

PCA interpretation: scores

SVD

PCA interpretation: scores

Principal Coordinates Analysis (PCoA)

Principal Coordinates Analysis (cont’d)

Not much “horseshoe” effect here.

Within-sample diversity

Alpha diversity estimates

Note, you can ignore warning about singletons when performing this analysis.

Comparison of alpha diversity estimates