Lecture: Exploratory analysis of microbiome data in R/Bioconductor

Levi Waldron

June 30, 2017

Outline

Statistical properties of metagenomic data
Distances for high dimensional data
Principal Components and Principal Coordinates Analysis
Alpha diversity

Properties of processed microbiome data

taxonomic data for kingdom, phylum or division, class, order, family, genus, (species)
- hundreds of rows
inferred metabolic function
- hundreds of rows
calculated alpha diversity

Properties of processed microbiome data

“count-ish” data, minimum is zero
- MetaPhlAn2 data are not actually integer counts
non-normal
- highly skewed (over-dispersed)
- often has a lot of zero values
samples differ in extraction and amplification efficiency, read depth
- counts do not provide absolute microbial abundance
- we can only infer relative abundance

Example: Rampelli Africa dataset

Rampelli S et al.: Metagenome Sequencing of the Hadza Hunter-Gatherer Gut Microbiota. Curr. Biol. 2015, 25:1682–1693.

suppressPackageStartupMessages(library(phyloseq))
Rampelli

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 727 taxa and 38 samples ]
## sample_data() Sample Data:       [ 38 samples by 18 sample variables ]
## tax_table()   Taxonomy Table:    [ 727 taxa by 8 taxonomic ranks ]

ab <- otu_table(Rampelli)
ab[1:5, 1:4]

## OTU Table:          [5 taxa and 4 samples]
##                      taxa are rows
##                         H1       H2       H3       H4
## k__Bacteria       99.01042 72.28436 99.91570 99.54461
## k__Archaea         0.98958  0.00000  0.08430  0.45539
## p__Firmicutes     55.91770 54.45349 76.49131 82.69539
## p__Bacteroidetes  29.17228 11.65402 22.45855 15.36970
## p__Proteobacteria 10.63735  5.43092  0.12856  0.18853

summary(ab[, 1:4])

##        H1               H2              H3               H4        
##  Min.   : 0.000   Min.   : 0.00   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 0.000   1st Qu.: 0.00   1st Qu.: 0.000   1st Qu.: 0.000  
##  Median : 0.000   Median : 0.00   Median : 0.000   Median : 0.000  
##  Mean   : 1.076   Mean   : 1.09   Mean   : 1.098   Mean   : 1.087  
##  3rd Qu.: 0.000   3rd Qu.: 0.00   3rd Qu.: 0.000   3rd Qu.: 0.000  
##  Max.   :99.010   Max.   :72.28   Max.   :99.916   Max.   :99.545

summary(sample_data(Rampelli))

##   subjectID          body_site         antibiotics_current_use
##  Length:38          Length:38          Length:38              
##  Class :character   Class :character   Class :character       
##  Mode  :character   Mode  :character   Mode  :character       
##                                                               
##                                                               
##                                                               
##                                                               
##  study_condition      disease               age        age_category      
##  Length:38          Length:38          Min.   : 8.00   Length:38         
##  Class :character   Class :character   1st Qu.:23.25   Class :character  
##  Mode  :character   Mode  :character   Median :30.00   Mode  :character  
##                                        Mean   :31.68                     
##                                        3rd Qu.:38.00                     
##                                        Max.   :70.00                     
##                                                                          
##     gender               BMI        country            location        
##  Length:38          Min.   : NA   Length:38          Length:38         
##  Class :character   1st Qu.: NA   Class :character   Class :character  
##  Mode  :character   Median : NA   Mode  :character   Mode  :character  
##                     Mean   :NaN                                        
##                     3rd Qu.: NA                                        
##                     Max.   : NA                                        
##                     NA's   :38                                         
##  non_westernized    DNA_extraction_kit  number_reads     
##  Length:38          Length:38          Min.   : 3740248  
##  Class :character   Class :character   1st Qu.: 9756842  
##  Mode  :character   Mode  :character   Median :15193451  
##                                        Mean   :22303876  
##                                        3rd Qu.:29293112  
##                                        Max.   :77841428  
##                                                          
##   number_bases       minimum_read_length median_read_length
##  Min.   :3.165e+08   Min.   :60          Min.   : 84.00    
##  1st Qu.:9.056e+08   1st Qu.:60          1st Qu.:100.00    
##  Median :1.462e+09   Median :60          Median :100.00    
##  Mean   :2.114e+09   Mean   :60          Mean   : 97.18    
##  3rd Qu.:2.788e+09   3rd Qu.:60          3rd Qu.:100.00    
##  Max.   :7.485e+09   Max.   :60          Max.   :101.00    
##                                                            
##  NCBI_accession    
##  Length:38         
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

Distances in high-dimensional data analysis

The importance of distance

High-dimensional data are complex and impossible to visualize in raw form

The importance of distance (cont’d)

Distances can simplify thousands of dimensions
Any clustering or classification of samples and/or features involves combining or identifying objects that are close or similar.
Distances or similarities are mathematical representations of what we mean by close or similar.
The choice of distance is a subject matter-specific, qualitative decision

Heatmap

plot_heatmap(Rampelli, method="PCoA", distance="bray")

Metrics and distances

A metric satisfies the following five properties:

non-negativity \(d(a, b) \ge 0\)
symmetry \(d(a, b) = d(b, a)\)
identification mark \(d(a, a) = 0\)
definiteness \(d(a, b) = 0\) if and only if \(a=b\)
triangle inequality \(d(a, b) + d(b, c) \ge d(a, c)\)

A distance is only required to satisfy 1-3.
A similarity function satisfies 1-2, and increases as \(a\) and \(b\) become more similar
A dissimilarity function satisfies 1-2, and decreases as \(a\) and \(b\) become more similar

Euclidian distance (metric)

Remember grade school: Euclidean d = \(\sqrt{ (A_x-B_x)^2 + (A_y-B_y)^2}\).
Side note: also referred to as \(L_2\) norm

Euclidian distance in high dimensions

dim(otu_table(Rampelli))

## [1] 727  38

table(sample_data(Rampelli)$location)

## 
## Bologna Dedauko Sengele 
##      11      20       7

Interested in identifying similar samples and similar microbes

Euclidian distance in high dimensions

Points are no longer on the Cartesian plane,
instead they are in higher dimensions. For example:
- sample \(i\) is defined by a point in 727 dimensional space: \((Y_{1,i},\dots,Y_{727,i})^\top\).
- feature \(g\) is defined by a point in 38 dimensions \((Y_{g,38},\dots,Y_{g,38})^\top\)

Euclidian distance in high dimensions

Euclidean distance as for two dimensions. E.g., the distance between two samples \(i\) and \(j\) is:

\[ \mbox{dist}(i,j) = \sqrt{ \sum_{g=1}^{727} (Y_{g,i}-Y_{g,j })^2 } \]

and the distance between two features \(h\) and \(g\) is:

\[ \mbox{dist}(h,g) = \sqrt{ \sum_{i=1}^{38} (Y_{h,i}-Y_{g,i})^2 } \]

3 sample example

ab <- otu_table(Rampelli)
dedauko1 <- ab[, 1]
dedauko2 <- ab[, 2]
bologna1 <- ab[, 29]
sqrt(sum((dedauko1 - dedauko2)^2))

## [1] 103.0065

sqrt(sum((dedauko1 - bologna1)^2))

## [1] 124.6045

3 sample example using dist()

dim(ab)

## [1] 727  38

(d <- dist(t(ab[, c(1, 2, 29)])))

##           H1       H2
## H2  103.0065         
## IT2 124.6045 124.1906

class(d)

## [1] "dist"

Alpha / Beta diversity measures

From Morgan and Huttenhower Human Microbiome Analysis

SVD

These examples describe the A) sequence counts and B) relative abundances of six taxa detected in three samples. C) A collector’s curve using a richness estimator approximates the relationship between the number of sequences drawn from each sample and the number of taxa expected to be present based on detected abundances. D) Alpha diversity captures both the organismal richness of a sample and the evenness of the organisms’ abundance distribution. E) Beta diversity represents the similarity (or difference) in organismal composition between samples.

Shannon Index alpha diversity: \(H' = -\sum_{i=1}^{S} \left( p_i ln(p_i) \right )\)
Beta diversity: \(\beta = (n_1 - c) + (n_2 - c)\)

Beta diversity / dissimilarity

E.g. Bray-Curtis dissimilarity between all pairs of samples:

plot(hclust(phyloseq::distance(Rampelli, method="bray")), 
     main="Bray-Curtis Dissimilarity", xlab="", sub = "")

Dozens of distance measures are available
- see ?phyloseq::distance and ?phyloseq::distanceMethodList

Dimension reduction and PCA

Motivation for dimension reduction

Simulate the heights of twin pairs:

suppressPackageStartupMessages(library(MASS))
set.seed(1)
n <- 100
y <- t(MASS::mvrnorm(n, c(0,0), matrix(c(1, 0.95, 0.95, 1), 2, 2)))
dim(y)

## [1]   2 100

cor(t(y))

##           [,1]      [,2]
## [1,] 1.0000000 0.9433295
## [2,] 0.9433295 1.0000000

Motivation for dimension reduction

Not much loss of height differences when just using average heights of twin pairs.
- because twin heights are highly correlated

Singular Value Decomposition (SVD)

SVD generalizes the example rotation we looked at:

\[\mathbf{Y} = \mathbf{UDV}^\top\]

note: the above formulation is for \(m > n\)

Singular Value Decomposition (SVD)

\(\mathbf{Y}\): the m rows x n cols matrix of measurements
\(\mathbf{U}\): m x n matrix relating original scores to PCA scores (loadings)
\(\mathbf{D}\): n x n diagonal matrix (eigenvalues)
\(\mathbf{V}\): n × n orthogonal matrix (eigenvectors or PCA scores)
- orthogonal = unit length and “perpendicular” in 3-D

SVD of Rampelli dataset

Scaling:

e.standardize.fast <- t(scale(t(ab), scale=FALSE))

SVD:

s <- svd(e.standardize.fast)
names(s)

## [1] "d" "u" "v"

SVD of Rampelli dataset

dim(s$u)     # loadings

## [1] 727  38

length(s$d)  # eigenvalues

## [1] 38

dim(s$v)     # d %*% vT = scores

## [1] 38 38

SVD vs. prcomp()

rafalib::mypar()
p <- prcomp( t(e.standardize.fast) )
plot(s$u[, 1] ~ p$rotation[, 1])

Note: u and v can each be multiplied by -1 arbitrarily

PCA interpretation: loadings

\(\mathbf{U}\) (loadings): relate the principal component axes to the original variables
- think of principal component axes as a weighted combination of original axes

PCA interpretation: loadings

plot(p$rotation[, 1], xlab="Index of taxa", ylab="Loadings of PC1", 
     main="Loadings of PC1") #or, predict(p)
abline(h=c(-0.2, 0.2), col="red")

PCA interpretation: loadings

Taxa with high PC1 loadings
- note look at taxanomic relationship of rows
- should probably prune to one taxanomic level and re-do

e.pc1taxa <- ab[p$rotation[, 1] < -0.2 | p$rotation[, 1] > 0.2, ]
pheatmap::pheatmap(e.pc1taxa, scale="none", show_rownames=TRUE, 
                  show_colnames = FALSE)

PCA interpretation: eigenvalues

\(\mathbf{D}\) (eigenvalues): standard deviation scaling factor that each decomposed variable is multiplied by.

PCA interpretation: eigenvalues

Alternatively as cumulative % variance explained (using cumsum() function):

PCA interpretation: scores

\(\mathbf{V}\) (scores): The “datapoints” in the reduced prinipal component space
In some implementations (like prcomp()), scores are \(\mathbf{D V^T}\)

PCA interpretation: scores

Principal Coordinates Analysis (PCoA)

also referred to as Multi-dimensional Scaling (MDS) or ordination
a reduced SVD, performed on a distance matrix
identify two (or more) eigenvalues/vectors that preserve distances
Advantage
- PCA is limited to Euclidian distance, which is less useful for ecological data
- compatible with any dissimilarity measure
Disadvantage
- no loadings to help with interpretation

Principal Coordinates Analysis (cont’d)

Available methods are “DCA”, “CCA”, “RDA”, “CAP”, “DPCoA”, “NMDS”, “MDS”, “PCoA”

Not much “horseshoe” effect here.

Within-sample diversity

Alpha diversity estimates

Look at ?phyloseq::estimate_richness
Supported measures of alpha diversity are:
- “Observed”, “Chao1”, “ACE”, “Shannon”, “Simpson”, “InvSimpson”, “Fisher”
- more information from vegan package

Note, you can ignore warning about singletons when performing this analysis.

Lecture: Exploratory analysis of microbiome data in R/Bioconductor

Outline

Properties of processed microbiome data

Properties of processed microbiome data

Example: Rampelli Africa dataset

Distances in high-dimensional data analysis

The importance of distance

The importance of distance (cont’d)

Heatmap

Metrics and distances

Euclidian distance (metric)

Euclidian distance in high dimensions

Euclidian distance in high dimensions

Euclidian distance in high dimensions

3 sample example

3 sample example using dist()

Alpha / Beta diversity measures

Beta diversity / dissimilarity

Dimension reduction and PCA

Motivation for dimension reduction

Motivation for dimension reduction

Motivation for dimension reduction

Singular Value Decomposition (SVD)

Singular Value Decomposition (SVD)

SVD of Rampelli dataset

SVD of Rampelli dataset

SVD vs. prcomp()

PCA interpretation: loadings

PCA interpretation: loadings

PCA interpretation: loadings

PCA interpretation: eigenvalues

PCA interpretation: eigenvalues

PCA interpretation: scores

PCA interpretation: scores

Principal Coordinates Analysis (PCoA)

Principal Coordinates Analysis (cont’d)

Within-sample diversity

Alpha diversity estimates

Comparison of alpha diversity estimates

Links