An Introduction to Analysis of Microbiome Data

Levi Waldron

May 22, 2017

Outline

Types and properties of microbiome data
Exploratory data analysis
Log-linear regression for count data

Generating microbiome data

From Morgan and Huttenhower, Human Microbiome Analysis, https://doi.org/10.1371/journal.pcbi.1002808

16S rRNA profiling

Pros:

cheap (multiplex hundreds of samples)
relatively small data
genus-level taxonomy and inferred metabolic function for bacteria and archaea

Cons:

taxonomy reliable only to genus level
indirect inference of metabolic function
use of a single marker gene is susceptible to biases

Bioinformatics pipeline of choice: QIIME

Whole metagenome shotgun sequencing (WMS)

Pros:

taxonomy to species and even strain
viruses and fungi
gene variants, e.g. ABX resistance
use of many marker genes is less susceptible to biases
more direct + precise functional inference

Cons:

expensive – probably no multiplexing
contamination from human DNA
gigabytes of sequence data per sample (before processing)

Bioinformatics pipeline of choice: Biobakery (MetaPhlan2, HUMAnN2)

Properties of processed microbiome data

taxonomic data for kingdom, phylum or division, class, order, family, genus, (species)
- hundreds of rows
inferred metabolic function
- hundreds of rows
calculated alpha diversity

Properties of processed microbiome data

count data, minimum is zero
- MetaPhlAn2 data are not actually integer counts
non-normal
- highly skewed (over-dispersed)
- often has a lot of zero values
samples differ in extraction and amplification efficiency, read depth
- counts do not provide absolute microbial abundance
- we can only infer relative abundance

Example: Rampelli Africa dataset

Rampelli S et al.: Metagenome Sequencing of the Hadza Hunter-Gatherer Gut Microbiota. Curr. Biol. 2015, 25:1682–1693.

library(curatedMetagenomicData)
eset <- RampelliS_2015.metaphlan_bugs_list.stool()
library(phyloseq)
Rampelli = ExpressionSet2phyloseq(eset)

Quick look: Rampelli Africa dataset

Rampelli

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 727 taxa and 38 samples ]
## sample_data() Sample Data:       [ 38 samples by 8 sample variables ]
## tax_table()   Taxonomy Table:    [ 727 taxa by 8 taxonomic ranks ]

otu_table(Rampelli)[1:5, 1:4]

## OTU Table:          [5 taxa and 4 samples]
##                      taxa are rows
##                         H1      H10      H11      H12
## k__Bacteria       99.01042 84.50315 67.44338 74.98205
## k__Archaea         0.98958  0.18360  0.28721  0.20239
## p__Firmicutes     55.91770 37.15785 27.61193 30.97439
## p__Bacteroidetes  29.17228 24.42213 16.44710 21.01599
## p__Proteobacteria 10.63735 15.06530 17.95200 16.36751

summary(otu_table(Rampelli)[, 1:4])

##        H1              H10                H11                H12        
##  Min.   : 0.000   Min.   : 0.00000   Min.   : 0.00000   Min.   : 0.000  
##  1st Qu.: 0.000   1st Qu.: 0.00000   1st Qu.: 0.00000   1st Qu.: 0.000  
##  Median : 0.000   Median : 0.00000   Median : 0.00000   Median : 0.000  
##  Mean   : 1.076   Mean   : 1.08239   Mean   : 1.06046   Mean   : 1.067  
##  3rd Qu.: 0.000   3rd Qu.: 0.02908   3rd Qu.: 0.01579   3rd Qu.: 0.000  
##  Max.   :99.010   Max.   :84.50315   Max.   :67.44338   Max.   :74.982

summary(sample_data(Rampelli))

##   subjectID              age           gender              camp          
##  Length:38          Min.   : 8.00   Length:38          Length:38         
##  Class :character   1st Qu.:23.25   Class :character   Class :character  
##  Mode  :character   Median :30.00   Mode  :character   Mode  :character  
##                     Mean   :31.68                                        
##                     3rd Qu.:38.00                                        
##                     Max.   :70.00                                        
##    country            disease            bodysite        
##  Length:38          Length:38          Length:38         
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##   number_reads     
##  Min.   : 3740248  
##  1st Qu.: 9756842  
##  Median :15193451  
##  Mean   :22303876  
##  3rd Qu.:29293112  
##  Max.   :77841428

Heatmap

plot_heatmap(Rampelli, method="PCoA", distance="bray")

Barplot

par(mar = c(18, 4, 0, 0) + 0.1) # make more room on bottom margin
barplot(sort(taxa_sums(Rampelli), TRUE)[1:30]/nsamples(Rampelli), las=2)

Subsetting and pruning

head(tax_table(Rampelli))

## Taxonomy Table:     [6 taxa by 8 taxonomic ranks]:
##                   Kingdom    Phylum           Class Order Family Genus
## k__Bacteria       "Bacteria" NA               NA    NA    NA     NA   
## k__Archaea        "Archaea"  NA               NA    NA    NA     NA   
## p__Firmicutes     "Bacteria" "Firmicutes"     NA    NA    NA     NA   
## p__Bacteroidetes  "Bacteria" "Bacteroidetes"  NA    NA    NA     NA   
## p__Proteobacteria "Bacteria" "Proteobacteria" NA    NA    NA     NA   
## p__Spirochaetes   "Bacteria" "Spirochaetes"   NA    NA    NA     NA   
##                   Species Strain
## k__Bacteria       NA      NA    
## k__Archaea        NA      NA    
## p__Firmicutes     NA      NA    
## p__Bacteroidetes  NA      NA    
## p__Proteobacteria NA      NA    
## p__Spirochaetes   NA      NA

Species plus strain-level taxonomy:

(Rampelli.sp_strain = subset_taxa(Rampelli, !is.na(Species)))

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 486 taxa and 38 samples ]
## sample_data() Sample Data:       [ 38 samples by 8 sample variables ]
## tax_table()   Taxonomy Table:    [ 486 taxa by 8 taxonomic ranks ]

Keep phylum-level taxonomy only:

taxonomy.level = apply(tax_table(Rampelli), 1, function(x) sum(!is.na(x)))
Rampelli.phy = prune_taxa(taxonomy.level==2, Rampelli)
taxa_names(Rampelli.phy)

##  [1] "p__Firmicutes"                  "p__Bacteroidetes"              
##  [3] "p__Proteobacteria"              "p__Spirochaetes"               
##  [5] "p__Euryarchaeota"               "p__Actinobacteria"             
##  [7] "p__Viruses_noname"              "p__Cyanobacteria"              
##  [9] "p__Verrucomicrobia"             "p__Deinococcus_Thermus"        
## [11] "p__Eukaryota_noname"            "p__Candidatus_Saccharibacteria"
## [13] "p__Acidobacteria"               "p__Microsporidia"              
## [15] "p__Fusobacteria"                "p__Tenericutes"

Advanced subsetting

Keep taxa only if they are in the most abundant 10% of taxa in at least 10 samples:

f1<- filterfun_sample(topp(0.1))
subs <- genefilter_sample(Rampelli.sp_strain, f1, A=10)
Rampelli2 <- subset_taxa(Rampelli.sp_strain, subs)
plot_heatmap(Rampelli2, method="PCoA", distance="bray")

More help here.

Distances in high dimensions

High-dimensional data are complex and impossible to visualize in raw form
- Thousands of dimensions, we can only visualize 2-3
Distances can simplify thousands of dimensions
Any clustering or classification of samples and/or features involves combining or identifying objects that are close or similar.
Distances or similarities are mathematical representations of what we mean by close or similar.
The choice of distance is a subject matter-specific, qualitative decision

Alpha / Beta diversity measures

From Morgan and Huttenhower Human Microbiome Analysis

SVD

These examples describe the A) sequence counts and B) relative abundances of six taxa detected in three samples. C) A collector’s curve using a richness estimator approximates the relationship between the number of sequences drawn from each sample and the number of taxa expected to be present based on detected abundances. D) Alpha diversity captures both the organismal richness of a sample and the evenness of the organisms’ abundance distribution. E) Beta diversity represents the similarity (or difference) in organismal composition between samples.

Shannon Index alpha diversity: $H' = -\sum_{i=1}^{S} \left( p_i ln(p_i) \right )$
Beta diversity: $\beta = (n_1 - c) + (n_2 - c)$

Alpha diversity estimates

Look at ?phyloseq::estimate_richness
Supported measures of alpha diversity are:
- “Observed”, “Chao1”, “ACE”, “Shannon”, “Simpson”, “InvSimpson”, “Fisher”
- more information from vegan package

Note, you can ignore warning about singletons:

Rampelli.counts <- ExpressionSet2phyloseq(eset, relab=FALSE)
alpha_meas = c("Shannon", "Simpson", "InvSimpson")
(p <- plot_richness(Rampelli.counts, "gender", "camp", measures=alpha_meas))

Comparison of alpha diversity estimates

alphas = estimate_richness(Rampelli.counts, measures=alpha_meas)

## Warning in estimate_richness(Rampelli.counts, measures = alpha_meas): The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.

pairs(alphas)

Beta diversity / dissimilarity

E.g. Bray-Curtis dissimilarity between all pairs of samples:

plot(hclust(phyloseq::distance(Rampelli, method="bray")), 
     main="Bray-Curtis Dissimilarity", xlab="", sub = "")

Dozens of distance measures are available
- see ?phyloseq::distance and ?phyloseq::distanceMethodList

Ordination

Because Euclidian distance isn’t preferred, usually don’t use PCA
Most commonly use Principal Coordinates Analysis
- preserves distances in a low-dimensional projection

ord = ordinate(Rampelli, method="PCoA", distance="bray")
plot_ordination(Rampelli, ord, color="camp", shape="camp") + 
  ggplot2::ggtitle("Bray-Curtis Principal Coordinates Analysis")

Available methods are “DCA”, “CCA”, “RDA”, “CAP”, “DPCoA”, “NMDS”, “MDS”, “PCoA”

Not much “horseshoe” effect here.

Linear modeling for metagenomic data: Two main approaches (1)

normalizing transformation, orinary linear modeling
- calculate relative abundance, dividing by the total number of counts for each sample (account for different sequencing depths)
- variance-stabilizing transformation of features, arcsin(sqrt(x))

Advantages
- simplicity: can directly use PCA, linear models, non-parametric tests
Disadvantages
- data may still not be very normally distributed
- regression coefficients for arcsin-sqrt transformed data not easily interpretable

Two main approaches (2)

treat as count data, log-linear generalized linear model (GLM)
- log-linear systematic component
- typically negative binomially-distributed random component
- model can include an “offset” term to account for different sequencing depths

Advantages
- GLM framework provides great flexibility to deal with sequencing depth, over-dispersion
- coefficients are readily interpretable in “multiplicative” models
- phyloseq and DESeq2 packages simplify the process
Disadvantages
- models are more complicated to understand

Multiple Linear Regression Model (approach 1)

Systematic part of model:

\[ E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p \]

$E[y|x]$ is the expected value of $y$ given $x$
$y$ is the outcome, response, or dependent variable
$x$ is the vector of predictors / independent variables
$x_p$ are the individual predictors or independent variables
$\beta_p$ are the regression coefficients

Multiple Linear Regression Model (cont’d)

Random part of model:

$y_i = E[y_i|x_i] + \epsilon_i$

$y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + \epsilon_i$

$x_{pi}$ is the value of predictor $x_j$ for observation $i$

Assumption: $\epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2)$

Normal distribution of $\epsilon_i$
Mean zero at every value of predictors
Constant variance at every value of predictors
Observations are independent

Log-linear models

Systematic component is:

\[ log(E[y|x_i]) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \]

Or equivalently: \[ E[y|x_i] = exp \left( \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \right) \]

where $E[y|x_i]$ is the expected number of counts for a microbe in subject i

Random component:

$y_i$ is typically assumed to be Poisson or Negative Binomal distributed.

Note: Modeling $log(E[y|x_i])$ is not equivalent to modeling $E(log(y|x_i))$

Compare Poisson vs. Negative Binomial

Negative Binomial Distribution has two parameters: # of trials n, and probability of success p

Additive vs. Multiplicative models

Linear regression is an additive model
- e.g. for two binary variables $\beta_1 = 1.5$, $\beta_2 = 1.5$.
- If $x_1=1$ and $x_2=1$, this adds 3.0 to $E(y|x)$
Log-linear models are multiplicative:
- If $x_1=1$ and $x_2=1$, this adds 3.0 to $log(E[y_i])$
- Expected count increases 20-fold: $exp(1.5+1.5)$ or $exp(1.5) * exp(1.5)$
- Coefficients are invariant to multiplicative scaling of the data

This is a very important distinction!

Feature and sample QC

In this example, unecessary to remove low-read samples or taxa:

Rampelli.counts

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 727 taxa and 38 samples ]
## sample_data() Sample Data:       [ 38 samples by 8 sample variables ]
## tax_table()   Taxonomy Table:    [ 727 taxa by 8 taxonomic ranks ]

prune_samples(sample_sums(Rampelli.counts) > 5e7, Rampelli.counts)

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 727 taxa and 34 samples ]
## sample_data() Sample Data:       [ 34 samples by 8 sample variables ]
## tax_table()   Taxonomy Table:    [ 727 taxa by 8 taxonomic ranks ]

prune_taxa(taxa_sums(Rampelli.counts) > 1e3, Rampelli.counts)

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 706 taxa and 38 samples ]
## sample_data() Sample Data:       [ 38 samples by 8 sample variables ]
## tax_table()   Taxonomy Table:    [ 706 taxa by 8 taxonomic ranks ]

Conversion to DESeq2

More help on converting to DESeq2 from various formats here.

dds.data = phyloseq_to_deseq2(Rampelli.counts, ~country)

## converting counts to integer mode

Note: better to use normalized count data than relative abundance

Negative Binomial log-linear model with DESeq2

Excellent DESeq2 manual here or vignettes(package="DESeq2")

dds = DESeq(dds.data)
res = results(dds)
res = res[order(res$padj, na.last=NA), ]
alpha = 0.01
sigtab = res[(res$padj < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"), 
               as(tax_table(Rampelli)[rownames(sigtab), ], "matrix"))

head(sigtab)

##                           baseMean log2FoldChange    lfcSE     stat
## s__Prevotella_stercorea 1026450.69       22.49632 1.069746 21.02960
## t__GCF_000235885        1026450.69       22.49632 1.069746 21.02960
## p__Spirochaetes          572913.49       21.64745 1.176744 18.39606
## c__Spirochaetia          572913.49       21.64745 1.176744 18.39606
## o__Spirochaetales        572913.49       21.64745 1.176744 18.39606
## g__Catenibacterium        51752.17       30.00000 1.670805 17.95542
##                               pvalue         padj  Kingdom        Phylum
## s__Prevotella_stercorea 3.516377e-98 8.087668e-96 Bacteria Bacteroidetes
## t__GCF_000235885        3.516377e-98 8.087668e-96 Bacteria Bacteroidetes
## p__Spirochaetes         1.412718e-75 1.299700e-73 Bacteria  Spirochaetes
## c__Spirochaetia         1.412718e-75 1.299700e-73 Bacteria  Spirochaetes
## o__Spirochaetales       1.412718e-75 1.299700e-73 Bacteria  Spirochaetes
## g__Catenibacterium      4.353075e-72 2.503018e-70 Bacteria    Firmicutes
##                                    Class              Order
## s__Prevotella_stercorea      Bacteroidia      Bacteroidales
## t__GCF_000235885             Bacteroidia      Bacteroidales
## p__Spirochaetes                     <NA>               <NA>
## c__Spirochaetia             Spirochaetia               <NA>
## o__Spirochaetales           Spirochaetia     Spirochaetales
## g__Catenibacterium      Erysipelotrichia Erysipelotrichales
##                                      Family           Genus
## s__Prevotella_stercorea      Prevotellaceae      Prevotella
## t__GCF_000235885             Prevotellaceae      Prevotella
## p__Spirochaetes                        <NA>            <NA>
## c__Spirochaetia                        <NA>            <NA>
## o__Spirochaetales                      <NA>            <NA>
## g__Catenibacterium      Erysipelotrichaceae Catenibacterium
##                                      Species        Strain
## s__Prevotella_stercorea Prevotella_stercorea          <NA>
## t__GCF_000235885        Prevotella_stercorea GCF_000235885
## p__Spirochaetes                         <NA>          <NA>
## c__Spirochaetia                         <NA>          <NA>
## o__Spirochaetales                       <NA>          <NA>
## g__Catenibacterium                      <NA>          <NA>

Plot results

library("ggplot2")
theme_set(theme_bw())
sigtabgen = subset(sigtab, !is.na(Family))
# Phylum order
x = tapply(sigtabgen$log2FoldChange, sigtabgen$Phylum, function(x) max(x))
x = sort(x, TRUE)
sigtabgen$Phylum = factor(as.character(sigtabgen$Phylum), levels=names(x))
# Family order
x = tapply(sigtabgen$log2FoldChange, sigtabgen$Family, function(x) max(x))
x = sort(x, TRUE)
sigtabgen$Family = factor(as.character(sigtabgen$Family), levels=names(x))
ggplot(sigtabgen, aes(y=Family, x=log2FoldChange, color=Phylum)) + 
  geom_vline(xintercept = 0.0, color = "gray", size = 0.5) +
  geom_point(size=6) + 
  theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5))

Heatmap of differentially abundant taxa

select <- rownames(sigtab)
nt <- normTransform(dds) # defaults to log2(x+1)
log2.norm.counts <- assay(nt)[select, ]
df <- as.data.frame(colData(dds)[,c("country", "gender")])

pheatmap::pheatmap(log2.norm.counts, annotation_col=df, main="log2(counts + 1)")

MA plots

Fold-change vs. mean:

plotMA(res, main="Difference vs. Average")
legend("bottomright", legend="differentially abundant", lty=-1, pch=1, col="red", bty='n')

A note about Multiple Hypothesis Testing

DESeq2 uses “Independent Hypothesis Weighting” (see IHW package)
- like False Discovery Rate, but automatically down-weights hypotheses from low-abundance features
- may make “non-specific” filtering / pruning of low-abundance features
- in my experience, there aren’t enough features in microbiome data for it to work

Regression on ordination vectors and alpha diversity

Prepare a data.frame:

df = data.frame(country=sample_data(Rampelli)$country,
                Shannon=alphas$Shannon)
df = cbind(df, ord$vectors[, 1:5])
head(df)

##      country  Shannon      Axis.1     Axis.2      Axis.3       Axis.4
## H1  tanzania 3.688009 -0.11832297 0.07692996  0.09619371 -0.049329957
## H10 tanzania 3.952527 -0.19796867 0.09884072 -0.16478172 -0.097493755
## H11 tanzania 3.880499 -0.22202543 0.22287671 -0.25653163  0.006247430
## H12 tanzania 3.874827 -0.23351705 0.15462528 -0.22859742 -0.068984544
## H13 tanzania 3.598540 -0.05794346 0.05416069  0.13050213  0.005986638
## H14 tanzania 3.579617  0.01428806 0.09845835  0.12076997  0.012938378
##           Axis.5
## H1  -0.010983849
## H10 -0.063325316
## H11 -0.059823475
## H12 -0.058443856
## H13 -0.020542266
## H14 -0.005270389

par(mfrow=c(3,2))
for (i in 2:7){
  boxplot(df[, i] ~ df$country, main=colnames(df)[i])
}

curatedMetagenomicData pipeline

cMD

curatedMetagenomicData usage

One dataset from R:

curatedMetagenomicData("HMP_2012.metaphlan_bugs_list.stool")

Many datasets from R:

curatedMetagenomicData("HMP_2012.metaphlan_bugs_list.*")

Command-line: $ curatedMetagenomicData -p "HMP_2012.metaphlan_bugs_list.*"

Pasolli, Schiffer et al., bioRxiv 103085

Links

A built html version of this lecture is available.
The source R Markdown is also available from Github.