Statistical analysis for metagenomic data

Levi Waldron and Curtis Huttenhower

June 6-7, 2016

Outline

Learning Objectives

Properties of processed metagenomic data

Linear modeling for metagenomic data: Two main approaches (1)

  1. normalizing transformation, orinary linear modeling
    • calculate relative abundance, dividing by the total number of counts for each sample (account for different sequencing depths)
    • variance-stabilizing transformation of features, arcsin(sqrt(x))

Two main approaches (2)

  1. treat as count data, log-linear generalized linear model (GLM)
    • log-linear systematic component
    • typically negative binomially-distributed random component
    • model can include an “offset” term to account for different sequencing depths

Multiple Linear Regression Model (approach 1)

Systematic part of model:

\[ E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p \]

Multiple Linear Regression Model (cont’d)

Random part of model:

\(y_i = E[y_i|x_i] + \epsilon_i\)

\(y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + \epsilon_i\)

Assumption: \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2)\)

Generalized Linear Models

Components of GLM

Linear Regression as GLM

Log-linear models

Systematic component is:

\[ log(E[y|x_i]) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \]

Or equivalently: \[ E[y|x_i] = exp \left( \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \right) \]

where \(E[y|x_i]\) is the expected number of counts for a microbe in subject i

\(\epsilon_i\) is typically Poisson or Negative Binomal distributed.

Additive vs. Multiplicative models

This is a very important distinction!

Poisson model

Visualizing the Poisson Distribution

Negative binomial distribution

Visualizing the Negative Binomial Distribution

Compare Poisson vs. Negative Binomial

Negative Binomial Distribution has two parameters: # of trials n, and probability of success p

Demystifying error models

If there is evidence that the fixed parameters differ between two groups of interest, we say the results are statistically significant.

Zero-inflated models

Poisson Distribution with Zero Inflation

Lab 1: exploration

Example: Candela Africa dataset

Rampelli S et al.: Metagenome Sequencing of the Hadza Hunter-Gatherer Gut Microbiota. Curr. Biol. 2015, 25:1682–1693.

indat = read.delim("data/Candela_Africa_stool.txt")
inmetadat = read.delim("data/Candela_Africa_metadat.txt")
library(phyloseq)
source("https://raw.githubusercontent.com/waldronlab/EPIC2016/master/metaphlanToPhyloseq.R")
Candela = metaphlanToPhyloseq(indat, inmetadat, simplenames = TRUE, 
                              roundtointeger = FALSE)
summary(otu_table(Candela))
summary(sample_data(Candela))

Initial exploration

phyloseq help vignettes here.

Candela
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 3302 taxa and 38 samples ]
## sample_data() Sample Data:       [ 38 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 3302 taxa by 8 taxonomic ranks ]
Candela = prune_taxa(taxa_sums(Candela) > 0, Candela)
Candela
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 648 taxa and 38 samples ]
## sample_data() Sample Data:       [ 38 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 648 taxa by 8 taxonomic ranks ]

Subsetting and pruning

rank_names(Candela)
## [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"
## [8] "Strain"
subset_taxa(Candela, !is.na(Strain))
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 207 taxa and 38 samples ]
## sample_data() Sample Data:       [ 38 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 207 taxa by 8 taxonomic ranks ]
(Candela.sp_strain = subset_taxa(Candela, !is.na(Species)))
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 444 taxa and 38 samples ]
## sample_data() Sample Data:       [ 38 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 444 taxa by 8 taxonomic ranks ]
taxonomy.level = apply(tax_table(Candela), 1, function(x) sum(!is.na(x)))
Candela.phy = prune_taxa(taxonomy.level==2, Candela)
taxa_names(Candela.phy)
##  [1] "p__Euryarchaeota"    "p__Acidobacteria"    "p__Actinobacteria"  
##  [4] "p__Bacteroidetes"    "p__Cyanobacteria"    "p__Firmicutes"      
##  [7] "p__Fusobacteria"     "p__Proteobacteria"   "p__Spirochaetes"    
## [10] "p__Tenericutes"      "p__Verrucomicrobia"  "p__Eukaryota_noname"
## [13] "p__Microsporidia"

Advanced pruning

Keep taxa only if they are in the most abundant 10% of taxa in at least two samples:

f1<- filterfun_sample(topp(0.1))
pru <- genefilter_sample(Candela, f1, A=2)
summary(pru)
##    Mode   FALSE    TRUE    NA's 
## logical     455     193       0
subset_taxa(Candela, pru)
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 193 taxa and 38 samples ]
## sample_data() Sample Data:       [ 38 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 193 taxa by 8 taxonomic ranks ]

More help here.

Heatmap

plot_heatmap(Candela.sp_strain, method="PCoA", distance="bray")

Barplot

par(mar = c(18, 4, 0, 0) + 0.1) # make more room on bottom margin
barplot(sort(taxa_sums(Candela.sp_strain), TRUE)[1:30]/nsamples(Candela.sp_strain), las=2)

Distances in high dimensions

Euclidian distance (metric)

Alpha / Beta diversity measures

From Morgan and Huttenhower Human Microbiome Analysis

SVD

These examples describe the A) sequence counts and B) relative abundances of six taxa detected in three samples. C) A collector’s curve using a richness estimator approximates the relationship between the number of sequences drawn from each sample and the number of taxa expected to be present based on detected abundances. D) Alpha diversity captures both the organismal richness of a sample and the evenness of the organisms’ abundance distribution. E) Beta diversity represents the similarity (or difference) in organismal composition between samples.

Alpha diversity estimates

Note, you can ignore warning about singletons:

Candela.int = Candela
otu_table(Candela.int) = round(otu_table(Candela)*1e4)
alpha_meas = c("Shannon", "Simpson", "InvSimpson")
(p <- plot_richness(Candela.int, "gender", "camp", measures=alpha_meas))

Comparison of alpha diversity estimates

alphas = estimate_richness(Candela.int, measures=alpha_meas)
pairs(alphas)

Beta diversity / dissimilarity

E.g. Bray-Curtis dissimilarity between all pairs of samples:

plot(hclust(phyloseq::distance(Candela, method="bray")), 
     main="Bray-Curtis Dissimilarity", xlab="", sub = "")

Ordination

ord = ordinate(Candela, method="PCoA", distance="bray")
plot_ordination(Candela, ord, color="camp", shape="camp") + 
  ggplot2::ggtitle("Bray-Curtis Principal Coordinates Analysis")

Not much “horseshoe” effect here.

Ordination (cont’d)

plot_scree(ord) + ggplot2::ggtitle("Screeplot")

Lab 2: regression

Sample selection

Not necessary, not evaluated:

Candela.pruned = subset_samples(Candela, country %in% c("italy", "tanzania"))
Candela.pruned = prune_samples(sample_sums(Candela.pruned) > 10, Candela.pruned)
Candela.pruned

Conversion to DESeq2

More help on converting to DESeq2 from various formats here.

dds.data = phyloseq_to_deseq2(Candela, ~country)
## converting counts to integer mode

Note: better to use normalized count data than relative abundance

Negative Binomial log-linear model with DESeq2

Excellent DESeq2 manual here or vignettes(package="DESeq2")

dds = DESeq(dds.data)
res = results(dds)
res = res[order(res$padj, na.last=NA), ]
alpha = 0.01
sigtab = res[(res$padj < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"), 
               as(tax_table(Candela)[rownames(sigtab), ], "matrix"))
head(sigtab)
##                                         baseMean log2FoldChange     lfcSE
## f__Bacteroidaceae                       4.980675      -6.015589 0.8326327
## g__Bacteroides                          4.980675      -6.015589 0.8326327
## f__Prevotellaceae                      33.286897       4.456109 0.6539972
## g__Prevotella                          33.215445       4.663931 0.6916855
## g__Phascolarctobacterium                8.684247       4.311203 0.6774475
## s__Phascolarctobacterium_succinatutens  8.684247       4.311203 0.6774475
##                                             stat       pvalue         padj
## f__Bacteroidaceae                      -7.224782 5.019067e-13 4.115635e-11
## g__Bacteroides                         -7.224782 5.019067e-13 4.115635e-11
## f__Prevotellaceae                       6.813652 9.515189e-12 5.201637e-10
## g__Prevotella                           6.742850 1.553097e-11 6.367698e-10
## g__Phascolarctobacterium                6.363892 1.967046e-10 4.608508e-09
## s__Phascolarctobacterium_succinatutens  6.363892 1.967046e-10 4.608508e-09
##                                         Kingdom        Phylum
## f__Bacteroidaceae                      Bacteria Bacteroidetes
## g__Bacteroides                         Bacteria Bacteroidetes
## f__Prevotellaceae                      Bacteria Bacteroidetes
## g__Prevotella                          Bacteria Bacteroidetes
## g__Phascolarctobacterium               Bacteria    Firmicutes
## s__Phascolarctobacterium_succinatutens Bacteria    Firmicutes
##                                                Class           Order
## f__Bacteroidaceae                        Bacteroidia   Bacteroidales
## g__Bacteroides                           Bacteroidia   Bacteroidales
## f__Prevotellaceae                        Bacteroidia   Bacteroidales
## g__Prevotella                            Bacteroidia   Bacteroidales
## g__Phascolarctobacterium               Negativicutes Selenomonadales
## s__Phascolarctobacterium_succinatutens Negativicutes Selenomonadales
##                                                    Family
## f__Bacteroidaceae                          Bacteroidaceae
## g__Bacteroides                             Bacteroidaceae
## f__Prevotellaceae                          Prevotellaceae
## g__Prevotella                              Prevotellaceae
## g__Phascolarctobacterium               Acidaminococcaceae
## s__Phascolarctobacterium_succinatutens Acidaminococcaceae
##                                                        Genus
## f__Bacteroidaceae                                       <NA>
## g__Bacteroides                                   Bacteroides
## f__Prevotellaceae                                       <NA>
## g__Prevotella                                     Prevotella
## g__Phascolarctobacterium               Phascolarctobacterium
## s__Phascolarctobacterium_succinatutens Phascolarctobacterium
##                                                                    Species
## f__Bacteroidaceae                                                     <NA>
## g__Bacteroides                                                        <NA>
## f__Prevotellaceae                                                     <NA>
## g__Prevotella                                                         <NA>
## g__Phascolarctobacterium                                              <NA>
## s__Phascolarctobacterium_succinatutens Phascolarctobacterium_succinatutens
##                                        Strain
## f__Bacteroidaceae                        <NA>
## g__Bacteroides                           <NA>
## f__Prevotellaceae                        <NA>
## g__Prevotella                            <NA>
## g__Phascolarctobacterium                 <NA>
## s__Phascolarctobacterium_succinatutens   <NA>

Bayesian estimation of dispersion

dds2 <- estimateSizeFactors(dds)
dds2 <- estimateDispersions(dds2)
plotDispEsts(dds2)

Plot results

library("ggplot2")
theme_set(theme_bw())
sigtabgen = subset(sigtab, !is.na(Family))
# Phylum order
x = tapply(sigtabgen$log2FoldChange, sigtabgen$Phylum, function(x) max(x))
x = sort(x, TRUE)
sigtabgen$Phylum = factor(as.character(sigtabgen$Phylum), levels=names(x))
# Family order
x = tapply(sigtabgen$log2FoldChange, sigtabgen$Family, function(x) max(x))
x = sort(x, TRUE)
sigtabgen$Family = factor(as.character(sigtabgen$Family), levels=names(x))
ggplot(sigtabgen, aes(y=Family, x=log2FoldChange, color=Phylum)) + 
  geom_vline(xintercept = 0.0, color = "gray", size = 0.5) +
  geom_point(size=6) + 
  theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5))

Correcting for gender as a potential confounder

table(sample_data(Candela)$country, sample_data(Candela)$gender)
##           
##            female male
##   italy         7    4
##   tanzania      9   18
dds.data2 = phyloseq_to_deseq2(Candela, ~country + gender)
dds2 = DESeq(dds.data)
resultsNames(dds2)
## [1] "Intercept"       "countryitaly"    "countrytanzania"

Correcting for gender as a potential confounder

## italy = numerator, tanzania = denominator
res2 = results(dds, contrast=c("country", "italy", "tanzania"))
res2 = res2[order(res$padj, na.last=NA), ]
alpha = 0.01
sigtab2 = res2[which(res2$padj < alpha), ]
sigtab2 = cbind(as(sigtab2, "data.frame"), as(tax_table(Candela)[rownames(sigtab2), ], "matrix"))
head(sigtab2)
##                                 baseMean log2FoldChange     lfcSE     stat
## p__Actinobacteria               3.339887       3.767790 0.6576299 5.729347
## c__Actinobacteria               3.339887       3.767790 0.6576299 5.729347
## o__Bifidobacteriales            2.495634       5.212101 0.9370110 5.562476
## f__Bifidobacteriaceae           2.495634       5.212101 0.9370110 5.562476
## g__Bifidobacterium              2.495634       5.212101 0.9370110 5.562476
## s__Bifidobacterium_adolescentis 1.182044       4.141045 1.0224334 4.050185
##                                       pvalue         padj  Kingdom
## p__Actinobacteria               1.008178e-08 1.271856e-07 Bacteria
## c__Actinobacteria               1.008178e-08 1.271856e-07 Bacteria
## o__Bifidobacteriales            2.659732e-08 2.726225e-07 Bacteria
## f__Bifidobacteriaceae           2.659732e-08 2.726225e-07 Bacteria
## g__Bifidobacterium              2.659732e-08 2.726225e-07 Bacteria
## s__Bifidobacterium_adolescentis 5.117714e-05 2.997518e-04 Bacteria
##                                         Phylum          Class
## p__Actinobacteria               Actinobacteria           <NA>
## c__Actinobacteria               Actinobacteria Actinobacteria
## o__Bifidobacteriales            Actinobacteria Actinobacteria
## f__Bifidobacteriaceae           Actinobacteria Actinobacteria
## g__Bifidobacterium              Actinobacteria Actinobacteria
## s__Bifidobacterium_adolescentis Actinobacteria Actinobacteria
##                                             Order             Family
## p__Actinobacteria                            <NA>               <NA>
## c__Actinobacteria                            <NA>               <NA>
## o__Bifidobacteriales            Bifidobacteriales               <NA>
## f__Bifidobacteriaceae           Bifidobacteriales Bifidobacteriaceae
## g__Bifidobacterium              Bifidobacteriales Bifidobacteriaceae
## s__Bifidobacterium_adolescentis Bifidobacteriales Bifidobacteriaceae
##                                           Genus
## p__Actinobacteria                          <NA>
## c__Actinobacteria                          <NA>
## o__Bifidobacteriales                       <NA>
## f__Bifidobacteriaceae                      <NA>
## g__Bifidobacterium              Bifidobacterium
## s__Bifidobacterium_adolescentis Bifidobacterium
##                                                      Species Strain
## p__Actinobacteria                                       <NA>   <NA>
## c__Actinobacteria                                       <NA>   <NA>
## o__Bifidobacteriales                                    <NA>   <NA>
## f__Bifidobacteriaceae                                   <NA>   <NA>
## g__Bifidobacterium                                      <NA>   <NA>
## s__Bifidobacterium_adolescentis Bifidobacterium_adolescentis   <NA>

MA plots

Fold-change vs. mean:

plotMA(res, main="Difference vs. Average")
legend("bottomright", legend="differentially abundant", lty=-1, pch=1, col="red", bty='n')

Plot individual counts

par(mfrow=c(1,2))
plotCounts(dds2, gene="p__Actinobacteria", intgroup="country")
plotCounts(dds2, gene="p__Actinobacteria", intgroup="gender")

Heatmap of differentially abundant taxa

select <- rownames(sigtab2)
nt <- normTransform(dds2) # defaults to log2(x+1)
log2.norm.counts <- assay(nt)[select, ]
df <- as.data.frame(colData(dds2)[,c("country", "gender")])
pheatmap::pheatmap(log2.norm.counts, annotation_col=df, main="log2(counts + 1)")

Regression on ordination vectors and alpha diversity

Prepare a data.frame:

df = data.frame(country=sample_data(Candela)$country,
                Shannon=alphas$Shannon)
df = cbind(df, ord$vectors[, 1:5])
par(mfrow=c(3,2))
for (i in 2:7){
  boxplot(df[, i] ~ df$country, main=colnames(df)[i])
}

Regression on ordination vectors and alpha diversity (cont’d)

Multivariate regression:

fit = glm(country ~ ., data=df, family=binomial("logit"))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Univariate regression:

res <- sapply(2:ncol(df), function(i){
  fit = glm(df$country ~ df[, i], family=binomial("logit"))
  summary(fit)$coefficients[2, ]
})
colnames(res) = colnames(df)[2:ncol(df)]
res
##               Shannon        Axis.1       Axis.2    Axis.3    Axis.4
## Estimate   -2.8818298 -15.342356740 -36.21703957 4.3518071 2.7867726
## Std. Error  2.0011503   5.305383309  15.91867326 3.7931338 4.5480462
## z value    -1.4400866  -2.891846988  -2.27512928 1.1472854 0.6127406
## Pr(>|z|)    0.1498429   0.003829844   0.02289818 0.2512637 0.5400479
##                 Axis.5
## Estimate   -0.39981340
## Std. Error  4.84389021
## z value    -0.08253973
## Pr(>|z|)    0.93421752
write.csv(res, file="univariate_shannonPCoA.csv")

Regression in R: model formulae

response variable ~ explanatory variables

Model formulae tutorial

Can use Model matrices for more control of the model (?model.matrix)

Note on read depths and offset