Levi Waldron and Curtis Huttenhower
June 6-7, 2016
phyloseq
Bioconductor package for exploratory data analysisDESeq2
Bioconductor package for regression on metagenomic dataarcsin(sqrt(x))
phyloseq
and DESeq2
packages simplify the processSystematic part of model:
\[ E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p \]
Random part of model:
\(y_i = E[y_i|x_i] + \epsilon_i\)
\(y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + \epsilon_i\)
Assumption: \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2)\)
The model: \(y_i = E[y|x_i] + \epsilon_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + \epsilon_i\)
Random component of \(y_i\) is normally distributed: \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2)\)
Systematic component (linear predictor): \(\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi}\)
Link function here is the identity link: \(g(E(y | x)) = E(y | x)\). We are modeling the mean directly, no transformation.
Systematic component is:
\[ log(E[y|x_i]) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \]
Or equivalently: \[ E[y|x_i] = exp \left( \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \right) \]
where \(E[y|x_i]\) is the expected number of counts for a microbe in subject i
\(\epsilon_i\) is typically Poisson or Negative Binomal distributed.
This is a very important distinction!
Negative Binomial Distribution has two parameters: # of trials n, and probability of success p
If there is evidence that the fixed parameters differ between two groups of interest, we say the results are statistically significant.
pcsl::zeroinfl()
count ~ X|1
)Rampelli S et al.: Metagenome Sequencing of the Hadza Hunter-Gatherer Gut Microbiota. Curr. Biol. 2015, 25:1682–1693.
indat = read.delim("data/Candela_Africa_stool.txt")
inmetadat = read.delim("data/Candela_Africa_metadat.txt")
library(phyloseq)
source("https://raw.githubusercontent.com/waldronlab/EPIC2016/master/metaphlanToPhyloseq.R")
Candela = metaphlanToPhyloseq(indat, inmetadat, simplenames = TRUE,
roundtointeger = FALSE)
summary(otu_table(Candela))
summary(sample_data(Candela))
phyloseq
help vignettes here.
Candela
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 3302 taxa and 38 samples ]
## sample_data() Sample Data: [ 38 samples by 11 sample variables ]
## tax_table() Taxonomy Table: [ 3302 taxa by 8 taxonomic ranks ]
Candela = prune_taxa(taxa_sums(Candela) > 0, Candela)
Candela
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 648 taxa and 38 samples ]
## sample_data() Sample Data: [ 38 samples by 11 sample variables ]
## tax_table() Taxonomy Table: [ 648 taxa by 8 taxonomic ranks ]
rank_names(Candela)
## [1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"
## [8] "Strain"
subset_taxa(Candela, !is.na(Strain))
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 207 taxa and 38 samples ]
## sample_data() Sample Data: [ 38 samples by 11 sample variables ]
## tax_table() Taxonomy Table: [ 207 taxa by 8 taxonomic ranks ]
(Candela.sp_strain = subset_taxa(Candela, !is.na(Species)))
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 444 taxa and 38 samples ]
## sample_data() Sample Data: [ 38 samples by 11 sample variables ]
## tax_table() Taxonomy Table: [ 444 taxa by 8 taxonomic ranks ]
taxonomy.level = apply(tax_table(Candela), 1, function(x) sum(!is.na(x)))
Candela.phy = prune_taxa(taxonomy.level==2, Candela)
taxa_names(Candela.phy)
## [1] "p__Euryarchaeota" "p__Acidobacteria" "p__Actinobacteria"
## [4] "p__Bacteroidetes" "p__Cyanobacteria" "p__Firmicutes"
## [7] "p__Fusobacteria" "p__Proteobacteria" "p__Spirochaetes"
## [10] "p__Tenericutes" "p__Verrucomicrobia" "p__Eukaryota_noname"
## [13] "p__Microsporidia"
Keep taxa only if they are in the most abundant 10% of taxa in at least two samples:
f1<- filterfun_sample(topp(0.1))
pru <- genefilter_sample(Candela, f1, A=2)
summary(pru)
## Mode FALSE TRUE NA's
## logical 455 193 0
subset_taxa(Candela, pru)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 193 taxa and 38 samples ]
## sample_data() Sample Data: [ 38 samples by 11 sample variables ]
## tax_table() Taxonomy Table: [ 193 taxa by 8 taxonomic ranks ]
More help here.
plot_heatmap(Candela.sp_strain, method="PCoA", distance="bray")
par(mar = c(18, 4, 0, 0) + 0.1) # make more room on bottom margin
barplot(sort(taxa_sums(Candela.sp_strain), TRUE)[1:30]/nsamples(Candela.sp_strain), las=2)
From Morgan and Huttenhower Human Microbiome Analysis
These examples describe the A) sequence counts and B) relative abundances of six taxa detected in three samples. C) A collector’s curve using a richness estimator approximates the relationship between the number of sequences drawn from each sample and the number of taxa expected to be present based on detected abundances. D) Alpha diversity captures both the organismal richness of a sample and the evenness of the organisms’ abundance distribution. E) Beta diversity represents the similarity (or difference) in organismal composition between samples.
?phyloseq::estimate_richness
vegan
packageNote, you can ignore warning about singletons:
Candela.int = Candela
otu_table(Candela.int) = round(otu_table(Candela)*1e4)
alpha_meas = c("Shannon", "Simpson", "InvSimpson")
(p <- plot_richness(Candela.int, "gender", "camp", measures=alpha_meas))
alphas = estimate_richness(Candela.int, measures=alpha_meas)
pairs(alphas)
E.g. Bray-Curtis dissimilarity between all pairs of samples:
plot(hclust(phyloseq::distance(Candela, method="bray")),
main="Bray-Curtis Dissimilarity", xlab="", sub = "")
?phyloseq::distance
and ?phyloseq::distanceMethodList
ord = ordinate(Candela, method="PCoA", distance="bray")
plot_ordination(Candela, ord, color="camp", shape="camp") +
ggplot2::ggtitle("Bray-Curtis Principal Coordinates Analysis")
Not much “horseshoe” effect here.
plot_scree(ord) + ggplot2::ggtitle("Screeplot")
Not necessary, not evaluated:
Candela.pruned = subset_samples(Candela, country %in% c("italy", "tanzania"))
Candela.pruned = prune_samples(sample_sums(Candela.pruned) > 10, Candela.pruned)
Candela.pruned
DESeq2
automatically optimizes “non-specific” filtering / pruning of low-abundance featuresMore help on converting to DESeq2 from various formats here.
dds.data = phyloseq_to_deseq2(Candela, ~country)
## converting counts to integer mode
Note: better to use normalized count data than relative abundance
Excellent DESeq2 manual here or vignettes(package="DESeq2")
dds = DESeq(dds.data)
res = results(dds)
res = res[order(res$padj, na.last=NA), ]
alpha = 0.01
sigtab = res[(res$padj < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"),
as(tax_table(Candela)[rownames(sigtab), ], "matrix"))
head(sigtab)
## baseMean log2FoldChange lfcSE
## f__Bacteroidaceae 4.980675 -6.015589 0.8326327
## g__Bacteroides 4.980675 -6.015589 0.8326327
## f__Prevotellaceae 33.286897 4.456109 0.6539972
## g__Prevotella 33.215445 4.663931 0.6916855
## g__Phascolarctobacterium 8.684247 4.311203 0.6774475
## s__Phascolarctobacterium_succinatutens 8.684247 4.311203 0.6774475
## stat pvalue padj
## f__Bacteroidaceae -7.224782 5.019067e-13 4.115635e-11
## g__Bacteroides -7.224782 5.019067e-13 4.115635e-11
## f__Prevotellaceae 6.813652 9.515189e-12 5.201637e-10
## g__Prevotella 6.742850 1.553097e-11 6.367698e-10
## g__Phascolarctobacterium 6.363892 1.967046e-10 4.608508e-09
## s__Phascolarctobacterium_succinatutens 6.363892 1.967046e-10 4.608508e-09
## Kingdom Phylum
## f__Bacteroidaceae Bacteria Bacteroidetes
## g__Bacteroides Bacteria Bacteroidetes
## f__Prevotellaceae Bacteria Bacteroidetes
## g__Prevotella Bacteria Bacteroidetes
## g__Phascolarctobacterium Bacteria Firmicutes
## s__Phascolarctobacterium_succinatutens Bacteria Firmicutes
## Class Order
## f__Bacteroidaceae Bacteroidia Bacteroidales
## g__Bacteroides Bacteroidia Bacteroidales
## f__Prevotellaceae Bacteroidia Bacteroidales
## g__Prevotella Bacteroidia Bacteroidales
## g__Phascolarctobacterium Negativicutes Selenomonadales
## s__Phascolarctobacterium_succinatutens Negativicutes Selenomonadales
## Family
## f__Bacteroidaceae Bacteroidaceae
## g__Bacteroides Bacteroidaceae
## f__Prevotellaceae Prevotellaceae
## g__Prevotella Prevotellaceae
## g__Phascolarctobacterium Acidaminococcaceae
## s__Phascolarctobacterium_succinatutens Acidaminococcaceae
## Genus
## f__Bacteroidaceae <NA>
## g__Bacteroides Bacteroides
## f__Prevotellaceae <NA>
## g__Prevotella Prevotella
## g__Phascolarctobacterium Phascolarctobacterium
## s__Phascolarctobacterium_succinatutens Phascolarctobacterium
## Species
## f__Bacteroidaceae <NA>
## g__Bacteroides <NA>
## f__Prevotellaceae <NA>
## g__Prevotella <NA>
## g__Phascolarctobacterium <NA>
## s__Phascolarctobacterium_succinatutens Phascolarctobacterium_succinatutens
## Strain
## f__Bacteroidaceae <NA>
## g__Bacteroides <NA>
## f__Prevotellaceae <NA>
## g__Prevotella <NA>
## g__Phascolarctobacterium <NA>
## s__Phascolarctobacterium_succinatutens <NA>
dds2 <- estimateSizeFactors(dds)
dds2 <- estimateDispersions(dds2)
plotDispEsts(dds2)
library("ggplot2")
theme_set(theme_bw())
sigtabgen = subset(sigtab, !is.na(Family))
# Phylum order
x = tapply(sigtabgen$log2FoldChange, sigtabgen$Phylum, function(x) max(x))
x = sort(x, TRUE)
sigtabgen$Phylum = factor(as.character(sigtabgen$Phylum), levels=names(x))
# Family order
x = tapply(sigtabgen$log2FoldChange, sigtabgen$Family, function(x) max(x))
x = sort(x, TRUE)
sigtabgen$Family = factor(as.character(sigtabgen$Family), levels=names(x))
ggplot(sigtabgen, aes(y=Family, x=log2FoldChange, color=Phylum)) +
geom_vline(xintercept = 0.0, color = "gray", size = 0.5) +
geom_point(size=6) +
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5))
table(sample_data(Candela)$country, sample_data(Candela)$gender)
##
## female male
## italy 7 4
## tanzania 9 18
dds.data2 = phyloseq_to_deseq2(Candela, ~country + gender)
dds2 = DESeq(dds.data)
resultsNames(dds2)
## [1] "Intercept" "countryitaly" "countrytanzania"
## italy = numerator, tanzania = denominator
res2 = results(dds, contrast=c("country", "italy", "tanzania"))
res2 = res2[order(res$padj, na.last=NA), ]
alpha = 0.01
sigtab2 = res2[which(res2$padj < alpha), ]
sigtab2 = cbind(as(sigtab2, "data.frame"), as(tax_table(Candela)[rownames(sigtab2), ], "matrix"))
head(sigtab2)
## baseMean log2FoldChange lfcSE stat
## p__Actinobacteria 3.339887 3.767790 0.6576299 5.729347
## c__Actinobacteria 3.339887 3.767790 0.6576299 5.729347
## o__Bifidobacteriales 2.495634 5.212101 0.9370110 5.562476
## f__Bifidobacteriaceae 2.495634 5.212101 0.9370110 5.562476
## g__Bifidobacterium 2.495634 5.212101 0.9370110 5.562476
## s__Bifidobacterium_adolescentis 1.182044 4.141045 1.0224334 4.050185
## pvalue padj Kingdom
## p__Actinobacteria 1.008178e-08 1.271856e-07 Bacteria
## c__Actinobacteria 1.008178e-08 1.271856e-07 Bacteria
## o__Bifidobacteriales 2.659732e-08 2.726225e-07 Bacteria
## f__Bifidobacteriaceae 2.659732e-08 2.726225e-07 Bacteria
## g__Bifidobacterium 2.659732e-08 2.726225e-07 Bacteria
## s__Bifidobacterium_adolescentis 5.117714e-05 2.997518e-04 Bacteria
## Phylum Class
## p__Actinobacteria Actinobacteria <NA>
## c__Actinobacteria Actinobacteria Actinobacteria
## o__Bifidobacteriales Actinobacteria Actinobacteria
## f__Bifidobacteriaceae Actinobacteria Actinobacteria
## g__Bifidobacterium Actinobacteria Actinobacteria
## s__Bifidobacterium_adolescentis Actinobacteria Actinobacteria
## Order Family
## p__Actinobacteria <NA> <NA>
## c__Actinobacteria <NA> <NA>
## o__Bifidobacteriales Bifidobacteriales <NA>
## f__Bifidobacteriaceae Bifidobacteriales Bifidobacteriaceae
## g__Bifidobacterium Bifidobacteriales Bifidobacteriaceae
## s__Bifidobacterium_adolescentis Bifidobacteriales Bifidobacteriaceae
## Genus
## p__Actinobacteria <NA>
## c__Actinobacteria <NA>
## o__Bifidobacteriales <NA>
## f__Bifidobacteriaceae <NA>
## g__Bifidobacterium Bifidobacterium
## s__Bifidobacterium_adolescentis Bifidobacterium
## Species Strain
## p__Actinobacteria <NA> <NA>
## c__Actinobacteria <NA> <NA>
## o__Bifidobacteriales <NA> <NA>
## f__Bifidobacteriaceae <NA> <NA>
## g__Bifidobacterium <NA> <NA>
## s__Bifidobacterium_adolescentis Bifidobacterium_adolescentis <NA>
Fold-change vs. mean:
plotMA(res, main="Difference vs. Average")
legend("bottomright", legend="differentially abundant", lty=-1, pch=1, col="red", bty='n')
par(mfrow=c(1,2))
plotCounts(dds2, gene="p__Actinobacteria", intgroup="country")
plotCounts(dds2, gene="p__Actinobacteria", intgroup="gender")
select <- rownames(sigtab2)
nt <- normTransform(dds2) # defaults to log2(x+1)
log2.norm.counts <- assay(nt)[select, ]
df <- as.data.frame(colData(dds2)[,c("country", "gender")])
pheatmap::pheatmap(log2.norm.counts, annotation_col=df, main="log2(counts + 1)")
Prepare a data.frame
:
df = data.frame(country=sample_data(Candela)$country,
Shannon=alphas$Shannon)
df = cbind(df, ord$vectors[, 1:5])
par(mfrow=c(3,2))
for (i in 2:7){
boxplot(df[, i] ~ df$country, main=colnames(df)[i])
}
Multivariate regression:
fit = glm(country ~ ., data=df, family=binomial("logit"))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Univariate regression:
res <- sapply(2:ncol(df), function(i){
fit = glm(df$country ~ df[, i], family=binomial("logit"))
summary(fit)$coefficients[2, ]
})
colnames(res) = colnames(df)[2:ncol(df)]
res
## Shannon Axis.1 Axis.2 Axis.3 Axis.4
## Estimate -2.8818298 -15.342356740 -36.21703957 4.3518071 2.7867726
## Std. Error 2.0011503 5.305383309 15.91867326 3.7931338 4.5480462
## z value -1.4400866 -2.891846988 -2.27512928 1.1472854 0.6127406
## Pr(>|z|) 0.1498429 0.003829844 0.02289818 0.2512637 0.5400479
## Axis.5
## Estimate -0.39981340
## Std. Error 4.84389021
## z value -0.08253973
## Pr(>|z|) 0.93421752
write.csv(res, file="univariate_shannonPCoA.csv")
lm()
, glm()
, and high-level packages like DESeq2
use a “model formula” interface.response variable ~ explanatory variables
Can use Model matrices for more control of the model (?model.matrix
)
Note: better to use metaphlan2 option:
-t clade_profiles
then differential read depth accounted for by an offset term in log-linear model