Levi Waldron
May 22, 2017
From Morgan and Huttenhower, Human Microbiome Analysis, https://doi.org/10.1371/journal.pcbi.1002808
Pros:
Cons:
Bioinformatics pipeline of choice: QIIME
Pros:
Cons:
Bioinformatics pipeline of choice: Biobakery (MetaPhlan2, HUMAnN2)
Rampelli S et al.: Metagenome Sequencing of the Hadza Hunter-Gatherer Gut Microbiota. Curr. Biol. 2015, 25:1682–1693.
library(curatedMetagenomicData)
eset <- RampelliS_2015.metaphlan_bugs_list.stool()
library(phyloseq)
Rampelli = ExpressionSet2phyloseq(eset)
Rampelli
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 727 taxa and 38 samples ]
## sample_data() Sample Data: [ 38 samples by 8 sample variables ]
## tax_table() Taxonomy Table: [ 727 taxa by 8 taxonomic ranks ]
otu_table(Rampelli)[1:5, 1:4]
## OTU Table: [5 taxa and 4 samples]
## taxa are rows
## H1 H10 H11 H12
## k__Bacteria 99.01042 84.50315 67.44338 74.98205
## k__Archaea 0.98958 0.18360 0.28721 0.20239
## p__Firmicutes 55.91770 37.15785 27.61193 30.97439
## p__Bacteroidetes 29.17228 24.42213 16.44710 21.01599
## p__Proteobacteria 10.63735 15.06530 17.95200 16.36751
summary(otu_table(Rampelli)[, 1:4])
## H1 H10 H11 H12
## Min. : 0.000 Min. : 0.00000 Min. : 0.00000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.00000 1st Qu.: 0.00000 1st Qu.: 0.000
## Median : 0.000 Median : 0.00000 Median : 0.00000 Median : 0.000
## Mean : 1.076 Mean : 1.08239 Mean : 1.06046 Mean : 1.067
## 3rd Qu.: 0.000 3rd Qu.: 0.02908 3rd Qu.: 0.01579 3rd Qu.: 0.000
## Max. :99.010 Max. :84.50315 Max. :67.44338 Max. :74.982
summary(sample_data(Rampelli))
## subjectID age gender camp
## Length:38 Min. : 8.00 Length:38 Length:38
## Class :character 1st Qu.:23.25 Class :character Class :character
## Mode :character Median :30.00 Mode :character Mode :character
## Mean :31.68
## 3rd Qu.:38.00
## Max. :70.00
## country disease bodysite
## Length:38 Length:38 Length:38
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## number_reads
## Min. : 3740248
## 1st Qu.: 9756842
## Median :15193451
## Mean :22303876
## 3rd Qu.:29293112
## Max. :77841428
plot_heatmap(Rampelli, method="PCoA", distance="bray")
par(mar = c(18, 4, 0, 0) + 0.1) # make more room on bottom margin
barplot(sort(taxa_sums(Rampelli), TRUE)[1:30]/nsamples(Rampelli), las=2)
head(tax_table(Rampelli))
## Taxonomy Table: [6 taxa by 8 taxonomic ranks]:
## Kingdom Phylum Class Order Family Genus
## k__Bacteria "Bacteria" NA NA NA NA NA
## k__Archaea "Archaea" NA NA NA NA NA
## p__Firmicutes "Bacteria" "Firmicutes" NA NA NA NA
## p__Bacteroidetes "Bacteria" "Bacteroidetes" NA NA NA NA
## p__Proteobacteria "Bacteria" "Proteobacteria" NA NA NA NA
## p__Spirochaetes "Bacteria" "Spirochaetes" NA NA NA NA
## Species Strain
## k__Bacteria NA NA
## k__Archaea NA NA
## p__Firmicutes NA NA
## p__Bacteroidetes NA NA
## p__Proteobacteria NA NA
## p__Spirochaetes NA NA
Species plus strain-level taxonomy:
(Rampelli.sp_strain = subset_taxa(Rampelli, !is.na(Species)))
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 486 taxa and 38 samples ]
## sample_data() Sample Data: [ 38 samples by 8 sample variables ]
## tax_table() Taxonomy Table: [ 486 taxa by 8 taxonomic ranks ]
Keep phylum-level taxonomy only:
taxonomy.level = apply(tax_table(Rampelli), 1, function(x) sum(!is.na(x)))
Rampelli.phy = prune_taxa(taxonomy.level==2, Rampelli)
taxa_names(Rampelli.phy)
## [1] "p__Firmicutes" "p__Bacteroidetes"
## [3] "p__Proteobacteria" "p__Spirochaetes"
## [5] "p__Euryarchaeota" "p__Actinobacteria"
## [7] "p__Viruses_noname" "p__Cyanobacteria"
## [9] "p__Verrucomicrobia" "p__Deinococcus_Thermus"
## [11] "p__Eukaryota_noname" "p__Candidatus_Saccharibacteria"
## [13] "p__Acidobacteria" "p__Microsporidia"
## [15] "p__Fusobacteria" "p__Tenericutes"
Keep taxa only if they are in the most abundant 10% of taxa in at least 10 samples:
f1<- filterfun_sample(topp(0.1))
subs <- genefilter_sample(Rampelli.sp_strain, f1, A=10)
Rampelli2 <- subset_taxa(Rampelli.sp_strain, subs)
plot_heatmap(Rampelli2, method="PCoA", distance="bray")
More help here.
From Morgan and Huttenhower Human Microbiome Analysis
These examples describe the A) sequence counts and B) relative abundances of six taxa detected in three samples. C) A collector’s curve using a richness estimator approximates the relationship between the number of sequences drawn from each sample and the number of taxa expected to be present based on detected abundances. D) Alpha diversity captures both the organismal richness of a sample and the evenness of the organisms’ abundance distribution. E) Beta diversity represents the similarity (or difference) in organismal composition between samples.
?phyloseq::estimate_richness
vegan
packageNote, you can ignore warning about singletons:
Rampelli.counts <- ExpressionSet2phyloseq(eset, relab=FALSE)
alpha_meas = c("Shannon", "Simpson", "InvSimpson")
(p <- plot_richness(Rampelli.counts, "gender", "camp", measures=alpha_meas))
alphas = estimate_richness(Rampelli.counts, measures=alpha_meas)
## Warning in estimate_richness(Rampelli.counts, measures = alpha_meas): The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
##
## We recommended that you find the un-trimmed data and retry.
pairs(alphas)
E.g. Bray-Curtis dissimilarity between all pairs of samples:
plot(hclust(phyloseq::distance(Rampelli, method="bray")),
main="Bray-Curtis Dissimilarity", xlab="", sub = "")
?phyloseq::distance
and ?phyloseq::distanceMethodList
ord = ordinate(Rampelli, method="PCoA", distance="bray")
plot_ordination(Rampelli, ord, color="camp", shape="camp") +
ggplot2::ggtitle("Bray-Curtis Principal Coordinates Analysis")
Not much “horseshoe” effect here.
arcsin(sqrt(x))
phyloseq
and DESeq2
packages simplify the processSystematic part of model:
\[ E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p \]
Random part of model:
\(y_i = E[y_i|x_i] + \epsilon_i\)
\(y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + \epsilon_i\)
Assumption: \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2)\)
Systematic component is:
\[ log(E[y|x_i]) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \]
Or equivalently: \[ E[y|x_i] = exp \left( \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \right) \]
where \(E[y|x_i]\) is the expected number of counts for a microbe in subject i
\(y_i\) is typically assumed to be Poisson or Negative Binomal distributed.
Negative Binomial Distribution has two parameters: # of trials n, and probability of success p
This is a very important distinction!
In this example, unecessary to remove low-read samples or taxa:
Rampelli.counts
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 727 taxa and 38 samples ]
## sample_data() Sample Data: [ 38 samples by 8 sample variables ]
## tax_table() Taxonomy Table: [ 727 taxa by 8 taxonomic ranks ]
prune_samples(sample_sums(Rampelli.counts) > 5e7, Rampelli.counts)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 727 taxa and 34 samples ]
## sample_data() Sample Data: [ 34 samples by 8 sample variables ]
## tax_table() Taxonomy Table: [ 727 taxa by 8 taxonomic ranks ]
prune_taxa(taxa_sums(Rampelli.counts) > 1e3, Rampelli.counts)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 706 taxa and 38 samples ]
## sample_data() Sample Data: [ 38 samples by 8 sample variables ]
## tax_table() Taxonomy Table: [ 706 taxa by 8 taxonomic ranks ]
More help on converting to DESeq2 from various formats here.
dds.data = phyloseq_to_deseq2(Rampelli.counts, ~country)
## converting counts to integer mode
Note: better to use normalized count data than relative abundance
Excellent DESeq2 manual here or vignettes(package="DESeq2")
dds = DESeq(dds.data)
res = results(dds)
res = res[order(res$padj, na.last=NA), ]
alpha = 0.01
sigtab = res[(res$padj < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"),
as(tax_table(Rampelli)[rownames(sigtab), ], "matrix"))
head(sigtab)
## baseMean log2FoldChange lfcSE stat
## s__Prevotella_stercorea 1026450.69 22.49632 1.069746 21.02960
## t__GCF_000235885 1026450.69 22.49632 1.069746 21.02960
## p__Spirochaetes 572913.49 21.64745 1.176744 18.39606
## c__Spirochaetia 572913.49 21.64745 1.176744 18.39606
## o__Spirochaetales 572913.49 21.64745 1.176744 18.39606
## g__Catenibacterium 51752.17 30.00000 1.670805 17.95542
## pvalue padj Kingdom Phylum
## s__Prevotella_stercorea 3.516377e-98 8.087668e-96 Bacteria Bacteroidetes
## t__GCF_000235885 3.516377e-98 8.087668e-96 Bacteria Bacteroidetes
## p__Spirochaetes 1.412718e-75 1.299700e-73 Bacteria Spirochaetes
## c__Spirochaetia 1.412718e-75 1.299700e-73 Bacteria Spirochaetes
## o__Spirochaetales 1.412718e-75 1.299700e-73 Bacteria Spirochaetes
## g__Catenibacterium 4.353075e-72 2.503018e-70 Bacteria Firmicutes
## Class Order
## s__Prevotella_stercorea Bacteroidia Bacteroidales
## t__GCF_000235885 Bacteroidia Bacteroidales
## p__Spirochaetes <NA> <NA>
## c__Spirochaetia Spirochaetia <NA>
## o__Spirochaetales Spirochaetia Spirochaetales
## g__Catenibacterium Erysipelotrichia Erysipelotrichales
## Family Genus
## s__Prevotella_stercorea Prevotellaceae Prevotella
## t__GCF_000235885 Prevotellaceae Prevotella
## p__Spirochaetes <NA> <NA>
## c__Spirochaetia <NA> <NA>
## o__Spirochaetales <NA> <NA>
## g__Catenibacterium Erysipelotrichaceae Catenibacterium
## Species Strain
## s__Prevotella_stercorea Prevotella_stercorea <NA>
## t__GCF_000235885 Prevotella_stercorea GCF_000235885
## p__Spirochaetes <NA> <NA>
## c__Spirochaetia <NA> <NA>
## o__Spirochaetales <NA> <NA>
## g__Catenibacterium <NA> <NA>
library("ggplot2")
theme_set(theme_bw())
sigtabgen = subset(sigtab, !is.na(Family))
# Phylum order
x = tapply(sigtabgen$log2FoldChange, sigtabgen$Phylum, function(x) max(x))
x = sort(x, TRUE)
sigtabgen$Phylum = factor(as.character(sigtabgen$Phylum), levels=names(x))
# Family order
x = tapply(sigtabgen$log2FoldChange, sigtabgen$Family, function(x) max(x))
x = sort(x, TRUE)
sigtabgen$Family = factor(as.character(sigtabgen$Family), levels=names(x))
ggplot(sigtabgen, aes(y=Family, x=log2FoldChange, color=Phylum)) +
geom_vline(xintercept = 0.0, color = "gray", size = 0.5) +
geom_point(size=6) +
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5))
select <- rownames(sigtab)
nt <- normTransform(dds) # defaults to log2(x+1)
log2.norm.counts <- assay(nt)[select, ]
df <- as.data.frame(colData(dds)[,c("country", "gender")])
pheatmap::pheatmap(log2.norm.counts, annotation_col=df, main="log2(counts + 1)")
Fold-change vs. mean:
plotMA(res, main="Difference vs. Average")
legend("bottomright", legend="differentially abundant", lty=-1, pch=1, col="red", bty='n')
DESeq2
uses “Independent Hypothesis Weighting” (see IHW
package)
Prepare a data.frame
:
df = data.frame(country=sample_data(Rampelli)$country,
Shannon=alphas$Shannon)
df = cbind(df, ord$vectors[, 1:5])
head(df)
## country Shannon Axis.1 Axis.2 Axis.3 Axis.4
## H1 tanzania 3.688009 -0.11832297 0.07692996 0.09619371 -0.049329957
## H10 tanzania 3.952527 -0.19796867 0.09884072 -0.16478172 -0.097493755
## H11 tanzania 3.880499 -0.22202543 0.22287671 -0.25653163 0.006247430
## H12 tanzania 3.874827 -0.23351705 0.15462528 -0.22859742 -0.068984544
## H13 tanzania 3.598540 -0.05794346 0.05416069 0.13050213 0.005986638
## H14 tanzania 3.579617 0.01428806 0.09845835 0.12076997 0.012938378
## Axis.5
## H1 -0.010983849
## H10 -0.063325316
## H11 -0.059823475
## H12 -0.058443856
## H13 -0.020542266
## H14 -0.005270389
par(mfrow=c(3,2))
for (i in 2:7){
boxplot(df[, i] ~ df$country, main=colnames(df)[i])
}
One dataset from R:
curatedMetagenomicData("HMP_2012.metaphlan_bugs_list.stool")
Many datasets from R:
curatedMetagenomicData("HMP_2012.metaphlan_bugs_list.*")
Command-line: $ curatedMetagenomicData -p "HMP_2012.metaphlan_bugs_list.*"
Pasolli, Schiffer et al., bioRxiv 103085