Levi Waldron
June 30, 2017
Rampelli S et al.: Metagenome Sequencing of the Hadza Hunter-Gatherer Gut Microbiota. Curr. Biol. 2015, 25:1682–1693.
suppressPackageStartupMessages(library(phyloseq))
Rampelli
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 727 taxa and 38 samples ]
## sample_data() Sample Data: [ 38 samples by 18 sample variables ]
## tax_table() Taxonomy Table: [ 727 taxa by 8 taxonomic ranks ]
ab <- otu_table(Rampelli)
ab[1:5, 1:4]
## OTU Table: [5 taxa and 4 samples]
## taxa are rows
## H1 H2 H3 H4
## k__Bacteria 99.01042 72.28436 99.91570 99.54461
## k__Archaea 0.98958 0.00000 0.08430 0.45539
## p__Firmicutes 55.91770 54.45349 76.49131 82.69539
## p__Bacteroidetes 29.17228 11.65402 22.45855 15.36970
## p__Proteobacteria 10.63735 5.43092 0.12856 0.18853
summary(ab[, 1:4])
## H1 H2 H3 H4
## Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.000 Median : 0.00 Median : 0.000 Median : 0.000
## Mean : 1.076 Mean : 1.09 Mean : 1.098 Mean : 1.087
## 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.: 0.000
## Max. :99.010 Max. :72.28 Max. :99.916 Max. :99.545
summary(sample_data(Rampelli))
## subjectID body_site antibiotics_current_use
## Length:38 Length:38 Length:38
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## study_condition disease age age_category
## Length:38 Length:38 Min. : 8.00 Length:38
## Class :character Class :character 1st Qu.:23.25 Class :character
## Mode :character Mode :character Median :30.00 Mode :character
## Mean :31.68
## 3rd Qu.:38.00
## Max. :70.00
##
## gender BMI country location
## Length:38 Min. : NA Length:38 Length:38
## Class :character 1st Qu.: NA Class :character Class :character
## Mode :character Median : NA Mode :character Mode :character
## Mean :NaN
## 3rd Qu.: NA
## Max. : NA
## NA's :38
## non_westernized DNA_extraction_kit number_reads
## Length:38 Length:38 Min. : 3740248
## Class :character Class :character 1st Qu.: 9756842
## Mode :character Mode :character Median :15193451
## Mean :22303876
## 3rd Qu.:29293112
## Max. :77841428
##
## number_bases minimum_read_length median_read_length
## Min. :3.165e+08 Min. :60 Min. : 84.00
## 1st Qu.:9.056e+08 1st Qu.:60 1st Qu.:100.00
## Median :1.462e+09 Median :60 Median :100.00
## Mean :2.114e+09 Mean :60 Mean : 97.18
## 3rd Qu.:2.788e+09 3rd Qu.:60 3rd Qu.:100.00
## Max. :7.485e+09 Max. :60 Max. :101.00
##
## NCBI_accession
## Length:38
## Class :character
## Mode :character
##
##
##
##
plot_heatmap(Rampelli, method="PCoA", distance="bray")
A metric satisfies the following five properties:
dim(otu_table(Rampelli))
## [1] 727 38
table(sample_data(Rampelli)$location)
##
## Bologna Dedauko Sengele
## 11 20 7
Interested in identifying similar samples and similar microbes
Euclidean distance as for two dimensions. E.g., the distance between two samples \(i\) and \(j\) is:
\[ \mbox{dist}(i,j) = \sqrt{ \sum_{g=1}^{727} (Y_{g,i}-Y_{g,j })^2 } \]
and the distance between two features \(h\) and \(g\) is:
\[ \mbox{dist}(h,g) = \sqrt{ \sum_{i=1}^{38} (Y_{h,i}-Y_{g,i})^2 } \]
ab <- otu_table(Rampelli)
dedauko1 <- ab[, 1]
dedauko2 <- ab[, 2]
bologna1 <- ab[, 29]
sqrt(sum((dedauko1 - dedauko2)^2))
## [1] 103.0065
sqrt(sum((dedauko1 - bologna1)^2))
## [1] 124.6045
dim(ab)
## [1] 727 38
(d <- dist(t(ab[, c(1, 2, 29)])))
## H1 H2
## H2 103.0065
## IT2 124.6045 124.1906
class(d)
## [1] "dist"
From Morgan and Huttenhower Human Microbiome Analysis
These examples describe the A) sequence counts and B) relative abundances of six taxa detected in three samples. C) A collector’s curve using a richness estimator approximates the relationship between the number of sequences drawn from each sample and the number of taxa expected to be present based on detected abundances. D) Alpha diversity captures both the organismal richness of a sample and the evenness of the organisms’ abundance distribution. E) Beta diversity represents the similarity (or difference) in organismal composition between samples.
E.g. Bray-Curtis dissimilarity between all pairs of samples:
plot(hclust(phyloseq::distance(Rampelli, method="bray")),
main="Bray-Curtis Dissimilarity", xlab="", sub = "")
?phyloseq::distance
and ?phyloseq::distanceMethodList
Simulate the heights of twin pairs:
suppressPackageStartupMessages(library(MASS))
set.seed(1)
n <- 100
y <- t(MASS::mvrnorm(n, c(0,0), matrix(c(1, 0.95, 0.95, 1), 2, 2)))
dim(y)
## [1] 2 100
cor(t(y))
## [,1] [,2]
## [1,] 1.0000000 0.9433295
## [2,] 0.9433295 1.0000000
SVD generalizes the example rotation we looked at:
\[\mathbf{Y} = \mathbf{UDV}^\top\]
Scaling:
e.standardize.fast <- t(scale(t(ab), scale=FALSE))
SVD:
s <- svd(e.standardize.fast)
names(s)
## [1] "d" "u" "v"
dim(s$u) # loadings
## [1] 727 38
length(s$d) # eigenvalues
## [1] 38
dim(s$v) # d %*% vT = scores
## [1] 38 38
rafalib::mypar()
p <- prcomp( t(e.standardize.fast) )
plot(s$u[, 1] ~ p$rotation[, 1])
Note: u and v can each be multiplied by -1 arbitrarily
plot(p$rotation[, 1], xlab="Index of taxa", ylab="Loadings of PC1",
main="Loadings of PC1") #or, predict(p)
abline(h=c(-0.2, 0.2), col="red")
e.pc1taxa <- ab[p$rotation[, 1] < -0.2 | p$rotation[, 1] > 0.2, ]
pheatmap::pheatmap(e.pc1taxa, scale="none", show_rownames=TRUE,
show_colnames = FALSE)
Alternatively as cumulative % variance explained (using cumsum()
function):
prcomp()
), scores are \(\mathbf{D V^T}\)Not much “horseshoe” effect here.
?phyloseq::estimate_richness
vegan
packageNote, you can ignore warning about singletons when performing this analysis.