Enterotypes of the human gut microbiome (2011)
Published in Nature in early 2011, this work compared (among other things), the faecal microbial communities from 22 subjects using complete shotgun DNA sequencing. Authors further compared these microbial communities with the faecal communities of subjects from other studies. A total of 280 faecal samples / subjects are represented in this dataset, and 553 genera. The authors claim that the data naturally clumps into three community-level clusters, or “enterotypes”, that are not immediately explained by sequencing technology or demographic features of the subjects, but with potential relevance to understanding human gut microbiota.
Arumugam, M., et al. (2011). Enterotypes of the human gut microbiome.
Nature, 473(7346), 174-180.
See supplemental information for subject data.
An OTU-clustered dataset has been publicly-accessible since around the date of publication. However, controversy over analysis methodology and difficulty in reproducing their results has prompted the authors to post a reproducible example with data and R code
However, there still appear to be some issues, the first of which is dealt with on this page: some samples have been omitted in the revised dataset they have provided to reproduce the figures in their article. Inclusion of the missing samples from the original dataset, and reanalysis, changes the outcome of “3 enterotypes clusters”, a major finding of the article.
There are also fundamental problems with the manner in which some of the clustering procedures were applied and conveyed, and I hope to udpate with further reproducible details of that in an update to this page.
We have taken careful steps to include the originally-published enterotypes dataset - as they provided it - as an example .RData
dataset in the phyloseq package. One can access this dataset by loading the phyloseq package and then calling the included data:
library("phyloseq")
data("enterotype")
We want to compare the differences in the newly-released analysis based on R, with what we would have found originally. In this case, we find that they omitted 9 or so Sanger samples from their main Sanger dataset, and provided this as the “raw” datafile. The following code shows the comparison just in samples names between the two, identifying the missing samples, and their auxiliary data. Moreover, we show that by leaving the samples in the dataset, and using their code, we arrive at a different value for the optimal number of clusters: 4 or 5, instead of 3.
Compare the new 33-sample dataset with the original dataset, now included in the phyloseq package.
data = read.table("MetaHIT_SangerSamples.genus.txt", header = T,
row.names = 1, dec = ".", sep = "\t")
data = data[-1, ]
# subset the just the Sanger samples from the original enterotype data
sanger_only_sd = subset(data.frame(sample_data(enterotype)), SeqTech ==
"Sanger")
# Which samples are the same between the two?
rownames(sanger_only_sd)[rownames(sanger_only_sd) %in% colnames(data)]
## [1] "AM.F10.T1" "AM.F10.T2" "DA.AD.1" "DA.AD.2" "DA.AD.3"
## [6] "DA.AD.4" "ES.AD.1" "ES.AD.2" "ES.AD.3" "ES.AD.4"
## [11] "FR.AD.1" "FR.AD.2" "FR.AD.3" "FR.AD.4" "FR.AD.5"
## [16] "FR.AD.6" "FR.AD.7" "FR.AD.8" "IT.AD.1" "IT.AD.2"
## [21] "IT.AD.3" "IT.AD.4" "IT.AD.5" "IT.AD.6" "JP.AD.1"
## [26] "JP.AD.2" "JP.AD.3" "JP.AD.4" "JP.AD.5" "JP.AD.6"
## [31] "JP.AD.7" "JP.AD.8" "JP.AD.9"
# or not
omitted_sample_names = rownames(sanger_only_sd)[!rownames(sanger_only_sd) %in%
colnames(data)]
print(omitted_sample_names)
## [1] "AM.AD.1" "AM.AD.2" "DA.AD.1T" "DA.AD.3T" "JP.IN.1" "JP.IN.2"
## [7] "JP.IN.3" "JP.IN.4"
# These are the 9 samples for which an enterotype was mysteriously not
# assigned:
subset(sanger_only_sd, is.na(Enterotype))
## Enterotype Sample_ID SeqTech SampleID Project Nationality
## AM.AD.1 <NA> AM.AD.1 Sanger AM.AD.1 gill06 american
## AM.AD.2 <NA> AM.AD.2 Sanger AM.AD.2 gill06 american
## AM.F10.T1 <NA> AM.F10.T1 Sanger AM.F10.T1 turnbaugh09 american
## DA.AD.1T <NA> DA.AD.1T Sanger <NA> <NA> <NA>
## DA.AD.3T <NA> DA.AD.3T Sanger <NA> <NA> <NA>
## JP.IN.1 <NA> JP.IN.1 Sanger JP.IN.1 kurokawa07 japanese
## JP.IN.2 <NA> JP.IN.2 Sanger JP.IN.2 kurokawa07 japanese
## JP.IN.3 <NA> JP.IN.3 Sanger JP.IN.3 kurokawa07 japanese
## JP.IN.4 <NA> JP.IN.4 Sanger JP.IN.4 kurokawa07 japanese
## Gender Age ClinicalStatus
## AM.AD.1 F 28.00 healthy
## AM.AD.2 M 37.00 healthy
## AM.F10.T1 F NA obese
## DA.AD.1T <NA> NA <NA>
## DA.AD.3T <NA> NA <NA>
## JP.IN.1 F 0.58 healthy
## JP.IN.2 M 0.50 healthy
## JP.IN.3 M 0.25 healthy
## JP.IN.4 F 0.33 healthy
omitted_sample_names %in% rownames(subset(sanger_only_sd, is.na(Enterotype)))
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# Is there an explanation for omitting these samples? Any prior knowledge
# outside of the abundance data itself that could support the decision to
# ommit?
omitted_sample_names = rownames(sanger_only_sd)[!rownames(sanger_only_sd) %in%
colnames(data)]
# See anything weird in their abundance values?
omitted_samples = subset_samples(enterotype, is.na(Enterotype))
head(otu_table(omitted_samples))
## OTU Table: [6 taxa and 9 samples]
## taxa are rows
## AM.AD.1 AM.AD.2 AM.F10.T1 DA.AD.1T DA.AD.3T JP.IN.1
## -1 0.4265 0.3322619 4.82e-01 6.25e-01 0.8053 0.577
## Bacteria 0.0000 0.0000000 0.00e+00 0.00e+00 0.0000 0.000
## Prosthecochloris 0.0000 0.0000000 0.00e+00 0.00e+00 0.0000 0.000
## Chloroflexus 0.0000 0.0000000 0.00e+00 8.10e-07 0.0000 0.000
## Dehalococcoides 0.0000 0.0002683 9.59e-06 0.00e+00 0.0000 0.000
## Thermus 0.0000 0.0000000 0.00e+00 0.00e+00 0.0000 0.000
## JP.IN.2 JP.IN.3 JP.IN.4
## -1 0.07865 0.1081 0.4619
## Bacteria 0.00000 0.0000 0.0000
## Prosthecochloris 0.00000 0.0000 0.0000
## Chloroflexus 0.00000 0.0000 0.0000
## Dehalococcoides 0.00000 0.0000 0.0000
## Thermus 0.00000 0.0000 0.0000
Let's re-run their analysis but leave-in the omitted samples. First, subset the full “enterotypes” dataset to just the Sanger data:
ps_data = subset_samples(enterotype, SeqTech == "Sanger")
Redefine the “data” variable to be the abundances of all Sanger samples,
not just the 33 they included.
data = as(otu_table(ps_data), "matrix")
From this point, it is possible to directly follow the beginning of the complete script that they provide.
# Uncomment next two lines if R packages are already installed
# install.packages('cluster') install.packages('clusterSim')
library(cluster)
library(clusterSim)
# Omit the un-assigned garbage-bin genus
data = data[-1, ]
# Define a custom JSD calculator function
dist.JSD <- function(inMatrix, pseudocount = 1e-06, ...) {
KLD <- function(x, y) sum(x * log(x/y))
JSD <- function(x, y) sqrt(0.5 * KLD(x, (x + y)/2) + 0.5 * KLD(y, (x + y)/2))
matrixColSize <- length(colnames(inMatrix))
matrixRowSize <- length(rownames(inMatrix))
colnames <- colnames(inMatrix)
resultsMatrix <- matrix(0, matrixColSize, matrixColSize)
inMatrix = apply(inMatrix, 1:2, function(x) ifelse(x == 0, pseudocount,
x))
for (i in 1:matrixColSize) {
for (j in 1:matrixColSize) {
resultsMatrix[i, j] = JSD(as.vector(inMatrix[, i]), as.vector(inMatrix[,
j]))
}
}
rownames(resultsMatrix) <- colnames(resultsMatrix) <- colnames
resultsMatrix <- as.dist(resultsMatrix)
attr(resultsMatrix, "method") <- "dist"
return(resultsMatrix)
}
data.dist = dist.JSD(data)
pam.clustering = function(x, k) {
# x is a distance matrix and k the number of clusters
require(cluster)
cluster = as.vector(pam(as.dist(x), k, diss = TRUE)$clustering)
return(cluster)
}
### The following line is already asking for 3 clusters...
data.cluster = pam.clustering(data.dist, k = 3)
require(clusterSim)
nclusters = index.G1(t(data), data.cluster, d = data.dist, centrotypes = "medoids")
nclusters = NULL
for (k in 1:20) {
if (k == 1) {
nclusters[k] = NA
} else {
data.cluster_temp = pam.clustering(data.dist, k)
nclusters[k] = index.G1(t(data), data.cluster_temp, d = data.dist, centrotypes = "medoids")
}
}
plot(nclusters, type = "h", xlab = "k clusters", ylab = "CH index",
main = "Optimal number of clusters")
obs.silhouette = mean(silhouette(data.cluster, data.dist)[, 3])
cat(obs.silhouette) #0.1899451
## 0.2104
With the original samples included, the optimal number of clusters appears to be 4, with 5 clusters also being a better option than 3.