Re-analyzing the enterotype dataset

Background

Enterotypes of the human gut microbiome (2011)

Published in Nature in early 2011, this work compared (among other things), the faecal microbial communities from 22 subjects using complete shotgun DNA sequencing. Authors further compared these microbial communities with the faecal communities of subjects from other studies. A total of 280 faecal samples / subjects are represented in this dataset, and 553 genera. The authors claim that the data naturally clumps into three community-level clusters, or “enterotypes”, that are not immediately explained by sequencing technology or demographic features of the subjects, but with potential relevance to understanding human gut microbiota.

Original Reference

Arumugam, M., et al. (2011). Enterotypes of the human gut microbiome.

Nature, 473(7346), 174-180.

See supplemental information for subject data.

Goal of this page

An OTU-clustered dataset has been publicly-accessible since around the date of publication. However, controversy over analysis methodology and difficulty in reproducing their results has prompted the authors to post a reproducible example with data and R code

However, there still appear to be some issues, the first of which is dealt with on this page: some samples have been omitted in the revised dataset they have provided to reproduce the figures in their article. Inclusion of the missing samples from the original dataset, and reanalysis, changes the outcome of “3 enterotypes clusters”, a major finding of the article.

There are also fundamental problems with the manner in which some of the clustering procedures were applied and conveyed, and I hope to udpate with further reproducible details of that in an update to this page.

Original enterotype data included in phyloseq package

We have taken careful steps to include the originally-published enterotypes dataset - as they provided it - as an example .RData dataset in the phyloseq package. One can access this dataset by loading the phyloseq package and then calling the included data:

library("phyloseq")
data("enterotype")

We want to compare the differences in the newly-released analysis based on R, with what we would have found originally. In this case, we find that they omitted 9 or so Sanger samples from their main Sanger dataset, and provided this as the “raw” datafile. The following code shows the comparison just in samples names between the two, identifying the missing samples, and their auxiliary data. Moreover, we show that by leaving the samples in the dataset, and using their code, we arrive at a different value for the optimal number of clusters: 4 or 5, instead of 3.

Read / compare included samples

Compare the new 33-sample dataset with the original dataset, now included in the phyloseq package.

data = read.table("MetaHIT_SangerSamples.genus.txt", header = T, 
    row.names = 1, dec = ".", sep = "\t")
data = data[-1, ]

# subset the just the Sanger samples from the original enterotype data
sanger_only_sd = subset(data.frame(sample_data(enterotype)), SeqTech == 
    "Sanger")

# Which samples are the same between the two?
rownames(sanger_only_sd)[rownames(sanger_only_sd) %in% colnames(data)]

##  [1] "AM.F10.T1" "AM.F10.T2" "DA.AD.1"   "DA.AD.2"   "DA.AD.3"  
##  [6] "DA.AD.4"   "ES.AD.1"   "ES.AD.2"   "ES.AD.3"   "ES.AD.4"  
## [11] "FR.AD.1"   "FR.AD.2"   "FR.AD.3"   "FR.AD.4"   "FR.AD.5"  
## [16] "FR.AD.6"   "FR.AD.7"   "FR.AD.8"   "IT.AD.1"   "IT.AD.2"  
## [21] "IT.AD.3"   "IT.AD.4"   "IT.AD.5"   "IT.AD.6"   "JP.AD.1"  
## [26] "JP.AD.2"   "JP.AD.3"   "JP.AD.4"   "JP.AD.5"   "JP.AD.6"  
## [31] "JP.AD.7"   "JP.AD.8"   "JP.AD.9"

# or not
omitted_sample_names = rownames(sanger_only_sd)[!rownames(sanger_only_sd) %in% 
    colnames(data)]
print(omitted_sample_names)

## [1] "AM.AD.1"  "AM.AD.2"  "DA.AD.1T" "DA.AD.3T" "JP.IN.1"  "JP.IN.2" 
## [7] "JP.IN.3"  "JP.IN.4"


# These are the 9 samples for which an enterotype was mysteriously not
# assigned:
subset(sanger_only_sd, is.na(Enterotype))

##           Enterotype Sample_ID SeqTech  SampleID     Project Nationality
## AM.AD.1         <NA>   AM.AD.1  Sanger   AM.AD.1      gill06    american
## AM.AD.2         <NA>   AM.AD.2  Sanger   AM.AD.2      gill06    american
## AM.F10.T1       <NA> AM.F10.T1  Sanger AM.F10.T1 turnbaugh09    american
## DA.AD.1T        <NA>  DA.AD.1T  Sanger      <NA>        <NA>        <NA>
## DA.AD.3T        <NA>  DA.AD.3T  Sanger      <NA>        <NA>        <NA>
## JP.IN.1         <NA>   JP.IN.1  Sanger   JP.IN.1 kurokawa07     japanese
## JP.IN.2         <NA>   JP.IN.2  Sanger   JP.IN.2 kurokawa07     japanese
## JP.IN.3         <NA>   JP.IN.3  Sanger   JP.IN.3  kurokawa07    japanese
## JP.IN.4         <NA>   JP.IN.4  Sanger   JP.IN.4 kurokawa07     japanese
##           Gender   Age ClinicalStatus
## AM.AD.1        F 28.00        healthy
## AM.AD.2        M 37.00        healthy
## AM.F10.T1      F    NA          obese
## DA.AD.1T    <NA>    NA           <NA>
## DA.AD.3T    <NA>    NA           <NA>
## JP.IN.1        F  0.58        healthy
## JP.IN.2        M  0.50        healthy
## JP.IN.3        M  0.25        healthy
## JP.IN.4        F  0.33        healthy

omitted_sample_names %in% rownames(subset(sanger_only_sd, is.na(Enterotype)))

## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE


# Is there an explanation for omitting these samples?  Any prior knowledge
# outside of the abundance data itself that could support the decision to
# ommit?
omitted_sample_names = rownames(sanger_only_sd)[!rownames(sanger_only_sd) %in% 
    colnames(data)]

# See anything weird in their abundance values?
omitted_samples = subset_samples(enterotype, is.na(Enterotype))
head(otu_table(omitted_samples))

## OTU Table:          [6 taxa and 9 samples]
##                      taxa are rows
##                  AM.AD.1   AM.AD.2 AM.F10.T1 DA.AD.1T DA.AD.3T JP.IN.1
## -1                0.4265 0.3322619  4.82e-01 6.25e-01   0.8053   0.577
## Bacteria          0.0000 0.0000000  0.00e+00 0.00e+00   0.0000   0.000
## Prosthecochloris  0.0000 0.0000000  0.00e+00 0.00e+00   0.0000   0.000
## Chloroflexus      0.0000 0.0000000  0.00e+00 8.10e-07   0.0000   0.000
## Dehalococcoides   0.0000 0.0002683  9.59e-06 0.00e+00   0.0000   0.000
## Thermus           0.0000 0.0000000  0.00e+00 0.00e+00   0.0000   0.000
##                  JP.IN.2 JP.IN.3 JP.IN.4
## -1               0.07865  0.1081  0.4619
## Bacteria         0.00000  0.0000  0.0000
## Prosthecochloris 0.00000  0.0000  0.0000
## Chloroflexus     0.00000  0.0000  0.0000
## Dehalococcoides  0.00000  0.0000  0.0000
## Thermus          0.00000  0.0000  0.0000

Re-run their analysis

Let's re-run their analysis but leave-in the omitted samples. First, subset the full “enterotypes” dataset to just the Sanger data:

ps_data = subset_samples(enterotype, SeqTech == "Sanger")

Redefine the “data” variable to be the abundances of all Sanger samples,
not just the 33 they included.

data = as(otu_table(ps_data), "matrix")

From this point, it is possible to directly follow the beginning of the complete script that they provide.

# Uncomment next two lines if R packages are already installed
# install.packages('cluster') install.packages('clusterSim')
library(cluster)
library(clusterSim)

# Omit the un-assigned garbage-bin genus
data = data[-1, ]

# Define a custom JSD calculator function
dist.JSD <- function(inMatrix, pseudocount = 1e-06, ...) {
    KLD <- function(x, y) sum(x * log(x/y))
    JSD <- function(x, y) sqrt(0.5 * KLD(x, (x + y)/2) + 0.5 * KLD(y, (x + y)/2))
    matrixColSize <- length(colnames(inMatrix))
    matrixRowSize <- length(rownames(inMatrix))
    colnames <- colnames(inMatrix)
    resultsMatrix <- matrix(0, matrixColSize, matrixColSize)

    inMatrix = apply(inMatrix, 1:2, function(x) ifelse(x == 0, pseudocount, 
        x))

    for (i in 1:matrixColSize) {
        for (j in 1:matrixColSize) {
            resultsMatrix[i, j] = JSD(as.vector(inMatrix[, i]), as.vector(inMatrix[, 
                j]))
        }
    }
    rownames(resultsMatrix) <- colnames(resultsMatrix) <- colnames
    resultsMatrix <- as.dist(resultsMatrix)
    attr(resultsMatrix, "method") <- "dist"
    return(resultsMatrix)
}

data.dist = dist.JSD(data)

pam.clustering = function(x, k) {
    # x is a distance matrix and k the number of clusters
    require(cluster)
    cluster = as.vector(pam(as.dist(x), k, diss = TRUE)$clustering)
    return(cluster)
}

### The following line is already asking for 3 clusters...
data.cluster = pam.clustering(data.dist, k = 3)

require(clusterSim)
nclusters = index.G1(t(data), data.cluster, d = data.dist, centrotypes = "medoids")

nclusters = NULL

for (k in 1:20) {
    if (k == 1) {
        nclusters[k] = NA
    } else {
        data.cluster_temp = pam.clustering(data.dist, k)
        nclusters[k] = index.G1(t(data), data.cluster_temp, d = data.dist, centrotypes = "medoids")
    }
}

plot(nclusters, type = "h", xlab = "k clusters", ylab = "CH index", 
    main = "Optimal number of clusters")

plot of chunk unnamed-chunk-6


obs.silhouette = mean(silhouette(data.cluster, data.dist)[, 3])
cat(obs.silhouette)  #0.1899451

## 0.2104

Not three clusters

With the original samples included, the optimal number of clusters appears to be 4, with 5 clusters also being a better option than 3.