- Properties of data set
- Unsupervised Analysis
- Supervised Analysis
Mehrdad Yazdani
August 12, 2014
The data originates from stool samples from the NIH Human Microbiome Project and Professor Larry Smarr. The NIH HMP has healthy and sick subjects.
Here we focus on different population of species.
The data shows the different compositions of different species for each subject. Hence, it has the properties of a compositional data set:
In our data, the number of species are:
## [1] 2572
The number of subjects are:
## [1] 249
Note that we have far more species than subjects in this data set.
The composition of the species for each subject must sum to 1.0, however this is not the case for this data set:
Possible reason: numerical "round-off"" errors introduce this discrepancy.
Zeros must be handled carefully. There are two classes of zeros in compositional data sets:
Absolute zeros are dealt with by removing them. Round-off zeros are trickier and are typically replaced with "small" values (imputation tricks).
The number of species that are always zero for all subjects is:
## [1] 29
We will treat these species as being absolute zeros and remove them from the data:
## [1] "Marinilabilia sp. AK2"
## [2] "Desulfovibrio piezophilus"
## [3] "Streptomyces bottropensis"
## [4] "Novosphingobium sp. AP12"
## [5] "Acinetobacter sp. NCTC 7422"
## [6] "Caldisphaera lagunensis"
## [7] "Streptomyces auratus"
## [8] "Candidatus Arthromitus sp. SFB-1"
## [9] "Gillisia sp. CBA3202"
## [10] "Thielavia terrestris"
## [11] "Synechococcus sp. PCC 6312"
## [12] "Alcaligenes faecalis"
## [13] "Aspergillus fumigatus"
## [14] "Helicobacter heilmannii"
## [15] "Gibberella zeae"
## [16] "Xanthomonas perforans"
## [17] "marine gamma proteobacterium HTCC2080"
## [18] "Pseudoalteromonas undina"
## [19] "Myceliophthora thermophila"
## [20] "Magnaporthe oryzae"
## [21] "Candida glabrata"
## [22] "Synechococcus sp. RS9916"
## [23] "marine gamma proteobacterium HTCC2148"
## [24] "Aspergillus oryzae"
## [25] "Herbaspirillum sp. GW103"
## [26] "Pseudoalteromonas spongiae"
## [27] "Rivularia sp. PCC 7116"
## [28] "Streptomyces albus"
## [29] "Acinetobacter parvus"
After removing absolute zeros, we observe that there are also a large number of zeros from round-off errors:
Since the compositions do not sum to 1.0, we replace these round-off zeros with values so that our data is a true compositional data set.
Because of these constraints, the usual algebra of additions, multiplications, etc. that we are used to does not apply. Typically, a transformation function is applied to the composition so that we can apply the usual Euclidean algebra. There are many possible transformation functions used.
Here we apply the log transformation on compositions.
Top 3 PC's explain 80% of variance.

Hypothesis: PC2 is the most useful component for discriminating healthy vs. sick subjects.

Many of the loadings are close to zero, therefore PC2 can be approximated by a sparse vector: this can lead to better interpretable results as to which species "matter." This is in contrast to PC-1.
We build classifiers to determine which species are important for discriminating healthy from sick subjects. In our approach, we pool all LS, CD, and UC subjects into one group labeled as "sick," and all HE subjects are labeled as "healthy."
The classifier that we use is a logistic regression model and we measure the error of the classifier using the Akaike information criterion (AIC).
Note that since we have an order of magnitude less subjects than species, this is an undetermined system (more unknowns than equations) and it is not meaningful to use "all" the data. To mitigate this issue, we take subsets of the species that we have. We first take subsets from the PCs, followed by subsets of the species.
We build a logistic regression model on the top 3 PC's to measure just how good these components are classifying sick from healthy subjects. Recall that our PCA plots from before appeared to show PC2 to be the most useful for this task. The AIC for the logistic regression model that uses only PC 1 is:
## [1] 163.8
The AIC for the logistic regression model that uses only PC 2 is:
## [1] 61.6
The AIC for the logistic regression model that uses only PC 3 is:
## [1] 190.9
The lower the AIC, the less error the model has. Therefore these analyses support our earlier hypothesis that PC2 is more discriminative than the other PCs.
We now build classifiers on each individual species. The AIC for logistic regression models that use single species is as follows:
We select the pair of species with lowest AIC. Since the AIC was computed for a model that uses a single species, selecting a pair of species may be sub-optimal.
The two species with the lowest individual AIC are:
## [1] "Bacteroides.dorei" "Bacteroides.oleiciplenus"
Their respective individual AIC's are:
## [1] 24.47 49.38
This plot shows that the species with the lowest AIC have a larger separability than the PCA plot from before. However, a lot of interesting structure that the PCA plot revealed is lost (for example: the sub-cluster within healthy subjects).
While E. Coli does not have lowest AIC, comparing it with the lowest AIC specie reveals good discrimination and interesting structures that PCA had revealed.