In this tutorial we describe and show examples of R code implementing methods to use standard microbiome reference groups to simplify beta-diversity analyses and faclitate independent validation as described in Using standard microbiome reference groups to simplify beta-diversity analyses and facilitate independent validation by Marlena Maziarz, Ruth M. Pfeiffer, Yunhu Wan and Mitchell H. Gail. The code, test files, HMP reference sets and any other necessary files are on github.com/NCI-biostats/microbiome-fixed-reference (you need to right click on it and open it in a new tab/window.)
This tutorial describes the use of two functions: HMPdistance and GHotelling. HMPdistance calculates distance to the HMP stool and nasal reference sets, GHotelling is a generalized Hotelling test, which accounts for correlation between samples from the same individual.
First, download
microbiome-fixed-reference-functions.r
taxonomy-GG13-8-and-HMP-reference-sets-L2-L6.rdata
from GitHub (again, right click) and save in the current directory. Source the code into your workspace (no need to do anything with taxonomy-GG13-8-and-HMP-reference-sets-L2-L6.rdata at this time, it will be loaded automatically from your current directory):
source('microbiome-fixed-reference-functions.r')
This function computes the mean distance dstool of a microbiome sample (or samples) to a reference set of 92 stool samples from HMP (Human Microbiome Project) and the mean distance dnasal to a reference set of 74 HMP nasal samples. The function definition is
HMPdistance <- function(tax.level = NULL,
d.new.filename = NULL,
d.new.ix.col.not.rel.abu = NULL,
measure = 'bc',
print.details = FALSE,
source.directly.from.github = FALSE)
tax.level = defines the taxonomic level for which the distances are to be computed, should be one of: ‘l2.phylum’, ‘l3.class’, ‘l4.order’, ‘l5.family’, ‘l6.genus’.
d.new.filename = filename of your csv or rdata (including path, ready to load) dataset, samples/subjects in rows, relative abundance (+ possibly metadata) in columns (more details below).
d.new.ix.col.not.rel.abu = index of columns that are not relative abunance columns (ie. index of metadata columnsm if any), default = NULL.
measure = ‘bc’ for Bray-Curtis (default) or ‘corr’ for D_Pearson. For a given two \(1 \times p\) composition vectors, \(X = (X_1, \ldots, X_p)\) and \(Y = (Y_1, \ldots, Y_p)\), the Bray-Curtis distance is \[BC = 1 - \sum_{i = 1}^{p}min(X_i, Y_i).\] The \(D_{Pearson}\) distance is \[D_{Pearson} = \frac{\sum_{i = 1}^{p}(X_i - \bar{X})(Y_i - \bar{Y})}{\{\sum_{i = 1}^{p}(X_i - \bar{X})^2\sum_{i = 1}^{p}(Y_i - \bar{Y})^2\}^{1/2}} = 1 - \frac{\sum_{i = 1}^{p}X_iY_i-(1/p)}{[\{\sum_{i = 1}^{p}X_i^2-(1/p)\}\{\sum_{i = 1}^{p}Y_i^2 - (1/p)\}]^{1/2}}\] since \(\bar{X} = 1/p\) and \(\bar{Y} = 1/p\). Note, these are not metrics, dissimilarity measures would be a more accurate term to describe them.
print.details = TRUE/FALSE, if TRUE prints details of the taxonomy matching.
source.directly.from.github = TRUE/FALSE, FALSE (default) requires that you download the taxonomy-GG13-8-and-HMP-reference-sets-L2-L6.rdata and save it to your current directory, it will be loaded automatically by the function. The file contains the reference sets and the GG13.8 taxonomy object, all are needed by the HMPdistance function. If TRUE, the file will be laoded directly from GitHub (requires that repmis library is loaded).
Your datset should have \(n\) rows (samples) and \(k + p\) columns, where \(p\) = number of relative abundance columns, and \(k\) metadata columns such as sample ID, subject ID. The sum of relative abundances in a given row over the \(p\) columns should equal 1 (up to the 6th decimal place).
Before the distance to the reference set can be calculated, we need to match the taxonomy in your dataset to the reference set. For example, at the genus level, Greengenes 13.8 (GG13.8) has \(p=2929\) different genus sequences (including missing ones). An example taxonomic string is \[k__Bacteria.p__Actinobacteria.c__Acidimicrobiia.o__Acidimicrobiales.f__Acidimicrobiaceae.g__Ferrimicrobium\] which specifies the kingdom, phylum, class, order, family and genus. Another example of a taxonomic string at the \(genus\) level, one that includes missing data on order, family and genus, might be \[GG13.8 is k__Bacteria.p__Acidobacteria.c__TM1.o_.f_.g_\] which has missing information at the order, family and genus levels but is a unique genus sequence in GG13.8. At the phylum level, GG13.8 includes p=91 kingdom/phyla. A list of the 2929 sequences at the genus level and the sequences at the family (p=1116), order (p=664), class(p=319) and kingdom/phylum (p=91) are given in order in the attached file taxonomy-GG13-8-and-HMP-reference-sets.xlsx.
An \(n\) x \(2\) matrix of distances to stool and nasal HMP references (ie. with columns \(dist.to.hmp.stool\), \(dist.to.hmp.nasal\)).
If you decide not to use Greengenes13.8 for your microbiome analysis pipeline, use the following conventions:
If you are only interested in the relative abundances at the kingdom/phylum level and want to compute distances at that level, then enter the sequences using the spelling as shown in the kingdom/phylum column of GG_13_8.xlsx (the spelling is important as it will be used for matching the relative abundance columns in the dataset to those in the reference sets. GG_13_8_taxonomy-used-in-reference.xlsx shows the names used in the reference sets from phylum to genus levels). Note, the kingdom/phylum names should be automatically generated by the analysis pipeline used (such as QIIME) when using GG13.8 closed reference at the phylum level. No information on order, class, family and genus should be included, only information up to the level you are working on. If the kingdom is known but the kingdom/phylum is not among the 91 string patterns in GG13.8, then the non-recognized phylum will be set to missing and its relative abundance will be added to K._.
In order to use the HMPdistance function to compute distances at a given level, you need to:
Use the naming conventions as in GG13.8 (also shown in GG_13_8_taxonomy-used-in-reference.xlsx) for each level.
At a given level, any name that is not in GG13.8 will be assigned missing values up to the lowest level at which the sequence does match a sequence in GG13.8. For example, let K.P.O.C.F.G (kingdom.phylum.order.class.family.genus) denote a particular sequence in GG13.8 and let K.P.O.C.f.g denote a sequence to be classified at the genus level but one that is not in GG13.8. The string will match the one in GG13.8 at the kingdom, phylum, order and class levels, but not at family or genus. It will be assigned K.P.O.C.f._ first, a match will be searched for and not found (since family f is also not in GG13.8), it will then be assigned K.P.O.C._._ and at that point a match will be found in GG13.8. This effectively means that any *leaves$ not found in the reference set will be collapsed and the relative abundances for those summed up and saved at the next level up that is found in GG13.8. Note, not all missingness patterns are in GG13.8. As a rule, if a given name string is not found in GG13.8, the lowest non-missing level will be assigned as missing and an attempt at matching will be made, and this process will repeat until a match is found or K._._._._._ is reached.
You need to be connected to the internet, since HMPdistance function needs to download taxonomy-GG13-8-and-HMP-reference-sets-L2-L6.rdata from GitHub
This example calculates the Bray-Curtis distance from all the samples in HMP to the HMP stool and nasal reference sets:
d.l2.hmp <- HMPdistance(tax.level ='l2.phylum',
d.new.filename = "HMP_L2_030317_phylum-example-input-file.csv",
d.new.ix.col.not.rel.abu = 1:4,
measure = 'bc')
dim(d.l2.hmp)
## [1] 681 2
head(d.l2.hmp)
## dist.to.hmp.stool dist.to.hmp.nasal
## 1 0.2170888 0.7399657
## 2 0.1813009 0.7608636
## 3 0.4449405 0.6322286
## 4 0.8962866 0.5433916
## 5 0.8057144 0.2336688
## 6 0.1859455 0.7474709
summary(d.l2.hmp)
## dist.to.hmp.stool dist.to.hmp.nasal
## Min. :0.1745 Min. :0.2111
## 1st Qu.:0.3478 1st Qu.:0.3839
## Median :0.5990 Median :0.6511
## Mean :0.5584 Mean :0.5854
## 3rd Qu.:0.7267 3rd Qu.:0.7211
## Max. :0.9995 Max. :0.9987
Now, using an external dataset, with 1-Pearson correlation as the dissimilarity measure:
d.l2.baxter <- HMPdistance(tax.level ='l2.phylum',
d.new.filename = "Baxter_L2_rel_abundance.csv",
d.new.ix.col.not.rel.abu = 1:4,
measure = 'corr')
head(d.l2.baxter)
## dist.to.hmp.stool dist.to.hmp.nasal
## 1 0.2114405 0.6749880
## 2 0.2475106 0.6251935
## 3 0.5184979 0.6293766
## 4 0.5577265 0.5815625
## 5 0.1439118 0.6948250
## 6 0.5580645 0.4064870
summary(d.l2.baxter)
## dist.to.hmp.stool dist.to.hmp.nasal
## Min. :0.05934 Min. :0.1631
## 1st Qu.:0.20782 1st Qu.:0.5744
## Median :0.37746 Median :0.6117
## Mean :0.36931 Mean :0.6142
## 3rd Qu.:0.53946 3rd Qu.:0.6637
## Max. :0.69704 Max. :0.8422