Recommended Views/Resources

The phyloseq package [http://joey711.github.io/phyloseq/plot_ordination-examples]
The coursera course exploratory data analysis
see this youtube video [https://www.youtube.com/watch?v=wQhVWUcXM0A]
Dendograms in R: [http://rstudio-pubs-static.s3.amazonaws.com/1876_df0bf890dd54461f98719b461d987c3d.html]
slides on Principal coordinate analysis [http://www.ohio.edu/plantbio/staff/mccarthy/multivariate/PCO&NMDS.pdf]

require(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.1.3

require(dplyr)
require(dendextend)
require(RColorBrewer)
require(vegan)
require(GUniFrac)
require(labdsv)

## Warning: package 'mgcv' was built under R version 3.1.3

## Warning: package 'MASS' was built under R version 3.1.3

Background

This is about methods starting from an abundance table (that could be represented by a heatmap (heatmap function in R)) to define a distance between the samples (distance measures) and to subsequently cluster the samples based on this distance and to (re)present the distance between the samples (PCoA, hierarchical clustering >> dendrogram, k-means clustering)

Several examples are used here to illustrate the calculations, you need in your working directory:

Zeller2014AbundTable.RData

Distances (between samples calculated from abundance tables)

Distance/similarity is king. You want to know which samples are most similar to each other. You therefore need to define a distance measure. What makes samples similar? This is the most important question, a measure that makes sense in the biological context of the samples.

For illustration purposes of what a distance matrix is, assume you have an abundance table of 10 samples (rows) and 200 OTUs (columns). Each sample has therefore 200 coordinates. You then calculate pairwise distances, simply eucledian would be: $DistanceS1toS2 = sqrt((OTU1_{S1} - OTU1_{S2})^2 +..+ (OTU200_{S1} - OTU200_{S2})^2)$

Note already, that because you calculate pairwise distances the resulting distance matrix will be squared (n = number of samples) and symmetric, the distances on the diagonal are obviously 0.

Distance Measures for abundance tables and their implementation in R

General comments

I assume in the end it might be a customized method appropriate for the biological context. For example, if you know in CRC there are 3 index species, it would make sense to weigh them more.

Abundance tables are particularly challenging for distance measures because they are sparse (a lot of zeros occur). The Euclidean distance is not well suited for such tasks.

Bray Curtis…

R provides a function for calculating distances, the dist() function, which provides a fairly narrow range of distances (euclidean, manhattan, binary, canberra, and maximum). However, the vegan library provides the vegdist() function, and the LabDSV library provides the dsvdis() function as alternatives that provide many more indices, including those commonly used in vegetation ecology.

See: [http://ecology.msu.montana.edu/labdsv/R/labs/lab8/lab8.html]

Even better is the phyloseq package, see the distances here [http://joey711.github.io/phyloseq/distance]

Example Euclidean Distance with dist()

For this example I use the filtered Abundance Table from Zeller et al. (2014). In short, the authors provided an abundance table in the supplementary material S3. For their analysis all OTUs that did not reach an abundance of 0.0001 in at least one of the samples were kicked out. On top, I only kept the samples that were classified into Control or CRC (colorectal cancer). The data “Zeller2014AbundTable.RData” must be in the working directory and is a relative abundance table with 114 samples and 301 OTUs.

load("Zeller2014AbundTable.RData")
# There are 114 samples (rows), and 301 OTUs (columns), the first two columns 
# are the sample name and the groups.

AT <- Zeller2014AbundTable[,3:303]

# row.names appear problematic for me, especially since dplyr kicks them out.
# However, if you add them here the final dendrogram will be plotted with the
# sample identifiers. These are however not terribly informative. 
# I left them out, and of course, you could also add them back later.

# row.names(AT) <- Zeller2014AbundTable$Sample

EuclDistAT <- dist(AT, method = "euclidean")
# EuclDistAt is now of class 'dist'
# it is a vector of legnth n*(n-1)/2, with n being the number of samples
# here 114
# all.equal(length(dist.eucl), n*(n-1)/2)

n <- attr(EuclDistAT, "Size")
# The dissimilarity between row (sample) i and row j is given by
i =110
j = 76
EuclDistAT[n*(i-1) - i*(i-1)/2 + j-i]

## [1] 0.3822251

# if you want to see the distances in as a distance matrix:
EuclDistATMatrix <- as.matrix(EuclDistAT)
head(EuclDistATMatrix[1:5,1:5])

##           1         2          3          4         5
## 1 0.0000000 0.1909462 0.14536998 0.14422591 0.2459368
## 2 0.1909462 0.0000000 0.15004534 0.11425745 0.3189353
## 3 0.1453700 0.1500453 0.00000000 0.08448667 0.2354292
## 4 0.1442259 0.1142574 0.08448667 0.00000000 0.2614872
## 5 0.2459368 0.3189353 0.23542922 0.26148722 0.0000000

# and of course you can also look at a histogram to see the distribution of 
# the dissimilarities

#hist(dist.eucl)

UniFrac Distances

Is a distance that consideres the phylogenetic relation between the OTUs, i.e. for its implementation you need a phylogentic tree. Therefore here is an example independent of the Zeller_2014 dataset, the example of the GUniFrac package. Most info is from this package.

Chen et al. (2012): We want to define a distance between any two microbriome samples. “Numerous distance measures have been proposed to compare microbial communities. Phylogenetic distance measures, which account for the phylogenetic relationship among the species, provide far more power because they exploit the degree of divergence between different sequences. Among these, the UniFrac distances are the most popular ones.

unweighted UniFrac distance: considers only species presence and absence information and counts the fraction of branch length unique to either community (good to detect changes in rare lineages)
weighted UniFrac distance: that species abundance information and weights the branch length with abundance difference (good to detect changes in the most abundant lineages)

# require(vegan)
# require(GUniFrac)

Let us start with the example of the “GUniFrac” package. In their paper they state for this data set: “we finally have an OTU table of 60 samples (28 smokers versus 32 non-smokers) and 856 OTUs”

data(throat.otu.tab)
# throat.otu.tab is the abundance table
# dim() shows you it is exactly the 60 samples of 856 OTUs
# they use row.names in the data.frame
range(colSums(throat.otu.tab))

## [1]    1 9740

median(colSums(throat.otu.tab)) # so the vast majority of OTUs is only present in some samples, and with low abundance.

## [1] 4

# NOTE it appears to be absolute abundances here!!
data(throat.tree)
# a rooted phylogenetic tree of R class "phylo"
data(throat.meta)
# the meta information
groups <- throat.meta$SmokingStatus
# tells you which of the 60 samples are smokers and non smokers
# note age is thrown in simply, so a possible confounder 
### NOTE THEY DO a raryfication
# Rarefaction
otu.tab.rff <- Rarefy(throat.otu.tab)$otu.tab.rff
# since depth is not specified the lowest seqeuncing depth is used, see (?Rarefy)
# otu.tab.rff is an abundance matrix again with 60 rows and 856 OTUs
# but now the number of OTU assigned reads per sample is the same
range(rowSums(otu.tab.rff)) # shows you that all samples have now 766 reads

## [1] 766 766

# compare
range(rowSums(throat.otu.tab)) # ranged from 766 to 3763

## [1]  766 3763

unifracs <- GUniFrac(otu.tab.rff, throat.tree, alpha=c(0, 0.5, 1))$unifracs
# alpha is the parameter controlling weight on abundant lineages
dw <- unifracs[, , "d_1"] # Weighted UniFrac
du <- unifracs[, , "d_UW"] # Unweighted UniFrac
dv <- unifracs[, , "d_VAW"] # Variance adjusted weighted UniFrac
d0 <- unifracs[, , "d_0"] # GUniFrac with alpha 0
d5 <- unifracs[, , "d_0.5"] # GUniFrac with alpha 0.5

# ok, so you got the different distances out, see below, maybe as distance
# needed for some commands
# Permanova - Distance based multivariate analysis of variance
# adonis(as.dist(d5) ~ groups)

# Combine d(0), d(0.5), d(1) for testing

# PermanovaG(unifracs[, , c("d_0", "d_0.5", "d_1")] ~ groups)
# PCoA plot
# s.class(cmdscale(d5, k=2), fac = groups)

NOTE: for the phylogenetics, probably the R package ape is used. It might help you to get the rooted tree.

Hierachical clustering resulting in dendograms

As soon as you have your distances you can look for clusters (which samples are close together, do they belong to a biological relevant group). Hierarchical clustering is a deterministic method (same outcome each time), that results in a dendogram which allows you to judge the distances between the samples. You can use the hclust() function in R.

Another clustering method is k-means clustering (maybe below). For k-means you need to know the number of clusters (k). The method is pure clustering, iterative and dependent on the starting values (so not deterministic). It results in clusters but no dendrogram.

hClustering <- hclust(EuclDistAT, method = 'complete')
# do not put the matrix here does not work

The method argument is by default = ‘complete’. During the hierarchical clustering process groups are generated (bottom up method: find the two closest samples, unite them in a group, find the next two closest samples, this could be one sample and the just formed group). When ‘complete’, the distance between two groups is the distance between the two points that are furthest from each other. One alternative is ‘average’ where the distance between the two groups is the distance between the centers of gravity of the two groups.

There is also a top down method. See the youtube video under views if interested.

Plotting the dendrogram

Simple dendogram plots can be directly obtained with hClustering which is of class “hclust”. Definitely see the Rpubs link under recommended views.

plot(hClustering, hang = -1)

# the hang makes sure that all indices are written at the same height
# try
# plot(hClustering)

The height reflects the distance between the samples, want to know the distance between two samples, find their common node and look at which height it is.

NB: For obtaining clusters you had to cut the dendrogram at a certain height.

Getting the labels of the dendrogram coloured according to the biological groups of interest

Here the method I found to achieve this colouring using the as.dendogram() function, the output of class “dendrogram” allows more flexibility for plots.

hcd = as.dendrogram(hClustering)
# hcd is now actually of class "dendogram"

One interesting option is the plot of specific branches, note again that you have to cut at a certain height for clustering.

# gives you the option to show only some branches
plot(cut(hcd, h = 0.4)$upper, main = "Upper tree of cut at h=0.4")

But key is of course the colouring of the labels, and maybe the relabeling. Here I needed the package ‘dendextend’

# require(dendextend)

cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442",
               "#0072B2", "#D55E00", "#CC79A7")

colorCode <- c(Control=cbPalette[2], CRC = cbPalette[3])
# just a character vector
labels_colors(hcd) <- colorCode[Zeller2014AbundTable$Groups][order.dendrogram(hcd)]
# this was actually new for me
# Zeller2014AbundTable$Groups is just a character vector with "Control" and "CRC"
# colorCode[Zeller2014AbundTable$Groups] is then the Color for each "Control" and
# each "CRC"
# the [order.dendrogram(hcd)] puts the colours in the right order

# if you wanted to control the numbers
# ControlNumbers <- which(Zeller2014AbundTable$Groups %in% "Control")
# CRCNumbers <- which(Zeller2014AbundTable$Groups %in% "CRC")

and here the plot

plot(hcd)

And of course the numbers are not so informative, you could also give the IDs back here instead of using row.names from the beginning (this is commented out). Instead, I used just Control and CRC labels.

# labels(hcd) <- Zeller2014AbundTable$Sample[order.dendrogram(hcd)]
labels(hcd) <- Zeller2014AbundTable$Groups[order.dendrogram(hcd)]

plot(hcd)

For ggplot. There is an add on library, but I did not figure out the colour trick yet. See again the link under views.

The myplclust option from the coursera course

In the coursera course a function was provided to allow coulering already from hclust results

myplclust <- function(hclust, lab = hclust$labels, lab.col = rep(1, length(hclust$labels)), 
                      hang = 0.1, ...) {
        ## modifiction of plclust for plotting hclust objects *in colour*!  Copyright
        ## Eva KF Chan 2009 Arguments: hclust: hclust object lab: a character vector
        ## of labels of the leaves of the tree lab.col: colour for the labels;
        ## NA=default device foreground colour hang: as in hclust & plclust Side
        ## effect: A display of hierarchical cluster with coloured leaf labels.
        y <- rep(hclust$height, 2)
        x <- as.numeric(hclust$merge)
        y <- y[which(x < 0)]
        x <- x[which(x < 0)]
        x <- abs(x)
        y <- y[order(x)]
        x <- x[order(x)]
        plot(hclust, labels = FALSE, hang = hang, ...)
        text(x = x, y = y[hclust$order] - (max(hclust$height) * hang), labels = lab[hclust$order], 
             col = lab.col[hclust$order], srt = 90, adj = c(1, 0.5), xpd = NA, ...)
}

A plot using this function can look like this

myplclust(hClustering, lab = Zeller2014AbundTable$Groups, lab.col = labels_colors(hcd))

so this is definitely a function to think about albeit currently the labeling is obviously wrong so I have to figure out how it works.

The effect of the dominating UNMAPPED OTU

Should probably go into the distance part. Key is to notice again that distance is king. The abundance table AT is dominated by the UNMAPPED OTU, see the difference on the Dendrogram by removing this.

EuclDistMatrix2 <- dist(AT[,-1], method = "euclidean")
# see above this is not really a matrix it contains the matrix though.
hClustering2 <- hclust(EuclDistMatrix2)
hcd2 = as.dendrogram(hClustering2)
labels_colors(hcd2) <- colorCode[Zeller2014AbundTable$Groups][order.dendrogram(hcd2)]
labels(hcd2) <- Zeller2014AbundTable$Groups[order.dendrogram(hcd2)]

plot(hcd2)

Heatmaps and Colours in R

The heatmap function takes directly an abundance table and clusters both the Samples and the OTUs. Key will then of course be to define the distfun argument. Here I just played a bit, especially with colours but there is still things to do to get nice heatmap representations of abundance tables (talk to Alireza)

Colours

A heatmap is colours. I still did not really figure out how to adjust the colour range to the actual range of the abundance table. The ColorBrewer was a suggestion of Roger Peng during the exploratory data analysis coursera course.

#require(RColorBrewer)

Creating a colour range

#see http://www.cookbook-r.com/Graphs/Colors_%28ggplot2%29/
cols <- brewer.pal(4, "BuGn") # sets how many colours of the palette are fixed??
# can not be because it allows later even pla(2) so a number smaller than four
pal <- colorRampPalette(cols)
# pal is now a function
colourmap <- pal(20) # makes 20 colours from the function pal

heatmap(data.matrix(AT[,]), col = colourmap)

Obviously the “UNMAPPED” dominates, so let’s see the heatmap without it

heatmap(data.matrix(AT[,-1]), col = colourmap)

#heatmap(data.matrix(log10(AT[,-1]+0.0001)), col = colourmap)

This obviously also changed the clustering. Still not super good, attempt to log

heatmap(data.matrix(log10(AT[,-1]+0.0001)), col = colourmap)

yet another clustering. I just tried to add the same colour a lot of times at the end.

colourmap2 <- c(colourmap, rep(colourmap[20],200)) # makes 20 colours from the function pal

heatmap(data.matrix(AT[,-1]), col = colourmap2)

#heatmap(data.matrix(log10(AT[,-1]+0.0001)), col = colourmap)

That is the current status COLOUR

k-means Clustering

not yet considered

Principal Coordinate Analysis (MDS), NMDS, maybe also something on Principal Component Analysis

Some Background Info

A lot of this info came from these slides here: [http://www.ohio.edu/plantbio/staff/mccarthy/multivariate/PCO&NMDS.pdf]

As metnioned the distance matrix was already the most critical thing. PcoA is to obtain an Euclidean representation of the provided distance matrix (independent on how the similarity or distance was determined).

Gover (1966) developed PCoA also known as MDS (metric multidimensional scaling). So the task is to allow positioning of the objects (samples) in a space with fewer dimensions while preserving their distance relationship as well as possible

Metric distance matrices have no violation of the triangle inequality (if non metric the relationships between objects might produce negative eigenvalues, but even that does often not impair the Euclidean representation for the first few principal components). Transformations of non metric matrices can result in metric ones (e.g. add a constant to distances)

One could argue that PCoO is necessarily inferior to PCA because in PCA each point is placed exactly where it ought to be, whereas in PCoO each point is only approximated based on a best-fit model of the dissimilarities. However, the imprecision is in most cases of no relevance for the biologist.

Example to illustrate the algorithm of PCoA

PCoA works roughly like this, see follwoing example:

you start with a distance matrix D
you transform D into A using $A = -1/2*D^{2}$
you scale matrix a resulting in Matrix Delta, so that $delta_{j,i} = a_{j,i} - mean(j) - mean(i) + mean(A)$ # mean(j) is rowMean, mean(i) is column mean
You calculate eigenvectors and eigenvalues of Delta (results in n eigenvectors with n being the number of samples)
the Eigenvectors are scaled using their corresponding eigenvalues: $NewVector <- Eigenvector*sqrt(Eigenvalue)
When the scaled eigenvectors are written as columns, the rows of the resulting table are the coordinates of the objects in PCoA space. In other words the columns given by for example cmdscale are the eigenvectors.

# the example is from the above mentioned slides but done self in R
########################
# Do PCoA YOurself#######
#########################

DCPO <- matrix(c(0,3.16228, 3.16228, 7.07197, 7.07197, 3.16228, 0, 4.47214, 4.47214,
                 6.32456, 3.16228, 4.47214, 0, 6.32456, 4.47214,
                 7.07197, 4.47214, 6.32456, 0, 4.47214, 
                 7.07197, 6.32456, 4.47214, 4.47214, 0), nrow = 5)

# so DCPO is the distance matrix

# Transform DCPO into matrix A
A <- -1/2*DCPO^2

# Scaling/centering of Matrix A
RowM <- rowMeans(A)
ColM <- colMeans(A)
M <- mean(A)
Delta <- A
for (i in 1:5){
        for (j in 1:5) {
                Delta[i,j] <- Delta[i,j] - RowM[i] - ColM[j] + M
        }
}


# Compute the eigenvalues and eigenvectors of Delta
Eigen <- eigen(Delta)
# SEE YOU GET 5 EIGENVALUES (nrow = ncol of sqare distance matrix)
# Scale the eigenvectors
FirstEigenvector <- Eigen$vectors[,1]
FirstEigenvalue <- Eigen$values[1]
SecondEigenvector <- Eigen$vectors[,2]
SecondEigenvalue <- Eigen$values[2]

PrincipalCoords <- data.frame(First = FirstEigenvector*sqrt(FirstEigenvalue),
                              Sedond = SecondEigenvector * sqrt(SecondEigenvalue))

##################################################################
#### prove that the R functions result in the same PCoA ######
##################################################################

CmdScale <- cmdscale(DCPO, k =2)

#CmdScale is now equal your PrincipalCoords

# the slides used a package for the PCoA
# require(labdsv)


PCO.1 <- pco(DCPO, k =2)
# also the same

Examles using the Zeller Data illustrating different R possibilities for PCoA

cmdscale() stats package

The classical way in R to do PCoA is the function cmdscale(), standing for classical multidimensional scaling (It is a major part of what ecologists call ???ordination???.)

So the first thing to note is that with the argument k you set how many dimensions the points are supposed to have (for plotting puropses most likely two). Since you saw the algorithm above you now know that if k is 2 it just takes the first two eigenvectors ignoring the others. Note, that means clearly that the distances must not be perfectly reflected, anyway not, see the example later.

cmdscale follows the analysis of Mardia (1978), and returns the best-fitting k-dimensional representation

A set of Euclidean distances on n points can be represented exactly in at most n - 1 dimensions. With n =114, you get an error as soon as you put k > 113.

The configuration returned is given in principal-component axes, so the reflection chosen may differ between R platforms (see prcomp). (???)

The outcome If eig = FALSE, add = FALSE and x.ret = FALSE (default), a matrix with k columns whose rows give the coordinates of the points chosen to represent the dissimilarities

PcoA2D_AT <- cmdscale(EuclDistAT, k =2)
# so here simply the Euclidean distance is used, see distances above

PcoA2D_AT <- as.data.frame(PcoA2D_AT)
PcoA2D_AT$Group <- Zeller2014AbundTable$Groups
names(PcoA2D_AT)[1:2] <- c('PC1', 'PC2')

cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442",
               "#0072B2", "#D55E00", "#CC79A7")
Tr_PcoA <- ggplot(PcoA2D_AT, aes(x = PC1, y = PC2, colour = Group, 
                                 label = row.names(PcoA2D_AT)))
Tr_PcoA + geom_point(size =5) +
        scale_colour_manual(values = cbPalette[2:3]) +
        geom_text(col = 'black')

From this plot samples 110 and 76 are very close together, so for sample 110 sample 76 should be the one with the smallest dissimilarity. Just to check this

DistForSample110 <- data.frame(Sample = 1:114, Dist = EuclDistATMatrix[110,])
head(arrange(DistForSample110, Dist), n=8)

##   Sample      Dist
## 1    110 0.0000000
## 2     23 0.2007931
## 3     76 0.2061865
## 4    105 0.2299289
## 5     33 0.2344901
## 6     83 0.2665276
## 7     78 0.2665616
## 8     15 0.2800883

So obviously the distances in the PC1 and PC2 space do not exactly represent the order they were in the original distance matrix. This is because it is not exact and you only use the first two Eigenvectors.

pco() < labsdv package

Just check that this method results in exactly the same plot

PcoA2D_ATpco <- pco(EuclDistAT, k =2) # note here you can actually also use EuclDistATMatrix, so in difference to cmdscale it allows matrice
all.equal(PcoA2D_ATpco$points, cmdscale(EuclDistAT, k =2))

## [1] TRUE

So same result.

NMDS

NMDS (non metric multidimensional scaling) is another method, that might even be better than PCoA in compressing the distance relationships into 2 or 3 dimensions. It is more computer intensive.

Contrary to eigenvector methods such as PCA or PCO, NMDS calculations do not maximize the variability associated with individual axes of the ordination; NMDS axes are arbitrary, so the final plots can be rotated, centered, and inverted

Distances, then Principal coordinate analysis, clustering, Dendrograms