STAT540 Seminar 09: Clustering and PCA

Rebecca Johnston 12/03/2014

Contributor: Gabriela Cohen Freue

In this seminar we'll explore clustering genes and samples using the photoreceptor time series with the two genotypes. It is interesting to compare the results of different clustering algorithms, and to see the effect of filtering and/or definitions of attributes on the resulting clusters.

Load required packages and data

Load the photoRec data. Remember, for more information on the photoRec dataset, go here.

library(RColorBrewer)
library(cluster)
library(pvclust)
library(xtable)
library(limma)
library(plyr)
library(lattice)
library(ggplot2)
library(reshape2)
prDat <- read.table("GSE4051_data.tsv", header = TRUE, row.names = 1)
str(prDat, max.level = 0)
## 'data.frame':    29949 obs. of  39 variables:
prDes <- readRDS("GSE4051_design.rds")
str(prDes)
## 'data.frame':    39 obs. of  4 variables:
##  $ sidChar : chr  "Sample_20" "Sample_21" "Sample_22" "Sample_23" ...
##  $ sidNum  : num  20 21 22 23 16 17 6 24 25 26 ...
##  $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ gType   : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 2 2 2 1 1 1 ...

Finally, as an additional step to make visualization easier later, we'll rescale the rows, since we're not interested in absolute differences in expression between genes at the moment. Note that although one can do this step within the heatmap() function, it will not be available for other functions we will use. We can always go back to the original data if we need to.

sprDat <- t(scale(t(prDat)))
str(sprDat, max.level = 0, give.attr = FALSE)
##  num [1:29949, 1:39] 0.0838 0.1758 0.7797 -0.3196 0.8358 ...
round(data.frame(avgBefore = rowMeans(head(prDat)), avgAfter = rowMeans(head(sprDat)), 
    varBefore = apply(head(prDat), 1, var), varAfter = apply(head(sprDat), 1, 
        var)), 2)
##              avgBefore avgAfter varBefore varAfter
## 1415670_at        7.22        0      0.02        1
## 1415671_at        9.37        0      0.35        1
## 1415672_at        9.70        0      0.15        1
## 1415673_at        8.42        0      0.03        1
## 1415674_a_at      8.47        0      0.02        1
## 1415675_at        9.67        0      0.03        1

The data for each row – which is for one probeset – now has mean 0 and variance 1.

Sample Clustering

In this part, we will use samples as objects to be clustered using gene attributes (i.e., vector variables of dimension ~30K).

Hierarchical clustering for photoRec data

In this section we will illustrate different hierarchical clustering methods. These plots were included in Lecture 16.

However, for most expression data applications, we suggest you should standardize the data; use Euclidean as the “distance” (so it's just like Pearson correlation) and use “average linkage”.

# compute pairwise distances
pr.dis <- dist(t(sprDat), method = "euclidean")

# create a new factor representing the interaction of gType and devStage
prDes$grp <- with(prDes, interaction(gType, devStage))
summary(prDes$grp)
##        wt.E16     NrlKO.E16         wt.P2      NrlKO.P2         wt.P6 
##             4             3             4             4             4 
##      NrlKO.P6        wt.P10     NrlKO.P10    wt.4_weeks NrlKO.4_weeks 
##             4             4             4             4             4
# compute hierarchical clustering using different linkage types
pr.hc.s <- hclust(pr.dis, method = "single")
pr.hc.c <- hclust(pr.dis, method = "complete")
pr.hc.a <- hclust(pr.dis, method = "average")
pr.hc.w <- hclust(pr.dis, method = "ward")

# plot them
op <- par(mar = c(0, 4, 4, 2), mfrow = c(2, 2))

plot(pr.hc.s, labels = FALSE, main = "Single", xlab = "")
plot(pr.hc.c, labels = FALSE, main = "Complete", xlab = "")
plot(pr.hc.a, labels = FALSE, main = "Average", xlab = "")
plot(pr.hc.w, labels = FALSE, main = "Ward", xlab = "")

plot of chunk unnamed-chunk-5

par(op)

# identify 10 clusters
op <- par(mar = c(1, 4, 4, 1))
plot(pr.hc.w, labels = prDes$grp, cex = 0.6, main = "Ward showing 10 clusters")
rect.hclust(pr.hc.w, k = 10)

plot of chunk unnamed-chunk-5

par(op)

When you call heatmap(), it automatically performs hierarchical clustering for you and it reorders the rows and/or columns of the data accordingly. Both the reordering and the dendrograms can be suppressed it with Rowv = NA and/or Colv = NA.

Note that when you have a lot of genes, the tree is pretty ugly. Thus, the row clustering was suppressed for now.

By default, heatmap() uses the hclust() function, which takes a distance matrix, calculated by the dist() function (with default = 'euclidean'). However, you can also write your own clustering and distance functions. In the examples below, I used hclust() with ward linkage method and the euclidean distance.

Note that the dendrogram in the top margin of the heatmap is the same as that of the hclust() function.

Exercise: Play with the options of the heatmap function and compare the different heatmaps. Note that one can also use the original data prDat and set the option scale = "row". You will get the same heatmaps although the columns may be ordered differently (use Colv = NA to suppress reordering).

Clustering method = “ward”, with colour side bars for genotype and devStage

jBluesFun <- colorRampPalette(brewer.pal(n = 9, "Blues"))
gTypeCols <- brewer.pal(8, "Dark2")[c(1, 2)]
heatmap(as.matrix(sprDat), Rowv = NA, col = jBluesFun(256), hclustfun = function(x) hclust(x, 
    method = "ward"), scale = "none", labCol = prDes$grp, labRow = NA, margins = c(8, 
    1), ColSideColor = gTypeCols[unclass(prDes$gType)])
legend("topright", legend = levels(prDes$gType), col = gTypeCols, lty = 1, lwd = 5, 
    cex = 0.6)

plot of chunk unnamed-chunk-6

devStageCols <- brewer.pal(8, "Dark2")
heatmap(as.matrix(sprDat), Rowv = NA, col = jBluesFun(256), hclustfun = function(x) hclust(x, 
    method = "ward"), scale = "none", labCol = prDes$grp, labRow = NA, margin = c(8, 
    1), ColSideColor = devStageCols[unclass(prDes$devStage)])
legend("topright", legend = levels(prDes$devStage), col = devStageCols, lty = 1, 
    lwd = 5, cex = 0.6)

plot of chunk unnamed-chunk-7

Clustering method = “average”, with colour side bars for genotype and devStage

jBluesFun <- colorRampPalette(brewer.pal(n = 9, "Blues"))
gTypeCols <- brewer.pal(8, "Dark2")[c(1, 2)]
heatmap(as.matrix(sprDat), Rowv = NA, col = jBluesFun(256), hclustfun = function(x) hclust(x, 
    method = "average"), scale = "none", labCol = prDes$grp, labRow = NA, margins = c(8, 
    1), ColSideColor = gTypeCols[unclass(prDes$gType)])
legend("topright", legend = levels(prDes$gType), col = gTypeCols, lty = 1, lwd = 5, 
    cex = 0.6)

plot of chunk unnamed-chunk-8

devStageCols <- brewer.pal(8, "Dark2")
heatmap(as.matrix(sprDat), Rowv = NA, col = jBluesFun(256), hclustfun = function(x) hclust(x, 
    method = "average"), scale = "none", labCol = prDes$grp, labRow = NA, margin = c(8, 
    1), ColSideColor = devStageCols[unclass(prDes$devStage)])
legend("topright", legend = levels(prDes$devStage), col = devStageCols, lty = 1, 
    lwd = 5, cex = 0.6)

plot of chunk unnamed-chunk-9

Clustering method = “single”, with colour side bars for genotype and devStage

jBluesFun <- colorRampPalette(brewer.pal(n = 9, "Blues"))
gTypeCols <- brewer.pal(8, "Dark2")[c(1, 2)]
heatmap(as.matrix(sprDat), Rowv = NA, col = jBluesFun(256), hclustfun = function(x) hclust(x, 
    method = "single"), scale = "none", labCol = prDes$grp, labRow = NA, margins = c(8, 
    1), ColSideColor = gTypeCols[unclass(prDes$gType)])
legend("topright", legend = levels(prDes$gType), col = gTypeCols, lty = 1, lwd = 5, 
    cex = 0.6)

plot of chunk unnamed-chunk-10

devStageCols <- brewer.pal(8, "Dark2")
heatmap(as.matrix(sprDat), Rowv = NA, col = jBluesFun(256), hclustfun = function(x) hclust(x, 
    method = "single"), scale = "none", labCol = prDes$grp, labRow = NA, margin = c(8, 
    1), ColSideColor = devStageCols[unclass(prDes$devStage)])
legend("topright", legend = levels(prDes$devStage), col = devStageCols, lty = 1, 
    lwd = 5, cex = 0.6)

plot of chunk unnamed-chunk-11

Partitioning methods for photoRec

Note that the results depend on the initial values (randomly generated) to create the first k clusters. In order to get the same results, you need to set many initial points (see the parameter nstart).

K-means clustering

K-means is a classic clustering method described in Lecture 16. An important observation about k-means is that it cannot determine the number of clusters for you. In fact, doing this automatically is quite hard (though techniques do exist).

Here we'll just do a clustering of samples using all genes (~30K):

# Objects in columns
set.seed(31)
k <- 5
pr.km <- kmeans(t(sprDat), centers = k, nstart = 50)

# We can look at the within sum of squares of each cluster
pr.km$withinss
## [1] 120153  78227 110209 100197 133036
# We can look at the composition of each cluster
pr.kmTable <- data.frame(devStage = prDes$devStage, cluster = pr.km$cluster)
prTable <- xtable(with(pr.kmTable, table(devStage, cluster)), caption = "Number of samples from each develomental stage within each k-means cluster")
align(prTable) <- "lccccc"
print(prTable, type = "html", caption.placement = "top")
Number of samples from each develomental stage within each k-means cluster
1 2 3 4 5
E16 0 0 6 0 1
P2 4 0 0 0 4
P6 5 1 0 0 2
P10 1 2 0 3 2
4_weeks 0 2 1 5 0

Helpful info and tips:

Repeat the analysis using a different seed and check if you get the same clusters.

# Set a different seed and repeat k-means clustering
set.seed(100)
k <- 5
pr.km <- kmeans(t(sprDat), centers = k, nstart = 50)
pr.km$withinss
## [1] 120153 100197 133036  78227 110209
pr.kmTable <- data.frame(devStage = prDes$devStage, cluster = pr.km$cluster)
prTable <- xtable(with(pr.kmTable, table(devStage, cluster)), caption = "Number of samples from each develomental stage within each k-means cluster")
align(prTable) <- "lccccc"
print(prTable, type = "html", caption.placement = "top")
Number of samples from each develomental stage within each k-means cluster
1 2 3 4 5
E16 0 0 1 0 6
P2 4 0 4 0 0
P6 5 0 2 1 0
P10 1 3 2 2 0
4_weeks 0 5 0 2 1

Yes the results are slightly different!

PAM algorithm
K representative objects (= medoids) are chosen as cluster centers and objects are assigned to the center (= medoid = cluster) with which they have minimum dissimilarity (Kaufman and Rousseeuw, 1990). Nice features of PAM are: (a) it accepts a dissimilarity matrix (use diss = TRUE); (b) it is more robust to outliers as the centroids of the clusters are data objects; © one can determine the number of clusters by exploring the average silhouette value.

pr.pam <- pam(pr.dis, k = k)
pr.pamTable <- data.frame(devStage = prDes$devStage, cluster = pr.pam$clustering)
pamTable <- xtable(with(pr.pamTable, table(devStage, cluster)), caption = "Number of samples from each develomental stage within each PAM cluster")
align(pamTable) <- "lccccc"
print(pamTable, type = "html", caption.placement = "top")
Number of samples from each develomental stage within each PAM cluster
1 2 3 4 5
E16 6 1 0 0 0
P2 0 1 7 0 0
P6 3 2 3 0 0
P10 0 2 1 1 4
4_weeks 1 0 1 4 2

Additional information on the PAM result is available through summary(pr.pam)

The silhouette plot
The cluster package contains the function silhouette() that compares the minimum average dissimilarity of each object to other clusters with the average dissimilarity to objects in its own cluster. The resulting measure is called the “width of each object's silhouette”. A value close to 1 indicates that the object is similar to objects in its cluster compared to those in other clusters. Thus, the average of all objects silhouette widths gives an indication of how well the clusters are defined.

op <- par(mar = c(5, 1, 4, 4))
plot(pr.pam, main = "Silhouette Plot for 5 clusters")

plot of chunk unnamed-chunk-18

par(op)

Gene clustering

A different view at the data can be obtained from clustering genes instead of samples. Since clustering genes is slow when you have a lot of genes, for the sake of time we will work with a smaller subset of genes.

In many cases, analysts use cluster analysis to illustrate the results of a differential expression analysis. Sample clustering following a differential expression (DE) analysis will probably show the separation of the groups identified by the DE analysis. Thus, as it was mentioned in lectures, we need to be careful in over-interpreting these kind of results. However, note that it is valid to perform a gene clustering to see if differential expressed genes cluster according to their function, subcellular localizations, pathways, etc.

A smaller dataset

In Seminar 6: Fitting and interpretting linear models (high volume), you've learned how to use limma to fit a common linear model to a very large number of genes and thus identify genes that show differential expression over the course of development.

## [1] 972

We start by using different clustering algorithms to cluster the top 972 genes that showed differential expression across the different developmental stage (BH adjusted p value < 10-5).

Hierarchical:

geneC.dis <- dist(topDat, method = "euclidean")
geneC.hc.a <- hclust(geneC.dis, method = "average")
plot(geneC.hc.a, labels = FALSE, main = "Hierarchical with Average Linkage", 
    xlab = "")

plot of chunk unnamed-chunk-20

When there are lots of objects to cluster, the dendrograms are in general not very informative as it is difficult to identify any interesting pattern in the data.

Partitioning

The most interesting thing to look at is the cluster centers (basically the “prototype” for the cluster) and membership sizes. Then we can try to visualize the genes that are in each cluster.

Let's visualize a cluster (remember the data were rescaled) using line plots. This makes sense since we also want to be able to see the cluster center.

Improve the plot by adding sample names to the x-axis (e.g., wt_E16_1)

set.seed(1234)
k <- 5
kmeans.genes <- kmeans(topDat, centers = k)

# choose which cluster we want
clusterNum <- 1

# Set up the axes without plotting; ylim set based on trial run.
plot(kmeans.genes$centers[clusterNum, ], ylim = c(-4, 4), type = "n", xlab = "", 
    ylab = "Relative expression", axes = FALSE)

# Plot the expression of all the genes in the selected cluster in grey.
matlines(y = t(topDat[kmeans.genes$cluster == clusterNum, ]), col = "grey")

# Add the cluster center. This is last so it isn't underneath the members
points(kmeans.genes$centers[clusterNum, ], type = "l")

# Optional: colored points to show which development stage the samples are
# from.
points(kmeans.genes$centers[clusterNum, ], col = prDes$devStage, pch = 20)

# Add sample labels for x axis Check the sample IDs are in the same order
# colnames(sprDat) == prDes$sidChar Not sure how to do this?
axis(side = 2)
# use 'cex.axis' to change font size, 'las' to change text direction
axis(side = 1, at = 1:39, labels = prDes$grp, cex.axis = 0.8, las = 2)

plot of chunk unnamed-chunk-21

Or, probably more commonly used, we can see both dendrograms using heatmaps (through hierarchical clustering):

devStageCols <- brewer.pal(8, "Dark2")
heatmap(as.matrix(topDat), col = jBluesFun(256), hclustfun = function(x) hclust(x, 
    method = "average"), labCol = prDes$grp, labRow = NA, margin = c(8, 1), 
    scale = "none", ColSideColor = devStageCols[unclass(prDes$devStage)])
legend("topleft", levels(prDes$devStage), col = devStageCols, lty = 1, lwd = 5, 
    cex = 0.6)

plot of chunk unnamed-chunk-22

Redefining the attributes

In the previous example, all the samples were used as attributes to cluster genes. However, we can define different attributes, for example, by estimating parameters of a linear model. Consider:

\[ \begin{equation} X_{gi,devStage} = \mu_{g,devStage} + \epsilon_{gi,devStage} \end{equation} \]

Thus, we can define a new attributes for each gene, i.e., \[Att_g=(\mu_{g,E16},\mu_{g,P2},\mu_{g,P6},\mu_{g,P10},\mu_{g,4w})\] and estimate these parameters.

annoTopDat <- stack(as.data.frame(topDat))  # stack probe data tall and skinny
annoTopDat$probeset <- rownames(topDat)  # add probeset ID as variable
# get info on gType and devStage, then average over reps within devStage
annoTopDat <- merge(annoTopDat, prDes, by.x = "ind", by.y = "sidChar")
devStageAvg <- ddply(annoTopDat, ~probeset, function(x) {
    avgByDevStage <- aggregate(values ~ devStage, x, mean)$values
    names(avgByDevStage) <- levels(x$devStage)
    avgByDevStage
})
# put probset info back into rownames
rownames(devStageAvg) <- devStageAvg$probeset
devStageAvg$probeset <- NULL
str(devStageAvg)
## 'data.frame':    972 obs. of  5 variables:
##  $ E16    : num  -0.628 1.235 -0.419 1.401 0.855 ...
##  $ P2     : num  -1.041 0.7 -0.918 0.737 0.74 ...
##  $ P6     : num  -0.214 -0.26 -0.744 -0.66 0.34 ...
##  $ P10    : num  0.722 -0.683 0.553 -0.779 -0.363 ...
##  $ 4_weeks: num  1.083 -0.838 1.475 -0.523 -1.464 ...
heatmap(as.matrix(devStageAvg), Colv = NA, col = jBluesFun(256), hclustfun = function(x) hclust(x, 
    method = "average"), labCol = colnames(devStageAvg), labRow = NA, margin = c(8, 
    1))

plot of chunk unnamed-chunk-24

We can look at the average expression of genes within a cluster for each developmental stage:

k <- 4
geneDS.km <- kmeans(devStageAvg, centers = k, nstart = 50)
clust.centers <- geneDS.km$centers
# Look at all clusters
op <- par(mfrow = c(2, 2))
for (clusterNum in 1:4) {
    # Set up the axes without plotting; ylim set based on trial run.
    plot(clust.centers[clusterNum, ], ylim = c(-4, 4), type = "n", xlab = "Develomental Stage", 
        ylab = "Relative expression", axes = F, main = paste("Cluster", clusterNum, 
            sep = " "))
    axis(2)
    axis(1, 1:5, c(colnames(clust.centers)[1:4], "4W"), cex.axis = 0.9)

    # Plot the expression of all the genes in the selected cluster in grey.
    matlines(y = t(devStageAvg[geneDS.km$cluster == clusterNum, ]), col = "grey")

    # Add the cluster center. This is last so it isn't underneath the members
    points(clust.centers[clusterNum, ], type = "l")

    # Optional: points to show development stages.
    points(clust.centers[clusterNum, ], pch = 20)
}

plot of chunk unnamed-chunk-25

par(op)

Or we can compare all clusters' centers.

plot(clust.centers[clusterNum, ], ylim = c(-4, 4), type = "n", xlab = "Develomental Stage", 
    ylab = "Average expression", axes = FALSE, main = "Clusters centers")
axis(2)
axis(1, 1:5, c(colnames(clust.centers)[1:4], "4W"), cex.axis = 0.9)

for (clusterNum in 1:4) {
    points(clust.centers[clusterNum, ], type = "l", col = clusterNum, lwd = 2)
    points(clust.centers[clusterNum, ], col = clusterNum, pch = 20)
}

plot of chunk unnamed-chunk-26

We can look at 3-dimensions of the data and illustrate clusters determined by kmeans. The most interesting analysis is to follow with a biological interpretation of the clusters. For that, smaller clusters may be easier to interpret.

cloud(devStageAvg[, "E16"] ~ devStageAvg[, "P6"] * devStageAvg[, "4_weeks"], 
    col = geneDS.km$clust, xlab = "E16", ylab = "P6", zlab = "4_weeks", pch = 19, 
    alpha = 0.3)

plot of chunk unnamed-chunk-27

Statistical measures to evaluate clusters

An important issue for clustering is the question of certainty of the cluster membership. Clustering always gives you an answer, even if there aren't really any underlying clusters. There are many ways to address this. Here we introduce an approachable one offered in R, pvclust, which you can read about at (http://www.is.titech.ac.jp/~shimo/prog/pvclust/).

Important: pvclust clusters the columns. I don't recommend doing this for genes! The computation will take a very long time. Even the following example with all 30K genes will take some time to run.

You control how many bootstrap iterations pvclust does with the nboot parameter. We've also noted that pvclust causes problems on some machines, so if you have trouble with it, it's not critical.

pvc <- pvclust(topDat, nboot = 100)
## Bootstrap (r = 0.5)... Done.
## Bootstrap (r = 0.6)... Done.
## Bootstrap (r = 0.7)... Done.
## Bootstrap (r = 0.8)... Done.
## Bootstrap (r = 0.9)... Done.
## Bootstrap (r = 1.0)... Done.
## Bootstrap (r = 1.1)... Done.
## Bootstrap (r = 1.2)... Done.
## Bootstrap (r = 1.3)... Done.
## Bootstrap (r = 1.4)... Done.
plot(pvc, labels = prDes$grp, cex = 0.7)
pvrect(pvc, alpha = 0.95)

plot of chunk unnamed-chunk-28

Principal Components Analysis

In R, we can use prcomp() to perform Principal Components Analysis (PCA). You can also use svd(). The following code reproduces some of the material shown in the PCA/SVD lecture (the genes used are not the same so it won't be exactly equivalent).

Scaling is suppressed because we already scaled the rows. You can experiment with this to see what happens.

pcs <- prcomp(sprDat, center = FALSE, scale = FALSE)

# scree plot
plot(pcs)

plot of chunk unnamed-chunk-29

# append the rotations for the first 10 PCs to the phenodata
prinComp <- cbind(prDes, pcs$rotation[prDes$sidNum, 1:10])

# scatter plot showing us how the first few PCs relate to covariates
plot(prinComp[, c("sidNum", "devStage", "gType", "PC1", "PC2", "PC3")], pch = 19, 
    cex = 0.8)

plot of chunk unnamed-chunk-29

# plot data on first two PCs, colored by development stage
plot(prinComp[, c("PC1", "PC2")], bg = prDes$devStage, pch = 21, cex = 1.5)
legend(list(x = 0.2, y = 0.3), as.character(levels(prDes$devStage)), pch = 21, 
    pt.bg = c(1, 2, 3, 4, 5))

plot of chunk unnamed-chunk-29

It is commonly seen a cluster analysis on the first 3 principal components to illustrate and explore the data.

Most of the plots in this Seminar were done with basic R graphics. As an exercise, you can try to create new plots using lattice and/or ggplot2!

Recreate the last PCA plot using ggplot2:

head(prinComp)
##      sidChar sidNum devStage gType       grp       PC1      PC2      PC3
## 12 Sample_20     20      E16    wt    wt.E16  0.027594 -0.04399 -0.03564
## 13 Sample_21     21      E16    wt    wt.E16 -0.087275 -0.07942 -0.03242
## 14 Sample_22     22      E16    wt    wt.E16 -0.009865 -0.05342 -0.01445
## 15 Sample_23     23      E16    wt    wt.E16  0.233631  0.01071 -0.15440
## 9  Sample_16     16      E16 NrlKO NrlKO.E16  0.210762 -0.06027  0.21504
## 10 Sample_17     17      E16 NrlKO NrlKO.E16 -0.174085  0.03883 -0.01670
##        PC4      PC5     PC6       PC7       PC8      PC9      PC10
## 12 -0.3110  0.05909 -0.1028  0.093788  0.004379 -0.10535 -0.008807
## 13 -0.3030  0.24586 -0.1539  0.125438  0.082397 -0.01789 -0.078415
## 14 -0.3406  0.10805 -0.1973  0.119627  0.008837 -0.08359 -0.065474
## 15 -0.1092 -0.15167 -0.1740  0.198213 -0.029549 -0.03911  0.233301
## 9   0.0717  0.04963  0.3438 -0.007746 -0.009464  0.11675  0.265226
## 10  0.1226 -0.10117  0.1105  0.034155  0.388777  0.17308 -0.253600
ggplot(prinComp, aes(PC1, PC2, colour = devStage)) + geom_point(size = 4)

plot of chunk unnamed-chunk-30

ggplot(prinComp, aes(PC1, PC2, colour = gType)) + geom_point(size = 4)

plot of chunk unnamed-chunk-30