Introduction

Principal component analysis (PCA) and Multidimensional scaling (MDS) are common techniques that are used to visualize high-dimentional data. High-dimentional data are data with features (p) a lot more than observations (n).
That is: p>>n

It is very common in genomic data sets where we have tens of thousands of features(genes, DNA methylation regions), but only a handful of samples (genomic are expensive at least for now).

I am going to use a microarray data set to illustrate PCA and MDS, and then show you how to do clustering in R and make pretty heatmaps. It is a pretty old microarray data set, but the skills can be applied to any other high-dimentional genomic data sets. It contains gene expression profile for different cancer types.

Many of the codes are from the R lab of the first big data Summer Institute in the University of Washington. Module 4

If you find anything that I did wrong or explained wrong, please let me know. I am learning also!

Load in the data

# install the package if you do not have it.
# install.packages("ISLR")
library(ISLR)

ncidat = t(NCI60$data)
colnames(ncidat) = NCI60$labs

dim(ncidat)
## [1] 6830   64
unique(colnames(ncidat))
##  [1] "CNS"         "RENAL"       "BREAST"      "NSCLC"       "UNKNOWN"    
##  [6] "OVARIAN"     "MELANOMA"    "PROSTATE"    "LEUKEMIA"    "K562B-repro"
## [11] "K562A-repro" "COLON"       "MCF7A-repro" "MCF7D-repro"

First 6 rows, 6 columns of the data:

ncidat[1:6,1:6]
##      CNS      CNS    CNS         RENAL BREAST    CNS
## 1  0.300 0.679961  0.940  2.800000e-01  0.485  0.310
## 2  1.180 1.289961 -0.040 -3.100000e-01 -0.465 -0.030
## 3  0.550 0.169961 -0.170  6.800000e-01  0.395 -0.100
## 4  1.140 0.379961 -0.040 -8.100000e-01  0.905 -0.460
## 5 -0.265 0.464961 -0.605  6.250000e-01  0.200 -0.205
## 6 -0.070 0.579961  0.000 -1.387779e-17 -0.005 -0.540

PCA - take SVD to get solution.

Singular value decomposition(SVD) is used for PCA.

To get an idea of how svd works, I strongly recommend you to read this and this

Read about PCA and MDS here

We are going to center (x- mean) the same gene from different samples, but not scale (x-mean)/sd them.
That’s why we transpose the matrix (scale works on columns ?scale), and then transpose it back. Be aware of whether you should center/scale your data or not. It all depends on the intrinsic properties of your data and the purpose of your analysis. If your data are in the similar range, you do not have to center it.

Usually for a SVD analysis:
X is a n x p matrix (n rows and p columns)
Xnxp = Unxn %*% Dnxp %*% Vpxp

One has to be aware that in the microarray data, columns are usually samples(observations n), rows are genes(features p).

we need to first transpose X for the microarray data for the svd analysis

X = t(scale(t(ncidat),center=TRUE,scale=FALSE))

Use svd to get the matrice.

One can use the base R function princomp (default center and scale), but svd gives you more controls.

in a svd analysis, a matrix n x p matrix X is decomposed by X = U*D*V:
1.U is an m×n orthogonal matrix.
2.V is an n×n orthogonal matrix.
3.D is an n×n diagonal matrix.

By definition D should be a n x p (64 x 6830) matrix, but in this case, it becomes 64 x 64 matrix.
The reason is that diagonals of D: d1 >= d2 >= d3 >= ….d(r) where r =rank(X), rank(X) <= min(n,p)

See the rank of a matrix
The other ds are all zeros. (no variations after d(r)), that’s why D is dropped to 64 x 64 matrix.

We can check it:

# we transpose X again for svd
sv = svd(t(X))
U = sv$u
V = sv$v
D = sv$d

## in R calculate the rank of a matrix is by
qr(t(X))$rank
## [1] 63
length(D)
## [1] 64
min(D)
## [1] 1.06447e-13
# the last diagnal in D is very small.
# it is very close to 0, it has to do with the precision of the decimals in computer

let’s plot scatter plot between PCs

PC scatterplots

PCs: Z = XV or Z = UD (U are un-scaled PCs)

Some facts of PCA:

k th column of Z, Z(k), is the k th PC.(the k th pattern)

PC loadings: V
k th column of V, V(k) is the k th PC loading (feature weights). aka. the k th column of V encodes the associated k th pattern in feature space.

PC loadings: U
k th column of U, U(k) is the k th PC loading (observation weights). aka. the k th column of U encodes the associated k th pattern in observation space.

Diagnal matrix: D
diagnals in D: d(k) gives the strength of the k th pattern.

Variance explained by k th PC: d(k)^2
Total variance of the data: sum(d(k1)^2 + d(k2)^2 + …..d(k)^2+….)

proportion of variane explained by k th PC: d(k)^2 / sum(d(k1)^2 + d(k2)^2 + …..d(k)^2+….)

Let’s plot U1 vs U2 (plot the loadings). Remember that they are unscaled PCs

cols = as.numeric(as.factor(colnames(ncidat)))

plot(U[,1],U[,2],type="n",xlab="PC1",ylab="PC2")
text(U[,1],U[,2],colnames(X),col=cols)

we see that Melanoma samples are close to each other.

U are un-scaled PCs. We can also plot Z which is scaled PC:
in R Z<- X%*%V or Z<- U %*% diag(D)

par(mfrow=c(1,1))
Z = t(X)%*%V

# plot PC1 vs PC2
plot(Z[,1], Z[,2], type ="n", xlab="PC1", ylab="PC2")
text(Z[,1], Z[,2], colnames(X), col=cols)

It looks much the same as the the figure of U1 vs U2 above using U, but with different scales.

you can also plot PC2 vs PC3

plot(Z[,2], Z[,3], type ="n", xlab = "PC2", ylab="PC3")
text(Z[,2], Z[,3], colnames(X), col=cols)

We can look as many PCs as we like, but we usually stop when the cumulative variance explained by PCs is over ~90% to ~95% of total variance.

Or plot it by ggplot2 for pretty figure:

pc_dat<- data.frame(type = rownames(Z), PC1 = Z[,1], PC2= Z[,2])
library(ggplot2)
ggplot(pc_dat,aes(x=PC1, y=PC2, col=type)) + geom_point() + geom_text(aes(label = type), hjust=0, vjust=0)

## the text is a bit messy, may try packages below or play more with ggplot2
## use directlabels http://directlabels.r-forge.r-project.org/
## use cowplot https://github.com/wilkelab/cowplot

PC loadings - visualize data by limiting to top genes in magnitude in the PC loadings

The matrix V contains the weigths for the features, and we can use V to select important features(genes) that contribute to the each PC.

## get a gradient of colors for grey, green, red.
## one can do better use other libraries such RcolorBrewer. see examples later.

aa<- grep("grey",colors())
bb<- grep("green",colors())
cc<-  grep("red",colors())
gcol2<- colors()[c(aa[1:30],bb[1:20],rep(cc,2))]

## use the genes that drive the first PC1. This is the first major patter in the data
k=1
ord1<- order(abs(V[,k]),decreasing=TRUE)
x1<- as.matrix(X[ord1[1:250],])
heatmap(x1,col=gcol2)

# use the genes that drive the second PC (PC2)
j<- 2
ord<- order(abs(V[,j]),decreasing=TRUE)

## we just use the first 250 features(genes) to plot a heatmap, This is the second major pattern.
x<- as.matrix(X[ord[1:250],])
heatmap(x,col=gcol2)

We find the genes that drive the Melanoma together.

Variance Explained

varex = 0
cumvar = 0
denom = sum(D^2)
for(i in 1:64){
  varex[i] = D[i]^2/denom
  cumvar[i] = sum(D[1:i]^2)/denom
}

## variance explained by each PC cumulatively
cumvar
##  [1] 0.1489294 0.2319364 0.2977720 0.3408323 0.3793002 0.4143671 0.4431287
##  [8] 0.4713030 0.4976867 0.5192567 0.5402571 0.5602793 0.5785838 0.5962576
## [15] 0.6129791 0.6290826 0.6448598 0.6594001 0.6738773 0.6879947 0.7017289
## [22] 0.7146592 0.7272638 0.7391037 0.7508988 0.7619925 0.7726091 0.7829132
## [29] 0.7930745 0.8030158 0.8126016 0.8219866 0.8307461 0.8394120 0.8477786
## [36] 0.8560582 0.8641702 0.8721606 0.8797465 0.8872086 0.8945074 0.9016548
## [43] 0.9086548 0.9154609 0.9221211 0.9284180 0.9346431 0.9405652 0.9460392
## [50] 0.9514103 0.9565874 0.9614860 0.9662284 0.9708055 0.9751019 0.9792776
## [57] 0.9832045 0.9868178 0.9901928 0.9931137 0.9955657 0.9979035 1.0000000
## [64] 1.0000000

screeplot

par(mfrow=c(1,2))
plot(1:64,varex,type="l",lwd=2,xlab="PC",ylab="% Variance Explained")
plot(1:64,cumvar,type="l",lwd=2,xlab="PC",ylab="Cummulative Variance Explained")

Sparse PCA

When p>>n (we have way more genes than the samples), many featuers(genes) are irrelevant. PCA can perform very badly. Sparse PCA zero out irrelevant features from PC loadings. The advantage is that we can find important features that contribute to major patterns. Tipically, opitmize PCA criterion with sparsity-encouraging penalty of matrix V. It is an active area of research!

In other words, sparse PCA does feature selection for us, it retains only the features that drive the major pattern in the data. Again, we may or may not need to do sparse PCA to do feature selection. We can use some pre-knowledges, say, oncogenes and tumor-suppressors, differentially expressed genes across observations and most variable genes across samples etc.

library("PMA")
## Loading required package: plyr
## Loading required package: impute
## we also look at the first 4 PCs
spc = SPC(t(X),sumabsv=10,K=4)
## 1234567891011121314151617181920
## 1234567891011121314151617181920
## 1234567891011121314151617181920
## 1234567891011121314151617181920
#how many genes selected? we look at the V matrix again, if the weights are zeros, they are not important features. sparse PCA zero them out. For each of the four PCs, how many features are retained?
apply(spc$v!=0, 2, sum)
## [1] 173 231 256 186
#PC scatterplots
cols = as.numeric(as.factor(colnames(ncidat)))
K = 3
pclabs = c("SPC1","SPC2","SPC3","SPC4")
par(mfrow=c(1,K))
for(i in 1:K){
  j = i+1
  plot(spc$u[,i],spc$u[,j],type="n",xlab=pclabs[i],ylab=pclabs[j])
  text(spc$u[,i],spc$u[,j],colnames(X),col=cols)
}

#SPC loadings - visualize data by limiting to genes selected by the sparse PC loadings
aa = grep("grey",colors())
bb = grep("green",colors())
cc = grep("red",colors())
gcol2 = colors()[c(aa[1:30],bb[1:20],rep(cc,2))]

j = 1
ind = which(spc$v[,j]!=0)
x = as.matrix(X[ind,])
heatmap(x,col=gcol2)

It looks different from the heatmap we get using the top 250 features that drive PC1 in our regular PCA, becausein this case, we are not selecting the top 250, rather using only the vs (173 features) that are not zeros. Select different features for heatmap, it will look different.

length(ind)
173 features.

variance explained

spc$prop.var.explained
## [1] 0.04219442 0.07099912 0.09435261 0.11683447

Using MDS (multidimentional scaling )

To perform MDS analysis. we need a measure for similarities or a distance matrix. One can try many distance matrix, ?dist to see the help. eucledian is the default, but it usually does not perform well. try all of them if you can: “euclidean”, “maximum”, “manhattan”, “canberra”, “binary” or “minkowski”. Read this to understand the mathematial definition of distance.

MDS is different from PCA in that:
1. it is non-linear.
2. visualizing the proximitites, only need dissimilarities. aka, it does not need the original data. It only needs the distances between the data points. Imagine I only tell you the distances between each of the cities (Boson, LA, Miami, Houston, Seattle ….), you can reconstruct a map of many cities based on only the distances.

Both MDS and PCA are for dimension reduction for visualization.

# default is educledian
d<- dist(t(X))
mds<- cmdscale(d, k=2)
par(mfrow=c(1,1))
plot(mds[,1], mds[,2], type="n", main = "eucledian MDS")
text(mds[,1], mds[,2], rownames(mds), col=cols)

## use manhattan distance matrix
d<- dist(t(X), method = "manhattan")
mds<- cmdscale(d, k=2)
plot(mds[,1], mds[,2], type="n", main = "manhattan MDS")
text(mds[,1], mds[,2], rownames(mds), col=cols)

# use minkowski distance
d<- dist(t(X), method = "minkowski")
mds<- cmdscale(d, k=2)
plot(mds[,1], mds[,2], type="n", main = "minkowski MDS")
text(mds[,1], mds[,2], rownames(mds), col=cols)

These three distance matrix generate comparable MDS plot. We can clearly see Melanoma samples are close to each other. Compare with the PCA analysis, they yeild similar results.

d<- dist(t(X), method = "canberra")
mds<- cmdscale(d, k=2)
plot(mds[,1], mds[,2], type="n", main = "canberra MDS")
text(mds[,1], mds[,2], rownames(mds), col=cols)

It looks like canberra distance is very bad for this particular dataset, but it may be useufl for other data sets.

K-means clustering.

Read about K-means, hierachinal clustering, distance matrix and different linkages here

One needs to specify a K (how many clusters you want).

K = 9
km = kmeans(t(ncidat),centers=K)

#PCA - take SVD to get solution
#center genes, but don't scale. SVD analysis

X = t(scale(t(ncidat),center=TRUE,scale=FALSE))
sv = svd(t(X))
U = sv$u
V = sv$v
D = sv$d
Z = t(X)%*%V

# how do we visualize K-means results?
## overlay K-means result on the PCA plot.

par(mfrow=c(1,1))
plot(Z[,1],Z[,2],col=km$cluster,type="n")
text(Z[,1],Z[,2],colnames(ncidat),cex=.75,col=km$cluster)
cens = km$centers
points(cens%*%V[,1],cens%*%V[,2],col=1:K,pch=16,cex=3)

Re-run K-means and see solution changes! set.seed() if you want reproducible result!

k-means initialize each observation i to a cluster assignment k randomly.

K = 9
km = kmeans(t(ncidat),centers=K)
plot(Z[,1],Z[,2],col=km$cluster,type="n")
text(Z[,1],Z[,2],colnames(ncidat),cex=.75,col=km$cluster)
cens = km$centers
points(cens%*%V[,1],cens%*%V[,2],col=1:K,pch=16,cex=3)

Try different K

K = 5
km = kmeans(t(ncidat),centers=K)
plot(Z[,1],Z[,2],col=km$cluster,type="n")
text(Z[,1],Z[,2],colnames(ncidat),cex=.75,col=km$cluster)
cens = km$centers
points(cens%*%V[,1],cens%*%V[,2],col=1:K,pch=16,cex=3)

Hierarchical clustering

In addition to the distance matrix, we need to define how to calculate the distance between two sets when the points keep merging each other. ?hclust see all the methods: “ward.D”, “ward.D2”, “single”, “complete”, “average” ….

require("ISLR")
ncidat<- t(NCI60$data)
colnames(ncidat)<- NCI60$labs

dim(ncidat)
## [1] 6830   64
unique(colnames(ncidat))
##  [1] "CNS"         "RENAL"       "BREAST"      "NSCLC"       "UNKNOWN"    
##  [6] "OVARIAN"     "MELANOMA"    "PROSTATE"    "LEUKEMIA"    "K562B-repro"
## [11] "K562A-repro" "COLON"       "MCF7A-repro" "MCF7D-repro"

Complete linakge - Euclidean distance

Maximum dissimilarity between points in two sets used to determine which two sets should be merged.
Often gives comparable cluster sizes.
Less sensitive to outliers.
Works better with spherical distributions.

Note that for clustering samples, we did not do any scaling for the data.

cols = as.numeric(as.factor(colnames(ncidat)))
Dmat = dist(t(ncidat))
com.hclust = hclust(Dmat,method="complete")
plot(com.hclust,cex=.7,main="Complete Linkage")

# library(devtools) # get from CRAN with install.packages("devtools")
# install_github("ririzarr/rafalib")
library(rafalib)
## Loading required package: RColorBrewer
# use function in rafalib to plot colored dendragram
myplclust(com.hclust, labels=colnames(ncidat), lab.col=as.fumeric(colnames(ncidat)), main = "complete linkage")

Single linkage

Minimum dissimilarity between points in two sets used to determine which two sets should be merged.
Can handle diverse shapes.
Very sensitive to outliers or noise.
Often results in unbalanced clusters.
Extended, trailing clusters in which observations fused one at a time -chaining.

sing.hclust = hclust(Dmat,method="single")
myplclust(sing.hclust, labels=colnames(ncidat), lab.col=as.fumeric(colnames(ncidat)), main = "single linkage")

Average linkage

Average dissimilarity between points in two sets used to determine
which two sets should be merged.
A compromise between single and complete linkage.
Less sensitive to outliers.
Works better with spherical distributions.

Similar linkage: Ward’s linkage. Join objects that minimize Euclidean distance / average Euclidean distance

ave.hclust = hclust(Dmat,method="average")
myplclust(ave.hclust, labels=colnames(ncidat), lab.col=as.fumeric(colnames(ncidat)), main = "average linkage eucledian distance")

Ward’s linkage

ward.hclust = hclust(Dmat,method="ward.D")
myplclust(ward.hclust, labels=colnames(ncidat), lab.col=as.fumeric(colnames(ncidat)), main = "ward linkage eucledian distance")

Cut the tree

ward.hclust<- hclust(Dmat,method="ward.D")
myplclust(ward.hclust, labels=colnames(ncidat), lab.col=as.fumeric(colnames(ncidat)), main = "ward linkage eucledian distance")

names(ward.hclust)
## [1] "merge"       "height"      "order"       "labels"      "method"     
## [6] "call"        "dist.method"
abline(h=120)
rect.hclust(ward.hclust,h=120)

cl<- cutree(ward.hclust, h= 120)
table(type=colnames(X), clusters=cl)
##              clusters
## type          1 2 3 4 5 6 7 8 9
##   BREAST      2 0 1 0 0 0 0 2 2
##   CNS         5 0 0 0 0 0 0 0 0
##   COLON       0 0 0 0 0 0 7 0 0
##   K562A-repro 0 0 0 0 0 1 0 0 0
##   K562B-repro 0 0 0 0 0 1 0 0 0
##   LEUKEMIA    0 0 0 0 5 1 0 0 0
##   MCF7A-repro 0 0 0 0 0 0 0 1 0
##   MCF7D-repro 0 0 0 0 0 0 0 1 0
##   MELANOMA    1 0 0 0 0 0 0 0 7
##   NSCLC       2 0 1 6 0 0 0 0 0
##   OVARIAN     0 0 1 5 0 0 0 0 0
##   PROSTATE    0 0 0 2 0 0 0 0 0
##   RENAL       1 7 1 0 0 0 0 0 0
##   UNKNOWN     0 0 1 0 0 0 0 0 0

Complete linkage with different distances.

Dmat = dist(t(ncidat),method="manhattan") #L1 distance
com.hclust = hclust(Dmat,method="complete")
myplclust(com.hclust, labels=colnames(ncidat), lab.col=as.fumeric(colnames(ncidat)), main = "complete linkage- L1 distance")

We can try all different combinations of distance matrix and different linkages. Never use eucledian distances! one can also use 1- cor(X) as a distance measure! It is commonly used in the clustering of gene expression. Also, use either average linkage or Ward’s linkage.

Ward’s linakge for manhattan distance

Dmat = dist(t(ncidat),method="manhattan")
ward.hclust = hclust(Dmat,method="ward.D")
myplclust(ward.hclust, labels=colnames(ncidat), lab.col=as.fumeric(colnames(ncidat)), main = "ward linkage-L1 distance")

Ward’s linakge for 1- cor(X) distance

?cor calculate correlation between columns of a matrix. Do not need to transpose the matrix for calculating the distances between samples(columns).

Dmat = as.dist(1-cor(ncidat))
ward.hclust = hclust(Dmat,method="ward.D")
myplclust(ward.hclust, labels=colnames(ncidat), lab.col=as.fumeric(colnames(ncidat)), main = "ward linkage-1-cor(X) distance")

Bi-clustering

In the above clustering examples, we use all the features(genes) to cluster samples. we are going to do biclustering (cluster both features and samples) and make a Heatmap. There are too many features, in order to cluster rows (genes/features), we need to filter the features first.

require("ISLR")
ncidat = t(NCI60$data)
colnames(ncidat) = NCI60$labs

#filter genes using PCA
# scale function scales the columns of a numeric matrix
X = t(scale(t(ncidat),center=TRUE,scale=FALSE))
sv = svd(t(X));
V = sv$v

#PC loadings - visualize data by limiting to top genes in magnitude in the PC loadings
## get some colors
library(RColorBrewer)
hmcols<- colorRampPalette(brewer.pal(9,"GnBu"))(100)

## use feature weigths for the first PC (PC1)
j = 1
ord = order(abs(V[,j]),decreasing=TRUE)
x = as.matrix(X[ord[1:250],])

# the default is eucledian distance and complete linage for both rows and columns
heatmap(x,col=hmcols,hclustfun=function(x)hclust(x,method="ward.D"))

## we can also use weights for the second PC (PC2)
j = 2
ord = order(abs(V[,j]),decreasing=TRUE)
x = as.matrix(X[ord[1:250],])

#cluster heatmap - uses Ward's linkage (complete is default)

heatmap(x,col=hmcols,hclustfun=function(x)hclust(x,method="ward.D"))

Or we can select genes that are most variable across samples, check the genefilter bioconductor package. For gene-expression data, we can select top features(genes) that are differentially expressed.

library(genefilter)
## 
## Attaching package: 'genefilter'
## 
## The following object is masked from 'package:base':
## 
##     anyNA
rv<- rowVars(X)
## select the top 250 most variable genes for clustering
idx<- order(-rv)[1:250]
heatmap(X[idx,], col=hmcols,hclustfun=function(x)hclust(x,method="ward.D"))

Because using PC loadings and rowVars select different features, the heatmaps look different.

use heatmap.2

There are many other packages give better control of the heatmap. One can also check other packages: heatmap.3, pheatmap, gapmap and d3heatmap!!!!

Let’s use heatmap.2

library(RColorBrewer)
library(gplots)
## 
## Attaching package: 'gplots'
## 
## The following object is masked from 'package:stats':
## 
##     lowess
hmcols<- colorRampPalette(brewer.pal(9,"GnBu"))(100)
heatmap.2(x, hclustfun=function(x)hclust(x,method="ward.D"), col=hmcols, trace= "none", main = "eucledian distance")

It produced the same heatmap as heatmap function, but now the heatmap can be resized if you drag the window of the picture. In addition, it adds a color key on the top left. ?heatmap.2 to see many other options.

we can also change the distance matrix.
Let’s try manhattan distance.

library(RColorBrewer)
library(gplots)
hmcols<- colorRampPalette(brewer.pal(9,"GnBu"))(100)

heatmap.2(x, distfun=function(x) dist(x, method='manhattan'), hclustfun=function(x)hclust(x,method="ward.D"), col=hmcols, trace= "none", main = "manhattan distance")

In this case, the eucledian distance performed OK, but usually I avoid using eucledian distance.

Let’s use correlation distance.

heatmap.2(x, distfun=function(x) as.dist(1-cor(t(x))), hclustfun=function(x)hclust(x,method="ward.D"), col=hmcols, trace= "none", main = "correaltion distance")

add column sidebars.

library(gplots)
# map color to the same cancer type.
## colors avaiable in the package 
display.brewer.all()

# qualitative color scale for side bars, can also use:
# cols<- sample(colors(), 14, replace = F)[as.numeric(as.factor(colnames(ncidat)))]
# the following gives you more control of the color

cols1<- palette(brewer.pal(8, "Dark2"))
cols2<- palette(brewer.pal(12, "Paired"))

cols<- c(cols1, cols2)[as.numeric(as.factor(colnames(ncidat)))]

cbind(colnames(x), cols)  # check which color maps to different cancer types
##                     cols     
##  [1,] "CNS"         "red"    
##  [2,] "CNS"         "red"    
##  [3,] "CNS"         "red"    
##  [4,] "RENAL"       "#66A61E"
##  [5,] "BREAST"      "black"  
##  [6,] "CNS"         "red"    
##  [7,] "CNS"         "red"    
##  [8,] "BREAST"      "black"  
##  [9,] "NSCLC"       "#D95F02"
## [10,] "NSCLC"       "#D95F02"
## [11,] "RENAL"       "#66A61E"
## [12,] "RENAL"       "#66A61E"
## [13,] "RENAL"       "#66A61E"
## [14,] "RENAL"       "#66A61E"
## [15,] "RENAL"       "#66A61E"
## [16,] "RENAL"       "#66A61E"
## [17,] "RENAL"       "#66A61E"
## [18,] "BREAST"      "black"  
## [19,] "NSCLC"       "#D95F02"
## [20,] "RENAL"       "#66A61E"
## [21,] "UNKNOWN"     "#E6AB02"
## [22,] "OVARIAN"     "#7570B3"
## [23,] "MELANOMA"    "#1B9E77"
## [24,] "PROSTATE"    "#E7298A"
## [25,] "OVARIAN"     "#7570B3"
## [26,] "OVARIAN"     "#7570B3"
## [27,] "OVARIAN"     "#7570B3"
## [28,] "OVARIAN"     "#7570B3"
## [29,] "OVARIAN"     "#7570B3"
## [30,] "PROSTATE"    "#E7298A"
## [31,] "NSCLC"       "#D95F02"
## [32,] "NSCLC"       "#D95F02"
## [33,] "NSCLC"       "#D95F02"
## [34,] "LEUKEMIA"    "magenta"
## [35,] "K562B-repro" "cyan"   
## [36,] "K562A-repro" "blue"   
## [37,] "LEUKEMIA"    "magenta"
## [38,] "LEUKEMIA"    "magenta"
## [39,] "LEUKEMIA"    "magenta"
## [40,] "LEUKEMIA"    "magenta"
## [41,] "LEUKEMIA"    "magenta"
## [42,] "COLON"       "green3" 
## [43,] "COLON"       "green3" 
## [44,] "COLON"       "green3" 
## [45,] "COLON"       "green3" 
## [46,] "COLON"       "green3" 
## [47,] "COLON"       "green3" 
## [48,] "COLON"       "green3" 
## [49,] "MCF7A-repro" "yellow" 
## [50,] "BREAST"      "black"  
## [51,] "MCF7D-repro" "gray"   
## [52,] "BREAST"      "black"  
## [53,] "NSCLC"       "#D95F02"
## [54,] "NSCLC"       "#D95F02"
## [55,] "NSCLC"       "#D95F02"
## [56,] "MELANOMA"    "#1B9E77"
## [57,] "BREAST"      "black"  
## [58,] "BREAST"      "black"  
## [59,] "MELANOMA"    "#1B9E77"
## [60,] "MELANOMA"    "#1B9E77"
## [61,] "MELANOMA"    "#1B9E77"
## [62,] "MELANOMA"    "#1B9E77"
## [63,] "MELANOMA"    "#1B9E77"
## [64,] "MELANOMA"    "#1B9E77"
## notice that matrix rows of x are centered  
heatmap.2(x, distfun=function(x) as.dist(1-cor(t(x))), hclustfun=function(x)hclust(x,method="ward.D"), trace="none", ColSideColors=cols, col=hmcols, labCol=colnames(x), labRow = F, margins = c(6,6), density.info="none", main = "centered and correlation distance" )

Also notice that we are not standarizing each row (feature/gene) across different samples. See the color key. We can do it by using scale on matrix x first before feeding into the heatmap.2 function or we can specify scale = "row" in the heatmap.2 function to get a Z-score.

However,one needs to be aware that in the heatmap.2 function, clustering is performed before scaling.

“The defaults of almost every heat map function in R does the hierarchical clustering first, then scales the rows then displays the image”

biostars post1
biostars post2

scale the rows:

heatmap.2(x, distfun=function(x) as.dist(1-cor(t(x))), hclustfun=function(x)hclust(x,method="ward.D"),trace="none", scale = "row", ColSideColors=cols, col=hmcols, labCol=colnames(ncidat), labRow = F, margins = c(6,6), density.info="none", main = "scaled gene and correlation distance")

After scaling rows, the pattern looks the same, but the color is bit different after scaling

Add RowSidebar

# assign the output of heatmap.2 to a variable hm
hm<- heatmap.2(x, distfun=function(x) as.dist(1-cor(t(x))), hclustfun=function(x)hclust(x,method="ward.D"),trace="none", scale = "row", ColSideColors=cols, col=hmcols, labCol=colnames(ncidat), labRow = F, margins = c(6,6), density.info="none", main = "scaled gene and correlation distance")

names(hm)
##  [1] "rowInd"        "colInd"        "call"          "rowMeans"     
##  [5] "rowSDs"        "carpet"        "rowDendrogram" "colDendrogram"
##  [9] "breaks"        "col"           "colorTable"
#return the maxtrix returned after clustering as in the heatmap
m.afterclust<- x[rev(hm$rowInd),rev(hm$colInd)]

# to extract subgroups that are clustered together
# rowDendrogram is a list object 
labels(hm$rowDendrogram[[1]])
##   [1] "2317" "2320" "2025" "6348" "208"  "6347" "6346" "3438" "6738" "5631"
##  [11] "6351" "6390" "6357" "6356" "6354" "6355" "6353" "5586" "5816" "5921"
##  [21] "5841" "5843" "5842" "5791" "5957" "5945" "5913" "5894" "6084" "5833"
##  [31] "5942" "5941" "5540" "5590" "5973" "6322" "6331" "6278" "5587" "5351"
##  [41] "5344" "5796" "5556" "5555" "5557" "5931" "5930" "5920" "6039" "6387"
##  [51] "5558" "5559" "5560" "5561" "5564" "5565" "5566" "5758" "5589" "6535"
##  [61] "196"  "6389" "325"  "78"   "281"  "6604" "329"  "6079" "326"  "328" 
##  [71] "327"  "6392" "6393" "6391" "6420" "6419" "5142" "6404" "6268" "6705"
##  [81] "5729" "6416" "6414" "6415" "254"  "255"  "246"  "253"  "252"  "251" 
##  [91] "269"  "248"  "262"  "247"  "261"  "237"  "234"  "257"  "256"  "264" 
## [101] "258"  "1837" "274"  "214"  "273"  "224"  "266"  "265"  "336"  "337" 
## [111] "6564" "245"  "244"  "340"  "339"  "897"  "213"  "188"  "3034" "1012"
## [121] "241"  "242"  "243"  "112"  "113"  "5338" "5336" "95"   "92"   "6349"
## [131] "276"  "286"  "287"  "128"  "127"  "134"
labels(hm$rowDendrogram[[2]][[2]])
##  [1] "4154" "4153" "4291" "4397" "4362" "5242" "4385" "3863" "3864" "4759"
## [11] "2758" "4228" "5472" "4715" "4716" "2068" "4701" "4700" "4699" "4295"
## [21] "4361" "4294" "4278" "4279" "4039" "4334" "4298" "4355" "4383" "4187"
## [31] "2734" "4184" "4347" "4359" "4384" "4340" "4341" "4095" "4096" "4094"
## [41] "4093" "5661" "5804" "4289" "4288" "6236" "4259" "5467" "4348" "4344"
#Separating clusters
#convert the rowDendrogram to a hclust object
hc.rows<- as.hclust(hm$rowDendrogram)
hc.cols<- as.hclust(hm$colDendrogram)

table(type=colnames(X), clusters=cutree(hc.cols, k=9))
##              clusters
## type          1 2 3 4 5 6 7 8 9
##   BREAST      0 3 0 0 0 0 0 2 2
##   CNS         2 3 0 0 0 0 0 0 0
##   COLON       0 0 0 0 7 0 0 0 0
##   K562A-repro 0 0 0 0 0 0 1 0 0
##   K562B-repro 0 0 0 0 0 0 1 0 0
##   LEUKEMIA    0 0 0 0 0 1 5 0 0
##   MCF7A-repro 0 0 0 0 0 0 0 1 0
##   MCF7D-repro 0 0 0 0 0 0 0 1 0
##   MELANOMA    0 1 0 0 0 0 0 0 7
##   NSCLC       0 2 3 0 0 4 0 0 0
##   OVARIAN     0 0 1 0 1 4 0 0 0
##   PROSTATE    0 0 1 0 1 0 0 0 0
##   RENAL       0 1 1 7 0 0 0 0 0
##   UNKNOWN     0 0 1 0 0 0 0 0 0
names(hc.rows)
## [1] "merge"       "height"      "order"       "labels"      "call"       
## [6] "method"      "dist.method"
plot(hc.rows)  # rotate the dendrogram 90 degree, it is the same as in the heatmap

rect.hclust(hc.rows,h=5)

ct<- cutree(hc.rows,h=5)

# get the members of each subgroup in the order of the cluster(left--->right), the row order will
# be reversed compared to the heatmap.

# ct[hc.rows$order]

table(ct)
## ct
##   1   2   3 
##  52 114  84
# get the matrix after clustering in the order of the heatmap (up--->down)

tableclustn<-  data.frame(m.afterclust, cluster = rev(ct[hc.rows$order]))

# remake the heatmap adding the RowSide bar based on the subgroups

mycolhc<- palette(brewer.pal(8, "Dark2"))
mycolhc<-mycolhc[as.vector(ct)]

heatmap.2(x, distfun=function(x) as.dist(1-cor(t(x))), hclustfun=function(x)hclust(x,method="ward.D"),trace="none", scale = "row", ColSideColors=cols, RowSideColors = mycolhc, col=hmcols, labCol=colnames(ncidat), labRow = F, margins = c(6,6), density.info="none", main = "scaled gene and correlation distance")

A different way to make a similar heatmap.

Use the dendrogram object and feed into the Rowv and Colv arguments.

## cluster for the rows (genes), we do not need to transpose ncidat 
## but if you are clustering columns, hclust(dist(t(ncidat)))

## we use weights for the second PC (PC2)
j = 2
ord = order(abs(V[,j]),decreasing=TRUE)
x = as.matrix(X[ord[1:250],])

hc.cols<- hclust(as.dist(1-cor(x)), method = "ward.D")  # cluster the samples
hc.rows<- hclust(as.dist(1-cor(t(x))), method = "ward.D") # cluster the genes
table(type=colnames(X), clusters=cutree(hc.cols, k=9))
##              clusters
## type          1 2 3 4 5 6 7 8 9
##   BREAST      0 3 0 0 0 0 0 2 2
##   CNS         2 3 0 0 0 0 0 0 0
##   COLON       0 0 0 0 7 0 0 0 0
##   K562A-repro 0 0 0 0 0 0 1 0 0
##   K562B-repro 0 0 0 0 0 0 1 0 0
##   LEUKEMIA    0 0 0 0 0 1 5 0 0
##   MCF7A-repro 0 0 0 0 0 0 0 1 0
##   MCF7D-repro 0 0 0 0 0 0 0 1 0
##   MELANOMA    0 1 0 0 0 0 0 0 7
##   NSCLC       0 2 3 0 0 4 0 0 0
##   OVARIAN     0 0 1 0 1 4 0 0 0
##   PROSTATE    0 0 1 0 1 0 0 0 0
##   RENAL       0 1 1 7 0 0 0 0 0
##   UNKNOWN     0 0 1 0 0 0 0 0 0
names(hc.cols)
## [1] "merge"       "height"      "order"       "labels"      "method"     
## [6] "call"        "dist.method"
hc.cols$labels  #the original label from the maxtrix x
##  [1] "CNS"         "CNS"         "CNS"         "RENAL"       "BREAST"     
##  [6] "CNS"         "CNS"         "BREAST"      "NSCLC"       "NSCLC"      
## [11] "RENAL"       "RENAL"       "RENAL"       "RENAL"       "RENAL"      
## [16] "RENAL"       "RENAL"       "BREAST"      "NSCLC"       "RENAL"      
## [21] "UNKNOWN"     "OVARIAN"     "MELANOMA"    "PROSTATE"    "OVARIAN"    
## [26] "OVARIAN"     "OVARIAN"     "OVARIAN"     "OVARIAN"     "PROSTATE"   
## [31] "NSCLC"       "NSCLC"       "NSCLC"       "LEUKEMIA"    "K562B-repro"
## [36] "K562A-repro" "LEUKEMIA"    "LEUKEMIA"    "LEUKEMIA"    "LEUKEMIA"   
## [41] "LEUKEMIA"    "COLON"       "COLON"       "COLON"       "COLON"      
## [46] "COLON"       "COLON"       "COLON"       "MCF7A-repro" "BREAST"     
## [51] "MCF7D-repro" "BREAST"      "NSCLC"       "NSCLC"       "NSCLC"      
## [56] "MELANOMA"    "BREAST"      "BREAST"      "MELANOMA"    "MELANOMA"   
## [61] "MELANOMA"    "MELANOMA"    "MELANOMA"    "MELANOMA"
hc.cols$order
##  [1] 56 59 57 58 62 60 61 63 64 16 14 15 11 12 13 17 54 55 21 22 10 20 30
## [24]  1  2  8  3  7  9  4  5  6 23 18 19 35 36 37 41 38 39 40 33 31 32 25
## [47] 26 34 53 27 28 52 49 50 51 43 46 47 48 42 44 45 24 29
#print the row labels in the order they appear in the tree
hc.cols$labels[hc.cols$order]  
##  [1] "MELANOMA"    "MELANOMA"    "BREAST"      "BREAST"      "MELANOMA"   
##  [6] "MELANOMA"    "MELANOMA"    "MELANOMA"    "MELANOMA"    "RENAL"      
## [11] "RENAL"       "RENAL"       "RENAL"       "RENAL"       "RENAL"      
## [16] "RENAL"       "NSCLC"       "NSCLC"       "UNKNOWN"     "OVARIAN"    
## [21] "NSCLC"       "RENAL"       "PROSTATE"    "CNS"         "CNS"        
## [26] "BREAST"      "CNS"         "CNS"         "NSCLC"       "RENAL"      
## [31] "BREAST"      "CNS"         "MELANOMA"    "BREAST"      "NSCLC"      
## [36] "K562B-repro" "K562A-repro" "LEUKEMIA"    "LEUKEMIA"    "LEUKEMIA"   
## [41] "LEUKEMIA"    "LEUKEMIA"    "NSCLC"       "NSCLC"       "NSCLC"      
## [46] "OVARIAN"     "OVARIAN"     "LEUKEMIA"    "NSCLC"       "OVARIAN"    
## [51] "OVARIAN"     "BREAST"      "MCF7A-repro" "BREAST"      "MCF7D-repro"
## [56] "COLON"       "COLON"       "COLON"       "COLON"       "COLON"      
## [61] "COLON"       "COLON"       "PROSTATE"    "OVARIAN"
## plot the cluster, and validate the order of the labels
myplclust(hc.cols, labels=colnames(ncidat), lab.col=as.fumeric(colnames(ncidat)), main = "1- cor(x) distance")

rect.hclust(hc.rows, h= 2.5)

cutree(hc.rows,h=2.5)
##  256  286  252 4327  196 4320  243 4354 4306 4280  257 4304 4308 6393 4288 
##    1    2    1    3    4    3    1    5    5    5    1    5    5    4    6 
## 4353 4305 4341  251 4296 6391  248 5557  245  254 4344 5586 4387  242  244 
##    5    5    6    1    5    4    1    7    1    1    6    7    5    1    1 
## 6356  287 2025 4184 5556  247 6392 4289 6415 4388 4321 5843 5796 4347 5561 
##    8    2    8    6    7    1    4    6    4    5    3    7    7    6    7 
## 4295 5931  264  255 4094 5555 4355 5559 3438 6355 4264 4322 4307  241 4340 
##    6    7    1    1    6    7    6    7    8    8    5    3    5    1    6 
## 6354 4383 4326 5930 6416 2758 4154 6348  273 4297 5590 4317 4376 4093 4384 
##    8    6    3    7    4    9    9    8    1    5    7    3    3    6    6 
## 4348 4426 4701 4700 4460 4375 4362 4060 5921  253 4061 5472  336 4062  188 
##    6    5    9    9    3    3    9    3    7    1    3    9    1    3    1 
## 2734 6705 6351 5344 5941 6419 5842  113  339  134 4424  208 5758 5942 4302 
##    6    4    8    7    7    4    7    2    1    2    5    8    4    7    5 
## 4241 4282 5565  246  237 5558 5467 4244 5791 4313 6084 4325 5804 5336  261 
##    3    5    7    1    1    7    6    5    7    3    7    3    6    2    1 
## 5841 4059 4218 4187 4310  274 4461  213  337 5920 6357 4334 2068  328 5587 
##    7    3    3    6    5    1    3    1    1    7    8    6    9    4    7 
## 4291  112 3977 6039 4699 6535 6349  325  262 5913 4095 4473  258  329 5564 
##    9    2    5    7    9    4    2    4    1    7    6    3    1    4    7 
## 4228 5661  224  340 6604 4311 4299 4259 6079  266 5833  897 3960 4298 6268 
##    9    6    1    1    4    5    5    6    4    1    7    1    5    6    4 
## 5973 4173 5894 6414 2317 5540 6564 4314 4279 4309 4300  265   78 6347 4338 
##    7    3    7    4    8    7    1    3    6    5    5    1    4    8    5 
##   95 4318 3034 4312 4303 6387 4089 6278  128 4385 2320 5338 3864 4407 4716 
##    2    3    1    3    5    7    5    7    2    9    8    2    9    5    9 
## 5816  214 5351   92 4361 6236 4715  326 4278 4186 1012  281 6322 4153 5142 
##    7    1    7    2    6    6    9    4    6    3    1    4    7    9    4 
## 6404 1837 5242  276 4759 4301 5957 5729 4039 5589 5560  127 4085 5945 4328 
##    4    1    9    2    9    5    7    4    6    4    7    2    5    7    3 
## 5566 4294 4110  269 4330 4324  234 4359  327 4172 4339 6346 6420 4474 3863 
##    7    6    5    1    3    3    1    6    4    3    5    8    4    3    9 
## 6738 5631 4425 6331 6389 4397 6353 6390 4243 4096 
##    8    8    5    7    4    9    8    8    5    6
ct<- cutree(hc.rows,k=3)
#get the members' names of each clusters

head(ct)
##  256  286  252 4327  196 4320 
##    1    1    1    2    3    2
table(ct)
## ct
##   1   2   3 
##  52 114  84
# sort(ct)

# split(names(ct),ct)

mycolhc<- palette(brewer.pal(8, "Dark2"))
mycolhc<-mycolhc[as.vector(ct)]

Add the RowSidebar

rowDend<- rev(as.dendrogram(hc.rows))
colDend<- rev(as.dendrogram(hc.cols))

heatmap.2(x, Rowv = rowDend, Colv = colDend, trace="none", scale = "row", ColSideColors=cols, RowSideColors= mycolhc, col=hmcols, labCol=colnames(ncidat), labRow = F, margins = c(6,6), density.info="none", main = "scaled gene and correlation distance")

## I am not sure for the sample clustering, it is not exactly the same as the one above

#
#heatmap.2(x[cutree(hc.rows,k=3)==3,])
#heatmap.2(x[cutree(hc.rows,k=3)==2,])
#heatmap.2(x[cutree(hc.rows,k=5)==5,])

Further readings

You probably don’t understand heatmaps by Mick Watson

Why do you look at the speck in your sister’s quilt plot and pay no attention to the plank in your own heat map? by Lior Patcher

posts on biostars: understanding the clustering in heatmap

Making a heatmap with R by Dave Tang.

Using RColorBrewer to colour your figures in R by R bloggers

how to change heatmap2 color range in r

Customizing gplots heatmap.2 - color range for heatmap and legend for RowSideColors

map colour to values in heatmap

“Now for my heatmap power tip of the day, if you will: use the”useRaster=TRUE" parameter in your heatmap.2() call. Excellent extension by R developers since 2.13. But for some reason the R developers explicitly turn it off for interactive session windows, so you’ll only see it in an exported file. (Unless you have a custom image() function which doesn’t disable it, I did this.) It also makes the exported file hugely smaller, especially for PDFs."

“What it does is darn-near essential for nextgen coverage heatmaps – it actually properly resamples the image as it down-sizes the image during export. Without useRaster=TRUE, image() creates a zillion tiny rectangles to represent the heatmap, all pieced together right next to each other. When the display is fewer pixels/points high than the number of rows of data, it discretizes the data – that is, it uses integer values for the rectangles. In many cases, especially onscreen, many rectangles fully overlap others, randomly obscuring the real patterns, and often blunting your otherwise cool-looking signal.”

“Best way to test is export to PDF with useRaster=FALSE, then do it again with useRaster=TRUE. For me, night and day.”