In this exercise we deal with the unsupervised methods like, PCA and Clustering techniques like Kmeans and Hierarchical clustering.

Principal componenet Analysis

Principal componenet Analysis is done as a method of dimensionality reduction in such a way that the reduced dimensions of the data captures the maximum variability explained in the data.

Let us do PCA on the USArrests data set. Let us check the data set and verify the observations by taking mean of the data and also the variance of the data.

head(USArrests)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7
## Applying mean

apply(USArrests,2,mean)
##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232
## Finding the mean

apply(USArrests,2,var)
##   Murder  Assault UrbanPop     Rape 
##    18.97  6945.17   209.52    87.73

One crucial point to be observed before doing PCA is that to ensure that the scale of the variables are comparable. Here in this dataset we could find out that each of the variables have different scales. For eg. The number of rapes in each state is per 100,000 individuals and Urbanpop variable measures the percentage of the population in each state living in an urban area. So obviously both these measures are not comparable. We therefore have to resort to scaling of the variables.

Let us do the principal component analysis using prcomp() function.

pr.US <- prcomp(USArrests,scale=TRUE)

## Variables under Principal componenets

names(pr.US)
## [1] "sdev"     "rotation" "center"   "scale"    "x"

The scale and center attributes gives the mean and scaling attributes of each of these 4 variables. The rotation, gives the number of principal components. As seen the number of principal componenets is the min of (n-1,p) where n is the number of observations and p is no of variables. So there are 4 principal componenets.

Let us now plot the first two principal components

biplot(pr.US,scale=0)

plot of chunk unnamed-chunk-3

Let us now find out the proportion of variance explained by each of these principal components. The sd gives the standard deviation. The square of sd gives the variance.

pr.var <- pr.US$sd^2
pr.var
## [1] 2.4802 0.9898 0.3566 0.1734
pr.pve <- pr.var/sum(pr.var)
pr.pve
## [1] 0.62006 0.24744 0.08914 0.04336

So as seen, the first principal component explains 62% of the variance of the data, the second 24.7 etc. The first two principla componenets together explains about 86% of the variance.

Let us plot these

plot(pr.pve,xlab="Principal component",ylab="Proportion of variance explained",ylim=c(0,1),type='b')

plot of chunk unnamed-chunk-5

plot(cumsum(pr.pve),xlab="Principal component",ylab="Proportion of variance explained",ylim=c(0,1),type='b')

plot of chunk unnamed-chunk-5

K Means Clustering

Let us perform k means clustering on some random data sets.

set.seed(2)
x = matrix(rnorm(50*2),ncol=2)
x[1:25,1] = x[1:25,1]+3
x[1:25,2] = x[1:25,2]-4

Let us perform the k means clustering with K=2

km.ex <- kmeans(x,2,nstart=20)
summary(km.ex)
##              Length Class  Mode   
## cluster      50     -none- numeric
## centers       4     -none- numeric
## totss         1     -none- numeric
## withinss      2     -none- numeric
## tot.withinss  1     -none- numeric
## betweenss     1     -none- numeric
## size          2     -none- numeric
## iter          1     -none- numeric
## ifault        1     -none- numeric

Let us plot the two clusters with each cluster coloured seperately

plot(x,col=(km.ex$cluster+1),main="K means clustering",pch=20,cex=2)

plot of chunk unnamed-chunk-8

Since we knew the number of clusters we gave a value of k=2. However in real life we do not know the number of clusters and therefore will have to figure out what value of k should be used.

Let us try to use 3 clusters

set.seed(4)
km.ex1 <- kmeans(x,3,nstart=20)
km.ex1
## K-means clustering with 3 clusters of sizes 10, 23, 17
## 
## Cluster means:
##     [,1]     [,2]
## 1  2.300 -2.69622
## 2 -0.382 -0.08741
## 3  3.779 -4.56201
## 
## Clustering vector:
##  [1] 3 1 3 1 3 3 3 1 3 1 3 1 3 1 3 1 3 3 3 3 3 1 3 3 3 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 19.56 52.68 25.74
##  (between_SS / total_SS =  79.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
plot(x,col=(km.ex1$cluster+1),main="K means clustering",pch=20,cex=2)

plot of chunk unnamed-chunk-9

The nstart function will help in performing multiple random assignments for k and kmeans() function will pick the best among the available models. Let us try varying nstart for 1 and 20 and observe the within the cluster sum of squares. We will have to pick the model with minimal within cluster sum of squares.

km.ex2 <- kmeans(x,3,nstart=1)
km.ex2$tot.withinss
## [1] 99.88
km.ex2$withinss
## [1] 14.58 59.07 26.24
km.ex3 <- kmeans(x,3,nstart=20)
km.ex3$tot.withinss
## [1] 97.98
km.ex3$withinss
## [1] 19.56 25.74 52.68

Hierarchical Clustering

Let us start of with Hierarchical clustering. hclust() function in R will be used for doing hierarchical clustering. The dist() function will be used to calculate the Eucledian distance as a measure of dissimilarity. There are basically three types of hierarchical clustering methods i.e complete, single and average linkage clustering.

hc.complete = hclust(dist(x),method="complete")
names(hc.complete)
## [1] "merge"       "height"      "order"       "labels"      "method"     
## [6] "call"        "dist.method"

Some of the attributes of the fit are as explained below

“merge” “height” “order” “labels” “method” “call” “dist.method” 1. merge : Gives those observations which are merged together in the dendrogram 2. height : Gives the height of each observation 3. order : Gives the order in which each of the observations are clubbed

hc.average = hclust(dist(x),method="average")
plot(hc.average)

plot of chunk unnamed-chunk-12

hc.single = hclust(dist(x),method="single")
plot(hc.single)

plot of chunk unnamed-chunk-12

To know the cluster labels we can use the cutree() function

cutree(hc.complete,2)
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

To scale the variables before performing hierarchical clustering we can use scale() function.

xsc = scale(x)

plot(hclust(dist(xsc),method="complete"), main = "HC")

plot of chunk unnamed-chunk-14

Correlation based distance can be coputed using as.dist() function

x = matrix(rnorm(30*3),ncol=3)
dd = as.dist(1-cor(t(x)))
plot(hclust(dd,method="complete"),main="Complete linkage")

plot of chunk unnamed-chunk-15

Let us do the above analysis in some real data.

Let us do these analysis in NC160 data set which is part of the ISLR package. Let us first start with carrying out a PCA on this data set.

This data set pertains to the 64 cancer cell lines. The dimension of the data is 64x6830

library(ISLR)
## 
## Attaching package: 'ISLR'
## 
## The following objects are masked _by_ '.GlobalEnv':
## 
##     Carseats, Hitters
nci.labs = NCI60$labs
nci.data = NCI60$data
dim(nci.data)
## [1]   64 6830

Let us do PCA on this data

pr.nc <- prcomp(nci.data,scale=TRUE)

summary(pr.nc)
## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6
## Standard deviation     27.853 21.4814 19.8205 17.0326 15.9718 15.7211
## Proportion of Variance  0.114  0.0676  0.0575  0.0425  0.0374  0.0362
## Cumulative Proportion   0.114  0.1812  0.2387  0.2812  0.3185  0.3547
##                            PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     14.4715 13.5443 13.1440 12.7386 12.6867 12.1577
## Proportion of Variance  0.0307  0.0269  0.0253  0.0238  0.0236  0.0216
## Cumulative Proportion   0.3853  0.4122  0.4375  0.4613  0.4848  0.5065
##                           PC13    PC14    PC15    PC16    PC17    PC18
## Standard deviation     11.8302 11.6255 11.4378 11.0005 10.6567 10.4888
## Proportion of Variance  0.0205  0.0198  0.0192  0.0177  0.0166  0.0161
## Cumulative Proportion   0.5270  0.5467  0.5659  0.5836  0.6002  0.6163
##                           PC19    PC20    PC21    PC22   PC23   PC24
## Standard deviation     10.4352 10.3219 10.1461 10.0544 9.9027 9.6477
## Proportion of Variance  0.0159  0.0156  0.0151  0.0148 0.0144 0.0136
## Cumulative Proportion   0.6323  0.6479  0.6630  0.6778 0.6921 0.7057
##                          PC25   PC26   PC27   PC28   PC29   PC30   PC31
## Standard deviation     9.5076 9.3325 9.2732 9.0900 8.9812 8.7500 8.5996
## Proportion of Variance 0.0132 0.0127 0.0126 0.0121 0.0118 0.0112 0.0108
## Cumulative Proportion  0.7190 0.7317 0.7443 0.7564 0.7682 0.7794 0.7903
##                          PC32   PC33    PC34    PC35    PC36    PC37
## Standard deviation     8.4474 8.3730 8.21579 8.15731 7.97465 7.90446
## Proportion of Variance 0.0104 0.0103 0.00988 0.00974 0.00931 0.00915
## Cumulative Proportion  0.8007 0.8110 0.82087 0.83061 0.83992 0.84907
##                           PC38    PC39    PC40    PC41   PC42    PC43
## Standard deviation     7.82127 7.72156 7.58603 7.45619 7.3444 7.10449
## Proportion of Variance 0.00896 0.00873 0.00843 0.00814 0.0079 0.00739
## Cumulative Proportion  0.85803 0.86676 0.87518 0.88332 0.8912 0.89861
##                          PC44    PC45   PC46    PC47    PC48    PC49
## Standard deviation     7.0131 6.95839 6.8663 6.80744 6.64763 6.61607
## Proportion of Variance 0.0072 0.00709 0.0069 0.00678 0.00647 0.00641
## Cumulative Proportion  0.9058 0.91290 0.9198 0.92659 0.93306 0.93947
##                           PC50    PC51    PC52    PC53    PC54    PC55
## Standard deviation     6.40793 6.21984 6.20326 6.06706 5.91805 5.91233
## Proportion of Variance 0.00601 0.00566 0.00563 0.00539 0.00513 0.00512
## Cumulative Proportion  0.94548 0.95114 0.95678 0.96216 0.96729 0.97241
##                           PC56    PC57   PC58    PC59    PC60    PC61
## Standard deviation     5.73539 5.47261 5.2921 5.02117 4.68398 4.17567
## Proportion of Variance 0.00482 0.00438 0.0041 0.00369 0.00321 0.00255
## Cumulative Proportion  0.97723 0.98161 0.9857 0.98940 0.99262 0.99517
##                           PC62    PC63     PC64
## Standard deviation     4.08212 4.04124 2.07e-14
## Proportion of Variance 0.00244 0.00239 0.00e+00
## Cumulative Proportion  0.99761 1.00000 1.00e+00

Let us create a simple function that assigns a distinct colour to each of the 64 cell lines.

Cols = function(vec){
  cols=rainbow(length(unique(vec)))
  return(cols[as.numeric(as.factor(vec))])
  
}

The rainbow() function takes a positive integer as its argument and returns a vector containing that number of distinct colors.

Let us now plot the principal component score vectors.

par(mfrow=c(1,2))
plot(pr.nc$x[,1:2],col=Cols(nci.labs),pch=19,xlab="Z1",ylab="Z2")
plot(pr.nc$x[,c(1,3)],col=Cols(nci.labs),pch=19,xlab="Z1",ylab="Z3")

plot of chunk unnamed-chunk-19

Cell lines corrsponding to a single cancer type tend to have similar values on the first few principal componenet score vectors.This indicates that cell lines from the same cancer type tend to have pretty similar gene expression levels.

Clustering the observations of the NC160 data

Let us do hierarchical clustering on the data

First we have to scale and then carry out the clustering

hc.data <- scale(nci.data)
par(mfrow=c(1,3))
data.dist = dist(hc.data)
plot(hclust(data.dist),labels=nci.labs,main="Complete")
plot(hclust(data.dist,method="average"),labels=nci.labs,main="average")

plot(hclust(data.dist,method="single"),labels=nci.labs,main="single")

plot of chunk unnamed-chunk-20

Let us now cut the dendrogram at certain heights that will yield certain number of clusters.

hc.out = hclust(dist(hc.data))
hc.clusters = cutree(hc.out,4)
table(hc.clusters,nci.labs)
##            nci.labs
## hc.clusters BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA MCF7A-repro
##           1      2   3     2           0           0        0           0
##           2      3   2     0           0           0        0           0
##           3      0   0     0           1           1        6           0
##           4      2   0     5           0           0        0           1
##            nci.labs
## hc.clusters MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE RENAL UNKNOWN
##           1           0        8     8       6        2     8       1
##           2           0        0     1       0        0     1       0
##           3           0        0     0       0        0     0       0
##           4           1        0     0       0        0     0       0

From the above we can see patterns evolving. For eg, Leukemia is clearly clustered into cluster 3.

par(mfrow=c(1,1))
plot(hc.out,labels=nci.labs)
abline(h=139,col="red")

plot of chunk unnamed-chunk-22

Let us also do K means clustering on this data with K=4

set.seed(2)
km.out = kmeans(hc.data,4,nstart=20)
km.clusters = km.out$cluster
table(km.clusters,hc.clusters)
##            hc.clusters
## km.clusters  1  2  3  4
##           1 11  0  0  9
##           2  0  0  8  0
##           3  9  0  0  0
##           4 20  7  0  0

Let us now do clustering on the first five principal components

hc.out2 = hclust(dist(pr.nc$x[,1:5]))
table(cutree(hc.out2,4),nci.labs)
##    nci.labs
##     BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA MCF7A-repro
##   1      0   2     7           0           0        2           0
##   2      5   3     0           0           0        0           0
##   3      0   0     0           1           1        4           0
##   4      2   0     0           0           0        0           1
##    nci.labs
##     MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE RENAL UNKNOWN
##   1           0        1     8       5        2     7       0
##   2           0        7     1       1        0     2       1
##   3           0        0     0       0        0     0       0
##   4           1        0     0       0        0     0       0

Performing a principal componenet analysis as a first step to performing a clustering exercise can yield better results.