In this exercise we deal with the unsupervised methods like, PCA and Clustering techniques like Kmeans and Hierarchical clustering.
Principal componenet Analysis is done as a method of dimensionality reduction in such a way that the reduced dimensions of the data captures the maximum variability explained in the data.
Let us do PCA on the USArrests data set. Let us check the data set and verify the observations by taking mean of the data and also the variance of the data.
head(USArrests)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
## Applying mean
apply(USArrests,2,mean)
## Murder Assault UrbanPop Rape
## 7.788 170.760 65.540 21.232
## Finding the mean
apply(USArrests,2,var)
## Murder Assault UrbanPop Rape
## 18.97 6945.17 209.52 87.73
One crucial point to be observed before doing PCA is that to ensure that the scale of the variables are comparable. Here in this dataset we could find out that each of the variables have different scales. For eg. The number of rapes in each state is per 100,000 individuals and Urbanpop variable measures the percentage of the population in each state living in an urban area. So obviously both these measures are not comparable. We therefore have to resort to scaling of the variables.
Let us do the principal component analysis using prcomp() function.
pr.US <- prcomp(USArrests,scale=TRUE)
## Variables under Principal componenets
names(pr.US)
## [1] "sdev" "rotation" "center" "scale" "x"
The scale and center attributes gives the mean and scaling attributes of each of these 4 variables. The rotation, gives the number of principal components. As seen the number of principal componenets is the min of (n-1,p) where n is the number of observations and p is no of variables. So there are 4 principal componenets.
Let us now plot the first two principal components
biplot(pr.US,scale=0)
Let us now find out the proportion of variance explained by each of these principal components. The sd gives the standard deviation. The square of sd gives the variance.
pr.var <- pr.US$sd^2
pr.var
## [1] 2.4802 0.9898 0.3566 0.1734
pr.pve <- pr.var/sum(pr.var)
pr.pve
## [1] 0.62006 0.24744 0.08914 0.04336
So as seen, the first principal component explains 62% of the variance of the data, the second 24.7 etc. The first two principla componenets together explains about 86% of the variance.
Let us plot these
plot(pr.pve,xlab="Principal component",ylab="Proportion of variance explained",ylim=c(0,1),type='b')
plot(cumsum(pr.pve),xlab="Principal component",ylab="Proportion of variance explained",ylim=c(0,1),type='b')
Let us perform k means clustering on some random data sets.
set.seed(2)
x = matrix(rnorm(50*2),ncol=2)
x[1:25,1] = x[1:25,1]+3
x[1:25,2] = x[1:25,2]-4
Let us perform the k means clustering with K=2
km.ex <- kmeans(x,2,nstart=20)
summary(km.ex)
## Length Class Mode
## cluster 50 -none- numeric
## centers 4 -none- numeric
## totss 1 -none- numeric
## withinss 2 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 2 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
Let us plot the two clusters with each cluster coloured seperately
plot(x,col=(km.ex$cluster+1),main="K means clustering",pch=20,cex=2)
Since we knew the number of clusters we gave a value of k=2. However in real life we do not know the number of clusters and therefore will have to figure out what value of k should be used.
Let us try to use 3 clusters
set.seed(4)
km.ex1 <- kmeans(x,3,nstart=20)
km.ex1
## K-means clustering with 3 clusters of sizes 10, 23, 17
##
## Cluster means:
## [,1] [,2]
## 1 2.300 -2.69622
## 2 -0.382 -0.08741
## 3 3.779 -4.56201
##
## Clustering vector:
## [1] 3 1 3 1 3 3 3 1 3 1 3 1 3 1 3 1 3 3 3 3 3 1 3 3 3 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 19.56 52.68 25.74
## (between_SS / total_SS = 79.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
plot(x,col=(km.ex1$cluster+1),main="K means clustering",pch=20,cex=2)
The nstart function will help in performing multiple random assignments for k and kmeans() function will pick the best among the available models. Let us try varying nstart for 1 and 20 and observe the within the cluster sum of squares. We will have to pick the model with minimal within cluster sum of squares.
km.ex2 <- kmeans(x,3,nstart=1)
km.ex2$tot.withinss
## [1] 99.88
km.ex2$withinss
## [1] 14.58 59.07 26.24
km.ex3 <- kmeans(x,3,nstart=20)
km.ex3$tot.withinss
## [1] 97.98
km.ex3$withinss
## [1] 19.56 25.74 52.68
Let us start of with Hierarchical clustering. hclust() function in R will be used for doing hierarchical clustering. The dist() function will be used to calculate the Eucledian distance as a measure of dissimilarity. There are basically three types of hierarchical clustering methods i.e complete, single and average linkage clustering.
hc.complete = hclust(dist(x),method="complete")
names(hc.complete)
## [1] "merge" "height" "order" "labels" "method"
## [6] "call" "dist.method"
Some of the attributes of the fit are as explained below
“merge” “height” “order” “labels” “method” “call” “dist.method” 1. merge : Gives those observations which are merged together in the dendrogram 2. height : Gives the height of each observation 3. order : Gives the order in which each of the observations are clubbed
hc.average = hclust(dist(x),method="average")
plot(hc.average)
hc.single = hclust(dist(x),method="single")
plot(hc.single)
To know the cluster labels we can use the cutree() function
cutree(hc.complete,2)
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
To scale the variables before performing hierarchical clustering we can use scale() function.
xsc = scale(x)
plot(hclust(dist(xsc),method="complete"), main = "HC")
Correlation based distance can be coputed using as.dist() function
x = matrix(rnorm(30*3),ncol=3)
dd = as.dist(1-cor(t(x)))
plot(hclust(dd,method="complete"),main="Complete linkage")
Let us do the above analysis in some real data.
Let us do these analysis in NC160 data set which is part of the ISLR package. Let us first start with carrying out a PCA on this data set.
This data set pertains to the 64 cancer cell lines. The dimension of the data is 64x6830
library(ISLR)
##
## Attaching package: 'ISLR'
##
## The following objects are masked _by_ '.GlobalEnv':
##
## Carseats, Hitters
nci.labs = NCI60$labs
nci.data = NCI60$data
dim(nci.data)
## [1] 64 6830
Let us do PCA on this data
pr.nc <- prcomp(nci.data,scale=TRUE)
summary(pr.nc)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 27.853 21.4814 19.8205 17.0326 15.9718 15.7211
## Proportion of Variance 0.114 0.0676 0.0575 0.0425 0.0374 0.0362
## Cumulative Proportion 0.114 0.1812 0.2387 0.2812 0.3185 0.3547
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 14.4715 13.5443 13.1440 12.7386 12.6867 12.1577
## Proportion of Variance 0.0307 0.0269 0.0253 0.0238 0.0236 0.0216
## Cumulative Proportion 0.3853 0.4122 0.4375 0.4613 0.4848 0.5065
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 11.8302 11.6255 11.4378 11.0005 10.6567 10.4888
## Proportion of Variance 0.0205 0.0198 0.0192 0.0177 0.0166 0.0161
## Cumulative Proportion 0.5270 0.5467 0.5659 0.5836 0.6002 0.6163
## PC19 PC20 PC21 PC22 PC23 PC24
## Standard deviation 10.4352 10.3219 10.1461 10.0544 9.9027 9.6477
## Proportion of Variance 0.0159 0.0156 0.0151 0.0148 0.0144 0.0136
## Cumulative Proportion 0.6323 0.6479 0.6630 0.6778 0.6921 0.7057
## PC25 PC26 PC27 PC28 PC29 PC30 PC31
## Standard deviation 9.5076 9.3325 9.2732 9.0900 8.9812 8.7500 8.5996
## Proportion of Variance 0.0132 0.0127 0.0126 0.0121 0.0118 0.0112 0.0108
## Cumulative Proportion 0.7190 0.7317 0.7443 0.7564 0.7682 0.7794 0.7903
## PC32 PC33 PC34 PC35 PC36 PC37
## Standard deviation 8.4474 8.3730 8.21579 8.15731 7.97465 7.90446
## Proportion of Variance 0.0104 0.0103 0.00988 0.00974 0.00931 0.00915
## Cumulative Proportion 0.8007 0.8110 0.82087 0.83061 0.83992 0.84907
## PC38 PC39 PC40 PC41 PC42 PC43
## Standard deviation 7.82127 7.72156 7.58603 7.45619 7.3444 7.10449
## Proportion of Variance 0.00896 0.00873 0.00843 0.00814 0.0079 0.00739
## Cumulative Proportion 0.85803 0.86676 0.87518 0.88332 0.8912 0.89861
## PC44 PC45 PC46 PC47 PC48 PC49
## Standard deviation 7.0131 6.95839 6.8663 6.80744 6.64763 6.61607
## Proportion of Variance 0.0072 0.00709 0.0069 0.00678 0.00647 0.00641
## Cumulative Proportion 0.9058 0.91290 0.9198 0.92659 0.93306 0.93947
## PC50 PC51 PC52 PC53 PC54 PC55
## Standard deviation 6.40793 6.21984 6.20326 6.06706 5.91805 5.91233
## Proportion of Variance 0.00601 0.00566 0.00563 0.00539 0.00513 0.00512
## Cumulative Proportion 0.94548 0.95114 0.95678 0.96216 0.96729 0.97241
## PC56 PC57 PC58 PC59 PC60 PC61
## Standard deviation 5.73539 5.47261 5.2921 5.02117 4.68398 4.17567
## Proportion of Variance 0.00482 0.00438 0.0041 0.00369 0.00321 0.00255
## Cumulative Proportion 0.97723 0.98161 0.9857 0.98940 0.99262 0.99517
## PC62 PC63 PC64
## Standard deviation 4.08212 4.04124 2.07e-14
## Proportion of Variance 0.00244 0.00239 0.00e+00
## Cumulative Proportion 0.99761 1.00000 1.00e+00
Let us create a simple function that assigns a distinct colour to each of the 64 cell lines.
Cols = function(vec){
cols=rainbow(length(unique(vec)))
return(cols[as.numeric(as.factor(vec))])
}
The rainbow() function takes a positive integer as its argument and returns a vector containing that number of distinct colors.
Let us now plot the principal component score vectors.
par(mfrow=c(1,2))
plot(pr.nc$x[,1:2],col=Cols(nci.labs),pch=19,xlab="Z1",ylab="Z2")
plot(pr.nc$x[,c(1,3)],col=Cols(nci.labs),pch=19,xlab="Z1",ylab="Z3")
Cell lines corrsponding to a single cancer type tend to have similar values on the first few principal componenet score vectors.This indicates that cell lines from the same cancer type tend to have pretty similar gene expression levels.
Let us do hierarchical clustering on the data
First we have to scale and then carry out the clustering
hc.data <- scale(nci.data)
par(mfrow=c(1,3))
data.dist = dist(hc.data)
plot(hclust(data.dist),labels=nci.labs,main="Complete")
plot(hclust(data.dist,method="average"),labels=nci.labs,main="average")
plot(hclust(data.dist,method="single"),labels=nci.labs,main="single")
Let us now cut the dendrogram at certain heights that will yield certain number of clusters.
hc.out = hclust(dist(hc.data))
hc.clusters = cutree(hc.out,4)
table(hc.clusters,nci.labs)
## nci.labs
## hc.clusters BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA MCF7A-repro
## 1 2 3 2 0 0 0 0
## 2 3 2 0 0 0 0 0
## 3 0 0 0 1 1 6 0
## 4 2 0 5 0 0 0 1
## nci.labs
## hc.clusters MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE RENAL UNKNOWN
## 1 0 8 8 6 2 8 1
## 2 0 0 1 0 0 1 0
## 3 0 0 0 0 0 0 0
## 4 1 0 0 0 0 0 0
From the above we can see patterns evolving. For eg, Leukemia is clearly clustered into cluster 3.
par(mfrow=c(1,1))
plot(hc.out,labels=nci.labs)
abline(h=139,col="red")
Let us also do K means clustering on this data with K=4
set.seed(2)
km.out = kmeans(hc.data,4,nstart=20)
km.clusters = km.out$cluster
table(km.clusters,hc.clusters)
## hc.clusters
## km.clusters 1 2 3 4
## 1 11 0 0 9
## 2 0 0 8 0
## 3 9 0 0 0
## 4 20 7 0 0
Let us now do clustering on the first five principal components
hc.out2 = hclust(dist(pr.nc$x[,1:5]))
table(cutree(hc.out2,4),nci.labs)
## nci.labs
## BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA MCF7A-repro
## 1 0 2 7 0 0 2 0
## 2 5 3 0 0 0 0 0
## 3 0 0 0 1 1 4 0
## 4 2 0 0 0 0 0 1
## nci.labs
## MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE RENAL UNKNOWN
## 1 0 1 8 5 2 7 0
## 2 0 7 1 1 0 2 1
## 3 0 0 0 0 0 0 0
## 4 1 0 0 0 0 0 0
Performing a principal componenet analysis as a first step to performing a clustering exercise can yield better results.