Chapter 10 :

Unsupervised Learning

Principal Component Analysis, K-Means Clustering, and Hierarchical Clustering

Applied (7-11)

Problem 7

In the chapter, we mentioned the use of correlation-based distance and Euclidean distance as dissimilarity measures for hierarchical clustering. It turns out that these two measures are almost equivalent: if each observation has been centered to have mean zero and standard deviation one, and if we let rij denote the correlation between the ith and jth observations, then the quantity 1 - rij is proportional to the squared Euclidean distance between the ith and jth observations.

On the USArrests data, show that this proportionality holds.

Hint: The Euclidean distance can be calculated using the dist() function, and correlations can be calculated using the cor() function.

Solution 7

library(ISLR)
set.seed(1)
dsc <- scale(USArrests)
d1 <- dist(dsc)^2
d2 <- as.dist(1 - cor(t(dsc)))
summary(d2 / d1)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000086 0.069135 0.133943 0.234193 0.262589 4.887686

Problem 8

In Section 10.2.3, a formula for calculating PVE was given in Equation 10.8. We also saw that the PVE can be obtained using the sdev output of the prcomp() function. On the USArrests data, calculate PVE in two ways:

(a)

Using the sdev output of the prcomp() function, as was done in Section 10.2.3.

Solution (a)

#Applying PCA to four variables of USArrest dataset
pca<-prcomp(USArrests,scale=TRUE)
#variation of each component
pcavar<-pca$sdev^2
#proportion of variance explained by each component
pve<-pcavar/sum(pcavar)
pve
## [1] 0.62006039 0.24744129 0.08914080 0.04335752

(b)

By applying Equation 10.8 directly. That is, use the prcomp() function to compute the principal component loadings. Then, use those loadings in Equation 10.8 to obtain the PVE.

Solution (b)

loadings<-pca$rotation
USArrests2 <- scale(USArrests)
sumvar <- sum(apply(as.matrix(USArrests2)^2, 2, sum))
apply((as.matrix(USArrests2) %*% loadings)^2, 2, sum) / sumvar
##        PC1        PC2        PC3        PC4 
## 0.62006039 0.24744129 0.08914080 0.04335752

Problem 9

Consider the USArrests data. We will now perform hierarchical clustering on the states.

(a)

Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.

hc<-(USArrests)
set.seed(2)
hcluster<-hclust(dist(hc),method="complete")
plot(hcluster)

(b)

Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?

Solution (b)

clusters3<-cutree(hcluster,3)
clusters3
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              1              2              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              1              1              2 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              3              1              3              3 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              3              1              3              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              1              3              1              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              3              3              1              3              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              1              3              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              3              2              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              3              2              2              3              3 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              3              3              2
plot(clusters3)

table(clusters3)
## clusters3
##  1  2  3 
## 16 14 20

The Table summarizes number of states in each cluster and which states belongs to which cluster.

(c)

Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.

hcscale<-scale(USArrests)
set.seed(2)
hclusterscale<-hclust(dist(hcscale),method="complete")
plot(hclusterscale)

(d)

What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.

Solution (d)

scaledtree<-cutree(hclusterscale,3)
plot(scaledtree)

table(scaledtree)
## scaledtree
##  1  2  3 
##  8 11 31

Scaling affects the clusters.

Scaling should be done if the units of measure of variables are different.

Problem 10

In this problem, you will generate simulated data, and then perform PCA and K-means clustering on the data.

(a)

Generate a simulated data set with 20 observations in each of three classes (i.e. 60 observations total), and 50 variables. Hint: There are a number of functions in R that you can use to generate data. One example is the rnorm() function; runif() is another option. Be sure to add a mean shift to the observations in each class so that there are three distinct classes.

Solution (a)

set.seed(2)
x<-matrix(rnorm(20*3*50,mean=0,sd=0.1),ncol=50)
x[1:20,2]<-1
x[21:40,1]<-2
x[21:40,2]<-2
x[41:60,1]<-1
true.labels<-c(rep(1,20),rep(2,20),rep(3,20))

(b)

Perform PCA on the 60 observations and plot the first two principal component score vectors. Use a different color to indicate the observations in each of the three classes. If the three classes appear separated in this plot, then continue on to part (c). If not, then return to part (a) and modify the simulation so that there is greater separation between the three classes. Do not continue to part (c) until the three classes show at least some separation in the first two principal component score vectors.

Solution (b)

pca<-prcomp(x)
plot(pca$x[,1:2],col=1:3,xlab="Z1",ylab="Z2",pch=20)

(c)

Perform K-means clustering of the observations with K = 3. How well do the clusters that you obtained in K-means clustering compare to the true class labels?

Hint: You can use the table() function in R to compare the true class labels to the class labels obtained by clustering. Be careful how you interpret the results: K-means clustering will arbitrarily number the clusters, so you cannot simply check whether the true class labels and clustering labels are the same.

Solution (c)

km.clustering <- kmeans(x, 3, nstart = 20)
table(true.labels, km.clustering$cluster)
##            
## true.labels  1  2  3
##           1 20  0  0
##           2  0 20  0
##           3  0  0 20

There were 3 classes in original data set with 20 observations in each. These are perfectly clustered in 3 classes obtained from k-means clustering.

(d)

Perform K-means clustering with K = 2. Describe your results.

Solution (d)

km.clustering.2 <- kmeans(x, 2, nstart = 20)
table(true.labels, km.clustering.2$cluster)
##            
## true.labels  1  2
##           1 20  0
##           2  0 20
##           3 20  0

The original 3 classes are now clustered into 2 only. 3 cluster is clubbed with another completely.

(e)

Now perform K-means clustering with K = 4, and describe your results.

Solution (e)

km.clustering.4<- kmeans(x, 4, nstart = 20)
table(true.labels, km.clustering.4$cluster)
##            
## true.labels  1  2  3  4
##           1  0 11  0  9
##           2 20  0  0  0
##           3  0  0 20  0

The original 3 classes are now classified into 4 clusters. With 2 classes perfectly classified into 2 separate clusters but observations of one class is distributed over 2 clusters.

(f)

Now perform K-means clustering with K = 3 on the first two principal component score vectors, rather than on the raw data. That is, perform K-means clustering on the 60 × 2 matrix of which the first column is the first principal component score vector, and the second column is the second principal component score vector. Comment on the results.

Solution (f)

km.pca<-kmeans(pca$x[,1:2],3,nstart=20)
table(true.labels,km.pca$cluster)
##            
## true.labels  1  2  3
##           1  0  0 20
##           2  0 20  0
##           3 20  0  0

All observations are perfectly clustered once again

(g)

Using the scale() function, perform K-means clustering with K = 3 on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in (b)? Explain.

Solution (g)

kmscale<-kmeans(scale(x),3,nstart=20)
table(true.labels,kmscale$cluster)
##            
## true.labels  1  2  3
##           1  8  2 10
##           2  0 19  1
##           3 11  1  8

Here we see the observations are not perfectly clustered and the results are worse than unscaled clustering. Scaling affects the distance between the observations and hence should be avoided till mandatory

Problem 11

On the book website, “www.StatLearning.com”, there is a gene expression data set (Ch10Ex11.csv) that consists of 40 tissue samples with measurements on 1,000 genes. The first 20 samples are from healthy patients, while the second 20 are from a diseased group.

(a)

Load in the data using read.csv(). You will need to select header=F.

Solution (a)

genes<-read.csv("Ch10Ex11.csv",header=FALSE)

(b)

Apply hierarchical clustering to the samples using correlation based distance, and plot the dendrogram. Do the genes separate the samples into the two groups? Do your results depend on the type of linkage used?

Solution (b)

#Complete Linkage
hc.complete<-hclust(as.dist(1-cor(genes)),method="complete")
plot(hc.complete)

#Single Linkage
hc.single<-hclust(as.dist(1-cor(genes)),method="single")
plot(hc.single)

#Average Linkage
hc.average<-hclust(as.dist(1-cor(genes)),method="average")
plot(hc.average)

The results obtained differs based on the type of linkage used.

The complete linkage and single linkage methods gives two clusters while average linkage gives three clusters.

(c)

Your collaborator wants to know which genes differ the most across the two groups. Suggest a way to answer this question, and apply it here.

Solution (c)

We can use principal component analysis to answer the question.

We will arrange in decreasing order the total weight given to each gene for different principal components

The top 10 in the order will be the genes which differ a lot in the two group

pca.gene<-prcomp(t(genes))
head(pca.gene$rotation)
##               PC1          PC2          PC3          PC4          PC5
## [1,] -0.002434664 -0.030745799  0.009359932  0.009699551 -0.012847866
## [2,] -0.002016598 -0.025927592  0.050300983 -0.026082885  0.003488293
## [3,]  0.011233842 -0.003937802  0.014564920  0.054373032 -0.020411836
## [4,]  0.013912855  0.025625408  0.033998676 -0.011530298 -0.009364524
## [5,]  0.007293322  0.013590353 -0.008229702 -0.001343010  0.030002978
## [6,]  0.017928318 -0.026302699 -0.020728401 -0.024069152 -0.018619253
##               PC6          PC7          PC8          PC9         PC10
## [1,]  0.023439995  0.010152261 -0.024602570 -0.021925557 -0.035003076
## [2,]  0.001605492 -0.037364376 -0.017332292  0.011319311  0.007802611
## [3,]  0.025337127  0.070772412  0.047340581 -0.013963868  0.023624407
## [4,]  0.029529539  0.002885764 -0.093667774 -0.008391226 -0.019226470
## [5,] -0.017042934  0.003555111 -0.053227214 -0.010479774  0.008446406
## [6,] -0.049103273 -0.040473304 -0.005455454 -0.003882692  0.028472950
##              PC11         PC12         PC13         PC14         PC15
## [1,]  0.068133070  0.002322824 -0.050042837 -0.043957087  0.007542896
## [2,] -0.092523227  0.036265781  0.002951734  0.021272662 -0.040075267
## [3,]  0.017649621  0.021512568  0.013587072  0.005264628 -0.002918920
## [4,]  0.006695624  0.025918069 -0.081179098  0.017689681  0.045951951
## [5,]  0.053250618 -0.076682641 -0.049516326 -0.003282028  0.060755699
## [6,] -0.018103035  0.015433035  0.015967833 -0.006985293 -0.025237500
##             PC16         PC17        PC18         PC19         PC20
## [1,] -0.04567334 -0.019899716  0.02946561 -0.009362957 -0.029855408
## [2,]  0.03433259  0.003735211 -0.01218600 -0.023466062 -0.005495696
## [3,]  0.01881913  0.003284517  0.02597233  0.021581732  0.016808524
## [4,] -0.01062858  0.018342677 -0.03334608 -0.052262385 -0.030868339
## [5,] -0.02562691  0.049934804 -0.04221058 -0.012279815  0.018004932
## [6,] -0.00394582  0.037319024 -0.02541592 -0.029423771 -0.012043007
##              PC21          PC22         PC23         PC24         PC25
## [1,] -0.009190761  0.0230209664 -0.028970518  0.033060132  0.021453017
## [2,] -0.002808309  0.0079065160 -0.007921167 -0.034424716  0.011932971
## [3,]  0.010683143 -0.0392265342  0.004592080  0.026463736 -0.038085712
## [4,]  0.079419742 -0.0001627164  0.070396594 -0.002015954  0.006459925
## [5,] -0.038364004 -0.0230993500 -0.047439556 -0.001129421 -0.001285153
## [6,] -0.004522525  0.0304001071  0.016062043 -0.019329595 -0.034486284
##              PC26         PC27         PC28         PC29         PC30
## [1,]  0.034447853  0.017729906  0.034708970 -0.028136309 -0.009873440
## [2,]  0.051079165  0.032435028 -0.006934708 -0.026307151 -0.008143422
## [3,] -0.064720318 -0.004616608  0.038015189  0.006455198  0.004570640
## [4,]  0.022138389 -0.017120199  0.074901678  0.015812685  0.016391804
## [5,] -0.010772594  0.010889806 -0.005305488  0.015248277  0.029303828
## [6,]  0.001489549  0.028082907 -0.036617970 -0.054760935  0.023337598
##             PC31         PC32        PC33          PC34        PC35
## [1,] -0.03576788  0.016708304 -0.01823350  0.0007957941 -0.01443692
## [2,] -0.04439239  0.011968530  0.04168309  0.0123210140  0.02739196
## [3,]  0.02932866  0.026066011  0.02055204 -0.0716448783  0.02726941
## [4,] -0.03954720  0.014714963  0.02846397  0.0316775643  0.01866774
## [5,]  0.05494446 -0.005416152  0.03476606  0.0245476439 -0.04037835
## [6,]  0.01132569  0.006320203 -0.00237484  0.0061140832  0.01402898
##              PC36         PC37         PC38          PC39        PC40
## [1,]  0.010652118 -0.009366629 -0.012754402  0.0020214363  0.07000786
## [2,] -0.002733484 -0.001318693  0.031410461 -0.0108377476 -0.06326465
## [3,]  0.020891497 -0.001380233 -0.025857254  0.0008800921 -0.32824953
## [4,] -0.027363133 -0.006080650 -0.025316130 -0.0235404170 -0.01675446
## [5,] -0.046869227 -0.017973802  0.002917167  0.0342753219  0.04896111
## [6,]  0.042083325  0.055817170 -0.010080327  0.0029965594  0.05407104
Total.loadings<-apply(pca.gene$rotation,1,sum)
index<-order(abs(Total.loadings),decreasing = TRUE)
index[1:10]
##  [1] 865  68 911 428 624  11 524 803 980 822

The gene number 865, 68, 911, 428, 624, 11, 524, 803, 980, 822 are the 10 most different genes across the group.