Q7
In the chapter, we mentioned the use of correlation-based distance and Euclidean distance as dissimilarity measures for hierarchical clustering. It turns out that these two measures are almost equivalent : if each observation has been centered to have mean zero and standard deviation one, and if we let \(r_{ij}\) denote the correlation between the \(i\)th and \(j\)th observations, then the quantity \(1−r_{ij}\) is proportional to the squared Euclidean distance between the \(i\)th and \(j\)th observations. On the “USArrests” data, show that this proportionality holds.
library(ISLR)
set.seed(1)
dsc = scale(USArrests)
d1 = dist(dsc)^2
d2 = as.dist(1-cor(t(dsc)))
summary(d2/d1)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000086 0.069135 0.133943 0.234193 0.262589 4.887686
Q8
In Section 10.2.3, a formula for calculating PVE was given in Equation 10.8. We also saw that the PVE can be obtained using the “sdev” output of the “prcomp()” function. On the “USArrests” data, calculate PVE in two ways :
a.Using the “sdev” output of the “prcomp()” function, as was done in Section 10.2.3.
pr.out = prcomp(USArrests,scale=TRUE)
pr.var = pr.out$sdev^2
pve = pr.var/sum(pr.var) #主成分贡献度
sum(pr.var)## [1] 4
pve## [1] 0.62006039 0.24744129 0.08914080 0.04335752
b.By applying Equation 10.8 directly. That is, use the “prcomp()” function to compute the principal component loadings. Then, use those loadings in Equation 10.8 to obtain the PVE.
loadings = pr.out$rotation
USArrests2 = scale(USArrests)
sumvar = sum(apply(as.matrix(USArrests2)^2,2,sum))
apply((as.matrix(USArrests2)%*%loadings)^2,2,sum)/sumvar## PC1 PC2 PC3 PC4
## 0.62006039 0.24744129 0.08914080 0.04335752
Q9
Consider the “USArrests” data. We will now perform hierarchical clustering on the states.
a.Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.
set.seed(2)
hc.complete = hclust(dist(USArrests),method="complete")
plot(hc.complete)b.Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters ?
cutree(hc.complete,3)## Alabama Alaska Arizona Arkansas California
## 1 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 2 3 1 1 2
## Hawaii Idaho Illinois Indiana Iowa
## 3 3 1 3 3
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 3 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 1 3 1 2
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 1 3 2
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 3 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 3 2 1
## South Dakota Tennessee Texas Utah Vermont
## 3 2 2 3 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 3 3 2
c.Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.
sd.data = scale(USArrests)
hc.complete.sd = hclust(dist(sd.data),method="complete")
plot(hc.complete.sd)d.What effect does scaling the variables have on the hierarchical clustering obtained ? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed ? Provide a justification for your answer.
cutree(hc.complete.sd,3)## Alabama Alaska Arizona Arkansas California
## 1 1 2 3 2
## Colorado Connecticut Delaware Florida Georgia
## 2 3 3 2 1
## Hawaii Idaho Illinois Indiana Iowa
## 3 3 2 3 3
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 3 2
## Massachusetts Michigan Minnesota Mississippi Missouri
## 3 2 3 1 3
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 2 3 3
## New Mexico New York North Carolina North Dakota Ohio
## 2 2 1 3 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 3 3 3 1
## South Dakota Tennessee Texas Utah Vermont
## 3 1 2 3 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 3 3 3 3 3
table(cutree(hc.complete,3),cutree(hc.complete.sd,3))##
## 1 2 3
## 1 6 9 1
## 2 2 2 10
## 3 0 0 20
Scaling the variables affect the clusters obtained although the trees are somewhat similar. The variables should be scaled beforehand because the data measures have different units.
Q10
In this problem, you will generate simulated data, and then perform PCA and K-means clustering on the data.
a.Generate a simulated data set with 20 observations in each of three classes (i.e. 60 observations total), and 50 variables.
set.seed(2)
x = matrix(rnorm(20*3*50,mean=0,sd=0.001),ncol=50)
x[1:20,2] = 1
x[21:40,1] = 2
x[21:40,2] = 2
x[41:60,1] = 1
true.labels = c(rep(3,20),rep(2,20),rep(1,20))b.Perform PCA on the 60 observations and plot the first two principal component score vectors. Use a different color to indicate the observations in each of the three classes. If the three classes appear separated in this plot, then continue on to part (c). If not, the return to part (a) and modify the simulation so that there is greater separation between the three classes. Do not continue to part (c) until the three classes show at least some separation in the first two principal component score vectors.
pr.out = prcomp(x)
plot(pr.out$x[,1:2],col=1:3,xlab="Z1",ylab="Z2",pch=19)c.Perform K-means clustering of the observations with K=3. How well do the clusters that you obtained in K-means clustering compare to the true class labels ?
km.out = kmeans(x,3,nstart=20)
table(true.labels,km.out$cluster)##
## true.labels 1 2 3
## 1 0 20 0
## 2 20 0 0
## 3 0 0 20
The observations are perfectly clustered.
d.Perform K-means clustering with K=2. Describe your results.
km.out = kmeans(x,2,nstart=20)
table(true.labels,km.out$cluster)##
## true.labels 1 2
## 1 20 0
## 2 0 20
## 3 20 0
All observations of one of the three clusters is now absorbed in one of the two clusters.
e.Now perform K-means clustering with K=4, and describe your results.
km.out = kmeans(x,4,nstart=20)
table(true.labels,km.out$cluster)##
## true.labels 1 2 3 4
## 1 0 0 0 20
## 2 0 0 20 0
## 3 11 9 0 0
The second cluster is split into two clusters.
f.Now perform K-means clustering with \(K=3\) on the first two principal component score vectors, rather than on the raw data. That is, perform K-means clustering on the \(60x2\) matrix of which the first column is the first principal component score vector, and the second column is the second principal component score vector. Comment on the results.
km.out = kmeans(pr.out$x[,1:2],3,nstart=20)
table(true.labels,km.out$cluster)##
## true.labels 1 2 3
## 1 20 0 0
## 2 0 20 0
## 3 0 0 20
All observations are perfectly clustered once again.
g.Using the “scale()” function, perform K-means clustering with K=3 on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in (b) ? Explain.
km.out = kmeans(scale(x),3,nstart=20)
table(true.labels,km.out$cluster)##
## true.labels 1 2 3
## 1 7 1 12
## 2 2 18 0
## 3 9 2 9
We may see that we have worse results than with unscaled data, as scaling affects the distance between the observations.
Q11
On the book website, there is a gene expression data set that consists of 40 tissue samples with measurements on 1000 genes. The first 20 samples are from healthy patients, while the second 20 are from a diseased group.
a.Load the data using “read.csv()”. You will need to select “header = F”.
genes = read.csv("D:/Documents/Ch10Ex11.csv",header=F)
dim(genes)## [1] 1000 40
b.Apply hierarchical clustering to the samples using correlation-based distance, and plot the dendrogram. Do the genes separate the samples into two groups ? Do your results depend on the type of linkage used ?
hc.complete = hclust(as.dist(1-cor(genes)),method="complete")
plot(hc.complete)hc.single = hclust(as.dist(1-cor(genes)),method="single")
plot(hc.single)hc.average = hclust(as.dist(1-cor(genes)),method="average")
plot(hc.average)The results are pretty different when using different linkage methods as we obtain two clusters for complete and single linkages or three clusters for average cluster.
c.Your collaborator wants to know which genes differ the most across the two groups. Suggest a way to answer this question, and apply it here.
We may use PCA to see which genes differ the most. We will examine the absolute values of the total loadings for each gene as it characterizes the weight of each gene.
pr.out = prcomp(t(genes))
head(pr.out$rotation)## PC1 PC2 PC3 PC4 PC5
## [1,] -0.002434664 -0.030745799 0.009359932 0.009699551 -0.012847866
## [2,] -0.002016598 -0.025927592 0.050300983 -0.026082885 0.003488293
## [3,] 0.011233842 -0.003937802 0.014564920 0.054373032 -0.020411836
## [4,] 0.013912855 0.025625408 0.033998676 -0.011530298 -0.009364524
## [5,] 0.007293322 0.013590353 -0.008229702 -0.001343010 0.030002978
## [6,] 0.017928318 -0.026302699 -0.020728401 -0.024069152 -0.018619253
## PC6 PC7 PC8 PC9 PC10
## [1,] 0.023439995 0.010152261 -0.024602570 -0.021925557 -0.035003076
## [2,] 0.001605492 -0.037364376 -0.017332292 0.011319311 0.007802611
## [3,] 0.025337127 0.070772412 0.047340581 -0.013963868 0.023624407
## [4,] 0.029529539 0.002885764 -0.093667774 -0.008391226 -0.019226470
## [5,] -0.017042934 0.003555111 -0.053227214 -0.010479774 0.008446406
## [6,] -0.049103273 -0.040473304 -0.005455454 -0.003882692 0.028472950
## PC11 PC12 PC13 PC14 PC15
## [1,] 0.068133070 0.002322824 -0.050042837 -0.043957087 0.007542896
## [2,] -0.092523227 0.036265781 0.002951734 0.021272662 -0.040075267
## [3,] 0.017649621 0.021512568 0.013587072 0.005264628 -0.002918920
## [4,] 0.006695624 0.025918069 -0.081179098 0.017689681 0.045951951
## [5,] 0.053250618 -0.076682641 -0.049516326 -0.003282028 0.060755699
## [6,] -0.018103035 0.015433035 0.015967833 -0.006985293 -0.025237500
## PC16 PC17 PC18 PC19 PC20
## [1,] -0.04567334 -0.019899716 0.02946561 -0.009362957 -0.029855408
## [2,] 0.03433259 0.003735211 -0.01218600 -0.023466062 -0.005495696
## [3,] 0.01881913 0.003284517 0.02597233 0.021581732 0.016808524
## [4,] -0.01062858 0.018342677 -0.03334608 -0.052262385 -0.030868339
## [5,] -0.02562691 0.049934804 -0.04221058 -0.012279815 0.018004932
## [6,] -0.00394582 0.037319024 -0.02541592 -0.029423771 -0.012043007
## PC21 PC22 PC23 PC24 PC25
## [1,] -0.009190761 0.0230209664 -0.028970518 0.033060132 0.021453017
## [2,] -0.002808309 0.0079065160 -0.007921167 -0.034424716 0.011932971
## [3,] 0.010683143 -0.0392265342 0.004592080 0.026463736 -0.038085712
## [4,] 0.079419742 -0.0001627164 0.070396594 -0.002015954 0.006459925
## [5,] -0.038364004 -0.0230993500 -0.047439556 -0.001129421 -0.001285153
## [6,] -0.004522525 0.0304001071 0.016062043 -0.019329595 -0.034486284
## PC26 PC27 PC28 PC29 PC30
## [1,] 0.034447853 0.017729906 0.034708970 -0.028136309 -0.009873440
## [2,] 0.051079165 0.032435028 -0.006934708 -0.026307151 -0.008143422
## [3,] -0.064720318 -0.004616608 0.038015189 0.006455198 0.004570640
## [4,] 0.022138389 -0.017120199 0.074901678 0.015812685 0.016391804
## [5,] -0.010772594 0.010889806 -0.005305488 0.015248277 0.029303828
## [6,] 0.001489549 0.028082907 -0.036617970 -0.054760935 0.023337598
## PC31 PC32 PC33 PC34 PC35
## [1,] -0.03576788 0.016708304 -0.01823350 0.0007957941 -0.01443692
## [2,] -0.04439239 0.011968530 0.04168309 0.0123210140 0.02739196
## [3,] 0.02932866 0.026066011 0.02055204 -0.0716448783 0.02726941
## [4,] -0.03954720 0.014714963 0.02846397 0.0316775643 0.01866774
## [5,] 0.05494446 -0.005416152 0.03476606 0.0245476439 -0.04037835
## [6,] 0.01132569 0.006320203 -0.00237484 0.0061140832 0.01402898
## PC36 PC37 PC38 PC39 PC40
## [1,] 0.010652118 -0.009366629 -0.012754402 0.0020214363 0.07000786
## [2,] -0.002733484 -0.001318693 0.031410461 -0.0108377476 -0.06326465
## [3,] 0.020891497 -0.001380233 -0.025857254 0.0008800921 -0.32824953
## [4,] -0.027363133 -0.006080650 -0.025316130 -0.0235404170 -0.01675446
## [5,] -0.046869227 -0.017973802 0.002917167 0.0342753219 0.04896111
## [6,] 0.042083325 0.055817170 -0.010080327 0.0029965594 0.05407104
total.load = apply(pr.out$rotation,1,sum)
index = order(abs(total.load),decreasing=TRUE)
index[1:10]## [1] 865 68 911 428 624 11 524 803 980 822
These are the 10 most different genes across the two groups.