Q1. This problem involves \(K\)-means clustering algorithm.
We may write \[\frac{1}{|C_k|}\sum_{i,i'\in C_k}\sum_{j=1}^p(x_{ij} - x_{i'j})^2 = \sum_i\sum_jx_{ij}^2 - 2\sum_i\sum_jx_{ij}\overline{x}_{kj} + \sum_i\sum_jx_{ij}^2 = 2\sum_i\sum_jx_{ij}^2 - 2|C_k|\sum_j\overline{x}_{kj}^2\] and \[2\sum_i\sum_jx_{ij}^2 - 4\sum_i\sum_jx_{ij}\overline{x}_{kj} + 2\sum_i\sum_j\overline{x}_{kj}^2 = 2\sum_i\sum_jx_{ij}^2 - 2|C_k|\sum_j\overline{x}_{kj}^2,\] which proves (10.12).
This identity shows us that at step 2.(b) of the algorithm (10.1), when we assign each observation to the cluster whose centroid is closest, we actually decrease the right member of the identity. So, we also decrease the left member of the identity which is our objective. Another way of seeing this is that the identity shows that minimizing the sum of the squared Euclidean distance for each cluster is the same as minimizing the within-cluster variance for each cluster.
Q2. Suppose that we have four observations, for which we compute a dissimilarity matrix, given by \[\pmatrix{ &0.3 &0.4 &0.7 \cr 0.3 & &0.5 &0.8 \cr 0.4 &0.5 & & 0.45 \cr 0.7 &0.8 &0.45 & }.\] For instance, the dissimilarity between the first and second observations is 0.3, and the dissimilarity between the second and fourth observations is 0.8.
We will use Algorithm 10.2 to explain the different steps that lead to the dendrogram.
Step 1. We already have \[\pmatrix{ &0.3 &0.4 &0.7 \cr 0.3 & &0.5 &0.8 \cr 0.4 &0.5 & & 0.45 \cr 0.7 &0.8 &0.45 & }.\]
Step 2. \(i = 4\) : We may see that 0.3 is the minimum dissimilarity, so we fuse observations 1 and 2 to form cluster (1,2) at hight 0.3. We now have the new dissimilarity matrix \[\pmatrix{ &0.5 &0.8 \cr 0.5 & &0.45 \cr 0.8 &0.45 & }.\] \(i = 3\) : We now see that the minimum dissimilarity is 0.45, so we fuse observations 3 and 4 to form cluster (3,4) at height 0.45. We now have the new dissimilarity matrix \[\pmatrix{ &0.8 \cr 0.8 & }.\] \(i = 4\) : It remains to fuse clusters (1,2) and (3,4) to form cluster ((1,2),(3,4)) at height 0.8.
d = as.dist(matrix(c(0, 0.3, 0.4, 0.7,
0.3, 0, 0.5, 0.8,
0.4, 0.5, 0.0, 0.45,
0.7, 0.8, 0.45, 0.0), nrow = 4))
plot(hclust(d, method = "complete"))
We will again use Algorithm 10.2 to explain the different steps that lead to the dendrogram.
Step 1. We already have \[\pmatrix{ &0.3 &0.4 &0.7 \cr 0.3 & &0.5 &0.8 \cr 0.4 &0.5 & & 0.45 \cr 0.7 &0.8 &0.45 & }.\]
Step 2. \(i = 4\) : We may see that 0.3 is the minimum dissimilarity, so we fuse observations 1 and 2 to form cluster (1,2) at hight 0.3. We now have the new dissimilarity matrix \[\pmatrix{ &0.4 &0.7 \cr 0.4 & &0.45 \cr 0.7 &0.45 & }.\] \(i = 3\) : We now see that the minimum dissimilarity is 0.4, so we fuse cluster (1,2) and observation 3 to form cluster ((1,2),3) at height 0.4. We now have the new dissimilarity matrix \[\pmatrix{ &0.45 \cr 0.45 & }.\] \(i = 4\) : It remains to fuse clusters ((1,2),3) and observation 4 to form cluster (((1,2),3),4) at height 0.45.
plot(hclust(d, method = "single"))
In this case, we have clusters (1,2) and (3,4).
In this case, we have clusters ((1,2),3) and (4).
plot(hclust(d, method = "complete"), labels = c(2,1,4,3))
Q3. In this problem, you will perform \(K\)-means clustering manually, with \(K = 2\), on a small example with \(n = 6\) observations and \(p = 2\) features. The observations are as follows. \[\begin{array}{c|cc} \hline \mbox{Obs.} &X_1 &X_2 \cr \hline 1 &1 &4 \cr 2 &1 &3 \cr 3 &0 &4 \cr 4 &5 &1 \cr 5 &6 &2 \cr 6 &4 &0 \cr \hline \end{array}\]
x <- cbind(c(1, 1, 0, 5, 6, 4), c(4, 3, 4, 1, 2, 0))
plot(x[,1], x[,2])
set.seed(1)
labels <- sample(2, nrow(x), replace = T)
labels
## [1] 1 1 2 2 1 2
plot(x[, 1], x[, 2], col = (labels + 1), pch = 20, cex = 2)
We can compute the centroid for the green cluster with \[\overline{x}_{11} = \frac{1}{3}(0 + 4 + 5) = 3\mbox{ and }\overline{x}_{12} = \frac{1}{3}(4 + 0 + 1) = \frac{5}{3}\] and for the red cluster \[\overline{x}_{21} = \frac{1}{3}(1 + 1 + 6) = \frac{8}{3}\mbox{ and }\overline{x}_{22} = \frac{1}{3}(2 + 4 + 3) = 3.\]
centroid1 <- c(mean(x[labels == 1, 1]), mean(x[labels == 1, 2]))
centroid2 <- c(mean(x[labels == 2, 1]), mean(x[labels == 2, 2]))
plot(x[,1], x[,2], col=(labels + 1), pch = 20, cex = 2)
points(centroid1[1], centroid1[2], col = 2, pch = 4)
points(centroid2[1], centroid2[2], col = 3, pch = 4)
labels <- c(1, 1, 1, 2, 2, 2)
plot(x[, 1], x[, 2], col = (labels + 1), pch = 20, cex = 2)
points(centroid1[1], centroid1[2], col = 2, pch = 4)
points(centroid2[1], centroid2[2], col = 3, pch = 4)
We can compute the centroid for the green cluster with \[\overline{x}_{11} = \frac{1}{3}(4 + 5 + 6) = 5\mbox{ and }\overline{x}_{12} = \frac{1}{3}(0 + 1 + 2) = 1\] and for the red cluster \[\overline{x}_{21} = \frac{1}{3}(0 + 1 + 1) = \frac{2}{3}\mbox{ and }\overline{x}_{22} = \frac{1}{3}(3 + 4 + 4) = \frac{11}{3}.\]
centroid1 <- c(mean(x[labels == 1, 1]), mean(x[labels == 1, 2]))
centroid2 <- c(mean(x[labels == 2, 1]), mean(x[labels == 2, 2]))
plot(x[,1], x[,2], col=(labels + 1), pch = 20, cex = 2)
points(centroid1[1], centroid1[2], col = 2, pch = 4)
points(centroid2[1], centroid2[2], col = 3, pch = 4)
If we assign each observation to the centroid to which it is closest, nothing changes, so the algorithm is terminated at this step.
plot(x[, 1], x[, 2], col=(labels + 1), pch = 20, cex = 2)
Q4. Suppose that for a particular data set, we perform hierarchical clustering using single linkage and using complete linkage. We obtain two dendrograms.
There is not enough information to tell. For example, if \(d(1,4) = 2\), \(d(1,5) = 3\), \(d(2,4) = 1\), \(d(2,5) = 3\), \(d(3,4) = 4\) and \(d(3,5) = 1\), the single linkage dissimilarity between \(\{1,2,3\}\) and \(\{4,5\}\) would be equal to 1 and the complete linkage dissimilarity between \(\{1,2,3\}\) and \(\{4,5\}\) would be equal to 4. So, with single linkage, they would fuse at a height of 1, and with complete linkage, they would fuse at a height of 4. But, if all inter-observations distance are equal to 2, we would have that the single and complete linkage dissimilarities between \(\{1,2,3\}\) and \(\{4,5\}\) are equal to 2.
They would fuse at the same height. For example, if \(d(5,6) = 2\), the single and complete linkage dissimilarities between \(\{5\}\) and \(\{6\}\) would be equal to 2. So, they would fuse at a height of 2 for single and complete linkage.
Q5. In words, describe the results that you would expect if you performed \(K\)-means clustering of the eight shoppers in Figure 10.14, on the basis of their sock and computer purchases, with \(K = 2\). Give three answers, one for each of the variable scalings displayed. Explain.
If we take into consideration the unscaled variables, the number of socks plays a larger role than the number of computers, so we have the clusters \(\{1,2,7,8\}\) (least socks and computer) and \(\{3,4,5,6\}\) (more socks and computer).
socks <- c(8, 11, 7, 6, 5, 6, 7, 8)
computers <- c(0, 0, 0, 0, 1, 1, 1, 1)
x <- cbind(socks, computers)
labels <- c(1, 1, 2, 2, 2, 2, 1, 1)
plot(x[, 1], x[, 2], col=(labels + 1), pch = 20, cex = 2, asp = 1)
If we take into consideration the scaled variables, the number of computers plays a much larger role than the number of socks, so we have the clusters \(\{5,6,7,8\}\) (purchased computer) and \(\{1,2,3,4\}\) (no computer purchased).
x <- cbind(scale(socks, center = FALSE), scale(computers, center = FALSE))
sd(computers)
## [1] 0.5345225
labels <- c(1, 1, 2, 2, 2, 2, 1, 1)
plot(x[, 1], x[, 2], col=(labels + 1), pch = 20, cex = 2, asp = 1)
If we take into consideration the variables measured by the number of dollars spent, here also the number of computers plays a much larger role than the number of socks, so we have the clusters \(\{5,6,7,8\}\) (purchased computer) and \(\{1,2,3,4\}\) (no computer purchased).
Q6. A researcher collects expression measurements for 1000 genes in 100 tissue samples. The data can be written as a 1000x1000 matrix, which we call \(X\), in which each row represents a gene and each column a tissue sample. Each tissue sample was processed on a different day, and the columns of \(X\) are ordered so that the samples that were processed earliest are on the left, and the samples that were processed later are on the right. The tissue samples belong to two groups : control (C) and treatment (T). The C and T samples were processed in a random order across the days. The researcher wishes to determine whether each gene’s expression measurements differ between the treatment and control groups.
As a pre-analysis (before comparing T versus C), the researcher performs a principal component analysis of the data, and finds that the first principal component (a vector of length 100) has a strong linear trend from left to right, and explains 10% of the variation. The researcher now remembers that each patient sample was run on one of two machines, A and B, and machine A was used more often in the earlier times while B was used more often later. The researcher has a record of which sample was run on which machine.
The first principal component “explains 10% of the variation” means 90% of the information in the gene data set is lost by projecting the tissue sample observations onto the first principal component. Another way of explaining it is 90% of the variance in the data is not contained in the first principal component.
Given the flaw shown in pre-analysis of a time-wise linear trend amongst the tissue samples’ first principal component, I would advise the researcher to include the machine used (A vs B) as a feature of the data set. This should enhance the PVE of the first principal component before applying the two-sample t-test.
set.seed(1)
Control <- matrix(rnorm(50 * 1000), ncol = 50)
Treatment <- matrix(rnorm(50 * 1000), ncol = 50)
X <- cbind(Control, Treatment)
X[1, ] <- seq(-18, 18 - .36, .36) # linear trend in one dimension
pr.out <- prcomp(scale(X))
summary(pr.out)$importance[, 1]
## Standard deviation Proportion of Variance Cumulative Proportion
## 3.148148 0.099110 0.099110
We have 9.911% variance explained by the first principal component. Now, adding in A vs B via 10 vs 0 encoding.
X <- rbind(X, c(rep(10, 50), rep(0, 50)))
pr.out <- prcomp(scale(X))
summary(pr.out)$importance[, 1]
## Standard deviation Proportion of Variance Cumulative Proportion
## 3.397839 0.115450 0.115450
Now we have 11.54% variance explained by the first principal component. That’s an improvement of 1.629%.
Q7. In the chapter, we mentioned the use of correlation-based distance and Euclidean distance as dissimilarity measures for hierarchical clustering. It turns out that these two measures are almost equivalent : if each observation has been centered to have mean zero and standard deviation one, and if we let \(r_{ij}\) denote the correlation between the \(i\)th and \(j\)th observations, then the quantity \(1 - r_{ij}\) is proportional to the squared Euclidean distance between the \(i\)th and \(j\)th observations. On the “USArrests” data, show that this proportionality holds.
library(ISLR)
set.seed(1)
dsc <- scale(USArrests)
d1 <- dist(dsc)^2
d2 <- as.dist(1 - cor(t(dsc)))
summary(d2 / d1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000086 0.069140 0.133900 0.234200 0.262600 4.888000
Q8. In Section 10.2.3, a formula for calculating PVE was given in Equation 10.8. We also saw that the PVE can be obtained using the “sdev” output of the “prcomp()” function. On the “USArrests” data, calculate PVE in two ways :
pr.out <- prcomp(USArrests, scale = TRUE)
pr.var <- pr.out$sdev^2
pve <- pr.var / sum(pr.var)
sum(pr.var)
## [1] 4
pve
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
loadings <- pr.out$rotation
USArrests2 <- scale(USArrests)
sumvar <- sum(apply(as.matrix(USArrests2)^2, 2, sum))
apply((as.matrix(USArrests2) %*% loadings)^2, 2, sum) / sumvar
## PC1 PC2 PC3 PC4
## 0.62006039 0.24744129 0.08914080 0.04335752
Q9. Consider the “USArrests” data. We will now perform hierarchical clustering on the states.
set.seed(2)
hc.complete <- hclust(dist(USArrests), method = "complete")
plot(hc.complete)
cutree(hc.complete, 3)
## Alabama Alaska Arizona Arkansas California
## 1 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 2 3 1 1 2
## Hawaii Idaho Illinois Indiana Iowa
## 3 3 1 3 3
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 3 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 1 3 1 2
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 1 3 2
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 3 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 3 2 1
## South Dakota Tennessee Texas Utah Vermont
## 3 2 2 3 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 3 3 2
sd.data <- scale(USArrests)
hc.complete.sd <- hclust(dist(sd.data), method = "complete")
plot(hc.complete.sd)
cutree(hc.complete.sd, 3)
## Alabama Alaska Arizona Arkansas California
## 1 1 2 3 2
## Colorado Connecticut Delaware Florida Georgia
## 2 3 3 2 1
## Hawaii Idaho Illinois Indiana Iowa
## 3 3 2 3 3
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 3 2
## Massachusetts Michigan Minnesota Mississippi Missouri
## 3 2 3 1 3
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 2 3 3
## New Mexico New York North Carolina North Dakota Ohio
## 2 2 1 3 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 3 3 3 1
## South Dakota Tennessee Texas Utah Vermont
## 3 1 2 3 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 3 3 3 3 3
table(cutree(hc.complete, 3), cutree(hc.complete.sd, 3))
##
## 1 2 3
## 1 6 9 1
## 2 2 2 10
## 3 0 0 20
Scaling the variables affect the clusters obtained although the trees are somewhat similar. The variables should be scaled beforehand because the data measures have different units.
Q10. In this problem, you will generate simulated data, and then perform PCA and \(K\)-means clustering on the data.
set.seed(2)
x <- matrix(rnorm(20 * 3 * 50, mean = 0, sd = 0.001), ncol = 50)
x[1:20, 2] <- 1
x[21:40, 1] <- 2
x[21:40, 2] <- 2
x[41:60, 1] <- 1
true.labels <- c(rep(1, 20), rep(2, 20), rep(3, 20))
pr.out <- prcomp(x)
plot(pr.out$x[, 1:2], col = 1:3, xlab = "Z1", ylab = "Z2", pch = 19)
km.out <- kmeans(x, 3, nstart = 20)
table(true.labels, km.out$cluster)
##
## true.labels 1 2 3
## 1 20 0 0
## 2 0 20 0
## 3 0 0 20
The observations are perfectly clustered.
km.out <- kmeans(x, 2, nstart = 20)
table(true.labels, km.out$cluster)
##
## true.labels 1 2
## 1 20 0
## 2 0 20
## 3 20 0
All observations of one of the three clusters is now absorbed in one of the two clusters.
km.out <- kmeans(x, 4, nstart = 20)
table(true.labels, km.out$cluster)
##
## true.labels 1 2 3 4
## 1 9 0 0 11
## 2 0 20 0 0
## 3 0 0 20 0
The first cluster is splitted into two clusters.
km.out <- kmeans(pr.out$x[, 1:2], 3, nstart = 20)
table(true.labels, km.out$cluster)
##
## true.labels 1 2 3
## 1 0 0 20
## 2 0 20 0
## 3 20 0 0
All observations are perfectly clustered once again.
km.out <- kmeans(scale(x), 3, nstart = 20)
table(true.labels, km.out$cluster)
##
## true.labels 1 2 3
## 1 8 2 10
## 2 0 19 1
## 3 11 1 8
We may see that we have worse results than with unscaled data, as scaling affects the distance between the observations.
Q11. On the book website, there is a gene expression data set that consists of 40 tissue samples with measurements on 1000 genes. The first 20 samples are from healthy patients, while the second 20 are from a diseased group.
genes <- read.csv("Ch10Ex11.csv", header = FALSE)
hc.complete <- hclust(as.dist(1 - cor(genes)), method = "complete")
plot(hc.complete)
hc.single <- hclust(as.dist(1 - cor(genes)), method = "single")
plot(hc.single)
hc.average <- hclust(as.dist(1 - cor(genes)), method = "average")
plot(hc.average)
The results are pretty different when using different linkage methods as we obtain two clusters for complete and single linkages or three clusters for average cluster.
We may use PCA to see which genes differ the most. We will examine the absolute values of the total loadings for each gene as it characterizes the weight of each gene.
pr.out <- prcomp(t(genes))
head(pr.out$rotation)
## PC1 PC2 PC3 PC4 PC5
## [1,] -0.002434664 -0.030745799 0.009359932 0.009699551 -0.012847866
## [2,] -0.002016598 -0.025927592 0.050300983 -0.026082885 0.003488293
## [3,] 0.011233842 -0.003937802 0.014564920 0.054373032 -0.020411836
## [4,] 0.013912855 0.025625408 0.033998676 -0.011530298 -0.009364524
## [5,] 0.007293322 0.013590353 -0.008229702 -0.001343010 0.030002978
## [6,] 0.017928318 -0.026302699 -0.020728401 -0.024069152 -0.018619253
## PC6 PC7 PC8 PC9 PC10
## [1,] 0.023439995 0.010152261 -0.024602570 -0.021925557 -0.035003076
## [2,] 0.001605492 -0.037364376 -0.017332292 0.011319311 0.007802611
## [3,] 0.025337127 0.070772412 0.047340581 -0.013963868 0.023624407
## [4,] 0.029529539 0.002885764 -0.093667774 -0.008391226 -0.019226470
## [5,] -0.017042934 0.003555111 -0.053227214 -0.010479774 0.008446406
## [6,] -0.049103273 -0.040473304 -0.005455454 -0.003882692 0.028472950
## PC11 PC12 PC13 PC14 PC15
## [1,] 0.068133070 0.002322824 -0.050042837 -0.043957087 0.007542896
## [2,] -0.092523227 0.036265781 0.002951734 0.021272662 -0.040075267
## [3,] 0.017649621 0.021512568 0.013587072 0.005264628 -0.002918920
## [4,] 0.006695624 0.025918069 -0.081179098 0.017689681 0.045951951
## [5,] 0.053250618 -0.076682641 -0.049516326 -0.003282028 0.060755699
## [6,] -0.018103035 0.015433035 0.015967833 -0.006985293 -0.025237500
## PC16 PC17 PC18 PC19 PC20
## [1,] -0.04567334 -0.019899716 0.02946561 -0.009362957 -0.029855408
## [2,] 0.03433259 0.003735211 -0.01218600 -0.023466062 -0.005495696
## [3,] 0.01881913 0.003284517 0.02597233 0.021581732 0.016808524
## [4,] -0.01062858 0.018342677 -0.03334608 -0.052262385 -0.030868339
## [5,] -0.02562691 0.049934804 -0.04221058 -0.012279815 0.018004932
## [6,] -0.00394582 0.037319024 -0.02541592 -0.029423771 -0.012043007
## PC21 PC22 PC23 PC24 PC25
## [1,] -0.009190761 0.0230209664 -0.028970518 0.033060132 0.021453017
## [2,] -0.002808309 0.0079065160 -0.007921167 -0.034424716 0.011932971
## [3,] 0.010683143 -0.0392265342 0.004592080 0.026463736 -0.038085712
## [4,] 0.079419742 -0.0001627164 0.070396594 -0.002015954 0.006459925
## [5,] -0.038364004 -0.0230993500 -0.047439556 -0.001129421 -0.001285153
## [6,] -0.004522525 0.0304001071 0.016062043 -0.019329595 -0.034486284
## PC26 PC27 PC28 PC29 PC30
## [1,] 0.034447853 0.017729906 0.034708970 -0.028136309 -0.009873440
## [2,] 0.051079165 0.032435028 -0.006934708 -0.026307151 -0.008143422
## [3,] -0.064720318 -0.004616608 0.038015189 0.006455198 0.004570640
## [4,] 0.022138389 -0.017120199 0.074901678 0.015812685 0.016391804
## [5,] -0.010772594 0.010889806 -0.005305488 0.015248277 0.029303828
## [6,] 0.001489549 0.028082907 -0.036617970 -0.054760935 0.023337598
## PC31 PC32 PC33 PC34 PC35
## [1,] -0.03576788 0.016708304 -0.01823350 0.0007957941 -0.01443692
## [2,] -0.04439239 0.011968530 0.04168309 0.0123210140 0.02739196
## [3,] 0.02932866 0.026066011 0.02055204 -0.0716448783 0.02726941
## [4,] -0.03954720 0.014714963 0.02846397 0.0316775643 0.01866774
## [5,] 0.05494446 -0.005416152 0.03476606 0.0245476439 -0.04037835
## [6,] 0.01132569 0.006320203 -0.00237484 0.0061140832 0.01402898
## PC36 PC37 PC38 PC39 PC40
## [1,] 0.010652118 -0.009366629 -0.012754402 0.0020214363 0.07000786
## [2,] -0.002733484 -0.001318693 0.031410461 -0.0108377476 -0.06326465
## [3,] 0.020891497 -0.001380233 -0.025857254 0.0008800921 -0.32824953
## [4,] -0.027363133 -0.006080650 -0.025316130 -0.0235404170 -0.01675446
## [5,] -0.046869227 -0.017973802 0.002917167 0.0342753219 0.04896111
## [6,] 0.042083325 0.055817170 -0.010080327 0.0029965594 0.05407104
total.load <- apply(pr.out$rotation, 1, sum)
index <- order(abs(total.load), decreasing = TRUE)
index[1:10]
## [1] 865 68 911 428 624 11 524 803 980 822
These are the 10 most different genes across the two groups.