On the USArrests data, show that this proportionality holds.
Hint: The Euclidean distance can be calculated using the dist() function, and correlations can be calculated using the cor() function.
library(ISLR)
set.seed(1)
dsc <- scale(USArrests)
d1 <- dist(dsc)^2
d2 <- as.dist(1 - cor(t(dsc)))
summary(d2 / d1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000086 0.069135 0.133943 0.234193 0.262589 4.887686
On the USArrests data, calculate PVE in two ways:
pr.out <- prcomp(USArrests, scale = TRUE)
pr.var <- pr.out$sdev^2
pve <- pr.var / sum(pr.var)
sum(pr.var)
## [1] 4
pve
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
loadings <- pr.out$rotation
USArrests2 <- scale(USArrests)
sumvar <- sum(apply(as.matrix(USArrests2)^2, 2, sum))
apply((as.matrix(USArrests2) %*% loadings)^2, 2, sum) / sumvar
## PC1 PC2 PC3 PC4
## 0.62006039 0.24744129 0.08914080 0.04335752
set.seed(2)
hc.complete <- hclust(dist(USArrests), method = "complete")
plot(hc.complete)
cutree(hc.complete, 3)
## Alabama Alaska Arizona Arkansas California
## 1 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 2 3 1 1 2
## Hawaii Idaho Illinois Indiana Iowa
## 3 3 1 3 3
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 3 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 1 3 1 2
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 1 3 2
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 3 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 3 2 1
## South Dakota Tennessee Texas Utah Vermont
## 3 2 2 3 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 3 3 2
sd.data <- scale(USArrests)
hc.complete.sd <- hclust(dist(sd.data), method = "complete")
plot(hc.complete.sd)
cutree(hc.complete.sd, 3)
## Alabama Alaska Arizona Arkansas California
## 1 1 2 3 2
## Colorado Connecticut Delaware Florida Georgia
## 2 3 3 2 1
## Hawaii Idaho Illinois Indiana Iowa
## 3 3 2 3 3
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 3 2
## Massachusetts Michigan Minnesota Mississippi Missouri
## 3 2 3 1 3
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 2 3 3
## New Mexico New York North Carolina North Dakota Ohio
## 2 2 1 3 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 3 3 3 1
## South Dakota Tennessee Texas Utah Vermont
## 3 1 2 3 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 3 3 3 3 3
table(cutree(hc.complete, 3), cutree(hc.complete.sd, 3))
##
## 1 2 3
## 1 6 9 1
## 2 2 2 10
## 3 0 0 20
Scaling the variables affect the clusters obtained although the trees are somewhat similar. The variables should be scaled beforehand because the data measures have different units.
Hint: There are a number of functions in R that you can use to generate data. One example is the rnorm() function; runif() is another option. Be sure to add a mean shift to the observations in each class so that there are three distinct classes.
set.seed(2)
x <- matrix(rnorm(20 * 3 * 50, mean = 0, sd = 0.001), ncol = 50)
x[1:20, 2] <- 1
x[21:40, 1] <- 2
x[21:40, 2] <- 2
x[41:60, 1] <- 1
true.labels <- c(rep(1, 20), rep(2, 20), rep(3, 20))
pr.out <- prcomp(x)
plot(pr.out$x[, 1:2], col = 1:3, xlab = "Z1", ylab = "Z2", pch = 19)
Hint: You can use the table() function in R to compare the true class labels to the class labels obtained by clustering. Be careful how you interpret the results: K-means clustering will arbitrarily number the clusters, so you cannot simply check whether the true class labels and clustering labels are the same.
km.out <- kmeans(x, 3, nstart = 20)
table(true.labels, km.out$cluster)
##
## true.labels 1 2 3
## 1 20 0 0
## 2 0 20 0
## 3 0 0 20
km.out <- kmeans(x, 2, nstart = 20)
table(true.labels, km.out$cluster)
##
## true.labels 1 2
## 1 20 0
## 2 0 20
## 3 20 0
The observations are perfectly clustered.
All observations of one of the three clusters is now absorbed in one of the two clusters.
km.out <- kmeans(x, 4, nstart = 20)
table(true.labels, km.out$cluster)
##
## true.labels 1 2 3 4
## 1 9 0 0 11
## 2 0 20 0 0
## 3 0 0 20 0
The first cluster is splitted into two clusters.
km.out <- kmeans(pr.out$x[, 1:2], 3, nstart = 20)
table(true.labels, km.out$cluster)
##
## true.labels 1 2 3
## 1 0 0 20
## 2 0 20 0
## 3 20 0 0
All observations are perfectly clustered once again.
km.out <- kmeans(scale(x), 3, nstart = 20)
table(true.labels, km.out$cluster)
##
## true.labels 1 2 3
## 1 8 2 10
## 2 0 19 1
## 3 11 1 8
We have worse results than with unscaled data, as scaling affects the distance between the observations.
genes <- read.csv("Ch10Ex11.csv", header = FALSE)
hc.complete <- hclust(as.dist(1 - cor(genes)), method = "complete")
plot(hc.complete)
hc.single <- hclust(as.dist(1 - cor(genes)), method = "single")
plot(hc.single)
hc.average <- hclust(as.dist(1 - cor(genes)), method = "average")
plot(hc.average)
The results are pretty different when using different linkage methods as we obtain two clusters for complete and single linkages or three clusters for average cluster.
We may use PCA to see which genes differ the most. We will examine the absolute values of the total loadings for each gene as it characterizes the weight of each gene.
pr.out <- prcomp(t(genes))
head(pr.out$rotation)
## PC1 PC2 PC3 PC4 PC5
## [1,] -0.002434664 -0.030745799 0.009359932 0.009699551 -0.012847866
## [2,] -0.002016598 -0.025927592 0.050300983 -0.026082885 0.003488293
## [3,] 0.011233842 -0.003937802 0.014564920 0.054373032 -0.020411836
## [4,] 0.013912855 0.025625408 0.033998676 -0.011530298 -0.009364524
## [5,] 0.007293322 0.013590353 -0.008229702 -0.001343010 0.030002978
## [6,] 0.017928318 -0.026302699 -0.020728401 -0.024069152 -0.018619253
## PC6 PC7 PC8 PC9 PC10
## [1,] 0.023439995 0.010152261 -0.024602570 -0.021925557 -0.035003076
## [2,] 0.001605492 -0.037364376 -0.017332292 0.011319311 0.007802611
## [3,] 0.025337127 0.070772412 0.047340581 -0.013963868 0.023624407
## [4,] 0.029529539 0.002885764 -0.093667774 -0.008391226 -0.019226470
## [5,] -0.017042934 0.003555111 -0.053227214 -0.010479774 0.008446406
## [6,] -0.049103273 -0.040473304 -0.005455454 -0.003882692 0.028472950
## PC11 PC12 PC13 PC14 PC15
## [1,] 0.068133070 0.002322824 -0.050042837 -0.043957087 0.007542896
## [2,] -0.092523227 0.036265781 0.002951734 0.021272662 -0.040075267
## [3,] 0.017649621 0.021512568 0.013587072 0.005264628 -0.002918920
## [4,] 0.006695624 0.025918069 -0.081179098 0.017689681 0.045951951
## [5,] 0.053250618 -0.076682641 -0.049516326 -0.003282028 0.060755699
## [6,] -0.018103035 0.015433035 0.015967833 -0.006985293 -0.025237500
## PC16 PC17 PC18 PC19 PC20
## [1,] -0.04567334 -0.019899716 0.02946561 -0.009362957 -0.029855408
## [2,] 0.03433259 0.003735211 -0.01218600 -0.023466062 -0.005495696
## [3,] 0.01881913 0.003284517 0.02597233 0.021581732 0.016808524
## [4,] -0.01062858 0.018342677 -0.03334608 -0.052262385 -0.030868339
## [5,] -0.02562691 0.049934804 -0.04221058 -0.012279815 0.018004932
## [6,] -0.00394582 0.037319024 -0.02541592 -0.029423771 -0.012043007
## PC21 PC22 PC23 PC24 PC25
## [1,] -0.009190761 0.0230209664 -0.028970518 0.033060132 0.021453017
## [2,] -0.002808309 0.0079065160 -0.007921167 -0.034424716 0.011932971
## [3,] 0.010683143 -0.0392265342 0.004592080 0.026463736 -0.038085712
## [4,] 0.079419742 -0.0001627164 0.070396594 -0.002015954 0.006459925
## [5,] -0.038364004 -0.0230993500 -0.047439556 -0.001129421 -0.001285153
## [6,] -0.004522525 0.0304001071 0.016062043 -0.019329595 -0.034486284
## PC26 PC27 PC28 PC29 PC30
## [1,] 0.034447853 0.017729906 0.034708970 -0.028136309 -0.009873440
## [2,] 0.051079165 0.032435028 -0.006934708 -0.026307151 -0.008143422
## [3,] -0.064720318 -0.004616608 0.038015189 0.006455198 0.004570640
## [4,] 0.022138389 -0.017120199 0.074901678 0.015812685 0.016391804
## [5,] -0.010772594 0.010889806 -0.005305488 0.015248277 0.029303828
## [6,] 0.001489549 0.028082907 -0.036617970 -0.054760935 0.023337598
## PC31 PC32 PC33 PC34 PC35
## [1,] -0.03576788 0.016708304 -0.01823350 0.0007957941 -0.01443692
## [2,] -0.04439239 0.011968530 0.04168309 0.0123210140 0.02739196
## [3,] 0.02932866 0.026066011 0.02055204 -0.0716448783 0.02726941
## [4,] -0.03954720 0.014714963 0.02846397 0.0316775643 0.01866774
## [5,] 0.05494446 -0.005416152 0.03476606 0.0245476439 -0.04037835
## [6,] 0.01132569 0.006320203 -0.00237484 0.0061140832 0.01402898
## PC36 PC37 PC38 PC39 PC40
## [1,] 0.010652118 -0.009366629 -0.012754402 0.0020214363 0.07000786
## [2,] -0.002733484 -0.001318693 0.031410461 -0.0108377476 -0.06326465
## [3,] 0.020891497 -0.001380233 -0.025857254 0.0008800921 -0.32824953
## [4,] -0.027363133 -0.006080650 -0.025316130 -0.0235404170 -0.01675446
## [5,] -0.046869227 -0.017973802 0.002917167 0.0342753219 0.04896111
## [6,] 0.042083325 0.055817170 -0.010080327 0.0029965594 0.05407104
total.load <- apply(pr.out$rotation, 1, sum)
index <- order(abs(total.load), decreasing = TRUE)
index[1:10]
## [1] 865 68 911 428 624 11 524 803 980 822
These are the 10 most different genes across the two groups.