STA 6933 Spring 2025 Homework 1

HW 1

1. Consider the USArrests data. We will now perform hierarchical clustering on the states.

data(USArrests)
head(USArrests)

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

states=row.names(USArrests)
states

##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"

names(USArrests)

## [1] "Murder"   "Assault"  "UrbanPop" "Rape"

apply(USArrests, 2, mean)

##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232

apply(USArrests, 2, var)

##     Murder    Assault   UrbanPop       Rape 
##   18.97047 6945.16571  209.51878   87.72916

summary(USArrests)

##      Murder          Assault         UrbanPop          Rape      
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
##  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00

a) Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.

set.seed(1)
hc.complete <- hclust(dist(USArrests), method="complete")
plot(hc.complete, main = "Hierarchical Clustering of US States", sub = "Complete Linkage, Euclidean Distance", xlab = "States", cex = 0.5)

b) Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?

clusters <- cutree(hc.complete, 3)
split(rownames(USArrests), clusters)

## $`1`
##  [1] "Alabama"        "Alaska"         "Arizona"        "California"    
##  [5] "Delaware"       "Florida"        "Illinois"       "Louisiana"     
##  [9] "Maryland"       "Michigan"       "Mississippi"    "Nevada"        
## [13] "New Mexico"     "New York"       "North Carolina" "South Carolina"
## 
## $`2`
##  [1] "Arkansas"      "Colorado"      "Georgia"       "Massachusetts"
##  [5] "Missouri"      "New Jersey"    "Oklahoma"      "Oregon"       
##  [9] "Rhode Island"  "Tennessee"     "Texas"         "Virginia"     
## [13] "Washington"    "Wyoming"      
## 
## $`3`
##  [1] "Connecticut"   "Hawaii"        "Idaho"         "Indiana"      
##  [5] "Iowa"          "Kansas"        "Kentucky"      "Maine"        
##  [9] "Minnesota"     "Montana"       "Nebraska"      "New Hampshire"
## [13] "North Dakota"  "Ohio"          "Pennsylvania"  "South Dakota" 
## [17] "Utah"          "Vermont"       "West Virginia" "Wisconsin"

c) Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.

USArrests.scaled <- scale(USArrests)
hc.complete.scaled <- hclust(dist(USArrests.scaled), method = "complete")
plot(hc.complete.scaled, main = "Hierarchical Clustering of US States (Scaled Features)", sub = "Complete Linkage, Euclidean Distance", xlab = "States", cex = 0.5)

d) What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.

clusters.scaled <- cutree(hc.complete.scaled, 3)
split(rownames(USArrests), clusters.scaled)

## $`1`
## [1] "Alabama"        "Alaska"         "Georgia"        "Louisiana"     
## [5] "Mississippi"    "North Carolina" "South Carolina" "Tennessee"     
## 
## $`2`
##  [1] "Arizona"    "California" "Colorado"   "Florida"    "Illinois"  
##  [6] "Maryland"   "Michigan"   "Nevada"     "New Mexico" "New York"  
## [11] "Texas"     
## 
## $`3`
##  [1] "Arkansas"      "Connecticut"   "Delaware"      "Hawaii"       
##  [5] "Idaho"         "Indiana"       "Iowa"          "Kansas"       
##  [9] "Kentucky"      "Maine"         "Massachusetts" "Minnesota"    
## [13] "Missouri"      "Montana"       "Nebraska"      "New Hampshire"
## [17] "New Jersey"    "North Dakota"  "Ohio"          "Oklahoma"     
## [21] "Oregon"        "Pennsylvania"  "Rhode Island"  "South Dakota" 
## [25] "Utah"          "Vermont"       "Virginia"      "Washington"   
## [29] "West Virginia" "Wisconsin"     "Wyoming"

table(clusters, clusters.scaled)

##         clusters.scaled
## clusters  1  2  3
##        1  6  9  1
##        2  2  2 10
##        3  0  0 20

Scaling the variables had a large impact on the hierarchical clustering obtained.

Cluster 1 (before scaling): 6 states remained in Cluster 1 after scaling, 9 states moved to Cluster 2, and 1 state moved to Cluster 3.

Cluster 2 (before scaling): 2 states moved to Cluster 1, 2 stayed in Cluster 2, and 10 moved to Cluster 3 after scaling.

Cluster 3 (before scaling): All 20 states remained in Cluster 3.

The largest shift happened in Cluster 2, where most states moved to Cluster 3 after scaling. Cluster 3 remained mostly unchanged, meaning that states in this cluster were already closely grouped regardless of scaling.

In my opinion, the variables should be scaled before the inter-observation dissimilarities are computed. Without scaling, clustering outcomes can be dominated by features with larger ranges. By scaling variables to mean 0 and std dev 1, we created a cluster structure that assumes all variables contribute equally.

2. We using the following R code to simulate a data and then perform PCA and K-means clustering on the data.

set.seed(1)
x = matrix(runif(n=3000, 0, 0.01), ncol=50)
x[1:20, 2] = 2
x[21:40, 1] = 5
x[21:40, 2] = 5
x[41:60, 1] = 2

dim(x)

## [1] 60 50

a) Perform PCA on the 60 observations and plot the first two principal component score vectors. Use a different color to indicate the observations in each of the three classes.

pr.out <- prcomp(x)
summary(pr.out)

## Importance of components:
##                           PC1    PC2     PC3      PC4      PC5      PC6
## Standard deviation     2.6874 1.1614 0.00526 0.004927 0.004869 0.004755
## Proportion of Variance 0.8426 0.1574 0.00000 0.000000 0.000000 0.000000
## Cumulative Proportion  0.8426 1.0000 0.99996 0.999960 0.999960 0.999960
##                             PC7      PC8     PC9     PC10     PC11     PC12
## Standard deviation     0.004605 0.004264 0.00414 0.004112 0.003979 0.003884
## Proportion of Variance 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000
## Cumulative Proportion  0.999970 0.999970 0.99997 0.999970 0.999980 0.999980
##                            PC13     PC14     PC15     PC16     PC17     PC18
## Standard deviation     0.003783 0.003693 0.003667 0.003471 0.003417 0.003386
## Proportion of Variance 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## Cumulative Proportion  0.999980 0.999980 0.999980 0.999980 0.999980 0.999990
##                            PC19     PC20     PC21     PC22     PC23     PC24
## Standard deviation     0.003196 0.003188 0.003041 0.003005 0.002929 0.002773
## Proportion of Variance 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## Cumulative Proportion  0.999990 0.999990 0.999990 0.999990 0.999990 0.999990
##                            PC25     PC26     PC27     PC28     PC29     PC30
## Standard deviation     0.002566 0.002498 0.002283 0.002235 0.002146 0.002113
## Proportion of Variance 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## Cumulative Proportion  0.999990 0.999990 0.999990 1.000000 1.000000 1.000000
##                            PC31     PC32     PC33     PC34     PC35     PC36
## Standard deviation     0.002053 0.001867 0.001817 0.001729 0.001648 0.001533
## Proportion of Variance 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## Cumulative Proportion  1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
##                            PC37     PC38    PC39     PC40     PC41     PC42
## Standard deviation     0.001494 0.001369 0.00131 0.001276 0.001139 0.001133
## Proportion of Variance 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000
## Cumulative Proportion  1.000000 1.000000 1.00000 1.000000 1.000000 1.000000
##                            PC43      PC44      PC45      PC46      PC47
## Standard deviation     0.001006 0.0008864 0.0007426 0.0007119 0.0006409
## Proportion of Variance 0.000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  1.000000 1.0000000 1.0000000 1.0000000 1.0000000
##                             PC48     PC49      PC50
## Standard deviation     0.0005407 0.000424 0.0003072
## Proportion of Variance 0.0000000 0.000000 0.0000000
## Cumulative Proportion  1.0000000 1.000000 1.0000000

pr.out$x[,1:2]

##             PC1           PC2
##  [1,] -1.886084 -1.4126555398
##  [2,] -1.885333 -1.4119024616
##  [3,] -1.883911 -1.4104753285
##  [4,] -1.881536 -1.4081065143
##  [5,] -1.886535 -1.4130992761
##  [6,] -1.881609 -1.4081843495
##  [7,] -1.881284 -1.4078650757
##  [8,] -1.883291 -1.4098560032
##  [9,] -1.883509 -1.4100937102
## [10,] -1.887531 -1.4140963348
## [11,] -1.886506 -1.4130745147
## [12,] -1.886709 -1.4132851764
## [13,] -1.883098 -1.4096604647
## [14,] -1.885245 -1.4118304476
## [15,] -1.882519 -1.4090875485
## [16,] -1.884443 -1.4110047224
## [17,] -1.882886 -1.4094606762
## [18,] -1.880948 -1.4075182107
## [19,] -1.885273 -1.4118401119
## [20,] -1.882462 -1.4090397810
## [21,]  3.768770  0.0001737612
## [22,]  3.768773  0.0001797876
## [23,]  3.768767  0.0001715874
## [24,]  3.768774  0.0001749279
## [25,]  3.768775  0.0001667985
## [26,]  3.768778  0.0001762815
## [27,]  3.768778  0.0001770803
## [28,]  3.768768  0.0001807997
## [29,]  3.768771  0.0001706622
## [30,]  3.768768  0.0001777160
## [31,]  3.768780  0.0001756922
## [32,]  3.768774  0.0001728183
## [33,]  3.768771  0.0001745470
## [34,]  3.768774  0.0001753220
## [35,]  3.768780  0.0001825239
## [36,]  3.768770  0.0001764830
## [37,]  3.768773  0.0001849647
## [38,]  3.768776  0.0001689022
## [39,]  3.768774  0.0001715969
## [40,]  3.768771  0.0001699520
## [41,] -1.883577  1.4092686285
## [42,] -1.885707  1.4113967583
## [43,] -1.886295  1.4119916678
## [44,] -1.881188  1.4068770473
## [45,] -1.883727  1.4094219588
## [46,] -1.886693  1.4123889646
## [47,] -1.887287  1.4129766263
## [48,] -1.884825  1.4105245004
## [49,] -1.881670  1.4073681209
## [50,] -1.883975  1.4096678506
## [51,] -1.881305  1.4070062987
## [52,] -1.883034  1.4087254643
## [53,] -1.885682  1.4113753416
## [54,] -1.885152  1.4108489627
## [55,] -1.887158  1.4128518849
## [56,] -1.888118  1.4138029360
## [57,] -1.883147  1.4088467540
## [58,] -1.887476  1.4131778175
## [59,] -1.885050  1.4107409572
## [60,] -1.883682  1.4093755031

plot(pr.out$x[,1:2], col = 1:3, xlab = "PC1", ylab = "PC2", pch = 19)

b) Perform K-means clustering of the observations with K = 3. How well do the clusters that you obtained in K-means clustering compare to the true class labels? Hint: You can use the table() function in R to compare the true class labels to the class labels obtained by clustering. Be careful how you interpret the results: K-means clustering will arbitrarily number the clusters, so you cannot simply check whether the true class labels and clustering labels are the same.

km.out <- kmeans(x, 3, nstart = 20)
km.out

## K-means clustering with 3 clusters of sizes 20, 20, 20
## 
## Cluster means:
##          [,1]        [,2]        [,3]        [,4]        [,5]        [,6]
## 1 5.000000000 5.000000000 0.004292492 0.005217101 0.003641431 0.005947972
## 2 0.005551671 2.000000000 0.005166836 0.005484659 0.004972492 0.004618285
## 3 2.000000000 0.004905241 0.006024362 0.004481017 0.003438724 0.005039361
##          [,7]        [,8]        [,9]       [,10]       [,11]       [,12]
## 1 0.004752187 0.004264830 0.005844531 0.005260191 0.004829985 0.005221851
## 2 0.004793031 0.005133796 0.004568124 0.005312861 0.005521645 0.004789253
## 3 0.006430145 0.004849288 0.004879257 0.005388688 0.005607340 0.005460055
##         [,13]       [,14]       [,15]       [,16]       [,17]       [,18]
## 1 0.004623373 0.005507304 0.004386255 0.006040156 0.004216303 0.004591728
## 2 0.004532753 0.004919870 0.004443105 0.003835058 0.005174819 0.004863420
## 3 0.005295802 0.005807342 0.004115938 0.004918375 0.005346474 0.004807140
##         [,19]       [,20]       [,21]       [,22]       [,23]       [,24]
## 1 0.004690692 0.004780629 0.004249589 0.004303261 0.003391395 0.004497712
## 2 0.005089885 0.004356273 0.005575948 0.005273448 0.004913316 0.005158933
## 3 0.004889102 0.004589221 0.005044854 0.004905811 0.004818232 0.005316169
##         [,25]       [,26]       [,27]       [,28]       [,29]       [,30]
## 1 0.004934328 0.005853189 0.004763984 0.004924727 0.003575623 0.004665989
## 2 0.004580133 0.004958652 0.005596034 0.004711264 0.005718374 0.005147582
## 3 0.005135477 0.005391266 0.004634117 0.005557899 0.005459883 0.004388423
##         [,31]       [,32]       [,33]       [,34]       [,35]       [,36]
## 1 0.005631196 0.005004424 0.004121293 0.004412899 0.005321122 0.004177870
## 2 0.004648414 0.004670802 0.004967435 0.004564740 0.004373587 0.005841328
## 3 0.005756571 0.004767306 0.005565726 0.003729663 0.005554961 0.003845988
##         [,37]       [,38]       [,39]       [,40]       [,41]       [,42]
## 1 0.005149373 0.003549839 0.004598301 0.005725721 0.006244830 0.004426540
## 2 0.005256790 0.004157657 0.005288231 0.006248561 0.005244998 0.004281433
## 3 0.004848821 0.005349297 0.004297070 0.004336235 0.005718933 0.005287813
##         [,43]       [,44]       [,45]       [,46]       [,47]       [,48]
## 1 0.005662519 0.004538859 0.004936487 0.004151314 0.005363267 0.005508981
## 2 0.005008912 0.005659766 0.003554611 0.004574720 0.003803449 0.005997626
## 3 0.005807893 0.004656849 0.004331194 0.005114674 0.003690784 0.005671462
##         [,49]       [,50]
## 1 0.005240999 0.005038338
## 2 0.004874033 0.004476749
## 3 0.005114646 0.004802585
## 
## Clustering vector:
##  [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [39] 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 0.007365112 0.008123523 0.008381756
##  (between_SS / total_SS = 100.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

cluster.labels <- km.out$cluster
true.labels <- rep(1:3, each=20)
true.labels

##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [39] 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

table(true.labels, cluster.labels)

##            cluster.labels
## true.labels  1  2  3
##           1  0 20  0
##           2 20  0  0
##           3  0  0 20

The first row (true label = 1) shows that all 20 observations in class 1 were assigned to cluster 2.

The second row (true label = 2) shows that all 20 observations in class 2 were assigned to cluster 1.

The third row (true label = 3) shows that all 20 observations in class 3 were assigned to cluster 3.

The clusters I obtained in K-means clustering successfully identified the three groups of data. Since K-means clustering arbitrarily numbers the clusters, class 1 does not need to be assigned to cluster 1, etc.

c) Perform K-means clustering with K = 2. Describe your results.

km.out <- kmeans(x, 2, nstart = 20)
km.out

## K-means clustering with 2 clusters of sizes 20, 40
## 
## Cluster means:
##       [,1]     [,2]        [,3]        [,4]        [,5]        [,6]        [,7]
## 1 5.000000 5.000000 0.004292492 0.005217101 0.003641431 0.005947972 0.004752187
## 2 1.002776 1.002453 0.005595599 0.004982838 0.004205608 0.004828823 0.005611588
##          [,8]        [,9]       [,10]       [,11]       [,12]       [,13]
## 1 0.004264830 0.005844531 0.005260191 0.004829985 0.005221851 0.004623373
## 2 0.004991542 0.004723690 0.005350774 0.005564493 0.005124654 0.004914278
##         [,14]       [,15]       [,16]       [,17]       [,18]       [,19]
## 1 0.005507304 0.004386255 0.006040156 0.004216303 0.004591728 0.004690692
## 2 0.005363606 0.004279521 0.004376716 0.005260646 0.004835280 0.004989493
##         [,20]       [,21]       [,22]       [,23]       [,24]       [,25]
## 1 0.004780629 0.004249589 0.004303261 0.003391395 0.004497712 0.004934328
## 2 0.004472747 0.005310401 0.005089629 0.004865774 0.005237551 0.004857805
##         [,26]       [,27]       [,28]       [,29]       [,30]       [,31]
## 1 0.005853189 0.004763984 0.004924727 0.003575623 0.004665989 0.005631196
## 2 0.005174959 0.005115076 0.005134582 0.005589128 0.004768003 0.005202493
##         [,32]       [,33]       [,34]       [,35]       [,36]       [,37]
## 1 0.005004424 0.004121293 0.004412899 0.005321122 0.004177870 0.005149373
## 2 0.004719054 0.005266581 0.004147202 0.004964274 0.004843658 0.005052806
##         [,38]       [,39]       [,40]       [,41]       [,42]       [,43]
## 1 0.003549839 0.004598301 0.005725721 0.006244830 0.004426540 0.005662519
## 2 0.004753477 0.004792650 0.005292398 0.005481965 0.004784623 0.005408403
##         [,44]       [,45]       [,46]       [,47]       [,48]       [,49]
## 1 0.004538859 0.004936487 0.004151314 0.005363267 0.005508981 0.005240999
## 2 0.005158307 0.003942903 0.004844697 0.003747116 0.005834544 0.004994340
##         [,50]
## 1 0.005038338
## 2 0.004639667
## 
## Clustering vector:
##  [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [39] 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1]  0.007365112 79.599093209
##  (between_SS / total_SS =  84.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

cluster.labels <- km.out$cluster
table(true.labels, cluster.labels)

##            cluster.labels
## true.labels  1  2
##           1  0 20
##           2 20  0
##           3  0 20

Here, K-means grouped classes 1 and 3 together, while keeping class 2 separate. This makes sense since class 2 was set as x[21:40,1] = x[21:40,2] = 5, which is much greater than the rest of the data. While class 1 has feature 2 set as 2 and class 3 has feature 1 set as 2, their differences are not enough to justify two separate clusters when only two are allowed.

d) Now perform K-means clustering with K = 4, and describe your results.

km.out <- kmeans(x, 4, nstart = 20)
km.out

## K-means clustering with 4 clusters of sizes 11, 20, 20, 9
## 
## Cluster means:
##          [,1]        [,2]        [,3]        [,4]        [,5]        [,6]
## 1 2.000000000 0.006223974 0.005503457 0.003969071 0.003555036 0.005094137
## 2 5.000000000 5.000000000 0.004292492 0.005217101 0.003641431 0.005947972
## 3 0.005551671 2.000000000 0.005166836 0.005484659 0.004972492 0.004618285
## 4 2.000000000 0.003293456 0.006661023 0.005106729 0.003296564 0.004972412
##          [,7]        [,8]        [,9]       [,10]       [,11]       [,12]
## 1 0.007196633 0.004143518 0.005153079 0.005416735 0.007158928 0.007023599
## 2 0.004752187 0.004264830 0.005844531 0.005260191 0.004829985 0.005221851
## 3 0.004793031 0.005133796 0.004568124 0.005312861 0.005521645 0.004789253
## 4 0.005493327 0.005711896 0.004544585 0.005354408 0.003710956 0.003549057
##         [,13]       [,14]       [,15]       [,16]       [,17]       [,18]
## 1 0.007354389 0.006006205 0.003964197 0.004462192 0.006417839 0.004221412
## 2 0.004623373 0.005507304 0.004386255 0.006040156 0.004216303 0.004591728
## 3 0.004532753 0.004919870 0.004443105 0.003835058 0.005174819 0.004863420
## 4 0.002779752 0.005564286 0.004301398 0.005475931 0.004037028 0.005523029
##         [,19]       [,20]       [,21]       [,22]       [,23]       [,24]
## 1 0.004291688 0.004226678 0.003626661 0.005002857 0.003913671 0.005652864
## 2 0.004690692 0.004780629 0.004249589 0.004303261 0.003391395 0.004497712
## 3 0.005089885 0.004356273 0.005575948 0.005273448 0.004913316 0.005158933
## 4 0.005619273 0.005032328 0.006778200 0.004787200 0.005923806 0.004904654
##         [,25]       [,26]       [,27]       [,28]       [,29]       [,30]
## 1 0.003457210 0.004286576 0.004870858 0.005147259 0.006164609 0.005544888
## 2 0.004934328 0.005853189 0.004763984 0.004924727 0.003575623 0.004665989
## 3 0.004580133 0.004958652 0.005596034 0.004711264 0.005718374 0.005147582
## 4 0.007186693 0.006741443 0.004344768 0.006059793 0.004598551 0.002974966
##         [,31]       [,32]       [,33]       [,34]       [,35]       [,36]
## 1 0.005760799 0.005230798 0.006335986 0.003134177 0.005747493 0.003889846
## 2 0.005631196 0.005004424 0.004121293 0.004412899 0.005321122 0.004177870
## 3 0.004648414 0.004670802 0.004967435 0.004564740 0.004373587 0.005841328
## 4 0.005751404 0.004200815 0.004624298 0.004457480 0.005319644 0.003792382
##         [,37]       [,38]       [,39]       [,40]       [,41]       [,42]
## 1 0.004293253 0.006027733 0.004008244 0.003869813 0.005850709 0.005067579
## 2 0.005149373 0.003549839 0.004598301 0.005725721 0.006244830 0.004426540
## 3 0.005256790 0.004157657 0.005288231 0.006248561 0.005244998 0.004281433
## 4 0.005527848 0.004520097 0.004650079 0.004906305 0.005557873 0.005556988
##         [,43]       [,44]       [,45]       [,46]       [,47]       [,48]
## 1 0.003873039 0.005438298 0.004106648 0.004225592 0.003494039 0.004342667
## 2 0.005662519 0.004538859 0.004936487 0.004151314 0.005363267 0.005508981
## 3 0.005008912 0.005659766 0.003554611 0.004574720 0.003803449 0.005997626
## 4 0.008172716 0.003701744 0.004605640 0.006201330 0.003931250 0.007295545
##         [,49]       [,50]
## 1 0.006378219 0.006466118
## 2 0.005240999 0.005038338
## 3 0.004874033 0.004476749
## 4 0.003570278 0.002769378
## 
## Clustering vector:
##  [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [39] 2 2 1 4 1 1 4 4 4 1 1 1 1 1 4 1 1 4 4 4 4 1
## 
## Within cluster sum of squares by cluster:
## [1] 0.004023809 0.007365112 0.008123523 0.003440322
##  (between_SS / total_SS = 100.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

cluster.labels <- km.out$cluster
table(true.labels, cluster.labels)

##            cluster.labels
## true.labels  1  2  3  4
##           1  0  0 20  0
##           2  0 20  0  0
##           3 11  0  0  9

The third class is split into 2 clusters. I think this highlights an issue in K-means: it will always assign K clusters, even if fewer would be more natural.

e) Now perform K-means clustering with K = 3 on the first two principal component score vectors, rather than on the raw data. That is, perform K-means clustering on the 60 × 2 matrix of which the first column is the first principal component score vector, and the second column is the second principal component score vector. Comment on the results.

km.out <- kmeans(pr.out$x[,1:2], 3, nstart = 20)
km.out

## K-means clustering with 3 clusters of sizes 20, 20, 20
## 
## Cluster means:
##         PC1           PC2
## 1 -1.884737  1.4104317022
## 2 -1.884036 -1.4106068124
## 3  3.768773  0.0001751102
## 
## Clustering vector:
##  [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [39] 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 1.657870e-04 1.555139e-04 6.907011e-10
##  (between_SS / total_SS = 100.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

cluster.labels <- km.out$cluster
table(true.labels, cluster.labels)

##            cluster.labels
## true.labels  1  2  3
##           1  0 20  0
##           2  0  0 20
##           3 20  0  0

K-means clustering on PCA-reduced data performed just as well as K-means on raw data.

f) Using the scale() function, perform K-means clustering with K = 3 on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in (a)? Explain.

km.out <- kmeans(scale(x), 3, nstart = 20)
km.out

## K-means clustering with 3 clusters of sizes 21, 22, 17
## 
## Cluster means:
##         [,1]       [,2]       [,3]        [,4]        [,5]        [,6]
## 1 -0.4827857 -0.4367532  0.8250709 -0.03268641  0.21286884 -0.05411195
## 2 -0.1829176 -0.2266505 -0.5681131 -0.04809385 -0.02029717 -0.10870999
## 3  0.8330993  0.8328310 -0.2840000  0.10261643 -0.23668870  0.20752769
##          [,7]        [,8]       [,9]       [,10]       [,11]      [,12]
## 1 -0.06063122  0.25472051 -0.2704699  0.10581486 -0.02641405 -0.2966573
## 2  0.24841538 -0.19760273 -0.1663686 -0.12097774  0.10059243  0.4313126
## 3 -0.24658133 -0.05893356  0.5494104  0.02584696 -0.09754931 -0.1917102
##        [,13]      [,14]      [,15]      [,16]      [,17]      [,18]      [,19]
## 1 -0.3308856  0.4019458 -0.5673036 -0.4375181 -0.3331488 -0.3718166  0.1785589
## 2  0.5291391 -0.2383619  0.3794237  0.3100469  0.4942686  0.1037578 -0.4973151
## 3 -0.2760272 -0.1880529  0.2097680  0.1392264 -0.2281050  0.3250281  0.4230116
##         [,20]       [,21]      [,22]       [,23]       [,24]      [,25]
## 1  0.04779041  0.13371765 -0.1078524  0.25263927 -0.09110564  0.4221038
## 2 -0.29574328 -0.05432027  0.3596750 -0.09345295  0.28129758 -0.1071656
## 3  0.32369139 -0.09488380 -0.3322323 -0.19114469 -0.25148991 -0.3827374
##         [,26]       [,27]        [,28]       [,29]      [,30]       [,31]
## 1  0.02756019  0.17955961 -0.006992715 -0.01299296 -0.0702414 -0.17859720
## 2  0.05034478  0.01683351 -0.045960632  0.52781826  0.2163431  0.10580402
## 3 -0.09919701 -0.24359348  0.068116525 -0.66700881 -0.1932046  0.08369721
##        [,32]      [,33]      [,34]      [,35]       [,36]       [,37]
## 1 -0.2414275  0.1906902 -0.0119394  0.2453456 -0.08336142  0.07912769
## 2  0.4330253  0.3192718  0.1138621 -0.4608903  0.06954770 -0.08456631
## 3 -0.2621518 -0.6487338 -0.1326023  0.2933724  0.01297297  0.01169278
##        [,38]       [,39]        [,40]      [,41]      [,42]       [,43]
## 1  0.4440724  0.04228129 -0.313041151 -0.3771077  0.1829199  0.07044042
## 2 -0.2808206 -0.09256913 -0.001393404  0.2748355  0.2171122 -0.10646463
## 3 -0.1851451  0.06756552  0.388501122  0.1101694 -0.5069286  0.05076312
##        [,44]       [,45]       [,46]      [,47]      [,48]      [,49]
## 1  0.2779180 -0.03773379  0.12495020 -0.5071740  0.4166455 -0.1356745
## 2 -0.1667141 -0.45801646 -0.03767761 -0.1533308 -0.3091538  0.2930603
## 3 -0.1275628  0.63933951 -0.10559098  0.8249371 -0.1145984 -0.2116566
##        [,50]
## 1 -0.6066658
## 2  0.6961651
## 3 -0.1515088
## 
## Clustering vector:
##  [1] 1 2 1 2 1 3 1 2 3 1 2 2 2 2 1 1 1 2 1 1 1 2 2 3 3 3 3 1 3 3 3 3 2 3 3 2 3 3
## [39] 2 3 2 3 2 2 1 3 2 2 2 2 1 1 1 2 2 1 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 940.0293 964.0236 763.8425
##  (between_SS / total_SS =   9.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

cluster.labels <- km.out$cluster
table(true.labels, cluster.labels)

##            cluster.labels
## true.labels  1  2  3
##           1 10  8  2
##           2  2  5 13
##           3  9  9  2

Unlike previous K-means runs, the clusters do not correspond cleanly to the true labels.

Class 1 and Class 3 are mixed into Clusters 1 and 3.

Class 2 is somewhat better assigned, with 13 out of 20 observations grouped into Cluster 2.

There is significant overlap between the clusters, suggesting that scaling changed the structure of the data.

In this case, scaling negative impacted clustering performance. It flattened the data structure, blurring the group boundaries. Scaling should be used when features have vastly different ranges.

3. On the book website, https://www.statlearning.com/resources-second-edition, there is a gene expression data set (Ch12Ex13.csv) that consists of 40 tissue samples with measurements on 1,000 genes. The first 20 samples are from healthy patients, while the second 20 are from a diseased group. The data is also available on the blackboard.

a) Load in the data using read.csv(). You will need to select header = F.

gene_data <- read.csv("Ch12Ex13.csv", header = FALSE)  
dim(gene_data)

## [1] 1000   40

b) Apply hierarchical clustering to the samples using correlation-based distance, and plot the dendrogram. Do the genes separate the samples into the two groups? Do your results depend on the type of linkage used?

cor_dist <- as.dist(1 - cor(gene_data))

hc_complete <- hclust(cor_dist, method = "complete")
hc_average <- hclust(cor_dist, method = "average")
hc_single <- hclust(cor_dist, method = "single")

par(mfrow=c(1,3))

plot(hc_complete, main = "Complete Linkage")
plot(hc_average, main = "Average Linkage")
plot(hc_single, main = "Single Linkage")

The genes correctly separate into two clusters for complete and single linkage, but average linkage splits the genes into three clusters. However, the diseased group’s gene expression features are still correctly grouped as their own cluster.

c) Your collaborator wants to know which genes differ the most across the two groups. Suggest a way to answer this question, and apply it here.

One way would be to calculate the absolute difference in average gene expression between the tissues collected from healthy patients (first 20 columns) and diseased patients (second 20 columns). Then, we can sort the genes by descending order based on calculated difference.

gene_means_healthy <- rowMeans(gene_data[,1:20]) 
gene_means_diseased <- rowMeans(gene_data[,21:40]) 
gene_diff <- abs(gene_means_healthy - gene_means_diseased)
top_genes <- order(gene_diff, decreasing = TRUE)[1:10]
data.frame(Gene = top_genes, Difference = gene_diff[top_genes])

##    Gene Difference
## 1   600   2.747577
## 2   584   2.601985
## 3   549   2.550757
## 4   540   2.545174
## 5   502   2.544461
## 6   568   2.519418
## 7   582   2.496084
## 8   565   2.470820
## 9   562   2.465549
## 10  554   2.436718