data(USArrests)
head(USArrests)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
states=row.names(USArrests)
states
## [1] "Alabama" "Alaska" "Arizona" "Arkansas"
## [5] "California" "Colorado" "Connecticut" "Delaware"
## [9] "Florida" "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa" "Kansas"
## [17] "Kentucky" "Louisiana" "Maine" "Maryland"
## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska" "Nevada"
## [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
## [41] "South Dakota" "Tennessee" "Texas" "Utah"
## [45] "Vermont" "Virginia" "Washington" "West Virginia"
## [49] "Wisconsin" "Wyoming"
names(USArrests)
## [1] "Murder" "Assault" "UrbanPop" "Rape"
apply(USArrests, 2, mean)
## Murder Assault UrbanPop Rape
## 7.788 170.760 65.540 21.232
apply(USArrests, 2, var)
## Murder Assault UrbanPop Rape
## 18.97047 6945.16571 209.51878 87.72916
summary(USArrests)
## Murder Assault UrbanPop Rape
## Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
## 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
## Median : 7.250 Median :159.0 Median :66.00 Median :20.10
## Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
## 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
## Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
set.seed(1)
hc.complete <- hclust(dist(USArrests), method="complete")
plot(hc.complete, main = "Hierarchical Clustering of US States", sub = "Complete Linkage, Euclidean Distance", xlab = "States", cex = 0.5)
clusters <- cutree(hc.complete, 3)
split(rownames(USArrests), clusters)
## $`1`
## [1] "Alabama" "Alaska" "Arizona" "California"
## [5] "Delaware" "Florida" "Illinois" "Louisiana"
## [9] "Maryland" "Michigan" "Mississippi" "Nevada"
## [13] "New Mexico" "New York" "North Carolina" "South Carolina"
##
## $`2`
## [1] "Arkansas" "Colorado" "Georgia" "Massachusetts"
## [5] "Missouri" "New Jersey" "Oklahoma" "Oregon"
## [9] "Rhode Island" "Tennessee" "Texas" "Virginia"
## [13] "Washington" "Wyoming"
##
## $`3`
## [1] "Connecticut" "Hawaii" "Idaho" "Indiana"
## [5] "Iowa" "Kansas" "Kentucky" "Maine"
## [9] "Minnesota" "Montana" "Nebraska" "New Hampshire"
## [13] "North Dakota" "Ohio" "Pennsylvania" "South Dakota"
## [17] "Utah" "Vermont" "West Virginia" "Wisconsin"
USArrests.scaled <- scale(USArrests)
hc.complete.scaled <- hclust(dist(USArrests.scaled), method = "complete")
plot(hc.complete.scaled, main = "Hierarchical Clustering of US States (Scaled Features)", sub = "Complete Linkage, Euclidean Distance", xlab = "States", cex = 0.5)
clusters.scaled <- cutree(hc.complete.scaled, 3)
split(rownames(USArrests), clusters.scaled)
## $`1`
## [1] "Alabama" "Alaska" "Georgia" "Louisiana"
## [5] "Mississippi" "North Carolina" "South Carolina" "Tennessee"
##
## $`2`
## [1] "Arizona" "California" "Colorado" "Florida" "Illinois"
## [6] "Maryland" "Michigan" "Nevada" "New Mexico" "New York"
## [11] "Texas"
##
## $`3`
## [1] "Arkansas" "Connecticut" "Delaware" "Hawaii"
## [5] "Idaho" "Indiana" "Iowa" "Kansas"
## [9] "Kentucky" "Maine" "Massachusetts" "Minnesota"
## [13] "Missouri" "Montana" "Nebraska" "New Hampshire"
## [17] "New Jersey" "North Dakota" "Ohio" "Oklahoma"
## [21] "Oregon" "Pennsylvania" "Rhode Island" "South Dakota"
## [25] "Utah" "Vermont" "Virginia" "Washington"
## [29] "West Virginia" "Wisconsin" "Wyoming"
table(clusters, clusters.scaled)
## clusters.scaled
## clusters 1 2 3
## 1 6 9 1
## 2 2 2 10
## 3 0 0 20
Scaling the variables had a large impact on the hierarchical clustering obtained.
Cluster 1 (before scaling): 6 states remained in Cluster 1 after scaling, 9 states moved to Cluster 2, and 1 state moved to Cluster 3.
Cluster 2 (before scaling): 2 states moved to Cluster 1, 2 stayed in Cluster 2, and 10 moved to Cluster 3 after scaling.
Cluster 3 (before scaling): All 20 states remained in Cluster 3.
The largest shift happened in Cluster 2, where most states moved to Cluster 3 after scaling. Cluster 3 remained mostly unchanged, meaning that states in this cluster were already closely grouped regardless of scaling.
In my opinion, the variables should be scaled before the inter-observation dissimilarities are computed. Without scaling, clustering outcomes can be dominated by features with larger ranges. By scaling variables to mean 0 and std dev 1, we created a cluster structure that assumes all variables contribute equally.
set.seed(1)
x = matrix(runif(n=3000, 0, 0.01), ncol=50)
x[1:20, 2] = 2
x[21:40, 1] = 5
x[21:40, 2] = 5
x[41:60, 1] = 2
dim(x)
## [1] 60 50
pr.out <- prcomp(x)
summary(pr.out)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.6874 1.1614 0.00526 0.004927 0.004869 0.004755
## Proportion of Variance 0.8426 0.1574 0.00000 0.000000 0.000000 0.000000
## Cumulative Proportion 0.8426 1.0000 0.99996 0.999960 0.999960 0.999960
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.004605 0.004264 0.00414 0.004112 0.003979 0.003884
## Proportion of Variance 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000
## Cumulative Proportion 0.999970 0.999970 0.99997 0.999970 0.999980 0.999980
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 0.003783 0.003693 0.003667 0.003471 0.003417 0.003386
## Proportion of Variance 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## Cumulative Proportion 0.999980 0.999980 0.999980 0.999980 0.999980 0.999990
## PC19 PC20 PC21 PC22 PC23 PC24
## Standard deviation 0.003196 0.003188 0.003041 0.003005 0.002929 0.002773
## Proportion of Variance 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## Cumulative Proportion 0.999990 0.999990 0.999990 0.999990 0.999990 0.999990
## PC25 PC26 PC27 PC28 PC29 PC30
## Standard deviation 0.002566 0.002498 0.002283 0.002235 0.002146 0.002113
## Proportion of Variance 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## Cumulative Proportion 0.999990 0.999990 0.999990 1.000000 1.000000 1.000000
## PC31 PC32 PC33 PC34 PC35 PC36
## Standard deviation 0.002053 0.001867 0.001817 0.001729 0.001648 0.001533
## Proportion of Variance 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## Cumulative Proportion 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
## PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 0.001494 0.001369 0.00131 0.001276 0.001139 0.001133
## Proportion of Variance 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000
## Cumulative Proportion 1.000000 1.000000 1.00000 1.000000 1.000000 1.000000
## PC43 PC44 PC45 PC46 PC47
## Standard deviation 0.001006 0.0008864 0.0007426 0.0007119 0.0006409
## Proportion of Variance 0.000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion 1.000000 1.0000000 1.0000000 1.0000000 1.0000000
## PC48 PC49 PC50
## Standard deviation 0.0005407 0.000424 0.0003072
## Proportion of Variance 0.0000000 0.000000 0.0000000
## Cumulative Proportion 1.0000000 1.000000 1.0000000
pr.out$x[,1:2]
## PC1 PC2
## [1,] -1.886084 -1.4126555398
## [2,] -1.885333 -1.4119024616
## [3,] -1.883911 -1.4104753285
## [4,] -1.881536 -1.4081065143
## [5,] -1.886535 -1.4130992761
## [6,] -1.881609 -1.4081843495
## [7,] -1.881284 -1.4078650757
## [8,] -1.883291 -1.4098560032
## [9,] -1.883509 -1.4100937102
## [10,] -1.887531 -1.4140963348
## [11,] -1.886506 -1.4130745147
## [12,] -1.886709 -1.4132851764
## [13,] -1.883098 -1.4096604647
## [14,] -1.885245 -1.4118304476
## [15,] -1.882519 -1.4090875485
## [16,] -1.884443 -1.4110047224
## [17,] -1.882886 -1.4094606762
## [18,] -1.880948 -1.4075182107
## [19,] -1.885273 -1.4118401119
## [20,] -1.882462 -1.4090397810
## [21,] 3.768770 0.0001737612
## [22,] 3.768773 0.0001797876
## [23,] 3.768767 0.0001715874
## [24,] 3.768774 0.0001749279
## [25,] 3.768775 0.0001667985
## [26,] 3.768778 0.0001762815
## [27,] 3.768778 0.0001770803
## [28,] 3.768768 0.0001807997
## [29,] 3.768771 0.0001706622
## [30,] 3.768768 0.0001777160
## [31,] 3.768780 0.0001756922
## [32,] 3.768774 0.0001728183
## [33,] 3.768771 0.0001745470
## [34,] 3.768774 0.0001753220
## [35,] 3.768780 0.0001825239
## [36,] 3.768770 0.0001764830
## [37,] 3.768773 0.0001849647
## [38,] 3.768776 0.0001689022
## [39,] 3.768774 0.0001715969
## [40,] 3.768771 0.0001699520
## [41,] -1.883577 1.4092686285
## [42,] -1.885707 1.4113967583
## [43,] -1.886295 1.4119916678
## [44,] -1.881188 1.4068770473
## [45,] -1.883727 1.4094219588
## [46,] -1.886693 1.4123889646
## [47,] -1.887287 1.4129766263
## [48,] -1.884825 1.4105245004
## [49,] -1.881670 1.4073681209
## [50,] -1.883975 1.4096678506
## [51,] -1.881305 1.4070062987
## [52,] -1.883034 1.4087254643
## [53,] -1.885682 1.4113753416
## [54,] -1.885152 1.4108489627
## [55,] -1.887158 1.4128518849
## [56,] -1.888118 1.4138029360
## [57,] -1.883147 1.4088467540
## [58,] -1.887476 1.4131778175
## [59,] -1.885050 1.4107409572
## [60,] -1.883682 1.4093755031
plot(pr.out$x[,1:2], col = 1:3, xlab = "PC1", ylab = "PC2", pch = 19)
km.out <- kmeans(x, 3, nstart = 20)
km.out
## K-means clustering with 3 clusters of sizes 20, 20, 20
##
## Cluster means:
## [,1] [,2] [,3] [,4] [,5] [,6]
## 1 5.000000000 5.000000000 0.004292492 0.005217101 0.003641431 0.005947972
## 2 0.005551671 2.000000000 0.005166836 0.005484659 0.004972492 0.004618285
## 3 2.000000000 0.004905241 0.006024362 0.004481017 0.003438724 0.005039361
## [,7] [,8] [,9] [,10] [,11] [,12]
## 1 0.004752187 0.004264830 0.005844531 0.005260191 0.004829985 0.005221851
## 2 0.004793031 0.005133796 0.004568124 0.005312861 0.005521645 0.004789253
## 3 0.006430145 0.004849288 0.004879257 0.005388688 0.005607340 0.005460055
## [,13] [,14] [,15] [,16] [,17] [,18]
## 1 0.004623373 0.005507304 0.004386255 0.006040156 0.004216303 0.004591728
## 2 0.004532753 0.004919870 0.004443105 0.003835058 0.005174819 0.004863420
## 3 0.005295802 0.005807342 0.004115938 0.004918375 0.005346474 0.004807140
## [,19] [,20] [,21] [,22] [,23] [,24]
## 1 0.004690692 0.004780629 0.004249589 0.004303261 0.003391395 0.004497712
## 2 0.005089885 0.004356273 0.005575948 0.005273448 0.004913316 0.005158933
## 3 0.004889102 0.004589221 0.005044854 0.004905811 0.004818232 0.005316169
## [,25] [,26] [,27] [,28] [,29] [,30]
## 1 0.004934328 0.005853189 0.004763984 0.004924727 0.003575623 0.004665989
## 2 0.004580133 0.004958652 0.005596034 0.004711264 0.005718374 0.005147582
## 3 0.005135477 0.005391266 0.004634117 0.005557899 0.005459883 0.004388423
## [,31] [,32] [,33] [,34] [,35] [,36]
## 1 0.005631196 0.005004424 0.004121293 0.004412899 0.005321122 0.004177870
## 2 0.004648414 0.004670802 0.004967435 0.004564740 0.004373587 0.005841328
## 3 0.005756571 0.004767306 0.005565726 0.003729663 0.005554961 0.003845988
## [,37] [,38] [,39] [,40] [,41] [,42]
## 1 0.005149373 0.003549839 0.004598301 0.005725721 0.006244830 0.004426540
## 2 0.005256790 0.004157657 0.005288231 0.006248561 0.005244998 0.004281433
## 3 0.004848821 0.005349297 0.004297070 0.004336235 0.005718933 0.005287813
## [,43] [,44] [,45] [,46] [,47] [,48]
## 1 0.005662519 0.004538859 0.004936487 0.004151314 0.005363267 0.005508981
## 2 0.005008912 0.005659766 0.003554611 0.004574720 0.003803449 0.005997626
## 3 0.005807893 0.004656849 0.004331194 0.005114674 0.003690784 0.005671462
## [,49] [,50]
## 1 0.005240999 0.005038338
## 2 0.004874033 0.004476749
## 3 0.005114646 0.004802585
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [39] 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##
## Within cluster sum of squares by cluster:
## [1] 0.007365112 0.008123523 0.008381756
## (between_SS / total_SS = 100.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
cluster.labels <- km.out$cluster
true.labels <- rep(1:3, each=20)
true.labels
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [39] 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
table(true.labels, cluster.labels)
## cluster.labels
## true.labels 1 2 3
## 1 0 20 0
## 2 20 0 0
## 3 0 0 20
The first row (true label = 1) shows that all 20 observations in class 1 were assigned to cluster 2.
The second row (true label = 2) shows that all 20 observations in class 2 were assigned to cluster 1.
The third row (true label = 3) shows that all 20 observations in class 3 were assigned to cluster 3.
The clusters I obtained in K-means clustering successfully identified the three groups of data. Since K-means clustering arbitrarily numbers the clusters, class 1 does not need to be assigned to cluster 1, etc.
km.out <- kmeans(x, 2, nstart = 20)
km.out
## K-means clustering with 2 clusters of sizes 20, 40
##
## Cluster means:
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## 1 5.000000 5.000000 0.004292492 0.005217101 0.003641431 0.005947972 0.004752187
## 2 1.002776 1.002453 0.005595599 0.004982838 0.004205608 0.004828823 0.005611588
## [,8] [,9] [,10] [,11] [,12] [,13]
## 1 0.004264830 0.005844531 0.005260191 0.004829985 0.005221851 0.004623373
## 2 0.004991542 0.004723690 0.005350774 0.005564493 0.005124654 0.004914278
## [,14] [,15] [,16] [,17] [,18] [,19]
## 1 0.005507304 0.004386255 0.006040156 0.004216303 0.004591728 0.004690692
## 2 0.005363606 0.004279521 0.004376716 0.005260646 0.004835280 0.004989493
## [,20] [,21] [,22] [,23] [,24] [,25]
## 1 0.004780629 0.004249589 0.004303261 0.003391395 0.004497712 0.004934328
## 2 0.004472747 0.005310401 0.005089629 0.004865774 0.005237551 0.004857805
## [,26] [,27] [,28] [,29] [,30] [,31]
## 1 0.005853189 0.004763984 0.004924727 0.003575623 0.004665989 0.005631196
## 2 0.005174959 0.005115076 0.005134582 0.005589128 0.004768003 0.005202493
## [,32] [,33] [,34] [,35] [,36] [,37]
## 1 0.005004424 0.004121293 0.004412899 0.005321122 0.004177870 0.005149373
## 2 0.004719054 0.005266581 0.004147202 0.004964274 0.004843658 0.005052806
## [,38] [,39] [,40] [,41] [,42] [,43]
## 1 0.003549839 0.004598301 0.005725721 0.006244830 0.004426540 0.005662519
## 2 0.004753477 0.004792650 0.005292398 0.005481965 0.004784623 0.005408403
## [,44] [,45] [,46] [,47] [,48] [,49]
## 1 0.004538859 0.004936487 0.004151314 0.005363267 0.005508981 0.005240999
## 2 0.005158307 0.003942903 0.004844697 0.003747116 0.005834544 0.004994340
## [,50]
## 1 0.005038338
## 2 0.004639667
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [39] 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 0.007365112 79.599093209
## (between_SS / total_SS = 84.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
cluster.labels <- km.out$cluster
table(true.labels, cluster.labels)
## cluster.labels
## true.labels 1 2
## 1 0 20
## 2 20 0
## 3 0 20
Here, K-means grouped classes 1 and 3 together, while keeping class 2 separate. This makes sense since class 2 was set as x[21:40,1] = x[21:40,2] = 5, which is much greater than the rest of the data. While class 1 has feature 2 set as 2 and class 3 has feature 1 set as 2, their differences are not enough to justify two separate clusters when only two are allowed.
km.out <- kmeans(x, 4, nstart = 20)
km.out
## K-means clustering with 4 clusters of sizes 11, 20, 20, 9
##
## Cluster means:
## [,1] [,2] [,3] [,4] [,5] [,6]
## 1 2.000000000 0.006223974 0.005503457 0.003969071 0.003555036 0.005094137
## 2 5.000000000 5.000000000 0.004292492 0.005217101 0.003641431 0.005947972
## 3 0.005551671 2.000000000 0.005166836 0.005484659 0.004972492 0.004618285
## 4 2.000000000 0.003293456 0.006661023 0.005106729 0.003296564 0.004972412
## [,7] [,8] [,9] [,10] [,11] [,12]
## 1 0.007196633 0.004143518 0.005153079 0.005416735 0.007158928 0.007023599
## 2 0.004752187 0.004264830 0.005844531 0.005260191 0.004829985 0.005221851
## 3 0.004793031 0.005133796 0.004568124 0.005312861 0.005521645 0.004789253
## 4 0.005493327 0.005711896 0.004544585 0.005354408 0.003710956 0.003549057
## [,13] [,14] [,15] [,16] [,17] [,18]
## 1 0.007354389 0.006006205 0.003964197 0.004462192 0.006417839 0.004221412
## 2 0.004623373 0.005507304 0.004386255 0.006040156 0.004216303 0.004591728
## 3 0.004532753 0.004919870 0.004443105 0.003835058 0.005174819 0.004863420
## 4 0.002779752 0.005564286 0.004301398 0.005475931 0.004037028 0.005523029
## [,19] [,20] [,21] [,22] [,23] [,24]
## 1 0.004291688 0.004226678 0.003626661 0.005002857 0.003913671 0.005652864
## 2 0.004690692 0.004780629 0.004249589 0.004303261 0.003391395 0.004497712
## 3 0.005089885 0.004356273 0.005575948 0.005273448 0.004913316 0.005158933
## 4 0.005619273 0.005032328 0.006778200 0.004787200 0.005923806 0.004904654
## [,25] [,26] [,27] [,28] [,29] [,30]
## 1 0.003457210 0.004286576 0.004870858 0.005147259 0.006164609 0.005544888
## 2 0.004934328 0.005853189 0.004763984 0.004924727 0.003575623 0.004665989
## 3 0.004580133 0.004958652 0.005596034 0.004711264 0.005718374 0.005147582
## 4 0.007186693 0.006741443 0.004344768 0.006059793 0.004598551 0.002974966
## [,31] [,32] [,33] [,34] [,35] [,36]
## 1 0.005760799 0.005230798 0.006335986 0.003134177 0.005747493 0.003889846
## 2 0.005631196 0.005004424 0.004121293 0.004412899 0.005321122 0.004177870
## 3 0.004648414 0.004670802 0.004967435 0.004564740 0.004373587 0.005841328
## 4 0.005751404 0.004200815 0.004624298 0.004457480 0.005319644 0.003792382
## [,37] [,38] [,39] [,40] [,41] [,42]
## 1 0.004293253 0.006027733 0.004008244 0.003869813 0.005850709 0.005067579
## 2 0.005149373 0.003549839 0.004598301 0.005725721 0.006244830 0.004426540
## 3 0.005256790 0.004157657 0.005288231 0.006248561 0.005244998 0.004281433
## 4 0.005527848 0.004520097 0.004650079 0.004906305 0.005557873 0.005556988
## [,43] [,44] [,45] [,46] [,47] [,48]
## 1 0.003873039 0.005438298 0.004106648 0.004225592 0.003494039 0.004342667
## 2 0.005662519 0.004538859 0.004936487 0.004151314 0.005363267 0.005508981
## 3 0.005008912 0.005659766 0.003554611 0.004574720 0.003803449 0.005997626
## 4 0.008172716 0.003701744 0.004605640 0.006201330 0.003931250 0.007295545
## [,49] [,50]
## 1 0.006378219 0.006466118
## 2 0.005240999 0.005038338
## 3 0.004874033 0.004476749
## 4 0.003570278 0.002769378
##
## Clustering vector:
## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [39] 2 2 1 4 1 1 4 4 4 1 1 1 1 1 4 1 1 4 4 4 4 1
##
## Within cluster sum of squares by cluster:
## [1] 0.004023809 0.007365112 0.008123523 0.003440322
## (between_SS / total_SS = 100.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
cluster.labels <- km.out$cluster
table(true.labels, cluster.labels)
## cluster.labels
## true.labels 1 2 3 4
## 1 0 0 20 0
## 2 0 20 0 0
## 3 11 0 0 9
The third class is split into 2 clusters. I think this highlights an issue in K-means: it will always assign K clusters, even if fewer would be more natural.
km.out <- kmeans(pr.out$x[,1:2], 3, nstart = 20)
km.out
## K-means clustering with 3 clusters of sizes 20, 20, 20
##
## Cluster means:
## PC1 PC2
## 1 -1.884737 1.4104317022
## 2 -1.884036 -1.4106068124
## 3 3.768773 0.0001751102
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [39] 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 1.657870e-04 1.555139e-04 6.907011e-10
## (between_SS / total_SS = 100.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
cluster.labels <- km.out$cluster
table(true.labels, cluster.labels)
## cluster.labels
## true.labels 1 2 3
## 1 0 20 0
## 2 0 0 20
## 3 20 0 0
K-means clustering on PCA-reduced data performed just as well as K-means on raw data.
km.out <- kmeans(scale(x), 3, nstart = 20)
km.out
## K-means clustering with 3 clusters of sizes 21, 22, 17
##
## Cluster means:
## [,1] [,2] [,3] [,4] [,5] [,6]
## 1 -0.4827857 -0.4367532 0.8250709 -0.03268641 0.21286884 -0.05411195
## 2 -0.1829176 -0.2266505 -0.5681131 -0.04809385 -0.02029717 -0.10870999
## 3 0.8330993 0.8328310 -0.2840000 0.10261643 -0.23668870 0.20752769
## [,7] [,8] [,9] [,10] [,11] [,12]
## 1 -0.06063122 0.25472051 -0.2704699 0.10581486 -0.02641405 -0.2966573
## 2 0.24841538 -0.19760273 -0.1663686 -0.12097774 0.10059243 0.4313126
## 3 -0.24658133 -0.05893356 0.5494104 0.02584696 -0.09754931 -0.1917102
## [,13] [,14] [,15] [,16] [,17] [,18] [,19]
## 1 -0.3308856 0.4019458 -0.5673036 -0.4375181 -0.3331488 -0.3718166 0.1785589
## 2 0.5291391 -0.2383619 0.3794237 0.3100469 0.4942686 0.1037578 -0.4973151
## 3 -0.2760272 -0.1880529 0.2097680 0.1392264 -0.2281050 0.3250281 0.4230116
## [,20] [,21] [,22] [,23] [,24] [,25]
## 1 0.04779041 0.13371765 -0.1078524 0.25263927 -0.09110564 0.4221038
## 2 -0.29574328 -0.05432027 0.3596750 -0.09345295 0.28129758 -0.1071656
## 3 0.32369139 -0.09488380 -0.3322323 -0.19114469 -0.25148991 -0.3827374
## [,26] [,27] [,28] [,29] [,30] [,31]
## 1 0.02756019 0.17955961 -0.006992715 -0.01299296 -0.0702414 -0.17859720
## 2 0.05034478 0.01683351 -0.045960632 0.52781826 0.2163431 0.10580402
## 3 -0.09919701 -0.24359348 0.068116525 -0.66700881 -0.1932046 0.08369721
## [,32] [,33] [,34] [,35] [,36] [,37]
## 1 -0.2414275 0.1906902 -0.0119394 0.2453456 -0.08336142 0.07912769
## 2 0.4330253 0.3192718 0.1138621 -0.4608903 0.06954770 -0.08456631
## 3 -0.2621518 -0.6487338 -0.1326023 0.2933724 0.01297297 0.01169278
## [,38] [,39] [,40] [,41] [,42] [,43]
## 1 0.4440724 0.04228129 -0.313041151 -0.3771077 0.1829199 0.07044042
## 2 -0.2808206 -0.09256913 -0.001393404 0.2748355 0.2171122 -0.10646463
## 3 -0.1851451 0.06756552 0.388501122 0.1101694 -0.5069286 0.05076312
## [,44] [,45] [,46] [,47] [,48] [,49]
## 1 0.2779180 -0.03773379 0.12495020 -0.5071740 0.4166455 -0.1356745
## 2 -0.1667141 -0.45801646 -0.03767761 -0.1533308 -0.3091538 0.2930603
## 3 -0.1275628 0.63933951 -0.10559098 0.8249371 -0.1145984 -0.2116566
## [,50]
## 1 -0.6066658
## 2 0.6961651
## 3 -0.1515088
##
## Clustering vector:
## [1] 1 2 1 2 1 3 1 2 3 1 2 2 2 2 1 1 1 2 1 1 1 2 2 3 3 3 3 1 3 3 3 3 2 3 3 2 3 3
## [39] 2 3 2 3 2 2 1 3 2 2 2 2 1 1 1 2 2 1 1 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 940.0293 964.0236 763.8425
## (between_SS / total_SS = 9.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
cluster.labels <- km.out$cluster
table(true.labels, cluster.labels)
## cluster.labels
## true.labels 1 2 3
## 1 10 8 2
## 2 2 5 13
## 3 9 9 2
Unlike previous K-means runs, the clusters do not correspond cleanly to the true labels.
Class 1 and Class 3 are mixed into Clusters 1 and 3.
Class 2 is somewhat better assigned, with 13 out of 20 observations grouped into Cluster 2.
There is significant overlap between the clusters, suggesting that scaling changed the structure of the data.
In this case, scaling negative impacted clustering performance. It flattened the data structure, blurring the group boundaries. Scaling should be used when features have vastly different ranges.
gene_data <- read.csv("Ch12Ex13.csv", header = FALSE)
dim(gene_data)
## [1] 1000 40
cor_dist <- as.dist(1 - cor(gene_data))
hc_complete <- hclust(cor_dist, method = "complete")
hc_average <- hclust(cor_dist, method = "average")
hc_single <- hclust(cor_dist, method = "single")
par(mfrow=c(1,3))
plot(hc_complete, main = "Complete Linkage")
plot(hc_average, main = "Average Linkage")
plot(hc_single, main = "Single Linkage")
The genes correctly separate into two clusters for complete and single linkage, but average linkage splits the genes into three clusters. However, the diseased group’s gene expression features are still correctly grouped as their own cluster.
One way would be to calculate the absolute difference in average gene expression between the tissues collected from healthy patients (first 20 columns) and diseased patients (second 20 columns). Then, we can sort the genes by descending order based on calculated difference.
gene_means_healthy <- rowMeans(gene_data[,1:20])
gene_means_diseased <- rowMeans(gene_data[,21:40])
gene_diff <- abs(gene_means_healthy - gene_means_diseased)
top_genes <- order(gene_diff, decreasing = TRUE)[1:10]
data.frame(Gene = top_genes, Difference = gene_diff[top_genes])
## Gene Difference
## 1 600 2.747577
## 2 584 2.601985
## 3 549 2.550757
## 4 540 2.545174
## 5 502 2.544461
## 6 568 2.519418
## 7 582 2.496084
## 8 565 2.470820
## 9 562 2.465549
## 10 554 2.436718