rm(list =ls())
#install.packages("factoextra")
#install.packages("NbClust")
library(cluster)
package ‘cluster’ was built under R version 3.5.2
library(factoextra)
Loading required package: ggplot2
Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(NbClust)
setwd("/Users/jayavarshini/Desktop/ms/sem1/dmm/Assing3/")
The working directory was changed to /Users/jayavarshini/Desktop/ms/sem1/dmm/Assing3 inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
data_buddy_move <- read.csv("buddymove_holidayiq.csv", header=T, sep=",", comment.char = '#')
head(data_buddy_move)
buddy_move=data_buddy_move[,2:7]
head(buddy_move)
#Applying basic statistics before applying k means to check to apply standardization or not.
stats<- data.frame(
Min = apply(buddy_move, 2, min), # minimum
Med = apply(buddy_move, 2, median), # median
Mean = apply(buddy_move, 2, mean), # mean
SD = apply(buddy_move, 2, sd), # Standard deviation
Max = apply(buddy_move, 2, max)
)
stats <- round(stats, 1)
head(stats)
The minimum and maximum of sports value is much less than the rest, So I’m scaling the data.
x<-scale(buddy_move)
head(x)
Sports Religious Nature Theatre Shopping Picnic
[1,] -1.509552 -1.0100142 -0.9973422 -1.4744331 -1.0740003 -0.7783943
[2,] -1.509552 -1.4722052 -1.0630749 -1.2565864 -1.0499404 -1.6057691
[3,] -1.509552 -1.8419580 -0.6029459 -0.9142560 -1.5070790 -1.3912645
[4,] -1.509552 -1.2873288 -1.0411640 -0.6652884 -0.8815209 -1.8202736
[5,] -1.509552 -0.3629468 -1.5451149 -1.7856426 -0.4243823 -1.0541859
[6,] -1.358415 -1.7803325 -0.3400150 -0.7275303 -1.4589591 -1.3606210
stats<- data.frame(
Min = apply(x, 2, min), # minimum
Med = apply(x, 2, median), # median
Mean = apply(x, 2, mean), # mean
SD = apply(x, 2, sd), # Standard deviation
Max = apply(x, 2, max)
)
stats <- round(stats, 1)
head(stats)
# Initializing total within sum of squares error: wss
wss <- 0
# For 1 to 15 cluster centers
for (i in 1:15) {
kmn <- kmeans(x, centers = i, nstart = 20)
# Saving the total within sum of squares to wss variable
wss[i] <- kmn$tot.withinss
}
# Plot total within sum of squares vs. number of clusters
plot(1:15, wss, type = "b",
xlab = "Number of Clusters",
ylab = "Within groups sum of squares")
The plot has an elbow where the quality measure improves more slowly as the number of clusters increases which shows the quality of the of the model is no longer improving substantially as the model complexity increases.
Also to do it using fviz_nbclust() easily:
# How many clusters? A couple of means to visuzalize it.
fviz_nbclust(x, kmeans, method="wss") # Elbow method minimizes total
# within-cluster sum of squares (wss). Also called a "Scree" plot.
# Silhouette measures the quality of a cluster, i.e., how well each
# point lies within its cluster.
fviz_nbclust(x, kmeans, method="silhouette")
k <- 3
kmean=kmeans(x,centers = 3,nstart = 25)
fviz_cluster(kmean, data=x)
Number of observations in each cluster
kmean$size
[1] 39 113 97
Total SSE of the clusters
print(kmean$tot.withinss)
[1] 649.9424
SSE of each cluster
print(kmean$withinss)
[1] 96.53409 222.64518 330.76311
for(i in 1:3)
{
print(i)
print(which(kmean$cluster==i))
}
[1] 1
[1] 51 59 71 91 98 116 117 124 130 137 141 144 146 149 159 161 162 163 166 169 176 180 184 186 191 195
[27] 200 203 205 210 213 214 215 218 225 228 232 237 241
[1] 2
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
[27] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 52 53
[53] 54 55 56 57 58 60 61 62 63 64 65 66 67 68 69 70 72 73 74 75 76 77 78 79 80 81
[79] 82 83 84 85 86 87 88 89 90 92 93 94 95 96 97 99 100 101 102 103 104 105 106 107 108 109
[105] 125 129 134 140 142 145 148 156 238
[1] 3
[1] 110 111 112 113 114 115 118 119 120 121 122 123 126 127 128 131 132 133 135 136 138 139 143 147 150 151
[27] 152 153 154 155 157 158 160 164 165 167 168 170 171 172 173 174 175 177 178 179 181 182 183 185 187 188
[53] 189 190 192 193 194 196 197 198 199 201 202 204 206 207 208 209 211 212 216 217 219 220 221 222 223 224
[79] 226 227 229 230 231 233 234 235 236 239 240 242 243 244 245 246 247 248 249
In cluster one, the users who has given more reviews about nature and picnic are clustered together. It is obvious that people enjoy nature prefer to spend more time with family having picnics. For instance, for cluster points 73,84,94 the nature and picnic values are high than the rest. So they are clustered together.
In cluster two, Something I found out interesting was the users who rated can be mothers/women of the family. I say so because the ratings on Religious,Shopping and Picnic are specifically high.
In cluster three, the sports ratings play a role. If we can see the clusters more in detail, we can find the users who prefer watching movies and sports on tv than outdoors.
set.seed(1122)
setwd("/Users/jayavarshini/Desktop/ms/sem1/dmm/Assing3/")
The working directory was changed to /Users/jayavarshini/Desktop/ms/sem1/dmm/Assing3 inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
DataSet <- read.csv("buddymove_holidayiq.csv", header=T, sep=",", comment.char = '#')
DataSet
SubSet<-sample_n(DataSet, 50)
rownames(SubSet)<-SubSet$User.Id
SubSet<-SubSet[2:7]
SubSet
x<-scale(SubSet)
head(x)
Sports Religious Nature Theatre Shopping Picnic
User 20 -0.8438782 0.2067001 -0.9068927 -1.32468472 0.0397644 -0.5215410
User 4 -1.2561965 -1.2040127 -1.0526953 -0.47224925 -0.8705050 -1.6589443
User 167 1.4925923 0.6360474 0.9156408 2.02822813 1.0937605 0.2552222
User 36 -0.9813176 -0.8666683 -0.8339913 -0.04603152 -0.4632792 -1.3260458
User 116 0.1181979 0.7893858 -0.3965833 0.20969913 0.3990813 0.3384468
User 115 0.3930768 -0.4066533 1.0371431 -0.35859119 -0.4393248 0.8932777
Complete Linakge
complete_linkage<- eclust(x, "hclust", hc_method = "complete",k=1)
fviz_dend(complete_linkage, show_labels=T, palette="jco")
The number of singleton clusters: 19
single_linkage<- eclust(x, "hclust", hc_method = "single",k=1)
fviz_dend(single_linkage, show_labels=T, palette="jco",main='Single Linkage')
The number of singleton custer pairs:15
average_linkage<- eclust(x ,"hclust", hc_method ="average",k=1)
fviz_dend(average_linkage ,show_labels=T,palette="jco", main='Average Linkage')
The total number of singleton cluster pairs 18
Complete Linkage: The number of singleton clusters: 19
{User 71,User 98},{User 12,User 11},{User 72,User 73},{User 115,User 56},{User 18 ,User 41},{User 23,User 58},{User 43,User 35},{User 36,User 60},{User 4,User 37},{User 200,User 195},{User 199,User 168},{User 197,User 217},{User 224,User 221},{User 167,User 240},{User 140,User 145},{User 155,User 136},{User 170,User 225},{User 157,User 131},{User 116,User 139}
Single Linkage: The number of singleton custer pairs:15
{User 200,User 195},{User 224,User 221},{User 167 ,User 240},{User 71,User 98},{User 12,User 11},{User 18,User 41},{User 23,User 38},{User 43,User 35},{User 157,User 131},{User 116,User 139},{User 40,User 45},{User 140,User 145},{User155 ,User 136},{User 197 ,User 217},{User 199,User 168}
Average Linkage: The total number of singleton cluster pairs 18 {User 12,User 11},{User 43,User 35},{User 36,User 60},{User 4,User 37},{User 72,User 73},{User 18,User 41},{User 23,User 38},{User 71,User 98},{User 140,User 145},{User ,155User 136},{User 157,User 131},{User 116,User 139},{User 200,User 195},{User 224,User 221},{User 167,User 240},{User 197,User 217},{User 170,User 225},{User 199,User 168}
According to the assumption I take, the single linkage has the smallest number of singleton pairs and I consider the purest.
#cutree(single_linkage,)
cutree(single_linkage,h=1.7)
User 20 User 4 User 167 User 36 User 116 User 115 User 237 User 43 User 71 User 1 User 200 User 23
1 1 1 1 1 1 2 1 1 1 2 1
User 72 User 140 User 197 User 170 User 225 User 35 User 17 User 56 User 155 User 145 User 53 User 240
1 1 1 1 1 1 1 1 1 1 1 1
User 37 User 157 User 18 User 98 User 195 User 185 User 14 User 73 User 217 User 224 User 38 User 199
1 1 1 1 2 3 1 1 1 1 1 3
User 76 User 239 User 168 User 139 User 9 User 41 User 31 User 60 User 12 User 221 User 11 User 131
1 1 3 1 1 1 1 1 1 1 1 1
User 136 User 248
1 3
plot(single_linkage)
abline(h=1.7,col="red")
Im getting 3 clusters at height 1.7 #### Part e
complete_linkage2<- eclust(x, "hclust", hc_method = "complete",k=3)
fviz_dend(complete_linkage2, show_labels=T, palette="jco")
single_linkage2<- eclust(x, "hclust", hc_method = "single",k=3)
fviz_dend(single_linkage2, show_labels=T, palette="jco")
average_linkage2<- eclust(x, "hclust", hc_method = "average",k=3)
fviz_dend(average_linkage2, show_labels=T, palette="jco")
Silhouette index for all types of linkage.
complete_statastics <- fpc::cluster.stats(dist(x), complete_linkage2$cluster)
complete_statastics$avg.silwidth
[1] 0.3533173
single_statastics <- fpc::cluster.stats(dist(x), single_linkage2$cluster)
single_statastics$avg.silwidth
[1] 0.2703044
average_statastics <- fpc::cluster.stats(dist(x), average_linkage2$cluster)
average_statastics$avg.silwidth
[1] 0.3775509
ACcording to the average silhoutte index, the complete linakge is the best. #### Part f
NbClust(x,method = "complete")
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 9 proposed 2 as the best number of clusters
* 3 proposed 3 as the best number of clusters
* 3 proposed 5 as the best number of clusters
* 1 proposed 7 as the best number of clusters
* 1 proposed 11 as the best number of clusters
* 1 proposed 12 as the best number of clusters
* 2 proposed 13 as the best number of clusters
* 1 proposed 14 as the best number of clusters
* 2 proposed 15 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 2
*******************************************************************
$All.index
KL CH Hartigan CCC Scott Marriot TrCovW TraceW Friedman Rubin Cindex DB
2 14.4914 46.9075 9.1933 -0.0306 80.1498 6636265.36 1058.1850 148.6922 40.2032 1.9772 0.3928 1.0072
3 0.6058 31.8644 8.4750 -1.7867 108.2655 8509357.51 658.9692 124.7913 49.2698 2.3559 0.4817 0.9674
4 0.4633 27.3049 12.7285 -2.6062 147.8576 6853013.19 494.0157 105.7267 56.4402 2.7808 0.4534 1.1392
5 2.8163 28.6898 6.1854 -1.5870 191.3229 4489177.14 207.0485 82.8121 65.7713 3.5502 0.4378 1.1284
6 0.4082 26.7361 11.9718 -1.7656 217.0269 3866042.84 190.8952 72.8048 70.3882 4.0382 0.4135 1.1115
7 2.4003 29.6480 6.2310 -1.4929 266.7837 1945264.45 137.4747 57.2326 77.3620 5.1369 0.4383 1.0173
8 0.7864 29.2879 7.6881 -1.1590 300.4442 1295956.95 99.8483 49.9888 87.3907 5.8813 0.4737 1.0032
9 2.0405 30.5342 4.5339 -0.3742 333.6766 843806.23 64.7258 42.2542 99.6265 6.9579 0.4885 0.9831
10 1.2209 29.8992 3.8888 -0.3057 357.4652 647340.45 56.6232 38.0469 106.6295 7.7273 0.4844 0.9425
11 1.0673 29.1664 3.6622 -0.3375 395.9761 362590.85 48.7869 34.6757 119.3246 8.4786 0.5319 0.7863
12 0.4254 28.5854 7.7659 -0.3723 425.0465 241263.48 43.7133 31.6991 134.3862 9.2747 0.5645 0.7906
13 5.0649 31.3580 2.2639 0.6995 473.9297 106517.70 28.5004 26.3201 154.6270 11.1702 0.5598 0.8439
14 1.0959 30.0561 2.0633 0.3537 511.2982 58507.67 26.8222 24.8026 183.9262 11.8536 0.5547 0.8136
15 0.2971 28.8325 5.2722 -0.0146 531.4134 44917.98 25.2488 23.4581 203.5508 12.5330 0.5518 0.7759
Silhouette Duda Pseudot2 Beale Ratkowsky Ball Ptbiserial Frey McClain Dunn Hubert SDindex
2 0.3992 0.7376 7.8279 1.3094 0.4930 74.3461 0.6241 0.3151 0.5910 0.2437 0.0052 1.5176
3 0.3533 0.7044 7.9744 1.5340 0.4357 41.5971 0.6550 0.8284 0.6888 0.3175 0.0062 1.3622
4 0.3340 0.6023 15.8469 2.4387 0.3992 26.4317 0.6447 0.8799 0.8979 0.2885 0.0067 1.6247
5 0.2888 0.6402 9.5562 2.0425 0.3785 16.5624 0.5934 1.0301 1.4004 0.2675 0.0071 1.4764
6 0.2561 0.5212 11.9433 3.2821 0.3536 12.1341 0.5504 0.2625 1.7957 0.2671 0.0073 1.6565
7 0.2964 0.4385 5.1213 3.9406 0.3389 8.1761 0.5455 0.1703 2.1019 0.3231 0.0078 1.5675
8 0.3020 0.4920 8.2601 3.5310 0.3219 6.2486 0.5465 0.3411 2.1494 0.3603 0.0079 1.4893
9 0.3347 0.3897 7.8290 5.0201 0.3083 4.6949 0.5333 0.5025 2.3646 0.3987 0.0079 1.4925
10 0.3307 0.7936 0.2601 0.5004 0.2950 3.8047 0.5238 0.1855 2.4876 0.4056 0.0080 1.4612
11 0.3495 0.3693 6.8301 5.2555 0.2831 3.1523 0.5234 0.5418 2.5019 0.4464 0.0080 1.3589
12 0.3495 0.6123 8.2315 2.2621 0.2726 2.6416 0.5146 0.5471 2.6150 0.4805 0.0080 1.5709
13 0.3492 3.2551 -0.6928 -1.3327 0.2646 2.0246 0.4344 0.2383 3.8803 0.4575 0.0086 1.9501
14 0.3549 4.1310 -0.7579 -1.4580 0.2557 1.7716 0.4321 0.3750 3.9290 0.4593 0.0087 1.9432
15 0.3593 0.4727 6.6932 3.6787 0.2476 1.5639 0.4288 0.3128 3.9933 0.4612 0.0087 1.8913
Dindex SDbw
2 1.6200 1.2013
3 1.5004 0.7141
4 1.3910 0.3662
5 1.2288 0.3043
6 1.1601 0.2679
7 1.0239 0.2223
8 0.9609 0.1963
9 0.8870 0.1766
10 0.8312 0.1504
11 0.7841 0.1082
12 0.7513 0.1058
13 0.6862 0.1043
14 0.6602 0.0958
15 0.6358 0.0855
$All.CriticalValues
CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
2 0.5432 18.5029 0.2573
3 0.5190 17.6120 0.1733
4 0.5569 19.0934 0.0283
5 0.4997 17.0194 0.0667
6 0.4503 15.8722 0.0062
7 0.1924 16.7852 0.0070
8 0.3506 14.8210 0.0056
9 0.2445 15.4517 0.0011
10 -0.0981 -11.1930 0.7899
11 0.1924 16.7852 0.0014
12 0.4503 15.8722 0.0459
13 -0.0981 -11.1930 1.0000
14 -0.0981 -11.1930 1.0000
15 0.2864 14.9482 0.0059
$Best.nc
KL CH Hartigan CCC Scott Marriot TrCovW TraceW Friedman Rubin Cindex
Number_clusters 2.0000 2.0000 5.0000 13.0000 7.0000 5 3.0000 5.0000 14.0000 13.000 2.0000
Value_Index 14.4914 46.9075 6.5431 0.6995 49.7567 1740702 399.2158 12.9075 29.2992 -1.212 0.3928
DB Silhouette Duda PseudoT2 Beale Ratkowsky Ball PtBiserial Frey McClain Dunn
Number_clusters 15.0000 2.0000 2.0000 2.0000 2.0000 2.000 3.000 3.000 1 2.000 12.0000
Value_Index 0.7759 0.3992 0.7376 7.8279 1.3094 0.493 32.749 0.655 NA 0.591 0.4805
Hubert SDindex Dindex SDbw
Number_clusters 0 11.0000 0 15.0000
Value_Index 0 1.3589 0 0.0855
$Best.partition
User 20 User 4 User 167 User 36 User 116 User 115 User 237 User 43 User 71 User 1 User 200 User 23
1 1 2 1 2 1 2 1 1 1 2 1
User 72 User 140 User 197 User 170 User 225 User 35 User 17 User 56 User 155 User 145 User 53 User 240
1 2 2 2 2 1 1 1 2 2 1 2
User 37 User 157 User 18 User 98 User 195 User 185 User 14 User 73 User 217 User 224 User 38 User 199
1 2 1 1 2 2 1 1 2 2 1 2
User 76 User 239 User 168 User 139 User 9 User 41 User 31 User 60 User 12 User 221 User 11 User 131
1 2 2 2 1 1 1 1 1 2 1 2
User 136 User 248
2 2
NbClust(x,method = "single")
NaNs produced
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 6 proposed 2 as the best number of clusters
* 3 proposed 3 as the best number of clusters
* 1 proposed 4 as the best number of clusters
* 7 proposed 9 as the best number of clusters
* 1 proposed 11 as the best number of clusters
* 2 proposed 13 as the best number of clusters
* 1 proposed 14 as the best number of clusters
* 3 proposed 15 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 9
*******************************************************************
$All.index
KL CH Hartigan CCC Scott Marriot TrCovW TraceW Friedman Rubin Cindex DB
2 1.1780 9.3980 9.4276 -5.8551 30.8464 17789656.0 740.4079 245.8621 13.8019 1.1958 0.4863 0.7795
3 0.5695 10.1204 0.9298 -7.8771 72.3644 17447362.3 564.9911 205.5003 26.1819 1.4307 0.4546 0.9154
4 1.2489 7.0374 0.7992 -11.0882 97.1881 18879523.5 527.0235 201.5136 30.6947 1.4590 0.4549 0.7606
5 1.0190 5.4484 0.3474 -12.1885 107.6349 23937066.0 511.0490 198.0724 32.5858 1.4843 0.4550 0.8901
6 1.1123 4.3627 0.2627 -13.2035 127.8665 22998754.0 504.8444 196.5548 37.6323 1.4958 0.4554 0.8208
7 0.1831 3.6170 16.3558 -16.8886 139.0082 25050847.7 498.4411 195.3883 39.9606 1.5047 0.4557 0.7427
8 0.7618 6.4622 33.3652 -13.8740 180.1798 14361328.6 347.0451 141.5481 58.8712 2.0770 0.4297 0.6351
9 108.7004 13.9761 1.1840 -7.8376 256.2659 3968475.9 311.4268 78.8828 89.4201 3.7270 0.4256 0.7333
10 3.1020 12.5986 1.3796 -8.5311 267.4485 3917492.8 291.5743 76.6687 97.1484 3.8347 0.4242 0.7555
11 0.0129 11.5711 9.1190 -9.1150 276.7544 3935171.9 283.2820 74.1125 99.5336 3.9669 0.4198 0.7621
12 1.3075 13.4537 8.0153 -7.6581 307.3097 2541796.1 154.8078 60.0675 112.6929 4.8945 0.5095 0.7327
13 3.0653 15.1912 3.2781 -6.4375 374.5395 777525.0 147.3928 49.6045 157.3419 5.9269 0.4893 0.6257
14 8.2436 15.0978 1.1168 -6.4226 393.5879 616071.1 124.0001 45.5674 162.8986 6.4520 0.5012 0.6164
15 0.8121 14.1303 1.0981 -7.0034 414.6928 463706.4 122.3676 44.1963 169.6937 6.6521 0.5015 0.6007
Silhouette Duda Pseudot2 Beale Ratkowsky Ball Ptbiserial Frey McClain Dunn Hubert SDindex
2 0.3169 0.8340 8.9553 0.7490 0.2566 122.9310 0.3797 1.8204 0.0837 0.3341 0.0073 1.2977
3 0.2703 0.7695 0.5990 0.7682 0.3013 68.5001 0.4695 -13.0696 0.2235 0.2920 0.0069 1.1548
4 0.2329 0.9942 0.2377 0.0218 0.2680 50.3784 0.4648 -308.6086 0.2275 0.2837 0.0068 1.1520
5 0.0397 3.6404 -0.7253 -1.3952 0.2441 39.6145 0.4258 -5.2215 0.2818 0.2812 0.0066 1.3125
6 0.0400 9.1165 0.0000 0.0000 0.2246 32.7591 0.4213 -4.8152 0.2852 0.2626 0.0066 1.4555
7 0.0753 0.7216 15.4304 1.4479 0.2105 27.9126 0.4188 0.2326 0.2870 0.2586 0.0065 1.3535
8 0.1975 0.5454 29.1740 3.1178 0.2540 17.6935 0.5370 0.4097 0.5029 0.2429 0.0064 1.2653
9 0.3254 1.0038 -0.0378 -0.0132 0.2847 8.7648 0.6213 3.8977 1.1216 0.3236 0.0069 1.4260
10 0.3285 0.9951 0.1135 0.0182 0.2715 7.6669 0.6109 2.5233 1.1734 0.3226 0.0069 1.4502
11 0.2371 0.7093 9.0186 1.5086 0.2604 6.7375 0.5904 0.2228 1.2937 0.3217 0.0069 1.4260
12 0.2861 0.5266 8.0901 3.1125 0.2574 5.0056 0.6031 0.4024 1.4387 0.4147 0.0071 1.4147
13 0.3332 0.9559 0.9218 0.1689 0.2527 3.8157 0.5987 0.5265 1.5496 0.4123 0.0073 1.4474
14 0.3309 0.4259 1.3479 2.5929 0.2455 3.2548 0.5893 -4.8499 1.6702 0.4254 0.0074 1.5205
15 0.3382 4.1310 -0.7579 -1.4580 0.2379 2.9464 0.5861 -6.3823 1.6899 0.4069 0.0074 1.6189
Dindex SDbw
2 2.0890 0.5304
3 1.8907 0.4208
4 1.8480 0.2721
5 1.8100 0.2191
6 1.7840 0.1800
7 1.7535 0.1321
8 1.4531 0.1042
9 1.1105 0.1035
10 1.0803 0.0933
11 1.0467 0.0850
12 0.9442 0.0717
13 0.8568 0.0724
14 0.8149 0.0658
15 0.7884 0.0553
$All.CriticalValues
CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
2 0.6433 24.9549 0.6107
3 0.0348 55.4763 0.6091
4 0.6319 23.8864 1.0000
5 -0.0981 -11.1930 1.0000
6 -0.3211 0.0000 NaN
7 0.6288 23.6160 0.1971
8 0.6114 22.2432 0.0060
9 0.3979 15.1322 1.0000
10 0.5503 18.7987 1.0000
11 0.5432 18.5029 0.1801
12 0.3758 14.9464 0.0108
13 0.5276 17.9093 0.9846
14 -0.0981 -11.1930 0.1356
15 -0.0981 -11.1930 1.0000
$Best.nc
KL CH Hartigan CCC Scott Marriot TrCovW TraceW Friedman Rubin Cindex
Number_clusters 9.0000 13.0000 9.0000 2.0000 9.0000 9 3.0000 9.0000 13.000 9.0000 11.0000
Value_Index 108.7004 15.1912 32.1812 -5.8551 76.0862 10341870 175.4168 60.4512 44.649 -1.5424 0.4198
DB Silhouette Duda PseudoT2 Beale Ratkowsky Ball PtBiserial Frey McClain Dunn
Number_clusters 15.0000 15.0000 2.000 2.0000 2.000 3.0000 3.000 9.0000 2.0000 2.0000 14.0000
Value_Index 0.6007 0.3382 0.834 8.9553 0.749 0.3013 54.431 0.6213 1.8204 0.0837 0.4254
Hubert SDindex Dindex SDbw
Number_clusters 0 4.000 0 15.0000
Value_Index 0 1.152 0 0.0553
$Best.partition
User 20 User 4 User 167 User 36 User 116 User 115 User 237 User 43 User 71 User 1 User 200 User 23
1 1 2 1 3 4 5 1 1 1 6 1
User 72 User 140 User 197 User 170 User 225 User 35 User 17 User 56 User 155 User 145 User 53 User 240
1 3 3 3 3 1 1 1 3 3 1 2
User 37 User 157 User 18 User 98 User 195 User 185 User 14 User 73 User 217 User 224 User 38 User 199
1 3 1 1 7 8 1 1 3 2 1 8
User 76 User 239 User 168 User 139 User 9 User 41 User 31 User 60 User 12 User 221 User 11 User 131
1 2 8 3 1 1 1 1 1 2 1 3
User 136 User 248
3 9
NbClust(x,method = "average")
NaNs produced
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 8 proposed 2 as the best number of clusters
* 3 proposed 3 as the best number of clusters
* 1 proposed 4 as the best number of clusters
* 4 proposed 5 as the best number of clusters
* 1 proposed 7 as the best number of clusters
* 1 proposed 10 as the best number of clusters
* 1 proposed 13 as the best number of clusters
* 2 proposed 14 as the best number of clusters
* 2 proposed 15 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 2
*******************************************************************
$All.index
KL CH Hartigan CCC Scott Marriot TrCovW TraceW Friedman Rubin Cindex DB
2 42.7536 43.6317 7.6164 -0.4374 85.6793 5941492.96 1062.7221 154.0078 49.2029 1.9090 0.3961 1.0019
3 0.3419 28.4798 6.6439 -2.5570 109.3701 8323428.09 632.8272 132.9170 56.8605 2.2119 0.4971 0.9333
4 0.1333 23.3769 22.5870 -3.8772 144.9029 7270186.32 451.9154 116.4551 60.9979 2.5246 0.4871 0.8918
5 4.9560 31.0973 6.8376 -0.8754 223.4252 2362270.56 376.5825 78.1042 76.1266 3.7642 0.4676 1.0482
6 9.1609 29.3582 2.7488 -0.9457 247.4681 2103089.68 266.2071 67.8019 83.1374 4.3362 0.4924 0.9924
7 0.1251 25.8505 5.5486 -2.8580 277.7597 1561859.04 241.2669 63.8152 93.7086 4.6070 0.5223 0.8880
8 0.8923 25.2092 5.9773 -2.6594 301.7204 1263299.18 174.9706 56.5218 101.0233 5.2015 0.5098 0.8824
9 0.4078 25.3267 14.9595 -2.2616 333.6827 843704.50 137.2253 49.4800 108.3780 5.9418 0.4953 0.8287
10 9.9168 31.5989 2.6581 0.2614 385.6633 368304.77 55.8571 36.2526 120.4288 8.1098 0.4509 0.8747
11 0.5897 29.8298 3.5354 -0.1079 402.1022 320779.20 45.6249 33.9937 130.4411 8.6487 0.4430 0.8201
12 2.5897 29.1311 1.9449 -0.1797 429.3941 221171.41 38.7454 31.1682 141.2071 9.4327 0.4365 0.7523
13 0.1525 27.4893 7.8387 -0.6423 465.5747 125890.23 35.7438 29.6507 178.2512 9.9155 0.4339 0.7267
14 4.0987 30.5062 2.5521 0.5054 499.8400 73576.17 18.9938 24.4671 187.8930 12.0161 0.5193 0.7423
15 1.5299 29.6699 1.8832 0.2765 514.4226 63095.82 18.5137 22.8474 192.1438 12.8680 0.5085 0.7160
Silhouette Duda Pseudot2 Beale Ratkowsky Ball Ptbiserial Frey McClain Dunn Hubert SDindex
2 0.4173 0.6441 7.7367 1.9844 0.4830 77.0039 0.6973 0.8723 0.4332 0.2335 0.0063 1.4708
3 0.3776 0.5360 9.5212 3.0526 0.4250 44.3057 0.7039 1.7861 0.4849 0.2994 0.0068 1.2863
4 0.3739 0.5953 21.7589 2.5368 0.3874 29.1138 0.6930 0.8460 0.5450 0.2994 0.0069 1.2588
5 0.3407 0.6544 5.8094 1.8625 0.3819 15.6208 0.6405 0.5383 1.2381 0.3639 0.0073 1.4103
6 0.3227 0.7695 0.5990 0.7682 0.3574 11.3003 0.6319 0.5830 1.3598 0.4007 0.0073 1.3603
7 0.3161 0.4413 8.8639 4.2628 0.3338 9.1165 0.6308 0.9574 1.3734 0.4254 0.0073 1.2350
8 0.3334 0.5155 7.5204 3.2148 0.3174 7.0652 0.6164 0.9971 1.4850 0.4254 0.0074 1.3756
9 0.3400 0.5604 14.9024 2.8667 0.3037 5.4978 0.5968 0.5477 1.6419 0.3739 0.0073 1.4359
10 0.3528 1.0084 -0.0332 -0.0256 0.2960 3.6253 0.5147 0.3954 2.6140 0.3477 0.0082 1.4663
11 0.3573 0.2998 4.6705 5.9897 0.2834 3.0903 0.5100 0.3912 2.6811 0.3477 0.0081 1.4690
12 0.3625 3.6404 -0.7253 -1.3952 0.2729 2.5974 0.5062 0.5096 2.7372 0.3477 0.0082 1.4462
13 0.3682 0.5759 8.1000 2.5969 0.2629 2.2808 0.5038 0.4049 2.7696 0.3477 0.0082 1.5428
14 0.3721 0.8469 0.5424 0.5217 0.2559 1.7477 0.4588 0.3465 3.4606 0.4329 0.0087 1.6594
15 0.3852 5.1205 0.0000 0.0000 0.2479 1.5232 0.4532 0.3601 3.5560 0.4329 0.0087 1.6812
Dindex SDbw
2 1.6730 1.0874
3 1.5618 0.3977
4 1.4462 0.3209
5 1.1904 0.2911
6 1.1109 0.2493
7 1.0682 0.1889
8 0.9972 0.1678
9 0.9303 0.1506
10 0.8036 0.1378
11 0.7711 0.1223
12 0.7361 0.1015
13 0.7101 0.0925
14 0.6528 0.0872
15 0.6253 0.0815
$All.CriticalValues
CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
2 0.4643 16.1499 0.0769
3 0.4174 15.3565 0.0107
4 0.5992 21.4021 0.0219
5 0.4174 15.3565 0.1005
6 0.0348 55.4763 0.6091
7 0.3212 14.7957 0.0019
8 0.3506 14.8210 0.0098
9 0.5190 17.6120 0.0122
10 0.1924 16.7852 1.0000
11 0.0348 55.4763 0.0043
12 -0.0981 -11.1930 1.0000
13 0.4174 15.3565 0.0255
14 0.1255 20.9054 0.7845
15 -0.3211 0.0000 NaN
$Best.nc
KL CH Hartigan CCC Scott Marriot TrCovW TraceW Friedman Rubin Cindex
Number_clusters 2.0000 2.0000 4.0000 14.0000 5.0000 5 3.000 5.0000 13.0000 10.000 2.0000
Value_Index 42.7536 43.6317 15.9432 0.5054 78.5223 4648735 429.895 28.0486 37.0441 -1.629 0.3961
DB Silhouette Duda PseudoT2 Beale Ratkowsky Ball PtBiserial Frey McClain Dunn
Number_clusters 15.000 2.0000 2.0000 2.0000 5.0000 2.000 3.0000 3.0000 1 2.0000 14.0000
Value_Index 0.716 0.4173 0.6441 7.7367 1.8625 0.483 32.6982 0.7039 NA 0.4332 0.4329
Hubert SDindex Dindex SDbw
Number_clusters 0 7.000 0 15.0000
Value_Index 0 1.235 0 0.0815
$Best.partition
User 20 User 4 User 167 User 36 User 116 User 115 User 237 User 43 User 71 User 1 User 200 User 23
1 1 2 1 1 1 2 1 1 1 2 1
User 72 User 140 User 197 User 170 User 225 User 35 User 17 User 56 User 155 User 145 User 53 User 240
1 1 2 2 2 1 1 1 1 1 1 2
User 37 User 157 User 18 User 98 User 195 User 185 User 14 User 73 User 217 User 224 User 38 User 199
1 1 1 1 2 2 1 1 2 2 1 2
User 76 User 239 User 168 User 139 User 9 User 41 User 31 User 60 User 12 User 221 User 11 User 131
1 2 2 1 1 1 1 1 1 2 1 1
User 136 User 248
1 2
plot(silhouette(cutree(complete_linkage2,3),dist(x)))
plot(silhouette(cutree(single_linkage2,3),dist(x)))
plot(silhouette(cutree(average_linkage2,5),dist(x)))
The one based on purity./lowest number of singleton nodes gives us single_linkage to be the best and . The clustering performed with nb clust gave us good silhoutte index for complete_linkage. And we can see in the plot, the nb clust gave a elbow shaped drop in 3 clusters for complete and 3 for single and 5 for average. The higher the silhoute index,the good structure is present for the clusters.
I think the Complete linkage will be suit for the dataset, since it clusters properly and gives us a higher good structure.