This dataset was populated from destination reviews published by 249 reviewers of holidayiq.com till October 2014. Reviews are split into 6 categories among destinations across South India were considered and the count of reviews in each category for every reviewer (traveler) is captured.
Attribute Information:
Attribute 1 : Unique user id
Attribute 2 : Number of reviews on stadiums, sports complex, etc.
Attribute 3 : Number of reviews on religious institutions
Attribute 4 : Number of reviews on beach, lake, river, etc.
Attribute 5 : Number of reviews on theatres, exhibitions, etc.
Attribute 6 : Number of reviews on malls, shopping places, etc.
Attribute 7 : Number of reviews on parks, picnic spots, etc.
library(tidyverse)
library(cluster)
library(factoextra)
library(dplyr)
library(NbClust)
bm <- read.csv("~\\db\\buddymove_holidayiq.csv")
head(bm)
## User.Id Sports Religious Nature Theatre Shopping Picnic
## 1 User 1 2 77 79 69 68 95
## 2 User 2 2 62 76 76 69 68
## 3 User 3 2 50 97 87 50 75
## 4 User 4 2 68 77 95 76 61
## 5 User 5 2 98 54 59 95 86
## 6 User 6 3 52 109 93 52 76
bm_reviews <- bm[,2:7]
head(bm_reviews)
## Sports Religious Nature Theatre Shopping Picnic
## 1 2 77 79 69 68 95
## 2 2 62 76 76 69 68
## 3 2 50 97 87 50 75
## 4 2 68 77 95 76 61
## 5 2 98 54 59 95 86
## 6 3 52 109 93 52 76
Applying basic statistics before implementing k-means to check to scale the data or not.
stats <- data.frame(
Min = apply(bm_reviews, 2, min), # minimum
Med = apply(bm_reviews, 2, median), # median
Mean = apply(bm_reviews, 2, mean), # mean
SD = apply(bm_reviews, 2, sd), # standard deviation
Max = apply(bm_reviews, 2, max) # maximum
)
stats <- round(stats, 1)
head(stats)
## Min Med Mean SD Max
## Sports 2 12 12.0 6.6 25
## Religious 50 104 109.8 32.5 203
## Nature 52 119 124.5 45.6 318
## Theatre 59 113 116.4 32.1 213
## Shopping 50 104 112.6 41.6 233
## Picnic 61 119 120.4 32.6 218
So, here the minimum and maximum value of the sport is less than the rest. Therefore, we have to scale the data.
bm_scaled <- scale(bm_reviews)
head(bm_scaled)
## Sports Religious Nature Theatre Shopping Picnic
## [1,] -1.509552 -1.0100142 -0.9973422 -1.4744331 -1.0740003 -0.7783943
## [2,] -1.509552 -1.4722052 -1.0630749 -1.2565864 -1.0499404 -1.6057691
## [3,] -1.509552 -1.8419580 -0.6029459 -0.9142560 -1.5070790 -1.3912645
## [4,] -1.509552 -1.2873288 -1.0411640 -0.6652884 -0.8815209 -1.8202736
## [5,] -1.509552 -0.3629468 -1.5451149 -1.7856426 -0.4243823 -1.0541859
## [6,] -1.358415 -1.7803325 -0.3400150 -0.7275303 -1.4589591 -1.3606210
stats<- data.frame(
Min = apply(bm_scaled, 2, min), # minimum
Med = apply(bm_scaled, 2, median), # median
Mean = apply(bm_scaled, 2, mean), # mean
SD = apply(bm_scaled, 2, sd), # Standard deviation
Max = apply(bm_scaled, 2, max) # maximum
)
stats <- round(stats, 1)
head(stats)
## Min Med Mean SD Max
## Sports -1.5 0.0 0 1 2.0
## Religious -1.8 -0.2 0 1 2.9
## Nature -1.6 -0.1 0 1 4.2
## Theatre -1.8 -0.1 0 1 3.0
## Shopping -1.5 -0.2 0 1 2.9
## Picnic -1.8 0.0 0 1 3.0
Initializing total within the sum of squares error: wss
wss <- 0
# For 1 to 15 cluster centers
for (i in 1:15) {
kmn <- kmeans(bm_scaled, centers = i, nstart = 20)
# Saving the total within sum of squares to wss variable
wss[i] <- kmn$tot.withinss
}
Plot total within the sum of squares vs. number of clusters
plot(1:15, wss, type = "b",
xlab = "Number of Clusters",
ylab = "Within groups sum of squares")
Based on the plot, we can say that the elbow where the quality improves slowly as the k (number of clusters) increases, which means since the model complexity increases, the quality of the model is no longer improving.
Let’s do it with fviz_nbclust() to see the optimum number of clusters.
# wss - within-cluster sum of squares
fviz_nbclust(bm_scaled, kmeans, method="wss")
We will check the silhouette method as it measures the quality of a cluster. For instance, how well each point lies within its cluster.
fviz_nbclust(bm_scaled, kmeans, method="silhouette")
Moreover, we can also use GAP statistics to determine the number of clusters
gapstat <- clusGap(bm_scaled,FUN = kmeans,K.max = 25,B = 100)
fviz_gap_stat(gapstat)
According to gap statistics, the optimal number of clusters is 3, however, since our dataset is not big, I will use two clusters as the silhouette method suggested.
Let’s choose 2 clusters and execute them.
k <- 2
kmean <- kmeans(bm_scaled,centers = 2,nstart = 25)
Visualization of the cluster plot
fviz_cluster(kmean, data = bm_scaled)
Now, let’s check the number of observations in each cluster
kmean$size
## [1] 110 139
Total SSE of the clusters and each cluster respectively. SSE stands for Sum of Squared Error
print(kmean$tot.withinss)
## [1] 878.9785
print(kmean$withinss)
## [1] 510.6742 368.3043
for(i in 1:3)
{
print(i)
print(which(kmean$cluster==i))
}
## [1] 1
## [1] 51 110 113 114 115 116 117 119 121 124 126 127 130 131 137 138 139 141
## [19] 144 146 149 157 159 160 161 162 163 166 167 168 169 170 171 172 173 174
## [37] 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
## [55] 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210
## [73] 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228
## [91] 229 230 231 232 233 234 235 236 237 239 240 241 242 243 244 245 246 247
## [109] 248 249
## [1] 2
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 52 53 54 55
## [55] 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
## [73] 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
## [91] 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109
## [109] 111 112 118 120 122 123 125 128 129 132 133 134 135 136 140 142 143 145
## [127] 147 148 150 151 152 153 154 155 156 158 164 165 238
## [1] 3
## integer(0)
In cluster one, the users who have given more reviews about nature and picnic are clustered together. It is obvious that people enjoy nature prefer to spend more time with family having picnics. For instance, for cluster points 73,84,94 nature and picnic values are high than the rest. So they are clustered together. Something I found out interesting was the users who rated can be mothers/women of the family. I say so because the ratings on Religious, Shopping, and Picnic are specifically high.
In cluster two, the sports ratings play a role. If we can see the clusters more in detail, we can find the users who prefer watching movies and sports on tv than outdoors.
Now we will upload the data once again, and execute operations on Hierarchical clustering.
set.seed(1122)
DataSet <- read.csv("~\\db\\buddymove_holidayiq.csv")
DataSet
## User.Id Sports Religious Nature Theatre Shopping Picnic
## 1 User 1 2 77 79 69 68 95
## 2 User 2 2 62 76 76 69 68
## 3 User 3 2 50 97 87 50 75
## 4 User 4 2 68 77 95 76 61
## 5 User 5 2 98 54 59 95 86
## 6 User 6 3 52 109 93 52 76
## 7 User 7 3 64 85 82 73 69
## 8 User 8 3 54 107 92 54 76
## 9 User 9 3 64 108 64 54 93
## 10 User 10 3 86 76 74 74 103
## 11 User 11 3 107 54 64 103 94
## 12 User 12 3 103 60 63 102 93
## 13 User 13 3 64 82 82 75 69
## 14 User 14 3 93 54 74 103 69
## 15 User 15 3 63 82 81 78 69
## 16 User 16 3 82 79 75 75 82
## 17 User 17 5 59 131 103 54 86
## 18 User 18 5 56 124 108 56 85
## 19 User 19 4 85 67 111 65 72
## 20 User 20 5 114 83 65 114 102
## 21 User 21 4 93 82 79 79 90
## 22 User 22 4 105 52 75 113 78
## 23 User 23 5 69 118 74 66 101
## 24 User 24 4 71 123 64 59 102
## 25 User 25 5 88 94 81 79 91
## 26 User 26 5 83 99 89 74 91
## 27 User 27 5 69 133 74 54 101
## 28 User 28 5 128 53 74 117 105
## 29 User 29 5 74 123 69 61 101
## 30 User 30 5 79 93 118 90 72
## 31 User 31 4 51 115 110 51 84
## 32 User 32 5 69 93 93 81 79
## 33 User 33 4 100 53 86 112 78
## 34 User 34 4 88 69 108 71 77
## 35 User 35 4 93 84 75 84 112
## 36 User 36 4 79 86 110 93 73
## 37 User 37 4 87 72 112 63 71
## 38 User 38 5 65 128 79 56 101
## 39 User 39 4 91 66 110 69 71
## 40 User 40 4 93 79 74 92 90
## 41 User 41 5 59 118 108 63 85
## 42 User 42 5 65 103 90 80 79
## 43 User 43 4 87 89 76 89 111
## 44 User 44 4 93 85 79 85 91
## 45 User 45 5 93 94 79 79 113
## 46 User 46 5 74 130 74 53 101
## 47 User 47 4 118 59 69 119 97
## 48 User 48 4 101 61 76 113 78
## 49 User 49 5 88 99 79 83 114
## 50 User 50 4 74 87 87 79 79
## 51 User 51 14 148 59 104 208 119
## 52 User 52 5 93 89 74 89 93
## 53 User 53 4 79 103 68 103 85
## 54 User 54 4 76 92 111 90 71
## 55 User 55 8 94 140 89 118 92
## 56 User 56 8 108 109 89 89 143
## 57 User 57 6 98 89 133 79 79
## 58 User 58 8 84 118 102 94 90
## 59 User 59 8 138 74 79 143 118
## 60 User 60 6 74 109 103 89 88
## 61 User 61 8 118 104 94 94 118
## 62 User 62 8 94 128 89 128 99
## 63 User 63 8 64 155 118 64 97
## 64 User 64 6 84 108 138 103 79
## 65 User 65 8 108 109 94 99 138
## 66 User 66 6 89 108 128 116 81
## 67 User 67 6 103 79 148 74 79
## 68 User 68 6 103 109 99 89 108
## 69 User 69 8 108 99 94 94 138
## 70 User 70 8 108 109 104 79 133
## 71 User 71 8 138 59 84 138 128
## 72 User 72 8 79 143 113 69 109
## 73 User 73 8 84 153 79 64 113
## 74 User 74 8 74 148 118 64 104
## 75 User 75 8 74 153 118 59 99
## 76 User 76 6 118 64 101 133 92
## 77 User 77 6 69 153 113 59 99
## 78 User 78 6 123 68 89 128 90
## 79 User 79 6 99 104 128 64 78
## 80 User 80 6 124 59 84 143 88
## 81 User 81 6 84 131 128 93 81
## 82 User 82 6 113 64 89 141 86
## 83 User 83 6 118 64 99 134 88
## 84 User 84 8 89 143 84 64 118
## 85 User 85 6 98 89 133 69 79
## 86 User 86 6 98 104 109 91 101
## 87 User 87 8 84 143 74 74 133
## 88 User 88 8 123 64 93 148 90
## 89 User 89 6 84 113 141 98 81
## 90 User 90 6 59 148 125 59 97
## 91 User 91 8 143 69 79 143 123
## 92 User 92 8 94 133 84 125 97
## 93 User 93 8 84 108 113 93 90
## 94 User 94 8 84 138 89 69 123
## 95 User 95 6 84 108 133 98 81
## 96 User 96 6 99 84 138 69 83
## 97 User 97 8 113 94 84 109 128
## 98 User 98 6 148 64 74 138 128
## 99 User 99 8 84 138 113 64 114
## 100 User 100 6 98 79 138 79 79
## 101 User 101 8 133 84 84 133 118
## 102 User 102 6 79 113 106 87 88
## 103 User 103 8 113 104 84 94 148
## 104 User 104 6 69 138 113 79 99
## 105 User 105 6 83 103 98 99 90
## 106 User 106 8 69 143 123 71 97
## 107 User 107 8 118 94 99 99 143
## 108 User 108 6 74 133 99 74 113
## 109 User 109 12 115 94 163 109 92
## 110 User 110 12 123 124 109 114 168
## 111 User 111 12 94 133 157 133 95
## 112 User 112 12 104 123 163 128 89
## 113 User 113 14 99 198 89 74 153
## 114 User 114 14 113 148 99 148 110
## 115 User 115 14 94 163 99 94 153
## 116 User 116 12 133 104 119 129 133
## 117 User 117 14 163 69 94 188 143
## 118 User 118 12 74 158 158 74 119
## 119 User 119 12 123 124 109 129 143
## 120 User 120 12 94 137 148 113 100
## 121 User 121 12 123 129 139 99 143
## 122 User 122 12 89 163 99 79 143
## 123 User 123 14 81 173 143 79 122
## 124 User 124 14 135 84 158 139 92
## 125 User 125 12 108 113 133 114 100
## 126 User 126 14 138 119 119 124 158
## 127 User 127 12 128 122 134 109 120
## 128 User 128 14 79 203 153 69 119
## 129 User 129 12 84 150 128 104 102
## 130 User 130 12 148 69 104 173 138
## 131 User 131 14 133 124 119 124 168
## 132 User 132 12 89 128 148 111 102
## 133 User 133 12 79 183 133 70 113
## 134 User 134 12 96 113 113 99 97
## 135 User 135 14 84 161 153 84 116
## 136 User 136 12 109 135 158 123 92
## 137 User 137 12 133 74 114 175 102
## 138 User 138 14 133 139 124 109 128
## 139 User 139 12 128 124 114 111 126
## 140 User 140 10 97 123 108 119 111
## 141 User 141 12 123 104 104 134 153
## 142 User 142 10 113 89 168 89 89
## 143 User 143 12 109 148 94 133 119
## 144 User 144 10 165 64 94 173 97
## 145 User 145 12 93 123 123 124 100
## 146 User 146 14 143 98 119 163 105
## 147 User 147 12 89 158 89 89 158
## 148 User 148 10 113 111 153 84 97
## 149 User 149 12 138 86 114 158 102
## 150 User 150 12 79 158 148 69 119
## 151 User 151 14 84 178 138 87 116
## 152 User 152 14 84 168 148 80 113
## 153 User 153 14 84 173 143 69 110
## 154 User 154 12 104 130 178 118 92
## 155 User 155 10 99 123 153 128 103
## 156 User 156 12 93 128 118 119 100
## 157 User 157 12 128 119 104 119 173
## 158 User 158 14 94 178 109 69 143
## 159 User 159 12 148 69 104 168 114
## 160 User 160 12 113 114 109 119 143
## 161 User 161 12 153 74 94 163 158
## 162 User 162 22 139 163 114 203 139
## 163 User 163 14 163 99 94 158 143
## 164 User 164 12 89 178 89 79 153
## 165 User 165 12 79 183 104 69 148
## 166 User 166 12 168 69 84 198 133
## 167 User 167 22 128 158 183 158 130
## 168 User 168 22 129 188 109 104 198
## 169 User 169 16 178 74 114 198 129
## 170 User 170 20 143 139 139 139 148
## 171 User 171 20 114 198 124 114 153
## 172 User 172 18 105 158 143 129 133
## 173 User 173 20 120 193 104 148 138
## 174 User 174 20 98 193 163 99 140
## 175 User 175 22 148 174 129 129 153
## 176 User 176 20 183 104 164 193 153
## 177 User 177 18 99 203 129 94 158
## 178 User 178 22 139 178 139 163 144
## 179 User 179 22 90 188 163 109 138
## 180 User 180 18 163 84 129 193 139
## 181 User 181 18 88 178 178 89 140
## 182 User 182 18 114 144 213 133 128
## 183 User 183 22 143 124 208 134 124
## 184 User 184 20 188 94 104 208 158
## 185 User 185 18 99 228 104 84 163
## 186 User 186 18 178 94 114 188 173
## 187 User 187 22 119 198 149 114 168
## 188 User 188 20 114 203 109 84 158
## 189 User 189 18 99 193 158 99 144
## 190 User 190 22 153 139 154 154 183
## 191 User 191 20 162 94 144 198 135
## 192 User 192 18 132 149 139 119 144
## 193 User 193 18 104 193 114 99 173
## 194 User 194 18 99 183 128 114 129
## 195 User 195 20 178 89 154 198 183
## 196 User 196 20 114 243 104 79 173
## 197 User 197 22 133 183 114 183 140
## 198 User 198 22 148 149 154 149 148
## 199 User 199 22 114 213 129 94 178
## 200 User 200 22 188 124 114 183 188
## 201 User 201 25 153 159 129 139 218
## 202 User 202 18 114 153 178 138 124
## 203 User 203 22 184 84 129 233 133
## 204 User 204 18 84 193 188 80 138
## 205 User 205 18 178 129 104 183 163
## 206 User 206 18 114 163 163 148 124
## 207 User 207 16 132 94 178 109 120
## 208 User 208 20 93 183 178 89 140
## 209 User 209 16 137 104 173 109 130
## 210 User 210 18 153 74 114 188 129
## 211 User 211 18 128 114 168 124 124
## 212 User 212 22 114 228 119 99 168
## 213 User 213 22 178 89 114 203 139
## 214 User 214 18 173 84 99 193 158
## 215 User 215 20 168 89 124 183 129
## 216 User 216 18 83 198 163 84 140
## 217 User 217 22 144 173 134 173 149
## 218 User 218 18 198 64 89 203 153
## 219 User 219 20 114 208 114 119 168
## 220 User 220 18 114 183 114 109 178
## 221 User 221 18 114 173 183 128 124
## 222 User 222 18 128 134 188 89 124
## 223 User 223 18 143 154 129 129 178
## 224 User 224 18 128 163 173 133 130
## 225 User 225 18 148 99 139 168 129
## 226 User 226 18 83 218 163 74 140
## 227 User 227 22 120 168 158 144 133
## 228 User 228 22 203 99 104 223 173
## 229 User 229 18 143 139 119 139 203
## 230 User 230 22 153 144 139 144 193
## 231 User 231 18 147 144 129 149 144
## 232 User 232 18 188 74 109 213 168
## 233 User 233 25 158 154 154 154 153
## 234 User 234 25 104 208 168 134 144
## 235 User 235 25 139 153 193 158 134
## 236 User 236 25 84 247 168 109 140
## 237 User 237 25 173 89 124 233 158
## 238 User 238 8 93 119 99 89 138
## 239 User 239 22 124 168 208 148 124
## 240 User 240 18 114 158 178 158 124
## 241 User 241 20 188 94 94 223 153
## 242 User 242 25 114 238 124 104 178
## 243 User 243 18 94 188 148 99 139
## 244 User 244 25 129 318 94 89 188
## 245 User 245 18 139 148 129 129 168
## 246 User 246 22 114 228 104 84 168
## 247 User 247 20 124 178 104 158 174
## 248 User 248 20 133 149 139 144 213
## 249 User 249 20 143 149 139 159 143
SubSet<-sample_n(DataSet, 50)
rownames(SubSet)<-SubSet$User.Id
SubSet<-SubSet[2:7]
SubSet
## Sports Religious Nature Theatre Shopping Picnic
## User 202 18 114 153 178 138 124
## User 145 12 93 123 123 124 100
## User 116 12 133 104 119 129 133
## User 170 20 143 139 139 139 148
## User 209 16 137 104 173 109 130
## User 160 12 113 114 109 119 143
## User 156 12 93 128 118 119 100
## User 139 12 128 124 114 111 126
## User 122 12 89 163 99 79 143
## User 126 14 138 119 119 124 158
## User 78 6 123 68 89 128 90
## User 107 8 118 94 99 99 143
## User 200 22 188 124 114 183 188
## User 216 18 83 198 163 84 140
## User 68 6 103 109 99 89 108
## User 177 18 99 203 129 94 158
## User 203 22 184 84 129 233 133
## User 19 4 85 67 111 65 72
## User 45 5 93 94 79 79 113
## User 40 4 93 79 74 92 90
## User 197 22 133 183 114 183 140
## User 238 8 93 119 99 89 138
## User 25 5 88 94 81 79 91
## User 11 3 107 54 64 103 94
## User 246 22 114 228 104 84 168
## User 146 14 143 98 119 163 105
## User 124 14 135 84 158 139 92
## User 201 25 153 159 129 139 218
## User 52 5 93 89 74 89 93
## User 137 12 133 74 114 175 102
## User 225 18 148 99 139 168 129
## User 129 12 84 150 128 104 102
## User 58 8 84 118 102 94 90
## User 245 18 139 148 129 129 168
## User 119 12 123 124 109 129 143
## User 76 6 118 64 101 133 92
## User 247 20 124 178 104 158 174
## User 114 14 113 148 99 148 110
## User 37 4 87 72 112 63 71
## User 33 4 100 53 86 112 78
## User 47 4 118 59 69 119 97
## User 217 22 144 173 134 173 149
## User 222 18 128 134 188 89 124
## User 147 12 89 158 89 89 158
## User 164 12 89 178 89 79 153
## User 103 8 113 104 84 94 148
## User 133 12 79 183 133 70 113
## User 210 18 153 74 114 188 129
## User 192 18 132 149 139 119 144
## User 49 5 88 99 79 83 114
We have to scale the data as before.
x1<-scale(SubSet)
head(x1)
## Sports Religious Nature Theatre Shopping Picnic
## User 202 0.88051950 -0.07290709 0.75528071 2.3334988 0.52688821 -0.0427233
## User 145 -0.09064171 -0.87872231 0.05247608 0.3529057 0.15015507 -0.8079168
## User 116 -0.09064171 0.65616382 -0.39263352 0.2088625 0.28470262 0.2442242
## User 170 1.20423991 1.03988535 0.42730522 0.9290782 0.55379772 0.7224702
## User 209 0.55679910 0.80965243 -0.39263352 2.1534449 -0.25348759 0.1485751
## User 160 -0.09064171 -0.11127924 -0.15836531 -0.1512453 0.01560752 0.5630549
By using linkage methods, we will compare the results.
Let’s check the dissimilarity matrix
dis_mat <- dist(x1, method = "euclidean")
comp_linkage <- hclust(dis_mat, method = "complete")
# dendogram
plot(comp_linkage, cex = 0.6, hang = -1)
We may also check the same with the function called agnes().
comp_linkage2 <- agnes(x1, method = "complete")
# to check the agglomerative coefficient
comp_linkage2$ac
## [1] 0.8584832
The closer the result to 1, the stronger it is. Generally, we want the agglomerative coefficient to get as high as possible close to one.
Let’s now check multiple methods such as Average, Single, Ward.
m_methods <- c("average", "single", "complete", "ward")
names(m_methods) <- c("average", "single", "complete", "ward")
# to measure the coefficiency
coeff <- function(y) {
agnes(x1, method = y)$ac
}
map_dbl(m_methods, coeff)
## average single complete ward
## 0.7549854 0.6103903 0.8584832 0.9296723
comp_linkage3 <- agnes(x1, method = "ward")
pltree(comp_linkage3, cex = 0.6, hang = -1, main = "Agnes Dendrogram")
Since we want it to be as close as possible to one, ward method seems the best. Furthermore, we see that the dendogram result gives us pretty different output from our starting point.
On the other hand, we may try the divisive hierarchical clustering. At first glance the output seems the same but the logic is completely different.
comp_linkage4 <- diana(x1)
# To find the amount of clustering structure, we should use divisive coefficient
comp_linkage4$dc
## [1] 0.8514986
# dendrogram
pltree(comp_linkage4, cex = 0.6, hang = -1, main = "Diana Dendrogram")
comp_linkage5 <- hclust(dis_mat, method = "ward.D2" )
# cut tree into 4 groups
sub_grp <- cutree(comp_linkage5, k = 2)
table(sub_grp)
## sub_grp
## 1 2
## 36 14
# plots with borders
plot(comp_linkage5, cex = 0.6)
rect.hclust(comp_linkage5, k = 2, border = 2:5)
We may also plot the dendogram using triangles and it is much more informative.
To make it a little bit more comparable, even though Hierarchical clustering has very clear output, we might consider to make it a bit more comparable and using other algorithms.
fviz_cluster(list(data = x1, cluster = sub_grp))
# cut agnes() tree into 2 groups
clinkage_agnes <- agnes(x1, method = "ward")
cutree(as.hclust(clinkage_agnes), k = 2)
## User 202 User 145 User 116 User 170 User 209 User 160 User 156 User 139
## 1 1 1 1 1 1 1 1
## User 122 User 126 User 78 User 107 User 200 User 216 User 68 User 177
## 1 1 2 1 1 1 2 1
## User 203 User 19 User 45 User 40 User 197 User 238 User 25 User 11
## 1 2 2 2 1 1 2 2
## User 246 User 146 User 124 User 201 User 52 User 137 User 225 User 129
## 1 1 1 1 2 1 1 1
## User 58 User 245 User 119 User 76 User 247 User 114 User 37 User 33
## 2 1 1 2 1 1 2 2
## User 47 User 217 User 222 User 147 User 164 User 103 User 133 User 210
## 2 1 1 1 1 1 1 1
## User 192 User 49
## 1 2
# cut diana() tree into 2 groups
clinkage_diana <- diana(x1)
cutree(as.hclust(clinkage_diana), k = 2)
## User 202 User 145 User 116 User 170 User 209 User 160 User 156 User 139
## 1 2 2 1 1 2 2 2
## User 122 User 126 User 78 User 107 User 200 User 216 User 68 User 177
## 2 1 2 2 1 1 2 1
## User 203 User 19 User 45 User 40 User 197 User 238 User 25 User 11
## 1 2 2 2 1 2 2 2
## User 246 User 146 User 124 User 201 User 52 User 137 User 225 User 129
## 1 2 2 1 2 2 1 2
## User 58 User 245 User 119 User 76 User 247 User 114 User 37 User 33
## 2 1 2 2 1 2 2 2
## User 47 User 217 User 222 User 147 User 164 User 103 User 133 User 210
## 2 1 1 2 2 2 2 1
## User 192 User 49
## 1 2
The according number under the user ID shows in which clusters users are located. Then we may compare some dendograms by linking labels. But again, we have to compute distance matrix, then compute two hierarchical methods and create two dendograms that you may compare.
res.dist <- dist(x1, method = "euclidean")
# compute 2 hierarchical clusterings
hc1 <- hclust(res.dist, method = "complete")
hc2 <- hclust(res.dist, method = "ward.D2")
# create two dendrograms
dend1 <- as.dendrogram (hc1)
dend2 <- as.dendrogram (hc2)
#install.packages("dendextend")
library(dendextend)
##
## ---------------------
## Welcome to dendextend version 1.14.0
## Type citation('dendextend') for how to cite the package.
##
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
##
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## Or contact: <tal.galili@gmail.com>
##
## To suppress this message use: suppressPackageStartupMessages(library(dendextend))
## ---------------------
##
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
##
## cutree
tanglegram(dend1, dend2)
Based on the dendogram, we can say that our output is stable and we can easily move to another method to compare.
complete_linkage<- eclust(x1, "hclust", hc_method = "complete",k=1)
fviz_dend(complete_linkage, show_labels=T, palette="jco")
single_linkage<- eclust(x1, "hclust", hc_method = "single",k=1)
fviz_dend(single_linkage, show_labels=T, palette="jco",main='Single Linkage')
Average linkage
average_linkage<- eclust(x1 ,"hclust", hc_method ="average",k=1)
fviz_dend(average_linkage ,show_labels=T,palette="jco", main='Average Linkage')
#cutree(single linkage)
cutree(single_linkage, h=1.7)
## User 202 User 145 User 116 User 170 User 209 User 160 User 156 User 139
## 1 1 1 1 1 1 1 1
## User 122 User 126 User 78 User 107 User 200 User 216 User 68 User 177
## 1 1 1 1 2 3 1 3
## User 203 User 19 User 45 User 40 User 197 User 238 User 25 User 11
## 4 1 1 1 1 1 1 1
## User 246 User 146 User 124 User 201 User 52 User 137 User 225 User 129
## 3 1 1 5 1 1 1 1
## User 58 User 245 User 119 User 76 User 247 User 114 User 37 User 33
## 1 1 1 1 1 1 1 1
## User 47 User 217 User 222 User 147 User 164 User 103 User 133 User 210
## 1 1 1 1 1 1 1 1
## User 192 User 49
## 1 1
plot(single_linkage)
abline(h=1.7, col="red")
complete_linkage2 <- eclust(x1, "hclust", hc_method = "complete", k=2)
fviz_dend(complete_linkage2, show_labels = T, palette = "jco")
single_linkage2 <- eclust(x1, "hclust", hc_method = "single", k = 2)
fviz_dend(single_linkage2, sshow_labels = T, palette = "jco")
average_linkage2 <- eclust(x1, "hclust", hc_method = "average", k = 2)
fviz_dend(average_linkage2, show_labels = T, palette = "jco")
Statistics of methods:
complete_statistics <- fpc::cluster.stats(dist(x1), complete_linkage2$cluster)
complete_statistics$avg.silwidth
## [1] 0.2468318
single_statistics2 <-fpc::cluster.stats(dist(x1), single_linkage2$cluster)
single_statistics2$avg.silwidth
## [1] 0.2630974
average_statistics <- fpc::cluster.stats(dist(x1), average_linkage2$cluster)
average_statistics$avg.silwidth
## [1] 0.3082262
According to the average silhouette index, the average linkage is the best.
#install.packages("NbClust")
library(NbClust)
NbClust(x1, method = "complete")
## Warning in pf(beale, pp, df2): NaNs produced
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 3 proposed 2 as the best number of clusters
## * 11 proposed 3 as the best number of clusters
## * 3 proposed 4 as the best number of clusters
## * 2 proposed 13 as the best number of clusters
## * 4 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
## $All.index
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW
## 2 0.6849 19.5729 18.3070 -2.9768 51.3072 29919836.5 1350.2594 208.8411
## 3 5.1531 22.2001 6.8925 -2.9127 109.6610 20954946.5 979.3352 151.1812
## 4 0.6453 18.8581 6.9840 -4.0752 144.2469 18653254.3 627.9395 131.8460
## 5 0.3847 17.6448 13.3193 -5.0722 175.7808 15512275.1 362.8549 114.4670
## 6 2.5021 20.4920 6.7757 -3.0492 218.9723 9416310.1 238.9933 88.3243
## 7 4.0141 20.3621 3.0610 -2.6216 249.5471 6953521.5 197.8484 76.5381
## 8 0.8651 18.6880 2.9577 -3.0856 262.8827 6955963.7 182.0334 71.4518
## 9 0.2734 17.4477 6.5457 -3.4628 282.7657 5915087.1 159.6743 66.7511
## 10 0.9077 18.2560 7.5594 -2.7937 318.0862 3603179.8 140.3495 57.5614
## 11 1.1051 19.7841 7.5410 -1.8172 368.7975 1581243.0 100.2900 48.4122
## 12 1.7248 21.5809 4.9974 -0.8050 398.7893 1032928.4 53.5188 40.5680
## 13 1.7408 22.2005 3.2790 -0.4210 430.8889 637942.2 37.0515 35.8530
## 14 1.0450 21.9514 3.1633 -0.4744 449.4862 510053.7 29.5283 32.9342
## 15 0.4731 21.7782 6.2691 -0.5240 477.6013 333685.9 23.7058 30.2741
## Friedman Rubin Cindex DB Silhouette Duda Pseudot2 Beale Ratkowsky
## 2 22.1278 1.4078 0.4552 1.4582 0.2468 0.6182 18.5292 2.2996 0.3393
## 3 37.9504 1.9447 0.4368 1.4533 0.2857 0.3911 9.3416 5.1343 0.3962
## 4 43.7064 2.2299 0.4430 1.1233 0.2755 0.6995 6.8748 1.5558 0.3682
## 5 51.3305 2.5684 0.5024 1.1469 0.2844 0.5756 16.2200 2.7132 0.3483
## 6 65.2189 3.3286 0.4389 1.2066 0.2638 0.6377 6.2502 2.0039 0.3412
## 7 72.6314 3.8412 0.5130 1.1460 0.2727 1.7736 -1.3085 -1.2586 0.3249
## 8 75.1001 4.1147 0.5172 1.0360 0.2798 0.3569 3.6032 4.6209 0.3075
## 9 81.7219 4.4044 0.5407 0.9838 0.2735 0.4918 9.3001 3.5781 0.2929
## 10 88.8362 5.1076 0.5565 0.9321 0.2934 0.4825 8.5805 3.6680 0.2835
## 11 98.7790 6.0728 0.5413 0.8312 0.3283 0.5588 9.4746 2.8040 0.2755
## 12 104.0417 7.2471 0.5078 0.8600 0.3253 0.2217 14.0442 10.8065 0.2680
## 13 113.1321 8.2002 0.4952 0.8207 0.3320 1.4326 -0.6039 -0.7745 0.2598
## 14 119.4751 8.9269 0.5109 0.7777 0.3425 4.1749 0.0000 0.0000 0.2518
## 15 128.7008 9.7113 0.5332 0.7096 0.3791 0.3463 11.3266 6.2253 0.2445
## Ball Ptbiserial Frey McClain Dunn Hubert SDindex Dindex SDbw
## 2 104.4205 0.3603 -0.1020 0.6633 0.1774 0.0045 1.6441 1.9176 4.7361
## 3 50.3937 0.5441 0.1180 1.0274 0.2070 0.0057 1.4556 1.6532 1.6051
## 4 32.9615 0.5544 0.6387 1.0676 0.2141 0.0058 1.1456 1.5336 0.4148
## 5 22.8934 0.5463 0.4817 1.3158 0.2542 0.0062 1.2814 1.4224 0.3672
## 6 14.7207 0.5147 0.0742 2.1848 0.2615 0.0071 1.2235 1.2555 0.3206
## 7 10.9340 0.5305 0.2122 2.2971 0.3275 0.0082 1.3848 1.1895 0.2925
## 8 8.9315 0.5304 0.3487 2.3291 0.3335 0.0087 1.2884 1.1400 0.2332
## 9 7.4168 0.5288 0.3143 2.3685 0.3507 0.0087 1.2696 1.1024 0.1990
## 10 5.7561 0.5191 0.2429 2.6236 0.3807 0.0090 1.3699 1.0227 0.1784
## 11 4.4011 0.5105 0.3868 2.9030 0.3994 0.0090 1.2709 0.9341 0.1670
## 12 3.3807 0.4632 0.2693 3.8355 0.3838 0.0100 1.4759 0.8563 0.1571
## 13 2.7579 0.4533 0.1351 4.0702 0.3893 0.0100 1.4550 0.7933 0.1338
## 14 2.3524 0.4519 0.0914 4.1242 0.4093 0.0101 1.4299 0.7565 0.1185
## 15 2.0183 0.4519 0.2612 4.1355 0.4297 0.0102 1.3486 0.7104 0.0927
##
## $All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2 0.5902 20.8338 0.0365
## 3 0.2864 14.9482 0.0007
## 4 0.4889 16.7256 0.1685
## 5 0.5432 18.5029 0.0162
## 6 0.4174 15.3565 0.0775
## 7 0.1255 20.9054 1.0000
## 8 0.0348 55.4763 0.0117
## 9 0.3758 14.9464 0.0047
## 10 0.3506 14.8210 0.0044
## 11 0.4347 15.6060 0.0165
## 12 0.1924 16.7852 0.0000
## 13 0.0348 55.4763 1.0000
## 14 -0.3211 0.0000 NaN
## 15 0.2864 14.9482 0.0001
##
## $Best.nc
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW
## Number_clusters 3.0000 13.0000 3.0000 13.000 3.0000 3 3.0000 3.0000
## Value_Index 5.1531 22.2005 11.4145 -0.421 58.3539 6663198 370.9242 38.3247
## Friedman Rubin Cindex DB Silhouette Duda PseudoT2
## Number_clusters 3.0000 3.0000 3.0000 15.0000 15.0000 2.0000 2.0000
## Value_Index 15.8226 -0.2517 0.4368 0.7096 0.3791 0.6182 18.5292
## Beale Ratkowsky Ball PtBiserial Frey McClain Dunn Hubert
## Number_clusters 4.0000 3.0000 3.0000 4.0000 1 2.0000 15.0000 0
## Value_Index 1.5558 0.3962 54.0268 0.5544 NA 0.6633 0.4297 0
## SDindex Dindex SDbw
## Number_clusters 4.0000 0 15.0000
## Value_Index 1.1456 0 0.0927
##
## $Best.partition
## User 202 User 145 User 116 User 170 User 209 User 160 User 156 User 139
## 1 2 3 3 1 3 2 3
## User 122 User 126 User 78 User 107 User 200 User 216 User 68 User 177
## 2 3 2 2 3 1 2 1
## User 203 User 19 User 45 User 40 User 197 User 238 User 25 User 11
## 3 2 2 2 3 2 2 2
## User 246 User 146 User 124 User 201 User 52 User 137 User 225 User 129
## 1 3 1 3 2 3 3 2
## User 58 User 245 User 119 User 76 User 247 User 114 User 37 User 33
## 2 3 3 2 3 2 2 2
## User 47 User 217 User 222 User 147 User 164 User 103 User 133 User 210
## 2 3 1 2 2 2 1 3
## User 192 User 49
## 3 2
NbClust(x1,method = "single")
## Warning in pf(beale, pp, df2): NaNs produced
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 6 proposed 2 as the best number of clusters
## * 2 proposed 3 as the best number of clusters
## * 3 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 6 proposed 6 as the best number of clusters
## * 2 proposed 14 as the best number of clusters
## * 3 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
## $All.index
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW
## 2 2.1245 3.0110 3.1761 -6.8674 12.7562 64685768 1535.8033 276.6462
## 3 6.5860 3.1266 4.1968 -10.1472 36.2973 90889876 1464.6641 259.4770
## 4 0.2463 3.5914 5.7168 -12.4350 50.9155 120620503 1173.8047 238.2067
## 5 0.5866 4.3606 6.6677 -14.1252 75.5314 115194170 747.6437 211.8750
## 6 2.2306 5.2203 0.7071 -13.2955 117.3693 71844539 568.1615 184.5324
## 7 0.9474 4.4349 0.3634 -14.1372 121.4764 90076900 533.9666 181.6137
## 8 1.5463 3.7950 0.7028 -14.9944 134.6934 90322449 528.3895 180.0916
## 9 0.3830 3.3816 4.1048 -15.6885 151.9195 80998479 508.8208 177.1277
## 10 2.4260 3.6711 0.2598 -15.3804 180.4913 56470360 472.8815 161.0080
## 11 1.3113 3.2676 0.3973 -16.1511 200.1400 46125249 469.0072 159.9690
## 12 0.6737 2.9591 2.1413 -16.8537 211.2931 43917872 464.3716 158.3560
## 13 0.2116 2.9637 6.8360 -17.0219 229.1902 36034083 448.8294 149.9088
## 14 5.5953 3.6652 1.8930 -15.8811 264.9969 20420679 245.9067 126.5312
## 15 0.1541 3.6143 8.9269 -16.1160 287.0753 15073903 189.1657 120.2102
## Friedman Rubin Cindex DB Silhouette Duda Pseudot2 Beale Ratkowsky
## 2 4.2421 1.0627 0.4374 0.5647 0.2831 0.9595 1.9816 0.1588 0.1383
## 3 9.7253 1.1330 0.4345 0.5513 0.2716 0.9411 2.8810 0.2358 0.1749
## 4 15.4531 1.2342 0.4952 0.5154 0.2910 0.8895 5.5926 0.4677 0.1978
## 5 25.3653 1.3876 0.4763 0.7277 0.2002 0.8685 6.3612 0.5692 0.2285
## 6 31.0112 1.5932 0.4543 0.7939 0.1994 2.6232 -1.2376 -1.5871 0.2481
## 7 31.8280 1.6188 0.4548 0.7514 0.1260 2.0754 -0.5182 -0.9968 0.2329
## 8 34.2738 1.6325 0.4553 0.7281 0.1321 2.4241 -0.5875 -1.1301 0.2194
## 9 37.2382 1.6598 0.4556 0.6797 0.1347 0.9081 3.8452 0.3793 0.2093
## 10 46.7896 1.8260 0.4640 0.6814 0.0482 13.3223 0.0000 0.0000 0.2123
## 11 55.0686 1.8379 0.4643 0.6558 0.0711 1.0445 -1.5352 -0.1596 0.2032
## 12 57.6271 1.8566 0.4677 0.8003 0.0210 1.0017 -0.0579 -0.0062 0.1958
## 13 63.7543 1.9612 0.4607 0.7713 0.0395 0.8433 6.3157 0.6942 0.1940
## 14 79.1735 2.3235 0.4369 0.7410 0.1075 0.9903 0.2936 0.0364 0.2014
## 15 85.9245 2.4457 0.4305 0.7338 0.1326 0.7869 7.8541 1.0072 0.1982
## Ball Ptbiserial Frey McClain Dunn Hubert SDindex Dindex SDbw
## 2 138.3231 0.2257 0.3068 0.0283 0.3293 0.0083 1.1587 2.1883 0.4809
## 3 86.4923 0.3200 -1.5130 0.0574 0.2985 0.0084 1.2248 2.1047 0.6416
## 4 59.5517 0.4339 4.2424 0.0842 0.3238 0.0090 1.0468 2.0114 0.2181
## 5 42.3750 0.4492 2.2467 0.1963 0.2998 0.0082 1.1425 1.8827 0.2302
## 6 30.7554 0.4567 -5.2110 0.3820 0.2812 0.0076 1.1074 1.7393 0.2230
## 7 25.9448 0.4517 -3.6543 0.3877 0.2801 0.0075 1.1717 1.7025 0.1796
## 8 22.5114 0.4471 -5.1095 0.3919 0.2596 0.0074 1.4284 1.6749 0.1452
## 9 19.6809 0.4439 1.5613 0.3957 0.2561 0.0074 1.4125 1.6411 0.1033
## 10 16.1008 0.4436 -2.5928 0.5146 0.2562 0.0076 1.4331 1.5327 0.0896
## 11 14.5426 0.4413 -5.3126 0.5173 0.2539 0.0076 1.4726 1.5039 0.0716
## 12 13.1963 0.3992 0.5715 0.6043 0.2532 0.0075 1.6610 1.4773 0.0666
## 13 11.5314 0.4071 0.7252 0.6743 0.2488 0.0075 1.6207 1.4178 0.0599
## 14 9.0379 0.4145 0.6614 1.0035 0.2432 0.0075 1.5156 1.2740 0.0614
## 15 8.0140 0.4146 -0.0967 1.1070 0.2247 0.0076 1.5015 1.2241 0.0565
##
## $All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2 0.6484 25.4817 0.9872
## 3 0.6459 25.2189 0.9645
## 4 0.6433 24.9549 0.8320
## 5 0.6349 24.1554 0.7547
## 6 0.0348 55.4763 1.0000
## 7 -0.0981 -11.1930 1.0000
## 8 -0.0981 -11.1930 1.0000
## 9 0.6222 23.0711 0.8918
## 10 -0.3211 0.0000 NaN
## 11 0.6152 22.5206 1.0000
## 12 0.6114 22.2432 1.0000
## 13 0.6075 21.9643 0.6545
## 14 0.5902 20.8338 0.9998
## 15 0.5853 20.5473 0.4222
##
## $Best.nc
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW
## Number_clusters 3.000 6.0000 15.0000 2.0000 6.0000 6 5.000 6.0000
## Value_Index 6.586 5.2203 7.0339 -6.8674 41.8379 61581991 426.161 24.4238
## Friedman Rubin Cindex DB Silhouette Duda PseudoT2
## Number_clusters 14.0000 14.0000 15.0000 4.0000 4.000 2.0000 2.0000
## Value_Index 15.4192 -0.2402 0.4305 0.5154 0.291 0.9595 1.9816
## Beale Ratkowsky Ball PtBiserial Frey McClain Dunn Hubert
## Number_clusters 2.0000 6.0000 3.0000 6.0000 1 2.0000 2.0000 0
## Value_Index 0.1588 0.2481 51.8308 0.4567 NA 0.0283 0.3293 0
## SDindex Dindex SDbw
## Number_clusters 4.0000 0 15.0000
## Value_Index 1.0468 0 0.0565
##
## $Best.partition
## User 202 User 145 User 116 User 170 User 209 User 160 User 156 User 139
## 1 1 1 1 1 1 1 1
## User 122 User 126 User 78 User 107 User 200 User 216 User 68 User 177
## 1 1 1 1 2 1 1 1
## User 203 User 19 User 45 User 40 User 197 User 238 User 25 User 11
## 1 1 1 1 1 1 1 1
## User 246 User 146 User 124 User 201 User 52 User 137 User 225 User 129
## 1 1 1 1 1 1 1 1
## User 58 User 245 User 119 User 76 User 247 User 114 User 37 User 33
## 1 1 1 1 1 1 1 1
## User 47 User 217 User 222 User 147 User 164 User 103 User 133 User 210
## 1 1 1 1 1 1 1 1
## User 192 User 49
## 1 1
NbClust(x1, method = "average")
## Warning in pf(beale, pp, df2): NaNs produced
## Warning in pf(beale, pp, df2): NaNs produced
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 5 proposed 2 as the best number of clusters
## * 4 proposed 3 as the best number of clusters
## * 3 proposed 5 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 5 proposed 8 as the best number of clusters
## * 1 proposed 12 as the best number of clusters
## * 2 proposed 13 as the best number of clusters
## * 2 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
## $All.index
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW
## 2 1.7257 25.3138 14.6531 -1.8485 71.0440 20161706.1 1421.8373 192.4876
## 3 2.4344 23.3505 7.8929 -2.5798 108.2873 21538679.6 659.0527 147.4692
## 4 0.8801 20.3694 7.2512 -3.4639 149.5701 16769444.4 461.0836 126.2649
## 5 31.7874 19.0742 2.7076 -4.3623 180.5818 14092056.5 258.9438 109.0713
## 6 0.1051 16.3475 3.3766 -5.1707 192.7320 15914810.2 235.2447 102.8811
## 7 0.0818 14.8850 19.3518 -5.5781 213.0593 14425607.7 189.8532 95.5485
## 8 7.6518 20.7704 4.0082 -2.0424 274.5651 5506630.4 170.4845 65.8936
## 9 5.2941 19.9236 1.8971 -2.1589 294.9031 4640205.2 125.8784 60.1530
## 10 0.1635 18.2830 4.0270 -2.7791 311.4568 4114042.2 112.0897 57.4929
## 11 0.5123 18.0511 7.1107 -2.7285 339.1829 2859091.0 108.4392 52.2342
## 12 4.1572 19.5345 2.4735 -1.8099 388.2022 1276524.1 74.3614 44.1792
## 13 0.2195 18.7710 8.6290 -2.1125 399.7630 1188876.3 60.4288 41.4792
## 14 2.1988 21.4363 4.5870 -0.7154 452.7555 477770.6 33.4015 33.6350
## 15 3.1779 22.1366 1.9444 -0.3583 483.9005 294187.2 24.5451 29.8337
## Friedman Rubin Cindex DB Silhouette Duda Pseudot2 Beale Ratkowsky
## 2 28.8120 1.5274 0.4266 1.3123 0.3082 0.6363 17.7226 2.1308 0.3812
## 3 43.2597 1.9936 0.4351 1.0943 0.2905 0.6915 6.6934 1.6095 0.4044
## 4 51.1193 2.3284 0.4302 1.2287 0.2807 0.5197 8.3176 3.2000 0.3758
## 5 54.9805 2.6955 0.4739 1.0268 0.2937 1.4284 -1.1996 -0.9230 0.3541
## 6 58.2835 2.8577 0.4716 0.8743 0.2876 0.5533 4.8445 2.6626 0.3285
## 7 65.0968 3.0770 0.4672 0.8799 0.2850 0.5421 20.2712 3.1196 0.3102
## 8 80.0809 4.4617 0.4984 0.9555 0.3112 0.6674 2.4922 1.5981 0.3112
## 9 85.7552 4.8875 0.5239 0.9348 0.3157 3.9211 0.0000 0.0000 0.2972
## 10 92.2823 5.1137 0.5226 0.8216 0.3490 0.2912 7.3017 7.0230 0.2835
## 11 101.2297 5.6285 0.5155 0.7649 0.3615 0.5352 8.6841 3.0373 0.2733
## 12 115.2300 6.6547 0.5362 0.7224 0.3850 1.4391 -0.9153 -0.8804 0.2660
## 13 118.5753 7.0879 0.5308 0.6896 0.3909 0.5588 9.4746 2.8040 0.2570
## 14 131.1214 8.7409 0.5035 0.7254 0.3877 0.4080 5.8034 4.4655 0.2515
## 15 139.0213 9.8546 0.4818 0.7087 0.3871 8.7846 0.0000 0.0000 0.2447
## Ball Ptbiserial Frey McClain Dunn Hubert SDindex Dindex SDbw
## 2 96.2438 0.5072 0.4348 0.5607 0.1803 0.0058 1.5469 1.8457 2.7073
## 3 49.1564 0.5608 0.3809 0.9369 0.2085 0.0060 1.2071 1.6336 1.1947
## 4 31.5662 0.5744 0.2992 1.1278 0.2197 0.0066 1.3503 1.5036 0.4620
## 5 21.8143 0.5806 0.5544 1.2046 0.2483 0.0067 1.1031 1.4103 0.3587
## 6 17.1469 0.5801 0.8022 1.2237 0.2483 0.0071 1.1351 1.3574 0.2730
## 7 13.6498 0.5764 0.4008 1.2748 0.2483 0.0076 1.2151 1.3077 0.2461
## 8 8.2367 0.5454 0.2756 2.2875 0.3391 0.0089 1.2037 1.0913 0.2249
## 9 6.6837 0.5427 0.3287 2.3764 0.3647 0.0090 1.2686 1.0466 0.2028
## 10 5.7493 0.5423 0.3809 2.3863 0.3647 0.0090 1.1665 1.0004 0.1557
## 11 4.7486 0.5389 0.4200 2.4508 0.3647 0.0091 1.1708 0.9390 0.1286
## 12 3.6816 0.5142 0.4144 2.8772 0.4061 0.0090 1.2791 0.8584 0.1188
## 13 3.1907 0.5106 0.3490 2.9410 0.4061 0.0090 1.2580 0.8242 0.1028
## 14 2.4025 0.4643 0.2597 3.8667 0.3942 0.0100 1.3485 0.7464 0.0990
## 15 1.9889 0.4541 0.2648 4.1032 0.3942 0.0102 1.3600 0.7005 0.0907
##
## $All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2 0.5948 21.1187 0.0518
## 3 0.4772 16.4352 0.1536
## 4 0.3758 14.9464 0.0092
## 5 0.1924 16.7852 1.0000
## 6 0.2864 14.9482 0.0305
## 7 0.5569 19.0934 0.0067
## 8 0.2445 15.4517 0.1822
## 9 -0.3211 0.0000 NaN
## 10 0.1255 20.9054 0.0006
## 11 0.3979 15.1322 0.0116
## 12 0.1255 20.9054 1.0000
## 13 0.4347 15.6060 0.0165
## 14 0.1924 16.7852 0.0036
## 15 -0.3211 0.0000 NaN
##
## $Best.nc
## KL CH Hartigan CCC Scott Marriot TrCovW
## Number_clusters 5.0000 2.0000 7.0000 15.0000 8.0000 8 3.0000
## Value_Index 31.7874 25.3138 15.9752 -0.3583 61.5057 8052552 762.7845
## TraceW Friedman Rubin Cindex DB Silhouette Duda
## Number_clusters 8.0000 8.000 8.000 2.0000 13.0000 13.0000 2.0000
## Value_Index 23.9143 14.984 -0.959 0.4266 0.6896 0.3909 0.6363
## PseudoT2 Beale Ratkowsky Ball PtBiserial Frey McClain
## Number_clusters 2.0000 3.0000 3.0000 3.0000 5.0000 1 2.0000
## Value_Index 17.7226 1.6095 0.4044 47.0874 0.5806 NA 0.5607
## Dunn Hubert SDindex Dindex SDbw
## Number_clusters 12.0000 0 5.0000 0 15.0000
## Value_Index 0.4061 0 1.1031 0 0.0907
##
## $Best.partition
## User 202 User 145 User 116 User 170 User 209 User 160 User 156 User 139
## 1 2 2 1 1 2 2 2
## User 122 User 126 User 78 User 107 User 200 User 216 User 68 User 177
## 2 2 2 2 1 2 2 2
## User 203 User 19 User 45 User 40 User 197 User 238 User 25 User 11
## 1 2 2 2 1 2 2 2
## User 246 User 146 User 124 User 201 User 52 User 137 User 225 User 129
## 2 1 1 1 2 1 1 2
## User 58 User 245 User 119 User 76 User 247 User 114 User 37 User 33
## 2 1 2 2 1 2 2 2
## User 47 User 217 User 222 User 147 User 164 User 103 User 133 User 210
## 2 1 1 2 2 2 2 1
## User 192 User 49
## 1 2
plot(silhouette(cutree(complete_linkage2,2),dist(x1)))
plot(silhouette(cutree(single_linkage2,2),dist(x1)))
plot(silhouette(cutree(average_linkage2,2),dist(x1)))
Based on purity, the lowest number of singleton nodes gives us complete linkage as the best. The clustering performed with NbClust() gave us a good silhouette index for average_linkage. The higher the silhouette index, the good structure is present for clusters. According to this assumption, I think Average linkage will be suitable for the dataset since it clusters properly and gives us a better structure.
Let’s read the dataset again, choose proper columns, and implement the CLARA algorithm.
df <- read.csv("~\\db\\buddymove_holidayiq.csv")
df <- df[2:7]
As we did before, we need to scale the data once again.
df_scaled <- scale(df)
head(df_scaled)
## Sports Religious Nature Theatre Shopping Picnic
## [1,] -1.509552 -1.0100142 -0.9973422 -1.4744331 -1.0740003 -0.7783943
## [2,] -1.509552 -1.4722052 -1.0630749 -1.2565864 -1.0499404 -1.6057691
## [3,] -1.509552 -1.8419580 -0.6029459 -0.9142560 -1.5070790 -1.3912645
## [4,] -1.509552 -1.2873288 -1.0411640 -0.6652884 -0.8815209 -1.8202736
## [5,] -1.509552 -0.3629468 -1.5451149 -1.7856426 -0.4243823 -1.0541859
## [6,] -1.358415 -1.7803325 -0.3400150 -0.7275303 -1.4589591 -1.3606210
Now, we can assign two clusters and execute them.
clara_flex <- eclust(df_scaled, "clara", k=2)
summary(clara_flex)
## Object of class 'clara' from call:
## fun_clust(x = x, k = k)
## Medoids:
## Sports Religious Nature Theatre Shopping Picnic
## [1,] -1.0561401 -0.8251378 -0.5591241 -0.8520141 -0.9296407 -0.9009684
## [2,] 0.9086447 -0.1472576 0.7336194 0.8285172 0.3936552 0.3860590
## Objective function: 1.842772
## Numerical information per cluster:
## size max_diss av_diss isolation
## [1,] 111 3.643449 1.410513 1.042142
## [2,] 138 4.477561 2.190459 1.280725
## Average silhouette width per cluster:
## [1] 0.4985167 0.2183780
## Average silhouette width of best sample: 0.3432591
##
## Best sample:
## [1] 3 23 26 34 37 43 44 45 47 49 55 80 85 90 100 109 124 130 132
## [20] 137 148 152 153 158 159 160 168 172 180 181 184 187 189 198 200 205 210 212
## [39] 217 220 232 238 243 244
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 1 2 2 2 2
## [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2
##
## Silhouette plot information for best sample:
## cluster neighbor sil_width
## 26 1 2 0.643556145
## 50 1 2 0.642877240
## 25 1 2 0.641830440
## 44 1 2 0.636672033
## 21 1 2 0.636118491
## 16 1 2 0.634231383
## 32 1 2 0.632781183
## 52 1 2 0.624357141
## 40 1 2 0.624087493
## 42 1 2 0.623698779
## 7 1 2 0.615150885
## 10 1 2 0.614920632
## 13 1 2 0.614543236
## 1 1 2 0.613888556
## 15 1 2 0.612269371
## 105 1 2 0.608367716
## 60 1 2 0.606368569
## 43 1 2 0.600861748
## 34 1 2 0.600459475
## 36 1 2 0.599812834
## 53 1 2 0.599033986
## 102 1 2 0.598454342
## 2 1 2 0.598397772
## 4 1 2 0.597558893
## 54 1 2 0.596882345
## 35 1 2 0.593456682
## 23 1 2 0.591771516
## 45 1 2 0.591117689
## 49 1 2 0.590153328
## 8 1 2 0.585237840
## 19 1 2 0.584862326
## 37 1 2 0.583217583
## 39 1 2 0.582553773
## 9 1 2 0.580465438
## 29 1 2 0.576793040
## 6 1 2 0.576500996
## 3 1 2 0.575088232
## 30 1 2 0.573374512
## 14 1 2 0.573268780
## 41 1 2 0.570785141
## 38 1 2 0.566417232
## 48 1 2 0.566379004
## 24 1 2 0.565618534
## 5 1 2 0.563591353
## 12 1 2 0.562690438
## 46 1 2 0.562107071
## 33 1 2 0.560568542
## 58 1 2 0.559124047
## 27 1 2 0.554574506
## 86 1 2 0.551743455
## 17 1 2 0.550325588
## 18 1 2 0.550135570
## 22 1 2 0.547536121
## 11 1 2 0.547523225
## 31 1 2 0.544807838
## 93 1 2 0.543676322
## 68 1 2 0.541613966
## 108 1 2 0.522853573
## 20 1 2 0.511740462
## 104 1 2 0.507619427
## 79 1 2 0.500539265
## 47 1 2 0.499157554
## 85 1 2 0.492274048
## 57 1 2 0.491176990
## 95 1 2 0.476309476
## 100 1 2 0.466080823
## 81 1 2 0.464117232
## 96 1 2 0.463936916
## 77 1 2 0.463291082
## 78 1 2 0.455831230
## 84 1 2 0.454264946
## 66 1 2 0.450848922
## 94 1 2 0.450579479
## 82 1 2 0.450106845
## 73 1 2 0.448329763
## 28 1 2 0.446240795
## 55 1 2 0.442036857
## 64 1 2 0.439462730
## 83 1 2 0.439093685
## 72 1 2 0.437930634
## 92 1 2 0.436882817
## 106 1 2 0.433348610
## 90 1 2 0.432939817
## 76 1 2 0.432306152
## 62 1 2 0.429284661
## 99 1 2 0.427911786
## 89 1 2 0.422992852
## 74 1 2 0.422658728
## 61 1 2 0.419819069
## 75 1 2 0.418211407
## 63 1 2 0.416587186
## 80 1 2 0.410904541
## 67 1 2 0.403258285
## 87 1 2 0.395497661
## 97 1 2 0.393814618
## 134 1 2 0.373553827
## 70 1 2 0.364256048
## 69 1 2 0.362292339
## 238 1 2 0.360753250
## 88 1 2 0.354918927
## 56 1 2 0.334756925
## 65 1 2 0.334477779
## 140 1 2 0.315079698
## 103 1 2 0.296334207
## 101 1 2 0.292990198
## 107 1 2 0.272311232
## 59 1 2 0.258291253
## 98 1 2 0.241246502
## 71 1 2 0.234644731
## 91 1 2 0.222675006
## 144 1 2 0.062272752
## 198 2 1 0.456810658
## 233 2 1 0.450311810
## 178 2 1 0.448621810
## 217 2 1 0.444675143
## 249 2 1 0.442505349
## 190 2 1 0.441120908
## 175 2 1 0.436885786
## 170 2 1 0.435480280
## 227 2 1 0.431924368
## 230 2 1 0.428019231
## 187 2 1 0.426655184
## 223 2 1 0.415573140
## 245 2 1 0.415188664
## 167 2 1 0.415083706
## 231 2 1 0.411585874
## 235 2 1 0.408265904
## 224 2 1 0.404590320
## 234 2 1 0.402027773
## 192 2 1 0.397421755
## 248 2 1 0.391270430
## 201 2 1 0.388631062
## 197 2 1 0.388020774
## 206 2 1 0.383085571
## 171 2 1 0.381683583
## 247 2 1 0.379555387
## 176 2 1 0.375076366
## 240 2 1 0.375039166
## 162 2 1 0.374185280
## 199 2 1 0.369533262
## 200 2 1 0.369289860
## 219 2 1 0.369158110
## 202 2 1 0.369146961
## 229 2 1 0.369075394
## 239 2 1 0.369033178
## 221 2 1 0.366936000
## 195 2 1 0.362830627
## 242 2 1 0.359448378
## 174 2 1 0.356708145
## 179 2 1 0.356559562
## 168 2 1 0.355317630
## 191 2 1 0.354734394
## 172 2 1 0.352441222
## 212 2 1 0.352049542
## 205 2 1 0.350289354
## 189 2 1 0.347631459
## 183 2 1 0.346254187
## 220 2 1 0.345311028
## 173 2 1 0.337487475
## 186 2 1 0.336446484
## 225 2 1 0.336237380
## 237 2 1 0.334393185
## 208 2 1 0.324470775
## 211 2 1 0.323009101
## 213 2 1 0.322368502
## 182 2 1 0.321838234
## 180 2 1 0.320749099
## 215 2 1 0.319245646
## 236 2 1 0.319086191
## 177 2 1 0.316932299
## 243 2 1 0.316615383
## 184 2 1 0.314945301
## 193 2 1 0.312172644
## 228 2 1 0.310994783
## 203 2 1 0.304248638
## 246 2 1 0.298525909
## 188 2 1 0.295904705
## 181 2 1 0.294305489
## 232 2 1 0.289855623
## 222 2 1 0.288853639
## 214 2 1 0.287869340
## 131 2 1 0.287805879
## 241 2 1 0.285089218
## 194 2 1 0.284141798
## 209 2 1 0.277186729
## 216 2 1 0.274468948
## 196 2 1 0.273783741
## 126 2 1 0.271117069
## 204 2 1 0.267131751
## 226 2 1 0.253762794
## 210 2 1 0.247778475
## 244 2 1 0.246695095
## 169 2 1 0.241180718
## 185 2 1 0.239178075
## 218 2 1 0.227749076
## 207 2 1 0.225510219
## 163 2 1 0.193521142
## 138 2 1 0.191430285
## 117 2 1 0.172416797
## 121 2 1 0.169756179
## 157 2 1 0.164825546
## 110 2 1 0.159284363
## 161 2 1 0.132357583
## 146 2 1 0.116684114
## 51 2 1 0.113485023
## 119 2 1 0.107344710
## 130 2 1 0.106539620
## 166 2 1 0.103615958
## 128 2 1 0.099660801
## 116 2 1 0.088096800
## 124 2 1 0.087656607
## 141 2 1 0.081883315
## 154 2 1 0.073330373
## 135 2 1 0.069184942
## 127 2 1 0.068498134
## 113 2 1 0.060183401
## 123 2 1 0.060059761
## 115 2 1 0.060058066
## 151 2 1 0.055141570
## 136 2 1 0.043787456
## 152 2 1 0.041169708
## 158 2 1 0.032119560
## 114 2 1 0.025880582
## 111 2 1 0.018570899
## 112 2 1 0.014934385
## 159 2 1 0.004999305
## 139 2 1 -0.002706371
## 160 2 1 -0.007050038
## 153 2 1 -0.014538329
## 118 2 1 -0.026612485
## 137 2 1 -0.029296656
## 149 2 1 -0.035595583
## 120 2 1 -0.050952615
## 155 2 1 -0.051059592
## 164 2 1 -0.057010644
## 109 2 1 -0.058634219
## 165 2 1 -0.059816082
## 147 2 1 -0.068693695
## 150 2 1 -0.069097424
## 143 2 1 -0.090706443
## 133 2 1 -0.092684625
## 132 2 1 -0.095503832
## 122 2 1 -0.114963961
## 142 2 1 -0.158925691
## 125 2 1 -0.160256494
## 148 2 1 -0.182913886
## 129 2 1 -0.185878026
## 145 2 1 -0.217714469
## 156 2 1 -0.249876571
##
## 946 dissimilarities, summarized :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2143 2.3296 3.2648 3.3623 4.3727 7.3740
## Metric : euclidean
## Number of objects : 44
##
## Available components:
## [1] "sample" "medoids" "i.med" "clustering" "objective"
## [6] "clusinfo" "diss" "call" "silinfo" "data"
## [11] "clust_plot" "nbclust"
fviz_cluster(clara_flex)
fviz_silhouette(clara_flex)
## cluster size ave.sil.width
## 1 1 111 0.50
## 2 2 138 0.22
Since we have the silhouette results, we can say that the one is close to 1 has the highest quality which means here the first cluster has the highest quality with average silhouette width of 0.50. And we have a lot of errors/probable mistakes in cluster number 2. When the silhouette width for a particular object is close to zero, this is like a boundary line. We are not sure how to assign this. On the other hand, when the index is less than zero, it means that we did a mistake.
I will use another approach with 2 clusters and 6 samples specified by using Euclidean distance.
clara_clust <- clara(df_scaled, 2, metric = "euclidean", stand = FALSE, samples = 6,
sampsize = 50, trace = 0, medoids.x = TRUE,
rngR = FALSE, pamLike = FALSE, correct.d = TRUE)
class(clara_clust)
## [1] "clara" "partition"
clara_clust
## Call: clara(x = df_scaled, k = 2, metric = "euclidean", stand = FALSE, samples = 6, sampsize = 50, trace = 0, medoids.x = TRUE, rngR = FALSE, pamLike = FALSE, correct.d = TRUE)
## Medoids:
## Sports Religious Nature Theatre Shopping Picnic
## [1,] -0.9050028 -0.3629468 -0.4495696 -0.2295951 -0.5206220 -0.5945333
## [2,] 1.2109193 1.0236262 0.3173122 0.7040334 0.6342544 0.8457116
## Objective function: 1.80886
## Clustering vector: int [1:249] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
## Cluster sizes: 148 101
## Best sample:
## [1] 2 8 11 16 18 19 32 34 37 38 43 44 45 52 55 85 86 88 90
## [20] 91 100 106 107 109 113 114 124 126 130 132 135 136 144 148 149 159 160 170
## [39] 174 180 181 185 187 189 205 210 216 223 227 232
##
## Available components:
## [1] "sample" "medoids" "i.med" "clustering" "objective"
## [6] "clusinfo" "diss" "call" "silinfo" "data"
Visualization of the results
fviz_cluster(clara_clust, geom = "point", ellipse.type = "norm")
fviz_cluster(clara_clust, palette = c("#00AFBB", "#FC4E07", "#E7B800"), ellipse.type = "t", geom = "point", pointsize = 1, ggtheme = theme_classic())
fviz_silhouette(clara_clust)
## cluster size ave.sil.width
## 1 1 33 0.35
## 2 2 17 0.26
According to the silhouette measure, we can say that the first cluster is the best one so far with a 0.35 average silhouette index width.
In conclusion, based on the aforementioned results of different types of clustering algorithms, we conclude that CLARA gave us the best result so far when compared with others with average silhouette width of 0.34.
Jacek Lewkowicz. Unsupervised Learning. Presentation from the classes.
Introduction to statistical data analysis with R. Matthias Kohl. Furtwangen University.
https://cran.r-project.org/web/packages/clusterSim/clusterSim.pdf
Data - UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/BuddyMove+Data+Set#