Introduction

This dataset was populated from destination reviews published by 249 reviewers of holidayiq.com till October 2014. Reviews are split into 6 categories among destinations across South India were considered and the count of reviews in each category for every reviewer (traveler) is captured.

Attribute Information:

Attribute 1 : Unique user id

Attribute 2 : Number of reviews on stadiums, sports complex, etc.

Attribute 3 : Number of reviews on religious institutions

Attribute 4 : Number of reviews on beach, lake, river, etc.

Attribute 5 : Number of reviews on theatres, exhibitions, etc.

Attribute 6 : Number of reviews on malls, shopping places, etc.

Attribute 7 : Number of reviews on parks, picnic spots, etc.

Data

library(tidyverse) 
library(cluster)
library(factoextra)
library(dplyr)
library(NbClust)

bm <- read.csv("~\\db\\buddymove_holidayiq.csv")
head(bm)
##   User.Id Sports Religious Nature Theatre Shopping Picnic
## 1  User 1      2        77     79      69       68     95
## 2  User 2      2        62     76      76       69     68
## 3  User 3      2        50     97      87       50     75
## 4  User 4      2        68     77      95       76     61
## 5  User 5      2        98     54      59       95     86
## 6  User 6      3        52    109      93       52     76

1. Selection of objects and variables

bm_reviews <- bm[,2:7]
head(bm_reviews)
##   Sports Religious Nature Theatre Shopping Picnic
## 1      2        77     79      69       68     95
## 2      2        62     76      76       69     68
## 3      2        50     97      87       50     75
## 4      2        68     77      95       76     61
## 5      2        98     54      59       95     86
## 6      3        52    109      93       52     76

2. Basic statistical applications

Applying basic statistics before implementing k-means to check to scale the data or not.

stats <- data.frame(
  Min = apply(bm_reviews, 2, min),      # minimum 
  Med = apply(bm_reviews, 2, median),   # median 
  Mean = apply(bm_reviews, 2, mean),    # mean
  SD = apply(bm_reviews, 2, sd),        # standard deviation
  Max = apply(bm_reviews, 2, max)       # maximum
)
stats <- round(stats, 1)
head(stats)
##           Min Med  Mean   SD Max
## Sports      2  12  12.0  6.6  25
## Religious  50 104 109.8 32.5 203
## Nature     52 119 124.5 45.6 318
## Theatre    59 113 116.4 32.1 213
## Shopping   50 104 112.6 41.6 233
## Picnic     61 119 120.4 32.6 218

So, here the minimum and maximum value of the sport is less than the rest. Therefore, we have to scale the data.

bm_scaled <- scale(bm_reviews)
head(bm_scaled)
##         Sports  Religious     Nature    Theatre   Shopping     Picnic
## [1,] -1.509552 -1.0100142 -0.9973422 -1.4744331 -1.0740003 -0.7783943
## [2,] -1.509552 -1.4722052 -1.0630749 -1.2565864 -1.0499404 -1.6057691
## [3,] -1.509552 -1.8419580 -0.6029459 -0.9142560 -1.5070790 -1.3912645
## [4,] -1.509552 -1.2873288 -1.0411640 -0.6652884 -0.8815209 -1.8202736
## [5,] -1.509552 -0.3629468 -1.5451149 -1.7856426 -0.4243823 -1.0541859
## [6,] -1.358415 -1.7803325 -0.3400150 -0.7275303 -1.4589591 -1.3606210
stats<- data.frame(
  Min = apply(bm_scaled, 2, min),               # minimum
  Med = apply(bm_scaled, 2, median),            # median
  Mean = apply(bm_scaled, 2, mean),             # mean
  SD = apply(bm_scaled, 2, sd),                 # Standard deviation
  Max = apply(bm_scaled, 2, max)                # maximum
)
stats <- round(stats, 1)
head(stats)
##            Min  Med Mean SD Max
## Sports    -1.5  0.0    0  1 2.0
## Religious -1.8 -0.2    0  1 2.9
## Nature    -1.6 -0.1    0  1 4.2
## Theatre   -1.8 -0.1    0  1 3.0
## Shopping  -1.5 -0.2    0  1 2.9
## Picnic    -1.8  0.0    0  1 3.0

Clustering

Initializing total within the sum of squares error: wss

wss <- 0

# For 1 to 15 cluster centers
for (i in 1:15) {
  kmn <- kmeans(bm_scaled, centers = i, nstart = 20)
# Saving the total within sum of squares to wss variable
  wss[i] <- kmn$tot.withinss
}

1. Determining the number of clusters

Plot total within the sum of squares vs. number of clusters

plot(1:15, wss, type = "b", 
     xlab = "Number of Clusters", 
     ylab = "Within groups sum of squares")

Based on the plot, we can say that the elbow where the quality improves slowly as the k (number of clusters) increases, which means since the model complexity increases, the quality of the model is no longer improving.

Let’s do it with fviz_nbclust() to see the optimum number of clusters.

# wss - within-cluster sum of squares
fviz_nbclust(bm_scaled, kmeans, method="wss") 

We will check the silhouette method as it measures the quality of a cluster. For instance, how well each point lies within its cluster.

fviz_nbclust(bm_scaled, kmeans, method="silhouette")

Moreover, we can also use GAP statistics to determine the number of clusters

gapstat <- clusGap(bm_scaled,FUN = kmeans,K.max = 25,B = 100)
fviz_gap_stat(gapstat)

According to gap statistics, the optimal number of clusters is 3, however, since our dataset is not big, I will use two clusters as the silhouette method suggested.

2. Choice of classification methods

2.1 K-means

Let’s choose 2 clusters and execute them.

k <- 2
kmean <- kmeans(bm_scaled,centers = 2,nstart = 25)

Visualization of the cluster plot

fviz_cluster(kmean, data = bm_scaled)

Now, let’s check the number of observations in each cluster

kmean$size
## [1] 110 139

Total SSE of the clusters and each cluster respectively. SSE stands for Sum of Squared Error

print(kmean$tot.withinss)
## [1] 878.9785
print(kmean$withinss)
## [1] 510.6742 368.3043
for(i in 1:3)
{
  print(i)
  print(which(kmean$cluster==i))
}
## [1] 1
##   [1]  51 110 113 114 115 116 117 119 121 124 126 127 130 131 137 138 139 141
##  [19] 144 146 149 157 159 160 161 162 163 166 167 168 169 170 171 172 173 174
##  [37] 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
##  [55] 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210
##  [73] 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228
##  [91] 229 230 231 232 233 234 235 236 237 239 240 241 242 243 244 245 246 247
## [109] 248 249
## [1] 2
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  52  53  54  55
##  [55]  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73
##  [73]  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91
##  [91]  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109
## [109] 111 112 118 120 122 123 125 128 129 132 133 134 135 136 140 142 143 145
## [127] 147 148 150 151 152 153 154 155 156 158 164 165 238
## [1] 3
## integer(0)

In cluster one, the users who have given more reviews about nature and picnic are clustered together. It is obvious that people enjoy nature prefer to spend more time with family having picnics. For instance, for cluster points 73,84,94 nature and picnic values are high than the rest. So they are clustered together. Something I found out interesting was the users who rated can be mothers/women of the family. I say so because the ratings on Religious, Shopping, and Picnic are specifically high.

In cluster two, the sports ratings play a role. If we can see the clusters more in detail, we can find the users who prefer watching movies and sports on tv than outdoors.

2.2 Hierarchical clustering

Now we will upload the data once again, and execute operations on Hierarchical clustering.

set.seed(1122)
DataSet <- read.csv("~\\db\\buddymove_holidayiq.csv")
DataSet
##      User.Id Sports Religious Nature Theatre Shopping Picnic
## 1     User 1      2        77     79      69       68     95
## 2     User 2      2        62     76      76       69     68
## 3     User 3      2        50     97      87       50     75
## 4     User 4      2        68     77      95       76     61
## 5     User 5      2        98     54      59       95     86
## 6     User 6      3        52    109      93       52     76
## 7     User 7      3        64     85      82       73     69
## 8     User 8      3        54    107      92       54     76
## 9     User 9      3        64    108      64       54     93
## 10   User 10      3        86     76      74       74    103
## 11   User 11      3       107     54      64      103     94
## 12   User 12      3       103     60      63      102     93
## 13   User 13      3        64     82      82       75     69
## 14   User 14      3        93     54      74      103     69
## 15   User 15      3        63     82      81       78     69
## 16   User 16      3        82     79      75       75     82
## 17   User 17      5        59    131     103       54     86
## 18   User 18      5        56    124     108       56     85
## 19   User 19      4        85     67     111       65     72
## 20   User 20      5       114     83      65      114    102
## 21   User 21      4        93     82      79       79     90
## 22   User 22      4       105     52      75      113     78
## 23   User 23      5        69    118      74       66    101
## 24   User 24      4        71    123      64       59    102
## 25   User 25      5        88     94      81       79     91
## 26   User 26      5        83     99      89       74     91
## 27   User 27      5        69    133      74       54    101
## 28   User 28      5       128     53      74      117    105
## 29   User 29      5        74    123      69       61    101
## 30   User 30      5        79     93     118       90     72
## 31   User 31      4        51    115     110       51     84
## 32   User 32      5        69     93      93       81     79
## 33   User 33      4       100     53      86      112     78
## 34   User 34      4        88     69     108       71     77
## 35   User 35      4        93     84      75       84    112
## 36   User 36      4        79     86     110       93     73
## 37   User 37      4        87     72     112       63     71
## 38   User 38      5        65    128      79       56    101
## 39   User 39      4        91     66     110       69     71
## 40   User 40      4        93     79      74       92     90
## 41   User 41      5        59    118     108       63     85
## 42   User 42      5        65    103      90       80     79
## 43   User 43      4        87     89      76       89    111
## 44   User 44      4        93     85      79       85     91
## 45   User 45      5        93     94      79       79    113
## 46   User 46      5        74    130      74       53    101
## 47   User 47      4       118     59      69      119     97
## 48   User 48      4       101     61      76      113     78
## 49   User 49      5        88     99      79       83    114
## 50   User 50      4        74     87      87       79     79
## 51   User 51     14       148     59     104      208    119
## 52   User 52      5        93     89      74       89     93
## 53   User 53      4        79    103      68      103     85
## 54   User 54      4        76     92     111       90     71
## 55   User 55      8        94    140      89      118     92
## 56   User 56      8       108    109      89       89    143
## 57   User 57      6        98     89     133       79     79
## 58   User 58      8        84    118     102       94     90
## 59   User 59      8       138     74      79      143    118
## 60   User 60      6        74    109     103       89     88
## 61   User 61      8       118    104      94       94    118
## 62   User 62      8        94    128      89      128     99
## 63   User 63      8        64    155     118       64     97
## 64   User 64      6        84    108     138      103     79
## 65   User 65      8       108    109      94       99    138
## 66   User 66      6        89    108     128      116     81
## 67   User 67      6       103     79     148       74     79
## 68   User 68      6       103    109      99       89    108
## 69   User 69      8       108     99      94       94    138
## 70   User 70      8       108    109     104       79    133
## 71   User 71      8       138     59      84      138    128
## 72   User 72      8        79    143     113       69    109
## 73   User 73      8        84    153      79       64    113
## 74   User 74      8        74    148     118       64    104
## 75   User 75      8        74    153     118       59     99
## 76   User 76      6       118     64     101      133     92
## 77   User 77      6        69    153     113       59     99
## 78   User 78      6       123     68      89      128     90
## 79   User 79      6        99    104     128       64     78
## 80   User 80      6       124     59      84      143     88
## 81   User 81      6        84    131     128       93     81
## 82   User 82      6       113     64      89      141     86
## 83   User 83      6       118     64      99      134     88
## 84   User 84      8        89    143      84       64    118
## 85   User 85      6        98     89     133       69     79
## 86   User 86      6        98    104     109       91    101
## 87   User 87      8        84    143      74       74    133
## 88   User 88      8       123     64      93      148     90
## 89   User 89      6        84    113     141       98     81
## 90   User 90      6        59    148     125       59     97
## 91   User 91      8       143     69      79      143    123
## 92   User 92      8        94    133      84      125     97
## 93   User 93      8        84    108     113       93     90
## 94   User 94      8        84    138      89       69    123
## 95   User 95      6        84    108     133       98     81
## 96   User 96      6        99     84     138       69     83
## 97   User 97      8       113     94      84      109    128
## 98   User 98      6       148     64      74      138    128
## 99   User 99      8        84    138     113       64    114
## 100 User 100      6        98     79     138       79     79
## 101 User 101      8       133     84      84      133    118
## 102 User 102      6        79    113     106       87     88
## 103 User 103      8       113    104      84       94    148
## 104 User 104      6        69    138     113       79     99
## 105 User 105      6        83    103      98       99     90
## 106 User 106      8        69    143     123       71     97
## 107 User 107      8       118     94      99       99    143
## 108 User 108      6        74    133      99       74    113
## 109 User 109     12       115     94     163      109     92
## 110 User 110     12       123    124     109      114    168
## 111 User 111     12        94    133     157      133     95
## 112 User 112     12       104    123     163      128     89
## 113 User 113     14        99    198      89       74    153
## 114 User 114     14       113    148      99      148    110
## 115 User 115     14        94    163      99       94    153
## 116 User 116     12       133    104     119      129    133
## 117 User 117     14       163     69      94      188    143
## 118 User 118     12        74    158     158       74    119
## 119 User 119     12       123    124     109      129    143
## 120 User 120     12        94    137     148      113    100
## 121 User 121     12       123    129     139       99    143
## 122 User 122     12        89    163      99       79    143
## 123 User 123     14        81    173     143       79    122
## 124 User 124     14       135     84     158      139     92
## 125 User 125     12       108    113     133      114    100
## 126 User 126     14       138    119     119      124    158
## 127 User 127     12       128    122     134      109    120
## 128 User 128     14        79    203     153       69    119
## 129 User 129     12        84    150     128      104    102
## 130 User 130     12       148     69     104      173    138
## 131 User 131     14       133    124     119      124    168
## 132 User 132     12        89    128     148      111    102
## 133 User 133     12        79    183     133       70    113
## 134 User 134     12        96    113     113       99     97
## 135 User 135     14        84    161     153       84    116
## 136 User 136     12       109    135     158      123     92
## 137 User 137     12       133     74     114      175    102
## 138 User 138     14       133    139     124      109    128
## 139 User 139     12       128    124     114      111    126
## 140 User 140     10        97    123     108      119    111
## 141 User 141     12       123    104     104      134    153
## 142 User 142     10       113     89     168       89     89
## 143 User 143     12       109    148      94      133    119
## 144 User 144     10       165     64      94      173     97
## 145 User 145     12        93    123     123      124    100
## 146 User 146     14       143     98     119      163    105
## 147 User 147     12        89    158      89       89    158
## 148 User 148     10       113    111     153       84     97
## 149 User 149     12       138     86     114      158    102
## 150 User 150     12        79    158     148       69    119
## 151 User 151     14        84    178     138       87    116
## 152 User 152     14        84    168     148       80    113
## 153 User 153     14        84    173     143       69    110
## 154 User 154     12       104    130     178      118     92
## 155 User 155     10        99    123     153      128    103
## 156 User 156     12        93    128     118      119    100
## 157 User 157     12       128    119     104      119    173
## 158 User 158     14        94    178     109       69    143
## 159 User 159     12       148     69     104      168    114
## 160 User 160     12       113    114     109      119    143
## 161 User 161     12       153     74      94      163    158
## 162 User 162     22       139    163     114      203    139
## 163 User 163     14       163     99      94      158    143
## 164 User 164     12        89    178      89       79    153
## 165 User 165     12        79    183     104       69    148
## 166 User 166     12       168     69      84      198    133
## 167 User 167     22       128    158     183      158    130
## 168 User 168     22       129    188     109      104    198
## 169 User 169     16       178     74     114      198    129
## 170 User 170     20       143    139     139      139    148
## 171 User 171     20       114    198     124      114    153
## 172 User 172     18       105    158     143      129    133
## 173 User 173     20       120    193     104      148    138
## 174 User 174     20        98    193     163       99    140
## 175 User 175     22       148    174     129      129    153
## 176 User 176     20       183    104     164      193    153
## 177 User 177     18        99    203     129       94    158
## 178 User 178     22       139    178     139      163    144
## 179 User 179     22        90    188     163      109    138
## 180 User 180     18       163     84     129      193    139
## 181 User 181     18        88    178     178       89    140
## 182 User 182     18       114    144     213      133    128
## 183 User 183     22       143    124     208      134    124
## 184 User 184     20       188     94     104      208    158
## 185 User 185     18        99    228     104       84    163
## 186 User 186     18       178     94     114      188    173
## 187 User 187     22       119    198     149      114    168
## 188 User 188     20       114    203     109       84    158
## 189 User 189     18        99    193     158       99    144
## 190 User 190     22       153    139     154      154    183
## 191 User 191     20       162     94     144      198    135
## 192 User 192     18       132    149     139      119    144
## 193 User 193     18       104    193     114       99    173
## 194 User 194     18        99    183     128      114    129
## 195 User 195     20       178     89     154      198    183
## 196 User 196     20       114    243     104       79    173
## 197 User 197     22       133    183     114      183    140
## 198 User 198     22       148    149     154      149    148
## 199 User 199     22       114    213     129       94    178
## 200 User 200     22       188    124     114      183    188
## 201 User 201     25       153    159     129      139    218
## 202 User 202     18       114    153     178      138    124
## 203 User 203     22       184     84     129      233    133
## 204 User 204     18        84    193     188       80    138
## 205 User 205     18       178    129     104      183    163
## 206 User 206     18       114    163     163      148    124
## 207 User 207     16       132     94     178      109    120
## 208 User 208     20        93    183     178       89    140
## 209 User 209     16       137    104     173      109    130
## 210 User 210     18       153     74     114      188    129
## 211 User 211     18       128    114     168      124    124
## 212 User 212     22       114    228     119       99    168
## 213 User 213     22       178     89     114      203    139
## 214 User 214     18       173     84      99      193    158
## 215 User 215     20       168     89     124      183    129
## 216 User 216     18        83    198     163       84    140
## 217 User 217     22       144    173     134      173    149
## 218 User 218     18       198     64      89      203    153
## 219 User 219     20       114    208     114      119    168
## 220 User 220     18       114    183     114      109    178
## 221 User 221     18       114    173     183      128    124
## 222 User 222     18       128    134     188       89    124
## 223 User 223     18       143    154     129      129    178
## 224 User 224     18       128    163     173      133    130
## 225 User 225     18       148     99     139      168    129
## 226 User 226     18        83    218     163       74    140
## 227 User 227     22       120    168     158      144    133
## 228 User 228     22       203     99     104      223    173
## 229 User 229     18       143    139     119      139    203
## 230 User 230     22       153    144     139      144    193
## 231 User 231     18       147    144     129      149    144
## 232 User 232     18       188     74     109      213    168
## 233 User 233     25       158    154     154      154    153
## 234 User 234     25       104    208     168      134    144
## 235 User 235     25       139    153     193      158    134
## 236 User 236     25        84    247     168      109    140
## 237 User 237     25       173     89     124      233    158
## 238 User 238      8        93    119      99       89    138
## 239 User 239     22       124    168     208      148    124
## 240 User 240     18       114    158     178      158    124
## 241 User 241     20       188     94      94      223    153
## 242 User 242     25       114    238     124      104    178
## 243 User 243     18        94    188     148       99    139
## 244 User 244     25       129    318      94       89    188
## 245 User 245     18       139    148     129      129    168
## 246 User 246     22       114    228     104       84    168
## 247 User 247     20       124    178     104      158    174
## 248 User 248     20       133    149     139      144    213
## 249 User 249     20       143    149     139      159    143
SubSet<-sample_n(DataSet, 50)
rownames(SubSet)<-SubSet$User.Id
SubSet<-SubSet[2:7]
SubSet
##          Sports Religious Nature Theatre Shopping Picnic
## User 202     18       114    153     178      138    124
## User 145     12        93    123     123      124    100
## User 116     12       133    104     119      129    133
## User 170     20       143    139     139      139    148
## User 209     16       137    104     173      109    130
## User 160     12       113    114     109      119    143
## User 156     12        93    128     118      119    100
## User 139     12       128    124     114      111    126
## User 122     12        89    163      99       79    143
## User 126     14       138    119     119      124    158
## User 78       6       123     68      89      128     90
## User 107      8       118     94      99       99    143
## User 200     22       188    124     114      183    188
## User 216     18        83    198     163       84    140
## User 68       6       103    109      99       89    108
## User 177     18        99    203     129       94    158
## User 203     22       184     84     129      233    133
## User 19       4        85     67     111       65     72
## User 45       5        93     94      79       79    113
## User 40       4        93     79      74       92     90
## User 197     22       133    183     114      183    140
## User 238      8        93    119      99       89    138
## User 25       5        88     94      81       79     91
## User 11       3       107     54      64      103     94
## User 246     22       114    228     104       84    168
## User 146     14       143     98     119      163    105
## User 124     14       135     84     158      139     92
## User 201     25       153    159     129      139    218
## User 52       5        93     89      74       89     93
## User 137     12       133     74     114      175    102
## User 225     18       148     99     139      168    129
## User 129     12        84    150     128      104    102
## User 58       8        84    118     102       94     90
## User 245     18       139    148     129      129    168
## User 119     12       123    124     109      129    143
## User 76       6       118     64     101      133     92
## User 247     20       124    178     104      158    174
## User 114     14       113    148      99      148    110
## User 37       4        87     72     112       63     71
## User 33       4       100     53      86      112     78
## User 47       4       118     59      69      119     97
## User 217     22       144    173     134      173    149
## User 222     18       128    134     188       89    124
## User 147     12        89    158      89       89    158
## User 164     12        89    178      89       79    153
## User 103      8       113    104      84       94    148
## User 133     12        79    183     133       70    113
## User 210     18       153     74     114      188    129
## User 192     18       132    149     139      119    144
## User 49       5        88     99      79       83    114

We have to scale the data as before.

x1<-scale(SubSet)
head(x1)
##               Sports   Religious      Nature    Theatre    Shopping     Picnic
## User 202  0.88051950 -0.07290709  0.75528071  2.3334988  0.52688821 -0.0427233
## User 145 -0.09064171 -0.87872231  0.05247608  0.3529057  0.15015507 -0.8079168
## User 116 -0.09064171  0.65616382 -0.39263352  0.2088625  0.28470262  0.2442242
## User 170  1.20423991  1.03988535  0.42730522  0.9290782  0.55379772  0.7224702
## User 209  0.55679910  0.80965243 -0.39263352  2.1534449 -0.25348759  0.1485751
## User 160 -0.09064171 -0.11127924 -0.15836531 -0.1512453  0.01560752  0.5630549

2.2.1 Method comparisons.

By using linkage methods, we will compare the results.

Let’s check the dissimilarity matrix

dis_mat <- dist(x1, method = "euclidean")

comp_linkage <- hclust(dis_mat, method = "complete")

# dendogram
plot(comp_linkage, cex = 0.6, hang = -1)

We may also check the same with the function called agnes().

comp_linkage2 <- agnes(x1, method = "complete")

# to check the agglomerative coefficient
comp_linkage2$ac
## [1] 0.8584832

The closer the result to 1, the stronger it is. Generally, we want the agglomerative coefficient to get as high as possible close to one.

Let’s now check multiple methods such as Average, Single, Ward.

m_methods <- c("average", "single", "complete", "ward")

names(m_methods) <- c("average", "single", "complete", "ward")

# to measure the coefficiency
coeff <- function(y) {
  agnes(x1, method = y)$ac
}

map_dbl(m_methods, coeff)
##   average    single  complete      ward 
## 0.7549854 0.6103903 0.8584832 0.9296723
comp_linkage3 <- agnes(x1, method = "ward")
pltree(comp_linkage3, cex = 0.6, hang = -1, main = "Agnes Dendrogram")

Since we want it to be as close as possible to one, ward method seems the best. Furthermore, we see that the dendogram result gives us pretty different output from our starting point.

On the other hand, we may try the divisive hierarchical clustering. At first glance the output seems the same but the logic is completely different.

comp_linkage4 <- diana(x1)
# To find the amount of clustering structure, we should use divisive coefficient
comp_linkage4$dc
## [1] 0.8514986
# dendrogram
pltree(comp_linkage4, cex = 0.6, hang = -1, main = "Diana Dendrogram")

comp_linkage5 <- hclust(dis_mat, method = "ward.D2" )
# cut tree into 4 groups
sub_grp <- cutree(comp_linkage5, k = 2)

table(sub_grp)
## sub_grp
##  1  2 
## 36 14
# plots with borders
plot(comp_linkage5, cex = 0.6)
rect.hclust(comp_linkage5, k = 2, border = 2:5)

We may also plot the dendogram using triangles and it is much more informative.

To make it a little bit more comparable, even though Hierarchical clustering has very clear output, we might consider to make it a bit more comparable and using other algorithms.

fviz_cluster(list(data = x1, cluster = sub_grp))

# cut agnes() tree into 2 groups
clinkage_agnes <- agnes(x1, method = "ward")
cutree(as.hclust(clinkage_agnes), k = 2)
## User 202 User 145 User 116 User 170 User 209 User 160 User 156 User 139 
##        1        1        1        1        1        1        1        1 
## User 122 User 126  User 78 User 107 User 200 User 216  User 68 User 177 
##        1        1        2        1        1        1        2        1 
## User 203  User 19  User 45  User 40 User 197 User 238  User 25  User 11 
##        1        2        2        2        1        1        2        2 
## User 246 User 146 User 124 User 201  User 52 User 137 User 225 User 129 
##        1        1        1        1        2        1        1        1 
##  User 58 User 245 User 119  User 76 User 247 User 114  User 37  User 33 
##        2        1        1        2        1        1        2        2 
##  User 47 User 217 User 222 User 147 User 164 User 103 User 133 User 210 
##        2        1        1        1        1        1        1        1 
## User 192  User 49 
##        1        2
# cut diana() tree into 2 groups
clinkage_diana <- diana(x1)
cutree(as.hclust(clinkage_diana), k = 2)
## User 202 User 145 User 116 User 170 User 209 User 160 User 156 User 139 
##        1        2        2        1        1        2        2        2 
## User 122 User 126  User 78 User 107 User 200 User 216  User 68 User 177 
##        2        1        2        2        1        1        2        1 
## User 203  User 19  User 45  User 40 User 197 User 238  User 25  User 11 
##        1        2        2        2        1        2        2        2 
## User 246 User 146 User 124 User 201  User 52 User 137 User 225 User 129 
##        1        2        2        1        2        2        1        2 
##  User 58 User 245 User 119  User 76 User 247 User 114  User 37  User 33 
##        2        1        2        2        1        2        2        2 
##  User 47 User 217 User 222 User 147 User 164 User 103 User 133 User 210 
##        2        1        1        2        2        2        2        1 
## User 192  User 49 
##        1        2

The according number under the user ID shows in which clusters users are located. Then we may compare some dendograms by linking labels. But again, we have to compute distance matrix, then compute two hierarchical methods and create two dendograms that you may compare.

res.dist <- dist(x1, method = "euclidean")

# compute 2 hierarchical clusterings
hc1 <- hclust(res.dist, method = "complete")
hc2 <- hclust(res.dist, method = "ward.D2")

# create two dendrograms
dend1 <- as.dendrogram (hc1)
dend2 <- as.dendrogram (hc2)

#install.packages("dendextend")
library(dendextend)
## 
## ---------------------
## Welcome to dendextend version 1.14.0
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## Or contact: <tal.galili@gmail.com>
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------
## 
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
## 
##     cutree
tanglegram(dend1, dend2)

Based on the dendogram, we can say that our output is stable and we can easily move to another method to compare.

complete_linkage<- eclust(x1, "hclust", hc_method = "complete",k=1)
fviz_dend(complete_linkage, show_labels=T, palette="jco")

single_linkage<- eclust(x1, "hclust", hc_method = "single",k=1)

fviz_dend(single_linkage, show_labels=T, palette="jco",main='Single Linkage')

Average linkage

average_linkage<- eclust(x1 ,"hclust", hc_method ="average",k=1)

fviz_dend(average_linkage ,show_labels=T,palette="jco", main='Average Linkage')

#cutree(single linkage)
cutree(single_linkage, h=1.7)
## User 202 User 145 User 116 User 170 User 209 User 160 User 156 User 139 
##        1        1        1        1        1        1        1        1 
## User 122 User 126  User 78 User 107 User 200 User 216  User 68 User 177 
##        1        1        1        1        2        3        1        3 
## User 203  User 19  User 45  User 40 User 197 User 238  User 25  User 11 
##        4        1        1        1        1        1        1        1 
## User 246 User 146 User 124 User 201  User 52 User 137 User 225 User 129 
##        3        1        1        5        1        1        1        1 
##  User 58 User 245 User 119  User 76 User 247 User 114  User 37  User 33 
##        1        1        1        1        1        1        1        1 
##  User 47 User 217 User 222 User 147 User 164 User 103 User 133 User 210 
##        1        1        1        1        1        1        1        1 
## User 192  User 49 
##        1        1
plot(single_linkage)
abline(h=1.7, col="red")

complete_linkage2 <- eclust(x1, "hclust", hc_method = "complete", k=2)
fviz_dend(complete_linkage2, show_labels = T, palette = "jco")

single_linkage2 <- eclust(x1, "hclust", hc_method = "single", k = 2)
fviz_dend(single_linkage2, sshow_labels = T, palette = "jco")

average_linkage2 <- eclust(x1, "hclust", hc_method = "average", k = 2)
fviz_dend(average_linkage2, show_labels = T, palette = "jco")

Statistics of methods:

complete_statistics <- fpc::cluster.stats(dist(x1), complete_linkage2$cluster)
complete_statistics$avg.silwidth
## [1] 0.2468318
single_statistics2 <-fpc::cluster.stats(dist(x1), single_linkage2$cluster)
single_statistics2$avg.silwidth
## [1] 0.2630974
average_statistics <- fpc::cluster.stats(dist(x1), average_linkage2$cluster)
average_statistics$avg.silwidth
## [1] 0.3082262

According to the average silhouette index, the average linkage is the best.

#install.packages("NbClust")
library(NbClust)
NbClust(x1, method = "complete")
## Warning in pf(beale, pp, df2): NaNs produced

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 3 proposed 2 as the best number of clusters 
## * 11 proposed 3 as the best number of clusters 
## * 3 proposed 4 as the best number of clusters 
## * 2 proposed 13 as the best number of clusters 
## * 4 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
## $All.index
##        KL      CH Hartigan     CCC    Scott    Marriot    TrCovW   TraceW
## 2  0.6849 19.5729  18.3070 -2.9768  51.3072 29919836.5 1350.2594 208.8411
## 3  5.1531 22.2001   6.8925 -2.9127 109.6610 20954946.5  979.3352 151.1812
## 4  0.6453 18.8581   6.9840 -4.0752 144.2469 18653254.3  627.9395 131.8460
## 5  0.3847 17.6448  13.3193 -5.0722 175.7808 15512275.1  362.8549 114.4670
## 6  2.5021 20.4920   6.7757 -3.0492 218.9723  9416310.1  238.9933  88.3243
## 7  4.0141 20.3621   3.0610 -2.6216 249.5471  6953521.5  197.8484  76.5381
## 8  0.8651 18.6880   2.9577 -3.0856 262.8827  6955963.7  182.0334  71.4518
## 9  0.2734 17.4477   6.5457 -3.4628 282.7657  5915087.1  159.6743  66.7511
## 10 0.9077 18.2560   7.5594 -2.7937 318.0862  3603179.8  140.3495  57.5614
## 11 1.1051 19.7841   7.5410 -1.8172 368.7975  1581243.0  100.2900  48.4122
## 12 1.7248 21.5809   4.9974 -0.8050 398.7893  1032928.4   53.5188  40.5680
## 13 1.7408 22.2005   3.2790 -0.4210 430.8889   637942.2   37.0515  35.8530
## 14 1.0450 21.9514   3.1633 -0.4744 449.4862   510053.7   29.5283  32.9342
## 15 0.4731 21.7782   6.2691 -0.5240 477.6013   333685.9   23.7058  30.2741
##    Friedman  Rubin Cindex     DB Silhouette   Duda Pseudot2   Beale Ratkowsky
## 2   22.1278 1.4078 0.4552 1.4582     0.2468 0.6182  18.5292  2.2996    0.3393
## 3   37.9504 1.9447 0.4368 1.4533     0.2857 0.3911   9.3416  5.1343    0.3962
## 4   43.7064 2.2299 0.4430 1.1233     0.2755 0.6995   6.8748  1.5558    0.3682
## 5   51.3305 2.5684 0.5024 1.1469     0.2844 0.5756  16.2200  2.7132    0.3483
## 6   65.2189 3.3286 0.4389 1.2066     0.2638 0.6377   6.2502  2.0039    0.3412
## 7   72.6314 3.8412 0.5130 1.1460     0.2727 1.7736  -1.3085 -1.2586    0.3249
## 8   75.1001 4.1147 0.5172 1.0360     0.2798 0.3569   3.6032  4.6209    0.3075
## 9   81.7219 4.4044 0.5407 0.9838     0.2735 0.4918   9.3001  3.5781    0.2929
## 10  88.8362 5.1076 0.5565 0.9321     0.2934 0.4825   8.5805  3.6680    0.2835
## 11  98.7790 6.0728 0.5413 0.8312     0.3283 0.5588   9.4746  2.8040    0.2755
## 12 104.0417 7.2471 0.5078 0.8600     0.3253 0.2217  14.0442 10.8065    0.2680
## 13 113.1321 8.2002 0.4952 0.8207     0.3320 1.4326  -0.6039 -0.7745    0.2598
## 14 119.4751 8.9269 0.5109 0.7777     0.3425 4.1749   0.0000  0.0000    0.2518
## 15 128.7008 9.7113 0.5332 0.7096     0.3791 0.3463  11.3266  6.2253    0.2445
##        Ball Ptbiserial    Frey McClain   Dunn Hubert SDindex Dindex   SDbw
## 2  104.4205     0.3603 -0.1020  0.6633 0.1774 0.0045  1.6441 1.9176 4.7361
## 3   50.3937     0.5441  0.1180  1.0274 0.2070 0.0057  1.4556 1.6532 1.6051
## 4   32.9615     0.5544  0.6387  1.0676 0.2141 0.0058  1.1456 1.5336 0.4148
## 5   22.8934     0.5463  0.4817  1.3158 0.2542 0.0062  1.2814 1.4224 0.3672
## 6   14.7207     0.5147  0.0742  2.1848 0.2615 0.0071  1.2235 1.2555 0.3206
## 7   10.9340     0.5305  0.2122  2.2971 0.3275 0.0082  1.3848 1.1895 0.2925
## 8    8.9315     0.5304  0.3487  2.3291 0.3335 0.0087  1.2884 1.1400 0.2332
## 9    7.4168     0.5288  0.3143  2.3685 0.3507 0.0087  1.2696 1.1024 0.1990
## 10   5.7561     0.5191  0.2429  2.6236 0.3807 0.0090  1.3699 1.0227 0.1784
## 11   4.4011     0.5105  0.3868  2.9030 0.3994 0.0090  1.2709 0.9341 0.1670
## 12   3.3807     0.4632  0.2693  3.8355 0.3838 0.0100  1.4759 0.8563 0.1571
## 13   2.7579     0.4533  0.1351  4.0702 0.3893 0.0100  1.4550 0.7933 0.1338
## 14   2.3524     0.4519  0.0914  4.1242 0.4093 0.0101  1.4299 0.7565 0.1185
## 15   2.0183     0.4519  0.2612  4.1355 0.4297 0.0102  1.3486 0.7104 0.0927
## 
## $All.CriticalValues
##    CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2          0.5902            20.8338       0.0365
## 3          0.2864            14.9482       0.0007
## 4          0.4889            16.7256       0.1685
## 5          0.5432            18.5029       0.0162
## 6          0.4174            15.3565       0.0775
## 7          0.1255            20.9054       1.0000
## 8          0.0348            55.4763       0.0117
## 9          0.3758            14.9464       0.0047
## 10         0.3506            14.8210       0.0044
## 11         0.4347            15.6060       0.0165
## 12         0.1924            16.7852       0.0000
## 13         0.0348            55.4763       1.0000
## 14        -0.3211             0.0000          NaN
## 15         0.2864            14.9482       0.0001
## 
## $Best.nc
##                     KL      CH Hartigan    CCC   Scott Marriot   TrCovW  TraceW
## Number_clusters 3.0000 13.0000   3.0000 13.000  3.0000       3   3.0000  3.0000
## Value_Index     5.1531 22.2005  11.4145 -0.421 58.3539 6663198 370.9242 38.3247
##                 Friedman   Rubin Cindex      DB Silhouette   Duda PseudoT2
## Number_clusters   3.0000  3.0000 3.0000 15.0000    15.0000 2.0000   2.0000
## Value_Index      15.8226 -0.2517 0.4368  0.7096     0.3791 0.6182  18.5292
##                  Beale Ratkowsky    Ball PtBiserial Frey McClain    Dunn Hubert
## Number_clusters 4.0000    3.0000  3.0000     4.0000    1  2.0000 15.0000      0
## Value_Index     1.5558    0.3962 54.0268     0.5544   NA  0.6633  0.4297      0
##                 SDindex Dindex    SDbw
## Number_clusters  4.0000      0 15.0000
## Value_Index      1.1456      0  0.0927
## 
## $Best.partition
## User 202 User 145 User 116 User 170 User 209 User 160 User 156 User 139 
##        1        2        3        3        1        3        2        3 
## User 122 User 126  User 78 User 107 User 200 User 216  User 68 User 177 
##        2        3        2        2        3        1        2        1 
## User 203  User 19  User 45  User 40 User 197 User 238  User 25  User 11 
##        3        2        2        2        3        2        2        2 
## User 246 User 146 User 124 User 201  User 52 User 137 User 225 User 129 
##        1        3        1        3        2        3        3        2 
##  User 58 User 245 User 119  User 76 User 247 User 114  User 37  User 33 
##        2        3        3        2        3        2        2        2 
##  User 47 User 217 User 222 User 147 User 164 User 103 User 133 User 210 
##        2        3        1        2        2        2        1        3 
## User 192  User 49 
##        3        2
NbClust(x1,method = "single")
## Warning in pf(beale, pp, df2): NaNs produced

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 6 proposed 2 as the best number of clusters 
## * 2 proposed 3 as the best number of clusters 
## * 3 proposed 4 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 6 proposed 6 as the best number of clusters 
## * 2 proposed 14 as the best number of clusters 
## * 3 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************
## $All.index
##        KL     CH Hartigan      CCC    Scott   Marriot    TrCovW   TraceW
## 2  2.1245 3.0110   3.1761  -6.8674  12.7562  64685768 1535.8033 276.6462
## 3  6.5860 3.1266   4.1968 -10.1472  36.2973  90889876 1464.6641 259.4770
## 4  0.2463 3.5914   5.7168 -12.4350  50.9155 120620503 1173.8047 238.2067
## 5  0.5866 4.3606   6.6677 -14.1252  75.5314 115194170  747.6437 211.8750
## 6  2.2306 5.2203   0.7071 -13.2955 117.3693  71844539  568.1615 184.5324
## 7  0.9474 4.4349   0.3634 -14.1372 121.4764  90076900  533.9666 181.6137
## 8  1.5463 3.7950   0.7028 -14.9944 134.6934  90322449  528.3895 180.0916
## 9  0.3830 3.3816   4.1048 -15.6885 151.9195  80998479  508.8208 177.1277
## 10 2.4260 3.6711   0.2598 -15.3804 180.4913  56470360  472.8815 161.0080
## 11 1.3113 3.2676   0.3973 -16.1511 200.1400  46125249  469.0072 159.9690
## 12 0.6737 2.9591   2.1413 -16.8537 211.2931  43917872  464.3716 158.3560
## 13 0.2116 2.9637   6.8360 -17.0219 229.1902  36034083  448.8294 149.9088
## 14 5.5953 3.6652   1.8930 -15.8811 264.9969  20420679  245.9067 126.5312
## 15 0.1541 3.6143   8.9269 -16.1160 287.0753  15073903  189.1657 120.2102
##    Friedman  Rubin Cindex     DB Silhouette    Duda Pseudot2   Beale Ratkowsky
## 2    4.2421 1.0627 0.4374 0.5647     0.2831  0.9595   1.9816  0.1588    0.1383
## 3    9.7253 1.1330 0.4345 0.5513     0.2716  0.9411   2.8810  0.2358    0.1749
## 4   15.4531 1.2342 0.4952 0.5154     0.2910  0.8895   5.5926  0.4677    0.1978
## 5   25.3653 1.3876 0.4763 0.7277     0.2002  0.8685   6.3612  0.5692    0.2285
## 6   31.0112 1.5932 0.4543 0.7939     0.1994  2.6232  -1.2376 -1.5871    0.2481
## 7   31.8280 1.6188 0.4548 0.7514     0.1260  2.0754  -0.5182 -0.9968    0.2329
## 8   34.2738 1.6325 0.4553 0.7281     0.1321  2.4241  -0.5875 -1.1301    0.2194
## 9   37.2382 1.6598 0.4556 0.6797     0.1347  0.9081   3.8452  0.3793    0.2093
## 10  46.7896 1.8260 0.4640 0.6814     0.0482 13.3223   0.0000  0.0000    0.2123
## 11  55.0686 1.8379 0.4643 0.6558     0.0711  1.0445  -1.5352 -0.1596    0.2032
## 12  57.6271 1.8566 0.4677 0.8003     0.0210  1.0017  -0.0579 -0.0062    0.1958
## 13  63.7543 1.9612 0.4607 0.7713     0.0395  0.8433   6.3157  0.6942    0.1940
## 14  79.1735 2.3235 0.4369 0.7410     0.1075  0.9903   0.2936  0.0364    0.2014
## 15  85.9245 2.4457 0.4305 0.7338     0.1326  0.7869   7.8541  1.0072    0.1982
##        Ball Ptbiserial    Frey McClain   Dunn Hubert SDindex Dindex   SDbw
## 2  138.3231     0.2257  0.3068  0.0283 0.3293 0.0083  1.1587 2.1883 0.4809
## 3   86.4923     0.3200 -1.5130  0.0574 0.2985 0.0084  1.2248 2.1047 0.6416
## 4   59.5517     0.4339  4.2424  0.0842 0.3238 0.0090  1.0468 2.0114 0.2181
## 5   42.3750     0.4492  2.2467  0.1963 0.2998 0.0082  1.1425 1.8827 0.2302
## 6   30.7554     0.4567 -5.2110  0.3820 0.2812 0.0076  1.1074 1.7393 0.2230
## 7   25.9448     0.4517 -3.6543  0.3877 0.2801 0.0075  1.1717 1.7025 0.1796
## 8   22.5114     0.4471 -5.1095  0.3919 0.2596 0.0074  1.4284 1.6749 0.1452
## 9   19.6809     0.4439  1.5613  0.3957 0.2561 0.0074  1.4125 1.6411 0.1033
## 10  16.1008     0.4436 -2.5928  0.5146 0.2562 0.0076  1.4331 1.5327 0.0896
## 11  14.5426     0.4413 -5.3126  0.5173 0.2539 0.0076  1.4726 1.5039 0.0716
## 12  13.1963     0.3992  0.5715  0.6043 0.2532 0.0075  1.6610 1.4773 0.0666
## 13  11.5314     0.4071  0.7252  0.6743 0.2488 0.0075  1.6207 1.4178 0.0599
## 14   9.0379     0.4145  0.6614  1.0035 0.2432 0.0075  1.5156 1.2740 0.0614
## 15   8.0140     0.4146 -0.0967  1.1070 0.2247 0.0076  1.5015 1.2241 0.0565
## 
## $All.CriticalValues
##    CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2          0.6484            25.4817       0.9872
## 3          0.6459            25.2189       0.9645
## 4          0.6433            24.9549       0.8320
## 5          0.6349            24.1554       0.7547
## 6          0.0348            55.4763       1.0000
## 7         -0.0981           -11.1930       1.0000
## 8         -0.0981           -11.1930       1.0000
## 9          0.6222            23.0711       0.8918
## 10        -0.3211             0.0000          NaN
## 11         0.6152            22.5206       1.0000
## 12         0.6114            22.2432       1.0000
## 13         0.6075            21.9643       0.6545
## 14         0.5902            20.8338       0.9998
## 15         0.5853            20.5473       0.4222
## 
## $Best.nc
##                    KL     CH Hartigan     CCC   Scott  Marriot  TrCovW  TraceW
## Number_clusters 3.000 6.0000  15.0000  2.0000  6.0000        6   5.000  6.0000
## Value_Index     6.586 5.2203   7.0339 -6.8674 41.8379 61581991 426.161 24.4238
##                 Friedman   Rubin  Cindex     DB Silhouette   Duda PseudoT2
## Number_clusters  14.0000 14.0000 15.0000 4.0000      4.000 2.0000   2.0000
## Value_Index      15.4192 -0.2402  0.4305 0.5154      0.291 0.9595   1.9816
##                  Beale Ratkowsky    Ball PtBiserial Frey McClain   Dunn Hubert
## Number_clusters 2.0000    6.0000  3.0000     6.0000    1  2.0000 2.0000      0
## Value_Index     0.1588    0.2481 51.8308     0.4567   NA  0.0283 0.3293      0
##                 SDindex Dindex    SDbw
## Number_clusters  4.0000      0 15.0000
## Value_Index      1.0468      0  0.0565
## 
## $Best.partition
## User 202 User 145 User 116 User 170 User 209 User 160 User 156 User 139 
##        1        1        1        1        1        1        1        1 
## User 122 User 126  User 78 User 107 User 200 User 216  User 68 User 177 
##        1        1        1        1        2        1        1        1 
## User 203  User 19  User 45  User 40 User 197 User 238  User 25  User 11 
##        1        1        1        1        1        1        1        1 
## User 246 User 146 User 124 User 201  User 52 User 137 User 225 User 129 
##        1        1        1        1        1        1        1        1 
##  User 58 User 245 User 119  User 76 User 247 User 114  User 37  User 33 
##        1        1        1        1        1        1        1        1 
##  User 47 User 217 User 222 User 147 User 164 User 103 User 133 User 210 
##        1        1        1        1        1        1        1        1 
## User 192  User 49 
##        1        1
NbClust(x1, method = "average")
## Warning in pf(beale, pp, df2): NaNs produced

## Warning in pf(beale, pp, df2): NaNs produced

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 5 proposed 2 as the best number of clusters 
## * 4 proposed 3 as the best number of clusters 
## * 3 proposed 5 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 5 proposed 8 as the best number of clusters 
## * 1 proposed 12 as the best number of clusters 
## * 2 proposed 13 as the best number of clusters 
## * 2 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************
## $All.index
##         KL      CH Hartigan     CCC    Scott    Marriot    TrCovW   TraceW
## 2   1.7257 25.3138  14.6531 -1.8485  71.0440 20161706.1 1421.8373 192.4876
## 3   2.4344 23.3505   7.8929 -2.5798 108.2873 21538679.6  659.0527 147.4692
## 4   0.8801 20.3694   7.2512 -3.4639 149.5701 16769444.4  461.0836 126.2649
## 5  31.7874 19.0742   2.7076 -4.3623 180.5818 14092056.5  258.9438 109.0713
## 6   0.1051 16.3475   3.3766 -5.1707 192.7320 15914810.2  235.2447 102.8811
## 7   0.0818 14.8850  19.3518 -5.5781 213.0593 14425607.7  189.8532  95.5485
## 8   7.6518 20.7704   4.0082 -2.0424 274.5651  5506630.4  170.4845  65.8936
## 9   5.2941 19.9236   1.8971 -2.1589 294.9031  4640205.2  125.8784  60.1530
## 10  0.1635 18.2830   4.0270 -2.7791 311.4568  4114042.2  112.0897  57.4929
## 11  0.5123 18.0511   7.1107 -2.7285 339.1829  2859091.0  108.4392  52.2342
## 12  4.1572 19.5345   2.4735 -1.8099 388.2022  1276524.1   74.3614  44.1792
## 13  0.2195 18.7710   8.6290 -2.1125 399.7630  1188876.3   60.4288  41.4792
## 14  2.1988 21.4363   4.5870 -0.7154 452.7555   477770.6   33.4015  33.6350
## 15  3.1779 22.1366   1.9444 -0.3583 483.9005   294187.2   24.5451  29.8337
##    Friedman  Rubin Cindex     DB Silhouette   Duda Pseudot2   Beale Ratkowsky
## 2   28.8120 1.5274 0.4266 1.3123     0.3082 0.6363  17.7226  2.1308    0.3812
## 3   43.2597 1.9936 0.4351 1.0943     0.2905 0.6915   6.6934  1.6095    0.4044
## 4   51.1193 2.3284 0.4302 1.2287     0.2807 0.5197   8.3176  3.2000    0.3758
## 5   54.9805 2.6955 0.4739 1.0268     0.2937 1.4284  -1.1996 -0.9230    0.3541
## 6   58.2835 2.8577 0.4716 0.8743     0.2876 0.5533   4.8445  2.6626    0.3285
## 7   65.0968 3.0770 0.4672 0.8799     0.2850 0.5421  20.2712  3.1196    0.3102
## 8   80.0809 4.4617 0.4984 0.9555     0.3112 0.6674   2.4922  1.5981    0.3112
## 9   85.7552 4.8875 0.5239 0.9348     0.3157 3.9211   0.0000  0.0000    0.2972
## 10  92.2823 5.1137 0.5226 0.8216     0.3490 0.2912   7.3017  7.0230    0.2835
## 11 101.2297 5.6285 0.5155 0.7649     0.3615 0.5352   8.6841  3.0373    0.2733
## 12 115.2300 6.6547 0.5362 0.7224     0.3850 1.4391  -0.9153 -0.8804    0.2660
## 13 118.5753 7.0879 0.5308 0.6896     0.3909 0.5588   9.4746  2.8040    0.2570
## 14 131.1214 8.7409 0.5035 0.7254     0.3877 0.4080   5.8034  4.4655    0.2515
## 15 139.0213 9.8546 0.4818 0.7087     0.3871 8.7846   0.0000  0.0000    0.2447
##       Ball Ptbiserial   Frey McClain   Dunn Hubert SDindex Dindex   SDbw
## 2  96.2438     0.5072 0.4348  0.5607 0.1803 0.0058  1.5469 1.8457 2.7073
## 3  49.1564     0.5608 0.3809  0.9369 0.2085 0.0060  1.2071 1.6336 1.1947
## 4  31.5662     0.5744 0.2992  1.1278 0.2197 0.0066  1.3503 1.5036 0.4620
## 5  21.8143     0.5806 0.5544  1.2046 0.2483 0.0067  1.1031 1.4103 0.3587
## 6  17.1469     0.5801 0.8022  1.2237 0.2483 0.0071  1.1351 1.3574 0.2730
## 7  13.6498     0.5764 0.4008  1.2748 0.2483 0.0076  1.2151 1.3077 0.2461
## 8   8.2367     0.5454 0.2756  2.2875 0.3391 0.0089  1.2037 1.0913 0.2249
## 9   6.6837     0.5427 0.3287  2.3764 0.3647 0.0090  1.2686 1.0466 0.2028
## 10  5.7493     0.5423 0.3809  2.3863 0.3647 0.0090  1.1665 1.0004 0.1557
## 11  4.7486     0.5389 0.4200  2.4508 0.3647 0.0091  1.1708 0.9390 0.1286
## 12  3.6816     0.5142 0.4144  2.8772 0.4061 0.0090  1.2791 0.8584 0.1188
## 13  3.1907     0.5106 0.3490  2.9410 0.4061 0.0090  1.2580 0.8242 0.1028
## 14  2.4025     0.4643 0.2597  3.8667 0.3942 0.0100  1.3485 0.7464 0.0990
## 15  1.9889     0.4541 0.2648  4.1032 0.3942 0.0102  1.3600 0.7005 0.0907
## 
## $All.CriticalValues
##    CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2          0.5948            21.1187       0.0518
## 3          0.4772            16.4352       0.1536
## 4          0.3758            14.9464       0.0092
## 5          0.1924            16.7852       1.0000
## 6          0.2864            14.9482       0.0305
## 7          0.5569            19.0934       0.0067
## 8          0.2445            15.4517       0.1822
## 9         -0.3211             0.0000          NaN
## 10         0.1255            20.9054       0.0006
## 11         0.3979            15.1322       0.0116
## 12         0.1255            20.9054       1.0000
## 13         0.4347            15.6060       0.0165
## 14         0.1924            16.7852       0.0036
## 15        -0.3211             0.0000          NaN
## 
## $Best.nc
##                      KL      CH Hartigan     CCC   Scott Marriot   TrCovW
## Number_clusters  5.0000  2.0000   7.0000 15.0000  8.0000       8   3.0000
## Value_Index     31.7874 25.3138  15.9752 -0.3583 61.5057 8052552 762.7845
##                  TraceW Friedman  Rubin Cindex      DB Silhouette   Duda
## Number_clusters  8.0000    8.000  8.000 2.0000 13.0000    13.0000 2.0000
## Value_Index     23.9143   14.984 -0.959 0.4266  0.6896     0.3909 0.6363
##                 PseudoT2  Beale Ratkowsky    Ball PtBiserial Frey McClain
## Number_clusters   2.0000 3.0000    3.0000  3.0000     5.0000    1  2.0000
## Value_Index      17.7226 1.6095    0.4044 47.0874     0.5806   NA  0.5607
##                    Dunn Hubert SDindex Dindex    SDbw
## Number_clusters 12.0000      0  5.0000      0 15.0000
## Value_Index      0.4061      0  1.1031      0  0.0907
## 
## $Best.partition
## User 202 User 145 User 116 User 170 User 209 User 160 User 156 User 139 
##        1        2        2        1        1        2        2        2 
## User 122 User 126  User 78 User 107 User 200 User 216  User 68 User 177 
##        2        2        2        2        1        2        2        2 
## User 203  User 19  User 45  User 40 User 197 User 238  User 25  User 11 
##        1        2        2        2        1        2        2        2 
## User 246 User 146 User 124 User 201  User 52 User 137 User 225 User 129 
##        2        1        1        1        2        1        1        2 
##  User 58 User 245 User 119  User 76 User 247 User 114  User 37  User 33 
##        2        1        2        2        1        2        2        2 
##  User 47 User 217 User 222 User 147 User 164 User 103 User 133 User 210 
##        2        1        1        2        2        2        2        1 
## User 192  User 49 
##        1        2
plot(silhouette(cutree(complete_linkage2,2),dist(x1)))

plot(silhouette(cutree(single_linkage2,2),dist(x1)))

plot(silhouette(cutree(average_linkage2,2),dist(x1)))

Based on purity, the lowest number of singleton nodes gives us complete linkage as the best. The clustering performed with NbClust() gave us a good silhouette index for average_linkage. The higher the silhouette index, the good structure is present for clusters. According to this assumption, I think Average linkage will be suitable for the dataset since it clusters properly and gives us a better structure.

2.3 CLARA

Let’s read the dataset again, choose proper columns, and implement the CLARA algorithm.

df <- read.csv("~\\db\\buddymove_holidayiq.csv")
df <- df[2:7]

As we did before, we need to scale the data once again.

df_scaled <- scale(df)
head(df_scaled)
##         Sports  Religious     Nature    Theatre   Shopping     Picnic
## [1,] -1.509552 -1.0100142 -0.9973422 -1.4744331 -1.0740003 -0.7783943
## [2,] -1.509552 -1.4722052 -1.0630749 -1.2565864 -1.0499404 -1.6057691
## [3,] -1.509552 -1.8419580 -0.6029459 -0.9142560 -1.5070790 -1.3912645
## [4,] -1.509552 -1.2873288 -1.0411640 -0.6652884 -0.8815209 -1.8202736
## [5,] -1.509552 -0.3629468 -1.5451149 -1.7856426 -0.4243823 -1.0541859
## [6,] -1.358415 -1.7803325 -0.3400150 -0.7275303 -1.4589591 -1.3606210

Now, we can assign two clusters and execute them.

clara_flex <- eclust(df_scaled, "clara", k=2)

summary(clara_flex)
## Object of class 'clara' from call:
##  fun_clust(x = x, k = k) 
## Medoids:
##          Sports  Religious     Nature    Theatre   Shopping     Picnic
## [1,] -1.0561401 -0.8251378 -0.5591241 -0.8520141 -0.9296407 -0.9009684
## [2,]  0.9086447 -0.1472576  0.7336194  0.8285172  0.3936552  0.3860590
## Objective function:    1.842772 
## Numerical information per cluster:
##      size max_diss  av_diss isolation
## [1,]  111 3.643449 1.410513  1.042142
## [2,]  138 4.477561 2.190459  1.280725
## Average silhouette width per cluster:
## [1] 0.4985167 0.2183780
## Average silhouette width of best sample: 0.3432591 
## 
## Best sample:
##  [1]   3  23  26  34  37  43  44  45  47  49  55  80  85  90 100 109 124 130 132
## [20] 137 148 152 153 158 159 160 168 172 180 181 184 187 189 198 200 205 210 212
## [39] 217 220 232 238 243 244
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 1 2 2 2 2
## [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2
## 
## Silhouette plot information for best sample:
##     cluster neighbor    sil_width
## 26        1        2  0.643556145
## 50        1        2  0.642877240
## 25        1        2  0.641830440
## 44        1        2  0.636672033
## 21        1        2  0.636118491
## 16        1        2  0.634231383
## 32        1        2  0.632781183
## 52        1        2  0.624357141
## 40        1        2  0.624087493
## 42        1        2  0.623698779
## 7         1        2  0.615150885
## 10        1        2  0.614920632
## 13        1        2  0.614543236
## 1         1        2  0.613888556
## 15        1        2  0.612269371
## 105       1        2  0.608367716
## 60        1        2  0.606368569
## 43        1        2  0.600861748
## 34        1        2  0.600459475
## 36        1        2  0.599812834
## 53        1        2  0.599033986
## 102       1        2  0.598454342
## 2         1        2  0.598397772
## 4         1        2  0.597558893
## 54        1        2  0.596882345
## 35        1        2  0.593456682
## 23        1        2  0.591771516
## 45        1        2  0.591117689
## 49        1        2  0.590153328
## 8         1        2  0.585237840
## 19        1        2  0.584862326
## 37        1        2  0.583217583
## 39        1        2  0.582553773
## 9         1        2  0.580465438
## 29        1        2  0.576793040
## 6         1        2  0.576500996
## 3         1        2  0.575088232
## 30        1        2  0.573374512
## 14        1        2  0.573268780
## 41        1        2  0.570785141
## 38        1        2  0.566417232
## 48        1        2  0.566379004
## 24        1        2  0.565618534
## 5         1        2  0.563591353
## 12        1        2  0.562690438
## 46        1        2  0.562107071
## 33        1        2  0.560568542
## 58        1        2  0.559124047
## 27        1        2  0.554574506
## 86        1        2  0.551743455
## 17        1        2  0.550325588
## 18        1        2  0.550135570
## 22        1        2  0.547536121
## 11        1        2  0.547523225
## 31        1        2  0.544807838
## 93        1        2  0.543676322
## 68        1        2  0.541613966
## 108       1        2  0.522853573
## 20        1        2  0.511740462
## 104       1        2  0.507619427
## 79        1        2  0.500539265
## 47        1        2  0.499157554
## 85        1        2  0.492274048
## 57        1        2  0.491176990
## 95        1        2  0.476309476
## 100       1        2  0.466080823
## 81        1        2  0.464117232
## 96        1        2  0.463936916
## 77        1        2  0.463291082
## 78        1        2  0.455831230
## 84        1        2  0.454264946
## 66        1        2  0.450848922
## 94        1        2  0.450579479
## 82        1        2  0.450106845
## 73        1        2  0.448329763
## 28        1        2  0.446240795
## 55        1        2  0.442036857
## 64        1        2  0.439462730
## 83        1        2  0.439093685
## 72        1        2  0.437930634
## 92        1        2  0.436882817
## 106       1        2  0.433348610
## 90        1        2  0.432939817
## 76        1        2  0.432306152
## 62        1        2  0.429284661
## 99        1        2  0.427911786
## 89        1        2  0.422992852
## 74        1        2  0.422658728
## 61        1        2  0.419819069
## 75        1        2  0.418211407
## 63        1        2  0.416587186
## 80        1        2  0.410904541
## 67        1        2  0.403258285
## 87        1        2  0.395497661
## 97        1        2  0.393814618
## 134       1        2  0.373553827
## 70        1        2  0.364256048
## 69        1        2  0.362292339
## 238       1        2  0.360753250
## 88        1        2  0.354918927
## 56        1        2  0.334756925
## 65        1        2  0.334477779
## 140       1        2  0.315079698
## 103       1        2  0.296334207
## 101       1        2  0.292990198
## 107       1        2  0.272311232
## 59        1        2  0.258291253
## 98        1        2  0.241246502
## 71        1        2  0.234644731
## 91        1        2  0.222675006
## 144       1        2  0.062272752
## 198       2        1  0.456810658
## 233       2        1  0.450311810
## 178       2        1  0.448621810
## 217       2        1  0.444675143
## 249       2        1  0.442505349
## 190       2        1  0.441120908
## 175       2        1  0.436885786
## 170       2        1  0.435480280
## 227       2        1  0.431924368
## 230       2        1  0.428019231
## 187       2        1  0.426655184
## 223       2        1  0.415573140
## 245       2        1  0.415188664
## 167       2        1  0.415083706
## 231       2        1  0.411585874
## 235       2        1  0.408265904
## 224       2        1  0.404590320
## 234       2        1  0.402027773
## 192       2        1  0.397421755
## 248       2        1  0.391270430
## 201       2        1  0.388631062
## 197       2        1  0.388020774
## 206       2        1  0.383085571
## 171       2        1  0.381683583
## 247       2        1  0.379555387
## 176       2        1  0.375076366
## 240       2        1  0.375039166
## 162       2        1  0.374185280
## 199       2        1  0.369533262
## 200       2        1  0.369289860
## 219       2        1  0.369158110
## 202       2        1  0.369146961
## 229       2        1  0.369075394
## 239       2        1  0.369033178
## 221       2        1  0.366936000
## 195       2        1  0.362830627
## 242       2        1  0.359448378
## 174       2        1  0.356708145
## 179       2        1  0.356559562
## 168       2        1  0.355317630
## 191       2        1  0.354734394
## 172       2        1  0.352441222
## 212       2        1  0.352049542
## 205       2        1  0.350289354
## 189       2        1  0.347631459
## 183       2        1  0.346254187
## 220       2        1  0.345311028
## 173       2        1  0.337487475
## 186       2        1  0.336446484
## 225       2        1  0.336237380
## 237       2        1  0.334393185
## 208       2        1  0.324470775
## 211       2        1  0.323009101
## 213       2        1  0.322368502
## 182       2        1  0.321838234
## 180       2        1  0.320749099
## 215       2        1  0.319245646
## 236       2        1  0.319086191
## 177       2        1  0.316932299
## 243       2        1  0.316615383
## 184       2        1  0.314945301
## 193       2        1  0.312172644
## 228       2        1  0.310994783
## 203       2        1  0.304248638
## 246       2        1  0.298525909
## 188       2        1  0.295904705
## 181       2        1  0.294305489
## 232       2        1  0.289855623
## 222       2        1  0.288853639
## 214       2        1  0.287869340
## 131       2        1  0.287805879
## 241       2        1  0.285089218
## 194       2        1  0.284141798
## 209       2        1  0.277186729
## 216       2        1  0.274468948
## 196       2        1  0.273783741
## 126       2        1  0.271117069
## 204       2        1  0.267131751
## 226       2        1  0.253762794
## 210       2        1  0.247778475
## 244       2        1  0.246695095
## 169       2        1  0.241180718
## 185       2        1  0.239178075
## 218       2        1  0.227749076
## 207       2        1  0.225510219
## 163       2        1  0.193521142
## 138       2        1  0.191430285
## 117       2        1  0.172416797
## 121       2        1  0.169756179
## 157       2        1  0.164825546
## 110       2        1  0.159284363
## 161       2        1  0.132357583
## 146       2        1  0.116684114
## 51        2        1  0.113485023
## 119       2        1  0.107344710
## 130       2        1  0.106539620
## 166       2        1  0.103615958
## 128       2        1  0.099660801
## 116       2        1  0.088096800
## 124       2        1  0.087656607
## 141       2        1  0.081883315
## 154       2        1  0.073330373
## 135       2        1  0.069184942
## 127       2        1  0.068498134
## 113       2        1  0.060183401
## 123       2        1  0.060059761
## 115       2        1  0.060058066
## 151       2        1  0.055141570
## 136       2        1  0.043787456
## 152       2        1  0.041169708
## 158       2        1  0.032119560
## 114       2        1  0.025880582
## 111       2        1  0.018570899
## 112       2        1  0.014934385
## 159       2        1  0.004999305
## 139       2        1 -0.002706371
## 160       2        1 -0.007050038
## 153       2        1 -0.014538329
## 118       2        1 -0.026612485
## 137       2        1 -0.029296656
## 149       2        1 -0.035595583
## 120       2        1 -0.050952615
## 155       2        1 -0.051059592
## 164       2        1 -0.057010644
## 109       2        1 -0.058634219
## 165       2        1 -0.059816082
## 147       2        1 -0.068693695
## 150       2        1 -0.069097424
## 143       2        1 -0.090706443
## 133       2        1 -0.092684625
## 132       2        1 -0.095503832
## 122       2        1 -0.114963961
## 142       2        1 -0.158925691
## 125       2        1 -0.160256494
## 148       2        1 -0.182913886
## 129       2        1 -0.185878026
## 145       2        1 -0.217714469
## 156       2        1 -0.249876571
## 
## 946 dissimilarities, summarized :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2143  2.3296  3.2648  3.3623  4.3727  7.3740 
## Metric :  euclidean 
## Number of objects : 44
## 
## Available components:
##  [1] "sample"     "medoids"    "i.med"      "clustering" "objective" 
##  [6] "clusinfo"   "diss"       "call"       "silinfo"    "data"      
## [11] "clust_plot" "nbclust"
fviz_cluster(clara_flex)

fviz_silhouette(clara_flex)
##   cluster size ave.sil.width
## 1       1  111          0.50
## 2       2  138          0.22

Since we have the silhouette results, we can say that the one is close to 1 has the highest quality which means here the first cluster has the highest quality with average silhouette width of 0.50. And we have a lot of errors/probable mistakes in cluster number 2. When the silhouette width for a particular object is close to zero, this is like a boundary line. We are not sure how to assign this. On the other hand, when the index is less than zero, it means that we did a mistake.

I will use another approach with 2 clusters and 6 samples specified by using Euclidean distance.

clara_clust <- clara(df_scaled, 2, metric = "euclidean", stand = FALSE, samples = 6,
                     sampsize = 50, trace = 0, medoids.x = TRUE,
                     rngR = FALSE, pamLike = FALSE, correct.d = TRUE)
class(clara_clust)
## [1] "clara"     "partition"
clara_clust
## Call:     clara(x = df_scaled, k = 2, metric = "euclidean", stand = FALSE,      samples = 6, sampsize = 50, trace = 0, medoids.x = TRUE,      rngR = FALSE, pamLike = FALSE, correct.d = TRUE) 
## Medoids:
##          Sports  Religious     Nature    Theatre   Shopping     Picnic
## [1,] -0.9050028 -0.3629468 -0.4495696 -0.2295951 -0.5206220 -0.5945333
## [2,]  1.2109193  1.0236262  0.3173122  0.7040334  0.6342544  0.8457116
## Objective function:   1.80886
## Clustering vector:    int [1:249] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
## Cluster sizes:            148 101 
## Best sample:
##  [1]   2   8  11  16  18  19  32  34  37  38  43  44  45  52  55  85  86  88  90
## [20]  91 100 106 107 109 113 114 124 126 130 132 135 136 144 148 149 159 160 170
## [39] 174 180 181 185 187 189 205 210 216 223 227 232
## 
## Available components:
##  [1] "sample"     "medoids"    "i.med"      "clustering" "objective" 
##  [6] "clusinfo"   "diss"       "call"       "silinfo"    "data"

Visualization of the results

fviz_cluster(clara_clust, geom = "point", ellipse.type = "norm")

fviz_cluster(clara_clust, palette = c("#00AFBB", "#FC4E07", "#E7B800"), ellipse.type = "t", geom = "point", pointsize = 1, ggtheme = theme_classic())

fviz_silhouette(clara_clust)
##   cluster size ave.sil.width
## 1       1   33          0.35
## 2       2   17          0.26

According to the silhouette measure, we can say that the first cluster is the best one so far with a 0.35 average silhouette index width.

Summary

In conclusion, based on the aforementioned results of different types of clustering algorithms, we conclude that CLARA gave us the best result so far when compared with others with average silhouette width of 0.34.


Sources:

  1. Jacek Lewkowicz. Unsupervised Learning. Presentation from the classes.

  2. Introduction to statistical data analysis with R. Matthias Kohl. Furtwangen University.

  3. https://cran.r-project.org/web/packages/clusterSim/clusterSim.pdf

  4. Data - UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/BuddyMove+Data+Set#