The data set is taken from UCI Machine Learning Repository and it refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories
Number of Attributes : 8
Number of instances : 440
| Attribute | Variable_Type |
|---|---|
| Fresh | Continuous |
| Milk | Continuous |
| Grocery | Continuous |
| Frozen | Continuous |
| Detergents | Continuous |
| Delicatessen | Continuous |
| Channel | Nominal |
| Region | Nominal |
To segment customers based on their annual spending on various products which may be used to devise successful, targeted marketing initiatives
To find out buying patterns in each segment
## Channel Region Fresh Milk
## Min. :1.000 Min. :1.000 Min. : 3 Min. : 55
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.: 3128 1st Qu.: 1533
## Median :1.000 Median :3.000 Median : 8504 Median : 3627
## Mean :1.323 Mean :2.543 Mean : 12000 Mean : 5796
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.: 16934 3rd Qu.: 7190
## Max. :2.000 Max. :3.000 Max. :112151 Max. :73498
## Grocery Frozen Detergents_Paper Delicassen
## Min. : 3 Min. : 25.0 Min. : 3.0 Min. : 3.0
## 1st Qu.: 2153 1st Qu.: 742.2 1st Qu.: 256.8 1st Qu.: 408.2
## Median : 4756 Median : 1526.0 Median : 816.5 Median : 965.5
## Mean : 7951 Mean : 3071.9 Mean : 2881.5 Mean : 1524.9
## 3rd Qu.:10656 3rd Qu.: 3554.2 3rd Qu.: 3922.0 3rd Qu.: 1820.2
## Max. :92780 Max. :60869.0 Max. :40827.0 Max. :47943.0
-Looking at the image, we notice that three is the optimum number of cluster and then notice how moving to 5 clusters makes almost no difference. -Hence we may stick to use 3-4 clusters on this data
>-Grocery and milk, Detergents_paper and Grocery, milk and Detergent_paper are strongly correlated
>-We see that the ‘elbow’ where the sum of squares stops decreasing drastically is around 3 clusters,then the decrease tapers off. This is our k value!
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 1 proposed 2 as the best number of clusters
## * 11 proposed 3 as the best number of clusters
## * 2 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 4 proposed 8 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
## * 1 proposed 12 as the best number of clusters
## * 1 proposed 14 as the best number of clusters
## * 1 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
## Cluster_size
## 330 60 50
| Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicassen |
|---|---|---|---|---|---|
| 8253.47 | 3824.60 | 5280.45 | 2572.66 | 1773.06 | 1137.50 |
| 35941.40 | 6044.45 | 6288.62 | 6713.97 | 1039.67 | 3049.47 |
| 8000.04 | 18511.42 | 27573.90 | 1996.68 | 12407.36 | 2252.02 |
| cluster | Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicassen |
|---|---|---|---|---|---|---|
| 1 | 8253.47 | 3824.60 | 5280.45 | 2572.66 | 1773.06 | 1137.50 |
| 2 | 35941.40 | 6044.45 | 6288.62 | 6713.97 | 1039.67 | 3049.47 |
| 3 | 8000.04 | 18511.42 | 27573.90 | 1996.68 | 12407.36 | 2252.02 |
| 1 | 2 | 3 |
|---|---|---|
| 244 | 52 | 2 |
| 86 | 8 | 48 |
## ARI
## 0.152653
- We can check how well the kmeans clusters =3 clustered the data by using the flexclust package. See how well the clustering was for channel 1 and 2? Not very well with a low index of 0.152.
## Cluster size in each group
## groups.3
## 1 2 3
## 431 6 3
## Cluster_2
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 24 26373 36423 22019 5154 4337 16523
## 48 44466 54259 55571 7782 24171 6465
## 62 35942 38369 59598 3254 26701 2017
## 86 16117 46197 92780 1026 40827 2944
## 87 22925 73498 32114 987 20070 903
## 334 8565 4980 67298 131 38102 1215
## Clustering for region types
##
## groups.3 1 2 3
## 1 77 46 308
## 2 0 1 5
## 3 0 0 3
- All the intersecting points are the ones which were clustered incorrectly.
## Medoids
## Fresh Milk Grocery Frozen Detergents_Paper
## [1,] -0.4477070 -0.4796863 -0.6611775 -0.3254455 -0.5391300
## [2,] -0.4739576 0.7176780 1.1501142 -0.3940392 0.9529458
## [3,] 0.6363954 -0.5291418 -0.5881492 0.4678107 -0.5181562
## Delicassen
## [1,] -0.2974606
## [2,] 0.2032298
## [3,] -0.2098753
##
## groups.3 1 2 3
## 1 204 102 125
## 2 0 6 0
## 3 0 0 3
## cluster size ave.sil.width
## 1 1 204 0.55
## 2 2 108 -0.02
## 3 3 128 -0.04
-It can be seen that some samples have a negative silhouette. This means that they are not in the right cluster
-We can find the name of these samples and determine the clusters they are closer, as follow:
## cluster neighbor sil_width
## 184 2 3 -0.005140129
## 216 2 1 -0.009284111
## 82 2 1 -0.013158010
## 246 2 1 -0.024135588
## 385 2 1 -0.026916258
## 266 2 1 -0.035541198
## 160 2 1 -0.040767874
## 72 2 3 -0.042711053
## 265 2 1 -0.055863634
## 101 2 1 -0.088466345
## 304 2 1 -0.092631773
## 316 2 1 -0.101626605
## 358 2 1 -0.107719005
## 219 2 1 -0.107833333
## 171 2 1 -0.110234886
## 25 2 1 -0.112960068
## 347 2 1 -0.114337124
## 58 2 1 -0.114723394
## 294 2 1 -0.123912918
## 417 2 1 -0.126184802
## 43 2 1 -0.128267684
## 38 2 1 -0.130458596
## 157 2 1 -0.139418922
## 194 2 1 -0.145276630
## 427 2 1 -0.171382212
## 14 2 1 -0.172906421
## 13 2 3 -0.192307972
## 95 2 1 -0.198502984
## 15 2 1 -0.202515102
## 3 2 1 -0.204840108
## 54 2 1 -0.205292522
## 183 2 1 -0.216433777
## 421 2 1 -0.218332932
## 189 2 1 -0.225629838
## 190 2 1 -0.233866521
## 377 2 1 -0.242218610
## 107 2 1 -0.249877609
## 255 2 1 -0.250959959
## 11 2 1 -0.273223551
## 68 2 1 -0.273876865
## 17 2 1 -0.293786044
## 306 2 1 -0.297877510
## 215 2 1 -0.308792121
## 397 2 1 -0.318984847
## 341 2 1 -0.320178602
## 176 2 1 -0.343385145
## 167 2 1 -0.348713052
## 83 2 1 -0.355724974
## 342 2 1 -0.356574269
## 303 2 1 -0.375947513
## 49 2 1 -0.385240742
## 161 2 1 -0.400333472
## 2 2 1 -0.413519643
## 198 2 1 -0.421167889
## 45 2 1 -0.424031847
## 222 2 1 -0.427708046
## 245 2 1 -0.448292547
## 329 3 1 -0.007396548
## 336 3 1 -0.012875069
## 242 3 1 -0.036751781
## 288 3 1 -0.041071288
## 268 3 1 -0.048971960
## 423 3 1 -0.051994126
## 311 3 1 -0.052685511
## 119 3 1 -0.058794193
## 339 3 1 -0.062845057
## 191 3 1 -0.072744070
## 233 3 1 -0.079655745
## 403 3 1 -0.086725782
## 357 3 1 -0.094989028
## 92 3 1 -0.095573178
## 333 3 1 -0.098508368
## 76 3 1 -0.101849328
## 404 3 1 -0.107392524
## 231 3 1 -0.137084017
## 238 3 1 -0.143746675
## 128 3 1 -0.149640196
## 369 3 1 -0.154389737
## 355 3 1 -0.162266580
## 263 3 1 -0.165341477
## 89 3 1 -0.168159383
## 19 3 1 -0.180778500
## 144 3 1 -0.185370690
## 227 3 1 -0.187357791
## 84 3 1 -0.196791754
## 405 3 1 -0.204888573
## 388 3 1 -0.205945262
## 295 3 1 -0.209524494
## 42 3 1 -0.212114433
## 4 3 1 -0.213002138
## 338 3 1 -0.218203139
## 31 3 1 -0.219012031
## 340 3 1 -0.231144148
## 297 3 1 -0.233898560
## 243 3 1 -0.236141916
## 361 3 1 -0.236511615
## 235 3 1 -0.243969796
## 398 3 1 -0.253539515
## 33 3 1 -0.255154623
## 141 3 1 -0.255976638
## 153 3 1 -0.258442868
## 73 3 1 -0.268074268
## 433 3 1 -0.272958495
## 211 3 1 -0.273785012
## 158 3 1 -0.281238775
## 253 3 1 -0.288822555
## 115 3 1 -0.288894070
## 280 3 1 -0.291954591
## 218 3 1 -0.294220702
## 279 3 1 -0.312698206
## 422 3 1 -0.322429305
## 230 3 1 -0.327207181
## 21 3 1 -0.333035974
## 121 3 1 -0.338918084
## 323 3 1 -0.339390744
## 262 3 1 -0.341483697
## 399 3 1 -0.359947530
## 250 3 1 -0.380436480
## 424 3 1 -0.384695064
## 114 3 1 -0.395670458
## 100 3 1 -0.397307650
## 270 3 1 -0.410403261
## 163 3 1 -0.422128828
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVI (diagonal, varying volume and shape) model with 8 components:
##
## log.likelihood n df BIC ICL
## -24319.1 440 103 -49265.15 -49349.14
##
## Clustering table:
## 1 2 3 4 5 6 7 8
## 66 17 55 87 14 36 124 41
##
## Mixing probabilities:
## 1 2 3 4 5 6
## 0.14829213 0.03669908 0.13197503 0.20393897 0.03438032 0.08143027
## 7 8
## 0.27066628 0.09261793
##
## Means:
## [,1] [,2] [,3] [,4] [,5]
## Fresh 6953.7520 33994.066 13926.050 19890.0334 13411.181
## Milk 3965.9269 15547.124 6970.027 3210.3000 29326.427
## Grocery 4734.4980 11701.621 9577.872 3760.4102 39789.058
## Frozen 833.9464 15846.089 2006.443 5979.2225 2558.126
## Detergents_Paper 1481.0508 1529.493 3525.844 555.2186 19124.188
## Delicassen 844.3439 8685.197 2011.469 1614.2546 3602.209
## [,6] [,7] [,8]
## Fresh 5385.383 10132.2356 2000.1319
## Milk 12480.459 1359.6364 7239.0245
## Grocery 21525.643 1938.8324 12343.6111
## Frozen 1560.210 2302.9635 477.2111
## Detergents_Paper 9684.784 269.4009 5486.3243
## Delicassen 1145.202 624.9454 1079.6855
##
## Variances:
## [,,1]
## Fresh Milk Grocery Frozen Detergents_Paper
## Fresh 29865082 0 0 0 0
## Milk 0 5077290 0 0 0
## Grocery 0 0 5957240 0 0
## Frozen 0 0 0 247709 0
## Detergents_Paper 0 0 0 0 1422091
## Delicassen 0 0 0 0 0
## Delicassen
## Fresh 0.0
## Milk 0.0
## Grocery 0.0
## Frozen 0.0
## Detergents_Paper 0.0
## Delicassen 397586.9
## [,,2]
## Fresh Milk Grocery Frozen Detergents_Paper
## Fresh 720674684 0 0 0 0
## Milk 0 154999741 0 0 0
## Grocery 0 0 42204084 0 0
## Frozen 0 0 0 226886772 0
## Detergents_Paper 0 0 0 0 2398598
## Delicassen 0 0 0 0 0
## Delicassen
## Fresh 0
## Milk 0
## Grocery 0
## Frozen 0
## Detergents_Paper 0
## Delicassen 124022950
## [,,3]
## Fresh Milk Grocery Frozen Detergents_Paper
## Fresh 93618643 0 0 0 0
## Milk 0 6554400 0 0 0
## Grocery 0 0 9303005 0 0
## Frozen 0 0 0 2132837 0
## Detergents_Paper 0 0 0 0 2876004
## Delicassen 0 0 0 0 0
## Delicassen
## Fresh 0
## Milk 0
## Grocery 0
## Frozen 0
## Detergents_Paper 0
## Delicassen 2698989
## [,,4]
## Fresh Milk Grocery Frozen Detergents_Paper
## Fresh 229112081 0 0 0 0.0
## Milk 0 3491834 0 0 0.0
## Grocery 0 0 3910598 0 0.0
## Frozen 0 0 0 20078394 0.0
## Detergents_Paper 0 0 0 0 152328.2
## Delicassen 0 0 0 0 0.0
## Delicassen
## Fresh 0
## Milk 0
## Grocery 0
## Frozen 0
## Detergents_Paper 0
## Delicassen 1455411
## [,,5]
## Fresh Milk Grocery Frozen Detergents_Paper
## Fresh 157059668 0 0 0 0
## Milk 0 290636329 0 0 0
## Grocery 0 0 424074652 0 0
## Frozen 0 0 0 5648121 0
## Detergents_Paper 0 0 0 0 109450107
## Delicassen 0 0 0 0 0
## Delicassen
## Fresh 0
## Milk 0
## Grocery 0
## Frozen 0
## Detergents_Paper 0
## Delicassen 3529720
## [,,6]
## Fresh Milk Grocery Frozen Detergents_Paper
## Fresh 18840284 0 0 0 0
## Milk 0 18947561 0 0 0
## Grocery 0 0 27007252 0 0
## Frozen 0 0 0 1045613 0
## Detergents_Paper 0 0 0 0 11990520
## Delicassen 0 0 0 0 0
## Delicassen
## Fresh 0
## Milk 0
## Grocery 0
## Frozen 0
## Detergents_Paper 0
## Delicassen 640757
## [,,7]
## Fresh Milk Grocery Frozen Detergents_Paper
## Fresh 51241385 0.0 0 0 0.00
## Milk 0 574613.9 0 0 0.00
## Grocery 0 0.0 1066482 0 0.00
## Frozen 0 0.0 0 3324868 0.00
## Detergents_Paper 0 0.0 0 0 37720.24
## Delicassen 0 0.0 0 0 0.00
## Delicassen
## Fresh 0.0
## Milk 0.0
## Grocery 0.0
## Frozen 0.0
## Detergents_Paper 0.0
## Delicassen 167037.8
## [,,8]
## Fresh Milk Grocery Frozen Detergents_Paper
## Fresh 3020132 0 0 0.0 0
## Milk 0 9841039 0 0.0 0
## Grocery 0 0 12660623 0.0 0
## Frozen 0 0 0 128456.5 0
## Detergents_Paper 0 0 0 0.0 4219492
## Delicassen 0 0 0 0.0 0
## Delicassen
## Fresh 0.0
## Milk 0.0
## Grocery 0.0
## Frozen 0.0
## Detergents_Paper 0.0
## Delicassen 776244.4
##
## region 1 2 3 4 5 6 7 8
## 1 15 3 7 15 2 7 22 6
## 2 7 1 2 10 1 7 14 5
## 3 44 13 46 62 11 22 88 30
## EDDA - classification
## ------------------------------------------------
## Gaussian finite mixture model for classification
## ------------------------------------------------
##
## EDDA model summary:
##
## log.likelihood n df BIC
## -25794.52 440 41 -51838.6
##
## Classes n Model G
## 1 77 VEE 1
## 2 47 VEE 1
## 3 316 VEE 1
##
## Training classification summary:
##
## Predicted
## Class 1 2 3
## 1 0 2 75
## 2 0 4 43
## 3 0 22 294
##
## Training error = 0.2818182
## Dimension reduction for model-based clustering
##
## -----------------------------------------------------------------
## Dimension reduction for model-based clustering and classification
## -----------------------------------------------------------------
##
## Mixture model type: Mclust (VVI, 8)
##
## Clusters n
## 1 66
## 2 17
## 3 55
## 4 87
## 5 14
## 6 36
## 7 124
## 8 41
##
## Estimated basis vectors:
## Dir1 Dir2 Dir3 Dir4 Dir5
## Fresh -0.016337 -0.00043504 -0.0079574 0.234452 -0.71497
## Milk -0.031128 -0.28420072 -0.6208671 -0.225221 0.11021
## Grocery 0.459986 0.06901699 0.1958439 0.084046 0.24298
## Frozen -0.017386 -0.21628964 0.5303445 -0.764562 -0.38040
## Detergents_Paper -0.871555 0.09803042 0.4857966 0.214116 0.49713
## Delicassen -0.165124 0.92632351 -0.2425744 -0.506797 -0.16060
## Dir6
## Fresh 0.32111
## Milk 0.34435
## Grocery 0.38478
## Frozen 0.21966
## Detergents_Paper 0.73822
## Delicassen 0.19244
##
## Dir1 Dir2 Dir3 Dir4 Dir5 Dir6
## Eigenvalues 184.597 24.996 9.7017 4.0829 0.99462 0.7671
## Cum. % 81.992 93.095 97.4040 99.2175 99.65928 100.0000
## LOF (local outlier factor)
## larger bubbles in the visualization have a larger LOF
- we can extract our cluster labels and outliers to plot our results.
- Lining up with our intuition, the DBSCAN algorithm was able to identify one cluster of customers who about the mean grocery and mean milk product purchases. In addition, it was able to flag customers whose annual purchasing behavior deviated too heavily from other customers.
## The top three ranking algorithms for each measure
## 1 2 3
## APN hierarchical-2 hierarchical-3 hierarchical-4
## AD pam-8 pam-7 kmeans-8
## ADM hierarchical-2 hierarchical-4 hierarchical-5
## FOM kmeans-8 pam-8 kmeans-7
## Optimal scores
## Score Method Clusters
## Connectivity 2.9289683 hierarchical 2
## Dunn 0.3852785 hierarchical 2
## Silhouette 0.7957468 hierarchical 2