INTRODUCTION

The data set is taken from UCI Machine Learning Repository and it refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories

ATTRIBUTES

Number of Attributes : 8

Number of instances : 440

Attribute Variable_Type
Fresh Continuous
Milk Continuous
Grocery Continuous
Frozen Continuous
Detergents Continuous
Delicatessen Continuous
Channel Nominal
Region Nominal

OBJECTIVE:

To segment customers based on their annual spending on various products which may be used to devise successful, targeted marketing initiatives

To find out buying patterns in each segment

##     Channel          Region          Fresh             Milk      
##  Min.   :1.000   Min.   :1.000   Min.   :     3   Min.   :   55  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:  3128   1st Qu.: 1533  
##  Median :1.000   Median :3.000   Median :  8504   Median : 3627  
##  Mean   :1.323   Mean   :2.543   Mean   : 12000   Mean   : 5796  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.: 16934   3rd Qu.: 7190  
##  Max.   :2.000   Max.   :3.000   Max.   :112151   Max.   :73498  
##     Grocery          Frozen        Detergents_Paper    Delicassen     
##  Min.   :    3   Min.   :   25.0   Min.   :    3.0   Min.   :    3.0  
##  1st Qu.: 2153   1st Qu.:  742.2   1st Qu.:  256.8   1st Qu.:  408.2  
##  Median : 4756   Median : 1526.0   Median :  816.5   Median :  965.5  
##  Mean   : 7951   Mean   : 3071.9   Mean   : 2881.5   Mean   : 1524.9  
##  3rd Qu.:10656   3rd Qu.: 3554.2   3rd Qu.: 3922.0   3rd Qu.: 1820.2  
##  Max.   :92780   Max.   :60869.0   Max.   :40827.0   Max.   :47943.0

CLUSTERGRAM - VISUALIZATION AND DIAGNOSTICS FOR CLUSTER ANALYSIS

-Looking at the image, we notice that three is the optimum number of cluster and then notice how moving to 5 clusters makes almost no difference. -Hence we may stick to use 3-4 clusters on this data

Correlation plot:

>-Grocery and milk, Detergents_paper and Grocery, milk and Detergent_paper are strongly correlated

WSS Plot:

>-We see that the ‘elbow’ where the sum of squares stops decreasing drastically is around 3 clusters,then the decrease tapers off. This is our k value!

K MEAN CLUSTER ANALYSIS

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 1 proposed 2 as the best number of clusters 
## * 11 proposed 3 as the best number of clusters 
## * 2 proposed 4 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 4 proposed 8 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## * 1 proposed 12 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
## Cluster_size
##  330 60 50
Cluster_Centers
Fresh Milk Grocery Frozen Detergents_Paper Delicassen
8253.47 3824.60 5280.45 2572.66 1773.06 1137.50
35941.40 6044.45 6288.62 6713.97 1039.67 3049.47
8000.04 18511.42 27573.90 1996.68 12407.36 2252.02
Cluster_Mean
cluster Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 8253.47 3824.60 5280.45 2572.66 1773.06 1137.50
2 35941.40 6044.45 6288.62 6713.97 1039.67 3049.47
3 8000.04 18511.42 27573.90 1996.68 12407.36 2252.02

Clustering for channel types
1 2 3
244 52 2
86 8 48
##      ARI 
## 0.152653
  • We can check how well the kmeans clusters =3 clustered the data by using the flexclust package. See how well the clustering was for channel 1 and 2? Not very well with a low index of 0.152.

Scatter Plot Matrix

Hierarchical clustering

## Cluster size in each group
## groups.3
##   1   2   3 
## 431   6   3
## Cluster_2
##     Fresh  Milk Grocery Frozen Detergents_Paper Delicassen
## 24  26373 36423   22019   5154             4337      16523
## 48  44466 54259   55571   7782            24171       6465
## 62  35942 38369   59598   3254            26701       2017
## 86  16117 46197   92780   1026            40827       2944
## 87  22925 73498   32114    987            20070        903
## 334  8565  4980   67298    131            38102       1215

## Clustering for region types
##         
## groups.3   1   2   3
##        1  77  46 308
##        2   0   1   5
##        3   0   0   3
  • All the intersecting points are the ones which were clustered incorrectly.

PAM: Partitioning around Medoids

## Medoids
##           Fresh       Milk    Grocery     Frozen Detergents_Paper
## [1,] -0.4477070 -0.4796863 -0.6611775 -0.3254455       -0.5391300
## [2,] -0.4739576  0.7176780  1.1501142 -0.3940392        0.9529458
## [3,]  0.6363954 -0.5291418 -0.5881492  0.4678107       -0.5181562
##      Delicassen
## [1,] -0.2974606
## [2,]  0.2032298
## [3,] -0.2098753

##         
## groups.3   1   2   3
##        1 204 102 125
##        2   0   6   0
##        3   0   0   3
##   cluster size ave.sil.width
## 1       1  204          0.55
## 2       2  108         -0.02
## 3       3  128         -0.04

-It can be seen that some samples have a negative silhouette. This means that they are not in the right cluster

-We can find the name of these samples and determine the clusters they are closer, as follow:

##     cluster neighbor    sil_width
## 184       2        3 -0.005140129
## 216       2        1 -0.009284111
## 82        2        1 -0.013158010
## 246       2        1 -0.024135588
## 385       2        1 -0.026916258
## 266       2        1 -0.035541198
## 160       2        1 -0.040767874
## 72        2        3 -0.042711053
## 265       2        1 -0.055863634
## 101       2        1 -0.088466345
## 304       2        1 -0.092631773
## 316       2        1 -0.101626605
## 358       2        1 -0.107719005
## 219       2        1 -0.107833333
## 171       2        1 -0.110234886
## 25        2        1 -0.112960068
## 347       2        1 -0.114337124
## 58        2        1 -0.114723394
## 294       2        1 -0.123912918
## 417       2        1 -0.126184802
## 43        2        1 -0.128267684
## 38        2        1 -0.130458596
## 157       2        1 -0.139418922
## 194       2        1 -0.145276630
## 427       2        1 -0.171382212
## 14        2        1 -0.172906421
## 13        2        3 -0.192307972
## 95        2        1 -0.198502984
## 15        2        1 -0.202515102
## 3         2        1 -0.204840108
## 54        2        1 -0.205292522
## 183       2        1 -0.216433777
## 421       2        1 -0.218332932
## 189       2        1 -0.225629838
## 190       2        1 -0.233866521
## 377       2        1 -0.242218610
## 107       2        1 -0.249877609
## 255       2        1 -0.250959959
## 11        2        1 -0.273223551
## 68        2        1 -0.273876865
## 17        2        1 -0.293786044
## 306       2        1 -0.297877510
## 215       2        1 -0.308792121
## 397       2        1 -0.318984847
## 341       2        1 -0.320178602
## 176       2        1 -0.343385145
## 167       2        1 -0.348713052
## 83        2        1 -0.355724974
## 342       2        1 -0.356574269
## 303       2        1 -0.375947513
## 49        2        1 -0.385240742
## 161       2        1 -0.400333472
## 2         2        1 -0.413519643
## 198       2        1 -0.421167889
## 45        2        1 -0.424031847
## 222       2        1 -0.427708046
## 245       2        1 -0.448292547
## 329       3        1 -0.007396548
## 336       3        1 -0.012875069
## 242       3        1 -0.036751781
## 288       3        1 -0.041071288
## 268       3        1 -0.048971960
## 423       3        1 -0.051994126
## 311       3        1 -0.052685511
## 119       3        1 -0.058794193
## 339       3        1 -0.062845057
## 191       3        1 -0.072744070
## 233       3        1 -0.079655745
## 403       3        1 -0.086725782
## 357       3        1 -0.094989028
## 92        3        1 -0.095573178
## 333       3        1 -0.098508368
## 76        3        1 -0.101849328
## 404       3        1 -0.107392524
## 231       3        1 -0.137084017
## 238       3        1 -0.143746675
## 128       3        1 -0.149640196
## 369       3        1 -0.154389737
## 355       3        1 -0.162266580
## 263       3        1 -0.165341477
## 89        3        1 -0.168159383
## 19        3        1 -0.180778500
## 144       3        1 -0.185370690
## 227       3        1 -0.187357791
## 84        3        1 -0.196791754
## 405       3        1 -0.204888573
## 388       3        1 -0.205945262
## 295       3        1 -0.209524494
## 42        3        1 -0.212114433
## 4         3        1 -0.213002138
## 338       3        1 -0.218203139
## 31        3        1 -0.219012031
## 340       3        1 -0.231144148
## 297       3        1 -0.233898560
## 243       3        1 -0.236141916
## 361       3        1 -0.236511615
## 235       3        1 -0.243969796
## 398       3        1 -0.253539515
## 33        3        1 -0.255154623
## 141       3        1 -0.255976638
## 153       3        1 -0.258442868
## 73        3        1 -0.268074268
## 433       3        1 -0.272958495
## 211       3        1 -0.273785012
## 158       3        1 -0.281238775
## 253       3        1 -0.288822555
## 115       3        1 -0.288894070
## 280       3        1 -0.291954591
## 218       3        1 -0.294220702
## 279       3        1 -0.312698206
## 422       3        1 -0.322429305
## 230       3        1 -0.327207181
## 21        3        1 -0.333035974
## 121       3        1 -0.338918084
## 323       3        1 -0.339390744
## 262       3        1 -0.341483697
## 399       3        1 -0.359947530
## 250       3        1 -0.380436480
## 424       3        1 -0.384695064
## 114       3        1 -0.395670458
## 100       3        1 -0.397307650
## 270       3        1 -0.410403261
## 163       3        1 -0.422128828

CLARA Algorithm

AGNES: Agglomerative Nesting

Gaussian finite mixture model

## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm 
## ----------------------------------------------------
## 
## Mclust VVI (diagonal, varying volume and shape) model with 8 components:
## 
##  log.likelihood   n  df       BIC       ICL
##        -24319.1 440 103 -49265.15 -49349.14
## 
## Clustering table:
##   1   2   3   4   5   6   7   8 
##  66  17  55  87  14  36 124  41 
## 
## Mixing probabilities:
##          1          2          3          4          5          6 
## 0.14829213 0.03669908 0.13197503 0.20393897 0.03438032 0.08143027 
##          7          8 
## 0.27066628 0.09261793 
## 
## Means:
##                       [,1]      [,2]      [,3]       [,4]      [,5]
## Fresh            6953.7520 33994.066 13926.050 19890.0334 13411.181
## Milk             3965.9269 15547.124  6970.027  3210.3000 29326.427
## Grocery          4734.4980 11701.621  9577.872  3760.4102 39789.058
## Frozen            833.9464 15846.089  2006.443  5979.2225  2558.126
## Detergents_Paper 1481.0508  1529.493  3525.844   555.2186 19124.188
## Delicassen        844.3439  8685.197  2011.469  1614.2546  3602.209
##                       [,6]       [,7]       [,8]
## Fresh             5385.383 10132.2356  2000.1319
## Milk             12480.459  1359.6364  7239.0245
## Grocery          21525.643  1938.8324 12343.6111
## Frozen            1560.210  2302.9635   477.2111
## Detergents_Paper  9684.784   269.4009  5486.3243
## Delicassen        1145.202   624.9454  1079.6855
## 
## Variances:
## [,,1]
##                     Fresh    Milk Grocery Frozen Detergents_Paper
## Fresh            29865082       0       0      0                0
## Milk                    0 5077290       0      0                0
## Grocery                 0       0 5957240      0                0
## Frozen                  0       0       0 247709                0
## Detergents_Paper        0       0       0      0          1422091
## Delicassen              0       0       0      0                0
##                  Delicassen
## Fresh                   0.0
## Milk                    0.0
## Grocery                 0.0
## Frozen                  0.0
## Detergents_Paper        0.0
## Delicassen         397586.9
## [,,2]
##                      Fresh      Milk  Grocery    Frozen Detergents_Paper
## Fresh            720674684         0        0         0                0
## Milk                     0 154999741        0         0                0
## Grocery                  0         0 42204084         0                0
## Frozen                   0         0        0 226886772                0
## Detergents_Paper         0         0        0         0          2398598
## Delicassen               0         0        0         0                0
##                  Delicassen
## Fresh                     0
## Milk                      0
## Grocery                   0
## Frozen                    0
## Detergents_Paper          0
## Delicassen        124022950
## [,,3]
##                     Fresh    Milk Grocery  Frozen Detergents_Paper
## Fresh            93618643       0       0       0                0
## Milk                    0 6554400       0       0                0
## Grocery                 0       0 9303005       0                0
## Frozen                  0       0       0 2132837                0
## Detergents_Paper        0       0       0       0          2876004
## Delicassen              0       0       0       0                0
##                  Delicassen
## Fresh                     0
## Milk                      0
## Grocery                   0
## Frozen                    0
## Detergents_Paper          0
## Delicassen          2698989
## [,,4]
##                      Fresh    Milk Grocery   Frozen Detergents_Paper
## Fresh            229112081       0       0        0              0.0
## Milk                     0 3491834       0        0              0.0
## Grocery                  0       0 3910598        0              0.0
## Frozen                   0       0       0 20078394              0.0
## Detergents_Paper         0       0       0        0         152328.2
## Delicassen               0       0       0        0              0.0
##                  Delicassen
## Fresh                     0
## Milk                      0
## Grocery                   0
## Frozen                    0
## Detergents_Paper          0
## Delicassen          1455411
## [,,5]
##                      Fresh      Milk   Grocery  Frozen Detergents_Paper
## Fresh            157059668         0         0       0                0
## Milk                     0 290636329         0       0                0
## Grocery                  0         0 424074652       0                0
## Frozen                   0         0         0 5648121                0
## Detergents_Paper         0         0         0       0        109450107
## Delicassen               0         0         0       0                0
##                  Delicassen
## Fresh                     0
## Milk                      0
## Grocery                   0
## Frozen                    0
## Detergents_Paper          0
## Delicassen          3529720
## [,,6]
##                     Fresh     Milk  Grocery  Frozen Detergents_Paper
## Fresh            18840284        0        0       0                0
## Milk                    0 18947561        0       0                0
## Grocery                 0        0 27007252       0                0
## Frozen                  0        0        0 1045613                0
## Detergents_Paper        0        0        0       0         11990520
## Delicassen              0        0        0       0                0
##                  Delicassen
## Fresh                     0
## Milk                      0
## Grocery                   0
## Frozen                    0
## Detergents_Paper          0
## Delicassen           640757
## [,,7]
##                     Fresh     Milk Grocery  Frozen Detergents_Paper
## Fresh            51241385      0.0       0       0             0.00
## Milk                    0 574613.9       0       0             0.00
## Grocery                 0      0.0 1066482       0             0.00
## Frozen                  0      0.0       0 3324868             0.00
## Detergents_Paper        0      0.0       0       0         37720.24
## Delicassen              0      0.0       0       0             0.00
##                  Delicassen
## Fresh                   0.0
## Milk                    0.0
## Grocery                 0.0
## Frozen                  0.0
## Detergents_Paper        0.0
## Delicassen         167037.8
## [,,8]
##                    Fresh    Milk  Grocery   Frozen Detergents_Paper
## Fresh            3020132       0        0      0.0                0
## Milk                   0 9841039        0      0.0                0
## Grocery                0       0 12660623      0.0                0
## Frozen                 0       0        0 128456.5                0
## Detergents_Paper       0       0        0      0.0          4219492
## Delicassen             0       0        0      0.0                0
##                  Delicassen
## Fresh                   0.0
## Milk                    0.0
## Grocery                 0.0
## Frozen                  0.0
## Detergents_Paper        0.0
## Delicassen         776244.4

##       
## region  1  2  3  4  5  6  7  8
##      1 15  3  7 15  2  7 22  6
##      2  7  1  2 10  1  7 14  5
##      3 44 13 46 62 11 22 88 30
## EDDA - classification
## ------------------------------------------------
## Gaussian finite mixture model for classification 
## ------------------------------------------------
## 
## EDDA model summary:
## 
##  log.likelihood   n df      BIC
##       -25794.52 440 41 -51838.6
##        
## Classes   n Model G
##       1  77   VEE 1
##       2  47   VEE 1
##       3 316   VEE 1
## 
## Training classification summary:
## 
##      Predicted
## Class   1   2   3
##     1   0   2  75
##     2   0   4  43
##     3   0  22 294
## 
## Training error = 0.2818182

## Dimension reduction for model-based clustering
## 
## -----------------------------------------------------------------
## Dimension reduction for model-based clustering and classification 
## -----------------------------------------------------------------
## 
## Mixture model type: Mclust (VVI, 8)
##         
## Clusters   n
##        1  66
##        2  17
##        3  55
##        4  87
##        5  14
##        6  36
##        7 124
##        8  41
## 
## Estimated basis vectors:
##                       Dir1        Dir2       Dir3      Dir4     Dir5
## Fresh            -0.016337 -0.00043504 -0.0079574  0.234452 -0.71497
## Milk             -0.031128 -0.28420072 -0.6208671 -0.225221  0.11021
## Grocery           0.459986  0.06901699  0.1958439  0.084046  0.24298
## Frozen           -0.017386 -0.21628964  0.5303445 -0.764562 -0.38040
## Detergents_Paper -0.871555  0.09803042  0.4857966  0.214116  0.49713
## Delicassen       -0.165124  0.92632351 -0.2425744 -0.506797 -0.16060
##                     Dir6
## Fresh            0.32111
## Milk             0.34435
## Grocery          0.38478
## Frozen           0.21966
## Detergents_Paper 0.73822
## Delicassen       0.19244
## 
##                Dir1   Dir2    Dir3    Dir4     Dir5     Dir6
## Eigenvalues 184.597 24.996  9.7017  4.0829  0.99462   0.7671
## Cum. %       81.992 93.095 97.4040 99.2175 99.65928 100.0000

DBSCAN : for dealing with outliers..

## LOF (local outlier factor)
## larger bubbles in the visualization have a larger LOF

  • we can extract our cluster labels and outliers to plot our results.

  • Lining up with our intuition, the DBSCAN algorithm was able to identify one cluster of customers who about the mean grocery and mean milk product purchases. In addition, it was able to flag customers whose annual purchasing behavior deviated too heavily from other customers.

INFERENCE

## The top three ranking algorithms for each measure
##     1              2              3             
## APN hierarchical-2 hierarchical-3 hierarchical-4
## AD  pam-8          pam-7          kmeans-8      
## ADM hierarchical-2 hierarchical-4 hierarchical-5
## FOM kmeans-8       pam-8          kmeans-7
## Optimal scores
##                  Score       Method Clusters
## Connectivity 2.9289683 hierarchical        2
## Dunn         0.3852785 hierarchical        2
## Silhouette   0.7957468 hierarchical        2