Assignment 4

In this homework I will try to cluster the data into segments based on 5 numerical variables. At the end I will try to validate the results with 2 demographical variables (one numerical and one categorical)

#Importing data set
mydata <- read.csv("./Customer_Segmentation.csv")

#Removing variables from a data set
mydata1 <- mydata[, !(names(mydata) %in% c("Year_Birth", "Kidhome", "Teenhome", "Dt_Customer", "Recency", "NumDealsPurchases", "NumWebPurchases", "NumCatalogPurchases", "NumStorePurchases", "NumWebVisitsMonth", "AcceptedCmp3", "AcceptedCmp4", "AcceptedCmp5", "AcceptedCmp1", "AcceptedCmp2", "Complain", "Z_CostContact", "Z_Revenue", "Response", "MntGoldProds", "Marital_Status"))]

head(mydata1)

##     ID  Education Income MntWines MntFruits MntMeatProducts MntFishProducts
## 1 5524 Graduation  58138      635        88             546             172
## 2 2174 Graduation  46344       11         1               6               2
## 3 4141 Graduation  71613      426        49             127             111
## 4 6182 Graduation  26646       11         4              20              10
## 5 5324        PhD  58293      173        43             118              46
## 6 7446     Master  62513      520        42              98               0
##   MntSweetProducts
## 1               88
## 2                1
## 3               21
## 4                3
## 5               27
## 6               42

Explanation of a data set:

unit of observation: a customer
number of observations: 2240
source: Kaggle.com (Customer Segmentation: Clustering)

Explanation of the variables in the data set:

ID: unique identifier for each unit in the data set
Education: the highest level of education attained by individual (“Graduation”, “PhD”, “Master”, “Basic”, “2n Cycle”)
Income: the annual income of individual
MntWines: the amount of money spent on wines
MntFruits: the amount of money spent on fruits
MntMeatProducts: the amount of money spent on meat products
MntFishProducts: the amount of money spent on fish products
MntSweetProducts: the amount of money spend on sweet products

#Checking for NA's
any(is.na(mydata1))

## [1] TRUE

library(tidyr)
mydata1 <- drop_na(mydata)

#Creating a sample of 300 units
set.seed(1)
mydata2 <- mydata1[sample(nrow(mydata1), size = 300), ]

#Checking which of the values of variable "Education" is most commonly selected, so I can use it as a reference group
table(mydata2$Education)

## 
##   2n Cycle      Basic Graduation     Master        PhD 
##         33          3        161         40         63

Since the most commonly selected level of education is “Graduate” I will select it as a reference group and make sure it is on the first place when factoring it.

#Factoring variables
mydata2$Education <- factor(mydata2$Education,
                             levels = c("Graduation", "PhD", "Master", "2n Cycle", "Basic"), 
                             labels = c("Graduation", "PhD", "Master", "2n Cycle", "Basic"))

head(mydata2)

##        ID Year_Birth  Education Marital_Status Income Kidhome Teenhome
## 1017 8985       1964   2n Cycle       Together  68316       0        1
## 679   143       1970 Graduation         Single  61209       0        0
## 2177 9014       1975 Graduation        Married  37085       1        1
## 930  6810       1983 Graduation       Divorced  82025       0        0
## 1533 2217       1975   2n Cycle        Married  37284       1        1
## 471  1676       1975 Graduation        Married  43057       0        1
##      Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts
## 1017  04/11/2012      54      806        80             161             120
## 679   25/08/2013      73      466         0             224             119
## 2177  26/06/2014      65       39         1              16               2
## 930   28/05/2013      76      267        98             606              48
## 1533  29/03/2013      46       11         1               2               2
## 471   30/10/2013      30      213         2              44               0
##      MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases
## 1017               11           33                 5              10
## 679                49           99                 1               5
## 2177                0            3                 4               3
## 930                70           98                 1               3
## 1533                1            6                 1               0
## 471                 2            5                 4               4
##      NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3
## 1017                   7                10                 6            0
## 679                    3                 4                 2            0
## 2177                   0                 3                 8            0
## 930                    2                 6                 1            0
## 1533                   0                 3                 6            0
## 471                    2                 5                 5            0
##      AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact
## 1017            0            0            0            0        0             3
## 679             0            0            0            0        0             3
## 2177            0            0            0            0        0             3
## 930             0            1            0            0        0             3
## 1533            0            0            0            0        0             3
## 471             0            0            0            0        0             3
##      Z_Revenue Response
## 1017        11        0
## 679         11        0
## 2177        11        0
## 930         11        1
## 1533        11        0
## 471         11        0

#Descriptive statistics, showing numerical variables
summary(mydata2[c("MntWines", "MntFruits", "MntMeatProducts", "MntFishProducts", "MntSweetProducts")])

##     MntWines        MntFruits      MntMeatProducts  MntFishProducts 
##  Min.   :   0.0   Min.   :  0.00   Min.   :   1.0   Min.   :  0.00  
##  1st Qu.:  30.0   1st Qu.:  2.00   1st Qu.:  16.0   1st Qu.:  3.00  
##  Median : 200.5   Median :  9.00   Median :  71.0   Median : 13.00  
##  Mean   : 327.8   Mean   : 25.80   Mean   : 187.4   Mean   : 37.32  
##  3rd Qu.: 546.2   3rd Qu.: 28.25   3rd Qu.: 252.2   3rd Qu.: 50.25  
##  Max.   :1396.0   Max.   :199.00   Max.   :1622.0   Max.   :237.00  
##  MntSweetProducts
##  Min.   :  0.00  
##  1st Qu.:  1.00  
##  Median :  9.00  
##  Mean   : 27.72  
##  3rd Qu.: 35.50  
##  Max.   :194.00

Mean for variable “MntFruits”: on average customer spend 25.80$ on fruits. (currency is not defined in data set)
Median “MntMeatProducts”: 50% of the customers spend on average 71$ on meat products while the other half spend more than that.

Clustering the data

When clustering data we want units within groups to be as homogenuous as possible, while groups should be as heterogenuous as possible between themselves.

Auumptions for clustering:

There are natural groups of objects in a population (theoretical structure of the analysis is important)
Representativeness of a sample (sufficiently large sample)
We pay attention to a problem of too strong correlation between cluster variables (it should be <0.03). If correlated, we should use the same number of variables from each group of related variables.

As cluster variables I chose numeric variables: “MntWines”, “MntFruits”, “MntMeatProducts”, “MntFishProducts”, “MntSweetProducts”

As a verification variable I chose categorical variable “Education” and numerical variable “Income”

Firstly, all variables should be standardized because they have different scales. We are doing so to make sure they have the equal effect on the result.

mydata2$MntWines_z <- scale(mydata2$MntWines)
mydata2$MntFruits_z <- scale(mydata2$MntFruits)
mydata2$MntMeatProducts_z <- scale(mydata2$MntMeatProducts)
mydata2$MntFishProducts_z <- scale(mydata2$MntFishProducts)
mydata2$MntSweetProducts_z <- scale(mydata2$MntSweetProducts)

#Checking for correlation between variables
library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata2[, c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")]),
      type = "pearson")

##                    MntWines_z MntFruits_z MntMeatProducts_z MntFishProducts_z
## MntWines_z               1.00        0.36              0.53              0.47
## MntFruits_z              0.36        1.00              0.47              0.53
## MntMeatProducts_z        0.53        0.47              1.00              0.59
## MntFishProducts_z        0.47        0.53              0.59              1.00
## MntSweetProducts_z       0.39        0.47              0.57              0.64
##                    MntSweetProducts_z
## MntWines_z                       0.39
## MntFruits_z                      0.47
## MntMeatProducts_z                0.57
## MntFishProducts_z                0.64
## MntSweetProducts_z               1.00
## 
## n= 300 
## 
## 
## P
##                    MntWines_z MntFruits_z MntMeatProducts_z MntFishProducts_z
## MntWines_z                     0           0                 0               
## MntFruits_z         0                      0                 0               
## MntMeatProducts_z   0          0                             0               
## MntFishProducts_z   0          0           0                                 
## MntSweetProducts_z  0          0           0                 0               
##                    MntSweetProducts_z
## MntWines_z          0                
## MntFruits_z         0                
## MntMeatProducts_z   0                
## MntFishProducts_z   0                
## MntSweetProducts_z

From the table we can see some correlation between variables, however I will keep them since we had similar examples in the class. As stated in the assumptions, in this case we should use the same number of variables from each group of related variables.

mydata2$Dissimilarity <- sqrt(mydata2$MntWines_z^2 + mydata2$MntFruits_z^2 + mydata2$MntMeatProducts_z^2 + mydata2$MntFishProducts_z^2 + mydata2$MntSweetProducts_z^2)

head(mydata2[order(-mydata2$Dissimilarity), c("ID" , "Dissimilarity")], 10)

##        ID Dissimilarity
## 444  4947      6.031356
## 998  5236      5.960135
## 843  1456      5.830813
## 675  1501      5.740319
## 912  8931      5.497329
## 634  4611      5.404004
## 1167 5735      5.319408
## 1966 3334      5.128727
## 549  3179      5.063539
## 2171 8722      4.963794

I will not exclude units because there are not too big gaps.

library(factoextra)

## Warning: package 'factoextra' was built under R version 4.3.2

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

#Calculating Euclidean distances
distance <- get_dist(mydata2[c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")],
                     method = "euclidian")

#Calculating squared euclidian distances
distance2 <- distance^2 

#Showing dissimilarity matrix
fviz_dist(distance2)

From the picture above, we can see that there are some natural groups forming. Dimensions of matrix 300x300.

#Calculating Hopkins statistic
get_clust_tendency(mydata2[c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")],
                   n = nrow(mydata2) - 1,
                   graph = FALSE)

## $hopkins_stat
## [1] 0.7823613
## 
## $plot
## NULL

We want Hopkins statistic to be as high as possible and the cut point is 0.5. Since our value is around 0.78 we can conclude that our data is clusterable.

Hierarchical clustering

library(factoextra)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

WARD <- mydata2[c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")] %>%
  get_dist(method = "euclidean") %>%
  hclust(method = "ward.D2")

WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 300

fviz_dend(WARD)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

From the dendogram I would say it should be cut in 4 groups. Since I have 300 units in my sample data, it would take 296 steps to divide them into 3 groups.

#Coloring clusters for easier visualization
fviz_dend(WARD,
          k = 4,
          cex = 0.5,
          palette = "jama",
          color_labels_by_k = TRUE,
          rect = TRUE)

From the graph above, we can see that 4 groups would be the most optimal solution in our case.

mydata2$ClusterWard <- cutree(WARD,
                             k = 4)

head(mydata2[c(-2, -3)])

##        ID Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines
## 1017 8985       Together  68316       0        1  04/11/2012      54      806
## 679   143         Single  61209       0        0  25/08/2013      73      466
## 2177 9014        Married  37085       1        1  26/06/2014      65       39
## 930  6810       Divorced  82025       0        0  28/05/2013      76      267
## 1533 2217        Married  37284       1        1  29/03/2013      46       11
## 471  1676        Married  43057       0        1  30/10/2013      30      213
##      MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds
## 1017        80             161             120               11           33
## 679          0             224             119               49           99
## 2177         1              16               2                0            3
## 930         98             606              48               70           98
## 1533         1               2               2                1            6
## 471          2              44               0                2            5
##      NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases
## 1017                 5              10                   7                10
## 679                  1               5                   3                 4
## 2177                 4               3                   0                 3
## 930                  1               3                   2                 6
## 1533                 1               0                   0                 3
## 471                  4               4                   2                 5
##      NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1
## 1017                 6            0            0            0            0
## 679                  2            0            0            0            0
## 2177                 8            0            0            0            0
## 930                  1            0            0            1            0
## 1533                 6            0            0            0            0
## 471                  5            0            0            0            0
##      AcceptedCmp2 Complain Z_CostContact Z_Revenue Response MntWines_z
## 1017            0        0             3        11        0  1.3421561
## 679             0        0             3        11        0  0.3879834
## 2177            0        0             3        11        0 -0.8103451
## 930             0        0             3        11        1 -0.1704882
## 1533            0        0             3        11        0 -0.8889241
## 471             0        0             3        11        0 -0.3220333
##      MntFruits_z MntMeatProducts_z MntFishProducts_z MntSweetProducts_z
## 1017   1.3479590        -0.1036690         1.6279658         -0.4062618
## 679   -0.6417707         0.1439105         1.6082751          0.5170604
## 2177  -0.6168991        -0.6734948        -0.6955430         -0.6735392
## 930    1.7956482         1.6451068         0.2102316          1.0273174
## 1533  -0.6168991        -0.7285125        -0.6955430         -0.6492413
## 471   -0.5920275        -0.5634595        -0.7349245         -0.6249433
##      Dissimilarity ClusterWard
## 1017      2.538591           1
## 679       1.853918           1
## 2177      1.558286           2
## 930       2.656948           3
## 1533      1.614551           2
## 471       1.304744           2

#Calculating positions of initial leaders
Initial_leaders <- aggregate(mydata2[ , c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")],
                             by = list(mydata2$ClusterWard),
                             FUN = mean)

Initial_leaders

##   Group.1 MntWines_z MntFruits_z MntMeatProducts_z MntFishProducts_z
## 1       1  0.3472698   0.3646019         0.8502833         0.9352628
## 2       2 -0.7200433  -0.5051342        -0.6185020        -0.5906088
## 3       3  1.1860506   2.0707905         1.4238080         1.3122984
## 4       4  1.2220034  -0.2530794        -0.1118028        -0.2637202
##   MntSweetProducts_z
## 1          0.6918606
## 2         -0.5572781
## 3          1.6028752
## 4         -0.2231795

#K-Means clustering, initial leaders are chosen as centroids of groups with hierarchical clustering
K_MEANS <- hkmeans(mydata2[c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")], 
                   k = 4,
                   hc.metric = "euclidean",
                   hc.method = "ward.D2")

K_MEANS

## Hierarchical K-means clustering with 4 clusters of sizes 50, 174, 25, 51
## 
## Cluster means:
##   MntWines_z MntFruits_z MntMeatProducts_z MntFishProducts_z MntSweetProducts_z
## 1  0.5249914  0.20834129         1.3442387        1.31291377          1.2746706
## 2 -0.6597517 -0.44837245        -0.5997088       -0.54989936         -0.5164404
## 3  0.8262293  2.77559009         1.1307702        1.26880648          1.2800161
## 4  1.3312057 -0.03509825         0.1738851       -0.03300711         -0.1151628
## 
## Clustering vector:
## 1017  679 2177  930 1533  471  270 1211  597 1301 1974  330 1799 1615 1749   37 
##    4    1    2    3    2    2    2    2    1    3    1    4    2    2    1    2 
## 1129  729  878  485 1826  975  554 1446 2159 1948 1530  343   40  537  248 1222 
##    2    2    2    4    2    1    4    2    2    2    2    2    2    2    2    2 
## 2087 1414 1304 1696  526 1069 1426   22 1742 1766 1395 1128  983 1791 1516 1640 
##    1    2    2    2    2    2    2    4    2    3    4    4    2    2    4    2 
## 1639  843  465  996 1549 1200 1134   84 1895 1165 2191 1328  557 1685  287 1522 
##    2    3    2    2    2    4    2    2    1    2    1    2    2    2    2    4 
##  858 1840  576  990  316 1075  733 1314  811  955 1167 1706  501  536  988 1067 
##    4    2    2    2    2    1    4    2    2    2    3    3    3    2    2    2 
## 1025   29 1942  838 1820 1652 1317  573 1966  369   86 1507 1646  355 1843 1073 
##    2    2    1    2    1    4    1    2    3    2    2    2    4    2    2    4 
##  361 1340 1266 1841  247  751  219  135  111  532  408 1589  912 2151 1836  359 
##    4    2    4    2    2    2    2    4    2    1    2    2    3    4    1    2 
## 1148 2042 1101 1242 1218   19  273  418 1995 1567  867  686  403 1611 1040 1064 
##    2    4    1    4    2    2    2    4    2    2    2    1    1    3    1    4 
## 1801  818  634  664  719  500  423  421  989  140  126  998 2090 1154 1536  504 
##    1    2    3    4    2    2    2    4    4    2    2    1    2    2    3    2 
## 1481  358  785 1572  305 1833  981  129  309  441 2064  470 1360 1822 1790  349 
##    4    2    4    2    2    2    1    2    2    2    1    2    2    2    2    2 
##  894  590 1956  474 2081  952  455 1577   15 1465 1318   62  390 1668 1059  381 
##    1    1    2    4    4    2    2    2    4    4    2    2    2    2    3    2 
## 1891   77 2200 1721 1351  665   31 2030  549 2012 2171  743 1052 2045 1172  797 
##    2    2    2    4    1    2    2    2    3    4    1    1    2    2    2    1 
## 1151  810 1596  961 1308  334 2034 1292   93 2060 1324  714  610 1306 1265   33 
##    1    2    3    2    3    2    2    1    2    2    3    1    4    2    2    1 
## 1020 2043  437 1141 2126  217  792  805  108  271 1294 2115  209 1007 1362 1633 
##    1    2    2    1    2    4    1    1    3    2    2    2    4    2    4    2 
## 1760 1608 2109  568 1002 2029 1735 1458  768  201  916 1378 1381 1373 2091 1538 
##    2    2    1    2    2    2    2    4    1    1    3    2    2    3    2    4 
##  116  643  879 2193 1022 1257 1295 1692 1463 1221 1244  462 1957  299  235  513 
##    4    2    2    2    1    4    4    1    4    2    2    2    2    2    2    2 
## 1498  173 1889 1107 1960  407  324 1744  731  185 2168  895 1894  765 1204  464 
##    2    2    1    2    3    4    3    1    2    2    1    2    3    4    2    2 
## 1698 1757  493  675 1866  966  444 1191   49 1495 1726 1697  291 1931  653 2116 
##    2    1    4    1    2    2    1    2    2    2    2    1    4    1    2    2 
##  434  741 2119 1370 1925  880  936  674  836   56 1049  934 
##    2    3    2    4    2    1    2    3    2    4    4    2 
## 
## Within cluster sum of squares by cluster:
## [1] 251.10677  65.84700 131.68368  84.76356
##  (between_SS / total_SS =  64.3 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "data"        
## [11] "hclust"

By observing clustering vector, we can see which units is classified to which specific group. We want the ratio between_SS / total_SS to be high, while we want to minimize sum of squares within the group. The value of 64.3% indicates that the groups somehow differentiate between units.

#Visual representation
library(factoextra)
fviz_cluster(K_MEANS, 
             palette = "jama", 
             repel = FALSE,
             ggtheme = theme_classic())

mydata2$ClusterK_Means <- K_MEANS$cluster

head(mydata2[c("ID", "ClusterWard", "ClusterK_Means")])

##        ID ClusterWard ClusterK_Means
## 1017 8985           1              4
## 679   143           1              1
## 2177 9014           2              2
## 930  6810           3              3
## 1533 2217           2              2
## 471  1676           2              2

#Checking for reclassifications
table(mydata2$ClusterWard)

## 
##   1   2   3   4 
##  67 158  32  43

table(mydata2$ClusterK_Means)

## 
##   1   2   3   4 
##  50 174  25  51

table(mydata2$ClusterWard, mydata2$ClusterK_Means)

##    
##       1   2   3   4
##   1  42  10   3  12
##   2   0 158   0   0
##   3   8   0  22   2
##   4   0   6   0  37

Based on information from the table, we can observe that there were some reclassifications.

3rd row: When using hierarchical clustering the 3rd group initially contained of 32 units. After using K-Means clustering method 8 units were reclassified to the first group and 2 units were reclassified to the fourth group.
3rd column: In the 3rd group we have 25 units after using K-Means clustering method. Initially there were 22 units, while 3 units were reclassified from the first group.

#Computing centroids
Centroids <- K_MEANS$centers
Centroids

##   MntWines_z MntFruits_z MntMeatProducts_z MntFishProducts_z MntSweetProducts_z
## 1  0.5249914  0.20834129         1.3442387        1.31291377          1.2746706
## 2 -0.6597517 -0.44837245        -0.5997088       -0.54989936         -0.5164404
## 3  0.8262293  2.77559009         1.1307702        1.26880648          1.2800161
## 4  1.3312057 -0.03509825         0.1738851       -0.03300711         -0.1151628

library(ggplot2)
library(tidyr)

Figure <- as.data.frame(Centroids)
Figure$id <- 1:nrow(Figure)
Figure <- pivot_longer(Figure, cols = c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z"))

Figure$Groups <- factor(Figure$id, 
                        levels = c(1, 2, 3, 4, 5), 
                        labels = c("1", "2", "3", "4", "5"))

Figure$nameFactor <- factor(Figure$name, 
                            levels = c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z"), 
                            labels = c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z"))

ggplot(Figure, aes(x = nameFactor, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Groups, col = Groups), size = 3) +
  geom_line(aes(group = id), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables")

Here we can see trends across three groups, whether consumers spend above or below average and on which specific categories of food and drinks.

In the first group consumers spend above average on all categories, however, they spend more on Meat, Fish and Sweets in comparison to Wine and Fruit.

In the second group, consumers spend below average on all categories of food and drinks.

In the third group consumers spend above average on all categories of food and drinks, while they spend the most on Fruits.

In the fourth group, consumers spend above average on Wine and slightly above average on Meat products, while they spend slightly below average on Fruits, Fish products and Sweets.

#Checking if all the variables are successful at classifying units into groups
fit <- aov(cbind(MntWines_z, MntFruits_z, MntMeatProducts_z, MntFishProducts_z, MntSweetProducts_z)~ as.factor(ClusterK_Means),
           data = mydata2)

summary(fit)

##  Response 1 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   3 196.96  65.654  190.45 < 2.2e-16 ***
## Residuals                 296 102.04   0.345                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 2 :
##                            Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   3 229.811  76.604  327.72 < 2.2e-16 ***
## Residuals                 296  69.189   0.234                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 3 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   3 186.44  62.145  163.42 < 2.2e-16 ***
## Residuals                 296 112.56   0.380                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 4 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   3 179.10  59.702  147.39 < 2.2e-16 ***
## Residuals                 296 119.89   0.405                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 5 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   3 169.28  56.428  128.76 < 2.2e-16 ***
## Residuals                 296 129.72   0.438                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA test:

H0: Means of the respective variable are equal across all groups.
H1: At least one of the means of the respective variables differs across the groups.

Since ANOVA tests are significant for all 5 variables we can reject H0 and conclude that means differ across the four groups (p<0.001).

Validation of the results

Education

chisq_results <- chisq.test(mydata2$Education, as.factor(mydata2$ClusterK_Means))

## Warning in chisq.test(mydata2$Education, as.factor(mydata2$ClusterK_Means)):
## Chi-squared approximation may be incorrect

chisq_results

## 
##  Pearson's Chi-squared test
## 
## data:  mydata2$Education and as.factor(mydata2$ClusterK_Means)
## X-squared = 13.4, df = 12, p-value = 0.3406

Hypothesis for Chi square test:

H0: There is no association between Education and clusters by K-Means.

H1: There is association between Education and clusters by K-Means.

We don’t have enough statistical evidence to prove that there is an association between Education and clusters by K-Means.

addmargins(chisq_results$observed)

##                  
## mydata2$Education   1   2   3   4 Sum
##        Graduation  28  92  17  24 161
##        PhD         10  33   2  18  63
##        Master       5  27   3   5  40
##        2n Cycle     7  19   3   4  33
##        Basic        0   3   0   0   3
##        Sum         50 174  25  51 300

round(chisq_results$expected, 2)

##                  
## mydata2$Education     1     2     3     4
##        Graduation 26.83 93.38 13.42 27.37
##        PhD        10.50 36.54  5.25 10.71
##        Master      6.67 23.20  3.33  6.80
##        2n Cycle    5.50 19.14  2.75  5.61
##        Basic       0.50  1.74  0.25  0.51

round(chisq_results$res, 2)

##                  
## mydata2$Education     1     2     3     4
##        Graduation  0.23 -0.14  0.98 -0.64
##        PhD        -0.15 -0.59 -1.42  2.23
##        Master     -0.65  0.79 -0.18 -0.69
##        2n Cycle    0.64 -0.03  0.15 -0.68
##        Basic      -0.71  0.96 -0.50 -0.71

Income

aggregate(mydata2$Income, 
          by = list(mydata2$ClusterK_Means), 
          FUN = "mean")

##   Group.1        x
## 1       1 75027.40
## 2       2 40065.02
## 3       3 76797.24
## 4       4 69604.98

Here we can see the means of income across all 4 groups.

#Checking whether Income has statistically significant on groups classified by K-Means
fit <- aov(Income ~ as.factor(ClusterK_Means), 
           data = mydata2)
summary(fit)

##                            Df    Sum Sq   Mean Sq F value Pr(>F)    
## as.factor(ClusterK_Means)   3 8.131e+10 2.710e+10   107.7 <2e-16 ***
## Residuals                 296 7.445e+10 2.515e+08                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA test:

H0: The means of income are the same across 4 cluster groups.
H1: At least one cluster group differs in the mean of Income.

We can reject H0 and conclude that at least one cluster group differs from the others in the mean of Income inside it (p<0.001).

We can see that people in the group 1 have the highest income, while they also spend the most on average.

Conclusion

Classification of 300 customers into 4 groups was based on 5 standardized variables: “MntWines”, “MntFruits”, “MntMeatProducts”, “MntFishProducts”, “MntSweetProducts”.

I started with hierarchical clustering using Ward’s algorithm and based on dendogram I decided to divide the sample into 4 groups. After that I applied K-Means method to reassess units among groups.

Second group:

After clustering with K-Means second group contains the highest number of consumers (58%) characterized by below average value of all cluster variables. People in this group are mostly graduates with the lowest income out of all four groups.