In this homework I will try to cluster the data into segments based on 5 numerical variables. At the end I will try to validate the results with 2 demographical variables (one numerical and one categorical)
#Importing data set
mydata <- read.csv("./Customer_Segmentation.csv")
#Removing variables from a data set
mydata1 <- mydata[, !(names(mydata) %in% c("Year_Birth", "Kidhome", "Teenhome", "Dt_Customer", "Recency", "NumDealsPurchases", "NumWebPurchases", "NumCatalogPurchases", "NumStorePurchases", "NumWebVisitsMonth", "AcceptedCmp3", "AcceptedCmp4", "AcceptedCmp5", "AcceptedCmp1", "AcceptedCmp2", "Complain", "Z_CostContact", "Z_Revenue", "Response", "MntGoldProds", "Marital_Status"))]
head(mydata1)
## ID Education Income MntWines MntFruits MntMeatProducts MntFishProducts
## 1 5524 Graduation 58138 635 88 546 172
## 2 2174 Graduation 46344 11 1 6 2
## 3 4141 Graduation 71613 426 49 127 111
## 4 6182 Graduation 26646 11 4 20 10
## 5 5324 PhD 58293 173 43 118 46
## 6 7446 Master 62513 520 42 98 0
## MntSweetProducts
## 1 88
## 2 1
## 3 21
## 4 3
## 5 27
## 6 42
Explanation of a data set:
unit of observation: a customer
number of observations: 2240
source: Kaggle.com (Customer Segmentation: Clustering)
Explanation of the variables in the data set:
ID: unique identifier for each unit in the data set
Education: the highest level of education attained by individual (“Graduation”, “PhD”, “Master”, “Basic”, “2n Cycle”)
Income: the annual income of individual
MntWines: the amount of money spent on wines
MntFruits: the amount of money spent on fruits
MntMeatProducts: the amount of money spent on meat products
MntFishProducts: the amount of money spent on fish products
MntSweetProducts: the amount of money spend on sweet products
#Checking for NA's
any(is.na(mydata1))
## [1] TRUE
library(tidyr)
mydata1 <- drop_na(mydata)
#Creating a sample of 300 units
set.seed(1)
mydata2 <- mydata1[sample(nrow(mydata1), size = 300), ]
#Checking which of the values of variable "Education" is most commonly selected, so I can use it as a reference group
table(mydata2$Education)
##
## 2n Cycle Basic Graduation Master PhD
## 33 3 161 40 63
Since the most commonly selected level of education is “Graduate” I will select it as a reference group and make sure it is on the first place when factoring it.
#Factoring variables
mydata2$Education <- factor(mydata2$Education,
levels = c("Graduation", "PhD", "Master", "2n Cycle", "Basic"),
labels = c("Graduation", "PhD", "Master", "2n Cycle", "Basic"))
head(mydata2)
## ID Year_Birth Education Marital_Status Income Kidhome Teenhome
## 1017 8985 1964 2n Cycle Together 68316 0 1
## 679 143 1970 Graduation Single 61209 0 0
## 2177 9014 1975 Graduation Married 37085 1 1
## 930 6810 1983 Graduation Divorced 82025 0 0
## 1533 2217 1975 2n Cycle Married 37284 1 1
## 471 1676 1975 Graduation Married 43057 0 1
## Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts
## 1017 04/11/2012 54 806 80 161 120
## 679 25/08/2013 73 466 0 224 119
## 2177 26/06/2014 65 39 1 16 2
## 930 28/05/2013 76 267 98 606 48
## 1533 29/03/2013 46 11 1 2 2
## 471 30/10/2013 30 213 2 44 0
## MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases
## 1017 11 33 5 10
## 679 49 99 1 5
## 2177 0 3 4 3
## 930 70 98 1 3
## 1533 1 6 1 0
## 471 2 5 4 4
## NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3
## 1017 7 10 6 0
## 679 3 4 2 0
## 2177 0 3 8 0
## 930 2 6 1 0
## 1533 0 3 6 0
## 471 2 5 5 0
## AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact
## 1017 0 0 0 0 0 3
## 679 0 0 0 0 0 3
## 2177 0 0 0 0 0 3
## 930 0 1 0 0 0 3
## 1533 0 0 0 0 0 3
## 471 0 0 0 0 0 3
## Z_Revenue Response
## 1017 11 0
## 679 11 0
## 2177 11 0
## 930 11 1
## 1533 11 0
## 471 11 0
#Descriptive statistics, showing numerical variables
summary(mydata2[c("MntWines", "MntFruits", "MntMeatProducts", "MntFishProducts", "MntSweetProducts")])
## MntWines MntFruits MntMeatProducts MntFishProducts
## Min. : 0.0 Min. : 0.00 Min. : 1.0 Min. : 0.00
## 1st Qu.: 30.0 1st Qu.: 2.00 1st Qu.: 16.0 1st Qu.: 3.00
## Median : 200.5 Median : 9.00 Median : 71.0 Median : 13.00
## Mean : 327.8 Mean : 25.80 Mean : 187.4 Mean : 37.32
## 3rd Qu.: 546.2 3rd Qu.: 28.25 3rd Qu.: 252.2 3rd Qu.: 50.25
## Max. :1396.0 Max. :199.00 Max. :1622.0 Max. :237.00
## MntSweetProducts
## Min. : 0.00
## 1st Qu.: 1.00
## Median : 9.00
## Mean : 27.72
## 3rd Qu.: 35.50
## Max. :194.00
Mean for variable “MntFruits”: on average customer spend 25.80$ on fruits. (currency is not defined in data set)
Median “MntMeatProducts”: 50% of the customers spend on average 71$ on meat products while the other half spend more than that.
When clustering data we want units within groups to be as homogenuous as possible, while groups should be as heterogenuous as possible between themselves.
Auumptions for clustering:
There are natural groups of objects in a population (theoretical structure of the analysis is important)
Representativeness of a sample (sufficiently large sample)
We pay attention to a problem of too strong correlation between cluster variables (it should be <0.03). If correlated, we should use the same number of variables from each group of related variables.
As cluster variables I chose numeric variables: “MntWines”, “MntFruits”, “MntMeatProducts”, “MntFishProducts”, “MntSweetProducts”
As a verification variable I chose categorical variable “Education” and numerical variable “Income”
Firstly, all variables should be standardized because they have different scales. We are doing so to make sure they have the equal effect on the result.
mydata2$MntWines_z <- scale(mydata2$MntWines)
mydata2$MntFruits_z <- scale(mydata2$MntFruits)
mydata2$MntMeatProducts_z <- scale(mydata2$MntMeatProducts)
mydata2$MntFishProducts_z <- scale(mydata2$MntFishProducts)
mydata2$MntSweetProducts_z <- scale(mydata2$MntSweetProducts)
#Checking for correlation between variables
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata2[, c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")]),
type = "pearson")
## MntWines_z MntFruits_z MntMeatProducts_z MntFishProducts_z
## MntWines_z 1.00 0.36 0.53 0.47
## MntFruits_z 0.36 1.00 0.47 0.53
## MntMeatProducts_z 0.53 0.47 1.00 0.59
## MntFishProducts_z 0.47 0.53 0.59 1.00
## MntSweetProducts_z 0.39 0.47 0.57 0.64
## MntSweetProducts_z
## MntWines_z 0.39
## MntFruits_z 0.47
## MntMeatProducts_z 0.57
## MntFishProducts_z 0.64
## MntSweetProducts_z 1.00
##
## n= 300
##
##
## P
## MntWines_z MntFruits_z MntMeatProducts_z MntFishProducts_z
## MntWines_z 0 0 0
## MntFruits_z 0 0 0
## MntMeatProducts_z 0 0 0
## MntFishProducts_z 0 0 0
## MntSweetProducts_z 0 0 0 0
## MntSweetProducts_z
## MntWines_z 0
## MntFruits_z 0
## MntMeatProducts_z 0
## MntFishProducts_z 0
## MntSweetProducts_z
From the table we can see some correlation between variables, however I will keep them since we had similar examples in the class. As stated in the assumptions, in this case we should use the same number of variables from each group of related variables.
mydata2$Dissimilarity <- sqrt(mydata2$MntWines_z^2 + mydata2$MntFruits_z^2 + mydata2$MntMeatProducts_z^2 + mydata2$MntFishProducts_z^2 + mydata2$MntSweetProducts_z^2)
head(mydata2[order(-mydata2$Dissimilarity), c("ID" , "Dissimilarity")], 10)
## ID Dissimilarity
## 444 4947 6.031356
## 998 5236 5.960135
## 843 1456 5.830813
## 675 1501 5.740319
## 912 8931 5.497329
## 634 4611 5.404004
## 1167 5735 5.319408
## 1966 3334 5.128727
## 549 3179 5.063539
## 2171 8722 4.963794
I will not exclude units because there are not too big gaps.
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.3.2
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
#Calculating Euclidean distances
distance <- get_dist(mydata2[c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")],
method = "euclidian")
#Calculating squared euclidian distances
distance2 <- distance^2
#Showing dissimilarity matrix
fviz_dist(distance2)
From the picture above, we can see that there are some natural groups forming. Dimensions of matrix 300x300.
#Calculating Hopkins statistic
get_clust_tendency(mydata2[c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")],
n = nrow(mydata2) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.7823613
##
## $plot
## NULL
We want Hopkins statistic to be as high as possible and the cut point is 0.5. Since our value is around 0.78 we can conclude that our data is clusterable.
library(factoextra)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
WARD <- mydata2[c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")] %>%
get_dist(method = "euclidean") %>%
hclust(method = "ward.D2")
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 300
fviz_dend(WARD)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
From the dendogram I would say it should be cut in 4 groups. Since I have 300 units in my sample data, it would take 296 steps to divide them into 3 groups.
#Coloring clusters for easier visualization
fviz_dend(WARD,
k = 4,
cex = 0.5,
palette = "jama",
color_labels_by_k = TRUE,
rect = TRUE)
From the graph above, we can see that 4 groups would be the most optimal solution in our case.
mydata2$ClusterWard <- cutree(WARD,
k = 4)
head(mydata2[c(-2, -3)])
## ID Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines
## 1017 8985 Together 68316 0 1 04/11/2012 54 806
## 679 143 Single 61209 0 0 25/08/2013 73 466
## 2177 9014 Married 37085 1 1 26/06/2014 65 39
## 930 6810 Divorced 82025 0 0 28/05/2013 76 267
## 1533 2217 Married 37284 1 1 29/03/2013 46 11
## 471 1676 Married 43057 0 1 30/10/2013 30 213
## MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds
## 1017 80 161 120 11 33
## 679 0 224 119 49 99
## 2177 1 16 2 0 3
## 930 98 606 48 70 98
## 1533 1 2 2 1 6
## 471 2 44 0 2 5
## NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases
## 1017 5 10 7 10
## 679 1 5 3 4
## 2177 4 3 0 3
## 930 1 3 2 6
## 1533 1 0 0 3
## 471 4 4 2 5
## NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1
## 1017 6 0 0 0 0
## 679 2 0 0 0 0
## 2177 8 0 0 0 0
## 930 1 0 0 1 0
## 1533 6 0 0 0 0
## 471 5 0 0 0 0
## AcceptedCmp2 Complain Z_CostContact Z_Revenue Response MntWines_z
## 1017 0 0 3 11 0 1.3421561
## 679 0 0 3 11 0 0.3879834
## 2177 0 0 3 11 0 -0.8103451
## 930 0 0 3 11 1 -0.1704882
## 1533 0 0 3 11 0 -0.8889241
## 471 0 0 3 11 0 -0.3220333
## MntFruits_z MntMeatProducts_z MntFishProducts_z MntSweetProducts_z
## 1017 1.3479590 -0.1036690 1.6279658 -0.4062618
## 679 -0.6417707 0.1439105 1.6082751 0.5170604
## 2177 -0.6168991 -0.6734948 -0.6955430 -0.6735392
## 930 1.7956482 1.6451068 0.2102316 1.0273174
## 1533 -0.6168991 -0.7285125 -0.6955430 -0.6492413
## 471 -0.5920275 -0.5634595 -0.7349245 -0.6249433
## Dissimilarity ClusterWard
## 1017 2.538591 1
## 679 1.853918 1
## 2177 1.558286 2
## 930 2.656948 3
## 1533 1.614551 2
## 471 1.304744 2
#Calculating positions of initial leaders
Initial_leaders <- aggregate(mydata2[ , c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")],
by = list(mydata2$ClusterWard),
FUN = mean)
Initial_leaders
## Group.1 MntWines_z MntFruits_z MntMeatProducts_z MntFishProducts_z
## 1 1 0.3472698 0.3646019 0.8502833 0.9352628
## 2 2 -0.7200433 -0.5051342 -0.6185020 -0.5906088
## 3 3 1.1860506 2.0707905 1.4238080 1.3122984
## 4 4 1.2220034 -0.2530794 -0.1118028 -0.2637202
## MntSweetProducts_z
## 1 0.6918606
## 2 -0.5572781
## 3 1.6028752
## 4 -0.2231795
#K-Means clustering, initial leaders are chosen as centroids of groups with hierarchical clustering
K_MEANS <- hkmeans(mydata2[c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")],
k = 4,
hc.metric = "euclidean",
hc.method = "ward.D2")
K_MEANS
## Hierarchical K-means clustering with 4 clusters of sizes 50, 174, 25, 51
##
## Cluster means:
## MntWines_z MntFruits_z MntMeatProducts_z MntFishProducts_z MntSweetProducts_z
## 1 0.5249914 0.20834129 1.3442387 1.31291377 1.2746706
## 2 -0.6597517 -0.44837245 -0.5997088 -0.54989936 -0.5164404
## 3 0.8262293 2.77559009 1.1307702 1.26880648 1.2800161
## 4 1.3312057 -0.03509825 0.1738851 -0.03300711 -0.1151628
##
## Clustering vector:
## 1017 679 2177 930 1533 471 270 1211 597 1301 1974 330 1799 1615 1749 37
## 4 1 2 3 2 2 2 2 1 3 1 4 2 2 1 2
## 1129 729 878 485 1826 975 554 1446 2159 1948 1530 343 40 537 248 1222
## 2 2 2 4 2 1 4 2 2 2 2 2 2 2 2 2
## 2087 1414 1304 1696 526 1069 1426 22 1742 1766 1395 1128 983 1791 1516 1640
## 1 2 2 2 2 2 2 4 2 3 4 4 2 2 4 2
## 1639 843 465 996 1549 1200 1134 84 1895 1165 2191 1328 557 1685 287 1522
## 2 3 2 2 2 4 2 2 1 2 1 2 2 2 2 4
## 858 1840 576 990 316 1075 733 1314 811 955 1167 1706 501 536 988 1067
## 4 2 2 2 2 1 4 2 2 2 3 3 3 2 2 2
## 1025 29 1942 838 1820 1652 1317 573 1966 369 86 1507 1646 355 1843 1073
## 2 2 1 2 1 4 1 2 3 2 2 2 4 2 2 4
## 361 1340 1266 1841 247 751 219 135 111 532 408 1589 912 2151 1836 359
## 4 2 4 2 2 2 2 4 2 1 2 2 3 4 1 2
## 1148 2042 1101 1242 1218 19 273 418 1995 1567 867 686 403 1611 1040 1064
## 2 4 1 4 2 2 2 4 2 2 2 1 1 3 1 4
## 1801 818 634 664 719 500 423 421 989 140 126 998 2090 1154 1536 504
## 1 2 3 4 2 2 2 4 4 2 2 1 2 2 3 2
## 1481 358 785 1572 305 1833 981 129 309 441 2064 470 1360 1822 1790 349
## 4 2 4 2 2 2 1 2 2 2 1 2 2 2 2 2
## 894 590 1956 474 2081 952 455 1577 15 1465 1318 62 390 1668 1059 381
## 1 1 2 4 4 2 2 2 4 4 2 2 2 2 3 2
## 1891 77 2200 1721 1351 665 31 2030 549 2012 2171 743 1052 2045 1172 797
## 2 2 2 4 1 2 2 2 3 4 1 1 2 2 2 1
## 1151 810 1596 961 1308 334 2034 1292 93 2060 1324 714 610 1306 1265 33
## 1 2 3 2 3 2 2 1 2 2 3 1 4 2 2 1
## 1020 2043 437 1141 2126 217 792 805 108 271 1294 2115 209 1007 1362 1633
## 1 2 2 1 2 4 1 1 3 2 2 2 4 2 4 2
## 1760 1608 2109 568 1002 2029 1735 1458 768 201 916 1378 1381 1373 2091 1538
## 2 2 1 2 2 2 2 4 1 1 3 2 2 3 2 4
## 116 643 879 2193 1022 1257 1295 1692 1463 1221 1244 462 1957 299 235 513
## 4 2 2 2 1 4 4 1 4 2 2 2 2 2 2 2
## 1498 173 1889 1107 1960 407 324 1744 731 185 2168 895 1894 765 1204 464
## 2 2 1 2 3 4 3 1 2 2 1 2 3 4 2 2
## 1698 1757 493 675 1866 966 444 1191 49 1495 1726 1697 291 1931 653 2116
## 2 1 4 1 2 2 1 2 2 2 2 1 4 1 2 2
## 434 741 2119 1370 1925 880 936 674 836 56 1049 934
## 2 3 2 4 2 1 2 3 2 4 4 2
##
## Within cluster sum of squares by cluster:
## [1] 251.10677 65.84700 131.68368 84.76356
## (between_SS / total_SS = 64.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "data"
## [11] "hclust"
By observing clustering vector, we can see which units is classified to which specific group. We want the ratio between_SS / total_SS to be high, while we want to minimize sum of squares within the group. The value of 64.3% indicates that the groups somehow differentiate between units.
#Visual representation
library(factoextra)
fviz_cluster(K_MEANS,
palette = "jama",
repel = FALSE,
ggtheme = theme_classic())
mydata2$ClusterK_Means <- K_MEANS$cluster
head(mydata2[c("ID", "ClusterWard", "ClusterK_Means")])
## ID ClusterWard ClusterK_Means
## 1017 8985 1 4
## 679 143 1 1
## 2177 9014 2 2
## 930 6810 3 3
## 1533 2217 2 2
## 471 1676 2 2
#Checking for reclassifications
table(mydata2$ClusterWard)
##
## 1 2 3 4
## 67 158 32 43
table(mydata2$ClusterK_Means)
##
## 1 2 3 4
## 50 174 25 51
table(mydata2$ClusterWard, mydata2$ClusterK_Means)
##
## 1 2 3 4
## 1 42 10 3 12
## 2 0 158 0 0
## 3 8 0 22 2
## 4 0 6 0 37
Based on information from the table, we can observe that there were some reclassifications.
3rd row: When using hierarchical clustering the 3rd group initially contained of 32 units. After using K-Means clustering method 8 units were reclassified to the first group and 2 units were reclassified to the fourth group.
3rd column: In the 3rd group we have 25 units after using K-Means clustering method. Initially there were 22 units, while 3 units were reclassified from the first group.
#Computing centroids
Centroids <- K_MEANS$centers
Centroids
## MntWines_z MntFruits_z MntMeatProducts_z MntFishProducts_z MntSweetProducts_z
## 1 0.5249914 0.20834129 1.3442387 1.31291377 1.2746706
## 2 -0.6597517 -0.44837245 -0.5997088 -0.54989936 -0.5164404
## 3 0.8262293 2.77559009 1.1307702 1.26880648 1.2800161
## 4 1.3312057 -0.03509825 0.1738851 -0.03300711 -0.1151628
library(ggplot2)
library(tidyr)
Figure <- as.data.frame(Centroids)
Figure$id <- 1:nrow(Figure)
Figure <- pivot_longer(Figure, cols = c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z"))
Figure$Groups <- factor(Figure$id,
levels = c(1, 2, 3, 4, 5),
labels = c("1", "2", "3", "4", "5"))
Figure$nameFactor <- factor(Figure$name,
levels = c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z"),
labels = c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z"))
ggplot(Figure, aes(x = nameFactor, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Groups, col = Groups), size = 3) +
geom_line(aes(group = id), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables")
Here we can see trends across three groups, whether consumers spend
above or below average and on which specific categories of food and
drinks.
In the first group consumers spend above average on all categories, however, they spend more on Meat, Fish and Sweets in comparison to Wine and Fruit.
In the second group, consumers spend below average on all categories of food and drinks.
In the third group consumers spend above average on all categories of food and drinks, while they spend the most on Fruits.
In the fourth group, consumers spend above average on Wine and slightly above average on Meat products, while they spend slightly below average on Fruits, Fish products and Sweets.
#Checking if all the variables are successful at classifying units into groups
fit <- aov(cbind(MntWines_z, MntFruits_z, MntMeatProducts_z, MntFishProducts_z, MntSweetProducts_z)~ as.factor(ClusterK_Means),
data = mydata2)
summary(fit)
## Response 1 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 3 196.96 65.654 190.45 < 2.2e-16 ***
## Residuals 296 102.04 0.345
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 2 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 3 229.811 76.604 327.72 < 2.2e-16 ***
## Residuals 296 69.189 0.234
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 3 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 3 186.44 62.145 163.42 < 2.2e-16 ***
## Residuals 296 112.56 0.380
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 4 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 3 179.10 59.702 147.39 < 2.2e-16 ***
## Residuals 296 119.89 0.405
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 5 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 3 169.28 56.428 128.76 < 2.2e-16 ***
## Residuals 296 129.72 0.438
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA test:
H0: Means of the respective variable are equal across all groups.
H1: At least one of the means of the respective variables differs across the groups.
Since ANOVA tests are significant for all 5 variables we can reject H0 and conclude that means differ across the four groups (p<0.001).
chisq_results <- chisq.test(mydata2$Education, as.factor(mydata2$ClusterK_Means))
## Warning in chisq.test(mydata2$Education, as.factor(mydata2$ClusterK_Means)):
## Chi-squared approximation may be incorrect
chisq_results
##
## Pearson's Chi-squared test
##
## data: mydata2$Education and as.factor(mydata2$ClusterK_Means)
## X-squared = 13.4, df = 12, p-value = 0.3406
Hypothesis for Chi square test:
H0: There is no association between Education and clusters by K-Means.
H1: There is association between Education and clusters by K-Means.
We don’t have enough statistical evidence to prove that there is an association between Education and clusters by K-Means.
addmargins(chisq_results$observed)
##
## mydata2$Education 1 2 3 4 Sum
## Graduation 28 92 17 24 161
## PhD 10 33 2 18 63
## Master 5 27 3 5 40
## 2n Cycle 7 19 3 4 33
## Basic 0 3 0 0 3
## Sum 50 174 25 51 300
round(chisq_results$expected, 2)
##
## mydata2$Education 1 2 3 4
## Graduation 26.83 93.38 13.42 27.37
## PhD 10.50 36.54 5.25 10.71
## Master 6.67 23.20 3.33 6.80
## 2n Cycle 5.50 19.14 2.75 5.61
## Basic 0.50 1.74 0.25 0.51
round(chisq_results$res, 2)
##
## mydata2$Education 1 2 3 4
## Graduation 0.23 -0.14 0.98 -0.64
## PhD -0.15 -0.59 -1.42 2.23
## Master -0.65 0.79 -0.18 -0.69
## 2n Cycle 0.64 -0.03 0.15 -0.68
## Basic -0.71 0.96 -0.50 -0.71
aggregate(mydata2$Income,
by = list(mydata2$ClusterK_Means),
FUN = "mean")
## Group.1 x
## 1 1 75027.40
## 2 2 40065.02
## 3 3 76797.24
## 4 4 69604.98
Here we can see the means of income across all 4 groups.
#Checking whether Income has statistically significant on groups classified by K-Means
fit <- aov(Income ~ as.factor(ClusterK_Means),
data = mydata2)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 3 8.131e+10 2.710e+10 107.7 <2e-16 ***
## Residuals 296 7.445e+10 2.515e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA test:
H0: The means of income are the same across 4 cluster groups.
H1: At least one cluster group differs in the mean of Income.
We can reject H0 and conclude that at least one cluster group differs from the others in the mean of Income inside it (p<0.001).
We can see that people in the group 1 have the highest income, while they also spend the most on average.
Classification of 300 customers into 4 groups was based on 5 standardized variables: “MntWines”, “MntFruits”, “MntMeatProducts”, “MntFishProducts”, “MntSweetProducts”.
I started with hierarchical clustering using Ward’s algorithm and based on dendogram I decided to divide the sample into 4 groups. After that I applied K-Means method to reassess units among groups.
Second group:
After clustering with K-Means second group contains the highest number of consumers (58%) characterized by below average value of all cluster variables. People in this group are mostly graduates with the lowest income out of all four groups.