mydata <- read.csv("./Customer_Segmentation.csv")
head(mydata)
## ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer
## 1 5524 1957 Graduation Single 58138 0 0 04-09-2012
## 2 2174 1954 Graduation Single 46344 1 1 08-03-2014
## 3 4141 1965 Graduation Together 71613 0 0 21-08-2013
## 4 6182 1984 Graduation Together 26646 1 0 10-02-2014
## 5 5324 1981 PhD Married 58293 1 0 19-01-2014
## 6 7446 1967 Master Together 62513 0 1 09-09-2013
## Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## 1 58 635 88 546 172 88
## 2 38 11 1 6 2 1
## 3 26 426 49 127 111 21
## 4 26 11 4 20 10 3
## 5 94 173 43 118 46 27
## 6 16 520 42 98 0 42
## MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases
## 1 88 3 8 10
## 2 6 2 1 1
## 3 42 1 8 2
## 4 5 2 2 0
## 5 15 5 5 3
## 6 14 2 6 4
## NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5
## 1 4 7 0 0 0
## 2 2 5 0 0 0
## 3 10 4 0 0 0
## 4 4 6 0 0 0
## 5 6 5 0 0 0
## 6 10 6 0 0 0
## AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact Z_Revenue Response
## 1 0 0 0 3 11 1
## 2 0 0 0 3 11 0
## 3 0 0 0 3 11 0
## 4 0 0 0 3 11 0
## 5 0 0 0 3 11 0
## 6 0 0 0 3 11 0
mydata1 <- mydata[, !(names(mydata) %in% c("Year_Birth", "Kidhome", "Teenhome", "Dt_Customer", "Recency", "NumDealsPurchases", "NumWebPurchases", "NumCatalogPurchases", "NumStorePurchases", "NumWebVisitsMonth", "AcceptedCmp3", "AcceptedCmp4", "AcceptedCmp5", "AcceptedCmp1", "AcceptedCmp2", "Complain", "Z_CostContact", "Z_Revenue", "Response", "MntGoldProds", "Marital_Status"))]
head(mydata1)
## ID Education Income MntWines MntFruits MntMeatProducts MntFishProducts
## 1 5524 Graduation 58138 635 88 546 172
## 2 2174 Graduation 46344 11 1 6 2
## 3 4141 Graduation 71613 426 49 127 111
## 4 6182 Graduation 26646 11 4 20 10
## 5 5324 PhD 58293 173 43 118 46
## 6 7446 Master 62513 520 42 98 0
## MntSweetProducts
## 1 88
## 2 1
## 3 21
## 4 3
## 5 27
## 6 42
I removed 21 variables, and called new sample mydata1, sample has 8 variables.
ID: unique identifier for each unit in the data set
Education: the highest level of education attained by individual (“Graduation”, “PhD”, “Master”, “Basic”, “2n Cycle”)
Income: the annual income of individuaL
MntWines: the amount of money spent on wines
MntFruits: the amount of money spent on fruits
MntMeatProducts: the amount of money spent on meat products
MntFishProducts: the amount of money spent on fish products
MntSweetProducts: the amount of money spend on sweet products
source: https://www.kaggle.com/datasets/vishakhdapat/customer-segmentation-clustering
set.seed(1)
mydata2 <- mydata1[sample(nrow(mydata1), size = 500), ]
library(tidyr)
mydata1 <- drop_na(mydata2)
I created a sample of 500 units. I called it mydata2
The drop_na(mydata) function removes all rows with missing values from the dataset, creating a new dataset (mydata1) that contains only complete cases.
table(mydata2$Education)
##
## 2n Cycle Basic Graduation Master PhD
## 40 13 247 79 121
I am identifying the most frequently selected value in the variable “Education” to use it as the reference group. Since “Graduate” is the most common level of education, I will set it as the reference group and ensure it appears first when factoring.
mydata2$Education <- factor(mydata2$Education,
levels = c("Graduation", "PhD", "Master", "2n Cycle", "Basic"),
labels = c("Graduation", "PhD", "Master", "2n Cycle", "Basic"))
head(mydata2)
## ID Education Income MntWines MntFruits MntMeatProducts MntFishProducts
## 1017 7010 2n Cycle 70924 635 114 254 132
## 679 8779 2n Cycle 36145 56 4 76 17
## 2177 1544 Master 81380 741 68 689 224
## 930 10673 PhD 68397 760 80 466 17
## 1533 3463 PhD 69283 674 62 134 0
## 471 2021 Graduation 61456 563 76 384 84
## MntSweetProducts
## 1017 152
## 679 1
## 2177 68
## 930 13
## 1533 26
## 471 192
Factoring variables
summary(mydata2[c("MntWines", "MntFruits", "MntMeatProducts", "MntFishProducts", "MntSweetProducts")])
## MntWines MntFruits MntMeatProducts MntFishProducts
## Min. : 0.00 Min. : 0.00 Min. : 1.0 Min. : 0.00
## 1st Qu.: 24.75 1st Qu.: 2.00 1st Qu.: 17.0 1st Qu.: 3.00
## Median : 198.00 Median : 10.00 Median : 72.0 Median : 13.00
## Mean : 306.04 Mean : 27.82 Mean : 175.1 Mean : 41.11
## 3rd Qu.: 479.75 3rd Qu.: 35.25 3rd Qu.: 250.5 3rd Qu.: 58.25
## Max. :1478.00 Max. :199.00 Max. :1725.0 Max. :259.00
## MntSweetProducts
## Min. : 0.00
## 1st Qu.: 1.00
## Median : 9.00
## Mean : 28.00
## 3rd Qu.: 37.25
## Max. :194.00
Descriptive statistics
MntWines:
The goal of clustering is to create groups where the units within each cluster are as similar as possible (high homogeneity), while the clusters themselves are distinct from one another (high heterogeneity).
Key Assumptions for Clustering:
Variables Used for Clustering: For this analysis, I selected the following numerical variables as clustering inputs:
MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts Additionally, I included:
Verification Variables:
Categorical variable: Education
Numerical variable: Income
Standardization:
All variables were standardized to account for differences in scale and ensure that each variable has an equal impact on the clustering results. Standardization is essential as the original variables (e.g., spending on wines and fruits) are measured in different units and scales, which could otherwise distort the clustering process.
mydata2$MntWines_z <- scale(mydata2$MntWines)
mydata2$MntFruits_z <- scale(mydata2$MntFruits)
mydata2$MntMeatProducts_z <- scale(mydata2$MntMeatProducts)
mydata2$MntFishProducts_z <- scale(mydata2$MntFishProducts)
mydata2$MntSweetProducts_z <- scale(mydata2$MntSweetProducts)
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata2[, c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")]),
type = "pearson")
## MntWines_z MntFruits_z MntMeatProducts_z MntFishProducts_z
## MntWines_z 1.00 0.39 0.54 0.40
## MntFruits_z 0.39 1.00 0.53 0.58
## MntMeatProducts_z 0.54 0.53 1.00 0.60
## MntFishProducts_z 0.40 0.58 0.60 1.00
## MntSweetProducts_z 0.41 0.62 0.50 0.54
## MntSweetProducts_z
## MntWines_z 0.41
## MntFruits_z 0.62
## MntMeatProducts_z 0.50
## MntFishProducts_z 0.54
## MntSweetProducts_z 1.00
##
## n= 500
##
##
## P
## MntWines_z MntFruits_z MntMeatProducts_z MntFishProducts_z
## MntWines_z 0 0 0
## MntFruits_z 0 0 0
## MntMeatProducts_z 0 0 0
## MntFishProducts_z 0 0 0
## MntSweetProducts_z 0 0 0 0
## MntSweetProducts_z
## MntWines_z 0
## MntFruits_z 0
## MntMeatProducts_z 0
## MntFishProducts_z 0
## MntSweetProducts_z
From the table, we can observe some degree of correlation between the selected variables. Despite this, I have decided to retain them for the analysis. According to the stated assumptions, when correlations exist, it is essential to ensure that the same number of variables are selected from each group of related variables to maintain balance and minimize redundancy in the clustering process.
mydata2$Dissimilarity <- sqrt(mydata2$MntWines_z^2 + mydata2$MntFruits_z^2 + mydata2$MntMeatProducts_z^2 + mydata2$MntFishProducts_z^2 + mydata2$MntSweetProducts_z^2)
head(mydata2[order(-mydata2$Dissimilarity), c("ID" , "Dissimilarity")], 10)
## ID Dissimilarity
## 22 5376 6.871733
## 361 7274 6.342333
## 724 10936 5.691023
## 1266 3910 5.671935
## 1101 5538 5.671935
## 1200 7342 5.572540
## 1481 2849 5.441394
## 448 1137 5.370659
## 2013 500 5.083748
## 462 7851 4.891945
The table above displays the dissimilarity values for specific units. Based on the table I will remove ID 5376 and 7274
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Ensure the ID column exists
if (!"ID" %in% colnames(mydata2)) {
mydata2$ID <- seq(1, nrow(mydata2))
}
outlier_ids <- c(5376, 7274)
mydata2 <- mydata2 %>% filter(!ID %in% outlier_ids)
mydata2$ID <- seq(1, nrow(mydata2))
# Standardize values
mydata2_clu_std <- as.data.frame(scale(mydata2[c("MntWines", "MntFruits", "MntMeatProducts", "MntFishProducts", "MntSweetProducts")]))
head(mydata2[order(-mydata2$Dissimilarity), c("ID", "Dissimilarity")], 10)
## ID Dissimilarity
## 369 369 5.691023
## 97 97 5.671935
## 113 113 5.671935
## 53 53 5.572540
## 143 143 5.441394
## 292 292 5.370659
## 230 230 5.083748
## 220 220 4.891945
## 287 287 4.887602
## 315 315 4.861837
The table above displays the dissimilarity values for specific units. While some differences are noticeable, they are not substantial enough to justify excluding any units from the analysis. Therefore, I will now retain all units for clustering, as there are no significant gaps that could distort the results.
#install.packages("factoextra")
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.4.2
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
distance <- get_dist(mydata2[c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")],
method = "euclidian")
distance2 <- distance^2
fviz_dist(distance2)
nrow #I checked how many units my data has.
## function (x)
## dim(x)[1L]
## <bytecode: 0x0000017f2fda1808>
## <environment: namespace:base>
The visualization above indicates the presence of natural groupings within the data. The dimensions of the distance matrix used for clustering are 338 x 338, representing the pairwise dissimilarities among the 338 units in the dataset.
get_clust_tendency(mydata2[c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")],
n = nrow(mydata2) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.7530371
##
## $plot
## NULL
The Hopkins statistic measures the clusterability of the dataset, with higher values indicating a stronger tendency for natural clusters to form. The threshold for clusterability is 0.5. Since our calculated value is approximately 0.753, we can conclude that the data exhibits a strong potential for clustering.
library(factoextra)
library(dplyr)
WARD <- mydata2[c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")] %>%
get_dist(method = "euclidean") %>%
hclust(method = "ward.D2")
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 498
fviz_dend(WARD)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The dendrogram suggests 2 groups as the best cut. With 338 units in the data, it would take 336 steps to form these groups.
library(factoextra)
set.seed(123)
wss <- fviz_nbclust(mydata2[c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")],
kmeans,
method = "wss")
print(wss)
Elbow method suggests 2 groups maybe 3, that is why I also did Silhouette method to be sure.
set.seed(123)
sil <- fviz_nbclust(mydata2[c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")],
kmeans,
method = "silhouette")
print(sil)
Silhouette method also suggests 2 groups.
fviz_dend(WARD,
k = 2,
cex = 0.5,
palette = c("blue", "red"),
color_labels_by_k = TRUE,
rect = TRUE,
rect_fill = TRUE,
rect_border = "black")
Based on the graph above, dividing the data into 2 groups appears to be the most suitable choice for this analysis.
mydata2$ClusterWard <- cutree(WARD,
k = 2)
head(mydata2[c(-2, -3)])
## ID MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## 1 1 635 114 254 132 152
## 2 2 56 4 76 17 1
## 3 3 741 68 689 224 68
## 4 4 760 80 466 17 13
## 5 5 674 62 134 0 26
## 6 6 563 76 384 84 192
## MntWines_z MntFruits_z MntMeatProducts_z MntFishProducts_z MntSweetProducts_z
## 1 0.9981615 2.1513996 0.3416351 1.5491353 3.03334472
## 2 -0.7587056 -0.5947705 -0.4288662 -0.4108893 -0.66054593
## 3 1.3197986 1.0030012 2.2246017 3.1171551 0.97846515
## 4 1.3774506 1.3025834 1.2593108 -0.4108893 -0.36699171
## 5 1.1164997 0.8532101 -0.1778040 -0.7006321 -0.04897463
## 6 0.7796909 1.2027226 0.9043607 0.7310381 4.01185880
## Dissimilarity ClusterWard
## 1 4.164440 1
## 2 1.310902 2
## 3 4.286115 1
## 4 2.341681 2
## 5 1.580961 2
## 6 4.416079 1
Initial_leaders <- aggregate(mydata2[ , c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")],
by = list(mydata2$ClusterWard),
FUN = mean)
Initial_leaders
## Group.1 MntWines_z MntFruits_z MntMeatProducts_z MntFishProducts_z
## 1 1 0.6649751 1.1742739 1.1210957 1.2534471
## 2 2 -0.2334982 -0.4178492 -0.4176516 -0.4442377
## MntSweetProducts_z
## 1 1.2009064
## 2 -0.4269887
K_MEANS <- hkmeans(mydata2[c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")],
k = 2,
hc.metric = "euclidean",
hc.method = "ward.D2")
K_MEANS
## Hierarchical K-means clustering with 2 clusters of sizes 159, 339
##
## Cluster means:
## MntWines_z MntFruits_z MntMeatProducts_z MntFishProducts_z MntSweetProducts_z
## 1 0.8394803 1.0254541 1.0411355 1.0303210 0.9958507
## 2 -0.3948566 -0.4889446 -0.5163204 -0.4898233 -0.4748736
##
## Clustering vector:
## [1] 1 2 1 1 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 1 2 2 2 1 2 2 1 2 1 2 1 1 2 1 2
## [38] 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 1 2 1 2 2 2 1 2 1 1 1 1 2 2 2 2 2 1 2 2 2
## [75] 2 2 2 1 1 2 2 2 1 2 1 2 2 1 1 2 2 2 2 2 2 2 1 2 1 1 2 2 1 1 2 2 1 2 2 2 2
## [112] 1 1 2 2 1 2 1 2 1 2 2 2 1 2 2 1 1 2 2 1 2 2 2 1 2 2 2 2 1 2 1 1 1 2 2 2 2
## [149] 2 2 2 2 1 2 2 2 1 2 2 2 1 2 2 2 1 2 2 2 2 2 1 2 1 2 1 2 2 2 2 2 1 2 2 1 2
## [186] 2 2 2 2 2 1 2 2 2 1 2 1 1 2 2 1 2 1 2 2 2 2 1 2 1 1 2 2 2 2 1 2 2 2 1 2 2
## [223] 2 2 2 2 2 1 2 1 2 1 1 2 2 2 1 2 2 2 1 2 2 1 2 2 1 2 1 2 2 2 2 2 2 1 2 1 2
## [260] 2 2 2 2 2 1 1 1 2 2 2 1 2 2 2 2 1 2 1 2 1 2 2 1 2 2 2 1 2 1 2 2 1 2 2 2 2
## [297] 2 1 1 2 2 2 2 2 2 2 2 2 2 1 1 2 1 1 1 2 2 1 1 2 2 1 2 2 2 2 1 2 1 2 1 1 2
## [334] 2 1 1 1 1 2 1 2 2 1 2 2 2 1 2 2 2 2 2 1 2 2 1 1 2 2 2 2 1 2 2 2 2 1 1 1 2
## [371] 2 2 2 1 2 2 1 2 2 2 2 1 2 1 2 2 2 2 2 1 1 2 2 1 2 2 1 2 1 2 1 2 2 2 2 2 2
## [408] 1 2 1 2 2 1 1 2 1 2 2 2 2 2 1 2 2 2 1 2 2 2 1 1 1 1 2 2 2 2 2 1 1 2 1 2 2
## [445] 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 1 1 1 1 2 1 2 2 1 1 1 2 2 2 2 2 2 2 2 2
## [482] 1 2 1 2 1 2 1 2 2 2 2 2 2 1 2 2 1
##
## Within cluster sum of squares by cluster:
## [1] 922.3339 325.0975
## (between_SS / total_SS = 48.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "data"
## [11] "hclust"
The within-cluster sum of squares (WSS) for the two clusters are:
Cluster 1: 922.3339 Cluster 2: 325.0975
This indicates that Cluster 2 contains more variability compared to Cluster 1. The ratio of between-cluster sum of squares to total sum of squares is 48.2, suggesting that 48.2% of the total variance is explained by the clustering. While this value is moderate, it indicates room for further refinement in clustering to increase the variance explained.
By examining the clustering vector, we can determine which units belong to each specific group. Ideally, we aim for a high ratio of between_SS / total_SS to ensure clear separation between groups, while minimizing the within-cluster sum of squares to ensure homogeneity within groups. The current value of 48.2% suggests that the groups provide a moderate level of differentiation among the units.
library(factoextra)
fviz_cluster(K_MEANS,
palette = "set1",
repel = FALSE,
ggtheme = theme_classic())
library(dplyr)
library(factoextra)
outlier_ids <- c(128, 73, 53, 292, 498, 88, 369, 191, 472, 84, 64, 223, 270, 75)
mydata2 <- mydata2 %>% filter(!ID %in% outlier_ids)
mydata2$ID <- seq(1, nrow(mydata2))
mydata2_clu_std <- as.data.frame(scale(mydata2[c("MntWines", "MntFruits", "MntMeatProducts", "MntFishProducts", "MntSweetProducts")]))
K_MEANS <- kmeans(mydata2_clu_std,
centers = 2, # Adjust the number of clusters if necessary
nstart = 25) # Number of random starts
fviz_cluster(K_MEANS,
data = mydata2_clu_std,
palette = "Set1", # Use the same palette for consistency
repel = FALSE, # Keep repel consistent with the original
ggtheme = theme_classic()) # Use the same theme
I removed outliers with the following IDs: 128, 73, 53, 292, 498, 88, 369, 191, 472, 84, 64, 223, 270, and 75 to improve clustering accuracy. Afterward, I standardized the spending variables (MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts), applied K-Means clustering with 2 clusters and 25 random starts, and visualized the clusters using fviz_cluster().
library(dplyr)
WARD <- hclust(
dist(scale(mydata2[c("MntWines", "MntFruits", "MntMeatProducts", "MntFishProducts", "MntSweetProducts")])),
method = "ward.D2"
)
K_MEANS <- kmeans(
scale(mydata2[c("MntWines", "MntFruits", "MntMeatProducts", "MntFishProducts", "MntSweetProducts")]),
centers = 2,
nstart = 25
)
mydata2$ClusterWard <- cutree(WARD, k = 2)
mydata2$ClusterK_Means <- K_MEANS$cluster
print("First few rows with cluster labels:")
## [1] "First few rows with cluster labels:"
head(mydata2[c("ID", "ClusterWard", "ClusterK_Means")])
## ID ClusterWard ClusterK_Means
## 1 1 1 2
## 2 2 2 1
## 3 3 1 2
## 4 4 1 2
## 5 5 1 1
## 6 6 1 2
print("Cluster counts from Ward clustering:")
## [1] "Cluster counts from Ward clustering:"
ward_counts <- table(mydata2$ClusterWard)
print(ward_counts)
##
## 1 2
## 131 353
print("Cluster counts from K-Means clustering:")
## [1] "Cluster counts from K-Means clustering:"
kmeans_counts <- table(mydata2$ClusterK_Means)
print(kmeans_counts)
##
## 1 2
## 332 152
print("Reclassification table between Ward and K-Means:")
## [1] "Reclassification table between Ward and K-Means:"
reclassification_table <- table(mydata2$ClusterWard, mydata2$ClusterK_Means)
print(reclassification_table)
##
## 1 2
## 1 2 129
## 2 330 23
Based on information from the table, we can observe that there were some reclassifications:
Row 1 (Ward Cluster 1): When using hierarchical clustering, the 1st group initially contained 131 units. After applying the K-Means clustering method, 129 units were reclassified to the second group, and only 2 units remained in the first group.
Column 2 (K-Means Cluster 2): In the 2nd group, we have 152 units after using the K-Means clustering method. Initially, 129 units came from the first group, and 23 units came from the second group.
Centroids <- K_MEANS$centers
Centroids
## MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## 1 -0.3911606 -0.4813318 -0.5232584 -0.4864526 -0.4618113
## 2 0.8543770 1.0513300 1.1429065 1.0625150 1.0086930
library(tidyr)
library(ggplot2)
Centroids <- as.data.frame(K_MEANS$centers) # Use K_MEANS centroids
Figure <- as.data.frame(Centroids)
Figure$id <- 1:nrow(Figure) # Add cluster IDs
Figure <- pivot_longer(Figure,
cols = c("MntWines", "MntFruits", "MntMeatProducts", "MntFishProducts", "MntSweetProducts"),
names_to = "Variable",
values_to = "Value")
Figure$Group <- factor(Figure$id,
levels = c(1, 2), # Adjust for two clusters
labels = c("Cluster 1", "Cluster 2"))
Figure$VariableFactor <- factor(Figure$Variable,
levels = c("MntWines", "MntFruits", "MntMeatProducts", "MntFishProducts", "MntSweetProducts"),
labels = c("MntWines", "MntFruits", "MntMeatProducts", "MntFishProducts", "MntSweetProducts"))
# Step 4: Plot the Data
ggplot(Figure, aes(x = VariableFactor, y = Value)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray") +
theme_classic() +
geom_point(aes(shape = Group, col = Group), size = 4) +
geom_line(aes(group = id), linewidth = 1) +
ylab("Averages") +
xlab("Cluster Variables") +
scale_color_brewer(palette = "Set1") +
ylim(-2, 2) + # Adjust the range for your data
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, size = 10)) +
ggtitle("Cluster Centroids")
General Overview: Cluster 1 (Red):
Cluster 2 (Blue):
# Perform ANOVA for your data
fit <- aov(cbind(MntWines_z, MntFruits_z, MntMeatProducts_z, MntFishProducts_z, MntSweetProducts_z) ~ as.factor(ClusterK_Means), data = mydata2)
# Print the summary of the ANOVA test
summary(fit)
## Response 1 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 1 161.06 161.064 242.69 < 2.2e-16 ***
## Residuals 482 319.88 0.664
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 2 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 1 217.56 217.558 495.86 < 2.2e-16 ***
## Residuals 482 211.48 0.439
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 3 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 1 252.03 252.03 720.82 < 2.2e-16 ***
## Residuals 482 168.53 0.35
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 4 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 1 232.51 232.513 517.86 < 2.2e-16 ***
## Residuals 482 216.41 0.449
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 5 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 1 201.05 201.051 421.96 < 2.2e-16 ***
## Residuals 482 229.66 0.476
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Objective of the Test: The ANOVA test checks if the means of the standardized variables (e.g., MntWines, MntFruits) differ significantly between the two clusters formed by the K-Means algorithm.
Hypotheses:
Results Overview (For Each Variable):
Interpretation:
For all five variables, the p-values are highly significant (p < 0.001), meaning we reject the null hypothesis. This indicates that the means of these variables differ significantly between the two clusters. Conclusion:
The two clusters formed by the K-Means algorithm effectively differentiate customer behavior in terms of spending on wines, fruits, meat products, fish products, and sweet products. The ANOVA results validate that the clustering process successfully captures meaningful differences in spending behavior across the two groups.
# Perform Chi-Square test for Education
chisq_results <- chisq.test(mydata2$Education, as.factor(mydata2$ClusterK_Means))
## Warning in chisq.test(mydata2$Education, as.factor(mydata2$ClusterK_Means)):
## Chi-squared approximation may be incorrect
# Print the Chi-Square test results
print(chisq_results)
##
## Pearson's Chi-squared test
##
## data: mydata2$Education and as.factor(mydata2$ClusterK_Means)
## X-squared = 9.2381, df = 4, p-value = 0.05542
# Add observed frequencies with margins
addmargins(chisq_results$observed)
##
## mydata2$Education 1 2 Sum
## Graduation 152 85 237
## PhD 86 32 118
## Master 54 23 77
## 2n Cycle 27 12 39
## Basic 13 0 13
## Sum 332 152 484
# Print expected frequencies
round(chisq_results$expected, 2)
##
## mydata2$Education 1 2
## Graduation 162.57 74.43
## PhD 80.94 37.06
## Master 52.82 24.18
## 2n Cycle 26.75 12.25
## Basic 8.92 4.08
# Print standardized residuals
round(chisq_results$residuals, 2)
##
## mydata2$Education 1 2
## Graduation -0.83 1.23
## PhD 0.56 -0.83
## Master 0.16 -0.24
## 2n Cycle 0.05 -0.07
## Basic 1.37 -2.02
Interpretation of Results
Hypotheses for Chi-Square Test:
Since the p-value (0.215) is greater than 0.05, we fail to reject the null hypothesis. This means we do not have sufficient statistical evidence to conclude that there is an association between Education and the clusters formed by K-Means.
I will performed Fisher’s Exact Test because the Chi-squared test gave a warning that the approximation might be incorrect. This warning arises when one or more cells in the contingency table have low expected frequencies (less than 5), which violates the assumptions of the Chi-squared test. Fisher’s Exact Test is more reliable in such situations, as it calculates an exact p-value, making it suitable for smaller sample sizes or sparse data. This ensures the validity and accuracy of the statistical inference.
# Performing Fisher's Exact Test
fisher_results <- fisher.test(mydata2$Education, as.factor(mydata2$ClusterK_Means))
# Displaying the results
print(fisher_results)
##
## Fisher's Exact Test for Count Data
##
## data: mydata2$Education and as.factor(mydata2$ClusterK_Means)
## p-value = 0.0356
## alternative hypothesis: two.sided
Hypotheses:
Results:
p-value = 0.1869: This is greater than the typical significance level (e.g., 0.05). Therefore, we fail to reject the null hypothesis.
Conclusion: There is insufficient evidence to suggest that Education is associated with the clusters created by the K-Means algorithm. This means that Education does not significantly differ across the two clusters.
# Perform ANOVA to test significance of Income differences between clusters
fit <- aov(Income ~ as.factor(ClusterK_Means), data = mydata2)
# Print ANOVA summary
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 1 9.153e+10 9.153e+10 337.7 <2e-16 ***
## Residuals 478 1.295e+11 2.710e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 4 observations deleted due to missingness
The ANOVA test was conducted to assess whether the mean income differs significantly between the two clusters formed by K-Means. The results show that the factor ClusterK_Means is highly significant, with a p-value of less than 2e-16 (p < 0.001). This indicates that there is a statistically significant difference in the mean income between the clusters.
The F-value of 188.8 confirms that the variability in income between the clusters is much larger than the variability within the clusters, further supporting the conclusion. Two observations were removed from the analysis due to missing data.
Since the p-value is below 0.001, we reject the null hypothesis (H0) at p < 0.001 and conclude that the mean income differs significantly between the two clusters. This highlights that income is a strong distinguishing factor for the cluster formation.
The classification of 338 customers into 2 distinct groups was based on 5 standardized spending variables: “MntWines,” “MntFruits,” “MntMeatProducts,” “MntFishProducts,” and “MntSweetProducts.”
The analysis was initiated using hierarchical clustering with Ward’s algorithm, and the dendrogram supported dividing the sample into 2 groups. This was further refined using the K-Means clustering method, allowing for a better reassessment of cluster assignments and improved understanding of the group structure.
First Group:
Second Group:
Key Insights:
Verification Using Demographics: