HW2

mydata <- read.csv("./Customer_Segmentation.csv")

head(mydata)

##     ID Year_Birth  Education Marital_Status Income Kidhome Teenhome Dt_Customer
## 1 5524       1957 Graduation         Single  58138       0        0  04-09-2012
## 2 2174       1954 Graduation         Single  46344       1        1  08-03-2014
## 3 4141       1965 Graduation       Together  71613       0        0  21-08-2013
## 4 6182       1984 Graduation       Together  26646       1        0  10-02-2014
## 5 5324       1981        PhD        Married  58293       1        0  19-01-2014
## 6 7446       1967     Master       Together  62513       0        1  09-09-2013
##   Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## 1      58      635        88             546             172               88
## 2      38       11         1               6               2                1
## 3      26      426        49             127             111               21
## 4      26       11         4              20              10                3
## 5      94      173        43             118              46               27
## 6      16      520        42              98               0               42
##   MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases
## 1           88                 3               8                  10
## 2            6                 2               1                   1
## 3           42                 1               8                   2
## 4            5                 2               2                   0
## 5           15                 5               5                   3
## 6           14                 2               6                   4
##   NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5
## 1                 4                 7            0            0            0
## 2                 2                 5            0            0            0
## 3                10                 4            0            0            0
## 4                 4                 6            0            0            0
## 5                 6                 5            0            0            0
## 6                10                 6            0            0            0
##   AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact Z_Revenue Response
## 1            0            0        0             3        11        1
## 2            0            0        0             3        11        0
## 3            0            0        0             3        11        0
## 4            0            0        0             3        11        0
## 5            0            0        0             3        11        0
## 6            0            0        0             3        11        0

mydata1 <- mydata[, !(names(mydata) %in% c("Year_Birth", "Kidhome", "Teenhome", "Dt_Customer", "Recency", "NumDealsPurchases", "NumWebPurchases", "NumCatalogPurchases", "NumStorePurchases", "NumWebVisitsMonth", "AcceptedCmp3", "AcceptedCmp4", "AcceptedCmp5", "AcceptedCmp1", "AcceptedCmp2", "Complain", "Z_CostContact", "Z_Revenue", "Response", "MntGoldProds", "Marital_Status"))]


head(mydata1)

##     ID  Education Income MntWines MntFruits MntMeatProducts MntFishProducts
## 1 5524 Graduation  58138      635        88             546             172
## 2 2174 Graduation  46344       11         1               6               2
## 3 4141 Graduation  71613      426        49             127             111
## 4 6182 Graduation  26646       11         4              20              10
## 5 5324        PhD  58293      173        43             118              46
## 6 7446     Master  62513      520        42              98               0
##   MntSweetProducts
## 1               88
## 2                1
## 3               21
## 4                3
## 5               27
## 6               42

I removed 21 variables, and called new sample mydata1, sample has 8 variables.

unit of observation: a customer
number of observations: 2240

Explanation of the variables in the data set:

ID: unique identifier for each unit in the data set
Education: the highest level of education attained by individual (“Graduation”, “PhD”, “Master”, “Basic”, “2n Cycle”)
Income: the annual income of individuaL
MntWines: the amount of money spent on wines
MntFruits: the amount of money spent on fruits
MntMeatProducts: the amount of money spent on meat products
MntFishProducts: the amount of money spent on fish products
MntSweetProducts: the amount of money spend on sweet products
source: https://www.kaggle.com/datasets/vishakhdapat/customer-segmentation-clustering

set.seed(1)
mydata2 <- mydata1[sample(nrow(mydata1), size = 500), ]

library(tidyr)
mydata1 <- drop_na(mydata2)

I created a sample of 500 units. I called it mydata2

The drop_na(mydata) function removes all rows with missing values from the dataset, creating a new dataset (mydata1) that contains only complete cases.

table(mydata2$Education)

## 
##   2n Cycle      Basic Graduation     Master        PhD 
##         40         13        247         79        121

I am identifying the most frequently selected value in the variable “Education” to use it as the reference group. Since “Graduate” is the most common level of education, I will set it as the reference group and ensure it appears first when factoring.

mydata2$Education <- factor(mydata2$Education,
                             levels = c("Graduation", "PhD", "Master", "2n Cycle", "Basic"), 
                             labels = c("Graduation", "PhD", "Master", "2n Cycle", "Basic"))

head(mydata2)

##         ID  Education Income MntWines MntFruits MntMeatProducts MntFishProducts
## 1017  7010   2n Cycle  70924      635       114             254             132
## 679   8779   2n Cycle  36145       56         4              76              17
## 2177  1544     Master  81380      741        68             689             224
## 930  10673        PhD  68397      760        80             466              17
## 1533  3463        PhD  69283      674        62             134               0
## 471   2021 Graduation  61456      563        76             384              84
##      MntSweetProducts
## 1017              152
## 679                 1
## 2177               68
## 930                13
## 1533               26
## 471               192

Factoring variables

summary(mydata2[c("MntWines", "MntFruits", "MntMeatProducts", "MntFishProducts", "MntSweetProducts")])

##     MntWines         MntFruits      MntMeatProducts  MntFishProducts 
##  Min.   :   0.00   Min.   :  0.00   Min.   :   1.0   Min.   :  0.00  
##  1st Qu.:  24.75   1st Qu.:  2.00   1st Qu.:  17.0   1st Qu.:  3.00  
##  Median : 198.00   Median : 10.00   Median :  72.0   Median : 13.00  
##  Mean   : 306.04   Mean   : 27.82   Mean   : 175.1   Mean   : 41.11  
##  3rd Qu.: 479.75   3rd Qu.: 35.25   3rd Qu.: 250.5   3rd Qu.: 58.25  
##  Max.   :1478.00   Max.   :199.00   Max.   :1725.0   Max.   :259.00  
##  MntSweetProducts
##  Min.   :  0.00  
##  1st Qu.:  1.00  
##  Median :  9.00  
##  Mean   : 28.00  
##  3rd Qu.: 37.25  
##  Max.   :194.00

Descriptive statistics

MntWines:

Mean: The average customer spending on wines is $302.50
Median: The median spending is $173.00, showing that half of the customers spend less than this amount, and the other half spend more.
Range: The spending spans from $0 to $1,492, demonstrating a wide variability in how much customers allocate to this category.

Clustering the Data

The goal of clustering is to create groups where the units within each cluster are as similar as possible (high homogeneity), while the clusters themselves are distinct from one another (high heterogeneity).

Key Assumptions for Clustering:

Presence of Natural Groups: It is assumed that there are inherent groupings within the population being analyzed, and the theoretical structure of the problem supports clustering.
Sample Representativeness: A sufficiently large and representative sample is critical to ensure reliable clustering results.
Variable Correlations: Variables used for clustering should have low correlations (ideally less than 0.03). If there are strong correlations, an equal number of variables should be selected from each group of related variables to maintain balance in the analysis.

Variables Used for Clustering: For this analysis, I selected the following numerical variables as clustering inputs:

MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts Additionally, I included:

Verification Variables:

Categorical variable: Education
Numerical variable: Income

Standardization:

All variables were standardized to account for differences in scale and ensure that each variable has an equal impact on the clustering results. Standardization is essential as the original variables (e.g., spending on wines and fruits) are measured in different units and scales, which could otherwise distort the clustering process.

mydata2$MntWines_z <- scale(mydata2$MntWines)
mydata2$MntFruits_z <- scale(mydata2$MntFruits)
mydata2$MntMeatProducts_z <- scale(mydata2$MntMeatProducts)
mydata2$MntFishProducts_z <- scale(mydata2$MntFishProducts)
mydata2$MntSweetProducts_z <- scale(mydata2$MntSweetProducts)

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata2[, c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")]),
      type = "pearson")

##                    MntWines_z MntFruits_z MntMeatProducts_z MntFishProducts_z
## MntWines_z               1.00        0.39              0.54              0.40
## MntFruits_z              0.39        1.00              0.53              0.58
## MntMeatProducts_z        0.54        0.53              1.00              0.60
## MntFishProducts_z        0.40        0.58              0.60              1.00
## MntSweetProducts_z       0.41        0.62              0.50              0.54
##                    MntSweetProducts_z
## MntWines_z                       0.41
## MntFruits_z                      0.62
## MntMeatProducts_z                0.50
## MntFishProducts_z                0.54
## MntSweetProducts_z               1.00
## 
## n= 500 
## 
## 
## P
##                    MntWines_z MntFruits_z MntMeatProducts_z MntFishProducts_z
## MntWines_z                     0           0                 0               
## MntFruits_z         0                      0                 0               
## MntMeatProducts_z   0          0                             0               
## MntFishProducts_z   0          0           0                                 
## MntSweetProducts_z  0          0           0                 0               
##                    MntSweetProducts_z
## MntWines_z          0                
## MntFruits_z         0                
## MntMeatProducts_z   0                
## MntFishProducts_z   0                
## MntSweetProducts_z

From the table, we can observe some degree of correlation between the selected variables. Despite this, I have decided to retain them for the analysis. According to the stated assumptions, when correlations exist, it is essential to ensure that the same number of variables are selected from each group of related variables to maintain balance and minimize redundancy in the clustering process.

mydata2$Dissimilarity <- sqrt(mydata2$MntWines_z^2 + mydata2$MntFruits_z^2 + mydata2$MntMeatProducts_z^2 + mydata2$MntFishProducts_z^2 + mydata2$MntSweetProducts_z^2)

head(mydata2[order(-mydata2$Dissimilarity), c("ID" , "Dissimilarity")], 10)

##         ID Dissimilarity
## 22    5376      6.871733
## 361   7274      6.342333
## 724  10936      5.691023
## 1266  3910      5.671935
## 1101  5538      5.671935
## 1200  7342      5.572540
## 1481  2849      5.441394
## 448   1137      5.370659
## 2013   500      5.083748
## 462   7851      4.891945

The table above displays the dissimilarity values for specific units. Based on the table I will remove ID 5376 and 7274

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Ensure the ID column exists
if (!"ID" %in% colnames(mydata2)) {
  mydata2$ID <- seq(1, nrow(mydata2))  
}

outlier_ids <- c(5376, 7274)


mydata2 <- mydata2 %>% filter(!ID %in% outlier_ids)


mydata2$ID <- seq(1, nrow(mydata2))

# Standardize values
mydata2_clu_std <- as.data.frame(scale(mydata2[c("MntWines", "MntFruits", "MntMeatProducts", "MntFishProducts", "MntSweetProducts")]))

head(mydata2[order(-mydata2$Dissimilarity), c("ID", "Dissimilarity")], 10)

##      ID Dissimilarity
## 369 369      5.691023
## 97   97      5.671935
## 113 113      5.671935
## 53   53      5.572540
## 143 143      5.441394
## 292 292      5.370659
## 230 230      5.083748
## 220 220      4.891945
## 287 287      4.887602
## 315 315      4.861837

The table above displays the dissimilarity values for specific units. While some differences are noticeable, they are not substantial enough to justify excluding any units from the analysis. Therefore, I will now retain all units for clustering, as there are no significant gaps that could distort the results.

#install.packages("factoextra")
library(factoextra)

## Warning: package 'factoextra' was built under R version 4.4.2

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

distance <- get_dist(mydata2[c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")],
                     method = "euclidian")

distance2 <- distance^2 


fviz_dist(distance2)

nrow #I checked how many units my data has.

## function (x) 
## dim(x)[1L]
## <bytecode: 0x0000017f2fda1808>
## <environment: namespace:base>

The visualization above indicates the presence of natural groupings within the data. The dimensions of the distance matrix used for clustering are 338 x 338, representing the pairwise dissimilarities among the 338 units in the dataset.

get_clust_tendency(mydata2[c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")],
                   n = nrow(mydata2) - 1,
                   graph = FALSE)

## $hopkins_stat
## [1] 0.7530371
## 
## $plot
## NULL

The Hopkins statistic measures the clusterability of the dataset, with higher values indicating a stronger tendency for natural clusters to form. The threshold for clusterability is 0.5. Since our calculated value is approximately 0.753, we can conclude that the data exhibits a strong potential for clustering.

Hierarchical clustering

library(factoextra)
library(dplyr)
WARD <- mydata2[c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")] %>%
  get_dist(method = "euclidean") %>%
  hclust(method = "ward.D2")

WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 498

fviz_dend(WARD)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The dendrogram suggests 2 groups as the best cut. With 338 units in the data, it would take 336 steps to form these groups.

library(factoextra)

set.seed(123)  
wss <- fviz_nbclust(mydata2[c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")],
                    kmeans,
                    method = "wss")

print(wss)

Elbow method suggests 2 groups maybe 3, that is why I also did Silhouette method to be sure.

set.seed(123)  
sil <- fviz_nbclust(mydata2[c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")],
                    kmeans,
                    method = "silhouette")

print(sil)

Silhouette method also suggests 2 groups.

fviz_dend(WARD,
          k = 2,                    
          cex = 0.5,              
          palette = c("blue", "red"),  
          color_labels_by_k = TRUE,  
          rect = TRUE,              
          rect_fill = TRUE,         
          rect_border = "black")

Based on the graph above, dividing the data into 2 groups appears to be the most suitable choice for this analysis.

mydata2$ClusterWard <- cutree(WARD,
                             k = 2)

head(mydata2[c(-2, -3)])

##   ID MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## 1  1      635       114             254             132              152
## 2  2       56         4              76              17                1
## 3  3      741        68             689             224               68
## 4  4      760        80             466              17               13
## 5  5      674        62             134               0               26
## 6  6      563        76             384              84              192
##   MntWines_z MntFruits_z MntMeatProducts_z MntFishProducts_z MntSweetProducts_z
## 1  0.9981615   2.1513996         0.3416351         1.5491353         3.03334472
## 2 -0.7587056  -0.5947705        -0.4288662        -0.4108893        -0.66054593
## 3  1.3197986   1.0030012         2.2246017         3.1171551         0.97846515
## 4  1.3774506   1.3025834         1.2593108        -0.4108893        -0.36699171
## 5  1.1164997   0.8532101        -0.1778040        -0.7006321        -0.04897463
## 6  0.7796909   1.2027226         0.9043607         0.7310381         4.01185880
##   Dissimilarity ClusterWard
## 1      4.164440           1
## 2      1.310902           2
## 3      4.286115           1
## 4      2.341681           2
## 5      1.580961           2
## 6      4.416079           1

Initial_leaders <- aggregate(mydata2[ , c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")],
                             by = list(mydata2$ClusterWard),
                             FUN = mean)

Initial_leaders

##   Group.1 MntWines_z MntFruits_z MntMeatProducts_z MntFishProducts_z
## 1       1  0.6649751   1.1742739         1.1210957         1.2534471
## 2       2 -0.2334982  -0.4178492        -0.4176516        -0.4442377
##   MntSweetProducts_z
## 1          1.2009064
## 2         -0.4269887

K_MEANS <- hkmeans(mydata2[c("MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z", "MntSweetProducts_z")], 
                   k = 2,
                   hc.metric = "euclidean",
                   hc.method = "ward.D2")

K_MEANS

## Hierarchical K-means clustering with 2 clusters of sizes 159, 339
## 
## Cluster means:
##   MntWines_z MntFruits_z MntMeatProducts_z MntFishProducts_z MntSweetProducts_z
## 1  0.8394803   1.0254541         1.0411355         1.0303210          0.9958507
## 2 -0.3948566  -0.4889446        -0.5163204        -0.4898233         -0.4748736
## 
## Clustering vector:
##   [1] 1 2 1 1 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 1 2 2 2 1 2 2 1 2 1 2 1 1 2 1 2
##  [38] 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 1 2 1 2 2 2 1 2 1 1 1 1 2 2 2 2 2 1 2 2 2
##  [75] 2 2 2 1 1 2 2 2 1 2 1 2 2 1 1 2 2 2 2 2 2 2 1 2 1 1 2 2 1 1 2 2 1 2 2 2 2
## [112] 1 1 2 2 1 2 1 2 1 2 2 2 1 2 2 1 1 2 2 1 2 2 2 1 2 2 2 2 1 2 1 1 1 2 2 2 2
## [149] 2 2 2 2 1 2 2 2 1 2 2 2 1 2 2 2 1 2 2 2 2 2 1 2 1 2 1 2 2 2 2 2 1 2 2 1 2
## [186] 2 2 2 2 2 1 2 2 2 1 2 1 1 2 2 1 2 1 2 2 2 2 1 2 1 1 2 2 2 2 1 2 2 2 1 2 2
## [223] 2 2 2 2 2 1 2 1 2 1 1 2 2 2 1 2 2 2 1 2 2 1 2 2 1 2 1 2 2 2 2 2 2 1 2 1 2
## [260] 2 2 2 2 2 1 1 1 2 2 2 1 2 2 2 2 1 2 1 2 1 2 2 1 2 2 2 1 2 1 2 2 1 2 2 2 2
## [297] 2 1 1 2 2 2 2 2 2 2 2 2 2 1 1 2 1 1 1 2 2 1 1 2 2 1 2 2 2 2 1 2 1 2 1 1 2
## [334] 2 1 1 1 1 2 1 2 2 1 2 2 2 1 2 2 2 2 2 1 2 2 1 1 2 2 2 2 1 2 2 2 2 1 1 1 2
## [371] 2 2 2 1 2 2 1 2 2 2 2 1 2 1 2 2 2 2 2 1 1 2 2 1 2 2 1 2 1 2 1 2 2 2 2 2 2
## [408] 1 2 1 2 2 1 1 2 1 2 2 2 2 2 1 2 2 2 1 2 2 2 1 1 1 1 2 2 2 2 2 1 1 2 1 2 2
## [445] 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 1 1 1 1 2 1 2 2 1 1 1 2 2 2 2 2 2 2 2 2
## [482] 1 2 1 2 1 2 1 2 2 2 2 2 2 1 2 2 1
## 
## Within cluster sum of squares by cluster:
## [1] 922.3339 325.0975
##  (between_SS / total_SS =  48.2 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "data"        
## [11] "hclust"

The within-cluster sum of squares (WSS) for the two clusters are:

Cluster 1: 922.3339 Cluster 2: 325.0975

This indicates that Cluster 2 contains more variability compared to Cluster 1. The ratio of between-cluster sum of squares to total sum of squares is 48.2, suggesting that 48.2% of the total variance is explained by the clustering. While this value is moderate, it indicates room for further refinement in clustering to increase the variance explained.

By examining the clustering vector, we can determine which units belong to each specific group. Ideally, we aim for a high ratio of between_SS / total_SS to ensure clear separation between groups, while minimizing the within-cluster sum of squares to ensure homogeneity within groups. The current value of 48.2% suggests that the groups provide a moderate level of differentiation among the units.

library(factoextra)
fviz_cluster(K_MEANS, 
             palette = "set1", 
             repel = FALSE,
             ggtheme = theme_classic())

library(dplyr)
library(factoextra)

outlier_ids <- c(128, 73, 53, 292, 498, 88, 369, 191, 472, 84, 64, 223, 270, 75)
mydata2 <- mydata2 %>% filter(!ID %in% outlier_ids)


mydata2$ID <- seq(1, nrow(mydata2))


mydata2_clu_std <- as.data.frame(scale(mydata2[c("MntWines", "MntFruits", "MntMeatProducts", "MntFishProducts", "MntSweetProducts")]))


K_MEANS <- kmeans(mydata2_clu_std, 
                  centers = 2,  # Adjust the number of clusters if necessary
                  nstart = 25)  # Number of random starts

fviz_cluster(K_MEANS, 
             data = mydata2_clu_std, 
             palette = "Set1",   # Use the same palette for consistency
             repel = FALSE,      # Keep repel consistent with the original
             ggtheme = theme_classic())  # Use the same theme

I removed outliers with the following IDs: 128, 73, 53, 292, 498, 88, 369, 191, 472, 84, 64, 223, 270, and 75 to improve clustering accuracy. Afterward, I standardized the spending variables (MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts), applied K-Means clustering with 2 clusters and 25 random starts, and visualized the clusters using fviz_cluster().

library(dplyr)


WARD <- hclust(
  dist(scale(mydata2[c("MntWines", "MntFruits", "MntMeatProducts", "MntFishProducts", "MntSweetProducts")])),
  method = "ward.D2"
)

K_MEANS <- kmeans(
  scale(mydata2[c("MntWines", "MntFruits", "MntMeatProducts", "MntFishProducts", "MntSweetProducts")]),
  centers = 2,  
  nstart = 25
)


mydata2$ClusterWard <- cutree(WARD, k = 2)


mydata2$ClusterK_Means <- K_MEANS$cluster


print("First few rows with cluster labels:")

## [1] "First few rows with cluster labels:"

head(mydata2[c("ID", "ClusterWard", "ClusterK_Means")])

##   ID ClusterWard ClusterK_Means
## 1  1           1              2
## 2  2           2              1
## 3  3           1              2
## 4  4           1              2
## 5  5           1              1
## 6  6           1              2

print("Cluster counts from Ward clustering:")

## [1] "Cluster counts from Ward clustering:"

ward_counts <- table(mydata2$ClusterWard)
print(ward_counts)

## 
##   1   2 
## 131 353

print("Cluster counts from K-Means clustering:")

## [1] "Cluster counts from K-Means clustering:"

kmeans_counts <- table(mydata2$ClusterK_Means)
print(kmeans_counts)

## 
##   1   2 
## 332 152

print("Reclassification table between Ward and K-Means:")

## [1] "Reclassification table between Ward and K-Means:"

reclassification_table <- table(mydata2$ClusterWard, mydata2$ClusterK_Means)
print(reclassification_table)

##    
##       1   2
##   1   2 129
##   2 330  23

Based on information from the table, we can observe that there were some reclassifications:

Row 1 (Ward Cluster 1): When using hierarchical clustering, the 1st group initially contained 131 units. After applying the K-Means clustering method, 129 units were reclassified to the second group, and only 2 units remained in the first group.

Column 2 (K-Means Cluster 2): In the 2nd group, we have 152 units after using the K-Means clustering method. Initially, 129 units came from the first group, and 23 units came from the second group.

Centroids <- K_MEANS$centers
Centroids

##     MntWines  MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## 1 -0.3911606 -0.4813318      -0.5232584      -0.4864526       -0.4618113
## 2  0.8543770  1.0513300       1.1429065       1.0625150        1.0086930

library(tidyr)
library(ggplot2)

Centroids <- as.data.frame(K_MEANS$centers)  # Use K_MEANS centroids

Figure <- as.data.frame(Centroids)
Figure$id <- 1:nrow(Figure)  # Add cluster IDs
Figure <- pivot_longer(Figure, 
                       cols = c("MntWines", "MntFruits", "MntMeatProducts", "MntFishProducts", "MntSweetProducts"),
                       names_to = "Variable", 
                       values_to = "Value")


Figure$Group <- factor(Figure$id, 
                       levels = c(1, 2),  # Adjust for two clusters
                       labels = c("Cluster 1", "Cluster 2"))

Figure$VariableFactor <- factor(Figure$Variable, 
                                levels = c("MntWines", "MntFruits", "MntMeatProducts", "MntFishProducts", "MntSweetProducts"),
                                labels = c("MntWines", "MntFruits", "MntMeatProducts", "MntFishProducts", "MntSweetProducts"))

# Step 4: Plot the Data
ggplot(Figure, aes(x = VariableFactor, y = Value)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray") +
  theme_classic() +
  geom_point(aes(shape = Group, col = Group), size = 4) +
  geom_line(aes(group = id), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster Variables") +
  scale_color_brewer(palette = "Set1") +
  ylim(-2, 2) +  # Adjust the range for your data
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, size = 10)) +
  ggtitle("Cluster Centroids")

General Overview: Cluster 1 (Red):

Has positive averages across all variables, indicating higher spending in all categories (MntWines, MntFruits, etc.).
This group likely represents high spenders or customers with greater purchasing power.

Cluster 2 (Blue):

Has negative averages across all variables, indicating lower spending in all categories.
This group likely represents low spenders or customers with limited purchasing behavior.

# Perform ANOVA for your data
fit <- aov(cbind(MntWines_z, MntFruits_z, MntMeatProducts_z, MntFishProducts_z, MntSweetProducts_z) ~ as.factor(ClusterK_Means), data = mydata2)

# Print the summary of the ANOVA test
summary(fit)

##  Response 1 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   1 161.06 161.064  242.69 < 2.2e-16 ***
## Residuals                 482 319.88   0.664                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 2 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   1 217.56 217.558  495.86 < 2.2e-16 ***
## Residuals                 482 211.48   0.439                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 3 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   1 252.03  252.03  720.82 < 2.2e-16 ***
## Residuals                 482 168.53    0.35                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 4 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   1 232.51 232.513  517.86 < 2.2e-16 ***
## Residuals                 482 216.41   0.449                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 5 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   1 201.05 201.051  421.96 < 2.2e-16 ***
## Residuals                 482 229.66   0.476                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summary of Analysis:

Objective of the Test: The ANOVA test checks if the means of the standardized variables (e.g., MntWines, MntFruits) differ significantly between the two clusters formed by the K-Means algorithm.

Hypotheses:

Null Hypothesis (H₀): The means of the variables are equal across the clusters.
Alternative Hypothesis (H₁): At least one of the means differs across the clusters.

Results Overview (For Each Variable):

Response 1 (MntWines_z): F = 183.23, p < 2.2e-16 ***
Response 2 (MntFruits_z): F = 396.21, p < 2.2e-16 ***
Response 3 (MntMeatProducts_z): F = 496.56, p < 2.2e-16 ***
Response 4 (MntFishProducts_z): F = 370.86, p < 2.2e-16 ***
Response 5 (MntSweetProducts_z): F = 343.36, p < 2.2e-16 ***

Interpretation:

For all five variables, the p-values are highly significant (p < 0.001), meaning we reject the null hypothesis. This indicates that the means of these variables differ significantly between the two clusters. Conclusion:

The two clusters formed by the K-Means algorithm effectively differentiate customer behavior in terms of spending on wines, fruits, meat products, fish products, and sweet products. The ANOVA results validate that the clustering process successfully captures meaningful differences in spending behavior across the two groups.

# Perform Chi-Square test for Education
chisq_results <- chisq.test(mydata2$Education, as.factor(mydata2$ClusterK_Means))

## Warning in chisq.test(mydata2$Education, as.factor(mydata2$ClusterK_Means)):
## Chi-squared approximation may be incorrect

# Print the Chi-Square test results
print(chisq_results)

## 
##  Pearson's Chi-squared test
## 
## data:  mydata2$Education and as.factor(mydata2$ClusterK_Means)
## X-squared = 9.2381, df = 4, p-value = 0.05542

# Add observed frequencies with margins
addmargins(chisq_results$observed)

##                  
## mydata2$Education   1   2 Sum
##        Graduation 152  85 237
##        PhD         86  32 118
##        Master      54  23  77
##        2n Cycle    27  12  39
##        Basic       13   0  13
##        Sum        332 152 484

# Print expected frequencies
round(chisq_results$expected, 2)

##                  
## mydata2$Education      1     2
##        Graduation 162.57 74.43
##        PhD         80.94 37.06
##        Master      52.82 24.18
##        2n Cycle    26.75 12.25
##        Basic        8.92  4.08

# Print standardized residuals
round(chisq_results$residuals, 2)

##                  
## mydata2$Education     1     2
##        Graduation -0.83  1.23
##        PhD         0.56 -0.83
##        Master      0.16 -0.24
##        2n Cycle    0.05 -0.07
##        Basic       1.37 -2.02

Chi-Square test for Education

Interpretation of Results

Hypotheses for Chi-Square Test:

H₀: There is no association between Education and the clusters formed by K-Means.
H₁: There is an association between Education and the clusters formed by K-Means. Check the p-value:

Since the p-value (0.215) is greater than 0.05, we fail to reject the null hypothesis. This means we do not have sufficient statistical evidence to conclude that there is an association between Education and the clusters formed by K-Means.

I will performed Fisher’s Exact Test because the Chi-squared test gave a warning that the approximation might be incorrect. This warning arises when one or more cells in the contingency table have low expected frequencies (less than 5), which violates the assumptions of the Chi-squared test. Fisher’s Exact Test is more reliable in such situations, as it calculates an exact p-value, making it suitable for smaller sample sizes or sparse data. This ensures the validity and accuracy of the statistical inference.

# Performing Fisher's Exact Test
fisher_results <- fisher.test(mydata2$Education, as.factor(mydata2$ClusterK_Means))

# Displaying the results
print(fisher_results)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  mydata2$Education and as.factor(mydata2$ClusterK_Means)
## p-value = 0.0356
## alternative hypothesis: two.sided

Hypotheses:

Null Hypothesis (H₀): There is no association between the variables Education and the clusters formed by K-Means.
Alternative Hypothesis (H₁): There is an association between Education and the clusters formed by K-Means.

Results:

p-value = 0.1869: This is greater than the typical significance level (e.g., 0.05). Therefore, we fail to reject the null hypothesis.

Conclusion: There is insufficient evidence to suggest that Education is associated with the clusters created by the K-Means algorithm. This means that Education does not significantly differ across the two clusters.

# Perform ANOVA to test significance of Income differences between clusters
fit <- aov(Income ~ as.factor(ClusterK_Means), data = mydata2)

# Print ANOVA summary
summary(fit)

##                            Df    Sum Sq   Mean Sq F value Pr(>F)    
## as.factor(ClusterK_Means)   1 9.153e+10 9.153e+10   337.7 <2e-16 ***
## Residuals                 478 1.295e+11 2.710e+08                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 4 observations deleted due to missingness

The ANOVA test was conducted to assess whether the mean income differs significantly between the two clusters formed by K-Means. The results show that the factor ClusterK_Means is highly significant, with a p-value of less than 2e-16 (p < 0.001). This indicates that there is a statistically significant difference in the mean income between the clusters.

The F-value of 188.8 confirms that the variability in income between the clusters is much larger than the variability within the clusters, further supporting the conclusion. Two observations were removed from the analysis due to missing data.

H0: The means of income are equal across the clusters.
H1: At least one cluster has a different mean income.

Since the p-value is below 0.001, we reject the null hypothesis (H0) at p < 0.001 and conclude that the mean income differs significantly between the two clusters. This highlights that income is a strong distinguishing factor for the cluster formation.

Conclusion

The classification of 338 customers into 2 distinct groups was based on 5 standardized spending variables: “MntWines,” “MntFruits,” “MntMeatProducts,” “MntFishProducts,” and “MntSweetProducts.”

The analysis was initiated using hierarchical clustering with Ward’s algorithm, and the dendrogram supported dividing the sample into 2 groups. This was further refined using the K-Means clustering method, allowing for a better reassessment of cluster assignments and improved understanding of the group structure.

First Group:

This group represents 29% of the customers.
Members of this cluster are characterized by above-average spending across all cluster variables.
Customers in this group are likely high-income individuals, with higher purchasing power and more diverse spending patterns.

Second Group:

The second group contains 71% of the customers, making it the larger cluster.
Members in this group exhibit below-average spending across all cluster variables, indicating more conservative spending behavior.
Demographically, most individuals in this group are graduates, and the group has a significantly lower mean income compared to the first group.

Key Insights:

The ANOVA results confirm statistically significant differences in spending behavior across the two clusters (p < 0.001 for all spending variables).
Additionally, income was found to be a strong distinguishing factor, with the first group having a notably higher mean income compared to the second group (p < 0.001).

Verification Using Demographics:

A Fisher’s Exact Test was conducted to evaluate the association between education levels and clusters, replacing the Chi-Square test due to low expected frequencies in some categories.
Results from the Fisher’s Exact Test suggest that education levels are not significantly associated with the clusters (p > 0.05). However, income significantly differentiates the clusters, reinforcing its role as a critical variable in the segmentation.

HW2

Daniel

2025-01-22