h1.title {color: salmon;font-family: "Copperplate", fantasy; font-size: 48px;}
.author {color: lightsalmon ;font-family: "Copperplate", fantasy; font-size: 24px;}
.date {color: darksalmon ;font-family: "Copperplate", fantasy; font-size: 18px;}

h1.title:hover {
    text-shadow: 2px 2px 8px rgba(255, 20, 147, 0.8);
}

.columns {display: flex;}
h1 {color: #FF91A4;font-family: 'Roboto', sans-serif; font-size: 36px;}

Introduction of Dataset

mydata <- read.csv("C:/Users/Pino/Desktop/IMB/MVA/customer_segmentation_data.csv")
colnames(mydata) [1] <- "ID"
colnames(mydata) [2] <- "Age"
colnames(mydata) [3] <- "Gender"
colnames(mydata) [4] <- "Income"
colnames(mydata) [5] <- "Spending Score"
colnames(mydata) [6] <- "Membership Years"
colnames(mydata) [7] <- "Purchase Frequency"
colnames(mydata) [8] <- "Preferred Category"
colnames(mydata) [9] <- "Last Purchase Amount"


head(mydata)

##   ID Age Gender Income Spending Score Membership Years Purchase Frequency
## 1  1  38 Female  99342             90                3                 24
## 2  2  21 Female  78852             60                2                 42
## 3  3  60 Female 126573             30                2                 28
## 4  4  40  Other  47099             74                9                  5
## 5  5  65 Female 140621             21                3                 25
## 6  6  31  Other  57305             24                3                 30
##   Preferred Category Last Purchase Amount
## 1          Groceries               113.53
## 2             Sports                41.93
## 3           Clothing               424.36
## 4      Home & Garden               991.93
## 5        Electronics               347.08
## 6      Home & Garden                86.85

This dataset contains simulated customer data that can be used for segmentation analysis. It includes demographic and behavioral information about customers, which can help in identifying distinct segments within the customer base. The data is taken from Kaggle website.

Sample size: 1000

Unit of observation: A single customer

Explanation of variables:

ID: Unique identifier for each customer. Age: Age of the customer. Gender: Gender of the customer (Male, Female, Other). Income: Annual income of the customer (in USD). Spending Score: Spending score (1-100), indicating the customer’s spending behavior and loyalty. Membership Years: Number of years the customer has been a member. Purchase Frequency: Number of purchases made by the customer in the last year. Preferred Category: Preferred shopping category (Electronics, Clothing, Groceries, Home & Garden, Sports). Last Purchase Amount: Amount spent by the customer on their last purchase (in USD).

I renamed some variables so the name is more “tidy”.

The dataset includes numerical variables so we can approach it with clustering analysis

Descriptive Statistics

summary(mydata[c(4:9)])

##      Income       Spending Score   Membership Years Purchase Frequency
##  Min.   : 30004   Min.   :  1.00   Min.   : 1.000   Min.   : 1.0      
##  1st Qu.: 57912   1st Qu.: 26.00   1st Qu.: 3.000   1st Qu.:15.0      
##  Median : 87846   Median : 50.00   Median : 5.000   Median :27.0      
##  Mean   : 88501   Mean   : 50.69   Mean   : 5.469   Mean   :26.6      
##  3rd Qu.:116110   3rd Qu.: 76.00   3rd Qu.: 8.000   3rd Qu.:39.0      
##  Max.   :149973   Max.   :100.00   Max.   :10.000   Max.   :50.0      
##  Preferred Category Last Purchase Amount
##  Length:1000        Min.   : 10.4       
##  Class :character   1st Qu.:218.8       
##  Mode  :character   Median :491.6       
##                     Mean   :492.3       
##                     3rd Qu.:747.2       
##                     Max.   :999.7

For the variable “Income”, 1st Quartile (Q1) is 57912, meaning 25% of customers have an income less than $57,912.

For the variable “Spending Score”, min equals 1, indicating that the lowest spending score out of the whole sample is 1.

For the variable “Membership Years”, mean is 5.469, indicating the average membership length of the sample is 5.469 years.

Research Question

My research question is: Based on variables “Spending Score” and “Membership Years”, how can customer segments be defined?

Standardizing Cluster Variables

mydata_clu_std <- as.data.frame(scale(mydata[c(5,6)]))

Here I saved the chosen cluster, which were standardized into a new data frame.

mydata$Dissimilarity = sqrt(mydata_clu_std$`Spending Score`^2 + mydata_clu_std$`Membership Years`^2)

head(mydata[order(-mydata$Dissimilarity), c("ID", "Dissimilarity")], 10)

##      ID Dissimilarity
## 900 900      2.337053
## 987 987      2.337053
## 353 353      2.327687
## 913 913      2.327687
## 384 384      2.322369
## 467 467      2.322369
## 377 377      2.312943
## 425 425      2.312943
## 883 883      2.312943
## 943 943      2.312943

First I created a new variable “Dissimilarity” so I can esentially identify potential outliers. By analyzing the table above, I didn’t identify any potential outliers.

Dissimilarity Matrix

library(factoextra)

## Warning: package 'factoextra' was built under R version 4.4.2

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Distances <- get_dist(mydata_clu_std, 
                      method = "euclidian")

fviz_dist(Distances, 
          gradient = list(low = "darkred",
                          mid = "grey95",
                          high = "white"))

Dissimilarity matrix with euclidian distances helps us se clusters forming, however from this matrix it is difficult to clearly see the number of clusters.

Hopkins Statistics

library(factoextra) 
get_clust_tendency(mydata_clu_std, 
                   n = nrow(mydata_clu_std) - 1,
                   graph = FALSE)

## $hopkins_stat
## [1] 0.7005131
## 
## $plot
## NULL

Hopkins Statistics helps us decide if our data is clusterable. With it being 0.7 (over 0.5, which is the threshold), I know that the data is clusterable and can procceed with the analysis.

Finding the Optimal Number of Clusters

library(factoextra)
library(NbClust)

fviz_nbclust(mydata_clu_std, kmeans, method = "wss") +
  labs(subtitle = "Elbow method")

The biggest break we see is at number of clusters 4, indicating by elbow method that the optimal number of clusters is 4.

fviz_nbclust(mydata_clu_std, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette analysis")

The biggest average silhouette width is at the number of clusters being 4.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(factoextra)
WARD <- mydata_clu_std %>%
  get_dist(method = "euclidean") %>%  
  hclust(method = "ward.D2")          

WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 1000

library(factoextra)
fviz_dend(WARD)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

fviz_dend(WARD,
          k = 4,
          cex = 0.5,
          palette = "jama",
          color_labels_by_k = TRUE,
          rect = TRUE)

By Dendrogram method, we should cut at 4 clusters.

library(NbClust)
NbClust(mydata_clu_std, 
        distance = "euclidean", 
        min.nc = 2, max.nc = 10,
        method = "kmeans", 
        index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 5 proposed 2 as the best number of clusters 
## * 4 proposed 3 as the best number of clusters 
## * 11 proposed 4 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  4 
##  
##  
## *******************************************************************

## $All.index
##        KL        CH Hartigan     CCC    Scott   Marriot    TrCovW    TraceW
## 2  3.4857  597.2601 590.4964 -1.3333 1380.657 1002926.1 514409.19 1249.9554
## 3  0.3452  769.7994 618.3263 -8.6510 1880.087 1369467.8 179137.24  785.3059
## 4  3.3729 1036.5473 174.3501  1.0539 2835.248  936716.1  53528.49  484.7008
## 5  3.5974  956.1230 219.2383 -1.6623 3182.254 1034487.6  40122.32  412.4937
## 6  2.8376  976.3016 158.7808 -1.0112 3714.208  875110.8  32283.43  338.0154
## 7  0.1061  969.0337 209.1246 -1.2870 3921.948  967690.1  22881.07  291.4581
## 8  1.7117 1034.3559  86.6481  0.7921 4248.360  911930.0  13689.34  240.7553
## 9  0.7253  993.9438  61.5396 -0.5372 4432.118  960420.5  12946.35  221.4154
## 10 0.6997  944.2539 172.2686 -2.2197 4596.088 1006386.4  12863.53  208.4697
##    Friedman  Rubin Cindex     DB Silhouette   Duda  Pseudot2   Beale Ratkowsky
## 2    2.9789 1.5985 0.3775 1.2922     0.3518 0.7010  277.2703  0.4257    0.3116
## 3    3.1446 2.5442 0.3138 0.9426     0.3845 0.8963   58.5405  0.1154    0.4496
## 4    6.2666 4.1221 0.4049 0.8220     0.4219 1.5158 -152.7898 -0.3390    0.4351
## 5    7.9550 4.8437 0.3957 0.9115     0.3956 1.1321  -45.2687 -0.1162    0.3983
## 6   11.8910 5.9110 0.3866 1.0279     0.3817 1.5459 -117.2373 -0.3513    0.3718
## 7   12.7435 6.8552 0.3699 0.9606     0.3746 1.9512 -144.7860 -0.4841    0.3492
## 8   14.8769 8.2989 0.3692 0.8462     0.3933 1.5958  -94.0822 -0.3706    0.3316
## 9   16.6548 9.0238 0.3584 0.8791     0.3844 1.4562  -73.6227 -0.3110    0.3143
## 10  18.6874 9.5841 0.3500 0.9506     0.3670 1.3924  -57.7739 -0.2796    0.2992
##        Ball Ptbiserial   Frey McClain   Dunn Hubert SDindex Dindex   SDbw
## 2  624.9777     0.4764 0.2082  0.6292 0.0948 0.0007  2.8815 1.0326 1.8884
## 3  261.7686     0.5731 0.2112  1.0165 0.0097 0.0008  1.8357 0.8097 1.4203
## 4  121.1752     0.6131 0.8020  1.2673 0.0157 0.0009  1.5698 0.6497 1.0042
## 5   82.4987     0.5798 0.4150  1.5382 0.0165 0.0010  1.9411 0.5934 0.6890
## 6   56.3359     0.5594 0.5052  1.7911 0.0181 0.0011  2.2842 0.5354 0.4975
## 7   41.6369     0.5394 0.3311  1.9889 0.0185 0.0012  2.2027 0.4999 0.4816
## 8   30.0944     0.5246 0.6473  2.1537 0.0203 0.0012  2.1541 0.4569 0.6731
## 9   24.6017     0.5072 0.7188  2.3240 0.0204 0.0012  2.6925 0.4347 0.3801
## 10  20.8470     0.4815 0.2633  2.6012 0.0208 0.0013  2.6240 0.4207 0.2204
## 
## $All.CriticalValues
##    CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2          0.5733           483.8609       0.6534
## 3          0.5530           408.9284       0.8910
## 4          0.5301           398.0160       1.0000
## 5          0.5275           347.6084       1.0000
## 6          0.5036           327.2609       1.0000
## 7          0.4775           325.0467       1.0000
## 8          0.4716           282.3289       1.0000
## 9          0.4701           264.9142       1.0000
## 10         0.4644           236.4154       1.0000
## 
## $Best.nc
##                     KL       CH Hartigan    CCC    Scott  Marriot   TrCovW
## Number_clusters 5.0000    4.000   4.0000 4.0000   4.0000      4.0      3.0
## Value_Index     3.5974 1036.547 443.9762 1.0539 955.1614 530523.3 335271.9
##                   TraceW Friedman   Rubin Cindex    DB Silhouette  Duda
## Number_clusters   4.0000   6.0000  4.0000 3.0000 4.000     4.0000 2.000
## Value_Index     228.3979   3.9361 -0.8563 0.3138 0.822     0.4219 0.701
##                 PseudoT2  Beale Ratkowsky     Ball PtBiserial Frey McClain
## Number_clusters   2.0000 2.0000    3.0000   3.0000     4.0000    1  2.0000
## Value_Index     277.2703 0.4257    0.4496 363.2091     0.6131   NA  0.6292
##                   Dunn Hubert SDindex Dindex    SDbw
## Number_clusters 2.0000      0  4.0000      0 10.0000
## Value_Index     0.0948      0  1.5698      0  0.2204
## 
## $Best.partition
##    [1] 3 3 1 2 1 1 3 2 4 2 4 3 3 1 1 4 1 1 2 3 1 2 3 2 3 2 4 4 2 1 4 3 4 3 2 1 3
##   [38] 2 1 1 1 3 4 3 1 1 4 2 3 3 2 3 2 2 3 1 4 1 4 1 4 2 4 1 1 3 3 2 3 4 3 1 1 2
##   [75] 2 2 1 4 1 3 2 3 2 2 1 4 4 4 4 3 1 2 2 4 2 4 4 2 1 3 1 2 4 1 1 4 2 2 2 2 1
##  [112] 2 4 1 1 2 1 4 2 1 1 2 4 2 3 2 3 2 3 4 2 2 2 2 2 4 4 3 4 3 2 1 2 1 4 1 4 3
##  [149] 3 4 3 1 2 1 3 3 3 1 3 1 1 3 3 4 1 4 2 2 3 2 1 1 2 3 3 3 1 3 2 1 3 4 3 4 2
##  [186] 2 4 1 3 3 3 4 4 2 2 4 2 2 4 4 4 4 4 2 2 1 2 3 1 4 2 1 4 1 3 1 1 4 2 2 4 1
##  [223] 4 1 4 4 2 4 3 1 4 4 4 4 1 2 1 1 4 4 4 3 1 2 1 4 4 2 2 3 4 3 4 2 2 2 1 3 3
##  [260] 2 2 1 2 3 2 2 4 1 2 2 1 1 2 3 3 1 1 4 3 3 1 4 4 1 1 3 3 3 2 3 4 3 3 1 2 1
##  [297] 1 1 4 1 4 1 4 2 4 3 3 1 1 4 1 4 1 3 4 1 3 1 2 1 1 1 1 4 3 3 4 2 3 3 2 3 1
##  [334] 4 3 2 3 2 2 4 4 3 2 1 4 4 1 2 1 4 1 3 2 3 3 2 1 1 1 2 3 1 1 2 1 2 3 2 1 2
##  [371] 3 3 3 2 1 1 3 3 2 3 2 1 1 1 2 2 4 1 3 3 3 3 2 2 4 1 4 4 4 3 3 2 4 4 1 4 3
##  [408] 2 2 1 3 4 4 1 3 3 4 4 1 3 3 4 4 1 3 2 4 1 4 3 3 3 2 2 3 3 1 1 3 2 4 1 2 1
##  [445] 3 4 2 2 1 1 3 4 2 1 2 1 1 2 3 3 3 3 4 2 4 4 1 3 2 2 2 1 1 3 4 1 1 1 4 3 4
##  [482] 3 2 1 2 3 2 2 1 1 3 2 3 1 1 3 4 3 2 4 3 4 2 4 2 3 2 2 3 1 2 4 1 1 4 2 1 3
##  [519] 3 4 3 3 1 4 1 3 1 3 1 3 4 4 3 1 1 1 2 3 4 2 1 3 1 1 3 4 4 4 4 3 3 3 4 3 3
##  [556] 3 3 3 4 3 3 4 1 1 2 3 2 2 4 2 1 2 4 4 3 3 3 2 4 3 2 1 3 4 1 3 3 3 4 1 1 1
##  [593] 4 4 2 1 2 2 3 2 4 3 1 1 2 1 3 1 1 4 3 4 3 1 3 3 4 1 1 2 1 4 3 4 1 4 1 3 2
##  [630] 3 4 1 4 4 1 1 1 3 2 4 1 4 4 3 1 2 1 1 3 4 1 4 4 3 2 4 4 3 4 1 1 4 3 1 1 2
##  [667] 2 4 2 4 1 2 1 1 4 1 2 3 4 2 2 4 2 4 2 3 1 1 1 1 2 4 3 3 2 4 2 1 3 1 4 2 3
##  [704] 1 4 3 1 2 4 3 1 4 4 3 2 4 2 3 4 3 2 3 1 1 4 4 4 1 2 2 1 3 4 4 4 2 3 3 1 1
##  [741] 1 4 3 4 2 2 4 3 1 2 2 1 1 2 4 2 3 4 1 2 1 4 1 3 3 3 4 3 3 4 4 2 4 1 4 1 2
##  [778] 2 3 4 1 4 1 3 4 3 2 2 3 3 2 4 1 3 1 4 3 1 1 1 4 3 3 4 4 2 1 1 4 4 1 2 4 4
##  [815] 2 2 3 1 4 2 3 4 2 3 1 4 2 2 4 1 1 1 2 2 3 1 4 3 1 2 2 4 1 2 1 2 1 1 2 1 1
##  [852] 3 3 2 1 3 2 3 4 4 1 2 2 4 1 3 2 3 1 4 2 4 3 3 2 2 1 3 3 3 3 4 3 3 2 2 1 1
##  [889] 2 3 4 1 3 4 1 1 2 3 1 4 4 1 4 3 2 3 1 4 4 3 3 4 2 2 2 4 3 2 3 2 2 3 3 3 4
##  [926] 4 4 4 1 4 4 4 3 3 1 1 3 1 3 4 2 2 3 3 3 2 4 1 1 1 4 2 4 1 2 3 2 4 2 2 3 2
##  [963] 4 3 4 2 2 1 2 1 1 3 4 2 1 4 3 4 2 4 1 2 2 4 3 4 4 3 3 1 1 3 2 1 1 2 2 1 2
## [1000] 1

Elbow method, Silhouette analysis, Dendrogram and other indices suggest I should create 4 clusters.

K-means Clustering Method

Clustering <- kmeans(mydata_clu_std, 
                     centers = 4, 
                     nstart = 25) 

Clustering

## K-means clustering with 4 clusters of sizes 243, 239, 265, 253
## 
## Cluster means:
##   Spending Score Membership Years
## 1     -0.8665949        0.8661137
## 2      0.9145986        0.9273127
## 3     -0.8547402       -0.8090784
## 4      0.8636349       -0.8604253
## 
## Clustering vector:
##    [1] 4 4 3 2 3 3 4 2 1 2 1 4 4 3 3 1 3 3 2 4 3 2 4 2 4 2 1 1 2 3 1 4 1 4 2 3 4
##   [38] 2 3 3 3 4 1 4 3 3 1 2 4 4 2 4 2 2 4 3 1 3 1 3 1 2 1 3 3 4 4 2 4 1 4 3 3 2
##   [75] 2 2 3 1 3 4 2 4 2 2 3 1 1 1 1 4 3 2 2 1 2 1 1 2 3 4 3 2 1 3 3 1 2 2 2 2 3
##  [112] 2 1 3 3 2 3 1 2 3 3 2 1 2 4 2 4 2 4 1 2 2 2 2 2 1 1 4 1 4 2 3 2 3 1 3 1 4
##  [149] 4 1 4 3 2 3 4 4 4 3 4 3 3 4 4 1 3 1 2 2 4 2 3 3 2 4 4 4 3 4 2 3 4 1 4 1 2
##  [186] 2 1 3 4 4 4 1 1 2 2 1 2 2 1 1 1 1 1 2 2 3 2 4 3 1 2 3 1 3 4 3 3 1 2 2 1 3
##  [223] 1 3 1 1 2 1 4 3 1 1 1 1 3 2 3 3 1 1 1 4 3 2 3 1 1 2 2 4 1 4 1 2 2 2 3 4 4
##  [260] 2 2 3 2 4 2 2 1 3 2 2 3 3 2 4 4 3 3 1 4 4 3 1 1 3 3 4 4 4 2 4 1 4 4 3 2 3
##  [297] 3 3 1 3 1 3 1 2 1 4 4 3 3 1 3 1 3 4 1 3 4 3 2 3 3 3 3 1 4 4 1 2 4 4 2 4 3
##  [334] 1 4 2 4 2 2 1 1 4 2 3 1 1 3 2 3 1 3 4 2 4 4 2 3 3 3 2 4 3 3 2 3 2 4 2 3 2
##  [371] 4 4 4 2 3 3 4 4 2 4 2 3 3 3 2 2 1 3 4 4 4 4 2 2 1 3 1 1 1 4 4 2 1 1 3 1 4
##  [408] 2 2 3 4 1 1 3 4 4 1 1 3 4 4 1 1 3 4 2 1 3 1 4 4 4 2 2 4 4 3 3 4 2 1 3 2 3
##  [445] 4 1 2 2 3 3 4 1 2 3 2 3 3 2 4 4 4 4 1 2 1 1 3 4 2 2 2 3 3 4 1 3 3 3 1 4 1
##  [482] 4 2 3 2 4 2 2 3 3 4 2 4 3 3 4 1 4 2 1 4 1 2 1 2 4 2 2 4 3 2 1 3 3 1 2 3 4
##  [519] 4 1 4 4 3 1 3 4 3 4 3 4 1 1 4 3 3 3 2 4 1 2 3 4 3 3 4 1 1 1 1 4 4 4 1 4 4
##  [556] 4 4 4 1 4 4 1 3 3 2 4 2 2 1 2 3 2 1 1 4 4 4 2 1 4 2 3 4 1 3 4 4 4 1 3 3 3
##  [593] 1 1 2 3 2 2 4 2 1 4 3 3 2 3 4 3 3 1 4 1 4 3 4 4 1 3 3 2 3 1 4 1 3 1 3 4 2
##  [630] 4 1 3 1 1 3 3 3 4 2 1 3 1 1 4 3 2 3 3 4 1 3 1 1 4 2 1 1 4 1 3 3 1 4 3 3 2
##  [667] 2 1 2 1 3 2 3 3 1 3 2 4 1 2 2 1 2 1 2 4 3 3 3 3 2 1 4 4 2 1 2 3 4 3 1 2 4
##  [704] 3 1 4 3 2 1 4 3 1 1 4 2 1 2 4 1 4 2 4 3 3 1 1 1 3 2 2 3 4 1 1 1 2 4 4 3 3
##  [741] 3 1 4 1 2 2 1 4 3 2 2 3 3 2 1 2 4 1 3 2 3 1 3 4 4 4 1 4 4 1 1 2 1 3 1 3 2
##  [778] 2 4 1 3 1 3 4 1 4 2 2 4 4 2 1 3 4 3 1 4 3 3 3 1 4 4 1 1 2 3 3 1 1 3 2 1 1
##  [815] 2 2 4 3 1 2 4 1 2 4 3 1 2 2 1 3 3 3 2 2 4 3 1 4 3 2 2 1 3 2 3 2 3 3 2 3 3
##  [852] 4 4 2 3 4 2 4 1 1 3 2 2 1 3 4 2 4 3 1 2 1 4 4 2 2 3 4 4 4 4 1 4 4 2 2 3 3
##  [889] 2 4 1 3 4 1 3 3 2 4 3 1 1 3 1 4 2 4 3 1 1 4 4 1 2 2 2 1 4 2 4 2 2 4 4 4 1
##  [926] 1 1 1 3 1 1 1 4 4 3 3 4 3 4 1 2 2 4 4 4 2 1 3 3 3 1 2 1 3 2 4 2 1 2 2 4 2
##  [963] 1 4 1 2 2 3 2 3 3 4 1 2 3 1 4 1 2 1 3 2 2 1 4 1 1 4 4 3 3 4 2 3 3 2 2 3 2
## [1000] 3
## 
## Within cluster sum of squares by cluster:
## [1] 125.1556 106.9615 131.4482 121.1354
##  (between_SS / total_SS =  75.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

I created 4 clusters with the sizes of each being as follows: first one includes 243 units, second one 239, third one 265 and lastly fourth one 253 units.

Biggest variability is seen in cluster 3, with it having the within sum of squares of 131.4482, indicating that it is the most heterogeneous cluster.

The total within sum of squares is 484.7007, the between sum of squares is 1509.95239, making the total sum of squares 1994,65309.

The ratio between the between sum of squares and the total sum of squares is 75.7%, which means that 75.7% of total variability in the dataset is explained by the differences between the clusters, and the remaining 24.3% of total variability is the variability within the clusters. 75.7% is essentially the biggest ratio of final leaders with the K-means method.

Cluster Plot

library(factoextra)
library(ggplot2)

fviz_cluster(Clustering,
  palette = "Dark2",   
  repel = TRUE,       
  ggtheme = theme_minimal(),  
  data = mydata_clu_std,
  geom = "point"       
) +
  scale_shape_manual(values = c(15, 16, 17, 18)) + 
  scale_size_manual(values = c(3, 3, 3, 3)) +       
  theme(legend.position = "bottom")

Averages <- Clustering$centers
Averages

##   Spending Score Membership Years
## 1     -0.8665949        0.8661137
## 2      0.9145986        0.9273127
## 3     -0.8547402       -0.8090784
## 4      0.8636349       -0.8604253

Figure <- as.data.frame(Averages)
Figure$id <- 1:nrow(Figure)

Cluster Profile Plot

library(tidyr)
Figure <- pivot_longer(Figure, cols = c("Spending Score", "Membership Years"))

Figure$Group <- factor(Figure$id, 
                       levels = c(1, 2, 3, 4), 
                       labels = c("1", "2", "3", "4"))

Figure$NameF <- factor(Figure$name, 
                       levels = c("Spending Score", "Membership Years"), 
                       labels = c("Spending Score", "Membership Years"))


library(ggplot2)
ggplot(Figure, aes(x = name, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size = 5, alpha = 0.4) +
  geom_line(aes(group = id), linewidth = 1.5) +
  ylab("Averages") +
  xlab("Cluster variables")+
  ylim(-2.5, 2.5) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 12))

Based on the profiles of the clusters we can conclude that: Cluster 1 represents loyal customers (above-average membership duration) who also have above-average spending scores. Cluster 2 represents long-term customers (significantly above-average membership years) who are conservative spenders. Cluster 3 represents new or less engaged customers (very low membership duration) who also exhibit low spending. Cluster 4 represents newer customers or opportunistic high spenders (slightly lower membership years) who spend significantly above average.

mydata$Group <- Clustering$cluster
fit <- aov(cbind(`Spending Score` , `Membership Years`) ~ Group, data = mydata)
summary(fit)

##  Response Spending Score :
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## Group         1 119764  119764  166.51 < 2.2e-16 ***
## Residuals   998 717800     719                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Membership Years :
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## Group         1 4862.2  4862.2  1477.2 < 2.2e-16 ***
## Residuals   998 3284.8     3.3                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I conducted an ANOVA analysis to determine whether the clustering variables (“Spending Score” and “Membership Years”) differentiate between the groups created by clustering. The purpose of this analysis was to validate the clustering results by checking if the group means of these variables are significantly different.

For “Spending Score”, the F-value is 166.51, with p < 0.001, indicating that the mean of “Spending Score” differs significantly across groups. For “Membership Years”, the F-value is 1477.2, with p < 0.001, confirming that the mean “Membership Years” also differs strongly between groups.

Criterion Validity

chi_square <- chisq.test(mydata$`Preferred Category`, as.factor(mydata$Group))
chi_square

## 
##  Pearson's Chi-squared test
## 
## data:  mydata$`Preferred Category` and as.factor(mydata$Group)
## X-squared = 11.648, df = 12, p-value = 0.4744

H₀ : There is no association between the Preferred Category and classification of clusters in 4 groups. H₁ : There is association between the Preferred Category and classification of clusters in 4 groups. With p-value being greater than 0.05, I can’t reject the null hypothesis meaning I didn’t find enough information to say that there is association.

library(ggplot2)

ggplot(mydata, aes(x = as.factor(Group), fill = `Preferred Category`)) +
  geom_bar(position = "fill") +  # 'fill' scales the bars to proportions
  scale_y_continuous(labels = scales::percent_format()) +  
  labs(
    title = "Distribution of Preferred Categories Across Groups",
    x = "Group",
    y = "Proportion",
    fill = "Preferred Category"
  ) +
  theme_minimal()

I created a stacked bar chart, to visually represent the distribution of Preferred categories across groups.

addmargins(chi_square$observed)

##                            as.factor(mydata$Group)
## mydata$`Preferred Category`    1    2    3    4  Sum
##               Clothing        44   39   36   51  170
##               Electronics     59   54   55   47  215
##               Groceries       47   48   49   55  199
##               Home & Garden   41   48   68   49  206
##               Sports          52   50   57   51  210
##               Sum            243  239  265  253 1000

addmargins(round(chi_square$expected, 2))

##                            as.factor(mydata$Group)
## mydata$`Preferred Category`      1      2      3      4     Sum
##               Clothing       41.31  40.63  45.05  43.01  170.00
##               Electronics    52.24  51.38  56.98  54.40  215.00
##               Groceries      48.36  47.56  52.74  50.35  199.01
##               Home & Garden  50.06  49.23  54.59  52.12  206.00
##               Sports         51.03  50.19  55.65  53.13  210.00
##               Sum           243.00 238.99 265.01 253.01 1000.01

round(chi_square$res, 2)

##                            as.factor(mydata$Group)
## mydata$`Preferred Category`     1     2     3     4
##               Clothing       0.42 -0.26 -1.35  1.22
##               Electronics    0.93  0.36 -0.26 -1.00
##               Groceries     -0.20  0.06 -0.51  0.66
##               Home & Garden -1.28 -0.18  1.81 -0.43
##               Sports         0.14 -0.03  0.18 -0.29

With none of the standardized residuals being statistically significant we can see that there really is no association between the Preferred Category and classification of clusters in 4 groups.

library(effectsize)
effectsize::cramers_v(mydata$`Preferred Category`, mydata$Group)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.00              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.00)

## [1] "tiny"
## (Rules: funder2019)

The effect size is 0, indicating no association or effect. This aligns with our initial finding that there is no significant association between the between the Preferred Category and classification of clusters in 4 groups. However, I conducted all the necessary statistical tests to ensure the validity of our conclusions.

Cross-Validity

Since the clustering was not validated using the criterion validity method, we will now proceed with validation using the cross-validity meth

od.

set.seed(123) 

indices <- sample(1:nrow(mydata))

split_point <- floor(nrow(mydata) / 2)

indices1 <- indices[1:split_point]
indices2 <- indices[(split_point + 1):nrow(mydata)]

data1 <- mydata[indices1, ]
data2 <- mydata[indices2, ]

cat("Summary of first subset (data1):\n")

## Summary of first subset (data1):

print(summary(data1))

##        ID              Age          Gender              Income      
##  Min.   :   5.0   Min.   :18.0   Length:500         Min.   : 30004  
##  1st Qu.: 239.8   1st Qu.:31.0   Class :character   1st Qu.: 57912  
##  Median : 486.5   Median :45.0   Mode  :character   Median : 87558  
##  Mean   : 496.2   Mean   :44.2                      Mean   : 87979  
##  3rd Qu.: 751.2   3rd Qu.:57.0                      3rd Qu.:114631  
##  Max.   :1000.0   Max.   :69.0                      Max.   :149973  
##  Spending Score   Membership Years Purchase Frequency Preferred Category
##  Min.   :  1.00   Min.   : 1.00    Min.   : 1.00      Length:500        
##  1st Qu.: 24.00   1st Qu.: 3.00    1st Qu.:14.00      Class :character  
##  Median : 47.00   Median : 5.00    Median :27.00      Mode  :character  
##  Mean   : 48.99   Mean   : 5.46    Mean   :26.06                        
##  3rd Qu.: 74.00   3rd Qu.: 8.00    3rd Qu.:38.25                        
##  Max.   :100.00   Max.   :10.00    Max.   :50.00                        
##  Last Purchase Amount Dissimilarity        Group      
##  Min.   : 12.36       Min.   :0.1659   Min.   :1.000  
##  1st Qu.:230.49       1st Qu.:0.9258   1st Qu.:1.000  
##  Median :486.36       Median :1.3198   Median :3.000  
##  Mean   :486.94       Mean   :1.2865   Mean   :2.498  
##  3rd Qu.:723.73       3rd Qu.:1.6745   3rd Qu.:3.250  
##  Max.   :999.74       Max.   :2.3371   Max.   :4.000

cat("\nSummary of second subset (data2):\n")

## 
## Summary of second subset (data2):

print(summary(data2))

##        ID             Age           Gender              Income      
##  Min.   :  1.0   Min.   :18.00   Length:500         Min.   : 30058  
##  1st Qu.:256.8   1st Qu.:30.00   Class :character   1st Qu.: 58200  
##  Median :505.5   Median :43.00   Mode  :character   Median : 88098  
##  Mean   :504.8   Mean   :43.37                      Mean   : 89023  
##  3rd Qu.:749.2   3rd Qu.:57.00                      3rd Qu.:117900  
##  Max.   :998.0   Max.   :69.00                      Max.   :149936  
##  Spending Score   Membership Years Purchase Frequency Preferred Category
##  Min.   :  1.00   Min.   : 1.000   Min.   : 1.00      Length:500        
##  1st Qu.: 27.00   1st Qu.: 3.000   1st Qu.:16.00      Class :character  
##  Median : 54.00   Median : 5.000   Median :28.00      Mode  :character  
##  Mean   : 52.38   Mean   : 5.478   Mean   :27.13                        
##  3rd Qu.: 77.00   3rd Qu.: 8.000   3rd Qu.:39.00                        
##  Max.   :100.00   Max.   :10.000   Max.   :50.00                        
##  Last Purchase Amount Dissimilarity        Group      
##  Min.   : 10.4        Min.   :0.1886   Min.   :1.000  
##  1st Qu.:213.4        1st Qu.:1.0014   1st Qu.:2.000  
##  Median :505.6        Median :1.4117   Median :3.000  
##  Mean   :497.8        Mean   :1.3567   Mean   :2.558  
##  3rd Qu.:771.7        3rd Qu.:1.7024   3rd Qu.:4.000  
##  Max.   :997.1        Max.   :2.3371   Max.   :4.000

Here I made two equal smaller samples on which I will conduct Cross-validity check.

data1_clu_std <- as.data.frame(scale(data1[c(5,6)]))

Clustering1 <- kmeans(data1_clu_std, 
                     centers = 4, 
                     nstart = 25) 

Clustering1

## K-means clustering with 4 clusters of sizes 113, 125, 133, 129
## 
## Cluster means:
##   Spending Score Membership Years
## 1      0.9123170        0.8707048
## 2      0.9116663       -0.8405909
## 3     -0.7857385       -0.8109151
## 4     -0.8724565        0.8878754
## 
## Clustering vector:
##  415  463  179  526  195  938  818  118  299  229  244   14  374  665  602  603 
##    2    4    1    2    1    3    3    4    4    2    1    3    1    3    2    3 
##  768  709   91  953  348  649  355  840   26  519  426  979  766  211  932  590 
##    2    4    3    4    1    2    2    1    1    2    1    1    2    1    4    3 
##  593  555  871  373  844  143  544  490  621  775  905  937  842   23  923  309 
##    4    2    1    2    1    1    3    3    3    4    1    2    4    2    2    3 
##  135  821  954  224  166  217  290  581   72  588  575  141  722  865  859  153 
##    1    2    3    3    4    3    2    1    3    2    2    1    2    3    4    1 
##  294  277  999   41  431   90  316  223  528  116  606  774  747  456  598  854 
##    3    3    1    3    2    2    3    4    2    1    3    3    4    3    1    1 
##   39  159  752  209  988  994   34  516   13   69  895  755  409  308  278   89 
##    3    2    3    3    2    3    2    1    2    2    3    4    1    3    4    4 
##  537  291  424  880  286  671  121  110  158   64  483  477  480  711   67  663 
##    1    4    3    2    2    3    3    1    3    3    1    3    2    3    2    2 
##  847   85  165  648   51   74  178  362  236  610  330  726  127  212  686  785 
##    3    3    3    3    1    1    2    3    1    4    2    4    2    3    2    4 
##  814  310  744  243  862  888  792  113  619  893  151  666  614  767  160  391 
##    4    4    4    3    1    3    4    4    3    2    2    1    3    4    3    2 
##  155  974    5  326  784  280  800  789  567  843  238  764  339  920  822  137 
##    2    1    3    2    2    2    3    2    1    3    3    2    1    1    4    4 
##  455  738  560  589   83  696  867  196  769  680  900  926  500  852  344  966 
##    1    2    2    4    1    4    1    4    2    1    4    4    4    2    3    1 
##  459   20  996  164   52  534  177  554   84  523  633  392  302  597  706  864 
##    2    2    1    4    2    3    3    2    1    3    4    2    3    1    2    4 
##  837  430  710  761  712  428  672  250  429  398  928  381  545   40  522  473 
##    4    2    2    3    4    3    1    2    4    4    4    1    2    3    2    3 
##  200  125  265  959  186  573  252  458  152   54  538  235  289  185  765  413 
##    4    2    1    1    1    4    2    1    3    1    2    3    1    1    2    4 
##  627  794  981  783  205  904  564  857  908  727  346  858  468  509   57  457 
##    3    2    3    3    1    2    3    1    4    4    4    2    2    2    4    3 
##  617  357  279  270  646  347  129  218  618  698  337  976  539  975  861  553 
##    1    3    2    1    1    3    2    4    3    3    2    4    4    3    3    4 
##  724  390  498  222  899  657  421  762  660  163  846  673  578  913  878  225 
##    3    2    2    3    3    4    2    4    3    2    1    3    1    1    2    4 
##  389  117  771  885   55  947  811  557  658  682 1000  134  891  688  447  104 
##    2    3    4    1    2    4    3    2    2    4    3    1    4    3    1    3 
##  716  845  210  349  401  258  915  386  941   24  466  130  886  943  377  170 
##    4    3    4    3    2    2    1    1    1    1    4    4    1    2    2    1 
##  445  234  422  508  910   80  894  548  475  903  343  323  479  838  450  111 
##    2    4    4    1    2    2    4    4    4    4    1    3    4    2    3    3 
##  317  741  287  585  292  226  297  605  637  834  237  700  809   33  836  396 
##    2    3    2    3    2    4    3    1    3    1    3    3    4    4    3    3 
##  935  917   76   94   30  723  175  916  685  115  751  608  465  358  902   96 
##    3    2    1    4    3    3    2    4    1    3    1    3    4    3    3    4 
##  782  397  404  148  813  968  714  338  869  106   11  625  364  705  403  461 
##    4    4    4    2    4    3    2    1    3    4    4    3    1    4    4    2 
##  704   31  655  661   16  420  882  417  464  412  810  524  437  732  562  204 
##    3    4    1    3    4    2    4    4    1    4    4    4    3    2    4    1 
##  720  965  624  384  122  399  634  315  259  494  780   48  331  100  108  301 
##    2    4    4    3    1    4    1    4    2    3    4    1    1    2    1    4 
##   10  697  851  980  402  889  804  925  395  986    8  261  541  306  853  883 
##    1    1    3    4    1    1    4    4    4    4    1    1    3    2    2    2 
##  282  267  262  760  219  352  119  452   36  870  961  240  304  600  694  105 
##    4    4    3    1    1    2    1    4    3    4    2    4    1    1    2    3 
##  388  934  180  906  615  241  703  559   37  303   19  378  549  990  733  188 
##    3    2    3    2    2    1    2    4    2    4    1    2    4    3    4    3 
##  860  393  139  992  371  189  311  547  418  382   38  816  319  596  120  604 
##    4    1    4    2    2    2    3    4    4    3    1    1    1    3    3    3 
##  533  441  199  499  944  609   81  942  717  650    6  128   49  476  239  340 
##    2    1    4    1    2    3    1    1    1    4    3    1    2    3    4    4 
##  193  824  561  645  190  191  446  668  630  571  512   59  305  832   61  570 
##    4    2    2    3    2    2    4    4    2    3    4    4    1    3    4    1 
##  591  676  770  955 
##    3    3    4    1 
## 
## Within cluster sum of squares by cluster:
## [1] 55.18674 66.19557 67.75788 67.46634
##  (between_SS / total_SS =  74.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

library(factoextra)
library(ggplot2)

fviz_cluster(Clustering1,
  palette = "Dark2",   
  repel = TRUE,       
  ggtheme = theme_minimal(),  
  data = data1_clu_std,
  geom = "point"       
) +
  scale_shape_manual(values = c(15, 16, 17, 18)) + 
  scale_size_manual(values = c(3, 3, 3, 3)) +       
  theme(legend.position = "bottom")

Averages1 <- Clustering1$centers

Figure1 <- as.data.frame(Averages1)
Figure1$id <- 1:nrow(Figure1)

library(tidyr)
Figure1 <- pivot_longer(Figure1, cols = c("Spending Score", "Membership Years"))

Figure1$Group <- factor(Figure1$id, 
                       levels = c(1, 2, 3, 4), 
                       labels = c("1", "2", "3", "4"))

Figure1$NameF <- factor(Figure1$name, 
                       levels = c("Spending Score", "Membership Years"), 
                       labels = c("Spending Score", "Membership Years"))


library(ggplot2)
ggplot(Figure1, aes(x = name, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size = 5, alpha = 0.4) +
  geom_line(aes(group = id), linewidth = 1.5) +
  ylab("Averages") +
  xlab("Cluster variables")+
  ylim(-2.5, 2.5) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 12))

data2_clu_std <- as.data.frame(scale(data2[c(5,6)]))

Clustering2 <- kmeans(data2_clu_std, 
                     centers = 4, 
                     nstart = 25) 

Clustering2

## K-means clustering with 4 clusters of sizes 128, 128, 112, 132
## 
## Cluster means:
##   Spending Score Membership Years
## 1      0.8947066        0.9725375
## 2      0.8179644       -0.8780666
## 3     -0.8671055        0.8437603
## 4     -0.9250461       -0.8075260
## 
## Clustering vector:
## 510 962  88 132 438 777 788 471 251 203 246 481 574 440 435 626 492 817 131 667 
##   4   1   3   1   4   1   1   1   3   3   3   3   3   1   2   3   1   2   1   1 
## 478 162 322 692 168 442 276  78 957 819 527 835  95 406 552 379 342 221 184 161 
##   4   2   4   3   1   4   4   3   1   3   4   2   1   3   2   1   2   3   3   4 
## 504 448 242 181 718 812 918 930 414 407 991 949 273 187 535 171 501 753 601 136 
##   3   1   2   2   2   1   1   3   4   2   4   4   1   3   4   4   2   4   3   3 
##  79 951 670 890 295 802 505 443 284 595  87 805 651 138 365 803 933 644 754 336 
##   4   3   3   2   1   2   1   1   4   1   3   3   4   2   4   2   2   2   1   1 
## 739 232 334 987 433 328 876 470 264 779 201 820 729 620 815 707  65 507  29 823 
##   4   3   3   3   1   1   1   1   2   2   3   1   1   1   1   4   4   1   1   3 
## 206 124 691 263 228  45 332 281 982 632 427 629 577 268 969 327 271 746 781 167 
##   4   1   1   1   3   4   2   4   1   4   3   1   2   4   1   3   4   1   4   1 
## 255 807 612  71 260 530 623 451 636 734 757 798 841 872 927  46 863 558 719  98 
##   1   4   3   2   1   2   2   2   4   3   2   4   1   3   3   4   1   2   3   1 
## 434 220 566 679 324 931 721 877 275 687 231 197 831 169 659 960 341 911 320 662 
##   1   1   2   3   3   3   1   4   2   4   3   1   4   2   3   1   3   2   4   3 
##   2 274 376 639 681 532 881 638 207  56 750 613 394 958 503 647  68 725 419  53 
##   2   2   4   1   1   3   2   2   1   4   1   2   1   3   3   4   3   3   4   1 
## 922 230 333 677 256 742 318 940 640 745 678 173 702 351 873 266 372 172 298 778 
##   2   4   4   1   1   3   4   3   3   1   2   1   1   4   2   1   2   4   4   1 
## 797 948  17 536 487 227 897 312 790 444 551 112  93 664 513 690 248 569 469 462 
##   2   4   4   4   1   1   1   3   2   4   2   1   1   4   4   4   1   3   1   2 
##   4  50 550 488 801 540 887 898 114 956  99 361  58 912 635 405 123 582 652 977 
##   1   2   2   1   3   1   4   2   4   2   4   2   4   3   4   4   3   4   3   2 
## 400 493  77 695 808 654 529 901 611 363 563 892  15 174 142 416 856 970 489 758 
##   2   2   4   1   4   2   4   3   2   4   4   4   4   2   4   2   2   4   4   3 
##  18 517 423  42 313 321  63  73 919 383 254 272 787 460  44  92  21 748 656 439 
##   4   4   3   2   4   4   3   4   2   4   1   4   1   2   2   1   4   2   3   2 
## 699 307 496 983 194  82 945  62 572 586 107 998 146 511 410 370 350 866 521 786 
##   2   2   2   1   1   2   2   1   1   2   1   4   4   1   4   1   3   2   2   2 
## 584 791 829 921 985 967 868 482 963 103 387 952 914 631 486 855   9 296 145 796 
##   3   1   3   1   2   1   2   2   3   3   3   1   1   3   2   4   3   4   3   3 
## 514 756 946 772  25 701 565 795 849 247 607 825 408 502 936 288 568 740 216 183 
##   4   1   1   1   2   3   1   4   1   3   2   4   1   3   4   2   1   4   4   2 
## 531 929 924 472 245 154 675 693 144 735 973 293 728 731 353  22 622 909 830 875 
##   3   4   2   4   4   4   3   2   4   3   3   2   4   4   1   1   3   3   4   1 
## 454 157 759 253  86 453 984 827  35 749 583 576  47 192 176  28 828 249 971 329 
##   4   2   4   3   3   1   3   1   1   4   2   2   3   3   2   3   1   1   4   2 
## 773 964 126 776 708 669 616 354 497 579 972 208 939 325 467 896 335 542 140 799 
##   3   2   1   4   1   1   2   2   3   3   2   2   2   2   4   4   2   2   2   4 
##  75 806 518 150 689 285 432 149 743 950 425 156 495 993 369 683  27 233 198 546 
##   1   1   2   3   4   4   2   2   2   4   2   2   4   1   4   1   3   3   1   3 
## 257 826 345 215 995 380 592 375 525   1  97 715 874 684 997  32 300 506 628 520 
##   4   3   3   2   4   2   4   4   4   2   3   1   2   3   1   2   4   2   2   3 
## 213 202 214 989 736 653 449 580 109 133 879  66 907   3 411  70 839  12 713 833 
##   3   3   4   2   1   3   4   2   1   1   2   2   4   4   2   3   4   2   3   1 
## 587 360 674 269 515 978 599 101 594 641 102 884 385 367 314 484  60 848 643 793 
##   2   1   4   1   3   3   2   4   3   4   1   2   1   2   2   4   4   4   3   4 
## 556 763 850 366 283 543 436 737 182 730 485 474  43 359 147 491 368   7 642 356 
##   2   4   4   1   3   4   2   2   3   1   1   2   3   4   3   2   1   2   3   1 
## 
## Within cluster sum of squares by cluster:
## [1] 51.67048 54.97523 56.71744 63.80148
##  (between_SS / total_SS =  77.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

library(factoextra)
library(ggplot2)

fviz_cluster(Clustering2,
  palette = "Dark2",   
  repel = TRUE,       
  ggtheme = theme_minimal(),  
  data = data2_clu_std,
  geom = "point"       
) +
  scale_shape_manual(values = c(15, 16, 17, 18)) + 
  scale_size_manual(values = c(3, 3, 3, 3)) +       
  theme(legend.position = "bottom")

Averages2 <- Clustering2$centers

Figure2 <- as.data.frame(Averages2)
Figure2$id <- 1:nrow(Figure2)

library(tidyr)
Figure2 <- pivot_longer(Figure2, cols = c("Spending Score", "Membership Years"))

Figure2$Group <- factor(Figure2$id, 
                       levels = c(1, 2, 3, 4), 
                       labels = c("1", "2", "3", "4"))

Figure2$NameF <- factor(Figure2$name, 
                       levels = c("Spending Score", "Membership Years"), 
                       labels = c("Spending Score", "Membership Years"))


library(ggplot2)
ggplot(Figure2, aes(x = name, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size = 5, alpha = 0.4) +
  geom_line(aes(group = id), linewidth = 1.5) +
  ylab("Averages") +
  xlab("Cluster variables")+
  ylim(-2.5, 2.5) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 12))

Above the cross-validity check was performed by dividing the dataset into two smaller samples and replicating the clustering process. The resulting cluster plots and cluster profile plots from both samples were found to be consistent with those from the main dataset. This demonstrates that the clustering solution is stable and robust, as the structure of the clusters remains similar across different subsets of the data. Therefore, the identified clusters are reliable and represent meaningful patterns within the dataset.

Conclusion

Based on the analysis, we clustered 1000 customers into 4 distinct segments using two variables: “Spending Score” and “Membership Years,” which were standardized prior to clustering.

Cluster 1: Represents loyal customers with above-average membership duration and above-average spending scores. This group values long-term engagement and spends consistently. They represent 24.3% of the dataset (243/1000).

Cluster 2: Consists of long-term, conservative spenders with significantly above-average membership years but below-average spending scores. These customers represent 23.9% of the dataset (239/1000). They may value the relationship with the company but exhibit cautious spending habits.

Cluster 3: Represents new or less engaged customers who exhibit very low membership duration and low spending scores. This is the most heterogeneous cluster with the highest within-cluster variability (WSS = 131.4482). This group comprises 26.5% of the dataset (265/1000).

Cluster 4: Consists of newer, opportunistic high spenders with slightly below-average membership years but significantly above-average spending scores. These customers represent 25.3% of the dataset (253/1000).

Homework 2

Pino Sekač

2025-01-21