h1.title {color: salmon;font-family: "Copperplate", fantasy; font-size: 48px;}
.author {color: lightsalmon ;font-family: "Copperplate", fantasy; font-size: 24px;}
.date {color: darksalmon ;font-family: "Copperplate", fantasy; font-size: 18px;}
h1.title:hover {
text-shadow: 2px 2px 8px rgba(255, 20, 147, 0.8);
}
.columns {display: flex;}
h1 {color: #FF91A4;font-family: 'Roboto', sans-serif; font-size: 36px;}
mydata <- read.csv("C:/Users/Pino/Desktop/IMB/MVA/customer_segmentation_data.csv")
colnames(mydata) [1] <- "ID"
colnames(mydata) [2] <- "Age"
colnames(mydata) [3] <- "Gender"
colnames(mydata) [4] <- "Income"
colnames(mydata) [5] <- "Spending Score"
colnames(mydata) [6] <- "Membership Years"
colnames(mydata) [7] <- "Purchase Frequency"
colnames(mydata) [8] <- "Preferred Category"
colnames(mydata) [9] <- "Last Purchase Amount"
head(mydata)
## ID Age Gender Income Spending Score Membership Years Purchase Frequency
## 1 1 38 Female 99342 90 3 24
## 2 2 21 Female 78852 60 2 42
## 3 3 60 Female 126573 30 2 28
## 4 4 40 Other 47099 74 9 5
## 5 5 65 Female 140621 21 3 25
## 6 6 31 Other 57305 24 3 30
## Preferred Category Last Purchase Amount
## 1 Groceries 113.53
## 2 Sports 41.93
## 3 Clothing 424.36
## 4 Home & Garden 991.93
## 5 Electronics 347.08
## 6 Home & Garden 86.85
This dataset contains simulated customer data that can be used for segmentation analysis. It includes demographic and behavioral information about customers, which can help in identifying distinct segments within the customer base. The data is taken from Kaggle website.
Sample size: 1000
Unit of observation: A single customer
Explanation of variables:
ID: Unique identifier for each customer. Age: Age of the customer. Gender: Gender of the customer (Male, Female, Other). Income: Annual income of the customer (in USD). Spending Score: Spending score (1-100), indicating the customer’s spending behavior and loyalty. Membership Years: Number of years the customer has been a member. Purchase Frequency: Number of purchases made by the customer in the last year. Preferred Category: Preferred shopping category (Electronics, Clothing, Groceries, Home & Garden, Sports). Last Purchase Amount: Amount spent by the customer on their last purchase (in USD).
I renamed some variables so the name is more “tidy”.
The dataset includes numerical variables so we can approach it with clustering analysissummary(mydata[c(4:9)])
## Income Spending Score Membership Years Purchase Frequency
## Min. : 30004 Min. : 1.00 Min. : 1.000 Min. : 1.0
## 1st Qu.: 57912 1st Qu.: 26.00 1st Qu.: 3.000 1st Qu.:15.0
## Median : 87846 Median : 50.00 Median : 5.000 Median :27.0
## Mean : 88501 Mean : 50.69 Mean : 5.469 Mean :26.6
## 3rd Qu.:116110 3rd Qu.: 76.00 3rd Qu.: 8.000 3rd Qu.:39.0
## Max. :149973 Max. :100.00 Max. :10.000 Max. :50.0
## Preferred Category Last Purchase Amount
## Length:1000 Min. : 10.4
## Class :character 1st Qu.:218.8
## Mode :character Median :491.6
## Mean :492.3
## 3rd Qu.:747.2
## Max. :999.7
For the variable “Income”, 1st Quartile (Q1) is 57912, meaning 25% of customers have an income less than $57,912.
For the variable “Spending Score”, min equals 1, indicating that the lowest spending score out of the whole sample is 1.
For the variable “Membership Years”, mean is 5.469, indicating the average membership length of the sample is 5.469 years.mydata_clu_std <- as.data.frame(scale(mydata[c(5,6)]))
mydata$Dissimilarity = sqrt(mydata_clu_std$`Spending Score`^2 + mydata_clu_std$`Membership Years`^2)
head(mydata[order(-mydata$Dissimilarity), c("ID", "Dissimilarity")], 10)
## ID Dissimilarity
## 900 900 2.337053
## 987 987 2.337053
## 353 353 2.327687
## 913 913 2.327687
## 384 384 2.322369
## 467 467 2.322369
## 377 377 2.312943
## 425 425 2.312943
## 883 883 2.312943
## 943 943 2.312943
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.4.2
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Distances <- get_dist(mydata_clu_std,
method = "euclidian")
fviz_dist(Distances,
gradient = list(low = "darkred",
mid = "grey95",
high = "white"))
library(factoextra)
get_clust_tendency(mydata_clu_std,
n = nrow(mydata_clu_std) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.7005131
##
## $plot
## NULL
library(factoextra)
library(NbClust)
fviz_nbclust(mydata_clu_std, kmeans, method = "wss") +
labs(subtitle = "Elbow method")
fviz_nbclust(mydata_clu_std, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette analysis")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(factoextra)
WARD <- mydata_clu_std %>%
get_dist(method = "euclidean") %>%
hclust(method = "ward.D2")
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 1000
library(factoextra)
fviz_dend(WARD)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
fviz_dend(WARD,
k = 4,
cex = 0.5,
palette = "jama",
color_labels_by_k = TRUE,
rect = TRUE)
library(NbClust)
NbClust(mydata_clu_std,
distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans",
index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 5 proposed 2 as the best number of clusters
## * 4 proposed 3 as the best number of clusters
## * 11 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 4
##
##
## *******************************************************************
## $All.index
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW
## 2 3.4857 597.2601 590.4964 -1.3333 1380.657 1002926.1 514409.19 1249.9554
## 3 0.3452 769.7994 618.3263 -8.6510 1880.087 1369467.8 179137.24 785.3059
## 4 3.3729 1036.5473 174.3501 1.0539 2835.248 936716.1 53528.49 484.7008
## 5 3.5974 956.1230 219.2383 -1.6623 3182.254 1034487.6 40122.32 412.4937
## 6 2.8376 976.3016 158.7808 -1.0112 3714.208 875110.8 32283.43 338.0154
## 7 0.1061 969.0337 209.1246 -1.2870 3921.948 967690.1 22881.07 291.4581
## 8 1.7117 1034.3559 86.6481 0.7921 4248.360 911930.0 13689.34 240.7553
## 9 0.7253 993.9438 61.5396 -0.5372 4432.118 960420.5 12946.35 221.4154
## 10 0.6997 944.2539 172.2686 -2.2197 4596.088 1006386.4 12863.53 208.4697
## Friedman Rubin Cindex DB Silhouette Duda Pseudot2 Beale Ratkowsky
## 2 2.9789 1.5985 0.3775 1.2922 0.3518 0.7010 277.2703 0.4257 0.3116
## 3 3.1446 2.5442 0.3138 0.9426 0.3845 0.8963 58.5405 0.1154 0.4496
## 4 6.2666 4.1221 0.4049 0.8220 0.4219 1.5158 -152.7898 -0.3390 0.4351
## 5 7.9550 4.8437 0.3957 0.9115 0.3956 1.1321 -45.2687 -0.1162 0.3983
## 6 11.8910 5.9110 0.3866 1.0279 0.3817 1.5459 -117.2373 -0.3513 0.3718
## 7 12.7435 6.8552 0.3699 0.9606 0.3746 1.9512 -144.7860 -0.4841 0.3492
## 8 14.8769 8.2989 0.3692 0.8462 0.3933 1.5958 -94.0822 -0.3706 0.3316
## 9 16.6548 9.0238 0.3584 0.8791 0.3844 1.4562 -73.6227 -0.3110 0.3143
## 10 18.6874 9.5841 0.3500 0.9506 0.3670 1.3924 -57.7739 -0.2796 0.2992
## Ball Ptbiserial Frey McClain Dunn Hubert SDindex Dindex SDbw
## 2 624.9777 0.4764 0.2082 0.6292 0.0948 0.0007 2.8815 1.0326 1.8884
## 3 261.7686 0.5731 0.2112 1.0165 0.0097 0.0008 1.8357 0.8097 1.4203
## 4 121.1752 0.6131 0.8020 1.2673 0.0157 0.0009 1.5698 0.6497 1.0042
## 5 82.4987 0.5798 0.4150 1.5382 0.0165 0.0010 1.9411 0.5934 0.6890
## 6 56.3359 0.5594 0.5052 1.7911 0.0181 0.0011 2.2842 0.5354 0.4975
## 7 41.6369 0.5394 0.3311 1.9889 0.0185 0.0012 2.2027 0.4999 0.4816
## 8 30.0944 0.5246 0.6473 2.1537 0.0203 0.0012 2.1541 0.4569 0.6731
## 9 24.6017 0.5072 0.7188 2.3240 0.0204 0.0012 2.6925 0.4347 0.3801
## 10 20.8470 0.4815 0.2633 2.6012 0.0208 0.0013 2.6240 0.4207 0.2204
##
## $All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2 0.5733 483.8609 0.6534
## 3 0.5530 408.9284 0.8910
## 4 0.5301 398.0160 1.0000
## 5 0.5275 347.6084 1.0000
## 6 0.5036 327.2609 1.0000
## 7 0.4775 325.0467 1.0000
## 8 0.4716 282.3289 1.0000
## 9 0.4701 264.9142 1.0000
## 10 0.4644 236.4154 1.0000
##
## $Best.nc
## KL CH Hartigan CCC Scott Marriot TrCovW
## Number_clusters 5.0000 4.000 4.0000 4.0000 4.0000 4.0 3.0
## Value_Index 3.5974 1036.547 443.9762 1.0539 955.1614 530523.3 335271.9
## TraceW Friedman Rubin Cindex DB Silhouette Duda
## Number_clusters 4.0000 6.0000 4.0000 3.0000 4.000 4.0000 2.000
## Value_Index 228.3979 3.9361 -0.8563 0.3138 0.822 0.4219 0.701
## PseudoT2 Beale Ratkowsky Ball PtBiserial Frey McClain
## Number_clusters 2.0000 2.0000 3.0000 3.0000 4.0000 1 2.0000
## Value_Index 277.2703 0.4257 0.4496 363.2091 0.6131 NA 0.6292
## Dunn Hubert SDindex Dindex SDbw
## Number_clusters 2.0000 0 4.0000 0 10.0000
## Value_Index 0.0948 0 1.5698 0 0.2204
##
## $Best.partition
## [1] 3 3 1 2 1 1 3 2 4 2 4 3 3 1 1 4 1 1 2 3 1 2 3 2 3 2 4 4 2 1 4 3 4 3 2 1 3
## [38] 2 1 1 1 3 4 3 1 1 4 2 3 3 2 3 2 2 3 1 4 1 4 1 4 2 4 1 1 3 3 2 3 4 3 1 1 2
## [75] 2 2 1 4 1 3 2 3 2 2 1 4 4 4 4 3 1 2 2 4 2 4 4 2 1 3 1 2 4 1 1 4 2 2 2 2 1
## [112] 2 4 1 1 2 1 4 2 1 1 2 4 2 3 2 3 2 3 4 2 2 2 2 2 4 4 3 4 3 2 1 2 1 4 1 4 3
## [149] 3 4 3 1 2 1 3 3 3 1 3 1 1 3 3 4 1 4 2 2 3 2 1 1 2 3 3 3 1 3 2 1 3 4 3 4 2
## [186] 2 4 1 3 3 3 4 4 2 2 4 2 2 4 4 4 4 4 2 2 1 2 3 1 4 2 1 4 1 3 1 1 4 2 2 4 1
## [223] 4 1 4 4 2 4 3 1 4 4 4 4 1 2 1 1 4 4 4 3 1 2 1 4 4 2 2 3 4 3 4 2 2 2 1 3 3
## [260] 2 2 1 2 3 2 2 4 1 2 2 1 1 2 3 3 1 1 4 3 3 1 4 4 1 1 3 3 3 2 3 4 3 3 1 2 1
## [297] 1 1 4 1 4 1 4 2 4 3 3 1 1 4 1 4 1 3 4 1 3 1 2 1 1 1 1 4 3 3 4 2 3 3 2 3 1
## [334] 4 3 2 3 2 2 4 4 3 2 1 4 4 1 2 1 4 1 3 2 3 3 2 1 1 1 2 3 1 1 2 1 2 3 2 1 2
## [371] 3 3 3 2 1 1 3 3 2 3 2 1 1 1 2 2 4 1 3 3 3 3 2 2 4 1 4 4 4 3 3 2 4 4 1 4 3
## [408] 2 2 1 3 4 4 1 3 3 4 4 1 3 3 4 4 1 3 2 4 1 4 3 3 3 2 2 3 3 1 1 3 2 4 1 2 1
## [445] 3 4 2 2 1 1 3 4 2 1 2 1 1 2 3 3 3 3 4 2 4 4 1 3 2 2 2 1 1 3 4 1 1 1 4 3 4
## [482] 3 2 1 2 3 2 2 1 1 3 2 3 1 1 3 4 3 2 4 3 4 2 4 2 3 2 2 3 1 2 4 1 1 4 2 1 3
## [519] 3 4 3 3 1 4 1 3 1 3 1 3 4 4 3 1 1 1 2 3 4 2 1 3 1 1 3 4 4 4 4 3 3 3 4 3 3
## [556] 3 3 3 4 3 3 4 1 1 2 3 2 2 4 2 1 2 4 4 3 3 3 2 4 3 2 1 3 4 1 3 3 3 4 1 1 1
## [593] 4 4 2 1 2 2 3 2 4 3 1 1 2 1 3 1 1 4 3 4 3 1 3 3 4 1 1 2 1 4 3 4 1 4 1 3 2
## [630] 3 4 1 4 4 1 1 1 3 2 4 1 4 4 3 1 2 1 1 3 4 1 4 4 3 2 4 4 3 4 1 1 4 3 1 1 2
## [667] 2 4 2 4 1 2 1 1 4 1 2 3 4 2 2 4 2 4 2 3 1 1 1 1 2 4 3 3 2 4 2 1 3 1 4 2 3
## [704] 1 4 3 1 2 4 3 1 4 4 3 2 4 2 3 4 3 2 3 1 1 4 4 4 1 2 2 1 3 4 4 4 2 3 3 1 1
## [741] 1 4 3 4 2 2 4 3 1 2 2 1 1 2 4 2 3 4 1 2 1 4 1 3 3 3 4 3 3 4 4 2 4 1 4 1 2
## [778] 2 3 4 1 4 1 3 4 3 2 2 3 3 2 4 1 3 1 4 3 1 1 1 4 3 3 4 4 2 1 1 4 4 1 2 4 4
## [815] 2 2 3 1 4 2 3 4 2 3 1 4 2 2 4 1 1 1 2 2 3 1 4 3 1 2 2 4 1 2 1 2 1 1 2 1 1
## [852] 3 3 2 1 3 2 3 4 4 1 2 2 4 1 3 2 3 1 4 2 4 3 3 2 2 1 3 3 3 3 4 3 3 2 2 1 1
## [889] 2 3 4 1 3 4 1 1 2 3 1 4 4 1 4 3 2 3 1 4 4 3 3 4 2 2 2 4 3 2 3 2 2 3 3 3 4
## [926] 4 4 4 1 4 4 4 3 3 1 1 3 1 3 4 2 2 3 3 3 2 4 1 1 1 4 2 4 1 2 3 2 4 2 2 3 2
## [963] 4 3 4 2 2 1 2 1 1 3 4 2 1 4 3 4 2 4 1 2 2 4 3 4 4 3 3 1 1 3 2 1 1 2 2 1 2
## [1000] 1
Clustering <- kmeans(mydata_clu_std,
centers = 4,
nstart = 25)
Clustering
## K-means clustering with 4 clusters of sizes 243, 239, 265, 253
##
## Cluster means:
## Spending Score Membership Years
## 1 -0.8665949 0.8661137
## 2 0.9145986 0.9273127
## 3 -0.8547402 -0.8090784
## 4 0.8636349 -0.8604253
##
## Clustering vector:
## [1] 4 4 3 2 3 3 4 2 1 2 1 4 4 3 3 1 3 3 2 4 3 2 4 2 4 2 1 1 2 3 1 4 1 4 2 3 4
## [38] 2 3 3 3 4 1 4 3 3 1 2 4 4 2 4 2 2 4 3 1 3 1 3 1 2 1 3 3 4 4 2 4 1 4 3 3 2
## [75] 2 2 3 1 3 4 2 4 2 2 3 1 1 1 1 4 3 2 2 1 2 1 1 2 3 4 3 2 1 3 3 1 2 2 2 2 3
## [112] 2 1 3 3 2 3 1 2 3 3 2 1 2 4 2 4 2 4 1 2 2 2 2 2 1 1 4 1 4 2 3 2 3 1 3 1 4
## [149] 4 1 4 3 2 3 4 4 4 3 4 3 3 4 4 1 3 1 2 2 4 2 3 3 2 4 4 4 3 4 2 3 4 1 4 1 2
## [186] 2 1 3 4 4 4 1 1 2 2 1 2 2 1 1 1 1 1 2 2 3 2 4 3 1 2 3 1 3 4 3 3 1 2 2 1 3
## [223] 1 3 1 1 2 1 4 3 1 1 1 1 3 2 3 3 1 1 1 4 3 2 3 1 1 2 2 4 1 4 1 2 2 2 3 4 4
## [260] 2 2 3 2 4 2 2 1 3 2 2 3 3 2 4 4 3 3 1 4 4 3 1 1 3 3 4 4 4 2 4 1 4 4 3 2 3
## [297] 3 3 1 3 1 3 1 2 1 4 4 3 3 1 3 1 3 4 1 3 4 3 2 3 3 3 3 1 4 4 1 2 4 4 2 4 3
## [334] 1 4 2 4 2 2 1 1 4 2 3 1 1 3 2 3 1 3 4 2 4 4 2 3 3 3 2 4 3 3 2 3 2 4 2 3 2
## [371] 4 4 4 2 3 3 4 4 2 4 2 3 3 3 2 2 1 3 4 4 4 4 2 2 1 3 1 1 1 4 4 2 1 1 3 1 4
## [408] 2 2 3 4 1 1 3 4 4 1 1 3 4 4 1 1 3 4 2 1 3 1 4 4 4 2 2 4 4 3 3 4 2 1 3 2 3
## [445] 4 1 2 2 3 3 4 1 2 3 2 3 3 2 4 4 4 4 1 2 1 1 3 4 2 2 2 3 3 4 1 3 3 3 1 4 1
## [482] 4 2 3 2 4 2 2 3 3 4 2 4 3 3 4 1 4 2 1 4 1 2 1 2 4 2 2 4 3 2 1 3 3 1 2 3 4
## [519] 4 1 4 4 3 1 3 4 3 4 3 4 1 1 4 3 3 3 2 4 1 2 3 4 3 3 4 1 1 1 1 4 4 4 1 4 4
## [556] 4 4 4 1 4 4 1 3 3 2 4 2 2 1 2 3 2 1 1 4 4 4 2 1 4 2 3 4 1 3 4 4 4 1 3 3 3
## [593] 1 1 2 3 2 2 4 2 1 4 3 3 2 3 4 3 3 1 4 1 4 3 4 4 1 3 3 2 3 1 4 1 3 1 3 4 2
## [630] 4 1 3 1 1 3 3 3 4 2 1 3 1 1 4 3 2 3 3 4 1 3 1 1 4 2 1 1 4 1 3 3 1 4 3 3 2
## [667] 2 1 2 1 3 2 3 3 1 3 2 4 1 2 2 1 2 1 2 4 3 3 3 3 2 1 4 4 2 1 2 3 4 3 1 2 4
## [704] 3 1 4 3 2 1 4 3 1 1 4 2 1 2 4 1 4 2 4 3 3 1 1 1 3 2 2 3 4 1 1 1 2 4 4 3 3
## [741] 3 1 4 1 2 2 1 4 3 2 2 3 3 2 1 2 4 1 3 2 3 1 3 4 4 4 1 4 4 1 1 2 1 3 1 3 2
## [778] 2 4 1 3 1 3 4 1 4 2 2 4 4 2 1 3 4 3 1 4 3 3 3 1 4 4 1 1 2 3 3 1 1 3 2 1 1
## [815] 2 2 4 3 1 2 4 1 2 4 3 1 2 2 1 3 3 3 2 2 4 3 1 4 3 2 2 1 3 2 3 2 3 3 2 3 3
## [852] 4 4 2 3 4 2 4 1 1 3 2 2 1 3 4 2 4 3 1 2 1 4 4 2 2 3 4 4 4 4 1 4 4 2 2 3 3
## [889] 2 4 1 3 4 1 3 3 2 4 3 1 1 3 1 4 2 4 3 1 1 4 4 1 2 2 2 1 4 2 4 2 2 4 4 4 1
## [926] 1 1 1 3 1 1 1 4 4 3 3 4 3 4 1 2 2 4 4 4 2 1 3 3 3 1 2 1 3 2 4 2 1 2 2 4 2
## [963] 1 4 1 2 2 3 2 3 3 4 1 2 3 1 4 1 2 1 3 2 2 1 4 1 1 4 4 3 3 4 2 3 3 2 2 3 2
## [1000] 3
##
## Within cluster sum of squares by cluster:
## [1] 125.1556 106.9615 131.4482 121.1354
## (between_SS / total_SS = 75.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
I created 4 clusters with the sizes of each being as follows: first one includes 243 units, second one 239, third one 265 and lastly fourth one 253 units.
Biggest variability is seen in cluster 3, with it having the within sum of squares of 131.4482, indicating that it is the most heterogeneous cluster.
The total within sum of squares is 484.7007, the between sum of squares is 1509.95239, making the total sum of squares 1994,65309.
The ratio between the between sum of squares and the total sum of squares is 75.7%, which means that 75.7% of total variability in the dataset is explained by the differences between the clusters, and the remaining 24.3% of total variability is the variability within the clusters. 75.7% is essentially the biggest ratio of final leaders with the K-means method.library(factoextra)
library(ggplot2)
fviz_cluster(Clustering,
palette = "Dark2",
repel = TRUE,
ggtheme = theme_minimal(),
data = mydata_clu_std,
geom = "point"
) +
scale_shape_manual(values = c(15, 16, 17, 18)) +
scale_size_manual(values = c(3, 3, 3, 3)) +
theme(legend.position = "bottom")
Averages <- Clustering$centers
Averages
## Spending Score Membership Years
## 1 -0.8665949 0.8661137
## 2 0.9145986 0.9273127
## 3 -0.8547402 -0.8090784
## 4 0.8636349 -0.8604253
Figure <- as.data.frame(Averages)
Figure$id <- 1:nrow(Figure)
library(tidyr)
Figure <- pivot_longer(Figure, cols = c("Spending Score", "Membership Years"))
Figure$Group <- factor(Figure$id,
levels = c(1, 2, 3, 4),
labels = c("1", "2", "3", "4"))
Figure$NameF <- factor(Figure$name,
levels = c("Spending Score", "Membership Years"),
labels = c("Spending Score", "Membership Years"))
library(ggplot2)
ggplot(Figure, aes(x = name, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group, col = Group), size = 5, alpha = 0.4) +
geom_line(aes(group = id), linewidth = 1.5) +
ylab("Averages") +
xlab("Cluster variables")+
ylim(-2.5, 2.5) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 12))
mydata$Group <- Clustering$cluster
fit <- aov(cbind(`Spending Score` , `Membership Years`) ~ Group, data = mydata)
summary(fit)
## Response Spending Score :
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 1 119764 119764 166.51 < 2.2e-16 ***
## Residuals 998 717800 719
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Membership Years :
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 1 4862.2 4862.2 1477.2 < 2.2e-16 ***
## Residuals 998 3284.8 3.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I conducted an ANOVA analysis to determine whether the clustering variables (“Spending Score” and “Membership Years”) differentiate between the groups created by clustering. The purpose of this analysis was to validate the clustering results by checking if the group means of these variables are significantly different.
For “Spending Score”, the F-value is 166.51, with p < 0.001, indicating that the mean of “Spending Score” differs significantly across groups. For “Membership Years”, the F-value is 1477.2, with p < 0.001, confirming that the mean “Membership Years” also differs strongly between groups.chi_square <- chisq.test(mydata$`Preferred Category`, as.factor(mydata$Group))
chi_square
##
## Pearson's Chi-squared test
##
## data: mydata$`Preferred Category` and as.factor(mydata$Group)
## X-squared = 11.648, df = 12, p-value = 0.4744
library(ggplot2)
ggplot(mydata, aes(x = as.factor(Group), fill = `Preferred Category`)) +
geom_bar(position = "fill") + # 'fill' scales the bars to proportions
scale_y_continuous(labels = scales::percent_format()) +
labs(
title = "Distribution of Preferred Categories Across Groups",
x = "Group",
y = "Proportion",
fill = "Preferred Category"
) +
theme_minimal()
addmargins(chi_square$observed)
## as.factor(mydata$Group)
## mydata$`Preferred Category` 1 2 3 4 Sum
## Clothing 44 39 36 51 170
## Electronics 59 54 55 47 215
## Groceries 47 48 49 55 199
## Home & Garden 41 48 68 49 206
## Sports 52 50 57 51 210
## Sum 243 239 265 253 1000
addmargins(round(chi_square$expected, 2))
## as.factor(mydata$Group)
## mydata$`Preferred Category` 1 2 3 4 Sum
## Clothing 41.31 40.63 45.05 43.01 170.00
## Electronics 52.24 51.38 56.98 54.40 215.00
## Groceries 48.36 47.56 52.74 50.35 199.01
## Home & Garden 50.06 49.23 54.59 52.12 206.00
## Sports 51.03 50.19 55.65 53.13 210.00
## Sum 243.00 238.99 265.01 253.01 1000.01
round(chi_square$res, 2)
## as.factor(mydata$Group)
## mydata$`Preferred Category` 1 2 3 4
## Clothing 0.42 -0.26 -1.35 1.22
## Electronics 0.93 0.36 -0.26 -1.00
## Groceries -0.20 0.06 -0.51 0.66
## Home & Garden -1.28 -0.18 1.81 -0.43
## Sports 0.14 -0.03 0.18 -0.29
library(effectsize)
effectsize::cramers_v(mydata$`Preferred Category`, mydata$Group)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.00 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.00)
## [1] "tiny"
## (Rules: funder2019)
od.
set.seed(123)
indices <- sample(1:nrow(mydata))
split_point <- floor(nrow(mydata) / 2)
indices1 <- indices[1:split_point]
indices2 <- indices[(split_point + 1):nrow(mydata)]
data1 <- mydata[indices1, ]
data2 <- mydata[indices2, ]
cat("Summary of first subset (data1):\n")
## Summary of first subset (data1):
print(summary(data1))
## ID Age Gender Income
## Min. : 5.0 Min. :18.0 Length:500 Min. : 30004
## 1st Qu.: 239.8 1st Qu.:31.0 Class :character 1st Qu.: 57912
## Median : 486.5 Median :45.0 Mode :character Median : 87558
## Mean : 496.2 Mean :44.2 Mean : 87979
## 3rd Qu.: 751.2 3rd Qu.:57.0 3rd Qu.:114631
## Max. :1000.0 Max. :69.0 Max. :149973
## Spending Score Membership Years Purchase Frequency Preferred Category
## Min. : 1.00 Min. : 1.00 Min. : 1.00 Length:500
## 1st Qu.: 24.00 1st Qu.: 3.00 1st Qu.:14.00 Class :character
## Median : 47.00 Median : 5.00 Median :27.00 Mode :character
## Mean : 48.99 Mean : 5.46 Mean :26.06
## 3rd Qu.: 74.00 3rd Qu.: 8.00 3rd Qu.:38.25
## Max. :100.00 Max. :10.00 Max. :50.00
## Last Purchase Amount Dissimilarity Group
## Min. : 12.36 Min. :0.1659 Min. :1.000
## 1st Qu.:230.49 1st Qu.:0.9258 1st Qu.:1.000
## Median :486.36 Median :1.3198 Median :3.000
## Mean :486.94 Mean :1.2865 Mean :2.498
## 3rd Qu.:723.73 3rd Qu.:1.6745 3rd Qu.:3.250
## Max. :999.74 Max. :2.3371 Max. :4.000
cat("\nSummary of second subset (data2):\n")
##
## Summary of second subset (data2):
print(summary(data2))
## ID Age Gender Income
## Min. : 1.0 Min. :18.00 Length:500 Min. : 30058
## 1st Qu.:256.8 1st Qu.:30.00 Class :character 1st Qu.: 58200
## Median :505.5 Median :43.00 Mode :character Median : 88098
## Mean :504.8 Mean :43.37 Mean : 89023
## 3rd Qu.:749.2 3rd Qu.:57.00 3rd Qu.:117900
## Max. :998.0 Max. :69.00 Max. :149936
## Spending Score Membership Years Purchase Frequency Preferred Category
## Min. : 1.00 Min. : 1.000 Min. : 1.00 Length:500
## 1st Qu.: 27.00 1st Qu.: 3.000 1st Qu.:16.00 Class :character
## Median : 54.00 Median : 5.000 Median :28.00 Mode :character
## Mean : 52.38 Mean : 5.478 Mean :27.13
## 3rd Qu.: 77.00 3rd Qu.: 8.000 3rd Qu.:39.00
## Max. :100.00 Max. :10.000 Max. :50.00
## Last Purchase Amount Dissimilarity Group
## Min. : 10.4 Min. :0.1886 Min. :1.000
## 1st Qu.:213.4 1st Qu.:1.0014 1st Qu.:2.000
## Median :505.6 Median :1.4117 Median :3.000
## Mean :497.8 Mean :1.3567 Mean :2.558
## 3rd Qu.:771.7 3rd Qu.:1.7024 3rd Qu.:4.000
## Max. :997.1 Max. :2.3371 Max. :4.000
data1_clu_std <- as.data.frame(scale(data1[c(5,6)]))
Clustering1 <- kmeans(data1_clu_std,
centers = 4,
nstart = 25)
Clustering1
## K-means clustering with 4 clusters of sizes 113, 125, 133, 129
##
## Cluster means:
## Spending Score Membership Years
## 1 0.9123170 0.8707048
## 2 0.9116663 -0.8405909
## 3 -0.7857385 -0.8109151
## 4 -0.8724565 0.8878754
##
## Clustering vector:
## 415 463 179 526 195 938 818 118 299 229 244 14 374 665 602 603
## 2 4 1 2 1 3 3 4 4 2 1 3 1 3 2 3
## 768 709 91 953 348 649 355 840 26 519 426 979 766 211 932 590
## 2 4 3 4 1 2 2 1 1 2 1 1 2 1 4 3
## 593 555 871 373 844 143 544 490 621 775 905 937 842 23 923 309
## 4 2 1 2 1 1 3 3 3 4 1 2 4 2 2 3
## 135 821 954 224 166 217 290 581 72 588 575 141 722 865 859 153
## 1 2 3 3 4 3 2 1 3 2 2 1 2 3 4 1
## 294 277 999 41 431 90 316 223 528 116 606 774 747 456 598 854
## 3 3 1 3 2 2 3 4 2 1 3 3 4 3 1 1
## 39 159 752 209 988 994 34 516 13 69 895 755 409 308 278 89
## 3 2 3 3 2 3 2 1 2 2 3 4 1 3 4 4
## 537 291 424 880 286 671 121 110 158 64 483 477 480 711 67 663
## 1 4 3 2 2 3 3 1 3 3 1 3 2 3 2 2
## 847 85 165 648 51 74 178 362 236 610 330 726 127 212 686 785
## 3 3 3 3 1 1 2 3 1 4 2 4 2 3 2 4
## 814 310 744 243 862 888 792 113 619 893 151 666 614 767 160 391
## 4 4 4 3 1 3 4 4 3 2 2 1 3 4 3 2
## 155 974 5 326 784 280 800 789 567 843 238 764 339 920 822 137
## 2 1 3 2 2 2 3 2 1 3 3 2 1 1 4 4
## 455 738 560 589 83 696 867 196 769 680 900 926 500 852 344 966
## 1 2 2 4 1 4 1 4 2 1 4 4 4 2 3 1
## 459 20 996 164 52 534 177 554 84 523 633 392 302 597 706 864
## 2 2 1 4 2 3 3 2 1 3 4 2 3 1 2 4
## 837 430 710 761 712 428 672 250 429 398 928 381 545 40 522 473
## 4 2 2 3 4 3 1 2 4 4 4 1 2 3 2 3
## 200 125 265 959 186 573 252 458 152 54 538 235 289 185 765 413
## 4 2 1 1 1 4 2 1 3 1 2 3 1 1 2 4
## 627 794 981 783 205 904 564 857 908 727 346 858 468 509 57 457
## 3 2 3 3 1 2 3 1 4 4 4 2 2 2 4 3
## 617 357 279 270 646 347 129 218 618 698 337 976 539 975 861 553
## 1 3 2 1 1 3 2 4 3 3 2 4 4 3 3 4
## 724 390 498 222 899 657 421 762 660 163 846 673 578 913 878 225
## 3 2 2 3 3 4 2 4 3 2 1 3 1 1 2 4
## 389 117 771 885 55 947 811 557 658 682 1000 134 891 688 447 104
## 2 3 4 1 2 4 3 2 2 4 3 1 4 3 1 3
## 716 845 210 349 401 258 915 386 941 24 466 130 886 943 377 170
## 4 3 4 3 2 2 1 1 1 1 4 4 1 2 2 1
## 445 234 422 508 910 80 894 548 475 903 343 323 479 838 450 111
## 2 4 4 1 2 2 4 4 4 4 1 3 4 2 3 3
## 317 741 287 585 292 226 297 605 637 834 237 700 809 33 836 396
## 2 3 2 3 2 4 3 1 3 1 3 3 4 4 3 3
## 935 917 76 94 30 723 175 916 685 115 751 608 465 358 902 96
## 3 2 1 4 3 3 2 4 1 3 1 3 4 3 3 4
## 782 397 404 148 813 968 714 338 869 106 11 625 364 705 403 461
## 4 4 4 2 4 3 2 1 3 4 4 3 1 4 4 2
## 704 31 655 661 16 420 882 417 464 412 810 524 437 732 562 204
## 3 4 1 3 4 2 4 4 1 4 4 4 3 2 4 1
## 720 965 624 384 122 399 634 315 259 494 780 48 331 100 108 301
## 2 4 4 3 1 4 1 4 2 3 4 1 1 2 1 4
## 10 697 851 980 402 889 804 925 395 986 8 261 541 306 853 883
## 1 1 3 4 1 1 4 4 4 4 1 1 3 2 2 2
## 282 267 262 760 219 352 119 452 36 870 961 240 304 600 694 105
## 4 4 3 1 1 2 1 4 3 4 2 4 1 1 2 3
## 388 934 180 906 615 241 703 559 37 303 19 378 549 990 733 188
## 3 2 3 2 2 1 2 4 2 4 1 2 4 3 4 3
## 860 393 139 992 371 189 311 547 418 382 38 816 319 596 120 604
## 4 1 4 2 2 2 3 4 4 3 1 1 1 3 3 3
## 533 441 199 499 944 609 81 942 717 650 6 128 49 476 239 340
## 2 1 4 1 2 3 1 1 1 4 3 1 2 3 4 4
## 193 824 561 645 190 191 446 668 630 571 512 59 305 832 61 570
## 4 2 2 3 2 2 4 4 2 3 4 4 1 3 4 1
## 591 676 770 955
## 3 3 4 1
##
## Within cluster sum of squares by cluster:
## [1] 55.18674 66.19557 67.75788 67.46634
## (between_SS / total_SS = 74.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
library(factoextra)
library(ggplot2)
fviz_cluster(Clustering1,
palette = "Dark2",
repel = TRUE,
ggtheme = theme_minimal(),
data = data1_clu_std,
geom = "point"
) +
scale_shape_manual(values = c(15, 16, 17, 18)) +
scale_size_manual(values = c(3, 3, 3, 3)) +
theme(legend.position = "bottom")
Averages1 <- Clustering1$centers
Figure1 <- as.data.frame(Averages1)
Figure1$id <- 1:nrow(Figure1)
library(tidyr)
Figure1 <- pivot_longer(Figure1, cols = c("Spending Score", "Membership Years"))
Figure1$Group <- factor(Figure1$id,
levels = c(1, 2, 3, 4),
labels = c("1", "2", "3", "4"))
Figure1$NameF <- factor(Figure1$name,
levels = c("Spending Score", "Membership Years"),
labels = c("Spending Score", "Membership Years"))
library(ggplot2)
ggplot(Figure1, aes(x = name, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group, col = Group), size = 5, alpha = 0.4) +
geom_line(aes(group = id), linewidth = 1.5) +
ylab("Averages") +
xlab("Cluster variables")+
ylim(-2.5, 2.5) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 12))
data2_clu_std <- as.data.frame(scale(data2[c(5,6)]))
Clustering2 <- kmeans(data2_clu_std,
centers = 4,
nstart = 25)
Clustering2
## K-means clustering with 4 clusters of sizes 128, 128, 112, 132
##
## Cluster means:
## Spending Score Membership Years
## 1 0.8947066 0.9725375
## 2 0.8179644 -0.8780666
## 3 -0.8671055 0.8437603
## 4 -0.9250461 -0.8075260
##
## Clustering vector:
## 510 962 88 132 438 777 788 471 251 203 246 481 574 440 435 626 492 817 131 667
## 4 1 3 1 4 1 1 1 3 3 3 3 3 1 2 3 1 2 1 1
## 478 162 322 692 168 442 276 78 957 819 527 835 95 406 552 379 342 221 184 161
## 4 2 4 3 1 4 4 3 1 3 4 2 1 3 2 1 2 3 3 4
## 504 448 242 181 718 812 918 930 414 407 991 949 273 187 535 171 501 753 601 136
## 3 1 2 2 2 1 1 3 4 2 4 4 1 3 4 4 2 4 3 3
## 79 951 670 890 295 802 505 443 284 595 87 805 651 138 365 803 933 644 754 336
## 4 3 3 2 1 2 1 1 4 1 3 3 4 2 4 2 2 2 1 1
## 739 232 334 987 433 328 876 470 264 779 201 820 729 620 815 707 65 507 29 823
## 4 3 3 3 1 1 1 1 2 2 3 1 1 1 1 4 4 1 1 3
## 206 124 691 263 228 45 332 281 982 632 427 629 577 268 969 327 271 746 781 167
## 4 1 1 1 3 4 2 4 1 4 3 1 2 4 1 3 4 1 4 1
## 255 807 612 71 260 530 623 451 636 734 757 798 841 872 927 46 863 558 719 98
## 1 4 3 2 1 2 2 2 4 3 2 4 1 3 3 4 1 2 3 1
## 434 220 566 679 324 931 721 877 275 687 231 197 831 169 659 960 341 911 320 662
## 1 1 2 3 3 3 1 4 2 4 3 1 4 2 3 1 3 2 4 3
## 2 274 376 639 681 532 881 638 207 56 750 613 394 958 503 647 68 725 419 53
## 2 2 4 1 1 3 2 2 1 4 1 2 1 3 3 4 3 3 4 1
## 922 230 333 677 256 742 318 940 640 745 678 173 702 351 873 266 372 172 298 778
## 2 4 4 1 1 3 4 3 3 1 2 1 1 4 2 1 2 4 4 1
## 797 948 17 536 487 227 897 312 790 444 551 112 93 664 513 690 248 569 469 462
## 2 4 4 4 1 1 1 3 2 4 2 1 1 4 4 4 1 3 1 2
## 4 50 550 488 801 540 887 898 114 956 99 361 58 912 635 405 123 582 652 977
## 1 2 2 1 3 1 4 2 4 2 4 2 4 3 4 4 3 4 3 2
## 400 493 77 695 808 654 529 901 611 363 563 892 15 174 142 416 856 970 489 758
## 2 2 4 1 4 2 4 3 2 4 4 4 4 2 4 2 2 4 4 3
## 18 517 423 42 313 321 63 73 919 383 254 272 787 460 44 92 21 748 656 439
## 4 4 3 2 4 4 3 4 2 4 1 4 1 2 2 1 4 2 3 2
## 699 307 496 983 194 82 945 62 572 586 107 998 146 511 410 370 350 866 521 786
## 2 2 2 1 1 2 2 1 1 2 1 4 4 1 4 1 3 2 2 2
## 584 791 829 921 985 967 868 482 963 103 387 952 914 631 486 855 9 296 145 796
## 3 1 3 1 2 1 2 2 3 3 3 1 1 3 2 4 3 4 3 3
## 514 756 946 772 25 701 565 795 849 247 607 825 408 502 936 288 568 740 216 183
## 4 1 1 1 2 3 1 4 1 3 2 4 1 3 4 2 1 4 4 2
## 531 929 924 472 245 154 675 693 144 735 973 293 728 731 353 22 622 909 830 875
## 3 4 2 4 4 4 3 2 4 3 3 2 4 4 1 1 3 3 4 1
## 454 157 759 253 86 453 984 827 35 749 583 576 47 192 176 28 828 249 971 329
## 4 2 4 3 3 1 3 1 1 4 2 2 3 3 2 3 1 1 4 2
## 773 964 126 776 708 669 616 354 497 579 972 208 939 325 467 896 335 542 140 799
## 3 2 1 4 1 1 2 2 3 3 2 2 2 2 4 4 2 2 2 4
## 75 806 518 150 689 285 432 149 743 950 425 156 495 993 369 683 27 233 198 546
## 1 1 2 3 4 4 2 2 2 4 2 2 4 1 4 1 3 3 1 3
## 257 826 345 215 995 380 592 375 525 1 97 715 874 684 997 32 300 506 628 520
## 4 3 3 2 4 2 4 4 4 2 3 1 2 3 1 2 4 2 2 3
## 213 202 214 989 736 653 449 580 109 133 879 66 907 3 411 70 839 12 713 833
## 3 3 4 2 1 3 4 2 1 1 2 2 4 4 2 3 4 2 3 1
## 587 360 674 269 515 978 599 101 594 641 102 884 385 367 314 484 60 848 643 793
## 2 1 4 1 3 3 2 4 3 4 1 2 1 2 2 4 4 4 3 4
## 556 763 850 366 283 543 436 737 182 730 485 474 43 359 147 491 368 7 642 356
## 2 4 4 1 3 4 2 2 3 1 1 2 3 4 3 2 1 2 3 1
##
## Within cluster sum of squares by cluster:
## [1] 51.67048 54.97523 56.71744 63.80148
## (between_SS / total_SS = 77.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
library(factoextra)
library(ggplot2)
fviz_cluster(Clustering2,
palette = "Dark2",
repel = TRUE,
ggtheme = theme_minimal(),
data = data2_clu_std,
geom = "point"
) +
scale_shape_manual(values = c(15, 16, 17, 18)) +
scale_size_manual(values = c(3, 3, 3, 3)) +
theme(legend.position = "bottom")
Averages2 <- Clustering2$centers
Figure2 <- as.data.frame(Averages2)
Figure2$id <- 1:nrow(Figure2)
library(tidyr)
Figure2 <- pivot_longer(Figure2, cols = c("Spending Score", "Membership Years"))
Figure2$Group <- factor(Figure2$id,
levels = c(1, 2, 3, 4),
labels = c("1", "2", "3", "4"))
Figure2$NameF <- factor(Figure2$name,
levels = c("Spending Score", "Membership Years"),
labels = c("Spending Score", "Membership Years"))
library(ggplot2)
ggplot(Figure2, aes(x = name, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group, col = Group), size = 5, alpha = 0.4) +
geom_line(aes(group = id), linewidth = 1.5) +
ylab("Averages") +
xlab("Cluster variables")+
ylim(-2.5, 2.5) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 12))
Based on the analysis, we clustered 1000 customers into 4 distinct segments using two variables: “Spending Score” and “Membership Years,” which were standardized prior to clustering.
Cluster 1: Represents loyal customers with above-average membership duration and above-average spending scores. This group values long-term engagement and spends consistently. They represent 24.3% of the dataset (243/1000).
Cluster 2: Consists of long-term, conservative spenders with significantly above-average membership years but below-average spending scores. These customers represent 23.9% of the dataset (239/1000). They may value the relationship with the company but exhibit cautious spending habits.
Cluster 3: Represents new or less engaged customers who exhibit very low membership duration and low spending scores. This is the most heterogeneous cluster with the highest within-cluster variability (WSS = 131.4482). This group comprises 26.5% of the dataset (265/1000).
Cluster 4: Consists of newer, opportunistic high spenders with slightly below-average membership years but significantly above-average spending scores. These customers represent 25.3% of the dataset (253/1000).