Mall Customer Segmentation
In this project I have implemented a k-Means Clustering algorithm on a dataset of Mall Customers. The goal of this project is to identify groups of customers in order for the mall to adapt marketing strategies for them. The dataset was found on Kaggle at https://www.kaggle.com/datasets/vjchoudhary7/customer-segmentation-tutorial-in-python/data.
Preliminary
# libraries
library(caret)
library(DescTools)
library(ggplot2)
library(cluster)
library(factoextra)
library(fpc)
library(dplyr)
library(gridExtra)## [1] 200 5
## spc_tbl_ [200 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ CustomerID : num [1:200] 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : chr [1:200] "Male" "Male" "Female" "Female" ...
## $ Age : num [1:200] 19 21 20 23 31 22 35 23 64 30 ...
## $ Annual Income (k$) : num [1:200] 15 15 16 16 17 17 18 18 19 19 ...
## $ Spending Score (1-100): num [1:200] 39 81 6 77 40 76 6 94 3 72 ...
## - attr(*, "spec")=
## .. cols(
## .. CustomerID = col_double(),
## .. Gender = col_character(),
## .. Age = col_double(),
## .. `Annual Income (k$)` = col_double(),
## .. `Spending Score (1-100)` = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
## CustomerID Gender Age Annual Income (k$)
## Min. : 1.00 Length:200 Min. :18.00 Min. : 15.00
## 1st Qu.: 50.75 Class :character 1st Qu.:28.75 1st Qu.: 41.50
## Median :100.50 Mode :character Median :36.00 Median : 61.50
## Mean :100.50 Mean :38.85 Mean : 60.56
## 3rd Qu.:150.25 3rd Qu.:49.00 3rd Qu.: 78.00
## Max. :200.00 Max. :70.00 Max. :137.00
## Spending Score (1-100)
## Min. : 1.00
## 1st Qu.:34.75
## Median :50.00
## Mean :50.20
## 3rd Qu.:73.00
## Max. :99.00
In summary, there are 200 data points and 5 variables in this dataset. The 5 variables are CustomerID, Gender, Age (from 18 to 70), Annual Income (k$) and Spending Score (1-100).
Checking for missing values:
## [1] 0
Data Exploration
- Gender
- Age
- Age distribution between genders
Female customers are more predominant between 22 and 55 years of age. As for the 55-70 age group, men are more predominant than women.
Based on gender and age alone, there are 4 visible groups. Both men and women groups have a somewhat binomial distribution:
women have 2 peak groups: one that peaks at the age of ~32, the other one at ~47,
men groups peak at ~33 years and a smaller peak at ~62.
- Annual income and spending score boxplots
p1 <- ggplot(as.data.frame(mall$'Annual Income (k$)'), aes(y = mall$'Annual Income (k$)')) + geom_boxplot(fill='#2FA4FF')
p2 <- ggplot(as.data.frame(mall$'Spending Score (1-100)'), aes(y = mall$'Spending Score (1-100)')) + geom_boxplot(fill='#00AB08')
grid.arrange(p1, p2, ncol = 2)- Age vs Spending Score
plot(mall$Age,mall$'Spending Score (1-100)',col="red",xlab="Age",ylab = "Spending Score", main="Age VS Spending Score")There is no visible correlation in this relationship. The only thing worth pointing out is that customers older than 40 have on average lower spending scores.
- Income vs Spending Score
plot(mall$'Annual Income (k$)',mall$'Spending Score (1-100)',col="red",xlab="Annual Income",ylab = "Spending Score", main="Annual Income VS Spending Score")Based on this graph, we see 5 clusters:
low income with low spending scores
low income with high spending scores
relatively medium income, medium spending scores
higher income, high spending scores
higher income, low spending scores
We will begin our analysis with 5 clusters.
k-Means Clustering
K-means clustering is a popular unsupervised machine learning algorithm used for grouping similar data points into clusters. K-means clusters are formed by partitioning a dataset into K distinct groups, where K is a user-defined parameter. The algorithm works iteratively to minimize the within-cluster variation, which means that data points within the same cluster are as similar as possible. It does this by assigning data points to the nearest cluster center (centroid) and then recalculating the centroids based on the current assignments.
To implement kMC, we will get rid of CustomerID since it’s irrelevant to the analysis. KMC works best with continuous variables, so we will remove the Gender variable as well.
For better results, we scale the data. Clustering algorithms rely on distances between data points. Scaling ensures that the distances are meaningful and are not influenced by the scale of the features.
Kdata <- scale(Kdata)
Kdata <- as.data.frame(Kdata) # we need Kdata to be a data frame to identify the customers later
head(Kdata)As mentioned above, we will use 5 as the initial number of clusters, since we saw 5 distinct groups in the Annual Income vs Spending Score plot.
However, if we want to determine a good cluster number, we can calculate the average distance between data points and their cluster centroid. This is known as the Elbow method. The goal is to find a point where the rate of decrease sharply shifts.
wss<-0
for (n in 1:10){
km<-kmeans(Kdata, centers=n, nstart=10, iter.max=50) #starts with random centroids 10 times
wss[n]<-km$tot.withinss
}
options(repr.plot.width=5, repr.plot.height=5)
plot(wss, type="b", xlab="Number of clusters (n)", ylab="Sum of squares within groups")From this graph, it seems like 6 clusters is the optimal point. We will try 5 clusters, then 6.
## K-means clustering with 5 clusters of sizes 54, 20, 39, 47, 40
##
## Cluster means:
## Age Annual Income (k$) Spending Score (1-100)
## 1 -0.97822376 -0.7411999 0.46627028
## 2 0.52974416 -1.2872781 -1.23337167
## 3 0.07314728 0.9725047 -1.19429976
## 4 1.20182469 -0.2351832 -0.05223672
## 5 -0.42773261 0.9724070 1.21304137
##
## Clustering vector:
## [1] 1 1 2 1 1 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
## [38] 1 2 1 4 1 2 1 2 1 4 1 1 1 4 1 1 4 4 4 4 4 1 4 4 1 4 4 4 1 4 4 1 1 4 4 4 4
## [75] 4 1 4 4 1 4 4 1 4 4 1 4 4 1 1 4 4 1 4 4 1 1 4 1 4 1 1 4 4 1 4 1 4 4 4 4 4
## [112] 1 3 1 1 1 4 4 4 4 1 3 5 5 3 5 3 5 3 5 3 5 3 5 3 5 3 5 3 5 3 5 3 5 3 5 3 5
## [149] 3 5 3 5 3 5 3 5 3 5 3 5 4 5 3 5 3 5 3 5 3 5 3 5 3 5 3 5 3 5 3 5 3 5 3 5 3
## [186] 5 3 5 3 5 3 5 3 5 3 5 3 5 3 5
##
## Within cluster sum of squares by cluster:
## [1] 51.85673 18.58760 46.38992 26.65665 23.91544
## (between_SS / total_SS = 72.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
From the results, we can see two important figures:
Within Clusters Sum of Squares by Cluster (WSS): This is also known as “Intra-Cluster Variance”. WSS helps us figure out how “close” or “spread out” the people are within each cluster. The WSS for clusters 2 and 5 are slightly higher than the other clusters.
Between_SS / Total_SS: This ratio tells us what portion of the total variability is explained by the differences between your clusters. Values closer to 1 tell us that there is a lot of variability between the clusters, which is what we want. For 5 clusters, this value is 72%.
Visualizing the clusters:
These clusters are well distanced from each other, except for cluster 4, which seems to overlap with cluster 3 a lot.
Next we try 6 clusters:
## K-means clustering with 6 clusters of sizes 21, 39, 33, 45, 24, 38
##
## Cluster means:
## Age Annual Income (k$) Spending Score (1-100)
## 1 0.4777583 -1.3049552 -1.19344867
## 2 -0.4408110 0.9891010 1.23640011
## 3 0.2211606 1.0805138 -1.28682305
## 4 1.2515802 -0.2396117 -0.04388764
## 5 -0.9735839 -1.3221791 1.03458649
## 6 -0.8709130 -0.1135003 -0.09334615
##
## Clustering vector:
## [1] 5 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1
## [38] 5 1 5 4 5 1 5 1 5 4 6 6 6 4 6 6 4 4 4 4 4 6 4 4 6 4 4 4 6 4 4 6 6 4 4 4 4
## [75] 4 6 4 6 6 4 4 6 4 4 6 4 4 6 6 4 4 6 4 6 6 6 4 6 4 6 6 4 4 6 4 6 4 4 4 4 4
## [112] 6 6 6 6 6 4 4 4 4 6 6 6 2 6 2 3 2 3 2 3 2 6 2 3 2 3 2 6 2 3 2 6 2 3 2 3 2
## [149] 3 2 3 2 3 2 3 2 3 2 3 2 4 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3
## [186] 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2
##
## Within cluster sum of squares by cluster:
## [1] 20.52332 22.36267 34.51630 23.87015 11.71664 20.20990
## (between_SS / total_SS = 77.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
WSS for the clusters are relatively low, and altogether lower than the first model. BSS/TSS is also higher at 77.7%.
These clusters are well defined, with slight overlap between cluster 2 and 6.
The larger the number of clusters, the better the WSS and BSS/TSS will turn out. However, we don’t want to create too many clusters, because that would be inefficient from a marketing perspective - the larger the number of clusters the more groups you need to cater to.
We can go ahead and identify the customers in each cluster.
mall$kmeans <- k2$cluster
mall <- mall %>%
mutate(Gender = ifelse(Gender == "Female", 1, 0))
# female: 1, male: 0
mall_clusters <- mall %>%
group_by(kmeans) %>%
summarise(Age_mean= mean(Age),
Income_mean= mean(`Annual Income (k$)`),
SpenScore_mean= mean(`Spending Score (1-100)`),
Gender=(mean(Gender)),
Count=(length(kmeans)))
mall_clustersThe 6 clusters are as follows:
Cluster 1: Early adults, higher income, with high spending scores
Cluster 2: Early adults, medium income, medium spending scores
Cluster 3: Middle aged, low income, low spending scores
Cluster 4: Early adults, low income, high spending scores
Cluster 5: Middle aged, higher income, low spending scores - the only group with where men are predominant Cluster 6: Late middle age, medium income, medium spending scores - largest group.
Conclusion
At this point, we can give some suggestions for the mall owner:
They can target Cluster 4 (high average income, low spending) to incentivize them to spend more money;
They can invest to personalize Cluster 1’s buying experience, since they have the highest average spending scores and high income.