Customer segmentation is an important concept, which enables us to divide the customers into clusters based on certain parameters such as demographics, purchasing habits, frequency of purchases et cetera. The exact or optimal way of deciding the parameters on which clustering needs to be done depends upon the objective of the business or the hypothesis taken into consideration. There is no correct way to perform clustering, but certain patterns can definitely be observed if the clusters are optimally selected.
Often unsupervised learning comes handy while exploring patterns in the data, or forming clusters; methods such as K-Means, hierarchical clustering, are pretty handy.
You are owing a supermarket mall and through membership cards , you have some basic data about your customers like Customer ID, age, gender, annual income and spending score. Spending Score is something you assign to the customer based on your defined parameters like customer behavior and purchasing data.
Problem Statement: To derive the optimum number of clusters and understand the underlying customer segments based on the data provided. The data contains information of about 200 customers and 5 variables such as customer ID, gender, age, annual income, spending score. This datatset is taken from Kaggle’s Mall customer segmentation dataset.
library(ggthemes)
library(tidyverse)
library(grid)
library(gridExtra)
Importing the data
mall_cust <- read_csv('Mall_Customers.csv')
## Parsed with column specification:
## cols(
## CustomerID = col_integer(),
## Gender = col_character(),
## Age = col_integer(),
## `Annual Income (k$)` = col_integer(),
## `Spending Score (1-100)` = col_integer()
## )
head(mall_cust)
## # A tibble: 6 x 5
## CustomerID Gender Age `Annual Income (k$)` `Spending Score (1-100)`
## <int> <chr> <int> <int> <int>
## 1 1 Male 19 15 39
## 2 2 Male 21 15 81
## 3 3 Female 20 16 6
## 4 4 Female 23 16 77
## 5 5 Female 31 17 40
## 6 6 Female 22 17 76
#Changint the column names
colnames(mall_cust) <- c('CustomerID', 'Gender', 'Age', 'Annual_Income', 'SpendingScore')
#Changing the variable type
mall_cust$Gender <- as.factor(mall_cust$Gender)
mall_cust$CustomerID <- as.factor(mall_cust$CustomerID)
Exploratory Data Analysis
summary(mall_cust)
## CustomerID Gender Age Annual_Income
## 1 : 1 Female:112 Min. :18.00 Min. : 15.00
## 2 : 1 Male : 88 1st Qu.:28.75 1st Qu.: 41.50
## 3 : 1 Median :36.00 Median : 61.50
## 4 : 1 Mean :38.85 Mean : 60.56
## 5 : 1 3rd Qu.:49.00 3rd Qu.: 78.00
## 6 : 1 Max. :70.00 Max. :137.00
## (Other):194
## SpendingScore
## Min. : 1.00
## 1st Qu.:34.75
## Median :50.00
## Mean :50.20
## 3rd Qu.:73.00
## Max. :99.00
##
Missing values
cat("There are", sum(is.na(mall_cust)), "N/A values.")
## There are 0 N/A values.
Hypothesis: Customers can be grouped according to their spending score given their income
Boxplot of Annual Income and Spending Score
p1 <- ggplot(mall_cust, aes(y=Annual_Income))+
geom_boxplot(fill='Gray')+
theme_bw()+
ggtitle('Boxplot of Annual Income')
p2 <- ggplot(mall_cust, aes(y=SpendingScore))+
geom_boxplot(fill='Gray')+
theme_bw()+
ggtitle('Boxplot of Spending Score')
grid.arrange(p1,p2, ncol=2)
Comment : No such outliers detected
K-Means Clustering
Kdata <- mall_cust[,c(4,5)]
#Elbow curve to estimate number of clusters
tot.withinss <- vector("numeric", length = 10)
for (i in 1:10){
kDet <- kmeans(Kdata, i)
tot.withinss[i] <- kDet$tot.withinss
}
ggplot(as.data.frame(tot.withinss), aes(x = seq(1,10), y = tot.withinss)) +
geom_point(col = "#F8766D") +
geom_line(col = "#F8766D") +
theme(axis.title.x.bottom = element_blank()) +
ylab("Within-cluster Sum of Squares") +
xlab("Number of Clusters") +
ggtitle("Elbow K Estimation")
K =5 seems ideal in this case. Hence, we are going to create 5 clusters.
set.seed(12345)
customerClusters <- kmeans(Kdata, 5)
## Visualizaing the clusters
ggplot(Kdata, aes(x = Annual_Income, y = SpendingScore)) +
geom_point(stat = "identity", aes(color = as.factor(customerClusters$cluster))) +
scale_color_discrete(name=" ",
breaks=c("1", "2", "3", "4", "5"),
labels=c("Cluster 1", "Cluster 5", "Cluster 3", "Cluster 4", "Cluster 2")) +
ggtitle("Mall Customer Segmens", subtitle = "K-means Clustering")
The possible clusters created are as follows:
customerClusters
## K-means clustering with 5 clusters of sizes 23, 35, 81, 39, 22
##
## Cluster means:
## Annual_Income SpendingScore
## 1 26.30435 20.91304
## 2 88.20000 17.11429
## 3 55.29630 49.51852
## 4 86.53846 82.12821
## 5 25.72727 79.36364
##
## Clustering vector:
## [1] 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1
## [36] 5 1 5 1 5 1 5 1 3 1 5 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [71] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [106] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 2 4 3 4 2 4 2 4 3 4 2 4 2 4 2 4
## [141] 2 4 3 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2
## [176] 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4
##
## Within cluster sum of squares by cluster:
## [1] 5098.696 12511.143 9875.111 13444.051 3519.455
## (between_SS / total_SS = 83.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"