Customer Segmentation using K-Means Clustering

Why is Customer Segmentation done ?

Customer segmentation is an important concept, which enables us to divide the customers into clusters based on certain parameters such as demographics, purchasing habits, frequency of purchases et cetera. The exact or optimal way of deciding the parameters on which clustering needs to be done depends upon the objective of the business or the hypothesis taken into consideration. There is no correct way to perform clustering, but certain patterns can definitely be observed if the clusters are optimally selected.

Often unsupervised learning comes handy while exploring patterns in the data, or forming clusters; methods such as K-Means, hierarchical clustering, are pretty handy.

Overview

You are owing a supermarket mall and through membership cards , you have some basic data about your customers like Customer ID, age, gender, annual income and spending score. Spending Score is something you assign to the customer based on your defined parameters like customer behavior and purchasing data.

Problem Statement: To derive the optimum number of clusters and understand the underlying customer segments based on the data provided. The data contains information of about 200 customers and 5 variables such as customer ID, gender, age, annual income, spending score. This datatset is taken from Kaggle’s Mall customer segmentation dataset.

Code

library(ggthemes)
library(tidyverse)
library(grid)
library(gridExtra)

Importing the data

mall_cust <- read_csv('Mall_Customers.csv')

## Parsed with column specification:
## cols(
##   CustomerID = col_integer(),
##   Gender = col_character(),
##   Age = col_integer(),
##   `Annual Income (k$)` = col_integer(),
##   `Spending Score (1-100)` = col_integer()
## )

head(mall_cust)

## # A tibble: 6 x 5
##   CustomerID Gender   Age `Annual Income (k$)` `Spending Score (1-100)`
##        <int> <chr>  <int>                <int>                    <int>
## 1          1 Male      19                   15                       39
## 2          2 Male      21                   15                       81
## 3          3 Female    20                   16                        6
## 4          4 Female    23                   16                       77
## 5          5 Female    31                   17                       40
## 6          6 Female    22                   17                       76

#Changint the column names 
colnames(mall_cust) <- c('CustomerID', 'Gender', 'Age', 'Annual_Income', 'SpendingScore')

#Changing the variable type
mall_cust$Gender <- as.factor(mall_cust$Gender)
mall_cust$CustomerID <- as.factor(mall_cust$CustomerID)

Exploratory Data Analysis

summary(mall_cust)

##    CustomerID     Gender         Age        Annual_Income   
##  1      :  1   Female:112   Min.   :18.00   Min.   : 15.00  
##  2      :  1   Male  : 88   1st Qu.:28.75   1st Qu.: 41.50  
##  3      :  1                Median :36.00   Median : 61.50  
##  4      :  1                Mean   :38.85   Mean   : 60.56  
##  5      :  1                3rd Qu.:49.00   3rd Qu.: 78.00  
##  6      :  1                Max.   :70.00   Max.   :137.00  
##  (Other):194                                                
##  SpendingScore  
##  Min.   : 1.00  
##  1st Qu.:34.75  
##  Median :50.00  
##  Mean   :50.20  
##  3rd Qu.:73.00  
##  Max.   :99.00  
##

Missing values

cat("There are", sum(is.na(mall_cust)), "N/A values.")

## There are 0 N/A values.

Hypothesis: Customers can be grouped according to their spending score given their income

Boxplot of Annual Income and Spending Score

p1 <- ggplot(mall_cust, aes(y=Annual_Income))+
  geom_boxplot(fill='Gray')+
  theme_bw()+
  ggtitle('Boxplot of Annual Income')

p2 <- ggplot(mall_cust, aes(y=SpendingScore))+
  geom_boxplot(fill='Gray')+
  theme_bw()+
  ggtitle('Boxplot of Spending Score')

grid.arrange(p1,p2, ncol=2)

Comment : No such outliers detected

K-Means Clustering

Kdata <- mall_cust[,c(4,5)]

#Elbow curve to estimate number of clusters
tot.withinss <- vector("numeric", length = 10)
for (i in 1:10){
  kDet <- kmeans(Kdata, i)
  tot.withinss[i] <- kDet$tot.withinss
}

ggplot(as.data.frame(tot.withinss), aes(x = seq(1,10), y = tot.withinss)) + 
  geom_point(col = "#F8766D") +    
  geom_line(col = "#F8766D") + 
  theme(axis.title.x.bottom = element_blank()) +
  ylab("Within-cluster Sum of Squares") +
  xlab("Number of Clusters") +
  ggtitle("Elbow K Estimation")

K =5 seems ideal in this case. Hence, we are going to create 5 clusters.

set.seed(12345)
customerClusters <- kmeans(Kdata, 5)


## Visualizaing the clusters 
ggplot(Kdata, aes(x = Annual_Income, y = SpendingScore)) + 
  geom_point(stat = "identity", aes(color = as.factor(customerClusters$cluster))) +
  scale_color_discrete(name=" ",
                       breaks=c("1", "2", "3", "4", "5"),
                       labels=c("Cluster 1", "Cluster 5", "Cluster 3", "Cluster 4", "Cluster 2")) +
  ggtitle("Mall Customer Segmens", subtitle = "K-means Clustering")

The possible clusters created are as follows:

customerClusters

## K-means clustering with 5 clusters of sizes 23, 35, 81, 39, 22
## 
## Cluster means:
##   Annual_Income SpendingScore
## 1      26.30435      20.91304
## 2      88.20000      17.11429
## 3      55.29630      49.51852
## 4      86.53846      82.12821
## 5      25.72727      79.36364
## 
## Clustering vector:
##   [1] 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1
##  [36] 5 1 5 1 5 1 5 1 3 1 5 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [71] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [106] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 2 4 3 4 2 4 2 4 3 4 2 4 2 4 2 4
## [141] 2 4 3 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2
## [176] 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4
## 
## Within cluster sum of squares by cluster:
## [1]  5098.696 12511.143  9875.111 13444.051  3519.455
##  (between_SS / total_SS =  83.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Low income and Low spending score : Cluster 4
Low income and high spending score : Cluster 1
Mid-Level income and Mid-Level spending score : Cluster 2
High annual income and low spending score : Cluster 5
High income and High spending score : CLuster 3

Customer Segmentation using K-Means Clustering

Menali

May 4, 2019

Why is Customer Segmentation done ?

Overview

Code