INTRODUCTION

Customer segmentation is a very important tool for every business and organizations. It helps business owners and marketers to be able to identify key areas highlighted below that need attention and more efficient strategies in order to satisfy their customers. This makes customers also to spend more on their products and services and also make recommendations to others.
  • High spenders
  • One timers
  • Location
  • Loyal Customers
  • Inactive Customers
  • Coupon lovers
  • And many more
I will demonstrate this by using unsupervised ML technique (KMeans Clustering Algorithm). Considering a supermarket mall owner who has some basic data about his customers like Customer ID, age, gender, annual income and spending score. Spending Score is something you assign to the customer based on your defined parameters like customer behavior and purchasing data.
This data helps to understand the and to pedict Target Customers so that efforts can be made by the marketing team in creating strategies and planning accordingly.

OVERVIEW OF DATASET

The dataset used in this paper is available on kaggle website with a some few modifications. This data set contains the basic information of customers of a particular shopping mall, their annual income and spending score. The dataset contains 200 observations from different customers with five variables revealing certain information about a particular customer. Below are the interpretations of the various column headings.
  • CustomerID
  • Gender
  • Age
  • Annual Income
  • Spending Score
# loading the dataset
mallCust <- read.csv("Mall_Customers.csv", header = TRUE)
head(mallCust, 5)
##   CustomerID Gender Age Income SpendingScore
## 1          1   Male  19     15            39
## 2          2   Male  21     15            81
## 3          3 Female  20     16             6
## 4          4 Female  23     16            77
## 5          5 Female  31     17            40
# inspecting the data set
dim(mallCust)
## [1] 200   5
# inspecting the dataset
str(mallCust)
## 'data.frame':    200 obs. of  5 variables:
##  $ CustomerID   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender       : chr  "Male" "Male" "Female" "Female" ...
##  $ Age          : int  19 21 20 23 31 22 35 23 64 30 ...
##  $ Income       : int  15 15 16 16 17 17 18 18 19 19 ...
##  $ SpendingScore: int  39 81 6 77 40 76 6 94 3 72 ...

Cleaning the Dataset

Now, I will check if there are any missing variables, then take them off as they may interfere in the data structure and may cause inconsistencies in results.
sum(is.na(mallCust) == 1)
## [1] 0
Since, the results reveals there are no observations with NAs, I will go ahead with my dataset.

Statistics and Exploratory Data Analysis

Rendering basic statistical data analysis helps one to properly understand the particular dataset been analyzed. I run an analysis on the data to find out basic information about the data. Such as the location of measures (mode,median and mean), minimum and maximum observations also be obtained for every variable.
Using ggplot for visualisation and explanotory data analysis.
library(ggplot2)
library(cowplot)
Age - Gender variable
ageHist <- ggplot(mallCust, aes(Age, fill=Gender)) +
          geom_histogram(bins = 20) +
          ggtitle("Distribution of data per Age and Gender")
ageBox <- ggplot(mallCust, aes(x=Gender, y=Age, fill = Gender)) +
          geom_boxplot() +
          ggtitle("Boxplot of Age per Gender")
plot_grid(ageHist, ageBox)

Gender - Income variable
incomeHist <- ggplot(mallCust, aes(Income, fill=Gender)) +
              geom_density(alpha = 0.5) + 
              ggtitle("Density of income per Gender")
incomeBox <- ggplot(mallCust, aes(x=Gender, y=Income, fill = Gender)) +
  geom_boxplot() +
  ggtitle("Boxplot of Income per Gender")
plot_grid(incomeHist, incomeBox)

Gender - Spending Score
spendHist <- ggplot(mallCust, aes(SpendingScore, fill=Gender)) +
              geom_dotplot(binwidth = 10) + 
              ggtitle("Density of Spending Score per Gender")
spendBox <- ggplot(mallCust, aes(x=Gender, y=SpendingScore, fill = Gender)) +
  geom_boxplot() +
  ggtitle("Boxplot of Spending Score per Gender")
plot_grid(spendHist, spendBox)

How does the variable ‘Spending Score’ vary in relation to Gender, Age and Income?

This section is dedicated to analysing the variations between the variable ‘spending score’ and other corresponding variables in order to determine the existing relationships.
The diagram below clearly indicates that the spending score is highest among the group 23 - 32 peaking at around 28.Lowest among the group 45 - 60 Year olds. Spending score starts increasing again for 60+ year olds.
ggplot(mallCust, aes(x = Age, y=SpendingScore)) +
            geom_smooth()                      + 
            geom_point(aes(size=SpendingScore, colour = Gender)) + 
            scale_radius(range=c(1, 5))+
            geom_vline(xintercept = c(23, 28, 32, 45,60, 70)) +
            ggtitle("Spending Score v Age")

Although the spending score for women is higher than that of men for each bracket, the mode for men is higher than that of women.
ggplot(mallCust, aes(x = Gender, y=SpendingScore)) +
            geom_violin(scale = "area", aes(fill = Gender)) + 
            ggtitle("Spending Score v Gender") +
            geom_hline(yintercept = c(20, 40, 60, 80, 90)) +
            stat_sum()

High Spending scores with income < 40K and in between 70K - 110 K. Higher Income v Spending score frequencies in 40K - 70K.
ggplot(mallCust, aes(x = Income, y=SpendingScore)) +
  geom_hex() +
  geom_smooth() +                      
  geom_vline(xintercept = c(40, 70, 110)) +
  ggtitle("Spending Score v Income") +
  scale_fill_gradientn(colours = topo.colors(6, alpha = 0.8))

As income of the customer increases, the variance of the spending score also increases.

Clustering

Clustering the data with 2 centres.
cluster_Data <- function(c, title){
#cols <- c("2" = "red", "3" = "green")
#par(mfrow=c(1,2))

km <- kmeans(mallCust[,3:5], centers = c)
agePlot <- ggplot(mallCust, aes(Age, SpendingScore)) + 
  geom_point(colour = factor(km$cluster + 1), size=3)+
  scale_color_discrete() +
  ggtitle(title)

incomePlot <- ggplot(mallCust, aes(Income, SpendingScore)) + 
  geom_point(colour = factor(km$cluster + 1), size=3)+
  scale_color_discrete() +
  ggtitle( title)

plot_grid(agePlot, incomePlot)
}
#Cluster with 2 centers
#cols <- c("2" = "red", "3" = "green")
cluster_Data(2, "k-means clustering results with k= 2")

#cols <- c("2" = "red", "3" = "green")
cluster_Data(5, "k-means clustering results with k= 5")

Identifying the best value for centers using internal validation methods - data structure based metrics.

Dunn, Silhouette and Connectivity

# clValid is a package that containing various cluster validation methods.
library(clValid) 
intern <- clValid(mallCust[, 3:5], nClust=2:10, validation="internal", clMethods=c("kmeans"), maxitems = 5000)
summary(intern)
## 
## Clustering Methods:
##  kmeans 
## 
## Cluster sizes:
##  2 3 4 5 6 7 8 9 10 
## 
## Validation Measures:
##                            2       3       4       5       6       7       8       9      10
##                                                                                             
## kmeans Connectivity  17.5865  8.3710 21.4183 23.3893 30.7413 40.6952 51.5647 57.9798 75.9155
##        Dunn           0.0635  0.0803  0.0758  0.0522  0.0938  0.1106  0.0994  0.1127  0.1047
##        Silhouette     0.3251  0.3839  0.4055  0.4443  0.4177  0.3982  0.4279  0.4168  0.4020
## 
## Optimal Scores:
## 
##              Score  Method Clusters
## Connectivity 8.3710 kmeans 3       
## Dunn         0.1127 kmeans 9       
## Silhouette   0.4443 kmeans 5
plot(intern)

Stability Based Metrics

par(mfrow=c(2,2))
stab <- clValid(mallCust[3:5], nClust=2:10, validation="stability", clMethods=c("kmeans"))
optimalScores(stab)
##          Score Method Clusters
## APN  0.2684315 kmeans        3
## AD  27.4861619 kmeans       10
## ADM 15.2684189 kmeans        3
## FOM 20.0172807 kmeans       10
plot(stab)

Cluster with 3 centers

cluster_Data(3, "k-means clustering results with k= 3")