1 Intro

Business Question: Customer Clustering

Customer Segmentation is the subdivision of a market into discrete customer groups that share similar characteristics. Customer Segmentation can be a powerful means to identify unsatisfied customer needs. Using the above data companies can then outperform the competition by developing uniquely appealing products and services.

You are owning a supermarket mall and through membership cards, you have some basic data about your customers like Customer ID, age, gender, annual income and spending score. You want to understand the customers like who are the target customers so that the sense can be given to marketing team and plan the strategy accordingly.

2 Data Preparation

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v stringr 1.4.0
## v tidyr   1.1.4     v forcats 0.5.1
## v readr   2.1.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date()        masks base::date()
## x dplyr::filter()          masks stats::filter()
## x lubridate::intersect()   masks base::intersect()
## x dplyr::lag()             masks stats::lag()
## x lubridate::setdiff()     masks base::setdiff()
## x lubridate::union()       masks base::union()
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(FactoMineR)
customer <- read.csv("segmentation data.csv")
head(customer)

Variable Data type Range Description

  • ID : numerical Integer Shows a unique identificator of a customer.

  • Sex : 0:male, 1:female

  • Marital status : 0:single, 1:non-single (divorced / separated / married / widowed)

  • Age : The age of the customer in years, calculated as current year minus the year of birth of the customer at the time of creation of the dataset

  • Education : Level of education of the customer (0:other / unknown, 1:high school, 2:university, 3:graduate school)

  • Income : Real Self-reported annual income in US dollars of the customer.

  • Occupation : Category of occupation of the customer (0:unemployed / unskilled, 1:skilled employee / official, 2:management / self-employed / highly qualified employee / officer)

  • Settlement size : The size of the city that the customer lives in (0:small city, 1:mid-sized city, 2:big city)

3 Data Cleansing

ID columns to index/rowname, only numerical value in data frame.

customer_clean <- customer %>% 
  column_to_rownames("ID")

head(customer_clean)

Check NA Data

colSums(is.na(customer_clean))
##             Sex  Marital.status             Age       Education          Income 
##               0               0               0               0               0 
##      Occupation Settlement.size 
##               0               0

Note

There is no NA data in dataframe.

4 Exploratory Data Analysis

Check Data Range

# Please type your code down below
summary(customer_clean)
##       Sex        Marital.status        Age          Education    
##  Min.   :0.000   Min.   :0.0000   Min.   :18.00   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:27.00   1st Qu.:1.000  
##  Median :0.000   Median :0.0000   Median :33.00   Median :1.000  
##  Mean   :0.457   Mean   :0.4965   Mean   :35.91   Mean   :1.038  
##  3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:42.00   3rd Qu.:1.000  
##  Max.   :1.000   Max.   :1.0000   Max.   :76.00   Max.   :3.000  
##      Income         Occupation     Settlement.size
##  Min.   : 35832   Min.   :0.0000   Min.   :0.000  
##  1st Qu.: 97663   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :115549   Median :1.0000   Median :1.000  
##  Mean   :120954   Mean   :0.8105   Mean   :0.739  
##  3rd Qu.:138072   3rd Qu.:1.0000   3rd Qu.:1.000  
##  Max.   :309364   Max.   :2.0000   Max.   :2.000

Note

Data range is not in the same range. Data need to be scaled before modelling.

customer_scale <- scale(customer_clean)
head(customer_scale)
##                  Sex Marital.status         Age   Education      Income
## 100000001 -0.9171695      -0.992776  2.65295099  1.60392184  0.09749923
## 100000002  1.0897659       1.006773 -1.18683527 -0.06335658  0.78245869
## 100000003 -0.9171695      -0.992776  1.11703649 -0.06335658 -0.83299391
## 100000004 -0.9171695      -0.992776  0.77572215 -0.06335658  1.32805410
## 100000005 -0.9171695      -0.992776  1.45835082 -0.06335658  0.73674749
## 100000006 -0.9171695      -0.992776 -0.07756368 -0.06335658  0.62698289
##           Occupation Settlement.size
## 100000001  0.2967488       1.5519379
## 100000002  0.2967488       1.5519379
## 100000003 -1.2692080      -0.9095021
## 100000004  0.2967488       0.3212179
## 100000005  0.2967488       0.3212179
## 100000006 -1.2692080      -0.9095021
summary(customer_scale)
##       Sex          Marital.status         Age            Education       
##  Min.   :-0.9172   Min.   :-0.9928   Min.   :-1.5281   Min.   :-1.73063  
##  1st Qu.:-0.9172   1st Qu.:-0.9928   1st Qu.:-0.7602   1st Qu.:-0.06336  
##  Median :-0.9172   Median :-0.9928   Median :-0.2482   Median :-0.06336  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 1.0898   3rd Qu.: 1.0068   3rd Qu.: 0.5197   3rd Qu.:-0.06336  
##  Max.   : 1.0898   Max.   : 1.0068   Max.   : 3.4209   Max.   : 3.27120  
##      Income          Occupation      Settlement.size  
##  Min.   :-2.2337   Min.   :-1.2692   Min.   :-0.9095  
##  1st Qu.:-0.6112   1st Qu.:-1.2692   1st Qu.:-0.9095  
##  Median :-0.1419   Median : 0.2967   Median : 0.3212  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4492   3rd Qu.: 0.2967   3rd Qu.: 0.3212  
##  Max.   : 4.9440   Max.   : 1.8627   Max.   : 1.5519

5 K-means

Clustering data into 5 cluster

  1. The number of sequences (iterations) of the k-means algorithm until a stable cluster is produced
customer_clustering$iter
## [1] 3
  1. The number of observations in each cluster
# Please type your code down below
customer_clustering$size
## [1] 247 605 362 336 450

Note

The number of data observations is relatively balanced

  1. Location of the center of the cluster / centroid, commonly used for profiling clusters
# Please type your code down below
customer_clustering$centers
##            Sex Marital.status        Age   Education     Income Occupation
## 1  0.009108399      0.3348597  1.7361005  1.84692598  1.0201097  0.5503451
## 2  0.837655011      0.5705077 -0.3359471  0.13230750 -0.6211426 -0.6661205
## 3  0.396763359      0.9736312 -0.6640209 -0.04032787  0.1584052  0.4957378
## 4 -0.726032758     -0.9927760 -0.1161647 -0.79279089 -0.5931031 -0.6959559
## 5 -0.908249746     -0.9927760  0.1196402 -0.56724517  0.5905869  0.7143373
##   Settlement.size
## 1       0.4806634
## 2      -0.8993308
## 3       0.6917938
## 4      -0.7849649
## 5       0.9748670
  1. Cluster label for each observation
# Please type your code down below
head(customer_clustering$cluster)
## 100000001 100000002 100000003 100000004 100000005 100000006 
##         1         3         4         5         5         4

5.1 Goodness of Fit

Good clustering

  • withinss: low
  • betweenss: high
  • betweenss/totss close to 1: the clustered group results increasingly represent the true distribution of the data
# Please type your code down below
customer_clustering$withinss
## [1] 1544.5023 1753.7827  965.3264  918.9129 1291.5195
customer_clustering$betweenss
## [1] 7518.956

Note

Withinss is low and betweenss is high (OK)

customer_clustering$totss
## [1] 13993
customer_clustering$betweenss / customer_clustering$totss
## [1] 0.537337

Note

Betweens/totss : 0.5 is good enough to represent the true distribution of the data

5.2 K Optimum Selection

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)

# Please type your code down below
fviz_nbclust(x = customer_scale,
             FUNcluster = kmeans,
             method = "wss")

Note

The optimum k value is 4 because more than 4 clusters the decrease in Total WSS is not too drastic

5.3 Recreate K-mean

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)

customer_clustering_4 <- kmeans(x = customer_scale,
                               centers = 4)

5.4 Cluster Profiling

Input the cluster label into the initial data

customer_clustering_4$centers
##           Sex Marital.status         Age   Education     Income Occupation
## 1  0.09011369      0.3909422  1.68902999  1.81946354  0.9809802  0.4991919
## 2  0.79655407      1.0011004 -0.59268205  0.05016025 -0.3987344 -0.2763247
## 3 -0.85731349     -0.6454860 -0.02337255 -0.50796416  0.5317359  0.7225792
## 4 -0.20909486     -0.9538238 -0.02825041 -0.48558943 -0.6060162 -0.7540014
##   Settlement.size
## 1       0.4569247
## 2      -0.3892828
## 3       0.9646469
## 4      -0.8562241

Cluster label (A, B, C & D) into data_ref

data_ref <- data.frame(cluster = (1:4), nama = c("A", "B", "C", "D"))

Input cluster into initial data

customer_clean$cluster <- customer_clustering_4$cluster

head(customer_clean)

Join data label into initial data

customer_clean %>% 
  left_join(data_ref)
## Joining, by = "cluster"

Profilling

  1. Profilling with mean cluster data
customer_clean %>% 
  group_by(cluster) %>% 
  summarise_all(mean)

Note

Cluster profiling using only the mean is difficult to interpret. It is necessary to try profiling using the median value of the cluster data.

  1. Profilling with median cluster data
customer_clean %>% 
  group_by(cluster) %>% 
  summarise_all(median)

Note

  • Cluster 1 : customer is female, ever been married, approximately 56 years old, study/have studied at university, work as skilled employee/official with income around 158.338 USD and live in mid-sized city

  • Cluster 2 : customer is female, ever been married, approximately 29 years old, study/have studied at high school, work as skilled employee/official with income around 105.759 USD and live in small city

  • Cluster 3 : customer is male, not married, approximately 35 years old, study/have studied at high school, work as skilled employee/official with income around 141.218 USD and live in big city

  • Cluster 4 : customer is male, not married, approximately 35 years old, study/have studied at high school, unemployed/unskilled with income around 97.859 USD and live in small city

5.5 Visualize Clustering

  1. K-means
fviz_cluster(object = customer_clustering_4, 
             data = customer_clean, show.clust.cent = T)

  1. Biplot PCA
customer_pca <- PCA(customer_scale,
                    graph = F)
# Please type your code down below
fviz_pca_biplot(customer_pca,
                habillage = 7)

Note

  • Arrows are close together (angle between arrows < 90), then the correlation is positive

  • Arrows are perpendicular to each other (angle between arrows = 90), so there is no correlation

6 Summary

  • Before modelling, data need to be in the same range. Data need to be scaled.

  • Optimal k-value or number of cluster is 4

  • Variable Correlation

    • Income, Occupation and Settlement Size are positively correlated

    • Sex & Marital Status are positively correlated

    • Age & Sex, Education & Settlement Size and Marital Status & Income are not correlated

  • Cluster profile creation has been carried out, the next step needs to be to determine a sales strategy that is in accordance with the customer profile/cluster