Business Question: Customer Clustering
Customer Segmentation is the subdivision of a market into discrete customer groups that share similar characteristics. Customer Segmentation can be a powerful means to identify unsatisfied customer needs. Using the above data companies can then outperform the competition by developing uniquely appealing products and services.
You are owning a supermarket mall and through membership cards, you have some basic data about your customers like Customer ID, age, gender, annual income and spending score. You want to understand the customers like who are the target customers so that the sense can be given to marketing team and plan the strategy accordingly.
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(tidyverse)## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v stringr 1.4.0
## v tidyr 1.1.4 v forcats 0.5.1
## v readr 2.1.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date() masks base::date()
## x dplyr::filter() masks stats::filter()
## x lubridate::intersect() masks base::intersect()
## x dplyr::lag() masks stats::lag()
## x lubridate::setdiff() masks base::setdiff()
## x lubridate::union() masks base::union()
library(GGally)## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(factoextra)## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(FactoMineR)customer <- read.csv("segmentation data.csv")
head(customer)Variable Data type Range Description
ID : numerical Integer Shows a unique identificator of a customer.
Sex : 0:male, 1:female
Marital status : 0:single, 1:non-single (divorced / separated / married / widowed)
Age : The age of the customer in years, calculated as current year minus the year of birth of the customer at the time of creation of the dataset
Education : Level of education of the customer (0:other / unknown, 1:high school, 2:university, 3:graduate school)
Income : Real Self-reported annual income in US dollars of the customer.
Occupation : Category of occupation of the customer (0:unemployed / unskilled, 1:skilled employee / official, 2:management / self-employed / highly qualified employee / officer)
Settlement size : The size of the city that the customer lives in (0:small city, 1:mid-sized city, 2:big city)
ID columns to index/rowname, only numerical value in data frame.
customer_clean <- customer %>%
column_to_rownames("ID")
head(customer_clean)Check NA Data
colSums(is.na(customer_clean))## Sex Marital.status Age Education Income
## 0 0 0 0 0
## Occupation Settlement.size
## 0 0
Note
There is no NA data in dataframe.
Check Data Range
# Please type your code down below
summary(customer_clean)## Sex Marital.status Age Education
## Min. :0.000 Min. :0.0000 Min. :18.00 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:27.00 1st Qu.:1.000
## Median :0.000 Median :0.0000 Median :33.00 Median :1.000
## Mean :0.457 Mean :0.4965 Mean :35.91 Mean :1.038
## 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:42.00 3rd Qu.:1.000
## Max. :1.000 Max. :1.0000 Max. :76.00 Max. :3.000
## Income Occupation Settlement.size
## Min. : 35832 Min. :0.0000 Min. :0.000
## 1st Qu.: 97663 1st Qu.:0.0000 1st Qu.:0.000
## Median :115549 Median :1.0000 Median :1.000
## Mean :120954 Mean :0.8105 Mean :0.739
## 3rd Qu.:138072 3rd Qu.:1.0000 3rd Qu.:1.000
## Max. :309364 Max. :2.0000 Max. :2.000
Note
Data range is not in the same range. Data need to be scaled before modelling.
customer_scale <- scale(customer_clean)
head(customer_scale)## Sex Marital.status Age Education Income
## 100000001 -0.9171695 -0.992776 2.65295099 1.60392184 0.09749923
## 100000002 1.0897659 1.006773 -1.18683527 -0.06335658 0.78245869
## 100000003 -0.9171695 -0.992776 1.11703649 -0.06335658 -0.83299391
## 100000004 -0.9171695 -0.992776 0.77572215 -0.06335658 1.32805410
## 100000005 -0.9171695 -0.992776 1.45835082 -0.06335658 0.73674749
## 100000006 -0.9171695 -0.992776 -0.07756368 -0.06335658 0.62698289
## Occupation Settlement.size
## 100000001 0.2967488 1.5519379
## 100000002 0.2967488 1.5519379
## 100000003 -1.2692080 -0.9095021
## 100000004 0.2967488 0.3212179
## 100000005 0.2967488 0.3212179
## 100000006 -1.2692080 -0.9095021
summary(customer_scale)## Sex Marital.status Age Education
## Min. :-0.9172 Min. :-0.9928 Min. :-1.5281 Min. :-1.73063
## 1st Qu.:-0.9172 1st Qu.:-0.9928 1st Qu.:-0.7602 1st Qu.:-0.06336
## Median :-0.9172 Median :-0.9928 Median :-0.2482 Median :-0.06336
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 1.0898 3rd Qu.: 1.0068 3rd Qu.: 0.5197 3rd Qu.:-0.06336
## Max. : 1.0898 Max. : 1.0068 Max. : 3.4209 Max. : 3.27120
## Income Occupation Settlement.size
## Min. :-2.2337 Min. :-1.2692 Min. :-0.9095
## 1st Qu.:-0.6112 1st Qu.:-1.2692 1st Qu.:-0.9095
## Median :-0.1419 Median : 0.2967 Median : 0.3212
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4492 3rd Qu.: 0.2967 3rd Qu.: 0.3212
## Max. : 4.9440 Max. : 1.8627 Max. : 1.5519
Clustering data into 5 cluster
customer_clustering$iter## [1] 3
# Please type your code down below
customer_clustering$size## [1] 247 605 362 336 450
Note
The number of data observations is relatively balanced
# Please type your code down below
customer_clustering$centers## Sex Marital.status Age Education Income Occupation
## 1 0.009108399 0.3348597 1.7361005 1.84692598 1.0201097 0.5503451
## 2 0.837655011 0.5705077 -0.3359471 0.13230750 -0.6211426 -0.6661205
## 3 0.396763359 0.9736312 -0.6640209 -0.04032787 0.1584052 0.4957378
## 4 -0.726032758 -0.9927760 -0.1161647 -0.79279089 -0.5931031 -0.6959559
## 5 -0.908249746 -0.9927760 0.1196402 -0.56724517 0.5905869 0.7143373
## Settlement.size
## 1 0.4806634
## 2 -0.8993308
## 3 0.6917938
## 4 -0.7849649
## 5 0.9748670
# Please type your code down below
head(customer_clustering$cluster)## 100000001 100000002 100000003 100000004 100000005 100000006
## 1 3 4 5 5 4
Good clustering
# Please type your code down below
customer_clustering$withinss## [1] 1544.5023 1753.7827 965.3264 918.9129 1291.5195
customer_clustering$betweenss## [1] 7518.956
Note
Withinss is low and betweenss is high (OK)
customer_clustering$totss## [1] 13993
customer_clustering$betweenss / customer_clustering$totss## [1] 0.537337
Note
Betweens/totss : 0.5 is good enough to represent the true distribution of the data
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
# Please type your code down below
fviz_nbclust(x = customer_scale,
FUNcluster = kmeans,
method = "wss") Note
The optimum k value is 4 because more than 4 clusters the decrease in Total WSS is not too drastic
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
customer_clustering_4 <- kmeans(x = customer_scale,
centers = 4)Input the cluster label into the initial data
customer_clustering_4$centers## Sex Marital.status Age Education Income Occupation
## 1 0.09011369 0.3909422 1.68902999 1.81946354 0.9809802 0.4991919
## 2 0.79655407 1.0011004 -0.59268205 0.05016025 -0.3987344 -0.2763247
## 3 -0.85731349 -0.6454860 -0.02337255 -0.50796416 0.5317359 0.7225792
## 4 -0.20909486 -0.9538238 -0.02825041 -0.48558943 -0.6060162 -0.7540014
## Settlement.size
## 1 0.4569247
## 2 -0.3892828
## 3 0.9646469
## 4 -0.8562241
Cluster label (A, B, C & D) into data_ref
data_ref <- data.frame(cluster = (1:4), nama = c("A", "B", "C", "D"))Input cluster into initial data
customer_clean$cluster <- customer_clustering_4$cluster
head(customer_clean)Join data label into initial data
customer_clean %>%
left_join(data_ref)## Joining, by = "cluster"
Profilling
customer_clean %>%
group_by(cluster) %>%
summarise_all(mean)Note
Cluster profiling using only the mean is difficult to interpret. It is necessary to try profiling using the median value of the cluster data.
customer_clean %>%
group_by(cluster) %>%
summarise_all(median)Note
Cluster 1 : customer is female, ever been married, approximately 56 years old, study/have studied at university, work as skilled employee/official with income around 158.338 USD and live in mid-sized city
Cluster 2 : customer is female, ever been married, approximately 29 years old, study/have studied at high school, work as skilled employee/official with income around 105.759 USD and live in small city
Cluster 3 : customer is male, not married, approximately 35 years old, study/have studied at high school, work as skilled employee/official with income around 141.218 USD and live in big city
Cluster 4 : customer is male, not married, approximately 35 years old, study/have studied at high school, unemployed/unskilled with income around 97.859 USD and live in small city
fviz_cluster(object = customer_clustering_4,
data = customer_clean, show.clust.cent = T)customer_pca <- PCA(customer_scale,
graph = F)# Please type your code down below
fviz_pca_biplot(customer_pca,
habillage = 7) Note
Arrows are close together (angle between arrows < 90), then the correlation is positive
Arrows are perpendicular to each other (angle between arrows = 90), so there is no correlation
Before modelling, data need to be in the same range. Data need to be scaled.
Optimal k-value or number of cluster is 4
Variable Correlation
Income, Occupation and Settlement Size are positively correlated
Sex & Marital Status are positively correlated
Age & Sex, Education & Settlement Size and Marital Status & Income are not correlated
Cluster profile creation has been carried out, the next step needs to be to determine a sales strategy that is in accordance with the customer profile/cluster