Customer Segmentation

About Dataset

The dataset used is customer segmentation data obtained from a mall. from the data writer want to know the cluster of each customer. This cluster is useful for tracking customers who are included in the priority category or customers who have less transactions.

Dataset

data = read.csv("E:/Algoritma/7_lbb_unsupervised/Mall_Customers.csv")
head(data)
  • CustomerID : ID of Customer
  • Gender : Gender of Customer
  • Age : Age of Customer
  • Annual.Income : Income of Customer
  • Spending.Score : Score assigned by the mall based on customer behavior and spending nature

Libraries and Setup

We’ll set-up caching for this notebook given how computationally expensive some of the code we will write can get.

written library is very useful for the results of the analysis

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr     1.1.1     âś” readr     2.1.4
## âś” forcats   1.0.0     âś” stringr   1.5.0
## âś” ggplot2   3.4.2     âś” tibble    3.2.1
## âś” lubridate 1.9.2     âś” tidyr     1.3.0
## âś” purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(cluster)
## Warning: package 'cluster' was built under R version 4.2.3
library(ggforce)
## Warning: package 'ggforce' was built under R version 4.2.3
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
library(cowplot)
## 
## Attaching package: 'cowplot'
## 
## The following object is masked from 'package:lubridate':
## 
##     stamp
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.2.3
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.2.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(dplyr)
library(tidyr)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(ggplot2)
library(glue)

Exploratori Data

summary(data)
##    CustomerID        Gender               Age        Annual.Income   
##  Min.   :  1.00   Length:200         Min.   :18.00   Min.   : 15.00  
##  1st Qu.: 50.75   Class :character   1st Qu.:28.75   1st Qu.: 41.50  
##  Median :100.50   Mode  :character   Median :36.00   Median : 61.50  
##  Mean   :100.50                      Mean   :38.85   Mean   : 60.56  
##  3rd Qu.:150.25                      3rd Qu.:49.00   3rd Qu.: 78.00  
##  Max.   :200.00                      Max.   :70.00   Max.   :137.00  
##  Spending.Score 
##  Min.   : 1.00  
##  1st Qu.:34.75  
##  Median :50.00  
##  Mean   :50.20  
##  3rd Qu.:73.00  
##  Max.   :99.00
  • CustomerID : there are 200 samples used in the dataset.
  • Age : the youngest is 18 years old and the oldest is 70 years old.
  • Annual.Income : the average income of customers is around 60.56$.
  • Spending.Score : the highest score is at 99 and the lowest is at 1

Gender Plot

data_gender <- data %>% 
  group_by(Gender) %>% 
  summarise(sum = n()) %>% 
  ungroup() %>%
  mutate(label1 = glue(
    "Gender : {Gender}
    Count: {comma(sum)}"
  ))


plot1 <- ggplot(data = data_gender, aes(x = sum, 
                                       y = reorder(Gender, sum), 
                                       text = label1)) +
  geom_col(aes(fill = sum)) +
  scale_fill_gradient(low="pink", high="maroon") +
  labs(title = "Gender",
       x = "Amount Gender",
       y = "Label Gender") +
  theme_minimal() +
  theme(legend.position = "none") 

ggplotly(plot1, tooltip = "text")

In the above plot it can be seen that most of the customers are female, which indicates that women prefer shopping than men. so to increase male customers, the mall management can increase promotions for men’s products, end than male customers more enthusiastic shopping at mall.

Data Preprocesing

Data Wrangling

glimpse(data)
## Rows: 200
## Columns: 5
## $ CustomerID     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ Gender         <chr> "Male", "Male", "Female", "Female", "Female", "Female",…
## $ Age            <int> 19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, 35, 58, 24,…
## $ Annual.Income  <int> 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, 19, 20, 20,…
## $ Spending.Score <int> 39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99, 15, 77, 13…

The initial data structure can be seen above. for the Gender variable is of character type, so it needs to be changed so that it can be processed

data$Gender = ifelse(data$Gender == 'Male', 0,1)
glimpse(data)
## Rows: 200
## Columns: 5
## $ CustomerID     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ Gender         <dbl> 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0…
## $ Age            <int> 19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, 35, 58, 24,…
## $ Annual.Income  <int> 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, 19, 20, 20,…
## $ Spending.Score <int> 39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99, 15, 77, 13…

the result of changes from the Gender variable 0 = Male 1 = Female

data_clean = data %>%
  select( -c(CustomerID)) %>%
  mutate( Gender = as.factor(Gender))
glimpse(data_clean)
## Rows: 200
## Columns: 4
## $ Gender         <fct> 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0…
## $ Age            <int> 19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, 35, 58, 24,…
## $ Annual.Income  <int> 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, 19, 20, 20,…
## $ Spending.Score <int> 39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99, 15, 77, 13…

then change the Gender data type to categorical and remove unused variables

Missing Value

colSums(is.na(data_clean))
##         Gender            Age  Annual.Income Spending.Score 
##              0              0              0              0

the dataset that we have does not have a missing value so that further analysis can be carried out and no missing value handling is required

Scaling

data_scale = data_clean[,2:4]
data_scale = scale(data_scale)
head(data_scale)
##             Age Annual.Income Spending.Score
## [1,] -1.4210029     -1.734646     -0.4337131
## [2,] -1.2778288     -1.734646      1.1927111
## [3,] -1.3494159     -1.696572     -1.7116178
## [4,] -1.1346547     -1.696572      1.0378135
## [5,] -0.5619583     -1.658498     -0.3949887
## [6,] -1.2062418     -1.658498      0.9990891

because the variables used have different units, it’s necessary to scaling so that all variables are equivalent. and separating Gender at the scaling step, because the range of Gender is the same as the results of the scaling.

Joining Dataset

data_join = data.frame(data_clean$Gender, data_scale)
data_join = data_join %>%
  mutate ( Gender = data_clean.Gender) %>%
  select ( -c(data_clean.Gender))

head(data_join)

at this step, combines the results of scaling with gender variables.

rownames(data_join) <- data$CustomerID
head(data_join)

then labeling the dataset with customerID

Clustering

Optimization

optimization using 2 methods : 1. Elbow Method 2. Silhouette Method

Elbow Method

fviz_nbclust(x = data_join, 
             FUNcluster = kmeans, 
             method = "wss") 

From the Elbow Method, the optimal cluster is 6

Silhouette Method

fviz_nbclust(data_join, kmeans, "silhouette", k.max = 20) + labs(subtitle = "Silhouette method")

From the Silhouette Method, the optimal cluster is 6

Clustering

Cluster = 4

set.seed(321)

data_cl <- kmeans(x=data_join, # data numerik yang ingin di clustering
                    centers = 4)
summary(data_cl)
##              Length Class  Mode   
## cluster      200    -none- numeric
## centers       16    -none- numeric
## totss          1    -none- numeric
## withinss       4    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           4    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

by using cluster = 4, the center value is 16 centers and 1 iteration

data_cl$centers
##           Age Annual.Income Spending.Score    Gender
## 1 -0.42773261     0.9724070      1.2130414 0.5500000
## 2  0.03711223     0.9876366     -1.1857814 0.5000000
## 3 -0.96008279    -0.7827991      0.3910484 0.5964912
## 4  1.08344244    -0.4893373     -0.3961802 0.5692308

The following is the center value of each cluster for each variable

head(data_cl$cluster)
## 1 2 3 4 5 6 
## 3 3 3 3 3 3

For example, the results of clustering show that customers ID 1 to 6 are belongs to cluster 3

data_cl$tot.withinss
## [1] 253.2568
data_cl$betweenss / data_cl$totss 
## [1] 0.6081315

from the results above, the Within Sum of Squares value is obtained 253.257 dan the value of Between Sum of Squares / Total Sum of Squares is 0.608

Cluster = 6

set.seed(123)

data_cl1 <- kmeans(x=data_join, # data numerik yang ingin di clustering
                    centers = 6)
summary(data_cl1)
##              Length Class  Mode   
## cluster      200    -none- numeric
## centers       24    -none- numeric
## totss          1    -none- numeric
## withinss       6    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           6    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

by using cluster = 6, the center value is 24 centers and 1 iteration

data_cl1$centers
##          Age Annual.Income Spending.Score    Gender
## 1  0.2105852     0.7449156     -1.3985957 0.2500000
## 2  1.1956271    -0.4598275     -0.3262196 0.5689655
## 3 -0.9719569    -1.3262173      1.1293439 0.5909091
## 4  0.0823251     1.8290695     -1.0455584 0.8000000
## 5 -0.4408110     0.9891010      1.2364001 0.5384615
## 6 -0.7797657    -0.4020602     -0.2153735 0.6595745

The following is the center value of each cluster for each variable

head(data_cl1$cluster)
## 1 2 3 4 5 6 
## 6 3 6 3 6 3

For example, the results of clustering show that customers ID 2,4,6 are belongs to cluster 3 end than customers ID 1,3,5 are belongs to cluster 6

data_cl1$tot.withinss
## [1] 204.513
data_cl1$betweenss / data_cl1$totss 
## [1] 0.6835535

from the results above, the Within Sum of Squares value is obtained 204.513 dan the value of Between Sum of Squares / Total Sum of Squares is 0.684

Goodness of Fit

clustering selection there are 2 values WSS is getting lower: the observation distance in the same group is getting lower, meaning that each cluster has more similar characteristics

The BSS/TSS ratio is close to 1, because the clustering results are increasingly representative of the actual distribution of the data

so that the selection of clustering falls on the central value = 6

Result Of Clustering

data$kelompok <- data_cl1$cluster

data %>% 
  select (-c(CustomerID,Gender)) %>%
  group_by(kelompok) %>% 
  summarise_all(mean)
  • in cluster 1 obtained an average age of 42 with an income of 80.13 $ and a score of 14.08 points
  • in cluster 2 obtained an average age of 56 with an income of 48.84 $ and a score of 41.78 points
  • in cluster 3 obtained an average age of 25 with an income of 25.73 $ and a score of 79.36 points
  • in cluster 4 obtained an average age of 40 with an income of 108.6 $ and a score of 23.20 points
  • in cluster 5 obtained an average age of 33 with an income of 86.54 $ and a score of 82.13 points
  • in cluster 6 obtained an average age of 28 with an income of 50.00 $ and a score of 44.64 points

Checking Customer

Loyal Customer

data %>% 
  filter(kelompok == 5)

The table above is the id details of loyal customers. Selection of loyal customers can be seen from a fairly high score with above average income and a productive age.

Need Special Treatment

data %>% 
  filter(kelompok == 4)

selection of customers who need special attention is to look at the smallest score with a very large income and are still productive

Visualization Clustering

fviz_cluster(
  object = data_cl1,
  data = data %>% select(-kelompok)
)

It can be seen that cluster 5 occupies the upper right position and the cluster in the lower right, for cluster 2 is in the lower left position so that special attention is not needed because in terms of income it is still below average and is still young.

Conclusion

  • It is necessary to hold special promos for men’s products so that male customers increase
  • 6 customer clusters were obtained, each cluster having special characteristics
  • Cluster 5 is a loyal customer so it needs to given special prizes for maintain customer loyalty such as additional points to exchanged for certain prizes.
  • For cluster 4, a special approach is needed to make shopping at the mall more frequent. such as providing updated information about available promos