Customer Segmentation

About Dataset

The dataset used is customer segmentation data obtained from a mall. from the data writer want to know the cluster of each customer. This cluster is useful for tracking customers who are included in the priority category or customers who have less transactions.

Dataset

data = read.csv("E:/Algoritma/7_lbb_unsupervised/Mall_Customers.csv")
head(data)

CustomerID : ID of Customer
Gender : Gender of Customer
Age : Age of Customer
Annual.Income : Income of Customer
Spending.Score : Score assigned by the mall based on customer behavior and spending nature

Libraries and Setup

We’ll set-up caching for this notebook given how computationally expensive some of the code we will write can get.

written library is very useful for the results of the analysis

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.2.3

## Warning: package 'tibble' was built under R version 4.2.3

## Warning: package 'dplyr' was built under R version 4.2.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(cluster)

## Warning: package 'cluster' was built under R version 4.2.3

library(ggforce)

## Warning: package 'ggforce' was built under R version 4.2.3

library(scales)

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

library(cowplot)

## 
## Attaching package: 'cowplot'
## 
## The following object is masked from 'package:lubridate':
## 
##     stamp

library(FactoMineR)

## Warning: package 'FactoMineR' was built under R version 4.2.3

library(factoextra)

## Warning: package 'factoextra' was built under R version 4.2.3

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(dplyr)
library(tidyr)
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(gridExtra)

## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

library(plotly)

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

library(ggplot2)
library(glue)

Exploratori Data

summary(data)

##    CustomerID        Gender               Age        Annual.Income   
##  Min.   :  1.00   Length:200         Min.   :18.00   Min.   : 15.00  
##  1st Qu.: 50.75   Class :character   1st Qu.:28.75   1st Qu.: 41.50  
##  Median :100.50   Mode  :character   Median :36.00   Median : 61.50  
##  Mean   :100.50                      Mean   :38.85   Mean   : 60.56  
##  3rd Qu.:150.25                      3rd Qu.:49.00   3rd Qu.: 78.00  
##  Max.   :200.00                      Max.   :70.00   Max.   :137.00  
##  Spending.Score 
##  Min.   : 1.00  
##  1st Qu.:34.75  
##  Median :50.00  
##  Mean   :50.20  
##  3rd Qu.:73.00  
##  Max.   :99.00

CustomerID : there are 200 samples used in the dataset.
Age : the youngest is 18 years old and the oldest is 70 years old.
Annual.Income : the average income of customers is around 60.56$.
Spending.Score : the highest score is at 99 and the lowest is at 1

Gender Plot

data_gender <- data %>% 
  group_by(Gender) %>% 
  summarise(sum = n()) %>% 
  ungroup() %>%
  mutate(label1 = glue(
    "Gender : {Gender}
    Count: {comma(sum)}"
  ))


plot1 <- ggplot(data = data_gender, aes(x = sum, 
                                       y = reorder(Gender, sum), 
                                       text = label1)) +
  geom_col(aes(fill = sum)) +
  scale_fill_gradient(low="pink", high="maroon") +
  labs(title = "Gender",
       x = "Amount Gender",
       y = "Label Gender") +
  theme_minimal() +
  theme(legend.position = "none") 

ggplotly(plot1, tooltip = "text")

In the above plot it can be seen that most of the customers are female, which indicates that women prefer shopping than men. so to increase male customers, the mall management can increase promotions for men’s products, end than male customers more enthusiastic shopping at mall.

Data Preprocesing

Data Wrangling

glimpse(data)

## Rows: 200
## Columns: 5
## $ CustomerID     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ Gender         <chr> "Male", "Male", "Female", "Female", "Female", "Female",…
## $ Age            <int> 19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, 35, 58, 24,…
## $ Annual.Income  <int> 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, 19, 20, 20,…
## $ Spending.Score <int> 39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99, 15, 77, 13…

The initial data structure can be seen above. for the Gender variable is of character type, so it needs to be changed so that it can be processed

data$Gender = ifelse(data$Gender == 'Male', 0,1)
glimpse(data)

## Rows: 200
## Columns: 5
## $ CustomerID     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ Gender         <dbl> 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0…
## $ Age            <int> 19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, 35, 58, 24,…
## $ Annual.Income  <int> 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, 19, 20, 20,…
## $ Spending.Score <int> 39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99, 15, 77, 13…

the result of changes from the Gender variable 0 = Male 1 = Female

data_clean = data %>%
  select( -c(CustomerID)) %>%
  mutate( Gender = as.factor(Gender))
glimpse(data_clean)

## Rows: 200
## Columns: 4
## $ Gender         <fct> 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0…
## $ Age            <int> 19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, 35, 58, 24,…
## $ Annual.Income  <int> 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, 19, 20, 20,…
## $ Spending.Score <int> 39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99, 15, 77, 13…

then change the Gender data type to categorical and remove unused variables

Missing Value

colSums(is.na(data_clean))

##         Gender            Age  Annual.Income Spending.Score 
##              0              0              0              0

the dataset that we have does not have a missing value so that further analysis can be carried out and no missing value handling is required

Scaling

data_scale = data_clean[,2:4]
data_scale = scale(data_scale)
head(data_scale)

##             Age Annual.Income Spending.Score
## [1,] -1.4210029     -1.734646     -0.4337131
## [2,] -1.2778288     -1.734646      1.1927111
## [3,] -1.3494159     -1.696572     -1.7116178
## [4,] -1.1346547     -1.696572      1.0378135
## [5,] -0.5619583     -1.658498     -0.3949887
## [6,] -1.2062418     -1.658498      0.9990891

because the variables used have different units, it’s necessary to scaling so that all variables are equivalent. and separating Gender at the scaling step, because the range of Gender is the same as the results of the scaling.

Joining Dataset

data_join = data.frame(data_clean$Gender, data_scale)
data_join = data_join %>%
  mutate ( Gender = data_clean.Gender) %>%
  select ( -c(data_clean.Gender))

head(data_join)

at this step, combines the results of scaling with gender variables.

rownames(data_join) <- data$CustomerID
head(data_join)

then labeling the dataset with customerID

Clustering

Optimization

optimization using 2 methods : 1. Elbow Method 2. Silhouette Method

Elbow Method

fviz_nbclust(x = data_join, 
             FUNcluster = kmeans, 
             method = "wss")

From the Elbow Method, the optimal cluster is 6

Silhouette Method

fviz_nbclust(data_join, kmeans, "silhouette", k.max = 20) + labs(subtitle = "Silhouette method")

From the Silhouette Method, the optimal cluster is 6

Clustering

Cluster = 4

set.seed(321)

data_cl <- kmeans(x=data_join, # data numerik yang ingin di clustering
                    centers = 4)
summary(data_cl)

##              Length Class  Mode   
## cluster      200    -none- numeric
## centers       16    -none- numeric
## totss          1    -none- numeric
## withinss       4    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           4    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

by using cluster = 4, the center value is 16 centers and 1 iteration

data_cl$centers

##           Age Annual.Income Spending.Score    Gender
## 1 -0.42773261     0.9724070      1.2130414 0.5500000
## 2  0.03711223     0.9876366     -1.1857814 0.5000000
## 3 -0.96008279    -0.7827991      0.3910484 0.5964912
## 4  1.08344244    -0.4893373     -0.3961802 0.5692308

The following is the center value of each cluster for each variable

head(data_cl$cluster)

## 1 2 3 4 5 6 
## 3 3 3 3 3 3

For example, the results of clustering show that customers ID 1 to 6 are belongs to cluster 3

data_cl$tot.withinss

## [1] 253.2568

data_cl$betweenss / data_cl$totss

## [1] 0.6081315

from the results above, the Within Sum of Squares value is obtained 253.257 dan the value of Between Sum of Squares / Total Sum of Squares is 0.608

Cluster = 6

set.seed(123)

data_cl1 <- kmeans(x=data_join, # data numerik yang ingin di clustering
                    centers = 6)
summary(data_cl1)

##              Length Class  Mode   
## cluster      200    -none- numeric
## centers       24    -none- numeric
## totss          1    -none- numeric
## withinss       6    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           6    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

by using cluster = 6, the center value is 24 centers and 1 iteration

data_cl1$centers

##          Age Annual.Income Spending.Score    Gender
## 1  0.2105852     0.7449156     -1.3985957 0.2500000
## 2  1.1956271    -0.4598275     -0.3262196 0.5689655
## 3 -0.9719569    -1.3262173      1.1293439 0.5909091
## 4  0.0823251     1.8290695     -1.0455584 0.8000000
## 5 -0.4408110     0.9891010      1.2364001 0.5384615
## 6 -0.7797657    -0.4020602     -0.2153735 0.6595745

The following is the center value of each cluster for each variable

head(data_cl1$cluster)

## 1 2 3 4 5 6 
## 6 3 6 3 6 3

For example, the results of clustering show that customers ID 2,4,6 are belongs to cluster 3 end than customers ID 1,3,5 are belongs to cluster 6

data_cl1$tot.withinss

## [1] 204.513

data_cl1$betweenss / data_cl1$totss

## [1] 0.6835535

from the results above, the Within Sum of Squares value is obtained 204.513 dan the value of Between Sum of Squares / Total Sum of Squares is 0.684

Goodness of Fit

clustering selection there are 2 values WSS is getting lower: the observation distance in the same group is getting lower, meaning that each cluster has more similar characteristics

The BSS/TSS ratio is close to 1, because the clustering results are increasingly representative of the actual distribution of the data

so that the selection of clustering falls on the central value = 6

Result Of Clustering

data$kelompok <- data_cl1$cluster

data %>% 
  select (-c(CustomerID,Gender)) %>%
  group_by(kelompok) %>% 
  summarise_all(mean)

in cluster 1 obtained an average age of 42 with an income of 80.13 $ and a score of 14.08 points
in cluster 2 obtained an average age of 56 with an income of 48.84 $ and a score of 41.78 points
in cluster 3 obtained an average age of 25 with an income of 25.73 $ and a score of 79.36 points
in cluster 4 obtained an average age of 40 with an income of 108.6 $ and a score of 23.20 points
in cluster 5 obtained an average age of 33 with an income of 86.54 $ and a score of 82.13 points
in cluster 6 obtained an average age of 28 with an income of 50.00 $ and a score of 44.64 points

Checking Customer

Loyal Customer

data %>% 
  filter(kelompok == 5)

The table above is the id details of loyal customers. Selection of loyal customers can be seen from a fairly high score with above average income and a productive age.

Need Special Treatment

data %>% 
  filter(kelompok == 4)

selection of customers who need special attention is to look at the smallest score with a very large income and are still productive

Visualization Clustering

fviz_cluster(
  object = data_cl1,
  data = data %>% select(-kelompok)
)

It can be seen that cluster 5 occupies the upper right position and the cluster in the lower right, for cluster 2 is in the lower left position so that special attention is not needed because in terms of income it is still below average and is still young.

Conclusion

It is necessary to hold special promos for men’s products so that male customers increase
6 customer clusters were obtained, each cluster having special characteristics
Cluster 5 is a loyal customer so it needs to given special prizes for maintain customer loyalty such as additional points to exchanged for certain prizes.
For cluster 4, a special approach is needed to make shopping at the mall more frequent. such as providing updated information about available promos

Customer Segmentation

Firmanur

2023-04-27

Customer Segmentation

About Dataset

Dataset

Libraries and Setup

Exploratori Data

Gender Plot

Data Preprocesing

Data Wrangling

Missing Value

Scaling

Joining Dataset

Clustering

Optimization

Elbow Method

Silhouette Method

Clustering

Cluster = 4

Cluster = 6

Goodness of Fit

Result Of Clustering

Checking Customer

Loyal Customer

Need Special Treatment

Visualization Clustering

Conclusion