The dataset used is customer segmentation data obtained from a mall. from the data writer want to know the cluster of each customer. This cluster is useful for tracking customers who are included in the priority category or customers who have less transactions.
data = read.csv("E:/Algoritma/7_lbb_unsupervised/Mall_Customers.csv")
head(data)CustomerID : ID of CustomerGender : Gender of CustomerAge : Age of CustomerAnnual.Income : Income of CustomerSpending.Score : Score assigned by the mall based on
customer behavior and spending natureWe’ll set-up caching for this notebook given how computationally expensive some of the code we will write can get.
written library is very useful for the results of the analysis
library(tidyverse)## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.1 âś” readr 2.1.4
## âś” forcats 1.0.0 âś” stringr 1.5.0
## âś” ggplot2 3.4.2 âś” tibble 3.2.1
## âś” lubridate 1.9.2 âś” tidyr 1.3.0
## âś” purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(cluster)## Warning: package 'cluster' was built under R version 4.2.3
library(ggforce)## Warning: package 'ggforce' was built under R version 4.2.3
library(scales)##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(cowplot)##
## Attaching package: 'cowplot'
##
## The following object is masked from 'package:lubridate':
##
## stamp
library(FactoMineR)## Warning: package 'FactoMineR' was built under R version 4.2.3
library(factoextra)## Warning: package 'factoextra' was built under R version 4.2.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(dplyr)
library(tidyr)
library(GGally)## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(gridExtra)##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(plotly)##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(ggplot2)
library(glue)summary(data)## CustomerID Gender Age Annual.Income
## Min. : 1.00 Length:200 Min. :18.00 Min. : 15.00
## 1st Qu.: 50.75 Class :character 1st Qu.:28.75 1st Qu.: 41.50
## Median :100.50 Mode :character Median :36.00 Median : 61.50
## Mean :100.50 Mean :38.85 Mean : 60.56
## 3rd Qu.:150.25 3rd Qu.:49.00 3rd Qu.: 78.00
## Max. :200.00 Max. :70.00 Max. :137.00
## Spending.Score
## Min. : 1.00
## 1st Qu.:34.75
## Median :50.00
## Mean :50.20
## 3rd Qu.:73.00
## Max. :99.00
CustomerID : there are 200 samples used in the
dataset.Age : the youngest is 18 years old and the oldest is 70
years old.Annual.Income : the average income of customers is
around 60.56$.Spending.Score : the highest score is at 99 and the
lowest is at 1data_gender <- data %>%
group_by(Gender) %>%
summarise(sum = n()) %>%
ungroup() %>%
mutate(label1 = glue(
"Gender : {Gender}
Count: {comma(sum)}"
))
plot1 <- ggplot(data = data_gender, aes(x = sum,
y = reorder(Gender, sum),
text = label1)) +
geom_col(aes(fill = sum)) +
scale_fill_gradient(low="pink", high="maroon") +
labs(title = "Gender",
x = "Amount Gender",
y = "Label Gender") +
theme_minimal() +
theme(legend.position = "none")
ggplotly(plot1, tooltip = "text")In the above plot it can be seen that most of the customers are female, which indicates that women prefer shopping than men. so to increase male customers, the mall management can increase promotions for men’s products, end than male customers more enthusiastic shopping at mall.
glimpse(data)## Rows: 200
## Columns: 5
## $ CustomerID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ Gender <chr> "Male", "Male", "Female", "Female", "Female", "Female",…
## $ Age <int> 19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, 35, 58, 24,…
## $ Annual.Income <int> 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, 19, 20, 20,…
## $ Spending.Score <int> 39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99, 15, 77, 13…
The initial data structure can be seen above. for the Gender variable is of character type, so it needs to be changed so that it can be processed
data$Gender = ifelse(data$Gender == 'Male', 0,1)
glimpse(data)## Rows: 200
## Columns: 5
## $ CustomerID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ Gender <dbl> 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0…
## $ Age <int> 19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, 35, 58, 24,…
## $ Annual.Income <int> 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, 19, 20, 20,…
## $ Spending.Score <int> 39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99, 15, 77, 13…
the result of changes from the Gender variable 0 = Male 1 = Female
data_clean = data %>%
select( -c(CustomerID)) %>%
mutate( Gender = as.factor(Gender))
glimpse(data_clean)## Rows: 200
## Columns: 4
## $ Gender <fct> 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0…
## $ Age <int> 19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, 35, 58, 24,…
## $ Annual.Income <int> 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, 19, 20, 20,…
## $ Spending.Score <int> 39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99, 15, 77, 13…
then change the Gender data type to categorical and remove unused variables
colSums(is.na(data_clean))## Gender Age Annual.Income Spending.Score
## 0 0 0 0
the dataset that we have does not have a missing value so that further analysis can be carried out and no missing value handling is required
data_scale = data_clean[,2:4]
data_scale = scale(data_scale)
head(data_scale)## Age Annual.Income Spending.Score
## [1,] -1.4210029 -1.734646 -0.4337131
## [2,] -1.2778288 -1.734646 1.1927111
## [3,] -1.3494159 -1.696572 -1.7116178
## [4,] -1.1346547 -1.696572 1.0378135
## [5,] -0.5619583 -1.658498 -0.3949887
## [6,] -1.2062418 -1.658498 0.9990891
because the variables used have different units, it’s necessary to scaling so that all variables are equivalent. and separating Gender at the scaling step, because the range of Gender is the same as the results of the scaling.
data_join = data.frame(data_clean$Gender, data_scale)
data_join = data_join %>%
mutate ( Gender = data_clean.Gender) %>%
select ( -c(data_clean.Gender))
head(data_join)at this step, combines the results of scaling with gender variables.
rownames(data_join) <- data$CustomerID
head(data_join)then labeling the dataset with customerID
optimization using 2 methods : 1. Elbow Method 2. Silhouette Method
fviz_nbclust(x = data_join,
FUNcluster = kmeans,
method = "wss") From the Elbow Method, the optimal cluster is 6
fviz_nbclust(data_join, kmeans, "silhouette", k.max = 20) + labs(subtitle = "Silhouette method")From the Silhouette Method, the optimal cluster is 6
set.seed(321)
data_cl <- kmeans(x=data_join, # data numerik yang ingin di clustering
centers = 4)
summary(data_cl)## Length Class Mode
## cluster 200 -none- numeric
## centers 16 -none- numeric
## totss 1 -none- numeric
## withinss 4 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 4 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
by using cluster = 4, the center value is 16 centers and 1 iteration
data_cl$centers## Age Annual.Income Spending.Score Gender
## 1 -0.42773261 0.9724070 1.2130414 0.5500000
## 2 0.03711223 0.9876366 -1.1857814 0.5000000
## 3 -0.96008279 -0.7827991 0.3910484 0.5964912
## 4 1.08344244 -0.4893373 -0.3961802 0.5692308
The following is the center value of each cluster for each variable
head(data_cl$cluster)## 1 2 3 4 5 6
## 3 3 3 3 3 3
For example, the results of clustering show that customers ID 1 to 6 are belongs to cluster 3
data_cl$tot.withinss## [1] 253.2568
data_cl$betweenss / data_cl$totss ## [1] 0.6081315
from the results above, the Within Sum of Squares value is obtained 253.257 dan the value of Between Sum of Squares / Total Sum of Squares is 0.608
set.seed(123)
data_cl1 <- kmeans(x=data_join, # data numerik yang ingin di clustering
centers = 6)
summary(data_cl1)## Length Class Mode
## cluster 200 -none- numeric
## centers 24 -none- numeric
## totss 1 -none- numeric
## withinss 6 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 6 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
by using cluster = 6, the center value is 24 centers and 1 iteration
data_cl1$centers## Age Annual.Income Spending.Score Gender
## 1 0.2105852 0.7449156 -1.3985957 0.2500000
## 2 1.1956271 -0.4598275 -0.3262196 0.5689655
## 3 -0.9719569 -1.3262173 1.1293439 0.5909091
## 4 0.0823251 1.8290695 -1.0455584 0.8000000
## 5 -0.4408110 0.9891010 1.2364001 0.5384615
## 6 -0.7797657 -0.4020602 -0.2153735 0.6595745
The following is the center value of each cluster for each variable
head(data_cl1$cluster)## 1 2 3 4 5 6
## 6 3 6 3 6 3
For example, the results of clustering show that customers ID 2,4,6 are belongs to cluster 3 end than customers ID 1,3,5 are belongs to cluster 6
data_cl1$tot.withinss## [1] 204.513
data_cl1$betweenss / data_cl1$totss ## [1] 0.6835535
from the results above, the Within Sum of Squares value is obtained 204.513 dan the value of Between Sum of Squares / Total Sum of Squares is 0.684
clustering selection there are 2 values WSS is getting lower: the observation distance in the same group is getting lower, meaning that each cluster has more similar characteristics
The BSS/TSS ratio is close to 1, because the clustering results are increasingly representative of the actual distribution of the data
so that the selection of clustering falls on the central value = 6
data$kelompok <- data_cl1$cluster
data %>%
select (-c(CustomerID,Gender)) %>%
group_by(kelompok) %>%
summarise_all(mean)data %>%
filter(kelompok == 5)The table above is the id details of loyal customers. Selection of loyal customers can be seen from a fairly high score with above average income and a productive age.
data %>%
filter(kelompok == 4)selection of customers who need special attention is to look at the smallest score with a very large income and are still productive
fviz_cluster(
object = data_cl1,
data = data %>% select(-kelompok)
)It can be seen that cluster 5 occupies the upper right position and the cluster in the lower right, for cluster 2 is in the lower left position so that special attention is not needed because in terms of income it is still below average and is still young.