Have you ever wondered how companies can better understand their customer behavior? In this project, we will enter the world of customer segmentation using the K-means clustering method to analyze wholesale customer data. With this method, we will divide customers into similar groups based on their purchasing patterns. In doing so, we can uncover valuable insights that can help companies identify different customer segments and develop more effective marketing strategies. Get ready to gain a deep understanding of customer behavior and explore exciting new opportunities in the business world.
But before that, we are going to do some
Exploratory Data Analysis to understand more about our
Wholesale Customer dataset.
The data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories. that we’re using is from UCI Machine Learning Repository which includes 440 data, including their:
FRESH: annual spending (m.u.) on fresh
products (Continuous);MILK: annual spending (m.u.) on milk
products (Continuous);GROCERY: annual spending (m.u.)on grocery
products (Continuous);FROZEN: annual spending (m.u.)on frozen
products (Continuous)DETERGENTS_PAPER: annual spending (m.u.) on
detergents and paper products
(Continuous)DELICATESSEN: annual spending (m.u.)on and
delicatessen products (Continuous);CHANNEL: customers Channel -
Horeca (Hotel/Restaurant/Cafe) or Retail channel (Nominal)REGION: customers Region -
Lisnon, Oporto or Other (Nominal)wholesale <- read.csv("wholesale.csv", row.names = paste("Customer", 1:440))
rmarkdown::paged_table(wholesale )
glimpse(wholesale)
## Rows: 440
## Columns: 8
## $ Channel <int> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,…
## $ Region <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ Fresh <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5…
## $ Milk <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,…
## $ Grocery <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,…
## $ Frozen <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115…
## $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, …
## $ Delicassen <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2…
summary(wholesale)
## Channel Region Fresh Milk
## Min. :1.000 Min. :1.000 Min. : 3 Min. : 55
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.: 3128 1st Qu.: 1533
## Median :1.000 Median :3.000 Median : 8504 Median : 3627
## Mean :1.323 Mean :2.543 Mean : 12000 Mean : 5796
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.: 16934 3rd Qu.: 7190
## Max. :2.000 Max. :3.000 Max. :112151 Max. :73498
## Grocery Frozen Detergents_Paper Delicassen
## Min. : 3 Min. : 25.0 Min. : 3.0 Min. : 3.0
## 1st Qu.: 2153 1st Qu.: 742.2 1st Qu.: 256.8 1st Qu.: 408.2
## Median : 4756 Median : 1526.0 Median : 816.5 Median : 965.5
## Mean : 7951 Mean : 3071.9 Mean : 2881.5 Mean : 1524.9
## 3rd Qu.:10656 3rd Qu.: 3554.2 3rd Qu.: 3922.0 3rd Qu.: 1820.2
## Max. :92780 Max. :60869.0 Max. :40827.0 Max. :47943.0
Channel and Region
columns have unique values, so the data type of these two columns needs
to be changed to a factorwholesale <- wholesale %>%
mutate(Channel = as.factor(Channel),
Region = as.factor(Region),
Channel = ifelse(Channel == 1, "Horeca", "Retail"),
Region = ifelse(Region == 1, "Lisnon", ifelse(Region == 2, "Oporto", "Other")))
head(wholesale)
## Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## Customer 1 Retail Other 12669 9656 7561 214 2674 1338
## Customer 2 Retail Other 7057 9810 9568 1762 3293 1776
## Customer 3 Retail Other 6353 8808 7684 2405 3516 7844
## Customer 4 Horeca Other 13265 1196 4221 6404 507 1788
## Customer 5 Retail Other 22615 5410 7198 3915 1777 5185
## Customer 6 Retail Other 9413 8259 5126 666 1795 1451
colSums(is.na(wholesale))
## Channel Region Fresh Milk
## 0 0 0 0
## Grocery Frozen Detergents_Paper Delicassen
## 0 0 0 0
Based on the result above, there are no missing values recorded.
Exploratory Data Analysis (EDA) involves using graphics and visualizations to explore and analyze a data set. The goal is to explore, investigate, and learn the data variables within our dataset. First thing first, we will be analyzing the Pearson Correlation between all of our variables.
sale_num <- wholesale %>%
select(-c(Channel, Region))
cormat <- round(cor(sale_num),2)
get_lower_tri<-function(cormat){
cormat[upper.tri(cormat)] <- NA
return(cormat)
}
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)]<- NA
return(cormat)
}
lower_tri <- get_lower_tri(cormat)
sale_cor_melt <- reshape2::melt(lower_tri, na.rm = TRUE)
sale_cor_melt <- sale_cor_melt %>%
mutate(label = glue("{Var1} ~ {Var2}"))
plot_corr <- ggplot(sale_cor_melt,aes(Var1, Var2, text = label)) +
geom_tile(aes(fill = value)) +
geom_text(aes(label = round(value, 1)), alpha=0.5, size = 3, color = "White") +
scale_fill_gradientn(colors = c("#A8D8E0","#428793","#27636D"),
values = rescale(c(-1,0,1)),
limits = c(-1,1)) +
labs(x = NULL,
y = NULL,
fill = "Pearson Corr:") +
theme(legend.background = element_rect(fill = "#0E264E", color = "#0E264E"),
plot.background = element_rect(fill = "#0E264E", color = "#0E264E"),
panel.background = element_rect(fill = "#0E264E"),
panel.grid = element_line(colour = "#0E264E"),
panel.grid.major.x = element_line(colour = "#0E264E"),
panel.grid.minor.x = element_line(colour = "#0E264E"),
legend.title = element_text(colour = "#1B5249", face ="bold", family = "Times New Roman"),
legend.text = element_text(colour = "#1B5249", face ="bold", family = "Times New Roman"),
legend.justification = c(1, 0),
legend.position = c(0.6, 0.7),
legend.direction = "horizontal",
axis.text.x = element_text(color = "#1B5249", family = "Times New Roman",
angle = 45, vjust = 1, hjust = 1),
axis.text.y = element_blank(),
axis.ticks = element_blank()) +
guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
title.position = "top", title.hjust = 0.5))
ggplotly(plot_corr, tooltip = "text") %>%
layout(hoverlabel = list( bgcolor = "rgba(255,255,255,0.75)",
font = list(
color = "Black",
family = "Cardo",
size = 12
)))
## Warning: plotly.js does not (yet) support horizontal legend items
## You can track progress here:
## https://github.com/plotly/plotly.js/issues/53
💡 The insights we can take from the plot above are:
The Detergents_Paper variable is highly
correlated with the Grocery variable. This is
understandable because Detergents_Paper includes the daily needs of
customers.
We can reduce the dimensionality of these variables with Principal Component Analysis while retaining as much information as possible to avoid multicollinearity in our data set.
Objective: Form a product group that has the most sales in each cluster
Clustering refers to the practice of finding meaningful ways to group data (or create subgroups) within a dataset - and the resulting groups are usually called clusters. The objective is to have a number of partitions where the observations that fall into each partition are similar to others in that group, while the partitions are distinctive from one another.
numerical variablesdf_sale <- wholesale %>%
select(-c(Channel, Region))
rmarkdown::paged_table(head(df_sale))
variable’s scalessummary(df_sale)
## Fresh Milk Grocery Frozen
## Min. : 3 Min. : 55 Min. : 3 Min. : 25.0
## 1st Qu.: 3128 1st Qu.: 1533 1st Qu.: 2153 1st Qu.: 742.2
## Median : 8504 Median : 3627 Median : 4756 Median : 1526.0
## Mean : 12000 Mean : 5796 Mean : 7951 Mean : 3071.9
## 3rd Qu.: 16934 3rd Qu.: 7190 3rd Qu.:10656 3rd Qu.: 3554.2
## Max. :112151 Max. :73498 Max. :92780 Max. :60869.0
## Detergents_Paper Delicassen
## Min. : 3.0 Min. : 3.0
## 1st Qu.: 256.8 1st Qu.: 408.2
## Median : 816.5 Median : 965.5
## Mean : 2881.5 Mean : 1524.9
## 3rd Qu.: 3922.0 3rd Qu.: 1820.2
## Max. :40827.0 Max. :47943.0
We can see that all of our variables are still in similar scales and we don’t need to do some scalling.
Or we can say finding the optimal number of clusters, we seek to minimize the total within-cluster sum of squares (meaning that the distance is minimum between obseration in the same cluster). To find the optimum number of cluster we will use the Elbow Method, Silhouette Method & Gap Statistic.
To use Elbow Method is to choose the number of cluster in the area of “bend of an elbow”, what we’re looking for is a point where diminishing returns start to kick in (an elbow) and we start to lose substantial gains.
elbowplot <- fviz_nbclust(x = df_sale,
FUNcluster = kmeans,
method = "wss",
verbose = F)
# Save to our assets
ggsave("Assets/elbowplot.png", elbowplot, height=4, width=6)
# Call it again using knitr
knitr::include_graphics("Assets/elbowplot.png")
As demonstrated from our
Elbow Method, the plot recommends to use 6
clusters for our dataset. However this is still not enough since the
line itself is still vague. We will try to use another method that
called Silhouette Method.
Silhouette Method measures the silhouette coefficient, by calculating the mean intra-cluster distance and the mean nearest-cluster distance for each observations. What we’re looking for is the peak-point which is the cluster with the highest silhouette score.
silhoplot <- fviz_nbclust(x = df_sale,
FUNcluster = kmeans,
method = "silhouette",
verbose = F)
# Save to our assets
ggsave("Assets/silhoplot.png", silhoplot, height=4, width=6)
# Call it again using knitr
knitr::include_graphics("Assets/silhoplot.png")
The Silhouette
Method shows that the optimum number of cluster is 2.
Since these 2 methods shows a different number of optimum k, we are going to choose the result from our Elbow Method since every clusters below 3 are considered too few for clustering.
The Gap Statistic Method compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic.
gaplot <- fviz_nbclust(x = df_sale,
FUNcluster = kmeans,
method = "gap_stat",
verbose = F)
# Save to our assets
ggsave("Assets/gaplot.png", gaplot, height=4, width=6)
# Call it again using knitr
knitr::include_graphics("Assets/gaplot.png")
Based on the gap statistical
method, it shows that the optimum number of clusters is
7. These three methods produce different k values.
Therefore we will try to use the most k to get better results.
K-means is a centroid-based clustering algorithm that follows a procedure of classifying a given dataset into a pre-determined number of clusters, denoted as k. Here are the general process of K-means Clustering:
Random initialization: Randomly initialize our cluster centers (centroid) and the numbers of our centroid are based on how much our k are set.
Cluster assignment : Assign each observation to the nearest centroid based on its euclidean distance calculation. To do this, we have to make sure that our dataset has the same scale for its variables.
Centroid Update: Moves the centroid to the means point of each cluster.
Repeat the process until the centroids is not moving again.
Now we already have the number of our cluster centers/centroids (which are 2), we will start the process above using kmeans() function.
set.seed(100)
sale_cluster_sm <- kmeans(df_sale,centers = 2)
sale_cluster_sm$betweenss/sale_cluster_sm$totss*100
## [1] 28.15958
Based on the result above, the ratio between the BSS/TSS is 28%. The ideal scores for this particular ratio is near 100%. Means our BSS/TSS score is still less than ideal.
set.seed(100)
sale_cluster_gm <- kmeans(df_sale,centers = 7)
sale_cluster_gm$betweenss/sale_cluster_gm$totss*100
## [1] 70.99523
Based on the result above, the ratio between the BSS/TSS is 70%. The ideal scores for this particular ratio is near 100%.
set.seed(100)
sale_cluster_em <- kmeans(df_sale,centers = 6)
sale_cluster_em$betweenss/sale_cluster_em$totss*100
## [1] 69.86398
Based on the result above, the ratio between the BSS/TSS is 69%. The ideal scores for this particular ratio is near 100%.
The goodness of clustering results can be seen from 3 values:
Within Sum of Squares ($withinss): sum of the squared distances from each observation to the centroid of each cluster.
Between Sum of Squares ($betweenss): the sum of the weighted squared distances from each centroid to the global mean. Weighted based on the number of observations in the cluster.
Total Sum of Squares ($totss): sum of the squared distances from each observation to the global mean.
paste("Size", 1:2, ":", sale_cluster_sm$size)
## [1] "Size 1 : 65" "Size 2 : 375"
paste("WSS", 1:2, ":", sale_cluster_sm$withinss)
## [1] "WSS 1 : 52876126598.5847" "WSS 2 : 60341401922.3253"
paste("BSS :",sale_cluster_sm$betweenss)
## [1] "BSS : 44378328644.6991"
paste("TSS :", sale_cluster_sm$totss)
## [1] "TSS : 157595857165.609"
paste("Size", 1:7, ":", sale_cluster_gm$size)
## [1] "Size 1 : 5" "Size 2 : 83" "Size 3 : 37" "Size 4 : 29" "Size 5 : 22"
## [6] "Size 6 : 81" "Size 7 : 183"
paste("WSS", 1:7, ":", sale_cluster_gm$withinss)
## [1] "WSS 1 : 5682449097.6" "WSS 2 : 4204770608"
## [3] "WSS 3 : 3040788197.67568" "WSS 4 : 4820212591.58621"
## [5] "WSS 5 : 15753602546.8636" "WSS 6 : 5777862771.58025"
## [7] "WSS 7 : 6430632569.32239"
paste("BSS :",sale_cluster_gm$betweenss)
## [1] "BSS : 111885538782.981"
paste("TSS :", sale_cluster_gm$totss)
## [1] "TSS : 157595857165.609"
paste("Size", 1:6, ":", sale_cluster_em$size)
## [1] "Size 1 : 5" "Size 2 : 93" "Size 3 : 100" "Size 4 : 30" "Size 5 : 23"
## [6] "Size 6 : 189"
paste("WSS", 1:6, ":", sale_cluster_em$withinss)
## [1] "WSS 1 : 5682449097.6" "WSS 2 : 5506675575.26883"
## [3] "WSS 3 : 7378812588.43" "WSS 4 : 5004238144.5"
## [5] "WSS 5 : 15989905999.1304" "WSS 6 : 7931041137.60845"
paste("BSS :",sale_cluster_em$betweenss)
## [1] "BSS : 110102734623.071"
paste("TSS :", sale_cluster_em$totss)
## [1] "TSS : 157595857165.609"
Now we’ll try to visualize our clustering results for easier read
df_sale$cluster <- as.factor(sale_cluster_gm$cluster)
head(df_sale)
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen cluster
## Customer 1 12669 9656 7561 214 2674 1338 3
## Customer 2 7057 9810 9568 1762 3293 1776 2
## Customer 3 6353 8808 7684 2405 3516 7844 2
## Customer 4 13265 1196 4221 6404 507 1788 7
## Customer 5 22615 5410 7198 3915 1777 5185 6
## Customer 6 9413 8259 5126 666 1795 1451 7
as.data.frame(sale_cluster_gm$centers)
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 25603.000 43460.600 61472.200 2636.000 29974.2000 2708.8000
## 2 3390.024 8187.000 12256.024 1291.072 5336.3133 1413.4699
## 3 17967.838 7925.784 11669.216 2136.405 3447.7297 2439.0541
## 4 6375.069 17634.069 26893.448 1902.207 12037.2069 2536.4483
## 5 49899.545 6995.000 6558.773 9887.182 984.5909 4681.9545
## 6 21714.309 3119.543 3345.296 4759.519 565.1728 1419.1358
## 7 6362.885 2417.038 2989.169 2699.787 715.7760 865.1694
sale_centroid <- df_sale %>%
group_by(cluster) %>%
summarise_all(mean)
sale_centroid
## # A tibble: 7 × 7
## cluster Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 25603 43461. 61472. 2636 29974. 2709.
## 2 2 3390. 8187 12256. 1291. 5336. 1413.
## 3 3 17968. 7926. 11669. 2136. 3448. 2439.
## 4 4 6375. 17634. 26893. 1902. 12037. 2536.
## 5 5 49900. 6995 6559. 9887. 985. 4682.
## 6 6 21714. 3120. 3345. 4760. 565. 1419.
## 7 7 6363. 2417. 2989. 2700. 716. 865.
sale_centroid %>%
pivot_longer(-cluster) %>%
group_by(name) %>%
rename(Product = name) %>%
summarize(
Group_Min = which.min(value),
Group_Max = which.max(value))
## # A tibble: 6 × 3
## Product Group_Min Group_Max
## <chr> <int> <int>
## 1 Delicassen 7 5
## 2 Detergents_Paper 6 1
## 3 Fresh 2 5
## 4 Frozen 2 5
## 5 Grocery 7 1
## 6 Milk 7 1
💡Profiling of each cluster:
Cluster 1 : Has products with the lowest group, namely Delicassen, Detergents Paper, Grocery, and Milk products
Cluster 3 : Has products with the highest groups, namely for Delicassen, Fresh, and Frozen products
Cluster 4 : Has products with the highest group equaling cluster 3 with different products, namely Detergent Paper, Grocery, and Milk Products
Cluster 7 : Having products with the second lowest group after cluster 1, namely Fresh and Frozen products
sale_centroid %>%
pivot_longer(-cluster) %>%
ggplot(aes(x = value, y = tidytext::reorder_within(name, value, cluster), fill = value)) +
geom_col() +
scale_fill_gradient(low = "pink", high = "navy") +
facet_wrap(~cluster, scales = "free_y") +
theme_minimal(base_size = 8) +
labs(title = "Product per Cluster",
y = "",
x = "Skor")
💡Profiling of each cluster:
Customer clusters with the lowest purchases across all products in order, namely Cluster 7, Cluster 3, and Cluster 2
The customer cluster with the highest purchase is Cluster 1
If you look at the majority of product clusters with the most
purchases, they are Grocery, Fresh,
Milk, and Frozen
We can draw several conclusions regarding our dataset based on the clustering process:
We were able to separate our data into at least 7 clusters based on all numerical features, with more than 70% of the total sum squared coming from the observed distances between clusters.
Cluster 1 is the customer cluster with the highest purchase and Cluster 7 with the lowest purchase
It seems that the variables in the dataset are good enough to make good clustering because quite a lot of clusters can be made.
A work by Ahmad Fauzi
wrk.ahmadfauzi@gmail.com
with R Language