Have you ever wondered how companies can better understand their customer behavior? In this project, we will enter the world of customer segmentation using the K-means clustering method to analyze wholesale customer data. With this method, we will divide customers into similar groups based on their purchasing patterns. In doing so, we can uncover valuable insights that can help companies identify different customer segments and develop more effective marketing strategies. Get ready to gain a deep understanding of customer behavior and explore exciting new opportunities in the business world.

But before that, we are going to do some Exploratory Data Analysis to understand more about our Wholesale Customer dataset.

1 Dataset

1.1 Description

The data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories. that we’re using is from UCI Machine Learning Repository which includes 440 data, including their:

  • FRESH: annual spending (m.u.) on fresh products (Continuous);
  • MILK: annual spending (m.u.) on milk products (Continuous);
  • GROCERY: annual spending (m.u.)on grocery products (Continuous);
  • FROZEN: annual spending (m.u.)on frozen products (Continuous)
  • DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)
  • DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous);
  • CHANNEL: customers Channel - Horeca (Hotel/Restaurant/Cafe) or Retail channel (Nominal)
  • REGION: customers Region - Lisnon, Oporto or Other (Nominal)

1.2 Table Preview

wholesale  <- read.csv("wholesale.csv", row.names = paste("Customer", 1:440))
rmarkdown::paged_table(wholesale )

1.3 Data Wrangling

  • Check the Data Types and Data Content:
glimpse(wholesale)
## Rows: 440
## Columns: 8
## $ Channel          <int> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,…
## $ Region           <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ Fresh            <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5…
## $ Milk             <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,…
## $ Grocery          <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,…
## $ Frozen           <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115…
## $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, …
## $ Delicassen       <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2…
summary(wholesale)
##     Channel          Region          Fresh             Milk      
##  Min.   :1.000   Min.   :1.000   Min.   :     3   Min.   :   55  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:  3128   1st Qu.: 1533  
##  Median :1.000   Median :3.000   Median :  8504   Median : 3627  
##  Mean   :1.323   Mean   :2.543   Mean   : 12000   Mean   : 5796  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.: 16934   3rd Qu.: 7190  
##  Max.   :2.000   Max.   :3.000   Max.   :112151   Max.   :73498  
##     Grocery          Frozen        Detergents_Paper    Delicassen     
##  Min.   :    3   Min.   :   25.0   Min.   :    3.0   Min.   :    3.0  
##  1st Qu.: 2153   1st Qu.:  742.2   1st Qu.:  256.8   1st Qu.:  408.2  
##  Median : 4756   Median : 1526.0   Median :  816.5   Median :  965.5  
##  Mean   : 7951   Mean   : 3071.9   Mean   : 2881.5   Mean   : 1524.9  
##  3rd Qu.:10656   3rd Qu.: 3554.2   3rd Qu.: 3922.0   3rd Qu.: 1820.2  
##  Max.   :92780   Max.   :60869.0   Max.   :40827.0   Max.   :47943.0
  • It can be seen that the Channel and Region columns have unique values, so the data type of these two columns needs to be changed to a factor
wholesale <- wholesale %>% 
  mutate(Channel = as.factor(Channel),
         Region = as.factor(Region),
         Channel = ifelse(Channel == 1, "Horeca", "Retail"),
         Region = ifelse(Region == 1, "Lisnon", ifelse(Region == 2, "Oporto", "Other")))
head(wholesale)
##            Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## Customer 1  Retail  Other 12669 9656    7561    214             2674       1338
## Customer 2  Retail  Other  7057 9810    9568   1762             3293       1776
## Customer 3  Retail  Other  6353 8808    7684   2405             3516       7844
## Customer 4  Horeca  Other 13265 1196    4221   6404              507       1788
## Customer 5  Retail  Other 22615 5410    7198   3915             1777       5185
## Customer 6  Retail  Other  9413 8259    5126    666             1795       1451
  • Check for Missing Values:
colSums(is.na(wholesale))
##          Channel           Region            Fresh             Milk 
##                0                0                0                0 
##          Grocery           Frozen Detergents_Paper       Delicassen 
##                0                0                0                0

Based on the result above, there are no missing values recorded.

2 Exploratory Data Analysis

Exploratory Data Analysis (EDA) involves using graphics and visualizations to explore and analyze a data set. The goal is to explore, investigate, and learn the data variables within our dataset. First thing first, we will be analyzing the Pearson Correlation between all of our variables.

sale_num <- wholesale %>% 
  select(-c(Channel, Region))

cormat <- round(cor(sale_num),2)

get_lower_tri<-function(cormat){
  cormat[upper.tri(cormat)] <- NA
  return(cormat)
  }

get_upper_tri <- function(cormat){
  cormat[lower.tri(cormat)]<- NA
  return(cormat)
}

lower_tri <- get_lower_tri(cormat)

sale_cor_melt <- reshape2::melt(lower_tri, na.rm = TRUE)
sale_cor_melt <- sale_cor_melt %>% 
  mutate(label = glue("{Var1} ~ {Var2}"))

plot_corr <- ggplot(sale_cor_melt,aes(Var1, Var2, text = label)) +
  geom_tile(aes(fill = value)) +
  geom_text(aes(label = round(value, 1)), alpha=0.5, size = 3, color = "White") + 
  scale_fill_gradientn(colors = c("#A8D8E0","#428793","#27636D"),
                      values = rescale(c(-1,0,1)),
                      limits = c(-1,1)) +
  labs(x = NULL,
      y = NULL,
      fill = "Pearson Corr:") +
  theme(legend.background = element_rect(fill = "#0E264E", color = "#0E264E"),
        plot.background = element_rect(fill = "#0E264E", color = "#0E264E"),
        panel.background = element_rect(fill = "#0E264E"),
        panel.grid = element_line(colour = "#0E264E"),
        panel.grid.major.x = element_line(colour = "#0E264E"),
        panel.grid.minor.x = element_line(colour = "#0E264E"),
        legend.title = element_text(colour = "#1B5249", face ="bold", family = "Times New Roman"),
        legend.text = element_text(colour = "#1B5249", face ="bold", family = "Times New Roman"),
        legend.justification = c(1, 0),
        legend.position = c(0.6, 0.7),
        legend.direction = "horizontal",
        axis.text.x = element_text(color = "#1B5249", family = "Times New Roman",
                                angle = 45, vjust = 1, hjust = 1),
        axis.text.y = element_blank(),
        axis.ticks = element_blank()) +
        guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
                  title.position = "top", title.hjust = 0.5))

ggplotly(plot_corr, tooltip = "text") %>% 
  layout(hoverlabel = list( bgcolor = "rgba(255,255,255,0.75)",
                         font = list(
                           color = "Black",
                           family = "Cardo",
                           size = 12
                         )))
## Warning: plotly.js does not (yet) support horizontal legend items 
## You can track progress here: 
## https://github.com/plotly/plotly.js/issues/53

💡 The insights we can take from the plot above are:

  • The Detergents_Paper variable is highly correlated with the Grocery variable. This is understandable because Detergents_Paper includes the daily needs of customers.

  • We can reduce the dimensionality of these variables with Principal Component Analysis while retaining as much information as possible to avoid multicollinearity in our data set.

3 Clustering

Objective: Form a product group that has the most sales in each cluster

Clustering refers to the practice of finding meaningful ways to group data (or create subgroups) within a dataset - and the resulting groups are usually called clusters. The objective is to have a number of partitions where the observations that fall into each partition are similar to others in that group, while the partitions are distinctive from one another.

3.1 Data Pre-processing

  • Select only the numerical variables
df_sale <- wholesale %>% 
  select(-c(Channel, Region))

rmarkdown::paged_table(head(df_sale))
  • Check our variable’s scales
summary(df_sale)
##      Fresh             Milk          Grocery          Frozen       
##  Min.   :     3   Min.   :   55   Min.   :    3   Min.   :   25.0  
##  1st Qu.:  3128   1st Qu.: 1533   1st Qu.: 2153   1st Qu.:  742.2  
##  Median :  8504   Median : 3627   Median : 4756   Median : 1526.0  
##  Mean   : 12000   Mean   : 5796   Mean   : 7951   Mean   : 3071.9  
##  3rd Qu.: 16934   3rd Qu.: 7190   3rd Qu.:10656   3rd Qu.: 3554.2  
##  Max.   :112151   Max.   :73498   Max.   :92780   Max.   :60869.0  
##  Detergents_Paper    Delicassen     
##  Min.   :    3.0   Min.   :    3.0  
##  1st Qu.:  256.8   1st Qu.:  408.2  
##  Median :  816.5   Median :  965.5  
##  Mean   : 2881.5   Mean   : 1524.9  
##  3rd Qu.: 3922.0   3rd Qu.: 1820.2  
##  Max.   :40827.0   Max.   :47943.0

We can see that all of our variables are still in similar scales and we don’t need to do some scalling.

3.2 Finding Optimum Number of Clusters

Or we can say finding the optimal number of clusters, we seek to minimize the total within-cluster sum of squares (meaning that the distance is minimum between obseration in the same cluster). To find the optimum number of cluster we will use the Elbow Method, Silhouette Method & Gap Statistic.

  1. Elbow Method

To use Elbow Method is to choose the number of cluster in the area of “bend of an elbow”, what we’re looking for is a point where diminishing returns start to kick in (an elbow) and we start to lose substantial gains.

elbowplot <- fviz_nbclust(x = df_sale, 
                          FUNcluster = kmeans,
                          method = "wss",
                          verbose = F)


# Save to our assets
ggsave("Assets/elbowplot.png", elbowplot, height=4, width=6)

# Call it again using knitr
knitr::include_graphics("Assets/elbowplot.png")

As demonstrated from our Elbow Method, the plot recommends to use 6 clusters for our dataset. However this is still not enough since the line itself is still vague. We will try to use another method that called Silhouette Method.

  1. Silhouette Method

Silhouette Method measures the silhouette coefficient, by calculating the mean intra-cluster distance and the mean nearest-cluster distance for each observations. What we’re looking for is the peak-point which is the cluster with the highest silhouette score.

silhoplot <- fviz_nbclust(x = df_sale, 
                          FUNcluster = kmeans,
                          method = "silhouette",
                          verbose = F) 

# Save to our assets
ggsave("Assets/silhoplot.png", silhoplot, height=4, width=6)

# Call it again using knitr
knitr::include_graphics("Assets/silhoplot.png")

The Silhouette Method shows that the optimum number of cluster is 2.

Since these 2 methods shows a different number of optimum k, we are going to choose the result from our Elbow Method since every clusters below 3 are considered too few for clustering.

  1. Gap Statistic Method

The Gap Statistic Method compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic.

gaplot <- fviz_nbclust(x = df_sale, 
                          FUNcluster = kmeans,
                          method = "gap_stat",
                          verbose = F)

# Save to our assets
ggsave("Assets/gaplot.png", gaplot, height=4, width=6)

# Call it again using knitr
knitr::include_graphics("Assets/gaplot.png")

Based on the gap statistical method, it shows that the optimum number of clusters is 7. These three methods produce different k values. Therefore we will try to use the most k to get better results.

3.3 K-Means Clustering

K-means is a centroid-based clustering algorithm that follows a procedure of classifying a given dataset into a pre-determined number of clusters, denoted as k. Here are the general process of K-means Clustering:

  • Random initialization: Randomly initialize our cluster centers (centroid) and the numbers of our centroid are based on how much our k are set.

  • Cluster assignment : Assign each observation to the nearest centroid based on its euclidean distance calculation. To do this, we have to make sure that our dataset has the same scale for its variables.

  • Centroid Update: Moves the centroid to the means point of each cluster.

  • Repeat the process until the centroids is not moving again.

  1. Optimal K value based on Silhouette Method

Now we already have the number of our cluster centers/centroids (which are 2), we will start the process above using kmeans() function.

set.seed(100)

sale_cluster_sm <- kmeans(df_sale,centers = 2)
sale_cluster_sm$betweenss/sale_cluster_sm$totss*100
## [1] 28.15958

Based on the result above, the ratio between the BSS/TSS is 28%. The ideal scores for this particular ratio is near 100%. Means our BSS/TSS score is still less than ideal.

  1. Optimal K value based on Gap Statistic Method
set.seed(100)

sale_cluster_gm <- kmeans(df_sale,centers = 7)
sale_cluster_gm$betweenss/sale_cluster_gm$totss*100
## [1] 70.99523

Based on the result above, the ratio between the BSS/TSS is 70%. The ideal scores for this particular ratio is near 100%.

  1. Optimal K value based on Elbow Method
set.seed(100)

sale_cluster_em <- kmeans(df_sale,centers = 6)
sale_cluster_em$betweenss/sale_cluster_em$totss*100
## [1] 69.86398

Based on the result above, the ratio between the BSS/TSS is 69%. The ideal scores for this particular ratio is near 100%.

3.4 Goodness of Fit

The goodness of clustering results can be seen from 3 values:

  • Within Sum of Squares ($withinss): sum of the squared distances from each observation to the centroid of each cluster.

  • Between Sum of Squares ($betweenss): the sum of the weighted squared distances from each centroid to the global mean. Weighted based on the number of observations in the cluster.

  • Total Sum of Squares ($totss): sum of the squared distances from each observation to the global mean.

paste("Size", 1:2, ":", sale_cluster_sm$size)
## [1] "Size 1 : 65"  "Size 2 : 375"
paste("WSS", 1:2, ":", sale_cluster_sm$withinss)
## [1] "WSS 1 : 52876126598.5847" "WSS 2 : 60341401922.3253"
paste("BSS :",sale_cluster_sm$betweenss)
## [1] "BSS : 44378328644.6991"
paste("TSS :", sale_cluster_sm$totss)
## [1] "TSS : 157595857165.609"
  1. Gap Statistic Method
paste("Size", 1:7, ":", sale_cluster_gm$size)
## [1] "Size 1 : 5"   "Size 2 : 83"  "Size 3 : 37"  "Size 4 : 29"  "Size 5 : 22" 
## [6] "Size 6 : 81"  "Size 7 : 183"
paste("WSS", 1:7, ":", sale_cluster_gm$withinss)
## [1] "WSS 1 : 5682449097.6"     "WSS 2 : 4204770608"      
## [3] "WSS 3 : 3040788197.67568" "WSS 4 : 4820212591.58621"
## [5] "WSS 5 : 15753602546.8636" "WSS 6 : 5777862771.58025"
## [7] "WSS 7 : 6430632569.32239"
paste("BSS :",sale_cluster_gm$betweenss)
## [1] "BSS : 111885538782.981"
paste("TSS :", sale_cluster_gm$totss)
## [1] "TSS : 157595857165.609"
  1. Elbow Method
paste("Size", 1:6, ":", sale_cluster_em$size)
## [1] "Size 1 : 5"   "Size 2 : 93"  "Size 3 : 100" "Size 4 : 30"  "Size 5 : 23" 
## [6] "Size 6 : 189"
paste("WSS", 1:6, ":", sale_cluster_em$withinss)
## [1] "WSS 1 : 5682449097.6"     "WSS 2 : 5506675575.26883"
## [3] "WSS 3 : 7378812588.43"    "WSS 4 : 5004238144.5"    
## [5] "WSS 5 : 15989905999.1304" "WSS 6 : 7931041137.60845"
paste("BSS :",sale_cluster_em$betweenss)
## [1] "BSS : 110102734623.071"
paste("TSS :", sale_cluster_em$totss)
## [1] "TSS : 157595857165.609"

3.5 Cluster Analysis & Profiling

Now we’ll try to visualize our clustering results for easier read

df_sale$cluster <- as.factor(sale_cluster_gm$cluster)
head(df_sale)
##            Fresh Milk Grocery Frozen Detergents_Paper Delicassen cluster
## Customer 1 12669 9656    7561    214             2674       1338       3
## Customer 2  7057 9810    9568   1762             3293       1776       2
## Customer 3  6353 8808    7684   2405             3516       7844       2
## Customer 4 13265 1196    4221   6404              507       1788       7
## Customer 5 22615 5410    7198   3915             1777       5185       6
## Customer 6  9413 8259    5126    666             1795       1451       7
  1. Grouping data based on cluster label
as.data.frame(sale_cluster_gm$centers)
##       Fresh      Milk   Grocery   Frozen Detergents_Paper Delicassen
## 1 25603.000 43460.600 61472.200 2636.000       29974.2000  2708.8000
## 2  3390.024  8187.000 12256.024 1291.072        5336.3133  1413.4699
## 3 17967.838  7925.784 11669.216 2136.405        3447.7297  2439.0541
## 4  6375.069 17634.069 26893.448 1902.207       12037.2069  2536.4483
## 5 49899.545  6995.000  6558.773 9887.182         984.5909  4681.9545
## 6 21714.309  3119.543  3345.296 4759.519         565.1728  1419.1358
## 7  6362.885  2417.038  2989.169 2699.787         715.7760   865.1694
sale_centroid <- df_sale %>% 
  group_by(cluster) %>% 
  summarise_all(mean)
sale_centroid
## # A tibble: 7 × 7
##   cluster  Fresh   Milk Grocery Frozen Detergents_Paper Delicassen
##   <fct>    <dbl>  <dbl>   <dbl>  <dbl>            <dbl>      <dbl>
## 1 1       25603  43461.  61472.  2636            29974.      2709.
## 2 2        3390.  8187   12256.  1291.            5336.      1413.
## 3 3       17968.  7926.  11669.  2136.            3448.      2439.
## 4 4        6375. 17634.  26893.  1902.           12037.      2536.
## 5 5       49900.  6995    6559.  9887.             985.      4682.
## 6 6       21714.  3120.   3345.  4760.             565.      1419.
## 7 7        6363.  2417.   2989.  2700.             716.       865.
  1. Simplifies Profiling: a table that displays the clusters with the lowest and highest values for each Product
sale_centroid %>% 
  pivot_longer(-cluster) %>% 
  group_by(name) %>% 
  rename(Product = name) %>% 
  summarize(
    Group_Min = which.min(value),
    Group_Max = which.max(value))
## # A tibble: 6 × 3
##   Product          Group_Min Group_Max
##   <chr>                <int>     <int>
## 1 Delicassen               7         5
## 2 Detergents_Paper         6         1
## 3 Fresh                    2         5
## 4 Frozen                   2         5
## 5 Grocery                  7         1
## 6 Milk                     7         1

💡Profiling of each cluster:

  • Cluster 1 : Has products with the lowest group, namely Delicassen, Detergents Paper, Grocery, and Milk products

  • Cluster 3 : Has products with the highest groups, namely for Delicassen, Fresh, and Frozen products

  • Cluster 4 : Has products with the highest group equaling cluster 3 with different products, namely Detergent Paper, Grocery, and Milk Products

  • Cluster 7 : Having products with the second lowest group after cluster 1, namely Fresh and Frozen products

  1. Simplifies Profiling: Graphs that rank product sales in each cluster
sale_centroid %>% 
  pivot_longer(-cluster) %>%
  ggplot(aes(x = value, y = tidytext::reorder_within(name, value, cluster), fill = value)) +
  geom_col() +
  scale_fill_gradient(low = "pink", high = "navy") +
  facet_wrap(~cluster, scales = "free_y") +
  theme_minimal(base_size = 8) +
  labs(title = "Product per Cluster",
       y = "",
       x = "Skor")

💡Profiling of each cluster:

  • Customer clusters with the lowest purchases across all products in order, namely Cluster 7, Cluster 3, and Cluster 2

  • The customer cluster with the highest purchase is Cluster 1

  • If you look at the majority of product clusters with the most purchases, they are Grocery, Fresh, Milk, and Frozen

4 Conclusions

We can draw several conclusions regarding our dataset based on the clustering process:

  • We were able to separate our data into at least 7 clusters based on all numerical features, with more than 70% of the total sum squared coming from the observed distances between clusters.

  • Cluster 1 is the customer cluster with the highest purchase and Cluster 7 with the lowest purchase

  • It seems that the variables in the dataset are good enough to make good clustering because quite a lot of clusters can be made.

 

A work by Ahmad Fauzi

wrk.ahmadfauzi@gmail.com

with R Language