Introduction

The data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories. This data set download from UCI Machine Learning Repository.

My goal today is to use various clustering techniques to segment customers. Clustering is an unsupervised learning algorithm that tries to cluster data based on their similarity. Thus, there is no outcome to be predicted, and the algorithm just tries to find patterns in the data.

The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the K-means clustering algorithm are:

Import Library

library(dplyr)
library(magrittr)
library(tidyverse)
library(corrplot)
library(cluster) 
library(gridExtra)
library(factoextra)

Load Data

customer <- read.csv("./data/wholesale-customers.csv")
head(customer)
##   Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1       2      3 12669 9656    7561    214             2674       1338
## 2       2      3  7057 9810    9568   1762             3293       1776
## 3       2      3  6353 8808    7684   2405             3516       7844
## 4       1      3 13265 1196    4221   6404              507       1788
## 5       2      3 22615 5410    7198   3915             1777       5185
## 6       2      3  9413 8259    5126    666             1795       1451

Attribute Information:

  1. FRESH - annual spending (m.u.) on fresh products (Continuous)
  2. MILK - annual spending (m.u.) on milk products (Continuous)
  3. GROCERY - annual spending (m.u.)on grocery products (Continuous)
  4. FROZEN - annual spending (m.u.)on frozen products (Continuous)
  5. DETERGENTS_PAPER- annual spending (m.u.) on detergents and paper products (Continuous)
  6. DELICATESSEN - annual spending (m.u.)on and delicatessen products (Continuous);
  7. CHANNEL - customers Channel - Horeca (Hotel/Restaurant/Cafe) or Retail channel (Nominal)
  8. REGION - customers Region Lisnon, Oporto or Other (Nominal)
summary(customer)
##     Channel          Region          Fresh             Milk      
##  Min.   :1.000   Min.   :1.000   Min.   :     3   Min.   :   55  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:  3128   1st Qu.: 1533  
##  Median :1.000   Median :3.000   Median :  8504   Median : 3627  
##  Mean   :1.323   Mean   :2.543   Mean   : 12000   Mean   : 5796  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.: 16934   3rd Qu.: 7190  
##  Max.   :2.000   Max.   :3.000   Max.   :112151   Max.   :73498  
##     Grocery          Frozen        Detergents_Paper    Delicassen     
##  Min.   :    3   Min.   :   25.0   Min.   :    3.0   Min.   :    3.0  
##  1st Qu.: 2153   1st Qu.:  742.2   1st Qu.:  256.8   1st Qu.:  408.2  
##  Median : 4756   Median : 1526.0   Median :  816.5   Median :  965.5  
##  Mean   : 7951   Mean   : 3071.9   Mean   : 2881.5   Mean   : 1524.9  
##  3rd Qu.:10656   3rd Qu.: 3554.2   3rd Qu.: 3922.0   3rd Qu.: 1820.2  
##  Max.   :92780   Max.   :60869.0   Max.   :40827.0   Max.   :47943.0

Well, the summary shows large difference in min and max spending of customers , this gives initial hint that there are low as well as high spending clients to distributer.

Check missing data in dataset

sum(is.na(customer))
## [1] 0

Data Preprocessing

All the attributes are of same scale except channel and region. We can ignore those attributes for clustering and normalization is not required.

customer %<>%
  mutate(Channel = ifelse(Channel == 1, "Horeca", "Retail"),
         Region = case_when(Region == 1 ~ "Lisbon",
                            Region == 2 ~ "Oporto",
                            Region == 3 ~ "Others"))

head(customer)
##   Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1  Retail Others 12669 9656    7561    214             2674       1338
## 2  Retail Others  7057 9810    9568   1762             3293       1776
## 3  Retail Others  6353 8808    7684   2405             3516       7844
## 4  Horeca Others 13265 1196    4221   6404              507       1788
## 5  Retail Others 22615 5410    7198   3915             1777       5185
## 6  Retail Others  9413 8259    5126    666             1795       1451

As we don’t want the clustering algorithm to depend to an arbitrary variable unit, we start by scaling/standardizing the data using the R function scale:

customer_sc <- as_tibble(scale(customer[3:8]))

head(customer_sc)
## # A tibble: 6 x 6
##     Fresh    Milk Grocery Frozen Detergents_Paper Delicassen
##     <dbl>   <dbl>   <dbl>  <dbl>            <dbl>      <dbl>
## 1  0.0529  0.523  -0.0411 -0.589          -0.0435    -0.0663
## 2 -0.391   0.544   0.170  -0.270           0.0863     0.0890
## 3 -0.447   0.408  -0.0281 -0.137           0.133      2.24  
## 4  0.100  -0.623  -0.393   0.686          -0.498      0.0933
## 5  0.839  -0.0523 -0.0793  0.174          -0.232      1.30  
## 6 -0.205   0.334  -0.297  -0.496          -0.228     -0.0262

As of now the Channel and Region columns are excluded as they do not refer to spending information.

Clustering

KMeans

KMeans algorithm (also referred as Lloyd’s algorithm) is the most commonly used unsupervised machine learning algorithm used to partition the data into a set of k groups or clusters.

How KMeans works?

  1. Define the number of clusters (k).
  2. Initialize k centroids by randomly.
  3. Assignment Step: Assign each observation to the closest centroid (center-point) by calculting least squared euclidean distance between centroids and observations. (i.e. least squared euclidean distance between assigned center and observation should be minimum than other centers).
  4. Update Step: Calculate the new means as centroids for new clusters.
  5. Repeat both assignment and update step (i.e. steps 3 & 4) until convergence (minimum total sum of square) or maximum iteration is reached

Determining optimal number of clusters (k)

Before we do the actual clustering, we need to identity the Optimal number of clusters (k) for this data set of wholesale customers. The popular way of determining number of clusters are

  1. Elbow Method
  2. Silhouette Method
  3. Gap Static Method

Elbow and Silhouette methods are direct methods and gap statistic method is the statistics method.

In this demonstration, we are going to see how silhouette method is used.

set.seed(212)

fviz_nbclust(customer_sc , kmeans, method = "wss")

The elbow method graph do not show a sharp knee bend in this case, but we can consider k value as 5 .

Silhouette method

set.seed(212)

fviz_nbclust(customer_sc, kmeans, method = "silhouette")

Silhouette method shows that optimal number of cluster are 2.

Gap Statistic

set.seed(212)
gap_stat <- clusGap(customer_sc, FUN = kmeans, nstart = 25,
                    K.max = 10, B = 50)
## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations
fviz_gap_stat(gap_stat)

Gap Statistic method shows that optimal number of clusters are 3.

With above estimates for K , we will compute Kmeans clustering for K = 2 ,3 and 5. We can also view our results by using fviz_cluster. This provides a nice illustration of the clusters. If there are more than two dimensions (variables) fviz_cluster will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.

set.seed(212)

k2 <- kmeans(customer_sc, centers = 2, nstart = 30)
k3 <- kmeans(customer_sc, centers = 3, nstart = 30)
k5 <- kmeans(customer_sc, centers = 5, nstart = 30)

# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = customer_sc) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point",  data = customer_sc) + ggtitle("k = 3")
p3 <- fviz_cluster(k5, geom = "point",  data = customer_sc) + ggtitle("k = 4")

grid.arrange(p1, p2, p3, nrow = 2)

Above visual assessment shows 2 and 3 clusters seperate the data in distinct group, also calculated by silhoutte and gap statistic method.

Let’s have a look at cluster details in above cases.

print(k2)
## K-means clustering with 2 clusters of sizes 399, 41
## 
## Cluster means:
##         Fresh       Milk    Grocery      Frozen Detergents_Paper Delicassen
## 1 -0.00542930 -0.2122882 -0.2302493 -0.03310806       -0.2320799 -0.0826124
## 2  0.05283636  2.0659269  2.2407190  0.32219794        2.2585338  0.8039597
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 2 1 2 2 2 1 2 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1
##  [75] 1 1 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
## [149] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 2 1 2 1 1 1 1 2 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [297] 1 1 1 1 1 2 1 1 2 1 2 1 1 2 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 2 1
## [334] 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [371] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [408] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 982.9619 966.3860
##  (between_SS / total_SS =  26.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
print(k3)
## K-means clustering with 3 clusters of sizes 13, 322, 105
## 
## Cluster means:
##        Fresh       Milk    Grocery      Frozen Detergents_Paper  Delicassen
## 1  1.2628701  3.8420545  3.4733327  1.70013636        3.2968964  2.37006056
## 2  0.1121903 -0.3514030 -0.4256645  0.04403387       -0.4182375 -0.12285483
## 3 -0.5004055  0.6019528  0.8753395 -0.34553027        0.8744079  0.08331873
## 
## Clustering vector:
##   [1] 2 3 3 2 2 2 2 2 2 3 3 2 3 3 3 2 3 2 2 2 2 2 2 1 3 2 2 2 3 2 2 2 2 2 2 3 2
##  [38] 3 3 2 2 2 3 3 3 3 3 1 3 3 2 2 2 3 2 2 3 3 2 2 2 1 2 3 2 1 2 3 2 2 2 3 2 2
##  [75] 2 2 2 3 2 2 2 3 3 2 2 1 1 2 2 2 2 2 1 2 3 2 2 2 2 2 3 3 3 2 2 2 3 3 2 3 2
## [112] 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2
## [149] 2 2 2 2 2 2 2 3 3 2 3 3 3 2 2 3 2 3 3 2 2 2 3 3 2 3 2 3 2 2 2 2 2 1 3 1 2
## [186] 2 2 2 3 3 2 2 2 3 2 2 2 3 2 2 3 3 2 2 2 3 2 2 2 3 2 1 2 2 3 3 3 2 3 2 2 3
## [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 2 2 1 2 2 2 2 2 2 2
## [260] 2 2 2 2 2 3 3 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2
## [297] 2 2 3 2 2 3 3 3 3 3 3 2 2 3 2 2 3 2 2 3 2 2 2 3 2 2 2 2 2 1 2 2 2 2 2 3 2
## [334] 1 2 2 2 2 2 2 3 3 3 3 2 2 3 2 2 3 2 3 2 3 2 2 2 3 2 2 2 2 2 2 2 3 2 2 2 2
## [371] 2 2 2 2 2 2 3 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2
## [408] 3 2 2 2 2 3 2 2 2 3 3 3 2 3 2 2 2 2 2 3 2 2 2 3 2 2 2 2 2 2 3 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 693.5608 677.2089 239.5571
##  (between_SS / total_SS =  38.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

From above details we can conclude that cluster size of 3 will be suitable for us, as it separates the high variation observations in a seperate group. This cluster can include potential high spending customers .

Thus we will calculate our final analysis using 3 as optimal clusters.

set.seed(212)
final <- kmeans(customer_sc, 3, nstart = 30)
print(final)
## K-means clustering with 3 clusters of sizes 3, 393, 44
## 
## Cluster means:
##          Fresh       Milk    Grocery      Frozen Detergents_Paper  Delicassen
## 1  3.840444845  3.2957757  0.9852919  7.20489292       -0.1527927  6.79967230
## 2  0.004950908 -0.2277887 -0.2542638 -0.02703683       -0.2486071 -0.08005229
## 3 -0.306069120  1.8098555  2.2038591 -0.24975466        2.2309315  0.25139852
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 3 2 2 2 2 2 2 2 2
##  [38] 2 3 2 2 2 2 3 2 3 3 3 2 3 2 2 2 2 2 2 3 2 2 2 2 3 2 2 2 3 2 2 2 2 2 2 2 2
##  [75] 2 2 2 3 2 2 2 2 2 2 2 3 3 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2
## [149] 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 3 2 3 2 2 2 2 2 3 2 3 2 2 2 2 2 2 2 1 2 1 2
## [186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 3 2 2 2 3 2 3 2 2 2 2 3 2 2 2 2 2
## [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2
## [260] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [297] 2 2 2 2 2 3 2 2 3 2 3 2 2 3 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 1 2 2 2 2 2 3 2
## [334] 3 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 3 2 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [371] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [408] 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 214.5396 944.8291 441.0021
##  (between_SS / total_SS =  39.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
fviz_cluster(final, geom = "point",  data = customer_sc)

cluster_mean <- customer_sc %>%
  mutate(Cluster = final$cluster) %>%
  group_by(Cluster) %>%
  summarise_all("mean")

cluster_mean
## # A tibble: 3 x 7
##   Cluster    Fresh   Milk Grocery  Frozen Detergents_Paper Delicassen
##     <int>    <dbl>  <dbl>   <dbl>   <dbl>            <dbl>      <dbl>
## 1       1  3.84     3.30    0.985  7.20             -0.153     6.80  
## 2       2  0.00495 -0.228  -0.254 -0.0270           -0.249    -0.0801
## 3       3 -0.306    1.81    2.20  -0.250             2.23      0.251
# Visualize customer segments with average product spending  

cluster_mean %>% 
  gather(Product, MU, Fresh:Delicassen)%>%
    ggplot(aes(x=Product , y = MU, fill = Product)) + geom_col(width = 0.5) + 
              facet_grid(.~ Cluster)+ 
                      scale_fill_brewer(palette = "Accent")+
                          ylab("Customer Spending in Monetory Units") +
                    ggtitle("Customer Segments with Average Product Spending")+
                                    theme(axis.text.x = element_blank(),
                                          axis.ticks.x = element_blank(),
                                          axis.title.x = element_blank())

Above statistics and its visual analysis shows that customer spending habits vary in each segment. We will conclude them in our last section.

Now let’s analyse the Channel and Region distribution in each cluster.

# Channel and Region in Cluster 1/Segment 1

customer[1:2] %>%
  mutate(Cluster = final$cluster) %>%
  filter(Cluster == 1) %>% ungroup() %>% count(Channel , Region)
##   Channel Region n
## 1  Horeca Oporto 1
## 2  Horeca Others 2

Conclusion

From above analysis we can conclude below observations or customer spending habits of each segment identified in clustered data :

Segment 1: This segment best represent and contains only Retail customers who spend mainly on Groceries , Milk followed by Detergents and papers and then on Fresh products.

Segment 2: This segment contains only few Hotel/Restaurant/Cafe customers who spend heavily on Fresh , Frozen followed by Milk and Groceries. And also have highest median spending on Delicassen.These customers form a well seperated group with these spending habit.

Segment 3: This segment consist of majority Hotel/Restaurant/Cafe customers along with Retail Customers who spend decently on Fresh followed by Groceries and Milk,but spend least on Detergents_Paper and Delicassen in all groups.

This concludes our customer segmentation exercise using K-Means algorithm.