1 Introduction

Wholesale company is a retail company which has sales records from various type of products. We want to do customer segmentation based on purchase characteristics so that we can do targeted product offering to the customer. In this case, we will use PCA (Principal Component Analysis) and K-means Clustering to create customer segmentation.

Load the required library.

# import library
library(dplyr)
library(FactoMineR)
library(factoextra)
library(tidyverse)
library(ggiraphExtra)
library(ggplot2)
library(reshape2)
library(cowplot)
library(GGally)
library(plotly)

2 Read Data

The dataset is about retail data that consists of 440 observations dan 8 variables.

wholesale <- read.csv("wholesale.csv")
glimpse(wholesale)
#> Rows: 440
#> Columns: 8
#> $ Channel          <int> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,~
#> $ Region           <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,~
#> $ Fresh            <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5~
#> $ Milk             <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,~
#> $ Grocery          <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,~
#> $ Frozen           <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115~
#> $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, ~
#> $ Delicassen       <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2~

Variables description:

  • Channel: Horeca (hotel, restaurant, cafe) / Retail (1= Horeca, 2 = Retail)
  • Region: Branch location of the wholesale company (Region 1, 2 and 3)
  • Fresh, Milk, Grocery, Frozen, Detergents_Paper, Delicassen: no of purchase/quantity transaction for each product

3 Data Wrangling

Since Channel and Region should be categorical, we will convert the data type

wholesale <- wholesale %>% 
  mutate_at(vars(Channel,Region), as.factor)

glimpse(wholesale)
#> Rows: 440
#> Columns: 8
#> $ Channel          <fct> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,~
#> $ Region           <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,~
#> $ Fresh            <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5~
#> $ Milk             <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,~
#> $ Grocery          <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,~
#> $ Frozen           <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115~
#> $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, ~
#> $ Delicassen       <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2~

Check if there are missing values.

# your code here
colSums(is.na(wholesale))
#>          Channel           Region            Fresh             Milk 
#>                0                0                0                0 
#>          Grocery           Frozen Detergents_Paper       Delicassen 
#>                0                0                0                0

There is no missing value in the data, thus we can continue with the next process.

4 Exploratory Data Analysis

4.1 Clustering Opportunity

Let’s see if we can segment the wholesale by Channel and try to look if Retail Channel has more purchase transaction in Grocery and Detergents_Paper.

# Inspecting numerical variable based on Channel
ggplot(wholesale, aes(Grocery, Detergents_Paper, color = Channel, size = Region)) + 
    geom_point(alpha = 0.5) + theme_minimal()

Based on plot above, there is a segmentation between Retail and Horeca Channel ie. Retail Channel (Channel 2) has more purchase transaction in Grocery and Detergents_Paper than Horeca Channel (Channel 1)

Let’s also check some other numeric variables differences between Retail and HOreca Channel.

p1 <- ggplot(wholesale, aes(Channel, Milk, fill = Channel)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "Milk")

p2 <- ggplot(wholesale, aes(Channel, Frozen, fill = Channel)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "Frozen")

p3 <- ggplot(wholesale, aes(Channel, Fresh, fill = Channel)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "Fresh")

p4 <- ggplot(wholesale, aes(Channel, Delicassen, fill = Channel)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "Delicassen")

plot_grid(p1, p2, p3, p4)

As we can see above that Retail Channel (Channel 2) has higher purchase transaction in Milk, but in other variables there is not so much different between Retail and Horeca Channel.

Now bases on this, we can try to do segmentation of wholesale transaction based on its purchase characteristics to find out more interesting pattern in the data.

4.2 PCA (Principle Component Analysis) Possibility

Here we want to explore if there is correlation between numeric variables in wholesale data. Strong correlation between variables indicates that the data may have multicollinearity and thus we can remove it by reducing the dimension of data while still retaining as much information as possible. This is also helpful for further analysis with lower computation process.

ggpairs(wholesale[,c(3:8)], showStrips = F) + 
  theme(axis.text = element_text(colour = "black", size = 11),
        strip.background = element_rect(fill = "#d63d2d"),
        strip.text = element_text(colour = "white", size = 12,
                                  face = "bold"))

It can be seen that there are some variables with high correlation, such as Grocery with Milk, Detergents_Paper with Grocery, etc. Therefore we can reduce the data dimension using PCA.

5 Data Pre-Processing

Since Channel and Region should be categorical (not numeric) and in K-means clustering will only use numeric data to calculate distance, thus we will remove them and are not necessarily needed for analysis.

# remove categorical variables Channel and Region
wholesale <- wholesale %>% 
  select(-c(Channel,Region)) 

5.1 Check Scaling

Let’s see if we need to do data scaling or not by exploring if the data has the same units.

# your code here
summary(wholesale)
#>      Fresh             Milk          Grocery          Frozen       
#>  Min.   :     3   Min.   :   55   Min.   :    3   Min.   :   25.0  
#>  1st Qu.:  3128   1st Qu.: 1533   1st Qu.: 2153   1st Qu.:  742.2  
#>  Median :  8504   Median : 3627   Median : 4756   Median : 1526.0  
#>  Mean   : 12000   Mean   : 5796   Mean   : 7951   Mean   : 3071.9  
#>  3rd Qu.: 16934   3rd Qu.: 7190   3rd Qu.:10656   3rd Qu.: 3554.2  
#>  Max.   :112151   Max.   :73498   Max.   :92780   Max.   :60869.0  
#>  Detergents_Paper    Delicassen     
#>  Min.   :    3.0   Min.   :    3.0  
#>  1st Qu.:  256.8   1st Qu.:  408.2  
#>  Median :  816.5   Median :  965.5  
#>  Mean   : 2881.5   Mean   : 1524.9  
#>  3rd Qu.: 3922.0   3rd Qu.: 1820.2  
#>  Max.   :40827.0   Max.   :47943.0

Wholesale data has various range of data, therefore we need to do scaling later on.

5.2 Outlier Detection

Before doing clustering, let’s see if there is outliers in the data. Outliers will impact on the clustering result, thus it’s better if we can remove the outliers at first. Here we will visualize the outliers with biplot from PCA.

5.3 PCA

# use PCA from FactoMineR library and do scaling at the same time
pca_ws <- PCA(X = wholesale, # clean data
                scale.unit = T, # scaling
                graph = F)

5.4 Visualize

# visualize the individual factor plot and identify 5 outliers
plot.PCA(x=pca_ws,
         choix = "ind",
         select = "contrib 5")

As we can see that there are outliers in the data. Let’s save the index in an object so that we can remove them later on,

# save the outliers into object
outlier <- c(326,184,48,62,86)

Now remove the outliers fro wholesale data then save into a new object called ws_no_outlier to be used for clustering

# remove the outliers and save into a new object
ws_no_outlier <- wholesale[-outlier,]
# check if the outliers have been removed by comparing with previous data
dim(wholesale)
#> [1] 440   6
dim(ws_no_outlier)
#> [1] 435   6

6 Clustering

Before doing clustering, let’s ensure that our data only has numeric variables that will be used for clustering

# check data summary
glimpse(ws_no_outlier)
#> Rows: 435
#> Columns: 6
#> $ Fresh            <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5~
#> $ Milk             <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,~
#> $ Grocery          <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,~
#> $ Frozen           <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115~
#> $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, ~
#> $ Delicassen       <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2~

Since we use data without outliers, we need to do scaling again before clustering.

# scale data wholes with no outliers
ws_no_outlier_scale <- scale(ws_no_outlier)

6.1 Choose Optimum K

Here we will use Elbow Method to select k optimum value.

# use function to plot elbow method
RNGkind(sample.kind = "Rounding")
kmeansTunning <- function(data, maxK) {
  withinall <- NULL
  total_k <- NULL
  for (i in 2:maxK) {
    set.seed(567)
    temp <- kmeans(data,i)$tot.withinss
    withinall <- append(withinall, temp)
    total_k <- append(total_k,i)
  }
  plot(x = total_k, y = withinall, type = "o", xlab = "Number of Cluster", ylab = "Total within")
}

kmeansTunning(ws_no_outlier_scale, maxK = 8)

From the above method, the optimum value of k is 6 because after k=6 does not decrease the total within sum of squares value significantly.

6.2 K-means Clustering

Now we do k-means clustering based on k value from elbow method and save the result in new object ws_kmeans

RNGkind(sample.kind = "Rounding")
set.seed(333)

# k-means clustering
ws_kmeans <- kmeans(ws_no_outlier_scale, centers = 6)

6.3 Cluster Profiling

After clustering, we insert the cluster label into each observation in our wholesale data (before scaling, but after removing the outliers).

# create a new column segment to insert cluster label resulted from k-means clustering
ws_no_outlier$segment <- ws_kmeans$cluster
ws_no_outlier

Then we create profiling for the cluster by using mean value for each product in wholesale data.

# group by segment
ws_no_outlier %>% 
  group_by(segment) %>% 
  summarise_all(mean)

In order to make it easier to create profiling, let’s use radar plot to visualize the cluster.

# use ggRadar to visualize clustering
ggRadar(data=ws_no_outlier, aes(colour=segment), interactive=TRUE)

Profiling:

  • Cluster 1: consists of products with the lowest quantity value in Frozen
  • Cluster 2: consists of products with the highest quantity value in Fresh and Frozen
  • Cluster 3: consists of products with the highest quantity value in Milk, Grocery and Detergents_Paper
  • Cluster 4: tendency of having products with the lowest quantity value almost in every type of products
  • Cluster 5: consists of products with the highest quantity value in Fresh
  • Cluster 6: tendency of having products with the highest quantity value almost in every type of products

We can also visualize the clustering using function from factoextra package.

# clustering visualization
fviz_cluster(object = ws_kmeans, 
             data = ws_no_outlier)

7 PCA (Principal Component Analysis)

Characteristics from each cluster can also be visualized using combination of clustering and PCA.

# PCA
pca_ws_new <- PCA(ws_no_outlier[,1:6], graph = F)

# analyze cummulative variance of each PC from PCA
pca_ws_new$eig
#>        eigenvalue percentage of variance cumulative percentage of variance
#> comp 1 2.58552805              43.092134                          43.09213
#> comp 2 1.59084740              26.514123                          69.60626
#> comp 3 0.71699730              11.949955                          81.55621
#> comp 4 0.62647238              10.441206                          91.99742
#> comp 5 0.39569059               6.594843                          98.59226
#> comp 6 0.08446427               1.407738                         100.00000

In this case we want to retain minimum 90% of the information of wholesale data, therefore we select PC1-PC4 out of 6 PC. Thus we are able to reduce around 33% of dimension from our original data while retaining around 91.99% information of the data.

Now we extract the values of PC1-PC4 and put into a new object. Later on we can also use the data for supervised analysis.

# making a new data frame from PCA result
wholesale_x <- data.frame(pca_ws_new$ind$coord[,1:4]) %>% bind_cols(segment = as.factor(ws_no_outlier$segment))
wholesale_x

8 Combining Clustering and PCA

Clustering and PCA can be combined to obtain better visualization for clustering and understand the pattern in the dataset. This will be done using biplot using PC1 and PC2 as the axes.

fviz_pca_ind(pca_ws_new,
             geom.ind = "point", # show points only (nbut not "text")
             col.ind = wholesale_x$segment, # color by groups
             palette = c("#00AFBB", "#E7B800", "#FC4E07", "#ee25f5", "#60fa41", "#c2bfbe"),
             addEllipses = TRUE, # Concentration ellipses
             legend.title = "Segment"
             )

The individual observations map describes how each observation is positioned in PC1 and PC2.

💡 Insights:

  • There are some outliers in the data that can be checked with further analysis
  • Segment 6 has more varied PC values, indicated by bigger ellipse sphere and also highest PC1 scores compared to other segments
  • However the data cannot only be represented by only 2 dimensions since the cluster looks like overlapping each other.

Let’s try to add 1 more dimension to see if the clusters is still overlapping.

plot_ly(wholesale_x, x = ~Dim.1, y = ~Dim.2, z = ~Dim.3, color = ~segment, colors = c("black", 
    "red", "green", "blue", "yellow", "cyan")) %>% add_markers() %>% 
  layout(scene = list(xaxis = list(title = 'Dim.1'),
                     yaxis = list(title = 'Dim.2'),
                     zaxis = list(title = 'Dim.3')))

Now we look at the Variable Factor Map where represents correlations between variables.

plot.PCA(x = pca_ws_new, choix = "var", title = "Variable Factor Map")

💡 Insights:

  • Variables that are highly contributed to PC1 are Grocery, Milk, and Detergents_Paper
  • Variables that are highly contributed to PC2 are Frozen and Fresh
  • Variable Fresh and Frozen have very high positive correlation
  • Variable Milk, Grocery, and Detergents _Paper also have positive correlation

Based on both Individual Observation Map and Variable Factor Map, we can also see that in terms of profiling:

  • Cluster 1: some of the observations are positioned below 0 of PC2 coordinates, which means either Frozen or Fresh has lower contribution
  • Cluster 2: most of the observations are positioned in PC2 coordinates, which means Frozen and Fresh are high contribution
  • Cluster 3: most of the observations are positioned in PC1 coordinates, which means Milk, Grocery and Detergents_Paper are high contribution
  • Cluster 4: most of the observations are positioned below 0 in both PC1 and PC2 coordinates, which means lower contribution in almost all variables
  • Cluster 5: some of the observations are positioned above 0 of PC2 coordinates, which means either Frozen or Fresh has high contribution
  • Cluster 6: most of the observations are positioned varied above 0 in PC1 and PC2 coordinates, which means higher contribution in almost all variables

9 Conclusion

From the unsupervised learning for the wholesale analysis we can conclude:

  1. K-means clustering can be done using the dataset to create 6 clusters for customer segmentation, so that we can do targeted product offering to the customer.
  2. Using PCA (Principal Component Analysis) the data dimension can be reduced from 6 to 4, while still maintaining 91.99% information.
  3. The dataset obtained from PCA can be used for further supervised analysis or visualization.