Wholesale company is a retail company which has sales records from various type of products. We want to do customer segmentation based on purchase characteristics so that we can do targeted product offering to the customer. In this case, we will use PCA (Principal Component Analysis) and K-means Clustering to create customer segmentation.
Load the required library.
# import library
library(dplyr)
library(FactoMineR)
library(factoextra)
library(tidyverse)
library(ggiraphExtra)
library(ggplot2)
library(reshape2)
library(cowplot)
library(GGally)
library(plotly)The dataset is about retail data that consists of 440 observations dan 8 variables.
wholesale <- read.csv("wholesale.csv")
glimpse(wholesale)#> Rows: 440
#> Columns: 8
#> $ Channel <int> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,~
#> $ Region <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,~
#> $ Fresh <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5~
#> $ Milk <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,~
#> $ Grocery <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,~
#> $ Frozen <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115~
#> $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, ~
#> $ Delicassen <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2~
Variables description:
Channel: Horeca (hotel, restaurant, cafe) / Retail (1=
Horeca, 2 = Retail)Region: Branch location of the wholesale company
(Region 1, 2 and 3)Fresh, Milk, Grocery,
Frozen, Detergents_Paper,
Delicassen: no of purchase/quantity transaction for each
productSince Channel and Region should be
categorical, we will convert the data type
wholesale <- wholesale %>%
mutate_at(vars(Channel,Region), as.factor)
glimpse(wholesale)#> Rows: 440
#> Columns: 8
#> $ Channel <fct> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,~
#> $ Region <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,~
#> $ Fresh <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5~
#> $ Milk <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,~
#> $ Grocery <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,~
#> $ Frozen <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115~
#> $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, ~
#> $ Delicassen <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2~
Check if there are missing values.
# your code here
colSums(is.na(wholesale))#> Channel Region Fresh Milk
#> 0 0 0 0
#> Grocery Frozen Detergents_Paper Delicassen
#> 0 0 0 0
There is no missing value in the data, thus we can continue with the next process.
Let’s see if we can segment the wholesale by Channel and try to look if Retail Channel has more purchase transaction in Grocery and Detergents_Paper.
# Inspecting numerical variable based on Channel
ggplot(wholesale, aes(Grocery, Detergents_Paper, color = Channel, size = Region)) +
geom_point(alpha = 0.5) + theme_minimal()Based on plot above, there is a segmentation between Retail and Horeca Channel ie. Retail Channel (Channel 2) has more purchase transaction in Grocery and Detergents_Paper than Horeca Channel (Channel 1)
Let’s also check some other numeric variables differences between Retail and HOreca Channel.
p1 <- ggplot(wholesale, aes(Channel, Milk, fill = Channel)) + geom_boxplot(show.legend = F) +
theme_minimal() + labs(title = "Milk")
p2 <- ggplot(wholesale, aes(Channel, Frozen, fill = Channel)) + geom_boxplot(show.legend = F) +
theme_minimal() + labs(title = "Frozen")
p3 <- ggplot(wholesale, aes(Channel, Fresh, fill = Channel)) + geom_boxplot(show.legend = F) +
theme_minimal() + labs(title = "Fresh")
p4 <- ggplot(wholesale, aes(Channel, Delicassen, fill = Channel)) + geom_boxplot(show.legend = F) +
theme_minimal() + labs(title = "Delicassen")
plot_grid(p1, p2, p3, p4)As we can see above that Retail Channel (Channel 2) has higher purchase transaction in Milk, but in other variables there is not so much different between Retail and Horeca Channel.
Now bases on this, we can try to do segmentation of wholesale transaction based on its purchase characteristics to find out more interesting pattern in the data.
Here we want to explore if there is correlation between numeric variables in wholesale data. Strong correlation between variables indicates that the data may have multicollinearity and thus we can remove it by reducing the dimension of data while still retaining as much information as possible. This is also helpful for further analysis with lower computation process.
ggpairs(wholesale[,c(3:8)], showStrips = F) +
theme(axis.text = element_text(colour = "black", size = 11),
strip.background = element_rect(fill = "#d63d2d"),
strip.text = element_text(colour = "white", size = 12,
face = "bold"))It can be seen that there are some variables with high correlation,
such as Grocery with Milk,
Detergents_Paper with Grocery, etc. Therefore
we can reduce the data dimension using PCA.
Since Channel and Region should be
categorical (not numeric) and in K-means clustering will only use
numeric data to calculate distance, thus we will remove them and are not
necessarily needed for analysis.
# remove categorical variables Channel and Region
wholesale <- wholesale %>%
select(-c(Channel,Region)) Let’s see if we need to do data scaling or not by exploring if the data has the same units.
# your code here
summary(wholesale)#> Fresh Milk Grocery Frozen
#> Min. : 3 Min. : 55 Min. : 3 Min. : 25.0
#> 1st Qu.: 3128 1st Qu.: 1533 1st Qu.: 2153 1st Qu.: 742.2
#> Median : 8504 Median : 3627 Median : 4756 Median : 1526.0
#> Mean : 12000 Mean : 5796 Mean : 7951 Mean : 3071.9
#> 3rd Qu.: 16934 3rd Qu.: 7190 3rd Qu.:10656 3rd Qu.: 3554.2
#> Max. :112151 Max. :73498 Max. :92780 Max. :60869.0
#> Detergents_Paper Delicassen
#> Min. : 3.0 Min. : 3.0
#> 1st Qu.: 256.8 1st Qu.: 408.2
#> Median : 816.5 Median : 965.5
#> Mean : 2881.5 Mean : 1524.9
#> 3rd Qu.: 3922.0 3rd Qu.: 1820.2
#> Max. :40827.0 Max. :47943.0
Wholesale data has various range of data, therefore we need to do scaling later on.
Before doing clustering, let’s see if there is outliers in the data. Outliers will impact on the clustering result, thus it’s better if we can remove the outliers at first. Here we will visualize the outliers with biplot from PCA.
# use PCA from FactoMineR library and do scaling at the same time
pca_ws <- PCA(X = wholesale, # clean data
scale.unit = T, # scaling
graph = F)# visualize the individual factor plot and identify 5 outliers
plot.PCA(x=pca_ws,
choix = "ind",
select = "contrib 5")As we can see that there are outliers in the data. Let’s save the index in an object so that we can remove them later on,
# save the outliers into object
outlier <- c(326,184,48,62,86)Now remove the outliers fro wholesale data then save
into a new object called ws_no_outlier to be used for
clustering
# remove the outliers and save into a new object
ws_no_outlier <- wholesale[-outlier,]# check if the outliers have been removed by comparing with previous data
dim(wholesale)#> [1] 440 6
dim(ws_no_outlier)#> [1] 435 6
Before doing clustering, let’s ensure that our data only has numeric variables that will be used for clustering
# check data summary
glimpse(ws_no_outlier)#> Rows: 435
#> Columns: 6
#> $ Fresh <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5~
#> $ Milk <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,~
#> $ Grocery <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,~
#> $ Frozen <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115~
#> $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, ~
#> $ Delicassen <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2~
Since we use data without outliers, we need to do scaling again before clustering.
# scale data wholes with no outliers
ws_no_outlier_scale <- scale(ws_no_outlier)Here we will use Elbow Method to select
k optimum value.
# use function to plot elbow method
RNGkind(sample.kind = "Rounding")
kmeansTunning <- function(data, maxK) {
withinall <- NULL
total_k <- NULL
for (i in 2:maxK) {
set.seed(567)
temp <- kmeans(data,i)$tot.withinss
withinall <- append(withinall, temp)
total_k <- append(total_k,i)
}
plot(x = total_k, y = withinall, type = "o", xlab = "Number of Cluster", ylab = "Total within")
}
kmeansTunning(ws_no_outlier_scale, maxK = 8)From the above method, the optimum value of k is
6 because after k=6 does not decrease the total within
sum of squares value significantly.
Now we do k-means clustering based on k value from elbow method and
save the result in new object ws_kmeans
RNGkind(sample.kind = "Rounding")
set.seed(333)
# k-means clustering
ws_kmeans <- kmeans(ws_no_outlier_scale, centers = 6)After clustering, we insert the cluster label into each observation in our wholesale data (before scaling, but after removing the outliers).
# create a new column segment to insert cluster label resulted from k-means clustering
ws_no_outlier$segment <- ws_kmeans$cluster
ws_no_outlierThen we create profiling for the cluster by using mean value for each product in wholesale data.
# group by segment
ws_no_outlier %>%
group_by(segment) %>%
summarise_all(mean)In order to make it easier to create profiling, let’s use radar plot to visualize the cluster.
# use ggRadar to visualize clustering
ggRadar(data=ws_no_outlier, aes(colour=segment), interactive=TRUE)Profiling:
We can also visualize the clustering using function from factoextra package.
# clustering visualization
fviz_cluster(object = ws_kmeans,
data = ws_no_outlier)Characteristics from each cluster can also be visualized using combination of clustering and PCA.
# PCA
pca_ws_new <- PCA(ws_no_outlier[,1:6], graph = F)
# analyze cummulative variance of each PC from PCA
pca_ws_new$eig#> eigenvalue percentage of variance cumulative percentage of variance
#> comp 1 2.58552805 43.092134 43.09213
#> comp 2 1.59084740 26.514123 69.60626
#> comp 3 0.71699730 11.949955 81.55621
#> comp 4 0.62647238 10.441206 91.99742
#> comp 5 0.39569059 6.594843 98.59226
#> comp 6 0.08446427 1.407738 100.00000
In this case we want to retain minimum 90% of the information of wholesale data, therefore we select PC1-PC4 out of 6 PC. Thus we are able to reduce around 33% of dimension from our original data while retaining around 91.99% information of the data.
Now we extract the values of PC1-PC4 and put into a new object. Later on we can also use the data for supervised analysis.
# making a new data frame from PCA result
wholesale_x <- data.frame(pca_ws_new$ind$coord[,1:4]) %>% bind_cols(segment = as.factor(ws_no_outlier$segment))
wholesale_xClustering and PCA can be combined to obtain better visualization for clustering and understand the pattern in the dataset. This will be done using biplot using PC1 and PC2 as the axes.
fviz_pca_ind(pca_ws_new,
geom.ind = "point", # show points only (nbut not "text")
col.ind = wholesale_x$segment, # color by groups
palette = c("#00AFBB", "#E7B800", "#FC4E07", "#ee25f5", "#60fa41", "#c2bfbe"),
addEllipses = TRUE, # Concentration ellipses
legend.title = "Segment"
)The individual observations map describes how each observation is positioned in PC1 and PC2.
💡 Insights:
Let’s try to add 1 more dimension to see if the clusters is still overlapping.
plot_ly(wholesale_x, x = ~Dim.1, y = ~Dim.2, z = ~Dim.3, color = ~segment, colors = c("black",
"red", "green", "blue", "yellow", "cyan")) %>% add_markers() %>%
layout(scene = list(xaxis = list(title = 'Dim.1'),
yaxis = list(title = 'Dim.2'),
zaxis = list(title = 'Dim.3')))Now we look at the Variable Factor Map where represents correlations between variables.
plot.PCA(x = pca_ws_new, choix = "var", title = "Variable Factor Map")💡 Insights:
Grocery, Milk, and
Detergents_PaperFrozen
and FreshFresh and Frozen have very high
positive correlationMilk, Grocery, and
Detergents _Paper also have positive correlationBased on both Individual Observation Map and Variable Factor Map, we can also see that in terms of profiling:
From the unsupervised learning for the wholesale analysis we can conclude: