The data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories. This data set download from UCI Machine Learning Repository.
My goal today is to use various clustering techniques to segment customers. Clustering is an unsupervised learning algorithm that tries to cluster data based on their similarity. Thus, there is no outcome to be predicted, and the algorithm just tries to find patterns in the data.
The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the K-means clustering algorithm are:
## Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 2 3 12669 9656 7561 214 2674 1338
## 2 2 3 7057 9810 9568 1762 3293 1776
## 3 2 3 6353 8808 7684 2405 3516 7844
## 4 1 3 13265 1196 4221 6404 507 1788
## 5 2 3 22615 5410 7198 3915 1777 5185
## 6 2 3 9413 8259 5126 666 1795 1451
Attribute Information:
FRESH - annual spending (m.u.) on fresh products (Continuous)MILK - annual spending (m.u.) on milk products (Continuous)GROCERY - annual spending (m.u.)on grocery products (Continuous)FROZEN - annual spending (m.u.)on frozen products (Continuous)DETERGENTS_PAPER- annual spending (m.u.) on detergents and paper products (Continuous)DELICATESSEN - annual spending (m.u.)on and delicatessen products (Continuous);CHANNEL - customers Channel - Horeca (Hotel/Restaurant/Cafe) or Retail channel (Nominal)REGION - customers Region Lisnon, Oporto or Other (Nominal)## Channel Region Fresh Milk
## Min. :1.000 Min. :1.000 Min. : 3 Min. : 55
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.: 3128 1st Qu.: 1533
## Median :1.000 Median :3.000 Median : 8504 Median : 3627
## Mean :1.323 Mean :2.543 Mean : 12000 Mean : 5796
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.: 16934 3rd Qu.: 7190
## Max. :2.000 Max. :3.000 Max. :112151 Max. :73498
## Grocery Frozen Detergents_Paper Delicassen
## Min. : 3 Min. : 25.0 Min. : 3.0 Min. : 3.0
## 1st Qu.: 2153 1st Qu.: 742.2 1st Qu.: 256.8 1st Qu.: 408.2
## Median : 4756 Median : 1526.0 Median : 816.5 Median : 965.5
## Mean : 7951 Mean : 3071.9 Mean : 2881.5 Mean : 1524.9
## 3rd Qu.:10656 3rd Qu.: 3554.2 3rd Qu.: 3922.0 3rd Qu.: 1820.2
## Max. :92780 Max. :60869.0 Max. :40827.0 Max. :47943.0
Well, the summary shows large difference in min and max spending of customers , this gives initial hint that there are low as well as high spending clients to distributer.
Check missing data in dataset
## [1] 0
All the attributes are of same scale except channel and region. We can ignore those attributes for clustering and normalization is not required.
customer %<>%
mutate(Channel = ifelse(Channel == 1, "Horeca", "Retail"),
Region = case_when(Region == 1 ~ "Lisbon",
Region == 2 ~ "Oporto",
Region == 3 ~ "Others"))
head(customer)## Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 Retail Others 12669 9656 7561 214 2674 1338
## 2 Retail Others 7057 9810 9568 1762 3293 1776
## 3 Retail Others 6353 8808 7684 2405 3516 7844
## 4 Horeca Others 13265 1196 4221 6404 507 1788
## 5 Retail Others 22615 5410 7198 3915 1777 5185
## 6 Retail Others 9413 8259 5126 666 1795 1451
As we don’t want the clustering algorithm to depend to an arbitrary variable unit, we start by scaling/standardizing the data using the R function scale:
## # A tibble: 6 x 6
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0529 0.523 -0.0411 -0.589 -0.0435 -0.0663
## 2 -0.391 0.544 0.170 -0.270 0.0863 0.0890
## 3 -0.447 0.408 -0.0281 -0.137 0.133 2.24
## 4 0.100 -0.623 -0.393 0.686 -0.498 0.0933
## 5 0.839 -0.0523 -0.0793 0.174 -0.232 1.30
## 6 -0.205 0.334 -0.297 -0.496 -0.228 -0.0262
As of now the Channel and Region columns are excluded as they do not refer to spending information.
KMeans algorithm (also referred as Lloyd’s algorithm) is the most commonly used unsupervised machine learning algorithm used to partition the data into a set of k groups or clusters.
Before we do the actual clustering, we need to identity the Optimal number of clusters (k) for this data set of wholesale customers. The popular way of determining number of clusters are
Elbow and Silhouette methods are direct methods and gap statistic method is the statistics method.
In this demonstration, we are going to see how silhouette method is used.
The elbow method graph do not show a sharp knee bend in this case, but we can consider k value as 5 .
Silhouette method shows that optimal number of cluster are 2.
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
Gap Statistic method shows that optimal number of clusters are 3.
With above estimates for K , we will compute Kmeans clustering for K = 2 ,3 and 5. We can also view our results by using fviz_cluster. This provides a nice illustration of the clusters. If there are more than two dimensions (variables) fviz_cluster will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.
set.seed(212)
k2 <- kmeans(customer_sc, centers = 2, nstart = 30)
k3 <- kmeans(customer_sc, centers = 3, nstart = 30)
k5 <- kmeans(customer_sc, centers = 5, nstart = 30)
# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = customer_sc) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point", data = customer_sc) + ggtitle("k = 3")
p3 <- fviz_cluster(k5, geom = "point", data = customer_sc) + ggtitle("k = 4")
grid.arrange(p1, p2, p3, nrow = 2)Above visual assessment shows 2 and 3 clusters seperate the data in distinct group, also calculated by silhoutte and gap statistic method.
Let’s have a look at cluster details in above cases.
## K-means clustering with 2 clusters of sizes 399, 41
##
## Cluster means:
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 -0.00542930 -0.2122882 -0.2302493 -0.03310806 -0.2320799 -0.0826124
## 2 0.05283636 2.0659269 2.2407190 0.32219794 2.2585338 0.8039597
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 2 1 2 2 2 1 2 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1
## [75] 1 1 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
## [149] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 2 1 2 1 1 1 1 2 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [297] 1 1 1 1 1 2 1 1 2 1 2 1 1 2 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 2 1
## [334] 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [371] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [408] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
##
## Within cluster sum of squares by cluster:
## [1] 982.9619 966.3860
## (between_SS / total_SS = 26.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
## K-means clustering with 3 clusters of sizes 13, 322, 105
##
## Cluster means:
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 1.2628701 3.8420545 3.4733327 1.70013636 3.2968964 2.37006056
## 2 0.1121903 -0.3514030 -0.4256645 0.04403387 -0.4182375 -0.12285483
## 3 -0.5004055 0.6019528 0.8753395 -0.34553027 0.8744079 0.08331873
##
## Clustering vector:
## [1] 2 3 3 2 2 2 2 2 2 3 3 2 3 3 3 2 3 2 2 2 2 2 2 1 3 2 2 2 3 2 2 2 2 2 2 3 2
## [38] 3 3 2 2 2 3 3 3 3 3 1 3 3 2 2 2 3 2 2 3 3 2 2 2 1 2 3 2 1 2 3 2 2 2 3 2 2
## [75] 2 2 2 3 2 2 2 3 3 2 2 1 1 2 2 2 2 2 1 2 3 2 2 2 2 2 3 3 3 2 2 2 3 3 2 3 2
## [112] 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2
## [149] 2 2 2 2 2 2 2 3 3 2 3 3 3 2 2 3 2 3 3 2 2 2 3 3 2 3 2 3 2 2 2 2 2 1 3 1 2
## [186] 2 2 2 3 3 2 2 2 3 2 2 2 3 2 2 3 3 2 2 2 3 2 2 2 3 2 1 2 2 3 3 3 2 3 2 2 3
## [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 2 2 1 2 2 2 2 2 2 2
## [260] 2 2 2 2 2 3 3 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2
## [297] 2 2 3 2 2 3 3 3 3 3 3 2 2 3 2 2 3 2 2 3 2 2 2 3 2 2 2 2 2 1 2 2 2 2 2 3 2
## [334] 1 2 2 2 2 2 2 3 3 3 3 2 2 3 2 2 3 2 3 2 3 2 2 2 3 2 2 2 2 2 2 2 3 2 2 2 2
## [371] 2 2 2 2 2 2 3 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2
## [408] 3 2 2 2 2 3 2 2 2 3 3 3 2 3 2 2 2 2 2 3 2 2 2 3 2 2 2 2 2 2 3 2 2
##
## Within cluster sum of squares by cluster:
## [1] 693.5608 677.2089 239.5571
## (between_SS / total_SS = 38.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
From above details we can conclude that cluster size of 3 will be suitable for us, as it separates the high variation observations in a seperate group. This cluster can include potential high spending customers .
Thus we will calculate our final analysis using 3 as optimal clusters.
## K-means clustering with 3 clusters of sizes 3, 393, 44
##
## Cluster means:
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 3.840444845 3.2957757 0.9852919 7.20489292 -0.1527927 6.79967230
## 2 0.004950908 -0.2277887 -0.2542638 -0.02703683 -0.2486071 -0.08005229
## 3 -0.306069120 1.8098555 2.2038591 -0.24975466 2.2309315 0.25139852
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 3 2 2 2 2 2 2 2 2
## [38] 2 3 2 2 2 2 3 2 3 3 3 2 3 2 2 2 2 2 2 3 2 2 2 2 3 2 2 2 3 2 2 2 2 2 2 2 2
## [75] 2 2 2 3 2 2 2 2 2 2 2 3 3 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2
## [149] 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 3 2 3 2 2 2 2 2 3 2 3 2 2 2 2 2 2 2 1 2 1 2
## [186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 3 2 2 2 3 2 3 2 2 2 2 3 2 2 2 2 2
## [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2
## [260] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [297] 2 2 2 2 2 3 2 2 3 2 3 2 2 3 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 1 2 2 2 2 2 3 2
## [334] 3 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 3 2 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [371] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [408] 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2
##
## Within cluster sum of squares by cluster:
## [1] 214.5396 944.8291 441.0021
## (between_SS / total_SS = 39.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
cluster_mean <- customer_sc %>%
mutate(Cluster = final$cluster) %>%
group_by(Cluster) %>%
summarise_all("mean")
cluster_mean## # A tibble: 3 x 7
## Cluster Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 3.84 3.30 0.985 7.20 -0.153 6.80
## 2 2 0.00495 -0.228 -0.254 -0.0270 -0.249 -0.0801
## 3 3 -0.306 1.81 2.20 -0.250 2.23 0.251
# Visualize customer segments with average product spending
cluster_mean %>%
gather(Product, MU, Fresh:Delicassen)%>%
ggplot(aes(x=Product , y = MU, fill = Product)) + geom_col(width = 0.5) +
facet_grid(.~ Cluster)+
scale_fill_brewer(palette = "Accent")+
ylab("Customer Spending in Monetory Units") +
ggtitle("Customer Segments with Average Product Spending")+
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.x = element_blank())Above statistics and its visual analysis shows that customer spending habits vary in each segment. We will conclude them in our last section.
Now let’s analyse the Channel and Region distribution in each cluster.
# Channel and Region in Cluster 1/Segment 1
customer[1:2] %>%
mutate(Cluster = final$cluster) %>%
filter(Cluster == 1) %>% ungroup() %>% count(Channel , Region)## Channel Region n
## 1 Horeca Oporto 1
## 2 Horeca Others 2
From above analysis we can conclude below observations or customer spending habits of each segment identified in clustered data :
Segment 1: This segment best represent and contains only Retail customers who spend mainly on Groceries , Milk followed by Detergents and papers and then on Fresh products.
Segment 2: This segment contains only few Hotel/Restaurant/Cafe customers who spend heavily on Fresh , Frozen followed by Milk and Groceries. And also have highest median spending on Delicassen.These customers form a well seperated group with these spending habit.
Segment 3: This segment consist of majority Hotel/Restaurant/Cafe customers along with Retail Customers who spend decently on Fresh followed by Groceries and Milk,but spend least on Detergents_Paper and Delicassen in all groups.
This concludes our customer segmentation exercise using K-Means algorithm.