Introduction

Customers segmentation is a prerequisite for formulating marketing strategies. This document demonstrates exploratory data analysis using hierarchical clustering and its associated visualization techniques on wholesale customers data set, with highlights on the effects of feature normalization, feature discretization, and selection of dissimilarity measures on clustering outcomes.

Wholesale Customers Data Set

The data set was published by Margarida G. M. S. Cardoso in 2014. It contains 440 observations and 8 variables: Channel (Horeca (Hotel/Restaurant/Café) or Retail), Region (Lisbon, Oporto or Other), and 6 Product Categories - Fresh, Milk, Grocery, Frozen, Detergents_Paper, Delicassen (with values of annual spending).

## 'data.frame':    440 obs. of  8 variables:
##  $ Channel         : int  2 2 2 1 2 2 2 2 1 2 ...
##  $ Region          : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Fresh           : int  12669 7057 6353 13265 22615 9413 12126 7579 5963 6006 ...
##  $ Milk            : int  9656 9810 8808 1196 5410 8259 3199 4956 3648 11093 ...
##  $ Grocery         : int  7561 9568 7684 4221 7198 5126 6975 9426 6192 18881 ...
##  $ Frozen          : int  214 1762 2405 6404 3915 666 480 1669 425 1159 ...
##  $ Detergents_Paper: int  2674 3293 3516 507 1777 1795 3140 3321 1716 7425 ...
##  $ Delicassen      : int  1338 1776 7844 1788 5185 1451 545 2566 750 2098 ...
##     Channel          Region          Fresh             Milk      
##  Min.   :1.000   Min.   :1.000   Min.   :     3   Min.   :   55  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:  3128   1st Qu.: 1533  
##  Median :1.000   Median :3.000   Median :  8504   Median : 3627  
##  Mean   :1.323   Mean   :2.543   Mean   : 12000   Mean   : 5796  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.: 16934   3rd Qu.: 7190  
##  Max.   :2.000   Max.   :3.000   Max.   :112151   Max.   :73498  
##     Grocery          Frozen        Detergents_Paper    Delicassen     
##  Min.   :    3   Min.   :   25.0   Min.   :    3.0   Min.   :    3.0  
##  1st Qu.: 2153   1st Qu.:  742.2   1st Qu.:  256.8   1st Qu.:  408.2  
##  Median : 4756   Median : 1526.0   Median :  816.5   Median :  965.5  
##  Mean   : 7951   Mean   : 3071.9   Mean   : 2881.5   Mean   : 1524.9  
##  3rd Qu.:10656   3rd Qu.: 3554.2   3rd Qu.: 3922.0   3rd Qu.: 1820.2  
##  Max.   :92780   Max.   :60869.0   Max.   :40827.0   Max.   :47943.0

The spending for Fresh products are far greater than those for other products, indicating that Fresh product will dominate dissimilarity measures.

par(mfrow=c(2,3))
for (i in c(3:8))
    hist(RawData[,c(i)], breaks = 200, main = colnames(RawData)[i], xlab="Annual Spending (m.u.)", ylab = "No. of Customers")

The spending distributions are positively skewed in all 6 product categories, suggesting uevenly sized clusters.

Feature normalization and discretization

If the concern is total spending across all product categories, then the raw data can be used for clustering. On the other hand, if product categories should be weighted equally, then we need to normalize the features. Here, we normalize the features to mean = 0 and SD = 1.

NormalizedData <- cbind(RawData[,c(1,2)], scale(RawData[,c(3:8)]))
summary(NormalizedData)
##     Channel          Region          Fresh              Milk        
##  Min.   :1.000   Min.   :1.000   Min.   :-0.9486   Min.   :-0.7779  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:-0.7015   1st Qu.:-0.5776  
##  Median :1.000   Median :3.000   Median :-0.2764   Median :-0.2939  
##  Mean   :1.323   Mean   :2.543   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.: 0.3901   3rd Qu.: 0.1889  
##  Max.   :2.000   Max.   :3.000   Max.   : 7.9187   Max.   : 9.1732  
##     Grocery            Frozen         Detergents_Paper    Delicassen     
##  Min.   :-0.8364   Min.   :-0.62763   Min.   :-0.6037   Min.   :-0.5396  
##  1st Qu.:-0.6101   1st Qu.:-0.47988   1st Qu.:-0.5505   1st Qu.:-0.3960  
##  Median :-0.3363   Median :-0.31844   Median :-0.4331   Median :-0.1984  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2846   3rd Qu.: 0.09935   3rd Qu.: 0.2182   3rd Qu.: 0.1047  
##  Max.   : 8.9264   Max.   :11.90545   Max.   : 7.9586   Max.   :16.4597
par(mfrow=c(2,3))
for (i in c(3:8))
    hist(NormalizedData[,c(i)], breaks = 200, main = colnames(RawData)[i], xlab="Normalized Annual Spending (m.u.)", ylab = "No. of Customers")

A rule of thumb in marketing is 20/80, meaning top 20% of the most profitable customers contribute to 80% of the profits. Therefore, it is meaningful to discretize features into quintiles.

RawData <- within(RawData, FreshQuintile <- as.integer(cut(Fresh, quantile(Fresh, probs=0:5/5), include.lowest=TRUE)))
RawData <- within(RawData, MilkQuintile <- as.integer(cut(Milk, quantile(Milk, probs=0:5/5), include.lowest=TRUE)))
RawData <- within(RawData, GroceryQuintile <- as.integer(cut(Grocery, quantile(Grocery, probs=0:5/5), include.lowest=TRUE)))
RawData <- within(RawData, FrozenQuintile <- as.integer(cut(Frozen, quantile(Frozen, probs=0:5/5), include.lowest=TRUE)))
RawData <- within(RawData, Detergents_PaperQuintile <- as.integer(cut(Detergents_Paper, quantile(Detergents_Paper, probs=0:5/5), include.lowest=TRUE)))
RawData <- within(RawData, DelicassenQuintile <- as.integer(cut(Delicassen, quantile(Delicassen, probs=0:5/5), include.lowest=TRUE)))
par(mfrow=c(2,3))
for (i in c(9:14))
    hist(RawData[,c(i)], breaks = 20, main = colnames(RawData)[i], xlab="Discretized Annual Spending Rank", ylab = "No. of Customers")

Hierarchical Clustering on Different Feature Representations

Now, we will compare dendrograms of the hierarchical clustering from the three feature representations, using “complete” dissimilarity measure method.

require(fastcluster)
require(graphics)
library(dendextend)
library(gplots)
d_raw <- dist(RawData[,c(1:8)])
d_nom <- dist(NormalizedData)
d_dis <- dist(RawData[,c(1,2,9:14)])
hc_raw <- hclust(d_raw, method = "complete")
hc_nom <- hclust(d_nom, method = "complete")
hc_dis_complete <- hclust(d_dis, method = "complete")
par(mfrow=c(1,3))
plot(hc_raw)
plot(hc_nom)
plot(hc_dis_complete)

The results show that there are several 1-member clusters in both the raw data set and the normalized data set; the main branches are also very unbalanced in both. By contrast, the dendrogram from discretized data set is more balanced; if we make a cut at height = 8.5, we will get 3 roughly equally sized clusters.

Hierarchical Clustering Using Different Dissimilarity Measures

There are multiple ways to compare distance of two clusters. Here, we will compare “single”, “complete”, “average”, and “centroid” methods on the discretized data set. The “single” method is a nearest-neighbor technique and tends to produce clusters with large diameters and violates “compactness” property. The “complete” method is a furthest-neighbor technique and tends to produce compact clusters with small diameters but violates “closeness” property. The “average” method is a compromise between “single” and “complete”, but its results may vary by the numerical scale on which the observation dissimilarities are measured. The “centroid” method is similar to K-Mean clustering where centers of clusters are computed first then distance are determined between centroids. By optimizing cluster centers, instead of cluster borders, the “centroid” method often leads to incorrectly cut borders in between of clusters.

hc_dis_single <- hclust(d_dis, method = "single")
hc_dis_average <- hclust(d_dis, method = "average")
hc_dis_centroid <- hclust(d_dis, method = "centroid")
par(mfrow=c(1,2))
plot(hc_dis_complete)
plot(hc_dis_average)

The “average” method resulted in a somewhat different structure to the “complete” method.

par(mfrow=c(1,2))
plot(hc_dis_single)
plot(hc_dis_centroid)

The “single” method exhibits chaining phenomenon due to combining, at relatively low thresholds, observations linked by a series of close intermediate observations. The “centroid” method exhibits crossed borders between clusters. The two methods can be considered as defective for the clustering task here.

Visualization Techniques

First, let’s use tanglegram to better visualize the differences between the two dendrograms produced by “complete” and “average” methods.

wsc_dendlist <- dendlist()
wsc_dendlist <- dendlist(wsc_dendlist, as.dendrogram(hc_dis_complete))
wsc_dendlist <- dendlist(wsc_dendlist, as.dendrogram(hc_dis_average))
names(wsc_dendlist) <- c("complete", "average")
par(mfrow=c(1,1))
wsc_dendlist %>% dendlist(which = c(1,2)) %>% ladderize %>% 
set("rank_branches") %>%
tanglegram(common_subtrees_color_branches = TRUE)

Number of sub-trees that are identical between the two dendrograms:

length(unique(common_subtrees_clusters(wsc_dendlist[[1]], wsc_dendlist[[2]]))[-1])
## [1] 111

Finally, let’s use heat map to visualize the relations between the dendrogram and the features.

dend <- as.dendrogram(hc_dis_complete)
dend <- rotate(dend, 1:440)
dend <- color_branches(dend, k=3)
dend <- hang.dendrogram(dend,hang_height=0.1)
dend <- set(dend, "labels_cex", 0.5)

par(mfrow=c(1,1))
col_func <- function(n) rev(colorspace::heat_hcl(n, c = c(80, 30), l = c(30, 90), power = c(1/5, 1.5)))
gplots::heatmap.2(as.matrix(RawData[,c(1,2,9:14)]), 
       main = "Heatmap for Discretized Data Set",
       srtCol = 20,
       dendrogram = "row",
       Rowv = dend,
       Colv = "NA", 
       trace="none",          
       margins =c(5.5,0.5),      
       key.xlab = "Cm",
       denscol = "grey",
       density.info = "density",
       col = col_func
      )

Conclusion

Hierarchical clustering results depend on feature representation and measure of dissimilarity. Feature representation, in turn, depends on business objectives.

To determine the most “natural” number of clusters on a hierarchical clustering result, the Gap statistic[1] method can be used.

Reference

[1] The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (2013) Trevor Hastie, Robert Tibshirani, Jerome Friedman