Customers segmentation is a prerequisite for formulating marketing strategies. This document demonstrates exploratory data analysis using hierarchical clustering and its associated visualization techniques on wholesale customers data set, with highlights on the effects of feature normalization, feature discretization, and selection of dissimilarity measures on clustering outcomes.
The data set was published by Margarida G. M. S. Cardoso in 2014. It contains 440 observations and 8 variables: Channel (Horeca (Hotel/Restaurant/Café) or Retail), Region (Lisbon, Oporto or Other), and 6 Product Categories - Fresh, Milk, Grocery, Frozen, Detergents_Paper, Delicassen (with values of annual spending).
## 'data.frame': 440 obs. of 8 variables:
## $ Channel : int 2 2 2 1 2 2 2 2 1 2 ...
## $ Region : int 3 3 3 3 3 3 3 3 3 3 ...
## $ Fresh : int 12669 7057 6353 13265 22615 9413 12126 7579 5963 6006 ...
## $ Milk : int 9656 9810 8808 1196 5410 8259 3199 4956 3648 11093 ...
## $ Grocery : int 7561 9568 7684 4221 7198 5126 6975 9426 6192 18881 ...
## $ Frozen : int 214 1762 2405 6404 3915 666 480 1669 425 1159 ...
## $ Detergents_Paper: int 2674 3293 3516 507 1777 1795 3140 3321 1716 7425 ...
## $ Delicassen : int 1338 1776 7844 1788 5185 1451 545 2566 750 2098 ...
## Channel Region Fresh Milk
## Min. :1.000 Min. :1.000 Min. : 3 Min. : 55
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.: 3128 1st Qu.: 1533
## Median :1.000 Median :3.000 Median : 8504 Median : 3627
## Mean :1.323 Mean :2.543 Mean : 12000 Mean : 5796
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.: 16934 3rd Qu.: 7190
## Max. :2.000 Max. :3.000 Max. :112151 Max. :73498
## Grocery Frozen Detergents_Paper Delicassen
## Min. : 3 Min. : 25.0 Min. : 3.0 Min. : 3.0
## 1st Qu.: 2153 1st Qu.: 742.2 1st Qu.: 256.8 1st Qu.: 408.2
## Median : 4756 Median : 1526.0 Median : 816.5 Median : 965.5
## Mean : 7951 Mean : 3071.9 Mean : 2881.5 Mean : 1524.9
## 3rd Qu.:10656 3rd Qu.: 3554.2 3rd Qu.: 3922.0 3rd Qu.: 1820.2
## Max. :92780 Max. :60869.0 Max. :40827.0 Max. :47943.0
The spending for Fresh products are far greater than those for other products, indicating that Fresh product will dominate dissimilarity measures.
par(mfrow=c(2,3))
for (i in c(3:8))
hist(RawData[,c(i)], breaks = 200, main = colnames(RawData)[i], xlab="Annual Spending (m.u.)", ylab = "No. of Customers")
The spending distributions are positively skewed in all 6 product categories, suggesting uevenly sized clusters.
If the concern is total spending across all product categories, then the raw data can be used for clustering. On the other hand, if product categories should be weighted equally, then we need to normalize the features. Here, we normalize the features to mean = 0 and SD = 1.
NormalizedData <- cbind(RawData[,c(1,2)], scale(RawData[,c(3:8)]))
summary(NormalizedData)
## Channel Region Fresh Milk
## Min. :1.000 Min. :1.000 Min. :-0.9486 Min. :-0.7779
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:-0.7015 1st Qu.:-0.5776
## Median :1.000 Median :3.000 Median :-0.2764 Median :-0.2939
## Mean :1.323 Mean :2.543 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.: 0.3901 3rd Qu.: 0.1889
## Max. :2.000 Max. :3.000 Max. : 7.9187 Max. : 9.1732
## Grocery Frozen Detergents_Paper Delicassen
## Min. :-0.8364 Min. :-0.62763 Min. :-0.6037 Min. :-0.5396
## 1st Qu.:-0.6101 1st Qu.:-0.47988 1st Qu.:-0.5505 1st Qu.:-0.3960
## Median :-0.3363 Median :-0.31844 Median :-0.4331 Median :-0.1984
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.2846 3rd Qu.: 0.09935 3rd Qu.: 0.2182 3rd Qu.: 0.1047
## Max. : 8.9264 Max. :11.90545 Max. : 7.9586 Max. :16.4597
par(mfrow=c(2,3))
for (i in c(3:8))
hist(NormalizedData[,c(i)], breaks = 200, main = colnames(RawData)[i], xlab="Normalized Annual Spending (m.u.)", ylab = "No. of Customers")
A rule of thumb in marketing is 20/80, meaning top 20% of the most profitable customers contribute to 80% of the profits. Therefore, it is meaningful to discretize features into quintiles.
RawData <- within(RawData, FreshQuintile <- as.integer(cut(Fresh, quantile(Fresh, probs=0:5/5), include.lowest=TRUE)))
RawData <- within(RawData, MilkQuintile <- as.integer(cut(Milk, quantile(Milk, probs=0:5/5), include.lowest=TRUE)))
RawData <- within(RawData, GroceryQuintile <- as.integer(cut(Grocery, quantile(Grocery, probs=0:5/5), include.lowest=TRUE)))
RawData <- within(RawData, FrozenQuintile <- as.integer(cut(Frozen, quantile(Frozen, probs=0:5/5), include.lowest=TRUE)))
RawData <- within(RawData, Detergents_PaperQuintile <- as.integer(cut(Detergents_Paper, quantile(Detergents_Paper, probs=0:5/5), include.lowest=TRUE)))
RawData <- within(RawData, DelicassenQuintile <- as.integer(cut(Delicassen, quantile(Delicassen, probs=0:5/5), include.lowest=TRUE)))
par(mfrow=c(2,3))
for (i in c(9:14))
hist(RawData[,c(i)], breaks = 20, main = colnames(RawData)[i], xlab="Discretized Annual Spending Rank", ylab = "No. of Customers")
Now, we will compare dendrograms of the hierarchical clustering from the three feature representations, using “complete” dissimilarity measure method.
require(fastcluster)
require(graphics)
library(dendextend)
library(gplots)
d_raw <- dist(RawData[,c(1:8)])
d_nom <- dist(NormalizedData)
d_dis <- dist(RawData[,c(1,2,9:14)])
hc_raw <- hclust(d_raw, method = "complete")
hc_nom <- hclust(d_nom, method = "complete")
hc_dis_complete <- hclust(d_dis, method = "complete")
par(mfrow=c(1,3))
plot(hc_raw)
plot(hc_nom)
plot(hc_dis_complete)
The results show that there are several 1-member clusters in both the raw data set and the normalized data set; the main branches are also very unbalanced in both. By contrast, the dendrogram from discretized data set is more balanced; if we make a cut at height = 8.5, we will get 3 roughly equally sized clusters.
There are multiple ways to compare distance of two clusters. Here, we will compare “single”, “complete”, “average”, and “centroid” methods on the discretized data set. The “single” method is a nearest-neighbor technique and tends to produce clusters with large diameters and violates “compactness” property. The “complete” method is a furthest-neighbor technique and tends to produce compact clusters with small diameters but violates “closeness” property. The “average” method is a compromise between “single” and “complete”, but its results may vary by the numerical scale on which the observation dissimilarities are measured. The “centroid” method is similar to K-Mean clustering where centers of clusters are computed first then distance are determined between centroids. By optimizing cluster centers, instead of cluster borders, the “centroid” method often leads to incorrectly cut borders in between of clusters.
hc_dis_single <- hclust(d_dis, method = "single")
hc_dis_average <- hclust(d_dis, method = "average")
hc_dis_centroid <- hclust(d_dis, method = "centroid")
par(mfrow=c(1,2))
plot(hc_dis_complete)
plot(hc_dis_average)
The “average” method resulted in a somewhat different structure to the “complete” method.
par(mfrow=c(1,2))
plot(hc_dis_single)
plot(hc_dis_centroid)
The “single” method exhibits chaining phenomenon due to combining, at relatively low thresholds, observations linked by a series of close intermediate observations. The “centroid” method exhibits crossed borders between clusters. The two methods can be considered as defective for the clustering task here.
First, let’s use tanglegram to better visualize the differences between the two dendrograms produced by “complete” and “average” methods.
wsc_dendlist <- dendlist()
wsc_dendlist <- dendlist(wsc_dendlist, as.dendrogram(hc_dis_complete))
wsc_dendlist <- dendlist(wsc_dendlist, as.dendrogram(hc_dis_average))
names(wsc_dendlist) <- c("complete", "average")
par(mfrow=c(1,1))
wsc_dendlist %>% dendlist(which = c(1,2)) %>% ladderize %>%
set("rank_branches") %>%
tanglegram(common_subtrees_color_branches = TRUE)
Number of sub-trees that are identical between the two dendrograms:
length(unique(common_subtrees_clusters(wsc_dendlist[[1]], wsc_dendlist[[2]]))[-1])
## [1] 111
Finally, let’s use heat map to visualize the relations between the dendrogram and the features.
dend <- as.dendrogram(hc_dis_complete)
dend <- rotate(dend, 1:440)
dend <- color_branches(dend, k=3)
dend <- hang.dendrogram(dend,hang_height=0.1)
dend <- set(dend, "labels_cex", 0.5)
par(mfrow=c(1,1))
col_func <- function(n) rev(colorspace::heat_hcl(n, c = c(80, 30), l = c(30, 90), power = c(1/5, 1.5)))
gplots::heatmap.2(as.matrix(RawData[,c(1,2,9:14)]),
main = "Heatmap for Discretized Data Set",
srtCol = 20,
dendrogram = "row",
Rowv = dend,
Colv = "NA",
trace="none",
margins =c(5.5,0.5),
key.xlab = "Cm",
denscol = "grey",
density.info = "density",
col = col_func
)
Hierarchical clustering results depend on feature representation and measure of dissimilarity. Feature representation, in turn, depends on business objectives.
To determine the most “natural” number of clusters on a hierarchical clustering result, the Gap statistic[1] method can be used.
[1] The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (2013) Trevor Hastie, Robert Tibshirani, Jerome Friedman