Introduction

Marketers are interested in understanding and forecasting how customers purchase products and services and how they respond to marketing actions initiated by the firm. Marketing analysts develop quantitative models that leverage business data, statistical computation, and machine learning to forecast sales and to support important marketing decisions involving customer relationship management (CRM), market segmentation, monetization, value creation and communication. In this article we will focus on the machine learning techniques broadly used in the market segmentation field.

Cluster Analysis

Market Segmentation is about understanding the drivers of choice as well as the options that are available to consumers. A famous and broadly used technique to apply market segmentation is Cluster Analysis: it refers to a class of unsupervised learning techniques used to classify individuals into groups such that:

This example shows two different popular cluster analysis techniques:

  1. Hierarchical clustering
  2. K-Means clustering

Reading and outputting data

The sample includes data from 73 students that were asked to allocate 100 points across six automobile attributes (Trendiness, Styling, Reliability, Sportiness, Performance, and Comfort) in a way that reflects their importance in the purchase decision of which car to buy. We use this dataset to answer the following questions:

  1. Are there different benefit segments among this student population?
  2. How many segments?
  3. How are they different in their constant-sum allocation?
  4. How can this information be transformed into actionable levers from a managerial standpoint?

Let’s start by reading the raw data. The first 6 observations from our sample data can be seen here:

ID Trendiness Styling Reliability Sportiness Performance Comfort MBA Choice
1 10 20 35 5 20 10 MBA Lexus
2 25 5 25 5 25 15 MBA BMW
3 10 20 30 10 10 20 MBA Lexus
4 10 15 30 10 20 15 MBA BMW
5 20 10 40 1 14 15 MBA Mercedes
6 20 30 10 20 10 10 MBA Lexus

Hierarchical Clustering Analysis

Hierarchical Clustering Analysis is one of the most popular techniques used for market segmentation. It is a numerical procedure which attempts to separate a set of observations into clusters from the bottom-up by joining single individuals sequentially until one large cluster is obtained. Hence, this technique doesn’t require the pre-specification of the number of clusters, which can be assessed through the “dendogram” (a tree-like representation of the data). This is known as agglomerative hierarchical clustering. In theory, it can also be done by initially grouping all the observations into one cluster, and then successively splitting these clusters. This is known as divisive hierarchical clustering, but is rarely done in practice.

More specifically, the agglomerative algorithm works as follows:

  1. Each respondent is initially assigned to his or her own cluster
  2. Identify the distance between each cluster (intially between pairs of respondents)
  3. The two closest clusters are combined into one
  4. Repeat steps 2 and 3 until there is one unique cluster containing all the observations
  5. Represent the clusters in a dendogram

A key aspect of hierarchical clustering consists of choosing how to compute the distance between two clusters. Is it equal to the maximal distance between two points from each of these clusters? Or the minimal distance? What about the distance between two points? In this handout, we will use Ward’s criterion which aims to minimize the total variance within-cluster. To do so, we use the R function hclust. We start by standardizing the data so that each variable is on the same scale; this means that the mean is equal to zero. The distance between two clusters has been computed based on the length of the straight line drawn from one cluster to another. This is commonly referred to as Euclidean distance.

The distance matrix of the first 5 observations is shown in the table below:

Distance Matrix
1 2 3 4 5
0.0000 3.7302 2.8022 1.7756 2.7466
3.7302 0.0000 4.2187 3.0175 2.9845
2.8022 4.2187 0.0000 1.9747 3.3311
1.7756 3.0175 1.9747 0.0000 2.9241
2.7466 2.9845 3.3311 2.9241 0.0000

The choice of distance metric should be made based on theoretical concerns from the domain of study. For example, if clustering crime sites in a city, city block distance may be appropriate. Or, better yet, the time taken to travel between each location.

The function hclust() is used to apply hierarchical clustering on the sample data. We obtain the dendogram shown below, which can help us decide the number of clusters to retain. At first sight, this number seems to be either 3 or 4. Note: In order to get the same labeling of the clusters, it is important to set the seed to a specific value. Otherwise, cluster 1 in one analysis may correspond to cluster 3 in another.

The four-cluster solution

We set the algorithm with 4 clusters as seen below:

Let us now look at some description of this clustering. The table below informs us with the number of individuals in each cluster:

h_cluster Freq
1 18
2 29
3 17
4 9

The table below reports the profiles of the four clusters (i.e., the clustering variables means by cluster).

Group.1 Trendiness Styling Reliability Sportiness Performance Comfort
1 -0.504 -0.684 1.100 -0.946 0.655 0.086
2 -0.016 -0.425 -0.282 0.501 -0.099 0.586
3 1.147 0.855 -0.657 0.163 -0.919 -0.698
4 -1.109 1.121 -0.052 -0.030 0.746 -0.743

Looking at this table, we can describe the clusters as follows:

  1. Cluster 1 values Reliability and Performance
  2. Cluster 2 values Sportiness and Comfort
  3. Cluster 3 values Trendiness and Style
  4. Cluster 4 values Style and Performance

Hence, it seems that Cluster 4 is a combination of Clusters 1 and 3. This suggests that 3 clusters may be better at capturing the heterogeneity of the subjects in this dataset.

Three-Cluster Solution

In the table below we can see the number of students in each cluster:

h_cluster Freq
1 27
2 29
3 17
Group.1 Trendiness Styling Reliability Sportiness Performance Comfort
1 -0.7054 -0.0821 0.7159 -0.6405 0.6851 -0.1902
2 -0.0158 -0.4249 -0.2816 0.5005 -0.0989 0.5862
3 1.1473 0.8552 -0.6566 0.1635 -0.9193 -0.6979

Looking at this table, we can describe the clusters as follows:

  • Cluster 1 values Performance and Reliability (Perfomance Driven)
  • Cluster 2 values Comfort and Sportiness (Comfort Driven)
  • Cluster 3 values Trendiness and Style (Appearance Driven)

This solution seems to have clusters of similar sizes. In addition, we can easily caracterize each of them. The first cluster concerns Performance and Reliability while Cluster 2 values Comfort and Sportiness. Finally, the third cluster concerns about the appearance.

We can also focus on a given cluster. Here’s the Appearance-Driven Segment and the cluster members:

Number of Clusters

As seen above, one can use the dendogram to decide on the appropriate number of clusters. The function NbClust examines 26 indexes/criteria used to determine the optimal number of clusters and outputs the optimal number based on the majority rule. Note that since it is a constant-sum allocation between the attributes, we must use only 5 variables to avoid collinearity issues: if we know the points allocations for the first 5 attributes, then the last one can be computed as the residual. Hence there is a collinearity issue.

The D index is a graphical method of determining the number of clusters. In the plot of D index, we seek a significant knee (the significant peak in Dindex second differences plot) that corresponds to a significant increase of the value of the measure.

Among all indices:

  • 7 proposed 3 as the best number of clusters
  • 2 proposed 4 as the best number of clusters
  • 2 proposed 5 as the best number of clusters
  • 2 proposed 6 as the best number of clusters
  • 4 proposed 9 as the best number of clusters
  • 3 proposed 12 as the best number of clusters
  • 1 proposed 13 as the best number of clusters
  • 2 proposed 15 as the best number of clusters

According to the majority rule, the best number of clusters is 3.

Targeting the Segments: Demographics

Examining the above results, the next questions arise:

  • Are these segments managerially meaningful?
  • Do the segments lead to different marketing strategies?
  • Do they give any insights about the nature of the competition?

Segmentation is about understanding the drivers of choice as well as the options that are available to consumers. In our case the drivers are Performance, Comfort and Appearance. We are hoping that there is a relationship between these drivers, the student’s education level and the car model choices. We can now analyze our demographics in light of these cluster results using the function CrossTable() from package gmodels:

Education

 
Education
H. Clust
Perfomance
 
Comfort
 
Appearance
 
Total
MBA
N
Row(%)
Column(%)
 
14
58.33%
51.85%
 
6
25.00%
20.69%
 
4
16.67%
23.53%
 
24
32.88%
Undergrad
N
Row(%)
Column(%)
 
13
26.53%
48.15%
 
23
46.94%
79.31%
 
13
26.53%
76.47%
 
49
67.12%
Total
27
36.99%
29
39.73%
17
23.29%
73

In the table above we can see that the rows are MBA or undergraduate students. The columns are the three clusters suggested from the hierarchical cluster analysis. There are 24 MBA students which represent 33% of the sample data, and if we zoom in on that population, it can be observed that 14 out of 24 belong to the performance segment; that means 58% of the MBA students are classified in this cluster. Overall, we can see that MBA students are more likely than undergraduate students to pay attention to the performance of a car. If we look at the undergraduate students, we can see that only 13 out of 49 of them emphasized performance, and 23 of them emphasized comfort. So it can be concluded that the most important attribute for the undergraduate students is the car comfort.

It seems that these segments can be identified through some demographics, although we just use education as a way to predict. As can be seen, the education variable captures the age gap effect, because MBA students tend to be at least five or six years older than undergraduate students. In light of the results, it can be said that older people tend to emphasize performance, and younger people tend to emphasize comfort and appearance. This relationship is quite significant at the 5% confidence level which means one could target these clusters using the education variable.

Car Model Choice

Here we analyze if there is a relationship between the drivers and the choice of cars that students made.

 
Choice
H. Clust
Perfomance
 
Comfort
 
Appearance
 
Total
BMW
N
Row(%)
Column(%)
 
14
43.75%
51.85%
 
10
31.25%
34.48%
 
8
25.00%
47.06%
 
32
43.84%
Lexus
N
Row(%)
Column(%)
 
9
40.91%
33.33%
 
8
36.36%
27.59%
 
5
22.73%
29.41%
 
22
30.14%
Mercedes
N
Row(%)
Column(%)
 
4
21.05%
14.81%
 
11
57.89%
37.93%
 
4
21.05%
23.53%
 
19
26.03%
Total
27
36.99%
29
39.73%
17
23.29%
73

It can be seen that in the performance driven segment there are 27 people, which represent 37% of the dataset. Also, 14 out of 27 chose BMW, which means the share of BMW in this market segment is approximately 52%. It can also be observed that the BMW choice is appealing to all three clusters, but mainly in the Performance segment. In the case of the Lexus choice, it appears to have equal share across the three segments. As for the Comfort segment, there are 29 people classified in this cluster, and 11 of these chose Mercedes: the market share of Mercedes in this cluster is around 38%.

There seems to be some relationship: in this case BMW is appealing to all segments, whereas Mercedes is mostly appealing to the Comfort segment, and Lexus seems to have equal share across three segments. Unfortunately, the relationship is not that significant and this is not surprising, because the sample size is on the lower side, with only 73 students.

K-Means Clustering

K-Means clustering is an unsupervised machine learning technique which requires us to specify the number of clusters in advance. A couple classic examples are clustering different types of customers in company loyalty programs and separating medical patients in different risk categories. This technique aims to group the observations based on their similarity using an optimization procedure. Indeed, the aim is to minimize the within-cluster variation which is defined as the sum of square of the euclidean distance between each data point to the centroid of its cluster. More precisely, the algorithm works as follows:

  1. Start by assigning each point to a cluster randomly
  2. Compute the centroid of each cluster and the distances of each point to each centroid
  3. Reassign each observation to the closest centroid
  4. Repeat Steps 2 and 3 until the within-cluster variance is minimized

(For a more mathematically rigorous interpretation and implementation of this method, see here).

Three Cluster Solution obtained using K-Means

Let us start by observing how the algorithm works on our data for 3 segments. We use the function kmeans() from package stats. We will set the seed to a specific value (e.g., 1990) to make this work results replicable.

cluster Trendiness Styling Reliability Sportiness Performance Comfort
1 0.503 1.062 -0.436 0.116 -0.547 -0.814
2 -0.003 -0.379 -0.350 0.498 -0.045 0.536
3 -0.637 -0.684 1.178 -1.033 0.779 0.086

Looking at the kmeans method results, we can describe the clusters as follows:

  • Cluster 1 values Styling and Trendiness (Appearance Driven)
  • Cluster 2 values Comfort and Sportiness (Comfort Driven)
  • Cluster 3 values Performance and Reliability (Performance Driven)

Finding the optimal number of clusters

A key question when using the K-Means clustering technique consists of choosing the optimal number of segments. In order to do that, we can use the function NbClust() as in hierarchical clustering by specifying the method=kmeans. From the output, we see that the three-cluster solution is best.

The Hubert index is a graphical method of determining the number of clusters. In the plot of Hubert index, we seek a significant knee that corresponds to a significant increase of the value of the measure i.e the significant peak in the Hubert index second differences plot.

The D index is a graphical method of determining the number of clusters. In the plot of D index, we seek a significant knee (the significant peak in Dindex second differences plot) that corresponds to a significant increase in the value of the measure.

Index Table: K-Means
KL CH Hartigan CCC Scott TrCovW TraceW Friedman Rubin Cindex DB Silhouette
12.0 3 7.0 12.0 8.0 4 6.0 15.0 12.0 7.0 15 9.0
24.7 23 7.2 -2.3 47.9 674 12.5 3.3 -0.3 0.3 1 0.3
Duda PseudoT2 Beale Ratkowsky Ball PtBiserial Frey McClain Dunn Hubert SDindex Dindex SDbw
3.0 3.0 3.0 3.0 4.0 9.0 2 3.0 9.0 0 9.0 0 14.0
1.1 -1.3 -0.1 0.4 23.3 0.5 NA 1.3 0.2 0 1.6 0 0.3

Among all indices in the table above, the following can be concluded:

  • 6 proposed 3 as the best number of clusters
  • 2 proposed 4 as the best number of clusters
  • 2 proposed 6 as the best number of clusters
  • 2 proposed 7 as the best number of clusters
  • 1 proposed 8 as the best number of clusters
  • 4 proposed 9 as the best number of clusters
  • 3 proposed 12 as the best number of clusters
  • 1 proposed 14 as the best number of clusters
  • 2 proposed 15 as the best number of clusters

According to the majority rule, the best number of clusters is 3.

Looking at the cluster means above, we see that the clusters defined with the kmeans function are characterized similarly as the hclust method: whether we are using K-means or H-clustering, we are receiving the same recommendation of three clusters. We can now compare this clustering to the demographics as done with the hierarchical clustering.

Education

 
Education
K-Means
Appearance KM
 
Comfort KM
 
Perf. KM
 
Total
MBA
N
Row(%)
Column(%)
 
5
20.83%
21.74%
 
8
33.33%
25.00%
 
11
45.83%
61.11%
 
24
32.88%
Undergrad
N
Row(%)
Column(%)
 
18
36.73%
78.26%
 
24
48.98%
75.00%
 
7
14.29%
38.89%
 
49
67.12%
Total
23
31.51%
32
43.84%
18
24.66%
73

Overall, it can be observed that MBA students are more likely than undergraduate students to pay attention to the performance of a car, with 11 out of 24 in this segment. Looking at the undergraduate students, we can see that only 7 out of 49 of them emphasized performance, and 24 of them emphasized comfort. Therefore, it can be concluded that the most important attribute for the undergraduate students is the comfort of a car as seen in the hclust results as well.

As we can see, education captures the age gap effect between the data sample; MBA students tend to be at least five or six years older than undergraduate students. In light of the results, it can be said that older people tend to emphasize performance, and younger people tend to emphasize comfort and appearance. This relationship is quite significant at the 5% confidence level which means one could target these clusters using the education variable.

Car Model Choice

 
Choice
K-Means
Appearance KM
 
Comfort KM
 
Perf. KM
 
Total
BMW
N
Row(%)
Column(%)
 
11
34.38%
47.83%
 
14
43.75%
43.75%
 
7
21.88%
38.89%
 
32
43.84%
Lexus
N
Row(%)
Column(%)
 
6
27.27%
26.09%
 
8
36.36%
25.00%
 
8
36.36%
44.44%
 
22
30.14%
Mercedes
N
Row(%)
Column(%)
 
6
31.58%
26.09%
 
10
52.63%
31.25%
 
3
15.79%
16.67%
 
19
26.03%
Total
23
31.51%
32
43.84%
18
24.66%
73

If we look at the performance driven segment (Perf. KM), there are 18 people, representing 25% of the sample; 7 out of 18 of these cluster members chose BMW, which means approximately 39% of the cluster: the share of BMW in this market segment is 44%. We can observe that the BMW choice is mainly appealing to the Appearance segment. In the case of the Lexus car choice, it appears to have equal share across the three segments. As for the Comfort segment, there are 32 people classified in this cluster, and 10 of them chose Mercedes, which is around 31% of the segment.

There seems to be some relationship: in this case BMW is appealing to all the segments, whereas Mercedes is mostly appealing to the comfort segment, and Lexus seems to have equal share across these segments, quite similar to the hclust results. Unfortunately, the relationship is not that significant and this is not that surprising, because the sample size is really on the lower side, with only 73 students.

Clustering Results Comparison

The below table shows the comparison between K-Means and Hierarchical clustering algorithms based on our implementations:

 
Hierarchical C.
K-Means C.
Appearance KM
 
Comfort KM
 
Perf. KM
 
Total
Perfomance
N
Row(%)
Column(%)
 
7
25.93%
30.43%
 
3
11.11%
9.38%
 
17
62.96%
94.44%
 
27
36.99%
Comfort
N
Row(%)
Column(%)
 
1
3.45%
4.35%
 
27
93.10%
84.38%
 
1
3.45%
5.56%
 
29
39.73%
Appearance
N
Row(%)
Column(%)
 
15
88.24%
65.22%
 
2
11.76%
6.25%
 
0
0.00%
0.00%
 
17
23.29%
Total
23
31.51%
32
43.84%
18
24.66%
73

We can see that the biggest difference between the two classification methods lies in the Performance segments (18 in kmeans and 27 in hclust). From the table above we can now calculate the hit rate, which is a measure of concordance between the two classification methods:

\[ Hit\ Rate = \frac{(15+27+17)}{73} = 0.808 \approx 81\% \]

This means that 81% of the people in the sample are classified in the same cluster by both classification techniques. That relationship is very significant, as can be seen in the value \(p = 7.27e^{-16}\) which is near to zero.

As we know, clustering is a subjective statistical analysis, and there is more than one appropriate algorithm for every dataset and type of problem. So how to choose between K-means and hierarchical?

  • If there is a specific number of clusters in the dataset, but the group they belong to is unknown, choose K-means.
  • If the differences are based on prior beliefs, hierarchical clustering should be used to know the number of clusters.
  • With a large dataset, K-means computes faster than HCA.
  • K-Means uses the mean or median to compute the centroid for representing clusters, while HCA has various linkage methods that may or may not use the centroid.
  • The result of K-means is unstructured, but that of hierarchical is more interpretable and informative.
  • It is easier to determine the number of clusters by hierarchical clustering dendrogram.
  • HCA benefits from not needing an explicit “k” number of clusters a priori. This means that one can find all the potential clusters and decide which clusters make the most sense to a particular problem.

As seen in this article, the results of both clustering techniques are almost similar to the same dataset. That said, our results indicate that Hierarchical clustering works especially well with smaller data sets (our case). Agglomerative algorithms become more computationally expensive as more data points are considered, because they not only have to determine the best way to pair the clusters at each iteration but also when the clustering is complete.

References

  1. https://www.sciencedirect.com/topics/computer-science/hierarchical-clustering

  2. https://uc-r.github.io/kmeans_clustering

  3. https://web.stanford.edu/~hastie/Papers/gap.pdf

  4. https://www.statmethods.net/advstats/cluster.html

  5. https://link.springer.com/chapter/10.1007/978-3-662-44980-6_5

  6. https://www.springer.com/gp/book/9783540877523