Marketers are interested in understanding and forecasting how customers purchase products and services and how they respond to marketing actions initiated by the firm. Marketing analysts develop quantitative models that leverage business data, statistical computation, and machine learning to forecast sales and to support important marketing decisions involving customer relationship management (CRM), market segmentation, monetization, value creation and communication. In this article we will focus on the machine learning techniques broadly used in the market segmentation field.
Market Segmentation is about understanding the drivers of choice as well as the options that are available to consumers. A famous and broadly used technique to apply market segmentation is Cluster Analysis: it refers to a class of unsupervised learning techniques used to classify individuals into groups such that:
This example shows two different popular cluster analysis techniques:
The sample includes data from 73 students that were asked to allocate 100 points across six automobile attributes (Trendiness, Styling, Reliability, Sportiness, Performance, and Comfort) in a way that reflects their importance in the purchase decision of which car to buy. We use this dataset to answer the following questions:
Let’s start by reading the raw data. The first 6 observations from our sample data can be seen here:
| ID | Trendiness | Styling | Reliability | Sportiness | Performance | Comfort | MBA | Choice |
|---|---|---|---|---|---|---|---|---|
| 1 | 10 | 20 | 35 | 5 | 20 | 10 | MBA | Lexus |
| 2 | 25 | 5 | 25 | 5 | 25 | 15 | MBA | BMW |
| 3 | 10 | 20 | 30 | 10 | 10 | 20 | MBA | Lexus |
| 4 | 10 | 15 | 30 | 10 | 20 | 15 | MBA | BMW |
| 5 | 20 | 10 | 40 | 1 | 14 | 15 | MBA | Mercedes |
| 6 | 20 | 30 | 10 | 20 | 10 | 10 | MBA | Lexus |
Hierarchical Clustering Analysis is one of the most popular techniques used for market segmentation. It is a numerical procedure which attempts to separate a set of observations into clusters from the bottom-up by joining single individuals sequentially until one large cluster is obtained. Hence, this technique doesn’t require the pre-specification of the number of clusters, which can be assessed through the “dendogram” (a tree-like representation of the data). This is known as agglomerative hierarchical clustering. In theory, it can also be done by initially grouping all the observations into one cluster, and then successively splitting these clusters. This is known as divisive hierarchical clustering, but is rarely done in practice.
More specifically, the agglomerative algorithm works as follows:
A key aspect of hierarchical clustering consists of choosing how to compute the distance between two clusters. Is it equal to the maximal distance between two points from each of these clusters? Or the minimal distance? What about the distance between two points? In this handout, we will use Ward’s criterion which aims to minimize the total variance within-cluster. To do so, we use the R function hclust. We start by standardizing the data so that each variable is on the same scale; this means that the mean is equal to zero. The distance between two clusters has been computed based on the length of the straight line drawn from one cluster to another. This is commonly referred to as Euclidean distance.
The distance matrix of the first 5 observations is shown in the table below:
| 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|
| 0.0000 | 3.7302 | 2.8022 | 1.7756 | 2.7466 |
| 3.7302 | 0.0000 | 4.2187 | 3.0175 | 2.9845 |
| 2.8022 | 4.2187 | 0.0000 | 1.9747 | 3.3311 |
| 1.7756 | 3.0175 | 1.9747 | 0.0000 | 2.9241 |
| 2.7466 | 2.9845 | 3.3311 | 2.9241 | 0.0000 |
The choice of distance metric should be made based on theoretical concerns from the domain of study. For example, if clustering crime sites in a city, city block distance may be appropriate. Or, better yet, the time taken to travel between each location.
The function hclust() is used to apply hierarchical clustering on the sample data. We obtain the dendogram shown below, which can help us decide the number of clusters to retain. At first sight, this number seems to be either 3 or 4. Note: In order to get the same labeling of the clusters, it is important to set the seed to a specific value. Otherwise, cluster 1 in one analysis may correspond to cluster 3 in another.
We set the algorithm with 4 clusters as seen below:
Let us now look at some description of this clustering. The table below informs us with the number of individuals in each cluster:
| h_cluster | Freq |
|---|---|
| 1 | 18 |
| 2 | 29 |
| 3 | 17 |
| 4 | 9 |
The table below reports the profiles of the four clusters (i.e., the clustering variables means by cluster).
| Group.1 | Trendiness | Styling | Reliability | Sportiness | Performance | Comfort |
|---|---|---|---|---|---|---|
| 1 | -0.504 | -0.684 | 1.100 | -0.946 | 0.655 | 0.086 |
| 2 | -0.016 | -0.425 | -0.282 | 0.501 | -0.099 | 0.586 |
| 3 | 1.147 | 0.855 | -0.657 | 0.163 | -0.919 | -0.698 |
| 4 | -1.109 | 1.121 | -0.052 | -0.030 | 0.746 | -0.743 |
Looking at this table, we can describe the clusters as follows:
Hence, it seems that Cluster 4 is a combination of Clusters 1 and 3. This suggests that 3 clusters may be better at capturing the heterogeneity of the subjects in this dataset.
In the table below we can see the number of students in each cluster:
| h_cluster | Freq |
|---|---|
| 1 | 27 |
| 2 | 29 |
| 3 | 17 |
| Group.1 | Trendiness | Styling | Reliability | Sportiness | Performance | Comfort |
|---|---|---|---|---|---|---|
| 1 | -0.7054 | -0.0821 | 0.7159 | -0.6405 | 0.6851 | -0.1902 |
| 2 | -0.0158 | -0.4249 | -0.2816 | 0.5005 | -0.0989 | 0.5862 |
| 3 | 1.1473 | 0.8552 | -0.6566 | 0.1635 | -0.9193 | -0.6979 |
Looking at this table, we can describe the clusters as follows:
This solution seems to have clusters of similar sizes. In addition, we can easily caracterize each of them. The first cluster concerns Performance and Reliability while Cluster 2 values Comfort and Sportiness. Finally, the third cluster concerns about the appearance.
We can also focus on a given cluster. Here’s the Appearance-Driven Segment and the cluster members:
As seen above, one can use the dendogram to decide on the appropriate number of clusters. The function NbClust examines 26 indexes/criteria used to determine the optimal number of clusters and outputs the optimal number based on the majority rule. Note that since it is a constant-sum allocation between the attributes, we must use only 5 variables to avoid collinearity issues: if we know the points allocations for the first 5 attributes, then the last one can be computed as the residual. Hence there is a collinearity issue.
The D index is a graphical method of determining the number of clusters. In the plot of D index, we seek a significant knee (the significant peak in Dindex second differences plot) that corresponds to a significant increase of the value of the measure.
Among all indices:
According to the majority rule, the best number of clusters is 3.
Examining the above results, the next questions arise:
Segmentation is about understanding the drivers of choice as well as the options that are available to consumers. In our case the drivers are Performance, Comfort and Appearance. We are hoping that there is a relationship between these drivers, the student’s education level and the car model choices. We can now analyze our demographics in light of these cluster results using the function CrossTable() from package gmodels:
| Education |
H. Clust Perfomance |
Comfort |
Appearance |
Total |
|---|---|---|---|---|
| MBA N Row(%) Column(%) |
14 58.33% 51.85% |
6 25.00% 20.69% |
4 16.67% 23.53% |
24 32.88% |
| Undergrad N Row(%) Column(%) |
13 26.53% 48.15% |
23 46.94% 79.31% |
13 26.53% 76.47% |
49 67.12% |
| Total |
27 36.99% |
29 39.73% |
17 23.29% |
73 |
In the table above we can see that the rows are MBA or undergraduate students. The columns are the three clusters suggested from the hierarchical cluster analysis. There are 24 MBA students which represent 33% of the sample data, and if we zoom in on that population, it can be observed that 14 out of 24 belong to the performance segment; that means 58% of the MBA students are classified in this cluster. Overall, we can see that MBA students are more likely than undergraduate students to pay attention to the performance of a car. If we look at the undergraduate students, we can see that only 13 out of 49 of them emphasized performance, and 23 of them emphasized comfort. So it can be concluded that the most important attribute for the undergraduate students is the car comfort.
It seems that these segments can be identified through some demographics, although we just use education as a way to predict. As can be seen, the education variable captures the age gap effect, because MBA students tend to be at least five or six years older than undergraduate students. In light of the results, it can be said that older people tend to emphasize performance, and younger people tend to emphasize comfort and appearance. This relationship is quite significant at the 5% confidence level which means one could target these clusters using the education variable.
Here we analyze if there is a relationship between the drivers and the choice of cars that students made.
| Choice |
H. Clust Perfomance |
Comfort |
Appearance |
Total |
|---|---|---|---|---|
| BMW N Row(%) Column(%) |
14 43.75% 51.85% |
10 31.25% 34.48% |
8 25.00% 47.06% |
32 43.84% |
| Lexus N Row(%) Column(%) |
9 40.91% 33.33% |
8 36.36% 27.59% |
5 22.73% 29.41% |
22 30.14% |
| Mercedes N Row(%) Column(%) |
4 21.05% 14.81% |
11 57.89% 37.93% |
4 21.05% 23.53% |
19 26.03% |
| Total |
27 36.99% |
29 39.73% |
17 23.29% |
73 |
It can be seen that in the performance driven segment there are 27 people, which represent 37% of the dataset. Also, 14 out of 27 chose BMW, which means the share of BMW in this market segment is approximately 52%. It can also be observed that the BMW choice is appealing to all three clusters, but mainly in the Performance segment. In the case of the Lexus choice, it appears to have equal share across the three segments. As for the Comfort segment, there are 29 people classified in this cluster, and 11 of these chose Mercedes: the market share of Mercedes in this cluster is around 38%.
There seems to be some relationship: in this case BMW is appealing to all segments, whereas Mercedes is mostly appealing to the Comfort segment, and Lexus seems to have equal share across three segments. Unfortunately, the relationship is not that significant and this is not surprising, because the sample size is on the lower side, with only 73 students.
K-Means clustering is an unsupervised machine learning technique which requires us to specify the number of clusters in advance. A couple classic examples are clustering different types of customers in company loyalty programs and separating medical patients in different risk categories. This technique aims to group the observations based on their similarity using an optimization procedure. Indeed, the aim is to minimize the within-cluster variation which is defined as the sum of square of the euclidean distance between each data point to the centroid of its cluster. More precisely, the algorithm works as follows:
(For a more mathematically rigorous interpretation and implementation of this method, see here).
Let us start by observing how the algorithm works on our data for 3 segments. We use the function kmeans() from package stats. We will set the seed to a specific value (e.g., 1990) to make this work results replicable.
| cluster | Trendiness | Styling | Reliability | Sportiness | Performance | Comfort |
|---|---|---|---|---|---|---|
| 1 | 0.503 | 1.062 | -0.436 | 0.116 | -0.547 | -0.814 |
| 2 | -0.003 | -0.379 | -0.350 | 0.498 | -0.045 | 0.536 |
| 3 | -0.637 | -0.684 | 1.178 | -1.033 | 0.779 | 0.086 |
Looking at the kmeans method results, we can describe the clusters as follows:
A key question when using the K-Means clustering technique consists of choosing the optimal number of segments. In order to do that, we can use the function NbClust() as in hierarchical clustering by specifying the method=kmeans. From the output, we see that the three-cluster solution is best.
The Hubert index is a graphical method of determining the number of clusters. In the plot of Hubert index, we seek a significant knee that corresponds to a significant increase of the value of the measure i.e the significant peak in the Hubert index second differences plot.
The D index is a graphical method of determining the number of clusters. In the plot of D index, we seek a significant knee (the significant peak in Dindex second differences plot) that corresponds to a significant increase in the value of the measure.
| KL | CH | Hartigan | CCC | Scott | TrCovW | TraceW | Friedman | Rubin | Cindex | DB | Silhouette |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 12.0 | 3 | 7.0 | 12.0 | 8.0 | 4 | 6.0 | 15.0 | 12.0 | 7.0 | 15 | 9.0 |
| 24.7 | 23 | 7.2 | -2.3 | 47.9 | 674 | 12.5 | 3.3 | -0.3 | 0.3 | 1 | 0.3 |
| Duda | PseudoT2 | Beale | Ratkowsky | Ball | PtBiserial | Frey | McClain | Dunn | Hubert | SDindex | Dindex | SDbw |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3.0 | 3.0 | 3.0 | 3.0 | 4.0 | 9.0 | 2 | 3.0 | 9.0 | 0 | 9.0 | 0 | 14.0 |
| 1.1 | -1.3 | -0.1 | 0.4 | 23.3 | 0.5 | NA | 1.3 | 0.2 | 0 | 1.6 | 0 | 0.3 |
Among all indices in the table above, the following can be concluded:
According to the majority rule, the best number of clusters is 3.
Looking at the cluster means above, we see that the clusters defined with the kmeans function are characterized similarly as the hclust method: whether we are using K-means or H-clustering, we are receiving the same recommendation of three clusters. We can now compare this clustering to the demographics as done with the hierarchical clustering.
| Education |
K-Means Appearance KM |
Comfort KM |
Perf. KM |
Total |
|---|---|---|---|---|
| MBA N Row(%) Column(%) |
5 20.83% 21.74% |
8 33.33% 25.00% |
11 45.83% 61.11% |
24 32.88% |
| Undergrad N Row(%) Column(%) |
18 36.73% 78.26% |
24 48.98% 75.00% |
7 14.29% 38.89% |
49 67.12% |
| Total |
23 31.51% |
32 43.84% |
18 24.66% |
73 |
Overall, it can be observed that MBA students are more likely than undergraduate students to pay attention to the performance of a car, with 11 out of 24 in this segment. Looking at the undergraduate students, we can see that only 7 out of 49 of them emphasized performance, and 24 of them emphasized comfort. Therefore, it can be concluded that the most important attribute for the undergraduate students is the comfort of a car as seen in the hclust results as well.
As we can see, education captures the age gap effect between the data sample; MBA students tend to be at least five or six years older than undergraduate students. In light of the results, it can be said that older people tend to emphasize performance, and younger people tend to emphasize comfort and appearance. This relationship is quite significant at the 5% confidence level which means one could target these clusters using the education variable.
| Choice |
K-Means Appearance KM |
Comfort KM |
Perf. KM |
Total |
|---|---|---|---|---|
| BMW N Row(%) Column(%) |
11 34.38% 47.83% |
14 43.75% 43.75% |
7 21.88% 38.89% |
32 43.84% |
| Lexus N Row(%) Column(%) |
6 27.27% 26.09% |
8 36.36% 25.00% |
8 36.36% 44.44% |
22 30.14% |
| Mercedes N Row(%) Column(%) |
6 31.58% 26.09% |
10 52.63% 31.25% |
3 15.79% 16.67% |
19 26.03% |
| Total |
23 31.51% |
32 43.84% |
18 24.66% |
73 |
If we look at the performance driven segment (Perf. KM), there are 18 people, representing 25% of the sample; 7 out of 18 of these cluster members chose BMW, which means approximately 39% of the cluster: the share of BMW in this market segment is 44%. We can observe that the BMW choice is mainly appealing to the Appearance segment. In the case of the Lexus car choice, it appears to have equal share across the three segments. As for the Comfort segment, there are 32 people classified in this cluster, and 10 of them chose Mercedes, which is around 31% of the segment.
There seems to be some relationship: in this case BMW is appealing to all the segments, whereas Mercedes is mostly appealing to the comfort segment, and Lexus seems to have equal share across these segments, quite similar to the hclust results. Unfortunately, the relationship is not that significant and this is not that surprising, because the sample size is really on the lower side, with only 73 students.
The below table shows the comparison between K-Means and Hierarchical clustering algorithms based on our implementations:
| Hierarchical C. |
K-Means C. Appearance KM |
Comfort KM |
Perf. KM |
Total |
|---|---|---|---|---|
| Perfomance N Row(%) Column(%) |
7 25.93% 30.43% |
3 11.11% 9.38% |
17 62.96% 94.44% |
27 36.99% |
| Comfort N Row(%) Column(%) |
1 3.45% 4.35% |
27 93.10% 84.38% |
1 3.45% 5.56% |
29 39.73% |
| Appearance N Row(%) Column(%) |
15 88.24% 65.22% |
2 11.76% 6.25% |
0 0.00% 0.00% |
17 23.29% |
| Total |
23 31.51% |
32 43.84% |
18 24.66% |
73 |
We can see that the biggest difference between the two classification methods lies in the Performance segments (18 in kmeans and 27 in hclust). From the table above we can now calculate the hit rate, which is a measure of concordance between the two classification methods:
\[ Hit\ Rate = \frac{(15+27+17)}{73} = 0.808 \approx 81\% \]
This means that 81% of the people in the sample are classified in the same cluster by both classification techniques. That relationship is very significant, as can be seen in the value \(p = 7.27e^{-16}\) which is near to zero.
As we know, clustering is a subjective statistical analysis, and there is more than one appropriate algorithm for every dataset and type of problem. So how to choose between K-means and hierarchical?
As seen in this article, the results of both clustering techniques are almost similar to the same dataset. That said, our results indicate that Hierarchical clustering works especially well with smaller data sets (our case). Agglomerative algorithms become more computationally expensive as more data points are considered, because they not only have to determine the best way to pair the clusters at each iteration but also when the clustering is complete.