Nowadays, most industries are driven by customer behavior and data analysis. For any organization, the two factors help decide the budget, marketing schemes, and promotional campaigns and help move toward a lucrative direction. For example, suppose you work for a fashion online retail store and would like to decide the discount offers to target different types of customers to boost sales. It would be beneficial to have a tool in your arsenal to segment the customers into groups based on their likes and dislikes, previous shopping history, and search queries to help design strategies to boost sales.
Figure 1.1: An example of boosting product sale by finding the target audience using clustering
In this tutorial, we will learn about Hierarchical clustering— the tool that will help us cluster the data into groups based on the dissimilarity between the observation in the data (see Figure 1.1 (Ali 2022)). To begin with, we familiarise ourselves with a broader categorization of Hierarchical clustering— supervised and unsupervised machine learning algorithms and learn about clustering. Then, we will focus on Hierarchical clustering, dwelling deep into the parameters that govern its behavior. Finally, we will polish our concepts through a demonstration using R on publicly available data.
Machine learning, a sub-field of Artificial Intelligence (AI), refers to the computer’s ability to solve a problem without explicit instructions. There are many different types of learning in machine learning. In the context of Hierarchical Clustering, it is sufficient to understand the difference between 2 types— supervised and unsupervised. The tutorial (available here) (Brownlee 2019) provides a well-detailed description of all types of learning with examples.
To understand supervised and unsupervised learning, we will define terms such as input variable(s) that refer to the attributes of the data utilized to identify patterns or predict the label, which contains binary or multi-category information.
In supervised learning, the dataset has labels to “supervise” the algorithms identifying patterns or learning relationships between input variables and the label. Supervised learning can subsume different types of problems, not limited to the ones mentioned below:
In unsupervised learning, the dataset doesn’t have labels to “supervise” the algorithms, leaving them to learn patterns in the dataset. Unsupervised learning can subsume different types of problems, not limited to the ones mentioned below:
Clustering— grouping data points based on their similarities or differences. Algorithms such as K-means and Hierarchical clustering are examples of clustering.
Dimensionality reduction— reducing the number of features (or dimensions) in the dataset if they are too high. Algorithms involve principal component analysis and manifold learning.
As discussed before, clustering helps in grouping data based on similarities or dissimilarities. It can be used in situations where we want to segment customers or recommend similar movies. Before learning specifically about Hierarchical clustering, it is important to understand why and when it is required by comparing it with K-means clustering. Figure 3.1 (Traverso A 2019) shows the steps involved in implementing the K-means algorithm to cluster the data. A brief procedure is mentioned below:
Figure 3.1: K-means clustering algorithm
The K-means algorithm relies on predefining the number of clusters. Although there are ways of finding the optimal number of clusters (refer tutorial) (Banerjee 2022), they are not ideal since the decision is based on sample data at hand and hence, might not be robust to new data. Additionally, visualizing high-dimensional clustered data obtained from K-means requires an additional step— reduce dimensions.
Hierarchical clustering helps avoid predefining the number of clusters and easily visualize data in high-dimensional space. Sounds like the perfect clustering tool? Let’s learn more about Hierarchical clustering to understand its advantages and disadvantages over K-means clustering.
Hierarchical clustering merges unlabelled data into a hierarchy of clusters forming a tree-shaped structure called a dendrogram. Before discussing how to build and interpret a dendrogram, it’s important to understand the 2 types of Hierarchical clustering— Agglomerative and Divisive (see Figure 4.1) (Jarman 2020).
Figure 4.1: Types of Hierarchical clustering
It is a bottom-up approach in which every data point forms a single cluster (leaf of the tree). At each step, the two least dissimilar clusters are merged, and the process continues until all data points come under a single cluster.
It is a top-down approach in which all data points are part of one single cluster (the root of the tree). At each step, the most heterogeneous cluster is divided into two clusters, and the process continues until all data points become individual clusters. In this tutorial, we will focus only on Agglomerative clustering.
Some questions need to be answered before one can efficiently grasp the Agglomerative clustering algorithm, such as:
The answers to these questions lie in understanding three topics— distance measure, linkage methods, and dendrogram.
The dissimilarity between the clusters can be measured by calculating the distance between the clusters. Some of the commonly used distance measure are mentioned below.
Euclidean distance— It is the straight line distance between two data points. In n-dimensional space, the euclidean distance between two vectors, p and q is \(d(p,q) = \sqrt (\sum_{i=1}^n (p_i - q_i)^2)\).
Manhattan distance— It is the distance between two data points measured along the axes at right angles.In n-dimensional space, the manhattan distance between two vectors, p and q is \(d(p,q) = \sum_{i=1}^n |p_i - q_i|\).
Cosine distance— It measures the cosine similarity between two data points. In n-dimensional space, the cosine distance between two vectors, p and q is \(d(p,q) = 1 - cosine\_similarity\), where cosine similarity is measured as \(cosine\_similarity = cos(\theta) = \frac{\sum_{i=1}^n p_i*q_i}{\sqrt \sum_{i=1}^n p_i^2 .\sqrt\sum_{i=1}^n q_i^2}\)
Figure 5.1: An example illustrating the difference between Euclidean, Manhattan and Cosine distance between points A and B in a two-dimensional space
The distance measures mentioned above explained how to calculate the distance between 2 points. But how do we calculate the distance between 2 clusters? There are several ways to calculate the distance between the clusters that affect the type of clusters we get, called the linkage methods, given below.
Single Linkage— It is the shortest distance between two clusters, A and B. To compute this, we calculate the pairwise distance between data points in cluster A and B and use the smallest distance.
Complete Linkage— It is the farthest distance between two clusters, A and B. To compute this, we calculate the pairwise distance between data points in cluster A and B and use the largest distance.
Average Linkage— It is the average of distances between all pair of points in the two clusters, A and B. To compute this, we calculate the pairwise distance between data points in cluster A and B and calculate the average.
Centroid Linkage— It is the distance between the centroids (mean) of clusters A and B.
Figure 5.2 (Bonthu 2021) illustrates the different types of linkage methods between two clusters, A and B, in a two-dimensional space.
Figure 5.2: An example illustrating the different types of linkage methods in a two-dimensional space
A dendrogram is a diagram that shows the summary of distance measures between clusters. It helps visualize the different clusters in the data, as seen in Figure 5.3 (Polamuri 2020). The left-hand side of the figure shows a scatter plot of 7 data points, P1 to P7, in a two-dimensional space. The right-hand side shows a dendrogram for the data points where the y-axis represents the distance measure.
To interpret the dendrogram, we focus on the y-axis. The dendrogram shows that P6 and P7 are the most similar clusters since the height of the link joining P6 and P7 is the shortest (as evident from the scatter plot as well). The cluster containing P5 and the new merged cluster containing P6 and P7 becomes the second most similar. Furthermore,the maximum height difference between clusters, one containing P5, P6, and P7 and one with P4, shows that they are the most dissimilar. By drawing a horizontal line at a given height, we can group the data into clusters based on the formed (linked) clusters below the line.
There are several misconceptions related to the interpretation of a dendrogram such as:
Hence, we should verify the optimal number of clusters obtained from the shape of the dendrogram against methods such as the elbow and average silhouette. (refer tutorial) (Hierarchical Cluster Analysis, n.d.)
Figure 5.3: An example of a dendrogram (right) based on a scatter plot (left)
The steps involved in implementing Agglomerative clustering are as follows:
It’s time to implement Hierarchical Clustering in R! We will be utilizing the Formula 1 World Championship dataset available download here (Vopani 2022). Formula 1 World Championship, one of the premier forms of auto racing, involves a series of races across the globe annually, each race called a Grand Prix. F1 participating teams are known as constructors, and each constructor team has two racers.
The purpose of using the dataset is to cluster constructors based on their performance during the era of hybrid engines, 2014 to 2021, and to provide valuable insights based on points earned, fastest lap speed and fastest lap time in each race. If you are new to the world of F1 and curious about it, I would highly recommend watching “Formula 1: Drive to Survive” available on Netflix.
Let’s load the required packages in R.
####################
# Load libraries
####################
library(tidyverse)
library(dplyr)
library(lubridate)
library(magrittr)
library(ggfortify)
####################
# Clear environment
####################
rm(list = ls());
The F1 dataset on Kaggle contains multiple CSV files with varied information. To gauge constructor’s performance, we will require only 3 CSV files— races.csv, constructors.csv and results.csv
# load files
races <- read.csv2("./data/races.csv", sep = ',', na.strings = "\\N");
constructors <- read.csv2('./data/constructors.csv', sep = ',', na.strings = "\\N");
results <-read.csv2('./data/results.csv', sep = ',', na.strings = "\\N");
In the above code, na.strings takes a string that gets replaced with NA. The results.csv contains information about the constructor’s performance for both the drivers for each race from 1950-2022. The code below selects required columns from results.csv, merges the constructor’s name and year from the other 2 CSV files, and filters data for the period— 2014 to 2021.
# select the attributes to use.
data <- results %>%
select(raceId, constructorId, points, fastestLapTime, fastestLapSpeed);
# Add the year of the grand prix based on raceId.
data<-data%>%
inner_join(races[,c("raceId", "year","name")], by = "raceId")%>%
rename(grandPrix = `name`)%>%
select(-c("raceId"));
# Add constructor's name absed on constructorId
data<-data%>%
inner_join(constructors[,c("constructorId","name")], by = "constructorId")%>%
rename(constructorName = `name`) %>%
select(-c("constructorId"));
# select race data between 2014 and 2021
data <- data[data$year>=2014 & data$year<=2021,];
head(data)
## points fastestLapTime fastestLapSpeed year grandPrix
## 22128 25 1:32.478 206.436 2014 Australian Grand Prix
## 22129 18 1:33.066 205.131 2014 Australian Grand Prix
## 22130 15 1:32.917 205.460 2014 Australian Grand Prix
## 22131 12 1:33.186 204.867 2014 Australian Grand Prix
## 22132 10 1:32.616 206.128 2014 Australian Grand Prix
## 22133 8 1:32.568 206.235 2014 Australian Grand Prix
## constructorName
## 22128 Mercedes
## 22129 McLaren
## 22130 McLaren
## 22131 Ferrari
## 22132 Williams
## 22133 Force India
The final data frame, data contains 5 attributes:
The constructor’s team names have changed multiple times over the years. Hence, we replace the names representing the same constructor with the most recent name.
# create a new variable to store the analysis results.
dataAnalyzed = data.frame(data);
replaceConstrtuctorName <- c('Force India'= "Aston Martin", 'Racing Point' = "Aston Martin",
'Renault' = "Alpine F1 Team", 'Lotus F1' = "Alpine F1 Team",
'Toro Rosso' = "AlphaTauri", 'Sauber' = "Alfa Romeo",
'Haas F1 Team' = "Marussia", 'Haas F1 Team' = "Manor Marussia");
dataAnalyzed$constructorName <- str_replace_all(dataAnalyzed$constructorName, replaceConstrtuctorName)
To perform a fair performance comparison between constructors, we select constructors that participated every year from 2014-2021. Furthermore, we select only those Grand Prix circuits that held race each year from 2014-2021.
# use constructors present in all years for comparison
numYears<- unique(dataAnalyzed$year);
commonConstructors<-dataAnalyzed[dataAnalyzed$year==numYears[1],c("constructorName")];
for (i in 2:length(numYears)){
temp<-dataAnalyzed[dataAnalyzed$year==numYears[i],c("constructorName")];
commonConstructors<-intersect(commonConstructors, temp);
}
commonConstructors
## [1] "Mercedes" "McLaren" "Ferrari" "Williams"
## [5] "Aston Martin" "AlphaTauri" "Alfa Romeo" "Alpine F1 Team"
## [9] "Red Bull"
# select only the common constructors
dataAnalyzed<- dataAnalyzed[dataAnalyzed$constructorName %in% commonConstructors,];
# since different circuits can effect the teams performance, we take only those
# circuits that were raced from 2014 to 2021, each year.
commonPrix <- dataAnalyzed[dataAnalyzed$year==numYears[1],c("grandPrix")];
for (i in 2:length(numYears)){
temp<-dataAnalyzed[dataAnalyzed$year==numYears[i],c("grandPrix")];
commonPrix<-intersect(commonPrix, temp);
}
commonPrix
## [1] "Bahrain Grand Prix" "Spanish Grand Prix" "Austrian Grand Prix"
## [4] "British Grand Prix" "Hungarian Grand Prix" "Belgian Grand Prix"
## [7] "Italian Grand Prix" "Russian Grand Prix" "Abu Dhabi Grand Prix"
# select only the common circuits
dataAnalyzed<- dataAnalyzed[dataAnalyzed$grandPrix %in% commonPrix,];
Let’s check for missing values in the data.
# create a new variable to store the preprocessing results.
dataPreprocessed <- data.frame(dataAnalyzed);
# check if there are missing values in the data
sapply(dataPreprocessed, function(x) sum(is.na(x)))
## points fastestLapTime fastestLapSpeed year grandPrix
## 0 81 81 0 0
## constructorName
## 0
There are 81 rows with missing values in columns fastestLapSpeed and fastestLapTime. We simply remove the rows with missing values and confirm the removal.
# removing rows with NA
dataPreprocessed <- na.omit(dataPreprocessed);
# confirm removal of NA
sapply(dataPreprocessed, function(x) sum(is.na(x)))
## points fastestLapTime fastestLapSpeed year grandPrix
## 0 0 0 0 0
## constructorName
## 0
Let’s check the data types for the columns.
glimpse(dataPreprocessed)
## Rows: 1,215
## Columns: 6
## $ points <chr> "25", "18", "15", "12", "10", "8", "6", "4", "2", "1",…
## $ fastestLapTime <chr> "1:37.108", "1:37.020", "1:39.320", "1:39.269", "1:38.…
## $ fastestLapSpeed <chr> "200.634", "200.816", "196.165", "196.266", "197.228",…
## $ year <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, …
## $ grandPrix <chr> "Bahrain Grand Prix", "Bahrain Grand Prix", "Bahrain G…
## $ constructorName <chr> "Mercedes", "Mercedes", "Aston Martin", "Red Bull", "A…
To process the data for clustering, we need to make the following changes in terms of data types:
# change column types
# 1.
dataPreprocessed$points <- sapply(dataPreprocessed$points, as.integer);
#2.
my_options <- options(digits.secs = 3) ; # Modify and save default global options
dataPreprocessed <- dataPreprocessed %>%
mutate(temp = strptime(fastestLapTime,format = "%M:%OS"))%>%
select(-c(fastestLapTime))%>%
mutate(fastestLapTime = minute(temp)*60*1000 + second(temp)*1000)%>%
select(-c(temp));
# 3.
dataPreprocessed$fastestLapSpeed = as.numeric(dataPreprocessed$fastestLapSpeed);
# 4.
chr2factorCols <-c("constructorName", "grandPrix");
dataPreprocessed[chr2factorCols] <- sapply(dataPreprocessed[chr2factorCols], as.factor);
# checkout the data attributes and types after processing
glimpse(dataPreprocessed)
## Rows: 1,215
## Columns: 6
## $ points <int> 25, 18, 15, 12, 10, 8, 6, 4, 2, 1, 0, 0, 0, 0, 0, 0, 0…
## $ fastestLapSpeed <dbl> 200.634, 200.816, 196.165, 196.266, 197.228, 196.181, …
## $ year <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, …
## $ grandPrix <chr> "Bahrain Grand Prix", "Bahrain Grand Prix", "Bahrain G…
## $ constructorName <chr> "Mercedes", "Mercedes", "Aston Martin", "Red Bull", "A…
## $ fastestLapTime <dbl> 97108, 97020, 99320, 99269, 98785, 99312, 99272, 99762…
We will now cluster constructor based on features— points, fastestLapTime and fastestLapSpeed. In order to do this, the following steps are performed.
intermediateResults <- dataPreprocessed%>%
select(c(year, points, fastestLapSpeed, fastestLapTime, constructorName)) %>%
group_by(year, constructorName) %>%
summarise(totalPoints = sum(points), min_fastestLapSpeed = min(fastestLapSpeed),
max_fastestLapTime = max(fastestLapTime),.groups = 'drop')%>%
as.data.frame();
head(intermediateResults)
## year constructorName totalPoints min_fastestLapSpeed max_fastestLapTime
## 1 2014 Alfa Romeo 0 174.300 114000
## 2 2014 AlphaTauri 7 176.414 114159
## 3 2014 Alpine F1 Team 4 148.352 115649
## 4 2014 Aston Martin 82 155.921 114532
## 5 2014 Ferrari 98 179.257 114090
## 6 2014 McLaren 80 176.898 114203
inputData<- intermediateResults%>%
group_by(constructorName) %>%
summarise(avg_totalPoints = mean(totalPoints),
avg_minFastestLapSpeed = mean(min_fastestLapSpeed),
avg_maxFastestLapTime = mean(max_fastestLapTime)) %>%
as.data.frame()%>%
set_rownames(.$constructorName) %>%
select(-c(constructorName)) ;
inputData
## avg_totalPoints avg_minFastestLapSpeed avg_maxFastestLapTime
## Alfa Romeo 7.875 185.2259 112947.6
## AlphaTauri 28.125 183.2430 113121.6
## Alpine F1 Team 40.750 184.4110 110569.8
## Aston Martin 61.250 184.7014 109880.0
## Ferrari 173.125 189.6946 110108.8
## McLaren 61.500 182.4089 110111.6
## Mercedes 309.875 192.0040 108147.2
## Red Bull 158.500 189.4468 109780.0
## Williams 55.375 186.9545 110388.0
Let’s analyze the constructor teams performance based on the intermediateResults. The code below plots the density of min_fastestLapSpeed across the years for each constructor.
ggplot(data = intermediateResults, aes(x = min_fastestLapSpeed))+
geom_density() +
facet_wrap(~constructorName);
Teams such as Mercedes, Red Bull, and Ferrari show high performance, as evident from the heavy-tailed distribution. The rest of the teams show indications of variable performance. It won’t be a surprise if Hierarchical clustering algorithm groups these teams together.
Let’s get to Clustering now! The data frame, inputData, contains three features for each constructor. Finally, it is time to cluster the data! It involves three easy steps in R:
The code below shows the implementation using the above 3 steps. For clustering the constructor’s performance, we will be using Euclidean distance and Average linkage.
# feature scaling
inputData <- inputData%>%
scale %>%
as.data.frame();
# calculate the distance matrix and hierarchical clusters.
distance_mat <- dist(inputData, method = 'euclidean');
hc<- hclust(distance_mat, method = "average");
Let’s visualize the cluster using a dendrogram and a scatter plot. To make the scatter plot, we use PCA() to convert the three features to two-dimensional space.
# plot dendrogram
dend <- as.dendrogram(hc)
plot(dend)
# plot scatter plot
pca <-prcomp(inputData);
ggplot(data = pca$x, mapping = aes(x = PC1, y = PC2))+
geom_point()+
geom_text(aes(label= rownames(pca$x)), vjust = -1)+
ylim(c(-1,1.2))+
xlim(c(-2.2,3.5))
The clusters shown in the dendrogram can be verified using the scatter plot. For instance, teams such as Red Bull and Ferrari, and AlphaTauri and Alfa Romeo show similar performances as seen from the early merging. Mercedes gets added to the cluster in the end, as can be confirmed by its distance from other constructors on the scatter plot.
By gauging the shape of the dendrogram, one might be tempted to draw a horizontal line around 0.75 on the y-axis to create 5 clusters, indicating 5 levels of performance during the era of hybrid engines. The 5 clusters can be:
The formed clusters depict quite an accurate depiction of F1 teams performance from 2014 to 2021, with Group 1 leading the Championship for 7 consecutive years and the subsequent decreasing in performance from Group 1 to Group 5 (see Figure below).
Yay! We have successfully implemented Hierarchical Clustering. There are a few more steps that one can perform to improve the performance, such as:
Hope you had fun while learning how to implement Hierarchical Clustering!