Clustering of Airline Fleet Project - Unsupervised Learning

1 Introduction

The objective of this paper is to analyse and work on the clustering of airline fleet data. My interest in aircrafts was first piqued in primary school, and I have been employed by an airline for over two years. Consequently, I was motivated to identify data related to this field.

The objective of this paper is to examine the characteristics of airline fleets and identify the similarities and differences between them. Airlines exhibit considerable diversity in their profiles, with the composition of their aircraft fleets largely contingent upon their business profiles. Some regional carriers utilise smaller and more cost-effective aircraft, while others employ a combination of narrow-body and wide-body aircraft to accommodate both short-haul and long-haul routes. Finally, there are those, particularly from the Middle Eastern region, which specialise in offering luxurious long-distance travel. Furthermore, the number of aircraft is as important as their type. For instance, the largest low-cost airlines in Europe use only narrow-body aircraft, while smaller Antarctic carriers also operate the same type of plane but in smaller numbers. With regard to the matter of regionality, it is a commonly held view in the business community that, due to the relatively high population density in most Asian countries, airlines in these countries tend to utilise wide-body aircraft on domestic flights. Conversely, airlines based in the United States have developed diverse fleets due to the vast distances between population centres and the lack of railway infrastructure. The largest traditional European airlines exhibit a comparable structural profile, albeit with a smaller fleet size. The objective of this article is to ascertain whether the aforementioned general business knowledge is reflected in the data.

2 Preparation of the dataset

2.1 Importing data

The data set that will be the focus of this article is sourced from Kaggle (https://www.kaggle.com/datasets/traceyvanp/airlinefleet). The data set comprises information pertaining to over 100 airlines, collated in January 2017. The original file comprises nine columns:

Parent Airline
Airline
Aircraft Type: Manufacturer & Model
Current: Quantity of airplanes in Operation
Future: Quantity of planned airplanes
Order: Quantity of airplanes on order
Unit Cost: Average unit cost ($M) of Aircraft Type
Total Cost: Current quantity * Unit Cost ($M)
Average Age: Average age of “Current” airplanes by “Aircraft Type”

# Importing necessary libraries for whole project
library(knitr)
library(dplyr)
library(factoextra)
library(cluster)
library(ClusterR)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(ggpubr)
library(ggrepel) 
library(ClustGeo)
library(ape)

fleet_raw <- read.csv("Fleet Data.csv")

colnames(fleet_raw) <- c("Airline", "Airline_Company", "Aircraft", "Current", 
                  "Future", "Ordered", "Historic", "Total", "Unit_Value", "Currents_Value", "Avg_Age")

kable(head(fleet_raw[98:103,]))

	Airline	Airline_Company	Aircraft	Current	Future	Ordered	Historic	Total	Unit_Value	Currents_Value	Avg_Age
98	Air Berlin	Air Berlin	McDonnell Douglas MD-80	NA	NA	1	1	NA	$45	$0	NA
99	Air Canada	Air Canada	Airbus A319	15	NA	33	48	NA	$90	$1,344	18.9
100	Air Canada	Air Canada Jetz	Airbus A319	3	NA	2	5	NA	$90	$269	18.7
101	Air Canada	Air Canada Rouge	Airbus A319	20	NA	NA	20	NA	$90	$1,792	18.6
102	Air Canada	Air Canada	Airbus A320	42	NA	13	55	NA	$98	$4,116	23.3
103	Air Canada	Air Canada Jetz	Airbus A320	NA	NA	5	5	NA		$0	NA

2.2 Clearing and transformating data

In order to focus the analysis on the core subject of airlines, and to exclude any subsidiaries, I have elected to eliminate the Airline and Aircraft columns.

fleet_raw$Airline_Company <- NULL
fleet_raw$Aircraft <- NULL

kable(head(fleet_raw[98:103,]))

	Airline	Current	Future	Ordered	Historic	Total	Unit_Value	Currents_Value	Avg_Age
98	Air Berlin	NA	NA	1	1	NA	$45	$0	NA
99	Air Canada	15	NA	33	48	NA	$90	$1,344	18.9
100	Air Canada	3	NA	2	5	NA	$90	$269	18.7
101	Air Canada	20	NA	NA	20	NA	$90	$1,792	18.6
102	Air Canada	42	NA	13	55	NA	$98	$4,116	23.3
103	Air Canada	NA	NA	5	5	NA		$0	NA

In the next step I decided to check columns for missing values. I expected a lot of them due to the fact that 4 columns of quantity are related to the one row of certain aircraft in a fleet of certain airline. I also wanted to check the exact number of rows in this table.

colSums(is.na(fleet_raw) | fleet_raw == "")

##        Airline        Current         Future        Ordered       Historic 
##              0            724           1395            470             99 
##          Total     Unit_Value Currents_Value        Avg_Age 
##           1235             35             27            763

nrow(fleet_raw)

## [1] 1583

It is evident that there is a significant dearth of data pertaining to the average age of aircraft. Given the inherent difficulty in accurately estimating this parameter due to the lengthy production cycles of certain aircraft types, I have elected to exclude this column from the analysis.

fleet_raw$Avg_Age <- NULL

Furthermore, I elected to consolidate the Future and Ordered categories into a single column and to eliminate rows lacking a unit value, with the objective of preventing the emergence of airlines without an average aircraft value.

fleet_raw <- fleet_raw %>%
  filter(!is.na(Unit_Value) & Unit_Value != "")
fleet_raw$Future <- fleet_raw$Future + fleet_raw$Ordered
fleet_raw$Ordered <- NULL

The NA values in the Current, Ordered, and Historic columns were replaced with the number zero. Additionally, the Total column was populated with a sum of the Current, Future, and Historic values.

fleet_raw$Current[is.na(fleet_raw$Current)] <- 0
fleet_raw$Future[is.na(fleet_raw$Future)] <- 0
fleet_raw$Historic[is.na(fleet_raw$Historic)] <- 0

fleet_raw$Total <- fleet_raw$Current + fleet_raw$Future + fleet_raw$Historic
kable(head(fleet_raw[96:100,]))

	Airline	Current	Historic	Total	Unit_Value	Currents_Value
96	Air Berlin	0	18	18	$20	$0
97	Air Berlin	0	1	1	$45	$0
98	Air Canada	15	48	63	$90	$1,344
99	Air Canada	3	5	8	$90	$269
100	Air Canada	20	20	40	$90	$1,792

As at this point I decided to aggregate table to the level of Airline and to remove Current_Value column as it was related with both Unit_Value and Current. I also needed to remove dollar sign from Unit_Value to treat it as numeric. Current, Future, Historic and Total were calculated as sums of aggregated rows, but Unit_Value was replaced by five columns:

Avg_Current_Unit_Value,
Avg_Future_Unit_Value,
Avg_Historic_Unit_Value,
Avg_Total_Unit_Value,

calculated as a weighted mean for each group of aircrafts.

fleet_raw$Currents_Value <- NULL
fleet_raw$Unit_Value <- as.numeric(substr(fleet_raw$Unit_Value, 2, nchar(fleet_raw$Unit_Value)))
colnames(fleet_raw) <- c( "Airline", "Current", "Future", "Historic", "Total", "Unit_Value_Mln_USD")

# Calculating sums for each aircraft category
fleet_raw <- fleet_raw %>%
  mutate(
    Value_Current = Unit_Value_Mln_USD * Current,
    Value_Future = Unit_Value_Mln_USD * Future,
    Value_Historic = Unit_Value_Mln_USD * Historic,
    Value_Total = Unit_Value_Mln_USD * Total
  )

fleet_raw$Unit_Value_Mln_USD <- NULL

# Groupping by airline to aggregate

fleet_raw<- fleet_raw %>%
  group_by(Airline) %>%
  summarise(
    Current = sum(Current, na.rm = TRUE),
    Future = sum(Future, na.rm = TRUE),
    Historic = sum(Historic, na.rm = TRUE),
    Total = sum(Total, na.rm = TRUE),
    
    Value_Current = sum(Value_Current, na.rm = TRUE),
    Value_Future = sum(Value_Future, na.rm = TRUE),
    Value_Historic = sum(Value_Historic, na.rm = TRUE),
    Value_Total = sum(Value_Total, na.rm = TRUE)
  ) %>%

  ungroup()

# Calculating average aircraft values for each group
fleet_raw <- fleet_raw %>%
  mutate(
    Avg_Value_Curr = replace(Value_Current / Current, is.nan(Value_Current / Current), 0),
    Avg_Value_Future = replace(Value_Future / Future, is.nan(Value_Future / Future), 0),
    Avg_Value_Hist = replace(Value_Historic / Historic, is.nan(Value_Historic / Historic), 0),
    Avg_Value_Ttl = replace(Value_Total / Total, is.nan(Value_Total / Total), 0)
  ) %>%
  select(Airline, Current, Future, Historic, Total, 
         Avg_Value_Curr, Avg_Value_Future, 
         Avg_Value_Hist, Avg_Value_Ttl)


kable(head(fleet_raw[37:46,]))

Airline	Current	Future	Historic	Total	Avg_Value_Curr	Avg_Value_Future	Avg_Value_Hist	Avg_Value_Ttl
El Al	49	0	114	163	151.77551	0	180.66667	171.98160
Emirates	249	15	351	615	342.56225	295	302.78917	318.70244
Ethiopian Airlines	82	0	128	210	158.98780	0	146.12500	151.14762
Etihad Airways	124	12	161	297	227.61290	240	224.00000	226.15488
FedEx Express	696	0	688	1384	70.53017	0	113.66424	91.97254
Finnair	71	0	191	262	122.46479	0	87.96859	97.31679

In the next step of dataset analysis from printting the summary of every column with most important statistic measures.

summary(fleet_raw)

##    Airline             Current         Future          Historic     
##  Length:113         Min.   :  17   Min.   :  0.00   Min.   :  45.0  
##  Class :character   1st Qu.:  63   1st Qu.:  0.00   1st Qu.: 113.0  
##  Mode  :character   Median : 105   Median :  0.00   Median : 195.0  
##                     Mean   : 182   Mean   : 20.29   Mean   : 326.3  
##                     3rd Qu.: 206   3rd Qu.: 26.00   3rd Qu.: 367.0  
##                     Max.   :1410   Max.   :319.00   Max.   :2679.0  
##      Total        Avg_Value_Curr   Avg_Value_Future Avg_Value_Hist  
##  Min.   :  71.0   Min.   : 28.44   Min.   :  0.0    Min.   : 21.44  
##  1st Qu.: 173.0   1st Qu.: 78.33   1st Qu.:  0.0    1st Qu.: 74.82  
##  Median : 327.0   Median :103.19   Median :  0.0    Median : 98.34  
##  Mean   : 528.7   Mean   :115.07   Mean   : 54.6    Mean   :108.00  
##  3rd Qu.: 588.0   3rd Qu.:134.74   3rd Qu.: 98.0    3rd Qu.:124.19  
##  Max.   :4139.0   Max.   :342.56   Max.   :314.2    Max.   :302.79  
##  Avg_Value_Ttl   
##  Min.   : 24.79  
##  1st Qu.: 78.22  
##  Median :100.58  
##  Mean   :109.89  
##  3rd Qu.:126.96  
##  Max.   :318.70

It is evident that the future aircraft data set is characterised by a considerable number of missing values, as evidenced by the fact that the median average value of future aircrafts is equal to zero. In order to maintain the integrity of the core data set, it was decided to retain this column, along with the variable Avg_Value_Future, but to refrain from utilising it in subsequent tests.

3 Clustering

3.1 Hopkins Statistic

In order to commence the clustering process, it was deemed appropriate to undertake a Hopkins Statistic Test on three distinct groups of data: Current, Historic and Total.

# Data preparation for Current & Avg_Value_Curr
data_for_clustering <- fleet_raw %>%
  select(Current, Avg_Value_Curr)
data_normalized <- scale(data_for_clustering)

# Clustering tendency for Current & Avg_Value_Curr
clust_tendency <- get_clust_tendency(data_normalized, n = 2, graph = FALSE, 
                                     gradient = list(low = "blue", high = "white"), 
                                     seed = 123)
print(paste("Hopkins Statistic for Current & Avg_Value_Curr:", clust_tendency$hopkins_stat))

## [1] "Hopkins Statistic for Current & Avg_Value_Curr: 0.959875188336194"

plot1 <- fviz_dist(
  dist(data_normalized), 
  show_labels = FALSE, 
  gradient = list(low = "blue", mid = "white", high = "red")
) +
  labs(title = "Dissimilarity: Current & Avg_Value_Curr") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 8),
    axis.text = element_blank(),
    legend.position = "bottom"
  )

# Data preparation for Total & Avg_Value_Ttl
data_for_clustering_ttl <- fleet_raw %>%
  select(Total, Avg_Value_Ttl)
data_normalized_ttl <- scale(data_for_clustering_ttl)

# Clustering tendency for Total & Avg_Value_Ttl
clust_tendency_ttl <- get_clust_tendency(data_normalized_ttl, n = 3, graph = FALSE, 
                                         gradient = list(low = "blue", high = "white"), 
                                         seed = 123)
print(paste("Hopkins Statistic for Total & Avg_Value_Ttl:", clust_tendency_ttl$hopkins_stat))

## [1] "Hopkins Statistic for Total & Avg_Value_Ttl: 0.93814424475394"

plot2 <- fviz_dist(
  dist(data_normalized_ttl), 
  show_labels = FALSE, 
  gradient = list(low = "blue", mid = "white", high = "red")
) +
  labs(title = "Dissimilarity: Total & Avg_Value_Ttl") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 8),
    axis.text = element_blank(),
    legend.position = "bottom"
  )

# Data preparation for Historic & Avg_Value_Hist
data_for_clustering_hist <- fleet_raw %>%
  select(Historic, Avg_Value_Hist)
data_normalized_hist <- scale(data_for_clustering_hist)

# Clustering tendency for Historic & Avg_Value_Hist
clust_tendency_hist <- get_clust_tendency(data_normalized_hist, n = 2, graph = FALSE, 
                                          gradient = list(low = "blue", high = "white"), 
                                          seed = 123)
print(paste("Hopkins Statistic for Historic & Avg_Value_Hist:", clust_tendency_hist$hopkins_stat))

## [1] "Hopkins Statistic for Historic & Avg_Value_Hist: 0.943784621539787"

plot3 <- fviz_dist(
  dist(data_normalized_hist), 
  show_labels = FALSE, 
  gradient = list(low = "blue", mid = "white", high = "red")
) +
  labs(title = "Dissimilarity: Historic & Avg_Value_Hist") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 8),
    axis.text = element_blank(),
    legend.position = "bottom"
  )

# Arrange the three plots side by side
grid.arrange(plot1, plot2, plot3, ncol = 3)

As a consequence, it is evident that the pair of parameters for all three categories are yielding outcomes that exceed 0.5, which is indicative of their adjustment for clustering in accordance with established protocols. Moreover, the outcomes exceed 0.9 for each date category, which may suggest that clustering is a viable approach for this dataset.

3.2 Adjusting number of clusters

3.2.1 For K-Means

3.2.1.1 NbClust

As a preliminary approach to identifying the optimal number of clusters, I employed the fviz_nbclust function from the factoextra package, as demonstrated in the course material. For all three datasets, I utilised the “silhouette” method.

# Data preparation for Current and Avg_Value_Curr
data_for_clustering_1 <- fleet_raw %>%
  select(Current, Avg_Value_Curr)
data_normalized_1 <- scale(data_for_clustering_1)

# Silhouette Method for Current and Avg_Value_Curr
km1s_1 <- fviz_nbclust(data_normalized_1, kmeans, method = "silhouette") + 
  ggtitle("Silhouette Method: Current & Avg_Value_Curr") +
  theme(
    plot.title = element_text(size = 8),  # Decrease font size of the title
    axis.text = element_text(size = 6)    # Decrease font size of axis labels
  )

# Data preparation for Total and Avg_Value_Ttl
data_for_clustering_2 <- fleet_raw %>%
  select(Total, Avg_Value_Ttl)
data_normalized_2 <- scale(data_for_clustering_2)

# Silhouette Method for Total and Avg_Value_Ttl
km1s_2 <- fviz_nbclust(data_normalized_2, kmeans, method = "silhouette") + 
  ggtitle("Silhouette Method: Total & Avg_Value_Ttl") +
  theme(
    plot.title = element_text(size = 8),  # Decrease font size of the title
    axis.text = element_text(size = 6)    # Decrease font size of axis labels
  )

# Data preparation for Historic and Avg_Value_Hist
data_for_clustering_3 <- fleet_raw %>%
  select(Historic, Avg_Value_Hist)
data_normalized_3 <- scale(data_for_clustering_3)

# Silhouette Method for Historic and Avg_Value_Hist
km1s_3 <- fviz_nbclust(data_normalized_3, kmeans, method = "silhouette") + 
  ggtitle("Silhouette Method: Historic & Avg_Value_Hist") +
  theme(
    plot.title = element_text(size = 8),  # Decrease font size of the title
    axis.text = element_text(size = 6)    # Decrease font size of axis labels
  )

# Arrange the plots side by side using grid.arrange
grid.arrange(km1s_1, km1s_2, km1s_3, ncol = 3)

The analysis of both current and historic data indicated that k=3 was the optimal number of clusters. For the total data set, the optimal number of clusters was identified as k=2. However, for all three categories, the difference between k=2 and k=3 was relatively minor. Nevertheless, it is evident that for the total and current data sets, the silhouette value for both k=2 and k=3 is approximately 0.6, while for the historical data set, it is between 0.4 and 0.5. This suggests that while the quality of the historical data clustering may be inferior, the current values render it a reasonable approximation.

3.2.1.2 Optimal Clusters

As an alternative approach, the Optimal_Clusters_KMeans function from the ClusterR package was employed. The criterion selected for all three data types was “silhouette.”

# Data preparation and scaling for Current & Avg_Value_Curr
data_for_clustering1 <- fleet_raw %>%
  select(Current, Avg_Value_Curr)
data_normalized1 <- scale(data_for_clustering1)

# Optimal cluster number analysis for Current & Avg_Value_Curr
plot1 <- Optimal_Clusters_KMeans(
  data_normalized1,
  max_clusters = 10, 
  plot_clusters = TRUE,
  criterion = "silhouette"
)

# Data preparation and scaling for Total & Avg_Value_Ttl
data_for_clustering2 <- fleet_raw %>%
  select(Total, Avg_Value_Ttl)
data_normalized2 <- scale(data_for_clustering2)

# Optimal cluster number analysis for Total & Avg_Value_Ttl
plot2 <- Optimal_Clusters_KMeans(
  data_normalized2,
  max_clusters = 10, 
  plot_clusters = TRUE,
  criterion = "silhouette"
)

# Data preparation and scaling for Historic & Avg_Value_Hist
data_for_clustering3 <- fleet_raw %>%
  select(Historic, Avg_Value_Hist)
data_normalized3 <- scale(data_for_clustering3)

# Optimal cluster number analysis for Historic & Avg_Value_Hist
plot3 <- Optimal_Clusters_KMeans(
  data_normalized3,
  max_clusters = 10, 
  plot_clusters = TRUE,
  criterion = "silhouette"
)

The results appear to be largely analogous to those obtained from the fviz_nbclust function. However, there are notable discrepancies, despite the identical criteria of silhouette being employed in both instances. It is evident that, in the present case, the silhouette value for all data groups is approximately 0.6 for both k=2 and k=3. For the current data, even k=4 and k=5 appear to exhibit promising values.

3.2.2 For PAM

In order to identify the optimal number of clusters for PAM, I employed the fviz_nbclust function, this time with the cluster set as “pam.” All other parameters were maintained at their original values, which had been established for the K-Means approach.

# Data preparation for Current and Avg_Value_Curr
data_for_clustering_1 <- fleet_raw %>%
  select(Current, Avg_Value_Curr)
data_normalized_1 <- scale(data_for_clustering_1)

# Silhouette Method for Current and Avg_Value_Curr
km1s_1 <- fviz_nbclust(data_normalized_1, pam, method = "silhouette") + 
  ggtitle("Silhouette Method: Current & Avg_Value_Curr") +
  theme(
    plot.title = element_text(size = 8),  # Decrease font size of the title
    axis.text = element_text(size = 6)    # Decrease font size of axis labels
  )

# Data preparation for Total and Avg_Value_Ttl
data_for_clustering_2 <- fleet_raw %>%
  select(Total, Avg_Value_Ttl)
data_normalized_2 <- scale(data_for_clustering_2)

# Silhouette Method for Total and Avg_Value_Ttl
km1s_2 <- fviz_nbclust(data_normalized_2, pam, method = "silhouette") + 
  ggtitle("Silhouette Method: Total & Avg_Value_Ttl") +
  theme(
    plot.title = element_text(size = 8),  # Decrease font size of the title
    axis.text = element_text(size = 6)    # Decrease font size of axis labels
  )

# Data preparation for Historic and Avg_Value_Hist
data_for_clustering_3 <- fleet_raw %>%
  select(Historic, Avg_Value_Hist)
data_normalized_3 <- scale(data_for_clustering_3)

# Silhouette Method for Historic and Avg_Value_Hist
km1s_3 <- fviz_nbclust(data_normalized_3, pam, method = "silhouette") + 
  ggtitle("Silhouette Method: Historic & Avg_Value_Hist") +
  theme(
    plot.title = element_text(size = 8),  # Decrease font size of the title
    axis.text = element_text(size = 6)    # Decrease font size of axis labels
  )

# Arrange the plots side by side using grid.arrange
grid.arrange(km1s_1, km1s_2, km1s_3, ncol = 3)

The outcomes yielded by PAM diverge somewhat from those obtained through K-Means. The discrepancy between k=2 and k=3 is more pronounced for all data sets, with k=3 emerging as the optimal choice for all categories. In the current instance, the silhouette result for the existing data set was approximately 0.6, while for the other two sets it was approximately 0.5.

3.3 K-Means

Following an investigation into the optimal number of clusters, a decision was taken to commence the clustering process using the K-Means method. The analysis was conducted for k=3. Initially, the data were clustered based on the current data set. To facilitate the interpretation of the results, the airline names were added to the relevant points on the graph.

# Preparing and scaling the data
data_for_clustering <- fleet_raw %>%
  select(Current, Avg_Value_Curr)

data_for_clustering <- scale(data_for_clustering)

# Performing enhanced k-means clustering using eclust on scaled data
res.km <- eclust(data_for_clustering, "kmeans", hc_metric = "euclidean", k = 3, graph = FALSE)

# Creating the clustering plot without displaying it immediately
cluster_plot <- fviz_cluster(res.km, main = "K-Means Clustering on Current data", labelsize = 0)

# Adding airline names & plotting
cluster_plot + 
  geom_text_repel(aes(label = fleet_raw$Airline), size = 2)

The K-Means clustering for k=3 reveals a noteworthy division, particularly given that only two variables were considered. The first cluster comprises the largest European and American airlines. Additionally, the group comprises the three largest Chinese airlines. The second cluster comprises all airlines that do not stand out in terms of either the amount or the average price of their combined historical, current and future fleet. This group is notable for its diversity, encompassing airlines from across the globe, including both traditional and low-cost carriers. In contrast, the third cluster is more homogeneous, comprising primarily Asian airlines, with the exception of China (China Airlines is a Taiwanese carrier) and the Middle East. There are only two exceptions to this, namely Virgin Atlantic, which is a relatively atypical UK airline, and Ethiopian Airlines, which is also somewhat unusual for this region of the world.

data_for_clustering <- fleet_raw %>%
  select(Historic, Avg_Value_Hist)

data_for_clustering <- scale(data_for_clustering)

res.km <- eclust(data_for_clustering, "kmeans", hc_metric = "euclidean", k = 3, graph = FALSE)

cluster_plot <- fviz_cluster(res.km, main = "K-Means Clustering on Historic Data", labelsize = 0)

cluster_plot + 
  geom_text_repel(aes(label = fleet_raw$Airline), size = 2)

A review of the historical data reveals that there is not a significant divergence from the current situation. However, there is a notable advancement in Chinese aviation, as evidenced by the historical data, which indicates that Air China is the sole member of cluster 1, comprising the largest airlines in terms of aircraft quantity. Additionally, the historical data reveals that FedEx Express had a considerably higher number of aircraft and is classified in cluster 2. Furthermore, the historical data identifies South African Airlines as the sole African carrier in cluster 3, whereas the current data indicates that Ethiopian Airlines is the only African carrier in this cluster.

data_for_clustering <- fleet_raw %>%
  select(Total, Avg_Value_Ttl)

data_for_clustering <- scale(data_for_clustering)


res.km <- eclust(data_for_clustering, "kmeans", hc_metric = "euclidean", k = 3, graph = FALSE)


cluster_plot <- fviz_cluster(res.km, main = "K-Means Clustering on Total Data", labelsize = 0)

cluster_plot + 
  geom_text_repel(aes(label = fleet_raw$Airline), size = 2)

The clustering of the total data set enables the generation of results that are comparable to those observed in two previous examples. This indicates that future data will not significantly alter the profile of the airlines in question, which appear to be relatively consistent over time.

3.4 PAM

Following the completion of the K-Means clustering process, I proceeded to undertake the PAM analysis with k=3, repeating this for all three data categories.

# Data preparation
data_for_clustering <- fleet_raw %>%
   select(Current, Avg_Value_Curr)
data_normalized <- scale(data_for_clustering)

# PAM Clustering
pam_clusters <- eclust(data_normalized, "pam", k = 3, graph = FALSE)

# Creating the clustering plot without displaying it immediately
cluster_plot <- fviz_cluster(pam_clusters, main = "PAM Clustering for Current Data", labelsize = 0)

# Adding airline names & plotting
cluster_plot + 
  geom_text_repel(aes(label = fleet_raw$Airline), size = 2)

The results of the PAM clustering on the current data set appear similar to those of the K-Means clustering, but there are a number of notable differences. With regard to Cluster One, it is notable that Qantas Airways, the largest Australian carrier, has been included. The third cluster also included airlines such as Turkish Airlines, Air India and Gulf Air. The aforementioned airlines, which have undergone a change in cluster assignment, appear to exhibit similarities to other airlines within clusters 1 and 3, respectively. However, it is noteworthy that, from a geographical perspective, Air India and Turkish Airlines are classified as Asian airlines, though they do not adhere to the conventional characteristics associated with this region. Gulf Air, on the other hand, is a Middle Eastern carrier, yet it differs from other prominent Middle Eastern airlines, such as Emirates and Etihad, in terms of its business approach.

data_for_clustering <- fleet_raw %>%
  select(Historic, Avg_Value_Hist)
data_normalized <- scale(data_for_clustering)

pam_clusters <- eclust(data_normalized, "pam", k = 3, graph = FALSE)

cluster_plot <- fviz_cluster(pam_clusters, main = "PAM Clustering for Historical Data", labelsize = 0)

cluster_plot + 
  geom_text_repel(aes(label = fleet_raw$Airline), size = 2)

With regard to PAM for historical data, a clear distinction can be observed in comparison to the K-Means approach, which was already evident in the analysis of current data. The notable aspect here is the considerable size of Cluster 3 and the inclusion of FedEx Express, which challenges the regionality hypothesis associated with this specific clustering.

data_for_clustering <- fleet_raw %>%
  select(Total, Avg_Value_Ttl)
data_normalized <- scale(data_for_clustering)

pam_clusters <- eclust(data_normalized, "pam", k = 3, graph = FALSE)

cluster_plot <- fviz_cluster(pam_clusters, main = "PAM Clustering for Total Data", labelsize = 0)

cluster_plot + 
  geom_text_repel(aes(label = fleet_raw$Airline), size = 2)

The PAM clustering for the total dataset is nearly identical to that of the PAM clustering made on the current dataset. Similarly, the comparison with K-Means exhibits a similar outcome to that observed with the current data, indicating that the influence of the future fleet is relatively unchanged.

3.5 Hierarchical clustering

data_for_clustering <- fleet_raw %>%
  select(Current, Avg_Value_Curr)
data_normalized <- scale(data_for_clustering)

distance_matrix <- dist(data_normalized, method = "euclidean")

hclust_result <- hclust(distance_matrix, method = "complete")

plot(hclust_result, labels = fleet_raw$Airline, 
     main = "Hierarchical clustering", 
     xlab = "Airlines", ylab = "Distance", sub = "",
     cex = 0.4, 
     lwd = 1) 

rect.hclust(hclust_result, k=3, border="red")

Despite the complexity of the plot and the multitude of airlines included, it enables the observation of not only the three-cluster division similar to the one seen in previous examples, but also the identification of outlier airlines, namely American Airlines (with the largest fleet) and Emirates (with the highest average aircraft price).

data_for_clustering <- fleet_raw %>%
  select(Total, Avg_Value_Ttl)

data_normalized <- scale(data_for_clustering)

res.dist <- dist(data_normalized, method = "euclidean")

hc <- hclust(res.dist, method = "ward.D2")

hc$labels <- fleet_raw$Airline

plot(as.phylo(hc), type = "fan", cex = 0.5, no.margin = TRUE)

This peculiar plot is analogous to that presented in class, and it offers intriguing insights. The visualization of groups of three or four airlines that are the most similar to themselves is particularly noteworthy. Notable examples include Singapore Airlines and Cathay Pacific, which are major competitors despite being based in relatively small but highly developed regions, namely Singapore and Hong Kong. Additionally, the grouping of airlines that is not based on region and is instead aligned with their respective profiles is evident. For instance, Chinese low-cost carrier Spring Airlines is situated adjacent to the European airline Wizz Air. Nevertheless, there are instances where the number of aircraft and their average value appear to be similar despite significant differences in airline profiles. This is evident in the case of Air New Zealand, a traditional carrier, and Thomas Cook Airlines, a typical tourist operator. It is possible that including the charter indicator in the analysis could yield different results.

4 Summary

In the paper, I conducted an analysis of the airline fleet data with the objective of identifying both differences and similarities between the various fleets. The application of clustering enabled the classification of airlines into three principal categories based on the data type in question: current, historic, and total. The resulting groups exhibited slight differences when applying the K-Means and PAM clustering types. However, the primary correlation with the geographical factor remained consistently visible. The analysis revealed that the majority of airlines exhibited consistent profiles across different data types, indicating that their characteristics remained relatively stable throughout their development. This finding is consistent with the prevailing business understanding of the industry. With regard to the aforementioned three sets, the first comprised a group of large European, American, and Chinese Mainland airlines. These airlines were characterised by a considerable size of fleet, coupled with a relatively modest average aircraft value. This is consistent with the observation that these airlines operate a diverse range of routes, including both short-haul and regional services, as well as long-haul routes with a significant number of aircraft. Domestic and regional routes are often of lower demand, which explains why larger aircraft are not deployed on these routes. The second group of airlines, as identified through clustering, comprised those based in Asia and the Middle East. It is not immediately apparent why these two groups should be considered together, given the existence of significant business differences between them. However, they are united by the fact that the majority of their respective fleets consist of wide-body aircraft. Middle Eastern airlines primarily operate on long-haul routes, while Asian airlines also offer a significant number of domestic flights. However, due to the high population density in these regions, wide-body aircraft are utilized on these domestic routes. The final group is the most heterogeneous, comprising airlines from across the globe that do not distinguish themselves by the number or average value of their aircraft. This demonstrates the limitations of such a simple clustering approach. For instance, Ryanair and Aeroflot, despite being in the same cluster, are very different airlines in terms of their business models. Overall, I believe that clustering with the incorporation of additional factors can be a valuable tool for airlines to benchmark their performance against their market rivals. Even with the relatively straightforward criteria employed in this study, the results were informative.