The objective of this paper is to analyse and work on the clustering of airline fleet data. My interest in aircrafts was first piqued in primary school, and I have been employed by an airline for over two years. Consequently, I was motivated to identify data related to this field.
The objective of this paper is to examine the characteristics of airline fleets and identify the similarities and differences between them. Airlines exhibit considerable diversity in their profiles, with the composition of their aircraft fleets largely contingent upon their business profiles. Some regional carriers utilise smaller and more cost-effective aircraft, while others employ a combination of narrow-body and wide-body aircraft to accommodate both short-haul and long-haul routes. Finally, there are those, particularly from the Middle Eastern region, which specialise in offering luxurious long-distance travel. Furthermore, the number of aircraft is as important as their type. For instance, the largest low-cost airlines in Europe use only narrow-body aircraft, while smaller Antarctic carriers also operate the same type of plane but in smaller numbers. With regard to the matter of regionality, it is a commonly held view in the business community that, due to the relatively high population density in most Asian countries, airlines in these countries tend to utilise wide-body aircraft on domestic flights. Conversely, airlines based in the United States have developed diverse fleets due to the vast distances between population centres and the lack of railway infrastructure. The largest traditional European airlines exhibit a comparable structural profile, albeit with a smaller fleet size. The objective of this article is to ascertain whether the aforementioned general business knowledge is reflected in the data.
The data set that will be the focus of this article is sourced from Kaggle (https://www.kaggle.com/datasets/traceyvanp/airlinefleet). The data set comprises information pertaining to over 100 airlines, collated in January 2017. The original file comprises nine columns:
# Importing necessary libraries for whole project
library(knitr)
library(dplyr)
library(factoextra)
library(cluster)
library(ClusterR)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(ggpubr)
library(ggrepel)
library(ClustGeo)
library(ape)
fleet_raw <- read.csv("Fleet Data.csv")
colnames(fleet_raw) <- c("Airline", "Airline_Company", "Aircraft", "Current",
"Future", "Ordered", "Historic", "Total", "Unit_Value", "Currents_Value", "Avg_Age")
kable(head(fleet_raw[98:103,]))
Airline | Airline_Company | Aircraft | Current | Future | Ordered | Historic | Total | Unit_Value | Currents_Value | Avg_Age | |
---|---|---|---|---|---|---|---|---|---|---|---|
98 | Air Berlin | Air Berlin | McDonnell Douglas MD-80 | NA | NA | 1 | 1 | NA | $45 | $0 | NA |
99 | Air Canada | Air Canada | Airbus A319 | 15 | NA | 33 | 48 | NA | $90 | $1,344 | 18.9 |
100 | Air Canada | Air Canada Jetz | Airbus A319 | 3 | NA | 2 | 5 | NA | $90 | $269 | 18.7 |
101 | Air Canada | Air Canada Rouge | Airbus A319 | 20 | NA | NA | 20 | NA | $90 | $1,792 | 18.6 |
102 | Air Canada | Air Canada | Airbus A320 | 42 | NA | 13 | 55 | NA | $98 | $4,116 | 23.3 |
103 | Air Canada | Air Canada Jetz | Airbus A320 | NA | NA | 5 | 5 | NA | $0 | NA |
In order to focus the analysis on the core subject of airlines, and to exclude any subsidiaries, I have elected to eliminate the Airline and Aircraft columns.
fleet_raw$Airline_Company <- NULL
fleet_raw$Aircraft <- NULL
kable(head(fleet_raw[98:103,]))
Airline | Current | Future | Ordered | Historic | Total | Unit_Value | Currents_Value | Avg_Age | |
---|---|---|---|---|---|---|---|---|---|
98 | Air Berlin | NA | NA | 1 | 1 | NA | $45 | $0 | NA |
99 | Air Canada | 15 | NA | 33 | 48 | NA | $90 | $1,344 | 18.9 |
100 | Air Canada | 3 | NA | 2 | 5 | NA | $90 | $269 | 18.7 |
101 | Air Canada | 20 | NA | NA | 20 | NA | $90 | $1,792 | 18.6 |
102 | Air Canada | 42 | NA | 13 | 55 | NA | $98 | $4,116 | 23.3 |
103 | Air Canada | NA | NA | 5 | 5 | NA | $0 | NA |
In the next step I decided to check columns for missing values. I expected a lot of them due to the fact that 4 columns of quantity are related to the one row of certain aircraft in a fleet of certain airline. I also wanted to check the exact number of rows in this table.
colSums(is.na(fleet_raw) | fleet_raw == "")
## Airline Current Future Ordered Historic
## 0 724 1395 470 99
## Total Unit_Value Currents_Value Avg_Age
## 1235 35 27 763
nrow(fleet_raw)
## [1] 1583
It is evident that there is a significant dearth of data pertaining to the average age of aircraft. Given the inherent difficulty in accurately estimating this parameter due to the lengthy production cycles of certain aircraft types, I have elected to exclude this column from the analysis.
fleet_raw$Avg_Age <- NULL
Furthermore, I elected to consolidate the Future and Ordered categories into a single column and to eliminate rows lacking a unit value, with the objective of preventing the emergence of airlines without an average aircraft value.
fleet_raw <- fleet_raw %>%
filter(!is.na(Unit_Value) & Unit_Value != "")
fleet_raw$Future <- fleet_raw$Future + fleet_raw$Ordered
fleet_raw$Ordered <- NULL
The NA values in the Current, Ordered, and Historic columns were replaced with the number zero. Additionally, the Total column was populated with a sum of the Current, Future, and Historic values.
fleet_raw$Current[is.na(fleet_raw$Current)] <- 0
fleet_raw$Future[is.na(fleet_raw$Future)] <- 0
fleet_raw$Historic[is.na(fleet_raw$Historic)] <- 0
fleet_raw$Total <- fleet_raw$Current + fleet_raw$Future + fleet_raw$Historic
kable(head(fleet_raw[96:100,]))
Airline | Current | Future | Historic | Total | Unit_Value | Currents_Value | |
---|---|---|---|---|---|---|---|
96 | Air Berlin | 0 | 0 | 18 | 18 | $20 | $0 |
97 | Air Berlin | 0 | 0 | 1 | 1 | $45 | $0 |
98 | Air Canada | 15 | 0 | 48 | 63 | $90 | $1,344 |
99 | Air Canada | 3 | 0 | 5 | 8 | $90 | $269 |
100 | Air Canada | 20 | 0 | 20 | 40 | $90 | $1,792 |
As at this point I decided to aggregate table to the level of Airline and to remove Current_Value column as it was related with both Unit_Value and Current. I also needed to remove dollar sign from Unit_Value to treat it as numeric. Current, Future, Historic and Total were calculated as sums of aggregated rows, but Unit_Value was replaced by five columns:
calculated as a weighted mean for each group of aircrafts.
fleet_raw$Currents_Value <- NULL
fleet_raw$Unit_Value <- as.numeric(substr(fleet_raw$Unit_Value, 2, nchar(fleet_raw$Unit_Value)))
colnames(fleet_raw) <- c( "Airline", "Current", "Future", "Historic", "Total", "Unit_Value_Mln_USD")
# Calculating sums for each aircraft category
fleet_raw <- fleet_raw %>%
mutate(
Value_Current = Unit_Value_Mln_USD * Current,
Value_Future = Unit_Value_Mln_USD * Future,
Value_Historic = Unit_Value_Mln_USD * Historic,
Value_Total = Unit_Value_Mln_USD * Total
)
fleet_raw$Unit_Value_Mln_USD <- NULL
# Groupping by airline to aggregate
fleet_raw<- fleet_raw %>%
group_by(Airline) %>%
summarise(
Current = sum(Current, na.rm = TRUE),
Future = sum(Future, na.rm = TRUE),
Historic = sum(Historic, na.rm = TRUE),
Total = sum(Total, na.rm = TRUE),
Value_Current = sum(Value_Current, na.rm = TRUE),
Value_Future = sum(Value_Future, na.rm = TRUE),
Value_Historic = sum(Value_Historic, na.rm = TRUE),
Value_Total = sum(Value_Total, na.rm = TRUE)
) %>%
ungroup()
# Calculating average aircraft values for each group
fleet_raw <- fleet_raw %>%
mutate(
Avg_Value_Curr = replace(Value_Current / Current, is.nan(Value_Current / Current), 0),
Avg_Value_Future = replace(Value_Future / Future, is.nan(Value_Future / Future), 0),
Avg_Value_Hist = replace(Value_Historic / Historic, is.nan(Value_Historic / Historic), 0),
Avg_Value_Ttl = replace(Value_Total / Total, is.nan(Value_Total / Total), 0)
) %>%
select(Airline, Current, Future, Historic, Total,
Avg_Value_Curr, Avg_Value_Future,
Avg_Value_Hist, Avg_Value_Ttl)
kable(head(fleet_raw[37:46,]))
Airline | Current | Future | Historic | Total | Avg_Value_Curr | Avg_Value_Future | Avg_Value_Hist | Avg_Value_Ttl |
---|---|---|---|---|---|---|---|---|
El Al | 49 | 0 | 114 | 163 | 151.77551 | 0 | 180.66667 | 171.98160 |
Emirates | 249 | 15 | 351 | 615 | 342.56225 | 295 | 302.78917 | 318.70244 |
Ethiopian Airlines | 82 | 0 | 128 | 210 | 158.98780 | 0 | 146.12500 | 151.14762 |
Etihad Airways | 124 | 12 | 161 | 297 | 227.61290 | 240 | 224.00000 | 226.15488 |
FedEx Express | 696 | 0 | 688 | 1384 | 70.53017 | 0 | 113.66424 | 91.97254 |
Finnair | 71 | 0 | 191 | 262 | 122.46479 | 0 | 87.96859 | 97.31679 |
In the next step of dataset analysis from printting the summary of every column with most important statistic measures.
summary(fleet_raw)
## Airline Current Future Historic
## Length:113 Min. : 17 Min. : 0.00 Min. : 45.0
## Class :character 1st Qu.: 63 1st Qu.: 0.00 1st Qu.: 113.0
## Mode :character Median : 105 Median : 0.00 Median : 195.0
## Mean : 182 Mean : 20.29 Mean : 326.3
## 3rd Qu.: 206 3rd Qu.: 26.00 3rd Qu.: 367.0
## Max. :1410 Max. :319.00 Max. :2679.0
## Total Avg_Value_Curr Avg_Value_Future Avg_Value_Hist
## Min. : 71.0 Min. : 28.44 Min. : 0.0 Min. : 21.44
## 1st Qu.: 173.0 1st Qu.: 78.33 1st Qu.: 0.0 1st Qu.: 74.82
## Median : 327.0 Median :103.19 Median : 0.0 Median : 98.34
## Mean : 528.7 Mean :115.07 Mean : 54.6 Mean :108.00
## 3rd Qu.: 588.0 3rd Qu.:134.74 3rd Qu.: 98.0 3rd Qu.:124.19
## Max. :4139.0 Max. :342.56 Max. :314.2 Max. :302.79
## Avg_Value_Ttl
## Min. : 24.79
## 1st Qu.: 78.22
## Median :100.58
## Mean :109.89
## 3rd Qu.:126.96
## Max. :318.70
It is evident that the future aircraft data set is characterised by a considerable number of missing values, as evidenced by the fact that the median average value of future aircrafts is equal to zero. In order to maintain the integrity of the core data set, it was decided to retain this column, along with the variable Avg_Value_Future, but to refrain from utilising it in subsequent tests.
In order to commence the clustering process, it was deemed appropriate to undertake a Hopkins Statistic Test on three distinct groups of data: Current, Historic and Total.
# Data preparation for Current & Avg_Value_Curr
data_for_clustering <- fleet_raw %>%
select(Current, Avg_Value_Curr)
data_normalized <- scale(data_for_clustering)
# Clustering tendency for Current & Avg_Value_Curr
clust_tendency <- get_clust_tendency(data_normalized, n = 2, graph = FALSE,
gradient = list(low = "blue", high = "white"),
seed = 123)
print(paste("Hopkins Statistic for Current & Avg_Value_Curr:", clust_tendency$hopkins_stat))
## [1] "Hopkins Statistic for Current & Avg_Value_Curr: 0.959875188336194"
plot1 <- fviz_dist(
dist(data_normalized),
show_labels = FALSE,
gradient = list(low = "blue", mid = "white", high = "red")
) +
labs(title = "Dissimilarity: Current & Avg_Value_Curr") +
theme_minimal() +
theme(
plot.title = element_text(size = 8),
axis.text = element_blank(),
legend.position = "bottom"
)
# Data preparation for Total & Avg_Value_Ttl
data_for_clustering_ttl <- fleet_raw %>%
select(Total, Avg_Value_Ttl)
data_normalized_ttl <- scale(data_for_clustering_ttl)
# Clustering tendency for Total & Avg_Value_Ttl
clust_tendency_ttl <- get_clust_tendency(data_normalized_ttl, n = 3, graph = FALSE,
gradient = list(low = "blue", high = "white"),
seed = 123)
print(paste("Hopkins Statistic for Total & Avg_Value_Ttl:", clust_tendency_ttl$hopkins_stat))
## [1] "Hopkins Statistic for Total & Avg_Value_Ttl: 0.93814424475394"
plot2 <- fviz_dist(
dist(data_normalized_ttl),
show_labels = FALSE,
gradient = list(low = "blue", mid = "white", high = "red")
) +
labs(title = "Dissimilarity: Total & Avg_Value_Ttl") +
theme_minimal() +
theme(
plot.title = element_text(size = 8),
axis.text = element_blank(),
legend.position = "bottom"
)
# Data preparation for Historic & Avg_Value_Hist
data_for_clustering_hist <- fleet_raw %>%
select(Historic, Avg_Value_Hist)
data_normalized_hist <- scale(data_for_clustering_hist)
# Clustering tendency for Historic & Avg_Value_Hist
clust_tendency_hist <- get_clust_tendency(data_normalized_hist, n = 2, graph = FALSE,
gradient = list(low = "blue", high = "white"),
seed = 123)
print(paste("Hopkins Statistic for Historic & Avg_Value_Hist:", clust_tendency_hist$hopkins_stat))
## [1] "Hopkins Statistic for Historic & Avg_Value_Hist: 0.943784621539787"
plot3 <- fviz_dist(
dist(data_normalized_hist),
show_labels = FALSE,
gradient = list(low = "blue", mid = "white", high = "red")
) +
labs(title = "Dissimilarity: Historic & Avg_Value_Hist") +
theme_minimal() +
theme(
plot.title = element_text(size = 8),
axis.text = element_blank(),
legend.position = "bottom"
)
# Arrange the three plots side by side
grid.arrange(plot1, plot2, plot3, ncol = 3)
As a consequence, it is evident that the pair of parameters for all three categories are yielding outcomes that exceed 0.5, which is indicative of their adjustment for clustering in accordance with established protocols. Moreover, the outcomes exceed 0.9 for each date category, which may suggest that clustering is a viable approach for this dataset.
As a preliminary approach to identifying the optimal number of clusters, I employed the fviz_nbclust function from the factoextra package, as demonstrated in the course material. For all three datasets, I utilised the “silhouette” method.
# Data preparation for Current and Avg_Value_Curr
data_for_clustering_1 <- fleet_raw %>%
select(Current, Avg_Value_Curr)
data_normalized_1 <- scale(data_for_clustering_1)
# Silhouette Method for Current and Avg_Value_Curr
km1s_1 <- fviz_nbclust(data_normalized_1, kmeans, method = "silhouette") +
ggtitle("Silhouette Method: Current & Avg_Value_Curr") +
theme(
plot.title = element_text(size = 8), # Decrease font size of the title
axis.text = element_text(size = 6) # Decrease font size of axis labels
)
# Data preparation for Total and Avg_Value_Ttl
data_for_clustering_2 <- fleet_raw %>%
select(Total, Avg_Value_Ttl)
data_normalized_2 <- scale(data_for_clustering_2)
# Silhouette Method for Total and Avg_Value_Ttl
km1s_2 <- fviz_nbclust(data_normalized_2, kmeans, method = "silhouette") +
ggtitle("Silhouette Method: Total & Avg_Value_Ttl") +
theme(
plot.title = element_text(size = 8), # Decrease font size of the title
axis.text = element_text(size = 6) # Decrease font size of axis labels
)
# Data preparation for Historic and Avg_Value_Hist
data_for_clustering_3 <- fleet_raw %>%
select(Historic, Avg_Value_Hist)
data_normalized_3 <- scale(data_for_clustering_3)
# Silhouette Method for Historic and Avg_Value_Hist
km1s_3 <- fviz_nbclust(data_normalized_3, kmeans, method = "silhouette") +
ggtitle("Silhouette Method: Historic & Avg_Value_Hist") +
theme(
plot.title = element_text(size = 8), # Decrease font size of the title
axis.text = element_text(size = 6) # Decrease font size of axis labels
)
# Arrange the plots side by side using grid.arrange
grid.arrange(km1s_1, km1s_2, km1s_3, ncol = 3)
The analysis of both current and historic data indicated that k=3 was the optimal number of clusters. For the total data set, the optimal number of clusters was identified as k=2. However, for all three categories, the difference between k=2 and k=3 was relatively minor. Nevertheless, it is evident that for the total and current data sets, the silhouette value for both k=2 and k=3 is approximately 0.6, while for the historical data set, it is between 0.4 and 0.5. This suggests that while the quality of the historical data clustering may be inferior, the current values render it a reasonable approximation.
As an alternative approach, the Optimal_Clusters_KMeans function from the ClusterR package was employed. The criterion selected for all three data types was “silhouette.”
# Data preparation and scaling for Current & Avg_Value_Curr
data_for_clustering1 <- fleet_raw %>%
select(Current, Avg_Value_Curr)
data_normalized1 <- scale(data_for_clustering1)
# Optimal cluster number analysis for Current & Avg_Value_Curr
plot1 <- Optimal_Clusters_KMeans(
data_normalized1,
max_clusters = 10,
plot_clusters = TRUE,
criterion = "silhouette"
)
# Data preparation and scaling for Total & Avg_Value_Ttl
data_for_clustering2 <- fleet_raw %>%
select(Total, Avg_Value_Ttl)
data_normalized2 <- scale(data_for_clustering2)
# Optimal cluster number analysis for Total & Avg_Value_Ttl
plot2 <- Optimal_Clusters_KMeans(
data_normalized2,
max_clusters = 10,
plot_clusters = TRUE,
criterion = "silhouette"
)
# Data preparation and scaling for Historic & Avg_Value_Hist
data_for_clustering3 <- fleet_raw %>%
select(Historic, Avg_Value_Hist)
data_normalized3 <- scale(data_for_clustering3)
# Optimal cluster number analysis for Historic & Avg_Value_Hist
plot3 <- Optimal_Clusters_KMeans(
data_normalized3,
max_clusters = 10,
plot_clusters = TRUE,
criterion = "silhouette"
)
The results appear to be largely analogous to those obtained from the fviz_nbclust function. However, there are notable discrepancies, despite the identical criteria of silhouette being employed in both instances. It is evident that, in the present case, the silhouette value for all data groups is approximately 0.6 for both k=2 and k=3. For the current data, even k=4 and k=5 appear to exhibit promising values.
In order to identify the optimal number of clusters for PAM, I employed the fviz_nbclust function, this time with the cluster set as “pam.” All other parameters were maintained at their original values, which had been established for the K-Means approach.
# Data preparation for Current and Avg_Value_Curr
data_for_clustering_1 <- fleet_raw %>%
select(Current, Avg_Value_Curr)
data_normalized_1 <- scale(data_for_clustering_1)
# Silhouette Method for Current and Avg_Value_Curr
km1s_1 <- fviz_nbclust(data_normalized_1, pam, method = "silhouette") +
ggtitle("Silhouette Method: Current & Avg_Value_Curr") +
theme(
plot.title = element_text(size = 8), # Decrease font size of the title
axis.text = element_text(size = 6) # Decrease font size of axis labels
)
# Data preparation for Total and Avg_Value_Ttl
data_for_clustering_2 <- fleet_raw %>%
select(Total, Avg_Value_Ttl)
data_normalized_2 <- scale(data_for_clustering_2)
# Silhouette Method for Total and Avg_Value_Ttl
km1s_2 <- fviz_nbclust(data_normalized_2, pam, method = "silhouette") +
ggtitle("Silhouette Method: Total & Avg_Value_Ttl") +
theme(
plot.title = element_text(size = 8), # Decrease font size of the title
axis.text = element_text(size = 6) # Decrease font size of axis labels
)
# Data preparation for Historic and Avg_Value_Hist
data_for_clustering_3 <- fleet_raw %>%
select(Historic, Avg_Value_Hist)
data_normalized_3 <- scale(data_for_clustering_3)
# Silhouette Method for Historic and Avg_Value_Hist
km1s_3 <- fviz_nbclust(data_normalized_3, pam, method = "silhouette") +
ggtitle("Silhouette Method: Historic & Avg_Value_Hist") +
theme(
plot.title = element_text(size = 8), # Decrease font size of the title
axis.text = element_text(size = 6) # Decrease font size of axis labels
)
# Arrange the plots side by side using grid.arrange
grid.arrange(km1s_1, km1s_2, km1s_3, ncol = 3)
The outcomes yielded by PAM diverge somewhat from those obtained through
K-Means. The discrepancy between k=2 and k=3 is more pronounced for all
data sets, with k=3 emerging as the optimal choice for all categories.
In the current instance, the silhouette result for the existing data set
was approximately 0.6, while for the other two sets it was approximately
0.5.
Following an investigation into the optimal number of clusters, a decision was taken to commence the clustering process using the K-Means method. The analysis was conducted for k=3. Initially, the data were clustered based on the current data set. To facilitate the interpretation of the results, the airline names were added to the relevant points on the graph.
# Preparing and scaling the data
data_for_clustering <- fleet_raw %>%
select(Current, Avg_Value_Curr)
data_for_clustering <- scale(data_for_clustering)
# Performing enhanced k-means clustering using eclust on scaled data
res.km <- eclust(data_for_clustering, "kmeans", hc_metric = "euclidean", k = 3, graph = FALSE)
# Creating the clustering plot without displaying it immediately
cluster_plot <- fviz_cluster(res.km, main = "K-Means Clustering on Current data", labelsize = 0)
# Adding airline names & plotting
cluster_plot +
geom_text_repel(aes(label = fleet_raw$Airline), size = 2)
The K-Means clustering for k=3 reveals a noteworthy division, particularly given that only two variables were considered. The first cluster comprises the largest European and American airlines. Additionally, the group comprises the three largest Chinese airlines. The second cluster comprises all airlines that do not stand out in terms of either the amount or the average price of their combined historical, current and future fleet. This group is notable for its diversity, encompassing airlines from across the globe, including both traditional and low-cost carriers. In contrast, the third cluster is more homogeneous, comprising primarily Asian airlines, with the exception of China (China Airlines is a Taiwanese carrier) and the Middle East. There are only two exceptions to this, namely Virgin Atlantic, which is a relatively atypical UK airline, and Ethiopian Airlines, which is also somewhat unusual for this region of the world.
data_for_clustering <- fleet_raw %>%
select(Historic, Avg_Value_Hist)
data_for_clustering <- scale(data_for_clustering)
res.km <- eclust(data_for_clustering, "kmeans", hc_metric = "euclidean", k = 3, graph = FALSE)
cluster_plot <- fviz_cluster(res.km, main = "K-Means Clustering on Historic Data", labelsize = 0)
cluster_plot +
geom_text_repel(aes(label = fleet_raw$Airline), size = 2)
A review of the historical data reveals that there is not a significant divergence from the current situation. However, there is a notable advancement in Chinese aviation, as evidenced by the historical data, which indicates that Air China is the sole member of cluster 1, comprising the largest airlines in terms of aircraft quantity. Additionally, the historical data reveals that FedEx Express had a considerably higher number of aircraft and is classified in cluster 2. Furthermore, the historical data identifies South African Airlines as the sole African carrier in cluster 3, whereas the current data indicates that Ethiopian Airlines is the only African carrier in this cluster.
data_for_clustering <- fleet_raw %>%
select(Total, Avg_Value_Ttl)
data_for_clustering <- scale(data_for_clustering)
res.km <- eclust(data_for_clustering, "kmeans", hc_metric = "euclidean", k = 3, graph = FALSE)
cluster_plot <- fviz_cluster(res.km, main = "K-Means Clustering on Total Data", labelsize = 0)
cluster_plot +
geom_text_repel(aes(label = fleet_raw$Airline), size = 2)
The clustering of the total data set enables the generation of results that are comparable to those observed in two previous examples. This indicates that future data will not significantly alter the profile of the airlines in question, which appear to be relatively consistent over time.
Following the completion of the K-Means clustering process, I proceeded to undertake the PAM analysis with k=3, repeating this for all three data categories.
# Data preparation
data_for_clustering <- fleet_raw %>%
select(Current, Avg_Value_Curr)
data_normalized <- scale(data_for_clustering)
# PAM Clustering
pam_clusters <- eclust(data_normalized, "pam", k = 3, graph = FALSE)
# Creating the clustering plot without displaying it immediately
cluster_plot <- fviz_cluster(pam_clusters, main = "PAM Clustering for Current Data", labelsize = 0)
# Adding airline names & plotting
cluster_plot +
geom_text_repel(aes(label = fleet_raw$Airline), size = 2)
The results of the PAM clustering on the current data set appear similar to those of the K-Means clustering, but there are a number of notable differences. With regard to Cluster One, it is notable that Qantas Airways, the largest Australian carrier, has been included. The third cluster also included airlines such as Turkish Airlines, Air India and Gulf Air. The aforementioned airlines, which have undergone a change in cluster assignment, appear to exhibit similarities to other airlines within clusters 1 and 3, respectively. However, it is noteworthy that, from a geographical perspective, Air India and Turkish Airlines are classified as Asian airlines, though they do not adhere to the conventional characteristics associated with this region. Gulf Air, on the other hand, is a Middle Eastern carrier, yet it differs from other prominent Middle Eastern airlines, such as Emirates and Etihad, in terms of its business approach.
data_for_clustering <- fleet_raw %>%
select(Historic, Avg_Value_Hist)
data_normalized <- scale(data_for_clustering)
pam_clusters <- eclust(data_normalized, "pam", k = 3, graph = FALSE)
cluster_plot <- fviz_cluster(pam_clusters, main = "PAM Clustering for Historical Data", labelsize = 0)
cluster_plot +
geom_text_repel(aes(label = fleet_raw$Airline), size = 2)
With regard to PAM for historical data, a clear distinction can be observed in comparison to the K-Means approach, which was already evident in the analysis of current data. The notable aspect here is the considerable size of Cluster 3 and the inclusion of FedEx Express, which challenges the regionality hypothesis associated with this specific clustering.
data_for_clustering <- fleet_raw %>%
select(Total, Avg_Value_Ttl)
data_normalized <- scale(data_for_clustering)
pam_clusters <- eclust(data_normalized, "pam", k = 3, graph = FALSE)
cluster_plot <- fviz_cluster(pam_clusters, main = "PAM Clustering for Total Data", labelsize = 0)
cluster_plot +
geom_text_repel(aes(label = fleet_raw$Airline), size = 2)
The PAM clustering for the total dataset is nearly identical to that of the PAM clustering made on the current dataset. Similarly, the comparison with K-Means exhibits a similar outcome to that observed with the current data, indicating that the influence of the future fleet is relatively unchanged.
data_for_clustering <- fleet_raw %>%
select(Current, Avg_Value_Curr)
data_normalized <- scale(data_for_clustering)
distance_matrix <- dist(data_normalized, method = "euclidean")
hclust_result <- hclust(distance_matrix, method = "complete")
plot(hclust_result, labels = fleet_raw$Airline,
main = "Hierarchical clustering",
xlab = "Airlines", ylab = "Distance", sub = "",
cex = 0.4,
lwd = 1)
rect.hclust(hclust_result, k=3, border="red")
Despite the complexity of the plot and the multitude of airlines included, it enables the observation of not only the three-cluster division similar to the one seen in previous examples, but also the identification of outlier airlines, namely American Airlines (with the largest fleet) and Emirates (with the highest average aircraft price).
data_for_clustering <- fleet_raw %>%
select(Total, Avg_Value_Ttl)
data_normalized <- scale(data_for_clustering)
res.dist <- dist(data_normalized, method = "euclidean")
hc <- hclust(res.dist, method = "ward.D2")
hc$labels <- fleet_raw$Airline
plot(as.phylo(hc), type = "fan", cex = 0.5, no.margin = TRUE)
This peculiar plot is analogous to that presented in class, and it offers intriguing insights. The visualization of groups of three or four airlines that are the most similar to themselves is particularly noteworthy. Notable examples include Singapore Airlines and Cathay Pacific, which are major competitors despite being based in relatively small but highly developed regions, namely Singapore and Hong Kong. Additionally, the grouping of airlines that is not based on region and is instead aligned with their respective profiles is evident. For instance, Chinese low-cost carrier Spring Airlines is situated adjacent to the European airline Wizz Air. Nevertheless, there are instances where the number of aircraft and their average value appear to be similar despite significant differences in airline profiles. This is evident in the case of Air New Zealand, a traditional carrier, and Thomas Cook Airlines, a typical tourist operator. It is possible that including the charter indicator in the analysis could yield different results.
In the paper, I conducted an analysis of the airline fleet data with the objective of identifying both differences and similarities between the various fleets. The application of clustering enabled the classification of airlines into three principal categories based on the data type in question: current, historic, and total. The resulting groups exhibited slight differences when applying the K-Means and PAM clustering types. However, the primary correlation with the geographical factor remained consistently visible. The analysis revealed that the majority of airlines exhibited consistent profiles across different data types, indicating that their characteristics remained relatively stable throughout their development. This finding is consistent with the prevailing business understanding of the industry. With regard to the aforementioned three sets, the first comprised a group of large European, American, and Chinese Mainland airlines. These airlines were characterised by a considerable size of fleet, coupled with a relatively modest average aircraft value. This is consistent with the observation that these airlines operate a diverse range of routes, including both short-haul and regional services, as well as long-haul routes with a significant number of aircraft. Domestic and regional routes are often of lower demand, which explains why larger aircraft are not deployed on these routes. The second group of airlines, as identified through clustering, comprised those based in Asia and the Middle East. It is not immediately apparent why these two groups should be considered together, given the existence of significant business differences between them. However, they are united by the fact that the majority of their respective fleets consist of wide-body aircraft. Middle Eastern airlines primarily operate on long-haul routes, while Asian airlines also offer a significant number of domestic flights. However, due to the high population density in these regions, wide-body aircraft are utilized on these domestic routes. The final group is the most heterogeneous, comprising airlines from across the globe that do not distinguish themselves by the number or average value of their aircraft. This demonstrates the limitations of such a simple clustering approach. For instance, Ryanair and Aeroflot, despite being in the same cluster, are very different airlines in terms of their business models. Overall, I believe that clustering with the incorporation of additional factors can be a valuable tool for airlines to benchmark their performance against their market rivals. Even with the relatively straightforward criteria employed in this study, the results were informative.