In this study, I use the European Commercial Flights Dataset (2016–2022), which includes airport level IFR (Instrument Flight Rules) flight movements across Europe. Instead of analyzing individual flights, I focus on structural traffic patterns at the airport level. The objective is to examine whether European airports can be grouped according to their overall traffic scale, variability over time, and sensitivity to the COVID-19 shock. By applying unsupervised learning techniques, I aim to identify meaningful structural differences across airports that may not be visible through simple traffic rankings.
library(tidyverse)
library(lubridate)
library(janitor)
library(cluster)
library(factoextra)
library(mclust)
flights <- read_csv("flights.csv") %>%
clean_names()
glimpse(flights)
## Rows: 688,099
## Columns: 14
## $ year <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 20…
## $ month_num <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", "0…
## $ month_mon <chr> "JAN", "JAN", "JAN", "JAN", "JAN", "JAN", "JAN", "JAN", …
## $ flt_date <dttm> 2016-01-01, 2016-01-01, 2016-01-01, 2016-01-01, 2016-01…
## $ apt_icao <chr> "EBAW", "EBBR", "EBCI", "EBLG", "EBOS", "EDDB", "EDDC", …
## $ apt_name <chr> "Antwerp", "Brussels", "Charleroi", "Liège", "Ostend-Bru…
## $ state_name <chr> "Belgium", "Belgium", "Belgium", "Belgium", "Belgium", "…
## $ flt_dep_1 <dbl> 4, 174, 45, 6, 7, 98, 18, 1, 401, 3, 122, 92, 172, 276, …
## $ flt_arr_1 <dbl> 3, 171, 47, 7, 7, 99, 21, 1, 341, 4, 130, 90, 175, 236, …
## $ flt_tot_1 <dbl> 7, 345, 92, 13, 14, 197, 39, 2, 742, 7, 252, 182, 347, 5…
## $ flt_dep_ifr_2 <dbl> NA, 174, 45, NA, NA, NA, NA, NA, 401, NA, 125, NA, NA, 2…
## $ flt_arr_ifr_2 <dbl> NA, 161, 45, NA, NA, NA, NA, NA, 306, NA, 129, NA, NA, 2…
## $ flt_tot_ifr_2 <dbl> NA, 335, 90, NA, NA, NA, NA, NA, 707, NA, 254, NA, NA, 5…
## $ pivot_label <chr> "Antwerp (EBAW)", "Brussels (EBBR)", "Charleroi (EBCI)",…
In this step, airport-level features are constructed by aggregating total traffic, average traffic, traffic variability, and the coefficient of variation for each airport.
airport_features <- flights %>%
group_by(apt_icao, apt_name, state_name) %>%
summarise(
total_traffic = sum(flt_tot_1, na.rm = TRUE),
mean_traffic = mean(flt_tot_1, na.rm = TRUE),
sd_traffic = sd(flt_tot_1, na.rm = TRUE),
cv_traffic = sd_traffic / mean_traffic,
.groups = "drop"
)
glimpse(airport_features)
## Rows: 333
## Columns: 7
## $ apt_icao <chr> "EBAW", "EBBR", "EBCI", "EBLG", "EBOS", "EDDB", "EDDC", …
## $ apt_name <chr> "Antwerp", "Brussels", "Charleroi", "Liège", "Ostend-Bru…
## $ state_name <chr> "Belgium", "Belgium", "Belgium", "Belgium", "Belgium", "…
## $ total_traffic <dbl> 85510, 1178495, 288556, 228654, 41831, 594689, 107059, 2…
## $ mean_traffic <dbl> 36.589645, 502.985489, 125.132697, 97.590269, 17.853606,…
## $ sd_traffic <dbl> 15.238451, 200.624375, 43.544414, 31.960764, 7.447820, 9…
## $ cv_traffic <dbl> 0.4164690, 0.3988671, 0.3479859, 0.3274995, 0.4171605, 0…
In this section, I measure the impact of the COVID-19 pandemic by calculating the percentage change in total traffic between 2019 and 2020 for each airport.
covid_impact <- flights %>%
filter(year %in% c(2019, 2020)) %>%
group_by(apt_icao, year) %>%
summarise(year_total = sum(flt_tot_1, na.rm = TRUE), .groups = "drop") %>%
tidyr::pivot_wider(names_from = year, values_from = year_total) %>%
mutate(
covid_drop = (`2020` - `2019`) / `2019`
) %>%
select(apt_icao, covid_drop)
summary(covid_impact$covid_drop)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.9932 -0.5867 -0.4909 -0.4034 -0.2792 5.5250
The COVID-19 impact is defined as the relative change in total IFR movements between 2019 and 2020.
covid_impact %>%
arrange(covid_drop) %>%
head(10)
## # A tibble: 10 × 2
## apt_icao covid_drop
## <chr> <dbl>
## 1 EPRA -0.993
## 2 LIPH -0.833
## 3 EIWT -0.829
## 4 EVVA -0.8
## 5 EGLC -0.768
## 6 LELC -0.766
## 7 LTBA -0.748
## 8 EIKN -0.737
## 9 GMMX -0.719
## 10 EGKK -0.718
In this step, airport-level features are constructed by aggregating total traffic, average traffic, traffic variability, and the coefficient of variation for each airport.
airport_features2 <- airport_features %>%
left_join(covid_impact, by = "apt_icao") %>%
filter(!is.na(covid_drop))
dim(airport_features2)
## [1] 331 8
colSums(is.na(airport_features2))
## apt_icao apt_name state_name total_traffic mean_traffic
## 0 0 0 0 0
## sd_traffic cv_traffic covid_drop
## 0 0 0
For the clustering analysis, five continuous variables are selected: total traffic, average traffic, traffic variability (standard deviation), coefficient of variation, and the COVID-19 impact measure. These variables capture both structural size differences and crisis sensitivity across airports.
cluster_data <- airport_features2 %>%
select(total_traffic, mean_traffic, sd_traffic, cv_traffic, covid_drop)
summary(cluster_data)
## total_traffic mean_traffic sd_traffic cv_traffic
## Min. : 26 Min. : 1.13 Min. : 0.4577 Min. :0.2424
## 1st Qu.: 21127 1st Qu.: 10.26 1st Qu.: 5.3374 1st Qu.:0.3793
## Median : 68060 Median : 30.69 Median : 14.8455 Median :0.4430
## Mean : 262810 Mean : 116.24 Mean : 50.1357 Mean :0.5006
## 3rd Qu.: 313826 3rd Qu.: 136.46 3rd Qu.: 55.8057 3rd Qu.:0.5854
## Max. :2694025 Max. :1149.82 Max. :581.3177 Max. :1.5223
## covid_drop
## Min. :-0.9932
## 1st Qu.:-0.5878
## Median :-0.4912
## Mean :-0.4042
## 3rd Qu.:-0.2800
## Max. : 5.5250
Since the variables are measured on different scales, standardization is applied before clustering. This ensures that no single variable dominates the distance calculations.
cluster_scaled <- scale(cluster_data)
summary(cluster_scaled)
## total_traffic mean_traffic sd_traffic cv_traffic
## Min. :-0.5936 Min. :-0.6003 Min. :-0.6081 Min. :-1.4671
## 1st Qu.:-0.5459 1st Qu.:-0.5527 1st Qu.:-0.5483 1st Qu.:-0.6891
## Median :-0.4399 Median :-0.4461 Median :-0.4319 Median :-0.3274
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.1152 3rd Qu.: 0.1055 3rd Qu.: 0.0694 3rd Qu.: 0.4817
## Max. : 5.4918 Max. : 5.3901 Max. : 6.5016 Max. : 5.8047
## covid_drop
## Min. :-1.3838
## 1st Qu.:-0.4312
## Median :-0.2045
## Mean : 0.0000
## 3rd Qu.: 0.2917
## Max. :13.9283
The aim is to identify the point where adding more clusters does not significantly improve the model fit.
fviz_nbclust(cluster_scaled, kmeans, method = "wss")
Based on the elbow plot and the overall structure of the data, three clusters appear to provide a reasonable and interpretable grouping of airports.
set.seed(123)
kmeans_result <- kmeans(cluster_scaled, centers = 3, nstart = 25)
airport_features2$cluster <- as.factor(kmeans_result$cluster)
table(airport_features2$cluster)
##
## 1 2 3
## 78 32 221
The clustering results suggest three broad groups of airports in Europe. One group consists of a small number of very large hub airports with significantly higher total and average traffic levels. These airports also show a stronger decline during the COVID-19 period.
Another group includes smaller airports with relatively low traffic volumes but higher variability over time. This may indicate more seasonal or irregular traffic patterns. The third group represents medium-sized airports, forming the largest share of the sample, with moderate traffic levels and a more balanced response to the pandemic shock.
airport_features2 %>%
group_by(cluster) %>%
summarise(
avg_total = mean(total_traffic),
avg_mean = mean(mean_traffic),
avg_sd = mean(sd_traffic),
avg_cv = mean(cv_traffic),
avg_covid = mean(covid_drop),
n_airports = n()
)
## # A tibble: 3 × 7
## cluster avg_total avg_mean avg_sd avg_cv avg_covid n_airports
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1 39450. 17.9 14.3 0.745 -0.226 78
## 2 2 1403046. 612. 259. 0.430 -0.607 32
## 3 3 176541. 79.1 32.5 0.424 -0.438 221
fviz_cluster(kmeans_result,
data = cluster_scaled,
ellipse.type = "convex",
geom = "point",
stand = FALSE)
The PCA based cluster visualization shows a clear separation between large hub airports (Cluster 2) and the rest of the airports. Cluster 2 is positioned distinctly along the first principal component, reflecting its significantly higher traffic levels.
Clusters 1 and 3 are relatively closer to each other but still exhibit visible structural differences. Cluster 1 appears more vertically dispersed, suggesting higher variability, while Cluster 3 forms a more compact and homogeneous group. Overall, the clustering structure appears stable and well separated.
The average silhouette width for the three-cluster solution is 0.41, indicating a reasonable cluster structure. Testing a four cluster solution resulted in a lower average silhouette value (0.40), confirming that the three cluster specification provides the most stable and interpretable structure for the data.
sil <- silhouette(kmeans_result$cluster, dist(cluster_scaled))
fviz_silhouette(sil)
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## cluster size ave.sil.width
## 1 1 78 0.19
## 2 2 32 0.46
## 3 3 221 0.49
mean(sil[, 3])
## [1] 0.4148623
set.seed(123)
k4 <- kmeans(cluster_scaled, centers = 4, nstart = 25)
sil4 <- silhouette(k4$cluster, dist(cluster_scaled))
mean(sil4[,3])
## [1] 0.4011342
I obtained an ARI value of 0.37, which indicates a moderate level of agreement between the K-means and hierarchical clustering solutions. This suggests that the overall cluster structure is relatively stable across different methods.
hc <- hclust(dist(cluster_scaled), method = "ward.D2")
hc_cut <- cutree(hc, k = 3)
airport_features2$cluster_hc <- as.factor(hc_cut)
ari <- mclust::adjustedRandIndex(
as.integer(airport_features2$cluster),
as.integer(airport_features2$cluster_hc)
)
ari
## [1] 0.3721248
plot(hc,
labels = FALSE,
main = "Hierarchical Clustering Dendrogram",
xlab = "",
sub = "")
The boxplots confirm substantial structural differences across clusters. Cluster 2 exhibits significantly higher traffic volumes and variability, while Cluster 1 consists of smaller and more irregular airports. The pandemic impact is strongest among large hub airports, reinforcing the clustering interpretation.
library(ggplot2)
library(tidyr)
airport_features2 %>%
select(cluster, total_traffic, mean_traffic, sd_traffic, covid_drop) %>%
pivot_longer(-cluster, names_to = "variable", values_to = "value") %>%
ggplot(aes(x = cluster, y = value, fill = cluster)) +
geom_boxplot(alpha = 0.7) +
facet_wrap(~variable, scales = "free") +
theme_minimal()
In this analysis, I show that European airports can be broadly grouped into three categories over the period 2016–2022. Large hub airports clearly stand out due to their high traffic volumes and their stronger decline during the COVID-19 period. Medium-sized airports form the largest group and display more balanced traffic patterns, while smaller airports tend to operate at lower volumes but with relatively higher variability.
Although clustering simplifies a complex aviation system, it provides a useful way to understand structural differences across airports. Overall, the results suggest that airport size, traffic stability, and crisis exposure play an important role in shaping the European airport landscape.