Introduction

In this study, I use the European Commercial Flights Dataset (2016–2022), which includes airport level IFR (Instrument Flight Rules) flight movements across Europe. Instead of analyzing individual flights, I focus on structural traffic patterns at the airport level. The objective is to examine whether European airports can be grouped according to their overall traffic scale, variability over time, and sensitivity to the COVID-19 shock. By applying unsupervised learning techniques, I aim to identify meaningful structural differences across airports that may not be visible through simple traffic rankings.

Packages

library(tidyverse)
library(lubridate)
library(janitor)
library(cluster)
library(factoextra)
library(mclust)

Data Import

flights <- read_csv("flights.csv") %>%
  clean_names()

glimpse(flights)
## Rows: 688,099
## Columns: 14
## $ year          <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 20…
## $ month_num     <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", "0…
## $ month_mon     <chr> "JAN", "JAN", "JAN", "JAN", "JAN", "JAN", "JAN", "JAN", …
## $ flt_date      <dttm> 2016-01-01, 2016-01-01, 2016-01-01, 2016-01-01, 2016-01…
## $ apt_icao      <chr> "EBAW", "EBBR", "EBCI", "EBLG", "EBOS", "EDDB", "EDDC", …
## $ apt_name      <chr> "Antwerp", "Brussels", "Charleroi", "Liège", "Ostend-Bru…
## $ state_name    <chr> "Belgium", "Belgium", "Belgium", "Belgium", "Belgium", "…
## $ flt_dep_1     <dbl> 4, 174, 45, 6, 7, 98, 18, 1, 401, 3, 122, 92, 172, 276, …
## $ flt_arr_1     <dbl> 3, 171, 47, 7, 7, 99, 21, 1, 341, 4, 130, 90, 175, 236, …
## $ flt_tot_1     <dbl> 7, 345, 92, 13, 14, 197, 39, 2, 742, 7, 252, 182, 347, 5…
## $ flt_dep_ifr_2 <dbl> NA, 174, 45, NA, NA, NA, NA, NA, 401, NA, 125, NA, NA, 2…
## $ flt_arr_ifr_2 <dbl> NA, 161, 45, NA, NA, NA, NA, NA, 306, NA, 129, NA, NA, 2…
## $ flt_tot_ifr_2 <dbl> NA, 335, 90, NA, NA, NA, NA, NA, 707, NA, 254, NA, NA, 5…
## $ pivot_label   <chr> "Antwerp (EBAW)", "Brussels (EBBR)", "Charleroi (EBCI)",…

Data Preparation

In this step, airport-level features are constructed by aggregating total traffic, average traffic, traffic variability, and the coefficient of variation for each airport.

airport_features <- flights %>%
  group_by(apt_icao, apt_name, state_name) %>%
  summarise(
    total_traffic = sum(flt_tot_1, na.rm = TRUE),
    mean_traffic  = mean(flt_tot_1, na.rm = TRUE),
    sd_traffic    = sd(flt_tot_1, na.rm = TRUE),
    cv_traffic    = sd_traffic / mean_traffic,
    .groups = "drop"
  )

glimpse(airport_features)
## Rows: 333
## Columns: 7
## $ apt_icao      <chr> "EBAW", "EBBR", "EBCI", "EBLG", "EBOS", "EDDB", "EDDC", …
## $ apt_name      <chr> "Antwerp", "Brussels", "Charleroi", "Liège", "Ostend-Bru…
## $ state_name    <chr> "Belgium", "Belgium", "Belgium", "Belgium", "Belgium", "…
## $ total_traffic <dbl> 85510, 1178495, 288556, 228654, 41831, 594689, 107059, 2…
## $ mean_traffic  <dbl> 36.589645, 502.985489, 125.132697, 97.590269, 17.853606,…
## $ sd_traffic    <dbl> 15.238451, 200.624375, 43.544414, 31.960764, 7.447820, 9…
## $ cv_traffic    <dbl> 0.4164690, 0.3988671, 0.3479859, 0.3274995, 0.4171605, 0…

Pandemic Impact Variable

In this section, I measure the impact of the COVID-19 pandemic by calculating the percentage change in total traffic between 2019 and 2020 for each airport.

covid_impact <- flights %>%
  filter(year %in% c(2019, 2020)) %>%
  group_by(apt_icao, year) %>%
  summarise(year_total = sum(flt_tot_1, na.rm = TRUE), .groups = "drop") %>%
  tidyr::pivot_wider(names_from = year, values_from = year_total) %>%
  mutate(
    covid_drop = (`2020` - `2019`) / `2019`
  ) %>%
  select(apt_icao, covid_drop)

summary(covid_impact$covid_drop)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.9932 -0.5867 -0.4909 -0.4034 -0.2792  5.5250

Pandemic Impact (COVID-19 Shock)

The COVID-19 impact is defined as the relative change in total IFR movements between 2019 and 2020.

covid_impact %>%
  arrange(covid_drop) %>%
  head(10)
## # A tibble: 10 × 2
##    apt_icao covid_drop
##    <chr>         <dbl>
##  1 EPRA         -0.993
##  2 LIPH         -0.833
##  3 EIWT         -0.829
##  4 EVVA         -0.8  
##  5 EGLC         -0.768
##  6 LELC         -0.766
##  7 LTBA         -0.748
##  8 EIKN         -0.737
##  9 GMMX         -0.719
## 10 EGKK         -0.718

Dataset Preparation for Clustering

In this step, airport-level features are constructed by aggregating total traffic, average traffic, traffic variability, and the coefficient of variation for each airport.

airport_features2 <- airport_features %>%
  left_join(covid_impact, by = "apt_icao") %>%
  filter(!is.na(covid_drop))

dim(airport_features2)
## [1] 331   8
colSums(is.na(airport_features2))
##      apt_icao      apt_name    state_name total_traffic  mean_traffic 
##             0             0             0             0             0 
##    sd_traffic    cv_traffic    covid_drop 
##             0             0             0

Selection of Clustering Variables

For the clustering analysis, five continuous variables are selected: total traffic, average traffic, traffic variability (standard deviation), coefficient of variation, and the COVID-19 impact measure. These variables capture both structural size differences and crisis sensitivity across airports.

cluster_data <- airport_features2 %>%
  select(total_traffic, mean_traffic, sd_traffic, cv_traffic, covid_drop)

summary(cluster_data)
##  total_traffic      mean_traffic       sd_traffic         cv_traffic    
##  Min.   :     26   Min.   :   1.13   Min.   :  0.4577   Min.   :0.2424  
##  1st Qu.:  21127   1st Qu.:  10.26   1st Qu.:  5.3374   1st Qu.:0.3793  
##  Median :  68060   Median :  30.69   Median : 14.8455   Median :0.4430  
##  Mean   : 262810   Mean   : 116.24   Mean   : 50.1357   Mean   :0.5006  
##  3rd Qu.: 313826   3rd Qu.: 136.46   3rd Qu.: 55.8057   3rd Qu.:0.5854  
##  Max.   :2694025   Max.   :1149.82   Max.   :581.3177   Max.   :1.5223  
##    covid_drop     
##  Min.   :-0.9932  
##  1st Qu.:-0.5878  
##  Median :-0.4912  
##  Mean   :-0.4042  
##  3rd Qu.:-0.2800  
##  Max.   : 5.5250

Scaling the Variables

Since the variables are measured on different scales, standardization is applied before clustering. This ensures that no single variable dominates the distance calculations.

cluster_scaled <- scale(cluster_data)

summary(cluster_scaled)
##  total_traffic      mean_traffic       sd_traffic        cv_traffic     
##  Min.   :-0.5936   Min.   :-0.6003   Min.   :-0.6081   Min.   :-1.4671  
##  1st Qu.:-0.5459   1st Qu.:-0.5527   1st Qu.:-0.5483   1st Qu.:-0.6891  
##  Median :-0.4399   Median :-0.4461   Median :-0.4319   Median :-0.3274  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.1152   3rd Qu.: 0.1055   3rd Qu.: 0.0694   3rd Qu.: 0.4817  
##  Max.   : 5.4918   Max.   : 5.3901   Max.   : 6.5016   Max.   : 5.8047  
##    covid_drop     
##  Min.   :-1.3838  
##  1st Qu.:-0.4312  
##  Median :-0.2045  
##  Mean   : 0.0000  
##  3rd Qu.: 0.2917  
##  Max.   :13.9283

Determining the Optimal Number of Clusters

The aim is to identify the point where adding more clusters does not significantly improve the model fit.

fviz_nbclust(cluster_scaled, kmeans, method = "wss")

K-Means Clustering (k = 3)

Based on the elbow plot and the overall structure of the data, three clusters appear to provide a reasonable and interpretable grouping of airports.

set.seed(123)

kmeans_result <- kmeans(cluster_scaled, centers = 3, nstart = 25)

airport_features2$cluster <- as.factor(kmeans_result$cluster)

table(airport_features2$cluster)
## 
##   1   2   3 
##  78  32 221

Cluster Profiling

The clustering results suggest three broad groups of airports in Europe. One group consists of a small number of very large hub airports with significantly higher total and average traffic levels. These airports also show a stronger decline during the COVID-19 period.

Another group includes smaller airports with relatively low traffic volumes but higher variability over time. This may indicate more seasonal or irregular traffic patterns. The third group represents medium-sized airports, forming the largest share of the sample, with moderate traffic levels and a more balanced response to the pandemic shock.

airport_features2 %>%
  group_by(cluster) %>%
  summarise(
    avg_total = mean(total_traffic),
    avg_mean  = mean(mean_traffic),
    avg_sd    = mean(sd_traffic),
    avg_cv    = mean(cv_traffic),
    avg_covid = mean(covid_drop),
    n_airports = n()
  )
## # A tibble: 3 × 7
##   cluster avg_total avg_mean avg_sd avg_cv avg_covid n_airports
##   <fct>       <dbl>    <dbl>  <dbl>  <dbl>     <dbl>      <int>
## 1 1          39450.     17.9   14.3  0.745    -0.226         78
## 2 2        1403046.    612.   259.   0.430    -0.607         32
## 3 3         176541.     79.1   32.5  0.424    -0.438        221

Cluster Visualization using PCA

fviz_cluster(kmeans_result, 
             data = cluster_scaled,
             ellipse.type = "convex",
             geom = "point",
             stand = FALSE)

The PCA based cluster visualization shows a clear separation between large hub airports (Cluster 2) and the rest of the airports. Cluster 2 is positioned distinctly along the first principal component, reflecting its significantly higher traffic levels.

Clusters 1 and 3 are relatively closer to each other but still exhibit visible structural differences. Cluster 1 appears more vertically dispersed, suggesting higher variability, while Cluster 3 forms a more compact and homogeneous group. Overall, the clustering structure appears stable and well separated.

Cluster Validation (Silhouette Analysis)

The average silhouette width for the three-cluster solution is 0.41, indicating a reasonable cluster structure. Testing a four cluster solution resulted in a lower average silhouette value (0.40), confirming that the three cluster specification provides the most stable and interpretable structure for the data.

sil <- silhouette(kmeans_result$cluster, dist(cluster_scaled))

fviz_silhouette(sil)
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
##   cluster size ave.sil.width
## 1       1   78          0.19
## 2       2   32          0.46
## 3       3  221          0.49

mean(sil[, 3])
## [1] 0.4148623
set.seed(123)
k4 <- kmeans(cluster_scaled, centers = 4, nstart = 25)

sil4 <- silhouette(k4$cluster, dist(cluster_scaled))
mean(sil4[,3])
## [1] 0.4011342

Cluster Stability Check (Hierarchical Comparison)

I obtained an ARI value of 0.37, which indicates a moderate level of agreement between the K-means and hierarchical clustering solutions. This suggests that the overall cluster structure is relatively stable across different methods.

hc <- hclust(dist(cluster_scaled), method = "ward.D2")

hc_cut <- cutree(hc, k = 3)

airport_features2$cluster_hc <- as.factor(hc_cut)

ari <- mclust::adjustedRandIndex(
  as.integer(airport_features2$cluster),
  as.integer(airport_features2$cluster_hc)
)

ari
## [1] 0.3721248
plot(hc,
     labels = FALSE,
     main = "Hierarchical Clustering Dendrogram",
     xlab = "",
     sub = "")

Cluster Comparison of Key Variables

The boxplots confirm substantial structural differences across clusters. Cluster 2 exhibits significantly higher traffic volumes and variability, while Cluster 1 consists of smaller and more irregular airports. The pandemic impact is strongest among large hub airports, reinforcing the clustering interpretation.

library(ggplot2)
library(tidyr)

airport_features2 %>%
  select(cluster, total_traffic, mean_traffic, sd_traffic, covid_drop) %>%
  pivot_longer(-cluster, names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = cluster, y = value, fill = cluster)) +
  geom_boxplot(alpha = 0.7) +
  facet_wrap(~variable, scales = "free") +
  theme_minimal()

Conclusion

In this analysis, I show that European airports can be broadly grouped into three categories over the period 2016–2022. Large hub airports clearly stand out due to their high traffic volumes and their stronger decline during the COVID-19 period. Medium-sized airports form the largest group and display more balanced traffic patterns, while smaller airports tend to operate at lower volumes but with relatively higher variability.

Although clustering simplifies a complex aviation system, it provides a useful way to understand structural differences across airports. Overall, the results suggest that airport size, traffic stability, and crisis exposure play an important role in shaping the European airport landscape.