Introduction

Air traffic systems generate large volumes of data describing airport operations, including departures, arrivals, total movements, and instrument flight rule (IFR) traffic. These variables are typically highly correlated, making it difficult to identify underlying patterns using standard univariate or bivariate analysis.

This project applies unsupervised learning techniques to Polish air traffic data. Specifically, Principal Component Analysis (PCA) is used to reduce the dimensionality of the dataset while preserving its main structure. Subsequently, clustering is applied in the reduced-dimensional space to group airports with similar traffic characteristics.

The objective of this analysis is to explore similarities between airports and to identify distinct types of airports based on their operational profiles.

Data Description

The dataset contains monthly air traffic records for Polish airports. Each observation includes temporal information (year and month), airport identifiers, and multiple traffic-related variables. These variables include counts of departures, arrivals, total flights, and IFR operations.

The year 2025 is excluded from the analysis because the data for that year is incomplete and could bias the results. The dataset for the Poland traffic was extracted after compiling the data under the section Airport Traffic (from 2016) from the link provided below.

https://ansperformance.eu/csv/

All the datasets were compiled and the data for Poland from 2016-2025 was extracted into the CSV file.

Data Loading and Cleaning

rm(list = ls())

library(tidyverse)
library(lubridate)

air_raw <- read.csv("Poland air traffic.csv")

air_clean <- air_raw %>%
  rename(
    year = YEAR,
    month_num = MONTH_NUM,
    month_name = MONTH_MON,
    flight_date = FLT_DATE,
    airport_icao = APT_ICAO,
    airport_name = APT_NAME,
    country = STATE_NAME,
    departures = FLT_DEP_1,
    arrivals = FLT_ARR_1,
    total_flights = FLT_TOT_1,
    departures_ifr = FLT_DEP_IFR_2,
    arrivals_ifr = FLT_ARR_IFR_2,
    total_ifr = FLT_TOT_IFR_2
  ) %>%
  filter(year != 2025) %>%
  mutate(
    departures = ifelse(is.na(departures), 0, departures),
    arrivals = ifelse(is.na(arrivals), 0, arrivals),
    total_flights = ifelse(is.na(total_flights), 0, total_flights),
    departures_ifr = ifelse(is.na(departures_ifr), 0, departures_ifr),
    arrivals_ifr = ifelse(is.na(arrivals_ifr), 0, arrivals_ifr),
    total_ifr = ifelse(is.na(total_ifr), 0, total_ifr)
  )

air_traffic <- air_clean

The data were cleaned by renaming variables for clarity, removing incomplete observations from 2025, and replacing missing traffic values with zeros. This ensures that the dataset is consistent and suitable for multivariate analysis.The missing values replaced with 0 represent the number of flights flown, furthermore the observations for 2025 were removed for better consistency since the dataset did not contain data for the whole year of 2025.

Exploratory Data Analysis

ggplot(air_traffic, aes(x = total_flights)) +
  geom_histogram(bins = 50, fill = "steelblue", color = "white") +
  theme_minimal() +
  labs(
    title = "Distribution of Total Flights",
    x = "Total Flights",
    y = "Frequency"
  )

The distribution of total flights is highly right-skewed. Most airports operate at low traffic volumes, while a small number of airports account for a disproportionately large number of flights. This suggests the presence of major hubs alongside many smaller regional airports.

ggplot(air_traffic, aes(x = departures, y = arrivals)) +
  geom_point(alpha = 0.3) +
  theme_minimal() +
  labs(
    title = "Arrivals vs Departures",
    x = "Departures",
    y = "Arrivals"
  )

A strong linear relationship between arrivals and departures is observed, indicating substantial redundancy among traffic variables. This further motivates the use of dimension reduction techniques.

Airport-Level Aggregation

To compare airports as independent units, the data were aggregated at the airport level using mean values. Mean traffic measures were preferred over totals to avoid bias due to unequal numbers of observations across airports.

airport_summary <- air_traffic %>%
  group_by(airport_icao, airport_name) %>%
  summarise(
    mean_departures = mean(departures),
    mean_arrivals = mean(arrivals),
    mean_total_flights = mean(total_flights),
    mean_dep_ifr = mean(departures_ifr),
    mean_arr_ifr = mean(arrivals_ifr),
    mean_total_ifr = mean(total_ifr),
    .groups = "drop"
  ) %>%
  mutate(
    ifr_share = ifelse(mean_total_flights > 0,
                       mean_total_ifr / mean_total_flights,
                       0)
  )

This transformation results in one observation per airport and forms the basis for subsequent PCA.

Principal Component Analysis

Before applying PCA, all variables were standardized to ensure equal contribution to the analysis, as they are measured on different scales.

airport_features <- airport_summary %>%
  select(
    mean_departures,
    mean_arrivals,
    mean_total_flights,
    mean_dep_ifr,
    mean_arr_ifr,
    mean_total_ifr,
    ifr_share
  )

airport_scaled <- scale(airport_features)

pca_airports <- prcomp(airport_scaled, center = TRUE, scale. = TRUE)

summary(pca_airports)

## Importance of components:
##                           PC1    PC2       PC3       PC4       PC5 PC6 PC7
## Standard deviation     2.5878 0.5505 0.0003133 1.117e-16 1.436e-17   0   0
## Proportion of Variance 0.9567 0.0433 0.0000000 0.000e+00 0.000e+00   0   0
## Cumulative Proportion  0.9567 1.0000 1.0000000 1.000e+00 1.000e+00   1   1

The PCA results show that a small number of principal components explain most of the variance in the data, indicating that the dimensionality of the dataset can be substantially reduced without significant loss of information.

Scree Plot and Variance Explained

eigenvalues <- pca_airports$sdev^2

variance_df <- data.frame(
  PC = paste0("PC", seq_along(eigenvalues)),
  Variance = eigenvalues / sum(eigenvalues)
)

ggplot(variance_df, aes(x = PC, y = Variance)) +
  geom_col(fill = "steelblue") +
  geom_line(aes(group = 1)) +
  geom_point() +
  theme_minimal() +
  labs(
    title = "Scree Plot",
    x = "Principal Component",
    y = "Proportion of Variance Explained"
  )

The scree plot indicates a clear drop in explained variance after the first few components. Based on this pattern, the first two principal components are retained for further analysis.

PCA Visualization and Interpretation

pca_scores <- as.data.frame(pca_airports$x) %>%
  mutate(
    airport_icao = airport_summary$airport_icao,
    airport_name = airport_summary$airport_name
  )

pc1_var <- round(variance_df$Variance[1] * 100, 1)
pc2_var <- round(variance_df$Variance[2] * 100, 1)

ggplot(pca_scores, aes(x = PC1, y = PC2)) +
  geom_point(size = 3, alpha = 0.8) +
  theme_minimal() +
  labs(
    title = "PCA of Polish Airports",
    x = paste0("PC1 (", pc1_var, "% variance)"),
    y = paste0("PC2 (", pc2_var, "% variance)")
  )

The PCA scatterplot reveals clear separation between airports based on traffic characteristics. The first principal component primarily reflects overall traffic volume, separating major hubs from smaller airports. The second principal component captures differences in operational structure, particularly related to IFR traffic.

Clustering Analysis

To further explore similarities between airports, k-means clustering was applied in the reduced PCA space.

set.seed(123)

pca_for_clustering <- pca_scores %>% select(PC1, PC2)

kmeans_model <- kmeans(pca_for_clustering, centers = 3, nstart = 25)

pca_clustered <- pca_scores %>%
  mutate(cluster = factor(kmeans_model$cluster))

ggplot(pca_clustered, aes(x = PC1, y = PC2, color = cluster)) +
  geom_point(size = 3) +
  theme_minimal() +
  labs(
    title = "Airport Clusters Based on PCA",
    x = paste0("PC1 (", pc1_var, "% variance)"),
    y = paste0("PC2 (", pc2_var, "% variance)")
  )

The clustering results reveal three distinct groups of airports. These clusters can be interpreted as major international hubs, medium-sized regional airports, and small local airports with low traffic volumes.

Discussion

The combined use of PCA and clustering provides a clear and interpretable framework for analyzing airport traffic data. PCA successfully reduced dimensionality while preserving key patterns, and clustering highlighted meaningful groupings of airports based on both traffic volume and operational characteristics.

Limitations.

This analysis relies on aggregated average values, which may obscure seasonal or short-term variations. Additionally, PCA assumes linear relationships among variables, and the choice of cluster number involves some subjectivity. Furthermore the Dataset Provided does not include passenger numbers which is a limitation for a more in depth analysis.

Conclusion

This project demonstrates how unsupervised learning methods can be effectively applied to real-world transportation data. By combining PCA and clustering, the analysis identified clear patterns and meaningful similarities between Polish airports, showing the value of dimension reduction for exploratory data analysis.

Airport Similarity Analysis Using Unsupervised Learning

Mohib Usman

2026-01-08