# Load only the packages actually used in this report.
library(tidyverse) # data manipulation (dplyr, tidyr) and visualisation (ggplot2)
library(data.table) # fast reading of the very large (1.8 GB) source CSV files
library(janitor) # clean_names() for consistent snake_case column names
library(skimr) # rich one-line statistical summaries
library(lubridate) # date/time handling
library(scales) # axis formatting (commas, percentages)
library(corrplot) # correlation heatmaps
library(patchwork) # composing multiple ggplots
library(knitr) # kable() tables
library(kableExtra) # styled HTML tables
theme_set(theme_minimal(base_size = 12) +
theme(panel.grid = element_blank()))
report_fill <- "#2c7fb8" # primary colour used throughout
Flight delays remain a persistent challenge in the U.S. aviation industry, causing significant economic losses and passenger dissatisfaction. This project analyzes flight delay patterns using the Aeolus dataset, a comprehensive multi-modal dataset sourced from the U.S. Department of Transportation’s Bureau of Transportation Statistics. The analysis focuses on domestic flights during December 2023 and December 2024, encompassing over 1.13 million flight records with rich operational, meteorological, and airport-level attributes.
The project has two main analytical components. First, exploratory data analysis is conducted to understand delay patterns, weather impacts, and year-over-year trends. Second, predictive models are developed for two complementary tasks: a regression task to predict exact arrival delay in minutes, and a classification task to predict whether a flight will experience a significant delay exceeding 15 minutes. All analyses are implemented using the tidymodels framework in R.
Flight delays increase substantially during the December holiday season due to higher passenger demand, tighter flight schedules, and more frequent winter weather disruptions. These delays create operational challenges for airlines and airports, resulting in increased costs, resource constraints, and reduced service reliability. Understanding the factors that contribute to December flight delays, and how these factors vary across holiday seasons, can support more effective planning and delay mitigation strategies.
The Aeolus dataset provides a comprehensive foundation for this analysis by combining flight operations, weather conditions, and airport information within a single dataset. This integration enables the joint examination of operational and environmental factors, allowing a more complete assessment of the drivers of flight delays.
Flight delays cost the U.S. economy billions of dollars annually, yet accurately predicting delays remains difficult due to the complex interaction of weather conditions, operational factors, airport congestion, and delay propagation across flight networks. During the December holiday travel season, the problem intensifies as flight volumes surge and weather disruptions become more frequent.
Existing research has rarely compared regression and classification approaches on a unified dataset with integrated weather features. Furthermore, delay pattern changes between consecutive holiday seasons remain underexplored. This project addresses these gaps by leveraging the Aeolus dataset to simultaneously investigate both predictive paradigms while focusing specifically on December 2023 and 2024.
The objectives of this project are as follows:
The dataset used in this project is titled Aeolus - A Multi-Modal Flight Delay Dataset. It originates from the U.S. Department of Transportation’s Bureau of Transportation Statistics, which collects and publishes comprehensive data on domestic flight operations. The dataset covers domestic flights within the United States during the month of December for two consecutive years, specifically December 1–31, 2023 and December 1–31, 2024. This time period was selected to focus on the peak holiday travel season, allowing for comparative analysis of delay patterns between two holiday periods.
# December flight counts per year, with a bold total row.
tibble(
Year = c("December 2023", "December 2024", "Total"),
`Number of Flights` = c(558609, 574434, 1133043)
) |>
kable(caption = "December flight records by year.",
format.args = list(big.mark = ","),
align = c("l", "r")) |>
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover")) |>
row_spec(3, bold = TRUE)
| Year | Number of Flights |
|---|---|
| December 2023 | 558,609 |
| December 2024 | 574,434 |
| Total | 1,133,043 |
After filtering to December, the combined dataset comprises
1,133,043 rows and 35 columns (34
original variables plus an engineered year column).
The 35 variables fall into six conceptual categories:
tibble(
Category = c("Flight Identification", "Temporal", "Delay / Operational",
"Weather (Origin)", "Weather (Destination)", "Geospatial"),
Variables = c(
"op_carrier, op_carrier_fl_num, origin, dest, origin_index, dest_index",
"fl_date, crs_dep_time, dep_time, crs_arr_time, arr_time, month, day_of_month, day_of_week, year",
"dep_delay, arr_delay, taxi_out, taxi_in, air_time, crs_elapsed_time, actual_elapsed_time, wheels_off, wheels_on, flights",
"o_temp, o_prcp, o_wspd",
"d_temp, d_prcp, d_wspd",
"o_latitude, o_longitude, d_latitude, d_longitude"
)
) |>
kable(caption = "The 35 variables grouped into six conceptual categories.",
align = c("l", "l")) |>
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover")) |>
column_spec(1, bold = TRUE)
| Category | Variables |
|---|---|
| Flight Identification | op_carrier, op_carrier_fl_num, origin, dest, origin_index, dest_index |
| Temporal | fl_date, crs_dep_time, dep_time, crs_arr_time, arr_time, month, day_of_month, day_of_week, year |
| Delay / Operational | dep_delay, arr_delay, taxi_out, taxi_in, air_time, crs_elapsed_time, actual_elapsed_time, wheels_off, wheels_on, flights |
| Weather (Origin) | o_temp, o_prcp, o_wspd |
| Weather (Destination) | d_temp, d_prcp, d_wspd |
| Geospatial | o_latitude, o_longitude, d_latitude, d_longitude |
The target variables for modelling stage are
arr_delay for regression and is_delayed
(arr_delay > 15) minutes for classification.
# load dataset file paths
path_2023 <- 'flight_with_weather_2023.csv'
path_2024 <- 'flight_with_weather_2024.csv'
# read the data
flight_2023 <- read_csv(path_2023)
flight_2024 <- read_csv(path_2024)
File paths for the 2023 and 2024 datasets are first assigned to
variables path_2023 and path_2024. The read_csv() function
then reads each CSV file into R as a data frame.
# inspect column names 2023
names(flight_2023)
## [1] "FL_DATE" "OP_CARRIER" "OP_CARRIER_FL_NUM"
## [4] "ORIGIN" "DEST" "CRS_DEP_TIME"
## [7] "DEP_TIME" "DEP_DELAY" "TAXI_OUT"
## [10] "WHEELS_OFF" "WHEELS_ON" "TAXI_IN"
## [13] "CRS_ARR_TIME" "ARR_TIME" "ARR_DELAY"
## [16] "CRS_ELAPSED_TIME" "ACTUAL_ELAPSED_TIME" "AIR_TIME"
## [19] "FLIGHTS" "MONTH" "DAY_OF_MONTH"
## [22] "DAY_OF_WEEK" "ORIGIN_INDEX" "DEST_INDEX"
## [25] "O_TEMP" "O_PRCP" "O_WSPD"
## [28] "D_TEMP" "D_PRCP" "D_WSPD"
## [31] "O_LATITUDE" "O_LONGITUDE" "D_LATITUDE"
## [34] "D_LONGITUDE"
# inspect column names 2024
names(flight_2024)
## [1] "FL_DATE" "OP_CARRIER" "OP_CARRIER_FL_NUM"
## [4] "ORIGIN" "DEST" "CRS_DEP_TIME"
## [7] "DEP_TIME" "DEP_DELAY" "TAXI_OUT"
## [10] "WHEELS_OFF" "WHEELS_ON" "TAXI_IN"
## [13] "CRS_ARR_TIME" "ARR_TIME" "ARR_DELAY"
## [16] "CRS_ELAPSED_TIME" "ACTUAL_ELAPSED_TIME" "AIR_TIME"
## [19] "FLIGHTS" "MONTH" "DAY_OF_MONTH"
## [22] "DAY_OF_WEEK" "ORIGIN_INDEX" "DEST_INDEX"
## [25] "O_TEMP" "O_PRCP" "O_WSPD"
## [28] "D_TEMP" "D_PRCP" "D_WSPD"
## [31] "O_LATITUDE" "O_LONGITUDE" "D_LATITUDE"
## [34] "D_LONGITUDE"
# verify both datasets have identical column names and order
identical(names(flight_2023), names(flight_2024))
## [1] TRUE
Column name inspection reveals 34 variables organized into flight
identification, temporal data, delay metrics, weather conditions
(origin/destination), and geospatial coordinates. The
identical() function confirms both datasets have identical
column names and ordering, validating that row-binding can proceed
without alignment issues.
# check unique values for MONTH column 2023 and 2024
unique(flight_2023$MONTH)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
unique(flight_2024$MONTH)
## [1] 1 2 4 5 6 7 8 9 10 11 12
The unique() function was applied to the MONTH column of
the 2023 dataset to identify all distinct month values present. The
output shows months 1 through 12, indicating that the 2023 dataset
contains flight records from all twelve months of the year (January to
December).
However, for year 2024, the output shows months 1, 2, and 4 through 12, but notably month 3 (March) is missing. This indicates that the 2024 dataset does not contain any flight records for March. The reason for this missing month could be due to data unavailability at the time of download or an intentional subset provided by the data source.
# filter to only keep Dec rows
dec_flight_2023 <- flight_2023 %>%
filter(MONTH == 12) %>%
mutate(YEAR = 2023L)
dec_flight_2024 <- flight_2024 %>%
filter(MONTH == 12) %>%
mutate(YEAR = 2024L)
unique(dec_flight_2023$MONTH)
## [1] 12
unique(dec_flight_2024$MONTH)
## [1] 12
cat("December 2023:", nrow(dec_flight_2023), "flights\n")
## December 2023: 558609 flights
cat("December 2024:", nrow(dec_flight_2024), "flights\n")
## December 2024: 574434 flights
The analysis is restricted to December flights to focus on the peak holiday travel period while reducing the dataset to a more manageable size. A YEAR variable is retained to enable direct comparisons between the 2023 and 2024 holiday seasons and to support subsequent temporal analyses.
df <- bind_rows(dec_flight_2023, dec_flight_2024) %>%
janitor::clean_names()
The December 2023 and 2024 datasets were combined into a single dataset, and all variable names were standardized to a consistent naming convention. The resulting dataset contains 1,133,043 flight records and 35 variables, providing a unified foundation for subsequent cleaning, exploration, and modelling activities.
glimpse(df)
## Rows: 1,133,043
## Columns: 35
## $ fl_date <dttm> 2023-12-01, 2023-12-01, 2023-12-01, 2023-12-02, 2…
## $ op_carrier <chr> "DL", "OO", "OO", "DL", "OO", "OO", "DL", "OO", "O…
## $ op_carrier_fl_num <dbl> 1189, 3888, 3920, 1189, 3888, 3920, 1189, 3888, 39…
## $ origin <chr> "DLH", "DLH", "DLH", "DLH", "DLH", "DLH", "DLH", "…
## $ dest <chr> "MSP", "MSP", "MSP", "MSP", "MSP", "MSP", "MSP", "…
## $ crs_dep_time <dttm> 2023-12-01 05:05:00, 2023-12-01 10:44:00, 2023-12…
## $ dep_time <dttm> 2023-12-01 05:23:00, 2023-12-01 10:36:00, 2023-12…
## $ dep_delay <dbl> 18, -8, -5, -7, 0, -5, 1, -6, -8, -8, -7, -5, 0, -…
## $ taxi_out <dbl> 20, 12, 21, 23, 14, 25, 16, 29, 21, 11, 10, 35, 26…
## $ wheels_off <dttm> 2023-12-01 05:43:00, 2023-12-01 10:48:00, 2023-12…
## $ wheels_on <dttm> 2023-12-01 06:14:00, 2023-12-01 11:20:00, 2023-12…
## $ taxi_in <dbl> 3, 4, 5, 6, 7, 4, 4, 5, 5, 3, 3, 3, 5, 5, 8, 3, 4,…
## $ crs_arr_time <dttm> 2023-12-01 06:19:00, 2023-12-01 11:53:00, 2023-12…
## $ arr_time <dttm> 2023-12-01 06:17:00, 2023-12-01 11:24:00, 2023-12…
## $ arr_delay <dbl> -2, -29, -19, -20, -15, -18, -24, -6, -27, -39, -3…
## $ crs_elapsed_time <dbl> 74, 69, 75, 74, 69, 75, 74, 69, 75, 74, 69, 75, 74…
## $ actual_elapsed_time <dbl> 54, 48, 61, 61, 54, 62, 49, 69, 56, 43, 46, 71, 60…
## $ air_time <dbl> 31, 32, 35, 32, 33, 33, 29, 35, 30, 29, 33, 33, 29…
## $ flights <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ month <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12…
## $ day_of_month <dbl> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6,…
## $ day_of_week <dbl> 5, 5, 5, 6, 6, 6, 7, 7, 7, 1, 1, 1, 2, 2, 2, 3, 3,…
## $ origin_index <dbl> 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90…
## $ dest_index <dbl> 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, …
## $ o_temp <dbl> -7.2, -6.7, 0.6, -8.3, -9.4, 0.6, -1.1, -0.6, 0.6,…
## $ o_prcp <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
## $ o_wspd <dbl> 7.6, 5.4, 16.6, 0.0, 0.0, 14.8, 14.8, 9.4, 9.4, 5.…
## $ d_temp <dbl> -2.8, -5.6, 2.2, -3.9, -5.6, 3.3, 0.6, 0.6, 2.2, 0…
## $ d_prcp <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
## $ d_wspd <dbl> 5.4, 7.6, 7.6, 0.0, 11.2, 16.6, 5.4, 0.0, 5.4, 0.0…
## $ o_latitude <dbl> 46.84209, 46.84209, 46.84209, 46.84209, 46.84209, …
## $ o_longitude <dbl> -92.19365, -92.19365, -92.19365, -92.19365, -92.19…
## $ d_latitude <dbl> 44.88055, 44.88055, 44.88055, 44.88055, 44.88055, …
## $ d_longitude <dbl> -93.21692, -93.21692, -93.21692, -93.21692, -93.21…
## $ year <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 20…
# Tabulate how many columns fall into each broad data type.
tibble(variable = names(df),
type = map_chr(df, ~ class(.x)[1])) |>
mutate(type_group = case_when(
type %in% c("POSIXct", "Date", "IDate", "ITime") ~ "Datetime",
type %in% c("numeric", "integer", "double") ~ "Numeric",
TRUE ~ "Character")) |>
count(type_group, name = "n_columns") |>
kable(caption = "Number of columns by broad data type.") |>
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
| type_group | n_columns |
|---|---|
| Character | 3 |
| Datetime | 7 |
| Numeric | 25 |
The dataset is dominated by numerical flight and weather variables, with seven datetime fields and a small number of categorical identifiers. The datetime fields were successfully parsed, enabling temporal analysis and feature engineering. Preliminary inspection suggests that flights and month may be constant variables and should be examined further.
# Identify columns that carry no information
df |>
summarise(across(everything(), n_distinct)) |>
pivot_longer(everything(), names_to = "variable", values_to = "n_unique") |>
filter(n_unique <= 1) |>
kable(caption = "Constant columns (single unique value = zero predictive value).") |>
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
| variable | n_unique |
|---|---|
| flights | 1 |
| month | 1 |
The variables month and flights contain only a single value across all observations and therefore provide no predictive information. These columns are retained during initial exploration for transparency but are removed during data cleaning to reduce unnecessary model complexity and ensure a more efficient modelling process.
# summary of df
skim(df) |> yank("numeric")
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| op_carrier_fl_num | 0 | 1 | 2418.27 | 1622.07 | 1.00 | 1103.00 | 2142.00 | 3562.00 | 8819.00 | ▇▆▃▂▁ |
| dep_delay | 0 | 1 | 10.79 | 51.96 | -99.00 | -6.00 | -2.00 | 8.00 | 3786.00 | ▇▁▁▁▁ |
| taxi_out | 0 | 1 | 18.00 | 9.42 | 1.00 | 12.00 | 16.00 | 21.00 | 184.00 | ▇▁▁▁▁ |
| taxi_in | 0 | 1 | 8.23 | 6.60 | 1.00 | 4.00 | 6.00 | 10.00 | 244.00 | ▇▁▁▁▁ |
| arr_delay | 0 | 1 | 3.74 | 54.02 | -126.00 | -17.00 | -7.00 | 7.00 | 3795.00 | ▇▁▁▁▁ |
| crs_elapsed_time | 0 | 1 | 150.32 | 73.98 | 20.00 | 95.00 | 134.00 | 182.00 | 1003.00 | ▇▂▁▁▁ |
| actual_elapsed_time | 0 | 1 | 143.27 | 73.22 | 16.00 | 90.00 | 127.00 | 175.00 | 743.00 | ▇▃▁▁▁ |
| air_time | 0 | 1 | 117.03 | 71.35 | 6.00 | 65.00 | 100.00 | 147.00 | 673.00 | ▇▂▁▁▁ |
| flights | 0 | 1 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ▁▁▇▁▁ |
| month | 0 | 1 | 12.00 | 0.00 | 12.00 | 12.00 | 12.00 | 12.00 | 12.00 | ▁▁▇▁▁ |
| day_of_month | 0 | 1 | 15.94 | 8.95 | 1.00 | 8.00 | 16.00 | 24.00 | 31.00 | ▇▇▇▇▇ |
| day_of_week | 0 | 1 | 4.07 | 2.05 | 1.00 | 2.00 | 4.00 | 6.00 | 7.00 | ▇▃▃▅▇ |
| origin_index | 0 | 1 | 159.54 | 87.61 | 0.00 | 85.00 | 174.00 | 235.00 | 321.00 | ▆▇▆▇▆ |
| dest_index | 0 | 1 | 159.55 | 87.60 | 0.00 | 85.00 | 174.00 | 235.00 | 321.00 | ▆▇▆▇▆ |
| o_temp | 216 | 1 | 8.71 | 8.12 | -36.00 | 2.80 | 8.30 | 13.90 | 33.30 | ▁▁▅▇▂ |
| o_prcp | 216 | 1 | 0.12 | 0.68 | 0.00 | 0.00 | 0.00 | 0.00 | 49.50 | ▇▁▁▁▁ |
| o_wspd | 216 | 1 | 11.68 | 8.23 | 0.00 | 5.40 | 11.20 | 16.60 | 79.60 | ▇▃▁▁▁ |
| d_temp | 771 | 1 | 9.20 | 8.12 | -36.00 | 3.30 | 8.90 | 15.00 | 33.90 | ▁▁▅▇▂ |
| d_prcp | 771 | 1 | 0.12 | 0.69 | 0.00 | 0.00 | 0.00 | 0.00 | 49.50 | ▇▁▁▁▁ |
| d_wspd | 771 | 1 | 11.94 | 8.21 | 0.00 | 7.60 | 11.20 | 16.60 | 79.60 | ▇▃▁▁▁ |
| o_latitude | 0 | 1 | 36.40 | 6.00 | 13.48 | 32.90 | 36.12 | 40.69 | 71.29 | ▁▇▇▁▁ |
| o_longitude | 0 | 1 | -95.00 | 18.39 | -176.65 | -111.98 | -87.90 | -80.94 | -64.80 | ▁▁▅▅▇ |
| d_latitude | 0 | 1 | 36.40 | 6.01 | 13.48 | 32.90 | 36.12 | 40.69 | 71.29 | ▁▇▇▁▁ |
| d_longitude | 0 | 1 | -94.99 | 18.39 | -176.65 | -111.98 | -87.90 | -80.94 | -64.80 | ▁▁▅▅▇ |
| year | 0 | 1 | 2023.51 | 0.50 | 2023.00 | 2023.00 | 2024.00 | 2024.00 | 2024.00 | ▇▁▁▁▇ |
The dataset is highly complete, with all core flight and delay variables fully populated. Missing values are limited to a small number of weather-related variables and are examined separately. Preliminary statistics also indicate that departure and arrival delays are strongly right-skewed, with most flights departing and arriving on time or early, but a small number experiencing extremely large delays. This distributional pattern has important implications for subsequent modelling and evaluation.
# check missing values
missing_summary <- df |>
summarise(across(everything(), ~ sum(is.na(.x)))) |>
pivot_longer(everything(), names_to = "variable", values_to = "n_missing") |>
mutate(pct_missing = round(100 * n_missing / nrow(df), 3)) |>
filter(n_missing > 0) |>
arrange(desc(n_missing))
missing_summary |>
kable(caption = "Columns containing missing values.") |>
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
| variable | n_missing | pct_missing |
|---|---|---|
| d_temp | 771 | 0.068 |
| d_prcp | 771 | 0.068 |
| d_wspd | 771 | 0.068 |
| o_temp | 216 | 0.019 |
| o_prcp | 216 | 0.019 |
| o_wspd | 216 | 0.019 |
missing_summary |>
ggplot(aes(x = reorder(variable, pct_missing), y = pct_missing)) +
geom_col(fill = report_fill) +
geom_text(aes(label = paste0(n_missing, " (", pct_missing, "%)")),
hjust = -0.1, size = 3.2) +
coord_flip(ylim = c(0, max(missing_summary$pct_missing) * 1.4)) +
labs(title = "Missing values are confined to the six weather variables",
subtitle = "All operational and delay variables are 100% complete",
x = NULL, y = "% missing")
Missing values are minimal, affecting less than 0.1% of observations and occurring only in weather-related variables. The identical missing counts across origin and destination weather fields suggest that missingness is linked to weather station reporting rather than flight delays themselves. Given the very low proportion of affected records, these missing values are unlikely to influence the analysis significantly. Therefore, rows are retained for EDA, with appropriate imputation to be applied during the modelling phase. Importantly, the target variable (arr_delay) contains no missing values, ensuring that all flights remain available for prediction tasks.
# check for duplicate
n_dupes <- sum(duplicated(df))
cat("Fully duplicated rows:", n_dupes, "\n")
## Fully duplicated rows: 0
No duplicate records were identified in the dataset, indicating that each row represents a unique flight. As a result, no de-duplication is required, and there is minimal risk of double-counting flights in subsequent analyses or modelling activities.
# inspecting values fall within physically plausible ranges.
validity <- tibble(
check = c("Negative air_time",
"Negative taxi_out",
"Negative taxi_in",
"CRS departure timestamps missing",
"Actual departure timestamps missing",
"Months other than December",
"arr_delay outside [-180, 4000] min",
"dep_delay outside [-180, 4000] min"),
n = c(sum(df$air_time < 0, na.rm = TRUE),
sum(df$taxi_out < 0, na.rm = TRUE),
sum(df$taxi_in < 0, na.rm = TRUE),
sum(is.na(df$crs_dep_time)),
sum(is.na(df$dep_time)),
sum(df$month != 12),
sum(df$arr_delay < -180 | df$arr_delay > 4000),
sum(df$dep_delay < -180 | df$dep_delay > 4000)))
validity |>
kable(caption = "Data-validity checks (all counts should be 0).") |>
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
| check | n |
|---|---|
| Negative air_time | 0 |
| Negative taxi_out | 0 |
| Negative taxi_in | 0 |
| CRS departure timestamps missing | 0 |
| Actual departure timestamps missing | 0 |
| Months other than December | 0 |
| arr_delay outside [-180, 4000] min | 0 |
| dep_delay outside [-180, 4000] min | 0 |
All data quality checks were passed successfully, with no invalid values, missing key timestamps, out-of-scope records, or implausible delay values detected. This indicates that the dataset is internally consistent and free from major data integrity issues. Consequently, subsequent cleaning efforts can focus on handling genuine extreme delays rather than correcting data errors.
# Quantify the extreme right tail of arrival delay using the IQR rule.
q <- quantile(df$arr_delay, c(.25, .75))
iqr <- diff(q)
upper_fence <- q[2] + 1.5 * iqr
df |>
summarise(
p95 = quantile(arr_delay, 0.95),
p99 = quantile(arr_delay, 0.99),
max = max(arr_delay),
upper_fence = upper_fence,
n_above_fence = sum(arr_delay > upper_fence),
pct_above = round(100 * mean(arr_delay > upper_fence), 2)) |>
kable(caption = "Upper-tail outlier diagnostics for arr_delay (minutes).") |>
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
| p95 | p99 | max | upper_fence | n_above_fence | pct_above |
|---|---|---|---|---|---|
| 68 | 193 | 3795 | 43 | 92010 | 8.12 |
# Boxplot of arrival delay, with the x-axis clipped so the body is legible.
ggplot(df, aes(x = arr_delay)) +
geom_boxplot(fill = report_fill, alpha = 0.7, outlier.alpha = 0.05) +
coord_cartesian(xlim = c(-60, 300)) +
labs(title = "Arrival delay is tightly centred with a long right tail",
subtitle = "x-axis clipped to [-60, 300] min; rare delays extend beyond 3,700 min",
x = "Arrival delay (minutes)", y = NULL) +
theme(axis.text.y = element_blank())
Arrival delays exhibit a highly right-skewed distribution, with most flights experiencing little or no delay and a small number of flights experiencing extremely long delays. Although many observations are classified as statistical outliers, they represent genuine operational events rather than data errors and are therefore retained. This distributional pattern highlights the need for modelling approaches that can accommodate skewed data and reinforces the value of the binary delay classification used in this study.
# - drop the two constant columns (month, flights);
# - derive the binary classification target;
# - extract scheduled departure hour and a time-of-day band;
# - add a weekend flag and human-readable weekday labels;
# - construct a route identifier;
# - set year and day_of_week as ordered factors for clean plotting.
weekday_labels <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
df_clean <- df |>
select(-month, -flights) |> # remove zero-information columns
mutate(
year = factor(year),
dep_hour = as.integer(format(crs_dep_time, "%H")),
time_of_day = cut(dep_hour,
breaks = c(-1, 5, 11, 16, 20, 24),
labels = c("Overnight (0-5)", "Morning (6-11)",
"Afternoon (12-16)", "Evening (17-20)",
"Late (21-23)")),
weekday = factor(weekday_labels[day_of_week], levels = weekday_labels),
is_weekend = day_of_week %in% c(6, 7),
route = paste(origin, dest, sep = "-"),
# Classification target: significant arrival delay (> 15 minutes).
is_delayed = factor(if_else(arr_delay > 15, "Delayed", "OnTime"),
levels = c("OnTime", "Delayed"))
)
cat("Cleaned dataset:", nrow(df_clean), "rows x", ncol(df_clean), "columns\n")
## Cleaned dataset: 1133043 rows x 39 columns
glimpse(df_clean |> select(year, dep_hour, time_of_day, weekday, is_weekend,
route, arr_delay, is_delayed))
## Rows: 1,133,043
## Columns: 8
## $ year <fct> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023…
## $ dep_hour <int> 5, 10, 18, 5, 10, 18, 5, 10, 18, 5, 10, 18, 5, 10, 18, 5, …
## $ time_of_day <fct> Overnight (0-5), Morning (6-11), Evening (17-20), Overnigh…
## $ weekday <fct> Fri, Fri, Fri, Sat, Sat, Sat, Sun, Sun, Sun, Mon, Mon, Mon…
## $ is_weekend <lgl> FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, F…
## $ route <chr> "DLH-MSP", "DLH-MSP", "DLH-MSP", "DLH-MSP", "DLH-MSP", "DL…
## $ arr_delay <dbl> -2, -29, -19, -20, -15, -18, -24, -6, -27, -39, -30, -9, -…
## $ is_delayed <fct> OnTime, OnTime, OnTime, OnTime, OnTime, OnTime, OnTime, On…
Data transformation produced an analysis-ready dataset while preserving all flight records. Constant variables (month and flights) were removed as they provide no predictive value, while new features were engineered to capture temporal patterns, travel routes, and delay outcomes. These transformations were designed to support the study’s research objectives and enhance subsequent analysis and modelling without reducing the dataset size.
saveRDS(df_clean, "df_clean.rds")
# review target distribution
target_tab <- df_clean |>
count(is_delayed) |>
mutate(pct = round(100 * n / sum(n), 2))
target_tab |>
kable(caption = "Class distribution of the delay target (arr_delay > 15 min).",
format.args = list(big.mark = ",")) |>
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
| is_delayed | n | pct |
|---|---|---|
| OnTime | 931,892 | 82.25 |
| Delayed | 201,151 | 17.75 |
ggplot(target_tab, aes(is_delayed, n, fill = is_delayed)) +
geom_col(width = 0.6) +
geom_text(aes(label = paste0(comma(n), "\n(", pct, "%)")), vjust = -0.2, size = 3.5) +
scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
scale_fill_manual(values = c(OnTime = "#41ab5d", Delayed = "#e34a33")) +
labs(title = "About one in six December flights is significantly delayed",
x = NULL, y = "Number of flights") +
theme(legend.position = "none")
Approximately 17.8% of flights are classified as delayed, while 82.2% are on time or early, resulting in a moderately imbalanced dataset. Although the imbalance is not extreme, it is sufficient to make accuracy an unreliable performance measure, as a naive model could achieve high accuracy by predicting all flights as on time. Consequently, classification models are evaluated using metrics such as balanced accuracy, precision, recall, specificity, and Cohen’s Kappa, which remain informative under class imbalance rather than relying on accuracy alone.
ggplot(filter(df_clean, arr_delay >= -60, arr_delay <= 180),
aes(arr_delay)) +
geom_histogram(binwidth = 5, fill = report_fill, colour = "white") +
geom_vline(xintercept = 15, linetype = "dashed", colour = "red") +
annotate("text", x = 22, y = Inf, label = "15-min\nthreshold",
colour = "red", hjust = 0, vjust = 1.2, size = 3) +
scale_y_continuous(labels = comma) +
labs(title = "Arrival-delay distribution is right-skewed and peaks before zero",
subtitle = "Clipped to [-60, 180] min for readability",
x = "Arrival delay (minutes)", y = "Number of flights")
Arrival delays follow a right-skewed distribution, with most flights arriving on time or slightly early and a smaller proportion experiencing substantial delays. The long tail of severe delays highlights the complexity of modelling delay duration and suggests that methods capable of handling skewed outcomes may be more appropriate. The 15-minute threshold provides a practical distinction between routine variation and operationally significant delays, supporting its use as the classification boundary in this study.
ggplot(filter(df_clean, dep_delay >= -40, dep_delay <= 180),
aes(dep_delay)) +
geom_histogram(binwidth = 5, fill = "#756bb1", colour = "white") +
scale_y_continuous(labels = comma) +
labs(title = "Departure delay mirrors arrival delay",
subtitle = "Clipped to [-40, 180] min",
x = "Departure delay (minutes)", y = "Number of flights")
Departure delays exhibit a distribution similar to arrival delays, with most flights departing on time or slightly early and a smaller number experiencing substantial delays. This pattern suggests a strong relationship between departure and arrival delays, as late departures often translate into late arrivals. Consequently, departure delay is expected to be one of the most influential predictors of arrival delay in subsequent modelling analyses.
df_clean |>
count(op_carrier, sort = TRUE) |>
ggplot(aes(reorder(op_carrier, n), n)) +
geom_col(fill = report_fill) +
geom_text(aes(label = comma(n)), hjust = -0.1, size = 2.8) +
coord_flip(clip = "off") +
scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.18))) +
labs(title = "Flight volume is concentrated among a few large carriers",
x = "Carrier", y = "Number of flights")
Flight activity is concentrated among a small number of major carriers, with the top five airlines accounting for the majority of December flights. This uneven distribution means that some carriers have substantially more observations than others, which may influence how carrier information is encoded during modelling. The relationship between airline and delay performance is explored further in the bivariate analysis.
# Origin-airport weather distributions on a common panel.
df_clean |>
select(o_temp, o_prcp, o_wspd) |>
pivot_longer(everything(), names_to = "var", values_to = "value") |>
mutate(var = recode(var,
o_temp = "Temperature (°C)",
o_prcp = "Precipitation (mm)",
o_wspd = "Wind speed (km/h)")) |>
ggplot(aes(value)) +
geom_histogram(bins = 40, fill = "#2b8cbe", colour = "white") +
facet_wrap(~ var, scales = "free", ncol = 3) +
scale_y_continuous(labels = comma) +
labs(title = "Origin-airport weather: cold, mostly dry, breezy December conditions",
x = NULL, y = "Number of flights")
Weather conditions vary considerably across flights. Temperature follows a broad distribution, reflecting the diverse climates of U.S. airports, while precipitation is highly concentrated at zero, indicating that most flights depart under dry conditions. Wind speed is moderately right-skewed, with occasional periods of strong winds. These patterns suggest that weather-related effects may be non-linear, making threshold-based or categorical representations potentially more informative than simple linear relationships.
# Scatter on a 25k sample with a reference line
set.seed(2024)
df_clean |>
slice_sample(n = 25000) |>
ggplot(aes(dep_delay, arr_delay)) +
geom_point(alpha = 0.08, colour = report_fill) +
geom_abline(slope = 1, intercept = 0, colour = "red", linetype = "dashed") +
coord_cartesian(xlim = c(-50, 300), ylim = c(-50, 300)) +
labs(title = "Departure delay almost perfectly tracks arrival delay (r = 0.97)",
subtitle = "25,000-flight sample; red line is y = x",
x = "Departure delay (minutes)", y = "Arrival delay (minutes)")
Departure delay exhibits an exceptionally strong positive relationship with arrival delay, indicating that flights departing late are highly likely to arrive late. This identifies departure delay as the most influential operational predictor of arrival performance. However, its practical usefulness depends on when predictions are required, as departure delay may not be available for advance forecasting before the flight leaves the gate.
The raw Pearson correlations between weather and
arr_delay are weak (≈ 0.03–0.04), which might suggest
weather is irrelevant. Binning the weather variables reveals a very
different, and much stronger, non-linear picture.
# Delay rate (% > 15 min) across binned weather conditions at the origin.
make_bin_summary <- function(data, var, breaks, labels, varname) {
data |>
filter(!is.na(.data[[var]])) |>
mutate(bin = cut(.data[[var]], breaks = breaks, labels = labels)) |>
group_by(bin) |>
summarise(delay_rate = 100 * mean(arr_delay > 15), .groups = "drop") |>
mutate(driver = varname)
}
prcp_b <- make_bin_summary(df_clean, "o_prcp",
c(-Inf, 0, 1, 5, 10, Inf),
c("0", "0-1", "1-5", "5-10", "10+"), "Precipitation (mm)")
wspd_b <- make_bin_summary(df_clean, "o_wspd",
c(-Inf, 5, 15, 25, 40, Inf),
c("0-5", "5-15", "15-25", "25-40", "40+"), "Wind speed (km/h)")
temp_b <- make_bin_summary(df_clean, "o_temp",
c(-Inf, -10, 0, 10, 20, Inf),
c("<-10", "-10-0", "0-10", "10-20", "20+"), "Temperature (°C)")
bind_rows(prcp_b, wspd_b, temp_b) |>
ggplot(aes(bin, delay_rate, group = driver)) +
geom_col(fill = "#2b8cbe") +
geom_text(aes(label = paste0(round(delay_rate, 1), "%")), vjust = -0.3, size = 2.8) +
facet_wrap(~ driver, scales = "free_x") +
scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
labs(title = "Weather effects on delay are strong but NON-linear",
subtitle = "Delay rate (% of flights > 15 min late) by binned origin weather",
x = NULL, y = "Delay rate (%)")
Weather conditions are associated with higher delay rates, but the relationship is clearly non-linear. Delay rates increase under heavier precipitation and stronger winds, while temperature exhibits a U-shaped pattern, with both very cold and very warm conditions linked to higher delays. These findings indicate that weather effects are driven by thresholds and extremes rather than simple linear relationships, suggesting that non-linear modelling approaches may be better suited to capturing their impact on flight delays.
df_clean |>
group_by(op_carrier) |>
summarise(delay_rate = 100 * mean(arr_delay > 15),
mean_delay = mean(arr_delay), n = n(), .groups = "drop") |>
ggplot(aes(reorder(op_carrier, delay_rate), delay_rate, fill = mean_delay)) +
geom_col() +
geom_text(aes(label = paste0(round(delay_rate, 1), "%")), hjust = -0.1, size = 2.7) +
coord_flip(clip = "off") +
scale_fill_gradient(low = "#41ab5d", high = "#e34a33",
name = "Mean delay (min)") +
scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
labs(title = "Punctuality varies widely across carriers",
x = "Carrier", y = "Delay rate (% > 15 min)")
Delay performance varies noticeably across airlines, with some carriers experiencing substantially higher delay rates than others. Notably, airline size does not appear to be strongly associated with punctuality, as high-volume carriers do not necessarily perform better or worse than smaller operators. These findings suggest that carrier-specific operational factors contribute to delay performance, making op_carrier a meaningful predictor for subsequent modelling.
year_summary <- df_clean |>
group_by(year) |>
summarise(n_flights = n(),
mean_arr_delay = round(mean(arr_delay), 2),
median_arr = median(arr_delay),
delay_rate_pct = round(100 * mean(arr_delay > 15), 2),
mean_dep_delay = round(mean(dep_delay), 2),
.groups = "drop")
year_summary |>
kable(caption = "Year-over-year delay comparison",
format.args = list(big.mark = ",")) |>
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
| year | n_flights | mean_arr_delay | median_arr | delay_rate_pct | mean_dep_delay |
|---|---|---|---|---|---|
| 2023 | 558,609 | 0.43 | -8 | 15.05 | 8.16 |
| 2024 | 574,434 | 6.95 | -6 | 20.38 | 13.34 |
ggplot(year_summary, aes(year, delay_rate_pct, fill = year)) +
geom_col(width = 0.55) +
geom_text(aes(label = paste0(delay_rate_pct, "%")), vjust = -0.3, size = 4) +
scale_fill_manual(values = c("2023" = "#41ab5d", "2024" = "#e34a33")) +
scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
labs(title = "December 2024 Experienced Higher Delay Rates Than December 2023",
x = NULL, y = "Delay rate (% > 15 min)") +
theme(legend.position = "none")
Flight performance deteriorated noticeably between December 2023 and December 2024. Average arrival and departure delays increased, while the proportion of flights delayed by more than 15 minutes rose substantially despite only a modest increase in flight volume. These findings indicate that December 2024 was both busier and less punctual than the previous year. As a result, the year variable is likely to be informative for modelling, and both seasons should be represented during model training and evaluation to ensure the model can generalize across changing operating conditions.
corr_vars <- c("arr_delay", "dep_delay", "taxi_out", "taxi_in", "air_time",
"crs_elapsed_time", "actual_elapsed_time",
"o_temp", "o_prcp", "o_wspd", "d_temp", "d_prcp", "d_wspd",
"day_of_month", "day_of_week")
cmat <- cor(select(df_clean, all_of(corr_vars)), use = "pairwise.complete.obs")
corrplot(cmat, method = "color", type = "upper",
tl.col = "black", tl.cex = 0.75, tl.srt = 45,
addCoef.col = "grey20", number.cex = 0.55,
col = colorRampPalette(c("#2166ac", "white", "#b2182b"))(200),
mar = c(0, 0, 1.5, 0),
title = "Correlation matrix of numeric variables")
The correlation analysis reinforces earlier findings that departure delay is the strongest predictor of arrival delay, exhibiting a substantially stronger relationship than any other variable. It also reveals high correlations among several flight duration measures, indicating potential redundancy and multicollinearity. In contrast, weather variables show only weak linear relationships with delays, supporting the earlier conclusion that their effects are likely non-linear. These findings suggest that redundant duration variables should be reduced and that modelling approaches less sensitive to correlated predictors may be more appropriate than standard linear regression.
df_clean |>
group_by(year, dep_hour) |>
summarise(delay_rate = 100 * mean(arr_delay > 15), n = n(), .groups = "drop") |>
filter(n >= 200) |>
ggplot(aes(dep_hour, delay_rate, colour = year)) +
geom_line(linewidth = 1) +
geom_point(size = 1.3) +
scale_colour_manual(values = c("2023" = "#41ab5d", "2024" = "#e34a33")) +
scale_x_continuous(breaks = seq(0, 23, 3)) +
labs(title = "Delays build through the day and 2024 sits above 2023 at every hour",
x = "Scheduled departure hour", y = "Delay rate (% > 15 min)",
colour = "Year")
Delay rates increase steadily throughout the day, with early-morning flights experiencing the lowest risk of delay and evening departures the highest. This pattern reflects the accumulation and propagation of operational disruptions as the day progresses, highlighting the importance of departure time as a predictor of delays. Additionally, delay rates are consistently higher in 2024 across nearly all time periods, indicating that the year-over-year decline in punctuality was widespread rather than limited to specific parts of the day.
df_clean |>
group_by(day_of_month) |>
summarise(delay_rate = 100 * mean(arr_delay > 15), .groups = "drop") |>
ggplot(aes(day_of_month, delay_rate)) +
geom_line(colour = report_fill, linewidth = 1) +
geom_point(colour = report_fill) +
scale_x_continuous(breaks = seq(1, 31, 2)) +
labs(title = "Delay risk spikes around the post-Christmas travel surge",
x = "Day of December", y = "Delay rate (% > 15 min)")
Delay rates vary throughout the month, with the highest levels occurring during the late-December holiday travel period. This pattern suggests that holiday-related travel demand is a key contributor to flight delays, while early December remains comparatively less congested. The findings indicate that day-of-month effects contain meaningful predictive information and should be considered in subsequent modelling.
This phase develops predictive models addressing two complementary analytical objectives derived from the research questions:
Both objectives are evaluated under two modelling scenarios:
dep_delay),
taxi times, and air time. This represents a gate-arrival estimation
setting once the flight is airborne.The objective of the classification task is to predict whether a flight will arrive on time or experience a significant arrival delay. A binary target variable, is_delayed, was created from arrival delay, where flights arriving more than 15 minutes late were classified as Delayed and all others as OnTime.
Two modelling scenarios were evaluated. Model A utilised operational, temporal, geographical, and weather-related variables, while Model B additionally included departure delay (dep_delay). Logistic Regression and Random Forest were implemented and compared using Accuracy, Precision, Recall (Sensitivity), Balanced Accuracy, and Cohen’s Kappa.
Missing values in six weather-related variables (o_temp, o_prcp, o_wspd, d_temp, d_prcp, and d_wspd) were handled using median imputation. This approach ensured complete data for model training while remaining robust to outliers. Following imputation, no missing values remained in the dataset.
#=========================================================
# Load Cleaned Dataset
#=========================================================
df_clean <- readRDS("df_clean.rds")
dim(df_clean)
## [1] 1133043 39
#=========================================================
# Missing Value Treatment
#=========================================================
weather_cols <- c(
"o_temp", "o_prcp", "o_wspd",
"d_temp", "d_prcp", "d_wspd"
)
for(col in weather_cols) {
df_clean[[col]][is.na(df_clean[[col]])] <-
median(df_clean[[col]], na.rm = TRUE)
}
colSums(is.na(df_clean))
## fl_date op_carrier op_carrier_fl_num origin
## 0 0 0 0
## dest crs_dep_time dep_time dep_delay
## 0 0 0 0
## taxi_out wheels_off wheels_on taxi_in
## 0 0 0 0
## crs_arr_time arr_time arr_delay crs_elapsed_time
## 0 0 0 0
## actual_elapsed_time air_time day_of_month day_of_week
## 0 0 0 0
## origin_index dest_index o_temp o_prcp
## 0 0 0 0
## o_wspd d_temp d_prcp d_wspd
## 0 0 0 0
## o_latitude o_longitude d_latitude d_longitude
## 0 0 0 0
## year dep_hour time_of_day weekday
## 0 0 0 0
## is_weekend route is_delayed
## 0 0 0
Variables that directly revealed the final arrival outcome were excluded to prevent target leakage. Consequently, arr_delay was removed because it was used to construct the target variable, is_delayed.
The route variable was also excluded as its information was already represented by the origin and dest variables. Finally, categorical variables were converted to factors to ensure compatibility with the classification algorithms.
#=========================================================
# Model A Dataset
# Excludes departure delay
#=========================================================
modelA_df <- df_clean %>%
select(
is_delayed,
op_carrier,
day_of_month,
day_of_week,
year,
dep_hour,
time_of_day,
weekday,
is_weekend,
o_temp,
o_prcp,
o_wspd,
d_temp,
d_prcp,
d_wspd,
o_latitude,
o_longitude,
d_latitude,
d_longitude
)
glimpse(modelA_df)
## Rows: 1,133,043
## Columns: 19
## $ is_delayed <fct> OnTime, OnTime, OnTime, OnTime, OnTime, OnTime, OnTime, O…
## $ op_carrier <chr> "DL", "OO", "OO", "DL", "OO", "OO", "DL", "OO", "OO", "DL…
## $ day_of_month <dbl> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, …
## $ day_of_week <dbl> 5, 5, 5, 6, 6, 6, 7, 7, 7, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, …
## $ year <fct> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 202…
## $ dep_hour <int> 5, 10, 18, 5, 10, 18, 5, 10, 18, 5, 10, 18, 5, 10, 18, 5,…
## $ time_of_day <fct> Overnight (0-5), Morning (6-11), Evening (17-20), Overnig…
## $ weekday <fct> Fri, Fri, Fri, Sat, Sat, Sat, Sun, Sun, Sun, Mon, Mon, Mo…
## $ is_weekend <lgl> FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
## $ o_temp <dbl> -7.2, -6.7, 0.6, -8.3, -9.4, 0.6, -1.1, -0.6, 0.6, 0.0, -…
## $ o_prcp <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.…
## $ o_wspd <dbl> 7.6, 5.4, 16.6, 0.0, 0.0, 14.8, 14.8, 9.4, 9.4, 5.4, 0.0,…
## $ d_temp <dbl> -2.8, -5.6, 2.2, -3.9, -5.6, 3.3, 0.6, 0.6, 2.2, 0.0, -3.…
## $ d_prcp <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.…
## $ d_wspd <dbl> 5.4, 7.6, 7.6, 0.0, 11.2, 16.6, 5.4, 0.0, 5.4, 0.0, 7.6, …
## $ o_latitude <dbl> 46.84209, 46.84209, 46.84209, 46.84209, 46.84209, 46.8420…
## $ o_longitude <dbl> -92.19365, -92.19365, -92.19365, -92.19365, -92.19365, -9…
## $ d_latitude <dbl> 44.88055, 44.88055, 44.88055, 44.88055, 44.88055, 44.8805…
## $ d_longitude <dbl> -93.21692, -93.21692, -93.21692, -93.21692, -93.21692, -9…
#=========================================================
# Train-Test Split
#=========================================================
set.seed(123)
split_obj <- initial_split(
modelA_df,
prop = 0.80,
strata = is_delayed
)
train_data <- training(split_obj)
test_data <- testing(split_obj)
dim(train_data)
## [1] 906433 19
dim(test_data)
## [1] 226610 19
#=========================================================
# Prepare Data for Classification Models
#=========================================================
train_data <- train_data %>%
mutate(across(where(is.character), as.factor))
test_data <- test_data %>%
mutate(across(where(is.character), as.factor))
train_data$is_weekend <- as.factor(train_data$is_weekend)
test_data$is_weekend <- as.factor(test_data$is_weekend)
str(train_data)
## tibble [906,433 × 19] (S3: tbl_df/tbl/data.frame)
## $ is_delayed : Factor w/ 2 levels "OnTime","Delayed": 2 2 2 2 2 2 2 2 2 2 ...
## $ op_carrier : Factor w/ 15 levels "9E","AA","AS",..: 12 12 12 12 12 12 12 12 5 5 ...
## $ day_of_month: num [1:906433] 6 9 19 24 26 27 30 30 3 4 ...
## $ day_of_week : num [1:906433] 3 6 2 7 2 3 6 6 7 1 ...
## $ year : Factor w/ 2 levels "2023","2024": 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_hour : int [1:906433] 18 17 5 5 5 17 17 10 13 12 ...
## $ time_of_day : Factor w/ 5 levels "Overnight (0-5)",..: 4 4 1 1 1 4 4 2 3 3 ...
## $ weekday : Factor w/ 7 levels "Mon","Tue","Wed",..: 3 6 2 7 2 3 6 6 7 1 ...
## $ is_weekend : Factor w/ 2 levels "FALSE","TRUE": 1 2 1 2 1 1 2 2 2 1 ...
## $ o_temp : num [1:906433] 1.1 1.7 -11.7 5 2.8 1.1 -2.2 -1.7 16.7 5.6 ...
## $ o_prcp : num [1:906433] 0 0.3 0 0 0.5 0 0 0 0 0 ...
## $ o_wspd : num [1:906433] 27.7 11.2 13 18.4 38.9 14.8 16.6 11.2 7.6 5.4 ...
## $ d_temp : num [1:906433] 7.2 2.2 -8.3 7.8 10 2.2 -2.2 -1.7 1.1 -3.3 ...
## $ d_prcp : num [1:906433] 0 0.3 0 0.4 1.5 0 0 0 0 0 ...
## $ d_wspd : num [1:906433] 22.3 22.3 9.4 24.1 25.9 0 20.5 9.4 7.6 13 ...
## $ o_latitude : num [1:906433] 46.8 46.8 46.8 46.8 46.8 ...
## $ o_longitude : num [1:906433] -92.2 -92.2 -92.2 -92.2 -92.2 ...
## $ d_latitude : num [1:906433] 44.9 44.9 44.9 44.9 44.9 ...
## $ d_longitude : num [1:906433] -93.2 -93.2 -93.2 -93.2 -93.2 ...
Logistic Regression was selected as the baseline classification model due to its simplicity and interpretability. The model estimates the probability that a flight belongs to the delayed class based on the relationship between the predictor variables and the binary target variable.
Model A was trained using operational, temporal, geographical, and weather-related variables. Model performance was evaluated using Accuracy, Precision, Recall, Balanced Accuracy, and Cohen’s Kappa.
#=========================================================
# Logistic Regression - Model A
#=========================================================
# Purpose:
# Establish a baseline classification model using
# operational, temporal, geographical and weather
# variables without departure delay information.
#=========================================================
log_model <- glm(
is_delayed ~ .,
data = train_data,
family = binomial
)
summary(log_model)
##
## Call:
## glm(formula = is_delayed ~ ., family = binomial, data = train_data)
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8789999 0.0500054 -57.574 < 2e-16 ***
## op_carrierAA 0.3837809 0.0190314 20.166 < 2e-16 ***
## op_carrierAS 0.5860142 0.0243522 24.064 < 2e-16 ***
## op_carrierB6 0.7621720 0.0213970 35.620 < 2e-16 ***
## op_carrierDL 0.0194867 0.0191469 1.018 0.30880
## op_carrierF9 0.5438666 0.0231450 23.498 < 2e-16 ***
## op_carrierG4 0.4795950 0.0352921 13.589 < 2e-16 ***
## op_carrierHA 0.5310647 0.0378163 14.043 < 2e-16 ***
## op_carrierMQ 0.1633752 0.0229987 7.104 1.21e-12 ***
## op_carrierNK 0.5726918 0.0221659 25.837 < 2e-16 ***
## op_carrierOH 0.3898769 0.0226217 17.235 < 2e-16 ***
## op_carrierOO 0.4025351 0.0198905 20.238 < 2e-16 ***
## op_carrierUA 0.1494766 0.0196600 7.603 2.89e-14 ***
## op_carrierWN 0.3262375 0.0187275 17.420 < 2e-16 ***
## op_carrierYX -0.1533109 0.0229023 -6.694 2.17e-11 ***
## day_of_month 0.0212945 0.0003183 66.903 < 2e-16 ***
## day_of_week 0.0572587 0.0016754 34.176 < 2e-16 ***
## year2024 0.4081505 0.0057490 70.995 < 2e-16 ***
## dep_hour 0.0638555 0.0020154 31.684 < 2e-16 ***
## time_of_dayMorning (6-11) 0.1326760 0.0208922 6.350 2.15e-10 ***
## time_of_dayAfternoon (12-16) 0.1636631 0.0270586 6.048 1.46e-09 ***
## time_of_dayEvening (17-20) 0.1098965 0.0337384 3.257 0.00112 **
## time_of_dayLate (21-23) -0.1708269 0.0401194 -4.258 2.06e-05 ***
## weekdayTue -0.0884717 0.0103649 -8.536 < 2e-16 ***
## weekdayWed -0.0168882 0.0099229 -1.702 0.08876 .
## weekdayThu 0.0561261 0.0090770 6.183 6.28e-10 ***
## weekdayFri 0.0424225 0.0087182 4.866 1.14e-06 ***
## weekdaySat 0.0071342 0.0093160 0.766 0.44380
## weekdaySun NA NA NA NA
## is_weekendTRUE NA NA NA NA
## o_temp -0.0012473 0.0005803 -2.149 0.03161 *
## o_prcp 0.1354156 0.0036347 37.256 < 2e-16 ***
## o_wspd 0.0062396 0.0003514 17.758 < 2e-16 ***
## d_temp 0.0024596 0.0005781 4.255 2.09e-05 ***
## d_prcp 0.1185959 0.0036137 32.818 < 2e-16 ***
## d_wspd 0.0079420 0.0003488 22.770 < 2e-16 ***
## o_latitude 0.0016229 0.0007632 2.126 0.03347 *
## o_longitude 0.0018834 0.0002128 8.849 < 2e-16 ***
## d_latitude -0.0065656 0.0007689 -8.538 < 2e-16 ***
## d_longitude 0.0063090 0.0002134 29.568 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 847749 on 906432 degrees of freedom
## Residual deviance: 812210 on 906395 degrees of freedom
## AIC: 812286
##
## Number of Fisher Scoring iterations: 4
#=========================================================
# Generate Prediction Probabilities
#=========================================================
# Predict the probability that each flight belongs
# to the delayed class.
#=========================================================
log_prob <- predict(
log_model,
newdata = test_data,
type = "response"
)
summary(log_prob)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02276 0.12126 0.16458 0.17751 0.22017 0.99670
#=========================================================
# Convert Probabilities into Class Labels
#=========================================================
# Flights with probability greater than 0.50 are
# classified as Delayed; otherwise OnTime.
#=========================================================
log_pred <- ifelse(
log_prob > 0.5,
"Delayed",
"OnTime"
)
log_pred <- factor(
log_pred,
levels = levels(test_data$is_delayed)
)
table(log_pred)
## log_pred
## OnTime Delayed
## 226123 487
#=========================================================
# Evaluate Logistic Regression Model A
#=========================================================
log_cm <- confusionMatrix(
log_pred,
test_data$is_delayed,
positive = "Delayed"
)
log_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction OnTime Delayed
## OnTime 186166 39957
## Delayed 213 274
##
## Accuracy : 0.8227
## 95% CI : (0.8212, 0.8243)
## No Information Rate : 0.8225
## P-Value [Acc > NIR] : 0.3699
##
## Kappa : 0.0093
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.006811
## Specificity : 0.998857
## Pos Pred Value : 0.562628
## Neg Pred Value : 0.823295
## Prevalence : 0.177534
## Detection Rate : 0.001209
## Detection Prevalence : 0.002149
## Balanced Accuracy : 0.502834
##
## 'Positive' Class : Delayed
##
Logistic Regression Model A achieved an accuracy of 82.27%. However, despite the relatively high accuracy, the model performed poorly in identifying delayed flights, achieving a recall of only 0.68%. In contrast, specificity reached 99.89%, indicating that the model primarily predicted the majority class (OnTime). This behaviour is reflected in the low balanced accuracy (50.28%) and Cohen’s Kappa (0.0093). Overall, Logistic Regression Model A served as a useful baseline but was ineffective for detecting flight delays in this dataset.
Random Forest was implemented to capture nonlinear relationships and interactions among the predictor variables. Unlike Logistic Regression, Random Forest combines multiple decision trees to improve predictive performance and robustness.
Model A was trained using operational, temporal, geographical, and weather-related variables. A Random Forest classifier consisting of 200 trees was developed and evaluated using the same training and testing datasets as Logistic Regression Model A.
#=========================================================
# Random Forest Model A
#=========================================================
library(ranger)
rf_model <- ranger(
is_delayed ~ .,
data = train_data,
num.trees = 200,
probability = TRUE,
importance = "impurity",
seed = 123
)
## Growing trees.. Progress: 29%. Estimated remaining time: 1 minute, 15 seconds.
## Growing trees.. Progress: 58%. Estimated remaining time: 44 seconds.
## Growing trees.. Progress: 87%. Estimated remaining time: 13 seconds.
rf_model
## Ranger result
##
## Call:
## ranger(is_delayed ~ ., data = train_data, num.trees = 200, probability = TRUE, importance = "impurity", seed = 123)
##
## Type: Probability estimation
## Number of trees: 200
## Sample size: 906433
## Number of independent variables: 18
## Mtry: 4
## Target node size: 10
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error (Brier s.): 0.123461
#=========================================================
# Generate Random Forest Predictions
#=========================================================
# Predict class probabilities for each flight in the
# test dataset using the trained Random Forest model.
#=========================================================
rf_pred_prob <- predict(
rf_model,
data = test_data
)$predictions
head(rf_pred_prob)
## OnTime Delayed
## [1,] 0.9351981 0.06480194
## [2,] 0.9374902 0.06250980
## [3,] 0.8094519 0.19054806
## [4,] 0.9405006 0.05949938
## [5,] 0.9270507 0.07294933
## [6,] 0.8719644 0.12803561
#=========================================================
# Convert Probabilities into Class Labels
#=========================================================
# Flights with a predicted probability greater than
# 0.50 are classified as delayed.
#=========================================================
rf_pred_class <- ifelse(
rf_pred_prob[, "Delayed"] > 0.5,
"Delayed",
"OnTime"
)
rf_pred_class <- factor(
rf_pred_class,
levels = levels(test_data$is_delayed)
)
table(rf_pred_class)
## rf_pred_class
## OnTime Delayed
## 214669 11941
#=========================================================
# Evaluate Random Forest Performance
#=========================================================
# Generate confusion matrix and classification
# performance metrics.
#=========================================================
rf_cm <- confusionMatrix(
rf_pred_class,
test_data$is_delayed,
positive = "Delayed"
)
rf_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction OnTime Delayed
## OnTime 182585 32084
## Delayed 3794 8147
##
## Accuracy : 0.8417
## 95% CI : (0.8402, 0.8432)
## No Information Rate : 0.8225
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2515
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.20251
## Specificity : 0.97964
## Pos Pred Value : 0.68227
## Neg Pred Value : 0.85054
## Prevalence : 0.17753
## Detection Rate : 0.03595
## Detection Prevalence : 0.05269
## Balanced Accuracy : 0.59107
##
## 'Positive' Class : Delayed
##
Random Forest Model A achieved an accuracy of 84.17%, outperforming Logistic Regression Model A (82.27%). The model recorded a precision of 68.23% and a recall of 20.25%, demonstrating a substantially improved ability to identify delayed flights compared to Logistic Regression, which achieved a recall of only 0.68%.
The balanced accuracy increased to 59.11%, indicating improved performance across both classes while maintaining high specificity (97.96%). These results suggest that Random Forest was better able to capture nonlinear relationships and interactions among operational, temporal, geographical, and weather-related variables, resulting in superior classification performance when departure delay information was excluded.
Logistic Regression Model B was developed using the same predictors as Model A, with the addition of departure delay (dep_delay). This variable provides information about operational disruptions occurring before take-off and is expected to be strongly associated with arrival delays.
The inclusion of departure delay allows the model to evaluate whether predictive performance improves when real-time operational information is available. Model performance was assessed using Accuracy, Precision, Recall, Balanced Accuracy, and Cohen’s Kappa.
#=========================================================
# Construct Model B Dataset
#=========================================================
# Model B includes departure delay in addition to the
# operational, temporal, geographical and weather
# variables used in Model A.
#=========================================================
modelB_df <- df_clean %>%
select(
is_delayed,
dep_delay,
op_carrier,
day_of_month,
day_of_week,
year,
dep_hour,
time_of_day,
weekday,
is_weekend,
o_temp,
o_prcp,
o_wspd,
d_temp,
d_prcp,
d_wspd,
o_latitude,
o_longitude,
d_latitude,
d_longitude
)
dim(modelB_df)
## [1] 1133043 20
#=========================================================
# Training-Test Split
#=========================================================
set.seed(123)
splitB <- initial_split(
modelB_df,
prop = 0.80,
strata = is_delayed
)
trainB <- training(splitB)
testB <- testing(splitB)
dim(trainB)
## [1] 906433 20
dim(testB)
## [1] 226610 20
#=========================================================
# Prepare Data for Classification Models
#=========================================================
trainB <- trainB %>%
mutate(across(where(is.character), as.factor))
testB <- testB %>%
mutate(across(where(is.character), as.factor))
trainB$is_weekend <- as.factor(trainB$is_weekend)
testB$is_weekend <- as.factor(testB$is_weekend)
str(trainB)
## tibble [906,433 × 20] (S3: tbl_df/tbl/data.frame)
## $ is_delayed : Factor w/ 2 levels "OnTime","Delayed": 2 2 2 2 2 2 2 2 2 2 ...
## $ dep_delay : num [1:906433] 88 8 231 391 96 43 132 17 38 16 ...
## $ op_carrier : Factor w/ 15 levels "9E","AA","AS",..: 12 12 12 12 12 12 12 12 5 5 ...
## $ day_of_month: num [1:906433] 6 9 19 24 26 27 30 30 3 4 ...
## $ day_of_week : num [1:906433] 3 6 2 7 2 3 6 6 7 1 ...
## $ year : Factor w/ 2 levels "2023","2024": 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_hour : int [1:906433] 18 17 5 5 5 17 17 10 13 12 ...
## $ time_of_day : Factor w/ 5 levels "Overnight (0-5)",..: 4 4 1 1 1 4 4 2 3 3 ...
## $ weekday : Factor w/ 7 levels "Mon","Tue","Wed",..: 3 6 2 7 2 3 6 6 7 1 ...
## $ is_weekend : Factor w/ 2 levels "FALSE","TRUE": 1 2 1 2 1 1 2 2 2 1 ...
## $ o_temp : num [1:906433] 1.1 1.7 -11.7 5 2.8 1.1 -2.2 -1.7 16.7 5.6 ...
## $ o_prcp : num [1:906433] 0 0.3 0 0 0.5 0 0 0 0 0 ...
## $ o_wspd : num [1:906433] 27.7 11.2 13 18.4 38.9 14.8 16.6 11.2 7.6 5.4 ...
## $ d_temp : num [1:906433] 7.2 2.2 -8.3 7.8 10 2.2 -2.2 -1.7 1.1 -3.3 ...
## $ d_prcp : num [1:906433] 0 0.3 0 0.4 1.5 0 0 0 0 0 ...
## $ d_wspd : num [1:906433] 22.3 22.3 9.4 24.1 25.9 0 20.5 9.4 7.6 13 ...
## $ o_latitude : num [1:906433] 46.8 46.8 46.8 46.8 46.8 ...
## $ o_longitude : num [1:906433] -92.2 -92.2 -92.2 -92.2 -92.2 ...
## $ d_latitude : num [1:906433] 44.9 44.9 44.9 44.9 44.9 ...
## $ d_longitude : num [1:906433] -93.2 -93.2 -93.2 -93.2 -93.2 ...
#=========================================================
# Logistic Regression Model B
#=========================================================
# Model B extends Model A by including departure delay
# as an additional predictor.
#=========================================================
log_model_B <- glm(
is_delayed ~ .,
data = trainB,
family = binomial
)
summary(log_model_B)
##
## Call:
## glm(formula = is_delayed ~ ., family = binomial, data = trainB)
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.748e+00 8.034e-02 -34.203 < 2e-16 ***
## dep_delay 1.307e-01 3.457e-04 378.114 < 2e-16 ***
## op_carrierAA 8.837e-02 3.210e-02 2.753 0.005906 **
## op_carrierAS 3.407e-01 3.962e-02 8.600 < 2e-16 ***
## op_carrierB6 2.375e-01 3.671e-02 6.471 9.77e-11 ***
## op_carrierDL -1.631e-01 3.215e-02 -5.072 3.94e-07 ***
## op_carrierF9 2.938e-01 3.955e-02 7.430 1.08e-13 ***
## op_carrierG4 1.850e-01 6.219e-02 2.975 0.002934 **
## op_carrierHA 4.566e-01 5.658e-02 8.071 6.97e-16 ***
## op_carrierMQ 2.732e-01 3.737e-02 7.310 2.66e-13 ***
## op_carrierNK -1.668e-01 3.794e-02 -4.397 1.10e-05 ***
## op_carrierOH 3.242e-01 3.866e-02 8.386 < 2e-16 ***
## op_carrierOO 3.172e-01 3.382e-02 9.379 < 2e-16 ***
## op_carrierUA 3.433e-02 3.295e-02 1.042 0.297371
## op_carrierWN -3.197e-01 3.146e-02 -10.163 < 2e-16 ***
## op_carrierYX 3.088e-01 3.742e-02 8.251 < 2e-16 ***
## day_of_month 4.054e-05 5.129e-04 0.079 0.936997
## day_of_week 3.395e-02 2.732e-03 12.425 < 2e-16 ***
## year2024 1.801e-01 9.299e-03 19.367 < 2e-16 ***
## dep_hour -4.665e-02 3.213e-03 -14.520 < 2e-16 ***
## time_of_dayMorning (6-11) 3.510e-01 3.214e-02 10.921 < 2e-16 ***
## time_of_dayAfternoon (12-16) 4.049e-01 4.245e-02 9.537 < 2e-16 ***
## time_of_dayEvening (17-20) 7.283e-01 5.351e-02 13.611 < 2e-16 ***
## time_of_dayLate (21-23) 6.753e-01 6.410e-02 10.535 < 2e-16 ***
## weekdayTue 2.415e-02 1.676e-02 1.441 0.149666
## weekdayWed 1.778e-01 1.573e-02 11.306 < 2e-16 ***
## weekdayThu 1.252e-01 1.468e-02 8.528 < 2e-16 ***
## weekdayFri 9.705e-02 1.405e-02 6.906 4.98e-12 ***
## weekdaySat -3.837e-03 1.538e-02 -0.249 0.803002
## weekdaySun NA NA NA NA
## is_weekendTRUE NA NA NA NA
## o_temp 3.481e-03 9.247e-04 3.764 0.000167 ***
## o_prcp 8.803e-02 5.623e-03 15.656 < 2e-16 ***
## o_wspd 2.889e-03 5.702e-04 5.067 4.05e-07 ***
## d_temp -2.260e-03 9.225e-04 -2.450 0.014298 *
## d_prcp 1.189e-01 5.265e-03 22.586 < 2e-16 ***
## d_wspd 1.129e-02 5.630e-04 20.045 < 2e-16 ***
## o_latitude 9.253e-03 1.226e-03 7.545 4.53e-14 ***
## o_longitude -6.892e-03 3.401e-04 -20.267 < 2e-16 ***
## d_latitude -1.637e-02 1.222e-03 -13.399 < 2e-16 ***
## d_longitude 1.131e-02 3.361e-04 33.662 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 847749 on 906432 degrees of freedom
## Residual deviance: 355790 on 906394 degrees of freedom
## AIC: 355868
##
## Number of Fisher Scoring iterations: 8
#=========================================================
# Generate Prediction Probabilities
#=========================================================
log_prob_B <- predict(
log_model_B,
newdata = testB,
type = "response"
)
summary(log_prob_B)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.301e-05 2.129e-02 3.596e-02 1.776e-01 1.126e-01 1.000e+00
#=========================================================
# Convert Probabilities to Classes
#=========================================================
log_pred_B <- ifelse(log_prob_B > 0.5,
"Delayed",
"OnTime")
log_pred_B <- factor(
log_pred_B,
levels = c("OnTime", "Delayed")
)
table(log_pred_B)
## log_pred_B
## OnTime Delayed
## 194548 32062
#=========================================================
# Evaluate Logistic Regression Model B
#=========================================================
library(caret)
log_cm_B <- confusionMatrix(
log_pred_B,
testB$is_delayed,
positive = "Delayed"
)
log_cm_B
## Confusion Matrix and Statistics
##
## Reference
## Prediction OnTime Delayed
## OnTime 182972 11576
## Delayed 3407 28655
##
## Accuracy : 0.9339
## 95% CI : (0.9329, 0.9349)
## No Information Rate : 0.8225
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.754
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7123
## Specificity : 0.9817
## Pos Pred Value : 0.8937
## Neg Pred Value : 0.9405
## Prevalence : 0.1775
## Detection Rate : 0.1265
## Detection Prevalence : 0.1415
## Balanced Accuracy : 0.8470
##
## 'Positive' Class : Delayed
##
Model B achieved an accuracy of 93.39%, substantially improving upon Logistic Regression Model A (82.27%). The inclusion of departure delay provided significant predictive power, enabling the model to better distinguish between delayed and on-time flights.
The model achieved a recall of 71.23% and a precision of 89.37%, indicating strong performance in identifying delayed flights while maintaining a low rate of false positives. Specificity remained high at 98.17%, demonstrating excellent performance in correctly classifying on-time flights. The balanced accuracy of 84.70% and Cohen’s Kappa of 0.7540 further indicate strong classification performance across both classes.
Compared with Model A, the substantial improvement in recall confirms that departure delay is a highly influential predictor of arrival delay. These findings suggest that incorporating real-time operational information significantly enhances flight delay prediction performance.
Random Forest Model B was developed using the same predictors as Random Forest Model A, with the addition of departure delay (dep_delay). The inclusion of departure delay allows the model to incorporate real-time operational information that may influence a flight’s final arrival status.
By combining this enhanced feature set with an ensemble learning approach, Random Forest Model B is designed to capture complex nonlinear relationships and interactions among operational, temporal, geographical, and weather-related variables. The model was trained using 200 decision trees and evaluated using Accuracy, Precision, Recall, Balanced Accuracy, and Cohen’s Kappa.
#=========================================================
# Random Forest Model B
#=========================================================
# Model B includes departure delay as an additional
# predictor variable.
#=========================================================
library(ranger)
rf_model_B <- ranger(
is_delayed ~ .,
data = trainB,
num.trees = 200,
probability = TRUE,
importance = "impurity",
seed = 123
)
## Growing trees.. Progress: 34%. Estimated remaining time: 1 minute, 0 seconds.
## Growing trees.. Progress: 69%. Estimated remaining time: 28 seconds.
rf_model_B
## Ranger result
##
## Call:
## ranger(is_delayed ~ ., data = trainB, num.trees = 200, probability = TRUE, importance = "impurity", seed = 123)
##
## Type: Probability estimation
## Number of trees: 200
## Sample size: 906433
## Number of independent variables: 19
## Mtry: 4
## Target node size: 10
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error (Brier s.): 0.04750652
#=========================================================
# Generate Random Forest Predictions
#=========================================================
# Predict probability of each flight belonging
# to the delayed class.
#=========================================================
rf_pred_B <- predict(
rf_model_B,
data = testB
)
rf_prob_B <- rf_pred_B$predictions[, "Delayed"]
summary(rf_prob_B)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.01530 0.03937 0.18118 0.14168 1.00000
#=========================================================
# Convert Probabilities to Classes
#=========================================================
# Flights with probability > 0.5 are classified
# as Delayed, otherwise OnTime.
#=========================================================
rf_class_B <- ifelse(
rf_prob_B > 0.5,
"Delayed",
"OnTime"
)
rf_class_B <- factor(
rf_class_B,
levels = c("OnTime", "Delayed")
)
table(rf_class_B)
## rf_class_B
## OnTime Delayed
## 194409 32201
#=========================================================
# Evaluate Random Forest Model B
#=========================================================
library(caret)
rf_cm_B <- confusionMatrix(
rf_class_B,
testB$is_delayed,
positive = "Delayed"
)
rf_cm_B
## Confusion Matrix and Statistics
##
## Reference
## Prediction OnTime Delayed
## OnTime 183557 10852
## Delayed 2822 29379
##
## Accuracy : 0.9397
## 95% CI : (0.9387, 0.9406)
## No Information Rate : 0.8225
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7758
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7303
## Specificity : 0.9849
## Pos Pred Value : 0.9124
## Neg Pred Value : 0.9442
## Prevalence : 0.1775
## Detection Rate : 0.1296
## Detection Prevalence : 0.1421
## Balanced Accuracy : 0.8576
##
## 'Positive' Class : Delayed
##
Random Forest Model B was developed using the same feature set as Model A, with the addition of departure delay (dep_delay). By incorporating this operational variable, the model was able to capture both nonlinear relationships and real-time flight conditions that influence arrival delays.
The model achieved an accuracy of 93.97%, representing the highest classification performance among all models evaluated. It also achieved a precision of 91.24% and a recall of 73.03%, indicating strong performance in identifying delayed flights while maintaining a low false-positive rate. Furthermore, the model achieved a specificity of 98.49%, demonstrating excellent performance in correctly classifying on-time flights. The balanced accuracy of 85.76% and Cohen’s Kappa value of 0.7758 further confirm strong classification performance across both classes.
Compared with Logistic Regression Model B, Random Forest Model B produced slightly higher accuracy, recall, balanced accuracy, and Kappa, suggesting that nonlinear relationships and feature interactions contribute additional predictive value beyond what can be captured by a linear model. The results indicate that combining operational, temporal, geographical (coordinates), weather, and departure delay information with an ensemble learning approach provides the most effective solution for flight delay classification in this study.
| Model | Accuracy | Recall | Specificity | Balanced_Accuracy |
|---|---|---|---|---|
| Logistic Regression A | 82.27 | 0.68 | 99.89 | 50.28 |
| Random Forest A | 84.17 | 20.25 | 97.96 | 59.11 |
| Logistic Regression B | 93.39 | 71.23 | 98.17 | 84.70 |
| Random Forest B | 93.97 | 73.03 | 98.49 | 85.76 |
Among the four classification models evaluated, Random Forest Model B achieved the best overall performance, with an accuracy of 93.97%, recall of 73.03%, specificity of 98.49%, and balanced accuracy of 85.76%. The inclusion of departure delay substantially improved predictive performance across both Logistic Regression and Random Forest models, highlighting its importance as a predictor of arrival delays.
Overall, Random Forest models consistently outperformed Logistic Regression models, suggesting that nonlinear relationships and interactions among operational, temporal, geographical, and weather-related variables contribute to flight delay outcomes. Consequently, Random Forest Model B was selected as the best-performing classification model for flight delay prediction.
***
## 4.2 Regression Modelling
The regression task predicts the **exact arrival delay duration in minutes** (`arr_delay`).
Two algorithms are evaluated in order of increasing complexity:
- **Linear Regression** — interpretable baseline assuming linear feature relationships
- **Random Forest** — ensemble of decision trees capturing non-linear interactions
Both models are trained and evaluated under two scenarios:
| Scenario | Features Available | Use Case |
|----------|-------------------|----------|
| **A — Pre-Departure** | Schedule, carrier, route, weather forecast | Prediction at booking or check-in |
| **B — Post-Departure** | Above + `dep_delay`, taxi times, air time | Real-time gate-arrival estimation |
***
## 4.2.1 Data Preparation
### Loading the Dataset
The cleaned dataset saved during Phase 2 is loaded directly, ensuring consistency with all prior cleaning, feature engineering, and imputation decisions.
## 4.2 Load Cleaned Dataset
The cleaned dataset produced in Phase 2 is loaded. It contains 1,133,043 flight records with missing values confined to the six weather variables.
``` r
df_model <- readRDS("df_clean.rds")
cat("Rows:", nrow(df_model), "\n")
## Rows: 1133043
cat("Columns:", ncol(df_model), "\n")
## Columns: 39
Missing values are confined to the six weather variables at less than 0.1%of observations. Median imputation is applied as it is more robust than mean imputation for skewed distributions.
weather_vars <- c("o_temp", "o_prcp", "o_wspd",
"d_temp", "d_prcp", "d_wspd")
df_model |>
summarise(across(all_of(weather_vars), ~ sum(is.na(.x)))) |>
pivot_longer(everything(),
names_to = "variable",
values_to = "n_missing") |>
mutate(pct_missing = round(100 * n_missing / nrow(df_model), 4)) |>
kable(caption = "Missing values before imputation.") |>
kable_styling(full_width = FALSE,
bootstrap_options = c("striped", "hover"))
| variable | n_missing | pct_missing |
|---|---|---|
| o_temp | 216 | 0.0191 |
| o_prcp | 216 | 0.0191 |
| o_wspd | 216 | 0.0191 |
| d_temp | 771 | 0.0680 |
| d_prcp | 771 | 0.0680 |
| d_wspd | 771 | 0.0680 |
df_model <- df_model |>
mutate(across(all_of(weather_vars),
~ if_else(is.na(.x),
median(.x, na.rm = TRUE),
.x)))
df_model |>
summarise(across(all_of(weather_vars), ~ sum(is.na(.x)))) |>
pivot_longer(everything(),
names_to = "variable",
values_to = "n_missing") |>
kable(caption = "Missing values after imputation (all should be 0).") |>
kable_styling(full_width = FALSE,
bootstrap_options = c("striped", "hover"))
| variable | n_missing |
|---|---|
| o_temp | 0 |
| o_prcp | 0 |
| o_wspd | 0 |
| d_temp | 0 |
| d_prcp | 0 |
| d_wspd | 0 |
cat("Total missing values remaining:", sum(is.na(df_model)), "\n")
## Total missing values remaining: 0
Variables that would cause data leakage or are redundant are removed before modelling. Leakage occurs when information that would not be available at prediction time is included as a predictor, leading to overly optimistic model performance.
The following variables are excluded:
arr_time, actual_elapsed_time,
wheels_off, wheels_on — only known after the
flight landsis_delayed — derived directly from
arr_delay, our target variableroute — too many unique combinations;
origin and dest already capture thisfl_date, crs_dep_time — replaced by
engineered features dep_hour, day_of_month
etcop_carrier_fl_num — individual flight numbers do not
generalisereg_df <- df_model |>
select(
arr_delay, # target
dep_delay, taxi_out, taxi_in, # operational
air_time, crs_elapsed_time, # schedule
dep_hour, time_of_day, # temporal
day_of_month, day_of_week,
weekday, is_weekend, year,
op_carrier, origin, dest, # categorical
o_temp, o_prcp, o_wspd, # origin weather
d_temp, d_prcp, d_wspd, # destination weather
o_latitude, o_longitude, # geospatial
d_latitude, d_longitude
)
cat("Final dataset for modelling:", nrow(reg_df), "rows x",
ncol(reg_df), "columns\n")
## Final dataset for modelling: 1133043 rows x 26 columns
The dataset is split into 80% training and 20% testing using a random sample. A seed is set to ensure reproducibility.
set.seed(42)
train_index <- sample(1:nrow(reg_df), size = 0.8 * nrow(reg_df))
reg_train <- reg_df[train_index, ]
reg_test <- reg_df[-train_index, ]
cat("Training set:", nrow(reg_train), "rows\n")
## Training set: 906434 rows
cat("Test set:", nrow(reg_test), "rows\n")
## Test set: 226609 rows
Categorical variables such as op_carrier,
origin, dest, time_of_day and
weekday are converted to numeric using label encoding.
Boolean and factor variables are also converted to integer for
compatibility with all three models.
# Convert categoricals to numeric for all three models
encode_df <- function(df) {
df |>
mutate(
op_carrier = as.integer(factor(op_carrier)),
origin = as.integer(factor(origin)),
dest = as.integer(factor(dest)),
time_of_day = as.integer(factor(time_of_day)),
weekday = as.integer(factor(weekday)),
year = as.integer(as.character(year)),
is_weekend = as.integer(is_weekend)
)
}
reg_train_enc <- encode_df(reg_train)
reg_test_enc <- encode_df(reg_test)
cat("Any missing values in training set:", sum(is.na(reg_train_enc)), "\n")
## Any missing values in training set: 0
cat("Any missing values in test set:", sum(is.na(reg_test_enc)), "\n")
## Any missing values in test set: 0
dep_delay)In this scenario, only features available before the flight
departs are used. Variables such as dep_delay,
taxi_out, taxi_in, and air_time
are excluded because they are only known after the flight has operated.
This represents a realistic advance-prediction setting.
safe_features <- c(
"arr_delay", # target
"crs_elapsed_time", # scheduled duration
"dep_hour", "day_of_month", # temporal
"day_of_week", "is_weekend", "year",
"op_carrier", "origin", "dest", # categorical
"time_of_day", "weekday",
"o_temp", "o_prcp", "o_wspd", # origin weather
"d_temp", "d_prcp", "d_wspd", # destination weather
"o_latitude", "o_longitude", # geospatial
"d_latitude", "d_longitude"
)
reg_train_safe <- reg_train_enc |> select(all_of(safe_features))
reg_test_safe <- reg_test_enc |> select(all_of(safe_features))
cat("Features used (Scenario A):", ncol(reg_train_safe) - 1, "\n")
## Features used (Scenario A): 21
Linear Regression serves as the interpretable baseline model. It estimates a linear relationship between the available pre-departure features and arrival delay. Given the non-linear delay patterns identified during EDA, Linear Regression is expected to underperform compared to the ensemble approach, but provides a useful performance benchmark.
lm_model_a <- lm(arr_delay ~ ., data = reg_train_safe)
lm_pred_a <- predict(lm_model_a, newdata = reg_test_safe)
lm_rmse_a <- sqrt(mean((reg_test_safe$arr_delay - lm_pred_a)^2))
lm_mae_a <- mean(abs(reg_test_safe$arr_delay - lm_pred_a))
lm_rsq_a <- cor(reg_test_safe$arr_delay, lm_pred_a)^2
cat("Linear Regression (Pre-Departure):\n")
## Linear Regression (Pre-Departure):
cat("RMSE:", round(lm_rmse_a, 3), "\n")
## RMSE: 54.135
cat("MAE: ", round(lm_mae_a, 3), "\n")
## MAE: 24.06
cat("R²: ", round(lm_rsq_a, 3), "\n")
## R²: 0.018
Random Forest captures non-linear relationships and feature interactions that Linear Regression cannot model. It constructs an ensemble of decision trees, each trained on a random subset of data and features, with the final prediction averaged across all trees. To manage computational resources, the model is trained on a 20% sample of the training data with 100 trees.
library(ranger)
set.seed(42)
train_sample_a <- reg_train_safe |> slice_sample(prop = 0.2)
rf_model_a <- ranger(
arr_delay ~ .,
data = train_sample_a,
num.trees = 100,
mtry = 5,
min.node.size = 20,
importance = "impurity",
seed = 42
)
rf_pred_a <- predict(rf_model_a, data = reg_test_safe)$predictions
rf_rmse_a <- sqrt(mean((reg_test_safe$arr_delay - rf_pred_a)^2))
rf_mae_a <- mean(abs(reg_test_safe$arr_delay - rf_pred_a))
rf_rsq_a <- cor(reg_test_safe$arr_delay, rf_pred_a)^2
cat("Random Forest (Pre-Departure):\n")
## Random Forest (Pre-Departure):
cat("RMSE:", round(rf_rmse_a, 3), "\n")
## RMSE: 52.274
cat("MAE: ", round(rf_mae_a, 3), "\n")
## MAE: 22.725
cat("R²: ", round(rf_rsq_a, 3), "\n")
## R²: 0.085
In Scenario B, features that become available after the
flight has departed are added — specifically
dep_delay, taxi_out, taxi_in, and
air_time. This represents a real-time gate-arrival
estimation setting. As identified in the EDA phase, departure delay
alone exhibits a correlation of r = 0.97 with arrival delay, and its
inclusion is expected to substantially improve model performance.
post_features <- c(
"arr_delay", # target
"dep_delay", # key post-departure signal
"taxi_out", "taxi_in", # operational (post-departure)
"air_time", # in-flight duration
"crs_elapsed_time", # scheduled duration
"dep_hour", "day_of_month", # temporal
"day_of_week", "is_weekend", "year",
"op_carrier", "origin", "dest", # categorical
"time_of_day", "weekday",
"o_temp", "o_prcp", "o_wspd", # origin weather
"d_temp", "d_prcp", "d_wspd", # destination weather
"o_latitude", "o_longitude", # geospatial
"d_latitude", "d_longitude"
)
reg_train_post <- reg_train_enc |> select(all_of(post_features))
reg_test_post <- reg_test_enc |> select(all_of(post_features))
cat("Features used (Scenario B):", ncol(reg_train_post) - 1, "\n")
## Features used (Scenario B): 25
lm_model_b <- lm(arr_delay ~ ., data = reg_train_post)
lm_pred_b <- predict(lm_model_b, newdata = reg_test_post)
lm_rmse_b <- sqrt(mean((reg_test_post$arr_delay - lm_pred_b)^2))
lm_mae_b <- mean(abs(reg_test_post$arr_delay - lm_pred_b))
lm_rsq_b <- cor(reg_test_post$arr_delay, lm_pred_b)^2
cat("Linear Regression (Post-Departure):\n")
## Linear Regression (Post-Departure):
cat("RMSE:", round(lm_rmse_b, 3), "\n")
## RMSE: 0.173
cat("MAE: ", round(lm_mae_b, 3), "\n")
## MAE: 0.074
cat("R²: ", round(lm_rsq_b, 3), "\n")
## R²: 1
set.seed(42)
train_sample_b <- reg_train_post |> slice_sample(prop = 0.2)
rf_model_b <- ranger(
arr_delay ~ .,
data = train_sample_b,
num.trees = 100,
mtry = 5,
min.node.size = 20,
importance = "impurity",
seed = 42
)
rf_pred_b <- predict(rf_model_b, data = reg_test_post)$predictions
rf_rmse_b <- sqrt(mean((reg_test_post$arr_delay - rf_pred_b)^2))
rf_mae_b <- mean(abs(reg_test_post$arr_delay - rf_pred_b))
rf_rsq_b <- cor(reg_test_post$arr_delay, rf_pred_b)^2
cat("Random Forest (Post-Departure):\n")
## Random Forest (Post-Departure):
cat("RMSE:", round(rf_rmse_b, 3), "\n")
## RMSE: 16.174
cat("MAE: ", round(rf_mae_b, 3), "\n")
## MAE: 7.031
cat("R²: ", round(rf_rsq_b, 3), "\n")
## R²: 0.931
reg_results <- tibble(
Scenario = c(rep("A: Pre-Departure", 2),
rep("B: Post-Departure", 2)),
Model = rep(c("Linear Regression", "Random Forest"), 2),
RMSE = c(lm_rmse_a, rf_rmse_a, lm_rmse_b, rf_rmse_b),
MAE = c(lm_mae_a, rf_mae_a, lm_mae_b, rf_mae_b),
R2 = c(lm_rsq_a, rf_rsq_a, lm_rsq_b, rf_rsq_b)
) |>
mutate(across(c(RMSE, MAE, R2), ~ round(.x, 3)))
reg_results |>
kable(caption = "Regression model performance across both scenarios.",
col.names = c("Scenario", "Model", "RMSE", "MAE", "R²")) |>
kable_styling(full_width = FALSE,
bootstrap_options = c("striped", "hover")) |>
collapse_rows(columns = 1, valign = "middle")
| Scenario | Model | RMSE | MAE | R² |
|---|---|---|---|---|
| A: Pre-Departure | Linear Regression | 54.135 | 24.060 | 0.018 |
| Random Forest | 52.274 | 22.725 | 0.085 | |
| B: Post-Departure | Linear Regression | 0.173 | 0.074 | 1.000 |
| Random Forest | 16.174 | 7.031 | 0.931 |
# Scenario A
saveRDS(lm_model_a, "lm_model_pre.rds")
saveRDS(rf_model_a, "rf_model_pre.rds")
# Scenario B
saveRDS(lm_model_b, "lm_model_post.rds")
saveRDS(rf_model_b, "rf_model_post.rds")
cat("All models saved successfully!\n")
## All models saved successfully!
The low R² values across both pre-departure models reflect the inherent difficulty of predicting flight delays using only information available before departure. Real-time operational disruptions — such as late incoming aircraft and air traffic control intervention — cannot be captured by weather and schedule features alone. Weather variables showed only weak linear correlations with arrival delay (r ≈ 0.03–0.04) during EDA, and features such as carrier, route, and time of day provide only a general population-level signal rather than flight-specific predictions. Despite modest absolute performance, Random Forest is expected to outperform Linear Regression, confirming that delay relationships are non-linear in nature — consistent with the non-linear weather effects and time-of-day patterns identified during exploratory analysis.
The inclusion of dep_delay, taxi_out,
taxi_in, and air_time is expected to
substantially improve both RMSE and R² across both models. Departure
delay in particular acts as a near-direct signal of arrival delay (r =
0.97, EDA Phase 3), and its inclusion transforms the problem from a weak
signal estimation task to a high-confidence operational calculation.
This scenario is appropriate for real-time airline systems estimating
gate-arrival times once a flight is airborne. The performance gap
between Scenario A and Scenario B directly quantifies the informational
value of post-departure operational data.
This section evaluates the modelling results across both prediction scenarios. Instead of reviewing each model separately, the focus is on comparing how model performance changes when moving from Scenario A, which uses only pre-departure information, to Scenario B, which includes post-departure operational information.
The classification task compared Logistic Regression and Random Forest models under two scenarios:
dep_delaydep_delayclassification_eval <- tibble(
Model = c("Logistic Regression", "Random Forest",
"Logistic Regression", "Random Forest"),
Scenario = c("A: Pre-Departure", "A: Pre-Departure",
"B: Post-Departure", "B: Post-Departure"),
Accuracy = c(82.27, 84.17, 93.39, 93.97),
Precision = c(NA, 68.23, 89.37, 91.24),
Recall = c(0.68, 20.25, 71.23, 73.03),
Specificity = c(99.89, 97.96, 98.17, 98.49),
Balanced_Accuracy = c(50.28, 59.11, 84.70, 85.76),
Kappa = c(0.0093, NA, 0.7540, 0.7758)
)
classification_eval |>
kable(caption = "Classification model evaluation across both scenarios.") |>
kable_styling(full_width = FALSE,
bootstrap_options = c("striped", "hover"))
| Model | Scenario | Accuracy | Precision | Recall | Specificity | Balanced_Accuracy | Kappa |
|---|---|---|---|---|---|---|---|
| Logistic Regression | A: Pre-Departure | 82.27 | NA | 0.68 | 99.89 | 50.28 | 0.0093 |
| Random Forest | A: Pre-Departure | 84.17 | 68.23 | 20.25 | 97.96 | 59.11 | NA |
| Logistic Regression | B: Post-Departure | 93.39 | 89.37 | 71.23 | 98.17 | 84.70 | 0.7540 |
| Random Forest | B: Post-Departure | 93.97 | 91.24 | 73.03 | 98.49 | 85.76 | 0.7758 |
classification_eval |>
select(Model, Scenario, Accuracy, Recall, Specificity, Balanced_Accuracy) |>
pivot_longer(cols = c(Accuracy, Recall, Specificity, Balanced_Accuracy),
names_to = "Metric",
values_to = "Value") |>
ggplot(aes(x = Scenario, y = Value, fill = Model)) +
geom_col(position = "dodge") +
facet_wrap(~ Metric, scales = "free_y") +
labs(title = "Classification performance improves strongly in Scenario B",
x = NULL,
y = "Metric value (%)",
fill = "Model") +
theme(axis.text.x = element_text(angle = 20, hjust = 1))
In Scenario A, both classification models achieved relatively high accuracy because most flights were on time. However, recall was weak, especially for Logistic Regression, meaning that the model failed to identify most delayed flights. Random Forest performed better than Logistic Regression in Scenario A because it was able to capture non-linear patterns and interactions among weather, carrier, temporal, and airport-related variables.
In Scenario B, performance improved substantially for both models
after adding dep_delay. This confirms that departure delay
is the strongest predictor of arrival delay. Random Forest Model B
achieved the best overall classification performance, with the highest
accuracy, recall, specificity, balanced accuracy, and Kappa. Therefore,
Random Forest Model B was selected as the best classification model.
The regression task compared Linear Regression and Random Forest models under the same two scenarios:
reg_results |>
kable(caption = "Regression model evaluation across both scenarios.",
col.names = c("Scenario", "Model", "RMSE", "MAE", "R²")) |>
kable_styling(full_width = FALSE,
bootstrap_options = c("striped", "hover"))
| Scenario | Model | RMSE | MAE | R² |
|---|---|---|---|---|
| A: Pre-Departure | Linear Regression | 54.135 | 24.060 | 0.018 |
| A: Pre-Departure | Random Forest | 52.274 | 22.725 | 0.085 |
| B: Post-Departure | Linear Regression | 0.173 | 0.074 | 1.000 |
| B: Post-Departure | Random Forest | 16.174 | 7.031 | 0.931 |
reg_results |>
ggplot(aes(x = Scenario, y = RMSE, fill = Model)) +
geom_col(position = "dodge") +
labs(title = "Regression RMSE comparison across scenarios",
x = NULL,
y = "RMSE",
fill = "Model") +
theme(axis.text.x = element_text(angle = 20, hjust = 1))
reg_results |>
ggplot(aes(x = Scenario, y = R2, fill = Model)) +
geom_col(position = "dodge") +
labs(title = "Regression R² comparison across scenarios",
x = NULL,
y = "R²",
fill = "Model") +
theme(axis.text.x = element_text(angle = 20, hjust = 1))
Scenario A produced weaker regression performance because only pre-departure information was available. Features such as weather, carrier, airport, date, and scheduled time provide useful general patterns, but they cannot fully capture real-time operational disruptions. This explains the lower R² values and higher prediction errors in the pre-departure setting.
Scenario B produced much stronger regression results because it
included operational information such as dep_delay,
taxi_out, taxi_in, and air_time.
These variables are much closer to the final arrival outcome and
therefore reduce prediction error. This confirms that post-departure
information greatly improves the ability to estimate exact arrival delay
duration.
Feature importance was examined for the Random Forest models because they provide an interpretable ranking of influential predictors. This helps identify which variables contributed most strongly to delay prediction.
rf_imp_class <- tibble(
Feature = names(rf_model_B$variable.importance),
Importance = rf_model_B$variable.importance
) |>
arrange(desc(Importance)) |>
slice_head(n = 15)
rf_imp_class |>
ggplot(aes(x = reorder(Feature, Importance), y = Importance)) +
geom_col(fill = report_fill) +
coord_flip() +
labs(title = "Top 15 feature importance values for Random Forest Classification Model B",
x = NULL,
y = "Importance")
The feature importance plot shows which predictors were most influential in classifying flights as delayed or on time. Departure delay is expected to dominate the classification model because it directly reflects whether the flight has already experienced an operational delay before arrival. Other important variables may include carrier, departure hour, airport location, weather conditions, and day-of-month effects.
rf_imp_reg <- tibble(
Feature = names(rf_model_b$variable.importance),
Importance = rf_model_b$variable.importance
) |>
arrange(desc(Importance)) |>
slice_head(n = 15)
rf_imp_reg |>
ggplot(aes(x = reorder(Feature, Importance), y = Importance)) +
geom_col(fill = report_fill) +
coord_flip() +
labs(title = "Top 15 feature importance values for Random Forest Regression Model B",
x = NULL,
y = "Importance")
For the regression task, feature importance helps explain which
variables were most useful in predicting the exact number of arrival
delay minutes. Similar to the classification task, post-departure
variables such as dep_delay, taxi_out,
taxi_in, and air_time are expected to rank
highly because they provide direct operational information about the
flight’s progress.
Across both classification and regression tasks, Scenario B consistently outperformed Scenario A. This demonstrates that real-time operational variables provide substantially stronger predictive information than schedule, weather, carrier, and airport variables alone.
For classification, Random Forest Model B was the strongest model because it achieved the best balance between accuracy, recall, specificity, and balanced accuracy. For regression, the best-performing model should be selected based on the lowest RMSE and MAE, together with the highest R². Overall, the evaluation confirms that Random Forest models are more effective than simpler linear models because they can capture non-linear relationships and interactions among flight delay predictors.
The comparison also highlights an important practical distinction. Scenario A is more useful for early prediction before departure, but its accuracy is limited. Scenario B is much more accurate, but it can only be used after the flight has already departed. Therefore, the preferred model depends on the intended use case: advance planning or real-time operational prediction.
This project investigated U.S. domestic flight delays during the December holiday periods of 2023 and 2024 using the Aeolus dataset, which contains over 1.13 million flight records enriched with operational, temporal, weather, and airport-related information. Both classification and regression approaches were developed and evaluated to better understand the factors associated with flight delays and to assess the feasibility of predicting them.
The results consistently showed that operational information is the strongest driver of arrival delay, with departure delay emerging as the single most influential predictor. For the classification task, Random Forest achieved the best overall performance, reaching 94.0% accuracy, 73.0% recall, and 85.8% balanced accuracy. For the regression task, the best-performing Random Forest model achieved an RMSE of approximately 16 minutes and an R² of 0.93, demonstrating strong predictive capability once post-departure information becomes available.
A key finding of this study is the clear trade-off between prediction timing and predictive accuracy. While pre-departure models provide useful insights for advance planning and resource allocation, their predictive power remains limited. In contrast, post-departure models achieve substantially higher accuracy by leveraging real-time operational information.
Overall, the findings demonstrate the value of machine learning for flight-delay prediction and provide practical insights that can support both advance planning and real-time operational decision-making within the aviation industry.
sessionInfo()
## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.7.4
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Asia/Kuala_Lumpur
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ranger_0.18.0 caret_7.0-1 lattice_0.22-9
## [4] rsample_1.3.2 kableExtra_1.4.0 knitr_1.51
## [7] patchwork_1.3.2 corrplot_0.95 scales_1.4.0
## [10] skimr_2.2.2 janitor_2.2.1 data.table_1.18.2.1
## [13] lubridate_1.9.5 forcats_1.0.1 stringr_1.6.0
## [16] dplyr_1.2.1 purrr_1.2.2 readr_2.2.0
## [19] tidyr_1.3.2 tibble_3.3.1 ggplot2_4.0.3
## [22] tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 viridisLite_0.4.3 timeDate_4052.112
## [4] farver_2.1.2 S7_0.2.1 fastmap_1.2.0
## [7] pROC_1.19.0.1 digest_0.6.39 rpart_4.1.24
## [10] timechange_0.4.0 lifecycle_1.0.5 survival_3.8-6
## [13] magrittr_2.0.4 compiler_4.5.2 rlang_1.2.0
## [16] sass_0.4.10 tools_4.5.2 yaml_2.3.12
## [19] plyr_1.8.9 xml2_1.5.2 repr_1.1.7
## [22] RColorBrewer_1.1-3 withr_3.0.2 stats4_4.5.2
## [25] nnet_7.3-20 grid_4.5.2 future_1.70.0
## [28] globals_0.19.1 iterators_1.0.14 MASS_7.3-65
## [31] cli_3.6.6 rmarkdown_2.30 generics_0.1.4
## [34] future.apply_1.20.2 rstudioapi_0.18.0 reshape2_1.4.5
## [37] tzdb_0.5.0 cachem_1.1.0 splines_4.5.2
## [40] parallel_4.5.2 base64enc_0.1-6 vctrs_0.7.1
## [43] hardhat_1.4.3 Matrix_1.7-4 jsonlite_2.0.0
## [46] hms_1.1.4 listenv_0.10.1 systemfonts_1.3.2
## [49] foreach_1.5.2 gower_1.0.2 jquerylib_0.1.4
## [52] recipes_1.3.2 glue_1.8.0 parallelly_1.47.0
## [55] codetools_0.2-20 stringi_1.8.7 gtable_0.3.6
## [58] furrr_0.4.0 pillar_1.11.1 htmltools_0.5.9
## [61] ipred_0.9-15 lava_1.9.0 R6_2.6.1
## [64] textshaping_1.0.5 evaluate_1.0.5 snakecase_0.11.1
## [67] bslib_0.10.0 class_7.3-23 Rcpp_1.1.1-1.1
## [70] svglite_2.2.2 nlme_3.1-168 prodlim_2026.03.11
## [73] xfun_0.56 ModelMetrics_1.2.2.2 pkgconfig_2.0.3