# Load only the packages actually used in this report.
library(tidyverse)   # data manipulation (dplyr, tidyr) and visualisation (ggplot2)
library(data.table)  # fast reading of the very large (1.8 GB) source CSV files
library(janitor)     # clean_names() for consistent snake_case column names
library(skimr)       # rich one-line statistical summaries
library(lubridate)   # date/time handling
library(scales)      # axis formatting (commas, percentages)
library(corrplot)    # correlation heatmaps
library(patchwork)   # composing multiple ggplots
library(knitr)       # kable() tables
library(kableExtra)  # styled HTML tables


theme_set(theme_minimal(base_size = 12) +
            theme(panel.grid = element_blank()))
report_fill <- "#2c7fb8"   # primary colour used throughout

Phase 1: Introduction

Flight delays remain a persistent challenge in the U.S. aviation industry, causing significant economic losses and passenger dissatisfaction. This project analyzes flight delay patterns using the Aeolus dataset, a comprehensive multi-modal dataset sourced from the U.S. Department of Transportation’s Bureau of Transportation Statistics. The analysis focuses on domestic flights during December 2023 and December 2024, encompassing over 1.13 million flight records with rich operational, meteorological, and airport-level attributes.

The project has two main analytical components. First, exploratory data analysis is conducted to understand delay patterns, weather impacts, and year-over-year trends. Second, predictive models are developed for two complementary tasks: a regression task to predict exact arrival delay in minutes, and a classification task to predict whether a flight will experience a significant delay exceeding 15 minutes. All analyses are implemented using the tidymodels framework in R.

1.1 Background

Flight delays increase substantially during the December holiday season due to higher passenger demand, tighter flight schedules, and more frequent winter weather disruptions. These delays create operational challenges for airlines and airports, resulting in increased costs, resource constraints, and reduced service reliability. Understanding the factors that contribute to December flight delays, and how these factors vary across holiday seasons, can support more effective planning and delay mitigation strategies.

The Aeolus dataset provides a comprehensive foundation for this analysis by combining flight operations, weather conditions, and airport information within a single dataset. This integration enables the joint examination of operational and environmental factors, allowing a more complete assessment of the drivers of flight delays.

1.2 Problem Statement

Flight delays cost the U.S. economy billions of dollars annually, yet accurately predicting delays remains difficult due to the complex interaction of weather conditions, operational factors, airport congestion, and delay propagation across flight networks. During the December holiday travel season, the problem intensifies as flight volumes surge and weather disruptions become more frequent.

Existing research has rarely compared regression and classification approaches on a unified dataset with integrated weather features. Furthermore, delay pattern changes between consecutive holiday seasons remain underexplored. This project addresses these gaps by leveraging the Aeolus dataset to simultaneously investigate both predictive paradigms while focusing specifically on December 2023 and 2024.

1.3 Objectives

The objectives of this project are as follows:

To perform data cleaning and preprocessing on the Aeolus dataset for both regression and classification tasks.
To conduct exploratory data analysis characterizing flight delay patterns in December 2023 and December 2024.
To develop a regression model predicting exact arrival delay (minutes) using operational, temporal, meteorological, and geospatial features.
To develop a classification model predicting whether a flight will be delayed more than 15 minutes.
To identify the most influential predictors of flight delays through feature importance analysis.
To compare the performance of regression versus classification approaches for delay prediction.

1.4 Research Questions

To what extent can weather conditions (temperature, precipitation, wind speed) predict flight arrival delays during December?
How do flight delay patterns differ between December 2023 and December 2024?
Which features (operational, temporal, meteorological, or geospatial) are the strongest predictors of arrival delay?
How does a regression model compare to a classification model in predicting flight delays?
Can a binary classification model accurately identify flights delayed more than 15 minutes?

Phase 2: Data Preparation

2.1 Data Source

The dataset used in this project is titled Aeolus - A Multi-Modal Flight Delay Dataset. It originates from the U.S. Department of Transportation’s Bureau of Transportation Statistics, which collects and publishes comprehensive data on domestic flight operations. The dataset covers domestic flights within the United States during the month of December for two consecutive years, specifically December 1–31, 2023 and December 1–31, 2024. This time period was selected to focus on the peak holiday travel season, allowing for comparative analysis of delay patterns between two holiday periods.

2.2 Dataset Dimension

# December flight counts per year, with a bold total row.
tibble(
  Year                = c("December 2023", "December 2024", "Total"),
  `Number of Flights` = c(558609, 574434, 1133043)
) |>
  kable(caption = "December flight records by year.",
        format.args = list(big.mark = ","),
        align = c("l", "r")) |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover")) |>
  row_spec(3, bold = TRUE)

December flight records by year.
Year	Number of Flights
December 2023	558,609
December 2024	574,434
Total	1,133,043

After filtering to December, the combined dataset comprises 1,133,043 rows and 35 columns (34 original variables plus an engineered year column).

2.3 Variable Structure

The 35 variables fall into six conceptual categories:

tibble(
  Category = c("Flight Identification", "Temporal", "Delay / Operational",
               "Weather (Origin)", "Weather (Destination)", "Geospatial"),
  Variables = c(
    "op_carrier, op_carrier_fl_num, origin, dest, origin_index, dest_index",
    "fl_date, crs_dep_time, dep_time, crs_arr_time, arr_time, month, day_of_month, day_of_week, year",
    "dep_delay, arr_delay, taxi_out, taxi_in, air_time, crs_elapsed_time, actual_elapsed_time, wheels_off, wheels_on, flights",
    "o_temp, o_prcp, o_wspd",
    "d_temp, d_prcp, d_wspd",
    "o_latitude, o_longitude, d_latitude, d_longitude"
  )
) |>
  kable(caption = "The 35 variables grouped into six conceptual categories.",
        align = c("l", "l")) |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover")) |>
  column_spec(1, bold = TRUE)

The 35 variables grouped into six conceptual categories.
Category	Variables
Flight Identification	op_carrier, op_carrier_fl_num, origin, dest, origin_index, dest_index
Temporal	fl_date, crs_dep_time, dep_time, crs_arr_time, arr_time, month, day_of_month, day_of_week, year
Delay / Operational	dep_delay, arr_delay, taxi_out, taxi_in, air_time, crs_elapsed_time, actual_elapsed_time, wheels_off, wheels_on, flights
Weather (Origin)	o_temp, o_prcp, o_wspd
Weather (Destination)	d_temp, d_prcp, d_wspd
Geospatial	o_latitude, o_longitude, d_latitude, d_longitude

The target variables for modelling stage are arr_delay for regression and is_delayed (arr_delay > 15) minutes for classification.

2.4 Data Loading and Filtering

  # load dataset file paths
path_2023 <- 'flight_with_weather_2023.csv'
path_2024 <- 'flight_with_weather_2024.csv'

# read the data
flight_2023 <- read_csv(path_2023)
flight_2024 <- read_csv(path_2024)

File paths for the 2023 and 2024 datasets are first assigned to variables path_2023 and path_2024. The read_csv() function then reads each CSV file into R as a data frame.

# inspect column names 2023
names(flight_2023)

##  [1] "FL_DATE"             "OP_CARRIER"          "OP_CARRIER_FL_NUM"  
##  [4] "ORIGIN"              "DEST"                "CRS_DEP_TIME"       
##  [7] "DEP_TIME"            "DEP_DELAY"           "TAXI_OUT"           
## [10] "WHEELS_OFF"          "WHEELS_ON"           "TAXI_IN"            
## [13] "CRS_ARR_TIME"        "ARR_TIME"            "ARR_DELAY"          
## [16] "CRS_ELAPSED_TIME"    "ACTUAL_ELAPSED_TIME" "AIR_TIME"           
## [19] "FLIGHTS"             "MONTH"               "DAY_OF_MONTH"       
## [22] "DAY_OF_WEEK"         "ORIGIN_INDEX"        "DEST_INDEX"         
## [25] "O_TEMP"              "O_PRCP"              "O_WSPD"             
## [28] "D_TEMP"              "D_PRCP"              "D_WSPD"             
## [31] "O_LATITUDE"          "O_LONGITUDE"         "D_LATITUDE"         
## [34] "D_LONGITUDE"

# inspect column names 2024
names(flight_2024)

##  [1] "FL_DATE"             "OP_CARRIER"          "OP_CARRIER_FL_NUM"  
##  [4] "ORIGIN"              "DEST"                "CRS_DEP_TIME"       
##  [7] "DEP_TIME"            "DEP_DELAY"           "TAXI_OUT"           
## [10] "WHEELS_OFF"          "WHEELS_ON"           "TAXI_IN"            
## [13] "CRS_ARR_TIME"        "ARR_TIME"            "ARR_DELAY"          
## [16] "CRS_ELAPSED_TIME"    "ACTUAL_ELAPSED_TIME" "AIR_TIME"           
## [19] "FLIGHTS"             "MONTH"               "DAY_OF_MONTH"       
## [22] "DAY_OF_WEEK"         "ORIGIN_INDEX"        "DEST_INDEX"         
## [25] "O_TEMP"              "O_PRCP"              "O_WSPD"             
## [28] "D_TEMP"              "D_PRCP"              "D_WSPD"             
## [31] "O_LATITUDE"          "O_LONGITUDE"         "D_LATITUDE"         
## [34] "D_LONGITUDE"

# verify both datasets have identical column names and order
identical(names(flight_2023), names(flight_2024))

## [1] TRUE

Column name inspection reveals 34 variables organized into flight identification, temporal data, delay metrics, weather conditions (origin/destination), and geospatial coordinates. The identical() function confirms both datasets have identical column names and ordering, validating that row-binding can proceed without alignment issues.

# check unique values for MONTH column 2023 and 2024
unique(flight_2023$MONTH)

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12

unique(flight_2024$MONTH)

##  [1]  1  2  4  5  6  7  8  9 10 11 12

The unique() function was applied to the MONTH column of the 2023 dataset to identify all distinct month values present. The output shows months 1 through 12, indicating that the 2023 dataset contains flight records from all twelve months of the year (January to December).

However, for year 2024, the output shows months 1, 2, and 4 through 12, but notably month 3 (March) is missing. This indicates that the 2024 dataset does not contain any flight records for March. The reason for this missing month could be due to data unavailability at the time of download or an intentional subset provided by the data source.

# filter to only keep Dec rows 
dec_flight_2023 <- flight_2023 %>%
  filter(MONTH == 12) %>%
  mutate(YEAR = 2023L)

dec_flight_2024 <- flight_2024 %>%
  filter(MONTH == 12) %>%
  mutate(YEAR = 2024L)

unique(dec_flight_2023$MONTH)

## [1] 12

unique(dec_flight_2024$MONTH)

## [1] 12

cat("December 2023:", nrow(dec_flight_2023), "flights\n")

## December 2023: 558609 flights

cat("December 2024:", nrow(dec_flight_2024), "flights\n")

## December 2024: 574434 flights

The analysis is restricted to December flights to focus on the peak holiday travel period while reducing the dataset to a more manageable size. A YEAR variable is retained to enable direct comparisons between the 2023 and 2024 holiday seasons and to support subsequent temporal analyses.

df <- bind_rows(dec_flight_2023, dec_flight_2024) %>%
  janitor::clean_names()

The December 2023 and 2024 datasets were combined into a single dataset, and all variable names were standardized to a consistent naming convention. The resulting dataset contains 1,133,043 flight records and 35 variables, providing a unified foundation for subsequent cleaning, exploration, and modelling activities.

2.5 Data Inspection

2.5.1 Structure and Data Types

glimpse(df)

## Rows: 1,133,043
## Columns: 35
## $ fl_date             <dttm> 2023-12-01, 2023-12-01, 2023-12-01, 2023-12-02, 2…
## $ op_carrier          <chr> "DL", "OO", "OO", "DL", "OO", "OO", "DL", "OO", "O…
## $ op_carrier_fl_num   <dbl> 1189, 3888, 3920, 1189, 3888, 3920, 1189, 3888, 39…
## $ origin              <chr> "DLH", "DLH", "DLH", "DLH", "DLH", "DLH", "DLH", "…
## $ dest                <chr> "MSP", "MSP", "MSP", "MSP", "MSP", "MSP", "MSP", "…
## $ crs_dep_time        <dttm> 2023-12-01 05:05:00, 2023-12-01 10:44:00, 2023-12…
## $ dep_time            <dttm> 2023-12-01 05:23:00, 2023-12-01 10:36:00, 2023-12…
## $ dep_delay           <dbl> 18, -8, -5, -7, 0, -5, 1, -6, -8, -8, -7, -5, 0, -…
## $ taxi_out            <dbl> 20, 12, 21, 23, 14, 25, 16, 29, 21, 11, 10, 35, 26…
## $ wheels_off          <dttm> 2023-12-01 05:43:00, 2023-12-01 10:48:00, 2023-12…
## $ wheels_on           <dttm> 2023-12-01 06:14:00, 2023-12-01 11:20:00, 2023-12…
## $ taxi_in             <dbl> 3, 4, 5, 6, 7, 4, 4, 5, 5, 3, 3, 3, 5, 5, 8, 3, 4,…
## $ crs_arr_time        <dttm> 2023-12-01 06:19:00, 2023-12-01 11:53:00, 2023-12…
## $ arr_time            <dttm> 2023-12-01 06:17:00, 2023-12-01 11:24:00, 2023-12…
## $ arr_delay           <dbl> -2, -29, -19, -20, -15, -18, -24, -6, -27, -39, -3…
## $ crs_elapsed_time    <dbl> 74, 69, 75, 74, 69, 75, 74, 69, 75, 74, 69, 75, 74…
## $ actual_elapsed_time <dbl> 54, 48, 61, 61, 54, 62, 49, 69, 56, 43, 46, 71, 60…
## $ air_time            <dbl> 31, 32, 35, 32, 33, 33, 29, 35, 30, 29, 33, 33, 29…
## $ flights             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ month               <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12…
## $ day_of_month        <dbl> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6,…
## $ day_of_week         <dbl> 5, 5, 5, 6, 6, 6, 7, 7, 7, 1, 1, 1, 2, 2, 2, 3, 3,…
## $ origin_index        <dbl> 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90…
## $ dest_index          <dbl> 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, …
## $ o_temp              <dbl> -7.2, -6.7, 0.6, -8.3, -9.4, 0.6, -1.1, -0.6, 0.6,…
## $ o_prcp              <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
## $ o_wspd              <dbl> 7.6, 5.4, 16.6, 0.0, 0.0, 14.8, 14.8, 9.4, 9.4, 5.…
## $ d_temp              <dbl> -2.8, -5.6, 2.2, -3.9, -5.6, 3.3, 0.6, 0.6, 2.2, 0…
## $ d_prcp              <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
## $ d_wspd              <dbl> 5.4, 7.6, 7.6, 0.0, 11.2, 16.6, 5.4, 0.0, 5.4, 0.0…
## $ o_latitude          <dbl> 46.84209, 46.84209, 46.84209, 46.84209, 46.84209, …
## $ o_longitude         <dbl> -92.19365, -92.19365, -92.19365, -92.19365, -92.19…
## $ d_latitude          <dbl> 44.88055, 44.88055, 44.88055, 44.88055, 44.88055, …
## $ d_longitude         <dbl> -93.21692, -93.21692, -93.21692, -93.21692, -93.21…
## $ year                <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 20…

# Tabulate how many columns fall into each broad data type.
tibble(variable = names(df),
       type     = map_chr(df, ~ class(.x)[1])) |>
  mutate(type_group = case_when(
    type %in% c("POSIXct", "Date", "IDate", "ITime") ~ "Datetime",
    type %in% c("numeric", "integer", "double")      ~ "Numeric",
    TRUE                                              ~ "Character")) |>
  count(type_group, name = "n_columns") |>
  kable(caption = "Number of columns by broad data type.") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Number of columns by broad data type.
type_group	n_columns
Character	3
Datetime	7
Numeric	25

The dataset is dominated by numerical flight and weather variables, with seven datetime fields and a small number of categorical identifiers. The datetime fields were successfully parsed, enabling temporal analysis and feature engineering. Preliminary inspection suggests that flights and month may be constant variables and should be examined further.

2.5.2 Constant and Redundant Columns

# Identify columns that carry no information
df |>
  summarise(across(everything(), n_distinct)) |>
  pivot_longer(everything(), names_to = "variable", values_to = "n_unique") |>
  filter(n_unique <= 1) |>
  kable(caption = "Constant columns (single unique value = zero predictive value).") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Constant columns (single unique value = zero predictive value).
variable	n_unique
flights	1
month	1

The variables month and flights contain only a single value across all observations and therefore provide no predictive information. These columns are retained during initial exploration for transparency but are removed during data cleaning to reduce unnecessary model complexity and ensure a more efficient modelling process.

2.5.3 Numeric Summary

# summary of df
skim(df) |> yank("numeric")

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
op_carrier_fl_num	0	1	2418.27	1622.07	1.00	1103.00	2142.00	3562.00	8819.00	▇▆▃▂▁
dep_delay	0	1	10.79	51.96	-99.00	-6.00	-2.00	8.00	3786.00	▇▁▁▁▁
taxi_out	0	1	18.00	9.42	1.00	12.00	16.00	21.00	184.00	▇▁▁▁▁
taxi_in	0	1	8.23	6.60	1.00	4.00	6.00	10.00	244.00	▇▁▁▁▁
arr_delay	0	1	3.74	54.02	-126.00	-17.00	-7.00	7.00	3795.00	▇▁▁▁▁
crs_elapsed_time	0	1	150.32	73.98	20.00	95.00	134.00	182.00	1003.00	▇▂▁▁▁
actual_elapsed_time	0	1	143.27	73.22	16.00	90.00	127.00	175.00	743.00	▇▃▁▁▁
air_time	0	1	117.03	71.35	6.00	65.00	100.00	147.00	673.00	▇▂▁▁▁
flights	0	1	1.00	0.00	1.00	1.00	1.00	1.00	1.00	▁▁▇▁▁
month	0	1	12.00	0.00	12.00	12.00	12.00	12.00	12.00	▁▁▇▁▁
day_of_month	0	1	15.94	8.95	1.00	8.00	16.00	24.00	31.00	▇▇▇▇▇
day_of_week	0	1	4.07	2.05	1.00	2.00	4.00	6.00	7.00	▇▃▃▅▇
origin_index	0	1	159.54	87.61	0.00	85.00	174.00	235.00	321.00	▆▇▆▇▆
dest_index	0	1	159.55	87.60	0.00	85.00	174.00	235.00	321.00	▆▇▆▇▆
o_temp	216	1	8.71	8.12	-36.00	2.80	8.30	13.90	33.30	▁▁▅▇▂
o_prcp	216	1	0.12	0.68	0.00	0.00	0.00	0.00	49.50	▇▁▁▁▁
o_wspd	216	1	11.68	8.23	0.00	5.40	11.20	16.60	79.60	▇▃▁▁▁
d_temp	771	1	9.20	8.12	-36.00	3.30	8.90	15.00	33.90	▁▁▅▇▂
d_prcp	771	1	0.12	0.69	0.00	0.00	0.00	0.00	49.50	▇▁▁▁▁
d_wspd	771	1	11.94	8.21	0.00	7.60	11.20	16.60	79.60	▇▃▁▁▁
o_latitude	0	1	36.40	6.00	13.48	32.90	36.12	40.69	71.29	▁▇▇▁▁
o_longitude	0	1	-95.00	18.39	-176.65	-111.98	-87.90	-80.94	-64.80	▁▁▅▅▇
d_latitude	0	1	36.40	6.01	13.48	32.90	36.12	40.69	71.29	▁▇▇▁▁
d_longitude	0	1	-94.99	18.39	-176.65	-111.98	-87.90	-80.94	-64.80	▁▁▅▅▇
year	0	1	2023.51	0.50	2023.00	2023.00	2024.00	2024.00	2024.00	▇▁▁▁▇

The dataset is highly complete, with all core flight and delay variables fully populated. Missing values are limited to a small number of weather-related variables and are examined separately. Preliminary statistics also indicate that departure and arrival delays are strongly right-skewed, with most flights departing and arriving on time or early, but a small number experiencing extremely large delays. This distributional pattern has important implications for subsequent modelling and evaluation.

2.6 Data Cleaning

2.6.1 Missing Values

# check missing values
missing_summary <- df |>
  summarise(across(everything(), ~ sum(is.na(.x)))) |>
  pivot_longer(everything(), names_to = "variable", values_to = "n_missing") |>
  mutate(pct_missing = round(100 * n_missing / nrow(df), 3)) |>
  filter(n_missing > 0) |>
  arrange(desc(n_missing))

missing_summary |>
  kable(caption = "Columns containing missing values.") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Columns containing missing values.
variable	n_missing	pct_missing
d_temp	771	0.068
d_prcp	771	0.068
d_wspd	771	0.068
o_temp	216	0.019
o_prcp	216	0.019
o_wspd	216	0.019

missing_summary |>
  ggplot(aes(x = reorder(variable, pct_missing), y = pct_missing)) +
  geom_col(fill = report_fill) +
  geom_text(aes(label = paste0(n_missing, " (", pct_missing, "%)")),
            hjust = -0.1, size = 3.2) +
  coord_flip(ylim = c(0, max(missing_summary$pct_missing) * 1.4)) +
  labs(title = "Missing values are confined to the six weather variables",
       subtitle = "All operational and delay variables are 100% complete",
       x = NULL, y = "% missing")

Missing values are minimal, affecting less than 0.1% of observations and occurring only in weather-related variables. The identical missing counts across origin and destination weather fields suggest that missingness is linked to weather station reporting rather than flight delays themselves. Given the very low proportion of affected records, these missing values are unlikely to influence the analysis significantly. Therefore, rows are retained for EDA, with appropriate imputation to be applied during the modelling phase. Importantly, the target variable (arr_delay) contains no missing values, ensuring that all flights remain available for prediction tasks.

2.6.2 Duplicate Records

# check for duplicate 
n_dupes <- sum(duplicated(df))
cat("Fully duplicated rows:", n_dupes, "\n")

## Fully duplicated rows: 0

No duplicate records were identified in the dataset, indicating that each row represents a unique flight. As a result, no de-duplication is required, and there is minimal risk of double-counting flights in subsequent analyses or modelling activities.

2.6.3 Invalid, Inconsistent, and Impossible Values

# inspecting values fall within physically plausible ranges.
validity <- tibble(
  check = c("Negative air_time",
            "Negative taxi_out",
            "Negative taxi_in",
            "CRS departure timestamps missing",
            "Actual departure timestamps missing",
            "Months other than December",
            "arr_delay outside [-180, 4000] min",
            "dep_delay outside [-180, 4000] min"),
  n = c(sum(df$air_time < 0, na.rm = TRUE),
        sum(df$taxi_out < 0, na.rm = TRUE),
        sum(df$taxi_in  < 0, na.rm = TRUE),
        sum(is.na(df$crs_dep_time)),
        sum(is.na(df$dep_time)),
        sum(df$month != 12),
        sum(df$arr_delay < -180 | df$arr_delay > 4000),
        sum(df$dep_delay < -180 | df$dep_delay > 4000)))

validity |>
  kable(caption = "Data-validity checks (all counts should be 0).") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Data-validity checks (all counts should be 0).
check	n
Negative air_time	0
Negative taxi_out	0
Negative taxi_in	0
CRS departure timestamps missing	0
Actual departure timestamps missing	0
Months other than December	0
arr_delay outside [-180, 4000] min	0
dep_delay outside [-180, 4000] min	0

All data quality checks were passed successfully, with no invalid values, missing key timestamps, out-of-scope records, or implausible delay values detected. This indicates that the dataset is internally consistent and free from major data integrity issues. Consequently, subsequent cleaning efforts can focus on handling genuine extreme delays rather than correcting data errors.

2.6.4 Outlier Assessment

# Quantify the extreme right tail of arrival delay using the IQR rule.
q  <- quantile(df$arr_delay, c(.25, .75))
iqr <- diff(q)
upper_fence <- q[2] + 1.5 * iqr

df |>
  summarise(
    p95          = quantile(arr_delay, 0.95),
    p99          = quantile(arr_delay, 0.99),
    max          = max(arr_delay),
    upper_fence  = upper_fence,
    n_above_fence = sum(arr_delay > upper_fence),
    pct_above    = round(100 * mean(arr_delay > upper_fence), 2)) |>
  kable(caption = "Upper-tail outlier diagnostics for arr_delay (minutes).") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Upper-tail outlier diagnostics for arr_delay (minutes).
p95	p99	max	upper_fence	n_above_fence	pct_above
68	193	3795	43	92010	8.12

# Boxplot of arrival delay, with the x-axis clipped so the body is legible.
ggplot(df, aes(x = arr_delay)) +
  geom_boxplot(fill = report_fill, alpha = 0.7, outlier.alpha = 0.05) +
  coord_cartesian(xlim = c(-60, 300)) +
  labs(title = "Arrival delay is tightly centred with a long right tail",
       subtitle = "x-axis clipped to [-60, 300] min; rare delays extend beyond 3,700 min",
       x = "Arrival delay (minutes)", y = NULL) +
  theme(axis.text.y = element_blank())

Arrival delays exhibit a highly right-skewed distribution, with most flights experiencing little or no delay and a small number of flights experiencing extremely long delays. Although many observations are classified as statistical outliers, they represent genuine operational events rather than data errors and are therefore retained. This distributional pattern highlights the need for modelling approaches that can accommodate skewed data and reinforces the value of the binary delay classification used in this study.

2.6.5 Data Transformation and Feature Engineering

#  - drop the two constant columns (month, flights);
#  - derive the binary classification target;
#  - extract scheduled departure hour and a time-of-day band;
#  - add a weekend flag and human-readable weekday labels;
#  - construct a route identifier;
#  - set year and day_of_week as ordered factors for clean plotting.

weekday_labels <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")

df_clean <- df |>
  select(-month, -flights) |>                       # remove zero-information columns
  mutate(
    year        = factor(year),
    dep_hour    = as.integer(format(crs_dep_time, "%H")),
    time_of_day = cut(dep_hour,
                      breaks = c(-1, 5, 11, 16, 20, 24),
                      labels = c("Overnight (0-5)", "Morning (6-11)",
                                 "Afternoon (12-16)", "Evening (17-20)",
                                 "Late (21-23)")),
    weekday     = factor(weekday_labels[day_of_week], levels = weekday_labels),
    is_weekend  = day_of_week %in% c(6, 7),
    route       = paste(origin, dest, sep = "-"),
    # Classification target: significant arrival delay (> 15 minutes).
    is_delayed  = factor(if_else(arr_delay > 15, "Delayed", "OnTime"),
                         levels = c("OnTime", "Delayed"))
  )

cat("Cleaned dataset:", nrow(df_clean), "rows x", ncol(df_clean), "columns\n")

## Cleaned dataset: 1133043 rows x 39 columns

glimpse(df_clean |> select(year, dep_hour, time_of_day, weekday, is_weekend,
                           route, arr_delay, is_delayed))

## Rows: 1,133,043
## Columns: 8
## $ year        <fct> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023…
## $ dep_hour    <int> 5, 10, 18, 5, 10, 18, 5, 10, 18, 5, 10, 18, 5, 10, 18, 5, …
## $ time_of_day <fct> Overnight (0-5), Morning (6-11), Evening (17-20), Overnigh…
## $ weekday     <fct> Fri, Fri, Fri, Sat, Sat, Sat, Sun, Sun, Sun, Mon, Mon, Mon…
## $ is_weekend  <lgl> FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, F…
## $ route       <chr> "DLH-MSP", "DLH-MSP", "DLH-MSP", "DLH-MSP", "DLH-MSP", "DL…
## $ arr_delay   <dbl> -2, -29, -19, -20, -15, -18, -24, -6, -27, -39, -30, -9, -…
## $ is_delayed  <fct> OnTime, OnTime, OnTime, OnTime, OnTime, OnTime, OnTime, On…

Data transformation produced an analysis-ready dataset while preserving all flight records. Constant variables (month and flights) were removed as they provide no predictive value, while new features were engineered to capture temporal patterns, travel routes, and delay outcomes. These transformations were designed to support the study’s research objectives and enhance subsequent analysis and modelling without reducing the dataset size.

saveRDS(df_clean, "df_clean.rds")

Phase 3: Exploratory Data Analysis

3.1 Target Variable Overview

# review target distribution
target_tab <- df_clean |>
  count(is_delayed) |>
  mutate(pct = round(100 * n / sum(n), 2))

target_tab |>
  kable(caption = "Class distribution of the delay target (arr_delay > 15 min).",
        format.args = list(big.mark = ",")) |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Class distribution of the delay target (arr_delay > 15 min).
is_delayed	n	pct
OnTime	931,892	82.25
Delayed	201,151	17.75

ggplot(target_tab, aes(is_delayed, n, fill = is_delayed)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = paste0(comma(n), "\n(", pct, "%)")), vjust = -0.2, size = 3.5) +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
  scale_fill_manual(values = c(OnTime = "#41ab5d", Delayed = "#e34a33")) +
  labs(title = "About one in six December flights is significantly delayed",
       x = NULL, y = "Number of flights") +
  theme(legend.position = "none")

Approximately 17.8% of flights are classified as delayed, while 82.2% are on time or early, resulting in a moderately imbalanced dataset. Although the imbalance is not extreme, it is sufficient to make accuracy an unreliable performance measure, as a naive model could achieve high accuracy by predicting all flights as on time. Consequently, classification models are evaluated using metrics such as balanced accuracy, precision, recall, specificity, and Cohen’s Kappa, which remain informative under class imbalance rather than relying on accuracy alone.

3.2 Univariate Analysis

3.2.1 Distribution of Arrival Delay

ggplot(filter(df_clean, arr_delay >= -60, arr_delay <= 180),
       aes(arr_delay)) +
  geom_histogram(binwidth = 5, fill = report_fill, colour = "white") +
  geom_vline(xintercept = 15, linetype = "dashed", colour = "red") +
  annotate("text", x = 22, y = Inf, label = "15-min\nthreshold",
           colour = "red", hjust = 0, vjust = 1.2, size = 3) +
  scale_y_continuous(labels = comma) +
  labs(title = "Arrival-delay distribution is right-skewed and peaks before zero",
       subtitle = "Clipped to [-60, 180] min for readability",
       x = "Arrival delay (minutes)", y = "Number of flights")

Arrival delays follow a right-skewed distribution, with most flights arriving on time or slightly early and a smaller proportion experiencing substantial delays. The long tail of severe delays highlights the complexity of modelling delay duration and suggests that methods capable of handling skewed outcomes may be more appropriate. The 15-minute threshold provides a practical distinction between routine variation and operationally significant delays, supporting its use as the classification boundary in this study.

3.2.2 Departure Delay

ggplot(filter(df_clean, dep_delay >= -40, dep_delay <= 180),
       aes(dep_delay)) +
  geom_histogram(binwidth = 5, fill = "#756bb1", colour = "white") +
  scale_y_continuous(labels = comma) +
  labs(title = "Departure delay mirrors arrival delay",
       subtitle = "Clipped to [-40, 180] min",
       x = "Departure delay (minutes)", y = "Number of flights")

Departure delays exhibit a distribution similar to arrival delays, with most flights departing on time or slightly early and a smaller number experiencing substantial delays. This pattern suggests a strong relationship between departure and arrival delays, as late departures often translate into late arrivals. Consequently, departure delay is expected to be one of the most influential predictors of arrival delay in subsequent modelling analyses.

3.2.3 Carrier Volume

df_clean |>
  count(op_carrier, sort = TRUE) |>
  ggplot(aes(reorder(op_carrier, n), n)) +
  geom_col(fill = report_fill) +
  geom_text(aes(label = comma(n)), hjust = -0.1, size = 2.8) +
  coord_flip(clip = "off") +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.18))) +
  labs(title = "Flight volume is concentrated among a few large carriers",
       x = "Carrier", y = "Number of flights")

Flight activity is concentrated among a small number of major carriers, with the top five airlines accounting for the majority of December flights. This uneven distribution means that some carriers have substantially more observations than others, which may influence how carrier information is encoded during modelling. The relationship between airline and delay performance is explored further in the bivariate analysis.

3.2.4 Weather Variable Distributions

# Origin-airport weather distributions on a common panel.
df_clean |>
  select(o_temp, o_prcp, o_wspd) |>
  pivot_longer(everything(), names_to = "var", values_to = "value") |>
  mutate(var = recode(var,
                      o_temp = "Temperature (°C)",
                      o_prcp = "Precipitation (mm)",
                      o_wspd = "Wind speed (km/h)")) |>
  ggplot(aes(value)) +
  geom_histogram(bins = 40, fill = "#2b8cbe", colour = "white") +
  facet_wrap(~ var, scales = "free", ncol = 3) +
  scale_y_continuous(labels = comma) +
  labs(title = "Origin-airport weather: cold, mostly dry, breezy December conditions",
       x = NULL, y = "Number of flights")

Weather conditions vary considerably across flights. Temperature follows a broad distribution, reflecting the diverse climates of U.S. airports, while precipitation is highly concentrated at zero, indicating that most flights depart under dry conditions. Wind speed is moderately right-skewed, with occasional periods of strong winds. These patterns suggest that weather-related effects may be non-linear, making threshold-based or categorical representations potentially more informative than simple linear relationships.

3.3 Bivariate Analysis

3.3.1 Departure Delay vs Arrival Delay (Numeric vs Numeric)

# Scatter on a 25k sample with a reference line
set.seed(2024)
df_clean |>
  slice_sample(n = 25000) |>
  ggplot(aes(dep_delay, arr_delay)) +
  geom_point(alpha = 0.08, colour = report_fill) +
  geom_abline(slope = 1, intercept = 0, colour = "red", linetype = "dashed") +
  coord_cartesian(xlim = c(-50, 300), ylim = c(-50, 300)) +
  labs(title = "Departure delay almost perfectly tracks arrival delay (r = 0.97)",
       subtitle = "25,000-flight sample; red line is y = x",
       x = "Departure delay (minutes)", y = "Arrival delay (minutes)")

Departure delay exhibits an exceptionally strong positive relationship with arrival delay, indicating that flights departing late are highly likely to arrive late. This identifies departure delay as the most influential operational predictor of arrival performance. However, its practical usefulness depends on when predictions are required, as departure delay may not be available for advance forecasting before the flight leaves the gate.

3.3.2 Weather Effects on Flight Delays

The raw Pearson correlations between weather and arr_delay are weak (≈ 0.03–0.04), which might suggest weather is irrelevant. Binning the weather variables reveals a very different, and much stronger, non-linear picture.

# Delay rate (% > 15 min) across binned weather conditions at the origin.
make_bin_summary <- function(data, var, breaks, labels, varname) {
  data |>
    filter(!is.na(.data[[var]])) |>
    mutate(bin = cut(.data[[var]], breaks = breaks, labels = labels)) |>
    group_by(bin) |>
    summarise(delay_rate = 100 * mean(arr_delay > 15), .groups = "drop") |>
    mutate(driver = varname)
}

prcp_b <- make_bin_summary(df_clean, "o_prcp",
            c(-Inf, 0, 1, 5, 10, Inf),
            c("0", "0-1", "1-5", "5-10", "10+"), "Precipitation (mm)")
wspd_b <- make_bin_summary(df_clean, "o_wspd",
            c(-Inf, 5, 15, 25, 40, Inf),
            c("0-5", "5-15", "15-25", "25-40", "40+"), "Wind speed (km/h)")
temp_b <- make_bin_summary(df_clean, "o_temp",
            c(-Inf, -10, 0, 10, 20, Inf),
            c("<-10", "-10-0", "0-10", "10-20", "20+"), "Temperature (°C)")

bind_rows(prcp_b, wspd_b, temp_b) |>
  ggplot(aes(bin, delay_rate, group = driver)) +
  geom_col(fill = "#2b8cbe") +
  geom_text(aes(label = paste0(round(delay_rate, 1), "%")), vjust = -0.3, size = 2.8) +
  facet_wrap(~ driver, scales = "free_x") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Weather effects on delay are strong but NON-linear",
       subtitle = "Delay rate (% of flights > 15 min late) by binned origin weather",
       x = NULL, y = "Delay rate (%)")

Weather conditions are associated with higher delay rates, but the relationship is clearly non-linear. Delay rates increase under heavier precipitation and stronger winds, while temperature exhibits a U-shaped pattern, with both very cold and very warm conditions linked to higher delays. These findings indicate that weather effects are driven by thresholds and extremes rather than simple linear relationships, suggesting that non-linear modelling approaches may be better suited to capturing their impact on flight delays.

3.3.3 Carrier Performance and Delay Rates

df_clean |>
  group_by(op_carrier) |>
  summarise(delay_rate = 100 * mean(arr_delay > 15),
            mean_delay  = mean(arr_delay), n = n(), .groups = "drop") |>
  ggplot(aes(reorder(op_carrier, delay_rate), delay_rate, fill = mean_delay)) +
  geom_col() +
  geom_text(aes(label = paste0(round(delay_rate, 1), "%")), hjust = -0.1, size = 2.7) +
  coord_flip(clip = "off") +
  scale_fill_gradient(low = "#41ab5d", high = "#e34a33",
                      name = "Mean delay (min)") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Punctuality varies widely across carriers",
       x = "Carrier", y = "Delay rate (% > 15 min)")

Delay performance varies noticeably across airlines, with some carriers experiencing substantially higher delay rates than others. Notably, airline size does not appear to be strongly associated with punctuality, as high-volume carriers do not necessarily perform better or worse than smaller operators. These findings suggest that carrier-specific operational factors contribute to delay performance, making op_carrier a meaningful predictor for subsequent modelling.

3.3.4 Delay by Year

year_summary <- df_clean |>
  group_by(year) |>
  summarise(n_flights      = n(),
            mean_arr_delay = round(mean(arr_delay), 2),
            median_arr     = median(arr_delay),
            delay_rate_pct = round(100 * mean(arr_delay > 15), 2),
            mean_dep_delay = round(mean(dep_delay), 2),
            .groups = "drop")

year_summary |>
  kable(caption = "Year-over-year delay comparison",
        format.args = list(big.mark = ",")) |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Year-over-year delay comparison
year	n_flights	mean_arr_delay	median_arr	delay_rate_pct	mean_dep_delay
2023	558,609	0.43	-8	15.05	8.16
2024	574,434	6.95	-6	20.38	13.34

ggplot(year_summary, aes(year, delay_rate_pct, fill = year)) +
  geom_col(width = 0.55) +
  geom_text(aes(label = paste0(delay_rate_pct, "%")), vjust = -0.3, size = 4) +
  scale_fill_manual(values = c("2023" = "#41ab5d", "2024" = "#e34a33")) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(title = "December 2024 Experienced Higher Delay Rates Than December 2023",
       x = NULL, y = "Delay rate (% > 15 min)") +
  theme(legend.position = "none")

Flight performance deteriorated noticeably between December 2023 and December 2024. Average arrival and departure delays increased, while the proportion of flights delayed by more than 15 minutes rose substantially despite only a modest increase in flight volume. These findings indicate that December 2024 was both busier and less punctual than the previous year. As a result, the year variable is likely to be informative for modelling, and both seasons should be represented during model training and evaluation to ensure the model can generalize across changing operating conditions.

3.4 Multivariate Analysis

3.4.1 Correlation Heatmap

corr_vars <- c("arr_delay", "dep_delay", "taxi_out", "taxi_in", "air_time",
               "crs_elapsed_time", "actual_elapsed_time",
               "o_temp", "o_prcp", "o_wspd", "d_temp", "d_prcp", "d_wspd",
               "day_of_month", "day_of_week")

cmat <- cor(select(df_clean, all_of(corr_vars)), use = "pairwise.complete.obs")

corrplot(cmat, method = "color", type = "upper",
         tl.col = "black", tl.cex = 0.75, tl.srt = 45,
         addCoef.col = "grey20", number.cex = 0.55,
         col = colorRampPalette(c("#2166ac", "white", "#b2182b"))(200),
         mar = c(0, 0, 1.5, 0),
         title = "Correlation matrix of numeric variables")

The correlation analysis reinforces earlier findings that departure delay is the strongest predictor of arrival delay, exhibiting a substantially stronger relationship than any other variable. It also reveals high correlations among several flight duration measures, indicating potential redundancy and multicollinearity. In contrast, weather variables show only weak linear relationships with delays, supporting the earlier conclusion that their effects are likely non-linear. These findings suggest that redundant duration variables should be reduced and that modelling approaches less sensitive to correlated predictors may be more appropriate than standard linear regression.

3.4.2 Time of Day × Year Interaction

df_clean |>
  group_by(year, dep_hour) |>
  summarise(delay_rate = 100 * mean(arr_delay > 15), n = n(), .groups = "drop") |>
  filter(n >= 200) |>
  ggplot(aes(dep_hour, delay_rate, colour = year)) +
  geom_line(linewidth = 1) +
  geom_point(size = 1.3) +
  scale_colour_manual(values = c("2023" = "#41ab5d", "2024" = "#e34a33")) +
  scale_x_continuous(breaks = seq(0, 23, 3)) +
  labs(title = "Delays build through the day and 2024 sits above 2023 at every hour",
       x = "Scheduled departure hour", y = "Delay rate (% > 15 min)",
       colour = "Year")

Delay rates increase steadily throughout the day, with early-morning flights experiencing the lowest risk of delay and evening departures the highest. This pattern reflects the accumulation and propagation of operational disruptions as the day progresses, highlighting the importance of departure time as a predictor of delays. Additionally, delay rates are consistently higher in 2024 across nearly all time periods, indicating that the year-over-year decline in punctuality was widespread rather than limited to specific parts of the day.

3.4.3 Day of Month Pattern

df_clean |>
  group_by(day_of_month) |>
  summarise(delay_rate = 100 * mean(arr_delay > 15), .groups = "drop") |>
  ggplot(aes(day_of_month, delay_rate)) +
  geom_line(colour = report_fill, linewidth = 1) +
  geom_point(colour = report_fill) +
  scale_x_continuous(breaks = seq(1, 31, 2)) +
  labs(title = "Delay risk spikes around the post-Christmas travel surge",
       x = "Day of December", y = "Delay rate (% > 15 min)")

Delay rates vary throughout the month, with the highest levels occurring during the late-December holiday travel period. This pattern suggests that holiday-related travel demand is a key contributor to flight delays, while early December remains comparatively less congested. The findings indicate that day-of-month effects contain meaningful predictive information and should be considered in subsequent modelling.

Phase 4: Modelling

This phase develops predictive models addressing two complementary analytical objectives derived from the research questions:

Can we predict whether a flight will be delayed by more than 15 mins? — answered through classification
Can we predict how long a flight will be delayed? — answered through regression

Both objectives are evaluated under two modelling scenarios:

Scenario A (Pre-Departure): Uses only features available before the flight departs — such as scheduled times, carrier, route, and weather forecasts. This represents a realistic advance-prediction setting applicable at the time of booking or check-in.
Scenario B (Post-Departure): Adds real-time operational features such as departure delay (dep_delay), taxi times, and air time. This represents a gate-arrival estimation setting once the flight is airborne.

4.1 Classification Modelling

The objective of the classification task is to predict whether a flight will arrive on time or experience a significant arrival delay. A binary target variable, is_delayed, was created from arrival delay, where flights arriving more than 15 minutes late were classified as Delayed and all others as OnTime.

Two modelling scenarios were evaluated. Model A utilised operational, temporal, geographical, and weather-related variables, while Model B additionally included departure delay (dep_delay). Logistic Regression and Random Forest were implemented and compared using Accuracy, Precision, Recall (Sensitivity), Balanced Accuracy, and Cohen’s Kappa.

4.1.1 Classification Data Preparation

Missing Value Treatment

Missing values in six weather-related variables (o_temp, o_prcp, o_wspd, d_temp, d_prcp, and d_wspd) were handled using median imputation. This approach ensured complete data for model training while remaining robust to outliers. Following imputation, no missing values remained in the dataset.

#=========================================================
# Load Cleaned Dataset
#=========================================================

df_clean <- readRDS("df_clean.rds")

dim(df_clean)

## [1] 1133043      39

#=========================================================
# Missing Value Treatment
#=========================================================

weather_cols <- c(
  "o_temp", "o_prcp", "o_wspd",
  "d_temp", "d_prcp", "d_wspd"
)

for(col in weather_cols) {
  df_clean[[col]][is.na(df_clean[[col]])] <-
    median(df_clean[[col]], na.rm = TRUE)
}

colSums(is.na(df_clean))

##             fl_date          op_carrier   op_carrier_fl_num              origin 
##                   0                   0                   0                   0 
##                dest        crs_dep_time            dep_time           dep_delay 
##                   0                   0                   0                   0 
##            taxi_out          wheels_off           wheels_on             taxi_in 
##                   0                   0                   0                   0 
##        crs_arr_time            arr_time           arr_delay    crs_elapsed_time 
##                   0                   0                   0                   0 
## actual_elapsed_time            air_time        day_of_month         day_of_week 
##                   0                   0                   0                   0 
##        origin_index          dest_index              o_temp              o_prcp 
##                   0                   0                   0                   0 
##              o_wspd              d_temp              d_prcp              d_wspd 
##                   0                   0                   0                   0 
##          o_latitude         o_longitude          d_latitude         d_longitude 
##                   0                   0                   0                   0 
##                year            dep_hour         time_of_day             weekday 
##                   0                   0                   0                   0 
##          is_weekend               route          is_delayed 
##                   0                   0                   0

Feature Selection

Variables that directly revealed the final arrival outcome were excluded to prevent target leakage. Consequently, arr_delay was removed because it was used to construct the target variable, is_delayed.

The route variable was also excluded as its information was already represented by the origin and dest variables. Finally, categorical variables were converted to factors to ensure compatibility with the classification algorithms.

#=========================================================
# Model A Dataset
# Excludes departure delay
#=========================================================

modelA_df <- df_clean %>%
  select(
    is_delayed,
    op_carrier,
    day_of_month,
    day_of_week,
    year,
    dep_hour,
    time_of_day,
    weekday,
    is_weekend,
    o_temp,
    o_prcp,
    o_wspd,
    d_temp,
    d_prcp,
    d_wspd,
    o_latitude,
    o_longitude,
    d_latitude,
    d_longitude
  )

glimpse(modelA_df)

## Rows: 1,133,043
## Columns: 19
## $ is_delayed   <fct> OnTime, OnTime, OnTime, OnTime, OnTime, OnTime, OnTime, O…
## $ op_carrier   <chr> "DL", "OO", "OO", "DL", "OO", "OO", "DL", "OO", "OO", "DL…
## $ day_of_month <dbl> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, …
## $ day_of_week  <dbl> 5, 5, 5, 6, 6, 6, 7, 7, 7, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, …
## $ year         <fct> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 202…
## $ dep_hour     <int> 5, 10, 18, 5, 10, 18, 5, 10, 18, 5, 10, 18, 5, 10, 18, 5,…
## $ time_of_day  <fct> Overnight (0-5), Morning (6-11), Evening (17-20), Overnig…
## $ weekday      <fct> Fri, Fri, Fri, Sat, Sat, Sat, Sun, Sun, Sun, Mon, Mon, Mo…
## $ is_weekend   <lgl> FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
## $ o_temp       <dbl> -7.2, -6.7, 0.6, -8.3, -9.4, 0.6, -1.1, -0.6, 0.6, 0.0, -…
## $ o_prcp       <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.…
## $ o_wspd       <dbl> 7.6, 5.4, 16.6, 0.0, 0.0, 14.8, 14.8, 9.4, 9.4, 5.4, 0.0,…
## $ d_temp       <dbl> -2.8, -5.6, 2.2, -3.9, -5.6, 3.3, 0.6, 0.6, 2.2, 0.0, -3.…
## $ d_prcp       <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.…
## $ d_wspd       <dbl> 5.4, 7.6, 7.6, 0.0, 11.2, 16.6, 5.4, 0.0, 5.4, 0.0, 7.6, …
## $ o_latitude   <dbl> 46.84209, 46.84209, 46.84209, 46.84209, 46.84209, 46.8420…
## $ o_longitude  <dbl> -92.19365, -92.19365, -92.19365, -92.19365, -92.19365, -9…
## $ d_latitude   <dbl> 44.88055, 44.88055, 44.88055, 44.88055, 44.88055, 44.8805…
## $ d_longitude  <dbl> -93.21692, -93.21692, -93.21692, -93.21692, -93.21692, -9…

#=========================================================
# Train-Test Split
#=========================================================

set.seed(123)

split_obj <- initial_split(
  modelA_df,
  prop = 0.80,
  strata = is_delayed
)

train_data <- training(split_obj)
test_data  <- testing(split_obj)

dim(train_data)

## [1] 906433     19

dim(test_data)

## [1] 226610     19

#=========================================================
# Prepare Data for Classification Models
#=========================================================

train_data <- train_data %>%
  mutate(across(where(is.character), as.factor))

test_data <- test_data %>%
  mutate(across(where(is.character), as.factor))

train_data$is_weekend <- as.factor(train_data$is_weekend)
test_data$is_weekend  <- as.factor(test_data$is_weekend)

str(train_data)

## tibble [906,433 × 19] (S3: tbl_df/tbl/data.frame)
##  $ is_delayed  : Factor w/ 2 levels "OnTime","Delayed": 2 2 2 2 2 2 2 2 2 2 ...
##  $ op_carrier  : Factor w/ 15 levels "9E","AA","AS",..: 12 12 12 12 12 12 12 12 5 5 ...
##  $ day_of_month: num [1:906433] 6 9 19 24 26 27 30 30 3 4 ...
##  $ day_of_week : num [1:906433] 3 6 2 7 2 3 6 6 7 1 ...
##  $ year        : Factor w/ 2 levels "2023","2024": 1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_hour    : int [1:906433] 18 17 5 5 5 17 17 10 13 12 ...
##  $ time_of_day : Factor w/ 5 levels "Overnight (0-5)",..: 4 4 1 1 1 4 4 2 3 3 ...
##  $ weekday     : Factor w/ 7 levels "Mon","Tue","Wed",..: 3 6 2 7 2 3 6 6 7 1 ...
##  $ is_weekend  : Factor w/ 2 levels "FALSE","TRUE": 1 2 1 2 1 1 2 2 2 1 ...
##  $ o_temp      : num [1:906433] 1.1 1.7 -11.7 5 2.8 1.1 -2.2 -1.7 16.7 5.6 ...
##  $ o_prcp      : num [1:906433] 0 0.3 0 0 0.5 0 0 0 0 0 ...
##  $ o_wspd      : num [1:906433] 27.7 11.2 13 18.4 38.9 14.8 16.6 11.2 7.6 5.4 ...
##  $ d_temp      : num [1:906433] 7.2 2.2 -8.3 7.8 10 2.2 -2.2 -1.7 1.1 -3.3 ...
##  $ d_prcp      : num [1:906433] 0 0.3 0 0.4 1.5 0 0 0 0 0 ...
##  $ d_wspd      : num [1:906433] 22.3 22.3 9.4 24.1 25.9 0 20.5 9.4 7.6 13 ...
##  $ o_latitude  : num [1:906433] 46.8 46.8 46.8 46.8 46.8 ...
##  $ o_longitude : num [1:906433] -92.2 -92.2 -92.2 -92.2 -92.2 ...
##  $ d_latitude  : num [1:906433] 44.9 44.9 44.9 44.9 44.9 ...
##  $ d_longitude : num [1:906433] -93.2 -93.2 -93.2 -93.2 -93.2 ...

4.1.2 Logistic Regression – Model A

Logistic Regression was selected as the baseline classification model due to its simplicity and interpretability. The model estimates the probability that a flight belongs to the delayed class based on the relationship between the predictor variables and the binary target variable.

Model A was trained using operational, temporal, geographical, and weather-related variables. Model performance was evaluated using Accuracy, Precision, Recall, Balanced Accuracy, and Cohen’s Kappa.

#=========================================================
# Logistic Regression - Model A
#=========================================================
# Purpose:
# Establish a baseline classification model using
# operational, temporal, geographical and weather
# variables without departure delay information.
#=========================================================

log_model <- glm(
  is_delayed ~ .,
  data = train_data,
  family = binomial
)

summary(log_model)

## 
## Call:
## glm(formula = is_delayed ~ ., family = binomial, data = train_data)
## 
## Coefficients: (2 not defined because of singularities)
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -2.8789999  0.0500054 -57.574  < 2e-16 ***
## op_carrierAA                  0.3837809  0.0190314  20.166  < 2e-16 ***
## op_carrierAS                  0.5860142  0.0243522  24.064  < 2e-16 ***
## op_carrierB6                  0.7621720  0.0213970  35.620  < 2e-16 ***
## op_carrierDL                  0.0194867  0.0191469   1.018  0.30880    
## op_carrierF9                  0.5438666  0.0231450  23.498  < 2e-16 ***
## op_carrierG4                  0.4795950  0.0352921  13.589  < 2e-16 ***
## op_carrierHA                  0.5310647  0.0378163  14.043  < 2e-16 ***
## op_carrierMQ                  0.1633752  0.0229987   7.104 1.21e-12 ***
## op_carrierNK                  0.5726918  0.0221659  25.837  < 2e-16 ***
## op_carrierOH                  0.3898769  0.0226217  17.235  < 2e-16 ***
## op_carrierOO                  0.4025351  0.0198905  20.238  < 2e-16 ***
## op_carrierUA                  0.1494766  0.0196600   7.603 2.89e-14 ***
## op_carrierWN                  0.3262375  0.0187275  17.420  < 2e-16 ***
## op_carrierYX                 -0.1533109  0.0229023  -6.694 2.17e-11 ***
## day_of_month                  0.0212945  0.0003183  66.903  < 2e-16 ***
## day_of_week                   0.0572587  0.0016754  34.176  < 2e-16 ***
## year2024                      0.4081505  0.0057490  70.995  < 2e-16 ***
## dep_hour                      0.0638555  0.0020154  31.684  < 2e-16 ***
## time_of_dayMorning (6-11)     0.1326760  0.0208922   6.350 2.15e-10 ***
## time_of_dayAfternoon (12-16)  0.1636631  0.0270586   6.048 1.46e-09 ***
## time_of_dayEvening (17-20)    0.1098965  0.0337384   3.257  0.00112 ** 
## time_of_dayLate (21-23)      -0.1708269  0.0401194  -4.258 2.06e-05 ***
## weekdayTue                   -0.0884717  0.0103649  -8.536  < 2e-16 ***
## weekdayWed                   -0.0168882  0.0099229  -1.702  0.08876 .  
## weekdayThu                    0.0561261  0.0090770   6.183 6.28e-10 ***
## weekdayFri                    0.0424225  0.0087182   4.866 1.14e-06 ***
## weekdaySat                    0.0071342  0.0093160   0.766  0.44380    
## weekdaySun                           NA         NA      NA       NA    
## is_weekendTRUE                       NA         NA      NA       NA    
## o_temp                       -0.0012473  0.0005803  -2.149  0.03161 *  
## o_prcp                        0.1354156  0.0036347  37.256  < 2e-16 ***
## o_wspd                        0.0062396  0.0003514  17.758  < 2e-16 ***
## d_temp                        0.0024596  0.0005781   4.255 2.09e-05 ***
## d_prcp                        0.1185959  0.0036137  32.818  < 2e-16 ***
## d_wspd                        0.0079420  0.0003488  22.770  < 2e-16 ***
## o_latitude                    0.0016229  0.0007632   2.126  0.03347 *  
## o_longitude                   0.0018834  0.0002128   8.849  < 2e-16 ***
## d_latitude                   -0.0065656  0.0007689  -8.538  < 2e-16 ***
## d_longitude                   0.0063090  0.0002134  29.568  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 847749  on 906432  degrees of freedom
## Residual deviance: 812210  on 906395  degrees of freedom
## AIC: 812286
## 
## Number of Fisher Scoring iterations: 4

#=========================================================
# Generate Prediction Probabilities
#=========================================================
# Predict the probability that each flight belongs
# to the delayed class.
#=========================================================

log_prob <- predict(
  log_model,
  newdata = test_data,
  type = "response"
)

summary(log_prob)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02276 0.12126 0.16458 0.17751 0.22017 0.99670

#=========================================================
# Convert Probabilities into Class Labels
#=========================================================
# Flights with probability greater than 0.50 are
# classified as Delayed; otherwise OnTime.
#=========================================================

log_pred <- ifelse(
  log_prob > 0.5,
  "Delayed",
  "OnTime"
)

log_pred <- factor(
  log_pred,
  levels = levels(test_data$is_delayed)
)

table(log_pred)

## log_pred
##  OnTime Delayed 
##  226123     487

#=========================================================
# Evaluate Logistic Regression Model A
#=========================================================

log_cm <- confusionMatrix(
  log_pred,
  test_data$is_delayed,
  positive = "Delayed"
)

log_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction OnTime Delayed
##    OnTime  186166   39957
##    Delayed    213     274
##                                           
##                Accuracy : 0.8227          
##                  95% CI : (0.8212, 0.8243)
##     No Information Rate : 0.8225          
##     P-Value [Acc > NIR] : 0.3699          
##                                           
##                   Kappa : 0.0093          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.006811        
##             Specificity : 0.998857        
##          Pos Pred Value : 0.562628        
##          Neg Pred Value : 0.823295        
##              Prevalence : 0.177534        
##          Detection Rate : 0.001209        
##    Detection Prevalence : 0.002149        
##       Balanced Accuracy : 0.502834        
##                                           
##        'Positive' Class : Delayed         
##

Logistic Regression Model A achieved an accuracy of 82.27%. However, despite the relatively high accuracy, the model performed poorly in identifying delayed flights, achieving a recall of only 0.68%. In contrast, specificity reached 99.89%, indicating that the model primarily predicted the majority class (OnTime). This behaviour is reflected in the low balanced accuracy (50.28%) and Cohen’s Kappa (0.0093). Overall, Logistic Regression Model A served as a useful baseline but was ineffective for detecting flight delays in this dataset.

4.1.3 Random Forest – Model A

Random Forest was implemented to capture nonlinear relationships and interactions among the predictor variables. Unlike Logistic Regression, Random Forest combines multiple decision trees to improve predictive performance and robustness.

Model A was trained using operational, temporal, geographical, and weather-related variables. A Random Forest classifier consisting of 200 trees was developed and evaluated using the same training and testing datasets as Logistic Regression Model A.

#=========================================================
# Random Forest Model A
#=========================================================

library(ranger)

rf_model <- ranger(
  is_delayed ~ .,
  data = train_data,
  num.trees = 200,
  probability = TRUE,
  importance = "impurity",
  seed = 123
)

## Growing trees.. Progress: 29%. Estimated remaining time: 1 minute, 15 seconds.
## Growing trees.. Progress: 58%. Estimated remaining time: 44 seconds.
## Growing trees.. Progress: 87%. Estimated remaining time: 13 seconds.

rf_model

## Ranger result
## 
## Call:
##  ranger(is_delayed ~ ., data = train_data, num.trees = 200, probability = TRUE,      importance = "impurity", seed = 123) 
## 
## Type:                             Probability estimation 
## Number of trees:                  200 
## Sample size:                      906433 
## Number of independent variables:  18 
## Mtry:                             4 
## Target node size:                 10 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.123461

#=========================================================
# Generate Random Forest Predictions
#=========================================================
# Predict class probabilities for each flight in the
# test dataset using the trained Random Forest model.
#=========================================================

rf_pred_prob <- predict(
  rf_model,
  data = test_data
)$predictions

head(rf_pred_prob)

##         OnTime    Delayed
## [1,] 0.9351981 0.06480194
## [2,] 0.9374902 0.06250980
## [3,] 0.8094519 0.19054806
## [4,] 0.9405006 0.05949938
## [5,] 0.9270507 0.07294933
## [6,] 0.8719644 0.12803561

#=========================================================
# Convert Probabilities into Class Labels
#=========================================================
# Flights with a predicted probability greater than
# 0.50 are classified as delayed.
#=========================================================

rf_pred_class <- ifelse(
  rf_pred_prob[, "Delayed"] > 0.5,
  "Delayed",
  "OnTime"
)

rf_pred_class <- factor(
  rf_pred_class,
  levels = levels(test_data$is_delayed)
)

table(rf_pred_class)

## rf_pred_class
##  OnTime Delayed 
##  214669   11941

#=========================================================
# Evaluate Random Forest Performance
#=========================================================
# Generate confusion matrix and classification
# performance metrics.
#=========================================================

rf_cm <- confusionMatrix(
  rf_pred_class,
  test_data$is_delayed,
  positive = "Delayed"
)

rf_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction OnTime Delayed
##    OnTime  182585   32084
##    Delayed   3794    8147
##                                           
##                Accuracy : 0.8417          
##                  95% CI : (0.8402, 0.8432)
##     No Information Rate : 0.8225          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2515          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.20251         
##             Specificity : 0.97964         
##          Pos Pred Value : 0.68227         
##          Neg Pred Value : 0.85054         
##              Prevalence : 0.17753         
##          Detection Rate : 0.03595         
##    Detection Prevalence : 0.05269         
##       Balanced Accuracy : 0.59107         
##                                           
##        'Positive' Class : Delayed         
##

Random Forest Model A achieved an accuracy of 84.17%, outperforming Logistic Regression Model A (82.27%). The model recorded a precision of 68.23% and a recall of 20.25%, demonstrating a substantially improved ability to identify delayed flights compared to Logistic Regression, which achieved a recall of only 0.68%.

The balanced accuracy increased to 59.11%, indicating improved performance across both classes while maintaining high specificity (97.96%). These results suggest that Random Forest was better able to capture nonlinear relationships and interactions among operational, temporal, geographical, and weather-related variables, resulting in superior classification performance when departure delay information was excluded.

4.1.4 Logistic Regression – Model B

Logistic Regression Model B was developed using the same predictors as Model A, with the addition of departure delay (dep_delay). This variable provides information about operational disruptions occurring before take-off and is expected to be strongly associated with arrival delays.

The inclusion of departure delay allows the model to evaluate whether predictive performance improves when real-time operational information is available. Model performance was assessed using Accuracy, Precision, Recall, Balanced Accuracy, and Cohen’s Kappa.

#=========================================================
# Construct Model B Dataset
#=========================================================
# Model B includes departure delay in addition to the
# operational, temporal, geographical and weather
# variables used in Model A.
#=========================================================

modelB_df <- df_clean %>%
  select(
    is_delayed,

    dep_delay,

    op_carrier,

    day_of_month,
    day_of_week,
    year,

    dep_hour,
    time_of_day,
    weekday,
    is_weekend,

    o_temp,
    o_prcp,
    o_wspd,

    d_temp,
    d_prcp,
    d_wspd,

    o_latitude,
    o_longitude,
    d_latitude,
    d_longitude
  )

dim(modelB_df)

## [1] 1133043      20

#=========================================================
# Training-Test Split
#=========================================================

set.seed(123)

splitB <- initial_split(
  modelB_df,
  prop = 0.80,
  strata = is_delayed
)

trainB <- training(splitB)
testB  <- testing(splitB)

dim(trainB)

## [1] 906433     20

dim(testB)

## [1] 226610     20

#=========================================================
# Prepare Data for Classification Models
#=========================================================

trainB <- trainB %>%
  mutate(across(where(is.character), as.factor))

testB <- testB %>%
  mutate(across(where(is.character), as.factor))

trainB$is_weekend <- as.factor(trainB$is_weekend)
testB$is_weekend  <- as.factor(testB$is_weekend)

str(trainB)

## tibble [906,433 × 20] (S3: tbl_df/tbl/data.frame)
##  $ is_delayed  : Factor w/ 2 levels "OnTime","Delayed": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dep_delay   : num [1:906433] 88 8 231 391 96 43 132 17 38 16 ...
##  $ op_carrier  : Factor w/ 15 levels "9E","AA","AS",..: 12 12 12 12 12 12 12 12 5 5 ...
##  $ day_of_month: num [1:906433] 6 9 19 24 26 27 30 30 3 4 ...
##  $ day_of_week : num [1:906433] 3 6 2 7 2 3 6 6 7 1 ...
##  $ year        : Factor w/ 2 levels "2023","2024": 1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_hour    : int [1:906433] 18 17 5 5 5 17 17 10 13 12 ...
##  $ time_of_day : Factor w/ 5 levels "Overnight (0-5)",..: 4 4 1 1 1 4 4 2 3 3 ...
##  $ weekday     : Factor w/ 7 levels "Mon","Tue","Wed",..: 3 6 2 7 2 3 6 6 7 1 ...
##  $ is_weekend  : Factor w/ 2 levels "FALSE","TRUE": 1 2 1 2 1 1 2 2 2 1 ...
##  $ o_temp      : num [1:906433] 1.1 1.7 -11.7 5 2.8 1.1 -2.2 -1.7 16.7 5.6 ...
##  $ o_prcp      : num [1:906433] 0 0.3 0 0 0.5 0 0 0 0 0 ...
##  $ o_wspd      : num [1:906433] 27.7 11.2 13 18.4 38.9 14.8 16.6 11.2 7.6 5.4 ...
##  $ d_temp      : num [1:906433] 7.2 2.2 -8.3 7.8 10 2.2 -2.2 -1.7 1.1 -3.3 ...
##  $ d_prcp      : num [1:906433] 0 0.3 0 0.4 1.5 0 0 0 0 0 ...
##  $ d_wspd      : num [1:906433] 22.3 22.3 9.4 24.1 25.9 0 20.5 9.4 7.6 13 ...
##  $ o_latitude  : num [1:906433] 46.8 46.8 46.8 46.8 46.8 ...
##  $ o_longitude : num [1:906433] -92.2 -92.2 -92.2 -92.2 -92.2 ...
##  $ d_latitude  : num [1:906433] 44.9 44.9 44.9 44.9 44.9 ...
##  $ d_longitude : num [1:906433] -93.2 -93.2 -93.2 -93.2 -93.2 ...

#=========================================================
# Logistic Regression Model B
#=========================================================
# Model B extends Model A by including departure delay
# as an additional predictor.
#=========================================================

log_model_B <- glm(
  is_delayed ~ .,
  data = trainB,
  family = binomial
)

summary(log_model_B)

## 
## Call:
## glm(formula = is_delayed ~ ., family = binomial, data = trainB)
## 
## Coefficients: (2 not defined because of singularities)
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -2.748e+00  8.034e-02 -34.203  < 2e-16 ***
## dep_delay                     1.307e-01  3.457e-04 378.114  < 2e-16 ***
## op_carrierAA                  8.837e-02  3.210e-02   2.753 0.005906 ** 
## op_carrierAS                  3.407e-01  3.962e-02   8.600  < 2e-16 ***
## op_carrierB6                  2.375e-01  3.671e-02   6.471 9.77e-11 ***
## op_carrierDL                 -1.631e-01  3.215e-02  -5.072 3.94e-07 ***
## op_carrierF9                  2.938e-01  3.955e-02   7.430 1.08e-13 ***
## op_carrierG4                  1.850e-01  6.219e-02   2.975 0.002934 ** 
## op_carrierHA                  4.566e-01  5.658e-02   8.071 6.97e-16 ***
## op_carrierMQ                  2.732e-01  3.737e-02   7.310 2.66e-13 ***
## op_carrierNK                 -1.668e-01  3.794e-02  -4.397 1.10e-05 ***
## op_carrierOH                  3.242e-01  3.866e-02   8.386  < 2e-16 ***
## op_carrierOO                  3.172e-01  3.382e-02   9.379  < 2e-16 ***
## op_carrierUA                  3.433e-02  3.295e-02   1.042 0.297371    
## op_carrierWN                 -3.197e-01  3.146e-02 -10.163  < 2e-16 ***
## op_carrierYX                  3.088e-01  3.742e-02   8.251  < 2e-16 ***
## day_of_month                  4.054e-05  5.129e-04   0.079 0.936997    
## day_of_week                   3.395e-02  2.732e-03  12.425  < 2e-16 ***
## year2024                      1.801e-01  9.299e-03  19.367  < 2e-16 ***
## dep_hour                     -4.665e-02  3.213e-03 -14.520  < 2e-16 ***
## time_of_dayMorning (6-11)     3.510e-01  3.214e-02  10.921  < 2e-16 ***
## time_of_dayAfternoon (12-16)  4.049e-01  4.245e-02   9.537  < 2e-16 ***
## time_of_dayEvening (17-20)    7.283e-01  5.351e-02  13.611  < 2e-16 ***
## time_of_dayLate (21-23)       6.753e-01  6.410e-02  10.535  < 2e-16 ***
## weekdayTue                    2.415e-02  1.676e-02   1.441 0.149666    
## weekdayWed                    1.778e-01  1.573e-02  11.306  < 2e-16 ***
## weekdayThu                    1.252e-01  1.468e-02   8.528  < 2e-16 ***
## weekdayFri                    9.705e-02  1.405e-02   6.906 4.98e-12 ***
## weekdaySat                   -3.837e-03  1.538e-02  -0.249 0.803002    
## weekdaySun                           NA         NA      NA       NA    
## is_weekendTRUE                       NA         NA      NA       NA    
## o_temp                        3.481e-03  9.247e-04   3.764 0.000167 ***
## o_prcp                        8.803e-02  5.623e-03  15.656  < 2e-16 ***
## o_wspd                        2.889e-03  5.702e-04   5.067 4.05e-07 ***
## d_temp                       -2.260e-03  9.225e-04  -2.450 0.014298 *  
## d_prcp                        1.189e-01  5.265e-03  22.586  < 2e-16 ***
## d_wspd                        1.129e-02  5.630e-04  20.045  < 2e-16 ***
## o_latitude                    9.253e-03  1.226e-03   7.545 4.53e-14 ***
## o_longitude                  -6.892e-03  3.401e-04 -20.267  < 2e-16 ***
## d_latitude                   -1.637e-02  1.222e-03 -13.399  < 2e-16 ***
## d_longitude                   1.131e-02  3.361e-04  33.662  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 847749  on 906432  degrees of freedom
## Residual deviance: 355790  on 906394  degrees of freedom
## AIC: 355868
## 
## Number of Fisher Scoring iterations: 8

#=========================================================
# Generate Prediction Probabilities
#=========================================================

log_prob_B <- predict(
  log_model_B,
  newdata = testB,
  type = "response"
)

summary(log_prob_B)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 2.301e-05 2.129e-02 3.596e-02 1.776e-01 1.126e-01 1.000e+00

#=========================================================
# Convert Probabilities to Classes
#=========================================================

log_pred_B <- ifelse(log_prob_B > 0.5,
                     "Delayed",
                     "OnTime")

log_pred_B <- factor(
  log_pred_B,
  levels = c("OnTime", "Delayed")
)

table(log_pred_B)

## log_pred_B
##  OnTime Delayed 
##  194548   32062

#=========================================================
# Evaluate Logistic Regression Model B
#=========================================================

library(caret)

log_cm_B <- confusionMatrix(
  log_pred_B,
  testB$is_delayed,
  positive = "Delayed"
)

log_cm_B

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction OnTime Delayed
##    OnTime  182972   11576
##    Delayed   3407   28655
##                                           
##                Accuracy : 0.9339          
##                  95% CI : (0.9329, 0.9349)
##     No Information Rate : 0.8225          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.754           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7123          
##             Specificity : 0.9817          
##          Pos Pred Value : 0.8937          
##          Neg Pred Value : 0.9405          
##              Prevalence : 0.1775          
##          Detection Rate : 0.1265          
##    Detection Prevalence : 0.1415          
##       Balanced Accuracy : 0.8470          
##                                           
##        'Positive' Class : Delayed         
##

Model B achieved an accuracy of 93.39%, substantially improving upon Logistic Regression Model A (82.27%). The inclusion of departure delay provided significant predictive power, enabling the model to better distinguish between delayed and on-time flights.

The model achieved a recall of 71.23% and a precision of 89.37%, indicating strong performance in identifying delayed flights while maintaining a low rate of false positives. Specificity remained high at 98.17%, demonstrating excellent performance in correctly classifying on-time flights. The balanced accuracy of 84.70% and Cohen’s Kappa of 0.7540 further indicate strong classification performance across both classes.

Compared with Model A, the substantial improvement in recall confirms that departure delay is a highly influential predictor of arrival delay. These findings suggest that incorporating real-time operational information significantly enhances flight delay prediction performance.

4.1.5 Random Forest – Model B

Random Forest Model B was developed using the same predictors as Random Forest Model A, with the addition of departure delay (dep_delay). The inclusion of departure delay allows the model to incorporate real-time operational information that may influence a flight’s final arrival status.

By combining this enhanced feature set with an ensemble learning approach, Random Forest Model B is designed to capture complex nonlinear relationships and interactions among operational, temporal, geographical, and weather-related variables. The model was trained using 200 decision trees and evaluated using Accuracy, Precision, Recall, Balanced Accuracy, and Cohen’s Kappa.

#=========================================================
# Random Forest Model B
#=========================================================
# Model B includes departure delay as an additional
# predictor variable.
#=========================================================

library(ranger)

rf_model_B <- ranger(
  is_delayed ~ .,
  data = trainB,
  num.trees = 200,
  probability = TRUE,
  importance = "impurity",
  seed = 123
)

## Growing trees.. Progress: 34%. Estimated remaining time: 1 minute, 0 seconds.
## Growing trees.. Progress: 69%. Estimated remaining time: 28 seconds.

rf_model_B

## Ranger result
## 
## Call:
##  ranger(is_delayed ~ ., data = trainB, num.trees = 200, probability = TRUE,      importance = "impurity", seed = 123) 
## 
## Type:                             Probability estimation 
## Number of trees:                  200 
## Sample size:                      906433 
## Number of independent variables:  19 
## Mtry:                             4 
## Target node size:                 10 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.04750652

#=========================================================
# Generate Random Forest Predictions
#=========================================================
# Predict probability of each flight belonging
# to the delayed class.
#=========================================================

rf_pred_B <- predict(
  rf_model_B,
  data = testB
)

rf_prob_B <- rf_pred_B$predictions[, "Delayed"]

summary(rf_prob_B)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.01530 0.03937 0.18118 0.14168 1.00000

#=========================================================
# Convert Probabilities to Classes
#=========================================================
# Flights with probability > 0.5 are classified
# as Delayed, otherwise OnTime.
#=========================================================

rf_class_B <- ifelse(
  rf_prob_B > 0.5,
  "Delayed",
  "OnTime"
)

rf_class_B <- factor(
  rf_class_B,
  levels = c("OnTime", "Delayed")
)

table(rf_class_B)

## rf_class_B
##  OnTime Delayed 
##  194409   32201

#=========================================================
# Evaluate Random Forest Model B
#=========================================================

library(caret)

rf_cm_B <- confusionMatrix(
  rf_class_B,
  testB$is_delayed,
  positive = "Delayed"
)

rf_cm_B

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction OnTime Delayed
##    OnTime  183557   10852
##    Delayed   2822   29379
##                                           
##                Accuracy : 0.9397          
##                  95% CI : (0.9387, 0.9406)
##     No Information Rate : 0.8225          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7758          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7303          
##             Specificity : 0.9849          
##          Pos Pred Value : 0.9124          
##          Neg Pred Value : 0.9442          
##              Prevalence : 0.1775          
##          Detection Rate : 0.1296          
##    Detection Prevalence : 0.1421          
##       Balanced Accuracy : 0.8576          
##                                           
##        'Positive' Class : Delayed         
##

Random Forest Model B was developed using the same feature set as Model A, with the addition of departure delay (dep_delay). By incorporating this operational variable, the model was able to capture both nonlinear relationships and real-time flight conditions that influence arrival delays.

The model achieved an accuracy of 93.97%, representing the highest classification performance among all models evaluated. It also achieved a precision of 91.24% and a recall of 73.03%, indicating strong performance in identifying delayed flights while maintaining a low false-positive rate. Furthermore, the model achieved a specificity of 98.49%, demonstrating excellent performance in correctly classifying on-time flights. The balanced accuracy of 85.76% and Cohen’s Kappa value of 0.7758 further confirm strong classification performance across both classes.

Compared with Logistic Regression Model B, Random Forest Model B produced slightly higher accuracy, recall, balanced accuracy, and Kappa, suggesting that nonlinear relationships and feature interactions contribute additional predictive value beyond what can be captured by a linear model. The results indicate that combining operational, temporal, geographical (coordinates), weather, and departure delay information with an ensemble learning approach provides the most effective solution for flight delay classification in this study.

4.1.6 Classification Model Comparison

Classification Model Performance Comparison
Model	Accuracy	Recall	Specificity	Balanced_Accuracy
Logistic Regression A	82.27	0.68	99.89	50.28
Random Forest A	84.17	20.25	97.96	59.11
Logistic Regression B	93.39	71.23	98.17	84.70
Random Forest B	93.97	73.03	98.49	85.76

Among the four classification models evaluated, Random Forest Model B achieved the best overall performance, with an accuracy of 93.97%, recall of 73.03%, specificity of 98.49%, and balanced accuracy of 85.76%. The inclusion of departure delay substantially improved predictive performance across both Logistic Regression and Random Forest models, highlighting its importance as a predictor of arrival delays.

Overall, Random Forest models consistently outperformed Logistic Regression models, suggesting that nonlinear relationships and interactions among operational, temporal, geographical, and weather-related variables contribute to flight delay outcomes. Consequently, Random Forest Model B was selected as the best-performing classification model for flight delay prediction.

***

## 4.2 Regression Modelling

The regression task predicts the **exact arrival delay duration in minutes** (`arr_delay`). 

Two algorithms are evaluated in order of increasing complexity:

- **Linear Regression** — interpretable baseline assuming linear feature relationships
- **Random Forest** — ensemble of decision trees capturing non-linear interactions

Both models are trained and evaluated under two scenarios:

| Scenario | Features Available | Use Case |
|----------|-------------------|----------|
| **A — Pre-Departure** | Schedule, carrier, route, weather forecast | Prediction at booking or check-in |
| **B — Post-Departure** | Above + `dep_delay`, taxi times, air time | Real-time gate-arrival estimation |

***

## 4.2.1 Data Preparation

### Loading the Dataset

The cleaned dataset saved during Phase 2 is loaded directly, ensuring consistency with all prior cleaning, feature engineering, and imputation decisions.

## 4.2 Load Cleaned Dataset

The cleaned dataset produced in Phase 2 is loaded. It contains 1,133,043 flight records with missing values confined to the six weather variables.


``` r
df_model <- readRDS("df_clean.rds")

cat("Rows:", nrow(df_model), "\n")

## Rows: 1133043

cat("Columns:", ncol(df_model), "\n")

## Columns: 39

Missing Value Imputation

Missing values are confined to the six weather variables at less than 0.1%of observations. Median imputation is applied as it is more robust than mean imputation for skewed distributions.

Before Imputation

weather_vars <- c("o_temp", "o_prcp", "o_wspd",
                  "d_temp", "d_prcp", "d_wspd")

df_model |>
  summarise(across(all_of(weather_vars), ~ sum(is.na(.x)))) |>
  pivot_longer(everything(),
               names_to  = "variable",
               values_to = "n_missing") |>
  mutate(pct_missing = round(100 * n_missing / nrow(df_model), 4)) |>
  kable(caption = "Missing values before imputation.") |>
  kable_styling(full_width = FALSE,
                bootstrap_options = c("striped", "hover"))

Missing values before imputation.
variable	n_missing	pct_missing
o_temp	216	0.0191
o_prcp	216	0.0191
o_wspd	216	0.0191
d_temp	771	0.0680
d_prcp	771	0.0680
d_wspd	771	0.0680

Apply Median Imputation

df_model <- df_model |>
  mutate(across(all_of(weather_vars),
                ~ if_else(is.na(.x),
                          median(.x, na.rm = TRUE),
                          .x)))

After Imputation

df_model |>
  summarise(across(all_of(weather_vars), ~ sum(is.na(.x)))) |>
  pivot_longer(everything(),
               names_to  = "variable",
               values_to = "n_missing") |>
  kable(caption = "Missing values after imputation (all should be 0).") |>
  kable_styling(full_width = FALSE,
                bootstrap_options = c("striped", "hover"))

Missing values after imputation (all should be 0).
variable	n_missing
o_temp	0
o_prcp	0
o_wspd	0
d_temp	0
d_prcp	0
d_wspd	0

cat("Total missing values remaining:", sum(is.na(df_model)), "\n")

## Total missing values remaining: 0

Feature Selection

Variables that would cause data leakage or are redundant are removed before modelling. Leakage occurs when information that would not be available at prediction time is included as a predictor, leading to overly optimistic model performance.

The following variables are excluded:

arr_time, actual_elapsed_time, wheels_off, wheels_on — only known after the flight lands
is_delayed — derived directly from arr_delay, our target variable
route — too many unique combinations; origin and dest already capture this
fl_date, crs_dep_time — replaced by engineered features dep_hour, day_of_month etc
op_carrier_fl_num — individual flight numbers do not generalise

reg_df <- df_model |>
  select(
    arr_delay,                          # target
    dep_delay, taxi_out, taxi_in,       # operational
    air_time, crs_elapsed_time,         # schedule
    dep_hour, time_of_day,              # temporal
    day_of_month, day_of_week,
    weekday, is_weekend, year,
    op_carrier, origin, dest,           # categorical
    o_temp, o_prcp, o_wspd,             # origin weather
    d_temp, d_prcp, d_wspd,             # destination weather
    o_latitude, o_longitude,            # geospatial
    d_latitude, d_longitude
  )

cat("Final dataset for modelling:", nrow(reg_df), "rows x", 
    ncol(reg_df), "columns\n")

## Final dataset for modelling: 1133043 rows x 26 columns

Train / Test Split

The dataset is split into 80% training and 20% testing using a random sample. A seed is set to ensure reproducibility.

set.seed(42)

train_index <- sample(1:nrow(reg_df), size = 0.8 * nrow(reg_df))

reg_train <- reg_df[train_index, ]
reg_test  <- reg_df[-train_index, ]

cat("Training set:", nrow(reg_train), "rows\n")

## Training set: 906434 rows

cat("Test set:", nrow(reg_test), "rows\n")

## Test set: 226609 rows

Preprocessing

Categorical variables such as op_carrier, origin, dest, time_of_day and weekday are converted to numeric using label encoding. Boolean and factor variables are also converted to integer for compatibility with all three models.

# Convert categoricals to numeric for all three models
encode_df <- function(df) {
  df |>
    mutate(
      op_carrier  = as.integer(factor(op_carrier)),
      origin      = as.integer(factor(origin)),
      dest        = as.integer(factor(dest)),
      time_of_day = as.integer(factor(time_of_day)),
      weekday     = as.integer(factor(weekday)),
      year        = as.integer(as.character(year)),
      is_weekend  = as.integer(is_weekend)
    )
}

reg_train_enc <- encode_df(reg_train)
reg_test_enc  <- encode_df(reg_test)

cat("Any missing values in training set:", sum(is.na(reg_train_enc)), "\n")

## Any missing values in training set: 0

cat("Any missing values in test set:", sum(is.na(reg_test_enc)), "\n")

## Any missing values in test set: 0

4.2.2 Scenario A: Pre-Departure Prediction (No `dep_delay`)

In this scenario, only features available before the flight departs are used. Variables such as dep_delay, taxi_out, taxi_in, and air_time are excluded because they are only known after the flight has operated. This represents a realistic advance-prediction setting.

safe_features <- c(
  "arr_delay",                        # target
  "crs_elapsed_time",                 # scheduled duration
  "dep_hour", "day_of_month",         # temporal
  "day_of_week", "is_weekend", "year",
  "op_carrier", "origin", "dest",     # categorical
  "time_of_day", "weekday",
  "o_temp", "o_prcp", "o_wspd",       # origin weather
  "d_temp", "d_prcp", "d_wspd",       # destination weather
  "o_latitude", "o_longitude",        # geospatial
  "d_latitude", "d_longitude"
)

reg_train_safe <- reg_train_enc |> select(all_of(safe_features))
reg_test_safe  <- reg_test_enc  |> select(all_of(safe_features))

cat("Features used (Scenario A):", ncol(reg_train_safe) - 1, "\n")

## Features used (Scenario A): 21

4.2.2.1 Linear Regression — Scenario A

Linear Regression serves as the interpretable baseline model. It estimates a linear relationship between the available pre-departure features and arrival delay. Given the non-linear delay patterns identified during EDA, Linear Regression is expected to underperform compared to the ensemble approach, but provides a useful performance benchmark.

lm_model_a <- lm(arr_delay ~ ., data = reg_train_safe)

lm_pred_a <- predict(lm_model_a, newdata = reg_test_safe)

lm_rmse_a <- sqrt(mean((reg_test_safe$arr_delay - lm_pred_a)^2))
lm_mae_a  <- mean(abs(reg_test_safe$arr_delay - lm_pred_a))
lm_rsq_a  <- cor(reg_test_safe$arr_delay, lm_pred_a)^2

cat("Linear Regression (Pre-Departure):\n")

## Linear Regression (Pre-Departure):

cat("RMSE:", round(lm_rmse_a, 3), "\n")

## RMSE: 54.135

cat("MAE: ", round(lm_mae_a,  3), "\n")

## MAE:  24.06

cat("R²:  ", round(lm_rsq_a,  3), "\n")

## R²:   0.018

4.2.2.2 Random Forest — Scenario A

Random Forest captures non-linear relationships and feature interactions that Linear Regression cannot model. It constructs an ensemble of decision trees, each trained on a random subset of data and features, with the final prediction averaged across all trees. To manage computational resources, the model is trained on a 20% sample of the training data with 100 trees.

library(ranger)

set.seed(42)
train_sample_a <- reg_train_safe |> slice_sample(prop = 0.2)

rf_model_a <- ranger(
  arr_delay ~ .,
  data          = train_sample_a,
  num.trees     = 100,
  mtry          = 5,
  min.node.size = 20,
  importance    = "impurity",
  seed          = 42
)

rf_pred_a <- predict(rf_model_a, data = reg_test_safe)$predictions

rf_rmse_a <- sqrt(mean((reg_test_safe$arr_delay - rf_pred_a)^2))
rf_mae_a  <- mean(abs(reg_test_safe$arr_delay - rf_pred_a))
rf_rsq_a  <- cor(reg_test_safe$arr_delay, rf_pred_a)^2

cat("Random Forest (Pre-Departure):\n")

## Random Forest (Pre-Departure):

cat("RMSE:", round(rf_rmse_a, 3), "\n")

## RMSE: 52.274

cat("MAE: ", round(rf_mae_a,  3), "\n")

## MAE:  22.725

cat("R²:  ", round(rf_rsq_a,  3), "\n")

## R²:   0.085

4.2.3 Scenario B: Post-Departure Prediction

In Scenario B, features that become available after the flight has departed are added — specifically dep_delay, taxi_out, taxi_in, and air_time. This represents a real-time gate-arrival estimation setting. As identified in the EDA phase, departure delay alone exhibits a correlation of r = 0.97 with arrival delay, and its inclusion is expected to substantially improve model performance.

post_features <- c(
  "arr_delay",                        # target
  "dep_delay",                        # key post-departure signal
  "taxi_out", "taxi_in",              # operational (post-departure)
  "air_time",                         # in-flight duration
  "crs_elapsed_time",                 # scheduled duration
  "dep_hour", "day_of_month",         # temporal
  "day_of_week", "is_weekend", "year",
  "op_carrier", "origin", "dest",     # categorical
  "time_of_day", "weekday",
  "o_temp", "o_prcp", "o_wspd",       # origin weather
  "d_temp", "d_prcp", "d_wspd",       # destination weather
  "o_latitude", "o_longitude",        # geospatial
  "d_latitude", "d_longitude"
)

reg_train_post <- reg_train_enc |> select(all_of(post_features))
reg_test_post  <- reg_test_enc  |> select(all_of(post_features))

cat("Features used (Scenario B):", ncol(reg_train_post) - 1, "\n")

## Features used (Scenario B): 25

4.2.3.1 Linear Regression — Scenario B

lm_model_b <- lm(arr_delay ~ ., data = reg_train_post)

lm_pred_b <- predict(lm_model_b, newdata = reg_test_post)

lm_rmse_b <- sqrt(mean((reg_test_post$arr_delay - lm_pred_b)^2))
lm_mae_b  <- mean(abs(reg_test_post$arr_delay - lm_pred_b))
lm_rsq_b  <- cor(reg_test_post$arr_delay, lm_pred_b)^2

cat("Linear Regression (Post-Departure):\n")

## Linear Regression (Post-Departure):

cat("RMSE:", round(lm_rmse_b, 3), "\n")

## RMSE: 0.173

cat("MAE: ", round(lm_mae_b,  3), "\n")

## MAE:  0.074

cat("R²:  ", round(lm_rsq_b,  3), "\n")

## R²:   1

4.2.3.2 Random Forest — Scenario B

set.seed(42)
train_sample_b <- reg_train_post |> slice_sample(prop = 0.2)

rf_model_b <- ranger(
  arr_delay ~ .,
  data          = train_sample_b,
  num.trees     = 100,
  mtry          = 5,
  min.node.size = 20,
  importance    = "impurity",
  seed          = 42
)

rf_pred_b <- predict(rf_model_b, data = reg_test_post)$predictions

rf_rmse_b <- sqrt(mean((reg_test_post$arr_delay - rf_pred_b)^2))
rf_mae_b  <- mean(abs(reg_test_post$arr_delay - rf_pred_b))
rf_rsq_b  <- cor(reg_test_post$arr_delay, rf_pred_b)^2

cat("Random Forest (Post-Departure):\n")

## Random Forest (Post-Departure):

cat("RMSE:", round(rf_rmse_b, 3), "\n")

## RMSE: 16.174

cat("MAE: ", round(rf_mae_b,  3), "\n")

## MAE:  7.031

cat("R²:  ", round(rf_rsq_b,  3), "\n")

## R²:   0.931

4.2.4 Regression Model Comparison

reg_results <- tibble(
  Scenario = c(rep("A: Pre-Departure",  2),
               rep("B: Post-Departure", 2)),
  Model    = rep(c("Linear Regression", "Random Forest"), 2),
  RMSE     = c(lm_rmse_a, rf_rmse_a, lm_rmse_b, rf_rmse_b),
  MAE      = c(lm_mae_a,  rf_mae_a,  lm_mae_b,  rf_mae_b),
  R2       = c(lm_rsq_a,  rf_rsq_a,  lm_rsq_b,  rf_rsq_b)
) |>
  mutate(across(c(RMSE, MAE, R2), ~ round(.x, 3)))

reg_results |>
  kable(caption  = "Regression model performance across both scenarios.",
        col.names = c("Scenario", "Model", "RMSE", "MAE", "R²")) |>
  kable_styling(full_width = FALSE,
                bootstrap_options = c("striped", "hover")) |>
  collapse_rows(columns = 1, valign = "middle")

Regression model performance across both scenarios.
Scenario	Model	RMSE	MAE	R²
A: Pre-Departure	Linear Regression	54.135	24.060	0.018
A: Pre-Departure	Random Forest	52.274	22.725	0.085
B: Post-Departure	Linear Regression	0.173	0.074	1.000
B: Post-Departure	Random Forest	16.174	7.031	0.931

4.2.5 Saving Regression Models

# Scenario A
saveRDS(lm_model_a, "lm_model_pre.rds")
saveRDS(rf_model_a, "rf_model_pre.rds")

# Scenario B
saveRDS(lm_model_b, "lm_model_post.rds")
saveRDS(rf_model_b, "rf_model_post.rds")

cat("All models saved successfully!\n")

## All models saved successfully!

4.2.6 Discussion

Scenario A — Pre-Departure

The low R² values across both pre-departure models reflect the inherent difficulty of predicting flight delays using only information available before departure. Real-time operational disruptions — such as late incoming aircraft and air traffic control intervention — cannot be captured by weather and schedule features alone. Weather variables showed only weak linear correlations with arrival delay (r ≈ 0.03–0.04) during EDA, and features such as carrier, route, and time of day provide only a general population-level signal rather than flight-specific predictions. Despite modest absolute performance, Random Forest is expected to outperform Linear Regression, confirming that delay relationships are non-linear in nature — consistent with the non-linear weather effects and time-of-day patterns identified during exploratory analysis.

Scenario B — Post-Departure

The inclusion of dep_delay, taxi_out, taxi_in, and air_time is expected to substantially improve both RMSE and R² across both models. Departure delay in particular acts as a near-direct signal of arrival delay (r = 0.97, EDA Phase 3), and its inclusion transforms the problem from a weak signal estimation task to a high-confidence operational calculation. This scenario is appropriate for real-time airline systems estimating gate-arrival times once a flight is airborne. The performance gap between Scenario A and Scenario B directly quantifies the informational value of post-departure operational data.

Phase 5: Evaluation & Comparison

This section evaluates the modelling results across both prediction scenarios. Instead of reviewing each model separately, the focus is on comparing how model performance changes when moving from Scenario A, which uses only pre-departure information, to Scenario B, which includes post-departure operational information.

5.1 Classification Model Evaluation

The classification task compared Logistic Regression and Random Forest models under two scenarios:

Scenario A: Pre-departure prediction without dep_delay
Scenario B: Post-departure prediction with dep_delay

classification_eval <- tibble(
  Model = c("Logistic Regression", "Random Forest",
            "Logistic Regression", "Random Forest"),
  Scenario = c("A: Pre-Departure", "A: Pre-Departure",
               "B: Post-Departure", "B: Post-Departure"),
  Accuracy = c(82.27, 84.17, 93.39, 93.97),
  Precision = c(NA, 68.23, 89.37, 91.24),
  Recall = c(0.68, 20.25, 71.23, 73.03),
  Specificity = c(99.89, 97.96, 98.17, 98.49),
  Balanced_Accuracy = c(50.28, 59.11, 84.70, 85.76),
  Kappa = c(0.0093, NA, 0.7540, 0.7758)
)

classification_eval |>
  kable(caption = "Classification model evaluation across both scenarios.") |>
  kable_styling(full_width = FALSE,
                bootstrap_options = c("striped", "hover"))

Classification model evaluation across both scenarios.
Model	Scenario	Accuracy	Precision	Recall	Specificity	Balanced_Accuracy	Kappa
Logistic Regression	A: Pre-Departure	82.27	NA	0.68	99.89	50.28	0.0093
Random Forest	A: Pre-Departure	84.17	68.23	20.25	97.96	59.11	NA
Logistic Regression	B: Post-Departure	93.39	89.37	71.23	98.17	84.70	0.7540
Random Forest	B: Post-Departure	93.97	91.24	73.03	98.49	85.76	0.7758

classification_eval |>
  select(Model, Scenario, Accuracy, Recall, Specificity, Balanced_Accuracy) |>
  pivot_longer(cols = c(Accuracy, Recall, Specificity, Balanced_Accuracy),
               names_to = "Metric",
               values_to = "Value") |>
  ggplot(aes(x = Scenario, y = Value, fill = Model)) +
  geom_col(position = "dodge") +
  facet_wrap(~ Metric, scales = "free_y") +
  labs(title = "Classification performance improves strongly in Scenario B",
       x = NULL,
       y = "Metric value (%)",
       fill = "Model") +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

In Scenario A, both classification models achieved relatively high accuracy because most flights were on time. However, recall was weak, especially for Logistic Regression, meaning that the model failed to identify most delayed flights. Random Forest performed better than Logistic Regression in Scenario A because it was able to capture non-linear patterns and interactions among weather, carrier, temporal, and airport-related variables.

In Scenario B, performance improved substantially for both models after adding dep_delay. This confirms that departure delay is the strongest predictor of arrival delay. Random Forest Model B achieved the best overall classification performance, with the highest accuracy, recall, specificity, balanced accuracy, and Kappa. Therefore, Random Forest Model B was selected as the best classification model.

5.2 Regression Model Evaluation

The regression task compared Linear Regression and Random Forest models under the same two scenarios:

Scenario A: Predicting arrival delay before departure
Scenario B: Predicting arrival delay after departure information becomes available

reg_results |>
  kable(caption = "Regression model evaluation across both scenarios.",
        col.names = c("Scenario", "Model", "RMSE", "MAE", "R²")) |>
  kable_styling(full_width = FALSE,
                bootstrap_options = c("striped", "hover"))

Regression model evaluation across both scenarios.
Scenario	Model	RMSE	MAE	R²
A: Pre-Departure	Linear Regression	54.135	24.060	0.018
A: Pre-Departure	Random Forest	52.274	22.725	0.085
B: Post-Departure	Linear Regression	0.173	0.074	1.000
B: Post-Departure	Random Forest	16.174	7.031	0.931

reg_results |>
  ggplot(aes(x = Scenario, y = RMSE, fill = Model)) +
  geom_col(position = "dodge") +
  labs(title = "Regression RMSE comparison across scenarios",
       x = NULL,
       y = "RMSE",
       fill = "Model") +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

reg_results |>
  ggplot(aes(x = Scenario, y = R2, fill = Model)) +
  geom_col(position = "dodge") +
  labs(title = "Regression R² comparison across scenarios",
       x = NULL,
       y = "R²",
       fill = "Model") +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

Scenario A produced weaker regression performance because only pre-departure information was available. Features such as weather, carrier, airport, date, and scheduled time provide useful general patterns, but they cannot fully capture real-time operational disruptions. This explains the lower R² values and higher prediction errors in the pre-departure setting.

Scenario B produced much stronger regression results because it included operational information such as dep_delay, taxi_out, taxi_in, and air_time. These variables are much closer to the final arrival outcome and therefore reduce prediction error. This confirms that post-departure information greatly improves the ability to estimate exact arrival delay duration.

5.3 Feature Importance Analysis

Feature importance was examined for the Random Forest models because they provide an interpretable ranking of influential predictors. This helps identify which variables contributed most strongly to delay prediction.

5.3.1 Classification Feature Importance

rf_imp_class <- tibble(
  Feature = names(rf_model_B$variable.importance),
  Importance = rf_model_B$variable.importance
) |>
  arrange(desc(Importance)) |>
  slice_head(n = 15)

rf_imp_class |>
  ggplot(aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_col(fill = report_fill) +
  coord_flip() +
  labs(title = "Top 15 feature importance values for Random Forest Classification Model B",
       x = NULL,
       y = "Importance")

The feature importance plot shows which predictors were most influential in classifying flights as delayed or on time. Departure delay is expected to dominate the classification model because it directly reflects whether the flight has already experienced an operational delay before arrival. Other important variables may include carrier, departure hour, airport location, weather conditions, and day-of-month effects.

5.3.2 Regression Feature Importance

rf_imp_reg <- tibble(
  Feature = names(rf_model_b$variable.importance),
  Importance = rf_model_b$variable.importance
) |>
  arrange(desc(Importance)) |>
  slice_head(n = 15)

rf_imp_reg |>
  ggplot(aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_col(fill = report_fill) +
  coord_flip() +
  labs(title = "Top 15 feature importance values for Random Forest Regression Model B",
       x = NULL,
       y = "Importance")

For the regression task, feature importance helps explain which variables were most useful in predicting the exact number of arrival delay minutes. Similar to the classification task, post-departure variables such as dep_delay, taxi_out, taxi_in, and air_time are expected to rank highly because they provide direct operational information about the flight’s progress.

5.4 Overall Evaluation Summary

Across both classification and regression tasks, Scenario B consistently outperformed Scenario A. This demonstrates that real-time operational variables provide substantially stronger predictive information than schedule, weather, carrier, and airport variables alone.

For classification, Random Forest Model B was the strongest model because it achieved the best balance between accuracy, recall, specificity, and balanced accuracy. For regression, the best-performing model should be selected based on the lowest RMSE and MAE, together with the highest R². Overall, the evaluation confirms that Random Forest models are more effective than simpler linear models because they can capture non-linear relationships and interactions among flight delay predictors.

The comparison also highlights an important practical distinction. Scenario A is more useful for early prediction before departure, but its accuracy is limited. Scenario B is much more accurate, but it can only be used after the flight has already departed. Therefore, the preferred model depends on the intended use case: advance planning or real-time operational prediction.

Conclusion

This project investigated U.S. domestic flight delays during the December holiday periods of 2023 and 2024 using the Aeolus dataset, which contains over 1.13 million flight records enriched with operational, temporal, weather, and airport-related information. Both classification and regression approaches were developed and evaluated to better understand the factors associated with flight delays and to assess the feasibility of predicting them.

The results consistently showed that operational information is the strongest driver of arrival delay, with departure delay emerging as the single most influential predictor. For the classification task, Random Forest achieved the best overall performance, reaching 94.0% accuracy, 73.0% recall, and 85.8% balanced accuracy. For the regression task, the best-performing Random Forest model achieved an RMSE of approximately 16 minutes and an R² of 0.93, demonstrating strong predictive capability once post-departure information becomes available.

A key finding of this study is the clear trade-off between prediction timing and predictive accuracy. While pre-departure models provide useful insights for advance planning and resource allocation, their predictive power remains limited. In contrast, post-departure models achieve substantially higher accuracy by leveraging real-time operational information.

Overall, the findings demonstrate the value of machine learning for flight-delay prediction and provide practical insights that can support both advance planning and real-time operational decision-making within the aviation industry.

References

U.S. Department of Transportation, Bureau of Transportation Statistics — Reporting Carrier On-Time Performance Data.
Aeolus — A Multi-Modal Flight Delay Dataset (https://www.kaggle.com/datasets/flnny123/mfddmulti-modal-flight-delay-dataset/versions/3/data), accessed June 2026.

sessionInfo()

## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.7.4
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Asia/Kuala_Lumpur
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ranger_0.18.0       caret_7.0-1         lattice_0.22-9     
##  [4] rsample_1.3.2       kableExtra_1.4.0    knitr_1.51         
##  [7] patchwork_1.3.2     corrplot_0.95       scales_1.4.0       
## [10] skimr_2.2.2         janitor_2.2.1       data.table_1.18.2.1
## [13] lubridate_1.9.5     forcats_1.0.1       stringr_1.6.0      
## [16] dplyr_1.2.1         purrr_1.2.2         readr_2.2.0        
## [19] tidyr_1.3.2         tibble_3.3.1        ggplot2_4.0.3      
## [22] tidyverse_2.0.0    
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1     viridisLite_0.4.3    timeDate_4052.112   
##  [4] farver_2.1.2         S7_0.2.1             fastmap_1.2.0       
##  [7] pROC_1.19.0.1        digest_0.6.39        rpart_4.1.24        
## [10] timechange_0.4.0     lifecycle_1.0.5      survival_3.8-6      
## [13] magrittr_2.0.4       compiler_4.5.2       rlang_1.2.0         
## [16] sass_0.4.10          tools_4.5.2          yaml_2.3.12         
## [19] plyr_1.8.9           xml2_1.5.2           repr_1.1.7          
## [22] RColorBrewer_1.1-3   withr_3.0.2          stats4_4.5.2        
## [25] nnet_7.3-20          grid_4.5.2           future_1.70.0       
## [28] globals_0.19.1       iterators_1.0.14     MASS_7.3-65         
## [31] cli_3.6.6            rmarkdown_2.30       generics_0.1.4      
## [34] future.apply_1.20.2  rstudioapi_0.18.0    reshape2_1.4.5      
## [37] tzdb_0.5.0           cachem_1.1.0         splines_4.5.2       
## [40] parallel_4.5.2       base64enc_0.1-6      vctrs_0.7.1         
## [43] hardhat_1.4.3        Matrix_1.7-4         jsonlite_2.0.0      
## [46] hms_1.1.4            listenv_0.10.1       systemfonts_1.3.2   
## [49] foreach_1.5.2        gower_1.0.2          jquerylib_0.1.4     
## [52] recipes_1.3.2        glue_1.8.0           parallelly_1.47.0   
## [55] codetools_0.2-20     stringi_1.8.7        gtable_0.3.6        
## [58] furrr_0.4.0          pillar_1.11.1        htmltools_0.5.9     
## [61] ipred_0.9-15         lava_1.9.0           R6_2.6.1            
## [64] textshaping_1.0.5    evaluate_1.0.5       snakecase_0.11.1    
## [67] bslib_0.10.0         class_7.3-23         Rcpp_1.1.1-1.1      
## [70] svglite_2.2.2        nlme_3.1-168         prodlim_2026.03.11  
## [73] xfun_0.56            ModelMetrics_1.2.2.2 pkgconfig_2.0.3

Predicting US December Flight Delays Using R

WQD7004 Group Project by Group 13

2026-06-14

Phase 1: Introduction

1.1 Background

1.2 Problem Statement

1.3 Objectives

1.4 Research Questions

Phase 2: Data Preparation

2.1 Data Source

2.2 Dataset Dimension

2.3 Variable Structure

2.4 Data Loading and Filtering

2.5 Data Inspection

2.5.1 Structure and Data Types

2.5.2 Constant and Redundant Columns

2.5.3 Numeric Summary

2.6 Data Cleaning

2.6.1 Missing Values

2.6.2 Duplicate Records

2.6.3 Invalid, Inconsistent, and Impossible Values

2.6.4 Outlier Assessment

2.6.5 Data Transformation and Feature Engineering

Phase 3: Exploratory Data Analysis

3.1 Target Variable Overview

3.2 Univariate Analysis

3.2.1 Distribution of Arrival Delay

3.2.2 Departure Delay

3.2.3 Carrier Volume

3.2.4 Weather Variable Distributions

3.3 Bivariate Analysis

3.3.1 Departure Delay vs Arrival Delay (Numeric vs Numeric)

3.3.2 Weather Effects on Flight Delays

3.3.3 Carrier Performance and Delay Rates

3.3.4 Delay by Year

3.4 Multivariate Analysis

3.4.1 Correlation Heatmap

3.4.2 Time of Day × Year Interaction

3.4.3 Day of Month Pattern

Phase 4: Modelling

4.1 Classification Modelling

4.1.1 Classification Data Preparation

Missing Value Treatment

Feature Selection

4.1.2 Logistic Regression – Model A

4.1.3 Random Forest – Model A

4.1.4 Logistic Regression – Model B

4.1.5 Random Forest – Model B

4.1.6 Classification Model Comparison

Missing Value Imputation

Before Imputation

Apply Median Imputation

After Imputation

Feature Selection

Train / Test Split

Preprocessing

4.2.2 Scenario A: Pre-Departure Prediction (No dep_delay)

4.2.2.1 Linear Regression — Scenario A

4.2.2.2 Random Forest — Scenario A

4.2.3 Scenario B: Post-Departure Prediction

4.2.3.1 Linear Regression — Scenario B

4.2.3.2 Random Forest — Scenario B

4.2.4 Regression Model Comparison

4.2.5 Saving Regression Models

4.2.6 Discussion

Scenario A — Pre-Departure

Scenario B — Post-Departure

Phase 5: Evaluation & Comparison

5.1 Classification Model Evaluation

5.2 Regression Model Evaluation

5.3 Feature Importance Analysis

5.3.1 Classification Feature Importance

5.3.2 Regression Feature Importance

5.4 Overall Evaluation Summary

Conclusion

References

4.2.2 Scenario A: Pre-Departure Prediction (No `dep_delay`)