Introduction
This analysis examines flight delays from Alaska Airlines and AM West.
We clean the data, explore trends, compare airline performance, and
attempt to forecast future delays. Due to dataset limitations,
alternative approaches are also discussed.
Load Necessary Libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tsibble)
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
##
## Attaching package: 'tsibble'
##
## The following object is masked from 'package:lubridate':
##
## interval
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
library(fable)
## Loading required package: fabletools
library(fabletools)
library(lubridate)
Read & Clean Data
* The dataset is transformed from wide to long format for better
analysis. * Missing values are populated to ensure data integrity. * A
Date column is created for structured time-series analysis.
# Create the dataset
airline_delays <- data.frame(
Airline = c("ALASKA", "ALASKA", "AM WEST", "AM WEST"),
Status = c("On Time", "Delayed", "On Time", "Delayed"),
Los_Angeles = c(497, 62, 694, 117),
Phoenix = c(221, 12, 4840, 415),
San_Diego = c(212, 20, 383, 65),
San_Francisco = c(503, 102, 320, 129),
Seattle = c(1841, 305, 201, 61)
)
# Save CSV (Only if not already created)
if (!file.exists("airline_delays.csv")) {
write.csv(airline_delays, "airline_delays.csv", row.names = FALSE)
}
# Read the CSV file
df <- read_csv("airline_delays.csv", show_col_types = FALSE)
# Transform data from wide to long format
df_tidy <- df %>%
pivot_longer(cols = c("Los_Angeles", "Phoenix", "San_Diego", "San_Francisco", "Seattle"),
names_to = "Destination",
values_to = "Flight_Count") %>%
mutate(Date = seq.Date(from = as.Date("2023-01-01"), by = "days", length.out = n()))
Aggregate Data for Analysis
df_ts <- df_tidy %>%
group_by(Date, Airline, Status) %>%
summarise(Total_Flight_Count = sum(Flight_Count), .groups = "drop") %>%
as_tsibble(index = Date, key = c(Airline, Status))
Percentage-Based Comparison of Delays (Overall)
# Compute percentage of delayed flights per airline
delay_percentage <- df_tidy %>%
group_by(Airline, Status) %>%
summarise(Total_Flights = sum(Flight_Count), .groups = "drop") %>%
pivot_wider(names_from = Status, values_from = Total_Flights, values_fill = 0) %>%
mutate(Delay_Percentage = (Delayed / (Delayed + `On Time`)) * 100)
print(delay_percentage)
## # A tibble: 2 × 4
## Airline Delayed `On Time` Delay_Percentage
## <chr> <dbl> <dbl> <dbl>
## 1 ALASKA 501 3274 13.3
## 2 AM WEST 787 6438 10.9
# Visualization
ggplot(delay_percentage, aes(x = Airline, y = Delay_Percentage, fill = Airline)) +
geom_bar(stat = "identity") +
labs(title = "Percentage of Delayed Flights by Airline",
x = "Airline",
y = "Percentage of Delayed Flights (%)") +
theme_minimal()
Percentage-Based Comparison of Delays (City-Level)
# Compute delay percentage per city
delay_percentage_city <- df_tidy %>%
group_by(Airline, Destination, Status) %>%
summarise(Total_Flights = sum(Flight_Count), .groups = "drop") %>%
pivot_wider(names_from = Status, values_from = Total_Flights, values_fill = 0) %>%
mutate(Delay_Percentage = (Delayed / (Delayed + `On Time`)) * 100)
print(delay_percentage_city)
## # A tibble: 10 × 5
## Airline Destination Delayed `On Time` Delay_Percentage
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 ALASKA Los_Angeles 62 497 11.1
## 2 ALASKA Phoenix 12 221 5.15
## 3 ALASKA San_Diego 20 212 8.62
## 4 ALASKA San_Francisco 102 503 16.9
## 5 ALASKA Seattle 305 1841 14.2
## 6 AM WEST Los_Angeles 117 694 14.4
## 7 AM WEST Phoenix 415 4840 7.90
## 8 AM WEST San_Diego 65 383 14.5
## 9 AM WEST San_Francisco 129 320 28.7
## 10 AM WEST Seattle 61 201 23.3
# Visualization
ggplot(delay_percentage_city, aes(x = Destination, y = Delay_Percentage, fill = Airline)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Percentage of Delayed Flights Across Cities",
x = "City",
y = "Percentage of Delayed Flights (%)") +
theme_minimal()
Discrepancy Between Overall and City-Based Performance
## Observations:
# - AM West has a higher overall delay percentage than Alaska Airlines.
# - However, some cities (e.g., San Francisco) show high delays for both airlines.
# - This suggests that looking at total airline performance alone can be misleading.
## Explanation:
# - City-level delays may be influenced by external factors (e.g., weather, airport congestion).
# - Some cities might experience fewer delays due to operational efficiencies.
# - This reinforces the importance of analyzing both overall and city-specific performance.
Forecasting Attempt: ETS Model
ets_model <- df_ts %>%
model(ETS(Total_Flight_Count))
forecast_results <- ets_model %>%
forecast(h = 30)
autoplot(forecast_results) +
facet_wrap(~Airline + Status) +
labs(title = "ETS Forecasted Flight Delays (Next 30 Days)",
y = "Total Delays", x = "Date")
# Model accuracy
accuracy(ets_model)
## # A tibble: 4 × 12
## Airline Status .model .type ME RMSE MAE MPE MAPE MASE RMSSE
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA Delayed ETS(Tota… Trai… 56.6 102. 76.6 -33.1 134. NaN NaN
## 2 ALASKA On Time ETS(Tota… Trai… 236. 633. 419. -7.25 59.6 NaN NaN
## 3 AM WEST Delayed ETS(Tota… Trai… 0.469 132. 103. -58.5 83.3 NaN NaN
## 4 AM WEST On Time ETS(Tota… Trai… -1.22 1784. 1422. -219. 248. NaN NaN
## # ℹ 1 more variable: ACF1 <dbl>
# Check ETS model type
report(ets_model)
## Warning in report.mdl_df(ets_model): Model reporting is only supported for
## individual models, so a glance will be shown. To see the report for a specific
## model, use `select()` and `filter()` to identify a single model.
## # A tibble: 4 × 11
## Airline Status .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA Delayed ETS(To… 8.36e0 -25.4 56.8 80.8 55.6 1.04e4 2.08e4 1.87e+0
## 2 ALASKA On Time ETS(To… 3.23e0 -35.4 76.8 101. 75.6 4.01e5 6.93e5 9.95e-1
## 3 AM WEST Delayed ETS(To… 2.89e4 -28.4 62.8 86.8 61.7 1.73e4 1.49e4 1.03e+2
## 4 AM WEST On Time ETS(To… 5.30e6 -41.5 88.9 113. 87.7 3.18e6 2.68e6 1.42e+3
Alternative Approach: ARIMA Model
arima_model <- df_ts %>%
model(ARIMA(Total_Flight_Count))
forecast_results_arima <- arima_model %>%
forecast(h = 30)
autoplot(forecast_results_arima) +
facet_wrap(~Airline + Status) +
labs(title = "ARIMA Forecasted Flight Delays (Next 30 Days)",
y = "Total Delays", x = "Date")
Baseline Comparison: Naïve Forecasting
naive_model <- df_ts %>%
model(NAIVE(Total_Flight_Count))
forecast_results_naive <- naive_model %>%
forecast(h = 30)
autoplot(forecast_results_naive) +
facet_wrap(~Airline + Status) +
labs(title = "Naïve Forecasted Flight Delays (Next 30 Days)",
y = "Total Delays", x = "Date")
Conclusion
Predicting flight delays is challenging, and our analysis showed that
traditional forecasting models struggle with accuracy. The ETS model
failed due to high variability in the data, while ARIMA showed minor
improvements but still lacked strong predictive power. Surprisingly, the
simple Naïve model, which assumes tomorrow’s delays will be the same as
today’s, performed just as well as the complex models. This suggests
that past delays alone are not enough to predict future trends.
A deeper look at city-by-city delays revealed that different airports experience varying levels of delays. Factors like weather, airport congestion, and airline scheduling likely have a major impact—something our dataset doesn’t capture.
To improve delay predictions, we need more detailed data, including weather conditions, time of day, and airport congestion levels. Without these factors, time-series forecasting alone does not provide much value. Future work could explore machine learning models that incorporate additional variables to better understand what drives airline delays.