Flight Delay Analysis

Course: WQD7004 Programming for Data Science ─ (OCC1) GROUP 3

Group Members:

#	Name	Matric No.
1	BONG YICK LING	25081955
2	TAN PHAIK YEE	S2120650
3	LEE KENT	25071713
4	NUR ATHIRAH BATRISYIA BINTI HAMIZUL	25076033
5	FARAH ANNISA BINTI NORHISHAM	25092902

Dataset: Flight Delay & Cancellation Analysis (Kaggle)

Analysis Overview

This exploratory data analysis examines U.S. domestic flight delays in early 2024 (January–February), using approximately one million flight records from the Bureau of Transportation Statistics. The dataset captures flight schedules, origin airports, taxi times, air time, distance, and delay minutes attributed to weather and late-arriving aircraft.

The analysis is structured around six research questions, each corresponding to a chart-based investigation:

Delay Cause Breakdown — Is weather or late-arriving aircraft the dominant source of delay minutes?
State-Level Delay — Which states experience the highest average delays, and what drives them?
Monthly Trends — How did delay rates change between January and February 2024?
Day-of-Week Performance — Which days show the worst operational performance?
Distance–Air Time Relationship — How strongly correlated are flight distance and duration? (Regression)
Airport Performance Quadrant — How do airports compare on taxi-out efficiency versus delay rate? (Classification)

Together, these analyses identify operational bottlenecks, quantify the role of cascade delays versus weather disruptions, and classify airport performance to support airline operations planning.

EDA Roadmap

The analysis progresses through five phases, from data preparation to airport performance evaluation:

Phase	Checkpoint	Focus
Data Preparation	1 ─ Dataset Integrity	Schema validation, column types, duplicate detection
	2 ─ Data Quality	Missing value analysis
	3 ─ Data Cleaning & Feature Engineering	Type casting, deduplication, imputation, invalid row removal, feature engineering
Operational Metrics	4 ─ Taxi-Out & Distance	Top airports by taxi-out time; distance–air time correlation; speed distribution
Delay Analysis	5 ─ Delay Cause Breakdown	Weather vs late aircraft delay: proportion and volume
	6 ─ State-Level Delay	Top 15 states by average delay, cause breakdown
Temporal Patterns	7 ─ Monthly Trends	January vs February delay rate comparison
	8 ─ Day-of-Week Performance	Delay rate and taxi-out time patterns across days
Airport Performance	9 ─ Airport Quadrant	Taxi-out vs delay rate 4-quadrant classification

Research Objectives & Questions

#	Main Objective	Research Problem
1	To identify the main cause and geographical hotspots of flight delays.	Is weather a major contributor to delays?
2		Which states have the highest average flight delays and what are the main contributing delay causes?
3	To analyze temporal trends in flight performance by identifying peak periods of delays across months and days of the week.	Which month has the highest delay frequency?
4		Which day of week has the worst operational performance?
5	To examine the relationships among flight variables to identify the key drivers of flight performance.	Is there a strong relationship between flight distance and airtime?
6		How do taxi-out time, delay rate, and flight volume affect airport performance and contribute to flight delays?

Flight Operations - Which origin airports experience the longest taxi-out times? - Which states have the highest average delays?

Temporal Patterns - Which month has the highest delay frequency? - Which day of week has the worst operational performance?

Delay Root Causes - Is weather a major contributor to delays? - How frequently are delays caused by late aircraft?

Efficiency Metrics - Is there a strong relationship between distance and air_time? - Do longer taxi times correlate with higher delays?

Airport Performance - Which airports have the highest delay rate? - Which airports show the most efficient operations?

Data Preparation

Imports & Configuration
Dataset Integrity
Data Quality Assessment
Data Cleaning & Feature Engineering

Analysis

Delay Cause Breakdown
State-Level Delay
Monthly Flight Volume & Delay Rate
Day-of-Week Operational Performance
Distance–Air Time & Speed Analysis
Airport Performance Quadrant

1. Imports, Configuration & Data Loading

All required libraries are imported, global settings are configured, and the raw dataset is loaded. The notebook is self-contained and uses only packages available in the standard R environment.

# ── Standard Library ─────────────────────────────────────────────────────────
suppressWarnings({
  suppressPackageStartupMessages({
    library(dplyr)
    library(tidyr)
    library(readr)
    library(ggplot2)
  })
})

# ── Settings ──────────────────────────────────────────────────────────────────
options(warn = -1)
theme_set(theme_minimal(base_size = 11))
PALETTE <- c(
  "#4C78A8", "#F58518", "#54A24B", "#E45756",
  "#72B7B2", "#EECA3B", "#B279A2", "#FF9DA6"
)

MONTH_LABELS <- c(
  `1` = "Jan", `2` = "Feb", `3` = "Mar", `4` = "Apr",
  `5` = "May", `6` = "Jun", `7` = "Jul", `8` = "Aug",
  `9` = "Sep", `10` = "Oct", `11` = "Nov", `12` = "Dec"
)
DOW_LABELS <- c(
  `1` = "Mon", `2` = "Tue", `3` = "Wed", `4` = "Thu",
  `5` = "Fri", `6` = "Sat", `7` = "Sun"
)

fmt_int <- function(x) {
  format(x, big.mark = ",", scientific = FALSE, trim = TRUE)
}

infer_dtype_label <- function(x) {
  if (inherits(x, "integer")) return("Int64")
  if (inherits(x, "numeric")) return("Float64")
  if (inherits(x, "character")) return("String")
  if (inherits(x, "logical")) return("Boolean")
  if (inherits(x, "Date")) return("Date")
  if (inherits(x, "POSIXct")) return("Datetime")
  "Unknown"
}

cat("✅ Libraries loaded successfully.\n")

## ✅ Libraries loaded successfully.

cat(sprintf("   readr  : %s\n", as.character(packageVersion("readr"))))

##    readr  : 2.2.0

cat(sprintf("   dplyr  : %s\n", as.character(packageVersion("dplyr"))))

##    dplyr  : 1.2.1

cat(sprintf("   ggplot2: %s\n", as.character(packageVersion("ggplot2"))))

##    ggplot2: 4.0.3

# ── Data Path & Loading ──────────────────────────────────────────────────────
DATA_PATH <- "flight_data_2024.csv"

# Additional libraries used in later analysis
suppressWarnings({
  suppressPackageStartupMessages({
    library(scales)
    library(patchwork)
  })
})

df_raw <- read_csv(
  DATA_PATH,
  guess_max  = 10000,
  na         = c("", "NA", "N/A", "null"),
  show_col_types = FALSE,
  progress   = FALSE
)

cat(sprintf("Dataset loaded: %s rows x %s columns\n", fmt_int(nrow(df_raw)), ncol(df_raw)))

## Dataset loaded: 1,048,575 rows x 18 columns

2. Dataset Integrity

Goals: - Verify all 18 required columns are present - Check column dtype alignment with the expected schema - Identify exact duplicate records

# ── Expected schema ────────────────────────────────────────────────────────────
REQUIRED_COLUMNS <- c(
  "year", "month", "day_of_month", "day_of_week", "fl_date",
  "origin", "origin_city_name", "origin_state_nm", "dep_time",
  "taxi_out", "wheels_off", "wheels_on", "taxi_in", "cancelled",
  "air_time", "distance", "weather_delay", "late_aircraft_delay"
)

EXPECTED_SCHEMA <- c(
  year = "Int64", month = "Int64", day_of_month = "Int64", day_of_week = "Int64",
  fl_date = "String", origin = "String", origin_city_name = "String",
  origin_state_nm = "String", dep_time = "Int64", taxi_out = "Float64",
  wheels_off = "Int64", wheels_on = "Int64", taxi_in = "Float64",
  cancelled = "Int64", air_time = "Float64", distance = "Float64",
  weather_delay = "Float64", late_aircraft_delay = "Float64"
)

# ── Column presence check ──────────────────────────────────────────────────────
missing_cols <- setdiff(REQUIRED_COLUMNS, names(df_raw))
cat("=== Required Column Check ===\n")

## === Required Column Check ===

if (length(missing_cols) > 0) {
  cat(sprintf("⚠️  Missing columns: %s\n", paste(missing_cols, collapse = ", ")))
} else {
  cat(sprintf("✅ All %s required columns are present.\n", length(REQUIRED_COLUMNS)))
}

## ✅ All 18 required columns are present.

# ── Column type overview ───────────────────────────────────────────────────────
cat("\n=== Column Dtype Overview ===\n")

## 
## === Column Dtype Overview ===

type_rows <- data.frame(
  column = names(df_raw),
  actual_dtype = vapply(df_raw, infer_dtype_label, character(1)),
  stringsAsFactors = FALSE,
  check.names = FALSE
)
type_rows$expected_dtype <- unname(EXPECTED_SCHEMA[type_rows$column])
type_rows$expected_dtype[is.na(type_rows$expected_dtype)] <- "—"
type_rows$match <- ifelse(
  type_rows$actual_dtype == type_rows$expected_dtype | !(type_rows$column %in% names(EXPECTED_SCHEMA)),
  "✅",
  "⚠️"
)
print(type_rows, row.names = FALSE)

##               column actual_dtype expected_dtype match
##                 year      Float64          Int64     ⚠️
##                month      Float64          Int64     ⚠️
##         day_of_month      Float64          Int64     ⚠️
##          day_of_week      Float64          Int64     ⚠️
##              fl_date       String         String    ✅
##               origin       String         String    ✅
##     origin_city_name       String         String    ✅
##      origin_state_nm       String         String    ✅
##             dep_time      Float64          Int64     ⚠️
##             taxi_out      Float64        Float64    ✅
##           wheels_off      Float64          Int64     ⚠️
##            wheels_on      Float64          Int64     ⚠️
##              taxi_in      Float64        Float64    ✅
##            cancelled      Float64          Int64     ⚠️
##             air_time      Float64        Float64    ✅
##             distance      Float64        Float64    ✅
##        weather_delay      Float64        Float64    ✅
##  late_aircraft_delay      Float64        Float64    ✅

# ── Duplicate detection ────────────────────────────────────────────────────────
total_rows <- nrow(df_raw)
unique_rows <- nrow(unique(df_raw))
duplicate_count <- total_rows - unique_rows

cat("\n=== Duplicate Records ===\n")

## 
## === Duplicate Records ===

cat(sprintf("Total rows    : %s\n", fmt_int(total_rows)))

## Total rows    : 1,048,575

cat(sprintf("Unique rows   : %s\n", fmt_int(unique_rows)))

## Unique rows   : 1,041,174

cat(sprintf("Duplicates    : %s  (%.2f%%)\n", fmt_int(duplicate_count), duplicate_count / total_rows * 100))

## Duplicates    : 7,401  (0.71%)

if (duplicate_count == 0) {
  cat("✅ No exact duplicate rows detected.\n")
} else {
  cat(sprintf("⚠️  %s duplicate rows will be removed during cleaning.\n", fmt_int(duplicate_count)))
}

## ⚠️  7,401 duplicate rows will be removed during cleaning.

3. Data Quality Assessment

Goals: - Visualise missing value distribution per column

Note: Cancelled flights are expected to have nulls in operational columns (taxi_out, taxi_in, air_time, dep_time, wheels_off, wheels_on) — this is not a data error. Range validation (e.g. air_time > 0, distance > 0) is performed during data cleaning in the next section.

# ── Missing value summary ──────────────────────────────────────────────────────
null_counts <- data.frame(
  column = names(df_raw),
  null_count = vapply(df_raw, function(x) sum(is.na(x)), numeric(1)),
  stringsAsFactors = FALSE,
  check.names = FALSE
)
null_counts$null_pct <- round(null_counts$null_count / nrow(df_raw) * 100, 2)
null_counts <- null_counts[order(-null_counts$null_count), ]

cat("=== Missing Value Summary ===\n")

## === Missing Value Summary ===

print(null_counts, row.names = FALSE)

##               column null_count null_pct
##             air_time      25751     2.46
##            wheels_on      23677     2.26
##              taxi_in      23677     2.26
##             taxi_out      23125     2.21
##           wheels_off      23125     2.21
##             dep_time      22553     2.15
##                 year          0     0.00
##                month          0     0.00
##         day_of_month          0     0.00
##          day_of_week          0     0.00
##              fl_date          0     0.00
##               origin          0     0.00
##     origin_city_name          0     0.00
##      origin_state_nm          0     0.00
##            cancelled          0     0.00
##             distance          0     0.00
##        weather_delay          0     0.00
##  late_aircraft_delay          0     0.00

null_pd <- null_counts[null_counts$null_count > 0, , drop = FALSE]

if (nrow(null_pd) == 0) {
  cat("\n✅ No missing values detected.\n")
} else {
  null_pd$column <- factor(null_pd$column, levels = rev(null_pd$column))

  fig <- ggplot(null_pd, aes(x = column, y = null_pct)) +
    geom_col(fill = PALETTE[4], color = "white") +
    geom_text(aes(label = sprintf("%.1f%%", null_pct)), hjust = -0.1, size = 3) +
    coord_flip() +
    labs(
      x = "Missing Values (%)",
      title = "Missing Value Distribution by Column"
    ) +
    expand_limits(y = max(null_pd$null_pct) * 1.2) +
    theme(plot.title = element_text(face = "bold"))

  print(fig)

  cancelled_count <- sum(df_raw$cancelled == 1, na.rm = TRUE)
  cat(sprintf(
    "\nNote: %s cancelled flights (%.1f%%) — operational nulls in these rows are expected.\n",
    fmt_int(cancelled_count),
    cancelled_count / nrow(df_raw) * 100
  ))
}

## 
## Note: 23,306 cancelled flights (2.2%) — operational nulls in these rows are expected.

4. Data Cleaning & Feature Engineering

This section runs the full data pipeline inline (no external modules required for Kaggle):

Cast column types — Ensure all columns match expected dtypes
Remove duplicates — Drop exact duplicate rows
Fill delay nulls — Replace null delay values with 0 for non-cancelled flights
Remove invalid rows — Drop non-cancelled flights with impossible operational values
Engineer features — Add total_delay, delay_flag, delay_category, time_of_day, season, speed_mph_per_min

# ═══════════════════════════════════════════════════════════════════════════════
# STEP 1: Cast column types
# ═══════════════════════════════════════════════════════════════════════════════
as_int_safe <- function(x) as.integer(suppressWarnings(as.numeric(x)))
as_num_safe <- function(x) suppressWarnings(as.numeric(x))

df <- df_raw %>%
  mutate(
    year = as_int_safe(year),
    month = as_int_safe(month),
    day_of_month = as_int_safe(day_of_month),
    day_of_week = as_int_safe(day_of_week),
    dep_time = as_int_safe(dep_time),
    taxi_out = as_num_safe(taxi_out),
    wheels_off = as_int_safe(wheels_off),
    wheels_on = as_int_safe(wheels_on),
    taxi_in = as_num_safe(taxi_in),
    cancelled = as_int_safe(cancelled),
    air_time = as_num_safe(air_time),
    distance = as_num_safe(distance),
    weather_delay = as_num_safe(weather_delay),
    late_aircraft_delay = as_num_safe(late_aircraft_delay)
  )
cat(sprintf("Step 1 — Cast types: %s rows\n", fmt_int(nrow(df))))

## Step 1 — Cast types: 1,048,575 rows

# ═══════════════════════════════════════════════════════════════════════════════
# STEP 2: Remove duplicates
# ═══════════════════════════════════════════════════════════════════════════════
before <- nrow(df)
df <- distinct(df)
cat(sprintf(
  "Step 2 — Remove duplicates: removed %s rows → %s remaining\n",
  fmt_int(before - nrow(df)),
  fmt_int(nrow(df))
))

## Step 2 — Remove duplicates: removed 7,401 rows → 1,041,174 remaining

# ═══════════════════════════════════════════════════════════════════════════════
# STEP 3: Fill delay nulls (non-cancelled flights only)
# ═══════════════════════════════════════════════════════════════════════════════
for (delay_col in c("weather_delay", "late_aircraft_delay")) {
  df[[delay_col]] <- ifelse(
    df$cancelled == 0,
    ifelse(is.na(df[[delay_col]]), 0, df[[delay_col]]),
    df[[delay_col]]
  )
}
cat("Step 3 — Filled null delays for non-cancelled flights\n")

## Step 3 — Filled null delays for non-cancelled flights

# ═══════════════════════════════════════════════════════════════════════════════
# STEP 4: Remove invalid operational rows (non-cancelled only)
# ═══════════════════════════════════════════════════════════════════════════════
before <- nrow(df)
df <- df[
  df$cancelled == 1 |
    (!is.na(df$air_time) & df$air_time > 0 &
       !is.na(df$distance) & df$distance > 0 &
       !is.na(df$taxi_out) & df$taxi_out >= 0 &
       !is.na(df$taxi_in) & df$taxi_in >= 0),
]
cat(sprintf(
  "Step 4 — Remove invalid rows: removed %s → %s remaining\n",
  fmt_int(before - nrow(df)),
  fmt_int(nrow(df))
))

## Step 4 — Remove invalid rows: removed 2,445 → 1,038,729 remaining

# ═══════════════════════════════════════════════════════════════════════════════
# STEP 5: Feature engineering
# ═══════════════════════════════════════════════════════════════════════════════
df$total_delay <- ifelse(is.na(df$weather_delay), 0, df$weather_delay) + ifelse(is.na(df$late_aircraft_delay), 0, df$late_aircraft_delay)
df$delay_flag <- as.integer(df$total_delay > 0)
df$delay_category <- ifelse(
  df$total_delay == 0, "none",
  ifelse(df$total_delay <= 15, "minor",
    ifelse(df$total_delay <= 60, "moderate", "severe")
  )
)
df$time_of_day <- ifelse(
  is.na(df$dep_time), "unknown",
  ifelse(df$dep_time < 600, "night",
    ifelse(df$dep_time < 1200, "morning",
      ifelse(df$dep_time < 1800, "afternoon", "evening")
    )
  )
)
df$season <- ifelse(
  df$month %in% c(12, 1, 2), "winter",
  ifelse(df$month %in% c(3, 4, 5), "spring",
    ifelse(df$month %in% c(6, 7, 8), "summer", "fall")
  )
)
df$speed_mph_per_min <- df$distance / df$air_time

cat(sprintf("Step 5 — Feature engineering complete: %s total columns\n", ncol(df)))

## Step 5 — Feature engineering complete: 24 total columns

# ── Active (non-cancelled) flights ─────────────────────────────────────────────
df_active <- df[df$cancelled == 0, ]

cat(sprintf("\n%s\n", strrep("=", 50)))

## 
## ==================================================

cat(sprintf("Total flights     : %s\n", fmt_int(nrow(df))))

## Total flights     : 1,038,729

cat(sprintf("Cancelled flights : %s\n", fmt_int(sum(df$cancelled == 1, na.rm = TRUE))))

## Cancelled flights : 15,905

cat(sprintf("Active flights    : %s\n", fmt_int(nrow(df_active))))

## Active flights    : 1,022,824

cat(sprintf("Delayed flights   : %s\n", fmt_int(sum(df_active$delay_flag == 1, na.rm = TRUE))))

## Delayed flights   : 109,174

cat(sprintf("Overall delay rate: %.1f%%\n", mean(df_active$delay_flag, na.rm = TRUE) * 100))

## Overall delay rate: 10.7%

cat("Engineered cols   : total_delay, delay_flag, delay_category, time_of_day, season, speed_mph_per_min\n")

## Engineered cols   : total_delay, delay_flag, delay_category, time_of_day, season, speed_mph_per_min

Delay Cause Breakdown

This analysis provides a breakdown of flight delay minutes across two primary categories—Late Aircraft Delay and Weather Delay—measured in both absolute volume (thousands of minutes) and percentage share of the total delay time.

Key Findings

Dominance of Late Aircraft Delays: Late aircraft delays are the primary contributor to overall delay time, accounting for 81.7% of the total share. This translates to an absolute volume of 5,585,402 minutes (over 5.5 million minutes).
Secondary Role of Weather Delays: Weather-related delays play a much smaller role, representing 18.3% of the total delay minutes, with an absolute volume of 1,252,335 minutes (approximately 1.25 million minutes).
Significant Volume Disparity: The volume of delays caused by late aircraft arrivals is more than four times greater than the delay time caused directly by adverse weather conditions, highlighting that operational propagation is the main driver of flight disruptions.

# Delay cause breakdown
delayed_active <- df_active %>% filter(total_delay > 0)
total_weather <- sum(ifelse(is.na(delayed_active$weather_delay), 0, delayed_active$weather_delay), na.rm = TRUE)
total_late <- sum(ifelse(is.na(delayed_active$late_aircraft_delay), 0, delayed_active$late_aircraft_delay), na.rm = TRUE)
grand_total <- total_weather + total_late

cause_pd <- data.frame(
  cause = c("Weather Delay", "Late Aircraft Delay"),
  total_minutes = c(total_weather, total_late),
  share_pct = c(total_weather / grand_total * 100, total_late / grand_total * 100)
)
print("=== Delay Cause Breakdown ===")

## [1] "=== Delay Cause Breakdown ==="

print(cause_pd, row.names = FALSE)

##                cause total_minutes share_pct
##        Weather Delay       1252335  18.31505
##  Late Aircraft Delay       5585402  81.68495

fig <- ggplot(cause_pd, aes(x = "", y = share_pct, fill = cause)) +
  geom_col(width = 1, color = "white") +
  coord_polar("y") +
  geom_text(
    aes(label = sprintf("%.1f%%", share_pct)),
    position = position_stack(vjust = 0.5),
    fontface = "bold",
    size = 4.5
  ) +
  scale_fill_manual(values = c(PALETTE[2], PALETTE[4])) +
  labs(title = "Share of Total Delay Minutes by Cause") +
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 10, lineheight = 0.95), plot.title.position = "plot", plot.margin = margin(8, 24, 2, 24))

bar_plot <- ggplot(cause_pd, aes(x = cause, y = total_minutes / 1000, fill = cause)) +
  geom_col(color = "white") +
  coord_flip() +
  scale_fill_manual(values = c(PALETTE[2], PALETTE[4])) +
  scale_y_continuous(labels = scales::label_number(suffix = "K")) +
  labs(title = "Absolute Delay Volume by Cause", x = NULL, y = "Total Delay Minutes (thousands)") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 10, hjust = 0.5), plot.title.position = "plot", plot.margin = margin(2, 12, 8, 8))

if (requireNamespace("patchwork", quietly = TRUE)) {
  combined <- fig / bar_plot + patchwork::plot_annotation(
    title = "Delay Cause Breakdown",
    theme = theme(plot.title = element_text(size = 12.5, face = "bold", hjust = 0.5))
  ) + patchwork::plot_layout(heights = c(1.05, 0.95), guides = "keep")
  print(combined)
} else {
  print(fig)
  print(bar_plot)
}

State-Level Delay

The State-Level Delay Analysis examines average flight delay minutes across the top 15 U.S. states (and territories) with the highest delay burden, based on 1,025,269 active (non-cancelled) flights recorded in January–February 2024.

Key Findings

Iowa leads in average delays: Iowa has the highest average total delay of 13.5 minutes per flight, followed closely by Montana (13.1 minutes) and North Dakota (12.8 minutes). As relatively low-traffic Midwestern states, this suggests that smaller regional airports may be more vulnerable to delay propagation.
West Virginia’s unique weather factor: West Virginia stands out as the only state in the top 5 where weather delay (8.1 minutes) exceeds late aircraft delay (4.1 minutes). This indicates its delays are more directly driven by adverse weather conditions rather than cascading operational issues.
Late aircraft delay is the dominant systemic issue: Late aircraft delay is the primary cause in 13 out of 15 states, accounting for 81.7% of total delay minutes across the entire dataset. This highlights a systemic, operational root cause where delays from one flight ripple forward to subsequent flights using the same aircraft.
Geographical isolation amplifies delays in territories: Puerto Rico and the U.S. Virgin Islands both appear in the top 10 despite being island territories. Their delays are almost entirely driven by late aircraft (9.3 minutes and 8.6 minutes respectively), with near-zero weather delays, reflecting how isolation from the mainland network amplifies cascading effects when incoming flights arrive late.
High-volume hubs are not immune: High-traffic states like Illinois (50,148 flights) also appear in the top 15. The sheer volume of aircraft rotations increases exposure to cascading late-aircraft delays, showing that major hub states remain highly vulnerable.

# State-level delay analysis using the cleaned df_active from Section 4
# (PALETTE and df_active are already defined)

# State-level aggregation
N <- 15   # top N states to display

state_df <- df_active %>%
  group_by(origin_state_nm) %>%
  summarise(
    avg_total_delay         = mean(total_delay,         na.rm = TRUE),
    avg_weather_delay       = mean(weather_delay,       na.rm = TRUE),
    avg_late_aircraft_delay = mean(late_aircraft_delay, na.rm = TRUE),
    flight_count            = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_total_delay)) %>%
  slice_head(n = N) %>%
  # Sort ascending so highest bar appears at top of horizontal chart
  mutate(origin_state_nm = factor(origin_state_nm,
                                  levels = origin_state_nm[order(avg_total_delay)]))

#Plot 1: Top N States by Avg Total Delay
p1 <- ggplot(state_df, aes(x = avg_total_delay, y = origin_state_nm)) +
  geom_col(fill = PALETTE[3], color = "white", width = 0.8) +
  geom_text(
    aes(label = sprintf("%.1f min", avg_total_delay)),
    hjust = -0.15, size = 3.4
  ) +
  scale_x_continuous(
    expand = expansion(mult = c(0, 0.25))
  ) +
  labs(
    x     = "Avg Total Delay (min)",
    y     = NULL,
    title = paste0("Top ", N, " States by Avg Total Delay")
  ) +
  theme_minimal(base_size = 11) +
  theme(
    plot.title   = element_text(face = "bold", size = 12, margin = margin(b = 6)),
    plot.margin  = margin(10, 30, 10, 10),
    panel.grid.major.y = element_blank(),
    axis.text.y  = element_text(margin = margin(r = 8)),
    axis.title.x = element_text(margin = margin(t = 6))
  )

# Plot 2: Delay Cause Breakdown by State
# Pivot to long format for grouped bars
state_long <- state_df %>%
  select(origin_state_nm, avg_weather_delay, avg_late_aircraft_delay) %>%
  pivot_longer(
    cols      = c(avg_weather_delay, avg_late_aircraft_delay),
    names_to  = "cause",
    values_to = "avg_delay"
  ) %>%
  mutate(
    cause = recode(cause,
      avg_weather_delay       = "Avg Weather Delay",
      avg_late_aircraft_delay = "Avg Late Aircraft Delay"
    )
  )

p2 <- ggplot(state_long,
             aes(x = avg_delay, y = origin_state_nm, fill = cause)) +
  geom_col(
    position = position_dodge(width = 0.85),
    width    = 0.8,
    color    = "white"
  ) +
  scale_fill_manual(
    values = c(
      "Avg Weather Delay"       = PALETTE[2],   # orange
      "Avg Late Aircraft Delay" = PALETTE[4]    # red
    )
  ) +
  labs(
    x     = "Average Delay (min)",
    y     = NULL,
    fill  = NULL,
    title = "Delay Cause Breakdown by State"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    plot.title        = element_text(face = "bold", size = 12, hjust = 0.5, margin = margin(b = 6)),
    legend.position   = "bottom",
    legend.margin     = margin(t = 4),
    plot.margin       = margin(10, 10, 10, 30),
    panel.grid.major.y = element_blank(),
    axis.text.y       = element_text(margin = margin(r = 8)),
    axis.title.x      = element_text(margin = margin(t = 6))
  )

# Stack vertically so the full plot fits within the notebook viewport
combined_with_caption <- p1 / p2 +
  plot_annotation(
    title   = "State-Level Delay Analysis",
    theme = theme(
      plot.title   = element_text(face = "bold", size = 13, hjust = 0.5, margin = margin(b = 8))
    )
  ) &
  theme(
    plot.margin = margin(12, 10, 6, 10)
  )

print(combined_with_caption)

Monthly Flight Volume and Delay Rate Summary

This analysis compares U.S. flight activity and performance over the first two months of 2024 (January and February), examining the relationship between total monthly flight volume (total flight count) and the proportion of delayed flights (delay rate).

Key Findings

Sharp Decline in Delay Rates: Flight reliability improved substantially between the two months. The delay rate dropped from its peak of 13.5% in January to 7.7% in February, representing an absolute decrease of 5.8 percentage points (nearly a 43% reduction in the likelihood of a flight being delayed).
Stable Flight Volume: While delay rates dropped dramatically, the overall volume of scheduled flights remained relatively steady. January recorded slightly over 500,000 flights, while February experienced a minor seasonal decline to just under 500,000 flights.
Operational Recovery: The significant improvement in the delay rate alongside a relatively constant flight volume suggests that the high delay rates in January were likely driven by temporary winter weather disruptions or post-holiday congestion, which cleared up considerably by February.

library(dplyr)

# ── Data Aggregation ──────────────────────────────────────────────────────────
monthly <- df_active %>%
  group_by(month) %>%
  summarize(
    flight_count = n(),
    # Calculate delay rate using total_delay > 0
    delay_rate = mean(total_delay > 0, na.rm = TRUE),
    avg_total_delay = mean(total_delay, na.rm = TRUE)
  ) %>%
  arrange(month) %>%
  mutate(
    # Map numeric months (1, 2, ...) to abbreviated names ("Jan", "Feb", ...)
    month_label = month.abb[month] 
  )

# ── Visualization (Dual Y-Axis Plot) ──────────────────────────────────────────
# Save original graphical parameters to restore later
old_par <- par(no.readonly = TRUE)

# Adjust margins to leave space on the right (side 4) for the secondary y-axis
par(mar = c(5, 5, 4, 5) + 0.1)

# 1. Primary Plot: Bar chart for Flight Count (Left Axis)
bar_centers <- barplot(
  monthly$flight_count,
  names.arg = monthly$month_label,
  col = adjustcolor(PALETTE[1], alpha.f = 0.7),
  border = NA,
  ylim = c(0, 550000),
  ylab = "",
  xlab = "Month",
  yaxt = "n",
  main = "Monthly Flight Volume and Delay Rate (2024)",
  font.main = 2,
  cex.main = 1.1
)

# Add subtle horizontal gridlines
grid(nx = NA, ny = NULL, col = "gray90", lty = "solid")

# Re-draw the bars over the gridlines so the grid sits neatly in the background
barplot(
  monthly$flight_count,
  col = adjustcolor(PALETTE[1], alpha.f = 0.7),
  border = NA,
  ylim = c(0, 550000),
  yaxt = "n",
  add = TRUE
)

# Format and add the Left Y-Axis
axis_left_ticks <- seq(0, 500000, by = 100000)
axis(
  side = 2, 
  at = axis_left_ticks, 
  labels = paste0(axis_left_ticks / 1000, "K"), 
  las = 1, 
  col.axis = PALETTE[1], 
  col = PALETTE[1]
)
mtext("Flight Count", side = 2, line = 3.5, col = PALETTE[1], cex = 1.1, font = 2)

# Enable drawing over the current plot for the secondary axis
par(new = TRUE)

# 2. Secondary Plot: Line chart for Delay Rate (Right Axis)
plot(
  bar_centers,
  monthly$delay_rate * 100,
  type = "o",
  col = PALETTE[4],
  pch = 16,
  cex = 1.5,
  lwd = 2.5,
  ylim = c(7, 14),
  axes = FALSE,
  xlab = "",
  ylab = ""
)

# Format and add the Right Y-Axis
axis_right_ticks <- seq(8, 13, by = 1)
axis(
  side = 4, 
  at = axis_right_ticks, 
  labels = paste0(axis_right_ticks, ".0%"), 
  las = 1, 
  col.axis = PALETTE[4], 
  col = PALETTE[4]
)
mtext("Delay Rate (%)", side = 4, line = 3.5, col = PALETTE[4], cex = 1.1, font = 2)

# Draw a clean border around the plot
box(col = "gray80")

# Legends
legend(
  "topleft",
  legend = "Flight Count",
  fill = adjustcolor(PALETTE[1], alpha.f = 0.7),
  border = NA,
  bty = "n",
  inset = c(0.02, 0.02)
)
legend(
  "topright",
  legend = "Delay Rate (%)",
  col = PALETTE[4],
  lty = 1,
  lwd = 2.5,
  pch = 16,
  bty = "n",
  inset = c(0.02, 0.02)
)

# Restore default graphical parameters
par(old_par)

# ── Print Summary to Console ──────────────────────────────────────────────────
worst_idx <- which.max(monthly$delay_rate)
best_idx  <- which.min(monthly$delay_rate)

worst_month <- monthly[worst_idx, ]
best_month  <- monthly[best_idx, ]

cat(sprintf("Highest delay rate: %s — %.1f%%\n", worst_month$month_label, worst_month$delay_rate * 100))

## Highest delay rate: Jan — 13.5%

cat(sprintf("Lowest  delay rate: %s  — %.1f%%\n", best_month$month_label, best_month$delay_rate * 100))

## Lowest  delay rate: Feb  — 7.7%

Day-of-week operational performance

This analysis tracks weekly flight efficiency by comparing two major performance indicators across each day of the week: the flight delay rate (%) (left axis) and the average taxi-out time in minutes (right axis).

Key Findings

Mid-Week Delay Minimums: Flights are most reliable during the middle of the week. Wednesday is the best-performing day with the lowest delay rate of 8.0%, closely followed by Thursday (approximately 8.2%).
High-Delay Peaks on Tuesdays and Weekends: Delay rates spike significantly at both the beginning and end of the week. Tuesday is the worst-performing day with a 12.1% delay rate, closely followed by Sunday (roughly 12.0%). Mondays, Fridays, and Saturdays also experience elevated delay rates between 11% and 11.8%.
Highly Stable Ground Operations: Unlike the fluctuating delay rates, average taxi-out times are remarkably consistent across all seven days of the week, staying within a tight window of 17.5 to 18.5 minutes. This stability suggests that basic ground congestion at the terminal/runway remains steady day-to-day, and that the weekly variations in overall flight delays are likely driven by other external issues (such as passenger volume fluctuations or cumulative network delay propagation).

library(dplyr)

# ── Data Aggregation ──────────────────────────────────────────────────────────
dow <- df_active %>%
  group_by(day_of_week) %>%
  summarize(
    flight_count = n(),
    delay_rate = mean(total_delay > 0, na.rm = TRUE),
    avg_taxi_out = mean(taxi_out, na.rm = TRUE),
    avg_total_delay = mean(total_delay, na.rm = TRUE)
  ) %>%
  arrange(day_of_week) %>%
  mutate(
    day_label = if (exists("DOW_LABELS")) DOW_LABELS[day_of_week] else c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")[day_of_week]
  )

# Ensure PALETTE exists
if (!exists("PALETTE")) {
  PALETTE <- c("#4A90E2", "#50E3C2", "#F5A623", "#D0021B")
}

# ── Dynamic Axis Scaling ──────────────────────────────────────────────────────
max_delay <- max(dow$delay_rate * 100, na.rm = TRUE)
max_taxi  <- max(dow$avg_taxi_out, na.rm = TRUE)

# Round up both maximums to clean numbers for upper axis limits
limit_left  <- ceiling(max_delay / 5) * 5
limit_right <- ceiling(max_taxi / 5) * 5

# Define scale factor to align the right-axis values to the left-axis coordinate system
scale_factor <- limit_left / limit_right
scaled_taxi  <- dow$avg_taxi_out * scale_factor
height_matrix <- rbind(dow$delay_rate * 100, scaled_taxi)

# ── Visualization ─────────────────────────────────────────────────────────────
old_par <- par(no.readonly = TRUE)

# Increase the right margin (side 4) for the secondary axis
par(mar = c(5, 5, 4, 5) + 0.1)

# Plot the grouped bar chart
# Note: Added a * 1.18 multiplier to ylim to create headroom at the top for the legends
bar_centers <- barplot(
  height_matrix,
  beside = TRUE,
  names.arg = dow$day_label,
  col = c(adjustcolor(PALETTE[4], alpha.f = 0.85), adjustcolor(PALETTE[1], alpha.f = 0.85)),
  border = NA,
  ylim = c(0, limit_left * 1.18), 
  ylab = "",
  xlab = "Day of Week",
  yaxt = "n",
  space = c(0, 0.5),
  main = "Day-of-Week Operational Performance",
  font.main = 2,
  cex.main = 1.1
)

# Add horizontal gridlines behind the bars
grid(nx = NA, ny = NULL, col = "gray92", lty = "solid")

# Re-draw the bars over the gridlines
# Note: Re-applied the * 1.18 multiplier to ylim to keep the plots perfectly aligned
barplot(
  height_matrix,
  beside = TRUE,
  col = c(adjustcolor(PALETTE[4], alpha.f = 0.85), adjustcolor(PALETTE[1], alpha.f = 0.85)),
  border = NA,
  ylim = c(0, limit_left * 1.18),
  yaxt = "n",
  space = c(0, 0.5),
  add = TRUE
)

# Left Y-Axis (Delay Rate)
axis_left_ticks <- seq(0, limit_left, by = 5)
axis(
  side = 2, 
  at = axis_left_ticks, 
  labels = paste0(axis_left_ticks, "%"), 
  las = 1, 
  col.axis = PALETTE[4], 
  col = PALETTE[4]
)
mtext("Delay Rate (%)", side = 2, line = 3.5, col = PALETTE[4], cex = 1.1, font = 2)

# Right Y-Axis (Avg Taxi-Out)
axis_right_ticks <- seq(0, limit_right, by = 5)
axis(
  side = 4, 
  at = axis_right_ticks * scale_factor, 
  labels = axis_right_ticks, 
  las = 1, 
  col.axis = PALETTE[1], 
  col = PALETTE[1]
)
mtext("Avg Taxi-Out (min)", side = 4, line = 3.5, col = PALETTE[1], cex = 1.1, font = 2)

box(col = "gray80")

# Legends (Positions are now automatically pushed upward by the expanded ylim)
legend(
  "topleft",
  legend = "Delay Rate (%)",
  fill = adjustcolor(PALETTE[4], alpha.f = 0.85),
  border = NA,
  bty = "n",
  inset = c(0.04, 0.01)
)
legend(
  "topright",
  legend = "Avg Taxi-Out (min)",
  fill = adjustcolor(PALETTE[1], alpha.f = 0.85),
  border = NA,
  bty = "n",
  inset = c(0.04, 0.01)
)

# Restore default parameters
par(old_par)

# ── Summary Statistics Output ─────────────────────────────────────────────────
worst_day <- dow[which.max(dow$delay_rate), ]
best_day  <- dow[which.min(dow$delay_rate), ]

cat(sprintf("Worst day (delay rate): %s — %.1f%%\n", worst_day$day_label, worst_day$delay_rate * 100))

## Worst day (delay rate): Tue — 12.1%

cat(sprintf("Best  day (delay rate): %s  — %.1f%%\n", best_day$day_label, best_day$delay_rate * 100))

## Best  day (delay rate): Wed  — 8.0%

Distance-Air Time Relationship & Speed Analysis

This visualization provides a dual-pane analysis of flight performance characteristics. The top scatterplot examines the correlation between flight distance (in miles) and actual air time (in minutes), while the bottom histogram displays the statistical distribution of average flight speeds (in miles per minute).

Key Findings

Strong Linear Relationship: There is a highly linear positive correlation between flight distance and air time (Pearson \(r = 0.9759\)). This indicates that flight durations increase in a highly consistent and predictable manner relative to the distance traveled.
Average Flight Speed: The mean flight speed across the dataset is 6.76 miles per minute, which is equivalent to approximately 406 mph (or \(653 \text{ km/h}\)).
Tightly Clustered Speed Distribution: The distribution of flight speed is unimodal and roughly symmetric, with the vast majority of flights operating within a tight window between 5 and 9 miles per minute. Very few flights exceed 10 miles per minute or fall below 3 miles per minute.

# ── Distance vs Air Time ──────────────────────────────────────────────────────
dist_air <- df_active %>%
  filter(!is.na(air_time) & !is.na(distance)) %>%
  # Calculate speed (miles per minute) on-the-fly before selecting it
  mutate(speed_mph_per_min = distance / air_time) %>% 
  select(distance, air_time, speed_mph_per_min)

sample_size <- min(6000, nrow(dist_air))
set.seed(42)
sample_idx <- sample.int(nrow(dist_air), sample_size)
sample_da <- dist_air[sample_idx, ]
r_da <- cor(dist_air$distance, dist_air$air_time, use = "complete.obs")
mean_speed <- mean(dist_air$speed_mph_per_min, na.rm = TRUE)
speed_limit <- min(15, as.numeric(quantile(dist_air$speed_mph_per_min, 0.99, na.rm = TRUE)))
speed_plot <- dist_air$speed_mph_per_min[dist_air$speed_mph_per_min <= speed_limit]
speed_min <- floor(min(speed_plot, na.rm = TRUE))
speed_max <- ceiling(max(speed_plot, na.rm = TRUE))
speed_breaks <- pretty(c(speed_min, speed_max), n = 8)

old_par <- par(no.readonly = TRUE)
layout(matrix(c(1, 2), ncol = 1), heights = c(1.7, 1.7))

# Outer margins (oma) with 3 lines of space at the top (side 3) for the title
par(mar = c(5, 6.5, 4, 2) + 0.1, oma = c(0, 0, 3, 0))

# ── Plot 1: Distance vs Air Time (Scatterplot) ────────────────────────────────
plot(
  sample_da$distance,
  sample_da$air_time,
  pch = 16,
  cex = 0.6,
  col = adjustcolor(PALETTE[1], alpha.f = 0.2),
  xaxt = "n",
  xlab = "Distance (miles)",
  ylab = "", 
  main = sprintf("Distance vs Air Time\n(Pearson r = %.4f)", r_da)
)
axis(1, at = axTicks(1), labels = formatC(axTicks(1), format = "f", big.mark = ",", digits = 0))
abline(lm(air_time ~ distance, data = sample_da), col = PALETTE[4], lwd = 2.5)
box()

# Custom y-axis label
mtext("Air Time (minutes)", side = 2, line = 4.5)

# ── Plot 2: Distribution of Flight Speed (Histogram) ─────────────────────────
hist(
  speed_plot,
  breaks = speed_breaks,
  xlim = c(0, 30),
  ylim = c(0, 500000),
  col = adjustcolor(PALETTE[3], alpha.f = 0.85),
  border = "white",
  xlab = "Speed (miles/minute)",
  ylab = "", 
  main = "Distribution of Flight Speed",
  axes = FALSE
)
axis(1, at = seq(0, 30, by = 5), labels = seq(0, 30, by = 5))
axis(2, at = seq(0, 500000, by = 100000), labels = format(seq(0, 500000, by = 100000), big.mark = ",", scientific = FALSE, trim = TRUE), las = 1)
box()

# Custom y-axis label
mtext("Frequency", side = 2, line = 4.5)

abline(v = mean_speed, col = "red", lty = 2, lwd = 1.8)
legend(
  "topright",
  legend = sprintf("Mean: %.2f mi/min", mean_speed),
  lty = 2,
  lwd = 1.8,
  col = "red",
  bty = "n",
  inset = c(0, 0.05)
)

# ── Global Title (Top Middle) ────────────────────────────────────────────────
mtext(
  "Distance-Air Time Relationship & Speed Analysis", 
  side = 3, 
  line = 1, 
  outer = TRUE, 
  cex = 1.3, 
  font = 2
)

# Restore default graphical parameters
par(old_par)

Airport Performance Quadrant Analysis

This analysis evaluates airport operational efficiency by cross-referencing Average Taxi-Out Time (minutes) against Flight Delay Rate (%). Using the median thresholds (17.2 minutes for taxi-out and 9.8% for delay rates) as baselines, 282 airports are categorized into one of four performance quadrants: Efficient, Weather/Route Risk, Bottleneck, and Congested-Resilient.

Key Findings

System-Wide Benchmarks: The national operational baseline is defined by a median taxi-out time of 17.2 minutes and a median delay rate of 9.8%.
Extreme Taxi-Out Bottlenecks: Chippewa County International Airport (CIU) and Williston Basin International Airport (XWA) lead the nation in taxi-out times at 29.8 minutes and 29.1 minutes respectively—nearly double the national median.
Operational Quadrant Breakdown:
- Congested-Resilient (72 airports - High Taxi, Low Delay): These airports experience long taxi-out times but successfully maintain low delay rates. Notable examples include CIU and XWA, suggesting robust operational recovery despite surface constraints.
- Weather/Route Risk (72 airports - Low Taxi, High Delay): These facilities boast fast ground operations but suffer from high delay rates, likely driven by severe regional weather or airspace constraints. Extreme outliers include BET (Bethel) and BLV (MidAmerica St. Louis).
- Efficient (69 airports - Low Taxi, Low Delay): The optimal performers, maintaining quick ground turnarounds and low delay rates.
- Bottleneck (69 airports - High Taxi, High Delay): The most operationally challenged category, experiencing both prolonged taxi-out times and high delay rates (e.g., FCA, DRO, ASE, HDN).
Impact of Flight Volume: Larger bubbles (representing higher flight volumes) tend to cluster closer to the median intersection lines. This indicates that major high-volume hubs maintain more consistent, standardized performance, whereas smaller, regional airports (smaller bubbles) exhibit much higher statistical volatility in both taxi times and delay rates.

# ── Top taxi-out airports ─────────────────────────────────────────────────────
MIN_FLIGHTS <- if (exists("MIN_FLIGHTS")) MIN_FLIGHTS else 100
N <- if (exists("N")) N else 15

taxi_all <- df_active %>%
  filter(!is.na(taxi_out)) %>%
  group_by(origin) %>%
  summarise(avg_taxi_out = mean(taxi_out), n = n(), .groups = "drop") %>%
  filter(n >= MIN_FLIGHTS) %>%
  arrange(desc(avg_taxi_out))

taxi_top <- taxi_all %>% slice_head(n = N) %>% arrange(avg_taxi_out)

fig1 <- ggplot(taxi_top, aes(x = reorder(origin, avg_taxi_out), y = avg_taxi_out)) +
  geom_col(fill = PALETTE[1], color = "white") +
  geom_text(aes(label = sprintf("%.1f min", avg_taxi_out)), hjust = -0.15, size = 3) +
  coord_flip() +
  labs(
    x = "",
    y = "Average Taxi-Out Time (minutes)",
    title = sprintf("Top %s Airports by Average Taxi-Out Time", N)
  ) +
  scale_y_continuous(limits = c(0, max(taxi_top$avg_taxi_out) * 1.2)) +
  theme(plot.title = element_text(face = "bold"))

print(fig1)

# ── Airport performance quadrant analysis ─────────────────────────────────────
rates_filtered <- df_active %>%
  filter(!is.na(total_delay)) %>%
  group_by(origin) %>%
  summarise(
    delay_rate = mean(delay_flag, na.rm = TRUE),
    avg_total_delay = mean(total_delay, na.rm = TRUE),
    flight_count = n(),
    .groups = "drop"
  )

quad_df <- rates_filtered %>%
  select(origin, delay_rate, avg_total_delay, flight_count) %>%
  inner_join(taxi_all %>% select(origin, avg_taxi_out), by = "origin")

med_taxi <- median(quad_df$avg_taxi_out, na.rm = TRUE)
med_delay <- median(quad_df$delay_rate, na.rm = TRUE)

quad_df <- quad_df %>%
  mutate(quadrant = case_when(
    avg_taxi_out <= med_taxi & delay_rate <= med_delay ~ "Efficient",
    avg_taxi_out > med_taxi & delay_rate > med_delay ~ "Bottleneck",
    avg_taxi_out <= med_taxi & delay_rate > med_delay ~ "Weather/Route Risk",
    TRUE ~ "Congested-Resilient"
  ))

color_map_q <- c(
  "Efficient" = PALETTE[3],
  "Bottleneck" = PALETTE[4],
  "Weather/Route Risk" = PALETTE[2],
  "Congested-Resilient" = PALETTE[1]
)

extremes <- bind_rows(
  quad_df %>% slice_max(delay_rate, n = 5, with_ties = FALSE),
  quad_df %>% slice_max(avg_taxi_out, n = 5, with_ties = FALSE)
) %>% distinct(origin, .keep_all = TRUE)

fig2 <- ggplot(quad_df, aes(x = avg_taxi_out, y = delay_rate * 100, color = quadrant)) +
  geom_point(aes(size = flight_count), alpha = 0.7) +
  geom_vline(xintercept = med_taxi, linetype = "dashed", color = "gray50", linewidth = 1.2) +
  geom_hline(yintercept = med_delay * 100, linetype = "dashed", color = "gray50", linewidth = 1.2) +
  geom_text(
    data = extremes,
    aes(label = origin),
    size = 2.6,
    nudge_x = 0.5,
    nudge_y = 0.5,
    show.legend = FALSE
  ) +
  scale_color_manual(values = color_map_q) +
  scale_size_continuous(range = c(3, 10)) +
  labs(
    x = "Avg Taxi-Out Time (min)",
    y = "Delay Rate (%)",
    title = "Airport Performance Quadrant",
    subtitle = paste0("bubble size ", intToUtf8(0x221D), " flight volume"),
    color = "Quadrant"
  ) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5, family = "Apple Symbols")
  )

print(fig2)

cat(sprintf("\nMedian taxi-out: %.1f min  |  Median delay rate: %.1f%%\n", med_taxi, med_delay * 100))

## 
## Median taxi-out: 17.2 min  |  Median delay rate: 9.8%

cat("\nAirport count by quadrant:\n")

## 
## Airport count by quadrant:

print(quad_df %>% count(quadrant, name = "count") %>% arrange(quadrant))

## # A tibble: 4 × 2
##   quadrant            count
##   <chr>               <int>
## 1 Bottleneck             69
## 2 Congested-Resilient    72
## 3 Efficient              69
## 4 Weather/Route Risk     72

5. Conclusion

Key Findings

Analysis	Problem Type	Key Result
Delay Cause Breakdown	Descriptive	Late aircraft caused 81.7% of delay minutes (5.59M min); weather contributed 18.3% (1.25M min). Cascade effects dominate.
State-Level Delay	Descriptive	Iowa led with 13.5 min avg delay. Midwestern and rural states are most vulnerable. West Virginia is the only state where weather exceeded late-aircraft delay.
Monthly Trends	Descriptive	Delay rate dropped from 13.5% (Jan) to 7.7% (Feb) — a 43% reduction — despite stable flight volume (~500K/month).
Day-of-Week Performance	Descriptive	Wednesday best (8.0%), Tuesday worst (12.1%). Taxi-out times stable across all days (17.5–18.5 min).
Distance–Air Time	Regression	Pearson r = 0.9759; distance explains ~95% of air time variance. Mean speed: 6.76 mi/min (406 mph).
Airport Performance Quadrant	Classification	282 airports classified into 4 quadrants: 69 Efficient, 72 Congested-Resilient, 72 Weather/Route Risk, 69 Bottleneck.

Limitations

Limitation	Detail
Time window	Only Jan–Feb 2024; seasonal patterns (summer, holidays) not captured.
Regression scope	Single-predictor model (distance only); excludes aircraft type, wind, ATC constraints.
Classification thresholds	Median-based (17.2 min taxi-out, 9.8% delay); dataset-dependent and may shift with full-year data.

Recommendations

#	Recommendation
1	Prioritise on-time aircraft turnaround to reduce late-aircraft cascade delays.
2	Improve winter resilience at Midwestern and rural airports.
3	Extend analysis to a full year to capture seasonal variation.
4	Expand regression with additional predictors (departure time, congestion, aircraft type).

Flight Delay Analysis

Analysis Overview

EDA Roadmap

Research Objectives & Questions

Table of Contents

1. Imports, Configuration & Data Loading

2. Dataset Integrity

3. Data Quality Assessment

4. Data Cleaning & Feature Engineering

Delay Cause Breakdown

Key Findings

State-Level Delay

Key Findings

Monthly Flight Volume and Delay Rate Summary

Key Findings

Day-of-week operational performance

Key Findings

Distance-Air Time Relationship & Speed Analysis

Key Findings

Airport Performance Quadrant Analysis

Key Findings

5. Conclusion

Key Findings

Limitations

Recommendations