2015 U.S. Flight Delays Analysis

Introduction

This report analyzes U.S. commercial flight performance in 2015 using a large flight-level data set containing arrival delays, departure delays, cancellations, diversions, and delay categories (weather, carrier, air system, etc.). Two supporting data sets are used: an airlines look up table that maps IATA airline codes to airline names, and an airports data set with airport names, state, and latitude/longitude coordinates.

The goal was to tell a descriptive story in this analysis:
- What is the average arrival delay change across months?
- Why flight operations cancelled?
- When are delays worst?
- Who experiences the most consistent delay problems?
- Where do cancellations concentrate across the U.S.?

Since cancelled and diverted flights are not comparable to completed flights in terms of arrival delay, they are handled separately when calculating delay averages.

Data Preparation

This section loads the raw data sets and performs cleaning steps such as:
- Removing duplicate airport and airline records
- Converting time variables into usable hour/minute format
- Creating a proper date variable
- Handling cancelled and diverted flights appropriately

fp  <- "C:/Users/wwwma/OneDrive - Loyola University Maryland/IS460W/Flight Delay/flight delays.csv"
arp <- "C:/Users/wwwma/OneDrive - Loyola University Maryland/IS460W/Flight Delay/airports.csv"
alp <- "C:/Users/wwwma/OneDrive - Loyola University Maryland/IS460W/Flight Delay/airlines.csv"

flights  <- fread(fp, na.strings = c("", "NA"))
airports <- fread(arp, na.strings = c("", "NA"))
airlines <- fread(alp, na.strings = c("", "NA"))

# Airports cleanup
airports_clean <- airports
airports_clean$CITY <- trimws(airports_clean$CITY)
airports_clean$STATE <- trimws(airports_clean$STATE)
airports_clean$COUNTRY <- trimws(airports_clean$COUNTRY)
airports_clean$IATA_CODE <- trimws(airports_clean$IATA_CODE)
airports_clean <- unique(airports_clean)

# Airlines cleanup
airlines_clean <- airlines
airlines_clean$IATA_CODE <- trimws(airlines_clean$IATA_CODE)
airlines_clean <- unique(airlines_clean)

# Flights cleanup
flights_clean <- flights
flights_clean$YEAR <- as.integer(flights_clean$YEAR)
flights_clean$MONTH <- as.integer(flights_clean$MONTH)
flights_clean$DAY <- as.integer(flights_clean$DAY)
flights_clean$DAY_OF_WEEK <- as.integer(flights_clean$DAY_OF_WEEK)
flights_clean$CANCELLED <- as.integer(flights_clean$CANCELLED)
flights_clean$DIVERTED <- as.integer(flights_clean$DIVERTED)

flights_clean$flight_date <- make_date(flights_clean$YEAR, flights_clean$MONTH, flights_clean$DAY)
flights_clean$is_cancelled <- flights_clean$CANCELLED == 1
flights_clean$is_diverted <- flights_clean$DIVERTED == 1

# Parse scheduled times into hour/minute and validate ranges
flights_clean <- flights_clean %>%
  mutate(
    sched_dep_hour = floor(SCHEDULED_DEPARTURE / 100),
    sched_dep_min  = SCHEDULED_DEPARTURE %% 100,
    sched_arr_hour = floor(SCHEDULED_ARRIVAL / 100),
    sched_arr_min  = SCHEDULED_ARRIVAL %% 100
  ) %>%
  mutate(
    sched_dep_hour = ifelse(sched_dep_hour >= 0 & sched_dep_hour <= 23, sched_dep_hour, NA),
    sched_arr_hour = ifelse(sched_arr_hour >= 0 & sched_arr_hour <= 23, sched_arr_hour, NA),
    sched_dep_min  = ifelse(sched_dep_min  >= 0 & sched_dep_min  <= 59, sched_dep_min, NA),
    sched_arr_min  = ifelse(sched_arr_min  >= 0 & sched_arr_min  <= 59, sched_arr_min, NA)
  )

# Delay columns: replace missing with 0 for non-cancelled flights; keep NA for cancelled flights
delay_cols <- c("AIR_SYSTEM_DELAY","SECURITY_DELAY","AIRLINE_DELAY",
                "LATE_AIRCRAFT_DELAY","WEATHER_DELAY")

flights_clean <- flights_clean %>%
  mutate(across(all_of(delay_cols),
                ~ ifelse(is_cancelled, NA,
                         ifelse(is.na(.), 0L, .))))

Information About The Dataset

This section summarizes key characteristics of the data set, including total flights, cancellation rates, diversion rates, and the arrival delay. These statistic will provide a better understanding for the visualization charts.

Overview

total_flights <- nrow(flights_clean)
cancel_rate <- mean(flights_clean$is_cancelled)
divert_rate <- mean(flights_clean$is_diverted)

data.frame(
  total_flights = total_flights,
  cancellation_rate = cancel_rate,
  diversion_rate = divert_rate
)

##   total_flights cancellation_rate diversion_rate
## 1       5819079        0.01544643    0.002609863

Interpretation: The data set contains a large amount of flights from 2015. A small but meaningful percentage of flights are cancelled and a smaller portion are diverted. Because cancelled/diverted flights do not produce comparable arrival delay values, delay averages focus on completed, non-diverted flights.

Arrival Delay Summary (Completed Flights)

flights_clean %>%
  filter(!is_cancelled, !is_diverted, !is.na(ARRIVAL_DELAY)) %>%
  summarise(
    mean_arr_delay = mean(ARRIVAL_DELAY),
    median_arr_delay = median(ARRIVAL_DELAY),
    p90_arr_delay = quantile(ARRIVAL_DELAY, 0.90)
  )

##   mean_arr_delay median_arr_delay p90_arr_delay
## 1       4.407057               -5            34

Interpretation: The mean and median arrival delays differ, which suggests delays are skewed by a smaller number of very late flights. This provides a useful benchmark for high delay flights.

Top Airlines & Airports by Volume

flights_clean %>%
  filter(!is_cancelled, !is_diverted) %>%
  count(AIRLINE, sort = TRUE) %>%
  head(10)

##     AIRLINE       n
##      <char>   <int>
##  1:      WN 1242403
##  2:      DL  870275
##  3:      AA  712935
##  4:      OO  576814
##  5:      EV  554752
##  6:      UA  507762
##  7:      MQ  278791
##  8:      B6  262042
##  9:      US  194223
## 10:      AS  171439

flights_clean %>%
  filter(!is_cancelled, !is_diverted) %>%
  count(ORIGIN_AIRPORT, sort = TRUE) %>%
  head(10)

##     ORIGIN_AIRPORT      n
##             <char>  <int>
##  1:            ATL 343506
##  2:            ORD 276554
##  3:            DFW 232647
##  4:            DEN 193402
##  5:            LAX 192003
##  6:            PHX 145552
##  7:            SFO 145491
##  8:            IAH 144019
##  9:            LAS 131937
## 10:            MSP 111055

Interpretation: A handful of airlines and major airports account for a large portion of total flights. This matters because system-wide delay patterns are heavily influenced by these high-volume carriers and hubs.

Findings

1) Monthly Delay Trend (Line Plot)

This plot shows average arrival delay by month for completed, non-diverted flights. Highlighting the worst month helps identify seasonal pressure periods.

Interpretation:
The monthly trend reveals clear seasonality in flight delays. Average arrival delays rise during the peak summer travel demand, suggesting that the summer demand pressure contribute to system congestion. The highlighted month June represent the highest average delay, indicating a period of widespread operational strain rather than isolated incidents.

2) Cancellation reasons by airline (horizontal stacked bar)

This chart focuses only on cancelled flights and shows the composition of cancellation causes by airline. A stacked bar chart is useful here because it highlights how each airline’s cancellations are distributed across major categories (Weather, Carrier, National Air System, Security).

Interpretation:
Weather and carrier-related factors account for the majority of cancellations across most airlines. However, the relative composition varies by carrier, indicating differences in operational resilience and exposure to weather-sensitive routes. Security-related cancellations remain minimal across all airlines, suggesting they are not a major driver of overall system disruption.

3) Delay Hotspots (Heatmap: Day-of-Week × Departure Hour)

Each tile represents the average arrival delay for a specific day of week and scheduled departure hour. Cells are grey when fewer than 500 flights exist for that day/hour combination to avoid unstable averages. This heatmap highlights “hotspot” times when delays are consistently higher.

Interpretation:
The heatmap highlights clear temporal hotspots in delay intensity. Delays tend to increase during certain departure hours, particularly later in the day, suggesting cumulative effects of earlier disruptions. Weekday patterns also differ slightly from weekends, indicating that travel demand and scheduling density may influence system congestion.

4) Airline Comparison (Trellis Bar Plot by Day of Week)

This trellis plot compares average arrival delay by day of week for the top 8 airlines by volume. Airlines are ordered from lowest to highest overall average delay, making it easy to identify which carriers consistently perform better versus worse. Color is used to distinguish days of the week and highlight end-of-week patterns.

Interpretation: The heatmap highlights clear temporal hotspots in delay intensity. Delays tend to increase during certain departure hours, particularly later in the day, suggesting cumulative effects of earlier disruptions. Weekday patterns also differ slightly from weekends, indicating that business travel demand and scheduling density may influence system congestion.

5) Where Cancellations Cluster (Leaflet Map)

This map visualizes the top 50 origin airports by number of cancelled flights. Circle size increases with more cancellations. This geographic view helps identify whether cancellations are concentrated in major hubs and whether certain regions show higher disruption.

Interpretation:
Cancellations are geographically concentrated at major origin airports, reflecting the influence of high-volume hubs on system-wide reliability. Larger airports experience more cancellations in absolute terms, which may result from higher traffic density and increased sensitivity to weather and air traffic control constraints. The geographic clustering reinforces the idea that delays and cancellations are not randomly distributed but are tied to network structure.

Conclusion

Across 2015, flight delays show clear variation by time of year, day of week, and departure hour. Airlines also differ meaningfully in their average delay performance, and the trellis visualization makes those differences easier to compare. Cancellations are primarily driven by a mix of weather and carrier-related factors, and cancellations concentrate at high-volume origin airports.