This report analyzes U.S. commercial flight performance in 2015 using a large flight-level data set containing arrival delays, departure delays, cancellations, diversions, and delay categories (weather, carrier, air system, etc.). Two supporting data sets are used: an airlines look up table that maps IATA airline codes to airline names, and an airports data set with airport names, state, and latitude/longitude coordinates.
Since cancelled and diverted flights are not comparable to completed flights in terms of arrival delay, they are handled separately when calculating delay averages.
fp <- "C:/Users/wwwma/OneDrive - Loyola University Maryland/IS460W/Flight Delay/flight delays.csv"
arp <- "C:/Users/wwwma/OneDrive - Loyola University Maryland/IS460W/Flight Delay/airports.csv"
alp <- "C:/Users/wwwma/OneDrive - Loyola University Maryland/IS460W/Flight Delay/airlines.csv"
flights <- fread(fp, na.strings = c("", "NA"))
airports <- fread(arp, na.strings = c("", "NA"))
airlines <- fread(alp, na.strings = c("", "NA"))
# Airports cleanup
airports_clean <- airports
airports_clean$CITY <- trimws(airports_clean$CITY)
airports_clean$STATE <- trimws(airports_clean$STATE)
airports_clean$COUNTRY <- trimws(airports_clean$COUNTRY)
airports_clean$IATA_CODE <- trimws(airports_clean$IATA_CODE)
airports_clean <- unique(airports_clean)
# Airlines cleanup
airlines_clean <- airlines
airlines_clean$IATA_CODE <- trimws(airlines_clean$IATA_CODE)
airlines_clean <- unique(airlines_clean)
# Flights cleanup
flights_clean <- flights
flights_clean$YEAR <- as.integer(flights_clean$YEAR)
flights_clean$MONTH <- as.integer(flights_clean$MONTH)
flights_clean$DAY <- as.integer(flights_clean$DAY)
flights_clean$DAY_OF_WEEK <- as.integer(flights_clean$DAY_OF_WEEK)
flights_clean$CANCELLED <- as.integer(flights_clean$CANCELLED)
flights_clean$DIVERTED <- as.integer(flights_clean$DIVERTED)
flights_clean$flight_date <- make_date(flights_clean$YEAR, flights_clean$MONTH, flights_clean$DAY)
flights_clean$is_cancelled <- flights_clean$CANCELLED == 1
flights_clean$is_diverted <- flights_clean$DIVERTED == 1
# Parse scheduled times into hour/minute and validate ranges
flights_clean <- flights_clean %>%
mutate(
sched_dep_hour = floor(SCHEDULED_DEPARTURE / 100),
sched_dep_min = SCHEDULED_DEPARTURE %% 100,
sched_arr_hour = floor(SCHEDULED_ARRIVAL / 100),
sched_arr_min = SCHEDULED_ARRIVAL %% 100
) %>%
mutate(
sched_dep_hour = ifelse(sched_dep_hour >= 0 & sched_dep_hour <= 23, sched_dep_hour, NA),
sched_arr_hour = ifelse(sched_arr_hour >= 0 & sched_arr_hour <= 23, sched_arr_hour, NA),
sched_dep_min = ifelse(sched_dep_min >= 0 & sched_dep_min <= 59, sched_dep_min, NA),
sched_arr_min = ifelse(sched_arr_min >= 0 & sched_arr_min <= 59, sched_arr_min, NA)
)
# Delay columns: replace missing with 0 for non-cancelled flights; keep NA for cancelled flights
delay_cols <- c("AIR_SYSTEM_DELAY","SECURITY_DELAY","AIRLINE_DELAY",
"LATE_AIRCRAFT_DELAY","WEATHER_DELAY")
flights_clean <- flights_clean %>%
mutate(across(all_of(delay_cols),
~ ifelse(is_cancelled, NA,
ifelse(is.na(.), 0L, .))))
This section summarizes key characteristics of the data set, including total flights, cancellation rates, diversion rates, and the arrival delay. These statistic will provide a better understanding for the visualization charts.
total_flights <- nrow(flights_clean)
cancel_rate <- mean(flights_clean$is_cancelled)
divert_rate <- mean(flights_clean$is_diverted)
data.frame(
total_flights = total_flights,
cancellation_rate = cancel_rate,
diversion_rate = divert_rate
)
## total_flights cancellation_rate diversion_rate
## 1 5819079 0.01544643 0.002609863
Interpretation: The data set contains a large amount of flights from 2015. A small but meaningful percentage of flights are cancelled and a smaller portion are diverted. Because cancelled/diverted flights do not produce comparable arrival delay values, delay averages focus on completed, non-diverted flights.
flights_clean %>%
filter(!is_cancelled, !is_diverted, !is.na(ARRIVAL_DELAY)) %>%
summarise(
mean_arr_delay = mean(ARRIVAL_DELAY),
median_arr_delay = median(ARRIVAL_DELAY),
p90_arr_delay = quantile(ARRIVAL_DELAY, 0.90)
)
## mean_arr_delay median_arr_delay p90_arr_delay
## 1 4.407057 -5 34
Interpretation: The mean and median arrival delays differ, which suggests delays are skewed by a smaller number of very late flights. This provides a useful benchmark for high delay flights.
flights_clean %>%
filter(!is_cancelled, !is_diverted) %>%
count(AIRLINE, sort = TRUE) %>%
head(10)
## AIRLINE n
## <char> <int>
## 1: WN 1242403
## 2: DL 870275
## 3: AA 712935
## 4: OO 576814
## 5: EV 554752
## 6: UA 507762
## 7: MQ 278791
## 8: B6 262042
## 9: US 194223
## 10: AS 171439
flights_clean %>%
filter(!is_cancelled, !is_diverted) %>%
count(ORIGIN_AIRPORT, sort = TRUE) %>%
head(10)
## ORIGIN_AIRPORT n
## <char> <int>
## 1: ATL 343506
## 2: ORD 276554
## 3: DFW 232647
## 4: DEN 193402
## 5: LAX 192003
## 6: PHX 145552
## 7: SFO 145491
## 8: IAH 144019
## 9: LAS 131937
## 10: MSP 111055
Interpretation: A handful of airlines and major airports account for a large portion of total flights. This matters because system-wide delay patterns are heavily influenced by these high-volume carriers and hubs.
This plot shows average arrival delay by month for completed, non-diverted flights. Highlighting the worst month helps identify seasonal pressure periods.
Interpretation:
The monthly trend reveals clear seasonality in flight delays. Average
arrival delays rise during the peak summer travel demand, suggesting
that the summer demand pressure contribute to system congestion. The
highlighted month June represent the highest average delay, indicating a
period of widespread operational strain rather than isolated
incidents.
This chart focuses only on cancelled flights and shows the composition of cancellation causes by airline. A stacked bar chart is useful here because it highlights how each airline’s cancellations are distributed across major categories (Weather, Carrier, National Air System, Security).
Interpretation:
Weather and carrier-related factors account for the majority of
cancellations across most airlines. However, the relative composition
varies by carrier, indicating differences in operational resilience and
exposure to weather-sensitive routes. Security-related cancellations
remain minimal across all airlines, suggesting they are not a major
driver of overall system disruption.
Each tile represents the average arrival delay for a specific day of week and scheduled departure hour. Cells are grey when fewer than 500 flights exist for that day/hour combination to avoid unstable averages. This heatmap highlights “hotspot” times when delays are consistently higher.
Interpretation:
The heatmap highlights clear temporal hotspots in delay intensity.
Delays tend to increase during certain departure hours, particularly
later in the day, suggesting cumulative effects of earlier disruptions.
Weekday patterns also differ slightly from weekends, indicating that
travel demand and scheduling density may influence system
congestion.
This trellis plot compares average arrival delay by day of week for the top 8 airlines by volume. Airlines are ordered from lowest to highest overall average delay, making it easy to identify which carriers consistently perform better versus worse. Color is used to distinguish days of the week and highlight end-of-week patterns.
Interpretation: The heatmap highlights clear temporal hotspots in delay intensity. Delays tend to increase during certain departure hours, particularly later in the day, suggesting cumulative effects of earlier disruptions. Weekday patterns also differ slightly from weekends, indicating that business travel demand and scheduling density may influence system congestion.
This map visualizes the top 50 origin airports by number of cancelled flights. Circle size increases with more cancellations. This geographic view helps identify whether cancellations are concentrated in major hubs and whether certain regions show higher disruption.
Interpretation:
Cancellations are geographically concentrated at major origin airports,
reflecting the influence of high-volume hubs on system-wide reliability.
Larger airports experience more cancellations in absolute terms, which
may result from higher traffic density and increased sensitivity to
weather and air traffic control constraints. The geographic clustering
reinforces the idea that delays and cancellations are not randomly
distributed but are tied to network structure.
Across 2015, flight delays show clear variation by time of year, day of week, and departure hour. Airlines also differ meaningfully in their average delay performance, and the trellis visualization makes those differences easier to compare. Cancellations are primarily driven by a mix of weather and carrier-related factors, and cancellations concentrate at high-volume origin airports.