This project analyzes arrival delays of airlines across different destinations. The objective is to recreate the original table in a structured data format, transform it into a tidy dataset using R, and evaluate airline performance using percentage-based comparisons rather than raw counts. The final deliverable includes a fully reproducible R Markdown file with narrative explanations of data preparation, transformation, analysis, and conclusions, published to both GitHub and RPubs.
Kaggle = https://www.kaggle.com/datasets/umeradnaan/flight-delays-dataset?select=flight_delays.csv Notes: Given the size of the file, the data was filtered to Airlines “Delta” “United” with destination to “JFK”
url <- "https://raw.githubusercontent.com/JZunaRepo/Week-5/refs/heads/main/filtered_flights.csv"
df <- read.csv(
file = url
)
glimpse(df)
## Rows: 174,559
## Columns: 16
## $ FlightID <int> 24, 28, 34, 38, 44, 55, 60, 77, 81, 85, 93, 98, 118…
## $ Airline <chr> "Delta", "Delta", "Delta", "United", "United", "Uni…
## $ FlightNumber <int> 8277, 6347, 5581, 2482, 8577, 460, 7504, 2034, 1514…
## $ Origin <chr> "JFK", "LAX", "ORD", "ATL", "ORD", "LAX", "ATL", "D…
## $ Destination <chr> "JFK", "JFK", "JFK", "JFK", "JFK", "JFK", "JFK", "J…
## $ ScheduledDeparture <chr> "2024-09-01 01:27", "2024-09-01 22:55", "2024-09-01…
## $ ActualDeparture <chr> "2024-09-01 01:40", "2024-09-01 23:15", "2024-09-01…
## $ ScheduledArrival <chr> "2024-09-01 02:27", "2024-09-02 03:55", "2024-09-01…
## $ ActualArrival <chr> "2024-09-01 02:53", "2024-09-02 03:59", "2024-09-01…
## $ DelayMinutes <int> 26, 4, 12, 18, -6, 30, 9, 19, 18, 17, 18, 19, 18, 3…
## $ DelayReason <chr> "Maintenance", "Maintenance", "Air Traffic Control"…
## $ Cancelled <chr> "True", "True", "False", "True", "False", "True", "…
## $ Diverted <chr> "False", "True", "False", "False", "False", "False"…
## $ AircraftType <chr> "Boeing 737", "Boeing 777", "Boeing 777", "Boeing 7…
## $ TailNumber <chr> "N40985", "N79992", "N63330", "N95927", "N79176", "…
## $ Distance <int> 2198, 2511, 1554, 1919, 226, 1659, 2797, 1027, 570,…
In order to replicate the wide table provided for thsi assignment, I will select the top 5 origins to compare
top5_origins <- df %>%
count(Origin, sort = TRUE) %>%
slice_head(n = 5) %>%
pull(Origin)
In the following section first I will use the top 5 origins I selected above and then use them to filter the data set. At the same time, I will create labels based om the “DelayMinutes” filed to label the flights that were delayed and the flights that were on time for each airline.
counts_long <- df %>%
filter(Origin %in% top5_origins) %>%
mutate(
status = if_else(DelayMinutes > 0, "delayed", "on time")
) %>%
count(Airline, status, Origin, name = "count") %>%
complete(
Airline,
status = c("on time", "delayed"),
Origin = top5_origins
) %>%
replace_na(list(count = 0))
The last step is to recreate the wide table using the dataframe created above by creating a status for each ariline and a column for each of the top airlines in the original dataset.
flights_wide <- counts_long %>%
mutate(status = factor(status, levels = c("on time", "delayed"))) %>%
arrange(Airline, status) %>%
pivot_wider(names_from = Origin, values_from = count)
flights_wide
## # A tibble: 4 × 7
## Airline status ATL DFW JFK LAX ORD
## <chr> <fct> <int> <int> <int> <int> <int>
## 1 Delta on time 4756 4675 4623 4612 4648
## 2 Delta delayed 12642 12877 12698 12816 12919
## 3 United on time 4659 4662 4694 4721 4644
## 4 United delayed 12874 12731 12709 12716 12883
Transforming the data from wide to long
counts_long <- flights_wide %>%
pivot_longer(
cols = all_of(top5_origins),
names_to = "Origin",
values_to = "count"
)
Overall delay by airline Table
overall_rates <- counts_long %>%
group_by(Airline, status) %>%
summarise(total = sum(count), .groups = "drop") %>%
pivot_wider(names_from = status, values_from = total) %>%
mutate(
total_flights = `on time` + delayed,
delay_rate = delayed / total_flights
) %>%
select(Airline, `on time`, delayed, total_flights, delay_rate)
overall_rates
## # A tibble: 2 × 5
## Airline `on time` delayed total_flights delay_rate
## <chr> <int> <int> <int> <dbl>
## 1 Delta 23314 63952 87266 0.733
## 2 United 23380 63913 87293 0.732
Overall delay by airline chart
overall_rates %>%
ggplot(aes(x = Airline, y = delay_rate, fill = Airline)) +
geom_col() +
geom_text(
aes(label = paste0(round(delay_rate * 100, 1), "%")),
vjust = 1.5,
color = "white",
fontface = "bold",
size = 4
) +
scale_fill_brewer(palette = "Set2") +
scale_y_continuous(labels = scales::percent_format()) +
labs(
title = "Overall Delay Rate by Airline",
y = "Delay Rate",
x = NULL
) +
theme_minimal() +
theme(legend.position = "none")
Anthropic. (2025). Claude Opus 4.5 [Large language model]. https://claude.ai. Accessed March 1, 2026.
Delay by city and airline table
city_rates <- counts_long %>%
group_by(Origin, Airline, status) %>%
summarise(total = sum(count), .groups = "drop") %>%
pivot_wider(names_from = status, values_from = total) %>%
mutate(
total_flights = `on time` + delayed,
delay_rate = delayed / total_flights
) %>%
arrange(Origin, delay_rate)
city_rates
## # A tibble: 10 × 6
## Origin Airline `on time` delayed total_flights delay_rate
## <chr> <chr> <int> <int> <int> <dbl>
## 1 ATL Delta 4756 12642 17398 0.727
## 2 ATL United 4659 12874 17533 0.734
## 3 DFW United 4662 12731 17393 0.732
## 4 DFW Delta 4675 12877 17552 0.734
## 5 JFK United 4694 12709 17403 0.730
## 6 JFK Delta 4623 12698 17321 0.733
## 7 LAX United 4721 12716 17437 0.729
## 8 LAX Delta 4612 12816 17428 0.735
## 9 ORD United 4644 12883 17527 0.735
## 10 ORD Delta 4648 12919 17567 0.735
Delay by city and airline chart
city_rates %>%
ggplot(aes(x = Origin, y = delay_rate, fill = Airline)) +
geom_col(position = "dodge") +
geom_text(
aes(label = paste0(round(delay_rate * 100, 1), "%")),
position = position_dodge(width = 0.9),
vjust = 1.5,
color = "white",
fontface = "bold",
size = 3
) +
scale_fill_brewer(palette = "Set2") +
scale_y_continuous(labels = scales::percent_format()) +
labs(
title = "Delay Rate by Origin and Airline",
y = "Delay Rate",
x = NULL
) +
theme_minimal() +
theme(legend.position = "bottom")
Anthropic. (2025). Claude Opus 4.5 [Large language model]. https://claude.ai. Accessed March 1, 2026.
Looking at the aggregated delayed rates, one airline perform almost the same. Delta’s delay rate is about 73.28% while United’s is 73.22% these difference is less than a percentage point. This tells us that when all flights are combined, neither airline is better overall. The aggregate comparison suggest a head to head in performance and reinforces that small numerical difference at this scale is practically redundant.
However, when we take a look at the city by city breakdown it shows a subtle variation. United shows slightly lower delay rates in the individual origins, while Delta performs marginally better in one and ties in another. These differences are small but it still shows that performance can shift depending on the location. This contrast between aggregated results and stratified results reinforces the idea of how important considering weighting effects. Overall delay rates are weighted averages influenced by flight volume at each origin, the combined metric can hide each city level differences.
This reflects the logic behind Simpson’s paradox which tells us that conclusions drawn from aggregated data may not fully capture patterns observed within subgroups, underscoring why both overall and stratified analyses are necessary before forming final judgments.