Introduction

This project analyzes arrival delays of airlines across different destinations. The objective is to recreate the original table in a structured data format, transform it into a tidy dataset using R, and evaluate airline performance using percentage-based comparisons rather than raw counts. The final deliverable includes a fully reproducible R Markdown file with narrative explanations of data preparation, transformation, analysis, and conclusions, published to both GitHub and RPubs.

Body

Source

Kaggle = https://www.kaggle.com/datasets/umeradnaan/flight-delays-dataset?select=flight_delays.csv Notes: Given the size of the file, the data was filtered to Airlines “Delta” “United” with destination to “JFK”

url <- "https://raw.githubusercontent.com/JZunaRepo/Week-5/refs/heads/main/filtered_flights.csv"

df <- read.csv(
  file = url
)

glimpse(df)
## Rows: 174,559
## Columns: 16
## $ FlightID           <int> 24, 28, 34, 38, 44, 55, 60, 77, 81, 85, 93, 98, 118…
## $ Airline            <chr> "Delta", "Delta", "Delta", "United", "United", "Uni…
## $ FlightNumber       <int> 8277, 6347, 5581, 2482, 8577, 460, 7504, 2034, 1514…
## $ Origin             <chr> "JFK", "LAX", "ORD", "ATL", "ORD", "LAX", "ATL", "D…
## $ Destination        <chr> "JFK", "JFK", "JFK", "JFK", "JFK", "JFK", "JFK", "J…
## $ ScheduledDeparture <chr> "2024-09-01 01:27", "2024-09-01 22:55", "2024-09-01…
## $ ActualDeparture    <chr> "2024-09-01 01:40", "2024-09-01 23:15", "2024-09-01…
## $ ScheduledArrival   <chr> "2024-09-01 02:27", "2024-09-02 03:55", "2024-09-01…
## $ ActualArrival      <chr> "2024-09-01 02:53", "2024-09-02 03:59", "2024-09-01…
## $ DelayMinutes       <int> 26, 4, 12, 18, -6, 30, 9, 19, 18, 17, 18, 19, 18, 3…
## $ DelayReason        <chr> "Maintenance", "Maintenance", "Air Traffic Control"…
## $ Cancelled          <chr> "True", "True", "False", "True", "False", "True", "…
## $ Diverted           <chr> "False", "True", "False", "False", "False", "False"…
## $ AircraftType       <chr> "Boeing 737", "Boeing 777", "Boeing 777", "Boeing 7…
## $ TailNumber         <chr> "N40985", "N79992", "N63330", "N95927", "N79176", "…
## $ Distance           <int> 2198, 2511, 1554, 1919, 226, 1659, 2797, 1027, 570,…

Wide Table Recreation

In order to replicate the wide table provided for thsi assignment, I will select the top 5 origins to compare

top5_origins <- df %>%
  count(Origin, sort = TRUE) %>%
  slice_head(n = 5) %>%
  pull(Origin)

In the following section first I will use the top 5 origins I selected above and then use them to filter the data set. At the same time, I will create labels based om the “DelayMinutes” filed to label the flights that were delayed and the flights that were on time for each airline.

counts_long <- df %>%
  filter(Origin %in% top5_origins) %>%
  mutate(
    status = if_else(DelayMinutes > 0, "delayed", "on time")
  ) %>%
  count(Airline, status, Origin, name = "count") %>%
  complete(
    Airline,
    status = c("on time", "delayed"),
    Origin = top5_origins
  ) %>%
  replace_na(list(count = 0))

The last step is to recreate the wide table using the dataframe created above by creating a status for each ariline and a column for each of the top airlines in the original dataset.

flights_wide <- counts_long %>%
  mutate(status = factor(status, levels = c("on time", "delayed"))) %>%
  arrange(Airline, status) %>%
  pivot_wider(names_from = Origin, values_from = count)

flights_wide
## # A tibble: 4 × 7
##   Airline status    ATL   DFW   JFK   LAX   ORD
##   <chr>   <fct>   <int> <int> <int> <int> <int>
## 1 Delta   on time  4756  4675  4623  4612  4648
## 2 Delta   delayed 12642 12877 12698 12816 12919
## 3 United  on time  4659  4662  4694  4721  4644
## 4 United  delayed 12874 12731 12709 12716 12883

Data Transformation

Transforming the data from wide to long

counts_long <- flights_wide %>%
  pivot_longer(
    cols = all_of(top5_origins),
    names_to = "Origin",
    values_to = "count"
  )

Analysis

Overall delay by airline Table

overall_rates <- counts_long %>%
  group_by(Airline, status) %>%
  summarise(total = sum(count), .groups = "drop") %>%
  pivot_wider(names_from = status, values_from = total) %>%
  mutate(
    total_flights = `on time` + delayed,
    delay_rate = delayed / total_flights
  ) %>%
  select(Airline, `on time`, delayed, total_flights, delay_rate)

overall_rates
## # A tibble: 2 × 5
##   Airline `on time` delayed total_flights delay_rate
##   <chr>       <int>   <int>         <int>      <dbl>
## 1 Delta       23314   63952         87266      0.733
## 2 United      23380   63913         87293      0.732

Overall delay by airline chart

overall_rates %>% 
  ggplot(aes(x = Airline, y = delay_rate, fill = Airline)) + 
  geom_col() + 
  geom_text(
    aes(label = paste0(round(delay_rate * 100, 1), "%")),
    vjust = 1.5,
    color = "white",
    fontface = "bold",
    size = 4
  ) +
  scale_fill_brewer(palette = "Set2") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(
    title = "Overall Delay Rate by Airline",
    y = "Delay Rate",
    x = NULL
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Anthropic. (2025). Claude Opus 4.5 [Large language model]. https://claude.ai. Accessed March 1, 2026.

Delay by city and airline table

city_rates <- counts_long %>%
  group_by(Origin, Airline, status) %>%
  summarise(total = sum(count), .groups = "drop") %>%
  pivot_wider(names_from = status, values_from = total) %>%
  mutate(
    total_flights = `on time` + delayed,
    delay_rate = delayed / total_flights
  ) %>%
  arrange(Origin, delay_rate)

city_rates
## # A tibble: 10 × 6
##    Origin Airline `on time` delayed total_flights delay_rate
##    <chr>  <chr>       <int>   <int>         <int>      <dbl>
##  1 ATL    Delta        4756   12642         17398      0.727
##  2 ATL    United       4659   12874         17533      0.734
##  3 DFW    United       4662   12731         17393      0.732
##  4 DFW    Delta        4675   12877         17552      0.734
##  5 JFK    United       4694   12709         17403      0.730
##  6 JFK    Delta        4623   12698         17321      0.733
##  7 LAX    United       4721   12716         17437      0.729
##  8 LAX    Delta        4612   12816         17428      0.735
##  9 ORD    United       4644   12883         17527      0.735
## 10 ORD    Delta        4648   12919         17567      0.735

Delay by city and airline chart

city_rates %>%
  ggplot(aes(x = Origin, y = delay_rate, fill = Airline)) +
  geom_col(position = "dodge") +
  geom_text(
    aes(label = paste0(round(delay_rate * 100, 1), "%")),
    position = position_dodge(width = 0.9),
    vjust = 1.5,
    color = "white",
    fontface = "bold",
    size = 3
  ) +
  scale_fill_brewer(palette = "Set2") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(
    title = "Delay Rate by Origin and Airline",
    y = "Delay Rate",
    x = NULL
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Anthropic. (2025). Claude Opus 4.5 [Large language model]. https://claude.ai. Accessed March 1, 2026.

Conclusion

Overall Comparison

Looking at the aggregated delayed rates, one airline perform almost the same. Delta’s delay rate is about 73.28% while United’s is 73.22% these difference is less than a percentage point. This tells us that when all flights are combined, neither airline is better overall. The aggregate comparison suggest a head to head in performance and reinforces that small numerical difference at this scale is practically redundant.

However, when we take a look at the city by city breakdown it shows a subtle variation. United shows slightly lower delay rates in the individual origins, while Delta performs marginally better in one and ties in another. These differences are small but it still shows that performance can shift depending on the location. This contrast between aggregated results and stratified results reinforces the idea of how important considering weighting effects. Overall delay rates are weighted averages influenced by flight volume at each origin, the combined metric can hide each city level differences.

This reflects the logic behind Simpson’s paradox which tells us that conclusions drawn from aggregated data may not fully capture patterns observed within subgroups, underscoring why both overall and stratified analyses are necessary before forming final judgments.