Week 5A Code

Body

Source

Kaggle = https://www.kaggle.com/datasets/umeradnaan/flight-delays-dataset?select=flight_delays.csv Notes: Given the size of the file, the data was filtered to Airlines “Delta” “United” with destination to “JFK”

url <- "https://raw.githubusercontent.com/JZunaRepo/Week-5/refs/heads/main/filtered_flights.csv"

df <- read.csv(
  file = url
)

glimpse(df)

## Rows: 174,559
## Columns: 16
## $ FlightID           <int> 24, 28, 34, 38, 44, 55, 60, 77, 81, 85, 93, 98, 118…
## $ Airline            <chr> "Delta", "Delta", "Delta", "United", "United", "Uni…
## $ FlightNumber       <int> 8277, 6347, 5581, 2482, 8577, 460, 7504, 2034, 1514…
## $ Origin             <chr> "JFK", "LAX", "ORD", "ATL", "ORD", "LAX", "ATL", "D…
## $ Destination        <chr> "JFK", "JFK", "JFK", "JFK", "JFK", "JFK", "JFK", "J…
## $ ScheduledDeparture <chr> "2024-09-01 01:27", "2024-09-01 22:55", "2024-09-01…
## $ ActualDeparture    <chr> "2024-09-01 01:40", "2024-09-01 23:15", "2024-09-01…
## $ ScheduledArrival   <chr> "2024-09-01 02:27", "2024-09-02 03:55", "2024-09-01…
## $ ActualArrival      <chr> "2024-09-01 02:53", "2024-09-02 03:59", "2024-09-01…
## $ DelayMinutes       <int> 26, 4, 12, 18, -6, 30, 9, 19, 18, 17, 18, 19, 18, 3…
## $ DelayReason        <chr> "Maintenance", "Maintenance", "Air Traffic Control"…
## $ Cancelled          <chr> "True", "True", "False", "True", "False", "True", "…
## $ Diverted           <chr> "False", "True", "False", "False", "False", "False"…
## $ AircraftType       <chr> "Boeing 737", "Boeing 777", "Boeing 777", "Boeing 7…
## $ TailNumber         <chr> "N40985", "N79992", "N63330", "N95927", "N79176", "…
## $ Distance           <int> 2198, 2511, 1554, 1919, 226, 1659, 2797, 1027, 570,…

Wide Table Recreation

In order to replicate the wide table provided for thsi assignment, I will select the top 5 origins to compare

top5_origins <- df %>%
  count(Origin, sort = TRUE) %>%
  slice_head(n = 5) %>%
  pull(Origin)

In the following section first I will use the top 5 origins I selected above and then use them to filter the data set. At the same time, I will create labels based om the “DelayMinutes” filed to label the flights that were delayed and the flights that were on time for each airline.

counts_long <- df %>%
  filter(Origin %in% top5_origins) %>%
  mutate(
    status = if_else(DelayMinutes > 0, "delayed", "on time")
  ) %>%
  count(Airline, status, Origin, name = "count") %>%
  complete(
    Airline,
    status = c("on time", "delayed"),
    Origin = top5_origins
  ) %>%
  replace_na(list(count = 0))

The last step is to recreate the wide table using the dataframe created above by creating a status for each ariline and a column for each of the top airlines in the original dataset.

flights_wide <- counts_long %>%
  mutate(status = factor(status, levels = c("on time", "delayed"))) %>%
  arrange(Airline, status) %>%
  pivot_wider(names_from = Origin, values_from = count)

flights_wide

## # A tibble: 4 × 7
##   Airline status    ATL   DFW   JFK   LAX   ORD
##   <chr>   <fct>   <int> <int> <int> <int> <int>
## 1 Delta   on time  4756  4675  4623  4612  4648
## 2 Delta   delayed 12642 12877 12698 12816 12919
## 3 United  on time  4659  4662  4694  4721  4644
## 4 United  delayed 12874 12731 12709 12716 12883

Data Transformation

Transforming the data from wide to long

counts_long <- flights_wide %>%
  pivot_longer(
    cols = all_of(top5_origins),
    names_to = "Origin",
    values_to = "count"
  )

Analysis

Overall delay by airline Table

overall_rates <- counts_long %>%
  group_by(Airline, status) %>%
  summarise(total = sum(count), .groups = "drop") %>%
  pivot_wider(names_from = status, values_from = total) %>%
  mutate(
    total_flights = `on time` + delayed,
    delay_rate = delayed / total_flights
  ) %>%
  select(Airline, `on time`, delayed, total_flights, delay_rate)

overall_rates

## # A tibble: 2 × 5
##   Airline `on time` delayed total_flights delay_rate
##   <chr>       <int>   <int>         <int>      <dbl>
## 1 Delta       23314   63952         87266      0.733
## 2 United      23380   63913         87293      0.732

Overall delay by airline chart

overall_rates %>% 
  ggplot(aes(x = Airline, y = delay_rate, fill = Airline)) + 
  geom_col() + 
  geom_text(
    aes(label = paste0(round(delay_rate * 100, 1), "%")),
    vjust = 1.5,
    color = "white",
    fontface = "bold",
    size = 4
  ) +
  scale_fill_brewer(palette = "Set2") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(
    title = "Overall Delay Rate by Airline",
    y = "Delay Rate",
    x = NULL
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Anthropic. (2025). Claude Opus 4.5 [Large language model]. https://claude.ai. Accessed March 1, 2026.

Delay by city and airline table

city_rates <- counts_long %>%
  group_by(Origin, Airline, status) %>%
  summarise(total = sum(count), .groups = "drop") %>%
  pivot_wider(names_from = status, values_from = total) %>%
  mutate(
    total_flights = `on time` + delayed,
    delay_rate = delayed / total_flights
  ) %>%
  arrange(Origin, delay_rate)

city_rates

## # A tibble: 10 × 6
##    Origin Airline `on time` delayed total_flights delay_rate
##    <chr>  <chr>       <int>   <int>         <int>      <dbl>
##  1 ATL    Delta        4756   12642         17398      0.727
##  2 ATL    United       4659   12874         17533      0.734
##  3 DFW    United       4662   12731         17393      0.732
##  4 DFW    Delta        4675   12877         17552      0.734
##  5 JFK    United       4694   12709         17403      0.730
##  6 JFK    Delta        4623   12698         17321      0.733
##  7 LAX    United       4721   12716         17437      0.729
##  8 LAX    Delta        4612   12816         17428      0.735
##  9 ORD    United       4644   12883         17527      0.735
## 10 ORD    Delta        4648   12919         17567      0.735

Delay by city and airline chart

city_rates %>%
  ggplot(aes(x = Origin, y = delay_rate, fill = Airline)) +
  geom_col(position = "dodge") +
  geom_text(
    aes(label = paste0(round(delay_rate * 100, 1), "%")),
    position = position_dodge(width = 0.9),
    vjust = 1.5,
    color = "white",
    fontface = "bold",
    size = 3
  ) +
  scale_fill_brewer(palette = "Set2") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(
    title = "Delay Rate by Origin and Airline",
    y = "Delay Rate",
    x = NULL
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Anthropic. (2025). Claude Opus 4.5 [Large language model]. https://claude.ai. Accessed March 1, 2026.

Week 5A Code

Jonnathan Zuna Largo

2026-03-02

Introduction

Body

Source

Wide Table Recreation

Data Transformation

Analysis

Conclusion

Overall Comparison