week_5A_airline_delays

Author

Brandon Chanderban

Introduction/Approach

The objective of this Week 5 Airplane Delays assignment is to reconstruct the flight arrival/delay dataset shown in the prompt (two airlines spanning five destinations) and then use R to tidy, transform, and analyse the data. The analysis will focus on percentages (not just the raw counts of on-time and delayed flights), comparing the two airlines’ performances both overall and city-by-city, and then explaining why those two comparisons may lead to different conclusion. The discrepancy may reflect Simpson’s Paradox, where aggregated results differ from subgroup results due to differences in weighting across the subgroups themselves.

Data Structure

The data will be constructed initially in a wide format, to mirror how the information appears within the presented prompt, then transformed into a long/tidy format for analysis.

Specifically, the constructed wide formatting will reflect the image below:

In this, we can observe that the column headings feature the destinations, whilst the row headings contain the airline names and the ‘on time’/‘delayed’ status labels.

After transformation into long format, each row will instead reflect a singular observation, with the following columns: airline | status | city | count.

Proposed Plan

The analytical approach will likely follow the steps outlined below:

  • Construct and publish the the wide CSV (from Excel to GitHub)

    • The dataset will be recreated in Excel using the wide structure in the prior presented image.

    • In this, any layout styled blanks will be preserved (such as the blank row between the two airline data rows).

    • Export the Excel file as a CSV.

    • Upload said CSV to my public GitHub repository so it can be read directly into RStudio.

  • Import in R and clean the wide file

    • Once the CSV file has been read into R, the blank separator row between the two airlines must be removed.

    • The counts of flights for each city must be converted to numeric.

  • Transform from wide to long, with the use of the pivot_longer() functionality. In this, the columns of city and count will be created, while retaining airline and status.

  • Compare the airlines’ overall percentages. For each airline, we will calculate total flights and delayed flights to determine the overall delay relate (the total delayed flights/the total number of flights).

  • Compare the airlines’ city-by-city performance. In this, the delay rate for each airline will be calculated at the city level.

  • Explain any discrepancies observed between the overall versus the city-by-city results.

Anticipated Challenges

One expected challenge involves data cleaning that will be required on import. The blank separator row may be read in as an NA record and should thus be removed, and the count values may import as character strings (due to the comma formatting), requiring conversion before analysis.

Additionally, the analytical write up must clearly explain any mismatch between overall and city-level performance as a weighting effect (and potentially an example of Simpson’s paradox) rather than as a contradiction or error within the data itself.

Optional Endeavors

  • Provide a table/chart for both the overall delay rate comparison and the city-by-city delay rate comparison.

Code Base/Body

As with preliminary data cleaning, the first step involves importing the required libraries.

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'stringr' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
Warning: package 'janitor' was built under R version 4.5.2

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

Next, we will import the reconstructed raw CSV file (initially created within Excel, then saved as a CSV file, and housed within my GitHub repository).

csv_url <- "https://raw.githubusercontent.com/bkchanderban/CUNY_SPS/refs/heads/main/DATA607/DATA607/week_5A_assignment/recreated%20airline%20delays%20data%20table.csv"

delays_raw_wide <- read.csv(csv_url, stringsAsFactors = FALSE, check.names = FALSE)

head(delays_raw_wide)
                  Los Angeles Phoenix San Diego San Francisco Seattle
1  ALASKA on time         497     221       212           503   1,841
2         delayed          62      12        20           102     305
3                          NA                NA            NA        
4 AM WEST on time         694   4,840       383           320     201
5         delayed         117     415        65           129      61
glimpse(delays_raw_wide)
Rows: 5
Columns: 7
$ ``              <chr> "ALASKA", "", "", "AM WEST", ""
$ ``              <chr> "on time", "delayed", "", "on time", "delayed"
$ `Los Angeles`   <int> 497, 62, NA, 694, 117
$ Phoenix         <chr> "221", "12", "", "4,840", "415"
$ `San Diego`     <int> 212, 20, NA, 383, 65
$ `San Francisco` <int> 503, 102, NA, 320, 129
$ Seattle         <chr> "1,841", "305", "", "201", "61"

Now that we have imported our recreated airline delays CSV file, the next stage in the analysis involves cleaning the wide table.

#Rename the two unnamed columns
delays_wide <- delays_raw_wide %>%
  rename(
    airline = 1,
    status = 2
  )

#Remove the blank separator row between the two airlines
city_cols <- c("Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle")

delays_wide <- delays_wide %>%
  mutate(
    airline = na_if(airline, ""),
    status  = na_if(status, "")
  ) %>%
  filter(!(is.na(airline) & is.na(status) & if_all(all_of(city_cols), ~ is.na(.x) | .x == "")))

#Ensure that the airline is filled down
delays_wide <- delays_wide %>%
  fill(airline, .direction = "down")

#Strip commas within flight counts and convert to numeric
delays_wide <- delays_wide %>%
  mutate(across(all_of(city_cols), ~ str_replace_all(as.character(.x),",", ""))) %>%
  mutate(across(all_of(city_cols), as.numeric))

delays_wide
  airline  status Los Angeles Phoenix San Diego San Francisco Seattle
1  ALASKA on time         497     221       212           503    1841
2  ALASKA delayed          62      12        20           102     305
3 AM WEST on time         694    4840       383           320     201
4 AM WEST delayed         117     415        65           129      61
glimpse(delays_wide)
Rows: 4
Columns: 7
$ airline         <chr> "ALASKA", "ALASKA", "AM WEST", "AM WEST"
$ status          <chr> "on time", "delayed", "on time", "delayed"
$ `Los Angeles`   <dbl> 497, 62, 694, 117
$ Phoenix         <dbl> 221, 12, 4840, 415
$ `San Diego`     <dbl> 212, 20, 383, 65
$ `San Francisco` <dbl> 503, 102, 320, 129
$ Seattle         <dbl> 1841, 305, 201, 61

Now that we have cleaned the delays_wide data object, we can convert it from wide to long formatting for easier analysis.

delays_long <- delays_wide %>%
  pivot_longer(
    cols = all_of(city_cols),
    names_to = "city",
    values_to = "count"
  ) %>%
  mutate(
    status = str_to_lower(str_squish(status)),
    airline = str_squish(airline)
  )

delays_long
# A tibble: 20 × 4
   airline status  city          count
   <chr>   <chr>   <chr>         <dbl>
 1 ALASKA  on time Los Angeles     497
 2 ALASKA  on time Phoenix         221
 3 ALASKA  on time San Diego       212
 4 ALASKA  on time San Francisco   503
 5 ALASKA  on time Seattle        1841
 6 ALASKA  delayed Los Angeles      62
 7 ALASKA  delayed Phoenix          12
 8 ALASKA  delayed San Diego        20
 9 ALASKA  delayed San Francisco   102
10 ALASKA  delayed Seattle         305
11 AM WEST on time Los Angeles     694
12 AM WEST on time Phoenix        4840
13 AM WEST on time San Diego       383
14 AM WEST on time San Francisco   320
15 AM WEST on time Seattle         201
16 AM WEST delayed Los Angeles     117
17 AM WEST delayed Phoenix         415
18 AM WEST delayed San Diego        65
19 AM WEST delayed San Francisco   129
20 AM WEST delayed Seattle          61

Overall Comparisons: Delay Rate by Airline

Given that our data is now in long format, we are at a stage where we can discern the overall delay rates for each airline. This will be computed as follows:

overall_rates <- delays_long %>%
  group_by(airline) %>%
  summarise(
    total_flights = sum(count),
    delayed_flights = sum(count[status == 'delayed']),
    delay_rate = round((delayed_flights/total_flights*100),2),
    .groups = "drop"
  )
overall_rates
# A tibble: 2 × 4
  airline total_flights delayed_flights delay_rate
  <chr>           <dbl>           <dbl>      <dbl>
1 ALASKA           3775             501       13.3
2 AM WEST          7225             787       10.9

As can be viewed in the table above, we can see that although AM West has a greater number of delayed flights than Alaska airlines does, it (AM West) actually features a lower overall delay rate (10.89%), owing to its larger number of total flights. In contrast, although Alaska has a lower number of delayed flights (501 compared to AM West’s 787), because those 501 delays make up a greater portion of Alaska’s overall flights (which are fewer in number than AM West’s), Alaska’s resulting delay rate is higher.

An illustration setting out the above with more clarity will be presented below.

plot_overall <- overall_rates %>%
  mutate(on_time_flights = total_flights - delayed_flights) %>%
  select(airline, on_time_flights, delayed_flights, total_flights, delay_rate) %>%
  pivot_longer(
    cols = c(on_time_flights, delayed_flights),
    names_to = "flight_status",
    values_to = "flights"
  ) %>%
  mutate(
    flight_status = recode(
      flight_status,
      on_time_flights = "On time",
      delayed_flights = "Delayed"
    )
  )

ggplot(plot_overall, aes(x = airline, y = flights, fill = flight_status)) +
  geom_col() +
  geom_text(
    data = overall_rates,
    aes(
      x = airline,
      y = total_flights,
      label = paste0(
        "Delay rate: ", delay_rate, "% (",
        delayed_flights, "/", total_flights, ")"
      )
    ),
    vjust = -0.4,
    inherit.aes = FALSE
  ) +
  labs(
    title = "Overall Flights and Delay Rate by Airline",
    x = "Airline",
    y = "Number of flights",
    fill = ""
  ) +
  expand_limits(y = max(overall_rates$total_flights) * 1.10) +
  theme(plot.title = element_text(hjust = 0.5))

City-by-City Comparisons: Delay Rate by Airline within Each City

Now that we have the overall delay rates for each of the two airlines, our next task entails calculating the delay rate that each airline exhibits at each of the five destinations.

city_rates <- delays_long %>%
  group_by(airline, city) %>%
  summarise(
    total_flights = sum(count),
    delayed_flights = sum(count[status == 'delayed']),
    delay_rate = round((delayed_flights/total_flights*100),2),
    .groups = "drop"
  )

city_rates
# A tibble: 10 × 5
   airline city          total_flights delayed_flights delay_rate
   <chr>   <chr>                 <dbl>           <dbl>      <dbl>
 1 ALASKA  Los Angeles             559              62      11.1 
 2 ALASKA  Phoenix                 233              12       5.15
 3 ALASKA  San Diego               232              20       8.62
 4 ALASKA  San Francisco           605             102      16.9 
 5 ALASKA  Seattle                2146             305      14.2 
 6 AM WEST Los Angeles             811             117      14.4 
 7 AM WEST Phoenix                5255             415       7.9 
 8 AM WEST San Diego               448              65      14.5 
 9 AM WEST San Francisco           449             129      28.7 
10 AM WEST Seattle                 262              61      23.3 

An illustraton setting out the delay rates by city and airline will be presented below for additional clarity.

# Build a city-by-city plotting table
plot_city <- delays_long %>%
  group_by(city, airline, status) %>%
  summarise(flights = sum(count), .groups = "drop") %>%
  mutate(
    flight_status = if_else(status == "delayed", "Delayed", "On time"),
    flight_status = factor(flight_status, levels = c("Delayed", "On time"))  # Delayed on top
  )

# Create a label table using the city_rates output
city_labels <- city_rates %>%
  mutate(
    label = paste0(
      delay_rate, "%\n(",
      delayed_flights, "/", total_flights, ")"
    )
  )

# Generate the plot
ggplot(plot_city, aes(x = airline, y = flights, fill = flight_status)) +
  geom_col() +
  geom_text(
    data = city_labels,
    aes(x = airline, y = total_flights, label = label),
    vjust = -0.35,
    size = 3,
    lineheight = 0.9,
    inherit.aes = FALSE
  ) +
  facet_wrap(~ city) +
  labs(
    title = "City-by-City Flights and Delay Rate by Airline",
    x = "Airline",
    y = "Number of flights",
    fill = ""
  ) +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 20, hjust = 1)
  ) +
  expand_limits(y = max(city_rates$total_flights) * 1.12)

In examining the plot above, which sets out the delay rate percentages of each airline at each of the five examined destinations, a key observation is that in all five destinations, Alaska airline exhibits lower delay rates compared to AM West. For example, in Los Angeles, Alaska’s delay rate is 11.09% while AM West’s was 14.43%, and in San Diego, Alaska’s rate is 8.62%, whilst AM West’s is 14.51%.

Describing the Discrepancy

Now, in considering the overall delay rate by airline and the city-by-city delay rate (of the airlines at each of the examined five cities), there does seem to be a discrepancy in the reasoning as to which airline possesses the higher rate of delay. Specifically, how is it that Alaska airlines has a higher overall delay rate (13.27% compared to AM West’s 10.89%) and yet appears to have a lower delay rate in each of the five cities when examined city-by-city? This apparent contradiction can be explained using the framework of Simpson’s Paradox.

Simpson’s Paradox occurs when a pattern observed within subgroups reverses (or substantially changes) once the data has been aggregated. In this context, the key is that the overall delay rate is not a simple average of the five city-level delay rates. Instead, it is a weighted average, where the weights are the numbers of flights each airline has in each city. AM West operates a very large share of its flights in Phoenix (5,255 total flights), where its delay rate is relatively low (7.90%). By contrast, Alaska operates a much larger share of its flights in Seattle (2,146 total flights), where its delay rate is higher (14.21%). Because these high volume cities dominate each airline’s totals, they disproportionately influence the overall percentages. As a result, even though Alaska has lower delay rates than AM West in each city when compared directly, Alaska’s overall delay rate is pulled upward by having so many flights concentrated in a higher-delay city (Seattle), while AM West’s overall delay rate is pulled downward by having most of its flights concentrated in a lower-delay city (Phoenix).

In other words, the discrepancy is not due to any errors or contradiction in the dataset, but rather due to differences in the flight volume distributions across the cities. When the data is aggregated, the cities do not contribute equally to each airline’s overall delay rate, and that imbalance is what produces the Simpson’s Paradox outcome viewed here.

Conclusion

In completing this Week 5A Airline Delays assignment, the reconstructed dataset was successfully imported from a wide Excel styled CSV and transformed into a tidy long format suitable for analysis. This reshaping step allowed delay performance to be evaluated both at the overall airline level and at the city-by-city level using percentages rather than raw counts.

From the overall comparison, Alaska recorded a higher aggregate delay rate (13.27%) than AM West (10.89%), suggesting that Alaska performed worse when all flights across the five destinations were pooled together. However, when the same comparison was conducted city-by-city, Alaska displayed a lower delay rate in each of the five cities, meaning that within every destination Alaska was, in observed effect, the better performer. The apparent contradiction between overall and city-level delay results is driven by high volume cities dominating the totals, which is consistent with Simpson’s Paradox and shows why aggregate summaries must be considered in context.

LLM Used

  • OpenAI. (2026). ChatGPT (Version 4o) [Large language model]. https://chat.openai.com . Accessed February 28, 2026.