The objective of this Week 5 Airplane Delays assignment is to reconstruct the flight arrival/delay dataset shown in the prompt (two airlines spanning five destinations) and then use R to tidy, transform, and analyse the data. The analysis will focus on percentages (not just the raw counts of on-time and delayed flights), comparing the two airlines’ performances both overall and city-by-city, and then explaining why those two comparisons may lead to different conclusion. The discrepancy may reflect Simpson’s Paradox, where aggregated results differ from subgroup results due to differences in weighting across the subgroups themselves.
Data Structure
The data will be constructed initially in a wide format, to mirror how the information appears within the presented prompt, then transformed into a long/tidy format for analysis.
Specifically, the constructed wide formatting will reflect the image below:
In this, we can observe that the column headings feature the destinations, whilst the row headings contain the airline names and the ‘on time’/‘delayed’ status labels.
After transformation into long format, each row will instead reflect a singular observation, with the following columns: airline | status | city | count.
Proposed Plan
The analytical approach will likely follow the steps outlined below:
Construct and publish the the wide CSV (from Excel to GitHub)
The dataset will be recreated in Excel using the wide structure in the prior presented image.
In this, any layout styled blanks will be preserved (such as the blank row between the two airline data rows).
Export the Excel file as a CSV.
Upload said CSV to my public GitHub repository so it can be read directly into RStudio.
Import in R and clean the wide file
Once the CSV file has been read into R, the blank separator row between the two airlines must be removed.
The counts of flights for each city must be converted to numeric.
Transform from wide to long, with the use of the pivot_longer() functionality. In this, the columns of city and count will be created, while retaining airline and status.
Compare the airlines’ overall percentages. For each airline, we will calculate total flights and delayed flights to determine the overall delay relate (the total delayed flights/the total number of flights).
Compare the airlines’ city-by-city performance. In this, the delay rate for each airline will be calculated at the city level.
Explain any discrepancies observed between the overall versus the city-by-city results.
Anticipated Challenges
One expected challenge involves data cleaning that will be required on import. The blank separator row may be read in as an NA record and should thus be removed, and the count values may import as character strings (due to the comma formatting), requiring conversion before analysis.
Additionally, the analytical write up must clearly explain any mismatch between overall and city-level performance as a weighting effect (and potentially an example of Simpson’s paradox) rather than as a contradiction or error within the data itself.
Optional Endeavors
Provide a table/chart for both the overall delay rate comparison and the city-by-city delay rate comparison.
Code Base/Body
As with preliminary data cleaning, the first step involves importing the required libraries.
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'stringr' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
Warning: package 'janitor' was built under R version 4.5.2
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
Next, we will import the reconstructed raw CSV file (initially created within Excel, then saved as a CSV file, and housed within my GitHub repository).
Los Angeles Phoenix San Diego San Francisco Seattle
1 ALASKA on time 497 221 212 503 1,841
2 delayed 62 12 20 102 305
3 NA NA NA
4 AM WEST on time 694 4,840 383 320 201
5 delayed 117 415 65 129 61
Now that we have imported our recreated airline delays CSV file, the next stage in the analysis involves cleaning the wide table.
#Rename the two unnamed columnsdelays_wide <- delays_raw_wide %>%rename(airline =1,status =2 )#Remove the blank separator row between the two airlinescity_cols <-c("Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle")delays_wide <- delays_wide %>%mutate(airline =na_if(airline, ""),status =na_if(status, "") ) %>%filter(!(is.na(airline) &is.na(status) &if_all(all_of(city_cols), ~is.na(.x) | .x =="")))#Ensure that the airline is filled downdelays_wide <- delays_wide %>%fill(airline, .direction ="down")#Strip commas within flight counts and convert to numericdelays_wide <- delays_wide %>%mutate(across(all_of(city_cols), ~str_replace_all(as.character(.x),",", ""))) %>%mutate(across(all_of(city_cols), as.numeric))delays_wide
airline status Los Angeles Phoenix San Diego San Francisco Seattle
1 ALASKA on time 497 221 212 503 1841
2 ALASKA delayed 62 12 20 102 305
3 AM WEST on time 694 4840 383 320 201
4 AM WEST delayed 117 415 65 129 61
# A tibble: 20 × 4
airline status city count
<chr> <chr> <chr> <dbl>
1 ALASKA on time Los Angeles 497
2 ALASKA on time Phoenix 221
3 ALASKA on time San Diego 212
4 ALASKA on time San Francisco 503
5 ALASKA on time Seattle 1841
6 ALASKA delayed Los Angeles 62
7 ALASKA delayed Phoenix 12
8 ALASKA delayed San Diego 20
9 ALASKA delayed San Francisco 102
10 ALASKA delayed Seattle 305
11 AM WEST on time Los Angeles 694
12 AM WEST on time Phoenix 4840
13 AM WEST on time San Diego 383
14 AM WEST on time San Francisco 320
15 AM WEST on time Seattle 201
16 AM WEST delayed Los Angeles 117
17 AM WEST delayed Phoenix 415
18 AM WEST delayed San Diego 65
19 AM WEST delayed San Francisco 129
20 AM WEST delayed Seattle 61
Overall Comparisons: Delay Rate by Airline
Given that our data is now in long format, we are at a stage where we can discern the overall delay rates for each airline. This will be computed as follows:
# A tibble: 2 × 4
airline total_flights delayed_flights delay_rate
<chr> <dbl> <dbl> <dbl>
1 ALASKA 3775 501 13.3
2 AM WEST 7225 787 10.9
As can be viewed in the table above, we can see that although AM West has a greater number of delayed flights than Alaska airlines does, it (AM West) actually features a lower overall delay rate (10.89%), owing to its larger number of total flights. In contrast, although Alaska has a lower number of delayed flights (501 compared to AM West’s 787), because those 501 delays make up a greater portion of Alaska’s overall flights (which are fewer in number than AM West’s), Alaska’s resulting delay rate is higher.
An illustration setting out the above with more clarity will be presented below.
City-by-City Comparisons: Delay Rate by Airline within Each City
Now that we have the overall delay rates for each of the two airlines, our next task entails calculating the delay rate that each airline exhibits at each of the five destinations.
# A tibble: 10 × 5
airline city total_flights delayed_flights delay_rate
<chr> <chr> <dbl> <dbl> <dbl>
1 ALASKA Los Angeles 559 62 11.1
2 ALASKA Phoenix 233 12 5.15
3 ALASKA San Diego 232 20 8.62
4 ALASKA San Francisco 605 102 16.9
5 ALASKA Seattle 2146 305 14.2
6 AM WEST Los Angeles 811 117 14.4
7 AM WEST Phoenix 5255 415 7.9
8 AM WEST San Diego 448 65 14.5
9 AM WEST San Francisco 449 129 28.7
10 AM WEST Seattle 262 61 23.3
An illustraton setting out the delay rates by city and airline will be presented below for additional clarity.
# Build a city-by-city plotting tableplot_city <- delays_long %>%group_by(city, airline, status) %>%summarise(flights =sum(count), .groups ="drop") %>%mutate(flight_status =if_else(status =="delayed", "Delayed", "On time"),flight_status =factor(flight_status, levels =c("Delayed", "On time")) # Delayed on top )# Create a label table using the city_rates outputcity_labels <- city_rates %>%mutate(label =paste0( delay_rate, "%\n(", delayed_flights, "/", total_flights, ")" ) )# Generate the plotggplot(plot_city, aes(x = airline, y = flights, fill = flight_status)) +geom_col() +geom_text(data = city_labels,aes(x = airline, y = total_flights, label = label),vjust =-0.35,size =3,lineheight =0.9,inherit.aes =FALSE ) +facet_wrap(~ city) +labs(title ="City-by-City Flights and Delay Rate by Airline",x ="Airline",y ="Number of flights",fill ="" ) +theme(plot.title =element_text(hjust =0.5),axis.text.x =element_text(angle =20, hjust =1) ) +expand_limits(y =max(city_rates$total_flights) *1.12)
In examining the plot above, which sets out the delay rate percentages of each airline at each of the five examined destinations, a key observation is that in all five destinations, Alaska airline exhibits lower delay rates compared to AM West. For example, in Los Angeles, Alaska’s delay rate is 11.09% while AM West’s was 14.43%, and in San Diego, Alaska’s rate is 8.62%, whilst AM West’s is 14.51%.
Describing the Discrepancy
Now, in considering the overall delay rate by airline and the city-by-city delay rate (of the airlines at each of the examined five cities), there does seem to be a discrepancy in the reasoning as to which airline possesses the higher rate of delay. Specifically, how is it that Alaska airlines has a higher overall delay rate (13.27% compared to AM West’s 10.89%) and yet appears to have a lower delay rate in each of the five cities when examined city-by-city? This apparent contradiction can be explained using the framework of Simpson’s Paradox.
Simpson’s Paradox occurs when a pattern observed within subgroups reverses (or substantially changes) once the data has been aggregated. In this context, the key is that the overall delay rate is not a simple average of the five city-level delay rates. Instead, it is a weighted average, where the weights are the numbers of flights each airline has in each city. AM West operates a very large share of its flights in Phoenix (5,255 total flights), where its delay rate is relatively low (7.90%). By contrast, Alaska operates a much larger share of its flights in Seattle (2,146 total flights), where its delay rate is higher (14.21%). Because these high volume cities dominate each airline’s totals, they disproportionately influence the overall percentages. As a result, even though Alaska has lower delay rates than AM West in each city when compared directly, Alaska’s overall delay rate is pulled upward by having so many flights concentrated in a higher-delay city (Seattle), while AM West’s overall delay rate is pulled downward by having most of its flights concentrated in a lower-delay city (Phoenix).
In other words, the discrepancy is not due to any errors or contradiction in the dataset, but rather due to differences in the flight volume distributions across the cities. When the data is aggregated, the cities do not contribute equally to each airline’s overall delay rate, and that imbalance is what produces the Simpson’s Paradox outcome viewed here.
Conclusion
In completing this Week 5A Airline Delays assignment, the reconstructed dataset was successfully imported from a wide Excel styled CSV and transformed into a tidy long format suitable for analysis. This reshaping step allowed delay performance to be evaluated both at the overall airline level and at the city-by-city level using percentages rather than raw counts.
From the overall comparison, Alaska recorded a higher aggregate delay rate (13.27%) than AM West (10.89%), suggesting that Alaska performed worse when all flights across the five destinations were pooled together. However, when the same comparison was conducted city-by-city, Alaska displayed a lower delay rate in each of the five cities, meaning that within every destination Alaska was, in observed effect, the better performer. The apparent contradiction between overall and city-level delay results is driven by high volume cities dominating the totals, which is consistent with Simpson’s Paradox and shows why aggregate summaries must be considered in context.
LLM Used
OpenAI. (2026). ChatGPT (Version 4o) [Large language model]. https://chat.openai.com . Accessed February 28, 2026.