This assignment analyzes arrival delay data for two airlines across five cities. The data is originally presented in a wide summary format, so it must first be recreated and transformed into a tidy structure in R. Once properly structured, delay rates will be calculated using percentages rather than raw counts to allow for meaningful comparison. The analysis will compare overall airline performance as well as performance within each city, highlighting how aggregated results can differ from grouped comparisons.
Approach
The first thing I will do for this assignment is recreate the airline arrival data from the provided chart as a CSV file in the same wide format shown in the source table. After that, I will read the CSV file into R and use tidyr and dplyr to transform the data from wide to long format so that each row represents one airline, one city, one status (on time or delayed), and its corresponding count.
Next, I will calculate delay percentages by dividing the number of delayed flights by the total number of flights for each airline. I will first compare the overall delay percentages between the two airlines across all cities combined. Then, I will compare delay percentages city-by-city to examine differences within each destination.
Finally, I will describe the discrepancy between the overall comparison and the city-level comparison, explaining how differences in the distribution of flights across cities can lead to different conclusions when the data is aggregated versus when it is grouped.
Code Base
Creating, saving, loading, and checking CSV data file
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Airline Status Los_Angeles Phoenix San_Diego San_Francisco Seattle
1 Alaska on time 497 221 212 503 1841
2 Alaska delayed 62 12 20 102 305
3 AM West on time 694 4840 383 320 201
4 AM West delayed 117 415 65 129 61
# Export dataset to CSV file (wide format as required)write.csv(airline_delays_wide,"airline_delays_wide.csv",row.names =FALSE)
# Save wide dataset as CSV in working directorywrite_csv(airline_delays_wide, "airline_delays_wide.csv")# Read it from the GitHub repositoryairline_wide <-read_csv("https://raw.githubusercontent.com/MKudanova/Data607/refs/heads/main/5A/airline_delays_wide.csv",show_col_types =FALSE)glimpse(airline_wide)
# Wide to long airline_long <- airline_wide %>%pivot_longer(cols =-c(Airline, Status),names_to ="City",values_to ="Count" )airline_long
# A tibble: 20 × 4
Airline Status City Count
<chr> <chr> <chr> <dbl>
1 Alaska on time Los_Angeles 497
2 Alaska on time Phoenix 221
3 Alaska on time San_Diego 212
4 Alaska on time San_Francisco 503
5 Alaska on time Seattle 1841
6 Alaska delayed Los_Angeles 62
7 Alaska delayed Phoenix 12
8 Alaska delayed San_Diego 20
9 Alaska delayed San_Francisco 102
10 Alaska delayed Seattle 305
11 AM West on time Los_Angeles 694
12 AM West on time Phoenix 4840
13 AM West on time San_Diego 383
14 AM West on time San_Francisco 320
15 AM West on time Seattle 201
16 AM West delayed Los_Angeles 117
17 AM West delayed Phoenix 415
18 AM West delayed San_Diego 65
19 AM West delayed San_Francisco 129
20 AM West delayed Seattle 61
# A tibble: 2 × 4
Airline Total_Flights delayed Delay_Percent
<chr> <dbl> <dbl> <dbl>
1 AM West 7225 787 10.9
2 Alaska 3775 501 13.3
Overall totals and delay percentages per airline: Summary
When comparing overall delay percentages across all five cities combined, AM West has a lower delay rate of 10.89% compared to Alaska’s 13.27%. Based on the aggregated results, AM West appears to perform better overall in terms of on-time arrivals. However, this comparison reflects total flights across all destinations rather than performance within individual cities.
City-Level Delay Percentages by Airline
# Compute delay percentages per airline within each citycity_results <- airline_long %>%group_by(Airline, City, Status) %>%summarise(Total =sum(Count), .groups ="drop") %>%pivot_wider(names_from = Status, values_from = Total) %>%mutate(Total_Flights =`on time`+ delayed,Delay_Percent = (delayed / Total_Flights) *100 ) %>%select(Airline, City, Total_Flights, delayed, Delay_Percent)city_results
# A tibble: 10 × 5
Airline City Total_Flights delayed Delay_Percent
<chr> <chr> <dbl> <dbl> <dbl>
1 AM West Los_Angeles 811 117 14.4
2 AM West Phoenix 5255 415 7.90
3 AM West San_Diego 448 65 14.5
4 AM West San_Francisco 449 129 28.7
5 AM West Seattle 262 61 23.3
6 Alaska Los_Angeles 559 62 11.1
7 Alaska Phoenix 233 12 5.15
8 Alaska San_Diego 232 20 8.62
9 Alaska San_Francisco 605 102 16.9
10 Alaska Seattle 2146 305 14.2
City-Level Delay Percentages: Summary
The city-level delay percentage calculations show noticeable variation across destinations for both airlines. For AM West, delay rates range from 7.90% in Phoenix to 28.73% in San Francisco, with relatively higher delays in San Francisco and Seattle. For Alaska, delay rates range from 5.15% in Phoenix to 16.86% in San Francisco, with Phoenix showing the lowest delay rate and San Francisco and Seattle showing comparatively higher rates. Overall, delay percentages differ substantially by city, indicating that destination plays an important role in flight performance.
Discrepancy Between Overall and City-Level Comparisons
The overall comparison shows that AM West has a lower total delay percentage than Alaska when all cities are combined. However, the city-level analysis shows that Alaska has a lower delay rate in each of the five individual cities. In other words, the airline that appears to perform better overall does not perform better within any single city.
Explanation of the Discrepancy
This discrepancy occurs because the overall delay percentage is influenced by the distribution of flights across cities. AM West operates a large number of flights in Phoenix, where delay rates are relatively low, which reduces its overall delay percentage. In contrast, Alaska operates a substantial number of flights in Seattle, where delay rates are higher, increasing its overall percentage. As a result, the aggregated comparison reflects differences in flight volume across cities rather than consistent performance advantages within each destination.
Conclusion
This analysis shows that overall airline performance can differ from city-level results due to differences in flight distribution. Although AM West appears to have a lower overall delay rate, Alaska performs better within each individual city. The discrepancy highlights how aggregated data can mask underlying patterns. Transforming the dataset into tidy format enabled clearer and more meaningful comparisons.