5A Airline Delays

Author

Madina Kudanova

Introduction

This assignment analyzes arrival delay data for two airlines across five cities. The data is originally presented in a wide summary format, so it must first be recreated and transformed into a tidy structure in R. Once properly structured, delay rates will be calculated using percentages rather than raw counts to allow for meaningful comparison. The analysis will compare overall airline performance as well as performance within each city, highlighting how aggregated results can differ from grouped comparisons.

Approach

The first thing I will do for this assignment is recreate the airline arrival data from the provided chart as a CSV file in the same wide format shown in the source table. After that, I will read the CSV file into R and use tidyr and dplyr to transform the data from wide to long format so that each row represents one airline, one city, one status (on time or delayed), and its corresponding count.

Next, I will calculate delay percentages by dividing the number of delayed flights by the total number of flights for each airline. I will first compare the overall delay percentages between the two airlines across all cities combined. Then, I will compare delay percentages city-by-city to examine differences within each destination.

Finally, I will describe the discrepancy between the overall comparison and the city-level comparison, explaining how differences in the distribution of flights across cities can lead to different conclusions when the data is aggregated versus when it is grouped.

Code Base

Creating, saving, loading, and checking CSV data file

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Create the airline delays dataset in wide format
airline_delays_wide <- data.frame(
  Airline = c("Alaska", "Alaska", "AM West", "AM West"),
  Status = c("on time", "delayed", "on time", "delayed"),
  Los_Angeles = c(497, 62, 694, 117),
  Phoenix = c(221, 12, 4840, 415),
  San_Diego = c(212, 20, 383, 65),
  San_Francisco = c(503, 102, 320, 129),
  Seattle = c(1841, 305, 201, 61)
)

# Preview the dataset
airline_delays_wide
  Airline  Status Los_Angeles Phoenix San_Diego San_Francisco Seattle
1  Alaska on time         497     221       212           503    1841
2  Alaska delayed          62      12        20           102     305
3 AM West on time         694    4840       383           320     201
4 AM West delayed         117     415        65           129      61
# Export dataset to CSV file (wide format as required)
write.csv(airline_delays_wide,
          "airline_delays_wide.csv",
          row.names = FALSE)
# Save wide dataset as CSV in working directory
write_csv(airline_delays_wide, "airline_delays_wide.csv")
# Read it from the GitHub repository
airline_wide <- read_csv("https://raw.githubusercontent.com/MKudanova/Data607/refs/heads/main/5A/airline_delays_wide.csv",
                         show_col_types = FALSE)
glimpse(airline_wide)
Rows: 4
Columns: 7
$ Airline       <chr> "Alaska", "Alaska", "AM West", "AM West"
$ Status        <chr> "on time", "delayed", "on time", "delayed"
$ Los_Angeles   <dbl> 497, 62, 694, 117
$ Phoenix       <dbl> 221, 12, 4840, 415
$ San_Diego     <dbl> 212, 20, 383, 65
$ San_Francisco <dbl> 503, 102, 320, 129
$ Seattle       <dbl> 1841, 305, 201, 61
# Check for missing values
colSums(is.na(airline_wide))
      Airline        Status   Los_Angeles       Phoenix     San_Diego 
            0             0             0             0             0 
San_Francisco       Seattle 
            0             0 

Transforming data from wide to long format.

# Wide to long 
airline_long <- airline_wide %>%
  pivot_longer(
    cols = -c(Airline, Status),
    names_to = "City",
    values_to = "Count"
  )

airline_long
# A tibble: 20 × 4
   Airline Status  City          Count
   <chr>   <chr>   <chr>         <dbl>
 1 Alaska  on time Los_Angeles     497
 2 Alaska  on time Phoenix         221
 3 Alaska  on time San_Diego       212
 4 Alaska  on time San_Francisco   503
 5 Alaska  on time Seattle        1841
 6 Alaska  delayed Los_Angeles      62
 7 Alaska  delayed Phoenix          12
 8 Alaska  delayed San_Diego        20
 9 Alaska  delayed San_Francisco   102
10 Alaska  delayed Seattle         305
11 AM West on time Los_Angeles     694
12 AM West on time Phoenix        4840
13 AM West on time San_Diego       383
14 AM West on time San_Francisco   320
15 AM West on time Seattle         201
16 AM West delayed Los_Angeles     117
17 AM West delayed Phoenix         415
18 AM West delayed San_Diego        65
19 AM West delayed San_Francisco   129
20 AM West delayed Seattle          61

Overall Delay Percentages by Airline

# Compute overall totals and delay percentages per airline
overall_results <- airline_long %>%
  group_by(Airline, Status) %>%
  summarise(Total = sum(Count), .groups = "drop") %>%
  pivot_wider(names_from = Status, values_from = Total) %>%
  mutate(
    Total_Flights = `on time` + delayed,
    Delay_Percent = (delayed / Total_Flights) * 100
  ) %>%
  select(Airline, Total_Flights, delayed, Delay_Percent)

overall_results
# A tibble: 2 × 4
  Airline Total_Flights delayed Delay_Percent
  <chr>           <dbl>   <dbl>         <dbl>
1 AM West          7225     787          10.9
2 Alaska           3775     501          13.3

Overall totals and delay percentages per airline: Summary

When comparing overall delay percentages across all five cities combined, AM West has a lower delay rate of 10.89% compared to Alaska’s 13.27%. Based on the aggregated results, AM West appears to perform better overall in terms of on-time arrivals. However, this comparison reflects total flights across all destinations rather than performance within individual cities.

City-Level Delay Percentages by Airline

# Compute delay percentages per airline within each city
city_results <- airline_long %>%
  group_by(Airline, City, Status) %>%
  summarise(Total = sum(Count), .groups = "drop") %>%
  pivot_wider(names_from = Status, values_from = Total) %>%
  mutate(
    Total_Flights = `on time` + delayed,
    Delay_Percent = (delayed / Total_Flights) * 100
  ) %>%
  select(Airline, City, Total_Flights, delayed, Delay_Percent)

city_results
# A tibble: 10 × 5
   Airline City          Total_Flights delayed Delay_Percent
   <chr>   <chr>                 <dbl>   <dbl>         <dbl>
 1 AM West Los_Angeles             811     117         14.4 
 2 AM West Phoenix                5255     415          7.90
 3 AM West San_Diego               448      65         14.5 
 4 AM West San_Francisco           449     129         28.7 
 5 AM West Seattle                 262      61         23.3 
 6 Alaska  Los_Angeles             559      62         11.1 
 7 Alaska  Phoenix                 233      12          5.15
 8 Alaska  San_Diego               232      20          8.62
 9 Alaska  San_Francisco           605     102         16.9 
10 Alaska  Seattle                2146     305         14.2 

City-Level Delay Percentages: Summary

The city-level delay percentage calculations show noticeable variation across destinations for both airlines. For AM West, delay rates range from 7.90% in Phoenix to 28.73% in San Francisco, with relatively higher delays in San Francisco and Seattle. For Alaska, delay rates range from 5.15% in Phoenix to 16.86% in San Francisco, with Phoenix showing the lowest delay rate and San Francisco and Seattle showing comparatively higher rates. Overall, delay percentages differ substantially by city, indicating that destination plays an important role in flight performance.

Discrepancy Between Overall and City-Level Comparisons

The overall comparison shows that AM West has a lower total delay percentage than Alaska when all cities are combined. However, the city-level analysis shows that Alaska has a lower delay rate in each of the five individual cities. In other words, the airline that appears to perform better overall does not perform better within any single city.

Explanation of the Discrepancy

This discrepancy occurs because the overall delay percentage is influenced by the distribution of flights across cities. AM West operates a large number of flights in Phoenix, where delay rates are relatively low, which reduces its overall delay percentage. In contrast, Alaska operates a substantial number of flights in Seattle, where delay rates are higher, increasing its overall percentage. As a result, the aggregated comparison reflects differences in flight volume across cities rather than consistent performance advantages within each destination.

Conclusion

This analysis shows that overall airline performance can differ from city-level results due to differences in flight distribution. Although AM West appears to have a lower overall delay rate, Alaska performs better within each individual city. The discrepancy highlights how aggregated data can mask underlying patterns. Transforming the dataset into tidy format enabled clearer and more meaningful comparisons.