library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(tidyr)
flights <- read.csv("flights.csv")
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'flights.csv'
head(flights)
##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  Alaska on time         497     221       212           503    1841
## 2  Alaska delayed          62      12        20           102     305
## 3 AM West on time         694    4840       383           320     201
## 4 AM West delayed         117     415        65           129      61
str(flights)
## 'data.frame':    4 obs. of  7 variables:
##  $ Airline      : chr  "Alaska" "Alaska" "AM West" "AM West"
##  $ Status       : chr  "on time" "delayed" "on time" "delayed"
##  $ Los.Angeles  : int  497 62 694 117
##  $ Phoenix      : int  221 12 4840 415
##  $ San.Diego    : int  212 20 383 65
##  $ San.Francisco: int  503 102 320 129
##  $ Seattle      : int  1841 305 201 61

Introduction

This dataset contains arrival delay information for two airlines, Alaska and AM West, across five destinations: Los Angeles, Phoenix, San Diego, San Francisco, and Seattle. The goal is to compare their delay rates and determine which airline actually performs better.

flights_long <- pivot_longer(flights, 
                             cols = c("Los.Angeles", "Phoenix", "San.Diego", "San.Francisco", "Seattle"),
                             names_to = "Destination",
                             values_to = "Count" )
head(flights_long)
## # A tibble: 6 × 4
##   Airline Status  Destination   Count
##   <chr>   <chr>   <chr>         <int>
## 1 Alaska  on time Los.Angeles     497
## 2 Alaska  on time Phoenix         221
## 3 Alaska  on time San.Diego       212
## 4 Alaska  on time San.Francisco   503
## 5 Alaska  on time Seattle        1841
## 6 Alaska  delayed Los.Angeles      62
delay_rates <- flights_long %>%
  group_by(Airline, Destination)%>%
  summarize(
    total = sum(Count),
    delayed = sum(Count[Status == "delayed"])
  ) %>%
  mutate(delay_rate = delayed / total)
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by Airline and Destination.
## ℹ Output is grouped by Airline.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(Airline, Destination))` for per-operation grouping
##   (`?dplyr::dplyr_by`) instead.

Data Cleanup

The data came in wide format with cities spread across column headers, which made analysis difficult. I used pivot_longer() from the tidyr package to collapse the five city columns into two new columns, Destination and Count, making the data tidy and ready for analysis.

head(delay_rates)
## # A tibble: 6 × 5
## # Groups:   Airline [2]
##   Airline Destination   total delayed delay_rate
##   <chr>   <chr>         <int>   <int>      <dbl>
## 1 AM West Los.Angeles     811     117     0.144 
## 2 AM West Phoenix        5255     415     0.0790
## 3 AM West San.Diego       448      65     0.145 
## 4 AM West San.Francisco   449     129     0.287 
## 5 AM West Seattle         262      61     0.233 
## 6 Alaska  Los.Angeles     559      62     0.111
print(delay_rates)
## # A tibble: 10 × 5
## # Groups:   Airline [2]
##    Airline Destination   total delayed delay_rate
##    <chr>   <chr>         <int>   <int>      <dbl>
##  1 AM West Los.Angeles     811     117     0.144 
##  2 AM West Phoenix        5255     415     0.0790
##  3 AM West San.Diego       448      65     0.145 
##  4 AM West San.Francisco   449     129     0.287 
##  5 AM West Seattle         262      61     0.233 
##  6 Alaska  Los.Angeles     559      62     0.111 
##  7 Alaska  Phoenix         233      12     0.0515
##  8 Alaska  San.Diego       232      20     0.0862
##  9 Alaska  San.Francisco   605     102     0.169 
## 10 Alaska  Seattle        2146     305     0.142
overall_rates <- flights_long %>%
  group_by(Airline) %>%
  summarize(
    total = sum(Count),
    delayed = sum(Count[Status == "delayed"])
  ) %>%
  mutate(delay_rate = delayed / total)
overall_rates <- flights_long %>%
  group_by(Airline) %>%
  summarize(total = sum(Count), delayed = sum(Count[Status == "delayed"])) %>%
  mutate(delay_rate = delayed / total)
print(overall_rates)
## # A tibble: 2 × 4
##   Airline total delayed delay_rate
##   <chr>   <int>   <int>      <dbl>
## 1 AM West  7225     787      0.109
## 2 Alaska   3775     501      0.133

Analysis

When broken down by destination, Alaska had a lower delay rate at every single city. However, looking at the overall numbers, AM West appeared to perform better with a 10.9% delay rate compared to Alaska’s 13.3%.

Conclusion

This contradiction is known as Simpson’s Paradox. AM West looks better overall because the majority of their flights go to Phoenix, a low delay destination, which skews their aggregate rate downward. When comparing city by city, Alaska is actually the better performing airline and would be the recommended choice for travelers prioritizing on time arrivals.