library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(tidyr)
flights <- read.csv("flights.csv")
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'flights.csv'
head(flights)
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 Alaska delayed 62 12 20 102 305
## 3 AM West on time 694 4840 383 320 201
## 4 AM West delayed 117 415 65 129 61
str(flights)
## 'data.frame': 4 obs. of 7 variables:
## $ Airline : chr "Alaska" "Alaska" "AM West" "AM West"
## $ Status : chr "on time" "delayed" "on time" "delayed"
## $ Los.Angeles : int 497 62 694 117
## $ Phoenix : int 221 12 4840 415
## $ San.Diego : int 212 20 383 65
## $ San.Francisco: int 503 102 320 129
## $ Seattle : int 1841 305 201 61
This dataset contains arrival delay information for two airlines, Alaska and AM West, across five destinations: Los Angeles, Phoenix, San Diego, San Francisco, and Seattle. The goal is to compare their delay rates and determine which airline actually performs better.
flights_long <- pivot_longer(flights,
cols = c("Los.Angeles", "Phoenix", "San.Diego", "San.Francisco", "Seattle"),
names_to = "Destination",
values_to = "Count" )
head(flights_long)
## # A tibble: 6 × 4
## Airline Status Destination Count
## <chr> <chr> <chr> <int>
## 1 Alaska on time Los.Angeles 497
## 2 Alaska on time Phoenix 221
## 3 Alaska on time San.Diego 212
## 4 Alaska on time San.Francisco 503
## 5 Alaska on time Seattle 1841
## 6 Alaska delayed Los.Angeles 62
delay_rates <- flights_long %>%
group_by(Airline, Destination)%>%
summarize(
total = sum(Count),
delayed = sum(Count[Status == "delayed"])
) %>%
mutate(delay_rate = delayed / total)
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by Airline and Destination.
## ℹ Output is grouped by Airline.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(Airline, Destination))` for per-operation grouping
## (`?dplyr::dplyr_by`) instead.
The data came in wide format with cities spread across column
headers, which made analysis difficult. I used
pivot_longer() from the tidyr package to collapse the five
city columns into two new columns, Destination and Count, making the
data tidy and ready for analysis.
head(delay_rates)
## # A tibble: 6 × 5
## # Groups: Airline [2]
## Airline Destination total delayed delay_rate
## <chr> <chr> <int> <int> <dbl>
## 1 AM West Los.Angeles 811 117 0.144
## 2 AM West Phoenix 5255 415 0.0790
## 3 AM West San.Diego 448 65 0.145
## 4 AM West San.Francisco 449 129 0.287
## 5 AM West Seattle 262 61 0.233
## 6 Alaska Los.Angeles 559 62 0.111
print(delay_rates)
## # A tibble: 10 × 5
## # Groups: Airline [2]
## Airline Destination total delayed delay_rate
## <chr> <chr> <int> <int> <dbl>
## 1 AM West Los.Angeles 811 117 0.144
## 2 AM West Phoenix 5255 415 0.0790
## 3 AM West San.Diego 448 65 0.145
## 4 AM West San.Francisco 449 129 0.287
## 5 AM West Seattle 262 61 0.233
## 6 Alaska Los.Angeles 559 62 0.111
## 7 Alaska Phoenix 233 12 0.0515
## 8 Alaska San.Diego 232 20 0.0862
## 9 Alaska San.Francisco 605 102 0.169
## 10 Alaska Seattle 2146 305 0.142
overall_rates <- flights_long %>%
group_by(Airline) %>%
summarize(
total = sum(Count),
delayed = sum(Count[Status == "delayed"])
) %>%
mutate(delay_rate = delayed / total)
overall_rates <- flights_long %>%
group_by(Airline) %>%
summarize(total = sum(Count), delayed = sum(Count[Status == "delayed"])) %>%
mutate(delay_rate = delayed / total)
print(overall_rates)
## # A tibble: 2 × 4
## Airline total delayed delay_rate
## <chr> <int> <int> <dbl>
## 1 AM West 7225 787 0.109
## 2 Alaska 3775 501 0.133
When broken down by destination, Alaska had a lower delay rate at every single city. However, looking at the overall numbers, AM West appeared to perform better with a 10.9% delay rate compared to Alaska’s 13.3%.
This contradiction is known as Simpson’s Paradox. AM West looks better overall because the majority of their flights go to Phoenix, a low delay destination, which skews their aggregate rate downward. When comparing city by city, Alaska is actually the better performing airline and would be the recommended choice for travelers prioritizing on time arrivals.