This analysis examines arrival delay data for two airlines — Alaska Airlines and AM West — across five destination cities: Los Angeles, Phoenix, San Diego, San Francisco, and Seattle. The data comes from Numbersense by Kaiser Fung (McGraw Hill, 2013).
The CSV file uses a wide format that mirrors the original table, with one row per airline-status combination and one column per destination city. Empty cells appear in the airline name column where rows share the same airline.
# Read the wide-format CSV
df_wide <- read.csv("https://raw.githubusercontent.com/nanadanquah05-jpg/DATA-607/refs/heads/main/airline_delays.csv",
check.names = FALSE)
# If reading locally:
# df_wide <- read.csv("airline_delays.csv", check.names = FALSE)
# Display raw data
knitr::kable(df_wide, caption = "Raw Wide-Format Data")
| Airline | Status | Los Angeles | Phoenix | San Diego | San Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| ALASKA | delayed | 62 | 12 | 20 | 102 | 305 |
| AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| AM WEST | delayed | 117 | 415 | 65 | 129 | 61 |
# Alternatively, create dataframe directly from code
df_wide <- data.frame(
Airline = c("ALASKA", "ALASKA", "AM WEST", "AM WEST"),
Status = c("on time", "delayed", "on time", "delayed"),
`Los Angeles` = c(497, 62, 694, 117),
Phoenix = c(221, 12, 4840, 415),
`San Diego` = c(212, 20, 383, 65),
`San Francisco` = c(503, 102, 320, 129),
Seattle = c(1841, 305, 201, 61),
check.names = FALSE
)
knitr::kable(df_wide, caption = "Wide-Format Airline Delay Data")
| Airline | Status | Los Angeles | Phoenix | San Diego | San Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| ALASKA | delayed | 62 | 12 | 20 | 102 | 305 |
| AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| AM WEST | delayed | 117 | 415 | 65 | 129 | 61 |
In the original table, the airline name is only shown once per pair of rows (a common “merged cell” pattern). I fill in the missing (NA) airline names by carrying forward the last non-NA value.
# Simulate the blank airline cells as NA
df_wide_na <- df_wide
df_wide_na$Airline[c(2, 4)] <- NA
cat("Before fill:\n")
## Before fill:
print(df_wide_na)
## Airline Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 <NA> delayed 117 415 65 129 61
# Fill down missing airline names
df_wide_filled <- df_wide_na %>%
tidyr::fill(Airline, .direction = "down")
cat("\nAfter fill:\n")
##
## After fill:
print(df_wide_filled)
## Airline Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
I use tidyr::pivot_longer() to convert from wide (one
column per city) to long (one row per airline-status-city combination).
This is the tidy format required for analysis.
df_long <- df_wide %>%
pivot_longer(
cols = c("Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle"),
names_to = "City",
values_to = "Count"
)
knitr::kable(head(df_long, 10), caption = "Long-Format Data (first 10 rows)")
| Airline | Status | City | Count |
|---|---|---|---|
| ALASKA | on time | Los Angeles | 497 |
| ALASKA | on time | Phoenix | 221 |
| ALASKA | on time | San Diego | 212 |
| ALASKA | on time | San Francisco | 503 |
| ALASKA | on time | Seattle | 1841 |
| ALASKA | delayed | Los Angeles | 62 |
| ALASKA | delayed | Phoenix | 12 |
| ALASKA | delayed | San Diego | 20 |
| ALASKA | delayed | San Francisco | 102 |
| ALASKA | delayed | Seattle | 305 |
I compute the percentage of delayed flights for each airline overall (across all cities).
overall <- df_long %>%
group_by(Airline, Status) %>%
summarise(Total = sum(Count), .groups = "drop") %>%
group_by(Airline) %>%
mutate(
Grand_Total = sum(Total),
Pct = round(100 * Total / Grand_Total, 1)
) %>%
filter(Status == "delayed")
knitr::kable(overall %>% select(Airline, Delayed = Total, Total_Flights = Grand_Total, Delay_Pct = Pct),
caption = "Overall Delay Rate by Airline")
| Airline | Delayed | Total_Flights | Delay_Pct |
|---|---|---|---|
| ALASKA | 501 | 3775 | 13.3 |
| AM WEST | 787 | 7225 | 10.9 |
ggplot(overall, aes(x = Airline, y = Pct, fill = Airline)) +
geom_col(width = 0.5) +
geom_text(aes(label = paste0(Pct, "%")), vjust = -0.5, size = 5) +
scale_fill_manual(values = c("ALASKA" = "#0072B2", "AM WEST" = "#E69F00")) +
labs(title = "Overall Delay Rate by Airline",
x = NULL, y = "Delay Rate (%)") +
theme_minimal(base_size = 13) +
theme(legend.position = "none") +
ylim(0, 20)
Finding: Overall, AM WEST has a lower delay rate (~11%) compared to Alaska Airlines (~13%). This might lead us to conclude AM WEST performs better — but city-level analysis tells a different story.
city_rates <- df_long %>%
group_by(Airline, City) %>%
mutate(City_Total = sum(Count)) %>%
filter(Status == "delayed") %>%
mutate(Delay_Pct = round(100 * Count / City_Total, 1)) %>%
select(Airline, City, Delayed = Count, City_Total, Delay_Pct)
knitr::kable(city_rates, caption = "Delay Rate by Airline and City")
| Airline | City | Delayed | City_Total | Delay_Pct |
|---|---|---|---|---|
| ALASKA | Los Angeles | 62 | 559 | 11.1 |
| ALASKA | Phoenix | 12 | 233 | 5.2 |
| ALASKA | San Diego | 20 | 232 | 8.6 |
| ALASKA | San Francisco | 102 | 605 | 16.9 |
| ALASKA | Seattle | 305 | 2146 | 14.2 |
| AM WEST | Los Angeles | 117 | 811 | 14.4 |
| AM WEST | Phoenix | 415 | 5255 | 7.9 |
| AM WEST | San Diego | 65 | 448 | 14.5 |
| AM WEST | San Francisco | 129 | 449 | 28.7 |
| AM WEST | Seattle | 61 | 262 | 23.3 |
ggplot(city_rates, aes(x = City, y = Delay_Pct, fill = Airline)) +
geom_col(position = "dodge") +
geom_text(aes(label = paste0(Delay_Pct, "%")),
position = position_dodge(width = 0.9), vjust = -0.4, size = 3.5) +
scale_fill_manual(values = c("ALASKA" = "#0072B2", "AM WEST" = "#E69F00")) +
labs(title = "Delay Rate by Airline and City",
x = "Destination City", y = "Delay Rate (%)") +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(angle = 20, hjust = 1))
Finding: In every single city, Alaska Airlines has a lower delay rate than AM WEST. This is the opposite of the overall comparison!
# Show volume of flights per city per airline
volume <- df_long %>%
group_by(Airline, City) %>%
summarise(Flights = sum(Count), .groups = "drop") %>%
pivot_wider(names_from = Airline, values_from = Flights)
knitr::kable(volume, caption = "Total Flights per City per Airline")
| City | ALASKA | AM WEST |
|---|---|---|
| Los Angeles | 559 | 811 |
| Phoenix | 233 | 5255 |
| San Diego | 232 | 448 |
| San Francisco | 605 | 449 |
| Seattle | 2146 | 262 |
This reversal is a classic example of Simpson’s Paradox — a phenomenon where a trend that appears in aggregated data disappears or reverses when the data is broken into subgroups.
Why does it happen here?
Conclusion: AM WEST’s favorable overall delay rate is a statistical artifact of its route mix, not a genuine reflection of on-time performance. Alaska Airlines actually performs better at every individual destination. This example illustrates why it is critical to examine data at multiple levels of aggregation before drawing conclusions.