Overview

This analysis examines arrival delay data for two airlines — Alaska Airlines and AM West — across five destination cities: Los Angeles, Phoenix, San Diego, San Francisco, and Seattle. The data comes from Numbersense by Kaiser Fung (McGraw Hill, 2013).


Step 1: Load the CSV Data

The CSV file uses a wide format that mirrors the original table, with one row per airline-status combination and one column per destination city. Empty cells appear in the airline name column where rows share the same airline.

# Read the wide-format CSV
df_wide <- read.csv("https://raw.githubusercontent.com/nanadanquah05-jpg/DATA-607/refs/heads/main/airline_delays.csv",
                    check.names = FALSE)

# If reading locally:
# df_wide <- read.csv("airline_delays.csv", check.names = FALSE)

# Display raw data
knitr::kable(df_wide, caption = "Raw Wide-Format Data")
Raw Wide-Format Data
Airline Status Los Angeles Phoenix San Diego San Francisco Seattle
ALASKA on time 497 221 212 503 1841
ALASKA delayed 62 12 20 102 305
AM WEST on time 694 4840 383 320 201
AM WEST delayed 117 415 65 129 61
# Alternatively, create dataframe directly from code
df_wide <- data.frame(
  Airline  = c("ALASKA", "ALASKA", "AM WEST", "AM WEST"),
  Status   = c("on time", "delayed", "on time", "delayed"),
  `Los Angeles`   = c(497, 62, 694, 117),
  Phoenix         = c(221, 12, 4840, 415),
  `San Diego`     = c(212, 20, 383, 65),
  `San Francisco` = c(503, 102, 320, 129),
  Seattle         = c(1841, 305, 201, 61),
  check.names = FALSE
)

knitr::kable(df_wide, caption = "Wide-Format Airline Delay Data")
Wide-Format Airline Delay Data
Airline Status Los Angeles Phoenix San Diego San Francisco Seattle
ALASKA on time 497 221 212 503 1841
ALASKA delayed 62 12 20 102 305
AM WEST on time 694 4840 383 320 201
AM WEST delayed 117 415 65 129 61

Step 2: Handle Missing Data

In the original table, the airline name is only shown once per pair of rows (a common “merged cell” pattern). I fill in the missing (NA) airline names by carrying forward the last non-NA value.

# Simulate the blank airline cells as NA
df_wide_na <- df_wide
df_wide_na$Airline[c(2, 4)] <- NA

cat("Before fill:\n")
## Before fill:
print(df_wide_na)
##   Airline  Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2    <NA> delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4    <NA> delayed         117     415        65           129      61
# Fill down missing airline names
df_wide_filled <- df_wide_na %>%
  tidyr::fill(Airline, .direction = "down")

cat("\nAfter fill:\n")
## 
## After fill:
print(df_wide_filled)
##   Airline  Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4 AM WEST delayed         117     415        65           129      61

Step 3: Transform from Wide to Long Format

I use tidyr::pivot_longer() to convert from wide (one column per city) to long (one row per airline-status-city combination). This is the tidy format required for analysis.

df_long <- df_wide %>%
  pivot_longer(
    cols      = c("Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle"),
    names_to  = "City",
    values_to = "Count"
  )

knitr::kable(head(df_long, 10), caption = "Long-Format Data (first 10 rows)")
Long-Format Data (first 10 rows)
Airline Status City Count
ALASKA on time Los Angeles 497
ALASKA on time Phoenix 221
ALASKA on time San Diego 212
ALASKA on time San Francisco 503
ALASKA on time Seattle 1841
ALASKA delayed Los Angeles 62
ALASKA delayed Phoenix 12
ALASKA delayed San Diego 20
ALASKA delayed San Francisco 102
ALASKA delayed Seattle 305

Step 4: Overall Delay Rate Comparison

I compute the percentage of delayed flights for each airline overall (across all cities).

overall <- df_long %>%
  group_by(Airline, Status) %>%
  summarise(Total = sum(Count), .groups = "drop") %>%
  group_by(Airline) %>%
  mutate(
    Grand_Total  = sum(Total),
    Pct          = round(100 * Total / Grand_Total, 1)
  ) %>%
  filter(Status == "delayed")

knitr::kable(overall %>% select(Airline, Delayed = Total, Total_Flights = Grand_Total, Delay_Pct = Pct),
             caption = "Overall Delay Rate by Airline")
Overall Delay Rate by Airline
Airline Delayed Total_Flights Delay_Pct
ALASKA 501 3775 13.3
AM WEST 787 7225 10.9
ggplot(overall, aes(x = Airline, y = Pct, fill = Airline)) +
  geom_col(width = 0.5) +
  geom_text(aes(label = paste0(Pct, "%")), vjust = -0.5, size = 5) +
  scale_fill_manual(values = c("ALASKA" = "#0072B2", "AM WEST" = "#E69F00")) +
  labs(title = "Overall Delay Rate by Airline",
       x = NULL, y = "Delay Rate (%)") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none") +
  ylim(0, 20)

Finding: Overall, AM WEST has a lower delay rate (~11%) compared to Alaska Airlines (~13%). This might lead us to conclude AM WEST performs better — but city-level analysis tells a different story.


Step 5: City-by-City Delay Rate Comparison

city_rates <- df_long %>%
  group_by(Airline, City) %>%
  mutate(City_Total = sum(Count)) %>%
  filter(Status == "delayed") %>%
  mutate(Delay_Pct = round(100 * Count / City_Total, 1)) %>%
  select(Airline, City, Delayed = Count, City_Total, Delay_Pct)

knitr::kable(city_rates, caption = "Delay Rate by Airline and City")
Delay Rate by Airline and City
Airline City Delayed City_Total Delay_Pct
ALASKA Los Angeles 62 559 11.1
ALASKA Phoenix 12 233 5.2
ALASKA San Diego 20 232 8.6
ALASKA San Francisco 102 605 16.9
ALASKA Seattle 305 2146 14.2
AM WEST Los Angeles 117 811 14.4
AM WEST Phoenix 415 5255 7.9
AM WEST San Diego 65 448 14.5
AM WEST San Francisco 129 449 28.7
AM WEST Seattle 61 262 23.3
ggplot(city_rates, aes(x = City, y = Delay_Pct, fill = Airline)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = paste0(Delay_Pct, "%")),
            position = position_dodge(width = 0.9), vjust = -0.4, size = 3.5) +
  scale_fill_manual(values = c("ALASKA" = "#0072B2", "AM WEST" = "#E69F00")) +
  labs(title = "Delay Rate by Airline and City",
       x = "Destination City", y = "Delay Rate (%)") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

Finding: In every single city, Alaska Airlines has a lower delay rate than AM WEST. This is the opposite of the overall comparison!


Step 6: Explaining the Discrepancy (Simpson’s Paradox)

# Show volume of flights per city per airline
volume <- df_long %>%
  group_by(Airline, City) %>%
  summarise(Flights = sum(Count), .groups = "drop") %>%
  pivot_wider(names_from = Airline, values_from = Flights)

knitr::kable(volume, caption = "Total Flights per City per Airline")
Total Flights per City per Airline
City ALASKA AM WEST
Los Angeles 559 811
Phoenix 233 5255
San Diego 232 448
San Francisco 605 449
Seattle 2146 262

This reversal is a classic example of Simpson’s Paradox — a phenomenon where a trend that appears in aggregated data disappears or reverses when the data is broken into subgroups.

Why does it happen here?

Conclusion: AM WEST’s favorable overall delay rate is a statistical artifact of its route mix, not a genuine reflection of on-time performance. Alaska Airlines actually performs better at every individual destination. This example illustrates why it is critical to examine data at multiple levels of aggregation before drawing conclusions.