In this assignment, I worked with flight delay data for two airlines across five cities. The goal was to tidy the dataset, transform it into long format, and compare on-time vs delayed flights both overall and by city.
I started by loading the tidyverse package, which
includes the tools for data manipulation and visualization.
library(tidyverse)
The assignment provided data in a table. Since I only had the raw numbers, I created the dataset manually in R.
delays <- data.frame(
Airline = c("ALASKA", "ALASKA", "AM WEST", "AM WEST"),
Status = c("on time", "delayed", "on time", "delayed"),
Los.Angeles = c(497, 62, 694, 117),
Phoenix = c(221, 12, 4840, 415),
San.Diego = c(212, 20, 383, 65),
San.Francisco = c(503, 102, 320, 129),
Seattle = c(1841, 305, 201, 61)
)
delays
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
I verified whether there were missing values in the dataset. Since there were none, no imputation was needed.
sum(is.na(delays))
## [1] 0
To make analysis easier, I reshaped the dataset into long format using pivot_longer().
delays_long <- delays %>%
pivot_longer(cols = -c(Airline, Status),
names_to = "City",
values_to = "Count")
delays_long
## # A tibble: 20 × 4
## Airline Status City Count
## <chr> <chr> <chr> <dbl>
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San.Diego 212
## 4 ALASKA on time San.Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los.Angeles 62
## 7 ALASKA delayed Phoenix 12
## 8 ALASKA delayed San.Diego 20
## 9 ALASKA delayed San.Francisco 102
## 10 ALASKA delayed Seattle 305
## 11 AM WEST on time Los.Angeles 694
## 12 AM WEST on time Phoenix 4840
## 13 AM WEST on time San.Diego 383
## 14 AM WEST on time San.Francisco 320
## 15 AM WEST on time Seattle 201
## 16 AM WEST delayed Los.Angeles 117
## 17 AM WEST delayed Phoenix 415
## 18 AM WEST delayed San.Diego 65
## 19 AM WEST delayed San.Francisco 129
## 20 AM WEST delayed Seattle 61
Next, I calculated the overall percentages of on-time vs delayed flights for each airline.
overall <- delays_long %>%
group_by(Airline, Status) %>%
summarise(Total = sum(Count), .groups = "drop") %>%
group_by(Airline) %>%
mutate(Percent = round(100 * Total / sum(Total), 1))
overall
## # A tibble: 4 × 4
## # Groups: Airline [2]
## Airline Status Total Percent
## <chr> <chr> <dbl> <dbl>
## 1 ALASKA delayed 501 13.3
## 2 ALASKA on time 3274 86.7
## 3 AM WEST delayed 787 10.9
## 4 AM WEST on time 6438 89.1
Observation: The overall percentages indicate that Alaska experiences a higher proportion of delays (13.3%) compared to AM West (10.9%). Based on these totals alone, Alaska appears to perform worse overall.
I then compared percentages of delays across the five cities.
by_city <- delays_long %>%
group_by(Airline, City, Status) %>%
summarise(Total = sum(Count), .groups = "drop") %>%
group_by(Airline, City) %>%
mutate(Percent = round(100 * Total / sum(Total), 1))
by_city
## # A tibble: 20 × 5
## # Groups: Airline, City [10]
## Airline City Status Total Percent
## <chr> <chr> <chr> <dbl> <dbl>
## 1 ALASKA Los.Angeles delayed 62 11.1
## 2 ALASKA Los.Angeles on time 497 88.9
## 3 ALASKA Phoenix delayed 12 5.2
## 4 ALASKA Phoenix on time 221 94.8
## 5 ALASKA San.Diego delayed 20 8.6
## 6 ALASKA San.Diego on time 212 91.4
## 7 ALASKA San.Francisco delayed 102 16.9
## 8 ALASKA San.Francisco on time 503 83.1
## 9 ALASKA Seattle delayed 305 14.2
## 10 ALASKA Seattle on time 1841 85.8
## 11 AM WEST Los.Angeles delayed 117 14.4
## 12 AM WEST Los.Angeles on time 694 85.6
## 13 AM WEST Phoenix delayed 415 7.9
## 14 AM WEST Phoenix on time 4840 92.1
## 15 AM WEST San.Diego delayed 65 14.5
## 16 AM WEST San.Diego on time 383 85.5
## 17 AM WEST San.Francisco delayed 129 28.7
## 18 AM WEST San.Francisco on time 320 71.3
## 19 AM WEST Seattle delayed 61 23.3
## 20 AM WEST Seattle on time 201 76.7
Observation: When the data is broken down by city, AM West has a higher delay percentage in every single city.
The contrast between the city-by-city results and the overall totals reveals a clear example of Simpson’s paradox.
City-by-city: AM West performs worse in every city.
Overall: Alaska appears worse. This happens because AM West has an extremely large number of flights in Phoenix, and Phoenix has a relatively low delay percentage. Because Phoenix dominates the total number of flights, it pulls AM West’s overall delay percentage down, even though AM West is worse in each individual city.
Through this analysis, I tidied the dataset, reshaped it into long format, and compared delay rates across both airlines and cities. The key insight is that AM West performs worse in every city, yet Alaska appears worse overall due to Simpson’s paradox. This demonstrates the importance of examining both aggregated and disaggregated data to avoid misleading conclusions.