In this assignment, I worked with flight delay data for two airlines
across five cities.
The goal was to tidy the dataset, transform it into long format, and
compare on-time vs delayed flights both overall and by city.
I started by loading the tidyverse package, which
includes the tools for data manipulation and visualization.
library(tidyverse)
The assignment provided data in a table. Since I only had the raw numbers, I created the dataset manually in R.
delays <- data.frame(
Airline = c("ALASKA", "ALASKA", "AM WEST", "AM WEST"),
Status = c("on time", "delayed", "on time", "delayed"),
Los.Angeles = c(497, 62, 694, 117),
Phoenix = c(221, 12, 4840, 415),
San.Diego = c(212, 20, 383, 65),
San.Francisco = c(503, 102, 320, 129),
Seattle = c(1841, 305, 201, 61)
)
delays
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
I verified whether there were missing values in the dataset. Since there were none, no imputation was needed.
sum(is.na(delays))
## [1] 0
To make analysis easier, I reshaped the dataset into long format using pivot_longer().
delays_long <- delays %>%
pivot_longer(cols = -c(Airline, Status),
names_to = "City",
values_to = "Count")
delays_long
## # A tibble: 20 × 4
## Airline Status City Count
## <chr> <chr> <chr> <dbl>
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San.Diego 212
## 4 ALASKA on time San.Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los.Angeles 62
## 7 ALASKA delayed Phoenix 12
## 8 ALASKA delayed San.Diego 20
## 9 ALASKA delayed San.Francisco 102
## 10 ALASKA delayed Seattle 305
## 11 AM WEST on time Los.Angeles 694
## 12 AM WEST on time Phoenix 4840
## 13 AM WEST on time San.Diego 383
## 14 AM WEST on time San.Francisco 320
## 15 AM WEST on time Seattle 201
## 16 AM WEST delayed Los.Angeles 117
## 17 AM WEST delayed Phoenix 415
## 18 AM WEST delayed San.Diego 65
## 19 AM WEST delayed San.Francisco 129
## 20 AM WEST delayed Seattle 61
Next, I calculated the overall percentages of on-time vs delayed flights for each airline.
overall <- delays_long %>%
group_by(Airline, Status) %>%
summarise(Total = sum(Count), .groups = "drop") %>%
mutate(Percent = round(100 * Total / sum(Total), 1))
overall
## # A tibble: 4 × 4
## Airline Status Total Percent
## <chr> <chr> <dbl> <dbl>
## 1 ALASKA delayed 501 4.6
## 2 ALASKA on time 3274 29.8
## 3 AM WEST delayed 787 7.2
## 4 AM WEST on time 6438 58.5
Explanation: Overall, Alaska Airlines had fewer delays in proportion compared to AM West. The imbalance is influenced by AM West’s large number of flights out of Phoenix.
I then compared percentages of delays across the five cities.
by_city <- delays_long %>%
group_by(Airline, City, Status) %>%
summarise(Total = sum(Count), .groups = "drop") %>%
group_by(City) %>%
mutate(Percent = round(100 * Total / sum(Total), 1))
by_city
## # A tibble: 20 × 5
## # Groups: City [5]
## Airline City Status Total Percent
## <chr> <chr> <chr> <dbl> <dbl>
## 1 ALASKA Los.Angeles delayed 62 4.5
## 2 ALASKA Los.Angeles on time 497 36.3
## 3 ALASKA Phoenix delayed 12 0.2
## 4 ALASKA Phoenix on time 221 4
## 5 ALASKA San.Diego delayed 20 2.9
## 6 ALASKA San.Diego on time 212 31.2
## 7 ALASKA San.Francisco delayed 102 9.7
## 8 ALASKA San.Francisco on time 503 47.7
## 9 ALASKA Seattle delayed 305 12.7
## 10 ALASKA Seattle on time 1841 76.5
## 11 AM WEST Los.Angeles delayed 117 8.5
## 12 AM WEST Los.Angeles on time 694 50.7
## 13 AM WEST Phoenix delayed 415 7.6
## 14 AM WEST Phoenix on time 4840 88.2
## 15 AM WEST San.Diego delayed 65 9.6
## 16 AM WEST San.Diego on time 383 56.3
## 17 AM WEST San.Francisco delayed 129 12.2
## 18 AM WEST San.Francisco on time 320 30.4
## 19 AM WEST Seattle delayed 61 2.5
## 20 AM WEST Seattle on time 201 8.3
Explanation: When broken down by city, Alaska performed better in Phoenix, but worse in Seattle. AM West looked worse overall because of its very high delay counts in Phoenix.
Comparing the overall results to the city-by-city breakdown reveals a discrepancy.
City-by-city: Alaska sometimes performs worse (Seattle).
Overall: AM West looks much worse. This happens because Phoenix has so many AM West flights that its delay rate dominates the overall average. This is an example of Simpson’s paradox, where aggregated data can give a misleading impression.
Through this analysis, I tidied the dataset, transformed it into a long format, and compared delay rates across airlines and cities. The key insight is that while AM West appears worse overall, Alaska struggles in certain cities, showing the importance of looking at both aggregated and disaggregated data.