Load libraries:
Load the data and take a look at how I represented the PDF file:
## airline status los_angeles phoenix san_diego san_francisco seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 Alaska delayed 62 12 20 102 305
## 3 AM West on time 694 4840 383 320 201
## 4 AM West delayed 117 415 65 129 61
Now let’s convert to long format:
library(tidyr)
df_long <- gather(df, airport, qty_flights, los_angeles:seattle) %>%
select(airline, airport, status, qty_flights) %>%
arrange(airline, airport, status)
head(df_long)## airline airport status qty_flights
## 1 Alaska los_angeles delayed 62
## 2 Alaska los_angeles on time 497
## 3 Alaska phoenix delayed 12
## 4 Alaska phoenix on time 221
## 5 Alaska san_diego delayed 20
## 6 Alaska san_diego on time 212
The primary variable of analysis is percent of flights delayed.
total_flights <- df_long %>%
group_by(airline, airport) %>%
summarize(total=sum(qty_flights))
total_flights## # A tibble: 10 x 3
## # Groups: airline [?]
## airline airport total
## <chr> <chr> <int>
## 1 Alaska los_angeles 559
## 2 Alaska phoenix 233
## 3 Alaska san_diego 232
## 4 Alaska san_francisco 605
## 5 Alaska seattle 2146
## 6 AM West los_angeles 811
## 7 AM West phoenix 5255
## 8 AM West san_diego 448
## 9 AM West san_francisco 449
## 10 AM West seattle 262
Join total_flights to our original df_long and calculate percentages:
df_long <- df_long %>%
inner_join(total_flights, by=c('airline', 'airport')) %>%
mutate(delays=qty_flights / total) %>%
filter(status == 'delayed')Use ggplot2 to use a heatmap to summarize the difference in delays:
library(ggplot2)
ggplot(df_long, aes(airline, airport)) +
geom_tile(aes(fill=delays), color='white') +
scale_fill_gradient(low='white', high='red')We see that there are differences in delay rate between cities and airlines. Overall, Alaska has delays 13.27 percent of the time, and AM West about 10.89 percent of the time. This is interesting because the graph seems to tell a different story—it suggests that Alaska has a better delay rate than AM West in every airport, even if its overall rate is worse—a lesson in the difference between aggregate and ‘atomic’ statistics, I suppose.