DATA 607—Homework No. 5

Load libraries:

library(dplyr)

Load the data and take a look at how I represented the PDF file:

df <- read.csv('flights.csv', stringsAsFactors=FALSE)
head(df)

##   airline  status los_angeles phoenix san_diego san_francisco seattle
## 1  Alaska on time         497     221       212           503    1841
## 2  Alaska delayed          62      12        20           102     305
## 3 AM West on time         694    4840       383           320     201
## 4 AM West delayed         117     415        65           129      61

Now let’s convert to long format:

library(tidyr)
df_long <- gather(df, airport, qty_flights, los_angeles:seattle) %>%
    select(airline, airport, status, qty_flights) %>%
    arrange(airline, airport, status)
head(df_long)

##   airline     airport  status qty_flights
## 1  Alaska los_angeles delayed          62
## 2  Alaska los_angeles on time         497
## 3  Alaska     phoenix delayed          12
## 4  Alaska     phoenix on time         221
## 5  Alaska   san_diego delayed          20
## 6  Alaska   san_diego on time         212

The primary variable of analysis is percent of flights delayed.

total_flights <- df_long %>%
    group_by(airline, airport) %>%
    summarize(total=sum(qty_flights))
total_flights

## # A tibble: 10 x 3
## # Groups:   airline [?]
##    airline airport       total
##    <chr>   <chr>         <int>
##  1 Alaska  los_angeles     559
##  2 Alaska  phoenix         233
##  3 Alaska  san_diego       232
##  4 Alaska  san_francisco   605
##  5 Alaska  seattle        2146
##  6 AM West los_angeles     811
##  7 AM West phoenix        5255
##  8 AM West san_diego       448
##  9 AM West san_francisco   449
## 10 AM West seattle         262

Join total_flights to our original df_long and calculate percentages:

df_long <- df_long %>%
    inner_join(total_flights, by=c('airline', 'airport')) %>%
    mutate(delays=qty_flights / total) %>%
    filter(status == 'delayed')

Use ggplot2 to use a heatmap to summarize the difference in delays:

library(ggplot2)
ggplot(df_long, aes(airline, airport)) + 
    geom_tile(aes(fill=delays), color='white') +
    scale_fill_gradient(low='white', high='red')

We see that there are differences in delay rate between cities and airlines. Overall, Alaska has delays 13.27 percent of the time, and AM West about 10.89 percent of the time. This is interesting because the graph seems to tell a different story—it suggests that Alaska has a better delay rate than AM West in every airport, even if its overall rate is worse—a lesson in the difference between aggregate and ‘atomic’ statistics, I suppose.

DATA 607—Homework No. 5

Ben Horvath

September 30, 2018