Data607 Assignment 5

Intro

This week’s assignment is to take a table of flight delays, enter that information into a csv file, then tidy, transform, and analyze the information in R.

The information we’ll be looking at today looks like counts of flights that were either on time or delayed for two airlines. The information is available by arrival airport location.

The csv has been prepared and uploaded to github to make this Rmarkdown reproducible.

Method

Lets start by loading the libraries we’ll be using. We’ll use dplyr and tidyr to organize the data and ggplot to visualize it.

library(dplyr)
library(tidyr)
library(ggplot2)

Next, lets import the data from github into a dataframe object:

(raw <- read.csv(url('https://raw.githubusercontent.com/dataconsumer101/data607/master/assignment_5_source_table.csv'), stringsAsFactors = F, check.names = F))

##                   Los Angeles Phoenix San Diego San Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

It looks like the first two columns need names. The first column is the airline name and the second indicates whether the flight was on time or not- we’ll call that the flight status.

names(raw)[1] <- 'airline'
names(raw)[2] <- 'status'

Next, we’ll need to fill in the airline for row 2 and 5 as well as remove the blank row in the middle:

raw$airline[2] <- 'ALASKA'
raw$airline[5] <- 'AM WEST'

(raw <- filter(raw, nchar(airline) > 0))

##   airline  status Los Angeles Phoenix San Diego San Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4 AM WEST delayed         117     415        65           129      61

Lets gather the locations into a variable field using tidyr:

df <- gather(raw, airport, flight_count, -airline, -status)
head(df)

##   airline  status     airport flight_count
## 1  ALASKA on time Los Angeles          497
## 2  ALASKA delayed Los Angeles           62
## 3 AM WEST on time Los Angeles          694
## 4 AM WEST delayed Los Angeles          117
## 5  ALASKA on time     Phoenix          221
## 6  ALASKA delayed     Phoenix           12

I think the next logical step would be to convert all these flight status counts to rates, so that we can compare the two airlines at the same level.

Lets look at it two ways- the airline delay rate across each arrival location and also in aggregate. Lets prepare the dataframe by calculating the totals on each level, then joining back to the original dataset.

Lets check the overall rates first:

airline_total <- group_by(df, airline) %>%
  summarize(total_count = sum(flight_count))

overall <- group_by(df, airline, status) %>%
  summarize(fcount = sum(flight_count)) %>%
  left_join(airline_total, by = c('airline' = 'airline')) %>%
  mutate(pct = fcount / total_count)

overall

## # A tibble: 4 x 5
## # Groups:   airline [2]
##   airline status  fcount total_count   pct
##   <chr>   <chr>    <int>       <int> <dbl>
## 1 ALASKA  delayed    501        3775 0.133
## 2 ALASKA  on time   3274        3775 0.867
## 3 AM WEST delayed    787        7225 0.109
## 4 AM WEST on time   6438        7225 0.891

Lets visualize the results to get a better sense of the difference:

ggplot(overall, aes(x = airline, y = fcount/1000, fill = status)) +
  geom_col() +
  theme_bw() +
  labs(x = element_blank(),
       y = 'Flight Count (Thousands)',
       title = 'Total Flights Across 5 Cities',
       subtitle = 'By Arrival Status')

Clearly, AM West has more flights, but it isn’t very clear which airline is generally more likely to be delayed or not. Here’s another view that might make that easier:

filter(overall, status == 'delayed') %>%
ggplot(aes(x = airline, y = pct)) +
  geom_col() +
  theme_bw() +
  labs(x = element_blank(),
       y = element_blank(),
       title = 'Rate of Delayed Flights Across 5 Cities') +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  coord_flip()

So it looks like in general, if you have a choice to fly with one of these two airlines, the safer bet would be AM West. There are more flights to choose from and a lower chance of finding yourself in a delayed flight.

The data includes 5 arrival cities that we can compare, if you happen to be flying to one of those cities and need to decide which airline to book. We’ll be do a very similar comparison with a bit more detail.

airline_port_total <- group_by(df, airline, airport) %>%
  summarize(total_count = sum(flight_count))

airports <- left_join(df, airline_port_total, by = c('airline' = 'airline', 'airport' = 'airport')) %>%
  mutate(pct = flight_count / total_count)

head(airports)

##   airline  status     airport flight_count total_count        pct
## 1  ALASKA on time Los Angeles          497         559 0.88908766
## 2  ALASKA delayed Los Angeles           62         559 0.11091234
## 3 AM WEST on time Los Angeles          694         811 0.85573366
## 4 AM WEST delayed Los Angeles          117         811 0.14426634
## 5  ALASKA on time     Phoenix          221         233 0.94849785
## 6  ALASKA delayed     Phoenix           12         233 0.05150215

Now that our second dataset is prepared, lets see how the airlines compare by city:

ggplot(airline_port_total, aes(x = airport, y = total_count/1000, fill = airline)) +
  geom_col(position = 'dodge') +
  theme_bw() +
  labs(x = element_blank(),
       y = 'Flight Count (Thousands)',
       title = 'Flight Count by City and Airline')

Wow. If you’re flying to Phoenix you’re clearly more likely to find more options flying AM West and if you’re heading to Seattle, Alaska airlines should have more options for you.

What about delays?

filter(airports, status == 'delayed') %>%
  ggplot(aes(x = airport, y = pct, fill = airline)) +
  geom_col(position = 'dodge') + 
  theme_bw() +
  labs(x = element_blank(),
       y = element_blank(),
       title = 'Comparison of Delay Rate by City') +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  coord_flip()

Wow, thought we were almost done here! It looks like for every one of these cities, AM West is more likely to be delayed! This contradicts the calculation from earlier.

My guess is that since AM West has so many flights in Phoenix with a relatively low delay rate, the weighted average is brought down. Also, Alaska airlines delay rates in the top 3 cities are all above 10%, so the weighted average is kept high.

Lets look at the percent contribution of delays for each airline:

airline_total <- group_by(df, airline) %>%
  summarize(line_total = sum(flight_count))

share <- filter(airports, status == 'delayed') %>%
  left_join(airline_total, by = ('airline' = 'airline')) %>%
  mutate(pct_share = total_count / line_total * pct)

ggplot(share, aes(x = airline, y = pct_share, fill = airport)) +
  geom_col() +
  theme_bw() +
  labs(x = element_blank(),
       y = element_blank(),
       title = 'Delay Rate by Airline',
       subtitle = 'With Airport Weighted Contribution') +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  coord_flip() +
  scale_fill_discrete(name = element_blank())

It looks like Alaska’s delays in Seattle really make them look bad.

Data607 Assignment 5

Leo Yi

2020-02-24

Intro

Method

Conclusion