This week’s assignment is to take a table of flight delays, enter that information into a csv file, then tidy, transform, and analyze the information in R.
The information we’ll be looking at today looks like counts of flights that were either on time or delayed for two airlines. The information is available by arrival airport location.
The csv has been prepared and uploaded to github to make this Rmarkdown reproducible.
Lets start by loading the libraries we’ll be using. We’ll use dplyr and tidyr to organize the data and ggplot to visualize it.
library(dplyr)
library(tidyr)
library(ggplot2)
Next, lets import the data from github into a dataframe object:
(raw <- read.csv(url('https://raw.githubusercontent.com/dataconsumer101/data607/master/assignment_5_source_table.csv'), stringsAsFactors = F, check.names = F))
## Los Angeles Phoenix San Diego San Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
It looks like the first two columns need names. The first column is the airline name and the second indicates whether the flight was on time or not- we’ll call that the flight status.
names(raw)[1] <- 'airline'
names(raw)[2] <- 'status'
Next, we’ll need to fill in the airline for row 2 and 5 as well as remove the blank row in the middle:
raw$airline[2] <- 'ALASKA'
raw$airline[5] <- 'AM WEST'
(raw <- filter(raw, nchar(airline) > 0))
## airline status Los Angeles Phoenix San Diego San Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
Lets gather the locations into a variable field using tidyr:
df <- gather(raw, airport, flight_count, -airline, -status)
head(df)
## airline status airport flight_count
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA delayed Los Angeles 62
## 3 AM WEST on time Los Angeles 694
## 4 AM WEST delayed Los Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 ALASKA delayed Phoenix 12
I think the next logical step would be to convert all these flight status counts to rates, so that we can compare the two airlines at the same level.
Lets look at it two ways- the airline delay rate across each arrival location and also in aggregate. Lets prepare the dataframe by calculating the totals on each level, then joining back to the original dataset.
Lets check the overall rates first:
airline_total <- group_by(df, airline) %>%
summarize(total_count = sum(flight_count))
overall <- group_by(df, airline, status) %>%
summarize(fcount = sum(flight_count)) %>%
left_join(airline_total, by = c('airline' = 'airline')) %>%
mutate(pct = fcount / total_count)
overall
## # A tibble: 4 x 5
## # Groups: airline [2]
## airline status fcount total_count pct
## <chr> <chr> <int> <int> <dbl>
## 1 ALASKA delayed 501 3775 0.133
## 2 ALASKA on time 3274 3775 0.867
## 3 AM WEST delayed 787 7225 0.109
## 4 AM WEST on time 6438 7225 0.891
Lets visualize the results to get a better sense of the difference:
ggplot(overall, aes(x = airline, y = fcount/1000, fill = status)) +
geom_col() +
theme_bw() +
labs(x = element_blank(),
y = 'Flight Count (Thousands)',
title = 'Total Flights Across 5 Cities',
subtitle = 'By Arrival Status')
Clearly, AM West has more flights, but it isn’t very clear which airline is generally more likely to be delayed or not. Here’s another view that might make that easier:
filter(overall, status == 'delayed') %>%
ggplot(aes(x = airline, y = pct)) +
geom_col() +
theme_bw() +
labs(x = element_blank(),
y = element_blank(),
title = 'Rate of Delayed Flights Across 5 Cities') +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
coord_flip()
So it looks like in general, if you have a choice to fly with one of these two airlines, the safer bet would be AM West. There are more flights to choose from and a lower chance of finding yourself in a delayed flight.
The data includes 5 arrival cities that we can compare, if you happen to be flying to one of those cities and need to decide which airline to book. We’ll be do a very similar comparison with a bit more detail.
airline_port_total <- group_by(df, airline, airport) %>%
summarize(total_count = sum(flight_count))
airports <- left_join(df, airline_port_total, by = c('airline' = 'airline', 'airport' = 'airport')) %>%
mutate(pct = flight_count / total_count)
head(airports)
## airline status airport flight_count total_count pct
## 1 ALASKA on time Los Angeles 497 559 0.88908766
## 2 ALASKA delayed Los Angeles 62 559 0.11091234
## 3 AM WEST on time Los Angeles 694 811 0.85573366
## 4 AM WEST delayed Los Angeles 117 811 0.14426634
## 5 ALASKA on time Phoenix 221 233 0.94849785
## 6 ALASKA delayed Phoenix 12 233 0.05150215
Now that our second dataset is prepared, lets see how the airlines compare by city:
ggplot(airline_port_total, aes(x = airport, y = total_count/1000, fill = airline)) +
geom_col(position = 'dodge') +
theme_bw() +
labs(x = element_blank(),
y = 'Flight Count (Thousands)',
title = 'Flight Count by City and Airline')
Wow. If you’re flying to Phoenix you’re clearly more likely to find more options flying AM West and if you’re heading to Seattle, Alaska airlines should have more options for you.
What about delays?
filter(airports, status == 'delayed') %>%
ggplot(aes(x = airport, y = pct, fill = airline)) +
geom_col(position = 'dodge') +
theme_bw() +
labs(x = element_blank(),
y = element_blank(),
title = 'Comparison of Delay Rate by City') +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
coord_flip()
Wow, thought we were almost done here! It looks like for every one of these cities, AM West is more likely to be delayed! This contradicts the calculation from earlier.
My guess is that since AM West has so many flights in Phoenix with a relatively low delay rate, the weighted average is brought down. Also, Alaska airlines delay rates in the top 3 cities are all above 10%, so the weighted average is kept high.
Lets look at the percent contribution of delays for each airline:
airline_total <- group_by(df, airline) %>%
summarize(line_total = sum(flight_count))
share <- filter(airports, status == 'delayed') %>%
left_join(airline_total, by = ('airline' = 'airline')) %>%
mutate(pct_share = total_count / line_total * pct)
ggplot(share, aes(x = airline, y = pct_share, fill = airport)) +
geom_col() +
theme_bw() +
labs(x = element_blank(),
y = element_blank(),
title = 'Delay Rate by Airline',
subtitle = 'With Airport Weighted Contribution') +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
coord_flip() +
scale_fill_discrete(name = element_blank())
It looks like Alaska’s delays in Seattle really make them look bad.
The numbers don’t lie, you’re better off sticking with Alaska airlines, even though their numbers overall seem worse.