The goal of this project is to make a wide dataset long with dplyr and tidyr and then perform some simple analytics on the tidy dataset.
Here we use dplyr::tbl_df() to create a new dataframe after loading from csv.
setwd('~')
path = "./Data/airline_raw.csv"
airline_raw = read.csv(path)
library(dplyr)
library(tidyr)
library(ggplot2) # for analysis
airline = tbl_df(airline_raw)
airline
## Source: local data frame [4 x 7]
##
## airline time Los_Angeles Phoenix San_Diego San_Francisco Seattle
## 1 Alaska on_time 497 221 212 503 1841
## 2 Alaska delayed 62 12 20 102 305
## 3 AM_west on_time 694 4840 383 320 201
## 4 AM_west delayed 117 415 65 129 61
We now use tidyr and dplyr to gather and spread the original dataframe and store it in a new df. The resulting df is now long instead of wide.
airline_tidy = airline %>%
gather("city", "delay_amount", 3:7) %>%
spread(time, delay_amount)
airline_tidy
## Source: local data frame [10 x 4]
##
## airline city delayed on_time
## 1 Alaska Los_Angeles 62 497
## 2 Alaska Phoenix 12 221
## 3 Alaska San_Diego 20 212
## 4 Alaska San_Francisco 102 503
## 5 Alaska Seattle 305 1841
## 6 AM_west Los_Angeles 117 694
## 7 AM_west Phoenix 415 4840
## 8 AM_west San_Diego 65 383
## 9 AM_west San_Francisco 129 320
## 10 AM_west Seattle 61 201
First we add a new column to airline_tidy with dplyr::mutate() to create the delay rates for each city for analysis.
# create delay rate column
airline_tidy = airline_tidy %>%
mutate(delay_rate_pcnt = round(delayed /(delayed + on_time),4) * 100)
head(airline_tidy,n = 3)
## Source: local data frame [3 x 5]
##
## airline city delayed on_time delay_rate_pcnt
## 1 Alaska Los_Angeles 62 497 11.09
## 2 Alaska Phoenix 12 221 5.15
## 3 Alaska San_Diego 20 212 8.62
This dataset is a classic example of aggregate data not providing a realistic interpretation of events as outlined in Kaiser Fung’s book Numbersense.
If we wanted to determine which airline had the lowest delay rate we might be inclined to reference summary data.
airline_tidy %>%
group_by(airline) %>%
summarise(total_delay = sum(delayed), total_ontime = sum(on_time)) %>%
mutate(total_delay_rate_pcnt = round(total_delay /(total_delay + total_ontime),4)* 100)
## Source: local data frame [2 x 4]
##
## airline total_delay total_ontime total_delay_rate_pcnt
## 1 Alaska 501 3274 13.27
## 2 AM_west 787 6438 10.89
The results of this computation makes it appear that AM West outperformed Alaska.
However if we take the city by city comparison that is reflected in the airline table, it paints a much different picture of airline efficiency.
ggplot(airline_tidy, aes(x = city, y = delay_rate_pcnt, color = airline, group = airline)) +
geom_point(size = 5) +
geom_line(size = 2)
The city by city comparison offers a much better picture of the delay rates between the two airlines. We can now see that Alaska outperformed AM West in every city, in many cases by almost twice as much. The aggregated results were skewed because of a disproportionate number of total flights by AM West and an exceptionally low delay rate in Phoenix. This managed to push down the aggregated delay rate significantly.