The goal of this project is to make a wide dataset long with dplyr and tidyr and then perform some simple analytics on the tidy dataset.

1. Load Data and Libraries

Here we use dplyr::tbl_df() to create a new dataframe after loading from csv.

setwd('~')
path = "./Data/airline_raw.csv"
airline_raw = read.csv(path)

library(dplyr)
library(tidyr)
library(ggplot2)  # for analysis

airline = tbl_df(airline_raw)
airline
## Source: local data frame [4 x 7]
## 
##   airline    time Los_Angeles Phoenix San_Diego San_Francisco Seattle
## 1  Alaska on_time         497     221       212           503    1841
## 2  Alaska delayed          62      12        20           102     305
## 3 AM_west on_time         694    4840       383           320     201
## 4 AM_west delayed         117     415        65           129      61

2. Tidyr and Dplyr

We now use tidyr and dplyr to gather and spread the original dataframe and store it in a new df. The resulting df is now long instead of wide.

airline_tidy = airline %>%
    gather("city", "delay_amount", 3:7) %>%
        spread(time, delay_amount)

airline_tidy
## Source: local data frame [10 x 4]
## 
##    airline          city delayed on_time
## 1   Alaska   Los_Angeles      62     497
## 2   Alaska       Phoenix      12     221
## 3   Alaska     San_Diego      20     212
## 4   Alaska San_Francisco     102     503
## 5   Alaska       Seattle     305    1841
## 6  AM_west   Los_Angeles     117     694
## 7  AM_west       Phoenix     415    4840
## 8  AM_west     San_Diego      65     383
## 9  AM_west San_Francisco     129     320
## 10 AM_west       Seattle      61     201

3. Analysis

First we add a new column to airline_tidy with dplyr::mutate() to create the delay rates for each city for analysis.

# create delay rate column
airline_tidy = airline_tidy %>% 
    mutate(delay_rate_pcnt = round(delayed /(delayed + on_time),4) * 100)

head(airline_tidy,n = 3)
## Source: local data frame [3 x 5]
## 
##   airline        city delayed on_time delay_rate_pcnt
## 1  Alaska Los_Angeles      62     497           11.09
## 2  Alaska     Phoenix      12     221            5.15
## 3  Alaska   San_Diego      20     212            8.62

This dataset is a classic example of aggregate data not providing a realistic interpretation of events as outlined in Kaiser Fung’s book Numbersense.

If we wanted to determine which airline had the lowest delay rate we might be inclined to reference summary data.

airline_tidy %>% 
    group_by(airline) %>% 
        summarise(total_delay = sum(delayed), total_ontime = sum(on_time)) %>% 
            mutate(total_delay_rate_pcnt = round(total_delay /(total_delay + total_ontime),4)* 100)
## Source: local data frame [2 x 4]
## 
##   airline total_delay total_ontime total_delay_rate_pcnt
## 1  Alaska         501         3274                 13.27
## 2 AM_west         787         6438                 10.89

The results of this computation makes it appear that AM West outperformed Alaska.

However if we take the city by city comparison that is reflected in the airline table, it paints a much different picture of airline efficiency.

ggplot(airline_tidy, aes(x = city, y = delay_rate_pcnt, color = airline, group = airline)) +
    geom_point(size = 5) + 
    geom_line(size = 2)

The city by city comparison offers a much better picture of the delay rates between the two airlines. We can now see that Alaska outperformed AM West in every city, in many cases by almost twice as much. The aggregated results were skewed because of a disproportionate number of total flights by AM West and an exceptionally low delay rate in Phoenix. This managed to push down the aggregated delay rate significantly.