Tidying and Transforming Data

For this assignment, we are tasked with ‘tidying’ the example data. First I reproduced the data into a csv file which we load into an R dataframe.

library(tidyr)
library(dplyr)
df <-read.csv("https://raw.githubusercontent.com/jreznyc/DATA607/master/HW/HW%205/flights.csv")
head(df)
##   airline arr_status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA    on time         497     221       212           503    1841
## 2  ALASKA    delayed          62      12        20           102     305
## 3 AM WEST    on time         694    4840       383           320     201
## 4 AM WEST    delayed         117     415        65           129      61

The first transformation will be to gather the variable columns into one column named “city”

df2 <- gather(df,"city","count",3:7)
head(df2)
##   airline arr_status        city count
## 1  ALASKA    on time Los.Angeles   497
## 2  ALASKA    delayed Los.Angeles    62
## 3 AM WEST    on time Los.Angeles   694
## 4 AM WEST    delayed Los.Angeles   117
## 5  ALASKA    on time     Phoenix   221
## 6  ALASKA    delayed     Phoenix    12

Next, we’ll spread the arrival status column, contains on time and delayed counts across two rows for each city and airline. We’ll spread this out into two columns, one for delayed values and the other for on time values.

df3 <- spread(df2, arr_status, count)
head(df3)
##   airline          city delayed on time
## 1  ALASKA   Los.Angeles      62     497
## 2  ALASKA       Phoenix      12     221
## 3  ALASKA     San.Diego      20     212
## 4  ALASKA San.Francisco     102     503
## 5  ALASKA       Seattle     305    1841
## 6 AM WEST   Los.Angeles     117     694

Now we have a tidy dataframe where each row corresponds to a combination of airline and city. This facilitates further analysis. Now we can easily compare the arrival times for both airlines. Before we do that, it’s worth mentioning that the above operations can be done on one line using pipes. Below is an example of the same transformations performed in one line, then we will count on-time arrivals and delays for both airlines.

df4 <- df %>% gather("city","count",3:7) %>% spread(arr_status, count)
head(df4)
##   airline          city delayed on time
## 1  ALASKA   Los.Angeles      62     497
## 2  ALASKA       Phoenix      12     221
## 3  ALASKA     San.Diego      20     212
## 4  ALASKA San.Francisco     102     503
## 5  ALASKA       Seattle     305    1841
## 6 AM WEST   Los.Angeles     117     694
df4 %>% group_by(airline) %>% summarise(delays=sum(delayed), on_time=sum(`on time`))
## # A tibble: 2 x 3
##   airline delays on_time
##   <fct>    <int>   <int>
## 1 ALASKA     501    3274
## 2 AM WEST    787    6438

The counts don’t provide a clear picture to compare the relative performance of both airlines, so let’s look at the number of delays in proportion to total flights for each airline.

df4 %>% group_by(airline) %>% summarise(delays_percent=round(sum(delayed)/sum(delayed,`on time`)*100,2))
## # A tibble: 2 x 2
##   airline delays_percent
##   <fct>            <dbl>
## 1 ALASKA            13.3
## 2 AM WEST           10.9

Now we can see that Alaska Airlines has a higher percentage of delays than AM West.