IS 607 Week 6 Assignment

The goal here is to transform a wide data set into a “tidy” one. A tidy data set is defined as one where each variable is a column, each observation is a row, and each type of observational unit is a table.

I’ll be using the tidyr and dplyr packages.

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

First I’ll read in the data.

w6 <- read.csv('week_6.csv')

get rid of the blank row and add in the other two missing variables.

The way the data is structured I can see that the “Alaska” and “AM West” observations span two lines. I’ll add them to the 2nd line of each and get rid of the blank middle line.

w6[2,1] <- 'ALASKA'
w6[5,1] <- 'AM WEST'
w6 <- filter(w6, w6[,1] != "")

Since there were no headers on iehter the airline or on time/delayed columns, they were named X and X.1 by default. I will rename those something more usable.

names(w6)[1] <- "Airline"
names(w6)[2] <- "timeliness"

If I look at the data…

w6

##   Airline timeliness Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA    on time         497     221       212           503    1841
## 2  ALASKA    delayed          62      12        20           102     305
## 3 AM WEST    on time         694    4840       383           320     201
## 4 AM WEST    delayed         117     415        65           129      61

…everything is filled in, so now I’m ready to transform via tidyr.

I’ll use piping syntax because it makes reduces the amount of typing.

w6_tidy <- w6 %>% gather(city, frequency, -Airline, -timeliness) %>%
                  spread(timeliness, frequency)

First I used gather() to take all columns other than airline and timeliness, stack them in a new column called “city”, and have the new value column called frequency.

Next, I thought it would make sense to replace the frequency column with a ‘delayed’ column and a ‘on time’ column.

Here’s the result:

w6_tidy

##    Airline          city delayed on time
## 1   ALASKA   Los.Angeles      62     497
## 2   ALASKA       Phoenix      12     221
## 3   ALASKA     San.Diego      20     212
## 4   ALASKA San.Francisco     102     503
## 5   ALASKA       Seattle     305    1841
## 6  AM WEST   Los.Angeles     117     694
## 7  AM WEST       Phoenix     415    4840
## 8  AM WEST     San.Diego      65     383
## 9  AM WEST San.Francisco     129     320
## 10 AM WEST       Seattle      61     201

I thought about combining the airline with the ontime and delayed columns, so there would be a “Alaska On Time” column, “Alaska Delayed”, and so forth. But this seemed to combine two variables in a column, thus violating the definition of ‘tidy’ data.

However it seems to me that while there are good general rules for what constitutes ‘tidy’ data, there are some areas where it is subjective, and one is probably best served by thinking about making it ‘tidy’ in terms of the anlysis they are trying to perform.

So with that in mind, I’ll add some columns to measure relative delays per city per airline.

w6_tidy$total <- w6_tidy$`on time` + w6_tidy$delayed
w6_tidy$delay_pct <- w6_tidy$delayed / w6_tidy$total

Now I can compare how the airlines performed in any city I want to look at directly.

filter(w6_tidy, city == 'Los.Angeles')

##   Airline        city delayed on time total delay_pct
## 1  ALASKA Los.Angeles      62     497   559 0.1109123
## 2 AM WEST Los.Angeles     117     694   811 0.1442663

So Alaska had relatively (and absolutely) fewer delays in Los Angeles.

If I use the arrange() function to sort by the city, I can compare all the airlines at once:

arrange(w6_tidy, city)

##    Airline          city delayed on time total  delay_pct
## 1   ALASKA   Los.Angeles      62     497   559 0.11091234
## 2  AM WEST   Los.Angeles     117     694   811 0.14426634
## 3   ALASKA       Phoenix      12     221   233 0.05150215
## 4  AM WEST       Phoenix     415    4840  5255 0.07897241
## 5   ALASKA     San.Diego      20     212   232 0.08620690
## 6  AM WEST     San.Diego      65     383   448 0.14508929
## 7   ALASKA San.Francisco     102     503   605 0.16859504
## 8  AM WEST San.Francisco     129     320   449 0.28730512
## 9   ALASKA       Seattle     305    1841  2146 0.14212488
## 10 AM WEST       Seattle      61     201   262 0.23282443

And it appears Alaska has relatively fewer delays in every city.

IS 607 Week 6 Assignment

Chris Fenton

October 4, 2015