The goal here is to transform a wide data set into a “tidy” one. A tidy data set is defined as one where each variable is a column, each observation is a row, and each type of observational unit is a table.
I’ll be using the tidyr and dplyr packages.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
First I’ll read in the data.
w6 <- read.csv('week_6.csv')
get rid of the blank row and add in the other two missing variables.
The way the data is structured I can see that the “Alaska” and “AM West” observations span two lines. I’ll add them to the 2nd line of each and get rid of the blank middle line.
w6[2,1] <- 'ALASKA'
w6[5,1] <- 'AM WEST'
w6 <- filter(w6, w6[,1] != "")
Since there were no headers on iehter the airline or on time/delayed columns, they were named X and X.1 by default. I will rename those something more usable.
names(w6)[1] <- "Airline"
names(w6)[2] <- "timeliness"
If I look at the data…
w6
## Airline timeliness Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
…everything is filled in, so now I’m ready to transform via tidyr.
I’ll use piping syntax because it makes reduces the amount of typing.
w6_tidy <- w6 %>% gather(city, frequency, -Airline, -timeliness) %>%
spread(timeliness, frequency)
First I used gather() to take all columns other than airline and timeliness, stack them in a new column called “city”, and have the new value column called frequency.
Next, I thought it would make sense to replace the frequency column with a ‘delayed’ column and a ‘on time’ column.
Here’s the result:
w6_tidy
## Airline city delayed on time
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San.Diego 65 383
## 9 AM WEST San.Francisco 129 320
## 10 AM WEST Seattle 61 201
I thought about combining the airline with the ontime and delayed columns, so there would be a “Alaska On Time” column, “Alaska Delayed”, and so forth. But this seemed to combine two variables in a column, thus violating the definition of ‘tidy’ data.
However it seems to me that while there are good general rules for what constitutes ‘tidy’ data, there are some areas where it is subjective, and one is probably best served by thinking about making it ‘tidy’ in terms of the anlysis they are trying to perform.
So with that in mind, I’ll add some columns to measure relative delays per city per airline.
w6_tidy$total <- w6_tidy$`on time` + w6_tidy$delayed
w6_tidy$delay_pct <- w6_tidy$delayed / w6_tidy$total
Now I can compare how the airlines performed in any city I want to look at directly.
filter(w6_tidy, city == 'Los.Angeles')
## Airline city delayed on time total delay_pct
## 1 ALASKA Los.Angeles 62 497 559 0.1109123
## 2 AM WEST Los.Angeles 117 694 811 0.1442663
So Alaska had relatively (and absolutely) fewer delays in Los Angeles.
If I use the arrange() function to sort by the city, I can compare all the airlines at once:
arrange(w6_tidy, city)
## Airline city delayed on time total delay_pct
## 1 ALASKA Los.Angeles 62 497 559 0.11091234
## 2 AM WEST Los.Angeles 117 694 811 0.14426634
## 3 ALASKA Phoenix 12 221 233 0.05150215
## 4 AM WEST Phoenix 415 4840 5255 0.07897241
## 5 ALASKA San.Diego 20 212 232 0.08620690
## 6 AM WEST San.Diego 65 383 448 0.14508929
## 7 ALASKA San.Francisco 102 503 605 0.16859504
## 8 AM WEST San.Francisco 129 320 449 0.28730512
## 9 ALASKA Seattle 305 1841 2146 0.14212488
## 10 AM WEST Seattle 61 201 262 0.23282443
And it appears Alaska has relatively fewer delays in every city.