For this assignment, we are tasked with ‘tidying’ the example data. First I reproduced the data into a csv file which we load into an R dataframe.
library(tidyr)
library(dplyr)
df <-read.csv("https://raw.githubusercontent.com/jreznyc/DATA607/master/HW/HW%205/flights.csv")
head(df)
## airline arr_status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
The first transformation will be to gather the variable columns into one column named “city”
df2 <- gather(df,"city","count",3:7)
head(df2)
## airline arr_status city count
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA delayed Los.Angeles 62
## 3 AM WEST on time Los.Angeles 694
## 4 AM WEST delayed Los.Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 ALASKA delayed Phoenix 12
Next, we’ll spread the arrival status column, contains on time and delayed counts across two rows for each city and airline. We’ll spread this out into two columns, one for delayed values and the other for on time values.
df3 <- spread(df2, arr_status, count)
head(df3)
## airline city delayed on time
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
Now we have a tidy dataframe where each row corresponds to a combination of airline and city. This facilitates further analysis. Now we can easily compare the arrival times for both airlines. Before we do that, it’s worth mentioning that the above operations can be done on one line using pipes. Below is an example of the same transformations performed in one line, then we will count on-time arrivals and delays for both airlines.
df4 <- df %>% gather("city","count",3:7) %>% spread(arr_status, count)
head(df4)
## airline city delayed on time
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
df4 %>% group_by(airline) %>% summarise(delays=sum(delayed), on_time=sum(`on time`))
## # A tibble: 2 x 3
## airline delays on_time
## <fct> <int> <int>
## 1 ALASKA 501 3274
## 2 AM WEST 787 6438
The counts don’t provide a clear picture to compare the relative performance of both airlines, so let’s look at the number of delays in proportion to total flights for each airline.
df4 %>% group_by(airline) %>% summarise(delays_percent=round(sum(delayed)/sum(delayed,`on time`)*100,2))
## # A tibble: 2 x 2
## airline delays_percent
## <fct> <dbl>
## 1 ALASKA 13.3
## 2 AM WEST 10.9
Now we can see that Alaska Airlines has a higher percentage of delays than AM West.