Today we’re going to tidy a csv file and see if we can better understand arrival delays for two airlines across five destinations.
After opening the dplyr, tidyr, readr, and knitr packages we’re going to download our csv file from github and name it ‘flights’.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(readr)
library(knitr)
assignment5<-"https://raw.githubusercontent.com/RandallThompson/Data607/master/Assignment5.csv"
flights<-read.csv(assignment5)
Next we’re going to turn it into a data frame to make it easier to manipulate.
#turning csv into dataframe
data_frame(flights)
## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.
## # A tibble: 5 x 1
## flights$X $X.1 $Los.Angeles $Phoenix $San.Diego $San.Francisco $Seattle
## <fct> <fct> <int> <int> <int> <int> <int>
## 1 ALASKA on ti~ 497 221 212 503 1841
## 2 "" delay~ 62 12 20 102 305
## 3 "" "" NA NA NA NA NA
## 4 AM WEST on ti~ 694 4840 383 320 201
## 5 "" delay~ 117 415 65 129 61
As you can see, the data looks messy. There are many empty cells and a row of NAs. The data is formatted in a way that makes it difficult to apply functions to.
Let’s combine the cities into one column called city, get rid of the NAs, and fill the empty cells.
#using gather() to turn each column of cities into one column with 5 columns of values into one column. There is an empty row that I'm removing with the na.omit() function.
fly<-flights %>% gather(city, number, c(3:7)) %>% na.omit()
#changing the names of the columns
fly<-rename(fly, 'airline'='X', 'arrival'='X.1')
Now we’re going to fill in the empty cells in the column airline with the value above it using the fill() funciton.
fly %>% fill(airline)
## airline arrival city number
## 1 ALASKA on time Los.Angeles 497
## 2 delayed Los.Angeles 62
## 3 AM WEST on time Los.Angeles 694
## 4 delayed Los.Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 delayed Phoenix 12
## 7 AM WEST on time Phoenix 4840
## 8 delayed Phoenix 415
## 9 ALASKA on time San.Diego 212
## 10 delayed San.Diego 20
## 11 AM WEST on time San.Diego 383
## 12 delayed San.Diego 65
## 13 ALASKA on time San.Francisco 503
## 14 delayed San.Francisco 102
## 15 AM WEST on time San.Francisco 320
## 16 delayed San.Francisco 129
## 17 ALASKA on time Seattle 1841
## 18 delayed Seattle 305
## 19 AM WEST on time Seattle 201
## 20 delayed Seattle 61
#not working, not sure why
#fly %>% mutate(airline=replace(airline, airline=="", NA)) #tried to turn the empty cells into NAs
#fly %>% fill(airline) #still not working
#as.character(fly$airline) #What if we turn that column into a character?
#fly %>% fill(airline) #No luck. Unfortunately when I run these codes, it makes the data harder to work with so we're just going to skip to the solution I landed on: manually imputting the correct data
fly$airline <- rep(c('ALASKA', 'ALASKA', 'AM WEST', 'AM WEST'), 5)
print(fly)
## airline arrival city number
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA delayed Los.Angeles 62
## 4 AM WEST on time Los.Angeles 694
## 5 AM WEST delayed Los.Angeles 117
## 6 ALASKA on time Phoenix 221
## 7 ALASKA delayed Phoenix 12
## 9 AM WEST on time Phoenix 4840
## 10 AM WEST delayed Phoenix 415
## 11 ALASKA on time San.Diego 212
## 12 ALASKA delayed San.Diego 20
## 14 AM WEST on time San.Diego 383
## 15 AM WEST delayed San.Diego 65
## 16 ALASKA on time San.Francisco 503
## 17 ALASKA delayed San.Francisco 102
## 19 AM WEST on time San.Francisco 320
## 20 AM WEST delayed San.Francisco 129
## 21 ALASKA on time Seattle 1841
## 22 ALASKA delayed Seattle 305
## 24 AM WEST on time Seattle 201
## 25 AM WEST delayed Seattle 61
That looks better! Even though there are more rows, this format will be easier to work with.
#Organizing the data first by airlines then by arrivals will give us the sums of the values in arrivals.
fly %>%
group_by(airline, arrival) %>%
summarise(sum = sum(number))
## # A tibble: 4 x 3
## # Groups: airline [2]
## airline arrival sum
## <chr> <fct> <int>
## 1 ALASKA delayed 501
## 2 ALASKA on time 3274
## 3 AM WEST delayed 787
## 4 AM WEST on time 6438
We can see the total number of flights that were on time and delayed for each airline. It looks like AM WEST has more delayed flights.
#what percent was delayed for each airline
fly %>%
group_by(airline, arrival) %>%
summarise(sum = sum(number)) %>%
summarise(percentdelayed = sum[1]/(sum[1]+sum[2]))
## # A tibble: 2 x 2
## airline percentdelayed
## <chr> <dbl>
## 1 ALASKA 0.133
## 2 AM WEST 0.109
We can see between the two tables that ALASKA had fewer delayed flights but a higher percentage of their flights were delayed.
#To make this calculation easier, we're going to use the spread() function to create new columns. These new columns will make it easier to do a calculation within each row. Let's look at the before and after.
print(fly)
## airline arrival city number
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA delayed Los.Angeles 62
## 4 AM WEST on time Los.Angeles 694
## 5 AM WEST delayed Los.Angeles 117
## 6 ALASKA on time Phoenix 221
## 7 ALASKA delayed Phoenix 12
## 9 AM WEST on time Phoenix 4840
## 10 AM WEST delayed Phoenix 415
## 11 ALASKA on time San.Diego 212
## 12 ALASKA delayed San.Diego 20
## 14 AM WEST on time San.Diego 383
## 15 AM WEST delayed San.Diego 65
## 16 ALASKA on time San.Francisco 503
## 17 ALASKA delayed San.Francisco 102
## 19 AM WEST on time San.Francisco 320
## 20 AM WEST delayed San.Francisco 129
## 21 ALASKA on time Seattle 1841
## 22 ALASKA delayed Seattle 305
## 24 AM WEST on time Seattle 201
## 25 AM WEST delayed Seattle 61
fly<-fly %>%
spread(arrival, number)
print(fly)
## airline city delayed on time
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San.Diego 65 383
## 9 AM WEST San.Francisco 129 320
## 10 AM WEST Seattle 61 201
#We can see that the number column has disappeared but the values have been split into columns named for the values of the original arrivals column.
#Now to calcualte the percent of delays. To make the calculation easier, we're going to remove the space in the column named 'on time'.
fly<-rename(fly, 'ontime' = 'on time')
fly<-fly %>%
mutate(percent = (delayed / (delayed + ontime))) %>%
arrange(city)
print(fly)
## airline city delayed ontime percent
## 1 ALASKA Los.Angeles 62 497 0.11091234
## 2 AM WEST Los.Angeles 117 694 0.14426634
## 3 ALASKA Phoenix 12 221 0.05150215
## 4 AM WEST Phoenix 415 4840 0.07897241
## 5 ALASKA San.Diego 20 212 0.08620690
## 6 AM WEST San.Diego 65 383 0.14508929
## 7 ALASKA San.Francisco 102 503 0.16859504
## 8 AM WEST San.Francisco 129 320 0.28730512
## 9 ALASKA Seattle 305 1841 0.14212488
## 10 AM WEST Seattle 61 201 0.23282443
#The function mutate() adds the new column of percentages and then we arranged them by city.
We can see that in each city, ALASKA has a higher percentage of delays, but that’s not really what we wanted to see. We wanted to see what are the differences in the percentage of delays for each city overall. For that we have to combine the delay percentage of both airlines.
#We're intersted in each city so we're going to group_by city. We want the sums of delayed and ontime from the two airlines for each city. We do the same percent calcuation as earlier and arrange in accending order which is the default setting.
fly %>%
group_by(city) %>%
summarise(delayed = sum(delayed), ontime = sum(ontime)) %>%
mutate(percent = (delayed / (delayed + ontime))) %>%
arrange(percent)
## # A tibble: 5 x 4
## city delayed ontime percent
## <chr> <int> <int> <dbl>
## 1 Phoenix 427 5061 0.0778
## 2 San.Diego 85 595 0.125
## 3 Los.Angeles 179 1191 0.131
## 4 Seattle 366 2042 0.152
## 5 San.Francisco 231 823 0.219
Now we can see that overall, Phoenix has the lowest percentage of delayed flights and San.Francisco has the highest percentage. For such a busy airport, you would think Phoenix would have the MOST delays. It could be that they have earned so much business due to their lower percentage of delayed flights or it could be that when the city invested in such a large airport, they made sure to hire the world’s best logistical analysts to run the place!