Today we’re going to tidy a csv file and see if we can better understand arrival delays for two airlines across five destinations.

After opening the dplyr, tidyr, readr, and knitr packages we’re going to download our csv file from github and name it ‘flights’.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(readr)
library(knitr)
assignment5<-"https://raw.githubusercontent.com/RandallThompson/Data607/master/Assignment5.csv"
flights<-read.csv(assignment5)

Next we’re going to turn it into a data frame to make it easier to manipulate.

#turning csv into dataframe
data_frame(flights)
## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.
## # A tibble: 5 x 1
##   flights$X $X.1   $Los.Angeles $Phoenix $San.Diego $San.Francisco $Seattle
##   <fct>     <fct>         <int>    <int>      <int>          <int>    <int>
## 1 ALASKA    on ti~          497      221        212            503     1841
## 2 ""        delay~           62       12         20            102      305
## 3 ""        ""               NA       NA         NA             NA       NA
## 4 AM WEST   on ti~          694     4840        383            320      201
## 5 ""        delay~          117      415         65            129       61

As you can see, the data looks messy. There are many empty cells and a row of NAs. The data is formatted in a way that makes it difficult to apply functions to.

Time to clean the data

Let’s combine the cities into one column called city, get rid of the NAs, and fill the empty cells.

#using gather() to turn each column of cities into one column with 5 columns of values into one column. There is an empty row that I'm removing with the na.omit() function.
fly<-flights %>% gather(city, number, c(3:7)) %>% na.omit()
#changing the names of the columns 
fly<-rename(fly, 'airline'='X', 'arrival'='X.1')

Now we’re going to fill in the empty cells in the column airline with the value above it using the fill() funciton.

fly %>% fill(airline) 
##    airline arrival          city number
## 1   ALASKA on time   Los.Angeles    497
## 2          delayed   Los.Angeles     62
## 3  AM WEST on time   Los.Angeles    694
## 4          delayed   Los.Angeles    117
## 5   ALASKA on time       Phoenix    221
## 6          delayed       Phoenix     12
## 7  AM WEST on time       Phoenix   4840
## 8          delayed       Phoenix    415
## 9   ALASKA on time     San.Diego    212
## 10         delayed     San.Diego     20
## 11 AM WEST on time     San.Diego    383
## 12         delayed     San.Diego     65
## 13  ALASKA on time San.Francisco    503
## 14         delayed San.Francisco    102
## 15 AM WEST on time San.Francisco    320
## 16         delayed San.Francisco    129
## 17  ALASKA on time       Seattle   1841
## 18         delayed       Seattle    305
## 19 AM WEST on time       Seattle    201
## 20         delayed       Seattle     61
#not working, not sure why
#fly %>% mutate(airline=replace(airline, airline=="", NA)) #tried to turn the empty cells into NAs
#fly %>% fill(airline) #still not working
#as.character(fly$airline) #What if we turn that column into a character?
#fly %>% fill(airline) #No luck. Unfortunately when I run these codes, it makes the data harder to work with so we're just going to skip to the solution I landed on: manually imputting the correct data

fly$airline <- rep(c('ALASKA', 'ALASKA', 'AM WEST', 'AM WEST'), 5)
print(fly)
##    airline arrival          city number
## 1   ALASKA on time   Los.Angeles    497
## 2   ALASKA delayed   Los.Angeles     62
## 4  AM WEST on time   Los.Angeles    694
## 5  AM WEST delayed   Los.Angeles    117
## 6   ALASKA on time       Phoenix    221
## 7   ALASKA delayed       Phoenix     12
## 9  AM WEST on time       Phoenix   4840
## 10 AM WEST delayed       Phoenix    415
## 11  ALASKA on time     San.Diego    212
## 12  ALASKA delayed     San.Diego     20
## 14 AM WEST on time     San.Diego    383
## 15 AM WEST delayed     San.Diego     65
## 16  ALASKA on time San.Francisco    503
## 17  ALASKA delayed San.Francisco    102
## 19 AM WEST on time San.Francisco    320
## 20 AM WEST delayed San.Francisco    129
## 21  ALASKA on time       Seattle   1841
## 22  ALASKA delayed       Seattle    305
## 24 AM WEST on time       Seattle    201
## 25 AM WEST delayed       Seattle     61

That looks better! Even though there are more rows, this format will be easier to work with.

Lets look at how many flights were delayed and on time for each airline.

#Organizing the data first by airlines then by arrivals will give us the sums of the values in arrivals.
fly %>% 
  group_by(airline, arrival) %>% 
  summarise(sum = sum(number))
## # A tibble: 4 x 3
## # Groups:   airline [2]
##   airline arrival   sum
##   <chr>   <fct>   <int>
## 1 ALASKA  delayed   501
## 2 ALASKA  on time  3274
## 3 AM WEST delayed   787
## 4 AM WEST on time  6438

We can see the total number of flights that were on time and delayed for each airline. It looks like AM WEST has more delayed flights.

What percentage was delayed for each airline?

#what percent was delayed for each airline
fly %>% 
  group_by(airline, arrival) %>% 
  summarise(sum = sum(number)) %>% 
  summarise(percentdelayed = sum[1]/(sum[1]+sum[2]))
## # A tibble: 2 x 2
##   airline percentdelayed
##   <chr>            <dbl>
## 1 ALASKA           0.133
## 2 AM WEST          0.109

We can see between the two tables that ALASKA had fewer delayed flights but a higher percentage of their flights were delayed.

Is there a difference in delayed flights between cities?

#To make this calculation easier, we're going to use the spread() function to create new columns. These new columns will make it easier to do a calculation within each row. Let's look at the before and after.
print(fly)
##    airline arrival          city number
## 1   ALASKA on time   Los.Angeles    497
## 2   ALASKA delayed   Los.Angeles     62
## 4  AM WEST on time   Los.Angeles    694
## 5  AM WEST delayed   Los.Angeles    117
## 6   ALASKA on time       Phoenix    221
## 7   ALASKA delayed       Phoenix     12
## 9  AM WEST on time       Phoenix   4840
## 10 AM WEST delayed       Phoenix    415
## 11  ALASKA on time     San.Diego    212
## 12  ALASKA delayed     San.Diego     20
## 14 AM WEST on time     San.Diego    383
## 15 AM WEST delayed     San.Diego     65
## 16  ALASKA on time San.Francisco    503
## 17  ALASKA delayed San.Francisco    102
## 19 AM WEST on time San.Francisco    320
## 20 AM WEST delayed San.Francisco    129
## 21  ALASKA on time       Seattle   1841
## 22  ALASKA delayed       Seattle    305
## 24 AM WEST on time       Seattle    201
## 25 AM WEST delayed       Seattle     61
fly<-fly %>% 
  spread(arrival, number)
print(fly)
##    airline          city delayed on time
## 1   ALASKA   Los.Angeles      62     497
## 2   ALASKA       Phoenix      12     221
## 3   ALASKA     San.Diego      20     212
## 4   ALASKA San.Francisco     102     503
## 5   ALASKA       Seattle     305    1841
## 6  AM WEST   Los.Angeles     117     694
## 7  AM WEST       Phoenix     415    4840
## 8  AM WEST     San.Diego      65     383
## 9  AM WEST San.Francisco     129     320
## 10 AM WEST       Seattle      61     201
#We can see that the number column has disappeared but the values have been split into columns named for the values of the original arrivals column.
#Now to calcualte the percent of delays. To make the calculation easier, we're going to remove the space in the column named 'on time'. 
fly<-rename(fly, 'ontime' = 'on time')
fly<-fly %>%   
  mutate(percent = (delayed / (delayed + ontime))) %>% 
  arrange(city)
print(fly)
##    airline          city delayed ontime    percent
## 1   ALASKA   Los.Angeles      62    497 0.11091234
## 2  AM WEST   Los.Angeles     117    694 0.14426634
## 3   ALASKA       Phoenix      12    221 0.05150215
## 4  AM WEST       Phoenix     415   4840 0.07897241
## 5   ALASKA     San.Diego      20    212 0.08620690
## 6  AM WEST     San.Diego      65    383 0.14508929
## 7   ALASKA San.Francisco     102    503 0.16859504
## 8  AM WEST San.Francisco     129    320 0.28730512
## 9   ALASKA       Seattle     305   1841 0.14212488
## 10 AM WEST       Seattle      61    201 0.23282443
#The function mutate() adds the new column of percentages and then we arranged them by city. 

We can see that in each city, ALASKA has a higher percentage of delays, but that’s not really what we wanted to see. We wanted to see what are the differences in the percentage of delays for each city overall. For that we have to combine the delay percentage of both airlines.

#We're intersted in each city so we're going to group_by city. We want the sums of delayed and ontime from the two airlines for each city. We do the same percent calcuation as earlier and arrange in accending order which is the default setting. 
fly %>% 
  group_by(city) %>% 
  summarise(delayed = sum(delayed), ontime = sum(ontime)) %>% 
  mutate(percent = (delayed / (delayed + ontime))) %>% 
  arrange(percent)
## # A tibble: 5 x 4
##   city          delayed ontime percent
##   <chr>           <int>  <int>   <dbl>
## 1 Phoenix           427   5061  0.0778
## 2 San.Diego          85    595  0.125 
## 3 Los.Angeles       179   1191  0.131 
## 4 Seattle           366   2042  0.152 
## 5 San.Francisco     231    823  0.219

Now we can see that overall, Phoenix has the lowest percentage of delayed flights and San.Francisco has the highest percentage. For such a busy airport, you would think Phoenix would have the MOST delays. It could be that they have earned so much business due to their lower percentage of delayed flights or it could be that when the city invested in such a large airport, they made sure to hire the world’s best logistical analysts to run the place!