Creating the CSV & Reading It In

Our first step to create our csv file, which I’ve created as an exact replica of the image in our assignment description. Our next step is to load in our libraries (dplyr, tidyr) and then read in the messy csv file. We’ll also load ggplot2 for the analysis portion of this assignment and the “zoo” library for reasons I’ll explain later.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
flights_mess <- read.csv('flights.csv')
flights_mess
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

Manipulating and Tidying the Data

From the above dataframe, we can see that it’s quite messy. First we’ll rearrange the data from “wide-to-long” using tidyr’s gather() function. Let’s take a brief look to see what it looks like now.

flights <- gather(flights_mess, "City", "n", 3:7)

flights
##          X     X.1          City    n
## 1   ALASKA on time   Los.Angeles  497
## 2          delayed   Los.Angeles   62
## 3                    Los.Angeles   NA
## 4  AM WEST on time   Los.Angeles  694
## 5          delayed   Los.Angeles  117
## 6   ALASKA on time       Phoenix  221
## 7          delayed       Phoenix   12
## 8                        Phoenix   NA
## 9  AM WEST on time       Phoenix 4840
## 10         delayed       Phoenix  415
## 11  ALASKA on time     San.Diego  212
## 12         delayed     San.Diego   20
## 13                     San.Diego   NA
## 14 AM WEST on time     San.Diego  383
## 15         delayed     San.Diego   65
## 16  ALASKA on time San.Francisco  503
## 17         delayed San.Francisco  102
## 18                 San.Francisco   NA
## 19 AM WEST on time San.Francisco  320
## 20         delayed San.Francisco  129
## 21  ALASKA on time       Seattle 1841
## 22         delayed       Seattle  305
## 23                       Seattle   NA
## 24 AM WEST on time       Seattle  201
## 25         delayed       Seattle   61

We can see that because we had the empty row spacer in the messy data set, there are null rows for our count. We can filter that out. While we’re at it, let’s also rename our airline and flight status columns appropriately while using dplyr’s pipes.

flights <- flights %>%
  filter(!is.na(n)) %>%
  rename(Airline = X, Status = X.1)
flights
##    Airline  Status          City    n
## 1   ALASKA on time   Los.Angeles  497
## 2          delayed   Los.Angeles   62
## 3  AM WEST on time   Los.Angeles  694
## 4          delayed   Los.Angeles  117
## 5   ALASKA on time       Phoenix  221
## 6          delayed       Phoenix   12
## 7  AM WEST on time       Phoenix 4840
## 8          delayed       Phoenix  415
## 9   ALASKA on time     San.Diego  212
## 10         delayed     San.Diego   20
## 11 AM WEST on time     San.Diego  383
## 12         delayed     San.Diego   65
## 13  ALASKA on time San.Francisco  503
## 14         delayed San.Francisco  102
## 15 AM WEST on time San.Francisco  320
## 16         delayed San.Francisco  129
## 17  ALASKA on time       Seattle 1841
## 18         delayed       Seattle  305
## 19 AM WEST on time       Seattle  201
## 20         delayed       Seattle   61

Now we have one more thing to fix. For every “delayed” status, we have no value for airline, which is an artifact of the messy dataset. Let’s fix that. What we need to do is fill each empty cell with the previous cell that is not empty to appropriately assign the right airline to the delayed status. I couldn’t find anything that could do this specifically with tidyr or dplyr, so I went to the internet and found a library called “zoo”. I use na.locf(.) to solve my problem. Before I use that function, I still need to fill the empty Airline cells in the dataframe with null values.

flights[flights$Airline=="","Airline"] <- NA
flights <- flights %>%
  do(na.locf(.))
flights
##    Airline  Status          City    n
## 1   ALASKA on time   Los.Angeles  497
## 2   ALASKA delayed   Los.Angeles   62
## 3  AM WEST on time   Los.Angeles  694
## 4  AM WEST delayed   Los.Angeles  117
## 5   ALASKA on time       Phoenix  221
## 6   ALASKA delayed       Phoenix   12
## 7  AM WEST on time       Phoenix 4840
## 8  AM WEST delayed       Phoenix  415
## 9   ALASKA on time     San.Diego  212
## 10  ALASKA delayed     San.Diego   20
## 11 AM WEST on time     San.Diego  383
## 12 AM WEST delayed     San.Diego   65
## 13  ALASKA on time San.Francisco  503
## 14  ALASKA delayed San.Francisco  102
## 15 AM WEST on time San.Francisco  320
## 16 AM WEST delayed San.Francisco  129
## 17  ALASKA on time       Seattle 1841
## 18  ALASKA delayed       Seattle  305
## 19 AM WEST on time       Seattle  201
## 20 AM WEST delayed       Seattle   61

Analysis of Flight Status

Now that our data is in the long format, we can feed the data into ggplot2 for some simple data analysis. From viewing the dataset, it would be intuitive to compare flight status (delayed or on time) between the airlines by city. Let’s do that using two side-by-side bar plots.

f <- ggplot(flights, aes(x = City,y = n))
f + geom_bar(stat = "sum", position="dodge", aes(fill = Status)) +
  guides(colour = "colorbar",size = "none") +
  facet_grid(. ~ Airline) +
  theme(axis.title.y = element_blank(),
        axis.text.x = element_text(angle = 90, hjust = 1))

This is nice and gives us a basic picture, but doesn’t answer the question posed by the assignment. The assignment is asking us to be able to compare delayed arrival times for the airlines. This would be best accomplished by looking at the proportion of delayed arrivals over all arrivals by city and airline. While thinking about this, I realize I could still do a better job rearranging the data for this type of analysis.

So, let’s go back to the data frame, arrange it so that delayed and on time have their own columns, and add some calculated columns to be able to better plot our data in ggplot2 for proportion.

flights <- flights %>%
  spread(Status,n) %>%
  mutate(total = delayed + `on time`, `delay rate` = delayed / total)

flights
##    Airline          City delayed on time total delay rate
## 1   ALASKA   Los.Angeles      62     497   559 0.11091234
## 2   ALASKA       Phoenix      12     221   233 0.05150215
## 3   ALASKA     San.Diego      20     212   232 0.08620690
## 4   ALASKA San.Francisco     102     503   605 0.16859504
## 5   ALASKA       Seattle     305    1841  2146 0.14212488
## 6  AM WEST   Los.Angeles     117     694   811 0.14426634
## 7  AM WEST       Phoenix     415    4840  5255 0.07897241
## 8  AM WEST     San.Diego      65     383   448 0.14508929
## 9  AM WEST San.Francisco     129     320   449 0.28730512
## 10 AM WEST       Seattle      61     201   262 0.23282443

Now our dataset is prepared for delay rate analysis between the airlines by city.

delays <- ggplot(flights,aes(x = City, y = `delay rate`))
delays + geom_bar(stat = "identity", position="dodge", aes(fill = Airline))

From the visualization above, it seems clear that for every city, AM West has proportionally more delayed arrivals than Alaska airlines.

Conclusion

If we were to decide on which is the better airline I would argue that Alaska Airlines would be the better choice. It’s true that there is a higher frequency of flights with AM West suggesting that there is more choice with that airline. However, I would personally still prefer a higher likelihood that I would arrive on time with Alaska Airlines.