Creating the CSV & Reading It In
Our first step to create our csv file, which I’ve created as an exact replica of the image in our assignment description. Our next step is to load in our libraries (dplyr, tidyr) and then read in the messy csv file. We’ll also load ggplot2 for the analysis portion of this assignment and the “zoo” library for reasons I’ll explain later.
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(zoo)##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
flights_mess <- read.csv('flights.csv')
flights_mess## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
Manipulating and Tidying the Data
From the above dataframe, we can see that it’s quite messy. First we’ll rearrange the data from “wide-to-long” using tidyr’s gather() function. Let’s take a brief look to see what it looks like now.
flights <- gather(flights_mess, "City", "n", 3:7)
flights## X X.1 City n
## 1 ALASKA on time Los.Angeles 497
## 2 delayed Los.Angeles 62
## 3 Los.Angeles NA
## 4 AM WEST on time Los.Angeles 694
## 5 delayed Los.Angeles 117
## 6 ALASKA on time Phoenix 221
## 7 delayed Phoenix 12
## 8 Phoenix NA
## 9 AM WEST on time Phoenix 4840
## 10 delayed Phoenix 415
## 11 ALASKA on time San.Diego 212
## 12 delayed San.Diego 20
## 13 San.Diego NA
## 14 AM WEST on time San.Diego 383
## 15 delayed San.Diego 65
## 16 ALASKA on time San.Francisco 503
## 17 delayed San.Francisco 102
## 18 San.Francisco NA
## 19 AM WEST on time San.Francisco 320
## 20 delayed San.Francisco 129
## 21 ALASKA on time Seattle 1841
## 22 delayed Seattle 305
## 23 Seattle NA
## 24 AM WEST on time Seattle 201
## 25 delayed Seattle 61
We can see that because we had the empty row spacer in the messy data set, there are null rows for our count. We can filter that out. While we’re at it, let’s also rename our airline and flight status columns appropriately while using dplyr’s pipes.
flights <- flights %>%
filter(!is.na(n)) %>%
rename(Airline = X, Status = X.1)
flights## Airline Status City n
## 1 ALASKA on time Los.Angeles 497
## 2 delayed Los.Angeles 62
## 3 AM WEST on time Los.Angeles 694
## 4 delayed Los.Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 delayed Phoenix 12
## 7 AM WEST on time Phoenix 4840
## 8 delayed Phoenix 415
## 9 ALASKA on time San.Diego 212
## 10 delayed San.Diego 20
## 11 AM WEST on time San.Diego 383
## 12 delayed San.Diego 65
## 13 ALASKA on time San.Francisco 503
## 14 delayed San.Francisco 102
## 15 AM WEST on time San.Francisco 320
## 16 delayed San.Francisco 129
## 17 ALASKA on time Seattle 1841
## 18 delayed Seattle 305
## 19 AM WEST on time Seattle 201
## 20 delayed Seattle 61
Now we have one more thing to fix. For every “delayed” status, we have no value for airline, which is an artifact of the messy dataset. Let’s fix that. What we need to do is fill each empty cell with the previous cell that is not empty to appropriately assign the right airline to the delayed status. I couldn’t find anything that could do this specifically with tidyr or dplyr, so I went to the internet and found a library called “zoo”. I use na.locf(.) to solve my problem. Before I use that function, I still need to fill the empty Airline cells in the dataframe with null values.
flights[flights$Airline=="","Airline"] <- NA
flights <- flights %>%
do(na.locf(.))
flights## Airline Status City n
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA delayed Los.Angeles 62
## 3 AM WEST on time Los.Angeles 694
## 4 AM WEST delayed Los.Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 ALASKA delayed Phoenix 12
## 7 AM WEST on time Phoenix 4840
## 8 AM WEST delayed Phoenix 415
## 9 ALASKA on time San.Diego 212
## 10 ALASKA delayed San.Diego 20
## 11 AM WEST on time San.Diego 383
## 12 AM WEST delayed San.Diego 65
## 13 ALASKA on time San.Francisco 503
## 14 ALASKA delayed San.Francisco 102
## 15 AM WEST on time San.Francisco 320
## 16 AM WEST delayed San.Francisco 129
## 17 ALASKA on time Seattle 1841
## 18 ALASKA delayed Seattle 305
## 19 AM WEST on time Seattle 201
## 20 AM WEST delayed Seattle 61
Analysis of Flight Status
Now that our data is in the long format, we can feed the data into ggplot2 for some simple data analysis. From viewing the dataset, it would be intuitive to compare flight status (delayed or on time) between the airlines by city. Let’s do that using two side-by-side bar plots.
f <- ggplot(flights, aes(x = City,y = n))
f + geom_bar(stat = "sum", position="dodge", aes(fill = Status)) +
guides(colour = "colorbar",size = "none") +
facet_grid(. ~ Airline) +
theme(axis.title.y = element_blank(),
axis.text.x = element_text(angle = 90, hjust = 1))This is nice and gives us a basic picture, but doesn’t answer the question posed by the assignment. The assignment is asking us to be able to compare delayed arrival times for the airlines. This would be best accomplished by looking at the proportion of delayed arrivals over all arrivals by city and airline. While thinking about this, I realize I could still do a better job rearranging the data for this type of analysis.
So, let’s go back to the data frame, arrange it so that delayed and on time have their own columns, and add some calculated columns to be able to better plot our data in ggplot2 for proportion.
flights <- flights %>%
spread(Status,n) %>%
mutate(total = delayed + `on time`, `delay rate` = delayed / total)
flights## Airline City delayed on time total delay rate
## 1 ALASKA Los.Angeles 62 497 559 0.11091234
## 2 ALASKA Phoenix 12 221 233 0.05150215
## 3 ALASKA San.Diego 20 212 232 0.08620690
## 4 ALASKA San.Francisco 102 503 605 0.16859504
## 5 ALASKA Seattle 305 1841 2146 0.14212488
## 6 AM WEST Los.Angeles 117 694 811 0.14426634
## 7 AM WEST Phoenix 415 4840 5255 0.07897241
## 8 AM WEST San.Diego 65 383 448 0.14508929
## 9 AM WEST San.Francisco 129 320 449 0.28730512
## 10 AM WEST Seattle 61 201 262 0.23282443
Now our dataset is prepared for delay rate analysis between the airlines by city.
delays <- ggplot(flights,aes(x = City, y = `delay rate`))
delays + geom_bar(stat = "identity", position="dodge", aes(fill = Airline))From the visualization above, it seems clear that for every city, AM West has proportionally more delayed arrivals than Alaska airlines.
Conclusion
If we were to decide on which is the better airline I would argue that Alaska Airlines would be the better choice. It’s true that there is a higher frequency of flights with AM West suggesting that there is more choice with that airline. However, I would personally still prefer a higher likelihood that I would arrive on time with Alaska Airlines.