I used dplyr
and tidyr
since they are required for this assignment, and ggplot2
to create a basic bar plot summarising the flight delay information.
require(dplyr)
require(tidyr)
require(ggplot2)
For this assignment, I simply created a .csv file that looked identical to the given chart in the assignment description, nested rows, blank cells and all. The file is hosted in my GitHub repository, so we’ll just do a simple connection, read the data, and then set that to an object called delays
. Note that I set the header variable to true, so that the given cities are the column headers (the first two columns will be blank).
# open file
path <- ("https://raw.githubusercontent.com/Logan213/DATA607_Week6/master/airlinedly.csv")
con <- file(path, open="r")
delays <- read.csv(con, header = TRUE, stringsAsFactors = FALSE)
# close file
close(con)
delays
## X X.1 LA PHX Sandiego SanFran Seattle
## 1 Alaska On Time 497 221 212 503 1841
## 2 Delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM West On Time 694 4840 383 320 201
## 5 Delayed 117 415 65 129 61
I played around with tidyr
and dplyr
for quite a bit, hoping to somehow rotate and fill in the blank cells I created. In the end, I set the names for my two blank column headers, dropped the empty row using slice
, and then filled in the empty Airline “cells” with the appropriate names using a simple assignment.
# name empty column headers
names(delays)[1:2] <- c("Airline", "Status")
# drop empty row
delays <- slice(delays, -3)
delays[2,1] <- "Alaska"
delays[4,1] <- "AM West"
delays
## Airline Status LA PHX Sandiego SanFran Seattle
## 1 Alaska On Time 497 221 212 503 1841
## 2 Alaska Delayed 62 12 20 102 305
## 3 AM West On Time 694 4840 383 320 201
## 4 AM West Delayed 117 415 65 129 61
Below is the “piped” R code I used to reshape the data frame from wide to long format. I wanted each airline, city and flight count to be it’s own row. I then used mutate
to add a Total Flights column, and then again to get a percent of delayed flights.
delays <- delays %>%
gather(City, Flights, LA:Seattle) %>%
arrange(Airline, Status) %>%
spread(Status, Flights) %>%
mutate(Tot_Flights = Delayed + `On Time`) %>%
mutate(Pct_Delayed = Tot_Flights / Delayed)
# sort by City to easily compare % of flights delayed
arrange(delays, City)
## Airline City Delayed On Time Tot_Flights Pct_Delayed
## 1 Alaska LA 62 497 559 9.016129
## 2 AM West LA 117 694 811 6.931624
## 3 Alaska PHX 12 221 233 19.416667
## 4 AM West PHX 415 4840 5255 12.662651
## 5 Alaska Sandiego 20 212 232 11.600000
## 6 AM West Sandiego 65 383 448 6.892308
## 7 Alaska SanFran 102 503 605 5.931373
## 8 AM West SanFran 129 320 449 3.480620
## 9 Alaska Seattle 305 1841 2146 7.036066
## 10 AM West Seattle 61 201 262 4.295082
To get a total of the three columns containing the number of flights, I filtered by airline, and then used summarise_each
. I did not include the Pct_Delayed
that I created previously, because summing this would not calculate the correct % of delayed flights for that airline. Instead, I summarised first, and then simply added a new column using mutate
again.
# total flights and % delayed for Alaska
delays %>%
filter(Airline == "Alaska") %>%
select(Delayed:Tot_Flights) %>%
summarise_each(funs(sum)) %>%
mutate(Pct_Delayed_All = Delayed / Tot_Flights)
## Delayed On Time Tot_Flights Pct_Delayed_All
## 1 501 3274 3775 0.1327152
# total flights and % delayed for AM West
delays %>%
filter(Airline == "AM West") %>%
select(Delayed:Tot_Flights) %>%
summarise_each(funs(sum)) %>%
mutate(Pct_Delayed_All = Delayed / Tot_Flights)
## Delayed On Time Tot_Flights Pct_Delayed_All
## 1 787 6438 7225 0.1089273
Upon first glance of the data, it looks like Alaska Airlines has a much better performance, as three of the five destination cities have less than 100 delayed flights. However, when we look at the overall performance for the airlines, we can see that AM West, despite having 400+ delayed flights in Phoenix, is actually the more efficient carrier. AM West has more delays, but also conducts more flights, with a lower ratio of delayed flights than Alaska.
Just to get a visual, let’s plot the data in the transformed delays
data frame. The following will create a plot that separates the data by airline, color codes it by city, and shows the percent of flights delayed by bar height.
ggplot(data=delays, aes(Airline, Pct_Delayed, fill=City)) + geom_bar(stat="identity", position="dodge") + scale_fill_brewer()
I looks like if we were going to choose an Airline to fly to a West Coast destination, AM West would be the better bet to get you there on time across the board.
Lastly, we can get a quick side-by-side comparison of the average delayed flight percentage for each airline by using the summarise
function, grouping by airline, and calculating the mean percent of delayed flights. Note, this is not the same as the above where all flights are summed and the delays are divided by the total, rather the average % of flights delayed across the five destinations.
summarise(group_by(delays, Airline), mean(Pct_Delayed))
## Source: local data frame [2 x 2]
##
## Airline mean(Pct_Delayed)
## (chr) (dbl)
## 1 Alaska 10.600047
## 2 AM West 6.852457