Load Libraries

I used dplyr and tidyr since they are required for this assignment, and ggplot2 to create a basic bar plot summarising the flight delay information.

require(dplyr)
require(tidyr)
require(ggplot2)

Reading Data into R

For this assignment, I simply created a .csv file that looked identical to the given chart in the assignment description, nested rows, blank cells and all. The file is hosted in my GitHub repository, so we’ll just do a simple connection, read the data, and then set that to an object called delays. Note that I set the header variable to true, so that the given cities are the column headers (the first two columns will be blank).

# open file
path <- ("https://raw.githubusercontent.com/Logan213/DATA607_Week6/master/airlinedly.csv")
con <- file(path, open="r")
delays <- read.csv(con, header = TRUE, stringsAsFactors = FALSE)

# close file
close(con)

delays
##         X     X.1  LA  PHX Sandiego SanFran Seattle
## 1  Alaska On Time 497  221      212     503    1841
## 2         Delayed  62   12       20     102     305
## 3                  NA   NA       NA      NA      NA
## 4 AM West On Time 694 4840      383     320     201
## 5         Delayed 117  415       65     129      61

Reshaping the Data

I played around with tidyr and dplyr for quite a bit, hoping to somehow rotate and fill in the blank cells I created. In the end, I set the names for my two blank column headers, dropped the empty row using slice, and then filled in the empty Airline “cells” with the appropriate names using a simple assignment.

# name empty column headers
names(delays)[1:2] <- c("Airline", "Status")

# drop empty row
delays <- slice(delays, -3)
delays[2,1] <- "Alaska"
delays[4,1] <- "AM West"

delays
##   Airline  Status  LA  PHX Sandiego SanFran Seattle
## 1  Alaska On Time 497  221      212     503    1841
## 2  Alaska Delayed  62   12       20     102     305
## 3 AM West On Time 694 4840      383     320     201
## 4 AM West Delayed 117  415       65     129      61

Below is the “piped” R code I used to reshape the data frame from wide to long format. I wanted each airline, city and flight count to be it’s own row. I then used mutate to add a Total Flights column, and then again to get a percent of delayed flights.

delays <- delays %>% 
  gather(City, Flights, LA:Seattle) %>%
  arrange(Airline, Status) %>%
  spread(Status, Flights) %>%
  mutate(Tot_Flights = Delayed + `On Time`) %>%
  mutate(Pct_Delayed = Tot_Flights / Delayed)

# sort by City to easily compare % of flights delayed
arrange(delays, City)
##    Airline     City Delayed On Time Tot_Flights Pct_Delayed
## 1   Alaska       LA      62     497         559    9.016129
## 2  AM West       LA     117     694         811    6.931624
## 3   Alaska      PHX      12     221         233   19.416667
## 4  AM West      PHX     415    4840        5255   12.662651
## 5   Alaska Sandiego      20     212         232   11.600000
## 6  AM West Sandiego      65     383         448    6.892308
## 7   Alaska  SanFran     102     503         605    5.931373
## 8  AM West  SanFran     129     320         449    3.480620
## 9   Alaska  Seattle     305    1841        2146    7.036066
## 10 AM West  Seattle      61     201         262    4.295082

Comparing Alaska and AM West

To get a total of the three columns containing the number of flights, I filtered by airline, and then used summarise_each. I did not include the Pct_Delayed that I created previously, because summing this would not calculate the correct % of delayed flights for that airline. Instead, I summarised first, and then simply added a new column using mutate again.

# total flights and % delayed for Alaska
delays %>%
  filter(Airline == "Alaska") %>%
  select(Delayed:Tot_Flights) %>%
  summarise_each(funs(sum)) %>%
  mutate(Pct_Delayed_All = Delayed / Tot_Flights)
##   Delayed On Time Tot_Flights Pct_Delayed_All
## 1     501    3274        3775       0.1327152
# total flights and % delayed for AM West
delays %>%
  filter(Airline == "AM West") %>%
  select(Delayed:Tot_Flights) %>%
  summarise_each(funs(sum)) %>%
  mutate(Pct_Delayed_All = Delayed / Tot_Flights)
##   Delayed On Time Tot_Flights Pct_Delayed_All
## 1     787    6438        7225       0.1089273

Upon first glance of the data, it looks like Alaska Airlines has a much better performance, as three of the five destination cities have less than 100 delayed flights. However, when we look at the overall performance for the airlines, we can see that AM West, despite having 400+ delayed flights in Phoenix, is actually the more efficient carrier. AM West has more delays, but also conducts more flights, with a lower ratio of delayed flights than Alaska.

Plotting by City

Just to get a visual, let’s plot the data in the transformed delays data frame. The following will create a plot that separates the data by airline, color codes it by city, and shows the percent of flights delayed by bar height.

ggplot(data=delays, aes(Airline, Pct_Delayed, fill=City)) + geom_bar(stat="identity", position="dodge") + scale_fill_brewer()

I looks like if we were going to choose an Airline to fly to a West Coast destination, AM West would be the better bet to get you there on time across the board.

Summarize Delayed Flights by Airline

Lastly, we can get a quick side-by-side comparison of the average delayed flight percentage for each airline by using the summarise function, grouping by airline, and calculating the mean percent of delayed flights. Note, this is not the same as the above where all flights are summed and the delays are divided by the total, rather the average % of flights delayed across the five destinations.

summarise(group_by(delays, Airline), mean(Pct_Delayed))
## Source: local data frame [2 x 2]
## 
##   Airline mean(Pct_Delayed)
##     (chr)             (dbl)
## 1  Alaska         10.600047
## 2 AM West          6.852457