Tidying and Reformatting

Call necessary packages

I will be working with tidyr and dplyr to do most of my data tidying and transformation as per the assignment instructions.

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

Load data

I wrote the data exactly as it was presented in the assignment instructions and loaded it into R using read.csv(). The data is written in a wide format as suggested. I set stringsAsFactors to False because it makes it easier to wrangle the data when it isn’t wrapped up into factors.

ArrivalsDelays = read.csv("ArrivalsDelays.csv",stringsAsFactors = F)
ArrivalsDelays
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

Remove na values and rename columns

The data has a blank row with na values across all columns, so I removed that blank row using na.omit(). Also, column names were missing for the airline companies and whether a flight was delayed or on time, so I used rename() to rename the columns.

ArrivalsDelays = na.omit(ArrivalsDelays)
ArrivalsDelays = rename(ArrivalsDelays, Airline = X, IsOnTime = X.1)

ArrivalsDelays
##   Airline IsOnTime Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA  on time         497     221       212           503    1841
## 2          delayed          62      12        20           102     305
## 4 AM WEST  on time         694    4840       383           320     201
## 5          delayed         117     415        65           129      61

Fill in missing airlines

The raw table omits the airline name for the delayed flights to improve visual clarity when reading the table in a book. However, since I am doing analysis on the table, I need to reformat the Airline column to include the airline name for delayed flights. The code below subsets all unique non-blank airline names and fills in the Airline vector. The vector is then reattached to the dataframe.

Airlines = ArrivalsDelays$Airline[ArrivalsDelays$Airline != ""]
AirlineVector = c(rep(Airlines[1],2),rep(Airlines[2],2))

ArrivalsDelays$Airline = AirlineVector

ArrivalsDelays
##   Airline IsOnTime Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA  on time         497     221       212           503    1841
## 2  ALASKA  delayed          62      12        20           102     305
## 4 AM WEST  on time         694    4840       383           320     201
## 5 AM WEST  delayed         117     415        65           129      61

Gather the data

After filling in the blank values, removing empty rows, and renaming columns, I can finally gather the data together into a tidy format. I used the gather() function to reorganize the data into a tidy format.

ArrivalsDelays = gather(ArrivalsDelays,"City","Count",3:7)
ArrivalsDelays
##    Airline IsOnTime          City Count
## 1   ALASKA  on time   Los.Angeles   497
## 2   ALASKA  delayed   Los.Angeles    62
## 3  AM WEST  on time   Los.Angeles   694
## 4  AM WEST  delayed   Los.Angeles   117
## 5   ALASKA  on time       Phoenix   221
## 6   ALASKA  delayed       Phoenix    12
## 7  AM WEST  on time       Phoenix  4840
## 8  AM WEST  delayed       Phoenix   415
## 9   ALASKA  on time     San.Diego   212
## 10  ALASKA  delayed     San.Diego    20
## 11 AM WEST  on time     San.Diego   383
## 12 AM WEST  delayed     San.Diego    65
## 13  ALASKA  on time San.Francisco   503
## 14  ALASKA  delayed San.Francisco   102
## 15 AM WEST  on time San.Francisco   320
## 16 AM WEST  delayed San.Francisco   129
## 17  ALASKA  on time       Seattle  1841
## 18  ALASKA  delayed       Seattle   305
## 19 AM WEST  on time       Seattle   201
## 20 AM WEST  delayed       Seattle    61

Analysis

Which airline is more reliable to fly to specific cities?

I am interested in seeing which airline is more reliable to fly to each city. The way I am going to determine this is by calculating the rate of “on time” flights for each airline in each city. I will then compare them in a nice chart at the end.

Calculating the percentage of successful flights

The first step in creating this new metric is to group the data by Airline and by City and then summing the “on time” and “delayed” flights to create the TotalFlights metric.

The next step is to join the new dataframe onto the original one. This allows the calculation between count and TotalFlights to be easy using mutate().

The final step is to mutate() Count and TotalFlights to calculate the percentage of “on time” and “delayed” flights.

TotalFlights = 
  ArrivalsDelays %>%
  group_by(Airline,City) %>%
  summarize(TotalFlights = sum(Count))

ArrivalsDelays = left_join(ArrivalsDelays,TotalFlights, by = c("Airline","City"))

ArrivalsDelays = mutate(ArrivalsDelays, Percentage = Count/TotalFlights)

ArrivalsDelays
##    Airline IsOnTime          City Count TotalFlights Percentage
## 1   ALASKA  on time   Los.Angeles   497          559 0.88908766
## 2   ALASKA  delayed   Los.Angeles    62          559 0.11091234
## 3  AM WEST  on time   Los.Angeles   694          811 0.85573366
## 4  AM WEST  delayed   Los.Angeles   117          811 0.14426634
## 5   ALASKA  on time       Phoenix   221          233 0.94849785
## 6   ALASKA  delayed       Phoenix    12          233 0.05150215
## 7  AM WEST  on time       Phoenix  4840         5255 0.92102759
## 8  AM WEST  delayed       Phoenix   415         5255 0.07897241
## 9   ALASKA  on time     San.Diego   212          232 0.91379310
## 10  ALASKA  delayed     San.Diego    20          232 0.08620690
## 11 AM WEST  on time     San.Diego   383          448 0.85491071
## 12 AM WEST  delayed     San.Diego    65          448 0.14508929
## 13  ALASKA  on time San.Francisco   503          605 0.83140496
## 14  ALASKA  delayed San.Francisco   102          605 0.16859504
## 15 AM WEST  on time San.Francisco   320          449 0.71269488
## 16 AM WEST  delayed San.Francisco   129          449 0.28730512
## 17  ALASKA  on time       Seattle  1841         2146 0.85787512
## 18  ALASKA  delayed       Seattle   305         2146 0.14212488
## 19 AM WEST  on time       Seattle   201          262 0.76717557
## 20 AM WEST  delayed       Seattle    61          262 0.23282443

Making the plot

Below, I have written a little ggplot barchart comparing “on time” flight percentage in each city by airline. According to the plot, Alaska is more reliable across the board for every city.

ggplot(filter(ArrivalsDelays, IsOnTime == "on time")) +
  geom_bar(aes(x = City, y = Percentage, fill = Airline),stat = "identity", position = "dodge", width = 0.7) +
  labs(title = "On Time Flight Percentage by City")