1 Libraries

This script uses dplyr, tidyr, and ggplot from the tidyverse family of packages.

library(tidyverse)
library(magrittr)

2 Import csv dataset

Data ‘’Assignment4.csv’’can be obtained from GitHub repository https://github.com/robertwelk/DATA607_Assignment4.git

# The file is read in as a csv, with strings preserved as characters rather than factors. Two of the columns are renamed for clarity and ease in downstream processes. 
data <- read.csv('Assignment4.csv', stringsAsFactors = F) %>% 
        rename(carrier=X, flight.status=X.1) %>% 
        as_tibble()
data
## # A tibble: 5 x 7
##   carrier flight.status Los.Angeles Pheonix San.Diego San.Fransisco Seattle
##   <chr>   <chr>               <int>   <int>     <int>         <int>   <int>
## 1 Alaska  on time               497     221       212           503    1841
## 2 ""      delayed                62      12        20           102     305
## 3 ""      ""                     NA      NA        NA            NA      NA
## 4 AM West on time               694    4840       383           230     201
## 5 ""      delayed               117     415        65           129      61

3 Wrangle data into tidy format

Each column should be a variable and each row an observation. The data in its original format is considered to be wide as can be seen by the cross tabulation between carrier and individual cities. The city names are themselves part of one variable yet are given unique columns. Using tidyr, the function ‘gather()’ can be used to create a new column called ‘city’ with each displaying a unique value while preserving their associated counts.
For this data it makes sense to keep serparate the counts of flights for on-time and delayed, as they are each a distinct variable and should be given their own column. The ‘spread()’ is then used to create a sperate variable for on-time counts and delayed flight counts for each city for each carrier. The use of dplyr piping allows for readable code even while grouping processing steps together.

#add airline name to column 1
data[c(2,5),1] <- c('Alaska','AM West')

# make tidy by putting into long format, and remove blank rows
data.tidy <- data %>% 
            gather(city, count, 3:7) %>% 
            filter(!is.na(count)) %>% 
            spread(flight.status,count) 
data.tidy
## # A tibble: 10 x 4
##    carrier city          delayed `on time`
##    <chr>   <chr>           <int>     <int>
##  1 Alaska  Los.Angeles        62       497
##  2 Alaska  Pheonix            12       221
##  3 Alaska  San.Diego          20       212
##  4 Alaska  San.Fransisco     102       503
##  5 Alaska  Seattle           305      1841
##  6 AM West Los.Angeles       117       694
##  7 AM West Pheonix           415      4840
##  8 AM West San.Diego          65       383
##  9 AM West San.Fransisco     129       230
## 10 AM West Seattle            61       201

4 Arrival time analysis

To compare arrival delays between the two carriers, it is convenient to use a ratio describing the average number of on-time flight per a delayed flight. This calcualtion can be made with ease due to the format of the now tidy data set. Comparison is easy to make with a dodged barplot that displays much of the information (carrier, city, on-time/delayed flight counts) in one graph. As can be seen, Alaska has a higher percentage of flight that are on time in each of the 5 cities compared to AM West.

data.tidy %<>% 
  mutate(ratio = `on time`/`delayed`) %>% 
  mutate(total = `on time` + `delayed`)
data.tidy
## # A tibble: 10 x 6
##    carrier city          delayed `on time` ratio total
##    <chr>   <chr>           <int>     <int> <dbl> <int>
##  1 Alaska  Los.Angeles        62       497  8.02   559
##  2 Alaska  Pheonix            12       221 18.4    233
##  3 Alaska  San.Diego          20       212 10.6    232
##  4 Alaska  San.Fransisco     102       503  4.93   605
##  5 Alaska  Seattle           305      1841  6.04  2146
##  6 AM West Los.Angeles       117       694  5.93   811
##  7 AM West Pheonix           415      4840 11.7   5255
##  8 AM West San.Diego          65       383  5.89   448
##  9 AM West San.Fransisco     129       230  1.78   359
## 10 AM West Seattle            61       201  3.30   262
ggplot(data.tidy, aes(x=city,y=ratio,fill=carrier)) + 
        geom_bar(stat='identity',position = 'dodge')