Load and transform data

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'dplyr' was built under R version 3.4.2
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
untidy <- read.csv(url("https://raw.githubusercontent.com/agCS/DATA607/master/flights%20wk5.csv"), stringsAsFactors = FALSE, header = TRUE)
# remove blank row
untidy <- untidy[-3,]
untidy
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  Alaska on time         497     221       212           503    1841
## 2  Alaska delayed          62      12        20           102     305
## 4 AM West on time         694    4840       383           320     201
## 5 AM West delayed         117     415        65           129      61

The city represents destination variable’s observations so I need to move it to rows by using gather(). ‘delayed’ and ‘on time’ are variables as well that need to be spread() into columns.

tidy1 <- untidy %>%
  gather(Destination, Count, 3:7)
tidy2 <- tidy1 %>%
  spread('X.1',Count)
# give column name to carrier
tidyFinal <- rename(tidy2, Carrier=X)
# display tidy data
tidyFinal
##    Carrier   Destination delayed on time
## 1   Alaska   Los.Angeles      62     497
## 2   Alaska       Phoenix      12     221
## 3   Alaska     San.Diego      20     212
## 4   Alaska San.Francisco     102     503
## 5   Alaska       Seattle     305    1841
## 6  AM West   Los.Angeles     117     694
## 7  AM West       Phoenix     415    4840
## 8  AM West     San.Diego      65     383
## 9  AM West San.Francisco     129     320
## 10 AM West       Seattle      61     201

Perform analysis to compare the arrival delays for the two airlines

I’d like to quickly see how total delays compare by two carriers:

ggplot(data = tidyFinal) +
  geom_col(mapping = aes(x = Carrier, y = delayed))

I see AM West has higher number of delays overall. Looking closer at averages, AM West has on average 57 more delayed flights.

(avg <- tidyFinal %>%
  group_by(Carrier) %>%
  summarize(avg = mean(delayed)))
## # A tibble: 2 x 2
##   Carrier   avg
##     <chr> <dbl>
## 1  Alaska 100.2
## 2 AM West 157.4

When comparing delays per destination, however, we see that although AM West has higher total delays, Alaska airline has much higher number of arrival delays in Seattle. Does this make sense? Perhaps, this could be explained by availability of routes by each carrier, for example, it’s possible that AM West flies to Seattle more frequently than AM West. In my case, however, i suspect the graphs are misleading as the numbers are not comparable in proportion to all flights by the carrier.

ggplot(data = tidyFinal) +
  geom_col(
    mapping = aes(x = Destination, y = delayed, fill = Carrier),
    position = "dodge")

So, let’s instead calculate proportions of delayed flights for each destination by carrier and average for delay in proportion to all flights. I will use mutate() function from dplyr to calculate and store the value in a new column.

tidyFinal <- mutate(tidyFinal, 
                     delayProp = tidyFinal$delayed / (tidyFinal$delayed + tidyFinal$'on time'))
ggplot(data = tidyFinal) +
  geom_col(
    mapping = aes(x = Destination, y = delayProp, fill = Carrier),
    position = "dodge")

(avg_prop <- tidyFinal %>%
  group_by(Carrier) %>%
  summarize(avg_prop = mean(delayProp)))
## # A tibble: 2 x 2
##   Carrier  avg_prop
##     <chr>     <dbl>
## 1  Alaska 0.1118683
## 2 AM West 0.1776915

Plotting proportions allows us to see that AM West has higher arrival delays at each destniations. Overall, AM West on average has ~7% more delays.