1. Create a CSV file that includes the flight delay information

The CSV file was created from a MySQL script, which is saved in GitHub at:
https://github.com/kecbenson/DATA_607_Wk5/blob/master/DATA607_Wk5_airlines.sql

The output CSV file from the MySQL script is saved in GitHub at:
https://github.com/kecbenson/DATA_607_Wk5/blob/master/arrival_delays.csv

2. Read the CSV file into R, and use tidyr and dplyr as needed to tidy and transform the data

First, let’s read in the CSV file and review the data.

library(tidyverse)
library(knitr)

url <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Wk5/master/arrival_delays.csv"
raw <- read_csv(url)
kable(raw)
Airline On Time/Delayed Los Angeles Phoenix San Diego San Francisco Seattle
ALASKA on time 497 221 212 503 1841
ALASKA delayed 62 12 20 102 305
AM WEST on time 694 4840 383 320 201
AM WEST delayed 117 415 65 129 61

Second, let’s use gather and spread functions to create a tidy data frame.

df1 <- raw %>% 
        # gather flight counts into destination and frequency columns
        gather("Dest", "Freq", 3:7) %>%  
        # spread delayed and ontime frequency into separate columns
        spread(2, 4) %>% 
        # rename delayed and ontime columns
        rename(Delayed = delayed, OnTime = 'on time') 
df1
## # A tibble: 10 x 4
##    Airline Dest          Delayed OnTime
##    <chr>   <chr>           <int>  <int>
##  1 ALASKA  Los Angeles        62    497
##  2 ALASKA  Phoenix            12    221
##  3 ALASKA  San Diego          20    212
##  4 ALASKA  San Francisco     102    503
##  5 ALASKA  Seattle           305   1841
##  6 AM WEST Los Angeles       117    694
##  7 AM WEST Phoenix           415   4840
##  8 AM WEST San Diego          65    383
##  9 AM WEST San Francisco     129    320
## 10 AM WEST Seattle            61    201

Third, let’s use mutate and arrange functions to create additional variables and re-order the data frame.

df2 <- df1 %>% 
        # add total flight count, % delayed and % ontime
        mutate(Total = Delayed + OnTime, PctDelayed = round(100 * Delayed / Total, 2), 
               PctOnTime = round(100 * OnTime / Total, 2)) %>% 
        # re-arrange rows by destination then airline
        arrange(Dest, Airline)
df2
## # A tibble: 10 x 7
##    Airline Dest          Delayed OnTime Total PctDelayed PctOnTime
##    <chr>   <chr>           <int>  <int> <int>      <dbl>     <dbl>
##  1 ALASKA  Los Angeles        62    497   559      11.1       88.9
##  2 AM WEST Los Angeles       117    694   811      14.4       85.6
##  3 ALASKA  Phoenix            12    221   233       5.15      94.8
##  4 AM WEST Phoenix           415   4840  5255       7.9       92.1
##  5 ALASKA  San Diego          20    212   232       8.62      91.4
##  6 AM WEST San Diego          65    383   448      14.5       85.5
##  7 ALASKA  San Francisco     102    503   605      16.9       83.1
##  8 AM WEST San Francisco     129    320   449      28.7       71.3
##  9 ALASKA  Seattle           305   1841  2146      14.2       85.8
## 10 AM WEST Seattle            61    201   262      23.3       76.7

Finally, let’s use group_by and bind_rows functions to summarize the delay data by airline, and then append to the end of the data frame.

df3 <- df2 %>% 
        # group by airline
        group_by(Airline) %>% 
        # summarize data by airline group
        summarize(Dest = "Total", Delayed = sum(Delayed), OnTime = sum(OnTime), Total = sum(Total), 
                  PctDelayed = round(100 * Delayed / Total, 2), PctOnTime = round(100 * OnTime / Total, 2))
# append summary rows at bottom of dataframe
df4 <- bind_rows(df2, df3)
kable(df4)
Airline Dest Delayed OnTime Total PctDelayed PctOnTime
ALASKA Los Angeles 62 497 559 11.09 88.91
AM WEST Los Angeles 117 694 811 14.43 85.57
ALASKA Phoenix 12 221 233 5.15 94.85
AM WEST Phoenix 415 4840 5255 7.90 92.10
ALASKA San Diego 20 212 232 8.62 91.38
AM WEST San Diego 65 383 448 14.51 85.49
ALASKA San Francisco 102 503 605 16.86 83.14
AM WEST San Francisco 129 320 449 28.73 71.27
ALASKA Seattle 305 1841 2146 14.21 85.79
AM WEST Seattle 61 201 262 23.28 76.72
ALASKA Total 501 3274 3775 13.27 86.73
AM WEST Total 787 6438 7225 10.89 89.11

3. Perform analysis to compare the arrival delays for the two airlines

We start by comparing the flight count for each airline, broken out by destination. Several observations can be made:

ggplot(df4) + geom_bar(aes(x = Airline, y = Total, fill = Airline), stat = "identity") + 
                facet_wrap(~ Dest, nrow = 2)

Next we compare the percentage of flights delayed for each airline, by destination. Two observations are apparent:

ggplot(df4) + geom_bar(aes(x = Airline, y = PctDelayed, fill = Airline), stat = "identity") + 
                facet_wrap(~ Dest, nrow = 2)

Conclusions

This exercise demonstrates that comparing (a) average statistics across subsets to (b) aggregate statistics across the entire dataset can be misleading. It is important to keep in mind the relative sizes of the subsets when making such comparisons. In particular, for the flight delay dataset:

  • AM WEST has a higher delay percentage than ALASKA for each of the destinations, yet across all destinations in total, AM WEST has a lower total delay percentage.

  • This is explained by the flight counts of each airline into each destination:

    • The average across all destinations for AM WEST is heavily weighted by the statistics for Phoenix, which accounts for 73% (5255 / 7225) of the total AM WEST flights. In Phoenix AM WEST has a relatively lower delay percentage of 7.9%.

    • The average across all destinations for ALASKA is heavily weighted by the statistics for Seattle, which accounts for 57% (2146 / 3775) of the total ALASKA flights. In Seattle ALASKA has a relatively higher delay percentage of 14.2%.

  • In aggregating across all flights for each airline, the average delay percentage is implicitly weighted by the flight counts into each destination. So even though AM WEST has a higher delay percentage than ALASKA for each destination individually, AM WEST’s overall average is heavily weighted by Phoenix, which results in a lower delay percentage than ALASKA across all destinations in total.