DATA 607 - Week 5 Assignment

1. Create a CSV file that includes the flight delay information
2. Read the CSV file into R, and use tidyr and dplyr as needed to tidy and transform the data
3. Perform analysis to compare the arrival delays for the two airlines
- Conclusions

1. Create a CSV file that includes the flight delay information

The CSV file was created from a MySQL script, which is saved in GitHub at:
https://github.com/kecbenson/DATA_607_Wk5/blob/master/DATA607_Wk5_airlines.sql

The output CSV file from the MySQL script is saved in GitHub at:
https://github.com/kecbenson/DATA_607_Wk5/blob/master/arrival_delays.csv

2. Read the CSV file into R, and use `tidyr` and `dplyr` as needed to tidy and transform the data

First, let’s read in the CSV file and review the data.

library(tidyverse)
library(knitr)

url <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Wk5/master/arrival_delays.csv"
raw <- read_csv(url)
kable(raw)

Airline	On Time/Delayed	Los Angeles	Phoenix	San Diego	San Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
ALASKA	delayed	62	12	20	102	305
AM WEST	on time	694	4840	383	320	201
AM WEST	delayed	117	415	65	129	61

Second, let’s use gather and spread functions to create a tidy data frame.

df1 <- raw %>% 
        # gather flight counts into destination and frequency columns
        gather("Dest", "Freq", 3:7) %>%  
        # spread delayed and ontime frequency into separate columns
        spread(2, 4) %>% 
        # rename delayed and ontime columns
        rename(Delayed = delayed, OnTime = 'on time') 
df1

## # A tibble: 10 x 4
##    Airline Dest          Delayed OnTime
##    <chr>   <chr>           <int>  <int>
##  1 ALASKA  Los Angeles        62    497
##  2 ALASKA  Phoenix            12    221
##  3 ALASKA  San Diego          20    212
##  4 ALASKA  San Francisco     102    503
##  5 ALASKA  Seattle           305   1841
##  6 AM WEST Los Angeles       117    694
##  7 AM WEST Phoenix           415   4840
##  8 AM WEST San Diego          65    383
##  9 AM WEST San Francisco     129    320
## 10 AM WEST Seattle            61    201

Third, let’s use mutate and arrange functions to create additional variables and re-order the data frame.

df2 <- df1 %>% 
        # add total flight count, % delayed and % ontime
        mutate(Total = Delayed + OnTime, PctDelayed = round(100 * Delayed / Total, 2), 
               PctOnTime = round(100 * OnTime / Total, 2)) %>% 
        # re-arrange rows by destination then airline
        arrange(Dest, Airline)
df2

## # A tibble: 10 x 7
##    Airline Dest          Delayed OnTime Total PctDelayed PctOnTime
##    <chr>   <chr>           <int>  <int> <int>      <dbl>     <dbl>
##  1 ALASKA  Los Angeles        62    497   559      11.1       88.9
##  2 AM WEST Los Angeles       117    694   811      14.4       85.6
##  3 ALASKA  Phoenix            12    221   233       5.15      94.8
##  4 AM WEST Phoenix           415   4840  5255       7.9       92.1
##  5 ALASKA  San Diego          20    212   232       8.62      91.4
##  6 AM WEST San Diego          65    383   448      14.5       85.5
##  7 ALASKA  San Francisco     102    503   605      16.9       83.1
##  8 AM WEST San Francisco     129    320   449      28.7       71.3
##  9 ALASKA  Seattle           305   1841  2146      14.2       85.8
## 10 AM WEST Seattle            61    201   262      23.3       76.7

Finally, let’s use group_by and bind_rows functions to summarize the delay data by airline, and then append to the end of the data frame.

df3 <- df2 %>% 
        # group by airline
        group_by(Airline) %>% 
        # summarize data by airline group
        summarize(Dest = "Total", Delayed = sum(Delayed), OnTime = sum(OnTime), Total = sum(Total), 
                  PctDelayed = round(100 * Delayed / Total, 2), PctOnTime = round(100 * OnTime / Total, 2))
# append summary rows at bottom of dataframe
df4 <- bind_rows(df2, df3)
kable(df4)

Airline	Dest	Delayed	OnTime	Total	PctDelayed	PctOnTime
ALASKA	Los Angeles	62	497	559	11.09	88.91
AM WEST	Los Angeles	117	694	811	14.43	85.57
ALASKA	Phoenix	12	221	233	5.15	94.85
AM WEST	Phoenix	415	4840	5255	7.90	92.10
ALASKA	San Diego	20	212	232	8.62	91.38
AM WEST	San Diego	65	383	448	14.51	85.49
ALASKA	San Francisco	102	503	605	16.86	83.14
AM WEST	San Francisco	129	320	449	28.73	71.27
ALASKA	Seattle	305	1841	2146	14.21	85.79
AM WEST	Seattle	61	201	262	23.28	76.72
ALASKA	Total	501	3274	3775	13.27	86.73
AM WEST	Total	787	6438	7225	10.89	89.11

3. Perform analysis to compare the arrival delays for the two airlines

We start by comparing the flight count for each airline, broken out by destination. Several observations can be made:

AM WEST has substantially more flights into Phoenix than ALASKA (5255 vs. 233)
ALASKA has substantially more flights into Seattle than AM WEST (2146 vs. 262)
For the other destinations, ALASKA and AM WEST have comparable flight counts
Overall, AM WEST has almost double the total number of flights compared to ALASKA (7225 vs 3775).

ggplot(df4) + geom_bar(aes(x = Airline, y = Total, fill = Airline), stat = "identity") + 
                facet_wrap(~ Dest, nrow = 2)

Next we compare the percentage of flights delayed for each airline, by destination. Two observations are apparent:

AM WEST has a higher delay percentage than ALASKA for each of the five destinations
However, AM WEST has a lower delay percentage than ALASKA across all five destinations in total (10.9% vs. 13.3%).

ggplot(df4) + geom_bar(aes(x = Airline, y = PctDelayed, fill = Airline), stat = "identity") + 
                facet_wrap(~ Dest, nrow = 2)

Conclusions

This exercise demonstrates that comparing (a) average statistics across subsets to (b) aggregate statistics across the entire dataset can be misleading. It is important to keep in mind the relative sizes of the subsets when making such comparisons. In particular, for the flight delay dataset:

AM WEST has a higher delay percentage than ALASKA for each of the destinations, yet across all destinations in total, AM WEST has a lower total delay percentage.
This is explained by the flight counts of each airline into each destination:
- The average across all destinations for AM WEST is heavily weighted by the statistics for Phoenix, which accounts for 73% (5255 / 7225) of the total AM WEST flights. In Phoenix AM WEST has a relatively lower delay percentage of 7.9%.
- The average across all destinations for ALASKA is heavily weighted by the statistics for Seattle, which accounts for 57% (2146 / 3775) of the total ALASKA flights. In Seattle ALASKA has a relatively higher delay percentage of 14.2%.
In aggregating across all flights for each airline, the average delay percentage is implicitly weighted by the flight counts into each destination. So even though AM WEST has a higher delay percentage than ALASKA for each destination individually, AM WEST’s overall average is heavily weighted by Phoenix, which results in a lower delay percentage than ALASKA across all destinations in total.

DATA 607 - Week 5 Assignment

Kevin Benson

September 25, 2018

1. Create a CSV file that includes the flight delay information

2. Read the CSV file into R, and use `tidyr` and `dplyr` as needed to tidy and transform the data

3. Perform analysis to compare the arrival delays for the two airlines

Conclusions

DATA 607 - Week 5 Assignment

Kevin Benson

September 25, 2018

1. Create a CSV file that includes the flight delay information

2. Read the CSV file into R, and use tidyr and dplyr as needed to tidy and transform the data

3. Perform analysis to compare the arrival delays for the two airlines

Conclusions

2. Read the CSV file into R, and use `tidyr` and `dplyr` as needed to tidy and transform the data