In this report we compare on-time performance between two carriers: AM West and Alaska Airlines. Our dataset is the ontime and delay frequency for multiple airports over the same sample period.

library(dplyr);library(tidyr);
library(ggplot2)
raw<-read.csv('~/Documents/CUNY/data_class/week4-tidyData/flights.csv')
raw_tbl<-tbl_df(raw)
#raw_tbl

For efficient analysis, we will transform the data into a long format.

#names(raw)
raw.c<-raw[complete.cases(raw),]
#raw_tbl[complete.cases(raw_tbl),]

We need to normalize the data and fill in missing values in the ‘carrier’ field. I’ll show how to do this using base R and tidyr.

Base R:

names(raw.c)[1:2]<-c('carrier','track')
#some prep work
g<-raw.c %>%
  gather(destination,freq,Los.Angeles:Seattle) 

g[g$carrier=='',1]<-rep(c('Alaska','AM WEST'),I(length(g$carrier)/4))

head(g)
##   carrier   track destination freq
## 1  Alaska on time Los.Angeles  497
## 2  Alaska delayed Los.Angeles   62
## 3 AM WEST on time Los.Angeles  694
## 4 AM WEST delayed Los.Angeles  117
## 5  Alaska on time     Phoenix  221
## 6  Alaska delayed     Phoenix   12

Using tidyr functions:

h<-raw.c %>%
  gather(destination,freq,Los.Angeles:Seattle) %>%
  mutate(carrier.c=as.character(carrier), carrier.1=ifelse(carrier.c=='',lag(carrier.c),carrier.c)) %>%
  select(carrier.1,track:freq)

head(h)
##   carrier.1   track destination freq
## 1    Alaska on time Los.Angeles  497
## 2    Alaska delayed Los.Angeles   62
## 3   AM WEST on time Los.Angeles  694
## 4   AM WEST delayed Los.Angeles  117
## 5    Alaska on time     Phoenix  221
## 6    Alaska delayed     Phoenix   12

Final Step

Perform analysis to compare the arrival delays for the two airlines using dplyr and the grouping and summarise functions.

First statistic is the proportion of tardiness by carrier and city. Having this, we calculate average rank across all cities for the carriers (using the min_rank() function).

h.1<-h %>%
  group_by(carrier.1,destination) %>%
  mutate(total=sum(freq), prop=round(freq/total,2))

head(h.1)
## Source: local data frame [6 x 6]
## Groups: carrier.1, destination [3]
## 
##   carrier.1   track destination  freq total  prop
##       <chr>  <fctr>       <chr> <int> <int> <dbl>
## 1    Alaska on time Los.Angeles   497   559  0.89
## 2    Alaska delayed Los.Angeles    62   559  0.11
## 3   AM WEST on time Los.Angeles   694   811  0.86
## 4   AM WEST delayed Los.Angeles   117   811  0.14
## 5    Alaska on time     Phoenix   221   233  0.95
## 6    Alaska delayed     Phoenix    12   233  0.05
h.1 %>%
  group_by(destination) %>%
  filter(track=='delayed') %>%
  mutate(rank=min_rank(prop)) %>%
  select(-c(freq:prop))
## Source: local data frame [10 x 4]
## Groups: destination [5]
## 
##    carrier.1   track   destination  rank
##        <chr>  <fctr>         <chr> <int>
## 1     Alaska delayed   Los.Angeles     1
## 2    AM WEST delayed   Los.Angeles     2
## 3     Alaska delayed       Phoenix     1
## 4    AM WEST delayed       Phoenix     2
## 5     Alaska delayed     San.Diego     1
## 6    AM WEST delayed     San.Diego     2
## 7     Alaska delayed San.Francisco     1
## 8    AM WEST delayed San.Francisco     2
## 9     Alaska delayed       Seattle     1
## 10   AM WEST delayed       Seattle     2

Alaska Airlines is the best performer (relative to its own activity) for each airport.

We prove this by looking at average rank across airport:

h.1 %>%
  group_by(destination) %>%
  filter(track=='delayed') %>%
  mutate(rank=min_rank(prop)) %>%
  ungroup() %>%
  group_by(carrier.1) %>%
  summarise(avg.rank=mean(rank))
## # A tibble: 2 x 2
##   carrier.1 avg.rank
##       <chr>    <dbl>
## 1    Alaska        1
## 2   AM WEST        2

Graphics

We run some charts to display volatility/distribution of delay proportions by carrier.

#boxplot of proportional delays by destination
ggplot(h.1[h.1$track=='delayed',],aes(x=carrier.1,y=prop))+geom_boxplot()+ggtitle("Variance in delays by Carrier")+labs(x="Carrier",y="proportion of delays")

The plots demonstrates how much more tardy AM West tends to be: 75% of AM Wests service has a higher proportion of delays than the typical delay at Alaska Airlines.

The below plot shows a destination-by-destination view of the proportion delayed. There is an overarching dominance by AM West.

ggplot(h.1[h.1$track=='delayed',],aes(y=prop,x=factor(destination),fill=factor(carrier.1),color=factor(carrier.1)))+geom_bar(stat="identity",position="dodge")+theme(axis.text.x=element_text(angle=45, hjust=1))+ggtitle("Proportion of Delays by City")+labs(x="Destination",y="proportion of delays")

Conclusion

Alaska performs better by way of proportion of flights delayed to on time arrival. This study readily lends itself to a Bayesian probability estimation model. With the model we would estimate the probability of being delayed given a destination airport and particular carrier.