In this report we compare on-time performance between two carriers: AM West and Alaska Airlines. Our dataset is the ontime and delay frequency for multiple airports over the same sample period.
library(dplyr);library(tidyr);
library(ggplot2)
raw<-read.csv('~/Documents/CUNY/data_class/week4-tidyData/flights.csv')
raw_tbl<-tbl_df(raw)
#raw_tbl
For efficient analysis, we will transform the data into a long format.
#names(raw)
raw.c<-raw[complete.cases(raw),]
#raw_tbl[complete.cases(raw_tbl),]
We need to normalize the data and fill in missing values in the ‘carrier’ field. I’ll show how to do this using base R and tidyr.
Base R:
names(raw.c)[1:2]<-c('carrier','track')
#some prep work
g<-raw.c %>%
gather(destination,freq,Los.Angeles:Seattle)
g[g$carrier=='',1]<-rep(c('Alaska','AM WEST'),I(length(g$carrier)/4))
head(g)
## carrier track destination freq
## 1 Alaska on time Los.Angeles 497
## 2 Alaska delayed Los.Angeles 62
## 3 AM WEST on time Los.Angeles 694
## 4 AM WEST delayed Los.Angeles 117
## 5 Alaska on time Phoenix 221
## 6 Alaska delayed Phoenix 12
Using tidyr functions:
h<-raw.c %>%
gather(destination,freq,Los.Angeles:Seattle) %>%
mutate(carrier.c=as.character(carrier), carrier.1=ifelse(carrier.c=='',lag(carrier.c),carrier.c)) %>%
select(carrier.1,track:freq)
head(h)
## carrier.1 track destination freq
## 1 Alaska on time Los.Angeles 497
## 2 Alaska delayed Los.Angeles 62
## 3 AM WEST on time Los.Angeles 694
## 4 AM WEST delayed Los.Angeles 117
## 5 Alaska on time Phoenix 221
## 6 Alaska delayed Phoenix 12
Perform analysis to compare the arrival delays for the two airlines using dplyr and the grouping and summarise functions.
First statistic is the proportion of tardiness by carrier and city. Having this, we calculate average rank across all cities for the carriers (using the min_rank() function).
h.1<-h %>%
group_by(carrier.1,destination) %>%
mutate(total=sum(freq), prop=round(freq/total,2))
head(h.1)
## Source: local data frame [6 x 6]
## Groups: carrier.1, destination [3]
##
## carrier.1 track destination freq total prop
## <chr> <fctr> <chr> <int> <int> <dbl>
## 1 Alaska on time Los.Angeles 497 559 0.89
## 2 Alaska delayed Los.Angeles 62 559 0.11
## 3 AM WEST on time Los.Angeles 694 811 0.86
## 4 AM WEST delayed Los.Angeles 117 811 0.14
## 5 Alaska on time Phoenix 221 233 0.95
## 6 Alaska delayed Phoenix 12 233 0.05
h.1 %>%
group_by(destination) %>%
filter(track=='delayed') %>%
mutate(rank=min_rank(prop)) %>%
select(-c(freq:prop))
## Source: local data frame [10 x 4]
## Groups: destination [5]
##
## carrier.1 track destination rank
## <chr> <fctr> <chr> <int>
## 1 Alaska delayed Los.Angeles 1
## 2 AM WEST delayed Los.Angeles 2
## 3 Alaska delayed Phoenix 1
## 4 AM WEST delayed Phoenix 2
## 5 Alaska delayed San.Diego 1
## 6 AM WEST delayed San.Diego 2
## 7 Alaska delayed San.Francisco 1
## 8 AM WEST delayed San.Francisco 2
## 9 Alaska delayed Seattle 1
## 10 AM WEST delayed Seattle 2
Alaska Airlines is the best performer (relative to its own activity) for each airport.
We prove this by looking at average rank across airport:
h.1 %>%
group_by(destination) %>%
filter(track=='delayed') %>%
mutate(rank=min_rank(prop)) %>%
ungroup() %>%
group_by(carrier.1) %>%
summarise(avg.rank=mean(rank))
## # A tibble: 2 x 2
## carrier.1 avg.rank
## <chr> <dbl>
## 1 Alaska 1
## 2 AM WEST 2
We run some charts to display volatility/distribution of delay proportions by carrier.
#boxplot of proportional delays by destination
ggplot(h.1[h.1$track=='delayed',],aes(x=carrier.1,y=prop))+geom_boxplot()+ggtitle("Variance in delays by Carrier")+labs(x="Carrier",y="proportion of delays")
The plots demonstrates how much more tardy AM West tends to be: 75% of AM Wests service has a higher proportion of delays than the typical delay at Alaska Airlines.
The below plot shows a destination-by-destination view of the proportion delayed. There is an overarching dominance by AM West.
ggplot(h.1[h.1$track=='delayed',],aes(y=prop,x=factor(destination),fill=factor(carrier.1),color=factor(carrier.1)))+geom_bar(stat="identity",position="dodge")+theme(axis.text.x=element_text(angle=45, hjust=1))+ggtitle("Proportion of Delays by City")+labs(x="Destination",y="proportion of delays")
Alaska performs better by way of proportion of flights delayed to on time arrival. This study readily lends itself to a Bayesian probability estimation model. With the model we would estimate the probability of being delayed given a destination airport and particular carrier.