Delay time is one of the great important index for evaluating an airline company’s performance, since it can affect customers’ satisfication and selection to the company. Therefore, controlling delay time for the airline companies should be considered one of the key managerial aspects especially, in the severe competitive airline marketing environment.
For many companies, they record the reasons for both departure and arrive delay, for instance weather, security , carrier delay, National Air System Delay ,and Late Aircraft Delay in our case, which can be found in the database. When I took 2016-02-01 - 2016-02-29 as study period, 272360 departure delay cases and 292410 arrive delays out of 423889 cases were found out, however, mostly records of the these delay reasons are empty. This is probably because there are others factors leading to the airline delay which cannot be identified. Hence, a deeper analysis by description analysis and modelling by the state-of-the-art data mining techniques need to be employed. The results can benefit relative managers to find out the main factors that cannot be observed but important in order to reduce the delay time and improve the performance for airline companies. In the following parts, a series analysis will be promoted embedding with R code.
knitr::opts_chunk$set(echo = TRUE)
pkgs<-c("xlsx","ggplot2","ggmap")
installed.packages(pkgs)
library(ggplot2)
library(ggmap)
library(caret)
library(rpart)
library(doParallel)
if(file.exists("df.RData")){
load("df.RData")
}else{
df<-as.data.frame(read.csv("airlinesInf.csv"))
###remove the state name in the name of city
df$ORIGIN_CITY_NAME<-gsub(",.*","",df[,c("ORIGIN_CITY_NAME")])
save(df,file="df.RData")
}
if(file.exists("df_ap.RData")){
load("df_ap.RData")
}else{
df_ap<-data.frame(read.csv("airportsnew1.csv"))
save(df_ap,file="df_ap.RData")
}
Discovery of the possible impact factors on the time delays can benefit from description analysis based on the dataset. Moreover, many graphics are produced as well in order to show the data structure easily and effectively.
if(file.exists("df_order.RData"))
{load("df_order.RData")}else{
df_delay<-df[,c("CARRIER","DEP_DELAY","ARR_DELAY")]
###insert the NA to 0
df_delay[is.na(df_delay)]<-0
df_delay[,c("DEP_DELAY")]<-as.numeric(df_delay[,c("DEP_DELAY")])
df_delay[,c("ARR_DELAY")]<-as.numeric(df_delay[,c("ARR_DELAY")])
df_delay_mean<-aggregate(df_delay[,c("DEP_DELAY","ARR_DELAY")],by=list(df_delay[,"CARRIER"]),FUN=mean)
colnames(df_delay_mean)[1]<-"CARRIER"
as.data.frame(reshape(as.data.frame(df_delay_mean),varying=c(2:3),v.names="TIME",timevar="DELAY",times=colnames(df_delay[,2:3]),direction="long"))->df_delay_mean_long
df_order<-df_delay_mean_long[with(df_delay_mean_long, order(DELAY, -TIME)), ]
df_order$CARRIER <- factor(df_order$CARRIER, levels = df_order$CARRIER[order(df_order$TIME)])
save(df_order,file="df_order.RData")
}
ggplot(df_order,aes(x=CARRIER,y=TIME,group=DELAY,fill=DELAY))+xlab("CARRIERS")+ylab("")+
geom_bar(position="dodge",stat="identity")+
theme(panel.background = element_blank())+
theme(axis.text=element_text(size=20),axis.title=element_text(size=20))+
theme(axis.line=element_line())+
theme(legend.text = element_text(size = 18),legend.title=element_text(size=18))+scale_fill_grey()
In Feb of 2016, there are 12 carriers are recorded in the database. The sum of delays time for each carrier in this month was calculated. The figure above illustrates that the carrier AS has the best performance (shortest delay time) and airline NK has the worstest performance(longest delay time). In such case, carrier AS and NK were selected for further analysis afterwards. ##2.2 Analyze the possible factors that affect the performance( carrier AS and NK)
###select the airline which has the shortest delay time()
df_AS<-df[df$CARRIER=="AS",c("ORIGIN_CITY_NAME","DEST_CITY_NAME")]
counts_AS_ORG<- table(df_AS$ORIGIN_CITY_NAME)
counts_AS_ORG_f10<-counts_AS_ORG[order(counts_AS_ORG,decreasing = TRUE)][1:10]
names(counts_AS_ORG_f10)<-gsub(",.*","",names(counts_AS_ORG_f10))
barplot(counts_AS_ORG_f10, main="Original City distribution(AS)(TOP10)",
ylab="Number of flight")
counts_AS_DES<- table(df_AS$DEST_CITY_NAME)
counts_AS_DES_f10<-counts_AS_DES[order(counts_AS_DES,decreasing = TRUE)][1:10]
names(counts_AS_DES_f10)<-gsub(",.*","",names(counts_AS_DES_f10))
barplot(counts_AS_DES_f10, main="Destination City distribution(AS)(TOP10)",
ylab="Number of flight")
##select the airline which has the longest delay time(NK)(worst performance)
df_NK<-df[df$CARRIER=="NK",c("ORIGIN_CITY_NAME","DEST_CITY_NAME")]
##nlevels(df_NK$ORIGIN_CITY_NAME)
##nlevels(df_NK$DEST_CITY_NAME)
counts_NK_ORG<- table(df_NK$ORIGIN_CITY_NAME)
counts_NK_ORG_f10<-counts_NK_ORG[order(counts_NK_ORG,decreasing = TRUE)][1:10]
names(counts_NK_ORG_f10)<-gsub(",.*","",names(counts_NK_ORG_f10))
barplot(counts_NK_ORG_f10, main="Original City distribution(NK)(TOP10)",
ylab="Number of flight")
counts_NK_DES<- table(df_NK$DEST_CITY_NAME)
counts_NK_DES_f10<-counts_NK_DES[order(counts_NK_DES,decreasing = TRUE)][1:10]
names(counts_NK_DES_f10)<-gsub(",.*","",names(counts_NK_DES_f10))
barplot(counts_NK_DES_f10, main="Destination City distribution(NK)(TOP10)",
ylab="Number of flight")
The four bar charts above show the cities where the airline frequently departure and arrive corresponding to AS and NK carrier respectively. It can be said that the pattern of cities corresponding to the two carriers are significant different. The figures depict that the flies from carrier AS are mainly at Seattle. On the other hand, the flights NK arrive and departure more at Fort Lauderdale, Las Vegas, and Chicago than other cities. In such case, the data regarding the airport and cities should be analyzed.
if(file.exists("counts_GEO.RData")){
load("counts_GEO.RData")
}else{
CITY_df<-df[,c("CARRIER","ORIGIN_CITY_NAME")]
CITY_df$ORIGIN_CITY_NAME<-as.character(CITY_df$ORIGIN_CITY_NAME)
GEO<-merge(CITY_df,df_ap,by.x="ORIGIN_CITY_NAME",by.y="city")
GEO2<-GEO[,c("ORIGIN_CITY_NAME","airport","lat","long")]
GEO2$lat<-round(GEO2$lat,digits=3)
GEO2$long<-round(GEO2$long,digits=3)
counts_GEO <- ddply(GEO2, .(GEO2$ORIGIN_CITY_NAME, GEO2$airport,GEO2$lat,GEO2$long), nrow)
names(counts_GEO) <- c("CITY", "AIRPORT", "LAT","LONG", "Freq")
###The cities that most AS carriers departures from
counts_GEO_ORG_AS<-counts_GEO[counts_GEO$CITY%in%names(counts_AS_ORG_f10),"CITY"]
##The cities that most flights from AS arrived at
counts_GEO_DES_AS<-counts_GEO[counts_GEO$CITY%in%names(counts_AS_DES_f10),"CITY"]
unique(c(counts_GEO_DES_AS,counts_GEO_ORG_AS))->GEO_AS
##The cities that most NK carriers departure from
counts_GEO_ORG_NK<-counts_GEO[counts_GEO$CITY%in%names(counts_NK_ORG_f10),"CITY"]
##The cities that most flights from NK arrived at
counts_GEO_DES_NK<-counts_GEO[counts_GEO$CITY%in%names(counts_NK_DES_f10),"CITY"]
unique(c(counts_GEO_DES_NK,counts_GEO_ORG_NK))->GEO_NK
GEO_AS_UNI<-GEO_AS[!GEO_AS%in%GEO_NK]
GEO_NK_UNI<-GEO_NK[!GEO_NK%in%GEO_AS]
###The rest data which is not in either of the cities
counts_GEO$type="REST"
counts_GEO[counts_GEO$CITY%in%GEO_AS_UNI,"type"]<-"AS"
counts_GEO[counts_GEO$CITY%in%GEO_NK_UNI,"type"]<-"NK"
type=counts_GEO$type.abb
counts_GEO$type=as.factor(counts_GEO$type,levels=counts_GEO$type.abb)
save(counts_GEO,file="counts_GEO.RData")
}
###plot on map###
map <- get_map(location = 'US', zoom = 4)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=US&zoom=4&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=US&sensor=false
mapPoints <- ggmap(map) +geom_point(aes(x = LONG, y = LAT, size = Freq,col=as.factor(type)), data = counts_GEO, alpha = 0.5)
mapPoints
We emerge the location information of the airports and the flights information for AS and NK. After that, the number of flights was also counted by city and airport.
A map with various size and colour circle is produced above. The colour presents the types of the carriers, which involve AS, NK and rest of the carriers. The size of the points represents the number of fights departure or arrive at the corresponding the airports.
It is worthy to notice that cities where NK frequently departure or arrives have higher volume of flights the AS.
df$DEP_HOUR<-floor(df$DEP_TIME/100)
df$ARR_HOUR<-floor(df$ARR_TIME/100)
clnames<-c("TIME","HOUR","DELAY","TYPE","CARRIER")
HOUR_AS_DEP<-df[df$CARRIER=="AS"&df$DEP_DELAY>0,c("DEP_TIME","DEP_HOUR","DEP_DELAY")]
HOUR_AS_DEP$type<-"DEP"
HOUR_AS_DEP$CARRIER<-"AS"
colnames(HOUR_AS_DEP)<-clnames
HOUR_NK_DEP<-df[df$CARRIER=="NK"&df$DEP_DELAY>0,c("DEP_TIME","DEP_HOUR","DEP_DELAY")]
HOUR_NK_DEP$type<-"DEP"
HOUR_NK_DEP$CARRIER<-"NK"
colnames(HOUR_NK_DEP)<-clnames
HOUR_AS_ARR<-df[df$CARRIER=="AS"&df$ARR_DELAY>0,c("ARR_TIME","ARR_HOUR","ARR_DELAY")]
HOUR_AS_ARR$type<-"ARR"
HOUR_AS_ARR$CARRIER<-"AS"
colnames(HOUR_AS_ARR)<-clnames
HOUR_NK_ARR<-df[df$CARRIER=="NK"&df$ARR_DELAY>0,c("ARR_TIME","ARR_HOUR","ARR_DELAY")]
HOUR_NK_ARR$type<-"ARR"
HOUR_NK_ARR$CARRIER<-"NK"
colnames(HOUR_NK_ARR)<-clnames
HOUR_DELAY<-rbind(HOUR_NK_ARR,HOUR_AS_ARR,HOUR_NK_DEP,HOUR_AS_DEP)
HOUR_DELAY<-na.omit(HOUR_DELAY)
ggplot(HOUR_DELAY, aes(x=HOUR, fill=as.factor(CARRIER))) +geom_histogram(binwidth=.5, position="dodge")+theme(legend.text = element_text(size = 18),legend.title=element_text(size=18))+ guides(fill=guide_legend(title="CARRIER"))
From the plot above, the hour distribution is similar for both carriers. It means that the times for fights delay have similar pattern.
Here I am going to use classification and regression tree for training models, the output is the departure delay time (L<15, 15<=M<60,L>=60). The inputs include the time of departure, the time of arrive, the airport, the city, wheels on, taxi time. A plot will be given as follows and feature importance analysis will be given as well.
In this report, the training data sets only contains carrier AS and NK, and the records that the delay time is bigger than 0 in order to save the computation time.
df_output<-df[df$CARRIER=="AS"|df$CARRIER=="NK",]
df_output[is.na(df_output)]<-0
df_output<-df_output[df_output$DEP_DELAY>0,]
#df_output[df_output$DEP_DELAY<0,"OUTPUT_DEP_DELAY"]<-"E"
df_output[df_output$DEP_DELAY>=0 & df_output$DEP_DELAY<15,"OUTPUT_DEP_DELAY"]<-"L"
df_output[df_output$DEP_DELAY>=15 & df_output$DEP_DELAY<60,"OUTPUT_DEP_DELAY"]<-"M"
df_output[df_output$DEP_DELAY>=60 ,"OUTPUT_DEP_DELAY"]<-"H"
set.seed(123)
indexTrain<-sample(1:nrow(df_output),nrow(df_output)*0.5,replace=FALSE)
df_output$CARRIER<-as.character(df_output$CARRIER)
inputs<-c("OUTPUT_DEP_DELAY", "DAY_OF_MONTH", "CARRIER" ,"ORIGIN_AIRPORT_ID" , "DEST_AIRPORT_ID", "DEP_HOUR" , "CANCELLED", "FLIGHTS" , "CARRIER_DELAY" ,"WEATHER_DELAY", "SECURITY_DELAY" )
train_df<-df_output[indexTrain,inputs]
test_df<-df_output[-indexTrain,inputs]
if(file.exists("rpartFit.RData")){
load("rpartFit.RData")
}else{
library(rpart)
library(doParallel)
numc <- makeCluster(detectCores()-8)
registerDoParallel(numc)
##5-folds cross valdiation used for avoiding over-fitting
cvcontrol<-trainControl(method="cv",number=2,returnData=FALSE,verboseIter=TRUE,allowParallel=TRUE,savePredictions=TRUE,index=createMultiFolds(train_df$OUTPUT_DEP_DELAY, k=5, times=2),seeds = NULL)
rpart.grid<-expand.grid(.cp=c(0.0001,0.001,0.1,1))
rpartFit<-caret::train(OUTPUT_DEP_DELAY ~ .,data=as.data.frame(train_df),method="rpart",trControl=cvcontrol,tuneGrid=rpart.grid)
save(rpartFit,file="rpartFit.RData")
}
plot(rpartFit$finalModel, uniform=TRUE,
main="Classification Tree")
text(rpartFit$finalModel,use.n=TRUE, all=TRUE, cex=1.2)
predict(rpartFit,test_df)->pred
table(pred,test_df$OUTPUT_DEP_DELAY)
##
## pred H L M
## H 419 52 175
## L 258 1258 632
## M 183 203 546
A classification and regression tree was built. The plot classificaiton tree shows the tree graphically. A testing result was made by employing a new data set- test data set, which is shown in the table above. This can be used for calculating the accuracy of this model.The result illustrats that the rpart model has a high performance of classfication based on the prepared training data set.
varImp(rpartFit)
## rpart variable importance
##
## Overall
## CARRIER_DELAY 100.00
## DEP_HOUR 66.03
## DAY_OF_MONTH 55.92
## ORIGIN_AIRPORT_ID 49.22
## CARRIERNK 44.43
## DEST_AIRPORT_ID 34.90
## WEATHER_DELAY 16.58
## SECURITY_DELAY 0.00
## FLIGHTS 0.00
## CANCELLED 0.00
Finally, a variable importance was given based on CART model. The important factors for instance the “day of month”, “departure hour”, “destination airport” are the factors that affect delay time significantly. The The original airport and destination airport play an important role on the classification which is consisted with the description analysis. The “hour” factor is a key variable as well which is against the description analysis that the pattern of delay time are similar for the two airports.
In this report, a main framework of analysis by using the open data sets were given. However, there are still some limitaions which can be worked in the future work.
1 The target, this report only takes the departure delay time as example for training models. The arrive delay time was not analyzed due to the time limitation. Some relevents should be re-considered. One of them should be feature selection. For instance, some new features, like departure delay time and taxi in need to be added as new input feature since the arrive delay(target) could be affected by it.
2 The size of training dataset. In this report, the training data sets containing two carriers were selected. Due to the large computation time of the whole dataset(more than 423889) and the limitation of our computation capability, it is hard to finish the training process before deadline(7 days) Nevertheless, the whole data set can be involoved for getting more accurate training model
3 The model selection. This report employ classification and regression tree(CART) as training model. However, with the increasing number of training data set, the computation time would grow faster than other data mining model such as support vector machine. Moreover, CART is less robust than others such as random forest. Therefore, in the future work, more data mining techniques should be tested.