2 Description analysis

Discovery of the possible impact factors on the time delays can benefit from description analysis based on the dataset. Moreover, many graphics are produced as well in order to show the data structure easily and effectively.

2.1 Find out the best and worst performance airport(shortest delay time and longest delay time)

if(file.exists("df_order.RData"))
        {load("df_order.RData")}else{
        df_delay<-df[,c("CARRIER","DEP_DELAY","ARR_DELAY")]
        ###insert the NA to 0 
        df_delay[is.na(df_delay)]<-0
        df_delay[,c("DEP_DELAY")]<-as.numeric(df_delay[,c("DEP_DELAY")])
        df_delay[,c("ARR_DELAY")]<-as.numeric(df_delay[,c("ARR_DELAY")])
        df_delay_mean<-aggregate(df_delay[,c("DEP_DELAY","ARR_DELAY")],by=list(df_delay[,"CARRIER"]),FUN=mean)
        colnames(df_delay_mean)[1]<-"CARRIER"
        as.data.frame(reshape(as.data.frame(df_delay_mean),varying=c(2:3),v.names="TIME",timevar="DELAY",times=colnames(df_delay[,2:3]),direction="long"))->df_delay_mean_long
        df_order<-df_delay_mean_long[with(df_delay_mean_long, order(DELAY, -TIME)), ]
        df_order$CARRIER <- factor(df_order$CARRIER, levels = df_order$CARRIER[order(df_order$TIME)])
        save(df_order,file="df_order.RData")
        }

ggplot(df_order,aes(x=CARRIER,y=TIME,group=DELAY,fill=DELAY))+xlab("CARRIERS")+ylab("")+
geom_bar(position="dodge",stat="identity")+
        theme(panel.background = element_blank())+
        theme(axis.text=element_text(size=20),axis.title=element_text(size=20))+
        theme(axis.line=element_line())+
        theme(legend.text = element_text(size = 18),legend.title=element_text(size=18))+scale_fill_grey()

In Feb of 2016, there are 12 carriers are recorded in the database. The sum of delays time for each carrier in this month was calculated. The figure above illustrates that the carrier AS has the best performance (shortest delay time) and airline NK has the worstest performance(longest delay time). In such case, carrier AS and NK were selected for further analysis afterwards. ##2.2 Analyze the possible factors that affect the performance( carrier AS and NK)

###select the airline which has the shortest delay time()
df_AS<-df[df$CARRIER=="AS",c("ORIGIN_CITY_NAME","DEST_CITY_NAME")]
counts_AS_ORG<- table(df_AS$ORIGIN_CITY_NAME)
counts_AS_ORG_f10<-counts_AS_ORG[order(counts_AS_ORG,decreasing = TRUE)][1:10]
names(counts_AS_ORG_f10)<-gsub(",.*","",names(counts_AS_ORG_f10))
barplot(counts_AS_ORG_f10, main="Original City distribution(AS)(TOP10)", 
    ylab="Number of flight")

counts_AS_DES<- table(df_AS$DEST_CITY_NAME)
counts_AS_DES_f10<-counts_AS_DES[order(counts_AS_DES,decreasing = TRUE)][1:10]
names(counts_AS_DES_f10)<-gsub(",.*","",names(counts_AS_DES_f10))
barplot(counts_AS_DES_f10, main="Destination City distribution(AS)(TOP10)", 
    ylab="Number of flight")

##select the airline which has the longest delay time(NK)(worst performance)
df_NK<-df[df$CARRIER=="NK",c("ORIGIN_CITY_NAME","DEST_CITY_NAME")]
##nlevels(df_NK$ORIGIN_CITY_NAME)
##nlevels(df_NK$DEST_CITY_NAME)

counts_NK_ORG<- table(df_NK$ORIGIN_CITY_NAME)
counts_NK_ORG_f10<-counts_NK_ORG[order(counts_NK_ORG,decreasing = TRUE)][1:10]
names(counts_NK_ORG_f10)<-gsub(",.*","",names(counts_NK_ORG_f10))
barplot(counts_NK_ORG_f10, main="Original City distribution(NK)(TOP10)", 
    ylab="Number of flight")

counts_NK_DES<- table(df_NK$DEST_CITY_NAME)
counts_NK_DES_f10<-counts_NK_DES[order(counts_NK_DES,decreasing = TRUE)][1:10]
names(counts_NK_DES_f10)<-gsub(",.*","",names(counts_NK_DES_f10))
barplot(counts_NK_DES_f10, main="Destination City distribution(NK)(TOP10)", 
    ylab="Number of flight")

The four bar charts above show the cities where the airline frequently departure and arrive corresponding to AS and NK carrier respectively. It can be said that the pattern of cities corresponding to the two carriers are significant different. The figures depict that the flies from carrier AS are mainly at Seattle. On the other hand, the flights NK arrive and departure more at Fort Lauderdale, Las Vegas, and Chicago than other cities. In such case, the data regarding the airport and cities should be analyzed.

if(file.exists("counts_GEO.RData")){
        load("counts_GEO.RData")
}else{
        CITY_df<-df[,c("CARRIER","ORIGIN_CITY_NAME")]
        CITY_df$ORIGIN_CITY_NAME<-as.character(CITY_df$ORIGIN_CITY_NAME)
        GEO<-merge(CITY_df,df_ap,by.x="ORIGIN_CITY_NAME",by.y="city")
        GEO2<-GEO[,c("ORIGIN_CITY_NAME","airport","lat","long")]
        GEO2$lat<-round(GEO2$lat,digits=3)
        GEO2$long<-round(GEO2$long,digits=3)
        counts_GEO <- ddply(GEO2, .(GEO2$ORIGIN_CITY_NAME, GEO2$airport,GEO2$lat,GEO2$long), nrow)
        names(counts_GEO) <- c("CITY", "AIRPORT", "LAT","LONG", "Freq")
        ###The cities that most AS carriers departures from
        counts_GEO_ORG_AS<-counts_GEO[counts_GEO$CITY%in%names(counts_AS_ORG_f10),"CITY"]
        
        ##The  cities that most flights from AS arrived at
        counts_GEO_DES_AS<-counts_GEO[counts_GEO$CITY%in%names(counts_AS_DES_f10),"CITY"]
        unique(c(counts_GEO_DES_AS,counts_GEO_ORG_AS))->GEO_AS
        ##The  cities that most NK carriers departure from
        counts_GEO_ORG_NK<-counts_GEO[counts_GEO$CITY%in%names(counts_NK_ORG_f10),"CITY"]
         ##The  cities that most flights from NK arrived at
        counts_GEO_DES_NK<-counts_GEO[counts_GEO$CITY%in%names(counts_NK_DES_f10),"CITY"]
        unique(c(counts_GEO_DES_NK,counts_GEO_ORG_NK))->GEO_NK
        GEO_AS_UNI<-GEO_AS[!GEO_AS%in%GEO_NK]
        GEO_NK_UNI<-GEO_NK[!GEO_NK%in%GEO_AS]
        ###The rest data which is not in either of the cities 
        counts_GEO$type="REST"
        counts_GEO[counts_GEO$CITY%in%GEO_AS_UNI,"type"]<-"AS"
        counts_GEO[counts_GEO$CITY%in%GEO_NK_UNI,"type"]<-"NK"
        type=counts_GEO$type.abb
        counts_GEO$type=as.factor(counts_GEO$type,levels=counts_GEO$type.abb)
        save(counts_GEO,file="counts_GEO.RData")
        }


###plot on map###
map <- get_map(location = 'US', zoom = 4)

## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=US&zoom=4&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false

## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=US&sensor=false

mapPoints <- ggmap(map) +geom_point(aes(x = LONG, y = LAT, size = Freq,col=as.factor(type)), data = counts_GEO, alpha = 0.5)
mapPoints

We emerge the location information of the airports and the flights information for AS and NK. After that, the number of flights was also counted by city and airport.

A map with various size and colour circle is produced above. The colour presents the types of the carriers, which involve AS, NK and rest of the carriers. The size of the points represents the number of fights departure or arrive at the corresponding the airports.

It is worthy to notice that cities where NK frequently departure or arrives have higher volume of flights the AS.

2.3 Time factor

df$DEP_HOUR<-floor(df$DEP_TIME/100)
df$ARR_HOUR<-floor(df$ARR_TIME/100)
clnames<-c("TIME","HOUR","DELAY","TYPE","CARRIER")
HOUR_AS_DEP<-df[df$CARRIER=="AS"&df$DEP_DELAY>0,c("DEP_TIME","DEP_HOUR","DEP_DELAY")]
HOUR_AS_DEP$type<-"DEP"
HOUR_AS_DEP$CARRIER<-"AS"
colnames(HOUR_AS_DEP)<-clnames

HOUR_NK_DEP<-df[df$CARRIER=="NK"&df$DEP_DELAY>0,c("DEP_TIME","DEP_HOUR","DEP_DELAY")]
HOUR_NK_DEP$type<-"DEP"
HOUR_NK_DEP$CARRIER<-"NK"
colnames(HOUR_NK_DEP)<-clnames

HOUR_AS_ARR<-df[df$CARRIER=="AS"&df$ARR_DELAY>0,c("ARR_TIME","ARR_HOUR","ARR_DELAY")]
HOUR_AS_ARR$type<-"ARR"
HOUR_AS_ARR$CARRIER<-"AS"
colnames(HOUR_AS_ARR)<-clnames

HOUR_NK_ARR<-df[df$CARRIER=="NK"&df$ARR_DELAY>0,c("ARR_TIME","ARR_HOUR","ARR_DELAY")]
HOUR_NK_ARR$type<-"ARR"
HOUR_NK_ARR$CARRIER<-"NK"
colnames(HOUR_NK_ARR)<-clnames

HOUR_DELAY<-rbind(HOUR_NK_ARR,HOUR_AS_ARR,HOUR_NK_DEP,HOUR_AS_DEP)
HOUR_DELAY<-na.omit(HOUR_DELAY)
ggplot(HOUR_DELAY, aes(x=HOUR, fill=as.factor(CARRIER))) +geom_histogram(binwidth=.5, position="dodge")+theme(legend.text = element_text(size = 18),legend.title=element_text(size=18))+ guides(fill=guide_legend(title="CARRIER"))

From the plot above, the hour distribution is similar for both carriers. It means that the times for fights delay have similar pattern.

Here I am going to use classification and regression tree for training models, the output is the departure delay time (L<15, 15<=M<60,L>=60). The inputs include the time of departure, the time of arrive, the airport, the city, wheels on, taxi time. A plot will be given as follows and feature importance analysis will be given as well.

In this report, the training data sets only contains carrier AS and NK, and the records that the delay time is bigger than 0 in order to save the computation time.

Modeling results

df_output<-df[df$CARRIER=="AS"|df$CARRIER=="NK",]
df_output[is.na(df_output)]<-0
df_output<-df_output[df_output$DEP_DELAY>0,]
#df_output[df_output$DEP_DELAY<0,"OUTPUT_DEP_DELAY"]<-"E"
df_output[df_output$DEP_DELAY>=0 & df_output$DEP_DELAY<15,"OUTPUT_DEP_DELAY"]<-"L"
df_output[df_output$DEP_DELAY>=15 & df_output$DEP_DELAY<60,"OUTPUT_DEP_DELAY"]<-"M"
df_output[df_output$DEP_DELAY>=60 ,"OUTPUT_DEP_DELAY"]<-"H"

set.seed(123)
indexTrain<-sample(1:nrow(df_output),nrow(df_output)*0.5,replace=FALSE)
df_output$CARRIER<-as.character(df_output$CARRIER)
inputs<-c("OUTPUT_DEP_DELAY", "DAY_OF_MONTH", "CARRIER" ,"ORIGIN_AIRPORT_ID" , "DEST_AIRPORT_ID", "DEP_HOUR" , "CANCELLED", "FLIGHTS" , "CARRIER_DELAY" ,"WEATHER_DELAY", "SECURITY_DELAY" )
train_df<-df_output[indexTrain,inputs]
test_df<-df_output[-indexTrain,inputs]
  
if(file.exists("rpartFit.RData")){
        load("rpartFit.RData")
}else{
      library(rpart)
        library(doParallel)
        numc <- makeCluster(detectCores()-8)
        registerDoParallel(numc)
        ##5-folds cross valdiation used for avoiding over-fitting 
        cvcontrol<-trainControl(method="cv",number=2,returnData=FALSE,verboseIter=TRUE,allowParallel=TRUE,savePredictions=TRUE,index=createMultiFolds(train_df$OUTPUT_DEP_DELAY, k=5, times=2),seeds = NULL)
        rpart.grid<-expand.grid(.cp=c(0.0001,0.001,0.1,1))
        rpartFit<-caret::train(OUTPUT_DEP_DELAY ~ .,data=as.data.frame(train_df),method="rpart",trControl=cvcontrol,tuneGrid=rpart.grid)
        save(rpartFit,file="rpartFit.RData")  
}


plot(rpartFit$finalModel, uniform=TRUE, 
    main="Classification Tree")
text(rpartFit$finalModel,use.n=TRUE, all=TRUE, cex=1.2)

predict(rpartFit,test_df)->pred
table(pred,test_df$OUTPUT_DEP_DELAY)

##     
## pred    H    L    M
##    H  419   52  175
##    L  258 1258  632
##    M  183  203  546

A classification and regression tree was built. The plot classificaiton tree shows the tree graphically. A testing result was made by employing a new data set- test data set, which is shown in the table above. This can be used for calculating the accuracy of this model.The result illustrats that the rpart model has a high performance of classfication based on the prepared training data set.

varImp(rpartFit)

## rpart variable importance
## 
##                   Overall
## CARRIER_DELAY      100.00
## DEP_HOUR            66.03
## DAY_OF_MONTH        55.92
## ORIGIN_AIRPORT_ID   49.22
## CARRIERNK           44.43
## DEST_AIRPORT_ID     34.90
## WEATHER_DELAY       16.58
## SECURITY_DELAY       0.00
## FLIGHTS              0.00
## CANCELLED            0.00

Finally, a variable importance was given based on CART model. The important factors for instance the “day of month”, “departure hour”, “destination airport” are the factors that affect delay time significantly. The The original airport and destination airport play an important role on the classification which is consisted with the description analysis. The “hour” factor is a key variable as well which is against the description analysis that the pattern of delay time are similar for the two airports.

In this report, a main framework of analysis by using the open data sets were given. However, there are still some limitaions which can be worked in the future work.

1 The target, this report only takes the departure delay time as example for training models. The arrive delay time was not analyzed due to the time limitation. Some relevents should be re-considered. One of them should be feature selection. For instance, some new features, like departure delay time and taxi in need to be added as new input feature since the arrive delay(target) could be affected by it.

2 The size of training dataset. In this report, the training data sets containing two carriers were selected. Due to the large computation time of the whole dataset(more than 423889) and the limitation of our computation capability, it is hard to finish the training process before deadline(7 days) Nevertheless, the whole data set can be involoved for getting more accurate training model

3 The model selection. This report employ classification and regression tree(CART) as training model. However, with the increasing number of training data set, the computation time would grow faster than other data mining model such as support vector machine. Moreover, CART is less robust than others such as random forest. Therefore, in the future work, more data mining techniques should be tested.

SpotifyAssignment

gongbing

03/10/2016

1 Introduction

2 Description analysis

2.1 Find out the best and worst performance airport(shortest delay time and longest delay time)

2.3 Time factor

Modeling results