Predicting Flight Delay @ US Airports

Introduction

Airline delays. They are the bane of every travellers existence and anxiety. Airlines won’t tell you if your flight is likely to be delayed or not. Delayed flights can cause you to miss a connecting flight or an important business meeting. So why hasn’t anyone tried to explore if airline delays can be predicted with a reasonable degree of accuracy? In this analysis I try to develop a machine learning model that aims to predict if a flight arrival will be delayed by 15 minutes or more?

Data Explanation

Here’s a description of all the data used in the analysis:

Variable	Description
“DAY_OF_MONTH”	Day of Month
“DAY_OF_WEEK”	Day of Week
“AIRLINE_ID”	An identification number assigned by US DOT to identify a unique airline (carrier).
“CARRIER”	Code assigned by IATA and commonly used to identify a carrier.
“TAIL_NUM”	Tail Number
“FL_NUM”	Flight Number
“ORIGIN_AIRPORT_ID”	Origin Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport.
“ORIGIN”	Origin Airport
“DEST_AIRPORT_ID”	Destination Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport.
“DEST”	Destination Airport
“DEP_TIME”	Actual Departure Time (local time: hhmm)
“DEP_DEL15”	Departure Delay Indicator, 15 Minutes or More (1=Yes)
“DEP_TIME_BLK”	CRS Departure Time Block, Hourly Intervals
“ARR_TIME”	Actual Arrival Time (local time: hhmm)
“CANCELLED”	Cancelled Flight Indicator (1=Yes)
“DIVERTED”	Specifies The Reason For Cancellation
“DISTANCE”	Distance between airports (miles)

Data Origin and Collection

The data can be found & downloaded by navigating to the following link:

On-Time Performance

The data is automatically downloaded as a CSV file fromt the Department of Transportation Website.

# Will ETA be greater than 15 mins for a given flight?
# PART 1 CLEANING DATA
# Import flights dataset after dowloading from the TSA website

library(readr)
setwd("~/Google Drive/ix 2016 /Project3")
flights <- read.csv2('flights.csv', sep=",", header=TRUE, stringsAsFactors = FALSE)
# View(flights)
head(flights)

##   DAY_OF_MONTH DAY_OF_WEEK UNIQUE_CARRIER AIRLINE_ID CARRIER TAIL_NUM
## 1            1           4             AA      19805      AA   N787AA
## 2            1           4             AA      19805      AA   N795AA
## 3            1           4             AA      19805      AA   N798AA
## 4            1           4             AA      19805      AA   N799AA
## 5            1           4             AA      19805      AA   N376AA
## 6            1           4             AA      19805      AA   N398AA
##   FL_NUM ORIGIN_AIRPORT_ID ORIGIN_AIRPORT_SEQ_ID ORIGIN DEST_AIRPORT_ID
## 1      1             12478               1247802    JFK           12892
## 2      2             12892               1289203    LAX           12478
## 3      3             12478               1247802    JFK           12892
## 4      4             12892               1289203    LAX           12478
## 5      5             11298               1129803    DFW           12173
## 6      6             13830               1383002    OGG           11298
##   DEST_AIRPORT_SEQ_ID DEST DEP_TIME DEP_DEL15 DEP_TIME_BLK ARR_TIME
## 1             1289203  LAX      855      0.00    0900-0959     1237
## 2             1247802  JFK      856      0.00    0900-0959     1651
## 3             1289203  LAX     1226      0.00    1200-1259     1548
## 4             1247802  JFK     1214      0.00    1200-1259     2033
## 5             1217302  HNL     1754      1.00    1300-1359     2240
## 6             1129803  DFW       NA              1800-1859       NA
##   ARR_DEL15 CANCELLED DIVERTED DISTANCE  X
## 1      0.00      0.00     0.00  2475.00 NA
## 2      0.00      0.00     0.00  2475.00 NA
## 3      0.00      0.00     0.00  2475.00 NA
## 4      0.00      0.00     0.00  2475.00 NA
## 5      1.00      0.00     0.00  3784.00 NA
## 6                1.00     0.00  3711.00 NA

Preparation

The data set used is that of all flights originating and ending in the United States in 2015. The downloaded dataset included 450,000+ rows of data. As this data was way too large for R to handle and run a regression model and the random forest alogrithim, the dataset was subsetted to include the 10 busiest airports in the US by total passenger traffic.

Some of the columns are useless to our data analysis, so we NULL ’em.

#   Clean data

flights$DEST_AIRPORT_SEQ_ID <- NULL 
flights$UNIQUE_CARRIER <- NULL
flights$X <- NULL 
flights$ORIGIN_AIRPORT_SEQ_ID <- NULL

Then we group the flights according to arrival and departure delay:

# Group flights 

ontime <- flights[!is.na(flights$ARR_DEL15) & flights$ARR_DEL15!="" & !is.na(flights$DEP_DEL15) & flights$DEP_DEL15!="",]

Change the data class of the filtered data to enable data processing and running algorithms.

ontime$DEST_AIRPORT_ID <- as.factor(ontime$DEST_AIRPORT_ID)
ontime$ORIGIN_AIRPORT_ID <- as.factor(ontime$ORIGIN_AIRPORT_ID)
ontime$DAY_OF_WEEK <- as.factor(ontime$DAY_OF_WEEK)
ontime$DISTANCE <- as.integer(ontime$DISTANCE)
ontime$CANCELLED <- as.integer(ontime$CANCELLED)
ontime$DIVERTED <- as.integer(ontime$DIVERTED)
ontime$ORIGIN <- as.factor(ontime$ORIGIN)
ontime$DEP_TIME_BLK <- as.factor(ontime$DEP_TIME_BLK)
ontime$CARRIER <- as.factor(ontime$CARRIER)
ontime$ARR_DEL15 <- as.factor(ontime$ARR_DEL15)
ontime$DEP_DEL15 <-as.factor(ontime$DEP_DEL15)
ontime$DEST <- as.factor(ontime$DEST)

Exploratory Analysis

Since I love civil aviation in general, I knew what the data would say, so here are some random graphs none the less:

The following plot shows when flights are scheduled for every day of the week.

plot(ontime$DAY_OF_WEEK ~ ontime$DEP_TIME_BLK)

The following plot shows when flights show how far destination cities are from a given originating airport. This is of importance as longer the flight, airlines can make up time in the air.

plot(ontime$DISTANCE ~ ontime$ORIGIN)

The following plot looks for delays by day of week, it’s evident Sunday and Monday suffer from more delays than rest of the days of the week.

plot(ontime$ARR_DEL15 ~ ontime$DAY_OF_WEEK)

Modelling

The following code shows the creation of the training and testing data.

# PART 2: TRAINING DATA

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

set.seed(13) 

# Select columns to be used in algorithm training

feature<- c("ARR_DEL15", "DAY_OF_WEEK", "CARRIER", "DEST","ORIGIN","DEP_TIME_BLK")

# Created sorted version of the ontime data

ontime_sorted <- ontime[,feature] 

# Select data to put into training

training_index <- createDataPartition(ontime_sorted$ARR_DEL15, p=0.75, list=FALSE)

# Create training & testing dataset

training_data <- ontime_sorted[training_index,] 
testing_data <- ontime_sorted[training_index,]

Model Evaluation and Results

I used two methods two evaluate the data and build a model.

The first is the logistic regression model.

# METHOD 1: Logistic Regression

log_reg_mod <- train(ARR_DEL15 ~ ., data = training_data, method = "glm", family = "binomial",
                     trControl=trainControl(method = "cv", number = 5, repeats = 5))

# Predict

log_reg_predict <- predict(log_reg_mod, testing_data)

# Confusion matrix 

confusion_matrix_reg <- confusionMatrix(log_reg_predict, testing_data[,"ARR_DEL15"])
confusion_matrix_reg

I’ve suppressred the evaluation of the above code, as it takes about ~10 mins to run through the huge dataset. Instead here’s a screenshot of the results:

Results of Logistic Regression Model

The second is the random forest algorithm.

library(randomForest) 

random_forest <- randomForest(training_data[-1], training_data$ARR_DEL15, proximity = TRUE, importance = TRUE)
random_forest
random_forest_validation <- predict(random_forest, testing_data)

# Confusion matrix 

confusion_matrix_rf <- confusionMatrix(random_forest_validatio, testing_data[,"ARR_DEL15"])
confusion_matrix_rf

I’ve suppressred the evaluation of the above code again, this one takes about ~40 mins to run through the huge dataset. The random forest algorithm improved accuracy over the logistic regression model.

Conclusion

The logistic regression model has an accuracy rate of 78 %. The random forest algorithm improves it by 6 % to 84 %.

Future Improvements

The model can be made better by adding factors such as weather. This can be done through the API of a weather service. Also I decided not to segregate by month as the computation was already intensive with so much data to process. Segregating by months according to season will give better insights as well.