Airline delays. They are the bane of every travellers existence and anxiety. Airlines won’t tell you if your flight is likely to be delayed or not. Delayed flights can cause you to miss a connecting flight or an important business meeting. So why hasn’t anyone tried to explore if airline delays can be predicted with a reasonable degree of accuracy? In this analysis I try to develop a machine learning model that aims to predict if a flight arrival will be delayed by 15 minutes or more?
Here’s a description of all the data used in the analysis:
| Variable | Description |
|---|---|
| “DAY_OF_MONTH” | Day of Month |
| “DAY_OF_WEEK” | Day of Week |
| “AIRLINE_ID” | An identification number assigned by US DOT to identify a unique airline (carrier). |
| “CARRIER” | Code assigned by IATA and commonly used to identify a carrier. |
| “TAIL_NUM” | Tail Number |
| “FL_NUM” | Flight Number |
| “ORIGIN_AIRPORT_ID” | Origin Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. |
| “ORIGIN” | Origin Airport |
| “DEST_AIRPORT_ID” | Destination Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. |
| “DEST” | Destination Airport |
| “DEP_TIME” | Actual Departure Time (local time: hhmm) |
| “DEP_DEL15” | Departure Delay Indicator, 15 Minutes or More (1=Yes) |
| “DEP_TIME_BLK” | CRS Departure Time Block, Hourly Intervals |
| “ARR_TIME” | Actual Arrival Time (local time: hhmm) |
| “CANCELLED” | Cancelled Flight Indicator (1=Yes) |
| “DIVERTED” | Specifies The Reason For Cancellation |
| “DISTANCE” | Distance between airports (miles) |
The data can be found & downloaded by navigating to the following link:
The data is automatically downloaded as a CSV file fromt the Department of Transportation Website.
# Will ETA be greater than 15 mins for a given flight?
# PART 1 CLEANING DATA
# Import flights dataset after dowloading from the TSA website
library(readr)
setwd("~/Google Drive/ix 2016 /Project3")
flights <- read.csv2('flights.csv', sep=",", header=TRUE, stringsAsFactors = FALSE)
# View(flights)
head(flights)
## DAY_OF_MONTH DAY_OF_WEEK UNIQUE_CARRIER AIRLINE_ID CARRIER TAIL_NUM
## 1 1 4 AA 19805 AA N787AA
## 2 1 4 AA 19805 AA N795AA
## 3 1 4 AA 19805 AA N798AA
## 4 1 4 AA 19805 AA N799AA
## 5 1 4 AA 19805 AA N376AA
## 6 1 4 AA 19805 AA N398AA
## FL_NUM ORIGIN_AIRPORT_ID ORIGIN_AIRPORT_SEQ_ID ORIGIN DEST_AIRPORT_ID
## 1 1 12478 1247802 JFK 12892
## 2 2 12892 1289203 LAX 12478
## 3 3 12478 1247802 JFK 12892
## 4 4 12892 1289203 LAX 12478
## 5 5 11298 1129803 DFW 12173
## 6 6 13830 1383002 OGG 11298
## DEST_AIRPORT_SEQ_ID DEST DEP_TIME DEP_DEL15 DEP_TIME_BLK ARR_TIME
## 1 1289203 LAX 855 0.00 0900-0959 1237
## 2 1247802 JFK 856 0.00 0900-0959 1651
## 3 1289203 LAX 1226 0.00 1200-1259 1548
## 4 1247802 JFK 1214 0.00 1200-1259 2033
## 5 1217302 HNL 1754 1.00 1300-1359 2240
## 6 1129803 DFW NA 1800-1859 NA
## ARR_DEL15 CANCELLED DIVERTED DISTANCE X
## 1 0.00 0.00 0.00 2475.00 NA
## 2 0.00 0.00 0.00 2475.00 NA
## 3 0.00 0.00 0.00 2475.00 NA
## 4 0.00 0.00 0.00 2475.00 NA
## 5 1.00 0.00 0.00 3784.00 NA
## 6 1.00 0.00 3711.00 NA
The data set used is that of all flights originating and ending in the United States in 2015. The downloaded dataset included 450,000+ rows of data. As this data was way too large for R to handle and run a regression model and the random forest alogrithim, the dataset was subsetted to include the 10 busiest airports in the US by total passenger traffic.
Some of the columns are useless to our data analysis, so we NULL ’em.
# Clean data
flights$DEST_AIRPORT_SEQ_ID <- NULL
flights$UNIQUE_CARRIER <- NULL
flights$X <- NULL
flights$ORIGIN_AIRPORT_SEQ_ID <- NULL
Then we group the flights according to arrival and departure delay:
# Group flights
ontime <- flights[!is.na(flights$ARR_DEL15) & flights$ARR_DEL15!="" & !is.na(flights$DEP_DEL15) & flights$DEP_DEL15!="",]
Change the data class of the filtered data to enable data processing and running algorithms.
ontime$DEST_AIRPORT_ID <- as.factor(ontime$DEST_AIRPORT_ID)
ontime$ORIGIN_AIRPORT_ID <- as.factor(ontime$ORIGIN_AIRPORT_ID)
ontime$DAY_OF_WEEK <- as.factor(ontime$DAY_OF_WEEK)
ontime$DISTANCE <- as.integer(ontime$DISTANCE)
ontime$CANCELLED <- as.integer(ontime$CANCELLED)
ontime$DIVERTED <- as.integer(ontime$DIVERTED)
ontime$ORIGIN <- as.factor(ontime$ORIGIN)
ontime$DEP_TIME_BLK <- as.factor(ontime$DEP_TIME_BLK)
ontime$CARRIER <- as.factor(ontime$CARRIER)
ontime$ARR_DEL15 <- as.factor(ontime$ARR_DEL15)
ontime$DEP_DEL15 <-as.factor(ontime$DEP_DEL15)
ontime$DEST <- as.factor(ontime$DEST)
Since I love civil aviation in general, I knew what the data would say, so here are some random graphs none the less:
The following plot shows when flights are scheduled for every day of the week.
plot(ontime$DAY_OF_WEEK ~ ontime$DEP_TIME_BLK)
The following plot shows when flights show how far destination cities are from a given originating airport. This is of importance as longer the flight, airlines can make up time in the air.
plot(ontime$DISTANCE ~ ontime$ORIGIN)
The following plot looks for delays by day of week, it’s evident Sunday and Monday suffer from more delays than rest of the days of the week.
plot(ontime$ARR_DEL15 ~ ontime$DAY_OF_WEEK)
The following code shows the creation of the training and testing data.
# PART 2: TRAINING DATA
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(13)
# Select columns to be used in algorithm training
feature<- c("ARR_DEL15", "DAY_OF_WEEK", "CARRIER", "DEST","ORIGIN","DEP_TIME_BLK")
# Created sorted version of the ontime data
ontime_sorted <- ontime[,feature]
# Select data to put into training
training_index <- createDataPartition(ontime_sorted$ARR_DEL15, p=0.75, list=FALSE)
# Create training & testing dataset
training_data <- ontime_sorted[training_index,]
testing_data <- ontime_sorted[training_index,]
I used two methods two evaluate the data and build a model.
The first is the logistic regression model.
# METHOD 1: Logistic Regression
log_reg_mod <- train(ARR_DEL15 ~ ., data = training_data, method = "glm", family = "binomial",
trControl=trainControl(method = "cv", number = 5, repeats = 5))
# Predict
log_reg_predict <- predict(log_reg_mod, testing_data)
# Confusion matrix
confusion_matrix_reg <- confusionMatrix(log_reg_predict, testing_data[,"ARR_DEL15"])
confusion_matrix_reg
I’ve suppressred the evaluation of the above code, as it takes about ~10 mins to run through the huge dataset. Instead here’s a screenshot of the results:
Results of Logistic Regression Model
The second is the random forest algorithm.
library(randomForest)
random_forest <- randomForest(training_data[-1], training_data$ARR_DEL15, proximity = TRUE, importance = TRUE)
random_forest
random_forest_validation <- predict(random_forest, testing_data)
# Confusion matrix
confusion_matrix_rf <- confusionMatrix(random_forest_validatio, testing_data[,"ARR_DEL15"])
confusion_matrix_rf
I’ve suppressred the evaluation of the above code again, this one takes about ~40 mins to run through the huge dataset. The random forest algorithm improved accuracy over the logistic regression model.
The logistic regression model has an accuracy rate of 78 %. The random forest algorithm improves it by 6 % to 84 %.
The model can be made better by adding factors such as weather. This can be done through the API of a weather service. Also I decided not to segregate by month as the computation was already intensive with so much data to process. Segregating by months according to season will give better insights as well.