Techniques involved: sample spliting, linear regression, converting regression to classification, CART

On any given day, more than 87,000 flights take place in the United States alone. About one-third of these flights are commercial flights, operated by companies like United, American Airlines, and JetBlue. While about 80% of commercial flights take-off and land as scheduled, the other 20% suffer from delays due to various reasons. A certain number of delays are unavoidable, due to unexpected events, but some delays could hopefully be avoided if the factors causing delays were better understood and addressed.

In this problem, we’ll use a dataset of 9,381 flights that occured in June through August of 2014 between the three busiest US airports – Atlanta (ATL), Los Angeles (LAX), and Chicago (ORD) – to predict flight delays. The dataset AirlineDelay.csv includes the following 23 variables:

Flight = the origin-destination pair (LAX-ORD, ATL-LAX, etc.)

Carrier = the carrier operating the flight (American Airlines, Delta Air Lines, etc.)

Month = the month of the flight (June, July, or August)

DayOfWeek = the day of the week of the flight (Monday, Tuesday, etc.)

NumPrevFlights = the number of previous flights taken by this aircraft in the same day

PrevFlightGap = the amount of time between when this flight’s aircraft is scheduled to arrive at the airport and when it’s scheduled to depart for this flight

HistoricallyLate = the proportion of time this flight has been late historically

InsufficientHistory = whether or not we have enough data to determine the historical record of the flight (equal to 1 if we don’t have at least 3 records, equal to 0 if we do)

OriginInVolume = the amount of incoming traffic volume at the origin airport, normalized by the typical volume during the flight’s time and day of the week

OriginOutVolume = the amount of outgoing traffic volume at the origin airport, normalized by the typical volume during the flight’s time and day of the week

DestInVolume = the amount of incoming traffic volume at the destination airport, normalized by the typical volume during the flight’s time and day of the week

DestOutVolume = the amount of outgoing traffic volume at the destination airport, normalized by the typical volume during the flight’s time and day of the week

OriginPrecip = the amount of rain at the origin over the course of the day, in tenths of millimeters

OriginAvgWind = average daily wind speed at the origin, in miles per hour

OriginWindGust = fastest wind speed during the day at the origin, in miles per hour

OriginFog = whether or not there was fog at some point during the day at the origin (1 if there was, 0 if there wasn’t)

OriginThunder = whether or not there was thunder at some point during the day at the origin (1 if there was, 0 if there wasn’t)

DestPrecip = the amount of rain at the destination over the course of the day, in tenths of millimeters

DestAvgWind = average daily wind speed at the destination, in miles per hour

DestWindGust = fastest wind speed during the day at the destination, in miles per hour

DestFog = whether or not there was fog at some point during the day at the destination (1 if there was, 0 if there wasn’t)

DestThunder = whether or not there was thunder at some point during the day at the destination (1 if there was, 0 if there wasn’t)

TotalDelay = the amount of time the aircraft was delayed, in minutes (this is our dependent variable)

Load the data

setwd("C:/Users/jzchen/Documents/Courses/Analytics Edge/Final")
Airlines <- read.csv("AirlineDelay.csv")

Randomly split it into a training set (70% of the data) and testing set (30% of the data)

Since our dependent variable is a continuous one, we can’t use the sample.split function.

set.seed(15071)
spl <- sample(nrow(Airlines), 0.7*nrow(Airlines))
AirlinesTrain <- Airlines[spl, ]
AirlinesTest <- Airlines[-spl, ]

Build a linear regression model to predict “TotalDelay” using all of the other variables as independent variables.

LR <- lm(TotalDelay ~., data = AirlinesTrain)
summary(LR)
## 
## Call:
## lm(formula = TotalDelay ~ ., data = AirlinesTrain)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -68.30 -16.77  -7.71   2.78 817.93 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -45.055881  10.107693  -4.458 8.43e-06 ***
## FlightATL-ORD                4.379016   2.175661   2.013  0.04418 *  
## FlightLAX-ATL                1.901867   1.967008   0.967  0.33364    
## FlightLAX-ORD                5.660679  12.596163   0.449  0.65316    
## FlightORD-ATL                6.303303   2.262117   2.786  0.00534 ** 
## FlightORD-LAX               11.006282  12.597601   0.874  0.38232    
## CarrierAmerican Airlines    -5.724074  13.025146  -0.439  0.66034    
## CarrierDelta Air Lines       0.355603   3.527356   0.101  0.91970    
## CarrierExpressJet Airlines   7.142202   6.574938   1.086  0.27740    
## CarrierSkyWest Airlines      5.527777   4.921477   1.123  0.26140    
## CarrierSouthwest Airlines    0.241642   3.924019   0.062  0.95090    
## CarrierUnited Airlines       1.148722  12.937041   0.089  0.92925    
## CarrierVirgin America       -5.505712  13.222651  -0.416  0.67714    
## MonthJuly                   -6.345533   1.279019  -4.961 7.18e-07 ***
## MonthJune                   -3.784569   1.333093  -2.839  0.00454 ** 
## DayOfWeekMonday             -0.810539   1.914973  -0.423  0.67212    
## DayOfWeekSaturday           -4.506943   2.065833  -2.182  0.02917 *  
## DayOfWeekSunday             -5.418356   1.944548  -2.786  0.00534 ** 
## DayOfWeekThursday            1.571501   1.937850   0.811  0.41742    
## DayOfWeekTuesday            -4.206489   2.011211  -2.092  0.03652 *  
## DayOfWeekWednesday           1.585338   1.953771   0.811  0.41715    
## NumPrevFlights               1.563247   0.504670   3.098  0.00196 ** 
## PrevFlightGap                0.015831   0.008055   1.965  0.04940 *  
## HistoricallyLate            47.913638   3.326901  14.402  < 2e-16 ***
## InsufficientHistory         13.510716   1.586589   8.516  < 2e-16 ***
## OriginInVolume               5.121318   4.874897   1.051  0.29350    
## OriginOutVolume              6.682176   6.209972   1.076  0.28195    
## DestInVolume                14.971479   6.439830   2.325  0.02011 *  
## DestOutVolume                1.221879   2.268822   0.539  0.59021    
## OriginPrecip                 0.019734   0.006278   3.143  0.00168 ** 
## OriginAvgWind               -0.656333   0.296051  -2.217  0.02666 *  
## OriginWindGust               0.948098   0.130843   7.246 4.78e-13 ***
## OriginFog                   -0.239182   1.666246  -0.144  0.88586    
## OriginThunder               -0.818011   3.184140  -0.257  0.79726    
## DestPrecip                   0.036874   0.006381   5.778 7.90e-09 ***
## DestAvgWind                 -0.282348   0.296227  -0.953  0.34055    
## DestWindGust                 0.351908   0.129804   2.711  0.00672 ** 
## DestFog                     -0.997796   1.665534  -0.599  0.54914    
## DestThunder                  1.363956   3.185359   0.428  0.66852    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.77 on 6527 degrees of freedom
## Multiple R-squared:  0.09475,    Adjusted R-squared:  0.08948 
## F-statistic: 17.98 on 38 and 6527 DF,  p-value: < 2.2e-16

PREDICTIONS ON THE TEST SET

LRpred <- predict(LR, newdata = AirlinesTest)
SSE <- sum((LRpred - AirlinesTest$TotalDelay)^2)
SST <- sum((mean(AirlinesTrain$TotalDelay)-AirlinesTest$TotalDelay)^2)
R_squared <- 1- SSE/SST

A CLASSIFICATION PROBLEM

Let’s turn this problem into a multi-class classification problem by creating a new dependent variable. Our new dependent variable will take three different values: “No Delay”, “Minor Delay”, and “Major Delay”.

Airlines$DelayClass <- factor(ifelse(Airlines$TotalDelay == 0, "No Delay", ifelse(Airlines$TotalDelay >= 30, "Major Delay", "Minor Delay")))

remove the original dependent variable “TotalDelay” from your dataset

Airlines$TotalDelay <- NULL

Split the dataset

library(caTools)
set.seed(15071)
spl <- sample.split(Airlines, SplitRatio = 0.7)
train <- subset(Airlines, spl == T)
test <- subset(Airlines, spl == F)

Build a CART model

library(rpart)
library(rpart.plot)
CARTmodel <- rpart(DelayClass ~., data = train, method = "class")
prp(CARTmodel)

TRAINING SET ACCURACY

CARTpred_train <- predict(CARTmodel, type = "class")
table(train$DelayClass, CARTpred_train)
##              CARTpred_train
##               Major Delay Minor Delay No Delay
##   Major Delay           0         298      812
##   Minor Delay           0         334     1833
##   No Delay              0         163     3086
(0+334+3086)/nrow(train)
## [1] 0.5240576

baseline model

table(train$DelayClass)
## 
## Major Delay Minor Delay    No Delay 
##        1110        2167        3249
3249/nrow(train)
## [1] 0.4978547

TESTING SET ACCURACY

CARTpred <- predict(CARTmodel, newdata = test, type = "class")
table(test$DelayClass, CARTpred)
##              CARTpred
##               Major Delay Minor Delay No Delay
##   Major Delay           0         124      363
##   Minor Delay           0         135      794
##   No Delay              0          79     1360
(0+135+1360)/nrow(test)
## [1] 0.5236427