In the world of air travel, 2022 was a year of air travel defined by cancellations. While 2023 has seen a decrease in cancellations, FlightAware reports an increase in delay time from last year.
Per federal data, most flight delays in recent years have been caused by issues within the control of airlines.
Discovering and understanding which specific issues cause delays could inform airlines’ priorities, and could mitigate delay time moving forward.
The flight delay data has 3593 observations of 11 variables. They are:
Carrier: The airline
Airport_Distance: Distance between two airports
Number_of_Flights: Total number of flights in the
airport
Weather: Weather condition, measured on a scale from 0
(mild) to 10 (extreme)
Support_Crew_Available: Number of support crew
Baggage_loading_time: Time in minutes spent loading
baggage
Late_Arrival_o: Time in minutes the plane arrived
late
Cleaning_o: Time in minutes spent cleaning the
aircraft
Fueling_o: Time in minutes spent fueling the
aircraft
Security_o: Time in minutes spent in security
checking
Arr_Delay: Flight delay in minutes. This is the
dependent variable of the dataset.
For logistic regression, an additional variable,
Hour_Delay, will be created to see if a plane is 1 hour
late or more.
A copy of this publicly available data is stored at https://pengdsci.github.io/datasets/FlightDelay/Flight_delay-data.csv.
flight<-read.csv("https://pengdsci.github.io/datasets/FlightDelay/Flight_delay-data.csv")
flight.hour<-flight
flight.hour$Hour_Delay = 0
flight.hour$Hour_Delay<-ifelse(flight.hour$Arr_Delay > 59, 1, 0)
flight.hour<-flight.hour[-c(11)]
First, as the majority of the variables are numerical, a summary stats overview is warranted.
summary(flight)
## Carrier Airport_Distance Number_of_flights Weather
## Length:3593 Min. :376.0 Min. :29475 Min. :5.000
## Class :character 1st Qu.:431.0 1st Qu.:41634 1st Qu.:5.000
## Mode :character Median :443.0 Median :43424 Median :5.000
## Mean :442.4 Mean :43311 Mean :5.353
## 3rd Qu.:454.0 3rd Qu.:45140 3rd Qu.:6.000
## Max. :499.0 Max. :53461 Max. :6.000
## Support_Crew_Available Baggage_loading_time Late_Arrival_o Cleaning_o
## Min. : 0 Min. :14.00 Min. :15.00 Min. :-4.00
## 1st Qu.: 56 1st Qu.:17.00 1st Qu.:18.00 1st Qu.: 8.00
## Median : 83 Median :17.00 Median :19.00 Median :10.00
## Mean : 85 Mean :16.98 Mean :18.74 Mean :10.02
## 3rd Qu.:112 3rd Qu.:17.00 3rd Qu.:19.00 3rd Qu.:12.00
## Max. :222 Max. :19.00 Max. :22.00 Max. :23.00
## Fueling_o Security_o Arr_Delay
## Min. :13.00 Min. :13.00 Min. : 0.0
## 1st Qu.:23.00 1st Qu.:32.00 1st Qu.: 49.0
## Median :25.00 Median :37.00 Median : 70.0
## Mean :25.01 Mean :37.09 Mean : 69.8
## 3rd Qu.:27.00 3rd Qu.:42.00 3rd Qu.: 90.0
## Max. :36.00 Max. :63.00 Max. :180.0
Amazingly, there are no missing values. Of note is the extreme
variance within the Support_Crew_Available and
Number_of_Flights variables and minimal variance within the
weather category.
While the summary of variables like Airport Distance do not appear to cause alarm, the high variance of some stats suggests the possibility of outliers. While outliers are expected, it may be worth identifying and individually reviewing outliers to see if there could have been an inputting error.
testsup<-rosnerTest(flight$Support_Crew_Available, k = 5)
testnum<-rosnerTest(flight$Number_of_flights, k = 5)
testdist<-rosnerTest(flight$Airport_Distance, k = 5)
testbag<-rosnerTest(flight$Baggage_loading_time, k = 5)
The test of outliers for number of flights returns 3 possible outliers - observations 1652, 3163, and 2729. The delay for each observation is, 0. While these observations do not appear to be errors, they make a strong suggestion that number of flights could be a good predictor for delay time. Similarly, the baggage time also shows 2729 and 3163 as low outliers.
As there are multiple variables related to airplane prep (fueling, cleaning, baggage loading, and security), it may be worth checking for possible collinearity between two predictor variables. If one can be identified as redundant and removed, it will save us time and computational power.
pcor(flight[,2:11], method = "pearson")
As no variable appears to have a high correlation with another, there is no preliminary evidence of multicollinearity, so all variables could potentially be included into the model.
#Preliminary Variable Investigations
We will use the lm function for a quick analysis of what to include.
fit<-lm(Arr_Delay~Carrier+Airport_Distance+Number_of_flights+Weather+Support_Crew_Available+Baggage_loading_time+Late_Arrival_o+Cleaning_o+Fueling_o+Security_o, data=flight)
summary(fit)
##
## Call:
## lm(formula = Arr_Delay ~ Carrier + Airport_Distance + Number_of_flights +
## Weather + Support_Crew_Available + Baggage_loading_time +
## Late_Arrival_o + Cleaning_o + Fueling_o + Security_o, data = flight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.824 -8.308 -0.549 8.154 69.985
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.746e+02 7.643e+00 -75.177 <2e-16 ***
## CarrierAA -3.068e-01 1.375e+00 -0.223 0.823
## CarrierAS 1.306e+00 4.133e+00 0.316 0.752
## CarrierB6 -1.769e+00 1.318e+00 -1.342 0.180
## CarrierDL -1.756e+00 1.350e+00 -1.301 0.193
## CarrierEV -1.724e+00 1.362e+00 -1.266 0.206
## CarrierF9 -9.615e-01 4.573e+00 -0.210 0.833
## CarrierFL 6.205e-01 2.247e+00 0.276 0.782
## CarrierHA 1.714e+00 6.352e+00 0.270 0.787
## CarrierMQ -6.700e-01 1.421e+00 -0.472 0.637
## CarrierUA -1.507e+00 1.313e+00 -1.147 0.251
## CarrierUS -8.850e-01 1.582e+00 -0.559 0.576
## CarrierVX 1.462e-01 2.122e+00 0.069 0.945
## CarrierWN -1.905e+00 1.673e+00 -1.139 0.255
## Airport_Distance 1.727e-01 1.356e-02 12.742 <2e-16 ***
## Number_of_flights 4.425e-03 1.088e-04 40.665 <2e-16 ***
## Weather 4.437e+00 4.554e-01 9.742 <2e-16 ***
## Support_Crew_Available -4.894e-02 5.333e-03 -9.176 <2e-16 ***
## Baggage_loading_time 1.350e+01 4.395e-01 30.718 <2e-16 ***
## Late_Arrival_o 6.911e+00 3.336e-01 20.714 <2e-16 ***
## Cleaning_o 5.481e-02 6.002e-02 0.913 0.361
## Fueling_o -6.140e-02 5.960e-02 -1.030 0.303
## Security_o 7.788e-03 2.958e-02 0.263 0.792
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.45 on 3570 degrees of freedom
## Multiple R-squared: 0.8195, Adjusted R-squared: 0.8184
## F-statistic: 736.9 on 22 and 3570 DF, p-value: < 2.2e-16
This model suggests that Carrier, Cleaning time, Fueling time, and Security time are not needed in the final model.
fit2<-lm(Arr_Delay~Airport_Distance+Number_of_flights+Weather+Support_Crew_Available+Baggage_loading_time+Late_Arrival_o, data=flight)
summary(fit2)
##
## Call:
## lm(formula = Arr_Delay ~ Airport_Distance + Number_of_flights +
## Weather + Support_Crew_Available + Baggage_loading_time +
## Late_Arrival_o, data = flight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.344 -8.247 -0.564 8.080 69.620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.767e+02 7.269e+00 -79.334 <2e-16 ***
## Airport_Distance 1.735e-01 1.352e-02 12.838 <2e-16 ***
## Number_of_flights 4.434e-03 1.085e-04 40.868 <2e-16 ***
## Weather 4.463e+00 4.542e-01 9.826 <2e-16 ***
## Support_Crew_Available -4.892e-02 5.313e-03 -9.209 <2e-16 ***
## Baggage_loading_time 1.347e+01 4.380e-01 30.751 <2e-16 ***
## Late_Arrival_o 6.898e+00 3.327e-01 20.731 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.45 on 3586 degrees of freedom
## Multiple R-squared: 0.8189, Adjusted R-squared: 0.8186
## F-statistic: 2703 on 6 and 3586 DF, p-value: < 2.2e-16
We are making a separate dataset to compare future transformations.
model1<-flight[,c(2, 3, 4, 5, 6, 7, 11)]
model2<-flight[,c(2, 3, 4, 5, 6, 7, 11)]
To see how our initial model does, we will graph it compared to residuals.
fit3<-lm(Arr_Delay~Airport_Distance+Number_of_flights+Weather+Support_Crew_Available+Baggage_loading_time+Late_Arrival_o, data=model1)
summary(fit3)
##
## Call:
## lm(formula = Arr_Delay ~ Airport_Distance + Number_of_flights +
## Weather + Support_Crew_Available + Baggage_loading_time +
## Late_Arrival_o, data = model1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.344 -8.247 -0.564 8.080 69.620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.767e+02 7.269e+00 -79.334 <2e-16 ***
## Airport_Distance 1.735e-01 1.352e-02 12.838 <2e-16 ***
## Number_of_flights 4.434e-03 1.085e-04 40.868 <2e-16 ***
## Weather 4.463e+00 4.542e-01 9.826 <2e-16 ***
## Support_Crew_Available -4.892e-02 5.313e-03 -9.209 <2e-16 ***
## Baggage_loading_time 1.347e+01 4.380e-01 30.751 <2e-16 ***
## Late_Arrival_o 6.898e+00 3.327e-01 20.731 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.45 on 3586 degrees of freedom
## Multiple R-squared: 0.8189, Adjusted R-squared: 0.8186
## F-statistic: 2703 on 6 and 3586 DF, p-value: < 2.2e-16
res<-resid(fit3)
plot(model1$Arr_Delay, res, ylab="Residuals", xlab="Arrival Delay")
abline(0,0)
While the residuals appear to be in a slightly upward line, what stands out the most is the large inaccuracy at predicting little to no delay.
Since there are two instances of flights with outliers of both baggage and number of flights, we will see if replacing those variables with bins of values has an impact on the model. we could not find any precedent for this online, so we will use the quartile ranges of each to split each variable into ‘low’, ‘middle’ and ‘high’.
model2$Number_of_flights <- ifelse(model2$Number_of_flights < 41634, 'Low',
ifelse(model2$Number_of_flights >= 45140, 'High', 'Middle'))
model2$Baggage_loading_time <- ifelse(model2$Baggage_loading_time < 17, 'Low',
ifelse(model2$Number_of_flights >= 18, 'High', 'Middle'))
fit4<-lm(Arr_Delay~Airport_Distance+Number_of_flights+Weather+Support_Crew_Available+Baggage_loading_time+Late_Arrival_o, data=model2)
res2<-resid(fit4)
plot(model2$Arr_Delay, res2, ylab="Residuals", xlab="Arrival Delay")
abline(0,0)
deviance(fit3)
## [1] 555449.5
deviance(fit4)
## [1] 743704.8
While this new model has changed the shape of the residual plot, it has created a much higher sum of squared residuals - which is to say, there is more error in this model! We will go back to the older model, without bins. This is computationally advantageous anyway, being .1 MB smaller.
As every remaining variable in our model is numeric, we will look at pairwise associations of all variables, looking mostly at the correlation with the variable of interest, Arr_Delay.
ggpairs(model1,
columns = 1:7)
This output shows the strength of some correlation between variables (specifically baggage loading, late arrival, and number of flights) and the variable of interest. These variables make a strong case for their inclusion in our predictive model.
#Building a Model
As we are attempting to predict the probability of a delay of one hour or more, our response variable can only take values between 0 and 1. The predictive model most appropriate for this will be a logistic regression, following a form of:
\(P(Y) = (e^\delta)/1+e^\delta\)
Where \(\delta\) is equal to:
\(\beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + ... + \beta_{k}x_{k}\)
Where each \(\beta\) represents the coefficient associated with each variable (\(x_{k}\)).
The initial model is simply using every variable given in the initial dataframe. While very possibly accurate, this model is quite large and lacks transformations that possibly could assist with predictive modeling.
initial.model<-glm(Hour_Delay~. , family = binomial, data=flight.hour, na.action = na.exclude)
plot(initial.model$fitted.values, initial.model$residuals, ylab="Residuals", xlab="Predicted Probability of Hour+ Delay")
abline(0,0)
deviance(initial.model)
## [1] 1969.465
This model will be good to have as a baseline comparison for other candidate models.
Our initial model residuals appear to follow a cubic line, so the inclusion of a second or third order term may help.
This can help mitigate errors with residuals, as seen above. Only numerical variables will be transformed, as mathematical functions will not work well with categorical variables such as the name of the airline!
quad.df<-flight.hour
quad.df$flights2<-quad.df$Number_of_flights**2
quad.df$bag2<-quad.df$Baggage_loading_time**2
quad.df$dist2<-quad.df$Airport_Distance**2
quad.df$weather2<-quad.df$Weather**2
quad.df$security2<-quad.df$Security_o**2
quad.df$support2<-quad.df$Support_Crew_Available**2
quad.df$late2<-quad.df$Late_Arrival_o**2
quad.df$clean2<-quad.df$Cleaning_o**2
quad.df$fuel2<-quad.df$Fueling_o**2
quad.model<-glm(Hour_Delay~. , family = binomial , data=quad.df)
plot(quad.model$fitted.values, quad.model$residuals, ylab="Residuals", xlab="Predicted Probability of Hour+ Delay")
abline(0,0)
deviance(quad.model)
## [1] 1957.575
And because of the extremely obvious cubic shape to the residual plot of our initial model, we will try some cubed variables.
cube.df<-quad.df
cube.df$flights3<-cube.df$Number_of_flights**3
cube.df$bag3<-cube.df$Baggage_loading_time**3
cube.df$dist3<-cube.df$Airport_Distance**3
cube.df$weather3<-cube.df$Weather**3
cube.df$security3<-cube.df$Security_o**3
cube.df$support3<-cube.df$Support_Crew_Available**3
cube.df$late3<-cube.df$Late_Arrival_o**3
cube.df$clean3<-cube.df$Cleaning_o**3
cube.df$fuel3<-cube.df$Fueling_o**3
cube.model<-glm(Hour_Delay~. , family = binomial , data=cube.df)
plot(cube.model$fitted.values, cube.model$residuals, ylab="Residuals", xlab="Predicted Probability of Hour+ Delay")
abline(0,0)
deviance(cube.model)
## [1] 1946.978
The cubic and quadratic models both, somehow, reduce error slightly despite having a single residual that is over 300 off.
To identify the optimal model, we will use the ‘step’ function in R to eliminate redundant variables in each of our model sets.
First, we will apply various selection methods to our initial dataframe.
model.reg.both<-step(initial.model, direction = "both" , trace = 1)
Both selection processes narrowed down the model to Security_o, Airport_Distance, Number_of_Flights, Weather, Support_Crew_Available, Baggage_Loading_Time, and Late_arrival_o. This suggests that in a candidate model with no transformations or additional terms, those variables should be included.
Next, our quadratic and cubic models
model.quad.both<-step(quad.model, direction = "both" , trace = 1)
model.cube.both<-step(cube.model, direction = "both" , trace = 1)
The model with squared terms settled on: Weather, support crew, baggage loading, late arrival, security, number of flights^2, baggage time^2, and distance^2.
With cubed terms: Weather, support crew, late arrival, security, baggage time^2, distance^2, security^2, number of flights^3, baggage time^3, and security^3.
We do not want a final model to contain the same variable at multiple levels, but since this may indicate a strong connection between the reiterated terms and the variable of interest, we will create a reduced model with those.
Based on the previous results, we have 2 Candidate models, as well as our initial model:
can.reg<-glm(Hour_Delay ~ Security_o + Airport_Distance + Number_of_flights + Support_Crew_Available + Weather + Baggage_loading_time + Late_Arrival_o , family = binomial, data=flight.hour)
can.min<-glm(Hour_Delay ~ Airport_Distance + Number_of_flights + Baggage_loading_time + Security_o , family = binomial , data=flight.hour)
We will look at the significance of each coefficient, of each model, as an initial comparison:
full.table=summary(initial.model)$coef
kable(full.table, caption = "Significance of each coefficient in the full model")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -102.9257623 | 4.2859982 | -24.0144202 | 0.0000000 |
| CarrierAA | 0.0628418 | 0.3963437 | 0.1585538 | 0.8740205 |
| CarrierAS | -0.2516016 | 1.4132306 | -0.1780330 | 0.8586971 |
| CarrierB6 | -0.1677252 | 0.3783899 | -0.4432602 | 0.6575776 |
| CarrierDL | -0.3374653 | 0.3865832 | -0.8729433 | 0.3826940 |
| CarrierEV | -0.2168769 | 0.3862913 | -0.5614335 | 0.5745020 |
| CarrierF9 | -1.2389646 | 1.3356440 | -0.9276159 | 0.3536068 |
| CarrierFL | -0.8452742 | 0.6406167 | -1.3194695 | 0.1870122 |
| CarrierHA | -3.8793661 | 4.6055613 | -0.8423221 | 0.3996077 |
| CarrierMQ | -0.3142455 | 0.4121902 | -0.7623799 | 0.4458333 |
| CarrierUA | -0.2084461 | 0.3752656 | -0.5554627 | 0.5785782 |
| CarrierUS | -0.1524217 | 0.4372685 | -0.3485770 | 0.7274069 |
| CarrierVX | -0.4453947 | 0.5916292 | -0.7528275 | 0.4515536 |
| CarrierWN | -0.2453927 | 0.4772950 | -0.5141321 | 0.6071596 |
| Airport_Distance | 0.0272436 | 0.0038857 | 7.0111599 | 0.0000000 |
| Number_of_flights | 0.0006873 | 0.0000378 | 18.2063162 | 0.0000000 |
| Weather | 0.6592219 | 0.1265406 | 5.2095670 | 0.0000002 |
| Support_Crew_Available | -0.0081644 | 0.0014465 | -5.6444012 | 0.0000000 |
| Baggage_loading_time | 2.4382013 | 0.1787911 | 13.6371526 | 0.0000000 |
| Late_Arrival_o | 0.9455250 | 0.0954982 | 9.9009740 | 0.0000000 |
| Cleaning_o | 0.0095363 | 0.0163363 | 0.5837477 | 0.5593900 |
| Fueling_o | -0.0051694 | 0.0161900 | -0.3192934 | 0.7495040 |
| Security_o | 0.0140206 | 0.0080607 | 1.7393731 | 0.0819692 |
reg.table=summary(can.reg)$coef
kable(reg.table, caption = "Significance of each coefficient in the reduced model")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -102.6831868 | 4.2206844 | -24.328563 | 0.0000000 |
| Security_o | 0.0138565 | 0.0080020 | 1.731626 | 0.0833401 |
| Airport_Distance | 0.0268672 | 0.0038634 | 6.954268 | 0.0000000 |
| Number_of_flights | 0.0006841 | 0.0000375 | 18.238906 | 0.0000000 |
| Support_Crew_Available | -0.0082071 | 0.0014400 | -5.699314 | 0.0000000 |
| Weather | 0.6593858 | 0.1260339 | 5.231814 | 0.0000002 |
| Baggage_loading_time | 2.4304099 | 0.1778089 | 13.668666 | 0.0000000 |
| Late_Arrival_o | 0.9436507 | 0.0945881 | 9.976420 | 0.0000000 |
min.table=summary(can.min)$coef
kable(min.table, caption = "Significance of each coefficient in the minimal model")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -91.1238012 | 3.7056368 | -24.590592 | 0.0000000 |
| Airport_Distance | 0.0290380 | 0.0036613 | 7.931082 | 0.0000000 |
| Number_of_flights | 0.0007462 | 0.0000359 | 20.793681 | 0.0000000 |
| Baggage_loading_time | 2.7358787 | 0.1717724 | 15.927352 | 0.0000000 |
| Security_o | 0.0151718 | 0.0076559 | 1.981719 | 0.0475107 |
As expected, the full model shows many variables with insignificant p values. The minimal model, while exclusively significant values, does not have as many strong predictors as the mid-sized model. The mid-sized model has Security time not achieving statistical significance, but is fairly close.
As expected, almost all variables show a positive relationship with probability of longer delays. This should not come as a surprise, as most variables are number of minutes spent at some stage before a flight takes off (security, baggage loading time, etc). The number of support staff is negatively correlated, which also makes sense, as more hands available should lead to faster performance of those (and other, possibly unrecorded) tasks.
To test the similarity of the candidates, we will create a sample dataset by choosing values for each variable. We will then use R to predict the Delay for each sample flight in both the initial and candidate model. We will then compare the two models’ results.
samp.data=data.frame(Carrier=c("UA", "AA", "DL", "EV", "VX"), Airport_Distance=c(350, 500, 445, 430, 450), Number_of_flights=c(30123, 25001, 43333, 55757, 45454), Weather=c(5, 5, 2, 6, 9), Support_Crew_Available=c(17, 300, 85, 92, 71), Baggage_loading_time=c(14.5, 20, 16, 12, 17), Late_Arrival_o=c(10, 20, 15, 19, 18), Cleaning_o=c(25, 6, 8, 10, 10), Fueling_o=c(19.5, 22, 25, 27, 24), Security_o=c(40, 38, 18, 40, 26) )
pred.ini=predict(initial.model, newdata=samp.data , type = "response")
cut.off.prob = 0.5
pred.response.ini = ifelse(pred.ini > cut.off.prob, 1, 0)
pred.can=predict(can.reg, newdata=samp.data , type = "response")
cut.off.prob = 0.5
pred.response.can = ifelse(pred.can > cut.off.prob, 1, 0)
combined.table=cbind(samp.data, Initial_Model=pred.response.ini , Candidate_Model=pred.response.can)
kable(combined.table, caption = "Predicted Delay Time of Each Model Based on Sample Data")
| Carrier | Airport_Distance | Number_of_flights | Weather | Support_Crew_Available | Baggage_loading_time | Late_Arrival_o | Cleaning_o | Fueling_o | Security_o | Initial_Model | Candidate_Model |
|---|---|---|---|---|---|---|---|---|---|---|---|
| UA | 350 | 30123 | 5 | 17 | 14.5 | 10 | 25 | 19.5 | 40 | 0 | 0 |
| AA | 500 | 25001 | 5 | 300 | 20.0 | 20 | 6 | 22.0 | 38 | 0 | 0 |
| DL | 445 | 43333 | 2 | 85 | 16.0 | 15 | 8 | 25.0 | 18 | 0 | 0 |
| EV | 430 | 55757 | 6 | 92 | 12.0 | 19 | 10 | 27.0 | 40 | 0 | 0 |
| VX | 450 | 45454 | 9 | 71 | 17.0 | 18 | 10 | 24.0 | 26 | 1 | 1 |
As demonstrated in the table, the two models have identical predictions despite very extreme x values. It is our belief that the candidate model is ideal, as it generates comparable predictions at a fraction of the memory. This advantage will only continue to grow as larger datasets are used with this same model.
can.df<-subset(flight.hour , select=c('Hour_Delay' , 'Security_o' , 'Airport_Distance' , 'Number_of_flights' , 'Support_Crew_Available' , 'Weather' , 'Baggage_loading_time' , 'Late_Arrival_o'))
A copy of this data is now publicly available at: https://raw.githubusercontent.com/AlexDragonetti/flightcandidate/main/can.csv
To train our model, we will only a portion of the data to determine an optimal cut-off probability and then test it on the remaining portion.
First, we need to create separate dataframes for testing and training purposes.
nn = dim(can.df)[1]
train.id = sample(1:nn, round(nn*0.7), replace = FALSE)
training = can.df[train.id,]
testing = can.df[-train.id,]
Next, we ‘train’ the model by finding the most accurate cutoff probability. We do this by splitting our training data into 5 ‘folds’. Next, we use different combinations of those folds to test 20 different cutoff points. Finally, we see which cutoff point had the highest combined probability in each combination. This is called cross-validation.
n0 = dim(training)[1]/5
cut.0ff.prob = seq(0,1, length = 22)[-c(1,22)]
pred.accuracy = matrix(0,ncol=20, nrow=5, byrow = T)
for (i in 1:5){
valid.id = ((i-1)*n0 + 1):(i*n0)
valid.data = training[valid.id,]
train.data = training[-valid.id,]
train.model = glm(Hour_Delay ~ Security_o + Number_of_flights + Airport_Distance + Support_Crew_Available + Weather + Baggage_loading_time + Late_Arrival_o , family = binomial(link = logit), data = can.df)
newdata = data.frame(Security_o = valid.data$Security_o , Number_of_flights=valid.data$Number_of_flights , Airport_Distance=valid.data$Airport_Distance , Support_Crew_Available=valid.data$Support_Crew_Available , Weather=valid.data$Weather , Baggage_loading_time=valid.data$Baggage_loading_time , Late_Arrival_o=valid.data$Late_Arrival_o )
pred.prob = predict.glm(train.model, newdata, type = "response")
for(j in 1:20){
pred.status = rep(0,length(pred.prob))
valid.data$pred.status = as.numeric(pred.prob >cut.0ff.prob[j])
a11 = sum(valid.data$pred.status == valid.data$Hour_Delay)
pred.accuracy[i,j] = a11/length(pred.prob)
}
}
avg.accuracy = apply(pred.accuracy, 2, mean)
max.id = which(avg.accuracy ==max(avg.accuracy ))
tick.label = as.character(round(cut.0ff.prob,2))
plot(1:20, avg.accuracy, type = "b",
xlim=c(1,20),
ylim=c(0.5,1),
axes = FALSE,
xlab = "Cut-off Probability",
ylab = "Accuracy",
main = "5-fold CV performance"
)
axis(1, at=1:20, label = tick.label, las = 2)
axis(2)
segments(max.id, 0.5, max.id, avg.accuracy[max.id], col = "purple")
text(max.id, avg.accuracy[max.id]+0.03, as.character(round(avg.accuracy[max.id],4)), col = "purple", cex = 0.8)
The above shows the model’s optimal cutoff probability and accuracy.
We now will test this cutoff probability against the separate testing dataframe. This probability should be different from our previous probabilities, as the testing data has not been used in its calculation.
test.model = glm(Hour_Delay ~ Security_o + Airport_Distance + Number_of_flights + Support_Crew_Available + Weather + Baggage_loading_time + Late_Arrival_o , family = binomial (link=logit), data = training)
newdata = data.frame(Hour_Delay= testing$Hour_Delay , Security_o = testing$Security_o , Number_of_flights=testing$Number_of_flights , Airport_Distance=testing$Airport_Distance , Support_Crew_Available=testing$Support_Crew_Available , Weather=testing$Weather , Baggage_loading_time=testing$Baggage_loading_time , Late_Arrival_o=testing$Late_Arrival_o)
pred.prob.test = predict.glm(test.model, newdata, type = "response")
testing$test.status = as.numeric(pred.prob.test > 0.48)
a11 = sum(testing$test.status == testing$Hour_Delay)
test.accuracy = a11/length(pred.prob.test)
kable(as.data.frame(test.accuracy), align='c')
| test.accuracy |
|---|
| 0.864564 |
Our model’s accuracy, when applied to the testing datasubset is .8682. This means that roughly 87% of it’s predictions were correct.
Specificity and Sensitivity are two important pieces of information about our predictive model. The Sensitivity is the percentage of positive predictions that are correct: “How many of our predicted yeses are correct?”
The Specificity answers the opposite. It is the percentage of correct negative guesses: “How many of our predicted nos are correct?”
We will use the newdata dataframe generated in the
previous section to split our guesses into four groups:
p0.a0 - the number of correct nos p0.a1 -
the number of incorrect nos p1.a0 - the number of incorrect
yeses p1.a1 - the number of correct yeses
pred.prob.test = predict.glm(test.model, newdata, type = "response")
testing$test.status = as.numeric(pred.prob.test > 0.52)
p0.a0 = sum(testing$test.status ==0 & testing$Hour_Delay==0)
p0.a1 = sum(testing$test.status ==0 & testing$Hour_Delay ==1)
p1.a0 = sum(testing$test.status ==1 & testing$Hour_Delay==0)
p1.a1 = sum(testing$test.status ==1 & testing$Hour_Delay ==1)
sensitivity = p1.a1 / (p1.a1 + p0.a1)
specificity = p0.a0 / (p0.a0 + p1.a0)
precision = p1.a1 / (p1.a1 + p1.a0)
recall = sensitivity
F1 = 2*precision*recall/(precision + recall)
metric.list = cbind(sensitivity = sensitivity,
specificity = specificity,
precision = precision,
recall = recall,
F1 = F1)
kable(as.data.frame(metric.list), align='c', caption = "Local performance metrics")
| sensitivity | specificity | precision | recall | F1 |
|---|---|---|---|---|
| 0.908284 | 0.7960199 | 0.8821839 | 0.908284 | 0.8950437 |
From this, we can see that our sensitivity is about 91%, while our specificity is only 81%. Our model is better at predicting the presence of an hour delay than the lack of an hour delay. If we wanted to adjust the cutoff point to increase the specificity, we could use an ROC curve to establish an ideal point while factoring in both sensitivity and specificity.
An ROC curve can help us make decisions about a cutoff point. While our cutoff point is optimal for success, we may prefer a cutoff point with a higher sensitivity (correct designation of positive, which is in this case, the presence of an hour+ delay) or specificity (correct prediction of a negative (delay under an hour.)).
cut.off.seq = seq(0.01, 0.99, length = 100)
sensitivity.vec = NULL
specificity.vec = NULL
for (i in 1:100){
testing$test.status = as.numeric(pred.prob.test > cut.off.seq[i])
p0.a0 = sum(testing$test.status ==0 & testing$Hour_Delay==0)
p0.a1 = sum(testing$test.status ==0 & testing$Hour_Delay ==1)
p1.a0 = sum(testing$test.status ==1 & testing$Hour_Delay==0)
p1.a1 = sum(testing$test.status ==1 & testing$Hour_Delay ==1)
sensitivity.vec[i] = p1.a1 / (p1.a1 + p0.a1)
specificity.vec[i] = p0.a0 / (p0.a0 + p1.a0)
}
one.minus.spec = c(1,1 - specificity.vec)
sens.vec = c(1,sensitivity.vec)
par(pty = "s")
plot(one.minus.spec, sens.vec, type = "l", xlim = c(0,1), ylim = c(0,1) ,
xlab ="1 - specificity",
ylab = "sensitivity",
main = "ROC curve of Logistic Regression Model",
lwd = 2,
col = "blue", )
segments(0,0,1,1, col = "red", lty = 2, lwd = 2)
AUC = round(sum(sens.vec*(one.minus.spec[-101]-one.minus.spec[-1])),4)
text(0.8, 0.3, paste("AUC = ", AUC), col = "blue")
The above graph was generated with a series of cutoff thresholds and could be used to determine the ideal cutoff point, if specificity or sensitivity need to me accomodated.
Another possible means to an ideal cutoff could could be creating a Neural Network (NN) model. Creating an NN model, code-wise, will be very similar to the cross validation done in the previous section.
For good results using the neuralnet function, we will normalize our numerical variables. Our scaling formula for each numeric value is as follows:
\(\hat{x}_{ij} = \frac{x_{ij} - min value_{i}(x_{ij})}{range(x_{ij})}\)
Meaning that the scale X values will be arrived at by starting with the original x, subtracting the lowest value of that variable, then diving the difference by the difference between the highest and lowest x value. We will only do this for numerical variables, not dummy variables, as it will help mitigate variance. As dummy variables are only taking on values of 0 or 1, it would not change much, but it will be import for variables like airport distance (which has a range in the hundreds ).
flight.h2<-flight.hour
flight.h2$Airport_Distance = (flight.h2$Airport_Distance-min(flight.h2$Airport_Distance))/(max(flight.h2$Airport_Distance)-min(flight.h2$Airport_Distance))
flight.h2$Number_of_flights = (flight.h2$Number_of_flights-min(flight.h2$Number_of_flights))/(max(flight.h2$Number_of_flights)-min(flight.h2$Number_of_flights))
flight.h2$Weather = (flight.h2$Weather-min(flight.h2$Weather))/(max(flight.h2$Weather)-min(flight.h2$Weather))
flight.h2$Support_Crew_Available = (flight.h2$Support_Crew_Available-min(flight.h2$Support_Crew_Available))/(max(flight.h2$Support_Crew_Available)-min(flight.h2$Support_Crew_Available))
flight.h2$Baggage_loading_time = (flight.h2$Baggage_loading_time-min(flight.h2$Baggage_loading_time))/(max(flight.h2$Baggage_loading_time)-min(flight.h2$Baggage_loading_time))
flight.h2$Late_Arrival_o = (flight.h2$Late_Arrival_o-min(flight.h2$Late_Arrival_o))/(max(flight.h2$Late_Arrival_o)-min(flight.h2$Late_Arrival_o))
flight.h2$Cleaning_o = (flight.h2$Cleaning_o-min(flight.h2$Cleaning_o))/(max(flight.h2$Cleaning_o)-min(flight.h2$Cleaning_o))
flight.h2$Fueling_o = (flight.h2$Fueling_o-min(flight.h2$Fueling_o))/(max(flight.h2$Fueling_o)-min(flight.h2$Fueling_o))
flight.h2$Security_o = (flight.h2$Security_o-min(flight.h2$Security_o))/(max(flight.h2$Security_o)-min(flight.h2$Security_o))
Now, we will organize our standardized, complete, flight.hour
dataframe in such a way that it can be read as a model formula. For
conveneince, we will make a new dataset and adhere to
CamelCase naming conventions, where words in the phrase are
separated but capital letters, allowing us to avoid spaces.
flightmtx<-model.matrix(~. , data=flight.h2)
colnames(flightmtx)
## [1] "(Intercept)" "CarrierAA" "CarrierAS"
## [4] "CarrierB6" "CarrierDL" "CarrierEV"
## [7] "CarrierF9" "CarrierFL" "CarrierHA"
## [10] "CarrierMQ" "CarrierUA" "CarrierUS"
## [13] "CarrierVX" "CarrierWN" "Airport_Distance"
## [16] "Number_of_flights" "Weather" "Support_Crew_Available"
## [19] "Baggage_loading_time" "Late_Arrival_o" "Cleaning_o"
## [22] "Fueling_o" "Security_o" "Hour_Delay"
Many variable names are workable. We will keep all of the carrier names as is and rename the others.
colnames(flightmtx)[15]<-"AirportDistance"
colnames(flightmtx)[16]<-"NumberOfFlights"
colnames(flightmtx)[17]<-"Weather"
colnames(flightmtx)[18]<-"SupportCrewAvailable"
colnames(flightmtx)[19]<-"BaggageLoadingTime"
colnames(flightmtx)[20]<-"LateArrival"
colnames(flightmtx)[21]<-"Cleaning"
colnames(flightmtx)[22]<-"Fueling"
colnames(flightmtx)[23]<-"Security"
colnames(flightmtx)[24]<-"HourDelay"
Now that our variables are renamed and scaled, we will write the
model as an object called modelFormula which we can call
upon to put into code at any time. Note that we are dropping the
intercept, as our NN model will generate its own.
columnNames = colnames(flightmtx)
columnList = paste(columnNames[-c(1,length(columnNames))], collapse = "+")
columnList = paste(c(columnNames[length(columnNames)],"~",columnList), collapse="")
modelFormula = formula(columnList)
Just like the previous section, we will split our dataset into training and testing data for the purpose of cross-validation. See the previous section for a more detailed explanation.
n = dim(flightmtx)[1]
testID = sample(1:n, round(n*0.7), replace = FALSE)
testDat = flightmtx[testID,]
trainDat = flightmtx[-testID,]
train.Dat=as.data.frame(trainDat[,-1])
Our NN model will use the modelFormula object that we
coded earlier. Aside from the use of the modelFormula
object, the neuralnet command begins similarly to our glm code in the
previous section. It differs by adding commands afterwards, such as the
hidden layers and the use of the rprop+ algorithim, telling
the NN to use Backpropagation. Backpropagation is a means to improve
predictive accuracy by calculating the gradient of a loss function
(which we would call the Mean Square Error in classical statistics) of
weights within the NN and using them to update other weights.
NetworkModel = neuralnet(HourDelay ~ .,
data = train.Dat,hidden = 1, rep = 1, threshold = 0.01, learningrate = 0.1, linear.output = TRUE, algorithm = "rprop+")
kable(NetworkModel$result.matrix)
| error | 48.5786842 |
| reached.threshold | 0.0082445 |
| steps | 4540.0000000 |
| Intercept.to.1layhid1 | -15.7442088 |
| CarrierAA.to.1layhid1 | -0.5845317 |
| CarrierAS.to.1layhid1 | -0.3359010 |
| CarrierB6.to.1layhid1 | -0.1948270 |
| CarrierDL.to.1layhid1 | -0.8686852 |
| CarrierEV.to.1layhid1 | -0.5732558 |
| CarrierF9.to.1layhid1 | -4.7146590 |
| CarrierFL.to.1layhid1 | -1.2881158 |
| CarrierHA.to.1layhid1 | 0.3423838 |
| CarrierMQ.to.1layhid1 | -0.6459463 |
| CarrierUA.to.1layhid1 | -0.4935499 |
| CarrierUS.to.1layhid1 | -0.8189279 |
| CarrierVX.to.1layhid1 | -0.7320115 |
| CarrierWN.to.1layhid1 | -0.2956864 |
| AirportDistance.to.1layhid1 | 2.1083699 |
| NumberOfFlights.to.1layhid1 | 13.0659828 |
| Weather.to.1layhid1 | 0.6821870 |
| SupportCrewAvailable.to.1layhid1 | -1.5165616 |
| BaggageLoadingTime.to.1layhid1 | 9.7273746 |
| LateArrival.to.1layhid1 | 6.3078694 |
| Cleaning.to.1layhid1 | 0.2102520 |
| Fueling.to.1layhid1 | -1.0809530 |
| Security.to.1layhid1 | 0.5555216 |
| Intercept.to.HourDelay | -0.0237742 |
| 1layhid1.to.HourDelay | 1.0340119 |
Above is a table of each variable’s coefficient, below is a graphical model of them and how they are used to determine the predicted outcome.
plot(NetworkModel, rep="best")
And below is our logistic regression model, derived by the NN.
logiModel = glm(factor(Hour_Delay) ~., family = binomial, data = flight.hour)
pander(summary(logiModel)$coefficients)
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -102.9 | 4.286 | -24.01 | 1.966e-127 |
| CarrierAA | 0.06284 | 0.3963 | 0.1586 | 0.874 |
| CarrierAS | -0.2516 | 1.413 | -0.178 | 0.8587 |
| CarrierB6 | -0.1677 | 0.3784 | -0.4433 | 0.6576 |
| CarrierDL | -0.3375 | 0.3866 | -0.8729 | 0.3827 |
| CarrierEV | -0.2169 | 0.3863 | -0.5614 | 0.5745 |
| CarrierF9 | -1.239 | 1.336 | -0.9276 | 0.3536 |
| CarrierFL | -0.8453 | 0.6406 | -1.319 | 0.187 |
| CarrierHA | -3.879 | 4.606 | -0.8423 | 0.3996 |
| CarrierMQ | -0.3142 | 0.4122 | -0.7624 | 0.4458 |
| CarrierUA | -0.2084 | 0.3753 | -0.5555 | 0.5786 |
| CarrierUS | -0.1524 | 0.4373 | -0.3486 | 0.7274 |
| CarrierVX | -0.4454 | 0.5916 | -0.7528 | 0.4516 |
| CarrierWN | -0.2454 | 0.4773 | -0.5141 | 0.6072 |
| Airport_Distance | 0.02724 | 0.003886 | 7.011 | 2.364e-12 |
| Number_of_flights | 0.0006873 | 3.775e-05 | 18.21 | 4.599e-74 |
| Weather | 0.6592 | 0.1265 | 5.21 | 1.893e-07 |
| Support_Crew_Available | -0.008164 | 0.001446 | -5.644 | 1.658e-08 |
| Baggage_loading_time | 2.438 | 0.1788 | 13.64 | 2.408e-42 |
| Late_Arrival_o | 0.9455 | 0.0955 | 9.901 | 4.122e-23 |
| Cleaning_o | 0.009536 | 0.01634 | 0.5837 | 0.5594 |
| Fueling_o | -0.005169 | 0.01619 | -0.3193 | 0.7495 |
| Security_o | 0.01402 | 0.008061 | 1.739 | 0.08197 |
Finally, we will have our NN model find an optimal cutoff probability, using the same cross-validation technique that we used in our previous section. We expect that it will be slightly different, partially because it is a different technique, but also because our code generates a completely different random split for training and testing.
n0 = dim(train.Dat)[1]/5
cut.0ff.prob = seq(0,1, length = 20)
pred.accuracy = matrix(0,ncol=20, nrow=5, byrow = T)
for (i in 1:5){
valid.id = ((i-1)*n0 + 1):(i*n0)
valid.data = train.Dat[valid.id,]
train.data = train.Dat[-valid.id,]
train.model = neuralnet(modelFormula, train.data, hidden = 1, rep = 1, threshold = 0.1, learningrate = 0.1, algorithm = "rprop+")
pred.nn.val = predict(NetworkModel, newdata=valid.data, type="prob")
for(j in 1:20){
pred.status = rep(0,length(pred.nn.val))
valid.data$pred.status = as.numeric(pred.nn.val >cut.0ff.prob[j])
a11 = sum(valid.data$pred.status == valid.data$HourDelay)
pred.accuracy[i,j] = a11/length(pred.nn.val)
}
}
avg.accuracy = apply(pred.accuracy, 2, mean)
max.id = which(avg.accuracy ==max(avg.accuracy ))
tick.label = as.character(round(cut.0ff.prob,2))
plot(1:20, avg.accuracy, type = "b",
xlim=c(1,20),
ylim=c(0.5,1),
axes = FALSE,
xlab = "Cut-off Probability",
ylab = "Accuracy",
main = "5-fold CV performance"
)
axis(1, at=1:20, label = tick.label, las = 2)
axis(2)
segments(max.id, 0.5, max.id, avg.accuracy[max.id], col = "purple")
text(max.id, avg.accuracy[max.id]+0.03, as.character(round(avg.accuracy[max.id],4)), col = "purple", cex = 0.8)
Above is the optimal Cut-off Probability, as suggested by our NN model. The ROC curve below will show the same information as in the previous section.
nn.results = predict(NetworkModel, train.Dat)
cut0 = seq(0,1, length = 20)
SenSpe = matrix(0, ncol = length(cut0), nrow = 2, byrow = FALSE)
for (i in 1:length(cut0)){
a = sum(train.Dat[,"HourDelay"] == 1 & (nn.results > cut0[i]))
d = sum(train.Dat[,"HourDelay"] == 0 & (nn.results < cut0[i]))
b = sum(train.Dat[,"HourDelay"] == 0 & (nn.results > cut0[i]))
c = sum(train.Dat[,"HourDelay"] == 1 & (nn.results < cut0[i]))
sen = a/(a + c)
spe = d/(b + d)
SenSpe[,i] = c(sen, spe)
}
plot(1-SenSpe[2,], SenSpe[1,], type ="l", xlim=c(0,1), ylim=c(0,1),
xlab = "1 - specificity", ylab = "Sensitivity", lty = 1,
main = "ROC Curve", col = "blue")
abline(0,1, lty = 2, col = "red")
xx = 1-SenSpe[2,]
yy = SenSpe[1,]
width = xx[-length(xx)] - xx[-1]
height = yy[-1]
prediction = nn.results
category = train.Dat[,"HourDelay"] == 1
ROCobj <- roc(category, prediction)
AUC = auc(ROCobj)[1]
AUC =mean(sum(width*height), sum(width*yy[-length(yy)]))
text(0.8, 0.3, paste("AUC = ", round(AUC,4)), col = "blue4", cex = 0.9)
legend("bottomright", c("ROC of the model", "Random (Coin Flip) Guess"), lty=c(1,2),
col = c("blue4", "pink2"), bty = "n", cex = 0.8)
Similar to the previous section, the ROC curve helps determine an ideal cutoff probability, based on the desired sensitivity and specificity of our model. While favoring a more sensitive or specific model could be important due to external factors, there is no apparent need for either in the realm of flight delay predictions.