Overview

In the world of air travel, 2022 was a year of air travel defined by cancellations. While 2023 has seen a decrease in cancellations, FlightAware reports an increase in delay time from last year.

Per federal data, most flight delays in recent years have been caused by issues within the control of airlines.

Discovering and understanding which specific issues cause delays could inform airlines’ priorities, and could mitigate delay time moving forward.

Of course, with something as complicated as air travel, it is not unreasonable to expect small delays. Instead of splitting hairs over delays in the minutes, we will attempt to see which variables are most closely tied to a delay of 1 hour or more. While it is clear that ‘Air Carrier Delay’ is a sizable chunk of delay times, that is very vague. We will use a dataset of flight records to create a predictive logistical model, with a goal of seeing which variables have an effect on delays of an hour or more, and how strong that effect is.

Description of Data

The flight delay data has 3593 observations of 11 variables. They are:

Carrier: The airline

Airport_Distance: Distance between two airports

Number_of_Flights: Total number of flights in the airport

Weather: Weather condition, measured on a scale from 0 (mild) to 10 (extreme)

Support_Crew_Available: Number of support crew

Baggage_loading_time: Time in minutes spent loading baggage

Late_Arrival_o: Time in minutes the plane arrived late

Cleaning_o: Time in minutes spent cleaning the aircraft

Fueling_o: Time in minutes spent fueling the aircraft

Security_o: Time in minutes spent in security checking

Arr_Delay: Flight delay in minutes. This is the dependent variable of the dataset.

For logistic regression, an additional variable, Hour_Delay, will be created to see if a plane is 1 hour late or more.

A copy of this publicly available data is stored at https://pengdsci.github.io/datasets/FlightDelay/Flight_delay-data.csv.

flight<-read.csv("https://pengdsci.github.io/datasets/FlightDelay/Flight_delay-data.csv")
flight.hour<-flight

flight.hour$Hour_Delay = 0 

flight.hour$Hour_Delay<-ifelse(flight.hour$Arr_Delay > 59, 1, 0)

flight.hour<-flight.hour[-c(11)]

Dataset Overview

First, as the majority of the variables are numerical, a summary stats overview is warranted.

summary(flight)
##    Carrier          Airport_Distance Number_of_flights    Weather     
##  Length:3593        Min.   :376.0    Min.   :29475     Min.   :5.000  
##  Class :character   1st Qu.:431.0    1st Qu.:41634     1st Qu.:5.000  
##  Mode  :character   Median :443.0    Median :43424     Median :5.000  
##                     Mean   :442.4    Mean   :43311     Mean   :5.353  
##                     3rd Qu.:454.0    3rd Qu.:45140     3rd Qu.:6.000  
##                     Max.   :499.0    Max.   :53461     Max.   :6.000  
##  Support_Crew_Available Baggage_loading_time Late_Arrival_o    Cleaning_o   
##  Min.   :  0            Min.   :14.00        Min.   :15.00   Min.   :-4.00  
##  1st Qu.: 56            1st Qu.:17.00        1st Qu.:18.00   1st Qu.: 8.00  
##  Median : 83            Median :17.00        Median :19.00   Median :10.00  
##  Mean   : 85            Mean   :16.98        Mean   :18.74   Mean   :10.02  
##  3rd Qu.:112            3rd Qu.:17.00        3rd Qu.:19.00   3rd Qu.:12.00  
##  Max.   :222            Max.   :19.00        Max.   :22.00   Max.   :23.00  
##    Fueling_o       Security_o      Arr_Delay    
##  Min.   :13.00   Min.   :13.00   Min.   :  0.0  
##  1st Qu.:23.00   1st Qu.:32.00   1st Qu.: 49.0  
##  Median :25.00   Median :37.00   Median : 70.0  
##  Mean   :25.01   Mean   :37.09   Mean   : 69.8  
##  3rd Qu.:27.00   3rd Qu.:42.00   3rd Qu.: 90.0  
##  Max.   :36.00   Max.   :63.00   Max.   :180.0

Amazingly, there are no missing values. Of note is the extreme variance within the Support_Crew_Available and Number_of_Flights variables and minimal variance within the weather category.

Looking for Outliers

While the summary of variables like Airport Distance do not appear to cause alarm, the high variance of some stats suggests the possibility of outliers. While outliers are expected, it may be worth identifying and individually reviewing outliers to see if there could have been an inputting error.

testsup<-rosnerTest(flight$Support_Crew_Available, k = 5)
testnum<-rosnerTest(flight$Number_of_flights, k = 5)
testdist<-rosnerTest(flight$Airport_Distance, k = 5)
testbag<-rosnerTest(flight$Baggage_loading_time, k = 5)

The test of outliers for number of flights returns 3 possible outliers - observations 1652, 3163, and 2729. The delay for each observations is 0. While these observations do not appear to be errors, they make a strong suggestion that number of flights could be a good predictor for delay time. Similarly, the baggage time also shows 2729 and 3163 as low outliers.

Investigation for Collinearity

As there are multiple variables related to airplane prep (fueling, cleaning, baggage loading, and security), it may be worth checking for possible collinearity between two predictor variables. If one can be identified as redundant and removed, it will save us time and computational power.

pcor(flight[,2:11], method = "pearson")

As no variable appears to have a high correlation with another, there is no preliminary evidence of multicollinearity, so all variables could potentially be included into the model.

Discretizing Variables

Since there are two instances of flights with outliers of both baggage and number of flights, we will see if replacing those variables with bins of values has an impact on the model. we could not find any precedent for this online, so we will use the quartile ranges of each to split each variable into ‘low’, ‘middle’ and ‘high’.

model2<-flight.hour

model2$Number_of_flights <- ifelse(model2$Number_of_flights < 41634, 'Low',
               ifelse(model2$Number_of_flights >= 45140, 'High', 'Middle'))

model2$Baggage_loading_time <- ifelse(model2$Baggage_loading_time < 17, 'Low',
               ifelse(model2$Number_of_flights >= 18, 'High', 'Middle'))
fit<-glm(Hour_Delay~., family = binomial, data=flight.hour)


res<-resid(fit)

fit2<-glm(Hour_Delay~., family = binomial, data=model2)


res2<-resid(fit2)
plot(flight.hour$Hour_Delay, res, ylab="Residuals", xlab="Hour Delay - 0=No, 1=Yes", main="Residuals of Intial Model", col="blue")
abline(0,0)

plot(model2$Hour_Delay, res2, ylab="Residuals", xlab="Hour Delay - 0=No, 1=Yes", main="Residuals of Discretized Model", col="orange")
abline(0,0)

While the residual plots clearly look different, it is not entirely clear which model has less error.

errmat<-matrix(c(deviance(fit), deviance(fit2)),ncol=2) 
colnames(errmat)<-c('Original Model', 'Discretized Model')
rownames(errmat)<-c('Error')

error.sum<-as.table(errmat)

kable(error.sum)
Original Model Discretized Model
Error 1969.465 2193.526

While not apparent, the discretized model appears to have brought our residuals up. We will not consider a discretized model as a possiblity moving forward.

Pairwise Associations

Many variables in our model are numeric, we will look at pairwise associations of all variables, looking mostly at the correlation with the variable of interest, Arr_Delay.

ggpairs(flight.hour,                  
        columns = 2:11)         

This output shows the strength of some correlation between variables (specifically baggage loading, late arrival, and number of flights) and the variable of interest. These variables make a strong case for their inclusion in our predictive model.

Building a Model

As stated in our introduction, building a logistic model will allow us to see which variables have a significant effect on the presence of a long delay and how strong that effect is.

Logistic Regression

As we are attempting to predict the probability of a delay of one hour or more, our response variable can only take values between 0 and 1. The predictive model most appropriate for this will be a logistic regression, following a form of:

\(P(Y) = (e^\delta)/1+e^\delta\)

Where \(\delta\) is equal to:

\(\beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + ... + \beta_{k}x_{k}\)

Where each \(\beta\) represents the coefficient associated with each variable (\(x_{k}\)).

Development of Candidate Models

Initial Model

The initial model is simply using every variable given in the initial dataframe. While very possibly accurate, this model is quite large and lacks transformations that possibly could assist with predictive modeling.

initial.model<-glm(Hour_Delay~. , family = binomial, data=flight.hour, na.action = na.exclude)

plot(initial.model$fitted.values, initial.model$residuals, ylab="Residuals", xlab="Predicted Probability of Hour+ Delay")
abline(0,0)

deviance(initial.model)
## [1] 1969.465

This model will be good to have as a baseline comparison for other candidate models.

Our initial model residuals appear to follow a cubic line, so the inclusion of a second or third order term may help.

Including a Second-order variable

This can help mitigate errors with residuals, as seen above. Only numerical variables will be transformed, as mathematical functions will not work well with categorical variables such as the name of the airline!

quad.df<-flight.hour

quad.df$flights2<-quad.df$Number_of_flights**2
quad.df$bag2<-quad.df$Baggage_loading_time**2
quad.df$dist2<-quad.df$Airport_Distance**2
quad.df$weather2<-quad.df$Weather**2
quad.df$security2<-quad.df$Security_o**2
quad.df$support2<-quad.df$Support_Crew_Available**2
quad.df$late2<-quad.df$Late_Arrival_o**2
quad.df$clean2<-quad.df$Cleaning_o**2
quad.df$fuel2<-quad.df$Fueling_o**2




quad.model<-glm(Hour_Delay~. , family = binomial , data=quad.df)


plot(quad.model$fitted.values, quad.model$residuals, ylab="Residuals", xlab="Predicted Probability of Hour+ Delay")
abline(0,0)

deviance(quad.model)
## [1] 1957.575

And because of the extremely obvious cubic shape to the residual plot of our initial model, we will try some cubed variables.

cube.df<-quad.df

cube.df$flights3<-cube.df$Number_of_flights**3
cube.df$bag3<-cube.df$Baggage_loading_time**3
cube.df$dist3<-cube.df$Airport_Distance**3
cube.df$weather3<-cube.df$Weather**3
cube.df$security3<-cube.df$Security_o**3
cube.df$support3<-cube.df$Support_Crew_Available**3
cube.df$late3<-cube.df$Late_Arrival_o**3
cube.df$clean3<-cube.df$Cleaning_o**3
cube.df$fuel3<-cube.df$Fueling_o**3




cube.model<-glm(Hour_Delay~. , family = binomial , data=cube.df)


plot(cube.model$fitted.values, cube.model$residuals, ylab="Residuals", xlab="Predicted Probability of Hour+ Delay")
abline(0,0)

deviance(cube.model)
## [1] 1946.978

The cubic and quadratic models both, somehow, reduce error slightly despite having a single residual that is over 300 off.

Selection of Optimal Model

To identify the optimal model, we will use the ‘step’ function in R to eliminate redundant variables in each of our model sets.

First, we will apply various selection methods to our initial dataframe.

#model.reg.both<-step(initial.model, direction = "both" , trace = 1)

Both selection processes narrowed down the model to Security_o, Airport_Distance, Number_of_Flights, Weather, Support_Crew_Available, Baggage_Loading_Time, and Late_arrival_o. This suggests that in a candidate model with no transformations or additional terms, those variables should be included.

Next, our quadratic and cubic models

#model.quad.both<-step(quad.model, direction = "both" , trace = 1)

#model.cube.both<-step(cube.model, direction = "both" , trace = 1)

The model with squared terms settled on: Weather, support crew, baggage loading, late arrival, security, number of flights^2, baggage time^2, and distance^2.

With cubed terms: Weather, support crew, late arrival, security, baggage time^2, distance^2, security^2, number of flights^3, baggage time^3, and security^3.

We do not want a final model to contain the same variable at multiple levels, but since this may indicate a strong connection between the reiterated terms and the variable of interest, we will create a reduced model with those.

Candidate Models

Based on the previous results, we have 2 Candidate models, as well as our initial model:

can.reg<-glm(Hour_Delay ~ Security_o + Airport_Distance + Number_of_flights + Support_Crew_Available + Weather + Baggage_loading_time + Late_Arrival_o , family = binomial, data=flight.hour)

can.min<-glm(Hour_Delay ~ Airport_Distance + Number_of_flights + Baggage_loading_time + Security_o , family = binomial , data=flight.hour)

We will look at the significance of each coefficient, of each model, as an initial comparison:

full.table=summary(initial.model)$coef
kable(full.table, caption = "Significance of each coefficient in the full model")
Significance of each coefficient in the full model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -102.9257623 4.2859982 -24.0144202 0.0000000
CarrierAA 0.0628418 0.3963437 0.1585538 0.8740205
CarrierAS -0.2516016 1.4132306 -0.1780330 0.8586971
CarrierB6 -0.1677252 0.3783899 -0.4432602 0.6575776
CarrierDL -0.3374653 0.3865832 -0.8729433 0.3826940
CarrierEV -0.2168769 0.3862913 -0.5614335 0.5745020
CarrierF9 -1.2389646 1.3356440 -0.9276159 0.3536068
CarrierFL -0.8452742 0.6406167 -1.3194695 0.1870122
CarrierHA -3.8793661 4.6055613 -0.8423221 0.3996077
CarrierMQ -0.3142455 0.4121902 -0.7623799 0.4458333
CarrierUA -0.2084461 0.3752656 -0.5554627 0.5785782
CarrierUS -0.1524217 0.4372685 -0.3485770 0.7274069
CarrierVX -0.4453947 0.5916292 -0.7528275 0.4515536
CarrierWN -0.2453927 0.4772950 -0.5141321 0.6071596
Airport_Distance 0.0272436 0.0038857 7.0111599 0.0000000
Number_of_flights 0.0006873 0.0000378 18.2063162 0.0000000
Weather 0.6592219 0.1265406 5.2095670 0.0000002
Support_Crew_Available -0.0081644 0.0014465 -5.6444012 0.0000000
Baggage_loading_time 2.4382013 0.1787911 13.6371526 0.0000000
Late_Arrival_o 0.9455250 0.0954982 9.9009740 0.0000000
Cleaning_o 0.0095363 0.0163363 0.5837477 0.5593900
Fueling_o -0.0051694 0.0161900 -0.3192934 0.7495040
Security_o 0.0140206 0.0080607 1.7393731 0.0819692
reg.table=summary(can.reg)$coef
kable(reg.table, caption = "Significance of each coefficient in the reduced model")
Significance of each coefficient in the reduced model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -102.6831868 4.2206844 -24.328563 0.0000000
Security_o 0.0138565 0.0080020 1.731626 0.0833401
Airport_Distance 0.0268672 0.0038634 6.954268 0.0000000
Number_of_flights 0.0006841 0.0000375 18.238906 0.0000000
Support_Crew_Available -0.0082071 0.0014400 -5.699314 0.0000000
Weather 0.6593858 0.1260339 5.231814 0.0000002
Baggage_loading_time 2.4304099 0.1778089 13.668666 0.0000000
Late_Arrival_o 0.9436507 0.0945881 9.976420 0.0000000
min.table=summary(can.min)$coef
kable(min.table, caption = "Significance of each coefficient in the minimal model")
Significance of each coefficient in the minimal model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -91.1238012 3.7056368 -24.590592 0.0000000
Airport_Distance 0.0290380 0.0036613 7.931082 0.0000000
Number_of_flights 0.0007462 0.0000359 20.793681 0.0000000
Baggage_loading_time 2.7358787 0.1717724 15.927352 0.0000000
Security_o 0.0151718 0.0076559 1.981719 0.0475107

As expected, the full model shows many variables with insignificant p values. The minimal model, while exclusively significant values, does not have as many strong predictors as the mid-sized model. The mid-sized model has Security time not achieving statistical significance, but is fairly close.

As expected, almost all variables show a positive relationship with probability of longer delays. This should not come as a surprise, as most variables are number of minutes spent at some stage before a flight takes off (security, baggage loading time, etc). The number of support staff is negatively correlated, which also makes sense, as more hands available should lead to faster performance of those (and other, possibly unrecorded) tasks.

Prediction

To test the similarity of the candidates, we will create a sample dataset by choosing values for each variable. We will then use R to predict the Delay for each sample flight in both the initial and candidate model. We will then compare the two models’ results.

samp.data=data.frame(Carrier=c("UA", "AA", "DL", "EV", "VX"), Airport_Distance=c(350, 500, 445, 430, 450), Number_of_flights=c(30123, 25001, 43333, 55757, 45454), Weather=c(5, 5, 2, 6, 9), Support_Crew_Available=c(17, 300, 85, 92, 71), Baggage_loading_time=c(14.5, 20, 16, 12, 17), Late_Arrival_o=c(10, 20, 15, 19, 18), Cleaning_o=c(25, 6, 8, 10, 10), Fueling_o=c(19.5, 22, 25, 27, 24), Security_o=c(40, 38, 18, 40, 26) )
pred.ini=predict(initial.model, newdata=samp.data , type = "response")
cut.off.prob = 0.5
pred.response.ini = ifelse(pred.ini > cut.off.prob, 1, 0)

pred.can=predict(can.reg, newdata=samp.data , type = "response")
cut.off.prob = 0.5
pred.response.can = ifelse(pred.can > cut.off.prob, 1, 0)

combined.table=cbind(samp.data, Initial_Model=pred.response.ini , Candidate_Model=pred.response.can)


kable(combined.table, caption = "Predicted Delay Time of Each Model Based on Sample Data")
Predicted Delay Time of Each Model Based on Sample Data
Carrier Airport_Distance Number_of_flights Weather Support_Crew_Available Baggage_loading_time Late_Arrival_o Cleaning_o Fueling_o Security_o Initial_Model Candidate_Model
UA 350 30123 5 17 14.5 10 25 19.5 40 0 0
AA 500 25001 5 300 20.0 20 6 22.0 38 0 0
DL 445 43333 2 85 16.0 15 8 25.0 18 0 0
EV 430 55757 6 92 12.0 19 10 27.0 40 0 0
VX 450 45454 9 71 17.0 18 10 24.0 26 1 1

As demonstrated in the table, the models have similar predictions despite very extreme x values. It is our belief that the candidate model is ideal, as it generates comparable predictions at a fraction of the memory. This advantage will only continue to grow as larger datasets are used with this same model.

Model Building Conclusions

Here we utilized different strategies to optimize our predictive model and arrived at a reduced model formula. We were able to test and rule out different transformations of variables, compare possible models, and select a ‘champion’ model to be compared against other model building strategies in subsequent sections.

can.df<-subset(flight.hour , select=c('Hour_Delay' , 'Security_o' , 'Airport_Distance' , 'Number_of_flights' , 'Support_Crew_Available' , 'Weather' , 'Baggage_loading_time' , 'Late_Arrival_o'))

A copy of this data is now publicly available at: https://raw.githubusercontent.com/AlexDragonetti/flightcandidate/main/can.csv