1 Introduction

For this report, I will be using a data set consisting of delayed flight times of various airlines. In this data set, there are 11 variables and 3593 observations. The variables are:

  1. Carrier (Categorical) - The airline that the flight is taken with
  2. Airport_Distance (Numeric) - The Distance between the airports in miles
  3. Number_of_flights (Numeric) - Total no of flights in airport
  4. Weather (Numeric) - Delay due to weather condition ranked 0-10, with 0 being mild and 10 being extreme
  5. Support_Crew_Available (Numeric) - Total number of support crew available
  6. Baggage_loading_time (Numeric) - Time in minutes for loading the baggage
  7. Late_Arrival_o (Numeric) - Time in minutes for late arriving aircraft of the same flight
  8. Cleaning_o (Numeric) - Time in minutes for aircraft cleaning
  9. Fueling_o (Numeric) - Time in minutes for aircraft fueling
  10. Security_o (Numeric) - Time in minutes for security checking
  11. Arr_Delay (Numeric) - Flight Arr_Delay in minutes. It is dependent variable in the model

The goal for this analysis is to determine what factors affect the delayed arrival times.

flight1 <- read.csv("flight_delay-data.csv")

2 Fitting the Model: Part 1 - Exploration

2.1 Full Model

To start, the full Mulitple Linear Regression model will be fitted with all of the variables in the data set.

full.model <- lm(Arr_Delay ~ ., data = flight1)

Can this model be changed before any sort of analysis is done?

kable(summary(full.model)$coef)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -574.5592897 7.6427175 -75.1773555 0.0000000
CarrierAA -0.3068142 1.3750922 -0.2231227 0.8234528
CarrierAS 1.3060328 4.1334965 0.3159632 0.7520489
CarrierB6 -1.7685278 1.3177103 -1.3421219 0.1796418
CarrierDL -1.7558491 1.3498032 -1.3008186 0.1934045
CarrierEV -1.7244875 1.3619601 -1.2661806 0.2055310
CarrierF9 -0.9614506 4.5728315 -0.2102528 0.8334824
CarrierFL 0.6205342 2.2470500 0.2761550 0.7824450
CarrierHA 1.7141523 6.3522393 0.2698501 0.7872912
CarrierMQ -0.6700343 1.4206138 -0.4716513 0.6372045
CarrierUA -1.5067974 1.3133452 -1.1472972 0.2513357
CarrierUS -0.8850271 1.5820406 -0.5594212 0.5759094
CarrierVX 0.1461768 2.1223356 0.0688754 0.9450926
CarrierWN -1.9051889 1.6726924 -1.1389954 0.2547815
Airport_Distance 0.1727325 0.0135557 12.7424675 0.0000000
Number_of_flights 0.0044246 0.0001088 40.6645781 0.0000000
Weather 4.4365237 0.4553804 9.7424566 0.0000000
Support_Crew_Available -0.0489392 0.0053332 -9.1762611 0.0000000
Baggage_loading_time 13.4992387 0.4394619 30.7176559 0.0000000
Late_Arrival_o 6.9106449 0.3336251 20.7138026 0.0000000
Cleaning_o 0.0548127 0.0600167 0.9132910 0.3611512
Fueling_o -0.0614043 0.0596022 -1.0302365 0.3029688
Security_o 0.0077875 0.0295840 0.2632347 0.7923849

According to the table, our new linear and reduced model is Arr_Delay ~ Airport_Distance + Number_of_flights + Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o.

Analysis from this point forward will be performed with this reduced model.

reduced.model <- lm(Arr_Delay ~ Airport_Distance + Number_of_flights + Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, data = flight1)

Before any model interpretations are made, residual analysis needs to be done to determine if there are any assumption violations in the model. The variance inflation factors of the model can also be checked for any issues concerning multicollinearity.

par(mfrow=c(2,2))
plot(reduced.model)

vif(reduced.model)
##       Airport_Distance      Number_of_flights                Weather 
##               1.248330               2.136143               1.093489 
## Support_Crew_Available   Baggage_loading_time         Late_Arrival_o 
##               1.125079               2.074031               1.616311
barplot(vif(reduced.model), main = "VIF Values", horiz = FALSE, col = "steelblue")

It seems there are some assumption violations with this model. The QQ plot shows that the assumption of normality may not be satisfied. As for the Residuals vs Fit plot, it shows that the assumption of homoscedasticity, and because of the curved pattern, this model may not be good for prediction. So, changes need to be made first.

There is some good news, though: the VIFs suggest that there is no multicollinearity.

3 Fitting the Model: Part 2

3.1 Transformations to the Response Variable

Several transformations of the response variable will be used to correct the heteroscedasticity and non-normality in the model. Those transformations will be compared to see which transformation is the best and thus which model is the best.

To correct normality issues, a box-cox transformation will be applied to the response variable. To do this, multiple values of lambda will be used to determine the best transformation.

*Note, in order for the box-cox transformation to work, all values of the response variable must be positive, i.e. greater than 0. There are three observations that are equal to 0, so those observations will be removed before applying the box cox transformations.

flight2 <- flight1[-which(flight1$Arr_Delay == 0),]

par(pty = "s", mfrow = c(2, 2), oma=c(.1,.1,.1,.1), mar=c(4, 0, 2, 0))

boxcox(Arr_Delay ~ Airport_Distance + Number_of_flights + Weather + 
Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, data = flight2, 
lambda = seq(-0.5, 1, length = 10), xlab=expression(paste(lambda)))

boxcox(Arr_Delay ~ Airport_Distance + Number_of_flights + Weather + 
Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, data = flight2, 
lambda = seq(0, 1, length = 10), xlab=expression(paste(lambda)))

boxcox(Arr_Delay ~ Airport_Distance + Number_of_flights + Weather + 
Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, data = flight2, 
lambda = seq(0.5, 0.7, length = 10), xlab=expression(paste(lambda)))

I’m not quite certain which transformation is the best, but it seems that the optimal value for lambda consistently seems to be between 0.5 and 0.7, so 3 different transformations with lambda values 0.5, 0.6, and 0.7 will all be applied.

lambdaTrans1 <- lm((Arr_Delay)^0.5 ~ Airport_Distance + Number_of_flights + 
Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, 
data = flight2) 
kable(summary(lambdaTrans1)$coef, caption = "Lambda = 0.5 (SQRT Transformation)")
Lambda = 0.5 (SQRT Transformation)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -34.0259067 0.4393635 -77.443629 0
Airport_Distance 0.0109793 0.0008131 13.503838 0
Number_of_flights 0.0002962 0.0000065 45.303973 0
Weather 0.2377667 0.0273096 8.706342 0
Support_Crew_Available -0.0035656 0.0003196 -11.157298 0
Baggage_loading_time 0.8818184 0.0263490 33.466824 0
Late_Arrival_o 0.4553751 0.0200190 22.747170 0
lambdaTrans2 <- lm((Arr_Delay)^0.6 ~ Airport_Distance + Number_of_flights + 
Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, 
data = flight2) 
kable(summary(lambdaTrans2)$coef, caption = "lambda = 0.6")
lambda = 0.6
Estimate Std. Error t value Pr(>|t|)
(Intercept) -63.2738888 0.7893245 -80.162070 0
Airport_Distance 0.0198537 0.0014607 13.592265 0
Number_of_flights 0.0005295 0.0000117 45.074760 0
Weather 0.4463256 0.0490622 9.097139 0
Support_Crew_Available -0.0062481 0.0005741 -10.882899 0
Baggage_loading_time 1.5821825 0.0473365 33.424147 0
Late_Arrival_o 0.8162203 0.0359645 22.695194 0
lambdaTrans3 <- lm((Arr_Delay)^0.7 ~ Airport_Distance + Number_of_flights + 
Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, 
data = flight2) 
kable(summary(lambdaTrans3)$coef, caption = "lambda = 0.7")
lambda = 0.7
Estimate Std. Error t value Pr(>|t|)
(Intercept) -113.5599912 1.3935691 -81.488599 0
Airport_Distance 0.0350158 0.0025788 13.578204 0
Number_of_flights 0.0009235 0.0000207 44.529044 0
Weather 0.8128393 0.0866203 9.383933 0
Support_Crew_Available -0.0106732 0.0010136 -10.529777 0
Baggage_loading_time 2.7676442 0.0835736 33.116248 0
Late_Arrival_o 1.4274972 0.0634960 22.481685 0

Because the square root transformation has the smallest standard errors, that will be the transformation that will be chosen.

par(mfrow = c(2,2))
plot(lambdaTrans1)

All of the diagnostics look good for this model. They look much better than diagnostics from the original model without the transformation.

3.2 Goodness of fit measures

Now, the goodness of fit measures of the three candidate models will be generated. The model with the best goodness of fit measures will be chosen as the final model.

select = function(m) { # m is an object: model
e = m$resid                           # residuals
n0 = length(e)                        # sample size
SSE=(m$df)*(summary(m)$sigma)^2       # sum of squared error
R.sq=summary(m)$r.squared             # Coefficient of determination: R square!
R.adj=summary(m)$adj.r                # Adjusted R square
MSE=(summary(m)$sigma)^2              # square error
Cp=(SSE/MSE)-(n0-2*(n0-m$df))         # Mellow's p
AIC=n0*log(SSE)-n0*log(n0)+2*(n0-m$df)          # Akaike information criterion
SBC=n0*log(SSE)-n0*log(n0)+(log(n0))*(n0-m$df)  # Schwarz Bayesian Information criterion
X=model.matrix(m)                     # design matrix of the model
H=X%*%solve(t(X)%*%X)%*%t(X)          # hat matrix
d=e/(1-diag(H))                       
PRESS=t(d)%*%d   # predicted residual error sum of squares (PRESS)- a cross-validation measure
 tbl = as.data.frame(cbind(SSE=SSE, R.sq=R.sq, R.adj = R.adj, Cp = Cp, AIC = AIC, SBC = SBC, PRD = PRESS))
 names(tbl)=c("SSE", "R.sq", "R.adj", "Cp", "AIC", "SBC", "PRESS")
 tbl
}

output.sum = rbind(select(full.model), select(reduced.model), select(lambdaTrans1))
row.names(output.sum) = c("full.model", "reduced.model", "lambdaTrans1")
kable(output.sum, caption = "Goodness-of-fit Measures of Candidate Models")
Goodness-of-fit Measures of Candidate Models
SSE R.sq R.adj Cp AIC SBC PRESS
full.model 553651.41 0.8195269 0.8184147 23 18145.909 18288.204 561015.961
reduced.model 555449.53 0.8189407 0.8186378 7 18125.559 18168.867 557806.554
lambdaTrans1 2005.76 0.8419195 0.8416548 7 -2075.844 -2032.543 2013.419

The model with the box-cox transformation by far has the best goodness of fit measures. Adjusted R^2 is the highest, and all of the reported errors are the smallest of the three models. So, this will be chosen as the final model.

3.3 The Final Model

kable(summary(lambdaTrans1)$coef, caption = "Inferential Statistics of Final Model")
Inferential Statistics of Final Model
Estimate Std. Error t value Pr(>|t|)
(Intercept) -34.0259067 0.4393635 -77.443629 0
Airport_Distance 0.0109793 0.0008131 13.503838 0
Number_of_flights 0.0002962 0.0000065 45.303973 0
Weather 0.2377667 0.0273096 8.706342 0
Support_Crew_Available -0.0035656 0.0003196 -11.157298 0
Baggage_loading_time 0.8818184 0.0263490 33.466824 0
Late_Arrival_o 0.4553751 0.0200190 22.747170 0

Referring to this table from earlier, we can see the coefficients for this model. And because the sample size is big enough to satisfy the central limit theorem, we can use the p-values to assess for any further variable selection. Since all of the p-values are close to zero, there’s no need for any further variable selection. And, the explicit regression equation can be written as: \[ sqrt(Arr_Delay) = -34.0259 + 0.011 \times Airport_Distance + 0.0003 \times Number_of_flights + 0.2378 \times Weather - 0.0036 \times Support_Crew_Available + 0.8818 \times Baggage_loading_time + 0.4554 \times Late_Arrival_o \]

4 Summary

This study focused on finding any factors that contributed to airplanes’ delayed arrival times. Many linear regression techniques were used to generate several candidate models, and comparisons were made between those models. Ultimately, the best model chosen had a box-cox transformation used on the response variable with a lambda value of 0.5.