In this report, we will be using multiple linear regression models and determining which is best.
The data set I chose for this report is called flight delay data, and it can be found at “https://pengdsci.github.io/datasets/FlightDelay/Flight_delay-data.csv”. It gives a continuous response in minutes of how long a fight is delayed, which depends on one character and nine numeric variables. The variables are listed below:
Arr_Delay: time in minutes of how long a fight is delayed
Carrier: categorical company the flight is with
Airport_Distance: the distance from the airport in miles
Number_of_flights: total number of all of the flights in the airport
Weather: scale of 1-10 with 10 being severe weather
Support_Crew_Available: number of support crew members
Baggage_loading_time: the time it takes in minutes to load the bags
Late_Arrival_o: the time in minutes for the late arriving aircraft of the same flight
Cleaning_o: the time it takes for aircraft cleaning
Fueling_o: the time it takes for aircraft fueling
Security_o: the time for security checking.
In order to make this model, we need two categorical variables and as I mentioned above, we only have one so I turned support crew members (Support_Crew_Available variable) into a categorical variable, with breaks between every 50 crew members available. I don’t believe there are any variables we should drop yet. My practical question is which relevant variables effect the time a flight is delayed.
The original full model that the linear model function made was as follows:
\[ y= -574.5592897 \,+\,-0.3068142*CarrierAA \,+\,1.3060328*CarrierAS \\-\,1.7685278*CarrierB6 \,-\,1.7558491 *CarrierDL \,-\,1.7244875*CarrierEV \\,-\,0.9614506*CarrierF9 \,+\,0.6205342*CarrierFL \,+\,1.7141523*CarrierHA \\,-\,0.6700343*CarrierMQ \,-\,1.5067974*CarrierUA \,-\,0.8850271*CarrierUS \\ +0.1461768*CarrierVX \,-\,1.9051889*CarrierWN \,+\,0.1727325*AirportDistance \\,+\,0.0044246*NumberofFlights \,+\,4.4365237*Weather \,-\,0.0489392*SupportCrewAvailable \\+\,13.4992387*Baggageloadingtime \,+\,6.9106449*LateArrivalo \,+\,0.0548127*Cleaningo \\-\,0.0614043*Fuelingo \,+\,0.0077875*Securityo \]
kable(summary(full.model)$coef, caption ="Statistics of Regression Coefficients")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -574.5842125 | 7.6442278 | -75.1657623 | 0.0000000 |
| CarrierAA | -0.3176558 | 1.3742454 | -0.2311492 | 0.8172121 |
| CarrierAS | 1.3025919 | 4.1330152 | 0.3151675 | 0.7526529 |
| CarrierB6 | -1.7847244 | 1.3170082 | -1.3551354 | 0.1754602 |
| CarrierDL | -1.8249200 | 1.3494939 | -1.3522995 | 0.1763653 |
| CarrierEV | -1.7253348 | 1.3612518 | -1.2674619 | 0.2050729 |
| CarrierF9 | -1.0441150 | 4.5723713 | -0.2283531 | 0.8193849 |
| CarrierFL | 0.5732866 | 2.2461324 | 0.2552328 | 0.7985580 |
| CarrierHA | 2.1250862 | 6.3510956 | 0.3346015 | 0.7379454 |
| CarrierMQ | -0.6735115 | 1.4197022 | -0.4744034 | 0.6352413 |
| CarrierUA | -1.5041356 | 1.3127539 | -1.1457864 | 0.2519604 |
| CarrierUS | -0.8654488 | 1.5812309 | -0.5473260 | 0.5841890 |
| CarrierVX | 0.2205063 | 2.1217789 | 0.1039252 | 0.9172345 |
| CarrierWN | -1.9711735 | 1.6721496 | -1.1788260 | 0.2385461 |
| Airport_Distance | 0.1732896 | 0.0135518 | 12.7872150 | 0.0000000 |
| Number_of_flights | 0.0044376 | 0.0001090 | 40.7268944 | 0.0000000 |
| Weather | 4.4104382 | 0.4552719 | 9.6874808 | 0.0000000 |
| Support_Crew_Available | -0.0593364 | 0.0150282 | -3.9483467 | 0.0000802 |
| Baggage_loading_time | 13.4801105 | 0.4397689 | 30.6527125 | 0.0000000 |
| Late_Arrival_o | 6.9358981 | 0.3335567 | 20.7937628 | 0.0000000 |
| Cleaning_o | 0.0577420 | 0.0599874 | 0.9625690 | 0.3358292 |
| Fueling_o | -0.0587160 | 0.0595792 | -0.9855126 | 0.3244392 |
| Security_o | 0.0072173 | 0.0295649 | 0.2441177 | 0.8071537 |
| Support_Crew_Available.catLow to Medium | -0.4796695 | 0.8928880 | -0.5372112 | 0.5911552 |
| Support_Crew_Available.catMedium | 0.3808286 | 1.4950053 | 0.2547340 | 0.7989433 |
| Support_Crew_Available.catMedium to High | 1.5030377 | 2.2463979 | 0.6690879 | 0.5034827 |
| Support_Crew_Available.catHigh | 7.9897897 | 4.0327490 | 1.9812266 | 0.0476426 |
par(mfrow=c(2,2))
plot(full.model)
Residual plots of the full model
In this model, the constant variance is not 0, as seen in the first plot. We can see the model is not normal. The other plots are not normal either. The Cooke’s distance plot shows big gaps in the data. All of these observatins lead us to doing a transformation on the variables or droppping them.
Since none of the p-values from the dummy variables (each carrier variable) are statistically significant in the previous model, we can remove this variable to create a new model.
The linear equation for the equation where we dropped the Carrier variable is as follows
\[ y = -576.0172760 \,+\,0.1730925*AirportDistance \\,+\,0.0044296*NumberofFlights \,+\,4.4659931*Weather \,-\,0.0486269*SupportCrewAvailable \\+\,13.4862291*Baggageloadingtime \,+\,6.8977733*LateArrivalo \,+\,0.0526302*Cleaningo \\-\,0.0587412*Fuelingo \,+\,0.0085256*Securityo \]
kable(summary(model2)$coef, caption ="Statistics of Regression Coefficients without Carrier variable")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -576.0172760 | 7.4850911 | -76.9552798 | 0.0000000 |
| Airport_Distance | 0.1730925 | 0.0135235 | 12.7993778 | 0.0000000 |
| Number_of_flights | 0.0044296 | 0.0001086 | 40.7783777 | 0.0000000 |
| Weather | 4.4659931 | 0.4543878 | 9.8285949 | 0.0000000 |
| Support_Crew_Available | -0.0486269 | 0.0053197 | -9.1409110 | 0.0000000 |
| Baggage_loading_time | 13.4862291 | 0.4384808 | 30.7567126 | 0.0000000 |
| Late_Arrival_o | 6.8977733 | 0.3330868 | 20.7086345 | 0.0000000 |
| Cleaning_o | 0.0526302 | 0.0599114 | 0.8784680 | 0.3797487 |
| Fueling_o | -0.0587412 | 0.0594237 | -0.9885143 | 0.3229676 |
| Security_o | 0.0085256 | 0.0295234 | 0.2887744 | 0.7727707 |
par(mfrow=c(2,2))
plot(model2)
Residual plots of the full model
The model where we only dropped the variable Carrier is very similar to the full model and has the same assumptions violated with the graphs. Another way we can eliminate variables from the model is the stepAIC function, which uses a stepwise procedure to add or drop variables in the model until it finds the model with the lowest AIC value.
The reduced model using the stepwise procedure is as follows
\[ y = -576.7053849 \,+\,0.1735146*AirportDistance \\,+\,0.0044340*NumberofFlights \,+\,4.4626954*Weather \,-\,0.0489240*SupportCrewAvailable \\+\,13.4697937*Baggageloadingtime \,+\,6.8982019*LateArrivalo \]
kable(summary(reduced.model)$coef, caption ="Statistics of Reduced Model")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -576.6675653 | 7.2728480 | -79.2904736 | 0.0000000 |
| Airport_Distance | 0.1740504 | 0.0135132 | 12.8800688 | 0.0000000 |
| Number_of_flights | 0.0044460 | 0.0001086 | 40.9247070 | 0.0000000 |
| Weather | 4.4357382 | 0.4541083 | 9.7680176 | 0.0000000 |
| Support_Crew_Available | -0.0587552 | 0.0149628 | -3.9267525 | 0.0000877 |
| Baggage_loading_time | 13.4545639 | 0.4383386 | 30.6944553 | 0.0000000 |
| Late_Arrival_o | 6.9219880 | 0.3326825 | 20.8065884 | 0.0000000 |
| Support_Crew_Available.catLow to Medium | -0.5156077 | 0.8901764 | -0.5792197 | 0.5624773 |
| Support_Crew_Available.catMedium | 0.3286920 | 1.4902084 | 0.2205678 | 0.8254415 |
| Support_Crew_Available.catMedium to High | 1.4322439 | 2.2391682 | 0.6396321 | 0.5224527 |
| Support_Crew_Available.catHigh | 7.5915823 | 4.0203081 | 1.8883086 | 0.0590653 |
par(mfrow=c(2,2))
plot(reduced.model)
Residual plots of the full model
The model still has model violations so we will need to do a linear transformation. Since the variance is not constant, we can use a variable transformation on the reduced model.
For the Yeo Johnson transformation, we will be transforming the response variable instead of taking out explanatory variables. In the method we use here, we will be using the variables from the stepAIC reduced model. Similarly to a Box Cox transformation, the plots will give us an optimal lambda value under different transformed predictor variables. The only thing that makes this transformation different from the Box Cox transformation is that the response variable can include numbers less than 0.
yeoJmodel <- lm(Arr_Delay ~ Airport_Distance + Number_of_flights + Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, data=flightDelay)
yj<- boxCox(yeoJmodel, family="yjPower", plotit =TRUE, lambda = seq(0.5, 0.75, 1/100),)
In this model, we are doing a transformation on the response variable, Arr_Delay. Above, we found an optimal lambda value with the graph to plug into the equation \[ArrDelay^(\lambda-1))/\lambda\] to transform the response. We used the new response in to make a new model.
This table will give us our regression coefficients for the model.
cmtrx <- summary(sqrt.model)$coef
kable(cmtrx, caption = "Inferential Statistics of Final Model")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -106.6133879 | 1.3105262 | -81.351587 | 0 |
| Airport_Distance | 0.0328899 | 0.0024366 | 13.498098 | 0 |
| Number_of_flights | 0.0008780 | 0.0000196 | 44.890061 | 0 |
| Weather | 0.7500768 | 0.0818800 | 9.160684 | 0 |
| Support_Crew_Available | -0.0104628 | 0.0009578 | -10.923845 | 0 |
| Baggage_loading_time | 2.6314888 | 0.0789673 | 33.323758 | 0 |
| Late_Arrival_o | 1.3517826 | 0.0599875 | 22.534398 | 0 |
par(mfrow=c(2,2))
plot(sqrt.model)
The transformed model is definitely our best model so far, but the plots for this model still show us violations to the model assumptions. The variance is not constant and the normality assumptions are still not met. Since there are still assumption violations and we have a large sample size, the next thing we can try is a nonparametric method.
There are two different non-parametric bootstrap methods for bootstrap regression modeling. We can bootstrap bases on the cases or the residuals. We will be using both methods in this project.
To bootstrap cases, we will be taking a random sample of the observation ID and using it to take a corresponding record from the bootstrap sample with replacement. We repeat this many times to get bootstrap coefficients that can be used to make a confidence interval of the coefficient.
B = 1000
num.p = dim(model.frame(sqrt.model))[2]
smpl.n = dim(model.frame(sqrt.model))[1]
coef.mtrx = matrix(rep(0, B*num.p), ncol = num.p)
for (i in 1:B){
bootc.id = sample(1:smpl.n, smpl.n, replace = TRUE)
sqrt.model.btc = lm(ArrDelay.25 ~ Airport_Distance + Number_of_flights + Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, data = flightDelay[bootc.id,])
coef.mtrx[i,] = coef(sqrt.model.btc)
}
The following histograms of the bootstrap estimates of regression coefficients represent the sampling distributions of the corresponding estimates in the final model. All of the histograms are normal.
par(mfrow=c(2,4)) # histograms of bootstrap coefs
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=2, var.nm ="Airport Distance" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=3, var.nm ="Num of Flights" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=4, var.nm ="Weather" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=5, var.nm ="Support Crew" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=6, var.nm ="Baggage Loading")
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=7, var.nm ="Late Arrival")
kable(as.data.frame(cbind(formatC(cmtrx,4,format="f"), btc.ci.95=btc.ci)),
caption = "Regression Coefficient Matrix")
| Estimate | Std. Error | t value | Pr(>|t|) | btc.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | -106.6134 | 1.3105 | -81.3516 | 0.0000 | [ -109.1117 , -104.2545 ] |
| Airport_Distance | 0.0329 | 0.0024 | 13.4981 | 0.0000 | [ 0.0281 , 0.0378 ] |
| Number_of_flights | 0.0009 | 0.0000 | 44.8901 | 0.0000 | [ 8e-04 , 9e-04 ] |
| Weather | 0.7501 | 0.0819 | 9.1607 | 0.0000 | [ 0.5979 , 0.9116 ] |
| Support_Crew_Available | -0.0105 | 0.0010 | -10.9238 | 0.0000 | [ -0.0123 , -0.0087 ] |
| Baggage_loading_time | 2.6315 | 0.0790 | 33.3238 | 0.0000 | [ 2.4853 , 2.7738 ] |
| Late_Arrival_o | 1.3518 | 0.0600 | 22.5344 | 0.0000 | [ 1.2403 , 1.4704 ] |
Since 0 is not included in any of the confidence intervals, they are all significant.
In this method of bootstrapping the residuals, we will be taking a bootstrap sample from the set of residuals, and fitting the final model to these bootstrapped residuals many times. Then, we can denote the residual bootstrap regression coefficients and make confidence intervals.
We can start by making a histogram of the residuals, which shows us our residuals are approximately normal with a few outliers.
hist(sort(sqrt.model$residuals),n=40,
xlab="Residuals",
col = "lightblue",
border="navy",
main = "Histogram of Residuals")
Then, we proceed with bootstrapping the residuals.
model.resid = sqrt.model$residuals
B=1000
num.p = dim(model.matrix(sqrt.model))[2] # number of parameters
samp.n = dim(model.matrix(sqrt.model))[1] # sample size
btr.mtrx = matrix(rep(0,7*B), ncol=num.p) # zero matrix to store boot coefs
for (i in 1:B){
## Bootstrap response values
bt.sqrt = sqrt.model$fitted.values +
sample(sqrt.model$residuals, samp.n, replace = TRUE) # bootstrap residuals
# replace PriceUnitArea with bootstrap log price
flightDelay$bt.sqrt = bt.sqrt # send the boot response to the data
btr.model = lm(bt.sqrt~ Airport_Distance + Number_of_flights + Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, data=flightDelay) # b
btr.mtrx[i, ]=btr.model$coefficients
}
Next, I make histograms of the residual bootstrap estimates of the regression coefficients. All of the histograms are normal.
par(mfrow=c(2,4)) # histograms of bootstrap coefs
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=2, var.nm ="Airport Distance" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=3, var.nm ="Num of Flights" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=4, var.nm ="Weather" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=5, var.nm ="Support Crew" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=6, var.nm ="Baggage Loading" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=7, var.nm ="Late Arrival")
kable(as.data.frame(cbind(formatC(cmtrx,4,format="f"), btr.ci.95=btr.ci)),
caption = "Regression Coefficient Matrix with 95% Residual Bootstrap CI")
| Estimate | Std. Error | t value | Pr(>|t|) | btr.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | -106.6134 | 1.3105 | -81.3516 | 0.0000 | [ -109.0143 , -104.0873 ] |
| Airport_Distance | 0.0329 | 0.0024 | 13.4981 | 0.0000 | [ 0.0282 , 0.0378 ] |
| Number_of_flights | 0.0009 | 0.0000 | 44.8901 | 0.0000 | [ 8e-04 , 9e-04 ] |
| Weather | 0.7501 | 0.0819 | 9.1607 | 0.0000 | [ 0.5915 , 0.9217 ] |
| Support_Crew_Available | -0.0105 | 0.0010 | -10.9238 | 0.0000 | [ -0.0123 , -0.0087 ] |
| Baggage_loading_time | 2.6315 | 0.0790 | 33.3238 | 0.0000 | [ 2.4784 , 2.7809 ] |
| Late_Arrival_o | 1.3518 | 0.0600 | 22.5344 | 0.0000 | [ 1.2353 , 1.4698 ] |
Since 0 is not included in any of the confidence intervals, they are all significant. Our residual bootstrap confidence intervals yield the same results as p-values do because the sample size is large enough so that the sampling distributions of estimated coefficients have sufficiently good approximations of normal distributions.
The following table gives us the goodness of fit measures of each model, which will tell us which model is best for linear regression analysis. We are looking for a model with the lowest SSE, CP, AIC, SBC and press. On the other hand, we want the model with the highest r squared adjusted value. Our best models are the transformed, bootstrap cases, and bootstrap residual models.
output.sum = rbind(select(full.model), select(model2), select(reduced.model), select(sqrt.model), select(sqrt.model.btc), select(btr.model))
row.names(output.sum) = c("Full model", "Model without Carrier", "Step AIC model", "Transformed model", "Bootstrap Cases", "Bootstrap Residuals")
kable(output.sum, caption = "Goodness-of-fit Measures of Candidate Models")
| SSE | R.sq | R.adj | Cp | AIC | SBC | PRESS | |
|---|---|---|---|---|---|---|---|
| Full model | 552202.17 | 0.8199993 | 0.8186869 | 27 | 18144.492 | 18311.534 | 560823.11 |
| Model without Carrier | 555166.20 | 0.8190331 | 0.8185785 | 10 | 18129.726 | 18191.594 | 558445.60 |
| Step AIC model | 554058.67 | 0.8193941 | 0.8188899 | 11 | 18124.551 | 18192.605 | 557661.58 |
| Transformed model | 18052.92 | 0.8424637 | 0.8422001 | 7 | 5814.252 | 5857.559 | 18122.12 |
| Bootstrap Cases | 18166.60 | 0.8398981 | 0.8396302 | 7 | 5836.806 | 5880.113 | 18237.05 |
| Bootstrap Residuals | 18007.69 | 0.8430131 | 0.8427504 | 7 | 5805.237 | 5848.545 | 18078.25 |
Looking at the width of the confidence interval shows us the bootstrap cases and bootstrap residuals are very similar.
kable(round(cbind(btc.wd, btr.wd),4), caption="width of the two bootstrap confidence intervals")
| btc.wd | btr.wd |
|---|---|
| 4.8572 | 4.9270 |
| 0.0097 | 0.0096 |
| 0.0001 | 0.0001 |
| 0.3138 | 0.3302 |
| 0.0036 | 0.0036 |
| 0.2884 | 0.3025 |
| 0.2301 | 0.2346 |
Considering the interpretability, goodness-of-fit, and simplicity, the best model to predict the number in minutes that a flight is delayed is the final model we made is either the parametric model, bootstrap cases model, or bootstrap residual model. Because the results from the parametric and both bootstrap models yield similar results and the parametic model has no serious assumption violations, we can choose any one to do our discussion about.
The model we will interpret is the transformed model, the inferential statistics of this model are as follows:
cmtrx <- summary(sqrt.model)$coef
kable(cmtrx, caption = "Inferential Statistics of Final Model")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -106.6133879 | 1.3105262 | -81.351587 | 0 |
| Airport_Distance | 0.0328899 | 0.0024366 | 13.498098 | 0 |
| Number_of_flights | 0.0008780 | 0.0000196 | 44.890061 | 0 |
| Weather | 0.7500768 | 0.0818800 | 9.160684 | 0 |
| Support_Crew_Available | -0.0104628 | 0.0009578 | -10.923845 | 0 |
| Baggage_loading_time | 2.6314888 | 0.0789673 | 33.323758 | 0 |
| Late_Arrival_o | 1.3517826 | 0.0599875 | 22.534398 | 0 |
For example, the regression coefficient for the Weather variable is .7500. This means the Weather is 7.5% associated with the Arr_Delay variable. We can use this rule of thumb for each of the other regression coefficients.
The six variables (Airport_Distance, Number_of_flights, Weather, Support_Crew_Available, Baggage_loading_time, and Late_Arrival_o) are the most significant and relevant to the Arr_Delay response variable.