For this report, I will be using a data set consisting of delayed flight times of various airlines. In this data set, there are 11 variables and 3593 observations. The variables are:
The goal for this analysis is to determine what factors affect the delayed arrival times.
flight1 <- read.csv("flight_delay-data.csv")
To start, the full Mulitple Linear Regression model will be fitted with all of the variables in the data set.
full.model <- lm(Arr_Delay ~ ., data = flight1)
Can this model be changed before any sort of analysis is done?
kable(summary(full.model)$coef)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -574.5592897 | 7.6427175 | -75.1773555 | 0.0000000 |
| CarrierAA | -0.3068142 | 1.3750922 | -0.2231227 | 0.8234528 |
| CarrierAS | 1.3060328 | 4.1334965 | 0.3159632 | 0.7520489 |
| CarrierB6 | -1.7685278 | 1.3177103 | -1.3421219 | 0.1796418 |
| CarrierDL | -1.7558491 | 1.3498032 | -1.3008186 | 0.1934045 |
| CarrierEV | -1.7244875 | 1.3619601 | -1.2661806 | 0.2055310 |
| CarrierF9 | -0.9614506 | 4.5728315 | -0.2102528 | 0.8334824 |
| CarrierFL | 0.6205342 | 2.2470500 | 0.2761550 | 0.7824450 |
| CarrierHA | 1.7141523 | 6.3522393 | 0.2698501 | 0.7872912 |
| CarrierMQ | -0.6700343 | 1.4206138 | -0.4716513 | 0.6372045 |
| CarrierUA | -1.5067974 | 1.3133452 | -1.1472972 | 0.2513357 |
| CarrierUS | -0.8850271 | 1.5820406 | -0.5594212 | 0.5759094 |
| CarrierVX | 0.1461768 | 2.1223356 | 0.0688754 | 0.9450926 |
| CarrierWN | -1.9051889 | 1.6726924 | -1.1389954 | 0.2547815 |
| Airport_Distance | 0.1727325 | 0.0135557 | 12.7424675 | 0.0000000 |
| Number_of_flights | 0.0044246 | 0.0001088 | 40.6645781 | 0.0000000 |
| Weather | 4.4365237 | 0.4553804 | 9.7424566 | 0.0000000 |
| Support_Crew_Available | -0.0489392 | 0.0053332 | -9.1762611 | 0.0000000 |
| Baggage_loading_time | 13.4992387 | 0.4394619 | 30.7176559 | 0.0000000 |
| Late_Arrival_o | 6.9106449 | 0.3336251 | 20.7138026 | 0.0000000 |
| Cleaning_o | 0.0548127 | 0.0600167 | 0.9132910 | 0.3611512 |
| Fueling_o | -0.0614043 | 0.0596022 | -1.0302365 | 0.3029688 |
| Security_o | 0.0077875 | 0.0295840 | 0.2632347 | 0.7923849 |
According to the table, our new linear and reduced model is Arr_Delay ~ Airport_Distance + Number_of_flights + Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o.
Analysis from this point forward will be performed with this reduced model.
reduced.model <- lm(Arr_Delay ~ Airport_Distance + Number_of_flights + Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, data = flight1)
Before any model interpretations are made, residual analysis needs to be done to determine if there are any assumption violations in the model. The variance inflation factors of the model can also be checked for any issues concerning multicollinearity.
par(mfrow=c(2,2))
plot(reduced.model)
vif(reduced.model)
## Airport_Distance Number_of_flights Weather
## 1.248330 2.136143 1.093489
## Support_Crew_Available Baggage_loading_time Late_Arrival_o
## 1.125079 2.074031 1.616311
barplot(vif(reduced.model), main = "VIF Values", horiz = FALSE, col = "steelblue")
It seems there are some assumption violations with this model. The QQ plot shows that the assumption of normality may not be satisfied. As for the Residuals vs Fit plot, it shows that the assumption of homoscedasticity, and because of the curved pattern, this model may not be good for prediction. So, changes need to be made first.
There is some good news, though: the VIFs suggest that there is no multicollinearity.
Several transformations of the response variable will be used to correct the heteroscedasticity and non-normality in the model. Those transformations will be compared to see which transformation is the best and thus which model is the best.
To correct normality issues, a box-cox transformation will be applied to the response variable. To do this, multiple values of lambda will be used to determine the best transformation.
*Note, in order for the box-cox transformation to work, all values of the response variable must be positive, i.e. greater than 0. There are three observations that are equal to 0, so those observations will be removed before applying the box cox transformations.
flight2 <- flight1[-which(flight1$Arr_Delay == 0),]
par(pty = "s", mfrow = c(2, 2), oma=c(.1,.1,.1,.1), mar=c(4, 0, 2, 0))
boxcox(Arr_Delay ~ Airport_Distance + Number_of_flights + Weather +
Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, data = flight2,
lambda = seq(-0.5, 1, length = 10), xlab=expression(paste(lambda)))
boxcox(Arr_Delay ~ Airport_Distance + Number_of_flights + Weather +
Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, data = flight2,
lambda = seq(0, 1, length = 10), xlab=expression(paste(lambda)))
boxcox(Arr_Delay ~ Airport_Distance + Number_of_flights + Weather +
Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, data = flight2,
lambda = seq(0.5, 0.7, length = 10), xlab=expression(paste(lambda)))
I’m not quite certain which transformation is the best, but it seems that the optimal value for lambda consistently seems to be between 0.5 and 0.7, so 3 different transformations with lambda values 0.5, 0.6, and 0.7 will all be applied.
lambdaTrans1 <- lm((Arr_Delay)^0.5 ~ Airport_Distance + Number_of_flights +
Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o,
data = flight2)
kable(summary(lambdaTrans1)$coef, caption = "Lambda = 0.5 (SQRT Transformation)")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -34.0259067 | 0.4393635 | -77.443629 | 0 |
| Airport_Distance | 0.0109793 | 0.0008131 | 13.503838 | 0 |
| Number_of_flights | 0.0002962 | 0.0000065 | 45.303973 | 0 |
| Weather | 0.2377667 | 0.0273096 | 8.706342 | 0 |
| Support_Crew_Available | -0.0035656 | 0.0003196 | -11.157298 | 0 |
| Baggage_loading_time | 0.8818184 | 0.0263490 | 33.466824 | 0 |
| Late_Arrival_o | 0.4553751 | 0.0200190 | 22.747170 | 0 |
lambdaTrans2 <- lm((Arr_Delay)^0.6 ~ Airport_Distance + Number_of_flights +
Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o,
data = flight2)
kable(summary(lambdaTrans2)$coef, caption = "lambda = 0.6")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -63.2738888 | 0.7893245 | -80.162070 | 0 |
| Airport_Distance | 0.0198537 | 0.0014607 | 13.592265 | 0 |
| Number_of_flights | 0.0005295 | 0.0000117 | 45.074760 | 0 |
| Weather | 0.4463256 | 0.0490622 | 9.097139 | 0 |
| Support_Crew_Available | -0.0062481 | 0.0005741 | -10.882899 | 0 |
| Baggage_loading_time | 1.5821825 | 0.0473365 | 33.424147 | 0 |
| Late_Arrival_o | 0.8162203 | 0.0359645 | 22.695194 | 0 |
lambdaTrans3 <- lm((Arr_Delay)^0.7 ~ Airport_Distance + Number_of_flights +
Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o,
data = flight2)
kable(summary(lambdaTrans3)$coef, caption = "lambda = 0.7")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -113.5599912 | 1.3935691 | -81.488599 | 0 |
| Airport_Distance | 0.0350158 | 0.0025788 | 13.578204 | 0 |
| Number_of_flights | 0.0009235 | 0.0000207 | 44.529044 | 0 |
| Weather | 0.8128393 | 0.0866203 | 9.383933 | 0 |
| Support_Crew_Available | -0.0106732 | 0.0010136 | -10.529777 | 0 |
| Baggage_loading_time | 2.7676442 | 0.0835736 | 33.116248 | 0 |
| Late_Arrival_o | 1.4274972 | 0.0634960 | 22.481685 | 0 |
Because the square root transformation has the smallest standard errors, that will be the transformation that will be chosen.
par(mfrow = c(2,2))
plot(lambdaTrans1)
All of the diagnostics look good for this model. They look much better than diagnostics from the original model without the transformation.
Now, the goodness of fit measures of the three candidate models will be generated. The model with the best goodness of fit measures will be chosen as the final model.
select = function(m) { # m is an object: model
e = m$resid # residuals
n0 = length(e) # sample size
SSE=(m$df)*(summary(m)$sigma)^2 # sum of squared error
R.sq=summary(m)$r.squared # Coefficient of determination: R square!
R.adj=summary(m)$adj.r # Adjusted R square
MSE=(summary(m)$sigma)^2 # square error
Cp=(SSE/MSE)-(n0-2*(n0-m$df)) # Mellow's p
AIC=n0*log(SSE)-n0*log(n0)+2*(n0-m$df) # Akaike information criterion
SBC=n0*log(SSE)-n0*log(n0)+(log(n0))*(n0-m$df) # Schwarz Bayesian Information criterion
X=model.matrix(m) # design matrix of the model
H=X%*%solve(t(X)%*%X)%*%t(X) # hat matrix
d=e/(1-diag(H))
PRESS=t(d)%*%d # predicted residual error sum of squares (PRESS)- a cross-validation measure
tbl = as.data.frame(cbind(SSE=SSE, R.sq=R.sq, R.adj = R.adj, Cp = Cp, AIC = AIC, SBC = SBC, PRD = PRESS))
names(tbl)=c("SSE", "R.sq", "R.adj", "Cp", "AIC", "SBC", "PRESS")
tbl
}
output.sum = rbind(select(full.model), select(reduced.model), select(lambdaTrans1))
row.names(output.sum) = c("full.model", "reduced.model", "lambdaTrans1")
kable(output.sum, caption = "Goodness-of-fit Measures of Candidate Models")
| SSE | R.sq | R.adj | Cp | AIC | SBC | PRESS | |
|---|---|---|---|---|---|---|---|
| full.model | 553651.41 | 0.8195269 | 0.8184147 | 23 | 18145.909 | 18288.204 | 561015.961 |
| reduced.model | 555449.53 | 0.8189407 | 0.8186378 | 7 | 18125.559 | 18168.867 | 557806.554 |
| lambdaTrans1 | 2005.76 | 0.8419195 | 0.8416548 | 7 | -2075.844 | -2032.543 | 2013.419 |
The model with the box-cox transformation by far has the best goodness of fit measures. Adjusted R^2 is the highest, and all of the reported errors are the smallest of the three models. So, this will be chosen as the final model.
kable(summary(lambdaTrans1)$coef, caption = "Inferential Statistics of Final Model")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -34.0259067 | 0.4393635 | -77.443629 | 0 |
| Airport_Distance | 0.0109793 | 0.0008131 | 13.503838 | 0 |
| Number_of_flights | 0.0002962 | 0.0000065 | 45.303973 | 0 |
| Weather | 0.2377667 | 0.0273096 | 8.706342 | 0 |
| Support_Crew_Available | -0.0035656 | 0.0003196 | -11.157298 | 0 |
| Baggage_loading_time | 0.8818184 | 0.0263490 | 33.466824 | 0 |
| Late_Arrival_o | 0.4553751 | 0.0200190 | 22.747170 | 0 |
Referring to this table from earlier, we can see the coefficients for this model. And because the sample size is big enough to satisfy the central limit theorem, we can use the p-values to assess for any further variable selection. Since all of the p-values are close to zero, there’s no need for any further variable selection. And, the explicit regression equation can be written as: \[ sqrt(Arr_Delay) = -34.0259 + 0.011 \times Airport_Distance + 0.0003 \times Number_of_flights + 0.2378 \times Weather - 0.0036 \times Support_Crew_Available + 0.8818 \times Baggage_loading_time + 0.4554 \times Late_Arrival_o \]
This study focused on finding any factors that contributed to airplanes’ delayed arrival times. Many linear regression techniques were used to generate several candidate models, and comparisons were made between those models. Ultimately, the best model chosen had a box-cox transformation used on the response variable with a lambda value of 0.5.