1 Introduction

For this report, I will be using a data set consisting of delayed flight times of various airlines. In this data set, there are 11 variables and 3593 observations. The variables are:

Carrier (Categorical) - The airline that the flight is taken with
Airport_Distance (Numeric) - The Distance between the airports in miles
Number_of_flights (Numeric) - Total no of flights in airport
Weather (Numeric) - Delay due to weather condition ranked 0-10, with 0 being mild and 10 being extreme
Support_Crew_Available (Numeric) - Total number of support crew available
Baggage_loading_time (Numeric) - Time in minutes for loading the baggage
Late_Arrival_o (Numeric) - Time in minutes for late arriving aircraft of the same flight
Cleaning_o (Numeric) - Time in minutes for aircraft cleaning
Fueling_o (Numeric) - Time in minutes for aircraft fueling
Security_o (Numeric) - Time in minutes for security checking
Arr_Delay (Numeric) - Flight Arr_Delay in minutes. It is dependent variable in the model

The goal for this analysis is to determine what factors affect the delayed arrival times.

flight1 <- read.csv("flight_delay-data.csv")

2 Fitting the Model: Part 1 - Exploration

2.1 Full Model

To start, the full Mulitple Linear Regression model will be fitted with all of the variables in the data set.

full.model <- lm(Arr_Delay ~ ., data = flight1)

Can this model be changed before any sort of analysis is done?

kable(summary(full.model)$coef)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-574.5592897	7.6427175	-75.1773555	0.0000000
CarrierAA	-0.3068142	1.3750922	-0.2231227	0.8234528
CarrierAS	1.3060328	4.1334965	0.3159632	0.7520489
CarrierB6	-1.7685278	1.3177103	-1.3421219	0.1796418
CarrierDL	-1.7558491	1.3498032	-1.3008186	0.1934045
CarrierEV	-1.7244875	1.3619601	-1.2661806	0.2055310
CarrierF9	-0.9614506	4.5728315	-0.2102528	0.8334824
CarrierFL	0.6205342	2.2470500	0.2761550	0.7824450
CarrierHA	1.7141523	6.3522393	0.2698501	0.7872912
CarrierMQ	-0.6700343	1.4206138	-0.4716513	0.6372045
CarrierUA	-1.5067974	1.3133452	-1.1472972	0.2513357
CarrierUS	-0.8850271	1.5820406	-0.5594212	0.5759094
CarrierVX	0.1461768	2.1223356	0.0688754	0.9450926
CarrierWN	-1.9051889	1.6726924	-1.1389954	0.2547815
Airport_Distance	0.1727325	0.0135557	12.7424675	0.0000000
Number_of_flights	0.0044246	0.0001088	40.6645781	0.0000000
Weather	4.4365237	0.4553804	9.7424566	0.0000000
Support_Crew_Available	-0.0489392	0.0053332	-9.1762611	0.0000000
Baggage_loading_time	13.4992387	0.4394619	30.7176559	0.0000000
Late_Arrival_o	6.9106449	0.3336251	20.7138026	0.0000000
Cleaning_o	0.0548127	0.0600167	0.9132910	0.3611512
Fueling_o	-0.0614043	0.0596022	-1.0302365	0.3029688
Security_o	0.0077875	0.0295840	0.2632347	0.7923849

According to the table, our new linear and reduced model is Arr_Delay ~ Airport_Distance + Number_of_flights + Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o.

Analysis from this point forward will be performed with this reduced model.

reduced.model <- lm(Arr_Delay ~ Airport_Distance + Number_of_flights + Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, data = flight1)

Before any model interpretations are made, residual analysis needs to be done to determine if there are any assumption violations in the model. The variance inflation factors of the model can also be checked for any issues concerning multicollinearity.

par(mfrow=c(2,2))
plot(reduced.model)

vif(reduced.model)

##       Airport_Distance      Number_of_flights                Weather 
##               1.248330               2.136143               1.093489 
## Support_Crew_Available   Baggage_loading_time         Late_Arrival_o 
##               1.125079               2.074031               1.616311

barplot(vif(reduced.model), main = "VIF Values", horiz = FALSE, col = "steelblue")

It seems there are some assumption violations with this model. The QQ plot shows that the assumption of normality may not be satisfied. As for the Residuals vs Fit plot, it shows that the assumption of homoscedasticity, and because of the curved pattern, this model may not be good for prediction. So, changes need to be made first.

There is some good news, though: the VIFs suggest that there is no multicollinearity.

3 Fitting the Model: Part 2

3.1 Transformations to the Response Variable

Several transformations of the response variable will be used to correct the heteroscedasticity and non-normality in the model. Those transformations will be compared to see which transformation is the best and thus which model is the best.

To correct normality issues, a box-cox transformation will be applied to the response variable. To do this, multiple values of lambda will be used to determine the best transformation.

*Note, in order for the box-cox transformation to work, all values of the response variable must be positive, i.e. greater than 0. There are three observations that are equal to 0, so those observations will be removed before applying the box cox transformations.

flight2 <- flight1[-which(flight1$Arr_Delay == 0),]

par(pty = "s", mfrow = c(2, 2), oma=c(.1,.1,.1,.1), mar=c(4, 0, 2, 0))

boxcox(Arr_Delay ~ Airport_Distance + Number_of_flights + Weather + 
Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, data = flight2, 
lambda = seq(-0.5, 1, length = 10), xlab=expression(paste(lambda)))

boxcox(Arr_Delay ~ Airport_Distance + Number_of_flights + Weather + 
Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, data = flight2, 
lambda = seq(0, 1, length = 10), xlab=expression(paste(lambda)))

boxcox(Arr_Delay ~ Airport_Distance + Number_of_flights + Weather + 
Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, data = flight2, 
lambda = seq(0.5, 0.7, length = 10), xlab=expression(paste(lambda)))

I’m not quite certain which transformation is the best, but it seems that the optimal value for lambda consistently seems to be between 0.5 and 0.7, so 3 different transformations with lambda values 0.5, 0.6, and 0.7 will all be applied.

lambdaTrans1 <- lm((Arr_Delay)^0.5 ~ Airport_Distance + Number_of_flights + 
Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, 
data = flight2) 
kable(summary(lambdaTrans1)$coef, caption = "Lambda = 0.5 (SQRT Transformation)")

Lambda = 0.5 (SQRT Transformation)
	Estimate	Std. Error	t value
(Intercept)	-34.0259067	0.4393635	-77.443629
Airport_Distance	0.0109793	0.0008131	13.503838
Number_of_flights	0.0002962	0.0000065	45.303973
Weather	0.2377667	0.0273096	8.706342
Support_Crew_Available	-0.0035656	0.0003196	-11.157298
Baggage_loading_time	0.8818184	0.0263490	33.466824
Late_Arrival_o	0.4553751	0.0200190	22.747170

lambdaTrans2 <- lm((Arr_Delay)^0.6 ~ Airport_Distance + Number_of_flights + 
Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, 
data = flight2) 
kable(summary(lambdaTrans2)$coef, caption = "lambda = 0.6")

lambda = 0.6
	Estimate	Std. Error	t value
(Intercept)	-63.2738888	0.7893245	-80.162070
Airport_Distance	0.0198537	0.0014607	13.592265
Number_of_flights	0.0005295	0.0000117	45.074760
Weather	0.4463256	0.0490622	9.097139
Support_Crew_Available	-0.0062481	0.0005741	-10.882899
Baggage_loading_time	1.5821825	0.0473365	33.424147
Late_Arrival_o	0.8162203	0.0359645	22.695194

lambdaTrans3 <- lm((Arr_Delay)^0.7 ~ Airport_Distance + Number_of_flights + 
Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, 
data = flight2) 
kable(summary(lambdaTrans3)$coef, caption = "lambda = 0.7")

lambda = 0.7
	Estimate	Std. Error	t value
(Intercept)	-113.5599912	1.3935691	-81.488599
Airport_Distance	0.0350158	0.0025788	13.578204
Number_of_flights	0.0009235	0.0000207	44.529044
Weather	0.8128393	0.0866203	9.383933
Support_Crew_Available	-0.0106732	0.0010136	-10.529777
Baggage_loading_time	2.7676442	0.0835736	33.116248
Late_Arrival_o	1.4274972	0.0634960	22.481685

Because the square root transformation has the smallest standard errors, that will be the transformation that will be chosen.

par(mfrow = c(2,2))
plot(lambdaTrans1)

All of the diagnostics look good for this model. They look much better than diagnostics from the original model without the transformation.

3.2 Goodness of fit measures

Now, the goodness of fit measures of the three candidate models will be generated. The model with the best goodness of fit measures will be chosen as the final model.

select = function(m) { # m is an object: model
e = m$resid                           # residuals
n0 = length(e)                        # sample size
SSE=(m$df)*(summary(m)$sigma)^2       # sum of squared error
R.sq=summary(m)$r.squared             # Coefficient of determination: R square!
R.adj=summary(m)$adj.r                # Adjusted R square
MSE=(summary(m)$sigma)^2              # square error
Cp=(SSE/MSE)-(n0-2*(n0-m$df))         # Mellow's p
AIC=n0*log(SSE)-n0*log(n0)+2*(n0-m$df)          # Akaike information criterion
SBC=n0*log(SSE)-n0*log(n0)+(log(n0))*(n0-m$df)  # Schwarz Bayesian Information criterion
X=model.matrix(m)                     # design matrix of the model
H=X%*%solve(t(X)%*%X)%*%t(X)          # hat matrix
d=e/(1-diag(H))                       
PRESS=t(d)%*%d   # predicted residual error sum of squares (PRESS)- a cross-validation measure
 tbl = as.data.frame(cbind(SSE=SSE, R.sq=R.sq, R.adj = R.adj, Cp = Cp, AIC = AIC, SBC = SBC, PRD = PRESS))
 names(tbl)=c("SSE", "R.sq", "R.adj", "Cp", "AIC", "SBC", "PRESS")
 tbl
}

output.sum = rbind(select(full.model), select(reduced.model), select(lambdaTrans1))
row.names(output.sum) = c("full.model", "reduced.model", "lambdaTrans1")
kable(output.sum, caption = "Goodness-of-fit Measures of Candidate Models")

Goodness-of-fit Measures of Candidate Models
	SSE	R.sq	R.adj	Cp	AIC	SBC	PRESS
full.model	553651.41	0.8195269	0.8184147	23	18145.909	18288.204	561015.961
reduced.model	555449.53	0.8189407	0.8186378	7	18125.559	18168.867	557806.554
lambdaTrans1	2005.76	0.8419195	0.8416548	7	-2075.844	-2032.543	2013.419

The model with the box-cox transformation by far has the best goodness of fit measures. Adjusted R^2 is the highest, and all of the reported errors are the smallest of the three models. So, this will be chosen as the final model.

3.3 The Final Model

kable(summary(lambdaTrans1)$coef, caption = "Inferential Statistics of Final Model")

Inferential Statistics of Final Model
	Estimate	Std. Error	t value
(Intercept)	-34.0259067	0.4393635	-77.443629
Airport_Distance	0.0109793	0.0008131	13.503838
Number_of_flights	0.0002962	0.0000065	45.303973
Weather	0.2377667	0.0273096	8.706342
Support_Crew_Available	-0.0035656	0.0003196	-11.157298
Baggage_loading_time	0.8818184	0.0263490	33.466824
Late_Arrival_o	0.4553751	0.0200190	22.747170

Referring to this table from earlier, we can see the coefficients for this model. And because the sample size is big enough to satisfy the central limit theorem, we can use the p-values to assess for any further variable selection. Since all of the p-values are close to zero, there’s no need for any further variable selection. And, the explicit regression equation can be written as: \[ sqrt(Arr_Delay) = -34.0259 + 0.011 \times Airport_Distance + 0.0003 \times Number_of_flights + 0.2378 \times Weather - 0.0036 \times Support_Crew_Available + 0.8818 \times Baggage_loading_time + 0.4554 \times Late_Arrival_o \]

4 Summary

This study focused on finding any factors that contributed to airplanes’ delayed arrival times. Many linear regression techniques were used to generate several candidate models, and comparisons were made between those models. Ultimately, the best model chosen had a box-cox transformation used on the response variable with a lambda value of 0.5.

STA321 Homework 3: What factors affect Arrival Time?

Ian VanWright

10/01/2023