Introduction
This report will compare standard multiple linear regression with
bootstrap resampled multiple linear regression
The data set used for this report has been uploaded for public access
to the GitHub repository: ncbrechbill/STA321: Repository for the class
STA321, Topics in Advanced Statistics. The web URL for this page is https://github.com/ncbrechbill/STA321.
The following continuous variables will be used for all further
analyses:
Arr_Delay \(Y\) : Arrival delay
to destination, in minutes.
Airport_Distance \(X_1\) : The
distance between the airports in miles.
Number_of_flights \(X_2\) :
Total number of flights in the departure airport.
Support_Crew_Available \(X_3\) :
Total number of support crew available.
Baggage_loading_time \(X_4\) :
Time to load all baggage in minutes.
Late_Arrival_o \(X_5\) :
Aircraft’s late arrival to airport in minutes.
Cleaning_o \(X_6\) : Time to
clean the aircraft in minutes.
Fueling_o \(X_7\) : Time to fuel
the aircraft in minutes.
Security_o \(X_8\) : Time to
complete security clearing in minutes.

Practical Question
We aim to model and predict a flight’s delay time given measurable
parameters. This can help an airport provide information to airlines,
aircraft crew, and passengers about the expected delay time. This can be
beneficial for several reasons, including attempts to reduce these
delays and increase airport efficiency and satisfaction.
Standard Multiple
Linear Regression
Let \(\{x_1, x_2, \cdots, x_k \}\)
be \(k\) explanatory variables and
\(y\) be the response variables. The
general form of the multiple linear regression model is defined as
\[
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k +
\epsilon.
\]
We first start with a full model.




Step-wise selection will determine which variables are most important
to the model. We will perform stepwise selection in both directions to
ensure consistency, starting with both an “empty” model with no factors
and a “full” model with all factors.
## Start: AIC=24253.75
## Arr_Delay ~ 1
##
## Df Sum of Sq RSS AIC
## + Number_of_flights 1 2077915 989863 20192
## + Baggage_loading_time 1 1883866 1183911 20835
## + Late_Arrival_o 1 1364163 1703614 22142
## + Airport_Distance 1 713213 2354565 23305
## + Support_Crew_Available 1 401766 2666011 23751
## + Security_o 1 19673 3048105 24233
## + Fueling_o 1 4034 3063744 24251
## <none> 3067777 24254
## + Cleaning_o 1 37 3067740 24256
##
## Step: AIC=20191.55
## Arr_Delay ~ Number_of_flights
##
## Df Sum of Sq RSS AIC
## + Baggage_loading_time 1 297505 692358 18909
## + Late_Arrival_o 1 176896 812967 19486
## + Airport_Distance 1 80801 909062 19888
## + Support_Crew_Available 1 44206 945656 20029
## + Security_o 1 1072 988791 20190
## <none> 989863 20192
## + Cleaning_o 1 119 989744 20193
## + Fueling_o 1 3 989859 20194
## - Number_of_flights 1 2077915 3067777 24254
##
## Step: AIC=18909.19
## Arr_Delay ~ Number_of_flights + Baggage_loading_time
##
## Df Sum of Sq RSS AIC
## + Late_Arrival_o 1 80895 611462 18465
## + Airport_Distance 1 34139 658219 18730
## + Support_Crew_Available 1 18656 673702 18813
## <none> 692358 18909
## + Fueling_o 1 375 691982 18909
## + Security_o 1 230 692128 18910
## + Cleaning_o 1 185 692173 18910
## - Baggage_loading_time 1 297505 989863 20192
## - Number_of_flights 1 491553 1183911 20835
##
## Step: AIC=18464.76
## Arr_Delay ~ Number_of_flights + Baggage_loading_time + Late_Arrival_o
##
## Df Sum of Sq RSS AIC
## + Airport_Distance 1 27479 583984 18302
## + Support_Crew_Available 1 14610 596853 18380
## <none> 611462 18465
## + Fueling_o 1 327 611135 18465
## + Cleaning_o 1 282 611181 18465
## + Security_o 1 15 611447 18467
## - Late_Arrival_o 1 80895 692358 18909
## - Baggage_loading_time 1 201504 812967 19486
## - Number_of_flights 1 324014 935476 19991
##
## Step: AIC=18301.55
## Arr_Delay ~ Number_of_flights + Baggage_loading_time + Late_Arrival_o +
## Airport_Distance
##
## Df Sum of Sq RSS AIC
## + Support_Crew_Available 1 13580 570404 18219
## <none> 583984 18302
## + Cleaning_o 1 264 583720 18302
## + Fueling_o 1 180 583803 18302
## + Security_o 1 8 583976 18304
## - Airport_Distance 1 27479 611462 18465
## - Late_Arrival_o 1 74235 658219 18730
## - Baggage_loading_time 1 172621 756605 19230
## - Number_of_flights 1 283853 867836 19723
##
## Step: AIC=18219.02
## Arr_Delay ~ Number_of_flights + Baggage_loading_time + Late_Arrival_o +
## Airport_Distance + Support_Crew_Available
##
## Df Sum of Sq RSS AIC
## <none> 570404 18219
## + Cleaning_o 1 150 570254 18220
## + Fueling_o 1 98 570306 18220
## + Security_o 1 20 570384 18221
## - Support_Crew_Available 1 13580 583984 18302
## - Airport_Distance 1 26448 596853 18380
## - Late_Arrival_o 1 70633 641037 18637
## - Baggage_loading_time 1 159829 730234 19105
## - Number_of_flights 1 267074 837478 19597

Both directions determined that the best model is fitted as
follows:
\[
Y = -569 + 0.004497 X_2 + 13.97 X_4 + 7.093 X_5 + 0.1766 X_1 - 0.04974
X_3
\]
Based on residual analysis violations, bootstrap sampling will be
used to create a more appropriate and fitting model.
Bootstrap Cases
We will begin by using bootstrap cases to estimate the coefficients.
The distributions and confidence intervals will be found and
displayed.
delay <- lm(Arr_Delay ~ Number_of_flights + Baggage_loading_time + Late_Arrival_o + Airport_Distance + Support_Crew_Available, data = flight)
cmtrx <- summary(delay)$coef
##
B = 1000 # choose the number of bootstrap replicates.
##
num.p = dim(model.frame(delay))[2] # returns number of parameters in the model
smpl.n = dim(model.frame(delay))[1] # sample size
## zero matrix to store bootstrap coefficients
coef.mtrx = matrix(rep(0, B*num.p), ncol = num.p)
##
for (i in 1:B){
bootc.id = sample(1:smpl.n, smpl.n, replace = TRUE) # fit final model to the bootstrap sample
delay.btc = lm(Arr_Delay ~ Number_of_flights + Baggage_loading_time + Late_Arrival_o + Airport_Distance + Support_Crew_Available, data = flight[bootc.id,])
coef.mtrx[i,] = coef(delay.btc) # extract coefs from bootstrap regression model
}
boot.hist = function(cmtrx, bt.coef.mtrx, var.id, var.nm){
## bt.coef.mtrx = matrix for storing bootstrap estimates of coefficients
## var.id = variable ID (1, 2, ..., k+1)
## var.nm = variable name on the hist title, must be the string in the double quotes
## coefficient matrix of the final model
## Bootstrap sampling distribution of the estimated coefficients
x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
# height of the histogram - use it to make a nice-looking histogram.
highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density)
ylimit <- max(c(y1.1,highestbar))
hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="",
col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
lines(x = x1.1, y = y1.1, col = "red3")
lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue")
#legend("topright", c(""))
}
The following histograms represent the distribution of each of the
bootstrapped coefficients.
par(mfrow=c(2,3)) # histograms of bootstrap coefs
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=2, var.nm ="Number of Flights" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=3, var.nm ="Baggage Loading Time" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=4, var.nm ="Late Arrival" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=5, var.nm ="Airport Distance" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=6, var.nm ="Support Crew Available" )

Two normal-density curves were placed on each of the histograms.
The red density curve uses the estimated
regression coefficients and their corresponding standard error in the
output of the regression procedure. The p-values reported in the output
are based on the red curve.
The blue curve is a non-parametric data-driven
estimate of the density of bootstrap sampling distribution. The
bootstrap confidence intervals of the regressions are based on these
non-parametric bootstrap sampling distributions.
We can see from the above histograms that the two density curves in
all histograms are close to each other. we would expect that
significance test results and the corresponding bootstrap confidence
intervals are consistent. Next, we find 95% bootstrap confidence
intervals of each regression coefficient and combined them with the
output of the final model.
num.p = dim(coef.mtrx)[2] # number of parameters
btc.ci = NULL
btc.wd = NULL
for (i in 1:num.p){
lci.025 = round(quantile(coef.mtrx[, i], 0.025, type = 2),8)
uci.975 = round(quantile(coef.mtrx[, i],0.975, type = 2 ),8)
btc.wd[i] = uci.975 - lci.025
btc.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
}
#as.data.frame(btc.ci)
kable(as.data.frame(cbind(formatC(cmtrx,4,format="f"), btc.ci.95=btc.ci)),
caption = "Regression Coefficient Matrix")
Regression Coefficient Matrix
| (Intercept) |
-569.0375 |
7.3229 |
-77.7061 |
0.0000 |
[ -585.7031 , -551.5961 ] |
| Number_of_flights |
0.0045 |
0.0001 |
40.9817 |
0.0000 |
[ 0.0043 , 0.0047 ] |
| Baggage_loading_time |
13.9737 |
0.4408 |
31.7032 |
0.0000 |
[ 13.1707 , 14.8182 ] |
| Late_Arrival_o |
7.0929 |
0.3365 |
21.0755 |
0.0000 |
[ 6.4137 , 7.7148 ] |
| Airport_Distance |
0.1766 |
0.0137 |
12.8966 |
0.0000 |
[ 0.1495 , 0.2027 ] |
| Support_Crew_Available |
-0.0497 |
0.0054 |
-9.2410 |
0.0000 |
[ -0.0598 , -0.0397 ] |
All the coefficients are statistically significant, and the
confidence intervals are provided.
Bootstrap
Residuals
Assume that the fitted regression model is given by
\[
\begin{array}{ccc}
y_1 & = & \hat{\beta}_0 + \hat{\beta}_1 x_{11} + \hat{\beta}_2
x_{12} + \cdots + \hat{\beta}_k x_{1k} + e_1 \\
y_2 & = & \hat{\beta}_0 + \hat{\beta}_1 x_{21} + \hat{\beta}_2
x_{22} + \cdots + \hat{\beta}_k x_{2k} + e_2 \\
y_3 & = & \hat{\beta}_0 + \hat{\beta}_1 x_{31} + \hat{\beta}_2
x_{32} + \cdots + \hat{\beta}_k x_{3k} + e_3 \\
\vdots & \vdots & \vdots \\
y_n & = & \hat{\beta}_0 + \hat{\beta}_1 x_{n1} + \hat{\beta}_2
x_{n2} + \cdots + \hat{\beta}_k x_{nk} + e_n
\end{array}
\]
where \(\{e_1, e_2, \cdots, e_n \}\)
is the set of residuals obtained from the final model. \(\{x_{i1}, x_{i2}, \cdots, x_{ik} \}\) is
the i-th record from the data, and \(\{
\hat{\beta}_0, \hat{\beta}_1, \cdots, \hat{\beta}_k \}\) are the
estimated regression coefficients based on the original random
sample.
The distribution of the residuals is depicted in the following
histogram.
hist(sort(delay$residuals),n=40,
xlab="Residuals",
col = "lightblue",
border="navy",
main = "Histogram of Residuals")

delay.resid <- delay$residuals
B=1000
num.p = dim(model.matrix(delay))[2] # number of parameters
samp.n = dim(model.matrix(delay))[1] # sample size
btr.mtrx = matrix(rep(0,6*B), ncol=num.p) # zero matrix to store boot coefs
for (i in 1:B){
## Bootstrap response values
bt.delay = delay$fitted.values +
sample(delay.resid, samp.n, replace = TRUE) # bootstrap residuals
# replace PriceUnitArea with bootstrap log price
flight$bt.delay = bt.delay # send the boot response to the data
btr.model = lm(bt.delay ~ Number_of_flights + Baggage_loading_time + Late_Arrival_o + Airport_Distance + Support_Crew_Available, data = flight) # b
btr.mtrx[i,]=btr.model$coefficients
}
boot.hist = function(bt.coef.mtrx, var.id, var.nm){
## bt.coef.mtrx = matrix for storing bootstrap estimates of coefficients
## var.id = variable ID (1, 2, ..., k+1)
## var.nm = variable name on the hist title, must be the string in the double quotes
## Bootstrap sampling distribution of the estimated coefficients
x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
# height of the histogram - use it to make a nice-looking histogram.
highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density)
ylimit <- max(c(y1.1,highestbar))
hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="",
col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
lines(x = x1.1, y = y1.1, col = "red3") # normal density curve
lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue") # loess curve
}
par(mfrow=c(2,3)) # histograms of bootstrap coefs
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=2, var.nm ="Number of Flights" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=3, var.nm ="Baggage Loading Time" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=4, var.nm ="Late Arrival" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=5, var.nm ="Distance to Airport" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=6, var.nm ="Support Crew Available" )

num.p = dim(coef.mtrx)[2] # number of parameters
btr.ci = NULL
btr.wd = NULL
for (i in 1:num.p){
lci.025 = round(quantile(btr.mtrx[, i], 0.025, type = 2),8)
uci.975 = round(quantile(btr.mtrx[, i],0.975, type = 2 ),8)
btr.wd[i] = uci.975 - lci.025
btr.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
}
#as.data.frame(btc.ci)
kable(as.data.frame(cbind(formatC(cmtrx,4,format="f"), btr.ci.95=btr.ci)),
caption = "Regression Coefficient Matrix with 95% Residual Bootstrap CI")
Regression Coefficient Matrix with 95% Residual Bootstrap
CI
| (Intercept) |
-569.0375 |
7.3229 |
-77.7061 |
0.0000 |
[ -583.3537 , -554.4205 ] |
| Number_of_flights |
0.0045 |
0.0001 |
40.9817 |
0.0000 |
[ 0.0043 , 0.0047 ] |
| Baggage_loading_time |
13.9737 |
0.4408 |
31.7032 |
0.0000 |
[ 13.0799 , 14.8642 ] |
| Late_Arrival_o |
7.0929 |
0.3365 |
21.0755 |
0.0000 |
[ 6.4633 , 7.7799 ] |
| Airport_Distance |
0.1766 |
0.0137 |
12.8966 |
0.0000 |
[ 0.15 , 0.2015 ] |
| Support_Crew_Available |
-0.0497 |
0.0054 |
-9.2410 |
0.0000 |
[ -0.0598 , -0.0393 ] |
Analysis
The results of the standard and the bootstrapped models were very
similar. The first model was computationally the easiest to create, but
the bootstrapped models helped keep the assumptions of linear regression
under control, particularly the violations we observed in residual
analysis. The bootstrap model should thus be used for building
predictions.
Some of the variables that were excluded from the model were routine
airport activities. This indicates that the time to complete these are
not contributing factors to delays. Some of the factors that were
important to the model were based around general airport traffic and
business (total flights, baggage loading time, support crew available).
These factors seem to indicate that having more staff available to
handle the amount of flights may reduce the amount and magnitude of
delays. Some factors are almost impossible to account for (airport
distance, late arrival) and will likely always account for some amount
of delays.
