1 Introduction

For this report, I will be using a data set consisting of delayed flight times of various airlines. In this data set, there are 11 variables and 3593 observations. The variables are:

Carrier (Categorical) - The airline that the flight is taken with
Airport_Distance (Numeric) - The Distance between the airports in miles
Number_of_flights (Numeric) - Total no of flights in airport
Weather (Numeric) - Delay due to weather condition ranked 0-10, with 0 being mild and 10 being extreme
Support_Crew_Available (Numeric) - Total number of support crew available
Baggage_loading_time (Numeric) - Time in minutes for loading the baggage
Late_Arrival_o (Numeric) - Time in minutes for late arriving aircraft of the same flight
Cleaning_o (Numeric) - Time in minutes for aircraft cleaning
Fueling_o (Numeric) - Time in minutes for aircraft fueling
Security_o (Numeric) - Time in minutes for security checking
Arr_Delay (Numeric) - Flight Arr_Delay in minutes. It is dependent variable in the model

The goal for this analysis is to determine whether a parametric model (least squares) or nonparametric (bootstrapping) is better to build a regression model.

flight1<- read.csv("flight_delay-data.csv")
flight2 <- flight1[-which(flight1$Arr_Delay == 0),]

2 Recap of Building the Final Model

To start, the full model was fit with Arr_Delay as the response variable and all others as predictors. T tests determined that Carrier, Cleaning_o, Fueling_o, and Security_o, A new, reduced model was created with the remaining variables.

Residual Analysis was done with the reduced model, and the assumptions of normality and constant variances were violated. As a result, a box-cox transformation with \[ \lambda = 0.5 \] was applied to the response variable, and that corrected the issues with normality and constant variances.

Goodness of fit measures of the full model, reduced model, and the transformed model show that the transformed model is the best. So the model with the transformation was chosen as the final model. This chosen model will be used for analysis for this report.

lambdaTrans1 <- lm((Arr_Delay)^0.5 ~ Airport_Distance + Number_of_flights + 
Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, 
data = flight2) 
cmtrx <- summary(lambdaTrans1)$coef
kable(cmtrx, caption = "Inferential Statistics of Final Model")

Inferential Statistics of Final Model
	Estimate	Std. Error	t value
(Intercept)	-34.0259067	0.4393635	-77.443629
Airport_Distance	0.0109793	0.0008131	13.503838
Number_of_flights	0.0002962	0.0000065	45.303973
Weather	0.2377667	0.0273096	8.706342
Support_Crew_Available	-0.0035656	0.0003196	-11.157298
Baggage_loading_time	0.8818184	0.0263490	33.466824
Late_Arrival_o	0.4553751	0.0200190	22.747170

3 Bootstrapping Part 1

3.1 Bootstrapping Observations

For this report, I will be bootstrapping both the observations and the residuals. For this part of the report, I will be bootstrapping the observations in the model.

I will bootstrap 1000 times. Every coefficient estimate for each variable will be plotted and a sampling distribution for each variable will be made.

B = 1000       # choose the number of bootstrap replicates.
## 
num.p = dim(model.frame(lambdaTrans1))[2]  # returns number of parameters in the model
smpl.n = dim(model.frame(lambdaTrans1))[1] # sample size
## zero matrix to store bootstrap coefficients 
coef.mtrx = matrix(rep(0, B*num.p), ncol = num.p)       
## 
for (i in 1:B) {
 bootc.id = sample(1:smpl.n, smpl.n, replace = TRUE) # fit final model to the bootstrap sample
 lambdaTrans1.btc =  lm((Arr_Delay)^0.5 ~ Airport_Distance + Number_of_flights + 
 Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, 
 data = flight2[bootc.id, ])  
                      
 coef.mtrx[i,] = coef(lambdaTrans1.btc)    # extract coefs from bootstrap regression model    
}

boot.hist = function(cmtrx, bt.coef.mtrx, var.id, var.nm) {
  ## bt.coef.mtrx = matrix for storing bootstrap estimates of coefficients
  ## var.id = variable ID (1, 2, ..., k+1)
  ## var.nm = variable name on the hist title, must be the string in the double quotes
  ## coefficient matrix of the final model
  ## Bootstrap sampling distribution of the estimated coefficients
  x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
  y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
  # height of the histogram - use it to make a nice-looking histogram.
  highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density) 
  ylimit <- max(c(y1.1,highestbar))
  hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="", 
       col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
  lines(x = x1.1, y = y1.1, col = "red3")
  lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue") 
  #legend("topright", c(""))
}

par(mfrow = c(2, 3))  # histograms of bootstrap coefs
boot.hist(bt.coef.mtrx = coef.mtrx, var.id = 1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx = coef.mtrx, var.id = 2, var.nm ="Airport_Distance" )
boot.hist(bt.coef.mtrx = coef.mtrx, var.id = 3, var.nm ="Number_of_Flights" )
boot.hist(bt.coef.mtrx = coef.mtrx, var.id = 4, var.nm ="Weather")

par(mfrow = c(2, 3))
boot.hist(bt.coef.mtrx = coef.mtrx, var.id = 5, var.nm ="Support_Crew_Available" )
boot.hist(bt.coef.mtrx = coef.mtrx, var.id = 6, var.nm ="Baggage_loading_time" )
boot.hist(bt.coef.mtrx = coef.mtrx, var.id = 7, var.nm ="Late_Arrival_o" )

The red curves represent the actual distribution of the coefficient estimates based on the estimate and error generated in the final model summary output. The blue curves represent the density of the sampling distribution. As we can see, they look very similar. The curves overlap a lot. This suggests that the bootstrap did a very good job of estimating the coefficients and there is not a significant difference between the least squares method vs bootstrapping the observations.

3.2 Confidence Interval

Next, we can get the bootstrapped estimates as well as the 95% confidence intervals for those estimates.

num.p = dim(coef.mtrx)[2]  # number of parameters
btc.ci = NULL
btc.wd = NULL
for (i in 1:num.p) {
lci.025 = round(quantile(coef.mtrx[, i], 0.025, type = 2), 8)
uci.975 = round(quantile(coef.mtrx[, i], 0.975, type = 2 ), 8)
btc.wd[i] =  uci.975 - lci.025
btc.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
}

boot.hist = function(bt.coef.mtrx, var.id, var.nm){
  ## bt.coef.mtrx = matrix for storing bootstrap estimates of coefficients
  ## var.id = variable ID (1, 2, ..., k+1)
  ## var.nm = variable name on the hist title, must be the string in the double quotes
  ## Bootstrap sampling distribution of the estimated coefficients
  x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
  y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
  # height of the histogram - use it to make a nice-looking histogram.
  highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density) 
  ylimit <- max(c(y1.1,highestbar))
  hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="", 
       col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
  lines(x = x1.1, y = y1.1, col = "red3")       # normal density curve         
  lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue")    # loess curve
} 
#as.data.frame(btc.ci)
kable(as.data.frame(cbind(formatC(cmtrx, 4, format = "f"), btc.ci.95 = btc.ci)), caption = "Regression Coefficient Matrix (Bootstrapped Cases)")

Regression Coefficient Matrix (Bootstrapped Cases)
	Estimate	Std. Error	t value	btc.ci.95
(Intercept)	-34.0259	0.4394	-77.4436	[ -34.848 , -33.1699 ]
Airport_Distance	0.0110	0.0008	13.5038	[ 0.0093 , 0.0127 ]
Number_of_flights	0.0003	0.0000	45.3040	[ 3e-04 , 3e-04 ]
Weather	0.2378	0.0273	8.7063	[ 0.1833 , 0.2944 ]
Support_Crew_Available	-0.0036	0.0003	-11.1573	[ -0.0042 , -0.003 ]
Baggage_loading_time	0.8818	0.0263	33.4668	[ 0.832 , 0.935 ]
Late_Arrival_o	0.4554	0.0200	22.7472	[ 0.4147 , 0.4906 ]

Comparing these estimates to the estimates in the final model, we can see that they are very similar and consistent. So I can conclude that the bootstrap did a good job of estimating the model.

4 Bootstrapping Part 2

4.1 Bootstrapping the Residuals

For this part of the report, i will be bootstrapping the residuals of the model. Just as in i did in part 1, i will bootstrap 1000 times. Sampling distributions and their density curves will be plotted. Before that happens, we can start to illustrate what might happen with a histogram of the residuals.

hist(sort(lambdaTrans1$residuals),n=40, xlab="Residuals", col = "lightblue",
border="navy", main = "Histogram of Final Model Residuals")

model.resid = lambdaTrans1$residuals

B=1000
num.p = dim(model.matrix(lambdaTrans1))[2]   # number of parameters
samp.n = dim(model.matrix(lambdaTrans1))[1]  # sample size
btr.mtrx = matrix(rep(0, B*num.p), ncol=num.p) # zero matrix to store boot coefs

for (i in 1:B) {
## Bootstrap response values
bt.lambdaTrans1 = (lambdaTrans1$fitted.values + sample(model.resid, samp.n, replace = TRUE))  # bootstrap residuals
# replace PriceUnitArea with bootstrap log price

flight2$bt.lambdaTrans1 =  bt.lambdaTrans1  #  send the boot response to the data
  lambdaTrans.btr <- lm(bt.lambdaTrans1 ~ Airport_Distance + Number_of_flights + 
  Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, 
  data = flight2) 
  btr.mtrx[i,]=lambdaTrans.btr$coefficients
}

Now, to bootstrap and plot the sampling distributions:

par(mfrow=c(2,3))  # histograms of bootstrap coefs
boot.hist(bt.coef.mtrx = btr.mtrx, var.id = 1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx = btr.mtrx, var.id = 2, var.nm ="Airport_Distance" )
boot.hist(bt.coef.mtrx = btr.mtrx, var.id = 3, var.nm ="Number_Of_Flights" )

par(mfrow = c(2,3))
boot.hist(bt.coef.mtrx = btr.mtrx, var.id = 4, var.nm ="Weather" )
boot.hist(bt.coef.mtrx = btr.mtrx, var.id = 5, var.nm ="Support_Crew_Available" )
boot.hist(bt.coef.mtrx = btr.mtrx, var.id = 6, var.nm ="Baggage_loading_time" )
boot.hist(bt.coef.mtrx = btr.mtrx, var.id = 7, var.nm ="Late.arrival_o" )

Again, The curves look similar. There is a lot of overlap as well. And just like in part 1, we can summarize the estimates and generate a 95% confidence interval.

num.p = dim(coef.mtrx)[2]  # number of parameters
btr.ci = NULL
btr.wd = NULL
for (i in 1:num.p) {
    lci.025 = round(quantile(btr.mtrx[, i], 0.025, type = 2),8)
    uci.975 = round(quantile(btr.mtrx[, i],0.975, type = 2 ),8)
    btr.wd[i] = uci.975 - lci.025
    btr.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
}
  #as.data.frame(btc.ci)

  kable(as.data.frame(cbind(formatC(cmtrx, 4, format= "f"), btr.ci.95=btr.ci)), 
        caption = "Regression Coefficient Matrix (Bootstrapped Residuals)")

Regression Coefficient Matrix (Bootstrapped Residuals)
	Estimate	Std. Error	t value	btr.ci.95
(Intercept)	-34.0259	0.4394	-77.4436	[ -34.9259 , -33.1231 ]
Airport_Distance	0.0110	0.0008	13.5038	[ 0.0094 , 0.0125 ]
Number_of_flights	0.0003	0.0000	45.3040	[ 3e-04 , 3e-04 ]
Weather	0.2378	0.0273	8.7063	[ 0.1853 , 0.2889 ]
Support_Crew_Available	-0.0036	0.0003	-11.1573	[ -0.0041 , -0.003 ]
Baggage_loading_time	0.8818	0.0263	33.4668	[ 0.8278 , 0.9331 ]
Late_Arrival_o	0.4554	0.0200	22.7472	[ 0.4173 , 0.4956 ]

Comparing these estimates to the estimates of the final model, we can see that they’re also very similar and consistent. So i can conclude that this bootstrap also did a good job of estimating the model.

5 Putting everything together

Now, let’s compare all 3 methods and see how they did:

kable(as.data.frame(cbind(formatC(cmtrx[,-3],4,format="f"), 
btc.ci.95=btc.ci,btr.ci.95=btr.ci)), 
caption="Final Combined Inferential Statistics: p-values and Bootstrap CIs")

Final Combined Inferential Statistics: p-values and Bootstrap CIs
	Estimate	Std. Error	btc.ci.95	btr.ci.95
(Intercept)	-34.0259	0.4394	[ -34.848 , -33.1699 ]	[ -34.9259 , -33.1231 ]
Airport_Distance	0.0110	0.0008	[ 0.0093 , 0.0127 ]	[ 0.0094 , 0.0125 ]
Number_of_flights	0.0003	0.0000	[ 3e-04 , 3e-04 ]	[ 3e-04 , 3e-04 ]
Weather	0.2378	0.0273	[ 0.1833 , 0.2944 ]	[ 0.1853 , 0.2889 ]
Support_Crew_Available	-0.0036	0.0003	[ -0.0042 , -0.003 ]	[ -0.0041 , -0.003 ]
Baggage_loading_time	0.8818	0.0263	[ 0.832 , 0.935 ]	[ 0.8278 , 0.9331 ]
Late_Arrival_o	0.4554	0.0200	[ 0.4147 , 0.4906 ]	[ 0.4173 , 0.4956 ]

kable(round(cbind(btc.wd, btr.wd),4), 
caption = "width of the two bootstrap confidence intervals")

width of the two bootstrap confidence intervals
btc.wd	btr.wd
1.6781	1.8028
0.0034	0.0031
0.0000	0.0000
0.1111	0.1036
0.0013	0.0012
0.1030	0.1053
0.0760	0.0783

kable(cmtrx, caption = "Inferential Statistics of Final Model")

Inferential Statistics of Final Model
	Estimate	Std. Error	t value
(Intercept)	-34.0259067	0.4393635	-77.443629
Airport_Distance	0.0109793	0.0008131	13.503838
Number_of_flights	0.0002962	0.0000065	45.303973
Weather	0.2377667	0.0273096	8.706342
Support_Crew_Available	-0.0035656	0.0003196	-11.157298
Baggage_loading_time	0.8818184	0.0263490	33.466824
Late_Arrival_o	0.4553751	0.0200190	22.747170

All 3 methods give the same estimates (within rounding) and variable significance, and they give us similar confidence interval widths. So both the parametric and nonparametric methods worked well to produce a good model. So there really is no difference between the two methods. As a result, we can choose the parametric model to use for any future analysis.

note: this code chunk is just extra and is only here to be used in the project presentation

select = function(m) { # m is an object: model
e = m$resid                           # residuals
n0 = length(e)                        # sample size
SSE=(m$df)*(summary(m)$sigma)^2       # sum of squared error
R.sq=summary(m)$r.squared             # Coefficient of determination: R square!
R.adj=summary(m)$adj.r                # Adjusted R square
MSE=(summary(m)$sigma)^2              # square error
Cp=(SSE/MSE)-(n0-2*(n0-m$df))         # Mellow's p
AIC=n0*log(SSE)-n0*log(n0)+2*(n0-m$df)          # Akaike information criterion
SBC=n0*log(SSE)-n0*log(n0)+(log(n0))*(n0-m$df)  # Schwarz Bayesian Information criterion
X=model.matrix(m)                     # design matrix of the model
H=X%*%solve(t(X)%*%X)%*%t(X)          # hat matrix
d=e/(1-diag(H))                       
PRESS=t(d)%*%d   # predicted residual error sum of squares (PRESS)- a cross-validation measure
 tbl = as.data.frame(cbind(SSE = SSE, R.sq = R.sq, R.adj = R.adj, Cp = Cp, AIC = AIC, SBC = SBC, PRD = PRESS))
 names(tbl)=c("SSE", "R.sq", "R.adj",   " Cp",   "AIC", "SBC",  "PRESS")
 tbl
}

output.sum = rbind(select(lambdaTrans1), select(lambdaTrans1.btc), select(lambdaTrans.btr))
row.names(output.sum) = c("lambdaTrans1", "lambdaTrans1.btc", "lambdaTrans1.btr")
kable(output.sum, caption = "Goodness-of-fit Measures of Candidate Models")

Goodness-of-fit Measures of Candidate Models
	SSE	R.sq	R.adj	Cp	AIC	SBC	PRESS
lambdaTrans1	2005.760	0.8419195	0.8416548	7	-2075.844	-2032.543	2013.419
lambdaTrans1.btc	1908.979	0.8439453	0.8436839	7	-2253.385	-2210.084	1916.206
lambdaTrans1.btr	1999.669	0.8430520	0.8427892	7	-2086.763	-2043.461	2007.515

6 Summary and conclusions

This study focused on parametric vs nonparametric ways to build a model. Using a model built with least squares in an earlier analysis, we were able to use the nonparametric method of bootstrapping to build two models and compare them. All 3 models ended up being very similar, and thus I decided to choose the parametric method.

MIni Project 1: Bootstrapping Multiple Regression Models

Ian VanWright

10/07/2023