1 Introduction

For this report, I will be using a data set consisting of delayed flight times of various airlines. In this data set, there are 11 variables and 3593 observations. The variables are:

  1. Carrier (Categorical) - The airline that the flight is taken with
  2. Airport_Distance (Numeric) - The Distance between the airports in miles
  3. Number_of_flights (Numeric) - Total no of flights in airport
  4. Weather (Numeric) - Delay due to weather condition ranked 0-10, with 0 being mild and 10 being extreme
  5. Support_Crew_Available (Numeric) - Total number of support crew available
  6. Baggage_loading_time (Numeric) - Time in minutes for loading the baggage
  7. Late_Arrival_o (Numeric) - Time in minutes for late arriving aircraft of the same flight
  8. Cleaning_o (Numeric) - Time in minutes for aircraft cleaning
  9. Fueling_o (Numeric) - Time in minutes for aircraft fueling
  10. Security_o (Numeric) - Time in minutes for security checking
  11. Arr_Delay (Numeric) - Flight Arr_Delay in minutes. It is dependent variable in the model

The goal for this analysis is to determine whether a parametric model (least squares) or nonparametric (bootstrapping) is better to build a regression model.

flight1<- read.csv("flight_delay-data.csv")
flight2 <- flight1[-which(flight1$Arr_Delay == 0),]

2 Recap of Building the Final Model

To start, the full model was fit with Arr_Delay as the response variable and all others as predictors. T tests determined that Carrier, Cleaning_o, Fueling_o, and Security_o, A new, reduced model was created with the remaining variables.

Residual Analysis was done with the reduced model, and the assumptions of normality and constant variances were violated. As a result, a box-cox transformation with \[ \lambda = 0.5 \] was applied to the response variable, and that corrected the issues with normality and constant variances.

Goodness of fit measures of the full model, reduced model, and the transformed model show that the transformed model is the best. So the model with the transformation was chosen as the final model. This chosen model will be used for analysis for this report.

lambdaTrans1 <- lm((Arr_Delay)^0.5 ~ Airport_Distance + Number_of_flights + 
Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, 
data = flight2) 
cmtrx <- summary(lambdaTrans1)$coef
kable(cmtrx, caption = "Inferential Statistics of Final Model")
Inferential Statistics of Final Model
Estimate Std. Error t value Pr(>|t|)
(Intercept) -34.0259067 0.4393635 -77.443629 0
Airport_Distance 0.0109793 0.0008131 13.503838 0
Number_of_flights 0.0002962 0.0000065 45.303973 0
Weather 0.2377667 0.0273096 8.706342 0
Support_Crew_Available -0.0035656 0.0003196 -11.157298 0
Baggage_loading_time 0.8818184 0.0263490 33.466824 0
Late_Arrival_o 0.4553751 0.0200190 22.747170 0

3 Bootstrapping Part 1

3.1 Bootstrapping Observations

For this report, I will be bootstrapping both the observations and the residuals. For this part of the report, I will be bootstrapping the observations in the model.

I will bootstrap 1000 times. Every coefficient estimate for each variable will be plotted and a sampling distribution for each variable will be made.

B = 1000       # choose the number of bootstrap replicates.
## 
num.p = dim(model.frame(lambdaTrans1))[2]  # returns number of parameters in the model
smpl.n = dim(model.frame(lambdaTrans1))[1] # sample size
## zero matrix to store bootstrap coefficients 
coef.mtrx = matrix(rep(0, B*num.p), ncol = num.p)       
## 
for (i in 1:B) {
 bootc.id = sample(1:smpl.n, smpl.n, replace = TRUE) # fit final model to the bootstrap sample
 lambdaTrans1.btc =  lm((Arr_Delay)^0.5 ~ Airport_Distance + Number_of_flights + 
 Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, 
 data = flight2[bootc.id, ])  
                      
 coef.mtrx[i,] = coef(lambdaTrans1.btc)    # extract coefs from bootstrap regression model    
}

boot.hist = function(cmtrx, bt.coef.mtrx, var.id, var.nm) {
  ## bt.coef.mtrx = matrix for storing bootstrap estimates of coefficients
  ## var.id = variable ID (1, 2, ..., k+1)
  ## var.nm = variable name on the hist title, must be the string in the double quotes
  ## coefficient matrix of the final model
  ## Bootstrap sampling distribution of the estimated coefficients
  x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
  y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
  # height of the histogram - use it to make a nice-looking histogram.
  highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density) 
  ylimit <- max(c(y1.1,highestbar))
  hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="", 
       col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
  lines(x = x1.1, y = y1.1, col = "red3")
  lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue") 
  #legend("topright", c(""))
}

par(mfrow = c(2, 3))  # histograms of bootstrap coefs
boot.hist(bt.coef.mtrx = coef.mtrx, var.id = 1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx = coef.mtrx, var.id = 2, var.nm ="Airport_Distance" )
boot.hist(bt.coef.mtrx = coef.mtrx, var.id = 3, var.nm ="Number_of_Flights" )
boot.hist(bt.coef.mtrx = coef.mtrx, var.id = 4, var.nm ="Weather")

par(mfrow = c(2, 3))
boot.hist(bt.coef.mtrx = coef.mtrx, var.id = 5, var.nm ="Support_Crew_Available" )
boot.hist(bt.coef.mtrx = coef.mtrx, var.id = 6, var.nm ="Baggage_loading_time" )
boot.hist(bt.coef.mtrx = coef.mtrx, var.id = 7, var.nm ="Late_Arrival_o" )

The red curves represent the actual distribution of the coefficient estimates based on the estimate and error generated in the final model summary output. The blue curves represent the density of the sampling distribution. As we can see, they look very similar. The curves overlap a lot. This suggests that the bootstrap did a very good job of estimating the coefficients and there is not a significant difference between the least squares method vs bootstrapping the observations.

3.2 Confidence Interval

Next, we can get the bootstrapped estimates as well as the 95% confidence intervals for those estimates.

num.p = dim(coef.mtrx)[2]  # number of parameters
btc.ci = NULL
btc.wd = NULL
for (i in 1:num.p) {
lci.025 = round(quantile(coef.mtrx[, i], 0.025, type = 2), 8)
uci.975 = round(quantile(coef.mtrx[, i], 0.975, type = 2 ), 8)
btc.wd[i] =  uci.975 - lci.025
btc.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
}

boot.hist = function(bt.coef.mtrx, var.id, var.nm){
  ## bt.coef.mtrx = matrix for storing bootstrap estimates of coefficients
  ## var.id = variable ID (1, 2, ..., k+1)
  ## var.nm = variable name on the hist title, must be the string in the double quotes
  ## Bootstrap sampling distribution of the estimated coefficients
  x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
  y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
  # height of the histogram - use it to make a nice-looking histogram.
  highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density) 
  ylimit <- max(c(y1.1,highestbar))
  hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="", 
       col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
  lines(x = x1.1, y = y1.1, col = "red3")       # normal density curve         
  lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue")    # loess curve
} 
#as.data.frame(btc.ci)
kable(as.data.frame(cbind(formatC(cmtrx, 4, format = "f"), btc.ci.95 = btc.ci)), caption = "Regression Coefficient Matrix (Bootstrapped Cases)")
Regression Coefficient Matrix (Bootstrapped Cases)
Estimate Std. Error t value Pr(>|t|) btc.ci.95
(Intercept) -34.0259 0.4394 -77.4436 0.0000 [ -34.848 , -33.1699 ]
Airport_Distance 0.0110 0.0008 13.5038 0.0000 [ 0.0093 , 0.0127 ]
Number_of_flights 0.0003 0.0000 45.3040 0.0000 [ 3e-04 , 3e-04 ]
Weather 0.2378 0.0273 8.7063 0.0000 [ 0.1833 , 0.2944 ]
Support_Crew_Available -0.0036 0.0003 -11.1573 0.0000 [ -0.0042 , -0.003 ]
Baggage_loading_time 0.8818 0.0263 33.4668 0.0000 [ 0.832 , 0.935 ]
Late_Arrival_o 0.4554 0.0200 22.7472 0.0000 [ 0.4147 , 0.4906 ]

Comparing these estimates to the estimates in the final model, we can see that they are very similar and consistent. So I can conclude that the bootstrap did a good job of estimating the model.

4 Bootstrapping Part 2

4.1 Bootstrapping the Residuals

For this part of the report, i will be bootstrapping the residuals of the model. Just as in i did in part 1, i will bootstrap 1000 times. Sampling distributions and their density curves will be plotted. Before that happens, we can start to illustrate what might happen with a histogram of the residuals.

hist(sort(lambdaTrans1$residuals),n=40, xlab="Residuals", col = "lightblue",
border="navy", main = "Histogram of Final Model Residuals")

model.resid = lambdaTrans1$residuals

B=1000
num.p = dim(model.matrix(lambdaTrans1))[2]   # number of parameters
samp.n = dim(model.matrix(lambdaTrans1))[1]  # sample size
btr.mtrx = matrix(rep(0, B*num.p), ncol=num.p) # zero matrix to store boot coefs

for (i in 1:B) {
## Bootstrap response values
bt.lambdaTrans1 = (lambdaTrans1$fitted.values + sample(model.resid, samp.n, replace = TRUE))  # bootstrap residuals
# replace PriceUnitArea with bootstrap log price

flight2$bt.lambdaTrans1 =  bt.lambdaTrans1  #  send the boot response to the data
  lambdaTrans.btr <- lm(bt.lambdaTrans1 ~ Airport_Distance + Number_of_flights + 
  Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, 
  data = flight2) 
  btr.mtrx[i,]=lambdaTrans.btr$coefficients
}

Now, to bootstrap and plot the sampling distributions:

par(mfrow=c(2,3))  # histograms of bootstrap coefs
boot.hist(bt.coef.mtrx = btr.mtrx, var.id = 1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx = btr.mtrx, var.id = 2, var.nm ="Airport_Distance" )
boot.hist(bt.coef.mtrx = btr.mtrx, var.id = 3, var.nm ="Number_Of_Flights" )

par(mfrow = c(2,3))
boot.hist(bt.coef.mtrx = btr.mtrx, var.id = 4, var.nm ="Weather" )
boot.hist(bt.coef.mtrx = btr.mtrx, var.id = 5, var.nm ="Support_Crew_Available" )
boot.hist(bt.coef.mtrx = btr.mtrx, var.id = 6, var.nm ="Baggage_loading_time" )
boot.hist(bt.coef.mtrx = btr.mtrx, var.id = 7, var.nm ="Late.arrival_o" )

Again, The curves look similar. There is a lot of overlap as well. And just like in part 1, we can summarize the estimates and generate a 95% confidence interval.

num.p = dim(coef.mtrx)[2]  # number of parameters
btr.ci = NULL
btr.wd = NULL
for (i in 1:num.p) {
    lci.025 = round(quantile(btr.mtrx[, i], 0.025, type = 2),8)
    uci.975 = round(quantile(btr.mtrx[, i],0.975, type = 2 ),8)
    btr.wd[i] = uci.975 - lci.025
    btr.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
}
  #as.data.frame(btc.ci)

  kable(as.data.frame(cbind(formatC(cmtrx, 4, format= "f"), btr.ci.95=btr.ci)), 
        caption = "Regression Coefficient Matrix (Bootstrapped Residuals)")
Regression Coefficient Matrix (Bootstrapped Residuals)
Estimate Std. Error t value Pr(>|t|) btr.ci.95
(Intercept) -34.0259 0.4394 -77.4436 0.0000 [ -34.9259 , -33.1231 ]
Airport_Distance 0.0110 0.0008 13.5038 0.0000 [ 0.0094 , 0.0125 ]
Number_of_flights 0.0003 0.0000 45.3040 0.0000 [ 3e-04 , 3e-04 ]
Weather 0.2378 0.0273 8.7063 0.0000 [ 0.1853 , 0.2889 ]
Support_Crew_Available -0.0036 0.0003 -11.1573 0.0000 [ -0.0041 , -0.003 ]
Baggage_loading_time 0.8818 0.0263 33.4668 0.0000 [ 0.8278 , 0.9331 ]
Late_Arrival_o 0.4554 0.0200 22.7472 0.0000 [ 0.4173 , 0.4956 ]

Comparing these estimates to the estimates of the final model, we can see that they’re also very similar and consistent. So i can conclude that this bootstrap also did a good job of estimating the model.

5 Putting everything together

Now, let’s compare all 3 methods and see how they did:

kable(as.data.frame(cbind(formatC(cmtrx[,-3],4,format="f"), 
btc.ci.95=btc.ci,btr.ci.95=btr.ci)), 
caption="Final Combined Inferential Statistics: p-values and Bootstrap CIs")
Final Combined Inferential Statistics: p-values and Bootstrap CIs
Estimate Std. Error Pr(>|t|) btc.ci.95 btr.ci.95
(Intercept) -34.0259 0.4394 0.0000 [ -34.848 , -33.1699 ] [ -34.9259 , -33.1231 ]
Airport_Distance 0.0110 0.0008 0.0000 [ 0.0093 , 0.0127 ] [ 0.0094 , 0.0125 ]
Number_of_flights 0.0003 0.0000 0.0000 [ 3e-04 , 3e-04 ] [ 3e-04 , 3e-04 ]
Weather 0.2378 0.0273 0.0000 [ 0.1833 , 0.2944 ] [ 0.1853 , 0.2889 ]
Support_Crew_Available -0.0036 0.0003 0.0000 [ -0.0042 , -0.003 ] [ -0.0041 , -0.003 ]
Baggage_loading_time 0.8818 0.0263 0.0000 [ 0.832 , 0.935 ] [ 0.8278 , 0.9331 ]
Late_Arrival_o 0.4554 0.0200 0.0000 [ 0.4147 , 0.4906 ] [ 0.4173 , 0.4956 ]
kable(round(cbind(btc.wd, btr.wd),4), 
caption = "width of the two bootstrap confidence intervals")
width of the two bootstrap confidence intervals
btc.wd btr.wd
1.6781 1.8028
0.0034 0.0031
0.0000 0.0000
0.1111 0.1036
0.0013 0.0012
0.1030 0.1053
0.0760 0.0783
kable(cmtrx, caption = "Inferential Statistics of Final Model")
Inferential Statistics of Final Model
Estimate Std. Error t value Pr(>|t|)
(Intercept) -34.0259067 0.4393635 -77.443629 0
Airport_Distance 0.0109793 0.0008131 13.503838 0
Number_of_flights 0.0002962 0.0000065 45.303973 0
Weather 0.2377667 0.0273096 8.706342 0
Support_Crew_Available -0.0035656 0.0003196 -11.157298 0
Baggage_loading_time 0.8818184 0.0263490 33.466824 0
Late_Arrival_o 0.4553751 0.0200190 22.747170 0

All 3 methods give the same estimates (within rounding) and variable significance, and they give us similar confidence interval widths. So both the parametric and nonparametric methods worked well to produce a good model. So there really is no difference between the two methods. As a result, we can choose the parametric model to use for any future analysis.

note: this code chunk is just extra and is only here to be used in the project presentation

select = function(m) { # m is an object: model
e = m$resid                           # residuals
n0 = length(e)                        # sample size
SSE=(m$df)*(summary(m)$sigma)^2       # sum of squared error
R.sq=summary(m)$r.squared             # Coefficient of determination: R square!
R.adj=summary(m)$adj.r                # Adjusted R square
MSE=(summary(m)$sigma)^2              # square error
Cp=(SSE/MSE)-(n0-2*(n0-m$df))         # Mellow's p
AIC=n0*log(SSE)-n0*log(n0)+2*(n0-m$df)          # Akaike information criterion
SBC=n0*log(SSE)-n0*log(n0)+(log(n0))*(n0-m$df)  # Schwarz Bayesian Information criterion
X=model.matrix(m)                     # design matrix of the model
H=X%*%solve(t(X)%*%X)%*%t(X)          # hat matrix
d=e/(1-diag(H))                       
PRESS=t(d)%*%d   # predicted residual error sum of squares (PRESS)- a cross-validation measure
 tbl = as.data.frame(cbind(SSE = SSE, R.sq = R.sq, R.adj = R.adj, Cp = Cp, AIC = AIC, SBC = SBC, PRD = PRESS))
 names(tbl)=c("SSE", "R.sq", "R.adj",   " Cp",   "AIC", "SBC",  "PRESS")
 tbl
}

output.sum = rbind(select(lambdaTrans1), select(lambdaTrans1.btc), select(lambdaTrans.btr))
row.names(output.sum) = c("lambdaTrans1", "lambdaTrans1.btc", "lambdaTrans1.btr")
kable(output.sum, caption = "Goodness-of-fit Measures of Candidate Models")
Goodness-of-fit Measures of Candidate Models
SSE R.sq R.adj Cp AIC SBC PRESS
lambdaTrans1 2005.760 0.8419195 0.8416548 7 -2075.844 -2032.543 2013.419
lambdaTrans1.btc 1908.979 0.8439453 0.8436839 7 -2253.385 -2210.084 1916.206
lambdaTrans1.btr 1999.669 0.8430520 0.8427892 7 -2086.763 -2043.461 2007.515

6 Summary and conclusions

This study focused on parametric vs nonparametric ways to build a model. Using a model built with least squares in an earlier analysis, we were able to use the nonparametric method of bootstrapping to build two models and compare them. All 3 models ended up being very similar, and thus I decided to choose the parametric method.