For this report, I will be using a data set consisting of delayed flight times of various airlines. In this data set, there are 11 variables and 3593 observations. The variables are:
The goal for this analysis is to determine whether a parametric model (least squares) or nonparametric (bootstrapping) is better to build a regression model.
flight1<- read.csv("flight_delay-data.csv")
flight2 <- flight1[-which(flight1$Arr_Delay == 0),]
To start, the full model was fit with Arr_Delay as the response variable and all others as predictors. T tests determined that Carrier, Cleaning_o, Fueling_o, and Security_o, A new, reduced model was created with the remaining variables.
Residual Analysis was done with the reduced model, and the assumptions of normality and constant variances were violated. As a result, a box-cox transformation with \[ \lambda = 0.5 \] was applied to the response variable, and that corrected the issues with normality and constant variances.
Goodness of fit measures of the full model, reduced model, and the transformed model show that the transformed model is the best. So the model with the transformation was chosen as the final model. This chosen model will be used for analysis for this report.
lambdaTrans1 <- lm((Arr_Delay)^0.5 ~ Airport_Distance + Number_of_flights +
Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o,
data = flight2)
cmtrx <- summary(lambdaTrans1)$coef
kable(cmtrx, caption = "Inferential Statistics of Final Model")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -34.0259067 | 0.4393635 | -77.443629 | 0 |
| Airport_Distance | 0.0109793 | 0.0008131 | 13.503838 | 0 |
| Number_of_flights | 0.0002962 | 0.0000065 | 45.303973 | 0 |
| Weather | 0.2377667 | 0.0273096 | 8.706342 | 0 |
| Support_Crew_Available | -0.0035656 | 0.0003196 | -11.157298 | 0 |
| Baggage_loading_time | 0.8818184 | 0.0263490 | 33.466824 | 0 |
| Late_Arrival_o | 0.4553751 | 0.0200190 | 22.747170 | 0 |
For this report, I will be bootstrapping both the observations and the residuals. For this part of the report, I will be bootstrapping the observations in the model.
I will bootstrap 1000 times. Every coefficient estimate for each variable will be plotted and a sampling distribution for each variable will be made.
B = 1000 # choose the number of bootstrap replicates.
##
num.p = dim(model.frame(lambdaTrans1))[2] # returns number of parameters in the model
smpl.n = dim(model.frame(lambdaTrans1))[1] # sample size
## zero matrix to store bootstrap coefficients
coef.mtrx = matrix(rep(0, B*num.p), ncol = num.p)
##
for (i in 1:B) {
bootc.id = sample(1:smpl.n, smpl.n, replace = TRUE) # fit final model to the bootstrap sample
lambdaTrans1.btc = lm((Arr_Delay)^0.5 ~ Airport_Distance + Number_of_flights +
Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o,
data = flight2[bootc.id, ])
coef.mtrx[i,] = coef(lambdaTrans1.btc) # extract coefs from bootstrap regression model
}
boot.hist = function(cmtrx, bt.coef.mtrx, var.id, var.nm) {
## bt.coef.mtrx = matrix for storing bootstrap estimates of coefficients
## var.id = variable ID (1, 2, ..., k+1)
## var.nm = variable name on the hist title, must be the string in the double quotes
## coefficient matrix of the final model
## Bootstrap sampling distribution of the estimated coefficients
x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
# height of the histogram - use it to make a nice-looking histogram.
highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density)
ylimit <- max(c(y1.1,highestbar))
hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="",
col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
lines(x = x1.1, y = y1.1, col = "red3")
lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue")
#legend("topright", c(""))
}
par(mfrow = c(2, 3)) # histograms of bootstrap coefs
boot.hist(bt.coef.mtrx = coef.mtrx, var.id = 1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx = coef.mtrx, var.id = 2, var.nm ="Airport_Distance" )
boot.hist(bt.coef.mtrx = coef.mtrx, var.id = 3, var.nm ="Number_of_Flights" )
boot.hist(bt.coef.mtrx = coef.mtrx, var.id = 4, var.nm ="Weather")
par(mfrow = c(2, 3))
boot.hist(bt.coef.mtrx = coef.mtrx, var.id = 5, var.nm ="Support_Crew_Available" )
boot.hist(bt.coef.mtrx = coef.mtrx, var.id = 6, var.nm ="Baggage_loading_time" )
boot.hist(bt.coef.mtrx = coef.mtrx, var.id = 7, var.nm ="Late_Arrival_o" )
The red curves represent the actual distribution of the coefficient
estimates based on the estimate and error generated in the final model
summary output. The blue curves represent the density of the sampling
distribution. As we can see, they look very similar. The curves overlap
a lot. This suggests that the bootstrap did a very good job of
estimating the coefficients and there is not a significant difference
between the least squares method vs bootstrapping the observations.
Next, we can get the bootstrapped estimates as well as the 95% confidence intervals for those estimates.
num.p = dim(coef.mtrx)[2] # number of parameters
btc.ci = NULL
btc.wd = NULL
for (i in 1:num.p) {
lci.025 = round(quantile(coef.mtrx[, i], 0.025, type = 2), 8)
uci.975 = round(quantile(coef.mtrx[, i], 0.975, type = 2 ), 8)
btc.wd[i] = uci.975 - lci.025
btc.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
}
boot.hist = function(bt.coef.mtrx, var.id, var.nm){
## bt.coef.mtrx = matrix for storing bootstrap estimates of coefficients
## var.id = variable ID (1, 2, ..., k+1)
## var.nm = variable name on the hist title, must be the string in the double quotes
## Bootstrap sampling distribution of the estimated coefficients
x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
# height of the histogram - use it to make a nice-looking histogram.
highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density)
ylimit <- max(c(y1.1,highestbar))
hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="",
col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
lines(x = x1.1, y = y1.1, col = "red3") # normal density curve
lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue") # loess curve
}
#as.data.frame(btc.ci)
kable(as.data.frame(cbind(formatC(cmtrx, 4, format = "f"), btc.ci.95 = btc.ci)), caption = "Regression Coefficient Matrix (Bootstrapped Cases)")
| Estimate | Std. Error | t value | Pr(>|t|) | btc.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | -34.0259 | 0.4394 | -77.4436 | 0.0000 | [ -34.848 , -33.1699 ] |
| Airport_Distance | 0.0110 | 0.0008 | 13.5038 | 0.0000 | [ 0.0093 , 0.0127 ] |
| Number_of_flights | 0.0003 | 0.0000 | 45.3040 | 0.0000 | [ 3e-04 , 3e-04 ] |
| Weather | 0.2378 | 0.0273 | 8.7063 | 0.0000 | [ 0.1833 , 0.2944 ] |
| Support_Crew_Available | -0.0036 | 0.0003 | -11.1573 | 0.0000 | [ -0.0042 , -0.003 ] |
| Baggage_loading_time | 0.8818 | 0.0263 | 33.4668 | 0.0000 | [ 0.832 , 0.935 ] |
| Late_Arrival_o | 0.4554 | 0.0200 | 22.7472 | 0.0000 | [ 0.4147 , 0.4906 ] |
Comparing these estimates to the estimates in the final model, we can see that they are very similar and consistent. So I can conclude that the bootstrap did a good job of estimating the model.
For this part of the report, i will be bootstrapping the residuals of the model. Just as in i did in part 1, i will bootstrap 1000 times. Sampling distributions and their density curves will be plotted. Before that happens, we can start to illustrate what might happen with a histogram of the residuals.
hist(sort(lambdaTrans1$residuals),n=40, xlab="Residuals", col = "lightblue",
border="navy", main = "Histogram of Final Model Residuals")
model.resid = lambdaTrans1$residuals
B=1000
num.p = dim(model.matrix(lambdaTrans1))[2] # number of parameters
samp.n = dim(model.matrix(lambdaTrans1))[1] # sample size
btr.mtrx = matrix(rep(0, B*num.p), ncol=num.p) # zero matrix to store boot coefs
for (i in 1:B) {
## Bootstrap response values
bt.lambdaTrans1 = (lambdaTrans1$fitted.values + sample(model.resid, samp.n, replace = TRUE)) # bootstrap residuals
# replace PriceUnitArea with bootstrap log price
flight2$bt.lambdaTrans1 = bt.lambdaTrans1 # send the boot response to the data
lambdaTrans.btr <- lm(bt.lambdaTrans1 ~ Airport_Distance + Number_of_flights +
Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o,
data = flight2)
btr.mtrx[i,]=lambdaTrans.btr$coefficients
}
Now, to bootstrap and plot the sampling distributions:
par(mfrow=c(2,3)) # histograms of bootstrap coefs
boot.hist(bt.coef.mtrx = btr.mtrx, var.id = 1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx = btr.mtrx, var.id = 2, var.nm ="Airport_Distance" )
boot.hist(bt.coef.mtrx = btr.mtrx, var.id = 3, var.nm ="Number_Of_Flights" )
par(mfrow = c(2,3))
boot.hist(bt.coef.mtrx = btr.mtrx, var.id = 4, var.nm ="Weather" )
boot.hist(bt.coef.mtrx = btr.mtrx, var.id = 5, var.nm ="Support_Crew_Available" )
boot.hist(bt.coef.mtrx = btr.mtrx, var.id = 6, var.nm ="Baggage_loading_time" )
boot.hist(bt.coef.mtrx = btr.mtrx, var.id = 7, var.nm ="Late.arrival_o" )
Again, The curves look similar. There is a lot of overlap as well. And just like in part 1, we can summarize the estimates and generate a 95% confidence interval.
num.p = dim(coef.mtrx)[2] # number of parameters
btr.ci = NULL
btr.wd = NULL
for (i in 1:num.p) {
lci.025 = round(quantile(btr.mtrx[, i], 0.025, type = 2),8)
uci.975 = round(quantile(btr.mtrx[, i],0.975, type = 2 ),8)
btr.wd[i] = uci.975 - lci.025
btr.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
}
#as.data.frame(btc.ci)
kable(as.data.frame(cbind(formatC(cmtrx, 4, format= "f"), btr.ci.95=btr.ci)),
caption = "Regression Coefficient Matrix (Bootstrapped Residuals)")
| Estimate | Std. Error | t value | Pr(>|t|) | btr.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | -34.0259 | 0.4394 | -77.4436 | 0.0000 | [ -34.9259 , -33.1231 ] |
| Airport_Distance | 0.0110 | 0.0008 | 13.5038 | 0.0000 | [ 0.0094 , 0.0125 ] |
| Number_of_flights | 0.0003 | 0.0000 | 45.3040 | 0.0000 | [ 3e-04 , 3e-04 ] |
| Weather | 0.2378 | 0.0273 | 8.7063 | 0.0000 | [ 0.1853 , 0.2889 ] |
| Support_Crew_Available | -0.0036 | 0.0003 | -11.1573 | 0.0000 | [ -0.0041 , -0.003 ] |
| Baggage_loading_time | 0.8818 | 0.0263 | 33.4668 | 0.0000 | [ 0.8278 , 0.9331 ] |
| Late_Arrival_o | 0.4554 | 0.0200 | 22.7472 | 0.0000 | [ 0.4173 , 0.4956 ] |
Comparing these estimates to the estimates of the final model, we can see that they’re also very similar and consistent. So i can conclude that this bootstrap also did a good job of estimating the model.
Now, let’s compare all 3 methods and see how they did:
kable(as.data.frame(cbind(formatC(cmtrx[,-3],4,format="f"),
btc.ci.95=btc.ci,btr.ci.95=btr.ci)),
caption="Final Combined Inferential Statistics: p-values and Bootstrap CIs")
| Estimate | Std. Error | Pr(>|t|) | btc.ci.95 | btr.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | -34.0259 | 0.4394 | 0.0000 | [ -34.848 , -33.1699 ] | [ -34.9259 , -33.1231 ] |
| Airport_Distance | 0.0110 | 0.0008 | 0.0000 | [ 0.0093 , 0.0127 ] | [ 0.0094 , 0.0125 ] |
| Number_of_flights | 0.0003 | 0.0000 | 0.0000 | [ 3e-04 , 3e-04 ] | [ 3e-04 , 3e-04 ] |
| Weather | 0.2378 | 0.0273 | 0.0000 | [ 0.1833 , 0.2944 ] | [ 0.1853 , 0.2889 ] |
| Support_Crew_Available | -0.0036 | 0.0003 | 0.0000 | [ -0.0042 , -0.003 ] | [ -0.0041 , -0.003 ] |
| Baggage_loading_time | 0.8818 | 0.0263 | 0.0000 | [ 0.832 , 0.935 ] | [ 0.8278 , 0.9331 ] |
| Late_Arrival_o | 0.4554 | 0.0200 | 0.0000 | [ 0.4147 , 0.4906 ] | [ 0.4173 , 0.4956 ] |
kable(round(cbind(btc.wd, btr.wd),4),
caption = "width of the two bootstrap confidence intervals")
| btc.wd | btr.wd |
|---|---|
| 1.6781 | 1.8028 |
| 0.0034 | 0.0031 |
| 0.0000 | 0.0000 |
| 0.1111 | 0.1036 |
| 0.0013 | 0.0012 |
| 0.1030 | 0.1053 |
| 0.0760 | 0.0783 |
kable(cmtrx, caption = "Inferential Statistics of Final Model")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -34.0259067 | 0.4393635 | -77.443629 | 0 |
| Airport_Distance | 0.0109793 | 0.0008131 | 13.503838 | 0 |
| Number_of_flights | 0.0002962 | 0.0000065 | 45.303973 | 0 |
| Weather | 0.2377667 | 0.0273096 | 8.706342 | 0 |
| Support_Crew_Available | -0.0035656 | 0.0003196 | -11.157298 | 0 |
| Baggage_loading_time | 0.8818184 | 0.0263490 | 33.466824 | 0 |
| Late_Arrival_o | 0.4553751 | 0.0200190 | 22.747170 | 0 |
All 3 methods give the same estimates (within rounding) and variable significance, and they give us similar confidence interval widths. So both the parametric and nonparametric methods worked well to produce a good model. So there really is no difference between the two methods. As a result, we can choose the parametric model to use for any future analysis.
note: this code chunk is just extra and is only here to be used in the project presentation
select = function(m) { # m is an object: model
e = m$resid # residuals
n0 = length(e) # sample size
SSE=(m$df)*(summary(m)$sigma)^2 # sum of squared error
R.sq=summary(m)$r.squared # Coefficient of determination: R square!
R.adj=summary(m)$adj.r # Adjusted R square
MSE=(summary(m)$sigma)^2 # square error
Cp=(SSE/MSE)-(n0-2*(n0-m$df)) # Mellow's p
AIC=n0*log(SSE)-n0*log(n0)+2*(n0-m$df) # Akaike information criterion
SBC=n0*log(SSE)-n0*log(n0)+(log(n0))*(n0-m$df) # Schwarz Bayesian Information criterion
X=model.matrix(m) # design matrix of the model
H=X%*%solve(t(X)%*%X)%*%t(X) # hat matrix
d=e/(1-diag(H))
PRESS=t(d)%*%d # predicted residual error sum of squares (PRESS)- a cross-validation measure
tbl = as.data.frame(cbind(SSE = SSE, R.sq = R.sq, R.adj = R.adj, Cp = Cp, AIC = AIC, SBC = SBC, PRD = PRESS))
names(tbl)=c("SSE", "R.sq", "R.adj", " Cp", "AIC", "SBC", "PRESS")
tbl
}
output.sum = rbind(select(lambdaTrans1), select(lambdaTrans1.btc), select(lambdaTrans.btr))
row.names(output.sum) = c("lambdaTrans1", "lambdaTrans1.btc", "lambdaTrans1.btr")
kable(output.sum, caption = "Goodness-of-fit Measures of Candidate Models")
| SSE | R.sq | R.adj | Cp | AIC | SBC | PRESS | |
|---|---|---|---|---|---|---|---|
| lambdaTrans1 | 2005.760 | 0.8419195 | 0.8416548 | 7 | -2075.844 | -2032.543 | 2013.419 |
| lambdaTrans1.btc | 1908.979 | 0.8439453 | 0.8436839 | 7 | -2253.385 | -2210.084 | 1916.206 |
| lambdaTrans1.btr | 1999.669 | 0.8430520 | 0.8427892 | 7 | -2086.763 | -2043.461 | 2007.515 |
This study focused on parametric vs nonparametric ways to build a model. Using a model built with least squares in an earlier analysis, we were able to use the nonparametric method of bootstrapping to build two models and compare them. All 3 models ended up being very similar, and thus I decided to choose the parametric method.