For this report, I will be using a data set consisting of delayed flight times of various airlines. In this data set, there are 11 variables and 3593 observations. The variables are:
The goal for this analysis is to build a simple linear regression model and explore a potential relationship between a flight’s arrival delay time and one of the variables in this data set. I will also bootstrap and compare the parametric least squares method and nonparametric bootstrap method of building the model and see which one is better.
flight1 <- read.csv("flight_delay-data.csv", header = TRUE)
Before we fit the model, we can generate a pairwise scatterplot to see any association between variables. However, because we have a categorical response variable (Carrier), we must remove it first.
pairs.panels(flight1[, -c(1)], pch=21, main = "Pair-wise Scatter Plot of Numerical Variables") #to show color grouping
Since Arr_Delay and Number_of_flights seem to be highly correlated (r = 0.82) I would like to use Number_of_flights as the explanatory variable.
We can now fit the simple linear regression model, using Arr_Delay as the response variable and Number_of_flight as the explanatory variable.
slm <- lm(Arr_Delay ~ Number_of_flights, data = flight1)
Now that the model is fitted, we can check for any potential violation of model assumptions using residual analysis.
par(mfrow = c(2,2))
plot(slm)
Based on the plots, the assumptions of normality and constant variances are violated. A transformation of one of the variables should be able to correct them.
To correct issues with normality and constant variances, i will use a box-cox transformation. Lambda values from -1 to 1 will be used, and those will be compared to see what the best value for lambda is. Before we make the transformation, however, we must remove some of the observations in the data set because some of the values of Arr_Delay are not positive (not > 0).
flight2 <- flight1[-which(flight1$Arr_Delay == 0),]
boxcox(lm(Arr_Delay ~ Number_of_flights, data = flight2), lambda = seq(-1, 1, 1/10))
It looks like the optimal value of lambda is 0.75. The transformation will now be applied as such.
slm.trans <- lm((Arr_Delay^0.75) ~ Number_of_flights, data = flight2)
Now let’s see if there were any improvements:
par(mfrow=c(2,2))
plot(slm.trans)
It seems our transformation fixed the issues with normality and constant variances.
Now let’s check the coefficient estimates in both our orignal model and the transformed model:
reg.table1 <- coef(summary(slm))
pander(reg.table1, caption = "Inferential statistics for the parametric linear regression model: Flight Arrival Delay Time to Number of Flights")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -302.6 | 4.298 | -70.4 | 0 |
| Number_of_flights | 0.008598 | 9.903e-05 | 86.82 | 0 |
reg.table2 <- coef(summary(slm.trans))
pander(reg.table2, caption = "Inferential statistics for the parametric linear regression model with transformation")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -77.14 | 1.127 | -68.43 | 0 |
| Number_of_flights | 0.002328 | 2.597e-05 | 89.65 | 0 |
We should stick with the transformed model because the original model violated assumptions. So, for every additional flight in an airport, we should expect the arrival delay time for a plane to increase by 0.0023 minutes.
Now that we’ve done the parametric method of building a model, we can do the nonparametric method of bootstrapping. For this, i will bootstrap 1000 observations and build a sampling distribution of the coefficient estimates for each bootstrap. After that, we can build a 95% confidence interval for the bootstrap estimate.
Arr_Delay <- flight2$Arr_Delay
Number_of_flights <- flight2$Number_of_flights
vec.id <- 1:length(Arr_Delay) # vector of observation ID
boot.id <- sample(vec.id, length(Arr_Delay), replace = TRUE) # bootstrap obs ID.
boot.price <- Arr_Delay[boot.id] # bootstrap price
boot.distance <- Number_of_flights[boot.id] # corresponding bootstrap distance
B <- 1000 # number of bootstrap replicates
# define empty vectors to store bootstrap regression coefficients
boot.beta0 <- NULL
boot.beta1 <- NULL
## bootstrap regression models using for-loop
vec.id <- 1:length(Arr_Delay) # vector of observation ID
for(i in 1:B){
boot.id <- sample(vec.id, length(Arr_Delay), replace = TRUE) # bootstrap obs ID.
boot.price <- Arr_Delay[boot.id] # bootstrap price
boot.distance <- Number_of_flights[boot.id] # corresponding bootstrap distance
## regression
boot.reg <-lm(Arr_Delay[boot.id] ~ Number_of_flights[boot.id])
boot.beta0[i] <- coef(boot.reg)[1] # bootstrap intercept
boot.beta1[i] <- coef(boot.reg)[2] # bootstrap slope
}
boot.beta0.ci <- quantile(boot.beta0, c(0.025, 0.975), type = 2)
boot.beta1.ci <- quantile(boot.beta1, c(0.025, 0.975), type = 2)
boot.coef <- data.frame(rbind(boot.beta0.ci, boot.beta1.ci))
names(boot.coef) <- c("2.5%", "97.5%")
pander(boot.coef, caption="Bootstrap confidence intervals of regression coefficients.")
| 2.5% | 97.5% | |
|---|---|---|
| boot.beta0.ci | -313.7 | -297.8 |
| boot.beta1.ci | 0.008483 | 0.008856 |
Because 0 is not captured in the confidence interval for the slope estimate, we can conclude not only are the number of flights in an airport and the arrival delay time positively associated, there also exists a signficant relationship between the two variables.
This study focused on potential relationships between a flight’s arrival delay time and various factors. The factor that was analyzed was the number of flights in an airport. A simple linear regression model was generated, and assumptions of normality and constant variance were violated, so a box-cox transformation was applied.
Then we compared nonparametric and parametric model building methods. Bootstrapping was used, and it produced similar coefficient estimates. And a 95% confidence interval for the bootstrap estimates show that not only are Arr_delay and Number_of_flights postivitely associated, but there exists a significant relationship. Considering all of this information, and the fact that our sample size is large enough, we can choose the parametric method of building a model.