1 Introduction

For this report, I will be using a data set consisting of delayed flight times of various airlines. In this data set, there are 11 variables and 3593 observations. The variables are:

Carrier (Categorical) - The airline that the flight is taken with
Airport_Distance (Numeric) - The Distance between the airports in miles
Number_of_flights (Numeric) - Total no of flights in airport
Weather (Numeric) - Delay due to weather condition ranked 0-10, with 0 being mild and 10 being extreme
Support_Crew_Available (Numeric) - Total number of support crew available
Baggage_loading_time (Numeric) - Time in minutes for loading the baggage
Late_Arrival_o (Numeric) - Time in minutes for late arriving aircraft of the same flight
Cleaning_o (Numeric) - Time in minutes for aircraft cleaning
Fueling_o (Numeric) - Time in minutes for aircraft fueling
Security_o (Numeric) - Time in minutes for security checking
Arr_Delay (Numeric) - Flight Arr_Delay in minutes. It is dependent variable in the model

The goal for this analysis is to build a simple linear regression model and explore a potential relationship between a flight’s arrival delay time and one of the variables in this data set. I will also bootstrap and compare the parametric least squares method and nonparametric bootstrap method of building the model and see which one is better.

flight1 <- read.csv("flight_delay-data.csv", header = TRUE)

2 Data Exploration

Before we fit the model, we can generate a pairwise scatterplot to see any association between variables. However, because we have a categorical response variable (Carrier), we must remove it first.

pairs.panels(flight1[, -c(1)], pch=21, main = "Pair-wise Scatter Plot of Numerical Variables") #to show color grouping

Since Arr_Delay and Number_of_flights seem to be highly correlated (r = 0.82) I would like to use Number_of_flights as the explanatory variable.

3 Fitting the model

3.1 The Model

We can now fit the simple linear regression model, using Arr_Delay as the response variable and Number_of_flight as the explanatory variable.

slm <- lm(Arr_Delay ~ Number_of_flights, data = flight1)

Now that the model is fitted, we can check for any potential violation of model assumptions using residual analysis.

par(mfrow = c(2,2))
plot(slm)

Based on the plots, the assumptions of normality and constant variances are violated. A transformation of one of the variables should be able to correct them.

3.2 Transformation of the Response Variable

To correct issues with normality and constant variances, i will use a box-cox transformation. Lambda values from -1 to 1 will be used, and those will be compared to see what the best value for lambda is. Before we make the transformation, however, we must remove some of the observations in the data set because some of the values of Arr_Delay are not positive (not > 0).

flight2 <- flight1[-which(flight1$Arr_Delay == 0),]

boxcox(lm(Arr_Delay ~ Number_of_flights, data = flight2), lambda = seq(-1, 1, 1/10))

It looks like the optimal value of lambda is 0.75. The transformation will now be applied as such.

slm.trans <- lm((Arr_Delay^0.75) ~ Number_of_flights, data = flight2)

Now let’s see if there were any improvements:

par(mfrow=c(2,2))
plot(slm.trans)

It seems our transformation fixed the issues with normality and constant variances.

Now let’s check the coefficient estimates in both our orignal model and the transformed model:

reg.table1 <- coef(summary(slm))
pander(reg.table1, caption = "Inferential statistics for the parametric linear regression model: Flight Arrival Delay Time to Number of Flights")

Inferential statistics for the parametric linear regression model: Flight Arrival Delay Time to Number of Flights
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-302.6	4.298	-70.4	0
Number_of_flights	0.008598	9.903e-05	86.82	0

reg.table2 <- coef(summary(slm.trans))
pander(reg.table2, caption = "Inferential statistics for the parametric linear regression model with transformation")

Inferential statistics for the parametric linear regression model with transformation
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-77.14	1.127	-68.43	0
Number_of_flights	0.002328	2.597e-05	89.65	0

We should stick with the transformed model because the original model violated assumptions. So, for every additional flight in an airport, we should expect the arrival delay time for a plane to increase by 0.0023 minutes.

4 Bootstrapping the model

Now that we’ve done the parametric method of building a model, we can do the nonparametric method of bootstrapping. For this, i will bootstrap 1000 observations and build a sampling distribution of the coefficient estimates for each bootstrap. After that, we can build a 95% confidence interval for the bootstrap estimate.

Arr_Delay <- flight2$Arr_Delay
Number_of_flights <- flight2$Number_of_flights

vec.id <- 1:length(Arr_Delay)   # vector of observation ID
boot.id <- sample(vec.id, length(Arr_Delay), replace = TRUE)   # bootstrap obs ID.
boot.price <- Arr_Delay[boot.id]           # bootstrap price 
boot.distance <- Number_of_flights[boot.id]     # corresponding bootstrap distance

B <- 1000    # number of bootstrap replicates
# define empty vectors to store bootstrap regression coefficients
boot.beta0 <- NULL 
boot.beta1 <- NULL
## bootstrap regression models using for-loop
vec.id <- 1:length(Arr_Delay)   # vector of observation ID
for(i in 1:B){
  boot.id <- sample(vec.id, length(Arr_Delay), replace = TRUE)   # bootstrap obs ID.
  boot.price <- Arr_Delay[boot.id]           # bootstrap price
  boot.distance <- Number_of_flights[boot.id]     # corresponding bootstrap distance
  ## regression
  boot.reg <-lm(Arr_Delay[boot.id] ~ Number_of_flights[boot.id]) 
  boot.beta0[i] <- coef(boot.reg)[1]   # bootstrap intercept
  boot.beta1[i] <- coef(boot.reg)[2]   # bootstrap slope
}

boot.beta0.ci <- quantile(boot.beta0, c(0.025, 0.975), type = 2)
boot.beta1.ci <- quantile(boot.beta1, c(0.025, 0.975), type = 2)
boot.coef <- data.frame(rbind(boot.beta0.ci, boot.beta1.ci)) 
names(boot.coef) <- c("2.5%", "97.5%")
pander(boot.coef, caption="Bootstrap confidence intervals of regression coefficients.")

Bootstrap confidence intervals of regression coefficients.
	2.5%	97.5%
boot.beta0.ci	-313.7	-297.8
boot.beta1.ci	0.008483	0.008856

Because 0 is not captured in the confidence interval for the slope estimate, we can conclude not only are the number of flights in an airport and the arrival delay time positively associated, there also exists a signficant relationship between the two variables.

5 summary and Conclusion

This study focused on potential relationships between a flight’s arrival delay time and various factors. The factor that was analyzed was the number of flights in an airport. A simple linear regression model was generated, and assumptions of normality and constant variance were violated, so a box-cox transformation was applied.

Then we compared nonparametric and parametric model building methods. Bootstrapping was used, and it produced similar coefficient estimates. And a 95% confidence interval for the bootstrap estimates show that not only are Arr_delay and Number_of_flights postivitely associated, but there exists a significant relationship. Considering all of this information, and the fact that our sample size is large enough, we can choose the parametric method of building a model.

STA321 Week 3 Homework: Simple Linear Regression and Bootstrapping

Ian VanWright

09/16/2023