Data Set Description and Cleaning

The data set I chose for this report is called flight delay data, and it gives a continuous response in minutes of how long a fight is delayed. This variable is the dependent variable and is called Arr_Delay. This response variable depends on one character and nine numeric variables, including Carrier: categorical company the flight is with, Airport_Distance: the distance from the airport in miles, Number_of_flights: total number of all of the flights in the airport, Weather: scale of 1-10 with 10 being severe weather, and Support_Crew_Available: number of support crew members. The other five variables are all measured in minutes and they include Baggage_loading_time: the time it takes to load the bags, Late_Arrival_o: the time for the late arriving aircraft of the same flight, Cleaning_o: the time it takes for aircraft cleaning, Fueling_o: the time it takes for aircraft fueling, and lastly, Security_o: the time for security checking. In order to make this model, we need two categorical variables and as I mentioned above, we only have one so I turned support crew members (Support_Crew_Available variable) into a categorical variable, with breaks between every 50 crew members available.

summary(flightDelay$Support_Crew_Available)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      56      83      85     112     222
summary(flightDelay$Support_Crew_Available.cat)
##            Low  Low to Medium         Medium Medium to High           High 
##            738           1583           1030            224             18
pairwise_scatter<- data.frame(Arr_Delay, Airport_Distance, Number_of_flights, Cleaning_o, Fueling_o, Security_o)
pairs(pairwise_scatter, main ="Pair-wise Association: Scatter Plot")
pairs(pairwise_scatter, main ="Pair-wise Association: Scatter Plot")

Scatterplot and Checking Assumptions

#scatterplot to check if positive and linear
plot(flightDelay$Arr_Delay, flightDelay$Number_of_flights, pch=16, col="white",
                        xlab = "Arr_Delay",
                        ylab = "Number_of_flights",
                        main = "Arr_Delay vs Number_of_flights",
                        col.main = "navy",
                        cex.main = 0.8,
                        bty="n")
points(flightDelay$Arr_Delay, flightDelay$Number_of_flights, pch=16, col=alpha("pink", 0.5))

 #checking the plot of the residuals
residual<- rnorm(70, 0, 5) 
plot(1:70, residual, pch = 19, col = "magenta", 
     xlab = "", ylab = "",
     ylim=c(-15, 15),
     main = "Ideal Residual Plot")
abline(h=1, col= "black")

My practical and analytical questions are how can we minimize flight delays and where is the biggest problem causing these delays. For this data set, I would to see how much each variable changes our response variable, but since we are using a simple linear regression model, we will be looking at how the variable Number_of_flights because it is positively, linearally correlated to the response variable, Arr_Delay. This data has enough information to answer these questions. Since the plot of the residuals has no pattern and is random, we do not need to use a linear transformation. Our assumptions are met and we can continue with simple linear regression.

Simple Linear Regression

parametric.model <- lm(Arr_Delay ~ Number_of_flights)
par(mfrow = c(2,2))
plot(parametric.model)

reg.table <- coef(summary(parametric.model))
kable(reg.table, caption = "Inferential statistics for the parametric linear
      regression model: Air Delay and Number of Flights in the Airport")
Inferential statistics for the parametric linear regression model: Air Delay and Number of Flights in the Airport
Estimate Std. Error t value Pr(>|t|)
(Intercept) -302.5832840 4.297879 -70.40294 0
Number_of_flights 0.0085978 0.000099 86.82287 0

Since almost all of the points on the Normal Q-Q Plot are on the line, we can still assume the model is normal. Both of the plots on the left side look normal because there is one big group. The Cook’s Distance Plot shows a few outliers, but nothing too serious that will change the model. After doing simple linear regression, we find our model to be
\[ y=-302.5832840 \:+0.0085978x_i \] and the confidence interval for the slope is \((.008403643,.008791952)\).

Bootstrap Linear Regression Model

vec.id <- 1:length(Arr_Delay)   # vector of observation ID
boot.id <- sample(vec.id, length(Arr_Delay), replace = TRUE)   # bootstrap obs ID.
boot.Arr_Delay <- Arr_Delay[boot.id]           # bootstrap Arr_Delay
boot.Number_of_flights <- Number_of_flights[boot.id]     # corresponding bootstrapNumber_of_flights

B <- 1000    # number of bootstrap replicates
# define empty vectors to store bootstrap regression coefficients
boot.beta0 <- NULL 
boot.beta1 <- NULL
## bootstrap regression models using for-loop
vec.id <- 1:length(Arr_Delay)   # vector of observation ID
for(i in 1:B){
  boot.id <- sample(vec.id, length(Arr_Delay), replace = TRUE)   # bootstrap obs ID.
  boot.Arr_Delay <- Arr_Delay[boot.id]           # bootstrap Arr_Delay
  boot.Number_of_flights<- Number_of_flights[boot.id]     # corresponding bootstrap Number_of_flights
  ## regression
  boot.reg <-lm(Arr_Delay[boot.id] ~ Number_of_flights[boot.id]) 
  boot.beta0[i] <- coef(boot.reg)[1]   # bootstrap intercept
  boot.beta1[i] <- coef(boot.reg)[2]   # bootstrap slope
}
boot.table <- coef(summary(boot.reg))
kable(boot.table, caption = "Inferential statistics for the bootstrap linear
      regression model: Air Delay and Number of Flights in the Airport")
Inferential statistics for the bootstrap linear regression model: Air Delay and Number of Flights in the Airport
Estimate Std. Error t value Pr(>|t|)
(Intercept) -299.7981742 4.2207824 -71.02905 0
Number_of_flights[boot.id] 0.0085426 0.0000972 87.89552 0

After doing bootstrap linear regression, we find our model to be
\[ y= -302.6686009 \:+0.0086599x_i \]

Comparing the Models and Conclusions

The bootstrap confidence intervals of regression coefficients are as follows: the confidence interval for the y-intercept in the bootstrap linear regression model is \((-310.644868,-293.8679736)\) and the confidence interval for the slope is \((0.008393, 0.0087843).\) As for the simple linear regression model, the confidence interval for the slope is \((.008403643,.008791952)\). These confidence intervals are very similar and since the sample size is very large, we could use either model, the simple model or the bootstrap model, to predict what number of flights coming into the airport can minimize the air delay.