The data set I chose for this report is called flight delay data, and it gives a continuous response in minutes of how long a fight is delayed. This variable is the dependent variable and is called Arr_Delay. This response variable depends on one character and nine numeric variables, including Carrier: categorical company the flight is with, Airport_Distance: the distance from the airport in miles, Number_of_flights: total number of all of the flights in the airport, Weather: scale of 1-10 with 10 being severe weather, and Support_Crew_Available: number of support crew members. The other five variables are all measured in minutes and they include Baggage_loading_time: the time it takes to load the bags, Late_Arrival_o: the time for the late arriving aircraft of the same flight, Cleaning_o: the time it takes for aircraft cleaning, Fueling_o: the time it takes for aircraft fueling, and lastly, Security_o: the time for security checking. In order to make this model, we need two categorical variables and as I mentioned above, we only have one so I turned support crew members (Support_Crew_Available variable) into a categorical variable, with breaks between every 50 crew members available.
summary(flightDelay$Support_Crew_Available)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 56 83 85 112 222
summary(flightDelay$Support_Crew_Available.cat)
## Low Low to Medium Medium Medium to High High
## 738 1583 1030 224 18
pairwise_scatter<- data.frame(Arr_Delay, Airport_Distance, Number_of_flights, Cleaning_o, Fueling_o, Security_o)
pairs(pairwise_scatter, main ="Pair-wise Association: Scatter Plot")
pairs(pairwise_scatter, main ="Pair-wise Association: Scatter Plot")
#scatterplot to check if positive and linear
plot(flightDelay$Arr_Delay, flightDelay$Number_of_flights, pch=16, col="white",
xlab = "Arr_Delay",
ylab = "Number_of_flights",
main = "Arr_Delay vs Number_of_flights",
col.main = "navy",
cex.main = 0.8,
bty="n")
points(flightDelay$Arr_Delay, flightDelay$Number_of_flights, pch=16, col=alpha("pink", 0.5))
#checking the plot of the residuals
residual<- rnorm(70, 0, 5)
plot(1:70, residual, pch = 19, col = "magenta",
xlab = "", ylab = "",
ylim=c(-15, 15),
main = "Ideal Residual Plot")
abline(h=1, col= "black")
My practical and analytical questions are how can we minimize flight delays and where is the biggest problem causing these delays. For this data set, I would to see how much each variable changes our response variable, but since we are using a simple linear regression model, we will be looking at how the variable Number_of_flights because it is positively, linearally correlated to the response variable, Arr_Delay. This data has enough information to answer these questions. Since the plot of the residuals has no pattern and is random, we do not need to use a linear transformation. Our assumptions are met and we can continue with simple linear regression.
parametric.model <- lm(Arr_Delay ~ Number_of_flights)
par(mfrow = c(2,2))
plot(parametric.model)
reg.table <- coef(summary(parametric.model))
kable(reg.table, caption = "Inferential statistics for the parametric linear
regression model: Air Delay and Number of Flights in the Airport")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -302.5832840 | 4.297879 | -70.40294 | 0 |
| Number_of_flights | 0.0085978 | 0.000099 | 86.82287 | 0 |
Since almost all of the points on the Normal Q-Q Plot are on the
line, we can still assume the model is normal. Both of the plots on the
left side look normal because there is one big group. The Cook’s
Distance Plot shows a few outliers, but nothing too serious that will
change the model. After doing simple linear regression, we find our
model to be
\[
y=-302.5832840 \:+0.0085978x_i
\] and the confidence interval for the slope is \((.008403643,.008791952)\).
vec.id <- 1:length(Arr_Delay) # vector of observation ID
boot.id <- sample(vec.id, length(Arr_Delay), replace = TRUE) # bootstrap obs ID.
boot.Arr_Delay <- Arr_Delay[boot.id] # bootstrap Arr_Delay
boot.Number_of_flights <- Number_of_flights[boot.id] # corresponding bootstrapNumber_of_flights
B <- 1000 # number of bootstrap replicates
# define empty vectors to store bootstrap regression coefficients
boot.beta0 <- NULL
boot.beta1 <- NULL
## bootstrap regression models using for-loop
vec.id <- 1:length(Arr_Delay) # vector of observation ID
for(i in 1:B){
boot.id <- sample(vec.id, length(Arr_Delay), replace = TRUE) # bootstrap obs ID.
boot.Arr_Delay <- Arr_Delay[boot.id] # bootstrap Arr_Delay
boot.Number_of_flights<- Number_of_flights[boot.id] # corresponding bootstrap Number_of_flights
## regression
boot.reg <-lm(Arr_Delay[boot.id] ~ Number_of_flights[boot.id])
boot.beta0[i] <- coef(boot.reg)[1] # bootstrap intercept
boot.beta1[i] <- coef(boot.reg)[2] # bootstrap slope
}
boot.table <- coef(summary(boot.reg))
kable(boot.table, caption = "Inferential statistics for the bootstrap linear
regression model: Air Delay and Number of Flights in the Airport")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -299.7981742 | 4.2207824 | -71.02905 | 0 |
| Number_of_flights[boot.id] | 0.0085426 | 0.0000972 | 87.89552 | 0 |
After doing bootstrap linear regression, we find our model to
be
\[
y= -302.6686009 \:+0.0086599x_i
\]
The bootstrap confidence intervals of regression coefficients are as follows: the confidence interval for the y-intercept in the bootstrap linear regression model is \((-310.644868,-293.8679736)\) and the confidence interval for the slope is \((0.008393, 0.0087843).\) As for the simple linear regression model, the confidence interval for the slope is \((.008403643,.008791952)\). These confidence intervals are very similar and since the sample size is very large, we could use either model, the simple model or the bootstrap model, to predict what number of flights coming into the airport can minimize the air delay.