Data Set Description and Cleaning

In this report, we will be using multiple linear regression models and determining which is best.

The data set I chose for this report is called flight delay data, and it can be found at “https://pengdsci.github.io/datasets/FlightDelay/Flight_delay-data.csv”. It gives a continuous response in minutes of how long a fight is delayed, which depends on one character and nine numeric variables. The variables are listed below:

Arr_Delay: time in minutes of how long a fight is delayed
Carrier: categorical company the flight is with
Airport_Distance: the distance from the airport in miles
Number_of_flights: total number of all of the flights in the airport
Weather: scale of 1-10 with 10 being severe weather
Support_Crew_Available: number of support crew members
Baggage_loading_time: the time it takes in minutes to load the bags
Late_Arrival_o: the time in minutes for the late arriving aircraft of the same flight
Cleaning_o: the time it takes for aircraft cleaning
Fueling_o: the time it takes for aircraft fueling
Security_o: the time for security checking.

In order to make this model, we need two categorical variables and as I mentioned above, we only have one so I turned support crew members (Support_Crew_Available variable) into a categorical variable, with breaks between every 50 crew members available. I don’t believe there are any variables we should drop yet. My practical question is which relevant variables effect the time a flight is delayed.

Multiple Linear Regression Modeling

Full Linear Model

To begin, we will be using the lm(() function to build a full multiple linear regression model with all of the variables.

kable(summary(full.model)$coef, caption ="Statistics of Regression Coefficients")

Statistics of Regression Coefficients
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-574.5842125	7.6442278	-75.1657623	0.0000000
CarrierAA	-0.3176558	1.3742454	-0.2311492	0.8172121
CarrierAS	1.3025919	4.1330152	0.3151675	0.7526529
CarrierB6	-1.7847244	1.3170082	-1.3551354	0.1754602
CarrierDL	-1.8249200	1.3494939	-1.3522995	0.1763653
CarrierEV	-1.7253348	1.3612518	-1.2674619	0.2050729
CarrierF9	-1.0441150	4.5723713	-0.2283531	0.8193849
CarrierFL	0.5732866	2.2461324	0.2552328	0.7985580
CarrierHA	2.1250862	6.3510956	0.3346015	0.7379454
CarrierMQ	-0.6735115	1.4197022	-0.4744034	0.6352413
CarrierUA	-1.5041356	1.3127539	-1.1457864	0.2519604
CarrierUS	-0.8654488	1.5812309	-0.5473260	0.5841890
CarrierVX	0.2205063	2.1217789	0.1039252	0.9172345
CarrierWN	-1.9711735	1.6721496	-1.1788260	0.2385461
Airport_Distance	0.1732896	0.0135518	12.7872150	0.0000000
Number_of_flights	0.0044376	0.0001090	40.7268944	0.0000000
Weather	4.4104382	0.4552719	9.6874808	0.0000000
Support_Crew_Available	-0.0593364	0.0150282	-3.9483467	0.0000802
Baggage_loading_time	13.4801105	0.4397689	30.6527125	0.0000000
Late_Arrival_o	6.9358981	0.3335567	20.7937628	0.0000000
Cleaning_o	0.0577420	0.0599874	0.9625690	0.3358292
Fueling_o	-0.0587160	0.0595792	-0.9855126	0.3244392
Security_o	0.0072173	0.0295649	0.2441177	0.8071537
Support_Crew_Available.catLow to Medium	-0.4796695	0.8928880	-0.5372112	0.5911552
Support_Crew_Available.catMedium	0.3808286	1.4950053	0.2547340	0.7989433
Support_Crew_Available.catMedium to High	1.5030377	2.2463979	0.6690879	0.5034827
Support_Crew_Available.catHigh	7.9897897	4.0327490	1.9812266	0.0476426

The original full model that the linear model function made was as follows:

\[ y= -574.5592897 \,+\,-0.3068142*CarrierAA \,+\,1.3060328*CarrierAS \\-\,1.7685278*CarrierB6 \,-\,1.7558491 *CarrierDL \,-\,1.7244875*CarrierEV \\,-\,0.9614506*CarrierF9 \,+\,0.6205342*CarrierFL \,+\,1.7141523*CarrierHA \\,-\,0.6700343*CarrierMQ \,-\,1.5067974*CarrierUA \,-\,0.8850271*CarrierUS \\ +0.1461768*CarrierVX \,-\,1.9051889*CarrierWN \,+\,0.1727325*AirportDistance \\,+\,0.0044246*NumberofFlights \,+\,4.4365237*Weather \,-\,0.0489392*SupportCrewAvailable \\+\,13.4992387*Baggageloadingtime \,+\,6.9106449*LateArrivalo \,+\,0.0548127*Cleaningo \\-\,0.0614043*Fuelingo \,+\,0.0077875*Securityo \]

par(mfrow=c(2,2))
plot(full.model)

Residual plots of the full model

In this model, the constant variance is not 0, as seen in the first plot. We can see the model is not normal. The other plots are not normal either. The Cooke’s distance plot shows big gaps in the data. All of these observatins lead us to doing a transformation on the variables or droppping them.

Model without Carrier Variable

Since none of the p-values from the dummy variables (each carrier variable) are statistically significant in the previous model, we can remove this variable to create a new model.

kable(summary(model2)$coef, caption ="Statistics of Regression Coefficients without Carrier variable")

Statistics of Regression Coefficients without Carrier variable
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-576.0172760	7.4850911	-76.9552798	0.0000000
Airport_Distance	0.1730925	0.0135235	12.7993778	0.0000000
Number_of_flights	0.0044296	0.0001086	40.7783777	0.0000000
Weather	4.4659931	0.4543878	9.8285949	0.0000000
Support_Crew_Available	-0.0486269	0.0053197	-9.1409110	0.0000000
Baggage_loading_time	13.4862291	0.4384808	30.7567126	0.0000000
Late_Arrival_o	6.8977733	0.3330868	20.7086345	0.0000000
Cleaning_o	0.0526302	0.0599114	0.8784680	0.3797487
Fueling_o	-0.0587412	0.0594237	-0.9885143	0.3229676
Security_o	0.0085256	0.0295234	0.2887744	0.7727707

The linear equation for the equation where we dropped the Carrier variable is as follows

\[ y = -576.0172760 \,+\,0.1730925*AirportDistance \\,+\,0.0044296*NumberofFlights \,+\,4.4659931*Weather \,-\,0.0486269*SupportCrewAvailable \\+\,13.4862291*Baggageloadingtime \,+\,6.8977733*LateArrivalo \,+\,0.0526302*Cleaningo \\-\,0.0587412*Fuelingo \,+\,0.0085256*Securityo \]

par(mfrow=c(2,2))
plot(model2)

Residual plots of the full model

The model where we only dropped the variable Carrier is very similar to the full model and has the same assumptions violated with the graphs. Another way we can eliminate variables from the model is the stepAIC function, which uses a stepwise procedure to add or drop variables in the model until it finds the model with the lowest AIC value.

StepAIC Reduced Model

kable(summary(reduced.model)$coef, caption ="Statistics of Reduced Model")

Statistics of Reduced Model
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-576.6675653	7.2728480	-79.2904736	0.0000000
Airport_Distance	0.1740504	0.0135132	12.8800688	0.0000000
Number_of_flights	0.0044460	0.0001086	40.9247070	0.0000000
Weather	4.4357382	0.4541083	9.7680176	0.0000000
Support_Crew_Available	-0.0587552	0.0149628	-3.9267525	0.0000877
Baggage_loading_time	13.4545639	0.4383386	30.6944553	0.0000000
Late_Arrival_o	6.9219880	0.3326825	20.8065884	0.0000000
Support_Crew_Available.catLow to Medium	-0.5156077	0.8901764	-0.5792197	0.5624773
Support_Crew_Available.catMedium	0.3286920	1.4902084	0.2205678	0.8254415
Support_Crew_Available.catMedium to High	1.4322439	2.2391682	0.6396321	0.5224527
Support_Crew_Available.catHigh	7.5915823	4.0203081	1.8883086	0.0590653

par(mfrow=c(2,2))
plot(reduced.model)

Residual plots of the full model

The reduced model using the stepwise procedure is as follows

\[ y = -576.7053849 \,+\,0.1735146*AirportDistance \\,+\,0.0044340*NumberofFlights \,+\,4.4626954*Weather \,-\,0.0489240*SupportCrewAvailable \\+\,13.4697937*Baggageloadingtime \,+\,6.8982019*LateArrivalo \] The model still has model violations so we will need to do a linear transformation. Since the variance is not constant, we can use a variable transformation on the reduced model.

Variable Transformation Models

Yeo Johnson Variable Transformation

For the Yeo Johnson transformation, we will be transforming the response variable instead of taking out explanatory variables. In the method we use here, we will be using the variables from the stepAIC reduced model. Similarly to a Box Cox transformation, the plots will give us an optimal lambda value under different transformed predictor variables. The only thing that makes this transformation different from the Box Cox transformation is that the response variable can include numbers less than 0.

yeoJmodel <- lm(Arr_Delay ~ Airport_Distance + Number_of_flights + Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, data=flightDelay)
yj<- boxCox(yeoJmodel, family="yjPower", plotit =TRUE, lambda = seq(0.5, 0.75, 1/100),)

Transformed Response Model

In this model, we are doing a transformation on the response variable, Arr_Delay. Above, we found an optimal lambda value of .6 with the graph to plug into the equation \[((\lambda-1)/\lambda)\] to transform the response. This is the exponent of our ArrDelay variable in the equation. We used the new response in to make a new model.

par(mfrow=c(2,2))
plot(sqrt.model)

Analysis

The following table gives us the goodness of fit measures of each model, which will tell us which model is best for linear regression analysis. We are looking for a model with the lowest SSE, CP, AIC, SBC and press. On the other hand, we want the model with the highest r squared adjusted value.

output.sum = rbind(select(full.model), select(model2), select(reduced.model),  select(sqrt.model))
row.names(output.sum) = c("Full model", "Model without Carrier", "Step AIC model", "Transformed model")
kable(output.sum, caption = "Goodness-of-fit Measures of Candidate Models")

Goodness-of-fit Measures of Candidate Models
	SSE	R.sq	R.adj	Cp	AIC	SBC	PRESS
Full model	552202.17	0.8199993	0.8186869	27	18144.492	18311.534	560823.11
Model without Carrier	555166.20	0.8190331	0.8185785	10	18129.726	18191.594	558445.60
Step AIC model	554058.67	0.8193941	0.8188899	11	18124.551	18192.605	557661.58
Transformed model	18052.92	0.8424637	0.8422001	7	5814.252	5857.559	18122.12

Final Model Discussion

Considering the interpretability, goodness-of-fit, and simplicity, the best model to predict the number in minutes that a flight is delayed is the final model we made, the transformed response model. This model is given by the equation

\[ y = -106.613388 \ +\ 0.032890 *AirportDistance \\ +\, 0.000878 *NumberofFlights \,+\, 0.750077 *Weather \,-\,0.010463 *SupportCrewAvailable \\+\, 2.631489 *Baggageloadingtime \,+\, 1.351783 *LateArrivalo \]

Since we used a variable transformation, the values from this equation will not be correct for interpretting the regression coefficients. The regression coefficients are giving us a value on how much each variable is associated with the flight delay in minutes. Basically, as an airport is one mile from the airport, the air delay time increases by 3.29%. We can similarly interpret other regression coefficients.

Conclusion

Again, here are the residual plots of the transformed response model:

par(mfrow=c(2,2))
plot(sqrt.model)

The transformed response model is definitely our best model, but the plots for this model still show us violations to the model assumptions. The variance is not constant and the normality assumptions are still not met, so this model is only as correct as we could make it using parametric techniques.

Multiple Linear Regression

Emma Laughlin

Data Set Description and Cleaning

Multiple Linear Regression Modeling

Full Linear Model

Model without Carrier Variable

StepAIC Reduced Model

Variable Transformation Models

Yeo Johnson Variable Transformation

Transformed Response Model

Analysis

Final Model Discussion

Conclusion