knitr::opts_chunk$set(fig.width = 12, fig.height = 8,
echo = FALSE, warning = FALSE, message = FALSE)We know weather affects airport performance, but how bad does it have to be to force an airport to delay a plane? To help answer this, we will look at the airports in the New York Metro area (LaGuardia, John F. Kennedy, and Newark) for the year 2013. We will examine the effect of wind, precipitation, and visibility on 285,820 flights. Hopefully, if a predictive model can be made, fliers can tell from the local weather on whether or not to expect a delay.
All of the data come from the R package nycflights13, which is sourced from the Bureau of Transportation Statistics. It is an observational study, as each row is simply recorded data from each given flight. We load the data directly from the package to clean it to our specifications.
nycflights <- flights
nycweather <- weatherWe are only concerned with airports, not the individual airlines or airplanes, and so we remove those columns.
nycflights <- nycflights[, c(1:9, 13:17)]Lastly, we merge the two dataframes into one big dataframe, then remove any incomplete entries.
nycfinal <- merge(nycflights, nycweather, by = c("origin", "year", "month", "day", "hour"))
nycfinal <- nycfinal[complete.cases(nycfinal), ]We are left with a dataframe of 38,726 entries. Each is a flight from a certain airport. However, each airport likely has different policies regarding delays, so we will split them apart by airport to better compare them against each other.
lga <- subset(nycfinal, origin == "LGA")
jfk <- subset(nycfinal, origin == "JFK")
ewr <- subset(nycfinal, origin == "EWR")Now that we have our data, let’s examine which weather variables might help predict delays. Let us start with our initial assumptions of wind, precipitation, and visibility.
All three of our initial choices look terrible. There is very little evidence of linearity present. If we look at the data directly, we’ll find that all three variables are heavily skewed.
Let’s examine other variables in the weather dataset.
It looks as though all three might work better for our purposes, but nothing is ideal. If we examine our response variable of dep_delay.
There might be some suggestion of a normal distribution here, but the truth of the matter is that it is far too skewed to the right. That skew will likely influence our models by a lot. It will probably be better for us to just simulate all of our data, even though we have 285,820 cases.
Here the distribution is normal and centered around the actual mean of our data. If we simulate the data for all the other variables, then attempt to see the relationship, we might find more instances of linearity. If not, we will keep to the raw data and use humidity, air pressure, and dew point.
The data looks better, but nothing has a real suggestion of linearity. It seems as though wind gusts, precipitation, and visibility have no linear relation to airport delays.
None of our variables seem to have a strictly linear relation to airport delays, but the simulated data is somewhat less messy than the actual data. Based on this alone, we can expect the linear model to account for some of the delays, but we should not expect great performance. The use of simulated data does mean that we will not split apart the data by airports, but rather evaluate the three New York airports as one.
https://www.statmethods.net/stats/regression.html
We’ll use the first three variables to see if they can provide a strong enough model. Our alternative hypothesis is that the use of weather variables (visibility, precipitation, and wind gusts) will help to predict whether or not a flight gets delayed.
##
## Call:
## lm(formula = delay ~ visib + precip + wind, data = simulated.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7963 -2.1438 -0.4747 2.1727 13.5743
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.166 22.801 1.323 0.189
## visib -2.130 2.233 -0.954 0.342
## precip -548.468 357.203 -1.535 0.128
## wind 0.165 0.530 0.311 0.756
##
## Residual standard error: 3.737 on 96 degrees of freedom
## Multiple R-squared: 0.03772, Adjusted R-squared: 0.007646
## F-statistic: 1.254 on 3 and 96 DF, p-value: 0.2945
With a large p-value of 0.1234, we fail to reject our null hypothesis. Visibility, precipitation, and wind gusts cannot help predict flight delays. With an R-squared of 0.02859, our model only accounts for approximately 3% of the simulated data’s variation.
That being said, our distribution of residuals are nearly normal.
##
## Call:
## lm(formula = delay ~ humid + pressure + dew, data = simulated.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.825 -2.260 -0.295 2.105 12.994
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 696.00348 601.21756 1.158 0.250
## humid -0.01421 0.21341 -0.067 0.947
## pressure -0.67270 0.59134 -1.138 0.258
## dew 0.01243 0.17791 0.070 0.944
##
## Residual standard error: 3.784 on 96 degrees of freedom
## Multiple R-squared: 0.01351, Adjusted R-squared: -0.01732
## F-statistic: 0.4383 on 3 and 96 DF, p-value: 0.7261
We see that the p-value of 0.2076 is even worse in this case, as is the R-squared. We’re forced to conclude that weather data alone is not enough to provide a linear model for predicting flight delays at an airport.
It is not feasible to predict delays based on weather at the flight’s origin. To begin with, the distribution of delays were not normal. Far more flights seemed to be ontime than delayed. It is likely that what determines delays is not influenced entirely by the weather at the airport of origin. If weather is to be considered, one must also know the weather at the destination airport and along the flightpath. With those datapoints, it might be worth revisiting the model.
A stronger indicator might be the number of inbound and outbound flights at any given airport, along with the number of runways and gates available. Also, delays could also be a result of passengers arriving late, or planes having mechanical issues. In short, weather is only one factor with regards to airport delays. However, if our data categorized the reason for delays, we would probably be able to use our data to predict when a delay is weather related.