knitr::opts_chunk$set(fig.width = 12, fig.height = 8,
               echo = FALSE, warning = FALSE, message = FALSE)

1. Introduction

We know weather affects airport performance, but how bad does it have to be to force an airport to delay a plane? To help answer this, we will look at the airports in the New York Metro area (LaGuardia, John F. Kennedy, and Newark) for the year 2013. We will examine the effect of wind, precipitation, and visibility on 285,820 flights. Hopefully, if a predictive model can be made, fliers can tell from the local weather on whether or not to expect a delay.


2. Data Acquisition and Cleaning


All of the data come from the R package nycflights13, which is sourced from the Bureau of Transportation Statistics. It is an observational study, as each row is simply recorded data from each given flight. We load the data directly from the package to clean it to our specifications.

nycflights <- flights
nycweather <- weather

We are only concerned with airports, not the individual airlines or airplanes, and so we remove those columns.

nycflights <- nycflights[, c(1:9, 13:17)]

Lastly, we merge the two dataframes into one big dataframe, then remove any incomplete entries.

nycfinal <- merge(nycflights, nycweather, by = c("origin", "year", "month", "day", "hour"))
nycfinal <- nycfinal[complete.cases(nycfinal), ]

We are left with a dataframe of 38,726 entries. Each is a flight from a certain airport. However, each airport likely has different policies regarding delays, so we will split them apart by airport to better compare them against each other.

lga <- subset(nycfinal, origin == "LGA")
jfk <- subset(nycfinal, origin == "JFK")
ewr <- subset(nycfinal, origin == "EWR")


3. Exploratory Analysis


A. Raw Data

Now that we have our data, let’s examine which weather variables might help predict delays. Let us start with our initial assumptions of wind, precipitation, and visibility.

All three of our initial choices look terrible. There is very little evidence of linearity present. If we look at the data directly, we’ll find that all three variables are heavily skewed.

Let’s examine other variables in the weather dataset.

It looks as though all three might work better for our purposes, but nothing is ideal. If we examine our response variable of dep_delay.

There might be some suggestion of a normal distribution here, but the truth of the matter is that it is far too skewed to the right. That skew will likely influence our models by a lot. It will probably be better for us to just simulate all of our data, even though we have 285,820 cases.

B. Simulated Data

Here the distribution is normal and centered around the actual mean of our data. If we simulate the data for all the other variables, then attempt to see the relationship, we might find more instances of linearity. If not, we will keep to the raw data and use humidity, air pressure, and dew point.

The data looks better, but nothing has a real suggestion of linearity. It seems as though wind gusts, precipitation, and visibility have no linear relation to airport delays.

None of our variables seem to have a strictly linear relation to airport delays, but the simulated data is somewhat less messy than the actual data. Based on this alone, we can expect the linear model to account for some of the delays, but we should not expect great performance. The use of simulated data does mean that we will not split apart the data by airports, but rather evaluate the three New York airports as one.


4. Multiple Linear Regression


https://www.statmethods.net/stats/regression.html

We’ll use the first three variables to see if they can provide a strong enough model. Our alternative hypothesis is that the use of weather variables (visibility, precipitation, and wind gusts) will help to predict whether or not a flight gets delayed.

A. First Group

## 
## Call:
## lm(formula = delay ~ visib + precip + wind, data = simulated.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7963 -2.1438 -0.4747  2.1727 13.5743 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   30.166     22.801   1.323    0.189
## visib         -2.130      2.233  -0.954    0.342
## precip      -548.468    357.203  -1.535    0.128
## wind           0.165      0.530   0.311    0.756
## 
## Residual standard error: 3.737 on 96 degrees of freedom
## Multiple R-squared:  0.03772,    Adjusted R-squared:  0.007646 
## F-statistic: 1.254 on 3 and 96 DF,  p-value: 0.2945

With a large p-value of 0.1234, we fail to reject our null hypothesis. Visibility, precipitation, and wind gusts cannot help predict flight delays. With an R-squared of 0.02859, our model only accounts for approximately 3% of the simulated data’s variation.

That being said, our distribution of residuals are nearly normal.

B. Second Group

## 
## Call:
## lm(formula = delay ~ humid + pressure + dew, data = simulated.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.825 -2.260 -0.295  2.105 12.994 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 696.00348  601.21756   1.158    0.250
## humid        -0.01421    0.21341  -0.067    0.947
## pressure     -0.67270    0.59134  -1.138    0.258
## dew           0.01243    0.17791   0.070    0.944
## 
## Residual standard error: 3.784 on 96 degrees of freedom
## Multiple R-squared:  0.01351,    Adjusted R-squared:  -0.01732 
## F-statistic: 0.4383 on 3 and 96 DF,  p-value: 0.7261

We see that the p-value of 0.2076 is even worse in this case, as is the R-squared. We’re forced to conclude that weather data alone is not enough to provide a linear model for predicting flight delays at an airport.

6. Conclusion

It is not feasible to predict delays based on weather at the flight’s origin. To begin with, the distribution of delays were not normal. Far more flights seemed to be ontime than delayed. It is likely that what determines delays is not influenced entirely by the weather at the airport of origin. If weather is to be considered, one must also know the weather at the destination airport and along the flightpath. With those datapoints, it might be worth revisiting the model.

A stronger indicator might be the number of inbound and outbound flights at any given airport, along with the number of runways and gates available. Also, delays could also be a result of passengers arriving late, or planes having mechanical issues. In short, weather is only one factor with regards to airport delays. However, if our data categorized the reason for delays, we would probably be able to use our data to predict when a delay is weather related.