In this analysis, we will take a look at several different regression models for predicting the arrival delay (in minutes) of 2013 airplane flights in the ‘flights’ dataset from the ‘nyc13flights’ package. To begin, we first must prepare the data.

Data Preperation

## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.


The data preparation is relatively simple. First, we are free to outright remove observations with missing values, as we have a very large number of observations to work with (300,000+). Then we must also consider which variables are meaningful for prediction, and which variables we cannot keep.


colnames(flights)
##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"


Above are the names of each of the factors in the ‘flights’ dataset. We must be sure to not include any pair of variables that allows us to outright calculate arrival delay (think ‘scheduled arrival time’ paired with ‘arrival time’), so as to not have any models overfitted to our data. We also can drop other factors, such as ‘year’, as year remains the same across all observations. Additionally, other factors are redundant, such as ‘hour’, ‘minute’, and ‘time_hour’ all being different representations of scheduled departure time.
In the end, these factors chosen below will be suitable:


colnames(flights_final)
## [1] "month"          "day"            "dep_time"       "sched_dep_time"
## [5] "sched_arr_time" "arr_delay"      "origin"         "distance"


‘month’ and ‘day’ may be important, as the arrival delay may change throughout the year. ‘dep_time’ and ‘sched_dep_time’ are useful for evaluating delay based on when the flight leaves, ‘sched_arr_time’ is useful for comparison with the departure time, and ‘origin’ plus ‘distance’ are helpful when considering exactly where the flights are going to/coming from, as different airports will have different delays.


I had experimented with more factors, such as ‘carrier’, ‘dest’, etc. These factors would technically be fine to include in the models, but I found that due to RAM restrictions in Posit Cloud, the models had performed worse when these factors were included.

Predictive Models


We make 3 different models to test; a multiple linear regression, a random forest, and a bagging model. Note that I had to drastically reduce the number of observations via Monte-Carlo selection, as Posit Cloud only offers me 1gb of ram for free. However, the models should still be somewhat usable, as they are each still trained on 2,000 observations (instead of the full 300,000+).


To train these models, we partition the 2,000 randomly selected observations by selecting 80% (1600) for training, and the remaining 20% (400) for testing.

Multiple Regression

summary(flights_lm)
## 
## Call:
## lm(formula = arr_delay ~ ., data = flights_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -80.67 -22.54  -9.96   7.83 803.94 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -8.066479   5.450667  -1.480 0.139096    
## month          -0.058624   0.343092  -0.171 0.864349    
## day             0.133264   0.133021   1.002 0.316581    
## dep_time        0.062956   0.006760   9.313  < 2e-16 ***
## sched_dep_time -0.033034   0.007043  -4.690 2.97e-06 ***
## sched_arr_time -0.012879   0.003781  -3.406 0.000675 ***
## originJFK      -7.478948   2.877632  -2.599 0.009436 ** 
## originLGA      -5.715013   2.900225  -1.971 0.048950 *  
## distance       -0.001944   0.001719  -1.131 0.258261    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.95 on 1591 degrees of freedom
## Multiple R-squared:  0.08703,    Adjusted R-squared:  0.08244 
## F-statistic: 18.96 on 8 and 1591 DF,  p-value: < 2.2e-16


Here is the multiple linear regression trained on 2000 observations. Note that R processed the ‘carrier’, ‘origin’ and ‘dest’ predictors by splitting the different levels into individual predictors, as those predictors are all categorical. We can see that dep_time and sched_dep_time are the most significant, which makes perfect sense. Additionally, originJFK and originLGA are also significant, which is interesting!

## [1] "Root mean square error: 34.9573017245197"

Random Forest

summary(flights_rf)
##                 Length Class  Mode     
## call               4   -none- call     
## type               1   -none- character
## predicted       1600   -none- numeric  
## mse              150   -none- numeric  
## rsq              150   -none- numeric  
## oob.times       1600   -none- numeric  
## importance         7   -none- numeric  
## importanceSD       0   -none- NULL     
## localImportance    0   -none- NULL     
## proximity          0   -none- NULL     
## ntree              1   -none- numeric  
## mtry               1   -none- numeric  
## forest            11   -none- list     
## coefs              0   -none- NULL     
## y               1600   -none- numeric  
## test               0   -none- NULL     
## inbag              0   -none- NULL     
## terms              3   terms  call


Here is the random forest trained with 150 trees (150 was chosen as it was the point in which minimal error plateaued).

## [1] "Root mean square error: 30.0200927868016"

Bagging

summary(flights_bg)
##        Length Class      Mode   
## y      1600   -none-     numeric
## X         7   data.frame list   
## mtrees  100   -none-     list   
## OOB       1   -none-     logical
## comb      1   -none-     logical
## call      4   -none-     call


Here is the bagging model. Unfortunately, this model was very very heavy on my limited RAM, and so I was only able to train it with nbagg set to 100 However, it still performed comparatively well. From independent testing, it seemed that even with nbagg = 150, the root mean square error was still slightly higher than the random forest.

## [1] "Root mean square error: 29.4978332665526"

Conclusion


We can see, upon comparing the root mean square errors, that the bagging model performs the best out of the three models. Here are some sample predictions (from the bagging) for arrival delay (in minutes) against their true values:

##    Actual  Predicted
## 1     136 119.522541
## 2     -22  -5.204575
## 3       7  -4.873682
## 4     -23  -4.805875
## 5     -14  -1.223808
## 6     164  35.944538
## 7     -11  -5.204575
## 8      -4   4.106499
## 9      31  -4.873682
## 10    -28  -4.873682


However, according to the data, the average arrival delay is about 6.5 minutes. This means that the models produced are not adequate predictors of arrival delay. However, if run on more powerful software, more predictors and observations could have been used, and greater accuracy could likely be achieved. From this quick analysis, however, we can see that both random forest and bagging models should outperform multiple linear regression for this data.