In this analysis, we will take a look at several different
regression models for predicting the arrival delay (in minutes) of 2013
airplane flights in the ‘flights’ dataset from the ‘nyc13flights’
package. To begin, we first must prepare the data.
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
The data preparation is relatively simple. First, we are free to
outright remove observations with missing values, as we have a very
large number of observations to work with (300,000+). Then we must also
consider which variables are meaningful for prediction, and which
variables we cannot keep.
colnames(flights)
## [1] "year" "month" "day" "dep_time"
## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
## [9] "arr_delay" "carrier" "flight" "tailnum"
## [13] "origin" "dest" "air_time" "distance"
## [17] "hour" "minute" "time_hour"
Above are the names of each of the factors in the ‘flights’
dataset. We must be sure to not include any pair of variables that
allows us to outright calculate arrival delay (think ‘scheduled arrival
time’ paired with ‘arrival time’), so as to not have any models
overfitted to our data. We also can drop other factors, such as ‘year’,
as year remains the same across all observations. Additionally, other
factors are redundant, such as ‘hour’, ‘minute’, and ‘time_hour’ all
being different representations of scheduled departure time.
In the
end, these factors chosen below will be suitable:
colnames(flights_final)
## [1] "month" "day" "dep_time" "sched_dep_time"
## [5] "sched_arr_time" "arr_delay" "origin" "distance"
‘month’ and ‘day’ may be important, as the arrival delay may
change throughout the year. ‘dep_time’ and ‘sched_dep_time’ are useful
for evaluating delay based on when the flight leaves, ‘sched_arr_time’
is useful for comparison with the departure time, and ‘origin’ plus
‘distance’ are helpful when considering exactly where the flights are
going to/coming from, as different airports will have different
delays.
I had experimented with more factors, such as ‘carrier’, ‘dest’,
etc. These factors would technically be fine to include in the models,
but I found that due to RAM restrictions in Posit Cloud, the models had
performed worse when these factors were included.
We make 3 different models to test; a multiple linear
regression, a random forest, and a bagging model. Note that I had to
drastically reduce the number of observations via Monte-Carlo selection,
as Posit Cloud only offers me 1gb of ram for free. However, the models
should still be somewhat usable, as they are each still trained on 2,000
observations (instead of the full 300,000+).
To train these models, we partition the 2,000 randomly selected
observations by selecting 80% (1600) for training, and the remaining 20%
(400) for testing.
summary(flights_lm)
##
## Call:
## lm(formula = arr_delay ~ ., data = flights_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80.67 -22.54 -9.96 7.83 803.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.066479 5.450667 -1.480 0.139096
## month -0.058624 0.343092 -0.171 0.864349
## day 0.133264 0.133021 1.002 0.316581
## dep_time 0.062956 0.006760 9.313 < 2e-16 ***
## sched_dep_time -0.033034 0.007043 -4.690 2.97e-06 ***
## sched_arr_time -0.012879 0.003781 -3.406 0.000675 ***
## originJFK -7.478948 2.877632 -2.599 0.009436 **
## originLGA -5.715013 2.900225 -1.971 0.048950 *
## distance -0.001944 0.001719 -1.131 0.258261
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46.95 on 1591 degrees of freedom
## Multiple R-squared: 0.08703, Adjusted R-squared: 0.08244
## F-statistic: 18.96 on 8 and 1591 DF, p-value: < 2.2e-16
Here is the multiple linear regression trained on 2000
observations. Note that R processed the ‘carrier’, ‘origin’ and ‘dest’
predictors by splitting the different levels into individual predictors,
as those predictors are all categorical. We can see that dep_time and
sched_dep_time are the most significant, which makes perfect sense.
Additionally, originJFK and originLGA are also significant, which is
interesting!
## [1] "Root mean square error: 34.9573017245197"
summary(flights_rf)
## Length Class Mode
## call 4 -none- call
## type 1 -none- character
## predicted 1600 -none- numeric
## mse 150 -none- numeric
## rsq 150 -none- numeric
## oob.times 1600 -none- numeric
## importance 7 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 1600 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
Here is the random forest trained with 150 trees (150 was chosen
as it was the point in which minimal error plateaued).
## [1] "Root mean square error: 30.0200927868016"
summary(flights_bg)
## Length Class Mode
## y 1600 -none- numeric
## X 7 data.frame list
## mtrees 100 -none- list
## OOB 1 -none- logical
## comb 1 -none- logical
## call 4 -none- call
Here is the bagging model. Unfortunately, this model was very
very heavy on my limited RAM, and so I was only able to train it with
nbagg set to 100 However, it still performed comparatively well. From
independent testing, it seemed that even with nbagg = 150, the root mean
square error was still slightly higher than the random forest.
## [1] "Root mean square error: 29.4978332665526"
We can see, upon comparing the root mean square errors, that the
bagging model performs the best out of the three models. Here are some
sample predictions (from the bagging) for arrival delay (in minutes)
against their true values:
## Actual Predicted
## 1 136 119.522541
## 2 -22 -5.204575
## 3 7 -4.873682
## 4 -23 -4.805875
## 5 -14 -1.223808
## 6 164 35.944538
## 7 -11 -5.204575
## 8 -4 4.106499
## 9 31 -4.873682
## 10 -28 -4.873682
However, according to the data, the average arrival delay is
about 6.5 minutes. This means that the models produced are not adequate
predictors of arrival delay. However, if run on more powerful software,
more predictors and observations could have been used, and greater
accuracy could likely be achieved. From this quick analysis, however, we
can see that both random forest and bagging models should outperform
multiple linear regression for this data.