Sigrid Keydana, Trivadis
2017/07/10
data("AirPassengers")
alldata <- AirPassengers
train <- window(AirPassengers,end=1958.99)
test <- window(AirPassengers, start = 1959)
fit <- auto.arima(train)
autoplot(forecast(fit,h=20))
Let's compare ARIMA and RNNs on a set of synthetic and real-world benchmarks.
\( y_{t} = c + \phi_{1}y_{t-1} + \phi_{2}y_{t-2} + \dots + \phi_{p}y_{t-p} + e_{t} \)
differencing: I(d): number of times differencing has to be applied to obtain a stationary series
moving average part: MA(q):
\( y_{t} = c + e_t + \theta_{1}e_{t-1} + \theta_{2}e_{t-2} + \dots + \theta_{q}e_{t-q} \)
Googling the perfect LSTM picture… ;-)
Source: Stanford CS224n, lecture 11 http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture11.pdf |
LSTMs add a further gate and redefine how state is transported, but are similar in effect:
Let's get started!
Series: trend_train
ARIMA(4,1,0) with drift
Coefficients:
ar1 ar2 ar3 ar4 drift
-0.7700 -0.6768 -0.3714 -0.2292 1.010
s.e. 0.0998 0.1214 0.1214 0.0997 0.076
sigma^2 estimated as 5.433: log likelihood=-222.16
AIC=456.33 AICc=457.24 BIC=471.9
Wait - what kind of parameters/configuration are we talking about?
if (!model_exists) {
set.seed(22222)
model <- keras_model_sequential()
model %>%
# lstm_units is 32
# number of timesteps is 5
layer_lstm(units = lstm_units, input_shape = c(num_steps, num_features)) %>%
layer_dense(units = 1) %>%
compile(
loss = 'mean_squared_error',
optimizer = 'adam'
)
model %>% fit(
X_train, y_train, batch_size = batch_size, epochs = epochs, validation_data = list(X_test, y_test)
)
model %>% save_model_hdf5(filepath = paste0(model_name, ".h5"))
} else {
model <- load_model_hdf5(filepath = paste0(model_name, ".h5"))
}
pred_train <- model %>% predict(X_train, batch_size = batch_size)
pred_test <- model %>% predict(X_test, batch_size = batch_size)
OOPS…????
(from the known range of the data)
Because for an in-range test set, it works very well:
… shouldn't LSTM be, too? (It would also eliminate our out-of-range problem.)
We're using 4 instead of 5 timesteps now.
Result:
(although in this case the differences are quite small and homogeneous already)
Result:
Quite an improvement! Seems like this one goes to LSTM ;-)
Let's see ARIMA fist!
Series: trend_train
ARIMA(4,1,0) with drift
Coefficients:
ar1 ar2 ar3 ar4 drift
-0.7700 -0.6768 -0.3714 -0.2292 1.010
s.e. 0.0998 0.1214 0.1214 0.0997 0.076
sigma^2 estimated as 5.433: log likelihood=-222.16
AIC=456.33 AICc=457.24 BIC=471.9
We're using Keras' TimeDistributed layer to get multi-step forecasts.
if (!model_exists) {
set.seed(22222)
model <- keras_model_sequential()
model %>%
layer_lstm(units = lstm_units, input_shape = c(num_steps, num_features),
return_sequences = TRUE) %>%
time_distributed(layer_dense(units = 1)) %>%
compile(
loss = 'mean_squared_error',
optimizer = 'adam'
)
model %>% fit(
X_train, y_train, batch_size = batch_size, epochs = epochs, validation_data = list(X_test, y_test)
)
model %>% save_model_hdf5(filepath = paste0(model_name, ".h5"))
} else {
model <- load_model_hdf5(filepath = paste0(model_name, ".h5"))
}
pred_train <- model %>% predict(X_train, batch_size = 1)
pred_test <- model %>% predict(X_test, batch_size = 1)
So this one goes to ARIMA. But we haven't really done any hyperparameter tuning for LSTM.
As before:
Series: seasonal_train
ARIMA(3,0,2) with non-zero mean
Coefficients:
ar1 ar2 ar3 ma1 ma2 mean
0.8573 -0.5073 -0.3516 -1.6424 0.8175 4.0510
s.e. 0.1132 0.1358 0.1121 0.0778 0.0688 0.0186
sigma^2 estimated as 1.073: log likelihood=-132.77
AIC=279.53 AICc=280.88 BIC=297.11
So this one's for LSTM again.
Series: seasonal_train
ARIMA(3,0,2) with non-zero mean
Coefficients:
ar1 ar2 ar3 ma1 ma2 mean
0.8573 -0.5073 -0.3516 -1.6424 0.8175 4.0510
s.e. 0.1132 0.1358 0.1121 0.0778 0.0688 0.0186
sigma^2 estimated as 1.073: log likelihood=-132.77
AIC=279.53 AICc=280.88 BIC=297.11
[1] 16 6
Let's see:
Pretty close!
And again, no hyperparameter tuning involved.
ARIMA leading with 2.5 : 1.5
Time to look at some real data.
We will now zoom in on one-step-ahead forecasts only, and we will choose two very different datasets: one a classic we are sure ARIMA will handle flawlessly, and one … well, let's see!
Things to note:
Just one: One-step-ahead rolling forecast
Series: airp_train
ARIMA(1,1,0)(0,1,0)[12]
Coefficients:
ar1
-0.2397
s.e. 0.0935
sigma^2 estimated as 103.6: log likelihood=-399.64
AIC=803.28 AICc=803.4 BIC=808.63
With this small dataset, we are pretty restricted regarding the number of timesteps used. We're using 12 here:
… ARIMA leading with 3.5 : 1.5.
Wait - is this really fair? No experimenting with hyperparameters, and not even with the number of hidden units?
For sure it's not - for example, with 128 hidden units instead of 32, we're already down to RMSE = 19.5.
But let's accept the current standings and look at the final dataset…
As per the description…
“Internet traffic data (in bits) from a private ISP with centres in 11 European cities. The data corresponds to a transatlantic link and was collected from 06:57 hours on 7 June to 11:17 hours on 31 July 2005. Hourly data.”
Let's do it the auto.arima way as usual…
train <- traffic_df$bits[1:800]
test <- traffic_df$bits[801:nrow(traffic_df)]
fit <- auto.arima(train)
fit
Series: train
ARIMA(0,0,0) with zero mean
sigma^2 estimated as 2.77e+21: log likelihood=-20884.4
AIC=41770.81 AICc=41770.81 BIC=41775.49
… hm???
Let's give ARIMA some information…
train_ts <- msts(train,seasonal.periods = c(24, 24*7))
fit <- auto.arima(train_ts)
fit
Series: train_ts
ARIMA(0,0,0) with zero mean
sigma^2 estimated as 2.77e+21: log likelihood=-20884.4
AIC=41770.81 AICc=41770.81 BIC=41775.49
Still not what we wanted.
Perhaps an exhaustive search - over all models up to a specific capacity - will make things better?
#fit <- auto.arima(train_ts, stepwise = FALSE)
Honestly, I did not have the patience to wait for this to finish… but it didn't look good ;-)
Anything else we could do?
TBATS (Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components) is a method designed to cope with complex seasonal patterns.
fit <- tbats(train_ts)
fit
TBATS(0.063, {0,0}, -, {<24,5>, <168,6>})
Call: tbats(y = train_ts)
Parameters
Lambda: 0.062943
Alpha: 0.3257773
Gamma-1 Values: 0.09115799 0.1397115
Gamma-2 Values: 0.04353518 -0.02302253
Seed States:
[,1]
[1,] 59.0734941480
[2,] 0.3671490053
[3,] 0.0052660470
[4,] -0.0002850197
[5,] -0.0004175278
[6,] -0.0016098408
[7,] -0.0378062777
[8,] -0.0024151893
[9,] -0.0140024775
[10,] 0.0054073064
[11,] -0.0039832672
[12,] 0.1899170827
[13,] -0.0036139405
[14,] -0.0349537288
[15,] -0.0400270883
[16,] 0.0647173754
[17,] 0.0533587007
[18,] 0.1766179797
[19,] -0.1665938268
[20,] 0.0164856227
[21,] -0.0025646632
[22,] 0.0697743591
[23,] -0.0653420542
Sigma: 0.3752594
AIC: 40404.27
You might ask if our rules even allow letting TBATS help out ARIMA here?
Anyway, let's see the predictions:
We're gonna choose a sequence length of 7*24, i.e., 168.
And the results?
… after a training time of just 22 epochs!
How's that gonna affect the match?
Hm. How about we just give up that scenario and come to a conclusion?
LSTM works fine for tasks where ARIMA is known to work well, even
LSTM works great for multiple seasonality, again, without any tuning!
Benchmark-like approach (“shootout”) helps understand models/algorithms better (e.g.: would GRU work as well on multiple seasonality time series as LSTM? Why?)
Thanks for your attention!