Time series shootout: ARIMA vs. LSTM

Sigrid Keydana, Trivadis
2017/07/10

Forecasting? That’s running ARIMA, right?

Running ARIMA can be as easy as...

data("AirPassengers")
alldata <- AirPassengers
train <- window(AirPassengers,end=1958.99)
test <- window(AirPassengers, start = 1959)

fit <- auto.arima(train)
autoplot(forecast(fit,h=20))

plot of chunk unnamed-chunk-2

So why would we want to use anything else?

preconditions/restrictions (stationarity, constant error variance, no level shifts, linear relationships)
can model any non-linear function with neural networks
esp. RNNs (Recurrent Neural Networks) look promising for sequential data, even when it's not about NLP

Let's compare ARIMA and RNNs on a set of synthetic and real-world benchmarks.

Quick recap: ARIMA

autoregressive part: AR(p):

\( y_{t} = c + \phi_{1}y_{t-1} + \phi_{2}y_{t-2} + \dots + \phi_{p}y_{t-p} + e_{t} \)

differencing: I(d): number of times differencing has to be applied to obtain a stationary series
moving average part: MA(q):

\( y_{t} = c + e_t + \theta_{1}e_{t-1} + \theta_{2}e_{t-2} + \dots + \theta_{q}e_{t-q} \)

The contender: LSTM

Googling the perfect LSTM picture… ;-)

Why LSTM? Or: Why Gated Recurrent Units, and not plain RNNs?

How can we preserve relevant information over many timesteps (vanishing gradient)?
how can we not get distracted by all that recent stuff?


Source: Stanford CS224n, lecture 11 http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture11.pdf

Just a little bit more complicated: LSTMs

LSTMs add a further gate and redefine how state is transported, but are similar in effect:

Round 1: Linear trend (synthetic dataset 1)

The data

plot of chunk unnamed-chunk-3

The tasks

Task 1: One-step-ahead rolling forecast
Task 2: Multi-step-ahead rolling forecast (n=4)

Let's get started!

One-step-ahead rolling forecast... ARIMA please!

Series: trend_train 
ARIMA(4,1,0) with drift         

Coefficients:
          ar1      ar2      ar3      ar4  drift
      -0.7700  -0.6768  -0.3714  -0.2292  1.010
s.e.   0.0998   0.1214   0.1214   0.0997  0.076

sigma^2 estimated as 5.433:  log likelihood=-222.16
AIC=456.33   AICc=457.24   BIC=471.9

plot of chunk unnamed-chunk-4

Now, for the LSTM...

Wait - what kind of parameters/configuration are we talking about?

if (!model_exists) {
  set.seed(22222)
  model <- keras_model_sequential() 
  model %>% 
    # lstm_units is 32
    # number of timesteps is 5
    layer_lstm(units = lstm_units, input_shape = c(num_steps, num_features)) %>% 
    layer_dense(units = 1) %>% 
    compile(
      loss = 'mean_squared_error',
      optimizer = 'adam'
    )
  model %>% fit( 
    X_train, y_train, batch_size = batch_size, epochs = epochs, validation_data = list(X_test, y_test)
  )
  model %>% save_model_hdf5(filepath = paste0(model_name, ".h5"))
} else {
  model <- load_model_hdf5(filepath = paste0(model_name, ".h5"))
}

pred_train <- model %>% predict(X_train, batch_size = batch_size)
pred_test <- model %>% predict(X_test, batch_size = batch_size)

OK! Results please

plot of chunk unnamed-chunk-7

OOPS…????

Seems like LSTM does not like extrapolating

(from the known range of the data)

Because for an in-range test set, it works very well:

plot of chunk unnamed-chunk-8

ARIMA is allowed to work with differences...

… shouldn't LSTM be, too? (It would also eliminate our out-of-range problem.)

We're using 4 instead of 5 timesteps now.

Result:

plot of chunk unnamed-chunk-10

We could try to get further improvement by scaling the data

(although in this case the differences are quite small and homogeneous already)

Result:

plot of chunk unnamed-chunk-12

Quite an improvement! Seems like this one goes to LSTM ;-)

Next up: Multi-step-ahead rolling forecast

Let's see ARIMA fist!

Series: trend_train 
ARIMA(4,1,0) with drift         

Coefficients:
          ar1      ar2      ar3      ar4  drift
      -0.7700  -0.6768  -0.3714  -0.2292  1.010
s.e.   0.0998   0.1214   0.1214   0.0997  0.076

sigma^2 estimated as 5.433:  log likelihood=-222.16
AIC=456.33   AICc=457.24   BIC=471.9

plot of chunk unnamed-chunk-13

For LSTM, let's now directly use differenced and scaled data

We're using Keras' TimeDistributed layer to get multi-step forecasts.

if (!model_exists) {
  set.seed(22222)
  model <- keras_model_sequential() 
  model %>% 
    layer_lstm(units = lstm_units, input_shape = c(num_steps, num_features),
               return_sequences = TRUE) %>% 
    time_distributed(layer_dense(units = 1)) %>% 
    compile(
      loss = 'mean_squared_error',
      optimizer = 'adam'
    )
  model %>% fit( 
    X_train, y_train, batch_size = batch_size, epochs = epochs, validation_data = list(X_test, y_test)
  )
  model %>% save_model_hdf5(filepath = paste0(model_name, ".h5"))
} else {
  model <- load_model_hdf5(filepath = paste0(model_name, ".h5"))
}
pred_train <- model %>% predict(X_train, batch_size = 1)
pred_test <- model %>% predict(X_test, batch_size = 1)

Multi-step forecast from LSTM

plot of chunk unnamed-chunk-17

So this one goes to ARIMA. But we haven't really done any hyperparameter tuning for LSTM.

Round 2: Seasonal data (synthetic dataset 2)

The data

plot of chunk unnamed-chunk-18

The tasks

As before:

Task 1: One-step-ahead rolling forecast
Task 2: Multi-step-ahead rolling forecast (n=6 this time)

ARIMA for seasonal data, one-step-ahead forecast

Series: seasonal_train 
ARIMA(3,0,2) with non-zero mean 

Coefficients:
         ar1      ar2      ar3      ma1     ma2    mean
      0.8573  -0.5073  -0.3516  -1.6424  0.8175  4.0510
s.e.  0.1132   0.1358   0.1121   0.0778  0.0688  0.0186

sigma^2 estimated as 1.073:  log likelihood=-132.77
AIC=279.53   AICc=280.88   BIC=297.11

plot of chunk unnamed-chunk-19

Seasonal data, one-step-ahead forecast: Enter: LSTM

plot of chunk unnamed-chunk-21

So this one's for LSTM again.

ARIMA for seasonal data, multi-step-ahead forecast

Series: seasonal_train 
ARIMA(3,0,2) with non-zero mean 

Coefficients:
         ar1      ar2      ar3      ma1     ma2    mean
      0.8573  -0.5073  -0.3516  -1.6424  0.8175  4.0510
s.e.  0.1132   0.1358   0.1121   0.0778  0.0688  0.0186

sigma^2 estimated as 1.073:  log likelihood=-132.77
AIC=279.53   AICc=280.88   BIC=297.11

[1] 16  6

plot of chunk unnamed-chunk-22

Seasonal data, multi-step forecast: enter: LSTM

Let's see:

plot of chunk unnamed-chunk-23

Pretty close!

And again, no hyperparameter tuning involved.

Intermediate result after synthetic datasets?

ARIMA leading with 2.5 : 1.5

Time to look at some real data.

We will now zoom in on one-step-ahead forecasts only, and we will choose two very different datasets: one a classic we are sure ARIMA will handle flawlessly, and one … well, let's see!

Round 3: AirPassengers

The "Iris of time series"

plot of chunk unnamed-chunk-24

Things to note:

we have a trend
we have seasonality
variance is increasing (heteroscedasticity)

The task

Just one: One-step-ahead rolling forecast

AirPassengers, enter: ARIMA

Series: airp_train 
ARIMA(1,1,0)(0,1,0)[12]                    

Coefficients:
          ar1
      -0.2397
s.e.   0.0935

sigma^2 estimated as 103.6:  log likelihood=-399.64
AIC=803.28   AICc=803.4   BIC=808.63

plot of chunk unnamed-chunk-25

AirPassengers, enter: LSTM

With this small dataset, we are pretty restricted regarding the number of timesteps used. We're using 12 here:

plot of chunk unnamed-chunk-26

Intermediate result, before final round... Drumroll!

… ARIMA leading with 3.5 : 1.5.

Wait - is this really fair? No experimenting with hyperparameters, and not even with the number of hidden units?

For sure it's not - for example, with 128 hidden units instead of 32, we're already down to RMSE = 19.5.

But let's accept the current standings and look at the final dataset…

Round 4: Internet traffic data

The data

As per the description…

“Internet traffic data (in bits) from a private ISP with centres in 11 European cities. The data corresponds to a transatlantic link and was collected from 06:57 hours on 7 June to 11:17 hours on 31 July 2005. Hourly data.”

plot of chunk unnamed-chunk-27

Enter: ARIMA

Let's do it the auto.arima way as usual…

train <- traffic_df$bits[1:800]
test <- traffic_df$bits[801:nrow(traffic_df)]

fit <- auto.arima(train)
fit

Series: train 
ARIMA(0,0,0) with zero mean     

sigma^2 estimated as 2.77e+21:  log likelihood=-20884.4
AIC=41770.81   AICc=41770.81   BIC=41775.49

… hm???

ARIMA: 2nd try

Let's give ARIMA some information…

train_ts <- msts(train,seasonal.periods = c(24, 24*7))
fit <- auto.arima(train_ts)
fit

Series: train_ts 
ARIMA(0,0,0)             with zero mean     

sigma^2 estimated as 2.77e+21:  log likelihood=-20884.4
AIC=41770.81   AICc=41770.81   BIC=41775.49

Still not what we wanted.

ARIMA: 3nd try

Perhaps an exhaustive search - over all models up to a specific capacity - will make things better?

#fit <- auto.arima(train_ts, stepwise = FALSE)

Honestly, I did not have the patience to wait for this to finish… but it didn't look good ;-)

Anything else we could do?

Trying TBATS

TBATS (Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components) is a method designed to cope with complex seasonal patterns.

fit <- tbats(train_ts)
fit

TBATS(0.063, {0,0}, -, {<24,5>, <168,6>})

Call: tbats(y = train_ts)

Parameters
  Lambda: 0.062943
  Alpha: 0.3257773
  Gamma-1 Values: 0.09115799 0.1397115
  Gamma-2 Values: 0.04353518 -0.02302253

Seed States:
               [,1]
 [1,] 59.0734941480
 [2,]  0.3671490053
 [3,]  0.0052660470
 [4,] -0.0002850197
 [5,] -0.0004175278
 [6,] -0.0016098408
 [7,] -0.0378062777
 [8,] -0.0024151893
 [9,] -0.0140024775
[10,]  0.0054073064
[11,] -0.0039832672
[12,]  0.1899170827
[13,] -0.0036139405
[14,] -0.0349537288
[15,] -0.0400270883
[16,]  0.0647173754
[17,]  0.0533587007
[18,]  0.1766179797
[19,] -0.1665938268
[20,]  0.0164856227
[21,] -0.0025646632
[22,]  0.0697743591
[23,] -0.0653420542

Sigma: 0.3752594
AIC: 40404.27

Predictions from TBATS

You might ask if our rules even allow letting TBATS help out ARIMA here?

Anyway, let's see the predictions:

plot of chunk unnamed-chunk-32

Time to LSTM...!

We're gonna choose a sequence length of 7*24, i.e., 168.

And the results?

That's about half the RSME of TBATS...

… after a training time of just 22 epochs!

How's that gonna affect the match?

Hm. How about we just give up that scenario and come to a conclusion?

Conclusion

LSTM works fine for tasks where ARIMA is known to work well, even
- given very little data
- without much hyperparameter tuning
- without having more than one level of LSTM
- for multi-step forecasts
LSTM works great for multiple seasonality, again, without any tuning!
Benchmark-like approach (“shootout”) helps understand models/algorithms better (e.g.: would GRU work as well on multiple seasonality time series as LSTM? Why?)

Questions?

Thanks for your attention!