Time series shootout: ARIMA vs. LSTM

Sigrid Keydana, Trivadis
2017/07/10

 

 

Forecasting? That’s running ARIMA, right?

Running ARIMA can be as easy as...

 

data("AirPassengers")
alldata <- AirPassengers
train <- window(AirPassengers,end=1958.99)
test <- window(AirPassengers, start = 1959)

fit <- auto.arima(train)
autoplot(forecast(fit,h=20))

plot of chunk unnamed-chunk-2

So why would we want to use anything else?

 

  • preconditions/restrictions (stationarity, constant error variance, no level shifts, linear relationships)
  • can model any non-linear function with neural networks
  • esp. RNNs (Recurrent Neural Networks) look promising for sequential data, even when it's not about NLP

Let's compare ARIMA and RNNs on a set of synthetic and real-world benchmarks.

Quick recap: ARIMA

 

  • autoregressive part: AR(p):

\( y_{t} = c + \phi_{1}y_{t-1} + \phi_{2}y_{t-2} + \dots + \phi_{p}y_{t-p} + e_{t} \)

  • differencing: I(d): number of times differencing has to be applied to obtain a stationary series

  • moving average part: MA(q):

\( y_{t} = c + e_t + \theta_{1}e_{t-1} + \theta_{2}e_{t-2} + \dots + \theta_{q}e_{t-q} \)

 

The contender: LSTM

 

Googling the perfect LSTM picture… ;-)

Why LSTM? Or: Why Gated Recurrent Units, and not plain RNNs?

 

  • How can we preserve relevant information over many timesteps (vanishing gradient)?
  • how can we not get distracted by all that recent stuff?

 

Source: Stanford CS224n, lecture 11
http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture11.pdf

 

Just a little bit more complicated: LSTMs

 

LSTMs add a further gate and redefine how state is transported, but are similar in effect:

 

 

Round 1: Linear trend (synthetic dataset 1)

The data

 

plot of chunk unnamed-chunk-3

The tasks

 

  • Task 1: One-step-ahead rolling forecast
  • Task 2: Multi-step-ahead rolling forecast (n=4)

 

Let's get started!

One-step-ahead rolling forecast... ARIMA please!

 

Series: trend_train 
ARIMA(4,1,0) with drift         

Coefficients:
          ar1      ar2      ar3      ar4  drift
      -0.7700  -0.6768  -0.3714  -0.2292  1.010
s.e.   0.0998   0.1214   0.1214   0.0997  0.076

sigma^2 estimated as 5.433:  log likelihood=-222.16
AIC=456.33   AICc=457.24   BIC=471.9

plot of chunk unnamed-chunk-4

Now, for the LSTM...

 

Wait - what kind of parameters/configuration are we talking about?

 

if (!model_exists) {
  set.seed(22222)
  model <- keras_model_sequential() 
  model %>% 
    # lstm_units is 32
    # number of timesteps is 5
    layer_lstm(units = lstm_units, input_shape = c(num_steps, num_features)) %>% 
    layer_dense(units = 1) %>% 
    compile(
      loss = 'mean_squared_error',
      optimizer = 'adam'
    )
  model %>% fit( 
    X_train, y_train, batch_size = batch_size, epochs = epochs, validation_data = list(X_test, y_test)
  )
  model %>% save_model_hdf5(filepath = paste0(model_name, ".h5"))
} else {
  model <- load_model_hdf5(filepath = paste0(model_name, ".h5"))
}

pred_train <- model %>% predict(X_train, batch_size = batch_size)
pred_test <- model %>% predict(X_test, batch_size = batch_size)

OK! Results please

 

plot of chunk unnamed-chunk-7

OOPS…????

Seems like LSTM does not like extrapolating

 

(from the known range of the data)

Because for an in-range test set, it works very well:

 

plot of chunk unnamed-chunk-8

ARIMA is allowed to work with differences...

 

… shouldn't LSTM be, too? (It would also eliminate our out-of-range problem.)

We're using 4 instead of 5 timesteps now.

Result:

plot of chunk unnamed-chunk-10

We could try to get further improvement by scaling the data

 

(although in this case the differences are quite small and homogeneous already)

Result:

plot of chunk unnamed-chunk-12

Quite an improvement! Seems like this one goes to LSTM ;-)

Next up: Multi-step-ahead rolling forecast

 

Let's see ARIMA fist!

Series: trend_train 
ARIMA(4,1,0) with drift         

Coefficients:
          ar1      ar2      ar3      ar4  drift
      -0.7700  -0.6768  -0.3714  -0.2292  1.010
s.e.   0.0998   0.1214   0.1214   0.0997  0.076

sigma^2 estimated as 5.433:  log likelihood=-222.16
AIC=456.33   AICc=457.24   BIC=471.9

plot of chunk unnamed-chunk-13

For LSTM, let's now directly use differenced and scaled data

 

We're using Keras' TimeDistributed layer to get multi-step forecasts.

 

if (!model_exists) {
  set.seed(22222)
  model <- keras_model_sequential() 
  model %>% 
    layer_lstm(units = lstm_units, input_shape = c(num_steps, num_features),
               return_sequences = TRUE) %>% 
    time_distributed(layer_dense(units = 1)) %>% 
    compile(
      loss = 'mean_squared_error',
      optimizer = 'adam'
    )
  model %>% fit( 
    X_train, y_train, batch_size = batch_size, epochs = epochs, validation_data = list(X_test, y_test)
  )
  model %>% save_model_hdf5(filepath = paste0(model_name, ".h5"))
} else {
  model <- load_model_hdf5(filepath = paste0(model_name, ".h5"))
}
pred_train <- model %>% predict(X_train, batch_size = 1)
pred_test <- model %>% predict(X_test, batch_size = 1)

Multi-step forecast from LSTM

 

plot of chunk unnamed-chunk-17

So this one goes to ARIMA. But we haven't really done any hyperparameter tuning for LSTM.

 

 

Round 2: Seasonal data (synthetic dataset 2)

The data

 

plot of chunk unnamed-chunk-18

The tasks

 

As before:

  • Task 1: One-step-ahead rolling forecast
  • Task 2: Multi-step-ahead rolling forecast (n=6 this time)

 

ARIMA for seasonal data, one-step-ahead forecast

 

Series: seasonal_train 
ARIMA(3,0,2) with non-zero mean 

Coefficients:
         ar1      ar2      ar3      ma1     ma2    mean
      0.8573  -0.5073  -0.3516  -1.6424  0.8175  4.0510
s.e.  0.1132   0.1358   0.1121   0.0778  0.0688  0.0186

sigma^2 estimated as 1.073:  log likelihood=-132.77
AIC=279.53   AICc=280.88   BIC=297.11

plot of chunk unnamed-chunk-19

Seasonal data, one-step-ahead forecast: Enter: LSTM

 

plot of chunk unnamed-chunk-21

So this one's for LSTM again.

ARIMA for seasonal data, multi-step-ahead forecast

 

Series: seasonal_train 
ARIMA(3,0,2) with non-zero mean 

Coefficients:
         ar1      ar2      ar3      ma1     ma2    mean
      0.8573  -0.5073  -0.3516  -1.6424  0.8175  4.0510
s.e.  0.1132   0.1358   0.1121   0.0778  0.0688  0.0186

sigma^2 estimated as 1.073:  log likelihood=-132.77
AIC=279.53   AICc=280.88   BIC=297.11
[1] 16  6

plot of chunk unnamed-chunk-22

Seasonal data, multi-step forecast: enter: LSTM

 

Let's see:

plot of chunk unnamed-chunk-23

Pretty close!

And again, no hyperparameter tuning involved.

Intermediate result after synthetic datasets?

 

ARIMA leading with 2.5 : 1.5

 

Time to look at some real data.

We will now zoom in on one-step-ahead forecasts only, and we will choose two very different datasets: one a classic we are sure ARIMA will handle flawlessly, and one … well, let's see!

 

 

 

Round 3: AirPassengers

The "Iris of time series"

plot of chunk unnamed-chunk-24

Things to note:

  • we have a trend
  • we have seasonality
  • variance is increasing (heteroscedasticity)

The task

 

Just one: One-step-ahead rolling forecast  

AirPassengers, enter: ARIMA

 

Series: airp_train 
ARIMA(1,1,0)(0,1,0)[12]                    

Coefficients:
          ar1
      -0.2397
s.e.   0.0935

sigma^2 estimated as 103.6:  log likelihood=-399.64
AIC=803.28   AICc=803.4   BIC=808.63

plot of chunk unnamed-chunk-25

AirPassengers, enter: LSTM

 

With this small dataset, we are pretty restricted regarding the number of timesteps used. We're using 12 here:

plot of chunk unnamed-chunk-26

Intermediate result, before final round... Drumroll!

 

… ARIMA leading with 3.5 : 1.5.

Wait - is this really fair? No experimenting with hyperparameters, and not even with the number of hidden units?

For sure it's not - for example, with 128 hidden units instead of 32, we're already down to RMSE = 19.5.

But let's accept the current standings and look at the final dataset…

 

 

Round 4: Internet traffic data

The data

 

As per the description…

Internet traffic data (in bits) from a private ISP with centres in 11 European cities. The data corresponds to a transatlantic link and was collected from 06:57 hours on 7 June to 11:17 hours on 31 July 2005. Hourly data.

plot of chunk unnamed-chunk-27

Enter: ARIMA

 

Let's do it the auto.arima way as usual…

 

train <- traffic_df$bits[1:800]
test <- traffic_df$bits[801:nrow(traffic_df)]

fit <- auto.arima(train)
fit
Series: train 
ARIMA(0,0,0) with zero mean     

sigma^2 estimated as 2.77e+21:  log likelihood=-20884.4
AIC=41770.81   AICc=41770.81   BIC=41775.49

 

… hm???

ARIMA: 2nd try

 

Let's give ARIMA some information…

 

train_ts <- msts(train,seasonal.periods = c(24, 24*7))
fit <- auto.arima(train_ts)
fit
Series: train_ts 
ARIMA(0,0,0)             with zero mean     

sigma^2 estimated as 2.77e+21:  log likelihood=-20884.4
AIC=41770.81   AICc=41770.81   BIC=41775.49

Still not what we wanted.

ARIMA: 3nd try

 

Perhaps an exhaustive search - over all models up to a specific capacity - will make things better?

 

#fit <- auto.arima(train_ts, stepwise = FALSE)

 

Honestly, I did not have the patience to wait for this to finish… but it didn't look good ;-)

Anything else we could do?

Trying TBATS

 

TBATS (Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components) is a method designed to cope with complex seasonal patterns.

 

fit <- tbats(train_ts)
fit
TBATS(0.063, {0,0}, -, {<24,5>, <168,6>})

Call: tbats(y = train_ts)

Parameters
  Lambda: 0.062943
  Alpha: 0.3257773
  Gamma-1 Values: 0.09115799 0.1397115
  Gamma-2 Values: 0.04353518 -0.02302253

Seed States:
               [,1]
 [1,] 59.0734941480
 [2,]  0.3671490053
 [3,]  0.0052660470
 [4,] -0.0002850197
 [5,] -0.0004175278
 [6,] -0.0016098408
 [7,] -0.0378062777
 [8,] -0.0024151893
 [9,] -0.0140024775
[10,]  0.0054073064
[11,] -0.0039832672
[12,]  0.1899170827
[13,] -0.0036139405
[14,] -0.0349537288
[15,] -0.0400270883
[16,]  0.0647173754
[17,]  0.0533587007
[18,]  0.1766179797
[19,] -0.1665938268
[20,]  0.0164856227
[21,] -0.0025646632
[22,]  0.0697743591
[23,] -0.0653420542

Sigma: 0.3752594
AIC: 40404.27

 

Predictions from TBATS

 

You might ask if our rules even allow letting TBATS help out ARIMA here?

Anyway, let's see the predictions:

 

plot of chunk unnamed-chunk-32

 

Time to LSTM...!

 

We're gonna choose a sequence length of 7*24, i.e., 168.

And the results?

 

plot of chunk unnamed-chunk-33

 

That's about half the RSME of TBATS...

 

… after a training time of just 22 epochs!

How's that gonna affect the match?

Hm. How about we just give up that scenario and come to a conclusion?

Conclusion

 

  • LSTM works fine for tasks where ARIMA is known to work well, even

    • given very little data
    • without much hyperparameter tuning
    • without having more than one level of LSTM
    • for multi-step forecasts
  • LSTM works great for multiple seasonality, again, without any tuning!

  • Benchmark-like approach (“shootout”) helps understand models/algorithms better (e.g.: would GRU work as well on multiple seasonality time series as LSTM? Why?)

Questions?

 

Thanks for your attention!