Dynamic forecasts - with Bayesian linear models and neural networks

Sigrid Keydana, Trivadis
2017/14/11

About me & my employer

Trivadis

DACH-based IT consulting and service company, from traditional technologies to big data/machine learning/data science

My background

from psychology/statistics via software development and database engineering to data science and ML/DL

My passion

machine learning and deep learning
data science and (Bayesian) statistics
explanation/understanding over prediction accuracy

Where to find me

blog: http://recurrentnull.wordpress.com
twitter: @zkajdan

Dynamic models for timeseries prediction, - why?

Our task today: forecast men's 400m Olympic winning times

It's 2000, just before the Olympics. This is the data we have:

plot of chunk unnamed-chunk-2

This looks like we might even try linear regression...

lm(seconds ~ year, male400_1996) %>% coefficients()

(Intercept)        year 
  207.46609    -0.08257

plot of chunk unnamed-chunk-4

Unfortunately...

plot of chunk unnamed-chunk-5

OK then... let's just use ARIMA

(yes, the series is not totally regular, but let me make the point ;-))

arima_fit <- auto.arima(male400_1996$seconds)
arima_fit$coef

    ma1   drift 
-0.7999 -0.3807

plot of chunk unnamed-chunk-7

What if...

What if we didn't have to:

use static coefficients?

or even:

model linear relationships?

No static coefficients needed: Dynamic Linear Models (DLMs)

Deep and non-linear: LSTM networks

Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Dynamic linear models

State space models

States form a Markov chain.

Observations are independent conditional on states.

Hidden states, observations, and time

Hidden states evolve in time (state equation):

\( \theta_t = G_t\theta_{t-1} + w_t \)

Hidden states generate observables (observation equation):

\( Y_t = F_t\theta_t + v_t \)

System noise is normally distributed with mean \( 0 \) and variance \( W_t \):

\( w_t \sim \mathcal{N}(0, W_t) \)

Observation noise is normally distributed with mean \( 0 \) and variance \( V_t \):

\( v_t \sim \mathcal{N}(0, V_t) \)

How is this dynamic?

In theory, any and all of

system noise
observation noise
state equation
observation equation

could change over time.

Example: Where's the unicorn?

All we have is a sequence of (noisy) observations
And a sequence of hypothesized true positions (states)

How can we optimise our assumptions about the true position?

Bayesian updating

At any time, we have a current best estimate on where the unicorn is (given past observations)
We compute the next true position (state) based on the state equation (this is the prior)
We predict the corresponding observation based on the observation equation (this is the likelihood)
As soon as the new observation arrives, we can correct our estimate and compute the posterior on the state

Show me the math: Kalman filter

Here's the model again

state equation \( \theta_t = G_t\theta_{t-1} + w_t \), where \( w_t \sim \mathcal{N}(0, W_t) \)
observation equation \( Y_t = F_t\theta_t + v_t \), where \( v_t \sim \mathcal{N}(0, V_t) \)
prior \( \theta_0 \sim \mathcal{N}_p(m_0, C_0) \)

Kalman steps (given: \( \theta_{t-1}|y_{1:t-1} \sim \mathcal{N}(m_{t-1}, C_{t-1}) \))

One-step-ahead predictive distribution of \( \theta_t \) given \( y_{1:t-1} \):

\( E(\theta_t|y_{1:t-1}) = a_t = G_tm_{t-1} \) \( Var(\theta_t|y_{1:t-1}) = R_t = G_tC_{t-1}G'_t + W_t \)
One-step-ahead predictive distribution of \( Y_t \) given \( y_{1:t-1} \):

\( E(Y_t|y_{1:t-1}) = f_t = F_t a_t \) \( Var(Y_t|y_{1:t-1}) = Q_t = F_tR_tF'_t + V_t \)
Filtering distribution of \( \theta_t \) given \( y_{1:t} \):

\( E(\theta_t|y_{1:t}) = m_t = a_t + R_tF'_tQ_t^{-1}e_t \) \( Var(\theta_t|y_{1:t}) = C_t = R_t - R_tF'_tQ_t^{-1}F_tR_t \) where \( e_t = Y_t - f_t \) is the forecast error.

Kalman filter in a nutshell

We correct our prediction by how much it differs from the incoming observation.

Posterior estimate \( E(\theta_t|y_{1:t}) = m_t = a_t + R_tF'_tQ_t^{-1}e_t \)

Kalman gain \( R_tF'_tQ_t^{-1} \)

Aside: Filtering vs. smoothing

In filtering, we base our estimates of the true states on all observations made so far.
In smoothing, we form our opinion about the states “in hindsight”, based on all past and (available) future states.

Local level model ("random walk plus noise")

State equation: \( \mu_t = \mu_{t-1} + w_t, w_t \sim \mathcal{N}(0, W) \)

Observation equation: \( Y_t = \mu_t + v_t, v_t \sim \mathcal{N}(0, V) \)

\( G = F = 1 \)

W and V could be time-invariant or time-dependent.

Local level example

library(dlm)

build <- function(params) {
  dlmModPoly(order = 1, dV = exp(params[1]), dW = exp(params[2]))
}
initial_params <- rep(log(var(diff(lakeSup))), 2)
mle <- dlmMLE(lakeSup, initial_params, build)

model <- dlmModPoly(order = 1, dV = exp(mle$par[1]), dW = exp(mle$par[2]))
Map(function(x) round(x,2), unlist(model)) %>% unlist()

         m0          C0          FF           V          GG           W 
       0.00 10000000.00        1.00        9.47        1.00        0.12

plot of chunk unnamed-chunk-10

Local level example: Filtering and smoothing

plot of chunk unnamed-chunk-11

Two types of uncertainty

How would the filtering and smoothing curves have looked had we specified equal degrees of system and observation noise?

model2 <- dlmModPoly(order = 1, dV = 10, dW = 10)
filt2 <- dlmFilter(lakeSup, model2)
smooth2 <- dlmSmooth(lakeSup, model2)

Local level example: Forecasting

plot of chunk unnamed-chunk-14

Another local level example: Nvidia stock prices

build <- function(params) {
  dlmModPoly(order = 1, dV = exp(params[1]), dW = exp(params[2]))
}
initial_params <- rep(log_var_diff, 2)
mle <- dlmMLE(nvda, initial_params, build)

model <- dlmModPoly(order = 1, dV = exp(mle$par[1]), dW = exp(mle$par[2]))
Map(function(x) round(x,2), unlist(model)) %>% unlist()

         m0          C0          FF           V          GG           W 
       0.00 10000000.00        1.00        0.43        1.00       12.06

plot of chunk unnamed-chunk-17

Nvidia: filtering and smoothing

plot of chunk unnamed-chunk-18

(No, the lines did not turn invisible…)

Nvidia: Forecasting (or, let's beat the market)

plot of chunk unnamed-chunk-19

Not with this we won't…

DLMs - besides the local level model

linear trend model
seasonal models
regression models (with X as observation-generating matrix F and the coefficients as hidden state)
multivariate models, exploiting underlying commonalities (“seemingly unrelated time series equations”)
multivariate regression models (“seemingly unrelated regression models”)

Again, system and observation noise (as well as the transition equations) could be time-invariant or changing over time.

Nvidia revisited: Exploring relationships

Comparing Nvidia to AMD, Mu, and IBM:

plot of chunk unnamed-chunk-20

Does it look like Nvidia and IBM have a somewhat “complementary” relationship?

Nvidia and IBM: Linear regression

lm(nvda ~ ibm) %>% summary()


Call:
lm(formula = nvda ~ ibm)

Residuals:
   Min     1Q Median     3Q    Max 
-44.30  -8.29   2.70  10.08  29.24 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 457.2733    14.4070    31.7   <2e-16 ***
ibm          -2.0611     0.0914   -22.6   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.9 on 186 degrees of freedom
Multiple R-squared:  0.732, Adjusted R-squared:  0.731 
F-statistic:  509 on 1 and 186 DF,  p-value: <2e-16

So there's a strong relationship...

… but is it really constant over time?

lm(nvda ~ ibm) %>% residuals() %>% qqnorm()

plot of chunk unnamed-chunk-22

DLM regression to the rescue!

build <- function(u) {
  dlmModReg(ibm, dV = exp(u[1]), dW = exp(u[2 : 3]))
}
mle <- dlmMLE(nvda, parm = rep(0, 3), build)
model <- build(mle$par)

model$FF

     [,1] [,2]
[1,]    1    1

model$GG

     [,1] [,2]
[1,]    1    0
[2,]    0    1

model$W

     [,1]      [,2]
[1,] 12.2 0.000e+00
[2,]  0.0 2.393e-12

model$V

       [,1]
[1,] 0.3863

Dynamic regression coefficients

Final coefficients:

[1] 170.38366   0.05716

Development of coefficients over time (except initial noise):

plot of chunk unnamed-chunk-25

Back to the Olympics question...

Let's model this as a dynamic regression problem!

build <- function(u) {
  dlmModReg(male400_1996$year, dV = exp(u[1]), dW = exp(u[2 : 3]))
}
mle <- dlmMLE(male400_1996$seconds, parm = rep(0, 3), build)
model <- build(mle$par)
filt <-dlmFilter(male400_1996$seconds, model)
filt$m[1 + length(male400_1996$seconds), ] # most recent coefficients

[1] 209.48317  -0.08335

plot of chunk unnamed-chunk-27

Deep Learning: Long Short Term Memory (LSTM)

Learning temporal relationships: Recurrent Neural Networks

Source: Goodfellow et al. 2016, Deep Learning

I need to forget: Long Short Term Memory

Source: Stanford CS 224D Deep Learning for NLP Lecture Notes

Stock market again: Micron Technology

plot of chunk unnamed-chunk-28

Clearly remembering the downward walk won't help us to predict the near future…

One-step-ahead forecasts from LSTM

plot of chunk unnamed-chunk-29

Here, the sequence length (length of the hidden state) was chosen to be 30.

How about multi-step forecasts?

A way to get several predictions from an LSTM network in Keras: use TimeDistributed layer

model <- keras_model_sequential() 

model %>% 
  layer_lstm(units = lstm_units, input_shape = c(num_steps, num_features), return_sequences = TRUE) %>% 
  time_distributed(layer_dense(units = 1)) %>% 
  compile(
    loss = 'mean_squared_error',
    optimizer = 'adam'
  )

What happens if we produce forecasts keeping the same state size?

plot of chunk unnamed-chunk-31

Lowering state size to 12

plot of chunk unnamed-chunk-32

... or even, to 6.

plot of chunk unnamed-chunk-33

Clearly LSTM cannot show its true strength on a random walk...

… let's give it some more adequate input!

plot of chunk unnamed-chunk-34

This series has multiple seasonality.

Multiple seasonality

requires special treatment in classical state space models (use TBATS instead of ARIMA)
here, we just choose the length of the “outer” period as state size:

lstm_units <- 24 * 7 # hourly * weekly
model %>% 
  layer_lstm(units = lstm_units, input_shape = c(num_steps, num_features),
               return_sequences = TRUE) %>% 
  time_distributed(layer_dense(units = 1)) %>% 
  compile(
    loss = 'mean_squared_error',
    optimizer = 'adam'
)

168-step-ahead forecasts!

And here are the forecasts, 168 steps ahead (displaying 3 only)

plot of chunk unnamed-chunk-36

Takeaways

Time series forecasting is not just ARIMA, exponential smoothing & co.
Especially for dynamically changing data, interesting alternatives are DLMs and Deep Learning
Lots of opportunities for experimentation… let's find the unicorn :-)

Have fun!