Dynamic forecasts - with Bayesian linear models and neural networks

Sigrid Keydana, Trivadis
2017/14/11

About me & my employer

 

Trivadis

  • DACH-based IT consulting and service company, from traditional technologies to big data/machine learning/data science

My background

  • from psychology/statistics via software development and database engineering to data science and ML/DL

My passion

  • machine learning and deep learning
  • data science and (Bayesian) statistics
  • explanation/understanding over prediction accuracy

Where to find me

 

 

Dynamic models for timeseries prediction, - why?

Our task today: forecast men's 400m Olympic winning times

 

It's 2000, just before the Olympics. This is the data we have:

plot of chunk unnamed-chunk-2

This looks like we might even try linear regression...

 

lm(seconds ~ year, male400_1996) %>% coefficients()
(Intercept)        year 
  207.46609    -0.08257 

plot of chunk unnamed-chunk-4

Unfortunately...

 

plot of chunk unnamed-chunk-5

OK then... let's just use ARIMA

 

(yes, the series is not totally regular, but let me make the point ;-))

arima_fit <- auto.arima(male400_1996$seconds)
arima_fit$coef
    ma1   drift 
-0.7999 -0.3807 

plot of chunk unnamed-chunk-7

What if...

 

What if we didn't have to:

  • use static coefficients?

or even:

  • model linear relationships?

No static coefficients needed: Dynamic Linear Models (DLMs)

 

Deep and non-linear: LSTM networks

 

Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

 

 

Dynamic linear models

State space models

 

Source: Wikipedia

 

 

States form a Markov chain.

Observations are independent conditional on states.

Hidden states, observations, and time

 

  • Hidden states evolve in time (state equation):

         \( \theta_t = G_t\theta_{t-1} + w_t \)

  • Hidden states generate observables (observation equation):

         \( Y_t = F_t\theta_t + v_t \)

  • System noise is normally distributed with mean \( 0 \) and variance \( W_t \):

         \( w_t \sim \mathcal{N}(0, W_t) \)

  • Observation noise is normally distributed with mean \( 0 \) and variance \( V_t \):

         \( v_t \sim \mathcal{N}(0, V_t) \)

How is this dynamic?

 

In theory, any and all of

  • system noise
  • observation noise
  • state equation
  • observation equation

could change over time.

 

Example: Where's the unicorn?

 

  • All we have is a sequence of (noisy) observations
  • And a sequence of hypothesized true positions (states)

How can we optimise our assumptions about the true position?

Bayesian updating

 

  • At any time, we have a current best estimate on where the unicorn is (given past observations)
  • We compute the next true position (state) based on the state equation (this is the prior)
  • We predict the corresponding observation based on the observation equation (this is the likelihood)
  • As soon as the new observation arrives, we can correct our estimate and compute the posterior on the state

Show me the math: Kalman filter

 

Here's the model again

  • state equation \( \theta_t = G_t\theta_{t-1} + w_t \), where \( w_t \sim \mathcal{N}(0, W_t) \)
  • observation equation \( Y_t = F_t\theta_t + v_t \), where \( v_t \sim \mathcal{N}(0, V_t) \)
  • prior \( \theta_0 \sim \mathcal{N}_p(m_0, C_0) \)

 

Kalman steps (given: \( \theta_{t-1}|y_{1:t-1} \sim \mathcal{N}(m_{t-1}, C_{t-1}) \))

  1. One-step-ahead predictive distribution of \( \theta_t \) given \( y_{1:t-1} \):

    \( E(\theta_t|y_{1:t-1}) = a_t = G_tm_{t-1} \)             \( Var(\theta_t|y_{1:t-1}) = R_t = G_tC_{t-1}G'_t + W_t \)

  2. One-step-ahead predictive distribution of \( Y_t \) given \( y_{1:t-1} \):

    \( E(Y_t|y_{1:t-1}) = f_t = F_t a_t \)             \( Var(Y_t|y_{1:t-1}) = Q_t = F_tR_tF'_t + V_t \)

  3. Filtering distribution of \( \theta_t \) given \( y_{1:t} \):

    \( E(\theta_t|y_{1:t}) = m_t = a_t + R_tF'_tQ_t^{-1}e_t \)             \( Var(\theta_t|y_{1:t}) = C_t = R_t - R_tF'_tQ_t^{-1}F_tR_t \)       where \( e_t = Y_t - f_t \) is the forecast error.

Kalman filter in a nutshell

 

We correct our prediction by how much it differs from the incoming observation.

   

Posterior estimate       \( E(\theta_t|y_{1:t}) = m_t = a_t + R_tF'_tQ_t^{-1}e_t \)

Kalman gain                        \( R_tF'_tQ_t^{-1} \)

Aside: Filtering vs. smoothing

 

  • In filtering, we base our estimates of the true states on all observations made so far.
  • In smoothing, we form our opinion about the states “in hindsight”, based on all past and (available) future states.

 

Local level model ("random walk plus noise")

 

State equation: \( \mu_t = \mu_{t-1} + w_t, w_t \sim \mathcal{N}(0, W) \)

Observation equation: \( Y_t = \mu_t + v_t, v_t \sim \mathcal{N}(0, V) \)

\( G = F = 1 \)

 

 

W and V could be time-invariant or time-dependent.

Local level example

 

library(dlm)

build <- function(params) {
  dlmModPoly(order = 1, dV = exp(params[1]), dW = exp(params[2]))
}
initial_params <- rep(log(var(diff(lakeSup))), 2)
mle <- dlmMLE(lakeSup, initial_params, build)

model <- dlmModPoly(order = 1, dV = exp(mle$par[1]), dW = exp(mle$par[2]))
Map(function(x) round(x,2), unlist(model)) %>% unlist()
         m0          C0          FF           V          GG           W 
       0.00 10000000.00        1.00        9.47        1.00        0.12 

plot of chunk unnamed-chunk-10

 

Local level example: Filtering and smoothing

 

plot of chunk unnamed-chunk-11

Two types of uncertainty

 

How would the filtering and smoothing curves have looked had we specified equal degrees of system and observation noise?

 

model2 <- dlmModPoly(order = 1, dV = 10, dW = 10)
filt2 <- dlmFilter(lakeSup, model2)
smooth2 <- dlmSmooth(lakeSup, model2)

plot of chunk unnamed-chunk-13

 

Local level example: Forecasting

 

plot of chunk unnamed-chunk-14

Another local level example: Nvidia stock prices

 

build <- function(params) {
  dlmModPoly(order = 1, dV = exp(params[1]), dW = exp(params[2]))
}
initial_params <- rep(log_var_diff, 2)
mle <- dlmMLE(nvda, initial_params, build)

model <- dlmModPoly(order = 1, dV = exp(mle$par[1]), dW = exp(mle$par[2]))
Map(function(x) round(x,2), unlist(model)) %>% unlist()
         m0          C0          FF           V          GG           W 
       0.00 10000000.00        1.00        0.43        1.00       12.06 

plot of chunk unnamed-chunk-17

Nvidia: filtering and smoothing

 

plot of chunk unnamed-chunk-18

(No, the lines did not turn invisible…)

Nvidia: Forecasting (or, let's beat the market)

 

plot of chunk unnamed-chunk-19

Not with this we won't…

DLMs - besides the local level model

 

  • linear trend model
  • seasonal models
  • regression models (with X as observation-generating matrix F and the coefficients as hidden state)
  • multivariate models, exploiting underlying commonalities (“seemingly unrelated time series equations”)
  • multivariate regression models (“seemingly unrelated regression models”)

Again, system and observation noise (as well as the transition equations) could be time-invariant or changing over time.

Nvidia revisited: Exploring relationships

 

Comparing Nvidia to AMD, Mu, and IBM:

plot of chunk unnamed-chunk-20

Does it look like Nvidia and IBM have a somewhat “complementary” relationship?

Nvidia and IBM: Linear regression

 

lm(nvda ~ ibm) %>% summary()

Call:
lm(formula = nvda ~ ibm)

Residuals:
   Min     1Q Median     3Q    Max 
-44.30  -8.29   2.70  10.08  29.24 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 457.2733    14.4070    31.7   <2e-16 ***
ibm          -2.0611     0.0914   -22.6   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.9 on 186 degrees of freedom
Multiple R-squared:  0.732, Adjusted R-squared:  0.731 
F-statistic:  509 on 1 and 186 DF,  p-value: <2e-16

So there's a strong relationship...

 

… but is it really constant over time?

lm(nvda ~ ibm) %>% residuals() %>% qqnorm()

plot of chunk unnamed-chunk-22

 

DLM regression to the rescue!

 

build <- function(u) {
  dlmModReg(ibm, dV = exp(u[1]), dW = exp(u[2 : 3]))
}
mle <- dlmMLE(nvda, parm = rep(0, 3), build)
model <- build(mle$par)

model$FF
     [,1] [,2]
[1,]    1    1
model$GG
     [,1] [,2]
[1,]    1    0
[2,]    0    1
model$W
     [,1]      [,2]
[1,] 12.2 0.000e+00
[2,]  0.0 2.393e-12
model$V
       [,1]
[1,] 0.3863

Dynamic regression coefficients

 

Final coefficients:

[1] 170.38366   0.05716

 

Development of coefficients over time (except initial noise):

plot of chunk unnamed-chunk-25

Back to the Olympics question...

 

Let's model this as a dynamic regression problem!

build <- function(u) {
  dlmModReg(male400_1996$year, dV = exp(u[1]), dW = exp(u[2 : 3]))
}
mle <- dlmMLE(male400_1996$seconds, parm = rep(0, 3), build)
model <- build(mle$par)
filt <-dlmFilter(male400_1996$seconds, model)
filt$m[1 + length(male400_1996$seconds), ] # most recent coefficients
[1] 209.48317  -0.08335

plot of chunk unnamed-chunk-27

 

 

Deep Learning: Long Short Term Memory (LSTM)

Learning temporal relationships: Recurrent Neural Networks

 

Source: Goodfellow et al. 2016, Deep Learning

I need to forget: Long Short Term Memory

Stock market again: Micron Technology

 

plot of chunk unnamed-chunk-28

 

Clearly remembering the downward walk won't help us to predict the near future…

One-step-ahead forecasts from LSTM

 

plot of chunk unnamed-chunk-29

 

Here, the sequence length (length of the hidden state) was chosen to be 30.

How about multi-step forecasts?

 

A way to get several predictions from an LSTM network in Keras: use TimeDistributed layer

 

model <- keras_model_sequential() 

model %>% 
  layer_lstm(units = lstm_units, input_shape = c(num_steps, num_features), return_sequences = TRUE) %>% 
  time_distributed(layer_dense(units = 1)) %>% 
  compile(
    loss = 'mean_squared_error',
    optimizer = 'adam'
  )

What happens if we produce forecasts keeping the same state size?

 

plot of chunk unnamed-chunk-31

Lowering state size to 12

 

plot of chunk unnamed-chunk-32

... or even, to 6.

 

plot of chunk unnamed-chunk-33

Clearly LSTM cannot show its true strength on a random walk...

 

… let's give it some more adequate input!

plot of chunk unnamed-chunk-34

This series has multiple seasonality.

Multiple seasonality

 

  • requires special treatment in classical state space models (use TBATS instead of ARIMA)
  • here, we just choose the length of the “outer” period as state size:

 

lstm_units <- 24 * 7 # hourly * weekly
model %>% 
  layer_lstm(units = lstm_units, input_shape = c(num_steps, num_features),
               return_sequences = TRUE) %>% 
  time_distributed(layer_dense(units = 1)) %>% 
  compile(
    loss = 'mean_squared_error',
    optimizer = 'adam'
)

168-step-ahead forecasts!

 

And here are the forecasts, 168 steps ahead (displaying 3 only)  

plot of chunk unnamed-chunk-36

Takeaways

 

  • Time series forecasting is not just ARIMA, exponential smoothing & co.
  • Especially for dynamically changing data, interesting alternatives are DLMs and Deep Learning
  • Lots of opportunities for experimentation… let's find the unicorn :-)

 

Have fun!