Traditional time series models can rely on assumptions about dependence structure that poorly describe real time series. For example, time series are rarely simple auto-regressive processes, and data are often non-Gaussian. Bayesian state-space models estimate latent states together with parameters, so there is broad flexibility in the their posterior covariance structure.


Logistics

Resources

Software:

source('clarkFunctions2024.r')

Valuable reference:

From last time

Simulate populations of size n = 5, 10, 20 individuals with positive inbreeding. Use 10 chains to see if you can recover parameter estimates. Is recovery better, worse, or the same as with negative inbreeding.

Today

This vignette

Objectives


State-space models for time series

Hierarchical modeling allows for conditional independence of observations, taking up the dependence at a second, latent, stage, the underlying process. These states must be random and, therefore, must be estimated.

There have been many efforts to handle the combination of process and observation error in time series. Classical state-space models include the Kalman filter (Harvey 1989), an updating procedure whereby the prior estimate of the state is ‘corrected’ by how well it can predict the next observation. It is carried out as an iterative procedure.

Perhaps I would like to estimate the underlying state \(x_t\), given that \(y_t\) is observed. The underlying state might be population abundance (\(x_t\)) observed as counts on plots or sightings (\(y_t\)). The underlying state might be dissolved O\(_2\) in a river observed as assays from vials extracted from different locations and times. I might want to estimate the transition parameters in a function \(f(x_t)\) that describes the process, e.g., population growth or the time variation in gas exchange.

I might want to predict \(y_{t+k}\) for some time \(k\) in the future. Why predict \(y_{t+k}\), rather than just \(x_{t+k}\)? I could be interested in both, but I may eventually observe \(y_t\), whereas I will never observe \(x_t\). To compare observation with prediction, I need \(y_t\). The Kalman filter is not ideal for nonlinear models and when errors are non-Gaussian. In this section I describe how the Bayesian framework is used for time series data with underlying states. Here I summarize basic elements of the approach.

Conditional independence in time series data

The independence assumption allows us to treat multiple observations \([x_1, x_2] = [x_1|x_2][x_2]\) as a product of (independent) distributions, \([x_1] [x_2]\). This assumption implies that \([x_1|x_2] = [x_1]\): the marginal distribution of \(x_1\) does not change when \(x_2\) changes. Time series data are usually not independent. In a state-space model, dependence is taken up at a second stage.

*Three state variables arranged in a time series.*

Three state variables arranged in a time series.


Recall that the joint probability for three states is written like this: \(\left[ x_1, x_2, x_3 \right]\). For most models I assume that observations are independent and, thus, the likelihood is

\[ \left[ x_1, x_2, x_3 \right] = \left[ x_1 \right]\left[ x_2 \right]\left[ x_3 \right] \] The requirement for independent samples motivates much of experimental design. So what do I mean here by the term independent?

There is a common misconception that ‘independent’ means ‘dissimilar’. Here’s a definition from a recent text referring to the assumption of independence: ‘This assumption is violated when the value of one observation tends to be too similar to the values of other observations.’ There are many definitions like this in the literature. If I consider a simple regression model, the \(Y\)’s corresponding to similar values of \(X\) will be ‘similar’. If this were not true, then I would not fit the regression model. If the mean structure takes up the relationships between responses (e.g., \(\mathbf{x}' \boldsymbol{\beta}\) in regression), then the residual variation could be largely independent. If it does not, then there is residual dependence.

A likelihood that assumes independence could be equal to the joint distribution only in the special case where a conditional distribution is equal to a marginal distribution, i.e., \([x_1 | x_2] = [x_1]\). If samples are not independent, then

\[ \left[ x_1, x_2, x_3 \right] \neq \left[ x_1 \right]\left[ x_2 \right]\left[ x_3 \right] \] because marginal distributions are not the same as conditional distributions.

The joint distribution on the left is hard to think about, because it is \(n = 3\)-dimensional. I am not very good at thinking in 3 dimensions. What if \(n = 100\)? The traditional likelihood (the assumption of independence) reduces an \(n\)-dimensional distribution to a product of \(n\) univariate distributions. A hierarchical model accomplishes the same dimension reduction, but without ignoring the joint distribution. It does so by factoring the joint distribution into univariate conditional distributions. When I factor the joint distribution I get this:

\[ \begin{aligned} \left[ x_1, x_2, x_3 \right] &= \left[ x_3 | x_2 , x_1 \right]\left[ x_2, x_1 \right] \\ &= \left[ x_3 | x_2, x_1 \right] \left[ x_2 | x_1 \right]\left[x_1 \right] \end{aligned} \] This is a product of three univariate distributions, which are simple and easy to implement.

Despite the apparent resolution of the dimensionality challenge, I still have a problem. The remaining challenge is my difficulty in defining the relationship between \(x_3\) and the two things that affect it, \(x_2\) and \(x_1\). If this does not seem hard, consider that \(n\) values of \(x\) would require me to specify the distribution \(\left[x_n | x_1, \dots, x_{n-1} \right]\). I may not know how \(n-1\) different states combine into one effect.

Although the joint distribution is high-dimensional, the conditional distributions could be simple. In the state-space model, where the subscript on \(x\) is time, a given state \(x_t\) may conditionally depend only on \(x_{t-1}\) and \(x_{t+1}\). Said a bit differently, given \(x_t\), the values \(x_{t-1}\) and \(x_{t+1}\) are independent of each other–the past is conditionally independent of the future (given \(x_t\)). This is the Markov property. Where this holds, I can simplify,

\[ \left[x_n | x_1, \dots, x_{n-1}, x_{n+1}, \dots, n_T \right] = \left[x_t | x_{t-1}, x_{t+1} \right] \]

*The Markov property: there are direct links only between sequential values. Given $x_t$, the future is independent of the past.*

The Markov property: there are direct links only between sequential values. Given \(x_t\), the future is independent of the past.


We encountered the Markov property with MCMC where it describes a simulator. Here we apply it to data. If the Markov assumption is roughly correct, then models can be specified simply.

Conditional independence simplifies hierarchical models. It allows me to sample conditionally, using Gibbs sampling. I can achieve conditional independence if I treat the data as a separate stage in the model, and I treat all of the links as stochastic. I let \(y_t\) represent the observed value at time \(t\), and I assume that it results from observation error on the true \(x_t\). With this second stage, the combined process and observation model is now,

\[ [\mathbf{x}] \left[ \mathbf{y} | \mathbf{x} \right] = [x_1] \prod_{t=2}^T \left[x_t | x_{t-1} \right] \prod_{t=1}^T \left[y_t | x_t \right] \]

*The random walk with data $y_t$ treated as a separate stage in the model.*

The random walk with data \(y_t\) treated as a separate stage in the model.


I use a simple random walk to demonstrate.

Exercise 1. I previously included dependence structure in the Gaussian distribution. Write the joint distribution for \(\left[ x_1, x_2, x_3 \right]\) as a multivariate normal. Under what circumstances would this joint distribution be equivalent to \([ x_1] [x_2] [x_3]\)?

A random walk

The random walk model describes a path that moves by stochastic increments. A random walk can occur in multiple dimensions, as in movements left or right (1-D), across a surface (2-D), through the atmosphere or ocean (3-D), and so on. A random walk can occur in continuous space (e.g., diffusing gases) or discrete space (e.g., repeated coin tosses or moves on a lattice). A continuous random walk is Gaussian when steps are normally distributed.

A random walk model that has observation error is a state-space model. Consider the pair of equations

\[ \begin{aligned} x_{t+1} &= f(x_t) + \epsilon_t \\ y_t &= g(x_t) + \eta_t \end{aligned} \] The first equation is the process model, system equation, or evolution equation. The process model includes a deterministic component \(f(x_t)\), which encapsulates understanding of a process. The stochasticity in this equation is the second term, the process error, \(\epsilon_t\). It affects the process, because variability introduced at one time step affects the trajectory of change. Here I assume these terms are additive, but I could have used some other assumption. This stochasticity in the first equation means that the process is flexible to model miss-specification.

As a specific example, a random walk could look like this,

\[ \begin{aligned} f(x_t) &= x_t \\ \epsilon_t & \sim N(0,\sigma^2) \end{aligned} \] This random walk is only part of the model.

A state-space model includes the additional equation for observations. The true value of \(x_t\) is clouded by observation error. The observed quantity is \(y_t\). If the observations contain no additive or multiplicative bias a data model might look like this,

\[ y_t \sim N(x_t,\tau^2) \] This data model says that observations are centered on the true value \(x_t\) with noise described by a variance \(\tau^2\). In other words \(g_t(x_t) = x_t\).

Simulated data

As previously, it is a good idea to check algorithms against simulated data where parameter values are known. Before going further I simulate a data set with missing values:

nt   <- 200                                               # no. time intervals
time <- c( 1:nt )                                         # sample times
sig  <- 5                                                 # process err variance
tau  <- 20                                                # obs error variance
x    <- rep( 0, nt )                                      # initial state
for( t in 2:nt )x[t] <- x[t-1] + rnorm( 1, 0, sqrt(sig) ) # simulate process

y     <- rnorm( nt, x, sqrt(tau) )                        # simulate observations

nm    <- 20                                               # no. missing values
wm    <- sample( c(2:nt), nm )   
y[wm] <- NA                      

plot( time, y, pch = 16, cex = .8, bty = 'n', las = 1, col = '#1f78b4' )  # observations
lines( time, x )                                                          # states
points( time[wm], x[wm], col = '#ff7f00', cex = .8, pch = 16 )            # missing
*Observed (dots) and latent states (lines). Missing values are orange*

Observed (dots) and latent states (lines). Missing values are orange


Note that the loop to simulate time series data runs from index t = 2, ..., nt. The model is initialized at x[1] = 0.

The observations in the figure are show as dots. Some are missing. The line shows the ‘true’ latent states through time \(x_t\). These values would not be observed. They might be viewed as a breakdown of how I think uncertainty in the model and data could affect what is observed. At the same time, the latent states make observations conditionally independent: conditioned on the latent state \(x_t\), \(y_t\) is independent of \(y_{t-1}\) and \(y_{t+1}\).

Run this code a number of times to get a feel for the dynamics of a random walk. The autocorrelation in the series makes these random sequences appear as though there is some underlying, non-random control. A better way to appreciate the random walk is to look at the differenced series. Here are plots of x[t] and diff( x[t] ):

par(mfrow=c(2,1), mar=c(4,4,1,4), bty='n', las = 1 )
plot( time, x, type = 'l', xlab = '' )
plot( time[-1], diff(x), type = 'l', xlab = 'time' )
abline(h=0,lty=2)
abline( h = quantile( diff(x),c(.025,.975)), col ='#993404' ) # from time series
*Random walk above with differenced series below*

Random walk above with differenced series below

An analyst looking at the plot of \(x_t\) could fail to appreciate the uncorrelated nature of transitions, which is clear from the lower plot. I have drawn brown lines that bound 95% of the values.

Fitting the model

As in all hierarchical models, the state-space model uses distributions to connect states.

Here I fit the random walk model to simulated data. This model is shown graphically in the figure below. I specify prior distributions and set up a simple MCMC algorithm using Gibbs sampling. I use Gaussian distributions for the process and observation model.

*The random walk with data, process, and parameters. The square boxes (observations) and prior parameter values anchor the fit. All arrows are stochastic.*

The random walk with data, process, and parameters. The square boxes (observations) and prior parameter values anchor the fit. All arrows are stochastic.


Prior distribution

First, note that the process model treats the previous value as a prior for the current one, \([x_t | x_{t-1}]\). However, the first value at time \(t = 1\) does not have a previous value. I can specify a prior distribution for the first value as \(N(x_1|x_0,\sigma^2)\). Then the distribution for \(\mathbf{x} = (x_1, \dots, x_T)\) is

\[ \left[ \mathbf{x} |x_0,\sigma^2 \right] \propto N(x_1|x_0,\sigma^2) \times \prod_{t=2}^T N \left(x_t | x_{t-1}, \sigma^2 \right) \]

I place a prior distribution on the variance parameters that are (conjugate) inverse gamma,

\[ \begin{aligned} \sigma^2 &\sim IG(s_1, s_2) \\ \tau^2 &\sim IG(t_1,t_2) \end{aligned} \]

I want these prior distributions to be centered on values that are sensible. Typically, I will have some idea of how variable the process and observations could be. Together, the two variances determine how much of the variance will be taken up by process (small \(\sigma^2\)) versus observations (small \(\tau^2\)). In this example, I pretend that I don’t know the values used to simulate the data. For demonstration, my prior distributions for sig and tau are centered on 8 and 15, respectively.

In addition to mean values, I need to specify weights. I decide to place more weight on the observation error, proportional to the number of observations, nt = 200, and less weight on the process error, 2. Together, these two values will determine the balance of weight assigned to the underlying process model and to the observed values. The function varPrior( mu, wt ) takes as arguments the mean of the distribution and the weight and returns parameter values for the inverse gamma prior on the variances. Here are parameter values and weights:

sp <- varPrior( mu = 8, wt = 2 )       # weak prior
tp <- varPrior( 15, nt )               # wt set to no. observations
sg <- 1/rgamma( 1, sp[1], sp[2] )      # initialize values
tg <- 1/rgamma( 1, tp[1], tp[2] )

Posterior distribution

With these prior distributions on parameters, the posterior distribution for the random walk model is

\[ \begin{aligned} \left[ \mathbf{x}, \sigma^2, \tau^2 | \mathbf{y}, s_1, s_2, t_1, t_2 \right] & \propto N(x_1|x_0,\sigma^2) \prod_{t=2}^T N \left(x_t | x_{t-1}, \sigma^2 \right) \\ & \times \prod_{t=1}^T N\left(y_t | x_t, \tau^2 \right) \\ & \times IG(\sigma^2 | s_1, s_2) IG(\tau^2| t_1,t_2) \end{aligned} \]

I have missing values in the data. I need an index for the locations of these missing values and initialize them to values that are reasonable. I use this function:

tmp     <- initialStatesSS( y )
xg      <- tmp$x
notMiss <- tmp$notMiss
miss    <- tmp$miss


A Gibbs sampler

A Gibbs sampler for the state-space model is arranged such that the sample for each latent variable is updated before the next sample is drawn. For this example, I will simply omit a prior for \(x_1\). Once each of the states has been updated, update the variance parameters. The basic algorithm is the following:

  1. Sample each latent state conditional on all others,

\[ \begin{aligned} x_t | x_{t-1}, x_{t+1}, y_t, \sigma^2 &\propto [x_t|x_{t-1}, \sigma^2] \\ & \times [x_{t+1} | x_t, \sigma^2] \\ & \times [y_t | x_t ] \end{aligned} \] This conditional distribution is given in the Appendix. In this figure I highlight the boxes and arrows that are included in this conditional distribution:

*Sampling the latent state $x_t$ engages only the arrows that connect it to $x_{t-1}, x_{t+1}$, and $y_t$.*

Sampling the latent state \(x_t\) engages only the arrows that connect it to \(x_{t-1}, x_{t+1}\), and \(y_t\).

I can find each of the arrows in this graph in the conditional distribution, pointing either to or from \(x_t\).

  1. Sample the process variance,

\[ \sigma^2 | \mathbf{x}, s_1, s_2 \propto \prod_{t=2}^T [x_t | x_{t-1}, \sigma^2][\sigma^2 | s_1, s_2] \]

Here are the boxes and arrows involved in the update of the process variance:

*Sampling the process variance involves only the latent states and the prior distribution.*

Sampling the process variance involves only the latent states and the prior distribution.

Again, only arrows pointing to or from \(\sigma^2\) are included in the conditional distribution. This includes every value of \(x_t\). It does not involve \(y_t\).

  1. Sample the observation variance from the conditional distribution,

\[ \tau^2 | \mathbf{x}, \mathbf{y} \propto \prod_{t=1}^T [y_t | x_t, \tau^2][\tau^2 | t_1, t_2] \]

This graph shows boxes and arrows included in the conditional distribution for the observation variance:

*Sampling the observation variance involves only the observations and the prior distribution.*

Sampling the observation variance involves only the observations and the prior distribution.

The conditional distribution includes every observation \(y_t\). It includes \(x_t\), because they are part of the observation model.

  1. Repeat until a long converged sequence is available for analysis.


I will need to store the samples. I define a relatively small number of total steps ng and establish some matrices to hold values:

ng     <- 1000                            # no. Gibbs steps
vgibbs <- matrix( 0, ng, 2)               # store variances
colnames(vgibbs) <- c( 'sigma', 'tau' )
xgibbs <- matrix( 0, ng, nt )             # store latent states

Here is a loop that executes the Gibbs sampling algorithm:

for(g in 1:ng){
  xg <- updateSSRW( xg, y, missing = wm, tg = tg, sg = sg )      # latent states
  sg <- updateVariance( xg[-1], xg[-nt], sp[1], sp[2] )          # process var
  tg <- updateVariance( y[-wm], xg[-wm], tp[1], tp[2] )          # obs variance
  vgibbs[g,] <- c( sg, tg )                                      # store values
  xgibbs[g,] <- xg
}

Note three pieces to the algorithm, i) the latent states, ii) the process variance, and iii) the observation variance. The process variance includes only transitions between values of \(x_t\), i.e., \(x_2 - x_1, x_3 - x_2, \dots, x_T - x_{T-1}\). The missing values (wm) are not part of the observation model, \(y_t - x_t\).

Here are plots of the MCMC chains and posterior densities:

trueValues <- c( sig, tau )
chainPlot( vgibbs, trueValues = trueValues )
*Posterior estimates of variances for the random walk example*.

Posterior estimates of variances for the random walk example.

chain2density( vgibbs, trueValues = trueValues, burnin = nt/2 )
*Posterior estimates of variances for the random walk example*.

Posterior estimates of variances for the random walk example.


Here are estimates of latent states:

xpost <- t( apply( xgibbs, 2, quantile, c(.5,.025,.975)) )               # 95% CI

shadeInterval( time, xpost[,2:3], add=F, xlab = 't', ylab = 'x_t, y_t' ) # credible interval
lines( time, x, lwd=2 )                                                  # true values
points( time, y, lwd=2, col='#1f78b4', cex=.5, pch=16 )                  # obs values
points( time[wm], x[wm], col='#ff7f00', cex=.5, pch=16 )                 # missing 
*Posterior estimates of states as a 95% CI (shaded), true values (black line), observations (blue), and missing values (orange).*

Posterior estimates of states as a 95% CI (shaded), true values (black line), observations (blue), and missing values (orange).


The state-space model can do a good job of recovering latent states. It struggles to identify the two variances. So it needs sensible priors on the variances. Despite this limitation, state-space models have a proven track record of recovering parameters that describe the process. This is shown in the next section.

Exercise 2. Use the foregoing code for the following:

  1. Conduct simlation experiments to determine how different values of sig and tau affect the behavior of x. What is the effect of sig if you hold tau constant? and vice-versa?

  2. Determine how these same parameters affect the standard deviation of x and y.

  3. Determine how the length of the series nt affects the standard deviation in x and in the differenced sequence of x.

With predictors

Now suppose that the process is driven by predictors that change over time,

\[ x_t \sim N(x_{t-1} + \mathbf{z}'_{t-1} \boldsymbol{\beta}, \sigma^2) \] The second term in the mean structure describes how predictors in \(\mathbf{z}_t\) affect the process. Estimates of \(\boldsymbol{\beta}\) quantify these effects on the process. Here is a graph of the model with predictors:

*The model with covariates..*

The model with covariates..


The second term looks like a regression with a design vector \(\mathbf{z}_t\). At each time step there is an effect that comes from predictors and from the process error. Here is a simulation from this model that includes a data gap. Extended gaps are common in time-series data, as when a sensor fails and is eventually fixed or replaced.

nt    <- 300                                       # no. time intervals
time  <- c( 1:nt )                                 # sample times
wm    <- 30:60                                     # a sequence of missing values
sig   <- .01                                       # process err variance
tau   <- .1                                        # obs error variance
x     <- rep( 0, nt )
q     <- 3
z     <- matrix( rnorm(q*(nt - 1), .1), nt-1, q )  # design matrix
z[,1] <- 1
beta  <- matrix( runif(q, -.01,.01) )              # coefficient vector
zb    <- z%*%beta

for(t in 2:nt)x[t] <- x[t-1] + rnorm(1, zb[t-1], sqrt(sig))

y     <- rnorm( nt, x, sqrt(tau) )                 # observations
y[wm] <- NA                                        # missing

plot( time, y, col='grey', pch = 16, bty = 'n', las = 1 )
lines( time, x )
*Observed (dots) and latent states (line).*

Observed (dots) and latent states (line).

If I am interested in the controls on this process, then why bother estimating the latent states \(x_t\)? These controls are quantified by the coefficients \(\boldsymbol{\beta}\). Again, whether or not I care about the \(x_t\), they serve the valuable role of conferring conditional independence on the observed \(y_t\).

I use Gibbs sampling to estimate the coefficients in \(\boldsymbol{\beta}\) together with the latent states and the variance parameters. In the above code the coefficients are the vector beta. Here are some prior distributions, initial states, and matrices to hold samples.

sp <- varPrior( 2, 2 )             # weak prior
tp <- varPrior( 6, 2 )             # wt comparable to data
sg <- 1/rgamma( 1, sp[1], sp[2] )  # initialize values for variances, coefficients
tg <- 1/rgamma( 1, tp[1], tp[2] )
bg <- beta*0

tmp     <- initialStatesSS( y )    # initial states
xg      <- tmp$x
notMiss <- tmp$notMiss
miss    <- tmp$miss

ng     <- 1000
vgibbs <- matrix( 0, ng, 2 )
xgibbs <- matrix( 0, ng, nt )
bgibbs <- matrix( 0, ng, q )
colnames(vgibbs) <- c( 'sigma', 'tau' )
colnames(bgibbs) <- paste( 'beta', c(0:2), sep='' )

Note below the additional step in the Gibbs sampler to update coefficients. Otherwise it looks like the previous algorithm:

for(g in 1:ng){
  
  bg <- updateSSbeta( xg, z = z, sg, priorB = matrix(0,q),
                     priorIVB = diag(.001,q), addStates=T )
  zb <- z%*%bg
  xg <- updateSSRW( states = xg, yy = y, zb = zb, missing = wm, tg = tg, sg = sg )   
  sg <- updateVariance( xg[-1], xg[-nt] + zb, sp[1], sp[2] )
  tg <- updateVariance( y[-wm], xg[-wm], tp[1], tp[2] )
  vgibbs[g,] <- c( sg, tg )
  xgibbs[g,] <- xg
  bgibbs[g,] <- bg
}

The chains show rapid convergence to the true values in \(\boldsymbol{\beta}\)

Here are variance chains:

chainPlot( vgibbs, trueValues = c(sig,tau) )
*Variance estimates for model with covariates.*

Variance estimates for model with covariates.

Here are the coefficient chains:

chainPlot( bgibbs, trueValues = beta )
*Posterior estimates of coefficients. True values are brown.*

Posterior estimates of coefficients. True values are brown.

Here are plots of latent states:

xpost <- t( apply( xgibbs, 2, quantile, c(.5,.025,.975)) )
shadeInterval( time, xpost[,2:3], add=F )               # credible interval
lines( time, x, lwd=2, col = '#993404' )                  # true values
lines( time, xpost[,1], lwd=2 )                         # posterior mean
points( time, y, lwd=2, cex=.5, col='#1f78b4', pch=16 )    # obs values
points( time[wm], x[wm], col='#ff7f00',cex=.5, pch=16 )  # missing values
*Posterior estimates of states as a 95% CI (shaded), true values (brown line), observations (blue), and missing values (orange).*

Posterior estimates of states as a 95% CI (shaded), true values (brown line), observations (blue), and missing values (orange).

Note the effect of a prolonged period of missing data. Experiment with some different values for nt, sig, and tau. We could expect roughly 95% of latent states might fall within the 95% credible interval.


A state-space model in jags

Here is a version of the same model in jags. Note that I condition on the first observation, which has a prior distribution centered on the observed value. I have placed flat prior distributions on variances.

cat( "model {

 mu[1] ~ dnorm(y[1],10)
 xg[1] <- mu[1]

 for(t in 2:nt) {
   mu[t] <- xg[t-1] + b1 + b2*z[t-1,2] + b3*z[t-1,3]
   xg[t] ~ dnorm(mu[t],procErr)
   y[t] ~ dnorm(xg[t],obsErr)
 }

 procErr ~ dunif(0,20)
 obsErr  ~ dunif(0,20)
 sigma <- 1/procErr
 tau   <- 1/obsErr
 b1 ~ dunif(-1,1)
 b2 ~ dunif(-1,1)
 b3 ~ dunif(-1,1)
}  ", fill=T, file="stateSpaceModel.txt" )

Here is an analysis:

library(rjags)
ssData   <- list(y = y, z = z, nt = nt )
ssFit    <- jags.model( data = ssData, file = "stateSpaceModel.txt" )
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 268
##    Unobserved stochastic nodes: 336
##    Total graph size: 2407
## 
## Initializing model
update(ssFit)

ssFit <- coda.samples( ssFit, variable.names = c('b1','b2','b3', 'sigma', 'tau'),
                       n.iter = 5000 )
par(mar=c(3,3,1,1))
plot( as.mcmc(ssFit), bty = 'n' )


Exercise 3. Working in your group, analyze the following model in jags. Start by simulating the data set. Then write jags code to fit the model.

\[ \begin{aligned} x_t &\sim N(\alpha + \beta x_{t-1}, \sigma^2) \\ y_t &\sim Poi( exp(x_t) ) \end{aligned} \]

  1. Simulate data: Change the block of code above from a Gaussian data model to Poisson. What link function is this?
  1. Change the jags code: In addition to the code for the Gaussian model in this example, look back at the implementation of the Poisson in unit 8.
  1. Run the model using the same code as shown above.


Recap

State-space models treat serial dependence hierarchically. Hierarchical models do not require marginal independence, only conditional independence. Modeling latent states allows for dependence structure that could be quite different from what could be achieved with a standard models (e.g., auto-regressive). The flexibility of hierarchical modeling makes it attractive for time series data. Recall that I added predictors to the random walk model without changing any of the existing code–I just added one more function.

Informative prior distributions are important for variances. Examples here show how they affect estimates of covariates and of latent states. The relative magnitudes of process and observation error determine the extent which data versus process dominate the fit.

Appendix

The random walk has a conditionally normal posterior distribution that can be written this way:

\[ \begin{aligned} x_t | x_{t-1}, x_{t+1}, y_t, \sigma^2 &\propto N(x_t|x_{t-1}, \sigma^2) \times N(x_{t+1} | x_t, \sigma^2) \times N(y_t | x_t ) \\ &\propto exp \left[ -\frac{(x_t - x_{t-1})^2}{2\sigma^2} - \frac{(x_{t+1} - x_{t})^2}{2\sigma^2} - \frac{(y_t - x_t)^2}{2\tau^2} \right] \end{aligned} \] Again, the three factors on the right-hand side are just the arrows in the graph that point to or from \(x_t\).

Because this is a product of normals, it will be normal, with this form,

\[ \begin{aligned} x_t &\sim N(Vv, V) \\ V &= \left( \frac{2}{ \sigma^2} + \frac{1}{\tau^2} \right)^{-1} \\ v &= \frac{x_{t-1} + x_{t+1}}{\sigma^2} + \frac{y_t}{\tau^2} \end{aligned} \] Recall, \(v\) is a sum of the previous and next values in the series \((t-1, t+1)\) and the observation \(y_t\). The variances act as weights. The fitted latent states will be pulled to the data if observation variance \(\tau^2\) is small relative to \(\sigma^2\). The latent states will be pulled to the mean of past and previous values if \(\tau^2\) is large relative to \(\sigma^2\).

Here is the function:

updateSSRW <- function( states, yy, zb=rep(0,length(y)), missing, tg, sg){       

  # state-space random walk 
  # update continuous states, random walk
  # missing times, obs y, obs error tg, process error sg
  # zb = z%*%beta
  
  nn <- length(states)
  
  for(t in 1:nn){
    
    VI <- 0
    v  <- 0
    
    if(!t %in% missing){                    # observation
      v  <- yy[t]/tg
      VI <- 1/tg
    }
    
    if(t < nn){                             # t+1 term excluded for last 
      v  <- v + (states[t+1] - zb[t])/sg 
      VI <- VI + 1/sg
    }
    
    if(t > 1){                              # t-1 term excluded for 1st 
      v  <- v + (states[t-1] + zb[t-1])/sg
      VI <- VI + 1/sg
    }
    
    V     <- 1/VI
    states[t] <- rnorm( 1, V*v, sqrt(V) )
  }
  states
}

The conditional distribution for the process variance looks like this:

\[ \begin{aligned} \left[ \sigma^2 | \mathbf{x}, \mathbf{y} \right] &\propto \prod_{t=2}^T N(x_t | x_{t-1}, \sigma^2) IG(\sigma^2 | s_1, s_2 ) \\ &\propto \left( \sigma^2 \right)^{-(T-1)/2}exp \left[ -\frac{1}{2\sigma^2} \sum_{t=2}^T (x_t - x_{t-1})^2 \right] \times \left( \sigma^2 \right)^{-(s_1 + 1)} exp(-s_2/\sigma^2) \\ &= \left( \sigma^2 \right)^{-(T-1)/2 - s_1 - 1} exp \left[ -\frac{1}{2\sigma^2} \sum_{t=2}^T (x_t - x_{t-1})^2 - \frac{s_2}{\sigma^2} \right] \\ &= IG \left( \sigma^2 \bigg| (T-1)/2 + s_1, \frac{1}{2}\sum_{t=2}^T (x_t - x_{t-1})^2 + s_2 \right) \end{aligned} \]

The conditional distribution for the observation variance looks like this: \[ \begin{aligned} \left[ \tau^2 | \mathbf{x}, \mathbf{y} \right] &\propto \prod_{t=1}^T N(y_t | x_t, \tau^2) IG(\tau^2 | u_1, u_2 ) \\ &\propto \left( \tau^2 \right)^{-T/2}exp \left[ -\frac{1}{2\tau^2} \sum_{t=2}^T (y_t - x_t)^2 \right] \times \left( \tau^2 \right)^{-(u_1 + 1)} exp(-u_2/\tau^2) \\ &= IG \left( \tau^2 \bigg| T/2 + u_1, \frac{1}{2}\sum_{t=1}^T (y_t - x_t)^2 + u_2 \right) \end{aligned} \]