Traditional time series models can rely on assumptions about dependence structure that poorly describe real time series. For example, time series are rarely simple auto-regressive processes, and data are often non-Gaussian. Bayesian state-space models estimate latent states together with parameters, so there is broad flexibility in the their posterior covariance structure.


Software

source('clarkFunctions2022.r')

From last time: Group exercise

  1. Summarize in a few sentences the differences between a Bayesian view of uncertainty and a frequentist view.

  2. Why do 1.96 standard errors approximate a 95% CI and when might this approximation be poor?

  3. What role does a P value play in the likelihood profile method for a CI? Why might a Bayesian object to it?

  4. How does Fisher Information differ from the likelihood profile?

  5. Why is the bootstrap conform to a frequentist perspective, and what is the key assumption it uses?

  6. Which approach involves integrating a distribution?

Objectives


State-space models for time series

Hierarchical modeling allows for conditional independence of observations, taking up the dependence at a second, latent, stage, the underlying process. These states must be random and, therefore, must be estimated.

There have been many efforts to handle the combination of process and observation error in time series. Classical state-space models include the Kalman filter (Harvey 1989), an updating procedure whereby the prior estimate of the state is ‘corrected’ by how well it can predict the next observation. It is carried out as an iterative procedure. Perhaps I would like to estimate the underlying state \(x_t\), given that \(y_t\) is observed. I might want to estimate the transition parameters in a function \(f(x_t)\) that describes the process. I might want to predict \(y_{t+k}\) for some time \(k\) in the future. Why predict \(y_{t+k}\), rather than just \(x_{t+k}\)? I could be interested in both, but I may eventually observe \(y_t\), whereas I will never observe \(x_t\). To compare observation with prediction, I need \(y_t\). The Kalman filter is not ideal for nonlinear models and when errors are non-Gaussian. In this section I describe how the Bayesian framework is used for time series data with underlying states. Here I summarize basic elements of the approach.

Conditional independence in time series data

The independence assumption allows us to treat the likelihood, or data model, for multiple observations \([x_1, x_2] = [x_1|x_2][x_2]\) as a product of (independent) distributions \([x_1] [x_2]\). This assumption implies that \([x_1|x_2] = [x_1]\): the marginal distribution of \(x_1\) does not change when \(x_2\) changes. Time series data usually cannot be assumed to be independent, but this dependence can be taken up at a second stage.

*Three state variables arranged in a time series.*

Three state variables arranged in a time series.

Before considering the likelihood explicitly, recall the joint probability, now in the context of a time series. I begin with three states, having joint probability \(\left[ x_1, x_2, x_3 \right]\). For most models I assume that observations are independent and, thus, the likelihood is

\[ \left[ x_1, x_2, x_3 \right] = \left[ x_1 \right]\left[ x_2 \right]\left[ x_3 \right] \] The requirement for independent samples motivates much of experimental design. So what do I mean here by the term independent?

There is a common misconception that ‘independent’ means ‘dissimilar’. Here’s a definition from a recent text referring to the assumption of independence: ‘This assumption is violated when the value of one observation tends to be too similar to the values of other observations.’ There are many definitions like this in the literature. If I consider a simple regression model, the \(Y\)’s corresponding to similar values of \(X\) will be ‘similar’. If this were not the case, why would I fit a regression? Clearly, definitions like this miss the point.

The importance of independent observations results from the fact that the likelihood is used to represent the true joint distribution. A likelihood could be equal to the joint distribution only in the special case where a conditional distribution is equal to a marginal distribution, i.e., \([x_1 | x_2] = [x_1]\). If samples are not independent, then

\[ \left[ x_1, x_2, x_3 \right] \neq \left[ x_1 \right]\left[ x_2 \right]\left[ x_3 \right] \] because marginal distributions are not the same as conditional distributions.

The joint distribution on the left is hard to think about, because it is an \(n = 3\)-dimensional problem. I am not very good at thinking in \(n\) dimensions. What if \(n = 100\)? The traditional likelihood (the assumption of independence) reduces an \(n\)-dimensional distribution to a product of \(n\) univariate distributions. A hierarchical model accomplishes the same dimension reduction, but without ignoring the joint distribution. It does so by factoring the joint distribution into univariate conditional distributions. When I factor the joint distribution

\[ \begin{aligned} \left[ x_1, x_2, x_3 \right] &= \left[ x_3 | x_2 , x_1 \right]\left[ x_2, x_1 \right] \\ &= \left[ x_3 | x_2, x_1 \right] \left[ x_2 | x_1 \right]\left[x_1 \right] \end{aligned} \] I get a product of three univariate distributions.

Despite the apparent resolution of the dimensionality challenge, I still have a problem. The remaining challenge is my difficulty in defining the relationship between \(x_3\) and the two things that affect it, \(x_2\) and \(x_1\). If this does not seem hard to you, consider that \(n\) values of \(x\) would require me to specify the distribution \(\left[x_n | x_1, \dots, x_{n-1} \right]\). How do I specify how \(n-1\) different states combine into one effect?

Although the joint distribution is high-dimensional, the conditional distributions could be simple. In the state-space model, where the subscript on \(x\) is time, a given state \(x_t\) may conditionally depend only on \(x_{t-1}\) and \(x_{t+1}\). This is equivalent to saying that, given \(x_t\), the values \(x_{t-1}\) and \(x_{t+1}\) are conditionally independent of each other–the past is conditionally independent of the future (given \(x_t\)). This is the Markov property. Where this holds, I can simplify,

\[ \left[x_n | x_1, \dots, x_{n-1}, x_{n+1}, \dots, n_T \right] = \left[x_t | x_{t-1}, x_{t+1} \right] \]

*The Markov property: there are direct links only between sequential values. Given $x_t$, the future is independent of the past.*

The Markov property: there are direct links only between sequential values. Given \(x_t\), the future is independent of the past.

Conditional independence allows me to simplify in a hierarchical context. It allows me to sample conditionally, using Gibbs sampling. I can achieve conditional independence if I treat the data as a separate stage in the model, and I treat all of the links as stochastic. I let \(y_t\) represent the observed value at time \(t\), and I assume that it results from observation error on the true \(x_t\). With this second stage, the combined process and observation model is now,

\[ [\mathbf{x}] \left[ \mathbf{y} | \mathbf{x} \right] = [x_1] \prod_{t=2}^T \left[x_t | x_{t-1} \right] \prod_{t=1}^T \left[y_t | x_t \right] \]

*The random walk with data $y_t$ treated as a separate stage in the model.*

The random walk with data \(y_t\) treated as a separate stage in the model.

I use a simple random walk to demonstrate.

Exercise 1. We have previously included dependence structure in the Gaussian distribution. Write the joint distribution for \(\left[ x_1, x_2, x_3 \right]\) where the marginal distribution of each \(x_i\) is normal and the joint distribution is multivariate normal. Under what circumstances would they be equivalent?

A random walk model

Consider the pair of equations

\[ \begin{aligned} x_{t+1} &= f_t(x_t) + \epsilon_t \\ y_t &= g_t(x_t) + \eta_t \end{aligned} \] The first equation is the process model, system equation, or evolution equation. The process model includes a deterministic component \(f_t(x_t)\), which encapsulates understanding of a process. The stochasticity in this equation is the second term, the process error, \(\epsilon_t\). It affects the process, because variability introduced at one time step affects the trajectory of change in the future. Here I assume this is additive, but I could have used some other assumption. This stochasticity in the first equation means that the process is flexible to model miss-specification.

As a specific example, a random walk could look like this,

\[ \begin{aligned} f_t(x_t) &= x_t \\ \epsilon_t & \sim N(0,\sigma^2) \end{aligned} \]

If the observations contain no additive or multiplicative bias a data model might look like this,

\[ y_t \sim N(x_t,\tau^2) \]

Simulated data

As previously, it is a good idea to check algorithms against simulated data where parameter values are known. Before going further I simulate a data set with missing values:

nt   <- 200                      # no. time intervals
time <- c(1:nt)                  # sample times
nm   <- 20                       # number of missing values
wm   <- sort(sample(c(2:nt),nm)) # which are missing
sig  <- 5                        # process err variance
tau  <- 20                       # obs error variance
x    <- rep(10,nt)
for(t in 2:nt)x[t] <- x[t-1] + rnorm(1,0,sqrt(sig))

y     <- rnorm(nt,x,sqrt(tau))   # simulate observations
y[wm] <- NA                      # some are missing

plot(time, y, col='grey')
lines(time, x)
points( time[wm], x[wm], pch=16)
*Observed (dots) and latent states (lines). Missing values are shown as solid dots.*

Observed (dots) and latent states (lines). Missing values are shown as solid dots.

Note that the loop to simulate time series data runs from index 2, ..., nt.

The line in the figure shows the ‘true’ latent states through time. These values would not be observed. They are not ‘real’. They are a device that allows for conditional independence in observations and a breakdown of how I think my uncertainty in the model and in the data could affect what is observed.

Run this code a number of times to get a feel for the dynamics of a random walk. The autocorrelation in the series makes these random sequences appear as though there is some underlying, non-random control. A better way to appreciate the random walk is to look at the differenced series,

par(mfrow=c(2,1), mar=c(2,4,1,3), bty='n')
plot(time, x, type='l')
plot(time[-1], diff(x), type='l')
abline(h=0,lty=2)
abline(h=sqrt(sig)*1.96*c(-1,1),lty=2)           # Gaussian approximation
abline(h=quantile(diff(x),c(.025,.975)),col=2)   # from time series
*Random walk above with differenced series below*

Random walk above with differenced series below

An analyst looking at the plot of \(x_t\) could fail to appreciate the uncorrelated nature of values, shown in the lower plot. I have drawn dashed lines to bound 95% of the expected values.

*The random walk with data, process, and parameters. The square boxes (observations) and prior parameter values anchor the fit. All arrows are stochastic.*

The random walk with data, process, and parameters. The square boxes (observations) and prior parameter values anchor the fit. All arrows are stochastic.

Fitting the model

As in all hierarchical models, the state-space model uses distributions to connect states.

Here I fit the random walk model to simulated data. I specify prior distributions and set up a simple MCMC algorithm using Gibbs sampling.

First I place a prior distribution on the variance parameters that are inverse gamma,

\[ \begin{aligned} \sigma^2 &\sim IG(s_1, s_2) \\ \tau^2 &\sim IG(t_1,t_2) \end{aligned} \]

I want these prior distributions to be centered on values that I believe to be accurate. Assuming I don’t know the values used to simulate the data, my prior distributions for sig and tau are centered on 8 and 5, respectively. In addition to mean values, I need to specify weights. I decide to place more weight on the observation error, proportional to the number of observations, \(2 \times T\), and less weight on the process error \(2\). Together, these two values will determine the balance of weight assigned to the underlying process model and to the observed values. Here it is in R:

sp <- varPrior(mu=8, wt=2)          # weak prior
tp <- varPrior(15,nt)               # wt comparable to data
sg <- 1/rgamma(1,sp[1],sp[2])       # initialize values
tg <- 1/rgamma(1,tp[1],tp[2])

With these prior distributions on parameters, the posterior distribution for the random walk model is

\[ \begin{aligned} \left[ \mathbf{x}, \sigma^2, \tau^2 | \mathbf{y}, s_1, s_2, t_1, t_2 \right] & \propto N(x_1|x_0,\sigma^2) \\ & \times \prod_{t=2}^T N \left(x_t | x_{t-1}, \sigma^2 \right) \\ & \times \prod_{t=1}^T N\left(y_t | x_t, \tau^2 \right) \\ & \times IG(\sigma^2 | s_1, s_2) IG(\tau^2| t_1,t_2) \end{aligned} \] Note that the time index for states is \(2, \dots, T\), whereas for observations it is \(1, \dots, T\). Each \(x_t\) can be viewed as the prior mean for \(x_{t+1}\). For this reason, I explicitly specify a prior distribution for \(x_1\), centered on a value \(x_0\).

I have missing values in the data. I need an index for the locations of these missing values and initialize them to values that are reasonable. I use this function:

tmp     <- initialStatesSS(y)
xg      <- tmp$x
notMiss <- tmp$notMiss
miss    <- tmp$miss

A Gibbs sampler

A Gibbs sampler for the state-space model is arranged such that the sample for each latent variable is updated before the next sample is drawn. For this example, I will simply omit a prior for \(x_1\). Once each of the states has been updated, update the variance parameters. The basic algorithm is the following:

  1. Sample each latent state conditional on all others,

\[ \begin{aligned} x_t | x_{t-1}, x_{t+1}, y_t, \sigma^2 &\propto [x_t|x_{t-1}, \sigma^2] \\ & \times [x_{t+1} | x_t, \sigma^2] \\ & \times [y_t | x_t ] \end{aligned} \] This conditional distribution is shown in the Appendix.

*Sampling the latent state $x_t$ engages only the arrows that connect it to $x_{t-1}, x_{t+1}$, and $y_t$.*

Sampling the latent state \(x_t\) engages only the arrows that connect it to \(x_{t-1}, x_{t+1}\), and \(y_t\).

  1. Sample the process variance,

\[ \sigma^2 | \mathbf{x}, s_1, s_2 \propto \prod_{t=2}^T [x_t | x_{t-1}, \sigma^2][\sigma^2 | s_1, s_2] \]

*Sampling the process variance involves only the latent states and the prior distribution.*

Sampling the process variance involves only the latent states and the prior distribution.

  1. Sample the observation variance from the conditional distribution,

\[ \tau^2 | \mathbf{x}, \mathbf{y} \propto \prod_{t=1}^T [y_t | x_t, \tau^2][\tau^2 | t_1, t_2] \]

*Sampling the observation variance involves only the observations and the prior distribution.*

Sampling the observation variance involves only the observations and the prior distribution.

  1. Repeat until a long converged sequence is available for analysis.

I will need to store the samples. I define a relatively small number of total steps and establish some matrices to hold samples:

ng     <- 1000                         # no. Gibbs steps
vgibbs <- matrix(0,ng,2)               # store variances
colnames(vgibbs) <- c('sigma','tau')
xgibbs <- matrix(0,ng,nt)              # store latent states

Here is a loop that executes the Gibbs sampling algorithm:

for(g in 1:ng){
  xg <- updateSSRW(states=xg,y,missing=wm,tg=tg,sg=sg) # latent states
  sg <- updateVariance(xg[-1],xg[-nt],sp[1],sp[2])     # process var
  tg <- updateVariance(y[-wm],xg[-wm],tp[1],tp[2])     # obs variance
  vgibbs[g,] <- c(sg,tg)                               # store values
  xgibbs[g,] <- xg
}

Note three pieces to the algorithm, i) the latent states, ii) the process variance, and iii) the observation variance.

Here are plots of samples and densities:

.processPars(vgibbs,xtrue=c(sig,tau),CPLOT=T,DPLOT=T) 
*Posterior estimates of variances for the random walk example*.

Posterior estimates of variances for the random walk example.

## $summary
##       estimate    se  0.025 0.975 true value
## sigma    6.662 1.757  3.871 10.71          5
## tau     16.010 1.066 14.020 18.21         20

Here are estimates of latent states:

xpost <- t( apply(xgibbs,2,quantile,c(.5,.025,.975)) ) #95% CI

shadeInterval(time,xpost[,2:3],add=F)              #credible interval
lines(time,x,lwd=2)                                #true values
points(time,y,lwd=2,col='blue',cex=.5, pch=16)     #obs values
points(time[wm],x[wm],col='orange',cex=.5, pch=16) #missing 
*Posterior estimates of states as a 95% CI (shaded), true values (black line), observations (blue), and missing values (orange)*

Posterior estimates of states as a 95% CI (shaded), true values (black line), observations (blue), and missing values (orange)

Exercise 2. Use the foregoing code for the following:

  1. Conduct simlation experiments to determine how different values of sig and tau affect the behavior of x. What is the effect of sig if you hold tau constant? and vice-versa?

  2. Determine how these same parameters affect the standard deviation of x and y.

  3. Determine how the length of the series nt affects the standard deviation in x and in the differenced sequence of x.

A state-space model with covariates

Now suppose that the process is driven by covariates that change over time,

\[ x_t \sim N(x_{t-1} + \mathbf{z}'_{t-1} \boldsymbol{\beta}, \sigma^2) \]

*The model with covariates..*

The model with covariates..

The second term looks like a regression with a design vector \(\mathbf{z}_t\). At each time step there is an effect that comes from covariates and from the process error. Here is a simulation from this model:

nt    <- 300                                     # no. time intervals
time  <- c(1:nt)                                 # sample times
wm    <- 30:60                                   # a sequence of missing values
sig   <- .01                                     # process err variance
tau   <- .1                                      # obs error variance
x     <- rep(0,nt)
q     <- 3
z     <- matrix( rnorm(q*(nt - 1), .1), nt-1, q) # design matrix
z[,1] <- 1
beta  <- matrix(runif(q, -.01,.01))              # coefficient vector
zb    <- z%*%beta

for(t in 2:nt)x[t] <- x[t-1] + rnorm(1, zb[t-1], sqrt(sig))

y     <- rnorm(nt,x,sqrt(tau))                   # observations
y[wm] <- NA                                      # missing

plot(time,y, col='grey')
lines(time,x)
*Observed (dots) and latent states (line).*

Observed (dots) and latent states (line).

I now need to estimate the coefficients in \(\boldsymbol{\beta}\) together with the latent states and the variance parameters. In the above code the coefficients are the vector beta. Here are some prior distributions, initial states, and matrices to hold samples.

sp <- varPrior(2, 2)          # weak prior
tp <- varPrior(6, 2)          # wt comparable to data
sg <- 1/rgamma(1,sp[1],sp[2]) # initialize values for variances, coefficients
tg <- 1/rgamma(1,tp[1],tp[2])
bg <- beta*0

tmp     <- initialStatesSS(y) # initial states
xg      <- tmp$x
notMiss <- tmp$notMiss
miss    <- tmp$miss

ng     <- 1000
vgibbs <- matrix(0,ng,2)
xgibbs <- matrix(0,ng,nt)
bgibbs <- matrix(0,ng,q)
colnames(vgibbs) <- c('sigma','tau')
colnames(bgibbs) <- paste('beta',c(0:2),sep='')

Note the additional step in the Gibbs sampler to update coefficients. Otherwise it looks like the previous algorithm:

for(g in 1:ng){
  
  bg <- updateSSbeta(xg, z=z, sg, priorB=matrix(0,q),
                     priorIVB=diag(.001,q), addStates=T)
  zb <- z%*%bg
  xg <- updateSSRW(states=xg,yy=y,zb=zb,missing=wm,tg=tg,sg=sg)   
  sg <- updateVariance(xg[-1],xg[-nt] + zb,sp[1],sp[2])
  tg <- updateVariance(y[-wm],xg[-wm],tp[1],tp[2])
  vgibbs[g,] <- c(sg,tg)
  xgibbs[g,] <- xg
  bgibbs[g,] <- bg
}
par(mar=c(2,3,1,1))
.processPars(bgibbs,xtrue=beta,CPLOT=T,DPLOT=T) 
## $summary
##        estimate      se    0.025   0.975 true value
## beta0 -0.001247 0.01631 -0.03301 0.03030  -0.002328
## beta1 -0.021650 0.03098 -0.08188 0.03909  -0.003614
## beta2 -0.041580 0.02714 -0.09298 0.01215  -0.009536
*Posterior estimates of coefficients.*

Posterior estimates of coefficients.

Here are variances:

.processPars(vgibbs,xtrue=c(sig,tau),CPLOT=T,DPLOT=T) 
*Variance estimates for model with covariates.*

Variance estimates for model with covariates.

## $summary
##       estimate      se   0.025   0.975 true value
## sigma  0.07694 0.09572 0.05249 0.09814       0.01
## tau    0.14350 0.05639 0.11280 0.17360       0.10

In addition to variances there are now plots for estimates of the coefficients. Here are plots of latent states:

xpost <- t( apply(xgibbs,2,quantile,c(.5,.025,.975)) )
shadeInterval(time,xpost[,2:3],add=F)              # credible interval
lines(time,x,lwd=2, col = 'brown')                 # true values
lines(time,xpost[,1],lwd=2)                        # posterior mean
points(time,y,lwd=2,cex=.5, col='blue', pch=16)    # obs values
points(time[wm],x[wm],col='orange',cex=.5, pch=16) # missing values
*Posterior estimates of states as a 95% CI (shaded), true values (brown line), observations (blue), and missing values (orange).*

Posterior estimates of states as a 95% CI (shaded), true values (brown line), observations (blue), and missing values (orange).

Note the effect of a prolonged period of missing data. Experiment with some different values for nt, sig, and tau.

A state-space model in jags

Here is a version of the same model in jags. Note that I simply condition on the first observation, which has a prior distribution centered on the observed value.

cat( "model {

 mu[1] ~ dnorm(y[1],10)
 xg[1] <- mu[1]

 for(t in 2:nt) {
   mu[t] <- xg[t-1] + b1 + b2*z[t-1,2] + b3*z[t-1,3]
   xg[t] ~ dnorm(mu[t],procErr)
   y[t] ~ dnorm(xg[t],obsErr)
 }

 procErr ~ dunif(0,10)
 obsErr  ~ dunif(0,10)
 sigma <- 1/procErr
 tau   <- 1/obsErr
 b1 ~ dunif(-1,1)
 b2 ~ dunif(-1,1)
 b3 ~ dunif(-1,1)
}  ", fill=T,file="stateSpaceModel.txt" )

Here is the analysis:

library(rjags)
ssData   <- list(y = y, z = z, nt = nt )
parNames <- c('b1','b2','b3', 'sigma', 'tau')
ssFit    <- jags.model(data=ssData, file="stateSpaceModel.txt")
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 268
##    Unobserved stochastic nodes: 336
##    Total graph size: 2406
## 
## Initializing model
update(ssFit)

ssFit <- coda.samples(ssFit, variable.names = parNames, n.iter=5000)
par(mar=c(3,3,1,1))
plot( as.mcmc(ssFit) )

Exercise 3. Working in your group, analyze the following model in jags. Start by simulating the data set. Then write jags code to fit the model.

\[ \begin{aligned} x_t &\sim N(\alpha + \beta x_{t-1}, \sigma^2) \\ y_t &\sim Poi( exp(x_t) ) \end{aligned} \]

  1. Simulate data: Change the block of code above from a Gaussian data model to Poisson. What link function is this?
  1. Change the jags code: In addition to the code for the Gaussian model in this example, look back at the implementation of the Poisson in unit 8.
  1. Run the model using the same code as shown above.

–>

Recap

State-space models model serial dependence hierarchically. Hierarchical models do not require marginal independence, only conditional independence. Modeling latent states allows for dependence structure that could be quite different from what could be achieved with a standard models (e.g., autoregressive). The flexibility of hierarchical modeling makes it attractive for time series data.

Informative prior distributions are important for variances. Examples here show how they affect estimates of covariates and of latent states. The relative magnitudes of process and observation error determine the extent which data versus process dominate the fit.

Appendix

The random walk has a conditionally normal posterior distribution that can be written this way:

\[ \begin{aligned} x_t | x_{t-1}, x_{t+1}, y_t, \sigma^2 &\propto N(x_t|x_{t-1}, \sigma^2) \times N(x_{t+1} | x_t, \sigma^2) \times N(y_t | x_t ) \\ &\propto exp \left[ -\frac{(x_t - x_{t-1})^2}{2\sigma^2} - \frac{(x_{t+1} - x_{t})^2}{2\sigma^2} - \frac{(y_t - x_t)^2}{2\tau^2} \right] \end{aligned} \] Because this is a product of normals, it will be normal, with this form,

\[ \begin{aligned} x_t &\sim N(Vv, V) \\ V &= \left( \frac{2}{ \sigma^2} + \frac{1}{\tau^2} \right)^{-1} \\ v &= \frac{x_{t-1} + x_{t+1}}{\sigma^2} + \frac{y_t}{\tau^2} \end{aligned} \] Recall, \(v\) is a sum of the previous and next values in the series \((t-1, t+1)\) and the observation \(y_t\). The variances act as weights. The fitted latent states will be pulled to the data if observation variance \(\tau^2\) is small relative to \(\sigma^2\). The latent states will be pulled to the mean of past and previous values if \(\tau^2\) is large relative to \(\sigma^2\).

Here is the function:

updateSSRW <- function( states, yy, zb=rep(0,length(y)), missing, tg, sg){       

  # state-space random walk 
  # update continuous states, random walk
  # missing times, obs y, obs error tg, process error sg
  # zb = z%*%beta
  
  nn <- length(states)
  
  for(t in 1:nn){
    
    VI <- 0
    v  <- 0
    
    if(!t %in% missing){                    # observation
      v  <- yy[t]/tg
      VI <- 1/tg
    }
    
    if(t < nn){                             # t+1 term excluded for last 
      v  <- v + (states[t+1] - zb[t])/sg 
      VI <- VI + 1/sg
    }
    
    if(t > 1){                              # t-1 term excluded for 1st 
      v  <- v + (states[t-1] + zb[t-1])/sg
      VI <- VI + 1/sg
    }
    
    V     <- 1/VI
    states[t] <- rnorm( 1, V*v, sqrt(V) )
  }
  states
}

The conditional distribution for the process variance looks like this:

\[ \begin{aligned} \left[ \sigma^2 | \mathbf{x}, \mathbf{y} \right] &\propto \prod_{t=2}^T N(x_t | x_{t-1}, \sigma^2) IG(\sigma^2 | s_1, s_2 ) \\ &\propto \left( \sigma^2 \right)^{-(T-1)/2}exp \left[ -\frac{1}{2\sigma^2} \sum_{t=2}^T (x_t - x_{t-1})^2 \right] \times \left( \sigma^2 \right)^{-(s_1 + 1)} exp(-s_2/\sigma^2) \\ &= \left( \sigma^2 \right)^{-(T-1)/2 - s_1 - 1} exp \left[ -\frac{1}{2\sigma^2} \sum_{t=2}^T (x_t - x_{t-1})^2 - \frac{s_2}{\sigma^2} \right] \\ &= IG \left( \sigma^2 \bigg| (T-1)/2 + s_1, \frac{1}{2}\sum_{t=2}^T (x_t - x_{t-1})^2 + s_2 \right) \end{aligned} \]

The conditional distribution for the observation variance looks like this: \[ \begin{aligned} \left[ \tau^2 | \mathbf{x}, \mathbf{y} \right] &\propto \prod_{t=1}^T N(y_t | x_t, \tau^2) IG(\tau^2 | u_1, u_2 ) \\ &\propto \left( \tau^2 \right)^{-T/2}exp \left[ -\frac{1}{2\tau^2} \sum_{t=2}^T (y_t - x_t)^2 \right] \times \left( \tau^2 \right)^{-(u_1 + 1)} exp(-u_2/\tau^2) \\ &= IG \left( \tau^2 \bigg| T/2 + u_1, \frac{1}{2}\sum_{t=1}^T (y_t - x_t)^2 + u_2 \right) \end{aligned} \]