Kalman Filter Basics

Boat Localization Example Localization with measurement update

A boat sails from one island to another
Boat is trying to identify its location $\theta \sim N(m_0, C_0)$
Using a sequence of measurements to one of the islands $x_1,\ldots,x_n$

Measurements are noisy due to dilution of precision (Your GPS Can Kill You)

Boat Localozation

Reckoning Localization with no measurement updates is called reckoning

Source: http://www.hakaimagazine.com/article-short/traversing-seas

Kalman Filter \[\theta \sim N(m_0, C_0)\] \[x_t = \theta + w_t,~~~w_t \sim N(0,\sigma^2)\] \[x_1,x_2,\ldots \mid \theta \sim N(\theta,\sigma^2)\] The prior variance $C_0$ might be quite large if you are very uncertain about your guess $m_0$
Given the measurements $x^n = (x_1,\ldots,x_n)$, you update your opinion about $\theta$ computing its posterior density, using the Bayes formula

Normal Model \[f(x) = \dfrac{1}{\sqrt{2 \pi \sigma^2}} \exp^{-\dfrac{1}{2}\dfrac{(x-\mu)^2}{\sigma^2}}\] Or multivariate equivalent

\[f(x) = (2 \pi)^{-k/2} |\Sigma|^{-1/2}\exp^{-0.5(x-\mu)^T\Sigma^{-1}(x-\mu)}\]

The conjugate rior for the mean of a Normal distribution is normal. We will look at the Gaussian distribution from a Bayesian point of view. In the standard form, the likelihood has two parameters, the mean $\mu$ and the variance $\sigma^2$

\[p(x^n | \mu, \sigma^2) \propto \dfrac{1}{\sigma^n}\exp\left(-\dfrac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2\right)\]

In case when we know the variance $\sigma^2$, but do not know mean $\mu$, we assume $\mu$ is random. To have conjugate prior we choose

\[p(\mu | \mu_0, \sigma_0) \propto \dfrac{1}{\sigma_0}\exp\left(-\dfrac{1}{2\sigma_0^2}(\mu-\mu_0^2)\right)\]

In practice, when little is known about $\mu$, it is common to set the location hyper-parameter to zero and the scale to some large value.

When likelihood is Normal with unknown mean $\mu$ and a known variance $\sigma^2$. Our parameter of interest is $\mu$. We can use a conjugate Normal prior on $\mu$, with mean $\mu_0$ and variance $\sigma_0^2$. \[\begin{aligned} p(\mu| x^n, \sigma^2) & \propto p(x^n | \mu, \sigma^2)p(\mu) ~~~\mbox{(Bayes rule)}\\ % N(\mu_1,\tau_1) & = N(\mu, \sigma^2)\times N(\mu_0, \sigma_0^2) \end{aligned}\]

Useful Identity One of the most useful algebraic tricks for calculating posterior distribution is completing the square.

\[\dfrac{(x-\mu_1)^2}{\sigma_1} + \dfrac{(x-\mu_2)^2}{\sigma_2} = \dfrac{(x - \mu_3)^2}{\sigma_3} + \dfrac{(\mu_1-\mu_2)^2}{\sigma_1 + \sigma_2}\] where \[\mu_3 = \sigma_3 (\mu_1/\sigma_1 + \mu_2/\sigma_2)\] and \[\sigma_3 = (1/\sigma_1 + 1/\sigma_2)^{-1}\]

Prior: \[\theta \sim \frac{e^{-\frac{(\theta-\mu )^2}{2 \sigma ^2}}}{\sqrt{2 \pi } \sigma }\] Likelihood: \[x \mid \theta \sim \frac{e^{-\frac{(\theta-y )^2}{2 r^2}}}{\sqrt{2 \pi } r}\] Posterior mean: \[\frac{x \sigma ^2+\mu r^2}{r^2+\sigma ^2}\] Posterior variance: \[\frac{1}{\frac{1}{r^2}+\frac{1}{\sigma ^2}}\]

Prior, Likelihood, Posterior

After $n$ steps \[\begin{aligned} p(\mu | x^n) & \propto \prod_{i=1}^{n}\dfrac{1}{\sqrt{2\pi \sigma^2}}\exp\left(-\frac{(x_i - \mu )^2}{2\sigma^2}\right)\times\dfrac{1}{\sqrt{2\pi \sigma_0^2}}\exp\left(-\frac{(\mu - \mu_0)^2}{2\sigma_0^2}\right)\\ & \propto \exp\left(-\sum_{i=1}^{n}\frac{(x_i - \mu )^2}{2\sigma^2} - \frac{(\mu - \mu_0)^2}{2\sigma_0^2}\right)\\ & = \exp\left(-\dfrac{1}{2}\left[\sum_{i=1}^{n}\frac{(x_i - \mu )^2}{\sigma^2} + \frac{(\mu - \mu_0)^2}{\sigma_0^2}\right]\right)\\ & = \exp\left(-\dfrac{1}{2\sigma^2 \sigma_0^2}\left[\sigma_0^2\sum_{i=1}^{n}(x_i - \mu )^2 + \sigma^2 (\mu - \mu_0)^2\right]\right)\\ & = \exp\left(-\dfrac{1}{2\sigma^2 \sigma_0^2}\left[\sigma_0^2\sum_{i=1}^{n}(x_i^2 - 2\mu x_i+ \mu ^2) + \sigma^2 (\mu ^2 - 2\mu \mu_0 + \mu_0^2)\right]\right)\\ \end{aligned}\]

We can multiply the $2\mu x_i$ term in the summation by $n/n$ in order to get the equations in terms of the sufficient statistic $\bar{x}^n$ \[\begin{aligned} p(\mu | x^n) & \propto \exp\left(-\dfrac{1}{2\sigma^2 \sigma_0^2}\left[\sigma_0^2\sum_{i=1}^{n}(x_i^2 - \dfrac{n}{n}2\mu x_i+ \mu ^2) + \sigma^2 (\mu ^2 - 2\mu \mu_0 + \mu_0^2)\right]\right)\\ & = \exp\left(-\dfrac{1}{2\sigma^2 \sigma_0^2}\left[\sigma_0^2\sum_{i=1}^{n}x_i^2 - \sigma_0^22\mu n\bar{x}^n+ \tau_n^0n\mu ^2 + \sigma^2 \mu ^2 - 2\mu \mu_0\sigma^2 + \mu_0^2\sigma^2\right]\right) \end{aligned}\] set $k = \sigma_0^2\sum_{i=1}^{n}x_i^2 + \mu_0^2\sigma^2$ (they do not contain $\mu$) \[p(\mu | x^n) \propto \exp\left(-\dfrac{1}{2}\left[\mu ^2\left(\dfrac{1}{\sigma_0^2} + \dfrac{n}{\sigma^2}\right) - 2\mu\left(\dfrac{\mu_0}{\sigma_0^2} + \dfrac{n\bar{x}^n}{\sigma^2}\right) + k\right]\right)\]

Let’s multiply by \[\dfrac{1/\sigma_0^2 + n/\sigma^2}{1/\sigma_0^2 + n/\sigma^2}\] Now \[p(\mu | x^n) \propto \exp\left(-\dfrac{1}{2}\left(1/\sigma_0^2 + n/\sigma^2\right)\left(\mu - \dfrac{\mu_0/\sigma_0^2 + n\bar{x}^n/\sigma^2}{1/\sigma_0^2 + n/\sigma^2}\right)^2\right)\]

\[p(\mu | x^n) \propto \exp\left(-\dfrac{1}{2}\left(1/\sigma_0^2 + n/\sigma^2\right)\left(\mu - \dfrac{\mu_0/\sigma_0^2 + n\bar{x}^n/\sigma^2}{1/\sigma_0^2 + n/\sigma^2}\right)^2\right)\]

Posterior mean: $\mu_n = \dfrac{\mu_0/\sigma_0^2 + n\bar{x}^n/\sigma^2}{1/\sigma_0^2 + n/\sigma^2}$
Posterior variance: $\sigma_n^2 = \left(1/\sigma_0^2 + n/\sigma^2\right)^{-1}$
Posterior precision:: $\tau_n^2 = 1/\sigma_0^2 + n/\sigma^2$

Posterior Precision is just the sum of the prior precision and the data precision.

Posteior Mean \[\begin{aligned} \mu_n & = \dfrac{\mu_0/\sigma_0^2 + n\bar{x}^n/\sigma^2}{1/\sigma_0^2 + n/\sigma^2}\\ & = \dfrac{\mu_0\sigma^2}{\sigma^2 + n\sigma_0^2} + \dfrac{\sigma_0^2n\bar{x}^n}{\sigma^2 + n\sigma_0^2} \end{aligned}\]

As $n$ increases, data mean dominates prior mean.
As $\sigma_0^2$ decreases (less prior variance, greater prior precision), our prior mean becomes more important.

State space model

A state space model consists of two equations: \[\begin{aligned} Z_t&=HS_t+w_t\\ S_{t+1} &= FS_t + v_t \end{aligned}\] where $S_t$ is a state vector of dimension $m$, $Z_t$ is the observed time series, $F$, $G$, $H$ are matrices of parameters, $\{w_t\}$ and $\{v_t\}$ are i.i.d random vectors satisfying \[\mbox{E}(w_t)=0, \hspace{0.5cm} \mbox{E}(v_t)=0, \hspace{0.5cm}\mathrm{cov}(v_t)=V, \hspace{0.5cm} \mathrm{cov}(w_t)=W\] and $\{w_t\}$ and $\{v_t\}$ are independent.

State Space Models

State space models consider a time series as the output of a dynamic system perturbed by random disturbances.
Natural interpretation of a time series as the combination of several components, such as trend, seasonal or regressive components.
Computations can be implemented by recursive algorithms.

There are several types of inference to be performed using state-space models when analysing time series data

Model building versus inferring unknown variable. Assume a linear model $Z = HS + \epsilon$
- Model building: know signal $S$, observe $Z$, infer $H$ (a.k.a. model identification, learning)
- Estimation: know $H$, observe $Z$, estimate $S$
Hypothesis testing: unknown takes one of few possible values; aim at small probability of incorrect decision
Estimation: aim at a small estimation error

Time Series Estimation Tasks

Filtering: To recover the state vector $S_t$ given $Z^t$
Prediction: To predict $S_{t+h}$ or $Z_{t+h}$ for $h > 0$, given $Z^t$
Smoothing: To estimate $S_t$ given $Z^T$ , where $T > t$

Property of Multivariate Normal Under normality, we have

that normal prior plus normal likelihood results in a normal posterior,
that if the random vector $(X, Y )$ are jointly normal \[\begin{bmatrix} X\\ Y \end{bmatrix} \sim N\left( \begin{bmatrix} \mu_x\\ \mu_y \end{bmatrix}, \begin{bmatrix} \Sigma_{xx}&\Sigma_{xy}\\ \Sigma_{yx}&\Sigma_{yy}\\ \end{bmatrix}\right),\]
then the conditional distribution of $X$ given $Y = y$ is normal \[X\mid Y=y = y\sim N\left[\mu_x+\Sigma_{xy}\Sigma_{yy}^{-1}(y-\mu_y),\Sigma_{xx}-\Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx}\right].\]

Let’s apply this relation to the state-space model. Say we have the estimate for the state of the system $S_0 \sim N(\mu_0,P_0)$. We want to have an estimate for the next state $S$. A priory (from the system dynamics model) $S \sim N(F\mu_0, FP_0F^T + V)$, we denote the a priori variance by $P' = FP_0F^T + V$. Now, we recieve a measurement and we want to find the posterior parameters, $S \mid Z \sim(\mu,P)$. The joint distribution $P(S, Z)$ is given by \[ \begin{bmatrix}S\\Z\end{bmatrix} \sim N \left( \begin{bmatrix} F\mu_0\\ H\mu_0 \end{bmatrix}, \begin{bmatrix} P' & PH^T\\ HP' & HP'H^T+W \end{bmatrix} \right) \] We will denote the measurement covariance by $S = HP'H^T+W$ and we introduce a new intermidiate variable, called Kalman gain $K = P'H^TS^{-1}$ (which corresponds to $\Sigma_{xy}\Sigma_{yy}^{-1}$ in the formula above). Now, we can calculate the posterior $S\mid Z$ covariance \[ S\mid Z \sim N\left(F\mu_0 + K(Z - H\mu_0),(I-KH)P'\right) \] At the the beginning of applying the Kalman filter, you need to choose your initial mean and variance for the state. Effect of the initial values is decreasing as $t$ increases. To guaranty the stationary (stability), all eigenvalues of the coefficient matrix $F$ are less than one in modulus

Traditional stats models as state-space models

We start with a loclal trend model \[ \begin{aligned} y_t =& \mu_t + e_t,~e_t \sim N(0,\sigma_e^2)\\\ \mu_{t+1} =& \mu_t + \eta_t,~ \eta_t \sim N(0,\sigma_{\eta}^2 ) \end{aligned} \]

$e_t$ and $\eta_t$ are iid Gaussian white noise
$\mu_0$ is given (possible as a distributed value)
trend $\mu_t$ is not observable
we observe some noisy version of the trend $y_t$

Such a model can be used to analyze realized volatility: $\mu_t$ is the log volatility and $y_t$ is constructed from high frequency transactions data. This model is equivalent to several types of ARIMA, depending on the errors:

if $\sigma_e=0$, then we have ARIMA(0,1,0) model
if $\sigma_e > 0$, then we have ARIMA(0,1,1) model, satisfying \[(1-B)y_t = (1-\theta B)a_t, ~ a_t \sim N(0,\sigma_a^2)\] $\sigma_a$ and $\theta$ are determined by $\sigma_e$ and $\sigma_{\eta}$ \[(1-B)y_t = \eta_{t-1} + e_t - e_{t-1}\]

Let’s demonstrate the Kalman filtering applied to the local trend model. Instead of implementing filtering from scratch, we will use implementation from the dlm package. Consider the Nile river data set. We assume that state dynamics is a random walk plus noise, with unknown system and observation variances. Parametrizing variances on a log scale, to ensure positivity, the model can be build using the function defined below. Our model takes into account a jump in the flow of the river following the construction of Ashwan dam in 1898. This can be done by inflating the system variance in 1899 using a multiplier bigger than one.

library(dlm)
buildFun <- function(x) {
  m <- dlmModPoly(1, dV = exp(x[1]))
  m$JW <- matrix(1)
  m$X <- matrix(exp(x[2]), nc = 1, nr = length(Nile))
  j <- which(time(Nile) == 1899)
  m$X[j,1] <- m$X[j,1] * (1 + exp(x[3]))
  return(m)
}

Then we fit the model using MLE estimator

fit <- dlmMLE(Nile, parm = c(0,0,0), build = buildFun)
fit$conv # check if converged

## [1] 0

dlmNileJump <- buildFun(fit$par) # assign parsmeters to the model
V(dlmNileJump)

##          [,1]
## [1,] 16300.33

W(dlmNileJump)

## Warning in W.dlm(dlmNileJump): Time varying 'W'

##      [,1]
## [1,]    1

Now, we apply a Kalman filter to this model

nileJumpFilt <- dlmFilter(Nile, dlmNileJump)
plot(Nile, type = 'o', col = "seagreen")
lines(dropFirst(nileJumpFilt$m), type = 'o',
      pch = 20, col = "brown")

What about Kalman smoothing?

nileJumpSmooth <- dlmSmooth(nileJumpFilt) 
plot(Nile, type = 'o', col = "seagreen")
attach(nileJumpSmooth)
lines(dropFirst(s), type = 'o', pch = 20, col = "brown")
v <- unlist(dlmSvd2var(U.S, D.S))
pl <- dropFirst(s) + qnorm(0.05, sd = sqrt(v[-1]))
pu <- dropFirst(s) + qnorm(0.95, sd = sqrt(v[-1]))
detach()
lines(pl, lty = 2, col = "brown")
lines(pu, lty = 2, col = "brown")

plot(Nile, type = 'o', col = "seagreen")
for (i in 1:10) # 10 simulated "true" levels
  lines(dropFirst(dlmBSample(nileJumpFilt)), col = "brown")

Next, consider liner regression model with time dependent parameters

\[ \begin{aligned} y_t &= \alpha_t + \beta_t \, x_t + \epsilon_t \qquad & \epsilon_t \, \sim N(0,\sigma^2) \\ \alpha_t &= \quad \alpha_{t-1} + \epsilon_t^{\alpha} \qquad & \epsilon_t^{\alpha} \sim N(0,\sigma_{\alpha}^2) \\ \beta_t &= \quad \beta_{t-1} + \epsilon_t^{\beta} \qquad & \epsilon_t^{\beta} \sim N(0, \sigma_{\beta}^2) \\ \end{aligned} \]

We treat the coefficients as unobserved state and assume the dynamics of the variables is a simple random walk, unless we have a better guess about how coefficients change from one time step to another. In practice, often, random walk is our best guess.

dlm Package

dlmModARMA: for an ARMA process, potentially multivariate
dlmModPoly: for an $n^{th}$ order polynomial
dlmModReg : for Linear regression
dlmModSeas: for periodic – Seasonal factors
dlmModTrig: for periodic – Trigonometric form

Local Linear Trend \[\begin{aligned} y_t &= \qquad \quad \mu_t + \upsilon_t \quad &\upsilon_t \sim N(0,V) \\ \mu_t &= \mu_{t-1} + \delta_{t-1} + \omega_t^{\mu} \quad & \omega_t^{\mu} \sim N(0,W^{\mu}) \\ \delta_t &= \qquad \,\, \, \delta_{t-1} + \omega_t^{\delta} \quad & \omega_t^{\delta} \sim N(0,W^{\delta}) \\ \end{aligned}\]

Simple exponential smoothing with additive errors \[x_t = \ell_{t-1} + \varepsilon_t\] \[\ell_t=\ell_{t-1}+\alpha \varepsilon_t.\]

Holt’s linear method with additive errors \[\begin{aligned} y_t&=\ell_{t-1}+b_{t-1}+\varepsilon_t\\ \ell_t&=\ell_{t-1}+b_{t-1}+\alpha \varepsilon_t\\ b_t&=b_{t-1}+\beta \varepsilon_t, \end{aligned}\]

Relation to ARMA models Consider relation with ARMA models. The basic relations are

an ARMA model can be put into a state space form in "infinite" many ways;
for a given state space model in, there is an ARMA model.

State space model to ARMA model The second possibility is that there is an observational noise. Then, the same argument gives \[(1+\alpha_1B+\cdots+\alpha_mB^m)(Z_{t+m}-\epsilon_{t+m})=(1-\theta_1B-\cdots -\theta_{m-1}B^{m-1})a_{t+m}\] By combining $\epsilon_t$ with $a_t$ , the above equation is an ARMA$(m, m)$ model.

ARMA model to state space model: AR(2) \[Z_t=\phi_1Z_{t-1}+\phi_2Z_{t-2}+a_t\] For such an AR(2) process, to compute the forecasts, we need $Z_{t-1}$ and $Z_{t-2}$ . Therefore, it is easily seen that \[\begin{bmatrix} Z_{t+1}\\ Z_t \end{bmatrix} = \begin{bmatrix} \phi_1 & \phi_2\\ 1 & 0 \end{bmatrix} \begin{bmatrix} Z_t\\ Z_{t-1} \end{bmatrix} + \begin{bmatrix} 1\\ 0 \end{bmatrix} e_t,\] where $e_t = a_{t+1}$ and \[Z_t=[1, 0]S_t\] where $S_t = (Z_t , Z_{t-1})^T$ and there is no observational noise.

ARMA model to state space model: MA(2) \[Z_t=a_t-\theta_1a_{t-1}-\theta_2a_{t-2}\] Method 1:
\[\begin{bmatrix} a_t\\ a_{t-1} \end{bmatrix} = \begin{bmatrix} 0 & 0\\ 1 & 0 \end{bmatrix} \begin{bmatrix} a_{t-1}\\ a_{t-2} \end{bmatrix} + \begin{bmatrix} 1\\ 0 \end{bmatrix} a_t\] \[Z_t=[-\theta_1, -\theta_2]S_{t-1} + a_t\] Here the innovation $a_t$ shows up in both the state transition equation and the observation equation. The state vector is of dimension 2.

ARMA model to state space model: MA(2) Method 2: For an MA(2) model, we have \[\begin{aligned} Z_{t}^t&=Z_t\\ Z_{t+1}^t&=-\theta_1a_t-\theta_2a_{t-1}\\ Z_{t+2}^t&= -\theta_2a_t \end{aligned}\] Let $S_t = (Z_t , -\theta_1 a_t - \theta_2 a_{t-1} , -\theta_2 a_t )^T$ . Then, \[S_{t+1}= \begin{bmatrix} 0 & 1& 0\\ 0& 0& 1\\ 0& 0& 0 \end{bmatrix} S_t+ \begin{bmatrix} 1\\ -\theta_1\\ -\theta_2 \end{bmatrix} a_{t+1}\] and \[Z_t=[1,0,0]S_t\] Here the state vector is of dimension 3, but there is no observational noise.

ARMA model to state space model: Akaike’s approach Consider ARMA$(p, q)$ process, let $m = max\{p, q + 1\}$, $\phi_i = 0$ for $i > p$ and $\theta_j = 0$ for $j > q$. \[S_t = (Z_t , Z_{t+1}^t , Z_{t+2}^t ,\cdots , Z_{t+m-1}^t )^T\] where $Z_{t+\ell}^t$ is the conditional expectation of $Z_{t+\ell}$ given $\Psi_t = \{Z_t , Z_{t-1} , \cdots\}$. By using the updating equation $f$ forecasts (recall what we discussed before) \[Z_{t+1}(\ell -1)=Z_t(\ell)+\psi_{\ell-1}a_{t+1},\]

ARMA model to state space model: Akaike’s approach \[S_t = (Z_t , Z_{t+1}^t , Z_{t+2}^t ,\cdots , Z_{t+m-1}^t )^T\] \[S_{t+1}=FS_t+Ga_{t+1}\] \[Z_t=[1,0, \cdots ,0]S_t\] where \[F= \left[ \begin{array}{c|cccc} 0 &1& 0& \cdots& 0\\ 0 & 0 & 1 & \cdots & 0\\ \vdots & \vdots & & \\ \phi_m & \phi_{m-1} & \cdots & \phi_2 & \phi_1 \end{array}\right], G= \begin{bmatrix} 1\\ \psi_1\\ \psi_2\\ \vdots\\ \psi_{m-1} \end{bmatrix}\] The matrix $F$ is call a companion matrix of the polynomial $1 - \phi_1 B - \cdots - \phi_m B^m$.

ARMA model to state space model: Aoki’s Method Two-step procedure: First, consider the MA$(q)$ part: \[W_t = a_t - \theta_1 a_{t-1} - \cdots - \theta_q a_{t-q}\] \[\begin{bmatrix} a_t\\ a_{t-1}\\ \vdots\\ a_{t-q+1} \end{bmatrix} = \begin{bmatrix} 0&0&\cdots & 0&0\\ 1&0& \cdots&0&0\\ \vdots & & & &\\ 0 & 0 & \cdots & 1 & 0 \end{bmatrix} \begin{bmatrix} a_{t-1}\\ a_{t-2}\\ \vdots\\ a_{t-q} \end{bmatrix} + \begin{bmatrix} 1\\ 0\\ \vdots\\ 0 \end{bmatrix} a_t\] \[W_t=[-\theta_1,-\theta_2, \cdots, -\theta_q]S_t+a_t\]

ARMA model to state space model: Aoki’s Method First, consider the AR$(p)$ part: \[Z_t = \phi_1 Z_{t-1} + ... + \phi_p Z_{t-p} + W_t\] Define state-space vector as \[S_t=(Z_{t-1},Z_{t-2}, \cdots , Z_{t-p},a_{t-1}, \cdots, a_{t-q})'\] Then, we have \[\begin{bmatrix} Z-t\\ Z_{t-1}\\ \vdots\\ Z_{t-p+1}\\ a_t\\ a_{t-1}\\ \vdots\\ a_{t-q+1} \end{bmatrix} = \left[ \begin{array}{cccc|cccc} \phi_1&\phi_2&\cdots&\phi_p&-\theta_1&-\theta_2&\cdots&-\theta_q\\ 1&0&\cdots&0&0&0&\cdots&0\\ \vdots &&&&\vdots&&&\\ 0&\cdots&1&0&0&0&\cdots&0\\ 0&0&\cdots&0&0&0&\cdots&0\\ 0&0&\cdots&0&1&0&\cdots&0\\ \vdots &&&&0&&&\\ 0&0&\cdots&0&0&\cdots&1&0 \end{array}\right] \begin{bmatrix} Z_{t-1}\\ Z_{t-2}\\ \vdots\\ Z_{t-p}\\ a_{t-1}\\ a_{t-2}\\ \vdots\\ a_{t-q} \end{bmatrix}+ \begin{bmatrix} 1\\ 0\\ \vdots\\ 0\\ 1\\ 0\\ \vdots\\ 0 \end{bmatrix}a_t\] and \[Z_t=[\phi_1,\cdots,\phi_p,-\theta_1,\cdots,-\theta_q]S_t+a_t\]

MLE Estimation Innovations are given by \[\epsilon_t = Z_t - HS_t^{t-1}\] can be shown that $\mathrm{var}(\epsilon_t) = \Sigma_t$, where \[\Sigma_t = HP_t^{t-1}H^T + R\] Incomplete Data Likelihood: \[-\ln L(\Theta) = \dfrac{1}{2}\sum_{t=1}^{n}\log|\Sigma_t(\Theta)| + \dfrac{1}{2}\sum_{t=1}^{n}\epsilon_t(\Theta)^T\Sigma(\Theta)^{-1}\epsilon_t(\Theta)\] Here $\Theta = (F, Q, R)$. Use BFGS to find a sequence of $\Theta$’s and stop when stagnation happens.

Kalman Smoother

Input: initial distribution $X_0$ and data $Z_1,...,Z_T$
Algorithm: forward-backward pass
Forward pass: Kalman filter: compute $S_{t+1}^t$ and $S_{t+1}^{t+1}$ for $0 \le t < T$
Backward pass: Compute $S_t^T$ for $0 \le t < T$

Backward Pass

Compute $X_t^T$ given $S_{t+1}^T \sim N(m_{t+1}^T,C_{t+1}^T)$
Reverse arrow: $S_t^t \leftarrow X_{t+1}^t$
Same as incorporating measurement in filter
- Compute joint $(S_t^t, S_{t+1}^t)$
- Compute conditional $(S_t^t \mid S_{t+1}^t )$
New: $S_{t+1}$ is not “known”, we only know its distribution: $S_{t+1} \sim S_{t+1}^T$
“Uncondition” on $S_{t+1}$ to compute $S_t^T$ using laws of total expectation and variance

Kalman Smoother A smoothed version of data (an estimate, based on the entire data set) If $S_n$ and $P_n$ obtained via Kalman recursions, then for $t=n,..,1$ \[\begin{aligned} S_{t-1}^t &= S_{t-1} + J_{t-1}(S_t^n - S_t^{t-1})\\ P_{t-1}^n &= P^{t-1} + J_{t-1}(P_t^n - P_t^{t-1})J^T_{t-1}\\ J_{t-1} & = P_{t-1}F^T[P_t^{t-1}]^{-1} \end{aligned}\]

Kalman and Histogran Filter Shortciomings Kalman:

linear dynamics
linear measurement model
normal errors
unimodal uncertainty

Histogram:

discrete states
approximation
inefficient in memory

MCMC Financial Econometrics

Set of tools for inference and pricing in continuous-time models.

Simulation-based and provides a unified approach to state and parameter inference. Can also be applied sequentially.
Can handle Estimation and Model risk. Important implications for financial decision making
Bayesian inference. Uses conditional probability to solve an inverse problem and estimates expectations using Monte Carlo.

Filtering, Smoothing, Learning and Prediction

Data $y_{t}$ depends on a , $x_{t}$. \[\begin{aligned} \text{Observation equation} & \text{:\ }y_{t}=f\left( x_{t},\varepsilon _{t}^{y}\right) \\ \text{State evolution} & \text{: }x_{t+1}=g\left( x_{t},\varepsilon _{t+1}^{x}\right) ,\end{aligned}\]

Posterior distribution of $p\left(x_{t}|y^{t}\right)$ where $y^{t}=\left( y_{1},...,y_{t}\right)$
Prediction and Bayesian updating. \[p\left( x_{t+1}|y^{t}\right) =\int p\left( x_{t+1}|x_{t}\right) p\left( x_{t}|y^{t}\right) dx_{t},\label{Predict}%\] updated by Bayes rule \[\underset{\text{Posterior}}{\underbrace{p\left( x_{t+1}|y^{t+1}\right) }% }\propto\underset{\text{Likelihood}}{\underbrace{p\left( y_{t+1}% |x_{t+1}\right) }}\underset{\text{Prior}}{\underbrace{p\left( x_{t+1}% |y^{t}\right) }}.\label{Update}%\]

Nonlinear Model

The observation and evolution dynamics are \[\begin{aligned} y_t & = \frac{x_t}{1+ x_t^2} + v_t \; , {\rm where} \; v_t \sim N(0,1) \\ x_t & = x_{t-1} + w_t \; , {\rm where} \; w_t \sim N(0,0.5)\end{aligned}\]
Initial condition $x_0 \sim N( 1 , 10 )$

Fundamental question:

How do the filtering distributions $p(x_t|y^t)$ propagate in time?

Nonlinear: $y_t = x_t / 1+ x_t^2 + v_t$

Simulate Data

# MC sample size
N = 1000

# Posterior at time t=0 p(x[0]|y[0])=N(1,10)

x = rnorm(N,1,sqrt(10))
hist(x,main="Pr(x[0]|y[0])")

# Obtain draws from prior p(x[1]|y[0])

x1 = x + rnorm(N,0,sqrt(0.5))
hist(x1,main="Pr(x[1]|y[0])")

# Likelihood function p(y[1]|x[1])

y1  = 5
ths = seq(-30,30,length=1000)
plot(ths,dnorm(y1,ths/(1+ths^2),1),type="l",xlab="",ylab="")
title(paste("p(y[1]=",y1,"|x[1])",sep=""))

Nonlinear Filtering

Resampling

Observe $y_1=5, y_2=-2$

Key: resample and propagate particles

# Computing resampling weights

w = dnorm(y1,x1/(1+x1^2),1)

# Resample to obtain draws from p(x[1]|y[1])

k = sample(1:N,size=N,replace=TRUE,prob=w)
x = x1[k]
hist(x,main="Pr(x[1]|y[1])")

# Obtain draws from prior p(x[2]|y[1])
x2 = x + rnorm(N,0,sqrt(0.5))
hist(x2,main="Pr(x[2]|y[1])")

...

Propagation of MC error

Dynamic Linear Model (DLM)

Kalman Filter

Kalman filter for linear Gaussian systems

FFBS (Filter Forward Backwards Sample)

This determines the posterior distribution of the states \[p( x_t | y^t ) \; {\rm and} \; p( x_t | y^T )\] Also the joint distribution $p( x^T | y^T )$ of the hidden states.
Discrete Hidden Markov Model HMM (Baum-Welch, Viterbi)
With parameters known the Kalman filter gives the exact recursions.

Simulate DLM

Dynamic Linear Models \[y_t = x_t + v_t \; \; {\rm and} \; \; x_t = \alpha + \beta x_{t-1} + w_t\] Simulate Data

T1=50; alpha=-0.01; beta=0.98; sW=0.1; sV=0.25; W=sW^2; V=sV^2

y = rep(0,2*T1)
ht = rep(0,2*T1)

for (t in 2:(2*T1)){
  ht[t] = rnorm(1,alpha+beta*ht[t-1],sW)
  y[t] = rnorm(1,ht[t],sV)
}
ht = ht[(T1+1):(2*T1)]
y = y[(T1+1):(2*T1)]

Exact calculations

Kalman Filter recursions

m=rep(0,T1);C=rep(0,T1); m0=0; C0=100

R = C0+W; Q = R+V; A = R/Q
m[1] = m0+A*(y[1]-m0); C[1] = R-A^2*Q

for (t in 2:T1){
  R    = C[t-1]+W
  Q    = R+V
  A    = R/Q
  m[t] = m[t-1]+A*(y[t]-m[t-1])
  C[t] = R-A^2*Q
}

# Quantiles

q025 = function(x){quantile(x,0.025)}
q975 = function(x){quantile(x,0.975)}
h025 = apply(hs,2,q025)
sd   = sqrt(apply(hs,2,var))

DLM Data

Bootstrap Filter

Now run the particle filter

# Bootstrap filter

M = 1000
h = rnorm(M,m0,sqrt(C0))
hs = NULL

for (t in 1:T1){
  h1 = rnorm(M,alpha+beta*h,sW)
  w  = dnorm(y[t],h1,sV)
  w  = w/sum(w)
  h  = sample(h1,size=M,replace=T,prob=w)
  hs = cbind(hs,h)
}

DLM Filtering

Streaming Data

How do Parameter Distributions change in Time?

{.columns} Online Dynamic Learning

Real-time surveillance
Bayes means sequential updating of information
Update posterior density ${p}(\mbox{\boldmath{$$}} \mid y_t )$ with every new observation ($t = 1, \ldots, T$) - “sequential learning”

Bayes theorem: \[p(\mbox{\boldmath{$\theta$}} \mid y^t) \propto p(y_t \mid \mbox{\boldmath{$\theta$}}) \, p(\mbox{\boldmath{$\theta$}} \mid y^{t-1})\]

Galton 1877: First Particle Filter

Streaming Data

Online Learning

Construct an essential state vector $Z_{t+1}$. \[\begin{aligned} p(Z_{t+1}|y^{t+1}) &= \int p(Z_{t+1}|Z_t, y_{t+1}) \;d\mathbb{P}(Z_t|y^{t+1}) \\ &\propto \int \underbrace{p(Z_{t+1}|Z_t, y_{t+1})}_{propagate} \overbrace{ \underbrace{p(y_{t+1} | Z_t)}_{resample} \;d\mathbb{P}(Z_t|y^t)}\end{aligned}\]

Re-sample with weights proportional to $p(y_{t+1} | Z_t^{(i)})$ and generate $\{Z_t^{\zeta(i)}\}_{i=1}^N$
Propagate with $Z_{t+1}^{(i)} \sim p(Z_{t+1}|Z_t^{\zeta(i)}, y_{t+1})$ to obtain $\{Z_{t+1}^{(i)}\}_{i=1}^N$

Parameters: $p( \theta | Z_{t+1} )$ drawn “offline”

Sample – Resample

Resample – Sample

Particle Methods: Blind Propagation

{r, fig.cap = "Propagate-Resample is replaced by Resample-Propagate", out.width="100%"} knitr::include_graphics("fig/smc.png")

Traffic Problem State-Space

Wave Speed Propagation is a Mixture Distribution Shock wave propagation speed is a mixture, when calculated using Godunov scheme \[w = \frac{q(\rho_l) - q(\rho_r)}{\rho_l-\rho_r} \ \left[\frac{mi}{h}\right] = \left[\frac{veh}{h}\right]\left[\frac{mi}{veh}\right].\] Assume $\rho_l \sim TN(32, 16, 0, 320)$ and $\rho_r \sim TN(48, 16, 0,320)$
$q_c = 1600 \ veh/h$, $\rho_c = 40 \ veh/mi$, and $\rho_{jam} = 320 \ veh/mi$

Traffic Flow Speed Forecast is a Mixtrue Dsitribution Theorem: The solution (including numerical) to the LWR model with stochastic initial conditions is a mixture distribution.

A moment based filters such as Kalman Filter or Extended Kalman Filter would not capture the mixture.

Problem at Hand

The Parameter Learning and State Estimation Problem

Goal: given sparse sensor measurements, find the distribution over traffic state and underlying traffic flow parameters $p(\theta_t, \phi|y_1, y_2,...,y_t); \ \phi=(q_c,\rho_c)$
Parameters of the evolution equation (LWR) are stochastic
Distribution over state is a mixture
Can’t use moment based filters (KF, EKF,...)

Data Assimilation

State Space Representation

State space formulation allows to combine knowledge from analytical model with the one from field measurements, while taking model and measurement errors into account

State Space Representation

State vector $\theta_t = ( \rho_{1t} , \ldots , \rho_{nt} )$
Boundary conditionals $\rho_{0t}$ and $\rho_{(n+1)t}$
Underlying parameters $\phi = (q_c, \rho_c)$ are stochastic

\[\begin{aligned} \mbox{Observation: }&y_{t+1} = H\theta_{t+1} + v; \ v \sim N(0,V) \label{eqn-y}\\ \mbox{Evolution: }&\theta_{t+1} = f_{\phi}(\theta_t) + w; \ w \sim N(0,W) \label{(eqn-x)} \end{aligned}\] $H: \mathbb{R}^{M} \rightarrow \mathbb{R}^k$ in the measurement model. $\phi = (q_c,\rho_c, \rho_{max})$.

Parameter priors: $q_c \sim N(\mu_q, \sigma_c^2)$, $\rho_c = Uniform(\rho_{min}, \rho_{max})$

Particle Parameter Learning

Sample-based PDF Representation

Regions of high density: Many particles and Large weight of particles
Uneven partitioning
Discrete approximation for continuous pdf

\[p^{N}\left( \theta_{t+1}|y^{t+1}\right) \propto \sum_{i=1}^{N}w_{t}^{\left( i\right) }p\left( \theta_{t+1}|\theta_t^{\left( i\right) },y_{t+1}\right)\]

PF

Bayes Rule \[p(y_{t+1},\theta_{t+1}|\theta_t)=p(y_{t+1}|\theta_t)\,p(\theta_{t+1}|% \theta_t,y_{t+1}).\]

Given a particle approximation to $p^{N}\left( \theta_t|y^{t}\right)$ \[\begin{aligned} p^{N}\left( \theta_{t+1}|y^{t+1}\right) & \propto\sum_{i=1}^{N}p\left( y_{t+1}|\theta_t^{\left( i\right) }\right) p\left( \theta_{t+1}|\theta_t^{\left( i\right) },y_{t+1}\right) \label{Mixture2}\\ & =\sum_{i=1}^{N}w_{t}^{\left( i\right) }p\left( \theta_{t+1}|\theta_t^{\left( i\right) },y_{t+1}\right) \text{,}% \end{aligned}\] where \[w_{t}^{\left( i\right) }=\frac{p\left( y_{t+1}|\theta_t^{\left( i\right) }\right) }{\sum_{i=1}^{N}p\left( y_{t+1}|\theta_t^{\left( i\right) }\right) }\text{.}%\]
Essentially a mixture Kalman filter

Particle Parameter Learning

Given particles (a.k.a. random draws) $(\theta^{(i)}_t,\phi^{(i)},s^{(i)}_t),$ $i=1,\ldots,N$ \[p( \theta_t | y_{1:t} ) = \frac{1}{N} \sum_{i=1}^N \delta_{ \theta^{(i)} } \; .\]

First resample $(\theta^{k(i)}_t,\phi^{k(i)},s^{k(i)}_t)$ with weights proportional to $p(y_{t+1}|\theta^{k(i)}_t,\phi^{k(i)})$ and $s_t^{k(i)}=S(s^{(i)}_t,\theta^{k(i)}_t,y_{t+1})$ and then propogate to $p(\theta_{t+1}|y_{1:t+1})$ by drawing $\theta^{(i)}_{t+1}\,$from $p(\theta_{t+1}|\theta^{k(i)}_t,\phi^{k(i)},y_{t+1}),\,i=1,\ldots,N$.
Next we update the sufficient statistic as \[s_{t+1}=S(s_t^{k(i)},\theta^{(i)}_{t+1},y_{t+1}),\] for $i=1,\ldots,N$, which represents a deterministic propogation.
Finally, parameter learning is completed by drawing $\phi^{(i)}$ using $p(\phi|s^{(i)}_{t+1})$ for $i=1,\ldots,N$.

Galton 1877: First Particle Filter

Streaming Data

Online Learning

Re-sample with weights proportional to $p(y_{t+1} | Z_t^{(i)})$ and generate $\{Z_t^{\zeta(i)}\}_{i=1}^N$
Propagate with $Z_{t+1}^{(i)} \sim p(Z_{t+1}|Z_t^{\zeta(i)}, y_{t+1})$ to obtain $\{Z_{t+1}^{(i)}\}_{i=1}^N$

Parameters: $p( \theta | Z_{t+1} )$ drawn “offline”

Resample – Propagate

Algorithm

These ingredients then define a particle filtering and learning algorithm for the sequence of joint posterior distributions $p( \theta_t , \phi | y_{1:t} )$: \[\begin{aligned} & \text{Step 1. (Resample) Draw an index } k_t \left( i\right) \sim Mult_{N}\left( w_{t}^{\left( 1\right) },...,w_{t}^{\left( N\right) }\right), \\ & \mbox{where the weights are given by } w_t^{(i)} \propto p(y_{t+1}|(\theta_t,\phi)^{(i)}), \ \text{ for }i=1,...,N\\ \text{ } & \text{Step 2. (Propagate) Draw }\theta_{t+1}^{\left( i\right) }\sim p\left( \theta_{t+1}|\theta_t^{k_t \left( i\right) },y_{t+1}\right) \text{ for }i=1,...,N.\\ \text{ } & \text{Step 3. (Update) } s_{t+1}^{(i)} =S(s_t^{k_t(i)},\theta_{t+1}^{(i)},y_{t+1})\\ \text{ } & \text{Step 4. (Replenish) } \phi^{(i)} \sim p( \phi | s_{t+1}^{(i)} ) \end{aligned}\] There are a number of efficiency gains from such an approach, e.g. it does not suffer from degeneracy problems associated with traditional propagate-resample algorithms when $y_{t+1}$ is an outliers.

Obtaining state estimates from particles

Any estimate of a function $f(\theta_t)$ can be calculated by discrete PDF-approximation \[\mbox{E}(f(\theta_t)) = \dfrac{1}{N}\sum_{j=1}^{N}w_t^{(j)}f(\theta_t^{(j)})\]
Mean: \[\mbox{E}(\theta_t) = \dfrac{1}{N}\sum_{j=1}^{N}w_t^{(j)}\theta_t^{(j)}\]
MAP-estimate: particle with largest weight
Robust mean: mean within window around MAP-estimate

Particle Filters: Pluses

Estimation of full PDFs
Non-Gaussian distributions (multi-modal)
Non-linear state and observation model
Parallelizable

Particle Filters: Minuses

Degeneracy problem
High number of particles needed
Computationally expensive
Linear-Gaussian assumption is often sufficient

Applications: Localization

Track car position in given road map
Track car position from radio frequency measurements
Track aircraft position from estimated terrain elevation
Collision Avoidance (Prediction)

Applications: Model Estimation

Tracking with multiple motion-models
Recovery of signal from noisy measurements
Neural Network model selection (on-line classification)

Applications: Other

Visual Tracking
Prediction of (financial) time series
Quality control in semiconductor industry
Military applications: Target recognition from single or multiple images, Guidance of missiles
Reinforcement Learning

Mixture Kalman Filter For Traffic

\[\begin{aligned} \mbox{Observation: }&y_{t+1} = Hx_{t+1} + \gamma^Tz_{t+1} + v_{t+1} , \ v_{t+1} \sim N(0, V_{t+1})\\ \mbox{Evolution: }&x_{t+1} = F_{\alpha_{t+1}}x_t + (1-F_{\alpha_{t+1}})\mu + \alpha_t\beta_{t} + \omega_{1} \\ &\beta_{t+1} = \max(0,\beta_{t} + \omega_{2} \label{eqn-beta})\\ \mbox{Switching Evolution: }&\alpha_{t+1} \sim p(\alpha_{t+1} |\alpha_{t},Z_{t}) \end{aligned}\] where $z_t$ is an exogenous variable that effects the sensor model, $\mu$ is an average free flow speed \[\alpha_t \in \{0,1,-1\}\] \[\omega = (\omega_{1}, \omega_{2})^T \sim N(0, W), \ v \sim N(0,V)\] \[F_{\alpha_t} = \left\{ \begin{aligned} &1, \ \alpha_t \in \{1,-1\}\\ &F, \ \alpha_t = 0 \end{aligned} \right.\] No boundary conditions estimation is needed. No capacity/critical density is needed.

Bayes AI

Vadim Sokolov

Fall 2021

Kalman Filter Basics

State space model

Traditional stats models as state-space models

MCMC Financial Econometrics

Filtering, Smoothing, Learning and Prediction

Nonlinear Model

Nonlinear: \(y_t = x_t / 1+ x_t^2 + v_t\)

Simulate Data

Nonlinear Filtering

Resampling

Observe \(y_1=5, y_2=-2\)

Propagation of MC error

Dynamic Linear Model (DLM)

Kalman Filter

Simulate DLM

Exact calculations

DLM Data

Bootstrap Filter

DLM Filtering

Streaming Data

How do Parameter Distributions change in Time?

Galton 1877: First Particle Filter

Streaming Data

Online Learning

Sample – Resample

Resample – Sample

Particle Methods: Blind Propagation

The Parameter Learning and State Estimation Problem

Data Assimilation

State Space Representation

State Space Representation

Particle Parameter Learning

PF

Particle Parameter Learning

Galton 1877: First Particle Filter

Streaming Data

Online Learning

Resample – Propagate

Algorithm