Applications of Particle Filtering and

dummy slide

Introduction

Presentation Outline

Motivate interest in extrapolation.

Alternative estimation methods.

Importance sampling and particle filtering.

Particle smoothing

as implemented in the dynamichazard package.

Gradient and Hessian approximations

as implemented in the mssm package.

Survival Analysis Setup

Notation

\(\tilde T_i\) is the time to event for individual \(i\).

Assume independent censoring \(C_i\). We observe \(T_i = \min (\tilde T_i, C_i)\)

and define indicators \(D_i = 1_{\{\tilde T_i < C_i\}}\).

Observe covariates \(\vec x_{ik}\) at time \(k = 0, 1, \dots, \lfloor T_i \rfloor\).

\(\lfloor T_i \rfloor\) is the floor of \(T_i\).

\(p\) will denote a (conditional) density function

which specification is implicitly given by the context.

Time-varying Effects

We have data up to time \(\bar t\) for all observations.

Could also be for a group of conditionally correlated individuals or an individual with recurrent events.

Interested in e.g.

\[\prob\Cond{\tilde T_i \leq \bar t + \Delta t}{\tilde T_i \geq \bar t, \vec x_{i\lfloor \bar t \rfloor}}, \qquad \Delta t > 0\]

Assume \(\bar t\) is an integer from now.

Models

Time-invariant model like

\[ \prob\Cond{\tilde T_i \leq k + 1} {\tilde T_i \geq k, \vec x_{ik}} = 1 - \exp(-\vec\beta^\top\vec x_{ik}) \]

A spline based model

like the Royston-Parmar model (Royston and Parmar 2002).

Mixed Models

\[ \begin{align*} \Delta t &\in (0,1) \\ \prob\Cond{\tilde T_i \leq k + \Delta t} {\tilde T_i \geq k, \vec x_{ik}, \vec z_{ik}, \vec u_k} \hspace{-70pt}&\\ &= 1 - \exp\left(-(\vec\beta^\top\vec x_{ik} + \vec u_k^\top\vec z_{ik}) \Delta t\right) \\ \vec U_k \mid \vec U_{k - 1} = \vec u_{k - 1} &\sim N(\mat F \vec u_{k -1} , \mat\Sigma) \end{align*} \]

Joint modelling

with random mean function. Typically, with a simpler random effect structure.

Applications

Firms default models.

May have e.g., unobserved time-varying macro or industry effects.

Notation

\[\begin{align*} T_{ik} &= \min(\max(T_i - k, 0), 1) \\ D_{ik} &= 1_{\{\tilde T_i < C_i\wedge T_i \in (k, k + 1]\}} \end{align*}\]

\[ \begin{align*} \vec t_k &= (t_{1k}, \dots, t_{nk})^\top & \vec d_k &= (d_{1k}, \dots, d_{nk})^\top \\ \vec t_{s:k} &= (\vec t_s, \vec t_{s + 1}, \dots, \vec t_k) \end{align*} \]

Mixed Models Marginal Log-likelihood

\[ \begin{align*} f(\vec t_{1:\bar t}, \vec d_{1:\bar t}, \vec u_{1,\bar t}; \vec\beta, \mat\Sigma) \hspace{-70pt} \\ &= \mu(\vec u_1)g(\vec t_1, \vec d_1; \mat X_1, \mat Z_1, \vec u_1)\\ &\hspace{20pt} \cdot\prod_{k = 2}^{\bar t} g(\vec t_k, \vec d_k; \mat X_k, \mat Z_k, \vec u_k) \phi(\vec u_k; \mat F\vec u_{k-1}, \mat\Sigma) \\ L(\vec\beta, \mat \Sigma) &= \int_{\mathbb{R}^{\bar td}} f(\vec t_{1:\bar t}, \vec d_{1:\bar t}, \vec u_{1:\bar t}; \vec\beta, \mat\Sigma) \der\vec u_{1:\bar t} \end{align*} \]

where \(\vec U_k\in\mathbb{R}^{d}\), \(\mu\) is the time-invariante distribution, \(\mat X_s\) and \(\mat Z_s\) are design matrices, \(g\) is a suitable conditional density function, and \(\phi\) is a multivariate normal distribution density function.

Approximation Methods

Deterministic Methods

MQL, PQL, and Laplace approximations.

Extended and unscented Kalman filters.

Variational approximations.

Quadrature.

Monte Carlo Methods

Markov chain Monte Carlo.

Importance sampling.

Particle filters.

Also known as sequential Monte Carlo.

Usage

Expectation maximization (Dempster, Laird, and Rubin 1977). \[ \begin{align*} Q\Cond{\vec\beta, \mat\Sigma} {\widehat{\vec\beta}, \widehat{\mat\Sigma}}\hspace{-30pt} &\\ &= \expec_{(\widehat{\vec\beta}, \widehat{\mat\Sigma})} \Cond{\log f(\vec T_{1:\bar t}, \vec D_{1:\bar t}, \vec U_{1:\bar t}; \vec\beta, \mat\Sigma)} {\vec t_{1:\bar t}, \vec d_{1:\bar t}} \end{align*} \]

Direct maximization

using an approximation of \(L'(\vec\beta, \mat \Sigma)\).

Estimate of \(L''(\vec\beta, \mat \Sigma)\) at the maximum.

Importance Sampling

Basics

Want to approximate \(h(x) = \alpha \tilde h(x)\).

\(h\) and \(\alpha\) are unknown and \(\tilde h\) is known.

Sample \(\part{x}{1}, \dots, \part{x}{N}\) from \(q(x)\).
Compute unnormalized weights \[\part{\widetilde W}{j} = \tilde h\left(\part{x}{j}\right) / q\left(\part{x}{j}\right)\]
Normalize \[\part{W}{j} = \part{\widetilde W}{j} / \sum_{i = 1}^N \part{\widetilde W}{i}\]

Usage

Yields discrete approximation of \[ \expec (c(x)) \approx \sum_{j = 1}^N \part{W}{j} c(\part{x}{j}) \]

With only one time period

\[ \expec_{(\widehat{\vec\beta}, \widehat{\mat\Sigma})} \Cond{\log \mu(\vec U_1) + \log g(\vec T_1, \vec D_1; \mat X_1, \mat Z_1, \vec U_1)} {\vec t_1, \vec d_1} \] requires \[p\Cond{\vec u_1}{\vec t_1, \vec d_1} = \frac{\mu(\vec u_1)g(\vec t_1, \vec d_1; \mat X_1, \mat Z_1, \vec u_1)} {p(\vec t_1, \vec d_1)}\]

In Practice

Some \(\part{W}{j}\) may be very small.

May re-sample using \(\part{W}{j}\)

setting new weights to \(N^{-1}\).

Example

Example without Re-sampling

Example with Re-sampling

Example with More Samples

Points

\(q\) needs to be fast to sample from and evaluate.

\(q\) needs to be a good approximation of \(h\).

Many variance reduction methods

E.g., antithetic variables as used in the mssm package. E.g., see Durbin and Koopman (1997).

Often in-accurate in higher dimensions.

Particle Filtering

End Goal

Cover smoothers in the dynamichazard package.

\[ \begin{align*} Q\Cond{\vec\beta, \mat\Sigma} {\widehat{\vec\beta}, \widehat{\mat\Sigma}} \hspace{-30pt}& \\ &=\expec_{(\widehat{\vec\beta}, \widehat{\mat\Sigma})} \Big( \log \mu(\vec U_1) + \log g(\vec T_1, \vec D_1; \mat X_1, \mat Z_1, \vec U_1) \\ &\hspace{50pt} +\sum_{k = 2}^{\bar t} \log g(\vec T_k, \vec D_k; \mat X_k, \mat Z_k, \vec U_k) \\ &\hspace{70pt}+ \log\phi(\vec U_k; \mat F\vec U_{k-1}, \mat\Sigma) \,\Big|\, \vec t_{1:\bar t}, \vec d_{1:\bar t}\Big) \end{align*} \]

Need \(p\Cond{\vec U_{k-1:k}}{\vec t_{1:\bar t}, \vec d_{1:\bar t}}\).

Notation

(Forward) particle filter \((\part{\rR{\vec u}}{s}_{1:k}, \part{\rR{W}}{s}_{k})_{s=1,\dots,N}\).

Backward particle filter \((\part{\lR{\vec u}}{s}_k, \part{\lR{W}}{s}_k)_{s=1,\dots,N}\).

Smoothed particles \((\part{\bR{\vec u}}{s}_k, \part{\bR{W}}{s}_k)_{s=1,\dots,N}\).

Particle Filter

Have some estimate of \(p\Cond{\vec u_{1:k-1}}{\vec t_{1:k-1}, \vec d_{1:k-1}}\).

Use \[\begin{align*} p\Cond{\vec u_{1:k}}{\vec t_{1:k}, \vec d_{1:k}} \hspace{-50pt}&\\ &= p\Cond{\vec u_{1:k-1}}{\vec t_{1:k-1}, \vec d_{1:k-1}} \frac{g_k\Cond{\vec t_k, \vec d_k}{\vec u_k} h\Cond{\vec u_k}{\vec u_{k - 1}}} {p\Cond{\vec t_k, \vec d_k}{\vec t_{1:k-1}, \vec d_{1:k-1}}} \\ &\approx \sum_{s = 1}^N \part{\rR{W}}{s}_{k-1} \delta_{\part{\rR{\vec u}}{s}_{1:k-1}}(\vec u_{1:k-1}) \frac{g_k\Cond{\vec t_k, \vec d_k}{\vec u_k} h\Cond{\vec u_k}{\part{\rR{\vec u}}{s}_{1:k-1}}} {p\Cond{\vec t_k, \vec d_k}{\vec t_{1:k-1}, \vec d_{1:k-1}}} \end{align*}\]

where \(\delta\) is the Dirac delta function, \(g_k\Cond{\vec t_k, \vec d_k}{\vec u_k} = g(\vec t_k, \vec d_k;\mat X_k, \mat Z_k, \vec u_k)\), and \(h\Cond{\vec u_k}{\vec u_{1:k - 1}} = \phi(\vec u_k; \mat F\vec u_{k-1}, \mat\Sigma)\).

Auxiliary Particle Filter

Some weights, \(\part{\rR{W}}{s}_k\), quickly get all the mass.

Weight degeneracy.

Advantageous to re-sample with weights \[ \part{\rR{\beta}}{s}_k \apxprop p\Cond{\vec t_k, \vec d_k}{\part{\rR{\vec u}}{s}_{1:k-1}} \part{\rR{W}}{s}_{1:k-1} \]

An auxiliary particle filter (Pitt and Shephard 1999).

Auxiliary Particle Filter Algorithm

Sample \(j_1,\dots,j_N\) using \(\{\part{\rR{\beta}}{s}_k\}_{s=1,\dots,N}\).
Sample new state \[\part{\rR{\vec u}}{s}_k \sim \rR{q}_k\Cond{\cdot}{ \part{\rR{\vec u}}{j_s}_{1:k-1}, \vec t_k, \vec d_k}\]
Compute and normalize weights \[\begin{align*} \widetilde W_s &= \frac {\part{\rR{W}}{j_s}_{k-1}}{\part{\rR{\beta}}{j_s}_k} \frac{g_k\Cond{\vec t_k, \vec d_k}{\part{\rR{\vec u}}{s}_k} h\Cond{\part{\rR{\vec u}}{s}_k}{\part{\rR{\vec u}}{j_s}_{1:k-1}}} {\rR{q}_k\Cond{\part{\rR{\vec u}}{s}_k}{ \part{\rR{\vec u}}{j_s}_{1:k-1}, \vec t_k, \vec d_k}} \\ \part{\rR{W}}{s}_k &= \widetilde W_s / \sum_{i = 1}^N \widetilde W_i \end{align*}\]

Implementation Details

Both sampling and computation of weights require \(N\) similar and independent computations.

Easily done in parallel

as in the dynamichazard and mssm using the C++ thread library.

Largest computational cost

as \(g_k\) is relatively expensive to evaluate when \(\vec t_k\) and \(\vec d_k\) has many elements.

Particle Smoothing

Issue with Particle Filter

Can use \((\part{\rR{\vec u}}{s}_{1:\bar t}, \part{\rR{w}}{s}_{\bar t})\) for \(s = 1, \dots, N\).

Often few unique values of \(\vec u_k\) with small \(k\)

due to weight degeneracy.

Smoother from Briers, Doucet, and Maskell (2009)

Use the identity

\[ \begin{align*} p\Cond{\vec u_k}{\vec t_{1:\bar t}, \vec d_{1:\bar t}} &= \frac {p\Cond{\vec u_k}{\vec t_{1:k -1}, \vec d_{1:k -1}} p\Cond{\vec t_{k:\bar t}, \vec d_{k:\bar t}}{\vec u_k}} {p\Cond{\vec t_{k:\bar t}, \vec d_{k:\bar t}} {\vec t_{1:k-1}, \vec d_{1:k-1}}} \\ &\propto\int h\Cond{\vec u_k}{\vec u_{k - 1}} p\Cond{\vec u_{k-1}}{\vec t_{1:k-1}, \vec d_{1:k-1}}\der\vec u_{k-1}\\ &\hspace{40pt} \cdot p\Cond{\vec t_{k:\bar t}, \vec d_{k:\bar t}}{\vec u_k} \end{align*} \]

Have an approximation of \(p\Cond{\vec u_{k-1}}{\vec t_{1:k-1}, \vec d_{1:k-1}}\) but not \(p\Cond{\vec t_{k:\bar t}, \vec d_{k:\bar t}}{\vec u_k}\).

Backward Particle Filter

Run a backward particle filter to get \((\part{\lR{\vec u}}{s}_k, \part{\lR{W}}{s}_k)_{s=1,\dots,N}\) targeting an artificial \(\tilde p\Cond{\vec u_k}{\vec t_{k:\bar t}, \vec d_{k:\bar t}}\).

Has the property that \[ p\Cond{\vec t_{k:\bar t}, \vec d_{k:\bar t}}{\vec u_k} \propto \frac{\tilde p\Cond{\vec u_k}{\vec t_{k:\bar t}, \vec d_{k:\bar t}}} {\gamma_k\left(\vec u_k\right)} \approx \sum_{s = 1}^N \frac {\part{\lR{W}}{s}_k}{\gamma_k\left(\part{\lR{\vec u}}{s}_k\right)} \delta_{\part{\lR{\vec u}}{s}_k}(\vec u_k) \]

for a prespecified density function \(\gamma_k\).

Comments

Smoothed weights are given by

\[\begin{align*} \part{\bR{W}}{s}_k \propto \sum_{i = 1}^N h\Cond{\part{\lR{\vec u}}{s}_k}{\part{\rR{\vec u}}{i}_{k -1}} \part{\rR{W}}{i}_{k -1}\frac {\part{\lR{W}}{s}_k} {\gamma_k\left({\part{\lR{\vec u}}{s}_k}\right)} \end{align*}\]

We get \(2N\) evaluations of \(g\).

\(\bigO{N^2}\)

but can be reduced an average case of \(\bigO{N\log N}\) with the dual k-d tree method as implemented in the mssm package.

Smoother from Fearnhead, Wyncoll, and Tawn (2010)

\[ \begin{align*} p\Cond{\vec u_k}{\vec t_{1:\bar t}, \vec d_{1:\bar t}}\hspace{-50pt}& \\ &= \frac {p\Cond{\vec u_k}{\vec t_{1:k -1}, \vec d_{1:k -1}} g_k\Cond{\vec t_k, \vec d_k}{\vec u_k} p\Cond{\vec t_{k + 1:\bar t}, \vec d_{k + 1:\bar t}}{\vec u_k}} {p\Cond{\vec t_{k:\bar t}, \vec d_{k:\bar t}} {\vec t_{1:k-1}, \vec d_{1:k-1}}} \\ &\propto\int h(\vec u_k, \vec u_{k - 1}) p\Cond{\vec u_{k-1}}{\vec t_{1:k-1}, \vec d_{1:k-1}}\der\vec u_{k-1}\\ &\hspace{20pt} \cdot g_k\Cond{\vec t_k, \vec d_k}{\vec u_k} \int p\Cond{\vec t_{k + 1:\bar t}, \vec d_{k + 1:\bar t}}{\vec u_{k + 1}} \\ &\hspace{100pt} \cdot h\Cond{\vec u_{k+1}}{\vec u_k}\der \vec u_{k + 1} \end{align*} \]

Algorithm

Sample \((i_s, j_s)_{s = 1,\dots N}\) using \(\{\part{\rR{\beta}}{i}_k\}_{i=1,\dots,N}\) and \(\{\part{\lR{\beta}}{j}_k\}_{j=1,\dots,N}\).
Sample \[ \part{\bR{\vec u}}{s}_k \sim \bR q_k\Cond{\cdot}{\part{\rR{\vec u}}{i_s}_{k-1}, \part{\lR{\vec u}}{j_s}_{k+1}, \vec t_k, \vec d_k} \]

Algorithm

Compute and normalize weights \[\begin{align*} \widetilde W_s &= \frac {h\Cond{\part{\bR{\vec u}}{s}_k}{\part{\rR{\vec u}}{i_s}_{k - 1}} g_k\Cond{\vec t_k, \vec d_k}{\part{\bR{\vec u}}{s}_k} h\Cond{\part{\lR{\vec u}}{j_s}_{k + 1}}{\part{\bR{\vec u}}{s}_k}} {\bR q_k\Cond{\part{\bR{\vec u}}{s}_k}{\part{\rR{\vec u}}{i_s}_{k-1}, \part{\lR{\vec u}}{j_s}_{k+1}, \vec t_k, \vec d_k}} \\ &\hspace{20pt} \cdot \frac {\part{\lR{W}}{j_s}_{k+1}\part{\rR{W}}{i_s}_{k-1}} {\gamma_{k + 1}\left(\part{\lR{\vec u}}{j_s}_{k+1}\right) \part{\lR{\beta}}{j_s}_{k+1} \part{\rR{\beta}}{i_s}_{k-1}} \\ \part{\bR{W}}{s}_k &= \widetilde W_s / \sum_{o = 1}^N \widetilde W_o \end{align*}\]

Comments

We get \(3N\) evaluations of \(g\).

\(\bigO{N}\).

Can be generalized to handle singular components.

Implementation

Dynamichazard

Both smoothers.

All in C++ with parallel computation using the C++ thread library.

Computation in M-step is done in parallel in a memory efficient manner

using QR decompositions as in the bam function in mgcv package.

Gradient and Hessian Approximations

Motivation

Cover methods in mssm package.

Direct maximization

using an approximation of \(L'(\vec\beta, \mat \Sigma)\).

Estimate of \(L''(\vec\beta, \mat \Sigma)\) at the maximum.

Method

Can use the forward particle filter to approximate

\[ \begin{align*} L'(\vec\theta) &= \int \nabla_{\vec\theta}\log f(\vec T_{1:\bar t}, \vec D_{1:\bar t}, \vec U_{1:\bar t}; \vec\theta) \\ &\hspace{80pt} \cdot p\Cond{\vec u_{1:\bar t}}{\vec t_{1:\bar t}, \vec d_{1:\bar t}} \der\vec u_{1:\bar t} \end{align*} \]

where \(\vec\theta = (\vec\beta, \mat\Sigma)\).

And a similar expression \(L''(\vec\beta, \mat \Sigma)\) using Louis’ identity (Louis 1982).

In-accurate due to weight degeneracy.

Alternative

Poyiadjis, Doucet, and Singh (2011) show a recursive method with better properties.

Requires evaluation of \[ \sum_{i = 1}^N \frac{\part{\rR{W}}{i}_{k - 1} h\Cond{\part{\rR{\vec u}}{s}_k}{\part{\rR{\vec u}}{i}_{k - 1}}} {\sum_{j = 1}^N\part{\rR{W}}{j}_{k - 1} h\Cond{\part{\rR{\vec u}}{s}_k}{\part{\rR{\vec u}}{j}_{k - 1}}} \psi\left(\part{\rR{\vec u}}{s}_k, \part{\rR{\vec u}}{i}_{k - 1}\right) \]

for some function \(\psi\) as the smoother suggested by Briers, Doucet, and Maskell (2009).

Dual k-d Tree

Can use that for some \(s\in\mathcal{S}\), \(i\in\mathcal{I}\), and constant \(C\)

\[ h\Cond{\part{\rR{\vec u}}{s}_k}{\part{\rR{\vec u}}{i}_{k - 1}} \approx C \]

when \(\part{\rR{\vec u}}{s}_k\) are similar for \(s\in\mathcal{S}\) and similarly for \(i\in\mathcal{I}\).

Form two k-d trees and make an approximation as in Klaas et al. (2006).

Fast Approximation

##          method
## N         Dual-tree      Naive
##   12288    0.039704  0.7137595
##   24576    0.062140  2.6771194
##   49152    0.115722 10.9796716
##   98304    0.227879         NA
##   196608   0.450209         NA
##   393216   0.913649         NA
##   786432   1.844256         NA
##   1572864  4.057902         NA

Units are seconds.

mssm Package

Supports a wider range of models (not limited to survival models).

All the computation is in C++ with parallel computation using the C++ thread library.

Can be used for clustered data

as the method can be used to get gradients for multiple different groups.

“More work” required by the user.

Final Comments

Proposal Distribution

The proposal distribution is important.

Bootstrap filter (Gordon, Salmond, and Smith 1993) using \(h\) is implemented, along with various mode approximations

using either a multivariate normal distribution or multivariate \(t\)-distribution.

Summary

Mixed models is an alternative when one is interested in extrapolation.

Particle based methods can provide an efficient means of working with the presented mixed models.

There are complications with particle based methods and potential solutions.

A broad overview of the methods in dynamichazard and mssm package.

Thank You!

The presentation is at rpubs.com/boennecd/PFnSiML.

The markdown is at github.com/boennecd/Talks.

The dynamichazard package is at github.com/boennecd/dynamichazard.

The mssm package is at github.com/boennecd/mssm.

References are on the next slide.

References

Briers, Mark, Arnaud Doucet, and Simon Maskell. 2009. “Smoothing Algorithms for State–Space Models.” Annals of the Institute of Statistical Mathematics 62 (1): 61. doi:10.1007/s10463-009-0236-2.

Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. “Maximum Likelihood from Incomplete Data via the Em Algorithm.” Journal of the Royal Statistical Society. Series B (Methodological) 39 (1). [Royal Statistical Society, Wiley]: 1–38. http://www.jstor.org/stable/2984875.

Durbin, J., and S. J. Koopman. 1997. “Monte Carlo Maximum Likelihood Estimation for Non-Gaussian State Space Models.” Biometrika 84 (3). [Oxford University Press, Biometrika Trust]: 669–84. http://www.jstor.org/stable/2337587.

Fearnhead, Paul, David Wyncoll, and Jonathan Tawn. 2010. “A Sequential Smoothing Algorithm with Linear Computational Cost.” Biometrika 97 (2). [Oxford University Press, Biometrika Trust]: 447–64. http://www.jstor.org/stable/25734097.

Gordon, N. J., D. J. Salmond, and A. F. M. Smith. 1993. “Novel Approach to Nonlinear/Non-Gaussian Bayesian State Estimation.” IEE Proceedings F - Radar and Signal Processing 140 (2): 107–13. doi:10.1049/ip-f-2.1993.0015.

Klaas, Mike, Mark Briers, Nando de Freitas, Arnaud Doucet, Simon Maskell, and Dustin Lang. 2006. “Fast Particle Smoothing: If I Had a Million Particles.” In Proceedings of the 23rd International Conference on Machine Learning, 481–88. ICML ’06. New York, NY, USA: ACM. doi:10.1145/1143844.1143905.

Louis, Thomas A. 1982. “Finding the Observed Information Matrix When Using the Em Algorithm.” Journal of the Royal Statistical Society. Series B (Methodological) 44 (2). [Royal Statistical Society, Wiley]: 226–33. http://www.jstor.org/stable/2345828.

Pitt, Michael K., and Neil Shephard. 1999. “Filtering via Simulation: Auxiliary Particle Filters.” Journal of the American Statistical Association 94 (446). [American Statistical Association, Taylor & Francis, Ltd.]: 590–99. http://www.jstor.org/stable/2670179.

Poyiadjis, George, Arnaud Doucet, and Sumeetpal S. Singh. 2011. “Particle Approximations of the Score and Observed Information Matrix in State Space Models with Application to Parameter Estimation.” Biometrika 98 (1). Biometrika Trust: 65–80. http://www.jstor.org/stable/29777165.

Royston, Patrick, and Mahesh K. B. Parmar. 2002. “Flexible Parametric Proportional-Hazards and Proportional-Odds Models for Censored Survival Data, with Application to Prognostic Modelling and Estimation of Treatment Effects.” Statistics in Medicine 21 (15): 2175–97. doi:10.1002/sim.1203.