\[ \definecolor{gray}{RGB}{192,192,192} \renewcommand\vec{\boldsymbol} \def\bigO#1{\mathcal{O}(#1)} \def\Cond#1#2{\left(#1\,\middle|\, #2\right)} \def\mat#1{\boldsymbol{#1}} \def\der{{\mathop{}\!\mathrm{d}}} \def\argmax{\text{arg}\,\text{max}} \def\prob{\text{P}} \def\expec{\text{E}} \def\part#1#2{#1^{(#2)}} \def\rR#1{\overrightarrow{#1}} \def\lR#1{\overleftarrow{#1}} \def\bR#1{\overleftrightarrow{#1}} \def\apxprop{\underset{\displaystyle\sim}{\propto}} \]
Motivate interest in extrapolation.
Alternative estimation methods.
Importance sampling and particle filtering.
as implemented in the dynamichazard package.
as implemented in the mssm package.
\(\tilde T_i\) is the time to event for individual \(i\).
and define indicators \(D_i = 1_{\{\tilde T_i < C_i\}}\).
\(\lfloor T_i \rfloor\) is the floor of \(T_i\).
which specification is implicitly given by the context.
Could also be for a group of conditionally correlated individuals or an individual with recurrent events.
Interested in e.g.
\[\prob\Cond{\tilde T_i \leq \bar t + \Delta t}{\tilde T_i \geq \bar t, \vec x_{i\lfloor \bar t \rfloor}}, \qquad \Delta t > 0\]
Assume \(\bar t\) is an integer from now.
Time-invariant model like
\[ \prob\Cond{\tilde T_i \leq k + 1} {\tilde T_i \geq k, \vec x_{ik}} = 1 - \exp(-\vec\beta^\top\vec x_{ik}) \]
like the Royston-Parmar model (Royston and Parmar 2002).
\[ \begin{align*} \Delta t &\in (0,1) \\ \prob\Cond{\tilde T_i \leq k + \Delta t} {\tilde T_i \geq k, \vec x_{ik}, \vec z_{ik}, \vec u_k} \hspace{-70pt}&\\ &= 1 - \exp\left(-(\vec\beta^\top\vec x_{ik} + \vec u_k^\top\vec z_{ik}) \Delta t\right) \\ \vec U_k \mid \vec U_{k - 1} = \vec u_{k - 1} &\sim N(\mat F \vec u_{k -1} , \mat\Sigma) \end{align*} \]
with random mean function. Typically, with a simpler random effect structure.
May have e.g., unobserved time-varying macro or industry effects.
\[\begin{align*} T_{ik} &= \min(\max(T_i - k, 0), 1) \\ D_{ik} &= 1_{\{\tilde T_i < C_i\wedge T_i \in (k, k + 1]\}} \end{align*}\]
\[ \begin{align*} \vec t_k &= (t_{1k}, \dots, t_{nk})^\top & \vec d_k &= (d_{1k}, \dots, d_{nk})^\top \\ \vec t_{s:k} &= (\vec t_s, \vec t_{s + 1}, \dots, \vec t_k) \end{align*} \]
where \(\vec U_k\in\mathbb{R}^{d}\), \(\mu\) is the time-invariante distribution, \(\mat X_s\) and \(\mat Z_s\) are design matrices, \(g\) is a suitable conditional density function, and \(\phi\) is a multivariate normal distribution density function.
MQL, PQL, and Laplace approximations.
Extended and unscented Kalman filters.
Variational approximations.
Quadrature.
Markov chain Monte Carlo.
Importance sampling.
Also known as sequential Monte Carlo.
Expectation maximization (Dempster, Laird, and Rubin 1977). \[ \begin{align*} Q\Cond{\vec\beta, \mat\Sigma} {\widehat{\vec\beta}, \widehat{\mat\Sigma}}\hspace{-30pt} &\\ &= \expec_{(\widehat{\vec\beta}, \widehat{\mat\Sigma})} \Cond{\log f(\vec T_{1:\bar t}, \vec D_{1:\bar t}, \vec U_{1:\bar t}; \vec\beta, \mat\Sigma)} {\vec t_{1:\bar t}, \vec d_{1:\bar t}} \end{align*} \]
using an approximation of \(L'(\vec\beta, \mat \Sigma)\).
Estimate of \(L''(\vec\beta, \mat \Sigma)\) at the maximum.
\(h\) and \(\alpha\) are unknown and \(\tilde h\) is known.
Yields discrete approximation of \[ \expec (c(x)) \approx \sum_{j = 1}^N \part{W}{j} c(\part{x}{j}) \]
With only one time period
\[ \expec_{(\widehat{\vec\beta}, \widehat{\mat\Sigma})} \Cond{\log \mu(\vec U_1) + \log g(\vec T_1, \vec D_1; \mat X_1, \mat Z_1, \vec U_1)} {\vec t_1, \vec d_1} \] requires \[p\Cond{\vec u_1}{\vec t_1, \vec d_1} = \frac{\mu(\vec u_1)g(\vec t_1, \vec d_1; \mat X_1, \mat Z_1, \vec u_1)} {p(\vec t_1, \vec d_1)}\]
Some \(\part{W}{j}\) may be very small.
setting new weights to \(N^{-1}\).
\(q\) needs to be fast to sample from and evaluate.
\(q\) needs to be a good approximation of \(h\).
E.g., antithetic variables as used in the mssm package. E.g., see Durbin and Koopman (1997).
Often in-accurate in higher dimensions.
Cover smoothers in the dynamichazard package.
\[ \begin{align*} Q\Cond{\vec\beta, \mat\Sigma} {\widehat{\vec\beta}, \widehat{\mat\Sigma}} \hspace{-30pt}& \\ &=\expec_{(\widehat{\vec\beta}, \widehat{\mat\Sigma})} \Big( \log \mu(\vec U_1) + \log g(\vec T_1, \vec D_1; \mat X_1, \mat Z_1, \vec U_1) \\ &\hspace{50pt} +\sum_{k = 2}^{\bar t} \log g(\vec T_k, \vec D_k; \mat X_k, \mat Z_k, \vec U_k) \\ &\hspace{70pt}+ \log\phi(\vec U_k; \mat F\vec U_{k-1}, \mat\Sigma) \,\Big|\, \vec t_{1:\bar t}, \vec d_{1:\bar t}\Big) \end{align*} \]
Need \(p\Cond{\vec U_{k-1:k}}{\vec t_{1:\bar t}, \vec d_{1:\bar t}}\).
(Forward) particle filter \((\part{\rR{\vec u}}{s}_{1:k}, \part{\rR{W}}{s}_{k})_{s=1,\dots,N}\).
Backward particle filter \((\part{\lR{\vec u}}{s}_k, \part{\lR{W}}{s}_k)_{s=1,\dots,N}\).
Smoothed particles \((\part{\bR{\vec u}}{s}_k, \part{\bR{W}}{s}_k)_{s=1,\dots,N}\).
Have some estimate of \(p\Cond{\vec u_{1:k-1}}{\vec t_{1:k-1}, \vec d_{1:k-1}}\).
where \(\delta\) is the Dirac delta function, \(g_k\Cond{\vec t_k, \vec d_k}{\vec u_k} = g(\vec t_k, \vec d_k;\mat X_k, \mat Z_k, \vec u_k)\), and \(h\Cond{\vec u_k}{\vec u_{1:k - 1}} = \phi(\vec u_k; \mat F\vec u_{k-1}, \mat\Sigma)\).
Weight degeneracy.
An auxiliary particle filter (Pitt and Shephard 1999).
Both sampling and computation of weights require \(N\) similar and independent computations.
as in the dynamichazard and mssm using the C++ thread library.
as \(g_k\) is relatively expensive to evaluate when \(\vec t_k\) and \(\vec d_k\) has many elements.
Can use \((\part{\rR{\vec u}}{s}_{1:\bar t}, \part{\rR{w}}{s}_{\bar t})\) for \(s = 1, \dots, N\).
due to weight degeneracy.
Use the identity
\[ \begin{align*} p\Cond{\vec u_k}{\vec t_{1:\bar t}, \vec d_{1:\bar t}} &= \frac {p\Cond{\vec u_k}{\vec t_{1:k -1}, \vec d_{1:k -1}} p\Cond{\vec t_{k:\bar t}, \vec d_{k:\bar t}}{\vec u_k}} {p\Cond{\vec t_{k:\bar t}, \vec d_{k:\bar t}} {\vec t_{1:k-1}, \vec d_{1:k-1}}} \\ &\propto\int h\Cond{\vec u_k}{\vec u_{k - 1}} p\Cond{\vec u_{k-1}}{\vec t_{1:k-1}, \vec d_{1:k-1}}\der\vec u_{k-1}\\ &\hspace{40pt} \cdot p\Cond{\vec t_{k:\bar t}, \vec d_{k:\bar t}}{\vec u_k} \end{align*} \]
Have an approximation of \(p\Cond{\vec u_{k-1}}{\vec t_{1:k-1}, \vec d_{1:k-1}}\) but not \(p\Cond{\vec t_{k:\bar t}, \vec d_{k:\bar t}}{\vec u_k}\).
Run a backward particle filter to get \((\part{\lR{\vec u}}{s}_k, \part{\lR{W}}{s}_k)_{s=1,\dots,N}\) targeting an artificial \(\tilde p\Cond{\vec u_k}{\vec t_{k:\bar t}, \vec d_{k:\bar t}}\).
for a prespecified density function \(\gamma_k\).
\[ \begin{align*} p\Cond{\vec u_k}{\vec t_{1:\bar t}, \vec d_{1:\bar t}}\hspace{-50pt}& \\ &= \frac {p\Cond{\vec u_k}{\vec t_{1:k -1}, \vec d_{1:k -1}} g_k\Cond{\vec t_k, \vec d_k}{\vec u_k} p\Cond{\vec t_{k + 1:\bar t}, \vec d_{k + 1:\bar t}}{\vec u_k}} {p\Cond{\vec t_{k:\bar t}, \vec d_{k:\bar t}} {\vec t_{1:k-1}, \vec d_{1:k-1}}} \\ &\propto\int h(\vec u_k, \vec u_{k - 1}) p\Cond{\vec u_{k-1}}{\vec t_{1:k-1}, \vec d_{1:k-1}}\der\vec u_{k-1}\\ &\hspace{20pt} \cdot g_k\Cond{\vec t_k, \vec d_k}{\vec u_k} \int p\Cond{\vec t_{k + 1:\bar t}, \vec d_{k + 1:\bar t}}{\vec u_{k + 1}} \\ &\hspace{100pt} \cdot h\Cond{\vec u_{k+1}}{\vec u_k}\der \vec u_{k + 1} \end{align*} \]
We get \(3N\) evaluations of \(g\).
\(\bigO{N}\).
Can be generalized to handle singular components.
Both smoothers.
All in C++ with parallel computation using the C++ thread library.
using QR decompositions as in the bam function in mgcv package.
Cover methods in mssm package.
using an approximation of \(L'(\vec\beta, \mat \Sigma)\).
Estimate of \(L''(\vec\beta, \mat \Sigma)\) at the maximum.
Can use the forward particle filter to approximate
\[ \begin{align*} L'(\vec\theta) &= \int \nabla_{\vec\theta}\log f(\vec T_{1:\bar t}, \vec D_{1:\bar t}, \vec U_{1:\bar t}; \vec\theta) \\ &\hspace{80pt} \cdot p\Cond{\vec u_{1:\bar t}}{\vec t_{1:\bar t}, \vec d_{1:\bar t}} \der\vec u_{1:\bar t} \end{align*} \]where \(\vec\theta = (\vec\beta, \mat\Sigma)\).
And a similar expression \(L''(\vec\beta, \mat \Sigma)\) using Louis’ identity (Louis 1982).
In-accurate due to weight degeneracy.
Poyiadjis, Doucet, and Singh (2011) show a recursive method with better properties.
for some function \(\psi\) as the smoother suggested by Briers, Doucet, and Maskell (2009).
Can use that for some \(s\in\mathcal{S}\), \(i\in\mathcal{I}\), and constant \(C\)
\[ h\Cond{\part{\rR{\vec u}}{s}_k}{\part{\rR{\vec u}}{i}_{k - 1}} \approx C \]
when \(\part{\rR{\vec u}}{s}_k\) are similar for \(s\in\mathcal{S}\) and similarly for \(i\in\mathcal{I}\).
Form two k-d trees and make an approximation as in Klaas et al. (2006).
## method
## N Dual-tree Naive
## 12288 0.039704 0.7137595
## 24576 0.062140 2.6771194
## 49152 0.115722 10.9796716
## 98304 0.227879 NA
## 196608 0.450209 NA
## 393216 0.913649 NA
## 786432 1.844256 NA
## 1572864 4.057902 NAUnits are seconds.
Supports a wider range of models (not limited to survival models).
All the computation is in C++ with parallel computation using the C++ thread library.
as the method can be used to get gradients for multiple different groups.
“More work” required by the user.
The proposal distribution is important.
using either a multivariate normal distribution or multivariate \(t\)-distribution.
Mixed models is an alternative when one is interested in extrapolation.
Particle based methods can provide an efficient means of working with the presented mixed models.
There are complications with particle based methods and potential solutions.
A broad overview of the methods in dynamichazard and mssm package.
The presentation is at rpubs.com/boennecd/PFnSiML.
The markdown is at github.com/boennecd/Talks.
The dynamichazard package is at
github.com/boennecd/dynamichazard.
The mssm package is at
github.com/boennecd/mssm.
References are on the next slide.
Briers, Mark, Arnaud Doucet, and Simon Maskell. 2009. “Smoothing Algorithms for State–Space Models.” Annals of the Institute of Statistical Mathematics 62 (1): 61. doi:10.1007/s10463-009-0236-2.
Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. “Maximum Likelihood from Incomplete Data via the Em Algorithm.” Journal of the Royal Statistical Society. Series B (Methodological) 39 (1). [Royal Statistical Society, Wiley]: 1–38. http://www.jstor.org/stable/2984875.
Durbin, J., and S. J. Koopman. 1997. “Monte Carlo Maximum Likelihood Estimation for Non-Gaussian State Space Models.” Biometrika 84 (3). [Oxford University Press, Biometrika Trust]: 669–84. http://www.jstor.org/stable/2337587.
Fearnhead, Paul, David Wyncoll, and Jonathan Tawn. 2010. “A Sequential Smoothing Algorithm with Linear Computational Cost.” Biometrika 97 (2). [Oxford University Press, Biometrika Trust]: 447–64. http://www.jstor.org/stable/25734097.
Gordon, N. J., D. J. Salmond, and A. F. M. Smith. 1993. “Novel Approach to Nonlinear/Non-Gaussian Bayesian State Estimation.” IEE Proceedings F - Radar and Signal Processing 140 (2): 107–13. doi:10.1049/ip-f-2.1993.0015.
Klaas, Mike, Mark Briers, Nando de Freitas, Arnaud Doucet, Simon Maskell, and Dustin Lang. 2006. “Fast Particle Smoothing: If I Had a Million Particles.” In Proceedings of the 23rd International Conference on Machine Learning, 481–88. ICML ’06. New York, NY, USA: ACM. doi:10.1145/1143844.1143905.
Louis, Thomas A. 1982. “Finding the Observed Information Matrix When Using the Em Algorithm.” Journal of the Royal Statistical Society. Series B (Methodological) 44 (2). [Royal Statistical Society, Wiley]: 226–33. http://www.jstor.org/stable/2345828.
Pitt, Michael K., and Neil Shephard. 1999. “Filtering via Simulation: Auxiliary Particle Filters.” Journal of the American Statistical Association 94 (446). [American Statistical Association, Taylor & Francis, Ltd.]: 590–99. http://www.jstor.org/stable/2670179.
Poyiadjis, George, Arnaud Doucet, and Sumeetpal S. Singh. 2011. “Particle Approximations of the Score and Observed Information Matrix in State Space Models with Application to Parameter Estimation.” Biometrika 98 (1). Biometrika Trust: 65–80. http://www.jstor.org/stable/29777165.
Royston, Patrick, and Mahesh K. B. Parmar. 2002. “Flexible Parametric Proportional-Hazards and Proportional-Odds Models for Censored Survival Data, with Application to Prognostic Modelling and Estimation of Treatment Effects.” Statistics in Medicine 21 (15): 2175–97. doi:10.1002/sim.1203.
Comments
Smoothed weights are given by
\[\begin{align*} \part{\bR{W}}{s}_k \propto \sum_{i = 1}^N h\Cond{\part{\lR{\vec u}}{s}_k}{\part{\rR{\vec u}}{i}_{k -1}} \part{\rR{W}}{i}_{k -1}\frac {\part{\lR{W}}{s}_k} {\gamma_k\left({\part{\lR{\vec u}}{s}_k}\right)} \end{align*}\]
We get \(2N\) evaluations of \(g\).
but can be reduced an average case of \(\bigO{N\log N}\) with the dual k-d tree method as implemented in the
mssmpackage.