Variational Approximations

dummy slide

Introduction

\[ \definecolor{gray}{RGB}{192,192,192} \renewcommand\vec{\boldsymbol} \def\bigO#1{\mathcal{O}(#1)} \def\Cond#1#2{\left(#1\,\middle|\, #2\right)} \def\mat#1{\boldsymbol{#1}} \def\der{{\mathop{}\!\mathrm{d}}} \def\argmax{\text{arg}\,\text{max}} \]

Presentation Outline

Introduce survival analysis.

Show computational issues in survival analysis

particularly with random effects.

Show applications of variational approximations (VAs)

in survival analysis.

Cover future work.

Survival Analysis

Typical Application

Model the time of a terminal event.

Time zero is the time of diagnosis, time of onset, or birth date.

Some events are (right) censored.

Motivating Example

Full circle: observed event. Open circle: censored. Order is only for visualization purposes.

Motivating Example

Black: treatment. Blue and dashed: placebo.

Remarks

A larger number of observations may be censored

e.g., short trial period or low event rate.

Need to account for censoring.

Notation

\(T_i^*\) be the event time of individual \(i\) and let \(f\) denote the density function.

Observe \(T_i = \min (T_i^*, C_i)\) where \(C_i\) is the censoring time.

Let \(D_i = 1_{\{T_i^* < C_i\}}\) be the event indicators.

The observed likelihood is

\[p(t_i, d_i) = f(t_i)^{d_i}P(T_i^* > t_i)^{1 - d_i}\]

Notation

Let \(S\) be the survival function

\[S(t_i) = P(T_i^* > t_i) = \int_t^\infty f(s)\der s\]

Let \(\lambda\) be the hazard function

\[ \lambda(t) = \lim_{h\rightarrow 0^+} \frac {P(t\leq T_i^* < t + h \mid T_i^*\geq t)}{h} \]

which is the instantaneous event rate.

Notation

Then

\[p(t_i, d_i) = \lambda(t_i)^{d_i}S(t_i) = \lambda(t_i)^{d_i}\exp\left(-\int_0^{t_i}\lambda(s)\der s\right)\]

Important: two types of terms in maximum likelihood estimation

\[d_i\log\lambda(t_i) + \log S(t_i)\]

Nonparametric Estimates

Remarks

Want to control for additional factors

e.g., with observational data sets.

Want to impose (parametric) assumptions

to improve prediction accuracy or make inference.

Proportional Hazards (PH) Models

Typical choice is

\[ \lambda\Cond{t}{\vec x} = \lambda_0(t)\exp\left(\vec\beta^\top\vec x\right) \]
where \(\vec x\) is known covariates and \(\lambda_0\) is the baseline hazard.

\(\lambda_0\) may be parametric or nonparametric (Cox 1972).

Unit increase in \(x_l\) yields an \(\exp\beta_l\) factor increase in \(\lambda\).

Proportional Hazards (PH) comments

Special classes of \(\lambda_0\) yields a tractable (partial) log-likelihood function.

Generalization may be intractable due to

\[S\Cond{t}{\vec x} = \exp\left(-\int_0^t\lambda\Cond{s}{\vec x}\der s\right)\]

Leads to one-dimensional numerical integration.

Generalized Survival Models (GSMs)

Let

\[g\left(S\Cond{t}{\vec x}\right) = g\left(S_0(t)\right) + \vec\beta^\top\vec x\]

where \(g\) is a link function and \(S_0\) is a baseline survival function.

E.g., see Younes and Lachin (1997) and Royston and Parmar (2002).

Avoid integration.

Get a monotonicity constraint for \(S\Cond{t}{\vec x}\) for all \(\vec x\).

Computational Issues

Random Effects

May have unobserved factors which we want to account for or are interested in

e.g., twins who share genetic background and environment.

GSMs with Random Effects

\[ \begin{align*} g\left(S\Cond{t_{ki}}{\vec x_{ki}, \vec z_{ki}, \vec u_k}\right) &= g\left(S_0(t_{ki})\right) + \vec\beta^\top\vec x_{ki} + \vec z^\top_{ki}\vec u_k \\ &= g\left(S_0(t_{ki}; \vec x_{ki})\right) + \vec z^\top_{ki}\vec u_k \\ \vec U_k &\sim h(\vec \theta) \end{align*} \]
\(\vec z_{ki}\) is known covariates and \(\vec u_k\) is group \(k\)’s random effect.

\(k=1,\dots,m\) indicates the group and group \(k\) has \(i=1,\dots,n_k\) members.

In particular, we consider

\[\vec U_k \sim N(\vec 0, \mat \Sigma)\]

Notation

Let functions be applied elementwise

\[ \begin{align*} \vec x &= (x_1, x_2)^\top \\ f(\vec x) &= (f(x_1), f(x_2))^\top \end{align*} \]

Further, let

\[ \begin{align*} \vec t_k &= (t_{k1}, \dots, t_{kn_k})^\top, & \vec d_k &= (d_{k1}, \dots, d_{kn_k})^\top \\ \mat X_k &= (\vec x_{k1}, \dots, \vec x_{kn_k})^\top, & \mat Z_k &= (\vec z_{k1}, \dots, \vec z_{kn_k})^\top \end{align*} \]

Marginal Log-likelihood

The marginal log-likelihood (or model evidence) term for each group \(k\) is

\[ \begin{align*} l_k(\vec\beta, \mat\Sigma) &= \log \int \exp\left(h_k(\vec\beta, \mat\Sigma, \vec u) + \log \phi(\vec u;\mat \Sigma)\right) \der \vec u \\ h_k(\vec\beta, \mat\Sigma, \vec u) &= \vec d^\top_k \log \lambda\Cond{\vec t_k}{\mat X_k, \mat Z_k, \vec u} \\ &\hspace{20pt} + \vec 1^\top\log S\Cond{\vec t_k}{\mat X_k, \mat Z_k, \vec u} \end{align*} \]

where \(\phi(\cdot ;\mat \Sigma)\) is the density function of the multivariate normal distribution with a zero mean vector and covariance matrix \(\mat \Sigma\). \(\vec 1\) is a vector of ones.

\(l_k(\vec\beta, \mat\Sigma)\) is intractable in general.

Common Approximations

Laplace Approximation.

Fast, scales well, but may perform poorly e.g., for small groups (\(n_k\) small).

Adaptive Gaussian quadrature.

Fast in low dimensions, scales poorly, and performs well. For examples, see Liu and Pierce (1994) and Pinheiro and Bates (1995).

Monte Carlo methods.

Slow, scales poorly, and performs well.

Adaptive Gaussian Quadrature

\[\begin{align*} l &= \int c(\vec u)\der u \qquad\qquad\qquad\qquad\qquad \text{(intractable)} \\ &= \int \frac{c(\vec u)}{\phi(\vec u; \vec\mu, \mat\Lambda)} \phi(\vec u; \vec\mu, \mat\Lambda)\der\vec u \\ &= \int\underbrace{(2\pi)^{k/2} \lvert\Lambda\rvert^{1/2} c(\vec\mu + \Lambda^{1/2}\vec s) \exp\left(\vec s^\top\vec s / 2\right)}_{ \tilde c(\vec s; \vec\mu, \mat\Lambda)} \phi(\vec s; \vec 0, \mat I)\der\vec s \end{align*}\]

where \(\vec u\in \mathbb{R}^k\), \(\phi(\cdot; \vec\mu, \mat\Lambda)\) is the density of a multivariate normal distribution with mean \(\vec\mu\) and covariance matrix \(\mat\Lambda\), and \(\mat I\) is the identity matrix.

Adaptive Gaussian Quadrature

Apply Gauss-Hermite quadrature to each coordinate of \(\vec s\).

Each coordinate, \(s_l\), is approximated at \(b\) points.

Let \((g_i, w_i)\) be one of the node values and the corresponding weight for \(i = 1,\dots,b\) for each coordinate, \(s_l\), and

\[\vec s_{j_1,\cdots,j_k} = (g_{j_1},\cdots,g_{j_k})^\top\]

Adaptive Gaussian Quadrature

Repeatedly applying Gauss-Hermite quadrature yields

\[\begin{align*} l &= \int \tilde c(\vec s; \vec\mu; \mat\Lambda) \phi(\vec s; \vec 0, \mat I)\der\vec s \\ &\approx \sum_{j_1=1}^b\cdots\sum_{j_k=1}^b \tilde c(\vec s_{j_1,\cdots,j_k}; \vec\mu, \mat\Lambda) \prod_{q = 1}^k w_{j_q} \end{align*}\]

Scales poorly in dimension of the random effect, \(k\).

Fast alternatives are attractive.

Variational Approximations

Lower Bound

A lower bound of the marginal log-likelihood is

\[ \begin{align*} \log p\left(\vec t_k, \vec d_k\right) &= \int q(\vec u;\vec\theta_k) \log \left(\frac {p\left(\vec t_k, \vec d_k, \vec u\right) / q(\vec u;\vec\theta_k)} {p\Cond{\vec u}{\vec t_k, \vec d_k} / q(\vec u;\vec\theta_k)} \right)\der\vec u \\ &\geq \int q(\vec u;\vec\theta_k) \log \left(\frac {p\left(\vec t_k, \vec d_k, \vec u\right)} {q(\vec u;\vec\theta_k)} \right)\der\vec u \\ &= \log \tilde p \left(\vec t_k, \vec d_k;\vec\theta_k\right) \end{align*} \]

for some density function \(q\). Equality is if and only

\[ q(\vec u_k;\vec\theta_k) = p\Cond{\vec u_k}{\vec t_k, \vec d_k} \]

Lower Bound

Approximate maximum likelihood is

\[\argmax_{\vec\beta, \mat \Sigma, \vec\theta_1, \cdots, \vec\theta_m} \sum_{k=1}^m\log \tilde p \left(\vec t_k, \vec d_k;\vec\theta_k\right)\]

Marginal Log-likelihood

Recall the marginal log-likelihood

\[ \begin{align*} l_k(\vec\beta, \mat\Sigma) &= \log \int \exp\underbrace{\left(h_k(\vec\beta, \mat\Sigma, \vec u) + \log \phi(\vec u;\mat \Sigma)\right)}_{ \log p(\vec t_k, \vec d_k, \vec u)} \der \vec u \\ h_k(\vec\beta, \mat\Sigma, \vec u) &= \vec d^\top_k \log \lambda\Cond{\vec t_k}{\mat X_k, \mat Z_k, \vec u} \\ &\hspace{20pt} + \vec 1^\top\log S\Cond{\vec t_k}{\mat X_k, \mat Z_k, \vec u} \end{align*} \]

Applying the Lower Bound

Suppose \(q(\vec u; \vec\mu, \mat\Lambda) = \phi(\vec u; \vec\mu, \mat\Lambda)\).

\(\left(\log \phi(\vec u;\mat \Sigma) - \log q(\vec u; \vec\mu, \mat\Lambda)\right)\phi(\cdot)\) term is tractable.

The remaining two types of terms

\[\left(\vec d^\top_k \log \lambda\Cond{\vec t_k}{\mat X_k, \mat Z_k, \vec u} + \vec 1^\top\log S\Cond{\vec t_k}{\mat X_k, \mat Z_k, \vec u}\right)\phi(\cdot)\]

Applying the Lower Bound

\[ \lambda\Cond{\vec t_k}{\mat X_k, \mat Z_k, \vec u_k} = s\left(\vec t_k; \eta_1(\mat X_k) + \mat Z_k\vec u_k\right) \]

for some function \(s\). Then

\[ \begin{align*} \hspace{20pt}&\hspace{-20pt} \int \log\lambda\Cond{t_{ki}}{\vec x_{ki}, \vec z_{ki}, \vec u} \phi(\vec u; \vec\mu, \mat\Lambda) \der \vec u \\ &= \int \log s \left(t_{ki};\eta_1(\vec x_{ki}) + \vec z_{ki}^\top\vec\mu + \sqrt{\vec z_{ki}^\top\Sigma\vec z_{ki}}x\right) \phi(x)\der x \end{align*} \]

I.e., \(n_k\) one-dimensional integrals

instead of one \(k\)-dimensional integral. As shown by Ormerod and Wand (2012) for generalized linear mixed models.

Generalized PH Model

\[ \lambda\Cond{s}{\vec x_{ki}, \vec z_{ki}, \vec u_k} = \lambda_0(s ;\vec x_{ki}) \exp\left(\vec f(s, \vec z_{ki})^\top\vec u_k\right) \]

where \(\vec f:\, [0,\infty)\times \mathbb{R}^v\rightarrow\mathbb{R}^k\) is not applied elementwise. Then

\[\begin{align*} \log S\Cond{t_{ki}}{\vec x_{ki}, \vec z_{ki}, \vec u_k} &\\ &\hspace{-40pt}= -\int_0^{t_{ki}} \lambda_0(s ;\vec x_{ki}) \exp\left(\vec f(s, \vec z_{ki})^\top\vec u_k\right) \der s \\ \log\lambda\Cond{t_{ki}}{\vec x_{ki}, \vec z_{ki}, \vec u_k} &= \log \lambda_0(t_{ki};\vec x_{ki}) + \vec f(t_{ki}, \vec z_{ki})^\top\vec u_k \end{align*}\]

Generalized PH Model

\[ \begin{align*} \hspace{40pt}&\hspace{-40pt} \int \log S\Cond{t_{ki}}{\vec x_{ki}, \vec z_{ki}, \vec u} \phi(\vec u; \vec\mu, \mat\Lambda)\der\vec u \\ &=- \int_0^{t_{ki}} \lambda_0(s ;\vec x_{ki}) \bigg(\int \exp\left(\vec f(s, \vec z_{ki})^\top\vec u\right) \\ &\hspace{110pt}\cdot \phi(\vec u; \vec\mu, \mat\Lambda) \der\vec u\bigg)\der s \end{align*} \]

Assuming we can change order of integration. Similar to result by Yue and Kontar (2019).

Future work

Precision

The lower bounds are computationally attractive but precision is unknown.

Compare the Laplace approximation with VA in simple case.

Joint Models

Markers are often observed in discrete time which are supposedly related to the event rate.

Dependence is often assumed through a random mean function of the markers.

Thank You!

The presentation is at rpubs.com/boennecd/KTH-group-meeting-Oct19.

The markdown is at github.com/boennecd/Talks.

Notes are available at bit.ly/KTH19-VA-notes.

References are on the next slide.

References

Cox, D. R. 1972. “Regression Models and Life-Tables.” Journal of the Royal Statistical Society Series B 34 (2). [Royal Statistical Society, Wiley]: 187–220. http://www.jstor.org/stable/2985181.

Liu, Qing, and Donald A. Pierce. 1994. “A Note on Gauss-Hermite Quadrature.” Biometrika 81 (3). [Oxford University Press, Biometrika Trust]: 624–29. http://www.jstor.org/stable/2337136.

Ormerod, J. T., and M. P. Wand. 2012. “Gaussian Variational Approximate Inference for Generalized Linear Mixed Models.” Journal of Computational and Graphical Statistics 21 (1). Taylor & Francis: 2–17. doi:10.1198/jcgs.2011.09118.

Pinheiro, José C., and Douglas M. Bates. 1995. “Approximations to the Log-Likelihood Function in the Nonlinear Mixed-Effects Model.” Journal of Computational and Graphical Statistics 4 (1). American Statistical Association, Taylor & Francis, Ltd., Institute of Mathematical Statistics, Interface Foundation of America: 12–35. http://www.jstor.org/stable/1390625.

Royston, Patrick, and Mahesh K. B. Parmar. 2002. “Flexible Parametric Proportional-Hazards and Proportional-Odds Models for Censored Survival Data, with Application to Prognostic Modelling and Estimation of Treatment Effects.” Statistics in Medicine 21 (15): 2175–97. doi:10.1002/sim.1203.

Younes, Naji, and John Lachin. 1997. “Link-Based Models for Survival Data with Interval and Continuous Time Censoring.” Biometrics 53 (4). [Wiley, International Biometric Society]: 1199–1211. http://www.jstor.org/stable/2533490.

Yue, Xubo, and Raed Kontar. 2019. “Variational Inference of Joint Models Using Multivariate Gaussian Convolution Processes.”