Benjamin Christoffersen
28/07/2017
Survival analysis with time-varying coefficients
Continuous time model
Fast and scalable
Individual \( i = 1,\dots,n \)
Failure times \( T_1,\dots,T_n\in \mathbb{R}_+ \)
Timer intervals \( t = 1, \dots, d \)
Binary outcomes \( y_{it} = \left\{ \begin{matrix} 1 & T_i \in (t-1, t] \\ 0 & \text{otherwise} \end{matrix}\right. \)
Censoring indicators \( D_{i1},\dots,D_{i2},\dots \) (one if censored)
Covariates \( \vec{x}_{i1}, \vec{x}_{i2}, \dots \)
\[ \begin{array}{ll} \vec{\alpha}_{t + 1} = \mat{F}\vec{\alpha}_t + \mat{R}\vec{\eta}_t \qquad & \vec{\eta}_t \sim N(\vec{0}, \mat{Q}) \\ & \vec{\alpha}_{0} \sim N(\vec{a}_0, \mat{Q}_0) \end{array} , \qquad t = 1,\dots, d \]
Storage provider
More than 65000 hard disks Today
3 years of data (2013Q2‒2016Q3)
84366 observations and 623268 data rows
hd_dat[
1:10, c("serial_number", "model", "tstart",
"tstop", "fails", "smart_12")]
serial_number model tstart tstop fails smart_12
505 5XW004AJ ST31500541AS 30.0 40.0 0 0
506 5XW004AJ ST31500541AS 40.0 43.2 0 24
507 5XW004AJ ST31500541AS 43.2 56.9 0 25
508 5XW004Q0 ST31500541AS 40.6 51.0 0 0
509 5XW004Q0 ST31500541AS 51.0 53.7 0 54
510 5XW004Q0 ST31500541AS 53.7 54.1 0 56
511 5XW004Q0 ST31500541AS 54.1 54.4 0 57
512 5XW004Q0 ST31500541AS 54.4 54.5 0 58
513 5XW004Q0 ST31500541AS 54.5 54.7 0 59
514 5XW004Q0 ST31500541AS 54.7 57.2 0 61
| Number of hard disks | Number of failures | |
|---|---|---|
| ST4000DM000 | 36286 | 1556 |
| HMS5C4040BLE640 | 9333 | 40 |
| HMS5C4040ALE640 | 7162 | 93 |
| ST8000DM002 | 5155 | 17 |
| HDS722020ALA330 | 4774 | 230 |
| ST3000DM001 | 4698 | 1701 |
Top 6 hard disk versions by number of disks
\[ L(\mat{Q},\mat{Q}_0, \vec{a}_0) = p(\vec{\alpha}_0)\prod_{t = 1}^d \proppCond{\vec{\alpha}_t}{\vec{\alpha}_{t-1}} \prod_{i \in R_t} \proppCond{y_{it}}{\vec{\alpha}_t} \]
Risk set \( R_t = \Lbrace{i \in \{1,\dots,n\}: D_{it} = 0} \)
\[ \begin{aligned} \log L \Lparen{\mat{Q},\mat{Q}_0, \vec{a}_0} = & - \frac{1}{2} \Lparen{\vec{\alpha}_0 - \vec{a}_0}^\top \mat{Q}^{-1}_0\Lparen{\vec{\alpha}_0 - \vec{a}_0} \\ & - \frac{1}{2} \sum_{t = 1}^d \Lparen{\vec{\alpha}_t - \mat{F}\vec{\alpha}_{t - 1}}^\top\mat{R}^\top\mat{Q}^{-1}\mat{R}\Lparen{\vec{\alpha}_t - \mat{F}\vec{\alpha}_{t - 1}} \\ & - \frac{1}{2} \log \deter{\mat{Q}_0} - \frac{1}{2d} \log \deter{\mat{Q}} \\ & + \sum_{t = 1}^d \sum_{i \in R_t} l_{it}({\vec{\alpha}_t}) + \dots \end{aligned} \]
Same prediction step as KF
Make 1. order Taylor expansion around \( \emNotee{\vec{a}}{t}{t - 1} \)
Same as one Fisher scoring step in:
\[ \begin{array}{c} \emNotee{\vec{a}}{t}{t} = \argminu{\vec{\alpha}} - \log \proppCond{\vec{\alpha}}{\emNotee{\vec{a}}{t}{t-1}, \emNotee{\mat{V}}{t}{t-1}} -\sum_{i \in R_t} \log\proppCond{y_{it}}{\vec{\alpha}} \\ % \vec{\alpha} \sim N\Lparen{\emNotee{\vec{a}}{t}{t-1}, \emNotee{\mat{V}}{t}{t-1}} \end{array} \]
Essentially L2 penalized generalized linear models
Mode approximation: Minimize directly using Newton Raphson
EKF with extra iteration: More iterations using working responses as glm
library(dynamichazard)
frm <- Surv(tstart, tstop, fails) ~ -1 + model
system.time(
ddfit <- ddhazard(
formula = frm, data = hd_dat,
id = hd_dat$serial_number,
Q_0 = diag(1, 17), Q = diag(.01, 17),
by = 1, max_T = 60,
control = list(
method = "EKF",
eps = .005,
NR_eps = .00001)))
user system elapsed
17.0 1.2 10.9
See this blog post for explanation
frm <- Surv(tstart, tstop, fails) ~ -1 + model + smart_12 : model_large
Slides at rpubs.com/boennecd/EMS17 and code at github/boennecd/Talks
dynamichazard::ddhazard_app()
[2] L. Fahrmeir. “Posterior Mode Estimation by
Extended Kalman Filtering for Multivariate Dynamic
Generalized Linear Models”. In: Journal of the
American Statistical Association 87.418 (1992), pp.
501-509. [3] R. H. Shumway and D. S. Stoffer. “An Approach to
Time Series Smoothing and Forecasting Using the Em
Algorithm”. In: Journal of Time Series Analysis 3.4
(1982), pp. 253-264. ISSN: 1467-9892. DOI:
10.1111/j.1467-9892.1982.tb00349.x.