Survival Analysis with R and Aster

Based on the paper “Cox Proportional-Hazards Regression for Survival Data in R” (John Fox & Sanford Weisberg; last revision: 23 February 2011) https://socserv.socsci.mcmaster.ca/jfox/Books/Companion/appendix/Appendix-Cox-Regression.pdf

Paper Wordcloud

Basic Concepts and Notation

Let \(T\) be a random variable representing survival time with cumulative distribution function:

\[\begin{aligned} P(t) = Pr(T<=t) \end{aligned}\]

Then survival function is the compliment of the distribution function:

\[\begin{aligned} S(t) = Pr(T>t) = 1 - P(t) \end{aligned}\]

Finally, the hazard function (or hazard rate) assesses the instantaneous risk of demise at time \(t\), conditional on survival to that time:

\[\begin{aligned} h(t) = \lim_{\Delta t\to 0} \frac{Pr(t<=T<t+\Delta t|T>=t)}{\Delta t} \end{aligned}\]

For example, in acturial science , the hazard rate (or failure rate or force of mortality) is the rate of death for lives aged \(x\).
Survival data almost always includes censoring: the most common form is right-censoring - the period of observation expires or subject is removed from study before the event occurs. Survival models account for censoring which complicates the estimation of survival models.
Censoring must be independent of the future value of the hazard for the individual. Censoring that meets this requirement is noninformative (good thing).

Survival Analysis Applications

There are four major applications of survival analysis into analytics:

Business Planning (Churn): Profiling customers who has a higher survival rate and make strategy accordingly.
Lifetime Value Prediction (LTV): Engage with customers according to their lifetime value
Active customers : Predict when the customer will be active for the next time and take interventions accordingly.
Campaign evaluation: Monitor effect of campaign on the survival rate of customers.

Following are some industrial specific applications of survival analysis :

Banking – customer lifetime and LTV
Insurance – time to lapsing on policy
Mortgages – time to mortgage redemption
Mail Order Catalogue – time to next purchase
Retail – time till food customer starts purchasing non-food
Manufacturing – lifetime of a machine component
Public Sector – time intervals to critical events

Source: Is survival analysis the right model for you?

The Cox Proportional-Hazards Model

Survival analysis typically examines the relationship of the survival distribution to covariates: \[\begin{aligned} h_i(t) = f(t, x_{i1}, x_{i2}, ..., x_{ik}) \end{aligned}\]

For example, a parametric model based on the exponential distribution may be written as \[\begin{aligned} h_i(t) = exp(\alpha + \beta_1*x_{i1} + \beta_2*x_{i2} + ... + \beta_k*x_{ik}) \end{aligned}\] Note, that a constant hazard implies an exponential distribution of survival times.

The Cox model leaves the baseline hazard function \(\alpha(t)=log(h_0(t))\) unspecified: \[\begin{aligned} h_i(t) = h_0(t)*exp(\beta_1*x_{i1} + \beta_2*x_{i2} + ... + \beta_k*x_{ik}) \end{aligned}\] This model is semi-parametric because the baseline hazard function \(h_0(t)\) can take any form while covariates enter the model lineary. But any two observations with different \(x\)-values will have their hazard ratio independent of time \(t\) since non-parametric component that depends on time is the same for both.

The R coxph Function

The Cox proportional-hazards regression model is fit in R with the coxph() function (found in the survival package):

library(survival)
args(coxph)

## function (formula, data, weights, subset, na.action, init, control, 
##     ties = c("efron", "breslow", "exact"), singular.ok = TRUE, 
##     robust = FALSE, model = FALSE, x = FALSE, y = TRUE, tt, method = ties, 
##     ...) 
## NULL

Most of the arguments to coxph(), including gdata, weights, subset, na.action, singular.ok, model, x and y, are familiar from lm(). The formula argument is a little different. The right-hand side of the formula for coxph() is the same as for a linear model. The left-hand side is a survival object, created by the Surv() function:

Surv(week, arrest) ~ fin + age + race + wexp + mar + paro + prio

In the simple case of right-censored data, the call to Surv() takes the form Surv(time, event), where time is either the event time or the censoring time, and event is a dummy variable coded 1 if the event is observed or 0 if the observation is censored.

Experimental Study of Recidivism

For illustration consider data from an experimental study of recidivism of 432 male prisoners, who were observed for a year after being released from prison (Rossi et al., 1980). The following variables are included in the data:

week: week of first arrest after release, or censoring time.
arrest: the event indicator, equal to 1 for those arrested during the period of the study and 0 for those who were not arrested.
fin: a factor, with levels yes if the individual received financial aid after release from prison, and no if he did not; financial aid was a randomly assigned factor manipulated by the researchers.
age: in years at the time of release.
race: a factor with levels black and other.
wexp: a factor with levels yes if the individual had full-time work experience prior to incarceration and no if he did not.
mar: a factor with levels married if the individual was married at the time of release and not married if he was not.
paro: a factor coded yes if the individual was released on parole and no if he was not.
prio: number of prior convictions.
educ: education, a categorical variable coded numerically, with codes 2 (grade 6 or less), 3 (grades 6 through 9), 4 (grades 10 and 11), 5 (grade 12), or 6 (some post-secondary).
emp1 – emp52: factors coded yes if the individual was employed in the corresponding week of the study and no otherwise.

We read the data file into a data frame, and print the first few observations (omitting the variables emp1 – emp52, which are in columns 11–62 of the data frame):

url = "http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt"
Rossi = read.table(url, header=TRUE)
Rossi[1:5, 1:10]

##   week arrest fin age  race wexp         mar paro prio educ
## 1   20      1  no  27 black   no not married  yes    3    3
## 2   17      1  no  18 black   no not married  yes    8    4
## 3   25      1  no  19 other  yes not married  yes   13    3
## 4   52      0 yes  23 black  yes     married  yes    1    5
## 5   52      0  no  19 other  yes not married  yes    3    3

Thus, for example, the first individual was arrested in week 20 of the study, while the fourth individual was never rearrested, and hence has a censoring time of 52.

Kaplan–Meier Estimator

Kaplan–Meier estimator is used to estimate the survival function from lifetime data. A plot of the Kaplan–Meier estimator is a series of declining horizontal steps which, with a large enough sample size, approaches the true survival function for that population.

Cox Regression in R

A Cox regression of time to rearrest on the time-constant covariates is specified as follows:

mod.allison = 
  coxph(Surv(week, arrest) ~ fin + age + race + wexp + mar + paro + prio, 
        data=Rossi)
summary(mod.allison)

## Call:
## coxph(formula = Surv(week, arrest) ~ fin + age + race + wexp + 
##     mar + paro + prio, data = Rossi)
## 
##   n= 432, number of events= 114 
## 
##                    coef exp(coef) se(coef)      z Pr(>|z|)   
## finyes         -0.37942   0.68426  0.19138 -1.983  0.04742 * 
## age            -0.05744   0.94418  0.02200 -2.611  0.00903 **
## raceother      -0.31390   0.73059  0.30799 -1.019  0.30812   
## wexpyes        -0.14980   0.86088  0.21222 -0.706  0.48029   
## marnot married  0.43370   1.54296  0.38187  1.136  0.25606   
## paroyes        -0.08487   0.91863  0.19576 -0.434  0.66461   
## prio            0.09150   1.09581  0.02865  3.194  0.00140 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                exp(coef) exp(-coef) lower .95 upper .95
## finyes            0.6843     1.4614    0.4702    0.9957
## age               0.9442     1.0591    0.9043    0.9858
## raceother         0.7306     1.3688    0.3995    1.3361
## wexpyes           0.8609     1.1616    0.5679    1.3049
## marnot married    1.5430     0.6481    0.7300    3.2614
## paroyes           0.9186     1.0886    0.6259    1.3482
## prio              1.0958     0.9126    1.0360    1.1591
## 
## Concordance= 0.64  (se = 0.027 )
## Rsquare= 0.074   (max possible= 0.956 )
## Likelihood ratio test= 33.27  on 7 df,   p=2.362e-05
## Wald test            = 32.11  on 7 df,   p=3.871e-05
## Score (logrank) test = 33.53  on 7 df,   p=2.11e-05

Several important observations from this report:

The column marked \(z\) in the output records the ratio of each regression coefficient to its standard error, a Wald statistic which is asymptotically standard normal under the hypothesis that the corresponding β is 0. The covariates age and prio (prior convictions) have highly statistically significant coefficients, while the coefficient for fin (financial aid—the focus of the study) is marginally significant.
The exponentiated coefficients in the second column of the first panel (and in the first column of the second panel) of the output are interpretable as multiplicative effects on the hazard. Thus, for example, holding the other covariates constant, an additional year of age reduces the weekly hazard of rearrest by a factor of \(e^{b_2} = 0.944\) on average—that is, by 5.6 percent. Similarly, each prior conviction (prio) increases the hazard by a factor of 1.096, or 9.6 percent.
The likelihood-ratio, Wald, and score chi-square statistics at the bottom of the output are asymptotically equivalent tests of the omnibus null hypothesis that all of the βs are 0. In this instance, the test statistics are in close agreement, and the omnibus null hypothesis is soundly rejected.

Estimated Survival Function Distribution

Having fit a Cox model to the data, it is often of interest to examine the estimated distribution of survival times. The survfit() function estimates \(S(t)\), by default at the mean values of the covariates. The plot method for objects returned by survfit() graphs the estimated surivival function, along with a point-wise 95-percent confidence band. For example, for the model just fit to the recidivism data:

surv.model = survfit(mod.allison)

Even more cogently, we may wish to display how estimated survival depends upon the value of a covariate. Because the principal purpose of the recidivism study was to assess the impact of financial aid on rearrest, we focus on this covariate. We construct a new data frame with two rows, one for each value of fin; the other covariates are fixed to their average values:

Recieved Fin. Aid	Age	Race	FT Work Exp.	Married	Paroled	Prior Convictions
No	mean(age)	mean(Other)	mean(Yes)	mean(Not married)	mean(Yes)	mean(prio)
Yes	mean(age)	mean(Other)	mean(Yes)	mean(Not married)	mean(Yes)	mean(prio)

For a dummy covariate, such as the contrast associated with race, the average value is the proportion coded 1 in the data set—in the case of race, the proportion of non-blacks. This data frame is passed to survfit() via the newdata argument:

Rossi.fin = with(Rossi, 
  data.frame(fin=c(0, 1), 
             age=rep(mean(age), 2), 
             race=rep(mean(race == "other"), 2),
             wexp=rep(mean(wexp == "yes"), 2), 
             mar=rep(mean(mar == "not married"), 2),
             paro=rep(mean(paro == "yes"), 2), 
             prio=rep(mean(prio), 2)))

surv.fin = survfit(mod.allison, newdata=Rossi.fin)

Survival Analysis with Aster R

Aster R Environment

library(TeradataAsterR)

## Loading required package: RODBC

ta.connect("PreSalesCluster1-dallas", database = "baseball")

## RODBC Connection 1
## Details:
##   case=nochange
##   DSN=PreSalesCluster1-dallas
##   DATABASE=baseball

Data Load

Replicating recidivism study data in Aster:

ta.dropTable("rossi_original", schemaName = "public")

## [1] -2

ta.create(data=Rossi[,1:10] %>% mutate(id = row_number()), 
          table="rossi_original", schemaName = "public", tableType = "dimension")

rossi.ta = ta.transform(ta.data.frame("rossi_original", "table", schemaName = "public"),
             fin = if(fin == 'yes') 1 else 0,
             race = if(race == 'other') 1 else 0,
             wexp = if(wexp == 'yes') 1 else 0,
             mar = if(mar == 'not married') 1 else 0,
             paro = if(paro == 'yes') 1 else 0)
ta.dropTable("rossi", schemaName = "public")

## [1] -2

ta.create(data=rossi.ta, table="rossi", schemaName = "public", 
          tableType = "fact", partitionKey = "id")
ta.rm(rossi.ta)
rossi.ta = ta.data.frame("rossi", schemaName = "public")

Cox Regression Model

Function aa.coxph() is analog of R coxph() but with slightly different argument convention:

mod.aa = aa.coxph(data = rossi.ta, 
        feature.columns = c('fin', 'age', 'race', 'wexp', 'mar', 'paro', 'prio'),
        # categorical.columns = c('fin', 'race', 'wexp', 'mar', 'paro'),
        time.interval.column = 'week', event.column = 'arrest'
        #,accumulate = "id"  # see JIRA ANLY-1849
        ) 

names(mod.aa)

## [1] "coefficient.table"      "linear.predictor.table"
## [3] "output"

Comparing Regression Models Between R and Aster

##   id predictor category coefficient  exp_coef  std_error    z_score
## 6  1       fin       NA -0.37942217 0.6842567 0.19137948 -1.9825645
## 3  2       age       NA -0.05743774 0.9441807 0.02199947 -2.6108693
## 1  3      race       NA -0.31389979 0.7305922 0.30799278 -1.0191791
## 2  4      wexp       NA -0.14979570 0.8608838 0.21222430 -0.7058367
## 7  5       mar       NA  0.43370388 1.5429619 0.38186806  1.1357427
## 4  6      paro       NA -0.08487108 0.9186307 0.19575667 -0.4335540
## 5  7      prio       NA  0.09149708 1.0958136 0.02864855  3.1937770
##       p_value significance
## 6 0.047416095            *
## 3 0.009031240           **
## 1 0.308117970             
## 2 0.480289695             
## 7 0.256064244             
## 4 0.664612364             
## 5 0.001404246           **

Survival Function Distribution

As before with survfit() using aa.cox.survfit() we can generate survival probabilities for selected subjects (based on their features). And as before we construct a subject with mean covariate values and show the resulting survival distribution:

# mod.allison$means
all_means = ta.colMeans(rossi.ta)[c('fin','age','race','wexp','mar','paro','prio')]
ta.dropTable("rossi_means", schemaName = "public")

## [1] -2

ta.create(data=as.data.frame(t(all_means)), 
          table="rossi_means", schemaName = "public", tableType = "dimension")
rossi.means.ta = ta.data.frame("rossi_means", schemaName = "public")

surv.fin.aa = aa.cox.survfit(
  object = mod.aa$coefficient.table,
  cox.model.table = mod.aa$linear.predictor.table,
  predict.table = rossi.means.ta,
  # survival.probability = "rossi_survival_prob",
  predict.feature.columns = c('fin', 'age', 'race', 'wexp', 'mar', 'paro', 'prio'),
  predict.feature.names = c('fin', 'age', 'race', 'wexp', 'mar', 'paro', 'prio'))

Predicted Hazard Ratios

Next, we’d like to generate hazard ratios between predictive features and reference mean features (mean covariate subject created in the previous section):

predicted.ratios.aa = aa.cox.predict(
  object = mod.aa, 
  predicts = rossi.ta,
  refs = rossi.means.ta,
  predicts.partition.column = "1",
  predict.feature.columns = c('fin', 'age', 'race', 'wexp', 'mar', 'paro', 'prio'),
  predict.feature.names = c('fin', 'age', 'race', 'wexp', 'mar', 'paro', 'prio'),
  ref.feature.columns = c('fin', 'age', 'race', 'wexp', 'mar', 'paro', 'prio'),
  refs.partition.column = c('1'),
  accumulate = c("id")
  )

predictions.aa = predicted.ratios.aa[[1]]

Noticably, financial aid is missing in just 3 of 20 lowest risk subjects while it is present in just 4 of the 20 highest risk subjects. You can observe this trend on the whole dataset just by tracking how colors shift from left to right:

But what about highly significant prio and age covariates? Let’s observe their distribution across subjects with similar visuals:

Princeton Divorce Study

Princton hosts a collection of f small datasets used in the statistical course by Germán Rodríguez, classified by the type of statistical technique that may be used to analyze them. In particular, for survival data they have Marriage Dissolution in the U.S. that tracks married couples by husband education, husband race, couple mixed or not ethnicity. Event of interest (failure) is divorce and time is in years:

# http://data.princeton.edu/wws509/datasets/#divorce
divorce.df = read.csv(file="../data/divorce_raw.txt", sep="", header = FALSE)
names(divorce.df) = c('id','heduc','heblack','mixed','years','div')

head(divorce.df)

##   id heduc heblack mixed  years div
## 1  9     1       0     0 10.546   0
## 2 11     0       0     0 34.943   0
## 3 13     0       0     0  2.834   1
## 4 15     0       0     0 17.532   1
## 5 33     1       0     0  1.418   0
## 6 36     0       0     0 48.033   0

Cox PH Regression Model

mod.divorce = 
  coxph(Surv(years, div) ~ heduc + heblack + mixed, 
        data=divorce.df)

summary(mod.divorce)

## Call:
## coxph(formula = Surv(years, div) ~ heduc + heblack + mixed, data = divorce.df)
## 
##   n= 3371, number of events= 1032 
## 
##            coef exp(coef) se(coef)     z Pr(>|z|)   
## heduc   0.09426   1.09885  0.04718 1.998  0.04571 * 
## heblack 0.18367   1.20162  0.07974 2.303  0.02126 * 
## mixed   0.22936   1.25779  0.07929 2.893  0.00382 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##         exp(coef) exp(-coef) lower .95 upper .95
## heduc       1.099     0.9100     1.002     1.205
## heblack     1.202     0.8322     1.028     1.405
## mixed       1.258     0.7950     1.077     1.469
## 
## Concordance= 0.524  (se = 0.009 )
## Rsquare= 0.006   (max possible= 0.99 )
## Likelihood ratio test= 18.81  on 3 df,   p=0.0002987
## Wald test            = 19.6  on 3 df,   p=0.0002051
## Score (logrank) test = 19.68  on 3 df,   p=0.0001976

Survival Function

surv.model.divorce = survfit(mod.divorce)

Survival Function by the Strongest Covariate - Mixed

divorce.mixed = with(divorce.df, 
  data.frame(mixed=c(0, 1), 
             heduc=rep(mean(heduc), 2), 
             heblack=rep(mean(heblack), 2)))

surv.mixed = survfit(mod.divorce, newdata=divorce.mixed)