Based on the paper “Cox Proportional-Hazards Regression for Survival Data in R” (John Fox & Sanford Weisberg; last revision: 23 February 2011) https://socserv.socsci.mcmaster.ca/jfox/Books/Companion/appendix/Appendix-Cox-Regression.pdf

Paper Wordcloud

Paper Wordcloud

Basic Concepts and Notation

\[\begin{aligned} P(t) = Pr(T<=t) \end{aligned}\]

\[\begin{aligned} S(t) = Pr(T>t) = 1 - P(t) \end{aligned}\]

\[\begin{aligned} h(t) = \lim_{\Delta t\to 0} \frac{Pr(t<=T<t+\Delta t|T>=t)}{\Delta t} \end{aligned}\]

Survival Analysis Applications

There are four major applications of survival analysis into analytics:

  1. Business Planning (Churn): Profiling customers who has a higher survival rate and make strategy accordingly.
  2. Lifetime Value Prediction (LTV): Engage with customers according to their lifetime value
  3. Active customers : Predict when the customer will be active for the next time and take interventions accordingly.
  4. Campaign evaluation: Monitor effect of campaign on the survival rate of customers.

Following are some industrial specific applications of survival analysis :

Source: Is survival analysis the right model for you?

The Cox Proportional-Hazards Model

Survival analysis typically examines the relationship of the survival distribution to covariates: \[\begin{aligned} h_i(t) = f(t, x_{i1}, x_{i2}, ..., x_{ik}) \end{aligned}\]

For example, a parametric model based on the exponential distribution may be written as \[\begin{aligned} h_i(t) = exp(\alpha + \beta_1*x_{i1} + \beta_2*x_{i2} + ... + \beta_k*x_{ik}) \end{aligned}\] Note, that a constant hazard implies an exponential distribution of survival times.

The Cox model leaves the baseline hazard function \(\alpha(t)=log(h_0(t))\) unspecified: \[\begin{aligned} h_i(t) = h_0(t)*exp(\beta_1*x_{i1} + \beta_2*x_{i2} + ... + \beta_k*x_{ik}) \end{aligned}\] This model is semi-parametric because the baseline hazard function \(h_0(t)\) can take any form while covariates enter the model lineary. But any two observations with different \(x\)-values will have their hazard ratio independent of time \(t\) since non-parametric component that depends on time is the same for both.

The R coxph Function

The Cox proportional-hazards regression model is fit in R with the coxph() function (found in the survival package):

library(survival)
args(coxph)
## function (formula, data, weights, subset, na.action, init, control, 
##     ties = c("efron", "breslow", "exact"), singular.ok = TRUE, 
##     robust = FALSE, model = FALSE, x = FALSE, y = TRUE, tt, method = ties, 
##     ...) 
## NULL

Most of the arguments to coxph(), including gdata, weights, subset, na.action, singular.ok, model, x and y, are familiar from lm(). The formula argument is a little different. The right-hand side of the formula for coxph() is the same as for a linear model. The left-hand side is a survival object, created by the Surv() function:

Surv(week, arrest) ~ fin + age + race + wexp + mar + paro + prio

In the simple case of right-censored data, the call to Surv() takes the form Surv(time, event), where time is either the event time or the censoring time, and event is a dummy variable coded 1 if the event is observed or 0 if the observation is censored.

Experimental Study of Recidivism

For illustration consider data from an experimental study of recidivism of 432 male prisoners, who were observed for a year after being released from prison (Rossi et al., 1980). The following variables are included in the data:

We read the data file into a data frame, and print the first few observations (omitting the variables emp1 – emp52, which are in columns 11–62 of the data frame):

url = "http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt"
Rossi = read.table(url, header=TRUE)
Rossi[1:5, 1:10]
##   week arrest fin age  race wexp         mar paro prio educ
## 1   20      1  no  27 black   no not married  yes    3    3
## 2   17      1  no  18 black   no not married  yes    8    4
## 3   25      1  no  19 other  yes not married  yes   13    3
## 4   52      0 yes  23 black  yes     married  yes    1    5
## 5   52      0  no  19 other  yes not married  yes    3    3

Thus, for example, the first individual was arrested in week 20 of the study, while the fourth individual was never rearrested, and hence has a censoring time of 52.

Kaplan–Meier Estimator

Kaplan–Meier estimator is used to estimate the survival function from lifetime data. A plot of the Kaplan–Meier estimator is a series of declining horizontal steps which, with a large enough sample size, approaches the true survival function for that population.

Cox Regression in R

A Cox regression of time to rearrest on the time-constant covariates is specified as follows:

mod.allison = 
  coxph(Surv(week, arrest) ~ fin + age + race + wexp + mar + paro + prio, 
        data=Rossi)
summary(mod.allison)
## Call:
## coxph(formula = Surv(week, arrest) ~ fin + age + race + wexp + 
##     mar + paro + prio, data = Rossi)
## 
##   n= 432, number of events= 114 
## 
##                    coef exp(coef) se(coef)      z Pr(>|z|)   
## finyes         -0.37942   0.68426  0.19138 -1.983  0.04742 * 
## age            -0.05744   0.94418  0.02200 -2.611  0.00903 **
## raceother      -0.31390   0.73059  0.30799 -1.019  0.30812   
## wexpyes        -0.14980   0.86088  0.21222 -0.706  0.48029   
## marnot married  0.43370   1.54296  0.38187  1.136  0.25606   
## paroyes        -0.08487   0.91863  0.19576 -0.434  0.66461   
## prio            0.09150   1.09581  0.02865  3.194  0.00140 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                exp(coef) exp(-coef) lower .95 upper .95
## finyes            0.6843     1.4614    0.4702    0.9957
## age               0.9442     1.0591    0.9043    0.9858
## raceother         0.7306     1.3688    0.3995    1.3361
## wexpyes           0.8609     1.1616    0.5679    1.3049
## marnot married    1.5430     0.6481    0.7300    3.2614
## paroyes           0.9186     1.0886    0.6259    1.3482
## prio              1.0958     0.9126    1.0360    1.1591
## 
## Concordance= 0.64  (se = 0.027 )
## Rsquare= 0.074   (max possible= 0.956 )
## Likelihood ratio test= 33.27  on 7 df,   p=2.362e-05
## Wald test            = 32.11  on 7 df,   p=3.871e-05
## Score (logrank) test = 33.53  on 7 df,   p=2.11e-05

Several important observations from this report:

Estimated Survival Function Distribution

Having fit a Cox model to the data, it is often of interest to examine the estimated distribution of survival times. The survfit() function estimates \(S(t)\), by default at the mean values of the covariates. The plot method for objects returned by survfit() graphs the estimated surivival function, along with a point-wise 95-percent confidence band. For example, for the model just fit to the recidivism data:

surv.model = survfit(mod.allison)

Even more cogently, we may wish to display how estimated survival depends upon the value of a covariate. Because the principal purpose of the recidivism study was to assess the impact of financial aid on rearrest, we focus on this covariate. We construct a new data frame with two rows, one for each value of fin; the other covariates are fixed to their average values:

Recieved Fin. Aid Age Race FT Work Exp. Married Paroled Prior Convictions
No mean(age) mean(Other) mean(Yes) mean(Not married) mean(Yes) mean(prio)
Yes mean(age) mean(Other) mean(Yes) mean(Not married) mean(Yes) mean(prio)

For a dummy covariate, such as the contrast associated with race, the average value is the proportion coded 1 in the data set—in the case of race, the proportion of non-blacks. This data frame is passed to survfit() via the newdata argument:

Rossi.fin = with(Rossi, 
  data.frame(fin=c(0, 1), 
             age=rep(mean(age), 2), 
             race=rep(mean(race == "other"), 2),
             wexp=rep(mean(wexp == "yes"), 2), 
             mar=rep(mean(mar == "not married"), 2),
             paro=rep(mean(paro == "yes"), 2), 
             prio=rep(mean(prio), 2)))

surv.fin = survfit(mod.allison, newdata=Rossi.fin)

Survival Analysis with Aster R

Aster R Environment

library(TeradataAsterR)
## Loading required package: RODBC
ta.connect("PreSalesCluster1-dallas", database = "baseball")
## RODBC Connection 1
## Details:
##   case=nochange
##   DSN=PreSalesCluster1-dallas
##   DATABASE=baseball

Data Load

Replicating recidivism study data in Aster:

ta.dropTable("rossi_original", schemaName = "public") 
## [1] -2
ta.create(data=Rossi[,1:10] %>% mutate(id = row_number()), 
          table="rossi_original", schemaName = "public", tableType = "dimension")

rossi.ta = ta.transform(ta.data.frame("rossi_original", "table", schemaName = "public"),
             fin = if(fin == 'yes') 1 else 0,
             race = if(race == 'other') 1 else 0,
             wexp = if(wexp == 'yes') 1 else 0,
             mar = if(mar == 'not married') 1 else 0,
             paro = if(paro == 'yes') 1 else 0)
ta.dropTable("rossi", schemaName = "public")
## [1] -2
ta.create(data=rossi.ta, table="rossi", schemaName = "public", 
          tableType = "fact", partitionKey = "id")
ta.rm(rossi.ta)
rossi.ta = ta.data.frame("rossi", schemaName = "public")

Cox Regression Model

Function aa.coxph() is analog of R coxph() but with slightly different argument convention:

mod.aa = aa.coxph(data = rossi.ta, 
        feature.columns = c('fin', 'age', 'race', 'wexp', 'mar', 'paro', 'prio'),
        # categorical.columns = c('fin', 'race', 'wexp', 'mar', 'paro'),
        time.interval.column = 'week', event.column = 'arrest'
        #,accumulate = "id"  # see JIRA ANLY-1849
        ) 

names(mod.aa)
## [1] "coefficient.table"      "linear.predictor.table"
## [3] "output"

Comparing Regression Models Between R and Aster

##   id predictor category coefficient  exp_coef  std_error    z_score
## 6  1       fin       NA -0.37942217 0.6842567 0.19137948 -1.9825645
## 3  2       age       NA -0.05743774 0.9441807 0.02199947 -2.6108693
## 1  3      race       NA -0.31389979 0.7305922 0.30799278 -1.0191791
## 2  4      wexp       NA -0.14979570 0.8608838 0.21222430 -0.7058367
## 7  5       mar       NA  0.43370388 1.5429619 0.38186806  1.1357427
## 4  6      paro       NA -0.08487108 0.9186307 0.19575667 -0.4335540
## 5  7      prio       NA  0.09149708 1.0958136 0.02864855  3.1937770
##       p_value significance
## 6 0.047416095            *
## 3 0.009031240           **
## 1 0.308117970             
## 2 0.480289695             
## 7 0.256064244             
## 4 0.664612364             
## 5 0.001404246           **