Based on the paper

“Cox Proportional-Hazards Regression for Survival Data in R”(John Fox & Sanford Weisberg; last revision: 23 February 2011) https://socserv.socsci.mcmaster.ca/jfox/Books/Companion/appendix/Appendix-Cox-Regression.pdf

- Let \(T\) be a random variable representing survival time with cumulative distribution function:

\[\begin{aligned} P(t) = Pr(T<=t) \end{aligned}\]

- Then
*survival function*is the compliment of the distribution function:

\[\begin{aligned} S(t) = Pr(T>t) = 1 - P(t) \end{aligned}\]

- Finally, the
*hazard function*(or*hazard rate*) assesses the**instantaneous**risk of demise at time \(t\), conditional on survival to that time:

\[\begin{aligned} h(t) = \lim_{\Delta t\to 0} \frac{Pr(t<=T<t+\Delta t|T>=t)}{\Delta t} \end{aligned}\]

For example, in acturial science , the

*hazard rate*(or*failure rate*or*force of mortality*) is the rate of death for lives aged \(x\).Survival data almost always includes

*censoring*: the most common form is*right-censoring*- the period of observation expires or subject is removed from study before the event occurs. Survival models account for*censoring*which complicates the estimation of survival models.Censoring must be independent of the future value of the hazard for the individual. Censoring that meets this requirement is

*noninformative*(good thing).

There are four major applications of survival analysis into analytics:

**Business Planning (Churn)**: Profiling customers who has a higher survival rate and make strategy accordingly.**Lifetime Value Prediction (LTV)**: Engage with customers according to their lifetime value**Active customers**: Predict when the customer will be active for the next time and take interventions accordingly.**Campaign evaluation**: Monitor effect of campaign on the survival rate of customers.

Following are some industrial specific applications of survival analysis :

- Banking – customer lifetime and LTV
- Insurance – time to lapsing on policy
- Mortgages – time to mortgage redemption
- Mail Order Catalogue – time to next purchase
- Retail – time till food customer starts purchasing non-food
- Manufacturing – lifetime of a machine component
- Public Sector – time intervals to critical events

Survival analysis typically examines the relationship of the survival distribution to covariates: \[\begin{aligned} h_i(t) = f(t, x_{i1}, x_{i2}, ..., x_{ik}) \end{aligned}\]

For example, a parametric model based on the exponential distribution may be written as \[\begin{aligned} h_i(t) = exp(\alpha + \beta_1*x_{i1} + \beta_2*x_{i2} + ... + \beta_k*x_{ik}) \end{aligned}\] Note, that a constant hazard implies an exponential distribution of survival times.

The *Cox model* leaves the baseline hazard function \(\alpha(t)=log(h_0(t))\) unspecified: \[\begin{aligned}
h_i(t) = h_0(t)*exp(\beta_1*x_{i1} + \beta_2*x_{i2} + ... + \beta_k*x_{ik})
\end{aligned}\] This model is *semi-parametric* because the baseline hazard function \(h_0(t)\) can take any form while covariates enter the model lineary. But any two observations with different \(x\)-values will have their hazard ratio independent of time \(t\) since non-parametric component that depends on time is the same for both.

The Cox proportional-hazards regression model is fit in R with the *coxph()* function (found in the *survival* package):

```
library(survival)
args(coxph)
```

```
## function (formula, data, weights, subset, na.action, init, control,
## ties = c("efron", "breslow", "exact"), singular.ok = TRUE,
## robust = FALSE, model = FALSE, x = FALSE, y = TRUE, tt, method = ties,
## ...)
## NULL
```

Most of the arguments to *coxph()*, including *gdata, weights, subset, na.action, singular.ok, model, x* and *y*, are familiar from *lm()*. The *formula* argument is a little different. The right-hand side of the formula for *coxph()* is the same as for a linear model. The left-hand side is a *survival* object, created by the *Surv()* function:

`Surv(week, arrest) ~ fin + age + race + wexp + mar + paro + prio`

In the simple case of right-censored data, the call to *Surv()* takes the form *Surv(time, event)*, where *time* is either the event time or the censoring time, and *event* is a dummy variable coded **1** if the event is observed or **0** if the observation is censored.

For illustration consider data from an experimental study of recidivism of 432 male prisoners, who were observed for a year after being released from prison (Rossi et al., 1980). The following variables are included in the data:

**week**: week of first arrest after release, or censoring time.**arrest**: the event indicator, equal to 1 for those arrested during the period of the study and 0 for those who were not arrested.**fin**: a factor, with levels*yes*if the individual received financial aid after release from prison, and*no*if he did not; financial aid was a randomly assigned factor manipulated by the researchers.**age**: in years at the time of release.**race**: a factor with levels black and other.**wexp**: a factor with levels yes if the individual had full-time work experience prior to incarceration and no if he did not.**mar**: a factor with levels married if the individual was married at the time of release and not married if he was not.**paro**: a factor coded yes if the individual was released on parole and no if he was not.**prio**: number of prior convictions.**educ**: education, a categorical variable coded numerically, with codes 2 (grade 6 or less), 3 (grades 6 through 9), 4 (grades 10 and 11), 5 (grade 12), or 6 (some post-secondary).**emp1 – emp52**: factors coded yes if the individual was employed in the corresponding week of the study and no otherwise.

We read the data file into a data frame, and print the first few observations (omitting the variables **emp1 – emp52**, which are in columns **11–62** of the data frame):

```
url = "http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt"
Rossi = read.table(url, header=TRUE)
Rossi[1:5, 1:10]
```

```
## week arrest fin age race wexp mar paro prio educ
## 1 20 1 no 27 black no not married yes 3 3
## 2 17 1 no 18 black no not married yes 8 4
## 3 25 1 no 19 other yes not married yes 13 3
## 4 52 0 yes 23 black yes married yes 1 5
## 5 52 0 no 19 other yes not married yes 3 3
```

Thus, for example, the first individual was arrested in week 20 of the study, while the fourth individual was never rearrested, and hence has a censoring time of 52.

Kaplan–Meier estimator is used to estimate the survival function from lifetime data. A plot of the Kaplan–Meier estimator is a series of declining horizontal steps which, with a large enough sample size, approaches the true survival function for that population.

A Cox regression of time to rearrest on the time-constant covariates is specified as follows:

```
mod.allison =
coxph(Surv(week, arrest) ~ fin + age + race + wexp + mar + paro + prio,
data=Rossi)
summary(mod.allison)
```

```
## Call:
## coxph(formula = Surv(week, arrest) ~ fin + age + race + wexp +
## mar + paro + prio, data = Rossi)
##
## n= 432, number of events= 114
##
## coef exp(coef) se(coef) z Pr(>|z|)
## finyes -0.37942 0.68426 0.19138 -1.983 0.04742 *
## age -0.05744 0.94418 0.02200 -2.611 0.00903 **
## raceother -0.31390 0.73059 0.30799 -1.019 0.30812
## wexpyes -0.14980 0.86088 0.21222 -0.706 0.48029
## marnot married 0.43370 1.54296 0.38187 1.136 0.25606
## paroyes -0.08487 0.91863 0.19576 -0.434 0.66461
## prio 0.09150 1.09581 0.02865 3.194 0.00140 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## finyes 0.6843 1.4614 0.4702 0.9957
## age 0.9442 1.0591 0.9043 0.9858
## raceother 0.7306 1.3688 0.3995 1.3361
## wexpyes 0.8609 1.1616 0.5679 1.3049
## marnot married 1.5430 0.6481 0.7300 3.2614
## paroyes 0.9186 1.0886 0.6259 1.3482
## prio 1.0958 0.9126 1.0360 1.1591
##
## Concordance= 0.64 (se = 0.027 )
## Rsquare= 0.074 (max possible= 0.956 )
## Likelihood ratio test= 33.27 on 7 df, p=2.362e-05
## Wald test = 32.11 on 7 df, p=3.871e-05
## Score (logrank) test = 33.53 on 7 df, p=2.11e-05
```

Several important observations from this report:

- The column marked \(z\) in the output records the ratio of each regression coefficient to its standard error, a Wald statistic which is asymptotically standard normal under the hypothesis that the corresponding β is 0. The covariates
**age**and**prio**(prior convictions) have highly statistically significant coefficients, while the coefficient for**fin**(financial aid—the focus of the study) is marginally significant. - The exponentiated coefficients in the second column of the first panel (and in the first column of the second panel) of the output are interpretable as multiplicative effects on the hazard. Thus, for example, holding the other covariates constant, an additional year of
**age**reduces the weekly hazard of rearrest by a factor of \(e^{b_2} = 0.944\) on average—that is, by 5.6 percent. Similarly, each prior conviction (**prio**) increases the hazard by a factor of 1.096, or 9.6 percent. - The likelihood-ratio, Wald, and score chi-square statistics at the bottom of the output are asymptotically equivalent tests of the omnibus null hypothesis that all of the βs are 0. In this instance, the test statistics are in close agreement, and the omnibus null hypothesis is soundly rejected.

Having fit a Cox model to the data, it is often of interest to examine the estimated distribution of survival times. The *survfit()* function estimates \(S(t)\), by default at **the mean values** of the covariates. The plot method for objects returned by *survfit()* graphs the estimated surivival function, along with a point-wise 95-percent confidence band. For example, for the model just fit to the recidivism data:

`surv.model = survfit(mod.allison)`

Even more cogently, we may wish to display how estimated survival depends upon the value of a covariate. Because the principal purpose of the recidivism study was to assess the impact of financial aid on rearrest, we focus on this covariate. We construct a new data frame with two rows, one for each value of **fin**; the other covariates are fixed to their average values:

Recieved Fin. Aid | Age | Race | FT Work Exp. | Married | Paroled | Prior Convictions |
---|---|---|---|---|---|---|

No | mean(age) | mean(Other) | mean(Yes) | mean(Not married) | mean(Yes) | mean(prio) |

Yes | mean(age) | mean(Other) | mean(Yes) | mean(Not married) | mean(Yes) | mean(prio) |

For a dummy covariate, such as the contrast associated with **race**, the average value is the proportion coded 1 in the data set—in the case of **race**, the proportion of non-blacks. This data frame is passed to *survfit()* via the *newdata* argument:

```
Rossi.fin = with(Rossi,
data.frame(fin=c(0, 1),
age=rep(mean(age), 2),
race=rep(mean(race == "other"), 2),
wexp=rep(mean(wexp == "yes"), 2),
mar=rep(mean(mar == "not married"), 2),
paro=rep(mean(paro == "yes"), 2),
prio=rep(mean(prio), 2)))
surv.fin = survfit(mod.allison, newdata=Rossi.fin)
```

`library(TeradataAsterR)`

`## Loading required package: RODBC`

`ta.connect("PreSalesCluster1-dallas", database = "baseball")`

```
## RODBC Connection 1
## Details:
## case=nochange
## DSN=PreSalesCluster1-dallas
## DATABASE=baseball
```

Replicating recidivism study data in Aster:

`ta.dropTable("rossi_original", schemaName = "public") `

`## [1] -2`

```
ta.create(data=Rossi[,1:10] %>% mutate(id = row_number()),
table="rossi_original", schemaName = "public", tableType = "dimension")
rossi.ta = ta.transform(ta.data.frame("rossi_original", "table", schemaName = "public"),
fin = if(fin == 'yes') 1 else 0,
race = if(race == 'other') 1 else 0,
wexp = if(wexp == 'yes') 1 else 0,
mar = if(mar == 'not married') 1 else 0,
paro = if(paro == 'yes') 1 else 0)
ta.dropTable("rossi", schemaName = "public")
```

`## [1] -2`

```
ta.create(data=rossi.ta, table="rossi", schemaName = "public",
tableType = "fact", partitionKey = "id")
ta.rm(rossi.ta)
rossi.ta = ta.data.frame("rossi", schemaName = "public")
```

Function *aa.coxph()* is analog of R *coxph()* but with slightly different argument convention:

```
mod.aa = aa.coxph(data = rossi.ta,
feature.columns = c('fin', 'age', 'race', 'wexp', 'mar', 'paro', 'prio'),
# categorical.columns = c('fin', 'race', 'wexp', 'mar', 'paro'),
time.interval.column = 'week', event.column = 'arrest'
#,accumulate = "id" # see JIRA ANLY-1849
)
names(mod.aa)
```

```
## [1] "coefficient.table" "linear.predictor.table"
## [3] "output"
```

```
## id predictor category coefficient exp_coef std_error z_score
## 6 1 fin NA -0.37942217 0.6842567 0.19137948 -1.9825645
## 3 2 age NA -0.05743774 0.9441807 0.02199947 -2.6108693
## 1 3 race NA -0.31389979 0.7305922 0.30799278 -1.0191791
## 2 4 wexp NA -0.14979570 0.8608838 0.21222430 -0.7058367
## 7 5 mar NA 0.43370388 1.5429619 0.38186806 1.1357427
## 4 6 paro NA -0.08487108 0.9186307 0.19575667 -0.4335540
## 5 7 prio NA 0.09149708 1.0958136 0.02864855 3.1937770
## p_value significance
## 6 0.047416095 *
## 3 0.009031240 **
## 1 0.308117970
## 2 0.480289695
## 7 0.256064244
## 4 0.664612364
## 5 0.001404246 **
```