Based on the paper “Cox Proportional-Hazards Regression for Survival Data in R” (John Fox & Sanford Weisberg; last revision: 23 February 2011) https://socserv.socsci.mcmaster.ca/jfox/Books/Companion/appendix/Appendix-Cox-Regression.pdf
Paper Wordcloud
\[\begin{aligned} P(t) = Pr(T<=t) \end{aligned}\]
\[\begin{aligned} S(t) = Pr(T>t) = 1 - P(t) \end{aligned}\]
\[\begin{aligned} h(t) = \lim_{\Delta t\to 0} \frac{Pr(t<=T<t+\Delta t|T>=t)}{\Delta t} \end{aligned}\]
For example, in acturial science , the hazard rate (or failure rate or force of mortality) is the rate of death for lives aged \(x\).
Survival data almost always includes censoring: the most common form is right-censoring - the period of observation expires or subject is removed from study before the event occurs. Survival models account for censoring which complicates the estimation of survival models.
Censoring must be independent of the future value of the hazard for the individual. Censoring that meets this requirement is noninformative (good thing).
There are four major applications of survival analysis into analytics:
Following are some industrial specific applications of survival analysis :
Survival analysis typically examines the relationship of the survival distribution to covariates: \[\begin{aligned} h_i(t) = f(t, x_{i1}, x_{i2}, ..., x_{ik}) \end{aligned}\]
For example, a parametric model based on the exponential distribution may be written as \[\begin{aligned} h_i(t) = exp(\alpha + \beta_1*x_{i1} + \beta_2*x_{i2} + ... + \beta_k*x_{ik}) \end{aligned}\] Note, that a constant hazard implies an exponential distribution of survival times.
The Cox model leaves the baseline hazard function \(\alpha(t)=log(h_0(t))\) unspecified: \[\begin{aligned} h_i(t) = h_0(t)*exp(\beta_1*x_{i1} + \beta_2*x_{i2} + ... + \beta_k*x_{ik}) \end{aligned}\] This model is semi-parametric because the baseline hazard function \(h_0(t)\) can take any form while covariates enter the model lineary. But any two observations with different \(x\)-values will have their hazard ratio independent of time \(t\) since non-parametric component that depends on time is the same for both.
The Cox proportional-hazards regression model is fit in R with the coxph() function (found in the survival package):
library(survival)
args(coxph)
## function (formula, data, weights, subset, na.action, init, control,
## ties = c("efron", "breslow", "exact"), singular.ok = TRUE,
## robust = FALSE, model = FALSE, x = FALSE, y = TRUE, tt, method = ties,
## ...)
## NULL
Most of the arguments to coxph(), including gdata, weights, subset, na.action, singular.ok, model, x and y, are familiar from lm(). The formula argument is a little different. The right-hand side of the formula for coxph() is the same as for a linear model. The left-hand side is a survival object, created by the Surv() function:
Surv(week, arrest) ~ fin + age + race + wexp + mar + paro + prio
In the simple case of right-censored data, the call to Surv() takes the form Surv(time, event), where time is either the event time or the censoring time, and event is a dummy variable coded 1 if the event is observed or 0 if the observation is censored.
For illustration consider data from an experimental study of recidivism of 432 male prisoners, who were observed for a year after being released from prison (Rossi et al., 1980). The following variables are included in the data:
We read the data file into a data frame, and print the first few observations (omitting the variables emp1 – emp52, which are in columns 11–62 of the data frame):
url = "http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt"
Rossi = read.table(url, header=TRUE)
Rossi[1:5, 1:10]
## week arrest fin age race wexp mar paro prio educ
## 1 20 1 no 27 black no not married yes 3 3
## 2 17 1 no 18 black no not married yes 8 4
## 3 25 1 no 19 other yes not married yes 13 3
## 4 52 0 yes 23 black yes married yes 1 5
## 5 52 0 no 19 other yes not married yes 3 3
Thus, for example, the first individual was arrested in week 20 of the study, while the fourth individual was never rearrested, and hence has a censoring time of 52.
Kaplan–Meier estimator is used to estimate the survival function from lifetime data. A plot of the Kaplan–Meier estimator is a series of declining horizontal steps which, with a large enough sample size, approaches the true survival function for that population.
A Cox regression of time to rearrest on the time-constant covariates is specified as follows:
mod.allison =
coxph(Surv(week, arrest) ~ fin + age + race + wexp + mar + paro + prio,
data=Rossi)
summary(mod.allison)
## Call:
## coxph(formula = Surv(week, arrest) ~ fin + age + race + wexp +
## mar + paro + prio, data = Rossi)
##
## n= 432, number of events= 114
##
## coef exp(coef) se(coef) z Pr(>|z|)
## finyes -0.37942 0.68426 0.19138 -1.983 0.04742 *
## age -0.05744 0.94418 0.02200 -2.611 0.00903 **
## raceother -0.31390 0.73059 0.30799 -1.019 0.30812
## wexpyes -0.14980 0.86088 0.21222 -0.706 0.48029
## marnot married 0.43370 1.54296 0.38187 1.136 0.25606
## paroyes -0.08487 0.91863 0.19576 -0.434 0.66461
## prio 0.09150 1.09581 0.02865 3.194 0.00140 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## finyes 0.6843 1.4614 0.4702 0.9957
## age 0.9442 1.0591 0.9043 0.9858
## raceother 0.7306 1.3688 0.3995 1.3361
## wexpyes 0.8609 1.1616 0.5679 1.3049
## marnot married 1.5430 0.6481 0.7300 3.2614
## paroyes 0.9186 1.0886 0.6259 1.3482
## prio 1.0958 0.9126 1.0360 1.1591
##
## Concordance= 0.64 (se = 0.027 )
## Rsquare= 0.074 (max possible= 0.956 )
## Likelihood ratio test= 33.27 on 7 df, p=2.362e-05
## Wald test = 32.11 on 7 df, p=3.871e-05
## Score (logrank) test = 33.53 on 7 df, p=2.11e-05
Several important observations from this report:
Having fit a Cox model to the data, it is often of interest to examine the estimated distribution of survival times. The survfit() function estimates \(S(t)\), by default at the mean values of the covariates. The plot method for objects returned by survfit() graphs the estimated surivival function, along with a point-wise 95-percent confidence band. For example, for the model just fit to the recidivism data:
surv.model = survfit(mod.allison)
Even more cogently, we may wish to display how estimated survival depends upon the value of a covariate. Because the principal purpose of the recidivism study was to assess the impact of financial aid on rearrest, we focus on this covariate. We construct a new data frame with two rows, one for each value of fin; the other covariates are fixed to their average values:
Recieved Fin. Aid | Age | Race | FT Work Exp. | Married | Paroled | Prior Convictions |
---|---|---|---|---|---|---|
No | mean(age) | mean(Other) | mean(Yes) | mean(Not married) | mean(Yes) | mean(prio) |
Yes | mean(age) | mean(Other) | mean(Yes) | mean(Not married) | mean(Yes) | mean(prio) |
For a dummy covariate, such as the contrast associated with race, the average value is the proportion coded 1 in the data set—in the case of race, the proportion of non-blacks. This data frame is passed to survfit() via the newdata argument:
Rossi.fin = with(Rossi,
data.frame(fin=c(0, 1),
age=rep(mean(age), 2),
race=rep(mean(race == "other"), 2),
wexp=rep(mean(wexp == "yes"), 2),
mar=rep(mean(mar == "not married"), 2),
paro=rep(mean(paro == "yes"), 2),
prio=rep(mean(prio), 2)))
surv.fin = survfit(mod.allison, newdata=Rossi.fin)
library(TeradataAsterR)
## Loading required package: RODBC
ta.connect("PreSalesCluster1-dallas", database = "baseball")
## RODBC Connection 1
## Details:
## case=nochange
## DSN=PreSalesCluster1-dallas
## DATABASE=baseball
Replicating recidivism study data in Aster:
ta.dropTable("rossi_original", schemaName = "public")
## [1] -2
ta.create(data=Rossi[,1:10] %>% mutate(id = row_number()),
table="rossi_original", schemaName = "public", tableType = "dimension")
rossi.ta = ta.transform(ta.data.frame("rossi_original", "table", schemaName = "public"),
fin = if(fin == 'yes') 1 else 0,
race = if(race == 'other') 1 else 0,
wexp = if(wexp == 'yes') 1 else 0,
mar = if(mar == 'not married') 1 else 0,
paro = if(paro == 'yes') 1 else 0)
ta.dropTable("rossi", schemaName = "public")
## [1] -2
ta.create(data=rossi.ta, table="rossi", schemaName = "public",
tableType = "fact", partitionKey = "id")
ta.rm(rossi.ta)
rossi.ta = ta.data.frame("rossi", schemaName = "public")
Function aa.coxph() is analog of R coxph() but with slightly different argument convention:
mod.aa = aa.coxph(data = rossi.ta,
feature.columns = c('fin', 'age', 'race', 'wexp', 'mar', 'paro', 'prio'),
# categorical.columns = c('fin', 'race', 'wexp', 'mar', 'paro'),
time.interval.column = 'week', event.column = 'arrest'
#,accumulate = "id" # see JIRA ANLY-1849
)
names(mod.aa)
## [1] "coefficient.table" "linear.predictor.table"
## [3] "output"
## id predictor category coefficient exp_coef std_error z_score
## 6 1 fin NA -0.37942217 0.6842567 0.19137948 -1.9825645
## 3 2 age NA -0.05743774 0.9441807 0.02199947 -2.6108693
## 1 3 race NA -0.31389979 0.7305922 0.30799278 -1.0191791
## 2 4 wexp NA -0.14979570 0.8608838 0.21222430 -0.7058367
## 7 5 mar NA 0.43370388 1.5429619 0.38186806 1.1357427
## 4 6 paro NA -0.08487108 0.9186307 0.19575667 -0.4335540
## 5 7 prio NA 0.09149708 1.0958136 0.02864855 3.1937770
## p_value significance
## 6 0.047416095 *
## 3 0.009031240 **
## 1 0.308117970
## 2 0.480289695
## 7 0.256064244
## 4 0.664612364
## 5 0.001404246 **
As before with survfit() using aa.cox.survfit() we can generate survival probabilities for selected subjects (based on their features). And as before we construct a subject with mean covariate values and show the resulting survival distribution:
# mod.allison$means
all_means = ta.colMeans(rossi.ta)[c('fin','age','race','wexp','mar','paro','prio')]
ta.dropTable("rossi_means", schemaName = "public")
## [1] -2
ta.create(data=as.data.frame(t(all_means)),
table="rossi_means", schemaName = "public", tableType = "dimension")
rossi.means.ta = ta.data.frame("rossi_means", schemaName = "public")
surv.fin.aa = aa.cox.survfit(
object = mod.aa$coefficient.table,
cox.model.table = mod.aa$linear.predictor.table,
predict.table = rossi.means.ta,
# survival.probability = "rossi_survival_prob",
predict.feature.columns = c('fin', 'age', 'race', 'wexp', 'mar', 'paro', 'prio'),
predict.feature.names = c('fin', 'age', 'race', 'wexp', 'mar', 'paro', 'prio'))
Next, we’d like to generate hazard ratios between predictive features and reference mean features (mean covariate subject created in the previous section):
predicted.ratios.aa = aa.cox.predict(
object = mod.aa,
predicts = rossi.ta,
refs = rossi.means.ta,
predicts.partition.column = "1",
predict.feature.columns = c('fin', 'age', 'race', 'wexp', 'mar', 'paro', 'prio'),
predict.feature.names = c('fin', 'age', 'race', 'wexp', 'mar', 'paro', 'prio'),
ref.feature.columns = c('fin', 'age', 'race', 'wexp', 'mar', 'paro', 'prio'),
refs.partition.column = c('1'),
accumulate = c("id")
)
predictions.aa = predicted.ratios.aa[[1]]
Noticably, financial aid is missing in just 3 of 20 lowest risk subjects while it is present in just 4 of the 20 highest risk subjects. You can observe this trend on the whole dataset just by tracking how colors shift from left to right:
But what about highly significant prio and age covariates? Let’s observe their distribution across subjects with similar visuals:
Princton hosts a collection of f small datasets used in the statistical course by Germán Rodríguez, classified by the type of statistical technique that may be used to analyze them. In particular, for survival data they have Marriage Dissolution in the U.S. that tracks married couples by husband education, husband race, couple mixed or not ethnicity. Event of interest (failure) is divorce and time is in years:
# http://data.princeton.edu/wws509/datasets/#divorce
divorce.df = read.csv(file="../data/divorce_raw.txt", sep="", header = FALSE)
names(divorce.df) = c('id','heduc','heblack','mixed','years','div')
head(divorce.df)
## id heduc heblack mixed years div
## 1 9 1 0 0 10.546 0
## 2 11 0 0 0 34.943 0
## 3 13 0 0 0 2.834 1
## 4 15 0 0 0 17.532 1
## 5 33 1 0 0 1.418 0
## 6 36 0 0 0 48.033 0
mod.divorce =
coxph(Surv(years, div) ~ heduc + heblack + mixed,
data=divorce.df)
summary(mod.divorce)
## Call:
## coxph(formula = Surv(years, div) ~ heduc + heblack + mixed, data = divorce.df)
##
## n= 3371, number of events= 1032
##
## coef exp(coef) se(coef) z Pr(>|z|)
## heduc 0.09426 1.09885 0.04718 1.998 0.04571 *
## heblack 0.18367 1.20162 0.07974 2.303 0.02126 *
## mixed 0.22936 1.25779 0.07929 2.893 0.00382 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## heduc 1.099 0.9100 1.002 1.205
## heblack 1.202 0.8322 1.028 1.405
## mixed 1.258 0.7950 1.077 1.469
##
## Concordance= 0.524 (se = 0.009 )
## Rsquare= 0.006 (max possible= 0.99 )
## Likelihood ratio test= 18.81 on 3 df, p=0.0002987
## Wald test = 19.6 on 3 df, p=0.0002051
## Score (logrank) test = 19.68 on 3 df, p=0.0001976
surv.model.divorce = survfit(mod.divorce)
divorce.mixed = with(divorce.df,
data.frame(mixed=c(0, 1),
heduc=rep(mean(heduc), 2),
heblack=rep(mean(heblack), 2)))
surv.mixed = survfit(mod.divorce, newdata=divorce.mixed)