The Cox Proportional Hazards (CPH) Model (Cox 1972) is one of the most popular models in survival analysis. The CPH model represents a particular type of hazard-based regression models (See Rubio et al. (2019) for more details on this, and this link for some examples of hazard regression models in R). The main assumption behind the CPH model is that the individual hazard function has the following structure: \[ h_{PH}(t\mid x_i) = h_0(t)\exp\left\{x_i^{\top}\beta \right\}, \] where \(h_0(\cdot)\) is a baseline hazard, \(\beta=(\beta_1,\ldots,\beta_p)^\top \in \mathbb{R}^p\), and \(x_{i}\) is a vector of covariates. This is, the covariates affect the baseline hazard, by either increasing or decreasing it. As discussed, the CPH model is a particular case of the GH structure (Rubio et al. 2019). The baseline hazard \(h_0\) can be specified to be the hazard function associated to a parametric distribution, or modelled using splines. However, in order to avoid misspecification of the baseline hazard, it is often preferred to estimate it non-parametrically, while the coefficients \(\beta\) are estimated using the log partial likelihood function (Cox 1972): \[ \ell_p(\beta) = \sum_{d_i=1} x_i^{\top}\beta - \sum_{d_i=1} \log\left( \sum_{k \in {\mathcal R}(t_i)} \exp\left\{x_k^{\top}\beta\right\}\right), \] where \(t_i\), \(i=1,\dots,n\), are the survival times, \({\mathcal R}(t) = \{i : t_i \geq t\}\) denotes the risk set at time \(t\), and \(d_i\) denotes the vital status (\(1-\) dead, \(0-\) alive). We refer the reader to Fan and Jiang (2009) and Rossell and Rubio (2019) for a general overview of the Cox model. The maximum partial likelihood estimator (MPLE) is the value of \(\beta\), \(\bar{\beta}\), that maximises the log partial likelihood function. The MPLEs are consistent and asymptotically normal (see Rossell and Rubio (2019) for a general overview of the properties of the MPLE).
The following R code shows an (probably suboptimal) implementation of the log partial likelihood function as well as three real-data examples illustrating the calculation of the MPLE (using the command optim()
). A comparison with the results obtained with the survival
R package is also presented.
Note. Although the CPH model is perhaps the most popular hazard structure, Prof. David Cox himself has acknowledged the richness of hazard models (Rubio et al. 2019), and the potential power of using parametric formulations in (Reid 1994):
Reid: The paper has had an enormous impact, as you know, in many different directions. What do you think are the most positive benefits of the work?
Cox: Handling in-study covariates, that is, time-dependent covariates, I think is rather important and the fact that it’s readily adapted to multiple events, what the sociologists call event history analysis, for instance. It’s the basis for really lots of further things in a fairly immediate way. Of course, another issue is the physical or substantive basis for the proportional hazards model. I think that’s one of its weaknesses, that accelerated life models are in many ways more appealing because of their quite direct physical interpretation, particularly in an engineering context.
Reid: How do you feel about the cottage industry that’s grown up around it?
Cox: Don’t know, really. In the light of some of the further results one knows since, I think I would normally want to tackle problems parametrically, so I would take the underlying hazard to be a Weibull or something. I’m not keen on nonparametric formulations usually.
rm(list=ls())
# Required packages
library(survival)
library(knitr)
########################################################################################################3
# Log partial likelihood for the Cox proportional hazards model
########################################################################################################
# X : design matrix
# status : vital status (1 - dead, 0 - alive)
# times : survival times
# n.obs : number of observed events
# Risk set function
risk.set <- function(t) which(times >= t)
# log partial likelihood function
log.parlik <- function(beta){
status <- as.vector(as.logical(status))
Xbeta <- as.vector(X%*%beta)
lpl1 <- sum(Xbeta[status])
temp <- vector( )
for(i in 1:n.obs) temp[i] <- log(sum(exp(Xbeta[rs[[i]]])))
lpl2 <- sum(temp)
return(-lpl1 + lpl2)
}
###########################################################################################################
# Example 1 : Ovarian Cancer Survival Data
# ?ovarian
###########################################################################################################
# Required variables
X <- as.matrix(cbind(ovarian$age, ovarian$ecog.ps))
status <- as.vector(ovarian$fustat)
times <- as.vector(ovarian$futime)
n.obs <- sum(status)
# Risk set
rs <- apply(as.matrix(times[as.logical(status)]), 1, risk.set)
# Optimisation step
OPT <- optim(c(0,0),log.parlik, control = list(maxit = 1000))
# Using the survival R package
fit <- coxph(Surv(futime, fustat) ~ age + ecog.ps, data=ovarian)
# Comparison
MAT <- cbind( fit$coefficients, OPT$par)
colnames(MAT) <- c("survival package", "MPLE")
kable(MAT, digits = 4)
survival package | MPLE | |
---|---|---|
age | 0.1615 | 0.1615 |
ecog.ps | 0.0187 | 0.0188 |
###########################################################################################################
# Example 2 : NCCTG Lung Cancer Data
# A missing observation is removed
# ?lung
###########################################################################################################
# Required variables
X <- as.matrix(cbind(lung$age, lung$sex, lung$ph.ecog))[-14,]
status <- as.vector(lung$status)[-14] - 1
times <- as.vector(lung$time)[-14]
n.obs <- sum(status)
# Risk set
rs <- apply(as.matrix(times[as.logical(status)]), 1, risk.set)
# Optimisation step
OPT <- optim(c(0,0,0),log.parlik, control = list(maxit = 1000))
# Using the survival R package
fit <- coxph(Surv(time, status) ~ age + sex + ph.ecog, data=lung)
# Comparison
MAT <- cbind(fit$coefficients, OPT$par)
colnames(MAT) <- c("survival package", "MPLE")
kable(MAT, digits = 4)
survival package | MPLE | |
---|---|---|
age | 0.0111 | 0.0111 |
sex | -0.5526 | -0.5527 |
ph.ecog | 0.4637 | 0.4627 |
###########################################################################################################
# Example 3 : Veterans' Administration Lung Cancer study
# ?veteran
###########################################################################################################
# Required variables
X <- as.matrix(cbind(veteran$age, veteran$karno, veteran$trt))
status <- as.vector(veteran$status)
times <- as.vector(veteran$time)
n.obs <- sum(status)
# Risk set
rs <- apply(as.matrix(times[as.logical(status)]), 1, risk.set)
# Optimisation step
OPT <- optim(c(0,0,0),log.parlik, control = list(maxit = 1000))
# Using the survival R package
fit <- coxph(Surv(time, status) ~ age + karno + trt, data=veteran)
# Comparison
MAT <- cbind( fit$coefficients, OPT$par)
colnames(MAT) <- c("survival package", "MPLE")
kable(MAT, digits = 4)
survival package | MPLE | |
---|---|---|
age | -0.0039 | -0.0038 |
karno | -0.0344 | -0.0342 |
trt | 0.1895 | 0.1857 |
Cox, D.R. 1972. “Regression Models and Life-Tables.” Journal of the Royal Statistical Society: Series B (Methodological) 34 (2): 187–202.
Fan, J., and J. Jiang. 2009. “Non-and Semi-Parametric Modeling in Survival Analysis.” In New Developments in Biostatistics and Bioinformatics, 3–33. World Scientific.
Reid, Nancy. 1994. “A Conversation with Sir David Cox.” Statistical Science 9 (3): 439–55.
Rossell, D., and F.J. Rubio. 2019. “Additive Bayesian Variable Selection Under Censoring and Misspecification.” arXiv Preprint arXiv:1907.13563.
Rubio, F.J., L. Remontet, N.P. Jewell, and A. Belot. 2019. “On a General Structure for Hazard-Based Regression Models: An Application to Population-Based Cancer Research.” Statistical Methods in Medical Research 28: 2404–17.