This R markdown file is meant to be an illusrative example of applying Cox-proportional hazards regression to RCT data. We will use the “ovarian” dataset which comes pre-loaded as part of the survival package in R. This is an extremely simple dataset so note that there are many other methods for analyzing data on survival or generally data from a study where we follow participants foward in time. Before we begin let’s load some packages.
We can find out more information about the ovarian dataset by using the help function in R. Some important things to know are:
head(ovarian)
## futime fustat age resid.ds rx ecog.ps
## 1 59 1 72.3315 2 1 1
## 2 115 1 74.4932 2 1 1
## 3 156 1 66.4658 2 1 2
## 4 421 0 53.3644 2 2 1
## 5 431 1 50.3397 2 1 1
## 6 448 0 56.4301 1 1 2
First let’s create a useable outcome variable called death. This variable will equal 1 if the patient was not censored and 0 if they were.
ovarian$death<-if_else(ovarian$fustat==1,0,1)
ovarian$death<-factor(ovarian$death,levels = c(1,0),ordered = TRUE,labels = c("Died","Censored"))#Convert it to a reable factor
table(ovarian$death)
##
## Died Censored
## 14 12
Let’s calculate the crude RR and 95% CI for dying comparing treatment group 2 to treatment group 1 as the reference.
epi.2by2(table(ovarian$rx,ovarian$death))
## Outcome + Outcome - Total Inc risk * Odds
## Exposed + 6 7 13 46.2 0.857
## Exposed - 8 5 13 61.5 1.600
## Total 14 12 26 53.8 1.167
##
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc risk ratio 0.75 (0.36, 1.55)
## Odds ratio 0.54 (0.11, 2.55)
## Attrib risk * -15.38 (-53.25, 22.48)
## Attrib risk in population * -7.69 (-40.35, 24.97)
## Attrib fraction in exposed (%) -33.33 (-176.02, 35.59)
## Attrib fraction in population (%) -14.29 (-56.86, 16.73)
## -------------------------------------------------------------------
## Test that OR = 1: chi2(1) = 0.619 Pr>chi2 = 0.43
## Wald confidence limits
## CI: confidence interval
## * Outcomes per 100 population units
It appears that those patients in treatment group 2 had a lower cumulative incidence of death when compared to treatment group 1. There are problems with this:
Generally the cumulative incidence does not take into account the shrinking denominator. As more patients die the denominator gets smaller. So this approach underestimates the risk in each group.
Patients are lossed to follow-up in this study. We need some way to account for this because the time period that we might ascribe to the risk in each group might be different.
We need some way to account for time. One way this is typically done is by constructing a Kaplan-Meier curve. These curves estimate the probability of survival conditional on survival up to time \(t\). We can easily construct a K-M curve using R.
ggsurvplot(survfit(Surv(futime,fustat)~rx,data=ovarian))
Another approach is using Cox-regression which can be thought of as a form of logistic regression where the time at which the event occurs is also taken into account.
The Cox proportional hazards model can be written as: \[h(t)=h_{0}(t)e^{\beta_{k}x_{k}}\]
Where \(h_{0}(t)\) is the baseline hazard at time t. In other words, the risk of death for an individual that survived until time \(t\) and has a value of 0 for all covariates \(x_{k}\).
An advantage of this method is that we can condition on multiple covariates. Let’s fit this model to our ovarian dataset using treatment group and age as independent variables.
ovarian$rx<-factor(ovarian$rx,levels =c(1,2),labels=c("Group 1","Group 2"))#Creating a readable factor out of treatmetn variable
ovarian$rx<-relevel(ovarian$rx,ref = "Group 1")#Defining treatment group 2 as reference
#The coxph function needs a numeric variable so we'll create another death variable
ovarian$death_num<-if_else(ovarian$fustat==1,0,1)
coxfit<-coxph(Surv(ovarian$futime,event=ovarian$death_num)~rx+age,data=ovarian)
summary(coxfit)
## Call:
## coxph(formula = Surv(ovarian$futime, event = ovarian$death_num) ~
## rx + age, data = ovarian)
##
## n= 26, number of events= 14
##
## coef exp(coef) se(coef) z Pr(>|z|)
## rxGroup 2 -0.86227 0.42220 0.73753 -1.169 0.242
## age 0.04580 1.04686 0.04454 1.028 0.304
##
## exp(coef) exp(-coef) lower .95 upper .95
## rxGroup 2 0.4222 2.3685 0.09948 1.792
## age 1.0469 0.9552 0.95934 1.142
##
## Concordance= 0.542 (se = 0.106 )
## Likelihood ratio test= 1.56 on 2 df, p=0.5
## Wald test = 1.58 on 2 df, p=0.5
## Score (logrank) test = 1.6 on 2 df, p=0.4
We can re-write this model as: \[ln(h(t)/h_{0}(t))=\beta_{1}treatment+\beta_{2}age\] As such, the interpretation of \(e^{\beta_{1}}\) is the ratio of the hazards (which is the instantaneous risk at time \(t\)) comparing those in treatment group 2 to treatment group 1.
Based on the above results we could say that “Those in treatment group 2 had 0.42 (\(e^{-0.86227}\)) times the risk of death when compared to those in treatment group 2 conditional on survival up until time \(t\)”.
An interpretation of \(e^{\beta_{2}}\) would be: “All else being equal, for each one year increase in age at baseline, the hazard of death at time \(t\) increases by 5% conditional on survival up to time \(t\)”.
In order for our interpretation of the harzard ratios to be valid, two key assumptions need to hold:
Since \(\beta_{k}\) is fixed for all time points we assume that the hazards are “proportional” over the entire study period. This is almost never the case for actual data.
We assume that loss to follow-up is random. In other words, there is no selection bias as a result of censoring prior to the end of the study.
We can check the proportion hazards assumption by seeing if the beta estimates change with time.
residuals<-cox.zph(coxfit)
resplot<-ggcoxzph(residuals)
resplot