We import the required data set.
kidneyTrans <- read.csv("kidneyTrans.csv")
kidneyTrans %>% head() %>% kable()
| obs | time | death | gender | race | age |
|---|---|---|---|---|---|
| 1 | 1 | 0 | 1 | 1 | 46 |
| 2 | 5 | 0 | 1 | 1 | 51 |
| 3 | 7 | 1 | 1 | 1 | 55 |
| 4 | 9 | 0 | 1 | 1 | 57 |
| 5 | 13 | 0 | 1 | 1 | 45 |
| 6 | 13 | 0 | 1 | 1 | 43 |
According to the question, we know there are 863 observations. The meanings of each variable are as follows:
time means the survival times after transplant measured
in days;
death means the state of patient, where 0 represents
censoring and 1 represents observation;
gender means the gender of patient, where 1 represents
male and 2 represents female;
race means the race of patient, where 1 represents white
and 2 represents black.
We fit two Cox proportional models as follows:
Model I: \(\lambda(t)=\lambda_0(t)\mbox{exp}\big\{\beta_1I(\text{race = "B"})\big\}\);
Model II: \(\lambda(t)=\lambda_0(t)\mbox{exp}\big\{\beta_1I(\text{race = "B"})+\beta_2I(\text{gender = "M"})+\beta_3I(\text{race = "B" & gender = "M"})\big\}\).
The corresponding R code and output are as follows:
for (i in 1:863) {
if (kidneyTrans$gender[i] == 1) {
kidneyTrans$gender[i] <- 2
} else {
kidneyTrans$gender[i] <- 1
}
}
model1 <- coxph(Surv(time, death)~factor(race), data = kidneyTrans)
summary(model1)
## Call:
## coxph(formula = Surv(time, death) ~ factor(race), data = kidneyTrans)
##
## n= 863, number of events= 140
##
## coef exp(coef) se(coef) z Pr(>|z|)
## factor(race)2 0.2224 1.2491 0.2114 1.052 0.293
##
## exp(coef) exp(-coef) lower .95 upper .95
## factor(race)2 1.249 0.8006 0.8253 1.89
##
## Concordance= 0.508 (se = 0.017 )
## Likelihood ratio test= 1.06 on 1 df, p=0.3
## Wald test = 1.11 on 1 df, p=0.3
## Score (logrank) test = 1.11 on 1 df, p=0.3
model2 <- coxph(Surv(time, death)~factor(race)*factor(gender), data = kidneyTrans)
summary(model2)
## Call:
## coxph(formula = Surv(time, death) ~ factor(race) * factor(gender),
## data = kidneyTrans)
##
## n= 863, number of events= 140
##
## coef exp(coef) se(coef) z Pr(>|z|)
## factor(race)2 0.6567 1.9284 0.3120 2.105 0.0353 *
## factor(gender)2 0.2485 1.2821 0.1985 1.252 0.2107
## factor(race)2:factor(gender)2 -0.7455 0.4745 0.4271 -1.745 0.0809 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## factor(race)2 1.9284 0.5186 1.0463 3.554
## factor(gender)2 1.2821 0.7800 0.8688 1.892
## factor(race)2:factor(gender)2 0.4745 2.1075 0.2054 1.096
##
## Concordance= 0.54 (se = 0.024 )
## Likelihood ratio test= 4.37 on 3 df, p=0.2
## Wald test = 4.64 on 3 df, p=0.2
## Score (logrank) test = 4.74 on 3 df, p=0.2
According to the output above, we find that \(\beta_1\) is not significant in Model I but significant in Model II, which means that race is a variable which significant affect the survival after considering the gender of patients.
Denote \(WF\): white female patients and \(BF\): black female patients. In model II, we have \[\lambda_{WF}(t)=\lambda_0(t) \ \text{ and } \ \lambda_{BF}(t)=\lambda_0(t)\mbox{exp}(\beta_1).\] Therefore, we obtain \[\beta_1=\mbox{log}\Big\{\frac{\lambda_{BF}(t)}{\lambda_{WF}(t)}\Big\}.\] That is, \(\beta_1\) can be expressed as the logarithmic of the ratio of hazard functions of black female patients and white female patients. And the estimate of \(\beta_1\) is \(0.6567>0\), which means black female patients have a worse survival than white female patients.
We estimate the hazard function via Breslow estimate: \[\hat{\lambda}_0(t_{(j)})=\frac{d_j}{\sum_{l\in\mathcal{R}(t_{(j)})}\text{exp}\big\{\beta_1I(\text{race = "B"})+\beta_2I(\text{gender = "M"})+\beta_3I(\text{race = "B" & gender = "M"})\big\}}.\] And the estimate of survival function: \[ \begin{aligned} \hat{S}(t|\boldsymbol{x})&=\big[S_0(t|\boldsymbol{x})\big]^{\text{exp}\big\{\beta_1I(\text{race = "B"})+\beta_2I(\text{gender = "M"})+\beta_3I(\text{race = "B" & gender = "M"})\big\}} \\ &=\mbox{e}^{-\hat{\Lambda}_0(t|\boldsymbol{x})\text{exp}\big\{\beta_1I(\text{race = "B"})+\beta_2I(\text{gender = "M"})+\beta_3I(\text{race = "B" & gender = "M"})\big\}}. \end{aligned} \] The corresponding R code and output are as follows:
model2 <- coxph(Surv(time, death != 0)~factor(race)*factor(gender), data = kidneyTrans)
cumhaz0 <- basehaz(model2, centered = FALSE)
survprob <- data.frame(time = cumhaz0$time, surv_wf = exp(-cumhaz0$hazard),
surv_bf = exp(-exp(model2$coefficients[1]) * cumhaz0$hazard),
surv_wm = exp(-exp(model2$coefficients[2]) * cumhaz0$hazard),
surv_bm = exp(-exp(model2$coefficients[1] + model2$coefficients[2] +
model2$coefficients[3]) * cumhaz0$hazard))
ggplot(survprob, aes(time)) +
geom_step(aes(y = surv_wf, colour = "White Female")) +
geom_step(aes(y = surv_bf, colour = "Black Female")) +
geom_step(aes(y = surv_wm, colour = "White Male")) +
geom_step(aes(y = surv_bm, colour = "Black Male")) +
labs(title = "Cox PH model esimate for Model II", x = "Time (days)", y = "Estimated survival function") +
theme_minimal() +
theme(panel.grid =element_blank()) +
theme(axis.line = element_line(size=0.5, colour = "black")) +
theme(plot.title = element_text(hjust = 0.5, size = 13), axis.title = element_text(size = 14),
axis.text.x = element_text(size = 11), axis.text.y = element_text(size = 12))
We use the penalized splines to check whether age should be added in the Cox model proportional hazards model as a linear term. The corresponding R code and output are as follows:
model <- coxph(Surv(time, death)~factor(race)*factor(gender)+pspline(age), data = kidneyTrans)
summary(model)
## Call:
## coxph(formula = Surv(time, death) ~ factor(race) * factor(gender) +
## pspline(age), data = kidneyTrans)
##
## n= 863, number of events= 140
##
## coef se(coef) se2 Chisq DF p
## factor(race)2 0.44045 0.315085 0.314786 1.95 1.00 1.6e-01
## factor(gender)2 0.09948 0.199479 0.199275 0.25 1.00 6.2e-01
## pspline(age), linear 0.05027 0.007538 0.007538 44.48 1.00 2.6e-11
## pspline(age), nonlin 3.75 3.03 2.9e-01
## factor(race)2:factor(gend -0.58188 0.427803 0.427542 1.85 1.00 1.7e-01
##
## exp(coef) exp(-coef) lower .95 upper .95
## factor(race)2 1.5534 0.643745 0.83769 2.881e+00
## factor(gender)2 1.1046 0.905310 0.74715 1.633e+00
## ps(age)3 1.9824 0.504434 0.15770 2.492e+01
## ps(age)4 3.9206 0.255062 0.04850 3.169e+02
## ps(age)5 7.5000 0.133334 0.02818 1.996e+03
## ps(age)6 12.4137 0.080556 0.02694 5.721e+03
## ps(age)7 17.1172 0.058421 0.03420 8.568e+03
## ps(age)8 32.9979 0.030305 0.07077 1.539e+04
## ps(age)9 60.5622 0.016512 0.13372 2.743e+04
## ps(age)10 62.7202 0.015944 0.13778 2.855e+04
## ps(age)11 77.1365 0.012964 0.16853 3.531e+04
## ps(age)12 151.1207 0.006617 0.32603 7.005e+04
## ps(age)13 276.8249 0.003612 0.51795 1.480e+05
## ps(age)14 488.2104 0.002048 0.53594 4.447e+05
## factor(race)2:factor(gend 0.5588 1.789400 0.24163 1.293e+00
##
## Iterations: 3 outer, 9 Newton-Raphson
## Theta= 0.6480586
## Degrees of freedom for terms= 1 1 4 1
## Concordance= 0.676 (se = 0.022 )
## Likelihood ratio test= 64.05 on 7.02 df, p=2e-11
According to the output above, we find the \(p\)-value of non-linear term of age is \(0.29>0.05\). Therefore, a linear term of age is enough, we can add age as a linear term in Model II.
We conduct the variable selection to the Model II via the stepwise selection procedure and come up with the final model as follows:
model_linear <- coxph(Surv(time, death)~factor(race)*factor(gender)+age, data = kidneyTrans)
final_model <- step(model_linear, scope = list(upper = ~factor(race)*factor(gender)+age, lower = ~1))
## Start: AIC=1707.4
## Surv(time, death) ~ factor(race) * factor(gender) + age
##
## Df AIC
## - factor(race):factor(gender) 1 1707.3
## <none> 1707.4
## - age 1 1760.0
##
## Step: AIC=1707.28
## Surv(time, death) ~ factor(race) + factor(gender) + age
##
## Df AIC
## - factor(gender) 1 1705.3
## - factor(race) 1 1705.6
## <none> 1707.3
## + factor(race):factor(gender) 1 1707.4
## - age 1 1761.0
##
## Step: AIC=1705.3
## Surv(time, death) ~ factor(race) + age
##
## Df AIC
## - factor(race) 1 1703.6
## <none> 1705.3
## + factor(gender) 1 1707.3
## - age 1 1759.3
##
## Step: AIC=1703.6
## Surv(time, death) ~ age
##
## Df AIC
## <none> 1703.6
## + factor(race) 1 1705.3
## + factor(gender) 1 1705.6
## - age 1 1758.3
summary(final_model)
## Call:
## coxph(formula = Surv(time, death) ~ age, data = kidneyTrans)
##
## n= 863, number of events= 140
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.051068 1.052394 0.007136 7.156 8.3e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.052 0.9502 1.038 1.067
##
## Concordance= 0.671 (se = 0.022 )
## Likelihood ratio test= 56.74 on 1 df, p=5e-14
## Wald test = 51.21 on 1 df, p=8e-13
## Score (logrank) test = 53.44 on 1 df, p=3e-13
According to the output above, we find that the final model only have the linear term of age. That is, the final model is given by \[\lambda(t)=\lambda_0(t)\mbox{exp}(\beta\times\text{age}).\] We find that the logarithmic of hazard ratio of death after kidney transplant increases \(0.051068\) when the age at transplant is increased by 1 year.