Question 5

We import the required data set.

kidneyTrans <- read.csv("kidneyTrans.csv")
kidneyTrans %>% head() %>% kable()
obs time death gender race age
1 1 0 1 1 46
2 5 0 1 1 51
3 7 1 1 1 55
4 9 0 1 1 57
5 13 0 1 1 45
6 13 0 1 1 43

According to the question, we know there are 863 observations. The meanings of each variable are as follows:

time means the survival times after transplant measured in days;

death means the state of patient, where 0 represents censoring and 1 represents observation;

gender means the gender of patient, where 1 represents male and 2 represents female;

race means the race of patient, where 1 represents white and 2 represents black.

Question 5.1

We fit two Cox proportional models as follows:

Model I: \(\lambda(t)=\lambda_0(t)\mbox{exp}\big\{\beta_1I(\text{race = "B"})\big\}\);

Model II: \(\lambda(t)=\lambda_0(t)\mbox{exp}\big\{\beta_1I(\text{race = "B"})+\beta_2I(\text{gender = "M"})+\beta_3I(\text{race = "B" & gender = "M"})\big\}\).

The corresponding R code and output are as follows:

for (i in 1:863) {
  if (kidneyTrans$gender[i] == 1) {
    kidneyTrans$gender[i] <- 2
  } else {
    kidneyTrans$gender[i] <- 1
  }
}
model1 <- coxph(Surv(time, death)~factor(race), data = kidneyTrans)
summary(model1)
## Call:
## coxph(formula = Surv(time, death) ~ factor(race), data = kidneyTrans)
## 
##   n= 863, number of events= 140 
## 
##                 coef exp(coef) se(coef)     z Pr(>|z|)
## factor(race)2 0.2224    1.2491   0.2114 1.052    0.293
## 
##               exp(coef) exp(-coef) lower .95 upper .95
## factor(race)2     1.249     0.8006    0.8253      1.89
## 
## Concordance= 0.508  (se = 0.017 )
## Likelihood ratio test= 1.06  on 1 df,   p=0.3
## Wald test            = 1.11  on 1 df,   p=0.3
## Score (logrank) test = 1.11  on 1 df,   p=0.3
model2 <- coxph(Surv(time, death)~factor(race)*factor(gender), data = kidneyTrans)
summary(model2)
## Call:
## coxph(formula = Surv(time, death) ~ factor(race) * factor(gender), 
##     data = kidneyTrans)
## 
##   n= 863, number of events= 140 
## 
##                                  coef exp(coef) se(coef)      z Pr(>|z|)  
## factor(race)2                  0.6567    1.9284   0.3120  2.105   0.0353 *
## factor(gender)2                0.2485    1.2821   0.1985  1.252   0.2107  
## factor(race)2:factor(gender)2 -0.7455    0.4745   0.4271 -1.745   0.0809 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                               exp(coef) exp(-coef) lower .95 upper .95
## factor(race)2                    1.9284     0.5186    1.0463     3.554
## factor(gender)2                  1.2821     0.7800    0.8688     1.892
## factor(race)2:factor(gender)2    0.4745     2.1075    0.2054     1.096
## 
## Concordance= 0.54  (se = 0.024 )
## Likelihood ratio test= 4.37  on 3 df,   p=0.2
## Wald test            = 4.64  on 3 df,   p=0.2
## Score (logrank) test = 4.74  on 3 df,   p=0.2

According to the output above, we find that \(\beta_1\) is not significant in Model I but significant in Model II, which means that race is a variable which significant affect the survival after considering the gender of patients.

Denote \(WF\): white female patients and \(BF\): black female patients. In model II, we have \[\lambda_{WF}(t)=\lambda_0(t) \ \text{ and } \ \lambda_{BF}(t)=\lambda_0(t)\mbox{exp}(\beta_1).\] Therefore, we obtain \[\beta_1=\mbox{log}\Big\{\frac{\lambda_{BF}(t)}{\lambda_{WF}(t)}\Big\}.\] That is, \(\beta_1\) can be expressed as the logarithmic of the ratio of hazard functions of black female patients and white female patients. And the estimate of \(\beta_1\) is \(0.6567>0\), which means black female patients have a worse survival than white female patients.

Question 5.2

We estimate the hazard function via Breslow estimate: \[\hat{\lambda}_0(t_{(j)})=\frac{d_j}{\sum_{l\in\mathcal{R}(t_{(j)})}\text{exp}\big\{\beta_1I(\text{race = "B"})+\beta_2I(\text{gender = "M"})+\beta_3I(\text{race = "B" & gender = "M"})\big\}}.\] And the estimate of survival function: \[ \begin{aligned} \hat{S}(t|\boldsymbol{x})&=\big[S_0(t|\boldsymbol{x})\big]^{\text{exp}\big\{\beta_1I(\text{race = "B"})+\beta_2I(\text{gender = "M"})+\beta_3I(\text{race = "B" & gender = "M"})\big\}} \\ &=\mbox{e}^{-\hat{\Lambda}_0(t|\boldsymbol{x})\text{exp}\big\{\beta_1I(\text{race = "B"})+\beta_2I(\text{gender = "M"})+\beta_3I(\text{race = "B" & gender = "M"})\big\}}. \end{aligned} \] The corresponding R code and output are as follows:

model2 <- coxph(Surv(time, death != 0)~factor(race)*factor(gender), data = kidneyTrans)
cumhaz0 <- basehaz(model2, centered = FALSE)
survprob <- data.frame(time = cumhaz0$time, surv_wf = exp(-cumhaz0$hazard), 
                       surv_bf = exp(-exp(model2$coefficients[1]) * cumhaz0$hazard), 
                       surv_wm = exp(-exp(model2$coefficients[2]) * cumhaz0$hazard), 
                       surv_bm = exp(-exp(model2$coefficients[1] + model2$coefficients[2] +
                                            model2$coefficients[3]) * cumhaz0$hazard))
ggplot(survprob, aes(time)) +
  geom_step(aes(y = surv_wf, colour = "White Female")) + 
  geom_step(aes(y = surv_bf, colour = "Black Female")) +
  geom_step(aes(y = surv_wm, colour = "White Male")) +
  geom_step(aes(y = surv_bm, colour = "Black Male")) +
  labs(title = "Cox PH model esimate for Model II", x = "Time (days)", y = "Estimated survival function") + 
  theme_minimal() + 
  theme(panel.grid =element_blank()) + 
  theme(axis.line = element_line(size=0.5, colour = "black")) + 
  theme(plot.title = element_text(hjust = 0.5, size = 13), axis.title = element_text(size = 14), 
        axis.text.x = element_text(size = 11), axis.text.y = element_text(size = 12))

Question 5.3

We use the penalized splines to check whether age should be added in the Cox model proportional hazards model as a linear term. The corresponding R code and output are as follows:

model <- coxph(Surv(time, death)~factor(race)*factor(gender)+pspline(age), data = kidneyTrans)
summary(model)
## Call:
## coxph(formula = Surv(time, death) ~ factor(race) * factor(gender) + 
##     pspline(age), data = kidneyTrans)
## 
##   n= 863, number of events= 140 
## 
##                           coef     se(coef) se2      Chisq DF   p      
## factor(race)2              0.44045 0.315085 0.314786  1.95 1.00 1.6e-01
## factor(gender)2            0.09948 0.199479 0.199275  0.25 1.00 6.2e-01
## pspline(age), linear       0.05027 0.007538 0.007538 44.48 1.00 2.6e-11
## pspline(age), nonlin                                  3.75 3.03 2.9e-01
## factor(race)2:factor(gend -0.58188 0.427803 0.427542  1.85 1.00 1.7e-01
## 
##                           exp(coef) exp(-coef) lower .95 upper .95
## factor(race)2                1.5534   0.643745   0.83769 2.881e+00
## factor(gender)2              1.1046   0.905310   0.74715 1.633e+00
## ps(age)3                     1.9824   0.504434   0.15770 2.492e+01
## ps(age)4                     3.9206   0.255062   0.04850 3.169e+02
## ps(age)5                     7.5000   0.133334   0.02818 1.996e+03
## ps(age)6                    12.4137   0.080556   0.02694 5.721e+03
## ps(age)7                    17.1172   0.058421   0.03420 8.568e+03
## ps(age)8                    32.9979   0.030305   0.07077 1.539e+04
## ps(age)9                    60.5622   0.016512   0.13372 2.743e+04
## ps(age)10                   62.7202   0.015944   0.13778 2.855e+04
## ps(age)11                   77.1365   0.012964   0.16853 3.531e+04
## ps(age)12                  151.1207   0.006617   0.32603 7.005e+04
## ps(age)13                  276.8249   0.003612   0.51795 1.480e+05
## ps(age)14                  488.2104   0.002048   0.53594 4.447e+05
## factor(race)2:factor(gend    0.5588   1.789400   0.24163 1.293e+00
## 
## Iterations: 3 outer, 9 Newton-Raphson
##      Theta= 0.6480586 
## Degrees of freedom for terms= 1 1 4 1 
## Concordance= 0.676  (se = 0.022 )
## Likelihood ratio test= 64.05  on 7.02 df,   p=2e-11

According to the output above, we find the \(p\)-value of non-linear term of age is \(0.29>0.05\). Therefore, a linear term of age is enough, we can add age as a linear term in Model II.

Question 5.4

We conduct the variable selection to the Model II via the stepwise selection procedure and come up with the final model as follows:

model_linear <- coxph(Surv(time, death)~factor(race)*factor(gender)+age, data = kidneyTrans)
final_model <- step(model_linear, scope = list(upper = ~factor(race)*factor(gender)+age, lower = ~1))
## Start:  AIC=1707.4
## Surv(time, death) ~ factor(race) * factor(gender) + age
## 
##                               Df    AIC
## - factor(race):factor(gender)  1 1707.3
## <none>                           1707.4
## - age                          1 1760.0
## 
## Step:  AIC=1707.28
## Surv(time, death) ~ factor(race) + factor(gender) + age
## 
##                               Df    AIC
## - factor(gender)               1 1705.3
## - factor(race)                 1 1705.6
## <none>                           1707.3
## + factor(race):factor(gender)  1 1707.4
## - age                          1 1761.0
## 
## Step:  AIC=1705.3
## Surv(time, death) ~ factor(race) + age
## 
##                  Df    AIC
## - factor(race)    1 1703.6
## <none>              1705.3
## + factor(gender)  1 1707.3
## - age             1 1759.3
## 
## Step:  AIC=1703.6
## Surv(time, death) ~ age
## 
##                  Df    AIC
## <none>              1703.6
## + factor(race)    1 1705.3
## + factor(gender)  1 1705.6
## - age             1 1758.3
summary(final_model)
## Call:
## coxph(formula = Surv(time, death) ~ age, data = kidneyTrans)
## 
##   n= 863, number of events= 140 
## 
##         coef exp(coef) se(coef)     z Pr(>|z|)    
## age 0.051068  1.052394 0.007136 7.156  8.3e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##     exp(coef) exp(-coef) lower .95 upper .95
## age     1.052     0.9502     1.038     1.067
## 
## Concordance= 0.671  (se = 0.022 )
## Likelihood ratio test= 56.74  on 1 df,   p=5e-14
## Wald test            = 51.21  on 1 df,   p=8e-13
## Score (logrank) test = 53.44  on 1 df,   p=3e-13

According to the output above, we find that the final model only have the linear term of age. That is, the final model is given by \[\lambda(t)=\lambda_0(t)\mbox{exp}(\beta\times\text{age}).\] We find that the logarithmic of hazard ratio of death after kidney transplant increases \(0.051068\) when the age at transplant is increased by 1 year.