Question 3

We import the required data set.

prostateCancerSub <- read.csv("prostateCancerSub.csv")
prostateCancerSub %>% head() %>% kable()
Time Status Trt Stage hx hg sz Age
72 alive 0.2 mg estrogen 3 0 13.79883 2 75
1 dead - other ca 0.2 mg estrogen 3 0 14.59961 42 54
40 dead - cerebrovascular 5.0 mg estrogen 3 1 13.39844 3 69
20 dead - cerebrovascular 0.2 mg estrogen 3 1 17.59766 4 75
65 alive placebo 3 0 13.39844 34 67
24 dead - prostatic ca 0.2 mg estrogen 3 0 15.09961 10 71

According to the question, we know there are 502 observations. The meanings of each variable are as follows:

Time means the time to death (months) subject to censoring;

Status means the status at end of study. One category is ‘alive’, all other categories correspond to dead, by cause of death;

Trt means the treatment assigned (0.2/1.0/5.0 mg estrogen, placebo);

hx means the history of cardiovascular disease (1=yes, 0=no);

Stage means the tumor stage;

hg means the serum hemoglobin (g/100ml);

sz means the size of primary tumor (\(\text{cm}^2\));

Age means the age in years.

Firstly, we fit the Cox model \[ \begin{aligned} \text{Model 0: } \ \lambda(t|\boldsymbol{x}_i)=&\lambda_0(t)\mbox{exp}\big\{\beta_1I(\text{Trt}_i=\text{'0.2mg estrogen'})+\beta_2I(\text{Trt}_i=\text{'1.0mg estrogen'}) \\ &+\beta_3I(\text{Trt}_i=\text{'5.0mg estrogen'})+\beta_4I(\text{Stage}_i=4)+\beta_5I(\text{hx}_i=1)\big\}. \end{aligned} \]

death <- numeric(length = nrow(prostateCancerSub))
for (i in 1:length(death)) {
  if (prostateCancerSub$Status[i] == "alive") {
    death[i] <- 0
  } else {
    death[i] <- 1
  }
}
prostateCancerSub$Trt <- factor(prostateCancerSub$Trt, levels = c("placebo", "0.2 mg estrogen", "1.0 mg estrogen", "5.0 mg estrogen"))
prostateCancerSub <- cbind(prostateCancerSub, death)
model0 <- coxph(Surv(Time, death != 0) ~ Trt + I(Stage == 4) + I(hx == 1), data = prostateCancerSub)

Question 3.1

For the Cox model above, we obtain the martingale residuals against serum hemoglobin (hg) as follows:

r_M0 <- residuals(model0)
ggplot(data.frame(x = prostateCancerSub$hg, y = r_M0), aes(x = x, y = y)) + 
  geom_point() + geom_smooth() + 
  labs(title = "Martingale Residuals", x = "serum hemoglobin", y = "martingale residual") + 
  theme(plot.title = element_text(hjust = 0.5, size = 13), axis.title = element_text(size = 14), 
        axis.text.x = element_text(size = 11), axis.text.y = element_text(size = 12))

According to the martingale residual plot, we find that adding the variable hg as a linear term to start model is not appropriate. Therefore, we consider the following three functional forms of hg and construct the Cox models: \[ \begin{aligned} &\text{Model 1: Model 0 + Linear hg;} \\ &\text{Model 2: Model 0 + Linear hg + Quadratic hg;} \\ &\text{Model 3: Model 0 + Piecewise Linear hg with changepoint at 15.} \end{aligned} \]

model1 <- coxph(Surv(Time, death != 0) ~ Trt + I(Stage == 4) + I(hx == 1) + hg, data = prostateCancerSub)
model2 <- coxph(Surv(Time, death != 0) ~ Trt + I(Stage == 4) + I(hx == 1) + poly(hg, 2), data = prostateCancerSub)
fun_hg <- function(x) {cbind(x, (x - 15) * (x - 15 > 0))}
model3 <- coxph(Surv(Time, death != 0) ~ Trt + I(Stage == 4) + I(hx == 1) + fun_hg(hg), data = prostateCancerSub)

The AIC of each model is given by

result <- data.frame(Model = c("model 0", "model 1", "model 2", "model 3"), 
                     AIC = c(AIC(model0), AIC(model1), AIC(model2), AIC(model3)))
result %>% kable()
Model AIC
model 0 3995.847
model 1 3984.773
model 2 3980.339
model 3 3980.160

According to the output above, we find that model 3 has the smallest AIC. Therefore, we choose the model 3 to fit the variable hg.

Question 3.2

For the Cox model 3, we obtain the martingale residuals against tumor size (sz) as follows:

r_M0 <- residuals(model3)
ggplot(data.frame(x = prostateCancerSub$sz, y = r_M0), aes(x = x, y = y)) + 
  geom_point() + geom_smooth() + 
  labs(title = "Martingale Residuals", x = "tumor size", y = "martingale residual") + 
  theme(plot.title = element_text(hjust = 0.5, size = 13), axis.title = element_text(size = 14), 
        axis.text.x = element_text(size = 11), axis.text.y = element_text(size = 12))

According to the martingale residual plot, we find that adding the variable sz as a linear term to start model is appropriate. Therefore, we add the variable sz as a linear term. \[ \text{Model 4: Model 3 + Linear sz.} \]

model4 <- coxph(Surv(Time, death != 0) ~ Trt + I(Stage == 4) + I(hx == 1) + fun_hg(hg) + sz, data = prostateCancerSub)

Question 3.3

For the Cox model 4, we obtain the martingale residuals against age (Age) as follows:

r_M0 <- residuals(model4)
ggplot(data.frame(x = prostateCancerSub$Age, y = r_M0), aes(x = x, y = y)) + 
  geom_point() + geom_smooth() + 
  labs(title = "Martingale Residuals", x = "age", y = "martingale residual") + 
  theme(plot.title = element_text(hjust = 0.5, size = 13), axis.title = element_text(size = 14), 
        axis.text.x = element_text(size = 11), axis.text.y = element_text(size = 12))

According to the martingale residual plot, we find that adding the variable Age as a linear term to start model is not appropriate. Therefore, we consider the following three functional forms of Age and construct the Cox models: \[ \begin{aligned} &\text{Model 5: Model 4 + Exponential Age;} \\ &\text{Model 6: Model 4 + Linear Age + Quadratic Age;} \\ &\text{Model 7: Model 4 + Piecewise Linear hg with changepoint at 70 and no effect before 70.} \end{aligned} \]

fun_exp <- function(x) {exp(x)}
model5 <- coxph(Surv(Time, death != 0) ~ Trt + I(Stage == 4) + I(hx == 1) + fun_hg(hg) + sz + fun_exp(Age), data = prostateCancerSub)
model6 <- coxph(Surv(Time, death != 0) ~ Trt + I(Stage == 4) + I(hx == 1) + fun_hg(hg) + sz + poly(Age, 2), data = prostateCancerSub)
fun_age <- function(x) {pmax(x, 70)}
model7 <- coxph(Surv(Time, death != 0) ~ Trt + I(Stage == 4) + I(hx == 1) + fun_hg(hg) + sz + fun_age(Age), data = prostateCancerSub)

The AIC of each model is given by

result <- data.frame(Model = c("model 4", "model 5", "model 6", "model 7"), 
                     AIC = c(AIC(model4), AIC(model5), AIC(model6), AIC(model7)))
result %>% kable()
Model AIC
model 4 3964.242
model 5 3963.333
model 6 3957.152
model 7 3954.960

According to the output above, we find that model 7 has the smallest AIC. Therefore, we choose the model 7 to fit the variable Age.

In conclusion, we obtain the final model as follows: \[ \begin{aligned} \text{Model 7: } \ \lambda(t|\boldsymbol{x}_i)=&\lambda_0(t)\mbox{exp}\big\{\beta_1I(\text{Trt}_i=\text{'0.2mg estrogen'})+\beta_2I(\text{Trt}_i=\text{'1.0mg estrogen'}) \\ &+\beta_3I(\text{Trt}_i=\text{'5.0mg estrogen'})+\beta_4I(\text{Stage}_i=4)+\beta_5I(\text{hx}_i=1) \\ &+\beta_6\text{hg}_i+\beta_7(\text{hg}_i-15)_++\beta_8\text{sz}_i+\beta_9\mbox{max}(\text{Age}_i,70)\big\}. \end{aligned} \]

summary(model7)
## Call:
## coxph(formula = Surv(Time, death != 0) ~ Trt + I(Stage == 4) + 
##     I(hx == 1) + fun_hg(hg) + sz + fun_age(Age), data = prostateCancerSub)
## 
##   n= 502, number of events= 354 
## 
##                         coef exp(coef)  se(coef)      z Pr(>|z|)    
## Trt0.2 mg estrogen  0.016839  1.016982  0.146116  0.115  0.90825    
## Trt1.0 mg estrogen -0.388490  0.678080  0.157910 -2.460  0.01389 *  
## Trt5.0 mg estrogen -0.004901  0.995111  0.147224 -0.033  0.97344    
## I(Stage == 4)TRUE   0.221958  1.248519  0.112303  1.976  0.04811 *  
## I(hx == 1)TRUE      0.428227  1.534534  0.109874  3.897 9.72e-05 ***
## fun_hg(hg)x        -0.138289  0.870847  0.034795 -3.974 7.06e-05 ***
## fun_hg(hg)          0.338444  1.402763  0.131669  2.570  0.01016 *  
## sz                  0.019005  1.019187  0.004271  4.450 8.59e-06 ***
## fun_age(Age)        0.051999  1.053374  0.015071  3.450  0.00056 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                    exp(coef) exp(-coef) lower .95 upper .95
## Trt0.2 mg estrogen    1.0170     0.9833    0.7637    1.3542
## Trt1.0 mg estrogen    0.6781     1.4748    0.4976    0.9240
## Trt5.0 mg estrogen    0.9951     1.0049    0.7457    1.3280
## I(Stage == 4)TRUE     1.2485     0.8009    1.0019    1.5559
## I(hx == 1)TRUE        1.5345     0.6517    1.2372    1.9033
## fun_hg(hg)x           0.8708     1.1483    0.8134    0.9323
## fun_hg(hg)            1.4028     0.7129    1.0837    1.8158
## sz                    1.0192     0.9812    1.0107    1.0278
## fun_age(Age)          1.0534     0.9493    1.0227    1.0850
## 
## Concordance= 0.64  (se = 0.016 )
## Likelihood ratio test= 84.55  on 9 df,   p=2e-14
## Wald test            = 84.93  on 9 df,   p=2e-14
## Score (logrank) test = 87.37  on 9 df,   p=5e-15

The estimate \(\hat{\beta}_6=-0.138289\) and \(\hat{\beta}_7=0.338444\), which indicates that

  1. When serum hemoglobin is below 15, the hazard ratio of death for an individual relative to one with other covariates being equal but the serum hemoglobin (hg) being one unit less is \(\mbox{exp}(\hat{\beta}_6)=0.870847\), which means that the greater serum hemoglobin below 15, the lower hazard of death;

  2. When serum hemoglobin is beyond 15, the hazard ratio of death for an individual relative to one with other covariates being equal but the serum hemoglobin (hg) being one unit less is \(\mbox{exp}(\hat{\beta}_6+\hat{\beta}_7)=1.221592\), which means that the greater serum hemoglobin beyond 15, the larger hazard of death.

The estimate \(\hat{\beta}_8=0.019005\), which indicates that the hazard ratio of death for an individual relative to one with other covariates being equal but the tumor size (sz) being one unit less is \(\mbox{exp}(\hat{\beta}_8)=1.019187\), which means that the greater tumor size, the larger hazard of death.

The estimate \(\hat{\beta}_9=0.051999\), which indicates that

  1. The hazard of death is not affected by increase of age under 70;

  2. the hazard ratio of death for an individual relative to one with other covariates being equal but the age (Age) being one unit less is \(\mbox{exp}(\hat{\beta}_9)=1.053374\), which means that the greater age, the larger hazard of death.