We import the required data set.
prostateCancerSub <- read.csv("prostateCancerSub.csv")
prostateCancerSub %>% head() %>% kable()
| Time | Status | Trt | Stage | hx | hg | sz | Age |
|---|---|---|---|---|---|---|---|
| 72 | alive | 0.2 mg estrogen | 3 | 0 | 13.79883 | 2 | 75 |
| 1 | dead - other ca | 0.2 mg estrogen | 3 | 0 | 14.59961 | 42 | 54 |
| 40 | dead - cerebrovascular | 5.0 mg estrogen | 3 | 1 | 13.39844 | 3 | 69 |
| 20 | dead - cerebrovascular | 0.2 mg estrogen | 3 | 1 | 17.59766 | 4 | 75 |
| 65 | alive | placebo | 3 | 0 | 13.39844 | 34 | 67 |
| 24 | dead - prostatic ca | 0.2 mg estrogen | 3 | 0 | 15.09961 | 10 | 71 |
According to the question, we know there are 502 observations. The meanings of each variable are as follows:
Time means the time to death (months) subject to
censoring;
Status means the status at end of study. One category is
‘alive’, all other categories correspond to dead, by cause of death;
Trt means the treatment assigned (0.2/1.0/5.0 mg
estrogen, placebo);
hx means the history of cardiovascular disease (1=yes,
0=no);
Stage means the tumor stage;
hg means the serum hemoglobin (g/100ml);
sz means the size of primary tumor (\(\text{cm}^2\));
Age means the age in years.
Firstly, we fit the Cox model \[ \begin{aligned} \text{Model 0: } \ \lambda(t|\boldsymbol{x}_i)=&\lambda_0(t)\mbox{exp}\big\{\beta_1I(\text{Trt}_i=\text{'0.2mg estrogen'})+\beta_2I(\text{Trt}_i=\text{'1.0mg estrogen'}) \\ &+\beta_3I(\text{Trt}_i=\text{'5.0mg estrogen'})+\beta_4I(\text{Stage}_i=4)+\beta_5I(\text{hx}_i=1)\big\}. \end{aligned} \]
death <- numeric(length = nrow(prostateCancerSub))
for (i in 1:length(death)) {
if (prostateCancerSub$Status[i] == "alive") {
death[i] <- 0
} else {
death[i] <- 1
}
}
prostateCancerSub$Trt <- factor(prostateCancerSub$Trt, levels = c("placebo", "0.2 mg estrogen", "1.0 mg estrogen", "5.0 mg estrogen"))
prostateCancerSub <- cbind(prostateCancerSub, death)
model0 <- coxph(Surv(Time, death != 0) ~ Trt + I(Stage == 4) + I(hx == 1), data = prostateCancerSub)
For the Cox model above, we obtain the martingale residuals against
serum hemoglobin (hg) as follows:
r_M0 <- residuals(model0)
ggplot(data.frame(x = prostateCancerSub$hg, y = r_M0), aes(x = x, y = y)) +
geom_point() + geom_smooth() +
labs(title = "Martingale Residuals", x = "serum hemoglobin", y = "martingale residual") +
theme(plot.title = element_text(hjust = 0.5, size = 13), axis.title = element_text(size = 14),
axis.text.x = element_text(size = 11), axis.text.y = element_text(size = 12))
According to the martingale residual plot, we find that adding the
variable hg as a linear term to start model is not
appropriate. Therefore, we consider the following three functional forms
of hg and construct the Cox models: \[
\begin{aligned}
&\text{Model 1: Model 0 + Linear hg;} \\
&\text{Model 2: Model 0 + Linear hg + Quadratic hg;} \\
&\text{Model 3: Model 0 + Piecewise Linear hg with changepoint at
15.}
\end{aligned}
\]
model1 <- coxph(Surv(Time, death != 0) ~ Trt + I(Stage == 4) + I(hx == 1) + hg, data = prostateCancerSub)
model2 <- coxph(Surv(Time, death != 0) ~ Trt + I(Stage == 4) + I(hx == 1) + poly(hg, 2), data = prostateCancerSub)
fun_hg <- function(x) {cbind(x, (x - 15) * (x - 15 > 0))}
model3 <- coxph(Surv(Time, death != 0) ~ Trt + I(Stage == 4) + I(hx == 1) + fun_hg(hg), data = prostateCancerSub)
The AIC of each model is given by
result <- data.frame(Model = c("model 0", "model 1", "model 2", "model 3"),
AIC = c(AIC(model0), AIC(model1), AIC(model2), AIC(model3)))
result %>% kable()
| Model | AIC |
|---|---|
| model 0 | 3995.847 |
| model 1 | 3984.773 |
| model 2 | 3980.339 |
| model 3 | 3980.160 |
According to the output above, we find that model 3 has the smallest
AIC. Therefore, we choose the model 3 to fit the variable
hg.
For the Cox model 3, we obtain the martingale residuals against tumor
size (sz) as follows:
r_M0 <- residuals(model3)
ggplot(data.frame(x = prostateCancerSub$sz, y = r_M0), aes(x = x, y = y)) +
geom_point() + geom_smooth() +
labs(title = "Martingale Residuals", x = "tumor size", y = "martingale residual") +
theme(plot.title = element_text(hjust = 0.5, size = 13), axis.title = element_text(size = 14),
axis.text.x = element_text(size = 11), axis.text.y = element_text(size = 12))
According to the martingale residual plot, we find that adding the
variable sz as a linear term to start model is appropriate.
Therefore, we add the variable sz as a linear term. \[
\text{Model 4: Model 3 + Linear sz.}
\]
model4 <- coxph(Surv(Time, death != 0) ~ Trt + I(Stage == 4) + I(hx == 1) + fun_hg(hg) + sz, data = prostateCancerSub)
For the Cox model 4, we obtain the martingale residuals against age
(Age) as follows:
r_M0 <- residuals(model4)
ggplot(data.frame(x = prostateCancerSub$Age, y = r_M0), aes(x = x, y = y)) +
geom_point() + geom_smooth() +
labs(title = "Martingale Residuals", x = "age", y = "martingale residual") +
theme(plot.title = element_text(hjust = 0.5, size = 13), axis.title = element_text(size = 14),
axis.text.x = element_text(size = 11), axis.text.y = element_text(size = 12))
According to the martingale residual plot, we find that adding the
variable Age as a linear term to start model is not
appropriate. Therefore, we consider the following three functional forms
of Age and construct the Cox models: \[
\begin{aligned}
&\text{Model 5: Model 4 + Exponential Age;} \\
&\text{Model 6: Model 4 + Linear Age + Quadratic Age;} \\
&\text{Model 7: Model 4 + Piecewise Linear hg with changepoint at 70
and no effect before 70.}
\end{aligned}
\]
fun_exp <- function(x) {exp(x)}
model5 <- coxph(Surv(Time, death != 0) ~ Trt + I(Stage == 4) + I(hx == 1) + fun_hg(hg) + sz + fun_exp(Age), data = prostateCancerSub)
model6 <- coxph(Surv(Time, death != 0) ~ Trt + I(Stage == 4) + I(hx == 1) + fun_hg(hg) + sz + poly(Age, 2), data = prostateCancerSub)
fun_age <- function(x) {pmax(x, 70)}
model7 <- coxph(Surv(Time, death != 0) ~ Trt + I(Stage == 4) + I(hx == 1) + fun_hg(hg) + sz + fun_age(Age), data = prostateCancerSub)
The AIC of each model is given by
result <- data.frame(Model = c("model 4", "model 5", "model 6", "model 7"),
AIC = c(AIC(model4), AIC(model5), AIC(model6), AIC(model7)))
result %>% kable()
| Model | AIC |
|---|---|
| model 4 | 3964.242 |
| model 5 | 3963.333 |
| model 6 | 3957.152 |
| model 7 | 3954.960 |
According to the output above, we find that model 7 has the smallest
AIC. Therefore, we choose the model 7 to fit the variable
Age.
In conclusion, we obtain the final model as follows: \[ \begin{aligned} \text{Model 7: } \ \lambda(t|\boldsymbol{x}_i)=&\lambda_0(t)\mbox{exp}\big\{\beta_1I(\text{Trt}_i=\text{'0.2mg estrogen'})+\beta_2I(\text{Trt}_i=\text{'1.0mg estrogen'}) \\ &+\beta_3I(\text{Trt}_i=\text{'5.0mg estrogen'})+\beta_4I(\text{Stage}_i=4)+\beta_5I(\text{hx}_i=1) \\ &+\beta_6\text{hg}_i+\beta_7(\text{hg}_i-15)_++\beta_8\text{sz}_i+\beta_9\mbox{max}(\text{Age}_i,70)\big\}. \end{aligned} \]
summary(model7)
## Call:
## coxph(formula = Surv(Time, death != 0) ~ Trt + I(Stage == 4) +
## I(hx == 1) + fun_hg(hg) + sz + fun_age(Age), data = prostateCancerSub)
##
## n= 502, number of events= 354
##
## coef exp(coef) se(coef) z Pr(>|z|)
## Trt0.2 mg estrogen 0.016839 1.016982 0.146116 0.115 0.90825
## Trt1.0 mg estrogen -0.388490 0.678080 0.157910 -2.460 0.01389 *
## Trt5.0 mg estrogen -0.004901 0.995111 0.147224 -0.033 0.97344
## I(Stage == 4)TRUE 0.221958 1.248519 0.112303 1.976 0.04811 *
## I(hx == 1)TRUE 0.428227 1.534534 0.109874 3.897 9.72e-05 ***
## fun_hg(hg)x -0.138289 0.870847 0.034795 -3.974 7.06e-05 ***
## fun_hg(hg) 0.338444 1.402763 0.131669 2.570 0.01016 *
## sz 0.019005 1.019187 0.004271 4.450 8.59e-06 ***
## fun_age(Age) 0.051999 1.053374 0.015071 3.450 0.00056 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## Trt0.2 mg estrogen 1.0170 0.9833 0.7637 1.3542
## Trt1.0 mg estrogen 0.6781 1.4748 0.4976 0.9240
## Trt5.0 mg estrogen 0.9951 1.0049 0.7457 1.3280
## I(Stage == 4)TRUE 1.2485 0.8009 1.0019 1.5559
## I(hx == 1)TRUE 1.5345 0.6517 1.2372 1.9033
## fun_hg(hg)x 0.8708 1.1483 0.8134 0.9323
## fun_hg(hg) 1.4028 0.7129 1.0837 1.8158
## sz 1.0192 0.9812 1.0107 1.0278
## fun_age(Age) 1.0534 0.9493 1.0227 1.0850
##
## Concordance= 0.64 (se = 0.016 )
## Likelihood ratio test= 84.55 on 9 df, p=2e-14
## Wald test = 84.93 on 9 df, p=2e-14
## Score (logrank) test = 87.37 on 9 df, p=5e-15
The estimate \(\hat{\beta}_6=-0.138289\) and \(\hat{\beta}_7=0.338444\), which indicates that
When serum hemoglobin is below 15, the hazard ratio of death for
an individual relative to one with other covariates being equal but the
serum hemoglobin (hg) being one unit less is \(\mbox{exp}(\hat{\beta}_6)=0.870847\), which
means that the greater serum hemoglobin below 15, the lower hazard of
death;
When serum hemoglobin is beyond 15, the hazard ratio of death for
an individual relative to one with other covariates being equal but the
serum hemoglobin (hg) being one unit less is \(\mbox{exp}(\hat{\beta}_6+\hat{\beta}_7)=1.221592\),
which means that the greater serum hemoglobin beyond 15, the larger
hazard of death.
The estimate \(\hat{\beta}_8=0.019005\), which indicates
that the hazard ratio of death for an individual relative to one with
other covariates being equal but the tumor size (sz) being
one unit less is \(\mbox{exp}(\hat{\beta}_8)=1.019187\), which
means that the greater tumor size, the larger hazard of death.
The estimate \(\hat{\beta}_9=0.051999\), which indicates that
The hazard of death is not affected by increase of age under 70;
the hazard ratio of death for an individual relative to one with
other covariates being equal but the age (Age) being one
unit less is \(\mbox{exp}(\hat{\beta}_9)=1.053374\), which
means that the greater age, the larger hazard of death.