Question 3

We import the required data set.

prostateCancerSub <- read.csv("prostateCancerSub.csv")
prostateCancerSub %>% head() %>% kable()
Time Status Trt Stage hx hg sz Age
72 alive 0.2 mg estrogen 3 0 13.79883 2 75
1 dead - other ca 0.2 mg estrogen 3 0 14.59961 42 54
40 dead - cerebrovascular 5.0 mg estrogen 3 1 13.39844 3 69
20 dead - cerebrovascular 0.2 mg estrogen 3 1 17.59766 4 75
65 alive placebo 3 0 13.39844 34 67
24 dead - prostatic ca 0.2 mg estrogen 3 0 15.09961 10 71

According to the question, we know there are 502 observations. The meanings of each variable are as follows:

Time means the time to death (months) subject to censoring;

Status means the status at end of study. One category is ‘alive’, all other categories correspond to dead, by cause of death;

Trt means the treatment assigned (0.2/1.0/5.0 mg estrogen, placebo);

hx means the history of cardiovascular disease (1=yes, 0=no);

Stage means the tumor stage;

hg means the serum hemoglobin (g/100ml);

sz means the size of primary tumor (\(\text{cm}^2\));

Age means the age in years.

Question 3.1

We combine causes of death in the following way:

Status1: Combine death due to prostatic cancer and other cancer as death due to cancer;

Status2: Combine death due to cerebrovascular, heart or vascular as death due to vascular disease;

Status3: Combine death due to causes other than those mentioned above as death due to other causes.

And we plot the corresponding cumulative incidence functions as follows:

prostateCancerSub$status1 <- ifelse(prostateCancerSub$Status == "dead - prostatic ca" | prostateCancerSub$Status == "dead - other ca", 1, 0)
prostateCancerSub$status2 <- ifelse(prostateCancerSub$Status == "dead - cerebrovascular" | prostateCancerSub$Status == "dead - heart or vascular", 1, 0)
prostateCancerSub$status3 <- ifelse(prostateCancerSub$Status != "dead - prostatic ca" & prostateCancerSub$Status != "dead - other ca" & prostateCancerSub$Status != "dead - cerebrovascular" & prostateCancerSub$Status != "dead - heart or vascular" & prostateCancerSub$Status != "alive", 1, 0)

fit1 <- survfit(Surv(Time, status1) ~ 1, prostateCancerSub)
fit2 <- survfit(Surv(Time, status2) ~ 1, prostateCancerSub)
fit3 <- survfit(Surv(Time, status3) ~ 1, prostateCancerSub)

df1 <- data.frame(x = fit1$time, y = 1 - fit1$surv)
df2 <- data.frame(x = fit2$time, y = 1 - fit2$surv)
df3 <- data.frame(x = fit3$time, y = 1 - fit3$surv)

ggplot() + 
  geom_step(data = df1, aes(x = x, y = y, color = "Death due to cancer")) + 
  geom_step(data = df2, aes(x = x, y = y, color = "Death due to vscular disease")) + 
  geom_step(data = df3, aes(x = x, y = y, color = "Death due to other causes")) + 
  labs(title = "Estimated Cumulative Incidence Functions for Different Causes of Death", x = "Time (months)", y = "Estimated F(t)") + 
  theme_minimal() + 
  theme(panel.grid =element_blank()) + 
  theme(axis.line = element_line(size=0.5, colour = "black")) + 
  theme(plot.title = element_text(hjust = 0.5, size = 14), axis.title = element_text(size = 14), 
        axis.text.x = element_text(size = 11), axis.text.y = element_text(size = 12), legend.position = 'bottom')
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

According to the plot above, we find that the probability of dying due to cancer is \(\approx0.504\), the probability of dying due to vascular disease is \(\approx0.362\) and the probability of dying due to other causes is \(\approx0.241\) at 76 months. And the sum of three probabilities is \(\approx1.107>1\), which indicates that the estimates of probability exist biases.

Question 3.2

Consider the competing risks, the estimated cumulative incidence functions are given by

state.cp <- rep(0, nrow(prostateCancerSub))
state.cp[prostateCancerSub$Status == "dead - prostatic ca" | prostateCancerSub$Status == "dead - other ca"] <- 1
state.cp[prostateCancerSub$Status == "dead - cerebrovascular" | prostateCancerSub$Status == "dead - heart or vascular"] <- 2
state.cp[!(prostateCancerSub$Status %in% c("dead - prostatic ca", "dead - other ca", "dead - cerebrovascular", "dead - heart or vascular", "alive"))] <- 3
prostateCancerSub$state.cp <- state.cp

fit_cp <- survfit(Surv(Time, factor(state.cp)) ~ 1, prostateCancerSub)

ggplot(data.frame(x = fit_cp$time, y1 = fit_cp$pstate[, 2], y2 = fit_cp$pstate[, 3], y3 = fit_cp$pstate[, 4]), aes(x)) +
  geom_step(aes(y = y1, colour = "Death due to cancer")) + 
  geom_step(aes(y = y2, colour = "Death due to vscular disease")) +
  geom_step(aes(y = y3, colour = "Death due to other causes")) + 
  labs(title = "Estimated Cumulative Incidence Functions for Different Causes of Death", x = "Time (months)", y = "Estimated F(t)") + 
  theme_minimal() + 
  theme(panel.grid =element_blank()) + 
  theme(axis.line = element_line(size=0.5, colour = "black")) + 
  theme(plot.title = element_text(hjust = 0.5, size = 14), axis.title = element_text(size = 14), 
        axis.text.x = element_text(size = 11), axis.text.y = element_text(size = 12), legend.position = 'bottom')

According to the plot above, we find that the probability of dying due to cancer is \(\approx0.352\), the probability of dying due to vascular disease is \(\approx0.258\) and the probability of dying due to other causes is \(\approx0.152\) at 76 months. Compared the result with question 3.1, we find that the probabilities of all three types of death are smaller. We correct the biases in question 3.1.