Question 4

We import the required data set.

data <- read.csv("kidneyTrans.csv")
data %>% head() %>% kable()
obs time death gender race age
1 1 0 1 1 46
2 5 0 1 1 51
3 7 1 1 1 55
4 9 0 1 1 57
5 13 0 1 1 45
6 13 0 1 1 43

According to the question, we know there are 863 observations. The meanings of each variable are as follows:

time means the survival times after transplant measured in days;

death means the state of patient, where 0 represents censoring and 1 represents observation;

gender means the gender of patient, where 1 represents male and 2 represents female;

race means the race of patient, where 1 represents white and 2 represents black.

Question 4.1

First, we should screen 59 black female kidney transplant patients.

bf <- data %>% filter(gender == 2 & race == 2)

Then, we estimate the survival function of this data set via the Kaplan-Meier estimate.

timesbf <- bf[, 2]
statusbf <- bf[, 3]
est_km_bf <- survfit(Surv(timesbf, statusbf) ~ 1, conf.type = "plain", conf.int = 0.9)
summary(est_km_bf)
## Call: survfit(formula = Surv(timesbf, statusbf) ~ 1, conf.type = "plain", 
##     conf.int = 0.9)
## 
##  time n.risk n.event survival std.err lower 90% CI upper 90% CI
##    40     58       1    0.983  0.0171        0.955        1.000
##    45     57       1    0.966  0.0240        0.926        1.000
##   106     55       1    0.948  0.0293        0.900        0.996
##   121     52       1    0.930  0.0339        0.874        0.985
##   229     51       1    0.912  0.0378        0.849        0.974
##   344     45       1    0.891  0.0421        0.822        0.960
##   864     37       1    0.867  0.0473        0.789        0.945
##   929     30       1    0.838  0.0539        0.750        0.927
##   943     29       1    0.809  0.0592        0.712        0.907
##  1016     26       1    0.778  0.0646        0.672        0.885
##  1196     24       1    0.746  0.0696        0.631        0.860
##  2171     13       1    0.688  0.0846        0.549        0.828
##  2276     11       1    0.626  0.0974        0.466        0.786
##  2650      7       1    0.536  0.1176        0.343        0.730
plot(est_km_bf, main = "90% Confidence interval without transformation", xlab = "Time (Days)", ylab = "Estimated survival function")

Question 4.2

According to the results in (1), we know the probability a black female will survival at least 12 months (365 days) after transplantation is given by \[\hat{S}_{KM}(365)=0.891.\] And the standard error of Kaplan–Meier estimate is \(\mbox{SE}\big\{\hat{S}_{KM}(365)\big\}=0.0421\). Therefore, the \(90\%\) confidence interval of it is given by \[\Big[\hat{S}_{KM}(365)-z_{0.05}\times\mbox{SE}\big\{\hat{S}_{KM}(365)\big\},\hat{S}_{KM}(365)+z_{0.05}\times\mbox{SE}\big\{\hat{S}_{KM}(365)\big\}\Big]=[0.822,0.960].\]

Question 4.3

According to the results in (1), we know the estimated 0.25 quantile of the survival time of black female patients is given by \[\hat{t}^{0.25}=\text{inf}\{t:\hat{S}_{KM}(t)\leq0.75\}=1196.\] And the \(90\%\) confidence interval of it is given by \[[\hat{t}^{0.25}_l,\hat{t}^{0.25}_u]=[929,2650].\]

Question 4.4

We estimate the hazard function of black female patients via kernel–smoothed method with bandwidth equals 365 days and the biquadratic kernel without boundary correction. The relevant R codes and output are as follows:

res_bf <- muhaz(timesbf, statusbf, bw.grid = 365, kern = "biquadratic", bw.method = "global", b.cor = "none")

plot(res_bf, main = "Kernel-smoothed estimate", col = "red", lwd = 2)

The hazard function shows the dependence of the instantaneous risk of death on time, which is widely used to express the risk or hazard of death occurring at some time \(t\). According to the figure above, we can observe the law of change in the risk of patients death with the time after surgery. Patients are at higher risk of death in the immediate aftermath of surgery, three and six years after surgery.

Question 4.5

We conduct the left boundary correction to the estimation of hazard function in (4). The relevant R codes and output are as follows:

res_bfl <- muhaz(timesbf, statusbf, bw.grid = 365, kern = "biquadratic", bw.method = "global", b.cor = "left")

plot(res_bf, main = "Kernel-smoothed estimate", col = "red", lwd = 2)
lines(res_bfl, col = "blue", lwd = 2, lty = 2)
legend("topright", legend = c("without boundary correction", "left boundary correction"),
       col=c("red", "blue"), lty = c(1, 2), cex=0.8)

After the left boundary correction, we can find the hazard rate at the beginning time has improved significantly compared to that in (4). The reasons are as follows:

A problem with the kernel-smoothed estimation is that the kernel may put mass at negative times. Since the minimum time is 14 and the bandwidth is 365, the actual area under the kernel at the beginning is less than 1. So the estimation of hazard function at beginning will less than the true value. To correct for this problem, we conduct the left boundary correction to the estimation of hazard function and obtain a higher hazard rate at the beginning time.

Question 4.6

First, we should screen 280 white female kidney transplant patients.

wf <- data %>% filter(gender == 2 & race == 1)

Then, we estimate the survival function of this data set via the Kaplan-Meier estimate.

timeswf <- wf[, 2]
statuswf <- wf[, 3]
est_km_wf <- survfit(Surv(timeswf, statuswf) ~ 1, conf.type = "plain", conf.int = 0.9)
summary(est_km_wf)
## Call: survfit(formula = Surv(timeswf, statuswf) ~ 1, conf.type = "plain", 
##     conf.int = 0.9)
## 
##  time n.risk n.event survival std.err lower 90% CI upper 90% CI
##     2    279       1    0.996 0.00358        0.991        1.000
##     3    278       1    0.993 0.00505        0.985        1.000
##     7    276       1    0.989 0.00618        0.979        0.999
##    10    274       2    0.982 0.00797        0.969        0.995
##    21    270       1    0.978 0.00873        0.964        0.993
##    50    264       1    0.975 0.00945        0.959        0.990
##    52    262       1    0.971 0.01012        0.954        0.988
##    62    261       1    0.967 0.01075        0.950        0.985
##    68    259       1    0.963 0.01133        0.945        0.982
##    78    258       1    0.960 0.01189        0.940        0.979
##    97    255       1    0.956 0.01242        0.936        0.976
##   104    254       1    0.952 0.01293        0.931        0.974
##   143    247       1    0.948 0.01344        0.926        0.970
##   154    245       1    0.945 0.01393        0.922        0.967
##   209    238       1    0.941 0.01443        0.917        0.964
##   273    227       1    0.936 0.01495        0.912        0.961
##   297    225       1    0.932 0.01545        0.907        0.958
##   366    218       1    0.928 0.01596        0.902        0.954
##   470    206       1    0.923 0.01651        0.896        0.951
##   490    205       1    0.919 0.01703        0.891        0.947
##   614    199       1    0.914 0.01756        0.885        0.943
##   793    179       1    0.909 0.01819        0.879        0.939
##   840    178       1    0.904 0.01879        0.873        0.935
##   852    177       1    0.899 0.01937        0.867        0.931
##  1013    167       1    0.894 0.01998        0.861        0.926
##  1164    161       1    0.888 0.02062        0.854        0.922
##  1326    147       1    0.882 0.02134        0.847        0.917
##  1331    146       1    0.876 0.02204        0.840        0.912
##  1473    138       1    0.870 0.02277        0.832        0.907
##  1777    120       1    0.862 0.02371        0.823        0.901
##  1835    112       1    0.855 0.02471        0.814        0.895
##  1877    110       1    0.847 0.02568        0.805        0.889
##  1940    106       1    0.839 0.02665        0.795        0.883
##  2034     96       1    0.830 0.02777        0.785        0.876
##  2108     87       1    0.821 0.02905        0.773        0.868
##  2301     76       1    0.810 0.03060        0.760        0.860
##  2567     58       1    0.796 0.03311        0.741        0.850
##  2795     41       1    0.776 0.03756        0.715        0.838
plot(est_km_wf, main = "90% Confidence interval without transformation", xlab = "Time (Days)", ylab = "Estimated survival function")

We estimate the hazard function of black female patients via kernel–smoothed method with bandwidth equals 365 days and the biquadratic kernel without boundary correction. The relevant R codes and output are as follows:

res_wf <- muhaz(timeswf, statuswf, bw.grid = 365, kern = "biquadratic", bw.method = "global", b.cor = "none")

plot(res_wf, main = "Kernel-smoothed estimate", col = "red", lwd = 2)

Compare the results obtained in (6) with the previous, we can find that white female have a lower risk of death than black female. And white female have a higher survival rate than black female. Visually, there is a significant difference in the estimation of survival functions between the two. To investigate whether there is indeed a difference, we will conduct the log-rank test next.

Question 4.7

Denote \(S_B(t)\) is the survival function of black patients and \(S_W(t)\) is the survival function of white patients. We are interested in testing the hypothesis: \[H_0:S_B(t)=S_W(t) \ \text{ against } \ H_1:S_B(t)\neq S_W(t).\] We conduct the log-rank test on the survival function of black and white patients. The relevant R codes and output are as follows:

b <- data %>% filter(race == 2)
w <- data %>% filter(race == 1)
timesb <- b[, 2]
statusb <- b[, 3]
timesw <- w[, 2]
statusw <- w[, 3]
times_tot <- c(timesb, timesw)
status_tot <- c(statusb, statusw)
stain <- rep(c("black", "white"), times = c(length(timesb), length(timesw)))
dataset <- data.frame(times = times_tot, status = status_tot, stain = stain)
survdiff(Surv(times, status) ~ stain, data = dataset)
## Call:
## survdiff(formula = Surv(times, status) ~ stain, data = dataset)
## 
##               N Observed Expected (O-E)^2/E (O-E)^2/V
## stain=black 151       28     23.4     0.925      1.11
## stain=white 712      112    116.6     0.185      1.11
## 
##  Chisq= 1.1  on 1 degrees of freedom, p= 0.3

According to the output above, we know the \(p\)-value is \(0.3>0.05\). Therefore, we can not reject the null hypothesis. That is, we think there is no significant difference between the survival function of black and white patients.

Question 4.8

Denote \(S_{BM}(t)\) is the survival function of black male patients, \(S_{WM}(t)\) is the survival function of white male patients, \(S_{BF}(t)\) is the survival function of black female patients and \(S_{WF}(t)\) is the survival function of white female patients. We are interested in the null hypothesis \[H_0:S_{BM}(t)=S_{WM}(t) \ \text{ and } \ S_{BF}(t)=S_{WF}(t);\] Against the alternative hypothesis \[H_1:\text{ At least one of the above equations does not hold on.}\] By taking account of gender, we use stratified log-rank test to test whether the survival functions of black and white patients as follows:

dataset <- data.frame(times = data$time, status = data$death, gender = data$gender, race = data$race)
survdiff(Surv(times, status) ~ race + strata(gender), data = dataset)
## Call:
## survdiff(formula = Surv(times, status) ~ race + strata(gender), 
##     data = dataset)
## 
##          N Observed Expected (O-E)^2/E (O-E)^2/V
## race=1 712      112    116.7     0.188      1.13
## race=2 151       28     23.3     0.942      1.13
## 
##  Chisq= 1.1  on 1 degrees of freedom, p= 0.3

According to the output above, \(p\)-value is \(0.3>0.05\). Therefore, we can not reject the null hypothesis. That is, we think there is no significant difference between the survival function of black and white patients after taking account of gender.

Question 4.9

We are interested in testing the hypothesis: \[H_0:S_{BM}(t)=S_{WM}(t) \ \text{ against } \ H_1:S_{BM}(t)\neq S_{WM}(t).\] We conduct the log-rank test on the survival function of black male and white male patients. The relevant R codes and output are as follows:

bm <- data %>% filter(gender == 1 & race == 2)
wm <- data %>% filter(gender == 1 & race == 1)
timesbm <- bm[, 2]
statusbm <- bm[, 3]
timeswm <- wm[, 2]
statuswm <- wm[, 3]
times_tot <- c(timesbm, timeswm)
status_tot <- c(statusbm, statuswm)
stain <- rep(c("black", "white"), times = c(length(timesbm), length(timeswm)))
dataset <- data.frame(times = times_tot, status = status_tot, stain = stain)
survdiff(Surv(times, status) ~ stain, data = dataset)
## Call:
## survdiff(formula = Surv(times, status) ~ stain, data = dataset)
## 
##               N Observed Expected (O-E)^2/E (O-E)^2/V
## stain=black  92       14     15.1    0.0801     0.097
## stain=white 432       73     71.9    0.0168     0.097
## 
##  Chisq= 0.1  on 1 degrees of freedom, p= 0.8

According to the output above, we know the \(p\)-value is \(0.8>0.05\). Therefore, we can not reject the null hypothesis. That is, we think there is no significant difference between the survival function of black male and white male patients.

Similar to the hypothesis test above, we conduct the log-rank test on the survival function of black female and white female patients. The relevant R codes and output are as follows:

times_tot <- c(timesbf, timeswf)
status_tot <- c(statusbf, statuswf)
stain <- rep(c("black", "white"), times = c(length(timesbf), length(timeswf)))
dataset <- data.frame(times = times_tot, status = status_tot, stain = stain)
survdiff(Surv(times, status) ~ stain, data = dataset)
## Call:
## survdiff(formula = Surv(times, status) ~ stain, data = dataset)
## 
##               N Observed Expected (O-E)^2/E (O-E)^2/V
## stain=black  59       14     8.21     4.076      4.85
## stain=white 280       39    44.79     0.748      4.85
## 
##  Chisq= 4.8  on 1 degrees of freedom, p= 0.03

According to the output above, we know the \(p\)-value is \(0.03<0.05\). Therefore, we must reject the null hypothesis. That is, we think there is significant difference between the survival function of black female and white female patients.

According to (9), we know the impact of race on survival time depending on gender, which is different from the result of (8). Because in (8), the whole data can be effected by both gender and race. This may result in a less significant impact of gender on survival time. Therefore, we divide the whole data into two genders in (9), and separately use the log-rank test to test whether the survival functions of black and white patients are significantly different.