We import the required data set.
data <- read.csv("kidneyTrans.csv")
data %>% head() %>% kable()
| obs | time | death | gender | race | age |
|---|---|---|---|---|---|
| 1 | 1 | 0 | 1 | 1 | 46 |
| 2 | 5 | 0 | 1 | 1 | 51 |
| 3 | 7 | 1 | 1 | 1 | 55 |
| 4 | 9 | 0 | 1 | 1 | 57 |
| 5 | 13 | 0 | 1 | 1 | 45 |
| 6 | 13 | 0 | 1 | 1 | 43 |
According to the question, we know there are 863 observations. The meanings of each variable are as follows:
time means the survival times after transplant measured
in days;
death means the state of patient, where 0 represents
censoring and 1 represents observation;
gender means the gender of patient, where 1 represents
male and 2 represents female;
race means the race of patient, where 1 represents white
and 2 represents black.
First, we should screen 59 black female kidney transplant patients.
bf <- data %>% filter(gender == 2 & race == 2)
Then, we estimate the survival function of this data set via the Kaplan-Meier estimate.
timesbf <- bf[, 2]
statusbf <- bf[, 3]
est_km_bf <- survfit(Surv(timesbf, statusbf) ~ 1, conf.type = "plain", conf.int = 0.9)
summary(est_km_bf)
## Call: survfit(formula = Surv(timesbf, statusbf) ~ 1, conf.type = "plain",
## conf.int = 0.9)
##
## time n.risk n.event survival std.err lower 90% CI upper 90% CI
## 40 58 1 0.983 0.0171 0.955 1.000
## 45 57 1 0.966 0.0240 0.926 1.000
## 106 55 1 0.948 0.0293 0.900 0.996
## 121 52 1 0.930 0.0339 0.874 0.985
## 229 51 1 0.912 0.0378 0.849 0.974
## 344 45 1 0.891 0.0421 0.822 0.960
## 864 37 1 0.867 0.0473 0.789 0.945
## 929 30 1 0.838 0.0539 0.750 0.927
## 943 29 1 0.809 0.0592 0.712 0.907
## 1016 26 1 0.778 0.0646 0.672 0.885
## 1196 24 1 0.746 0.0696 0.631 0.860
## 2171 13 1 0.688 0.0846 0.549 0.828
## 2276 11 1 0.626 0.0974 0.466 0.786
## 2650 7 1 0.536 0.1176 0.343 0.730
plot(est_km_bf, main = "90% Confidence interval without transformation", xlab = "Time (Days)", ylab = "Estimated survival function")
According to the results in (1), we know the probability a black female will survival at least 12 months (365 days) after transplantation is given by \[\hat{S}_{KM}(365)=0.891.\] And the standard error of Kaplan–Meier estimate is \(\mbox{SE}\big\{\hat{S}_{KM}(365)\big\}=0.0421\). Therefore, the \(90\%\) confidence interval of it is given by \[\Big[\hat{S}_{KM}(365)-z_{0.05}\times\mbox{SE}\big\{\hat{S}_{KM}(365)\big\},\hat{S}_{KM}(365)+z_{0.05}\times\mbox{SE}\big\{\hat{S}_{KM}(365)\big\}\Big]=[0.822,0.960].\]
According to the results in (1), we know the estimated 0.25 quantile of the survival time of black female patients is given by \[\hat{t}^{0.25}=\text{inf}\{t:\hat{S}_{KM}(t)\leq0.75\}=1196.\] And the \(90\%\) confidence interval of it is given by \[[\hat{t}^{0.25}_l,\hat{t}^{0.25}_u]=[929,2650].\]
We estimate the hazard function of black female patients via kernel–smoothed method with bandwidth equals 365 days and the biquadratic kernel without boundary correction. The relevant R codes and output are as follows:
res_bf <- muhaz(timesbf, statusbf, bw.grid = 365, kern = "biquadratic", bw.method = "global", b.cor = "none")
plot(res_bf, main = "Kernel-smoothed estimate", col = "red", lwd = 2)
The hazard function shows the dependence of the instantaneous risk of death on time, which is widely used to express the risk or hazard of death occurring at some time \(t\). According to the figure above, we can observe the law of change in the risk of patients death with the time after surgery. Patients are at higher risk of death in the immediate aftermath of surgery, three and six years after surgery.
We conduct the left boundary correction to the estimation of hazard function in (4). The relevant R codes and output are as follows:
res_bfl <- muhaz(timesbf, statusbf, bw.grid = 365, kern = "biquadratic", bw.method = "global", b.cor = "left")
plot(res_bf, main = "Kernel-smoothed estimate", col = "red", lwd = 2)
lines(res_bfl, col = "blue", lwd = 2, lty = 2)
legend("topright", legend = c("without boundary correction", "left boundary correction"),
col=c("red", "blue"), lty = c(1, 2), cex=0.8)
After the left boundary correction, we can find the hazard rate at the beginning time has improved significantly compared to that in (4). The reasons are as follows:
A problem with the kernel-smoothed estimation is that the kernel may put mass at negative times. Since the minimum time is 14 and the bandwidth is 365, the actual area under the kernel at the beginning is less than 1. So the estimation of hazard function at beginning will less than the true value. To correct for this problem, we conduct the left boundary correction to the estimation of hazard function and obtain a higher hazard rate at the beginning time.
First, we should screen 280 white female kidney transplant patients.
wf <- data %>% filter(gender == 2 & race == 1)
Then, we estimate the survival function of this data set via the Kaplan-Meier estimate.
timeswf <- wf[, 2]
statuswf <- wf[, 3]
est_km_wf <- survfit(Surv(timeswf, statuswf) ~ 1, conf.type = "plain", conf.int = 0.9)
summary(est_km_wf)
## Call: survfit(formula = Surv(timeswf, statuswf) ~ 1, conf.type = "plain",
## conf.int = 0.9)
##
## time n.risk n.event survival std.err lower 90% CI upper 90% CI
## 2 279 1 0.996 0.00358 0.991 1.000
## 3 278 1 0.993 0.00505 0.985 1.000
## 7 276 1 0.989 0.00618 0.979 0.999
## 10 274 2 0.982 0.00797 0.969 0.995
## 21 270 1 0.978 0.00873 0.964 0.993
## 50 264 1 0.975 0.00945 0.959 0.990
## 52 262 1 0.971 0.01012 0.954 0.988
## 62 261 1 0.967 0.01075 0.950 0.985
## 68 259 1 0.963 0.01133 0.945 0.982
## 78 258 1 0.960 0.01189 0.940 0.979
## 97 255 1 0.956 0.01242 0.936 0.976
## 104 254 1 0.952 0.01293 0.931 0.974
## 143 247 1 0.948 0.01344 0.926 0.970
## 154 245 1 0.945 0.01393 0.922 0.967
## 209 238 1 0.941 0.01443 0.917 0.964
## 273 227 1 0.936 0.01495 0.912 0.961
## 297 225 1 0.932 0.01545 0.907 0.958
## 366 218 1 0.928 0.01596 0.902 0.954
## 470 206 1 0.923 0.01651 0.896 0.951
## 490 205 1 0.919 0.01703 0.891 0.947
## 614 199 1 0.914 0.01756 0.885 0.943
## 793 179 1 0.909 0.01819 0.879 0.939
## 840 178 1 0.904 0.01879 0.873 0.935
## 852 177 1 0.899 0.01937 0.867 0.931
## 1013 167 1 0.894 0.01998 0.861 0.926
## 1164 161 1 0.888 0.02062 0.854 0.922
## 1326 147 1 0.882 0.02134 0.847 0.917
## 1331 146 1 0.876 0.02204 0.840 0.912
## 1473 138 1 0.870 0.02277 0.832 0.907
## 1777 120 1 0.862 0.02371 0.823 0.901
## 1835 112 1 0.855 0.02471 0.814 0.895
## 1877 110 1 0.847 0.02568 0.805 0.889
## 1940 106 1 0.839 0.02665 0.795 0.883
## 2034 96 1 0.830 0.02777 0.785 0.876
## 2108 87 1 0.821 0.02905 0.773 0.868
## 2301 76 1 0.810 0.03060 0.760 0.860
## 2567 58 1 0.796 0.03311 0.741 0.850
## 2795 41 1 0.776 0.03756 0.715 0.838
plot(est_km_wf, main = "90% Confidence interval without transformation", xlab = "Time (Days)", ylab = "Estimated survival function")
We estimate the hazard function of black female patients via kernel–smoothed method with bandwidth equals 365 days and the biquadratic kernel without boundary correction. The relevant R codes and output are as follows:
res_wf <- muhaz(timeswf, statuswf, bw.grid = 365, kern = "biquadratic", bw.method = "global", b.cor = "none")
plot(res_wf, main = "Kernel-smoothed estimate", col = "red", lwd = 2)
Compare the results obtained in (6) with the previous, we can find that white female have a lower risk of death than black female. And white female have a higher survival rate than black female. Visually, there is a significant difference in the estimation of survival functions between the two. To investigate whether there is indeed a difference, we will conduct the log-rank test next.
Denote \(S_B(t)\) is the survival function of black patients and \(S_W(t)\) is the survival function of white patients. We are interested in testing the hypothesis: \[H_0:S_B(t)=S_W(t) \ \text{ against } \ H_1:S_B(t)\neq S_W(t).\] We conduct the log-rank test on the survival function of black and white patients. The relevant R codes and output are as follows:
b <- data %>% filter(race == 2)
w <- data %>% filter(race == 1)
timesb <- b[, 2]
statusb <- b[, 3]
timesw <- w[, 2]
statusw <- w[, 3]
times_tot <- c(timesb, timesw)
status_tot <- c(statusb, statusw)
stain <- rep(c("black", "white"), times = c(length(timesb), length(timesw)))
dataset <- data.frame(times = times_tot, status = status_tot, stain = stain)
survdiff(Surv(times, status) ~ stain, data = dataset)
## Call:
## survdiff(formula = Surv(times, status) ~ stain, data = dataset)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## stain=black 151 28 23.4 0.925 1.11
## stain=white 712 112 116.6 0.185 1.11
##
## Chisq= 1.1 on 1 degrees of freedom, p= 0.3
According to the output above, we know the \(p\)-value is \(0.3>0.05\). Therefore, we can not reject the null hypothesis. That is, we think there is no significant difference between the survival function of black and white patients.
Denote \(S_{BM}(t)\) is the survival function of black male patients, \(S_{WM}(t)\) is the survival function of white male patients, \(S_{BF}(t)\) is the survival function of black female patients and \(S_{WF}(t)\) is the survival function of white female patients. We are interested in the null hypothesis \[H_0:S_{BM}(t)=S_{WM}(t) \ \text{ and } \ S_{BF}(t)=S_{WF}(t);\] Against the alternative hypothesis \[H_1:\text{ At least one of the above equations does not hold on.}\] By taking account of gender, we use stratified log-rank test to test whether the survival functions of black and white patients as follows:
dataset <- data.frame(times = data$time, status = data$death, gender = data$gender, race = data$race)
survdiff(Surv(times, status) ~ race + strata(gender), data = dataset)
## Call:
## survdiff(formula = Surv(times, status) ~ race + strata(gender),
## data = dataset)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## race=1 712 112 116.7 0.188 1.13
## race=2 151 28 23.3 0.942 1.13
##
## Chisq= 1.1 on 1 degrees of freedom, p= 0.3
According to the output above, \(p\)-value is \(0.3>0.05\). Therefore, we can not reject the null hypothesis. That is, we think there is no significant difference between the survival function of black and white patients after taking account of gender.
We are interested in testing the hypothesis: \[H_0:S_{BM}(t)=S_{WM}(t) \ \text{ against } \ H_1:S_{BM}(t)\neq S_{WM}(t).\] We conduct the log-rank test on the survival function of black male and white male patients. The relevant R codes and output are as follows:
bm <- data %>% filter(gender == 1 & race == 2)
wm <- data %>% filter(gender == 1 & race == 1)
timesbm <- bm[, 2]
statusbm <- bm[, 3]
timeswm <- wm[, 2]
statuswm <- wm[, 3]
times_tot <- c(timesbm, timeswm)
status_tot <- c(statusbm, statuswm)
stain <- rep(c("black", "white"), times = c(length(timesbm), length(timeswm)))
dataset <- data.frame(times = times_tot, status = status_tot, stain = stain)
survdiff(Surv(times, status) ~ stain, data = dataset)
## Call:
## survdiff(formula = Surv(times, status) ~ stain, data = dataset)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## stain=black 92 14 15.1 0.0801 0.097
## stain=white 432 73 71.9 0.0168 0.097
##
## Chisq= 0.1 on 1 degrees of freedom, p= 0.8
According to the output above, we know the \(p\)-value is \(0.8>0.05\). Therefore, we can not reject the null hypothesis. That is, we think there is no significant difference between the survival function of black male and white male patients.
Similar to the hypothesis test above, we conduct the log-rank test on the survival function of black female and white female patients. The relevant R codes and output are as follows:
times_tot <- c(timesbf, timeswf)
status_tot <- c(statusbf, statuswf)
stain <- rep(c("black", "white"), times = c(length(timesbf), length(timeswf)))
dataset <- data.frame(times = times_tot, status = status_tot, stain = stain)
survdiff(Surv(times, status) ~ stain, data = dataset)
## Call:
## survdiff(formula = Surv(times, status) ~ stain, data = dataset)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## stain=black 59 14 8.21 4.076 4.85
## stain=white 280 39 44.79 0.748 4.85
##
## Chisq= 4.8 on 1 degrees of freedom, p= 0.03
According to the output above, we know the \(p\)-value is \(0.03<0.05\). Therefore, we must reject the null hypothesis. That is, we think there is significant difference between the survival function of black female and white female patients.
According to (9), we know the impact of race on survival time depending on gender, which is different from the result of (8). Because in (8), the whole data can be effected by both gender and race. This may result in a less significant impact of gender on survival time. Therefore, we divide the whole data into two genders in (9), and separately use the log-rank test to test whether the survival functions of black and white patients are significantly different.