We import the required data set.
kidneyTrans <- read.csv("kidneyTrans.csv")
kidneyTrans <- kidneyTrans %>% filter(age >= 35)
kidneyTrans %>% head() %>% kable()
| obs | time | death | gender | race | age |
|---|---|---|---|---|---|
| 1 | 1 | 0 | 1 | 1 | 46 |
| 2 | 5 | 0 | 1 | 1 | 51 |
| 3 | 7 | 1 | 1 | 1 | 55 |
| 4 | 9 | 0 | 1 | 1 | 57 |
| 5 | 13 | 0 | 1 | 1 | 45 |
| 6 | 13 | 0 | 1 | 1 | 43 |
We find that there are 863 observations. The meanings of each variable are as follows:
time means the survival times after transplant measured
in days;
death means the state of patient, where 0 represents
censoring and 1 represents observation;
gender means the gender of patient, where 1 represents
male and 2 represents female;
race means the race of patient, where 1 represents white
and 2 represents black.
For kdiney transplant, it is more appropriate to assume that the hazard function obeys the log-normal distribution. Because the patients have rejection reaction to the newly transplanted organ so the hazard of death increases initially. And after the patients adapt to the new organ, the hazard of death decreases. The hazard function of Weibull distribution is monotonic, and the hazard function of log-normal is unimodal. Therefore, we choose the log-normal distribution for the survival times from kidney transplant to death.
We filter the patients with \(35\leq\text{age}<55\).
data1 <- kidneyTrans %>% filter(age < 55)
Denote \(\hat{S}_1(t)\) be the Kaplan–Meier estimate of the survival function for patients with \(35\leq\text{age}<55\).
est_km1 <- survfit(Surv(time, death) ~ 1, conf.type = "plain", conf.int = 0.9, data = data1)
summary(est_km1)
## Call: survfit(formula = Surv(time, death) ~ 1, data = data1, conf.type = "plain",
## conf.int = 0.9)
##
## time n.risk n.event survival std.err lower 90% CI upper 90% CI
## 7 422 1 0.998 0.00237 0.994 1.000
## 10 421 1 0.995 0.00334 0.990 1.000
## 17 417 1 0.993 0.00410 0.986 1.000
## 26 416 1 0.990 0.00473 0.983 0.998
## 28 415 1 0.988 0.00529 0.979 0.997
## 37 412 1 0.986 0.00580 0.976 0.995
## 40 411 1 0.983 0.00626 0.973 0.994
## 44 410 1 0.981 0.00669 0.970 0.992
## 57 407 2 0.976 0.00747 0.964 0.988
## 59 405 1 0.974 0.00783 0.961 0.987
## 62 404 1 0.971 0.00818 0.958 0.985
## 68 400 2 0.966 0.00883 0.952 0.981
## 88 396 1 0.964 0.00914 0.949 0.979
## 91 395 1 0.962 0.00943 0.946 0.977
## 98 393 1 0.959 0.00972 0.943 0.975
## 104 392 1 0.957 0.01000 0.940 0.973
## 106 389 1 0.954 0.01027 0.937 0.971
## 119 381 1 0.952 0.01055 0.934 0.969
## 121 380 1 0.949 0.01081 0.931 0.967
## 150 377 1 0.947 0.01107 0.928 0.965
## 154 374 1 0.944 0.01133 0.925 0.963
## 190 371 1 0.942 0.01158 0.923 0.961
## 228 363 1 0.939 0.01183 0.920 0.958
## 242 357 1 0.936 0.01209 0.916 0.956
## 248 354 1 0.934 0.01234 0.913 0.954
## 249 353 1 0.931 0.01259 0.910 0.952
## 252 351 1 0.928 0.01283 0.907 0.950
## 273 345 1 0.926 0.01307 0.904 0.947
## 297 340 1 0.923 0.01331 0.901 0.945
## 311 338 1 0.920 0.01355 0.898 0.943
## 340 331 1 0.917 0.01379 0.895 0.940
## 344 327 1 0.915 0.01403 0.892 0.938
## 346 326 1 0.912 0.01427 0.888 0.935
## 354 323 1 0.909 0.01450 0.885 0.933
## 470 310 1 0.906 0.01475 0.882 0.930
## 478 309 1 0.903 0.01499 0.879 0.928
## 495 307 1 0.900 0.01522 0.875 0.925
## 614 295 1 0.897 0.01547 0.872 0.923
## 615 294 1 0.894 0.01572 0.868 0.920
## 697 282 1 0.891 0.01598 0.865 0.917
## 776 271 1 0.888 0.01626 0.861 0.914
## 793 269 1 0.884 0.01653 0.857 0.912
## 840 268 1 0.881 0.01679 0.853 0.909
## 875 262 1 0.878 0.01706 0.850 0.906
## 929 254 1 0.874 0.01734 0.846 0.903
## 939 252 1 0.871 0.01762 0.842 0.900
## 943 251 1 0.867 0.01788 0.838 0.897
## 945 250 2 0.860 0.01840 0.830 0.891
## 1016 241 1 0.857 0.01867 0.826 0.888
## 1164 220 1 0.853 0.01899 0.822 0.884
## 1186 218 1 0.849 0.01930 0.817 0.881
## 1191 217 1 0.845 0.01960 0.813 0.877
## 1275 203 1 0.841 0.01994 0.808 0.874
## 1326 198 1 0.837 0.02029 0.803 0.870
## 1331 197 1 0.832 0.02062 0.799 0.866
## 1388 191 1 0.828 0.02097 0.794 0.863
## 1473 185 1 0.824 0.02133 0.789 0.859
## 1777 151 1 0.818 0.02188 0.782 0.854
## 1820 141 1 0.812 0.02248 0.775 0.849
## 1835 140 1 0.807 0.02305 0.769 0.844
## 1877 137 1 0.801 0.02362 0.762 0.840
## 1940 132 1 0.795 0.02421 0.755 0.834
## 2034 122 1 0.788 0.02487 0.747 0.829
## 2276 95 1 0.780 0.02596 0.737 0.822
## 2301 93 1 0.771 0.02700 0.727 0.816
## 2369 88 1 0.763 0.02808 0.716 0.809
## 2414 85 1 0.754 0.02915 0.706 0.802
## 2421 84 1 0.745 0.03015 0.695 0.794
## 2557 74 1 0.735 0.03138 0.683 0.786
## 2567 73 1 0.725 0.03252 0.671 0.778
## 2650 65 1 0.713 0.03388 0.658 0.769
## 2795 44 1 0.697 0.03678 0.637 0.758
## 3146 19 1 0.661 0.04990 0.578 0.743
plot(est_km1, main = "90% Confidence interval without transformation", xlab = "Time (Days)", ylab = "Estimated survival function")
Then we use the plot of \(\mbox{log}\big\{-\mbox{log}\,\hat{S}_1(t)\big\}\) against \(\mbox{log}\,t\) to check whether the Weibull distribution is suitable for the survival times as follows:
y_axis <- log(-log(est_km1$surv))
x_axis <- log(est_km1$time)
ggplot(data.frame(x = x_axis, y = y_axis), mapping = aes(x = x, y = y)) +
geom_step() +
geom_smooth(method = 'lm', formula = y ~ x) +
labs(title = "Suitability of Weibull Distribution", x = expression(log(t)), y = "log{-log S(t)}") +
theme(plot.title = element_text(hjust = 0.5, size = 13), axis.title = element_text(size = 14),
axis.text.x = element_text(size = 11), axis.text.y = element_text(size = 12))
We use the plot of \(\mbox{log}\Big\{\hat{S}_1(t)/\big[1-\hat{S}_1(t)\big]\Big\}\) against \(\mbox{log}\,t\) to check whether the log-logistic distribution is suitable for the survival times as follows:
y_axis <- log(est_km1$surv / (1 - est_km1$surv))
x_axis <- log(est_km1$time)
ggplot(data.frame(x = x_axis, y = y_axis), mapping = aes(x = x, y = y)) +
geom_step() +
geom_smooth(method = 'lm', formula = y ~ x) +
labs(title = "Suitability of Log-logistic Distribution", x = expression(log(t)), y = "log{S(t) / [1-S(t)]}") +
theme(plot.title = element_text(hjust = 0.5, size = 13), axis.title = element_text(size = 14),
axis.text.x = element_text(size = 11), axis.text.y = element_text(size = 12))
We use the plot of \(\Phi^{-1}\big\{1-\hat{S}_1(t)\big\}\) against \(\mbox{log}\,t\) to check whether the log-logistic distribution is suitable for the survival times as follows:
y_axis <- qnorm(1 - est_km1$surv)
x_axis <- log(est_km1$time)
ggplot(data.frame(x = x_axis, y = y_axis), mapping = aes(x = x, y = y)) +
geom_step() +
geom_smooth(method = 'lm', formula = y ~ x) +
labs(title = "Suitability of Log-normal Distribution", x = expression(log(t)), y = "log{S(t) / [1-S(t)]}") +
theme(plot.title = element_text(hjust = 0.5, size = 13), axis.title = element_text(size = 14),
axis.text.x = element_text(size = 11), axis.text.y = element_text(size = 12))
According to the three plots above, we find that the plot of \(\Phi^{-1}\big\{1-\hat{S}_1(t)\big\}\) against \(\mbox{log}\,t\) is better approximated as a straight line, which indicates that the log-normal distribution is most suitable for the survival times. This conclusion is consistent with the guess in question 4.1.
We filter the patients with \(\text{age}\geq55\).
data2 <- kidneyTrans %>% filter(age >= 55)
Denote \(\hat{S}_2(t)\) be the Kaplan–Meier estimate of the survival function for patients with \(\text{age}\geq55\).
est_km2 <- survfit(Surv(time, death) ~ 1, conf.type = "plain", conf.int = 0.9, data = data2)
summary(est_km2)
## Call: survfit(formula = Surv(time, death) ~ 1, data = data2, conf.type = "plain",
## conf.int = 0.9)
##
## time n.risk n.event survival std.err lower 90% CI upper 90% CI
## 2 199 1 0.995 0.00501 0.987 1.000
## 7 198 1 0.990 0.00707 0.978 1.000
## 10 195 1 0.985 0.00867 0.971 0.999
## 21 192 1 0.980 0.01003 0.963 0.996
## 26 191 1 0.975 0.01121 0.956 0.993
## 43 190 2 0.964 0.01323 0.943 0.986
## 45 186 1 0.959 0.01414 0.936 0.982
## 50 185 1 0.954 0.01498 0.929 0.979
## 52 183 1 0.949 0.01578 0.923 0.975
## 56 182 1 0.944 0.01653 0.916 0.971
## 62 181 1 0.938 0.01725 0.910 0.967
## 69 179 1 0.933 0.01793 0.904 0.963
## 78 178 1 0.928 0.01858 0.897 0.958
## 79 177 1 0.923 0.01920 0.891 0.954
## 97 172 1 0.917 0.01982 0.885 0.950
## 98 171 1 0.912 0.02042 0.878 0.945
## 106 170 1 0.907 0.02099 0.872 0.941
## 158 162 1 0.901 0.02160 0.865 0.936
## 162 161 1 0.895 0.02217 0.859 0.932
## 206 155 1 0.890 0.02277 0.852 0.927
## 291 144 1 0.883 0.02344 0.845 0.922
## 311 142 1 0.877 0.02408 0.838 0.917
## 334 140 1 0.871 0.02471 0.830 0.912
## 391 134 1 0.864 0.02537 0.823 0.906
## 402 132 1 0.858 0.02601 0.815 0.901
## 421 129 1 0.851 0.02664 0.807 0.895
## 439 127 1 0.844 0.02726 0.800 0.889
## 481 121 1 0.838 0.02792 0.792 0.883
## 490 119 1 0.830 0.02855 0.784 0.877
## 583 111 1 0.823 0.02926 0.775 0.871
## 652 104 1 0.815 0.03003 0.766 0.864
## 730 97 1 0.807 0.03087 0.756 0.857
## 790 92 1 0.798 0.03176 0.746 0.850
## 806 91 1 0.789 0.03260 0.736 0.843
## 864 88 1 0.780 0.03344 0.725 0.835
## 946 82 1 0.771 0.03436 0.714 0.827
## 1013 80 1 0.761 0.03525 0.703 0.819
## 1105 76 1 0.751 0.03618 0.691 0.811
## 1210 73 1 0.741 0.03712 0.680 0.802
## 1357 65 1 0.729 0.03826 0.666 0.792
## 1384 62 1 0.718 0.03941 0.653 0.782
## 1418 60 1 0.706 0.04053 0.639 0.772
## 1509 55 1 0.693 0.04177 0.624 0.761
## 1734 48 1 0.678 0.04332 0.607 0.750
## 2056 31 1 0.656 0.04713 0.579 0.734
## 2171 29 1 0.634 0.05065 0.551 0.717
## 2291 26 1 0.609 0.05425 0.520 0.699
## 2313 24 1 0.584 0.05763 0.489 0.679
## 2489 18 1 0.552 0.06290 0.448 0.655
plot(est_km2, main = "90% Confidence interval without transformation", xlab = "Time (Days)", ylab = "Estimated survival function")
According to the Kaplan–Meier estimate of the survival function \(\hat{S}_1(t)\) and \(\hat{S}_2(t)\). We choose \(p\) to range from \(0.01\) to \(0.33\) with increment \(0.01\). We obtain \(p\)-th quantile of the survival times for the two groups as follows:
km_quantile <- function(time, survival, p) {
if (p > 1 | p < 0) {
stop("p should range between 0 and 1.")
}
n <- length(time)
index <- 0
result <- numeric(length = 1)
for (i in 1:n) {
if (survival[i] == 1 - p) {
result <- (time[i] + time[i + 1]) / 2
index <- 1
break
}
}
if (index == 0) {
for (i in 1:n) {
if (survival[i] < 1 - p) {
result <- time[i]
index <- 1
break
}
}
}
if (index == 0) {
stop("The quantile can not be estimated.")
}
return(result)
}
p <- seq(0.01, 0.33, by = 0.01)
index1 <- est_km1$n.censor
time1 <- est_km1$time[index1 == 0]
surv1 <- est_km1$surv[index1 == 0]
t1 <- numeric(length = length(p))
for (i in 1:length(p)) {
t1[i] <- km_quantile(time1, surv1, p[i])
}
index2 <- est_km2$n.censor
time2 <- est_km2$time[index2 == 0]
surv2 <- est_km2$surv[index2 == 0]
t2 <- numeric(length = length(p))
for (i in 1:length(p)) {
t2[i] <- km_quantile(time2, surv2, p[i])
}
df <- data.frame(p = p, t1 = t1, t2 = t2)
df %>% kable()
| p | t1 | t2 |
|---|---|---|
| 0.01 | 28 | 7 |
| 0.02 | 57 | 21 |
| 0.03 | 68 | 45 |
| 0.04 | 98 | 45 |
| 0.05 | 154 | 52 |
| 0.06 | 228 | 62 |
| 0.07 | 252 | 78 |
| 0.08 | 344 | 97 |
| 0.09 | 354 | 106 |
| 0.10 | 614 | 206 |
| 0.11 | 776 | 206 |
| 0.12 | 875 | 311 |
| 0.13 | 943 | 391 |
| 0.14 | 1164 | 402 |
| 0.15 | 1186 | 439 |
| 0.16 | 1326 | 481 |
| 0.17 | 1388 | 583 |
| 0.18 | 1777 | 652 |
| 0.19 | 1835 | 730 |
| 0.20 | 1940 | 790 |
| 0.21 | 2276 | 806 |
| 0.22 | 2276 | 946 |
| 0.23 | 2414 | 1013 |
| 0.24 | 2414 | 1105 |
| 0.25 | 2421 | 1210 |
| 0.26 | 2557 | 1357 |
| 0.27 | 2567 | 1357 |
| 0.28 | 2650 | 1384 |
| 0.29 | 2795 | 1418 |
| 0.30 | 2795 | 1509 |
| 0.31 | 3146 | 1734 |
| 0.32 | 3146 | 1734 |
| 0.33 | 3146 | 2056 |
ggplot(data.frame(x = t1, y = t2), aes(x = x, y = y)) +
geom_point() + geom_smooth(span = 1) +
labs(title = "Q-Q Plot", x = bquote('quantile for' ~ 35 <= ~ 'age' < 55 ~ 'group' ), y = bquote('quantile for age' >= 55 ~ 'group')) +
theme(plot.title = element_text(hjust = 0.5, size = 13), axis.title = element_text(size = 14),
axis.text.x = element_text(size = 11), axis.text.y = element_text(size = 12))
We find that the Q-Q plot is not a straight line, which indicates that the AFT model may not valid.
We use the plot of \(\mbox{log}\big\{-\mbox{log}\,S(t)\big\}\) against \(t\) to check the proportional hazards assumptions as follows:
y1 <- log(-log(est_km1$surv))
y2 <- log(-log(est_km2$surv))
t1 <- est_km1$time
t2 <- est_km2$time
ggplot() +
geom_step(aes(x = t1, y = y1, color = "35<=age<55"), size = 1) +
geom_step(aes(x = t2, y = y2, color = "age>=55"), size = 1) +
labs(title = "Kaplan-Meier estimate in the log{log[S(t)]}", x = "Time", y = "log{-log[S(t)]}") +
theme(plot.title = element_text(hjust = 0.5, size = 13), axis.title = element_text(size = 14),
axis.text.x = element_text(size = 11), axis.text.y = element_text(size = 12))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
According to the plot above, we find that the two curves demonstrate constant vertical shift, which indicates the proportional hazards assumption is valid. Therefore, the Cox model is appropriate to compare the survival times of the two groups of patients.