Introduction
The results here accompany the paper “Effective Ways to Build and Evaluate Individual Survival Distributions” by Haider et al. In total, 33 additional datasets were tested to validate the findings given by Table 7 in Section 5 of the paper. For each of (Low-Censoring, Low-Dimensional), (High-Censoring, Low-Dimensional), (Low-Censoring, High-Dimensional), and (High-Censoring, High-Dimensional) we have 16, 12, 4, and 1 additional datasets to check the generalization of the findings. Each following section gives further details about the corresponding datasets used. Here we have defined “High-Censoring” to be \(\geq\) 70%. Additionally, “High-Dimensional” datasets all have at least 274 features.
Low-Censoring, Low-Dimensional
All datasets in this category were collected from the TCGA website (details given in the paper). Specifically, here we have 16 datasets corresponding to survival time from different cancer types: Adrenocortical carcinoma (ACC), Bladder Urothelial Carcinoma (BLCA), Cholangiocarcinoma (CHOL), Esophageal carcinoma (ESCA), Head and Neck squamous cell carcinoma (HNSC), Kidney renal clear cell carcinoma (KIRC), Liver hepatocellular carcinoma (LIHC), Lung adenocarcinoma (LUAD), Lung squamous cell carcinoma (LUSC), Ovarian serous cystadenocarcinoma (OV), Pancreatic adenocarcinoma (PAAD), Sarcoma (SARC), Skin Cutaneous Melanoma (SKCM), Stomach adenocarcinoma (STAD), Stomach and Esophageal carcinoma (STES), and Uterine Carcinosarcoma (UCS). Table 1 gives summary statistics for each of the 16 datasets.
Table 1: Summary statistics for the 16 (Low-Censoring, Low-Dimensional) datasets. From top to bottom we have (1) the total number of patients, (2) the % of censored patients, and (3) the total number of features used (after feature selection).
| |
ACC |
BLCA |
CHOL |
ESCA |
HNSC |
KIRC |
LIHC |
LUAD |
LUSC |
OV |
PAAD |
SARC |
SKCM |
STAD |
STES |
UCS |
| N |
92 |
409 |
45 |
185 |
526 |
537 |
376 |
513 |
498 |
576 |
185 |
261 |
460 |
436 |
621 |
57 |
| % Censored |
63.04 |
55.99 |
51.11 |
58.38 |
57.6 |
67.04 |
64.89 |
64.13 |
56.83 |
40.45 |
45.95 |
62.07 |
51.96 |
61.01 |
60.23 |
38.6 |
| Number of Features |
9 |
13 |
5 |
13 |
20 |
14 |
12 |
14 |
12 |
2 |
10 |
5 |
18 |
17 |
18 |
4 |
Calibration
Concerning calibration for the (Low-Censoring, Low-Dimensional) datasets, our findings were that MTLR was superior to other models. The D-Calibration (Table 2) results support this as MTLR is D-Calibrated for all 16 datasets (though AFT and Cox are also D-Calibrated for all datasets). The 1-Calibration results (Table 3) are more discriminative showing MTLR and CoxEN-KP to outperform all other models. Although CoxEN-KP shows to be competitive with MTLR with these datasets, when combined with the results from the paper, MTLR has a general better performance leading us to recommend MTLR when measuring calibration on (Low-Censoring, Low-Dimensional) datasets.
Note we only include the accumulated 1-Calibration results here (the exact \(p\)-values are available upon request.)
Table 2: D-Calibration results for the 16 (Low-Censoring, Low-Dimensional) datasets. The Total column gives the total number of datasets for which each model was D-Calibrated.
| |
ACC |
BLCA |
CHOL |
ESCA |
HNSC |
KIRC |
LIHC |
LUAD |
LUSC |
OV |
PAAD |
SARC |
SKCM |
STAD |
STES |
UCS |
Total |
| KM |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
0.998 |
16 |
| AFT |
0.985 |
0.471 |
0.456 |
0.948 |
0.136 |
0.934 |
0.959 |
0.970 |
0.947 |
0.134 |
0.641 |
0.939 |
0.412 |
0.726 |
0.826 |
0.817 |
16 |
| Cox-KP |
0.929 |
0.984 |
0.610 |
0.973 |
0.745 |
0.927 |
0.986 |
0.999 |
0.989 |
0.939 |
0.635 |
0.956 |
0.128 |
0.969 |
0.958 |
0.518 |
16 |
| CoxEN-KP |
0.664 |
0.995 |
0.971 |
1.000 |
0.935 |
0.598 |
0.981 |
0.999 |
1.000 |
0.985 |
0.413 |
1.000 |
0.018 |
0.967 |
0.999 |
0.819 |
15 |
| RSF-KM |
0.982 |
0.562 |
0.859 |
0.661 |
0.253 |
0.508 |
0.312 |
0.996 |
0.000 |
0.000 |
0.490 |
0.000 |
0.661 |
0.063 |
0.000 |
0.607 |
12 |
| MTLR |
0.974 |
1.000 |
0.994 |
0.984 |
0.583 |
0.994 |
0.986 |
0.997 |
0.869 |
0.886 |
0.702 |
0.963 |
0.106 |
0.991 |
0.978 |
0.852 |
16 |
Table 3: The 1-Calibration results for the 16 (Low-Censoring, Low-Dimensional) datasets. Each column represents the 10th, 25th, 50th, 75th, and 90th percentiles of event times. Each row represents the total number of datasets for which each model was 1-Calibrated (the maximum being 16).
| |
10th |
25th |
50th |
75th |
90th |
| AFT |
10 |
10 |
5 |
5 |
3 |
| Cox-KP |
12 |
11 |
6 |
4 |
6 |
| CoxEN-KP |
16 |
14 |
12 |
7 |
5 |
| RSF-KM |
0 |
0 |
0 |
0 |
0 |
| MTLR |
16 |
15 |
11 |
10 |
2 |
Discrimination
Since discrimination is primarily evaluated with concordance we only present this result and omit the Integrated Brier score and the L1-loss (though these results are available from the authors upon request). Table 4 presents these concordance results, and we see results consistent with the findings of the paper – although other models (here MTLR and CoxEN-KP) generally outperform AFT and Cox-KP, the performance gain is not significant. Further, we again see the poor performance of RSF-KM on these Low-Dimensional models. Due to model simplicity of AFT and Cox-KP we suggest using these models for discrimination on (Low-Censoring, Low-Dimensional) datasets.
Table 4: Concordance results for the 16 (Low-Censoring, Low-Dimensional) datasets. Here we present the mean and standard deviation of the Concordance across the 5 cross-validation folds. Bolded values represent the best (highest Concordance) performing models.
| |
ACC |
BLCA |
CHOL |
ESCA |
HNSC |
KIRC |
LIHC |
LUAD |
LUSC |
OV |
PAAD |
SARC |
SKCM |
STAD |
STES |
UCS |
| KM |
0.5 (0) |
0.5 (0) |
0.5 (0) |
0.5 (0) |
0.5 (0) |
0.5 (0) |
0.5 (0) |
0.5 (0) |
0.5 (0) |
0.5 (0) |
0.5 (0) |
0.5 (0) |
0.5 (0) |
0.5 (0) |
0.5 (0) |
0.5 (0) |
| AFT |
0.793 (0.07) |
0.642 (0.032) |
0.568 (0.259) |
0.582 (0.087) |
0.659 (0.02) |
0.742 (0.032) |
0.621 (0.038) |
0.676 (0.021) |
0.578 (0.019) |
0.601 (0.022) |
0.634 (0.052) |
0.646 (0.07) |
0.749 (0.023) |
0.697 (0.039) |
0.646 (0.034) |
0.683 (0.088) |
| Cox-KP |
0.796 (0.084) |
0.648 (0.042) |
0.568 (0.259) |
0.575 (0.085) |
0.662 (0.025) |
0.74 (0.033) |
0.622 (0.035) |
0.675 (0.021) |
0.578 (0.02) |
0.602 (0.023) |
0.635 (0.048) |
0.646 (0.071) |
0.754 (0.021) |
0.697 (0.032) |
0.644 (0.032) |
0.692 (0.084) |
| CoxEN-KP |
0.749 (0.096) |
0.66 (0.045) |
0.69 (0.16) |
0.539 (0.089) |
0.647 (0.029) |
0.752 (0.021) |
0.592 (0.053) |
0.687 (0.027) |
0.508 (0.047) |
0.6 (0.022) |
0.631 (0.056) |
0.623 (0.08) |
0.757 (0.022) |
0.698 (0.033) |
0.623 (0.034) |
0.697 (0.088) |
| RSF-KM |
0.762 (0.131) |
0.589 (0.032) |
0.691 (0.073) |
0.628 (0.113) |
0.641 (0.051) |
0.741 (0.021) |
0.628 (0.044) |
0.642 (0.052) |
0.541 (0.034) |
0.517 (0.016) |
0.619 (0.057) |
0.604 (0.065) |
0.622 (0.011) |
0.634 (0.028) |
0.616 (0.052) |
0.604 (0.086) |
| MTLR |
0.803 (0.097) |
0.661 (0.041) |
0.764 (0.132) |
0.634 (0.083) |
0.676 (0.017) |
0.756 (0.007) |
0.623 (0.029) |
0.694 (0.029) |
0.595 (0.009) |
0.601 (0.021) |
0.618 (0.071) |
0.646 (0.073) |
0.758 (0.027) |
0.709 (0.032) |
0.656 (0.031) |
0.697 (0.089) |
High-Censor, Low Dimensional
All datasets in this category were collected from the TCGA website (details given in the paper). Specifically, here we have 12 datasets corresponding to survival time from different cancer types: Cervical squamous cell carcinoma and endocervical adenocarcinoma [CESC], Colon adenocarcinoma [COAD], Colorectal adenocarcinoma (COADREAD), FFPE Pilot Phase II [FPPP], Kidney Chromophobe [KICH], Pan-kidney cohort (KIPAN), Kidney renal papillary cell carcinoma [KIRP], Lower Grade Glinoma (LGG), Prostate adenocarcinoma [PRAD], Thyroid carcinoma [THCA], Uterine Corpus Endometrial Carcinoma [UCEC], and Uveal Melanoma [UVM]. Table 5 gives summary statistics for each of the 12 datasets.
Table 5: Summary statistics for the 12 (High-Censoring, Low-Dimensional) datasets. From top to bottom we have (1) the total number of patients, (2) the % of censored patients, and (3) the total number of features used (after feature selection).
| |
CESC |
COAD |
COADREAD |
FPPP |
KICH |
KIPAN |
KIRP |
LGG |
PRAD |
THCA |
UCEC |
UVM |
| N |
307 |
456 |
626 |
38 |
112 |
939 |
290 |
513 |
499 |
503 |
546 |
80 |
| % Censored |
76.55 |
77.63 |
79.39 |
81.58 |
89.29 |
75.19 |
84.83 |
75.63 |
98 |
96.82 |
83.33 |
71.25 |
| Number of Features |
15 |
12 |
13 |
3 |
7 |
20 |
12 |
4 |
6 |
14 |
6 |
6 |
Calibration
For calibration on (High-Censoring, Low-Dimensional) datasets, our findings were that MTLR and CoxEN-KP outperformed other models. The D-Calibration (Table 6) results are inconclusive as nearly all models are D-Calibrated for all datasets (recall D-Calibration is optimistic when data is highly censored). The 1-Calibration results (Table 7) show MTLR and CoxEN-KP to outperform all other models with CoxEN-KP slightly outperforming MTLR. Given MTLR’s better performance on the datasets given in the paper, these results support the finding that MTLR and CoxEN-KP are the best (and generally equivalent) models for calibration on (High-Censoring, Low-Dimensional) datasets.
Table 6: D-Calibration results for the 12 (High-Censoring, Low-Dimensional) datasets. The Total column gives the total number of datasets for which each model was D-Calibrated.
| |
CESC |
COAD |
COADREAD |
FPPP |
KICH |
KIPAN |
KIRP |
LGG |
PRAD |
THCA |
UCEC |
UVM |
Total |
| KM |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
1 |
1 |
1.000 |
1.000 |
12 |
| AFT |
0.999 |
0.978 |
1.000 |
0.899 |
0.997 |
0.899 |
0.957 |
0.955 |
1 |
1 |
0.966 |
0.988 |
12 |
| Cox-KP |
0.995 |
0.980 |
1.000 |
0.930 |
0.996 |
0.844 |
0.988 |
0.938 |
1 |
0 |
0.998 |
0.969 |
11 |
| CoxEN-KP |
0.979 |
0.904 |
0.987 |
1.000 |
0.999 |
0.593 |
1.000 |
0.819 |
1 |
1 |
0.998 |
0.998 |
12 |
| RSF-KM |
0.934 |
0.800 |
0.897 |
0.999 |
1.000 |
0.441 |
0.940 |
0.556 |
1 |
1 |
0.366 |
0.940 |
12 |
| MTLR |
0.947 |
0.696 |
0.967 |
0.999 |
1.000 |
0.783 |
0.999 |
0.993 |
1 |
1 |
0.965 |
0.994 |
12 |
Table 7: The 1-Calibration results for the 12 (High-Censoring, Low-Dimensional) datasets. Each column represents the 10th, 25th, 50th, 75th, and 90th percentiles of event times. Each row represents the total number of datasets for which each model was 1-Calibrated (the maximum being 12).
| |
10th |
25th |
50th |
75th |
90th |
| AFT |
4 |
7 |
5 |
2 |
1 |
| Cox-KP |
5 |
5 |
6 |
2 |
1 |
| CoxEN-KP |
8 |
10 |
10 |
5 |
3 |
| RSF-KM |
3 |
2 |
1 |
0 |
0 |
| MTLR |
9 |
10 |
6 |
3 |
2 |
Discrimination
Table 8 presents concordance results for (High-Censoring, Low-Dimensional) datasets and we see results consistent with the findings of the paper – (1) CoxEN-KP and MTLR are superior to AFT and Cox and (2) RSF-KM typically sees noticeably worse performance. This supports our recommendation to use MTLR or CoxEN-KP when evaluating discrimination on (High-Censoring, Low-Dimensional) datasets.
Table 8: Concordance results for the 12 (High-Censoring, Low-Dimensional) datasets. Here we present the mean and standard deviation of the Concordance across the 5 cross-validation folds. Bolded values represent the best (highest Concordance) performing models.
| |
CESC |
COAD |
COADREAD |
FPPP |
KICH |
KIPAN |
KIRP |
LGG |
PRAD |
THCA |
UCEC |
UVM |
| KM |
0.5 (0) |
0.5 (0) |
0.5 (0) |
0.5 (0) |
0.5 (0) |
0.5 (0) |
0.5 (0) |
0.5 (0) |
0.5 (0) |
0.5 (0) |
0.5 (0) |
0.5 (0) |
| AFT |
0.816 (0.015) |
0.73 (0.042) |
0.731 (0.036) |
0.531 (0.386) |
0.752 (0.422) |
0.782 (0.058) |
0.689 (0.169) |
0.781 (0.08) |
0.778 (0.273) |
0.705 (0.285) |
0.696 (0.079) |
0.736 (0.157) |
| Cox-KP |
0.817 (0.018) |
0.731 (0.04) |
0.732 (0.037) |
0.531 (0.386) |
0.836 (0.242) |
0.78 (0.06) |
0.683 (0.165) |
0.781 (0.078) |
0.774 (0.277) |
0.675 (0.3) |
0.695 (0.08) |
0.711 (0.134) |
| CoxEN-KP |
0.826 (0.032) |
0.741 (0.044) |
0.743 (0.038) |
0.641 (0.267) |
0.958 (0.052) |
0.794 (0.051) |
0.787 (0.104) |
0.779 (0.078) |
0.755 (0.282) |
0.955 (0.032) |
0.694 (0.081) |
0.733 (0.145) |
| RSF-KM |
0.829 (0.017) |
0.682 (0.041) |
0.709 (0.062) |
0.539 (0.376) |
0.923 (0.073) |
0.772 (0.058) |
0.801 (0.091) |
0.709 (0.066) |
0.778 (0.291) |
0.950 (0.027) |
0.614 (0.042) |
0.712 (0.077) |
| MTLR |
0.823 (0.036) |
0.746 (0.043) |
0.747 (0.037) |
0.721 (0.287) |
0.952 (0.047) |
0.795 (0.05) |
0.826 (0.069) |
0.781 (0.073) |
0.672 (0.4) |
0.913 (0.08) |
0.693 (0.078) |
0.737 (0.218) |
Low-Censoring, High-Dimensional
Here we examine a number of (Low-Censoring, High-Dimensional) datasets which contain gene-expression data relating the survival of cancer patients – see “A Multi-Task Learning Formulation for Survival Analysis” by Li et al. for details regarding dataset access. Specifically we use (1) the Norway/Stanford breast cancer data (NBSCD) which contains the gene-expression data for 115 female breast cancer patients, (2) the Van de Vijver’s microarray breast cancer data (VDV) which again contains gene-expression data for 78 breast cancer patients, (3) gene-expression data for adult myeloid leukemia (AML) on 116 patients, and (4) gene-expression data for 92 mantle cell lymphoma (MCL) patients. Table 9 gives the summary statistics for each of these 4 datasets.
Table 9: Summary statistics for the 4 (Low-Censoring, High-Dimensional) datasets. From top to bottom we have (1) the total number of patients, (2) the % of censored patients, and (3) the total number of features used (after feature selection).
| |
AML |
MCL |
NSBCD |
VDV |
| N |
116 |
92 |
115 |
78 |
| % Censored |
42.24 |
30.43 |
66.96 |
56.41 |
| Number of Features |
1616 |
2176 |
274 |
1319 |
Calibration
For calibration on (Low-Censoring, High-Dimensional) datasets, our findings were that MTLR, CoxEN-KP, and RSF-KM were all equivalent. The D-Calibration (Table 10) results are show that AFT fails D-Calibration across all datasets (and Cox fails to finish on these datasets). The 1-Calibration results (Table 11) show that MTLR, CoxEN-KP and RSF-KM are roughly equivalent across the different percentiles supporting our initial finding – MTLR, CoxEN-KP, and RSF-KM are equivalent for calibration on (Low-Censoring, High-Dimensional) datasets.
Table 10: D-Calibration results for the 4 (Low-Censoring, High-Dimensional) datasets. The Total column gives the total number of datasets for which each model was D-Calibrated. Note there is no results for Cox as it failed to run for all datasets.
| |
AML |
MCL |
NSBCD |
VDV |
Total |
| KM |
1.000 |
1.000 |
1.000 |
1.000 |
4 |
| AFT |
0.000 |
0.000 |
0.000 |
0.000 |
0 |
| CoxEN-KP |
0.897 |
0.228 |
0.992 |
0.998 |
4 |
| RSF-KM |
0.759 |
0.577 |
0.977 |
0.000 |
3 |
| MTLR |
0.529 |
0.391 |
0.981 |
0.996 |
4 |
Table 11: The 1-Calibration results for the 4 (Low-Censoring, High-Dimensional) datasets. Each column represents the 10th, 25th, 50th, 75th, and 90th percentiles of event times. Each row represents the total number of datasets for which each model was 1-Calibrated (the maximum being 4). Note there are no results for Cox as it failed to run on all datasets.
| |
10th |
25th |
50th |
75th |
90th |
| AFT |
0 |
0 |
0 |
0 |
0 |
| CoxEN-KP |
4 |
4 |
4 |
2 |
1 |
| RSF-KM |
4 |
4 |
3 |
3 |
1 |
| MTLR |
2 |
3 |
4 |
4 |
3 |
Discrimination
Table 12 presents the concordance results for (Low-Censoring, High-Dimensional) datasets. The findings in the paper (that MTLR/CoxEN-KP are superior to AFT and RSF-KM) are generally supported (actually always supported for being better than AFT). Except for VDV CoxEN-KP does better than RSF-KM and MTLR either matches or performs better than RSF-KM for all datasets.
Table 12: Concordance results for the 4 (Low-Censoring, High-Dimensional) datasets. Here we present the mean and standard deviation of the Concordance across the 5 cross-validation folds. Bolded values represent the best (highest Concordance) performing models. Note there are no results for Cox as it failed to run on all datasets.
| |
AML |
MCL |
NSBCD |
VDV |
| KM |
0.5 (0) |
0.5 (0) |
0.5 (0) |
0.5 (0) |
| AFT |
0.531 (0.074) |
0.496 (0.063) |
0.568 (0.129) |
0.469 (0.084) |
| CoxEN-KP |
0.656 (0.124) |
0.788 (0.042) |
0.784 (0.086) |
0.679 (0.187) |
| RSF-KM |
0.652 (0.089) |
0.764 (0.058) |
0.736 (0.101) |
0.725 (0.13) |
| MTLR |
0.651 (0.092) |
0.805 (0.034) |
0.776 (0.071) |
0.733 (0.157) |
High-Censoring, High-Dimensional
Here we examine 1 additional (High-Censoring, High-Dimensional) dataset, namely gene-expression data for 86 early lung adenocarcinoma (Lung) patients. – see “A Multi-Task Learning Formulation for Survival Analysis” by Li et al. for details regarding dataset access. Table 12 gives the summary statistics for this dataset.
Table 12: Summary statistics for the (High-Censoring, High-Dimensional) dataset. From top to bottom we have (1) the total number of patients, (2) the % of censored patients, and (3) the total number of features used (after feature selection).
| |
Lung |
| N |
86 |
| % Censored |
72.09 |
| Number of Features |
1121 |
Calibration
For calibration on (High-Censoring, High-Dimensional) datasets, our findings were that MTLR, CoxEN-KP, and RSF-KM were all equivalent. The D-Calibration results (Table 13) show that AFT is the only model to fail D-Calibration (note Cox failed to finish). Similarly, the 1-Calibration results (Table 14) show that MTLR, CoxEN-KP and RSF-KM are equivalent across the different percentiles suggesting our finding is consistent for this dataset as well.
Table 13: D-Calibration results for the 4 (Low-Censoring, High-Dimensional) datasets. The Total column gives the total number of datasets for which each model was D-Calibrated. Note the Cox model does not appear as it failed to finish on this dataset.
| |
Lung |
| KM |
1.000 |
| AFT |
0.000 |
| CoxEN-KP |
0.922 |
| RSF-KM |
0.887 |
| MTLR |
0.986 |
Table 14: The 1-Calibration results for the 4 (Low-Censoring, High-Dimensional) datasets. Each column represents the 10th, 25th, 50th, 75th, and 90th percentiles of event times. Each row represents the total number of datasets for which each model was 1-Calibrated (the maximum being 4). Note the Cox model does not appear as it failed to finish on this dataset.
| |
10th |
25th |
50th |
75th |
90th |
| AFT |
0 |
0 |
0 |
0 |
0 |
| CoxEN-KP |
1 |
1 |
1 |
1 |
0 |
| RSF-KM |
1 |
1 |
1 |
1 |
0 |
| MTLR |
1 |
1 |
1 |
1 |
0 |
Discrimination
Table 15 presents the concordance results for the (High-Censoring, High-Dimensional) dataset. The findings in the paper (that MTLR, CoxEN-KP, and RSF-KM are equivalent and also superior to AFT) are supported (although RSF-KM has much higher variance).
Table 15: Concordance results for the 4 (Low-Censoring, High-Dimensional) datasets. Here we present the mean and standard deviation of the Concordance across the 5 cross-validation folds. Bolded values represent the best (highest Concordance) performing models. Note the Cox model does not appear as it failed to finish on this dataset.
| |
Lung |
| KM |
0.5 (0) |
| AFT |
0.668 (0.172) |
| CoxEN-KP |
0.932 (0.039) |
| RSF-KM |
0.816 (0.173) |
| MTLR |
0.934 (0.029) |
