1 Introduction

The results here accompany the paper “Effective Ways to Build and Evaluate Individual Survival Distributions” by Haider et al. In total, 33 additional datasets were tested to validate the findings given by Table 7 in Section 5 of the paper. For each of (Low-Censoring, Low-Dimensional), (High-Censoring, Low-Dimensional), (Low-Censoring, High-Dimensional), and (High-Censoring, High-Dimensional) we have 16, 12, 4, and 1 additional datasets to check the generalization of the findings. Each following section gives further details about the corresponding datasets used. Here we have defined “High-Censoring” to be \(\geq\) 70%. Additionally, “High-Dimensional” datasets all have at least 274 features.


2 Low-Censoring, Low-Dimensional

All datasets in this category were collected from the TCGA website (details given in the paper). Specifically, here we have 16 datasets corresponding to survival time from different cancer types: Adrenocortical carcinoma (ACC), Bladder Urothelial Carcinoma (BLCA), Cholangiocarcinoma (CHOL), Esophageal carcinoma (ESCA), Head and Neck squamous cell carcinoma (HNSC), Kidney renal clear cell carcinoma (KIRC), Liver hepatocellular carcinoma (LIHC), Lung adenocarcinoma (LUAD), Lung squamous cell carcinoma (LUSC), Ovarian serous cystadenocarcinoma (OV), Pancreatic adenocarcinoma (PAAD), Sarcoma (SARC), Skin Cutaneous Melanoma (SKCM), Stomach adenocarcinoma (STAD), Stomach and Esophageal carcinoma (STES), and Uterine Carcinosarcoma (UCS). Table 1 gives summary statistics for each of the 16 datasets.

Table 1: Summary statistics for the 16 (Low-Censoring, Low-Dimensional) datasets. From top to bottom we have (1) the total number of patients, (2) the % of censored patients, and (3) the total number of features used (after feature selection).
ACC BLCA CHOL ESCA HNSC KIRC LIHC LUAD LUSC OV PAAD SARC SKCM STAD STES UCS
N 92 409 45 185 526 537 376 513 498 576 185 261 460 436 621 57
% Censored 63.04 55.99 51.11 58.38 57.6 67.04 64.89 64.13 56.83 40.45 45.95 62.07 51.96 61.01 60.23 38.6
Number of Features 9 13 5 13 20 14 12 14 12 2 10 5 18 17 18 4

2.1 Calibration

Concerning calibration for the (Low-Censoring, Low-Dimensional) datasets, our findings were that MTLR was superior to other models. The D-Calibration (Table 2) results support this as MTLR is D-Calibrated for all 16 datasets (though AFT and Cox are also D-Calibrated for all datasets). The 1-Calibration results (Table 3) are more discriminative showing MTLR and CoxEN-KP to outperform all other models. Although CoxEN-KP shows to be competitive with MTLR with these datasets, when combined with the results from the paper, MTLR has a general better performance leading us to recommend MTLR when measuring calibration on (Low-Censoring, Low-Dimensional) datasets.

Note we only include the accumulated 1-Calibration results here (the exact \(p\)-values are available upon request.)

Table 2: D-Calibration results for the 16 (Low-Censoring, Low-Dimensional) datasets. The Total column gives the total number of datasets for which each model was D-Calibrated.
ACC BLCA CHOL ESCA HNSC KIRC LIHC LUAD LUSC OV PAAD SARC SKCM STAD STES UCS Total
KM 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.998 16
AFT 0.985 0.471 0.456 0.948 0.136 0.934 0.959 0.970 0.947 0.134 0.641 0.939 0.412 0.726 0.826 0.817 16
Cox-KP 0.929 0.984 0.610 0.973 0.745 0.927 0.986 0.999 0.989 0.939 0.635 0.956 0.128 0.969 0.958 0.518 16
CoxEN-KP 0.664 0.995 0.971 1.000 0.935 0.598 0.981 0.999 1.000 0.985 0.413 1.000 0.018 0.967 0.999 0.819 15
RSF-KM 0.982 0.562 0.859 0.661 0.253 0.508 0.312 0.996 0.000 0.000 0.490 0.000 0.661 0.063 0.000 0.607 12
MTLR 0.974 1.000 0.994 0.984 0.583 0.994 0.986 0.997 0.869 0.886 0.702 0.963 0.106 0.991 0.978 0.852 16
Table 3: The 1-Calibration results for the 16 (Low-Censoring, Low-Dimensional) datasets. Each column represents the 10th, 25th, 50th, 75th, and 90th percentiles of event times. Each row represents the total number of datasets for which each model was 1-Calibrated (the maximum being 16).
10th 25th 50th 75th 90th
AFT 10 10 5 5 3
Cox-KP 12 11 6 4 6
CoxEN-KP 16 14 12 7 5
RSF-KM 0 0 0 0 0
MTLR 16 15 11 10 2

2.2 Discrimination

Since discrimination is primarily evaluated with concordance we only present this result and omit the Integrated Brier score and the L1-loss (though these results are available from the authors upon request). Table 4 presents these concordance results, and we see results consistent with the findings of the paper – although other models (here MTLR and CoxEN-KP) generally outperform AFT and Cox-KP, the performance gain is not significant. Further, we again see the poor performance of RSF-KM on these Low-Dimensional models. Due to model simplicity of AFT and Cox-KP we suggest using these models for discrimination on (Low-Censoring, Low-Dimensional) datasets.

Table 4: Concordance results for the 16 (Low-Censoring, Low-Dimensional) datasets. Here we present the mean and standard deviation of the Concordance across the 5 cross-validation folds. Bolded values represent the best (highest Concordance) performing models.
ACC BLCA CHOL ESCA HNSC KIRC LIHC LUAD LUSC OV PAAD SARC SKCM STAD STES UCS
KM 0.5 (0) 0.5 (0) 0.5 (0) 0.5 (0) 0.5 (0) 0.5 (0) 0.5 (0) 0.5 (0) 0.5 (0) 0.5 (0) 0.5 (0) 0.5 (0) 0.5 (0) 0.5 (0) 0.5 (0) 0.5 (0)
AFT 0.793 (0.07) 0.642 (0.032) 0.568 (0.259) 0.582 (0.087) 0.659 (0.02) 0.742 (0.032) 0.621 (0.038) 0.676 (0.021) 0.578 (0.019) 0.601 (0.022) 0.634 (0.052) 0.646 (0.07) 0.749 (0.023) 0.697 (0.039) 0.646 (0.034) 0.683 (0.088)
Cox-KP 0.796 (0.084) 0.648 (0.042) 0.568 (0.259) 0.575 (0.085) 0.662 (0.025) 0.74 (0.033) 0.622 (0.035) 0.675 (0.021) 0.578 (0.02) 0.602 (0.023) 0.635 (0.048) 0.646 (0.071) 0.754 (0.021) 0.697 (0.032) 0.644 (0.032) 0.692 (0.084)
CoxEN-KP 0.749 (0.096) 0.66 (0.045) 0.69 (0.16) 0.539 (0.089) 0.647 (0.029) 0.752 (0.021) 0.592 (0.053) 0.687 (0.027) 0.508 (0.047) 0.6 (0.022) 0.631 (0.056) 0.623 (0.08) 0.757 (0.022) 0.698 (0.033) 0.623 (0.034) 0.697 (0.088)
RSF-KM 0.762 (0.131) 0.589 (0.032) 0.691 (0.073) 0.628 (0.113) 0.641 (0.051) 0.741 (0.021) 0.628 (0.044) 0.642 (0.052) 0.541 (0.034) 0.517 (0.016) 0.619 (0.057) 0.604 (0.065) 0.622 (0.011) 0.634 (0.028) 0.616 (0.052) 0.604 (0.086)
MTLR 0.803 (0.097) 0.661 (0.041) 0.764 (0.132) 0.634 (0.083) 0.676 (0.017) 0.756 (0.007) 0.623 (0.029) 0.694 (0.029) 0.595 (0.009) 0.601 (0.021) 0.618 (0.071) 0.646 (0.073) 0.758 (0.027) 0.709 (0.032) 0.656 (0.031) 0.697 (0.089)

3 High-Censor, Low Dimensional

All datasets in this category were collected from the TCGA website (details given in the paper). Specifically, here we have 12 datasets corresponding to survival time from different cancer types: Cervical squamous cell carcinoma and endocervical adenocarcinoma [CESC], Colon adenocarcinoma [COAD], Colorectal adenocarcinoma (COADREAD), FFPE Pilot Phase II [FPPP], Kidney Chromophobe [KICH], Pan-kidney cohort (KIPAN), Kidney renal papillary cell carcinoma [KIRP], Lower Grade Glinoma (LGG), Prostate adenocarcinoma [PRAD], Thyroid carcinoma [THCA], Uterine Corpus Endometrial Carcinoma [UCEC], and Uveal Melanoma [UVM]. Table 5 gives summary statistics for each of the 12 datasets.

Table 5: Summary statistics for the 12 (High-Censoring, Low-Dimensional) datasets. From top to bottom we have (1) the total number of patients, (2) the % of censored patients, and (3) the total number of features used (after feature selection).
CESC COAD COADREAD FPPP KICH KIPAN KIRP LGG PRAD THCA UCEC UVM
N 307 456 626 38 112 939 290 513 499 503 546 80
% Censored 76.55 77.63 79.39 81.58 89.29 75.19 84.83 75.63 98 96.82 83.33 71.25
Number of Features 15 12 13 3 7 20 12 4 6 14 6 6

3.1 Calibration

For calibration on (High-Censoring, Low-Dimensional) datasets, our findings were that MTLR and CoxEN-KP outperformed other models. The D-Calibration (Table 6) results are inconclusive as nearly all models are D-Calibrated for all datasets (recall D-Calibration is optimistic when data is highly censored). The 1-Calibration results (Table 7) show MTLR and CoxEN-KP to outperform all other models with CoxEN-KP slightly outperforming MTLR. Given MTLR’s better performance on the datasets given in the paper, these results support the finding that MTLR and CoxEN-KP are the best (and generally equivalent) models for calibration on (High-Censoring, Low-Dimensional) datasets.

Table 6: D-Calibration results for the 12 (High-Censoring, Low-Dimensional) datasets. The Total column gives the total number of datasets for which each model was D-Calibrated.
CESC COAD COADREAD FPPP KICH KIPAN KIRP LGG PRAD THCA UCEC UVM Total
KM 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1 1 1.000 1.000 12
AFT 0.999 0.978 1.000 0.899 0.997 0.899 0.957 0.955 1 1 0.966 0.988 12
Cox-KP 0.995 0.980 1.000 0.930 0.996 0.844 0.988 0.938 1 0 0.998 0.969 11
CoxEN-KP 0.979 0.904 0.987 1.000 0.999 0.593 1.000 0.819 1 1 0.998 0.998 12
RSF-KM 0.934 0.800 0.897 0.999 1.000 0.441 0.940 0.556 1 1 0.366 0.940 12
MTLR 0.947 0.696 0.967 0.999 1.000 0.783 0.999 0.993 1 1 0.965 0.994 12
Table 7: The 1-Calibration results for the 12 (High-Censoring, Low-Dimensional) datasets. Each column represents the 10th, 25th, 50th, 75th, and 90th percentiles of event times. Each row represents the total number of datasets for which each model was 1-Calibrated (the maximum being 12).
10th 25th 50th 75th 90th
AFT 4 7 5 2 1
Cox-KP 5 5 6 2 1
CoxEN-KP 8 10 10 5 3
RSF-KM 3 2 1 0 0
MTLR 9 10 6 3 2

3.2 Discrimination

Table 8 presents concordance results for (High-Censoring, Low-Dimensional) datasets and we see results consistent with the findings of the paper – (1) CoxEN-KP and MTLR are superior to AFT and Cox and (2) RSF-KM typically sees noticeably worse performance. This supports our recommendation to use MTLR or CoxEN-KP when evaluating discrimination on (High-Censoring, Low-Dimensional) datasets.

Table 8: Concordance results for the 12 (High-Censoring, Low-Dimensional) datasets. Here we present the mean and standard deviation of the Concordance across the 5 cross-validation folds. Bolded values represent the best (highest Concordance) performing models.
CESC COAD COADREAD FPPP KICH KIPAN KIRP LGG PRAD THCA UCEC UVM
KM 0.5 (0) 0.5 (0) 0.5 (0) 0.5 (0) 0.5 (0) 0.5 (0) 0.5 (0) 0.5 (0) 0.5 (0) 0.5 (0) 0.5 (0) 0.5 (0)
AFT 0.816 (0.015) 0.73 (0.042) 0.731 (0.036) 0.531 (0.386) 0.752 (0.422) 0.782 (0.058) 0.689 (0.169) 0.781 (0.08) 0.778 (0.273) 0.705 (0.285) 0.696 (0.079) 0.736 (0.157)
Cox-KP 0.817 (0.018) 0.731 (0.04) 0.732 (0.037) 0.531 (0.386) 0.836 (0.242) 0.78 (0.06) 0.683 (0.165) 0.781 (0.078) 0.774 (0.277) 0.675 (0.3) 0.695 (0.08) 0.711 (0.134)
CoxEN-KP 0.826 (0.032) 0.741 (0.044) 0.743 (0.038) 0.641 (0.267) 0.958 (0.052) 0.794 (0.051) 0.787 (0.104) 0.779 (0.078) 0.755 (0.282) 0.955 (0.032) 0.694 (0.081) 0.733 (0.145)
RSF-KM 0.829 (0.017) 0.682 (0.041) 0.709 (0.062) 0.539 (0.376) 0.923 (0.073) 0.772 (0.058) 0.801 (0.091) 0.709 (0.066) 0.778 (0.291) 0.950 (0.027) 0.614 (0.042) 0.712 (0.077)
MTLR 0.823 (0.036) 0.746 (0.043) 0.747 (0.037) 0.721 (0.287) 0.952 (0.047) 0.795 (0.05) 0.826 (0.069) 0.781 (0.073) 0.672 (0.4) 0.913 (0.08) 0.693 (0.078) 0.737 (0.218)

4 Low-Censoring, High-Dimensional

Here we examine a number of (Low-Censoring, High-Dimensional) datasets which contain gene-expression data relating the survival of cancer patients – see “A Multi-Task Learning Formulation for Survival Analysis” by Li et al. for details regarding dataset access. Specifically we use (1) the Norway/Stanford breast cancer data (NBSCD) which contains the gene-expression data for 115 female breast cancer patients, (2) the Van de Vijver’s microarray breast cancer data (VDV) which again contains gene-expression data for 78 breast cancer patients, (3) gene-expression data for adult myeloid leukemia (AML) on 116 patients, and (4) gene-expression data for 92 mantle cell lymphoma (MCL) patients. Table 9 gives the summary statistics for each of these 4 datasets.

Table 9: Summary statistics for the 4 (Low-Censoring, High-Dimensional) datasets. From top to bottom we have (1) the total number of patients, (2) the % of censored patients, and (3) the total number of features used (after feature selection).
AML MCL NSBCD VDV
N 116 92 115 78
% Censored 42.24 30.43 66.96 56.41
Number of Features 1616 2176 274 1319

4.1 Calibration

For calibration on (Low-Censoring, High-Dimensional) datasets, our findings were that MTLR, CoxEN-KP, and RSF-KM were all equivalent. The D-Calibration (Table 10) results are show that AFT fails D-Calibration across all datasets (and Cox fails to finish on these datasets). The 1-Calibration results (Table 11) show that MTLR, CoxEN-KP and RSF-KM are roughly equivalent across the different percentiles supporting our initial finding – MTLR, CoxEN-KP, and RSF-KM are equivalent for calibration on (Low-Censoring, High-Dimensional) datasets.

Table 10: D-Calibration results for the 4 (Low-Censoring, High-Dimensional) datasets. The Total column gives the total number of datasets for which each model was D-Calibrated. Note there is no results for Cox as it failed to run for all datasets.
AML MCL NSBCD VDV Total
KM 1.000 1.000 1.000 1.000 4
AFT 0.000 0.000 0.000 0.000 0
CoxEN-KP 0.897 0.228 0.992 0.998 4
RSF-KM 0.759 0.577 0.977 0.000 3
MTLR 0.529 0.391 0.981 0.996 4
Table 11: The 1-Calibration results for the 4 (Low-Censoring, High-Dimensional) datasets. Each column represents the 10th, 25th, 50th, 75th, and 90th percentiles of event times. Each row represents the total number of datasets for which each model was 1-Calibrated (the maximum being 4). Note there are no results for Cox as it failed to run on all datasets.
10th 25th 50th 75th 90th
AFT 0 0 0 0 0
CoxEN-KP 4 4 4 2 1
RSF-KM 4 4 3 3 1
MTLR 2 3 4 4 3

4.2 Discrimination

Table 12 presents the concordance results for (Low-Censoring, High-Dimensional) datasets. The findings in the paper (that MTLR/CoxEN-KP are superior to AFT and RSF-KM) are generally supported (actually always supported for being better than AFT). Except for VDV CoxEN-KP does better than RSF-KM and MTLR either matches or performs better than RSF-KM for all datasets.

Table 12: Concordance results for the 4 (Low-Censoring, High-Dimensional) datasets. Here we present the mean and standard deviation of the Concordance across the 5 cross-validation folds. Bolded values represent the best (highest Concordance) performing models. Note there are no results for Cox as it failed to run on all datasets.
AML MCL NSBCD VDV
KM 0.5 (0) 0.5 (0) 0.5 (0) 0.5 (0)
AFT 0.531 (0.074) 0.496 (0.063) 0.568 (0.129) 0.469 (0.084)
CoxEN-KP 0.656 (0.124) 0.788 (0.042) 0.784 (0.086) 0.679 (0.187)
RSF-KM 0.652 (0.089) 0.764 (0.058) 0.736 (0.101) 0.725 (0.13)
MTLR 0.651 (0.092) 0.805 (0.034) 0.776 (0.071) 0.733 (0.157)

5 High-Censoring, High-Dimensional

Here we examine 1 additional (High-Censoring, High-Dimensional) dataset, namely gene-expression data for 86 early lung adenocarcinoma (Lung) patients. – see “A Multi-Task Learning Formulation for Survival Analysis” by Li et al. for details regarding dataset access. Table 12 gives the summary statistics for this dataset.

Table 12: Summary statistics for the (High-Censoring, High-Dimensional) dataset. From top to bottom we have (1) the total number of patients, (2) the % of censored patients, and (3) the total number of features used (after feature selection).
Lung
N 86
% Censored 72.09
Number of Features 1121

5.1 Calibration

For calibration on (High-Censoring, High-Dimensional) datasets, our findings were that MTLR, CoxEN-KP, and RSF-KM were all equivalent. The D-Calibration results (Table 13) show that AFT is the only model to fail D-Calibration (note Cox failed to finish). Similarly, the 1-Calibration results (Table 14) show that MTLR, CoxEN-KP and RSF-KM are equivalent across the different percentiles suggesting our finding is consistent for this dataset as well.

Table 13: D-Calibration results for the 4 (Low-Censoring, High-Dimensional) datasets. The Total column gives the total number of datasets for which each model was D-Calibrated. Note the Cox model does not appear as it failed to finish on this dataset.
Lung
KM 1.000
AFT 0.000
CoxEN-KP 0.922
RSF-KM 0.887
MTLR 0.986
Table 14: The 1-Calibration results for the 4 (Low-Censoring, High-Dimensional) datasets. Each column represents the 10th, 25th, 50th, 75th, and 90th percentiles of event times. Each row represents the total number of datasets for which each model was 1-Calibrated (the maximum being 4). Note the Cox model does not appear as it failed to finish on this dataset.
10th 25th 50th 75th 90th
AFT 0 0 0 0 0
CoxEN-KP 1 1 1 1 0
RSF-KM 1 1 1 1 0
MTLR 1 1 1 1 0

5.2 Discrimination

Table 15 presents the concordance results for the (High-Censoring, High-Dimensional) dataset. The findings in the paper (that MTLR, CoxEN-KP, and RSF-KM are equivalent and also superior to AFT) are supported (although RSF-KM has much higher variance).

Table 15: Concordance results for the 4 (Low-Censoring, High-Dimensional) datasets. Here we present the mean and standard deviation of the Concordance across the 5 cross-validation folds. Bolded values represent the best (highest Concordance) performing models. Note the Cox model does not appear as it failed to finish on this dataset.
Lung
KM 0.5 (0)
AFT 0.668 (0.172)
CoxEN-KP 0.932 (0.039)
RSF-KM 0.816 (0.173)
MTLR 0.934 (0.029)
