2817 cases out of 6399 had missing data (44.0%), with a total of 5015 missing data points (5015/270143, 1.9%). Variables considered that had >5% missing data were outcome of neoadjuvant treatment (22.3%), return to theatre (15.8%), differentiation grade (7.0%) and surgical approach (open/minimally invasive/hybrid, 5.7%). Circumferential resection margin, Clinical/pathological T/N/M stage, longitudinal resection margin, length of stay and survival time all had missing data in 1-5% of cases. All other variables had <1% missing data.
In this study missing data was assumed to be at random (MAR) and was handled using multiple imputation by chained equations (MICE),(1) with 10 iterations across 10 imputed datasets. MICE imputes iteratively, across variables, one at a time using a variety of methods according to the class of the data. In general, a type of regression model is generated with the missing value as the dependent variable and the known variables as the independent variables. Extension to the survival setting has been described previously(2,3). The optimal number of imputed datasets has been suggested to be 100 multiplied by the fraction of incomplete cases(4). Here, the number of imputed datasets was limited to 10 for computational reasons and 10 iterations were performed as suggested(1). All 41 candidate covariates (Table 1) were included in the imputation model, which should increase the quality of its specification. We elected to include cases with missing survival data (i.e. the outcome) as this have been shown to improve the imputation model(5) and is generally recommended(6). For the Cox-Proportional hazards method, final pooled models were generated. For the RSF, pooled survival probabilities were created as described below.
To derive the RSF, hyperparameters were optimised to mimimise prediction error (1 – c index) in the random forest Out-of-Bag (OOB) samples. Each tree in the random forest is trained on 2/3 of the cases with replacement (the in-bag samples) and then tested on the remaining 1/3 (the OOB samples), and hence the OOB error averaged across all trees is equivalent to the simple bootstrap with repetitions equal to the number of trees in the forest. Hyperparameters optimised were number of trees, number of variables in each decision tree and minimum node size.
This process was repeated for each imputed dataset to yield 10 separate models. It is not possible to directly combine RSF models generated on multiply imputed data as would be the case for Cox or Logistic Models, due to the absence of coefficients with this technique. To address this, predictions were directly combined(7,8). The standard error of prediction for each unique death time was calculated using the predictions from individual decision trees in the forest (i.e. 200-400 samples), allowing the predictions from each dataset to be combined using Rubin’s rules after a complementary log-log transformation(9).
In comparison to the binary outcome setting, there is no clear consensus on the assessment of discrimination in survival models. The Harell’s C-statistic in a binary outcome model is equivalent to the area under the receiver operator characteristic (ROC) curve or AUC. It can be defined as the proportion of pairs of cases where one has an outcome and the other does not, that the model correctly orders their predictions (i.e. the case with an outcome scores higher than the case without). In a survival setting, confusingly, Harrell’s C-statistic is not equivalent to this, instead representing the proportion of random pairs of cases where the case which lives longer is given a higher probability of survival at a specified time point.
Although a reasonable diagnostic measure of a model, it differs crucially from the binary outcome setting with which most users are familiar and may be misinterpreted in that way. It has also been criticised for considering only ‘usable pairs’ (i.e. uncensored pairs of cases) and is hence heavily dependent on the censoring distribution of the study dataset. An alternative method of assessment that has increasing popularity due to its ready interpretability and better utilisation of censored data is the time-dependent area under receiver operating characteristic curve (tAUC).
To calculate the tAUC, firstly we must consider censoring. If we treat the censored patients as missing (i.e. ignore censoring), then a biased comparison of cases and controls is conducted. There are multiple ways of addressing this issue, and here we have used the Inverse Probability of Censoring Weighting (IPCW) , a standard technique for survival analysis to appropriately weight the cases, as described by Blanche et al(10). In short, surviving cases are weighted according to the inverse of the probability of being censored at each time point so that surviving cases are proportionally over-represented in the final comparison.
Secondly, we must decide how to account for time. In this study we used the Cumulative sensitivity/dynamic specificity (C/D) definitions (there are two alternative definitions, Incident sensitivity and dynamic specificity – I/D and Incident sensitivity and static specificity – I/S). Here, for each measured time point (i.e. death time), all patients are classified into being either cases (i.e. dead at that time point) or controls (i.e. alive at that time point). The cumulative sensitivity at time point ‘t’ is the proportion of cases who are dead at time ‘t’ who are given a probability of dying above a set cut-off point ‘c’. The dynamic specificity is the proportion of cases who are alive at time ‘t’ who are given a probability of dying below ‘c’(11). Cases are weighted according to IPCW to yield a final sensitivity and specificity and calculated across a range of cut-points to derive the time dependent receiver operator curve (tROC), from which the tAUC can be calculated. The performance across all time points (integrated tAUC, iAUC) can also be calculated, and was 82.4% in the RSF. There was preservation of performance across time points (Figure S.5).
Predicted probabilities for each model were compared using decision curves. These were first developed in 2006 by Vickers et al.(12) and designed to provide more information on the clinical utility of a model than arbitrary statistics, such as the C-index. Decision curve analysis (DCA) calculates the ‘net benefit’ of decisions made across a range of probability thresholds by applying validation data to the formula:
\[Net Benefit = \frac{TP}{n}-\frac{FP}{n}\left(\frac{p_t}{1-p_t}\right)\]
Where TP is the number of true positives, FP is the number of false positives, n is the number of patients and pt is the chosen probability threshold. Plotting net benefit across all probability thresholds yields the decision curve. The principle is to assess the cost/benefit of basing a decision to intervene at a defined probability (i.e. pt) of an event (e.g. disease present/absent), and has been previously extended to censored data(13). Having generated the curve, it is possible to see the range of probabilities of disease in which the model has utility for basing treatment decisions above that of treating all patients or none.
Furthermore, different clinical prediction models can be compared by the same technique. The DCA is important in that it can show that a model that both discriminates and calibrates well, is still useless in clinical practice and should be discarded due to the distribution of predicted probabilities(14). DCA can also be used to show if a particular model would be beneficial based on individual patients expectations of acceptable risk(15). In a survival setting, decision curves are generated for individual survival times, which we conducted annually between 1 and 5 years (Figure S.6). The net-benefit of the RSF model was greater than for the Cox regression predictions across all threshold probabilities. This pattern is preserved at all time periods. Importantly, the RSF model also has a net benefit over the use of mortality as a function of pathological TNM stage alone (Figure S.7).
Factor levels in Clinical T Stage were collapsed so that T0,Tis and T1 were a single category due to low number of patients in these categories and similar observed survival (as shown below in Figure S.2). TX was treated as missing.
| Preoperative | Operative | Post-Operative |
|---|---|---|
| Gender | Any Surgical Complication | Involved Longitudinal Margin |
| Age | Anastomotic Leak | Invovled Circumferential Margin |
| Site of Tumour | Chyle Leak | Number of Positive Lymph nodes |
| Histopathology | Bleeding Complication | pT/ypT |
| cT | Cardiac Complication | Differentiation Grade |
| cN | Acute Kidney Injury | |
| cM | Pneumonia | |
| Neoadjuvant Treatment Modality | ARDS | |
| Surgical Approach | Pulmonary Embolism | |
| Type of Operation | Pleural Effusion | |
| Field of Nodal Dissection | Any Respiratory Complication | |
| Cardiovascular Comorbidity | Infective Complication | |
| COPD | Return to Theatre | |
| Chronic Kidney Disease | Number of Operations | |
| Chronic Liver Disease | ||
| Diabetes Mellitus | ||
| Psychiatric Illness | ||
| Cerebrovascular Disease | ||
| Barretts Oesophagus | ||
| Number of Comorbidities | ||
| Performance Status | ||
| ASA Grade |
| Variable | VIMP | LCI | UCI |
|---|---|---|---|
| Total Positive Lymph Nodes* | 10.88 | 9.76 | 12.00 |
| pT/ypT* | 3.54 | 2.90 | 4.19 |
| Involved Circumferential Margin* | 1.70 | 1.41 | 2.00 |
| Age* | 0.39 | 0.28 | 0.49 |
| cT* | 0.32 | 0.19 | 0.45 |
| cN* | 0.22 | 0.11 | 0.34 |
| Neoadjuvant Treatment* | 0.15 | 0.08 | 0.22 |
| Completion of Neoadjuvant Treatment* | 0.14 | 0.06 | 0.22 |
| Differentiation Grade (Worst)* | 0.09 | 0.02 | 0.15 |
| Involved Longitudinal Margin* | 0.07 | 0.03 | 0.10 |
| Gender* | 0.06 | 0.01 | 0.12 |
| Site of Tumour* | 0.06 | 0.02 | 0.10 |
| Any Complication* | 0.05 | 0.00 | 0.11 |
| Approach | 0.03 | -0.02 | 0.08 |
| ASA | 0.02 | -0.04 | 0.08 |
| Return to Theatre | 0.02 | -0.02 | 0.06 |
| Procedure | 0.02 | -0.01 | 0.05 |
| Histopathology* | 0.02 | -0.01 | 0.04 |
| Anastomotic Leak | 0.01 | 0.00 | 0.03 |
| Hospital Volume | 0.01 | -0.02 | 0.05 |
| COPD | 0.01 | -0.02 | 0.03 |
| Barrett’s | 0.01 | -0.01 | 0.02 |
| Respiratory Complication | 0.01 | -0.02 | 0.03 |
| ARDS | 0.01 | 0.00 | 0.01 |
| DM | 0.00 | -0.01 | 0.02 |
| Chyle Leak | 0.00 | -0.01 | 0.02 |
| IHD | 0.00 | -0.03 | 0.04 |
| Pneumonia | 0.00 | -0.01 | 0.02 |
| Cardiac Complication | 0.00 | -0.02 | 0.02 |
| cM | 0.00 | 0.00 | 0.00 |
| CVD | 0.00 | -0.01 | 0.01 |
| Number of Procedures | 0.00 | -0.02 | 0.02 |
| Renal Complication | 0.00 | 0.00 | 0.00 |
| Pulmonary Embolism | 0.00 | 0.00 | 0.00 |
| Chronic Liver disease | 0.00 | 0.00 | 0.00 |
| Bleeding Complication | 0.00 | 0.00 | 0.00 |
| Pleural Effusion | 0.00 | -0.01 | 0.01 |
| Psychiatric Illness | 0.00 | -0.01 | 0.00 |
| CKD | 0.00 | 0.00 | 0.00 |
| Infective Complication | 0.00 | -0.01 | 0.00 |
| Performance Status | 0.00 | -0.05 | 0.05 |
| Inclusion Criteria | Checklist |
|---|---|
| The probability of overall survival, disease-specific survival (DSS), or disease-specific mortality (DSM) must be the outcome predicted | Predicts Overall Survival |
| The model should address a clinically relevant question | Yes |
| At face value, the model should include the relevant predictors, or explain why something relevant was not included | Yes |
| The model validation study should specify precisely which patients were used to evaluate the model and the validation dataset’s inclusion/exclusion criteria | See Patient Flow Diagram |
| The model should be assessed for generalizability and external validation | Internal validation conducted |
| The model should have a well-defined prognostic time zero | Time of Surgery |
| All predictors must be known at time zero and sufficiently defined for someone else to use | Yes |
| Sufficient detail must be available to implement the model OR the author must allow free access to the model. | Yes |
| A measure of discrimination must have been reported | Yes |
| Calibration in the small must be assessed (from the external validation data set) and provided | Yes |
| The model should be validated over a time frame and practice setting that is relevant to contemporary patients with disease | Yes |
| It should be clear which initial treatment(s), if any, were applied, and with what frequency | Yes |
| The development and/or the validation of the prediction model must appear as a peer-reviewed journal article | Yes |
| Exclusion Criteria | Checklist |
|---|---|
| A substantial proportion of patients had essentially no follow up, either missing entirely or very short censored follow up, in the validation dataset | No |
| No information on number of missing values in validation dataset | Provided |
| The number of events in the validation dataset is small (<100) | No |
Partial dependence plots can be used to visualise the relative importance of different variables. The partial dependence function at specified variable values (e.g. Total Positive Lymph nodes = 5 or Neoadjuvant Treatment = Chemoradiotherapy) is calculated as the average prediction of the RSF if all cases in the training set were changed to have this value. It therefore allows visualisation of the average marginal effect of individual variables and how changing them effects overall predictions, although does not assess how individual variables interact. Figure S.3 and Figure S.4 show this function for the variables included in the final model, ordered by importance. The differential survival can be seen to be far more important for the first 4 variables compared to the others. Variable interactions are not however included in this assessment and can be difficult to effectively represent.
A CPH was fitted to the variables used for the RSF model with no interactions or transformations. The proportional hazards assumption was assessed using Schoenfeld residuals. To assess for non-linearity of variables, the Martingale residuals of continuous variables separately entered into a null Cox model (Age and Number of Positive Lymph nodes) were calculated and a loess smoother applied. Both of these variables showed a clear non-linear form. A square root transformation improved the fit of positive lymph nodes but standard transformation (power or logarithmic) did not results in an improvement for age (more details available on request). A restricted cubic spline was therefore applied to age with 4 knots (ages 49, 62, 69 and 78).
| Factor | Beta | Hazard Ratio (95% CI) | P Value | Proportional Hazards Chi Square | Proportional Hazards P Value | |
|---|---|---|---|---|---|---|
| Site of Tumour | Mid/Upper Oesophagus | 1 | ||||
| Lower Oesophagus | 0.001 | 1 (0.87 - 1.15) | <0.001 | 1.49 | 0.222 | |
| GOJ | -0.139 | 0.87 (0.75 - 1.01) | <0.001* | 0.24 | 0.623 | |
| Gender | Male | 1 | ||||
| Female | -0.212 | 0.81 (0.73 - 0.9) | <0.001* | 0.12 | 0.733 | |
| Histopathology | Adenocarcinoma | 1 | ||||
| SCC | 0.216 | 1.24 (1.08 - 1.42) | 0.002* | 3.85 | 0.05 | |
| cT | T0/1/is | 1 | ||||
| T2 | 0.198 | 1.22 (0.95 - 1.56) | <0.001* | 0.34 | 0.559 | |
| T3 | 0.213 | 1.24 (0.97 - 1.58) | <0.001* | 0.47 | 0.491 | |
| T4 | 0.176 | 1.19 (0.89 - 1.61) | 0.246 | 0.23 | 0.628 | |
| cN | N0 | 1 | ||||
| N1 | 0.127 | 1.14 (1.03 - 1.25) | 0.011* | 0.76 | 0.384 | |
| N2 | 0.135 | 1.14 (1.01 - 1.3) | <0.001* | 0.2 | 0.656 | |
| N3 | -0.027 | 0.97 (0.77 - 1.23) | 0.824 | 0.24 | 0.624 | |
| Neoadjuvant Treatment | None | 1 | ||||
| CRT - Completed | 0.19 | 1.21 (0.94 - 1.56) | 0.148 | 0.8 | 0.372 | |
| CRT - Not Completed | 0.813 | 2.25 (1.03 - 4.92) | 0.041* | 0.09 | 0.765 | |
| CT - Completed | 0.009 | 1.01 (0.89 - 1.14) | 0.882 | 0 | 0.967 | |
| CT - Not Completed | 0.158 | 1.17 (0.95 - 1.44) | 0.134 | 0.01 | 0.934 | |
| Any Complication | No | 1 | ||||
| Yes | 0.116 | 1.12 (1.04 - 1.21) | <0.001* | 0.07 | 0.784 | |
| Total LN Positive | 0.455 | 1.58 (1.52 - 1.63) | <0.001* | 0.12 | 0.728 | |
| pT/ypT | T0 | 1 | ||||
| T1 | 0.201 | 1.22 (0.91 - 1.63) | 0.175 | 1.11 | 0.291 | |
| T2 | 0.507 | 1.66 (1.26 - 2.19) | <0.001* | 0.83 | 0.363 | |
| T3 | 0.925 | 2.52 (1.95 - 3.25) | <0.001* | 1.2 | 0.272 | |
| T4 | 1.445 | 4.24 (3.18 - 5.66) | <0.001* | 2.16 | 0.141 | |
| Involved CRM | No | 1 | ||||
| Yes | 0.303 | 1.35 (1.24 - 1.48) | <0.001* | 1.32 | 0.251 | |
| Involved LRM | No | 1 | ||||
| Yes | 0.284 | 1.33 (1.11 - 1.59) | 0.002* | 0.42 | 0.515 | |
| Differentiation Grade | Well | 1 | ||||
| Moderate | 0.304 | 1.36 (1.05 - 1.76) | 0.022* | 0.3 | 0.584 | |
| Poor/Anaplastic | 0.441 | 1.55 (1.2 - 2.02) | 0.001* | 5.41 | 0.02 | |
| GX | 0.277 | 1.32 (0.99 - 1.75) | 0.057 | 2.32 | 0.128 |
To assess how the modelling process performed in different situations we first chose a list of variables that were available preoperatively. A preoperative prediction model has obvious uses, first in terms of tailoring treatments (e.g. Neoadjuvant Chemotherapy vs Neoadjuvant Chemoradiotherapy) and avoiding futile treatment (e.g. if the risk of death is extremely high in the early post-operative period. 22 variables were included, shown in the preoperative column of Main text Table 1.
This model was then compared to the calibrated full RSF model. The tAUC of the RSF model using these variables was 73.2 (95% CI 72.1-74.5%), compared to 84.8% (95% CI 83.1-86.9%), for the full model. Calibration was also poor, with a brier score of 0.171 (0.133 for full model), and visually worse calibration compared to the full model.
To illustrate how the model provides discrete predictions, two example cases are described below:
Case 1: 60-year-old male patient with adenocarcinoma of the oesophagus, who undergoes neoadjuvant chemotherapy before an oesophagectomy. His post-operative pathology reveals a well differentiated T3N1 (1/30) M0 tumour with no circumferential or longitudinal margin involvement. The ypTNM Stage is 3b.
Case 2: 60-year-old male patient with adenocarcinoma of the oesophagus, who undergoes neoadjuvant chemoradiotherapy which he fails to complete before an oesophagectomy. His post-operative pathology reveals a poorly differentiated T3N2 (6/30) M0 tumour with circumferential margin involvement. He also suffers from a post-operative complication. The ypTNM Stage is also 3b.
A basic knowledge of R is required to conduct the external validation. As the model does not generate coefficients, access to the model itself is required.
First, download the file packet from the web application in the ‘Model Details’ tab. This contains the models themselves and the manner in which dummy coding was conducted.
An example blank dataframe is also included showing the structure in which data must be presented to the model. Care should be taken to match the variables/names/factor-levels in this file. If the model fails to generate predictions, it is probably due to a discrepancy here.
Then access and download the R script from github:
https://github.com/saqibrahmanUGI/AUGIS-Surv/blob/master/externalvalid.r
Running this script will firstly install and load the needed R packages, then batch generate predictions, calculate the tAUC and C-index, plot annual calibration curves and plot quintiles of prediction against observed KM estimates.