Supplementary Methods

Missing Data

2817 cases out of 6399 had missing data (44.0%), with a total of 5015 missing data points (5015/270143, 1.9%). Variables considered that had >5% missing data were outcome of neoadjuvant treatment (22.3%), return to theatre (15.8%), differentiation grade (7.0%) and surgical approach (open/minimally invasive/hybrid, 5.7%). Circumferential resection margin, Clinical/pathological T/N/M stage, longitudinal resection margin, length of stay and survival time all had missing data in 1-5% of cases. All other variables had <1% missing data.

In this study missing data was assumed to be at random (MAR) and was handled using multiple imputation by chained equations (MICE),(1) with 10 iterations across 10 imputed datasets. MICE imputes iteratively, across variables, one at a time using a variety of methods according to the class of the data. In general, a type of regression model is generated with the missing value as the dependent variable and the known variables as the independent variables. Extension to the survival setting has been described previously(2,3). The optimal number of imputed datasets has been suggested to be 100 multiplied by the fraction of incomplete cases(4). Here, the number of imputed datasets was limited to 10 for computational reasons and 10 iterations were performed as suggested(1). All 41 candidate covariates (Table 1) were included in the imputation model, which should increase the quality of its specification. We elected to include cases with missing survival data (i.e. the outcome) as this have been shown to improve the imputation model(5) and is generally recommended(6). For the Cox-Proportional hazards method, final pooled models were generated. For the RSF, pooled survival probabilities were created as described below.

Deriving the RSF

To derive the RSF, hyperparameters were optimised to mimimise prediction error (1 – c index) in the random forest Out-of-Bag (OOB) samples. Each tree in the random forest is trained on 2/3 of the cases with replacement (the in-bag samples) and then tested on the remaining 1/3 (the OOB samples), and hence the OOB error averaged across all trees is equivalent to the simple bootstrap with repetitions equal to the number of trees in the forest. Hyperparameters optimised were number of trees, number of variables in each decision tree and minimum node size.

This process was repeated for each imputed dataset to yield 10 separate models. It is not possible to directly combine RSF models generated on multiply imputed data as would be the case for Cox or Logistic Models, due to the absence of coefficients with this technique. To address this, predictions were directly combined(7,8). The standard error of prediction for each unique death time was calculated using the predictions from individual decision trees in the forest (i.e. 200-400 samples), allowing the predictions from each dataset to be combined using Rubin’s rules after a complementary log-log transformation(9).

Assessment of Discrimination

In comparison to the binary outcome setting, there is no clear consensus on the assessment of discrimination in survival models. The Harell’s C-statistic in a binary outcome model is equivalent to the area under the receiver operator characteristic (ROC) curve or AUC. It can be defined as the proportion of pairs of cases where one has an outcome and the other does not, that the model correctly orders their predictions (i.e. the case with an outcome scores higher than the case without). In a survival setting, confusingly, Harrell’s C-statistic is not equivalent to this, instead representing the proportion of random pairs of cases where the case which lives longer is given a higher probability of survival at a specified time point.

Although a reasonable diagnostic measure of a model, it differs crucially from the binary outcome setting with which most users are familiar and may be misinterpreted in that way. It has also been criticised for considering only ‘usable pairs’ (i.e. uncensored pairs of cases) and is hence heavily dependent on the censoring distribution of the study dataset. An alternative method of assessment that has increasing popularity due to its ready interpretability and better utilisation of censored data is the time-dependent area under receiver operating characteristic curve (tAUC).

To calculate the tAUC, firstly we must consider censoring. If we treat the censored patients as missing (i.e. ignore censoring), then a biased comparison of cases and controls is conducted. There are multiple ways of addressing this issue, and here we have used the Inverse Probability of Censoring Weighting (IPCW) , a standard technique for survival analysis to appropriately weight the cases, as described by Blanche et al(10). In short, surviving cases are weighted according to the inverse of the probability of being censored at each time point so that surviving cases are proportionally over-represented in the final comparison.

Secondly, we must decide how to account for time. In this study we used the Cumulative sensitivity/dynamic specificity (C/D) definitions (there are two alternative definitions, Incident sensitivity and dynamic specificity – I/D and Incident sensitivity and static specificity – I/S). Here, for each measured time point (i.e. death time), all patients are classified into being either cases (i.e. dead at that time point) or controls (i.e. alive at that time point). The cumulative sensitivity at time point ‘t’ is the proportion of cases who are dead at time ‘t’ who are given a probability of dying above a set cut-off point ‘c’. The dynamic specificity is the proportion of cases who are alive at time ‘t’ who are given a probability of dying below ‘c’(11). Cases are weighted according to IPCW to yield a final sensitivity and specificity and calculated across a range of cut-points to derive the time dependent receiver operator curve (tROC), from which the tAUC can be calculated. The performance across all time points (integrated tAUC, iAUC) can also be calculated, and was 82.4% in the RSF. There was preservation of performance across time points (Figure S.5).

Decision Curve Analysis

Predicted probabilities for each model were compared using decision curves. These were first developed in 2006 by Vickers et al.(12) and designed to provide more information on the clinical utility of a model than arbitrary statistics, such as the C-index. Decision curve analysis (DCA) calculates the ‘net benefit’ of decisions made across a range of probability thresholds by applying validation data to the formula:

\[Net Benefit = \frac{TP}{n}-\frac{FP}{n}\left(\frac{p_t}{1-p_t}\right)\]

Where TP is the number of true positives, FP is the number of false positives, n is the number of patients and pt is the chosen probability threshold. Plotting net benefit across all probability thresholds yields the decision curve. The principle is to assess the cost/benefit of basing a decision to intervene at a defined probability (i.e. pt) of an event (e.g. disease present/absent), and has been previously extended to censored data(13). Having generated the curve, it is possible to see the range of probabilities of disease in which the model has utility for basing treatment decisions above that of treating all patients or none.

Furthermore, different clinical prediction models can be compared by the same technique. The DCA is important in that it can show that a model that both discriminates and calibrates well, is still useless in clinical practice and should be discarded due to the distribution of predicted probabilities(14). DCA can also be used to show if a particular model would be beneficial based on individual patients expectations of acceptable risk(15). In a survival setting, decision curves are generated for individual survival times, which we conducted annually between 1 and 5 years (Figure S.6). The net-benefit of the RSF model was greater than for the Cox regression predictions across all threshold probabilities. This pattern is preserved at all time periods. Importantly, the RSF model also has a net benefit over the use of mortality as a function of pathological TNM stage alone (Figure S.7).

Case and Variable Selection

Figure S.1 Study Flow Diagram



Factor levels in Clinical T Stage were collapsed so that T0,Tis and T1 were a single category due to low number of patients in these categories and similar observed survival (as shown below in Figure S.2). TX was treated as missing.

Figure S.2 Observed survival for 6399 patients after oesophagectomy, stratified by cT Stage





Table S.1 Variables considered for inclusion

Preoperative Operative Post-Operative
Gender Any Surgical Complication Involved Longitudinal Margin
Age Anastomotic Leak Invovled Circumferential Margin
Site of Tumour Chyle Leak Number of Positive Lymph nodes
Histopathology Bleeding Complication pT/ypT
cT Cardiac Complication Differentiation Grade
cN Acute Kidney Injury
cM Pneumonia
Neoadjuvant Treatment Modality ARDS
Surgical Approach Pulmonary Embolism
Type of Operation Pleural Effusion
Field of Nodal Dissection Any Respiratory Complication
Cardiovascular Comorbidity Infective Complication
COPD Return to Theatre
Chronic Kidney Disease Number of Operations
Chronic Liver Disease
Diabetes Mellitus
Psychiatric Illness
Cerebrovascular Disease
Barretts Oesophagus
Number of Comorbidities
Performance Status
ASA Grade





Table S.2 Full Model Variable Importance by permutation (VIMP),ordered by magnitude. Variables marked with * were included in the final model

Variable VIMP LCI UCI
Total Positive Lymph Nodes* 10.88 9.76 12.00
pT/ypT* 3.54 2.90 4.19
Involved Circumferential Margin* 1.70 1.41 2.00
Age* 0.39 0.28 0.49
cT* 0.32 0.19 0.45
cN* 0.22 0.11 0.34
Neoadjuvant Treatment* 0.15 0.08 0.22
Completion of Neoadjuvant Treatment* 0.14 0.06 0.22
Differentiation Grade (Worst)* 0.09 0.02 0.15
Involved Longitudinal Margin* 0.07 0.03 0.10
Gender* 0.06 0.01 0.12
Site of Tumour* 0.06 0.02 0.10
Any Complication* 0.05 0.00 0.11
Approach 0.03 -0.02 0.08
ASA 0.02 -0.04 0.08
Return to Theatre 0.02 -0.02 0.06
Procedure 0.02 -0.01 0.05
Histopathology* 0.02 -0.01 0.04
Anastomotic Leak 0.01 0.00 0.03
Hospital Volume 0.01 -0.02 0.05
COPD 0.01 -0.02 0.03
Barrett’s 0.01 -0.01 0.02
Respiratory Complication 0.01 -0.02 0.03
ARDS 0.01 0.00 0.01
DM 0.00 -0.01 0.02
Chyle Leak 0.00 -0.01 0.02
IHD 0.00 -0.03 0.04
Pneumonia 0.00 -0.01 0.02
Cardiac Complication 0.00 -0.02 0.02
cM 0.00 0.00 0.00
CVD 0.00 -0.01 0.01
Number of Procedures 0.00 -0.02 0.02
Renal Complication 0.00 0.00 0.00
Pulmonary Embolism 0.00 0.00 0.00
Chronic Liver disease 0.00 0.00 0.00
Bleeding Complication 0.00 0.00 0.00
Pleural Effusion 0.00 -0.01 0.01
Psychiatric Illness 0.00 -0.01 0.00
CKD 0.00 0.00 0.00
Infective Complication 0.00 -0.01 0.00
Performance Status 0.00 -0.05 0.05





Table S.3 AJCC Prognostic Model Criteria’

Inclusion Criteria Checklist
The probability of overall survival, disease-specific survival (DSS), or disease-specific mortality (DSM) must be the outcome predicted Predicts Overall Survival
The model should address a clinically relevant question Yes
At face value, the model should include the relevant predictors, or explain why something relevant was not included Yes
The model validation study should specify precisely which patients were used to evaluate the model and the validation dataset’s inclusion/exclusion criteria See Patient Flow Diagram
The model should be assessed for generalizability and external validation Internal validation conducted
The model should have a well-defined prognostic time zero Time of Surgery
All predictors must be known at time zero and sufficiently defined for someone else to use Yes
Sufficient detail must be available to implement the model OR the author must allow free access to the model. Yes
A measure of discrimination must have been reported Yes
Calibration in the small must be assessed (from the external validation data set) and provided Yes
The model should be validated over a time frame and practice setting that is relevant to contemporary patients with disease Yes
It should be clear which initial treatment(s), if any, were applied, and with what frequency Yes
The development and/or the validation of the prediction model must appear as a peer-reviewed journal article Yes
Exclusion Criteria Checklist
A substantial proportion of patients had essentially no follow up, either missing entirely or very short censored follow up, in the validation dataset No
No information on number of missing values in validation dataset Provided
The number of events in the validation dataset is small (<100) No



Variable Importance Measures

Partial dependence plots can be used to visualise the relative importance of different variables. The partial dependence function at specified variable values (e.g. Total Positive Lymph nodes = 5 or Neoadjuvant Treatment = Chemoradiotherapy) is calculated as the average prediction of the RSF if all cases in the training set were changed to have this value. It therefore allows visualisation of the average marginal effect of individual variables and how changing them effects overall predictions, although does not assess how individual variables interact. Figure S.3 and Figure S.4 show this function for the variables included in the final model, ordered by importance. The differential survival can be seen to be far more important for the first 4 variables compared to the others. Variable interactions are not however included in this assessment and can be difficult to effectively represent.

Figure S.3 Partial Dependence Plots (1). (A) pN/ypN, (B)pT/ypT, (C) Circumferential Margin, (D) Age at diagnosis, (E) cT, (F) cN



Figure S.4 Partial Dependence Plots(2). (A) Neoadjuvant Treatment Modality, (B) Completion of Neoadjuvant Treatment, (C) Grade of Differentiation, (D) Longitudinal Resection Margin,(E) Gender,(F) Site of Tumour,(G) Surgical Complications, (H) Histological Diagnosis



Cox Proportional Hazards Model

A CPH was fitted to the variables used for the RSF model with no interactions or transformations. The proportional hazards assumption was assessed using Schoenfeld residuals. To assess for non-linearity of variables, the Martingale residuals of continuous variables separately entered into a null Cox model (Age and Number of Positive Lymph nodes) were calculated and a loess smoother applied. Both of these variables showed a clear non-linear form. A square root transformation improved the fit of positive lymph nodes but standard transformation (power or logarithmic) did not results in an improvement for age (more details available on request). A restricted cubic spline was therefore applied to age with 4 knots (ages 49, 62, 69 and 78).

Table S.4 Cox Proportional Hazards Model Coeffficients

Factor Beta Hazard Ratio (95% CI) P Value Proportional Hazards Chi Square Proportional Hazards P Value
Site of Tumour Mid/Upper Oesophagus 1
Lower Oesophagus 0.001 1 (0.87 - 1.15) <0.001 1.49 0.222
GOJ -0.139 0.87 (0.75 - 1.01) <0.001* 0.24 0.623
Gender Male 1
Female -0.212 0.81 (0.73 - 0.9) <0.001* 0.12 0.733
Histopathology Adenocarcinoma 1
SCC 0.216 1.24 (1.08 - 1.42) 0.002* 3.85 0.05
cT T0/1/is 1
T2 0.198 1.22 (0.95 - 1.56) <0.001* 0.34 0.559
T3 0.213 1.24 (0.97 - 1.58) <0.001* 0.47 0.491
T4 0.176 1.19 (0.89 - 1.61) 0.246 0.23 0.628
cN N0 1
N1 0.127 1.14 (1.03 - 1.25) 0.011* 0.76 0.384
N2 0.135 1.14 (1.01 - 1.3) <0.001* 0.2 0.656
N3 -0.027 0.97 (0.77 - 1.23) 0.824 0.24 0.624
Neoadjuvant Treatment None 1
CRT - Completed 0.19 1.21 (0.94 - 1.56) 0.148 0.8 0.372
CRT - Not Completed 0.813 2.25 (1.03 - 4.92) 0.041* 0.09 0.765
CT - Completed 0.009 1.01 (0.89 - 1.14) 0.882 0 0.967
CT - Not Completed 0.158 1.17 (0.95 - 1.44) 0.134 0.01 0.934
Any Complication No 1
Yes 0.116 1.12 (1.04 - 1.21) <0.001* 0.07 0.784
Total LN Positive 0.455 1.58 (1.52 - 1.63) <0.001* 0.12 0.728
pT/ypT T0 1
T1 0.201 1.22 (0.91 - 1.63) 0.175 1.11 0.291
T2 0.507 1.66 (1.26 - 2.19) <0.001* 0.83 0.363
T3 0.925 2.52 (1.95 - 3.25) <0.001* 1.2 0.272
T4 1.445 4.24 (3.18 - 5.66) <0.001* 2.16 0.141
Involved CRM No 1
Yes 0.303 1.35 (1.24 - 1.48) <0.001* 1.32 0.251
Involved LRM No 1
Yes 0.284 1.33 (1.11 - 1.59) 0.002* 0.42 0.515
Differentiation Grade Well 1
Moderate 0.304 1.36 (1.05 - 1.76) 0.022* 0.3 0.584
Poor/Anaplastic 0.441 1.55 (1.2 - 2.02) 0.001* 5.41 0.02
GX 0.277 1.32 (0.99 - 1.75) 0.057 2.32 0.128



Integrated Area under the Curve



Figure S.5 Random Forest Model AUC across time points. Dashed lines represent 95% confidence intervals



Decision Curve Analysis



Figure S.6 Net benefit of the RSF model compared with the CPH model. The Black line represents the RSF model and the red line the CPH Model





Figures S.7 Net benefit of the RSF model compared with p/ypTNM stage. The Black line represents the RSF model and the red line TNM Stage



Sensitivity Analysis

To assess how the modelling process performed in different situations we first chose a list of variables that were available preoperatively. A preoperative prediction model has obvious uses, first in terms of tailoring treatments (e.g. Neoadjuvant Chemotherapy vs Neoadjuvant Chemoradiotherapy) and avoiding futile treatment (e.g. if the risk of death is extremely high in the early post-operative period. 22 variables were included, shown in the preoperative column of Main text Table 1.

This model was then compared to the calibrated full RSF model. The tAUC of the RSF model using these variables was 73.2 (95% CI 72.1-74.5%), compared to 84.8% (95% CI 83.1-86.9%), for the full model. Calibration was also poor, with a brier score of 0.171 (0.133 for full model), and visually worse calibration compared to the full model.



Examples of use

To illustrate how the model provides discrete predictions, two example cases are described below:

Case 1: 60-year-old male patient with adenocarcinoma of the oesophagus, who undergoes neoadjuvant chemotherapy before an oesophagectomy. His post-operative pathology reveals a well differentiated T3N1 (1/30) M0 tumour with no circumferential or longitudinal margin involvement. The ypTNM Stage is 3b.

Case 2: 60-year-old male patient with adenocarcinoma of the oesophagus, who undergoes neoadjuvant chemoradiotherapy which he fails to complete before an oesophagectomy. His post-operative pathology reveals a poorly differentiated T3N2 (6/30) M0 tumour with circumferential margin involvement. He also suffers from a post-operative complication. The ypTNM Stage is also 3b.

Figures S.8 Example Cases Predicted Survival



A marked difference can easily be seen between these two cases, with a predicted 3-year survival of 71.3% and 24.6% respectively for two cases with identical ypTNM staging. The mean survival observed for TNM3b patients is 38.6%.





External Validation Instructions

A basic knowledge of R is required to conduct the external validation. As the model does not generate coefficients, access to the model itself is required.

First, download the file packet from the web application in the ‘Model Details’ tab. This contains the models themselves and the manner in which dummy coding was conducted.

An example blank dataframe is also included showing the structure in which data must be presented to the model. Care should be taken to match the variables/names/factor-levels in this file. If the model fails to generate predictions, it is probably due to a discrepancy here.

Then access and download the R script from github:

https://github.com/saqibrahmanUGI/AUGIS-Surv/blob/master/externalvalid.r

Running this script will firstly install and load the needed R packages, then batch generate predictions, calculate the tAUC and C-index, plot annual calibration curves and plot quintiles of prediction against observed KM estimates.





Supplementary References

  1. van Buuren S, Groothuis-Oudshoorn K. MICE: Multivariate Imputation by Chained Equations in R. J Stat Softw. 2011;45(3):1–67.
  2. van Buuren S, Boshuizen HC, Knook DL. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med. 1999 Mar 30;18(6):681–694.
  3. White IR, Royston P. Imputing missing covariate values for the Cox model. Stat Med. 2009;28:1982–1998.
  4. White IR, Royston P, Wood AM. Multiple imputation using chained equations: Issues and guidance for practice. Stat Med. 2011;30(4):377–399.
  5. Moons KGM, Donders RART, Stijnen T, Harrell FE. Using the outcome for imputation of missing predictor values was preferred. J Clin Epidemiol. 2006;59(10):1092–1101.
  6. Sterne JAC, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, et al. Multiple imputation for missing data in epidemiological and clinical research: Potential and pitfalls. BMJ. 2009;339(7713):157–160.
  7. Marshall A, Altman DG, Holder RL, Royston P. Combining estimates of interest in prognostic modelling studies after multiple imputation: Current practice and guidelines. BMC Med Res Methodol. 2009;9(1):1–8.
  8. Wood AM, Royston P, White IR. The estimation and use of predictions for the assessment of model performance using large samples with multiply imputed data. Biometrical J. 2015 Jul;57(4):614–632.
  9. Hosmer D, Lemeshow S. Applied Survival Analysis: Regression Modeling of Time to Event Data. 1st ed. New York: John Wiley & Sons, Inc.; 1999.
  10. Blanche P, Dartigues J-F, Jacqmin-Gadda H. Estimating and comparing time-dependent areas under receiver operating characteristic curves for censored event times with competing risks. Stat Med. 2013 Dec 30;32(30):5381–5397.
  11. Kamarudin AN, Cox T, Kolamunnage-Dona R. Time-dependent ROC curve analysis in medical research: Current methods and applications. BMC Med Res Methodol. 2017;17(1):1–19.
  12. Vickers AJ, Elkin EB. Decision curve analysis: A novel method for evaluating prediction models. Med Decis Mak. 2006;26(6):565–574.
  13. Vickers AJ, Cronin AM, Elkin EB, Gonen M. Extensions to decision curve analysis, a novel method for evaluating diagnostic tests, prediction models and molecular markers. BMC Med Inform Decis Mak. 2008;8:1–17.
  14. Steyerberg EW, Vickers AJ. Decision curve analysis: a discussion. Med Decis Mak. 2008;28(1):146–149.
  15. Steyerberg EW, Vergouwe Y. Towards better clinical prediction models: Seven steps for development and an ABCD for validation. Vol. 35, European Heart Journal. Oxford University Press; 2014. p. 1925–1931.