1 Introduction

This article should serve as a tutorial on using the publicly available code which accompanies the paper “Effective Ways to Build and Evaluate Individual Survival Distributions”. There are two primary components to this article, (1) a quickstart option which gives an overview of using the master file, “analysisMaster.R”, and (2) the entire code overview which details all the functions and files found in the code. While using the “analysisMaster.R” file gives users access to most of the functionality of the codebase, no specialized modifications can be made to the models/evaluations directly so we give a detailed explanation of how to start with a dataset and end with evaluations of different models.

2 Quickstart Guide (analysisMaster.R)

Before you begin using the code, make sure all required packages are downloaded: caret, dataPreparation, ggplot2, reshape2,randomForestSRC, Rcpp, prodlim, survival, fastcox, plyr,and dplyr. To save you some time I have supplied a code snippet you can use to download any of the packages you may be missing:

#Taken from https://gist.github.com/stevenworthington/3178163
ipak <- function(pkg){
    new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
    if (length(new.pkg)) 
        install.packages(new.pkg, dependencies = TRUE)
    sapply(pkg, require, character.only = TRUE)
}
packages <- c("caret", "dataPreparation", "ggplot2", "reshape2","randomForestSRC", "Rcpp", "prodlim", "survival", "fastcox", "plyr", "dplyr")
ipak(packages)

Next we will load the example data (which can be found on the PSSP website). Details about this dataset can be found in the Introduction to the Example Data.

exampleDat = read.csv(file = "http://pssp.srv.ualberta.ca/system/predictors/datasets/000/000/076/original/All_Data_updated_may2011_Stage4_Stomach.csv?1354708504")

To use this codebase, the time to event feature must be named time, and the indicator of death (uncensored) should be named delta. In the example data these are the first two columns and are currently named SURVIVAL and CENSORED, so we make this change. Additionally, this data has CENSORED = 1 mean that a patient was censored. We need 1 to indicate a death.

names(exampleDat)[c(1,2)] = c("time", "delta")
exampleDat$delta = 1 - exampleDat$delta 

Now that the data is formatted correctly and packages are loaded, go ahead and load “analysisMaster.R”

source("analysisMaster.R")

This file contains the function analysisMaster() which takes many arguments, each of which controls some aspect of either the different models or evaluations.

Arguments:

For this example we will simply use all the default arguments (with the exception of verbose, which we set to FALSE for brevity). Note that this command may take a few minutes to run (excluding Random Survival Forests will speed up computation time).

#RSF and CoxEN are non-deterministic, so set the seed for reproducible results. (Additional stochasticity is included in 1-Calibration).
set.seed(42)
res = analysisMaster(exampleDat, verbose = FALSE)
The variable(s) SITE_BRUNCHUS_LUNG, SITE_COLORECTAL, SITE_HEAD_AND_NECK, SITE_ESOPHAGUS, SITE_PANCREAS, SITE_STOMACH, SITE_OTHER_DIGESTIVE, MISC, STAGE_1, STAGE_2, STAGE_3, STAGE_4, STAGE_NUMERICAL contained only 1 unique value. They have been removed.

You will notice some variables were removed – See Section 4.1 for more details.

res is now a list of 3 items:

Looking at each of these in more detail, note the difference in size between res$datasetUsed and exampleDat.

dim(exampleDat)
[1] 128  53
dim(res$datasetUsed)
[1] 128  22


Additionally, when we sourced “analysisMaster.R” we brought in plotSurvivalCurves() which can be used to examine the survival curves of each model. Here we show examples of Cox Elastic-Net, Random survival forests, and Multi-task Logistic regression, however, survival curves for all models are available.

#CoxEN
plotSurvivalCurves(survivalCurves = res$survivalCurves$CoxEN, indexToPlot = 1:10)

#RSFKM
plotSurvivalCurves(survivalCurves = res$survivalCurves$RSF, indexToPlot = 1:10)

#MTLR
plotSurvivalCurves(survivalCurves = res$survivalCurves$MTLR, indexToPlot = 1:10)

Note that each index relates to the corresponding row in res$datasetUsed (e.g. the survival curve for Index 1 is for the for the patient in the first row of res$datasetUsed).

The last component of res is the results dataframe, which includes the evaluations for every fold (Concordance, L1-loss, Brier score) and the calibration evaluated on the entire dataset (1-Calibration, D-Calibration). Below is a view of all the results and an example of aggregating the result. (Note that we average the p-values of 1-Calibration and D-Calibration because they are equal across the folds because they were evaluated on all the test data as opposed to per fold – see Section 6.3 and Section 6.6)

res$results
res$results %>%
  group_by(Model)  %>%
  summarise(N = mean(N),
            NumFeatures = mean(NumFeatures),
            PercentCensored = mean(PercentCensored),
            DCalibration = mean(DCalibration),
            OneCalibration10th = mean(OneCalibration_1),
            OneCalibration25th = mean(OneCalibration_2),
            OneCalibration50th = mean(OneCalibration_3),
            OneCalibration75th = mean(OneCalibration_4),
            OneCalibration90th = mean(OneCalibration_5),
            ConcordanceM = mean(Concordance),
            ConcordanceSD = sd(Concordance),
            SingleBrierM = mean(BrierSingle),
            SingleBrierSD = sd(BrierSingle),
            IntegratedBrierM = mean(BrierInt),
            IntegratedBrierSD = sd(BrierInt),
            L1M = mean(L1Loss),
            L1SD = sd(L1Loss))

Here we have given a fairly brief introduction to using the codebase to run the 6 survival models across the 5 different evaluation metrics. For a much more detailed introduction to the code (and a walk through of all the functions) continue on to the next sections or jump to a specific section using the table of contents.

3 Code Overview and Introduction to Example Data

The code here has been written for survival datasets, i.e. datasets are comprised of (1) a variable which indicates time of event, (2) an indicator variable for RIGHT censoring, and (3) features for each patient. Note that the code only allows for right censored patients, left censored or interval censored data will not work with the current implementation. To follow along with the examples please make sure that you have downloaded all the required packages (see the first paragraph of the Quickstart Guide), have your working directory in ~/ISDEvaluation, and sourced all the files (available by running `source(“analysisMaster.R”)).

As a case study we will use the Northern Alberta Cancer Dataset (NACD) which was also included in the original paper. However, to simplify things we will use a new subset of the dataset, namely we will only consider patients with stage 4 stomach cancers. This data contains a total of 128 patients (100 uncensored) and 53 features, and is also available on the github repository under the name “exampleData.csv”. Below we give the summary statistics of the first five variables:

NACD = read.csv(file = "http://pssp.srv.ualberta.ca/system/predictors/datasets/000/000/076/original/All_Data_updated_may2011_Stage4_Stomach.csv?1354708504")
head(NACD)
summary(NACD[,1:5])
    SURVIVAL          CENSORED          GENDER         BOX1_SCORE      BOX2_SCORE 
 Min.   : 0.2333   Min.   :0.0000   Min.   :0.0000   Min.   :0.000   Min.   :0.0  
 1st Qu.: 4.3447   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:1.0  
 Median : 9.9877   Median :0.0000   Median :1.0000   Median :2.000   Median :1.0  
 Mean   :14.0799   Mean   :0.2188   Mean   :0.6797   Mean   :2.109   Mean   :1.5  
 3rd Qu.:17.2667   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:2.0  
 Max.   :62.7667   Max.   :1.0000   Max.   :1.0000   Max.   :5.000   Max.   :4.0  

Here, we have SURVIVAL as the feature indicating the time until the event occurred and CENSORED representing the indicator for death, that is, a 1 indicates the patient was censored and a 0 indicates the patient died – these two types of variables are required for all datasets in this codebase. Feel free to explore the other variables if you are interested.

A quick note on notation – In general features and packages will be given as verbatim, e.g. package1 and feature1, functions will also be verbatim but also have parentheses e.g. function1(), arguments to functions will be italicized, variable assignments will be bolded, and files will be quoted (“file1.R”). However, there are exceptions when functions/file names are contained in section headers.

4 Pre-processing the Data

4.1 Validating and Cleaning (validateAndClean.R)

File Purpose: This file is used to make sure the survival dataset is in the correct format for analysis. For example, the variable used for survival time must be named exactly time. Additionally, a minimal amount of data cleansing is completed in this file.

File Path: “ISDEvaluation/ValidateCleanCV/validateAndClean.R”.

Functions:

  1. validateAndClean(survivalDataset, imputeZero=T)
  2. validate(survivalDataset, imputeZero)
  3. clean(survivalDataset)

Note: You will notice this file structure throughout the codebase, there will often be a master function (here validateAndClean()) which runs the other functions included in the file, e.g. validate() and clean().

Arguments:

  • survivalDataset: a (non-empty) dataframe representing the survival dataset – e.g. NACD
  • imputeZero: a Boolean indicating if zero valued survival times should be imputed (see below for details).

Packages Required:

  1. caret – The nearZeroVar() function to identify variables with only 1 unique value and remove them in the clean() function.

4.1.1 validate(survivalDataset, imputeZero)

The function validate() checks a number of things:

  1. A survival dataset was nonempty.
  2. The survival time variable MUST be named time.
  3. Similarly, the indicator of death variable must be named delta. The naming choice of this variable is the common notation in survival analysis.
  4. If time includes NA values these rows will be removed.
  5. If time includes Inf values these rows will be removed.
  6. If time includes negative values these rows will be removed.
  7. If time includes zero values there are two options depending on the value of the argument imputeZero. If imputeZero = TRUE, then zero valued times will be imputed to half the minimum non zero valued time. For example in the NACD dataset we are using here the minimum time is 0.2333. If there was another row where the event time was 0, this would be imputed to \(\frac{0.2333}{2} = 0.1166\). We include this option as some survival models (e.g. Accelerated Failure Time with the Weibull distribution) will fail if given a zero valued time. If imputeZero = FALSE then nothing will happen to these zero valued survival times. Currently there is no option to remove zero valued times other than removing them from the dataset prior to running the code available here.
  8. If delta includes NA values these rows will be removed.
  9. If delta includes Inf values these rows will be removed.
  10. delta must only have two unique values.
  11. The unique values of delta must be 0 and 1 – 0 for censored data and 1 for uncensored data.

If we try to use validate() on our NACD dataset we will get an error since currently our survival time variable is named Time and our death indicator is named Status. Note that below we include imputeZero = TRUE, but since there are no zero-valued times the same result would occur if we included imputeZero=FALSE.

validatedNACD = validate(NACD, imputeZero = TRUE)
Error in validate(NACD, imputeZero = TRUE) : 
  The variable 'time' in not included in the given dataset.
names(NACD)[1] = "time"
validatedNACD = validate(NACD, imputeZero = TRUE)
Error in validate(NACD, imputeZero = TRUE) : 
  The variable 'delta' in not included in the given dataset.
#Additionally, we need to swap the values for delta since CENSORED = 1 meant a patient was censored. We need 1 to mean a death.
names(NACD)[2] = "delta"
NACD$delta = 1 - NACD$delta
validatedNACD = validate(NACD, imputeZero = TRUE)

4.1.2 clean(survivalDataset)

Here, clean() does a very minimal amount of cleaning on the survival dataset. Specifically, the goal is to remove variables which will be not be meaningful to analysis. There are two cases where variables are removed:

  1. A variable contains all unique values. For example, in our NACD dataset there is a STAGE_4 variable which is an indicator (1 or 0) whether or not the cancer is in stage 4. Since we only have stage 4 stomach cancer patients this variable will be be all 1s and will be removed from the dataset.
  2. Variables such that 25% of rows (patients) have a missing value will be removed. However, the coding of missing data must be such that is.na() would identify the data as missing. For example is.na("") returns TRUE even though an empty string likely indicates a missing value.

Whenever a variable is removed or turned into a factor, clean() will return a warning indicating the name of the variable(s) which were removed/factored. For example, on our validated NACD dataset we can now call clean():

cleanedNACD = clean(validatedNACD)
The variable(s) SITE_BRUNCHUS_LUNG, SITE_COLORECTAL, SITE_HEAD_AND_NECK, SITE_ESOPHAGUS, SITE_PANCREAS, SITE_STOMACH, SITE_OTHER_DIGESTIVE, MISC, STAGE_1, STAGE_2, STAGE_3, STAGE_4, STAGE_NUMERICAL contained only 1 unique value. They have been removed.

As noted above a number of variables were removed due to having only 1 unique value.

4.1.3 validateAndClean(survivalDataset, imputeZero=T)

This function simply runs both validate() and clean() on a survival dataset. Note that here we use default the default (TRUE) for imputeZero. Below we verify that validateAndClean(NACD) is equivalent to cleanedNACD.

identical(validateAndClean(NACD),cleanedNACD)
[1] TRUE

4.2 Creating Cross-Validation Folds and Normalizing Features (createFoldsAndNormalize.R)

File Purpose: This file is used to create folds from the survival dataset for cross validation and normalize features. As part of normalization, mean/mode imputation is employed to handle missing data.

File Path: “ISDEvaluation/ValidateCleanCV/createFoldsAndNormalize.R”.

Functions:

  1. createFoldsAndNormalize(survivalDataset, numberOfFolds)
  2. createFoldsOfData(survivalDataset, numberOfFolds)
  3. meanImputation(listOfDatasets)
  4. normalizeVariables(listOfImputedDatasets)

Arguments:

  • survivalDataset: the survivalDataset AFTER going through validation and cleaning – e.g. cleanedNACD
  • numberOfFolds: the desired number of cross-validation folds, must be greater than 1.
  • listOfDatasets: a list containing a list of Training datasets and a list of Testing datasets - i.e. the output of createFoldsOfData()
  • listOfImputedDatasets: same as listOfDatasets but missing values have been imputed (the output of meanImputation())

Packages Required:

  1. caret – The dummyVars() function is used to create a one hot encoding of factor (nominal) variables.
  2. dataPreparation – The build_scales() and fastScale() function are used to normalize variables.

4.2.1 createFoldsOfData(survivalDataset, numberOfFolds)

This code base requires that data be in cross-validation format with a minimum of 2 folds. The cross-validation used here is specific to survival analysis in that cross-validation is deterministic – rows are first stratified by their censoring status (delta) and then sorted by time and sequentially placed into folds (see Figure 1). The reasoning behind this procedure is two-fold (pun-intended):

  1. The number of censored patients should be (roughly) equal across all folds.
  2. The range of event times should also be (roughly) equal across all folds.

By creating folds in this way models will be tested on data which will be mostly representative of what they saw in the training data.

Figure 1: Depiction of 3-fold cross-validation scheme used for survival data. The dataset is first stratefied by delta and then sorted by time. Rows are then sequentially ordered into the three folds.

Figure 1: Depiction of 3-fold cross-validation scheme used for survival data. The dataset is first stratefied by delta and then sorted by time. Rows are then sequentially ordered into the three folds.

In addition to creating cross validation folds, createFoldsOfData() also returns the indexes of the rows that each test set belongs. See the below example for details. Here we put our cleaned NACD dataset (cleanedNACD) into 5 folds:

folds = createFoldsOfData(cleanedNACD, 5)
#folds comprises a list of (2) items. (1) is the indexs of the test set. For example, examine the first test set fold:
testSetIndexs = folds[[1]]
#Test fold 1:
testSetIndexs[[1]]
 [1] 123 101  62  18  96  21 110 125  17  34 122 124  97  44  10  95  51  91 112  72 106  74  79 105  14  77

Here the first test fold is comprised of the 123rd, 101st, 62nd, …, and the 77th rows of the original data. We will later use these indexes for plotting in Section 7.1.

The second item returned by createFoldsOfData() is a list of (1) the training sets, and (2) the testing sets. Below we simply give the dimension of each:

trainingAndTestingSets = folds[[2]]
#First Training Set:
dim(trainingAndTestingSets$Training[[1]])
[1] 102  40
#First Testing Set:
dim(trainingAndTestingSets$Testing[[1]])
[1] 26 40

4.2.2 meanImputation(listOfDatasets)

Following the splitting of folds we perform imputation for missing data. Since the primary focus of this work was not in any way geared towards imputation methods we do a very basic mean imputation for missing values within fold. The means for all variables for each training fold are computed and the corresponding test fold is imputed with the mean value from the training fold.

In order to impute factor (nominal) variables, we use the one-hot encoding of these variables and then impute missing values with the mean of the one-hot encoded vector. Note that time and delta are not subject to imputation – if either of these variables contained missing values the entire row was removed in validate().

Below we would impute missing data, however, the NACD dataset we are using has no missing data so the following step is not necessary.

imputedData = meanImputation(trainingAndTestingSets)

4.2.3 normalizeVariables(listOfImputedDatasets)

Following variable imputation, all the variables are normalized (except time and delta) by taking the scales (mean and standard deviation) of the training set variables and normalizing both the training set and test set with those scales. Here normalizing follows the standard convention of subtracting the mean and dividing by the standard deviation. This function also does some cleaning of variable names – if a variable name has a comma in it functions later of will break so all commas and other special characters are removed.

Below we normalize the data and show the difference between the data before and after it was normalized.

head(imputedData$Training[[1]])
normalizedData = normalizeVariables(imputedData)
head(normalizedData$Training[[1]])

4.2.4 createFoldsAndNormalize(survivalDataset, numberOfFolds)

Similar to validateAndClean() above, createFoldsAndNormalize() serves as a master function which will run createFoldsOfData(), meanImputation(), and normalizeVariables() and will return a list of (2) items, (1) the indexes of the test set as given in Section 2.2.1, and (2) a list containing: (1) a list of normalized training folds and (2) a list of normalized testing folds.

Here we show the equivalence between testSetIndexs from above and the first item of createFoldsAndNormalize() and equivalence of normalizedData with the second item of createFoldsAndNormalize():

output = createFoldsAndNormalize(cleanedNACD,5)
all.equal(testSetIndexs, output[[1]])
[1] TRUE
all.equal(normalizedData, output[[2]])
[1] TRUE

5 Creating Individual Survival Distribution Models

The intention of this work is to examine different survival analysis models which can produce individualized survival distributions for different patients. In total we explored five models which produced individual survival distributions and one model which gave a population based survival curve (Kaplan-Meier). Below we give a minimal description for each model and describe how to implement each in the current codebase.

5.1 Kaplan-Meier (KaplanMeier.R)

The Kaplan-Meier (KM) estimator was devised as a non-parametric approach to estimating a survival curve even in the presence of censored data (see the original paper here). To give the KM estimator we first make the following definitions:

  • \(\tau_j\) – the \(j^\textrm{th}\) distinct death time,
  • \(d_j\) – the number of deaths at time \(\tau_j\),
  • \(r_j\) – the number of patients who were at risk at time \(\tau_j\). Note this also includes those who died at \(\tau_j\).

With these definitions, we say the probability of survival at time \(t\) is given by the KM estimator as \[\hat{S}_{KM}(t) = \prod_{j: \tau_j < t} \left(1 - \frac{d_j}{r_j}\right)\]. Note that here KM does not take into account any features and thus is considered to a non-parametric model. To give an example of the Kaplan-Meier approach consider the following data:

Example data for the Kaplan-Meier model. Note the data is transposed from the normal format.
Patient 1 Patient 2 Patient 3 Patient 4 Patient 5
time 1 3 5 7 9
delta 1 0 1 1 1

Given the data above, we have \(\hat{S}_{KM}(1) = \left(1 - \frac{1}{5}\right) = 0.8\). Note that no change occurs at \(t= 3\) since this patient was censored and \(\tau_j\) only refers to unique death times. Below we give the full KM curve for this example.

Figure 2: Kaplan-Meier plot using example data. Censored patients indicated by vertical dash on survival curve.

Since the KM model does not account for individual features, the survival distribution generated is supposed to represent the population’s overall survival curve, i.e. the average survival curve. So given \(K\)-fold cross-validation, there will only be \(K\) different survival curves as opposed to the \(N\) different survival curves the other models generate, given \(N\) unique patients. Next we give the details for running KM on a dataset after it has prepossessed by validateAndClean() and createFoldsAndNormalize().

5.1.1 Kaplan-Meier Code (KaplanMeier.R)

File Purpose: Use the Kaplan-Meier model that has been processed and split into folds by createFoldsAndNormalize().

File Path: “ISDEvaluation/Models/KaplanMeier.R”.

Functions:

  1. KM(training, testing)

Arguments:

  • training: The training set.
  • testing: The testing set.

Packages Required:

  1. survival – We require the Surv() function to change the data into a survival format.
  2. prodlim – prodlim contains a Kaplan-Meier implementation that includes a predict() function.

File Notes:

Given the list of datasets generated by createFoldsAndNormalize(), KM() will take a single training and testing set and build the Kaplan-Meier curve for the training set using the prodlim() function from the prodlim package. The output (along with the output of all following models) is a list of four objects:

  1. TestCurves: The survival curves for test patients (identical for all patients for KM).
  2. TestData: The values of time and delta for the test patients (used later when evaluating survival curves).
  3. TrainData: The values of time and delta for the training patients (used later when evaluating survival curves).
  4. TrainCurves: The survival curves for the training patients.

Note the output is the same for all following models.

Example Usage

Here, we give an example of using KM on a single training/testing fold – extending the results to all folds can be accomplished by using a for loop.

validatedAndCleaned = validateAndClean(survivalDataset = NACD)
folds = createFoldsAndNormalize(survivalDataset = validatedAndCleaned, numberOfFolds = 5)
#Recall the first argument of folds is the indexs of the test set on validatedAndCleaned and the second argument is the list of training and testing sets.
trainAndTest = folds[[2]]
training1stFold = trainAndTest$Training[[1]]
testing1stFold = trainAndTest$Testing[[1]]
kmMod = KM(training1stFold, testing1stFold)
testCurves = kmMod$TestCurves
testDat = kmMod$TestData
trainDat = kmMod$TrainData
testCurves

Above you will see that the survival curves for all patients are the exact same – the desired effect for the Kaplan-Meier model. Note that time will correspond to all the event times (both death and censored event times) of the training data. Additionally we include a time = 0 point where the survival probability will be 1 – this is consistent across all models.

For a more picturesque approach, we give the survival curve below:

One should notice that the survival curve does not drop to zero and instead ends at \(\approx 0.05\) – this is because the last patient in the training data was censored. Additionally, the above plot does not follow the classical Kaplan-Meier step function and is smoother using a spline fit – the details can be found in Section 7.1 which describes the plotting function.

So far we have shown the output of curves but ignored trainDat and testDat which are just the training and testing values for time and delta. Across all the following models, trainDat and testDat will be the same thing, but for clarity we will include them below and not in following models.

trainDat
time delta
15.8333333 1
10.3000000 1
12.5666667 1
13.6333333 1
55.1000000 1
17.5666667 1
10.5409840 1
7.0666667 0
18.1393440 1
15.6147540 1
30.8333333 0
11.4000000 1
11.5000000 1
5.9333333 1
3.1475410 1
7.8114750 1
5.1639340 1
7.1333333 1
19.0333333 1
0.6333333 1
5.8666667 0
16.4000000 1
2.8114750 1
15.0737700 1
12.5491800 1
12.3360660 1
7.0666667 1
35.3196720 1
2.6393440 1
19.2666667 1
17.1666667 1
50.7666667 1
0.7540980 1
4.4000000 1
19.0819670 1
10.6803280 1
1.6000000 1
10.6803280 1
44.7333333 1
35.9333333 1
0.9672130 1
13.9754100 1
31.9666667 0
48.7666667 0
1.6000000 1
6.6393440 1
10.3852460 1
5.5573770 1
20.7000000 1
1.7704920 1
13.2704920 1
4.3666667 1
2.9333333 1
1.2049180 1
0.9333333 0
9.2000000 1
2.4000000 1
6.5409840 1
6.3000000 0
32.1000000 1
6.8770490 1
2.1666667 1
36.4333333 1
7.3333333 0
15.3666667 1
9.8032790 1
62.7666667 0
3.5901640 1
31.3666667 0
36.7333333 0
5.1967210 1
43.6333333 0
1.2622950 1
13.9098360 1
10.1721310 1
13.8666667 1
25.1000000 1
11.5573770 1
5.1666667 0
4.2786890 1
19.8000000 1
7.1967210 1
3.1967210 1
7.9098360 1
42.0333333 0
8.5819670 1
1.3333333 0
9.0333333 0
6.1311480 1
4.1639340 1
3.3770490 1
10.6333333 1
16.6666667 1
5.3666667 1
4.2000000 0
1.2295080 1
12.9666667 1
2.6666667 0
37.5333333 0
56.8000000 0
0.4666667 0
4.5666667 0



testDat
time delta
0.2333333 0
3.7000000 0
6.1000000 0
23.5666667 0
36.8000000 0
50.1333333 0
0.7704920 1
1.2666667 1
2.2622950 1
3.0409840 1
4.1475410 1
4.9333333 1
5.5737700 1
6.8000000 1
7.2704920 1
9.5901640 1
10.4016390 1
10.7704920 1
12.5000000 1
13.5000000 1
14.7666667 1
16.3333333 1
18.0737700 1
19.5000000 1
35.2666667 1
46.8333333 1

5.2 Accelerated Failure Time

The Accelerated Failure Time (AFT) has been a commonly used survival analysis technique for finding significant features in a survival setting. The model is motivated by assuming survival time, \(T\) follows some distribution with density function \(f(t)\) and survival function \(S(t)\). Common AFT distributions are log-logistic, Weibull, log-normal, exponential and more. While we make assumptions on the distribution of \(T\), inference is instead made on \(\log(T)\), that is, for a patient, \(\mathbf{x_i}\) where \(i = 1,2,\ldots N\), AFT models follow the equation: \[ log(T_i) = \boldsymbol{w}^T\boldsymbol{x_i} + \sigma\epsilon_i, \] where \(\boldsymbol{w}\) are feature weights, \(\sigma\) is called a scale parameter and \(\epsilon_i\) is an independent, identically distributed noise parameter dependent on the assumed distribution of \(T\). This is essentially a normal linear regression but on the log scale of \(T\), i.e. features are assumed to have a linear relationship with the log of survival time. Here, \(\sigma\) is assumed to be constant across all patients and can be learned by maximum likelihood along with the weights \(\boldsymbol{w}\).

5.2.1 Accelerated Failure Time Code (AcceleratedFailureTime.R)

File Purpose: Apply the Accelerated Failure Time model (for a number of distributions) to data that has been processed and split into folds by createFoldsAndNormalize().

File Path: “ISDEvaluation/Models/AcceleratedFailureTime.R”.

Functions:

  1. AFT(training, testing, AFTDistribution)
  2. survfunc(object,t,newdata, name) - Used to extract survival cures from an AFT model. Altered from a function previously written by Timothy Johnson.

Arguments:

  • training: The training set.
  • testing: The testing set.
  • AFTDistribution: The distribution for the AFT model. Possible distributions: exponential, weibull, lognormal, gaussian, loglogistic, logistic.
  • object: A survreg object from the survreg() function.
  • t: The times for which to evaluate the AFT model.
  • newdata: The testing data to predict with using the AFT model.
  • name: The name of the variable containing the predicted times (t).

Packages Required:

  1. survival – We use the survreg() function to perform analysis for AFT models.

File Notes:

The primary function for this file is AFT(); survfunc() is a helper function AFT uses to build survival curves for test data after the model has been learned.

Given the list of datasets generated by createFoldsAndNormalize(), AFT() will take a single training set, testing set, and a specified distribution and learn the AFT model using the training set. Only one distribution can be tested at a time. In addition, the code contains a tryCatch command the catch the event where the AFT model fails to run, however, this only showed to occur for high dimensional datasets (when no feature selection was applied).

As with the Kaplan-Meier model, the output is a list of four items:

  1. TestCurves: The survival curves for test patients (identical for all patients for KM).
  2. TestData: The values of time and delta for the test patients (used later when evaluating survival curves).
  3. TrainData: The values of time and delta for the training patients (used later when evaluating survival curves).
  4. TrainCurves: The survival curves for the training patients.

Example Usage:

validatedAndCleaned = validateAndClean(survivalDataset = NACD)
folds = createFoldsAndNormalize(survivalDataset = validatedAndCleaned, numberOfFolds = 5)
trainAndTest = folds[[2]]
training1stFold = trainAndTest$Training[[1]]
testing1stFold = trainAndTest$Testing[[1]]
AFTMod = AFT(training1stFold, testing1stFold, "weibull")
curves = AFTMod$TestCurves
curves


5.3 Cox-Proportional Hazards with Kalbfleisch-Prentice Extension

The Cox proportional-hazards (Cox-PH) model is extremely common in the survival analysis literature. Previously, we mentioned the Kaplan-Meier model which modeled the survival function, \(S_{KM}(t)\) of a group of patients. Next, the Accelerated Failure Time model assumed a distribution on the survival time, \(T\), which allowed us to include features and build curves for individual patients. Instead, Cox-PH models the hazard function of patients, where hazard is interpreted as the risk of failure (dying) at any given time, specifically for a time \(t\) and covariates \(\boldsymbol{x}\) the hazard function is defined as \[ h(t\,|\,\boldsymbol{x}) = \lim_{\Delta t \rightarrow \,0} \frac{P(t \leq T < t + \Delta t\,\,\, |\,\,\, T \geq t, \boldsymbol{x} )}{\Delta t}. \]

Cox-PH models this hazard function in terms of a baseline hazard, \(h_0(t)\), equal for all patients and scaled by a learned, individual risk dependening on features, that is, \(h(t|\boldsymbol{x}) = h_0(t)\,e^{\boldsymbol{w}^T\boldsymbol{x}}\). This is why the model is termed the proportional hazards model, the hazard for one patient is proportional to the hazard of all other patients, by a scale depending on individual features.

To learn the values for \(\boldsymbol{w}^T\) one does not need to know the baseline hazard function. Suppose a single patient died at time \(t_j\). The probability that patient \(i\) was the one who died is given by \[\begin{align*} \frac{h(t_j\,|\,\boldsymbol{x_i})}{\sum_{k \in R(t_j)} h(t_j\,|\,\boldsymbol{x_k})} \quad &= \quad \frac{h_0(t_j)\,e^{\boldsymbol{w}^T\,\boldsymbol{x_i}}}{\sum_{k \in R(t_j)} h_0(t_j)\,e^{\boldsymbol{w}^T\,\boldsymbol{x_k}}}, \\ &= \quad \frac{e^{\boldsymbol{w}^T\,\boldsymbol{x_i}}}{\sum_{k \in R(t_j)} e^{\boldsymbol{w}^T\,\boldsymbol{x_k}}}, \end{align*}\]

where \(R(t_j)\) is the set of patients still alive (at risk) at time \(t_j\). Thus the likelihood for \(\boldsymbol{w}^T\) is defined as \[ L\left(\boldsymbol{w}^T\right) = \prod_j \frac{e^{\boldsymbol{w}^T\,\boldsymbol{x_{i\,(j\,)}}}}{\sum_{k \in R(t_j)} e^{\boldsymbol{w}^T\,\boldsymbol{x_k}}}, \] where subscript \(i (j)\) is interpreted as it is patient \(i\) who died at time \(j\).

By this formulation of the likelihood equation, we see that feature weights are independent of time, e.g in a study of survival post surgery, lab test results immediately following surgery have the same weight one day after surgery as they do a year from surgery.

While we are able to estimate feature weights without specifying the baseline hazard function, we cannot determine the survival distribution. A number of methods exist for calculating the the baseline hazard and therefore the survival distribution including the Kalbfleisch-Prentice estimator which we use in this work. The thesis “Baseline Survival Function Estimator under Proporitional Hazards Assumption” by Weng provides a nice discussion of the Breslow and Kalbfleisch-Prentice estimators.

5.3.1 Cox-Proportional Hazards with Kalbfleisch-Prentice Extension Code (CoxPH_KP.R)

File Purpose: Apply the Cox Proportional-Hazards model to data that has been processed and split into folds by createFoldsAndNormalize().

File Path: “ISDEvaluation/Models/CoxPH_KP.R”.

Functions:

  1. CoxPH_KP(training, testing, ElasticNet)
  2. KPEstimator(lp, lpTime, censorStatus) - Used to estimate the Kalbfleisch-Prentice estimator when doing elastic-net Cox.

Arguments:

  • training: The training set.
  • testing: The testing set.
  • ElasticNet: A Boolean specifying whether or not to
  • lp: The linear predictions (\(\boldsymbol{w}^T\boldsymbol{x}\)) from the Cox-PH model for the training data.
  • lpTime: The training dataset event times, i.e. time.
  • censorStatus: The training dataset censor status, i.e delta.

Packages Required:

  1. survival – Needed for coxph() and survfit().
  2. fastcox – See Elastic-Net Cox.
  3. prodlim – See Elastic-Net Cox.

File Notes:

The primary function for this file is CoxPH_KP(); KPEstimator() is only used for the Elastic-Net Cox model.

Given the list of datasets generated by createFoldsAndNormalize(), CoxPH_KP() will take a single training set, testing set, and a Boolean specifying whether or not to use the elastic-net model. For this section we assume ElasticNet = FALSE, that is we are NOT using the elastic net formulation. The code uses a tryCatch function to avoid models which are failing to converge (this happened for high dimensional datasets). The Cox-PH model is fit using coxph() from the survival package and then survival curves are fit using survfit() and specifying the type to be the Kalbfleisch-Prentice estimator. Ties are accounted for using the default of coxph(), i.e. the Efron approximation.

The output is the same as the four previous models.

Example Usage:

validatedAndCleaned = validateAndClean(survivalDataset = NACD)
folds = createFoldsAndNormalize(survivalDataset = validatedAndCleaned, numberOfFolds = 5)
trainAndTest = folds[[2]]
training1stFold = trainAndTest$Training[[1]]
testing1stFold = trainAndTest$Testing[[1]]
CoxMod = CoxPH_KP(training1stFold, testing1stFold, ElasticNet=F)
curves = CoxMod$TestCurves
curves


Note the preceding Figure depicts proportional hazards – all curves are the exact same shape but at different heights, i.e. they are proportional to one another.

5.4 Cox-Proportional Hazards with Kalbfleisch-Prentice Extension – Elastic Net

We include the elastic net version of Cox-PH to include a regularized version of Cox-PH to compete with other models such as Random Survival Forests and Multi-Task Logistic Regression which have built in regularization. The notation and implementations of elastic net come from Yang and Zou’s paper entitiled “A Cocktail Algorithm for Solving The Elastic Net Penalized Cox’s Regression in High Dimensions”“. The elastic net penalty adds the following term to the likelihood equation, \[ P_{\lambda, \alpha}(\boldsymbol{w}) = \sum_{j=1}^p \lambda\left(\alpha\,|w_j| + \frac{1}{2}\,(1-\alpha)\, w_j^2 \right), \] for \(p\) features where \(\lambda > 0\) and \(0 < \alpha \leq 1\). The first penalty term, \(|w_j|\) corresponds to the LASSO loss and is responsible for feature selection whereas the quadratic term is to control the size of feature weights. For example, \(\alpha = 1\) would correspond to just LASSO-Cox, whereas \(\alpha \approx 0\) would correspond to Ridge-Cox.

For more information regarding elastic net Cox see the Yang and Zou’s paper referenced above.

5.4.1 Cox-Proportional Hazards with Kalbfleisch-Prentice Extension – Elastic Net Code (CoxPH_KP.R)

File Purpose: Apply the Cox Proportional-Hazards model to data that has been processed and split into folds by createFoldsAndNormalize().

File Path: “ISDEvaluation/Models/CoxPH_KP.R”.

Functions:

  1. CoxPH_KP(training, testing, ElasticNet)
  2. KPEstimator(lp, lpTime, censorStatus) - Used to estimate the Kalbfleisch-Prentice estimator when doing elastic net Cox.

Arguments:

  • training: The training set.
  • testing: The testing set.
  • ElasticNet: A Boolean specifying whether or not to
  • lp: The linear predictions (\(\boldsymbol{w}^T\boldsymbol{x}\)) from the Cox-PH model for the training data.
  • lpTime: The training dataset event times, i.e. time.
  • censorStatus: The training dataset censor status, i.e delta.

Packages Required:

  1. survival – Needed for the normal Cox implementation.
  2. fastcox – Used for the implementation of elastic net cox (cocktail())
  3. prodlim – Used for the `sindex() function – sindex returns the indexes of positions, ideal for evaluating step functions.

File Notes:

Given the list of datasets generated by createFoldsAndNormalize(), CoxPH_KP() will take a single training set, testing set, and a Boolean specifying whether or not to use the elastic-net model. For this section we assume ElasticNet = TRUE, that is we are using the elastic net formulation. There are two hyperparameters involved for elastic net Cox, the strength of the penalty, \(\lambda\), and the type of penalty (more LASSO like or more Ridge like), \(\alpha\). Six values are tested for \(\alpha\), 0.01, 0.2, 0.4, 0.6, 0.8, and 1.0. For each of these values the function cv.cocktail() tests 100 values of \(\lambda\) using 5-fold cross validation and returns \(\lambda_{min}\), the optimal value of \(\lambda\) to minimize the cross validation error (see the `cocktail() function of the package documentation).The optimal value for \(\alpha\) and \(\lambda\) from internal (10) repeated 5-fold cross validation are then used for evaluating the test fold. Note the optimal values for \(\alpha\) and \(\lambda\) will (likely) differ across folds.

One the feature weights have been estimated, linear predictors are passed to KPEstimator() to estimate the survival curves. As of now, ties are not accounted for and the resulting survival distribution are using the estimate derived in the presence of no ties. Future iterations of this codebase should account for the correct estimate when handling ties.

The output is the same as previous models.

Example Usage:

validatedAndCleaned = validateAndClean(survivalDataset = NACD)
folds = createFoldsAndNormalize(survivalDataset = validatedAndCleaned, numberOfFolds = 5)
trainAndTest = folds[[2]]
training1stFold = trainAndTest$Training[[1]]
testing1stFold = trainAndTest$Testing[[1]]
CoxMod = CoxPH_KP(training1stFold, testing1stFold, ElasticNet=T)
curves = CoxMod$TestCurves
curves


5.5 Random Survival Forest - Kaplan-Meier Extension

Random Survival Forests (RSFs) were developed by Ishwaran et al. in 2008 and have been growing in popularity in recent years. While we do not give vast detail regarding the formation of these forests, the original paper and R package details provide an excellent overview.

The primary focus for RSFs were to build a risk score using an averaged cumulative hazard from the terminal nodes which in which patients landed. As previously mentioned here and explained further in our paper, risk scores are useful for discriminating between patients, but do not provide individual survival distributions.While survival distributions are not mentioned in the original RSF paper, the R package details briefly describe how RSFs build individual survival distributions in Section 8.1. Specifically, they state survival estimates are computed using Kaplan-Meier estimators. However, given a total of \(M\) trees in a forest, there will be \(M\) Kaplan-Meier estimators for each patient and the details of how these Kaplan-Meier estimators were combined are not given. After some backwards engineering I was able to come up with the conclusion that Kaplan-Meier curves are “averaged” across all the Kaplan-Meier curves - see Figure below.

Figure: Depiction of averaging two Kaplan-Meier curves together to produce one individual survival distribution for a test patient. Blue Kaplan-Meier curves represent the Kaplan-Meier estimators in two terminal nodes whereas the red curve represents the combination - giving form to the individual survival distribution.

5.5.1 Random Survival Forest - Kaplan-Meier Extension Code (RandomSurvivalForests.R)

File Purpose: Apply the RSF model to data that has been processed and split into folds by createFoldsAndNormalize().

File Path: “ISDEvaluation/Models/RandomSurvivalForests.R”.

Functions:

  1. RSF(training, testing)
  2. internalCV_RSF(training, numFolds) – Helper function for RSF

Arguments:

  • training: The training set.
  • testing: The testing set.
  • numFolds: The number of folds for internal cross validation.

Packages Required:

  1. survival – Needed for the Surv() function
  2. randomForestSRC – Used for the RSF implementation.

File Notes:

Given the list of datasets generated by createFoldsAndNormalize(), RSF() will take a single training set, testing set, and the number of folds to use for internal cross validation to select hyperparameters (namely ntree and mtry) and build the RSF model using rfsrc() from the randomForestSRC package. Internal cross validation is performed for ntree \(\in \{500,1000, 2000\}\) and mtry (number of patients in each terminal node) ${1,2,3} $ Then the predict() function for rfsrc is used to build survival curves for test patients.

The output is the same as previous models.

Example Usage:

validatedAndCleaned = validateAndClean(survivalDataset = NACD)
folds = createFoldsAndNormalize(survivalDataset = validatedAndCleaned, numberOfFolds = 5)
trainAndTest = folds[[2]]
training1stFold = trainAndTest$Training[[1]]
testing1stFold = trainAndTest$Testing[[1]]
rsfMod = RSF(training1stFold, testing1stFold)
curves = rsfMod$TestCurves
names(curves) = c("time",1:26)
curves

5.6 Multi-task Logistic Regression (MTLR.R)

Of the models tested, Multi-task Logistic Regression (MTLR) is the least well known so we give a brief description here. Consider modeling the probability of survival of patients at each of a vector of time points \(\tau = [t_1, t_2, \ldots, t_m]\), e.g. \(\tau\) could be the 60 monthly intervals from 1 month up to 60 months. We can set up a series of logistic regression models: For each patient, represented as \(\vec{x}\), \[ P(T \geq t_i \,|\, \vec{x}) = \left(1 + \exp(\vec{\theta_{i}}\cdot \vec{x} )\right)^{-1} \] where \(\vec{\theta_i}\) are the time-specific feature vectors. While the input features \(\vec{x}\) stay the same for all these classification tasks,
the binary labels \(y_i = [T\geq t_i]\) can change depending on the threshold \(t_i\).%

We encode the survival time \(d\) of a patient as a binary sequence \(y = y(d) = (y_1, y_2, \ldots, y_m)\), where \(y_i = y_i(d) \in \{0,1\}\) denotes the survival status of the patient at time \(t_i\), so that \(y_i = 0\) (no death event yet) for all \(i\) with \(t_i <d\), and \(y_i = 1\) (death) for all \(i\) with \(t_i \geq d\).

Here there are \(m+1\) possible legal sequences of the form \((0,0,\ldots,1,1,\ldots,1)\), including the sequence of all 0’s and the sequence of all 1’s. The probability of observing the survival status sequence \(y = (y_1, y_2, \ldots, y_m)\) can be represented as: \[ P_\Theta(Y\!\! =\!\! (y_1, y_2, \ldots, y_m) \,\,|\,\, \vec{x}) = \frac{\exp(\sum_{i=1}^m y_i \times \theta_{i}\cdot\vec{x} )} {\sum_{k=0}^m \exp(f_{\Theta}(\vec{x}, k))}, \] where \(\Theta = (\theta_{1}, \ldots, \theta_{m})\), and \(f_{\Theta}(\vec{x}, k) = \sum_{i=k+1}^m (\theta_{i}\cdot\vec{x})\) for \(0\leq k\leq m\) is the score of the sequence with the event occurring in the interval \([t_k, t_{k+1})\) before taking the logistic transform, with the boundary case \(f_{\Theta}(\vec{x}, k)= 0\) being the score for the sequence of all `0’s. Given a dataset of \(n\) patients \(\{ \vec{x_r}\}\) with associated time of deaths \(\{ d_r \}\), we find the optimal parameters (for the MTLR model) \(\Theta^*\) as \[ \Theta^*\ =\ \arg\max_{\Theta} \sum_{r=1}^n \left[\sum_{i=1}^m y_j(d_r)(\theta{i}\!\cdot\!\vec{x_r})\! - \log \sum_{k=0}^m \exp f_{\Theta}(\vec{x_r},k) \right] - \frac{C}{2}\!\sum_{j=1}^m\! \|\theta_{j}\|^2\! % +\! \frac{C_2}{2}\sum_{j=1}^{m-1}\! \|\theta_{j+1}\!-\!\theta_j\|^2\! \] where the \(C\) (for the regularizer) is found by an internal cross-validation process. }

There are many details here – to insure that the survival function starts at 1.0, and decreases monotonically and smoothly until reaching 0.0 for the final time point; to deal appropriately with censored patients; to decide how many time points to consider (\(m\)); and to minimize the risk of overfitting (by regularizing), and by selecting the relevant features. The paper by Yu et al. provides the details.

5.6.1 Multi-task Logistic Regression Code (MTLR.R)

File Purpose: Apply the MTLR model to data that has been processed and split into folds by createFoldsAndNormalize().

File Paths:

  1. “ISDEvaluation/Models/MTLR.R”.

Functions:

  1. MTLR(training, testing,C1 = NULL, numFolds = 5)
  2. internalCV_MTLR(training, numFolds) – Helper function for MTLR().
  3. avgLogLikLoss(params, dat, timePoints) – Helper function for internalCV_MTLR().

Arguments:

  • training: The training set.
  • testing: The testing set.
  • C1: The regularization parameter.
  • numFolds: The number of folds for internal cross validation.
  • params: The feature weights.
  • dat: The testing dataset (from internal CV).
  • timePoints: The time points used for evaluating MTLR().

Packages Required:

  1. Rcpp – Used for fast implementation of gradient and objective function calculation.

File Notes:

Given the list of datasets generated by createFoldsAndNormalize(), MTLR() will take a single training set, testing set, optionally a value for the regularization parameter, and the number of folds for internal cross-validation if C1 is left as NULL. Internal cross-validation is performed to select an the optimal regularization parameter (the average log-likelihood loss is optimized). Features are optimized via gradient descent and the optim() function. Since the objective function is non-convex with censored data, the weights are trained once treating all patients as uncensored and then trained again with the true censor values (this is in hopes of finding good starting weights).

The output is the same as previous models.

Example Usage:

validatedAndCleaned = validateAndClean(survivalDataset = NACD)
folds = createFoldsAndNormalize(survivalDataset = validatedAndCleaned, numberOfFolds = 5)
trainAndTest = folds[[2]]
training1stFold = trainAndTest$Training[[1]]
testing1stFold = trainAndTest$Testing[[1]]
mtlrMod = MTLR(training1stFold, testing1stFold)
curves = mtlrMod$TestCurves
names(curves) = c("time",1:26)
curves

Figure: MTLR Survival curves without last time point added.

6 Evaluting ISD Models

Once the survival curves have been generated it is likely beneficial to evaluate them using some type of metric. We tested (5) different metrics, Concordance (a discriminatory measure), 1-Calibration, Brier score, L1-Loss and novel D-Calibration. Below we only give the details for the code – for explanations of each metric please see the paper associated with this tutorial (Link). In addition to the evaluation metrics we also describe the function used to calculate the mean/median survival from each survival curve. We will show all evaluations using the unregularized Cox model as this is one of the most common models used in survival analysis.

6.1 Calculating ``Average’’ Survival (EvaluationHelperFunctions.R)

File Purpose: Calculate the mean/median survival time given a single survival curve.

File Paths:

  1. “ISDEvaluation/Evaluations/EvaluationHelperFunctions.R”

Functions:

  1. predictProbabilityFromCurve(survivalCurve,predictedTimes, timeToPredict)
  2. predictMeanSurvivalTimeSpline(survivalCurve, predictedTimes)
  3. predictMedianSurvivalTimeSpline(survivalCurve, predictedTimes)

Arguments:

  1. survivalCurve: The survival probabilities generated for a individual survival curve.
  2. predictedTimes: The times corresponding to survival probabilities, e.g. time = 0 corresponds to a survival probability of 1.
  3. timeToPredict: The time for which to extract a survival probability given a survival curve. Note this can either be a single time or a vector of times.

Packages Required: None.

File Notes:

This files purpose is to provide support for the evaluation metrics. Some metrics (concordance, L1-loss) require a single risk/survival measure and for those we use the mean (or median) survival times, predictMeanSurvivalTimeSpline() and predictMedianSurvivalTimeSpline() accordingly. To predict the mean time a monotonic spline fit is used to make the survival probabilities continuous. Specifically, we use the Hyman filtering of the cubic Hermite spline (see the Wikipedia page and the page’s references for an introduction). However, if the survival curves do not go to 0 probability we use a linear fit from the time = 0, survival probability = 1 point to the last time point we have and extend this linear fit to zero, see the Figure below. Once the survival curves have been extended to zero the integral of the survival curve is taken to be the mean survival time.

Figure: Linear fitting to extend survival curves to 0 survival probability. Left: Survival probabilities with no linear fit, Right: Survival probabilityies with linear fitting.

Figure: Linear fitting to extend survival curves to 0 survival probability. Left: Survival probabilities with no linear fit, Right: Survival probabilityies with linear fitting.

For calculating the median survival time the same spline fit is used and then an inverse spline is fit to the original spline fitting. The time associated with a survival probability of 0.5 is used as the median survival time.

Additionally, some evaluation metrics (specifically calibration metrics) require survival probabilities for a given time. This is implemented by predictProbabilityFromCurve(). The same spline fit (with linear extension to zero) from before is fitted to the survival probabilities and the given time(s) to predict (timeToPredict) is calculated using the internal spline() function.

Example Usage:

#Set up:
validatedAndCleaned = validateAndClean(survivalDataset = NACD)
folds = createFoldsAndNormalize(survivalDataset = validatedAndCleaned, numberOfFolds = 5)
trainAndTest = folds[[2]]
training1stFold = trainAndTest$Training[[1]]
testing1stFold = trainAndTest$Testing[[1]]
CoxMod = CoxPH_KP(training1stFold, testing1stFold, ElasticNet=F)
curves = CoxMod[[1]]
survivalTimes = curves[,1]
#We use the second patient because the first had a very shallow survival curve - see Cox curves in Seciton 3.3.1.
secondSurvivalCurve = curves[,3]
#Getting the survival probability for 5 months:
predictProbabilityFromCurve(survivalCurve = secondSurvivalCurve, predictedTimes = survivalTimes, timeToPredict = 5)
[1] 0.9571609
#Getting multiple survival probabilities:
predictProbabilityFromCurve(survivalCurve = secondSurvivalCurve, predictedTimes = survivalTimes, timeToPredict = c(5,10,20,40,80,160))
[1] 0.9571609 0.8646244 0.3927712 0.2527547 0.0000000 0.0000000
#Getting the mean survival time:
predictMeanSurvivalTimeSpline(survivalCurve = secondSurvivalCurve, predictedTimes = survivalTimes)
[1] 25.72988
#Getting the median survival time:
predictMedianSurvivalTimeSpline(survivalCurve = secondSurvivalCurve, predictedTimes = survivalTimes)
[1] 17.34427

6.2 Concordance (Concordance.R)

File Purpose: Calculate the concordance of a trained model on test data.

File Path: “ISDEvaluation/Evaluations/Concordance.R”

Functions:

  1. Concordance(survMod, ties = "None", method = "Median")

Arguments:

  1. survMod: The output of any of the survival models. Specifically this must be a list of (1) the survival curves of the test patients, (2) the test patients true time of death and censor status, and (3) the training patients true time of death and censor status. Technically the third component is not required for the concordance calculation.

  2. ties: A string indicating the method in which to handle ties. There are four options: (1) “None” which will exclude all ties, (2) “Risk” which will include ties who were tied in their risk (mean/median survival time), (3) “Time” which will include ties whos survival time was equal, and (4) “All” which includes both risk and time ties.

  3. method: A string indicating to use the “Mean” or the “Median” for the risk score evaluation.

Packages Required:

  1. survival - This package is used for the survConcordance() function – an time-efficient function for calculating concordance.

File Notes:

As discussed in the paper, concordance is a discriminative measure typically used to evaluated models which produce risk scores for patients. Here the risk score is given by the negative of the mean/median of an individuals survival distribution. Once the average survival times are calculated survConcordance() calculated the concordance information and the concordance score is returned depending on the values of the ties argument.

Example Usage:

#Set up:
validatedAndCleaned = validateAndClean(survivalDataset = NACD)
folds = createFoldsAndNormalize(survivalDataset = validatedAndCleaned, numberOfFolds = 5)
trainAndTest = folds[[2]]
training1stFold = trainAndTest$Training[[1]]
testing1stFold = trainAndTest$Testing[[1]]
CoxMod = CoxPH_KP(training1stFold, testing1stFold, ElasticNet=F)
#Evaluation:
Concordance(CoxMod, ties = "None", method = "Mean")
concordant 
 0.6317829 

6.3 1-Calibration (OneCalibration.R) {#1Cal}

File Purpose: Perform the 1-Calibration test on survival data.

File Path: “ISDEvaluation/Evaluations/OneCalibration.R”

Functions:

  1. OneCalibration(survMod, timeOfInterest = c(), type = "DN", numBuckets = 10)
  2. OneCalibrationCumulative(listOfSurvivalModels, timeOfInterest = c(), type = "DN", numBuckets = 10)
  3. binItUp(trueDeathTimes,censorStatus, predictions, type, numBuckets,timeOfInterest)

Arguments:

  1. survMod: The output of any of the survival models. Specifically this must be a list of (1) the survival curves of the test patients, (2) the test patients true time of death and censor status, and (3) the training patients true time of death and censor status. Technically the third component is not required for the concordance calculation.

  2. timeOfInterest: The time at which to evaluate the model for 1-Calibration. If nothing is passed in the quantiles (.1, .25,.5,.75,.9) of the true event times will be used.

  3. type: This is the type of 1-Calibration to be used, either “Uncensored” to use 1-Calibration for only the uncensored patients and discard all censored patients or “DN” for the D’Agostino-Nam translation.

  4. numBuckets: The number of buckets/bins to use for 1-Calibration.

  5. listOfSurvivalModels: A list of the output of survival models, e.g. a list of different survMod like objects.

  6. trueDeathTimes: The true event times of the subjects.

  7. censorStatus: The bit identifying if a patient died or was censored, i.e delta.

  8. predictions: The predicted survival probability for patients at their true event times.

Packages Required:

  1. survival – This package is used for the Surv() function.

  2. dplyr – This is used for group_by(), summarise(), and mutate()`.

File Notes:

There are two ways to perform 1-Calibration here – one can either pass the output from a survival model and have 1-Calibration evaluated on the test fold or one can pass a list of multiple survival models and have 1-Calibration evaluated on all patients. Note by doing the first way, OneCalibration(), there will end up being the same number of p-values as there are cross-validation folds and there will be no straightforward way of combining them (DO NOT AVERAGE THEM). OneCalibrationCumulative() mitigates this by taking in the model results from all folds and performing a single 1-Calibration test using all the test data. binItUp() is a helper function that takes probability predictions and performs the actual specified 1-Calibration test.

Example Usage:

#Set up for OneCalibration():
validatedAndCleaned = validateAndClean(survivalDataset = NACD)
folds = createFoldsAndNormalize(survivalDataset = validatedAndCleaned, numberOfFolds = 5)
trainAndTest = folds[[2]]
training1stFold = trainAndTest$Training[[1]]
testing1stFold = trainAndTest$Testing[[1]]
CoxMod = CoxPH_KP(training1stFold, testing1stFold, ElasticNet=F)
#Evaluation:
#By not specifying timeOfInterest we will get the 10th, 25th, 50th, 75th, and 90th quantiles of the true event times in the testing fold.
OneCalibration(CoxMod, timeOfInterest = c(), type = "DN",numBuckets = 10)
[1] 0.1524982 0.0000000 0.0000000 0.0000000 0.0000000
#One Calibration for uncensored patients.
OneCalibration(CoxMod, timeOfInterest = c(), type = "Uncensored",numBuckets = 10)
[1] 0.3689704 0.0000000 0.0000000 0.0000000 0.0000000
#Set up for OneCalibrationCumulative():
coxList = list()
for(i in 1:5){
  trainingFold = trainAndTest$Training[[i]]
  testingFold = trainAndTest$Testing[[i]]
  CoxMod = CoxPH_KP(trainingFold, testingFold, ElasticNet=F)
  coxList[[i]] = CoxMod
}
#Cumulative 1-Calibration for default time of Interest
OneCalibrationCumulative(listOfSurvivalModels = coxList, timeOfInterest = c(), type = "DN", numBuckets = 10)
[1] 0.118622 0.000000 0.000000 0.000000 0.000000
#Cumulative 1-Calibration for a specific time (2 months)
OneCalibrationCumulative(listOfSurvivalModels = coxList, timeOfInterest = 2, type = "DN", numBuckets = 10)
[1] 0.002524264

6.4 Brier Score (BrierScore.R)

File Purpose: Evaluate the Brier score on test data.

File Path: “ISDEvaluation/Evaluations/BrierScore.R”

Functions:

  1. BrierScore(survMod. type = "Interated", singleTime = NULL, integratedBrierTimes = NULL, numPoints = NULL)
  2. singleBrier(survMod, BrierTime = NULL)
  3. integratedBrier(survMod, integratedBrierTimes, numPoints)
  4. sinleBrierMultiplePoints(survMod, BrierTimes)

Arguments:

  1. survMod: The output of any of the survival models. Specifically this must be a list of (1) the survival curves of the test patients, (2) the test patients true time of death and censor status, and (3) the training patients true time of death and censor status. Technically the third component is not required for the concordance calculation.

  2. type: A string of either “Integrated” or “Single”. This will indicate whether to return the Brier score for a point or the integrated Brier score.

  3. singleBrierTime: The time point to evaluate the (non-integrated) Brier score. If left as NULL, the 50th percentile of the training and testing data’s event times combined will be used.

  4. integratedBrierTimes: A vector of length 2 indicating the limits of integration for the integrated Brier score. If left as NULL the range will be from zero to the max event time of the entire dataset (training and testing set combined).

  5. numPoints: The number of points to evaluate the integral, i.e. the higher the number of points the less error in the integration. If left as NULL the number of points will correspond to the number of events between the limits of integration.

  6. BrierTimes: A vector of times to evaluate a single Brier score – used in an internal function (singleBrierMultiplePoints()) of integratedBrier().

Packages Required:

  1. prodlim – Used for the prodlim() function that calculates the Kaplan-Meier curve. This is then used for the inverse probability of censoring weights (IPCW).

File Notes:

In this file BrierScore() is the only function that needs to be called – all the other functions simply work within BrierScore(). Note that there is no option for removing censored patients and the Brier score returned will always be the score corrected by the IPCWs. For the integrated Brier score the integral is evaluated using the simple trapezoidal rule. If numPoints is specified then the points where the Brier score is evaluated will be evenly spread across the limits of integration. However, if numPoints is not specified then the points evaluated will be the event times in the testing data. These rules follow from the sbrier() function in the ipred library written by Torsten Hothorn.

Example Usage:

#Set up:
validatedAndCleaned = validateAndClean(survivalDataset = NACD)
folds = createFoldsAndNormalize(survivalDataset = validatedAndCleaned, numberOfFolds = 5)
trainAndTest = folds[[2]]
training1stFold = trainAndTest$Training[[1]]
testing1stFold = trainAndTest$Testing[[1]]
CoxMod = CoxPH_KP(training1stFold, testing1stFold, ElasticNet=F)
#Evaluation:
#Integrated with specified times:
BrierScore(CoxMod, type = "Integrated", integratedBrierTimes = c(0,36))
[1] 0.1460349
#Integrated with unspecified times:
BrierScore(CoxMod, type = "Integrated")
[1] 0.1223641
#Single with specified times:
BrierScore(CoxMod, type = "Single", singleBrierTime = 12)
[1] 0.1820266
#Single with unspecified times:
BrierScore(CoxMod, type = "Single")
[1] 0.1662345

6.5 L1-Loss (L1Measures.R)

File Purpose: Evaluate the L1 loss on test data.

File Path: “ISDEvaluation/Evaluations/L1Measures.R”

Functions:

  1. L1(survMod. type = "Uncensored", logScale = FALSE, method = "Median")

Arguments:

  1. survMod: The output of any of the survival models. Specifically this must be a list of (1) the survival curves of the test patients, (2) the test patients true time of death and censor status, and (3) the training patients true time of death and censor status. Technically the third component is not required for the concordance calculation.

  2. type: A string indicating the type of L1-loss, options are one of “Uncensored”, “Hinge”, or “Margin”. See the paper for details regarding the mathematical definition of each option.

  3. logScale: A Boolean indicating whether to use the log scaled version of the L1-loss.

  4. method: A string indicating whether to use the “Mean” or “Median” as survival predictions.

Packages Required:

  1. prodlim – Used for the prodlim() function that calculates the Kaplan-Meier curve. See File Notes for more details.

  2. survival – Used for the Surv() function.

File Notes:

L1() returns the average L1 loss based on the specifications given, e.g the Margin L1-loss using the mean survival to compute the loss. There is one important specification to note – the Kaplan-Meier curve is generated (from the training data) and is completed using the linear extension as specified in Section 4.1 above. The mean of this Kaplan-Meier curve is computed and the survival estimate returned is the minimum of the mean survival time for individual survival curves and the mean of this Kaplan-Meier curve. See “L1Measures.R”" for the implementation details.

Example Usage:

#Set up:
validatedAndCleaned = validateAndClean(survivalDataset = NACD)
folds = createFoldsAndNormalize(survivalDataset = validatedAndCleaned, numberOfFolds = 5)
trainAndTest = folds[[2]]
training1stFold = trainAndTest$Training[[1]]
testing1stFold = trainAndTest$Testing[[1]]
CoxMod = CoxPH_KP(training1stFold, testing1stFold, ElasticNet=F)
#Evaluation:
#Uncensored Loss with Mean:
L1(CoxMod, type = "Uncensored", logScale = FALSE, method = "Mean")
[1] 8.222021
#Hinge Loss with Mean:
L1(CoxMod, type = "Hinge", logScale = FALSE, method = "Mean")
[1] 6.324632
#Margin Loss with Mean:
L1(CoxMod, type = "Margin", logScale = FALSE, method = "Mean")
[1] 46.98087
#Margin Loss with Median:
L1(CoxMod, type = "Margin", logScale = FALSE, method = "Median")
[1] 47.33277
#Margin Loss with Median on Log Scale:
L1(CoxMod, type = "Margin", logScale = TRUE, method = "Median")
[1] 2.612281

6.6 D-Calibration (DCalibration.R)

File Purpose: Evaluate test data using D-Calibration.

File Path: “ISDEvaluation/Evaluations/DCalibration.R”

Functions:

  1. DCalibration(survMod, numBins)

  2. DCalibrationCumulative(listOfSurivalModels, numBins)

  3. getBinned(survMod, numBins)

Arguments:

  1. survMod: The output of any of the survival models. Specifically this must be a list of (1) the survival curves of the test patients, (2) the test patients true time of death and censor status, and (3) the training patients true time of death and censor status. Technically the third component is not required for the concordance calculation.

  2. numBins: The number of bins to use for D-calibration.

  3. listOfSurvivalModels: A list of the output of survival models, e.g. a list of different survMod like objects.

Packages Required:

  1. prodlim – Used for the sindex() function.

  2. plyr – Used for the ldply() function to combine lists.

File Notes:

The format of the DCalibration.R file follows the same as OneCalibration.R. That is, DCalibration() is used for a single fold and returns a p-value. However, since p-values should not be averaged there is also a DCalibrationCumulative() function which will take all the folds test data and perform a single D-Calibration test. Lastly, getBinned() is a helper function for DCalibration() and DCalibrationCumulative() and does not need to be called directly.

Example Usage:

Here we do two models, Cox and MTLR. Below you will see that the Cox model fails whereas MTLR performs well. Additionally we show the corresponding cumulative D-Calibration histograms for each model.

#Set up:
validatedAndCleaned = validateAndClean(survivalDataset = NACD)
folds = createFoldsAndNormalize(survivalDataset = validatedAndCleaned, numberOfFolds = 5)
trainAndTest = folds[[2]]
training1stFold = trainAndTest$Training[[1]]
testing1stFold = trainAndTest$Testing[[1]]
CoxMod = CoxPH_KP(training1stFold, testing1stFold, ElasticNet=F)
MTLRMod = MTLR(training1stFold, testing1stFold)
#D-Calibration for Cox
DCalibration(CoxMod, numBins = 10)
[1] 0.0005961342
#D-Calibration for MTLR
DCalibration(MTLRMod, numBins = 10)
[1] 0.2756404
#Set up for DCalibrationCumulative():
coxList = list()
mtlrList = list()
for(i in 1:5){
  trainingFold = trainAndTest$Training[[i]]
  testingFold = trainAndTest$Testing[[i]]
  CoxMod = CoxPH_KP(trainingFold, testingFold, ElasticNet=F)
  coxList[[i]] = CoxMod
  
  MTLRMod = MTLR(trainingFold, testingFold)
  mtlrList[[i]] = MTLRMod
}
#Cumulative D-Calibration for cox
DCalibrationCumulative(listOfSurvivalModels = coxList, numBins = 10)
[1] 2.671878e-10

Figure: Histogram for D-Calibration corresponding to Cox model. Bin 1 refers to the [0,.1) bin and Bin 10 refers to the [0.9, 1] bin. Here we see the cox model puts too many of the patients in the first and last bin – leading to a failed D-Calibration test.
DCalibrationCumulative(listOfSurvivalModels = mtlrList, numBins = 10)
[1] 0.1750337

Figure: Histogram for D-Calibration corresponding to MTLR model. Bin 1 refers to the [0,.1) bin and Bin 10 refers to the [0.9, 1] bin. The bins are much more even here than those given by the Cox model, resulting in a passed D-Calibration test.

7 Miscellanious

7.1 Plotting Individual Survival Curves (plotSurvivalCurves.R)

File Purpose: Plot individual survival curves.

File Path: “ISDEvaluation/Plotting/plotSurvivalCurves.R”

Functions:

  1. plotSurvivalCurves(survivalCurves, indexToPlot, color = c(), xlim=c())

Arguments:

  1. survivalCurves: A dataframe where the first column corresponds to survival times and all other columns are survival probabilities of individual patients. This corresponds to the first item of the list returned by all models, e.g. CoxPH_KP().

  2. indexToPlot: The patient number you wish to plot. From survivalCurves patient 1 would be the second column. This can either be an integer specifying which patient or a vector of integers, specifying multiple patients.

  3. color: The color of the survival curves. The length much match that of indexToPlot.

  4. xlim: You can specify the range of the x-axis. This should be a vector of length two.

Packages Required:

  1. ggplot2 – Used for plotting.

  2. reshape2 – Used to reshape the data to long form.

File Notes:

The plotting code will select the columns specified by indexToPlot and fit a spline using the time column and the survival probabilities columns. This spline is then used to interpolate many points between the probabilities such that we have a smooth fit to plot. See below for a few plotting examples. Note that indexToPlot will match up testing data depending on the fold. Additionally, if over 15 indexes are specified, there will be no legend (the legend fills up the plot space).

Example Usage:

#Without feature selection.
validatedAndCleaned = validateAndClean(survivalDataset = NACD)
folds = createFoldsAndNormalize(survivalDataset = validatedAndCleaned, numberOfFolds = 5)
trainAndTest = folds[[2]]
training1stFold = trainAndTest$Training[[1]]
testing1stFold = trainAndTest$Testing[[1]]
CoxMod = CoxPH_KP(training1stFold, testing1stFold, ElasticNet=F)
survivalCurves = CoxMod[[1]]
plotSurvivalCurves(survivalCurves, indexToPlot = 1)

plotSurvivalCurves(survivalCurves, indexToPlot = 2)

plotSurvivalCurves(survivalCurves, indexToPlot = 2, color = "blue")

plotSurvivalCurves(survivalCurves, indexToPlot = 1:10)

plotSurvivalCurves(survivalCurves, indexToPlot = 1:10, xlim = c(0,25))

Note here that the index corresponds to the row number in the original dataset (they are all out of order here since we used an arbitrary cross-validation fold).

7.2 Feature Selection (FeatureSelection.R)

File Purpose: Perform feature selection on survival data.

File Path: “ISDEvaluation/FeatureSelection/FeatureSelection.R”

Functions:

  1. FeatureSelection(dataset, type = "UniCox", obs_thresh = 0, pThresh = 0.1)

  2. uniCox(dataset, obsthresh = 0, pThresh=0.1)

Arguments:

  1. dataset: This is a dataset which has been validated for use – i.e. gone through validateAndClean()

  2. type: The type of feature selection to use. Currently (August 28, 2018) the only option is “UniCox”. See File Notes for details.

  3. obs_thresh: The number of observations that a variable must have to qualify for selection.

  4. pThresh: The p-value threshold to be included in the analysis.

Packages Required:

  1. caret – the dummyVars() function is used for one-hot encoding.

File Notes:

The code here is written so custom feature selection methods can be easily added – they just have to added to a switch command. Here we use uni-variate cox for our feature selection – the details of this method are given in the paper. For this feature selection we first check that that each variable has a obs_thresh number of events (deaths) occur. The default for this is 0 events. Next the p-value for the feature must be below 0.1 to be included in the selected dataset. Note that variables are one-hot encoded before selection so all the values of a categorical variable may not be present post feature selection.

Example Usage:

Below you will see that unselected data has 40 features whereas the data post feature selection has only 22 features.

#Without feature selection.
validatedAndCleaned = validateAndClean(survivalDataset = NACD)
dim(validatedAndCleaned)
[1] 128  40
#After we have the selected data we would then pass it to createFoldsAndNormalize().
selectedData = FeatureSelection(validatedAndCleaned)
dim(selectedData)
[1] 128  22
---
title: "Individual Survival Distributions - Case Study and Code Tutorial"
author: "Humza Haider"
date: "`r format(Sys.time(), '%B %e, %Y')`"
output:
  html_notebook: 
    toc: TRUE
    number_sections: true
    fig_caption: yes
    fig_align: "center"
---

#Introduction
This article should serve as a tutorial on using the [publicly available code](https://github.com/haiderstats/ISDEvaluation) which accompanies the paper "Effective Ways to Build and Evaluate Individual Survival Distributions". There are two primary components to this article, (1) a [quickstart option](#quickstart) which gives an overview of using the master file, "analysisMaster.R", and (2) [the entire code overview](#theDetails) which details all the functions and files found in the code. While using the "analysisMaster.R" file gives users access to most of the functionality of the codebase, no specialized modifications can be made to the models/evaluations directly so we give a detailed explanation of how to start with a dataset and end with evaluations of different models.

#Quickstart Guide (analysisMaster.R) {#quickstart}

Before you begin using the code, make sure all required packages are downloaded:  `caret`, `dataPreparation`, `ggplot2`, `reshape2`,`randomForestSRC`, `Rcpp`, `prodlim`, `survival`, `fastcox`, `plyr`,and `dplyr`. To save you some time I have supplied a code snippet you can use to download any of the packages you may be missing:

```{r,results='hide'}
#Taken from https://gist.github.com/stevenworthington/3178163
ipak <- function(pkg){
    new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
    if (length(new.pkg)) 
        install.packages(new.pkg, dependencies = TRUE)
    sapply(pkg, require, character.only = TRUE)
}

packages <- c("caret", "dataPreparation", "ggplot2", "reshape2","randomForestSRC", "Rcpp", "prodlim", "survival", "fastcox", "plyr", "dplyr")
ipak(packages)
```


Next we will load the example data (which can be found on the [PSSP website](pssp.srv.ualberta.ca/system/predictors/datasets/000/000/076/original/All_Data_updated_may2011_Stage4_Stomach.csv?1354708504)). Details about this dataset can be found in the [Introduction to the Example Data](#theDetails).

```{r}
exampleDat = read.csv(file = "http://pssp.srv.ualberta.ca/system/predictors/datasets/000/000/076/original/All_Data_updated_may2011_Stage4_Stomach.csv?1354708504")
```

To use this codebase, the time to event feature must be named `time`, and the indicator of death (uncensored) should be named `delta`. In the example data these are the first two columns and are currently named `SURVIVAL` and `CENSORED`, so we make this change. Additionally, this data has CENSORED = 1 mean that a patient was censored. We need 1 to indicate a death.

```{r}
names(exampleDat)[c(1,2)] = c("time", "delta")
exampleDat$delta = 1 - exampleDat$delta 
```

Now that the data is formatted correctly and packages are loaded, go ahead and load "analysisMaster.R"

```{r}
source("analysisMaster.R")
```

This file contains the function `analysisMaster()` which takes many arguments, each of which controls some aspect of either the different models or evaluations.

**Arguments**:

* *survivalDataset*: a (non-empty) dataframe representing the survival dataset.
* *numberOfFolds*: The number of desired cross-validation folds. Default is 5.
* *CoxKP*, *CoxKPEN*, *KaplanMeier*, *RSFModel*, *AFTModel*, *MTLRModel*: All Boolean specifying whether to include each model (Cox proportional-hazards, Cox- Elastic Net, Kaplan-Meier, Random Survival Forests, Accelerated Failure Time, Multi-task Logistic Regression). All default to TRUE.
* *DCal* *OneCal*, *Concor*, *L1Measure*, *BrierSingle*, *BrierInt*: Boolean specifying whether or not to include each evaluation metric. (D-Calibration, 1-Calibration, Concordance, L1-loss, Brier score (single time), Integrated Brier Score). All default to TRUE.
* *DCalBins*: Number of bins for D-Calibration. Default is 10.
* *OneCalTime*: A numeric specifying the time to evaluate 1-Calibration. If left as NULL but OneCal = TRUE, then the 10th, 25th, 50th, 75th, and 90th percentiles of all event times are used. Default is NULL.
* *concordanceTies*: A string ("None", "Time", "Risk","All") indicating how to handle ties in concordance. Default is "Risk".
* *SingleBrierTime*: The time to evaluate the Brier Score. If left as NULL, the 50th percentile of all event times is used. Default is NULL.
* *IntegratedBrierTimes*: A 2 length vector (e.g. c(0,100)) specifying the lower and upper bounds on the integrated Brier score. If NULL then the default is 0 as a lower bound and the max event time of the entire dataset is used as an upper bound. Default is NULL.
* *numBrierPoints*: The number of points to evaluate the integrated Brier score. A simple trapezoidal numerical approximation is used. Default is 1000 points.
* *Ltype*: The type of L1-loss. Must be one of "Uncensored","Hinge", or "Margin". Default is "Margin".
* *Llog*: A Boolean specifying whether or not to use log-L1 metric. Default is FALSE.
* *typeOneCal*: A string indicating the type of 1-Calibration to use. Must be one of "DN" or "Uncensored". Default is "DN".
* *oneCalBuckets*: An integer specifying number of bins/buckets for 1-Calibration. Default is 10.
* *survivalPredictionMethod*: The way in which to estimate average survival times. Must be one of "Mean" or "Median". Default is "Median".
* *AFTDistribution*: The distribution to use for AFT, default is "weibull". Must be one of "weibull","exponential","lognormal","gaussian", "loglogistic","logistic".
* *FS*: A Boolean specifying whether or not to use feature selection. Default is TRUE.
* *imputeZero*: A Boolean specifying whether 0 valued times should be imputed (AFT breaks for 0 valued times). If TRUE then 0 valued times are imputed to half the minimum non-zero time. Default is TRUE. 
* *verbose*: A Boolean specifying whether or not to return results and progress information.


For this example we will simply use all the default arguments (with the exception of verbose, which we set to FALSE for brevity). Note that this command may take a few minutes to run (excluding Random Survival Forests will speed up computation time).

```{r}
#RSF and CoxEN are non-deterministic, so set the seed for reproducible results. (Additional stochasticity is included in 1-Calibration).
set.seed(42)
res = analysisMaster(exampleDat, verbose = FALSE)
```
You will notice some variables were removed -- See [Section 4.1](#validateAndClean) for more details.


`res` is now a list of 3 items:

* `datasetUsed` -- This is the dataset post validation and feature selection. Some rows may have had missing `time` or `delta` and been removed and features which were removed via feature selection or were deemed as unusable [(Section 4.1)](#validateAndClean) will no longer be present. 
* `survivalCurves` --  This is a list containing the survival curves for all patients for each model that was tested. These can be plotted using `plotSurvivalCurves()`.
* `results` -- This is a dataframe containing all the evaluation results with specified model and fold number. Additionally the sample size, number of features (post feature selection and one-hot encoding, excluding `time` and `delta`), and censoring percentage are returned.

Looking at each of these in more detail, note the difference in size between `res$datasetUsed` and `exampleDat`.

```{r}
dim(exampleDat)
dim(res$datasetUsed)
```
<br>
Additionally, when we sourced "analysisMaster.R" we brought in `plotSurvivalCurves()` which can be used to examine the survival curves of each model. Here we show examples of Cox Elastic-Net, Random survival forests, and Multi-task Logistic regression, however, survival curves for all models are available.

```{r}
#CoxEN
plotSurvivalCurves(survivalCurves = res$survivalCurves$CoxEN, indexToPlot = 1:10)
#RSFKM
plotSurvivalCurves(survivalCurves = res$survivalCurves$RSF, indexToPlot = 1:10)
#MTLR
plotSurvivalCurves(survivalCurves = res$survivalCurves$MTLR, indexToPlot = 1:10)
```

Note that each index relates to the corresponding row in `res$datasetUsed` (e.g. the survival curve for Index 1 is for the for the patient in the first row of `res$datasetUsed`).

The last component of `res` is the `results` dataframe, which includes the evaluations for every fold (Concordance, L1-loss, Brier score) and the calibration evaluated on the entire dataset (1-Calibration, D-Calibration). Below is a view of all the results and an example of aggregating the result. (Note that we average the p-values of 1-Calibration and D-Calibration because they are equal across the folds because they were evaluated on all the test data as opposed to per fold -- see [Section 6.3](#1Cal) and [Section 6.6](#DCal))

```{r}
res$results
res$results %>%
  group_by(Model)  %>%
  summarise(N = mean(N),
            NumFeatures = mean(NumFeatures),
            PercentCensored = mean(PercentCensored),
            DCalibration = mean(DCalibration),
            OneCalibration10th = mean(OneCalibration_1),
            OneCalibration25th = mean(OneCalibration_2),
            OneCalibration50th = mean(OneCalibration_3),
            OneCalibration75th = mean(OneCalibration_4),
            OneCalibration90th = mean(OneCalibration_5),
            ConcordanceM = mean(Concordance),
            ConcordanceSD = sd(Concordance),
            SingleBrierM = mean(BrierSingle),
            SingleBrierSD = sd(BrierSingle),
            IntegratedBrierM = mean(BrierInt),
            IntegratedBrierSD = sd(BrierInt),
            L1M = mean(L1Loss),
            L1SD = sd(L1Loss))
```

Here we have given a fairly brief introduction to using the codebase to run the 6 survival models across the 5 different evaluation metrics. For a much more detailed introduction to the code (and a walk through of all the functions) continue on to the next sections or jump to a specific section using the table of contents.

#Code Overview and Introduction to Example Data {#theDetails}

The code here has been written for *survival* datasets, *i.e.* datasets are comprised of (1) a variable which indicates time of event, (2) an indicator variable for RIGHT censoring, and (3) features for each patient. Note that the code only allows for right censored patients, left censored or interval censored data will not work with the current implementation. To follow along with the examples please make sure that you have downloaded all the required packages (see the first paragraph of the [Quickstart Guide](#quickstart)), have your working directory in `~/ISDEvaluation`, and sourced all the files (available by running `source("analysisMaster.R")).

As a case study we will use the Northern Alberta Cancer Dataset (NACD) which was also included in the original paper. However, to simplify things we will use a new subset of the dataset, namely we will only consider patients with stage 4 stomach cancers. This data contains a total of 128 patients (100 uncensored) and 53 features, and is also available on the [github repository](https://github.com/haiderstats/ISDEvaluation) under the name "exampleData.csv". Below we give the summary statistics of the first five variables:

```{r}
NACD = read.csv(file = "http://pssp.srv.ualberta.ca/system/predictors/datasets/000/000/076/original/All_Data_updated_may2011_Stage4_Stomach.csv?1354708504")
head(NACD)
summary(NACD[,1:5])
```


Here, we have `SURVIVAL` as the feature indicating the time until the event occurred and `CENSORED` representing the indicator for death, that is, a 1 indicates the patient was censored and a 0 indicates the patient died -- these two types of variables are required for all datasets in this codebase. Feel free to explore the other variables if you are interested.

A quick note on notation -- In general features and packages will be given as verbatim, e.g. `package1` and `feature1`, functions will also be verbatim but also have parentheses e.g. `function1()`, arguments to functions will be *italicized*, variable assignments will be **bolded**, and files will be quoted ("file1.R"). However, there are exceptions when functions/file names are contained in section headers.

#Pre-processing the Data

##Validating and Cleaning  (validateAndClean.R) {#validateAndClean}

**File Purpose**: This file is used to make sure the survival dataset is in the correct format for analysis. For example, the variable used for survival time must be named exactly `time`. Additionally, a minimal amount of data cleansing is completed in this file.

**File Path**: "ISDEvaluation/ValidateCleanCV/validateAndClean.R".

**Functions**:

1. `validateAndClean(survivalDataset, imputeZero=T)`
2. `validate(survivalDataset, imputeZero)`
3. `clean(survivalDataset)`

Note: You will notice this file structure throughout the codebase, there will often be a master function (here `validateAndClean()`) which runs the other functions included in the file, *e.g.* `validate()` and `clean()`. 

**Arguments**:

* *survivalDataset*: a (non-empty) dataframe representing the survival dataset -- *e.g.* **NACD**
* *imputeZero*: a Boolean indicating if zero valued survival times should be imputed (see below for details).

**Packages Required**:

1. `caret` -- The `nearZeroVar()` function to identify variables with only 1 unique value and remove them in the `clean()` function.


###*validate(survivalDataset, imputeZero)* 
The function `validate()` checks a number of things:

1. A survival dataset was nonempty.
2. The survival time variable MUST be named `time`.
3. Similarly, the indicator of death variable must be named `delta`. The naming choice of this variable is the common notation in survival analysis.
4. If `time` includes NA values these rows will be removed.
5. If `time` includes Inf values these rows will be removed.
6. If `time` includes negative values these rows will be removed.
7. If `time` includes zero values there are two options depending on the value of the argument imputeZero. If imputeZero = TRUE, then zero valued times will be imputed to half the minimum non zero valued time. For example in the NACD dataset we are using here the minimum time is 0.2333. If there was another row where the event time was 0, this would be imputed to $\frac{0.2333}{2} = 0.1166$. We include this option as some survival models (*e.g.* Accelerated Failure Time with the Weibull distribution) will fail if given a zero valued time. If imputeZero = FALSE then nothing will happen to these zero valued survival times. Currently there is no option to remove zero valued times other than removing them from the dataset prior to running the code available here.
8. If `delta` includes NA values these rows will be removed.
9. If `delta` includes Inf values these rows will be removed.
10. `delta` must only have two unique values.
11. The unique values of `delta` must be 0 and 1 -- 0 for censored data and 1 for uncensored data.

If we try to use `validate()` on our NACD dataset we will get an error since currently our survival time variable is named `Time` and our death indicator is named `Status`. Note that below we include *imputeZero = TRUE*, but since there are no zero-valued times the same result would occur if we included *imputeZero=FALSE*.

```{r, error= TRUE}
validatedNACD = validate(NACD, imputeZero = TRUE)
names(NACD)[1] = "time"
validatedNACD = validate(NACD, imputeZero = TRUE)

#Additionally, we need to swap the values for delta since CENSORED = 1 meant a patient was censored. We need 1 to mean a death.
names(NACD)[2] = "delta"
NACD$delta = 1 - NACD$delta
validatedNACD = validate(NACD, imputeZero = TRUE)
```

###*clean(survivalDataset)*

Here, `clean()` does a very minimal amount of cleaning on the survival dataset. Specifically, the goal is to remove variables which will be not be meaningful to analysis. There are two cases where variables are removed:

1. A variable contains all unique values. For example, in our NACD dataset there is a `STAGE_4` variable which is an indicator (1 or 0) whether or not the cancer is in stage 4. Since we only have stage 4 stomach cancer patients this variable will be be all 1s and will be removed from the dataset.
2. Variables such that 25% of rows (patients) have a missing value will be removed. However, the coding of missing data must be such that `is.na()` would identify the data as missing. For example `is.na("")` returns TRUE even though an empty string likely indicates a missing value.

Whenever a variable is removed or turned into a factor, `clean()` will return a warning indicating the name of the variable(s) which were removed/factored. For example, on our validated NACD dataset we can now call `clean()`:


```{r}
cleanedNACD = clean(validatedNACD)
```
As noted above a number of variables were removed due to having only 1 unique value. 


###*validateAndClean(survivalDataset, imputeZero=T)*

This function simply runs both `validate()` and `clean()` on a survival dataset. Note that here we use default the default (TRUE) for *imputeZero*. Below we verify that `validateAndClean(NACD)` is equivalent to **cleanedNACD**.

```{r, warning=FALSE}
identical(validateAndClean(NACD),cleanedNACD)
```


##Creating Cross-Validation Folds and Normalizing Features (createFoldsAndNormalize.R)

**File Purpose**: This file is used to create folds from the survival dataset for cross validation and normalize features. As part of normalization, mean/mode imputation is employed to handle missing data.

**File Path**: "ISDEvaluation/ValidateCleanCV/createFoldsAndNormalize.R".

**Functions**:

1. `createFoldsAndNormalize(survivalDataset, numberOfFolds)`
2. `createFoldsOfData(survivalDataset, numberOfFolds)`
3. `meanImputation(listOfDatasets)`
4. `normalizeVariables(listOfImputedDatasets)`


**Arguments**:

* *survivalDataset*: the survivalDataset AFTER going through validation and cleaning -- *e.g.* **cleanedNACD**
* *numberOfFolds*: the desired number of cross-validation folds, must be greater than 1.
* *listOfDatasets*: a list containing a list of Training datasets and a list of Testing datasets - *i.e.* the output of `createFoldsOfData()`
* *listOfImputedDatasets*: same as *listOfDatasets* but missing values have been imputed (the output of `meanImputation()`)

**Packages Required**:

1. `caret` -- The `dummyVars()` function is used to create a one hot encoding of factor (nominal) variables.
2. `dataPreparation` -- The `build_scales()` and `fastScale()` function are used to normalize variables.

###*createFoldsOfData(survivalDataset, numberOfFolds)*{#folds}

This code base requires that data be in cross-validation format with a minimum of 2 folds. The cross-validation used here is specific to survival analysis in that cross-validation is deterministic -- rows are first stratified by their censoring status (`delta`) and then sorted by `time` and sequentially placed into folds (see Figure 1). The reasoning behind this procedure is two-fold (pun-intended):

1. The number of censored patients should be (roughly) equal across all folds.
2. The range of event times should also be (roughly) equal across all folds.

By creating folds in this way models will be tested on data which will be mostly representative of what they saw in the training data.

![Figure 1: Depiction of 3-fold cross-validation scheme used for survival data. The dataset is first stratefied by delta and then sorted by time. Rows are then sequentially ordered into the three folds.](./Figures/SurvivalCV.png)

In addition to creating cross validation folds, `createFoldsOfData()` also returns the indexes of the rows that each test set belongs. See the below example for details. Here we put our cleaned NACD dataset (**cleanedNACD**) into 5 folds:

```{r}
folds = createFoldsOfData(cleanedNACD, 5)
#folds comprises a list of (2) items. (1) is the indexs of the test set. For example, examine the first test set fold:
testSetIndexs = folds[[1]]
#Test fold 1:
testSetIndexs[[1]]
```
Here the first test fold is comprised of the 123rd, 101st, 62nd, ..., and the 77th rows of the original data. We will later use these indexes for plotting in [Section 7.1](#plotting).

The second item returned by `createFoldsOfData()` is a list of (1) the training sets, and (2) the testing sets. Below we simply give the dimension of each:
```{r}
trainingAndTestingSets = folds[[2]]
#First Training Set:
dim(trainingAndTestingSets$Training[[1]])
#First Testing Set:
dim(trainingAndTestingSets$Testing[[1]])
```

###*meanImputation(listOfDatasets)*

Following the splitting of folds we perform imputation for missing data. Since the primary focus of this work was not in any way geared towards imputation methods we do a very basic mean imputation for missing values *within fold*. The means for all variables for each training fold are computed and the corresponding test fold is imputed with the mean value from the training fold.

In order to impute factor (nominal) variables, we use the one-hot encoding of these variables and then impute missing values with the mean of the one-hot encoded vector. Note that `time` and `delta` are not subject to imputation -- if either of these variables contained missing values the entire row was removed in `validate()`.

Below we would impute missing data, however, the NACD dataset we are using has no missing data so the following step is not necessary.
```{r, echo=TRUE,include=TRUE}
imputedData = meanImputation(trainingAndTestingSets)
```

###*normalizeVariables(listOfImputedDatasets)*

Following variable imputation, all the variables are normalized (except `time` and `delta`) by taking the scales (mean and standard deviation) of the training set variables and normalizing both the training set and test set with those scales. Here normalizing follows the standard convention of subtracting the mean and dividing by the standard deviation. This function also does some cleaning of variable names -- if a variable name has a comma in it functions later of will break so all commas and other special characters are removed.

Below we normalize the data and show the difference between the data before and after it was normalized.
```{r}
head(imputedData$Training[[1]])
normalizedData = normalizeVariables(imputedData)
head(normalizedData$Training[[1]])
```

###*createFoldsAndNormalize(survivalDataset, numberOfFolds)*

Similar to `validateAndClean()` above, `createFoldsAndNormalize()` serves as a master function which will run `createFoldsOfData()`, `meanImputation()`, and `normalizeVariables()` and will return a list of (2) items, (1) the indexes of the test set as given in Section 2.2.1, and (2) a list containing: (1) a list of normalized training folds and (2) a list of normalized testing folds. 

Here we show the equivalence between **testSetIndexs** from above and the first item of `createFoldsAndNormalize()`  and  equivalence of **normalizedData** with the second item of `createFoldsAndNormalize()`:

```{r}
output = createFoldsAndNormalize(cleanedNACD,5)
all.equal(testSetIndexs, output[[1]])
all.equal(normalizedData, output[[2]])
```

#Creating Individual Survival Distribution  Models

The intention of this work is to examine different survival analysis models which can produce individualized survival distributions for different patients. In total we explored five models which produced individual survival distributions and one model which gave a population based survival curve (Kaplan-Meier). Below we give a minimal description for each model and describe how to implement each in the current codebase.


##Kaplan-Meier (KaplanMeier.R)

The Kaplan-Meier (KM) estimator was devised as a non-parametric approach to estimating a survival curve even in the presence of censored data (see the original paper [here](https://web.stanford.edu/~lutian/coursepdf/KMpaper.pdf)). To give the KM estimator we first make the following definitions:

* \(\tau_j\) -- the \(j^\textrm{th}\) distinct death time,
* \(d_j\) -- the number of deaths at time \(\tau_j\),
* \(r_j\) -- the number of patients who were at risk at time \(\tau_j\). Note this also includes those who died at \(\tau_j\).

With these definitions, we say the probability of survival at time \(t\) is given by the KM estimator as
\[\hat{S}_{KM}(t) = \prod_{j: \tau_j < t} \left(1 - \frac{d_j}{r_j}\right)\].
Note that here KM does not take into account any features and thus is considered to a non-parametric model. To give an example of the Kaplan-Meier approach consider the following data:

```{r, echo = FALSE}
library(knitr)
library(kableExtra)
time = c(1,3,5,7,9)
delta = c(1,0,1,1,1)
dat = cbind.data.frame(time,delta)
row.names(dat) = c("Patient 1","Patient 2", "Patient 3","Patient 4","Patient 5")
kable(t(dat), caption = "Example data for the Kaplan-Meier model. Note the data is transposed from the normal format.", format = "html", align = c('c','c','c')) %>%
    kable_styling(bootstrap_options = "striped", full_width = F )
```

Given the data above, we have \(\hat{S}_{KM}(1) = \left(1 - \frac{1}{5}\right) = 0.8\). Note that no change occurs at \(t= 3\) since this patient was censored and \(\tau_j\) only refers to unique *death* times. Below we give the full KM curve for this example.

<center>
```{r, echo = FALSE,message=FALSE, fig.height=3}
library(survival)
library(ggplot2)
library(survminer)
fit = survfit(Surv(time,delta)~1, data = dat)
p = ggsurvplot(fit, data = dat, conf.int = F, legend = "none", palette = "black", censor.size = 10)
p$plot 
```
<figcaption>Figure 2: Kaplan-Meier plot using example data. Censored patients indicated by vertical dash on survival curve.</figcaption>
</center>

Since the KM model does not account for individual features, the survival distribution generated is supposed to represent the population's overall survival curve, *i.e.* the average survival curve. So given \(K\)-fold cross-validation, there will only be \(K\) different survival curves as opposed to the \(N\) different survival curves the other models generate, given \(N\) unique patients. Next we give the details for running KM on a dataset after it has prepossessed by `validateAndClean()` and `createFoldsAndNormalize()`.

###Kaplan-Meier Code (KaplanMeier.R)

**File Purpose**: Use the Kaplan-Meier model that has been processed and split into folds by `createFoldsAndNormalize()`.

**File Path**: "ISDEvaluation/Models/KaplanMeier.R".

**Functions**:

1. `KM(training, testing)`

**Arguments**:

* *training*: The training set.
* *testing*: The testing set.

**Packages Required**:

1. `survival` -- We require the `Surv()` function to change the data into a survival format.
2. `prodlim` -- prodlim contains a Kaplan-Meier implementation that includes a `predict()` function.

**File Notes**:

Given the list of datasets generated by `createFoldsAndNormalize()`, `KM()` will take a single training and testing set and build the Kaplan-Meier curve for the training set using the `prodlim()` function from the `prodlim` package. The output (along with the output of all following models) is a list of four objects:

1. TestCurves: The survival curves for test patients (identical for all patients for KM).
2. TestData: The values of `time` and `delta` for the test patients (used later when evaluating survival curves).
3. TrainData: The values of `time` and `delta` for the training patients (used later when evaluating survival curves).
4. TrainCurves: The survival curves for the training patients.

 Note the output is the same for **all** following models.

**Example Usage**

Here, we give an example of using KM on a single training/testing fold -- extending the results to all folds can be accomplished by using a for loop. 


```{r, warning=FALSE}
validatedAndCleaned = validateAndClean(survivalDataset = NACD)
folds = createFoldsAndNormalize(survivalDataset = validatedAndCleaned, numberOfFolds = 5)
#Recall the first argument of folds is the indexs of the test set on validatedAndCleaned and the second argument is the list of training and testing sets.
trainAndTest = folds[[2]]
training1stFold = trainAndTest$Training[[1]]
testing1stFold = trainAndTest$Testing[[1]]
kmMod = KM(training1stFold, testing1stFold)
testCurves = kmMod$TestCurves
testDat = kmMod$TestData
trainDat = kmMod$TrainData
testCurves
```

Above you will see that the survival curves for all patients are the exact same -- the desired effect for the Kaplan-Meier model. Note that `time` will correspond to all the event times (both death and censored event times) of the training data. Additionally we include a `time` = 0 point where the survival probability will be 1 -- this is consistent across all models. 

For a more picturesque approach, we give the survival curve below:

<center>
```{r, echo = FALSE}
plotSurvivalCurves(testCurves, 1:10)
```
</center>

One should notice that the survival curve does not drop to zero and instead ends at \(\approx 0.05\) -- this is because the last patient in the training data was censored. Additionally, the above plot does not follow the classical Kaplan-Meier step function and is smoother using a spline fit -- the details can be found in [Section 7.1](#plotting) which describes the plotting function.

So far we have shown the output of **curves** but ignored **trainDat** and **testDat** which are just the training and testing values for `time` and `delta`. Across all the following models, **trainDat** and **testDat** will be the same thing, but for clarity we will include them below and not in following models.

<center>**trainDat**</center>
```{r,echo=FALSE}
kable(trainDat, format = "html", align = c('c','c','c')) %>%
    kable_styling(bootstrap_options = "striped", position = "center") %>% 
      scroll_box(width = "65%", height = "200px",extra_css = "margin: auto;") 
```

<br><br>

<center>**testDat**</center>

```{r, echo=FALSE}
kable(testDat, format = "html", align = c('c','c','c')) %>%
    kable_styling(bootstrap_options = "striped", position = "center") %>% 
      scroll_box(width = "65%", height = "200px",extra_css = "margin: auto;") 
```

##Accelerated Failure Time 

The Accelerated Failure Time (AFT) has been a commonly used survival analysis technique for finding significant features in a survival setting. The model is motivated by assuming survival time, \(T\) follows some distribution with density function \(f(t)\) and survival function \(S(t)\). Common AFT distributions are log-logistic, Weibull, log-normal, exponential and more. While we make assumptions on the distribution of \(T\), inference is instead made on \(\log(T)\), that is, for a patient, \(\mathbf{x_i}\) where \(i = 1,2,\ldots N\), AFT models follow the equation:
\[
log(T_i) = \boldsymbol{w}^T\boldsymbol{x_i} + \sigma\epsilon_i,
\]
where \(\boldsymbol{w}\) are feature weights, \(\sigma\) is called a scale parameter and \(\epsilon_i\) is an independent, identically distributed noise parameter dependent on the assumed distribution of \(T\). This is essentially a normal linear regression but on the log scale of \(T\), *i.e.* features are assumed to have a linear relationship with the log of survival time. Here, \(\sigma\) is assumed to be constant across all patients and can be learned by maximum likelihood along with the weights \(\boldsymbol{w}\). 

###Accelerated Failure Time Code (AcceleratedFailureTime.R)

**File Purpose**: Apply the Accelerated Failure Time model (for a number of distributions) to data that has been processed and split into folds by `createFoldsAndNormalize()`.

**File Path**: "ISDEvaluation/Models/AcceleratedFailureTime.R".

**Functions**:

1. `AFT(training, testing, AFTDistribution)`
2. `survfunc(object,t,newdata, name)` - Used to extract survival cures from an AFT model. Altered from [a function previously written by Timothy Johnson.](http://rstudio-pubs-static.s3.amazonaws.com/161203_6ee743eb28df4cd68089a519aa705123.html)


**Arguments**:

* *training*: The training set.
* *testing*: The testing set.
* *AFTDistribution*: The distribution for the AFT model. Possible distributions: *exponential, weibull, lognormal, gaussian, loglogistic, logistic*.
* *object*: A survreg object from the `survreg()` function.
* *t*: The times for which to evaluate the AFT model.
* *newdata*: The testing data to predict with using the AFT model.
* *name*: The name of the variable containing the predicted times (*t*).

**Packages Required**:

1. `survival` -- We use the `survreg()` function to perform analysis for AFT models.

**File Notes**:

The primary function for this file is `AFT()`; `survfunc()` is a helper function AFT uses to build survival curves for test data after the model has been learned. 

Given the list of datasets generated by `createFoldsAndNormalize()`, `AFT()` will take a single training set, testing set, and a specified distribution and learn the AFT model using the training set. Only one distribution can be tested at a time. In addition, the code contains a tryCatch command the catch the event where the AFT model fails to run, however, this only showed to occur for high dimensional datasets (when no feature selection was applied). 

As with the Kaplan-Meier model, the output is a list of four items:

1. TestCurves: The survival curves for test patients (identical for all patients for KM).
2. TestData: The values of `time` and `delta` for the test patients (used later when evaluating survival curves).
3. TrainData: The values of `time` and `delta` for the training patients (used later when evaluating survival curves).
4. TrainCurves: The survival curves for the training patients.

**Example Usage:**

```{r, warning=FALSE}
validatedAndCleaned = validateAndClean(survivalDataset = NACD)
folds = createFoldsAndNormalize(survivalDataset = validatedAndCleaned, numberOfFolds = 5)
trainAndTest = folds[[2]]
training1stFold = trainAndTest$Training[[1]]
testing1stFold = trainAndTest$Testing[[1]]
AFTMod = AFT(training1stFold, testing1stFold, "weibull")
curves = AFTMod$TestCurves
curves
```

<br>

<center>
```{r, echo=FALSE}
plotSurvivalCurves(curves, 1:10)
```
</center>


##Cox-Proportional Hazards with Kalbfleisch-Prentice Extension

The Cox proportional-hazards (Cox-PH) model is extremely common in the survival analysis literature. Previously, we mentioned the Kaplan-Meier model which modeled the survival function, \(S_{KM}(t)\) of a group of patients. Next, the Accelerated Failure Time model assumed a distribution on the survival time, \(T\), which allowed us to include features and build curves for individual patients. Instead, Cox-PH models the *hazard function* of patients, where hazard is interpreted as the risk of failure (dying) at any given time, specifically for a time \(t\) and covariates \(\boldsymbol{x}\) the hazard function is defined as
\[
h(t\,|\,\boldsymbol{x}) = \lim_{\Delta t \rightarrow \,0} \frac{P(t \leq T < t + \Delta t\,\,\, |\,\,\, T \geq t, \boldsymbol{x} )}{\Delta t}. 
\]

Cox-PH models this hazard function in terms of a baseline hazard, \(h_0(t)\), equal for all patients and scaled by a learned, individual risk dependening on features, that is, \(h(t|\boldsymbol{x}) = h_0(t)\,e^{\boldsymbol{w}^T\boldsymbol{x}}\). This is why the model is termed the *proportional hazards* model, the hazard for one patient is proportional to the hazard of all other patients, by a scale depending on individual features.

To learn the values for \(\boldsymbol{w}^T\) one does not need to know the baseline hazard function. Suppose a single patient died at time \(t_j\). The probability that patient \(i\) was the one who died is given by
\begin{align*}
\frac{h(t_j\,|\,\boldsymbol{x_i})}{\sum_{k \in R(t_j)} h(t_j\,|\,\boldsymbol{x_k})} \quad  &= \quad \frac{h_0(t_j)\,e^{\boldsymbol{w}^T\,\boldsymbol{x_i}}}{\sum_{k \in R(t_j)} h_0(t_j)\,e^{\boldsymbol{w}^T\,\boldsymbol{x_k}}}, \\
&= \quad \frac{e^{\boldsymbol{w}^T\,\boldsymbol{x_i}}}{\sum_{k \in R(t_j)} e^{\boldsymbol{w}^T\,\boldsymbol{x_k}}},
\end{align*}
where \(R(t_j)\) is the set of patients still alive (at risk) at time \(t_j\). Thus the likelihood for \(\boldsymbol{w}^T\) is defined as
\[
L\left(\boldsymbol{w}^T\right) = \prod_j \frac{e^{\boldsymbol{w}^T\,\boldsymbol{x_{i\,(j\,)}}}}{\sum_{k \in R(t_j)} e^{\boldsymbol{w}^T\,\boldsymbol{x_k}}},
\]
where subscript \(i (j)\) is interpreted as it is patient \(i\) who died at time \(j\).

By this formulation of the likelihood equation, we see that feature weights are independent of time, *e.g* in a study of survival post surgery, lab test results immediately following surgery have the same weight one day after surgery as they do a year from surgery. 

While we are able to estimate feature weights without specifying the baseline hazard function, we cannot determine the survival distribution. A number of methods exist for calculating the the baseline hazard and therefore the survival distribution including the Kalbfleisch-Prentice estimator which we use in this work. The thesis ["Baseline Survival Function Estimator under Proporitional Hazards Assumption"](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=2ahUKEwjploaonvTcAhWKJ3wKHblTAXoQFjAAegQIBBAC&url=http%3A%2F%2Fstat.nuk.edu.tw%2Fgraduate%2F96_paper%2FM0944411.pdf&usg=AOvVaw1LAbWJ2e1MQL2KIXBPqIZA) by Weng provides a nice discussion of the Breslow and Kalbfleisch-Prentice estimators.


###Cox-Proportional Hazards with Kalbfleisch-Prentice Extension Code (CoxPH_KP.R)

**File Purpose**: Apply the Cox Proportional-Hazards model to data that has been processed and split into folds by *createFoldsAndNormalize()*.

**File Path**: "ISDEvaluation/Models/CoxPH_KP.R".

**Functions**:

1. `CoxPH_KP(training, testing, ElasticNet)`
2. `KPEstimator(lp, lpTime, censorStatus)` - Used to estimate the Kalbfleisch-Prentice estimator when doing elastic-net Cox.

**Arguments**:

* *training*: The training set.
* *testing*: The testing set.
* *ElasticNet*: A Boolean specifying whether or not to
* *lp*: The linear predictions (\(\boldsymbol{w}^T\boldsymbol{x}\)) from the Cox-PH model for the training data.
* *lpTime*: The training dataset event times, *i.e.* `time`.
* *censorStatus*: The training dataset censor status, *i.e* `delta`.

**Packages Required**:

1. `survival` -- Needed for `coxph()` and `survfit()`.
2. `fastcox` -- See Elastic-Net Cox.
3. `prodlim` -- See Elastic-Net Cox.

**File Notes**:

The primary function for this file is `CoxPH_KP()`; `KPEstimator()` is only used for the Elastic-Net Cox model.

Given the list of datasets generated by `createFoldsAndNormalize()`, `CoxPH_KP()` will take a single training set, testing set, and a Boolean specifying whether or not to use the elastic-net model. For this section we assume ElasticNet = FALSE, that is we are NOT using the elastic net formulation. The code uses a tryCatch function to avoid models which are failing to converge (this happened for high dimensional datasets). The Cox-PH model is fit using `coxph()` from the survival package and then survival curves are fit using `survfit()` and specifying the type to be the Kalbfleisch-Prentice estimator. Ties are accounted for using the default of `coxph()`, *i.e.* the Efron approximation.

The output is the same as the four previous models.

**Example Usage:**

```{r, warning=FALSE}
validatedAndCleaned = validateAndClean(survivalDataset = NACD)
folds = createFoldsAndNormalize(survivalDataset = validatedAndCleaned, numberOfFolds = 5)
trainAndTest = folds[[2]]
training1stFold = trainAndTest$Training[[1]]
testing1stFold = trainAndTest$Testing[[1]]
CoxMod = CoxPH_KP(training1stFold, testing1stFold, ElasticNet=F)
curves = CoxMod$TestCurves
curves
```
<br>

<center>
```{r, echo=FALSE}
plotSurvivalCurves(curves, 1:10)
```
</center>

Note the preceding Figure depicts proportional hazards -- all curves are the exact same shape but at different heights, *i.e.* they are proportional to one another. 

##Cox-Proportional Hazards with Kalbfleisch-Prentice Extension -- Elastic Net

We include the elastic net version of Cox-PH to include a regularized version of Cox-PH to compete with other models such as Random Survival Forests and Multi-Task Logistic Regression which have built in regularization. The notation and implementations of elastic net come from Yang and Zou's paper entitiled ["A Cocktail Algorithm for Solving The Elastic Net Penalized Cox's Regression in High Dimensions""](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=2ahUKEwiG3rLYuvTcAhUBN30KHfIQDYUQFjAAegQIABAC&url=http%3A%2F%2Fusers.stat.umn.edu%2F~zouxx019%2FPapers%2Ffastcox.pdf&usg=AOvVaw21oG6xeyfTb7jaKFBkYJct). The elastic net penalty adds the following term to the likelihood equation,
\[
P_{\lambda, \alpha}(\boldsymbol{w}) = \sum_{j=1}^p \lambda\left(\alpha\,|w_j| + \frac{1}{2}\,(1-\alpha)\, w_j^2 \right),
\]
for \(p\) features where \(\lambda > 0\) and \(0 < \alpha \leq 1\). The first penalty term, \(|w_j|\) corresponds to the LASSO loss and is responsible for feature selection whereas the quadratic term is to control the size of feature weights. For example, \(\alpha = 1\) would correspond to just LASSO-Cox, whereas \(\alpha \approx 0\) would correspond to Ridge-Cox. 

For more information regarding elastic net Cox see the Yang and Zou's paper referenced above.

###Cox-Proportional Hazards with Kalbfleisch-Prentice Extension -- Elastic Net Code (CoxPH_KP.R)

**File Purpose**: Apply the Cox Proportional-Hazards model to data that has been processed and split into folds by `createFoldsAndNormalize()`.

**File Path**: "ISDEvaluation/Models/CoxPH_KP.R".

**Functions**:

1. `CoxPH_KP(training, testing, ElasticNet)`
2. `KPEstimator(lp, lpTime, censorStatus)` - Used to estimate the Kalbfleisch-Prentice estimator when doing elastic net Cox.

**Arguments**:

* *training*: The training set.
* *testing*: The testing set.
* *ElasticNet*: A Boolean specifying whether or not to
* *lp*: The linear predictions (\(\boldsymbol{w}^T\boldsymbol{x}\)) from the Cox-PH model for the training data.
* *lpTime*: The training dataset event times, *i.e.* `time`.
* *censorStatus*: The training dataset censor status, *i.e* `delta`.

**Packages Required**:

1. `survival` -- Needed for the normal Cox implementation.
2. `fastcox` -- Used for the implementation of elastic net cox (`cocktail()`)
3. `prodlim` -- Used for the `sindex() function -- sindex returns the indexes of positions, ideal for evaluating step functions.

**File Notes**:

Given the list of datasets generated by `createFoldsAndNormalize()`, `CoxPH_KP()` will take a single training set, testing set, and a Boolean specifying whether or not to use the elastic-net model. For this section we assume ElasticNet = TRUE, that is we are using the elastic net formulation. There are two hyperparameters involved for elastic net Cox, the strength of the penalty, \(\lambda\), and the type of penalty (more LASSO like or more Ridge like), \(\alpha\). Six values are tested for \(\alpha\), 0.01, 0.2, 0.4, 0.6, 0.8, and 1.0. For each of these values the function `cv.cocktail()` tests 100 values of \(\lambda\) using 5-fold cross validation and returns \(\lambda_{min}\), the optimal value of \(\lambda\) to minimize the cross validation error (see the `cocktail() function of the [package documentation](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=2ahUKEwjCmMChxvTcAhUMIjQIHTumDoEQFjABegQICRAC&url=https%3A%2F%2Fcran.r-project.org%2Fweb%2Fpackages%2Ffastcox%2Ffastcox.pdf&usg=AOvVaw18r4av9gbxAKwh-LuEP3j5)).The optimal value for \(\alpha\) and \(\lambda\) from internal (10) repeated 5-fold cross validation are then used for evaluating the test fold. Note the optimal values for \(\alpha\) and \(\lambda\) will (likely) differ across folds.

One the feature weights have been estimated, linear predictors are passed to `KPEstimator()` to estimate the survival curves. As of now, ties are not accounted for and the resulting survival distribution are using the estimate derived in the presence of no ties. Future iterations of this codebase should account for the correct estimate when handling ties.

The output is the same as previous models.

**Example Usage:**

```{r, warning=FALSE}
validatedAndCleaned = validateAndClean(survivalDataset = NACD)
folds = createFoldsAndNormalize(survivalDataset = validatedAndCleaned, numberOfFolds = 5)
trainAndTest = folds[[2]]
training1stFold = trainAndTest$Training[[1]]
testing1stFold = trainAndTest$Testing[[1]]
CoxMod = CoxPH_KP(training1stFold, testing1stFold, ElasticNet=T)
curves = CoxMod$TestCurves
curves
```
<br>

<center>
```{r, echo=FALSE}
plotSurvivalCurves(curves, 1:10)
```
</center>



##Random Survival Forest - Kaplan-Meier Extension 

[Random Survival Forests](https://arxiv.org/pdf/0811.1645) (RSFs) were developed by Ishwaran et al. in 2008 and have been growing in popularity in recent years. While we do not give vast detail regarding the formation of these forests, the [original paper](https://arxiv.org/pdf/0811.1645) and [R package details](https://kogalur.github.io/randomForestSRC/theory.html) provide an excellent overview. 

The primary focus for RSFs were to build a risk score using an averaged cumulative hazard from the terminal nodes which in which patients landed. As previously mentioned here and explained further in our paper, risk scores are useful for discriminating between patients, but do not provide individual survival distributions.While survival distributions are not mentioned in the original RSF paper, the R package details briefly describe how RSFs build individual survival distributions in Section 8.1. Specifically, they state survival estimates are computed using Kaplan-Meier estimators. However, given a total of \(M\) trees in a forest, there will be \(M\) Kaplan-Meier estimators for each patient and the details of how these Kaplan-Meier estimators were combined are not given. After some backwards engineering I was able to come up with the conclusion that Kaplan-Meier curves are "averaged" across all the Kaplan-Meier curves - see Figure below.


<center>
```{r, echo=FALSE}
time1 = c(3,5,8,10,15)
cens1 = c(1,1,1,1,1)
time2 = c(6,9,10,17,20)
cens2 = c(1,1,1,1,1)
dat = cbind.data.frame(time1,cens1,time2,cens2)


p1 = prodlim(Surv(time1,cens1)~1,data = dat)
p2 = prodlim(Surv(time2,cens2)~1, data = dat)
KM1 = predict(p1, seq(0,20,by = 0.01)) 
KM2 = predict(p2, seq(0,20,by = 0.01))
KM1 = ifelse(is.na(KM1),0,KM1)
KM2 = ifelse(is.na(KM2),0,KM2)
RKM = (KM1 + KM2)/2

dat2 = cbind.data.frame(time = seq(0,20,by=0.01), RKM = RKM, KM1 = KM1, KM2 = KM2)

ggplot(data = dat2)+
  geom_line(aes(x = time, y = KM1), color = "dodgerblue2",size = .8)+
  geom_line(aes(x = time, y = KM2), color = "dodgerblue2", size = .8)+
  geom_line(aes(x = time, y = RKM), color = "orangered2",size = 1) +
  theme_bw()+
  labs(x = "Time",y = "Survival Probability")+
  theme(text = element_text(size = 18, face = "bold"))

```
</center>

<figcaption>Figure: Depiction of averaging two Kaplan-Meier curves together to produce one individual survival distribution for a test patient. Blue Kaplan-Meier curves represent the Kaplan-Meier estimators in two terminal nodes whereas the red curve represents the combination - giving form to the individual survival distribution.</figcaption>

###Random Survival Forest - Kaplan-Meier Extension Code (RandomSurvivalForests.R)

**File Purpose**: Apply the RSF model to data that has been processed and split into folds by *createFoldsAndNormalize()*.

**File Path**: "ISDEvaluation/Models/RandomSurvivalForests.R".

**Functions**:

1. `RSF(training, testing)`
2. `internalCV_RSF(training, numFolds)` -- Helper function for RSF

**Arguments**:

* *training*: The training set.
* *testing*: The testing set.
* *numFolds*: The number of folds for internal cross validation.

**Packages Required**:

1. `survival` -- Needed for the `Surv()` function
2. `randomForestSRC` -- Used for the RSF implementation.

**File Notes**:

Given the list of datasets generated by `createFoldsAndNormalize()`, `RSF()` will take a single training set, testing set, and the number of folds to use for internal cross validation to select hyperparameters (namely ntree and mtry) and build the RSF model using `rfsrc()` from the randomForestSRC package. Internal cross validation is performed for ntree $\in \{500,1000, 2000\}$ and mtry (number of patients in each terminal node) $\in \{1,2,3\} $ Then the `predict()` function for rfsrc is used to build survival curves for test patients.

The output is the same as previous models.

**Example Usage:**

```{r, warning=FALSE}
validatedAndCleaned = validateAndClean(survivalDataset = NACD)
folds = createFoldsAndNormalize(survivalDataset = validatedAndCleaned, numberOfFolds = 5)
trainAndTest = folds[[2]]
training1stFold = trainAndTest$Training[[1]]
testing1stFold = trainAndTest$Testing[[1]]
rsfMod = RSF(training1stFold, testing1stFold)
curves = rsfMod$TestCurves
names(curves) = c("time",1:26)
curves
```
<center>
```{r, echo=FALSE}
plotSurvivalCurves(curves, 1:10)
```
</center>


##Multi-task Logistic Regression (MTLR.R)

Of the models tested, Multi-task Logistic Regression (MTLR) is the least well known so we give a brief description here. Consider modeling the probability of survival of patients at each of a vector of time points
$\tau = [t_1, t_2, \ldots, t_m]$, *e.g.* $\tau$ could be the 60 monthly intervals from 1 month up to 60 months. 
We can set up a series of logistic regression models:
For each patient, represented as $\vec{x}$,
\[
P(T \geq t_i \,|\, \vec{x}) =  \left(1 + \exp(\vec{\theta_{i}}\cdot \vec{x} )\right)^{-1}
\]
where $\vec{\theta_i}$  are the time-specific feature vectors.
While the input features $\vec{x}$ stay the same for all these classification tasks,  
the binary labels $y_i = [T\geq t_i]$ can change depending on the threshold $t_i$.%

We encode the survival time $d$ of a patient as a binary sequence 
$y = y(d) = (y_1, y_2, \ldots, y_m)$, where $y_i = y_i(d) \in \{0,1\}$ denotes the survival status of the patient at time $t_i$, 
so that $y_i = 0$ (no death event yet) for all $i$ with $t_i <d$, and $y_i = 1$ (death) for all $i$ with $t_i \geq d$.



Here there are $m+1$ possible legal sequences of the form $(0,0,\ldots,1,1,\ldots,1)$, including the sequence of all 0's and the sequence of all 1's. 
The probability of observing the survival status sequence $y = (y_1, y_2, \ldots, y_m)$ can be represented as:
\[
  P_\Theta(Y\!\! =\!\! (y_1, y_2, \ldots, y_m) \,\,|\,\, \vec{x}) = \frac{\exp(\sum_{i=1}^m y_i \times \theta_{i}\cdot\vec{x} )}
     {\sum_{k=0}^m \exp(f_{\Theta}(\vec{x}, k))}, 
\]
where $\Theta = (\theta_{1}, \ldots, \theta_{m})$, and
$f_{\Theta}(\vec{x}, k) = \sum_{i=k+1}^m (\theta_{i}\cdot\vec{x})$ 
for $0\leq k\leq m$ is the score of the sequence with the event occurring in the interval $[t_k, t_{k+1})$ before taking the logistic transform, with the boundary case $f_{\Theta}(\vec{x}, k)= 0$ being the score for the sequence of all `0's. 
Given a dataset of $n$ patients $\{ \vec{x_r}\}$ with associated time of deaths $\{ d_r \}$, 
we find the optimal parameters (for the MTLR model) $\Theta^*$ as
\[
\Theta^*\ =\
\arg\max_{\Theta} 
\sum_{r=1}^n \left[\sum_{i=1}^m y_j(d_r)(\theta{i}\!\cdot\!\vec{x_r})\! - \log \sum_{k=0}^m \exp f_{\Theta}(\vec{x_r},k) \right]
  -  
\frac{C}{2}\!\sum_{j=1}^m\! \|\theta_{j}\|^2\! 
% +\! \frac{C_2}{2}\sum_{j=1}^{m-1}\! \|\theta_{j+1}\!-\!\theta_j\|^2\!
\]
where the $C$ (for the regularizer) is found by an internal cross-validation process.
}


There are many details here -- \eg 
to insure that the survival function starts at 1.0, and decreases monotonically and smoothly until reaching 0.0 for the final time point; 
to deal appropriately with censored patients; 
to decide how many time points to consider ($m$); and 
to minimize the risk of overfitting (by regularizing),
and by selecting the relevant features.
The [paper by Yu et al.](https://papers.nips.cc/paper/4210-learning-patient-specific-cancer-survival-distributions-as-a-sequence-of-dependent-regressors) provides the details.


###Multi-task Logistic Regression Code (MTLR.R)

**File Purpose**: Apply the MTLR model to data that has been processed and split into folds by *createFoldsAndNormalize()*.

**File Paths**: 

1. "ISDEvaluation/Models/MTLR.R".

**Functions**:

1. `MTLR(training, testing,C1 = NULL, numFolds = 5)`
2. `internalCV_MTLR(training, numFolds)` -- Helper function for `MTLR()`.
3. `avgLogLikLoss(params, dat, timePoints)` -- Helper function for `internalCV_MTLR()`.

**Arguments**:

* *training*: The training set.
* *testing*: The testing set.
* *C1*: The regularization parameter.
* *numFolds*: The number of folds for internal cross validation.
* *params*: The feature weights.
* *dat*: The testing dataset (from internal CV).
* *timePoints*: The time points used for evaluating `MTLR()`.

**Packages Required**:  

1. `Rcpp` -- Used for fast implementation of gradient and objective function calculation.

**File Notes**:

Given the list of datasets generated by `createFoldsAndNormalize()`, `MTLR()` will take a single training set, testing set, optionally a value for the regularization parameter, and the number of folds for internal cross-validation if *C1* is left as NULL. Internal cross-validation is performed to select an the optimal regularization parameter (the average log-likelihood loss is optimized). Features are optimized via gradient descent and the `optim()` function. Since the objective function is non-convex with censored data, the weights are trained once treating all patients as uncensored and then trained again with the true censor values (this is in hopes of finding good starting weights). 

The output is the same as previous models. 

**Example Usage:**

```{r, warning=FALSE}
validatedAndCleaned = validateAndClean(survivalDataset = NACD)
folds = createFoldsAndNormalize(survivalDataset = validatedAndCleaned, numberOfFolds = 5)
trainAndTest = folds[[2]]
training1stFold = trainAndTest$Training[[1]]
testing1stFold = trainAndTest$Testing[[1]]
mtlrMod = MTLR(training1stFold, testing1stFold)
curves = mtlrMod$TestCurves
names(curves) = c("time",1:26)
curves
```
<center>
```{r, echo=FALSE}
plotSurvivalCurves(curves, 1:10)
```
<figcaption>Figure: MTLR Survival curves **without** last time point added.</figcaption>
</center>


#Evaluting ISD Models

Once the survival curves have been generated it is likely beneficial to evaluate them using some type of metric. We tested (5) different metrics, Concordance (a discriminatory measure), 1-Calibration, Brier score, L1-Loss and novel D-Calibration. Below we only give the details for the code -- for explanations of each metric please see the paper associated with this tutorial (Link). In addition to the evaluation metrics we also describe the function used to calculate the mean/median survival from each survival curve. We will show all evaluations using the unregularized Cox model as this is one of the most common models used in survival analysis.

##Calculating ``Average'' Survival (EvaluationHelperFunctions.R)

**File Purpose**: Calculate the mean/median survival time given a single survival curve.

**File Paths**: 

1. "ISDEvaluation/Evaluations/EvaluationHelperFunctions.R"

**Functions**:

1. `predictProbabilityFromCurve(survivalCurve,predictedTimes, timeToPredict)`
2. `predictMeanSurvivalTimeSpline(survivalCurve, predictedTimes)`
3. `predictMedianSurvivalTimeSpline(survivalCurve, predictedTimes)`

**Arguments**:

1. *survivalCurve*: The survival probabilities generated for a individual survival curve.
2. *predictedTimes*: The times corresponding to survival probabilities, *e.g.* time = 0 corresponds to a survival probability of 1.
3. *timeToPredict*: The time for which to extract a survival probability given a survival curve. Note this can either be a single time or a vector of times.

**Packages Required**: None.

**File Notes**: 

This files purpose is to provide support for the evaluation metrics. Some metrics (concordance, L1-loss) require a single risk/survival measure and for those we use the mean (or median) survival times, `predictMeanSurvivalTimeSpline()` and `predictMedianSurvivalTimeSpline()` accordingly. To predict the mean time a monotonic spline fit is used to make the survival probabilities continuous. Specifically, we use the Hyman filtering of the cubic Hermite spline (see the [Wikipedia page](https://en.wikipedia.org/wiki/Monotone_cubic_interpolation) and the page's references for an introduction). However, if the survival curves do not go to 0 probability we use a linear fit from the time = 0, survival probability = 1 point to the last time point we have and extend this linear fit to zero, see the Figure below. Once the survival curves have been extended to zero the integral of the survival curve is taken to be the mean survival time.

![Figure: Linear fitting to extend survival curves to 0 survival probability. Left: Survival probabilities with no linear fit, Right: Survival probabilityies with linear fitting.](./Figures/NoLinearVsLinear.png)

For calculating the median survival time the same spline fit is used and then an inverse spline is fit to the original spline fitting. The time associated with a survival probability of 0.5 is used as the median survival time.

Additionally, some evaluation metrics (specifically calibration metrics) require survival probabilities for a given time. This is implemented by `predictProbabilityFromCurve()`. The same spline fit (with linear extension to zero) from before is fitted to the survival probabilities and the given time(s) to predict (*timeToPredict*) is calculated using the internal `spline()` function. 

**Example Usage**:
```{r, include = FALSE}
plotSurvivalCurvesWithLinear = function(survivalCurves, indexToPlot = 1, color = c(), xlim = c()){
  colorOK = T
  if(length(color) == 0)
    colorOK = F
  else if(length(color) != length(indexToPlot)){
    warning("If you would like to select custom colors please make sure the number of colors
            matches the number of curves.")
    colorOK =F 
  }
  time = survivalCurves$time
  curves = survivalCurves[,indexToPlot +1,drop=F]
  plotTimes = seq(min(time),max(time), length.out = length(time)*20)
  plotProbs = as.data.frame(sapply(curves,
                                   function(curve){
                                     curve = ifelse(curve < 1e-20,0,curve)
                                     survivialSpline = splinefun(time, curve, method = "hyman")
                                     return(pmax(survivialSpline(plotTimes),0))
                                   }
  ))
  maxProb = max(plotProbs[nrow(plotProbs),])
  maxSlope = (1 - maxProb)/(0 - max(plotTimes))
  maxLinearTime = max(plotTimes) + (0 - maxProb)/maxSlope
  additionalPlotTimes = seq(max(plotTimes), maxLinearTime, length.out = ceiling((maxLinearTime - max(plotTimes)))*20)
  plotLinearProbs = as.data.frame(sapply(plotProbs,
                                   function(probs){
                                     slope = (1 - min(probs))/(0 - max(plotTimes))
                                     return(pmax(min(probs) + (additionalPlotTimes - max(plotTimes))*slope,0))
                                   }
  ))
  plotTimes = c(plotTimes, additionalPlotTimes)
  plotProbs = rbind.data.frame(plotProbs, plotLinearProbs)
  data = cbind.data.frame(plotTimes,plotProbs)
  longFormData = melt(data,measure.vars = names(data)[-1], variable.name = "Patient")
  plot = ggplot(data = longFormData, aes(x = plotTimes,y = value, colour = Patient))+
    geom_line(size = 1.5)
  if(colorOK)
    plot = plot + scale_color_manual(values = color) 
  if(length(xlim)==2){
    plot = plot+ xlim(c(xlim[1],xlim[2]))
  }
  plot = plot +scale_y_continuous( limits= c(0,1),breaks = c(0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1))+
    theme_bw() +
    theme(text = element_text(size=18, face=  "bold"),
          axis.title = element_text(size = 20),
          axis.title.x = element_text(margin = margin(t = 15)),
          axis.title.y = element_text(margin = margin(r = 15))) + 
    labs(y = "Survival Probaility",x = "Time" )
  
  return(plot)
}
```

```{r, warning=FALSE}
#Set up:
validatedAndCleaned = validateAndClean(survivalDataset = NACD)
folds = createFoldsAndNormalize(survivalDataset = validatedAndCleaned, numberOfFolds = 5)
trainAndTest = folds[[2]]
training1stFold = trainAndTest$Training[[1]]
testing1stFold = trainAndTest$Testing[[1]]
CoxMod = CoxPH_KP(training1stFold, testing1stFold, ElasticNet=F)
curves = CoxMod[[1]]


survivalTimes = curves[,1]
#We use the second patient because the first had a very shallow survival curve - see Cox curves in Seciton 3.3.1.
secondSurvivalCurve = curves[,3]

#Getting the survival probability for 5 months:
predictProbabilityFromCurve(survivalCurve = secondSurvivalCurve, predictedTimes = survivalTimes, timeToPredict = 5)
#Getting multiple survival probabilities:
predictProbabilityFromCurve(survivalCurve = secondSurvivalCurve, predictedTimes = survivalTimes, timeToPredict = c(5,10,20,40,80,160))

#Getting the mean survival time:
predictMeanSurvivalTimeSpline(survivalCurve = secondSurvivalCurve, predictedTimes = survivalTimes)

#Getting the median survival time:
predictMedianSurvivalTimeSpline(survivalCurve = secondSurvivalCurve, predictedTimes = survivalTimes)
```
<center>
```{r, echo=FALSE,}
plotSurvivalCurvesWithLinear(curves,2)
```
</center>
##Concordance (Concordance.R)

**File Purpose**: Calculate the concordance of a trained model on test data.

**File Path**: "ISDEvaluation/Evaluations/Concordance.R"

**Functions**:

1. `Concordance(survMod, ties = "None", method = "Median")`

**Arguments**:

1. *survMod*: The output of any of the survival models. Specifically this must be a list of (1) the survival curves of the test patients, (2) the test patients true time of death and censor status, and (3) the training patients true time of death and censor status. Technically the third component is not required for the concordance calculation.

2. *ties*: A string indicating the method in which to handle ties. There are four options: (1) "None" which will exclude all ties, (2) "Risk" which will include ties who were tied in their risk (mean/median survival time), (3) "Time" which will include ties whos survival time was equal, and (4) "All" which includes both risk and time ties.

3. *method*: A string indicating to use the "Mean" or the "Median" for the risk score evaluation. 

**Packages Required**: 

1. `survival` - This package is used for the `survConcordance()` function -- an time-efficient function for calculating concordance.

**File Notes**: 

As discussed in the paper, concordance is a discriminative measure typically used to evaluated models which produce risk scores for patients. Here the risk score is given by the negative of the mean/median of an individuals survival distribution. Once the average survival times are calculated `survConcordance()` calculated the concordance information and the concordance score is returned depending on the values of the *ties* argument.

**Example Usage**:

```{r, warning=FALSE}
#Set up:
validatedAndCleaned = validateAndClean(survivalDataset = NACD)
folds = createFoldsAndNormalize(survivalDataset = validatedAndCleaned, numberOfFolds = 5)
trainAndTest = folds[[2]]
training1stFold = trainAndTest$Training[[1]]
testing1stFold = trainAndTest$Testing[[1]]
CoxMod = CoxPH_KP(training1stFold, testing1stFold, ElasticNet=F)

#Evaluation:
Concordance(CoxMod, ties = "None", method = "Mean")

```

##1-Calibration (OneCalibration.R) {#1Cal}

**File Purpose**: Perform the 1-Calibration test on survival data.

**File Path**: "ISDEvaluation/Evaluations/OneCalibration.R"

**Functions**:

1. `OneCalibration(survMod, timeOfInterest = c(), type = "DN", numBuckets = 10)`
2. `OneCalibrationCumulative(listOfSurvivalModels, timeOfInterest = c(), type = "DN", numBuckets = 10)`
3. `binItUp(trueDeathTimes,censorStatus, predictions, type, numBuckets,timeOfInterest)`

**Arguments**:

1. *survMod*: The output of any of the survival models. Specifically this must be a list of (1) the survival curves of the test patients, (2) the test patients true time of death and censor status, and (3) the training patients true time of death and censor status. Technically the third component is not required for the concordance calculation.

2. *timeOfInterest*: The time at which to evaluate the model for 1-Calibration. If nothing is passed in the quantiles (.1, .25,.5,.75,.9) of the true event times will be used.

3. *type*: This is the type of 1-Calibration to be used, either "Uncensored" to use 1-Calibration for only the uncensored patients and discard all censored patients or "DN" for the D'Agostino-Nam translation.

4. *numBuckets*: The number of buckets/bins to use for 1-Calibration.

5. *listOfSurvivalModels*: A list of the output of survival models, *e.g.* a list of different *survMod* like objects.

6. *trueDeathTimes*: The true event times of the subjects.

7. *censorStatus*: The bit identifying if a patient died or was censored, *i.e* `delta`.

8. *predictions*: The predicted survival probability for patients at their true event times.

**Packages Required**: 

1. `survival` -- This package is used for the `Surv()` function.

2. `dplyr` -- This is used for `group_by()`, `summarise()`, and mutate()`.

**File Notes**: 

There are two ways to perform 1-Calibration here -- one can either pass the output from a survival model and have 1-Calibration evaluated on the test fold or one can pass a list of multiple survival models and have 1-Calibration evaluated on all patients. Note by doing the first way, `OneCalibration()`, there will end up being the same number of p-values as there are cross-validation folds and there will be no straightforward way of combining them (DO NOT AVERAGE THEM). `OneCalibrationCumulative()` mitigates this by taking in the model results from all folds and performing a single 1-Calibration test using all the test data. `binItUp()` is a helper function that takes probability predictions and performs the actual specified 1-Calibration test.

**Example Usage**:

```{r, warning=FALSE}
#Set up for OneCalibration():
validatedAndCleaned = validateAndClean(survivalDataset = NACD)
folds = createFoldsAndNormalize(survivalDataset = validatedAndCleaned, numberOfFolds = 5)
trainAndTest = folds[[2]]
training1stFold = trainAndTest$Training[[1]]
testing1stFold = trainAndTest$Testing[[1]]
CoxMod = CoxPH_KP(training1stFold, testing1stFold, ElasticNet=F)

#Evaluation:
#By not specifying timeOfInterest we will get the 10th, 25th, 50th, 75th, and 90th quantiles of the true event times in the testing fold.
OneCalibration(CoxMod, timeOfInterest = c(), type = "DN",numBuckets = 10)

#One Calibration for uncensored patients.
OneCalibration(CoxMod, timeOfInterest = c(), type = "Uncensored",numBuckets = 10)

#Set up for OneCalibrationCumulative():
coxList = list()
for(i in 1:5){
  trainingFold = trainAndTest$Training[[i]]
  testingFold = trainAndTest$Testing[[i]]
  CoxMod = CoxPH_KP(trainingFold, testingFold, ElasticNet=F)
  coxList[[i]] = CoxMod
}
#Cumulative 1-Calibration for default time of Interest
OneCalibrationCumulative(listOfSurvivalModels = coxList, timeOfInterest = c(), type = "DN", numBuckets = 10)
#Cumulative 1-Calibration for a specific time (2 months)
OneCalibrationCumulative(listOfSurvivalModels = coxList, timeOfInterest = 2, type = "DN", numBuckets = 10)
```


##Brier Score (BrierScore.R)

**File Purpose**: Evaluate the Brier score on test data.

**File Path**: "ISDEvaluation/Evaluations/BrierScore.R"

**Functions**:

1. `BrierScore(survMod. type = "Interated", singleTime = NULL, integratedBrierTimes = NULL, numPoints = NULL)`
2. `singleBrier(survMod, BrierTime = NULL)`
3. `integratedBrier(survMod, integratedBrierTimes, numPoints)`
4. `sinleBrierMultiplePoints(survMod, BrierTimes)`

**Arguments**:

1. *survMod*: The output of any of the survival models. Specifically this must be a list of (1) the survival curves of the test patients, (2) the test patients true time of death and censor status, and (3) the training patients true time of death and censor status. Technically the third component is not required for the concordance calculation.

2. *type*: A string of either "Integrated" or "Single". This will indicate whether to return the Brier score for a point or the integrated Brier score.

3. *singleBrierTime*: The time point to evaluate the (non-integrated) Brier score. If left as NULL, the 50th percentile of the training and testing data's event times combined will be used.

4. *integratedBrierTimes*: A vector of length 2 indicating the limits of integration for the integrated Brier score. If left as NULL the range will be from zero to the max event time of the entire dataset (training and testing set combined).

5. *numPoints*: The number of points to evaluate the integral, *i.e.* the higher the number of points the less error in the integration. If left as NULL the number of points will correspond to the number of events between the limits of integration.

6. *BrierTimes*: A vector of times to evaluate a single Brier score -- used in an internal function (`singleBrierMultiplePoints()`) of `integratedBrier()`.

**Packages Required**: 

1. `prodlim` -- Used for the `prodlim()` function that calculates the Kaplan-Meier curve. This is then used for the inverse probability of censoring weights (IPCW).

**File Notes**: 

In this file `BrierScore()` is the only function that needs to be called -- all the other functions simply work within `BrierScore()`. Note that there is no option for removing censored patients and the Brier score returned will always be the score corrected by the IPCWs. For the integrated Brier score the integral is evaluated using the simple [trapezoidal rule](https://en.wikipedia.org/wiki/Trapezoidal_rule). If numPoints is specified then the points where the Brier score is evaluated will be evenly spread across the limits of integration. However, if numPoints is not specified then the points evaluated will be the event times in the testing data. These rules follow from the `sbrier()` function in the ipred library written by Torsten Hothorn.

**Example Usage**:


```{r, warning=FALSE}
#Set up:
validatedAndCleaned = validateAndClean(survivalDataset = NACD)
folds = createFoldsAndNormalize(survivalDataset = validatedAndCleaned, numberOfFolds = 5)
trainAndTest = folds[[2]]
training1stFold = trainAndTest$Training[[1]]
testing1stFold = trainAndTest$Testing[[1]]
CoxMod = CoxPH_KP(training1stFold, testing1stFold, ElasticNet=F)

#Evaluation:
#Integrated with specified times:
BrierScore(CoxMod, type = "Integrated", integratedBrierTimes = c(0,36))

#Integrated with unspecified times:
BrierScore(CoxMod, type = "Integrated")

#Single with specified times:
BrierScore(CoxMod, type = "Single", singleBrierTime = 12)

#Single with unspecified times:
BrierScore(CoxMod, type = "Single")
```

##L1-Loss (L1Measures.R) 

**File Purpose**: Evaluate the L1 loss on test data.

**File Path**: "ISDEvaluation/Evaluations/L1Measures.R"

**Functions**:

1. `L1(survMod. type = "Uncensored", logScale = FALSE, method = "Median")`

**Arguments**:

1. *survMod*: The output of any of the survival models. Specifically this must be a list of (1) the survival curves of the test patients, (2) the test patients true time of death and censor status, and (3) the training patients true time of death and censor status. Technically the third component is not required for the concordance calculation.

2. *type*: A string indicating the type of L1-loss, options are one of "Uncensored", "Hinge", or "Margin". See the paper for details regarding the mathematical definition of each option.

3. *logScale*: A Boolean indicating whether to use the log scaled version of the L1-loss.

4. *method*: A string indicating whether to use the "Mean" or "Median" as survival predictions.

**Packages Required**: 

1. `prodlim` -- Used for the `prodlim()` function that calculates the Kaplan-Meier curve. See File Notes for more details.

2. `survival` -- Used for the `Surv()` function.

**File Notes**: 

`L1()` returns the average L1 loss based on the specifications given, *e.g* the Margin L1-loss using the mean survival to compute the loss. There is one important specification to note -- the Kaplan-Meier curve is generated (from the training data) and is completed using the linear extension as specified in Section 4.1 above. The mean of this Kaplan-Meier curve is computed and the survival estimate returned is the minimum of the mean survival time for individual survival curves and the mean of this Kaplan-Meier curve. See "L1Measures.R"" for the implementation details.

**Example Usage**:


```{r, warning=FALSE}
#Set up:
validatedAndCleaned = validateAndClean(survivalDataset = NACD)
folds = createFoldsAndNormalize(survivalDataset = validatedAndCleaned, numberOfFolds = 5)
trainAndTest = folds[[2]]
training1stFold = trainAndTest$Training[[1]]
testing1stFold = trainAndTest$Testing[[1]]
CoxMod = CoxPH_KP(training1stFold, testing1stFold, ElasticNet=F)

#Evaluation:
#Uncensored Loss with Mean:
L1(CoxMod, type = "Uncensored", logScale = FALSE, method = "Mean")

#Hinge Loss with Mean:
L1(CoxMod, type = "Hinge", logScale = FALSE, method = "Mean")

#Margin Loss with Mean:
L1(CoxMod, type = "Margin", logScale = FALSE, method = "Mean")

#Margin Loss with Median:
L1(CoxMod, type = "Margin", logScale = FALSE, method = "Median")

#Margin Loss with Median on Log Scale:
L1(CoxMod, type = "Margin", logScale = TRUE, method = "Median")
```

##D-Calibration (DCalibration.R) {#DCal}

**File Purpose**: Evaluate test data using D-Calibration.

**File Path**: "ISDEvaluation/Evaluations/DCalibration.R"

**Functions**:

1. `DCalibration(survMod, numBins)`

2. `DCalibrationCumulative(listOfSurivalModels, numBins)`

3. `getBinned(survMod, numBins)`

**Arguments**:

1. *survMod*: The output of any of the survival models. Specifically this must be a list of (1) the survival curves of the test patients, (2) the test patients true time of death and censor status, and (3) the training patients true time of death and censor status. Technically the third component is not required for the concordance calculation.

2. *numBins*: The number of bins to use for D-calibration.

3. *listOfSurvivalModels*: A list of the output of survival models, *e.g.* a list of different *survMod* like objects.

**Packages Required**: 

1. `prodlim` -- Used for the `sindex()` function.

2. `plyr` -- Used for the `ldply()` function to combine lists.

**File Notes**: 

The format of the DCalibration.R file follows the same as OneCalibration.R. That is, `DCalibration()` is used for a single fold and returns a *p*-value. However, since *p*-values should not be averaged there is also a `DCalibrationCumulative()` function which will take all the folds test data and perform a single D-Calibration test. Lastly, `getBinned()` is a helper function for `DCalibration()` and `DCalibrationCumulative()` and does not need to be called directly. 

**Example Usage**:

Here we do two models, Cox and MTLR. Below you will see that the Cox model fails whereas MTLR performs well. Additionally we show the corresponding cumulative D-Calibration histograms for each model.

```{r, warning=FALSE}
#Set up:
validatedAndCleaned = validateAndClean(survivalDataset = NACD)
folds = createFoldsAndNormalize(survivalDataset = validatedAndCleaned, numberOfFolds = 5)
trainAndTest = folds[[2]]
training1stFold = trainAndTest$Training[[1]]
testing1stFold = trainAndTest$Testing[[1]]
CoxMod = CoxPH_KP(training1stFold, testing1stFold, ElasticNet=F)

MTLRMod = MTLR(training1stFold, testing1stFold)

#D-Calibration for Cox
DCalibration(CoxMod, numBins = 10)

#D-Calibration for MTLR
DCalibration(MTLRMod, numBins = 10)


#Set up for DCalibrationCumulative():
coxList = list()
mtlrList = list()
for(i in 1:5){
  trainingFold = trainAndTest$Training[[i]]
  testingFold = trainAndTest$Testing[[i]]
  CoxMod = CoxPH_KP(trainingFold, testingFold, ElasticNet=F)
  coxList[[i]] = CoxMod
  
  MTLRMod = MTLR(trainingFold, testingFold)
  mtlrList[[i]] = MTLRMod
}
#Cumulative D-Calibration for cox
DCalibrationCumulative(listOfSurvivalModels = coxList, numBins = 10)
```

<center>
```{r, echo = FALSE}
forHist = ldply(lapply(seq_along(coxList), function(x) getBinnedForHistorgram(coxList[[x]], 10)), rbind) %>% 
  group_by(Bin) %>% 
  summarise(Uncen = sum(Uncensored)/128, Cen = sum(Censored)/128)
library(lattice)
barchart(Bin~Uncen+Cen,data=forHist,
         stack=T,
         auto.key = list(space='right',cex=.8,text=c('Uncensored','Censored')),
         xlab = "Proportion of Patients",
         scales=list(y=list(tick.number = 160)))
```
</center>
<figcaption>Figure: Histogram for D-Calibration corresponding to Cox model. Bin 1 refers to the [0,.1) bin and Bin 10 refers to the [0.9, 1] bin. Here we see the cox model puts too many of the patients in the first and last bin -- leading to a failed D-Calibration test. </figcaption>

```{r}
DCalibrationCumulative(listOfSurvivalModels = mtlrList, numBins = 10)
```

<center>
```{r, echo = FALSE}
forHist = ldply(lapply(seq_along(coxList), function(x) getBinnedForHistorgram(mtlrList[[x]], 10)), rbind) %>% 
  group_by(Bin) %>% 
  summarise(Uncen = sum(Uncensored)/128, Cen = sum(Censored)/128)
library(lattice)
barchart(Bin~Uncen+Cen,data=forHist,
         stack=T,
         auto.key = list(space='right',cex=.8,text=c('Uncensored','Censored')),
         xlab = "Proportion of Patients",
         scales=list(y=list(tick.number = 160)))
```
</center>

<figcaption>Figure: Histogram for D-Calibration corresponding to MTLR model. Bin 1 refers to the [0,.1) bin and Bin 10 refers to the [0.9, 1] bin. The bins are much more even here than those given by the Cox model, resulting in a passed D-Calibration test. </figcaption>

#Miscellanious

##Plotting Individual Survival Curves (plotSurvivalCurves.R) {#plotting}

**File Purpose**: Plot individual survival curves.

**File Path**: "ISDEvaluation/Plotting/plotSurvivalCurves.R"

**Functions**:

1. `plotSurvivalCurves(survivalCurves, indexToPlot, color = c(), xlim=c())`

**Arguments**:

1. *survivalCurves*: A dataframe where the first column corresponds to survival times and all other columns are survival probabilities of individual patients. This corresponds to the first item of the list returned by all models, *e.g.* `CoxPH_KP()`.

2. *indexToPlot*: The patient number you wish to plot. From *survivalCurves* patient 1 would be the second column. This can either be an integer specifying which patient or a vector of integers, specifying multiple patients.

3. *color*: The color of the survival curves. The length much match that of *indexToPlot*.

4. *xlim*: You can specify the range of the x-axis. This should be a vector of length two.

**Packages Required**: 

1. `ggplot2` -- Used for plotting.

2. `reshape2` -- Used to reshape the data to long form.

**File Notes**: 

The plotting code will select the columns specified by *indexToPlot* and fit a spline using the time column and the survival probabilities columns. This spline is then used to interpolate many points between the probabilities such that we have a smooth fit to plot. See below for a few plotting examples. Note that *indexToPlot* will match up testing data depending on the fold. Additionally, if over 15 indexes are specified, there will be no legend (the legend fills up the plot space).

**Example Usage**:


```{r, warning=FALSE}
#Without feature selection.
validatedAndCleaned = validateAndClean(survivalDataset = NACD)
folds = createFoldsAndNormalize(survivalDataset = validatedAndCleaned, numberOfFolds = 5)
trainAndTest = folds[[2]]
training1stFold = trainAndTest$Training[[1]]
testing1stFold = trainAndTest$Testing[[1]]
CoxMod = CoxPH_KP(training1stFold, testing1stFold, ElasticNet=F)
survivalCurves = CoxMod[[1]]
plotSurvivalCurves(survivalCurves, indexToPlot = 1)

plotSurvivalCurves(survivalCurves, indexToPlot = 2)

plotSurvivalCurves(survivalCurves, indexToPlot = 2, color = "blue")

plotSurvivalCurves(survivalCurves, indexToPlot = 1:10)

plotSurvivalCurves(survivalCurves, indexToPlot = 1:10, xlim = c(0,25))

```

Note here that the index corresponds to the row number in the original dataset (they are all out of order here since we used an arbitrary cross-validation fold).

##Feature Selection (FeatureSelection.R)

**File Purpose**: Perform feature selection on survival data.

**File Path**: "ISDEvaluation/FeatureSelection/FeatureSelection.R"

**Functions**:

1. `FeatureSelection(dataset, type = "UniCox", obs_thresh = 0, pThresh = 0.1)`

2. `uniCox(dataset, obsthresh = 0, pThresh=0.1)`

**Arguments**:

1. *dataset*: This is a dataset which has been validated for use -- i.e. gone through `validateAndClean()`

2. *type*: The type of feature selection to use. Currently (August 28, 2018) the only option is "UniCox". See File Notes for details.

3. *obs_thresh*: The number of observations that a variable must have to qualify for selection.

4. *pThresh*: The p-value threshold to be included in the analysis.

**Packages Required**: 

1. `caret` -- the `dummyVars()` function is used for one-hot encoding.

**File Notes**: 

The code here is written so custom feature selection methods can be easily added -- they just have to added to a switch command. Here we use uni-variate cox for our feature selection -- the details of this method are given in the paper. For this feature selection we first check that that each variable has a *obs_thresh* number of events (deaths) occur. The default for this is 0 events. Next the p-value for the feature must be below 0.1 to be included in the selected dataset. Note that variables are one-hot encoded before selection so all the values of a categorical variable may not be present post feature selection.

**Example Usage**:

Below you will see that unselected data has 40 features whereas the data post feature selection has only 22 features.

```{r, warning=FALSE}
#Without feature selection.
validatedAndCleaned = validateAndClean(survivalDataset = NACD)
dim(validatedAndCleaned)

#After we have the selected data we would then pass it to createFoldsAndNormalize().
selectedData = FeatureSelection(validatedAndCleaned)
dim(selectedData)
```




