Introduction

Cancer prognosis is a critical clinical indicator that is widely used in cancer research and clinical practice. For example, in a clinical trial of a potential drug, scientists might evaluate if the drug prolongs cancer patients’ survival upon the treatment, compared with placebo or a standard of care treatment regimen. In this project, we are interested to explore the performance of multiple linear regression in modeling breast cancer prognosis (survival time in months) using patients’ epidemiological and clinical characteristics.

The breast cancer dataset we use here is from the Surveillance, Epidemiology, and End Results (SEER) program of the National Cancer Institute (NCI) in the United States. The dataset was downloaded from the IEEE DataPort here.

The data file contains clinical information of 4024 female patients with infiltrating duct and lobular carcinoma breast cancer diagnosed in 2006-2010. There are 15 variables describing patients’ epidemiological and clinical characteristics such as age, race, marital status, stage, grade, tumor size, lymph nodes, survival time (months) and status (alive or dead). download dataset here

Here is a brief description of the variables in this dataset:

Table 1 - Description of All Variables in the Dataset
Variable Type Brief Description Additional Information
Age integer the age (in years) of the patient at cancer diagnosis
Race categorical race recode Categories: “White”, “Black”, “Other (American Indian/AK Native, Asian/Pacific Islander)”
Marital.Status categorical patient’s marital status at the time of diagnosis Categories: “Married (including common law)”, “Single (never married)”, “Divorced”, “Separated”, “Widowed”
T.Stage categorical T Stage of tumor according to Adjusted AJCC 6th T (1988+), primarily based on tumor size Categories: “T1”, “T2”, “T3”, “T4”
N.Stage categorical N Stage of tumor according to Adjusted AJCC 6th T (1988+), primarily based on lymph node metastasis information Categories: “N1”, “N2”, “N3”
A.Stage categorical historic stage from Collaborative Stage (CS) for 2004+ and Extent of Disease (EOD) from 1973-2003 Categories: “Regional”, “Distant”
6th.Stage categorical overall cancer stage according to Adjusted AJCC 6th Stage (1988+), based on tumor size, lymph node and distant metastasis, and other clinical information Categories: “IIA”, “IIB”, “IIIA”, “IIIB”, “IIIC”
Grade categorical tumor grading based on differentiation level of tumor cells Categories: “Well differentiated; Grade I”, “Moderately differentiated; Grade II”, “Poorly differentiated; Grade III”, “Undifferentiated; anaplastic; Grade IV”
Tumor.Size integer exact tumor size in millimeters
Estrogen.Status categorical Estrogne Receptor status recode Categories: “Negative”, “Positive”
Progesterone.Status categorical Progesterone Receptor status recode Categories: “Negative”, “Positive”
Regional.Nodes.Examined integer the total number of regional lymph nodes that were removed and examined by the pathologist
Regional.Nodes.Positive integer the exact number of regional lymph nodes examined by the pathologist that were found to contain metastases
Survival.Months integer survival duration in months between cancer diagnosis and death or last follow-up (if death didn’t occur)
Status categorical vital status as of the study cut-off date Categories: “Alive”, “Dead”


Methods

1. Data loading and wrangling

# install uncommon packages if not already installed
if(!require(survival)) install.packages("survival")
if(!require(tidyverse)) install.packages("tidyverse")
if(!require(sjmisc)) install.packages("sjmisc")

# load libraries
library(readr)
library(survival)
library(tidyverse)
library(faraway)
library(lmtest)
library(MASS)
library(purrr)
library(gtools)
library(stringi)
library(sjmisc)
library(stringr)

First we load the raw data and convert the character variables into factors:

raw_data = read.csv("SEER Breast Cancer Dataset.csv")
raw_data = as_tibble(raw_data) %>%
  mutate_if(is.character, as.factor)

The fourth column X4 of the original dataset is just a blank column, so we remove it here:

data = raw_data[, -4]

We check if there’s any missing data. There isn’t any:

data[!complete.cases(data),]
## # A tibble: 0 × 15
## # … with 15 variables: Age <int>, Race <fct>, Marital.Status <fct>,
## #   T.Stage <fct>, N.Stage <fct>, X6th.Stage <fct>, Grade <fct>, A.Stage <fct>,
## #   Tumor.Size <int>, Estrogen.Status <fct>, Progesterone.Status <fct>,
## #   Regional.Node.Examined <int>, Reginol.Node.Positive <int>,
## #   Survival.Months <int>, Status <fct>

Note that some of the levels of the categorical variables have long names. For example, Race has 3 categories: Blace, Other (American Indian/AK Native, Asian/Pacific Islander), and White. We change the middle one to be Other to have cleaner output. Similarly, we also rename some of the levels for Marital.Status and Grade:

levels(data$Race) = c("Black", "Other", "White")
levels(data$Marital.Status) = c("Divorced", "Married", "Separated", "Single", "Widowed")
levels(data$Grade) = c("Grade II", "Grade III", "Grade IV", "Grade I")

Then we create a new composite variable - Regional.Node.Ratio - which is a ratio of positive nodes over examined nodes. We do this because we know the two inputs are related and might cause collinearity issues, but we want to retain information from both in our model. There is also a chance that this ratio could provide extra information in addition to the original variables.

data$Regional.Node.Ratio = data$Reginol.Node.Positive/data$Regional.Node.Examined


2. Brief exploration

Let’s take a look at the structure and variable names of our data and make sure everything looks good after our wrangling.

head(data)
## # A tibble: 6 × 16
##     Age Race  Marital.Status T.Stage N.Stage X6th.Stage Grade A.Stage Tumor.Size
##   <int> <fct> <fct>          <fct>   <fct>   <fct>      <fct> <fct>        <int>
## 1    43 Other Married        T2      N3      IIIC       Grad… Region…         40
## 2    47 Other Married        T2      N2      IIIA       Grad… Region…         45
## 3    67 White Married        T2      N1      IIB        Grad… Region…         25
## 4    46 White Divorced       T1      N1      IIA        Grad… Region…         19
## 5    63 White Married        T2      N2      IIIA       Grad… Region…         35
## 6    49 White Married        T2      N3      IIIC       Grad… Region…         32
## # … with 7 more variables: Estrogen.Status <fct>, Progesterone.Status <fct>,
## #   Regional.Node.Examined <int>, Reginol.Node.Positive <int>,
## #   Survival.Months <int>, Status <fct>, Regional.Node.Ratio <dbl>
(variables = names(data))
##  [1] "Age"                    "Race"                   "Marital.Status"        
##  [4] "T.Stage"                "N.Stage"                "X6th.Stage"            
##  [7] "Grade"                  "A.Stage"                "Tumor.Size"            
## [10] "Estrogen.Status"        "Progesterone.Status"    "Regional.Node.Examined"
## [13] "Reginol.Node.Positive"  "Survival.Months"        "Status"                
## [16] "Regional.Node.Ratio"

Note that for the variable 6th.Stage, R added a X in front of the variable name because it starts with a digit instead of a character.

Using the following quick_explore function, we quickly explore these variables to better understand the data we have and make sure it looks reasonable.

quick_explore = function(i){
    print(variables[i])
  # check if the variable is numeric or factor
  if ((class(data[[variables[i]]]) == "integer") || (class(data[[variables[i]]]) == "numeric")) {
    print(summary(data[[variables[i]]]))
  }
  if (class(data[[variables[i]]]) == "factor"){
    print(table(data[[variables[i]]]))
  }
}

for (i in 1:length(variables)){
  quick_explore(i)
}
## [1] "Age"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      30      47      54      54      61      69 
## [1] "Race"
## 
## Black Other White 
##   291   320  3413 
## [1] "Marital.Status"
## 
##  Divorced   Married Separated    Single   Widowed 
##       486      2643        45       615       235 
## [1] "T.Stage"
## 
##   T1   T2   T3   T4 
## 1603 1786  533  102 
## [1] "N.Stage"
## 
##   N1   N2   N3 
## 2732  820  472 
## [1] "X6th.Stage"
## 
##  IIA  IIB IIIA IIIB IIIC 
## 1305 1130 1050   67  472 
## [1] "Grade"
## 
##  Grade II Grade III  Grade IV   Grade I 
##      2351      1111        19       543 
## [1] "A.Stage"
## 
##  Distant Regional 
##       92     3932 
## [1] "Tumor.Size"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      16      25      30      38     140 
## [1] "Estrogen.Status"
## 
## Negative Positive 
##      269     3755 
## [1] "Progesterone.Status"
## 
## Negative Positive 
##      698     3326 
## [1] "Regional.Node.Examined"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       9      14      14      19      61 
## [1] "Reginol.Node.Positive"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       1       2       4       5      46 
## [1] "Survival.Months"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      56      73      71      90     107 
## [1] "Status"
## 
## Alive  Dead 
##  3408   616 
## [1] "Regional.Node.Ratio"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.02    0.10    0.21    0.33    0.50    1.00

Next, we build a pairs plot with all continuous variables to see what types of correlations there are that we need to consider while building our models.

par(oma = c(0, 0, 6, 0), xpd = TRUE)
pairs(data[sapply(data, is.numeric)], col = "dodgerblue")
mtext("Figure 1 - Scatter Plots Between Pairs of Continuous Variables", line = 5, outer = TRUE)

We then plot the distributions of survival months for patients that are either Alive or Dead. This clearly demonstrates the difficulty in analyzing survival data. The survival months distribution for Alive patients are highly skewed because we don’t know what the final survival months values will be for patients who are still alive. In contrast, the distribution for Dead patients is much more normal.

par(mfrow=c(1,3), oma = c(0, 0, 3, 0))
par(mai=c(0.6,0.6,0.2,0))
hist(data$"Survival.Months", breaks = 20, ylim =c(0, 350), main = "All Patients", xlab = "Survival Months")
par(mai=c(0.6,0.3,0.2,0))
hist(filter(data, Status == "Alive")$"Survival.Months", breaks = 20, 
     ylim =c(0, 350), main = "Status = Alive", xlab = "Survival Months")
par(mai=c(0.6,0.3,0.2,0))
hist(filter(data, Status == "Dead")$"Survival.Months", breaks = 20, 
     ylim =c(0, 350), main = "Status = Dead", xlab = "Survival Months")
mtext("Figure 2 - Distributions of Survival Months for Patients of Different Status", line = 1, outer = TRUE)


3. Training set and testing set

In this step, we split our data into a training set and test set to make sure that we are able to test our predictions on unseen data. We use an 80/20 split, which is a pretty standard split for an analysis such as this.

# split data into training and testing - 80%/20%
set.seed(42)
train_idx  = sample(nrow(data), size = trunc(0.80 * nrow(data)))
train_data = data[train_idx, ]
test_data = data[-train_idx, ]


4. Model building I: looking for a small and meaningful model

In section 4, we aim to build a model that is both small (few parameters), making it more easily interpretable, and thus more meaningful to cancer researchers who may be interested in understanding how specific variables play into the number of months a cancer patient will survive. This is what we consider our “theory-based” approach, which requires some clinical knowledge of the variables.

4.1 Full additive model

We start off by fitting an additive model with all the variables in the dataset.

fit_additive = lm(Survival.Months ~ ., data = train_data)
summary(fit_additive)
## 
## Call:
## lm(formula = Survival.Months ~ ., data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -74.32 -15.00   0.29  16.17  57.24 
## 
## Coefficients: (1 not defined because of singularities)
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  67.97057    4.25642   15.97   <2e-16 ***
## Age                           0.04077    0.04197    0.97    0.331    
## RaceOther                     1.45094    1.83970    0.79    0.430    
## RaceWhite                     1.18058    1.39392    0.85    0.397    
## Marital.StatusMarried         0.65803    1.12276    0.59    0.558    
## Marital.StatusSeparated      -4.75935    3.79838   -1.25    0.210    
## Marital.StatusSingle         -0.07363    1.39635   -0.05    0.958    
## Marital.StatusWidowed        -0.26905    1.81128   -0.15    0.882    
## T.StageT2                     1.97207    1.71313    1.15    0.250    
## T.StageT3                     5.93262    2.77631    2.14    0.033 *  
## T.StageT4                     4.10601    4.59336    0.89    0.371    
## N.StageN2                     1.99497    2.00340    1.00    0.319    
## N.StageN3                    -1.05953    2.70710   -0.39    0.696    
## X6th.StageIIB                -1.43855    1.84296   -0.78    0.435    
## X6th.StageIIIA               -2.61062    2.36792   -1.10    0.270    
## X6th.StageIIIB                1.71946    5.17442    0.33    0.740    
## X6th.StageIIIC                     NA         NA      NA       NA    
## GradeGrade III                0.25770    0.85500    0.30    0.763    
## GradeGrade IV                 4.65260    5.28814    0.88    0.379    
## GradeGrade I                 -1.25408    1.07676   -1.16    0.244    
## A.StageRegional               3.20023    2.65673    1.20    0.228    
## Tumor.Size                   -0.08254    0.03499   -2.36    0.018 *  
## Estrogen.StatusPositive       3.78001    1.70500    2.22    0.027 *  
## Progesterone.StatusPositive  -0.65786    1.10053   -0.60    0.550    
## Regional.Node.Examined       -0.00541    0.07013   -0.08    0.939    
## Reginol.Node.Positive         0.02350    0.17525    0.13    0.893    
## StatusDead                  -31.07003    1.07982  -28.77   <2e-16 ***
## Regional.Node.Ratio          -1.03195    2.39015   -0.43    0.666    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20 on 3192 degrees of freedom
## Multiple R-squared:  0.244,  Adjusted R-squared:  0.238 
## F-statistic: 39.7 on 26 and 3192 DF,  p-value: <2e-16

Then we examine the performance of this initial additive model. There are a number of problems with this model. 1. It’s far too complex to understand how these variables really relate to survival 2. We know based on what these variables are that there is a collinearity problem (which may be what’s preventing us from fitting a coefficient), and 3. We know that by including several collinear variables, we are very likely overfitting.

performance_evaluation(fit_additive)
##                rmse          loocv_rmse                  r2              adj_r2 
##  "20.1450939173051"  "20.3227384493345" "0.244297524155645" "0.238142052861173" 
##         sw_decision         bp_decision          num_params 
##            "Reject"            "Reject"                "28"

We also note that the equal variance and normality assumptions are not perfect. Both the SW and BP tests reject the null hypothesis. Furthermore, there is a bit of a fat tail on the right side of the Q-Q plot and the fitted-residuals plot is a bit lopsided.

par(mfrow=c(1,2), oma = c(0, 0, 3, 0))
diagnostics(fit_additive)
mtext("Figure 3 - Diagnostics of Model Assumptions", line = 1, outer = TRUE)

4.2 Reduced additive model

To fix some of the issues from the full additive model, we used a step function to perform an AIC forward search. This search gives us Status + Estrogen, suggesting that that they are important variables to include. The performance of this reduced model is ok, but we think we can do better based on our theoretical knowledge of the dataset by including interactions and some additional variables we know are related to survival.

# reduced additive model
fit_baseline = step(fit_additive, trace = 0)
summary(fit_baseline)
## 
## Call:
## lm(formula = Survival.Months ~ Estrogen.Status + Status, data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -75.07 -15.07   0.93  16.26  57.45 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                72.74       1.44   50.55   <2e-16 ***
## Estrogen.StatusPositive     3.33       1.45    2.29    0.022 *  
## StatusDead                -31.20       1.01  -30.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20 on 3216 degrees of freedom
## Multiple R-squared:  0.24,   Adjusted R-squared:  0.239 
## F-statistic:  508 on 2 and 3216 DF,  p-value: <2e-16
performance_evaluation(fit_baseline)
##                rmse          loocv_rmse                  r2              adj_r2 
##  "20.2035552267011"  "20.2250892722368" "0.239905044185093" "0.239432348317049" 
##         sw_decision         bp_decision          num_params 
##            "Reject"            "Reject"                 "3"

4.3 Adding interaction

To further improve our model, we next try adding interactions between some of our predictors. We don’t show the summary of the full 2-way interaction model here because it has too many parameters, but we try to reduce this model by an AIC forward search, starting with the reduced additive model.

We also tried a full 3-way interaction model, but decided not to include it here because, like the 2-way interaction model, it’s too large, and the 3-way interaction model very clearly overfits to our data.

fit_int_full = lm(Survival.Months ~ (.)^2, data = train_data)
fit_int_forw = step(fit_baseline, formula(fit_int_full), direction = "forward", trace = 0)
summary(fit_int_forw)
## 
## Call:
## lm(formula = Survival.Months ~ Estrogen.Status + Status + Estrogen.Status:Status, 
##     data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -74.74 -14.74   0.26  16.26  67.48 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           79.35       1.77   44.86  < 2e-16 ***
## Estrogen.StatusPositive               -3.61       1.81   -1.99    0.047 *  
## StatusDead                           -47.83       2.81  -17.04  < 2e-16 ***
## Estrogen.StatusPositive:StatusDead    19.09       3.01    6.35  2.5e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20 on 3215 degrees of freedom
## Multiple R-squared:  0.249,  Adjusted R-squared:  0.249 
## F-statistic:  356 on 3 and 3215 DF,  p-value: <2e-16

This search gives us an interaction Estrogen.Status and Status, suggesting that we may want to include that interaction in our model as well.

performance_evaluation(fit_int_full)
##                rmse          loocv_rmse                  r2              adj_r2 
##   "19.123863917568"               "Inf" "0.318974236878706" "0.249215174469228" 
##         sw_decision         bp_decision          num_params 
##            "Reject"    "Fail to Reject"               "359"
performance_evaluation(fit_int_forw)
##                rmse          loocv_rmse                  r2              adj_r2 
##  "20.0780826827026"  "20.1051858595402" "0.249316744218511" "0.248616262175791" 
##         sw_decision         bp_decision          num_params 
##            "Reject"            "Reject"                 "4"

With the full interaction model the BP test fails to reject the null, suggesting that we have avoided a heteroskedacicity problem with this model, but the reduced model still fails the BP test and rejects the null hypothesis.

4.4 Adding continuous variables

We only have 4 continuous variables and so far our searches have not included them, but for theoretical reasons we decide to manually add them back into the model.

fit_int_cont = lm(Survival.Months ~ (Status + Estrogen.Status + Age + Tumor.Size + Regional.Node.Examined + Reginol.Node.Positive)^2, data = train_data)
summary(fit_int_cont)
## 
## Call:
## lm(formula = Survival.Months ~ (Status + Estrogen.Status + Age + 
##     Tumor.Size + Regional.Node.Examined + Reginol.Node.Positive)^2, 
##     data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -74.03 -14.82   0.63  16.26  66.92 
## 
## Coefficients:
##                                                  Estimate Std. Error t value
## (Intercept)                                     71.044961  10.548831    6.73
## StatusDead                                     -35.797261   7.364677   -4.86
## Estrogen.StatusPositive                          2.651268   9.382374    0.28
## Age                                              0.163399   0.185128    0.88
## Tumor.Size                                       0.023953   0.121907    0.20
## Regional.Node.Examined                          -0.254563   0.365379   -0.70
## Reginol.Node.Positive                            0.415138   0.594430    0.70
## StatusDead:Estrogen.StatusPositive              20.257768   3.460620    5.85
## StatusDead:Age                                  -0.256614   0.112734   -2.28
## StatusDead:Tumor.Size                            0.015138   0.049996    0.30
## StatusDead:Regional.Node.Examined                0.148167   0.153698    0.96
## StatusDead:Reginol.Node.Positive                -0.227010   0.190605   -1.19
## Estrogen.StatusPositive:Age                     -0.134935   0.159302   -0.85
## Estrogen.StatusPositive:Tumor.Size               0.016114   0.066443    0.24
## Estrogen.StatusPositive:Regional.Node.Examined   0.059982   0.201225    0.30
## Estrogen.StatusPositive:Reginol.Node.Positive   -0.165721   0.271749   -0.61
## Age:Tumor.Size                                  -0.000493   0.001969   -0.25
## Age:Regional.Node.Examined                       0.004242   0.005573    0.76
## Age:Reginol.Node.Positive                       -0.002479   0.009033   -0.27
## Tumor.Size:Regional.Node.Examined               -0.000982   0.002490   -0.39
## Tumor.Size:Reginol.Node.Positive                -0.002633   0.003544   -0.74
## Regional.Node.Examined:Reginol.Node.Positive    -0.000378   0.007631   -0.05
##                                                Pr(>|t|)    
## (Intercept)                                     1.9e-11 ***
## StatusDead                                      1.2e-06 ***
## Estrogen.StatusPositive                           0.778    
## Age                                               0.378    
## Tumor.Size                                        0.844    
## Regional.Node.Examined                            0.486    
## Reginol.Node.Positive                             0.485    
## StatusDead:Estrogen.StatusPositive              5.3e-09 ***
## StatusDead:Age                                    0.023 *  
## StatusDead:Tumor.Size                             0.762    
## StatusDead:Regional.Node.Examined                 0.335    
## StatusDead:Reginol.Node.Positive                  0.234    
## Estrogen.StatusPositive:Age                       0.397    
## Estrogen.StatusPositive:Tumor.Size                0.808    
## Estrogen.StatusPositive:Regional.Node.Examined    0.766    
## Estrogen.StatusPositive:Reginol.Node.Positive     0.542    
## Age:Tumor.Size                                    0.802    
## Age:Regional.Node.Examined                        0.447    
## Age:Reginol.Node.Positive                         0.784    
## Tumor.Size:Regional.Node.Examined                 0.693    
## Tumor.Size:Reginol.Node.Positive                  0.458    
## Regional.Node.Examined:Reginol.Node.Positive      0.961    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20 on 3197 degrees of freedom
## Multiple R-squared:  0.252,  Adjusted R-squared:  0.247 
## F-statistic: 51.3 on 21 and 3197 DF,  p-value: <2e-16

Interestingly, we see that Age now interacts with Status, so we remove the other 3 continuous variables, keep Age in the model, and run it again to see whether the interaction with Age remains significant.

fit_int_age = lm(Survival.Months ~ (Status + Estrogen.Status + Age)^2, data = train_data)
summary(fit_int_age)
## 
## Call:
## lm(formula = Survival.Months ~ (Status + Estrogen.Status + Age)^2, 
##     data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -74.02 -14.82   0.58  16.24  67.23 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          68.547      8.377    8.18    4e-16 ***
## StatusDead                          -34.042      6.228   -5.47    5e-08 ***
## Estrogen.StatusPositive               3.614      8.444    0.43    0.669    
## Age                                   0.206      0.156    1.32    0.187    
## StatusDead:Estrogen.StatusPositive   19.927      3.035    6.57    6e-11 ***
## StatusDead:Age                       -0.264      0.107   -2.47    0.014 *  
## Estrogen.StatusPositive:Age          -0.140      0.157   -0.89    0.374    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20 on 3212 degrees of freedom
## Multiple R-squared:  0.251,  Adjusted R-squared:  0.249 
## F-statistic:  179 on 6 and 3212 DF,  p-value: <2e-16

We still observe the interaction between Age and Status. There is not a significant interaction between Age and Estrogen.Status so we do remove that term from our model.

fit_int_age_small = lm(Survival.Months ~ Status + Age + Estrogen.Status + Status:Age +  Status:Estrogen.Status, data = train_data)
summary(fit_int_age_small)
## 
## Call:
## lm(formula = Survival.Months ~ Status + Age + Estrogen.Status + 
##     Status:Age + Status:Estrogen.Status, data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -73.95 -14.78   0.56  16.30  66.73 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         75.5308     2.8956   26.08  < 2e-16 ***
## StatusDead                         -35.2686     6.0731   -5.81  7.0e-09 ***
## Age                                  0.0729     0.0438    1.66    0.096 .  
## Estrogen.StatusPositive             -3.7128     1.8121   -2.05    0.041 *  
## StatusDead:Age                      -0.2431     0.1045   -2.33    0.020 *  
## StatusDead:Estrogen.StatusPositive  19.9460     3.0345    6.57  5.7e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20 on 3213 degrees of freedom
## Multiple R-squared:  0.251,  Adjusted R-squared:  0.25 
## F-statistic:  215 on 5 and 3213 DF,  p-value: <2e-16
performance_evaluation(fit_int_age_small)
##                rmse          loocv_rmse                  r2              adj_r2 
##  "20.0593910688371"  "20.1002830568187"  "0.25071378501141" "0.249547762267886" 
##         sw_decision         bp_decision          num_params 
##            "Reject"            "Reject"                 "6"
par(mfrow=c(1,2), oma = c(0, 0, 3, 0))
diagnostics(fit_int_age_small)
mtext("Figure 4 - Diagnostics of Assumptions for Small and Meaningful Model", line = 1, outer = TRUE)

Now that we have included Age in our model, we see a slightly better performance compared with our previous fit_int_forw model that we discovered without any theory just using a simple forward search. We use an ANOVA to test whether the new variables we added are actually helping to describe significant additional variance.

anova(fit_int_forw, fit_int_age_small)$"Pr(>F)"[2]
## [1] 0.05

The p-value is 0.05. It’s not a super low p-value, but it does indicate that including Age in the model is meaningful, as we theorized. So we decide to keep Age in our final model.

4.5 Checking collinearity

We check variance inflation factor of the model fit_int_age_small. There are three variables with variance inflation factor over 5, which suggest there is a collinearity in the model.

vif(fit_int_age_small)[vif(fit_int_age_small) > 5]
##                         StatusDead                     StatusDead:Age 
##                                 38                                 35 
## StatusDead:Estrogen.StatusPositive 
##                                  8


5. Model building II: trying to meet the model assumptions

In this section, we attempt to “engineer” our model using some custom search algorithms. Our goal here was to improve our model or find a new one that will meet all model assumptions without using any clinical knowledge or theory and see whether that model performs better than our more simple, theoretically-founded model.

5.1 Making model fit_int_age_small pass Breusch-Pagan test

First we define some utility cost functions for a customized search procedure.

# Cost function based on AIC score.
cost_fun_aic = function(model) {
  aic = extractAIC(model)[2]
  aic
}

# Cost function based on reversed bp test p.value, because we want large p.value.
cost_fun_bp = function(model) {
  model_size = length(coef(model))
  if (model_size <= 1) {
    p_value = 1.0e-30
  } else {
    p_value = bptest(model)$p.value
  }
  1.0/p_value
}

# Cost function based on reversed sw test p.value, because we want large p.value.
cost_fun_sw = function(model) {
  model_size = length(coef(model))
  if (model_size <= 1) {
    p_value = 1.0e-30
  } else {
    p_value = shapiro.test(resid(model))$p.value
  }
  1.0/p_value
}

Based on fit_int_age_small model we found, we utilize our customized search procedure to make the model pass Breusch-Pagan test at \(\alpha\)=0.01.

predictors_int_age_small = attr(fit_int_age_small$terms , "term.labels")
fit_int_age_small_bp = step_search(
  used_predictors = predictors_int_age_small, 
  data = train_data, 
  response = "Survival.Months",
  cost_fun = cost_fun_bp, 
  max_rest_params = 9)
length(coef(fit_int_age_small_bp))
## [1] 45

We found that we needed to add at least extra 9 variables in order to pass Breusch-Pagan test at the given level. The additional variables expand to multiple coefficients and the model we now has 45 coefficients.

bptest(fit_int_age_small_bp)$p.value
##    BP 
## 0.012

5.2 Searching for a smaller model that passes Breusch-Pagan test

We search for a modified AIC Model(with k=1.2) to help find out useful continuous predictors, which will be used next in a customized model search procedure.

response = "Survival.Months"
simple_predictors = predictors(data = data, response = response)
numeric_predictors = predictors(data = data, response = response, predicate_fun = is_numeric_column)
log_predictors = predictors_fun(predictors = numeric_predictors, fun_prefix = "log(", fun_suffix = ")")
exp_predictors = predictors_fun(predictors = numeric_predictors, fun_prefix = "exp(", fun_suffix = ")")

poly_predictors = predictors_fun(predictors = numeric_predictors, fun_prefix = "poly(", 
                                 fun_suffix = ", 4, raw = TRUE)")

addictive_formula_str = addictive_formula(predictors = c(
  simple_predictors, log_predictors, exp_predictors, poly_predictors))

formula = paste(
  response, 
  predictors_fun(addictive_formula_str, fun_prefix = "(", fun_suffix = ")^2"),
  sep = " ~ ")
fit_int_modified = lm(formula = formula, data = train_data)
fit_null_modified = lm(formula = Survival.Months ~ 1, data = train_data)
fit_aic_modified = step(object = fit_null_modified, scope = formula(fit_int_modified), 
                        direction = "both", trace = FALSE, k=1.2)
summary(fit_aic_modified)
## 
## Call:
## lm(formula = Survival.Months ~ Status + Estrogen.Status + exp(Tumor.Size) + 
##     log(Regional.Node.Ratio) + Status:Estrogen.Status, data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -74.28 -14.78   0.47  16.16  67.78 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         7.86e+01   1.89e+00   41.67  < 2e-16 ***
## StatusDead                         -4.75e+01   2.82e+00  -16.82  < 2e-16 ***
## Estrogen.StatusPositive            -3.56e+00   1.81e+00   -1.96     0.05 *  
## exp(Tumor.Size)                    -3.68e-60   2.25e-60   -1.64     0.10    
## log(Regional.Node.Ratio)           -4.46e-01   3.89e-01   -1.15     0.25    
## StatusDead:Estrogen.StatusPositive  1.90e+01   3.01e+00    6.31  3.2e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20 on 3213 degrees of freedom
## Multiple R-squared:  0.25,   Adjusted R-squared:  0.249 
## F-statistic:  214 on 5 and 3213 DF,  p-value: <2e-16

We fix two continuous predictors that we found through the above modified AIC model and then utilize a customized procedure to search for a good model that has both low AIC value and follows model hierarchy rules. The search stops at 3 additional parameters, as it can’t get a lower AIC score. The model found happened to be the same as the modified AIC model above. The implementation of customized search procedure used can be found in the Appendix section. It searches through various functions and polynomials up to degree 6.

predictors_aic_numeric = c("exp(Tumor.Size)", "log(Regional.Node.Ratio)")
fit_aic_custom = step_search(
  used_predictors = predictors_aic_numeric, 
  data = train_data, 
  response = "Survival.Months",
  cost_fun = cost_fun_aic, 
  max_rest_params = 4)
summary(fit_aic_custom)
## 
## Call:
## lm(formula = formula, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -74.28 -14.78   0.47  16.16  67.78 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         7.86e+01   1.89e+00   41.67  < 2e-16 ***
## exp(Tumor.Size)                    -3.68e-60   2.25e-60   -1.64     0.10    
## log(Regional.Node.Ratio)           -4.46e-01   3.89e-01   -1.15     0.25    
## StatusDead                         -4.75e+01   2.82e+00  -16.82  < 2e-16 ***
## Estrogen.StatusPositive            -3.56e+00   1.81e+00   -1.96     0.05 *  
## StatusDead:Estrogen.StatusPositive  1.90e+01   3.01e+00    6.31  3.2e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20 on 3213 degrees of freedom
## Multiple R-squared:  0.25,   Adjusted R-squared:  0.249 
## F-statistic:  214 on 5 and 3213 DF,  p-value: <2e-16

Based on the customized AIC model found, we utilize the customized search to add some predictors so that it can pass Breusch-Pagan test at \(\alpha\)=0.01.

predictors_aic_custom = attr(fit_aic_custom$terms , "term.labels")
fit_aic_custom_bp = step_search(
  used_predictors = predictors_aic_custom, 
  data = train_data, 
  response = "Survival.Months",
  cost_fun = cost_fun_bp, 
  max_rest_params = 4)
summary(fit_aic_custom_bp)
## 
## Call:
## lm(formula = formula, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -75.22 -14.44   0.54  15.99  68.03 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              7.91e+01   2.26e+00   35.05  < 2e-16
## exp(Tumor.Size)                         -2.91e-60   3.18e-60   -0.92    0.359
## log(Regional.Node.Ratio)                 1.06e-01   5.09e-01    0.21    0.835
## StatusDead                              -4.72e+01   2.86e+00  -16.47  < 2e-16
## Estrogen.StatusPositive                 -3.55e+00   1.85e+00   -1.92    0.054
## Marital.StatusMarried                    7.00e-01   1.11e+00    0.63    0.528
## Marital.StatusSeparated                 -4.31e+00   3.81e+00   -1.13    0.259
## Marital.StatusSingle                    -1.03e-01   1.37e+00   -0.08    0.940
## Marital.StatusWidowed                   -4.47e-02   1.78e+00   -0.03    0.980
## GradeGrade III                          -2.00e+00   1.54e+00   -1.30    0.194
## GradeGrade IV                           -8.62e+00   8.68e+00   -0.99    0.321
## GradeGrade I                            -2.17e+00   2.15e+00   -1.01    0.313
## StatusDead:Estrogen.StatusPositive       1.87e+01   3.04e+00    6.15  8.4e-10
## exp(Tumor.Size):Marital.StatusMarried   -1.53e-60   4.49e-60   -0.34    0.734
## exp(Tumor.Size):Marital.StatusSeparated -2.17e-38   1.68e-38   -1.29    0.196
## exp(Tumor.Size):Marital.StatusSingle    -6.97e-56   7.01e-56   -0.99    0.321
## exp(Tumor.Size):Marital.StatusWidowed    2.14e-51   1.55e-51    1.39    0.166
## log(Regional.Node.Ratio):GradeGrade III -1.43e+00   8.72e-01   -1.64    0.101
## log(Regional.Node.Ratio):GradeGrade IV  -1.28e+01   6.23e+00   -2.05    0.040
## log(Regional.Node.Ratio):GradeGrade I   -7.32e-01   1.17e+00   -0.63    0.530
##                                            
## (Intercept)                             ***
## exp(Tumor.Size)                            
## log(Regional.Node.Ratio)                   
## StatusDead                              ***
## Estrogen.StatusPositive                 .  
## Marital.StatusMarried                      
## Marital.StatusSeparated                    
## Marital.StatusSingle                       
## Marital.StatusWidowed                      
## GradeGrade III                             
## GradeGrade IV                              
## GradeGrade I                               
## StatusDead:Estrogen.StatusPositive      ***
## exp(Tumor.Size):Marital.StatusMarried      
## exp(Tumor.Size):Marital.StatusSeparated    
## exp(Tumor.Size):Marital.StatusSingle       
## exp(Tumor.Size):Marital.StatusWidowed      
## log(Regional.Node.Ratio):GradeGrade III    
## log(Regional.Node.Ratio):GradeGrade IV  *  
## log(Regional.Node.Ratio):GradeGrade I      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20 on 3199 degrees of freedom
## Multiple R-squared:  0.254,  Adjusted R-squared:  0.25 
## F-statistic: 57.4 on 19 and 3199 DF,  p-value: <2e-16

We need 4 additional variables, which expand to multiple coefficients and the model that passes Breusch-Pagan test contains 20 coefficients.

bptest(fit_aic_custom_bp)$p.value
##   BP 
## 0.02

5.3 Making the model also pass Shapiro-Wilk normality test

We also try to make customized AIC model to pass Shapiro-Wilk normality test, but we fail to accomplish it with the data set.

predictors_aic_custom_bp = attr(fit_aic_custom_bp$terms , "term.labels")
fit_aic_custom_sw = step_search(
  used_predictors = predictors_aic_custom_bp, 
  data = train_data, 
  response = "Survival.Months",
  cost_fun = cost_fun_sw, 
  max_rest_params = 5)
shapiro.test(resid(fit_aic_custom_sw))$p.value
## [1] 9.3e-18

We try response transformation using Box-Cox method to see if it helps with normality, but it is still skewed.

par(mfrow=c(1,1), oma = c(0, 0, 1, 0))
refit_aic_custom_sw = lm(formula = deparse(formula(fit_aic_custom_sw)), data = train_data)
boxcox(refit_aic_custom_sw, plotit = TRUE, lambda = seq(1, 1.5, by = 0.1))
mtext("Figure 5 - Box-Cox method", line = 0, outer = TRUE)

par(mfrow=c(1,1), oma = c(0, 0, 1, 0))
hist(((train_data$Survival.Months^1.2-1)/1.2),
     xlab = "(Survival.Months^1.2-1)/1.2",
     main = "")
mtext("Figure 6 - Histogram of Transformed Survival Months", line = 0, outer = TRUE)

formula_box_cox = stri_replace_first_fixed(deparse(formula(refit_aic_custom_sw)), 
                                           "Survival.Months", "((Survival.Months^1.2-1)/1.2)")
fit_box_cox = lm(formula = formula_box_cox, data = train_data)
shapiro.test(resid(fit_box_cox))$p.value
## [1] 1.2e-16

With Box-Cox method, there is some improvement to the p-value of Shapiro-Wilk normality test. However, it is still far from passing the test.

5.4 Checking collinearity

We check variance inflation factor of the customized AIC model that passed Breusch-Pagan test at \(\alpha\)=0.01. There are two variables with variance inflation factor over 5, which suggest there is a collinearity in the model.

length(coef(fit_aic_custom_bp))
## [1] 20
vif(fit_aic_custom_bp)[vif(fit_aic_custom_bp) > 5]
##                         StatusDead StatusDead:Estrogen.StatusPositive 
##                                8.4                                8.0

6. Unusual observations

Next we create some functions to assess:

  1. High leverage (leverage > 2 * mean leverage)

  2. Outliers (based on standardized residuals)

  3. Influential points (Cook’s D > 4 / n)

We will use these functions to identify unusual observations in our final regression models.

# === Leverage === #
# Returns a logical (True/False) array indicating high leverage = TRUE
# High leverage is any observation with leverage larger than 2 * mean
get_high_lev_obs = function(mod) {
  hatvalues(mod) > 2 * mean(hatvalues(mod))
}

# === Outliers === #
# Returns a logical (True/False) array indicating outlier = TRUE
# Outliers are any observations having standardized residuals with magnitude > than 2
get_outlier_obs = function(mod) {
  abs(rstandard(mod)) > 2
}

# === Influential Point (Cook's D) === #
# Returns a logical (True/False) array indicating influential point = TRUE
# Influential points are any observations with Cook's D > 4 / n
get_influential_obs = function(mod) {
  cooks.distance(mod) > 4 / length(cooks.distance(mod))
}
sum(get_high_lev_obs(fit_int_age_small))
## [1] 421
sum(get_high_lev_obs(fit_aic_custom_bp))
## [1] 152

From 3219 total observations, our reduced theoretical model with age has 421 high leverage observations and our “engineered” model with a good bp score has 152 high leverage observations.

sum(get_outlier_obs(fit_int_age_small))
## [1] 101
sum(get_outlier_obs(fit_aic_custom_bp), na.rm=T)
## [1] 108

From 3219 total observations, our reduced theoretical model with age has 101 outliers and our “engineered” model with a good bp score has NA outliers.

sum(get_influential_obs(fit_int_age_small))
## [1] 215
sum(get_influential_obs(fit_aic_custom_bp), na.rm=T)
## [1] 88

Finally, from 3219 total observations, our reduced theoretical model with age has 215 influential observations and our “engineered” model with a good bp score has NA influential observations.

Based on the comparatively small number of high leverage, outlier, and influential observations in our second model, we begin to suspect that that model may be overfitted. We will examine this more closely later on with LOOCV and by using our test dataset.

Let’s remove the influential points from our first model and refit it.

train_data_clean = train_data[!get_influential_obs(fit_int_age_small),]
fit_int_age_small_clean = lm(
  formula = Survival.Months ~ Status + Age + Estrogen.Status + Status:Age + Status:Estrogen.Status, 
  data = train_data_clean)
summary(fit_int_age_small_clean)
## 
## Call:
## lm(formula = Survival.Months ~ Status + Age + Estrogen.Status + 
##     Status:Age + Status:Estrogen.Status, data = train_data_clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -73.14 -14.09   0.83  14.86  33.57 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         78.75152    2.96062   26.60  < 2e-16 ***
## StatusDead                         -42.97676    6.92621   -6.20  6.2e-10 ***
## Age                                  0.00588    0.04035    0.15     0.88    
## Estrogen.StatusPositive             -2.93843    2.07744   -1.41     0.16    
## StatusDead:Age                      -0.15039    0.11697   -1.29     0.20    
## StatusDead:Estrogen.StatusPositive  21.25200    3.50039    6.07  1.4e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18 on 2998 degrees of freedom
## Multiple R-squared:  0.253,  Adjusted R-squared:  0.252 
## F-statistic:  203 on 5 and 2998 DF,  p-value: <2e-16

7. More exploration

Survival data is specific because of the existence of censored data. For example here in our cancer patient survival dataset, the survival time is the time period between patients’ diagnosis and an endpoint. Ideally, this endpoint should be patients’ death. However, in real-world situation, this event (death due to cancer) may not occur. For example, patients might still be alive at the end of the study, or some patients might die because of other reasons unrelated with cancer itself (such as a car accident). This issue is called censoring (right censoring in our case). The censoring problem makes analysis of survival data complex. Specific statistical tools have been developed to analyze survival data, such as Kaplan-Meier survival probability curve and Cox Proportional Hazards Regression. While these methods are out of the scope of this course, we determine to include very brief analyses using these methods so we can have a comparison with multiple linear regression.

7.1 Kaplan-Meier Survival probability

Kaplan-Meier curves plot the survival probability against survival time. We show the overall survival curve of the whole dataset here:

# Survival Analysis: Kaplan-Meier Survival probability
par(oma = c(0, 0, 1, 0), xpd = TRUE)
kp_fit = survfit(Surv(Survival.Months, as.numeric(Status)) ~ 1, data=data)
plot(kp_fit, conf.int = F, mark.time = T, xlab = "Survival Months", ylab = "Survival Probability") 
mtext("Figure 7 - Kaplan-Meier Survival Probability Curve of All Patients", line = 0, outer = TRUE)

Note that all of the censored data are indicated by a vertical line.

Importantly, we can observe the survival curves of different subgroups of patients, statified by clinical factors of interest. For example, we can see that patients with Positive or Negative Estrogen Receptor status, or of different Grade, have distinct survival curves:

# A help function to plot Kaplan-Meier and log-rank P
kp_analysis = function(variable){
  size = length(levels(variable))
  kp_fit = survfit(Surv(Survival.Months, as.numeric(Status)) ~ variable, data=data)
  plot(kp_fit, conf.int = F, mark.time = T,col=2:(size+1), 
       xlab = "Survival Months", ylab = "Survival Probability",
       main = paste("stratification by", unlist(str_split(deparse(substitute(variable)), "\\$"))[2])) 
  legend("bottom", legend=levels(variable), lty = 1, col=2:(size+1), bty = 'n')
  surv_diff = survdiff(Surv(Survival.Months, as.numeric(Status)) ~ variable, data=data)
  p_value = 1 - pchisq(surv_diff$chisq, length(surv_diff$n) - 1)
  legend("center", legend = paste('log-rank P = ', round(p_value,4)), bty = 'n')
}
par(mfcol=c(1,2), oma = c(0, 0, 1, 0))
kp_analysis(data$Estrogen.Status)
kp_analysis(data$Grade)
mtext("Figure 8 - Effects of Clinical Factors on Patient's Survival Probability", line = 0, outer = TRUE)

7.2 Cox Proportional Hazards Regression

While Kaplan-Meier curves are great tools for visualizing the effects of single categorical variable, Cox Proportional Hazards Regression can be used to analyze the effects of single or multiple variables, either numerical or categorical:

# Survival Analysis: Cox Proportional Hazards Regression
fit_cox_ful = coxph(Surv(Survival.Months, as.numeric(Status))~., data=data)
summary(fit_cox_ful)
## Call:
## coxph(formula = Surv(Survival.Months, as.numeric(Status)) ~ ., 
##     data = data)
## 
##   n= 4024, number of events= 616 
## 
##                                 coef exp(coef) se(coef)     z Pr(>|z|)    
## Age                          0.02014   1.02035  0.00487  4.13  3.6e-05 ***
## RaceOther                   -0.74335   0.47552  0.21414 -3.47  0.00052 ***
## RaceWhite                   -0.37138   0.68978  0.12983 -2.86  0.00423 ** 
## Marital.StatusMarried       -0.20720   0.81286  0.11927 -1.74  0.08233 .  
## Marital.StatusSeparated      0.37852   1.46012  0.28979  1.31  0.19149    
## Marital.StatusSingle        -0.03685   0.96383  0.14615 -0.25  0.80097    
## Marital.StatusWidowed       -0.04724   0.95386  0.18190 -0.26  0.79508    
## T.StageT2                    0.19497   1.21527  0.16058  1.21  0.22469    
## T.StageT3                    0.35710   1.42918  0.25280  1.41  0.15779    
## T.StageT4                    0.63032   1.87821  0.31292  2.01  0.04398 *  
## N.StageN2                    0.50399   1.65531  0.20536  2.45  0.01412 *  
## N.StageN3                    0.62776   1.87340  0.24747  2.54  0.01119 *  
## X6th.StageIIB                0.27427   1.31556  0.20177  1.36  0.17405    
## X6th.StageIIIA               0.01466   1.01477  0.25501  0.06  0.95414    
## X6th.StageIIIB               0.17574   1.19213  0.39558  0.44  0.65686    
## X6th.StageIIIC                    NA        NA  0.00000    NA       NA    
## GradeGrade III               0.35226   1.42227  0.08935  3.94  8.1e-05 ***
## GradeGrade IV                1.10984   3.03388  0.35013  3.17  0.00153 ** 
## GradeGrade I                -0.44103   0.64337  0.17123 -2.58  0.01001 *  
## A.StageRegional             -0.16862   0.84483  0.19413 -0.87  0.38508    
## Tumor.Size                   0.00139   1.00139  0.00315  0.44  0.65984    
## Estrogen.StatusPositive     -0.64538   0.52446  0.13568 -4.76  2.0e-06 ***
## Progesterone.StatusPositive -0.49232   0.61121  0.10706 -4.60  4.3e-06 ***
## Regional.Node.Examined      -0.01729   0.98286  0.00982 -1.76  0.07838 .  
## Reginol.Node.Positive        0.03610   1.03676  0.01567  2.30  0.02122 *  
## Regional.Node.Ratio          0.57526   1.77759  0.27617  2.08  0.03725 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                             exp(coef) exp(-coef) lower .95 upper .95
## Age                             1.020      0.980     1.011     1.030
## RaceOther                       0.476      2.103     0.313     0.724
## RaceWhite                       0.690      1.450     0.535     0.890
## Marital.StatusMarried           0.813      1.230     0.643     1.027
## Marital.StatusSeparated         1.460      0.685     0.827     2.577
## Marital.StatusSingle            0.964      1.038     0.724     1.284
## Marital.StatusWidowed           0.954      1.048     0.668     1.362
## T.StageT2                       1.215      0.823     0.887     1.665
## T.StageT3                       1.429      0.700     0.871     2.346
## T.StageT4                       1.878      0.532     1.017     3.468
## N.StageN2                       1.655      0.604     1.107     2.476
## N.StageN3                       1.873      0.534     1.153     3.043
## X6th.StageIIB                   1.316      0.760     0.886     1.954
## X6th.StageIIIA                  1.015      0.985     0.616     1.673
## X6th.StageIIIB                  1.192      0.839     0.549     2.588
## X6th.StageIIIC                     NA         NA        NA        NA
## GradeGrade III                  1.422      0.703     1.194     1.694
## GradeGrade IV                   3.034      0.330     1.527     6.026
## GradeGrade I                    0.643      1.554     0.460     0.900
## A.StageRegional                 0.845      1.184     0.577     1.236
## Tumor.Size                      1.001      0.999     0.995     1.008
## Estrogen.StatusPositive         0.524      1.907     0.402     0.684
## Progesterone.StatusPositive     0.611      1.636     0.496     0.754
## Regional.Node.Examined          0.983      1.017     0.964     1.002
## Reginol.Node.Positive           1.037      0.965     1.005     1.069
## Regional.Node.Ratio             1.778      0.563     1.035     3.054
## 
## Concordance= 0.745  (se = 0.011 )
## Likelihood ratio test= 506  on 25 df,   p=<2e-16
## Wald test            = 575  on 25 df,   p=<2e-16
## Score (logrank) test = 671  on 25 df,   p=<2e-16

Results

1. Performance evaluation of candidate models

Table 2 - The Performance of Candidate Models
Models Train RMSE LOOCV RMSE Adjusted R^2 Number of Coefficients Pass Breusch-Pagan test?
Baseline 20.2 20.23 0.24 3 No
Full Additive 20.15 20.32 0.24 28 No
Small and Meaningful 20.06 20.1 0.25 6 No
Custom Search 20.07 20.1 0.25 6 No
Custom Search (Passed BP Test) 20.01 Inf 0.25 20 Yes

Based on the table, we decide to choose ‘Small and Meaningful’ fit_int_age_small as our final model. Through the whole process, we know that we cannot find a good linear model for prediction, as the continuous variables have little correlation with the response. Therefore, we would like to choose a model that is small and easier for interpretation. We choose the ‘Small and Meaningful’ model over ‘Custom Search’ for easier interpretation. Although we would like to choose a model that conforms to the model assumptions, the ‘Custom Search(Passed BP Test)’ has a leave-one-out cross-validated RMSE of ‘Inf’.

2. Performance of final model on the test set

Table 3 - The Performance of Final Model
Models Test RMSE Formula
Baseline (for comparison) 19.85 Survival.Months ~ Estrogen.Status + Status
Small and Meaningful (final model) 19.59 Survival.Months ~ Status + Age + Estrogen.Status + Status:Age + , Status:Estrogen.Status
Small and Meaningful (with influential points removed) 19.67 Survival.Months ~ Status + Age + Estrogen.Status + Status:Age + , Status:Estrogen.Status

We can see that the test RMSE of the final model ‘Small and Meaningful’ only perform slightly better than the ‘Baseline’ model, which only contains two factor predictors. This is because of the restrictions of the given data set and linear models.

3. Distribution of residuals

We check visually the model assumptions, comparing our final small and meaningful model fit_int_age_small with the model ‘Custom Search(Passed BP Test)’ fit_aic_custom_bp that passed Breusch-Pagan test. We display Figure 4 here again for easier visual comparison.

par(mfrow=c(1,2), oma = c(0, 0, 1, 0))
diagnostics(fit_int_age_small)
mtext("Figure 4 - Diagnostics of Assumptions for Small and Meaningful Model", line = 0, outer = TRUE)

par(mfrow=c(1,2), oma = c(0, 0, 1, 0))
diagnostics(fit_aic_custom_bp)
mtext("Figure 9 - Diagnostics of Assumptions for Custom Search(Passed BP Test)", line = 0, outer = TRUE)

From the “fitted vs residuals” plots, we don’t observe dramatic difference between the two models.


Discussion

1. Interpretation of the final model

In the fit_int_age_small model, we have 6 parameters: Intercept, Status, Age, EstrogenStatus, the interaction between Status and Age, and the interaction between Status and EstrogenStatus. This is not the only one model with similar performance in the cross-validation within the training set, or the validation on the test set. However, we found this model interesting because it is simple enough, and include a continuous variable Age in it. In our dataset, there are many categorical variables but much less continuous variables. In our initial attempts, the model selected by standard model searching process excluded all continuous variables. We manually added continuous variables back to the model and then removed the ones that still didn’t seem to contribute much to the response. Interestingly, Age seems to be predictive in the final model so we determine to keep it.

Status is the most significant predictor of our response Survival.Months, which is consistent to the initial observation that patients with different values of Status do have distinct distribution of Survival.Months (Figure 2). The coefficient of StatusDead is a negative value, meaning that the model will predict a less Survival.Months for patients with dead Status compared with that with alive Status, given other parameters fixed (Note that Status also has some interactions with other predictors we’ll discuss soon).

Age alone doesn’t have a significant effect on the response (when Status is alive). The interaction between Status and Age has a negative coefficient (and is still negative after adding the baseline coefficient of Age alone), suggesting that when the Status is dead, Age has a negative slope in predicting the Survival.Months. The mean Survival.Months decreases around 0.2 months for an one-year increase of Age.

Positive EstrogenStatus has a negative coefficient by itself, indicating that positive Estrogen Receptor status will decrease Survival.Months when other parameters are fixed (when Status is alive). Interestingly, the interaction between Status and EstrogenStatus has a very significant positive effect on the response. That is, only when Status is dead and EstrogenStatus is Positive, given other parameters fixed, the model will add around 20 months to the Survival.Months. Thus, for patients with dead Status, positive Estrogen Receptor status will significantly increase Survival.Months. This is a very interesting observation that the effect of EstrogenStatus on survival is opposite for patients with alive or dead Status.

In addition, from our understanding of this dataset, as well as what we observed during the model building process, many of our variables in this dataset is internally related. For example, the X6th.Stage (the stage of cancer according to the 6th version of AJCC staging system guideline; R added a X in front of the variable name because it starts with a digit instead of a character), actually considers other clinical information such as A.Stage, T.Stage, N.Stage, Grade, EstrogenStatus, and ProgesteroneStatus. Thus, the lack of X6th.Stage in the model doesn’t necessarily mean the stage of cancer is not predictive of survival.


2. Explanation versus prediction

In this project, we explored the application of multiple linear regression models in cancer patients’ survival data. We found that the model can go really large (for example full 2-way or 3-way interaction) to have better R squared. In this case, the prediction ability may increase (although in this case it doesn’t seem to increase too much). One thing to note is, in our particular situation, prediction is not the major purpose. For example, it doesn’t even make any sense to make a prediction of Survival.Months for a patient with the dead Status. If a patient already died, then the survival time will be already known, what’s the point of predicting it? Prediction is useful if the prediction doesn’t rely on the Status. However, in our dataset, Status is the most significant contributor to the response.

Thus, the major goal in this case will be the explanation. That is, although the prediction itself is not useful, but we can learn from the model what variables are - and how they are - related to patients’ survival. For the purpose of explanation, the model can’t be too complex, otherwise, it will just be too hard to understand the relationship between the predictors and the outcome. Also, ideally, we would like to have a model that the residuals fulfill the equal variation and normality assumptions. Although they might fail the statistical test, checking the plots is another way to estimate how well the assumptions are met (or how bad they are violated). According to Figure 4, our model is acceptable in terms of these assumptions, suggesting that the interperetation of the model is meaningful.


3. Predictability of the response

It is worth noting that our performance of the model is not ideal although we have tried our best by taking different approaches (some of them are not even mentioned in this report). We think this just reflects the fact that the information of the predictors are just not enough to make good prediction on the response, the patients’ survival here. There are at least too aspects.

One is that patients’ survival is just so hard to predict, especally just based on the very limited information in this dataset. For example, it is well known that there are subtypes of cancer, determined by molecular biomarkers of expression or variation of cancer related genes. Although the expression of Estrogen Receptor and Progesterone Receptor are known to be related to breast cancer subtypes, there are many many more. Cancer patients have huge heterogeneity in terms of the genetic and non-genetic drivers of cancer, as well as response to different treatment. We tried to remove influential points to improve predictability, but it did not seem to improve the performance. The predictability of the survival time is just so limited, which is further limited by the information from this particular dataset.

The other aspect is the nature of the task. Survival data is complicated due to the censoring issue. Survival data have two component: the duration of time, and the status of the event (death here). When the event doesn’t occur yet (for example the patient are still alive at the end of the study), or under situation that we will never know when it would occur (for example when a patient died of a car accident), the data is called censored data. Thus, the Survival.Months is not the true survival time for each patient, because we have patients who are still in the alive status (actually many of them, as shown in Figure 2). We don’t know their real survival time, but we only know the real survival time will be equal to or greater than the survival time we have. For the purpose of this project using linear regression, we treat the duration time as response, and treat the status as one of the predictors. This is not the best way to perform the analysis.


4. Survival analysis in practice

Because of the reason mention in the last section, in practice, there are a set of specific statistic tools designed for survival analysis. This includes Kaplan-Meier survival probability curves (mainly for categorical variables), log-rank statistical tests, and the Cox Proportional Hazard Regression model (works for both categorical and numerical variables, and both uni-variable and multi-variable modeling). We performed simple analysis using these tools in the “7. More exploration” section of “Methods”. For example, in Figure 8 (left), there is very clear difference for patients with positive or negative Estrogen Receptor status. Patients with positive Estrogen Receptor have a significantly better survival probability curve (shown in green). We can also see that patients with Negative Estrogen Receptor have less proportion of censored data (indicated by vertical lines on the curve). These observations are actually consistent with our interpretation form our multiple linear regression model. On the other hand, we also see the relationship between Grade and Patients’ survival from Figure 8 (right), which is not captured by our model.

In summary, in this project, we applied linear regression techniques in the cancer patient survival problem. Our resulted model is simple enough to interpretate, and provides some interesting insights and explanation to survival related variables. However, we also realize the difficulties of suvival analysis and acknowledge the limitations of linear regression on this complex problem.


Appendix

1. Utility functions used in the analysis

# utility functions used earlier:

diagnostics = function(model, pcol="grey", lcol="dodgerblue", alpha=0.05, plotit=TRUE, testit=TRUE){
  if (plotit){
    par(mfrow = c(1, 2))
    plot(fitted(model), resid(model), col = pcol, pch = 20,
      xlab = "Fitted", ylab = "Residuals", main = paste("Data from", deparse(substitute(model))))
    abline(h = 0, col = lcol, lwd = 2)
    qqnorm(resid(model), 
           main = paste("Normal Q-Q Plot for", deparse(substitute(model))), col = pcol)
    qqline(resid(model), col = lcol, lwd = 2)
  }
  
  if (testit){
    p_val = shapiro.test(resid(model))$p.value
    data.frame(p_val = p_val,
               decision = ifelse(p_val < alpha, "Reject", "Fail to Reject"))
  }
}

get_bp_decision = function(model, alpha) {
  decide = unname(bptest(model)$p.value < alpha)
  ifelse(decide, "Reject", "Fail to Reject")
}

get_sw_decision = function(model, alpha) {
  decide = unname(shapiro.test(resid(model))$p.value < alpha)
  ifelse(decide, "Reject", "Fail to Reject")
}

get_num_params = function(model) {
  length(coef(model))
}

get_rmse = function(model) {
  sqrt(mean((resid(model) ^ 2)))
}

get_loocv_rmse = function(model) {
  sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}

get_r2 = function(model) {
  summary(model)$r.squared
}

get_adj_r2 = function(model) {
  summary(model)$adj.r.squared
}

performance_evaluation = function(model){
  c(rmse = get_rmse(model),
  loocv_rmse = get_loocv_rmse(model),
  r2 = get_r2(model),
  adj_r2 = get_adj_r2(model),
  sw_decision = get_sw_decision(model, alpha = 0.01),
  bp_decision = get_bp_decision(model, alpha = 0.01),
  num_params = get_num_params(model))  
}

get_test_rmse = function(model, testset) {
  y = testset$Survival.Months
  y_hat = predict(model, newdata = testset)
  n = nrow(testset)
  sqrt(mean((y - y_hat)^2))
}

We define some utility functions for generating various predictor functions and predictor interactions.

lm2 = function(response, predictors, data) {
  if (length(predictors) == 0) {
    formula = paste(response, "1", sep = " ~ ")
    lm(formula = formula, data = data)
  } else {
    formula = paste(response, 
          paste(predictors, collapse = " + "), 
          sep = " ~ ")
    lm(formula = formula, data = data)
  }
}

predictors = function(data, response, predicate_fun = function(column, name) {TRUE}) {
  names = colnames(data)
  idx = which(colnames(data) %in% response)
  result = map2(.x = data[,-idx], .y = names[-idx], .f = predicate_fun)
  names[-idx][unlist(result)]
}

is_numeric_column = function(column, name) {
  is.numeric(column)
}

is_factor_column = function(column, name) {
  is.factor(column)
}

predictors_fun = function(predictors, fun_prefix, fun_suffix) {
  sapply(predictors, function(predictor) {
    paste(c(fun_prefix, predictor, fun_suffix), collapse = "", sep="")
  }, simplify = TRUE)
}

predictors_interaction = function(predictors, ways = 2) {
  if(length(predictors) >= ways) {
    comb_matrix = combinations(n=length(predictors), r=ways, v=predictors, repeats.allowed=FALSE)
    apply(X = comb_matrix, MARGIN = 1, function(row) { paste(row, collapse = ":") })
  } else {
    c()
  }
}

We define a customized step search procedure to help find top predictors based on cost function given cost_fun. The procedure is used in the project for searching top predictors based on AIC cost, a customized Breusch-Pagan test cost and a customized Shapiro-Wilk normality test cost.

forward_search = function(response, used_predictors, remaining_predictors, data, cost_fun) {
  model = lm2(response = response, predictors = used_predictors, data = data)
  best_cost = cost_fun(model)
  best_predictor = NULL
  
  for (new_predictor in remaining_predictors) {
    predictors = c(used_predictors, new_predictor)
    new_model = lm2(response = response, predictors = predictors, data = data)
    cost = cost_fun(new_model)

    if(cost < best_cost) {
      best_cost = cost
      best_predictor = new_predictor
    }
  }

  if (is.null(best_predictor)) {
    used_predictors
  } else {
    c(used_predictors, best_predictor)
  }
}

generate_remaining_predictors = function(used_predictors, response, data) {
  simple_predictors = predictors(data = data, response = response)

  numeric_predictors = predictors(data = data, response = response, predicate_fun = is_numeric_column)
  log_predictors = predictors_fun(predictors = numeric_predictors, fun_prefix = "log(", fun_suffix = ")")
  exp_predictors = predictors_fun(predictors = numeric_predictors, fun_prefix = "exp(", fun_suffix = ")")

  poly_dict = c("^2)" = "^3)", "^3)" = "^4)", "^4" = "^5", "^5" = "^6")
  next_poly_predictor = function(predictor) {
    if(predictor %in% numeric_predictors) {
      predictors_fun(predictors = c(predictor), fun_prefix = "I(", fun_suffix = "^2)")
    } else {
      poly_str = str_sub(predictor, - 3, - 1)
      if(poly_str %in% names(poly_dict)) {
        replacement = poly_dict[[poly_str]]
        predictor_poly = stri_replace_last_fixed(predictor, poly_str, replacement)
        c(predictor_poly)
      } else {
        c()
      }
    }
  }

  poly_predictors = unlist(sapply(used_predictors, next_poly_predictor))

  is_single_predictor = function(predictor) {
    !str_contains(predictor, ":")
  }
  used_single_predictors = Filter(is_single_predictor, x = used_predictors)
  int2_predictors = predictors_interaction(predictors = used_single_predictors, ways = 2)

  c(simple_predictors, log_predictors, exp_predictors, poly_predictors, int2_predictors)
}

step_search = function(used_predictors=c(), data, response, 
                       cost_fun, max_rest_params) {
  remaining_predictors = generate_remaining_predictors(
        used_predictors = used_predictors, response = response, data = data)
  max_num_params = max_rest_params + length(used_predictors)
  
  while (length(used_predictors) < max_num_params) {
    result_predictors = forward_search(
      response = response,
      used_predictors = used_predictors, 
      remaining_predictors = remaining_predictors, 
      data = data,
      cost_fun = cost_fun)
    
    if (identical(used_predictors, result_predictors)) {
      break
    } else if(length(result_predictors) <= max_num_params) {
      used_predictors = result_predictors
      
      remaining_predictors = generate_remaining_predictors(
        used_predictors = used_predictors, response = response, data = data)
    } else {
      break
    }
  }
  
  lm2(response = response, predictors = used_predictors, data = data)
}

2. Authors

Delong Meng, Ted Imhoff-Smith, Cheng Wei