project overview

This dataset, a clinical aggregation of breast cancer data from 105 patients, is composed of 105 observations with 17 variable fields. Our research project was thus guided by 2 classification tasks and 2 decision tree packages:

  1. using CART, could we build a binary classifier for identifying PR (progesterone receptor, a biomarker indicative of cancer diagnoses) status in patients as either positive or negative?; and,
  2. using C5.0, could we build a multi-class model for classifying tumors?

In building both models, we intend to conduct comprehensive end-to-end machine learning analyses, and hope to return recommendations on model implementation and data utility.


model 1: PR status

Our first model seeks to build a binary classifier for the progesterone receptor biomarker using the CART package. Because a positive biomarker status is demonstrative of breast cancer, our model should work to minimize the false negative rate and maximize the true positive rate (on balance, it is better to misclassify those with breast cancer than those without).


data prep

To prep the data for a decision tree, we factored categorical data while preserving numerical variables. The target variable, PR.Status, was verified and re-classified with 0 for negative and 1 for positive. After the data was cleaned and factored, we created an 80-20 test-train split.

#1 load the data and ensure the column names don't have spaces, hint check.names.  
cancer_df<-tibble(import("clinical_breast_cleaned.csv", check.names= TRUE))
cancer_df<-cancer_df%>%select(-c("ER.Status"))

#2 ensure all the variables are classified correctly and ensure the target variable for "PR.Status" is 0 for negative and 1 for positive
cancer_df<-cancer_df%>%
  mutate(PR.Status = ifelse(PR.Status == "1", 1, 0))


#4 creating test and train splits using caret

# factor non-numerical and categorical data
cancer_df<-cancer_df%>%
  mutate_if(sapply(cancer_df, is.character), as.factor)%>%
  mutate(as.factor(OS.event))

# create test train split
set.seed(1981)
split<-createDataPartition(cancer_df$PR.Status,times=1,p = 0.8,list=FALSE)
train_df<-cancer_df[split,]
test_df<-cancer_df[-split,]

exploratory analysis

After cleaning and formatting the data for the model, we performed some initial analysis to isolate the base rate of the target variable:

#6 evaluate the base rate 
br<-table(cancer_df$PR.Status)[2]/sum(table(cancer_df$PR.Status))
paste0("target base rate = ", br)
## [1] "target base rate = 0.514285714285714"

This base rate indicates a minimum performance; with no observational knowledge, a user could correctly identify an instance as positive 51.43% of the time. Our model should use this metric as comparison for its success, ideally contributing more power to its classification task.


model creation

With our base rate in mind, we can create a binary decision tree with default parameters.

#7 Build your model using the default settings
set.seed(1981)
cancer_tree_gini <- rpart(
  PR.Status ~ ., # model formula
  method = "class", # tree method
  parms = list(split = "gini"), # split method
  data = train_df, # data used
  control = rpart.control(cp = .01)
  )

model analysis

CART provides an analysis of the most important variables for the model:

#8 View the results, what is the most important variable for the tree? 
cancer_tree_gini$variable.importance
## Age.at.Initial.Pathologic.Diagnosis        Days.to.Date.of.Last.Contact 
##                           6.1304575                           4.5642540 
##                             OS.Time                          AJCC.Stage 
##                           4.3078238                           3.2731982 
##                   HER2.Final.Status                     Converted.Stage 
##                           2.4576609                           2.2268446 
##                               Tumor                          Node.Coded 
##                           1.5744506                           1.0187187 
##                              Gender 
##                           0.2264878

The variables that reduce the most relative error are Age at Initial Pathologic Diagnosis, Days to Date of Last Contact, OS Time, and AJCC Stage, with the remainder contributing to <2% to error minimization.

The tree can be graphically visualized along its split decisions:

#9 Plot the tree using the rpart.plot package (CART only).
rpart.plot(cancer_tree_gini, type =4, extra = 101)#package rpart.plot

The graph illustrates that the model is composed of the decision splits of 4 variables: Days to Date of Last Contact, Age at Initial Pathologic Diagnosis, AJCC Stage, and HER2 Final Status.

The model’s complexity parameters can be further visualized to isolate the optimal tree size:

#10 plot the cp chart and note the optimal size of the tree (CART only).
plotcp(cancer_tree_gini)

A reasonable optimal complexity parameter for pruning is often the leftmost value where the relative cross-validated error is less than the highest cross-validated error (the horizontal dashed line). That the error holds relatively constant as the complexity of the model increases is suggestive of the model’s relative lack of power: that the decision splits neither add nor subtract from its performance.


test predictions

Using the model, we can make predictions on the test set.

#11 Use the predict function and your models to predict the target variable using
#the test set. 
set.seed(1981)
cancer_pred<-predict(cancer_tree_gini, test_df, type= "class")

prediction analysis

The model can be evaluated across several metrics: hit rate, detection rate, confusion matrix, and ROC/AUC.

hit & detection rates

The hit rate is the true error rate of the model ((false positives+false negatives)/all data points):

#12 Generate, "by-hand", the hit rate and detection rate and compare the 
#detection rate to your original baseline rate. How did your models work?
con_mat<-table(cancer_pred, test_df$PR.Status)

par_error_rate<-sum(con_mat[row(con_mat) != col(con_mat)]) / sum(con_mat)

paste0("hit rate/true error rate = ", par_error_rate * 100, "%")
## [1] "hit rate/true error rate = 47.6190476190476%"

The detection rate is the rate at which the model correctly identifies the positive class relative to the entire classification (true positives/entire classification):

par_detection_rate<-con_mat[2,2]/sum(con_mat)*100

paste0("detection rate = ", par_detection_rate, "%")
## [1] "detection rate = 23.8095238095238%"

From these metrics, we can see that the model is much more adept at mis-classifying instances than at correctly identifying positive instances. We don’t love this!


confusion matrix

The confusionMatrix function provides a range of evaluation metrics:

#13 Use the confusion matrix function in caret to 
#check a variety of metrics and comment on the metric that might be best for 
#each type of analysis.  
conf_mat<-confusionMatrix(as.factor(cancer_pred), as.factor(test_df$PR.Status), positive = "1", dnn=c("Prediction", "Actual"), mode = "sens_spec")
conf_mat
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction 0 1
##          0 6 5
##          1 5 5
##                                           
##                Accuracy : 0.5238          
##                  95% CI : (0.2978, 0.7429)
##     No Information Rate : 0.5238          
##     P-Value [Acc > NIR] : 0.5874          
##                                           
##                   Kappa : 0.0455          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.5000          
##             Specificity : 0.5455          
##          Pos Pred Value : 0.5000          
##          Neg Pred Value : 0.5455          
##              Prevalence : 0.4762          
##          Detection Rate : 0.2381          
##    Detection Prevalence : 0.4762          
##       Balanced Accuracy : 0.5227          
##                                           
##        'Positive' Class : 1               
## 

Model accuracy is 52.38%; model specificity is 50.0%. Compared to the base rate of 51.43%, our model contributes asbolute negligible power to the classification task. Yikes. While this can be partially attributed to the small size of the test set (21 values), the relative volatility of the model and the training set (performance and distribution is significantly dependent on the seed) additionally contribute to poor performance.


ROC/AUC

The ROC curve illustrates the diagnostic power of the model across all classification thresholds; the AUC scores provides an aggregate measure of performance of the model.

#14 Generate a ROC and AUC output, interpret the results
par_roc<-roc(test_df$PR.Status, as.numeric(cancer_pred), plot = TRUE)

The regression line is only slightly above that of a 50-50 classifier; the model is thus hardly able to distinguish between the positive and negative class.

paste0("area under the curve = ", as.numeric(par_roc$auc))
## [1] "area under the curve = 0.522727272727273"

The ROC curve is supplemented by a sub-par AUC score; the model is conclusively not contributing any power to the classification task.


model 2: tumor classification

Our second model seeks to build a multi-class classifier of tumors. Because different classifications of tumors require different medical responses, our model should work to maximize accuracy and true positives (false negatives are less expensive than for our first task).


data prep

To prep the data for a decision tree, we factored categorical data while preserving numerical variables. The target variable, Tumor, was re-factored along its classifications (T1,T2,T3,T4). After the data was cleaned and factored, we created an 80-20 test-train split.

# read in and clean the data  
tumor_cf<-read.csv("clinical_breast_cleaned.csv")
tumor_cf<-tumor_cf[ ,-c(15)]

tumor_cf$Tumor<-fct_collapse(tumor_cf$Tumor, 
                             T1 = "T1", 
                             T2 = "T2",
                             T3 = "T3",
                             T4 = "T4")

tumor_cf<-tumor_cf%>%
  mutate_if(sapply(tumor_cf, is.character), as.factor)

# create data splits 
split_t<-createDataPartition(tumor_cf$Tumor,times=1,p = 0.8,list=FALSE)
train_t<-tumor_cf[split_t,]
test_t<-tumor_cf[-split_t,]


exploratory analyses

After cleaning and formatting the data for the model, we performed some initial analysis to isolate the base rate of the target variable. Its classification distribution:

table(tumor_cf$Tumor)
## 
## T1 T2 T3 T4 
## 15 65 19  6

and its classification proportions:

# isolate classifier base rates
round(table(tumor_cf$Tumor)/sum(table(tumor_cf$Tumor))*100, 2)
## 
##    T1    T2    T3    T4 
## 14.29 61.90 18.10  5.71

show an uneven distribution between the 4 classes, with T2 containing 61.9% of observations (greatly out-classing T4, which comprises 5.71%).


model creation

From our training set, we can define our resampling and hyper-parameter tuning processes to create our cross-validation set. Using these parameters, we can create our multi-class decision tree model!

# cross-validating
fitControl_t<-trainControl(method = "repeatedcv",
  number = 10,
  repeats = 5, returnResamp="all") #setting up our cross validation

features_t<-train_t[, -c(6)]
target_t<-train_t$Tumor

grid_t<-expand.grid(.winnow = c(TRUE, FALSE), .trials = c(1, 5, 10, 15, 20), .model = "tree")

# train model
tumor_mdl<-train(x=features_t, y=target_t, tuneGrid=grid_t, 
                 trControl=fitControl_t, method="C5.0", verbose=TRUE)
tumor_mdl
## C5.0 
## 
## 85 samples
## 15 predictors
##  4 classes: 'T1', 'T2', 'T3', 'T4' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 77, 77, 75, 77, 75, 78, ... 
## Resampling results across tuning parameters:
## 
##   winnow  trials  Accuracy   Kappa    
##   FALSE    1      0.7893095  0.6120263
##   FALSE    5      0.8409603  0.6878501
##   FALSE   10      0.8516746  0.7109338
##   FALSE   15      0.8477302  0.7074586
##   FALSE   20      0.8426746  0.6942645
##    TRUE    1      0.7960476  0.5968560
##    TRUE    5      0.7924603  0.5944924
##    TRUE   10      0.7984048  0.6053508
##    TRUE   15      0.7985159  0.6030135
##    TRUE   20      0.8035159  0.6155481
## 
## Tuning parameter 'model' was held constant at a value of tree
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 10, model = tree and winnow
##  = FALSE.

Our finalized model (whose hyperparameters were defined during a grid search) uses 15 trials and no winnowing to maximize accuracy.


model evaluation

Our model allows us to visualize the re-sample distributions:

# visualize the re-sample distributions
xyplot(tumor_mdl,type = c("g", "p", "smooth"))

The resampling lattice illustrates how resampling hardly effects model accuracy, with the metric staying constant above 0.82.

# view the variable importance of the model
varImp(tumor_mdl)
## C5.0 variable importance
## 
##                                     Overall
## AJCC.Stage                           100.00
## Converted.Stage                       94.12
## HER2.Final.Status                     84.71
## Node.Coded                            83.53
## Survival.Data.Form                    61.18
## Days.to.Date.of.Last.Contact          41.18
## OS.Time                               25.88
## Age.at.Initial.Pathologic.Diagnosis   10.59
## ER.Status                              9.41
## PR.Status                              5.88
## Metastasis                             0.00
## Metastasis.Coded                       0.00
## OS.event                               0.00
## Vital.Status                           0.00
## Gender                                 0.00

C5.0 also returns the importance of variables; 5 variables (AJCC.Stage, Converted.Stage, HER2.Final.Status, Node.Coded, Survival.Data.Form) contribute most significantly to predictor decisions.


prediction evaluation

Our tuned model can be used to make predictions on the test set; the function confusionMatrix returns a range of performance metrics:

# predict and view confusion matrix
tumor_predict<-predict(tumor_mdl, test_t, type= "raw")

confusionMatrix(as.factor(tumor_predict), as.factor(test_t$Tumor), 
                dnn=c("Prediction", "Actual"), mode = "sens_spec")
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction T1 T2 T3 T4
##         T1  1  0  0  0
##         T2  2 12  1  0
##         T3  0  1  2  0
##         T4  0  0  0  1
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8             
##                  95% CI : (0.5634, 0.9427)
##     No Information Rate : 0.65            
##     P-Value [Acc > NIR] : 0.1182          
##                                           
##                   Kappa : 0.5833          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: T1 Class: T2 Class: T3 Class: T4
## Sensitivity             0.3333    0.9231    0.6667      1.00
## Specificity             1.0000    0.5714    0.9412      1.00
## Pos Pred Value          1.0000    0.8000    0.6667      1.00
## Neg Pred Value          0.8947    0.8000    0.9412      1.00
## Prevalence              0.1500    0.6500    0.1500      0.05
## Detection Rate          0.0500    0.6000    0.1000      0.05
## Detection Prevalence    0.0500    0.7500    0.1500      0.05
## Balanced Accuracy       0.6667    0.7473    0.8039      1.00

The relatively high accuracy is constrained by its low Kappa value; the inequitable distribution of the dataset (which heavily prioritizes T2) forces the model to positively identify classes at inequitable rates. For example, the model has a T2 sensitivity (true positive rate) of 92.31%, signifying that the model will correctly classify a T2 tumor 92.31% of the time. The high T2 sensitivity is juxtaposed with a low specificity (57.14%). Thus, the model tends to overclassify the T2 class. Similar to our first model, this can be partially attributed to the small size of the test set (21 values) and the relative volatility of the model and the training set (performance and class distribution is significantly dependent on the seed).


conclusion

Conclusively, a decision tree is not compatible with this dataset. Its small set size, exacerbated with unequal class distributions, constrains the model; its overall volatility, in which the success and distribution of the model is highly dependent on the specific seed, dictates the performance of the model.

Furthermore, the decision tree model contributes very little power to the classification task. Accuracy metrics stay constant amid its hyper-parameter tuning, with changes to resampling processes and complexity parameters resulting in no measurable improvements.

In our recommendations, we propose a more comprehensive and equitable dataset be used with a decision tree model. The models in their current state could not be deployed for medical use; the cost of their misclassification remains too high to be implemented casually. In addition to an incredibly comprehensive dataset, the human cost of this classification task requires the decision tree models to be highly tuned and computationally efficient.