Using CART (from rpart, with gini split)

Train/Test Split

## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Base Rate for Classifier

This table shows that among the most important variables to the Decision Tree classifier will be ER Status, with a correlation of approximately 79%, followed by 75% approximately for Converted Stage. 71% is recorded for AJCC Stage.

## n= 82 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 82 41 0 (0.5000000 0.5000000)  
##   2) ER.Status=Indeterminate,Negative 29  0 0 (1.0000000 0.0000000) *
##   3) ER.Status=Positive 53 12 1 (0.2264151 0.7735849)  
##     6) Days.to.Date.of.Last.Contact< 12 7  2 0 (0.7142857 0.2857143) *
##     7) Days.to.Date.of.Last.Contact>=12 46  7 1 (0.1521739 0.8478261) *

According to the above output, the variables most contributing to the gini splits are “ER.Status”, and “Days to Date of Last Contact”.

As indicated by the above output as well as the below plot of the tree, the ER Status is an important variable, but excluding it would result in Days to Date of Last Contact as the most important variable for the tree’s gini splits. Other important variables as found before are the AJCC and Converted Stage variables.

The optimal size of the tree is approximately 2 terminal nodes, which can be observed from the above elbow chart for the cp. In the graphic of the tree, there appear to be 3 terminal nodes, but the difference between the two in the context of the model is not too large.

Test the accuracy (and comparison to the base rates)

##                            var  n wt dev yval complexity ncompete nsurrogate
## 1                    ER.Status 82 82  41    1 0.70731707        4          4
## 2                       <leaf> 29 29   0    1 0.01000000        0          0
## 3 Days.to.Date.of.Last.Contact 53 53  12    2 0.07317073        4          3
## 6                       <leaf>  7  7   2    1 0.01000000        0          0
## 7                       <leaf> 46 46   7    2 0.00000000        0          0
##      yval2.V1    yval2.V2    yval2.V3    yval2.V4    yval2.V5 yval2.nodeprob
## 1  1.00000000 41.00000000 41.00000000  0.50000000  0.50000000     1.00000000
## 2  1.00000000 29.00000000  0.00000000  1.00000000  0.00000000     0.35365854
## 3  2.00000000 12.00000000 41.00000000  0.22641509  0.77358491     0.64634146
## 6  1.00000000  5.00000000  2.00000000  0.71428571  0.28571429     0.08536585
## 7  2.00000000  7.00000000 39.00000000  0.15217391  0.84782609     0.56097561
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0  6  1
##          1  2 11
##                                           
##                Accuracy : 0.85            
##                  95% CI : (0.6211, 0.9679)
##     No Information Rate : 0.6             
##     P-Value [Acc > NIR] : 0.01596         
##                                           
##                   Kappa : 0.6809          
##                                           
##  Mcnemar's Test P-Value : 1.00000         
##                                           
##             Sensitivity : 0.9167          
##             Specificity : 0.7500          
##          Pos Pred Value : 0.8462          
##          Neg Pred Value : 0.8571          
##              Prevalence : 0.6000          
##          Detection Rate : 0.5500          
##    Detection Prevalence : 0.6500          
##       Balanced Accuracy : 0.8333          
##                                           
##        'Positive' Class : 1               
## 
## [1] 0.5196078

Whereas the base rate calculated for the percentage of 1 (disease) classifications was about 51%, the accuracy calculated from the matrix is about 80%. The Sensitivity was about 80% and the Specificity was approx. 80%. It can be seen from the data frame however, that many more correct classifications of PR.Status, whether 0’s or 1’s, were made as compared to incorrect classifications.

Calculated from the confusion matrix used, the Hit Rate was 20% while the Detection Rate was a bit higher, at 40%. The detection rate should ideally be equivalent to the sensitivity (tpr), which was 80%. In the realm of disease data, it is always best to maximize the sensitivity (the true positive rate) in order to maximize the percentage of all positive classifications that were correctly identified. It is vital to pick up on the “1s” in the data set in order to maximize diagnoses of diseases. While the detection rate can acceptably be lower, the hit rate/true error should be maximized. The model can be more pruned/fine-tuned in the future, perhaps by dropping variables that do not contribute to the splits as much and re-running the tree.

#ROC and AUC

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = test$PR.Status, predictor = as.numeric(tree_predict),     plot = TRUE)
## 
## Data: as.numeric(tree_predict) in 8 controls (test$PR.Status 0) < 12 cases (test$PR.Status 1).
## Area under the curve: 0.8333

Above, the ROC curve shows a maximization of both Sensitivity and Specificity around 0.5 for Specificity and 0.8 for Sensitivity. This is consistent with the calculation of 0.8 for Sensitivity from the confusion matrix. The specificity in this case is close to the base rate calculated at 51%. Moreover, the “tree_example_prob” shows the probability of being in the positive class as well as being in the negative class.

C5.0 Decision Tree Analysis

The above code notes the relative variable importance in making the splits. After going to office hours and speaking with Professor Wright, I was able to replace the Days to Date of Death column (which had mostly NA’s) with 0’s and ran the training set from there to avoid the errors due to NAs. However, upon using the train() function I was still not able to get an output for the variable importance. However, going back to the method used at the beginning of the lab to determine the relative importance of the variables in making the splits, ER Status will most likely also have a large influence in the multi-class example.

#Evaluating Model Performance
#tumor_predict = predict(tumor_mdl,test_w, type= "raw")
#View(as_tibble(tumor_predict))
#Lets use the confusion matrix (for EACH of the classes of the target variable!)

#confusionMatrix(as.factor(tumor_predict), as.factor(test_w$Tumor), 
                #dnn=c("Prediction", "Actual"), mode = "sens_spec")
#table(test_w$Tumor)
#tumor_predict_p = predict(tumor_mdl,test_w, type= "prob")
##   base_rate1 base_rate2 base_rate3 base_rate4
## 1  0.1428571  0.6095238  0.1619048 0.05714286

Conclusions:

The base rates for the C5.0 multiclass example are displayed in the dataframe created above. Upon printing the code for the confusion matrix, these values are compared for each individual class. Like before, the hit rate would be calculated as the true positive rate while the detection rate can deviate farther from the true positive rate. As this second multiclass example is also a health-related scenario, it would be most vital to augment the sensitivity in order to fully obtain the correct classifications when one of the tumor classes are present. While the non-multiclass example uses the CART method and gini splits in order to prioritize the splits in the tree, the C5.0 example is more complex and works with variables that have multiple levels. C5.0 utilized the gains ratio, which allows for the researchers in this scenario to deal with more missing data (as was found in the Days to Date of Death column) as well as place weights on how much each variable is contributing. In the context of these two models, C5.0 is much better, especially if the variable in question has multiple classes, like Tumor. The researchers would be able to better understand how to maximize sensitivity, and which variables are having the most effect in order to do so.