Cleaning the Data

In order to prepare the data for decision tree training, we first removed ER.Status because it contains essentially the same information as PR.Status. Next, we removed Days.to.date.of.Death because, as the missing variable plot shows, this variable was missing for a lot of patients. Additionally, we dropped any cases after that that were not complete.

We thought that this would be enough, but we soon learned that there were some columns that had an imbalance of information, causing our models to not build correctly. We had two solutions for this problem. First, we dropped the Gender, Metastasis, and Metastasis.Coded columns because they did not provide very much information because only a couple of patients had different values. Also, we collapsed the Stage columns (AJCC.Stage and Converted.Stage) so that the information in those columns was more balanced while still maintaining the factors. To do this, we put, for example, all Stage I, Stage IA, Stage IB variables into one category. Finally, we dropped any patients that contained unique values in the columns because if they ended up in the test set, then we would run into issues introducing new values into the decision tree.

After that, we were ready to start creating our trees! We broke down the remaining 102 patients into test and training sets.

Analysis for PR.Status and Tumor

After cleaning our data, we began our analysis on both the PR.Status variable and the Tumor variable. Since PR.Status is a binary variable, we chose to run a CART-style analysis to create a decision tree. On the other hand, Tumor is a multi-class variable, so a C5.0 analysis makes more sense.

PR.Status

RPart (CART style)

CART analysis tends to be more useful for binary predictions, so we think that this type of tree model will work best for the PR.Status variable.

Base Rate

The base rate is 51.9% for identifying whether PR.Status is positive or not.

## [1] -0.5196078
Building the Model

To build the model, we used the RPART to create a tree. From this tree, we were able to identify the most important variables (the relative importance of each variable can be seen in the table below this tree graph). In this case, OS.Time was the most important variable for the improvement of all nodes in which the attribute is a splitter. However, at the top of the tree is Age.at.Initial.Pathologic.Diagnosis, which means that Age.at.Initial.Pathologic.Diagnosis is the variable that most definitively partitions the data set to identify positive or negative PR.Status values.

Additionally, we looked at the conditional probability table. From that, we gathered that we probably was to have a tree with 2 splits to reduce the xerror (cross-validated error) as much as possible. This can be seen in the graph below, where 2 is the leftmost value below the dotted line (representing the highest cross-validated error minus the minimum cross-validated error, plus the standard deviation of the error at that tree).

## n= 82 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 82 39 1 (0.4756098 0.5243902)  
##    2) Age.at.Initial.Pathologic.Diagnosis< 57.5 42 17 0 (0.5952381 0.4047619)  
##      4) OS.Time>=672 19  5 0 (0.7368421 0.2631579) *
##      5) OS.Time< 672 23 11 1 (0.4782609 0.5217391)  
##       10) Days.to.Date.of.Last.Contact< 335.5 10  2 0 (0.8000000 0.2000000) *
##       11) Days.to.Date.of.Last.Contact>=335.5 13  3 1 (0.2307692 0.7692308) *
##    3) Age.at.Initial.Pathologic.Diagnosis>=57.5 40 14 1 (0.3500000 0.6500000)  
##      6) Days.to.Date.of.Last.Contact< 955.5 20  9 0 (0.5500000 0.4500000)  
##       12) Tumor=T2 8  2 0 (0.7500000 0.2500000) *
##       13) Tumor=T1,T3,T4 12  5 1 (0.4166667 0.5833333) *
##      7) Days.to.Date.of.Last.Contact>=955.5 20  3 1 (0.1500000 0.8500000) *

##                             OS.Time        Days.to.Date.of.Last.Contact 
##                           8.6855497                           8.6739260 
## Age.at.Initial.Pathologic.Diagnosis                               Tumor 
##                           4.8495702                           3.7148780 
##                     Converted.Stage                  Survival.Data.Form 
##                           1.3995354                           1.2836887 
##                          Node.Coded                   HER2.Final.Status 
##                           0.7466667                           0.7325753 
##                        Vital.Status                            OS.event 
##                           0.4994947                           0.3661614 
##                          AJCC.Stage 
##                           0.2666667

Prediction and Error Analysis

In this section, we used the tree that we created to predict the target variable, PR.Status, for each patient in the test set.

After running the model on our test set, we then generated the confusion matrix to compare the results to the actual data. We have an accuracy of 55%, which is slightly better than our base rate. Our ROC plot below confirms this, as the line plotted is only slightly above the y = x line. Our error rate was 9 out of 20 test cases, which is certainly not ideal. Our sensitivity and specificity are also poor, meaning we will have more false positives and false negatives than we would like to. Our kappa is incredibly low too, which indicates unstable predictions.

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction 0 1
##          0 6 5
##          1 4 5
##                                           
##                Accuracy : 0.55            
##                  95% CI : (0.3153, 0.7694)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.4119          
##                                           
##                   Kappa : 0.1             
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.5000          
##             Specificity : 0.6000          
##          Pos Pred Value : 0.5556          
##          Neg Pred Value : 0.5455          
##              Prevalence : 0.5000          
##          Detection Rate : 0.2500          
##    Detection Prevalence : 0.4500          
##       Balanced Accuracy : 0.5500          
##                                           
##        'Positive' Class : 1               
## 

## 
## Call:
## roc.default(response = test$PR.Status, predictor = ifelse(tree_example_prob[,     "0"] >= 0.5, 0, 1), plot = TRUE)
## 
## Data: ifelse(tree_example_prob[, "0"] >= 0.5, 0, 1) in 10 controls (test$PR.Status 0) < 10 cases (test$PR.Status 1).
## Area under the curve: 0.55

Caret (C5.0 style)

Caret analyses tend to be better suited for multi-factor or continuous classifications, so we assume that this type of modelling will be worst for identifying the PR.Status variable.

Creating Tree

Using the cross validation process and tuning, we found the best combination of attributes for our tree. This ended up being only one trial and winnowing. However, most models seemed to have a very similar accuracy by trial number, as can be seen on the accuracy versus trial plots below. Winnowing seemed to improve accuracy substantially, though.

We ended up having to remove more columns from our data because they were causing trouble in the algorithm due to undefined cases in the test set. Because of this, we were down to only four variables.

## C5.0 
## 
## 82 samples
##  4 predictor
##  2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 74, 74, 73, 74, 73, 74, ... 
## Resampling results across tuning parameters:
## 
##   winnow  trials  Accuracy   Kappa       
##   FALSE    1      0.4692857  -0.081874046
##   FALSE    5      0.4692857  -0.081874046
##   FALSE   10      0.4692857  -0.081874046
##   FALSE   15      0.4692857  -0.081874046
##   FALSE   20      0.4692857  -0.081874046
##    TRUE    1      0.5162302  -0.008823529
##    TRUE    5      0.5162302  -0.008823529
##    TRUE   10      0.5162302  -0.008823529
##    TRUE   15      0.5162302  -0.008823529
##    TRUE   20      0.5162302  -0.008823529
## 
## Tuning parameter 'model' was held constant at a value of tree
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 1, model = tree and winnow
##  = TRUE.

Analyzing Results

As it can be seen in the confusion matrix below, our results are beginning to look promising. We never had any false negatives, which is a good thing. This is reflected in our perfect Specificity score. Our accuracy was 50%, which is slightly lower than our base rate. Further, our kappa value is 0, which states zero inter-rater reliability. Our sensitivity is also 0, which means that we have a huge issue with false positives.

The overall error rate was 50%, which is bad. The area under the curve of our POC plot is 50%, which indicates that we are performing as well as random guessing, making this model particularly useless.

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0  0  0
##          1 10 10
##                                         
##                Accuracy : 0.5           
##                  95% CI : (0.272, 0.728)
##     No Information Rate : 0.5           
##     P-Value [Acc > NIR] : 0.588099      
##                                         
##                   Kappa : 0             
##                                         
##  Mcnemar's Test P-Value : 0.004427      
##                                         
##             Sensitivity : 0.0           
##             Specificity : 1.0           
##          Pos Pred Value : NaN           
##          Neg Pred Value : 0.5           
##              Prevalence : 0.5           
##          Detection Rate : 0.0           
##    Detection Prevalence : 0.0           
##       Balanced Accuracy : 0.5           
##                                         
##        'Positive' Class : 0             
## 
##           Actual
## Prediction  0  1
##          0  0  0
##          1 10 10

## 
## Call:
## roc.default(response = test$PR.Status, predictor = as.numeric(PRC50_predict),     plot = TRUE)
## 
## Data: as.numeric(PRC50_predict) in 10 controls (test$PR.Status 0) < 10 cases (test$PR.Status 1).
## Area under the curve: 0.5

Multi-Class Prediction: Tumors

Next, we will build another decision tree to predict the type of tumor a patient has. There are four classes of tumors (T1, T2, T3, T4).

RPART (CART style)

Base Rates

The base rates for the four tumor classes are:

  • T1 = 14.29%

  • T2 = 61.90%

  • T3 = 18.10%

  • T4 = 5.71%

Building the Multi-Class CART Model

We first built a tree using the Classification and Regression Tree (CART) method.

Decision Tree Plot

As we can see in the plot below, T4 is not used for training. This is unsurprising since there are only 6 individuals in the dataset with T4 tumors.

CP Chart

The CP (complexity paramter) chart below helps us determine the optimal size of the tree. Based on our plot, the optimal size is 2.

Variable Importance

Based on the model with default settings, the most important variable is AJCC.Stage, which describes the amount and spread of cancer in a patient’s body.

##                          AJCC.Stage                     Converted.Stage 
##                          20.4963598                          11.8360514 
##        Days.to.Date.of.Last.Contact                          Node.Coded 
##                           4.1272279                           3.2024324 
##                             OS.Time                  Survival.Data.Form 
##                           2.9263158                           1.8013682 
## Age.at.Initial.Pathologic.Diagnosis                           PR.Status 
##                           1.7150818                           0.5486842
Prediction

We use the tree generated previously on the training dataset to predict the tumor classes with the test dataset.

Evaluate Performance

Overall, the model doesn’t do a great job at prediction. Based on the ROC curves below (starting with T1), the AUC is very low and does not indicate a very successful model.

T1 - The error rate for T1 is 100% and the detection rate is 0%. There are 2 T1 tumors in the test set and they are being incorrectly identified as T2 tumors.

T2 = The error rate for T2 is 27.27% and the detection rate is 42.86%. Compared to the base rate of 61.90%, the model does worse at predicting a T2 tumor.

T3 - The error rate for T3 is 33.33% and the detection rate is 14.29%. Compared to the base rate of 18.10%, this is not a huge improvement.

T4 - This type of tumor has a very low prevalence in the dataset; therefore, we are not surprised that 3 T4 tumors are being incorrectly identified as T2 and T3 tumors when we use the model.

##              
## tree_predict2 T1 T2 T3 T4
##            T1  0  1  1  0
##            T2  2  9  0  2
##            T3  0  2  2  1
##            T4  0  0  0  0

## 
## Call:
## multiclass.roc.default(response = test2$Tumor, predictor = ifelse(tree_example_prob2[,     "T1"] >= 0.05, 0, 1), plot = TRUE)
## 
## Data: ifelse(tree_example_prob2[, "T1"] >= 0.05, 0, 1) with 4 levels of test2$Tumor: T1, T2, T3, T4.
## Multi-class area under the curve: 0.4808

Caret (C5.0 Style)

Building the Multi-Class C5.0 Model

Now, we will build a multi-class decision tree for tumor detection using the C5.0 algorithm.

## C5.0 
## 
## 84 samples
##  5 predictor
##  4 classes: 'T1', 'T2', 'T3', 'T4' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 75, 75, 75, 75, 75, 75, ... 
## Resampling results across tuning parameters:
## 
##   winnow  trials  Accuracy   Kappa       
##   FALSE    1      0.6000519  -0.009705882
##   FALSE    5      0.6000519  -0.009705882
##   FALSE   10      0.6000519  -0.009705882
##   FALSE   15      0.6000519  -0.009705882
##   FALSE   20      0.6000519  -0.009705882
##    TRUE    1      0.6089408  -0.005000000
##    TRUE    5      0.6089408  -0.005000000
##    TRUE   10      0.6089408  -0.005000000
##    TRUE   15      0.6089408  -0.005000000
##    TRUE   20      0.6089408  -0.005000000
## 
## Tuning parameter 'model' was held constant at a value of tree
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 1, model = tree and winnow
##  = TRUE.
Creating Tree

The XY plot below visualizes the accuracy for different numbers of trials. It is clear that the number of trials does not have a significant impact on the accuracy. When creating the tree, we encountered issues with predicting the model on the test data. If we use all columns in the dataset for prediction, we can create a tree with up to 83% accuracy. However, due to undefined cases in the data, we are unable to use this model for prediction. Therefore, we removed 10 columns in order to create a tree that is useful for prediction.

Evaluate Performance

Due to the complications with prediction, we can only evaluate our tree using 5 columns. As evident in our confusion matrix, the model does not perform very well. Although all the T2 tumors were correctly predicted, all the other tumors were incorrectly classified as a T2 tumor (which is evident in the Specificity values of 1 for T1, T3 and T4 tumors). The accuracy is 66.67%, which isn’t terrible, but knowing that only the T2 tumors are being classified correctly, we are not impressed with this number. We have a Kappa value of 0, which is very low. Overall, the C5.0 model we generated to predict the tumor class is not successful, and is much worse in comparison to the CART model.

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction T1 T2 T3 T4
##         T1  0  0  0  0
##         T2  1 14  3  3
##         T3  0  0  0  0
##         T4  0  0  0  0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6667          
##                  95% CI : (0.4303, 0.8541)
##     No Information Rate : 0.6667          
##     P-Value [Acc > NIR] : 0.6008          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: T1 Class: T2 Class: T3 Class: T4
## Sensitivity            0.00000    1.0000    0.0000    0.0000
## Specificity            1.00000    0.0000    1.0000    1.0000
## Pos Pred Value             NaN    0.6667       NaN       NaN
## Neg Pred Value         0.95238       NaN    0.8571    0.8571
## Prevalence             0.04762    0.6667    0.1429    0.1429
## Detection Rate         0.00000    0.6667    0.0000    0.0000
## Detection Prevalence   0.00000    1.0000    0.0000    0.0000
## Balanced Accuracy      0.50000    0.5000    0.5000    0.5000

Conclusion

Overall, this lab was challenging and the results of our models were unsuccessful.

PR.Status

We performed the CART and C5.0 analysis on PR.Status first (a binary variable). We expected the CART model to be more successful, which it was. However, both CART and C5.0 models were pretty bad and need to be improved. There are several ways to improve the model, but we would first suggest gathering more data, specifically more consistent and balanced data. This will make the models much more accurate for prediction. Moving forward, the models we built could be helpful in predicting whether or not someone will be positive for PR.Status, but should definitely not be used for practice. We would need to significantly improve the models before using them in practice, especially since the stakes are so high if we are using this model to detect the presence of cancer.

Tumor

We used the same two models (CART and C5.0) to build trees for our multi-class variable, tumor. There were four classes of tumors with varying prevalence in the dataset. We expected the C5.0 analysis to be better, since that algorithm is better suited for multi-class prediction. However, the CART model proved to be a bit more successful, although not good by any means. The T4 tumor had a very low prevalence in the data (only 6 indnividuals with this tumor) so it was not used for training the model. The CART model did an okay job at identifying T1, T2, and T3 variables, but could definitely be improved. The C5.0 model was surprisingly bad; due to undefined cases in the data we had to remove 10 columns. Therefore, we were only training the model on 5 variables. The model resulted in identifying all tumors as T2 tumors… definitely not the result we were hoping for. As previously mentioned, the model could likely be improved if we had more data and more balanced data that had near equal prevalence for all tumors. We definitely would not want to use this model in practice unless it is significantly improved. If we are able to improve the model, it could be useful in predicting whether or not a patient as a malignant/cancerous tumor.