Progesterone Receptors

The variable of interest, “PR.Status,” must be a binary response variable. “PR.Status” is the indicator in this dataset of whether or not progesterone receptors are present in the tumor. The table of values shows that “PR.Status” values are binary.

Var1 Freq
pr_absent 51
pr_present 54

Base Rate

The base rate for those with progesterone receptors is about 51 percent. This means, in order for a model to be better than random chance at predicting progesterone receptor status, the model must accurately claasify more than 51 percent of the observations.

Decision Tree

## n= 85 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 85 41 pr_present (0.48235294 0.51764706)  
##    2) Converted.Stage=No_Conversion,Stage IIA,Stage IIIA 63 27 pr_absent (0.57142857 0.42857143)  
##      4) Age.at.Initial.Pathologic.Diagnosis< 56.5 31  9 pr_absent (0.70967742 0.29032258)  
##        8) AJCC.Stage=Stage IB,Stage II,Stage III,Stage IIIB,Stage IIIC 9  0 pr_absent (1.00000000 0.00000000) *
##        9) AJCC.Stage=Stage IA,Stage IIA,Stage IIB,Stage IIIA 22  9 pr_absent (0.59090909 0.40909091)  
##         18) Age.at.Initial.Pathologic.Diagnosis>=44.5 12  3 pr_absent (0.75000000 0.25000000) *
##         19) Age.at.Initial.Pathologic.Diagnosis< 44.5 10  4 pr_present (0.40000000 0.60000000) *
##      5) Age.at.Initial.Pathologic.Diagnosis>=56.5 32 14 pr_present (0.43750000 0.56250000)  
##       10) OS.Time< 1061.5 20  8 pr_absent (0.60000000 0.40000000)  
##         20) AJCC.Stage=Stage IIA,Stage III,Stage IIIA,Stage IV 11  1 pr_absent (0.90909091 0.09090909) *
##         21) AJCC.Stage=Stage II,Stage IIB,Stage IIIB,Stage IIIC 9  2 pr_present (0.22222222 0.77777778) *
##       11) OS.Time>=1061.5 12  2 pr_present (0.16666667 0.83333333) *
##    3) Converted.Stage=Stage I,Stage IIB,Stage IIIC 22  5 pr_present (0.22727273 0.77272727) *

According to the decision tree above, the converted stage of the cancer is the leading predictor of whether or not progesterone receptors are present.

CP Chart

This complexity parameter elbow chart shows that a CP of 0.051 will build the optimally sized tree.

Decision Tree with CP = 0.051

As with the first decision tree, the converted stage of the cancer is the primary predictor of whether or not progesterone receptors are present. This is expected because changing the complexit parameter removes the variables that initially the least valuable and offer no additional, relevant information to the decision tree.

Hit Rate and Detection Rate

## [1] "Hit Rate/True Error Rate: 60%"
## [1] "Detection Rate: 20%"
## [1] "However, the maximum detection rate for this model is 50%"

This model is worse at predicting the positive class than random chance. Out of the twenty values included in the testing set, twelve of them were incorrectly classified. Given an individual has a 50 percent chance of correctly guessing whether or not a patient will have progesterone receptors, an error rate of 60 percent is unacceptable.

Confusion Matrix

## Confusion Matrix and Statistics
## 
##             Actual
## Prediction   pr_absent pr_present
##   pr_absent          4          6
##   pr_present         6          4
##                                           
##                Accuracy : 0.4             
##                  95% CI : (0.1912, 0.6395)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.8684          
##                                           
##                   Kappa : -0.2            
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.4             
##             Specificity : 0.4             
##          Pos Pred Value : 0.4             
##          Neg Pred Value : 0.4             
##              Prevalence : 0.5             
##          Detection Rate : 0.2             
##    Detection Prevalence : 0.5             
##       Balanced Accuracy : 0.4             
##                                           
##        'Positive' Class : pr_present      
## 

From the metrics displayed alongside the confusion matrix above, it is clear that this model not particularly effective at predicting whether or not a patient will have progesterone receptors. The sensitivity is 40 percent meaning the model cannot accurately identify patients with progesterone receptors and those without. The model mislabels those with PRs as not having PRs. Similarly, the specificity indicates that the model frequently misidentifies those without progesterone receptors as having PRs. A kappa value of -0.2 indicates that the model is, indeed, underperforming and not actionable. The model is not outperforming a classifier that simply guesses at random whether or not a patient will have progesterone receptors.

ROC and AUC

## Setting levels: control = pr_absent, case = pr_present
## Setting direction: controls > cases

## Area under the curve: 0.6

The area under the receiver operating characteristics curve is 60 percent indicating the model has little ability class separation capacity. In other words, as other metrics indicated, the model has little ability to correctly identify whether or not a patient has progesterone receptors.

Tumor

From the table of possible values for the type of tumor present, it is apparent that the values are relatively unbalanced. There are many more type 2 tumors than 1, 3, or 4. There are also very few type 4 tumors relative to type 1 or type 2. The class imbalance in the target variable may create issues when building the decision tree.

## 
## T1 T2 T3 T4 
## 15 65 19  6

Baserate

x
T1 14.29
T2 61.90
T3 18.10
T4 5.71

The different tumor types have different base rates, each indicating the minimum accuracy the model must have in order to be better than randomly classifying a tumor’s stage.

Decision Tree

## n= 85 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 85 33 T2 (0.14117647 0.61176471 0.18823529 0.05882353)  
##    2) AJCC.Stage=Stage IB,Stage II,Stage IIA,Stage IIB,Stage III,Stage IIIA,Stage IIIC,Stage IV 73 21 T2 (0.06849315 0.71232877 0.21917808 0.00000000)  
##      4) AJCC.Stage=Stage IB,Stage II,Stage IIA,Stage III 36  4 T2 (0.11111111 0.88888889 0.00000000 0.00000000) *
##      5) AJCC.Stage=Stage IIB,Stage IIIA,Stage IIIC,Stage IV 37 17 T2 (0.02702703 0.54054054 0.43243243 0.00000000)  
##       10) Node.Coded=Positive 30 10 T2 (0.03333333 0.66666667 0.30000000 0.00000000)  
##         20) AJCC.Stage=Stage IIB 12  0 T2 (0.00000000 1.00000000 0.00000000 0.00000000) *
##         21) AJCC.Stage=Stage IIIA,Stage IIIC,Stage IV 18  9 T3 (0.05555556 0.44444444 0.50000000 0.00000000) *
##       11) Node.Coded=Negative 7  0 T3 (0.00000000 0.00000000 1.00000000 0.00000000) *
##    3) AJCC.Stage=Stage I,Stage IA,Stage IIIB 12  5 T1 (0.58333333 0.00000000 0.00000000 0.41666667) *

According to the decision tree, the American Joint Committee on Cancer’s designation of a patient’s cancer status is the most influential variable in predicting the tumor’s stage. Surprisingly, it is also the used in the second split, albeit with different split criteria.

CP Chart

Based on the complexity parameter chart, there is no incentive to change the complexity parameter to a different value. The left-most point below the dotted line is the very first point.

## [1] "Hit Rate/True Error Rate: 15%"
Detection Rate for Each Tumor Stage.
T1 T2 T3 T4
Detection Rate 0.15 0.55 0.15 0.00
Maximum Rate 0.15 0.65 0.15 0.05

The detection rate for each of the tumor stages suggests the model is adequately predicting the stage of a tumor. One glaring concern, however, is the detection rate of 0 percent for stage four tumors. The model likely did not have enough training information to distinguish between stage four tumors and tumors of another stage. This may be because the sample size of stage four tumors is too small or the values for certain variables are indistinguishable between stage four and another stage.

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction T1 T2 T3 T4
##         T1  3  0  0  1
##         T2  0 11  0  0
##         T3  0  2  3  0
##         T4  0  0  0  0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.85            
##                  95% CI : (0.6211, 0.9679)
##     No Information Rate : 0.65            
##     P-Value [Acc > NIR] : 0.04438         
##                                           
##                   Kappa : 0.7391          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: T1 Class: T2 Class: T3 Class: T4
## Sensitivity             1.0000    0.8462    1.0000      0.00
## Specificity             0.9412    1.0000    0.8824      1.00
## Pos Pred Value          0.7500    1.0000    0.6000       NaN
## Neg Pred Value          1.0000    0.7778    1.0000      0.95
## Prevalence              0.1500    0.6500    0.1500      0.05
## Detection Rate          0.1500    0.5500    0.1500      0.00
## Detection Prevalence    0.2000    0.5500    0.2500      0.00
## Balanced Accuracy       0.9706    0.9231    0.9412      0.50

In general, the model has high values for sensitivity and specificity. This indicates the model is very good at predicting true positives and does not mislabel observations as false negatives or false positives. Additionally, the model has a relatively high kappa value at 0.7391, suggesting a moderate level of agreement between two raters who classify each observation.

Model AUROC

The area under the receiver operating characteristics curve is 0.9038. This means the model has very good discriminatory ability. Approximately 90 percent of the time the model will accurately assign the correct prediction as opposed to a model based on random chance. Unfortunately, given the nature of multiclass models, the ROC plot does not offer any substantial, additional information as it is a plot of one class versus the rest.

Summary

The two decision trees, one modeling progesterone receptor status and the other predicting the stage of breast cancer, widely differ in quality. The progesterone receptor tree is very poor at predicting whether or not progesterone receptors are present in the tumor. In this instance, the models inadequacy may be due to overfitting. The tree itself show that approximately 17 percent of observations predicted to have progesterone receptors do not, and 25 percent of tumors predicted to not have progesterone receptors do have them. This is in stark contrast to the test set where 60 percent of observations, of both values, were incorrectly predicted. This model should not be implemented as a method of determining whether or not progesterone receptors are present in a tumor.

The second decision tree, which predicts the current stage of a cancerous tumor, is significantly more accurate than the first decision tree. The main concern is the model’s inability to classify stage four tumors. There were likely not enough stage four tumors in the dataset for the decision tree to properly train and learn what values signify stage four. In addition, the model heavily relies on AJCC stage to classify the tumors. The American Joint Committee on Cancer (AJCC) is an organization known for defining cancer staging standards. This means that the AJCC’s classification system might be sufficient at predicting a tumor’s stage, in essence, making this decision tree redundant. Furthermore, this model’s accuracy is certainly acceptable for more trivial predictions. In the medical field, however, one would hope for more certainty in the descion tree’s predictions, especially considering an entire stage, the final stage, is ignored by the model. This model would likely not be adequate to implement in practice. In order to increase its accuracy, more data should be collected for stage four tumors, in particular.