Decision Trees

Congrats! You just graduated from medical school and got a PhD in Data Science at the same time, wow impressive. Because of these incredible accomplishments the world now believes you will be able to cure cancer…no pressure. To start you figured you better create some way to detect cancer when present. Luckily because you are now a MD and DS PhD or MDSDPhD, you have access to data sets and know your way around a ML classifier. So, on the way to fulfilling your destiny to rig the world of cancer you start by building several classifiers that can be used to aid in determining if patients have cancer and the type of tumor.

The included dataset (clinical_data_breast_cancer_modified.csv) has information on 105 patients across 17 variables, your goal is to build two classifiers one for PR.Status (progesterone receptor), a biomarker that routinely leads to a cancer diagnosis, indicating if there was a positive or negative outcome and one for the Tumor a multi-class variable . You would like to be able to explain the model to the mere mortals around you but need a fairly robust and flexible approach so you’ve chosen to use decision trees to get started. In building both models us CART and C5.0 and compare the differences.

In doing so, similar to great data scientists of the past, you remembered the excellent education provided to you at UVA in a undergrad data science course and have outlined steps that will need to be undertaken to complete this task (you can add more or combine if needed).
As always, you will need to make sure to #comment your work heavily and render the results in a clear report (knitted) as the non MDSDPhDs of the world will someday need to understand the wonder and spectacle that will be your R code. Good luck and the world thanks you.

Footnotes: - Some of the steps will not need to be repeated for the second model, use your judgment - You can add or combine steps if needed - Also, remember to try several methods during evaluation and always be mindful of how the model will be used in practice. - Do not include ER.Status in your first tree it’s basically the same as PR.Status

Libraries

Reading in the Data

First, we load in the data and drop the ER.Status column as specified in the instructions. We also ensure all the variables are classified correctly and ensure that the target variable, PR.Status, is changed to a numeric 0 (negative) and 1 (positive) system.

Splitting

Next, we split the data into a training and a testing set

Training Data

## # A tibble: 6 x 16
##   Gender Age.at.Initial.… PR.Status HER2.Final.Stat… Tumor Node.Coded Metastasis
##   <chr>             <dbl>     <dbl> <chr>            <chr> <chr>      <chr>     
## 1 FEMALE               66         0 Negative         T3    Positive   M1        
## 2 FEMALE               40         0 Negative         T2    Negative   M0        
## 3 FEMALE               56         0 Negative         T2    Positive   M0        
## 4 FEMALE               38         0 Negative         T3    Positive   M0        
## 5 FEMALE               74         0 Negative         T3    Negative   M0        
## 6 FEMALE               60         0 Negative         T2    Negative   M0        
## # … with 9 more variables: Metastasis.Coded <chr>, AJCC.Stage <chr>,
## #   Converted.Stage <chr>, Survival.Data.Form <chr>, Vital.Status <chr>,
## #   Days.to.Date.of.Last.Contact <dbl>, Days.to.date.of.Death <dbl>,
## #   OS.event <int>, OS.Time <dbl>

Testing Data

## # A tibble: 6 x 16
##   Gender Age.at.Initial.… PR.Status HER2.Final.Stat… Tumor Node.Coded Metastasis
##   <chr>             <dbl>     <dbl> <chr>            <chr> <chr>      <chr>     
## 1 FEMALE               48         0 Negative         T2    Positive   M0        
## 2 FEMALE               57         0 Negative         T2    Negative   M0        
## 3 FEMALE               36         0 Negative         T2    Negative   M0        
## 4 FEMALE               41         0 Negative         T4    Negative   M0        
## 5 FEMALE               78         0 Positive         T1    Positive   M0        
## 6 FEMALE               88         1 Negative         T2    Positive   M0        
## # … with 9 more variables: Metastasis.Coded <chr>, AJCC.Stage <chr>,
## #   Converted.Stage <chr>, Survival.Data.Form <chr>, Vital.Status <chr>,
## #   Days.to.Date.of.Last.Contact <dbl>, Days.to.date.of.Death <dbl>,
## #   OS.event <int>, OS.Time <dbl>

Baserate

## [1] 0.4857143

The baserate is the base class of probabilities unconditioned on featural evidence, frequently also known as prior probabilities. Because this is a multiclass, we will be using the individual percentages for each class.

We first gather/pair the data by the predictor variables (e.g. gender, tumor metastasis, etc.), then by the value of those predictor variables. We also add the probability that their PR Status is positive vs. negative.

The overall baserate was found by taking the average of values in the PR.Status column. Thus, we found that 51.4% of the testing data had a positive PR Status and 48.6% had a negative PR Status.

Building the Model

We build our model using the default setting and by using the rpart function. PR.Status is used as the “formula” aka our response variable. We utilize our previouslt split training dataset and set a cp of 0.01 as it is our default.

## n= 84 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 84 39 0 (0.53571429 0.46428571)  
##    2) AJCC.Stage=Stage IB,Stage IIA,Stage III,Stage IIIC,Stage IV 35 10 0 (0.71428571 0.28571429)  
##      4) Converted.Stage=No_Conversion,Stage IIIA 13  1 0 (0.92307692 0.07692308) *
##      5) Converted.Stage=Stage IIA,Stage IIIC 22  9 0 (0.59090909 0.40909091)  
##       10) Days.to.Date.of.Last.Contact< 869.5 15  5 0 (0.66666667 0.33333333) *
##       11) Days.to.Date.of.Last.Contact>=869.5 7  3 1 (0.42857143 0.57142857) *
##    3) AJCC.Stage=Stage I,Stage IA,Stage II,Stage IIB,Stage IIIA,Stage IIIB 49 20 1 (0.40816327 0.59183673)  
##      6) Days.to.Date.of.Last.Contact< 12 7  2 0 (0.71428571 0.28571429) *
##      7) Days.to.Date.of.Last.Contact>=12 42 15 1 (0.35714286 0.64285714)  
##       14) Survival.Data.Form=followup 19  9 0 (0.52631579 0.47368421) *
##       15) Survival.Data.Form=enrollment 23  5 1 (0.21739130 0.78260870) *

Now we can see the probabilities of being positive or negative based on certain “decisions” or factors.

The most important variable is shown to be the first split in our decision tree the ACJCC Stage (only if they are stages IIA, III, IV). In descending order, the next three most important factors are the Her2 Final Status, Converted Stage, and OS Time.

This method of visualization isn’t the most easy to use though so let’s move on to making it a more visually pleasing tree!

Decision Tree

Much better! Here we can see the split of importance of the variables and their end predictions. For example, if we knew the patient had a converted stage of Stage I breast cancer, we would follow the tree to the rightmost branch and immediately land at a terminal node.

CP Chart

## # A tibble: 5 x 5
##       CP nsplit `rel error` xerror  xstd
##    <dbl>  <dbl>       <dbl>  <dbl> <dbl>
## 1 0.231       0       1       1    0.117
## 2 0.0769      1       0.769   1.18 0.117
## 3 0.0256      2       0.692   1.10 0.117
## 4 0.0128      3       0.667   1.08 0.118
## 5 0.01        5       0.641   1.05 0.117

Next, we produce a “elbow chart” for various cp values and their associated relative errors. The dashed line represents the highest cross-validated error minus the minimum cross-validated error, plus the standard deviation of the error at that tree. A reasonable choice of cp for pruning is the leftmost value because this is where the mean is less than the horizontal line.

The optimal number of splits is 3 as shown in the cptable_ex because this is where the chart first dips below the mean line. Any more splitting increases the x-val relative error. Furthermore, we can also see in the cptable that this has a low relative error, okay xerror, and a decent xstd

Variable Importance

##                          AJCC.Stage                     Converted.Stage 
##                           5.8181441                           4.0991152 
##        Days.to.Date.of.Last.Contact                             OS.Time 
##                           3.5475117                           3.3240467 
##                  Survival.Data.Form                               Tumor 
##                           1.9859431                           1.3808629 
## Age.at.Initial.Pathologic.Diagnosis                          Node.Coded 
##                           1.2467354                           0.9709521 
##                   HER2.Final.Status 
##                           0.5466472

We can also visualize the importance of including each variable as shown above. This reiterates similar points to what we have previously found, but displays the relative importance for all of the variables now. Most important is the Converted Stage with a top score of 3.24 followed by AJCC stage with a score of 3.21. At the bottom and being the least important are node coded with a score of 0.22 and metastasis than 0.185.

Next, we use the predict function to predict the target variable.

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 
##  1  0  1  1  0  0  0  0  0  0  0  1  1  0  0  1  0  1  1  1  0 
## Levels: 0 1

Our test set of size 21 resulted in the prediction of the target variable, the PR Status, as shown above.

Confusion Matrix

Hit and Detection Rate

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0  2 10
##          1  4  5
##                                           
##                Accuracy : 0.3333          
##                  95% CI : (0.1459, 0.5697)
##     No Information Rate : 0.7143          
##     P-Value [Acc > NIR] : 0.9999          
##                                           
##                   Kappa : -0.2564         
##                                           
##  Mcnemar's Test P-Value : 0.1814          
##                                           
##             Sensitivity : 0.3333          
##             Specificity : 0.3333          
##          Pos Pred Value : 0.5556          
##          Neg Pred Value : 0.1667          
##              Prevalence : 0.7143          
##          Detection Rate : 0.2381          
##    Detection Prevalence : 0.4286          
##       Balanced Accuracy : 0.3333          
##                                           
##        'Positive' Class : 1               
##

The detection rate is found to be 38.10% with a prevalence of 57.1% Our model currently has an accuracy of 66.67%.

The hit rate can be derived from the confusion matrix by taking the number of true positives (8) over total actual positive cases (8+ a missing 4 = 12) for a hit rate of 66.7%.

Best Metrics

## [1] 0.6666667

The Hit Rate/ True Error Rate is 33.3%

## [1] 0.2380952

The true positive rate, also shown as our detection rate is 38.10%

The true positive rate could be improved upon as the ideal would create a confusion matrix with a value of 0 in the spot (1,2) also meaning a hit rate of 100%. We would want to focus on a higher hit rate in this specific case because it would be better for the patients to be more careful about their breast cancer diagnosis with a false positive rather than it being kept hidden with a false negative reading. If keeping false positives low was the ideal metric, then we would ideally be looking for a value of 0 in the confusion matrix at the spot (0,0) and also aiming for a high specificity or true negative rate.

AUC - ROC Curve

## 
## Call:
## roc.default(response = test$PR.Status, predictor = as.numeric(tree_predict),     plot = TRUE)
## 
## Data: as.numeric(tree_predict) in 6 controls (test$PR.Status 0) > 15 cases (test$PR.Status 1).
## Area under the curve: 0.6667

AUC - ROC curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0s as 0s and 1s as 1s.

The area under the curve of this graph is 0.6667. As our specificity decreases from 1.0 to 0.5, our sensitivity increases at a high rate through a specificity of 0.5, then the incremental gain drop for a specificity of 0.5 to 0.

As shown in the graph, there is a slight inflection point caused by the two rates that causes a deviation from the baseline, but a larger deviation from the line is more desirable and indicates better predicting qualities by the model.

Multi Class Target: Tumors

Splitting

Next, we split the data into a training and a testing set

Baserate

The overall baserates for each type of tumor was found by taking the average of values in the tumor# column.

Tumor 1 has a 14.3% positive rate, 2 has a 61.9% positive rate, 3 has a 18.1% positive rate, and 4 has a 5.7% positive rate.

Building the Model

We build our model using the default setting and by using the rpart function. PR.Status is used as the “formula” aka our response variable. We utilize our previously split training dataset and set a cp of 0.01 as it is our default.

## n= 84 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 84 31 T2 (0.15476190 0.63095238 0.17857143 0.03571429)  
##    2) AJCC.Stage=Stage I,Stage IA,Stage IIIB 12  3 T1 (0.75000000 0.00000000 0.00000000 0.25000000) *
##    3) AJCC.Stage=Stage IB,Stage II,Stage IIA,Stage IIB,Stage III,Stage IIIA,Stage IIIC,Stage IV 72 19 T2 (0.05555556 0.73611111 0.20833333 0.00000000)  
##      6) AJCC.Stage=Stage IB,Stage II,Stage IIA,Stage III 40  4 T2 (0.10000000 0.90000000 0.00000000 0.00000000) *
##      7) AJCC.Stage=Stage IIB,Stage IIIA,Stage IIIC,Stage IV 32 15 T2 (0.00000000 0.53125000 0.46875000 0.00000000)  
##       14) Days.to.Date.of.Last.Contact>=966 11  1 T2 (0.00000000 0.90909091 0.09090909 0.00000000) *
##       15) Days.to.Date.of.Last.Contact< 966 21  7 T3 (0.00000000 0.33333333 0.66666667 0.00000000) *

Now we can see the probabilities of being positive or negative based on certain “decisions” or factors.

The most important variable is shown to be the first split in our decision tree, the AJCC Stage (specifically stages I, IA, and IIIB). In order of descending importance are the second group of AJCC Stages (IB, II, IIA, IIB, III), the node coding positive or negative, and the days to the date of last contact..

This method of visualization isn’t the most easy to use though so let’s move on to making it a more visually pleasing tree!

Decision Tree

Much better! Here we can see the split of importance of the variables and their end predictions. For example, if we knew the patient were in AJCC Stage I, we would say they likely have a T1 tumor. Interestingly here, the T4 tumor is unused because it wasn’t found in enough of the data (it does not have the predominant number in any grouping – it challenges the reading of T1 in the rightmost node with 5, but T1 had slightly more with 7 readings).

CP Chart

## # A tibble: 3 x 5
##      CP nsplit `rel error` xerror  xstd
##   <dbl>  <dbl>       <dbl>  <dbl> <dbl>
## 1 0.290      0       1      1     0.143
## 2 0.113      1       0.710  0.710 0.130
## 3 0.01       3       0.484  0.645 0.126

The optimal number of splits is 2 as shown in the cptable_ex because this is where the chart first dips below the mean line. Any more splitting increases the x-val relative error. Furthermore, we can also see in the cptable that this has a low relative error, low xerror, and a good xstd

Variable Importance

##                          AJCC.Stage                     Converted.Stage 
##                          18.1244048                          12.2286302 
##        Days.to.Date.of.Last.Contact                             OS.Time 
##                           5.3954901                           4.7859848 
##                          Node.Coded Age.at.Initial.Pathologic.Diagnosis 
##                           2.2348524                           1.0158420 
##                  Survival.Data.Form                            OS.event 
##                           1.0158420                           0.8701791 
##                        Vital.Status                              Gender 
##                           0.8701791                           0.4350895

We can also visualize the importance of including each variable as shown above. This reiterates what we have previously found, but displays the relative importance for all of the variables now. Most important is the AJCC Stage with a top score of 17.3 followed by converted stage with a score of 12.5. At the bottom and being the least important are Her2 final status with a score of 0.847 and survival data form with a score of 0.847.

Next, we use the predict function to predict the target variable.

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 
## T2 T2 T2 T1 T3 T2 T3 T1 T2 T2 T2 T2 T2 T3 T2 T2 T2 T1 T1 T2 T2 
## Levels: T1 T2 T3 T4

Our test set of size 21 resulted in the prediction of the target variable, the type of tumor, as shown above.

Confusion Matrix

Hit and Detection Rate

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction T1 T2 T3 T4
##         T1  1  0  0  3
##         T2  0 11  3  0
##         T3  1  1  1  0
##         T4  0  0  0  0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.619           
##                  95% CI : (0.3844, 0.8189)
##     No Information Rate : 0.5714          
##     P-Value [Acc > NIR] : 0.4171          
##                                           
##                   Kappa : 0.336           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: T1 Class: T2 Class: T3 Class: T4
## Sensitivity            0.50000    0.9167   0.25000    0.0000
## Specificity            0.84211    0.6667   0.88235    1.0000
## Pos Pred Value         0.25000    0.7857   0.33333       NaN
## Neg Pred Value         0.94118    0.8571   0.83333    0.8571
## Prevalence             0.09524    0.5714   0.19048    0.1429
## Detection Rate         0.04762    0.5238   0.04762    0.0000
## Detection Prevalence   0.19048    0.6667   0.14286    0.0000
## Balanced Accuracy      0.67105    0.7917   0.56618    0.5000

The detection rates range from 0% with T4 to 38% with T2.

The hit rate can be derived from the confusion matrix by taking the number of true positives for each type of tumor over total actual positive cases.

Tumor 1: Correct Hits = 1, Actual = 3 –> Hit Rate = 33.3% Tumor 2: Correct Hits = 8, Actual = 10 –> Hit Rate = 80% Tumor 3: Correct Hits = 3, Actual = 5 –> Hit Rate = 60% Tumor 4: Correct Hits = 0, Actual = 3 –> Hit Rate = 0%

We have a good hit rate for tumor 2, an okay rate for tumor 3, a bad rate for tumor 1, and a terrible rate for tumor 4. More research money should be invested in improving detection for tumor 4.

Best Metrics

This rate can be found by summing the values in the confusion matrix where the rows do not equal the columns indicating errors on the model’s part and dividing it by the sum of all instances in the confusion matrix.

This gives us a Hit Rate/ True Error Rate of 9/21 which is is 42.9%

The true positive rate, also shown as our detection rate can be found by summing the values along the diagonal of the confusion matrix and dividing that value by all values in the confusion matrix which gives us 12/21 which is 57.1%

Lessons Learned

Within each model we were able to create models that could read a patient’s information and give an evaluation on the patient’s PR status in the first model and what type of tumor they have in the second.

Within the first model, we learned that the most important factors to look for to determine this is the patient’s AJCC Stage, Her2 Final Status, Converted Stage, and OS Time.

Similarly within the second model to determine Tumor type, the most important factors are the AJCC Stage (specifically stages I, IA, and IIIB), the second group of AJCC Stages (IB, II, IIA, IIB, III), the node coding positive or negative, and the days to the date of last contact.

We can see an overlap of AJCC stage ranking highly between both so this could be of something to note for doctors to pay more attention to during diagnosis.

We would recommend data scientists to iterate over this model to work to decrease the false negative rate for these models. Because of the nature of its usage, it is more detrimental for a patient to get a false negative and go undiagnosed than a patient to get a false positive and to begin treatment for something they do not have.

Because of the less than ideal hit rate/true error rate and true positive rates, we would not recommend these models in professional use, but do see the value of gaining more data to add to this dataset to continuously improve the model.

The size of this dataset was rather small which lead to many issues such as within the tumor model, tumor 4 was unrepresented throughout and was not used in our decision tree. We would want more data on patients with tumor 4 to better represent them within our dataset to get it to even show up as an option within our decision tree.

Decision Trees

Megan Lin, James Powell, Eva Mustafix

May 3rd, 2021

Reading in the Data

Splitting

Baserate

Building the Model

Decision Tree

CP Chart

Variable Importance

Confusion Matrix

Hit and Detection Rate

Best Metrics

AUC - ROC Curve

Multi Class Target: Tumors

Splitting

Baserate

Building the Model

Decision Tree

CP Chart

Variable Importance

Confusion Matrix

Hit and Detection Rate

Best Metrics

Lessons Learned