Objective
The included dataset (clinical_data_breast_cancer_modified.csv) has information on 105 patients across 16 variables, your goal is to build two classifiers one for PR.Status (progesterone receptor), a biomarker that routinely leads to a cancer diagnosis, indicating if there was a positive or negative outcome and one for the Tumor multi-class variable. You would like to be able to explain the model to non-experts but need a fairly robust and flexible approach so you’ve chosen to use decision trees to get started and will possibly move to an ensemble model if needed.
Breast Cancer Diagnosis Classifier
To attempt to predict a cancer diagnosis, we can use a biomarker—the progesterone receptor—that routinely leads to a cancer diagnosis (being progesterone positive tends to lead to the diagnosis, so that is our positive case). Our plan of action is to use a decision tree model to assess which variables in the dataset are most important in leading to progesterone positive or negative, and ultimately be able to take in other data to classify if a patient is PR (progesterone receptor) positive or negative.
Base Rate
To begin, the base rate in the data for being PR positive is 51.42%. This basically means that roughly half the data for the PR variable is PR positive (balanced set) and there’s a 51% chance of classifying PR positive if guessing at random. Now we’ll build our initial attempt at a decision tree.
Building the Model
Below is the output of the model as well as a CP (complexity parameter) plot to help us gauge the optimal number of splits for our tree. We can also view the variable importance to assess which variables the tree computed to be “most important” in terms of where to split. Then, using the CP tables and plot we can determine the optimal CP level and number of splits.
Model Output
## n= 105
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 105 51 PR_yes (0.4857143 0.5142857)
## 2) Days.to.Date.of.Last.Contact< 12 17 4 PR_no (0.7647059 0.2352941) *
## 3) Days.to.Date.of.Last.Contact>=12 88 38 PR_yes (0.4318182 0.5681818)
## 6) Converted.Stage=No_Conversion,Stage I,Stage IIA 69 34 PR_no (0.5072464 0.4927536)
## 12) Age.at.Initial.Pathologic.Diagnosis< 62.5 43 16 PR_no (0.6279070 0.3720930)
## 24) AJCC.Stage=Stage I,Stage IB,Stage II,Stage IIA,Stage III,Stage IIIB,Stage IIIC 26 6 PR_no (0.7692308 0.2307692) *
## 25) AJCC.Stage=Stage IA,Stage IIB,Stage IIIA 17 7 PR_yes (0.4117647 0.5882353) *
## 13) Age.at.Initial.Pathologic.Diagnosis>=62.5 26 8 PR_yes (0.3076923 0.6923077)
## 26) Survival.Data.Form=followup 15 7 PR_no (0.5333333 0.4666667) *
## 27) Survival.Data.Form=enrollment 11 0 PR_yes (0.0000000 1.0000000) *
## 7) Converted.Stage=Stage IIB,Stage IIIA,Stage IIIC 19 3 PR_yes (0.1578947 0.8421053) *
Variable Importance
## AJCC.Stage Converted.Stage
## 5.6213427 4.7271062
## Days.to.Date.of.Last.Contact Age.at.Initial.Pathologic.Diagnosis
## 4.5524206 4.3074227
## OS.Time Survival.Data.Form
## 4.1422930 3.6102564
## Node.Coded Tumor
## 0.7726353 0.5741736
## HER2.Final.Status Gender
## 0.5139509 0.3827824
## Metastasis Metastasis.Coded
## 0.2556006 0.2556006
Tree
CP Plot
To determine the number of splits, one method is to find the lowest level at which the relative error + standard error is < cross-validated error, however, that is true for every split. Therefore, we can result to choosing the CP level and numebr of splits by looking at the CP plot and choosing the lowest level at which the relative error is below the error threshold (around 1.09 in this case)–this results in a CP of 0.034 and 5 splits for the tree. Next we can use the R predict function to output a confusion matrix for our model.
Predictions and Confusion Matrix
Before assessing the confusion matrix we can calculate the hit rate/true error rate and the detection rate/prevalence. The hit rate for our first model is 25.71% which is very high, telling us the model has a high error rate. The detection rate is 35.24% which compared to our baseline of 51.42%, tells us the model is performing rather terribly. Let’s dig into the confusion matrix to get a deeper perspective and see where the errors are occurring.
## Confusion Matrix and Statistics
##
## Actual
## Prediction 1 2
## 1 41 17
## 2 10 37
##
## Accuracy : 0.7429
## 95% CI : (0.6483, 0.8232)
## No Information Rate : 0.5143
## P-Value [Acc > NIR] : 1.374e-06
##
## Kappa : 0.4872
##
## Mcnemar's Test P-Value : 0.2482
##
## Sensitivity : 0.6852
## Specificity : 0.8039
## Pos Pred Value : 0.7872
## Neg Pred Value : 0.7069
## Prevalence : 0.5143
## Detection Rate : 0.3524
## Detection Prevalence : 0.4476
## Balanced Accuracy : 0.7446
##
## 'Positive' Class : 2
##
Confusion Matrix Outputs:
Base rate = 51.43%
Accuracy = 74.29%
Kappa = 0.4872
Sensitivity = 68.52%
Specificity = 80.39%
Balanced Accuracy = 74.46%
For a classifier of this nature (detecting cancer), we want an extremely low false negative rate meaning a high sensitivity (true positive rate). While there is a fair accuracy and sensitivity meaning the tree is okay at predicting positive outcomes, the false negative rate is 31.48% which is way too high for this case. The model also has a high positive prediction value (precision) of 78.72% which again tells us the model is pretty good at classifying if a patient has cancer. We also don’t care as much about the false positive rate (1-specificity), because it’s better to err on the side of classifying the positive case (cancer is present) than to miss it if cancer is actually present (better to be safe than sorry). We can also look at the ROC curves to get even more information regarding the quality of our initial model.
ROC Curve
Our ROC curve is somewhat balanced and outputs an AUC (area under curve) of 0.7446 which is fair. This fair rating is probably due to a low false positive rate (high specificity). We can also change our probability threshold to try and optimize our ROC curve based on the metrics we want to improve.
We want to choose a threshold that gives us the highest sensitivity (it’s okay to sacrifice the specificity because the false positive rate isn’t of large concern in relation to the sensitivity). Changing the threshold to 0.54 induces an increase in the sensitivity, while reducing the AUC to 0.7309 which isn’t too bad. Below is the output:
New Models
Now, based on our assessment and optimal CP value, we can create a new tree.
Model Output
## n= 105
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 105 51 PR_yes (0.4857143 0.5142857)
## 2) Days.to.Date.of.Last.Contact< 12 17 4 PR_no (0.7647059 0.2352941) *
## 3) Days.to.Date.of.Last.Contact>=12 88 38 PR_yes (0.4318182 0.5681818)
## 6) Converted.Stage=No_Conversion,Stage I,Stage IIA 69 34 PR_no (0.5072464 0.4927536)
## 12) Age.at.Initial.Pathologic.Diagnosis< 62.5 43 16 PR_no (0.6279070 0.3720930)
## 24) AJCC.Stage=Stage I,Stage IB,Stage II,Stage IIA,Stage III,Stage IIIB,Stage IIIC 26 6 PR_no (0.7692308 0.2307692) *
## 25) AJCC.Stage=Stage IA,Stage IIB,Stage IIIA 17 7 PR_yes (0.4117647 0.5882353) *
## 13) Age.at.Initial.Pathologic.Diagnosis>=62.5 26 8 PR_yes (0.3076923 0.6923077) *
## 7) Converted.Stage=Stage IIB,Stage IIIA,Stage IIIC 19 3 PR_yes (0.1578947 0.8421053) *
Variable Importance
## Converted.Stage Days.to.Date.of.Last.Contact
## 4.7271062 3.5678052
## AJCC.Stage Age.at.Initial.Pathologic.Diagnosis
## 3.3239068 3.3228073
## OS.Time Node.Coded
## 3.1576776 0.7726353
## Tumor Gender
## 0.5741736 0.3827824
## Metastasis Metastasis.Coded
## 0.2556006 0.2556006
## HER2.Final.Status
## 0.1857457
Tree
CP Plot
Confusion Matrix
## Confusion Matrix and Statistics
##
## Actual
## Prediction 1 2
## 1 33 10
## 2 18 44
##
## Accuracy : 0.7333
## 95% CI : (0.6381, 0.8149)
## No Information Rate : 0.5143
## P-Value [Acc > NIR] : 3.715e-06
##
## Kappa : 0.4639
##
## Mcnemar's Test P-Value : 0.1859
##
## Sensitivity : 0.8148
## Specificity : 0.6471
## Pos Pred Value : 0.7097
## Neg Pred Value : 0.7674
## Prevalence : 0.5143
## Detection Rate : 0.4190
## Detection Prevalence : 0.5905
## Balanced Accuracy : 0.7309
##
## 'Positive' Class : 2
##
As we can see, the new tree only has 5 leaves opposed to the 6 on the original tree. We can dig into the metrics of our new tree by assessing the confusion matrix:
“Optimal 1” Confusion Matrix Outputs:
Accuracy = 73.33% (74.29%)-(initial tree outputs)
Kappa = 0.4639 (0.4872)
Sensitivity = 81.48% (68.52%)
Specificity = 64.71% (80.39%)
Balanced Accuracy = 73.09% (74.46%)
As desired, our sensitivity increased greatly to 81.48% which is good but still not excellent. Aside from the specificity, which drops a fair amount, the other metrics are minimally affected. While this model is “good” and could theoretically be used in practice, I would only recommend the model to be used as a supplement to a doctors expertise. While the sensitivity of the model is good, there is still a false negative rate (i.e. cancer goes undetected) of 18.52% which is too high for a model supposed to detect cancer.
Hyperparameter Tuning
Model Output
## n= 105
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 105 51 PR_yes (0.4857143 0.5142857)
## 2) Days.to.Date.of.Last.Contact< 12 17 4 PR_no (0.7647059 0.2352941)
## 4) Age.at.Initial.Pathologic.Diagnosis>=53 11 0 PR_no (1.0000000 0.0000000) *
## 5) Age.at.Initial.Pathologic.Diagnosis< 53 6 2 PR_yes (0.3333333 0.6666667) *
## 3) Days.to.Date.of.Last.Contact>=12 88 38 PR_yes (0.4318182 0.5681818)
## 6) Converted.Stage=No_Conversion,Stage I,Stage IIA 69 34 PR_no (0.5072464 0.4927536)
## 12) Age.at.Initial.Pathologic.Diagnosis< 62.5 43 16 PR_no (0.6279070 0.3720930)
## 24) AJCC.Stage=Stage I,Stage IB,Stage II,Stage IIA,Stage III,Stage IIIB,Stage IIIC 26 6 PR_no (0.7692308 0.2307692) *
## 25) AJCC.Stage=Stage IA,Stage IIB,Stage IIIA 17 7 PR_yes (0.4117647 0.5882353) *
## 13) Age.at.Initial.Pathologic.Diagnosis>=62.5 26 8 PR_yes (0.3076923 0.6923077) *
## 7) Converted.Stage=Stage IIB,Stage IIIA,Stage IIIC 19 3 PR_yes (0.1578947 0.8421053) *
Variable Importance
## Age.at.Initial.Pathologic.Diagnosis Converted.Stage
## 6.7737877 5.3022696
## AJCC.Stage Days.to.Date.of.Last.Contact
## 3.8990702 3.5678052
## OS.Time Node.Coded
## 3.1576776 0.7726353
## HER2.Final.Status Tumor
## 0.7609091 0.5741736
## Gender Metastasis
## 0.3827824 0.2556006
## Metastasis.Coded
## 0.2556006
Tree
CP Plot
Confusion Matrix
## Confusion Matrix and Statistics
##
## Actual
## Prediction 1 2
## 1 31 6
## 2 20 48
##
## Accuracy : 0.7524
## 95% CI : (0.6586, 0.8314)
## No Information Rate : 0.5143
## P-Value [Acc > NIR] : 4.847e-07
##
## Kappa : 0.5005
##
## Mcnemar's Test P-Value : 0.01079
##
## Sensitivity : 0.8889
## Specificity : 0.6078
## Pos Pred Value : 0.7059
## Neg Pred Value : 0.8378
## Prevalence : 0.5143
## Detection Rate : 0.4571
## Detection Prevalence : 0.6476
## Balanced Accuracy : 0.7484
##
## 'Positive' Class : 2
##
By adjusting some other hyperparameters (I set the cp to 0.034 again, minbucket = 5, and maxdepth = 4), another optimal tree is produced. The confusion matrix output is as follows:
Accuracy = 75.24% (73.33%)-(“optimal 1” tree outputs)
Kappa = 0.5005 (0.4639)
Sensitivity = 88.89% (81.48%)
Specificity = 60.78% (64.71%)
Balanced Accuracy = 74.84% (73.09%)
These metrics are great in terms of our model goals. The sensitivity is even higher, now at a value of 88.89% which is great–this signifies a false negative rate of ~11%. While the specificity is even lower 60.78%, that is okay in this case. While the model performs well with this current set of data, it will definitely be sensitive to any changes and the model may be overfitting considering a small dataset. I would recommend some deeper evaluation, possible utilizing a random forest to create a more robust model especially for an application in cancer detection.
Tumor Classifier
By using the same dataset we can create another tree with the goal to classify the type of tumor a patient has (T1, T2, T3, T4). We’ll follow the same framework as the first model.
Base Rates
The base rates for the four classes are as follows:
T1 = 14.29%
T2 = 61.90%
T3 = 18.10%
T4 = 5.71%
Building the Model
We’ll build a multi-class tree model using similar steps as above. The output of the model is as follows:
Model Output
## n= 105
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 105 40 T2 (0.14285714 0.61904762 0.18095238 0.05714286)
## 2) AJCC.Stage=Stage I,Stage IA,Stage IIIB 16 6 T1 (0.62500000 0.00000000 0.00000000 0.37500000) *
## 3) AJCC.Stage=Stage IB,Stage II,Stage IIA,Stage IIB,Stage III,Stage IIIA,Stage IIIC,Stage IV 89 24 T2 (0.05617978 0.73033708 0.21348315 0.00000000)
## 6) AJCC.Stage=Stage IB,Stage II,Stage IIA,Stage III 46 4 T2 (0.08695652 0.91304348 0.00000000 0.00000000) *
## 7) AJCC.Stage=Stage IIB,Stage IIIA,Stage IIIC,Stage IV 43 20 T2 (0.02325581 0.53488372 0.44186047 0.00000000)
## 14) Node.Coded=Positive 33 11 T2 (0.03030303 0.66666667 0.30303030 0.00000000)
## 28) AJCC.Stage=Stage IIB 13 0 T2 (0.00000000 1.00000000 0.00000000 0.00000000) *
## 29) AJCC.Stage=Stage IIIA,Stage IIIC,Stage IV 20 10 T3 (0.05000000 0.45000000 0.50000000 0.00000000)
## 58) Days.to.Date.of.Last.Contact>=473.5 10 3 T2 (0.00000000 0.70000000 0.30000000 0.00000000) *
## 59) Days.to.Date.of.Last.Contact< 473.5 10 3 T3 (0.10000000 0.20000000 0.70000000 0.00000000) *
## 15) Node.Coded=Negative 10 1 T3 (0.00000000 0.10000000 0.90000000 0.00000000) *
Variable Importance
## AJCC.Stage Converted.Stage
## 26.5474049 16.8670811
## Node.Coded Days.to.Date.of.Last.Contact
## 8.3909166 3.7818182
## OS.Time Age.at.Initial.Pathologic.Diagnosis
## 3.4454545 2.2378883
## Survival.Data.Form PR.Status
## 1.4153663 1.3046039
## HER2.Final.Status Gender
## 0.4200000 0.3363636
Tree
CP Plot
After running the model, we see a tree is outputted with 6 leaves and also that it leaves out the T4 class–this makes sense as the T4 class only has 6 cases in the dataset. We can also view the CP plot to find the optimal number of splits. Our optimal level is a CP of 0.16 which correlates to 2 splits, which is kind of a red flag considering a tree of only 2 splits is extremely broad. Let’s look at the confusion matrix to get a better picture.
Confusion Matrix
## Confusion Matrix and Statistics
##
## Actual
## Prediction T1 T2 T3 T4
## T1 10 0 0 6
## T2 4 62 3 0
## T3 1 3 16 0
## T4 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.8381
## 95% CI : (0.7535, 0.9028)
## No Information Rate : 0.619
## P-Value [Acc > NIR] : 8.26e-07
##
## Kappa : 0.6985
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: T1 Class: T2 Class: T3 Class: T4
## Sensitivity 0.66667 0.9538 0.8421 0.00000
## Specificity 0.93333 0.8250 0.9535 1.00000
## Pos Pred Value 0.62500 0.8986 0.8000 NaN
## Neg Pred Value 0.94382 0.9167 0.9647 0.94286
## Prevalence 0.14286 0.6190 0.1810 0.05714
## Detection Rate 0.09524 0.5905 0.1524 0.00000
## Detection Prevalence 0.15238 0.6571 0.1905 0.00000
## Balanced Accuracy 0.80000 0.8894 0.8978 0.50000
Our overall accuracy is 83.81% which is fair but is a huge generalization in terms of a four-class tree. A good comparison metric is the base rate or prevalence of each class. The T2 class, for example, has the highest base rate (61.9%) and thus has the highest sensitivity (95%), while the T1 and T3 classes (base rates around 16%) have lower sensitivities however high specificities because the model is excellent at predicting when a tumor is not T1 or T3. To dig deeper we’ll look at the ROC curves.
ROC Curves
The output for the ROC curves is as follows (in order starting with T1):
As we can see, the ROC curves are very skewed (some appear as “perfect models” which is not the case) and tell us our model is probably overfitting in some cases. We can try and change the probability threshold to output better ROC curves:
T1: Threshold >0.62 gives an ROC with a slope of 1 (AUC = 0.5) and a threshold <0.62 gives an ROC in the form of a right angle (AUC = 0.75 – “perfect” model), therefore, the ROC curve for T1 isn’t quite computing correctly and is not of use.
T2: Threshold of 0.2 increases the specificity and gives an AUC of 0.7622.
T3: Threshold of 0.5 gives the optimal ROC curve with an AUC of 0.7122.
T4: Because T4 is not used in the tree, the ROC curve is not of concern and changing the threshold results in no change.
Optimal Model #1
Tree:
Confusion Matrix
## Confusion Matrix and Statistics
##
## Actual
## Prediction T1 T2 T3 T4
## T1 10 0 0 6
## T2 5 65 19 0
## T3 0 0 0 0
## T4 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.7143
## 95% CI : (0.6179, 0.7982)
## No Information Rate : 0.619
## P-Value [Acc > NIR] : 0.02645
##
## Kappa : 0.37
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: T1 Class: T2 Class: T3 Class: T4
## Sensitivity 0.66667 1.0000 0.000 0.00000
## Specificity 0.93333 0.4000 1.000 1.00000
## Pos Pred Value 0.62500 0.7303 NaN NaN
## Neg Pred Value 0.94382 1.0000 0.819 0.94286
## Prevalence 0.14286 0.6190 0.181 0.05714
## Detection Rate 0.09524 0.6190 0.000 0.00000
## Detection Prevalence 0.15238 0.8476 0.000 0.00000
## Balanced Accuracy 0.80000 0.7000 0.500 0.50000
Using the optimal cp actually ended up lowering the overall accuracy and it appears the tree is too un-complex (only 1 split so T3 and T4 are unused) as the sensitivity for T2 is 100%. Let’s try adjusting some hyperparameters to produce a more accurate tree.
Hyperparameter Tuning
Tree
Confusion Matrix
## Confusion Matrix and Statistics
##
## Actual
## Prediction T1 T2 T3 T4
## T1 10 0 0 0
## T2 5 64 10 0
## T3 0 1 9 0
## T4 0 0 0 6
##
## Overall Statistics
##
## Accuracy : 0.8476
## 95% CI : (0.7644, 0.9103)
## No Information Rate : 0.619
## P-Value [Acc > NIR] : 2.491e-07
##
## Kappa : 0.6953
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: T1 Class: T2 Class: T3 Class: T4
## Sensitivity 0.66667 0.9846 0.47368 1.00000
## Specificity 1.00000 0.6250 0.98837 1.00000
## Pos Pred Value 1.00000 0.8101 0.90000 1.00000
## Neg Pred Value 0.94737 0.9615 0.89474 1.00000
## Prevalence 0.14286 0.6190 0.18095 0.05714
## Detection Rate 0.09524 0.6095 0.08571 0.05714
## Detection Prevalence 0.09524 0.7524 0.09524 0.05714
## Balanced Accuracy 0.83333 0.8048 0.73103 1.00000
By using rpart.control and setting the cp to 0.01 (basic), the minbucket to 2, and the maxdepth to the 3, a new tree is made that may be the optimal tree in this case. The max depth is set to 3 to avoid overfitting and the minbucket is set to take every case (T1,2,3,4) into account. The overall accuracy is 84.76% which is good (1% better than our original tree), and the sensitivities for each class are fair. The specificities for each class are a little high (100% for T1 and T4) which could mean our tree is still overfitting even though it only has 4 splits–the issue of overfitting is especially concerning considering the dataset used only has 105 observations, and then we’re trying to classify 4 different tumors within the 105 observations. For example, the base rate of the T4 is only 5.71% and there are only 6 observations in the dataset for T4–because of this, it would be easy for the tree to “memorize” that data. Therefore, I would not recommend this model to be used in practice (maybe only for detecting T2 tumors as this dataset is rich with those observations), unless a more robust dataset is trained.