Congrats! You just graduated from medical school and got a PhD in Data Science at the same time, wow impressive. Because of these incredible accomplishments the world now believes you will be able to cure cancer…no pressure. To start you figured you better create some way to detect cancer when present. Luckily because you are now a MD and DS PhD or MDSDPhD, you have access to data sets and know your way around a ML classifier. So, on the way to fulfilling your destiny to rig the world of cancer you start by building several classifiers that can be used to aid in determining if patients have cancer and the type of tumor.
The included dataset (clinical_data_breast_cancer_modified.csv) has information on 105 patients across 17 variables, your goal is to build two classifiers one for PR.Status (progesterone receptor), a biomarker that routinely leads to a cancer diagnosis, indicating if there was a positive or negative outcome and one for the Tumor a multi-class variable . You would like to be able to explain the model to the mere mortals around you but need a fairly robust and flexible approach so you’ve chosen to use decision trees to get started. In building both models us CART and C5.0 and compare the differences.
In doing so, similar to great data scientists of the past, you remembered the excellent education provided to you at UVA in a undergrad data science course and have outlined steps that will need to be undertaken to complete this task (you can add more or combine if needed).
As always, you will need to make sure to #comment your work heavily and render the results in a clear report (knitted) as the non MDSDPhDs of the world will someday need to understand the wonder and spectacle that will be your R code. Good luck and the world thanks you.
Footnotes: - Some of the steps will not need to be repeated for the second model, use your judgment - You can add or combine steps if needed - Also, remember to try several methods during evaluation and always be mindful of how the model will be used in practice. - Do not include ER.Status in your first tree it’s basically the same as PR.Status
After cleaning the data a bit, calculated the baserate for both classifiers
#6 Ok now determine the baserate for the classifier, what does this number mean.
#For the multi-class this will be the individual percentages for each class.
baserate_PR <- sum(cancer$PR.Status)/103 #51.5%
#51.5% chance of randomly guessing correctly that a patients PR Status is positive
baserate_T1 <- sum(cancer$Tumor == 'T1')/103 #14.6%
baserate_T2 <- sum(cancer$Tumor == 'T2')/103 #62.1%
baserate_T3 <- sum(cancer$Tumor == 'T3')/103 #17.5%
baserate_T4 <- sum(cancer$Tumor == 'T4')/103 # 5.8%
Then I built a model for the PR.Status variable using CART
As can be seen in the tree, converted Stage is the most important variable in the tree. Other key variables are age at initial diagnosis, and days to date of last contact.
Optimal size of tree is 6 or 2 branches? Changes when I rerun the code. Can’t have one branch, so 2 or 6 is closest to the maximum relative error tolerance.
## [1] "Hit Rate/True Error Rate:38.0952380952381%"
## [1] "Detection Rate:42.8571428571429%"
This model is worse than the base rate, meaning that it is worse at predicting if someone has cancer than guessing randomly from the dataset. The error rate is 55% and the detection rate is 23%.
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 4 6
## 1 2 9
##
## Accuracy : 0.619
## 95% CI : (0.3844, 0.8189)
## No Information Rate : 0.7143
## P-Value [Acc > NIR] : 0.8843
##
## Kappa : 0.2222
##
## Mcnemar's Test P-Value : 0.2888
##
## Sensitivity : 0.6000
## Specificity : 0.6667
## Pos Pred Value : 0.8182
## Neg Pred Value : 0.4000
## Prevalence : 0.7143
## Detection Rate : 0.4286
## Detection Prevalence : 0.5238
## Balanced Accuracy : 0.6333
##
## 'Positive' Class : 1
##
Accuracy is low in this model, ideally don’t want a model to be below 70%. In this case our accuracy is only 45%. Also the false negative rate here is too high (Sensitivity - 1) at .69.
I would not trust the results of this model. Especially with something as serious as cancer.
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.6333
This ROC curve is very close to a linear line, indicating that it is about as accurate as a random guess.
Then I built a model for the Tumor Variable using C5.0 since it is a multi-class variable.
##
## Call:
## C5.0.formula(formula = Tumor ~ ., data = training_t)
##
## Classification Tree
## Number of samples: 85
## Number of predictors: 15
##
## Tree size: 7
##
## Non-standard options: attempt to group attributes
## [1] "Hit Rate/True Error Rate:30%"
## [1] "Detection Rate:45%"
Here the detection rate is higher and the error rate is lower than the previous model. However, the detection rate is still around the same as the baserate so not much of an improvement.
## Confusion Matrix and Statistics
##
## Actual
## Prediction T1 T2 T3 T4
## T1 2 0 0 0
## T2 1 9 1 0
## T3 0 4 2 0
## T4 0 0 0 1
##
## Overall Statistics
##
## Accuracy : 0.7
## 95% CI : (0.4572, 0.8811)
## No Information Rate : 0.65
## P-Value [Acc > NIR] : 0.4166
##
## Kappa : 0.4828
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: T1 Class: T2 Class: T3 Class: T4
## Sensitivity 0.6667 0.6923 0.6667 1.00
## Specificity 1.0000 0.7143 0.7647 1.00
## Pos Pred Value 1.0000 0.8182 0.3333 1.00
## Neg Pred Value 0.9444 0.5556 0.9286 1.00
## Prevalence 0.1500 0.6500 0.1500 0.05
## Detection Rate 0.1000 0.4500 0.1000 0.05
## Detection Prevalence 0.1000 0.5500 0.3000 0.05
## Balanced Accuracy 0.8333 0.7033 0.7157 1.00
For the Tumor multiclass model, the accuracy is okay (75%)and the kappa value indicates weak agreement (.45). But, each class has a low false negative rate which is promising. However, looking more at the details, its clear that the test set is too small to yield accurate metrics. Out of the whole test subgroup, there is only one data point with a T4 tumor, of course the positive prediction value is going to be one (or I’d hope so)!
Also since this was an unstable model, my metrics kept changing when I reran the code so take these numbers with a grain of salt.
Recommendations:
These models are not good. The dataset was too small to yield reliable results. Sadly, I won’t be curing cancer today. The models were either no better than the baserate or better by pure luck due to a tiny test subset.
In the future, should use a much larger dataset and potentially change the split to have a larger test set.