Congrats! You just graduated from medical school and got a PhD in Data Science at the same time, wow impressive. Because of these incredible accomplishments the world now believes you will be able to cure cancer…no pressure. To start you figured you better create some way to detect cancer when present. Luckily because you are now a MD and DS PhD or MDSDPhD, you have access to data sets and know your way around a ML classifier. So, on the way to fulfilling your destiny to rig the world of cancer you start by building several classifiers that can be used to aid in determining if patients have cancer and the type of tumor.
The included dataset (clinical_data_breast_cancer_modified.csv) has information on 105 patients across 17 variables, your goal is to build two classifiers one for PR.Status (progesterone receptor), a biomarker that routinely leads to a cancer diagnosis, indicating if there was a positive or negative outcome and one for the Tumor a multi-class variable . You would like to be able to explain the model to the mere mortals around you but need a fairly robust and flexible approach so you’ve chosen to use decision trees to get started. In building both models us CART and C5.0 and compare the differences.
In doing so, similar to great data scientists of the past, you remembered the excellent education provided to you at UVA in a undergrad data science course and have outlined steps that will need to be undertaken to complete this task (you can add more or combine if needed).
As always, you will need to make sure to #comment your work heavily and render the results in a clear report (knitted) as the non MDSDPhDs of the world will someday need to understand the wonder and spectacle that will be your R code. Good luck and the world thanks you.
Footnotes: - Some of the steps will not need to be repeated for the second model, use your judgment - You can add or combine steps if needed - Also, remember to try several methods during evaluation and always be mindful of how the model will be used in practice. - Do not include ER.Status in your first tree it’s basically the same as PR.Status
Libraries
First, we load in the data and drop the ER.Status column as specified in the instructions. We also ensure all the variables are classified correctly and ensure that the target variable, PR.Status, is changed to a numeric 0 (negative) and 1 (positive) system.
Next, we split the data into a training and a testing set
Training Data
## # A tibble: 6 x 16
## Gender Age.at.Initial.… PR.Status HER2.Final.Stat… Tumor Node.Coded Metastasis
## <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 FEMALE 66 0 Negative T3 Positive M1
## 2 FEMALE 40 0 Negative T2 Negative M0
## 3 FEMALE 56 0 Negative T2 Positive M0
## 4 FEMALE 38 0 Negative T3 Positive M0
## 5 FEMALE 74 0 Negative T3 Negative M0
## 6 FEMALE 60 0 Negative T2 Negative M0
## # … with 9 more variables: Metastasis.Coded <chr>, AJCC.Stage <chr>,
## # Converted.Stage <chr>, Survival.Data.Form <chr>, Vital.Status <chr>,
## # Days.to.Date.of.Last.Contact <dbl>, Days.to.date.of.Death <dbl>,
## # OS.event <int>, OS.Time <dbl>
Testing Data
## # A tibble: 6 x 16
## Gender Age.at.Initial.… PR.Status HER2.Final.Stat… Tumor Node.Coded Metastasis
## <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 FEMALE 48 0 Negative T2 Positive M0
## 2 FEMALE 57 0 Negative T2 Negative M0
## 3 FEMALE 36 0 Negative T2 Negative M0
## 4 FEMALE 41 0 Negative T4 Negative M0
## 5 FEMALE 78 0 Positive T1 Positive M0
## 6 FEMALE 88 1 Negative T2 Positive M0
## # … with 9 more variables: Metastasis.Coded <chr>, AJCC.Stage <chr>,
## # Converted.Stage <chr>, Survival.Data.Form <chr>, Vital.Status <chr>,
## # Days.to.Date.of.Last.Contact <dbl>, Days.to.date.of.Death <dbl>,
## # OS.event <int>, OS.Time <dbl>
## [1] 0.4857143
The baserate is the base class of probabilities unconditioned on featural evidence, frequently also known as prior probabilities. Because this is a multiclass, we will be using the individual percentages for each class.
We first gather/pair the data by the predictor variables (e.g. gender, tumor metastasis, etc.), then by the value of those predictor variables. We also add the probability that their PR Status is positive vs. negative.
The overall baserate was found by taking the average of values in the PR.Status column. Thus, we found that 51.4% of the testing data had a positive PR Status and 48.6% had a negative PR Status.
We build our model using the default setting and by using the rpart function. PR.Status is used as the “formula” aka our response variable. We utilize our previouslt split training dataset and set a cp of 0.01 as it is our default.
## n= 84
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 84 39 0 (0.53571429 0.46428571)
## 2) AJCC.Stage=Stage IB,Stage IIA,Stage III,Stage IIIC,Stage IV 35 10 0 (0.71428571 0.28571429)
## 4) Converted.Stage=No_Conversion,Stage IIIA 13 1 0 (0.92307692 0.07692308) *
## 5) Converted.Stage=Stage IIA,Stage IIIC 22 9 0 (0.59090909 0.40909091)
## 10) Days.to.Date.of.Last.Contact< 869.5 15 5 0 (0.66666667 0.33333333) *
## 11) Days.to.Date.of.Last.Contact>=869.5 7 3 1 (0.42857143 0.57142857) *
## 3) AJCC.Stage=Stage I,Stage IA,Stage II,Stage IIB,Stage IIIA,Stage IIIB 49 20 1 (0.40816327 0.59183673)
## 6) Days.to.Date.of.Last.Contact< 12 7 2 0 (0.71428571 0.28571429) *
## 7) Days.to.Date.of.Last.Contact>=12 42 15 1 (0.35714286 0.64285714)
## 14) Survival.Data.Form=followup 19 9 0 (0.52631579 0.47368421) *
## 15) Survival.Data.Form=enrollment 23 5 1 (0.21739130 0.78260870) *
Now we can see the probabilities of being positive or negative based on certain “decisions” or factors.
The most important variable is shown to be the first split in our decision tree the ACJCC Stage (only if they are stages IIA, III, IV). In descending order, the next three most important factors are the Her2 Final Status, Converted Stage, and OS Time.
This method of visualization isn’t the most easy to use though so let’s move on to making it a more visually pleasing tree!
Much better! Here we can see the split of importance of the variables and their end predictions. For example, if we knew the patient had a converted stage of Stage I breast cancer, we would follow the tree to the rightmost branch and immediately land at a terminal node.
## # A tibble: 5 x 5
## CP nsplit `rel error` xerror xstd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.231 0 1 1 0.117
## 2 0.0769 1 0.769 1.18 0.117
## 3 0.0256 2 0.692 1.10 0.117
## 4 0.0128 3 0.667 1.08 0.118
## 5 0.01 5 0.641 1.05 0.117
Next, we produce a “elbow chart” for various cp values and their associated relative errors. The dashed line represents the highest cross-validated error minus the minimum cross-validated error, plus the standard deviation of the error at that tree. A reasonable choice of cp for pruning is the leftmost value because this is where the mean is less than the horizontal line.
The optimal number of splits is 3 as shown in the cptable_ex because this is where the chart first dips below the mean line. Any more splitting increases the x-val relative error. Furthermore, we can also see in the cptable that this has a low relative error, okay xerror, and a decent xstd
## AJCC.Stage Converted.Stage
## 5.8181441 4.0991152
## Days.to.Date.of.Last.Contact OS.Time
## 3.5475117 3.3240467
## Survival.Data.Form Tumor
## 1.9859431 1.3808629
## Age.at.Initial.Pathologic.Diagnosis Node.Coded
## 1.2467354 0.9709521
## HER2.Final.Status
## 0.5466472
We can also visualize the importance of including each variable as shown above. This reiterates similar points to what we have previously found, but displays the relative importance for all of the variables now. Most important is the Converted Stage with a top score of 3.24 followed by AJCC stage with a score of 3.21. At the bottom and being the least important are node coded with a score of 0.22 and metastasis than 0.185.
Next, we use the predict function to predict the target variable.
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
## 1 0 1 1 0 0 0 0 0 0 0 1 1 0 0 1 0 1 1 1 0
## Levels: 0 1
Our test set of size 21 resulted in the prediction of the target variable, the PR Status, as shown above.
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 2 10
## 1 4 5
##
## Accuracy : 0.3333
## 95% CI : (0.1459, 0.5697)
## No Information Rate : 0.7143
## P-Value [Acc > NIR] : 0.9999
##
## Kappa : -0.2564
##
## Mcnemar's Test P-Value : 0.1814
##
## Sensitivity : 0.3333
## Specificity : 0.3333
## Pos Pred Value : 0.5556
## Neg Pred Value : 0.1667
## Prevalence : 0.7143
## Detection Rate : 0.2381
## Detection Prevalence : 0.4286
## Balanced Accuracy : 0.3333
##
## 'Positive' Class : 1
##
The detection rate is found to be 38.10% with a prevalence of 57.1% Our model currently has an accuracy of 66.67%.
The hit rate can be derived from the confusion matrix by taking the number of true positives (8) over total actual positive cases (8+ a missing 4 = 12) for a hit rate of 66.7%.
## [1] 0.6666667
The Hit Rate/ True Error Rate is 33.3%
## [1] 0.2380952
The true positive rate, also shown as our detection rate is 38.10%
The true positive rate could be improved upon as the ideal would create a confusion matrix with a value of 0 in the spot (1,2) also meaning a hit rate of 100%. We would want to focus on a higher hit rate in this specific case because it would be better for the patients to be more careful about their breast cancer diagnosis with a false positive rather than it being kept hidden with a false negative reading. If keeping false positives low was the ideal metric, then we would ideally be looking for a value of 0 in the confusion matrix at the spot (0,0) and also aiming for a high specificity or true negative rate.
##
## Call:
## roc.default(response = test$PR.Status, predictor = as.numeric(tree_predict), plot = TRUE)
##
## Data: as.numeric(tree_predict) in 6 controls (test$PR.Status 0) > 15 cases (test$PR.Status 1).
## Area under the curve: 0.6667
AUC - ROC curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0s as 0s and 1s as 1s.
The area under the curve of this graph is 0.6667. As our specificity decreases from 1.0 to 0.5, our sensitivity increases at a high rate through a specificity of 0.5, then the incremental gain drop for a specificity of 0.5 to 0.
As shown in the graph, there is a slight inflection point caused by the two rates that causes a deviation from the baseline, but a larger deviation from the line is more desirable and indicates better predicting qualities by the model.
Next, we split the data into a training and a testing set
The overall baserates for each type of tumor was found by taking the average of values in the tumor# column.
Tumor 1 has a 14.3% positive rate, 2 has a 61.9% positive rate, 3 has a 18.1% positive rate, and 4 has a 5.7% positive rate.
We build our model using the default setting and by using the rpart function. PR.Status is used as the “formula” aka our response variable. We utilize our previously split training dataset and set a cp of 0.01 as it is our default.
## n= 84
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 84 31 T2 (0.15476190 0.63095238 0.17857143 0.03571429)
## 2) AJCC.Stage=Stage I,Stage IA,Stage IIIB 12 3 T1 (0.75000000 0.00000000 0.00000000 0.25000000) *
## 3) AJCC.Stage=Stage IB,Stage II,Stage IIA,Stage IIB,Stage III,Stage IIIA,Stage IIIC,Stage IV 72 19 T2 (0.05555556 0.73611111 0.20833333 0.00000000)
## 6) AJCC.Stage=Stage IB,Stage II,Stage IIA,Stage III 40 4 T2 (0.10000000 0.90000000 0.00000000 0.00000000) *
## 7) AJCC.Stage=Stage IIB,Stage IIIA,Stage IIIC,Stage IV 32 15 T2 (0.00000000 0.53125000 0.46875000 0.00000000)
## 14) Days.to.Date.of.Last.Contact>=966 11 1 T2 (0.00000000 0.90909091 0.09090909 0.00000000) *
## 15) Days.to.Date.of.Last.Contact< 966 21 7 T3 (0.00000000 0.33333333 0.66666667 0.00000000) *
Now we can see the probabilities of being positive or negative based on certain “decisions” or factors.
The most important variable is shown to be the first split in our decision tree, the AJCC Stage (specifically stages I, IA, and IIIB). In order of descending importance are the second group of AJCC Stages (IB, II, IIA, IIB, III), the node coding positive or negative, and the days to the date of last contact..
This method of visualization isn’t the most easy to use though so let’s move on to making it a more visually pleasing tree!
Much better! Here we can see the split of importance of the variables and their end predictions. For example, if we knew the patient were in AJCC Stage I, we would say they likely have a T1 tumor. Interestingly here, the T4 tumor is unused because it wasn’t found in enough of the data (it does not have the predominant number in any grouping – it challenges the reading of T1 in the rightmost node with 5, but T1 had slightly more with 7 readings).
## # A tibble: 3 x 5
## CP nsplit `rel error` xerror xstd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.290 0 1 1 0.143
## 2 0.113 1 0.710 0.710 0.130
## 3 0.01 3 0.484 0.645 0.126
Next, we produce a “elbow chart” for various cp values and their associated relative errors. The dashed line represents the highest cross-validated error minus the minimum cross-validated error, plus the standard deviation of the error at that tree. A reasonable choice of cp for pruning is the leftmost value because this is where the mean is less than the horizontal line.
The optimal number of splits is 2 as shown in the cptable_ex because this is where the chart first dips below the mean line. Any more splitting increases the x-val relative error. Furthermore, we can also see in the cptable that this has a low relative error, low xerror, and a good xstd
## AJCC.Stage Converted.Stage
## 18.1244048 12.2286302
## Days.to.Date.of.Last.Contact OS.Time
## 5.3954901 4.7859848
## Node.Coded Age.at.Initial.Pathologic.Diagnosis
## 2.2348524 1.0158420
## Survival.Data.Form OS.event
## 1.0158420 0.8701791
## Vital.Status Gender
## 0.8701791 0.4350895
We can also visualize the importance of including each variable as shown above. This reiterates what we have previously found, but displays the relative importance for all of the variables now. Most important is the AJCC Stage with a top score of 17.3 followed by converted stage with a score of 12.5. At the bottom and being the least important are Her2 final status with a score of 0.847 and survival data form with a score of 0.847.
Next, we use the predict function to predict the target variable.
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
## T2 T2 T2 T1 T3 T2 T3 T1 T2 T2 T2 T2 T2 T3 T2 T2 T2 T1 T1 T2 T2
## Levels: T1 T2 T3 T4
Our test set of size 21 resulted in the prediction of the target variable, the type of tumor, as shown above.
## Confusion Matrix and Statistics
##
## Actual
## Prediction T1 T2 T3 T4
## T1 1 0 0 3
## T2 0 11 3 0
## T3 1 1 1 0
## T4 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.619
## 95% CI : (0.3844, 0.8189)
## No Information Rate : 0.5714
## P-Value [Acc > NIR] : 0.4171
##
## Kappa : 0.336
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: T1 Class: T2 Class: T3 Class: T4
## Sensitivity 0.50000 0.9167 0.25000 0.0000
## Specificity 0.84211 0.6667 0.88235 1.0000
## Pos Pred Value 0.25000 0.7857 0.33333 NaN
## Neg Pred Value 0.94118 0.8571 0.83333 0.8571
## Prevalence 0.09524 0.5714 0.19048 0.1429
## Detection Rate 0.04762 0.5238 0.04762 0.0000
## Detection Prevalence 0.19048 0.6667 0.14286 0.0000
## Balanced Accuracy 0.67105 0.7917 0.56618 0.5000
The detection rates range from 0% with T4 to 38% with T2.
The hit rate can be derived from the confusion matrix by taking the number of true positives for each type of tumor over total actual positive cases.
Tumor 1: Correct Hits = 1, Actual = 3 –> Hit Rate = 33.3% Tumor 2: Correct Hits = 8, Actual = 10 –> Hit Rate = 80% Tumor 3: Correct Hits = 3, Actual = 5 –> Hit Rate = 60% Tumor 4: Correct Hits = 0, Actual = 3 –> Hit Rate = 0%
We have a good hit rate for tumor 2, an okay rate for tumor 3, a bad rate for tumor 1, and a terrible rate for tumor 4. More research money should be invested in improving detection for tumor 4.
This rate can be found by summing the values in the confusion matrix where the rows do not equal the columns indicating errors on the model’s part and dividing it by the sum of all instances in the confusion matrix.
This gives us a Hit Rate/ True Error Rate of 9/21 which is is 42.9%
The true positive rate, also shown as our detection rate can be found by summing the values along the diagonal of the confusion matrix and dividing that value by all values in the confusion matrix which gives us 12/21 which is 57.1%
Within each model we were able to create models that could read a patient’s information and give an evaluation on the patient’s PR status in the first model and what type of tumor they have in the second.
Within the first model, we learned that the most important factors to look for to determine this is the patient’s AJCC Stage, Her2 Final Status, Converted Stage, and OS Time.
Similarly within the second model to determine Tumor type, the most important factors are the AJCC Stage (specifically stages I, IA, and IIIB), the second group of AJCC Stages (IB, II, IIA, IIB, III), the node coding positive or negative, and the days to the date of last contact.
We can see an overlap of AJCC stage ranking highly between both so this could be of something to note for doctors to pay more attention to during diagnosis.
We would recommend data scientists to iterate over this model to work to decrease the false negative rate for these models. Because of the nature of its usage, it is more detrimental for a patient to get a false negative and go undiagnosed than a patient to get a false positive and to begin treatment for something they do not have.
Because of the less than ideal hit rate/true error rate and true positive rates, we would not recommend these models in professional use, but do see the value of gaining more data to add to this dataset to continuously improve the model.
The size of this dataset was rather small which lead to many issues such as within the tumor model, tumor 4 was unrepresented throughout and was not used in our decision tree. We would want more data on patients with tumor 4 to better represent them within our dataset to get it to even show up as an option within our decision tree.