In this lab, we will be using decision trees to help assessing breast cancer diagnoses, as well as the most important variables playing into patients status and tumors. This is very important in the real world, as this allows us to dive in and reveal statistically important factors associated with different types of tumors and breast cancer status within the patient. Though this a very simple machine learning model, relative to what is used in the real world, it allows us to get a beginning glimpse into where machine learning can take us.
To prepare our data, one row had to be removed. One patient had a status of “Indeterminate”, rather than positive or negative, which threw off results and actually gave errors. It was better off if this instance was just removed. Furthermore, a new data set was created that removed the variable “ER.Status”, which will be used for our first decision tree.
Unlike other methods of machine learning, such as kNN, we don’t have to check for correlated variables, or scale our variables. This is very convenient, saving us time and effort. However, like other machine learning models, we have to create training and testing data splits. This was done using the standard 80% training and 20% testing ratio.
After splitting the data for the classifier, we can look at the base rate for PR.Status, which is the target variable for our first decision tree. In the table below, the negative status observations are encoded as 0, while the positive statuses are encoded as 1.
##
## 0 1
## 51 53
## [1] "Base Rate: 0.509615384615385"
As we can see above, the base rate is roughly 51%, meaning 51% of the data has a positive PR status. With this information, we can say that the data is very balanced, as the number of positives is very close to the number of negatives. The difference between the two is only two observations.
Next, we will build the decision tree, using the default settings. After building the decision tree, we can get a glimpse into the most important variables. The figure below is a list of the most important variables, according to the decision tree. The numbers represent GINI values, so the higher the value, the more important the variable is.
## Days.to.Date.of.Last.Contact OS.Time
## 3.7222567 3.7222567
## Converted.Stage Age.at.Initial.Pathologic.Diagnosis
## 3.6991304 2.6118012
## AJCC.Stage Tumor
## 2.1495535 0.9924386
## Metastasis Metastasis.Coded
## 0.2902001 0.2902001
## Gender
## 0.2835539
As we can see in the table above, there is a tie for the most important variable. Variables Days.to.Date.of.Last.Contact and OS.Time both have the highest GINI values, at roughly 3.72. The variable Converted.Stage is not far behind the two, at roughly 3.69. Therefore, these seem to be the most important variables, when it comes to this specific decision tree.
Additionally, we can also view the decision tree and subsequent graphs as well. Seen below is the actual decision tree we created.
The complexity is another important aspect of the decision tree. The complexity parameter plays directly into the creation of the tree, so it is important to have the optimal value, as it leads to an optimally sized tree. Shown below is a cp (complexity parameter) chart.
Our interpretation of this graph is as follows: “A reasonable choice of cp for pruning is often the leftmost value where the mean is less than the horizontal line”. From this, we can note that the optimal choice of cp according to this graph is 0.035, as it is the first value to appear below the dashed horizontal line.
Now, we can begin to assess the model using different evaluation metrics. Our first evaluations will be of the HIT/true error rate, and the detection rate. These values were found using the confusion matrix of the decision tree.
## [1] "Hit Rate/True Error Rate: 40%"
## [1] "Detection Rate: 25%"
As can be seen above, we have an error rate of 40%. This means that 40% of the testing data was miss-classified by our model. This is not a good error rate to have. We want to keep our errors as low as possible, so this is something that can be improved. Furthermore, we can see our detection rate of 25%. The detection rate shows what proportion of the data is positive for the target and correctly identified, so in this case, would be a positive PR status and being correctly identified. This rate is good, but could surely be improved.
Next, we will look at the confusion matrix to get an idea of several more metric to evaluate our model.
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 7 6
## 1 2 5
##
## Accuracy : 0.6
## 95% CI : (0.3605, 0.8088)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 0.4143
##
## Kappa : 0.2233
##
## Mcnemar's Test P-Value : 0.2888
##
## Sensitivity : 0.4545
## Specificity : 0.7778
## Pos Pred Value : 0.7143
## Neg Pred Value : 0.5385
## Prevalence : 0.5500
## Detection Rate : 0.2500
## Detection Prevalence : 0.3500
## Balanced Accuracy : 0.6162
##
## 'Positive' Class : 1
##
From the confusion matrix, we are shown many different metrics that relate to our model’s performance on the data. The shown accuracy is 0.6, or 60%. This accuracy is sub-par, but will suit as mildly reliable. Our Kappa value is low, at 0.22, which represents little to no agreement on how to classify/characterize values. Our false positive rate, which is 1 - specificity, is also 0.22, meaning around 22% of our negative observations were predicted to be false positives.
Lastly, we can check the ROC and AUC as another metric of evaluation. The graph below shows the ROC curve.
As can be seen in the graph, the ROC curve covers most of the area, but not a substantial amount. This is further proven by the AUC value of 0.6162, which means that the area under the curve is 61.62% of the graph’s total area. These metrics are alright.
Now, we will repeat the process we just did, but with a multi-class variable. In this case, our target variable will be Tumor, which has 4 types: T1, T2, T3, and T4. For this process, we will re-include ER.Status into our model. The processes are mostly similar between regular variable decision trees and multi-class variable trees, but some of our steps can differ. The same data split as last time will be created: 80% training, 20% testing.
After splitting the data for the classifier, we can look at the base rates for the different tumor types. Below are two tables, the first showing the number of occurences for each of the tumors. The second calculates the base rate, or the percentage of occurences for each of the tumors.
##
## T1 T2 T3 T4
## 15 65 18 6
## T1 T2 T3 T4
## 1 14.42 62.5 17.31 5.77
As we can see, tumor type T2 has the highest base rate, at 62.5%, while T4 has the lowest base rate, at 5.77%, with types T3 and T1 falling in between the two. Given the vastly different base rates, we would say that the data is not very balanced in this scenario, which may be a drawback to our model’s effectiveness.
Next, we will build the decision tree, using the default settings. After building the decision tree, we can get a glimpse into the most important variables. The figure below is a list of the most important variables, according to the decision tree. The numbers represent C5.0 values, so the higher the value, the more important the variable is.
## C5.0 variable importance
##
## Overall
## AJCC.Stage 100.00
## Node.Coded 41.67
## OS.Time 0.00
## Metastasis 0.00
## Metastasis.Coded 0.00
## Survival.Data.Form 0.00
## Days.to.date.of.Death 0.00
## Days.to.Date.of.Last.Contact 0.00
## OS.event 0.00
## HER2.Final.Status 0.00
## Gender 0.00
## PR.Status 0.00
## Converted.Stage 0.00
## Age.at.Initial.Pathologic.Diagnosis 0.00
## ER.Status 0.00
## Vital.Status 0.00
The list of most important variables is rather interesting. The list only has two variables of importance, AJCC.Stage and Node.Coded. All of the other variables are determined to be of zero importance to the model and the target variable, tumor type, which is peculiar, and somewhat hard to believe. But given what we have, we will say these are the most important variables.
Unfortunately, we cannot view the decision tree or the cp chart, given the constraints of the different packages be used to create the multi-class model.
Now, we can begin to assess the model using different evaluation metrics. Our first evaluations will be of the HIT/true error rate, and the detection rate. These values were found using the confusion matrix of the decision tree.
## [1] "Hit Rate/True Error Rate:30%"
## [1] "Detection Rate :5%" "Detection Rate :60%" "Detection Rate :0%"
## [4] "Detection Rate :5%"
As can be seen above, we have an error rate of 30%. This means that 30% of the testing data was miss-classified by our model. This is an alright error rate to have. We want to keep our errors as low as possible, so this is something that can be improved. Furthermore, we can see our detection rates for each type of tumor. This represents what percent of the data was this type of tumor present and correctly classified.
Next, we will look at the confusion matrix to get an idea of several more metric to evaluate our model.
confusionMatrix(as.factor(tumor_predict), as.factor(test_2$Tumor),
dnn=c("Prediction", "Actual"), mode = "sens_spec")
## Confusion Matrix and Statistics
##
## Actual
## Prediction T1 T2 T3 T4
## T1 1 0 0 0
## T2 2 12 3 0
## T3 0 1 0 0
## T4 0 0 0 1
##
## Overall Statistics
##
## Accuracy : 0.7
## 95% CI : (0.4572, 0.8811)
## No Information Rate : 0.65
## P-Value [Acc > NIR] : 0.4166
##
## Kappa : 0.3023
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: T1 Class: T2 Class: T3 Class: T4
## Sensitivity 0.3333 0.9231 0.0000 1.00
## Specificity 1.0000 0.2857 0.9412 1.00
## Pos Pred Value 1.0000 0.7059 0.0000 1.00
## Neg Pred Value 0.8947 0.6667 0.8421 1.00
## Prevalence 0.1500 0.6500 0.1500 0.05
## Detection Rate 0.0500 0.6000 0.0000 0.05
## Detection Prevalence 0.0500 0.8500 0.0500 0.05
## Balanced Accuracy 0.6667 0.6044 0.4706 1.00
From the confusion matrix, we are shown many different metrics that relate to our model’s performance on the data. The shown accuracy is 0.7, or 70%. This accuracy is decent, but surely not great. Our Kappa value is moderate, at 0.42, which represents a moderate agreement on how to classify/characterize values. Our false positive rate, which is 1 - specificity, is different for each tumor class. The only class with a high false positive rate is T2, at around 81%, which we can see in the confusion matrix table, where 5 are miss-classified.
Lastly, we can check the ROC and AUC as another metric of evaluation. The graphs below shows the ROC curve. For the multi-class model, we have to generate a curve for each of the classes. This differs from the regular model, where there is just one curve.
As we can see, most of the graphs have good ROC curves, where the take up a lot of the area of the graph. This note is confirmed by the AUC value of 0.8034, which means that the area under the curve is 80.34% of the graph’s total area. These metrics are pretty good.
Given all of the information from these models and evaluations, we would say that we have developed decent starting models that could be built upon in the future. With better methods and improved circumstances, these models could surely be worked on and much improved. However, that is not to say that we have learned nothing from these models. From these models, we were able to learn the most important variables associated with PR status and tumor type. For PR status, variables Days.to.Date.of.Last.Contact, OS.Time, and Converted.Stage all had an impact on the model building for PR status classification. For tumor type, variables AJCC.Stage and Node.Coded had an impact on the model building for tumor classification. All of this information is interesting to learn, and could be valuable in the real world.
To improve upon this model, we can give certain recommendations. First, larger data sets would be a prize. This data set only has 105 (104 with the one exception removed) observations, which can be representative, but the data and models could only be improved upon with more data points to learn from. Furthermore, more variables could also be helpful in improving the model. There are likely many more genetic, behavioral, and environmental patterns that could help classify different cancer attributes. Next, finding more balanced data would be helpful moving forward. The tumor variable was very unbalanced, as we saw in this lab. Unbalanced target variables make our results less reliable. Though it would be a challenge to have data balanced for every variable in the data, it would be an improvement to work on the balance of the data. Going forward, models like these can be used for great good in this world. Detecting cancer and its attributes is a life-or-death matter, so the more information we gather on it, the better our models become, and the better we become at detecting and treating the cancer properly. There’s surely a long way to go, but the power of data science could get us there.