In this report, multiple supervised machine learning algorithms are trained and tested on a dataset regarding the classification of breast cancer types. First, correlation based feature selection and principal component analysis are used to reduce the dimensionality of the dataset, after which four algorithms are trained on the dataset. The best performing model, using the random forest algorithm, is then used on a test set, resulting in an out-of-sample accuracy of 81.58%.
There are many different kinds of breast cancer, which can affect patients in many ways. Hence, each type of breast cancer requires its own approach to treatment. Through biopsies and the analysis of genetic data, crucial information is obtained regarding a tumor, which is subsequently used to choose the best treatment for a patient.
Several crucial elements in classifying breast tumors are the HR, HER2 and TN statuses. HR is short for hormone receptor. Breast tumors are tested for both progesterone and estrogen receptors. If these tests come out positively (either one or both), it means that hormones fuel the breast cancer. Hence if tested positively, treatment can include drugs designed to affect hormone production.
HER2 is short for human epidermal growth factor receptor 2. This is a gene that produces HER2 receptors. These, in turn, play a role in how healthy cells in the breasts reproduce and repair themselves. If the HER2 gene is malfunctioning, it produces too many copies, leading to an overexpression, which in turn can lead to the formation of tumors. As these tumors tend to be rather aggressive, identifying HER2-positive breast cancer is crucial, as targeted treatments can be used against the HER2 protein.
TN stands for triple negative. A tumor is classified as such if the tests for estrogren receptors, progesterone receptors and HER2 all come out negative. These tumors are resistant to hormonal and HER2 treatments and therefore require other treatments.
Given the statuses above, correctly classifying tumors is crucial, a task for which machine learning might prove to be useful. Using a dataset containing genetic data, this study will evaluate if machine learning algorithms can be used to assist in the classification of tumors. To do so, several classification algorithms are used in combination with dimensionality reduction, as the dataset contains far more features than observations. Additional tuning of hyper-paramaters and other strategies to increase the models’ fits are not performed, as the main goal of this report is to establish the usefulness of the machine learning algorithms at their defaults.
To first get an idea of the data, the dataset is read into R using the foreign package. After doing so, a table is generated containing several, general aspects of the data, as can be seen below.
| Rows | Columns | Factor features | Numeric features | Class of target | Features with missing values | Amount of missing values of target | |
|---|---|---|---|---|---|---|---|
| 157 | 12181 | 3045 | 9135 | factor | 0 | 0 |
As can be seen, the data consists of 157 rows and 12180 features. 75% of the features are numeric, whereas the remaining 25% are factors (3 to 1 ratio). The target variable is also a factor variable, meaning that a classification algorithm must be used for this machine learning problem. Lastly, the table shows that none of the features nor the target variable have missing values. Hence imputation for missing values will not be necessary.
To get an idea of the features in the dataset, the first 15 observations of the first 8 features are shown below.
| 1_chrom1_reg2927-43870_probloss | 2_chrom1_reg2927-43870_probnorm | 3_chrom1_reg2927-43870_probgain | 4_chrom1_reg2927-43870_call | 5_chrom1_reg85022-216735_probloss | 6_chrom1_reg85022-216735_probnorm | 7_chrom1_reg85022-216735_probgain | 8_chrom1_reg85022-216735_call |
|---|---|---|---|---|---|---|---|
| 0.008 | 0.977 | 0.015 | 0 | 0.008 | 0.977 | 0.015 | 0 |
| 0.000 | 0.000 | 0.912 | 1 | 0.000 | 0.000 | 0.912 | 1 |
| 0.005 | 0.968 | 0.027 | 0 | 0.005 | 0.968 | 0.027 | 0 |
| 0.942 | 0.058 | 0.000 | -1 | 0.942 | 0.058 | 0.000 | -1 |
| 0.000 | 0.002 | 0.994 | 1 | 0.000 | 0.002 | 0.994 | 1 |
| 0.011 | 0.978 | 0.011 | 0 | 0.011 | 0.978 | 0.011 | 0 |
| 0.005 | 0.972 | 0.023 | 0 | 0.005 | 0.972 | 0.023 | 0 |
| 0.008 | 0.954 | 0.038 | 0 | 0.008 | 0.954 | 0.038 | 0 |
| 0.004 | 0.956 | 0.041 | 0 | 0.004 | 0.956 | 0.041 | 0 |
| 0.207 | 0.792 | 0.002 | 0 | 0.207 | 0.792 | 0.002 | 0 |
| 0.014 | 0.978 | 0.009 | 0 | 0.014 | 0.978 | 0.009 | 0 |
| 0.022 | 0.972 | 0.006 | 0 | 0.022 | 0.972 | 0.006 | 0 |
| 0.743 | 0.257 | 0.000 | -1 | 0.743 | 0.257 | 0.000 | -1 |
| 0.003 | 0.910 | 0.087 | 0 | 0.003 | 0.910 | 0.087 | 0 |
| 0.842 | 0.157 | 0.000 | -1 | 0.842 | 0.157 | 0.000 | -1 |
Based on the names of the features, the dataset seems to contain information regarding the genetic profile of the cancer and/or patient, with per chromosome a probloss, probnorm, probgain and call feature (first 3 numeric, last 1 factor, conforming to the earlier observation of the 3 to 1 ratio).
In order to get a better understanding of the target variable, a histogram is plotted to evaluate its distribution over the several classes of breast cancer.
As can be seen in the histogram, the target variable consists of three classes, which are fairly evenly distributed over the data. Accordingly, the data is well balanced, meaning that accuracy can be used to evaluate the fit of the models and that the data can be used as is.
The dataset does, however, suffer from the so-called curse of dimensionality, as it consists of many more features than observations. Hence techniques for dimensionality reduction must be used on the dataset in order to pick the most relevant features to use for the classification. In order to do so, two techniques are used: Correlation-based Feature Selection (CFS) and Principal Component Analysis (PCA). Using these dimensionality reduction techniques, the same classification algorithms will be trained and tested on the data and evaluated in terms of accuracy. Then, the best performing model will be chosen and tested on a test-set, in order to generate an estimate for the model’s out-of-sample error.
Correlation-based feature selection is a dimensionality reduction technique which searches for the features which correlate strongly with the target and correlate the least with the other features. As the dataset contains a lot of features, my own computer was unable to perform the CFS. Luckily, I was able to use a different computer to run the CFS and save its output. Using said output, the dataset has been reduced. This reduced dataset is read into R, after which the general overview table is generated again.
| Rows | Columns | Numeric features | Class of target | Features with missing values | Amount of missing values of target | |
|---|---|---|---|---|---|---|
| 157 | 60 | 59 | factor | 0 | 0 |
As can be seen in the table, the dataset has been reduced to 59 numeric features and the target variable. Using these features, the following 4 algorithms will be used to train on the data:
The data will first be split in a train- and test set, after which the algorithms are trained on the train set.
Using the trainset, the performance of the resulting models will be evaluated using 10 fold cross-validation with accuracy as the metric. The results are then plotted and shown in a table below. Before writing this report, the models were also tested with centered and scaled features. As these preprocessing measures did not yield any improvements to the models, this step is skipped in this report.
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | NA’s | |
|---|---|---|---|---|---|---|---|
| CFS.SVM | 0.538 | 0.667 | 0.750 | 0.726 | 0.818 | 0.833 | 0 |
| CFS.NB | 0.500 | 0.628 | 0.739 | 0.708 | 0.750 | 0.917 | 0 |
| CFS.RF | 0.667 | 0.755 | 0.826 | 0.825 | 0.896 | 1.000 | 0 |
| CFS.GBM | 0.636 | 0.750 | 0.750 | 0.808 | 0.896 | 1.000 | 0 |
As can be seen in the table, the random forest models have the highest, average accuracy. To see how the models performed on the multiple folds, a parallel plot is shown below.
As is apparant from the parallelplot, its either the GBM or RF model that performs best on each fold. Yet, as the random forest model performs best overall, this model will be compared with the best model using principal component analysis for dimensionality reduction, after which the best performing model is chosen to run on the test set.
Lastly, the confusion matrix of the random forest models is shown to evaluate its predictions.
Cross-Validated (10 fold) Confusion Matrix
(entries are percentual average cell counts across resamples)
Reference
Prediction HER2+ HR+ TN
HER2+ 29.4 2.5 1.7
HR+ 0.8 27.7 6.7
TN 1.7 4.2 25.2
Accuracy (average) : 0.8235
As the confusion matrix shows, the model is somewhat imbalanced in its predictions, performing the worst on the classification of the TN class. Further tuning the hyper parameters of the model, along with possible bootstrapping strategies and the like may improve the model’s fit. This goes beyond the scope of this report, however, and the hyper parameters are therefore kept at their default values.
Principal component analysis is a dimensionality reduction technique which uses linear algebra to convert the set of input features into an uncorrelated set of principal components, which at a decreasing rate explain the variance in the dataset. The interpretability of the features, however, is lost. As the main goal of this study is to evaluate the effectivity of machine learning in classifying breast cancer, not so much the interpretability of the models, this does not pose a problem. The question then remains, however, how many of the principle components should be used. To first get an idea of the principal components and the variance which they explain, the features are transformed to numeric features, after which the PCA is performed and its results are shown.
As can be seen from the graph, the variance explained per principal componenent drops quickly after the first few components. The question remains, then, how many principal components should be used to obtain the best results. Accordingly, a loop will be used to train the same 4 models used in the CFS section with differing amounts of principal components. These models are:
The results will then be evaluated, in order to obtain a better estimate of how many principal components can best be used and with which model.
## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
Based on the table, it appears that the support vector machine performs best using the first 27 principal components, although the case could also be made for the naive bayes model using the first 7 principal components, as its accuracy is fairly similar using far less features.
To see how the models perform over the multiple folds, a parallelplot is shown below.
The support vector machine model performs best or is tied for best model on 9 out of the 10 folds, which is a good sign for its robustness in comparison to the other models. Lastly, the confusion matrix is shown to evaluate how well the model predicts the multiple classes.
Cross-Validated (10 fold) Confusion Matrix
(entries are percentual average cell counts across resamples)
Reference
Prediction HER2+ HR+ TN
HER2+ 22.7 5.0 5.0
HR+ 5.0 24.4 5.9
TN 4.2 5.0 22.7
Accuracy (average) : 0.6975
From the confusion matrix of the support vector machine using 27 principal components, it appears that the model’s predictions are fairly balanced over the multiple classes, with an average accuracy of 69.75% over the 10 fold cross validation.
From the previous section, it is clear that training models using principal components of this dataset yields worse results than using correlation based feature selection. This is most probably due to the high dimensionality of the dataset, which makes it unlikely that a small set of principal components can explain enough variance in order to produce accurate models.
Based on 10 fold cross validation with accuracy as metric, the random forest model trained on the CFS dataset performs best. The last step is to test the model on the test set to estimate the model’s out-of-sample error. This is done below.
Confusion Matrix and Statistics
Reference
Prediction HER2+ HR+ TN
HER2+ 12 2 0
HR+ 0 10 4
TN 0 1 9
Overall Statistics
Accuracy : 0.8158
95% CI : (0.6567, 0.9226)
No Information Rate : 0.3421
P-Value [Acc > NIR] : 2.744e-09
Kappa : 0.7241
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: HER2+ Class: HR+ Class: TN
Sensitivity 1.0000 0.7692 0.6923
Specificity 0.9231 0.8400 0.9600
Pos Pred Value 0.8571 0.7143 0.9000
Neg Pred Value 1.0000 0.8750 0.8571
Prevalence 0.3158 0.3421 0.3421
Detection Rate 0.3158 0.2632 0.2368
Detection Prevalence 0.3684 0.3684 0.2632
Balanced Accuracy 0.9615 0.8046 0.8262
From the results, it is clear that the random forest’s out-of-sample accuracy, 81.58%, is very close to its in-sample accuracy of 82.35%. Accordingly, the model seems to generalize quite well.
The question is, however, if the model’s accuracy is enough. As the predictions from the model would be used to decide which treatment should be used on a patient, the model may well be too erroneous, as a misclasification can lead to an ineffective treatment. Accordingly, further improvements to the model should be attempted, but this is outside the scope of this report.