I was going back and forth about whether I wanted to learn how to use neural networks or Support Vector machines (SVM) because they both seemed really cool and I liked the idea of how neural networks work similarly to the neural pathways in the brain. However, as I was researching neural networks, everything kept bringing me back to SVM, even the book talked about SVM on the first page of the neural network chapter. That felt like a sign to me. I also figured that the data set that I wanted to look into (breast cancer data found on UCIs directory) might be better suited to svm because of SVMs strong ability to categorize. Furthermore, SVM is considered one of the best out of the box classifiers. SVM is intended for binary classification when there are two (or more classes). The goal of this project is to learn SVM using a dataset on breast cancer research to be able to predict if the cancer is benign or malignant. I chose this data set because I was interested in using a data set that looked at cancer and I found one that was created at a hospital in Wisconsin (where I will be doing my internship this summer!) The data set was found on the UCI Machine Learning and was then found on kaggle later.
#My research question:
How to accurately differentiate between benign and malignant tumors using SVM
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0 ✔ readr 2.1.5
✔ ggplot2 3.5.2 ✔ stringr 1.5.1
✔ lubridate 1.9.4 ✔ tibble 3.2.1
✔ purrr 1.0.4 ✔ tidyr 1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
These are libraries that I tend to use so I am placing these here as a base
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 568 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): diagnosis
dbl (31): id, radius_mean, texture_mean, perimeter_mean, area_mean, smoothne...
lgl (1): ...33
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Before we begin to classify our data using SVM, lets gain a little bit more understanding of what our data set looks like by examining some graphs using a shiny app and performing some hierarchical clustering.
library(shiny)ui <-fluidPage(selectInput(inputId ="XVariable",label ="Choose a variable for the x-axis!",choices =colnames(Wisconsin_Breast_cancer)),selectInput(inputId ="YVariable",label ="Choose a variable for the y-axis",choices =colnames(Wisconsin_Breast_cancer)),selectInput(inputId ="ColorVariable",label ="Choose a variable for the color!",choices =colnames(Wisconsin_Breast_cancer)),plotOutput(outputId ="BC_Plot"))server <-function(input, output, session) { output$BC_Plot <-renderPlot({ggplot(Wisconsin_Breast_cancer) +geom_point(aes(x =!!sym(input$XVariable), y =!!sym(input$YVariable), color =!!sym(input$ColorVariable))) }) }shinyApp(ui, server)
Shiny applications not supported in static R Markdown documents
This shiny app tells us what our data looks like and to see the relationship between variables and how they are associated with diagnosis type. It appears that the malignant and benign tumors already have a few characteristics that are distinct from each other. For example, the bigger the tumor is (in regards to radius, perimeter, and area) the more likely that it is malignant. Furthermore, the more compact and concave a tumor is the more likely it is malignant. While factors such as smoothness and symmetry do not seem to have a large effect on whether a tumor is malignant or benign.
However, something interesting that I noticed was that there is a weird column labeled …33 that doesn’t appear to have any information in it. I am going to see if there are any data points present in this column (and if there are any other data points missing in any other columns) using mice. Based on how many data points are missing, I may remove the column or use mice to input the correct values.
library(mice)
Attaching package: 'mice'
The following object is masked from 'package:stats':
filter
The following objects are masked from 'package:base':
cbind, rbind
It appears there was a column that was entirely NA values. the column name was “…33” I am not quite sure what this column was supposed to be but Kaggle does not have any information as to what that column might represent so I will remove it before moving forward. There are not any other columns with NA values.
Now that the data set is a bit more clean, I am interested in looking at the patterns in this data set using hierarchical clustering to gain a stronger understanding of what this data looks like.
library(dendextend)
---------------------
Welcome to dendextend version 1.19.0
Type citation('dendextend') for how to cite the package.
Type browseVignettes(package = 'dendextend') for the package vignette.
The github page is: https://github.com/talgalili/dendextend/
Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
You may ask questions at stackoverflow, use the r and dendextend tags:
https://stackoverflow.com/questions/tagged/dendextend
To suppress this message use: suppressPackageStartupMessages(library(dendextend))
---------------------
Attaching package: 'dendextend'
The following object is masked from 'package:stats':
cutree
This seems very promising. Especially considering how the shiny app showed that there was already some significant differentiation between the two classes. I think that given this information, it is time to start moving into some svm.
First we need to split the data set into a training and testing data set. We also need to make sure our classifying variable is a factor, so we will double check that and then modify the variable to be a factor if needed.
I think the most important thing to discuss is the hyper plane. The goal of the hyper plane is to create a divider between different nominal variables within your data set. If data lays on one side of this divider, then it is categorized to be variable A and if it falls on the other side then it would be categorized as variable B. For two dimensions, a hyper plane can be visualized as a flat line, while with three dimensions it is better visualized as a plane. When there are more than three dimensions, it is harder to visualize however its role as a divider of classes still is accurate. The best hyper plane would have the greatest amount of margin (i.e. space away) from the closest data points on either side of the hyper plane, this is considered the maximal margin hyper plane. Those closest data points are called support vectors and they are what determine what the margins are which in turn helps to determine the best hyper plane to classify that data. If there is no way to perfectly separate the classes of data using a hyper plane, then there is a way to almost perfectly separate the data using a hyper plane and something called soft margins. In this case, we would allow some data points on the wrong side of the margin (and even the wrong side of the hyper plane) in order to preserve the svm’s ability to provide accurate predictions to a wide range of data sets.
One of the many benefits of SVM is that it can still be highly accurate at classifying data when data we are attempting to classify is not separated linearly. The kernel function can be used when data isn’t linear, and attempts other types of hyper planes to see which hyper plane works best for our data set. The kernels that are most often used are linear, polynomial, radial, and sigmoid kernels, which are used to optimize the hyper plane. Our first step before moving on is to find a kernel type that creates the most accurate hyper plane (using the kernel function) for our data to separate benign and malignant tumors.
Linear SVMs use straight lined hyper planes, polynomial SVMs use curved lined hyper planes, radial SVMs have more of a circular hyper plane, and finally sigmoid SVMs have more of an s shaped curve for a hyper plane.
Another important factor of SVM to be considered is the support vector classifier, which determines how to classify a data point while taking into account the hyper plane. The support vector classified can be tuned further using gamma and cost after a kernel type has been selected.
The first SVM we will look at is a linear SVM!
library(e1071)linear_cancer_SVM <-svm(diagnosis~ . -id, data = Breast_cancer_training_data, kernel ='linear')linear_cancer_SVM
Call:
svm(formula = diagnosis ~ . - id, data = Breast_cancer_training_data,
kernel = "linear")
Parameters:
SVM-Type: C-classification
SVM-Kernel: linear
cost: 1
Number of Support Vectors: 26
table(Breast_cancer_training_data$diagnosis)
B M
190 94
Perfect! Now lets check the accuracy
table(Prediction =predict(linear_cancer_SVM, Breast_cancer_training_data), Truth = Breast_cancer_training_data$diagnosis)
Truth
Prediction B M
B 189 2
M 1 92
So this is great! out of 284 samples, it only mislabeled 3 of them. I think this will most likely be the model we fine tune better but I want to look at the three other kernel types to make sure that this is the best model before we move forward.
Now lets create a polynomial SVM and see what the accuracy is!
polynomial_cancer_SVM <-svm(diagnosis ~ .-id, data = Breast_cancer_training_data, kernel ='polynomial')polynomial_cancer_SVM
Call:
svm(formula = diagnosis ~ . - id, data = Breast_cancer_training_data,
kernel = "polynomial")
Parameters:
SVM-Type: C-classification
SVM-Kernel: polynomial
cost: 1
degree: 3
coef.0: 0
Number of Support Vectors: 82
table(Prediction =predict(polynomial_cancer_SVM, Breast_cancer_training_data), Truth = Breast_cancer_training_data$diagnosis)
Truth
Prediction B M
B 190 20
M 0 74
This is significantly worse at correctly identifying malignant tumors, with it misidentifying 21 malignant tumors as benign. We will most likely not be using this model.
Let’s look at a radial kernel now!
radial_cancer_SVM <-svm(diagnosis ~ .-id, data = Breast_cancer_training_data, kernel ='radial')radial_cancer_SVM
Call:
svm(formula = diagnosis ~ . - id, data = Breast_cancer_training_data,
kernel = "radial")
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
Number of Support Vectors: 76
table(Prediction =predict(radial_cancer_SVM, Breast_cancer_training_data), Truth = Breast_cancer_training_data$diagnosis)
Truth
Prediction B M
B 190 3
M 0 91
While this technically has the same amount of accuracy as the linear svm model, it inaccurately classifies malignant tumors more often than the linear model, which I think is an issue. There is a higher weight on misdiagnosing a malignant tumor as benign than misdiagnosing a benign tumor as malignant. Therefore, I still believe that the linear model is the best for our goals. However, let’s look at the sigmoid svm before fully coming to this decision.
sigmoid_cancer_SVM <-svm(diagnosis ~ .-id, data = Breast_cancer_training_data, kernel ='sigmoid')sigmoid_cancer_SVM
Call:
svm(formula = diagnosis ~ . - id, data = Breast_cancer_training_data,
kernel = "sigmoid")
Parameters:
SVM-Type: C-classification
SVM-Kernel: sigmoid
cost: 1
coef.0: 0
Number of Support Vectors: 48
table(Prediction =predict(sigmoid_cancer_SVM, Breast_cancer_training_data), Truth = Breast_cancer_training_data$diagnosis)
Truth
Prediction B M
B 187 10
M 3 84
This is also significantly less accurate than the linear model so we will be moving forward with the linear svm model. Because we are using a linear model, the only tuning parameters we will be looking at is cost. With sigmoid, radial, and polynomial you would use gamma and cost but because we are moving forward with linear that means we will only look at cost.
Although we are not looking at gamma I do want to take a moment to explain the goal of gamma. gamma works by adjusting the shape of the hyper plane. Because a linear kernel is already a straight line, the shape cannot be adjusted and therefore it would not be applicable. However, for non linear kernels, gamma is a powerful tuning parameter. Low gamma tends to result in a smooth lines that generally separate the data but can be prone to a few incorrect classifications. Alternatively, high gamma results in lines that wrap around the support vectors and minimize the amount of incorrect classifications, but can result in the model being over fit for the training data set. gamma is critical for accurate radial, sigmoid, and polynomial kernels.
Cost controls how much the model cares about missclassifying data points. When cost is high, there is a high penalty for miss-classification and which can result in over fitting of the model to the training data set. This can create hyper planes that have small margins and tend to be a bit more complex. If the cost is small then the model is allowed to make a few miss-classifications and the model tends to be a bit more simpler. Unfortunately, if the cost is too low then the model can be under fit for the data set and tend to be less accurate overall.
Fortunately for use, there is a tuning function that will allow us to run a variety of different cost values and find the most optimal cost value for our model.
table(Prediction =predict(cancer_linear_cost_svm, Breast_cancer_training_data), Truth = Breast_cancer_training_data$diagnosis)
Truth
Prediction B M
B 189 2
M 1 92
Okay so this is what I expected for the linear svm. I am going to experiment for a minute and try to see if we can get a better accuracy using the radial hyper plane. I know that radial, sigmoid, and polynomial kernels tend to have signficantly higher accuracy once they are tuned so I am hoping that will be the case once the radial SVM is tuned
table(Prediction =predict(radial_cost_gamma_svm, Breast_cancer_training_data), Truth = Breast_cancer_training_data$diagnosis)
Truth
Prediction B M
B 189 3
M 1 91
Okay so this is relatively good, but I think we can make it even better by weighting the data so that miss-classifying malignant tumors as benign is worse than miss-classifying benign tumors as malignant. I would love to get our predictions to a place where all malignant tumors are properly classified. I believe we can do this by adding weights to our system. Currently, the data has twice as many benign tumors as it has malignant tumors, which can skew the predictions and make it so it is more likely to incorrectly predict our data. by adding weights and putting a higher emphasis on the malignant tumors, we may be able to improve this model.
weights1 <-c(B =1, M =2) weights2 <-c(B =1, M =3) weighted_radial_cost_gamma_svm <-svm(diagnosis ~ . - id, data = Breast_cancer_training_data,kernel ="radial",gamma =0.01,cost =4,class.weights = weights2)
table(Prediction =predict(weighted_radial_cost_gamma_svm, Breast_cancer_training_data), Truth = Breast_cancer_training_data$diagnosis)
Truth
Prediction B M
B 188 2
M 2 92
okay okay okay this is even better. I am going to see if I up the weights if that will adjust the prediction abilities even further. Yes!!! 100% accuracy!! I am now going to test this on the testing data as a final test to see if this is simply over fitting or whether this is truly accurate.
table(Prediction =predict(weighted_radial_cost_gamma_svm, Breast_cancer_testing_data), Truth = Breast_cancer_testing_data$diagnosis)
Truth
Prediction B M
B 163 9
M 3 109
Truth
Prediction B M
B 163 9
M 3 109
Okay so I potentially over fitted. I am going to go back to the weights1 and see if the accuracy is better
Truth
Prediction B M
B 164 10
M 2 108
Okay so this is worse but not drastically different than using weights 2. I am going to use the linear svm and see if that is slightly better. If it is more accurate than the weighted radial svm then I will see if adding weights to the linear svm will make it more accurate.
table(Prediction =predict(cancer_linear_cost_svm, Breast_cancer_testing_data), Truth = Breast_cancer_testing_data$diagnosis)
Truth
Prediction B M
B 163 8
M 3 110
The linear SVM does appear to be more accurate than the weighted radial model so I am going to add weights to the linear SVM to attempt to make it better.
I am using 0.0625 instead of 0.25 because although the tuned svm recommends 0.25 after rendering, it previously was recommending 0.0625 and with weights 0.0625 is more accurate than 0.25
weights1 <-c(B =1, M =2) weights2 <-c(B =1, M =3) cancer_linear_cost_weight_svm <-svm(diagnosis ~ .-id, data = Breast_cancer_training_data, kernel ='linear', cost =0.0625,class.weights = weights2)
table(Prediction =predict(cancer_linear_cost_weight_svm, Breast_cancer_testing_data), Truth = Breast_cancer_testing_data$diagnosis)
Truth
Prediction B M
B 161 3
M 5 115
Truth Prediction B M B 162 6 M 4 112
With weights1 it does decrease the amount of malignant tumors that are misdiagnosed as benign, I am going to try weights2 and see if that decreases the amount of mislabeled malignant tumors by putting even more weight on the malignant tumors.
Truth Prediction B M B 161 3 M 5 115
Yes! This is exactly what I was hoping for. This is absolutely perfect and I am happy with this accuracy. We now have a relatively accurate SVM model that can differentiate between benign and malignant tumors! 63% of the errors are miss-classifications of a benign tumor as malignant, which in this situation is less harmful that the miss-classification of a malignant tumor as benign. Furthermore, the model only makes an error 2.8% of the time which means it is accurate 97.2% of the time!