The following packages are required to rerun this .rmd
file:
library(tidyverse)
library(rpart)
library(rpart.plot)
library(knitr)
library(gridExtra)
library(ranger)
library(fastDummies)
library(e1071)
seed_num <- 42 # common seed value for randomized operations
set.seed(seed_num)
The purpose of the work presented here is to evaluate the performance resulting from the usage of Support Vector Machine (SVM) algorithms to solve classification problems. Furthermore, the results of these algorithms are compared against various tree based models, specifically those reported in my previously written RPubs article.
The dataset used here was sourced from Kaggle, and contains information on various mushroom species. Information on how the data was originally collected can be found in this article.
The cell below imports the data from the
mushroom_cleaned.csv file and stores it as a dataframe,
mush:
mush <- read.csv("data/mushroom_cleaned.csv")
kable(mush[1:5,], caption='First 5 rows of mush dataframe:')
| cap.diameter | cap.shape | gill.attachment | gill.color | stem.height | stem.width | stem.color | season | class |
|---|---|---|---|---|---|---|---|---|
| 1372 | 2 | 2 | 10 | 3.807467 | 1545 | 11 | 1.8042727 | 1 |
| 1461 | 2 | 2 | 10 | 3.807467 | 1557 | 11 | 1.8042727 | 1 |
| 1371 | 2 | 2 | 10 | 3.612496 | 1566 | 11 | 1.8042727 | 1 |
| 1261 | 6 | 2 | 10 | 3.787572 | 1566 | 11 | 1.8042727 | 1 |
| 1305 | 6 | 2 | 10 | 3.711971 | 1464 | 11 | 0.9431946 | 1 |
The rest of this section is copied directly from Section 2 of the aforementioned RPubs article:
Each row in the mush dataframe represents measurements
taken for a single mushroom, with the class field in
mush referring to whether or not the mushroom is edible.
This is a binary field with each value representing the following:
The remaining eight fields are all feature variables that quantitatively/qualitatively describe the mushroom, and are defined as follows:
| Field Name | Variable Type/Data Type | Description |
|---|---|---|
cap.diameter |
Continuous/Float | The diameter of the mushroom’s cap in m\(^{-4}\). |
cap.shape |
Categorical/Integer | A value from 0-6 the represents the shape of the mushroom cap. These include “bell”, “conical”, “convex”, “flat”, “sunken”, “spherical”, and “others”, respectively. |
gill.attachment |
Categorical/Integer | A value from 0-6 the represents how the mushroom gills are connected to its cap. These include “adnate”, “adnexed”, “decurrent”, “free”, “sinuate”, “pores”, and “unknown”, respectively. |
gill.color |
Categorical/Integer | A value from 0-11 the represents the color of the mushroom’s gills. These include “brown”, “buff”, “gray”, “green”, “pink”, “purple”, “red”, “white”, “yellow”, “blue”, “orange”, and “black”, respectively. |
stem.height |
Continuous/Float | The height of the mushroom’s stem (base to cap) in cm. |
stem.width |
Continuous/Float | The width of the mushroom’s stem (at it’s base) in \(^{-4}\). |
stem.color |
Categorical/Integer | A value from 0-11 the represents the color of the mushroom’s cap.
See gill.color for list of colors. |
season |
Continuous/Float | A value that represents the time of year the measurements for the mushroom were taken. Larger values represent measurements that were taken further along in the year. |
Based on the below, the mush dataframe contains
measurements of 53,035 mushrooms:
nrow(mush)
## [1] 54035
Of these observations, the following output summarizes the quantity of each that were classified as edible/non-edible:
table(mush$class)
##
## 0 1
## 24360 29675
SVM models, unlike their tree based counterparts, do not do well with
large amounts of data and can can be difficult to train with too many
observations. As such, the cell below randomly samples 10,000
observations from mush (reducing it’s size by more than
75%):
set.seed(seed_num)
n <- 10000
sample_choices <- sample(1:nrow(mush), n, replace=FALSE)
mush <- mush[sample_choices,]
The cell below replicates the rest of the data cleaning steps carried out in Section 3 of the aforementioned RPubs article.
# standardize length measurements
mush$cap.diameter <- mush$cap.diameter / 10
mush$stem.height <- mush$stem.height * 10
mush$stem.width <- mush$stem.width / 10
# balance classes by undersampling
set.seed(seed_num)
class_0 <- which(mush$class == 0)
class_1 <- which(mush$class == 1)
class_1 <- sample(class_1, size=length(class_0), replace=FALSE)
selected_rows <- c(class_0, class_1)
mush <- mush[selected_rows,]
# cast class field as a factor
mush$class <- as.factor(mush$class)
Before building and evaluating any SVM models, we need to split the
data in mush into training and testing sets that can be
used to properly evaluate their performance:
set.seed(seed_num)
# define training size
train_size <- 0.75
# create separate dataframes by class
class_0_df <- mush[which(mush$class == 0),]
class_1_df <- mush[which(mush$class == 1),]
# determine row numbers of samples
n <- floor(nrow(class_0_df) * train_size)
sample_vec <- 1:nrow(class_0_df)
train_samples <- sample(sample_vec, size=n, replace=FALSE)
# sample the class dataframes
train_0_df <- class_0_df[train_samples,]
test_0_df <- class_0_df[-train_samples,]
train_1_df <- class_1_df[train_samples,]
test_1_df <- class_1_df[-train_samples,]
# recombine dataframes
train <- rbind(train_0_df, train_1_df)
test <- rbind(test_0_df, test_1_df)
# shuffle rows
train <- train[sample(1:nrow(train)), ]
test <- test[sample(1:nrow(test)), ]
# split into x and y
train_x <- select(train, -class)
train_y <- select(train, class)
test_x <- select(test, -class)
test_y <- select(test, class)
Simply put, SVM models determine what’s known as a “decision boundary” in \(n\)-dimensional space, such that \(n\) is the number of explanatory fields. Predictions can be made using these models by determining where new observations are located in this space in reference to the decision boundary.
Because this decision boundary can only be easily visualized in two
or three dimensions, we will first build a model using just two
explanatory fields. While this model is not expected to have exceedingly
high accuracy, it will provide insight into how a SVM model makes
predictions. The two features chosen in this case are those deemed to be
the two “most important” continuous features of the random forest built
in Section 6 of the
previous RPubs article: stem.width and
cap.diameter.
set.seed(seed_num)
svm_data <- train[c('stem.width', 'cap.diameter', "class")]
simple_svm <- svm(class~.,data=svm_data, kernel="linear")
print(simple_svm)
##
## Call:
## svm(formula = class ~ ., data = svm_data, kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 5874
Next, we can use the model to make predictions on the test set:
preds <-unname(predict(simple_svm, test[c('stem.width', 'cap.diameter')]))
confusion_matrix <- table(preds, test$class)
print(confusion_matrix)
##
## preds 0 1
## 0 609 416
## 1 520 713
As we can see in the above confusion matrix, the model did not perform very well: it only correctly predicted 609 of the 1,126 edible mushrooms (~51%) and 713 of the 1,126 poisonous mushrooms (~63%) resulting in a total accuracy score of:
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", round(accuracy, 2)))
## [1] "Accuracy: 0.59"
While this result evidences a lack of promising utility for this model (it only performs 9% better than chance), we can see clearly how it is making its decisions:
set.seed(seed_num)
s_rows <- sample(1:nrow(test), 50, replace=FALSE)
s <- test[s_rows,c('stem.width', 'cap.diameter', 'class')]
plot(simple_svm, data = s)
In the above plot we can see the linear decision boundary drawn by the SVM model, in which an observation that places below the boundary is classified as belonging to class 0 (edible). Based on the decision boundary, mushrooms with larger stem widths and cap diameters are more likely to be edible, with the opposite being true for poisonous mushrooms. The points on the graph represent predictions made using 50 randomly chosen observations from the test set, which make it clear that this model is under-performing: red X’s in the top section and black X’s in the bottom section indicate incorrect predictions.
While the previously created SVM model was useful in visualizing how
SVM models operate, it was not an effective classifier. Using only two
features resulted in a decision boundary that was not complex enough to
capture many of the underlying connections/intricacies of our dataset.
As such, building a more complex model should provide better results.
The cell below creates a new SVM model using all the explanatory fields
in mush:
set.seed(seed_num)
complex_svm <- svm(class~.,data=train)
print(complex_svm)
##
## Call:
## svm(formula = class ~ ., data = train)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 3570
We can now use this model to make predictions on the test set:
preds <-unname(predict(complex_svm, test))
confusion_matrix <- table(preds, test$class)
print(confusion_matrix)
##
## preds 0 1
## 0 964 209
## 1 165 920
The results above show a massive improvement compared to the previously created two-feature model: it correctly predicted 1,076 of the 1,126 edible mushrooms (~95%) and 1,022 of the 1,126 poisonous mushrooms (~91%) resulting in a total accuracy score of:
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", round(accuracy, 2)))
## [1] "Accuracy: 0.83"
It is clear that this SVM model produces high-quality predictions, and its success is starting to rival that of the best tree based model produced here (the random forest from section 6, which had an accuracy score of almost 99%).
To see if a SVM model can perform as well as the random forest from the previous analysis, we can attempt to tune the hyperparameters that the SVM model takes as inputs.
All SVM models require specifying a “kernel”, which dictates the functions used to determine the support vectors that draw the decision boundary. One of the most common kernels chosen is known as the “radial basis function” (RPS), which is controlled by a hyperparameter, \(\gamma\). In short, \(\gamma\) dictates how much influence a single training example has over the decision boundary, with higher values resulting in more complex boundaries.
The Cost \(C\) is another (kernel-independent) hyperparameter that controls how much the model should care about an incorrectly classified observation. Like models with large \(\gamma\), models with high \(C\) tend to have more complex decision boundaries.
The cell below uses the RPS kernel with four different values of these two hyperparameters to determine the values of each that had the best results:
gammas <- 10^(-1:2)
costs <- 10^(-1:2)
tuned_model_rps <-
tune.svm(class~., data = train, kernel="radial",
cost = costs, gamma = gammas)
tuned_model_rps$best.parameters
## gamma cost
## 10 1 10
Of \(4\cdot4=16\) different SVM models were tested, the one using \(\gamma=0.1\) and \(C=100\) was found to produce the best results. We can now input these parameters into a new model to make predictions on the test set:
tuned_svm <- svm(class~.,data=train, kernel='radial', cost=100, gamma=0.1)
preds <-unname(predict(complex_svm, test))
confusion_matrix <- table(preds, test$class)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", round(accuracy, 2)))
## [1] "Accuracy: 0.83"
Despite the search, the model using these parameters performed just as well as the “out-of-the-box” SVM model.
As the work presented here has shown, SVM models can be quite an
effective tool for solving classification problems and have the ability
to achieve high levels of accuracy. However, for the use case presented
here, the tree-based
models produced previously yielded better results. Not only this,
but the previously created random forest was able to ingest and train
all observations in an extremely short period of time. With the SVM
models, training times on the original mush dataframe took
much too long to be feasible, resulting in number of observations having
to be reduced by almost 75%. Even after this reduction, training times
were still an issue with hyperparameter tuning: though 16 different SVM
input-sets were tested, it did not result in a model that exceeded the
quality of the “out-of-the-box” predictions. This is not to say that
there aren’t improvements that can be made to the model, but they would
take a not insignificant amount of time to determine. Some general
guidelines based off these results seem to indicate that random forests
work best for large datasets that contain a multitude of categorical and
continuous fields, while SVMs work best on smaller datasets containing
complex relationships between continuous fields.
However, it seems as though the main point of the work presented both here and here is that the model chosen should be evaluated depending on the use case. In this instance it appears as though random forests are the best choice, but this is most definitely not ubiquitous for all classification problems. Clear evidence for this can be seen by performing a literature search of how random forests compare against SVMs:
It clear from the above that the question of random forests vs SVM models has been evaluated for numerous classification problems across a plethora of academic fields.
To conclude, both tree-based and SVM models can be top choices when attempting to build superior classifier, but it is the responsibility of the data practitioner building these models to determine what is most pertinent to their use case.
The work presented here incorporates knowledge obtained from the following sources.
Tsang, I., Kwok J., Cheung P. , “Core Vector Machines, Fast SVM Training on Very Large Data Sets”, Journal of Machine Learning Research, vol. 6, April 2005, pp. 363-392 - Helpful to understanding how dataset size impacts training time and performance of SVM models.
Guido, R., Carmela-Groccia, M., Conforti, D., “A Hyper-Parameter Tuning Approach for Cost-Sensitive Support Vector Machine Classifiers”, Soft Computing, vol. 27, February 2022, pp. 12863-12881 - Helpful for understanding how different hyperparameters affect the SVM model.
Statnikov, A. & Aliferis, C., “Are Random Forests Better Than Support Vector Machines for Microarray-Based Cancer Reseach”, AMIA Annu Symp Proc, March 2007, pp. 686-690 - Helpful for understanding when to use random forests vs. SVM models.