In this document we will be using a Support Vector Machine to classify flowers within the classic Iris dataset. The Iris dataset is contained within the ggplot2 package and is a common starting point for dealing with classification problems.
For this document we will use the tidyverse package which contains ggplot2 as well as the e1071 package which allows us to build SVMs.
library(tidyverse)
library(e1071)
set.seed(42) # To make our document recreatable
Next we need to load our data.
data(iris)
head(iris, 20)
The Iris dataset contains information about the Sepal length/width, as well as the Petal length/width and indicates which species of Iris flower that observation corresponds to. Before creating our model I first want to quickly visualize this data.
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) +
geom_point() +
labs(title = 'Sepal Length vs Sepal Width')
We can see here that there is some definition between each species of Iris. Setosa flowers tend to have shorter and wider sepals than their counterparts while the virginica species tend to have a little longer sepals than the versicolor. Though there is still a non-trivial amount of overlap between all three species.
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
geom_point() +
labs(title = 'Petal Length vs Petal Width')
As with the sepal measurements, the Setosa species has both shorter and narrower petals than the other two species. Virginica petals tend to be overall larger than the versicolor petals though there is still some amount of overlap.
This general separation with some amount of overlap is the perfect indication that we should use a SVM in order to classify a test set of these flowers. The general idea of SVMs is to find a linear separator which can be drawn in-betwen two classes in order to separate all datapoints into one class or another. For complex data, SVMs attempt to accomplish this by dealing with the following three problems:
Class Separation: We attempt to find an optimal separating hyperplane between two or more classes by maximizing the margin between the classes’ closes points. The Points lying on the boundaries are called support vectors, and the middle of the margin is the optimal separating hyperplane.
Overlapping Classes: data points on the ‘wrong’ side of the discriminant margin are weighted down to reduce their influence.
Nonlinearity: when we cannot find a linear separator, data oints are projected into a higher-dimentional space where the data points effectively become linearly separable.
SVMs then find a problem solution by formulating the whole task as a quadratic optimaization problem which can be then solved by known techniques.
Now we will begin fitting our SVM to the iris dataset. First we randomly split into a training and testing set.
index <- c(1:nrow(iris))
test.index <- sample(index, size = (length(index)/3))
train <- iris[-test.index ,]
test <- iris[test.index ,]
Next we can use this training set to create our model
svm.model.linear <- svm(Species ~ ., data = train, kernel = 'linear')
svm.model.linear
##
## Call:
## svm(formula = Species ~ ., data = train, kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
## gamma: 0.25
##
## Number of Support Vectors: 24
table(Prediction = predict(svm.model.linear, train),Truth = train$Species)
## Truth
## Prediction setosa versicolor virginica
## setosa 37 0 0
## versicolor 0 35 0
## virginica 0 1 27
We see here that on the training dataset, our SVM was able to correctly identify 99 of the 100 observations. This is excellent, but maybe we can be perfect. With SVM models, there are generally three parameters which are typically tuned to optimize the model. These three parameters are Kernel, gamma, and cost. In our original model we used a linear kernel and the defaults for both cost and gamma.
Kernel represents they style of SVM that is used to classify data. So we start by adjusting the kernel to determine a more accurate SVM model.
We want to test the linear SVM created previously against a few other popular kernels. We will limit this to the polynomial, radial, and sigmoid kernels.
svm.model.poly <- svm(Species ~ ., data = train, kernel = 'polynomial')
svm.model.poly
##
## Call:
## svm(formula = Species ~ ., data = train, kernel = "polynomial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 1
## degree: 3
## gamma: 0.25
## coef.0: 0
##
## Number of Support Vectors: 35
table(Prediction = predict(svm.model.poly, train),Truth = train$Species)
## Truth
## Prediction setosa versicolor virginica
## setosa 37 0 0
## versicolor 0 36 3
## virginica 0 0 24
The polynomial kernel does not seem to have performed very well. Accuracy dropped to 97%.
svm.model.radial <- svm(Species ~ ., data = train, kernel = 'radial')
svm.model.radial
##
## Call:
## svm(formula = Species ~ ., data = train, kernel = "radial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.25
##
## Number of Support Vectors: 43
table(Prediction = predict(svm.model.radial, train),Truth = train$Species)
## Truth
## Prediction setosa versicolor virginica
## setosa 37 0 0
## versicolor 0 35 1
## virginica 0 1 26
The Radial SVM is still pretty good by only misclassifying 2 observations. The Radial kernel produced a 98% accuracy.
svm.model.sig <- svm(Species ~ ., data = train, kernel = 'sigmoid')
svm.model.sig
##
## Call:
## svm(formula = Species ~ ., data = train, kernel = "sigmoid")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: sigmoid
## cost: 1
## gamma: 0.25
## coef.0: 0
##
## Number of Support Vectors: 47
table(Prediction = predict(svm.model.sig, train),Truth = train$Species)
## Truth
## Prediction setosa versicolor virginica
## setosa 37 0 0
## versicolor 0 27 2
## virginica 0 9 25
The Sigmoid model does not quite seem to be up to snuff. The sigmoid model produced a 89% accuracy rate.
Now that we have decided on a kernel, we can tune the cost and gamma parameters. To do this we will use the tune.svm() function passing sequences of each parameter.
tuned.svm <- tune.svm(Species~., data = train, kernel = 'linear',
gamma = seq(1/2^nrow(iris),1, .01), cost = 2^seq(-6, 4, 2))
tuned.svm
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## gamma cost
## 7.006492e-46 0.25
##
## - best performance: 0.01
tuned.svm <- svm(Species ~ . , data = train, kernel = 'linear', gamma = 7.006492e-46, cost = 0.25)
table(Prediction = predict(tuned.svm, train),Truth = train$Species)
## Truth
## Prediction setosa versicolor virginica
## setosa 37 0 0
## versicolor 0 36 1
## virginica 0 0 26
Our tuned svm returned that - out of the sequence of gammas and costs provided - the best configuration was with a gamma of 7.006492e-46 and a cost of .25. Using a 10-fold cross validataion we returned a model that achieves 99% accuracy. So given the options that we allowed the tuning parameter to run through, we might still be able to do better.
99% accuracy matches the best model that we have been able to come up with so far but is it truly the best model possible? Luckily for us, the e1017 package also comes with a function that will simply output the best possible SVM. So let’s give that a whirl!
best.svm <- best.svm(Species~. , data = train, kernel = 'linear')
best.svm
##
## Call:
## best.svm(x = Species ~ ., data = train, kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
## gamma: 0.25
##
## Number of Support Vectors: 24
table(Prediction = predict(best.svm, train), Truth = train$Species)
## Truth
## Prediction setosa versicolor virginica
## setosa 37 0 0
## versicolor 0 35 0
## virginica 0 1 27
Using best.svm() we find that the best possible set of parameters is a linear kernel, a cost value of 1 and a gamma of .25. Coincidentally this is exactly the same as our previous linear svm model. This model is good for a 99% accuracy rating.
Now that we have found the best possible SVM model, lets see how it performs on some out of sample data.
best.svm.pred <- predict(best.svm, test)
table(Prediction = best.svm.pred, Truth = test$Species)
## Truth
## Prediction setosa versicolor virginica
## setosa 13 0 0
## versicolor 0 14 1
## virginica 0 0 22
sum(test$Species == best.svm.pred)/50
## [1] 0.98
Our best svm model is able to accurately predict 49/50 observations in the testing dataset which is good for 98% accuracy. This is pretty good! But is there a possibility that one of our other models worked better for the testing set than they did on the training set? Let’s quickly find out.
linear.svm.pred <- predict(svm.model.linear, test)
table(Prediction = linear.svm.pred, Truth = test$Species)
## Truth
## Prediction setosa versicolor virginica
## setosa 13 0 0
## versicolor 0 14 1
## virginica 0 0 22
sum(test$Species == linear.svm.pred)/50
## [1] 0.98
poly.svm.pred <- predict(svm.model.poly, test)
table(Prediction = poly.svm.pred, Truth = test$Species)
## Truth
## Prediction setosa versicolor virginica
## setosa 13 0 0
## versicolor 0 14 7
## virginica 0 0 16
sum(test$Species == poly.svm.pred)/50
## [1] 0.86
radial.svm.pred <- predict(svm.model.radial, test)
table(Prediction = radial.svm.pred, Truth = test$Species)
## Truth
## Prediction setosa versicolor virginica
## setosa 13 0 0
## versicolor 0 14 3
## virginica 0 0 20
sum(test$Species == radial.svm.pred)/50
## [1] 0.94
sig.svm.pred <- predict(svm.model.sig, test)
table(Prediction = sig.svm.pred, Truth = test$Species)
## Truth
## Prediction setosa versicolor virginica
## setosa 13 0 0
## versicolor 0 13 4
## virginica 0 1 19
sum(test$Species == sig.svm.pred)/50
## [1] 0.9
tuned.svm.pred <- predict(tuned.svm, test)
table(Prediction = tuned.svm.pred, Truth = test$Species)
## Truth
## Prediction setosa versicolor virginica
## setosa 13 0 0
## versicolor 0 14 5
## virginica 0 0 18
sum(test$Species == tuned.svm.pred)/50
## [1] 0.9
So out of all the models we have created in this document. The original linear SVM which was repeated by the best.svm() model is the best classifier in the test set as well as the training set. This is excellent as it means that this model does not appear to be overtrained on the training set which is a problem that SVM’s often find themselves in.