Classifying Iris Species Using Support Vector Machines

In this document we will be using a Support Vector Machine to classify flowers within the classic Iris dataset. The Iris dataset is contained within the ggplot2 package and is a common starting point for dealing with classification problems.

Necessary Objects

For this document we will use the tidyverse package which contains ggplot2 as well as the e1071 package which allows us to build SVMs.

library(tidyverse)
library(e1071)
set.seed(42) # To make our document recreatable

Next we need to load our data.

data(iris)
head(iris, 20)

Data Overview

The Iris dataset contains information about the Sepal length/width, as well as the Petal length/width and indicates which species of Iris flower that observation corresponds to. Before creating our model I first want to quickly visualize this data.

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) +
  geom_point() +
  labs(title = 'Sepal Length vs Sepal Width')

We can see here that there is some definition between each species of Iris. Setosa flowers tend to have shorter and wider sepals than their counterparts while the virginica species tend to have a little longer sepals than the versicolor. Though there is still a non-trivial amount of overlap between all three species.

ggplot(iris, aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
  geom_point() +
  labs(title = 'Petal Length vs Petal Width')

As with the sepal measurements, the Setosa species has both shorter and narrower petals than the other two species. Virginica petals tend to be overall larger than the versicolor petals though there is still some amount of overlap.

SVM Overview

This general separation with some amount of overlap is the perfect indication that we should use a SVM in order to classify a test set of these flowers. The general idea of SVMs is to find a linear separator which can be drawn in-betwen two classes in order to separate all datapoints into one class or another. For complex data, SVMs attempt to accomplish this by dealing with the following three problems:

Class Separation: We attempt to find an optimal separating hyperplane between two or more classes by maximizing the margin between the classes’ closes points. The Points lying on the boundaries are called support vectors, and the middle of the margin is the optimal separating hyperplane.
Overlapping Classes: data points on the ‘wrong’ side of the discriminant margin are weighted down to reduce their influence.
Nonlinearity: when we cannot find a linear separator, data oints are projected into a higher-dimentional space where the data points effectively become linearly separable.

SVMs then find a problem solution by formulating the whole task as a quadratic optimaization problem which can be then solved by known techniques.

Model Creation

Now we will begin fitting our SVM to the iris dataset. First we randomly split into a training and testing set.

index <- c(1:nrow(iris))
test.index <- sample(index, size = (length(index)/3))
train <- iris[-test.index ,]
test <- iris[test.index ,]

Next we can use this training set to create our model

svm.model.linear <- svm(Species ~ ., data = train, kernel = 'linear')
svm.model.linear

## 
## Call:
## svm(formula = Species ~ ., data = train, kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  0.25 
## 
## Number of Support Vectors:  24

table(Prediction = predict(svm.model.linear, train),Truth = train$Species)

##             Truth
## Prediction   setosa versicolor virginica
##   setosa         37          0         0
##   versicolor      0         35         0
##   virginica       0          1        27

We see here that on the training dataset, our SVM was able to correctly identify 99 of the 100 observations. This is excellent, but maybe we can be perfect. With SVM models, there are generally three parameters which are typically tuned to optimize the model. These three parameters are Kernel, gamma, and cost. In our original model we used a linear kernel and the defaults for both cost and gamma.

Kernel represents they style of SVM that is used to classify data. So we start by adjusting the kernel to determine a more accurate SVM model.

Kernel tuning

We want to test the linear SVM created previously against a few other popular kernels. We will limit this to the polynomial, radial, and sigmoid kernels.

Polynomial

svm.model.poly <- svm(Species ~ ., data = train, kernel = 'polynomial')
svm.model.poly

## 
## Call:
## svm(formula = Species ~ ., data = train, kernel = "polynomial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  3 
##       gamma:  0.25 
##      coef.0:  0 
## 
## Number of Support Vectors:  35

table(Prediction = predict(svm.model.poly, train),Truth = train$Species)

##             Truth
## Prediction   setosa versicolor virginica
##   setosa         37          0         0
##   versicolor      0         36         3
##   virginica       0          0        24

The polynomial kernel does not seem to have performed very well. Accuracy dropped to 97%.

Radial

svm.model.radial <- svm(Species ~ ., data = train, kernel = 'radial')
svm.model.radial

## 
## Call:
## svm(formula = Species ~ ., data = train, kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.25 
## 
## Number of Support Vectors:  43

table(Prediction = predict(svm.model.radial, train),Truth = train$Species)

##             Truth
## Prediction   setosa versicolor virginica
##   setosa         37          0         0
##   versicolor      0         35         1
##   virginica       0          1        26

The Radial SVM is still pretty good by only misclassifying 2 observations. The Radial kernel produced a 98% accuracy.

Sigmoid

svm.model.sig <- svm(Species ~ ., data = train, kernel = 'sigmoid')
svm.model.sig

## 
## Call:
## svm(formula = Species ~ ., data = train, kernel = "sigmoid")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  sigmoid 
##        cost:  1 
##       gamma:  0.25 
##      coef.0:  0 
## 
## Number of Support Vectors:  47

table(Prediction = predict(svm.model.sig, train),Truth = train$Species)

##             Truth
## Prediction   setosa versicolor virginica
##   setosa         37          0         0
##   versicolor      0         27         2
##   virginica       0          9        25

The Sigmoid model does not quite seem to be up to snuff. The sigmoid model produced a 89% accuracy rate.

Gamma and Cost Optimization

Now that we have decided on a kernel, we can tune the cost and gamma parameters. To do this we will use the tune.svm() function passing sequences of each parameter.

tune.svm()

tuned.svm <- tune.svm(Species~., data = train, kernel = 'linear',
        gamma = seq(1/2^nrow(iris),1, .01), cost = 2^seq(-6, 4, 2))
tuned.svm

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##         gamma cost
##  7.006492e-46 0.25
## 
## - best performance: 0.01

tuned.svm <- svm(Species ~ . , data = train, kernel = 'linear', gamma = 7.006492e-46, cost = 0.25)
table(Prediction = predict(tuned.svm, train),Truth = train$Species)

##             Truth
## Prediction   setosa versicolor virginica
##   setosa         37          0         0
##   versicolor      0         36         1
##   virginica       0          0        26

Our tuned svm returned that - out of the sequence of gammas and costs provided - the best configuration was with a gamma of 7.006492e-46 and a cost of .25. Using a 10-fold cross validataion we returned a model that achieves 99% accuracy. So given the options that we allowed the tuning parameter to run through, we might still be able to do better.

99% accuracy matches the best model that we have been able to come up with so far but is it truly the best model possible? Luckily for us, the e1017 package also comes with a function that will simply output the best possible SVM. So let’s give that a whirl!

best.svm()

best.svm <- best.svm(Species~. , data = train, kernel = 'linear')
best.svm

## 
## Call:
## best.svm(x = Species ~ ., data = train, kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  0.25 
## 
## Number of Support Vectors:  24

table(Prediction = predict(best.svm, train), Truth = train$Species)

##             Truth
## Prediction   setosa versicolor virginica
##   setosa         37          0         0
##   versicolor      0         35         0
##   virginica       0          1        27

Using best.svm() we find that the best possible set of parameters is a linear kernel, a cost value of 1 and a gamma of .25. Coincidentally this is exactly the same as our previous linear svm model. This model is good for a 99% accuracy rating.

The Final Test

Now that we have found the best possible SVM model, lets see how it performs on some out of sample data.

best.svm.pred <- predict(best.svm, test)
table(Prediction = best.svm.pred, Truth = test$Species)

##             Truth
## Prediction   setosa versicolor virginica
##   setosa         13          0         0
##   versicolor      0         14         1
##   virginica       0          0        22

sum(test$Species == best.svm.pred)/50

## [1] 0.98

Our best svm model is able to accurately predict 49/50 observations in the testing dataset which is good for 98% accuracy. This is pretty good! But is there a possibility that one of our other models worked better for the testing set than they did on the training set? Let’s quickly find out.

Linear Model

linear.svm.pred <- predict(svm.model.linear, test)
table(Prediction = linear.svm.pred, Truth = test$Species)

##             Truth
## Prediction   setosa versicolor virginica
##   setosa         13          0         0
##   versicolor      0         14         1
##   virginica       0          0        22

sum(test$Species == linear.svm.pred)/50

## [1] 0.98

Polynomial Model

poly.svm.pred <- predict(svm.model.poly, test)
table(Prediction = poly.svm.pred, Truth = test$Species)

##             Truth
## Prediction   setosa versicolor virginica
##   setosa         13          0         0
##   versicolor      0         14         7
##   virginica       0          0        16

sum(test$Species == poly.svm.pred)/50

## [1] 0.86

Radial Model

radial.svm.pred <- predict(svm.model.radial, test)
table(Prediction = radial.svm.pred, Truth = test$Species)

##             Truth
## Prediction   setosa versicolor virginica
##   setosa         13          0         0
##   versicolor      0         14         3
##   virginica       0          0        20

sum(test$Species == radial.svm.pred)/50

## [1] 0.94

Sigmoid Model

sig.svm.pred <- predict(svm.model.sig, test)
table(Prediction = sig.svm.pred, Truth = test$Species)

##             Truth
## Prediction   setosa versicolor virginica
##   setosa         13          0         0
##   versicolor      0         13         4
##   virginica       0          1        19

sum(test$Species == sig.svm.pred)/50

## [1] 0.9

Tuned Model

tuned.svm.pred <- predict(tuned.svm, test)
table(Prediction = tuned.svm.pred, Truth = test$Species)

##             Truth
## Prediction   setosa versicolor virginica
##   setosa         13          0         0
##   versicolor      0         14         5
##   virginica       0          0        18

sum(test$Species == tuned.svm.pred)/50

## [1] 0.9

Conclusion

So out of all the models we have created in this document. The original linear SVM which was repeated by the best.svm() model is the best classifier in the test set as well as the training set. This is excellent as it means that this model does not appear to be overtrained on the training set which is a problem that SVM’s often find themselves in.