A Very Short Guide to Support Vector Machines in R

Introduction

Suppose you have data on petals and sepals length and width of flowers of gender Iris. Would you be able to tell what is the species? Perhaps a machine learning model can help you.

The following data set contains 50 observations on those morphometric variables for three species of Iris. Check the means:

data(iris)
aggregate( .~ Species, data = iris, FUN = mean)

     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa          5.0         3.4          1.5        0.25
2 versicolor          5.9         2.8          4.3        1.33
3  virginica          6.6         3.0          5.6        2.03

There are several types of machine learning models, for regression and classification. We shall focus here on a just a couple of models suitable for multi-classification problems based on supervised learning, in which the labels of classes (Species) are already given.

Preparing the data

To evaluate the accuracy of models, we should fit them using part of the data, say 70% or 80%, and use the rest for testing, so that we can compute a confusion matrix containing the count (or proportions) of correct (diagonal) and misclassifications.

id <- sample(nrow(iris), size = nrow(iris) * 0.7) 
train_data <- iris[id, ]
nrow(train_data)

[1] 105

test_data <- iris[-id, ]
nrow(test_data)

[1] 45

Linear Discriminant Analysis

Just like principal components, LDA consists of finding linear combinations of the original variables X to obtain latent variables, Z, the linear discriminant variables.

\[{\bf Z}_j = {\bf Xa}_j\] But, differently from PCA, whose objective is to explain variance, with LDA the objective is to maximize the ratio of the separation of the class means to the within-class variance. This is also done though singular value decomposition.

To fit a LDA model,

library(MASS)
model_lda <- lda(Species ~., data = train_data)
model_lda

Call:
lda(Species ~ ., data = train_data)

Prior probabilities of groups:
    setosa versicolor  virginica 
      0.35       0.30       0.34 

Group means:
           Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa              5.0         3.4          1.5        0.26
versicolor          5.9         2.7          4.3        1.33
virginica           6.5         2.9          5.5        1.98

Coefficients of linear discriminants:
              LD1   LD2
Sepal.Length  1.1  0.19
Sepal.Width   1.2  2.02
Petal.Length -2.8 -0.97
Petal.Width  -2.0  2.83

Proportion of trace:
   LD1    LD2 
0.9934 0.0066

Most of the separation (98.9%) is explaind by LD1. The number of discriminant variables is given by \(\min(p, k-1)\), where \(p\) is the number of variables in \({\bf X}\) and \(k\) is the number of classes.

The classification rule is to determine decision boundaries that maximizes distances between the class means on LD1 and LD2, as shown in the plot below. Because of the underlying linearity, the boundaries consist of straight lines. In higher dimensions, it is a hyperplane.

To get predictions for the testing data,

pred_lda <- predict(model_lda, newdata = test_data)$class
pred_lda[1:5]  # check first 5 predictions

[1] setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica

Now we can compute the confusion matrix

table(obs = test_data$Species, pred = pred_lda)

            pred
obs          setosa versicolor virginica
  setosa         13          0         0
  versicolor      0         17         1
  virginica       0          0        14

SVM

The classification rule adopted by SVMs are similar to that used in LDA, but more flexible, using kernel functions for that, such as polynomial, radial, sigmoid or simply linear. The objective is to fit a ‘hyperplane’ (which can be nonlinear, depending on the kernel, and that is why SVMs are powerful) in the ‘gap’ between the classes, with maximal margins (distance from hyperplane to the nearest point). The points exactly in the margins are called support vectors. A program that finds such vectors are called support vector machines.

Source: Hastie et al. (2009) The elements of statistical learning: data mining, inference, and prediction (Vol. 2). New York: Springer.

However, there may be no such hyperplane well fitted enough to prevent misclassifications (left side ofthe figure above). The strategy then is to allow points to be on the wrong size of their margin (right side of the figure). This is determined by a hyperparameter cost. Another hyperparameter, gamma, might be needed if the kernel is not the linear one.

The choice of the hyperparameters impacts the accuracy of the fitted model. A cross-validation technique can be employed to find estimates that produce higher accuracy - this is done by fitting several models.

library(e1071)
model_svm <- svm(Species ~ Sepal.Length + Petal.Length, 
                 data = train_data, scale = FALSE,
                 kernel = 'linear', gamma = 1/2, cost = 50,
                 cross = 10)
summary(model_svm)


Call:
svm(formula = Species ~ Sepal.Length + Petal.Length, data = train_data, 
    kernel = "linear", gamma = 1/2, cost = 50, cross = 10, scale = FALSE)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  50 

Number of Support Vectors:  15

 ( 2 6 7 )


Number of Classes:  3 

Levels: 
 setosa versicolor virginica

10-fold cross-validation on training data:

Total Accuracy: 97 
Single Accuracies:
 100 100 100 91 100 100 90 91 100 100

The support vectors found can be seen (+) in the following plot:

plot(Petal.Length ~ Sepal.Length, data = train_data,
      col = train_data$Species)
points(model_svm$SV, pch = 3)

Now we can compute the confusion matrix

pred_svm <- predict(model_svm, newdata = test_data)
table(obs = test_data$Species, pred = pred_svm)

            pred
obs          setosa versicolor virginica
  setosa         13          0         0
  versicolor      0         17         1
  virginica       0          1        13

Miscellanea

The package caret offers a broad list of machine learning models that can be fitted using a common framework. There are functions to split and pre-process the data, to evaluate the accuracy of the models and the importance of variables. Also, the process of tuning the hyperparameters is easier.

library(caret)
model <- train(Species ~., data = train_data,
             method = "svmLinear2",
             preProcess = c("center", "scale"),
             tuneLength = 12,
             trControl = trainControl(method = "cv", number = 7, p = 0.7),
             metric = "Accuracy")
model

Support Vector Machines with Linear Kernel 

105 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

Pre-processing: centered (4), scaled (4) 
Resampling: Cross-Validated (7 fold) 
Summary of sample sizes: 90, 89, 89, 90, 91, 90, ... 
Resampling results across tuning parameters:

  cost    Accuracy  Kappa
    0.25  0.94      0.91 
    0.50  0.96      0.94 
    1.00  0.96      0.94 
    2.00  0.96      0.94 
    4.00  0.95      0.93 
    8.00  0.95      0.93 
   16.00  0.96      0.94 
   32.00  0.95      0.93 
   64.00  0.95      0.93 
  128.00  0.96      0.94 
  256.00  0.96      0.94 
  512.00  0.97      0.96 

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cost = 512.

Run also confusionMatrix(model) and varImp(model)

A Very Short Guide to Support Vector Machines in `R`

Anderson Rodrigo da Silva

17/03/2022

Introduction

Preparing the data

Linear Discriminant Analysis

SVM

Miscellanea