Support Vector Machines

Data Sources:

Heavily borrowed from:

Textbook: Introduction to statistical learning
Textbook: Elements of statistical learning
Wikipedia

require(knitr)

## Loading required package: knitr

Overview

SVM’s are a group of classifiers considered to be one of the best “out of the box” methods. There are several types of SVM’s. The simplest is the_maximal margin classifier_ (MMC). Though the MMC is elegant and simple, it cannot be applied to most data sets, since it requires the classes be separable by a linear boundary. The support vector classifier is an extension of the maximal margin classifier, and can be applied in a broader range of cases. However, better yet, the support vector machine is a further extension to accommodate non-linear class boundaries.

Maximum Margin Classifier

Here we define a hyperplane and introduce optimal separation of a hyperplane.

What is a Hyperplane?

In a p dimensional space, a hyperplane is a flat affine subspace of dimension p - 1. For example, in two dimensions, a hyperplane is a flat one-dimensional subspace (a line). In three dimensions, a hyperplane is a flat two-dimensional subspace – a plane. p > 3 dimensions are hard to visualize, but are still of p -1 dimensional flat subspaces.

A hyperplane essentially divides p dimensional space into two halves. Using the equation of a p-dimensional hyperplane, we can then determine which side a given point lies on by calculating the resulting sign of the equation.

Classification Using a Separating Hyperplane

Suppose we have a n x p data matrix X that consists of n training observations in p-dimensional space, which fall into two classes (-1, 1). The goal is to develop a classifier based on the training data that will correctly classify the test observations using the available features. This situation is present in any type of classifier.

Suppose that it is possible to construct a hyperplane that separates the training observations perfectly according to their class labels. If such a hyperplane exists, we can use it to construct a very natural classifier, as test observation is assigned a class depending on which side of the hyperplane it is located. We can also make use of the magnitude, that is, how far an observation is from the hyperplane, and so we can be confident (or less confident) about our class assignment. This type of separating hyperplane will lead to a linear decision boundary.

The Maximal Margin Classifier

If our data can be perfectly separated using a hyperplane, then there will be an infinite number of such hyperplanes. You can imagine that the hyperplane can usually be shifted up or down a little bit without coming in contact with the observations. So in order to construct the classifier we must decide on a reasonable way to decide which of the infinite possible separating hyperplanes to use.

The simplest choice is the maximal margin hyperplane (aka the optimal separating hyperplane), which is the separating hyperplane that is the farthest from the training observations. The margin is the distance from the hyperplane to the nearest points on either side. We hope that the maximal margin will have a large margin on the training data and the test data, however it can lead to overfitting when p is large.

The points that are used to decide this boundary are the set of closest equidistant perpendicular points, and are called the support vectors. The maximum margin hyperplane depends directly on these support vectors, and not on the other observations. The fact that the MMA boundary is set on only a small number of observations is an important property that will come up shortly.

Non-separable Case

The maximal margin classifier is a very natural way to perform classification, only if a separating hyperplane exists. However, in many cases we cannot do this, and the optimization of the hyperplane has no solution to exactly separate the two classes.

The next section shows how to extend this model in order to develop a hyperplane that almost separates the classes, a so-called soft margin. This is called the support vector classifier.

Support Vector Classifiers

Even when a hyperplane does exist between two classes, there are instances where fitting a hyperplane may not be desirable. This is because the hyperplane from maximal margin classifiers will perfectly fit the training data, and is highly sensitive to only a few of the training observations. Basically, we have a high potential to over fit the data.

In this case we might be willing to consider a classifier based on a hyperplane that does not perfectly separate the two classes. By doing this we can attain:

Greater robustness to individual observations, and
Better classification of most of the training observations.

Basically, it could be worthwhile to misclassify a few training observations in order to do a better job in classifying the remaining observations.

The support vector classifier, sometimes called a soft margin classifier, does exactly this. An observation can not only be on the wrong side of the margin, but also on the wrong side of the hyperplane. This situation is inevitable when there is no perfect separation in the data.

Details

The solution to the optimization problem for the support vector classifier has several tuning parameters. C a non negative parameter, M the margin width, which we seek to make as large as possible, and slack variables e1…,en that allow individual observations to be on the wrong side of the hyperplane.

First we classify the any given test observation based on the sign relative to a given hyperplane. ei tells us where the ith observation is located relative to the hyperplane and relative to the margin. if ei = 0 then the ith observation is on the correct side of the margin. If ei > 0 then the ith observation is on the wrong side of the margin, and has violated the margin. If ei > 1 then it is on the wrong side of the hyperplane.

C bounds the sum of the ei’s, and so it determines the number and severity of the violations to the margin and hyperplane that we will tolerate. If C = 0 then we have no limits for margin violations, in which case the equation is simply the maximal margin optimization problem. For C > 0, no more than C observations can be on the wrong size of the hyperplane. As C increases, we become more tolerant of violations to the margin, and so the margin will widen. C is generally chosen with cross-validation and controls the bias-variance trade-off. When C is small, we seek narrow margins that are rarely violated (low bias, high variance). The image below from the ISLR textbook is shown below.

It turns out that only observations that either lie on the margin or that violate the margin will affect the hyperplane. Thus, an observation that lies clearly on the correct side of the margin does not affect the support classifier. The observations that do lie on the margins and the wrong side of the margin are called the support vectors, which are the only points that drive the classifier.

This is distinctly different than say an LDA classifier, which use the mean of all the observations in each class, as well as in-class covariance. Logistic regression also has very low sensitivity to observations far from the decision boundary.

Thus when C is large, the margin is wide, and there are many support vectors. The figure below illustrates these trade-offs.

Support Vector Machines

First we will go through converting a linear classifier into a one that can handle non-linear boundaries. Then the support vector machine, which does this automatically, is introduced.

Classification with non-linear boundaries

The support vector classifier (above) was a natural approach for classification in the two-class setting if the boundary between them is linear. Similar to when we extend the linear model to account for non-linear relationships, we can transform the predictors using quadratic, cubic, and even higher order polynomial functions. Unfortunately, the problem with this is that we start accumulate $2p$ features and computations can begin to become unmanageable quickly.

The SVM

This method is an extension of the support vector classifier that results from enlarging the feature space using kernels. Where K is some non-linear kernel function that quantifies the similarity of two observations. Essentially, we fit a support vector classifier in a higher-dimensional space using a polynomial kernel of d-dimensions to create a support vector machine.

The polynomial kernel is one example of a possible non-linear kernel. Another popular choice is the radial kernel. The difference between the two kernels is primarily that the radial kernel has a very local behaviour compared to the polynomial kernel. The radial kernel also has a tuning parameter y. As y increases the fit becomes more and more non-linear. A comparison of the two kernels is presented below. In this case, both classifiers are able to fit the data sufficiently. Kernels allow us to shrink the size of the optimization problem of 2_p_ predictors to a more manageable size.

SVMs with More than Two Classes

The concept of separating hyperplanes only really lends itself well to the binary classification setting. However, we now discuss the two most popular techniques for extended SVMs to K-classes:

one-versus-one
one-versus-all

One-Vs-One Classification

This approach constructs all pairs of K to compare all the classes in a two-class setting, and build a classifier for each one. For each of these classifiers we classify a test observation and tally the number of times that the test observation is assigned to each of the K classes. The final classification is performed by assigning the test observation to the class which it was most frequently assigned in these pairwise classifications.

One-Vs-All Classification

Here we fit K SMV’s, each time comparing one of the K classes to the remaining K - 1 classes. We then assign the observation to the class for which has the highest amount of confidence that it belongs to the kth class rather than any other classes.

Though not covered here, there is also an extension of the SVM for regression.

Code Examples

Support Vector Classifier

The e1071 has implemented several statistical learning methods. The svm() method can fit a support vector classifier when the argument kernel='linear' is used. The cost argument allows us to specify the cost of a violation to the margin. Thus when the cost is small, the margins will be wide and there will be many support vectors.

set.seed(1)
#Create our own test data
x <- matrix(rnorm(20*2), ncol=2)
y <- c(rep(-1,10), rep(1,10))
x[y==1,]=x[y==1,] + 1

Let’s take a look at these data

plot(x, col=(3-y))

plot of chunk unnamed-chunk-2

They are not, Now we can fit the support vector classifier. To perform classification we have to specify the response as a factor. The argument scale = FALSE tells the function not to scale each feature. Sometimes we may want to do this.

library(e1071)

dat <- data.frame(x=x, y=as.factor(y))
svm.fit <- svm(y ~., data=dat, kernel='linear', cost=10, scale=FALSE)
# Plot the SVC obtained
plot(svm.fit, dat)

plot of chunk unnamed-chunk-3

summary(svm.fit)

## 
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 10, 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
##       gamma:  0.5 
## 
## Number of Support Vectors:  7
## 
##  ( 4 3 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1

The summary lets us know there were 7 support vectors, four in the first class and three in the second. What if we used a smaller cost parameter instead?

svm.fit2 <- svm(y ~., data=dat, kernel = 'linear', cost=0.1, scale=FALSE)

plot(svm.fit2, dat)

plot of chunk unnamed-chunk-4

summary(svm.fit2)

## 
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 0.1, 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.1 
##       gamma:  0.5 
## 
## Number of Support Vectors:  16
## 
##  ( 8 8 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1

With a smaller value of cost we obtain a margin number of support vectors via the larger margin. The e1071 library has a built-in tune() function that performs 10-fold cross-validation on the models.

set.seed(1)
tune.out <- tune(svm, y ~., data=dat, kernel='linear',
                 ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100)))
summary(tune.out)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##   0.1
## 
## - best performance: 0.1 
## 
## - Detailed performance results:
##    cost error dispersion
## 1 1e-03  0.70     0.4216
## 2 1e-02  0.70     0.4216
## 3 1e-01  0.10     0.2108
## 4 1e+00  0.15     0.2415
## 5 5e+00  0.15     0.2415
## 6 1e+01  0.15     0.2415
## 7 1e+02  0.15     0.2415

Here we see that cost = 0.1 results in the lowest cross-validation error rate. tune() also stores the best model obtained accessed through $best.model, thus we can predict using test data. Here we create a simulated test set.

xtest=matrix(rnorm(20*2), ncol=2)
ytest=sample(c(-1,1), 20, rep=TRUE)
xtest [ ytest ==1 ,]= xtest [ ytest ==1 ,] + 1
testdat=data.frame(x=xtest, y=as.factor(ytest))

Then predict the class labels of the test observations from the cross validated results.

yhat <- predict(tune.out$best.model, testdat)
library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

confusionMatrix(yhat, testdat$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction -1  1
##         -1 11  1
##         1   0  8
##                                         
##                Accuracy : 0.95          
##                  95% CI : (0.751, 0.999)
##     No Information Rate : 0.55          
##     P-Value [Acc > NIR] : 0.000111      
##                                         
##                   Kappa : 0.898         
##  Mcnemar's Test P-Value : 1.000000      
##                                         
##             Sensitivity : 1.000         
##             Specificity : 0.889         
##          Pos Pred Value : 0.917         
##          Neg Pred Value : 1.000         
##              Prevalence : 0.550         
##          Detection Rate : 0.550         
##    Detection Prevalence : 0.600         
##       Balanced Accuracy : 0.944         
##                                         
##        'Positive' Class : -1            
##

Support Vector Machine

Again we use the svm() function, however now we can experiment with non-linear kernels. For polynomial kernels we use the parameter degree to adjust the polynomial order. For radial kernels we use the gamma parameter to adjust the y value.

# Generate some test data
set.seed (1)
x <- matrix(rnorm(200*2), ncol=2)
x[1:100,]=x[1:100,]+2
x[101:150,]=x[101:150,]-2

y <- c(rep(1,150),rep(2,50))
dat <- data.frame(x=x,y=as.factor(y))

plot(x, col=y)

plot of chunk unnamed-chunk-8

Randomly split the data into training and testing groups and fit a radial kernel.

train <- sample(200, 100)

svm.fit <- svm(y ~., data=dat[train,], kernel='radial', gamma=1, cost=1)

plot(svm.fit, dat[train,])

plot of chunk unnamed-chunk-9

summary(svm.fit)

## 
## Call:
## svm(formula = y ~ ., data = dat[train, ], kernel = "radial", 
##     gamma = 1, cost = 1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  1 
## 
## Number of Support Vectors:  37
## 
##  ( 17 20 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  1 2

yhat <- predict(svm.fit, dat[-train,])
confusionMatrix(yhat, dat[-train,'y'])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 72  7
##          2  5 16
##                                       
##                Accuracy : 0.88        
##                  95% CI : (0.8, 0.936)
##     No Information Rate : 0.77        
##     P-Value [Acc > NIR] : 0.0041      
##                                       
##                   Kappa : 0.651       
##  Mcnemar's Test P-Value : 0.7728      
##                                       
##             Sensitivity : 0.935       
##             Specificity : 0.696       
##          Pos Pred Value : 0.911       
##          Neg Pred Value : 0.762       
##              Prevalence : 0.770       
##          Detection Rate : 0.720       
##    Detection Prevalence : 0.790       
##       Balanced Accuracy : 0.815       
##                                       
##        'Positive' Class : 1           
##

We can see from the figure that there are a fair number of training errors for this fit. If we increase the value of cost, we can reduce the training errors, but we risk overfitting the data.

svm.fit <- svm(y ~., dat[train,], kernel='radial', gamma=1, cost=1e5)

plot(svm.fit, dat[train,])

plot of chunk unnamed-chunk-10

This certainly is very irregular, and probably would over fit the test data. Let’s try cross validating these parameters instead.

set.seed(1)
tune.out <- tune(svm, y ~., data=dat[train,], 
                 kernel='radial', 
                 ranges = list(cost=c(0.1,1,10,100,1000),
                 gamma=c(0.5, 1,2,3,4)))

summary(tune.out)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##     1     2
## 
## - best performance: 0.12 
## 
## - Detailed performance results:
##     cost gamma error dispersion
## 1  1e-01   0.5  0.27    0.11595
## 2  1e+00   0.5  0.13    0.08233
## 3  1e+01   0.5  0.15    0.07071
## 4  1e+02   0.5  0.17    0.08233
## 5  1e+03   0.5  0.21    0.09944
## 6  1e-01   1.0  0.25    0.13540
## 7  1e+00   1.0  0.13    0.08233
## 8  1e+01   1.0  0.16    0.06992
## 9  1e+02   1.0  0.20    0.09428
## 10 1e+03   1.0  0.20    0.08165
## 11 1e-01   2.0  0.25    0.12693
## 12 1e+00   2.0  0.12    0.09189
## 13 1e+01   2.0  0.17    0.09487
## 14 1e+02   2.0  0.19    0.09944
## 15 1e+03   2.0  0.20    0.09428
## 16 1e-01   3.0  0.27    0.11595
## 17 1e+00   3.0  0.13    0.09487
## 18 1e+01   3.0  0.18    0.10328
## 19 1e+02   3.0  0.21    0.08756
## 20 1e+03   3.0  0.22    0.10328
## 21 1e-01   4.0  0.27    0.11595
## 22 1e+00   4.0  0.15    0.10801
## 23 1e+01   4.0  0.18    0.11353
## 24 1e+02   4.0  0.21    0.08756
## 25 1e+03   4.0  0.24    0.10750

yhat <- predict(tune.out$best.model, dat[-train,])
confusionMatrix(yhat, dat[-train, 'y'])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 74  7
##          2  3 16
##                                         
##                Accuracy : 0.9           
##                  95% CI : (0.824, 0.951)
##     No Information Rate : 0.77          
##     P-Value [Acc > NIR] : 0.000669      
##                                         
##                   Kappa : 0.699         
##  Mcnemar's Test P-Value : 0.342782      
##                                         
##             Sensitivity : 0.961         
##             Specificity : 0.696         
##          Pos Pred Value : 0.914         
##          Neg Pred Value : 0.842         
##              Prevalence : 0.770         
##          Detection Rate : 0.740         
##    Detection Prevalence : 0.810         
##       Balanced Accuracy : 0.828         
##                                         
##        'Positive' Class : 1             
##

Using the chosen cross validated tuning parameters we achieve a stronger fit.

ROC Curves

Another way to choose between models is to use and ROC curve. We can do this using the ROCR package.

library(ROCR)

## Loading required package: gplots

## Warning: package 'gplots' was built under R version 3.1.1

## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
## 
## Attaching package: 'gplots'
## 
## The following object is masked from 'package:stats':
## 
##     lowess

# function to handle the different models
rocplot <- function(pred, truth, ...){
  predob =  prediction(pred, truth)
  perf = performance(predob, 'tpr', 'fpr')
  plot(perf, ...)

}

Now, when we rebuild the SVMs we set decision.values=TRUE to obtain the fitted values.

svm.opt <- svm(y ~., data=dat[train,], kernel='radial',
               gamma=2, cost=1, decision.values=T)

fitted <- attributes(predict(svm.opt, dat[train,], decision.values=T))$decision.values

rocplot(fitted, dat[train,'y'], main='Training Data')

plot of chunk unnamed-chunk-13

Trying with the caret package

We train the model to select the tuning parameters that yield the highest ROC.

ctr <- trainControl(method='cv',
                    number=10, 
                    classProbs=TRUE,
                    summaryFunction=twoClassSummary)

svm.c <- train(y ~., dat[train,],
               method='svmRadial',
               trControl=ctr,
               metric="ROC")

## Loading required package: kernlab

## Warning: At least one of the class levels are not valid R variables names;
## This may cause errors if class probabilities are generated because the
## variables names will be converted to: X1, X2

## Loading required package: pROC
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

svm.c

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 100 samples
##   2 predictors
##   2 classes: '1', '2' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 90, 90, 90, 89, 90, 91, ... 
## 
## Resampling results across tuning parameters:
## 
##   C    ROC  Sens  Spec  ROC SD  Sens SD  Spec SD
##   0.2  0.9  0.9   0.8   0.07    0.1      0.2    
##   0.5  0.9  0.9   0.8   0.08    0.1      0.2    
##   1    0.9  0.9   0.7   0.09    0.1      0.3    
## 
## Tuning parameter 'sigma' was held constant at a value of 1.809
## ROC was used to select the optimal model using  the largest value.
## The final values used for the model were sigma = 2 and C = 0.2.

plot(svm.c$finalModel)

plot of chunk unnamed-chunk-14

yhat.c <- predict(svm.c, dat[-train,])
confusionMatrix(yhat.c, dat[-train,'y'])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 74  8
##          2  3 15
##                                         
##                Accuracy : 0.89          
##                  95% CI : (0.812, 0.944)
##     No Information Rate : 0.77          
##     P-Value [Acc > NIR] : 0.00174       
##                                         
##                   Kappa : 0.664         
##  Mcnemar's Test P-Value : 0.22780       
##                                         
##             Sensitivity : 0.961         
##             Specificity : 0.652         
##          Pos Pred Value : 0.902         
##          Neg Pred Value : 0.833         
##              Prevalence : 0.770         
##          Detection Rate : 0.740         
##    Detection Prevalence : 0.820         
##       Balanced Accuracy : 0.807         
##                                         
##        'Positive' Class : 1             
##

Looks like both packages produce the very similar values! Thats good. However, caret is much easier to work with I find.

SVM with Multiple Classes

The svm() function is able to perform the one-versus-one approach.

Application to Gene Expression Data

#clean memory space
rm(xtest)
rm(ytest)
rm(y)

library(ISLR)
attach(Khan)
names(Khan)

## [1] "xtrain" "xtest"  "ytrain" "ytest"

The dataset consists of expression measurements for 2 308 genes and are already split into testing and training data. We will use a support vector approach to predict the cancer sub type using gene expression measurements. This dataset has a very large number of features compared to observations. This means that likely we could use a linear kernel.

dat = data.frame(xtrain, y=as.factor(ytrain))

svm.fit <- svm(y~., dat, kernel='linear', cost=10)
summary(svm.fit)

## 
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 10)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
##       gamma:  0.0004333 
## 
## Number of Support Vectors:  58
## 
##  ( 20 20 11 7 )
## 
## 
## Number of Classes:  4 
## 
## Levels: 
##  1 2 3 4

table(svm.fit$fitted, dat$y)

##    
##      1  2  3  4
##   1  8  0  0  0
##   2  0 23  0  0
##   3  0  0 12  0
##   4  0  0  0 20

Here we see that there are actually no training errors. This is not surprising due to the large number of predictors and few observations. How does it work on the test data?

testing <- data.frame(xtest, y=as.factor(ytest))

yhat <- predict(svm.fit, newdata=testing)
table(yhat, testing$y)

##     
## yhat 1 2 3 4
##    1 3 0 0 0
##    2 0 6 2 0
##    3 0 0 4 0
##    4 0 0 0 5

Here we only make 2 errors on the test data. We might as well continue and tune the parameters to see if we can make any improvements.

# library(doMC)
# registerDoMC(6)
# Recall our classification training control function
ctr <- trainControl(method='cv',
                    number=10)

grid <- data.frame(C=c(0.0001, 0.001, 0.01, 1))

svm.tune <- train(y ~., dat,
                  method='svmLinear',
                  trControl=ctr,
                  tuneGrid=grid)
svm.tune

## Support Vector Machines with Linear Kernel 
## 
##   63 samples
## 2308 predictors
##    4 classes: '1', '2', '3', '4' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 57, 56, 57, 57, 56, 57, ... 
## 
## Resampling results across tuning parameters:
## 
##   C      Accuracy  Kappa  Accuracy SD  Kappa SD
##   1e-04  0.7       0.5    0.2          0.3     
##   0.001  1         1      0.05         0.06    
##   0.01   1         1      0.05         0.06    
##   1      1         1      0.05         0.06    
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was C = 0.001.

yhat.tune <- predict(svm.tune, testing)
confusionMatrix(yhat.tune, testing$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 1 2 3 4
##          1 3 0 0 0
##          2 0 6 2 0
##          3 0 0 4 0
##          4 0 0 0 5
## 
## Overall Statistics
##                                         
##                Accuracy : 0.9           
##                  95% CI : (0.683, 0.988)
##     No Information Rate : 0.3           
##     P-Value [Acc > NIR] : 3.77e-08      
##                                         
##                   Kappa : 0.864         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity              1.00    1.000    0.667     1.00
## Specificity              1.00    0.857    1.000     1.00
## Pos Pred Value           1.00    0.750    1.000     1.00
## Neg Pred Value           1.00    1.000    0.875     1.00
## Prevalence               0.15    0.300    0.300     0.25
## Detection Rate           0.15    0.300    0.200     0.25
## Detection Prevalence     0.15    0.400    0.200     0.25
## Balanced Accuracy        1.00    0.929    0.833     1.00

Turns out we get the same result. There isn’t much to tune with a linear SVM.

Car Gas Mileage

Here we predict whether a given car gets high or low gas mileage based on the Auto dataset.

attach(Auto)

The following object is masked from package:ggplot2:

mpg

dim(Auto)

[1] 392 9

kable(head(Auto))

mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name
18	8	307	130	3504	12.0	70	1	chevrolet chevelle malibu
15	8	350	165	3693	11.5	70	1	buick skylark 320
18	8	318	150	3436	11.0	70	1	plymouth satellite
16	8	304	150	3433	12.0	70	1	amc rebel sst
17	8	302	140	3449	10.5	70	1	ford torino
15	8	429	198	4341	10.0	70	1	ford galaxie 500

# Create a binary variable that takes on 1 for cars with gas mileage > median
Auto$y <- NA
Auto$y[Auto$mpg > median(Auto$mpg)] <- 1
Auto$y[Auto$mpg <= median(Auto$mpg)] <- 0
Auto$y <- as.factor(Auto$y)
length(Auto[is.na(Auto$y)]) # make sure there are no NA's

## [1] 0

Try a linear SVM

set.seed(123)
split <- createDataPartition(y=Auto$y, p=0.7, list=FALSE)
train <- Auto[split,]
test <- Auto[-split,]
# Remove mpg / name features
train <- train[-c(1,9)]
test <- test[-c(1,9)]

# 10 fold cross validation
ctr <- trainControl(method='repeatedcv',
                    number=10,
                    repeats=3)

# Recall as C increases, the margin tends to get wider
grid <- data.frame(C=seq(0.01,5,0.5))

svm.fit <- train(y ~., train,
                 method='svmLinear',
                 preProc=c('center','scale'),
                 trControl=ctr,
                 tuneGrid=grid)
svm.fit

## Support Vector Machines with Linear Kernel 
## 
## 276 samples
##   7 predictors
##   2 classes: '0', '1' 
## 
## Pre-processing: centered, scaled 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## 
## Summary of sample sizes: 248, 248, 249, 250, 248, 248, ... 
## 
## Resampling results across tuning parameters:
## 
##   C     Accuracy  Kappa  Accuracy SD  Kappa SD
##   0.01  0.9       0.8    0.06         0.1     
##   0.5   0.9       0.8    0.05         0.1     
##   1     0.9       0.8    0.05         0.1     
##   2     0.9       0.8    0.05         0.1     
##   2     0.9       0.8    0.05         0.1     
##   3     0.9       0.8    0.05         0.1     
##   3     0.9       0.8    0.05         0.1     
##   4     0.9       0.8    0.05         0.1     
##   4     0.9       0.8    0.05         0.1     
##   5     0.9       0.8    0.05         0.1     
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was C = 3.

# Training error rate
confusionMatrix(predict(svm.fit, train), train$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 119   8
##          1  19 130
##                                         
##                Accuracy : 0.902         
##                  95% CI : (0.861, 0.935)
##     No Information Rate : 0.5           
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.804         
##  Mcnemar's Test P-Value : 0.0543        
##                                         
##             Sensitivity : 0.862         
##             Specificity : 0.942         
##          Pos Pred Value : 0.937         
##          Neg Pred Value : 0.872         
##              Prevalence : 0.500         
##          Detection Rate : 0.431         
##    Detection Prevalence : 0.460         
##       Balanced Accuracy : 0.902         
##                                         
##        'Positive' Class : 0             
##

# Testing error rate
yhat <- predict(svm.fit, test)
confusionMatrix(yhat, test$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 55  3
##          1  3 55
##                                         
##                Accuracy : 0.948         
##                  95% CI : (0.891, 0.981)
##     No Information Rate : 0.5           
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.897         
##  Mcnemar's Test P-Value : 1             
##                                         
##             Sensitivity : 0.948         
##             Specificity : 0.948         
##          Pos Pred Value : 0.948         
##          Neg Pred Value : 0.948         
##              Prevalence : 0.500         
##          Detection Rate : 0.474         
##    Detection Prevalence : 0.500         
##       Balanced Accuracy : 0.948         
##                                         
##        'Positive' Class : 0             
##

You can’t really ask more than ~95% accuracy, but lets try a few other models to compare. There should be a way to plot two variables svm boundary at a time, but I haven’t figured it out yet.

set.seed(123)
# Try a polynomial function
svm.fit <- train(y ~., train,
                 method='svmPoly',
                 trControl=ctr)
                 
svm.fit

## Support Vector Machines with Polynomial Kernel 
## 
## 276 samples
##   7 predictors
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## 
## Summary of sample sizes: 248, 249, 248, 249, 250, 248, ... 
## 
## Resampling results across tuning parameters:
## 
##   degree  scale  C    Accuracy  Kappa  Accuracy SD  Kappa SD
##   1       0.001  0.2  0.7       0.4    0.1          0.3     
##   1       0.001  0.5  0.8       0.6    0.07         0.1     
##   1       0.001  1    0.8       0.6    0.05         0.1     
##   1       0.01   0.2  0.9       0.7    0.06         0.1     
##   1       0.01   0.5  0.9       0.8    0.06         0.1     
##   1       0.01   1    0.9       0.8    0.05         0.1     
##   1       0.1    0.2  0.9       0.8    0.05         0.1     
##   1       0.1    0.5  0.9       0.8    0.05         0.1     
##   1       0.1    1    0.9       0.8    0.05         0.1     
##   2       0.001  0.2  0.8       0.6    0.07         0.1     
##   2       0.001  0.5  0.8       0.6    0.05         0.1     
##   2       0.001  1    0.9       0.7    0.06         0.1     
##   2       0.01   0.2  0.9       0.8    0.06         0.1     
##   2       0.01   0.5  0.9       0.8    0.05         0.1     
##   2       0.01   1    0.9       0.8    0.05         0.1     
##   2       0.1    0.2  0.9       0.8    0.05         0.1     
##   2       0.1    0.5  0.9       0.8    0.05         0.1     
##   2       0.1    1    0.9       0.8    0.05         0.1     
##   3       0.001  0.2  0.8       0.6    0.05         0.1     
##   3       0.001  0.5  0.9       0.7    0.06         0.1     
##   3       0.001  1    0.9       0.7    0.06         0.1     
##   3       0.01   0.2  0.9       0.8    0.05         0.1     
##   3       0.01   0.5  0.9       0.8    0.05         0.1     
##   3       0.01   1    0.9       0.8    0.05         0.1     
##   3       0.1    0.2  0.9       0.8    0.05         0.1     
##   3       0.1    0.5  0.9       0.8    0.05         0.1     
##   3       0.1    1    0.9       0.8    0.05         0.1     
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were degree = 3, scale = 0.1 and C = 1.

plot(svm.fit)

plot of chunk unnamed-chunk-23

# Training error rate
confusionMatrix(predict(svm.fit, train), train$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 122   3
##          1  16 135
##                                         
##                Accuracy : 0.931         
##                  95% CI : (0.895, 0.958)
##     No Information Rate : 0.5           
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.862         
##  Mcnemar's Test P-Value : 0.00591       
##                                         
##             Sensitivity : 0.884         
##             Specificity : 0.978         
##          Pos Pred Value : 0.976         
##          Neg Pred Value : 0.894         
##              Prevalence : 0.500         
##          Detection Rate : 0.442         
##    Detection Prevalence : 0.453         
##       Balanced Accuracy : 0.931         
##                                         
##        'Positive' Class : 0             
##

# Testing error rate
yhat <- predict(svm.fit, test)
confusionMatrix(yhat, test$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 56  2
##          1  2 56
##                                         
##                Accuracy : 0.966         
##                  95% CI : (0.914, 0.991)
##     No Information Rate : 0.5           
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.931         
##  Mcnemar's Test P-Value : 1             
##                                         
##             Sensitivity : 0.966         
##             Specificity : 0.966         
##          Pos Pred Value : 0.966         
##          Neg Pred Value : 0.966         
##              Prevalence : 0.500         
##          Detection Rate : 0.483         
##    Detection Prevalence : 0.500         
##       Balanced Accuracy : 0.966         
##                                         
##        'Positive' Class : 0             
##

And there you have it, it looks like a third order polynomial can slightly improve the fit.

Support Vector Machines

Ryan Kelly

July 14, 2014

Data Sources:

Overview

Maximum Margin Classifier

What is a Hyperplane?

Classification Using a Separating Hyperplane

The Maximal Margin Classifier

Non-separable Case

Support Vector Classifiers

Details

Support Vector Machines

Classification with non-linear boundaries

The SVM

SVMs with More than Two Classes

One-Vs-One Classification

One-Vs-All Classification

Code Examples

Support Vector Classifier

Support Vector Machine

ROC Curves

Trying with the caret package

SVM with Multiple Classes

Application to Gene Expression Data

Car Gas Mileage