Support Vector Machines

Today we will look at some of the uses of Support Vector Machines. We will use the e1071 library for the necessary functions. The first dataset to be utilized will be the previously used Head Size and Brain Weight data from http://www.stat.ufl.edu/~winner/datasets.html. This dataset compares the measurements found for both males and females in the 1905 study by R. J. Gladstone entitled “A Study of the Relations of the Brain to the Size of the Head” in Biometrika, Volume 4, pages 105 to 123.

To begin, we will load the library, set up the data in a usable fashion, and set the seed.

library(e1071)
brain <- read.table('brainhead.dat',header=F,sep='')
names(brain) <- c('gender','age_range','size','weight')
set.seed(15)

Next, we will divide the dataset into sampled training and test sets, and then observe a plot of the training data to get some idea of how seperable the data are.

train <- sample(237,187)
# The training data will contain 187 records, while the test data will have fifty.
brain.train <- brain[train,]
brain.test <- brain[-train,]
# Here, we will plot the brain size and head size of the training data, using gender to seperate the results.
plot(brain.train$size,brain.train$weight,col=(brain.train$gender +2))

Next, we will build training and test matricies of the data with the attributes we wish to examine. The variable, y, will represent the response (gender) while x will be a matrix of brain weight and head size. Afterward, we will fit a model using the svm() function using a linear kernel and a cost of .1 and plot this model against the training data.

set1 <- data.frame(y=as.factor(brain.train$gender),x=brain.train[,c(3:4)])
set2 <- data.frame(y=as.factor(brain.test$gender),x=brain.test[,c(3:4)])
brain.fit1 <- svm(y~.,data=set1,kernel='linear',cost=.1,scale=FALSE)
plot(brain.fit1,set1)

Using the summary() function, we can get an idea of how many data points are needed to discern a linear seperation between the genders, using a cost of .1. We can also use the tune() function, also from e1071, that will allow us to submit a range of cost values to determine the best linear model with regard to cost.

summary(brain.fit1)

## 
## Call:
## svm(formula = y ~ ., data = set1, kernel = "linear", cost = 0.1, 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.1 
##       gamma:  0.5 
## 
## Number of Support Vectors:  113
## 
##  ( 56 57 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  1 2

# Here, we use the tune() function with a range of costs.
brain.tune <- tune(svm,y~.,data=set1,kernel='linear',ranges=list(cost=c(.0001,.001,.01,.1,1,10,100,1000)))
# Use summary() to help determine the best model.
summary(brain.tune)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     1
## 
## - best performance: 0.274269 
## 
## - Detailed performance results:
##    cost     error dispersion
## 1 1e-04 0.4763158 0.16463630
## 2 1e-03 0.4763158 0.16463630
## 3 1e-02 0.2789474 0.08419609
## 4 1e-01 0.2847953 0.11930079
## 5 1e+00 0.2742690 0.11832214
## 6 1e+01 0.2798246 0.12539470
## 7 1e+02 0.2798246 0.12539470
## 8 1e+03 0.2798246 0.12539470

# Now the best model is identified.
bestmod <- brain.tune$best.model
summary(bestmod)

## 
## Call:
## best.tune(method = svm, train.x = y ~ ., data = set1, ranges = list(cost = c(1e-04, 
##     0.001, 0.01, 0.1, 1, 10, 100, 1000)), kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  0.5 
## 
## Number of Support Vectors:  114
## 
##  ( 57 57 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  1 2

In this example, predictions are made using both the original model and the ‘best’ model on the test data. The accuracy of those predictions are then measured.

pred.orig <- predict(brain.fit1,set2)
pred.tune <- predict(bestmod,set2)
table(predicted=pred.orig,truth=set2$y)

##          truth
## predicted  1  2
##         1 28  9
##         2  8  5

33/50

## [1] 0.66

table(predicted=pred.tune,truth=set2$y)

##          truth
## predicted  1  2
##         1 29  9
##         2  7  5

34/50

## [1] 0.68

From the looks of things in this case, tuning a model did not result in a great deal of improvement in discerning between the two genders.

Would it be possible that a radial kernel would provide better results than a linear one? Below, we will carry out making a model using svm() but include a radial kernel and provide a value of 1 for gamma. We will also attempt to tune a model using ranges for both cost and gamma. Predicted results will then be calculated as before.

# Using a radial kernel and an initial gamma of 1.
brain.fit2 <- svm(y~.,data=set1,kernel='radial',cost=1,gamma=1,scale=FALSE)
# Plotting of the model on the training set.
plot(brain.fit2,set1)

# Tuning of a model with ranges for cost and gamma.
brain.tune2 <- tune(svm,y~.,data=set1,kernel='radial',ranges=list(cost=c(.0001,.001,.01,.1,10,100,1000),gamma=c(.01,.1,.5,1,2,3,4)))
summary(brain.tune2)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##   100   0.5
## 
## - best performance: 0.2517544 
## 
## - Detailed performance results:
##     cost gamma     error dispersion
## 1  1e-04  0.01 0.4809942 0.12268512
## 2  1e-03  0.01 0.4809942 0.12268512
## 3  1e-02  0.01 0.4809942 0.12268512
## 4  1e-01  0.01 0.4757310 0.13138062
## 5  1e+01  0.01 0.2944444 0.09463782
## 6  1e+02  0.01 0.2780702 0.09554093
## 7  1e+03  0.01 0.2839181 0.10457913
## 8  1e-04  0.10 0.4809942 0.12268512
## 9  1e-03  0.10 0.4809942 0.12268512
## 10 1e-02  0.10 0.4809942 0.12268512
## 11 1e-01  0.10 0.2780702 0.07778303
## 12 1e+01  0.10 0.2786550 0.10412030
## 13 1e+02  0.10 0.3049708 0.08373052
## 14 1e+03  0.10 0.2786550 0.08810884
## 15 1e-04  0.50 0.4809942 0.12268512
## 16 1e-03  0.50 0.4809942 0.12268512
## 17 1e-02  0.50 0.4809942 0.12268512
## 18 1e-01  0.50 0.2783626 0.08652863
## 19 1e+01  0.50 0.2730994 0.08559481
## 20 1e+02  0.50 0.2517544 0.10735813
## 21 1e+03  0.50 0.2938596 0.12713415
## 22 1e-04  1.00 0.4809942 0.12268512
## 23 1e-03  1.00 0.4809942 0.12268512
## 24 1e-02  1.00 0.4809942 0.12268512
## 25 1e-01  1.00 0.2733918 0.10336210
## 26 1e+01  1.00 0.2622807 0.10513719
## 27 1e+02  1.00 0.2935673 0.12139258
## 28 1e+03  1.00 0.3415205 0.10377989
## 29 1e-04  2.00 0.4757310 0.13138062
## 30 1e-03  2.00 0.4757310 0.13138062
## 31 1e-02  2.00 0.4757310 0.13138062
## 32 1e-01  2.00 0.2789474 0.09554839
## 33 1e+01  2.00 0.2777778 0.11294274
## 34 1e+02  2.00 0.3412281 0.10944035
## 35 1e+03  2.00 0.4061404 0.12817005
## 36 1e-04  3.00 0.4757310 0.13138062
## 37 1e-03  3.00 0.4757310 0.13138062
## 38 1e-02  3.00 0.4757310 0.13138062
## 39 1e-01  3.00 0.3005848 0.10399021
## 40 1e+01  3.00 0.3149123 0.09648680
## 41 1e+02  3.00 0.3950292 0.11075530
## 42 1e+03  3.00 0.4102339 0.11998245
## 43 1e-04  4.00 0.4757310 0.13138062
## 44 1e-03  4.00 0.4757310 0.13138062
## 45 1e-02  4.00 0.4757310 0.13138062
## 46 1e-01  4.00 0.3058480 0.09384300
## 47 1e+01  4.00 0.3359649 0.10639820
## 48 1e+02  4.00 0.4154971 0.14782365
## 49 1e+03  4.00 0.3730994 0.10747089

bestmod2 <- brain.tune2$best.model
pred.orig2 <- predict(brain.fit2,set2)
pred.tune2 <- predict(bestmod2,set2)
table(predicted=pred.orig2,truth=set2$y)

##          truth
## predicted  1  2
##         1 36 14
##         2  0  0

36/50

## [1] 0.72

table(predicted=pred.tune2,truth=set2$y)

##          truth
## predicted  1  2
##         1 27 10
##         2  9  4

31/50

## [1] 0.62

Our first attempt at using a radial kernel failed to identify both classes. Tuning helped improve the situation, but the results are still not as good as when we used a linear kernel.

Finally, we can make an attempt to see if predictive results improve using a polynomial kernel with a degree of 2. An initial model will be fit, and then we will see if tuning a model can provide increased predictive results.

# SVM model with a polynomial kernel of degree 2.
brain.fit3 <- svm(y~.,data=set1,kernel='polynomial',gamma=.5,cost=1,degree=2,scale=FALSE)
plot(brain.fit3,set1)

# Predicted results on test data.
pred.orig3 <- predict(brain.fit3,set2)
table(predicted=pred.orig3,truth=set2$y)

##          truth
## predicted  1  2
##         1 27  9
##         2  9  5

32/50

## [1] 0.64

summary(brain.fit3)

## 
## Call:
## svm(formula = y ~ ., data = set1, kernel = "polynomial", gamma = 0.5, 
##     cost = 1, degree = 2, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  2 
##       gamma:  0.5 
##      coef.0:  0 
## 
## Number of Support Vectors:  81
## 
##  ( 41 40 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  1 2

brain.tune3 <- tune(svm,y~.,data=set1,ranges=list(cost=c(.1,1,10),gamma=c(.1,.5,1),degree=2))
summary(brain.tune3)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma degree
##     1     1      2
## 
## - best performance: 0.2505848 
## 
## - Detailed performance results:
##   cost gamma degree     error dispersion
## 1  0.1   0.1      2 0.2672515 0.08384559
## 2  1.0   0.1      2 0.2777778 0.10578529
## 3 10.0   0.1      2 0.2885965 0.10048965
## 4  0.1   0.5      2 0.2616959 0.10107593
## 5  1.0   0.5      2 0.2561404 0.09093311
## 6 10.0   0.5      2 0.2508772 0.10223052
## 7  0.1   1.0      2 0.2666667 0.09115848
## 8  1.0   1.0      2 0.2505848 0.10513719
## 9 10.0   1.0      2 0.2567251 0.07858918

pred.tune3 <- predict(brain.tune3$best.model,set2)
table(predicted=pred.tune3,truth=set2$y)

##          truth
## predicted  1  2
##         1 30 10
##         2  6  4

34/50

## [1] 0.68

Overall, for this example we cannot find suitable improvement over using a linear kernel.

The next example wil demonstrate how to plot the results from a Support Vector Machine model withouth relying on some of the default features of svm.plot(). We will do this using an artifical dataset.

To begin, we will generate the data using the rnorm() function for two sets of closely related points. We will then plot the data, differentiating between the two sets using the response variable.

set.seed(30)
# Matrix of random values from the normal distribution, with mean = 0, standard deviation = 1.
xa <- matrix(rnorm(50),25,2)
# Another matrix of values from the same distribution, but with a mean of 1 with same s.d.
xb <- matrix(rnorm(50,1,1),25,2)
x <- matrix(rbind(xa,xb),50,2)
# Response variable.
y <- rep(c(1,-1),c(25,25))
plot(x,col=y+3,pch=10)

We will now incorporate the data into a data.frame object with y as the response variable and x containing the predictors. Afterward, we will tune a model using the tune() function and use the summary() function to determine the best cost value for the fit. Once that is done, a new model will be created using the best cost value.

dat <- data.frame(x,y=as.factor(y))
art.tune <- tune(svm,y~.,data=dat,kernel='linear',ranges=list(cost=c(.01,.1,1,10,100)))
summary(art.tune)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##   0.1
## 
## - best performance: 0.2 
## 
## - Detailed performance results:
##    cost error dispersion
## 1 1e-02  0.42  0.1475730
## 2 1e-01  0.20  0.1333333
## 3 1e+00  0.20  0.1333333
## 4 1e+01  0.20  0.1632993
## 5 1e+02  0.20  0.1632993

art.fit.best <- svm(y~.,data=dat,kernel='linear',cost=.1,scale=F)

Next, we can define the make.grid() function, which we will use to help plot the data. The make.grid() function is taken from an example demonstrated in the Statistical Learning course via Stanford Online during the Spring of 2015.

make.grid <- function(x,n=100){
    grange <- apply(x,2,range)
    x1 <- seq(from=grange[1,1],to=grange[2,1],length=n)
    x2 <- seq(from=grange[1,2],to=grange[2,2],length=n)
    expand.grid(X1=x1,X2=x2)
    }

We will now generate points for both axes, as well as the data we will use to differentiate between the classes on the grid.

xgrid <- make.grid(x)
ygrid <- predict(art.fit.best,xgrid)

We can now plot the data, taking care to mark the points used to help decide on a barrier between the two classes.

plot(xgrid,col=c('red','blue')[as.numeric(ygrid)],pch=20,cex=.2)
points(x,col=y+3,pch=10)
points(x[art.fit.best$index,],pch=5,cex=2)

We can also plot the data and include the boundaries and margins between the two sets of data, extracting out the coefficients.

beta <- drop(t(art.fit.best$coefs)%*%x[art.fit.best$index,])
beta0 <- art.fit.best$rho
plot(xgrid,col=c('red','blue')[as.numeric(ygrid)],pch=20,cex=.2)
points(x[art.fit.best$index,],pch=10,cex=2,col=y+3)
abline(beta0/beta[2],-beta[1]/beta[2])
abline((beta0-1)/beta[2],-beta[1]/beta[2],lty=2)
abline((beta0+1)/beta[2],-beta[1]/beta[2],lty=2)

Support Vector Machines

Pierre Phanor

December 21, 2015