9.6.1 Support Vector Classifier

e1071

1.Generating the observations, which belong to two classes:

set.seed(1)
x = matrix(rnorm(20*2),ncol=2)
y = c(rep(-1,10),rep(1,10))
x[y==1,] = x[y==1,] + 1

2.Checking whether the classes are linearly separable:

plot(x,col=(3-y))

They are not.

3.Fit the support vector classifier:

dat = data.frame(x=x,y=as.factor(y)) #分类:response should be a factor variable
library(e1071)
svmfit = svm(y~.,data=dat,kernel="linear",cost=10,scale=FALSE)

Notes:

  1. Plot the support vector classifier obtained:
plot(svmfit,dat)

svmfit$index # support vectors
## [1]  1  2  5  7 14 16 17
summary(svmfit)
## 
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 10, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  7
## 
##  ( 4 3 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1

This tells us, for instance, that a kernel was used with cost=10, and that there were 7 support vectors, 4 in one class and 3 in the other.

Then we use a smaller value of the cost parameter:

svmfit2 = svm(y~.,data=dat,kernel="linear",cost=0.1,scale=FALSE)
plot(svmfit2,dat)

summary(svmfit2)
## 
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 0.1, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.1 
## 
## Number of Support Vectors:  16
## 
##  ( 8 8 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1

Now that a smaller value of the cost parameter is being used, we obtain a larger number of support vectors, because the margin is now wider. Unfortunately, the svm() function does not explicitly output the coefficients of the linear decision boundary obtained when the support vectors classifier is fit, nor does it output the width of the margin.

The e1071 library includes a built-in function,tune(), to perform cross-validation.

set.seed(1)
tune.out = tune(svm,y~.,data=dat,kernel="linear",ranges=list(cost=c(0.001,0.01,0.1,1.5,10,100))) #默认是ten-fold
summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##   0.1
## 
## - best performance: 0.05 
## 
## - Detailed performance results:
##      cost error dispersion
## 1   0.001  0.55  0.4377975
## 2   0.010  0.55  0.4377975
## 3   0.100  0.05  0.1581139
## 4   1.500  0.15  0.2415229
## 5  10.000  0.15  0.2415229
## 6 100.000  0.15  0.2415229

We see that cost=0.1 results is the lowest cross-validation error rate. The tune()function stores the best model obtained, which can be accessed as follows:

bestmod = tune.out$best.model
summary(bestmod)
## 
## Call:
## best.tune(method = svm, train.x = y ~ ., data = dat, ranges = list(cost = c(0.001, 
##     0.01, 0.1, 1.5, 10, 100)), kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.1 
## 
## Number of Support Vectors:  16
## 
##  ( 8 8 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1

The predict() function can be used to predict the class label on a set of test observations, at any given value of the cost parameter. We begin by generating a test data set:

xtest = matrix(rnorm(20*2),ncol=2)
ytest = sample(c(-1,1),20,rep=TRUE)
xtest[ytest==1,] = xtest[ytest==1,] + 1
testdat = data.frame(x=xtest,y=as.factor(ytest))

Now we predict the class labels of these test observations:

ypred = predict(bestmod,testdat)
table(predict=ypred,truth=testdat$y)
##        truth
## predict -1 1
##      -1  9 1
##      1   2 8

With the value of cost=0.1,17 of the test observations are correctly classified. Then, we use cost=0.01:

svmfit = svm(y~.,data=dat,kernel="linear",cost=0.01,sclae=FALSE)
ypred = predict(svmfit,testdat)
table(predict=ypred,truth=testdat$y)
##        truth
## predict -1  1
##      -1 11  6
##      1   0  3

In this case, 3 additional observations is misclassified.

Now consider a situation in which the two classes are linearly separable. Then we can find a separating hyperplane using the svm() function.

x[y==1,] = x[y==1,] + 0.5
plot(x,col=(y+5)/2,pch=19)

Now the observations are just barely linearly separable. We fit the support vector classifier and plot the resulting hyperplane, using a very large value of cost so that no observations are misclassified.

dat = data.frame(x=x,y=as.factor(y))
svmfit = svm(y~.,data=dat,kernel="linear",cost=1e5)
summary(svmfit)
## 
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 1e+05)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1e+05 
## 
## Number of Support Vectors:  3
## 
##  ( 1 2 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1
plot(svmfit,dat)

No training errors were made and only three support vectors were used. However, we can see from the figure that the margin is very narrow(because the observations that not support vectors, indicated as circles, are very close to the decision boundary). It seems likely that this model will perform poorly on the test data. We now try a smaller value of cost:

svmfit = svm(y~.,data=dat,kernel="linear",cost=1)
summary(svmfit)
## 
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  7
## 
##  ( 4 3 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1
plot(svmfit,dat)

We misclassify a training observation, but we also obtain a much wider margin and make use of seven vectors. It seems likely that this model will perform better on test data than the model with cost=1e5.

9.6.2 Support Vector Machine

We first generate some data with a non-linear class boundary:

set.seed(1)
x = matrix(rnorm(200*2),ncol=2)
x[1:100,] = x[1:100,]+2
x[101:150,] = x[101:150,]-2
y = c(rep(1,150),rep(2,50))
dat = data.frame(x=x,y=as.factor(y))

Plotting the data makes it clear that the class boundary is indeed non-linear:

plot(x,col=y)

The data is randomly split into training and testing groups.We then fit the training data using the svm() function with a radial kernel and \(\gamma=1\):

train = sample(200,100)
svmfit = svm(y~.,data=dat[train,],kernel="radial",gamma=1,cost=1)
plot(svmfit,dat[train,])

We can perform cross-validation using tune() to select the best choice of \(\gamma\) and cost for an SVM with a radial kernel:

set.seed(1)
tune.out = tune(svm,y~.,data=dat[train,],kernel="radial",ranges=list(cost=c(0.1,1,10,100,1000),gamma=c(0.5,1,2,3,4)))
summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##     1   0.5
## 
## - best performance: 0.07 
## 
## - Detailed performance results:
##     cost gamma error dispersion
## 1  1e-01   0.5  0.26 0.15776213
## 2  1e+00   0.5  0.07 0.08232726
## 3  1e+01   0.5  0.07 0.08232726
## 4  1e+02   0.5  0.14 0.15055453
## 5  1e+03   0.5  0.11 0.07378648
## 6  1e-01   1.0  0.22 0.16193277
## 7  1e+00   1.0  0.07 0.08232726
## 8  1e+01   1.0  0.09 0.07378648
## 9  1e+02   1.0  0.12 0.12292726
## 10 1e+03   1.0  0.11 0.11005049
## 11 1e-01   2.0  0.27 0.15670212
## 12 1e+00   2.0  0.07 0.08232726
## 13 1e+01   2.0  0.11 0.07378648
## 14 1e+02   2.0  0.12 0.13165612
## 15 1e+03   2.0  0.16 0.13498971
## 16 1e-01   3.0  0.27 0.15670212
## 17 1e+00   3.0  0.07 0.08232726
## 18 1e+01   3.0  0.08 0.07888106
## 19 1e+02   3.0  0.13 0.14181365
## 20 1e+03   3.0  0.15 0.13540064
## 21 1e-01   4.0  0.27 0.15670212
## 22 1e+00   4.0  0.07 0.08232726
## 23 1e+01   4.0  0.09 0.07378648
## 24 1e+02   4.0  0.13 0.14181365
## 25 1e+03   4.0  0.15 0.13540064

Therefore, the best choice of parameters involves cost=1 and gamma=2. We can view the test set predictions for this model by applying the predict() function to the data. Notice that to do this we subset the dataframe dat using -train as an index set.

table(true=dat[-train,"y"],pred=predict(tune.out$best.model,newx=dat[-train,]))
##     pred
## true  1  2
##    1 54 23
##    2 17  6
(59+6)/(59+6+20+15)
## [1] 0.65

35% of test observations are misclassified by this SVM.

9.6.3 ROC Curves

The ROCR packages can be used to produce ROC curves.

We first write a short function to plot an ROC curve given a vector containing a numerical score for each observation(pred) and a vector containing the class label for each observation(truth)

library(ROCR)
rocplot = function(pred,truth,...){
  predob = prediction(pred,truth)
  perf = performance(predob,"tpr","fpr")
  plot(perf,...)
}

Fitted values: the numerical scores used to obtain the class labels. In essence, the sign of the fitted value determines on which side of the decision boundary the observation lies. In order to obtain the fitted values for a given SVM model fit, we use decision.values=TRUE when fitting svm(). Then the predict() function will output the fitted values.

svmfit.opt = svm(y~.,data=dat[train,],kernel="radial",gamma=2,cost=1,decision.values=T)
fitted = attributes(predict(svmfit.opt,dat[train,],decision.values=TRUE))$decision.values

Now we produce the ROC plot:

par(mfrow=c(1,2))
rocplot(fitted,dat[train,"y"],main="Training Data")

svmfit.flex = svm(y~.,data=dat[train,],kernel="radial",gamma=50,cost=1,decision.values=T)
fitted = attributes(predict(svmfit.flex,dat[train,],decision.values=T))$decision.values
rocplot(fitted,dat[train,"y"],add=T,col="red")

By increasing \(\gamma\) we can produce a more flexible fit and generate further improvements in accuracy. However, these ROC curves are all on the training data.We are really more interested in the level of prediction accuracy on the test data.

fitted = attributes(predict(svmfit.opt,dat[-train,],decision.values=T))$decision.values
rocplot(fitted,dat[-train,"y"],main="Test Data")
fitted = attributes(predict(svmfit.flex,dat[-train,],decision.values=T))$decision.values
rocplot(fitted,dat[-train,"y"],add=T,col="red")

The model with \(\gamma=2\) appears to provide the most accurate results.

9.6.4 SVM with Multiple Classes

The svm() function will perform multi-class classification using the one-versus-oneapproach.

set.seed(1)
x = rbind(x,matrix(rnorm(50*2),ncol=2))
y = c(y,rep(0,50))
dat = data.frame(x=x,y=as.factor(y))
par(mfrow=c(1,1))
plot(x,col=(y+1))

We now fit an SVM to the data:

svmfit = svm(y~.,data=dat,kernel="radial",cost=10,gamma=1)
plot(svmfit,dat)

9.6.5 Appliaction to Gene Expression Data

Khan data set: consists of a number of tissue samples corresponding to four distinct types of small round blue cell tumors.

library(ISLR)
names(Khan)
## [1] "xtrain" "xtest"  "ytrain" "ytest"
dim(Khan$xtrain)
## [1]   63 2308
dim(Khan$xtest)
## [1]   20 2308
length(Khan$ytrain)
## [1] 63
length(Khan$ytest)
## [1] 20

This data set consists of expression measurements for 2308 genes. The training and test sets consist of 63 and 20 observations respectively.

table(Khan$ytrain)
## 
##  1  2  3  4 
##  8 23 12 20
table(Khan$ytest)
## 
## 1 2 3 4 
## 3 6 6 5

In this data set, there are a very large number of features relative to the number of observations. This suggests that we should use a linear kernel, because the additional flexibility that will result from using a polynomial or radial kernel is unnecessary.

dat = data.frame(x=Khan$xtrain,y=as.factor(Khan$ytrain))
out = svm(y~.,data=dat,kernel="linear",cost=10)
summary(out)
## 
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 10)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  58
## 
##  ( 20 20 11 7 )
## 
## 
## Number of Classes:  4 
## 
## Levels: 
##  1 2 3 4
table(out$fitted,dat$y)
##    
##      1  2  3  4
##   1  8  0  0  0
##   2  0 23  0  0
##   3  0  0 12  0
##   4  0  0  0 20

We see that there are no training errors. Large number of variables relative to the number of observations implies that it is easy to find hyperplanes that fully separate the classes.

The performance on the test observations:

dat.te = data.frame(x=Khan$xtest,y=as.factor(Khan$ytest))
pred.te = predict(out,newdata=dat.te)
table(pred.te,dat.te$y)
##        
## pred.te 1 2 3 4
##       1 3 0 0 0
##       2 0 6 2 0
##       3 0 0 4 0
##       4 0 0 0 5

We see that using cost=10 yields two test set errors on this data.