SVM Usage in R: e1071 Package

We will use the svm() function in package e1071. A Classification model is fitted when type of y variable is a factor, and otherwise, it behaves as a regression analysis. If y is omitted, it is a novelty detection task.

 library(e1071)

Classification with 2 Categories

Simulated Data

set.seed (1)
x=matrix (rnorm (200*2) , ncol =2)
x[1:100 ,]=x[1:100 ,]+2
x[101:150 ,]= x[101:150 ,] -2
y=c(rep (1 ,150) ,rep (2 ,50) )
dat=data.frame(x=x,y=as.factor (y))

Plot to check linearly separable

plot(x, col =y)

They are not linearly separable.

Fit a SVM

train=sample (200 ,100)
svmfit =svm(y~., data=dat [train ,], kernel ="radial", gamma =1,cost =1)
plot(svmfit , dat[train ,])

Plot of the SVM

plot(svmfit, dat)

Support Vectors

Determine the identity of the support vectors with:

svmfit$index

##  [1]  8 11 13 14 17 23 26 39 45 47 50 66 67 88 92 94 96  4  5 12 35 37 43
## [24] 44 48 52 54 55 64 68 72 73 78 82 84 86 99

Tune the Cost Parameter

svmfit =svm(y~., data=dat , kernel ="linear", cost =0.1,scale =FALSE )
plot(svmfit , dat)

Support Vectors for a Smaller Cost Value

Determine the identity of the support vectors with:

svmfit$index

##   [1]   2   3   4   5   6  11  15  16  19  26  31  32  39  50  55  56  61
##  [18]  65  68  70  74  83  92  93  95 102 104 109 112 113 114 117 118 119
##  [35] 124 127 129 131 132 133 134 136 138 139 140 141 142 143 144 145 146
##  [52] 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165
##  [69] 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182
##  [86] 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199
## [103] 200

Cross-validation of Cost

The e1071 library includes a built-in function, tune(), to perform crossvalidation. By default, tune() performs ten-fold cross-validation on a set of models of interest. In order to use this function, we pass in relevant information about the set of models that are under consideration. The following command indicates that we want to compare SVMs with a linear kernel, using a range of values of the cost parameter.

set.seed (1)
tune.out=tune(svm ,y~.,data=dat ,kernel ="linear", 
              ranges =list(cost=c(0.001,0.01,0.1, 1,5,10,100)))

Cross-validation error rate

summary(tune.out)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##   cost
##  0.001
## 
## - best performance: 0.25 
## 
## - Detailed performance results:
##    cost error dispersion
## 1 1e-03  0.25   0.120185
## 2 1e-02  0.25   0.120185
## 3 1e-01  0.25   0.120185
## 4 1e+00  0.25   0.120185
## 5 5e+00  0.25   0.120185
## 6 1e+01  0.25   0.120185
## 7 1e+02  0.25   0.120185

Best Model

We see that cost=0.1 results in the lowest cross-validation error rate. The tune() function stores the best model obtained, which can be accessed as follows:

bestmod =tune.out$best.model

Iris Data

data(iris)
iris2cat = iris[1:100,]
head(iris2cat)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

The task is to predict the type of the iris flower based on sepal length and width. Therefore the y variable should be a categorical variable.

class(iris2cat$Species)

## [1] "factor"

Plot of the Data

We can first look at the scatter plot:

library(ggplot2)
f = ggplot(iris2cat, aes(Sepal.Width, Sepal.Length, colour = Species))
f + geom_jitter() + scale_colour_hue() + theme(legend.position = "bottom")

km.res <- kmeans(iris2cat[,1:2], 2, nstart = 25)
library(factoextra)

## Warning: package 'factoextra' was built under R version 3.3.3

fviz_cluster(km.res, data = iris2cat[,1:2], frame.type = "convex")+
  theme_minimal()

## Warning: argument frame is deprecated; please use ellipse instead.

## Warning: argument frame.type is deprecated; please use ellipse.type
## instead.

For the SVM we fit the model and try to predict the test set values:

svm.model <- svm(Species ~ Sepal.Length+Sepal.Width, data = iris2cat, cost = 100, gamma = 1)
svm.pred <- predict(svm.model, iris2cat[,1:2])
plot(svm.model,iris2cat[,c(1:2,5)])

Determine the identity of the support vectors:

svm.model$index

##  [1] 14 15 16 19 23 26 32 33 37 42 51 58 61 69 77 85 86 88

A cross-tabulation of the true versus the predicted values yields:

table(pred = svm.pred, true = iris2cat$Species)

##             true
## pred         setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         50         0
##   virginica       0          0         0

correctRate = sum(svm.pred==iris2cat$Species)/length(iris2cat$Species)
misRate=1-correctRate

```

Detect the best Cost

costs = seq(from=0.005, to=1,by=0.005)
pseudor2 = double(length(costs))

for (c in 1:length(costs)){
  epsilon.svr = svm(Species ~ Sepal.Length+Sepal.Width, data=iris2cat,cost=costs[c], gamma = 1)
svm.pred <- predict(epsilon.svr, iris2cat[,1:2])
classificationTable=table(pred = svm.pred, true = iris2cat$Species)
correctRate[c] = sum(svm.pred==iris2cat$Species)/length(iris2cat$Species)
misRate[c]=1-correctRate[c]
}

plot(costs,misRate, type="l")

print(min(misRate))

## [1] 0

k=which.min(misRate)
print(costs[k])

## [1] 0.92

Another way to do cross-validation

obj = tune.svm(Species ~ Sepal.Length+Sepal.Width, data=iris2cat,cost=seq(from=0.005, to=1,by=0.005), gamma = 1)
print(obj)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  gamma  cost
##      1 0.045
## 
## - best performance: 0.01

#summary(obj)

Best model

bestmod = obj$best.model
bestmod

## 
## Call:
## best.svm(x = Species ~ Sepal.Length + Sepal.Width, data = iris2cat, 
##     gamma = 1, cost = seq(from = 0.005, to = 1, by = 0.005))
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.045 
##       gamma:  1 
## 
## Number of Support Vectors:  96

Classification with More Than 2 Categories

In this example, we use the glass data from the UCI Repository of Machine Learning Databases for classification.

data(Glass, package="mlbench")
head(Glass)

##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

The task is to predict the type of a glass on basis of its chemical analysis. Therefore the y variable should be a categorical variable.

class(Glass$Type)

## [1] "factor"

Plot of the Data

We can first look at the scatter plots. (For colours in R, have a look http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf):

pairs(Glass[1:9], main = "Glass Data", pch = 21, bg = c("red", "green3", "blue","yellow","purple","deeppink")[unclass(Glass$Type)])

We start by splitting the data into a train and test set.

index <- 1:nrow(Glass)
testindex <- sample(index, trunc(length(index)/3))
testset <- Glass[testindex,]
trainset <- Glass[-testindex,]

Both for the SVM and the partitioning tree (via rpart()), we fit the model and try to predict the test set values:

svm.model <- svm(Type ~ ., data = trainset, cost = 100, gamma = 1)
svm.pred <- predict(svm.model, testset[,-10])

A cross-tabulation of the true versus the predicted values yields:

table(pred = svm.pred, true = testset[,10])

##     true
## pred  1  2  3  5  6  7
##    1 15  7  4  0  0  0
##    2  8 16  2  3  1  2
##    3  0  1  2  0  0  0
##    5  0  0  0  1  0  0
##    6  0  0  0  0  2  0
##    7  0  1  0  0  0  6

Support Vector Machines Using svm() function

SVM Usage in R: e1071 Package

Classification with 2 Categories

Simulated Data

Plot to check linearly separable

Fit a SVM

Plot of the SVM

Support Vectors

Tune the Cost Parameter

Support Vectors for a Smaller Cost Value

Cross-validation of Cost

Cross-validation error rate

Best Model

Iris Data

Plot of the Data

Detect the best Cost

Another way to do cross-validation

Best model

Classification with More Than 2 Categories

Plot of the Data