We will use the svm() function in package e1071. A Classification model is fitted when type of y variable is a factor, and otherwise, it behaves as a regression analysis. If y is omitted, it is a novelty detection task.
n = 25
a1 = rnorm(n)
a2 = 1 - a1 + 2* runif(n)
b1 = rnorm(n)
b2 = -1 - b1 - 2*runif(n)
x = rbind(matrix(cbind(a1,a2),,2),matrix(cbind(b1,b2),,2))
y <- matrix(c(rep(1,n),rep(-1,n)))
plot(x,col=ifelse(y>0,2,3),pch=".",cex=8)
kfunction <- function(linear =0, quadratic=0)
{
k <- function (x,y)
{
linear*sum((x)*(y)) + quadratic*sum((x^2)*(y^2))
}
class(k) <- "kernel"
k
}
svp <- ksvm(x,y,type="C-svc",C = 100, kernel=kfunction(1,0),scaled=c())
plot(c(min(x[,1]), max(x[,1])),c(min(x[,2]), max(x[,2])),type='n',xlab='x1',ylab='x2')
title(main='Linear Separable Features')
ymat <- ymatrix(svp)
points(x[-SVindex(svp),1], x[-SVindex(svp),2], pch = ifelse(ymat[-SVindex(svp)] < 0, 2, 1))
points(x[SVindex(svp),1], x[SVindex(svp),2], pch = ifelse(ymat[SVindex(svp)] < 0, 17, 16))
# Extract w and b from the model
w <- colSums(coef(svp)[[1]] * x[SVindex(svp),])
b <- b(svp)
# Draw the lines
abline(b/w[2],-w[1]/w[2])
abline((b+1)/w[2],-w[1]/w[2],lty=2)
abline((b-1)/w[2],-w[1]/w[2],lty=2)
data(iris)
iris2cat = iris[1:100,]
head(iris2cat)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
The task is to predict the type of the iris flower based on sepal length and width. Therefore the y variable should be a categorical variable.
class(iris2cat$Species)
## [1] "factor"
We can first look at the scatter plot:
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:kernlab':
##
## alpha
f = ggplot(iris2cat, aes(Sepal.Width, Sepal.Length, colour = Species))
f + geom_jitter() + scale_colour_hue() + theme(legend.position = "bottom")
km.res <- kmeans(iris2cat[,1:2], 2, nstart = 25)
library(factoextra)
## Warning: package 'factoextra' was built under R version 3.3.3
fviz_cluster(km.res, data = iris2cat[,1:2], frame.type = "convex")+
theme_minimal()
## Warning: argument frame is deprecated; please use ellipse instead.
## Warning: argument frame.type is deprecated; please use ellipse.type
## instead.
For the SVM we fit the model and try to predict the test set values:
svm.model <- svm(Species ~ Sepal.Length+Sepal.Width, data = iris2cat, cost = 100, gamma = 1)
svm.pred <- predict(svm.model, iris2cat[,1:2])
plot(svm.model,iris2cat[,c(1:2,5)])
Determine the identity of the support vectors:
svm.model$index
## [1] 14 15 16 19 23 26 32 33 37 42 51 58 61 69 77 85 86 88
A cross-tabulation of the true versus the predicted values yields:
table(pred = svm.pred, true = iris2cat$Species)
## true
## pred setosa versicolor virginica
## setosa 50 0 0
## versicolor 0 50 0
## virginica 0 0 0
correctRate = sum(svm.pred==iris2cat$Species)/length(iris2cat$Species)
misRate=1-correctRate
```
costs = seq(from=0.005, to=1,by=0.005)
pseudor2 = double(length(costs))
for (c in 1:length(costs)){
epsilon.svr = svm(Species ~ Sepal.Length+Sepal.Width, data=iris2cat,cost=costs[c], gamma = 1)
svm.pred <- predict(epsilon.svr, iris2cat[,1:2])
classificationTable=table(pred = svm.pred, true = iris2cat$Species)
correctRate[c] = sum(svm.pred==iris2cat$Species)/length(iris2cat$Species)
misRate[c]=1-correctRate[c]
}
plot(costs,misRate, type="l")
print(min(misRate))
## [1] 0
k=which.min(misRate)
print(costs[k])
## [1] 0.92
The e1071 library includes a built-in function, tune(), to perform crossvalidation. By default, tune() performs ten-fold cross-validation on a set of models of interest. In order to use this function, we pass in relevant information about the set of models that are under consideration. The following command indicates that we want to compare SVMs with a linear kernel, using a range of values of the cost parameter.
obj = tune.svm(Species ~ Sepal.Length+Sepal.Width, data=iris2cat,cost=seq(from=0.005, to=1,by=0.005), gamma = 1)
print(obj)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## gamma cost
## 1 0.045
##
## - best performance: 0.01
#summary(obj)
bestmod = obj$best.model
bestmod
##
## Call:
## best.svm(x = Species ~ Sepal.Length + Sepal.Width, data = iris2cat,
## gamma = 1, cost = seq(from = 0.005, to = 1, by = 0.005))
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 0.045
## gamma: 1
##
## Number of Support Vectors: 96
In this example, we use the glass data from the UCI Repository of Machine Learning Databases for classification.
data(Glass, package="mlbench")
head(Glass)
## RI Na Mg Al Si K Ca Ba Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1
The task is to predict the type of a glass on basis of its chemical analysis. Therefore the y variable should be a categorical variable.
class(Glass$Type)
## [1] "factor"
We can first look at the scatter plots. (For colours in R, have a look http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf):
pairs(Glass[1:9], main = "Glass Data", pch = 21, bg = c("red", "green3", "blue","yellow","purple","deeppink")[unclass(Glass$Type)])
We start by splitting the data into a train and test set.
index <- 1:nrow(Glass)
testindex <- sample(index, trunc(length(index)/3))
testset <- Glass[testindex,]
trainset <- Glass[-testindex,]
Both for the SVM and the partitioning tree (via rpart()), we fit the model and try to predict the test set values:
svm.model <- svm(Type ~ ., data = trainset, cost = 100, gamma = 1)
svm.pred <- predict(svm.model, testset[,-10])
A cross-tabulation of the true versus the predicted values yields:
table(pred = svm.pred, true = testset[,10])
## true
## pred 1 2 3 5 6 7
## 1 13 2 2 0 1 0
## 2 7 23 3 2 2 5
## 3 1 1 1 0 0 0
## 5 0 0 0 1 0 0
## 6 0 0 0 0 0 0
## 7 0 0 0 0 0 7