Support Vector Machines

This article will explain how to implement Support Vector Machines in R and their in depth interpretation.

SVM does not uses any Probability Model as such like other Classifiers use , because it directly looks for a Hyperplane which divides and sagments the data and classes.

General form of a Hyperplane is :

\[\beta_0 + \beta_1X_1 + \beta_2X_2 + . . .. . \beta_pX_p = 0 \] where \(p\) is the number of Dimentions.

For \(p=2\) i.e for a 2-D space it is a Line.

2)The vector \((\beta_1,\beta_2,\beta_3...\beta_p) is \ just \ a \ Normal\ vector.\) A vector in simple terms is just a 1-Dimentional Tensor or a 1-D array.

Support Vector Classifiers are majorly used for solving a binary clssification problem where we only have 2 class labels say \(Y = [-1,1]\) and a bunch of predictors \(X_i\) .And what SVM does is that it generates Hyperplanes which in simple terms are just straight lines or planes or are Non-linear curves , and these lines are used to saperate the data or sagment the data into 2 categories or more depending on the type of Classification problem.

We try to find a plane which saperates the classes in some feature space \(X_i\).

Another concept in SVM is of Maximal Margin Classifiers.What it means is that amongst a set of separating hyperplanes SVM aims at finding the one which maximizes the margin \(M\).This simply means that we want to maximize the gap or the distance between the 2 classes from the Decision Boundary(separating plane).

This concept of separating data linearly into 2 different classes using a Linear Separator or a straight linear line is called Linear Separability.

The term Support Vectors in SVM are the data points or training examples which are used to define or maximizing the margin.The support vectors are the points which are close to the decision boundary or on the wrong side of the boundary.

Linear SVM Classifier in R

set.seed(10023)
#generating data
#a matrix with 20 rows and 2 columns
x=matrix(rnorm(40),20,2) #predictors
x

##              [,1]        [,2]
##  [1,] -0.09551772 -0.71292841
##  [2,] -0.88478229 -0.67113080
##  [3,] -0.29167012 -0.86501651
##  [4,] -1.62441408 -0.23668324
##  [5,] -0.91787482  0.02041167
##  [6,]  1.32568784 -0.38768476
##  [7,]  0.31388917  0.45673048
##  [8,]  1.01797164  1.53178430
##  [9,] -1.16672514  0.93732626
## [10,] -0.68743749 -0.41605185
## [11,]  0.99254322 -0.55050643
## [12,]  0.18749460  0.80216742
## [13,] -0.46108212  1.88537026
## [14,] -1.40624867 -1.03890008
## [15,]  1.93876012 -0.63060819
## [16,] -0.77541675 -0.40216218
## [17,] -0.20556829  0.29768270
## [18,]  0.64617858 -0.85522734
## [19,]  0.26055481 -0.69133624
## [20,] -0.25535361 -0.72415561

y=rep(c(-1,1),c(10,10))#Binary response value
x[y==1,]=x[y==1,]+1 #2 classes are [-1,1]

#plotting the points
plot(x,col=y+2,pch=19)

Using the ‘e1071’ package to fit a SVM classifier

require(e1071)
#converting to a data frame
data<-data.frame(x,y=as.factor(y))
head(data)

##            X1          X2  y
## 1 -0.09551772 -0.71292841 -1
## 2 -0.88478229 -0.67113080 -1
## 3 -0.29167012 -0.86501651 -1
## 4 -1.62441408 -0.23668324 -1
## 5 -0.91787482  0.02041167 -1
## 6  1.32568784 -0.38768476 -1

svm<-svm(y ~ .,data=data,kernel="linear",cost=10,scale = F)
#here cost 'c' is a tuning parameter .The larger it is more stable the margin becomes, it is like a Regularization parameter 
svm

## 
## Call:
## svm(formula = y ~ ., data = data, kernel = "linear", cost = 10, 
##     scale = F)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
##       gamma:  0.5 
## 
## Number of Support Vectors:  10

svm$index #gives us the index of the Support Vectors

##  [1]  1  6  7  8  9 14 16 17 19 20

#so we have 10 support vectors
svm$fitted #to find the fitted values

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
## -1 -1 -1 -1 -1  1  1  1 -1 -1  1  1  1 -1  1  1  1  1  1  1 
## Levels: -1 1

#Confusion Matrix of Fitted values and Actual Response values
table(Predicted=svm$fitted,Actual=y)

##          Actual
## Predicted -1 1
##        -1  7 1
##        1   3 9

#accuracy on Training Set
mean(svm$fitted==y)*100 #has 80 % accuracy on Training Set

## [1] 80

#plotting
plot(svm,data)

We can also create our own plot.

#First Making Grids using a function
make.grid<-function(x,n=75) {
  grange=apply(x,2,range)
  x1=seq(from=grange[1,1],to=grange[2,1],length=n)
  x2=seq(from=grange[1,2],to=grange[2,2],length=n)
  expand.grid(X1=x1,X2=x2) #it makes a Lattice for us
}
xgrid=make.grid(x) #is a 75x75 matrix

#now predicting on this new Test Set
ygrid=predict(svm,xgrid)

#plotting the Linear Separator
plot(xgrid,col=c("red","blue")[as.numeric(ygrid)],pch=19,cex=.2)
#creates 2 regions
points(x,col=y+3,pch=19) #adding the points on Plot
points(x[svm$index,],pch=5,cex=2) #Highlighting the Support Vectors

In the above Plot the Highlighted points are the Support Vectors which were used in determining the Decision Boundary.

Extracting the Coefficient values of the Linear SVM equation

The \(\beta\) here are the coefficient values of the SVM model.As it is a Linear SVM classifier, the linear equation is linearly dependent on the predictors \(X_1\) and \(X_2\).

\(y_i=f(x,\beta) = \beta_0 + \beta_1.X_1 + \beta_2X_2\) , is the mathematical equation for the Linear SVM classifier.

beta = drop(t(svm$coefs)%*%x[svm$index,])
beta

## [1] -0.9108283 -0.5915662

beta0=svm$rho
beta0 #the intercept value

## [1] -0.4914382

#again Plotting
plot(xgrid,col=c("red","blue")[as.numeric(ygrid)],pch=19,cex=.2)
#creates 2 regions
points(x,col=y+3,pch=19) #adding the points on Plot
points(x[svm$index,],pch=5,cex=2)
abline(beta0/beta[2],-beta[1]/beta[2],lty=1)#is the Decision boundary or Plane
#below are for adding the soft margins
abline((beta0-1)/beta[2],-beta[1]/beta[2],lty=2)
abline((beta0+1)/beta[2],-beta[1]/beta[2],lty=2)

In the above plot the dashed lines are actually the Soft margins which are again margins which include the support vectors within or on them. And how small or wide these soft margins becomes depend on the value of our tuning parameter \(c\) which we assigned the value as 10 in the above SVM.

Conclusion

Support Vector Machines are actually very strong and accurate technique to do Classification.SVM are preferable when the classes are saperated well like in the example we did above we had 10 labels for 1 and 10 for -1.One unique thing about SVMs are that they don’t actually follow or use a Conditional Probability Model \(Pr(Y | X_i)\) like other classifiers do.

Linear SVM can not always be useful.Linear SVM can only be used when the data is linearly saperable.

When the the data is Non linearly saperable i.e has Non linearities in it we need to do Feature Expansion i.e do a Non linear transform to the features to convert to higher dimentions and use a Non linear function \(f(x,\beta)\) which is Non linear in predictors \(X_i\) to get a Non linear Decision Boundary which saperates the data in an enlarged feature space.An example is Radial SVMs which uses a radial kernel.