Radial Kernel Support Vector Machine

This article will be all about how to separate non linear data using a non-linear decision boundary , which cannot be simply separated by a linear separator.

It is often encountered that Linear Separators and boundaries fail because of the non linear interactions in the data and the non linear dependence between the features in feature space.

The trick here is that ,we will do feature expansion.

So how we solve this problem is via doing a non linear transformation on the features(\(X_i\)) and converting them to a higher dimentional space called a feature space.Now by this transformation we are able to saperate non linear data using a non linear decision boundary.

Non linearities can simply be added by using higher dimention terms such as square and cubic polynomial terms .

\[y_i = \beta_0 + \beta_1X_{1} \ + \beta_2X_{1}^2 + \beta_3X_2 + \beta_4X_2^2 + \beta_5X_2^3 .... = 0 \] is the equation of the non linear hyperplane which is generated if we use higher degree polynomials terms to fit to data to get a non linear decision boundary.

What we are actually doing is that we are fitting a SVM is an enlarged space.We enlarge the space of features by doing non linear transformations.

But the problem with Polynomials are that in higher dimentions i.e when having lots of predictors it gets wild and generally overfits at higher degrees of polynomials.

Hence there is another elegant way of adding non linearities in SVM is by the use of Kernel trick.


Kernel Function

Kernel function is a function of form– \[K(x,y) = (1 + \sum_{j=1}^{p} x_{ij}. y_{ij})^d\] where d = degree of polynomial.

Now the type of Kernel function we are going to use here is a Radial kernel.

The radial kernel is of form: \[k(x,y) = \exp(- \ \gamma \ \sum_{j=1}^{p}(x_{ij} - y_{ij})^2) \] Here \(\gamma\) is a hyper parameter or a tuning parameter which accounts for the smoothness of the decision boundary and controls the variance of the model.

If \(\gamma\) is very large then we get quiet fluctuating and wiggly decision boundaries which accounts for high variance and overfitting.

If \(\gamma\) is small , the decison line or boundary is smoother and has low variance.


Implementation in R

require(e1071)
require(ElemStatLearn)#package containing the dataset

#Loading the data
attach(mixture.example) #is just a simulated mixture data with 200 rows and 2 classes
names(mixture.example)
## [1] "x"        "y"        "xnew"     "prob"     "marginal" "px1"     
## [7] "px2"      "means"

The following data is also 2-D , so lets plot it.

plot(x,col=y+3)

#converting data to a data frame
data<-data.frame(y=factor(y),x)
head(data)
##   y          X1         X2
## 1 0  2.52609297  0.3210504
## 2 0  0.36695447  0.0314621
## 3 0  0.76821908  0.7174862
## 4 0  0.69343568  0.7771940
## 5 0 -0.01983662  0.8672537
## 6 0  2.19654493 -1.0230141

Now let’s fit a Radial kernel using svm() function.

Radialsvm<-svm(factor(y) ~ .,data=data,kernel="radial",cost=5,scale=F)
Radialsvm
## 
## Call:
## svm(formula = factor(y) ~ ., data = data, kernel = "radial", 
##     cost = 5, scale = F)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  5 
##       gamma:  0.5 
## 
## Number of Support Vectors:  103
#number of support vectors are 110

#Confusion matrix to ckeck the accuracy
table(predicted=Radialsvm$fitted,actual=data$y)
##          actual
## predicted  0  1
##         0 76 10
##         1 24 90
#misclassification Rate
mean(Radialsvm$fitted!=data$y)*100 #17% wrong predictions
## [1] 17

Now let’s create a grid and make prediction on that grid values.

xgrid=expand.grid(X1=px1,X2=px2) #generating grid points

ygrid=predict(Radialsvm,newdata = xgrid) #ygird consisting of predicted Response values

#lets plot the non linear decision boundary
plot(xgrid,col=as.numeric(ygrid),pch=20,cex=0.3)
points(x,col=y+1,pch=19) #we can see that the decision boundary is non linear

Now we can also improve the fit , by actually including the decision boundary using the contour() function.

func = predict(Radialsvm,xgrid,decision.values = TRUE)
func=attributes(func)$decision #to pull out all the attributes and use decision attr
plot(xgrid,col=as.numeric(ygrid),pch=20,cex=0.3)
points(x,col=y+1,pch=19)
contour(px1,px2,matrix(func,69,99),level=0,add=TRUE,lwd=3) #adds the non linear decision boundary
contour(px1,px2,matrix(prob,69,99),level=0.5,add=T,col="blue",lwd=3)#this is the true decision boundary i.e Bayes decision boundary
legend("topright",c("True Decision Boundary","Fitted Decision Boundary"),lwd=3,col=c("blue","black"))

The above plot shows us the tradeoff between the True Bayes decision boundary and the Fitted decision boundary generated by the Radial kernel by learning from data.Both look quiet similar and seems that SVM has done a good functional approximation of the actual true function.


Conclusion

Radial kernel support vector machine is a good approch when the data is not linearly separable.The idea behind generating non linear decision boundaries is that we need to do some non linear transformations on the features \(X_i\) which transforms them to a higher dimention space.We do this non linear transformation using the Kernel trick.Now there are 2 hyperparameters in the SVM i.e the regularization parameter ‘c’ and \(\gamma\).We can implement cross validation to find the best values of both these tuning parameters which affect our classifier’s \(C(X)\) perfomance.Another way of finding the best value for these hyperparameters are by using certain optimization techniques such as Bayesian Optimization.