Black Box Methods 2 - Support Vector Machines

Support Vector Machines

The Idea

The aim of this method is to identify a hyperplane which separates the samples. This is like a combination of nearest-neighbours and regression models, and is very powerful. The technique can be used for both classification and numerical prediction. It is traditionally applied for binary classification. Benefits include high scalability, but drawbacks, as for neural nets, include models which are very difficult to interpret.

Some details

But not too much - the maths are very complicated. The hyperplane is a multidimensional surface, but it’s easiest to think of it as a straight line between two linearly separable groups of examples, and generalise from there. In that situation, the hyperplane is simply a line. One can imagine that in three dimensions, it would be a plane. Being mathematicians, we don’t stop at 3 dimensions, so this can be applied to n-dimensional space.

Further, there are multiple planes available (in the linearly separable samples case). If we think back to the 2-D example, one can easily visualise that more than one line exists separating the samples into two homogenous groups. Typically we want the maximum margin hyperplane (MMH), that is, the line which separates the groups maximally: this will generalise to fit future data best.

(it’s an intersting beginner’s R challenge to try to draw a plot reflecting this)

basic_frame = data.frame(
  x=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
  y=c(5, 6, 4, 3, 2, 9, 8, 7, 1, 10),
  z=c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1))
library(ggplot2)
ggplot(data=basic_frame, aes(x=x, y=y)) + geom_point(shape=factor(basic_frame$z)) +
  geom_line(data=data.frame(x=c(0, 9), y=c(9, 0)), aes(x=x,y=y), color='red') +
  geom_line(data=data.frame(x=c(2, 8), y=c(9, 0)), aes(x=x,y=y), color='blue') +
  geom_line(data=data.frame(x=c(3, 7), y=c(9, 0)), aes(x=x,y=y), color='green') +
  ggtitle("Linearly separable data and a few possible hyperplanes")

It can’t be said which of these is best by mere visual inspection, but that’s what computers and optimisers are for. We define the support vectors are those points from each class that are closest to the MMH - each class must have at least one support vector. Using the support vectors alone, we can completely define the MMH - i.e. its representation is extremely compact, regardless of the number of samples. This also means that the SVM technique scales very well indeed.

Some more details

So the next generalisation comes from answering the next question: what if the data aren’t linearly separable? Well, we can add a slack variable which provides an indication to the model about how many points are allowed to be mis-classified. This variable is typically denoted \(\xi\) (greek ‘Xi’). Finally, a cost value (denoted as \(C\)) is applied to all points that violate the constraints, and rather than finding the maximum margin, the algorithm tries to minimise the total cost \(\sum \xi\). So if we have a high \(C\), the algorithm will try harder to achieve 100% separation, thus reducing the overall margin. An important balance must be struck between \(C\) and the overall margin, else beware overfitting.

The above is one way to handle non-linearities. Another is called the kernel trick. Using this, we have a clever way of letting a non-linear relationship appear linear - though this does depend on the data itself. If, say, 2 existing features in our data can be used to form an orthogonal third feature, then that new feature can be added, and might present a way to see the data in a new way (this could probably also be used in other techinques, of course). For example, think about weather stations on a mountain, but where the dataset only shows latitude and longitude. Since each weather station is unique, the pair of latitude-longitude allows us to infer an altitude - adding this as a feature may well allow the data to be linearly separable. There are many different types of kernel possible - amongst them we have:

the linear kernel which does not transform the data at all
the polynomial kernel of degree \(d\) which adds a simple non-linear transform of the data
the sigmoid kernel which results in an SVM analagous to an ANN using a sigmoid activation function
the Guassian RBF kernel which performs well on many data and is a good default

There are no reliable rules for which kernel to use with any given learning tasks - the fit depends heavily on the concept to be learned as well as the amount of data, and inter-feature relationships. Trial and error is our friend in these uncharted waters.

Additions such as the above only help to complicate the already impregnable representation of the model.

An example: Strength of concrete (revisited)

We saw this in the neural net tutorial. Let’s see if SVMs do better here.

concrete = read.csv("D://dev//R//mlwr//chap7-nn-svm//concrete.csv")
str(concrete)

## 'data.frame':    1030 obs. of  9 variables:
##  $ cement      : num  141 169 250 266 155 ...
##  $ slag        : num  212 42.2 0 114 183.4 ...
##  $ ash         : num  0 124.3 95.7 0 0 ...
##  $ water       : num  204 158 187 228 193 ...
##  $ superplastic: num  0 10.8 5.5 0 9.1 0 0 6.4 0 9 ...
##  $ coarseagg   : num  972 1081 957 932 1047 ...
##  $ fineagg     : num  748 796 861 670 697 ...
##  $ age         : int  28 14 28 28 28 90 7 56 28 28 ...
##  $ strength    : num  29.9 23.5 29.2 45.9 18.3 ...

summary(concrete)

##      cement           slag            ash             water      
##  Min.   :102.0   Min.   :  0.0   Min.   :  0.00   Min.   :121.8  
##  1st Qu.:192.4   1st Qu.:  0.0   1st Qu.:  0.00   1st Qu.:164.9  
##  Median :272.9   Median : 22.0   Median :  0.00   Median :185.0  
##  Mean   :281.2   Mean   : 73.9   Mean   : 54.19   Mean   :181.6  
##  3rd Qu.:350.0   3rd Qu.:142.9   3rd Qu.:118.30   3rd Qu.:192.0  
##  Max.   :540.0   Max.   :359.4   Max.   :200.10   Max.   :247.0  
##   superplastic      coarseagg         fineagg           age        
##  Min.   : 0.000   Min.   : 801.0   Min.   :594.0   Min.   :  1.00  
##  1st Qu.: 0.000   1st Qu.: 932.0   1st Qu.:731.0   1st Qu.:  7.00  
##  Median : 6.400   Median : 968.0   Median :779.5   Median : 28.00  
##  Mean   : 6.205   Mean   : 972.9   Mean   :773.6   Mean   : 45.66  
##  3rd Qu.:10.200   3rd Qu.:1029.4   3rd Qu.:824.0   3rd Qu.: 56.00  
##  Max.   :32.200   Max.   :1145.0   Max.   :992.6   Max.   :365.00  
##     strength    
##  Min.   : 2.33  
##  1st Qu.:23.71  
##  Median :34.45  
##  Mean   :35.82  
##  3rd Qu.:46.13  
##  Max.   :82.60

Though the data is wide-ranging, most SVM packages normalise / standardise for us behind the scenes. As with all other learners, multiple implementations exist for SVMs. We’ll use kernlab (others include e1071 and the award-winning LIBSVM).

c.train = concrete[1:773, ]
c.test  = concrete[774:1030, ]
#install.packages("kernlab")
library(kernlab)

## 
## Attaching package: 'kernlab'
## 
## The following object is masked from 'package:ggplot2':
## 
##     alpha

model = ksvm(data=c.train,
             strength ~ cement + slag + ash + water + superplastic + coarseagg + fineagg + age,
             kernel="vanilladot") # <-- this is a linear kernel

##  Setting default kernel parameters

model

## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 666 
## 
## Objective Function Value : -292.3885 
## Training error : 0.390531

The above unfortunately tells us little about how this will perform in the real world. Running this on our testing set:

pred.1 = predict(model, c.test)
cor(pred.1, c.test$strength)

##           [,1]
## [1,] 0.7484654

So this gives us a 75% correlation, not too shabby for (literally) a few lines of code. Let’s use the MAE:

MAE = function(actual, predicted) {
  mean(abs(actual - predicted))
}
MAE(pred.1, c.test$strength)

## [1] 8.480691

summary(c.test$strength)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.78   24.07   35.30   36.31   45.08   82.60

So our MAE is about 8.5, on a scale ranging from 0 - 85, about 10%. Not all that bad. Can we improve it? Does a bear cr@p in the woods?

Let’s try to use the Gaussian RBF model:

model.rbf = ksvm(data=c.train,
                 strength ~ cement + slag + ash + water + superplastic + coarseagg + fineagg + age,
                 kernel="rbfdot") # <-- this is the Guassian RBF kernel
pred.rbf = predict(model.rbf, c.test)
cor(pred.rbf, c.test$strength)

##          [,1]
## [1,] 0.908632

MAE(pred.rbf, c.test$strength)

## [1] 5.070756

So we get a better correlation and reduced MAE - awesome - all with a one-word change. Let’s tabulate these and a few others:

kernelNames = c("vanilla", "rbf", "laplace", "tanh", "bessel", "anova")
kernelMAEs = c()
kernelCors = c()
for(k in kernelNames) {
  model.k = ksvm(data=c.train,
                 strength ~ cement + slag + ash + water + superplastic + coarseagg + fineagg + age,
                 kernel=k)
  pred.k = predict(model.k, c.test)
  mae = MAE(pred.k, c.test$strength)
  kernelMAEs = c(kernelMAEs, round(mae, digits=3))
  kernelCors = c(kernelCors, round(cor(pred.k, c.test$strength), digits=3))
}

##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters

data.frame(Kernel=kernelNames, MAE=kernelMAEs, Cor=kernelCors)

##    Kernel      MAE   Cor
## 1 vanilla    8.481 0.748
## 2     rbf    5.050 0.909
## 3 laplace    5.384 0.898
## 4    tanh 1048.655 0.039
## 5  bessel    7.047 0.821
## 6   anova  792.574 0.087

And there you have it. Gauss FTW.