Support Vector Machines

Week 11

Maximal margin classifier

Example: Iris flowers

Data: built-in dataset “iris”

https://en.wikipedia.org/wiki/Iris_flower_data_set

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 25           4.8         3.4          1.9         0.2     setosa
## 125          6.7         3.3          5.7         2.1  virginica
## 69           6.2         2.2          4.5         1.5 versicolor
## 17           5.4         3.9          1.3         0.4     setosa
## 119          7.7         2.6          6.9         2.3  virginica
## 61           5.0         2.0          3.5         1.0 versicolor

We will extract species “setosa” and “virginica”:

We will assign \(-1\) to “setosa” and \(1\) to “virginica”; \(x_1\) is “Petal.Length” and \(x_2\) is “Petal.Width”.

Quiz (not graded): tinyurl.com/mh4510-w8-q1

Separating hyperplane

Generally, a hyperplane in \(p\)-dimensional space is given by the equation \[ \beta_0+\beta_1x_1+\cdots+\beta_p x_p = 0 \]

For \(p=2\), a hyperplane is just a line.

A separating hyperplane is a hyperplane that separates two classes, i.e., all observations with \(y^{(i)}=1\) are on one side of the hyperplane and all observations with \(y^{(i)}=-1\) are on the other side.

Some separating hyperplane

There are many separating hyperplanes

Given a separating hyperplane, we construct a classifier that assigns a class to a test observation depending on which side of the hyperplane the observation is.

A separating hyperplane is given by \[ \beta_0+\beta_1x_1+\cdots+\beta_p x_p = 0 \]

Separating means that \[ \beta_0+\beta_1x^{(i)}_1+\cdots+\beta_p x^{(i)}_p > 0 \mbox{ whenever } y^{(i)}=1 \]

and \[ \beta_0+\beta_1x^{(i)}_1+\cdots+\beta_p x^{(i)}_p < 0 \mbox{ whenever } y^{(i)}=-1 \]

Putting it together, we get \[ y^{(i)}\times(\beta_0+\beta_1x^{(i)}_1+\cdots+\beta_p x^{(i)}_p) > 0 \]

Quiz (not graded): tinyurl.com/mh4510-w8-q2

Maximal margin classifier

A separating hyperplane comes with margins

We try to find a separating plane that is in the middle between its margins and margin size is as large as possible.

A separating hyperplane is given by \[ \beta_0+\beta_1x_1+\cdots+\beta_p x_p = 0 \]

Since multiplying an equation by a constant does not change the hyperplane, we can normalize it so that \[ \beta_1^{2}+\cdots+\beta_p^2=1 \]

If a separating hyperplane satisfies \[ y^{(i)}\times(\beta_0+\beta_1x^{(i)}_1+\cdots+\beta_p x^{(i)}_p) > 0 \]

Then the maximal margin classifier is the solution of the optimization problem

Maximize \(M\) subject to \(\beta_1^{2}+\cdots+\beta_p^2=1\)

such that \(y^{(i)}\times(\beta_0+\beta_1x^{(i)}_1+\cdots+\beta_p x^{(i)}_p) \ge M, \quad \forall i=1,\cdots,N\)

This can be transformed to quadratic optimization. Not in the textbook. See THIS if you are interested.

Maximal margin:

Points on the margins — support vectors.

Support vector classifier

Non-separable case

We will extract species “versicolor” and “virginica” now.

We will assign \(-1\) to “versicolor” and \(1\) to “virginica”; \(x_1\) is “Petal.Length” and \(x_2\) is “Petal.Width”. Classes are not separable.

Soft margin

Instead of making sure that all observations are on the correct side of the separating hyperplane, i.e., solving the optimization problem

Maximize \(M\) subject to \(\beta_1^{2}+\cdots+\beta_p^2=1\)

such that \(y^{(i)}\times(\beta_0+\beta_1x^{(i)}_1+\cdots+\beta_p x^{(i)}_p) \ge M, \quad \forall i=1,\cdots,N\)

We will tolerate a few observations on the wrong side of the separating hyperplane. We will do it by solving a different optimization problem:

Maximize \(M\) depending on \(\beta_0,\beta_1,\dots,\beta_p\) and \(\varepsilon_1,\varepsilon_2,\dots,\varepsilon_N\).

subject to \(\beta_1^{2}+\cdots+\beta_p^2=1\)

such that \(y^{(i)}\times(\beta_0+\beta_1x^{(i)}_1+\cdots+\beta_p x^{(i)}_p) \ge M(1-\varepsilon_i), \quad \forall i=1,\cdots,N\)

and \(\varepsilon_i\ge 0\), \(\displaystyle\sum_{i=1}^{N}\varepsilon_i\le C\), where \(C\) is a nonnegative tuning parameter.

Here, \(\varepsilon_1,\varepsilon_2,\dots,\varepsilon_N\) are called slack variables.

Quiz (not graded): tinyurl.com/mh4510-w8-q3

Support vectors

A maximal margin classifier makes sure that all training observations are not only on the right side of a separating hyperplane, but also on the right side of the margin.

In contrast, a soft margin classifier allows some observations to be on the wrong side of the margin and even on the wrong side of the hyperplane - they will be misclassified.

In either case, training observations that are exactly on margins or between margins are called support vectors.

For a maximal margin classifier, there are no training observations between margins, so all support vectors are exactly on margins and there are usually 2 or 3 of them. For a soft margin classifier (also called support vector classifier), there may be a lot of support vectors.

In either case, the classifier only depends on support vectors. If we move, add, or delete observation points that are not within margins, the support vector classifier does not change (this is convenient because we may not need to re-train the model when new data is coming).

Large cost (low budget \(C\)):

## 12 support vectors

Smaller cost (larger value of \(C\)):

## 15 support vectors

Even smaller cost (even larger value of \(C\)):

## 24 support vectors

Very small cost (very large value of \(C\)):

## 50 support vectors

In any case, the decision boundary is given by \[ \beta_0+\beta_1x_1+\cdots+\beta_px_p=0 \]

The linear support classifier can be represented as \[ f(x)=\beta_0+\sum_{i=1}^{N}\alpha_i \langle x,x^{(i)}\rangle, \] where \(\langle *,*\rangle\) is the dot (inner) product.

It turns out that \(\alpha_i\neq 0\) only for support vectors (within the margins). For the rest of the observations, \(\alpha_i=0\) and thus the classifier is \[ f(x)=\beta_0+\sum_{i\in \mathcal{S}}\alpha_i \langle x,x^{(i)}\rangle, \] where \(\mathcal{S}\) is the set of support vectors.

Support vector machine

Non-linear decision boundary

We will now look again at the three species

Let \(1\) represent “versicolor” and \(-1\) either “setosa” or “virginica”

There is no linear boundary, but it is possible to separate the classes reasonably well by a plane in 3D space if we introduce the 3rd coordinate \(z\) as \[ z=\sqrt{(x_1-4)^2 + (x_2-1.25)^2} \]

Let us see how support vector classifiers perform.

Raw data:

##           Reference
## Prediction  -1   1
##         -1 100  50
##         1    0   0

##  Accuracy 
## 0.6666667

Data with the extra variable \(z\):

##           Reference
## Prediction -1  1
##         -1 98  3
##         1   2 47

##  Accuracy 
## 0.9666667

Enlarging the feature space

In this example, we essentially introduced the function \(F:\mathbb{R}^2\to\mathbb{R}^3\) and then trained a new classifier in \(\mathbb{R}^3\). The function \(F\) is given by \[ F(x_1,x_2)=\left(x_1,x_2,\sqrt{(x_1-4)^2 + (x_2-1.25)^2}\right) \]

A support vector classifier in \(x_1, x_2\) is the following optimization problem:

Maximize \(M\) depending on \(\beta_0,\beta_1,\beta_2\) and \(\varepsilon_1,\varepsilon_2,\dots,\varepsilon_N\).
subject to \(\beta_1^{2}+\beta_2^2=1\)
such that \(y^{(i)}\times(\beta_0+\beta_1x^{(i)}_1+\beta_2 x^{(i)}_2) \ge M(1-\varepsilon_i), \quad \forall i=1,\cdots,N\)

Enlarging the feature space, we introduce the new feature \[ z=\sqrt{(x_1-4)^2 + (x_2-1.25)^2} \] and get the new optimization problem

Maximize \(M\) depending on \(\beta_0,\beta_1,\beta_2,\beta_3\) and \(\varepsilon_1,\varepsilon_2,\dots,\varepsilon_N\).

subject to \(\beta_1^{2}+\beta_2^2+\beta_3^2=1\)

\(y^{(i)}\times\left(\beta_0+\beta_1x^{(i)}_1+\beta_2 x^{(i)}_2 +\beta_3\sqrt{(x_1^{(i)}-4)^2 + (x_2^{(i)}-1.25)^2} \right) \ge M(1-\varepsilon_i), \quad \forall i=1,\cdots,N\)

Kernel trick

In fact, to construct a support vector classifier, one does not need observations themselves. One needs pairwise distances \[ \|x^{(i)}-x^{(j)}\| \]

or inner (dot) products \[ x^{(i)} \cdot x^{(j)} = \langle x^{(i)}, x^{(j)} \rangle = \sum_{k=1}^p x^{(i)}_kx^{(j)}_k \]

Let \(F:\mathbb{R}^p\to\mathbb{R}^L\) be the function enlarging our feature space. Then, to train the support vector classifier in the enlarged space, we need inner products \[ K( x^{(i)}, x^{(j)}) = \langle F(x^{(i)}), F(x^{(j)}) \rangle \]

This function \(K\) is a kernel. The trick is that in order to compute these inner products, we do not actually need to know \(F\). We can work with the kernel itself.

A linear classifier is given by
\[ f(x)=\beta_0+\sum_{i\in \mathcal{S}}\alpha_i \langle x,x^{(i)}\rangle, \]

A support vector machine is the classifier defined by \[ f(x)=\beta_0+\sum_{i\in \mathcal{S}}\alpha_i K(x, x^{(i)}) \]

Essentially, support vector machine is a linear soft margin classifier in a multi-dimensional space \(\mathbb{R}^L\) that our feature space \(\mathbb{R}^p\) is mapped to. However, we don’t explicitly construct the map \(F:\mathbb{R}^p\to\mathbb{R}^{L}\).

Examples

The map from \(\mathbb{R}^2\) to \(\mathbb{R}^3\) that helped to separate iris classes: \[ F(x_1,x_2)=\left(x_1,x_2,\sqrt{(x_1-4)^2 + (x_2-1.25)^2}\right) \]

The kernel is \[ K(x, y)=x_1y_1+x_2y_2+\sqrt{(x_1-4)^2 + (x_2-1.25)^2}\cdot \sqrt{(y_1-4)^2 + (y_2-1.25)^2} \]

And the nonlinear classifier is given by

Example: \(p=2\), the kernel is \[ K\left( \begin{bmatrix} x_1^{(i)} \\ x_2^{(i)} \end{bmatrix}, \begin{bmatrix} x_1^{(j)} \\ x_2^{(j)} \end{bmatrix} \right)= \left(x_1^{(i)}\right)^2\left(x_1^{(j)}\right)^2+ \left(x_2^{(i)}\right)^2\left(x_2^{(j)}\right)^2+ 2x_1^{(i)}x_2^{(i)}x_1^{(j)} x_2^{(j)} \]

Consider the function \(F:\mathbb{R}^2\to\mathbb{R}^3\) given by \[ z_1 = x_1^2,\quad z_2 = x_2^2,\quad z_3 = \sqrt{2}x_1x_2 \]

Then the kernel is \[ K(x^{(i)}, x^{(j)}) = \langle F(x^{(i)}), F(x^{(j)}) \rangle \]

Example: \(p=2\), we want a decision boundary of the form \[ \beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_1^2+\beta_4 x_2^2 + \beta_5 x_1x_2=0 \]

This is the same as enlarging the feature space using the function \(F:\mathbb{R}^2\to\mathbb{R}^6\) defined by \[ F(x_1,x_2)=(1,\sqrt{2}x_1,\sqrt{2}x_2,x_1^2,x_2^2,\sqrt{2}x_1x_2) \]

This is the same as considering the kernel \[ K(x^{(i)}, x^{(j)})=\left(1+\langle x^{(i)}, x^{(j)} \rangle\right)^2 =\left(1+x_1^{(i)}x_1^{(j)} + x_2^{(i)}x_2^{(j)} \right)^2 \]

Computation that uses the kernel is more efficient.

Example: \[ K(x^{(i)}, x^{(j)})=e^{-\frac{\| x^{(i)} - x^{(j)} \|^2}{2}} \]

Then we have

Feature space is enlarged to an infinitely-dimensional space.

Some commonly used nonlinear kernels

These are kernels implemented in “svm” functon of the library “e1071”. A comprehensive review of kernels can be found here:

http://crsouza.com/2010/03/17/kernel-functions-for-machine-learning-applications/

Polynomial kernel \[ K(x,x')=(c_0+\gamma\langle x, x' \rangle)^d \]

Radial basis kernel \[ K(x,x')=e^{-\gamma \|x-x'\|^2} \]

Sigmoid kernel \[ K(x,x')=\tanh (c_0+ \gamma \langle x, x'\rangle ) \]

Parameters of these kernels can be tuned by cross-validation.

Linear kernel

Polynomial kernel of degree 2 with \(\gamma=2\):

Polynomial kernel of degree 2 with \(\gamma=0.1\)

Polynomial kernel of degree 3 with \(\gamma=1\):

Radial basis kernel, \(\gamma = 1\):

Sigmoid, \(\gamma = 0.5\) amd \(c_0=0\):

Choice of a kernel is somewhat like choice of artificial neural network architecture. You experiment with it and see what works best. Sometimes it has to do with your problem type (like convolutional neural networks are good for image recognition, recurrent networks for speech recognition etc.) Sometimes you just try and see what works better.

Choice of parameter values — cross-validation. It means that you remove some part of the data, train SVM on the remaining observations, and measure the error on those observations that were held out. Repeat \(k\) times.

Cross-validation is similar to LASSO regression, ridge regression, and decision tree.

Quiz (not graded): tinyurl.com/mh4510-w8-q4

Extensions of SVM

Multiclass classification: if there are \(K>2\) classes, two approaches are possible:

One-vs-one approach — fit \(\binom{K}{2}\) support vector machines that classify each class against each class. Feed a new observation to all these support vector machines and pick the class that is most frequently assigned.

One-vs-all approach — fit \(K\) support vector machines that classify each class against all others. Feed a new observation to all these support vector machines and pick the class associated with the highest level of confidence.

There are also support vector regression and support vector clustering.

Checklist

Optimization problem for a maximal margin classifier
Optimization problem for a soft margin classifier
Using kernel trick to obtain a non-linear decision boundary
Parameters of SVM that can be tuned via cross-validation