Support Vector Machines

While this course DATA 622 is mostly practical, we should remember the algorithms and code work because a deep theory was created in the 1990s. In some cases, the theory makes algorithms practical. Support vector machines have a deep and beautiful mathematical theory and this discussion post will touch on it.

But is math necessary to use support vector machines? After spending a few days down the mathematical rabbit hole, I can share a few thoughts on the mathematical journey. I’ll outline three mathematical aspects and make sure remarks on practical issues with SVM. Lastly, I will describe alternative resources for learning the mathematical aspects of SVM.

Three Mathematical Pillars

We assume the data set has observations \((x_1, y_1), (x_2, y_2), \cdots, (x_n, y_n)\) in which \(x_i \in R^p\) has \(p\) features and \(y_i \in \{ 1, -1 \}\) is a binary category.

Geometric and Functional Margin

The first problem in SVM is how to define a separating hyperplane to partition the observations by binary category in the right way. SVM defines a clever metric to choose the separating hyperplane to classify the \(x_i\)s in \(R^p\). The SVM uses the idea of a maximal margin in the separable case and the soft margin in the non-separable case to choose the hyperplane. If we specify the separating hyperplane as a function \(f(x)\) of the form:

\[ f(x) = \mathbf{\lambda^T} \mathbf{x} + \lambda_0 \] and require the margin to be at least 1, then the hyperplane should be chosen to solve constrained optimization problem below.

\[ \max_{\lambda, \lambda_0} \frac{1}{\lVert \lambda \rVert_{2}} \text{ where } y_i(\lambda^T \mathbf{x_i} + \lambda_0) \geq 1 , i=1, \cdots , n\] This optimization seeks to maximize the geometric margin (i.e. the maximize the distance to the nearest training point) while forcing the absolute value of the slope to be 1 or equivalently minimize the absolute value of the slope while forcing the functional margin to at least 1.

But the maximization is equivalent to a minimization when we invert the objective function:

\[ \min_{\lambda, \lambda_0} \frac{1}{2} \lVert \lambda \rVert^{2}_{2} \text{ where } -y_i(\lambda^T\mathbf{x_i}+\lambda_0)-1 \leq 0 , i = 1, \cdots n\]

In the non-separable case, the latter minimization form can be used to introduce a penalty term for points inside the margin.

\[\min_{\lambda, \lambda_0} \frac{1}{2} \lVert \lambda \rVert^{2}_{2} + C \sum_{i=1}^{n} \xi_i \text{ where } y_i(\lambda^T\mathbf{x_i}+\lambda_0) \geq 1 - \xi_i , \text{ and } i = 1, \cdots , n \text{ and } \xi_i \geq 0 \] SVM uses the hinge-loss function \(\xi_i = \max( 0, 1 - y_if(x_i))\) to define the non-separable SVM.

Convex Optimization

Convex optimization refers to solving optimization problems where the objective function and the constraints satisfy certain requirements. When these requirements are met, the theory promises the existence of a global optimal solution and important properties of the solution. The key properties of a solution are given by the so-called KKT conditions (which stand for Karush-Kuhn-Tucker). The KKT theorem is as one author observes the ion cannon of convex optimization. It gives first order conditions when a point is the global optimum and gives 4 conditions that needs to be satisfied by the solution. One of those conditions is on a Lagrangian of the optimization problem.

If we denote the Lagrangian of the non-separable case objective function as:

\[L(\lambda, \lambda_0, \xi, \alpha, r) = \frac{1}{2}\lVert\lambda\rVert_{2}^{2} + C \sum_{i=1}^{n}\xi_i - \sum_{i=1}^{n} \alpha_i[ y_i(\lambda^T x_i +\lambda_0)-1 + \xi_i] - \sum_{i=1}^{n} r_i \xi_i \] where \(r_i\) take the role of dummy slack variables since \(\xi\) are now fixed, then the KKT conditions give an equivalent dual optimization problem to be solved. This problem is:

\[ \max_{\alpha} \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i,k=1}^{n} \alpha_i \alpha_k y_i y_k x_i^{T} x_k \text{ where } 0 \leq \alpha_i \leq C , \text{ } i=1, \cdots , n \text{ and } \sum_{i=1}^{n}\alpha_i y_i = 0 \]

Faster solution with Sequential Minimal Optimization

Sequential minimal optimization (or SMO) was developed in 1998 by John Platt to train SVM more quickly. SMO is a faster algorithm to solve a convex optimization problem than with a traditional quadratic programming solver. SMO is an iterative algorithm that boils down to solving a series of subproblems. Each subproblem involves finding a pair of Lagrange multiplier variables, say \(\alpha_1, \alpha_2\) where a 1-dimensional quadratic optimization needs to be solved. This allows us to find an approximate solution that is closer to the global optimal point. At each step, the approximation converges to the optimal solution.

Some evidence suggests that SMO can be 1000 times faster than other convex optimization algorithms.

Rudin warns in several lectures that several SVM optimizers are very slow. She cites sklearn as an example. It would be useful to see if this defect is still present and can be reproduced in software.

Kernel Trick and Hilbert Space Theory

The last key idea I would like to call out is the so-called kernel trick. The key idea is that if a data set is non-separable, you can map the data to a higher dimensional space in which the data points can be separated.

Much of the theoretical machine is to prove the existence of a higher dimensional (possibly infinite dimensional) space in which our kernels can be mapped and to construct the feature mapping. Perhaps the most useful result of the kernel trick is to validate the use of the Gaussian kernel for support vector machine. The Gaussian kernel can handle some quite challenging and interwoven data sets. Some examples of Gaussian kernel performance from Rudin’s online lecture 61 are shown below.

The screenshot below shows a 2-dimensional data set of observations. It contains two interlocking spirals of points from two classes. Clearly, no straight line can separate them.

Gaussian_Kernel Example 1 - Raw Data

The next screenshot shows the solution from the Gaussian kernal SVM. A Gaussian kernel can be used to map the data points to an infinite dimensional Hilbert space. The decision boundary tracks the points of each class quite closely and separates them. One class will converge to 1 (red) and the other class will converge to -1 (blue).

Gaussian Spiral Kernel Trick

A second example using data in the shape of a crescent moon is shown below in the same 2-dimensional setup.

Crescent Moon - Raw Data

The Gaussian kernel SVM again delivers excellent separation below:

Crescent Moon Kernel Trick

Getting through the Math

My brain hit a brick wall when I read about SVM in assigned textbook Elements of Statistical Learning chapter 12.3 on Support Vector Machines and Kernels. ESL suggests that we read Chapter 5.8 on reproducing kernel Hilbert spaces (RKHS). While ISLR is more approachable, it avoids much of the math and the context will be absent.

I decided to read Cythnia Rudin’s well written notes on kernels to get the key intuition and context and key ideas. A link to the online lectures is here. The discussion of Reproducing Kernel Hilbert Spaces and kernels in ESL is dry but authoritative. The video lectures of interest are in her playlist for her machine learning ebook. They are lectures 40-61. Each one is about 3-15 minutes in length.

Is it necessary to learn the math to use SVM? Maybe not. But it may be an uncomfortable experience to apply the SVM in problem solving without understanding the API. Math also tells us which algorithms are faster.

Data 622 Discussion Week 10 - Is Math Necessary for SVM?

Alexander Ng

04/08/2021