If the data can be perfectly separated using a hyperplane, then there will in fact exist an infinite number of such hyperplanes. The maximal margin hyperplane (also known as the optimal separating hyperplane) is the separating hyperplane which is farthest from the training observations in terms of perpendicular distance. The minimal distance from the observations to the hyperplane is known as the margin. When a maximal margin hyperplane is used to classify test observations it is known as a maximal margin classifier. The maximal margin classifier is often successful but can overfit when \(p\) is large.
The maximal margin classifier classifies a test observation \(x^∗\) based on the sign of
\[f(x^*) = \beta_0 + \beta_1 x^*_1 + ... + \beta_p x^*_p \]
where \(\beta_0, \beta_1, ..., \beta_p\) are the coefficients of the maximal margin hyperplane. The maximal margin hyperplane represents the mid-line of the widest gap between the two classes.
There are relatively few training observations that affect the maximal margin hyperplane. Those observations which are constituted of \(p\)-dimensional vectors, are those that would cause the maximal margin hyperplane to move if they were moved in some dimension. These observations are known as support vectors since they support the maximal margin hyperplane and give it its shape. The maximal margin hyperplane is the solution to the optimization problem of choosing \(\beta_0, ..., \beta_p\) to maximize \(M\) such that
\[ \sum_{j = 1}^{p} \beta_j^2 = 1 \] \[ y_i(\beta_0 + \beta_1 x_{i1} + ... + \beta_p x_{ip}) \geq M\]
The constraint with \(M\) (assuming its positive) guarantees each observation will be on the correct side of the hyperplane. All together, the optimization problem chooses \(\beta_0, ..., \beta_p\) to maximize \(M\) while ensuring that each observation is on the right side of the hyperplane and at least distance \(M\) from the hyperplane, yielding a maximal margin hyperplane.
If no separating hyperplane exists, no maximal margin hyperplane exists either. However, a soft margin can be used to construct a hyperplane that almost separates the classes. This generalization of the maximal margin classifier is known as the support vector classifier. In general, a perfectly separating hyperplane can be undesirable because it can be very sensitive to individual observations, leading to overfitting. A classifier based on a hyperplane that doesn’t perfectly separate the two classes can offer greater robustness to variations in individual observations and better classification of most training observations at the cost of misclassifying a few training observations.
The support vector classifier has hyperplane that solves the optimization problem of \(max_{\beta_0, ..., \beta_p} M\) subject to
\[ \sum_{j = 1}^{p} \beta_j^2 = 1 \] \[ y_i(\beta_0 + \beta_1 x_{i1} + ... + \beta_p x_{ip}) \geq M(1-\epsilon_i)\] \[ \epsilon_i \geq 0 \] \[ \sum_{i = i}^{n} \epsilon_i \leq C \]
where \(C\) is a non-negative tuning parameter. \(\epsilon\)’s are slack variables that allow individual variables to fall on the wrong side of the margin and/or hyperplane. When \(\epsilon_i=0\), the \(i^{th}\) observation is on the correct side of the margin. When \(\epsilon_i\) is greater than zero, the \(i^{th}\) observation is on the wrong side of the margin and is said to have violated the margin. When \(\epsilon_i\) is greater than 1, the observation is on the wrong side of the hyperplane.
The tuning parameter \(C\) limits the sum of the slack variables, \(\epsilon_i, ..., \epsilon_n\) and so determines the number and severity of the violations to the margin and hyperplane that will be tolerated. When \(C\) is zero, no budget is available for violations, which means \(\epsilon_i = 0, ..., \epsilon_n = 0\), in which case the solution, if one exists, is the same as the maximal margin classifier. When \(C\) is greater than zero, no more than \(C\) observations may be on the wrong side of the hyperplane since \(\epsilon_i\) will be greater than zero and \(\sum_{i = i}^{n} \epsilon_i \leq C\). As \(C\) increases, the margin gets wider and more tolerant of violation. As \(C\) decreases, the margin gets narrower. \(C\) is generally selected using cross validation.
\(C\) also controls the bias-variance trade-off of the statistical learning model. When \(C\) is small, margins will be narrow and less violated which amounts to a classifier that is highly fit to the data with low bias, but high variance. Conversely, when \(C\) is large, the margin is large and more frequently violated which amounts to a classifier that is more loosely fit with potentially higher bias, and potentially lower variance. Only the observations that lie on the margin or violate the margin affect the hyperplane and classifier. Observations directly on the margin or on the wrong side of the margin for their class are known as support vectors since the observations affect the shape of the support vector classifier. When \(C\) is large, the margin is wide and more frequently violated which means many support vectors contribute to shaping the hyperplane.
Because the support vector classifier is based on only a some training observations (the support vectors), it is robust to the behavior of those observations that are far from the hyperplane.
When class boundaries are non-linear, the feature space can be enlarged using functions of the predictors such as quadratic, cubic, or interaction terms to address the non-linearity. However, computations can become unmanageable in the support vector classifier case. The support vector machine is an extension of the support vector classifier that results from enlarging the feature space using kernels to accommodate a non-linear boundary between classes.
The solution to the support vector classifier involves only the inner-products of the observations, not the observations themselves. The inner-product of two p-vectors a and b is defined as \(\langle a,b \rangle = \sum_{i = 1}^{r} a_i b_i\) . The linear support vector can be represented as:
\[ f(x) = \beta_0 + \sum_{i = 1}^{n} \alpha_i \langle x, x_i \rangle \]
where there are \(n\) parameters, \(\alpha_1, ..., \alpha_n\), one per training observation. To estimate the parameters \(\alpha_1, ..., \alpha_n\) and \(\beta_0\) requires the \(n \choose 2\) inner products, \(\langle x_i, x_{i'} \rangle\) between all pairs of training observations. \(\alpha_i\) is zero for all observations except for the support vectors, so only the inner product of the support points need to be calculated. Hence, the summation only proceeds in the subspace \(S\), the collection of indices of these support points, and the computation is much shorter.
One can generalize the support vector classifier by replacing the inner product operations with generalizations of the inner products in the form of \(K(x_i,x_{i′})\), where \(K\) is some function known as a kernel. A kernel is a function that quantifies the similarity of two observations. For example, a kernel of \(K(x_i,x_{i′}) = \sum_{j=1}^{p} x_i x_{i'}\) would be equivalent to the support vector classifier. This kernel is a linear kernel because it is linear in the features. In this case, the linear kernel quantifies the similarity of a pair of observations using Pearson (standard) correlation.
Another popular kernel function is the polynomial kernel of degree \(d\). It is used to fit a support vector classifier in a higher dimensional space involving polynomials of degree \(d\), instead of the original feature space.
When the support vector classifier is combined with a non-linear kernel, the resulting classifier is called a support vector machine. The non-linear function underlying the support vector machine has the form
\[f(x) = \beta_0 + \sum_{i \in S} \alpha_i K(x, x_i) \] Another popular non-linear kernel is the radial kernel given by
\[ K(x_i, x_{i'}) = exp(-\gamma \sum_{j=1}^{p} (x_{ij} - x_{i'j})^2) \]
where \(\gamma\) is a positive constant. The radial kernel has very local behavior; only nearby training observations affect the classification of a test observation. If a given test observation, \(x^∗ = (x^∗_1,..., x^∗_p)^T\) is far from a training observation \(x_i\) in terms of Euclidean distance, then \(\sum_{j = 1}^{p} (x^*_j - x_ij)^2\) will be large and thus the RHS will be small and as such xi will have virtually no effect on \(f(x^∗)\).
Support vector machines don’t extend well to more than two classes, but two popular approaches for extending SVMs to the \(K\)-class case are one-versus-one and one-versus-all. Assuming \(K > 2\), one-versus-one, or all-pairs, constructs \(K \choose 2\) SVMs, each of which compares a pair of classes. A test observation would be classified using each of the \(K \choose 2\) classifiers, with the final classification given by the class most frequently predicted by the \(K \choose 2\) classifiers.
For one-versus-all, we fit \(K\) SVMs, each time comparing one of the \(K\) classes to the remaining \(K−1\) classes. Assuming a test observation \(x^∗\) and coefficients \(\beta_{0k}, \beta_{1k}, ..., \beta_{pk}\), resulting from fitting an SVM comparing the \(k^{th}\) class (coded as +1) to the others (coded as −1 ), the test observation is assigned to the class for which \(\beta_{0k} + \beta_{1k}X_1 + ... + \beta_{pk}X_p\) is largest, as this amounts to the highest level of confidence.
The criterion for fitting the support vector classifier can be rewritten to include a nonnegative tuning parameter, \(\lambda\). When \(\lambda\) is large, more violations to the margin are tolerated, resulting in a low-variance but high bias classifier. The equation resembles that of a ridge and lasso regression and takes a “Loss + Penalty” form. However, the function below is known as hinge loss, which is closely related to the loss function used in logistic regression.
\[ L(\mathbf{X}, \mathbf{y}, \beta) = \sum_{i=1}^{n} = max[0, 1 - y_i(\beta_0 + \beta_1 x_{i1} + ... + \beta_p x_{ip})] \]L(X, y, β) =∑ni=1max [0, 1 − yi(β0 + β1xi1 + . . . + βpxip)] .
This loss function is exactly zero for observations for which \(y_i(\beta_0 + \beta_1 x_{i1} + ... + \beta_p x_{ip}) \leq 1\); these correspond to observations that are on the correct side of the margin. In contrast, the loss function for logistic regression is never exactly zero but is very small for observations that are far from the decision boundary. The similarities between their loss functions allows logistic regression and the support vector classifier to give very similar results. When the classes are well separated, SVMs tend to behave better than logistic regression; in more overlapping regimes, logistic regression is often preferred.
A support vector regression seeks coefficients that minimize a different type of loss then sum of squared residuals. Instead, only residuals larger in absolute value than some positive constant contribute to the loss function. This is an extension of the margin used in support vector classifiers to the regression setting.
library(e1071)
## Warning: package 'e1071' was built under R version 3.4.4
set.seed (1)
# generating random data
x=matrix(rnorm(20*2), ncol=2)
y=c(rep(-1,10), rep(1,10))
x[y==1,]=x[y==1,] + 1
plot(x, col=(3-y))
# Must encode this variable as a factor so that it can perform classification
dat=data.frame(x=x, y=as.factor(y))
# This scale = FALSE tells the svm() function not to scale each feature to have
# mean zero or standard deviation one
svmfit=svm(y~., data=dat, kernel="linear", cost=10,
scale=FALSE)
plot(svmfit , dat)
svmfit$index
## [1] 1 2 5 7 14 16 17
summary(svmfit)
##
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 10,
## scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 10
## gamma: 0.5
##
## Number of Support Vectors: 7
##
## ( 4 3 )
##
##
## Number of Classes: 2
##
## Levels:
## -1 1
Now, using a smaller cost parameter, we get more support vectors because the margin is wider. Although the width of the margin is not reported by the function, we know that it is mathematically occurring.
svmfit=svm(y~., data=dat, kernel="linear", cost=0.1, scale=FALSE)
plot(svmfit , dat)
svmfit$index
## [1] 1 2 3 4 5 7 9 10 12 13 14 15 16 17 18 20
The tune()
function performs 10 fold cross validation in order to optimally choose this for a given range of the cost parameter.
set.seed (1)
tune.out=tune(svm,y~.,data=dat,kernel="linear",
ranges=list(cost=c(0.001, 0.01, 0.1, 1,5,10,100)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.1
##
## - best performance: 0.1
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-03 0.70 0.4216370
## 2 1e-02 0.70 0.4216370
## 3 1e-01 0.10 0.2108185
## 4 1e+00 0.15 0.2415229
## 5 5e+00 0.15 0.2415229
## 6 1e+01 0.15 0.2415229
## 7 1e+02 0.15 0.2415229
This output clearly shows us what the best value for the cost parameter is and so we can choose that as our model:
bestmod=tune.out$best.model
summary(bestmod)
##
## Call:
## best.tune(method = svm, train.x = y ~ ., data = dat, ranges = list(cost = c(0.001,
## 0.01, 0.1, 1, 5, 10, 100)), kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.1
## gamma: 0.5
##
## Number of Support Vectors: 16
##
## ( 8 8 )
##
##
## Number of Classes: 2
##
## Levels:
## -1 1
First we will generate data with a non-linear class boundary:
set.seed (1)
x=matrix(rnorm(200*2), ncol=2)
x[1:100,]=x[1:100,]+2
x[101:150,]=x[101:150,]-2
y=c(rep(1,150),rep(2,50))
dat=data.frame(x=x,y=as.factor(y))
plot(x, col = y)
Randomly split into training and testing sets:
train=sample(200,100)
svmfit=svm(y~., data=dat[train,], kernel="radial", gamma=1,cost =1)
plot(svmfit , dat[train ,])
summary(svmfit)
##
## Call:
## svm(formula = y ~ ., data = dat[train, ], kernel = "radial",
## gamma = 1, cost = 1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 1
##
## Number of Support Vectors: 37
##
## ( 17 20 )
##
##
## Number of Classes: 2
##
## Levels:
## 1 2
Now if we increase the cost, since the above model has a lot of training errors, we will have a more irrigular shaped decision boundary but better accuracy. Based on this decision boundary, we should be worried that we are overfitting the data.
svmfit=svm(y~., data=dat[train,], kernel="radial",gamma=1, cost=1e5)
plot(svmfit ,dat[train ,])
Instead, we can again use tune()
to use cross validation and get the values of cost and gamma which will result in the best model:
set.seed (1)
tune.out=tune(svm, y~., data=dat[train,], kernel="radial",
ranges=list(cost=c(0.1,1,10,100,1000),
gamma=c(0.5,1,2,3,4) ))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 1 2
##
## - best performance: 0.12
##
## - Detailed performance results:
## cost gamma error dispersion
## 1 1e-01 0.5 0.27 0.11595018
## 2 1e+00 0.5 0.13 0.08232726
## 3 1e+01 0.5 0.15 0.07071068
## 4 1e+02 0.5 0.17 0.08232726
## 5 1e+03 0.5 0.21 0.09944289
## 6 1e-01 1.0 0.25 0.13540064
## 7 1e+00 1.0 0.13 0.08232726
## 8 1e+01 1.0 0.16 0.06992059
## 9 1e+02 1.0 0.20 0.09428090
## 10 1e+03 1.0 0.20 0.08164966
## 11 1e-01 2.0 0.25 0.12692955
## 12 1e+00 2.0 0.12 0.09189366
## 13 1e+01 2.0 0.17 0.09486833
## 14 1e+02 2.0 0.19 0.09944289
## 15 1e+03 2.0 0.20 0.09428090
## 16 1e-01 3.0 0.27 0.11595018
## 17 1e+00 3.0 0.13 0.09486833
## 18 1e+01 3.0 0.18 0.10327956
## 19 1e+02 3.0 0.21 0.08755950
## 20 1e+03 3.0 0.22 0.10327956
## 21 1e-01 4.0 0.27 0.11595018
## 22 1e+00 4.0 0.15 0.10801234
## 23 1e+01 4.0 0.18 0.11352924
## 24 1e+02 4.0 0.21 0.08755950
## 25 1e+03 4.0 0.24 0.10749677
The confusion matrix for the best model, which shows that only 10% of observvations are misclassified by this model.
table(true=dat[-train,"y"], pred=predict(tune.out$best.model, newdata=dat[-train ,]))
## pred
## true 1 2
## 1 74 3
## 2 7 16
library(ROCR)
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.4.4
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
rocplot=function(pred, truth, ...){
predob = prediction (pred, truth)
perf = performance (predob , "tpr", "fpr")
plot(perf ,...)}
svmfit.opt=svm(y~., data=dat[train,], kernel="radial", gamma=2, cost=1,decision.values=T)
fitted=attributes(predict(svmfit.opt,dat[train,],decision.values=TRUE))$decision.values
par(mfrow=c(1,2))
rocplot(fitted ,dat[train ,"y"],main="Training Data")
Now to see how well it does on the test data:
fitted=attributes(predict(svmfit.opt,dat[-train,],decision.values=T))$decision.values
rocplot(fitted,dat[-train,"y"],main="Test Data")
The default for the svm()
function is the “one-versus-one” approach. To show this witho ur randomly generated data, we will create a third class.
set.seed (1)
x=rbind(x, matrix(rnorm(50*2), ncol=2))
y=c(y, rep(0,50))
x[y==0,2]=x[y==0,2]+2
dat=data.frame(x=x, y=as.factor(y))
par(mfrow=c(1,1))
plot(x,col=(y+1))
svmfit=svm(y~., data=dat, kernel="radial", cost=10, gamma=1)
plot(svmfit , dat)