Exploration of Support Vector Machines

Hyperplanes

SVM’s were developed in the 1990’s as a classifier that utilizes the idea of the maximal margin. It is applicable to data sets where there are non-linear separable classes. To begin, a explanation of an optimal separating hyperplane is required.

A hyperplace is a flat affine subspace of a \(p\)-dimensional space with dimension \(p-1\). In five dimensional space, a hyperplane is a flat four dimensional subspace. In two dimensions, a hyperplace is defined by \[\beta_0+\beta_1X_1+\beta_2X_2=0\]

for parameters \(\beta_0, \beta_1, \beta_2\). Any \(X=(X_1, X_2)^T\) for which the equation holds is a point on the hyperplane above.

Extending this to the \(p\)-dimensional space we have the equation \[\beta_0+\beta_1X_1+\beta_2X_2+...+\beta_pX_p=0\]

which defines the hyperplane such that if a point $X=(X_1,X_2,…,X_p)^T in \(p\)-dimensional space satisfies the equation then \(X\) lies on the hyperplane.

Now, if \(X\) is not on the hyperplane but either below or above the hyperplane as in \[\beta_0+\beta_1X_1+\beta_2X_2+...+\beta_pX_p<0\]

or

\[\beta_0+\beta_1X_1+\beta_2X_2+...+\beta_pX_p>0\]

then we can think of the hyperplane as dividing the \(p\)-dimensional space into two halves.

This idea extends to a classifier that separates points in our data set by which side of a dividing hyperplane they are on, if such a hyperplane exists.

If this separable hyperplane exists, we can use a simple classifier to classify some test observation, \(x^*\) based on the sign of \(f(x^*)=\beta_0+\beta_1x_1^*+\beta_2x_2^*+...+\beta_px_p^*\). If \(f(x^*)\) we assign \(x^*\) to class 1, if negative, to class -1. Depending on how far \(x^*\) is from the hyperplane determines our confidence in the correct assignment.

Maximal Margin Classifier

This idea can further be extended to the maximal margin classifier which selects the optimal hyperplane out of the infinite number of choices of hyperplanes that exist.

We compute the orthogonal distance from a given hyperplane to the training observations where the smallest such distance is the margin. Then the maximal margin is the separating hyperplane that has the farthest minimal distance to the training observations. If the margin of the classifier if large on the training data we have some reason to think it may also be large for the test data.

To build the maximal margin classifier based on a set of \(n\) training observations \(x_1, x_2,...,x_n\in \mathbb{R}^p\) and associated class labels \(y_1,...,y_n \in \{-1, 1\}\) we need to solve the optimization

\[\begin{aligned} \text{ maximize M}_{\beta_0, \beta_1,...,\beta_p, M}\\ \text{ subject to }\sum_{j=1}^p\beta_j^2&=1,\\ y_i(\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+...+\beta_px_{ip})\geq M, \forall i&=1,...,n.\\ \end{aligned}\]

The constraints guarantee that each observation is on the right side of the hyperplane and at least the distance \(M\) away. \(M\) is the margin of the hyperplane and the optimization chooses the correct \(\beta_0, \beta_1,..,\beta_p\) to maximize \(M\).

This is all subject to a hyperplane that is separable. If not, then there is no maximal margin classifier. However, this idea can be generalized to the non-separable case called the support vector classifier*.

Support Vector Classifiers

Using a soft margin we can almost separate a non-separable hyperplane. This may be desirable if we have a very small margin which can lead to an overly sensitive hyperplane. If we design a classifier that can misclassify a small number of training observations in order to have a more robust overall classifier this may be preferred when dealing with data where a non-separable hyperplane exists.

The support vector classifier is the solution to the optimization problem

\[\begin{aligned} \text{ maximize M}_{\beta_0, \beta_1, +++\beta_p, M}\\ \text{ subject to }\sum_{j=1}^p\beta_j^2&=1,\\ y_i(\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+...+\beta_px_{ip})&\geq M(1-\epsilon_i),\\ \epsilon_i\geq0, \sum_{i=1}^n\epsilon_i&\leq C,\\ \end{aligned}\]

where \(C\) is a nonegative tuning parameter. \(M\) is the width of the margin that we want to make as small as possible. The slack variables are \(\epsilon_1,...,\epsilon_n\) that allow some individual observations to be on the ‘wrong’ side of the hyperplane. If \(\epsilon_i =0\) the \(i\)th observation is onhte correct side of the margin and if \(\epsilon_i>0\) the \(i\)th observation is on the wrong side of the hyperplane.

Again, the support vector classifier classifies some test observation, \(x^*\) based on the sign of \(f(x^*)=\beta_0+\beta_1x_1^*+\beta_2x_2^*+...+\beta_px_p^*\). If \(f(x^*)\) we assign \(x^*\) to class 1, if negative, to class -1. Depending on how far \(x^*\) is from the hyperplane determines our confidence in the correct assignment.

The tuning parameter \(C\) bounds \(\sum_{i=1}^n\epsilon_i\) to determine the number and severity of the margin violations per some tolerance criterion.

Support Vector Machines

SVM is an extension of the support vector classifier that expands the feature space using kernels to deal with non-linear separability issues between classes.

The solution to the support vector classifier involves only the inner product of the observations as opposed to the actual observations. The inner product of two \(r\)-vectors \(a\) and \(b\) is defined \((a, b)=\sum_{i=1}^ra_ib_i\). The inner product of two observations \(x_i, x_{i^\prime}\) is given \[\langle x_i, x_{i^\prime}\rangle=\sum_{j=1}^px_{ij}x_{i^\prime j}\] Then we can see that - the linear support vector classifier can be represented as \[f(x)=\beta_0+\sum_{i=1}^n\alpha_i\langle x, x_i\rangle,\] where there are \(n\) parameters \(\alpha_i, i=1,..,n\), one per training observation.

  • To estimate the parameters \(\alpha_1,...,\alpha_n\) and \(\beta_0\), we need the \(n\choose{2}\) inner products \(\langle x_i, x_{i^\prime}\rangle\) between all pairs of training observations.

To evaluate \(f(x)\) the inner product between the new point \(x\) and each of the training points \(x_i\) but \(\alpha_i\) is nonzero for the support vectors in the solution set alone. The solution function for the collection of indices of support points \(S\) has the form \[f(x)=\beta_0+\sum_{i\in S}\alpha_i\langle x, x_i\rangle\] For each time the inner product appears in he support vector classifer representation shown above we can replace it with a generalization of the inner product \[K(x_i, x_{i^\prime})\] where \(K\) is some function that we will refer to as a kernel. A kernel is a function that quantifies the similarity of two observations. These can be linear or non-linear. In the non-linear case, where the support vector classifier is conbined with a non-linear kernel, the function has the form \[f(x)=\beta_0+\sum_{i\in S}\alpha_iK(x, x_i)\]

Another choice of non-linear kernels is the *radial$ kernel \[K(x_i, x_{i^\prime})=exp(-\gamma\sum_{j=1}^p(X_{ij}-x_{i^\prime j})^2)\]

where \(\gamma\) is a positive constant.

The advantages of using kernels instead of techniques to expand the feature space include computational efficiency and the fact that for some applications an expanded feature space is so large that expanced computations without a kernel is intractable.

We will use the e1071 package to access the svm() function with kernel and cost arguments to specify the cost of a violation to the margin.

Demonstrating the svm() function to fit the support vector classifier for a given cost argument we begin by generating obervations

set.seed(1)
x <- matrix(rnorm(20 * 2), ncol = 2)
y <- c(rep(-1, 10), rep(1, 10))
x[y == 1, ] = x[y == 1, ] + 1
plot(x, col = (3 - y))

We can see the observations are not linearly separable so we will fit the support vector classifier.

Fitting Models

The project will now sequentially develop each of the following topics.

  1. Fitting Support Vector Classifiers
  2. Fitting Support Vector Machines
  3. Utilizing the ROC Curve
  4. SVM with Multiple Classes
  5. Applying these tools to Gene Expression Data

Fitting the Support Vector Classifier

Next, we fit the support vector classifier. In order for the svm() function to perform classification we need to encode the response as a factor variable.

data <- data.frame(x = x, y = as.factor(y))
library(e1071)
svmfit <- svm(y ~ ., data = data, kernel = "linear", 
              cost = 10, scale = FALSE)

The argument scale = FALSE tells the svm() function not to scale the features to have mean zero and standard deviation one. Sometimes scale = TRUE might be preferred.

We can plot the SVM using the plot() function

plot(svmfit, data)

the light region has points assigned to the \(-1\) class and the darker region receives points assigned to the \(+1\) class.

Because we use the kernel = “linear” argument we have a linear decision boundary. The support vectors in the plot are identified as crosses and circles whose identities can be determined using the index call

svmfit$index
## [1]  1  2  5  7 14 16 17

The summary() function can be used to generate some additional information about the SVM

summary(svmfit)
## 
## Call:
## svm(formula = y ~ ., data = data, kernel = "linear", cost = 10, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  7
## 
##  ( 4 3 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1

We see that we have seven support vectors with four in one class and three in the other with a cost of \(10\). What is we used a different cost?

data <- data.frame(x = x, y = as.factor(y))

svmfit <- svm(y ~ ., data = data, kernel = "linear", 
              cost = 0.1, scale = FALSE)
plot(svmfit, data)

svmfit$index
##  [1]  1  2  3  4  5  7  9 10 12 13 14 15 16 17 18 20

With a smaller cost we have more support vectors because the margin is wider.

The e1071 library includes a built-in function tune() to perform cross-validation. By default, tune() performs 10-fold CV on a set of models of interest.

set.seed(1)
tune.out <- tune(svm, y ~ ., data = data, kernel = "linear",
                 ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)))

We then can use summary() function to generate the cross-validation details including the error for each model per the related cost.

summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##   0.1
## 
## - best performance: 0.05 
## 
## - Detailed performance results:
##    cost error dispersion
## 1 1e-03  0.55  0.4377975
## 2 1e-02  0.55  0.4377975
## 3 1e-01  0.05  0.1581139
## 4 1e+00  0.15  0.2415229
## 5 5e+00  0.15  0.2415229
## 6 1e+01  0.15  0.2415229
## 7 1e+02  0.15  0.2415229

The tune() function stores the best model generated and we see the best performance is generated by cost = \(0.1\)

We can access the details of the best model using the argument best.model

bestmod <- tune.out$best.model
summary(bestmod)
## 
## Call:
## best.tune(method = svm, train.x = y ~ ., data = data, ranges = list(cost = c(0.001, 
##     0.01, 0.1, 1, 5, 10, 100)), kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.1 
## 
## Number of Support Vectors:  16
## 
##  ( 8 8 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1

The predict() function can be used to predict the class label on a set of test observations for any given cost parameter value.

To illustrate, we build a test data set.

xtest <- matrix(rnorm(20 * 2), ncol = 2)
ytest <- sample(c(-1, 1), 20, rep = TRUE)
xtest[ytest == 1, ] = xtest[ytest == 1, ] + 1
testdata = data.frame(x = xtest, y = as.factor(ytest))

Now we can predict the class label of the test observations we generated using cross-validation to identify the best model.

ypred <- predict(bestmod, testdata)
table(predict = ypred, truth = testdata$y)
##        truth
## predict -1 1
##      -1  9 1
##      1   2 8

The sum of the main diagonal tells us we have 17 test observations correctly classified.

If the cost is adjusted to \(0.01\) we can see that the model performance drops with three additional test observations miscalculated.

svmfit <- svm(y ~ ., data = data, kernel = "linear", cost = 0.01,
              scale = FALSE)
ypred <- predict(svmfit, testdata)
table(predict = ypred, truth = testdata$y)
##        truth
## predict -1  1
##      -1 11  6
##      1   0  3

Looking at a situation where the two classes are linearly separable we can find a separating hyperplane using the svm() function. To begin, we further separate our classes in the simulated data to meet our separability criteria.

x[y == 1, ] = x[y == 1,] + 0.5
plot(x, col = (y + 5) / 2, pch = 19)

Fitting the support vector classifier and plot the resulting hyperplane using a very large cost.

data = data.frame(x = x, y = as.factor(y))
svmfit <- svm(y ~ ., data = data, kernel = "linear", cost = 100000)
summary(svmfit)
## 
## Call:
## svm(formula = y ~ ., data = data, kernel = "linear", cost = 1e+05)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1e+05 
## 
## Number of Support Vectors:  3
## 
##  ( 1 2 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1
plot(svmfit, data)

There were no training errors made and only three support vectors were used.

What if we use a smaller cost?

svmfit <- svm(y ~ ., data = data, kernel = "linear", cost = 1)
summary(svmfit)
## 
## Call:
## svm(formula = y ~ ., data = data, kernel = "linear", cost = 1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  7
## 
##  ( 4 3 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1
plot(svmfit, data)

With a cost of \(1\) we misclassify a training observation but get a wider margin and the use of five support vectors.

Fitting the Support Vector Machines

To fit the SVM we need to use a non-linear kernel. We can use “polynomial” or “radial”. For the former, the degree argument is used to specify the degree \(d\) for the polynomial kernel and for the radial kernel we specify a gamma for the value of \(\gamma\).

To begin, generate some datga with a non-linear class boundary

set.seed(1)
x <- matrix(rnorm(200 * 2), ncol = 2)
x[1:100, ] = x[1:100, ] + 2
x[101:150, ] = x[101:150, ] - 2
y <- c(rep(1, 150), rep(2, 50))
data <- data.frame(x = x, y = as.factor(y))

To verify the data is nonlinear we can create a plot

plot(x, col = y)

Randomly splitting the data into testing and training groups we can then fit the training data using the svm() function with the “radial” kernel and \(\gamma=1\).

train <- sample(200, 100)
svmfit <- svm(y ~ ., data = data[train, ], kernel = "radial", gamma = 1,
              cost = 1)
plot(svmfit, data[train, ])

The plot indicates the SVM has a clearly non-linear class boundary. The summary() function generates some details about the model

summary(svmfit)
## 
## Call:
## svm(formula = y ~ ., data = data[train, ], kernel = "radial", gamma = 1, 
##     cost = 1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  31
## 
##  ( 16 15 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  1 2

There are a number of training errors that we can reduce by increasing the cost.

svmfit <- svm(y ~ ., data = data[train, ], kernel = "radial", gamma = 1,
              cost = 10000)
plot(svmfit, data[train, ])

The decision boundary is much more irregular that could introduce overfitting as a concern.

Again, we can use tune() to apply cross-validation to select the best choice of \(\gama\) and cost for an SVM using a “radial” kernel.

set.seed(1)
tune.out <- tune(svm, y ~ ., data = data[train, ], kernel ="radial", 
                 ranges = list(cost = c(0.1, 1, 10, 100, 1000),
                 gamma = c(0.5, 1, 2, 3, 4)))
summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##     1   0.5
## 
## - best performance: 0.07 
## 
## - Detailed performance results:
##     cost gamma error dispersion
## 1  1e-01   0.5  0.26 0.15776213
## 2  1e+00   0.5  0.07 0.08232726
## 3  1e+01   0.5  0.07 0.08232726
## 4  1e+02   0.5  0.14 0.15055453
## 5  1e+03   0.5  0.11 0.07378648
## 6  1e-01   1.0  0.22 0.16193277
## 7  1e+00   1.0  0.07 0.08232726
## 8  1e+01   1.0  0.09 0.07378648
## 9  1e+02   1.0  0.12 0.12292726
## 10 1e+03   1.0  0.11 0.11005049
## 11 1e-01   2.0  0.27 0.15670212
## 12 1e+00   2.0  0.07 0.08232726
## 13 1e+01   2.0  0.11 0.07378648
## 14 1e+02   2.0  0.12 0.13165612
## 15 1e+03   2.0  0.16 0.13498971
## 16 1e-01   3.0  0.27 0.15670212
## 17 1e+00   3.0  0.07 0.08232726
## 18 1e+01   3.0  0.08 0.07888106
## 19 1e+02   3.0  0.13 0.14181365
## 20 1e+03   3.0  0.15 0.13540064
## 21 1e-01   4.0  0.27 0.15670212
## 22 1e+00   4.0  0.07 0.08232726
## 23 1e+01   4.0  0.09 0.07378648
## 24 1e+02   4.0  0.13 0.14181365
## 25 1e+03   4.0  0.15 0.13540064

The best choice of parameters uses cost = 1 and gamma = 0.5.

we can also view the test set predictions by applying the predict function to the data. We subset the dataframe data and use -train as the test set.

table(true = data[-train, "y"], pred = predict(tune.out$best.model,
                                               newdata = data[-train, ]))
##     pred
## true  1  2
##    1 67 10
##    2  2 21

We can see that \(10\)% of the test observations are misclassified with this particular SVM

Utilizing the ROC Curves

the ROCR package generates the receiver operating characteristics curve, i.e. the ROC curve that illustrates the overall performance of a classifier.

library(ROCR)
library(Deducer)
## Loading required package: ggplot2
## Loading required package: JGR
## Loading required package: rJava
## Loading required package: JavaGD
## 
## Please type JGR() to launch console. Platform specific launchers (.exe and .app) can also be obtained at http://www.rforge.net/JGR/files/.
## Loading required package: car
## Loading required package: carData
## Loading required package: MASS
## Registered S3 methods overwritten by 'lme4':
##   method                          from
##   cooks.distance.influence.merMod car 
##   influence.merMod                car 
##   dfbeta.influence.merMod         car 
##   dfbetas.influence.merMod        car
## 
## 
## Note Non-JGR console detected:
##  Deducer is best used from within JGR (http://jgr.markushelbig.org/).
##  To Bring up GUI dialogs, type deducer().
rocplot <- function(pred, truth,...){
  predob = prediction(pred, truth)
  perf = performance(predob, "tpr", "fpr")
  plot(perf,...)}

SVMs and support vector classifiers output class labels for each observation but we can get fitted values for each observation as well.

In the case of a support vector classifier, the fitted value for an observation \(X=(X_1, X_2,..,X_p)^T\) takes the form \[\hat{\beta_0}+\hat{\beta_1}X_1+\hat{\beta_2}X_2+...+\hat{\beta_p}X_p\]

For an SVM with a non-linear kernel the equation the fitted value is \[f(x)=\beta_0+\sum_{i\in S}\alpha_iK(x, x_i)\] The sign of the fitted value determines the side of the decision boundary.

To get fitted values, use decision.values = TRUE when fitting svm(). The predict() function will output the fitted values.

svmfit.opt <- svm(y ~ ., data = data[train, ], kernel ="radial", 
                  gamma = 2, cost = 1, decision.values = TRUE)
fitted <- attributes(predict(svmfit.opt, data[train, ], decision.values = TRUE))$decision.values

Now we can fit the ROC plot.

par(mfrow = c(1, 2))
#rocplot(fitted, data[train, "y"], add = TRUE, col = "red")
# ?? rocplot is throwing an error 

By increasing \(\gamma\) we can get a more flexible fit and better accuracy

svmfit.flex <- svm(y ~ ., data = data[train, ], kernel = "radial", 
                   gamma = 50, cost = 1, decision.values = TRUE)
fitted <- attributes(predict(svmfit.flex, data[train, ], 
                             decision.values = T))$decision.values
# rocplot(fitted, data[train, "y"], add = TRUE, col = "red")
# ?? rocplot  throwing error
fitted <- attributes(predict(svmfit.opt, data[-train, ], 
                             decision.values = T))$decision.values
rocplot(fitted, data[-train, "y"], main = "Test Data")

fitted <- attributes(predict(svmfit.flex, data[-train, ], 
                             decision.values = T))$decision.values
rocplot(fitted, data[-train, "y"], add = TRUE, col = "red")

SVM with Multiple Classes

If we have a response variable \(Y\) with more than two levels, then the svm() function can perform multi-class classification using the one-versus-one approach.

To illustrate how this works, create a third class of observations.

set.seed(1)
x <- rbind(x, matrix(rnorm(50 * 2), ncol = 2))
y <- c(y, rep(0, 50))
x[y == 0, 2] = x[y == 0, 2] + 2
data = data.frame(x = x, y = as.factor(y))
par(mfrow = c(1, 1))
plot(x, col = (y = 1))

Fit an SVM to the data

svmfit <- svm(y ~ ., data = data, kernel = "radial", cost = 10, gamma = 1)
plot(svmfit, data)

We see that we have three bounded regions to separate the classes.

summary(svmfit)
## 
## Call:
## svm(formula = y ~ ., data = data, kernel = "radial", cost = 10, gamma = 1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
## 
## Number of Support Vectors:  105
## 
##  ( 38 37 30 )
## 
## 
## Number of Classes:  3 
## 
## Levels: 
##  0 1 2

We can repeat the steps to run 10-fold cross-validation.

set.seed(1)
tune.out <- tune(svm, y ~ ., data = data[train, ], kernel ="radial", 
                 ranges = list(cost = c(0.1, 1, 10, 100, 1000),
                 gamma = c(0.5, 1, 2, 3, 4)))
summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##     1   0.5
## 
## - best performance: 0.07 
## 
## - Detailed performance results:
##     cost gamma error dispersion
## 1  1e-01   0.5  0.26 0.15776213
## 2  1e+00   0.5  0.07 0.08232726
## 3  1e+01   0.5  0.07 0.08232726
## 4  1e+02   0.5  0.14 0.15055453
## 5  1e+03   0.5  0.11 0.07378648
## 6  1e-01   1.0  0.22 0.16193277
## 7  1e+00   1.0  0.07 0.08232726
## 8  1e+01   1.0  0.09 0.07378648
## 9  1e+02   1.0  0.12 0.12292726
## 10 1e+03   1.0  0.11 0.11005049
## 11 1e-01   2.0  0.27 0.15670212
## 12 1e+00   2.0  0.07 0.08232726
## 13 1e+01   2.0  0.11 0.07378648
## 14 1e+02   2.0  0.12 0.13165612
## 15 1e+03   2.0  0.16 0.13498971
## 16 1e-01   3.0  0.27 0.15670212
## 17 1e+00   3.0  0.07 0.08232726
## 18 1e+01   3.0  0.08 0.07888106
## 19 1e+02   3.0  0.13 0.14181365
## 20 1e+03   3.0  0.15 0.13540064
## 21 1e-01   4.0  0.27 0.15670212
## 22 1e+00   4.0  0.07 0.08232726
## 23 1e+01   4.0  0.09 0.07378648
## 24 1e+02   4.0  0.13 0.14181365
## 25 1e+03   4.0  0.15 0.13540064

Application to Gene Expression Data

Using the Khan data set in the ISLR package we explore classification using SVM. the data consists of tissue samples of four types of small round blue tumors. Each tissue sample has gene expression measurements. The data is already subset into test and train blocks.

To begin we examine the data

library(ISLR)
names(Khan)
## [1] "xtrain" "xtest"  "ytrain" "ytest"
dim(Khan$xtrain)
## [1]   63 2308
dim(Khan$xtest)
## [1]   20 2308
length(Khan$ytrain)
## [1] 63
length(Khan$ytest)
## [1] 20
table(Khan$ytrain)
## 
##  1  2  3  4 
##  8 23 12 20
table(Khan$ytest)
## 
## 1 2 3 4 
## 3 6 6 5

We have \(2,308\) genes and \(64\) and \(20\) observations respectively.

Using a support vector approach to predict cancer subtype using gene expression measurements. We have a very large amount of features relative to observations. This suggests we should use the linear kernel because the additional flexibility that results from using a polynomial or radial kernel is unnecessesary.

data <- data.frame(x = Khan$xtrain, y = as.factor(Khan$ytrain))
out <- svm(y ~ ., data = data, kernel = "linear", cost = 10)
summary(out)
## 
## Call:
## svm(formula = y ~ ., data = data, kernel = "linear", cost = 10)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  58
## 
##  ( 20 20 11 7 )
## 
## 
## Number of Classes:  4 
## 
## Levels: 
##  1 2 3 4
table(out$fitted, data$y)
##    
##      1  2  3  4
##   1  8  0  0  0
##   2  0 23  0  0
##   3  0  0 12  0
##   4  0  0  0 20

Notice that there are zero training errors. This is because the large number of variables relative to the number of observations makes it easy to find hyperplanes that fully separate the classes.

Either way, we care the most about the results of the model on the test data.

data.test <- data.frame(x = Khan$xtest, y = as.factor(Khan$ytest))
predict.test <- predict(out, newdata = data.test)
table(predict.test, data.test$y)
##             
## predict.test 1 2 3 4
##            1 3 0 0 0
##            2 0 6 2 0
##            3 0 0 4 0
##            4 0 0 0 5

We see that on the test data we have two errors.