Visit my website for more like this!
Heavily borrowed from:
Textbook: Introduction to statistical learning
Textbook: Elements of statistical learning
require(knitr)
## Loading required package: knitr
SVM’s are a group of classifiers considered to be one of the best “out of the box” methods. There are several types of SVM’s. The simplest is the_maximal margin classifier_ (MMC). Though the MMC is elegant and simple, it cannot be applied to most data sets, since it requires the classes be separable by a linear boundary. The support vector classifier is an extension of the maximal margin classifier, and can be applied in a broader range of cases. However, better yet, the support vector machine is a further extension to accommodate non-linear class boundaries.
Here we define a hyperplane and introduce optimal separation of a hyperplane.
In a p dimensional space, a hyperplane is a flat affine subspace of dimension p - 1. For example, in two dimensions, a hyperplane is a flat one-dimensional subspace (a line). In three dimensions, a hyperplane is a flat two-dimensional subspace – a plane. p > 3 dimensions are hard to visualize, but are still of p -1 dimensional flat subspaces.
A hyperplane essentially divides p dimensional space into two halves. Using the equation of a p-dimensional hyperplane, we can then determine which side a given point lies on by calculating the resulting sign of the equation.
Suppose we have a n x p data matrix X that consists of n training observations in p-dimensional space, which fall into two classes (-1, 1). The goal is to develop a classifier based on the training data that will correctly classify the test observations using the available features. This situation is present in any type of classifier.
Suppose that it is possible to construct a hyperplane that separates the training observations perfectly according to their class labels. If such a hyperplane exists, we can use it to construct a very natural classifier, as test observation is assigned a class depending on which side of the hyperplane it is located. We can also make use of the magnitude, that is, how far an observation is from the hyperplane, and so we can be confident (or less confident) about our class assignment. This type of separating hyperplane will lead to a linear decision boundary.
If our data can be perfectly separated using a hyperplane, then there will be an infinite number of such hyperplanes. You can imagine that the hyperplane can usually be shifted up or down a little bit without coming in contact with the observations. So in order to construct the classifier we must decide on a reasonable way to decide which of the infinite possible separating hyperplanes to use.
The simplest choice is the maximal margin hyperplane (aka the optimal separating hyperplane), which is the separating hyperplane that is the farthest from the training observations. The margin is the distance from the hyperplane to the nearest points on either side. We hope that the maximal margin will have a large margin on the training data and the test data, however it can lead to overfitting when p is large.
The points that are used to decide this boundary are the set of closest equidistant perpendicular points, and are called the support vectors. The maximum margin hyperplane depends directly on these support vectors, and not on the other observations. The fact that the MMA boundary is set on only a small number of observations is an important property that will come up shortly.
The maximal margin classifier is a very natural way to perform classification, only if a separating hyperplane exists. However, in many cases we cannot do this, and the optimization of the hyperplane has no solution to exactly separate the two classes.
The next section shows how to extend this model in order to develop a hyperplane that almost separates the classes, a so-called soft margin. This is called the support vector classifier.
Even when a hyperplane does exist between two classes, there are instances where fitting a hyperplane may not be desirable. This is because the hyperplane from maximal margin classifiers will perfectly fit the training data, and is highly sensitive to only a few of the training observations. Basically, we have a high potential to over fit the data.
In this case we might be willing to consider a classifier based on a hyperplane that does not perfectly separate the two classes. By doing this we can attain:
Greater robustness to individual observations, and
Better classification of most of the training observations.
Basically, it could be worthwhile to misclassify a few training observations in order to do a better job in classifying the remaining observations.
The support vector classifier, sometimes called a soft margin classifier, does exactly this. An observation can not only be on the wrong side of the margin, but also on the wrong side of the hyperplane. This situation is inevitable when there is no perfect separation in the data.
The solution to the optimization problem for the support vector classifier has several tuning parameters. C a non negative parameter, M the margin width, which we seek to make as large as possible, and slack variables e1…,en that allow individual observations to be on the wrong side of the hyperplane.
First we classify the any given test observation based on the sign relative to a given hyperplane. ei tells us where the ith observation is located relative to the hyperplane and relative to the margin. if ei = 0 then the ith observation is on the correct side of the margin. If ei > 0 then the ith observation is on the wrong side of the margin, and has violated the margin. If ei > 1 then it is on the wrong side of the hyperplane.
C bounds the sum of the ei’s, and so it determines the number and severity of the violations to the margin and hyperplane that we will tolerate. If C = 0 then we have no limits for margin violations, in which case the equation is simply the maximal margin optimization problem. For C > 0, no more than C observations can be on the wrong size of the hyperplane. As C increases, we become more tolerant of violations to the margin, and so the margin will widen. C is generally chosen with cross-validation and controls the bias-variance trade-off. When C is small, we seek narrow margins that are rarely violated (low bias, high variance). The image below from the ISLR textbook is shown below.
It turns out that only observations that either lie on the margin or that violate the margin will affect the hyperplane. Thus, an observation that lies clearly on the correct side of the margin does not affect the support classifier. The observations that do lie on the margins and the wrong side of the margin are called the support vectors, which are the only points that drive the classifier.
This is distinctly different than say an LDA classifier, which use the mean of all the observations in each class, as well as in-class covariance. Logistic regression also has very low sensitivity to observations far from the decision boundary.
Thus when C is large, the margin is wide, and there are many support vectors. The figure below illustrates these trade-offs.
First we will go through converting a linear classifier into a one that can handle non-linear boundaries. Then the support vector machine, which does this automatically, is introduced.
The support vector classifier (above) was a natural approach for classification in the two-class setting if the boundary between them is linear. Similar to when we extend the linear model to account for non-linear relationships, we can transform the predictors using quadratic, cubic, and even higher order polynomial functions. Unfortunately, the problem with this is that we start accumulate \(2p\) features and computations can begin to become unmanageable quickly.
This method is an extension of the support vector classifier that results from enlarging the feature space using kernels. Where K is some non-linear kernel function that quantifies the similarity of two observations. Essentially, we fit a support vector classifier in a higher-dimensional space using a polynomial kernel of d-dimensions to create a support vector machine.
The polynomial kernel is one example of a possible non-linear kernel. Another popular choice is the radial kernel. The difference between the two kernels is primarily that the radial kernel has a very local behaviour compared to the polynomial kernel. The radial kernel also has a tuning parameter y. As y increases the fit becomes more and more non-linear. A comparison of the two kernels is presented below. In this case, both classifiers are able to fit the data sufficiently. Kernels allow us to shrink the size of the optimization problem of 2_p_ predictors to a more manageable size.
The concept of separating hyperplanes only really lends itself well to the binary classification setting. However, we now discuss the two most popular techniques for extended SVMs to K-classes:
one-versus-one
one-versus-all
This approach constructs all pairs of K to compare all the classes in a two-class setting, and build a classifier for each one. For each of these classifiers we classify a test observation and tally the number of times that the test observation is assigned to each of the K classes. The final classification is performed by assigning the test observation to the class which it was most frequently assigned in these pairwise classifications.
Here we fit K SMV’s, each time comparing one of the K classes to the remaining K - 1 classes. We then assign the observation to the class for which has the highest amount of confidence that it belongs to the kth class rather than any other classes.
Though not covered here, there is also an extension of the SVM for regression.
The e1071
has implemented several statistical learning methods. The svm()
method can fit a support vector classifier when the argument kernel='linear'
is used. The cost
argument allows us to specify the cost of a violation to the margin. Thus when the cost is small, the margins will be wide and there will be many support vectors.
set.seed(1)
#Create our own test data
x <- matrix(rnorm(20*2), ncol=2)
y <- c(rep(-1,10), rep(1,10))
x[y==1,]=x[y==1,] + 1
Let’s take a look at these data
plot(x, col=(3-y))
They are not, Now we can fit the support vector classifier. To perform classification we have to specify the response as a factor. The argument scale = FALSE
tells the function not to scale each feature. Sometimes we may want to do this.
library(e1071)
dat <- data.frame(x=x, y=as.factor(y))
svm.fit <- svm(y ~., data=dat, kernel='linear', cost=10, scale=FALSE)
# Plot the SVC obtained
plot(svm.fit, dat)
summary(svm.fit)
##
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 10,
## scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 10
## gamma: 0.5
##
## Number of Support Vectors: 7
##
## ( 4 3 )
##
##
## Number of Classes: 2
##
## Levels:
## -1 1
The summary lets us know there were 7 support vectors, four in the first class and three in the second. What if we used a smaller cost parameter instead?
svm.fit2 <- svm(y ~., data=dat, kernel = 'linear', cost=0.1, scale=FALSE)
plot(svm.fit2, dat)
summary(svm.fit2)
##
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 0.1,
## scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.1
## gamma: 0.5
##
## Number of Support Vectors: 16
##
## ( 8 8 )
##
##
## Number of Classes: 2
##
## Levels:
## -1 1
With a smaller value of cost
we obtain a margin number of support vectors via the larger margin. The e1071
library has a built-in tune()
function that performs 10-fold cross-validation on the models.
set.seed(1)
tune.out <- tune(svm, y ~., data=dat, kernel='linear',
ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.1
##
## - best performance: 0.1
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-03 0.70 0.4216
## 2 1e-02 0.70 0.4216
## 3 1e-01 0.10 0.2108
## 4 1e+00 0.15 0.2415
## 5 5e+00 0.15 0.2415
## 6 1e+01 0.15 0.2415
## 7 1e+02 0.15 0.2415
Here we see that cost = 0.1
results in the lowest cross-validation error rate. tune()
also stores the best model obtained accessed through $best.model
, thus we can predict using test data. Here we create a simulated test set.
xtest=matrix(rnorm(20*2), ncol=2)
ytest=sample(c(-1,1), 20, rep=TRUE)
xtest [ ytest ==1 ,]= xtest [ ytest ==1 ,] + 1
testdat=data.frame(x=xtest, y=as.factor(ytest))
Then predict the class labels of the test observations from the cross validated results.
yhat <- predict(tune.out$best.model, testdat)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
confusionMatrix(yhat, testdat$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction -1 1
## -1 11 1
## 1 0 8
##
## Accuracy : 0.95
## 95% CI : (0.751, 0.999)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 0.000111
##
## Kappa : 0.898
## Mcnemar's Test P-Value : 1.000000
##
## Sensitivity : 1.000
## Specificity : 0.889
## Pos Pred Value : 0.917
## Neg Pred Value : 1.000
## Prevalence : 0.550
## Detection Rate : 0.550
## Detection Prevalence : 0.600
## Balanced Accuracy : 0.944
##
## 'Positive' Class : -1
##
Again we use the svm()
function, however now we can experiment with non-linear kernels. For polynomial kernels we use the parameter degree
to adjust the polynomial order. For radial kernels we use the gamma
parameter to adjust the y value.
# Generate some test data
set.seed (1)
x <- matrix(rnorm(200*2), ncol=2)
x[1:100,]=x[1:100,]+2
x[101:150,]=x[101:150,]-2
y <- c(rep(1,150),rep(2,50))
dat <- data.frame(x=x,y=as.factor(y))
plot(x, col=y)
Randomly split the data into training and testing groups and fit a radial kernel.
train <- sample(200, 100)
svm.fit <- svm(y ~., data=dat[train,], kernel='radial', gamma=1, cost=1)
plot(svm.fit, dat[train,])
summary(svm.fit)
##
## Call:
## svm(formula = y ~ ., data = dat[train, ], kernel = "radial",
## gamma = 1, cost = 1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 1
##
## Number of Support Vectors: 37
##
## ( 17 20 )
##
##
## Number of Classes: 2
##
## Levels:
## 1 2
yhat <- predict(svm.fit, dat[-train,])
confusionMatrix(yhat, dat[-train,'y'])
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 72 7
## 2 5 16
##
## Accuracy : 0.88
## 95% CI : (0.8, 0.936)
## No Information Rate : 0.77
## P-Value [Acc > NIR] : 0.0041
##
## Kappa : 0.651
## Mcnemar's Test P-Value : 0.7728
##
## Sensitivity : 0.935
## Specificity : 0.696
## Pos Pred Value : 0.911
## Neg Pred Value : 0.762
## Prevalence : 0.770
## Detection Rate : 0.720
## Detection Prevalence : 0.790
## Balanced Accuracy : 0.815
##
## 'Positive' Class : 1
##
We can see from the figure that there are a fair number of training errors for this fit. If we increase the value of cost, we can reduce the training errors, but we risk overfitting the data.
svm.fit <- svm(y ~., dat[train,], kernel='radial', gamma=1, cost=1e5)
plot(svm.fit, dat[train,])
This certainly is very irregular, and probably would over fit the test data. Let’s try cross validating these parameters instead.
set.seed(1)
tune.out <- tune(svm, y ~., data=dat[train,],
kernel='radial',
ranges = list(cost=c(0.1,1,10,100,1000),
gamma=c(0.5, 1,2,3,4)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 1 2
##
## - best performance: 0.12
##
## - Detailed performance results:
## cost gamma error dispersion
## 1 1e-01 0.5 0.27 0.11595
## 2 1e+00 0.5 0.13 0.08233
## 3 1e+01 0.5 0.15 0.07071
## 4 1e+02 0.5 0.17 0.08233
## 5 1e+03 0.5 0.21 0.09944
## 6 1e-01 1.0 0.25 0.13540
## 7 1e+00 1.0 0.13 0.08233
## 8 1e+01 1.0 0.16 0.06992
## 9 1e+02 1.0 0.20 0.09428
## 10 1e+03 1.0 0.20 0.08165
## 11 1e-01 2.0 0.25 0.12693
## 12 1e+00 2.0 0.12 0.09189
## 13 1e+01 2.0 0.17 0.09487
## 14 1e+02 2.0 0.19 0.09944
## 15 1e+03 2.0 0.20 0.09428
## 16 1e-01 3.0 0.27 0.11595
## 17 1e+00 3.0 0.13 0.09487
## 18 1e+01 3.0 0.18 0.10328
## 19 1e+02 3.0 0.21 0.08756
## 20 1e+03 3.0 0.22 0.10328
## 21 1e-01 4.0 0.27 0.11595
## 22 1e+00 4.0 0.15 0.10801
## 23 1e+01 4.0 0.18 0.11353
## 24 1e+02 4.0 0.21 0.08756
## 25 1e+03 4.0 0.24 0.10750
yhat <- predict(tune.out$best.model, dat[-train,])
confusionMatrix(yhat, dat[-train, 'y'])
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 74 7
## 2 3 16
##
## Accuracy : 0.9
## 95% CI : (0.824, 0.951)
## No Information Rate : 0.77
## P-Value [Acc > NIR] : 0.000669
##
## Kappa : 0.699
## Mcnemar's Test P-Value : 0.342782
##
## Sensitivity : 0.961
## Specificity : 0.696
## Pos Pred Value : 0.914
## Neg Pred Value : 0.842
## Prevalence : 0.770
## Detection Rate : 0.740
## Detection Prevalence : 0.810
## Balanced Accuracy : 0.828
##
## 'Positive' Class : 1
##
Using the chosen cross validated tuning parameters we achieve a stronger fit.
Another way to choose between models is to use and ROC curve. We can do this using the ROCR
package.
library(ROCR)
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.1.1
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
##
## Attaching package: 'gplots'
##
## The following object is masked from 'package:stats':
##
## lowess
# function to handle the different models
rocplot <- function(pred, truth, ...){
predob = prediction(pred, truth)
perf = performance(predob, 'tpr', 'fpr')
plot(perf, ...)
}
Now, when we rebuild the SVMs we set decision.values=TRUE
to obtain the fitted values.
svm.opt <- svm(y ~., data=dat[train,], kernel='radial',
gamma=2, cost=1, decision.values=T)
fitted <- attributes(predict(svm.opt, dat[train,], decision.values=T))$decision.values
rocplot(fitted, dat[train,'y'], main='Training Data')
We train the model to select the tuning parameters that yield the highest ROC.
ctr <- trainControl(method='cv',
number=10,
classProbs=TRUE,
summaryFunction=twoClassSummary)
svm.c <- train(y ~., dat[train,],
method='svmRadial',
trControl=ctr,
metric="ROC")
## Loading required package: kernlab
## Warning: At least one of the class levels are not valid R variables names;
## This may cause errors if class probabilities are generated because the
## variables names will be converted to: X1, X2
## Loading required package: pROC
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
svm.c
## Support Vector Machines with Radial Basis Function Kernel
##
## 100 samples
## 2 predictors
## 2 classes: '1', '2'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 90, 90, 90, 89, 90, 91, ...
##
## Resampling results across tuning parameters:
##
## C ROC Sens Spec ROC SD Sens SD Spec SD
## 0.2 0.9 0.9 0.8 0.07 0.1 0.2
## 0.5 0.9 0.9 0.8 0.08 0.1 0.2
## 1 0.9 0.9 0.7 0.09 0.1 0.3
##
## Tuning parameter 'sigma' was held constant at a value of 1.809
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 2 and C = 0.2.
plot(svm.c$finalModel)
yhat.c <- predict(svm.c, dat[-train,])
confusionMatrix(yhat.c, dat[-train,'y'])
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 74 8
## 2 3 15
##
## Accuracy : 0.89
## 95% CI : (0.812, 0.944)
## No Information Rate : 0.77
## P-Value [Acc > NIR] : 0.00174
##
## Kappa : 0.664
## Mcnemar's Test P-Value : 0.22780
##
## Sensitivity : 0.961
## Specificity : 0.652
## Pos Pred Value : 0.902
## Neg Pred Value : 0.833
## Prevalence : 0.770
## Detection Rate : 0.740
## Detection Prevalence : 0.820
## Balanced Accuracy : 0.807
##
## 'Positive' Class : 1
##
Looks like both packages produce the very similar values! Thats good. However, caret is much easier to work with I find.
The svm()
function is able to perform the one-versus-one approach.
#clean memory space
rm(xtest)
rm(ytest)
rm(y)
library(ISLR)
attach(Khan)
names(Khan)
## [1] "xtrain" "xtest" "ytrain" "ytest"
The dataset consists of expression measurements for 2 308 genes and are already split into testing and training data. We will use a support vector approach to predict the cancer sub type using gene expression measurements. This dataset has a very large number of features compared to observations. This means that likely we could use a linear kernel.
dat = data.frame(xtrain, y=as.factor(ytrain))
svm.fit <- svm(y~., dat, kernel='linear', cost=10)
summary(svm.fit)
##
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 10)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 10
## gamma: 0.0004333
##
## Number of Support Vectors: 58
##
## ( 20 20 11 7 )
##
##
## Number of Classes: 4
##
## Levels:
## 1 2 3 4
table(svm.fit$fitted, dat$y)
##
## 1 2 3 4
## 1 8 0 0 0
## 2 0 23 0 0
## 3 0 0 12 0
## 4 0 0 0 20
Here we see that there are actually no training errors. This is not surprising due to the large number of predictors and few observations. How does it work on the test data?
testing <- data.frame(xtest, y=as.factor(ytest))
yhat <- predict(svm.fit, newdata=testing)
table(yhat, testing$y)
##
## yhat 1 2 3 4
## 1 3 0 0 0
## 2 0 6 2 0
## 3 0 0 4 0
## 4 0 0 0 5
Here we only make 2 errors on the test data. We might as well continue and tune the parameters to see if we can make any improvements.
# library(doMC)
# registerDoMC(6)
# Recall our classification training control function
ctr <- trainControl(method='cv',
number=10)
grid <- data.frame(C=c(0.0001, 0.001, 0.01, 1))
svm.tune <- train(y ~., dat,
method='svmLinear',
trControl=ctr,
tuneGrid=grid)
svm.tune
## Support Vector Machines with Linear Kernel
##
## 63 samples
## 2308 predictors
## 4 classes: '1', '2', '3', '4'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 57, 56, 57, 57, 56, 57, ...
##
## Resampling results across tuning parameters:
##
## C Accuracy Kappa Accuracy SD Kappa SD
## 1e-04 0.7 0.5 0.2 0.3
## 0.001 1 1 0.05 0.06
## 0.01 1 1 0.05 0.06
## 1 1 1 0.05 0.06
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 0.001.
yhat.tune <- predict(svm.tune, testing)
confusionMatrix(yhat.tune, testing$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4
## 1 3 0 0 0
## 2 0 6 2 0
## 3 0 0 4 0
## 4 0 0 0 5
##
## Overall Statistics
##
## Accuracy : 0.9
## 95% CI : (0.683, 0.988)
## No Information Rate : 0.3
## P-Value [Acc > NIR] : 3.77e-08
##
## Kappa : 0.864
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity 1.00 1.000 0.667 1.00
## Specificity 1.00 0.857 1.000 1.00
## Pos Pred Value 1.00 0.750 1.000 1.00
## Neg Pred Value 1.00 1.000 0.875 1.00
## Prevalence 0.15 0.300 0.300 0.25
## Detection Rate 0.15 0.300 0.200 0.25
## Detection Prevalence 0.15 0.400 0.200 0.25
## Balanced Accuracy 1.00 0.929 0.833 1.00
Turns out we get the same result. There isn’t much to tune with a linear SVM.
Here we predict whether a given car gets high or low gas mileage based on the Auto
dataset.
attach(Auto)
The following object is masked from package:ggplot2:
mpg
dim(Auto)
[1] 392 9
kable(head(Auto))
mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | name |
---|---|---|---|---|---|---|---|---|
18 | 8 | 307 | 130 | 3504 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
15 | 8 | 350 | 165 | 3693 | 11.5 | 70 | 1 | buick skylark 320 |
18 | 8 | 318 | 150 | 3436 | 11.0 | 70 | 1 | plymouth satellite |
16 | 8 | 304 | 150 | 3433 | 12.0 | 70 | 1 | amc rebel sst |
17 | 8 | 302 | 140 | 3449 | 10.5 | 70 | 1 | ford torino |
15 | 8 | 429 | 198 | 4341 | 10.0 | 70 | 1 | ford galaxie 500 |
# Create a binary variable that takes on 1 for cars with gas mileage > median
Auto$y <- NA
Auto$y[Auto$mpg > median(Auto$mpg)] <- 1
Auto$y[Auto$mpg <= median(Auto$mpg)] <- 0
Auto$y <- as.factor(Auto$y)
length(Auto[is.na(Auto$y)]) # make sure there are no NA's
## [1] 0
Try a linear SVM
set.seed(123)
split <- createDataPartition(y=Auto$y, p=0.7, list=FALSE)
train <- Auto[split,]
test <- Auto[-split,]
# Remove mpg / name features
train <- train[-c(1,9)]
test <- test[-c(1,9)]
# 10 fold cross validation
ctr <- trainControl(method='repeatedcv',
number=10,
repeats=3)
# Recall as C increases, the margin tends to get wider
grid <- data.frame(C=seq(0.01,5,0.5))
svm.fit <- train(y ~., train,
method='svmLinear',
preProc=c('center','scale'),
trControl=ctr,
tuneGrid=grid)
svm.fit
## Support Vector Machines with Linear Kernel
##
## 276 samples
## 7 predictors
## 2 classes: '0', '1'
##
## Pre-processing: centered, scaled
## Resampling: Cross-Validated (10 fold, repeated 3 times)
##
## Summary of sample sizes: 248, 248, 249, 250, 248, 248, ...
##
## Resampling results across tuning parameters:
##
## C Accuracy Kappa Accuracy SD Kappa SD
## 0.01 0.9 0.8 0.06 0.1
## 0.5 0.9 0.8 0.05 0.1
## 1 0.9 0.8 0.05 0.1
## 2 0.9 0.8 0.05 0.1
## 2 0.9 0.8 0.05 0.1
## 3 0.9 0.8 0.05 0.1
## 3 0.9 0.8 0.05 0.1
## 4 0.9 0.8 0.05 0.1
## 4 0.9 0.8 0.05 0.1
## 5 0.9 0.8 0.05 0.1
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 3.
# Training error rate
confusionMatrix(predict(svm.fit, train), train$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 119 8
## 1 19 130
##
## Accuracy : 0.902
## 95% CI : (0.861, 0.935)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.804
## Mcnemar's Test P-Value : 0.0543
##
## Sensitivity : 0.862
## Specificity : 0.942
## Pos Pred Value : 0.937
## Neg Pred Value : 0.872
## Prevalence : 0.500
## Detection Rate : 0.431
## Detection Prevalence : 0.460
## Balanced Accuracy : 0.902
##
## 'Positive' Class : 0
##
# Testing error rate
yhat <- predict(svm.fit, test)
confusionMatrix(yhat, test$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 55 3
## 1 3 55
##
## Accuracy : 0.948
## 95% CI : (0.891, 0.981)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.897
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.948
## Specificity : 0.948
## Pos Pred Value : 0.948
## Neg Pred Value : 0.948
## Prevalence : 0.500
## Detection Rate : 0.474
## Detection Prevalence : 0.500
## Balanced Accuracy : 0.948
##
## 'Positive' Class : 0
##
You can’t really ask more than ~95% accuracy, but lets try a few other models to compare. There should be a way to plot two variables svm boundary at a time, but I haven’t figured it out yet.
set.seed(123)
# Try a polynomial function
svm.fit <- train(y ~., train,
method='svmPoly',
trControl=ctr)
svm.fit
## Support Vector Machines with Polynomial Kernel
##
## 276 samples
## 7 predictors
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
##
## Summary of sample sizes: 248, 249, 248, 249, 250, 248, ...
##
## Resampling results across tuning parameters:
##
## degree scale C Accuracy Kappa Accuracy SD Kappa SD
## 1 0.001 0.2 0.7 0.4 0.1 0.3
## 1 0.001 0.5 0.8 0.6 0.07 0.1
## 1 0.001 1 0.8 0.6 0.05 0.1
## 1 0.01 0.2 0.9 0.7 0.06 0.1
## 1 0.01 0.5 0.9 0.8 0.06 0.1
## 1 0.01 1 0.9 0.8 0.05 0.1
## 1 0.1 0.2 0.9 0.8 0.05 0.1
## 1 0.1 0.5 0.9 0.8 0.05 0.1
## 1 0.1 1 0.9 0.8 0.05 0.1
## 2 0.001 0.2 0.8 0.6 0.07 0.1
## 2 0.001 0.5 0.8 0.6 0.05 0.1
## 2 0.001 1 0.9 0.7 0.06 0.1
## 2 0.01 0.2 0.9 0.8 0.06 0.1
## 2 0.01 0.5 0.9 0.8 0.05 0.1
## 2 0.01 1 0.9 0.8 0.05 0.1
## 2 0.1 0.2 0.9 0.8 0.05 0.1
## 2 0.1 0.5 0.9 0.8 0.05 0.1
## 2 0.1 1 0.9 0.8 0.05 0.1
## 3 0.001 0.2 0.8 0.6 0.05 0.1
## 3 0.001 0.5 0.9 0.7 0.06 0.1
## 3 0.001 1 0.9 0.7 0.06 0.1
## 3 0.01 0.2 0.9 0.8 0.05 0.1
## 3 0.01 0.5 0.9 0.8 0.05 0.1
## 3 0.01 1 0.9 0.8 0.05 0.1
## 3 0.1 0.2 0.9 0.8 0.05 0.1
## 3 0.1 0.5 0.9 0.8 0.05 0.1
## 3 0.1 1 0.9 0.8 0.05 0.1
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 3, scale = 0.1 and C = 1.
plot(svm.fit)
# Training error rate
confusionMatrix(predict(svm.fit, train), train$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 122 3
## 1 16 135
##
## Accuracy : 0.931
## 95% CI : (0.895, 0.958)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.862
## Mcnemar's Test P-Value : 0.00591
##
## Sensitivity : 0.884
## Specificity : 0.978
## Pos Pred Value : 0.976
## Neg Pred Value : 0.894
## Prevalence : 0.500
## Detection Rate : 0.442
## Detection Prevalence : 0.453
## Balanced Accuracy : 0.931
##
## 'Positive' Class : 0
##
# Testing error rate
yhat <- predict(svm.fit, test)
confusionMatrix(yhat, test$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 56 2
## 1 2 56
##
## Accuracy : 0.966
## 95% CI : (0.914, 0.991)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.931
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.966
## Specificity : 0.966
## Pos Pred Value : 0.966
## Neg Pred Value : 0.966
## Prevalence : 0.500
## Detection Rate : 0.483
## Detection Prevalence : 0.500
## Balanced Accuracy : 0.966
##
## 'Positive' Class : 0
##
And there you have it, it looks like a third order polynomial can slightly improve the fit.