Q1. This problem involves hyperplanes in two dimensions.
Sketch the hyperplane \(1 + 3X_1 - X_2 = 0\). Indicate the set of points for which \(1 + 3X_1 - X_2 > 0\), as well as the set of points for which \(1 + 3X_1 - X_2 < 0\).
On the same plot, sketch the hyperplane \(-2 + X_1 + 2X_2 = 0\). Indicate the set of points for which \(-2 + X_1 + 2X_2 > 0\), as well as the set of points for which \(-2 + X_1 + 2X_2 < 0\).
x1 <- -10:10
x2 <- 1 + 3 * x1
plot(x1, x2, type = "l", col = "blue")
text(c(0), c(-20), "Greater than 0", col = "blue")
text(c(0), c(20), "Less than 0", col = "blue")
lines(x1, 1 - x1/2, col = "red")
text(c(0), c(-15), "Less than 0", col = "red")
text(c(0), c(15), "Greater than 0", col = "red")
Q2. We have seen that in \(p = 2\) dimensions, a linear boundary takes the form \(\beta_0 + \beta_1X_1 + \beta_2X_2 = 0\). We now investigate a non-linear decision boundary.
plot(NA, NA, type = "n", xlim = c(-4, 2), ylim = c(-1, 5), asp = 1, xlab = "X1", ylab = "X2")
symbols(c(-1), c(2), circles = c(2), add = TRUE, inches = FALSE)
plot(NA, NA, type = "n", xlim = c(-4, 2), ylim = c(-1, 5), asp = 1, xlab = "X1", ylab = "X2")
symbols(c(-1), c(2), circles = c(2), add = TRUE, inches = FALSE)
text(c(-1), c(2), "< 4")
text(c(-4), c(2), "> 4")
It suffices to replace \(X_1\) and \(X_2\) by the coordinates of the points in the equation and to check if the result is less or greater than 4. For \((0,0)\), we have \(5 > 4\) (blue class), for \((-1,1)\), we have \(1 < 4\) (red class), for \((2,2)\), we have \(9 > 4\) (blue class), for \((3,8)\), we have \(52 > 4\) (blue class).
plot(c(0, -1, 2, 3), c(0, 1, 2, 8), col = c("blue", "red", "blue", "blue"),
type = "p", asp = 1, xlab = "X1", ylab = "X2")
symbols(c(-1), c(2), circles = c(2), add = TRUE, inches = FALSE)
It is obvious since we may expand the equation of the decision boundary \[(1 + X_1)^2 + (2 - X_2)^2 = 4\] by \[X_1^2 + X_2^2 + 2X_1 - 4X_2 + 1 = 0\] which is linear in terms of \(X_1\), \(X_1^2\), \(X_2\) and \(x_2^2\).
Q3. Here we explore the maximal margin classifier on a toy data set.
Sketch the observations.
x1 = c(3, 2, 4, 1, 2, 4, 4)
x2 = c(4, 2, 4, 4, 1, 3, 1)
colors = c("red", "red", "red", "red", "blue", "blue", "blue")
plot(x1, x2, col = colors, xlim = c(0, 5), ylim = c(0, 5))
As shown in the plot, the optimal separating hyperplane has to be between the observations \((2,1)\) and \((2,2)\), and between the observations \((4,3)\) and \((4,4)\). So it is a line that passes through the points \((2,1.5)\) and \((4,3.5)\) which equation is \[X_1 - X_2 - 0.5 = 0.\]
plot(x1, x2, col = colors, xlim = c(0, 5), ylim = c(0, 5))
abline(-0.5, 1)
The classification rule is “Classify to Red if \(X_1 - X_2 -0.5 < 0\), and classify to Blue otherwise.”
plot(x1, x2, col = colors, xlim = c(0, 5), ylim = c(0, 5))
abline(-0.5, 1)
abline(-1, 1, lty = 2)
abline(0, 1, lty = 2)
The margin is here equal to \(1/4\).
The support vectors are the points \((2,1)\), \((2,2)\), \((4,3)\) and \((4,4)\).
By examining the plot, it is clear that if we moved the observation \((4,1)\), we would not change the maximal margin hyperplane as it is not a support vector.
For example, the hyperplane which equation is \(X_1 - X_2 - 0.3 = 0\) is not the optimal separating hyperplane.
plot(x1, x2, col = colors, xlim = c(0, 5), ylim = c(0, 5))
abline(-0.3, 1)
plot(x1, x2, col = colors, xlim = c(0, 5), ylim = c(0, 5))
points(c(3), c(1), col = c("red"))
When the red point \((3,1)\) is added to the plot, the two classes are obviously not separable by a hyperplane anymore.
Q4. Generate a simulated two-class data set with 100 observations and two features in which there is a visible but non-linear separation between the two classes. Show that in this setting, a support vector machine with a polynomial kernel (with degree greater than 1) or a radial kernel will outperform a support vector classifier on the training data. Which technique performs best on test data ? Make plots and report training and test error rates in order to back up your assertions.
We begin by creating a data set with non-linear separation between the two classes.
library(e1071)
set.seed(1)
x <- rnorm(100)
y <- 4 * x^2 + 1 + rnorm(100)
class <- sample(100, 50)
y[class] <- y[class] + 3
y[-class] <- y[-class] - 3
plot(x[class], y[class], col = "red", xlab = "X", ylab = "Y", ylim = c(-6, 30))
points(x[-class], y[-class], col = "blue")
Now, we fit a support vector classifier on the training data.
z <- rep(-1, 100)
z[class] <- 1
data <- data.frame(x = x, y = y, z = as.factor(z))
train <- sample(100, 50)
data.train <- data[train, ]
data.test <- data[-train, ]
svm.linear <- svm(z ~ ., data = data.train, kernel = "linear", cost = 10)
plot(svm.linear, data.train)
table(predict = predict(svm.linear, data.train), truth = data.train$z)
## truth
## predict -1 1
## -1 22 0
## 1 6 22
The support vector classifier makes 6 errors on the training data.
Next, we fit a support vector machine with a polynomial kernel.
svm.poly <- svm(z ~ ., data = data.train, kernel = "polynomial", cost = 10)
plot(svm.poly, data.train)
table(predict = predict(svm.poly, data.train), truth = data.train$z)
## truth
## predict -1 1
## -1 19 0
## 1 9 22
The support vector machine with a polynomial kernel of degree 3 makes 9 errors on the training data.
Finally, we fit a support vector machine with a radial kernel and a gamma of 1.
svm.radial <- svm(z ~ ., data = data.train, kernel = "radial", gamma = 1, cost = 10)
plot(svm.radial, data.train)
table(predict = predict(svm.radial, data.train), truth = data.train$z)
## truth
## predict -1 1
## -1 28 0
## 1 0 22
The support vector machine with a radial kernel makes 0 error on the training data.
Now, we check how these models fare when applied to the test data.
plot(svm.linear, data.test)
table(predict = predict(svm.linear, data.test), truth = data.test$z)
## truth
## predict -1 1
## -1 18 2
## 1 4 26
plot(svm.poly, data.test)
table(predict = predict(svm.poly, data.test), truth = data.test$z)
## truth
## predict -1 1
## -1 14 1
## 1 8 27
plot(svm.radial, data.test)
table(predict = predict(svm.radial, data.test), truth = data.test$z)
## truth
## predict -1 1
## -1 22 1
## 1 0 27
We may see that the linear, polynomial and radial support vector machines classify respectively 9, 6 and 1 observations incorrectly. So, radial kernel is the best model in this setting.
Q5. We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary. We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features.
set.seed(1)
x1 <- runif(500) - 0.5
x2 <- runif(500) - 0.5
y <- 1 * (x1^2 - x2^2 > 0)
plot(x1, x2, xlab = "X1", ylab = "X2", col = (4 - y), pch = (3 - y))
logit.fit <- glm(y ~ x1 + x2, family = "binomial")
summary(logit.fit)
##
## Call:
## glm(formula = y ~ x1 + x2, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.179 -1.139 -1.112 1.206 1.257
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.087260 0.089579 -0.974 0.330
## x1 0.196199 0.316864 0.619 0.536
## x2 -0.002854 0.305712 -0.009 0.993
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 692.18 on 499 degrees of freedom
## Residual deviance: 691.79 on 497 degrees of freedom
## AIC: 697.79
##
## Number of Fisher Scoring iterations: 3
None of the variables are statistically significants.
data <- data.frame(x1 = x1, x2 = x2, y = y)
probs <- predict(logit.fit, data, type = "response")
preds <- rep(0, 500)
preds[probs > 0.47] <- 1
plot(data[preds == 1, ]$x1, data[preds == 1, ]$x2, col = (4 - 1), pch = (3 - 1), xlab = "X1", ylab = "X2")
points(data[preds == 0, ]$x1, data[preds == 0, ]$x2, col = (4 - 0), pch = (3 - 0))
The decision boundary is obviously linear.
logitnl.fit <- glm(y ~ poly(x1, 2) + poly(x2, 2) + I(x1 * x2), family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logitnl.fit)
##
## Call:
## glm(formula = y ~ poly(x1, 2) + poly(x2, 2) + I(x1 * x2), family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.240e-04 -2.000e-08 -2.000e-08 2.000e-08 1.163e-03
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -102.2 4302.0 -0.024 0.981
## poly(x1, 2)1 2715.3 141109.5 0.019 0.985
## poly(x1, 2)2 27218.5 842987.2 0.032 0.974
## poly(x2, 2)1 -279.7 97160.4 -0.003 0.998
## poly(x2, 2)2 -28693.0 875451.3 -0.033 0.974
## I(x1 * x2) -206.4 41802.8 -0.005 0.996
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6.9218e+02 on 499 degrees of freedom
## Residual deviance: 3.5810e-06 on 494 degrees of freedom
## AIC: 12
##
## Number of Fisher Scoring iterations: 25
Here again, none of the variables are statistically significants.
probs <- predict(logitnl.fit, data, type = "response")
preds <- rep(0, 500)
preds[probs > 0.47] <- 1
plot(data[preds == 1, ]$x1, data[preds == 1, ]$x2, col = (4 - 1), pch = (3 - 1), xlab = "X1", ylab = "X2")
points(data[preds == 0, ]$x1, data[preds == 0, ]$x2, col = (4 - 0), pch = (3 - 0))
The non-linear decision boundary is surprisingly very similar to the true decision boundary.
data$y <- as.factor(data$y)
svm.fit <- svm(y ~ x1 + x2, data, kernel = "linear", cost = 0.01)
preds <- predict(svm.fit, data)
plot(data[preds == 0, ]$x1, data[preds == 0, ]$x2, col = (4 - 0), pch = (3 - 0), xlab = "X1", ylab = "X2")
points(data[preds == 1, ]$x1, data[preds == 1, ]$x2, col = (4 - 1), pch = (3 - 1))
This support vector classifier (even with low cost) classifies all points to a single class.
data$y <- as.factor(data$y)
svmnl.fit <- svm(y ~ x1 + x2, data, kernel = "radial", gamma = 1)
preds <- predict(svmnl.fit, data)
plot(data[preds == 0, ]$x1, data[preds == 0, ]$x2, col = (4 - 0), pch = (3 - 0), xlab = "X1", ylab = "X2")
points(data[preds == 1, ]$x1, data[preds == 1, ]$x2, col = (4 - 1), pch = (3 - 1))
Here again, the non-linear decision boundary is surprisingly very similar to the true decision boundary.
We may conclude that SVM with non-linear kernel and logistic regression with interaction terms are equally very powerful for finding non-linear decision boundaries. Also, SVM with linear kernel and logistic regression without any interaction term are very bad when it comes to finding non-linear decision boundaries. However, one argument in favor of SVM is that it requires some manual tuning to find the right interaction terms when using logistic regression, although when using SVM we only need to tune gamma.
Q6. At the end of Section 9.6.1, it is claimed that in the case of data that is just barely linearly separable, a support vector classifier with a small value of “cost” that misclassifies a couple of training observations may perform better on test data than one with a huge value of “cost” that does not misclassify any training observations. You will now investigate that claim.
We randomly generate 1000 points and scatter them across line \(x = y\) with wide margin. We also create noisy points along the line \(5x − 4y − 50 = 0\). These points make the classes barely separable and also shift the maximum margin classifier.
set.seed(1)
x.one <- runif(500, 0, 90)
y.one <- runif(500, x.one + 10, 100)
x.one.noise <- runif(50, 20, 80)
y.one.noise <- 5/4 * (x.one.noise - 10) + 0.1
x.zero <- runif(500, 10, 100)
y.zero <- runif(500, 0, x.zero - 10)
x.zero.noise <- runif(50, 20, 80)
y.zero.noise <- 5/4 * (x.zero.noise - 10) - 0.1
class.one <- seq(1, 550)
x <- c(x.one, x.one.noise, x.zero, x.zero.noise)
y <- c(y.one, y.one.noise, y.zero, y.zero.noise)
plot(x[class.one], y[class.one], col = "blue", pch = "+", ylim = c(0, 100))
points(x[-class.one], y[-class.one], col = "red", pch = 4)
set.seed(2)
z <- rep(0, 1100)
z[class.one] <- 1
data <- data.frame(x = x, y = y, z = as.factor(z))
tune.out <- tune(svm, z ~ ., data = data, kernel = "linear", ranges = list(cost = c(0.01, 0.1, 1, 5, 10, 100, 1000, 10000)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 10000
##
## - best performance: 0
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-02 0.06272727 0.03043909
## 2 1e-01 0.04818182 0.02876395
## 3 1e+00 0.04818182 0.02876395
## 4 5e+00 0.04818182 0.02778971
## 5 1e+01 0.04727273 0.02770698
## 6 1e+02 0.04727273 0.02564145
## 7 1e+03 0.04090909 0.02580210
## 8 1e+04 0.00000000 0.00000000
A cost of 10000 seems the best choice of parameter.
data.frame(cost = tune.out$performance$cost, misclass = tune.out$performance$error * 1100)
## cost misclass
## 1 1e-02 69
## 2 1e-01 53
## 3 1e+00 53
## 4 5e+00 53
## 5 1e+01 52
## 6 1e+02 52
## 7 1e+03 45
## 8 1e+04 0
Here a cost of 10000 classify all training points correctly.
x.test <- runif(1000, 0, 100)
class.one <- sample(1000, 500)
y.test <- rep(NA, 1000)
# Set y > x for class.one
for (i in class.one) {
y.test[i] <- runif(1, x.test[i], 100)
}
# set y < x for class.zero
for (i in setdiff(1:1000, class.one)) {
y.test[i] <- runif(1, 0, x.test[i])
}
plot(x.test[class.one], y.test[class.one], col = "blue", pch = "+")
points(x.test[-class.one], y.test[-class.one], col = "red", pch = 4)
set.seed(3)
z.test <- rep(0, 1000)
z.test[class.one] <- 1
data.test <- data.frame(x = x.test, y = y.test, z = as.factor(z.test))
costs <- c(0.01, 0.1, 1, 5, 10, 100, 1000, 10000)
test.err <- rep(NA, length(costs))
for (i in 1:length(costs)) {
svm.fit <- svm(z ~ ., data = data, kernel = "linear", cost = costs[i])
pred <- predict(svm.fit, data.test)
test.err[i] <- sum(pred != data.test$z)
}
data.frame(cost = costs, misclass = test.err)
## cost misclass
## 1 1e-02 50
## 2 1e-01 10
## 3 1e+00 0
## 4 5e+00 0
## 5 1e+01 0
## 6 1e+02 172
## 7 1e+03 188
## 8 1e+04 188
Costs of 1, 5 or 10 seem to perform better on test observations, this is much smaller than the value of 10000 for training observations.
We again see an overfitting phenomenon for linear kernel. A large cost tries to correctly classify noisy-points and hence overfits the train data. A small cost, however, makes a few errors on the noisy test points and performs better on test data.
Q7. In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the “Auto” data set.
library(ISLR)
var <- ifelse(Auto$mpg > median(Auto$mpg), 1, 0)
Auto$mpglevel <- as.factor(var)
set.seed(1)
tune.out <- tune(svm, mpglevel ~ ., data = Auto, kernel = "linear", ranges = list(cost = c(0.01, 0.1, 1, 5, 10, 100, 1000)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 1
##
## - best performance: 0.01275641
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-02 0.07403846 0.05471525
## 2 1e-01 0.03826923 0.05148114
## 3 1e+00 0.01275641 0.01344780
## 4 5e+00 0.01782051 0.01229997
## 5 1e+01 0.02038462 0.01074682
## 6 1e+02 0.03820513 0.01773427
## 7 1e+03 0.03820513 0.01773427
A cost of 1 seems to perform best.
set.seed(1)
tune.out <- tune(svm, mpglevel ~ ., data = Auto, kernel = "polynomial", ranges = list(cost = c(0.01, 0.1, 1, 5, 10, 100), degree = c(2, 3, 4)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost degree
## 100 2
##
## - best performance: 0.3013462
##
## - Detailed performance results:
## cost degree error dispersion
## 1 1e-02 2 0.5611538 0.04344202
## 2 1e-01 2 0.5611538 0.04344202
## 3 1e+00 2 0.5611538 0.04344202
## 4 5e+00 2 0.5611538 0.04344202
## 5 1e+01 2 0.5382051 0.05829238
## 6 1e+02 2 0.3013462 0.09040277
## 7 1e-02 3 0.5611538 0.04344202
## 8 1e-01 3 0.5611538 0.04344202
## 9 1e+00 3 0.5611538 0.04344202
## 10 5e+00 3 0.5611538 0.04344202
## 11 1e+01 3 0.5611538 0.04344202
## 12 1e+02 3 0.3322436 0.11140578
## 13 1e-02 4 0.5611538 0.04344202
## 14 1e-01 4 0.5611538 0.04344202
## 15 1e+00 4 0.5611538 0.04344202
## 16 5e+00 4 0.5611538 0.04344202
## 17 1e+01 4 0.5611538 0.04344202
## 18 1e+02 4 0.5611538 0.04344202
For a polynomial kernel, the lowest cross-validation error is obtained for a degree of 2 and a cost of 100.
set.seed(1)
tune.out <- tune(svm, mpglevel ~ ., data = Auto, kernel = "radial", ranges = list(cost = c(0.01, 0.1, 1, 5, 10, 100), gamma = c(0.01, 0.1, 1, 5, 10, 100)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 100 0.01
##
## - best performance: 0.01532051
##
## - Detailed performance results:
## cost gamma error dispersion
## 1 1e-02 1e-02 0.56115385 0.04344202
## 2 1e-01 1e-02 0.09185897 0.03862507
## 3 1e+00 1e-02 0.07147436 0.05103685
## 4 5e+00 1e-02 0.04326923 0.04975032
## 5 1e+01 1e-02 0.02551282 0.03812986
## 6 1e+02 1e-02 0.01532051 0.01788871
## 7 1e-02 1e-01 0.19153846 0.07612945
## 8 1e-01 1e-01 0.07916667 0.05201159
## 9 1e+00 1e-01 0.05608974 0.05092939
## 10 5e+00 1e-01 0.03064103 0.02637448
## 11 1e+01 1e-01 0.02551282 0.02076457
## 12 1e+02 1e-01 0.02807692 0.01458261
## 13 1e-02 1e+00 0.56115385 0.04344202
## 14 1e-01 1e+00 0.56115385 0.04344202
## 15 1e+00 1e+00 0.06634615 0.06187383
## 16 5e+00 1e+00 0.06128205 0.06186124
## 17 1e+01 1e+00 0.06128205 0.06186124
## 18 1e+02 1e+00 0.06128205 0.06186124
## 19 1e-02 5e+00 0.56115385 0.04344202
## 20 1e-01 5e+00 0.56115385 0.04344202
## 21 1e+00 5e+00 0.49224359 0.03806832
## 22 5e+00 5e+00 0.48967949 0.03738577
## 23 1e+01 5e+00 0.48967949 0.03738577
## 24 1e+02 5e+00 0.48967949 0.03738577
## 25 1e-02 1e+01 0.56115385 0.04344202
## 26 1e-01 1e+01 0.56115385 0.04344202
## 27 1e+00 1e+01 0.51775641 0.04471079
## 28 5e+00 1e+01 0.51012821 0.03817175
## 29 1e+01 1e+01 0.51012821 0.03817175
## 30 1e+02 1e+01 0.51012821 0.03817175
## 31 1e-02 1e+02 0.56115385 0.04344202
## 32 1e-01 1e+02 0.56115385 0.04344202
## 33 1e+00 1e+02 0.56115385 0.04344202
## 34 5e+00 1e+02 0.56115385 0.04344202
## 35 1e+01 1e+02 0.56115385 0.04344202
## 36 1e+02 1e+02 0.56115385 0.04344202
For a radial kernel, the lowest cross-validation error is obtained for a gamma of 0.01 and a cost of 100.
svm.linear <- svm(mpglevel ~ ., data = Auto, kernel = "linear", cost = 1)
svm.poly <- svm(mpglevel ~ ., data = Auto, kernel = "polynomial", cost = 100, degree = 2)
svm.radial <- svm(mpglevel ~ ., data = Auto, kernel = "radial", cost = 100, gamma = 0.01)
plotpairs = function(fit) {
for (name in names(Auto)[!(names(Auto) %in% c("mpg", "mpglevel", "name"))]) {
plot(fit, Auto, as.formula(paste("mpg~", name, sep = "")))
}
}
plotpairs(svm.linear)
plotpairs(svm.poly)
plotpairs(svm.radial)
Q8. This problem involves the “OJ” data set which is part of the ISLR package.
set.seed(1)
train <- sample(nrow(OJ), 800)
OJ.train <- OJ[train, ]
OJ.test <- OJ[-train, ]
svm.linear <- svm(Purchase ~ ., data = OJ.train, kernel = "linear", cost = 0.01)
summary(svm.linear)
##
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "linear",
## cost = 0.01)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.01
## gamma: 0.05555556
##
## Number of Support Vectors: 432
##
## ( 215 217 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
Support vector classifier creates 432 support vectors out of 800 training points. Out of these, 217 belong to level MM and remaining 215 belong to level CH.
train.pred <- predict(svm.linear, OJ.train)
table(OJ.train$Purchase, train.pred)
## train.pred
## CH MM
## CH 439 55
## MM 78 228
(78 + 55) / (439 + 228 + 78 + 55)
## [1] 0.16625
test.pred <- predict(svm.linear, OJ.test)
table(OJ.test$Purchase, test.pred)
## test.pred
## CH MM
## CH 141 18
## MM 31 80
(31 + 18) / (141 + 80 + 31 + 18)
## [1] 0.1814815
The training error rate is 16.6% and test error rate is about 18.1%.
set.seed(2)
tune.out <- tune(svm, Purchase ~ ., data = OJ.train, kernel = "linear", ranges = list(cost = 10^seq(-2, 1, by = 0.25)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.1
##
## - best performance: 0.1625
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01000000 0.17125 0.05172376
## 2 0.01778279 0.16500 0.05197489
## 3 0.03162278 0.16625 0.04604120
## 4 0.05623413 0.16500 0.04594683
## 5 0.10000000 0.16250 0.04787136
## 6 0.17782794 0.16250 0.04249183
## 7 0.31622777 0.16875 0.04379958
## 8 0.56234133 0.16625 0.03998698
## 9 1.00000000 0.16500 0.03670453
## 10 1.77827941 0.16625 0.03682259
## 11 3.16227766 0.16500 0.03717451
## 12 5.62341325 0.16500 0.03525699
## 13 10.00000000 0.16750 0.03917553
We may see that the optimal cost is 0.1.
svm.linear <- svm(Purchase ~ ., kernel = "linear", data = OJ.train, cost = tune.out$best.parameter$cost)
train.pred <- predict(svm.linear, OJ.train)
table(OJ.train$Purchase, train.pred)
## train.pred
## CH MM
## CH 438 56
## MM 71 235
(71 + 56) / (438 + 235 + 71 + 56)
## [1] 0.15875
test.pred <- predict(svm.linear, OJ.test)
table(OJ.test$Purchase, test.pred)
## test.pred
## CH MM
## CH 140 19
## MM 32 79
(32 + 19) / (140 + 79 + 32 + 19)
## [1] 0.1888889
We may see that, with the best cost, the training error rate is now 15.8% and the test error rate is 18.8%.
svm.radial <- svm(Purchase ~ ., kernel = "radial", data = OJ.train)
summary(svm.radial)
##
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "radial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.05555556
##
## Number of Support Vectors: 379
##
## ( 188 191 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
train.pred <- predict(svm.radial, OJ.train)
table(OJ.train$Purchase, train.pred)
## train.pred
## CH MM
## CH 455 39
## MM 77 229
(77 + 39) / (455 + 229 + 77 + 39)
## [1] 0.145
test.pred <- predict(svm.radial, OJ.test)
table(OJ.test$Purchase, test.pred)
## test.pred
## CH MM
## CH 141 18
## MM 28 83
(28 + 18) / (141 + 83 + 28 + 18)
## [1] 0.1703704
Radial kernel with default gamma creates 379 support vectors, out of which, 188 belong to level CH and remaining 191 belong to level MM. The classifier has a training error of 14.5% and a test error of 17% which is a slight improvement over linear kernel. We now use cross validation to find optimal cost.
set.seed(2)
tune.out <- tune(svm, Purchase ~ ., data = OJ.train, kernel = "radial", ranges = list(cost = 10^seq(-2,
1, by = 0.25)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 1
##
## - best performance: 0.16625
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01000000 0.38250 0.04533824
## 2 0.01778279 0.38250 0.04533824
## 3 0.03162278 0.37500 0.04894725
## 4 0.05623413 0.21500 0.05886661
## 5 0.10000000 0.17875 0.04860913
## 6 0.17782794 0.17875 0.05497790
## 7 0.31622777 0.17875 0.05981743
## 8 0.56234133 0.17250 0.05458174
## 9 1.00000000 0.16625 0.05001736
## 10 1.77827941 0.16875 0.05008673
## 11 3.16227766 0.17500 0.04787136
## 12 5.62341325 0.18000 0.05244044
## 13 10.00000000 0.18250 0.05596378
svm.radial <- svm(Purchase ~ ., kernel = "radial", data = OJ.train, cost = tune.out$best.parameter$cost)
summary(svm.radial)
##
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "radial",
## cost = tune.out$best.parameter$cost)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.05555556
##
## Number of Support Vectors: 379
##
## ( 188 191 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
train.pred <- predict(svm.radial, OJ.train)
table(OJ.train$Purchase, train.pred)
## train.pred
## CH MM
## CH 455 39
## MM 77 229
(77 + 39) / (455 + 229 + 77 + 39)
## [1] 0.145
test.pred <- predict(svm.radial, OJ.test)
table(OJ.test$Purchase, test.pred)
## test.pred
## CH MM
## CH 141 18
## MM 28 83
(28 + 18) / (141 + 83 + 28 + 18)
## [1] 0.1703704
Tuning does not reduce train and test error rates as we already used the optimal cost of 1.
svm.poly <- svm(Purchase ~ ., kernel = "polynomial", data = OJ.train, degree = 2)
summary(svm.poly)
##
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "polynomial",
## degree = 2)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 1
## degree: 2
## gamma: 0.05555556
## coef.0: 0
##
## Number of Support Vectors: 454
##
## ( 224 230 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
train.pred <- predict(svm.poly, OJ.train)
table(OJ.train$Purchase, train.pred)
## train.pred
## CH MM
## CH 461 33
## MM 105 201
(105 + 33) / (461 + 201 + 105 + 33)
## [1] 0.1725
test.pred <- predict(svm.poly, OJ.test)
table(OJ.test$Purchase, test.pred)
## test.pred
## CH MM
## CH 149 10
## MM 41 70
(41 + 10) / (149 + 70 + 41 + 10)
## [1] 0.1888889
Polynomial kernel with default gamma creates 454 support vectors, out of which, 224 belong to level CH and remaining 230 belong to level MM. The classifier has a training error of 17.2% and a test error of 18.8% which is no improvement over linear kernel. We now use cross validation to find optimal cost.
set.seed(2)
tune.out <- tune(svm, Purchase ~ ., data = OJ.train, kernel = "polynomial", degree = 2, ranges = list(cost = 10^seq(-2,
1, by = 0.25)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 10
##
## - best performance: 0.18125
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01000000 0.38250 0.04533824
## 2 0.01778279 0.36750 0.04972145
## 3 0.03162278 0.36500 0.05458174
## 4 0.05623413 0.33375 0.05070681
## 5 0.10000000 0.32500 0.04677072
## 6 0.17782794 0.25875 0.05952649
## 7 0.31622777 0.21250 0.06123724
## 8 0.56234133 0.21250 0.05743354
## 9 1.00000000 0.19750 0.06687468
## 10 1.77827941 0.19375 0.05376453
## 11 3.16227766 0.19625 0.05653477
## 12 5.62341325 0.18375 0.05434266
## 13 10.00000000 0.18125 0.05245699
svm.poly <- svm(Purchase ~ ., kernel = "polynomial", degree = 2, data = OJ.train, cost = tune.out$best.parameter$cost)
summary(svm.poly)
##
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "polynomial",
## degree = 2, cost = tune.out$best.parameter$cost)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 10
## degree: 2
## gamma: 0.05555556
## coef.0: 0
##
## Number of Support Vectors: 342
##
## ( 170 172 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
train.pred <- predict(svm.poly, OJ.train)
table(OJ.train$Purchase, train.pred)
## train.pred
## CH MM
## CH 450 44
## MM 72 234
(72 + 44) / (450 + 234 + 72 + 44)
## [1] 0.145
test.pred <- predict(svm.poly, OJ.test)
table(OJ.test$Purchase, test.pred)
## test.pred
## CH MM
## CH 140 19
## MM 31 80
(31 + 19) / (140 + 80 + 31 + 19)
## [1] 0.1851852
Tuning reduce train and test error rates.
Overall, radial basis kernel seems to be producing minimum misclassification error on both train and test data.