library(e1071)
set.seed(1)
x = rnorm(100)
y = 4 * x^2 + 1 + rnorm(100)
class = sample(100, 50)
y[class] = y[class] + 3
y[-class] = y[-class] - 3
plot(x[class], y[class], col = "red", xlab = "X", ylab = "Y", ylim = c(-6, 30))
points(x[-class], y[-class], col = "blue")
z = rep(-1, 100)
z[class] = 1
data = data.frame(x = x, y = y, z = as.factor(z))
train = sample(100, 50)
data.train = data[train, ]
data.test = data[-train, ]
svm.linear = svm(z ~ ., data = data.train, kernel = "linear", cost = 10)
plot(svm.linear, data.train)
table(predict = predict(svm.linear, data.train), truth = data.train$z)
## truth
## predict -1 1
## -1 22 0
## 1 6 22
The support vector classifier makes 6 errors on the training data.
Next, we fit a support vector machine with a polynomial kernel.
svm.poly = svm(z ~ ., data = data.train, kernel = "polynomial", cost = 10)
plot(svm.poly, data.train)
table(predict = predict(svm.poly, data.train), truth = data.train$z)
## truth
## predict -1 1
## -1 19 0
## 1 9 22
The support vector machine with a polynomial kernel of degree 3 makes 9 errors on the training data.
Finally, we fit a support vector machine with a radial kernel and a gamma of 1.
svm.radial = svm(z ~ ., data = data.train, kernel = "radial", gamma = 1, cost = 10)
plot(svm.radial, data.train)
table(predict = predict(svm.radial, data.train), truth = data.train$z)
## truth
## predict -1 1
## -1 28 0
## 1 0 22
The support vector machine with a radial kernel makes 0 error on the training data.
Now, we check how these models fare when applied to the test data.
plot(svm.linear, data.test)
table(predict = predict(svm.linear, data.test), truth = data.test$z)
## truth
## predict -1 1
## -1 18 2
## 1 4 26
plot(svm.poly, data.test)
table(predict = predict(svm.poly, data.test), truth = data.test$z)
## truth
## predict -1 1
## -1 14 1
## 1 8 27
plot(svm.radial, data.test)
table(predict = predict(svm.radial, data.test), truth = data.test$z)
## truth
## predict -1 1
## -1 22 1
## 1 0 27
We see that the linear, polynomial and radial support vector machines classify respectively 9, 6 and 1 observations incorrectly. So, radial kernel is the best model in this setting.
set.seed(1)
x1=runif(500)-0.5
x2=runif(500)-0.5
y=1*(x1^2-x2^2 > 0)
plot(x1,x2, col = (4 - y))
logit.fit = glm(y ~ x1 + x2, family = "binomial")
summary(logit.fit)
##
## Call:
## glm(formula = y ~ x1 + x2, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.179 -1.139 -1.112 1.206 1.257
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.087260 0.089579 -0.974 0.330
## x1 0.196199 0.316864 0.619 0.536
## x2 -0.002854 0.305712 -0.009 0.993
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 692.18 on 499 degrees of freedom
## Residual deviance: 691.79 on 497 degrees of freedom
## AIC: 697.79
##
## Number of Fisher Scoring iterations: 3
None of the variables are statistically significants.
data = data.frame(x1 = x1, x2 = x2, y = y)
probs = predict(logit.fit, data, type = "response")
preds = rep(0, 500)
preds[probs > 0.47] = 1
plot(data[preds == 1, ]$x1, data[preds == 1, ]$x2, col = (4 - 1), pch = (3 - 1), xlab = "X1", ylab = "X2")
points(data[preds == 0, ]$x1, data[preds == 0, ]$x2, col = (4 - 0), pch = (3 - 0))
The decision boundary is obviously linear.
logitnl.fit <- glm(y ~ poly(x1, 2) + poly(x2, 2) + I(x1 * x2), family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logitnl.fit)
##
## Call:
## glm(formula = y ~ poly(x1, 2) + poly(x2, 2) + I(x1 * x2), family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.240e-04 -2.000e-08 -2.000e-08 2.000e-08 1.163e-03
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -102.2 4302.0 -0.024 0.981
## poly(x1, 2)1 2715.3 141109.5 0.019 0.985
## poly(x1, 2)2 27218.5 842987.2 0.032 0.974
## poly(x2, 2)1 -279.7 97160.4 -0.003 0.998
## poly(x2, 2)2 -28693.0 875451.3 -0.033 0.974
## I(x1 * x2) -206.4 41802.8 -0.005 0.996
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6.9218e+02 on 499 degrees of freedom
## Residual deviance: 3.5810e-06 on 494 degrees of freedom
## AIC: 12
##
## Number of Fisher Scoring iterations: 25
probs <- predict(logitnl.fit, data, type = "response")
preds <- rep(0, 500)
preds[probs > 0.47] <- 1
plot(data[preds == 1, ]$x1, data[preds == 1, ]$x2, col = (4 - 1), pch = (3 - 1), xlab = "X1", ylab = "X2")
points(data[preds == 0, ]$x1, data[preds == 0, ]$x2, col = (4 - 0), pch = (3 - 0))
The non-linear decision boundary is very similar to the true decision boundary.
data$y <- as.factor(data$y)
svm.fit <- svm(y ~ x1 + x2, data, kernel = "linear", cost = 0.01)
preds <- predict(svm.fit, data)
plot(data[preds == 0, ]$x1, data[preds == 0, ]$x2, col = (4 - 0), pch = (3 - 0), xlab = "X1", ylab = "X2")
points(data[preds == 1, ]$x1, data[preds == 1, ]$x2, col = (4 - 1), pch = (3 - 1))
This support vector classifier classifies all points to a single class.
data$y <- as.factor(data$y)
svmnl.fit <- svm(y ~ x1 + x2, data, kernel = "radial", gamma = 1)
preds <- predict(svmnl.fit, data)
plot(data[preds == 0, ]$x1, data[preds == 0, ]$x2, col = (4 - 0), pch = (3 - 0), xlab = "X1", ylab = "X2")
points(data[preds == 1, ]$x1, data[preds == 1, ]$x2, col = (4 - 1), pch = (3 - 1))
The non-linear decision boundary is very similar to the true decision boundary.
We may conclude that SVM with non-linear kernel and logistic regression with interaction terms are equally very powerful for finding non-linear decision boundaries. Also, SVM with linear kernel and logistic regression without any interaction term are very bad when it comes to finding non-linear decision boundaries. However, one argument in favor of SVM is that it requires some manual tuning to find the right interaction terms when using logistic regression, although when using SVM we only need to tune gamma.
set.seed(1)
x.one <- runif(500, 0, 90)
y.one <- runif(500, x.one + 10, 100)
x.one.noise <- runif(50, 20, 80)
y.one.noise <- 5/4 * (x.one.noise - 10) + 0.1
x.zero <- runif(500, 10, 100)
y.zero <- runif(500, 0, x.zero - 10)
x.zero.noise <- runif(50, 20, 80)
y.zero.noise <- 5/4 * (x.zero.noise - 10) - 0.1
class.one <- seq(1, 550)
x <- c(x.one, x.one.noise, x.zero, x.zero.noise)
y <- c(y.one, y.one.noise, y.zero, y.zero.noise)
plot(x[class.one], y[class.one], col = "blue", pch = "+", ylim = c(0, 100))
points(x[-class.one], y[-class.one], col = "red", pch = 4)
set.seed(2)
z <- rep(0, 1100)
z[class.one] <- 1
data <- data.frame(x = x, y = y, z = as.factor(z))
tune.out <- tune(svm, z ~ ., data = data, kernel = "linear", ranges = list(cost = c(0.01, 0.1, 1, 5, 10, 100, 1000, 10000)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 10000
##
## - best performance: 0
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-02 0.06272727 0.03043909
## 2 1e-01 0.04818182 0.02876395
## 3 1e+00 0.04818182 0.02876395
## 4 5e+00 0.04818182 0.02778971
## 5 1e+01 0.04727273 0.02770698
## 6 1e+02 0.04727273 0.02564145
## 7 1e+03 0.04090909 0.02580210
## 8 1e+04 0.00000000 0.00000000
data.frame(cost = tune.out$performance$cost, misclass = tune.out$performance$error * 1100)
## cost misclass
## 1 1e-02 69
## 2 1e-01 53
## 3 1e+00 53
## 4 5e+00 53
## 5 1e+01 52
## 6 1e+02 52
## 7 1e+03 45
## 8 1e+04 0
x.test <- runif(1000, 0, 100)
class.one <- sample(1000, 500)
y.test <- rep(NA, 1000)
# Set y > x for class.one
for (i in class.one) {
y.test[i] <- runif(1, x.test[i], 100)
}
# set y < x for class.zero
for (i in setdiff(1:1000, class.one)) {
y.test[i] <- runif(1, 0, x.test[i])
}
plot(x.test[class.one], y.test[class.one], col = "blue", pch = "+")
points(x.test[-class.one], y.test[-class.one], col = "red", pch = 4)
set.seed(3)
z.test <- rep(0, 1000)
z.test[class.one] <- 1
data.test <- data.frame(x = x.test, y = y.test, z = as.factor(z.test))
costs <- c(0.01, 0.1, 1, 5, 10, 100, 1000, 10000)
test.err <- rep(NA, length(costs))
for (i in 1:length(costs)) {
svm.fit <- svm(z ~ ., data = data, kernel = "linear", cost = costs[i])
pred <- predict(svm.fit, data.test)
test.err[i] <- sum(pred != data.test$z)
}
data.frame(cost = costs, misclass = test.err)
## cost misclass
## 1 1e-02 50
## 2 1e-01 10
## 3 1e+00 0
## 4 5e+00 0
## 5 1e+01 0
## 6 1e+02 172
## 7 1e+03 188
## 8 1e+04 188
We again see an overfitting phenomenon for linear kernel. A large cost tries to correctly classify noisy-points and hence overfits the train data. A small cost, however, makes a few errors on the noisy test points and performs better on test data.
library(ISLR)
attach(Auto)
Auto$mpglevel <- as.factor(ifelse(Auto$mpg > median(Auto$mpg), 1, 0))
set.seed(1)
tune.out <- tune(svm, mpglevel ~ ., data = Auto, kernel = "linear", ranges = list(cost = c(0.01, 0.1, 1, 5, 10, 100, 1000)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 1
##
## - best performance: 0.01275641
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-02 0.07403846 0.05471525
## 2 1e-01 0.03826923 0.05148114
## 3 1e+00 0.01275641 0.01344780
## 4 5e+00 0.01782051 0.01229997
## 5 1e+01 0.02038462 0.01074682
## 6 1e+02 0.03820513 0.01773427
## 7 1e+03 0.03820513 0.01773427
A cost of 1 seems to perform best.
set.seed(1)
tune.out <- tune(svm, mpglevel ~ ., data = Auto, kernel = "polynomial", ranges = list(cost = c(0.01, 0.1, 1, 5, 10, 100), degree = c(2, 3, 4)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost degree
## 100 2
##
## - best performance: 0.3013462
##
## - Detailed performance results:
## cost degree error dispersion
## 1 1e-02 2 0.5611538 0.04344202
## 2 1e-01 2 0.5611538 0.04344202
## 3 1e+00 2 0.5611538 0.04344202
## 4 5e+00 2 0.5611538 0.04344202
## 5 1e+01 2 0.5382051 0.05829238
## 6 1e+02 2 0.3013462 0.09040277
## 7 1e-02 3 0.5611538 0.04344202
## 8 1e-01 3 0.5611538 0.04344202
## 9 1e+00 3 0.5611538 0.04344202
## 10 5e+00 3 0.5611538 0.04344202
## 11 1e+01 3 0.5611538 0.04344202
## 12 1e+02 3 0.3322436 0.11140578
## 13 1e-02 4 0.5611538 0.04344202
## 14 1e-01 4 0.5611538 0.04344202
## 15 1e+00 4 0.5611538 0.04344202
## 16 5e+00 4 0.5611538 0.04344202
## 17 1e+01 4 0.5611538 0.04344202
## 18 1e+02 4 0.5611538 0.04344202
set.seed(1)
tune.out <- tune(svm, mpglevel ~ ., data = Auto, kernel = "radial", ranges = list(cost = c(0.01, 0.1, 1, 5, 10, 100), gamma = c(0.01, 0.1, 1, 5, 10, 100)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 100 0.01
##
## - best performance: 0.01532051
##
## - Detailed performance results:
## cost gamma error dispersion
## 1 1e-02 1e-02 0.56115385 0.04344202
## 2 1e-01 1e-02 0.09185897 0.03862507
## 3 1e+00 1e-02 0.07147436 0.05103685
## 4 5e+00 1e-02 0.04326923 0.04975032
## 5 1e+01 1e-02 0.02551282 0.03812986
## 6 1e+02 1e-02 0.01532051 0.01788871
## 7 1e-02 1e-01 0.19153846 0.07612945
## 8 1e-01 1e-01 0.07916667 0.05201159
## 9 1e+00 1e-01 0.05608974 0.05092939
## 10 5e+00 1e-01 0.03064103 0.02637448
## 11 1e+01 1e-01 0.02551282 0.02076457
## 12 1e+02 1e-01 0.02807692 0.01458261
## 13 1e-02 1e+00 0.56115385 0.04344202
## 14 1e-01 1e+00 0.56115385 0.04344202
## 15 1e+00 1e+00 0.06634615 0.06187383
## 16 5e+00 1e+00 0.06128205 0.06186124
## 17 1e+01 1e+00 0.06128205 0.06186124
## 18 1e+02 1e+00 0.06128205 0.06186124
## 19 1e-02 5e+00 0.56115385 0.04344202
## 20 1e-01 5e+00 0.56115385 0.04344202
## 21 1e+00 5e+00 0.49224359 0.03806832
## 22 5e+00 5e+00 0.48967949 0.03738577
## 23 1e+01 5e+00 0.48967949 0.03738577
## 24 1e+02 5e+00 0.48967949 0.03738577
## 25 1e-02 1e+01 0.56115385 0.04344202
## 26 1e-01 1e+01 0.56115385 0.04344202
## 27 1e+00 1e+01 0.51775641 0.04471079
## 28 5e+00 1e+01 0.51012821 0.03817175
## 29 1e+01 1e+01 0.51012821 0.03817175
## 30 1e+02 1e+01 0.51012821 0.03817175
## 31 1e-02 1e+02 0.56115385 0.04344202
## 32 1e-01 1e+02 0.56115385 0.04344202
## 33 1e+00 1e+02 0.56115385 0.04344202
## 34 5e+00 1e+02 0.56115385 0.04344202
## 35 1e+01 1e+02 0.56115385 0.04344202
## 36 1e+02 1e+02 0.56115385 0.04344202
For a polynomial kernel, the lowest cross-validation error is obtained for a degree of 2 and a cost of 100.
For a radial kernel, the lowest cross-validation error is obtained for a gamma of 0.01 and a cost of 100.
svm.linear <- svm(mpglevel ~ ., data = Auto, kernel = "linear", cost = 1)
svm.poly <- svm(mpglevel ~ ., data = Auto, kernel = "polynomial", cost = 100, degree = 2)
svm.radial <- svm(mpglevel ~ ., data = Auto, kernel = "radial", cost = 100, gamma = 0.01)
plotpairs = function(fit) {
for (name in names(Auto)[!(names(Auto) %in% c("mpg", "mpglevel", "name"))]) {
plot(fit, Auto, as.formula(paste("mpg~", name, sep = "")))
}
}
plotpairs(svm.linear)
plotpairs(svm.poly)
plotpairs(svm.radial)
set.seed(1)
train <- sample(nrow(OJ), 800)
OJ.train <- OJ[train, ]
OJ.test <- OJ[-train, ]
svm.linear <- svm(Purchase ~ ., data = OJ.train, kernel = "linear", cost = 0.01)
summary(svm.linear)
##
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "linear",
## cost = 0.01)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.01
## gamma: 0.05555556
##
## Number of Support Vectors: 432
##
## ( 215 217 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
Support vector classifier creates 432 support vectors out of 800 training points. Out of these, 217 belong to level MM and remaining 215 belong to level CH.
train.pred <- predict(svm.linear, OJ.train)
table(OJ.train$Purchase, train.pred)
## train.pred
## CH MM
## CH 439 55
## MM 78 228
(78 + 55) / (439 + 228 + 78 + 55)
## [1] 0.16625
test.pred <- predict(svm.linear, OJ.test)
table(OJ.test$Purchase, test.pred)
## test.pred
## CH MM
## CH 141 18
## MM 31 80
(31 + 18) / (141 + 80 + 31 + 18)
## [1] 0.1814815
The training error rate is 16.6% and test error rate is about 18.1%.
set.seed(2)
tune.out <- tune(svm, Purchase ~ ., data = OJ.train, kernel = "linear", ranges = list(cost = 10^seq(-2, 1, by = 0.25)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.1
##
## - best performance: 0.1625
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01000000 0.17125 0.05172376
## 2 0.01778279 0.16500 0.05197489
## 3 0.03162278 0.16625 0.04604120
## 4 0.05623413 0.16500 0.04594683
## 5 0.10000000 0.16250 0.04787136
## 6 0.17782794 0.16250 0.04249183
## 7 0.31622777 0.16875 0.04379958
## 8 0.56234133 0.16625 0.03998698
## 9 1.00000000 0.16500 0.03670453
## 10 1.77827941 0.16625 0.03682259
## 11 3.16227766 0.16500 0.03717451
## 12 5.62341325 0.16500 0.03525699
## 13 10.00000000 0.16750 0.03917553
The optimal cost is 0.1
svm.linear <- svm(Purchase ~ ., kernel = "linear", data = OJ.train, cost = tune.out$best.parameter$cost)
train.pred <- predict(svm.linear, OJ.train)
table(OJ.train$Purchase, train.pred)
## train.pred
## CH MM
## CH 438 56
## MM 71 235
(71 + 56) / (438 + 235 + 71 + 56)
## [1] 0.15875
test.pred <- predict(svm.linear, OJ.test)
table(OJ.test$Purchase, test.pred)
## test.pred
## CH MM
## CH 140 19
## MM 32 79
(32 + 19) / (140 + 79 + 32 + 19)
## [1] 0.1888889
With the best cost, the training error rate is now 15.8% and the test error rate is 18.8%.
svm.poly <- svm(Purchase ~ ., kernel = "polynomial", data = OJ.train, degree = 2)
summary(svm.poly)
##
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "polynomial",
## degree = 2)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 1
## degree: 2
## gamma: 0.05555556
## coef.0: 0
##
## Number of Support Vectors: 454
##
## ( 224 230 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
train.pred <- predict(svm.poly, OJ.train)
table(OJ.train$Purchase, train.pred)
## train.pred
## CH MM
## CH 461 33
## MM 105 201
(105 + 33) / (461 + 201 + 105 + 33)
## [1] 0.1725
test.pred <- predict(svm.poly, OJ.test)
table(OJ.test$Purchase, test.pred)
## test.pred
## CH MM
## CH 149 10
## MM 41 70
(41 + 10) / (149 + 70 + 41 + 10)
## [1] 0.1888889
Polynomial kernel with default gamma creates 454 support vectors, out of which, 224 belong to level CH and remaining 230 belong to level MM. The classifier has a training error of 17.2% and a test error of 18.8% which is no improvement over linear kernel. We now use cross validation to find optimal cost.
set.seed(2)
tune.out <- tune(svm, Purchase ~ ., data = OJ.train, kernel = "polynomial", degree = 2, ranges = list(cost = 10^seq(-2,
1, by = 0.25)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 10
##
## - best performance: 0.18125
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01000000 0.38250 0.04533824
## 2 0.01778279 0.36750 0.04972145
## 3 0.03162278 0.36500 0.05458174
## 4 0.05623413 0.33375 0.05070681
## 5 0.10000000 0.32500 0.04677072
## 6 0.17782794 0.25875 0.05952649
## 7 0.31622777 0.21250 0.06123724
## 8 0.56234133 0.21250 0.05743354
## 9 1.00000000 0.19750 0.06687468
## 10 1.77827941 0.19375 0.05376453
## 11 3.16227766 0.19625 0.05653477
## 12 5.62341325 0.18375 0.05434266
## 13 10.00000000 0.18125 0.05245699
svm.poly <- svm(Purchase ~ ., kernel = "polynomial", degree = 2, data = OJ.train, cost = tune.out$best.parameter$cost)
summary(svm.poly)
##
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "polynomial",
## degree = 2, cost = tune.out$best.parameter$cost)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 10
## degree: 2
## gamma: 0.05555556
## coef.0: 0
##
## Number of Support Vectors: 342
##
## ( 170 172 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
train.pred <- predict(svm.poly, OJ.train)
table(OJ.train$Purchase, train.pred)
## train.pred
## CH MM
## CH 450 44
## MM 72 234
(72 + 44) / (450 + 234 + 72 + 44)
## [1] 0.145
test.pred <- predict(svm.poly, OJ.test)
table(OJ.test$Purchase, test.pred)
## test.pred
## CH MM
## CH 140 19
## MM 31 80
(31 + 19) / (140 + 80 + 31 + 19)
## [1] 0.1851852
Tuning reduce train and test error rates.
Overall, radial basis kernel seems to be producing minimum misclassification error on both train and test data.