library(e1071)
set.seed(2020)
x = rnorm(100)
y = 4 * x^2 + 1 + rnorm(100)
class = sample(100, 50)
y[class] = y[class] + 3
y[-class] = y[-class] - 3
plot(x[class], y[class], col = "red", xlab = "X", ylab = "Y", ylim = c(-6, 30))
points(x[-class], y[-class], col = "blue")
z = rep(-1, 100)
z[class] = 1
data = data.frame(x = x, y = y, z = as.factor(z))
train = sample(100, 50)
data.train = data[train, ]
data.test = data[-train, ]
svm.linear = svm(z ~ ., data = data.train, kernel = "linear", cost = 10)
plot(svm.linear, data.train)
table(predict = predict(svm.linear, data.train), truth = data.train$z)
## truth
## predict -1 1
## -1 14 4
## 1 8 24
The support vector classifier makes 12 errors on the training data.
Next, we fit a support vector machine with a polynomial kernel.
svm.poly = svm(z ~ ., data = data.train, kernel = "polynomial", cost = 10)
plot(svm.poly, data.train)
table(predict = predict(svm.poly, data.train), truth = data.train$z)
## truth
## predict -1 1
## -1 6 0
## 1 16 28
The support vector machine with a polynomial kernel of degree 3 makes 16 errors on the training data.
Finally, we fit a support vector machine with a radial kernel and a gamma of 1.
svm.radial = svm(z ~ ., data = data.train, kernel = "radial", gamma = 1, cost = 10)
plot(svm.radial, data.train)
table(predict = predict(svm.radial, data.train), truth = data.train$z)
## truth
## predict -1 1
## -1 22 0
## 1 0 28
The support vector machine with a radial kernel makes 0 error on the training data.
Now, we check how these models fare when applied to the test data.
plot(svm.linear, data.test)
table(predict = predict(svm.linear, data.test), truth = data.test$z)
## truth
## predict -1 1
## -1 17 4
## 1 11 18
plot(svm.poly, data.test)
table(predict = predict(svm.poly, data.test), truth = data.test$z)
## truth
## predict -1 1
## -1 4 0
## 1 24 22
plot(svm.radial, data.test)
table(predict = predict(svm.radial, data.test), truth = data.test$z)
## truth
## predict -1 1
## -1 26 0
## 1 2 22
We see that the linear, polynomial and radial support vector machines classify respectively 15, 24 and 2 observations incorrectly. So, radial kernel is the best model in this setting.
##5. We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary. We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features.
set.seed(1)
x1=runif(500)-0.5
x2=runif(500)-0.5
y=1*(x1^2-x2^2 > 0)
plot(x1,x2, col = (4 - y))
c. Fit a logistic regression model to the data, using X1 and X2 as predictors.
logit.fit = glm(y ~ x1 + x2, family = "binomial")
summary(logit.fit)
##
## Call:
## glm(formula = y ~ x1 + x2, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.179 -1.139 -1.112 1.206 1.257
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.087260 0.089579 -0.974 0.330
## x1 0.196199 0.316864 0.619 0.536
## x2 -0.002854 0.305712 -0.009 0.993
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 692.18 on 499 degrees of freedom
## Residual deviance: 691.79 on 497 degrees of freedom
## AIC: 697.79
##
## Number of Fisher Scoring iterations: 3
None of the variables are statistically significants.
data = data.frame(x1 = x1, x2 = x2, y = y)
probs = predict(logit.fit, data, type = "response")
preds = rep(0, 500)
preds[probs > 0.47] = 1
plot(data[preds == 1, ]$x1, data[preds == 1, ]$x2, col = (4 - 1), pch = (3 - 1), xlab = "X1", ylab = "X2")
points(data[preds == 0, ]$x1, data[preds == 0, ]$x2, col = (4 - 0), pch = (3 - 0))
The decision boundary is obviously linear.
logitnl.fit <- glm(y ~ poly(x1, 2) + poly(x2, 2) + I(x1 * x2), family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logitnl.fit)
##
## Call:
## glm(formula = y ~ poly(x1, 2) + poly(x2, 2) + I(x1 * x2), family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.240e-04 -2.000e-08 -2.000e-08 2.000e-08 1.163e-03
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -102.2 4302.0 -0.024 0.981
## poly(x1, 2)1 2715.3 141109.5 0.019 0.985
## poly(x1, 2)2 27218.5 842987.2 0.032 0.974
## poly(x2, 2)1 -279.7 97160.4 -0.003 0.998
## poly(x2, 2)2 -28693.0 875451.3 -0.033 0.974
## I(x1 * x2) -206.4 41802.8 -0.005 0.996
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6.9218e+02 on 499 degrees of freedom
## Residual deviance: 3.5810e-06 on 494 degrees of freedom
## AIC: 12
##
## Number of Fisher Scoring iterations: 25
probs <- predict(logitnl.fit, data, type = "response")
preds <- rep(0, 500)
preds[probs > 0.47] <- 1
plot(data[preds == 1, ]$x1, data[preds == 1, ]$x2, col = (4 - 1), pch = (3 - 1), xlab = "X1", ylab = "X2")
points(data[preds == 0, ]$x1, data[preds == 0, ]$x2, col = (4 - 0), pch = (3 - 0))
The non-linear decision boundary is very similar to the true decision boundary.
data$y <- as.factor(data$y)
svm.fit <- svm(y ~ x1 + x2, data, kernel = "linear", cost = 0.01)
preds <- predict(svm.fit, data)
plot(data[preds == 0, ]$x1, data[preds == 0, ]$x2, col = (4 - 0), pch = (3 - 0), xlab = "X1", ylab = "X2")
points(data[preds == 1, ]$x1, data[preds == 1, ]$x2, col = (4 - 1), pch = (3 - 1))
This support vector classifier classifies all points to a single class.
data$y <- as.factor(data$y)
svmnl.fit <- svm(y ~ x1 + x2, data, kernel = "radial", gamma = 1)
preds <- predict(svmnl.fit, data)
plot(data[preds == 0, ]$x1, data[preds == 0, ]$x2, col = (4 - 0), pch = (3 - 0), xlab = "X1", ylab = "X2")
points(data[preds == 1, ]$x1, data[preds == 1, ]$x2, col = (4 - 1), pch = (3 - 1))
The non-linear decision boundary is very similar to the true decision boundary.
We may conclude that SVM with non-linear kernel and logistic regression with interaction terms are equally very powerful for finding non-linear decision boundaries. Also, SVM with linear kernel and logistic regression without any interaction term are very bad when it comes to finding non-linear decision boundaries. However, one argument in favor of SVM is that it requires some manual tuning to find the right interaction terms when using logistic regression, although when using SVM we only need to tune gamma.
set.seed(1)
x.one <- runif(500, 0, 90)
y.one <- runif(500, x.one + 10, 100)
x.one.noise <- runif(50, 20, 80)
y.one.noise <- 5/4 * (x.one.noise - 10) + 0.1
x.zero <- runif(500, 10, 100)
y.zero <- runif(500, 0, x.zero - 10)
x.zero.noise <- runif(50, 20, 80)
y.zero.noise <- 5/4 * (x.zero.noise - 10) - 0.1
class.one <- seq(1, 550)
x <- c(x.one, x.one.noise, x.zero, x.zero.noise)
y <- c(y.one, y.one.noise, y.zero, y.zero.noise)
plot(x[class.one], y[class.one], col = "blue", pch = "+", ylim = c(0, 100))
points(x[-class.one], y[-class.one], col = "red", pch = 4)
set.seed(2)
z <- rep(0, 1100)
z[class.one] <- 1
data <- data.frame(x = x, y = y, z = as.factor(z))
tune.out <- tune(svm, z ~ ., data = data, kernel = "linear", ranges = list(cost = c(0.01, 0.1, 1, 5, 10, 100, 1000, 10000)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 10000
##
## - best performance: 0
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-02 0.06181818 0.03201239
## 2 1e-01 0.04818182 0.02101641
## 3 1e+00 0.04818182 0.02101641
## 4 5e+00 0.05000000 0.02153435
## 5 1e+01 0.05000000 0.02153435
## 6 1e+02 0.05272727 0.02259558
## 7 1e+03 0.04636364 0.03467016
## 8 1e+04 0.00000000 0.00000000
data.frame(cost = tune.out$performance$cost, misclass = tune.out$performance$error * 1100)
## cost misclass
## 1 1e-02 68
## 2 1e-01 53
## 3 1e+00 53
## 4 5e+00 55
## 5 1e+01 55
## 6 1e+02 58
## 7 1e+03 51
## 8 1e+04 0
x.test <- runif(1000, 0, 100)
class.one <- sample(1000, 500)
y.test <- rep(NA, 1000)
# Set y > x for class.one
for (i in class.one) {
y.test[i] <- runif(1, x.test[i], 100)
}
# set y < x for class.zero
for (i in setdiff(1:1000, class.one)) {
y.test[i] <- runif(1, 0, x.test[i])
}
plot(x.test[class.one], y.test[class.one], col = "blue", pch = "+")
points(x.test[-class.one], y.test[-class.one], col = "red", pch = 4)
set.seed(3)
z.test <- rep(0, 1000)
z.test[class.one] <- 1
data.test <- data.frame(x = x.test, y = y.test, z = as.factor(z.test))
costs <- c(0.01, 0.1, 1, 5, 10, 100, 1000, 10000)
test.err <- rep(NA, length(costs))
for (i in 1:length(costs)) {
svm.fit <- svm(z ~ ., data = data, kernel = "linear", cost = costs[i])
pred <- predict(svm.fit, data.test)
test.err[i] <- sum(pred != data.test$z)
}
data.frame(cost = costs, misclass = test.err)
## cost misclass
## 1 1e-02 61
## 2 1e-01 20
## 3 1e+00 2
## 4 5e+00 0
## 5 1e+01 0
## 6 1e+02 191
## 7 1e+03 209
## 8 1e+04 212
We again see an overfitting phenomenon for linear kernel. A large cost tries to correctly classify noisy-points and hence overfits the train data. A small cost, however, makes a few errors on the noisy test points and performs better on test data.
##7. In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the Auto data set.
library(ISLR)
attach(Auto)
Auto$mpglevel <- as.factor(ifelse(Auto$mpg > median(Auto$mpg), 1, 0))
set.seed(1)
tune.out <- tune(svm, mpglevel ~ ., data = Auto, kernel = "linear", ranges = list(cost = c(0.01, 0.1, 1, 5, 10, 100, 1000)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 1
##
## - best performance: 0.01025641
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-02 0.07653846 0.03617137
## 2 1e-01 0.04596154 0.03378238
## 3 1e+00 0.01025641 0.01792836
## 4 5e+00 0.02051282 0.02648194
## 5 1e+01 0.02051282 0.02648194
## 6 1e+02 0.03076923 0.03151981
## 7 1e+03 0.03076923 0.03151981
A cost of 1 seems to perform best.
set.seed(1)
tune.out <- tune(svm, mpglevel ~ ., data = Auto, kernel = "polynomial", ranges = list(cost = c(0.01, 0.1, 1, 5, 10, 100), degree = c(2, 3, 4)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost degree
## 100 2
##
## - best performance: 0.3013462
##
## - Detailed performance results:
## cost degree error dispersion
## 1 1e-02 2 0.5511538 0.04366593
## 2 1e-01 2 0.5511538 0.04366593
## 3 1e+00 2 0.5511538 0.04366593
## 4 5e+00 2 0.5511538 0.04366593
## 5 1e+01 2 0.5130128 0.08963366
## 6 1e+02 2 0.3013462 0.09961961
## 7 1e-02 3 0.5511538 0.04366593
## 8 1e-01 3 0.5511538 0.04366593
## 9 1e+00 3 0.5511538 0.04366593
## 10 5e+00 3 0.5511538 0.04366593
## 11 1e+01 3 0.5511538 0.04366593
## 12 1e+02 3 0.3446154 0.09821588
## 13 1e-02 4 0.5511538 0.04366593
## 14 1e-01 4 0.5511538 0.04366593
## 15 1e+00 4 0.5511538 0.04366593
## 16 5e+00 4 0.5511538 0.04366593
## 17 1e+01 4 0.5511538 0.04366593
## 18 1e+02 4 0.5511538 0.04366593
set.seed(1)
tune.out <- tune(svm, mpglevel ~ ., data = Auto, kernel = "radial", ranges = list(cost = c(0.01, 0.1, 1, 5, 10, 100), gamma = c(0.01, 0.1, 1, 5, 10, 100)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 100 0.01
##
## - best performance: 0.01282051
##
## - Detailed performance results:
## cost gamma error dispersion
## 1 1e-02 1e-02 0.55115385 0.04366593
## 2 1e-01 1e-02 0.08929487 0.04382379
## 3 1e+00 1e-02 0.07403846 0.03522110
## 4 5e+00 1e-02 0.04852564 0.03303346
## 5 1e+01 1e-02 0.02557692 0.02093679
## 6 1e+02 1e-02 0.01282051 0.01813094
## 7 1e-02 1e-01 0.21711538 0.09865227
## 8 1e-01 1e-01 0.07903846 0.03874545
## 9 1e+00 1e-01 0.05371795 0.03525162
## 10 5e+00 1e-01 0.02820513 0.03299190
## 11 1e+01 1e-01 0.03076923 0.03375798
## 12 1e+02 1e-01 0.03583333 0.02759051
## 13 1e-02 1e+00 0.55115385 0.04366593
## 14 1e-01 1e+00 0.55115385 0.04366593
## 15 1e+00 1e+00 0.06384615 0.04375618
## 16 5e+00 1e+00 0.05884615 0.04020934
## 17 1e+01 1e+00 0.05884615 0.04020934
## 18 1e+02 1e+00 0.05884615 0.04020934
## 19 1e-02 5e+00 0.55115385 0.04366593
## 20 1e-01 5e+00 0.55115385 0.04366593
## 21 1e+00 5e+00 0.49493590 0.04724924
## 22 5e+00 5e+00 0.48217949 0.05470903
## 23 1e+01 5e+00 0.48217949 0.05470903
## 24 1e+02 5e+00 0.48217949 0.05470903
## 25 1e-02 1e+01 0.55115385 0.04366593
## 26 1e-01 1e+01 0.55115385 0.04366593
## 27 1e+00 1e+01 0.51794872 0.05063697
## 28 5e+00 1e+01 0.51794872 0.04917316
## 29 1e+01 1e+01 0.51794872 0.04917316
## 30 1e+02 1e+01 0.51794872 0.04917316
## 31 1e-02 1e+02 0.55115385 0.04366593
## 32 1e-01 1e+02 0.55115385 0.04366593
## 33 1e+00 1e+02 0.55115385 0.04366593
## 34 5e+00 1e+02 0.55115385 0.04366593
## 35 1e+01 1e+02 0.55115385 0.04366593
## 36 1e+02 1e+02 0.55115385 0.04366593
For a polynomial kernel, the lowest cross-validation error is obtained for a degree of 2 and a cost of 100.
For a radial kernel, the lowest cross-validation error is obtained for a gamma of 0.01 and a cost of 100.
svm.linear <- svm(mpglevel ~ ., data = Auto, kernel = "linear", cost = 1)
svm.poly <- svm(mpglevel ~ ., data = Auto, kernel = "polynomial", cost = 100, degree = 2)
svm.radial <- svm(mpglevel ~ ., data = Auto, kernel = "radial", cost = 100, gamma = 0.01)
plotpairs = function(fit) {
for (name in names(Auto)[!(names(Auto) %in% c("mpg", "mpglevel", "name"))]) {
plot(fit, Auto, as.formula(paste("mpg~", name, sep = "")))
}
}
plotpairs(svm.linear)
plotpairs(svm.poly)
plotpairs(svm.radial)
##8. This problem involves the OJ data set which is part of the ISLR package.
set.seed(1)
train <- sample(nrow(OJ), 800)
OJ.train <- OJ[train, ]
OJ.test <- OJ[-train, ]
svm.linear <- svm(Purchase ~ ., data = OJ.train, kernel = "linear", cost = 0.01)
summary(svm.linear)
##
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "linear", cost = 0.01)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.01
##
## Number of Support Vectors: 435
##
## ( 219 216 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
Support vector classifier creates 432 support vectors out of 800 training points. Out of these, 217 belong to level MM and remaining 215 belong to level CH.
train.pred <- predict(svm.linear, OJ.train)
table(OJ.train$Purchase, train.pred)
## train.pred
## CH MM
## CH 420 65
## MM 75 240
(78 + 55) / (439 + 228 + 78 + 55)
## [1] 0.16625
test.pred <- predict(svm.linear, OJ.test)
table(OJ.test$Purchase, test.pred)
## test.pred
## CH MM
## CH 153 15
## MM 33 69
(31 + 18) / (141 + 80 + 31 + 18)
## [1] 0.1814815
The training error rate is 16.6% and test error rate is about 18.1%.
set.seed(2)
tune.out <- tune(svm, Purchase ~ ., data = OJ.train, kernel = "linear", ranges = list(cost = 10^seq(-2, 1, by = 0.25)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 1.778279
##
## - best performance: 0.1675
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01000000 0.17625 0.04059026
## 2 0.01778279 0.17625 0.04348132
## 3 0.03162278 0.17125 0.04604120
## 4 0.05623413 0.17000 0.04005205
## 5 0.10000000 0.17125 0.04168749
## 6 0.17782794 0.17000 0.04090979
## 7 0.31622777 0.17125 0.04411554
## 8 0.56234133 0.17125 0.04084609
## 9 1.00000000 0.17000 0.04090979
## 10 1.77827941 0.16750 0.03782269
## 11 3.16227766 0.16750 0.03782269
## 12 5.62341325 0.16750 0.03545341
## 13 10.00000000 0.17000 0.03736085
The optimal cost is 0.1
svm.linear <- svm(Purchase ~ ., kernel = "linear", data = OJ.train, cost = tune.out$best.parameter$cost)
train.pred <- predict(svm.linear, OJ.train)
table(OJ.train$Purchase, train.pred)
## train.pred
## CH MM
## CH 423 62
## MM 69 246
(71 + 56) / (438 + 235 + 71 + 56)
## [1] 0.15875
test.pred <- predict(svm.linear, OJ.test)
table(OJ.test$Purchase, test.pred)
## test.pred
## CH MM
## CH 156 12
## MM 29 73
(32 + 19) / (140 + 79 + 32 + 19)
## [1] 0.1888889
With the best cost, the training error rate is now 15.8% and the test error rate is 18.8%.
svm.poly <- svm(Purchase ~ ., kernel = "polynomial", data = OJ.train, degree = 2)
summary(svm.poly)
##
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "polynomial",
## degree = 2)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 1
## degree: 2
## coef.0: 0
##
## Number of Support Vectors: 447
##
## ( 225 222 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
train.pred <- predict(svm.poly, OJ.train)
table(OJ.train$Purchase, train.pred)
## train.pred
## CH MM
## CH 449 36
## MM 110 205
(105 + 33) / (461 + 201 + 105 + 33)
## [1] 0.1725
test.pred <- predict(svm.poly, OJ.test)
table(OJ.test$Purchase, test.pred)
## test.pred
## CH MM
## CH 153 15
## MM 45 57
(41 + 10) / (149 + 70 + 41 + 10)
## [1] 0.1888889
Polynomial kernel with default gamma creates 454 support vectors, out of which, 224 belong to level CH and remaining 230 belong to level MM. The classifier has a training error of 17.2% and a test error of 18.8% which is no improvement over linear kernel. We now use cross validation to find optimal cost.
set.seed(2)
tune.out <- tune(svm, Purchase ~ ., data = OJ.train, kernel = "polynomial", degree = 2, ranges = list(cost = 10^seq(-2,
1, by = 0.25)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 3.162278
##
## - best performance: 0.18
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01000000 0.39000 0.03670453
## 2 0.01778279 0.37000 0.03395258
## 3 0.03162278 0.36375 0.03197764
## 4 0.05623413 0.34500 0.03291403
## 5 0.10000000 0.32125 0.03866254
## 6 0.17782794 0.24750 0.03322900
## 7 0.31622777 0.20250 0.04073969
## 8 0.56234133 0.20250 0.03670453
## 9 1.00000000 0.19625 0.03910900
## 10 1.77827941 0.19125 0.03586723
## 11 3.16227766 0.18000 0.04005205
## 12 5.62341325 0.18000 0.04133199
## 13 10.00000000 0.18125 0.03830162
svm.poly <- svm(Purchase ~ ., kernel = "polynomial", degree = 2, data = OJ.train, cost = tune.out$best.parameter$cost)
summary(svm.poly)
##
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "polynomial",
## degree = 2, cost = tune.out$best.parameter$cost)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 3.162278
## degree: 2
## coef.0: 0
##
## Number of Support Vectors: 385
##
## ( 197 188 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
train.pred <- predict(svm.poly, OJ.train)
table(OJ.train$Purchase, train.pred)
## train.pred
## CH MM
## CH 451 34
## MM 90 225
(72 + 44) / (450 + 234 + 72 + 44)
## [1] 0.145
test.pred <- predict(svm.poly, OJ.test)
table(OJ.test$Purchase, test.pred)
## test.pred
## CH MM
## CH 154 14
## MM 41 61
(31 + 19) / (140 + 80 + 31 + 19)
## [1] 0.1851852
0
## [1] 0
Tuning reduce train and test error rates.