Exercise 5: We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary. We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features.
- (a) Generate a data set with n =500 and p = 2, such that the observations belong to two classes with a quadratic decision boundary between them.
set.seed(421)
x1 = runif(500) - 0.5
x2 = runif(500) - 0.5
y = 1 * (x1^2 - x2^2 > 0)
- (b) Plot the observations, colored according to their class labels. Your plot should display X1 on the x-axis, and X2 on the y-axis.
plot(x1[y == 0], x2[y == 0], col = "navy", xlab = "X1", ylab = "X2", pch = "+")
points(x1[y == 1], x2[y == 1], col = "grey", pch = 4)

- (c) Fit a logistic regression model to the data, using X1 and X2 as predictors.
lm_fit = glm(y ~ x1 + x2, family = binomial)
summary(lm_fit)
##
## Call:
## glm(formula = y ~ x1 + x2, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.278 -1.227 1.089 1.135 1.175
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.11999 0.08971 1.338 0.181
## x1 -0.16881 0.30854 -0.547 0.584
## x2 -0.08198 0.31476 -0.260 0.795
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 691.35 on 499 degrees of freedom
## Residual deviance: 690.99 on 497 degrees of freedom
## AIC: 696.99
##
## Number of Fisher Scoring iterations: 3
- (d) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be linear.
data = data.frame(x1 = x1, x2 = x2, y = y)
lm_prob = predict(lm_fit, data, type = "response")
lm_pred = ifelse(lm_prob > 0.52, 1, 0)
data_pos = data[lm_pred == 1, ]
data_neg = data[lm_pred == 0, ]
plot(data_pos$x1, data_pos$x2, col = "black", xlab = "X1", ylab = "X2", pch = "+")
points(data_neg$x1, data_neg$x2, col = "red", pch = 4)

- (e) Now fit a logistic regression model to the data using non-linear functions of X1 and X2 as predictors (e.g. X12, X1 ×X2, log(X2), and so forth).
lm_fit = glm(y ~ poly(x1, 2) + poly(x2, 2) + I(x1 * x2), data = data, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Call:
## glm(formula = y ~ poly(x1, 2) + poly(x2, 2) + I(x1 * x2), family = binomial,
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.003575 0.000000 0.000000 0.000000 0.003720
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 236.09 34920.61 0.007 0.995
## poly(x1, 2)1 3608.97 246381.97 0.015 0.988
## poly(x1, 2)2 88150.22 1333540.93 0.066 0.947
## poly(x2, 2)1 3256.75 177352.91 0.018 0.985
## poly(x2, 2)2 -87128.37 1164195.57 -0.075 0.940
## I(x1 * x2) -33.23 446735.64 0.000 1.000
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6.9135e+02 on 499 degrees of freedom
## Residual deviance: 3.3069e-05 on 494 degrees of freedom
## AIC: 12
##
## Number of Fisher Scoring iterations: 25
- (f) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be obviously non-linear. If it is not, then repeat (a)-(e) until you come up with an example in which the predicted class labels are obviously non-linear.
lm_prob = predict(lm_fit, data, type = "response")
lm_pred = ifelse(lm_prob > 0.5, 1, 0)
data_pos = data[lm_pred == 1, ]
data_neg = data[lm_pred == 0, ]
plot(data_pos$x1, data_pos$x2, col = "black", xlab = "X1", ylab = "X2", pch = "+")
points(data_neg$x1, data_neg$x2, col = "red", pch = 4)

- (g) Fit a support vector classifier to the data with X1 and X2 as predictors. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.
svm_fit = svm(as.factor(y) ~ x1 + x2, data, kernel = "linear", cost = 0.1)
svm_pred = predict(svm_fit, data)
data_pos = data[svm_pred == 1, ]
data_neg = data[svm_pred == 0, ]
plot(data_pos$x1, data_pos$x2, col = "black", xlab = "X1", ylab = "X2", pch = "+")
points(data_neg$x1, data_neg$x2, col = "red", pch = 4)

- (h) Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.
svm_fit = svm(as.factor(y) ~ x1 + x2, data, gamma = 1)
svm_pred = predict(svm_fit, data)
data_pos = data[svm_pred == 1, ]
data_neg = data[svm_pred == 0, ]
plot(data_pos$x1, data_pos$x2, col = "black", xlab = "X1", ylab = "X2", pch = "+")
points(data_neg$x1, data_neg$x2, col = "red", pch = 4)

- (i) Comment on your results.
- We conclude that SVM with non-linear kernel and logistic regression with interaction terms are very powerful for identifying non-linear decision boundaries. Also, SVM with linear kernel and logistic regression without any interaction term are not as good at finding non-linear decision boundaries.
Exercise 7: 7. In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the Auto data set.
- (a) Create a binary variable that takes on a 1 for cars with gas mileage above the median, and a 0 for cars with gas mileage below the median.
gas.med = median(Auto$mpg)
new.var = ifelse(Auto$mpg > gas.med, 1, 0)
Auto$mpglevel = as.factor(new.var)
head(Auto["mpglevel"])
## mpglevel
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
- (b) Fit a support vector classifier to the data with various values of cost, in order to predict whether a car gets high or low gas mileage. Report the cross-validation errors associated with different values of this parameter.
set.seed(3255)
tune.out = tune(svm, mpglevel ~ ., data = Auto, kernel = "linear", ranges = list(cost = c(0.01, 0.1, 1, 5, 10, 100)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 1
##
## - best performance: 0.01269231
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-02 0.07397436 0.06863413
## 2 1e-01 0.05102564 0.06923024
## 3 1e+00 0.01269231 0.02154160
## 4 5e+00 0.01519231 0.01760469
## 5 1e+01 0.02025641 0.02303772
## 6 1e+02 0.03294872 0.02898463
- (c) Now repeat (b), this time using SVMs with radial and polynomial basis kernels, with different values of gamma and degree and cost.
set.seed(21)
tune.out = tune(svm, mpglevel ~ ., data = Auto, kernel = "polynomial", ranges = list(cost = c(0.1,
1, 5, 10), degree = c(2, 3, 4)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost degree
## 10 2
##
## - best performance: 0.5435897
##
## - Detailed performance results:
## cost degree error dispersion
## 1 0.1 2 0.5587821 0.04538579
## 2 1.0 2 0.5587821 0.04538579
## 3 5.0 2 0.5587821 0.04538579
## 4 10.0 2 0.5435897 0.05611162
## 5 0.1 3 0.5587821 0.04538579
## 6 1.0 3 0.5587821 0.04538579
## 7 5.0 3 0.5587821 0.04538579
## 8 10.0 3 0.5587821 0.04538579
## 9 0.1 4 0.5587821 0.04538579
## 10 1.0 4 0.5587821 0.04538579
## 11 5.0 4 0.5587821 0.04538579
## 12 10.0 4 0.5587821 0.04538579
set.seed(463)
tune.out = tune(svm, mpglevel ~ ., data = Auto, kernel = "radial", ranges = list(cost = c(0.1,
1, 5, 10), gamma = c(0.01, 0.1, 1, 5, 10, 100)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 10 0.01
##
## - best performance: 0.02551282
##
## - Detailed performance results:
## cost gamma error dispersion
## 1 0.1 1e-02 0.09429487 0.04814900
## 2 1.0 1e-02 0.07897436 0.03875105
## 3 5.0 1e-02 0.05352564 0.02532795
## 4 10.0 1e-02 0.02551282 0.02417610
## 5 0.1 1e-01 0.07891026 0.03847631
## 6 1.0 1e-01 0.05602564 0.02881876
## 7 5.0 1e-01 0.03826923 0.03252085
## 8 10.0 1e-01 0.03320513 0.02964746
## 9 0.1 1e+00 0.57660256 0.05479863
## 10 1.0 1e+00 0.06628205 0.02996211
## 11 5.0 1e+00 0.06115385 0.02733573
## 12 10.0 1e+00 0.06115385 0.02733573
## 13 0.1 5e+00 0.57660256 0.05479863
## 14 1.0 5e+00 0.51538462 0.06642516
## 15 5.0 5e+00 0.50775641 0.07152757
## 16 10.0 5e+00 0.50775641 0.07152757
## 17 0.1 1e+01 0.57660256 0.05479863
## 18 1.0 1e+01 0.53833333 0.05640443
## 19 5.0 1e+01 0.53070513 0.05708644
## 20 10.0 1e+01 0.53070513 0.05708644
## 21 0.1 1e+02 0.57660256 0.05479863
## 22 1.0 1e+02 0.57660256 0.05479863
## 23 5.0 1e+02 0.57660256 0.05479863
## 24 10.0 1e+02 0.57660256 0.05479863
- (d) Make some plots to back up your assertions in (b) and (c).
svm.linear = svm(mpglevel ~ ., data = Auto, kernel = "linear", cost = 1)
svm.poly = svm(mpglevel ~ ., data = Auto, kernel = "polynomial", cost = 10,
degree = 2)
svm.radial = svm(mpglevel ~ ., data = Auto, kernel = "radial", cost = 10, gamma = 0.01)
plotpairs = function(fit) {
for (name in names(Auto)[!(names(Auto) %in% c("mpg", "mpglevel", "name"))]) {
plot(fit, Auto, as.formula(paste("mpg~", name, sep = "")))
}
}
plotpairs(svm.linear)





















Exercise 8: This problem involves the OJ data set which is part of the ISLR package.
- (a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
set.seed(1)
oj_train_samples = OJ %>% resample_partition(c(train = .8, test = .2))
summary(oj_train_samples)
## Length Class Mode
## train 2 resample list
## test 2 resample list
- (b) Fit a support vector classifier to the training data using cost=0.01,with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics, and describe the results obtained.
oj_linr_svc = svm(Purchase ~ ., data = oj_train_samples$train, kernel = 'linear', cost = 0.01)
summary(oj_linr_svc)
##
## Call:
## svm(formula = Purchase ~ ., data = oj_train_samples$train, kernel = "linear",
## cost = 0.01)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.01
##
## Number of Support Vectors: 450
##
## ( 226 224 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
- (c) What are the training and test error rates?
oj_train_samples$train %>% as_tibble() %>% mutate(Purchase_prime = predict(oj_linr_svc, newdata = .)) %>% summarize('Train Error Rate' = mean(Purchase != Purchase_prime))
## # A tibble: 1 x 1
## `Train Error Rate`
## <dbl>
## 1 0.157
oj_train_samples$test %>%as_tibble() %>% mutate(Purchase_prime = predict(oj_linr_svc, newdata = .)) %>% summarize('Test Error Rate' = mean(Purchase != Purchase_prime))
## # A tibble: 1 x 1
## `Test Error Rate`
## <dbl>
## 1 0.205
- (d) Use the tune() function to select an optimal cost. Consider values in the range 0.01 to 10.
set.seed(1)
tune(svm, Purchase ~ ., data = as_tibble( oj_train_samples$train ), kernel = 'linear', ranges = list(cost = 2^seq(-8,4))) -> oj_svc_tune
summary(oj_svc_tune)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.015625
##
## - best performance: 0.1603557
##
## - Detailed performance results:
## cost error dispersion
## 1 0.00390625 0.1638577 0.03795702
## 2 0.00781250 0.1626813 0.03729402
## 3 0.01562500 0.1603557 0.03840728
## 4 0.03125000 0.1650205 0.03522022
## 5 0.06250000 0.1638988 0.03695773
## 6 0.12500000 0.1615458 0.03716560
## 7 0.25000000 0.1615595 0.03601742
## 8 0.50000000 0.1603967 0.03777655
## 9 1.00000000 0.1650479 0.03445874
## 10 2.00000000 0.1662244 0.03721419
## 11 4.00000000 0.1627086 0.03446296
## 12 8.00000000 0.1627223 0.03666065
## 13 16.00000000 0.1638851 0.03643699
- (e) Compute the training and test error rates using this new value for cost.
oj_linr_svc = svm( Purchase ~ ., data = oj_train_samples$train,kernel = 'linear',cost = oj_svc_tune$best.parameters$cost)
oj_train_samples$train %>% as_tibble() %>% mutate(Purchase_prime = predict(oj_linr_svc)) %>% summarize('Train Error Rate' = mean(Purchase != Purchase_prime))
## # A tibble: 1 x 1
## `Train Error Rate`
## <dbl>
## 1 0.152
oj_train_samples$test %>% as_tibble() %>% mutate(Purchase_prime = predict(oj_linr_svc, newdata = .)) %>% summarize('Test Error Rate' = mean(Purchase != Purchase_prime))
## # A tibble: 1 x 1
## `Test Error Rate`
## <dbl>
## 1 0.2
- (f) Repeat parts (b) through (e) using a support vector machine with a radial kernel. Use the default value for gamma.
oj_radl_svc = svm( Purchase ~ ., data = oj_train_samples$train, kernel = 'radial')
summary(oj_radl_svc)
##
## Call:
## svm(formula = Purchase ~ ., data = oj_train_samples$train, kernel = "radial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 390
##
## ( 201 189 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
oj_train_samples$train %>% as_tibble() %>% mutate(Purchase_prime = predict(oj_radl_svc, newdata = .)) %>% summarize('Train Error Rate' = mean(Purchase != Purchase_prime))
## # A tibble: 1 x 1
## `Train Error Rate`
## <dbl>
## 1 0.145
oj_train_samples$test %>% as_tibble() %>% mutate(Purchase_prime = predict(oj_radl_svc, newdata = .)) %>% summarize('Test Error Rate' = mean(Purchase != Purchase_prime))
## # A tibble: 1 x 1
## `Test Error Rate`
## <dbl>
## 1 0.186
set.seed(1)
tune(svm,Purchase ~ .,data = as_tibble( oj_train_samples$train ),kernel = 'radial',ranges = list(cost = 2^seq(-8,4))) -> oj_radl_tune
summary(oj_radl_tune)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 1
##
## - best performance: 0.1673461
##
## - Detailed performance results:
## cost error dispersion
## 1 0.00390625 0.3764843 0.06228157
## 2 0.00781250 0.3764843 0.06228157
## 3 0.01562500 0.3764843 0.06228157
## 4 0.03125000 0.3366211 0.07569987
## 5 0.06250000 0.1918605 0.04001821
## 6 0.12500000 0.1755267 0.03617482
## 7 0.25000000 0.1755404 0.03986210
## 8 0.50000000 0.1720246 0.04204614
## 9 1.00000000 0.1673461 0.03585901
## 10 2.00000000 0.1720520 0.04109274
## 11 4.00000000 0.1731874 0.04013555
## 12 8.00000000 0.1825581 0.03589686
## 13 16.00000000 0.1837346 0.03185391
oj_radl_svc <- svm(Purchase ~ .,data = oj_train_samples$train,kernel = 'linear',cost = oj_radl_tune$best.parameters$cost)
oj_train_samples$train %>% as_tibble() %>% mutate(Purchase_prime = predict(oj_radl_svc)) %>% summarize('Train Error Rate' = mean(Purchase != Purchase_prime))
## # A tibble: 1 x 1
## `Train Error Rate`
## <dbl>
## 1 0.152
oj_train_samples$test %>% as_tibble() %>% mutate(Purchase_prime = predict(oj_radl_svc, newdata = .)) %>% summarize('Test Error Rate' = mean(Purchase != Purchase_prime))
## # A tibble: 1 x 1
## `Test Error Rate`
## <dbl>
## 1 0.2
- (g) Repeat parts (b) through (e) using a support vector machine with a polynomial kernel. Set degree=2.
oj_poly_svc = svm(Purchase ~ ., data = oj_train_samples$train, kernel = 'polynomial',degree = 2)
summary(oj_poly_svc)
##
## Call:
## svm(formula = Purchase ~ ., data = oj_train_samples$train, kernel = "polynomial",
## degree = 2)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 1
## degree: 2
## coef.0: 0
##
## Number of Support Vectors: 467
##
## ( 238 229 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
oj_train_samples$train %>% as_tibble() %>% mutate(Purchase_prime = predict(oj_poly_svc, newdata = .)) %>% summarize('Train Error Rate' = mean(Purchase != Purchase_prime))
## # A tibble: 1 x 1
## `Train Error Rate`
## <dbl>
## 1 0.174
oj_train_samples$test %>% as_tibble() %>% mutate(Purchase_prime = predict(oj_poly_svc, newdata = .)) %>% summarize('Test Error Rate' = mean(Purchase != Purchase_prime))
## # A tibble: 1 x 1
## `Test Error Rate`
## <dbl>
## 1 0.214
set.seed(1)
tune(svm,Purchase ~ .,data = as_tibble( oj_train_samples$train ),kernel = 'polynomial',ranges = list(cost = 2^seq(-8,4)),degree = 2) -> oj_poly_tune
summary(oj_poly_tune)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 8
##
## - best performance: 0.1708208
##
## - Detailed performance results:
## cost error dispersion
## 1 0.00390625 0.3764843 0.06228157
## 2 0.00781250 0.3776471 0.06301831
## 3 0.01562500 0.3495075 0.07242409
## 4 0.03125000 0.3389740 0.06929994
## 5 0.06250000 0.3179617 0.06929113
## 6 0.12500000 0.3016005 0.07149584
## 7 0.25000000 0.2268673 0.03547776
## 8 0.50000000 0.2012312 0.04937234
## 9 1.00000000 0.1918194 0.04481021
## 10 2.00000000 0.1848290 0.04773197
## 11 4.00000000 0.1801505 0.04354956
## 12 8.00000000 0.1708208 0.04408733
## 13 16.00000000 0.1732011 0.04348912
oj_poly_svc <- svm(Purchase ~ .,data = oj_train_samples$train,kernel = 'polynomial',cost = oj_poly_tune$best.parameters$cost)
oj_train_samples$train %>% as_tibble() %>%mutate(Purchase_prime = predict(oj_poly_svc)) %>%summarize('Train Error Rate' = mean(Purchase != Purchase_prime))
## # A tibble: 1 x 1
## `Train Error Rate`
## <dbl>
## 1 0.137
oj_train_samples$test %>% as_tibble() %>% mutate(Purchase_prime = predict(oj_poly_svc, newdata = .)) %>% summarize('Test Error Rate' = mean(Purchase != Purchase_prime))
## # A tibble: 1 x 1
## `Test Error Rate`
## <dbl>
## 1 0.219
- (h) Overall, which approach seems to give the best results on this data?
- We determine that the Radial Kernel approach gives the best results since our train and test error rates were the lowest compared to Linear and Polynomial kernels.