Libraries

library(ISLR)
library(tidyverse)
library(caret)
library(e1071)
library(modelr)

Exercises

Exercise 5: We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary. We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features.

  • (a) Generate a data set with n =500 and p = 2, such that the observations belong to two classes with a quadratic decision boundary between them.
set.seed(421)
x1 = runif(500) - 0.5
x2 = runif(500) - 0.5
y = 1 * (x1^2 - x2^2 > 0)


  • (b) Plot the observations, colored according to their class labels. Your plot should display X1 on the x-axis, and X2 on the y-axis.
plot(x1[y == 0], x2[y == 0], col = "navy", xlab = "X1", ylab = "X2", pch = "+")
points(x1[y == 1], x2[y == 1], col = "grey", pch = 4)


  • (c) Fit a logistic regression model to the data, using X1 and X2 as predictors.
lm_fit = glm(y ~ x1 + x2, family = binomial)
summary(lm_fit)
## 
## Call:
## glm(formula = y ~ x1 + x2, family = binomial)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.278  -1.227   1.089   1.135   1.175  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.11999    0.08971   1.338    0.181
## x1          -0.16881    0.30854  -0.547    0.584
## x2          -0.08198    0.31476  -0.260    0.795
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 691.35  on 499  degrees of freedom
## Residual deviance: 690.99  on 497  degrees of freedom
## AIC: 696.99
## 
## Number of Fisher Scoring iterations: 3


  • (d) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be linear.
data = data.frame(x1 = x1, x2 = x2, y = y)

lm_prob = predict(lm_fit, data, type = "response")
lm_pred = ifelse(lm_prob > 0.52, 1, 0)

data_pos = data[lm_pred == 1, ]
data_neg = data[lm_pred == 0, ]

plot(data_pos$x1, data_pos$x2, col = "black", xlab = "X1", ylab = "X2", pch = "+")
points(data_neg$x1, data_neg$x2, col = "red", pch = 4)


  • (e) Now fit a logistic regression model to the data using non-linear functions of X1 and X2 as predictors (e.g. X12, X1 ×X2, log(X2), and so forth).
lm_fit = glm(y ~ poly(x1, 2) + poly(x2, 2) + I(x1 * x2), data = data, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(lm_fit)
## 
## Call:
## glm(formula = y ~ poly(x1, 2) + poly(x2, 2) + I(x1 * x2), family = binomial, 
##     data = data)
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## -0.003575   0.000000   0.000000   0.000000   0.003720  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)
## (Intercept)      236.09   34920.61   0.007    0.995
## poly(x1, 2)1    3608.97  246381.97   0.015    0.988
## poly(x1, 2)2   88150.22 1333540.93   0.066    0.947
## poly(x2, 2)1    3256.75  177352.91   0.018    0.985
## poly(x2, 2)2  -87128.37 1164195.57  -0.075    0.940
## I(x1 * x2)       -33.23  446735.64   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6.9135e+02  on 499  degrees of freedom
## Residual deviance: 3.3069e-05  on 494  degrees of freedom
## AIC: 12
## 
## Number of Fisher Scoring iterations: 25


  • (f) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be obviously non-linear. If it is not, then repeat (a)-(e) until you come up with an example in which the predicted class labels are obviously non-linear.
lm_prob = predict(lm_fit, data, type = "response")
lm_pred = ifelse(lm_prob > 0.5, 1, 0)

data_pos = data[lm_pred == 1, ]
data_neg = data[lm_pred == 0, ]

plot(data_pos$x1, data_pos$x2, col = "black", xlab = "X1", ylab = "X2", pch = "+")
points(data_neg$x1, data_neg$x2, col = "red", pch = 4)


  • (g) Fit a support vector classifier to the data with X1 and X2 as predictors. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.
svm_fit = svm(as.factor(y) ~ x1 + x2, data, kernel = "linear", cost = 0.1)
svm_pred = predict(svm_fit, data)

data_pos = data[svm_pred == 1, ]
data_neg = data[svm_pred == 0, ]

plot(data_pos$x1, data_pos$x2, col = "black", xlab = "X1", ylab = "X2", pch = "+")
points(data_neg$x1, data_neg$x2, col = "red", pch = 4)


  • (h) Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.
svm_fit = svm(as.factor(y) ~ x1 + x2, data, gamma = 1)
svm_pred = predict(svm_fit, data)

data_pos = data[svm_pred == 1, ]
data_neg = data[svm_pred == 0, ]

plot(data_pos$x1, data_pos$x2, col = "black", xlab = "X1", ylab = "X2", pch = "+")
points(data_neg$x1, data_neg$x2, col = "red", pch = 4)


  • (i) Comment on your results.
    • We conclude that SVM with non-linear kernel and logistic regression with interaction terms are very powerful for identifying non-linear decision boundaries. Also, SVM with linear kernel and logistic regression without any interaction term are not as good at finding non-linear decision boundaries.


Exercise 7: 7. In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the Auto data set.

  • (a) Create a binary variable that takes on a 1 for cars with gas mileage above the median, and a 0 for cars with gas mileage below the median.
gas.med = median(Auto$mpg)

new.var = ifelse(Auto$mpg > gas.med, 1, 0)

Auto$mpglevel = as.factor(new.var)
head(Auto["mpglevel"])
##   mpglevel
## 1        0
## 2        0
## 3        0
## 4        0
## 5        0
## 6        0


  • (b) Fit a support vector classifier to the data with various values of cost, in order to predict whether a car gets high or low gas mileage. Report the cross-validation errors associated with different values of this parameter.
set.seed(3255)

tune.out = tune(svm, mpglevel ~ ., data = Auto, kernel = "linear", ranges = list(cost = c(0.01, 0.1, 1, 5, 10, 100)))
summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     1
## 
## - best performance: 0.01269231 
## 
## - Detailed performance results:
##    cost      error dispersion
## 1 1e-02 0.07397436 0.06863413
## 2 1e-01 0.05102564 0.06923024
## 3 1e+00 0.01269231 0.02154160
## 4 5e+00 0.01519231 0.01760469
## 5 1e+01 0.02025641 0.02303772
## 6 1e+02 0.03294872 0.02898463


  • (c) Now repeat (b), this time using SVMs with radial and polynomial basis kernels, with different values of gamma and degree and cost.
set.seed(21)
tune.out = tune(svm, mpglevel ~ ., data = Auto, kernel = "polynomial", ranges = list(cost = c(0.1, 
    1, 5, 10), degree = c(2, 3, 4)))
summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost degree
##    10      2
## 
## - best performance: 0.5435897 
## 
## - Detailed performance results:
##    cost degree     error dispersion
## 1   0.1      2 0.5587821 0.04538579
## 2   1.0      2 0.5587821 0.04538579
## 3   5.0      2 0.5587821 0.04538579
## 4  10.0      2 0.5435897 0.05611162
## 5   0.1      3 0.5587821 0.04538579
## 6   1.0      3 0.5587821 0.04538579
## 7   5.0      3 0.5587821 0.04538579
## 8  10.0      3 0.5587821 0.04538579
## 9   0.1      4 0.5587821 0.04538579
## 10  1.0      4 0.5587821 0.04538579
## 11  5.0      4 0.5587821 0.04538579
## 12 10.0      4 0.5587821 0.04538579
set.seed(463)
tune.out = tune(svm, mpglevel ~ ., data = Auto, kernel = "radial", ranges = list(cost = c(0.1, 
    1, 5, 10), gamma = c(0.01, 0.1, 1, 5, 10, 100)))
summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##    10  0.01
## 
## - best performance: 0.02551282 
## 
## - Detailed performance results:
##    cost gamma      error dispersion
## 1   0.1 1e-02 0.09429487 0.04814900
## 2   1.0 1e-02 0.07897436 0.03875105
## 3   5.0 1e-02 0.05352564 0.02532795
## 4  10.0 1e-02 0.02551282 0.02417610
## 5   0.1 1e-01 0.07891026 0.03847631
## 6   1.0 1e-01 0.05602564 0.02881876
## 7   5.0 1e-01 0.03826923 0.03252085
## 8  10.0 1e-01 0.03320513 0.02964746
## 9   0.1 1e+00 0.57660256 0.05479863
## 10  1.0 1e+00 0.06628205 0.02996211
## 11  5.0 1e+00 0.06115385 0.02733573
## 12 10.0 1e+00 0.06115385 0.02733573
## 13  0.1 5e+00 0.57660256 0.05479863
## 14  1.0 5e+00 0.51538462 0.06642516
## 15  5.0 5e+00 0.50775641 0.07152757
## 16 10.0 5e+00 0.50775641 0.07152757
## 17  0.1 1e+01 0.57660256 0.05479863
## 18  1.0 1e+01 0.53833333 0.05640443
## 19  5.0 1e+01 0.53070513 0.05708644
## 20 10.0 1e+01 0.53070513 0.05708644
## 21  0.1 1e+02 0.57660256 0.05479863
## 22  1.0 1e+02 0.57660256 0.05479863
## 23  5.0 1e+02 0.57660256 0.05479863
## 24 10.0 1e+02 0.57660256 0.05479863


  • (d) Make some plots to back up your assertions in (b) and (c).
svm.linear = svm(mpglevel ~ ., data = Auto, kernel = "linear", cost = 1)

svm.poly = svm(mpglevel ~ ., data = Auto, kernel = "polynomial", cost = 10, 
    degree = 2)

svm.radial = svm(mpglevel ~ ., data = Auto, kernel = "radial", cost = 10, gamma = 0.01)

plotpairs = function(fit) {
    for (name in names(Auto)[!(names(Auto) %in% c("mpg", "mpglevel", "name"))]) {
        plot(fit, Auto, as.formula(paste("mpg~", name, sep = "")))
    }
}

plotpairs(svm.linear)

plotpairs(svm.poly)

plotpairs(svm.radial)


Exercise 8: This problem involves the OJ data set which is part of the ISLR package.

  • (a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
set.seed(1)

oj_train_samples = OJ %>% resample_partition(c(train = .8, test = .2))
summary(oj_train_samples)
##       Length Class    Mode
## train 2      resample list
## test  2      resample list


  • (b) Fit a support vector classifier to the training data using cost=0.01,with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics, and describe the results obtained.
oj_linr_svc = svm(Purchase ~ ., data = oj_train_samples$train, kernel = 'linear', cost = 0.01)
summary(oj_linr_svc)
## 
## Call:
## svm(formula = Purchase ~ ., data = oj_train_samples$train, kernel = "linear", 
##     cost = 0.01)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.01 
## 
## Number of Support Vectors:  450
## 
##  ( 226 224 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM


  • (c) What are the training and test error rates?
oj_train_samples$train %>% as_tibble() %>% mutate(Purchase_prime = predict(oj_linr_svc, newdata = .)) %>% summarize('Train Error Rate' = mean(Purchase != Purchase_prime))
## # A tibble: 1 x 1
##   `Train Error Rate`
##                <dbl>
## 1              0.157
oj_train_samples$test %>%as_tibble() %>% mutate(Purchase_prime = predict(oj_linr_svc, newdata = .)) %>% summarize('Test Error Rate' = mean(Purchase != Purchase_prime))
## # A tibble: 1 x 1
##   `Test Error Rate`
##               <dbl>
## 1             0.205


  • (d) Use the tune() function to select an optimal cost. Consider values in the range 0.01 to 10.
set.seed(1)

tune(svm, Purchase ~ ., data = as_tibble( oj_train_samples$train ), kernel = 'linear', ranges = list(cost = 2^seq(-8,4))) -> oj_svc_tune

summary(oj_svc_tune)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##      cost
##  0.015625
## 
## - best performance: 0.1603557 
## 
## - Detailed performance results:
##           cost     error dispersion
## 1   0.00390625 0.1638577 0.03795702
## 2   0.00781250 0.1626813 0.03729402
## 3   0.01562500 0.1603557 0.03840728
## 4   0.03125000 0.1650205 0.03522022
## 5   0.06250000 0.1638988 0.03695773
## 6   0.12500000 0.1615458 0.03716560
## 7   0.25000000 0.1615595 0.03601742
## 8   0.50000000 0.1603967 0.03777655
## 9   1.00000000 0.1650479 0.03445874
## 10  2.00000000 0.1662244 0.03721419
## 11  4.00000000 0.1627086 0.03446296
## 12  8.00000000 0.1627223 0.03666065
## 13 16.00000000 0.1638851 0.03643699


  • (e) Compute the training and test error rates using this new value for cost.
oj_linr_svc = svm( Purchase ~ ., data = oj_train_samples$train,kernel = 'linear',cost = oj_svc_tune$best.parameters$cost)

oj_train_samples$train %>% as_tibble() %>% mutate(Purchase_prime = predict(oj_linr_svc)) %>% summarize('Train Error Rate' = mean(Purchase != Purchase_prime))
## # A tibble: 1 x 1
##   `Train Error Rate`
##                <dbl>
## 1              0.152
oj_train_samples$test %>% as_tibble() %>% mutate(Purchase_prime = predict(oj_linr_svc, newdata = .)) %>% summarize('Test Error Rate' = mean(Purchase != Purchase_prime))
## # A tibble: 1 x 1
##   `Test Error Rate`
##               <dbl>
## 1               0.2


  • (f) Repeat parts (b) through (e) using a support vector machine with a radial kernel. Use the default value for gamma.
oj_radl_svc = svm( Purchase ~ ., data = oj_train_samples$train, kernel = 'radial')
summary(oj_radl_svc)
## 
## Call:
## svm(formula = Purchase ~ ., data = oj_train_samples$train, kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  390
## 
##  ( 201 189 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM
oj_train_samples$train %>% as_tibble() %>% mutate(Purchase_prime = predict(oj_radl_svc, newdata = .)) %>% summarize('Train Error Rate' = mean(Purchase != Purchase_prime))
## # A tibble: 1 x 1
##   `Train Error Rate`
##                <dbl>
## 1              0.145
oj_train_samples$test %>% as_tibble() %>% mutate(Purchase_prime = predict(oj_radl_svc, newdata = .)) %>% summarize('Test Error Rate' = mean(Purchase != Purchase_prime))
## # A tibble: 1 x 1
##   `Test Error Rate`
##               <dbl>
## 1             0.186
set.seed(1)

tune(svm,Purchase ~ .,data = as_tibble( oj_train_samples$train ),kernel = 'radial',ranges = list(cost = 2^seq(-8,4))) -> oj_radl_tune
summary(oj_radl_tune)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     1
## 
## - best performance: 0.1673461 
## 
## - Detailed performance results:
##           cost     error dispersion
## 1   0.00390625 0.3764843 0.06228157
## 2   0.00781250 0.3764843 0.06228157
## 3   0.01562500 0.3764843 0.06228157
## 4   0.03125000 0.3366211 0.07569987
## 5   0.06250000 0.1918605 0.04001821
## 6   0.12500000 0.1755267 0.03617482
## 7   0.25000000 0.1755404 0.03986210
## 8   0.50000000 0.1720246 0.04204614
## 9   1.00000000 0.1673461 0.03585901
## 10  2.00000000 0.1720520 0.04109274
## 11  4.00000000 0.1731874 0.04013555
## 12  8.00000000 0.1825581 0.03589686
## 13 16.00000000 0.1837346 0.03185391
oj_radl_svc <- svm(Purchase ~ .,data = oj_train_samples$train,kernel = 'linear',cost = oj_radl_tune$best.parameters$cost)

oj_train_samples$train %>% as_tibble() %>% mutate(Purchase_prime = predict(oj_radl_svc)) %>% summarize('Train Error Rate' = mean(Purchase != Purchase_prime))
## # A tibble: 1 x 1
##   `Train Error Rate`
##                <dbl>
## 1              0.152
oj_train_samples$test %>% as_tibble() %>% mutate(Purchase_prime = predict(oj_radl_svc, newdata = .)) %>% summarize('Test Error Rate' = mean(Purchase != Purchase_prime))
## # A tibble: 1 x 1
##   `Test Error Rate`
##               <dbl>
## 1               0.2
  • (g) Repeat parts (b) through (e) using a support vector machine with a polynomial kernel. Set degree=2.
oj_poly_svc = svm(Purchase ~ ., data = oj_train_samples$train, kernel = 'polynomial',degree = 2)
summary(oj_poly_svc)
## 
## Call:
## svm(formula = Purchase ~ ., data = oj_train_samples$train, kernel = "polynomial", 
##     degree = 2)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  2 
##      coef.0:  0 
## 
## Number of Support Vectors:  467
## 
##  ( 238 229 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM
oj_train_samples$train %>% as_tibble() %>% mutate(Purchase_prime = predict(oj_poly_svc, newdata = .)) %>% summarize('Train Error Rate' = mean(Purchase != Purchase_prime))
## # A tibble: 1 x 1
##   `Train Error Rate`
##                <dbl>
## 1              0.174
oj_train_samples$test %>% as_tibble() %>% mutate(Purchase_prime = predict(oj_poly_svc, newdata = .)) %>% summarize('Test Error Rate' = mean(Purchase != Purchase_prime))
## # A tibble: 1 x 1
##   `Test Error Rate`
##               <dbl>
## 1             0.214
set.seed(1)
tune(svm,Purchase ~ .,data = as_tibble( oj_train_samples$train ),kernel = 'polynomial',ranges = list(cost = 2^seq(-8,4)),degree = 2) -> oj_poly_tune
summary(oj_poly_tune)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     8
## 
## - best performance: 0.1708208 
## 
## - Detailed performance results:
##           cost     error dispersion
## 1   0.00390625 0.3764843 0.06228157
## 2   0.00781250 0.3776471 0.06301831
## 3   0.01562500 0.3495075 0.07242409
## 4   0.03125000 0.3389740 0.06929994
## 5   0.06250000 0.3179617 0.06929113
## 6   0.12500000 0.3016005 0.07149584
## 7   0.25000000 0.2268673 0.03547776
## 8   0.50000000 0.2012312 0.04937234
## 9   1.00000000 0.1918194 0.04481021
## 10  2.00000000 0.1848290 0.04773197
## 11  4.00000000 0.1801505 0.04354956
## 12  8.00000000 0.1708208 0.04408733
## 13 16.00000000 0.1732011 0.04348912
oj_poly_svc <- svm(Purchase ~ .,data = oj_train_samples$train,kernel = 'polynomial',cost = oj_poly_tune$best.parameters$cost)

oj_train_samples$train %>% as_tibble() %>%mutate(Purchase_prime = predict(oj_poly_svc)) %>%summarize('Train Error Rate' = mean(Purchase != Purchase_prime))
## # A tibble: 1 x 1
##   `Train Error Rate`
##                <dbl>
## 1              0.137
oj_train_samples$test %>% as_tibble() %>% mutate(Purchase_prime = predict(oj_poly_svc, newdata = .)) %>% summarize('Test Error Rate' = mean(Purchase != Purchase_prime))
## # A tibble: 1 x 1
##   `Test Error Rate`
##               <dbl>
## 1             0.219
  • (h) Overall, which approach seems to give the best results on this data?
    • We determine that the Radial Kernel approach gives the best results since our train and test error rates were the lowest compared to Linear and Polynomial kernels.