5. We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary. We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features.

a) Generate a data set with n = 500 and p = 2, such that the observations belong to two classes with a quadratic decision boundary between them.

library(knitr)
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
## Registered S3 method overwritten by 'rvest':
##   method            from
##   read_xml.response xml2
## -- Attaching packages -------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.1       v purrr   0.3.2  
## v tibble  2.1.1       v dplyr   0.8.0.1
## v tidyr   0.8.3       v stringr 1.4.0  
## v readr   1.3.1       v forcats 0.4.0
## -- Conflicts ----------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggthemes)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(e1071)
theme_set(theme_tufte(base_size = 14))

df <- data.frame(replicate(2, rnorm(500)))
df <- as.tibble(df)
## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
class_func <- function(x, y) {
    x^2 + y < 1
}

df <- df %>%
    rename(Var1 = X1, Var2 = X2) %>%
    mutate(Class = ifelse(class_func(Var1, Var2),
                          'Class A',
                          'Class B'),
           Class = factor(Class))

b) Plot the observations, colored according to their class labels. Your plot should display X1 on the x-axis, and X2 on the yaxis.

inTrain <- sample(nrow(df), nrow(df)*0.6, replace = FALSE)

training <- df[inTrain,]
testing <- df[-inTrain,]

ggplot(df, aes(Var1, Var2, col = Class)) +
    geom_point(size = 2)

c) Fit a logistic regression model to the data, using X1 and X2 as predictors.

logreg_fit <- glm(Class ~ ., data = training, family = 'binomial')
summary(logreg_fit)
## 
## Call:
## glm(formula = Class ~ ., family = "binomial", data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8249  -0.7919  -0.2596   0.6825   2.4252  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.4213     0.1505  -2.800  0.00512 ** 
## Var1         -0.1080     0.1429  -0.756  0.44992    
## Var2          2.0213     0.2400   8.421  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 412.47  on 299  degrees of freedom
## Residual deviance: 279.31  on 297  degrees of freedom
## AIC: 285.31
## 
## Number of Fisher Scoring iterations: 5

d) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be linear.

pred <- predict(logreg_fit, testing, type = 'response')
pred <- ifelse(pred >= 0.5, 'Class B', 'Class A')
mean(pred == testing$Class)
## [1] 0.73

e) Now fit a logistic regression model to the data using non-linear functions of X1 and X2 as predictors.

logreg_fit <- glm(Class ~ poly(Var1, 2) + Var2, data = training, family = 'binomial')
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logreg_fit)
## 
## Call:
## glm(formula = Class ~ poly(Var1, 2) + Var2, family = "binomial", 
##     data = training)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -8.848e-04  -2.000e-08  -2.000e-08   2.000e-08   9.595e-04  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)
## (Intercept)         64.51    5701.25   0.011    0.991
## poly(Var1, 2)1   -2363.93  127037.62  -0.019    0.985
## poly(Var1, 2)2   25924.46 1078037.74   0.024    0.981
## Var2               912.27   37674.57   0.024    0.981
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4.1247e+02  on 299  degrees of freedom
## Residual deviance: 1.9387e-06  on 296  degrees of freedom
## AIC: 8
## 
## Number of Fisher Scoring iterations: 25

f) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be obviously non-linear. If it is not, then repeat (a)-(e) until you come up with an example in which the predicted class labels are obviously non-linear

pred <- predict(logreg_fit, testing, type = 'response')
pred <- ifelse(pred >= 0.5, 'Class B', 'Class A')
mean(pred == testing$Class)
## [1] 0.995
data_frame(pred = pred, class = testing$Class) %>%
    ggplot(aes(pred, class)) +
    geom_jitter()
## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.

g) Fit a support vector classifier to the data with X1 and X2 as predictors. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.

svm_fit <- svm(Class ~ ., data = training,
               kernel = 'linear',
               scale = FALSE)
plot(svm_fit, testing) 

mean(predict(svm_fit, testing) == testing$Class)
## [1] 0.745

h) Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.

svm_poly <- svm(Class ~ ., data = training,
                kernel = 'polynomial', degree = 2,
                scale = FALSE)
plot(svm_poly, testing)

mean(predict(svm_poly, testing) == testing$Class)
## [1] 0.79
svm_radial <- svm(Class ~ ., data = training,
                  kernel = 'radial',
                  scale = FALSE)
plot(svm_radial, testing)

mean(predict(svm_radial, testing) == testing$Class)
## [1] 0.96

i) Comment on your results: There are strong similarities between Support Vector Machines and Logistic Regression. The best model seemed to be a regression model with the right covariates.

7. In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the Auto data set.

a) Create a binary variable that takes on a 1 for cars with gas mileage above the median, and a 0 for cars with gas mileage below the median.

library(ISLR)
theme_set(theme_tufte(base_size = 14))
set.seed(1)

data(Auto)
Auto <- as.tibble(Auto)

Auto <- Auto %>%
    mutate(highmpg = as.integer(mpg > median(mpg))) %>%
    mutate(highmpg = factor(highmpg),
           cylinders = factor(cylinders))

Auto %>%
    sample_n(5) %>%
    select(mpg, highmpg)
## # A tibble: 5 x 2
##     mpg highmpg
##   <dbl> <fct>  
## 1  44.3 1      
## 2  23   1      
## 3  26   1      
## 4  23.9 1      
## 5  23.2 1
Auto <- Auto %>%
    select(-mpg, -name)

dummy_trans <- dummyVars(highmpg ~ ., data = Auto)
Auto_dummy <- predict(dummy_trans, Auto)
## Warning in model.frame.default(Terms, newdata, na.action = na.action, xlev
## = object$lvls): variable 'highmpg' is not a factor

b) Fit a support vector classifier to the data with various values of cost, in order to predict whether a car gets high or low gas mileage. Report the cross-validation errors associated with different values of this parameter. Comment on your results.

svm_linear <- train(x = Auto_dummy, y = Auto$highmpg,
                    method = 'svmLinear2',
                    trControl = trainControl(method = 'cv', number = 10, allowParallel = TRUE),
                    preProcess = c('center', 'scale'),
                    tuneGrid = expand.grid(cost = seq(1, 20, by = 1)))
svm_linear$finalModel
## 
## Call:
## svm.default(x = as.matrix(x), y = y, kernel = "linear", cost = param$cost, 
##     probability = classProbs)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  2 
##       gamma:  0.09090909 
## 
## Number of Support Vectors:  75

c) Now repeat (b), this time using SVMs with radial and polynomial basis kernels, with different values of gamma and degree and cost. Comment on your results.

svm_poly <- train(x = Auto_dummy, y = Auto$highmpg,
                  method = 'svmPoly',
                  trControl = trainControl(method = 'cv', number = 10, allowParallel = TRUE),
                  preProcess = c('center', 'scale'),
                  tuneGrid = expand.grid(degree = seq(1, 8, by = 1),
                                         C = seq(1, 5, by = 1),
                                         scale = TRUE))
svm_poly$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Polynomial kernel function. 
##  Hyperparameters : degree =  2  scale =  TRUE  offset =  1 
## 
## Number of Support Vectors : 71 
## 
## Objective Function Value : -45.587 
## Training error : 0.045918
svm_radial <- train(x = Auto_dummy, y = Auto$highmpg,
                  method = 'svmRadial',
                  trControl = trainControl(method = 'cv', number = 10, allowParallel = TRUE),
                  preProcess = c('center', 'scale'),
                  tuneGrid = expand.grid(C = seq(0.001, 3, length.out = 10),
                                         sigma = seq(0.2, 2, length.out = 5)))
svm_radial$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1.00066666666667 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  1.55 
## 
## Number of Support Vectors : 230 
## 
## Objective Function Value : -73.7206 
## Training error : 0.02551

d) Make some plots to back up your assertions in (b) and (c).

plot(svm_linear)

plot(svm_poly)

plot(svm_radial)

postResample(predict(svm_linear), Auto$highmpg)
##  Accuracy     Kappa 
## 0.9285714 0.8571429
postResample(predict(svm_poly), Auto$highmpg)
##  Accuracy     Kappa 
## 0.9540816 0.9081633
postResample(predict(svm_radial), Auto$highmpg)
##  Accuracy     Kappa 
## 0.9744898 0.9489796

8. This problem involves the OJ data set which is part of the ISLR package.

a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

set.seed(1)

data('OJ')

inTrain <- sample(nrow(OJ), 800, replace = FALSE)

training <- OJ[inTrain,]
testing <- OJ[-inTrain,]

b) Fit a support vector classifier to the training data using cost=0.01, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics, and describe the results obtained.

svm_linear <- svm(Purchase ~ ., data = training,
                  kernel = 'linear',
                  cost = 0.01)
summary(svm_linear)
## 
## Call:
## svm(formula = Purchase ~ ., data = training, kernel = "linear", 
##     cost = 0.01)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.01 
##       gamma:  0.05555556 
## 
## Number of Support Vectors:  435
## 
##  ( 219 216 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

c) What are the training and test error rates?

postResample(predict(svm_linear, training), training$Purchase)
##  Accuracy     Kappa 
## 0.8250000 0.6313971
postResample(predict(svm_linear, testing), testing$Purchase)
##  Accuracy     Kappa 
## 0.8222222 0.6082699

d) Use the tune() function to select an optimal cost. Consider values in the range 0.01 to 10.

svm_linear_tune <- train(Purchase ~ ., data = training,
                         method = 'svmLinear2',
                         trControl = trainControl(method = 'cv', number = 10),
                         preProcess = c('center', 'scale'),
                         tuneGrid = expand.grid(cost = seq(0.01, 10, length.out = 20)))
svm_linear_tune
## Support Vector Machines with Linear Kernel 
## 
## 800 samples
##  17 predictor
##   2 classes: 'CH', 'MM' 
## 
## Pre-processing: centered (17), scaled (17) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 721, 720, 720, 720, 721, 719, ... 
## Resampling results across tuning parameters:
## 
##   cost        Accuracy   Kappa    
##    0.0100000  0.8199215  0.6202565
##    0.5357895  0.8273760  0.6360834
##    1.0615789  0.8236101  0.6284665
##    1.5873684  0.8261105  0.6333280
##    2.1131579  0.8261105  0.6333280
##    2.6389474  0.8273605  0.6362121
##    3.1647368  0.8261105  0.6338114
##    3.6905263  0.8248605  0.6309732
##    4.2163158  0.8248605  0.6309732
##    4.7421053  0.8261105  0.6338114
##    5.2678947  0.8273605  0.6361662
##    5.7936842  0.8273605  0.6361662
##    6.3194737  0.8260947  0.6331693
##    6.8452632  0.8260947  0.6331693
##    7.3710526  0.8260947  0.6331693
##    7.8968421  0.8273605  0.6361662
##    8.4226316  0.8273605  0.6361662
##    8.9484211  0.8273605  0.6361662
##    9.4742105  0.8248447  0.6308145
##   10.0000000  0.8248447  0.6308145
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cost = 0.5357895.

e) Compute the training and test error rates using this new value for cost.

postResample(predict(svm_linear_tune, training), training$Purchase)
##  Accuracy     Kappa 
## 0.8350000 0.6524601
postResample(predict(svm_linear_tune, testing), testing$Purchase)
##  Accuracy     Kappa 
## 0.8444444 0.6585983

f) Repeat parts (b) through (e) using a support vector machine with a radial kernel. Use the default value for gamma.

svm_radial <- svm(Purchase ~ ., data = training,
                  method = 'radial',
                  cost = 0.01)
summary(svm_radial)
## 
## Call:
## svm(formula = Purchase ~ ., data = training, method = "radial", 
##     cost = 0.01)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.01 
##       gamma:  0.05555556 
## 
## Number of Support Vectors:  634
## 
##  ( 319 315 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM
postResample(predict(svm_radial, training), training$Purchase)
## Accuracy    Kappa 
##  0.60625  0.00000
postResample(predict(svm_radial, testing), testing$Purchase)
##  Accuracy     Kappa 
## 0.6222222 0.0000000
svm_radial_tune <- train(Purchase ~ ., data = training,
                         method = 'svmRadial',
                         trControl = trainControl(method = 'cv', number = 10),
                         preProcess = c('center', 'scale'),
                         tuneGrid = expand.grid(C = seq(0.01, 10, length.out = 20),
                                                sigma = 0.05691))
svm_radial_tune
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 800 samples
##  17 predictor
##   2 classes: 'CH', 'MM' 
## 
## Pre-processing: centered (17), scaled (17) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 720, 719, 719, 721, 719, 720, ... 
## Resampling results across tuning parameters:
## 
##   C           Accuracy   Kappa    
##    0.0100000  0.6062600  0.0000000
##    0.5357895  0.8274369  0.6315267
##    1.0615789  0.8249527  0.6267051
##    1.5873684  0.8199986  0.6165675
##    2.1131579  0.8174982  0.6105624
##    2.6389474  0.8149824  0.6041027
##    3.1647368  0.8112166  0.5964807
##    3.6905263  0.8112166  0.5964807
##    4.2163158  0.8124512  0.5993391
##    4.7421053  0.8137170  0.6021336
##    5.2678947  0.8137174  0.6017074
##    5.7936842  0.8137174  0.6017074
##    6.3194737  0.8124828  0.5988491
##    6.8452632  0.8124828  0.5988491
##    7.3710526  0.8137641  0.6020343
##    7.8968421  0.8112324  0.5967764
##    8.4226316  0.8112324  0.5967764
##    8.9484211  0.8099666  0.5939493
##    9.4742105  0.8124982  0.5992398
##   10.0000000  0.8124982  0.5992398
## 
## Tuning parameter 'sigma' was held constant at a value of 0.05691
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.05691 and C = 0.5357895.
postResample(predict(svm_radial_tune, training), training$Purchase)
## Accuracy    Kappa 
## 0.851250 0.684392
postResample(predict(svm_radial_tune, testing), testing$Purchase)
##  Accuracy     Kappa 
## 0.8185185 0.6040582

g) Repeat parts (b) through (e) using a support vector machine with a polynomial kernel. Set degree=2.

svm_poly <- svm(Purchase ~ ., data = training,
                  method = 'polynomial', degree = 2,
                  cost = 0.01)
summary(svm_poly)
## 
## Call:
## svm(formula = Purchase ~ ., data = training, method = "polynomial", 
##     degree = 2, cost = 0.01)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.01 
##       gamma:  0.05555556 
## 
## Number of Support Vectors:  634
## 
##  ( 319 315 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM
postResample(predict(svm_poly, training), training$Purchase)
## Accuracy    Kappa 
##  0.60625  0.00000
postResample(predict(svm_poly, testing), testing$Purchase)
##  Accuracy     Kappa 
## 0.6222222 0.0000000

h) Overall, which approach seems to give the best results on this data?

Overall the models are very similar, but the radial kernel does best by a small margin.