We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary. We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features.
(a) Generate a data set with n = 500 and p = 2, such that the observations belong to two classes with a quadratic decision boundary between them.
x1 = runif(500) - 0.5
x2 = runif(500) - 0.5
y = 1*(x1^2 - x2^2 > 0)
(b) Plot the observations, colored according to their class labels. Your plot should display X1 on the x-axis, and X2 on the y-axis.
library(tidyverse)
df = data.frame(x1,x2,y)
plot(x1[y ==0], x2[y == 0], col = 'red', xlab = 'x1',ylab = 'x2')
points(x1[y==1],x2[y==1],col = 'blue')
(c) Fit a logistic regression model to the data, using X1 and X2 as predictors.
(d) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be linear.
(e) Now fit a logistic regression model to the data using non-linear functions of X1 and X2 as predictors (e.g. X21 , X1 ×X2, log(X2),and so forth).
(f) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be obviously non-linear. If it is not, then repeat (a)-(e) until you come up with an example in which the predicted class labels are obviously non-linear.
glm.fit2 = glm(y ~ x1 * x2, data = df, family = "binomial")
df$class = predict(glm.fit2, type = 'response')
df$pred.class = ifelse(df$class > .5, 1, 0)
df$pred.class = as.factor(df$pred.class)
df|>
ggplot(mapping = aes(x1,x2))+geom_point(aes(colour = pred.class))+labs(title = "Logistic Regression with a Non Linear Decision Boundary", subtitle = "Y ~ X1 * X2")+theme_minimal()
(g) Fit a support vector classifier to the data with X1 and X2 as predictors. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.
library(caret)
set.seed(12)
control.cv= trainControl(method = "cv", number = 10)
svm = train(as.factor(y) ~., data = df, method = "svmLinear", trControl = control.cv)
svm
Support Vector Machines with Linear Kernel
500 samples
6 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 450, 450, 450, 450, 450, 451, ...
Resampling results:
Accuracy Kappa
0.9621961 0.9243392
Tuning parameter 'C' was held constant at a value of 1
df$svm_pred = predict(svm, newdata = df)
df|>
ggplot(aes(x = x1, y = x2, color = svm_pred))+geom_point()+theme_minimal()+labs(title = "Support Vector Machine", subtitle = "Color by predicted Class")
(h) Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.
set.seed(12)
nonlin.svm = train(y ~.,data =df, method = "svmRadial", trControl = control.cv)
nonlin.svm
Support Vector Machines with Radial Basis Function Kernel
500 samples
6 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 450, 450, 450, 450, 450, 451, ...
Resampling results across tuning parameters:
C Accuracy Kappa
0.25 0.9621961 0.9243392
0.50 0.9621961 0.9243392
1.00 0.9621961 0.9243392
Tuning parameter 'sigma' was held constant at a value of 0.3029571
Accuracy was used to select the optimal model using the
largest value.
The final values used for the model were sigma = 0.3029571 and C
= 0.25.
df$svm_pred_nonlin = predict(nonlin.svm,newdata = df)
df|>
ggplot(aes(x1,x2, color = svm_pred_nonlin))+geom_point()+labs(title = "Suppor Vector Machine With Radial Kernel", subtitle = "Colored by Predicted Class")+theme_minimal()
(i) Comment on your results caret() is hypertunning the SVM with a Radial kernel thus it is appears to look like the linear kernel.
In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the Auto data set.
library(ISLR2)
library(e1071)
data(Auto)
(a) Create a binary variable that takes on a 1 for cars with gas mileage above the median, and a 0 for cars with gas mileage below the median.
Auto$mpg = ifelse( Auto$mpg >median(Auto$mpg),1,0)
Error in median.default(Auto$mpg) : need numeric data
(b) Fit a support vector classifier to the data with various values of cost,in order to predict whether a car gets high or low gas mileage. Report the cross-validation errors associated with different values of this parameter. Comment on your results. Note you will need to fit the classifier without the gas mileage variable to produce sensible results.
cost.grid = expand.grid(C = c(0.01,.1,1,10,100))
mpg.svm = train(mpg~. -name, data = Auto,method = "svmLinear", trControl = control.cv, tuneGrid = cost.grid)
mpg.svm
Support Vector Machines with Linear Kernel
392 samples
8 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 352, 354, 353, 354, 354, 352, ...
Resampling results across tuning parameters:
C Accuracy Kappa
1e-02 0.9107928 0.8215789
1e-01 0.9056613 0.8113158
1e+00 0.9132928 0.8265789
1e+01 0.9080331 0.8160256
1e+02 0.9080331 0.8160256
Accuracy was used to select the optimal model using the
largest value.
The final value used for the model was C = 1.
The Best Cost parameter is 1e+00 and an accuracy of .913.
(c) Now repeat (b), this time using SVMs with radial and polynomial basis kernels, with different values of gamma and degree and cost. Comment on your results.
mpg.poly
Support Vector Machines with Polynomial Kernel
392 samples
8 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 353, 353, 354, 353, 353, 352, ...
Resampling results across tuning parameters:
degree scale C Accuracy Kappa
1 0.001 0.25 0.6679858 0.3451886
1 0.001 0.50 0.8013866 0.6030865
1 0.001 1.00 0.8699798 0.7397750
1 0.010 0.25 0.8906309 0.7810256
1 0.010 0.50 0.9032591 0.8064520
1 0.010 1.00 0.9108941 0.8216622
1 0.100 0.25 0.9083300 0.8165374
1 0.100 0.50 0.9083300 0.8164163
1 0.100 1.00 0.9056984 0.8111532
2 0.001 0.25 0.8013866 0.6030865
2 0.001 0.50 0.8699798 0.7397750
2 0.001 1.00 0.8830027 0.7657072
2 0.010 0.25 0.9032591 0.8064520
2 0.010 0.50 0.9108941 0.8216622
2 0.010 1.00 0.9083300 0.8165374
2 0.100 0.25 0.9108941 0.8215411
2 0.100 0.50 0.9081950 0.8161531
2 0.100 1.00 0.9107591 0.8213320
3 0.001 0.25 0.8674798 0.7347750
3 0.001 0.50 0.8778745 0.7555645
3 0.001 1.00 0.8931950 0.7862048
3 0.010 0.25 0.9134582 0.8268403
3 0.010 0.50 0.9108941 0.8216622
3 0.010 1.00 0.9057659 0.8113858
3 0.100 0.25 0.9030668 0.8059034
3 0.100 0.50 0.9159514 0.8317065
3 0.100 1.00 0.9210796 0.8420096
Accuracy was used to select the optimal model using the
largest value.
The final values used for the model were degree = 3, scale =
0.1 and C = 1.
The best Cost parameter is . 5 with an accuracy of .91. While the best cost parameter for the Polynomial kernel is The final values used for the model were degree = 3, scale = 0.1 and C = 1 and with an accuracy of .921,
(d) Make some plots to back up your assertions in (b) and (c)
plot(mpg.svm)
The highest accuracy is around .91 with a cost of .1.
plot(mpg.rad)
Accuracy is around .91 with a Cost parameter of .5.
plot(mpg.poly)
The best C = 1, Scale = .1, and Degree = 3, which follows what we was previously stated.
8. This problem involves the OJ data set which is part of the ISLR2 package.
(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
data(OJ)
oj.index = sample(1:nrow(OJ), 800)
oj.train = OJ[oj.index,]
oj.test = OJ[-oj.index,]
(b) Fit a support vector classifier to the training data using cost = 0.01, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics, and describe the results obtained.
oj.svm
Support Vector Machines with Linear Kernel
800 samples
17 predictor
2 classes: 'CH', 'MM'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 720, 720, 721, 720, 720, 719, ...
Resampling results:
Accuracy Kappa
0.8337758 0.6440631
Tuning parameter 'C' was held constant at a value of 0.01
The accuracy of the this Support Machine Classifier with a Linear Kernel is .83 containing 17 variables.
(c) What are the training and test error rates?
confusionMatrix(oj.svm.pred,oj.test$Purchase)
Confusion Matrix and Statistics
Reference
Prediction CH MM
CH 143 32
MM 19 76
Accuracy : 0.8111
95% CI : (0.7592, 0.856)
No Information Rate : 0.6
P-Value [Acc > NIR] : 8.311e-14
Kappa : 0.5984
Mcnemar's Test P-Value : 0.09289
Sensitivity : 0.8827
Specificity : 0.7037
Pos Pred Value : 0.8171
Neg Pred Value : 0.8000
Prevalence : 0.6000
Detection Rate : 0.5296
Detection Prevalence : 0.6481
Balanced Accuracy : 0.7932
'Positive' Class : CH
The test error is .81 while the training error is .83, which could suggest some overfitting within the svmLinear model.
(d) Use the tune() function to select an optimal cost. Consider values in the range 0.01 to 10.
oj.svm = train(Purchase ~., data = oj.train, method = "svmLinear", trControl = control.cv, tuneGrid = cost.grid)
oj.svm
Support Vector Machines with Linear Kernel
800 samples
17 predictor
2 classes: 'CH', 'MM'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 720, 720, 720, 721, 720, 720, ...
Resampling results across tuning parameters:
C Accuracy Kappa
1e-02 0.8312299 0.6370552
1e-01 0.8337457 0.6463782
1e+00 0.8387619 0.6561486
1e+01 0.8412148 0.6605159
1e+02 0.8411990 0.6604185
Accuracy was used to select the optimal model using the
largest value.
The final value used for the model was C = 10.
(e) Compute the training and test error rates using this new value for cost.
oj.svm.pred = predict(oj.svm,newdata = oj.test)
confusionMatrix(oj.svm.pred,oj.test$Purchase)
Confusion Matrix and Statistics
Reference
Prediction CH MM
CH 142 30
MM 20 78
Accuracy : 0.8148
95% CI : (0.7633, 0.8593)
No Information Rate : 0.6
P-Value [Acc > NIR] : 2.854e-14
Kappa : 0.6082
Mcnemar's Test P-Value : 0.2031
Sensitivity : 0.8765
Specificity : 0.7222
Pos Pred Value : 0.8256
Neg Pred Value : 0.7959
Prevalence : 0.6000
Detection Rate : 0.5259
Detection Prevalence : 0.6370
Balanced Accuracy : 0.7994
'Positive' Class : CH
The test error is .81 with the Cost parameter set from .01 to 10
(f) Repeat parts (b) through (e) using a support vector machine with a radial kernel. Use the default value for gamma.
oj.svm.rad
Support Vector Machines with Radial Basis Function Kernel
800 samples
17 predictor
2 classes: 'CH', 'MM'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 720, 720, 720, 720, 720, 721, ...
Resampling results across tuning parameters:
C Accuracy Kappa
0.25 0.8226033 0.6160577
0.50 0.8238070 0.6182171
1.00 0.8250725 0.6210410
Tuning parameter 'sigma' was held constant at a value of 0.05870129
Accuracy was used to select the optimal model using the
largest value.
The final values used for the model were sigma = 0.05870129 and
C = 1.
The best accuracy is .825.
oj.svm.rad.pred = predict(oj.svm.rad,newdata = oj.test)
confusionMatrix(oj.svm.rad.pred,oj.test$Purchase)
Confusion Matrix and Statistics
Reference
Prediction CH MM
CH 148 31
MM 14 77
Accuracy : 0.8333
95% CI : (0.7834, 0.8758)
No Information Rate : 0.6
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.6434
Mcnemar's Test P-Value : 0.01707
Sensitivity : 0.9136
Specificity : 0.7130
Pos Pred Value : 0.8268
Neg Pred Value : 0.8462
Prevalence : 0.6000
Detection Rate : 0.5481
Detection Prevalence : 0.6630
Balanced Accuracy : 0.8133
'Positive' Class : CH
Test error is .833
confusionMatrix(oj.svm.rad.pred,oj.test$Purchase)
Confusion Matrix and Statistics
Reference
Prediction CH MM
CH 162 108
MM 0 0
Accuracy : 0.6
95% CI : (0.5389, 0.6589)
No Information Rate : 0.6
P-Value [Acc > NIR] : 0.5264
Kappa : 0
Mcnemar's Test P-Value : <2e-16
Sensitivity : 1.0
Specificity : 0.0
Pos Pred Value : 0.6
Neg Pred Value : NaN
Prevalence : 0.6
Detection Rate : 0.6
Detection Prevalence : 1.0
Balanced Accuracy : 0.5
'Positive' Class : CH
The accuracy dropped down to .6 on the test set.
(g) Repeat parts (b) through (e) using a support vector machine with a polynomial kernel. Set degree = 2.
oj.svm.poly
Support Vector Machines with Polynomial Kernel
800 samples
17 predictor
2 classes: 'CH', 'MM'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
Resampling results across tuning parameters:
degree scale C Accuracy Kappa
1 0.001 0.25 0.6137537 0.00000000
1 0.001 0.50 0.6137537 0.00000000
1 0.001 1.00 0.6862248 0.23152111
1 0.001 2.00 0.8150250 0.59234779
1 0.010 0.25 0.8212600 0.61138927
1 0.010 0.50 0.8362912 0.64999247
1 0.010 1.00 0.8325566 0.64085386
1 0.010 2.00 0.8325412 0.64192112
1 0.100 0.25 0.8350258 0.64680442
1 0.100 0.50 0.8387912 0.65551745
1 0.100 1.00 0.8387758 0.65503882
1 0.100 2.00 0.8400570 0.65911582
1 1.000 0.25 0.8388070 0.65590840
1 1.000 0.50 0.8400725 0.65816588
1 1.000 1.00 0.8400879 0.65825974
1 1.000 2.00 0.8413225 0.66062428
2 0.001 0.25 0.6137537 0.00000000
2 0.001 0.50 0.6874748 0.23511741
2 0.001 1.00 0.8150250 0.59234779
2 0.001 2.00 0.8312908 0.63879437
2 0.010 0.25 0.8362754 0.64832893
2 0.010 0.50 0.8387600 0.65415544
2 0.010 1.00 0.8400258 0.65767211
2 0.010 2.00 0.8438229 0.66715525
2 0.100 0.25 0.8325879 0.63972766
2 0.100 0.50 0.8238221 0.62019796
2 0.100 1.00 0.8151029 0.60139473
2 0.100 2.00 0.8188217 0.60926107
2 1.000 0.25 0.8063526 0.58327905
2 1.000 0.50 0.8075867 0.58672156
2 1.000 1.00 0.8088526 0.58858804
2 1.000 2.00 0.8088529 0.58823031
3 0.001 0.25 0.6400049 0.08356011
3 0.001 0.50 0.7862584 0.51150175
3 0.001 1.00 0.8300100 0.63268277
3 0.001 2.00 0.8338066 0.64365701
3 0.010 0.25 0.8362754 0.64849317
3 0.010 0.50 0.8375408 0.65273253
3 0.010 1.00 0.8425570 0.66396035
3 0.010 2.00 0.8438537 0.66554860
3 0.100 0.25 0.8263375 0.62528448
3 0.100 0.50 0.8238846 0.62064853
3 0.100 1.00 0.8226192 0.61875712
3 0.100 2.00 0.8213217 0.61532808
3 1.000 0.25 0.7901180 0.54909107
3 1.000 0.50 0.8013996 0.57404855
3 1.000 1.00 0.7901797 0.55225155
3 1.000 2.00 0.7901489 0.55239281
Accuracy was used to select the optimal model using the
largest value.
The final values used for the model were degree = 3, scale =
0.01 and C = 2.
degree = 2 scale = 0.010 cost =2.00 Accuracy= 0.8438229
(h) Overall, which approach seems to give the best results on this data? The best approach is the Radial kernel with an accuracy of .833 on the test data set. The final values used for the model were sigma = 0.05870129 and C = 1.