We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary. We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features.
x1 <- runif (500) - 0.5
x2 <- runif (500) - 0.5
y <- as.factor(1 * (x1^2 - x2^2 > 0))
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.3
# data
df <- data.frame(
x1 = x1,
x2 = x2,
y = y
)
# Plot
ggplot(df, aes(x = x1, y = x2, color = y)) +
geom_point(size = 2) +
labs(title = "x1 vs x2 by class",
x = "x1",
y = "x2") +
theme_minimal()
library(glmnet)
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.4.3
## Loaded glmnet 4.1-8
log <- glm(y ~ x1 + x2, family = 'binomial')
We can repeat the above by using a polynomial of degree 2 for both x1 and x2:
# poly <- glm(y ~ x1 + x2 + poly(x1,3) + poly(x2,3), family = 'binomial')
poly <- glm(y ~ poly(x1, 2) + poly(x2, 2), family = 'binomial')
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
poly_preds <- predict(poly, newdata = df[,-3], type='response')
poly_preds <- as.factor(ifelse(poly_preds >= 0.5, 1, 0))
We can plot them each as follows:
# predicted data
df <- data.frame(
x1 = x1,
x2 = x2,
y = poly_preds
)
# Plot
ggplot(df, aes(x = x1, y = x2, color = y)) +
geom_point(size = 2) +
labs(title = "x1 vs x2 by predicted class for polynomial of degree 2",
x = "x1",
y = "x2") +
theme_minimal()
library(e1071)
## Warning: package 'e1071' was built under R version 4.4.3
svm <- svm(y ~ x1 + x2)
svm_preds <- predict(svm, newdata = df[,-3], type = 'response')
# predicted data
df <- data.frame(
x1 = x1,
x2 = x2,
y = svm_preds
)
# Plot
ggplot(df, aes(x = x1, y = x2, color = y)) +
geom_point(size = 2) +
labs(title = "x1 vs x2 by predicted class for SVM",
x = "x1",
y = "x2") +
theme_minimal()
svm_radial <- svm(y ~ x1 + x2, kernel ='radial')
svm_radial_preds <- predict(svm_radial, newdata = df[,-3], type = 'response')
# predicted data
df <- data.frame(
x1 = x1,
x2 = x2,
y = svm_preds
)
# Plot
ggplot(df, aes(x = x1, y = x2, color = y)) +
geom_point(size = 2) +
labs(title = "x1 vs x2 by predicted class for SVM with radial kernel",
x = "x1",
y = "x2") +
theme_minimal()
First we will compare the results of each method together:
## SVM (linear kernel) Accuracy: 0.98
## SVM (radial kernel) Accuracy: 0.98
## Logistic Regression Accuracy: 0.394
## Polynomial Logistic Regression Accuracy: 1
We see that the best performance is with logistic regression using polynomial terms, which gives a perfect fit. It is important to note that this is likely overfitting as it is on the training data, however. SVM using either radial or linear kernel gives a near perfect performance without any tuning, showing that they are quite good classifiers out of the box. In practical terms, SVM is the likely winner in this case despite the perfect performance of the polynomial logistic regression.
In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the Auto data set.
library(ISLR2)
auto <- Auto
auto$higher_mileage <- as.factor(ifelse(Auto$mpg > median(Auto$mpg), 1, 0))
auto <- auto[,-1] # removing mpg
table(auto$higher_mileage)
##
## 0 1
## 196 196
We can perform cross validation, and see the results below:
set.seed(123)
tuning_linear <- tune(
"svm",
higher_mileage ~ .,
data = auto,
kernel = "linear",
ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)) # cost parameters
)
summary(tuning_linear)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.01
##
## - best performance: 0.08910256
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-03 0.13250000 0.06630411
## 2 1e-02 0.08910256 0.03791275
## 3 1e-01 0.09403846 0.04842472
## 4 1e+00 0.09147436 0.05817937
## 5 5e+00 0.10423077 0.06425850
## 6 1e+01 0.10673077 0.06562804
## 7 1e+02 0.12724359 0.06371052
We find the best performance using a cost of 1e-02, with an error of 0.0891 and a dispersion of 0.0379.
We can do tuning on the radial kernel by adding in gamma as a tuning parameter:
set.seed(123)
tuning_radial <- tune(
"svm",
higher_mileage ~ .,
data = auto,
kernel = "radial",
ranges = list(
cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100),
gamma = c(0.01, 0.1, 1)
)
)
summary(tuning_radial)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 5 0.1
##
## - best performance: 0.07365385
##
## - Detailed performance results:
## cost gamma error dispersion
## 1 1e-03 0.01 0.58173077 0.04740051
## 2 1e-02 0.01 0.58173077 0.04740051
## 3 1e-01 0.01 0.11211538 0.05486624
## 4 1e+00 0.01 0.08910256 0.03791275
## 5 5e+00 0.01 0.08397436 0.03933133
## 6 1e+01 0.01 0.09403846 0.04842472
## 7 1e+02 0.01 0.09147436 0.05690990
## 8 1e-03 0.10 0.58173077 0.04740051
## 9 1e-02 0.10 0.26000000 0.09153493
## 10 1e-01 0.10 0.08910256 0.03979296
## 11 1e+00 0.10 0.08653846 0.03578151
## 12 5e+00 0.10 0.07365385 0.05975763
## 13 1e+01 0.10 0.07365385 0.05852240
## 14 1e+02 0.10 0.10442308 0.05095556
## 15 1e-03 1.00 0.58173077 0.04740051
## 16 1e-02 1.00 0.58173077 0.04740051
## 17 1e-01 1.00 0.58173077 0.04740051
## 18 1e+00 1.00 0.08660256 0.04466420
## 19 5e+00 1.00 0.08916667 0.05328055
## 20 1e+01 1.00 0.08916667 0.05328055
## 21 1e+02 1.00 0.08916667 0.05328055
Using the radial kernel we see the best results with a cost of 10 and a gamma of 0.10, with an error of 0.0736 and a dispersion of 0.0585. So far this is the best performing model.
We can then do the same with the polynomial kernel, but using degree instead of gamma:
set.seed(123)
tuning_polynomial <- tune(
"svm",
higher_mileage ~ .,
data = auto,
kernel = "polynomial",
ranges = list(
cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100),
degree = c(2, 3, 4)
)
)
summary(tuning_polynomial)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost degree
## 100 2
##
## - best performance: 0.3086538
##
## - Detailed performance results:
## cost degree error dispersion
## 1 1e-03 2 0.5817308 0.04740051
## 2 1e-02 2 0.5817308 0.04740051
## 3 1e-01 2 0.5817308 0.04740051
## 4 1e+00 2 0.5817308 0.04740051
## 5 5e+00 2 0.5817308 0.04740051
## 6 1e+01 2 0.5714744 0.04575370
## 7 1e+02 2 0.3086538 0.10382736
## 8 1e-03 3 0.5817308 0.04740051
## 9 1e-02 3 0.5817308 0.04740051
## 10 1e-01 3 0.5817308 0.04740051
## 11 1e+00 3 0.5817308 0.04740051
## 12 5e+00 3 0.5817308 0.04740051
## 13 1e+01 3 0.5817308 0.04740051
## 14 1e+02 3 0.4159615 0.12008716
## 15 1e-03 4 0.5817308 0.04740051
## 16 1e-02 4 0.5817308 0.04740051
## 17 1e-01 4 0.5817308 0.04740051
## 18 1e+00 4 0.5817308 0.04740051
## 19 5e+00 4 0.5817308 0.04740051
## 20 1e+01 4 0.5817308 0.04740051
## 21 1e+02 4 0.5817308 0.04740051
Interestingly enough we see an almost identical performance across the board, with most models having an error of 0.5817 and dispersion of 0.0474. The best performing model is using a cost of 100 and degree of 2 with an error of 0.3086 and a dispersion of 0.1038.
Hint: In the lab, we used the plot() function for svm objects only in cases with p = 2. When p > 2, you can use the plot() function to create plots displaying pairs of variables at a time. Essentially, instead of typing
plot(svmfit , dat)
where svmfit contains your fitted model and dat is a data frame containing your data, you can type
plot(svmfit , dat , x1 ~ x4)
in order to plot just the first and fourth variables. However, you must replace x1 and x4 with the correct variable names. To find out more, type ?plot.svm.
We can compare the errors of each of these in a cross validation plot each kernel:
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
We see that as stated above the best model was a radial kernel with a gamma of 0.1 and cost of 10. We will train this best model and then make a few different plots:
We see that we cannot provide an easily visible decision boundary using these variables, due to the high dimensional nature of the data.
This problem involves the OJ data set which is part of the ISLR2 package.
set.seed(123)
n <- nrow(OJ)
train_idx <- sample(1:n, size = 800)
# Create training and test sets
train <- OJ[train_idx, ]
test <- OJ[-train_idx, ]
svm_linear <- svm(Purchase ~ ., data = train, cost = 0.01)
summary(svm_linear)
##
## Call:
## svm(formula = Purchase ~ ., data = train, cost = 0.01)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 0.01
##
## Number of Support Vectors: 629
##
## ( 313 316 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
We see 313 support vectors in class CH, and 316 in class MM.
## Training Accuracy: 0.60875
## Test Accuracy: 0.6148148
set.seed(123)
tuning <- tune('svm', Purchase ~ ., data = train, kernel = 'linear', ranges = list(cost = seq(0.01, 0.1, by = 0.01)))
summary(tuning)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.04
##
## - best performance: 0.17
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01 0.17375 0.04910660
## 2 0.02 0.17250 0.04816061
## 3 0.03 0.17250 0.04887626
## 4 0.04 0.17000 0.04758034
## 5 0.05 0.17125 0.04678927
## 6 0.06 0.17125 0.04678927
## 7 0.07 0.17125 0.04678927
## 8 0.08 0.17250 0.04632314
## 9 0.09 0.17250 0.04669642
## 10 0.10 0.17500 0.04823265
We see the best model is at cost 0.03.
best_model <- tuning$best.model
## Training Accuracy: 0.60875
## Test Accuracy: 0.6148148
set.seed(123)
tuning_radial <- tune( 'svm', Purchase ~ ., data = train, kernel = 'radial', ranges = list(cost = seq(0.01, 0.1, by = 0.01)))
summary(tuning_radial)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.1
##
## - best performance: 0.17625
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01 0.39125 0.04411554
## 2 0.02 0.39125 0.04411554
## 3 0.03 0.38625 0.04466309
## 4 0.04 0.26375 0.06079622
## 5 0.05 0.22000 0.06851602
## 6 0.06 0.19500 0.06351946
## 7 0.07 0.18625 0.06108112
## 8 0.08 0.17750 0.06003471
## 9 0.09 0.17750 0.05945353
## 10 0.10 0.17625 0.06108112
Here we see the best model at cost 0.08, which we will continue with and calculate the accuracy for:
## Training Accuracy: 0.84
## Test Accuracy: 0.8037037
set.seed(123)
tuning_polynomial <- tune( 'svm', Purchase ~ ., data = train, kernel = 'polynomial', degree = 2, ranges = list(cost = seq(0.01, 0.1, by = 0.01)))
summary(tuning_polynomial)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.1
##
## - best performance: 0.31
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01 0.39000 0.04281744
## 2 0.02 0.36250 0.04930066
## 3 0.03 0.35750 0.05210833
## 4 0.04 0.35375 0.05272110
## 5 0.05 0.33500 0.04816061
## 6 0.06 0.31875 0.05077524
## 7 0.07 0.31500 0.05296750
## 8 0.08 0.31375 0.05573063
## 9 0.09 0.31375 0.05573063
## 10 0.10 0.31000 0.05361903
Here we see the best model at cost 0.08, which we will continue with and calculate the accuracy for:
## Training Accuracy: 0.71125
## Test Accuracy: 0.6962963
We can compare the test accuracy for all the models:
## Linear Accuracy: 0.6148148
## Radial Accuracy: 0.8037037
## Polynomial Accuracy: 0.6962963
We see the best performing model to be SVM with a radial kernel, with an accuracy of 80.4%.