library(ggplot2)
library(e1071)
library(ISLR)
library(ISLR2)
library(tidyverse)Predictive Modeling Homework 8
Load libraries
Exercise 5
We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary. We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features. (a) Generate a data set with n = 500 and p = 2, such that the obser- vations belong to two classes with a quadratic decision boundary between them.
set.seed(1)
n <- 500
x1 <- runif(n) - 0.5
x2 <- runif(n) - 0.5
y <- factor(1 * (x1^2 - x2^2 > 0), levels = c(0,1))
df <- data.frame(x1, x2, y)- Plot the observations, colored according to their class labels. Your plot should display X1 on the x-axis, and X2 on the y- axis.
ggplot(df, aes(x = x1, y = x2, color = y)) +
geom_point(alpha = 0.7) +
labs( x = "X1", y = "X2") +
theme_classic()- Fit a logistic regression model to the data, using X1 and X2 as predictors.
glm_lin <- glm(y ~ x1 + x2, family = binomial, data = df)
summary(glm_lin)
Call:
glm(formula = y ~ x1 + x2, family = binomial, data = df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.087260 0.089579 -0.974 0.330
x1 0.196199 0.316864 0.619 0.536
x2 -0.002854 0.305712 -0.009 0.993
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 692.18 on 499 degrees of freedom
Residual deviance: 691.79 on 497 degrees of freedom
AIC: 697.79
Number of Fisher Scoring iterations: 3
- Apply this model to the training data in order to obtain a pre- dicted class label for each training observation. Plot the ob- servations, colored according to the predicted class labels. The decision boundary should be linear.
df$pred_lin <- factor(ifelse(predict(glm_lin, type = "response") > 0.5, 1, 0))
grid <- expand.grid(
x1 = seq(min(df$x1), max(df$x1), length = 200),
x2 = seq(min(df$x2), max(df$x2), length = 200)
)
grid$prob_lin <- predict(glm_lin, newdata = grid, type = "response")
ggplot(df, aes(x = x1, y = x2, color = pred_lin)) +
geom_point(alpha = 0.7) +
geom_contour(
data = grid,
aes(z = prob_lin),
breaks = 0.5,
color = "black"
) +
labs(title = "Logistic fit: straight boundary",
color = "Predicted\nclass") +
theme_classic()- Now fit a logistic regression model to the data using non-linear functions of X1 and X2 as predictors (e.g. X12, X1 ×X2, log(X2), and so forth).
glm_nl <- glm(y ~ x1 + x2 + I(x1^2) + I(x2^2) + I(x1 * x2),
family = binomial, data = df)Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm_nl)
Call:
glm(formula = y ~ x1 + x2 + I(x1^2) + I(x2^2) + I(x1 * x2), family = binomial,
data = df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.16 713.54 -0.014 0.989
x1 42.10 15492.58 0.003 0.998
x2 -66.81 14788.95 -0.005 0.996
I(x1^2) 16757.98 519013.02 0.032 0.974
I(x2^2) -16671.65 508668.89 -0.033 0.974
I(x1 * x2) -206.38 41802.81 -0.005 0.996
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6.9218e+02 on 499 degrees of freedom
Residual deviance: 3.5810e-06 on 494 degrees of freedom
AIC: 12
Number of Fisher Scoring iterations: 25
- Apply this model to the training data in order to obtain a pre- dicted class label for each training observation. Plot the ob- servations, colored according to the predicted class labels. The decision boundary should be obviously non-linear. If it is not, then repeat (a)-(e) until you come up with an example in which the predicted class labels are obviously non-linear.
df$pred_nl <- factor(ifelse(predict(glm_nl, type = "response") > 0.5, 1, 0))
grid$prob_nl <- predict(glm_nl, newdata = grid, type = "response")
ggplot(df, aes(x = x1, y = x2, color = pred_nl)) +
geom_point(alpha = 0.7) +
geom_contour(
data = grid,
aes(z = prob_nl),
breaks = 0.5,
color = "black"
) +
labs(title = "Logistic (nonlinear) fit: curved boundary",
color = "Predicted\nclass") +
theme_classic()- Fit a support vector classifier to the data with X1 and X2 as predictors. Obtain a class prediction for each training observa- tion. Plot the observations, colored according to the predicted class labels.
svm_lin <- svm(
y ~ x1 + x2, data = df, kernel = "linear", cost = 0.01, scale = FALSE)
plot(
svm_lin, df, formula = x1 ~ x2,
main = "SVM (linear) Decision Boundary")- Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.
svm_rbf <- svm(
y ~ x1 + x2,
data = df,
kernel = "radial",
gamma = 1,
scale = FALSE
)
plot(
svm_rbf,
df,
formula = x1 ~ x2,
main = "SVM (radial) Decision Boundary"
)- Comment on your results.
🔴 Answer: The linear model using only X1 and X2 fails to capture the true quadratic boundary. . In contrast, including quadratic terms and interactions in the non-linear model significantly lower deviance and AIC and improves the fit.
Exercise 7
In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the Auto data set.
- Create a binary variable that takes on a 1 for cars with gas mileage above the median, and a 0 for cars with gas mileage below the median.
data(Auto)
Auto$mpg01 <- ifelse(Auto$mpg > median(Auto$mpg), 1, 0)- Fit a support vector classifier to the data with various values of
cost, in order to predict whether a car gets high or low gas mileage. Report the cross-validation errors associated with dif- ferent values of this parameter. Comment on your results. Note you will need to fit the classifier without the gas mileage variable to produce sensible results.
tune.out.lin <- tune(svm, mpg01 ~ . - mpg, data = Auto, kernel = "linear",
ranges = list(cost = c(0.01, 0.1, 1, 10, 100)))
print(tune.out.lin)
Parameter tuning of 'svm':
- sampling method: 10-fold cross validation
- best parameters:
cost
1
- best performance: 0.09622359
🔴 Answer: Using 10-fold cross validation, the tuning process yielded a best performance (CV error) of 0.09572396.
- Now repeat (b), this time using SVMs with radial and polyno- mial basis kernels, with different values of
gammaanddegreeand cost. Comment on your results.
tune.out.radial <- tune(svm, mpg01 ~ . - mpg, data = Auto, kernel = "radial",
ranges = list(cost = c(0.01, 0.1, 1, 10, 100),
gamma = c(0.5, 1, 2, 3)))
print(tune.out.radial)
Parameter tuning of 'svm':
- sampling method: 10-fold cross validation
- best parameters:
cost gamma
1 0.5
- best performance: 0.06262426
tune.out.poly <- tune(svm, mpg01 ~ . - mpg, data = Auto, kernel = "polynomial",
ranges = list(cost = c(0.01, 0.1, 1, 10, 100),
degree = c(2, 3, 4)))
print(tune.out.poly)
Parameter tuning of 'svm':
- sampling method: 10-fold cross validation
- best parameters:
cost degree
100 3
- best performance: 0.244183
🔴 Answer: The first tuning result shows a much lower CV error (≈ 6.43%) compared to the second (≈ 24.44%), indicating that the SVM configuration in the first run is significantly better at generalizing to unseen data.
- Make some plots to back up your assertions in (b) and (c).
Exercise 8
This problem involves the OJ data set which is part of the ISLR2 package.
- Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
set.seed(1)
train_idx <- sample(1:nrow(OJ), 800)
OJ.train <- OJ[train_idx, ]
OJ.test <- OJ[-train_idx, ]- Fit a support vector classifier to the training data using
cost = 0.01, withPurchaseas the response and the other vari- ables as predictors. Use thesummary()function to produce sum- mary statistics, and describe the results obtained.
svm.linear <- svm(Purchase ~ ., data = OJ.train, kernel = "linear", cost = 0.01)
summary(svm.linear)
Call:
svm(formula = Purchase ~ ., data = OJ.train, kernel = "linear", cost = 0.01)
Parameters:
SVM-Type: C-classification
SVM-Kernel: linear
cost: 0.01
Number of Support Vectors: 435
( 219 216 )
Number of Classes: 2
Levels:
CH MM
🔴 Answer: This output indicates that with a low cost value (0.01), the linear SVM uses 435 support vectors out of 800 training observations (roughly 54%), with nearly equal numbers from each class.
- What are the training and test error rates?
pred.train.lin <- predict(svm.linear, OJ.train)
train.error.lin <- mean(pred.train.lin != OJ.train$Purchase)
pred.test.lin <- predict(svm.linear, OJ.test)
test.error.lin <- mean(pred.test.lin != OJ.test$Purchase)
train.error.lin[1] 0.175
test.error.lin[1] 0.1777778
- Use the
tune()function to select an optimal cost. Consider val-ues in the range 0.01 to 10.
set.seed(1)
tune.out.lin <- tune(svm, Purchase ~ ., data = OJ.train, kernel = "linear",
ranges = list(cost = c(0.01, 0.1, 1, 10)))
summary(tune.out.lin)
Parameter tuning of 'svm':
- sampling method: 10-fold cross validation
- best parameters:
cost
0.1
- best performance: 0.1725
- Detailed performance results:
cost error dispersion
1 0.01 0.17625 0.02853482
2 0.10 0.17250 0.03162278
3 1.00 0.17500 0.02946278
4 10.00 0.17375 0.03197764
- Compute the training and test error rates using this new value for
cost.
best.lin <- tune.out.lin$best.model
pred.train.best <- predict(best.lin, OJ.train)
train.error.best <- mean(pred.train.best != OJ.train$Purchase)
pred.test.best <- predict(best.lin, OJ.test)
test.error.best <- mean(pred.test.best != OJ.test$Purchase)
train.error.best[1] 0.165
test.error.best[1] 0.162963
- Repeat parts (b) through (e) using a support vector machine with a radial kernel. Use the default value for
gamma.
svm.radial <- svm(Purchase ~ ., data = OJ.train, kernel = "radial", cost = 0.01)
summary(svm.radial)
Call:
svm(formula = Purchase ~ ., data = OJ.train, kernel = "radial", cost = 0.01)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 0.01
Number of Support Vectors: 634
( 319 315 )
Number of Classes: 2
Levels:
CH MM
pred.train.rad <- predict(svm.radial, OJ.train)
train.error.rad <- mean(pred.train.rad != OJ.train$Purchase)
pred.test.rad <- predict(svm.radial, OJ.test)
test.error.rad <- mean(pred.test.rad != OJ.test$Purchase)
train.error.rad[1] 0.39375
test.error.rad[1] 0.3777778
set.seed(1)
tune.out.rad <- tune(svm, Purchase ~ ., data = OJ.train, kernel = "radial",
ranges = list(cost = c(0.01, 0.1, 1, 10),
gamma = c(0.1, 0.5, 1, 2)))
summary(tune.out.rad)
Parameter tuning of 'svm':
- sampling method: 10-fold cross validation
- best parameters:
cost gamma
1 0.1
- best performance: 0.17875
- Detailed performance results:
cost gamma error dispersion
1 0.01 0.1 0.39375 0.04007372
2 0.10 0.1 0.19750 0.03525699
3 1.00 0.1 0.17875 0.02829041
4 10.00 0.1 0.19875 0.03087272
5 0.01 0.5 0.39375 0.04007372
6 0.10 0.5 0.28250 0.05502525
7 1.00 0.5 0.21375 0.03701070
8 10.00 0.5 0.21250 0.03632416
9 0.01 1.0 0.39375 0.04007372
10 0.10 1.0 0.34500 0.04937104
11 1.00 1.0 0.22625 0.04466309
12 10.00 1.0 0.23000 0.04684490
13 0.01 2.0 0.39375 0.04007372
14 0.10 2.0 0.38625 0.04348132
15 1.00 2.0 0.22750 0.04281744
16 10.00 2.0 0.24000 0.04158325
best.rad <- tune.out.rad$best.model
pred.train.rad.best <- predict(best.rad, OJ.train)
train.error.rad.best <- mean(pred.train.rad.best != OJ.train$Purchase)
pred.test.rad.best <- predict(best.rad, OJ.test)
test.error.rad.best <- mean(pred.test.rad.best != OJ.test$Purchase)
train.error.rad.best[1] 0.15
test.error.rad.best[1] 0.1851852
- Repeat parts (b) through (e) using a support vector machine with a polynomial kernel. Set
degree = 2.
svm.poly <- svm(Purchase ~ ., data = OJ.train, kernel = "polynomial", degree = 2, cost = 0.01)
summary(svm.poly)
Call:
svm(formula = Purchase ~ ., data = OJ.train, kernel = "polynomial",
degree = 2, cost = 0.01)
Parameters:
SVM-Type: C-classification
SVM-Kernel: polynomial
cost: 0.01
degree: 2
coef.0: 0
Number of Support Vectors: 636
( 321 315 )
Number of Classes: 2
Levels:
CH MM
pred.train.poly <- predict(svm.poly, OJ.train)
train.error.poly <- mean(pred.train.poly != OJ.train$Purchase)
pred.test.poly <- predict(svm.poly, OJ.test)
test.error.poly <- mean(pred.test.poly != OJ.test$Purchase)
train.error.poly[1] 0.3725
test.error.poly[1] 0.3666667
set.seed(1)
tune.out.poly <- tune(svm, Purchase ~ ., data = OJ.train, kernel = "polynomial",
ranges = list(cost = c(0.01, 0.1, 1, 10),
degree = 2))
summary(tune.out.poly)
Parameter tuning of 'svm':
- sampling method: 10-fold cross validation
- best parameters:
cost degree
10 2
- best performance: 0.18125
- Detailed performance results:
cost degree error dispersion
1 0.01 2 0.39125 0.04210189
2 0.10 2 0.32125 0.05001736
3 1.00 2 0.20250 0.04116363
4 10.00 2 0.18125 0.02779513
best.poly <- tune.out.poly$best.model
pred.train.poly.best <- predict(best.poly, OJ.train)
train.error.poly.best <- mean(pred.train.poly.best != OJ.train$Purchase)
pred.test.poly.best <- predict(best.poly, OJ.test)
test.error.poly.best <- mean(pred.test.poly.best != OJ.test$Purchase)
train.error.poly.best[1] 0.15
test.error.poly.best[1] 0.1888889
- Overall, which approach seems to give the best results on this data?
🔴 Answer: Overall, the tuned linear SVM yields the best results with a test error rate of about 16.3%, compared to 18.5% for the radial kernel and 18.9% for the polynomial kernel.