Predictive Modeling Homework 8

Author

Cheryl Chiu (wky301)

Published

April 6, 2025

Load libraries

library(ggplot2)
library(e1071)
library(ISLR)
library(ISLR2)
library(tidyverse)

Exercise 5

We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary. We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features. (a) Generate a data set with n = 500 and p = 2, such that the obser- vations belong to two classes with a quadratic decision boundary between them.

set.seed(1)
n  <- 500
x1 <- runif(n) - 0.5
x2 <- runif(n) - 0.5
y  <- factor(1 * (x1^2 - x2^2 > 0), levels = c(0,1))
df <- data.frame(x1, x2, y)

Plot the observations, colored according to their class labels. Your plot should display X1 on the x-axis, and X2 on the y- axis.

ggplot(df, aes(x = x1, y = x2, color = y)) +
  geom_point(alpha = 0.7) +
  labs( x = "X1", y = "X2") +
  theme_classic()

Fit a logistic regression model to the data, using X1 and X2 as predictors.

glm_lin <- glm(y ~ x1 + x2, family = binomial, data = df)
summary(glm_lin)


Call:
glm(formula = y ~ x1 + x2, family = binomial, data = df)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.087260   0.089579  -0.974    0.330
x1           0.196199   0.316864   0.619    0.536
x2          -0.002854   0.305712  -0.009    0.993

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 692.18  on 499  degrees of freedom
Residual deviance: 691.79  on 497  degrees of freedom
AIC: 697.79

Number of Fisher Scoring iterations: 3

Apply this model to the training data in order to obtain a pre- dicted class label for each training observation. Plot the ob- servations, colored according to the predicted class labels. The decision boundary should be linear.

df$pred_lin <- factor(ifelse(predict(glm_lin, type = "response") > 0.5, 1, 0))

grid <- expand.grid(
  x1 = seq(min(df$x1), max(df$x1), length = 200),
  x2 = seq(min(df$x2), max(df$x2), length = 200)
)
grid$prob_lin <- predict(glm_lin, newdata = grid, type = "response")

ggplot(df, aes(x = x1, y = x2, color = pred_lin)) +
  geom_point(alpha = 0.7) +
  geom_contour(
    data    = grid,
    aes(z = prob_lin),
    breaks  = 0.5,
    color   = "black"
  ) +
  labs(title = "Logistic fit: straight boundary",
       color = "Predicted\nclass") +
  theme_classic()

Now fit a logistic regression model to the data using non-linear functions of X1 and X2 as predictors (e.g. X12, X1 ×X2, log(X2), and so forth).

glm_nl <- glm(y ~ x1 + x2 + I(x1^2) + I(x2^2) + I(x1 * x2),
              family = binomial, data = df)

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(glm_nl)


Call:
glm(formula = y ~ x1 + x2 + I(x1^2) + I(x2^2) + I(x1 * x2), family = binomial, 
    data = df)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)    -10.16     713.54  -0.014    0.989
x1              42.10   15492.58   0.003    0.998
x2             -66.81   14788.95  -0.005    0.996
I(x1^2)      16757.98  519013.02   0.032    0.974
I(x2^2)     -16671.65  508668.89  -0.033    0.974
I(x1 * x2)    -206.38   41802.81  -0.005    0.996

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6.9218e+02  on 499  degrees of freedom
Residual deviance: 3.5810e-06  on 494  degrees of freedom
AIC: 12

Number of Fisher Scoring iterations: 25

Apply this model to the training data in order to obtain a pre- dicted class label for each training observation. Plot the ob- servations, colored according to the predicted class labels. The decision boundary should be obviously non-linear. If it is not, then repeat (a)-(e) until you come up with an example in which the predicted class labels are obviously non-linear.

df$pred_nl <- factor(ifelse(predict(glm_nl, type = "response") > 0.5, 1, 0))

grid$prob_nl <- predict(glm_nl, newdata = grid, type = "response")

ggplot(df, aes(x = x1, y = x2, color = pred_nl)) +
  geom_point(alpha = 0.7) +
  geom_contour(
    data   = grid,
    aes(z = prob_nl),
    breaks = 0.5,
    color  = "black"
  ) +
  labs(title = "Logistic (nonlinear) fit: curved boundary",
       color = "Predicted\nclass") +
 
  theme_classic()

Fit a support vector classifier to the data with X1 and X2 as predictors. Obtain a class prediction for each training observa- tion. Plot the observations, colored according to the predicted class labels.

svm_lin <- svm(
  y ~ x1 + x2, data   = df, kernel = "linear", cost   = 0.01, scale  = FALSE)
plot(
  svm_lin, df, formula = x1 ~ x2,
  main    = "SVM (linear) Decision Boundary")

Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.

svm_rbf <- svm(
  y      ~ x1 + x2,
  data   = df,
  kernel = "radial",
  gamma  = 1,
  scale  = FALSE
)

plot(
  svm_rbf, 
  df, 
  formula = x1 ~ x2,
  main    = "SVM (radial) Decision Boundary"
)

Comment on your results.

🔴 Answer: The linear model using only X1 and X2 fails to capture the true quadratic boundary. . In contrast, including quadratic terms and interactions in the non-linear model significantly lower deviance and AIC and improves the fit.

Exercise 7

In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the Auto data set.

Create a binary variable that takes on a 1 for cars with gas mileage above the median, and a 0 for cars with gas mileage below the median.

data(Auto)
Auto$mpg01 <- ifelse(Auto$mpg > median(Auto$mpg), 1, 0)

Fit a support vector classifier to the data with various values of cost, in order to predict whether a car gets high or low gas mileage. Report the cross-validation errors associated with dif- ferent values of this parameter. Comment on your results. Note you will need to fit the classifier without the gas mileage variable to produce sensible results.

tune.out.lin <- tune(svm, mpg01 ~ . - mpg, data = Auto, kernel = "linear",
                     ranges = list(cost = c(0.01, 0.1, 1, 10, 100)))
print(tune.out.lin)


Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost
    1

- best performance: 0.09622359

🔴 Answer: Using 10-fold cross validation, the tuning process yielded a best performance (CV error) of 0.09572396.

Now repeat (b), this time using SVMs with radial and polyno- mial basis kernels, with different values of gamma and degree and cost. Comment on your results.

tune.out.radial <- tune(svm, mpg01 ~ . - mpg, data = Auto, kernel = "radial",
                        ranges = list(cost = c(0.01, 0.1, 1, 10, 100),
                                      gamma = c(0.5, 1, 2, 3)))
print(tune.out.radial)


Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost gamma
    1   0.5

- best performance: 0.06262426

tune.out.poly <- tune(svm, mpg01 ~ . - mpg, data = Auto, kernel = "polynomial",
                      ranges = list(cost = c(0.01, 0.1, 1, 10, 100),
                                    degree = c(2, 3, 4)))
print(tune.out.poly)


Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost degree
  100      3

- best performance: 0.244183

🔴 Answer: The first tuning result shows a much lower CV error (≈ 6.43%) compared to the second (≈ 24.44%), indicating that the SVM configuration in the first run is significantly better at generalizing to unseen data.

Make some plots to back up your assertions in (b) and (c).

Exercise 8

This problem involves the OJ data set which is part of the ISLR2 package.

Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

set.seed(1)
train_idx <- sample(1:nrow(OJ), 800)
OJ.train <- OJ[train_idx, ]
OJ.test <- OJ[-train_idx, ]

Fit a support vector classifier to the training data using cost = 0.01, with Purchase as the response and the other vari- ables as predictors. Use the summary() function to produce sum- mary statistics, and describe the results obtained.

svm.linear <- svm(Purchase ~ ., data = OJ.train, kernel = "linear", cost = 0.01)
summary(svm.linear)


Call:
svm(formula = Purchase ~ ., data = OJ.train, kernel = "linear", cost = 0.01)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  0.01 

Number of Support Vectors:  435

 ( 219 216 )


Number of Classes:  2 

Levels: 
 CH MM

🔴 Answer: This output indicates that with a low cost value (0.01), the linear SVM uses 435 support vectors out of 800 training observations (roughly 54%), with nearly equal numbers from each class.

What are the training and test error rates?

pred.train.lin <- predict(svm.linear, OJ.train)
train.error.lin <- mean(pred.train.lin != OJ.train$Purchase)

pred.test.lin <- predict(svm.linear, OJ.test)
test.error.lin <- mean(pred.test.lin != OJ.test$Purchase)

train.error.lin

[1] 0.175

test.error.lin

[1] 0.1777778

Use the tune() function to select an optimal cost. Consider val-ues in the range 0.01 to 10.

set.seed(1)
tune.out.lin <- tune(svm, Purchase ~ ., data = OJ.train, kernel = "linear",
                     ranges = list(cost = c(0.01, 0.1, 1, 10)))
summary(tune.out.lin)


Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost
  0.1

- best performance: 0.1725 

- Detailed performance results:
   cost   error dispersion
1  0.01 0.17625 0.02853482
2  0.10 0.17250 0.03162278
3  1.00 0.17500 0.02946278
4 10.00 0.17375 0.03197764

Compute the training and test error rates using this new value for cost.

best.lin <- tune.out.lin$best.model

pred.train.best <- predict(best.lin, OJ.train)
train.error.best <- mean(pred.train.best != OJ.train$Purchase)

pred.test.best <- predict(best.lin, OJ.test)
test.error.best <- mean(pred.test.best != OJ.test$Purchase)

train.error.best

[1] 0.165

test.error.best

[1] 0.162963

Repeat parts (b) through (e) using a support vector machine with a radial kernel. Use the default value for gamma.

svm.radial <- svm(Purchase ~ ., data = OJ.train, kernel = "radial", cost = 0.01)
summary(svm.radial)


Call:
svm(formula = Purchase ~ ., data = OJ.train, kernel = "radial", cost = 0.01)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  0.01 

Number of Support Vectors:  634

 ( 319 315 )


Number of Classes:  2 

Levels: 
 CH MM

pred.train.rad <- predict(svm.radial, OJ.train)
train.error.rad <- mean(pred.train.rad != OJ.train$Purchase)

pred.test.rad <- predict(svm.radial, OJ.test)
test.error.rad <- mean(pred.test.rad != OJ.test$Purchase)

train.error.rad

[1] 0.39375

test.error.rad

[1] 0.3777778

set.seed(1)
tune.out.rad <- tune(svm, Purchase ~ ., data = OJ.train, kernel = "radial",
                     ranges = list(cost = c(0.01, 0.1, 1, 10),
                                   gamma = c(0.1, 0.5, 1, 2)))
summary(tune.out.rad)


Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost gamma
    1   0.1

- best performance: 0.17875 

- Detailed performance results:
    cost gamma   error dispersion
1   0.01   0.1 0.39375 0.04007372
2   0.10   0.1 0.19750 0.03525699
3   1.00   0.1 0.17875 0.02829041
4  10.00   0.1 0.19875 0.03087272
5   0.01   0.5 0.39375 0.04007372
6   0.10   0.5 0.28250 0.05502525
7   1.00   0.5 0.21375 0.03701070
8  10.00   0.5 0.21250 0.03632416
9   0.01   1.0 0.39375 0.04007372
10  0.10   1.0 0.34500 0.04937104
11  1.00   1.0 0.22625 0.04466309
12 10.00   1.0 0.23000 0.04684490
13  0.01   2.0 0.39375 0.04007372
14  0.10   2.0 0.38625 0.04348132
15  1.00   2.0 0.22750 0.04281744
16 10.00   2.0 0.24000 0.04158325

best.rad <- tune.out.rad$best.model

pred.train.rad.best <- predict(best.rad, OJ.train)
train.error.rad.best <- mean(pred.train.rad.best != OJ.train$Purchase)

pred.test.rad.best <- predict(best.rad, OJ.test)
test.error.rad.best <- mean(pred.test.rad.best != OJ.test$Purchase)

train.error.rad.best

[1] 0.15

test.error.rad.best

[1] 0.1851852

Repeat parts (b) through (e) using a support vector machine with a polynomial kernel. Set degree = 2.

svm.poly <- svm(Purchase ~ ., data = OJ.train, kernel = "polynomial", degree = 2, cost = 0.01)
summary(svm.poly)


Call:
svm(formula = Purchase ~ ., data = OJ.train, kernel = "polynomial", 
    degree = 2, cost = 0.01)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  polynomial 
       cost:  0.01 
     degree:  2 
     coef.0:  0 

Number of Support Vectors:  636

 ( 321 315 )


Number of Classes:  2 

Levels: 
 CH MM

pred.train.poly <- predict(svm.poly, OJ.train)
train.error.poly <- mean(pred.train.poly != OJ.train$Purchase)

pred.test.poly <- predict(svm.poly, OJ.test)
test.error.poly <- mean(pred.test.poly != OJ.test$Purchase)

train.error.poly

[1] 0.3725

test.error.poly

[1] 0.3666667

set.seed(1)
tune.out.poly <- tune(svm, Purchase ~ ., data = OJ.train, kernel = "polynomial",
                      ranges = list(cost = c(0.01, 0.1, 1, 10),
                                    degree = 2))
summary(tune.out.poly)


Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost degree
   10      2

- best performance: 0.18125 

- Detailed performance results:
   cost degree   error dispersion
1  0.01      2 0.39125 0.04210189
2  0.10      2 0.32125 0.05001736
3  1.00      2 0.20250 0.04116363
4 10.00      2 0.18125 0.02779513

best.poly <- tune.out.poly$best.model

pred.train.poly.best <- predict(best.poly, OJ.train)
train.error.poly.best <- mean(pred.train.poly.best != OJ.train$Purchase)

pred.test.poly.best <- predict(best.poly, OJ.test)
test.error.poly.best <- mean(pred.test.poly.best != OJ.test$Purchase)

train.error.poly.best

[1] 0.15

test.error.poly.best

[1] 0.1888889

Overall, which approach seems to give the best results on this data?

🔴 Answer: Overall, the tuned linear SVM yields the best results with a test error rate of about 16.3%, compared to 18.5% for the radial kernel and 18.9% for the polynomial kernel.