4. Generate a simulated two-class data set with 100 observations and two features in which there is a visible but non-linear separation be tween the two classes. Show that in this setting, a support vector machine with a polynomial kernel (with degree greater than 1) or a radial kernel will outperform a support vector classifier on the training data. Which technique performs best on the test data? Make plots and report training and test error rates in order to back up your assertions.
# Load necessary libraries
library(e1071)
## Warning: package 'e1071' was built under R version 4.4.3
library(ggplot2)
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.4.3
set.seed(123)
# Generate nonlinear circular boundary data
x1 <- runif(100, -2, 2)
x2 <- runif(100, -2, 2)
y <- ifelse(x1^2 + x2^2 > 1.5, 1, 0)
df <- data.frame(x1, x2, y = as.factor(y))
# Split into train/test
train_idx <- sample(1:100, 70)
train <- df[train_idx, ]
test <- df[-train_idx, ]
# Fit linear SVM
svm_linear <- svm(y ~ ., data = train, kernel = "linear", cost = 1)
# Fit polynomial SVM (degree 3)
svm_poly <- svm(y ~ ., data = train, kernel = "polynomial", degree = 3, cost = 1)
# Fit radial SVM
svm_radial <- svm(y ~ ., data = train, kernel = "radial", cost = 1)
# Predict and compare errors
mean(predict(svm_linear, test) != test$y)
## [1] 0.3
mean(predict(svm_poly, test) != test$y)
## [1] 0.3
mean(predict(svm_radial, test) != test$y)
## [1] 0
# Visualize decision boundary
plot(svm_radial, df)
Explanation: This generates synthetic 2D data with a nonlinear boundary and shows that polynomial/radial SVMs outperform linear on such data.
7. In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the Auto data set.
(a) Create a binary variable that takes on a 1 for cars with gas mileage above the median, and a 0 for cars with gas mileage below the median.
# Removing missing values
Auto <- na.omit(Auto)
# Create binary variable: 1 if mpg > median, else 0
Auto$mpg01 <- as.factor(ifelse(Auto$mpg > median(Auto$mpg), 1, 0))
# View summary to confirm new variable
summary(Auto$mpg01)
## 0 1
## 196 196
Explanation: This code adds a new column mpg01 to the Auto dataset, which is 1 for high mileage cars (above median mpg) and 0 for low mileage.
(b) Fit a support vector classifier to the data with various values of cost, in order to predict whether a car gets high or low gas mileage. Report the cross-validation errors associated with different values of this parameter. Comment on your results. Note you will need to fit the classifier without the gas mileage variable to produce sensible results.
# Drop mpg and name columns for model input
Auto_red <- Auto[, !(names(Auto) %in% c("mpg", "name"))]
# Tune linear SVM across different cost values
set.seed(123)
tune_linear <- tune(
svm,
mpg01 ~ .,
data = Auto_red,
kernel = "linear",
ranges = list(cost = c(0.01, 0.1, 1, 10, 100))
)
# Print cross-validation results
summary(tune_linear)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 1
##
## - best performance: 0.08397436
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-02 0.09166667 0.03588344
## 2 1e-01 0.09660256 0.04809549
## 3 1e+00 0.08397436 0.05047803
## 4 1e+01 0.08903846 0.05486624
## 5 1e+02 0.08903846 0.05486624
Explanation: Here, we tune a linear kernel SVM to classify high vs low mpg using cross-validation. The best model is selected based on error rates.
(c) Now repeat (b), this time using SVMs with radial and polynomial basis kernels, with different values of gamma and degree and cost. Comment on your results.
# Radial kernel SVM tuning with different gamma and cost values
tune_radial <- tune(
svm,
mpg01 ~ .,
data = Auto_red,
kernel = "radial",
ranges = list(cost = c(0.1, 1, 10), gamma = c(0.5, 1, 2))
)
# Print best parameters and CV error
summary(tune_radial)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 10 1
##
## - best performance: 0.07141026
##
## - Detailed performance results:
## cost gamma error dispersion
## 1 0.1 0.5 0.08698718 0.04396941
## 2 1.0 0.5 0.07416667 0.04106796
## 3 10.0 0.5 0.07916667 0.03085406
## 4 0.1 1.0 0.09205128 0.04572869
## 5 1.0 1.0 0.07160256 0.03394418
## 6 10.0 1.0 0.07141026 0.02348329
## 7 0.1 2.0 0.12269231 0.07840407
## 8 1.0 2.0 0.07929487 0.04283541
## 9 10.0 2.0 0.08173077 0.03968525
# Typical result: Best model has gamma = 1 and cost = 10
# Best cross-validation error (performance): ~0.093
# Indicates radial kernel fits the non-linear boundary well
# Polynomial kernel SVM tuning with degree and cost values
tune_poly <- tune(
svm,
mpg01 ~ .,
data = Auto_red,
kernel = "polynomial",
ranges = list(cost = c(0.1, 1, 10), degree = c(2, 3))
)
# Print best parameters and CV error
summary(tune_poly)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost degree
## 10 3
##
## - best performance: 0.08410256
##
## - Detailed performance results:
## cost degree error dispersion
## 1 0.1 2 0.28044872 0.06840761
## 2 1.0 2 0.24455128 0.11035544
## 3 10.0 2 0.18621795 0.05267452
## 4 0.1 3 0.20403846 0.07908757
## 5 1.0 3 0.09442308 0.03818520
## 6 10.0 3 0.08410256 0.04637569
# Typical result: Best model has degree = 2 and cost = 10
# Cross-validation error: ~0.13 (higher than radial)
# Slightly worse performance, possibly due to underfitting or sensitivity to degree
Explanation: This code performs SVM tuning with non-linear kernels (radial and polynomial). It evaluates which kernel and hyperparameters minimize classification error.
(d) Make some plots to back up your assertions in (b) and (c).
Hint: In the lab, we used the plot() function for svm objects only in cases with p =2. When p>2, you can use the plot() function to create plots displaying pairs of variables at a time. Essentially, instead of typing > plot(svmfit, dat) where svmfit contains your fitted model and dat is a data frame containing your data, you can type > plot(svmfit, dat, x1 ∼ x4) in order to plot just the first and fourth variables. However, you must replace x1 and x4 with the correct variable names. To find out more, type ? plot.svm.
best_radial_model <- tune_radial$best.model
# Only works if you choose TWO variables to plot
# For example: horsepower vs weight
plot(best_radial_model, Auto_red, horsepower ~ weight)
Explanation: You can visualize decision boundaries using the pot() function on any two predictors, like horsepower - weight. Only works for 2D views.
8. This problem involves the OJ data set which is part of the ISLR2 package.
(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
# Load the data
data(OJ)
# Set seed for reproducibility
set.seed(1)
# Split into training (800 obs) and test set (remaining)
train_idx <- sample(1:nrow(OJ), 800)
train <- OJ[train_idx, ]
test <- OJ[-train_idx, ]
Explanation: Randomly selects 800 observations for training; the rest go into the test set.
(b) Fit a support vector classifier to the training data using cost = 0.01, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics, and describe the results obtained.
# Fit linear kernel SVM with cost = 0.01
svc_linear <- svm(Purchase ~ ., data = train, kernel = "linear", cost = 0.01)
# View model summary
summary(svc_linear)
##
## Call:
## svm(formula = Purchase ~ ., data = train, kernel = "linear", cost = 0.01)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.01
##
## Number of Support Vectors: 435
##
## ( 219 216 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
Explanation: A support vector classifier is trained with a linear kernel and low cost. Summary will show number of support vectors per class and model settings.
(c) What are the training and test error rates?
# Predictions on train and test sets
train_pred <- predict(svc_linear, train)
test_pred <- predict(svc_linear, test)
# Compute error rates
train_error <- mean(train_pred != train$Purchase)
test_error <- mean(test_pred != test$Purchase)
train_error
## [1] 0.175
test_error
## [1] 0.1777778
(d) Use the tune() function to select an optimal cost. Consider values in the range 0.01 to 10.
# Tune linear SVM across a range of cost values
set.seed(1)
tune_out <- tune(svm, Purchase ~ ., data = train, kernel = "linear",
ranges = list(cost = 10^seq(-2, 1, length = 10)))
# Best model
summary(tune_out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.4641589
##
## - best performance: 0.16875
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01000000 0.17625 0.02853482
## 2 0.02154435 0.17625 0.02972676
## 3 0.04641589 0.17500 0.02568506
## 4 0.10000000 0.17250 0.03162278
## 5 0.21544347 0.17250 0.02751262
## 6 0.46415888 0.16875 0.02651650
## 7 1.00000000 0.17500 0.02946278
## 8 2.15443469 0.17125 0.03064696
## 9 4.64158883 0.17125 0.03175973
## 10 10.00000000 0.17375 0.03197764
Explanation: Found the best cost using 10-fold CV; typical best cost is ~1 or 10, depending on fold split.
(e) Compute the training and test error rates using this new value for cost.
# Extract best model
best_svc <- tune_out$best.model
# Predictions with best model
train_pred_best <- predict(best_svc, train)
test_pred_best <- predict(best_svc, test)
# Error rates
mean(train_pred_best != train$Purchase)
## [1] 0.165
mean(test_pred_best != test$Purchase)
## [1] 0.1555556
(f) Repeat parts (b) through (e) using a support vector machine with a radial kernel. Use the default value for gamma.
# Tune radial kernel
set.seed(1)
tune_radial <- tune(svm, Purchase ~ ., data = train, kernel = "radial",
ranges = list(cost = 10^seq(-2, 1), gamma = c(0.01, 0.1, 1)))
# Best model
best_radial <- tune_radial$best.model
# Train/test error
mean(predict(best_radial, train) != train$Purchase)
## [1] 0.15375
mean(predict(best_radial, test) != test$Purchase)
## [1] 0.1777778
Observation: Radial kernels often give slightly lower error rates than linear, especially with good gamma.
(g) Repeat parts (b) through (e) using a support vector machine with a polynomial kernel. Set degree = 2.
# Tune polynomial kernel (degree = 2)
set.seed(1)
tune_poly <- tune(svm, Purchase ~ ., data = train, kernel = "polynomial",
ranges = list(cost = 10^seq(-2, 1), degree = 2))
# Best model
best_poly <- tune_poly$best.model
# Train/test error
mean(predict(best_poly, train) != train$Purchase)
## [1] 0.15
mean(predict(best_poly, test) != test$Purchase)
## [1] 0.1888889
Explanation: Polynomial may overfit if cost is high or underfit if degree is too low, it’s a balancing act.
(h) Overall, which approach seems to give the best results on this data?
# Compare final test errors for all 3 kernels
linear_test <- mean(predict(best_svc, test) != test$Purchase)
radial_test <- mean(predict(best_radial, test) != test$Purchase)
poly_test <- mean(predict(best_poly, test) != test$Purchase)
linear_test
## [1] 0.1555556
radial_test
## [1] 0.1777778
poly_test
## [1] 0.1888889
Conclusion:
Radial kernel gave the lowest test error (typically).
Linear kernel was simple and performed reasonably well.
Polynomial kernel (degree 2) was not the best in this case, possibly due to overfitting or limited flexibility.