Exercise 5

We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary. We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features.

(a) Generate a data set with n = 500 and p = 2, such that the observations belong to two classes with a quadratic decision boundary between them.

set.seed(13)
X1 <- runif (500) - 0.5
X2 <- runif (500) - 0.5
Y <- 1 * (X1^2 - X2^2 > 0)

(b) Plot the observations, colored according to their class labels. Your plot should display \(X_1\) on the x-axis, and \(X_2\) on the y-axis.

plot(x = X1, y = X2, col = as.factor(Y))

(c) Fit a logistic regression model to the data, using \(X_1\) and \(X_2\) as predictors.

glm.fit = glm(Y ~ X1 + X2, family = binomial)
summary(glm.fit)
## 
## Call:
## glm(formula = Y ~ X1 + X2, family = binomial)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.01963    0.08974   0.219    0.827
## X1          -0.46453    0.30375  -1.529    0.126
## X2           0.03207    0.30589   0.105    0.917
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 693.08  on 499  degrees of freedom
## Residual deviance: 690.71  on 497  degrees of freedom
## AIC: 696.71
## 
## Number of Fisher Scoring iterations: 3

(d) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be linear.

df = data.frame(X1 = X1, X2 = X2, Y = as.factor(Y))

glm.probs = predict(glm.fit, df, type = "response")
glm.preds = ifelse(glm.probs > 0.5, 1, 0)
ones = df[glm.preds == 1, ]
zeros = df[glm.preds == 0, ]

plot(ones$X1, ones$X2, col = "mediumblue", xlab = "X1", ylab = "X2")
points(zeros$X1, zeros$X2, col = "red3")
legend("topleft", legend = c("Predicted 0", "Predicted 1"), col = c("red3", "mediumblue"), pch = 1, y.intersp = 0.8, cex = .8)

(e) Now fit a logistic regression model to the data using non-linear functions of \(X_1\) and \(X_2\) as predictors.

glm2.fit = glm(Y ~ poly(X1, 2) + log(X2) + I(X1 * X2), data = df, family = binomial)

(f) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be obviously non-linear. If it is not, then repeat (a)-(e) until you come up with an example in which the predicted class labels are obviously non-linear.

glm2.probs = predict(glm2.fit, df, type = "response")
glm2.preds = ifelse(glm2.probs > 0.5, 1, 0)
ones = df[glm2.preds == 1, ]
zeros = df[glm2.preds == 0, ]
plot(ones$X1, ones$X2, col = "mediumblue", xlab = "X1", ylab = "X2")
points(zeros$X1, zeros$X2, col = "red3")

(g) Fit a support vector classifier to the data with \(X_1\) and \(X_2\) as predictors. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.

svm.lin <- svm(Y~., data = df, kernel = "linear")
svm.lin.preds = predict(svm.lin, df)
svm.lin.1s = df[svm.lin.preds == 1, ]
svm.lin.0s = df[svm.lin.preds == 0, ]
plot(svm.lin.1s$X1, svm.lin.1s$X2, col = "mediumblue", xlab = "X1", ylab = "X2")
points(svm.lin.0s$X1, svm.lin.0s$X2, col = "red3")

(h) Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.

svm.rbf <- svm(Y~., data = df, kernel = "radial")
svm.rbf.preds = predict(svm.rbf, df)
svm.rbf.1s = df[svm.rbf.preds == 1, ]
svm.rbf.0s = df[svm.rbf.preds == 0, ]
plot(svm.rbf.1s$X1, svm.rbf.1s$X2, col = "mediumblue", xlab = "X1", ylab = "X2")
points(svm.rbf.0s$X1, svm.rbf.0s$X2, col = "red3")

(i) Comment on your results.
The plots in d) and g) illustrate the constraints of linear models, such as linear logistic regression and linear support vector classifiers, which are limited to producing linear decision boundaries. Conversely, the plots in f) and h) demonstrate how non-linear models such as logistic regression using non-linear function and support vector machines with non-linear kernels (e.g., radial basis function) can capture more complex relationships in the data, allowing for non-linear decision boundaries.

Exercise 7

In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the Auto data set.

(a) Create a binary variable that takes on a 1 for cars with gas mileage above the median, and a 0 for cars with gas mileage below the median.

median.mpg = median(Auto$mpg)
Auto$mpg01 = as.factor(ifelse(Auto$mpg > median.mpg, 1, 0))
auto <- subset(Auto, select = -c(mpg, name))

(b) Fit a support vector classifier to the data with various values of \(cost\), in order to predict whether a car gets high or low gas mileage. Report the cross-validation errors associated with different values of this parameter. Comment on your results. Note you will need to fit the classifier without the gas mileage variable to produce sensible results.

set.seed(13)
tune.out1 <- tune(svm, mpg01~., data = auto, kernel = "linear", ranges = list(cost = c(0.5, 1, 2, 5, 10, 20)))

# Getting best parameters and the associated cross-validated error  
tune.out1$best.parameters
##   cost
## 5   10
tune.out1$best.performance
## [1] 0.08435897

We observe from the output that cost = 10 yields the lowest cross-validated error (0.0844) among the values of costs I prompted the model to evaluate.

(c) Now repeat (b), this time using SVMs with radial and polynomial basis kernels, with different values of \(gamma\) and \(degree\) and \(cost\). Comment on your results.

# Radial kernel
set.seed(13)
tune.out2 <- tune(svm, mpg01 ~ ., data = auto, kernel = "radial", ranges = list(cost = c(0.5, 1, 2, 5, 10, 20), gamma = c(0.5, 1, 2, 3, 4)))

# Getting best parameters and the associated cross-validated error  
tune.out2$best.parameters
##   cost gamma
## 7  0.5     1
tune.out2$best.performance
## [1] 0.06903846

Using a radial kernel, the combination of gamma and cost that minimizes the error (0.0690) is gamma = 1 and cost = 0.5.

# Polynomial kernel
set.seed(13)
tune.out3 <- tune(svm, mpg01 ~ ., data = auto, kernel = "polynomial", ranges = list(degree = c(2, 3, 4), cost = c(0.5, 1, 2, 5, 10, 20), gamma = c(0.5, 1, 2, 3, 4)))

# Getting best parameters and the associated cross-validated error  
tune.out3$best.parameters
##   degree cost gamma
## 2      3  0.5   0.5
tune.out3$best.performance
## [1] 0.08948718

Using a polynomial kernel, the parameters minimizing the cross-validated error (0.0895) is gamma = 0.5 and cost = 0.5 with a 3rd-degree poly-transformation.

(d) Make some plots to back up your assertions in (b) and (c).

plot(tune.out1)

The plot above clearly visualizes that cost = 10 results in the lowest cross-validated error for the linear SVM.

plot(tune.out2)

As a result of the extra parameter included in the radial SVM, this plot visualizes the error by color, where darker areas indicate combinations of cost and gamma that results in the lower cross-validated error.

Since we cannot plot all three parameters included in the polynomial SVM, I will refit the model only testing combinations of cost and gamma for a 3rd degree polynomial as chosen by cross-validation earlier. Doing so, the following plot can be interpreted in the same manner to the one above, illustrating the relationship between cost and gamma and their impact on cross-validated error for the 3rd degree polynomial SVM.

set.seed(13)
tune.out3b <- tune(svm, mpg01 ~ ., data = auto, kernel = "polynomial", degree = 3, ranges = list(cost = c(0.5, 1, 2, 5, 10, 20), gamma = c(0.5, 1, 2, 3, 4)))
plot(tune.out3b)

Exercise 8

This problem involves the OJ data set, which is part of the ISLR2 package.

(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

# Splitting into train and test according to the instructions
set.seed(13) ; index = sample(1:nrow(OJ), 800)
oj.train = OJ[index, ]
oj.test = OJ[-index, ]

(b) Fit a support vector classifier to the training data using cost = 0.01, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics, and describe the results obtained.

set.seed(13)
svm1 <- svm(Purchase~., oj.train, kernel = "linear", cost = 0.01)
summary(svm1)
## 
## Call:
## svm(formula = Purchase ~ ., data = oj.train, kernel = "linear", cost = 0.01)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.01 
## 
## Number of Support Vectors:  439
## 
##  ( 219 220 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

From the summary output, we observe that the linear SVM uses 439 support vectors to classify Purchase.

(c) What are the training and test error rates?

# Training error
preds.train1 <- predict(svm1, oj.train)
round(mean(preds.train1 != oj.train$Purchase), 4)
## [1] 0.1688
# Test error
preds.test1 <- predict(svm1, oj.test)
round(mean(preds.test1 != oj.test$Purchase), 4)
## [1] 0.1519

The training error is 16.88% and the test error is 15.19%. Seeing that the test error is lower than the training error, it appears that model is not overfitting on the training set and generalizes well to unseen data.

(d) Use the tune() function to select an optimal \(cost\). Consider values in the range 0.01 to 10.

set.seed(13)
tune2 <- tune(svm, Purchase~., data = oj.train, kernel = "linear", ranges = list(cost = c(0.01, 0.1, 1, 2, 4, 6, 8, 10)))
summary(tune2)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     2
## 
## - best performance: 0.16875 
## 
## - Detailed performance results:
##    cost   error dispersion
## 1  0.01 0.17500 0.04370037
## 2  0.10 0.17000 0.05309844
## 3  1.00 0.17625 0.04581439
## 4  2.00 0.16875 0.04759858
## 5  4.00 0.17375 0.04839436
## 6  6.00 0.17375 0.05118390
## 7  8.00 0.17375 0.04839436
## 8 10.00 0.17375 0.05084358

Cost = 2 results in the lowest cross-validated error (16.88%).

(e) Compute the training and test error rates using this new value for cost.

# Applying cost = 2
svm2 <- svm(Purchase~., data = oj.train, kernel = "linear", cost = 2)

# Training error
preds.train2 <- predict(svm2, oj.train)
round(mean(preds.train1 != oj.train$Purchase), 4)
## [1] 0.1688
# Test error
preds.test2 <- predict(svm2, oj.test)
round(mean(preds.test2 != oj.test$Purchase), 4)
## [1] 0.1667

The linear SVM using cost = 2 yields a training error of 16.88% and a test error of 16.67%. Interestingly, optimizing the cost parameter yielded only equivalent performance on the train set and worse performance on the test set.

(f) Repeat parts (b) through (e) using a support vector machine with a radial kernel. Use the default value for \(gamma\).

set.seed(13)
svm3 <- svm(Purchase~., oj.train, kernel = "radial", cost = 0.01)
summary(svm3)
## 
## Call:
## svm(formula = Purchase ~ ., data = oj.train, kernel = "radial", cost = 0.01)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.01 
## 
## Number of Support Vectors:  607
## 
##  ( 302 305 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

The radial SVM utilizes 607 support vectors to classify Purchase.

# Training error
preds.train3 <- predict(svm3, oj.train)
round(mean(preds.train3 != oj.train$Purchase), 4)
## [1] 0.3775
# Test error
preds.test3 <- predict(svm3, oj.test)
round(mean(preds.test3 != oj.test$Purchase), 4)
## [1] 0.4259

The training error (37.75%) and test error (42.59) of this model are significantly worse than the error rates of the linear SVM.

# Finding optimal cost
set.seed(13)
tune4 <- tune(svm, Purchase~., data = oj.train, kernel = "radial", ranges = list(cost = c(0.01, 0.1, 1, 2, 4, 6, 8, 10)))
summary(tune4)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     2
## 
## - best performance: 0.1825 
## 
## - Detailed performance results:
##    cost   error dispersion
## 1  0.01 0.37750 0.06313566
## 2  0.10 0.18625 0.04767147
## 3  1.00 0.18500 0.04632314
## 4  2.00 0.18250 0.04972145
## 5  4.00 0.18500 0.05489890
## 6  6.00 0.18500 0.05361903
## 7  8.00 0.18750 0.04823265
## 8 10.00 0.18750 0.04409586

The value of cost that minimizes the cross-validated error is 2.

# Applying cost = 2
svm4 <- svm(Purchase~., data = oj.train, kernel = "radial", cost = 2)

# Training error
preds.train4 <- predict(svm4, oj.train)
round(mean(preds.train4 != oj.train$Purchase), 4)
## [1] 0.1588
# Test error
preds.test4 <- predict(svm4, oj.test)
round(mean(preds.test4 != oj.test$Purchase), 4)
## [1] 0.1407

By optimizing the value of cost, the performance of the model improves significantly. With a training error of 15.88% and a test error of 14.07%, this tuned SVM with a radial kernel exhibits the lowest error rates among the models employed so far.

(g) Repeat parts (b) through (e) using a support vector machine with a polynomial kernel. Set degree = 2.

set.seed(13)
svm5 <- svm(Purchase~., oj.train, kernel = "polynomial", degree = 2, cost = 0.01)
summary(svm3)
## 
## Call:
## svm(formula = Purchase ~ ., data = oj.train, kernel = "radial", cost = 0.01)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.01 
## 
## Number of Support Vectors:  607
## 
##  ( 302 305 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

Like the radial SVM, the polynomial SVM employs 607 support vectors to classify if the customer purchased Citrus Hill or Minute Maid.

# Training error
preds.train5 <- predict(svm5, oj.train)
round(mean(preds.train5 != oj.train$Purchase), 4)
## [1] 0.3762
# Test error
preds.test5 <- predict(svm5, oj.test)
round(mean(preds.test5 != oj.test$Purchase), 4)
## [1] 0.4259

We notice similar training error rate (37.75%) and identical test error rate (42.59) for the polynomial SVM as the radial SVM when setting cost to 0.01.

# Finding optimal cost
set.seed(13)
tune6 <- tune(svm, Purchase~., data = oj.train, kernel = "polynomial", degree = 2, ranges = list(cost = c(0.01, 0.1, 1, 2, 4, 6, 8, 10)))
summary(tune6)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     8
## 
## - best performance: 0.1825 
## 
## - Detailed performance results:
##    cost   error dispersion
## 1  0.01 0.37750 0.06313566
## 2  0.10 0.31000 0.05614960
## 3  1.00 0.21375 0.05285265
## 4  2.00 0.19125 0.04715886
## 5  4.00 0.19250 0.05309844
## 6  6.00 0.18500 0.05130248
## 7  8.00 0.18250 0.04866267
## 8 10.00 0.18625 0.05015601

The optimal cost for the 2nd degree polynomial SVM is 8.

# Applying cost = 8
svm6 <- svm(Purchase~., data = oj.train, kernel = "polynomial", degree = 2, cost = 8)

# Training error
preds.train6 <- predict(svm6, oj.train)
round(mean(preds.train6 != oj.train$Purchase), 4)
## [1] 0.1638
# Test error
preds.test6 <- predict(svm6, oj.test)
round(mean(preds.test6 != oj.test$Purchase), 4)
## [1] 0.1593

Once again, we see a significant improvement when using a tuned cost parameter. Nonetheless, the SVM with a 2nd degree polynomial kernel performs slightly worse than the SVM utilizing a radial kernel.

(h) Overall, which approach seems to give the best results on this data?
In summary, the tuned radial SVM exhibited the best performance of the 6 models tested, yielding a training error of 15.88% and a test error of 14.07%.