Support Vector Machines

Author

Justin Pons

5

We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary. We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features.

library(tidyverse)
library(caret)
library(e1071)
library(ISLR2)

5A

Generate a data set with n = 500 and p = 2, such that the observations belong to two classes with a quadratic decision boundary between them

x1 <- runif (500) - 0.5
x2 <- runif (500) - 0.5
y <- 1 * (x1^2 - x2^2 > 0)

5B

Plot the observations, colored according to their class labels. Your plot should display X1 on the x-axis, and X2 on the y-axis.

df <- cbind(x1, x2, as.factor(y)) |> 
  as.data.frame()
colnames(df) <- c("x1","x2","y")
df$y <- as.factor(df$y)
df |> ggplot(aes(x = x1, y = x2, color = y)) + 
  geom_point()

5C

Fit a logistic regression model to the data, using X1 and X2 as predictors.

mod.log <- glm(y ~ ., data = df, family = "binomial")

5D

Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be linear.

df$preds_log <- ifelse(predict(mod.log, type = "response") > 0.5, 2, 1) |> 
  as.factor()
df |> ggplot(aes(x = x1, y = x2, color = preds_log)) + 
  geom_point()

confusionMatrix(df$y, df$preds_log)
Confusion Matrix and Statistics

          Reference
Prediction   1   2
         1 145 105
         2 105 145
                                          
               Accuracy : 0.58            
                 95% CI : (0.5354, 0.6237)
    No Information Rate : 0.5             
    P-Value [Acc > NIR] : 0.0002002       
                                          
                  Kappa : 0.16            
                                          
 Mcnemar's Test P-Value : 1.0000000       
                                          
            Sensitivity : 0.58            
            Specificity : 0.58            
         Pos Pred Value : 0.58            
         Neg Pred Value : 0.58            
             Prevalence : 0.50            
         Detection Rate : 0.29            
   Detection Prevalence : 0.50            
      Balanced Accuracy : 0.58            
                                          
       'Positive' Class : 1               
                                          

We can see the basic logistic model doesn’t perform well on the data

5E

Now fit a logistic regression model to the data using non-linear functions of X1 and X2 as predictors.

Since we created the data, we know that there is a quadratic relationship.

mod.log2 <- glm(formula = y ~ x1 + x2 + I(x1^2) + I(x2^2), family = "binomial", data = df)
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

5F

Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be obviously non-linear. If it is not, then repeat (a)-(e) until you come up with an example in which the predicted class labels are obviously non-linear.

df$preds_log2 <- ifelse(predict(mod.log2, type = "response") > 0.5, 2, 1) |> 
  as.factor()
df |> ggplot(aes(x = x1, y = x2, color = preds_log2)) + 
  geom_point()

The quadratic terms resulted in a perfect representation of the data.

confusionMatrix(df$y, df$preds_log2)
Confusion Matrix and Statistics

          Reference
Prediction   1   2
         1 250   0
         2   0 250
                                     
               Accuracy : 1          
                 95% CI : (0.9926, 1)
    No Information Rate : 0.5        
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0        
            Specificity : 1.0        
         Pos Pred Value : 1.0        
         Neg Pred Value : 1.0        
             Prevalence : 0.5        
         Detection Rate : 0.5        
   Detection Prevalence : 0.5        
      Balanced Accuracy : 1.0        
                                     
       'Positive' Class : 1          
                                     

5G

Fit a support vector classifier to the data with X1 and X2 as predictors. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.

mod.svm <- svm(y ~ x1 + x2, data = df, kernel = "linear", cost = 10, scale = FALSE)
df$preds_svm <- predict(mod.svm, type = "response") |> 
  as.factor()
df |> ggplot(aes(x = x1, y = x2, color = preds_svm)) + 
  geom_point()

The linear kernal cannot accurately represent the data.

confusionMatrix(df$y, df$preds_svm)
Confusion Matrix and Statistics

          Reference
Prediction   1   2
         1 166  84
         2 111 139
                                         
               Accuracy : 0.61           
                 95% CI : (0.5657, 0.653)
    No Information Rate : 0.554          
    P-Value [Acc > NIR] : 0.006483       
                                         
                  Kappa : 0.22           
                                         
 Mcnemar's Test P-Value : 0.062617       
                                         
            Sensitivity : 0.5993         
            Specificity : 0.6233         
         Pos Pred Value : 0.6640         
         Neg Pred Value : 0.5560         
             Prevalence : 0.5540         
         Detection Rate : 0.3320         
   Detection Prevalence : 0.5000         
      Balanced Accuracy : 0.6113         
                                         
       'Positive' Class : 1              
                                         

5H

Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.

set.seed(1111)
idx <- sample(500, 400, replace = F)
tune.out <- tune(svm , y ~ x1 + x2, data = df[idx, ],
                 kernel = "radial",
                 ranges = list(
                   cost = c(0.1, 1, 10, 100, 1000),
                   gamma = c(0.5, 1, 2, 3, 4)
                   )
                 )
table(true = df[-idx, "y"],
      pred = predict(tune.out$best.model , newdata = df[-idx, ]
                     )
      )
    pred
true  1  2
   1 58  0
   2  3 39

This svm model results in a test error rate of 3%

5I

It is clear that regardless of the model used, understanding the relationship between the predictors and the response variable are crucial to constructing a high performing model.

7

In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the Auto data set.

data("Auto")
df <- Auto
df <- df |> 
  select(-name) |> 
  mutate(origin = as.factor(origin))

7A

df <- df |> 
  mutate(response = as.factor(ifelse(mpg > median(mpg),1,0))) |> 
  select(-mpg)

7B

idx <- sample(392, 300, replace = F)
svm.linear <- tune(svm,
                  response ~ .,
                  data = df[idx,],
                  kernel = "linear",
                  ranges = list(
                   cost = c(0.1, 1, 10, 100, 1000)
                   )
)
svm.linear$performances |>
  mutate(cost = as.factor(cost)) |> 
  ggplot(aes(x=cost,y=error)) +
  geom_col()

The lowest error occurs at cost = 1e-01

svm.linear$best.parameters
  cost
3   10

7C

svm.radial <- tune(svm,
                   response ~ .,
                   data = df[idx,],
                   kernel = "radial",
                   ranges = list(
                     cost = c(0.1, 1, 10, 100, 1000),
                     gamma = c(0.5, 1, 2, 3, 4)
                     )
                   )
svm.poly <- tune(svm,
                   response ~ .,
                   data = df[idx,],
                   kernel = "polynomial",
                   ranges = list(
                     cost = c(0.1, 1, 10, 100, 1000),
                     gamma = c(0.5, 1, 2, 3, 4),
                     degree = c(1,2,3,4)
                     )
                   )

7D

svm.radial$performances |>
  mutate(cost = as.factor(cost)) |> 
  ggplot(aes(x=cost,y=error)) +
  geom_col()

svm.radial$best.parameters
  cost gamma
7    1     1
svm.poly$performances |>
  mutate(cost = as.factor(cost)) |> 
  ggplot(aes(x=cost,y=error)) +
  geom_col()

svm.poly$best.parameters
  cost gamma degree
4  100   0.5      1

8

This problem involves the OJ data set which is part of the ISLR2 package.

df <- data("OJ")

8A

Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

idx <- sample(nrow(OJ),800,replace=F)
train <- OJ[idx,]
test <- OJ[-idx,]

8B

Fit a support vector classifier to the training data using cost = 0.01, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics, and describe the results obtained.

svm.linear <- svm(Purchase ~ .,
                  data = train,
                  cost = .01,
                  kernel = "linear")
summary(svm.linear)

Call:
svm(formula = Purchase ~ ., data = train, cost = 0.01, kernel = "linear")


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  0.01 

Number of Support Vectors:  442

 ( 221 221 )


Number of Classes:  2 

Levels: 
 CH MM

8C

What are the training and test error rates?

data.frame(train_error = mean(predict(svm.linear, train) != train$Purchase), 
           test_error = mean(predict(svm.linear, test) != test$Purchase))
  train_error test_error
1      0.1725  0.1666667

Training error is 15% and the test error is 19%

8D

Use the tune() function to select an optimal cost. Consider values in the range 0.01 to 10.

svm.linear <- tune(svm,
                  Purchase ~ .,
                  data = train,
                  kernel = "linear",
                  ranges = list(
                   cost = c(.01, .1, 1, 10)
                   )
)
svm.linear$best.parameters
  cost
4   10

Cost = .1 is the best value.

8E

Compute the training and test error rates using this new value for cost.

data.frame(train_error = mean(predict(svm.linear$best.model, train) != train$Purchase), 
           test_error = mean(predict(svm.linear$best.model, test) != test$Purchase))
  train_error test_error
1     0.16625  0.1481481

The train and test error increases to 15 and 20%

8F

Repeat parts (b) through (e) using a support vector machine with a radial kernel. Use the default value for gamma.

svm.radial <- svm(Purchase ~ .,
                  data = train,
                  cost = .01,
                  kernel = "radial")
summary(svm.radial)

Call:
svm(formula = Purchase ~ ., data = train, cost = 0.01, kernel = "radial")


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  0.01 

Number of Support Vectors:  616

 ( 309 307 )


Number of Classes:  2 

Levels: 
 CH MM
data.frame(train_error = mean(predict(svm.radial, train) != train$Purchase), 
           test_error = mean(predict(svm.radial, test) != test$Purchase))
  train_error test_error
1     0.38375  0.4074074
svm.radial <- tune(svm,
                  Purchase ~ .,
                  data = train,
                  kernel = "radial",
                  ranges = list(
                   cost = c(.01, .1, 1, 10)
                   )
)
svm.radial$best.parameters
  cost
4   10
data.frame(train_error = mean(predict(svm.radial$best.model, train) != train$Purchase), 
           test_error = mean(predict(svm.radial$best.model, test) != test$Purchase))
  train_error test_error
1     0.14125  0.1777778

8G

Repeat parts (b) through (e) using a support vector machine with a polynomial kernel. Set degree = 2.

svm.poly <- svm(Purchase ~ .,
                data = train,
                cost = .01,
                kernel = "polynomial",
                degree = 2)
summary(svm.poly)

Call:
svm(formula = Purchase ~ ., data = train, cost = 0.01, kernel = "polynomial", 
    degree = 2)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  polynomial 
       cost:  0.01 
     degree:  2 
     coef.0:  0 

Number of Support Vectors:  621

 ( 314 307 )


Number of Classes:  2 

Levels: 
 CH MM
data.frame(train_error = mean(predict(svm.poly, train) != train$Purchase), 
           test_error = mean(predict(svm.poly, test) != test$Purchase))
  train_error test_error
1     0.38375  0.4074074
svm.poly <- tune(svm,
                  Purchase ~ .,
                  data = train,
                  kernel = "polynomial",
                  degree = 2,
                  ranges = list(
                   cost = c(.01, .1, 1, 10)
                   )
)
svm.poly$best.parameters
  cost
4   10
data.frame(train_error = mean(predict(svm.poly$best.model, train) != train$Purchase), 
           test_error = mean(predict(svm.poly$best.model, test) != test$Purchase))
  train_error test_error
1     0.15625  0.1740741

8H

Overall, which approach seems to give the best results on this data?

Overall, the radial and polynomial performed similarly. However, the radial performed slightly better.