Introduction

In this project, I am going to evaluate College data set using Regularization methods and make a prediction on new data.

Support vector machines are powerful machine learning techniques and are used for supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis

The goal of an SVM is to take groups of observations and construct boundaries to predict which group future observations belong to based on their measurements. The different groups that must be separated will be called “classes”. SVMs can handle any number of classes, as well as observations of any dimension. SVMs can take almost any shape (including linear, radial, and polynomial, among others), and are generally flexible enough to be used in almost any classification endeavor that the user chooses to undertake.

Maximal Margin Classifier: If the classes are separable by a linear boundary, we can use a Maximal Margin Classifier to find the classification boundary.

Support Vector Classifiers: Most real data sets will not be fully separable by a linear boundary. To handle such data, we must use modified methodology. Whether the data is separable or not, the svm() command syntax is the same. In the case of data that is not linearly separable, however, the cost = argument takes on real importance. This quantifies the penalty associated with having an observation on the wrong side of the classification boundary. We can plot the fit in the same way as the completely separable case.

But how do we decide how costly these misclassifications actually are? Instead of specifying a cost up front, we can use the tune() function from e1071 to test various costs and identify which value produces the best fitting model

Support Vector Machines: Support Vector Classifiers are a subset of the group of classification structures known as Support Vector Machines. Support Vector Machines can construct classification boundaries that are nonlinear in shape. The options for classification structures using the svm() command from the e1071 package are linear, polynomial, radial, and sigmoid. Constructing a classification boundary, whether linear or nonlinear, for data that may or may not be separable.

Kernel:

SVMs for Multiple Classes: SVM techniques for more than 2 classes of observations

Data:

Orange Juice(OJ) data frame with 1070 observations on the following 18 variables.

Purchase:A factor with levels CH and MM indicating whether the customer purchased Citrus Hill or Minute Maid Orange Juice

WeekofPurchase: Week of purchase

StoreID: Store ID

PriceCH: Price charged for CH

PriceMM: Price charged for MM

DiscCH: Discount offered for CH

DiscMM: Discount offered for MM

SpecialCH: Indicator of special on CH

SpecialMM: Indicator of special on MM

LoyalCH: Customer brand loyalty for CH

SalePriceMM: Sale price for MM

SalePriceCH: Sale price for CH

PriceDiff: Sale price of MM less sale price of CH

Store7: A factor with levels No and Yes indicating whether the sale is at Store 7

PctDiscMM: Percentage discount for MM

PctDiscCH: Percentage discount for CH

ListPriceDiff: List price of MM less list price of CH

STORE: Which of 5 possible stores the sale occured at

Objective:

  1. Fit SVM with cost value of 0.01

  2. Use Tune() function to find optimal cost

  3. Fit SVM with radial kernel

  4. Fit SVM with polynomial kernel

  5. Compare the models



Loading Libraries

#Loading necessary libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#library(tidyverse)
library(caret)  
## Loading required package: lattice
## Loading required package: ggplot2
library(formattable)
library(ISLR)

Loading Libraries

#Loading necessary libraries
packages <- c('caret', 'formattable','ISLR','e1071' )

sapply(packages, require, character.only=T)
## Loading required package: e1071
##       caret formattable        ISLR       e1071 
##        TRUE        TRUE        TRUE        TRUE
data(OJ)

Data Exploration

head(OJ)
##   Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM SpecialCH
## 1       CH            237       1    1.75    1.99   0.00    0.0         0
## 2       CH            239       1    1.75    1.99   0.00    0.3         0
## 3       CH            245       1    1.86    2.09   0.17    0.0         0
## 4       MM            227       1    1.69    1.69   0.00    0.0         0
## 5       CH            228       7    1.69    1.69   0.00    0.0         0
## 6       CH            230       7    1.69    1.99   0.00    0.0         0
##   SpecialMM  LoyalCH SalePriceMM SalePriceCH PriceDiff Store7 PctDiscMM
## 1         0 0.500000        1.99        1.75      0.24     No  0.000000
## 2         1 0.600000        1.69        1.75     -0.06     No  0.150754
## 3         0 0.680000        2.09        1.69      0.40     No  0.000000
## 4         0 0.400000        1.69        1.69      0.00     No  0.000000
## 5         0 0.956535        1.69        1.69      0.00    Yes  0.000000
## 6         1 0.965228        1.99        1.69      0.30    Yes  0.000000
##   PctDiscCH ListPriceDiff STORE
## 1  0.000000          0.24     1
## 2  0.000000          0.24     1
## 3  0.091398          0.23     1
## 4  0.000000          0.00     1
## 5  0.000000          0.00     0
## 6  0.000000          0.30     0
str(OJ)
## 'data.frame':    1070 obs. of  18 variables:
##  $ Purchase      : Factor w/ 2 levels "CH","MM": 1 1 1 2 1 1 1 1 1 1 ...
##  $ WeekofPurchase: num  237 239 245 227 228 230 232 234 235 238 ...
##  $ StoreID       : num  1 1 1 1 7 7 7 7 7 7 ...
##  $ PriceCH       : num  1.75 1.75 1.86 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
##  $ PriceMM       : num  1.99 1.99 2.09 1.69 1.69 1.99 1.99 1.99 1.99 1.99 ...
##  $ DiscCH        : num  0 0 0.17 0 0 0 0 0 0 0 ...
##  $ DiscMM        : num  0 0.3 0 0 0 0 0.4 0.4 0.4 0.4 ...
##  $ SpecialCH     : num  0 0 0 0 0 0 1 1 0 0 ...
##  $ SpecialMM     : num  0 1 0 0 0 1 1 0 0 0 ...
##  $ LoyalCH       : num  0.5 0.6 0.68 0.4 0.957 ...
##  $ SalePriceMM   : num  1.99 1.69 2.09 1.69 1.69 1.99 1.59 1.59 1.59 1.59 ...
##  $ SalePriceCH   : num  1.75 1.75 1.69 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
##  $ PriceDiff     : num  0.24 -0.06 0.4 0 0 0.3 -0.1 -0.16 -0.16 -0.16 ...
##  $ Store7        : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 2 2 2 2 ...
##  $ PctDiscMM     : num  0 0.151 0 0 0 ...
##  $ PctDiscCH     : num  0 0 0.0914 0 0 ...
##  $ ListPriceDiff : num  0.24 0.24 0.23 0 0 0.3 0.3 0.24 0.24 0.24 ...
##  $ STORE         : num  1 1 1 1 0 0 0 0 0 0 ...

Data Partition

set.seed(111)
samp <- sample(nrow(OJ), 800)

train <- OJ[samp,]
test <- OJ[-samp,]
1. Fit SVM with cost value of 0.01
svm.linear1 <- svm(Purchase ~ .,kernel = "linear", cost = 0.01, data = train)
summary(svm.linear1)
## 
## Call:
## svm(formula = Purchase ~ ., data = train, kernel = "linear", cost = 0.01)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.01 
## 
## Number of Support Vectors:  446
## 
##  ( 225 221 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM
# Train Prediction
train.pred <- predict(svm.linear1, train)
# Train Confusion Matrix
err.tr1 <- table(train$Purchase, train.pred)
err.tr1
##     train.pred
##       CH  MM
##   CH 431  58
##   MM  79 232
# Train error Rate
Tr.err1 <- (1 - (sum(diag(err.tr1))/sum(err.tr1)))
Tr.err1
## [1] 0.17125
# Test Prediction
test.pred <- predict(svm.linear1, test)
# Test Confusion Matrix
err.te1 <- table(test$Purchase, test.pred)
err.te1
##     test.pred
##       CH  MM
##   CH 147  17
##   MM  24  82
# Test error Rate
Te.err1 <- (1- (sum(diag(err.te1))/sum(err.te1)))
Te.err1
## [1] 0.1518519
2. Use Tune() function to find optimal cost

Tune Model

tune.out <- tune(svm, Purchase ~ ., data = train, kernel = "linear",
                 ranges = list(cost = seq(0.01, 10, length.out = 20)))

summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##      cost
##  2.113158
## 
## - best performance: 0.1725 
## 
## - Detailed performance results:
##          cost   error dispersion
## 1   0.0100000 0.18000 0.04758034
## 2   0.5357895 0.17750 0.04669642
## 3   1.0615789 0.17500 0.04787136
## 4   1.5873684 0.17375 0.05015601
## 5   2.1131579 0.17250 0.05130248
## 6   2.6389474 0.17375 0.05015601
## 7   3.1647368 0.17500 0.04965156
## 8   3.6905263 0.17500 0.04965156
## 9   4.2163158 0.17500 0.05270463
## 10  4.7421053 0.17500 0.05270463
## 11  5.2678947 0.17500 0.05368374
## 12  5.7936842 0.17625 0.05382908
## 13  6.3194737 0.17625 0.05382908
## 14  6.8452632 0.17500 0.05103104
## 15  7.3710526 0.17625 0.05152197
## 16  7.8968421 0.17625 0.05152197
## 17  8.4226316 0.17625 0.05152197
## 18  8.9484211 0.17625 0.05152197
## 19  9.4742105 0.17750 0.05361903
## 20 10.0000000 0.17750 0.05361903

Fit Model

svm.linear2 <- svm(Purchase ~ ., kernel = "linear", data = train, 
                  cost = tune.out$best.parameter$cost)
# Train Prediction
train.pred <- predict(svm.linear2, train)
# Train Confusion Matrix
err.tr2<-table(train$Purchase, train.pred)
err.tr2
##     train.pred
##       CH  MM
##   CH 429  60
##   MM  79 232
# Train error Rate
Tr.err2 <- 1- (sum(diag(err.tr2))/sum(err.tr2))
Tr.err2
## [1] 0.17375
# Test Prediction
test.pred <- predict(svm.linear2, test)
# Test Confusion Matrix
err.te2<-table(test$Purchase, test.pred)
err.te2
##     test.pred
##       CH  MM
##   CH 146  18
##   MM  23  83
# Test error Rate
Te.err2 <- 1- (sum(diag(err.te2))/sum(err.te2))
Te.err2
## [1] 0.1518519
3. Fit SVM with radial kernel
svm.radial1 <- svm(Purchase ~ ., kernel = "radial", data = train)
summary(svm.radial1)
## 
## Call:
## svm(formula = Purchase ~ ., data = train, kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  385
## 
##  ( 197 188 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM
# Train Prediction
train.pred <- predict(svm.radial1, train)
# Train Confusion Matrix
err.tr3<-table(train$Purchase, train.pred)
err.tr3
##     train.pred
##       CH  MM
##   CH 447  42
##   MM  81 230
# Train error Rate
Tr.err3 <- 1- (sum(diag(err.tr3))/sum(err.tr3))
Tr.err3
## [1] 0.15375
# Test Prediction
test.pred <- predict(svm.radial1, test)
# Test Confusion Matrix
err.te3<-table(test$Purchase, test.pred)
err.te3
##     test.pred
##       CH  MM
##   CH 146  18
##   MM  25  81
# Test error Rate
Te.err3 <- 1- (sum(diag(err.te3))/sum(err.te3))
Te.err3
## [1] 0.1592593
Fit SVM with radial kernel with tuned

Tune Model

tune.out <- tune(svm, Purchase ~ ., data = train, kernel = "radial",
                 ranges = list(cost = seq(0.01, 10, length.out = 20)))
summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##      cost
##  1.061579
## 
## - best performance: 0.1775 
## 
## - Detailed performance results:
##          cost   error dispersion
## 1   0.0100000 0.38875 0.05015601
## 2   0.5357895 0.18250 0.03917553
## 3   1.0615789 0.17750 0.04241004
## 4   1.5873684 0.18375 0.04168749
## 5   2.1131579 0.18875 0.03793727
## 6   2.6389474 0.18375 0.03729108
## 7   3.1647368 0.18750 0.03952847
## 8   3.6905263 0.18875 0.04016027
## 9   4.2163158 0.19000 0.04116363
## 10  4.7421053 0.18875 0.04143687
## 11  5.2678947 0.18750 0.04039733
## 12  5.7936842 0.18500 0.04073969
## 13  6.3194737 0.18750 0.04039733
## 14  6.8452632 0.19125 0.03866254
## 15  7.3710526 0.19000 0.03944053
## 16  7.8968421 0.19000 0.03622844
## 17  8.4226316 0.18875 0.03606033
## 18  8.9484211 0.18750 0.03486083
## 19  9.4742105 0.18750 0.03385016
## 20 10.0000000 0.18750 0.03535534

Fit Model

svm.radial2 <- svm(Purchase ~ ., kernel = "radial", data = train, 
                  cost = tune.out$best.parameter$cost)
summary(svm.radial2)
## 
## Call:
## svm(formula = Purchase ~ ., data = train, kernel = "radial", cost = tune.out$best.parameter$cost)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1.061579 
## 
## Number of Support Vectors:  381
## 
##  ( 195 186 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM
# Train Prediction
train.pred <- predict(svm.radial2, train)
# Train Confusion Matrix
err.tr4<-table(train$Purchase, train.pred)
err.tr4
##     train.pred
##       CH  MM
##   CH 447  42
##   MM  81 230
# Train error Rate
Tr.err4 <- 1- (sum(diag(err.tr4))/sum(err.tr4))
Tr.err4
## [1] 0.15375
# Test Prediction
test.pred <- predict(svm.radial2, test)
# Test Confusion Matrix
err.te4<-table(test$Purchase, test.pred)
err.te4
##     test.pred
##       CH  MM
##   CH 147  17
##   MM  25  81
# Test error Rate
Te.err4 <- 1- (sum(diag(err.te4))/sum(err.te4))
Te.err4
## [1] 0.1555556
4. Fit SVM with polynomial kernel
svm.poly1 <- svm(Purchase ~ ., kernel = "polynomial", data = train, degree = 2)
summary(svm.poly1)
## 
## Call:
## svm(formula = Purchase ~ ., data = train, kernel = "polynomial", 
##     degree = 2)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  2 
##      coef.0:  0 
## 
## Number of Support Vectors:  459
## 
##  ( 232 227 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM
# Train Prediction
train.pred <- predict(svm.poly1, train)
# Train Confusion Matrix
err.tr5<-table(train$Purchase, train.pred)
err.tr5
##     train.pred
##       CH  MM
##   CH 455  34
##   MM 118 193
# Train error Rate
Tr.err5 <- 1- (sum(diag(err.tr5))/sum(err.tr5))
Tr.err5
## [1] 0.19
# Test Prediction
test.pred <- predict(svm.poly1, test)
# Test Confusion Matrix
err.te5<-table(test$Purchase, test.pred)
err.te5
##     test.pred
##       CH  MM
##   CH 152  12
##   MM  35  71
# Test error Rate
Te.err5 <- 1- (sum(diag(err.te5))/sum(err.te5))
Te.err5
## [1] 0.1740741
4. Fit SVM with polynomial kernel with tuned
tune.out <- tune(svm, Purchase ~ ., data = train, kernel = "polynomial", degree = 2, 
                 ranges =list(cost = 10^seq(-2,1, by = 0.25)))
summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##      cost
##  3.162278
## 
## - best performance: 0.19375 
## 
## - Detailed performance results:
##           cost   error dispersion
## 1   0.01000000 0.38875 0.05604128
## 2   0.01778279 0.37125 0.04966904
## 3   0.03162278 0.36375 0.04427267
## 4   0.05623413 0.34625 0.04931827
## 5   0.10000000 0.32750 0.03717451
## 6   0.17782794 0.26000 0.02813657
## 7   0.31622777 0.22250 0.04669642
## 8   0.56234133 0.21500 0.04281744
## 9   1.00000000 0.20875 0.05036326
## 10  1.77827941 0.20125 0.06136469
## 11  3.16227766 0.19375 0.06073908
## 12  5.62341325 0.19500 0.05839283
## 13 10.00000000 0.19500 0.06129392
svm.poly2 <- svm(Purchase ~ ., kernel = "polynomial", degree = 2, data = train,
                cost = tune.out$best.parameter$cost)
summary(svm.poly2)
## 
## Call:
## svm(formula = Purchase ~ ., data = train, kernel = "polynomial", 
##     degree = 2, cost = tune.out$best.parameter$cost)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  3.162278 
##      degree:  2 
##      coef.0:  0 
## 
## Number of Support Vectors:  391
## 
##  ( 201 190 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM
# Train Prediction
train.pred <- predict(svm.poly2, train)
# Train Confusion Matrix
err.tr6<-table(train$Purchase, train.pred)
err.tr6
##     train.pred
##       CH  MM
##   CH 452  37
##   MM 104 207
# Train error Rate
Tr.err6 <- 1- (sum(diag(err.tr6))/sum(err.tr6))
Tr.err6
## [1] 0.17625
# Test Prediction
test.pred <- predict(svm.poly2, test)
# Test Confusion Matrix
err.te6<-table(test$Purchase, test.pred)
err.te6
##     test.pred
##       CH  MM
##   CH 149  15
##   MM  28  78
# Test error Rate
Te.err6 <- 1- (sum(diag(err.te6))/sum(err.te6))
Te.err6
## [1] 0.1592593
6. Model Comparison
df<-data_frame(
  id = 1:6,
Model = c('SVM', 'SVM Tuned', 'SVM Radial', 'SVM Radial Tuned', 'SVM Polynomial',"SVM Polynomial tuned"),
Train_MSE =  c(Tr.err1, Tr.err2, Tr.err3, Tr.err4, Tr.err5, Tr.err6), 
Test_MSE  =  c(Te.err1, Te.err2, Te.err3, Te.err4, Te.err5, Te.err6)
            )

formattable(df, list(
                Model = formatter("span", 
                        style = ~ style(color = ifelse((Test_MSE+Train_MSE) == min(Test_MSE+Train_MSE),
                                                       "green", "red"))),
                area(col = c(Train_MSE)) ~ normalize_bar("pink", 0.2),
                area(col = c(Test_MSE)) ~ normalize_bar("pink", 0.2)
    ))
id Model Train_MSE Test_MSE
1 SVM 0.17125 0.1518519
2 SVM Tuned 0.17375 0.1518519
3 SVM Radial 0.15375 0.1592593
4 SVM Radial Tuned 0.15375 0.1555556
5 SVM Polynomial 0.19000 0.1740741
6 SVM Polynomial tuned 0.17625 0.1592593

Overall, Tuning helps reducing test error rates.

In average, radial tuned model seems to be producing minimum misclassification error in train and test data.



References:
1. http://uc-r.github.io/svm
2. An Introduction to Statistical learning. Springer 2013
3. The Elements of Statistical Learning. Springer; 2001.
4. https://www.datacamp.com/community/tutorials/support-vector-machines-r



*************************