In this project, I am going to evaluate College data set using Regularization methods and make a prediction on new data.
Support vector machines are powerful machine learning techniques and are used for supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis
The goal of an SVM is to take groups of observations and construct boundaries to predict which group future observations belong to based on their measurements. The different groups that must be separated will be called “classes”. SVMs can handle any number of classes, as well as observations of any dimension. SVMs can take almost any shape (including linear, radial, and polynomial, among others), and are generally flexible enough to be used in almost any classification endeavor that the user chooses to undertake.
Maximal Margin Classifier: If the classes are separable by a linear boundary, we can use a Maximal Margin Classifier to find the classification boundary.
Support Vector Classifiers: Most real data sets will not be fully separable by a linear boundary. To handle such data, we must use modified methodology. Whether the data is separable or not, the svm() command syntax is the same. In the case of data that is not linearly separable, however, the cost = argument takes on real importance. This quantifies the penalty associated with having an observation on the wrong side of the classification boundary. We can plot the fit in the same way as the completely separable case.
But how do we decide how costly these misclassifications actually are? Instead of specifying a cost up front, we can use the tune() function from e1071 to test various costs and identify which value produces the best fitting model
Support Vector Machines: Support Vector Classifiers are a subset of the group of classification structures known as Support Vector Machines. Support Vector Machines can construct classification boundaries that are nonlinear in shape. The options for classification structures using the svm() command from the e1071 package are linear, polynomial, radial, and sigmoid. Constructing a classification boundary, whether linear or nonlinear, for data that may or may not be separable.
Kernel:
SVMs for Multiple Classes: SVM techniques for more than 2 classes of observations
Data:
Orange Juice(OJ) data frame with 1070 observations on the following 18 variables.
Purchase:A factor with levels CH and MM indicating whether the customer purchased Citrus Hill or Minute Maid Orange Juice
WeekofPurchase: Week of purchase
StoreID: Store ID
PriceCH: Price charged for CH
PriceMM: Price charged for MM
DiscCH: Discount offered for CH
DiscMM: Discount offered for MM
SpecialCH: Indicator of special on CH
SpecialMM: Indicator of special on MM
LoyalCH: Customer brand loyalty for CH
SalePriceMM: Sale price for MM
SalePriceCH: Sale price for CH
PriceDiff: Sale price of MM less sale price of CH
Store7: A factor with levels No and Yes indicating whether the sale is at Store 7
PctDiscMM: Percentage discount for MM
PctDiscCH: Percentage discount for CH
ListPriceDiff: List price of MM less list price of CH
STORE: Which of 5 possible stores the sale occured at
Objective:
Fit SVM with cost value of 0.01
Use Tune() function to find optimal cost
Fit SVM with radial kernel
Fit SVM with polynomial kernel
Compare the models
Loading Libraries
#Loading necessary libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#library(tidyverse)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(formattable)
library(ISLR)
Loading Libraries
#Loading necessary libraries
packages <- c('caret', 'formattable','ISLR','e1071' )
sapply(packages, require, character.only=T)
## Loading required package: e1071
## caret formattable ISLR e1071
## TRUE TRUE TRUE TRUE
data(OJ)
Data Exploration
head(OJ)
## Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM SpecialCH
## 1 CH 237 1 1.75 1.99 0.00 0.0 0
## 2 CH 239 1 1.75 1.99 0.00 0.3 0
## 3 CH 245 1 1.86 2.09 0.17 0.0 0
## 4 MM 227 1 1.69 1.69 0.00 0.0 0
## 5 CH 228 7 1.69 1.69 0.00 0.0 0
## 6 CH 230 7 1.69 1.99 0.00 0.0 0
## SpecialMM LoyalCH SalePriceMM SalePriceCH PriceDiff Store7 PctDiscMM
## 1 0 0.500000 1.99 1.75 0.24 No 0.000000
## 2 1 0.600000 1.69 1.75 -0.06 No 0.150754
## 3 0 0.680000 2.09 1.69 0.40 No 0.000000
## 4 0 0.400000 1.69 1.69 0.00 No 0.000000
## 5 0 0.956535 1.69 1.69 0.00 Yes 0.000000
## 6 1 0.965228 1.99 1.69 0.30 Yes 0.000000
## PctDiscCH ListPriceDiff STORE
## 1 0.000000 0.24 1
## 2 0.000000 0.24 1
## 3 0.091398 0.23 1
## 4 0.000000 0.00 1
## 5 0.000000 0.00 0
## 6 0.000000 0.30 0
str(OJ)
## 'data.frame': 1070 obs. of 18 variables:
## $ Purchase : Factor w/ 2 levels "CH","MM": 1 1 1 2 1 1 1 1 1 1 ...
## $ WeekofPurchase: num 237 239 245 227 228 230 232 234 235 238 ...
## $ StoreID : num 1 1 1 1 7 7 7 7 7 7 ...
## $ PriceCH : num 1.75 1.75 1.86 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
## $ PriceMM : num 1.99 1.99 2.09 1.69 1.69 1.99 1.99 1.99 1.99 1.99 ...
## $ DiscCH : num 0 0 0.17 0 0 0 0 0 0 0 ...
## $ DiscMM : num 0 0.3 0 0 0 0 0.4 0.4 0.4 0.4 ...
## $ SpecialCH : num 0 0 0 0 0 0 1 1 0 0 ...
## $ SpecialMM : num 0 1 0 0 0 1 1 0 0 0 ...
## $ LoyalCH : num 0.5 0.6 0.68 0.4 0.957 ...
## $ SalePriceMM : num 1.99 1.69 2.09 1.69 1.69 1.99 1.59 1.59 1.59 1.59 ...
## $ SalePriceCH : num 1.75 1.75 1.69 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
## $ PriceDiff : num 0.24 -0.06 0.4 0 0 0.3 -0.1 -0.16 -0.16 -0.16 ...
## $ Store7 : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 2 2 2 2 ...
## $ PctDiscMM : num 0 0.151 0 0 0 ...
## $ PctDiscCH : num 0 0 0.0914 0 0 ...
## $ ListPriceDiff : num 0.24 0.24 0.23 0 0 0.3 0.3 0.24 0.24 0.24 ...
## $ STORE : num 1 1 1 1 0 0 0 0 0 0 ...
Data Partition
set.seed(111)
samp <- sample(nrow(OJ), 800)
train <- OJ[samp,]
test <- OJ[-samp,]
svm.linear1 <- svm(Purchase ~ .,kernel = "linear", cost = 0.01, data = train)
summary(svm.linear1)
##
## Call:
## svm(formula = Purchase ~ ., data = train, kernel = "linear", cost = 0.01)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.01
##
## Number of Support Vectors: 446
##
## ( 225 221 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
# Train Prediction
train.pred <- predict(svm.linear1, train)
# Train Confusion Matrix
err.tr1 <- table(train$Purchase, train.pred)
err.tr1
## train.pred
## CH MM
## CH 431 58
## MM 79 232
# Train error Rate
Tr.err1 <- (1 - (sum(diag(err.tr1))/sum(err.tr1)))
Tr.err1
## [1] 0.17125
# Test Prediction
test.pred <- predict(svm.linear1, test)
# Test Confusion Matrix
err.te1 <- table(test$Purchase, test.pred)
err.te1
## test.pred
## CH MM
## CH 147 17
## MM 24 82
# Test error Rate
Te.err1 <- (1- (sum(diag(err.te1))/sum(err.te1)))
Te.err1
## [1] 0.1518519
Tune Model
tune.out <- tune(svm, Purchase ~ ., data = train, kernel = "linear",
ranges = list(cost = seq(0.01, 10, length.out = 20)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 2.113158
##
## - best performance: 0.1725
##
## - Detailed performance results:
## cost error dispersion
## 1 0.0100000 0.18000 0.04758034
## 2 0.5357895 0.17750 0.04669642
## 3 1.0615789 0.17500 0.04787136
## 4 1.5873684 0.17375 0.05015601
## 5 2.1131579 0.17250 0.05130248
## 6 2.6389474 0.17375 0.05015601
## 7 3.1647368 0.17500 0.04965156
## 8 3.6905263 0.17500 0.04965156
## 9 4.2163158 0.17500 0.05270463
## 10 4.7421053 0.17500 0.05270463
## 11 5.2678947 0.17500 0.05368374
## 12 5.7936842 0.17625 0.05382908
## 13 6.3194737 0.17625 0.05382908
## 14 6.8452632 0.17500 0.05103104
## 15 7.3710526 0.17625 0.05152197
## 16 7.8968421 0.17625 0.05152197
## 17 8.4226316 0.17625 0.05152197
## 18 8.9484211 0.17625 0.05152197
## 19 9.4742105 0.17750 0.05361903
## 20 10.0000000 0.17750 0.05361903
Fit Model
svm.linear2 <- svm(Purchase ~ ., kernel = "linear", data = train,
cost = tune.out$best.parameter$cost)
# Train Prediction
train.pred <- predict(svm.linear2, train)
# Train Confusion Matrix
err.tr2<-table(train$Purchase, train.pred)
err.tr2
## train.pred
## CH MM
## CH 429 60
## MM 79 232
# Train error Rate
Tr.err2 <- 1- (sum(diag(err.tr2))/sum(err.tr2))
Tr.err2
## [1] 0.17375
# Test Prediction
test.pred <- predict(svm.linear2, test)
# Test Confusion Matrix
err.te2<-table(test$Purchase, test.pred)
err.te2
## test.pred
## CH MM
## CH 146 18
## MM 23 83
# Test error Rate
Te.err2 <- 1- (sum(diag(err.te2))/sum(err.te2))
Te.err2
## [1] 0.1518519
svm.radial1 <- svm(Purchase ~ ., kernel = "radial", data = train)
summary(svm.radial1)
##
## Call:
## svm(formula = Purchase ~ ., data = train, kernel = "radial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 385
##
## ( 197 188 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
# Train Prediction
train.pred <- predict(svm.radial1, train)
# Train Confusion Matrix
err.tr3<-table(train$Purchase, train.pred)
err.tr3
## train.pred
## CH MM
## CH 447 42
## MM 81 230
# Train error Rate
Tr.err3 <- 1- (sum(diag(err.tr3))/sum(err.tr3))
Tr.err3
## [1] 0.15375
# Test Prediction
test.pred <- predict(svm.radial1, test)
# Test Confusion Matrix
err.te3<-table(test$Purchase, test.pred)
err.te3
## test.pred
## CH MM
## CH 146 18
## MM 25 81
# Test error Rate
Te.err3 <- 1- (sum(diag(err.te3))/sum(err.te3))
Te.err3
## [1] 0.1592593
Tune Model
tune.out <- tune(svm, Purchase ~ ., data = train, kernel = "radial",
ranges = list(cost = seq(0.01, 10, length.out = 20)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 1.061579
##
## - best performance: 0.1775
##
## - Detailed performance results:
## cost error dispersion
## 1 0.0100000 0.38875 0.05015601
## 2 0.5357895 0.18250 0.03917553
## 3 1.0615789 0.17750 0.04241004
## 4 1.5873684 0.18375 0.04168749
## 5 2.1131579 0.18875 0.03793727
## 6 2.6389474 0.18375 0.03729108
## 7 3.1647368 0.18750 0.03952847
## 8 3.6905263 0.18875 0.04016027
## 9 4.2163158 0.19000 0.04116363
## 10 4.7421053 0.18875 0.04143687
## 11 5.2678947 0.18750 0.04039733
## 12 5.7936842 0.18500 0.04073969
## 13 6.3194737 0.18750 0.04039733
## 14 6.8452632 0.19125 0.03866254
## 15 7.3710526 0.19000 0.03944053
## 16 7.8968421 0.19000 0.03622844
## 17 8.4226316 0.18875 0.03606033
## 18 8.9484211 0.18750 0.03486083
## 19 9.4742105 0.18750 0.03385016
## 20 10.0000000 0.18750 0.03535534
Fit Model
svm.radial2 <- svm(Purchase ~ ., kernel = "radial", data = train,
cost = tune.out$best.parameter$cost)
summary(svm.radial2)
##
## Call:
## svm(formula = Purchase ~ ., data = train, kernel = "radial", cost = tune.out$best.parameter$cost)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1.061579
##
## Number of Support Vectors: 381
##
## ( 195 186 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
# Train Prediction
train.pred <- predict(svm.radial2, train)
# Train Confusion Matrix
err.tr4<-table(train$Purchase, train.pred)
err.tr4
## train.pred
## CH MM
## CH 447 42
## MM 81 230
# Train error Rate
Tr.err4 <- 1- (sum(diag(err.tr4))/sum(err.tr4))
Tr.err4
## [1] 0.15375
# Test Prediction
test.pred <- predict(svm.radial2, test)
# Test Confusion Matrix
err.te4<-table(test$Purchase, test.pred)
err.te4
## test.pred
## CH MM
## CH 147 17
## MM 25 81
# Test error Rate
Te.err4 <- 1- (sum(diag(err.te4))/sum(err.te4))
Te.err4
## [1] 0.1555556
4. Fit SVM with polynomial kernel
svm.poly1 <- svm(Purchase ~ ., kernel = "polynomial", data = train, degree = 2)
summary(svm.poly1)
##
## Call:
## svm(formula = Purchase ~ ., data = train, kernel = "polynomial",
## degree = 2)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 1
## degree: 2
## coef.0: 0
##
## Number of Support Vectors: 459
##
## ( 232 227 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
# Train Prediction
train.pred <- predict(svm.poly1, train)
# Train Confusion Matrix
err.tr5<-table(train$Purchase, train.pred)
err.tr5
## train.pred
## CH MM
## CH 455 34
## MM 118 193
# Train error Rate
Tr.err5 <- 1- (sum(diag(err.tr5))/sum(err.tr5))
Tr.err5
## [1] 0.19
# Test Prediction
test.pred <- predict(svm.poly1, test)
# Test Confusion Matrix
err.te5<-table(test$Purchase, test.pred)
err.te5
## test.pred
## CH MM
## CH 152 12
## MM 35 71
# Test error Rate
Te.err5 <- 1- (sum(diag(err.te5))/sum(err.te5))
Te.err5
## [1] 0.1740741
4. Fit SVM with polynomial kernel with tuned
tune.out <- tune(svm, Purchase ~ ., data = train, kernel = "polynomial", degree = 2,
ranges =list(cost = 10^seq(-2,1, by = 0.25)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 3.162278
##
## - best performance: 0.19375
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01000000 0.38875 0.05604128
## 2 0.01778279 0.37125 0.04966904
## 3 0.03162278 0.36375 0.04427267
## 4 0.05623413 0.34625 0.04931827
## 5 0.10000000 0.32750 0.03717451
## 6 0.17782794 0.26000 0.02813657
## 7 0.31622777 0.22250 0.04669642
## 8 0.56234133 0.21500 0.04281744
## 9 1.00000000 0.20875 0.05036326
## 10 1.77827941 0.20125 0.06136469
## 11 3.16227766 0.19375 0.06073908
## 12 5.62341325 0.19500 0.05839283
## 13 10.00000000 0.19500 0.06129392
svm.poly2 <- svm(Purchase ~ ., kernel = "polynomial", degree = 2, data = train,
cost = tune.out$best.parameter$cost)
summary(svm.poly2)
##
## Call:
## svm(formula = Purchase ~ ., data = train, kernel = "polynomial",
## degree = 2, cost = tune.out$best.parameter$cost)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 3.162278
## degree: 2
## coef.0: 0
##
## Number of Support Vectors: 391
##
## ( 201 190 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
# Train Prediction
train.pred <- predict(svm.poly2, train)
# Train Confusion Matrix
err.tr6<-table(train$Purchase, train.pred)
err.tr6
## train.pred
## CH MM
## CH 452 37
## MM 104 207
# Train error Rate
Tr.err6 <- 1- (sum(diag(err.tr6))/sum(err.tr6))
Tr.err6
## [1] 0.17625
# Test Prediction
test.pred <- predict(svm.poly2, test)
# Test Confusion Matrix
err.te6<-table(test$Purchase, test.pred)
err.te6
## test.pred
## CH MM
## CH 149 15
## MM 28 78
# Test error Rate
Te.err6 <- 1- (sum(diag(err.te6))/sum(err.te6))
Te.err6
## [1] 0.1592593
6. Model Comparison
df<-data_frame(
id = 1:6,
Model = c('SVM', 'SVM Tuned', 'SVM Radial', 'SVM Radial Tuned', 'SVM Polynomial',"SVM Polynomial tuned"),
Train_MSE = c(Tr.err1, Tr.err2, Tr.err3, Tr.err4, Tr.err5, Tr.err6),
Test_MSE = c(Te.err1, Te.err2, Te.err3, Te.err4, Te.err5, Te.err6)
)
formattable(df, list(
Model = formatter("span",
style = ~ style(color = ifelse((Test_MSE+Train_MSE) == min(Test_MSE+Train_MSE),
"green", "red"))),
area(col = c(Train_MSE)) ~ normalize_bar("pink", 0.2),
area(col = c(Test_MSE)) ~ normalize_bar("pink", 0.2)
))
| id | Model | Train_MSE | Test_MSE |
|---|---|---|---|
| 1 | SVM | 0.17125 | 0.1518519 |
| 2 | SVM Tuned | 0.17375 | 0.1518519 |
| 3 | SVM Radial | 0.15375 | 0.1592593 |
| 4 | SVM Radial Tuned | 0.15375 | 0.1555556 |
| 5 | SVM Polynomial | 0.19000 | 0.1740741 |
| 6 | SVM Polynomial tuned | 0.17625 | 0.1592593 |
Overall, Tuning helps reducing test error rates.
In average, radial tuned model seems to be producing minimum misclassification error in train and test data.
References: 1. http://uc-r.github.io/svm 2. An Introduction to Statistical learning. Springer 2013 3. The Elements of Statistical Learning. Springer; 2001. 4. https://www.datacamp.com/community/tutorials/support-vector-machines-r