A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples.
The most important question, however, is how do we decide the most optimized hyperplane.
setwd("D:/Class Materials & Work/Summer 2020 practice/SVM")
getwd()
## [1] "D:/Class Materials & Work/Summer 2020 practice/SVM"
library(tidymodels)
library(tidyverse)
#Importing the dataset
dataset <- read.csv('social.csv') %>%
select(Age, EstimatedSalary, Purchased)
#Encoding the target feature as factor
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
str(dataset)
## 'data.frame': 400 obs. of 3 variables:
## $ Age : int 19 35 26 27 19 27 27 32 25 35 ...
## $ EstimatedSalary: int 19000 20000 43000 57000 76000 58000 84000 150000 33000 65000 ...
## $ Purchased : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
Next, we will split the dataset into training and testing sets with tidymodel package.
#Check the outcome variable
dataset %>%
count(Purchased) %>%
mutate(prop = n/sum(n))
## # A tibble: 2 x 3
## Purchased n prop
## <fct> <int> <dbl>
## 1 0 257 0.642
## 2 1 143 0.358
#Split the dataset
set.seed(123)
splits <- initial_split(dataset, prop = 0.75)
training_set <- training(splits)
test_set <- testing(splits)
Next, we will scale both the training and testing dataset by [-3]
# Feature Scaling
training_set[-3] <- scale(training_set[-3])
test_set[-3] <- scale(test_set[-3])
Next, we will fit the SVM model to the training set. To date, the parsnip package has yet to cover Linear Support Vector Machine in their argument pool, so we will have to ground our analysis on the package e1071.
library(e1071)
#Fitting SVM to the Training set
classifier <- svm(formula = Purchased ~ .,
data = training_set,
type = 'C-classification',
kernel = 'linear')
classifier
##
## Call:
## svm(formula = Purchased ~ ., data = training_set, type = "C-classification",
## kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 121
For further reference, there are four types of SVM kernel, Linear SVM, Polynomial SVM, Radial Basis Function SVM, and Sigmoid SVM.
With our model fitted, we can predict our test set result.
# Predicting the Test set results
y_pred <- predict(classifier, newdata = test_set[-3])
y_pred
## 4 7 13 14 16 23 25 26 32 34 39 41 43 51 63 64 67 69 72 74
## 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 76 78 81 85 86 89 90 91 94 98 109 110 116 118 121 127 135 136 137 141
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 143 151 153 158 159 165 166 174 178 179 195 197 199 209 210 211 212 214 217 223
## 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 1
## 224 229 235 236 240 243 244 254 256 262 263 273 277 278 286 290 291 294 299 306
## 1 0 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 0
## 308 309 316 328 330 332 336 347 348 355 357 359 364 374 383 388 389 390 397 399
## 1 1 0 0 1 1 0 1 1 0 1 0 0 1 1 0 0 0 1 0
## Levels: 0 1
# Making the Confusion Matrix
cm <- table(test_set[, 3], y_pred)
cm
## y_pred
## 0 1
## 0 56 3
## 1 14 27
Lastly, we will plot both results from the training and the testing datasets.
# Plotting the training data set results
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set)
plot(set[, -3],
main = 'SVM (Training set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'coral1', 'aquamarine'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
#Next, we will plot the testing data set results
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set)
plot(set[, -3], main = 'SVM (Test set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'coral1', 'aquamarine'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
The hyperplane showed in both plots is the best hyperplane based on results of the model.