Previously on STAT412: - Neural Networks
Support Vector Machine (SVM) is a powerful and versatile machine learning algorithm primarily used for classification and regression tasks. Originally developed for binary classification, it effectively separates data points into distinct classes by finding the optimal hyperplane that has the maximum margin between two classes.
Key Features of SVM:
Hyperplane: Hperplane is a decision boundary that separates different classes in the feature space. The optimal hyperplane is the one that maximizes the margin between the closest points of the different classes, which are called support vectors.
Support Vectors: These are the data points that lie closest to the decision surface (or hyperplane). They are pivotal in defining the hyperplane because the orientation and position of the hyperplane depend entirely on these points.
Margin: This is the distance between the nearest data point of each class and the hyperplane. A larger margin is generally associated with a lower generalization error of the classifier.
Kernel Trick: The kernel trick allows SVM to solve non-linear problems by using a linear classifier. It transforms the original non-linear observations into a higher-dimensional space in which they become linearly separable.
Advantages:
Effective in high-dimensional spaces.
Still effective when the number of dimensions exceeds the number of samples.
Memory efficient, as it uses a subset of training points.
Versatile, different Kernel functions can be specified for the decision function.
SVM classification is robust to outliers.
Disadvantages:
If the number of features is much greater than the number of samples, overfitting might occur.
SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.
It is sensitive to the tuning of parameters and the choice of the kernel.
NOTE: To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes.
Maximal Margin Classifier
The main motivation is to draw a decision boundary that maximizes the distance to support vector points. If the decision boundary is too close to a support vector, it will be highly sensitive to noises and not generalize well.
Support Vector Classifier
It is also known as Soft Margin Classifier.
In practice, real data is messy and cannot be separated perfectly with a hyperplane. Maximal Margin Classifier tries to separate all positive and negative examples (i.e. two different classes) and does not allow any points to be misclassified. This results in an overfit model or, in some cases, a decision boundary cannot be find with a standard SVM. An overfit SVM achieves a high accuracy with training set but will not perform well on new, previously unseen examples. To overcome this issue, “soft margin” SVM was improved to allow some examples to be misclassified or be on the wrong side of decision boundary. Soft margin SVM often result in a better generalized model.
Non-linear SVM:
Nonlinear SVM is necessary when the data cannot be effectively separated by a linear decision boundary in the original feature space. Nonlinear SVM addresses this limitation by utilizing kernel functions to map the data into a higher-dimensional space where linear separation becomes possible.
How SVM works?
Step 1: Choose an optimal hyperplane which maximize margin
Step 2: Applies penalty for misclassification (cost ‘c’ tuning parameter).
Step 3: If non-linearly separable data points, transform data to high dimensional space where it is easier to classify with linear decision surfaces (Kernel trick)
Create our own data:
## [,1] [,2]
## [1,] -0.6264538 0.91897737
## [2,] 0.1836433 0.78213630
## [3,] -0.8356286 0.07456498
## [4,] 1.5952808 -1.98935170
## [5,] 0.3295078 0.61982575
## [6,] -0.8204684 -0.05612874
## [1] -1 -1 -1 -1 -1 -1
Let’s examine the data points on the plot:
To be able to perform classification task, we have to convert the dependent variable y to factor.
Then, the data is ready to apply support vector classifier.
## Warning: package 'e1071' was built under R version 4.3.3
svmfit <- svm(y ~., data=dat, kernel='linear', cost=10, scale=FALSE)
plot(svmfit , dat,col=c("lightblue","darkblue"))The C and gamma parameters are important hyperparameters used to control the behavior of the Support Vector Machine (SVM) model.
C Parameter:
C determines how flexible the model will be during training. A large C value means the model will tolerate fewer errors in the training data and will attempt to classify more data points correctly. This can lead to models that overfit the training data.
A small C value allows the model to tolerate more errors and create a more general decision boundary. This can lead to models that fit less tightly to the training data.
The C parameter affects the performance of SVM on both linearly separable and non-linearly separable datasets. For linearly separable datasets, a low C value may be used, while for non-linearly separable datasets, a higher C value is often preferred.
Generally,
0.1 < c < 100
Gamma Parameter:
Gamma controls the width of the Radial Basis Function (RBF) kernel. The RBF kernel is used to transform non-linearly separable datasets into separable ones, allowing SVM to classify them.
A small gamma value means the RBF kernel will have a wider radius of influence on data points, resulting in a smoother decision boundary. This can help the model generalize better.
A large gamma value means the RBF kernel will have a narrower radius of influence, resulting in a more complex decision boundary. This can lead to models that fit tightly to the training data and may overfit.
Selecting optimal values for these parameters is an important step in training an SVM model. Often, techniques such as cross-validation or hyperparameter optimization are used to determine these values.
Generally,
0.0001 < gamma < 10
When gamma is very small (0.008 or 0.01), the model cannot capture the complexity or “shape” of the data. The region of influence of any selected support vector would include the whole training set.
For intermediate values of gamma (0.05, 0.1, 0.5), good models can be found.
For larger values of gamma (3.0, 7.0, 11.0) in the above plot, the radius of the area of influence of the support vectors only includes the support vector itself and no amount of regularization with C will be able to prevent overfitting.
##
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 10, scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 10
##
## Number of Support Vectors: 7
##
## ( 4 3 )
##
##
## Number of Classes: 2
##
## Levels:
## -1 1
The summary lets us know there are 7 support vectors, four in the first class and three in the second. To see the index of the resulting support vectors in the data matrix:
## [1] 1 2 5 7 14 16 17
Let’s see what if we decrease the cost parameter:
Now, using a smaller cost parameter, we get more support vectors because the margin is wider. Let’s examine the support vectors:
## [1] 1 2 3 4 5 7 9 10 12 13 14 15 16 17 18 20
What is the optimal cost value? The tune() function performs 10 fold cross validation in order to optimally choose this for a given range of the cost parameter.
set.seed (1)
tune.out=tune(svm,y~.,data=dat,kernel="linear",
ranges=list(cost=c(0.001, 0.01, 0.1, 1,5,10,100)))
summary(tune.out)##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.1
##
## - best performance: 0.05
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-03 0.55 0.4377975
## 2 1e-02 0.55 0.4377975
## 3 1e-01 0.05 0.1581139
## 4 1e+00 0.15 0.2415229
## 5 5e+00 0.15 0.2415229
## 6 1e+01 0.15 0.2415229
## 7 1e+02 0.15 0.2415229
This output clearly shows us what the best value for the cost parameter is and so we can choose that as our model:
##
## Call:
## best.tune(METHOD = svm, train.x = y ~ ., data = dat, ranges = list(cost = c(0.001,
## 0.01, 0.1, 1, 5, 10, 100)), kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.1
##
## Number of Support Vectors: 16
##
## ( 8 8 )
##
##
## Number of Classes: 2
##
## Levels:
## -1 1
The data set contains information on different clients who received a loan at least 10 years ago. The variables income (yearly), age, loan (size in euros) are available. Our goal is to build a model which predicts, based on the input variables loan , input and age, whether or not a default will occur within 10 years.
LTI is the loan to yearly income ratio, which is a aggregated value. Therefore, we will not include this variable during the analysis.
| clientid | income | age | loan | LTI | default10yr |
|---|---|---|---|---|---|
| 1 | 66155.93 | 59.01702 | 8106.5321 | 0.1225368 | 0 |
| 2 | 34415.15 | 48.11715 | 6564.7450 | 0.1907516 | 0 |
| 3 | 57317.17 | 63.10805 | 8020.9533 | 0.1399398 | 0 |
| 4 | 42709.53 | 45.75197 | 6103.6423 | 0.1429105 | 0 |
| 5 | 66952.69 | 18.58434 | 8770.0992 | 0.1309895 | 1 |
| 6 | 24904.06 | 57.47161 | 15.4986 | 0.0006223 | 0 |
## [1] "data.frame"
## [1] 2000 6
Let’s see the type of each variable:
## 'data.frame': 2000 obs. of 6 variables:
## $ clientid : int 1 2 3 4 5 6 7 8 9 10 ...
## $ income : num 66156 34415 57317 42710 66953 ...
## $ age : num 59 48.1 63.1 45.8 18.6 ...
## $ loan : num 8107 6565 8021 6104 8770 ...
## $ LTI : num 0.123 0.191 0.14 0.143 0.131 ...
## $ default10yr: int 0 0 0 0 1 0 0 1 0 0 ...
The variable default10yr must be converted to factor.
## 'data.frame': 2000 obs. of 6 variables:
## $ clientid : int 1 2 3 4 5 6 7 8 9 10 ...
## $ income : num 66156 34415 57317 42710 66953 ...
## $ age : num 59 48.1 63.1 45.8 18.6 ...
## $ loan : num 8107 6565 8021 6104 8770 ...
## $ LTI : num 0.123 0.191 0.14 0.143 0.131 ...
## $ default10yr: Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1 ...
To check the model validity, validation set approach is employed.
## Warning: package 'caret' was built under R version 4.3.3
## Zorunlu paket yükleniyor: ggplot2
## Zorunlu paket yükleniyor: lattice
set.seed(123)
training.samples <- createDataPartition(credit$default10yr, p = 0.8, list = FALSE) #createDataPartition helps you define train set index
train.data <- credit[training.samples, ]
test.data <- credit[-training.samples, ]Now, we can apply svm() function to the data.
##
## Call:
## svm(formula = default10yr ~ income + age + loan, data = train.data)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 191
Let’s call the hyperparameters for the initial model:
## [1] 0.3333333
## [1] 1
Let’s tune the hyperparameters:
tuned_parameter=tune.svm(train.data[,2:4],train.data[,6],gamma = 10^(-5:-1), cost = 10^(-3:1))
tuned_parameter$best.parameters| gamma | cost | |
|---|---|---|
| 25 | 0.1 | 10 |
After tuning the hyperparameters, the best model has parameters of gamma 0.1 and cost 10.
Then,lets see the best model:
##
## Call:
## best.svm(x = train.data[, 2:4], y = train.data[, 6], gamma = 10^(-5:-1),
## cost = 10^(-3:1))
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10
##
## Number of Support Vectors: 147
#OR
svm_tuned<-svm(default10yr~income + age + loan,data=train.data,kernel="radial",cost=tuned_parameter$best.parameters$cost,gamma=tuned_parameter$best.parameters$gamma,type="C-classification")
svm_tuned##
## Call:
## svm(formula = default10yr ~ income + age + loan, data = train.data,
## kernel = "radial", cost = tuned_parameter$best.parameters$cost,
## gamma = tuned_parameter$best.parameters$gamma, type = "C-classification")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10
##
## Number of Support Vectors: 147
Performance of the default model:
#test set predictions
pred_test_default <-predict(svm_default,test.data[,2:4])
## Test Data performance for default model
test_tab_default = table(predicted = pred_test_default, actual = test.data[,6])
library(caret)
test_con_mat_default = confusionMatrix(test_tab_default)
metrics_test_default=c(test_con_mat_default$overall["Accuracy"],
test_con_mat_default$byClass["Sensitivity"],
test_con_mat_default$byClass["Specificity"])
library(Metrics)## Warning: package 'Metrics' was built under R version 4.3.3
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
##
## precision, recall
## Warning: package 'MLmetrics' was built under R version 4.3.3
##
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
## The following object is masked from 'package:base':
##
## Recall
Performance of the tuned model:
#test set predictions
pred_test_tuned <-predict(svm_tuned,test.data[,2:4])
## Test performance for tuned model
test_tab_tuned = table(predicted = pred_test_tuned, actual = test.data[,6])
library(caret)
test_con_mat_tuned = confusionMatrix(test_tab_tuned)
metrics_test_tuned=c(test_con_mat_tuned$overall["Accuracy"],
test_con_mat_tuned$byClass["Sensitivity"],
test_con_mat_tuned$byClass["Specificity"])
ce_tuned_test=ce(actual = test.data[,6],predicted = pred_test_tuned)Additioanlly, lets calculate F1 score.
REMEMBER:
f1_tuned=F1_Score(pred_test_tuned,test.data[,6]) #to obtain F1 score from the confusion matrix, give input as mode = "everything"COMPARISON:
| ces | f1s |
|---|---|
| 0.0275689 | 0.9839884 |
| 0.0200501 | 0.9883382 |
| metrics_test_default | metrics_test_tuned | |
|---|---|---|
| Accuracy | 0.9724311 | 0.9799499 |
| Sensitivity | 0.9854227 | 0.9883382 |
| Specificity | 0.8928571 | 0.9285714 |
##
## Call:
## glm(formula = default10yr ~ income + age + loan, family = "binomial",
## data = train.data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.6010507 0.9307066 10.32 <2e-16 ***
## income -0.0002384 0.0000235 -10.14 <2e-16 ***
## age -0.3472719 0.0287010 -12.10 <2e-16 ***
## loan 0.0017351 0.0001473 11.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1307.03 on 1600 degrees of freedom
## Residual deviance: 348.07 on 1597 degrees of freedom
## AIC: 356.07
##
## Number of Fisher Scoring iterations: 9
pred_log_test=predict(logis,test.data,type="response")
pred_log=ifelse(pred_log_test>=0.5,1,0)
test_tab_log = table(predicted = pred_log, actual = test.data[,6])
library(caret)
test_con_mat_log = confusionMatrix(test_tab_log)
metrics_test_log=c(test_con_mat_log$overall["Accuracy"],
test_con_mat_log$byClass["Sensitivity"],
test_con_mat_log$byClass["Specificity"])
ce_log_test=ce(actual = test.data[,6],predicted = pred_log)
f1_log=F1_Score(pred_log,test.data[,6]) | metrics_test_default | metrics_test_tuned | metrics_test_log | |
|---|---|---|---|
| Accuracy | 0.9724311 | 0.9799499 | 0.9448622 |
| Sensitivity | 0.9854227 | 0.9883382 | 0.9766764 |
| Specificity | 0.8928571 | 0.9285714 | 0.7500000 |
ces=data.frame(metrics=c("default","tuned","log"),ces=c(ce_default_test,ce_tuned_test,ce_log_test),f1s=c(f1_default,f1_tuned,f1_log))
ces| metrics | ces | f1s |
|---|---|---|
| default | 0.0275689 | 0.9839884 |
| tuned | 0.0200501 | 0.9883382 |
| log | 0.0551378 | 0.9682081 |
According to the performance measures, it can be said that tuned svm outperforms the other models.
This data set contains variables for the following information related to ice cream consumption.
CONSUME: Ice cream consumption in pints per capita
PRICE: Per pint price of ice cream in dollars
INC: Weekly family income in dollars
TEMP: Mean temperature in degrees F
| CONSUME | PRICE | INC | TEMP |
|---|---|---|---|
| 0.386 | 0.270 | 78 | 41 |
| 0.374 | 0.282 | 79 | 56 |
| 0.393 | 0.277 | 81 | 63 |
| 0.425 | 0.280 | 80 | 68 |
| 0.406 | 0.272 | 76 | 69 |
| 0.344 | 0.262 | 78 | 65 |
## [1] 29 4
## 'data.frame': 29 obs. of 4 variables:
## $ CONSUME: num 0.386 0.374 0.393 0.425 0.406 0.344 0.327 0.288 0.269 0.256 ...
## $ PRICE : num 0.27 0.282 0.277 0.28 0.272 0.262 0.275 0.267 0.265 0.277 ...
## $ INC : int 78 79 81 80 76 78 82 79 76 79 ...
## $ TEMP : int 41 56 63 68 69 65 61 47 32 24 ...
To create a basic sv regression in R, we use the svm() method from the e17071 package.
##
## Call:
## svm(formula = consume ~ ., data = pricedata)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 0.3333333
## epsilon: 0.1
##
##
## Number of Support Vectors: 24
We will now see how to model a linear regression using the Caret package. We will use this library as it provides us with many features for real life modeling.
To do this, we use the train method. We pass the same parameters as above, but in addition we pass the method = ‘svmRadial’ model to tell Caret to use a svm model. Caret also provides us with the following svm options: “svmRadial”, “svmLinear”, or “svmPoly”.
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
## Support Vector Machines with Radial Basis Function Kernel
##
## 29 samples
## 3 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 29, 29, 29, 29, 29, 29, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.04742293 0.5017193 0.03961522
## 0.50 0.04387186 0.5038271 0.03589080
## 1.00 0.04164332 0.5124484 0.03296355
##
## Tuning parameter 'sigma' was held constant at a value of 0.4468348
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.4468348 and C = 1.
Preprocessing with Caret
One feature that we use from Caret is preprocessing. Often in real life data science we want to run some pre processing before modeling. We will center and scale our data by passing the following to the train method: preProcess = c(“center”, “scale”)
set.seed(1)
model2 <- train( consume ~ ., data = pricedata, method = 'svmRadial', preProcess = c("center", "scale"))
model2## Support Vector Machines with Radial Basis Function Kernel
##
## 29 samples
## 3 predictor
##
## Pre-processing: centered (3), scaled (3)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 29, 29, 29, 29, 29, 29, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.04742293 0.5017193 0.03961522
## 0.50 0.04387186 0.5038271 0.03589080
## 1.00 0.04164332 0.5124484 0.03296355
##
## Tuning parameter 'sigma' was held constant at a value of 0.4468348
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.4468348 and C = 1.
Cross Validation
Lets use a data partitioning strategy like k-fold cross-validation that resamples and splits our data many times. Caret makes this easy with the trainControl method. We will use 10-fold cross-validation in this tutorial. But, firstly lets partition our data into train and test sets.
set.seed(1)
inTraining <- createDataPartition(pricedata$consume, p = .80, list = FALSE)
training <- pricedata[inTraining,]
testing <- pricedata[-inTraining,]
set.seed(1)
ctrl <- trainControl(method = "cv",number = 10)set.seed(1)
model4 <- train(consume ~ .,data = training , method = 'svmRadial',preProcess = c("center", "scale"), trCtrl = ctrl)
model4## Support Vector Machines with Radial Basis Function Kernel
##
## 25 samples
## 3 predictor
##
## Pre-processing: centered (3), scaled (3)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 25, 25, 25, 25, 25, 25, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.04618492 0.5228372 0.03849172
## 0.50 0.04404582 0.5278148 0.03630516
## 1.00 0.04367235 0.5257588 0.03554236
##
## Tuning parameter 'sigma' was held constant at a value of 0.2921907
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.2921907 and C = 1.
This results seemed to have improved our accuracy for our training data. Let’s check this on the test data to see the results.
Finally, we calculate the RMSE to compare to the model above.
test.features = subset(testing, select=-c(consume))
test.target = subset(testing, select=consume)[,1]
predictions = predict(model4, newdata = test.features)
# RMSE
sqrt(mean((test.target - predictions)^2))## [1] 0.016125
Tuning Hyperparameters:
To tune a svm model, we can give the model different values of C which represents cost and sigma. Caret will retrain the model using different lambdas and select the best version.
set.seed(1)
tuneGrid <- expand.grid(
C = c(0.25, .5, 1),
sigma = 0.1
)
model5 <- train(
consume ~ .,
data = training,
method = 'svmRadial',
preProcess = c("center", "scale"),
trControl = ctrl,
tuneGrid = tuneGrid
)
model5## Support Vector Machines with Radial Basis Function Kernel
##
## 25 samples
## 3 predictor
##
## Pre-processing: centered (3), scaled (3)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 23, 23, 23, 22, 22, 23, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.04611165 0.7687854 0.04249271
## 0.50 0.04005488 0.7846281 0.03640338
## 1.00 0.03822919 0.7813825 0.03450333
##
## Tuning parameter 'sigma' was held constant at a value of 0.1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.1 and C = 1.
References:
https://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf