Data622- HW3

Assignment Prompt

To Perform an analysis of the dataset used in Homework #2 using the SVM algorithm Answer questions, such as:

Which algorithm is recommended to get more accurate results?

Answer : SVM uses kernel trick to solve non-linear problems whereas decision trees derive hyper-rectangles in input space to solve the problem. Decision trees are better for categorical data and it deals colinearity better than SVM.

Is it better for classification or regression scenarios?

Answer: Support Vector Machine (SVM) is a supervised machine learning algorithm that can be used for both classification or regression challenges. However, it is mostly used in classification problems. In the SVM algorithm, we plot each data item as a point in n-dimensional space (where n is a number of features you have) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiates the two classes very well

Do you agree with the recommendations? for the same dataset I didn’t find any difference in accuracy in my previous random forest HW#2 assignment and in this SVM model. They both performed the same.
Why?

#prevent conflict with skimr and dlookr
options(kableExtra.auto_format = FALSE)

library(skimr)

## Warning: package 'skimr' was built under R version 4.0.5

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6      v purrr   0.3.4 
## v tibble  3.1.8      v dplyr   1.0.10
## v tidyr   1.2.0      v stringr 1.4.0 
## v readr   2.1.2      v forcats 0.5.2

## Warning: package 'tidyr' was built under R version 4.0.5

## Warning: package 'readr' was built under R version 4.0.5

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(lubridate)

## Warning: package 'lubridate' was built under R version 4.0.5

## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(rpart) #decision tree package rec'd by Practical ML in R textbook
library(rpart.plot) #decision tree display package rec'd by Practical ML in R textbook

## Warning: package 'rpart.plot' was built under R version 4.0.4

library(randomForest) #for random forest modeling

## Warning: package 'randomForest' was built under R version 4.0.4

## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

library(caret) #for confusionMatrix()

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(e1071)

## Warning: package 'e1071' was built under R version 4.0.4

library(caTools)

## Warning: package 'caTools' was built under R version 4.0.5

library(ggplot2)

Data Overview

Choosing to use a dataset showing the relationship between physiological parameters and stress level of an individual from Kaggle. The direct link to the dataset is below:

https://www.kaggle.com/datasets/laavanya/stress-level-detection?select=Stress-Lysis.csv

This is a dataset with 2001 cases and 4 variables. There are 3 variables I’ve imported as numeric: humidity, temp, and steps - these are relevant to the individuals. There are no missing values. I’ve imported my target variable, stress_lvl as a factor and it has three levels that correspond to low (0), medium (1), and high (2).

All appear to be reasonable distributed from looking at the percentiles.I note that an average humdity value is 20, average temp is 89, and average steps are about 100.

stress_dataset <- read.csv("Stress-Lysis.csv", 
                 col.names = c("humidity", "temp", "steps", "stress_lvl"),
                 colClasses = c("numeric", "numeric", "numeric", "factor"))

#let's see how our data overall statistics.
skim(stress_dataset)

Data summary
Name	stress_dataset
Number of rows	2001
Number of columns	4
_______________________
Column type frequency:
factor	1
numeric	3
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
stress_lvl	0	1	FALSE	3	1: 790, 2: 710, 0: 501

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
humidity	1	20.00	5.78	10	15	20	25	30	▇▇▇▇▇
temp	1	89.00	5.78	79	84	89	94	99	▇▇▇▇▇
steps	1	100.14	58.18	0	50	101	150	200	▇▇▇▇▇

#We had seen from HW#2 that steps was not an importan feature so removed that from our analysis
stress_dataset<-select(stress_dataset,-c(steps))

#let's plot and see how our data looks like
ggplot(stress_dataset, aes(x = humidity, y = temp, colour = stress_lvl)) +
  geom_point() +
  labs(title = 'Humidity vs Temp')

SVM Overview

This general separation with some very small amount of overlap is the perfect indication that we should use a SVM in order to classify a test set of these stress levels. The general idea of SVMs is to find a linear separator which can be drawn in-betwen two classes in order to separate all datapoints into one class or another. For complex data, SVMs attempt to accomplish this by dealing with the following three problems:

Class Separation: We attempt to find an optimal separating hyperplane between two or more classes by maximizing the margin between the classes’ closes points. The Points lying on the boundaries are called support vectors, and the middle of the margin is the optimal separating hyperplane.

Overlapping Classes: data points on the ‘wrong’ side of the discriminant margin are weighted down to reduce their influence.

Nonlinearity: when we cannot find a linear separator, data oints are projected into a higher-dimentional space where the data points effectively become linearly separable.

SVMs then find a problem solution by formulating the whole task as a quadratic optimaization problem which can be then solved by known techniques.

Model Creation

We see here that on the training dataset, our SVM was able to correctly identify 1331 of the 1334 observations. This is excellent, but maybe we can be perfect. With SVM models, there are generally three parameters which are typically tuned to optimize the model. These three parameters are Kernel, gamma, and cost. In our original model we used a linear kernel and the defaults for both cost and gamma.

Kernel represents they style of SVM that is used to classify data. So we start by adjusting the kernel to determine a more accurate SVM model.

#Splitting the dataset into the Training set and Test set
 
index <- c(1:nrow(stress_dataset))
test.index <- sample(index, size = (length(index)/3))
train <- stress_dataset[-test.index ,]
test <- stress_dataset[test.index ,]




#Next we can use this training set to create our model
svm.model.linear <- svm(stress_lvl ~ ., data = train, kernel = 'linear')
svm.model.linear

## 
## Call:
## svm(formula = stress_lvl ~ ., data = train, kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  168

pred <- predict(svm.model.linear, newdata=test)
confusionMatrix(pred, test$stress_lvl)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2
##          0 181   0   0
##          1   2 259   0
##          2   0   0 225
## 
## Overall Statistics
##                                           
##                Accuracy : 0.997           
##                  95% CI : (0.9892, 0.9996)
##     No Information Rate : 0.3883          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9955          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.9891   1.0000   1.0000
## Specificity            1.0000   0.9951   1.0000
## Pos Pred Value         1.0000   0.9923   1.0000
## Neg Pred Value         0.9959   1.0000   1.0000
## Prevalence             0.2744   0.3883   0.3373
## Detection Rate         0.2714   0.3883   0.3373
## Detection Prevalence   0.2714   0.3913   0.3373
## Balanced Accuracy      0.9945   0.9975   1.0000

Kernel tuning

We want to test the linear SVM created previously against a few other popular kernels. We will limit this to the polynomial, radial, and sigmoid kernels.

The polynomial kernel does not seem to have performed very well. Accuracy dropped

The radial kernel performed as our original model. so it is better than poly

The sigmoid kernel performed very badly and accuracy dropped sharply

# poly
svm.model.poly <- svm(stress_lvl ~ ., data = train, kernel = 'polynomial')
svm.model.poly

## 
## Call:
## svm(formula = stress_lvl ~ ., data = train, kernel = "polynomial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  3 
##      coef.0:  0 
## 
## Number of Support Vectors:  192

pred <- predict(svm.model.poly, newdata=test)
confusionMatrix(pred, test$stress_lvl)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2
##          0 181   0   0
##          1   2 259  12
##          2   0   0 213
## 
## Overall Statistics
##                                          
##                Accuracy : 0.979          
##                  95% CI : (0.965, 0.9885)
##     No Information Rate : 0.3883         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9681         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.9891   1.0000   0.9467
## Specificity            1.0000   0.9657   1.0000
## Pos Pred Value         1.0000   0.9487   1.0000
## Neg Pred Value         0.9959   1.0000   0.9736
## Prevalence             0.2744   0.3883   0.3373
## Detection Rate         0.2714   0.3883   0.3193
## Detection Prevalence   0.2714   0.4093   0.3193
## Balanced Accuracy      0.9945   0.9828   0.9733

#radial
svm.model.radial <- svm(stress_lvl ~ ., data = train, kernel = 'radial')

svm.model.radial

## 
## Call:
## svm(formula = stress_lvl ~ ., data = train, kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  172

pred <- predict(svm.model.radial, newdata=test)
confusionMatrix(pred, test$stress_lvl)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2
##          0 182   0   0
##          1   1 259   0
##          2   0   0 225
## 
## Overall Statistics
##                                      
##                Accuracy : 0.9985     
##                  95% CI : (0.9917, 1)
##     No Information Rate : 0.3883     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 0.9977     
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.9945   1.0000   1.0000
## Specificity            1.0000   0.9975   1.0000
## Pos Pred Value         1.0000   0.9962   1.0000
## Neg Pred Value         0.9979   1.0000   1.0000
## Prevalence             0.2744   0.3883   0.3373
## Detection Rate         0.2729   0.3883   0.3373
## Detection Prevalence   0.2729   0.3898   0.3373
## Balanced Accuracy      0.9973   0.9988   1.0000

#sigmoid
svm.model.sig <- svm(stress_lvl ~ ., data = train, kernel = 'sigmoid')
svm.model.sig

## 
## Call:
## svm(formula = stress_lvl ~ ., data = train, kernel = "sigmoid")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  sigmoid 
##        cost:  1 
##      coef.0:  0 
## 
## Number of Support Vectors:  490

pred <- predict(svm.model.sig, newdata=test)
confusionMatrix(pred, test$stress_lvl)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2
##          0  77 100   0
##          1 106 142  19
##          2   0  17 206
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6372          
##                  95% CI : (0.5994, 0.6737)
##     No Information Rate : 0.3883          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4494          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.4208   0.5483   0.9156
## Specificity            0.7934   0.6936   0.9615
## Pos Pred Value         0.4350   0.5318   0.9238
## Neg Pred Value         0.7837   0.7075   0.9572
## Prevalence             0.2744   0.3883   0.3373
## Detection Rate         0.1154   0.2129   0.3088
## Detection Prevalence   0.2654   0.4003   0.3343
## Balanced Accuracy      0.6071   0.6209   0.9385

Gamma and Cost Optimization

Now that we have decided on a kernel, we can tune the cost and gamma parameters. To do this we will use the tune.svm() function passing sequences of each parameter.

tune.svm()

Our tuned svm returned that - out of the sequence of gammas and costs provided - the best configuration was with a gamma of 7.006492e-46 and a cost of .16. Not sure why the accuracy went from 100% to 99.8% even after providing the best configuration.

tuned.svm <- tune.svm(stress_lvl ~ ., data = train, kernel = 'linear',
        gamma = seq(1/2^nrow(iris),1, .01), cost = 2^seq(-6, 4, 2))
tuned.svm

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##         gamma cost
##  7.006492e-46    1
## 
## - best performance: 0.003748176

tuned.svm <- svm(stress_lvl ~ ., data = train, kernel = 'linear', gamma = 7.006492e-46, cost = 0.16)
pred <- predict(tuned.svm, newdata=test)
confusionMatrix(pred, test$stress_lvl)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2
##          0 183   0   0
##          1   0 259   1
##          2   0   0 224
## 
## Overall Statistics
##                                      
##                Accuracy : 0.9985     
##                  95% CI : (0.9917, 1)
##     No Information Rate : 0.3883     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 0.9977     
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            1.0000   1.0000   0.9956
## Specificity            1.0000   0.9975   1.0000
## Pos Pred Value         1.0000   0.9962   1.0000
## Neg Pred Value         1.0000   1.0000   0.9977
## Prevalence             0.2744   0.3883   0.3373
## Detection Rate         0.2744   0.3883   0.3358
## Detection Prevalence   0.2744   0.3898   0.3358
## Balanced Accuracy      1.0000   0.9988   0.9978

#let’s try with e1071 package best.svm Using best.svm() we find that the best possible set of parameters is a linear kernel, a cost value of 1 . Coincidentally this is exactly the same as our previous linear svm model. This model is good for a 100% accuracy rating.

best.svm <- best.svm(stress_lvl ~ ., data = train, kernel = 'linear')
best.svm

## 
## Call:
## best.svm(x = stress_lvl ~ ., data = train, kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  168

pred <- predict(best.svm, newdata=test)
confusionMatrix(pred, test$stress_lvl)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2
##          0 181   0   0
##          1   2 259   0
##          2   0   0 225
## 
## Overall Statistics
##                                           
##                Accuracy : 0.997           
##                  95% CI : (0.9892, 0.9996)
##     No Information Rate : 0.3883          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9955          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.9891   1.0000   1.0000
## Specificity            1.0000   0.9951   1.0000
## Pos Pred Value         1.0000   0.9923   1.0000
## Neg Pred Value         0.9959   1.0000   1.0000
## Prevalence             0.2744   0.3883   0.3373
## Detection Rate         0.2714   0.3883   0.3373
## Detection Prevalence   0.2714   0.3913   0.3373
## Balanced Accuracy      0.9945   0.9975   1.0000

Final Test

Now that we have found the best possible SVM model, lets see how it performs on some out of sample data. Our best svm model is able to accurately predict 663/667 observations in the testing dataset which is good for 99.4% accuracy. This looks pretty good!

best.svm.pred <- predict(best.svm, test)
table(Prediction = best.svm.pred, Truth = test$stress_lvl)

##           Truth
## Prediction   0   1   2
##          0 181   0   0
##          1   2 259   0
##          2   0   0 225

sum(test$stress_lvl == best.svm.pred)/667

## [1] 0.9970015

Conclusion

So out of all the models we have created in this document. The original linear SVM which was repeated by the best.svm() model is the best classifier in the test set as well as the training set. This is excellent as it means that this model does not appear to be overtrained on the training set which is a problem that SVM’s often find themselves in.