For this homework assignment, we are going to apply the Support Vector Machine algorithm onto the dataset I used for homework # 2, this dataset is about wine classification. this dataset was tailored for use of algorithms such as decision tree or random forest. But I am curious to see how SVM will fare on this dataset.This dataset contains 12 columns and 1599 observations.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
dataset <- read_csv("C:\\Users\\Al Haque\\OneDrive\\Desktop\\Data 622\\Data622Hw2\\winequality-red.csv")
## Rows: 1599 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (12): fixed acidity, volatile acidity, citric acid, residual sugar, chlo...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(dataset)
## # A tibble: 6 Ă— 12
## `fixed acidity` `volatile acidity` `citric acid` `residual sugar` chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 1.9 0.076
## 2 7.8 0.88 0 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.7 0 1.9 0.076
## 6 7.4 0.66 0 1.8 0.075
## # ℹ 7 more variables: `free sulfur dioxide` <dbl>,
## # `total sulfur dioxide` <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## # alcohol <dbl>, quality <dbl>
According to Kaggle, the classes are ordered and there are more normal wines than excellent or poor ones, meaning we will have to stratify on the wine quality..
This dataset contains no empty NA values within the dataset.
library(visdat)
vis_dat(dataset)
ggplot(dataset,aes(x = quality)) +
geom_bar() +
ggtitle("Wine Quality Proportion") +
xlab("Wine Quality") +
ylab("Number of Wines") + theme_bw()
I have made a histogram of all the columns in the dataset, we melted the data into a long-format and facet-wrapped it by name, and we can see all the distribution.
data_long <- dataset %>%
pivot_longer(colnames(dataset)) %>%
as.data.frame()
## We can see various distribution of the dataset now..
ggp1 <- ggplot(data_long, aes(x = value)) + # Draw each column as histogram
geom_histogram() +
facet_wrap(~ name, scales = "free")
ggp1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Let’s plot the correlation plot again, since we have numeric variables let use view which variables are correlated with what.
library(corrplot)
## corrplot 0.92 loaded
corr_matrix <- dataset %>% select_if(is.numeric) %>%
cor(method="pearson", use="pairwise.complete.obs")
corrplot(corr_matrix,type="upper",method = "number")
Some of the variables are correlated with other while there are a negative correlation with some of the others..
We split our dataset into training and testing set..
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
set.seed(1235)
## use 75% of the data for training and the remaining 25 for validation
validation_index <- createDataPartition(dataset$quality,p = 0.75, list = FALSE)
## select 20% of the dataset for validation
validation <- dataset[-validation_index,]
Training <- dataset[validation_index,]
A quick check to see if the dataset are properly stratifed by quality.
validation %>%
tabyl(quality)
## quality n percent
## 3 4 0.01002506
## 4 11 0.02756892
## 5 171 0.42857143
## 6 159 0.39849624
## 7 45 0.11278195
## 8 9 0.02255639
Training %>%
tabyl(quality)
## quality n percent
## 3 6 0.0050000
## 4 42 0.0350000
## 5 510 0.4250000
## 6 479 0.3991667
## 7 154 0.1283333
## 8 9 0.0075000
First I am going to fit an SVM Linear Regression Model
library(e1071)
# Use eps-regression since this is the default regression setting according to the docuementation.. and we are using a five fold cross validation..
# Set the cross-validation using kfolds = 5
set.seed(1234)
trCtrl = trainControl(method = 'cv',number = 5)
svmfitreg = svm(quality~.,data = Training, type = "eps-regression",kernel = "linear",trControl = trCtrl)
print(svmfitreg)
##
## Call:
## svm(formula = quality ~ ., data = Training, type = "eps-regression",
## kernel = "linear", trControl = trCtrl)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: linear
## cost: 1
## gamma: 0.09090909
## epsilon: 0.1
##
##
## Number of Support Vectors: 1038
## predict values for the validation set..
y_pred <- predict(svmfitreg,newdata = validation)
head(y_pred)
## 1 2 3 4 5 6
## 5.131795 4.954623 6.100861 5.374676 4.983006 5.193420
The RMSE is lower than the decision tree but the R-squared is lowered as well.
eval_results <- function(true, predicted, df) {
SSE <- sum((predicted - true)^2)
SST <- sum((true - mean(true))^2)
R_square <- 1 - SSE / SST
RMSE = sqrt(SSE/nrow(df))
# Model performance metrics
data.frame(
RMSE = RMSE,
Rsquare = R_square
)
}
eval_results(validation$quality,y_pred,validation)
## RMSE Rsquare
## 1 0.6599542 0.3825836
See how the Radial Kernel works..
library(e1071)
# Use eps-regression since this is the default regression setting according to the docuementation..
svmfitreg2 = svm(quality~.,data = Training, type = "eps-regression",kernel = "radial")
print(svmfitreg2)
##
## Call:
## svm(formula = quality ~ ., data = Training, type = "eps-regression",
## kernel = "radial")
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 0.09090909
## epsilon: 0.1
##
##
## Number of Support Vectors: 972
y_pred2 <- predict(svmfitreg2,newata = validation)
head(y_pred2)
## 1 2 3 4 5 6
## 5.023523 5.247067 5.463963 5.023523 5.084557 5.367275
Wow the results are worse using the Radial Kernel..
eval_results(validation$quality,y_pred2,validation)
## Warning in predicted - true: longer object length is not a multiple of shorter
## object length
## RMSE Rsquare
## 1 1.744033 -3.311813
eval_results
## function(true, predicted, df) {
## SSE <- sum((predicted - true)^2)
## SST <- sum((true - mean(true))^2)
## R_square <- 1 - SSE / SST
## RMSE = sqrt(SSE/nrow(df))
##
## # Model performance metrics
## data.frame(
## RMSE = RMSE,
## Rsquare = R_square
## )
## }
## <bytecode: 0x000001795c94b738>
In order to improve the performance of the support vector regression we will need to select the best parameters for the model, in this case we will tweak the epilsion and cost parameter which we can change to avoid overfitting.
## Tuning. if cost value is too high it may lead to overfitting else if its too small cost value may underfit. we store it in tmodel
set.seed(1234)
tmodel <- tune(svm,quality~.,data = Training,ranges = list(epsilion = seq(0,1,0.1),cost = 2^(2:7)))
## Then we plot the tmodel darker region means higher accuracy or RMSE look at 15:06 of the video
## The lower the epislion and cost the better the metrics seem to do..
plot(tmodel)
## Store the best model in mymodel
bettermodel <- tmodel$best.model
summary(bettermodel)
##
## Call:
## best.tune(METHOD = svm, train.x = quality ~ ., data = Training, ranges = list(epsilion = seq(0,
## 1, 0.1), cost = 2^(2:7)))
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 4
## gamma: 0.09090909
## epsilon: 0.1
##
##
## Number of Support Vectors: 992
## Use better model since it contains the best model parameters
predi <- predict(bettermodel,validation)
## THe RMSE went slightly lowered but the R squared improved a little bit..
eval_results(validation$quality,predi,validation)
## RMSE Rsquare
## 1 0.621776 0.4519521
We will convert our target variable quality into a categorical column and then apply sVM onto it..
## Quality is 6.5 or higher it is good quality otherwise it is bad..
dataset$Quality <- ifelse(dataset$quality > 6.5, "Good","Bad")
Dataset <- select(dataset,-quality)
Dataset$Quality <- as.factor(Dataset$Quality)
library(caret)
set.seed(1237)
## use 75% of the data for training and the remaining 25 for validation
validation_index2 <- createDataPartition(Dataset$Quality,p = 0.75, list = FALSE)
## select 20% of the dataset for validation
validation2 <- Dataset[-validation_index,]
Training2 <- Dataset[validation_index,]
Ensure that the proportions are correct for both samples..
Training2 %>%
tabyl(Quality)
## Quality n percent
## Bad 1037 0.8641667
## Good 163 0.1358333
validation2%>%
tabyl(Quality)
## Quality n percent
## Bad 345 0.8646617
## Good 54 0.1353383
library(e1071)
# Set the cross-validation using kfolds = 5
set.seed(1237)
trCtrl = trainControl(method = 'cv',number = 5)
svmfitreg2 = svm(Quality~.,data = Training2,type = "C-classification",kernel = "linear",trControl = trCtrl)
summary(svmfitreg2)
##
## Call:
## svm(formula = Quality ~ ., data = Training2, type = "C-classification",
## kernel = "linear", trControl = trCtrl)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 370
##
## ( 207 163 )
##
##
## Number of Classes: 2
##
## Levels:
## Bad Good
y_pred3 <- predict(svmfitreg2,validation2)
head(y_pred3)
## 1 2 3 4 5 6
## Bad Bad Bad Bad Bad Bad
## Levels: Bad Good
Tab <- table(Predicted = y_pred3, Actual = validation2$Quality)
confusionMatrix(Tab)
## Confusion Matrix and Statistics
##
## Actual
## Predicted Bad Good
## Bad 345 54
## Good 0 0
##
## Accuracy : 0.8647
## 95% CI : (0.8271, 0.8967)
## No Information Rate : 0.8647
## P-Value [Acc > NIR] : 0.5362
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 5.498e-13
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.8647
## Neg Pred Value : NaN
## Prevalence : 0.8647
## Detection Rate : 0.8647
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Bad
##
Let’s attempt a radial kernel if it works,.
svmfitreg3 <- svm(Quality~.,data = Training2,type = "C-classification",kernel = "radial",trControl = trCtrl)
summary(svmfitreg3)
##
## Call:
## svm(formula = Quality ~ ., data = Training2, type = "C-classification",
## kernel = "radial", trControl = trCtrl)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 386
##
## ( 229 157 )
##
##
## Number of Classes: 2
##
## Levels:
## Bad Good
y_pred4 <- predict(svmfitreg3,validation2)
head(y_pred4)
## 1 2 3 4 5 6
## Bad Bad Bad Bad Bad Bad
## Levels: Bad Good
Tab1 <- table(y_pred4,validation2$Quality)
confusionMatrix(Tab1)
## Confusion Matrix and Statistics
##
##
## y_pred4 Bad Good
## Bad 339 41
## Good 6 13
##
## Accuracy : 0.8822
## 95% CI : (0.8465, 0.9122)
## No Information Rate : 0.8647
## P-Value [Acc > NIR] : 0.1711
##
## Kappa : 0.3074
##
## Mcnemar's Test P-Value : 7.071e-07
##
## Sensitivity : 0.9826
## Specificity : 0.2407
## Pos Pred Value : 0.8921
## Neg Pred Value : 0.6842
## Prevalence : 0.8647
## Detection Rate : 0.8496
## Detection Prevalence : 0.9524
## Balanced Accuracy : 0.6117
##
## 'Positive' Class : Bad
##
## Tuning. if cost value is too high it may lead to overfitting else if its too small cost value may underfit. we store it in tmodel
tmodel2 <- tune(svm,Quality~.,data = Training2,ranges = list(epsilion = seq(0,1,0.1),cost = 2^(2:7)))
plot(tmodel2)
bettermodel2 <- tmodel2$best.model
summary(bettermodel2)
##
## Call:
## best.tune(METHOD = svm, train.x = Quality ~ ., data = Training2,
## ranges = list(epsilion = seq(0, 1, 0.1), cost = 2^(2:7)))
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 8
##
## Number of Support Vectors: 365
##
## ( 223 142 )
##
##
## Number of Classes: 2
##
## Levels:
## Bad Good
## Evaluating The Model now..
predi2 <- predict(bettermodel2,validation2)
head(predi2)
## 1 2 3 4 5 6
## Bad Bad Bad Bad Bad Bad
## Levels: Bad Good
## The accuracy here is improved up to 89%
Tab2 <- table(predi2,validation2$Quality)
confusionMatrix(Tab2)
## Confusion Matrix and Statistics
##
##
## predi2 Bad Good
## Bad 332 29
## Good 13 25
##
## Accuracy : 0.8947
## 95% CI : (0.8604, 0.9231)
## No Information Rate : 0.8647
## P-Value [Acc > NIR] : 0.04272
##
## Kappa : 0.486
##
## Mcnemar's Test P-Value : 0.02064
##
## Sensitivity : 0.9623
## Specificity : 0.4630
## Pos Pred Value : 0.9197
## Neg Pred Value : 0.6579
## Prevalence : 0.8647
## Detection Rate : 0.8321
## Detection Prevalence : 0.9048
## Balanced Accuracy : 0.7126
##
## 'Positive' Class : Bad
##
For regression and classification, both times the radial svm model performed much better on the testing data then the the linear kernel.