For this homework assignment, we are going to apply the Support Vector Machine algorithm onto the dataset I used for homework # 2, this dataset is about wine classification. this dataset was tailored for use of algorithms such as decision tree or random forest. But I am curious to see how SVM will fare on this dataset.This dataset contains 12 columns and 1599 observations.

0.1 Importing the Dataset

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
dataset <- read_csv("C:\\Users\\Al Haque\\OneDrive\\Desktop\\Data 622\\Data622Hw2\\winequality-red.csv")
## Rows: 1599 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (12): fixed acidity, volatile acidity, citric acid, residual sugar, chlo...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(dataset)
## # A tibble: 6 Ă— 12
##   `fixed acidity` `volatile acidity` `citric acid` `residual sugar` chlorides
##             <dbl>              <dbl>         <dbl>            <dbl>     <dbl>
## 1             7.4               0.7           0                 1.9     0.076
## 2             7.8               0.88          0                 2.6     0.098
## 3             7.8               0.76          0.04              2.3     0.092
## 4            11.2               0.28          0.56              1.9     0.075
## 5             7.4               0.7           0                 1.9     0.076
## 6             7.4               0.66          0                 1.8     0.075
## # ℹ 7 more variables: `free sulfur dioxide` <dbl>,
## #   `total sulfur dioxide` <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## #   alcohol <dbl>, quality <dbl>

According to Kaggle, the classes are ordered and there are more normal wines than excellent or poor ones, meaning we will have to stratify on the wine quality..

0.2 Intital Data Exploration:

This dataset contains no empty NA values within the dataset.

library(visdat)

vis_dat(dataset)

ggplot(dataset,aes(x = quality)) +
  geom_bar() +
  ggtitle("Wine Quality Proportion") +
  xlab("Wine Quality") +
  ylab("Number of Wines") + theme_bw() 

0.2.1 Histogram of The Data.

I have made a histogram of all the columns in the dataset, we melted the data into a long-format and facet-wrapped it by name, and we can see all the distribution.

data_long <- dataset %>%
  pivot_longer(colnames(dataset)) %>%
  as.data.frame()
## We can see various distribution of the dataset now.. 
ggp1 <- ggplot(data_long, aes(x = value)) +    # Draw each column as histogram
  geom_histogram() + 
  facet_wrap(~ name, scales = "free")
ggp1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

0.3 Correlation Plot:

Let’s plot the correlation plot again, since we have numeric variables let use view which variables are correlated with what.

library(corrplot)
## corrplot 0.92 loaded
corr_matrix <- dataset %>% select_if(is.numeric) %>% 
  cor(method="pearson", use="pairwise.complete.obs")

corrplot(corr_matrix,type="upper",method = "number")

Some of the variables are correlated with other while there are a negative correlation with some of the others..

0.4 Split The Data into Testing and Training Set

We split our dataset into training and testing set..

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
set.seed(1235)

## use 75% of the data for training and the remaining 25 for validation 
validation_index <- createDataPartition(dataset$quality,p = 0.75, list = FALSE)

## select 20% of the dataset for validation
validation <- dataset[-validation_index,]

Training <- dataset[validation_index,]

A quick check to see if the dataset are properly stratifed by quality.

validation %>%
  tabyl(quality)
##  quality   n    percent
##        3   4 0.01002506
##        4  11 0.02756892
##        5 171 0.42857143
##        6 159 0.39849624
##        7  45 0.11278195
##        8   9 0.02255639
Training %>%
  tabyl(quality)
##  quality   n   percent
##        3   6 0.0050000
##        4  42 0.0350000
##        5 510 0.4250000
##        6 479 0.3991667
##        7 154 0.1283333
##        8   9 0.0075000

0.5 Impelementing The SVM algorithm

0.5.1 SVM regression linear kernel..

First I am going to fit an SVM Linear Regression Model

library(e1071)
# Use eps-regression since this is the default regression setting according to the docuementation.. and we are using a five fold cross validation.. 

# Set the cross-validation using kfolds = 5
set.seed(1234)
trCtrl = trainControl(method = 'cv',number = 5)

svmfitreg = svm(quality~.,data = Training, type = "eps-regression",kernel = "linear",trControl = trCtrl)
print(svmfitreg)
## 
## Call:
## svm(formula = quality ~ ., data = Training, type = "eps-regression", 
##     kernel = "linear", trControl = trCtrl)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  0.09090909 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  1038
## predict values for the validation set.. 
y_pred <- predict(svmfitreg,newdata = validation)
head(y_pred)
##        1        2        3        4        5        6 
## 5.131795 4.954623 6.100861 5.374676 4.983006 5.193420

0.5.1.1 Regression Metrics For SVM.

The RMSE is lower than the decision tree but the R-squared is lowered as well.

eval_results <- function(true, predicted, df) {
  SSE <- sum((predicted - true)^2)
  SST <- sum((true - mean(true))^2)
  R_square <- 1 - SSE / SST
  RMSE = sqrt(SSE/nrow(df))
  
# Model performance metrics
  data.frame(
    RMSE = RMSE,
    Rsquare = R_square
  )
  }

  
eval_results(validation$quality,y_pred,validation)
##        RMSE   Rsquare
## 1 0.6599542 0.3825836

0.5.2 SVM Kernel Radial Model

See how the Radial Kernel works..

library(e1071)
# Use eps-regression since this is the default regression setting according to the docuementation.. 
svmfitreg2 = svm(quality~.,data = Training, type = "eps-regression",kernel = "radial")
print(svmfitreg2)
## 
## Call:
## svm(formula = quality ~ ., data = Training, type = "eps-regression", 
##     kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.09090909 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  972
y_pred2 <- predict(svmfitreg2,newata = validation)
head(y_pred2)
##        1        2        3        4        5        6 
## 5.023523 5.247067 5.463963 5.023523 5.084557 5.367275

0.5.2.1 Regression Metrics for The Radial Kernel

Wow the results are worse using the Radial Kernel..

eval_results(validation$quality,y_pred2,validation)
## Warning in predicted - true: longer object length is not a multiple of shorter
## object length
##       RMSE   Rsquare
## 1 1.744033 -3.311813
eval_results
## function(true, predicted, df) {
##   SSE <- sum((predicted - true)^2)
##   SST <- sum((true - mean(true))^2)
##   R_square <- 1 - SSE / SST
##   RMSE = sqrt(SSE/nrow(df))
##   
## # Model performance metrics
##   data.frame(
##     RMSE = RMSE,
##     Rsquare = R_square
##   )
##   }
## <bytecode: 0x000001795c94b738>

0.5.3 Model Tuning

In order to improve the performance of the support vector regression we will need to select the best parameters for the model, in this case we will tweak the epilsion and cost parameter which we can change to avoid overfitting.

## Tuning. if cost value is too high it may lead to overfitting else if its too small cost value may underfit. we store it in tmodel
set.seed(1234)
tmodel <- tune(svm,quality~.,data = Training,ranges = list(epsilion = seq(0,1,0.1),cost = 2^(2:7)))
## Then we plot the tmodel darker region means higher accuracy or RMSE look at 15:06 of the video
## The lower the epislion and cost the better the metrics seem to do.. 
plot(tmodel)

## Store the best model in mymodel
bettermodel <- tmodel$best.model

summary(bettermodel)
## 
## Call:
## best.tune(METHOD = svm, train.x = quality ~ ., data = Training, ranges = list(epsilion = seq(0, 
##     1, 0.1), cost = 2^(2:7)))
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  4 
##       gamma:  0.09090909 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  992

0.5.3.1 Test tunedmodel on Test Data

## Use better model since it contains the best model parameters
predi <- predict(bettermodel,validation)
## THe RMSE went slightly lowered but the R squared improved a little bit.. 
eval_results(validation$quality,predi,validation)
##       RMSE   Rsquare
## 1 0.621776 0.4519521

0.6 Make a classification with SVM

We will convert our target variable quality into a categorical column and then apply sVM onto it..

## Quality is 6.5 or higher it is good quality otherwise it is bad..
dataset$Quality <- ifelse(dataset$quality > 6.5, "Good","Bad")

Dataset <- select(dataset,-quality)

Dataset$Quality <- as.factor(Dataset$Quality)

0.6.1 Create Training and Testing Set..

library(caret)
set.seed(1237)

## use 75% of the data for training and the remaining 25 for validation 
validation_index2 <- createDataPartition(Dataset$Quality,p = 0.75, list = FALSE)

## select 20% of the dataset for validation
validation2 <- Dataset[-validation_index,]

Training2 <- Dataset[validation_index,]

Ensure that the proportions are correct for both samples..

Training2 %>%
  tabyl(Quality)
##  Quality    n   percent
##      Bad 1037 0.8641667
##     Good  163 0.1358333
validation2%>%
  tabyl(Quality)
##  Quality   n   percent
##      Bad 345 0.8646617
##     Good  54 0.1353383

0.6.2 Use the svm algorithm.. for classification.

library(e1071)

# Set the cross-validation using kfolds = 5
set.seed(1237)
trCtrl = trainControl(method = 'cv',number = 5)

svmfitreg2 = svm(Quality~.,data = Training2,type = "C-classification",kernel = "linear",trControl = trCtrl)
summary(svmfitreg2)
## 
## Call:
## svm(formula = Quality ~ ., data = Training2, type = "C-classification", 
##     kernel = "linear", trControl = trCtrl)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  370
## 
##  ( 207 163 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  Bad Good
y_pred3 <- predict(svmfitreg2,validation2)
head(y_pred3)
##   1   2   3   4   5   6 
## Bad Bad Bad Bad Bad Bad 
## Levels: Bad Good
Tab <- table(Predicted = y_pred3, Actual = validation2$Quality)

confusionMatrix(Tab)
## Confusion Matrix and Statistics
## 
##          Actual
## Predicted Bad Good
##      Bad  345   54
##      Good   0    0
##                                           
##                Accuracy : 0.8647          
##                  95% CI : (0.8271, 0.8967)
##     No Information Rate : 0.8647          
##     P-Value [Acc > NIR] : 0.5362          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 5.498e-13       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.8647          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.8647          
##          Detection Rate : 0.8647          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : Bad             
## 

Let’s attempt a radial kernel if it works,.

svmfitreg3 <- svm(Quality~.,data = Training2,type = "C-classification",kernel = "radial",trControl = trCtrl)
summary(svmfitreg3)
## 
## Call:
## svm(formula = Quality ~ ., data = Training2, type = "C-classification", 
##     kernel = "radial", trControl = trCtrl)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  386
## 
##  ( 229 157 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  Bad Good
y_pred4 <- predict(svmfitreg3,validation2)
head(y_pred4)
##   1   2   3   4   5   6 
## Bad Bad Bad Bad Bad Bad 
## Levels: Bad Good
Tab1 <- table(y_pred4,validation2$Quality)
confusionMatrix(Tab1)
## Confusion Matrix and Statistics
## 
##        
## y_pred4 Bad Good
##    Bad  339   41
##    Good   6   13
##                                           
##                Accuracy : 0.8822          
##                  95% CI : (0.8465, 0.9122)
##     No Information Rate : 0.8647          
##     P-Value [Acc > NIR] : 0.1711          
##                                           
##                   Kappa : 0.3074          
##                                           
##  Mcnemar's Test P-Value : 7.071e-07       
##                                           
##             Sensitivity : 0.9826          
##             Specificity : 0.2407          
##          Pos Pred Value : 0.8921          
##          Neg Pred Value : 0.6842          
##              Prevalence : 0.8647          
##          Detection Rate : 0.8496          
##    Detection Prevalence : 0.9524          
##       Balanced Accuracy : 0.6117          
##                                           
##        'Positive' Class : Bad             
## 

0.6.2.1 Model Tuning..

## Tuning. if cost value is too high it may lead to overfitting else if its too small cost value may underfit. we store it in tmodel
tmodel2 <- tune(svm,Quality~.,data = Training2,ranges = list(epsilion = seq(0,1,0.1),cost = 2^(2:7)))
plot(tmodel2)

bettermodel2 <- tmodel2$best.model
summary(bettermodel2)
## 
## Call:
## best.tune(METHOD = svm, train.x = Quality ~ ., data = Training2, 
##     ranges = list(epsilion = seq(0, 1, 0.1), cost = 2^(2:7)))
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  8 
## 
## Number of Support Vectors:  365
## 
##  ( 223 142 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  Bad Good
## Evaluating The Model now..
predi2 <- predict(bettermodel2,validation2)
head(predi2)
##   1   2   3   4   5   6 
## Bad Bad Bad Bad Bad Bad 
## Levels: Bad Good
## The accuracy here is improved up to 89% 
Tab2 <- table(predi2,validation2$Quality)

confusionMatrix(Tab2)
## Confusion Matrix and Statistics
## 
##       
## predi2 Bad Good
##   Bad  332   29
##   Good  13   25
##                                           
##                Accuracy : 0.8947          
##                  95% CI : (0.8604, 0.9231)
##     No Information Rate : 0.8647          
##     P-Value [Acc > NIR] : 0.04272         
##                                           
##                   Kappa : 0.486           
##                                           
##  Mcnemar's Test P-Value : 0.02064         
##                                           
##             Sensitivity : 0.9623          
##             Specificity : 0.4630          
##          Pos Pred Value : 0.9197          
##          Neg Pred Value : 0.6579          
##              Prevalence : 0.8647          
##          Detection Rate : 0.8321          
##    Detection Prevalence : 0.9048          
##       Balanced Accuracy : 0.7126          
##                                           
##        'Positive' Class : Bad             
## 

0.7 Conclusion:

For regression and classification, both times the radial svm model performed much better on the testing data then the the linear kernel.