Read the following articles: https://www.hindawi.com/journals/complexity/2021/5550344/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/

library(tidyverse)
library(cowplot)
library(lattice)
library(reshape2)
library(corrplot)
library(caTools)
library(caret)
library(Hmisc)
library(e1071)

Data

Data Content and Structure

For Homework 2 I choose to work with the Wine Quality data set. This data set can be accessed from http://archive.ics.uci.edu/ml/datasets/Wine+Quality. This data set contains two sub data sets for Red and White wine respectively. For the purposes of this analysis, I will be working with the Red wine sub set. The goal of this data set was to model the wine quality based on physicochemical test. It contains 12 attributes as listed below.

1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10)

For this analysis, I will be attempting to model the quality of the wine based on a different combinations of attributes.

Data Setup

wine <- as.data.frame(read.csv('winequality-red.csv'))
wine[1,]
## [1] "7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5"
colnames(wine) <- c('col')

wine[c('Fixed_Acidity', 'Volatile_Acidity', 'Citric_Acid', 'Residual_Sugar', 'Chlorides', 'Free_Sulfer_Dioxide', 'Total_Sulfur_Dioxide', 'Density', 'pH', 'Sulphates', 'Alchohol', 'Quality')] <- str_split_fixed(wine$col, ';', 12)

wine <- wine %>% select(!col)

head(wine)
##   Fixed_Acidity Volatile_Acidity Citric_Acid Residual_Sugar Chlorides
## 1           7.4              0.7           0            1.9     0.076
## 2           7.8             0.88           0            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4              0.7           0            1.9     0.076
## 6           7.4             0.66           0            1.8     0.075
##   Free_Sulfer_Dioxide Total_Sulfur_Dioxide Density   pH Sulphates Alchohol
## 1                  11                   34  0.9978 3.51      0.56      9.4
## 2                  25                   67  0.9968  3.2      0.68      9.8
## 3                  15                   54   0.997 3.26      0.65      9.8
## 4                  17                   60   0.998 3.16      0.58      9.8
## 5                  11                   34  0.9978 3.51      0.56      9.4
## 6                  13                   40  0.9978 3.51      0.56      9.4
##   Quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

Exploratary Analysis

Glimpse

As the first step of exploratory analysis we will take a glimpse at the data.

head(wine,3)
##   Fixed_Acidity Volatile_Acidity Citric_Acid Residual_Sugar Chlorides
## 1           7.4              0.7           0            1.9     0.076
## 2           7.8             0.88           0            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
##   Free_Sulfer_Dioxide Total_Sulfur_Dioxide Density   pH Sulphates Alchohol
## 1                  11                   34  0.9978 3.51      0.56      9.4
## 2                  25                   67  0.9968  3.2      0.68      9.8
## 3                  15                   54   0.997 3.26      0.65      9.8
##   Quality
## 1       5
## 2       5
## 3       5

Convert Column Datatype

Looking at the datatypes of the columns, they are all of type character. They will be converted to type numeric.

wine$Fixed_Acidity <- as.numeric(wine$Fixed_Acidity)
wine$Volatile_Acidity <- as.numeric(wine$Volatile_Acidity)
wine$Citric_Acid <- as.numeric(wine$Citric_Acid)
wine$Residual_Sugar <- as.numeric(wine$Residual_Sugar)
wine$Chlorides <- as.numeric(wine$Chlorides)
wine$Free_Sulfer_Dioxide <- as.numeric(wine$Free_Sulfer_Dioxide)
wine$Total_Sulfur_Dioxide <- as.numeric(wine$Total_Sulfur_Dioxide)
wine$Density <- as.numeric(wine$Density)
wine$pH <- as.numeric(wine$pH)
wine$Sulphates <- as.numeric(wine$Sulphates)
wine$Alchohol <- as.numeric(wine$Alchohol)
wine$Quality <- as.numeric(wine$Quality)

NA

In order to make sure we have complete data we will do a check for any missing values.

sum(is.na(wine))
## [1] 0

Now that we have confirmed that no missing values exist in the data and have properly converted the column type, we can start with the statistical and visual summaries.

###Summary

Numerical

summary(wine)
##  Fixed_Acidity   Volatile_Acidity  Citric_Acid    Residual_Sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    Chlorides       Free_Sulfer_Dioxide Total_Sulfur_Dioxide    Density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          Sulphates         Alchohol        Quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000

Boxplot

#create boxplot
meltD <- melt(wine)
p <- ggplot(meltD, aes(factor(variable), value)) 
p + geom_boxplot() + facet_wrap(~variable, scale="free")

Histogram

hist.data.frame(wine)

Correlation Plot

corrplot(cor(wine),
         method = 'circle', order = 'alphabet', type = 'lower', diag = FALSE, number.cex = 0.75, tl.cex = 0.5,  col=colorRampPalette(c("blue","white","red"))(200))

Models

For the purposes of this analysis, I will be creating an SVM Model with ‘quality’ as the dependent variable and ‘Alcohol’, ‘pH’, ‘Density’, and ‘Citric_Acid’ as the independent variables.

Tracker

The tracker data frame will contain the name, RMSE, and R-Square for each model

tracker <- data.frame(matrix(vector(), 0, 3,
                dimnames=list(c(), c("Name", "RMSE", "R-Squared"))),
                stringsAsFactors=F)

Data Split

The dataset is split into a training and testing sets on a 85:15 ratio.

set.seed(123)
train_ind <- sample(seq_len(nrow(wine)), size = floor(0.85 * nrow(wine)))

train <- wine[train_ind, ]
test <- wine[-train_ind, ]

SVM Model

svm <- svm(Quality ~ Alchohol + pH + Density + Citric_Acid,
                 data=train,
                 kernel="polynomial",
                 scale=FALSE)

svm
## 
## Call:
## svm(formula = Quality ~ Alchohol + pH + Density + Citric_Acid, data = train, 
##     kernel = "polynomial", scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  3 
##       gamma:  0.25 
##      coef.0:  0 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  1141
pred <- predict(svm, newdata=test)

# RMSE
rmse <- sqrt(mean((test$Quality - pred)^2))
#R-Square
rs <- (cor(test$Quality, pred))^2

data.frame('RMSE' = rmse, 'R-Squared' = rs)
##        RMSE R.Squared
## 1 0.6779834 0.1725858

Conclusion

Which algorithm is recommended to get more accurate results? - Is it better for classification or regression scenarios? - Do you agree with the recommendations? - Why?

This analysis works with the Wine dataset to understand the relationship between the target variable Quality, and the independent variables Alcohol, pH, Density, and Citric Acid using the SVM algorithm. The data set was slit on a ratio of 85:15. The RMSE value for the SVM model is 0.678.

In the previous homework we looked at the Decision Tree model and Random Forest model with the same features. The model RMSEs were 0.637 and 0.573 respectively.

Of all the models, the Random Forest model had the most accurate results with the lowest RMSE.

These conclusion seems aligned with what is known of the data. The target variable ‘quality’ has categorical values. Random Forest’s nonlinear characteristics give it a an advantage over the SVM here. It is also ensemble model, meaning that it combines multiple base models to select the optimal model. Thus, the results seem appropriate for the problem at hand.