Read the following articles: https://www.hindawi.com/journals/complexity/2021/5550344/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/
library(tidyverse)
library(cowplot)
library(lattice)
library(reshape2)
library(corrplot)
library(caTools)
library(caret)
library(Hmisc)
library(e1071)
For Homework 2 I choose to work with the Wine Quality data set. This data set can be accessed from http://archive.ics.uci.edu/ml/datasets/Wine+Quality. This data set contains two sub data sets for Red and White wine respectively. For the purposes of this analysis, I will be working with the Red wine sub set. The goal of this data set was to model the wine quality based on physicochemical test. It contains 12 attributes as listed below.
1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10)
For this analysis, I will be attempting to model the quality of the wine based on a different combinations of attributes.
wine <- as.data.frame(read.csv('winequality-red.csv'))
wine[1,]
## [1] "7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5"
colnames(wine) <- c('col')
wine[c('Fixed_Acidity', 'Volatile_Acidity', 'Citric_Acid', 'Residual_Sugar', 'Chlorides', 'Free_Sulfer_Dioxide', 'Total_Sulfur_Dioxide', 'Density', 'pH', 'Sulphates', 'Alchohol', 'Quality')] <- str_split_fixed(wine$col, ';', 12)
wine <- wine %>% select(!col)
head(wine)
## Fixed_Acidity Volatile_Acidity Citric_Acid Residual_Sugar Chlorides
## 1 7.4 0.7 0 1.9 0.076
## 2 7.8 0.88 0 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.7 0 1.9 0.076
## 6 7.4 0.66 0 1.8 0.075
## Free_Sulfer_Dioxide Total_Sulfur_Dioxide Density pH Sulphates Alchohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.2 0.68 9.8
## 3 15 54 0.997 3.26 0.65 9.8
## 4 17 60 0.998 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## Quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
As the first step of exploratory analysis we will take a glimpse at the data.
head(wine,3)
## Fixed_Acidity Volatile_Acidity Citric_Acid Residual_Sugar Chlorides
## 1 7.4 0.7 0 1.9 0.076
## 2 7.8 0.88 0 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## Free_Sulfer_Dioxide Total_Sulfur_Dioxide Density pH Sulphates Alchohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.2 0.68 9.8
## 3 15 54 0.997 3.26 0.65 9.8
## Quality
## 1 5
## 2 5
## 3 5
Looking at the datatypes of the columns, they are all of type character. They will be converted to type numeric.
wine$Fixed_Acidity <- as.numeric(wine$Fixed_Acidity)
wine$Volatile_Acidity <- as.numeric(wine$Volatile_Acidity)
wine$Citric_Acid <- as.numeric(wine$Citric_Acid)
wine$Residual_Sugar <- as.numeric(wine$Residual_Sugar)
wine$Chlorides <- as.numeric(wine$Chlorides)
wine$Free_Sulfer_Dioxide <- as.numeric(wine$Free_Sulfer_Dioxide)
wine$Total_Sulfur_Dioxide <- as.numeric(wine$Total_Sulfur_Dioxide)
wine$Density <- as.numeric(wine$Density)
wine$pH <- as.numeric(wine$pH)
wine$Sulphates <- as.numeric(wine$Sulphates)
wine$Alchohol <- as.numeric(wine$Alchohol)
wine$Quality <- as.numeric(wine$Quality)
In order to make sure we have complete data we will do a check for any missing values.
sum(is.na(wine))
## [1] 0
Now that we have confirmed that no missing values exist in the data and have properly converted the column type, we can start with the statistical and visual summaries.
###Summary
summary(wine)
## Fixed_Acidity Volatile_Acidity Citric_Acid Residual_Sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## Chlorides Free_Sulfer_Dioxide Total_Sulfur_Dioxide Density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH Sulphates Alchohol Quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.310 Median :0.6200 Median :10.20 Median :6.000
## Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
#create boxplot
meltD <- melt(wine)
p <- ggplot(meltD, aes(factor(variable), value))
p + geom_boxplot() + facet_wrap(~variable, scale="free")
hist.data.frame(wine)
corrplot(cor(wine),
method = 'circle', order = 'alphabet', type = 'lower', diag = FALSE, number.cex = 0.75, tl.cex = 0.5, col=colorRampPalette(c("blue","white","red"))(200))
For the purposes of this analysis, I will be creating an SVM Model with ‘quality’ as the dependent variable and ‘Alcohol’, ‘pH’, ‘Density’, and ‘Citric_Acid’ as the independent variables.
The tracker data frame will contain the name, RMSE, and R-Square for each model
tracker <- data.frame(matrix(vector(), 0, 3,
dimnames=list(c(), c("Name", "RMSE", "R-Squared"))),
stringsAsFactors=F)
The dataset is split into a training and testing sets on a 85:15 ratio.
set.seed(123)
train_ind <- sample(seq_len(nrow(wine)), size = floor(0.85 * nrow(wine)))
train <- wine[train_ind, ]
test <- wine[-train_ind, ]
svm <- svm(Quality ~ Alchohol + pH + Density + Citric_Acid,
data=train,
kernel="polynomial",
scale=FALSE)
svm
##
## Call:
## svm(formula = Quality ~ Alchohol + pH + Density + Citric_Acid, data = train,
## kernel = "polynomial", scale = FALSE)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: polynomial
## cost: 1
## degree: 3
## gamma: 0.25
## coef.0: 0
## epsilon: 0.1
##
##
## Number of Support Vectors: 1141
pred <- predict(svm, newdata=test)
# RMSE
rmse <- sqrt(mean((test$Quality - pred)^2))
#R-Square
rs <- (cor(test$Quality, pred))^2
data.frame('RMSE' = rmse, 'R-Squared' = rs)
## RMSE R.Squared
## 1 0.6779834 0.1725858
Which algorithm is recommended to get more accurate results? - Is it better for classification or regression scenarios? - Do you agree with the recommendations? - Why?
This analysis works with the Wine dataset to understand the relationship between the target variable Quality, and the independent variables Alcohol, pH, Density, and Citric Acid using the SVM algorithm. The data set was slit on a ratio of 85:15. The RMSE value for the SVM model is 0.678.
In the previous homework we looked at the Decision Tree model and Random Forest model with the same features. The model RMSEs were 0.637 and 0.573 respectively.
Of all the models, the Random Forest model had the most accurate results with the lowest RMSE.
These conclusion seems aligned with what is known of the data. The target variable ‘quality’ has categorical values. Random Forest’s nonlinear characteristics give it a an advantage over the SVM here. It is also ensemble model, meaning that it combines multiple base models to select the optimal model. Thus, the results seem appropriate for the problem at hand.