Last Updated on Mon Jul 31 19:56:31 2017
Published at http://rpubs.com/enghanwenalvin/293012
Our task is to obtain a regression model for a wine quality data set. The data set contains records on the analysis of wine samples. Each record describes some chemical properties of the wine sample, together with a quality score (between 0 and 10) assigned by some wine experts. Our goal is to use this data set to obtain a regression model that can be used to assign a quality score to a new sample of wine based on its chemical properties.
Let us first import the data.
load("redWine.Rdata")
We take a quick look at the data.
str(redWine)
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
summary(redWine)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
There are 1,599 instances and 12 variables. The first 11 columns show the feature variables:
The last column shows the target variable of interest:
There is no missing data. All feature variables are numeric while the target variable is an integer. The charts below show the correlations among the variables and the histogram for the target variable. Alcohol is the most highly correlated with quality while volatile acidity is the least correlated. We see that most of the quality scores are either a 5 or a 6.
library(corrplot)
library(ggplot2)
cor <- cor(redWine)
corrplot(cor, method = c("number"))
ggplot(redWine, aes(quality)) +
geom_histogram(binwidth = 1, color='black', fill = 'gray') +
coord_cartesian(xlim=c(0,10)) +
geom_vline(xintercept = mean(redWine$quality), linetype='longdash', alpha=.5) +
xlab("Wine Quality") +
ylab("NUmber of Instances")
table(redWine$quality)
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
Since alcohol is the most correlated to quality, let us take a closer look at its distribution.
ggplot(redWine, aes(x=as.factor(quality), y=alcohol)) +
geom_boxplot() +
xlab("Wine Quality") +
ylab("Alcohol")
ggplot(data=redWine, aes(x=as.numeric(quality), y=alcohol)) +
geom_jitter(alpha=1/3) +
geom_smooth(method='lm')+
xlab("Wine Quality") +
ylab("Alcohol")
From the boxplot and simple linear regression line, we can see that Wines of higher quality contain higher level of alcohol. Notably, there appears to be much more outliers in wine of average quality (score = 5).
To build the model, we first randomly split the data set in two sub-sets (70%-30%) for training and validation purposes.
set.seed(1234)
rndSample <- sample(1:nrow(redWine), 0.7*nrow(redWine), replace = FALSE)
trainingData <- redWine[rndSample, ]
validationData <- redWine[-rndSample, ]
We run the SVM model using the default radial kernel.
library(e1071)
modelSVM <- svm(as.factor(quality) ~ ., trainingData) # default radial kernel
predictSVM <- predict(modelSVM, validationData)
(cmSVM <- table(predictSVM, validationData$quality)) # confusion matrix
##
## predictSVM 3 4 5 6 7 8
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 1 9 160 54 6 0
## 6 0 9 45 118 39 8
## 7 0 0 1 8 21 1
## 8 0 0 0 0 0 0
100*(1-sum(diag(cmSVM))/sum(cmSVM)) # the error rate
## [1] 37.70833
The error rate appears high at 37.7%.
We also run the SVM model using the polynomial kernel of degree 3 to see if we are able to yield better results.
modelSVM2 <- svm(as.factor(quality) ~ ., trainingData, cost=10, kernel="polynomial", degree=3)
predictSVM2 <- predict(modelSVM2, validationData)
(cmSVM2 <- table(predictSVM2, validationData$quality)) # confusion matrix
##
## predictSVM2 3 4 5 6 7 8
## 3 0 1 3 0 0 0
## 4 1 0 4 3 0 0
## 5 0 10 142 43 3 0
## 6 0 6 53 123 36 4
## 7 0 1 4 11 27 4
## 8 0 0 0 0 0 1
100*(1-sum(diag(cmSVM2))/sum(cmSVM2)) # the error rate
## [1] 38.95833
The error is even higher at 39.0%.
Let us also try the linear kernel. Note that the linear kernel is typically used when number of features is larger than number of observations, which is not the case here, but we run it for completeness.
modelSVM3 <- svm(as.factor(quality) ~ ., trainingData, kernel="linear")
predictSVM3 <- predict(modelSVM3, validationData)
(cmSVM3 <- table(predictSVM3, validationData$quality))
##
## predictSVM3 3 4 5 6 7 8
## 3 0 0 2 0 0 0
## 4 0 0 0 0 0 0
## 5 1 12 158 64 5 0
## 6 0 6 46 116 61 9
## 7 0 0 0 0 0 0
## 8 0 0 0 0 0 0
100*(1-sum(diag(cmSVM3))/sum(cmSVM3))
## [1] 42.91667
The error is much higehr at 42.9%. This is expected, since the linear kernel is a degenerate version of RBF hence it is unlikely to be more accurate than a properly tuned RBF kernel.
Let us try to tune the (radial) model’s cost and gamma parameters using grid search to see if we can improve results.
tuned_svm = tune.svm(as.factor(quality) ~ .,data = trainingData, kernel = 'radial', cost = 10^(-1:2), gamma=c(0.5,1,2))
summary(tuned_svm)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## gamma cost
## 0.5 1
##
## - best performance: 0.3547378
##
## - Detailed performance results:
## gamma cost error dispersion
## 1 0.5 0.1 0.4620495 0.06692070
## 2 1.0 0.1 0.5880550 0.04373341
## 3 2.0 0.1 0.5925193 0.04290306
## 4 0.5 1.0 0.3547378 0.05229062
## 5 1.0 1.0 0.3619128 0.03245191
## 6 2.0 1.0 0.3967503 0.03850064
## 7 0.5 10.0 0.3547539 0.03338042
## 8 1.0 10.0 0.3753057 0.02875788
## 9 2.0 10.0 0.3949727 0.03588240
## 10 0.5 100.0 0.3735360 0.02798562
## 11 1.0 100.0 0.3744128 0.02921146
## 12 2.0 100.0 0.3976432 0.03850386
The best parameters are gamma = 0.5 and cost = 10. Let us repeat the SVM with these parameters.
modelSVM_tuned <- svm(as.factor(quality) ~ ., trainingData, cost=10, gamma=0.5)
predictSVM_tuned <- predict(modelSVM_tuned, validationData)
(cmSVM_tuned <- table(predictSVM_tuned, validationData$quality))
##
## predictSVM_tuned 3 4 5 6 7 8
## 3 0 0 1 0 0 0
## 4 0 0 4 1 0 0
## 5 1 12 147 50 5 2
## 6 0 6 52 115 26 4
## 7 0 0 2 14 35 2
## 8 0 0 0 0 0 1
100*(1-sum(diag(cmSVM_tuned))/sum(cmSVM_tuned))
## [1] 37.91667
We do not see any meaningful change in the error rate.
Let us try a random forest instead. We use the same training and validation from above.
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
modelRF <- randomForest(as.factor(quality) ~ ., trainingData, ntree=750, importance = TRUE)
predictRF <- predict(modelRF, validationData)
(cmRF <- table(predictRF, validationData$quality))
##
## predictRF 3 4 5 6 7 8
## 3 0 0 0 0 0 0
## 4 0 0 0 1 0 0
## 5 1 11 163 39 2 0
## 6 0 7 43 132 29 5
## 7 0 0 0 8 35 3
## 8 0 0 0 0 0 1
100*(1-sum(diag(cmRF))/sum(cmRF))
## [1] 31.04167
The error rate is lower at 31.2%, which is not surprising since the random forest samples multiple trees grown not only with different observations but also using different predictors, thus increasing the diversity of the individual models.
Let us also take a look at the variable importance.
round(importance(modelRF), 2)
## 3 4 5 6 7 8
## fixed.acidity 0.13 -1.31 25.15 19.06 23.90 -3.45
## volatile.acidity 4.19 8.82 34.69 18.03 38.19 -0.27
## citric.acid 2.75 5.41 18.25 17.13 24.97 -1.41
## residual.sugar -0.31 -2.06 25.04 22.58 17.93 1.09
## chlorides -0.72 -2.61 31.96 17.06 18.82 0.03
## free.sulfur.dioxide 2.14 3.13 24.04 24.26 14.30 -0.35
## total.sulfur.dioxide 1.40 5.06 40.56 31.70 25.34 -0.01
## density -2.13 -2.78 27.55 31.27 25.10 -3.33
## pH -0.25 3.39 25.96 18.89 16.45 -1.15
## sulphates -2.65 6.00 39.20 30.57 41.79 2.69
## alcohol 1.49 2.75 61.63 35.43 48.68 2.08
## MeanDecreaseAccuracy MeanDecreaseGini
## fixed.acidity 37.61 52.80
## volatile.acidity 46.51 71.59
## citric.acid 32.88 52.16
## residual.sugar 37.38 51.66
## chlorides 37.37 58.68
## free.sulfur.dioxide 36.98 48.81
## total.sulfur.dioxide 54.71 73.23
## density 44.95 65.77
## pH 35.87 54.40
## sulphates 56.79 77.20
## alcohol 75.07 105.44
subplot <- par(mfrow = c(1,2))
varImpPlot(modelRF,type=1)
varImpPlot(modelRF,type=2)
We see that alcohol is the most important variable affecting the quality, with the highest mean decrease in accuracy and gini of 74.3 and 107.8 respectively. The level of alcohol in wines is what matters! So much for wine tasting!