Last Updated on Mon Jul 31 19:56:31 2017
Published at http://rpubs.com/enghanwenalvin/293012


Building Predictive Models in R

Our task is to obtain a regression model for a wine quality data set. The data set contains records on the analysis of wine samples. Each record describes some chemical properties of the wine sample, together with a quality score (between 0 and 10) assigned by some wine experts. Our goal is to use this data set to obtain a regression model that can be used to assign a quality score to a new sample of wine based on its chemical properties.

Let us first import the data.

load("redWine.Rdata")

We take a quick look at the data.

str(redWine)
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
summary(redWine)
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

There are 1,599 instances and 12 variables. The first 11 columns show the feature variables:

The last column shows the target variable of interest:

There is no missing data. All feature variables are numeric while the target variable is an integer. The charts below show the correlations among the variables and the histogram for the target variable. Alcohol is the most highly correlated with quality while volatile acidity is the least correlated. We see that most of the quality scores are either a 5 or a 6.

library(corrplot)
library(ggplot2)
cor <- cor(redWine)
corrplot(cor, method = c("number"))

ggplot(redWine, aes(quality)) +  
  geom_histogram(binwidth = 1, color='black', fill = 'gray') +
  coord_cartesian(xlim=c(0,10)) +
  geom_vline(xintercept = mean(redWine$quality), linetype='longdash', alpha=.5) +
  xlab("Wine Quality") +
  ylab("NUmber of Instances")

table(redWine$quality)
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Since alcohol is the most correlated to quality, let us take a closer look at its distribution.

ggplot(redWine, aes(x=as.factor(quality), y=alcohol)) +
  geom_boxplot() +
  xlab("Wine Quality") +
  ylab("Alcohol")

ggplot(data=redWine, aes(x=as.numeric(quality), y=alcohol)) +
  geom_jitter(alpha=1/3) +
  geom_smooth(method='lm')+
  xlab("Wine Quality") +
  ylab("Alcohol")

From the boxplot and simple linear regression line, we can see that Wines of higher quality contain higher level of alcohol. Notably, there appears to be much more outliers in wine of average quality (score = 5).

Splitting Data

To build the model, we first randomly split the data set in two sub-sets (70%-30%) for training and validation purposes.

set.seed(1234)
rndSample <- sample(1:nrow(redWine), 0.7*nrow(redWine), replace = FALSE)
trainingData <- redWine[rndSample, ]
validationData <- redWine[-rndSample, ]

We run the SVM model using the default radial kernel.

library(e1071)
modelSVM <- svm(as.factor(quality) ~ ., trainingData) # default radial kernel
predictSVM <- predict(modelSVM, validationData)
(cmSVM <- table(predictSVM, validationData$quality)) # confusion matrix
##           
## predictSVM   3   4   5   6   7   8
##          3   0   0   0   0   0   0
##          4   0   0   0   0   0   0
##          5   1   9 160  54   6   0
##          6   0   9  45 118  39   8
##          7   0   0   1   8  21   1
##          8   0   0   0   0   0   0
100*(1-sum(diag(cmSVM))/sum(cmSVM)) # the error rate
## [1] 37.70833

The error rate appears high at 37.7%.

We also run the SVM model using the polynomial kernel of degree 3 to see if we are able to yield better results.

modelSVM2 <- svm(as.factor(quality) ~ ., trainingData, cost=10, kernel="polynomial", degree=3) 
predictSVM2 <- predict(modelSVM2, validationData)
(cmSVM2 <- table(predictSVM2, validationData$quality)) # confusion matrix
##            
## predictSVM2   3   4   5   6   7   8
##           3   0   1   3   0   0   0
##           4   1   0   4   3   0   0
##           5   0  10 142  43   3   0
##           6   0   6  53 123  36   4
##           7   0   1   4  11  27   4
##           8   0   0   0   0   0   1
100*(1-sum(diag(cmSVM2))/sum(cmSVM2)) # the error rate
## [1] 38.95833

The error is even higher at 39.0%.

Let us also try the linear kernel. Note that the linear kernel is typically used when number of features is larger than number of observations, which is not the case here, but we run it for completeness.

modelSVM3 <- svm(as.factor(quality) ~ ., trainingData, kernel="linear") 
predictSVM3 <- predict(modelSVM3, validationData)
(cmSVM3 <- table(predictSVM3, validationData$quality)) 
##            
## predictSVM3   3   4   5   6   7   8
##           3   0   0   2   0   0   0
##           4   0   0   0   0   0   0
##           5   1  12 158  64   5   0
##           6   0   6  46 116  61   9
##           7   0   0   0   0   0   0
##           8   0   0   0   0   0   0
100*(1-sum(diag(cmSVM3))/sum(cmSVM3)) 
## [1] 42.91667

The error is much higehr at 42.9%. This is expected, since the linear kernel is a degenerate version of RBF hence it is unlikely to be more accurate than a properly tuned RBF kernel.

Let us try to tune the (radial) model’s cost and gamma parameters using grid search to see if we can improve results.

tuned_svm = tune.svm(as.factor(quality) ~ .,data = trainingData, kernel = 'radial', cost = 10^(-1:2), gamma=c(0.5,1,2))
summary(tuned_svm)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  gamma cost
##    0.5    1
## 
## - best performance: 0.3547378 
## 
## - Detailed performance results:
##    gamma  cost     error dispersion
## 1    0.5   0.1 0.4620495 0.06692070
## 2    1.0   0.1 0.5880550 0.04373341
## 3    2.0   0.1 0.5925193 0.04290306
## 4    0.5   1.0 0.3547378 0.05229062
## 5    1.0   1.0 0.3619128 0.03245191
## 6    2.0   1.0 0.3967503 0.03850064
## 7    0.5  10.0 0.3547539 0.03338042
## 8    1.0  10.0 0.3753057 0.02875788
## 9    2.0  10.0 0.3949727 0.03588240
## 10   0.5 100.0 0.3735360 0.02798562
## 11   1.0 100.0 0.3744128 0.02921146
## 12   2.0 100.0 0.3976432 0.03850386

The best parameters are gamma = 0.5 and cost = 10. Let us repeat the SVM with these parameters.

modelSVM_tuned <- svm(as.factor(quality) ~ ., trainingData, cost=10, gamma=0.5) 
predictSVM_tuned <- predict(modelSVM_tuned, validationData)
(cmSVM_tuned <- table(predictSVM_tuned, validationData$quality)) 
##                 
## predictSVM_tuned   3   4   5   6   7   8
##                3   0   0   1   0   0   0
##                4   0   0   4   1   0   0
##                5   1  12 147  50   5   2
##                6   0   6  52 115  26   4
##                7   0   0   2  14  35   2
##                8   0   0   0   0   0   1
100*(1-sum(diag(cmSVM_tuned))/sum(cmSVM_tuned)) 
## [1] 37.91667

We do not see any meaningful change in the error rate.

Let us try a random forest instead. We use the same training and validation from above.

library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
modelRF <- randomForest(as.factor(quality) ~ ., trainingData, ntree=750, importance = TRUE)
predictRF <- predict(modelRF, validationData)
(cmRF <- table(predictRF, validationData$quality))
##          
## predictRF   3   4   5   6   7   8
##         3   0   0   0   0   0   0
##         4   0   0   0   1   0   0
##         5   1  11 163  39   2   0
##         6   0   7  43 132  29   5
##         7   0   0   0   8  35   3
##         8   0   0   0   0   0   1
100*(1-sum(diag(cmRF))/sum(cmRF)) 
## [1] 31.04167

The error rate is lower at 31.2%, which is not surprising since the random forest samples multiple trees grown not only with different observations but also using different predictors, thus increasing the diversity of the individual models.

Let us also take a look at the variable importance.

round(importance(modelRF), 2)
##                          3     4     5     6     7     8
## fixed.acidity         0.13 -1.31 25.15 19.06 23.90 -3.45
## volatile.acidity      4.19  8.82 34.69 18.03 38.19 -0.27
## citric.acid           2.75  5.41 18.25 17.13 24.97 -1.41
## residual.sugar       -0.31 -2.06 25.04 22.58 17.93  1.09
## chlorides            -0.72 -2.61 31.96 17.06 18.82  0.03
## free.sulfur.dioxide   2.14  3.13 24.04 24.26 14.30 -0.35
## total.sulfur.dioxide  1.40  5.06 40.56 31.70 25.34 -0.01
## density              -2.13 -2.78 27.55 31.27 25.10 -3.33
## pH                   -0.25  3.39 25.96 18.89 16.45 -1.15
## sulphates            -2.65  6.00 39.20 30.57 41.79  2.69
## alcohol               1.49  2.75 61.63 35.43 48.68  2.08
##                      MeanDecreaseAccuracy MeanDecreaseGini
## fixed.acidity                       37.61            52.80
## volatile.acidity                    46.51            71.59
## citric.acid                         32.88            52.16
## residual.sugar                      37.38            51.66
## chlorides                           37.37            58.68
## free.sulfur.dioxide                 36.98            48.81
## total.sulfur.dioxide                54.71            73.23
## density                             44.95            65.77
## pH                                  35.87            54.40
## sulphates                           56.79            77.20
## alcohol                             75.07           105.44
subplot <- par(mfrow = c(1,2))
varImpPlot(modelRF,type=1)
varImpPlot(modelRF,type=2)

We see that alcohol is the most important variable affecting the quality, with the highest mean decrease in accuracy and gini of 74.3 and 107.8 respectively. The level of alcohol in wines is what matters! So much for wine tasting!