Wine quality estimation using regression trees and model trees

To develop a red wine rating model, we use data donated to the UCI Machine Learning Data Repository by Cortez et al. We are interested in predicting the quality of a red wine taste based on its quantified chemical characteristics.

## [1] "C:/Users/mammykins/Google Drive/R/wine"

Objectives

The data

We read the data in and assign it to mydata.

glimpse(mydata)
## Observations: 1,599
## Variables: 12
## $ fixed.acidity        (dbl) 6.5, 9.1, 6.9, 7.3, 12.5, 5.4, 10.4, 7.9,...
## $ volatile.acidity     (dbl) 0.90, 0.22, 0.52, 0.59, 0.28, 0.74, 0.28,...
## $ citric.acid          (dbl) 0.00, 0.24, 0.25, 0.26, 0.54, 0.09, 0.54,...
## $ residual.sugar       (dbl) 1.6, 2.1, 2.6, 2.0, 2.3, 1.7, 2.7, 1.8, 2...
## $ chlorides            (dbl) 0.052, 0.078, 0.081, 0.080, 0.082, 0.089,...
## $ free.sulfur.dioxide  (dbl) 9, 1, 10, 17, 12, 16, 5, 2, 9, 12, 21, 10...
## $ total.sulfur.dioxide (dbl) 17, 28, 37, 104, 29, 26, 19, 45, 46, 51, ...
## $ density              (dbl) 0.99467, 0.99900, 0.99685, 0.99584, 0.999...
## $ pH                   (dbl) 3.50, 3.41, 3.46, 3.28, 3.11, 3.67, 3.25,...
## $ sulphates            (dbl) 0.63, 0.87, 0.50, 0.52, 1.36, 0.56, 0.63,...
## $ alcohol              (dbl) 10.9, 10.3, 11.0, 9.9, 9.8, 11.6, 9.5, 9....
## $ quality              (int) 6, 6, 5, 5, 7, 6, 5, 6, 6, 6, 5, 5, 5, 6,...

The wines were subject to lab analysis and then rated in a blind tasting by panels of no less than three judges on a quality scale ranging from zero (gross) to 10 (excellent). If judges disagreed the median values was used. Thus we intend to model the numeric response variable quality by the eleven other explanatory variables. Compared to other machine learning models, trees do not need as much data preprocessing, thus we can just get on with it! However, it is important to consider the type of variation in the respone variable, checking for extremes and outliers can also be prudent at this stage.

hist(mydata$quality)

summary(mydata)
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Residual sugar has a max outlier but we’ll continue as if the data were OK. Red wine goes straight to my head.

Splitting the data into train and test

We use a similar method to Cortez, with a 75:25 split.

set.seed(1337)
train <- sample(x = nrow(mydata), size = 1199, replace = FALSE)  

Method

We’ll fit a regression tree using the default parameters. Using the R formula interface, we specify quality as the response and include all other variables as predictors using the dot notation.

m.rpart <- rpart(quality ~ ., data = mydata[train, ])  # fit with training set
m.rpart
## n= 1199 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 1199 810.335300 5.620517  
##    2) alcohol< 10.525 745 321.591900 5.359732  
##      4) sulphates< 0.575 292  93.791100 5.133562 *
##      5) sulphates>=0.575 453 203.236200 5.505519  
##       10) volatile.acidity>=0.335 402 159.184100 5.442786  
##         20) total.sulfur.dioxide>=83 78  11.448720 5.141026 *
##         21) total.sulfur.dioxide< 83 324 138.922800 5.515432 *
##       11) volatile.acidity< 0.335 51  30.000000 6.000000 *
##    3) alcohol>=10.525 454 354.933900 6.048458  
##      6) sulphates< 0.585 123 108.699200 5.520325  
##       12) volatile.acidity>=0.99 7   3.428571 3.714286 *
##       13) volatile.acidity< 0.99 116  81.060340 5.629310  
##         26) volatile.acidity>=0.385 84  50.416670 5.416667 *
##         27) volatile.acidity< 0.385 32  16.875000 6.187500 *
##      7) sulphates>=0.585 331 199.178200 6.244713  
##       14) alcohol< 11.55 198 109.818200 6.030303  
##         28) chlorides>=0.0945 35  20.685710 5.542857 *
##         29) chlorides< 0.0945 163  79.030670 6.134969 *
##       15) alcohol>=11.55 133  66.706770 6.563910 *

For each node in the tree, the number of examples reaching the decision point is listed. For instance all 1199 examples begin at the root node, of which 745 have alcohol less than 10.525. Because alcohol appears first in the tree it is the single most important predictor of wine quality. We can use summary() for extra detail.

summary(m.rpart)
## Call:
## rpart(formula = quality ~ ., data = mydata[train, ])
##   n= 1199 
## 
##            CP nsplit rel error    xerror       xstd
## 1  0.16512845      0 1.0000000 1.0012005 0.04370552
## 2  0.05807039      1 0.8348715 0.8546403 0.04278983
## 3  0.03031418      2 0.7768012 0.8160529 0.03891994
## 4  0.02987686      3 0.7464870 0.7911828 0.03735243
## 5  0.02795546      4 0.7166101 0.7813175 0.03703364
## 6  0.01734112      5 0.6886547 0.7488583 0.03544028
## 7  0.01699133      6 0.6713135 0.7432333 0.03500668
## 8  0.01246619      7 0.6543222 0.7350026 0.03427549
## 9  0.01087516      8 0.6418560 0.7442025 0.03493094
## 10 0.01000000      9 0.6309809 0.7314232 0.03436914
## 
## Variable importance
##              alcohol     volatile.acidity            sulphates 
##                   33                   15                   15 
##              density            chlorides total.sulfur.dioxide 
##                   14                    9                    5 
##        fixed.acidity          citric.acid                   pH 
##                    4                    2                    1 
##       residual.sugar  free.sulfur.dioxide 
##                    1                    1 
## 
## Node number 1: 1199 observations,    complexity param=0.1651285
##   mean=5.620517, MSE=0.6758426 
##   left son=2 (745 obs) right son=3 (454 obs)
##   Primary splits:
##       alcohol          < 10.525   to the left,  improve=0.16512850, (0 missing)
##       sulphates        < 0.645    to the left,  improve=0.12153840, (0 missing)
##       volatile.acidity < 0.425    to the right, improve=0.10239570, (0 missing)
##       citric.acid      < 0.295    to the left,  improve=0.06585085, (0 missing)
##       density          < 0.995565 to the right, improve=0.05332424, (0 missing)
##   Surrogate splits:
##       density              < 0.995575 to the right, agree=0.757, adj=0.359, (0 split)
##       chlorides            < 0.0685   to the right, agree=0.698, adj=0.203, (0 split)
##       volatile.acidity     < 0.3675   to the right, agree=0.660, adj=0.101, (0 split)
##       fixed.acidity        < 6.35     to the right, agree=0.654, adj=0.086, (0 split)
##       total.sulfur.dioxide < 17.5     to the right, agree=0.651, adj=0.077, (0 split)
## 
## Node number 2: 745 observations,    complexity param=0.03031418
##   mean=5.359732, MSE=0.431667 
##   left son=4 (292 obs) right son=5 (453 obs)
##   Primary splits:
##       sulphates            < 0.575    to the left,  improve=0.07638452, (0 missing)
##       volatile.acidity     < 0.335    to the right, improve=0.06502699, (0 missing)
##       alcohol              < 9.975    to the left,  improve=0.05203031, (0 missing)
##       fixed.acidity        < 12.55    to the left,  improve=0.03121456, (0 missing)
##       total.sulfur.dioxide < 83.5     to the right, improve=0.02718394, (0 missing)
##   Surrogate splits:
##       density             < 0.996285 to the left,  agree=0.664, adj=0.144, (0 split)
##       volatile.acidity    < 0.6525   to the right, agree=0.646, adj=0.096, (0 split)
##       fixed.acidity       < 6.15     to the left,  agree=0.619, adj=0.027, (0 split)
##       citric.acid         < 0.115    to the left,  agree=0.617, adj=0.024, (0 split)
##       free.sulfur.dioxide < 56       to the right, agree=0.615, adj=0.017, (0 split)
## 
## Node number 3: 454 observations,    complexity param=0.05807039
##   mean=6.048458, MSE=0.7817928 
##   left son=6 (123 obs) right son=7 (331 obs)
##   Primary splits:
##       sulphates        < 0.585    to the left,  improve=0.13257820, (0 missing)
##       volatile.acidity < 0.8675   to the right, improve=0.12988970, (0 missing)
##       citric.acid      < 0.275    to the left,  improve=0.10195960, (0 missing)
##       alcohol          < 11.55    to the left,  improve=0.09956824, (0 missing)
##       pH               < 3.355    to the right, improve=0.08725303, (0 missing)
##   Surrogate splits:
##       volatile.acidity     < 0.8375   to the right, agree=0.751, adj=0.081, (0 split)
##       density              < 0.99351  to the left,  agree=0.749, adj=0.073, (0 split)
##       total.sulfur.dioxide < 9.5      to the left,  agree=0.744, adj=0.057, (0 split)
##       residual.sugar       < 7.65     to the right, agree=0.740, adj=0.041, (0 split)
##       citric.acid          < 0.015    to the left,  agree=0.736, adj=0.024, (0 split)
## 
## Node number 4: 292 observations
##   mean=5.133562, MSE=0.3212024 
## 
## Node number 5: 453 observations,    complexity param=0.01734112
##   mean=5.505519, MSE=0.448645 
##   left son=10 (402 obs) right son=11 (51 obs)
##   Primary splits:
##       volatile.acidity     < 0.335    to the right, improve=0.06914183, (0 missing)
##       total.sulfur.dioxide < 81.5     to the right, improve=0.05979987, (0 missing)
##       alcohol              < 9.85     to the left,  improve=0.05963577, (0 missing)
##       fixed.acidity        < 10.95    to the left,  improve=0.04607798, (0 missing)
##       free.sulfur.dioxide  < 14.5     to the right, improve=0.02879088, (0 missing)
##   Surrogate splits:
##       fixed.acidity < 13.35    to the left,  agree=0.896, adj=0.078, (0 split)
##       chlorides     < 0.0565   to the right, agree=0.896, adj=0.078, (0 split)
## 
## Node number 6: 123 observations,    complexity param=0.02987686
##   mean=5.520325, MSE=0.8837332 
##   left son=12 (7 obs) right son=13 (116 obs)
##   Primary splits:
##       volatile.acidity    < 0.99     to the right, improve=0.2227272, (0 missing)
##       citric.acid         < 0.255    to the left,  improve=0.1756120, (0 missing)
##       pH                  < 3.365    to the right, improve=0.1618761, (0 missing)
##       alcohol             < 11.45    to the left,  improve=0.1429198, (0 missing)
##       free.sulfur.dioxide < 30       to the left,  improve=0.0960594, (0 missing)
## 
## Node number 7: 331 observations,    complexity param=0.02795546
##   mean=6.244713, MSE=0.601747 
##   left son=14 (198 obs) right son=15 (133 obs)
##   Primary splits:
##       alcohol          < 11.55    to the left,  improve=0.11373380, (0 missing)
##       volatile.acidity < 0.8575   to the right, improve=0.05562597, (0 missing)
##       pH               < 3.355    to the right, improve=0.04544472, (0 missing)
##       citric.acid      < 0.295    to the left,  improve=0.04092179, (0 missing)
##       chlorides        < 0.0945   to the right, improve=0.04058732, (0 missing)
##   Surrogate splits:
##       density        < 0.99477  to the right, agree=0.719, adj=0.301, (0 split)
##       fixed.acidity  < 5.7      to the right, agree=0.656, adj=0.143, (0 split)
##       chlorides      < 0.0525   to the right, agree=0.647, adj=0.120, (0 split)
##       pH             < 3.545    to the left,  agree=0.637, adj=0.098, (0 split)
##       residual.sugar < 4.25     to the left,  agree=0.616, adj=0.045, (0 split)
## 
## Node number 10: 402 observations,    complexity param=0.01087516
##   mean=5.442786, MSE=0.3959803 
##   left son=20 (78 obs) right son=21 (324 obs)
##   Primary splits:
##       total.sulfur.dioxide < 83       to the right, improve=0.05536057, (0 missing)
##       alcohol              < 9.975    to the left,  improve=0.04825427, (0 missing)
##       fixed.acidity        < 12.55    to the left,  improve=0.03179864, (0 missing)
##       volatile.acidity     < 0.605    to the right, improve=0.02846832, (0 missing)
##       pH                   < 3.54     to the right, improve=0.02716902, (0 missing)
##   Surrogate splits:
##       free.sulfur.dioxide < 36.5     to the right, agree=0.821, adj=0.077, (0 split)
##       residual.sugar      < 5.85     to the right, agree=0.813, adj=0.038, (0 split)
##       density             < 1.002655 to the right, agree=0.813, adj=0.038, (0 split)
##       sulphates           < 1.78     to the right, agree=0.808, adj=0.013, (0 split)
## 
## Node number 11: 51 observations
##   mean=6, MSE=0.5882353 
## 
## Node number 12: 7 observations
##   mean=3.714286, MSE=0.4897959 
## 
## Node number 13: 116 observations,    complexity param=0.01699133
##   mean=5.62931, MSE=0.6987961 
##   left son=26 (84 obs) right son=27 (32 obs)
##   Primary splits:
##       volatile.acidity    < 0.385    to the right, improve=0.1698571, (0 missing)
##       citric.acid         < 0.255    to the left,  improve=0.1497599, (0 missing)
##       pH                  < 3.355    to the right, improve=0.1237777, (0 missing)
##       free.sulfur.dioxide < 30       to the left,  improve=0.1050906, (0 missing)
##       alcohol             < 11.45    to the left,  improve=0.1022014, (0 missing)
##   Surrogate splits:
##       citric.acid          < 0.255    to the left,  agree=0.853, adj=0.469, (0 split)
##       pH                   < 3.295    to the right, agree=0.784, adj=0.219, (0 split)
##       free.sulfur.dioxide  < 34       to the left,  agree=0.759, adj=0.125, (0 split)
##       total.sulfur.dioxide < 95.5     to the left,  agree=0.759, adj=0.125, (0 split)
##       density              < 0.991705 to the right, agree=0.759, adj=0.125, (0 split)
## 
## Node number 14: 198 observations,    complexity param=0.01246619
##   mean=6.030303, MSE=0.5546373 
##   left son=28 (35 obs) right son=29 (163 obs)
##   Primary splits:
##       chlorides            < 0.0945   to the right, improve=0.09198652, (0 missing)
##       volatile.acidity     < 0.395    to the right, improve=0.08891009, (0 missing)
##       pH                   < 3.49     to the right, improve=0.05937785, (0 missing)
##       total.sulfur.dioxide < 51.5     to the right, improve=0.05133368, (0 missing)
##       citric.acid          < 0.275    to the left,  improve=0.04553784, (0 missing)
##   Surrogate splits:
##       density       < 1.0007   to the right, agree=0.854, adj=0.171, (0 split)
##       fixed.acidity < 14.75    to the right, agree=0.843, adj=0.114, (0 split)
##       pH            < 2.98     to the left,  agree=0.843, adj=0.114, (0 split)
##       citric.acid   < 0.675    to the right, agree=0.838, adj=0.086, (0 split)
## 
## Node number 15: 133 observations
##   mean=6.56391, MSE=0.5015546 
## 
## Node number 20: 78 observations
##   mean=5.141026, MSE=0.1467784 
## 
## Node number 21: 324 observations
##   mean=5.515432, MSE=0.4287742 
## 
## Node number 26: 84 observations
##   mean=5.416667, MSE=0.6001984 
## 
## Node number 27: 32 observations
##   mean=6.1875, MSE=0.5273438 
## 
## Node number 28: 35 observations
##   mean=5.542857, MSE=0.5910204 
## 
## Node number 29: 163 observations
##   mean=6.134969, MSE=0.4848508

Visualising the tree

Using the rpart.plot package we call the function of the same name. Red red wine.

rpart.plot(m.rpart, digits = 4, fallen.leaves = TRUE)

Evaluating model performance

Let’s see how well the model does when predicting the test data.

p.rpart <- predict(m.rpart, mydata[-train, ])  # note the negative, drops train leaving test

If we examine the summary statistics of the predicted versus the real qualities we notice that our predictions may be covering the tail or extreme qualities less well. We can use the correlation to compare how well the predicted does against the real.

summary(p.rpart)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.714   5.139   5.515   5.653   6.135   6.564
summary(mydata[-train, ][["quality"]])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.682   6.000   8.000
cor(p.rpart, mydata[-train, ][["quality"]])
## [1] 0.5821109

For a complex biological problem a correlation of 0.58 is acceptable. However, we can measure how far off the predictions were using the mean absolute error. We can create a simple MAE() function as follows.

MAE <- function(actual, predicted)  {
  mean(abs(actual - predicted))
}

MAE(predicted = p.rpart,actual = mydata[-train, ][["quality"]])
## [1] 0.4915993

Not bad, but if just guessed the mean wine quality everytime we would get a mean absolute error of 0.640125. Our regression tree is better on average but not by much than the imputed mean.

Improving model performance

Let’s improve things by using a model tree; replacing the leaf nodes with regression models. We will use the M5’ algorithm by Wang and Witten using the M5P() function from the RWeka package. The package has some Java dependencies so I’ll update this step at a later date.


References