To develop a red wine rating model, we use data donated to the UCI Machine Learning Data Repository by Cortez et al. We are interested in predicting the quality of a red wine taste based on its quantified chemical characteristics.
## [1] "C:/Users/mammykins/Google Drive/R/wine"
We read the data in and assign it to mydata.
glimpse(mydata)
## Observations: 1,599
## Variables: 12
## $ fixed.acidity (dbl) 6.5, 9.1, 6.9, 7.3, 12.5, 5.4, 10.4, 7.9,...
## $ volatile.acidity (dbl) 0.90, 0.22, 0.52, 0.59, 0.28, 0.74, 0.28,...
## $ citric.acid (dbl) 0.00, 0.24, 0.25, 0.26, 0.54, 0.09, 0.54,...
## $ residual.sugar (dbl) 1.6, 2.1, 2.6, 2.0, 2.3, 1.7, 2.7, 1.8, 2...
## $ chlorides (dbl) 0.052, 0.078, 0.081, 0.080, 0.082, 0.089,...
## $ free.sulfur.dioxide (dbl) 9, 1, 10, 17, 12, 16, 5, 2, 9, 12, 21, 10...
## $ total.sulfur.dioxide (dbl) 17, 28, 37, 104, 29, 26, 19, 45, 46, 51, ...
## $ density (dbl) 0.99467, 0.99900, 0.99685, 0.99584, 0.999...
## $ pH (dbl) 3.50, 3.41, 3.46, 3.28, 3.11, 3.67, 3.25,...
## $ sulphates (dbl) 0.63, 0.87, 0.50, 0.52, 1.36, 0.56, 0.63,...
## $ alcohol (dbl) 10.9, 10.3, 11.0, 9.9, 9.8, 11.6, 9.5, 9....
## $ quality (int) 6, 6, 5, 5, 7, 6, 5, 6, 6, 6, 5, 5, 5, 6,...
The wines were subject to lab analysis and then rated in a blind tasting by panels of no less than three judges on a quality scale ranging from zero (gross) to 10 (excellent). If judges disagreed the median values was used. Thus we intend to model the numeric response variable quality by the eleven other explanatory variables. Compared to other machine learning models, trees do not need as much data preprocessing, thus we can just get on with it! However, it is important to consider the type of variation in the respone variable, checking for extremes and outliers can also be prudent at this stage.
hist(mydata$quality)
summary(mydata)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
Residual sugar has a max outlier but we’ll continue as if the data were OK. Red wine goes straight to my head.
We use a similar method to Cortez, with a 75:25 split.
set.seed(1337)
train <- sample(x = nrow(mydata), size = 1199, replace = FALSE)
We’ll fit a regression tree using the default parameters. Using the R formula interface, we specify quality as the response and include all other variables as predictors using the dot notation.
m.rpart <- rpart(quality ~ ., data = mydata[train, ]) # fit with training set
m.rpart
## n= 1199
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 1199 810.335300 5.620517
## 2) alcohol< 10.525 745 321.591900 5.359732
## 4) sulphates< 0.575 292 93.791100 5.133562 *
## 5) sulphates>=0.575 453 203.236200 5.505519
## 10) volatile.acidity>=0.335 402 159.184100 5.442786
## 20) total.sulfur.dioxide>=83 78 11.448720 5.141026 *
## 21) total.sulfur.dioxide< 83 324 138.922800 5.515432 *
## 11) volatile.acidity< 0.335 51 30.000000 6.000000 *
## 3) alcohol>=10.525 454 354.933900 6.048458
## 6) sulphates< 0.585 123 108.699200 5.520325
## 12) volatile.acidity>=0.99 7 3.428571 3.714286 *
## 13) volatile.acidity< 0.99 116 81.060340 5.629310
## 26) volatile.acidity>=0.385 84 50.416670 5.416667 *
## 27) volatile.acidity< 0.385 32 16.875000 6.187500 *
## 7) sulphates>=0.585 331 199.178200 6.244713
## 14) alcohol< 11.55 198 109.818200 6.030303
## 28) chlorides>=0.0945 35 20.685710 5.542857 *
## 29) chlorides< 0.0945 163 79.030670 6.134969 *
## 15) alcohol>=11.55 133 66.706770 6.563910 *
For each node in the tree, the number of examples reaching the decision point is listed. For instance all 1199 examples begin at the root node, of which 745 have alcohol less than 10.525. Because alcohol appears first in the tree it is the single most important predictor of wine quality. We can use summary() for extra detail.
summary(m.rpart)
## Call:
## rpart(formula = quality ~ ., data = mydata[train, ])
## n= 1199
##
## CP nsplit rel error xerror xstd
## 1 0.16512845 0 1.0000000 1.0012005 0.04370552
## 2 0.05807039 1 0.8348715 0.8546403 0.04278983
## 3 0.03031418 2 0.7768012 0.8160529 0.03891994
## 4 0.02987686 3 0.7464870 0.7911828 0.03735243
## 5 0.02795546 4 0.7166101 0.7813175 0.03703364
## 6 0.01734112 5 0.6886547 0.7488583 0.03544028
## 7 0.01699133 6 0.6713135 0.7432333 0.03500668
## 8 0.01246619 7 0.6543222 0.7350026 0.03427549
## 9 0.01087516 8 0.6418560 0.7442025 0.03493094
## 10 0.01000000 9 0.6309809 0.7314232 0.03436914
##
## Variable importance
## alcohol volatile.acidity sulphates
## 33 15 15
## density chlorides total.sulfur.dioxide
## 14 9 5
## fixed.acidity citric.acid pH
## 4 2 1
## residual.sugar free.sulfur.dioxide
## 1 1
##
## Node number 1: 1199 observations, complexity param=0.1651285
## mean=5.620517, MSE=0.6758426
## left son=2 (745 obs) right son=3 (454 obs)
## Primary splits:
## alcohol < 10.525 to the left, improve=0.16512850, (0 missing)
## sulphates < 0.645 to the left, improve=0.12153840, (0 missing)
## volatile.acidity < 0.425 to the right, improve=0.10239570, (0 missing)
## citric.acid < 0.295 to the left, improve=0.06585085, (0 missing)
## density < 0.995565 to the right, improve=0.05332424, (0 missing)
## Surrogate splits:
## density < 0.995575 to the right, agree=0.757, adj=0.359, (0 split)
## chlorides < 0.0685 to the right, agree=0.698, adj=0.203, (0 split)
## volatile.acidity < 0.3675 to the right, agree=0.660, adj=0.101, (0 split)
## fixed.acidity < 6.35 to the right, agree=0.654, adj=0.086, (0 split)
## total.sulfur.dioxide < 17.5 to the right, agree=0.651, adj=0.077, (0 split)
##
## Node number 2: 745 observations, complexity param=0.03031418
## mean=5.359732, MSE=0.431667
## left son=4 (292 obs) right son=5 (453 obs)
## Primary splits:
## sulphates < 0.575 to the left, improve=0.07638452, (0 missing)
## volatile.acidity < 0.335 to the right, improve=0.06502699, (0 missing)
## alcohol < 9.975 to the left, improve=0.05203031, (0 missing)
## fixed.acidity < 12.55 to the left, improve=0.03121456, (0 missing)
## total.sulfur.dioxide < 83.5 to the right, improve=0.02718394, (0 missing)
## Surrogate splits:
## density < 0.996285 to the left, agree=0.664, adj=0.144, (0 split)
## volatile.acidity < 0.6525 to the right, agree=0.646, adj=0.096, (0 split)
## fixed.acidity < 6.15 to the left, agree=0.619, adj=0.027, (0 split)
## citric.acid < 0.115 to the left, agree=0.617, adj=0.024, (0 split)
## free.sulfur.dioxide < 56 to the right, agree=0.615, adj=0.017, (0 split)
##
## Node number 3: 454 observations, complexity param=0.05807039
## mean=6.048458, MSE=0.7817928
## left son=6 (123 obs) right son=7 (331 obs)
## Primary splits:
## sulphates < 0.585 to the left, improve=0.13257820, (0 missing)
## volatile.acidity < 0.8675 to the right, improve=0.12988970, (0 missing)
## citric.acid < 0.275 to the left, improve=0.10195960, (0 missing)
## alcohol < 11.55 to the left, improve=0.09956824, (0 missing)
## pH < 3.355 to the right, improve=0.08725303, (0 missing)
## Surrogate splits:
## volatile.acidity < 0.8375 to the right, agree=0.751, adj=0.081, (0 split)
## density < 0.99351 to the left, agree=0.749, adj=0.073, (0 split)
## total.sulfur.dioxide < 9.5 to the left, agree=0.744, adj=0.057, (0 split)
## residual.sugar < 7.65 to the right, agree=0.740, adj=0.041, (0 split)
## citric.acid < 0.015 to the left, agree=0.736, adj=0.024, (0 split)
##
## Node number 4: 292 observations
## mean=5.133562, MSE=0.3212024
##
## Node number 5: 453 observations, complexity param=0.01734112
## mean=5.505519, MSE=0.448645
## left son=10 (402 obs) right son=11 (51 obs)
## Primary splits:
## volatile.acidity < 0.335 to the right, improve=0.06914183, (0 missing)
## total.sulfur.dioxide < 81.5 to the right, improve=0.05979987, (0 missing)
## alcohol < 9.85 to the left, improve=0.05963577, (0 missing)
## fixed.acidity < 10.95 to the left, improve=0.04607798, (0 missing)
## free.sulfur.dioxide < 14.5 to the right, improve=0.02879088, (0 missing)
## Surrogate splits:
## fixed.acidity < 13.35 to the left, agree=0.896, adj=0.078, (0 split)
## chlorides < 0.0565 to the right, agree=0.896, adj=0.078, (0 split)
##
## Node number 6: 123 observations, complexity param=0.02987686
## mean=5.520325, MSE=0.8837332
## left son=12 (7 obs) right son=13 (116 obs)
## Primary splits:
## volatile.acidity < 0.99 to the right, improve=0.2227272, (0 missing)
## citric.acid < 0.255 to the left, improve=0.1756120, (0 missing)
## pH < 3.365 to the right, improve=0.1618761, (0 missing)
## alcohol < 11.45 to the left, improve=0.1429198, (0 missing)
## free.sulfur.dioxide < 30 to the left, improve=0.0960594, (0 missing)
##
## Node number 7: 331 observations, complexity param=0.02795546
## mean=6.244713, MSE=0.601747
## left son=14 (198 obs) right son=15 (133 obs)
## Primary splits:
## alcohol < 11.55 to the left, improve=0.11373380, (0 missing)
## volatile.acidity < 0.8575 to the right, improve=0.05562597, (0 missing)
## pH < 3.355 to the right, improve=0.04544472, (0 missing)
## citric.acid < 0.295 to the left, improve=0.04092179, (0 missing)
## chlorides < 0.0945 to the right, improve=0.04058732, (0 missing)
## Surrogate splits:
## density < 0.99477 to the right, agree=0.719, adj=0.301, (0 split)
## fixed.acidity < 5.7 to the right, agree=0.656, adj=0.143, (0 split)
## chlorides < 0.0525 to the right, agree=0.647, adj=0.120, (0 split)
## pH < 3.545 to the left, agree=0.637, adj=0.098, (0 split)
## residual.sugar < 4.25 to the left, agree=0.616, adj=0.045, (0 split)
##
## Node number 10: 402 observations, complexity param=0.01087516
## mean=5.442786, MSE=0.3959803
## left son=20 (78 obs) right son=21 (324 obs)
## Primary splits:
## total.sulfur.dioxide < 83 to the right, improve=0.05536057, (0 missing)
## alcohol < 9.975 to the left, improve=0.04825427, (0 missing)
## fixed.acidity < 12.55 to the left, improve=0.03179864, (0 missing)
## volatile.acidity < 0.605 to the right, improve=0.02846832, (0 missing)
## pH < 3.54 to the right, improve=0.02716902, (0 missing)
## Surrogate splits:
## free.sulfur.dioxide < 36.5 to the right, agree=0.821, adj=0.077, (0 split)
## residual.sugar < 5.85 to the right, agree=0.813, adj=0.038, (0 split)
## density < 1.002655 to the right, agree=0.813, adj=0.038, (0 split)
## sulphates < 1.78 to the right, agree=0.808, adj=0.013, (0 split)
##
## Node number 11: 51 observations
## mean=6, MSE=0.5882353
##
## Node number 12: 7 observations
## mean=3.714286, MSE=0.4897959
##
## Node number 13: 116 observations, complexity param=0.01699133
## mean=5.62931, MSE=0.6987961
## left son=26 (84 obs) right son=27 (32 obs)
## Primary splits:
## volatile.acidity < 0.385 to the right, improve=0.1698571, (0 missing)
## citric.acid < 0.255 to the left, improve=0.1497599, (0 missing)
## pH < 3.355 to the right, improve=0.1237777, (0 missing)
## free.sulfur.dioxide < 30 to the left, improve=0.1050906, (0 missing)
## alcohol < 11.45 to the left, improve=0.1022014, (0 missing)
## Surrogate splits:
## citric.acid < 0.255 to the left, agree=0.853, adj=0.469, (0 split)
## pH < 3.295 to the right, agree=0.784, adj=0.219, (0 split)
## free.sulfur.dioxide < 34 to the left, agree=0.759, adj=0.125, (0 split)
## total.sulfur.dioxide < 95.5 to the left, agree=0.759, adj=0.125, (0 split)
## density < 0.991705 to the right, agree=0.759, adj=0.125, (0 split)
##
## Node number 14: 198 observations, complexity param=0.01246619
## mean=6.030303, MSE=0.5546373
## left son=28 (35 obs) right son=29 (163 obs)
## Primary splits:
## chlorides < 0.0945 to the right, improve=0.09198652, (0 missing)
## volatile.acidity < 0.395 to the right, improve=0.08891009, (0 missing)
## pH < 3.49 to the right, improve=0.05937785, (0 missing)
## total.sulfur.dioxide < 51.5 to the right, improve=0.05133368, (0 missing)
## citric.acid < 0.275 to the left, improve=0.04553784, (0 missing)
## Surrogate splits:
## density < 1.0007 to the right, agree=0.854, adj=0.171, (0 split)
## fixed.acidity < 14.75 to the right, agree=0.843, adj=0.114, (0 split)
## pH < 2.98 to the left, agree=0.843, adj=0.114, (0 split)
## citric.acid < 0.675 to the right, agree=0.838, adj=0.086, (0 split)
##
## Node number 15: 133 observations
## mean=6.56391, MSE=0.5015546
##
## Node number 20: 78 observations
## mean=5.141026, MSE=0.1467784
##
## Node number 21: 324 observations
## mean=5.515432, MSE=0.4287742
##
## Node number 26: 84 observations
## mean=5.416667, MSE=0.6001984
##
## Node number 27: 32 observations
## mean=6.1875, MSE=0.5273438
##
## Node number 28: 35 observations
## mean=5.542857, MSE=0.5910204
##
## Node number 29: 163 observations
## mean=6.134969, MSE=0.4848508
Using the rpart.plot package we call the function of the same name. Red red wine.
rpart.plot(m.rpart, digits = 4, fallen.leaves = TRUE)
Let’s see how well the model does when predicting the test data.
p.rpart <- predict(m.rpart, mydata[-train, ]) # note the negative, drops train leaving test
If we examine the summary statistics of the predicted versus the real qualities we notice that our predictions may be covering the tail or extreme qualities less well. We can use the correlation to compare how well the predicted does against the real.
summary(p.rpart)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.714 5.139 5.515 5.653 6.135 6.564
summary(mydata[-train, ][["quality"]])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.682 6.000 8.000
cor(p.rpart, mydata[-train, ][["quality"]])
## [1] 0.5821109
For a complex biological problem a correlation of 0.58 is acceptable. However, we can measure how far off the predictions were using the mean absolute error. We can create a simple MAE() function as follows.
MAE <- function(actual, predicted) {
mean(abs(actual - predicted))
}
MAE(predicted = p.rpart,actual = mydata[-train, ][["quality"]])
## [1] 0.4915993
Not bad, but if just guessed the mean wine quality everytime we would get a mean absolute error of 0.640125. Our regression tree is better on average but not by much than the imputed mean.
Let’s improve things by using a model tree; replacing the leaf nodes with regression models. We will use the M5’ algorithm by Wang and Witten using the M5P() function from the RWeka package. The package has some Java dependencies so I’ll update this step at a later date.