The dataset redwines.csv is obtained from the Machine Learning course website (Spring 2017) from Professor Eric Suess at http://www.sci.csueastbay.edu/~esuess/stat6620/#week-6. This dataset is used for exercise using Regression Tree and Model Tree for machine learning in analyzing the red wine data and predicting its quality.
The red wine dataset contains 1599 observations and 12 numerical features for the red wine characteristics such as acidity, sugar, chlorides, density and pH, etc. A histogram of the red wine quality feature is plotted, and it shows a pretty normal distribution.
red.wine <- read.csv("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml10/redwines.csv")
str(red.wine)
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 6.5 9.1 6.9 7.3 12.5 5.4 10.4 7.9 7.3 9.5 ...
## $ volatile.acidity : num 0.9 0.22 0.52 0.59 0.28 0.74 0.28 0.4 0.39 0.37 ...
## $ citric.acid : num 0 0.24 0.25 0.26 0.54 0.09 0.54 0.3 0.31 0.52 ...
## $ residual.sugar : num 1.6 2.1 2.6 2 2.3 1.7 2.7 1.8 2.4 2 ...
## $ chlorides : num 0.052 0.078 0.081 0.08 0.082 0.089 0.105 0.157 0.074 0.088 ...
## $ free.sulfur.dioxide : num 9 1 10 17 12 16 5 2 9 12 ...
## $ total.sulfur.dioxide: num 17 28 37 104 29 26 19 45 46 51 ...
## $ density : num 0.995 0.999 0.997 0.996 1 ...
## $ pH : num 3.5 3.41 3.46 3.28 3.11 3.67 3.25 3.31 3.41 3.29 ...
## $ sulphates : num 0.63 0.87 0.5 0.52 1.36 0.56 0.63 0.91 0.54 0.58 ...
## $ alcohol : num 10.9 10.3 11 9.9 9.8 11.6 9.5 9.5 9.4 11.1 ...
## $ quality : int 6 6 5 5 7 6 5 6 6 6 ...
hist(red.wine$quality)
The five points statistice are shown here to get a feel of each feature in the red wine dataset. Luckily regression tree and model tree know how to deal with sets of data that are of different ranges so there’s no need for number normalization.
summary(red.wine)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
Using a sample() function to randomly permuate 75% of the overall dataset and set it into trained data and put the rest 25% for the tested dataset. There are 1199 observations for the trained and 400 observations for the tested. Further logistics are done here to make sure the spliting into trained and tested datasets are correct and add up to the original length of the overall data. The table() funcation and hist() are used to look at numbers and visually on the proportion of different levels of the quality features and although there’re some difference between the two datasets, their distribution are normal and are comparable to each other.
tsample = sample(nrow(red.wine), 0.75*nrow(red.wine))
head(tsample, n = 10)
## [1] 781 1058 348 99 435 1453 1558 1392 1316 726
length(tsample)
## [1] 1199
red.wine.train <- red.wine[tsample, ]
red.wine.test <- red.wine[-tsample, ]
nrow(red.wine.train)
## [1] 1199
nrow(red.wine.test)
## [1] 400
sum(nrow(red.wine.train)+nrow(red.wine.test))
## [1] 1599
table(red.wine.train$quality)
##
## 3 4 5 6 7 8
## 9 38 513 474 149 16
table(red.wine.test$quality)
##
## 3 4 5 6 7 8
## 1 15 168 164 50 2
hist(red.wine.train$quality)
hist(red.wine.test$quality)
Using the rpart() function under the rpart package, a regression tree model is generated based on y = quality against the rest of the features as x in the trained dataset. The result shows that alcohol is the most important feature as it is the first one to be splited, followed by sulphates level, then volatile acidity level, etc according to the regression tree.
Although the model is called “Regression tree,” this model has nothing to do with the regression, but based on Standard Deviation Reduction (SDR) that can reduce most of the standard deviation after the splitting; similar to the concept of reduction in entropy for information in the Decision Tree model for classification. But the regression tree is splitted by the higher value of the SDR beacuse SDR = sd(T) - sum(Ti/T*sd(Ti)), where T is the total number of observation before splitting, and Ti are number of observations contained in each portion after the splitting, so a bigger SDR is better since it would implied a smaller standard deviation in summation resulted from each portions after splitting.
library(rpart)
## Warning: package 'rpart' was built under R version 3.3.3
reg.tree <- rpart(quality ~ ., data = red.wine.train)
reg.tree
## n= 1199
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 1199 801.18100 5.637198
## 2) alcohol< 10.525 740 317.21490 5.363514
## 4) sulphates< 0.635 457 162.87530 5.229759 *
## 5) sulphates>=0.635 283 132.96110 5.579505
## 10) alcohol< 9.85 168 52.47619 5.404762
## 20) fixed.acidity< 10.75 152 37.55263 5.328947 *
## 21) fixed.acidity>=10.75 16 5.75000 6.125000 *
## 11) alcohol>=9.85 115 67.86087 5.834783
## 22) pH>=3.375 35 20.57143 5.428571 *
## 23) pH< 3.375 80 38.98750 6.012500 *
## 3) alcohol>=10.525 459 339.17650 6.078431
## 6) sulphates< 0.635 201 147.26370 5.711443
## 12) volatile.acidity>=0.99 8 3.50000 3.750000 *
## 13) volatile.acidity< 0.99 193 111.70980 5.792746
## 26) volatile.acidity>=0.495 109 54.97248 5.550459 *
## 27) volatile.acidity< 0.495 84 42.03571 6.107143 *
## 7) sulphates>=0.635 258 143.75190 6.364341
## 14) alcohol< 11.55 155 78.15484 6.135484 *
## 15) alcohol>=11.55 103 45.26214 6.708738 *
Use the summary() function to call the model object allow one to see the detail of how the trained dataset being splitted by the regression tree. There are total of 53 nodes for the regression tree model.
summary(reg.tree)
## Call:
## rpart(formula = quality ~ ., data = red.wine.train)
## n= 1199
##
## CP nsplit rel error xerror xstd
## 1 0.18072028 0 1.0000000 1.0012091 0.04467913
## 2 0.06011232 1 0.8192797 0.8344001 0.04201625
## 3 0.04000823 2 0.7591674 0.7994619 0.03918898
## 4 0.02668368 3 0.7191592 0.7906666 0.03824026
## 5 0.02538124 4 0.6924755 0.7764970 0.03784911
## 6 0.01834998 5 0.6670942 0.7494608 0.03648726
## 7 0.01575683 6 0.6487443 0.7464720 0.03651462
## 8 0.01145005 7 0.6329874 0.7337141 0.03619204
## 9 0.01036213 8 0.6215374 0.7198029 0.03565632
## 10 0.01000000 9 0.6111753 0.7104966 0.03513372
##
## Variable importance
## alcohol volatile.acidity density
## 32 14 14
## sulphates fixed.acidity citric.acid
## 13 8 6
## chlorides pH total.sulfur.dioxide
## 5 5 3
## residual.sugar
## 1
##
## Node number 1: 1199 observations, complexity param=0.1807203
## mean=5.637198, MSE=0.6682077
## left son=2 (740 obs) right son=3 (459 obs)
## Primary splits:
## alcohol < 10.525 to the left, improve=0.18072030, (0 missing)
## sulphates < 0.645 to the left, improve=0.12517630, (0 missing)
## volatile.acidity < 0.425 to the right, improve=0.11094080, (0 missing)
## density < 0.99537 to the right, improve=0.07452618, (0 missing)
## citric.acid < 0.295 to the left, improve=0.06587047, (0 missing)
## Surrogate splits:
## density < 0.99567 to the right, agree=0.771, adj=0.403, (0 split)
## chlorides < 0.0685 to the right, agree=0.680, adj=0.163, (0 split)
## volatile.acidity < 0.3675 to the right, agree=0.662, adj=0.118, (0 split)
## fixed.acidity < 6.75 to the right, agree=0.654, adj=0.096, (0 split)
## total.sulfur.dioxide < 17.5 to the right, agree=0.651, adj=0.089, (0 split)
##
## Node number 2: 740 observations, complexity param=0.02668368
## mean=5.363514, MSE=0.4286687
## left son=4 (457 obs) right son=5 (283 obs)
## Primary splits:
## sulphates < 0.635 to the left, improve=0.06739426, (0 missing)
## alcohol < 9.975 to the left, improve=0.06010847, (0 missing)
## volatile.acidity < 0.5475 to the right, improve=0.05536277, (0 missing)
## total.sulfur.dioxide < 46.5 to the right, improve=0.03207202, (0 missing)
## fixed.acidity < 12.55 to the left, improve=0.03119502, (0 missing)
## Surrogate splits:
## chlorides < 0.1005 to the left, agree=0.661, adj=0.113, (0 split)
## citric.acid < 0.395 to the left, agree=0.657, adj=0.102, (0 split)
## pH < 3.075 to the right, agree=0.654, adj=0.095, (0 split)
## volatile.acidity < 0.335 to the right, agree=0.651, adj=0.088, (0 split)
## density < 0.99822 to the left, agree=0.645, adj=0.071, (0 split)
##
## Node number 3: 459 observations, complexity param=0.06011232
## mean=6.078431, MSE=0.7389466
## left son=6 (201 obs) right son=7 (258 obs)
## Primary splits:
## sulphates < 0.635 to the left, improve=0.14199350, (0 missing)
## volatile.acidity < 0.9125 to the right, improve=0.13277910, (0 missing)
## citric.acid < 0.295 to the left, improve=0.12171660, (0 missing)
## pH < 3.355 to the right, improve=0.09761722, (0 missing)
## alcohol < 11.55 to the left, improve=0.09170903, (0 missing)
## Surrogate splits:
## citric.acid < 0.205 to the left, agree=0.675, adj=0.259, (0 split)
## fixed.acidity < 7.85 to the left, agree=0.662, adj=0.229, (0 split)
## volatile.acidity < 0.5875 to the right, agree=0.647, adj=0.194, (0 split)
## density < 0.994255 to the left, agree=0.647, adj=0.194, (0 split)
## pH < 3.425 to the right, agree=0.643, adj=0.184, (0 split)
##
## Node number 4: 457 observations
## mean=5.229759, MSE=0.356401
##
## Node number 5: 283 observations, complexity param=0.01575683
## mean=5.579505, MSE=0.4698273
## left son=10 (168 obs) right son=11 (115 obs)
## Primary splits:
## alcohol < 9.85 to the left, improve=0.09494557, (0 missing)
## total.sulfur.dioxide < 29.5 to the right, improve=0.09341644, (0 missing)
## volatile.acidity < 0.355 to the right, improve=0.09264380, (0 missing)
## fixed.acidity < 11.45 to the left, improve=0.08862331, (0 missing)
## free.sulfur.dioxide < 7.5 to the right, improve=0.07069268, (0 missing)
## Surrogate splits:
## fixed.acidity < 9.35 to the left, agree=0.657, adj=0.157, (0 split)
## chlorides < 0.0655 to the right, agree=0.640, adj=0.113, (0 split)
## residual.sugar < 2.65 to the left, agree=0.636, adj=0.104, (0 split)
## volatile.acidity < 0.305 to the right, agree=0.629, adj=0.087, (0 split)
## density < 0.99538 to the right, agree=0.608, adj=0.035, (0 split)
##
## Node number 6: 201 observations, complexity param=0.04000823
## mean=5.711443, MSE=0.7326551
## left son=12 (8 obs) right son=13 (193 obs)
## Primary splits:
## volatile.acidity < 0.99 to the right, improve=0.21766290, (0 missing)
## pH < 3.365 to the right, improve=0.11510710, (0 missing)
## alcohol < 11.45 to the left, improve=0.11154450, (0 missing)
## citric.acid < 0.255 to the left, improve=0.10702790, (0 missing)
## density < 0.995205 to the right, improve=0.05313951, (0 missing)
##
## Node number 7: 258 observations, complexity param=0.02538124
## mean=6.364341, MSE=0.5571781
## left son=14 (155 obs) right son=15 (103 obs)
## Primary splits:
## alcohol < 11.55 to the left, improve=0.14145870, (0 missing)
## density < 1.0009 to the right, improve=0.06895615, (0 missing)
## chlorides < 0.0945 to the right, improve=0.06373035, (0 missing)
## total.sulfur.dioxide < 53.5 to the right, improve=0.05826726, (0 missing)
## fixed.acidity < 13.4 to the right, improve=0.05823348, (0 missing)
## Surrogate splits:
## density < 0.994875 to the right, agree=0.694, adj=0.233, (0 split)
## chlorides < 0.053 to the right, agree=0.647, adj=0.117, (0 split)
## fixed.acidity < 5.7 to the right, agree=0.640, adj=0.097, (0 split)
## citric.acid < 0.635 to the left, agree=0.636, adj=0.087, (0 split)
## residual.sugar < 4.25 to the left, agree=0.628, adj=0.068, (0 split)
##
## Node number 10: 168 observations, complexity param=0.01145005
## mean=5.404762, MSE=0.3123583
## left son=20 (152 obs) right son=21 (16 obs)
## Primary splits:
## fixed.acidity < 10.75 to the left, improve=0.17481370, (0 missing)
## total.sulfur.dioxide < 29.5 to the right, improve=0.16173710, (0 missing)
## volatile.acidity < 0.315 to the right, improve=0.11120600, (0 missing)
## free.sulfur.dioxide < 14.5 to the right, improve=0.09169596, (0 missing)
## density < 0.997605 to the left, improve=0.08175771, (0 missing)
## Surrogate splits:
## pH < 2.89 to the right, agree=0.917, adj=0.125, (0 split)
## citric.acid < 0.71 to the left, agree=0.911, adj=0.062, (0 split)
##
## Node number 11: 115 observations, complexity param=0.01036213
## mean=5.834783, MSE=0.5900945
## left son=22 (35 obs) right son=23 (80 obs)
## Primary splits:
## pH < 3.375 to the right, improve=0.12233770, (0 missing)
## chlorides < 0.0625 to the right, improve=0.09949257, (0 missing)
## free.sulfur.dioxide < 31.5 to the right, improve=0.08829405, (0 missing)
## volatile.acidity < 0.385 to the right, improve=0.08213467, (0 missing)
## total.sulfur.dioxide < 13.5 to the right, improve=0.07911845, (0 missing)
## Surrogate splits:
## fixed.acidity < 7.15 to the left, agree=0.843, adj=0.486, (0 split)
## citric.acid < 0.195 to the left, agree=0.826, adj=0.429, (0 split)
## free.sulfur.dioxide < 25.5 to the right, agree=0.739, adj=0.143, (0 split)
## density < 0.99637 to the left, agree=0.739, adj=0.143, (0 split)
## residual.sugar < 7.45 to the right, agree=0.713, adj=0.057, (0 split)
##
## Node number 12: 8 observations
## mean=3.75, MSE=0.4375
##
## Node number 13: 193 observations, complexity param=0.01834998
## mean=5.792746, MSE=0.5788075
## left son=26 (109 obs) right son=27 (84 obs)
## Primary splits:
## volatile.acidity < 0.495 to the right, improve=0.13160570, (0 missing)
## pH < 3.365 to the right, improve=0.08567356, (0 missing)
## citric.acid < 0.255 to the left, improve=0.08366682, (0 missing)
## alcohol < 11.45 to the left, improve=0.08318212, (0 missing)
## free.sulfur.dioxide < 31.5 to the left, improve=0.06759142, (0 missing)
## Surrogate splits:
## citric.acid < 0.165 to the left, agree=0.870, adj=0.702, (0 split)
## pH < 3.295 to the right, agree=0.746, adj=0.417, (0 split)
## fixed.acidity < 7.85 to the left, agree=0.705, adj=0.321, (0 split)
## total.sulfur.dioxide < 11.5 to the right, agree=0.642, adj=0.179, (0 split)
## alcohol < 11.65 to the left, agree=0.627, adj=0.143, (0 split)
##
## Node number 14: 155 observations
## mean=6.135484, MSE=0.5042248
##
## Node number 15: 103 observations
## mean=6.708738, MSE=0.4394382
##
## Node number 20: 152 observations
## mean=5.328947, MSE=0.2470568
##
## Node number 21: 16 observations
## mean=6.125, MSE=0.359375
##
## Node number 22: 35 observations
## mean=5.428571, MSE=0.5877551
##
## Node number 23: 80 observations
## mean=6.0125, MSE=0.4873438
##
## Node number 26: 109 observations
## mean=5.550459, MSE=0.5043347
##
## Node number 27: 84 observations
## mean=6.107143, MSE=0.5004252
Using the rpart.plot() function to give a nice and easily read tree structure of the model. The visualization shows the root, nodes and leave of the regression tree, and each node & leave contains specific feature criteria, number of obsesrvations, and percentage of data. This model will be used for predicting the quality of wine observations in the tested dataset. The prediction is made by first taking the average value of the quality in the set of observations that falling within the path; and then the average value in quality is assigned to the tested observation which falling within the same final leaf.
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.3.3
rpart.plot(reg.tree, digits = 4, fallen.leaves = TRUE, type = 4, extra = 101)
The regression tree model is applied on the tested dataset, and generated a vector of prediction on the quality feature of observations. Using summary(), the comparison of actual quality and predicted quality can be seen. The actual quality range is wider spreading from 3 to 8, while the prediction is only ranging from 4.4 to 6.7. The model may not do a good job in predicting observation that should have been in the extremes, but it is appropriate enough to make prdiction on the average between quality rating from 5 to 6 where most of the observations belong.
rt.predict <- predict(reg.tree, red.wine.test)
summary(rt.predict)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.750 5.230 5.429 5.658 6.125 6.709
summary(red.wine.test$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.632 6.000 8.000
head(rt.predict, n = 15)
## 1 2 5 9 16 27 29 30
## 5.550459 5.428571 6.125000 5.229759 6.012500 5.550459 6.107143 6.708738
## 32 46 53 58 61 67 70
## 5.229759 5.229759 6.135484 5.229759 6.708738 5.229759 5.328947
Several evaluation can be done to look at the model performance including the correlation and the mean absolute error(MAE). The correlation between the actual quality vs. the prediction from regression tree is 0.57. and the mean absoute error (MAE), that tells how far the predicted values are from its actual values, is 0.53.
cor(rt.predict, red.wine.test$quality)
## [1] 0.4902011
MAE <- function(actual, predicted) {
mean(abs(actual - predicted))
}
MAE(rt.predict, red.wine.test$quality)
## [1] 0.5465131
mean(red.wine.train$quality)
## [1] 5.637198
MAE(5.620517, red.wine.test$quality)
## [1] 0.6678586
An improvement can be made by using a Model Tree to analyze the red wine dataset. This way, the dataset is still splitting like a regression tree mentioned above, but additionally, a specific regression is actually generated from the portion of observations that fall within a tree path, and will be used to analyzed the tested observation if it were also falling within the same path.
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.3.3
model.tree <- M5P(quality ~ ., data = red.wine.train)
model.tree
## M5 pruned model tree:
## (using smoothed linear models)
##
## alcohol <= 10.525 :
## | total.sulfur.dioxide <= 55.5 :
## | | sulphates <= 0.635 :
## | | | volatile.acidity <= 0.575 : LM1 (139/64.281%)
## | | | volatile.acidity > 0.575 :
## | | | | volatile.acidity <= 0.653 : LM2 (59/80.806%)
## | | | | volatile.acidity > 0.653 :
## | | | | | volatile.acidity <= 0.742 : LM3 (44/47.878%)
## | | | | | volatile.acidity > 0.742 :
## | | | | | | residual.sugar <= 2.15 : LM4 (12/77.026%)
## | | | | | | residual.sugar > 2.15 : LM5 (18/55.492%)
## | | sulphates > 0.635 :
## | | | alcohol <= 9.85 : LM6 (111/64.415%)
## | | | alcohol > 9.85 : LM7 (88/76.677%)
## | total.sulfur.dioxide > 55.5 : LM8 (269/52.938%)
## alcohol > 10.525 : LM9 (459/85.738%)
##
## LM num: 1
## quality =
## -0.0027 * fixed.acidity
## - 0.2221 * volatile.acidity
## - 0.6249 * citric.acid
## + 0.0011 * residual.sugar
## - 0.1585 * chlorides
## + 0.0001 * free.sulfur.dioxide
## - 0.0053 * total.sulfur.dioxide
## + 3.6894 * density
## - 0.1531 * pH
## + 2.2974 * sulphates
## + 0.0562 * alcohol
## + 0.8359
##
## LM num: 2
## quality =
## 0.0285 * fixed.acidity
## - 0.5379 * volatile.acidity
## - 2.6445 * citric.acid
## + 0.0011 * residual.sugar
## - 0.1585 * chlorides
## + 0.0001 * free.sulfur.dioxide
## - 0.0002 * total.sulfur.dioxide
## + 3.6894 * density
## - 0.1554 * pH
## + 0.3199 * sulphates
## + 0.5567 * alcohol
## - 2.9259
##
## LM num: 3
## quality =
## 0.0654 * fixed.acidity
## - 0.6637 * volatile.acidity
## - 0.5546 * citric.acid
## + 0.0445 * residual.sugar
## - 0.1585 * chlorides
## + 0.0001 * free.sulfur.dioxide
## - 0.0002 * total.sulfur.dioxide
## + 30.7912 * density
## - 0.1554 * pH
## + 0.6586 * sulphates
## + 0.3652 * alcohol
## - 29.0311
##
## LM num: 4
## quality =
## 0.2734 * fixed.acidity
## - 0.7191 * volatile.acidity
## - 0.5546 * citric.acid
## + 0.058 * residual.sugar
## - 0.1585 * chlorides
## - 0.0104 * free.sulfur.dioxide
## + 0.0048 * total.sulfur.dioxide
## - 22.4176 * density
## - 0.1554 * pH
## + 0.764 * sulphates
## + 0.1484 * alcohol
## + 24.1592
##
## LM num: 5
## quality =
## 0.141 * fixed.acidity
## - 0.7191 * volatile.acidity
## - 0.5546 * citric.acid
## + 0.058 * residual.sugar
## - 0.1585 * chlorides
## - 0.0085 * free.sulfur.dioxide
## + 0.0071 * total.sulfur.dioxide
## - 22.4176 * density
## - 0.1554 * pH
## + 1.9832 * sulphates
## + 0.1484 * alcohol
## + 24.6412
##
## LM num: 6
## quality =
## 0.0851 * fixed.acidity
## - 0.2241 * volatile.acidity
## - 0.0684 * citric.acid
## + 0.0011 * residual.sugar
## - 0.3906 * chlorides
## + 0.0001 * free.sulfur.dioxide
## - 0.0009 * total.sulfur.dioxide
## + 15.3794 * density
## - 0.3418 * pH
## + 0.137 * sulphates
## + 0.0807 * alcohol
## - 10.0702
##
## LM num: 7
## quality =
## -0.1915 * fixed.acidity
## - 0.9201 * volatile.acidity
## - 0.0684 * citric.acid
## + 0.0899 * residual.sugar
## - 2.3144 * chlorides
## - 0.0168 * free.sulfur.dioxide
## - 0.001 * total.sulfur.dioxide
## + 17.7087 * density
## - 2.8617 * pH
## + 0.137 * sulphates
## + 0.0899 * alcohol
## - 0.88
##
## LM num: 8
## quality =
## -0.4696 * volatile.acidity
## - 0.0187 * citric.acid
## + 0.0231 * residual.sugar
## - 0.095 * chlorides
## + 0.0001 * free.sulfur.dioxide
## - 0.0033 * total.sulfur.dioxide
## - 0.0413 * pH
## + 0.218 * sulphates
## + 0.2326 * alcohol
## + 3.489
##
## LM num: 9
## quality =
## -1.1556 * volatile.acidity
## - 0.0088 * citric.acid
## - 0.0544 * chlorides
## + 0.0103 * free.sulfur.dioxide
## - 0.0033 * total.sulfur.dioxide
## - 57.4832 * density
## - 1.0156 * pH
## + 1.6696 * sulphates
## + 0.2536 * alcohol
## + 63.186
##
## Number of Rules : 9
summary(model.tree)
##
## === Summary ===
##
## Correlation coefficient 0.6848
## Mean absolute error 0.4712
## Root mean squared error 0.596
## Relative absolute error 68.4244 %
## Root relative squared error 72.9144 %
## Total Number of Instances 1199
Using the new model tree to make prediction on red wine quality in the tested dataset, and give statistics summary. Interestingly, the range of the prediction on quality is narrower than the original regresion tree model; it only spaned from 4.7 to 7.0. However, the overall performance is considered better than the regression tree model because the correlation between the actual and prediction from the model tree is 0.6, slightly higher than the 0.57 from the regression tree, which indicated the model tree prediction is more correlated. Also the MAE is 0.5 for model tree, slightly lower than the 0.53 from regression tree, which indicates that the average error for model tree is smaller.
mt.predict <- predict(model.tree, red.wine.test)
summary(mt.predict)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.927 5.183 5.497 5.613 6.013 7.282
cor(mt.predict, red.wine.test$quality)
## [1] 0.5767396
MAE(red.wine.test$quality, mt.predict)
## [1] 0.5050358
Conclusion: Regression tree is first used to analyzing the red wine dataset for predicting its quality level. The algorithm is similar to a classification decision tree, where the splitting is based on a SDR, a standard deviation reduction value from the splitting. The mean in quality of the subset of osbervation falling within a tree path will be assgined to future observation that also falling within the same leaf location. The model tree is later introduced as a model improvement for the same dataset, where this model tree is actually trying to generate regressions for subset of observations in each tree path, and will be used to analyzed any future observations that also falling within the same leaf. The model tree is definitely more accurate in prediction and better in performance because it gives smaller MAE and higher correlation to the actual values as it would actually make “customized regression.” Both regresion tree and model tree are good at numerical prediction.