Regression Tree & Model Tree for Analyzing Red Wine Quality

Part 1: Collecting Data ——————

The dataset redwines.csv is obtained from the Machine Learning course website (Spring 2017) from Professor Eric Suess at http://www.sci.csueastbay.edu/~esuess/stat6620/#week-6. This dataset is used for exercise using Regression Tree and Model Tree for machine learning in analyzing the red wine data and predicting its quality.

Part 2: Regression Trees and Model Trees ——————-

The red wine dataset contains 1599 observations and 12 numerical features for the red wine characteristics such as acidity, sugar, chlorides, density and pH, etc. A histogram of the red wine quality feature is plotted, and it shows a pretty normal distribution.

red.wine <- read.csv("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml10/redwines.csv")
str(red.wine)

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  6.5 9.1 6.9 7.3 12.5 5.4 10.4 7.9 7.3 9.5 ...
##  $ volatile.acidity    : num  0.9 0.22 0.52 0.59 0.28 0.74 0.28 0.4 0.39 0.37 ...
##  $ citric.acid         : num  0 0.24 0.25 0.26 0.54 0.09 0.54 0.3 0.31 0.52 ...
##  $ residual.sugar      : num  1.6 2.1 2.6 2 2.3 1.7 2.7 1.8 2.4 2 ...
##  $ chlorides           : num  0.052 0.078 0.081 0.08 0.082 0.089 0.105 0.157 0.074 0.088 ...
##  $ free.sulfur.dioxide : num  9 1 10 17 12 16 5 2 9 12 ...
##  $ total.sulfur.dioxide: num  17 28 37 104 29 26 19 45 46 51 ...
##  $ density             : num  0.995 0.999 0.997 0.996 1 ...
##  $ pH                  : num  3.5 3.41 3.46 3.28 3.11 3.67 3.25 3.31 3.41 3.29 ...
##  $ sulphates           : num  0.63 0.87 0.5 0.52 1.36 0.56 0.63 0.91 0.54 0.58 ...
##  $ alcohol             : num  10.9 10.3 11 9.9 9.8 11.6 9.5 9.5 9.4 11.1 ...
##  $ quality             : int  6 6 5 5 7 6 5 6 6 6 ...

hist(red.wine$quality)

The five points statistice are shown here to get a feel of each feature in the red wine dataset. Luckily regression tree and model tree know how to deal with sets of data that are of different ranges so there’s no need for number normalization.

summary(red.wine)

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Using a sample() function to randomly permuate 75% of the overall dataset and set it into trained data and put the rest 25% for the tested dataset. There are 1199 observations for the trained and 400 observations for the tested. Further logistics are done here to make sure the spliting into trained and tested datasets are correct and add up to the original length of the overall data. The table() funcation and hist() are used to look at numbers and visually on the proportion of different levels of the quality features and although there’re some difference between the two datasets, their distribution are normal and are comparable to each other.

tsample = sample(nrow(red.wine), 0.75*nrow(red.wine))
head(tsample, n = 10)

##  [1]  781 1058  348   99  435 1453 1558 1392 1316  726

length(tsample)

## [1] 1199

red.wine.train <- red.wine[tsample, ]
red.wine.test <- red.wine[-tsample, ]
nrow(red.wine.train)

## [1] 1199

nrow(red.wine.test)

## [1] 400

sum(nrow(red.wine.train)+nrow(red.wine.test))

## [1] 1599

table(red.wine.train$quality)

## 
##   3   4   5   6   7   8 
##   9  38 513 474 149  16

table(red.wine.test$quality)

## 
##   3   4   5   6   7   8 
##   1  15 168 164  50   2

hist(red.wine.train$quality)

hist(red.wine.test$quality)

Step 3: Training a model on the data —-

Using the rpart() function under the rpart package, a regression tree model is generated based on y = quality against the rest of the features as x in the trained dataset. The result shows that alcohol is the most important feature as it is the first one to be splited, followed by sulphates level, then volatile acidity level, etc according to the regression tree.

Although the model is called “Regression tree,” this model has nothing to do with the regression, but based on Standard Deviation Reduction (SDR) that can reduce most of the standard deviation after the splitting; similar to the concept of reduction in entropy for information in the Decision Tree model for classification. But the regression tree is splitted by the higher value of the SDR beacuse SDR = sd(T) - sum(Ti/T*sd(Ti)), where T is the total number of observation before splitting, and Ti are number of observations contained in each portion after the splitting, so a bigger SDR is better since it would implied a smaller standard deviation in summation resulted from each portions after splitting.

library(rpart)

## Warning: package 'rpart' was built under R version 3.3.3

reg.tree <- rpart(quality ~ ., data = red.wine.train)
reg.tree

## n= 1199 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 1199 801.18100 5.637198  
##    2) alcohol< 10.525 740 317.21490 5.363514  
##      4) sulphates< 0.635 457 162.87530 5.229759 *
##      5) sulphates>=0.635 283 132.96110 5.579505  
##       10) alcohol< 9.85 168  52.47619 5.404762  
##         20) fixed.acidity< 10.75 152  37.55263 5.328947 *
##         21) fixed.acidity>=10.75 16   5.75000 6.125000 *
##       11) alcohol>=9.85 115  67.86087 5.834783  
##         22) pH>=3.375 35  20.57143 5.428571 *
##         23) pH< 3.375 80  38.98750 6.012500 *
##    3) alcohol>=10.525 459 339.17650 6.078431  
##      6) sulphates< 0.635 201 147.26370 5.711443  
##       12) volatile.acidity>=0.99 8   3.50000 3.750000 *
##       13) volatile.acidity< 0.99 193 111.70980 5.792746  
##         26) volatile.acidity>=0.495 109  54.97248 5.550459 *
##         27) volatile.acidity< 0.495 84  42.03571 6.107143 *
##      7) sulphates>=0.635 258 143.75190 6.364341  
##       14) alcohol< 11.55 155  78.15484 6.135484 *
##       15) alcohol>=11.55 103  45.26214 6.708738 *

Use the summary() function to call the model object allow one to see the detail of how the trained dataset being splitted by the regression tree. There are total of 53 nodes for the regression tree model.

summary(reg.tree)

## Call:
## rpart(formula = quality ~ ., data = red.wine.train)
##   n= 1199 
## 
##            CP nsplit rel error    xerror       xstd
## 1  0.18072028      0 1.0000000 1.0012091 0.04467913
## 2  0.06011232      1 0.8192797 0.8344001 0.04201625
## 3  0.04000823      2 0.7591674 0.7994619 0.03918898
## 4  0.02668368      3 0.7191592 0.7906666 0.03824026
## 5  0.02538124      4 0.6924755 0.7764970 0.03784911
## 6  0.01834998      5 0.6670942 0.7494608 0.03648726
## 7  0.01575683      6 0.6487443 0.7464720 0.03651462
## 8  0.01145005      7 0.6329874 0.7337141 0.03619204
## 9  0.01036213      8 0.6215374 0.7198029 0.03565632
## 10 0.01000000      9 0.6111753 0.7104966 0.03513372
## 
## Variable importance
##              alcohol     volatile.acidity              density 
##                   32                   14                   14 
##            sulphates        fixed.acidity          citric.acid 
##                   13                    8                    6 
##            chlorides                   pH total.sulfur.dioxide 
##                    5                    5                    3 
##       residual.sugar 
##                    1 
## 
## Node number 1: 1199 observations,    complexity param=0.1807203
##   mean=5.637198, MSE=0.6682077 
##   left son=2 (740 obs) right son=3 (459 obs)
##   Primary splits:
##       alcohol          < 10.525   to the left,  improve=0.18072030, (0 missing)
##       sulphates        < 0.645    to the left,  improve=0.12517630, (0 missing)
##       volatile.acidity < 0.425    to the right, improve=0.11094080, (0 missing)
##       density          < 0.99537  to the right, improve=0.07452618, (0 missing)
##       citric.acid      < 0.295    to the left,  improve=0.06587047, (0 missing)
##   Surrogate splits:
##       density              < 0.99567  to the right, agree=0.771, adj=0.403, (0 split)
##       chlorides            < 0.0685   to the right, agree=0.680, adj=0.163, (0 split)
##       volatile.acidity     < 0.3675   to the right, agree=0.662, adj=0.118, (0 split)
##       fixed.acidity        < 6.75     to the right, agree=0.654, adj=0.096, (0 split)
##       total.sulfur.dioxide < 17.5     to the right, agree=0.651, adj=0.089, (0 split)
## 
## Node number 2: 740 observations,    complexity param=0.02668368
##   mean=5.363514, MSE=0.4286687 
##   left son=4 (457 obs) right son=5 (283 obs)
##   Primary splits:
##       sulphates            < 0.635    to the left,  improve=0.06739426, (0 missing)
##       alcohol              < 9.975    to the left,  improve=0.06010847, (0 missing)
##       volatile.acidity     < 0.5475   to the right, improve=0.05536277, (0 missing)
##       total.sulfur.dioxide < 46.5     to the right, improve=0.03207202, (0 missing)
##       fixed.acidity        < 12.55    to the left,  improve=0.03119502, (0 missing)
##   Surrogate splits:
##       chlorides        < 0.1005   to the left,  agree=0.661, adj=0.113, (0 split)
##       citric.acid      < 0.395    to the left,  agree=0.657, adj=0.102, (0 split)
##       pH               < 3.075    to the right, agree=0.654, adj=0.095, (0 split)
##       volatile.acidity < 0.335    to the right, agree=0.651, adj=0.088, (0 split)
##       density          < 0.99822  to the left,  agree=0.645, adj=0.071, (0 split)
## 
## Node number 3: 459 observations,    complexity param=0.06011232
##   mean=6.078431, MSE=0.7389466 
##   left son=6 (201 obs) right son=7 (258 obs)
##   Primary splits:
##       sulphates        < 0.635    to the left,  improve=0.14199350, (0 missing)
##       volatile.acidity < 0.9125   to the right, improve=0.13277910, (0 missing)
##       citric.acid      < 0.295    to the left,  improve=0.12171660, (0 missing)
##       pH               < 3.355    to the right, improve=0.09761722, (0 missing)
##       alcohol          < 11.55    to the left,  improve=0.09170903, (0 missing)
##   Surrogate splits:
##       citric.acid      < 0.205    to the left,  agree=0.675, adj=0.259, (0 split)
##       fixed.acidity    < 7.85     to the left,  agree=0.662, adj=0.229, (0 split)
##       volatile.acidity < 0.5875   to the right, agree=0.647, adj=0.194, (0 split)
##       density          < 0.994255 to the left,  agree=0.647, adj=0.194, (0 split)
##       pH               < 3.425    to the right, agree=0.643, adj=0.184, (0 split)
## 
## Node number 4: 457 observations
##   mean=5.229759, MSE=0.356401 
## 
## Node number 5: 283 observations,    complexity param=0.01575683
##   mean=5.579505, MSE=0.4698273 
##   left son=10 (168 obs) right son=11 (115 obs)
##   Primary splits:
##       alcohol              < 9.85     to the left,  improve=0.09494557, (0 missing)
##       total.sulfur.dioxide < 29.5     to the right, improve=0.09341644, (0 missing)
##       volatile.acidity     < 0.355    to the right, improve=0.09264380, (0 missing)
##       fixed.acidity        < 11.45    to the left,  improve=0.08862331, (0 missing)
##       free.sulfur.dioxide  < 7.5      to the right, improve=0.07069268, (0 missing)
##   Surrogate splits:
##       fixed.acidity    < 9.35     to the left,  agree=0.657, adj=0.157, (0 split)
##       chlorides        < 0.0655   to the right, agree=0.640, adj=0.113, (0 split)
##       residual.sugar   < 2.65     to the left,  agree=0.636, adj=0.104, (0 split)
##       volatile.acidity < 0.305    to the right, agree=0.629, adj=0.087, (0 split)
##       density          < 0.99538  to the right, agree=0.608, adj=0.035, (0 split)
## 
## Node number 6: 201 observations,    complexity param=0.04000823
##   mean=5.711443, MSE=0.7326551 
##   left son=12 (8 obs) right son=13 (193 obs)
##   Primary splits:
##       volatile.acidity < 0.99     to the right, improve=0.21766290, (0 missing)
##       pH               < 3.365    to the right, improve=0.11510710, (0 missing)
##       alcohol          < 11.45    to the left,  improve=0.11154450, (0 missing)
##       citric.acid      < 0.255    to the left,  improve=0.10702790, (0 missing)
##       density          < 0.995205 to the right, improve=0.05313951, (0 missing)
## 
## Node number 7: 258 observations,    complexity param=0.02538124
##   mean=6.364341, MSE=0.5571781 
##   left son=14 (155 obs) right son=15 (103 obs)
##   Primary splits:
##       alcohol              < 11.55    to the left,  improve=0.14145870, (0 missing)
##       density              < 1.0009   to the right, improve=0.06895615, (0 missing)
##       chlorides            < 0.0945   to the right, improve=0.06373035, (0 missing)
##       total.sulfur.dioxide < 53.5     to the right, improve=0.05826726, (0 missing)
##       fixed.acidity        < 13.4     to the right, improve=0.05823348, (0 missing)
##   Surrogate splits:
##       density        < 0.994875 to the right, agree=0.694, adj=0.233, (0 split)
##       chlorides      < 0.053    to the right, agree=0.647, adj=0.117, (0 split)
##       fixed.acidity  < 5.7      to the right, agree=0.640, adj=0.097, (0 split)
##       citric.acid    < 0.635    to the left,  agree=0.636, adj=0.087, (0 split)
##       residual.sugar < 4.25     to the left,  agree=0.628, adj=0.068, (0 split)
## 
## Node number 10: 168 observations,    complexity param=0.01145005
##   mean=5.404762, MSE=0.3123583 
##   left son=20 (152 obs) right son=21 (16 obs)
##   Primary splits:
##       fixed.acidity        < 10.75    to the left,  improve=0.17481370, (0 missing)
##       total.sulfur.dioxide < 29.5     to the right, improve=0.16173710, (0 missing)
##       volatile.acidity     < 0.315    to the right, improve=0.11120600, (0 missing)
##       free.sulfur.dioxide  < 14.5     to the right, improve=0.09169596, (0 missing)
##       density              < 0.997605 to the left,  improve=0.08175771, (0 missing)
##   Surrogate splits:
##       pH          < 2.89     to the right, agree=0.917, adj=0.125, (0 split)
##       citric.acid < 0.71     to the left,  agree=0.911, adj=0.062, (0 split)
## 
## Node number 11: 115 observations,    complexity param=0.01036213
##   mean=5.834783, MSE=0.5900945 
##   left son=22 (35 obs) right son=23 (80 obs)
##   Primary splits:
##       pH                   < 3.375    to the right, improve=0.12233770, (0 missing)
##       chlorides            < 0.0625   to the right, improve=0.09949257, (0 missing)
##       free.sulfur.dioxide  < 31.5     to the right, improve=0.08829405, (0 missing)
##       volatile.acidity     < 0.385    to the right, improve=0.08213467, (0 missing)
##       total.sulfur.dioxide < 13.5     to the right, improve=0.07911845, (0 missing)
##   Surrogate splits:
##       fixed.acidity       < 7.15     to the left,  agree=0.843, adj=0.486, (0 split)
##       citric.acid         < 0.195    to the left,  agree=0.826, adj=0.429, (0 split)
##       free.sulfur.dioxide < 25.5     to the right, agree=0.739, adj=0.143, (0 split)
##       density             < 0.99637  to the left,  agree=0.739, adj=0.143, (0 split)
##       residual.sugar      < 7.45     to the right, agree=0.713, adj=0.057, (0 split)
## 
## Node number 12: 8 observations
##   mean=3.75, MSE=0.4375 
## 
## Node number 13: 193 observations,    complexity param=0.01834998
##   mean=5.792746, MSE=0.5788075 
##   left son=26 (109 obs) right son=27 (84 obs)
##   Primary splits:
##       volatile.acidity    < 0.495    to the right, improve=0.13160570, (0 missing)
##       pH                  < 3.365    to the right, improve=0.08567356, (0 missing)
##       citric.acid         < 0.255    to the left,  improve=0.08366682, (0 missing)
##       alcohol             < 11.45    to the left,  improve=0.08318212, (0 missing)
##       free.sulfur.dioxide < 31.5     to the left,  improve=0.06759142, (0 missing)
##   Surrogate splits:
##       citric.acid          < 0.165    to the left,  agree=0.870, adj=0.702, (0 split)
##       pH                   < 3.295    to the right, agree=0.746, adj=0.417, (0 split)
##       fixed.acidity        < 7.85     to the left,  agree=0.705, adj=0.321, (0 split)
##       total.sulfur.dioxide < 11.5     to the right, agree=0.642, adj=0.179, (0 split)
##       alcohol              < 11.65    to the left,  agree=0.627, adj=0.143, (0 split)
## 
## Node number 14: 155 observations
##   mean=6.135484, MSE=0.5042248 
## 
## Node number 15: 103 observations
##   mean=6.708738, MSE=0.4394382 
## 
## Node number 20: 152 observations
##   mean=5.328947, MSE=0.2470568 
## 
## Node number 21: 16 observations
##   mean=6.125, MSE=0.359375 
## 
## Node number 22: 35 observations
##   mean=5.428571, MSE=0.5877551 
## 
## Node number 23: 80 observations
##   mean=6.0125, MSE=0.4873438 
## 
## Node number 26: 109 observations
##   mean=5.550459, MSE=0.5043347 
## 
## Node number 27: 84 observations
##   mean=6.107143, MSE=0.5004252

Using the rpart.plot() function to give a nice and easily read tree structure of the model. The visualization shows the root, nodes and leave of the regression tree, and each node & leave contains specific feature criteria, number of obsesrvations, and percentage of data. This model will be used for predicting the quality of wine observations in the tested dataset. The prediction is made by first taking the average value of the quality in the set of observations that falling within the path; and then the average value in quality is assigned to the tested observation which falling within the same final leaf.

library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 3.3.3

rpart.plot(reg.tree, digits = 4, fallen.leaves = TRUE, type = 4, extra = 101)

Step 4: Evaluate model performance —-

The regression tree model is applied on the tested dataset, and generated a vector of prediction on the quality feature of observations. Using summary(), the comparison of actual quality and predicted quality can be seen. The actual quality range is wider spreading from 3 to 8, while the prediction is only ranging from 4.4 to 6.7. The model may not do a good job in predicting observation that should have been in the extremes, but it is appropriate enough to make prdiction on the average between quality rating from 5 to 6 where most of the observations belong.

rt.predict <- predict(reg.tree, red.wine.test)
summary(rt.predict)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.750   5.230   5.429   5.658   6.125   6.709

summary(red.wine.test$quality)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.632   6.000   8.000

head(rt.predict, n = 15)

##        1        2        5        9       16       27       29       30 
## 5.550459 5.428571 6.125000 5.229759 6.012500 5.550459 6.107143 6.708738 
##       32       46       53       58       61       67       70 
## 5.229759 5.229759 6.135484 5.229759 6.708738 5.229759 5.328947

Several evaluation can be done to look at the model performance including the correlation and the mean absolute error(MAE). The correlation between the actual quality vs. the prediction from regression tree is 0.57. and the mean absoute error (MAE), that tells how far the predicted values are from its actual values, is 0.53.

cor(rt.predict, red.wine.test$quality)

## [1] 0.4902011

MAE <- function(actual, predicted) {
  mean(abs(actual - predicted))  
}

MAE(rt.predict, red.wine.test$quality)

## [1] 0.5465131

mean(red.wine.train$quality)

## [1] 5.637198

MAE(5.620517, red.wine.test$quality)

## [1] 0.6678586

Step 5: Improving model performance —-

An improvement can be made by using a Model Tree to analyze the red wine dataset. This way, the dataset is still splitting like a regression tree mentioned above, but additionally, a specific regression is actually generated from the portion of observations that fall within a tree path, and will be used to analyzed the tested observation if it were also falling within the same path.

library(RWeka)

## Warning: package 'RWeka' was built under R version 3.3.3

model.tree <- M5P(quality ~ ., data = red.wine.train)
model.tree

## M5 pruned model tree:
## (using smoothed linear models)
## 
## alcohol <= 10.525 : 
## |   total.sulfur.dioxide <= 55.5 : 
## |   |   sulphates <= 0.635 : 
## |   |   |   volatile.acidity <= 0.575 : LM1 (139/64.281%)
## |   |   |   volatile.acidity >  0.575 : 
## |   |   |   |   volatile.acidity <= 0.653 : LM2 (59/80.806%)
## |   |   |   |   volatile.acidity >  0.653 : 
## |   |   |   |   |   volatile.acidity <= 0.742 : LM3 (44/47.878%)
## |   |   |   |   |   volatile.acidity >  0.742 : 
## |   |   |   |   |   |   residual.sugar <= 2.15 : LM4 (12/77.026%)
## |   |   |   |   |   |   residual.sugar >  2.15 : LM5 (18/55.492%)
## |   |   sulphates >  0.635 : 
## |   |   |   alcohol <= 9.85 : LM6 (111/64.415%)
## |   |   |   alcohol >  9.85 : LM7 (88/76.677%)
## |   total.sulfur.dioxide >  55.5 : LM8 (269/52.938%)
## alcohol >  10.525 : LM9 (459/85.738%)
## 
## LM num: 1
## quality = 
##  -0.0027 * fixed.acidity 
##  - 0.2221 * volatile.acidity 
##  - 0.6249 * citric.acid 
##  + 0.0011 * residual.sugar 
##  - 0.1585 * chlorides 
##  + 0.0001 * free.sulfur.dioxide 
##  - 0.0053 * total.sulfur.dioxide 
##  + 3.6894 * density 
##  - 0.1531 * pH 
##  + 2.2974 * sulphates 
##  + 0.0562 * alcohol 
##  + 0.8359
## 
## LM num: 2
## quality = 
##  0.0285 * fixed.acidity 
##  - 0.5379 * volatile.acidity 
##  - 2.6445 * citric.acid 
##  + 0.0011 * residual.sugar 
##  - 0.1585 * chlorides 
##  + 0.0001 * free.sulfur.dioxide 
##  - 0.0002 * total.sulfur.dioxide 
##  + 3.6894 * density 
##  - 0.1554 * pH 
##  + 0.3199 * sulphates 
##  + 0.5567 * alcohol 
##  - 2.9259
## 
## LM num: 3
## quality = 
##  0.0654 * fixed.acidity 
##  - 0.6637 * volatile.acidity 
##  - 0.5546 * citric.acid 
##  + 0.0445 * residual.sugar 
##  - 0.1585 * chlorides 
##  + 0.0001 * free.sulfur.dioxide 
##  - 0.0002 * total.sulfur.dioxide 
##  + 30.7912 * density 
##  - 0.1554 * pH 
##  + 0.6586 * sulphates 
##  + 0.3652 * alcohol 
##  - 29.0311
## 
## LM num: 4
## quality = 
##  0.2734 * fixed.acidity 
##  - 0.7191 * volatile.acidity 
##  - 0.5546 * citric.acid 
##  + 0.058 * residual.sugar 
##  - 0.1585 * chlorides 
##  - 0.0104 * free.sulfur.dioxide 
##  + 0.0048 * total.sulfur.dioxide 
##  - 22.4176 * density 
##  - 0.1554 * pH 
##  + 0.764 * sulphates 
##  + 0.1484 * alcohol 
##  + 24.1592
## 
## LM num: 5
## quality = 
##  0.141 * fixed.acidity 
##  - 0.7191 * volatile.acidity 
##  - 0.5546 * citric.acid 
##  + 0.058 * residual.sugar 
##  - 0.1585 * chlorides 
##  - 0.0085 * free.sulfur.dioxide 
##  + 0.0071 * total.sulfur.dioxide 
##  - 22.4176 * density 
##  - 0.1554 * pH 
##  + 1.9832 * sulphates 
##  + 0.1484 * alcohol 
##  + 24.6412
## 
## LM num: 6
## quality = 
##  0.0851 * fixed.acidity 
##  - 0.2241 * volatile.acidity 
##  - 0.0684 * citric.acid 
##  + 0.0011 * residual.sugar 
##  - 0.3906 * chlorides 
##  + 0.0001 * free.sulfur.dioxide 
##  - 0.0009 * total.sulfur.dioxide 
##  + 15.3794 * density 
##  - 0.3418 * pH 
##  + 0.137 * sulphates 
##  + 0.0807 * alcohol 
##  - 10.0702
## 
## LM num: 7
## quality = 
##  -0.1915 * fixed.acidity 
##  - 0.9201 * volatile.acidity 
##  - 0.0684 * citric.acid 
##  + 0.0899 * residual.sugar 
##  - 2.3144 * chlorides 
##  - 0.0168 * free.sulfur.dioxide 
##  - 0.001 * total.sulfur.dioxide 
##  + 17.7087 * density 
##  - 2.8617 * pH 
##  + 0.137 * sulphates 
##  + 0.0899 * alcohol 
##  - 0.88
## 
## LM num: 8
## quality = 
##  -0.4696 * volatile.acidity 
##  - 0.0187 * citric.acid 
##  + 0.0231 * residual.sugar 
##  - 0.095 * chlorides 
##  + 0.0001 * free.sulfur.dioxide 
##  - 0.0033 * total.sulfur.dioxide 
##  - 0.0413 * pH 
##  + 0.218 * sulphates 
##  + 0.2326 * alcohol 
##  + 3.489
## 
## LM num: 9
## quality = 
##  -1.1556 * volatile.acidity 
##  - 0.0088 * citric.acid 
##  - 0.0544 * chlorides 
##  + 0.0103 * free.sulfur.dioxide 
##  - 0.0033 * total.sulfur.dioxide 
##  - 57.4832 * density 
##  - 1.0156 * pH 
##  + 1.6696 * sulphates 
##  + 0.2536 * alcohol 
##  + 63.186
## 
## Number of Rules : 9

summary(model.tree)

## 
## === Summary ===
## 
## Correlation coefficient                  0.6848
## Mean absolute error                      0.4712
## Root mean squared error                  0.596 
## Relative absolute error                 68.4244 %
## Root relative squared error             72.9144 %
## Total Number of Instances             1199

Using the new model tree to make prediction on red wine quality in the tested dataset, and give statistics summary. Interestingly, the range of the prediction on quality is narrower than the original regresion tree model; it only spaned from 4.7 to 7.0. However, the overall performance is considered better than the regression tree model because the correlation between the actual and prediction from the model tree is 0.6, slightly higher than the 0.57 from the regression tree, which indicated the model tree prediction is more correlated. Also the MAE is 0.5 for model tree, slightly lower than the 0.53 from regression tree, which indicates that the average error for model tree is smaller.

mt.predict <- predict(model.tree, red.wine.test)
summary(mt.predict)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.927   5.183   5.497   5.613   6.013   7.282

cor(mt.predict, red.wine.test$quality)

## [1] 0.5767396

MAE(red.wine.test$quality, mt.predict)

## [1] 0.5050358

Conclusion: Regression tree is first used to analyzing the red wine dataset for predicting its quality level. The algorithm is similar to a classification decision tree, where the splitting is based on a SDR, a standard deviation reduction value from the splitting. The mean in quality of the subset of osbervation falling within a tree path will be assgined to future observation that also falling within the same leaf location. The model tree is later introduced as a model improvement for the same dataset, where this model tree is actually trying to generate regressions for subset of observations in each tree path, and will be used to analyzed any future observations that also falling within the same leaf. The model tree is definitely more accurate in prediction and better in performance because it gives smaller MAE and higher correlation to the actual values as it would actually make “customized regression.” Both regresion tree and model tree are good at numerical prediction.