If you recall from Divide and Conquer - Classification Using Decision Trees and Rules, a decision tree builds a model much like a flowchart in which decision nodes, leaf nodes, and branches define a series of decisions that are used to classify examples. Such trees can also be used for numeric prediction by making only small adjustments to the tree-growing algorithm. In this section, we will consider only the ways in which trees for numeric prediction differ from trees used for classification.
Trees for numeric prediction fall into two categories. The first, known as regression trees, were introduced in the 1980s as part of the seminal Classification and Regression Tree (CART) algorithm. Despite the name, regression trees do not use linear regression methods as described earlier in this chapter, rather they make predictions based on the average value of examples that reach a leaf.
The second type of trees for numeric prediction are known as model trees. Introduced several years later than regression trees, they are lesser-known, but perhaps more powerful. Model trees are grown in much the same way as regression trees, but at each leaf, a multiple linear regression model is built from the examples reaching that node. Depending on the number of leaf nodes, a model tree may build tens or even hundreds of such models. This may make model trees more difficult to understand than the equivalent regression tree, with the benefit that they may result in a more accurate model.
Trees that can perform numeric prediction offer a compelling yet often overlooked alternative to regression modeling.
Though traditional regression methods are typically the first choice for numeric prediction tasks, in some cases, numeric decision trees offer distinct advantages. For instance, decision trees may be better suited for tasks with many features or many complex, non-linear relationships among features and outcome. These situations present challenges for regression. Regression modeling also makes assumptions about how numeric data is distributed that are often violated in real-world data. This is not the case for trees.
Trees for numeric prediction are built in much the same way as they are for classification. Beginning at the root node, the data is partitioned using a divide-and-conquer strategy according to the feature that will result in the greatest increase in homogeneity in the outcome after a split is performed. In classification trees, you will recall that homogeneity is measured by entropy, which is undefined for numeric data. Instead, for numeric decision trees, homogeneity is measured by statistics such as variance, standard deviation, or absolute deviation from the mean.
One common splitting criterion is called the Standard Deviation Reduction (SDR).
Winemaking is a challenging and competitive business that offers the potential for great profit. However, there are numerous factors that contribute to the profitability of a winery. As an agricultural product, variables as diverse as the weather and the growing environment impact the quality of a varietal. The bottling and manufacturing can also affect the flavor for better or worse. Even the way the product is marketed, from the bottle design to the price point, can affect the customer’s perception of taste.
As a consequence, the winemaking industry has heavily invested in data collection and machine learning methods that may assist with the decision science of winemaking. For example, machine learning has been used to discover key differences in the chemical composition of wines from different regions, or to identify the chemical factors that lead a wine to taste sweeter.
More recently, machine learning has been employed to assist with rating the quality of wine-a notoriously difficult task. A review written by a renowned wine critic often determines whether the product ends up on the top or bottom shelf, in spite of the fact that even the expert judges are inconsistent when rating a wine in a blinded test.
In this case study, we will use regression trees and model trees to create a system capable of mimicking expert ratings of wine. Because trees result in a model that is readily understood, this can allow the winemakers to identify the key factors that contribute to better-rated wines. Perhaps more importantly, the system does not suffer from the human elements of tasting, such as the rater’s mood or palate fatigue. Computer-aided wine testing may therefore result in a better product as well as more objective, consistent, and fair ratings.
To develop the wine rating model, we will use data donated to the UCI Machine Learning Data Repository (http://archive.ics.uci.edu/ml) by P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis. The data include examples of red and white Vinho Verde wines from Portugal-one of the world’s leading wine-producing countries. Because the factors that contribute to a highly rated wine may differ between the red and white varieties, for this analysis we will examine only the more popular white wines.
The white wine data includes information on 11 chemical properties of 4,898 wine samples. For each wine, a laboratory analysis measured characteristics such as acidity, sugar content, chlorides, sulfur, alcohol, pH, and density. The samples were then rated in a blind tasting by panels of no less than three judges on a quality scale ranging from zero (very bad) to 10 (excellent). In the case of judges disagreeing on the rating, the median value was used.
The study by Cortez evaluated the ability of three machine learning approaches to model the wine data: multiple regression, artificial neural networks, and support vector machines. We covered multiple regression earlier in this chapter, and we will learn about neural networks and support vector machines in Chapter 7, Black Box Methods - Neural Networks and Support Vector Machines. The study found that the support vector machine offered significantly better results than the linear regression model. However, unlike regression, the support vector machine model is difficult to interpret. Using regression trees and model trees, we may be able to improve the regression results while still having a model that is easy to understand.
As usual, we will use the read.csv() function to load the data into R. Since all of the features are numeric, we can safely ignore the stringsAsFactors parameter. The wine data includes 11 features and the quality outcome, as follows:
str(wine)
'data.frame': 4898 obs. of 12 variables:
$ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
$ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
$ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
$ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
$ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
$ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
$ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
$ density : num 1.001 0.994 0.995 0.996 0.996 ...
$ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
$ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
$ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
$ quality : int 6 6 6 6 6 6 6 6 6 6 ...
Compared with other types of machine learning models, one of the advantages of trees is that they can handle many types of data without preprocessing. This means we do not need to normalize or standardize the features.
However, a bit of effort to examine the distribution of the outcome variable is needed to inform our evaluation of the model’s performance. For instance, suppose that there was a very little variation in quality from wine-to-wine, or that wines fell into a bimodal distribution: either very good or very bad. To check for such extremes, we can examine the distribution of quality using a histogram and produces the following figure:
The wine quality values appear to follow a fairly normal, bell-shaped distribution, centered around a value of six. This makes sense intuitively because most wines are of average quality; few are particularly bad or good. Although the results are not shown here, it is also useful to examine the summary(wine) output for outliers or other potential data problems. Even though trees are fairly robust with messy data, it is always prudent to check for severe problems. For now, we’ll assume that the data is reliable.
Our last step then is to divide into training and testing datasets. Since the wine data set was already sorted into random order, we can partition into two sets of contiguous rows as follows:
In order to mirror the conditions used by Cortez, we used sets of 75 percent and 25 percent for training and testing, respectively. We’ll evaluate the performance of our tree-based models on the testing data to see if we can obtain results comparable to the prior research study.
We will begin by training a regression tree model. Although almost any implementation of decision trees can be used to perform regression tree modeling, the rpart (recursive partitioning) package offers the most faithful implementation of regression trees as they were described by the CART team. As the classic R implementation of CART, the rpart package is also well-documented and supported with functions for visualizing and evaluating the rpart models.
Install the rpart package using the install.packages(“rpart”) command. It can then be loaded into your R session using the library(rpart) command. The following syntax will train a tree using the default settings, which typically work fairly well. If you need more finely-tuned settings, refer to the documentation for the control parameters using the ?rpart.control command.
Using the R formula interface, we can specify quality as the outcome variable and use the dot notation to allow all the other columns in the wine_train data frame to be used as predictors. The resulting regression tree model object is named m.rpart to distinguish it from the model tree that we will train later:
For basic information about the tree, simply type the name of the model object:
m.rpart
n= 3750
node), split, n, deviance, yval
* denotes terminal node
1) root 3750 3140.05973300 5.886933333
2) alcohol< 10.85 2473 1510.66235300 5.609381318
4) volatile.acidity>=0.2425 1406 740.15078240 5.402560455
8) volatile.acidity>=0.4225 182 92.99450549 4.994505495 *
9) volatile.acidity< 0.4225 1224 612.34558820 5.463235294 *
5) volatile.acidity< 0.2425 1067 631.12089970 5.881911903 *
3) alcohol>=10.85 1277 1069.95771300 6.424432263
6) free.sulfur.dioxide< 11.5 93 99.18279570 5.473118280 *
7) free.sulfur.dioxide>=11.5 1184 879.99915540 6.499155405
14) alcohol< 11.85 611 447.38134210 6.296235679 *
15) alcohol>=11.85 573 380.63176270 6.715532286 *
For each node in the tree, the number of examples reaching the decision point is listed. For instance, all 3,750 examples begin at the root node, of which 2,372 have alcohol < 10.85 and 1,378 have alcohol >= 10.85. Because alcohol was used first in the tree, it is the single most important predictor of wine quality.
Nodes indicated by * are terminal or leaf nodes, which means that they result in a prediction (listed here as yval). For example, node 5 has a yval of 5.971091. When the tree is used for predictions, any wine samples with alcohol < 10.85 and volatile.acidity < 0.2275 would therefore be predicted to have a quality value of 5.97.
A more detailed summary of the tree’s fit, including the mean squared error for each of the nodes and an overall measure of feature importance, can be obtained using the summary(m.rpart) command.
summary(m.rpart)
Call:
rpart(formula = quality ~ ., data = wine_train)
n= 3750
CP nsplit rel error xerror xstd
1 0.17816210965 0 1.0000000000 1.0002753805 0.02387817325
2 0.04439108908 1 0.8218378903 0.8227999313 0.02238889931
3 0.02890892849 2 0.7774468013 0.7842799405 0.02209716924
4 0.01655575215 3 0.7485378728 0.7564096499 0.02095876478
5 0.01108599568 4 0.7319821206 0.7434064599 0.02048695058
6 0.01000000000 5 0.7208961249 0.7377110332 0.02035059628
Variable importance
alcohol density chlorides volatile.acidity total.sulfur.dioxide
38 23 12 12 7
free.sulfur.dioxide sulphates pH residual.sugar
6 1 1 1
Node number 1: 3750 observations, complexity param=0.1781621097
mean=5.886933333, MSE=0.8373492622
left son=2 (2473 obs) right son=3 (1277 obs)
Primary splits:
alcohol < 10.85 to the left, improve=0.17816210970, (0 missing)
density < 0.992385 to the right, improve=0.11980972700, (0 missing)
chlorides < 0.0395 to the right, improve=0.08199994615, (0 missing)
total.sulfur.dioxide < 153.5 to the right, improve=0.03875439978, (0 missing)
free.sulfur.dioxide < 11.75 to the left, improve=0.03632118851, (0 missing)
Surrogate splits:
density < 0.99201 to the right, agree=0.869, adj=0.614, (0 split)
chlorides < 0.0375 to the right, agree=0.773, adj=0.334, (0 split)
total.sulfur.dioxide < 102.5 to the right, agree=0.705, adj=0.132, (0 split)
sulphates < 0.345 to the right, agree=0.670, adj=0.031, (0 split)
fixed.acidity < 5.25 to the right, agree=0.662, adj=0.009, (0 split)
Node number 2: 2473 observations, complexity param=0.04439108908
mean=5.609381318, MSE=0.6108622537
left son=4 (1406 obs) right son=5 (1067 obs)
Primary splits:
volatile.acidity < 0.2425 to the right, improve=0.09227122859, (0 missing)
free.sulfur.dioxide < 13.5 to the left, improve=0.04177239677, (0 missing)
alcohol < 10.15 to the left, improve=0.03313802355, (0 missing)
citric.acid < 0.205 to the left, improve=0.02721200452, (0 missing)
pH < 3.325 to the left, improve=0.01860334722, (0 missing)
Surrogate splits:
total.sulfur.dioxide < 111.5 to the right, agree=0.610, adj=0.097, (0 split)
pH < 3.295 to the left, agree=0.598, adj=0.067, (0 split)
alcohol < 10.05 to the left, agree=0.590, adj=0.049, (0 split)
sulphates < 0.715 to the left, agree=0.584, adj=0.037, (0 split)
residual.sugar < 1.85 to the right, agree=0.581, adj=0.029, (0 split)
Node number 3: 1277 observations, complexity param=0.02890892849
mean=6.424432263, MSE=0.8378682172
left son=6 (93 obs) right son=7 (1184 obs)
Primary splits:
free.sulfur.dioxide < 11.5 to the left, improve=0.08484051393, (0 missing)
alcohol < 11.85 to the left, improve=0.06149940937, (0 missing)
fixed.acidity < 7.35 to the right, improve=0.04259694781, (0 missing)
residual.sugar < 1.275 to the left, improve=0.02795661549, (0 missing)
total.sulfur.dioxide < 67.5 to the left, improve=0.02541718729, (0 missing)
Surrogate splits:
total.sulfur.dioxide < 48.5 to the left, agree=0.937, adj=0.14, (0 split)
Node number 4: 1406 observations, complexity param=0.01108599568
mean=5.402560455, MSE=0.5264230316
left son=8 (182 obs) right son=9 (1224 obs)
Primary splits:
volatile.acidity < 0.4225 to the right, improve=0.04703188791, (0 missing)
free.sulfur.dioxide < 17.5 to the left, improve=0.04607770066, (0 missing)
total.sulfur.dioxide < 86.5 to the left, improve=0.02894310145, (0 missing)
alcohol < 10.25 to the left, improve=0.02890076870, (0 missing)
chlorides < 0.0455 to the right, improve=0.02096634946, (0 missing)
Surrogate splits:
density < 0.99107 to the left, agree=0.874, adj=0.027, (0 split)
citric.acid < 0.11 to the left, agree=0.873, adj=0.022, (0 split)
fixed.acidity < 9.85 to the right, agree=0.873, adj=0.016, (0 split)
chlorides < 0.206 to the right, agree=0.871, adj=0.005, (0 split)
Node number 5: 1067 observations
mean=5.881911903, MSE=0.5914910025
Node number 6: 93 observations
mean=5.47311828, MSE=1.066481674
Node number 7: 1184 observations, complexity param=0.01655575215
mean=6.499155405, MSE=0.7432425299
left son=14 (611 obs) right son=15 (573 obs)
Primary splits:
alcohol < 11.85 to the left, improve=0.05907511430, (0 missing)
fixed.acidity < 7.35 to the right, improve=0.04400659867, (0 missing)
density < 0.991395 to the right, improve=0.02522410211, (0 missing)
residual.sugar < 1.225 to the left, improve=0.02503935718, (0 missing)
pH < 3.245 to the left, improve=0.02417935981, (0 missing)
Surrogate splits:
density < 0.991115 to the right, agree=0.710, adj=0.401, (0 split)
volatile.acidity < 0.2675 to the left, agree=0.665, adj=0.307, (0 split)
chlorides < 0.0365 to the right, agree=0.631, adj=0.237, (0 split)
total.sulfur.dioxide < 126.5 to the right, agree=0.566, adj=0.103, (0 split)
residual.sugar < 1.525 to the left, agree=0.560, adj=0.091, (0 split)
Node number 8: 182 observations
mean=4.994505495, MSE=0.5109588214
Node number 9: 1224 observations
mean=5.463235294, MSE=0.5002823433
Node number 14: 611 observations
mean=6.296235679, MSE=0.7322116891
Node number 15: 573 observations
mean=6.715532286, MSE=0.6642788179
Although the tree can be understood using only the preceding output, it is often more readily understood using visualization. The rpart.plot package by Stephen Milborrow provides an easy-to-use function that produces publication-quality decision trees.
After installing the package using the install.packages(“rpart.plot”) command, the rpart.plot() function produces a tree diagram from any rpart model object. The following commands plot the regression tree we built earlier which produces a tree diagram is as follows:
In addition to the digits parameter that controls the number of numeric digits to include in the diagram, many other aspects of the visualization can be adjusted. The following command shows just a few of the useful options: The fallen.leaves parameter forces the leaf nodes to be aligned at the bottom of the plot, while the type and extra parameters affect the way the decisions and nodes are labeled:
Visualizations like these may assist with the dissemination of regression tree results, as they are readily understood even without a mathematics background. In both cases, the numbers shown in the leaf nodes are the predicted values for the examples reaching that node. Showing the diagram to the wine producers may thus help to identify the key factors that predict the higher rated wines.
To use the regression tree model to make predictions on the test data, we use the predict() function. By default, this returns the estimated numeric value for the outcome variable, which we’ll save in a vector named p.rpart:
A quick look at the summary statistics of our predictions suggests a potential problem; the predictions fall on a much narrower range than the true values:
summary(p.rpart)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.994505 5.463235 5.881912 5.999010 6.296236 6.715532
summary(wine_test$quality)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.000000 5.000000 6.000000 5.848432 6.000000 8.000000
This finding suggests that the model is not correctly identifying the extreme cases, in particular the best and worst wines. On the other hand, between the first and third quartile, we may be doing well.
The correlation between the predicted and actual quality values provides a simple way to gauge the model’s performance. Recall that the cor() function can be used to measure the relationship between two equal-length vectors. We’ll use this to compare how well the predicted values correspond to the true values:
cor(p.rpart, wine_test$quality)
[1] 0.4931608263
A correlation of 0.54 is certainly acceptable. However, the correlation only measures how strongly the predictions are related to the true value; it is not a measure of how far off the predictions were from the true values.
Another way to think about the model’s performance is to consider how far, on average, its prediction was from the true value. This measurement is called the mean absolute error (MAE). The equation for MAE is as follows, where n indicates the number of predictions and ei indicates the error for prediction i:
As the name implies, this equation takes the mean of the absolute value of the errors. Since the error is just the difference between the predicted and actual values, we can create a simple MAE() function as follows:
The MAE for our predictions is then:
MAE(p.rpart, wine_test$quality)
[1] 0.5732104426
This implies that, on average, the difference between our model’s predictions and the true quality score was about 0.59. On a quality scale from zero to 10, this seems to suggest that our model is doing fairly well.
On the other hand, recall that most wines were neither very good nor very bad; the typical quality score was around five to six. Therefore, a classifier that did nothing but predict the mean value may still do fairly well according to this metric.
The mean quality rating in the training data is as follows:
mean(wine_train$quality)
[1] 5.886933333
If we predicted the value 5.87 for every wine sample, we would have a mean absolute error of only about 0.67:
MAE(5.87, wine_test$quality)
[1] 0.5815679443
Our regression tree (MAE = 0.59) comes closer on average to the true quality score than the imputed mean (MAE = 0.67), but not by much. In comparison, Cortez reported an MAE of 0.58 for the neural network model and an MAE of 0.45 for the support vector machine. This suggests that there is room for improvement.
To improve the performance of our learner, let’s try to build a model tree. Recall that a model tree improves on regression trees by replacing the leaf nodes with regression models. This often results in more accurate results than regression trees, which use only a single value for prediction at the leaf nodes.
The current state-of-the-art in model trees is the M5’ algorithm (M5-prime) by Y. Wang and I.H. Witten, which is a variant of the original M5 model tree algorithm proposed by J.R. Quinlan in 1992.
The M5 algorithm is available in R via the RWeka package and the M5P() function. The syntax of this function is shown in the following table. Be sure to install the RWeka package if you haven’t already. Because of its dependence on Java, the installation instructions are included in Chapter 1, Introducing Machine Learning.
We’ll fit the model tree using essentially the same syntax as we used for the regression tree:
NOTE: If you are getting the following error, you may need to a) update and locate RWeka directory path or b) update RWeka package (WPM(“list-packages”, “installed”))
Error in .jcall(o, “Ljava/lang/Class;”, “getClass”) : java.lang.NoClassDefFoundError: no/uib/cipr/matrix/Matrix
WPM("list-packages", "installed")
Caching repository metadata, please wait...
Refresh in progress. Please wait...
[DefaultPackageManager] downloaded 2 KB
[DefaultPackageManager] downloaded 16 KB
[DefaultPackageManager] downloaded 17 KB
[DefaultPackageManager] downloaded 18 KB
[DefaultPackageManager] downloaded 25 KB
[DefaultPackageManager] downloaded 30 KB
[DefaultPackageManager] downloaded 31 KB
[DefaultPackageManager] downloaded 33 KB
[DefaultPackageManager] downloaded 34 KB
[DefaultPackageManager] downloaded 44 KB
[DefaultPackageManager] downloaded 53 KB
[DefaultPackageManager] downloaded 59 KB
[DefaultPackageManager] downloaded 68 KB
[DefaultPackageManager] downloaded 85 KB
[DefaultPackageManager] downloaded 94 KB
[DefaultPackageManager] downloaded 96 KB
[DefaultPackageManager] downloaded 97 KB
[DefaultPackageManager] downloaded 98 KB
[DefaultPackageManager] downloaded 101 KB
[DefaultPackageManager] downloaded 106 KB
[DefaultPackageManager] downloaded 151 KB
[DefaultPackageManager] downloaded 158 KB
[DefaultPackageManager] downloaded 160 KB
[DefaultPackageManager] downloaded 163 KB
[DefaultPackageManager] downloaded 163 KB
[DefaultPackageManager] downloaded 166 KB
[DefaultPackageManager] downloaded 170 KB
[DefaultPackageManager] downloaded 171 KB
[DefaultPackageManager] downloaded 173 KB
[DefaultPackageManager] downloaded 174 KB
[DefaultPackageManager] downloaded 180 KB
[DefaultPackageManager] downloaded 183 KB
[DefaultPackageManager] downloaded 189 KB
[DefaultPackageManager] downloaded 193 KB
[DefaultPackageManager] downloaded 202 KB
[DefaultPackageManager] downloaded 207 KB
[DefaultPackageManager] downloaded 221 KB
[DefaultPackageManager] downloaded 226 KB
[DefaultPackageManager] downloaded 238 KB
[DefaultPackageManager] downloaded 241 KB
[DefaultPackageManager] downloaded 242 KB
[DefaultPackageManager] downloaded 245 KB
[DefaultPackageManager] downloaded 245 KB
[DefaultPackageManager] downloaded 253 KB
[DefaultPackageManager] downloaded 279 KB
[DefaultPackageManager] downloaded 283 KB
[DefaultPackageManager] downloaded 291 KB
[DefaultPackageManager] downloaded 292 KB
[DefaultPackageManager] downloaded 294 KB
[DefaultPackageManager] downloaded 299 KB
[DefaultPackageManager] downloaded 302 KB
[DefaultPackageManager] downloaded 305 KB
[DefaultPackageManager] downloaded 308 KB
[DefaultPackageManager] downloaded 310 KB
[DefaultPackageManager] downloaded 310 KB
[DefaultPackageManager] downloaded 311 KB
[DefaultPackageManager] downloaded 313 KB
[DefaultPackageManager] downloaded 318 KB
[DefaultPackageManager] downloaded 323 KB
[DefaultPackageManager] downloaded 327 KB
[DefaultPackageManager] downloaded 327 KB
[DefaultPackageManager] downloaded 329 KB
[DefaultPackageManager] downloaded 330 KB
[DefaultPackageManager] downloaded 332 KB
[DefaultPackageManager] downloaded 334 KB
[DefaultPackageManager] downloaded 339 KB
[DefaultPackageManager] downloaded 343 KB
[DefaultPackageManager] downloaded 352 KB
[DefaultPackageManager] downloaded 355 KB
[DefaultPackageManager] downloaded 359 KB
[DefaultPackageManager] downloaded 362 KB
[DefaultPackageManager] downloaded 364 KB
[DefaultPackageManager] downloaded 370 KB
[DefaultPackageManager] downloaded 378 KB
[DefaultPackageManager] downloaded 380 KB
[DefaultPackageManager] downloaded 381 KB
[DefaultPackageManager] downloaded 383 KB
[DefaultPackageManager] downloaded 396 KB
[DefaultPackageManager] downloaded 402 KB
[DefaultPackageManager] downloaded 406 KB
[DefaultPackageManager] downloaded 407 KB
[DefaultPackageManager] downloaded 410 KB
[DefaultPackageManager] downloaded 415 KB
[DefaultPackageManager] downloaded 416 KB
[DefaultPackageManager] downloaded 418 KB
[DefaultPackageManager] downloaded 423 KB
[DefaultPackageManager] downloaded 428 KB
[DefaultPackageManager] downloaded 431 KB
[DefaultPackageManager] downloaded 434 KB
[DefaultPackageManager] downloaded 438 KB
[DefaultPackageManager] downloaded 442 KB
[DefaultPackageManager] downloaded 447 KB
[DefaultPackageManager] downloaded 453 KB
[DefaultPackageManager] downloaded 517 KB
[DefaultPackageManager] downloaded 518 KB
[DefaultPackageManager] downloaded 520 KB
[DefaultPackageManager] downloaded 523 KB
[DefaultPackageManager] downloaded 526 KB
[DefaultPackageManager] downloaded 530 KB
[DefaultPackageManager] downloaded 532 KB
[DefaultPackageManager] downloaded 539 KB
[DefaultPackageManager] downloaded 540 KB
[DefaultPackageManager] downloaded 543 KB
[DefaultPackageManager] downloaded 546 KB
[DefaultPackageManager] downloaded 549 KB
[DefaultPackageManager] downloaded 555 KB
[DefaultPackageManager] downloaded 558 KB
[DefaultPackageManager] downloaded 561 KB
[DefaultPackageManager] downloaded 564 KB
[DefaultPackageManager] downloaded 570 KB
[DefaultPackageManager] downloaded 572 KB
[DefaultPackageManager] downloaded 575 KB
[DefaultPackageManager] downloaded 578 KB
[DefaultPackageManager] downloaded 581 KB
[DefaultPackageManager] downloaded 586 KB
[DefaultPackageManager] downloaded 591 KB
[DefaultPackageManager] downloaded 594 KB
[DefaultPackageManager] downloaded 596 KB
[DefaultPackageManager] downloaded 602 KB
[DefaultPackageManager] downloaded 607 KB
[DefaultPackageManager] downloaded 607 KB
[DefaultPackageManager] downloaded 609 KB
[DefaultPackageManager] downloaded 612 KB
[DefaultPackageManager] downloaded 613 KB
[DefaultPackageManager] downloaded 615 KB
[DefaultPackageManager] downloaded 616 KB
[DefaultPackageManager] downloaded 624 KB
[DefaultPackageManager] downloaded 626 KB
[DefaultPackageManager] downloaded 635 KB
[DefaultPackageManager] downloaded 638 KB
[DefaultPackageManager] downloaded 644 KB
[DefaultPackageManager] downloaded 645 KB
[DefaultPackageManager] downloaded 651 KB
[DefaultPackageManager] downloaded 653 KB
[DefaultPackageManager] downloaded 654 KB
[DefaultPackageManager] downloaded 654 KB
[DefaultPackageManager] downloaded 656 KB
[DefaultPackageManager] downloaded 659 KB
[DefaultPackageManager] downloaded 662 KB
[DefaultPackageManager] downloaded 663 KB
[DefaultPackageManager] downloaded 675 KB
[DefaultPackageManager] downloaded 682 KB
[DefaultPackageManager] downloaded 691 KB
[DefaultPackageManager] downloaded 692 KB
[DefaultPackageManager] downloaded 695 KB
[DefaultPackageManager] downloaded 695 KB
[DefaultPackageManager] downloaded 698 KB
[DefaultPackageManager] downloaded 699 KB
[DefaultPackageManager] downloaded 701 KB
[DefaultPackageManager] downloaded 702 KB
[DefaultPackageManager] downloaded 704 KB
[DefaultPackageManager] downloaded 705 KB
[DefaultPackageManager] downloaded 707 KB
[DefaultPackageManager] downloaded 710 KB
[DefaultPackageManager] downloaded 717 KB
[DefaultPackageManager] downloaded 727 KB
[DefaultPackageManager] downloaded 732 KB
[DefaultPackageManager] downloaded 735 KB
[DefaultPackageManager] downloaded 736 KB
[DefaultPackageManager] downloaded 739 KB
[DefaultPackageManager] downloaded 740 KB
[DefaultPackageManager] downloaded 742 KB
[DefaultPackageManager] downloaded 743 KB
[DefaultPackageManager] downloaded 745 KB
[DefaultPackageManager] downloaded 746 KB
[DefaultPackageManager] downloaded 748 KB
[DefaultPackageManager] downloaded 751 KB
[DefaultPackageManager] downloaded 752 KB
[DefaultPackageManager] downloaded 756 KB
[DefaultPackageManager] downloaded 761 KB
[DefaultPackageManager] downloaded 771 KB
[DefaultPackageManager] downloaded 777 KB
[DefaultPackageManager] downloaded 777 KB
[DefaultPackageManager] downloaded 781 KB
[DefaultPackageManager] downloaded 784 KB
[DefaultPackageManager] downloaded 786 KB
[DefaultPackageManager] downloaded 787 KB
[DefaultPackageManager] downloaded 789 KB
[DefaultPackageManager] downloaded 790 KB
[DefaultPackageManager] downloaded 796 KB
[DefaultPackageManager] downloaded 799 KB
[DefaultPackageManager] downloaded 800 KB
[DefaultPackageManager] downloaded 802 KB
[DefaultPackageManager] downloaded 809 KB
[DefaultPackageManager] downloaded 825 KB
[DefaultPackageManager] downloaded 828 KB
[DefaultPackageManager] downloaded 829 KB
[DefaultPackageManager] downloaded 831 KB
[DefaultPackageManager] downloaded 835 KB
[DefaultPackageManager] downloaded 838 KB
[DefaultPackageManager] downloaded 840 KB
[DefaultPackageManager] downloaded 843 KB
[DefaultPackageManager] downloaded 845 KB
[DefaultPackageManager] downloaded 850 KB
[DefaultPackageManager] downloaded 854 KB
[DefaultPackageManager] downloaded 857 KB
[DefaultPackageManager] downloaded 862 KB
[DefaultPackageManager] downloaded 864 KB
[DefaultPackageManager] downloaded 866 KB
[DefaultPackageManager] downloaded 867 KB
[DefaultPackageManager] downloaded 869 KB
[DefaultPackageManager] downloaded 870 KB
[DefaultPackageManager] downloaded 876 KB
[DefaultPackageManager] downloaded 878 KB
[DefaultPackageManager] downloaded 883 KB
[DefaultPackageManager] downloaded 888 KB
[DefaultPackageManager] downloaded 891 KB
[DefaultPackageManager] downloaded 895 KB
[DefaultPackageManager] downloaded 899 KB
[DefaultPackageManager] downloaded 901 KB
[DefaultPackageManager] downloaded 904 KB
[DefaultPackageManager] downloaded 904 KB
[DefaultPackageManager] downloaded 910 KB
[DefaultPackageManager] downloaded 914 KB
[DefaultPackageManager] downloaded 921 KB
[DefaultPackageManager] downloaded 929 KB
[DefaultPackageManager] downloaded 930 KB
[DefaultPackageManager] downloaded 932 KB
[DefaultPackageManager] downloaded 939 KB
[DefaultPackageManager] downloaded 945 KB
[DefaultPackageManager] downloaded 948 KB
[DefaultPackageManager] downloaded 949 KB
[DefaultPackageManager] downloaded 951 KB
[DefaultPackageManager] downloaded 953 KB
[DefaultPackageManager] downloaded 958 KB
[DefaultPackageManager] downloaded 962 KB
[DefaultPackageManager] downloaded 964 KB
[DefaultPackageManager] downloaded 965 KB
[DefaultPackageManager] downloaded 967 KB
[DefaultPackageManager] downloaded 970 KB
[DefaultPackageManager] downloaded 975 KB
[DefaultPackageManager] downloaded 980 KB
[DefaultPackageManager] downloaded 986 KB
[DefaultPackageManager] downloaded 991 KB
[DefaultPackageManager] downloaded 993 KB
[DefaultPackageManager] downloaded 999 KB
[DefaultPackageManager] downloaded 1005 KB
[DefaultPackageManager] downloaded 1006 KB
[DefaultPackageManager] downloaded 1013 KB
[DefaultPackageManager] downloaded 1015 KB
[DefaultPackageManager] downloaded 1016 KB
[DefaultPackageManager] downloaded 1024 KB
[DefaultPackageManager] downloaded 1027 KB
[DefaultPackageManager] downloaded 1028 KB
[DefaultPackageManager] downloaded 1028 KB
[DefaultPackageManager] downloaded 1034 KB
[DefaultPackageManager] downloaded 1034 KB
[DefaultPackageManager] downloaded 1035 KB
[DefaultPackageManager] downloaded 1037 KB
[DefaultPackageManager] downloaded 1038 KB
[DefaultPackageManager] downloaded 1051 KB
[DefaultPackageManager] downloaded 1066 KB
[DefaultPackageManager] downloaded 1070 KB
[DefaultPackageManager] downloaded 1076 KB
[DefaultPackageManager] downloaded 1078 KB
[DefaultPackageManager] downloaded 1081 KB
[DefaultPackageManager] downloaded 1085 KB
[DefaultPackageManager] downloaded 1086 KB
[DefaultPackageManager] downloaded 1088 KB
[DefaultPackageManager] downloaded 1089 KB
[DefaultPackageManager] downloaded 1091 KB
[DefaultPackageManager] downloaded 1092 KB
[DefaultPackageManager] downloaded 1097 KB
[DefaultPackageManager] downloaded 1102 KB
[DefaultPackageManager] downloaded 1110 KB
[DefaultPackageManager] downloaded 1121 KB
[DefaultPackageManager] downloaded 1126 KB
[DefaultPackageManager] downloaded 1127 KB
[DefaultPackageManager] downloaded 1135 KB
[DefaultPackageManager] downloaded 1137 KB
[DefaultPackageManager] downloaded 1140 KB
[DefaultPackageManager] downloaded 1142 KB
[DefaultPackageManager] downloaded 1143 KB
[DefaultPackageManager] downloaded 1145 KB
[DefaultPackageManager] downloaded 1155 KB
[DefaultPackageManager] downloaded 1158 KB
[DefaultPackageManager] downloaded 1168 KB
[DefaultPackageManager] downloaded 1178 KB
[DefaultPackageManager] downloaded 1181 KB
[DefaultPackageManager] downloaded 1183 KB
[DefaultPackageManager] downloaded 1189 KB
[DefaultPackageManager] downloaded 1193 KB
[DefaultPackageManager] downloaded 1194 KB
[DefaultPackageManager] downloaded 1196 KB
[DefaultPackageManager] downloaded 1197 KB
[DefaultPackageManager] downloaded 1199 KB
[DefaultPackageManager] downloaded 1210 KB
[DefaultPackageManager] downloaded 1215 KB
[DefaultPackageManager] downloaded 1222 KB
[DefaultPackageManager] downloaded 1232 KB
[DefaultPackageManager] downloaded 1240 KB
[DefaultPackageManager] downloaded 1246 KB
[DefaultPackageManager] downloaded 1251 KB
[DefaultPackageManager] downloaded 1253 KB
[DefaultPackageManager] downloaded 1254 KB
[DefaultPackageManager] downloaded 1254 KB
[DefaultPackageManager] downloaded 1256 KB
[DefaultPackageManager] downloaded 1259 KB
[DefaultPackageManager] downloaded 1267 KB
[DefaultPackageManager] downloaded 1276 KB
[DefaultPackageManager] downloaded 1278 KB
[DefaultPackageManager] downloaded 1283 KB
[DefaultPackageManager] downloaded 1301 KB
[DefaultPackageManager] downloaded 1308 KB
[DefaultPackageManager] downloaded 1313 KB
[DefaultPackageManager] downloaded 1314 KB
[DefaultPackageManager] downloaded 1317 KB
[DefaultPackageManager] downloaded 1319 KB
[DefaultPackageManager] downloaded 1329 KB
[DefaultPackageManager] downloaded 1337 KB
[DefaultPackageManager] downloaded 1337 KB
[DefaultPackageManager] downloaded 1339 KB
[DefaultPackageManager] downloaded 1348 KB
[DefaultPackageManager] downloaded 1368 KB
[DefaultPackageManager] downloaded 1375 KB
[DefaultPackageManager] downloaded 1377 KB
[DefaultPackageManager] downloaded 1380 KB
[DefaultPackageManager] downloaded 1381 KB
[DefaultPackageManager] downloaded 1400 KB
[DefaultPackageManager] downloaded 1402 KB
[DefaultPackageManager] downloaded 1410 KB
[DefaultPackageManager] downloaded 1418 KB
[DefaultPackageManager] downloaded 1440 KB
[DefaultPackageManager] downloaded 1443 KB
[DefaultPackageManager] downloaded 1444 KB
[DefaultPackageManager] downloaded 1459 KB
[DefaultPackageManager] downloaded 1465 KB
[DefaultPackageManager] downloaded 1467 KB
[DefaultPackageManager] downloaded 1475 KB
[DefaultPackageManager] downloaded 1504 KB
[DefaultPackageManager] downloaded 1507 KB
[DefaultPackageManager] downloaded 1511 KB
[DefaultPackageManager] downloaded 1523 KB
[DefaultPackageManager] downloaded 1527 KB
[DefaultPackageManager] downloaded 1530 KB
[DefaultPackageManager] downloaded 1536 KB
[DefaultPackageManager] downloaded 1542 KB
[DefaultPackageManager] downloaded 1559 KB
[DefaultPackageManager] downloaded 1570 KB
[DefaultPackageManager] downloaded 1574 KB
[DefaultPackageManager] downloaded 1590 KB
[DefaultPackageManager] downloaded 1593 KB
[DefaultPackageManager] downloaded 1599 KB
[DefaultPackageManager] downloaded 1600 KB
[DefaultPackageManager] downloaded 1608 KB
[DefaultPackageManager] downloaded 1635 KB
[DefaultPackageManager] downloaded 1637 KB
[DefaultPackageManager] downloaded 1648 KB
[DefaultPackageManager] downloaded 1654 KB
[DefaultPackageManager] downloaded 1662 KB
[DefaultPackageManager] downloaded 1663 KB
[DefaultPackageManager] downloaded 1672 KB
[DefaultPackageManager] downloaded 1675 KB
[DefaultPackageManager] downloaded 1775 KB
[DefaultPackageManager] downloaded 1803 KB
[DefaultPackageManager] downloaded 1867 KB
[DefaultPackageManager] downloaded 1867 KB
[DefaultPackageManager] downloaded 1869 KB
[DefaultPackageManager] downloaded 1870 KB
[DefaultPackageManager] downloaded 1874 KB
[DefaultPackageManager] downloaded 1874 KB
Installed Repository Loaded Package
========= ========== ====== =======
The tree itself can be examined by typing its name. In this case, the tree is very large and only the first few lines of output are shown:
m.m5p
M5 pruned model tree:
(using smoothed linear models)
alcohol <= 10.85 : LM1 (2473/77.476%)
alcohol > 10.85 :
| free.sulfur.dioxide <= 20.5 :
| | free.sulfur.dioxide <= 10.5 : LM2 (81/104.574%)
| | free.sulfur.dioxide > 10.5 : LM3 (224/87.002%)
| free.sulfur.dioxide > 20.5 : LM4 (972/84.073%)
LM num: 1
quality =
0.0777 * fixed.acidity
- 2.3087 * volatile.acidity
+ 0.0732 * residual.sugar
+ 0.0022 * free.sulfur.dioxide
- 155.0175 * density
+ 0.6462 * pH
+ 0.7923 * sulphates
+ 0.0758 * alcohol
+ 156.2102
LM num: 2
quality =
-0.0314 * fixed.acidity
- 0.3415 * volatile.acidity
+ 1.7929 * citric.acid
+ 0.1316 * residual.sugar
- 0.2456 * chlorides
+ 0.1212 * free.sulfur.dioxide
- 178.6281 * density
+ 0.054 * pH
+ 0.1392 * sulphates
+ 0.0108 * alcohol
+ 180.6069
LM num: 3
quality =
-0.2019 * fixed.acidity
- 2.3804 * volatile.acidity
- 1.0851 * citric.acid
+ 0.0905 * residual.sugar
- 0.2456 * chlorides
+ 0.0041 * free.sulfur.dioxide
- 177.078 * density
+ 0.054 * pH
+ 0.0868 * sulphates
+ 0.0108 * alcohol
+ 183.5076
LM num: 4
quality =
0.0004 * fixed.acidity
- 0.0325 * volatile.acidity
+ 0.0957 * residual.sugar
- 5.9702 * chlorides
+ 0.0002 * free.sulfur.dioxide
- 172.3931 * density
+ 1.0123 * pH
+ 1.1653 * sulphates
+ 0.1542 * alcohol
+ 171.6842
Number of Rules : 4
You will note that the splits are very similar to the regression tree that we built earlier. Alcohol is the most important variable, followed by volatile acidity and free sulfur dioxide. A key difference, however, is that the nodes terminate not in a numeric prediction, but a linear model (shown here as LM1 and LM2).
The linear models themselves are shown later in the output. For instance, the model for LM1 is shown in the forthcoming output. The values can be interpreted exactly the same as the multiple regression models we built earlier in this chapter. Each number is the net effect of the associated feature on the predicted wine quality. The coefficient of 0.266 for fixed acidity implies that for an increase of 1 unit of acidity, the wine quality is expected to increase by 0.266 (LM num: 1).
It is important to note that the effects estimated by LM1 apply only to wine samples reaching this node; a total of 36 linear models were built in this model tree, each with different estimates of the impact of fixed acidity and the other 10 features. For statistics on how well the model fits the training data, the summary() function can be applied to the M5P model. However, note that since these statistics are based on the training data, they should be used only as a rough diagnostic:
summary(m.m5p)
=== Summary ===
Correlation coefficient 0.5932
Mean absolute error 0.5804
Root mean squared error 0.7367
Relative absolute error 83.3671 %
Root relative squared error 80.507 %
Total Number of Instances 3750
Instead, we’ll look at how well the model performs on the unseen test data. The predict() function gets us a vector of predicted values:
The model tree appears to be predicting a wider range of values than the regression tree:
summary(p.m5p)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.169639 5.645572 6.031654 6.079428 6.501455 7.913091
The correlation also seems to be substantially higher:
cor(p.m5p, wine_test$quality)
[1] 0.5317229714
Furthermore, the model has slightly reduced the mean absolute error:
MAE(wine_test$quality, p.m5p)
[1] 0.5660352338
Although we did not improve a great deal beyond the regression tree, we surpassed the performance of the neural network model published by Cortez, and we are getting closer to the published mean absolute error value of 0.45 for the support vector machine model, all by using a much simpler learning method.
In this chapter, we studied two methods for modeling numeric data. The first method, linear regression, involves fitting straight lines to data. The second method uses decision trees for numeric prediction. The latter comes in two forms: regression trees, which use the average value of examples at leaf nodes to make numeric predictions; and model trees, which build a regression model at each leaf node in a hybrid approach that is, in some ways, the best of both worlds.
We used linear regression modeling to calculate the expected medical costs for various segments of the population. Because the relationship between the features and the target variable are well-described by the estimated regression model, we were able to identify certain demographics, such as smokers and the obese, who may need to be charged higher insurance rates to cover the higher-than-average medical expenses. Regression trees and model trees were used to model the subjective quality of wines from measureable characteristics. In doing so, we learned how regression trees offer a simple way to explain the relationship between features and a numeric outcome, but the more complex model trees may be more accurate. Along the way, we learned several methods for evaluating the performance of numeric models.
In stark contrast to this chapter, which covered machine learning methods that result in a clear understanding of the relationships between the input and the output, the next chapter covers methods that result in nearly-incomprehensible models. The upside is that they are extremely powerful techniques-among the most powerful stock classifiers-that can be applied to both classification and numeric prediction problems.
Lantz, Brett. Machine Learning with R. 2nd ed. Birmingham: Packt Publishing Ltd, 2015. Print. , 2013. Print.