This lesson is largely taken from the Carpentries Incubator Beta Lesson Machine Learning for Tabular Data in R
Slides shown to demonstrate some concepts
Broadly speaking, machine learning encompasses a range of techniques and algorithms for gaining insights from large data sets. In this lesson, we will focus on supervised learning for tabular data.
Tabular data takes the form of a data frame. The methods we consider can apply to a variety of data frames, from large to very large (e.g., up to 1,000’s of columns/variables and 100,000’s of rows/observations, or more).
Supervised learning methods build models that predict output values of a function, given some example input and output values. In our context, the output of this function will typically have the form of one of the columns of our data frame, while the input has the form of the remaining columns.
More about training and evaluating models in Machine Learning to come soon…
This lesson will focus on two machine learning methods that apply very well to tabular data in different forms:
Let’s simulate a data set of exam scores, along with letter grades.
library(tidyverse)
set.seed(456)
exam <- tibble(score = sample(80:100, 200, replace=TRUE)) %>%
mutate(grade = as_factor(ifelse(score < 90, "B", "A")))
head(exam)
## # A tibble: 6 × 2
## score grade
## <int> <fct>
## 1 92 A
## 2 84 B
## 3 82 B
## 4 85 B
## 5 100 A
## 6 93 A
summary(exam)
## score grade
## Min. : 80.0 A: 98
## 1st Qu.: 84.0 B:102
## Median : 89.0
## Mean : 89.6
## 3rd Qu.: 95.0
## Max. :100.0
Given only this data frame, can we recover the grading scale that was used to assign the letter grades? In other words, can we partition this data set into A’s and B’s, based on the exam score? The rpart command can form this partition for us.
In the following code block, the formula grade ~ score specifies that grade is the explanatory (independent) variable, and score is the response (dependent) variable. A helpful mnemonic is to read the ~ symbol as “as explained by.” This is the same formula notation used throughout R and in linear and logistic regression as well.
library(rpart)
library(rpart.plot)
examTree <- rpart(grade ~ score, data = exam)
rpart.plot(examTree)
The rpart function searches for the best way to split the data set into predicted values of the response variables, based on the explanatory variables. This Introduction to Rpart has details on how the split is chosen. In this simple case, the rpart function was able to perfectly partition the data after only one split. We can tell rpart.plot to report the number of correctly-classified cases in each node by including the option extra = 2.
rpart.plot(examTree, extra = 2)
Notice that in the top node, only the outcome with the most cases is shown (in this case B) – it’s a bit confusing, but that top (or Root) node is all of the data. After the split, both child nodes were classified perfectly.
In more complex situations, the algorithm will continue to create further splits to improve its classification. This process is called recursive partitioning (i.e., rpart).
Let’s now get to some real data. Consider the kyphosis data set, which is included in the rpart package.
library(rpart)
str(kyphosis)
## 'data.frame': 81 obs. of 4 variables:
## $ Kyphosis: Factor w/ 2 levels "absent","present": 1 1 2 1 1 1 1 1 1 2 ...
## $ Age : int 71 158 128 2 1 1 61 37 113 59 ...
## $ Number : int 3 3 4 5 4 2 2 3 2 6 ...
## $ Start : int 5 14 5 1 15 16 17 16 16 12 ...
For a description of this data set, you can view the help menu for kyphosis.
?kyphosis
For example, we will build a model that will predict whether a post-op kyphosis will be present (the output), given the age of the patient, the number of vertebrae involved, and the number of the first vertebra operated on (the input).
Let’s spend a few minutes exploring this data set.
summary(kyphosis)
## Kyphosis Age Number Start
## absent :64 Min. : 1.00 Min. : 2.000 Min. : 1.00
## present:17 1st Qu.: 26.00 1st Qu.: 3.000 1st Qu.: 9.00
## Median : 87.00 Median : 4.000 Median :13.00
## Mean : 83.65 Mean : 4.049 Mean :11.49
## 3rd Qu.:130.00 3rd Qu.: 5.000 3rd Qu.:16.00
## Max. :206.00 Max. :10.000 Max. :18.00
Notice that only 17 of the 81 cases in our data set indicate the presence of a kyphosis. Try making a scatterplot of two of the quantitative variables.
library(tidyverse)
ggplot(kyphosis, aes(x = Number, y = Start)) + geom_point()
In the jargon of machine learning, a model that predicts a categorical output variable is called a classification model, while one that predicts a quantitative (numeric) output is called a regression model. Note: this terminology conflicts slightly with the common use of the term “regression” in statistics.
Use the rpart function to create a decision tree using the kyphosis data set. The response variable is Kyphosis, and the explanatory varables are the remaining columns Age, Number, and Start.
ktree <- rpart(Kyphosis ~ ., data = kyphosis)
rpart.plot(ktree, extra = 2)
To make a prediction using this tree, start at the top node. Since Start is 12, and 12 >= 9, we follow the left (yes) edge. Since Start is not >= 15, we then follow the right (no) edge. Since Age is 59 and 59 is not < 55, we follow the right edge. Finally, since Age is not >= 111 we follow the right edge to the leaf and obtain the prediction present.
In the two leftmost leaves, all of the cases are classified correctly. However, in the three remaining leaves, there are 2, 3, and 8 incorrectly classified cases, for a total of 13 misclassifications.
In Machine Learning, we typcially don’t use the whole data set. Given a data frame, we will build machine learning models as follows.
Divide the data set into a training set and a testing set. Typically the training set will contain about 60% to 80% of the rows, while the testing set comprises the remaining rows. This train/test split is selected randomly.
Train the model on the training set. Part of this process may involve tuning: tweaking various model settings (i.e., hyperparameters) for optimal performance.
Test the performance of the model using the testing set. Since the testing set was not used in the training of the model, the testing performance will be a good indication of how well our model will perform on future (unknown) input values.
Once our model is built, we can use it to predict output values from new cases of input, and we can also examine the structure of the model to infer the nature of the relationship between the input and the output.
The first step in the process is to create a random train/test split of our data set. We will use the training set to build our model, without looking at the testing set. After our model is built, we will measure the accuracy of its predictions using the testing set.
There are various R packages that automate common tasks in machine learning, but it is instructive to use base R for now. The following commands will randomly select the row indexes of the training set (and therefore also of the testing set).
trainSize <- round(0.75 * nrow(kyphosis))
set.seed(6789) # so we all get the same random sets
trainIndex <- sample(nrow(kyphosis), trainSize)
Take a look at the trainIndex variable in the Environment tab of RStudio. Since we set a particular value for the random seed, we should all see the same sample of random numbers.
Next we form two data frames using these indexes. Recall that the selection of -trainIndex will select all rows whose indexes are not in the trainIndex vector.
trainDF <- kyphosis[trainIndex, ]
testDF <- kyphosis[-trainIndex, ]
We can View the train and test sets in RStudio to check that they form a random partition of the kyphosis data.
View(trainDF)
View(testDF)
Let’s train the model on the training set and test in on the testing set.
treeModel <- rpart(Kyphosis ~ Age + Number + Start, data = trainDF)
rpart.plot(treeModel, extra = 2)
Notice that we obtain a simpler tree using the training set instead of the full data set.
What proportion of cases in the training set were classified correctly?
Reading the leaves from left to right, there were 34, 11, and 6 correctly-classified cases:
(34+11+6)/nrow(trainDF)
## [1] 0.8360656
The predict function works on rpart models similarly to how it works on lm and glm models, but the output is a matrix of predicted probabilities.
predict(treeModel, testDF)
## absent present
## 12 0.9444444 0.05555556
## 21 0.9444444 0.05555556
## 32 0.6111111 0.38888889
## 36 0.9444444 0.05555556
## 37 0.6111111 0.38888889
## 38 0.6111111 0.38888889
## 39 0.1428571 0.85714286
## 43 0.1428571 0.85714286
## 44 0.6111111 0.38888889
## 46 0.6111111 0.38888889
## 47 0.9444444 0.05555556
## 51 0.6111111 0.38888889
## 52 0.9444444 0.05555556
## 57 0.9444444 0.05555556
## 60 0.9444444 0.05555556
## 66 0.6111111 0.38888889
## 68 0.9444444 0.05555556
## 69 0.6111111 0.38888889
## 70 0.9444444 0.05555556
## 76 0.9444444 0.05555556
To investigate the behavior of this model, we bind the columns of the predicted probabilities to the testing set data frame to create a new data frame called predDF.
predMatrix <- predict(treeModel, testDF)
predDF <- testDF %>%
bind_cols(predMatrix)
Compare the results in the predDF data frame with the plot of treeModel. Can you explain how the model is calculating the predicted probabilites?
Solution
Consider the first row of predDF.
predDF[1,]
## Kyphosis Age Number Start absent present
## 1 absent 148 3 16 0.9444444 0.05555556
Since the value of Start is greater than 13, we follow the “yes” branch of the decision tree and land at the leftmost leaf, labeled “absent”, with a probability of 34/36, which is approximately 0.9444. Similarly, consider row 8.
predDF[8,]
## Kyphosis Age Number Start absent present
## 8 absent 143 9 3 0.1428571 0.8571429
Since the value of Start is less than 13, we follow the “no” branch. Then since the value of Number is greater than 6, we follow the right branch to land on the leaf labeled “present”, with a probability of 6/7, which is 0.8571.
Let’s add a new column called Prediction to the predDF data frame that gives the model prediction (absent or present) for each row, based on the probability in the absent column of predDF.
predDF <- predDF %>%
mutate(Prediction = ifelse(predDF$absent > 0.5, "absent", "present"))
Recall that in supervised learning, we use the testing set to measure how our model performs on data it was not trained on. Since this model is a classification model, we compute accuracy as the proportion of correct predictions.
accuracy <- sum(predDF$Kyphosis == predDF$Prediction)/nrow(predDF)
cat("Proportion of correct predictions: ", accuracy, "\n")
## Proportion of correct predictions: 0.8
In general, the accuracy on the testing set will be less than the accuracy on the training set.
Repeat the construction of the decision tree model for the kyphosis data, but experiment with different values of the random seed to obtain different testing and training sets. Does the shape of the tree change? Does the testing set accuracy change?
set.seed(314159) # try a different seed
trainIndex <- sample(nrow(kyphosis), trainSize) # use the same size training set
trainDF <- kyphosis %>% slice(trainIndex)
testDF <- kyphosis %>% slice(-trainIndex)
treeModel <- rpart(Kyphosis ~ Age + Number + Start, data = trainDF)
rpart.plot(treeModel, extra = 2)
Changing the seed for the train/test split resulted in a different decision tree.
predMatrix <- predict(treeModel, testDF)
predictedKyphosis <- ifelse(predMatrix[,1] > 0.5, "absent", "present")
accuracy <- sum(testDF$Kyphosis == predictedKyphosis)/nrow(testDF)
cat("Proportion of correct predictions: ", accuracy, "\n")
## Proportion of correct predictions: 0.85
The testing set accuracy also changed.
In this episode we have seen that the structure of the decision tree can vary quite a bit when we make small changes to the training data. Training the model on the whole data set resulted in a very different tree than what we obtained by training on a slightly smaller testing set. And changing the choice of testing set, even while maintaining its size, also altered the tree structure. When the structure of a model changes significantly for a small change in the training data, we say that the model is not robust. Non-robustness can be a problem, because one or two unusual observations can make a big difference in the conclusions we draw.
Since their predictions can vary wildly depending on the randomly selected training set, decision trees are called weak learners. In the upcoming episodes, we will explore two methods that use ensembles of weak learners to create a models with more predictive strength.
We saw in the previous episode that decision tree models can be sensitive to small changes in the training data. Random Forests mitigate this issue by forming an ensemble (i.e., set) of decision trees, and using them all together to make a prediction.
For this episode, we will use a data set described in the article Modeling wine preferences by data mining from physicochemical properties, in Decision Support Systems, 47(4):547-553, by P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. (The data can be downloaded online) The data set contains quality ratings and measurements from 6497 samples of wine; rows 1:1599 are red wine samples, and rows 1600:6497 are white wine.
download.file("https://www.openml.org/data/get_csv/49817/wine_quality.arff", "data/wine.csv")
wine <- read_csv("data/wine.csv")
## Rows: 6497 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (12): fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlo...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(wine)
## Rows: 6,497
## Columns: 12
## $ fixed.acidity <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5…
## $ volatile.acidity <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, …
## $ citric.acid <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0…
## $ residual.sugar <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1,…
## $ chlorides <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, …
## $ free.sulfur.dioxide <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16…
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102,…
## $ density <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0…
## $ pH <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3…
## $ sulphates <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0…
## $ alcohol <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.…
## $ quality <dbl> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7…
ggplot(wine, aes(x=quality)) +
geom_histogram(binwidth = 1)
The goal of the models that follow will be to predict the quality rating of a wine sample from its chemical properties.
To illustrate classification models with this data set, let’s create a categorical variable grade that will serve as a response variable.
redwineClass <- wine %>%
slice(1:1599) %>% # just the red wine samples
mutate(grade = as_factor(if_else(quality < 5.5, "bad", "good"))) %>%
select(-quality)
summary(redwineClass$grade)
## bad good
## 744 855
Create training and test sets using 80/20 splits
trainSize <- round (0.80 * nrow(redwineClass))
set.seed(1234)
trainIndex <- sample(nrow(redwineClass), trainSize)
trainDF <- redwineClass %>% slice(trainIndex)
testDF <- redwineClass %>% slice(-trainIndex)
Use the rpart function to create a decision tree model for predicting the grade variable from the remaining columns in the redwineClass data frame. Use the training and testing sets defined above. Compute the testing set accuracy.
rwtree <- rpart(grade ~ ., data = trainDF)
rpart.plot(rwtree, extra = 2)
rwp <- predict(rwtree, testDF, type = "class")
sum(testDF$grade == rwp)/nrow(testDF)
## [1] 0.696875
We can also prune a decision tree to possibly get a better test error by creating less variance. It’s a little confusing and doesn’t do too much with this example, but rpart automatically includes a cross validation of the Complexity Parameter (CP) that be used to prune to the best tree. Finding the minimum of the cross validation error (xerror) and then using that complexity paramter to prune the tree might lead to a tree that overfits less (and is more generalizable to new data).
prwtree <- prune(rwtree,
cp = rwtree$cptable[which.min(rwtree$cptable[,"xerror"]), "CP"])
rpart.plot(prwtree)
Let’s see if that helped the test error…
prwp <- predict(prwtree, testDF, type = "class")
sum(testDF$grade == prwp)/nrow(testDF)
## [1] 0.703125
It does, but not too much on this particular data, so it’s not a big improvement.
A random forest model combines several decision tree models as follows.
More details can be found in Breiman’s 2002 paper, pages 8-9.
The randomForest package is an R implementation of Breiman and Cutler’s original Fortran code. The syntax for building a model is similar to lm, glm, and rpart. Because of the randomness used in the algorithm, it is good practice to set the random seed so that the results will be reproducible.
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(4567)
redwineForest <- randomForest(grade ~ ., data = trainDF)
The predict function works on random forests. Observe that it returns a vector of levels of the grade variable.
rwpred2 <- predict(redwineForest, testDF)
head(rwpred2)
## 1 2 3 4 5 6
## bad bad bad bad bad bad
## Levels: bad good
Compute the testing set accuracy of the random forest model above. Compare this accuracy with the decision tree model in the previous challenge.
sum(testDF$grade == rwpred2)/nrow(testDF)
## [1] 0.821875
The random forest model can correctly predict the grade about 82% of the time, which is an improvement of about 12 percentage points over the decision tree model. (Even with pruning)
Since each tree in the random forest is built on a bootstrap sample of the rows in the training set, some rows will be left out of each sample. Each tree can then make a prediction on each row that was excluded, i.e., “out of the bag” (OOB). Each row will be excluded by several trees, which form a subforest that produces an aggregate prediction for each row. The out-of-bag estimate of the error rate is the error rate of these predictions. The OOB error rate gives a convenient estimate of the error rate of the model (error = 1 - accuracy). The print function reports the OOB error.
print(redwineForest)
##
## Call:
## randomForest(formula = grade ~ ., data = trainDF)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 19.31%
## Confusion matrix:
## bad good class.error
## bad 467 122 0.2071307
## good 125 565 0.1811594
The confusion matrix gives further details on how the OOB error was calculated: Out of 470+119 bad wines, 119 were classified as good, and out of 126+564 good wines, 126 were classified as bad. Check that (119+126)/(470+119+126+564) = 0.1916.
Given that OOB error gives us a good way to estimate model accuracy, we can train random forest models on the entire data set, without using a train/test split (Breiman, 2001, p. 8).
set.seed(4567)
rwforFull <- randomForest(grade ~ ., data = redwineClass)
print(rwforFull)
##
## Call:
## randomForest(formula = grade ~ ., data = redwineClass)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 16.82%
## Confusion matrix:
## bad good class.error
## bad 615 129 0.1733871
## good 140 715 0.1637427
Caveat: If you plan to “tune” your model by adjusting the optional parameters in randomForest, it is still good practice to set aside a testing set for assessing model accuracy after the model has been tuned using only the training data. A train/test split is also helpful if you want to compare random forest models with other types of models (and have it be the same test set)
The out-of-bag errors can also be used to assess the importance of each explanatory variable in contributing to the accuracy of the model. If we set importance = TRUE in a call to randomForest, the function will keep track of how much the OOB accuracy decreases when each variable is randomly permuted. Scrambling a variable in this way effectively removes its predictive power from the model, so we would expect the most important variables to have the greatest decreases in accuracy.
set.seed(2345)
rwforFull <- randomForest(grade ~ ., data = redwineClass, importance = TRUE)
importance(rwforFull, type = 1)
## MeanDecreaseAccuracy
## fixed.acidity 33.60865
## volatile.acidity 50.37700
## citric.acid 30.12944
## residual.sugar 31.40780
## chlorides 34.39463
## free.sulfur.dioxide 32.60753
## total.sulfur.dioxide 50.95265
## density 42.35901
## pH 36.46366
## sulphates 58.41187
## alcohol 75.06260
Based on the Mean Decrease in Accuracy, Which two variables are the most important, and which two are the least important? Are these results consistent with the decision tree model you constructed in the first challenge of this episode?
For convenience, let’s sort the rows of the importance matrix.
importance(rwforFull, type = 1) %>%
as_tibble(rownames = "Variable") %>%
arrange(desc(MeanDecreaseAccuracy))
## # A tibble: 11 × 2
## Variable MeanDecreaseAccuracy
## <chr> <dbl>
## 1 alcohol 75.1
## 2 sulphates 58.4
## 3 total.sulfur.dioxide 51.0
## 4 volatile.acidity 50.4
## 5 density 42.4
## 6 pH 36.5
## 7 chlorides 34.4
## 8 fixed.acidity 33.6
## 9 free.sulfur.dioxide 32.6
## 10 residual.sugar 31.4
## 11 citric.acid 30.1
# or
varImpPlot(rwforFull, type = 1)
The top two variables are alcohol and sulphates, which both occur near the root of the decision tree. The least important variables are residual.sugar and citric.acid, which do not occur in the decision tree. So it seems consistent that the important variables play important roles in the decision tree, while the unimportant variables do not.
So far, we have applied decision trees and random forests to classification problems, where the dependent variable is categorical. These techniques also apply to regression problems, where we are predicting a quantitative variable.
Recall that in the original wine data set, there is a variable called quality that assigns each sample a numerical quality score. For the examples that follow, we will treat quality as a quantitative variable. As before, we select the rows that contain observations of red wines, and we form training and testing sets.
redwine <- wine %>% slice(1:1599)
trainSize <- round(0.80 * nrow(redwine))
set.seed(1234)
trainIndex <- sample(nrow(redwine), trainSize)
trainDF <- redwine %>% slice(trainIndex)
testDF <- redwine %>% slice(-trainIndex)
When the dependent variable is quantitative, we use the method = “anova” option in rpart to construct a decision tree. In this situation, the predicted value corresponding to each node is the mean value of the observations assigned to the node, and the algorithm searches for a split that minimizes the sum of the squared errors of these predictions.
rwtree <- rpart(quality ~ ., data = trainDF, method = "anova")
rpart.plot(rwtree)
The rpart.plot command rounds off numerical values to save space. To see more digits of accuracy, print the tree in text format.
rwtree
## n= 1279
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 1279 827.94210 5.636435
## 2) alcohol< 10.525 787 332.33800 5.364676
## 4) sulphates< 0.575 310 100.56770 5.154839 *
## 5) sulphates>=0.575 477 209.24950 5.501048
## 10) volatile.acidity>=0.405 364 149.63460 5.403846
## 20) alcohol< 9.85 223 66.76233 5.278027 *
## 21) alcohol>=9.85 141 73.75887 5.602837 *
## 11) volatile.acidity< 0.405 113 45.09735 5.814159 *
## 3) alcohol>=10.525 492 344.51020 6.071138
## 6) volatile.acidity>=0.8725 26 23.11538 4.730769
## 12) volatile.acidity>=1.015 10 6.00000 4.000000 *
## 13) volatile.acidity< 1.015 16 8.43750 5.187500 *
## 7) volatile.acidity< 0.8725 466 272.07730 6.145923
## 14) sulphates< 0.635 184 95.77717 5.831522
## 28) alcohol< 11.65 102 53.84314 5.627451 *
## 29) alcohol>=11.65 82 32.40244 6.085366 *
## 15) sulphates>=0.635 282 146.24470 6.351064
## 30) alcohol< 11.55 166 79.81325 6.138554 *
## 31) alcohol>=11.55 116 48.20690 6.655172 *
For example, the first split tests whether alcohol < 10.525, not 11 as shown in the plot of the tree.
In the above decision tree, consider the root (5.6), its left child (5.4), and the leftmost leaf (5.2). Check that these numbers are in fact the average quality values (rounded to one decimal place) of the observations in trainDF that are assigned to each node
Solution
The root node contains all the observations, so compute its mean quality value.
mean(trainDF$quality)
## [1] 5.636435
The left child of the root contains observations where alcohol is less than 10.525.
trainDF %>%
filter(alcohol < 10.525) %>%
summarize(nodeValue = mean(quality))
## # A tibble: 1 × 1
## nodeValue
## <dbl>
## 1 5.36
The leftmost leaf contains observations where alcohol is less than 10.525 and sulphates is less than 0.575.
trainDF %>%
filter(alcohol < 10.525, sulphates < 0.575) %>%
summarize(nodeValue = mean(quality))
## # A tibble: 1 × 1
## nodeValue
## <dbl>
## 1 5.15
For regression trees, each leaf is assigned a predicted value, and the predict function selects the appropriate leaf/prediction based on the values of the explanatory variables.
predictedQuality <- predict(rwtree, testDF)
head(predictedQuality)
## 1 2 3 4 5 6
## 5.154839 5.278027 5.278027 5.154839 5.154839 5.602837
Since this is a regression model, we assess its performance using the root mean squared error (RMSE).
errors <- predictedQuality - testDF$quality
decTreeRMSE <- sqrt(mean(errors^2))
decTreeRMSE
## [1] 0.6862169
Constructing a random forest regression model uses randomForest with the same syntax as we used for classification models. The only difference is that the response variable quality in our formula is a quantitative variable. The predicted values given by predict are the average predictions of all the trees in the forest.
set.seed(4567)
rwfor <- randomForest(quality ~ ., data = trainDF)
predQualRF <- predict(rwfor, testDF)
rfErrors <- predQualRF - testDF$quality
rfRMSE <- sqrt(mean(rfErrors^2))
rfRMSE
## [1] 0.5913485
The random forest RMSE is better (smaller) than the decision tree RMSE.
print(rwfor)
##
## Call:
## randomForest(formula = quality ~ ., data = trainDF)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 0.3269279
## % Var explained: 49.5
The Mean of squared residuals is the MSE of the out-of-bag errors. The % Var explained term is a “pseudo R-squared”, computed as 1 - MSE/Var(y). The OOB MSE should be close to the MSE on the testing set. So again, you don’t always need a train/test split when working with random forests.
We conclude this episode with a series of challenges.
As with classification models, setting importance = TRUE in a call to randomForest will use the OOB errors to measure variable importance. In the regression case, the decrease in performance when a variable is permuted is measured by the percentage increase in MSE (%IncMSE). The larger the %IncMSE, the more important the variable.
Compute the %IncMSE variable importance for the red wine regression model, and sort the variables by their importance. Re-train the model on the entire redwine data set instead of on the testing set. Are the most important variables the same as in the classification model?
Solution
The syntax is almost identical to the classification case. Since the identifier %IncMSE contains special characters, it must be enclosed in backticks. VarImpPlot will work as well!
set.seed(2345)
rwFull <- randomForest(quality ~ ., data = redwine, importance = TRUE)
importance(rwFull, type = 1) %>%
as_tibble(rownames = "Variable") %>%
arrange(desc(`%IncMSE`))
## # A tibble: 11 × 2
## Variable `%IncMSE`
## <chr> <dbl>
## 1 alcohol 59.2
## 2 sulphates 55.6
## 3 volatile.acidity 39.7
## 4 total.sulfur.dioxide 38.3
## 5 density 34.5
## 6 chlorides 32.7
## 7 free.sulfur.dioxide 30.0
## 8 citric.acid 28.9
## 9 fixed.acidity 28.4
## 10 pH 25.1
## 11 residual.sugar 24.2
varImpPlot(rwFull, type = 1)
The top five most important variables are the same as in the classification model, but their order is slightly different. The variable pH seems to be relatively more important in the classification model.
Earler, we used the lm function to fit a linear model to the training set, and then we computed the RMSE on the testing set. Repeat this process for the redwine data set. How does the linear regression RMSE compare to the random forest RMSE? Compare the summary of your linear model with the variable importance rankings from the previous challenge.
redwine.lm <- lm(quality ~ ., data = trainDF)
lmRMSE <- sqrt(mean((predict(redwine.lm, testDF) - testDF$quality)^2))
lmRMSE
## [1] 0.6880957
The random forest RMSE is less than the linear model RMSE.
summary(redwine.lm)
##
## Call:
## lm(formula = quality ~ ., data = trainDF)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.68737 -0.36035 -0.03507 0.43456 1.95898
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.3242378 23.4596015 0.397 0.6911
## fixed.acidity 0.0108114 0.0284758 0.380 0.7043
## volatile.acidity -1.2144573 0.1320387 -9.198 < 2e-16 ***
## citric.acid -0.2957551 0.1608745 -1.838 0.0662 .
## residual.sugar 0.0193480 0.0168430 1.149 0.2509
## chlorides -1.8858347 0.4583915 -4.114 4.14e-05 ***
## free.sulfur.dioxide 0.0054883 0.0023783 2.308 0.0212 *
## total.sulfur.dioxide -0.0034664 0.0007974 -4.347 1.49e-05 ***
## density -5.0636470 23.9244350 -0.212 0.8324
## pH -0.4331191 0.2065623 -2.097 0.0362 *
## sulphates 0.9244109 0.1306583 7.075 2.47e-12 ***
## alcohol 0.2886439 0.0293633 9.830 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6385 on 1267 degrees of freedom
## Multiple R-squared: 0.3762, Adjusted R-squared: 0.3708
## F-statistic: 69.46 on 11 and 1267 DF, p-value: < 2.2e-16
The variables the smallest p-values (Pr(>|t|)) tend to correspond to the most important variables in the random forest model.
Rows 1600-6497 of the wine data frame correspond to white wine observations. Use these rows to train and test a random forest model, as we did with the red wine observations. Compute a testing set RMSE, and assess variable importance. Are the important variables for predicting white wine quality the same as the important variables for predicting red wine quality?
whitewine <- wine %>% slice(1600:6497)
trainSize <- round(0.80 * nrow(whitewine))
set.seed(1234)
trainIndex <- sample(nrow(whitewine), trainSize)
trainDF <- whitewine %>% dplyr::slice(trainIndex)
testDF <- whitewine %>% dplyr::slice(-trainIndex)
wwfor <- randomForest(quality ~ ., data = trainDF)
predQualww <- predict(wwfor, testDF)
errors <- predQualww - testDF$quality
wwRMSE <- sqrt(mean(errors^2))
The RMSE on the white wine testing set is about 0.63.
set.seed(4567)
wwforFull <- randomForest(quality ~ ., data = whitewine, importance = TRUE)
importance(wwforFull, type = 1) %>%
as_tibble(rownames = "Variable") %>%
arrange(desc(`%IncMSE`))
## # A tibble: 11 × 2
## Variable `%IncMSE`
## <chr> <dbl>
## 1 volatile.acidity 89.1
## 2 free.sulfur.dioxide 84.1
## 3 alcohol 79.4
## 4 pH 64.6
## 5 citric.acid 59.9
## 6 sulphates 54.9
## 7 fixed.acidity 54.2
## 8 residual.sugar 53.0
## 9 chlorides 52.6
## 10 total.sulfur.dioxide 50.0
## 11 density 43.6
varImpPlot(wwforFull, type = 1)
Relative to red wine, free.sulfur.dioxide is much more important for predicting white wine quality, and sulphates is less important.