Machine Learning For Tabular Data In R

This lesson is largely taken from the Carpentries Incubator Beta Lesson Machine Learning for Tabular Data in R

Slides shown to demonstrate some concepts

A Brief Introduction to Machine Learning

What is Machine Learning?

Broadly speaking, machine learning encompasses a range of techniques and algorithms for gaining insights from large data sets. In this lesson, we will focus on supervised learning for tabular data.

Tabular data takes the form of a data frame. The methods we consider can apply to a variety of data frames, from large to very large (e.g., up to 1,000’s of columns/variables and 100,000’s of rows/observations, or more).

Supervised learning methods build models that predict output values of a function, given some example input and output values. In our context, the output of this function will typically have the form of one of the columns of our data frame, while the input has the form of the remaining columns.

More about training and evaluating models in Machine Learning to come soon…

Our Focus

This lesson will focus on two machine learning methods that apply very well to tabular data in different forms:

Decision Trees
Random Forests

Decision Trees

Let’s simulate a data set of exam scores, along with letter grades.

library(tidyverse)

set.seed(456)

exam <- tibble(score = sample(80:100, 200, replace=TRUE)) %>%
  mutate(grade = as_factor(ifelse(score < 90, "B", "A")))

head(exam)

## # A tibble: 6 × 2
##   score grade
##   <int> <fct>
## 1    92 A    
## 2    84 B    
## 3    82 B    
## 4    85 B    
## 5   100 A    
## 6    93 A

summary(exam)

##      score       grade  
##  Min.   : 80.0   A: 98  
##  1st Qu.: 84.0   B:102  
##  Median : 89.0          
##  Mean   : 89.6          
##  3rd Qu.: 95.0          
##  Max.   :100.0

Given only this data frame, can we recover the grading scale that was used to assign the letter grades? In other words, can we partition this data set into A’s and B’s, based on the exam score? The rpart command can form this partition for us.

In the following code block, the formula grade ~ score specifies that grade is the explanatory (independent) variable, and score is the response (dependent) variable. A helpful mnemonic is to read the ~ symbol as “as explained by.” This is the same formula notation used throughout R and in linear and logistic regression as well.

library(rpart)
library(rpart.plot)

examTree <- rpart(grade ~ score, data = exam)

rpart.plot(examTree)

The rpart function searches for the best way to split the data set into predicted values of the response variables, based on the explanatory variables. This Introduction to Rpart has details on how the split is chosen. In this simple case, the rpart function was able to perfectly partition the data after only one split. We can tell rpart.plot to report the number of correctly-classified cases in each node by including the option extra = 2.

rpart.plot(examTree, extra = 2)

Notice that in the top node, only the outcome with the most cases is shown (in this case B) – it’s a bit confusing, but that top (or Root) node is all of the data. After the split, both child nodes were classified perfectly.

In more complex situations, the algorithm will continue to create further splits to improve its classification. This process is called recursive partitioning (i.e., rpart).

Example: Kyphosis Data Set

Let’s now get to some real data. Consider the kyphosis data set, which is included in the rpart package.

library(rpart)
str(kyphosis)

## 'data.frame':    81 obs. of  4 variables:
##  $ Kyphosis: Factor w/ 2 levels "absent","present": 1 1 2 1 1 1 1 1 1 2 ...
##  $ Age     : int  71 158 128 2 1 1 61 37 113 59 ...
##  $ Number  : int  3 3 4 5 4 2 2 3 2 6 ...
##  $ Start   : int  5 14 5 1 15 16 17 16 16 12 ...

For a description of this data set, you can view the help menu for kyphosis.

?kyphosis

For example, we will build a model that will predict whether a post-op kyphosis will be present (the output), given the age of the patient, the number of vertebrae involved, and the number of the first vertebra operated on (the input).

Let’s spend a few minutes exploring this data set.

summary(kyphosis)

##     Kyphosis       Age             Number           Start      
##  absent :64   Min.   :  1.00   Min.   : 2.000   Min.   : 1.00  
##  present:17   1st Qu.: 26.00   1st Qu.: 3.000   1st Qu.: 9.00  
##               Median : 87.00   Median : 4.000   Median :13.00  
##               Mean   : 83.65   Mean   : 4.049   Mean   :11.49  
##               3rd Qu.:130.00   3rd Qu.: 5.000   3rd Qu.:16.00  
##               Max.   :206.00   Max.   :10.000   Max.   :18.00

Notice that only 17 of the 81 cases in our data set indicate the presence of a kyphosis. Try making a scatterplot of two of the quantitative variables.

library(tidyverse)
ggplot(kyphosis, aes(x = Number, y = Start)) + geom_point()

Classification vs. Regression

In the jargon of machine learning, a model that predicts a categorical output variable is called a classification model, while one that predicts a quantitative (numeric) output is called a regression model. Note: this terminology conflicts slightly with the common use of the term “regression” in statistics.

Challenge: Use rpart on the kyphosis data

Use the rpart function to create a decision tree using the kyphosis data set. The response variable is Kyphosis, and the explanatory varables are the remaining columns Age, Number, and Start.

Use rpart.plot to plot your tree model.
Use this tree to predict the value of Kyphosis when Start is 12, Age is 59, and Number is 6.
How many of the 81 cases in the data set does this tree classify incorrectly?

ktree <- rpart(Kyphosis ~ ., data = kyphosis)
rpart.plot(ktree, extra = 2)

To make a prediction using this tree, start at the top node. Since Start is 12, and 12 >= 9, we follow the left (yes) edge. Since Start is not >= 15, we then follow the right (no) edge. Since Age is 59 and 59 is not < 55, we follow the right edge. Finally, since Age is not >= 111 we follow the right edge to the leaf and obtain the prediction present.

In the two leftmost leaves, all of the cases are classified correctly. However, in the three remaining leaves, there are 2, 3, and 8 incorrectly classified cases, for a total of 13 misclassifications.

Training/Testing Data

In Machine Learning, we typcially don’t use the whole data set. Given a data frame, we will build machine learning models as follows.

Divide the data set into a training set and a testing set. Typically the training set will contain about 60% to 80% of the rows, while the testing set comprises the remaining rows. This train/test split is selected randomly.

Train the model on the training set. Part of this process may involve tuning: tweaking various model settings (i.e., hyperparameters) for optimal performance.

Test the performance of the model using the testing set. Since the testing set was not used in the training of the model, the testing performance will be a good indication of how well our model will perform on future (unknown) input values.

Once our model is built, we can use it to predict output values from new cases of input, and we can also examine the structure of the model to infer the nature of the relationship between the input and the output.

The first step in the process is to create a random train/test split of our data set. We will use the training set to build our model, without looking at the testing set. After our model is built, we will measure the accuracy of its predictions using the testing set.

There are various R packages that automate common tasks in machine learning, but it is instructive to use base R for now. The following commands will randomly select the row indexes of the training set (and therefore also of the testing set).

trainSize <- round(0.75 * nrow(kyphosis))
set.seed(6789) # so we all get the same random sets
trainIndex <- sample(nrow(kyphosis), trainSize)

Take a look at the trainIndex variable in the Environment tab of RStudio. Since we set a particular value for the random seed, we should all see the same sample of random numbers.

Next we form two data frames using these indexes. Recall that the selection of -trainIndex will select all rows whose indexes are not in the trainIndex vector.

trainDF <- kyphosis[trainIndex, ]
testDF <- kyphosis[-trainIndex, ]

We can View the train and test sets in RStudio to check that they form a random partition of the kyphosis data.

View(trainDF)
View(testDF)

Using Decision Trees for Supervised Learning

Let’s train the model on the training set and test in on the testing set.

treeModel <- rpart(Kyphosis ~ Age + Number + Start, data = trainDF)
rpart.plot(treeModel, extra = 2)

Notice that we obtain a simpler tree using the training set instead of the full data set.

Challenge: Training Set Accuracy

What proportion of cases in the training set were classified correctly?

Reading the leaves from left to right, there were 34, 11, and 6 correctly-classified cases:

(34+11+6)/nrow(trainDF)

## [1] 0.8360656

Model Predictions on the Test Set

The predict function works on rpart models similarly to how it works on lm and glm models, but the output is a matrix of predicted probabilities.

predict(treeModel, testDF)

##       absent    present
## 12 0.9444444 0.05555556
## 21 0.9444444 0.05555556
## 32 0.6111111 0.38888889
## 36 0.9444444 0.05555556
## 37 0.6111111 0.38888889
## 38 0.6111111 0.38888889
## 39 0.1428571 0.85714286
## 43 0.1428571 0.85714286
## 44 0.6111111 0.38888889
## 46 0.6111111 0.38888889
## 47 0.9444444 0.05555556
## 51 0.6111111 0.38888889
## 52 0.9444444 0.05555556
## 57 0.9444444 0.05555556
## 60 0.9444444 0.05555556
## 66 0.6111111 0.38888889
## 68 0.9444444 0.05555556
## 69 0.6111111 0.38888889
## 70 0.9444444 0.05555556
## 76 0.9444444 0.05555556

To investigate the behavior of this model, we bind the columns of the predicted probabilities to the testing set data frame to create a new data frame called predDF.

predMatrix <- predict(treeModel, testDF)
predDF <- testDF %>%
  bind_cols(predMatrix)

Challenge: Predicted Probabilities

Compare the results in the predDF data frame with the plot of treeModel. Can you explain how the model is calculating the predicted probabilites?

Solution

Consider the first row of predDF.

predDF[1,]

##   Kyphosis Age Number Start    absent    present
## 1   absent 148      3    16 0.9444444 0.05555556

Since the value of Start is greater than 13, we follow the “yes” branch of the decision tree and land at the leftmost leaf, labeled “absent”, with a probability of 34/36, which is approximately 0.9444. Similarly, consider row 8.

predDF[8,]

##   Kyphosis Age Number Start    absent   present
## 8   absent 143      9     3 0.1428571 0.8571429

Since the value of Start is less than 13, we follow the “no” branch. Then since the value of Number is greater than 6, we follow the right branch to land on the leaf labeled “present”, with a probability of 6/7, which is 0.8571.

Testing Set Accuracy

Let’s add a new column called Prediction to the predDF data frame that gives the model prediction (absent or present) for each row, based on the probability in the absent column of predDF.

predDF <- predDF %>%
  mutate(Prediction = ifelse(predDF$absent > 0.5, "absent", "present"))

Recall that in supervised learning, we use the testing set to measure how our model performs on data it was not trained on. Since this model is a classification model, we compute accuracy as the proportion of correct predictions.

accuracy <- sum(predDF$Kyphosis == predDF$Prediction)/nrow(predDF)
cat("Proportion of correct predictions: ", accuracy, "\n")

## Proportion of correct predictions:  0.8

In general, the accuracy on the testing set will be less than the accuracy on the training set.

Challenge: Change the Training Set

Repeat the construction of the decision tree model for the kyphosis data, but experiment with different values of the random seed to obtain different testing and training sets. Does the shape of the tree change? Does the testing set accuracy change?

set.seed(314159) # try a different seed
trainIndex <- sample(nrow(kyphosis), trainSize) # use the same size training set
trainDF <- kyphosis %>% slice(trainIndex)
testDF <- kyphosis %>% slice(-trainIndex)
treeModel <- rpart(Kyphosis ~ Age + Number + Start, data = trainDF)
rpart.plot(treeModel, extra = 2)

Changing the seed for the train/test split resulted in a different decision tree.

predMatrix <- predict(treeModel, testDF)
predictedKyphosis <- ifelse(predMatrix[,1] > 0.5, "absent", "present")
accuracy <- sum(testDF$Kyphosis == predictedKyphosis)/nrow(testDF)
cat("Proportion of correct predictions: ", accuracy, "\n")

## Proportion of correct predictions:  0.85

The testing set accuracy also changed.

Robustness, or lack thereof

In this episode we have seen that the structure of the decision tree can vary quite a bit when we make small changes to the training data. Training the model on the whole data set resulted in a very different tree than what we obtained by training on a slightly smaller testing set. And changing the choice of testing set, even while maintaining its size, also altered the tree structure. When the structure of a model changes significantly for a small change in the training data, we say that the model is not robust. Non-robustness can be a problem, because one or two unusual observations can make a big difference in the conclusions we draw.

Since their predictions can vary wildly depending on the randomly selected training set, decision trees are called weak learners. In the upcoming episodes, we will explore two methods that use ensembles of weak learners to create a models with more predictive strength.

Random Forest

We saw in the previous episode that decision tree models can be sensitive to small changes in the training data. Random Forests mitigate this issue by forming an ensemble (i.e., set) of decision trees, and using them all together to make a prediction.

Wine Dataset

For this episode, we will use a data set described in the article Modeling wine preferences by data mining from physicochemical properties, in Decision Support Systems, 47(4):547-553, by P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. (The data can be downloaded online) The data set contains quality ratings and measurements from 6497 samples of wine; rows 1:1599 are red wine samples, and rows 1600:6497 are white wine.

download.file("https://www.openml.org/data/get_csv/49817/wine_quality.arff", "data/wine.csv")
wine <- read_csv("data/wine.csv")

## Rows: 6497 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (12): fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlo...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(wine)

## Rows: 6,497
## Columns: 12
## $ fixed.acidity        <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5…
## $ volatile.acidity     <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, …
## $ citric.acid          <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0…
## $ residual.sugar       <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1,…
## $ chlorides            <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, …
## $ free.sulfur.dioxide  <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16…
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102,…
## $ density              <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0…
## $ pH                   <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3…
## $ sulphates            <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0…
## $ alcohol              <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.…
## $ quality              <dbl> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7…

ggplot(wine, aes(x=quality)) +
  geom_histogram(binwidth = 1)

The goal of the models that follow will be to predict the quality rating of a wine sample from its chemical properties.

Red Wine Classification Model

To illustrate classification models with this data set, let’s create a categorical variable grade that will serve as a response variable.

redwineClass <- wine %>%
  slice(1:1599) %>% # just the red wine samples
  mutate(grade = as_factor(if_else(quality < 5.5, "bad", "good"))) %>%
  select(-quality)
summary(redwineClass$grade)

##  bad good 
##  744  855

Create Training and Test Sets

Create training and test sets using 80/20 splits

trainSize <- round (0.80 * nrow(redwineClass))
set.seed(1234)
trainIndex <- sample(nrow(redwineClass), trainSize)
trainDF <- redwineClass %>% slice(trainIndex)
testDF <- redwineClass %>% slice(-trainIndex)

Challenge: Create a decision tree model

Use the rpart function to create a decision tree model for predicting the grade variable from the remaining columns in the redwineClass data frame. Use the training and testing sets defined above. Compute the testing set accuracy.

rwtree <- rpart(grade ~ ., data = trainDF)
rpart.plot(rwtree, extra = 2)

rwp <- predict(rwtree, testDF, type = "class")
sum(testDF$grade == rwp)/nrow(testDF)

## [1] 0.696875

We can also prune a decision tree to possibly get a better test error by creating less variance. It’s a little confusing and doesn’t do too much with this example, but rpart automatically includes a cross validation of the Complexity Parameter (CP) that be used to prune to the best tree. Finding the minimum of the cross validation error (xerror) and then using that complexity paramter to prune the tree might lead to a tree that overfits less (and is more generalizable to new data).

prwtree <- prune(rwtree,
                 cp = rwtree$cptable[which.min(rwtree$cptable[,"xerror"]), "CP"])
rpart.plot(prwtree)

Let’s see if that helped the test error…

prwp <- predict(prwtree, testDF, type = "class")
sum(testDF$grade == prwp)/nrow(testDF)

## [1] 0.703125

It does, but not too much on this particular data, so it’s not a big improvement.

Random Forest Classification Model

A random forest model combines several decision tree models as follows.

Several different decision trees are built, each from a random bootstrap sample of the same size as the original data. This process is also known as bagging (bootstrap aggregation).
For each tree model, a randomly-chosen subset of variables is chosen to be available for each split.
Each decision tree model makes a prediction, and the category with the most “votes” is selected as the prediction of the random forest.

More details can be found in Breiman’s 2002 paper, pages 8-9.

The randomForest package is an R implementation of Breiman and Cutler’s original Fortran code. The syntax for building a model is similar to lm, glm, and rpart. Because of the randomness used in the algorithm, it is good practice to set the random seed so that the results will be reproducible.

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

set.seed(4567)
redwineForest <- randomForest(grade ~ ., data = trainDF)

The predict function works on random forests. Observe that it returns a vector of levels of the grade variable.

rwpred2 <- predict(redwineForest, testDF)
head(rwpred2)

##   1   2   3   4   5   6 
## bad bad bad bad bad bad 
## Levels: bad good

Challenge: Accuracy of the Random Forest Model

Compute the testing set accuracy of the random forest model above. Compare this accuracy with the decision tree model in the previous challenge.

sum(testDF$grade == rwpred2)/nrow(testDF)

## [1] 0.821875

The random forest model can correctly predict the grade about 82% of the time, which is an improvement of about 12 percentage points over the decision tree model. (Even with pruning)

Out-of-bag Error

Since each tree in the random forest is built on a bootstrap sample of the rows in the training set, some rows will be left out of each sample. Each tree can then make a prediction on each row that was excluded, i.e., “out of the bag” (OOB). Each row will be excluded by several trees, which form a subforest that produces an aggregate prediction for each row. The out-of-bag estimate of the error rate is the error rate of these predictions. The OOB error rate gives a convenient estimate of the error rate of the model (error = 1 - accuracy). The print function reports the OOB error.

print(redwineForest)

## 
## Call:
##  randomForest(formula = grade ~ ., data = trainDF) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 19.31%
## Confusion matrix:
##      bad good class.error
## bad  467  122   0.2071307
## good 125  565   0.1811594

The confusion matrix gives further details on how the OOB error was calculated: Out of 470+119 bad wines, 119 were classified as good, and out of 126+564 good wines, 126 were classified as bad. Check that (119+126)/(470+119+126+564) = 0.1916.

Train on the Whole Data Set

Given that OOB error gives us a good way to estimate model accuracy, we can train random forest models on the entire data set, without using a train/test split (Breiman, 2001, p. 8).

set.seed(4567)
rwforFull <- randomForest(grade ~ ., data = redwineClass)
print(rwforFull)

## 
## Call:
##  randomForest(formula = grade ~ ., data = redwineClass) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 16.82%
## Confusion matrix:
##      bad good class.error
## bad  615  129   0.1733871
## good 140  715   0.1637427

Caveat: If you plan to “tune” your model by adjusting the optional parameters in randomForest, it is still good practice to set aside a testing set for assessing model accuracy after the model has been tuned using only the training data. A train/test split is also helpful if you want to compare random forest models with other types of models (and have it be the same test set)

Variable Importance

The out-of-bag errors can also be used to assess the importance of each explanatory variable in contributing to the accuracy of the model. If we set importance = TRUE in a call to randomForest, the function will keep track of how much the OOB accuracy decreases when each variable is randomly permuted. Scrambling a variable in this way effectively removes its predictive power from the model, so we would expect the most important variables to have the greatest decreases in accuracy.

set.seed(2345)
rwforFull <- randomForest(grade ~ ., data = redwineClass, importance = TRUE)
importance(rwforFull, type = 1)

##                      MeanDecreaseAccuracy
## fixed.acidity                    33.60865
## volatile.acidity                 50.37700
## citric.acid                      30.12944
## residual.sugar                   31.40780
## chlorides                        34.39463
## free.sulfur.dioxide              32.60753
## total.sulfur.dioxide             50.95265
## density                          42.35901
## pH                               36.46366
## sulphates                        58.41187
## alcohol                          75.06260

Challenge: Most and Least Important

Based on the Mean Decrease in Accuracy, Which two variables are the most important, and which two are the least important? Are these results consistent with the decision tree model you constructed in the first challenge of this episode?

For convenience, let’s sort the rows of the importance matrix.

importance(rwforFull, type = 1) %>%
  as_tibble(rownames = "Variable") %>%
  arrange(desc(MeanDecreaseAccuracy))

## # A tibble: 11 × 2
##    Variable             MeanDecreaseAccuracy
##    <chr>                               <dbl>
##  1 alcohol                              75.1
##  2 sulphates                            58.4
##  3 total.sulfur.dioxide                 51.0
##  4 volatile.acidity                     50.4
##  5 density                              42.4
##  6 pH                                   36.5
##  7 chlorides                            34.4
##  8 fixed.acidity                        33.6
##  9 free.sulfur.dioxide                  32.6
## 10 residual.sugar                       31.4
## 11 citric.acid                          30.1

# or

varImpPlot(rwforFull, type = 1)

The top two variables are alcohol and sulphates, which both occur near the root of the decision tree. The least important variables are residual.sugar and citric.acid, which do not occur in the decision tree. So it seems consistent that the important variables play important roles in the decision tree, while the unimportant variables do not.

Red Wine Regression Model

So far, we have applied decision trees and random forests to classification problems, where the dependent variable is categorical. These techniques also apply to regression problems, where we are predicting a quantitative variable.

Recall that in the original wine data set, there is a variable called quality that assigns each sample a numerical quality score. For the examples that follow, we will treat quality as a quantitative variable. As before, we select the rows that contain observations of red wines, and we form training and testing sets.

redwine <- wine %>% slice(1:1599) 
trainSize <- round(0.80 * nrow(redwine))
set.seed(1234) 
trainIndex <- sample(nrow(redwine), trainSize)
trainDF <- redwine %>% slice(trainIndex)
testDF <- redwine %>% slice(-trainIndex)

Fit a Decision Tree

When the dependent variable is quantitative, we use the method = “anova” option in rpart to construct a decision tree. In this situation, the predicted value corresponding to each node is the mean value of the observations assigned to the node, and the algorithm searches for a split that minimizes the sum of the squared errors of these predictions.

rwtree <- rpart(quality ~ ., data = trainDF, method = "anova")
rpart.plot(rwtree)

The rpart.plot command rounds off numerical values to save space. To see more digits of accuracy, print the tree in text format.

rwtree

## n= 1279 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 1279 827.94210 5.636435  
##    2) alcohol< 10.525 787 332.33800 5.364676  
##      4) sulphates< 0.575 310 100.56770 5.154839 *
##      5) sulphates>=0.575 477 209.24950 5.501048  
##       10) volatile.acidity>=0.405 364 149.63460 5.403846  
##         20) alcohol< 9.85 223  66.76233 5.278027 *
##         21) alcohol>=9.85 141  73.75887 5.602837 *
##       11) volatile.acidity< 0.405 113  45.09735 5.814159 *
##    3) alcohol>=10.525 492 344.51020 6.071138  
##      6) volatile.acidity>=0.8725 26  23.11538 4.730769  
##       12) volatile.acidity>=1.015 10   6.00000 4.000000 *
##       13) volatile.acidity< 1.015 16   8.43750 5.187500 *
##      7) volatile.acidity< 0.8725 466 272.07730 6.145923  
##       14) sulphates< 0.635 184  95.77717 5.831522  
##         28) alcohol< 11.65 102  53.84314 5.627451 *
##         29) alcohol>=11.65 82  32.40244 6.085366 *
##       15) sulphates>=0.635 282 146.24470 6.351064  
##         30) alcohol< 11.55 166  79.81325 6.138554 *
##         31) alcohol>=11.55 116  48.20690 6.655172 *

For example, the first split tests whether alcohol < 10.525, not 11 as shown in the plot of the tree.

Challenge: Check the Splits

In the above decision tree, consider the root (5.6), its left child (5.4), and the leftmost leaf (5.2). Check that these numbers are in fact the average quality values (rounded to one decimal place) of the observations in trainDF that are assigned to each node

Solution

The root node contains all the observations, so compute its mean quality value.

mean(trainDF$quality)

## [1] 5.636435

The left child of the root contains observations where alcohol is less than 10.525.

trainDF %>% 
  filter(alcohol < 10.525) %>% 
  summarize(nodeValue = mean(quality))

## # A tibble: 1 × 1
##   nodeValue
##       <dbl>
## 1      5.36

The leftmost leaf contains observations where alcohol is less than 10.525 and sulphates is less than 0.575.

trainDF %>% 
  filter(alcohol < 10.525, sulphates < 0.575) %>% 
  summarize(nodeValue = mean(quality))

## # A tibble: 1 × 1
##   nodeValue
##       <dbl>
## 1      5.15

Decision Tree RMSE

For regression trees, each leaf is assigned a predicted value, and the predict function selects the appropriate leaf/prediction based on the values of the explanatory variables.

predictedQuality <- predict(rwtree, testDF)
head(predictedQuality)

##        1        2        3        4        5        6 
## 5.154839 5.278027 5.278027 5.154839 5.154839 5.602837

Since this is a regression model, we assess its performance using the root mean squared error (RMSE).

errors <- predictedQuality - testDF$quality
decTreeRMSE <- sqrt(mean(errors^2))
decTreeRMSE

## [1] 0.6862169

Random Forest Regression Model

Constructing a random forest regression model uses randomForest with the same syntax as we used for classification models. The only difference is that the response variable quality in our formula is a quantitative variable. The predicted values given by predict are the average predictions of all the trees in the forest.

set.seed(4567)
rwfor <- randomForest(quality ~ ., data = trainDF)
predQualRF <- predict(rwfor, testDF) 
rfErrors <- predQualRF - testDF$quality
rfRMSE <- sqrt(mean(rfErrors^2))
rfRMSE

## [1] 0.5913485

The random forest RMSE is better (smaller) than the decision tree RMSE.

print(rwfor)

## 
## Call:
##  randomForest(formula = quality ~ ., data = trainDF) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.3269279
##                     % Var explained: 49.5

The Mean of squared residuals is the MSE of the out-of-bag errors. The % Var explained term is a “pseudo R-squared”, computed as 1 - MSE/Var(y). The OOB MSE should be close to the MSE on the testing set. So again, you don’t always need a train/test split when working with random forests.

We conclude this episode with a series of challenges.

Challenge: Variable Importance

As with classification models, setting importance = TRUE in a call to randomForest will use the OOB errors to measure variable importance. In the regression case, the decrease in performance when a variable is permuted is measured by the percentage increase in MSE (%IncMSE). The larger the %IncMSE, the more important the variable.

Compute the %IncMSE variable importance for the red wine regression model, and sort the variables by their importance. Re-train the model on the entire redwine data set instead of on the testing set. Are the most important variables the same as in the classification model?

Solution

The syntax is almost identical to the classification case. Since the identifier %IncMSE contains special characters, it must be enclosed in backticks. VarImpPlot will work as well!

set.seed(2345)
rwFull <- randomForest(quality ~ ., data = redwine, importance = TRUE)
importance(rwFull, type = 1) %>%
  as_tibble(rownames = "Variable") %>% 
  arrange(desc(`%IncMSE`))

## # A tibble: 11 × 2
##    Variable             `%IncMSE`
##    <chr>                    <dbl>
##  1 alcohol                   59.2
##  2 sulphates                 55.6
##  3 volatile.acidity          39.7
##  4 total.sulfur.dioxide      38.3
##  5 density                   34.5
##  6 chlorides                 32.7
##  7 free.sulfur.dioxide       30.0
##  8 citric.acid               28.9
##  9 fixed.acidity             28.4
## 10 pH                        25.1
## 11 residual.sugar            24.2

varImpPlot(rwFull, type = 1)

The top five most important variables are the same as in the classification model, but their order is slightly different. The variable pH seems to be relatively more important in the classification model.

Challenge: Linear Regression Model

Earler, we used the lm function to fit a linear model to the training set, and then we computed the RMSE on the testing set. Repeat this process for the redwine data set. How does the linear regression RMSE compare to the random forest RMSE? Compare the summary of your linear model with the variable importance rankings from the previous challenge.

redwine.lm <- lm(quality ~ ., data = trainDF)
lmRMSE <- sqrt(mean((predict(redwine.lm, testDF) - testDF$quality)^2))
lmRMSE

## [1] 0.6880957

The random forest RMSE is less than the linear model RMSE.

summary(redwine.lm)

## 
## Call:
## lm(formula = quality ~ ., data = trainDF)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68737 -0.36035 -0.03507  0.43456  1.95898 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           9.3242378 23.4596015   0.397   0.6911    
## fixed.acidity         0.0108114  0.0284758   0.380   0.7043    
## volatile.acidity     -1.2144573  0.1320387  -9.198  < 2e-16 ***
## citric.acid          -0.2957551  0.1608745  -1.838   0.0662 .  
## residual.sugar        0.0193480  0.0168430   1.149   0.2509    
## chlorides            -1.8858347  0.4583915  -4.114 4.14e-05 ***
## free.sulfur.dioxide   0.0054883  0.0023783   2.308   0.0212 *  
## total.sulfur.dioxide -0.0034664  0.0007974  -4.347 1.49e-05 ***
## density              -5.0636470 23.9244350  -0.212   0.8324    
## pH                   -0.4331191  0.2065623  -2.097   0.0362 *  
## sulphates             0.9244109  0.1306583   7.075 2.47e-12 ***
## alcohol               0.2886439  0.0293633   9.830  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6385 on 1267 degrees of freedom
## Multiple R-squared:  0.3762, Adjusted R-squared:  0.3708 
## F-statistic: 69.46 on 11 and 1267 DF,  p-value: < 2.2e-16

The variables the smallest p-values (Pr(>|t|)) tend to correspond to the most important variables in the random forest model.

Challenge: White Wine

Rows 1600-6497 of the wine data frame correspond to white wine observations. Use these rows to train and test a random forest model, as we did with the red wine observations. Compute a testing set RMSE, and assess variable importance. Are the important variables for predicting white wine quality the same as the important variables for predicting red wine quality?

whitewine <- wine %>% slice(1600:6497) 
trainSize <- round(0.80 * nrow(whitewine))
set.seed(1234) 
trainIndex <- sample(nrow(whitewine), trainSize)
trainDF <- whitewine %>% dplyr::slice(trainIndex)
testDF <- whitewine %>% dplyr::slice(-trainIndex)
wwfor <- randomForest(quality ~ ., data = trainDF)
predQualww <- predict(wwfor, testDF) 
errors <- predQualww - testDF$quality
wwRMSE <- sqrt(mean(errors^2))

The RMSE on the white wine testing set is about 0.63.

set.seed(4567)
wwforFull <- randomForest(quality ~ ., data = whitewine, importance = TRUE)
importance(wwforFull, type = 1) %>% 
  as_tibble(rownames = "Variable") %>% 
  arrange(desc(`%IncMSE`))

## # A tibble: 11 × 2
##    Variable             `%IncMSE`
##    <chr>                    <dbl>
##  1 volatile.acidity          89.1
##  2 free.sulfur.dioxide       84.1
##  3 alcohol                   79.4
##  4 pH                        64.6
##  5 citric.acid               59.9
##  6 sulphates                 54.9
##  7 fixed.acidity             54.2
##  8 residual.sugar            53.0
##  9 chlorides                 52.6
## 10 total.sulfur.dioxide      50.0
## 11 density                   43.6

varImpPlot(wwforFull, type = 1)

Relative to red wine, free.sulfur.dioxide is much more important for predicting white wine quality, and sulphates is less important.

Machine Learning For Tabular Data In R

James Capobianco, from Beta Carpentries Lesson

2023-04-12

A Brief Introduction to Machine Learning

What is Machine Learning?

Our Focus

Decision Trees

Decision Trees

Example: Kyphosis Data Set

Classification vs. Regression

Challenge: Use rpart on the kyphosis data

Training/Testing Data

Using Decision Trees for Supervised Learning

Challenge: Training Set Accuracy

Model Predictions on the Test Set

Challenge: Predicted Probabilities

Testing Set Accuracy

Challenge: Change the Training Set

Robustness, or lack thereof

Random Forest

Wine Dataset

Red Wine Classification Model

Create Training and Test Sets

Challenge: Create a decision tree model

Random Forest Classification Model

Challenge: Accuracy of the Random Forest Model

Out-of-bag Error

Train on the Whole Data Set

Variable Importance

Challenge: Most and Least Important

Red Wine Regression Model

Fit a Decision Tree

Challenge: Check the Splits

Decision Tree RMSE

Random Forest Regression Model

Challenge: Variable Importance

Challenge: Linear Regression Model

Challenge: White Wine