Welcome to the first script of Module 5! Here, we will use a data set about wine to run some first Random Forest models. You will learn how to:

  • create and interpret a Random Forest model
  • assess the accuracy of Random Forest predictions
  • optimize hyper-parameters of Random Forests
  • visualize model predictions and variable importance

The data and code used in this script are taken from https://www.simplilearn.com/tutorials/data-science-tutorial/random-forest-in-r

Please read the following code instructions carefully, try to understand the code that follows each instruction, execute it and see what happens. Do not hesitate to change the code or write your own code, and execute it as well!

Please consult the following websites for more explanations about Random Forests: https://www.listendata.com/2014/11/random-forest-with-r.html https://www.edureka.co/blog/random-forest-classifier


1. Data inspection

As usual, we start by loading the packages we need, and by defining our working directory:

library(randomForest)
library(ggplot2)

dir <- "C:/Users/Max/Desktop/IAMO_neu/eLearning/Module_5_Random_Forests/M05_Part01_Wine/"

Import the data set about wine. It contains different chemical parameters and a taste variable of 1599 wine samples. “Taste” is a categorical (=factor) variable; all other variables are continuous (=numeric). For 63 samples, the taste of the wine was “bad”; for 217 it was “good”; and for 1319 it was “normal”:

data <- read.csv(paste0(dir, "wine_taste.csv"))
data$taste <- as.factor(data$taste)

head(data)
fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol taste
7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 normal
7.8 0.88 0.00 2.6 0.098 25 67 0.9968 3.20 0.68 9.8 normal
7.8 0.76 0.04 2.3 0.092 15 54 0.9970 3.26 0.65 9.8 normal
11.2 0.28 0.56 1.9 0.075 17 60 0.9980 3.16 0.58 9.8 normal
7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 normal
7.4 0.66 0.00 1.8 0.075 13 40 0.9978 3.51 0.56 9.4 normal
str(data)
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ taste               : Factor w/ 3 levels "bad","good","normal": 3 3 3 3 3 3 3 2 2 3 ...
table(data$taste)
## 
##    bad   good normal 
##     63    217   1319
barplot(table(data$taste), main="Taste of Red Wine", xlab="taste", ylab="amount of samples")

We can visually examine the relationship of some parameters, e.g. with ggplot. You notice that the three groups of “taste” cannot be easily distinguished by just looking at two chemical parameters - that is why we need a random forest!

ggplot(data,aes(fixed.acidity, volatile.acidity)) + 
  geom_point(aes(color=taste))

ggplot(data,aes(alcohol, pH)) + 
  geom_point(aes(color=taste))

2. Random Forest - Classification

We will first use a random forest model for classification, i.e. we will predict the taste of wine with the chemical parameters. That means that “taste” is the response variable, and all other variables are predictor variables.

For the model, we will have to split our data set into two parts: a training and a testing data set. We randomly select 80% of all 1599 samples and assign them to the training data set. This will be the data that trains the model, i.e. based on these observations, the random forest will learn which combination of chemical parameters leads to which taste. We use the remaining 20% of all 1599 samples as testing data set. We then apply the trained model to the testing data, i.e. we use the trained model to predict the taste of the samples in the testing data set, based on the rules that the model has previously identified in the training data set. In the testing data set, we then compare the predicted taste to the actual taste - the higher the agreement between predicted and true taste, the more reliable is our model.

Let us now divide our data set into training and testing, using the function sample(). Each time we execute this function, it would draw a different random selection of our data, so if we re-run our script, the results would be different and it would be difficult to compare results if some other parameters are changed. To avoid that, we can define the random initialization point with set.seed() to make sure that the same samples are assigned to “training” and “testing” the next time.

set.seed(123)
subsample <- sample(nrow(data), 0.8 * nrow(data))

training <- data[subsample, ]
testing  <- data[-subsample, ]

NROW(training)
## [1] 1279
NROW(testing)
## [1] 320

Now we can already use the function randomforest(). In this function, we define the data set we use - this is the object “training”, and the response variable that the model is supposed to predict - this is “taste”. The signs “~.” simply mean that “taste” should be predicted by all other variables that are contained in “training”. The hyper-parameters “ntree” and “mtry” represent the number of decision trees to be grown, and the amount of variables that is randomly drawn from all predictor variables to be available for prediction at each node in a tree. The resulting object “model1” gives us a summary of the formula and the hyper-parameters we defined. It also contains an estimation of the out-of-bag error, i.e. the rate of misclassifications across all taste classes, and the confusion matrix:

model1 <- randomForest(taste~., data = training, ntree = 300, mtry = 5, importance = TRUE)

model1
## 
## Call:
##  randomForest(formula = taste ~ ., data = training, ntree = 300,      mtry = 5, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 300
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 13.6%
## Confusion matrix:
##        bad good normal class.error
## bad      0    1     51  1.00000000
## good     0  105     83  0.44148936
## normal   3   36   1000  0.03753609

Let’s have a closer look at the confusion matrix. The row names represent the true/observed taste, and the columns represent the taste classes that the model assigns. For example, 51 samples that actually have a bad taste, are classified as having a normal taste, and 40 samples that have a normal taste, were classified as having a good taste. The classification error is the percentage of misclassifications for each taste class - 100% of all samples with bad taste were misclassified, but only 3.94% of the samples with a normal taste were misclassified:

model1$confusion
##        bad good normal class.error
## bad      0    1     51  1.00000000
## good     0  105     83  0.44148936
## normal   3   36   1000  0.03753609

How do we know we chose a good number of trees? The function plot() will automatically plot the error trajectories of each taste class and the out-of-bag error as a response to the number of trees. We see that the curves level off somewhere between 50 and 100 trees, so we could safely reduce the number of trees to 100:

plot(model1)
legend("top", colnames(model1$err.rate), col=1:4, cex=0.8, fill=1:4)

To find out if “mtry” has an effect on the model performance, let’s iterate over all possible values that mtry can take (2 to 11). We see that the OOB error does almost not change at all:

OOB_vec <- NA
for (i in 2:11)
{ model_temp <- randomForest(taste~., data = training, ntree = 100, mtry = i) 
  OOB_vec[i] <- model_temp$err.rate[100,1]
}

OOB_vec
##  [1]        NA 0.1391712 0.1383894 0.1368256 0.1391712 0.1383894 0.1446443
##  [8] 0.1383894 0.1493354 0.1469898 0.1383894

We can plot the importance of all variables in the model with the function varImpPlot(). We focus on mean decrease accuracy here, according to which the most important variable for taste is alcohol content. It has a value of about 36, meaning that we can expect 36 more misclassifications if the variable “alcohol” was missing from our training dataset:

importance(model1)
##                            bad     good    normal MeanDecreaseAccuracy
## fixed.acidity        -3.645110 12.09771 14.642308             18.61179
## volatile.acidity      8.648063 28.41414  8.590038             22.46140
## citric.acid           1.565737 17.30854 12.051312             20.39269
## residual.sugar        2.271970 16.22289 16.734898             21.70459
## chlorides            -2.269836 14.60409 17.142164             21.32082
## free.sulfur.dioxide   2.461333 11.80994 15.788476             19.50181
## total.sulfur.dioxide  2.434719 22.15200 16.829435             24.54720
## density              -1.828830 14.19200 17.018732             21.51060
## pH                    1.560722 11.41297 15.100471             19.81527
## sulphates             5.772700 33.33643 12.125685             29.98570
## alcohol               1.567490 40.78602 20.701783             36.64469
##                      MeanDecreaseGini
## fixed.acidity                26.60980
## volatile.acidity             47.72406
## citric.acid                  33.39882
## residual.sugar               32.41572
## chlorides                    31.48853
## free.sulfur.dioxide          25.74612
## total.sulfur.dioxide         36.60599
## density                      33.91535
## pH                           27.70876
## sulphates                    45.53134
## alcohol                      62.88105
varImpPlot(model1, 1, sort = T, n.var = 11, main = "Variable Importance")

We now apply our model to the testing data set using the function predict(). The result is a vector called “prediction”, with the names corresponding to the original row number in “data”. We can convert the predictions to a confusion matrix using the function table(), where “testing$taste” is the true/observed taste of all samples in “testing”:

prediction1 <- predict(model1, newdata = testing)
head(prediction1)
##      3      7     15     22     23     27 
## normal normal normal normal normal normal 
## Levels: bad good normal
table(prediction1, testing$taste)
##            
## prediction1 bad good normal
##      bad      0    0      2
##      good     0   18     10
##      normal  11   11    268

To calculate the amount of misclassifications in “prediction”, we can simply divide the sum of correct classifications of the samples in “testing”, by the total number of samples in “testing”. The result means that the model trained on “training” correctly classified 90% of all samples in “testing”, which is not too bad!

sum(prediction1==testing$taste) / nrow(testing)
## [1] 0.89375

3. Regression

Let us know change the response variable, and predict for example “pH” instead of “taste”. We change the model formula accordingly:

model2 <- randomForest(pH~., data = training, ntree = 300, mtry = 2, importance = TRUE)

This model automatically does a regression instead of a classification because it detects that the response variable is not categorical but continuous. The mean of squared residuals and the percentage of the variance that is explained by the model are two indicators of the model performance and describe how well the model fits the data. Here, on average, the model assigns a pH-value that is about 0.006 away from the observed pH-value, and explains about 73% of the variance in the pH data, which is not too bad.

model2
## 
## Call:
##  randomForest(formula = pH ~ ., data = training, ntree = 300,      mtry = 2, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 300
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 0.006497377
##                     % Var explained: 73.52

The output of importance() looks also different now. Let’s have a look at “%IncMSE”, which indicates how much the mean squared error would increase if the variable was not included in the model. “Fixed acidity” is the most important variable, and the mean squared error would increase by 42% if this variable would be left out from the analysis:

importance(model2)
##                       %IncMSE IncNodePurity
## fixed.acidity        42.78064     9.4233419
## volatile.acidity     22.78430     2.0061933
## citric.acid          26.32449     4.6587501
## residual.sugar       16.65268     1.3479318
## chlorides            25.05561     2.4352478
## free.sulfur.dioxide  18.96410     1.3812376
## total.sulfur.dioxide 24.03107     1.7119907
## density              29.67010     2.6809792
## sulphates            25.18505     2.1780416
## alcohol              27.80584     2.0459861
## taste                 7.61224     0.3425787
varImpPlot(model2, 1, sort = T, n.var = 11, main = "Variable Importance")

Let us know predict pH in the testing data set using the function predict(). The result is a vector of predicted pH values. Let’s plot them against the observed values:

prediction2 <- predict(model2, newdata = testing)
plot(prediction2~ testing$pH, xlab="observed", ylab="predicted")
abline(0,1)

We cannot compare observed and predicted values with a confusion matrix and misclassifications anymore, because the response variable pH is continuous. Instead, to assess the model quality we can for example calculate the R²-value that gives an indication about the amount of observed variation that the model predicts. Here, it is 78%, which is quite good!

rsquare <- cor(prediction2, testing$pH)^2
rsquare
## [1] 0.7829907

Finally, we want to examine the marginal effect of the predictor variables on the response variable. We can do so by creating a partial dependence plot with the function partialPlot(). Let’s examine “fixed acidity”, which was the most important variable. We see that low values of fixed aciditiy are associated with high pH, and vice versa:

partialPlot(model2, pred.data = training, x.var = fixed.acidity)

Congratulations, you made it to the end of the first script! You can now go on with script 02, or solve the exercises 01 to 05.