Welcome to the first script of Module 5! Here, we will use a data set about wine to run some first Random Forest models. You will learn how to:
The data and code used in this script are taken from https://www.simplilearn.com/tutorials/data-science-tutorial/random-forest-in-r
Please read the following code instructions carefully, try to understand the code that follows each instruction, execute it and see what happens. Do not hesitate to change the code or write your own code, and execute it as well!
Please consult the following websites for more explanations about Random Forests: https://www.listendata.com/2014/11/random-forest-with-r.html https://www.edureka.co/blog/random-forest-classifier
As usual, we start by loading the packages we need, and by defining our working directory:
library(randomForest)
library(ggplot2)
dir <- "C:/Users/Max/Desktop/IAMO_neu/eLearning/Module_5_Random_Forests/M05_Part01_Wine/"
Import the data set about wine. It contains different chemical parameters and a taste variable of 1599 wine samples. “Taste” is a categorical (=factor) variable; all other variables are continuous (=numeric). For 63 samples, the taste of the wine was “bad”; for 217 it was “good”; and for 1319 it was “normal”:
data <- read.csv(paste0(dir, "wine_taste.csv"))
data$taste <- as.factor(data$taste)
head(data)
| fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | taste |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | normal |
| 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25 | 67 | 0.9968 | 3.20 | 0.68 | 9.8 | normal |
| 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15 | 54 | 0.9970 | 3.26 | 0.65 | 9.8 | normal |
| 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17 | 60 | 0.9980 | 3.16 | 0.58 | 9.8 | normal |
| 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | normal |
| 7.4 | 0.66 | 0.00 | 1.8 | 0.075 | 13 | 40 | 0.9978 | 3.51 | 0.56 | 9.4 | normal |
str(data)
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ taste : Factor w/ 3 levels "bad","good","normal": 3 3 3 3 3 3 3 2 2 3 ...
table(data$taste)
##
## bad good normal
## 63 217 1319
barplot(table(data$taste), main="Taste of Red Wine", xlab="taste", ylab="amount of samples")
We can visually examine the relationship of some parameters, e.g. with ggplot. You notice that the three groups of “taste” cannot be easily distinguished by just looking at two chemical parameters - that is why we need a random forest!
ggplot(data,aes(fixed.acidity, volatile.acidity)) +
geom_point(aes(color=taste))
ggplot(data,aes(alcohol, pH)) +
geom_point(aes(color=taste))
We will first use a random forest model for classification, i.e. we will predict the taste of wine with the chemical parameters. That means that “taste” is the response variable, and all other variables are predictor variables.
For the model, we will have to split our data set into two parts: a training and a testing data set. We randomly select 80% of all 1599 samples and assign them to the training data set. This will be the data that trains the model, i.e. based on these observations, the random forest will learn which combination of chemical parameters leads to which taste. We use the remaining 20% of all 1599 samples as testing data set. We then apply the trained model to the testing data, i.e. we use the trained model to predict the taste of the samples in the testing data set, based on the rules that the model has previously identified in the training data set. In the testing data set, we then compare the predicted taste to the actual taste - the higher the agreement between predicted and true taste, the more reliable is our model.
Let us now divide our data set into training and testing, using the function sample(). Each time we execute this function, it would draw a different random selection of our data, so if we re-run our script, the results would be different and it would be difficult to compare results if some other parameters are changed. To avoid that, we can define the random initialization point with set.seed() to make sure that the same samples are assigned to “training” and “testing” the next time.
set.seed(123)
subsample <- sample(nrow(data), 0.8 * nrow(data))
training <- data[subsample, ]
testing <- data[-subsample, ]
NROW(training)
## [1] 1279
NROW(testing)
## [1] 320
Now we can already use the function randomforest(). In this function, we define the data set we use - this is the object “training”, and the response variable that the model is supposed to predict - this is “taste”. The signs “~.” simply mean that “taste” should be predicted by all other variables that are contained in “training”. The hyper-parameters “ntree” and “mtry” represent the number of decision trees to be grown, and the amount of variables that is randomly drawn from all predictor variables to be available for prediction at each node in a tree. The resulting object “model1” gives us a summary of the formula and the hyper-parameters we defined. It also contains an estimation of the out-of-bag error, i.e. the rate of misclassifications across all taste classes, and the confusion matrix:
model1 <- randomForest(taste~., data = training, ntree = 300, mtry = 5, importance = TRUE)
model1
##
## Call:
## randomForest(formula = taste ~ ., data = training, ntree = 300, mtry = 5, importance = TRUE)
## Type of random forest: classification
## Number of trees: 300
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 13.6%
## Confusion matrix:
## bad good normal class.error
## bad 0 1 51 1.00000000
## good 0 105 83 0.44148936
## normal 3 36 1000 0.03753609
Let’s have a closer look at the confusion matrix. The row names represent the true/observed taste, and the columns represent the taste classes that the model assigns. For example, 51 samples that actually have a bad taste, are classified as having a normal taste, and 40 samples that have a normal taste, were classified as having a good taste. The classification error is the percentage of misclassifications for each taste class - 100% of all samples with bad taste were misclassified, but only 3.94% of the samples with a normal taste were misclassified:
model1$confusion
## bad good normal class.error
## bad 0 1 51 1.00000000
## good 0 105 83 0.44148936
## normal 3 36 1000 0.03753609
How do we know we chose a good number of trees? The function plot() will automatically plot the error trajectories of each taste class and the out-of-bag error as a response to the number of trees. We see that the curves level off somewhere between 50 and 100 trees, so we could safely reduce the number of trees to 100:
plot(model1)
legend("top", colnames(model1$err.rate), col=1:4, cex=0.8, fill=1:4)
To find out if “mtry” has an effect on the model performance, let’s iterate over all possible values that mtry can take (2 to 11). We see that the OOB error does almost not change at all:
OOB_vec <- NA
for (i in 2:11)
{ model_temp <- randomForest(taste~., data = training, ntree = 100, mtry = i)
OOB_vec[i] <- model_temp$err.rate[100,1]
}
OOB_vec
## [1] NA 0.1391712 0.1383894 0.1368256 0.1391712 0.1383894 0.1446443
## [8] 0.1383894 0.1493354 0.1469898 0.1383894
We can plot the importance of all variables in the model with the function varImpPlot(). We focus on mean decrease accuracy here, according to which the most important variable for taste is alcohol content. It has a value of about 36, meaning that we can expect 36 more misclassifications if the variable “alcohol” was missing from our training dataset:
importance(model1)
## bad good normal MeanDecreaseAccuracy
## fixed.acidity -3.645110 12.09771 14.642308 18.61179
## volatile.acidity 8.648063 28.41414 8.590038 22.46140
## citric.acid 1.565737 17.30854 12.051312 20.39269
## residual.sugar 2.271970 16.22289 16.734898 21.70459
## chlorides -2.269836 14.60409 17.142164 21.32082
## free.sulfur.dioxide 2.461333 11.80994 15.788476 19.50181
## total.sulfur.dioxide 2.434719 22.15200 16.829435 24.54720
## density -1.828830 14.19200 17.018732 21.51060
## pH 1.560722 11.41297 15.100471 19.81527
## sulphates 5.772700 33.33643 12.125685 29.98570
## alcohol 1.567490 40.78602 20.701783 36.64469
## MeanDecreaseGini
## fixed.acidity 26.60980
## volatile.acidity 47.72406
## citric.acid 33.39882
## residual.sugar 32.41572
## chlorides 31.48853
## free.sulfur.dioxide 25.74612
## total.sulfur.dioxide 36.60599
## density 33.91535
## pH 27.70876
## sulphates 45.53134
## alcohol 62.88105
varImpPlot(model1, 1, sort = T, n.var = 11, main = "Variable Importance")
We now apply our model to the testing data set using the function predict(). The result is a vector called “prediction”, with the names corresponding to the original row number in “data”. We can convert the predictions to a confusion matrix using the function table(), where “testing$taste” is the true/observed taste of all samples in “testing”:
prediction1 <- predict(model1, newdata = testing)
head(prediction1)
## 3 7 15 22 23 27
## normal normal normal normal normal normal
## Levels: bad good normal
table(prediction1, testing$taste)
##
## prediction1 bad good normal
## bad 0 0 2
## good 0 18 10
## normal 11 11 268
To calculate the amount of misclassifications in “prediction”, we can simply divide the sum of correct classifications of the samples in “testing”, by the total number of samples in “testing”. The result means that the model trained on “training” correctly classified 90% of all samples in “testing”, which is not too bad!
sum(prediction1==testing$taste) / nrow(testing)
## [1] 0.89375
Let us know change the response variable, and predict for example “pH” instead of “taste”. We change the model formula accordingly:
model2 <- randomForest(pH~., data = training, ntree = 300, mtry = 2, importance = TRUE)
This model automatically does a regression instead of a classification because it detects that the response variable is not categorical but continuous. The mean of squared residuals and the percentage of the variance that is explained by the model are two indicators of the model performance and describe how well the model fits the data. Here, on average, the model assigns a pH-value that is about 0.006 away from the observed pH-value, and explains about 73% of the variance in the pH data, which is not too bad.
model2
##
## Call:
## randomForest(formula = pH ~ ., data = training, ntree = 300, mtry = 2, importance = TRUE)
## Type of random forest: regression
## Number of trees: 300
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 0.006497377
## % Var explained: 73.52
The output of importance() looks also different now. Let’s have a look at “%IncMSE”, which indicates how much the mean squared error would increase if the variable was not included in the model. “Fixed acidity” is the most important variable, and the mean squared error would increase by 42% if this variable would be left out from the analysis:
importance(model2)
## %IncMSE IncNodePurity
## fixed.acidity 42.78064 9.4233419
## volatile.acidity 22.78430 2.0061933
## citric.acid 26.32449 4.6587501
## residual.sugar 16.65268 1.3479318
## chlorides 25.05561 2.4352478
## free.sulfur.dioxide 18.96410 1.3812376
## total.sulfur.dioxide 24.03107 1.7119907
## density 29.67010 2.6809792
## sulphates 25.18505 2.1780416
## alcohol 27.80584 2.0459861
## taste 7.61224 0.3425787
varImpPlot(model2, 1, sort = T, n.var = 11, main = "Variable Importance")
Let us know predict pH in the testing data set using the function predict(). The result is a vector of predicted pH values. Let’s plot them against the observed values:
prediction2 <- predict(model2, newdata = testing)
plot(prediction2~ testing$pH, xlab="observed", ylab="predicted")
abline(0,1)
We cannot compare observed and predicted values with a confusion matrix and misclassifications anymore, because the response variable pH is continuous. Instead, to assess the model quality we can for example calculate the R²-value that gives an indication about the amount of observed variation that the model predicts. Here, it is 78%, which is quite good!
rsquare <- cor(prediction2, testing$pH)^2
rsquare
## [1] 0.7829907
Finally, we want to examine the marginal effect of the predictor variables on the response variable. We can do so by creating a partial dependence plot with the function partialPlot(). Let’s examine “fixed acidity”, which was the most important variable. We see that low values of fixed aciditiy are associated with high pH, and vice versa:
partialPlot(model2, pred.data = training, x.var = fixed.acidity)
Congratulations, you made it to the end of the first script! You can now go on with script 02, or solve the exercises 01 to 05.