This project is from Book: Machine learning with R by Brett Lantz, chapter 6.
A link to the book https://bit.ly/3gsf2e0
This project is for educational purpose only.
The aim is to develop wine rating model.
we will use rpart (recursive partitioning) for regression trees rpart.plot to plot the regression tree
library(rpart)
library(rpart.plot)
library(Cubist)
## Loading required package: lattice
the data is donated to the UCI Machine learning repository, we will use whitewines dataset
wine <- read.csv("whitewines.csv")
#Exploring the structure of teh data frame
str(wine)
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 6.7 5.7 5.9 5.3 6.4 7 7.9 6.6 7 6.5 ...
## $ volatile.acidity : num 0.62 0.22 0.19 0.47 0.29 0.14 0.12 0.38 0.16 0.37 ...
## $ citric.acid : num 0.24 0.2 0.26 0.1 0.21 0.41 0.49 0.28 0.3 0.33 ...
## $ residual.sugar : num 1.1 16 7.4 1.3 9.65 0.9 5.2 2.8 2.6 3.9 ...
## $ chlorides : num 0.039 0.044 0.034 0.036 0.041 0.037 0.049 0.043 0.043 0.027 ...
## $ free.sulfur.dioxide : num 6 41 33 11 36 22 33 17 34 40 ...
## $ total.sulfur.dioxide: num 62 113 123 74 119 95 152 67 90 130 ...
## $ density : num 0.993 0.999 0.995 0.991 0.993 ...
## $ pH : num 3.41 3.22 3.49 3.48 2.99 3.25 3.18 3.21 2.88 3.28 ...
## $ sulphates : num 0.32 0.46 0.42 0.54 0.34 0.43 0.47 0.47 0.47 0.39 ...
## $ alcohol : num 10.4 8.9 10.1 11.2 10.9 ...
## $ quality : int 5 6 6 4 6 6 6 6 6 7 ...
Exmine the distribution of the wine quality
hist(wine$quality)
The wine quality looks it follows a normal distribution
#Dividing the dataset into training and testing sets.75% training and 25% testing.
wine_train <- wine[1:3750, ]
wine_test <- wine[3751:4898, ]
We will build the model using rpart function
m.rpart <- rpart(quality ~ . , data = wine_train)
m.rpart
## n= 3750
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 3750 2945.53200 5.870933
## 2) alcohol< 10.85 2372 1418.86100 5.604975
## 4) volatile.acidity>=0.2275 1611 821.30730 5.432030
## 8) volatile.acidity>=0.3025 688 278.97670 5.255814 *
## 9) volatile.acidity< 0.3025 923 505.04230 5.563380 *
## 5) volatile.acidity< 0.2275 761 447.36400 5.971091 *
## 3) alcohol>=10.85 1378 1070.08200 6.328737
## 6) free.sulfur.dioxide< 10.5 84 95.55952 5.369048 *
## 7) free.sulfur.dioxide>=10.5 1294 892.13600 6.391036
## 14) alcohol< 11.76667 629 430.11130 6.173291
## 28) volatile.acidity>=0.465 11 10.72727 4.545455 *
## 29) volatile.acidity< 0.465 618 389.71680 6.202265 *
## 15) alcohol>=11.76667 665 403.99400 6.596992 *
We can see the root has all the training data, then splited into two trees 2, 3 based on alchol percentage. we can see tree number 5, where alcohol <10.85 and voltaile.acidity <0.2275 so the predictit quality will be as per yval 5.97
rpart.plot(m.rpart, digits = 3)
we can enhance the tree by adding more paramters
rpart.plot(m.rpart, digits = 4, fallen.leaves = TRUE, type = 3, extra = 101)
## Step 4 - evaluating model performance
p.rpart <- predict(m.rpart, wine_test)
summary(p.rpart)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.545 5.563 5.971 5.893 6.202 6.597
summary(wine_test$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.901 6.000 9.000
we can see the extreems are not predicted correctly, but from 1st quartile to 3rd quartile (50 % of the population, the prediction is close)
We will exmine the correlation between the actual data in the test data set and predicted data
cor(p.rpart, wine_test$quality)
## [1] 0.5369525
Tt is acceptable to have a correlation of 0.54, it measures how strongly the predictions are related to the true value; it is not a meausre of how far off the predictions were from the true values.
We will create a MAE function
MAE <- function(actual, predicted) {
mean(abs(actual - predicted))
}
then we will calculate MAE
MAE(p.rpart, wine_test$quality)
## [1] 0.5872652
It suggests that, on average, the difference between our model’s predictions and teh true quality score was about 0.59. On a scale from 0 to 10 in the quality column, it looks the model is doing good.
the mean quality rating in the training data is
mean(wine_train$quality)
## [1] 5.870933
Comparing the mean of wine quality in training to the wine quality in the testing
MAE(mean(wine_train$quality), wine_test$quality)
## [1] 0.6719238
We can notice that MAE of the decision trees has MAE (0.59) which is better than the imputed mean MAE(0.67).
We will apply model tree algorithm
Model trees extend regression trees by replacing the leaf nodes with regression models, which often results in more accurate results.
We will use cubist() from cubist package
As the quality column is number 12 in the data set, we will remove it in teh train parmeter in cubist function.
m.cubist <- cubist(x = wine_train[-12], y = wine_train$quality)
# Examine the model
m.cubist
##
## Call:
## cubist.default(x = wine_train[-12], y = wine_train$quality)
##
## Number of samples: 3750
## Number of predictors: 11
##
## Number of committees: 1
## Number of rules: 25
The summary will display all the rules,
I just showing first rule
Call: cubist.default(x = wine_train[-12], y = wine_train$quality)
Target attribute `outcome'
Read 3750 cases (12 attributes) from undefined.data
Model:
Rule 1: [21 cases, mean 5.0, range 4 to 6, est err 0.5]
if
free.sulfur.dioxide > 30
total.sulfur.dioxide > 195
total.sulfur.dioxide <= 235
sulphates > 0.64
alcohol > 9.1
then
outcome = 573.6 + 0.0478 total.sulfur.dioxide - 573 density
- 0.788 alcohol + 0.186 residual.sugar - 4.73 volatile.acidity
We can see if conditions, and a linear model is shown after then statement…
#Examine the performance of the model
p.cubist <- predict(m.cubist, wine_test)
summary(p.cubist)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.677 5.416 5.906 5.848 6.238 7.393
Examine the correlation between predicted and actual values
cor(p.cubist, wine_test$quality)
## [1] 0.6201015
the above correlation is better than earlier model
Measure the MAE for the enahnced model
MAE(wine_test$quality, p.cubist)
## [1] 0.5339725
The error for this model is lower than the error for the earlier model.
Using regression trees and model trees is a powerful tool in your belt to solve regression problems.