Introduction

This project is from Book: Machine learning with R by Brett Lantz, chapter 6.

A link to the book https://bit.ly/3gsf2e0

This project is for educational purpose only.

The aim is to develop wine rating model.

Required packages

we will use rpart (recursive partitioning) for regression trees rpart.plot to plot the regression tree

library(rpart)
library(rpart.plot)
library(Cubist)
## Loading required package: lattice

Setp 1 - collecting data

the data is donated to the UCI Machine learning repository, we will use whitewines dataset

Step 2 - exploring and preparing the data

wine <- read.csv("whitewines.csv")

#Exploring the structure of teh data frame
str(wine)
## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  6.7 5.7 5.9 5.3 6.4 7 7.9 6.6 7 6.5 ...
##  $ volatile.acidity    : num  0.62 0.22 0.19 0.47 0.29 0.14 0.12 0.38 0.16 0.37 ...
##  $ citric.acid         : num  0.24 0.2 0.26 0.1 0.21 0.41 0.49 0.28 0.3 0.33 ...
##  $ residual.sugar      : num  1.1 16 7.4 1.3 9.65 0.9 5.2 2.8 2.6 3.9 ...
##  $ chlorides           : num  0.039 0.044 0.034 0.036 0.041 0.037 0.049 0.043 0.043 0.027 ...
##  $ free.sulfur.dioxide : num  6 41 33 11 36 22 33 17 34 40 ...
##  $ total.sulfur.dioxide: num  62 113 123 74 119 95 152 67 90 130 ...
##  $ density             : num  0.993 0.999 0.995 0.991 0.993 ...
##  $ pH                  : num  3.41 3.22 3.49 3.48 2.99 3.25 3.18 3.21 2.88 3.28 ...
##  $ sulphates           : num  0.32 0.46 0.42 0.54 0.34 0.43 0.47 0.47 0.47 0.39 ...
##  $ alcohol             : num  10.4 8.9 10.1 11.2 10.9 ...
##  $ quality             : int  5 6 6 4 6 6 6 6 6 7 ...

Exmine the distribution of the wine quality

hist(wine$quality)

The wine quality looks it follows a normal distribution

#Dividing the dataset into training and testing sets.75% training and 25% testing.
wine_train <- wine[1:3750, ]
wine_test <- wine[3751:4898, ]

Step 3 - training a model on the data

We will build the model using rpart function

m.rpart <- rpart(quality ~ . , data = wine_train)

m.rpart
## n= 3750 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 3750 2945.53200 5.870933  
##    2) alcohol< 10.85 2372 1418.86100 5.604975  
##      4) volatile.acidity>=0.2275 1611  821.30730 5.432030  
##        8) volatile.acidity>=0.3025 688  278.97670 5.255814 *
##        9) volatile.acidity< 0.3025 923  505.04230 5.563380 *
##      5) volatile.acidity< 0.2275 761  447.36400 5.971091 *
##    3) alcohol>=10.85 1378 1070.08200 6.328737  
##      6) free.sulfur.dioxide< 10.5 84   95.55952 5.369048 *
##      7) free.sulfur.dioxide>=10.5 1294  892.13600 6.391036  
##       14) alcohol< 11.76667 629  430.11130 6.173291  
##         28) volatile.acidity>=0.465 11   10.72727 4.545455 *
##         29) volatile.acidity< 0.465 618  389.71680 6.202265 *
##       15) alcohol>=11.76667 665  403.99400 6.596992 *

We can see the root has all the training data, then splited into two trees 2, 3 based on alchol percentage. we can see tree number 5, where alcohol <10.85 and voltaile.acidity <0.2275 so the predictit quality will be as per yval 5.97

Visualizing decision trees

rpart.plot(m.rpart, digits = 3)

we can enhance the tree by adding more paramters

rpart.plot(m.rpart, digits = 4, fallen.leaves = TRUE, type = 3, extra = 101)

## Step 4 - evaluating model performance

p.rpart <- predict(m.rpart, wine_test)

summary(p.rpart)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.545   5.563   5.971   5.893   6.202   6.597
summary(wine_test$quality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.901   6.000   9.000

we can see the extreems are not predicted correctly, but from 1st quartile to 3rd quartile (50 % of the population, the prediction is close)

We will exmine the correlation between the actual data in the test data set and predicted data

cor(p.rpart, wine_test$quality)
## [1] 0.5369525

Tt is acceptable to have a correlation of 0.54, it measures how strongly the predictions are related to the true value; it is not a meausre of how far off the predictions were from the true values.

Measure performacne with the mean absolute error

We will create a MAE function

MAE <- function(actual, predicted) {
    mean(abs(actual - predicted))
}

then we will calculate MAE

MAE(p.rpart, wine_test$quality)
## [1] 0.5872652

It suggests that, on average, the difference between our model’s predictions and teh true quality score was about 0.59. On a scale from 0 to 10 in the quality column, it looks the model is doing good.

the mean quality rating in the training data is

mean(wine_train$quality)
## [1] 5.870933

Comparing the mean of wine quality in training to the wine quality in the testing

MAE(mean(wine_train$quality), wine_test$quality)
## [1] 0.6719238

We can notice that MAE of the decision trees has MAE (0.59) which is better than the imputed mean MAE(0.67).

Step 5 - improving model performance

We will apply model tree algorithm

Model trees extend regression trees by replacing the leaf nodes with regression models, which often results in more accurate results.

We will use cubist() from cubist package

As the quality column is number 12 in the data set, we will remove it in teh train parmeter in cubist function.

m.cubist <- cubist(x = wine_train[-12], y = wine_train$quality)

# Examine the model

m.cubist
## 
## Call:
## cubist.default(x = wine_train[-12], y = wine_train$quality)
## 
## Number of samples: 3750 
## Number of predictors: 11 
## 
## Number of committees: 1 
## Number of rules: 25

The summary will display all the rules,

I just showing first rule

Call: cubist.default(x = wine_train[-12], y = wine_train$quality)

Cubist [Release 2.07 GPL Edition] Thu Aug 06 01:52:53 2020

Target attribute `outcome'

Read 3750 cases (12 attributes) from undefined.data

Model:

Rule 1: [21 cases, mean 5.0, range 4 to 6, est err 0.5]

if
free.sulfur.dioxide > 30
total.sulfur.dioxide > 195
total.sulfur.dioxide <= 235
sulphates > 0.64
alcohol > 9.1
then
outcome = 573.6 + 0.0478 total.sulfur.dioxide - 573 density
          - 0.788 alcohol + 0.186 residual.sugar - 4.73 volatile.acidity

We can see if conditions, and a linear model is shown after then statement…

#Examine the performance of the model

p.cubist <- predict(m.cubist, wine_test)

summary(p.cubist)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.677   5.416   5.906   5.848   6.238   7.393

Examine the correlation between predicted and actual values

cor(p.cubist, wine_test$quality)
## [1] 0.6201015

the above correlation is better than earlier model

Measure the MAE for the enahnced model

MAE(wine_test$quality, p.cubist)
## [1] 0.5339725

The error for this model is lower than the error for the earlier model.

Summary

Using regression trees and model trees is a powerful tool in your belt to solve regression problems.