For a fully functional html version, please visit http://www.rpubs.com/jasonchanhku/wine.

Libraries Used

library(rpart) #recursive and partitioning trees
library(plotly) #data visualization
library(rpart.plot)
library(RColorBrewer)
library(rattle)
library(RWeka)

Objective

This project aims to use regression trees and model trees to create a system capable of mimicking expert ratings of wine.

Perhaps more importantly, the system will not suffer from the human elements of tasting, such as the rater’s mood or palate fatigue. Computer-aided wine testing may therefore result in a better product as well as more objective, consistent, and fair ratings.

Step 1: Data Exploration

The data that will be analysed consists of only white wines produced from Portugal. The dataset was obtained from the UCI Machine Learning Data Repository http://archive.ics.uci.edu/ml

Data Preview

The dataset consists of 4,898 observations with 11 features. Note that the table below is only partial.

#load the dataset
wine <- read.csv(file = "Machine-Learning-with-R-datasets-master/whitewines.csv")

#table preview
knitr::kable(head(wine))

fixed.acidity	volatile.acidity	citric.acid	residual.sugar	chlorides	free.sulfur.dioxide	total.sulfur.dioxide	density	pH	sulphates	alcohol	quality
7.0	0.27	0.36	20.7	0.045	45	170	1.0010	3.00	0.45	8.8	6
6.3	0.30	0.34	1.6	0.049	14	132	0.9940	3.30	0.49	9.5	6
8.1	0.28	0.40	6.9	0.050	30	97	0.9951	3.26	0.44	10.1	6
7.2	0.23	0.32	8.5	0.058	47	186	0.9956	3.19	0.40	9.9	6
7.2	0.23	0.32	8.5	0.058	47	186	0.9956	3.19	0.40	9.9	6
8.1	0.28	0.40	6.9	0.050	30	97	0.9951	3.26	0.44	10.1	6

Below is also the structure of the dataset using str()

str(wine)

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

Features

From the dataset, the following is identified as target variable and features:

Target Variable

Quality: This is the target variable the project aims to mimick given the remaining features.

Features Bear in mind that as decision trees is used for this analysis, feature selection is done automatically. Hence, there is no need for manual feature selection. Thereofore, the features are measured characteristics such as:

acidity
sugar content
chlorides
sulfur
alcohol
pH
density

Data Visualization

The distribution of the target variable needs to be examine so that the model could be informed in case of any extremeties.

Quality Histogram

plot_ly(data = wine, x =~quality, type = "histogram")

## Warning in arrange_impl(.data, dots): '.Random.seed' is not an integer
## vector but of type 'NULL', so ignored

Histogram Insights

Wine quality appear to follow a fairly normal bell shaped distribution
This implies most wines are of average quality and few are good or bad

Boxplots

Below are the boxplots that gave significant findings with respect to quality:

Alcohol Content

wine2 <- wine
wine2$qualitychar <- ifelse(wine2$quality == 3, "a_Three", ifelse(wine2$quality == 4, "b_Four", ifelse(wine2$quality == 5, "c_Five", ifelse(wine2$quality == 6, "d_Six", ifelse(wine2$quality == 7, "e_Seven", ifelse(wine2$quality == 8, "f_Eight", "g_Nine"))) )))

plot_ly(data = wine2, x = ~qualitychar, y = ~alcohol, color = ~qualitychar, type = "box", colors = "Dark2")

Density

plot_ly(data = wine2, x = ~qualitychar, y = ~density, color = ~qualitychar, type = "box", colors = "Set1")

Boxplots Insights

Referring to the boxplots above, it is clear to what distinguishes above average wines:

Higher alcohol content
Lower density

Step 2 : Data Preparation

As decision tree is the model used, there is no need for data preprocessing. Therefore, the training and test data can be prepared directly as the data is sorted randomly. A proportion of 75% to 25% is chosen.

The partition of trainind and test dataset is done as below (please click show code):

#training set
wine_train <- wine[1:3750, ]

#test set
wine_test <- wine[3751:4898, ]

Step 3: Model Training (Regression Tree)

Although almost any implementation of decision trees can be used to perform regression tree modeling, the rpart (recursive partitioning) package offers the most faithful implementation of regression trees as they were described by the CART team.

The training is carried out as follows:

#building the model on training set
m.rpart <- rpart(quality ~. , data = wine_train)

m.rpart

## n= 3750 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 3750 3140.06000 5.886933  
##    2) alcohol< 10.85 2473 1510.66200 5.609381  
##      4) volatile.acidity>=0.2425 1406  740.15080 5.402560  
##        8) volatile.acidity>=0.4225 182   92.99451 4.994505 *
##        9) volatile.acidity< 0.4225 1224  612.34560 5.463235 *
##      5) volatile.acidity< 0.2425 1067  631.12090 5.881912 *
##    3) alcohol>=10.85 1277 1069.95800 6.424432  
##      6) free.sulfur.dioxide< 11.5 93   99.18280 5.473118 *
##      7) free.sulfur.dioxide>=11.5 1184  879.99920 6.499155  
##       14) alcohol< 11.85 611  447.38130 6.296236 *
##       15) alcohol>=11.85 573  380.63180 6.715532 *

Visualizing Decision Trees

The decision tree build by the rpart() model can be visualized clearly using the fancyRpartPlot() function.

fancyRpartPlot(m.rpart)

Step 4: Model Evaluation

Now, it is time to put the decision tree model to the test by running the test data. The summary of the prediction is as below:

Summary of Predicted

p.rpart <- predict(m.rpart,wine_test)

summary(p.rpart)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.995   5.463   5.882   5.999   6.296   6.716

Summary of Actual

summary(wine_test$quality)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.848   6.000   8.000

From the summaries above, it is obvious that the model does really bad at estimating the bad and the really good wine.

Mean Absolute Error (MAE)

Another way to think about the model’s performance is to consider how far, on average, its prediction was from the true value. This measurement is called the mean absolute error (MAE).

The MAE function is constructed and applied as below:

MAE <- function(actual, predicted){
  mean(abs(actual - predicted))
}

MAE(wine_test$quality, p.rpart)

## [1] 0.5732104

An MAE of 57% is acceptable given MAE is from a scale of 0 to 10.

Step 5: Model Improvement

To improve the model, a model tree is implemented. A model tree differs from a regression tree due to the fact that it runs multiple regression models at every node. Hopefully this is more accurate than just using a single value at leaf nodes.

The model tree is implemented using the M5P model from the RWeka package. The summary of the M5 model is as follow:

#building the model
m.m5p <- M5P(quality ~. , data = wine_train)

# building the predictor
p.m5p <- predict(m.m5p, wine_test)

m.m5p

## M5 pruned model tree:
## (using smoothed linear models)
## 
## alcohol <= 10.85 : LM1 (2473/77.476%)
## alcohol >  10.85 : 
## |   free.sulfur.dioxide <= 20.5 : 
## |   |   free.sulfur.dioxide <= 10.5 : LM2 (81/104.574%)
## |   |   free.sulfur.dioxide >  10.5 : LM3 (224/87.002%)
## |   free.sulfur.dioxide >  20.5 : LM4 (972/84.073%)
## 
## LM num: 1
## quality = 
##  0.0777 * fixed.acidity 
##  - 2.3087 * volatile.acidity 
##  + 0.0732 * residual.sugar 
##  + 0.0022 * free.sulfur.dioxide 
##  - 155.0175 * density 
##  + 0.6462 * pH 
##  + 0.7923 * sulphates 
##  + 0.0758 * alcohol 
##  + 156.2102
## 
## LM num: 2
## quality = 
##  -0.0314 * fixed.acidity 
##  - 0.3415 * volatile.acidity 
##  + 1.7929 * citric.acid 
##  + 0.1316 * residual.sugar 
##  - 0.2456 * chlorides 
##  + 0.1212 * free.sulfur.dioxide 
##  - 178.6281 * density 
##  + 0.054 * pH 
##  + 0.1392 * sulphates 
##  + 0.0108 * alcohol 
##  + 180.6069
## 
## LM num: 3
## quality = 
##  -0.2019 * fixed.acidity 
##  - 2.3804 * volatile.acidity 
##  - 1.0851 * citric.acid 
##  + 0.0905 * residual.sugar 
##  - 0.2456 * chlorides 
##  + 0.0041 * free.sulfur.dioxide 
##  - 177.078 * density 
##  + 0.054 * pH 
##  + 0.0868 * sulphates 
##  + 0.0108 * alcohol 
##  + 183.5076
## 
## LM num: 4
## quality = 
##  0.0004 * fixed.acidity 
##  - 0.0325 * volatile.acidity 
##  + 0.0957 * residual.sugar 
##  - 5.9702 * chlorides 
##  + 0.0002 * free.sulfur.dioxide 
##  - 172.3931 * density 
##  + 1.0123 * pH 
##  + 1.1653 * sulphates 
##  + 0.1542 * alcohol 
##  + 171.6842
## 
## Number of Rules : 4

Improved MAE

The MAE has improved to the following:

MAE(wine_test$quality, p.m5p)

## [1] 0.5660352

Testing

The quality of the white wine with the following features will be estimated:

test <- data.frame(fixed.acidity = 8.5, volatile.acidity = 0.33, citric.acid = 0.42, residual.sugar = 10.5, chlorides = 0.065, free.sulfur.dioxide = 47, total.sulfur.dioxide = 186, density = 0.9955, pH = 3.10, sulphates = 0.40, alcohol = 9.9)

test

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           8.5             0.33        0.42           10.5     0.065
##   free.sulfur.dioxide total.sulfur.dioxide density  pH sulphates alcohol
## 1                  47                  186  0.9955 3.1       0.4     9.9

The test data will have an estimated quality rating of:

test_pred <- predict(m.m5p, test)

test_pred

## [1] 5.730941