For a fully functional html version, please visit http://www.rpubs.com/jasonchanhku/wine.
Libraries Used
library(rpart) #recursive and partitioning trees
library(plotly) #data visualization
library(rpart.plot)
library(RColorBrewer)
library(rattle)
library(RWeka)
Objective
This project aims to use regression trees and model trees to create a system capable of mimicking expert ratings of wine.
Perhaps more importantly, the system will not suffer from the human elements of tasting, such as the rater’s mood or palate fatigue. Computer-aided wine testing may therefore result in a better product as well as more objective, consistent, and fair ratings.
Step 1: Data Exploration
The data that will be analysed consists of only white wines produced from Portugal. The dataset was obtained from the UCI Machine Learning Data Repository http://archive.ics.uci.edu/ml
Data Preview
The dataset consists of 4,898 observations with 11 features. Note that the table below is only partial.
#load the dataset
wine <- read.csv(file = "Machine-Learning-with-R-datasets-master/whitewines.csv")
#table preview
knitr::kable(head(wine))
| fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | quality |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 7.0 | 0.27 | 0.36 | 20.7 | 0.045 | 45 | 170 | 1.0010 | 3.00 | 0.45 | 8.8 | 6 |
| 6.3 | 0.30 | 0.34 | 1.6 | 0.049 | 14 | 132 | 0.9940 | 3.30 | 0.49 | 9.5 | 6 |
| 8.1 | 0.28 | 0.40 | 6.9 | 0.050 | 30 | 97 | 0.9951 | 3.26 | 0.44 | 10.1 | 6 |
| 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47 | 186 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 |
| 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47 | 186 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 |
| 8.1 | 0.28 | 0.40 | 6.9 | 0.050 | 30 | 97 | 0.9951 | 3.26 | 0.44 | 10.1 | 6 |
Below is also the structure of the dataset using str()
str(wine)
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
Features
From the dataset, the following is identified as target variable and features:
Target Variable
- Quality: This is the target variable the project aims to mimick given the remaining features.
Features Bear in mind that as decision trees is used for this analysis, feature selection is done automatically. Hence, there is no need for manual feature selection. Thereofore, the features are measured characteristics such as:
- acidity
- sugar content
- chlorides
- sulfur
- alcohol
- pH
- density
Data Visualization
The distribution of the target variable needs to be examine so that the model could be informed in case of any extremeties.
Quality Histogram
plot_ly(data = wine, x =~quality, type = "histogram")
## Warning in arrange_impl(.data, dots): '.Random.seed' is not an integer
## vector but of type 'NULL', so ignored
Histogram Insights
- Wine quality appear to follow a fairly normal bell shaped distribution
- This implies most wines are of average quality and few are good or bad
Boxplots
Below are the boxplots that gave significant findings with respect to quality:
Alcohol Content
wine2 <- wine
wine2$qualitychar <- ifelse(wine2$quality == 3, "a_Three", ifelse(wine2$quality == 4, "b_Four", ifelse(wine2$quality == 5, "c_Five", ifelse(wine2$quality == 6, "d_Six", ifelse(wine2$quality == 7, "e_Seven", ifelse(wine2$quality == 8, "f_Eight", "g_Nine"))) )))
plot_ly(data = wine2, x = ~qualitychar, y = ~alcohol, color = ~qualitychar, type = "box", colors = "Dark2")
Density
plot_ly(data = wine2, x = ~qualitychar, y = ~density, color = ~qualitychar, type = "box", colors = "Set1")
Boxplots Insights
Referring to the boxplots above, it is clear to what distinguishes above average wines:
- Higher alcohol content
- Lower density
Step 2 : Data Preparation
As decision tree is the model used, there is no need for data preprocessing. Therefore, the training and test data can be prepared directly as the data is sorted randomly. A proportion of 75% to 25% is chosen.
The partition of trainind and test dataset is done as below (please click show code):
#training set
wine_train <- wine[1:3750, ]
#test set
wine_test <- wine[3751:4898, ]
Step 3: Model Training (Regression Tree)
Although almost any implementation of decision trees can be used to perform regression tree modeling, the rpart (recursive partitioning) package offers the most faithful implementation of regression trees as they were described by the CART team.
The training is carried out as follows:
#building the model on training set
m.rpart <- rpart(quality ~. , data = wine_train)
m.rpart
## n= 3750
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 3750 3140.06000 5.886933
## 2) alcohol< 10.85 2473 1510.66200 5.609381
## 4) volatile.acidity>=0.2425 1406 740.15080 5.402560
## 8) volatile.acidity>=0.4225 182 92.99451 4.994505 *
## 9) volatile.acidity< 0.4225 1224 612.34560 5.463235 *
## 5) volatile.acidity< 0.2425 1067 631.12090 5.881912 *
## 3) alcohol>=10.85 1277 1069.95800 6.424432
## 6) free.sulfur.dioxide< 11.5 93 99.18280 5.473118 *
## 7) free.sulfur.dioxide>=11.5 1184 879.99920 6.499155
## 14) alcohol< 11.85 611 447.38130 6.296236 *
## 15) alcohol>=11.85 573 380.63180 6.715532 *
Visualizing Decision Trees
The decision tree build by the rpart() model can be visualized clearly using the fancyRpartPlot() function.
fancyRpartPlot(m.rpart)
Step 4: Model Evaluation
Now, it is time to put the decision tree model to the test by running the test data. The summary of the prediction is as below:
Summary of Predicted
p.rpart <- predict(m.rpart,wine_test)
summary(p.rpart)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.995 5.463 5.882 5.999 6.296 6.716
Summary of Actual
summary(wine_test$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.848 6.000 8.000
From the summaries above, it is obvious that the model does really bad at estimating the bad and the really good wine.
Mean Absolute Error (MAE)
Another way to think about the model’s performance is to consider how far, on average, its prediction was from the true value. This measurement is called the mean absolute error (MAE).
The MAE function is constructed and applied as below:
MAE <- function(actual, predicted){
mean(abs(actual - predicted))
}
MAE(wine_test$quality, p.rpart)
## [1] 0.5732104
An MAE of 57% is acceptable given MAE is from a scale of 0 to 10.
Step 5: Model Improvement
To improve the model, a model tree is implemented. A model tree differs from a regression tree due to the fact that it runs multiple regression models at every node. Hopefully this is more accurate than just using a single value at leaf nodes.
The model tree is implemented using the M5P model from the RWeka package. The summary of the M5 model is as follow:
#building the model
m.m5p <- M5P(quality ~. , data = wine_train)
# building the predictor
p.m5p <- predict(m.m5p, wine_test)
m.m5p
## M5 pruned model tree:
## (using smoothed linear models)
##
## alcohol <= 10.85 : LM1 (2473/77.476%)
## alcohol > 10.85 :
## | free.sulfur.dioxide <= 20.5 :
## | | free.sulfur.dioxide <= 10.5 : LM2 (81/104.574%)
## | | free.sulfur.dioxide > 10.5 : LM3 (224/87.002%)
## | free.sulfur.dioxide > 20.5 : LM4 (972/84.073%)
##
## LM num: 1
## quality =
## 0.0777 * fixed.acidity
## - 2.3087 * volatile.acidity
## + 0.0732 * residual.sugar
## + 0.0022 * free.sulfur.dioxide
## - 155.0175 * density
## + 0.6462 * pH
## + 0.7923 * sulphates
## + 0.0758 * alcohol
## + 156.2102
##
## LM num: 2
## quality =
## -0.0314 * fixed.acidity
## - 0.3415 * volatile.acidity
## + 1.7929 * citric.acid
## + 0.1316 * residual.sugar
## - 0.2456 * chlorides
## + 0.1212 * free.sulfur.dioxide
## - 178.6281 * density
## + 0.054 * pH
## + 0.1392 * sulphates
## + 0.0108 * alcohol
## + 180.6069
##
## LM num: 3
## quality =
## -0.2019 * fixed.acidity
## - 2.3804 * volatile.acidity
## - 1.0851 * citric.acid
## + 0.0905 * residual.sugar
## - 0.2456 * chlorides
## + 0.0041 * free.sulfur.dioxide
## - 177.078 * density
## + 0.054 * pH
## + 0.0868 * sulphates
## + 0.0108 * alcohol
## + 183.5076
##
## LM num: 4
## quality =
## 0.0004 * fixed.acidity
## - 0.0325 * volatile.acidity
## + 0.0957 * residual.sugar
## - 5.9702 * chlorides
## + 0.0002 * free.sulfur.dioxide
## - 172.3931 * density
## + 1.0123 * pH
## + 1.1653 * sulphates
## + 0.1542 * alcohol
## + 171.6842
##
## Number of Rules : 4
Improved MAE
The MAE has improved to the following:
MAE(wine_test$quality, p.m5p)
## [1] 0.5660352
Testing
The quality of the white wine with the following features will be estimated:
test <- data.frame(fixed.acidity = 8.5, volatile.acidity = 0.33, citric.acid = 0.42, residual.sugar = 10.5, chlorides = 0.065, free.sulfur.dioxide = 47, total.sulfur.dioxide = 186, density = 0.9955, pH = 3.10, sulphates = 0.40, alcohol = 9.9)
test
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 8.5 0.33 0.42 10.5 0.065
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 47 186 0.9955 3.1 0.4 9.9
The test data will have an estimated quality rating of:
test_pred <- predict(m.m5p, test)
test_pred
## [1] 5.730941