Case

Wine is an alcoholic drink made from fermented grapes and famous for its intoxicating effects and delicacy. Different varieties of grapes and strains of yeasts produce different styles of wine. The biochemical process that occurs during fermentation produces alcohol and substances that enrich the flavor of the wine and determine wine’s quality.

In industrial-scale production, it is important for wine producers to be able to classify correctly and swiftly whether their wine considered as high quality or not. This routine is performed for product categorization and quality control purposes. The traditional organoleptic approach is considered unpractical in such a massive scale. Therefore the idea of predicting wine’s quality based on its chemical properties is considered a brilliant idea. Using today’s tools for chemical detection and machine learning, a predictive model can be made and help wine producers classify wine more efficiently and improve their productivity.



Previously, we have documented the steps to make a predictive model for wine’s quality prediction using classification algorithms; Logistic Regression and k-NN. Unfortunately, the highest model’s accuracy only reached 76% (using k-NN).

Through this article, I am going to re-analyze our wine data and build another predictive model using other classification algorithms, in hope of a room for improvements. These algorithms are Naive Bayes, Decision Tree, and Random Forest. We will also discuss the main concept and important things related to each model.

The dataset that will be used was obtained from UCI Machine Learning Repository, containing the physicochemical characteristic of red and white variants of the Portuguese “Vinho Verde” wine1.

Solutions

Read Data

Wine dataset was obtained from UCI Machine Learning Repository, containing the physicochemical characteristic of red and white variants of the Portuguese “Vinho Verde”.

Coloumn description:

  • type : wine variants (red/white)
  • fixed acidity: most acids involved with wine
  • volatile acidity: amount of acetic acid in wine
  • citric acid: found in small quantities
  • residual sugar: amount of sugar remaining after wine fermentation/production
  • chlorides: amount of salt in the wine
  • free sulfur dioxide: free forms of S02, prevents microbial growth and the oxidation of wine
  • total sulfur dioxide: amount of free and bound forms of S02
  • density: the density of water depending on the percent alcohol and sugar content
  • pH: describes how acidic or basic a wine is on a scale 0-14 (very acidic: 0, very basic: 14); most wines are between 3-4 on the pH scale
  • sulphates: an antimicrobial and antioxidant
  • alcohol: the percent alcohol content of the wine
  • quality: target variable (based on sensory data, scored between 0 and 10)

Data Wrangling

Same as the previous article, We are going to build a predictive model to classify white wine quality, whereas the quality score of 7-10 is considered “Excellent”. Therefore, I subsetted the white wine data and removed type for a cleaner data analysis.

## 
## FALSE  TRUE 
## 0.999 0.001

There are still some missing value in our data and I will try to impute them with the median values because our data are not normally distributed.

##        fixed.acidity     volatile.acidity          citric.acid 
##                    0                    0                    0 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                    0                    0                    0 
## total.sulfur.dioxide              density                   pH 
##                    0                    0                    0 
##            sulphates              alcohol              quality 
##                    0                    0                    0

Exploratory Data Analysis

Target Variable Proportion

Based on the plot, there is an unbalanced proportion between the levels in our target variable. To avoid the loss of variance, I will use upsampling (rather than downsampling) to balance the proportion.

Based on the plot above, there are some predictor variables which have a high correlation with one another. These variables are free.sulfur.dioxide, total.sulfur.dioxide, density, alcohol, and residual sugar. This gave us an early warning that this data might not be appropriate for some model such as Naive Bayes.

Modeling: Naive Bayes

There are certain characteristics of Naive Bayes that should be considered:

  • assumes that all features of the dataset are equally important and independent. This allows Naive Bayes to perform faster computation (the algorithms is quite simple).
  • prone to bias due to data scarcity. In some cases, our data may have a distribution where scarce observations lead to probabilities approximating close to 0 or 1, which introduces a heavy bias into our model that could lead to poor performance on unseen data.
  • more appropriate for data with categoric predictors. This is because Naive Bayes is sensitive to data scarcity. Meanwhile, a continuous variable might contain really scarce or even only one observation for certain value.
  • apply Laplace estimator/smoothing for data scarcity problem. Laplace estimator proposes the adding of a small number (usually 1) to each of the counts in the frequency table. This subsequently ensures that each class-feature combination has a non-zero probability of occurring.

Based on the characteristic, our wine data is actually not very appropriate for Naive Bayes. From the data description, some of our predictors have a high correlation with one another. The predictors are also continuous variable. Even so, we are still going to try using Naive Bayes and the result will be compared with the other models. While building our Naive Bayes model, we should also apply Laplace estimator.

## [1] 0.7825157

For your information these are the metrics that we are going to evaluate from the model:

  • Accuracy: the ability to correctly predict both classes from the total observation.
  • Precision: the ability to correctly predict the positive class from the total predicted-positive class (false positive is low).
  • Recall: the ability to correctly predict the positive class from the total actual-positive class (false negative is low).
  • Specificity: the ability to correctly predict the negative class from the total actual-negative class.

Additionally, there is also the Receiver Operating Characteristics (ROC) curve and Area Under Curve (AUC) to evaluate our model. ROC plots the proportion of true positive rate (TPR or Sensitivity) to the proportion of a false negative rate (FNR or 1-Specificity). ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much a model is capable of distinguishing between classes. The closer the curve reaches the upper-left of the plot (True positive is high while false negative is low), the better our model is. The higher the AUC score the better our model separates our target classes. To ease your understanding, you can see the illustration below.



Based on the result, our model’s accuracy, sensitivity, and specificity are acceptable. Unfortunately, the model’s precision is still very small (38%). Our ROC curve also shows a quite good separation with AUC score 0.7532463 and it may be improved further.

In our case, wine is classified for product recommendation and product categorization purpose. Misclassification of both classes is equally destructive for the company. This is because having an excellent quality wine in the poor-normal class (with lower selling price) will inflict a financial loss for the company while having a poor-normal quality wine stranded in the excellent quality class will result in a bad reputation in the eye of their costumer.

From the explanation above, we know that both categories from the target variable are equally important. This means that our model’s accuracy should be high, while the recall and specificity of our model should not be very low either. Several things we can do includes finding the recall-specificity break-even point (the cutoff where recall and specificity are equal) or finding a point where both scores are not very low (>= 70) while also having high accuracy. Precision might not be the important metrics here, moreover because our test dataset has a class-imbalance (22% excellent - 78% poor-normal).

Below, I will demonstrate some model tuning practice by changing the threshold for classification.

From the plot above, we know that the recall-specificity break-even point is on 0.77. But if we set the threshold 0.8118, we can still get recall and specificity above 70% while also having accuracy 76%. Therefore I will pick that threshold. We may also try using another algorithm for this data.

Modeling: Decision Tree

Tree-based models are powerful, incredibly versatile and most likely the most popular choice for machine learning tasks.

Decision tree model is one of the tree-based models which has the major benefit of being interpretable. Decision tree is an algorithm that will make a set of rules visualized in a diagram that resembles a tree.

To really understand what it looks like, let’s make a decision tree model based on our wine_train data. We can use rpart package for building a decision tree model and use rattle and rpart.plot package to visualize them.

A decision tree consists of several parts. The first box in the upper part of the plot is the root node. The root will split and make branches by certain rules. Each branches ended with a node. Some nodes split again into other nodes and called the internal node. Nodes that do not split anymore or appear at the end of the tree is called the terminal node or leaf node, just like a leaf on a tree.

Each node shows:

  • the predicted class (Excellent/Poor-Normal).
  • the probability of Excellent/Poor-Normal class.
  • the percentage of observations in the node.
  • the root and internal nodes also show the rules (variables with threshold/value) that will partition each observation.

How does a tree choose its root, splits its branches, and ended up with a node?

A tree will choose to split data in such a way that the resulting nodes will contain data points with as many similar class as possible (homogenous). One measurement of homogeneity/purity within groups is entropy or the measure of disorder. Entropy near 0 means most of the observations fall within the same class (homogenous). Entropy near 1 is the other way around (heterogeneous).

Decision tree is built using a top-down fashion. The root node will be chosen from a variable and conditional rules that will give the highest entropy. The root will be partitioned (split) into nodes with each partition having different entropy. For more reference about the entropy calculation, you can read this article. The difference of entropy before and after splitting is called the information gain. The tree will prefer to perform splitting using variables and rules that will result in higher information gain (from a high entropy into a lower entropy).

There are certain characteristics of decision tree model:

  • perform well on both numerical and categorical variable.

  • all predictors are assumed to interact.

  • quite robust to the problem of multicollinearity. A decision tree will choose a variable that has the highest information gain in one split, whereas a method such as logistic regression would have used both.

  • robust and insensitive to outliers. Splitting will happen at a condition where it maximizes the homogeneity within resulting groups. Outliers will have little influence on the splitting process.

Some drawbacks of decision tree model is described below:

  • prone to overfitting. Trees should know when to stop growing or it will partition every single observation possible and result in an overfitting model. To overcome that problem, we can perform pre-pruning (determine tree depth before model building) or post-pruning (letting the tree to grow and pruning it later after important classification patterns have been discovered). We can set some parameter:
    • mincriterion: The value of the test statistic (1 - p-value) that must be exceeded in order to implement a split. For example, when mincriterion is 0.95, the p-value must be smaller than 0.05 in order for a node to split. This can also act as a “regulator” for the depth of the tree. The higher the mincriterion, the harder it is to perform splitting, thus generate a smaller tree.
    • minsplit: the minimum number of observations that must exist in a node in order for a split to be attempted. Default to 20.
    • minbucket: the minimum number of observations in any leaf node. Default to round(minsplit/3).
    • maxdepth: Set the maximum depth of any node of the final tree, with the root node counted as depth 0. Default to 30.
  • is a greedy algorithm which picks an attribute that will result in the highest information gain, a strategy that will converge to local optima but does not guarantee global optimality. Some effort to overcome this problem are:
    • k-steps look-ahead trees
    • dual information distance (DID) as the splitting criterion
    • etc.

## [1] 0.7755241

From the decision tree model result, I will not tune the model from the parameter because by using the default parameter, we already got a normal-sized tree (no need to perform tree pruning). Instead, I will try to tune the model by changing the threshold for classification.

Modelling: Random Forest

Random Forest is one example of an ensemble-based algorithm which was built based on a decision tree method and known for its versatility and performance. Ensemble-based algorithm itself is actually a hybrid of several machine learning techniques combined into one predictive model, built to reduce error, bias, and improve predictions. The building of a Random Forest model consist of several steps:

  • it performs bagging (bootstrap aggregation) Creating subsets of training data through random sampling with replacement to train multiple predictive models (in this case many decision trees).

  • it performs boosting Train multiple predictive models to generate a better one. This overcomes the problem of overfitting in decision tree because we will consider more than 1 decision tree which was previously trained using random variables and observations.

  • classify class based on voting mechanism after multiple decision trees have been trained and each tree gave prediction (fitted values). The final prediction was generated through majority voting (in classification) or average output (if regression).

Speaking about random selection of observation and variables, let’s get acquainted with a technique for model evaluation called K-fold Cross-Validation. This technique performs cross-validation by splitting our data into k equal-sized sample group (bins) and use one of the bins to become the test data while the rest of the data become the train data. This process is repeated for k-times (the folds). This makes every observation has the chance to be used as both training and test data and therefore may also overcome overfitting problem from the decision tree. Below is an example of a 5-bins and 5-fold cross validation.



To really understand how random forest work, let’s apply the random forest algorithm to our wine data.

## Random Forest 
## 
## 6142 samples
##   11 predictor
##    2 classes: 'Excellent', 'Poor-Normal' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 4 times) 
## Summary of sample sizes: 4606, 4607, 4606, 4607, 4606, 4606, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9498940  0.8997884
##    6    0.9461899  0.8923801
##   11    0.9427302  0.8854606
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

From the model summary, we know that the optimum number of variables considered for splitting at each tree node is 2. We can also inspect the importance of each variable that was used in our random forest using varImp().

## rf variable importance
## 
##                      Overall
## alcohol              100.000
## density               71.823
## chlorides             42.695
## total.sulfur.dioxide  26.445
## volatile.acidity      26.408
## residual.sugar        24.650
## free.sulfur.dioxide   24.589
## pH                    20.683
## citric.acid           15.234
## sulphates              1.523
## fixed.acidity          0.000

When using random forest - we are not required to split our dataset into train and test sets because random forest already has out-of-bag estimates (OOB) which act as a reliable estimate of the accuracy on unseen examples. Although, it is also possible to hold out a regular train-test cross-validation. For example, the OOB we achieved (in the summary below) was generated from our wine_train dataset.

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 3.24%
## Confusion matrix:
##             Excellent Poor-Normal class.error
## Excellent        3041          30 0.009768805
## Poor-Normal       169        2902 0.055030935

Let’s test our random forest model to our wine_test dataset:

## [1] 0.9344402

Based on the above analysis and results, a random forest model gave us such great accuracy, recall, specificity, and precision (all metrics > 65%). Furthermore, the AUC of the model is also very high, reaching 93%. I will try to tune the threshold to reach a higher score on each metrics.

Conclusion

Based on the metrics table above, the predictive model built using Random Forest algorithm gave the best result. The model gave highest accuracy 89% while also maintain sensitivity, specificity, and precision above 70%. It also gave the highest AUC at 93%. Therefore the best model to predict wine’s quality based on chemical properties is the Random Forest model.



  1. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.