Wine is an alcoholic drink made from fermented grapes and famous for its intoxicating effects and delicacy. Different varieties of grapes and strains of yeasts produce different styles of wine. The biochemical process that occurs during fermentation produces alcohol and substances that enrich the flavor of the wine and determine wine’s quality.
In industrial-scale production, it is important for wine producers to be able to classify correctly and swiftly whether their wine considered as high quality or not. This routine is performed for product categorization and quality control purposes. The traditional organoleptic approach is considered unpractical in such a massive scale. Therefore the idea of predicting wine’s quality based on its chemical properties is considered a brilliant idea. Using today’s tools for chemical detection and machine learning, a predictive model can be made and help wine producers classify wine more efficiently and improve their productivity.
Previously, we have documented the steps to make a predictive model for wine’s quality prediction using classification algorithms; Logistic Regression and k-NN. Unfortunately, the highest model’s accuracy only reached 76% (using k-NN).
Through this article, I am going to re-analyze our wine data and build another predictive model using other classification algorithms, in hope of a room for improvements. These algorithms are Naive Bayes, Decision Tree, and Random Forest. We will also discuss the main concept and important things related to each model.
The dataset that will be used was obtained from UCI Machine Learning Repository, containing the physicochemical characteristic of red and white variants of the Portuguese “Vinho Verde” wine1.
Wine dataset was obtained from UCI Machine Learning Repository, containing the physicochemical characteristic of red and white variants of the Portuguese “Vinho Verde”.
Coloumn description:
type : wine variants (red/white)fixed acidity: most acids involved with winevolatile acidity: amount of acetic acid in winecitric acid: found in small quantitiesresidual sugar: amount of sugar remaining after wine fermentation/productionchlorides: amount of salt in the winefree sulfur dioxide: free forms of S02, prevents microbial growth and the oxidation of winetotal sulfur dioxide: amount of free and bound forms of S02density: the density of water depending on the percent alcohol and sugar contentpH: describes how acidic or basic a wine is on a scale 0-14 (very acidic: 0, very basic: 14); most wines are between 3-4 on the pH scalesulphates: an antimicrobial and antioxidantalcohol: the percent alcohol content of the winequality: target variable (based on sensory data, scored between 0 and 10)Same as the previous article, We are going to build a predictive model to classify white wine quality, whereas the quality score of 7-10 is considered “Excellent”. Therefore, I subsetted the white wine data and removed type for a cleaner data analysis.
# filter only white wine data
wine <- wine %>%
filter(type == "white") %>%
select(-type) %>%
mutate(quality = as.factor(ifelse(quality > 6, "Excellent", "Poor-Normal")))
wine##
## FALSE TRUE
## 0.999 0.001
# inspect the data distribution of numerical variables
inspect_num(wine[,-c(6:8,10:11)], show_plot = T)There are still some missing value in our data and I will try to impute them with the median values because our data are not normally distributed.
# NA median impute
prevalues <- preProcess(wine, method=c("medianImpute"))
wine <- predict(prevalues, wine)
colSums(is.na(wine))## fixed.acidity volatile.acidity citric.acid
## 0 0 0
## residual.sugar chlorides free.sulfur.dioxide
## 0 0 0
## total.sulfur.dioxide density pH
## 0 0 0
## sulphates alcohol quality
## 0 0 0
Target Variable Proportion
Based on the plot, there is an unbalanced proportion between the levels in our target variable. To avoid the loss of variance, I will use upsampling (rather than downsampling) to balance the proportion.
# inspect corelation between predictors
GGally::ggcorr(wine[,-12], hjust = 1, layout.exp = 2, label = T, label_size = 2.9)Based on the plot above, there are some predictor variables which have a high correlation with one another. These variables are free.sulfur.dioxide, total.sulfur.dioxide, density, alcohol, and residual sugar. This gave us an early warning that this data might not be appropriate for some model such as Naive Bayes.
There are certain characteristics of Naive Bayes that should be considered:
Based on the characteristic, our wine data is actually not very appropriate for Naive Bayes. From the data description, some of our predictors have a high correlation with one another. The predictors are also continuous variable. Even so, we are still going to try using Naive Bayes and the result will be compared with the other models. While building our Naive Bayes model, we should also apply Laplace estimator.
library(e1071)
# model building
naive <- naiveBayes(wine_train[-12], wine_train$quality, laplace = 1)# model fitting
naive_pred <- predict(naive, wine_test, type = "class") # for the class prediction
naive_prob <- predict(naive, wine_test, type = "raw") # for the probability
# result
naive_table <- select(wine_test, quality) %>%
bind_cols(quality_pred = naive_pred) %>%
bind_cols(quality_eprob = round(naive_prob[,1],4)) %>%
bind_cols(quality_pprob = round(naive_prob[,2],4))
# performance evaluation - confusion matrix
naive_table %>%
conf_mat(quality, quality_pred) %>%
autoplot(type = "heatmap")naive_table %>%
summarise(
accuracy = accuracy_vec(quality, quality_pred),
sensitivity = sens_vec(quality, quality_pred),
specificity = spec_vec(quality, quality_pred),
precision = precision_vec(quality, quality_pred)
)# ROC
naive_roc <- data.frame(prediction=round(naive_prob[,1],4),
trueclass=as.numeric(naive_table$quality=="Excellent"))
head(naive_roc)library(ROCR)
naive_roc <- ROCR::prediction(naive_roc$prediction, naive_roc$trueclass)
# ROC curve
plot(performance(naive_roc, "tpr", "fpr"),
main = "ROC")
abline(a = 0, b = 1)# AUC
auc_ROCR_n <- performance(naive_roc, measure = "auc")
auc_ROCR_n <- auc_ROCR_n@y.values[[1]]
auc_ROCR_n## [1] 0.7825157
For your information these are the metrics that we are going to evaluate from the model:
Additionally, there is also the Receiver Operating Characteristics (ROC) curve and Area Under Curve (AUC) to evaluate our model. ROC plots the proportion of true positive rate (TPR or Sensitivity) to the proportion of a false negative rate (FNR or 1-Specificity). ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much a model is capable of distinguishing between classes. The closer the curve reaches the upper-left of the plot (True positive is high while false negative is low), the better our model is. The higher the AUC score the better our model separates our target classes. To ease your understanding, you can see the illustration below.
Based on the result, our model’s accuracy, sensitivity, and specificity are acceptable. Unfortunately, the model’s precision is still very small (38%). Our ROC curve also shows a quite good separation with AUC score 0.7532463 and it may be improved further.
In our case, wine is classified for product recommendation and product categorization purpose. Misclassification of both classes is equally destructive for the company. This is because having an excellent quality wine in the poor-normal class (with lower selling price) will inflict a financial loss for the company while having a poor-normal quality wine stranded in the excellent quality class will result in a bad reputation in the eye of their costumer.
From the explanation above, we know that both categories from the target variable are equally important. This means that our model’s accuracy should be high, while the recall and specificity of our model should not be very low either. Several things we can do includes finding the recall-specificity break-even point (the cutoff where recall and specificity are equal) or finding a point where both scores are not very low (>= 70) while also having high accuracy. Precision might not be the important metrics here, moreover because our test dataset has a class-imbalance (22% excellent - 78% poor-normal).
Below, I will demonstrate some model tuning practice by changing the threshold for classification.
library(plotly)
# model tuning - metrics function
metrics <- function(cutoff, prob, ref, postarget, negtarget)
{
predict <- factor(ifelse(prob >= cutoff, postarget, negtarget))
conf <- caret::confusionMatrix(predict , ref, positive = postarget)
acc <- conf$overall[1]
rec <- conf$byClass[1]
prec <- conf$byClass[3]
spec <- conf$byClass[2]
mat <- t(as.matrix(c(rec , acc , prec, spec)))
colnames(mat) <- c("recall", "accuracy", "precicion", "specificity")
return(mat)
}
co <- seq(0.01,0.99,length=100)
result <- matrix(0,100,4)
# apply function metrics
for(i in 1:100){
result[i,] = metrics(cutoff = co[i],
prob = naive_table$quality_eprob,
ref = as.factor(ifelse(naive_table$quality == "Excellent", 1, 0)),
postarget = "1",
negtarget = "0")
}
# visualize
ggplotly(tibble("Recall" = result[,1],
"Accuracy" = result[,2],
"Precision" = result[,3],
"Specificity" = result[,4],
"Cutoff" = co) %>%
gather(key = "Metrics", value = "value", 1:4) %>%
ggplot(aes(x = Cutoff, y = value, col = Metrics)) +
geom_line(lwd = 1.5) +
scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
scale_x_continuous(breaks = seq(0,1,0.1)) +
labs(title = "Tradeoff Model Perfomance") +
theme_minimal() +
theme(legend.position = "top",
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank()))From the plot above, we know that the recall-specificity break-even point is on 0.77. But if we set the threshold 0.8118, we can still get recall and specificity above 70% while also having accuracy 76%. Therefore I will pick that threshold. We may also try using another algorithm for this data.
# tuning final model
naive_table <- naive_table %>%
mutate(tuning_pred = as.factor(ifelse(quality_eprob >= 0.8118, "Excellent", "Poor-Normal")))
# metrics result
final_n <- naive_table %>%
summarise(
accuracy = accuracy_vec(quality, tuning_pred),
sensitivity = sens_vec(quality, tuning_pred),
specificity = spec_vec(quality, tuning_pred),
precision = precision_vec(quality, tuning_pred)
) %>%
cbind(AUC = auc_ROCR_n)
final_nTree-based models are powerful, incredibly versatile and most likely the most popular choice for machine learning tasks.
Decision tree model is one of the tree-based models which has the major benefit of being interpretable. Decision tree is an algorithm that will make a set of rules visualized in a diagram that resembles a tree.
To really understand what it looks like, let’s make a decision tree model based on our wine_train data. We can use rpart package for building a decision tree model and use rattle and rpart.plot package to visualize them.
library(rpart)
library(rattle)
library(rpart.plot)
# model building
dtree <- rpart(formula = quality ~ ., data = wine_train, method = "class")
fancyRpartPlot(dtree, sub = NULL)A decision tree consists of several parts. The first box in the upper part of the plot is the root node. The root will split and make branches by certain rules. Each branches ended with a node. Some nodes split again into other nodes and called the internal node. Nodes that do not split anymore or appear at the end of the tree is called the terminal node or leaf node, just like a leaf on a tree.
Each node shows:
How does a tree choose its root, splits its branches, and ended up with a node?
A tree will choose to split data in such a way that the resulting nodes will contain data points with as many similar class as possible (homogenous). One measurement of homogeneity/purity within groups is entropy or the measure of disorder. Entropy near 0 means most of the observations fall within the same class (homogenous). Entropy near 1 is the other way around (heterogeneous).
Decision tree is built using a top-down fashion. The root node will be chosen from a variable and conditional rules that will give the highest entropy. The root will be partitioned (split) into nodes with each partition having different entropy. For more reference about the entropy calculation, you can read this article. The difference of entropy before and after splitting is called the information gain. The tree will prefer to perform splitting using variables and rules that will result in higher information gain (from a high entropy into a lower entropy).
There are certain characteristics of decision tree model:
perform well on both numerical and categorical variable.
all predictors are assumed to interact.
quite robust to the problem of multicollinearity. A decision tree will choose a variable that has the highest information gain in one split, whereas a method such as logistic regression would have used both.
robust and insensitive to outliers. Splitting will happen at a condition where it maximizes the homogeneity within resulting groups. Outliers will have little influence on the splitting process.
Some drawbacks of decision tree model is described below:
mincriterion: The value of the test statistic (1 - p-value) that must be exceeded in order to implement a split. For example, when mincriterion is 0.95, the p-value must be smaller than 0.05 in order for a node to split. This can also act as a “regulator” for the depth of the tree. The higher the mincriterion, the harder it is to perform splitting, thus generate a smaller tree.minsplit: the minimum number of observations that must exist in a node in order for a split to be attempted. Default to 20.minbucket: the minimum number of observations in any leaf node. Default to round(minsplit/3).maxdepth: Set the maximum depth of any node of the final tree, with the root node counted as depth 0. Default to 30.# model fitting
dtree_pred <- predict(dtree, wine_test, type = "class")
dtree_prob <- predict(dtree, wine_test, type = "prob")
# result
dtree_table <- select(wine_test, quality) %>%
bind_cols(quality_pred = dtree_pred) %>%
bind_cols(quality_eprob = round(dtree_prob[,1],4)) %>%
bind_cols(quality_pprob = round(dtree_prob[,2],4))
# performance evaluation - confusion matrix
dtree_table %>%
conf_mat(quality, quality_pred) %>%
autoplot(type = "heatmap")dtree_table %>%
summarise(
accuracy = accuracy_vec(quality, quality_pred),
sensitivity = sens_vec(quality, quality_pred),
specificity = spec_vec(quality, quality_pred),
precision = precision_vec(quality, quality_pred)
)# ROC
dtree_roc <- data.frame(prediction=round(dtree_prob[,1],4),
trueclass=as.numeric(dtree_table$quality=="Excellent"))
head(dtree_roc)dtree_roc <- ROCR::prediction(dtree_roc$prediction, dtree_roc$trueclass)
# ROC curve
plot(performance(dtree_roc, "tpr", "fpr"),
main = "ROC")
abline(a = 0, b = 1)# AUC
auc_ROCR_d <- performance(dtree_roc, measure = "auc")
auc_ROCR_d <- auc_ROCR_d@y.values[[1]]
auc_ROCR_d## [1] 0.7755241
From the decision tree model result, I will not tune the model from the parameter because by using the default parameter, we already got a normal-sized tree (no need to perform tree pruning). Instead, I will try to tune the model by changing the threshold for classification.
for(i in 1:100){
result[i,] = metrics(cutoff = co[i],
prob = dtree_table$quality_eprob,
ref = as.factor(ifelse(dtree_table$quality == "Excellent", 1, 0)),
postarget = "1",
negtarget = "0")
}
ggplotly(data_frame("Recall" = result[,1],
"Accuracy" = result[,2],
"Precision" = result[,3],
"Specificity" = result[,4],
"Cutoff" = co) %>%
gather(key = "Metrics", value = "value", 1:4) %>%
ggplot(aes(x = Cutoff, y = value, col = Metrics)) +
geom_line(lwd = 1.5) +
scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
scale_x_continuous(breaks = seq(0,1,0.1)) +
labs(title = "Tradeoff Model Perfomance") +
theme_minimal() +
theme(legend.position = "top",
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank()))# tuning final model
dtree_table <- dtree_table %>%
mutate(tuning_pred = as.factor(ifelse(quality_eprob >= 0.70, "Excellent", "Poor-Normal")))
# metrics result
final_d <- dtree_table %>%
summarise(
accuracy = accuracy_vec(quality, tuning_pred),
sensitivity = sens_vec(quality, tuning_pred),
specificity = spec_vec(quality, tuning_pred),
precision = precision_vec(quality, tuning_pred)
) %>%
cbind(AUC = auc_ROCR_d)
final_dRandom Forest is one example of an ensemble-based algorithm which was built based on a decision tree method and known for its versatility and performance. Ensemble-based algorithm itself is actually a hybrid of several machine learning techniques combined into one predictive model, built to reduce error, bias, and improve predictions. The building of a Random Forest model consist of several steps:
it performs bagging (bootstrap aggregation) Creating subsets of training data through random sampling with replacement to train multiple predictive models (in this case many decision trees).
it performs boosting Train multiple predictive models to generate a better one. This overcomes the problem of overfitting in decision tree because we will consider more than 1 decision tree which was previously trained using random variables and observations.
classify class based on voting mechanism after multiple decision trees have been trained and each tree gave prediction (fitted values). The final prediction was generated through majority voting (in classification) or average output (if regression).
Speaking about random selection of observation and variables, let’s get acquainted with a technique for model evaluation called K-fold Cross-Validation. This technique performs cross-validation by splitting our data into k equal-sized sample group (bins) and use one of the bins to become the test data while the rest of the data become the train data. This process is repeated for k-times (the folds). This makes every observation has the chance to be used as both training and test data and therefore may also overcome overfitting problem from the decision tree. Below is an example of a 5-bins and 5-fold cross validation.
To really understand how random forest work, let’s apply the random forest algorithm to our wine data.
# model building
set.seed(417)
ctrl <- trainControl(method="repeatedcv", number=4, repeats=4) # k-fold cross validation
forest <- train(quality ~ ., data=wine_train, method="rf", trControl = ctrl)
forest## Random Forest
##
## 6142 samples
## 11 predictor
## 2 classes: 'Excellent', 'Poor-Normal'
##
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 4 times)
## Summary of sample sizes: 4606, 4607, 4606, 4607, 4606, 4606, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9498940 0.8997884
## 6 0.9461899 0.8923801
## 11 0.9427302 0.8854606
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
From the model summary, we know that the optimum number of variables considered for splitting at each tree node is 2. We can also inspect the importance of each variable that was used in our random forest using varImp().
## rf variable importance
##
## Overall
## alcohol 100.000
## density 71.823
## chlorides 42.695
## total.sulfur.dioxide 26.445
## volatile.acidity 26.408
## residual.sugar 24.650
## free.sulfur.dioxide 24.589
## pH 20.683
## citric.acid 15.234
## sulphates 1.523
## fixed.acidity 0.000
When using random forest - we are not required to split our dataset into train and test sets because random forest already has out-of-bag estimates (OOB) which act as a reliable estimate of the accuracy on unseen examples. Although, it is also possible to hold out a regular train-test cross-validation. For example, the OOB we achieved (in the summary below) was generated from our wine_train dataset.
plot(forest$finalModel)
legend("topright", colnames(forest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 3.24%
## Confusion matrix:
## Excellent Poor-Normal class.error
## Excellent 3041 30 0.009768805
## Poor-Normal 169 2902 0.055030935
Let’s test our random forest model to our wine_test dataset:
# model fitting
forest_pred <- predict(forest, wine_test, type = "raw")
forest_prob <- predict(forest, wine_test, type = "prob")
# result
forest_table <- select(wine_test, quality) %>%
bind_cols(quality_pred = forest_pred) %>%
bind_cols(quality_eprob = round(forest_prob[,1],4)) %>%
bind_cols(quality_pprob = round(forest_prob[,2],4))
# performance evaluation - confusion matrix
forest_table %>%
conf_mat(quality, quality_pred) %>%
autoplot(type = "heatmap")forest_table %>%
summarise(
accuracy = accuracy_vec(quality, quality_pred),
sensitivity = sens_vec(quality, quality_pred),
specificity = spec_vec(quality, quality_pred),
precision = precision_vec(quality, quality_pred)
)# ROC
forest_roc <- data.frame(prediction=round(forest_prob[,1],4),
trueclass=as.numeric(forest_table$quality=="Excellent"))
head(forest_roc)forest_roc <- ROCR::prediction(forest_roc$prediction, forest_roc$trueclass)
# ROC curve
plot(performance(forest_roc, "tpr", "fpr"),
main = "ROC")
abline(a = 0, b = 1)# AUC
auc_ROCR_f <- performance(forest_roc, measure = "auc")
auc_ROCR_f <- auc_ROCR_f@y.values[[1]]
auc_ROCR_f## [1] 0.9344402
Based on the above analysis and results, a random forest model gave us such great accuracy, recall, specificity, and precision (all metrics > 65%). Furthermore, the AUC of the model is also very high, reaching 93%. I will try to tune the threshold to reach a higher score on each metrics.
for(i in 1:100){
result[i,] = metrics(cutoff = co[i],
prob = forest_table$quality_eprob,
ref = as.factor(ifelse(forest_table$quality == "Excellent", 1, 0)),
postarget = "1",
negtarget = "0")
}
ggplotly(data_frame("Recall" = result[,1],
"Accuracy" = result[,2],
"Precision" = result[,3],
"Specificity" = result[,4],
"Cutoff" = co) %>%
gather(key = "Metrics", value = "value", 1:4) %>%
ggplot(aes(x = Cutoff, y = value, col = Metrics)) +
geom_line(lwd = 1.5) +
scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
scale_x_continuous(breaks = seq(0,1,0.1)) +
labs(title = "Tradeoff Model Perfomance") +
theme_minimal() +
theme(legend.position = "top",
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank()))# tuning final model
forest_table <- forest_table %>%
mutate(tuning_pred = as.factor(ifelse(quality_eprob >= 0.4257, "Excellent", "Poor-Normal")))
# metrics result
final_f <- forest_table %>%
summarise(
accuracy = accuracy_vec(quality, tuning_pred),
sensitivity = sens_vec(quality, tuning_pred),
specificity = spec_vec(quality, tuning_pred),
precision = precision_vec(quality, tuning_pred)
) %>%
cbind(AUC = auc_ROCR_f)
final_fBased on the metrics table above, the predictive model built using Random Forest algorithm gave the best result. The model gave highest accuracy 89% while also maintain sensitivity, specificity, and precision above 70%. It also gave the highest AUC at 93%. Therefore the best model to predict wine’s quality based on chemical properties is the Random Forest model.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.↩