Analyze Quality Wine with Naive Bayes, Decision Tree, and Random Forest

Background

This is data characteristic about wine quality, like chemical content and quality standard.

My purpose use this data is to analysis quality of wine based on chemical content.

Description Data:

fixed acidity: most acids involved with wine
volatile acidity: amount of acetic acid in wine
citric acid: found in small quantities
residual sugar: amount of sugar remaining after wine fermentation/production
chlorides: amount of salt in the wine
free sulfur dioxide: free forms of S02, prevents microbial growth and the oxidation of wine
total sulfur dioxide: amount of free and bound forms of S02
density: the density of water depending on the percent alcohol and sugar content
pH: describes how acidic or basic a wine is on a scale 0-14 (very acidic: 0, very basic: 14); most wines are between 3-4 on the pH scale
sulphates: an antimicrobial and antioxidant
alcohol: the percent alcohol content of the wine

The data I get from Kaggle with this following link:

https://www.kaggle.com/ronitf/heart-disease-uci

Set Up

Activated Library

library(dplyr) #wrangling data
library(tidyverse) #make plot
library(caret) #confussion matrix
library(rsample) #sampling data
library(e1071) #naive bayes
library(partykit) #decisioin tree
library(randomForest) #random forest
library(ROCR) #check ROC

options(scipen = 999)

Import Data

wine <- read.csv("winequality-red.csv")
wine

Exploratory Data Analysis

Check Data Type

glimpse(wine)

## Rows: 1,599
## Columns: 12
## $ fixed.acidity        <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5~
## $ volatile.acidity     <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, ~
## $ citric.acid          <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0~
## $ residual.sugar       <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1,~
## $ chlorides            <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, ~
## $ free.sulfur.dioxide  <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16~
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102,~
## $ density              <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0~
## $ pH                   <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3~
## $ sulphates            <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0~
## $ alcohol              <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.~
## $ quality              <int> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7~

All variable appropriate with data type

Classification target variable

Make standard score of quality, quality >= 7 is “Good” and quality < 7 is “Bad”. Put classification score quality in new column

wine$class <- as.factor(ifelse(wine$quality >= 7 , "Good" , "Bad"))
wine

Check missing value

colSums(is.na(wine))

##        fixed.acidity     volatile.acidity          citric.acid 
##                    0                    0                    0 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                    0                    0                    0 
## total.sulfur.dioxide              density                   pH 
##                    0                    0                    0 
##            sulphates              alcohol              quality 
##                    0                    0                    0 
##                class 
##                    0

Data no have missing value

Modelling

Cross Validation

Unselect variable quality because I want machine can learning with label target variable

wine <- 
  wine %>% 
  select(-quality)

Make data train for training model (80% proportion from actual data) and data test for testing model (20% proportion from actual data)

RNGkind(sample.kind = "Rounding")
set.seed(1616)

index <- sample(nrow(wine), 
                nrow(wine)*0.8)

wine.train <- wine[index, ]
wine.test <- wine[-index, ]

Check proportion data train

Check proportion because training model maybe can optimal if the data balance

prop.table(table(wine.train$class))

## 
##       Bad      Good 
## 0.8639562 0.1360438

Data train imbalance, so we need try to make balance with downsampling methode. After that compare performance with data train not using tunning imbalance.

wine_train_down <- downSample(x = wine.train %>%  select(-class),
                              y = wine.train$class,
                              yname = "class")

prop.table(table(wine_train_down$class))

## 
##  Bad Good 
##  0.5  0.5

Naive Bayes

#Using Tunning

Make Naive Bayes Model

model_naive_tun <- naiveBayes(x = wine_train_down %>% select(-class), 
                          y = wine_train_down$class, 
                          laplace = 1)

Make Prediction and Evaluation Model

pred_naive_tun <- predict(object= model_naive_tun,
                           newdata = wine.test,
                           type="class")

confusionMatrix(data= pred_naive_tun,
                reference= wine.test$class,
                positive="Good")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Bad Good
##       Bad  196    5
##       Good  81   38
##                                                
##                Accuracy : 0.7312               
##                  95% CI : (0.6791, 0.779)      
##     No Information Rate : 0.8656               
##     P-Value [Acc > NIR] : 1                    
##                                                
##                   Kappa : 0.3386               
##                                                
##  Mcnemar's Test P-Value : 0.0000000000000006092
##                                                
##             Sensitivity : 0.8837               
##             Specificity : 0.7076               
##          Pos Pred Value : 0.3193               
##          Neg Pred Value : 0.9751               
##              Prevalence : 0.1344               
##          Detection Rate : 0.1187               
##    Detection Prevalence : 0.3719               
##       Balanced Accuracy : 0.7957               
##                                                
##        'Positive' Class : Good                 
##

Check Performance Model

Receiver-Operating Curve (ROC)

ROC is a curve are plots correlation between True Positive Rate (Sensitivity or Recall) and False Positive Rate (Specificity). Good model ideally “High TP and Low FP”

wine_predProb_tun <- predict(model_naive_tun, newdata = wine.test,type = "raw")
head(wine_predProb_tun)

##            Bad       Good
## [1,] 0.9448873 0.05511266
## [2,] 0.4763166 0.52368344
## [3,] 0.2364779 0.76352211
## [4,] 0.9327039 0.06729614
## [5,] 0.8936715 0.10632846
## [6,] 0.7281391 0.27186094

Check ROC with plot

#create prediction object
wine_roc_tun <- prediction(predictions = wine_predProb_tun[, 2],
                       labels = as.numeric(wine.test$class == "Good"))

#create performance with prediction object
perf_tun <- performance(prediction.obj = wine_roc_tun,
                    measure = "tpr", # tpr = true positive rate
                    x.measure = "fpr") #fpr = false positive rate
                    
#create lot
plot(perf_tun)
abline(0,1, lty = 2)

Based on plot, line make a curve arc (High True Positive and Low False Positive) its mean good model

Area Under ROC Curve (AUC)

AUC show large are under ROC curve, parameter AUC if value close to 1, model good.

auc_tun <- performance(prediction.obj = wine_roc_tun, 
                   measure = "auc")
auc_tun@y.values

## [[1]]
## [1] 0.864411

Value AUC 0.864411, close to 1 its means good model

#Without using Tunning

Make Naive Bayes Model

model_naive <- naiveBayes(x = wine.train %>% select(-class), 
                          y = wine.train$class, 
                          laplace = 1)

Make Prediction and Evaluation Model

pred_naive <- predict(object= model_naive,
                           newdata = wine.test,
                           type="class")

confusionMatrix(data= pred_naive,
                reference= wine.test$class,
                positive="Good")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Bad Good
##       Bad  246   18
##       Good  31   25
##                                           
##                Accuracy : 0.8469          
##                  95% CI : (0.8027, 0.8845)
##     No Information Rate : 0.8656          
##     P-Value [Acc > NIR] : 0.85622         
##                                           
##                   Kappa : 0.4163          
##                                           
##  Mcnemar's Test P-Value : 0.08648         
##                                           
##             Sensitivity : 0.58140         
##             Specificity : 0.88809         
##          Pos Pred Value : 0.44643         
##          Neg Pred Value : 0.93182         
##              Prevalence : 0.13437         
##          Detection Rate : 0.07812         
##    Detection Prevalence : 0.17500         
##       Balanced Accuracy : 0.73474         
##                                           
##        'Positive' Class : Good            
##

Check Performance Model

Receiver-Operating Curve (ROC)

ROC is a curve are plots correlation between True Positive Rate (Sensitivity or Recall) and False Positive Rate (Specificity). Good model ideally “High TP and Low FP”

wine_predProb <- predict(model_naive, newdata = wine.test,type = "raw")
head(wine_predProb)

##            Bad        Good
## [1,] 0.9949913 0.005008735
## [2,] 0.8762924 0.123707645
## [3,] 0.7924593 0.207540685
## [4,] 0.9942464 0.005753589
## [5,] 0.9877532 0.012246844
## [6,] 0.9649074 0.035092576

Check ROC with plot

#create prediction object
wine_roc <- prediction(predictions = wine_predProb[, 2],
                       labels = as.numeric(wine.test$class == "Good"))

#create performance with prediction object
perf <- performance(prediction.obj = wine_roc,
                    measure = "tpr", # tpr = true positive rate
                    x.measure = "fpr") #fpr = false positive rate
                    
#create lot
plot(perf)
abline(0,1, lty = 2)

Based on plot, line make a curve arc (High True Positive and Low False Positive) its mean good model

Area Under ROC Curve (AUC)

AUC show large are under ROC curve, parameter AUC if value close to 1, model good.

auc<- performance(prediction.obj = wine_roc, 
                   measure = "auc")
auc@y.values

## [[1]]
## [1] 0.8764168

Value AUC 0.8764168, close to 1 its means good model

Interpretation Naive Bayes Model

After compare modeling using and without using tuning imblance, model with using tuning imbalance is better based on Accuracy dan Sensitivity (Recall). So for modelling using with tuning.
Accuracy : 0.7312 –> 73.1% model to correctly guess the target (Good / Bad).
Sensitivity (Recall) : 0.8837 –> 88.3% from all the positive actual data, capable proportion of model to guess right.
Specificity : 0.7076 –> 70.7% from all the negative actual data, capable proportion of model to guess right.
Pos Pred (Precision) : 0.3193 –> 31.9% from all the prediction result, capable model to correctly guess the positive class.

Based on Confussion Matrix model Naive Bayes, value Accuracy (0.7312 or 73.1%) and (Recall 0.8837 or 88.3% model). Its means Accuracy model can predict quality wine Good or Bad 73.1% and model can predict quality wine Good is 88.3%.

Decision Tree

#Without using Tunning

Make Decision Tree Model

set.seed(1616)
model_dt <- ctree(class ~ ., wine)

Make Prediction and Evaluation Model

Prediction and Evaluation Model using data test

pred_dt <- predict(model_dt, newdata = wine.test, type = "response")

confusionMatrix(pred_dt, wine.test$class, positive = "Good")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Bad Good
##       Bad  267   23
##       Good  10   20
##                                           
##                Accuracy : 0.8969          
##                  95% CI : (0.8582, 0.9279)
##     No Information Rate : 0.8656          
##     P-Value [Acc > NIR] : 0.05594         
##                                           
##                   Kappa : 0.4918          
##                                           
##  Mcnemar's Test P-Value : 0.03671         
##                                           
##             Sensitivity : 0.46512         
##             Specificity : 0.96390         
##          Pos Pred Value : 0.66667         
##          Neg Pred Value : 0.92069         
##              Prevalence : 0.13437         
##          Detection Rate : 0.06250         
##    Detection Prevalence : 0.09375         
##       Balanced Accuracy : 0.71451         
##                                           
##        'Positive' Class : Good            
##

Prediction and Evaluation Model using data train

pred_dt_train <- predict(model_dt, newdata = wine.train, type = "response")
confusionMatrix(pred_dt_train, wine.train$class, positive = "Good")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Bad Good
##       Bad  1058   87
##       Good   47   87
##                                           
##                Accuracy : 0.8952          
##                  95% CI : (0.8771, 0.9115)
##     No Information Rate : 0.864           
##     P-Value [Acc > NIR] : 0.0004409       
##                                           
##                   Kappa : 0.5065          
##                                           
##  Mcnemar's Test P-Value : 0.0007542       
##                                           
##             Sensitivity : 0.50000         
##             Specificity : 0.95747         
##          Pos Pred Value : 0.64925         
##          Neg Pred Value : 0.92402         
##              Prevalence : 0.13604         
##          Detection Rate : 0.06802         
##    Detection Prevalence : 0.10477         
##       Balanced Accuracy : 0.72873         
##                                           
##        'Positive' Class : Good            
##

Summary Prediction and Evaluation

model_dt_recap <- c("wine.test", "wine.train")
Accuracy <- c(0.8969,0.8952)
Recall <- c(0.4651,0.5000)

tabelmodelrecap <- data.frame(model_dt_recap,Accuracy,Recall)

print(tabelmodelrecap)

##   model_dt_recap Accuracy Recall
## 1      wine.test   0.8969 0.4651
## 2     wine.train   0.8952 0.5000

Cause value Accuracy with data_test and data_train imbalance (overfitting), model must try to prunning and compare performance result.

#Using Tunning

Tuning Model

Create new model with pruning treatment

set.seed(1616)

model_dt_tuned <- ctree(class ~ ., wine_train_down,
                        control = ctree_control(mincriterion = 0.5,
                                                minsplit = 35, #40
                                                minbucket = 20)) #12

Make Prediction and Evaluation Model

Prediction and Evaluation Model using data test

pred_dt_test_tunes <- predict(model_dt_tuned, newdata = wine.test, type = "response")
confusionMatrix(pred_dt_test_tunes, wine.test$class, positive = "Good")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Bad Good
##       Bad  219   10
##       Good  58   33
##                                          
##                Accuracy : 0.7875         
##                  95% CI : (0.7385, 0.831)
##     No Information Rate : 0.8656         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.3792         
##                                          
##  Mcnemar's Test P-Value : 0.00000001201  
##                                          
##             Sensitivity : 0.7674         
##             Specificity : 0.7906         
##          Pos Pred Value : 0.3626         
##          Neg Pred Value : 0.9563         
##              Prevalence : 0.1344         
##          Detection Rate : 0.1031         
##    Detection Prevalence : 0.2844         
##       Balanced Accuracy : 0.7790         
##                                          
##        'Positive' Class : Good           
##

Prediction and Evaluation Model using data train

pred_dt_train_tunes <- predict(model_dt_tuned, newdata = wine_train_down, type = "response")
confusionMatrix(pred_dt_train_tunes, wine_train_down$class, positive = "Good")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Bad Good
##       Bad  142   36
##       Good  32  138
##                                              
##                Accuracy : 0.8046             
##                  95% CI : (0.759, 0.8449)    
##     No Information Rate : 0.5                
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.6092             
##                                              
##  Mcnemar's Test P-Value : 0.716              
##                                              
##             Sensitivity : 0.7931             
##             Specificity : 0.8161             
##          Pos Pred Value : 0.8118             
##          Neg Pred Value : 0.7978             
##              Prevalence : 0.5000             
##          Detection Rate : 0.3966             
##    Detection Prevalence : 0.4885             
##       Balanced Accuracy : 0.8046             
##                                              
##        'Positive' Class : Good               
##

Summary Prediction and Evaluation

model_dt_recap_tuning <- c("wine.test.tuning", "wine.train.tuning")
Accuracy_tuning <- c(0.7875,0.8046)
Recall_tuning <- c(0.7674,0.7931)

tabelmodelrecap2 <- data.frame(model_dt_recap_tuning,Accuracy_tuning,Recall_tuning)

print(tabelmodelrecap2)

##   model_dt_recap_tuning Accuracy_tuning Recall_tuning
## 1      wine.test.tuning          0.7875        0.7674
## 2     wine.train.tuning          0.8046        0.7931

Compare with using tuning and without tuning

model_dt_recap_all <- c("wine.test", "wine.train", "wine.test.tuning", "wine.train.tuning")
Accuracy <- c(0.8969,0.8952,0.7875,0.8046)
Recall <- c(0.4651,0.5000,0.7674,0.7931)

tabelmodelrecapall <- data.frame(model_dt_recap_all,Accuracy,Recall)

print(tabelmodelrecapall)

##   model_dt_recap_all Accuracy Recall
## 1          wine.test   0.8969 0.4651
## 2         wine.train   0.8952 0.5000
## 3   wine.test.tuning   0.7875 0.7674
## 4  wine.train.tuning   0.8046 0.7931

Based on compare modeling using tuning and without tuning Value Accuracy and Recall with tuning is better than without tuning, so we can use model using tuning

Create Plot Decision Tree

model_dt_tuned

## 
## Model formula:
## class ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol
## 
## Fitted party:
## [1] root
## |   [2] alcohol <= 10.4
## |   |   [3] fixed.acidity <= 10.1
## |   |   |   [4] alcohol <= 9.9
## |   |   |   |   [5] volatile.acidity <= 0.52: Bad (n = 27, err = 11.1%)
## |   |   |   |   [6] volatile.acidity > 0.52: Bad (n = 48, err = 0.0%)
## |   |   |   [7] alcohol > 9.9: Bad (n = 31, err = 29.0%)
## |   |   [8] fixed.acidity > 10.1: Bad (n = 21, err = 47.6%)
## |   [9] alcohol > 10.4
## |   |   [10] volatile.acidity <= 0.66
## |   |   |   [11] sulphates <= 0.61
## |   |   |   |   [12] pH <= 3.27: Good (n = 20, err = 25.0%)
## |   |   |   |   [13] pH > 3.27: Bad (n = 23, err = 34.8%)
## |   |   |   [14] sulphates > 0.61: Good (n = 150, err = 18.0%)
## |   |   [15] volatile.acidity > 0.66: Bad (n = 28, err = 21.4%)
## 
## Number of inner nodes:    7
## Number of terminal nodes: 8

plot(model_dt_tuned,type="simple")

Interpretation Decision Tree Model

Nodes 1 is Root Nodes (Highest node in the tree structure, and has no parent)
Nodes 2,3,4,9,10,and 11 is Inner Nodes (Node of a tree that has child nodes)
Nodes 5,6,7,8,12,13,14,and 15 is Terminal Nodes (Node that does not have child nodes)

Random Forest

#Using Tuning

K-Fold Cross Validation

Split data by \(k\) part, where each part is used to testing data.

Make model random forest using 5-fold cross validation and repeat process 3 times, after that save on RDS

set.seed(1616)

ctrl <- trainControl(method = "repeatedcv",
                      number = 5,
                      repeats = 3) 
 
model_forest_tun <- train(class ~ .,
                  data = wine_train_down,
                  method = "rf", 
                  trControl = ctrl)

## Warning in (function (kind = NULL, normal.kind = NULL, sample.kind = NULL) :
## non-uniform 'Rounding' sampler used

saveRDS(model_forest_tun, "model_forest_update_tun.RDS")

Make Random Forest Model

Read RDS with name model_rf

model_rf_tun <- readRDS("model_forest_update_tun.RDS")
model_rf_tun

## Random Forest 
## 
## 348 samples
##  11 predictor
##   2 classes: 'Bad', 'Good' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 278, 279, 279, 278, 278, 278, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8151228  0.6304040
##    6    0.8075046  0.6152018
##   11    0.8035696  0.6073220
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Use model with mtry (predictor) = 2, because value accuracy more than other mtry

Model Evaluation

Check model error (OOB or Out-Off-Bag) with finalModel

model_rf_tun$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 16.95%
## Confusion matrix:
##      Bad Good class.error
## Bad  132   42  0.24137931
## Good  17  157  0.09770115

Result (OOB or Out-Off-Bag) is 16.95% , its mean this model has 83.05% of accuracy

Check Importance Variabel

varImp(model_rf_tun)

## rf variable importance
## 
##                      Overall
## alcohol              100.000
## sulphates             80.404
## volatile.acidity      65.344
## citric.acid           44.923
## total.sulfur.dioxide  27.964
## density               20.919
## chlorides             20.045
## fixed.acidity         19.735
## residual.sugar         4.417
## pH                     4.370
## free.sulfur.dioxide    0.000

plot(varImp(model_rf_tun))

Based on plot above, 3 most important variable is alcohol, sulphates and volatile.acidity

Make Prediction and Evaluation Model

Make prediction and check model evaluation with positive class “Good” using data_test

pred_rf_tun <- predict(model_rf_tun, wine.test, type = "raw")
confusionMatrix(pred_rf_tun, wine.test$class, positive = "Good")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Bad Good
##       Bad  210    4
##       Good  67   39
##                                             
##                Accuracy : 0.7781            
##                  95% CI : (0.7286, 0.8225)  
##     No Information Rate : 0.8656            
##     P-Value [Acc > NIR] : 1                 
##                                             
##                   Kappa : 0.4108            
##                                             
##  Mcnemar's Test P-Value : 0.0000000000001866
##                                             
##             Sensitivity : 0.9070            
##             Specificity : 0.7581            
##          Pos Pred Value : 0.3679            
##          Neg Pred Value : 0.9813            
##              Prevalence : 0.1344            
##          Detection Rate : 0.1219            
##    Detection Prevalence : 0.3312            
##       Balanced Accuracy : 0.8325            
##                                             
##        'Positive' Class : Good              
##

So now compare performance with model without using tuning.

#Using Tuning

K-Fold Cross Validation

set.seed(1616)

ctrl <- trainControl(method = "repeatedcv",
                      number = 5,
                      repeats = 3) 
 
model_forest <- train(class ~ .,
                  data = wine.train,
                  method = "rf", 
                  trControl = ctrl)

## Warning in (function (kind = NULL, normal.kind = NULL, sample.kind = NULL) :
## non-uniform 'Rounding' sampler used

saveRDS(model_forest, "model_forest_update.RDS")

Make Random Forest Model

Read RDS with name model_rf

model_rf <- readRDS("model_forest_update.RDS")
model_rf

## Random Forest 
## 
## 1279 samples
##   11 predictor
##    2 classes: 'Bad', 'Good' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 1023, 1023, 1023, 1023, 1024, 1023, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9030545  0.4926345
##    6    0.9077431  0.5469259
##   11    0.9007098  0.5274840
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.

Use model with mtry (predictor) = 2, because value accuracy more than other mtry

Model Evaluation

Check model error (OOB or Out-Off-Bag) with finalModel

model_rf$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 9.15%
## Confusion matrix:
##       Bad Good class.error
## Bad  1069   36  0.03257919
## Good   81   93  0.46551724

Result (OOB or Out-Off-Bag) is 16.95% , its mean this model has 83.05% of accuracy

Check Importance Variabel

varImp(model_rf)

## rf variable importance
## 
##                      Overall
## alcohol              100.000
## sulphates             48.895
## volatile.acidity      35.432
## total.sulfur.dioxide  11.788
## fixed.acidity          9.770
## density                7.978
## citric.acid            7.711
## residual.sugar         6.909
## chlorides              3.857
## free.sulfur.dioxide    3.753
## pH                     0.000

plot(varImp(model_rf))

Based on plot above, 3 most important variable is alcohol, sulphates and volatile.acidity

Make Prediction and Evaluation Model

Make prediction and check model evaluation with positive class “Good” using data_test

pred_rf <- predict(model_rf, wine.test, type = "raw")
confusionMatrix(pred_rf, wine.test$class, positive = "Good")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Bad Good
##       Bad  269   18
##       Good   8   25
##                                           
##                Accuracy : 0.9188          
##                  95% CI : (0.8832, 0.9462)
##     No Information Rate : 0.8656          
##     P-Value [Acc > NIR] : 0.002116        
##                                           
##                   Kappa : 0.6127          
##                                           
##  Mcnemar's Test P-Value : 0.077556        
##                                           
##             Sensitivity : 0.58140         
##             Specificity : 0.97112         
##          Pos Pred Value : 0.75758         
##          Neg Pred Value : 0.93728         
##              Prevalence : 0.13437         
##          Detection Rate : 0.07812         
##    Detection Prevalence : 0.10312         
##       Balanced Accuracy : 0.77626         
##                                           
##        'Positive' Class : Good            
##

Compare Performance

model_rf_recap_all <- c("without tuning", "with tuning")
Accuracy <- c(0.7781,0.9188)
Recall <- c(0.9070,0.58140)

tabelmodelrecaprf <- data.frame(model_rf_recap_all,Accuracy,Recall)

print(tabelmodelrecaprf)

##   model_rf_recap_all Accuracy Recall
## 1     without tuning   0.7781 0.9070
## 2        with tuning   0.9188 0.5814

Interpretation Random Forest Model

After compare modeling using and without using tuning imblance, model without using tuning imbalance is better based on Accuracy dan Sensitivity (Recall). So for modelling using without tuning imbalance.
Accuracy : 0.7781 –> 77.8% model to correctly guess the target (Good / Bad).
Sensitivity (Recall) : 0.9070 –> 90.7% from all the positive actual data, capable proportion of model to guess right.
Specificity : 0.7581 –> 75.8% from all the negative actual data, capable proportion of model to guess right.
Pos Pred (Precision) : 0.3679 –> 36.7% from all the prediction result, capable model to correctly guess the positive class.

Based on Confussion Matrix model Random Forest, value Accuracy (0.7781 or 77.8%) and (Recall 0.9070 or 90.07% model). Its means Accuracy model can predict quality wine Good or Bad 77.8% and model can predict quality wine Good is 90.07%.

Conclusion

Model_Name <- c("Naive Bayes", "Decission Tree", "Random Forest")
Accuracy <- c(0.7312,0.8046,0.7781)
Recall <- c(0.8837,0.7931,0.9070)
Specificity <- c(0.7076,0.8161,0.7581)
Precision <- c(0.3193,0.8118,0.3679)

modelrecapall <- data.frame(Model_Name,Accuracy,Recall,Specificity,Precision)

print(modelrecapall)

##       Model_Name Accuracy Recall Specificity Precision
## 1    Naive Bayes   0.7312 0.8837      0.7076    0.3193
## 2 Decission Tree   0.8046 0.7931      0.8161    0.8118
## 3  Random Forest   0.7781 0.9070      0.7581    0.3679

After make 3 model we get result Accuracy, Recall, Specificity, and Precision. In this case we will choose Decision Tree Model, because model can predict quality wine “Good” and “Bad” with accuracy 80.4% and model can predict quality “Good” 79.3%. So we want all wine quality Bad not mix with all wine quality Good.

Analyze Quality Wine with Naive Bayes, Decision Tree, and Random Forest

Gasha Sarwono

Background

Set Up

Exploratory Data Analysis

Modelling

Naive Bayes

#Using Tunning

#Without using Tunning

Decision Tree

#Without using Tunning

#Using Tunning

Compare with using tuning and without tuning

Random Forest

#Using Tuning

#Using Tuning

Compare Performance

Conclusion