This is data characteristic about wine quality, like chemical content and quality standard.
My purpose use this data is to analysis quality of wine based on chemical content.
Description Data:
fixed acidity: most acids involved with wine
volatile acidity: amount of acetic acid in wine
citric acid: found in small quantities
residual sugar: amount of sugar remaining after wine fermentation/production
chlorides: amount of salt in the wine
free sulfur dioxide: free forms of S02, prevents microbial growth and the oxidation of wine
total sulfur dioxide: amount of free and bound forms of S02
density: the density of water depending on the percent alcohol and sugar content
pH: describes how acidic or basic a wine is on a scale 0-14 (very acidic: 0, very basic: 14); most wines are between 3-4 on the pH scale
sulphates: an antimicrobial and antioxidant
alcohol: the percent alcohol content of the wine
The data I get from Kaggle with this following link:
Activated Library
library(dplyr) #wrangling data
library(tidyverse) #make plot
library(caret) #confussion matrix
library(rsample) #sampling data
library(e1071) #naive bayes
library(partykit) #decisioin tree
library(randomForest) #random forest
library(ROCR) #check ROC
options(scipen = 999)
Import Data
wine <- read.csv("winequality-red.csv")
wine
Check Data Type
glimpse(wine)
## Rows: 1,599
## Columns: 12
## $ fixed.acidity <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5~
## $ volatile.acidity <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, ~
## $ citric.acid <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0~
## $ residual.sugar <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1,~
## $ chlorides <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, ~
## $ free.sulfur.dioxide <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16~
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102,~
## $ density <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0~
## $ pH <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3~
## $ sulphates <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0~
## $ alcohol <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.~
## $ quality <int> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7~
All variable appropriate with data type
Classification target variable
Make standard score of quality, quality >= 7 is “Good” and quality < 7 is “Bad”. Put classification score quality in new column
wine$class <- as.factor(ifelse(wine$quality >= 7 , "Good" , "Bad"))
wine
Check missing value
colSums(is.na(wine))
## fixed.acidity volatile.acidity citric.acid
## 0 0 0
## residual.sugar chlorides free.sulfur.dioxide
## 0 0 0
## total.sulfur.dioxide density pH
## 0 0 0
## sulphates alcohol quality
## 0 0 0
## class
## 0
Data no have missing value
Cross Validation
Unselect variable quality because I want machine can learning with label target variable
wine <-
wine %>%
select(-quality)
Make data train for training model (80% proportion from actual data) and data test for testing model (20% proportion from actual data)
RNGkind(sample.kind = "Rounding")
set.seed(1616)
index <- sample(nrow(wine),
nrow(wine)*0.8)
wine.train <- wine[index, ]
wine.test <- wine[-index, ]
Check proportion data train
Check proportion because training model maybe can optimal if the data balance
prop.table(table(wine.train$class))
##
## Bad Good
## 0.8639562 0.1360438
Data train imbalance, so we need try to make balance with downsampling methode. After that compare performance with data train not using tunning imbalance.
wine_train_down <- downSample(x = wine.train %>% select(-class),
y = wine.train$class,
yname = "class")
prop.table(table(wine_train_down$class))
##
## Bad Good
## 0.5 0.5
Make Naive Bayes Model
model_naive_tun <- naiveBayes(x = wine_train_down %>% select(-class),
y = wine_train_down$class,
laplace = 1)
Make Prediction and Evaluation Model
pred_naive_tun <- predict(object= model_naive_tun,
newdata = wine.test,
type="class")
confusionMatrix(data= pred_naive_tun,
reference= wine.test$class,
positive="Good")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Bad Good
## Bad 196 5
## Good 81 38
##
## Accuracy : 0.7312
## 95% CI : (0.6791, 0.779)
## No Information Rate : 0.8656
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3386
##
## Mcnemar's Test P-Value : 0.0000000000000006092
##
## Sensitivity : 0.8837
## Specificity : 0.7076
## Pos Pred Value : 0.3193
## Neg Pred Value : 0.9751
## Prevalence : 0.1344
## Detection Rate : 0.1187
## Detection Prevalence : 0.3719
## Balanced Accuracy : 0.7957
##
## 'Positive' Class : Good
##
Check Performance Model
ROC is a curve are plots correlation between True Positive Rate (Sensitivity or Recall) and False Positive Rate (Specificity). Good model ideally “High TP and Low FP”
wine_predProb_tun <- predict(model_naive_tun, newdata = wine.test,type = "raw")
head(wine_predProb_tun)
## Bad Good
## [1,] 0.9448873 0.05511266
## [2,] 0.4763166 0.52368344
## [3,] 0.2364779 0.76352211
## [4,] 0.9327039 0.06729614
## [5,] 0.8936715 0.10632846
## [6,] 0.7281391 0.27186094
Check ROC with plot
#create prediction object
wine_roc_tun <- prediction(predictions = wine_predProb_tun[, 2],
labels = as.numeric(wine.test$class == "Good"))
#create performance with prediction object
perf_tun <- performance(prediction.obj = wine_roc_tun,
measure = "tpr", # tpr = true positive rate
x.measure = "fpr") #fpr = false positive rate
#create lot
plot(perf_tun)
abline(0,1, lty = 2)
Based on plot, line make a curve arc (High True Positive and Low False Positive) its mean good model
AUC show large are under ROC curve, parameter AUC if value close to 1, model good.
auc_tun <- performance(prediction.obj = wine_roc_tun,
measure = "auc")
auc_tun@y.values
## [[1]]
## [1] 0.864411
Value AUC 0.864411, close to 1 its means good model
Make Naive Bayes Model
model_naive <- naiveBayes(x = wine.train %>% select(-class),
y = wine.train$class,
laplace = 1)
Make Prediction and Evaluation Model
pred_naive <- predict(object= model_naive,
newdata = wine.test,
type="class")
confusionMatrix(data= pred_naive,
reference= wine.test$class,
positive="Good")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Bad Good
## Bad 246 18
## Good 31 25
##
## Accuracy : 0.8469
## 95% CI : (0.8027, 0.8845)
## No Information Rate : 0.8656
## P-Value [Acc > NIR] : 0.85622
##
## Kappa : 0.4163
##
## Mcnemar's Test P-Value : 0.08648
##
## Sensitivity : 0.58140
## Specificity : 0.88809
## Pos Pred Value : 0.44643
## Neg Pred Value : 0.93182
## Prevalence : 0.13437
## Detection Rate : 0.07812
## Detection Prevalence : 0.17500
## Balanced Accuracy : 0.73474
##
## 'Positive' Class : Good
##
Check Performance Model
ROC is a curve are plots correlation between True Positive Rate (Sensitivity or Recall) and False Positive Rate (Specificity). Good model ideally “High TP and Low FP”
wine_predProb <- predict(model_naive, newdata = wine.test,type = "raw")
head(wine_predProb)
## Bad Good
## [1,] 0.9949913 0.005008735
## [2,] 0.8762924 0.123707645
## [3,] 0.7924593 0.207540685
## [4,] 0.9942464 0.005753589
## [5,] 0.9877532 0.012246844
## [6,] 0.9649074 0.035092576
Check ROC with plot
#create prediction object
wine_roc <- prediction(predictions = wine_predProb[, 2],
labels = as.numeric(wine.test$class == "Good"))
#create performance with prediction object
perf <- performance(prediction.obj = wine_roc,
measure = "tpr", # tpr = true positive rate
x.measure = "fpr") #fpr = false positive rate
#create lot
plot(perf)
abline(0,1, lty = 2)
Based on plot, line make a curve arc (High True Positive and Low False Positive) its mean good model
AUC show large are under ROC curve, parameter AUC if value close to 1, model good.
auc<- performance(prediction.obj = wine_roc,
measure = "auc")
auc@y.values
## [[1]]
## [1] 0.8764168
Value AUC 0.8764168, close to 1 its means good model
Interpretation Naive Bayes Model
After compare modeling using and without using tuning imblance, model with using tuning imbalance is better based on Accuracy dan Sensitivity (Recall). So for modelling using with tuning.
Accuracy : 0.7312 –> 73.1% model to correctly guess the target (Good / Bad).
Sensitivity (Recall) : 0.8837 –> 88.3% from all the positive actual data, capable proportion of model to guess right.
Specificity : 0.7076 –> 70.7% from all the negative actual data, capable proportion of model to guess right.
Pos Pred (Precision) : 0.3193 –> 31.9% from all the prediction result, capable model to correctly guess the positive class.
Based on Confussion Matrix model Naive Bayes, value Accuracy (0.7312 or 73.1%) and (Recall 0.8837 or 88.3% model). Its means Accuracy model can predict quality wine Good or Bad 73.1% and model can predict quality wine Good is 88.3%.
Make Decision Tree Model
set.seed(1616)
model_dt <- ctree(class ~ ., wine)
Make Prediction and Evaluation Model
Prediction and Evaluation Model using data test
pred_dt <- predict(model_dt, newdata = wine.test, type = "response")
confusionMatrix(pred_dt, wine.test$class, positive = "Good")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Bad Good
## Bad 267 23
## Good 10 20
##
## Accuracy : 0.8969
## 95% CI : (0.8582, 0.9279)
## No Information Rate : 0.8656
## P-Value [Acc > NIR] : 0.05594
##
## Kappa : 0.4918
##
## Mcnemar's Test P-Value : 0.03671
##
## Sensitivity : 0.46512
## Specificity : 0.96390
## Pos Pred Value : 0.66667
## Neg Pred Value : 0.92069
## Prevalence : 0.13437
## Detection Rate : 0.06250
## Detection Prevalence : 0.09375
## Balanced Accuracy : 0.71451
##
## 'Positive' Class : Good
##
Prediction and Evaluation Model using data train
pred_dt_train <- predict(model_dt, newdata = wine.train, type = "response")
confusionMatrix(pred_dt_train, wine.train$class, positive = "Good")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Bad Good
## Bad 1058 87
## Good 47 87
##
## Accuracy : 0.8952
## 95% CI : (0.8771, 0.9115)
## No Information Rate : 0.864
## P-Value [Acc > NIR] : 0.0004409
##
## Kappa : 0.5065
##
## Mcnemar's Test P-Value : 0.0007542
##
## Sensitivity : 0.50000
## Specificity : 0.95747
## Pos Pred Value : 0.64925
## Neg Pred Value : 0.92402
## Prevalence : 0.13604
## Detection Rate : 0.06802
## Detection Prevalence : 0.10477
## Balanced Accuracy : 0.72873
##
## 'Positive' Class : Good
##
Summary Prediction and Evaluation
model_dt_recap <- c("wine.test", "wine.train")
Accuracy <- c(0.8969,0.8952)
Recall <- c(0.4651,0.5000)
tabelmodelrecap <- data.frame(model_dt_recap,Accuracy,Recall)
print(tabelmodelrecap)
## model_dt_recap Accuracy Recall
## 1 wine.test 0.8969 0.4651
## 2 wine.train 0.8952 0.5000
Cause value Accuracy with data_test and data_train imbalance (overfitting), model must try to prunning and compare performance result.
Tuning Model
Create new model with pruning treatment
set.seed(1616)
model_dt_tuned <- ctree(class ~ ., wine_train_down,
control = ctree_control(mincriterion = 0.5,
minsplit = 35, #40
minbucket = 20)) #12
Make Prediction and Evaluation Model
Prediction and Evaluation Model using data test
pred_dt_test_tunes <- predict(model_dt_tuned, newdata = wine.test, type = "response")
confusionMatrix(pred_dt_test_tunes, wine.test$class, positive = "Good")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Bad Good
## Bad 219 10
## Good 58 33
##
## Accuracy : 0.7875
## 95% CI : (0.7385, 0.831)
## No Information Rate : 0.8656
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3792
##
## Mcnemar's Test P-Value : 0.00000001201
##
## Sensitivity : 0.7674
## Specificity : 0.7906
## Pos Pred Value : 0.3626
## Neg Pred Value : 0.9563
## Prevalence : 0.1344
## Detection Rate : 0.1031
## Detection Prevalence : 0.2844
## Balanced Accuracy : 0.7790
##
## 'Positive' Class : Good
##
Prediction and Evaluation Model using data train
pred_dt_train_tunes <- predict(model_dt_tuned, newdata = wine_train_down, type = "response")
confusionMatrix(pred_dt_train_tunes, wine_train_down$class, positive = "Good")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Bad Good
## Bad 142 36
## Good 32 138
##
## Accuracy : 0.8046
## 95% CI : (0.759, 0.8449)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.6092
##
## Mcnemar's Test P-Value : 0.716
##
## Sensitivity : 0.7931
## Specificity : 0.8161
## Pos Pred Value : 0.8118
## Neg Pred Value : 0.7978
## Prevalence : 0.5000
## Detection Rate : 0.3966
## Detection Prevalence : 0.4885
## Balanced Accuracy : 0.8046
##
## 'Positive' Class : Good
##
Summary Prediction and Evaluation
model_dt_recap_tuning <- c("wine.test.tuning", "wine.train.tuning")
Accuracy_tuning <- c(0.7875,0.8046)
Recall_tuning <- c(0.7674,0.7931)
tabelmodelrecap2 <- data.frame(model_dt_recap_tuning,Accuracy_tuning,Recall_tuning)
print(tabelmodelrecap2)
## model_dt_recap_tuning Accuracy_tuning Recall_tuning
## 1 wine.test.tuning 0.7875 0.7674
## 2 wine.train.tuning 0.8046 0.7931
model_dt_recap_all <- c("wine.test", "wine.train", "wine.test.tuning", "wine.train.tuning")
Accuracy <- c(0.8969,0.8952,0.7875,0.8046)
Recall <- c(0.4651,0.5000,0.7674,0.7931)
tabelmodelrecapall <- data.frame(model_dt_recap_all,Accuracy,Recall)
print(tabelmodelrecapall)
## model_dt_recap_all Accuracy Recall
## 1 wine.test 0.8969 0.4651
## 2 wine.train 0.8952 0.5000
## 3 wine.test.tuning 0.7875 0.7674
## 4 wine.train.tuning 0.8046 0.7931
Based on compare modeling using tuning and without tuning Value Accuracy and Recall with tuning is better than without tuning, so we can use model using tuning
Create Plot Decision Tree
model_dt_tuned
##
## Model formula:
## class ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar +
## chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + alcohol
##
## Fitted party:
## [1] root
## | [2] alcohol <= 10.4
## | | [3] fixed.acidity <= 10.1
## | | | [4] alcohol <= 9.9
## | | | | [5] volatile.acidity <= 0.52: Bad (n = 27, err = 11.1%)
## | | | | [6] volatile.acidity > 0.52: Bad (n = 48, err = 0.0%)
## | | | [7] alcohol > 9.9: Bad (n = 31, err = 29.0%)
## | | [8] fixed.acidity > 10.1: Bad (n = 21, err = 47.6%)
## | [9] alcohol > 10.4
## | | [10] volatile.acidity <= 0.66
## | | | [11] sulphates <= 0.61
## | | | | [12] pH <= 3.27: Good (n = 20, err = 25.0%)
## | | | | [13] pH > 3.27: Bad (n = 23, err = 34.8%)
## | | | [14] sulphates > 0.61: Good (n = 150, err = 18.0%)
## | | [15] volatile.acidity > 0.66: Bad (n = 28, err = 21.4%)
##
## Number of inner nodes: 7
## Number of terminal nodes: 8
plot(model_dt_tuned,type="simple")
Interpretation Decision Tree Model
Nodes 1 is Root Nodes (Highest node in the tree structure, and has no parent)
Nodes 2,3,4,9,10,and 11 is Inner Nodes (Node of a tree that has child nodes)
Nodes 5,6,7,8,12,13,14,and 15 is Terminal Nodes (Node that does not have child nodes)
K-Fold Cross Validation
Split data by \(k\) part, where each part is used to testing data.
Make model random forest using 5-fold cross validation and repeat process 3 times, after that save on RDS
set.seed(1616)
ctrl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 3)
model_forest_tun <- train(class ~ .,
data = wine_train_down,
method = "rf",
trControl = ctrl)
## Warning in (function (kind = NULL, normal.kind = NULL, sample.kind = NULL) :
## non-uniform 'Rounding' sampler used
saveRDS(model_forest_tun, "model_forest_update_tun.RDS")
Make Random Forest Model
Read RDS with name model_rf
model_rf_tun <- readRDS("model_forest_update_tun.RDS")
model_rf_tun
## Random Forest
##
## 348 samples
## 11 predictor
## 2 classes: 'Bad', 'Good'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 278, 279, 279, 278, 278, 278, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8151228 0.6304040
## 6 0.8075046 0.6152018
## 11 0.8035696 0.6073220
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
Use model with mtry (predictor) = 2, because value accuracy more than other mtry
Model Evaluation
Check model error (OOB or Out-Off-Bag) with finalModel
model_rf_tun$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 16.95%
## Confusion matrix:
## Bad Good class.error
## Bad 132 42 0.24137931
## Good 17 157 0.09770115
Result (OOB or Out-Off-Bag) is 16.95% , its mean this model has 83.05% of accuracy
Check Importance Variabel
varImp(model_rf_tun)
## rf variable importance
##
## Overall
## alcohol 100.000
## sulphates 80.404
## volatile.acidity 65.344
## citric.acid 44.923
## total.sulfur.dioxide 27.964
## density 20.919
## chlorides 20.045
## fixed.acidity 19.735
## residual.sugar 4.417
## pH 4.370
## free.sulfur.dioxide 0.000
plot(varImp(model_rf_tun))
Based on plot above, 3 most important variable is alcohol, sulphates and volatile.acidity
Make Prediction and Evaluation Model
Make prediction and check model evaluation with positive class “Good” using data_test
pred_rf_tun <- predict(model_rf_tun, wine.test, type = "raw")
confusionMatrix(pred_rf_tun, wine.test$class, positive = "Good")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Bad Good
## Bad 210 4
## Good 67 39
##
## Accuracy : 0.7781
## 95% CI : (0.7286, 0.8225)
## No Information Rate : 0.8656
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4108
##
## Mcnemar's Test P-Value : 0.0000000000001866
##
## Sensitivity : 0.9070
## Specificity : 0.7581
## Pos Pred Value : 0.3679
## Neg Pred Value : 0.9813
## Prevalence : 0.1344
## Detection Rate : 0.1219
## Detection Prevalence : 0.3312
## Balanced Accuracy : 0.8325
##
## 'Positive' Class : Good
##
So now compare performance with model without using tuning.
K-Fold Cross Validation
set.seed(1616)
ctrl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 3)
model_forest <- train(class ~ .,
data = wine.train,
method = "rf",
trControl = ctrl)
## Warning in (function (kind = NULL, normal.kind = NULL, sample.kind = NULL) :
## non-uniform 'Rounding' sampler used
saveRDS(model_forest, "model_forest_update.RDS")
Make Random Forest Model
Read RDS with name model_rf
model_rf <- readRDS("model_forest_update.RDS")
model_rf
## Random Forest
##
## 1279 samples
## 11 predictor
## 2 classes: 'Bad', 'Good'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 1023, 1023, 1023, 1023, 1024, 1023, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9030545 0.4926345
## 6 0.9077431 0.5469259
## 11 0.9007098 0.5274840
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.
Use model with mtry (predictor) = 2, because value accuracy more than other mtry
Model Evaluation
Check model error (OOB or Out-Off-Bag) with finalModel
model_rf$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 9.15%
## Confusion matrix:
## Bad Good class.error
## Bad 1069 36 0.03257919
## Good 81 93 0.46551724
Result (OOB or Out-Off-Bag) is 16.95% , its mean this model has 83.05% of accuracy
Check Importance Variabel
varImp(model_rf)
## rf variable importance
##
## Overall
## alcohol 100.000
## sulphates 48.895
## volatile.acidity 35.432
## total.sulfur.dioxide 11.788
## fixed.acidity 9.770
## density 7.978
## citric.acid 7.711
## residual.sugar 6.909
## chlorides 3.857
## free.sulfur.dioxide 3.753
## pH 0.000
plot(varImp(model_rf))
Based on plot above, 3 most important variable is alcohol, sulphates and volatile.acidity
Make Prediction and Evaluation Model
Make prediction and check model evaluation with positive class “Good” using data_test
pred_rf <- predict(model_rf, wine.test, type = "raw")
confusionMatrix(pred_rf, wine.test$class, positive = "Good")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Bad Good
## Bad 269 18
## Good 8 25
##
## Accuracy : 0.9188
## 95% CI : (0.8832, 0.9462)
## No Information Rate : 0.8656
## P-Value [Acc > NIR] : 0.002116
##
## Kappa : 0.6127
##
## Mcnemar's Test P-Value : 0.077556
##
## Sensitivity : 0.58140
## Specificity : 0.97112
## Pos Pred Value : 0.75758
## Neg Pred Value : 0.93728
## Prevalence : 0.13437
## Detection Rate : 0.07812
## Detection Prevalence : 0.10312
## Balanced Accuracy : 0.77626
##
## 'Positive' Class : Good
##
model_rf_recap_all <- c("without tuning", "with tuning")
Accuracy <- c(0.7781,0.9188)
Recall <- c(0.9070,0.58140)
tabelmodelrecaprf <- data.frame(model_rf_recap_all,Accuracy,Recall)
print(tabelmodelrecaprf)
## model_rf_recap_all Accuracy Recall
## 1 without tuning 0.7781 0.9070
## 2 with tuning 0.9188 0.5814
Interpretation Random Forest Model
After compare modeling using and without using tuning imblance, model without using tuning imbalance is better based on Accuracy dan Sensitivity (Recall). So for modelling using without tuning imbalance.
Accuracy : 0.7781 –> 77.8% model to correctly guess the target (Good / Bad).
Sensitivity (Recall) : 0.9070 –> 90.7% from all the positive actual data, capable proportion of model to guess right.
Specificity : 0.7581 –> 75.8% from all the negative actual data, capable proportion of model to guess right.
Pos Pred (Precision) : 0.3679 –> 36.7% from all the prediction result, capable model to correctly guess the positive class.
Based on Confussion Matrix model Random Forest, value Accuracy (0.7781 or 77.8%) and (Recall 0.9070 or 90.07% model). Its means Accuracy model can predict quality wine Good or Bad 77.8% and model can predict quality wine Good is 90.07%.
Model_Name <- c("Naive Bayes", "Decission Tree", "Random Forest")
Accuracy <- c(0.7312,0.8046,0.7781)
Recall <- c(0.8837,0.7931,0.9070)
Specificity <- c(0.7076,0.8161,0.7581)
Precision <- c(0.3193,0.8118,0.3679)
modelrecapall <- data.frame(Model_Name,Accuracy,Recall,Specificity,Precision)
print(modelrecapall)
## Model_Name Accuracy Recall Specificity Precision
## 1 Naive Bayes 0.7312 0.8837 0.7076 0.3193
## 2 Decission Tree 0.8046 0.7931 0.8161 0.8118
## 3 Random Forest 0.7781 0.9070 0.7581 0.3679
After make 3 model we get result Accuracy, Recall, Specificity, and Precision. In this case we will choose Decision Tree Model, because model can predict quality wine “Good” and “Bad” with accuracy 80.4% and model can predict quality “Good” 79.3%. So we want all wine quality Bad not mix with all wine quality Good.