BackGround

About

This is data characteristic about Red-Wine quality, like chemical content and quality standard.

My purpose use this data is to analysis quality of Red wine based on chemical content.

Description

Description Data:

-fixed acidity: most acids involved with wine

-volatile acidity: amount of acetic acid in wine

-citric acid: found in small quantities

-residual sugar: amount of sugar remaining after wine fermentation/production

-chlorides: amount of salt in the wine

-free sulfur dioxide: free forms of S02, prevents microbial growth and the oxidation of wine

-total sulfur dioxide: amount of free and bound forms of S02

-density: the density of water depending on the percent alcohol and sugar content

-pH: describes how acidic or basic a wine is on a scale 0-14 (very acidic: 0, very basic: 14); most wines are between 3-4 on the pH scale

-sulphates: an antimicrobial and antioxidant

-alcohol: the percent alcohol content of the wine

Kaggle

Set-Up

Library

library(dplyr) # Wrangling Data
library(caret) # Confussion Matrix
library(FactoMineR) #
library(e1071)# Naive Bayes
library(ROCR) # ROC
library(randomForest) # Random Forest
library(partykit) # Decision Tree
library(rsample)

Data

red_wine <- read.csv("winequality-red.csv")

Explanatory Data Analysis

summary(red_wine)
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000

We are going to build a predictive model to classify red wine quality, whereas the quality score of 7-10 is considered “Excellent”. Therefore, I subsetted the white wine data and removed type for a cleaner data analysis.

unique(red_wine$quality)
## [1] 5 6 7 4 8 3

Check Missing Value

anyNA(red_wine)
## [1] FALSE

Change Data Type

red_wine <- red_wine %>% 
  mutate(quality=as.factor(ifelse(quality>6,"Excellent","Poor-Normal")))
red_wine

Check Proportion Data Train

prop.table(table(red_wine$quality))
## 
##   Excellent Poor-Normal 
##   0.1357098   0.8642902

You see an imbalance in the proportions of the data, so we will downsample to balance the proportions of the data

set.seed(211)
wine_dsamp <- downSample(x= red_wine %>% select(-quality),
                        y=red_wine$quality,
                        yname = "quality")
prop.table(table(wine_dsamp$quality))
## 
##   Excellent Poor-Normal 
##         0.5         0.5
set.seed(417)

split <- initial_split(data = red_wine, prop = 0.8, strata = "quality")

train <- training(split)
test <- testing(split)
prop.table(table(train$quality))
## 
##   Excellent Poor-Normal 
##   0.1359375   0.8640625

Naive Bayes

Naive Bayes Model

# model building
naive <- naiveBayes(wine_dsamp %>% select(-quality), wine_dsamp$quality, laplace = 1)
# model fitting
naive_pred <- predict(naive, test, type = "class") # for the class prediction

EValuation of Naive Bayes Model

# result
confusionMatrix(naive_pred,test$quality,positive = "Excellent")
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Excellent Poor-Normal
##   Excellent          35          92
##   Poor-Normal         8         184
##                                           
##                Accuracy : 0.6865          
##                  95% CI : (0.6325, 0.7371)
##     No Information Rate : 0.8652          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2634          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8140          
##             Specificity : 0.6667          
##          Pos Pred Value : 0.2756          
##          Neg Pred Value : 0.9583          
##              Prevalence : 0.1348          
##          Detection Rate : 0.1097          
##    Detection Prevalence : 0.3981          
##       Balanced Accuracy : 0.7403          
##                                           
##        'Positive' Class : Excellent       
## 

Check Performace Model

ROC (Receiver Operating Curve)

ROC is a curve are plots correlation between True Positive Rate (Sensitivity or Recall) and False Positive Rate (Specificity). Good model ideally “High TP and Low FP”

naive_prob <- predict(naive, newdata = test,type = "raw")
# membuat objeck prediction
wine_roc <- prediction(predictions = naive_prob[,1],# prob kelas positif
                       labels = as.numeric(test$quality =="Excellent"))

# performa dari object prediction
perf <- performance(prediction.obj = wine_roc,
                    measure = "tpr",
                    x.measure = "fpr")

plot(perf)
abline(0,1, lty = 2)

Based on plot, line make a curve arc (High True Positive and Low False Positive) its mean good model

AUC (Area Under ROC Curve)

AUC show large are under ROC curve, parameter AUC if value close to 1, model good.

auc <- performance(prediction.obj = wine_roc, 
                   measure = "auc")
auc@y.values
## [[1]]
## [1] 0.8258342

Value AUC 0.8384732, close to 1 its means good model

Interpretation

Accuracy : 0.6865 –> 68,65% model to correctly guess the target (Excellent / Poor-Normal).

Sensitivity (Recall) : 0.8140 –> 81.4% from all the positive actual data, capable proportion of model to guess right.

Specificity : 0.6667 –> 66,57% from all the negative actual data, capable proportion of model to guess right.

Pos Pred (Precision) : 0.2756 –> 27.56% from all the prediction result, capable model to correctly guess the positive class.

Based on Confussion Matrix model Naive Bayes, value Accuracy (68,65%) and (Recall 81.14% model). Its means Accuracy model can predict quality wine Excellent or Poor-Normal 68,65 % and model can predict quality wine Good is 80.1%.

Decision Tree

Model

model_dt <- ctree(quality~.,red_wine)

Prediction and Evaluation Model

Prediction And Evaluation Using Data Test

dtree_pred <- predict(model_dt, test, type = "response")
confusionMatrix(dtree_pred,reference = test$quality, positive = "Excellent")
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Excellent Poor-Normal
##   Excellent          18          15
##   Poor-Normal        25         261
##                                           
##                Accuracy : 0.8746          
##                  95% CI : (0.8332, 0.9089)
##     No Information Rate : 0.8652          
##     P-Value [Acc > NIR] : 0.3472          
##                                           
##                   Kappa : 0.4039          
##                                           
##  Mcnemar's Test P-Value : 0.1547          
##                                           
##             Sensitivity : 0.41860         
##             Specificity : 0.94565         
##          Pos Pred Value : 0.54545         
##          Neg Pred Value : 0.91259         
##              Prevalence : 0.13480         
##          Detection Rate : 0.05643         
##    Detection Prevalence : 0.10345         
##       Balanced Accuracy : 0.68213         
##                                           
##        'Positive' Class : Excellent       
## 

Prediksi dan evaluasi model menggunakan data train

pred_dt_train <- predict(model_dt, newdata = wine_dsamp, type = "response")
confusionMatrix(pred_dt_train, wine_dsamp$quality, positive = "Excellent")
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Excellent Poor-Normal
##   Excellent         107          12
##   Poor-Normal       110         205
##                                           
##                Accuracy : 0.7189          
##                  95% CI : (0.6741, 0.7607)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4378          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4931          
##             Specificity : 0.9447          
##          Pos Pred Value : 0.8992          
##          Neg Pred Value : 0.6508          
##              Prevalence : 0.5000          
##          Detection Rate : 0.2465          
##    Detection Prevalence : 0.2742          
##       Balanced Accuracy : 0.7189          
##                                           
##        'Positive' Class : Excellent       
## 

summary prediction and evaluation

model_dt_recap <- c("test", "wine_train_dsample")
Accuracy <- c(0.8746,0.8046)
Recall <- c(0.4186,0.4931)

tabelmodelrecap <- data.frame(model_dt_recap,Accuracy,Recall)

print(tabelmodelrecap)
##       model_dt_recap Accuracy Recall
## 1               test   0.8746 0.4186
## 2 wine_train_dsample   0.8046 0.4931

Cause value Accuracy with data_test and data_train imbalance (overfitting), model must to prunning to make right fitting.

Pruning

model_dt_tun <- ctree(quality ~ ., wine_dsamp,
                               control = ctree_control(mincriterion = 0.5,
                                            minsplit = 35, #40
                                            minbucket = 20)) #12
pred_dt_test_tun <- predict(model_dt_tun, newdata = test, type = "response")
confusionMatrix(pred_dt_test_tun, test$quality, positive = "Excellent")
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Excellent Poor-Normal
##   Excellent          34          74
##   Poor-Normal         9         202
##                                          
##                Accuracy : 0.7398         
##                  95% CI : (0.688, 0.7871)
##     No Information Rate : 0.8652         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.319          
##                                          
##  Mcnemar's Test P-Value : 2.142e-12      
##                                          
##             Sensitivity : 0.7907         
##             Specificity : 0.7319         
##          Pos Pred Value : 0.3148         
##          Neg Pred Value : 0.9573         
##              Prevalence : 0.1348         
##          Detection Rate : 0.1066         
##    Detection Prevalence : 0.3386         
##       Balanced Accuracy : 0.7613         
##                                          
##        'Positive' Class : Excellent      
## 
pred_dt_train_tun <- predict(model_dt_tun, newdata = wine_dsamp, type = "response")
confusionMatrix(pred_dt_train_tun, wine_dsamp$quality, positive = "Excellent")
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Excellent Poor-Normal
##   Excellent         185          37
##   Poor-Normal        32         180
##                                           
##                Accuracy : 0.841           
##                  95% CI : (0.8032, 0.8741)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.682           
##                                           
##  Mcnemar's Test P-Value : 0.6301          
##                                           
##             Sensitivity : 0.8525          
##             Specificity : 0.8295          
##          Pos Pred Value : 0.8333          
##          Neg Pred Value : 0.8491          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4263          
##    Detection Prevalence : 0.5115          
##       Balanced Accuracy : 0.8410          
##                                           
##        'Positive' Class : Excellent       
## 

#Summary Prediction and Evaluation

model_dt_recap_prun <- c("wine.test", "wine_train_down")
Accuracy_prun <- c(0.8119,0.8046)
Recall_prun <- c(0.9535,0.9673)

tabelmodelrecap2 <- data.frame(model_dt_recap_prun,Accuracy_prun,Recall_prun)

print(tabelmodelrecap2)
##   model_dt_recap_prun Accuracy_prun Recall_prun
## 1           wine.test        0.8119      0.9535
## 2     wine_train_down        0.8046      0.9673

Create Plot Decision Tree

#model_dt_tun
plot(model_dt_tun,type = "simple")

Nodes 1 is Root Nodes (Highest node in the tree structure, and has no parent)

Nodes 2,3,4,9,10,and 11 is Inner Nodes (Node of a tree that has child nodes)

Nodes 5,6,7,8,12,13,14,and 15 is Terminal Nodes (Node that does not have child nodes)

random Forest

K-Fold Cross Validation

Split data by [Math Processing Error] part, where each part is used to testing data.

Make model random forest using 5-fold cross validation and repeat process 3 times, after that save on RDS

set.seed(417)

ctrl <- trainControl(method = "repeatedcv",
                     number = 5, # k-fold
                     repeats = 3) # repetisi

fb_forest <- train(quality ~ .,
                   data = wine_dsamp,
                   method = "rf", # random forest
                   trControl = ctrl)

saveRDS(fb_forest, "fb_forest_updates.RDS") # simpan model

Model Random Forest Model

Read RDS

forestt <- readRDS("fb_forest_updates.RDS")
varImp(forestt)
## rf variable importance
## 
##                      Overall
## alcohol              100.000
## sulphates             62.124
## volatile.acidity      56.590
## citric.acid           42.322
## total.sulfur.dioxide  25.323
## density               22.415
## fixed.acidity         17.393
## chlorides             14.430
## residual.sugar         7.583
## pH                     5.050
## free.sulfur.dioxide    0.000

Model Evaluation

forestt$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 14.29%
## Confusion matrix:
##             Excellent Poor-Normal class.error
## Excellent         194          23   0.1059908
## Poor-Normal        39         178   0.1797235

oob error data sebesar 14.29 persen, its mean this model has 86,71 % of accuracy

Check Importance Variable

plot(varImp(forestt))

# Make Prediction adn Evaluation Model Make prediction and check model evaluation with positive class “Good” using data_test

pred_rfs <- predict(forestt, test,type = "raw")
confusionMatrix(pred_rfs,reference = test$quality,positive = "Excellent")
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Excellent Poor-Normal
##   Excellent          43          62
##   Poor-Normal         0         214
##                                           
##                Accuracy : 0.8056          
##                  95% CI : (0.7579, 0.8476)
##     No Information Rate : 0.8652          
##     P-Value [Acc > NIR] : 0.9988          
##                                           
##                   Kappa : 0.482           
##                                           
##  Mcnemar's Test P-Value : 9.408e-15       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.7754          
##          Pos Pred Value : 0.4095          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.1348          
##          Detection Rate : 0.1348          
##    Detection Prevalence : 0.3292          
##       Balanced Accuracy : 0.8877          
##                                           
##        'Positive' Class : Excellent       
## 

Interpretation Random Forest

Accuracy : 0.8056 –> 80.56% model to correctly guess the target (Excellent/Poor-Normal).

-Sensitivity (Recall) : 1 –> 100% from all the positive actual data, capable proportion of model to guess right.

-Specificity : 0.7717 –> 77,17% from all the negative actual data, capable proportion of model to guess right.

-Pos Pred (Precision) : 0.4057 –> 40.57% from all the prediction result, capable model to correctly guess the positive class.

Based on Confussion Matrix model Random Forest, value Accuracy (80.56%) and (Recall 100% ). Its means Accuracy model can predict quality wine Excellent or Poor-Normal 78.1% and model can predict quality wine Good is 100%.

Conculsion

Model_Name <- c("Naive Bayes", "Decission Tree", "Random Forest")
Accuracy <- c(0.6865,0.7398,0.8056)
Recall <- c(0.8140,0.7907,1.000)
Specificity <- c(0.6667,0.7319,0.7717)
Precision <- c(0.2756,0.3148,0.4057)

modelrecapall <- data.frame(Model_Name,Accuracy,Recall,Specificity,Precision)

print(modelrecapall)
##       Model_Name Accuracy Recall Specificity Precision
## 1    Naive Bayes   0.6865 0.8140      0.6667    0.2756
## 2 Decission Tree   0.7398 0.7907      0.7319    0.3148
## 3  Random Forest   0.8056 1.0000      0.7717    0.4057

After make 3 model we get result Accuracy, Recall, Specificity, and Precision. In this case we will choose Random Forest Model, because model can predict quality wine Excellent and Poor-Normal with accuracy 80.8% and model can predict quality “Excellent” 100%. So we want all wine quality Poor-Normalnot mix with all wine quality Excellent.