This is data characteristic about Red-Wine quality, like chemical content and quality standard.
My purpose use this data is to analysis quality of Red wine based on chemical content.
Description Data:
-fixed acidity: most acids involved with wine
-volatile acidity: amount of acetic acid in wine
-citric acid: found in small quantities
-residual sugar: amount of sugar remaining after wine fermentation/production
-chlorides: amount of salt in the wine
-free sulfur dioxide: free forms of S02, prevents microbial growth and the oxidation of wine
-total sulfur dioxide: amount of free and bound forms of S02
-density: the density of water depending on the percent alcohol and sugar content
-pH: describes how acidic or basic a wine is on a scale 0-14 (very acidic: 0, very basic: 14); most wines are between 3-4 on the pH scale
-sulphates: an antimicrobial and antioxidant
-alcohol: the percent alcohol content of the wine
library(dplyr) # Wrangling Data
library(caret) # Confussion Matrix
library(FactoMineR) #
library(e1071)# Naive Bayes
library(ROCR) # ROC
library(randomForest) # Random Forest
library(partykit) # Decision Tree
library(rsample)
red_wine <- read.csv("winequality-red.csv")
summary(red_wine)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.310 Median :0.6200 Median :10.20 Median :6.000
## Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
We are going to build a predictive model to classify red wine quality, whereas the quality score of 7-10 is considered “Excellent”. Therefore, I subsetted the white wine data and removed type for a cleaner data analysis.
unique(red_wine$quality)
## [1] 5 6 7 4 8 3
anyNA(red_wine)
## [1] FALSE
red_wine <- red_wine %>%
mutate(quality=as.factor(ifelse(quality>6,"Excellent","Poor-Normal")))
red_wine
prop.table(table(red_wine$quality))
##
## Excellent Poor-Normal
## 0.1357098 0.8642902
You see an imbalance in the proportions of the data, so we will downsample to balance the proportions of the data
set.seed(211)
wine_dsamp <- downSample(x= red_wine %>% select(-quality),
y=red_wine$quality,
yname = "quality")
prop.table(table(wine_dsamp$quality))
##
## Excellent Poor-Normal
## 0.5 0.5
set.seed(417)
split <- initial_split(data = red_wine, prop = 0.8, strata = "quality")
train <- training(split)
test <- testing(split)
prop.table(table(train$quality))
##
## Excellent Poor-Normal
## 0.1359375 0.8640625
# model building
naive <- naiveBayes(wine_dsamp %>% select(-quality), wine_dsamp$quality, laplace = 1)
# model fitting
naive_pred <- predict(naive, test, type = "class") # for the class prediction
# result
confusionMatrix(naive_pred,test$quality,positive = "Excellent")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Excellent Poor-Normal
## Excellent 35 92
## Poor-Normal 8 184
##
## Accuracy : 0.6865
## 95% CI : (0.6325, 0.7371)
## No Information Rate : 0.8652
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2634
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8140
## Specificity : 0.6667
## Pos Pred Value : 0.2756
## Neg Pred Value : 0.9583
## Prevalence : 0.1348
## Detection Rate : 0.1097
## Detection Prevalence : 0.3981
## Balanced Accuracy : 0.7403
##
## 'Positive' Class : Excellent
##
ROC is a curve are plots correlation between True Positive Rate (Sensitivity or Recall) and False Positive Rate (Specificity). Good model ideally “High TP and Low FP”
naive_prob <- predict(naive, newdata = test,type = "raw")
# membuat objeck prediction
wine_roc <- prediction(predictions = naive_prob[,1],# prob kelas positif
labels = as.numeric(test$quality =="Excellent"))
# performa dari object prediction
perf <- performance(prediction.obj = wine_roc,
measure = "tpr",
x.measure = "fpr")
plot(perf)
abline(0,1, lty = 2)
Based on plot, line make a curve arc (High True Positive and Low False Positive) its mean good model
AUC show large are under ROC curve, parameter AUC if value close to 1, model good.
auc <- performance(prediction.obj = wine_roc,
measure = "auc")
auc@y.values
## [[1]]
## [1] 0.8258342
Value AUC 0.8384732, close to 1 its means good model
Accuracy : 0.6865 –> 68,65% model to correctly guess the target (Excellent / Poor-Normal).
Sensitivity (Recall) : 0.8140 –> 81.4% from all the positive actual data, capable proportion of model to guess right.
Specificity : 0.6667 –> 66,57% from all the negative actual data, capable proportion of model to guess right.
Pos Pred (Precision) : 0.2756 –> 27.56% from all the prediction result, capable model to correctly guess the positive class.
Based on Confussion Matrix model Naive Bayes, value Accuracy (68,65%) and (Recall 81.14% model). Its means Accuracy model can predict quality wine Excellent or Poor-Normal 68,65 % and model can predict quality wine Good is 80.1%.
model_dt <- ctree(quality~.,red_wine)
Testdtree_pred <- predict(model_dt, test, type = "response")
confusionMatrix(dtree_pred,reference = test$quality, positive = "Excellent")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Excellent Poor-Normal
## Excellent 18 15
## Poor-Normal 25 261
##
## Accuracy : 0.8746
## 95% CI : (0.8332, 0.9089)
## No Information Rate : 0.8652
## P-Value [Acc > NIR] : 0.3472
##
## Kappa : 0.4039
##
## Mcnemar's Test P-Value : 0.1547
##
## Sensitivity : 0.41860
## Specificity : 0.94565
## Pos Pred Value : 0.54545
## Neg Pred Value : 0.91259
## Prevalence : 0.13480
## Detection Rate : 0.05643
## Detection Prevalence : 0.10345
## Balanced Accuracy : 0.68213
##
## 'Positive' Class : Excellent
##
pred_dt_train <- predict(model_dt, newdata = wine_dsamp, type = "response")
confusionMatrix(pred_dt_train, wine_dsamp$quality, positive = "Excellent")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Excellent Poor-Normal
## Excellent 107 12
## Poor-Normal 110 205
##
## Accuracy : 0.7189
## 95% CI : (0.6741, 0.7607)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4378
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4931
## Specificity : 0.9447
## Pos Pred Value : 0.8992
## Neg Pred Value : 0.6508
## Prevalence : 0.5000
## Detection Rate : 0.2465
## Detection Prevalence : 0.2742
## Balanced Accuracy : 0.7189
##
## 'Positive' Class : Excellent
##
model_dt_recap <- c("test", "wine_train_dsample")
Accuracy <- c(0.8746,0.8046)
Recall <- c(0.4186,0.4931)
tabelmodelrecap <- data.frame(model_dt_recap,Accuracy,Recall)
print(tabelmodelrecap)
## model_dt_recap Accuracy Recall
## 1 test 0.8746 0.4186
## 2 wine_train_dsample 0.8046 0.4931
Cause value Accuracy with data_test and data_train imbalance (overfitting), model must to prunning to make right fitting.
model_dt_tun <- ctree(quality ~ ., wine_dsamp,
control = ctree_control(mincriterion = 0.5,
minsplit = 35, #40
minbucket = 20)) #12
pred_dt_test_tun <- predict(model_dt_tun, newdata = test, type = "response")
confusionMatrix(pred_dt_test_tun, test$quality, positive = "Excellent")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Excellent Poor-Normal
## Excellent 34 74
## Poor-Normal 9 202
##
## Accuracy : 0.7398
## 95% CI : (0.688, 0.7871)
## No Information Rate : 0.8652
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.319
##
## Mcnemar's Test P-Value : 2.142e-12
##
## Sensitivity : 0.7907
## Specificity : 0.7319
## Pos Pred Value : 0.3148
## Neg Pred Value : 0.9573
## Prevalence : 0.1348
## Detection Rate : 0.1066
## Detection Prevalence : 0.3386
## Balanced Accuracy : 0.7613
##
## 'Positive' Class : Excellent
##
pred_dt_train_tun <- predict(model_dt_tun, newdata = wine_dsamp, type = "response")
confusionMatrix(pred_dt_train_tun, wine_dsamp$quality, positive = "Excellent")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Excellent Poor-Normal
## Excellent 185 37
## Poor-Normal 32 180
##
## Accuracy : 0.841
## 95% CI : (0.8032, 0.8741)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.682
##
## Mcnemar's Test P-Value : 0.6301
##
## Sensitivity : 0.8525
## Specificity : 0.8295
## Pos Pred Value : 0.8333
## Neg Pred Value : 0.8491
## Prevalence : 0.5000
## Detection Rate : 0.4263
## Detection Prevalence : 0.5115
## Balanced Accuracy : 0.8410
##
## 'Positive' Class : Excellent
##
#Summary Prediction and Evaluation
model_dt_recap_prun <- c("wine.test", "wine_train_down")
Accuracy_prun <- c(0.8119,0.8046)
Recall_prun <- c(0.9535,0.9673)
tabelmodelrecap2 <- data.frame(model_dt_recap_prun,Accuracy_prun,Recall_prun)
print(tabelmodelrecap2)
## model_dt_recap_prun Accuracy_prun Recall_prun
## 1 wine.test 0.8119 0.9535
## 2 wine_train_down 0.8046 0.9673
#model_dt_tun
plot(model_dt_tun,type = "simple")
Nodes 1 is Root Nodes (Highest node in the tree structure, and has no parent)
Nodes 2,3,4,9,10,and 11 is Inner Nodes (Node of a tree that has child nodes)
Nodes 5,6,7,8,12,13,14,and 15 is Terminal Nodes (Node that does not have child nodes)
Split data by [Math Processing Error] part, where each part is used to testing data.
Make model random forest using 5-fold cross validation and repeat process 3 times, after that save on RDS
set.seed(417)
ctrl <- trainControl(method = "repeatedcv",
number = 5, # k-fold
repeats = 3) # repetisi
fb_forest <- train(quality ~ .,
data = wine_dsamp,
method = "rf", # random forest
trControl = ctrl)
saveRDS(fb_forest, "fb_forest_updates.RDS") # simpan model
Read RDS
forestt <- readRDS("fb_forest_updates.RDS")
varImp(forestt)
## rf variable importance
##
## Overall
## alcohol 100.000
## sulphates 62.124
## volatile.acidity 56.590
## citric.acid 42.322
## total.sulfur.dioxide 25.323
## density 22.415
## fixed.acidity 17.393
## chlorides 14.430
## residual.sugar 7.583
## pH 5.050
## free.sulfur.dioxide 0.000
forestt$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 14.29%
## Confusion matrix:
## Excellent Poor-Normal class.error
## Excellent 194 23 0.1059908
## Poor-Normal 39 178 0.1797235
oob error data sebesar 14.29 persen, its mean this model has 86,71 % of accuracy
plot(varImp(forestt))
# Make Prediction adn Evaluation Model Make prediction and check model evaluation with positive class “Good” using data_test
pred_rfs <- predict(forestt, test,type = "raw")
confusionMatrix(pred_rfs,reference = test$quality,positive = "Excellent")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Excellent Poor-Normal
## Excellent 43 62
## Poor-Normal 0 214
##
## Accuracy : 0.8056
## 95% CI : (0.7579, 0.8476)
## No Information Rate : 0.8652
## P-Value [Acc > NIR] : 0.9988
##
## Kappa : 0.482
##
## Mcnemar's Test P-Value : 9.408e-15
##
## Sensitivity : 1.0000
## Specificity : 0.7754
## Pos Pred Value : 0.4095
## Neg Pred Value : 1.0000
## Prevalence : 0.1348
## Detection Rate : 0.1348
## Detection Prevalence : 0.3292
## Balanced Accuracy : 0.8877
##
## 'Positive' Class : Excellent
##
Accuracy : 0.8056 –> 80.56% model to correctly guess the target (Excellent/Poor-Normal).
-Sensitivity (Recall) : 1 –> 100% from all the positive actual data, capable proportion of model to guess right.
-Specificity : 0.7717 –> 77,17% from all the negative actual data, capable proportion of model to guess right.
-Pos Pred (Precision) : 0.4057 –> 40.57% from all the prediction result, capable model to correctly guess the positive class.
Based on Confussion Matrix model Random Forest, value Accuracy (80.56%) and (Recall 100% ). Its means Accuracy model can predict quality wine Excellent or Poor-Normal 78.1% and model can predict quality wine Good is 100%.
Model_Name <- c("Naive Bayes", "Decission Tree", "Random Forest")
Accuracy <- c(0.6865,0.7398,0.8056)
Recall <- c(0.8140,0.7907,1.000)
Specificity <- c(0.6667,0.7319,0.7717)
Precision <- c(0.2756,0.3148,0.4057)
modelrecapall <- data.frame(Model_Name,Accuracy,Recall,Specificity,Precision)
print(modelrecapall)
## Model_Name Accuracy Recall Specificity Precision
## 1 Naive Bayes 0.6865 0.8140 0.6667 0.2756
## 2 Decission Tree 0.7398 0.7907 0.7319 0.3148
## 3 Random Forest 0.8056 1.0000 0.7717 0.4057
After make 3 model we get result Accuracy, Recall, Specificity, and Precision. In this case we will choose Random Forest Model, because model can predict quality wine Excellent and Poor-Normal with accuracy 80.8% and model can predict quality “Excellent” 100%. So we want all wine quality Poor-Normalnot mix with all wine quality Excellent.