LBB ini akan membahas tentang penggunaan metode Supervised Learning untuk memprediksi kualitas dari sebuah wine. Metode yang digunakan untuk melakukan klasifikasi ini adalah Naive Bayes, Descision Tree, dan Random Forest
library(tidyverse)
library(e1071)
library(caret)
library(partykit)
library(rsample)
library(randomForest)
theme_set(theme_minimal())
wine <- read.csv("data/winequalityN.csv")
head(wine)
## type fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 white 7.0 0.27 0.36 20.7 0.045
## 2 white 6.3 0.30 0.34 1.6 0.049
## 3 white 8.1 0.28 0.40 6.9 0.050
## 4 white 7.2 0.23 0.32 8.5 0.058
## 5 white 7.2 0.23 0.32 8.5 0.058
## 6 white 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
glimpse(wine)
## Rows: 6,497
## Columns: 13
## $ type <chr> "white", "white", "white", "white", "white", "whi~
## $ fixed.acidity <dbl> 7.0, 6.3, 8.1, 7.2, 7.2, 8.1, 6.2, 7.0, 6.3, 8.1,~
## $ volatile.acidity <dbl> 0.27, 0.30, 0.28, 0.23, 0.23, 0.28, 0.32, 0.27, 0~
## $ citric.acid <dbl> 0.36, 0.34, 0.40, 0.32, 0.32, 0.40, 0.16, 0.36, 0~
## $ residual.sugar <dbl> 20.70, 1.60, 6.90, 8.50, 8.50, 6.90, 7.00, 20.70,~
## $ chlorides <dbl> 0.045, 0.049, 0.050, 0.058, 0.058, 0.050, 0.045, ~
## $ free.sulfur.dioxide <dbl> 45, 14, 30, 47, 47, 30, 30, 45, 14, 28, 11, 17, 1~
## $ total.sulfur.dioxide <dbl> 170, 132, 97, 186, 186, 97, 136, 170, 132, 129, 6~
## $ density <dbl> 1.0010, 0.9940, 0.9951, 0.9956, 0.9956, 0.9951, 0~
## $ pH <dbl> 3.00, 3.30, 3.26, 3.19, 3.19, 3.26, 3.18, 3.00, 3~
## $ sulphates <dbl> 0.45, 0.49, 0.44, 0.40, 0.40, 0.44, 0.47, 0.45, 0~
## $ alcohol <dbl> 8.8, 9.5, 10.1, 9.9, 9.9, 10.1, 9.6, 8.8, 9.5, 11~
## $ quality <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 7, 5, 7, 6~
colSums(is.na(wine))
## type fixed.acidity volatile.acidity
## 0 10 8
## citric.acid residual.sugar chlorides
## 3 2 2
## free.sulfur.dioxide total.sulfur.dioxide density
## 0 0 0
## pH sulphates alcohol
## 9 4 0
## quality
## 0
Karena tidak terlalu banyak missing data, dapat dihapus menggunakan na.omit().
wine <- wine %>%
na.omit()
prop.table(table(wine$type))
##
## red white
## 0.24648 0.75352
summary(wine)
## type fixed.acidity volatile.acidity citric.acid
## Length:6463 Min. : 3.800 Min. :0.0800 Min. :0.0000
## Class :character 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500
## Mode :character Median : 7.000 Median :0.2900 Median :0.3100
## Mean : 7.218 Mean :0.3396 Mean :0.3188
## 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900
## Max. :15.900 Max. :1.5800 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 1.00 Min. : 6.0
## 1st Qu.: 1.800 1st Qu.:0.03800 1st Qu.: 17.00 1st Qu.: 77.0
## Median : 3.000 Median :0.04700 Median : 29.00 Median :118.0
## Mean : 5.444 Mean :0.05606 Mean : 30.52 Mean :115.7
## 3rd Qu.: 8.100 3rd Qu.:0.06500 3rd Qu.: 41.00 3rd Qu.:156.0
## Max. :65.800 Max. :0.61100 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9923 1st Qu.:3.110 1st Qu.:0.4300 1st Qu.: 9.50
## Median :0.9949 Median :3.210 Median :0.5100 Median :10.30
## Mean :0.9947 Mean :3.218 Mean :0.5311 Mean :10.49
## 3rd Qu.:0.9970 3rd Qu.:3.320 3rd Qu.:0.6000 3rd Qu.:11.30
## Max. :1.0390 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.819
## 3rd Qu.:6.000
## Max. :9.000
Insight: Dari hasil EDA diatas, data yang digunakan ada 6463 baris dan 13 kolom. Kolom yang menjadi target variabel adalah type.
wine <- wine %>%
mutate(type = as.factor(type))
RNGkind(sample.kind = "Rounding")
set.seed(100)
splitter <- initial_split(wine, prop = 0.8, strata = "type")
data_train <- training(splitter)
data_test <- testing(splitter)
test <- data_test %>%
select(-type)
prop.table(table(data_train$type))
##
## red white
## 0.2465197 0.7534803
Karena data train tidak balance, maka dilakukan downsampling agar proporsi data menjadi balance menggunakan fungsi downSample()
set.seed(100)
data_train <- downSample(x = data_train %>%
select(-type),
y = data_train$type,
yname = "type")
prop.table(table(data_train$type))
##
## red white
## 0.5 0.5
Data train sudah balance, selanjutnya membuat modelling
Membuat model Naive Bayes dari data yang sudah diproses
naive_model <- naiveBayes(type ~ ., data_train)
Membuat model Decision Tree dari data yang sudah diproses
dt_model <- ctree(type ~ ., data_train)
plot(dt_model, type = "simple")
Menggunakan 5 fold cross validation dan 3 kali perulangan
set.seed(100)
control <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
rf_model <- train(type ~ ., data = data_train, method = "rf", trControl = control)
saveRDS(rf_model, file = "rf_model.rds")
rf_model$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 0.67%
## Confusion matrix:
## red white class.error
## red 1264 11 0.008627451
## white 6 1269 0.004705882
Prediksi model Naive Bayes
naive_pred <- predict(naive_model, newdata = test)
confusionMatrix(naive_pred, reference = data_test$type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction red white
## red 312 43
## white 6 930
##
## Accuracy : 0.962
## 95% CI : (0.9501, 0.9718)
## No Information Rate : 0.7537
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9016
##
## Mcnemar's Test P-Value : 2.706e-07
##
## Sensitivity : 0.9811
## Specificity : 0.9558
## Pos Pred Value : 0.8789
## Neg Pred Value : 0.9936
## Prevalence : 0.2463
## Detection Rate : 0.2417
## Detection Prevalence : 0.2750
## Balanced Accuracy : 0.9685
##
## 'Positive' Class : red
##
Insight: Dari hasil predict dan dilakukan confusion matrix pada model Naive Bayes didapatkan akurasi sebesar 96,2%
Prediksi model Decision Tree
tree_pred <- predict(object = dt_model, newdata = test, type = "response")
confusionMatrix(tree_pred, reference = data_test$type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction red white
## red 303 14
## white 15 959
##
## Accuracy : 0.9775
## 95% CI : (0.9679, 0.9849)
## No Information Rate : 0.7537
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9394
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9528
## Specificity : 0.9856
## Pos Pred Value : 0.9558
## Neg Pred Value : 0.9846
## Prevalence : 0.2463
## Detection Rate : 0.2347
## Detection Prevalence : 0.2455
## Balanced Accuracy : 0.9692
##
## 'Positive' Class : red
##
Insight: Dari hasil predict dan dilakukan confusion matrix pada model Decision Tree didapatkan akurasi sebesar 97,75%
rf_predict <- predict(rf_model, test)
confusionMatrix(rf_predict, reference = data_test$type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction red white
## red 315 4
## white 3 969
##
## Accuracy : 0.9946
## 95% CI : (0.9889, 0.9978)
## No Information Rate : 0.7537
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9854
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9906
## Specificity : 0.9959
## Pos Pred Value : 0.9875
## Neg Pred Value : 0.9969
## Prevalence : 0.2463
## Detection Rate : 0.2440
## Detection Prevalence : 0.2471
## Balanced Accuracy : 0.9932
##
## 'Positive' Class : red
##
Insight: Dari hasil predict dan dilakukan confusion matrix pada model Random Forest didapatkan akurasi sebesar 99,46%
plot(varImp(rf_model))
Prediksi yang dihasilkan didapatkan bahwa model Random Forest lebih baik dari pada model Naive Bayes dan Decision tree untuk melakukan prediksi dengan nilai akurasi masing-masing yaitu 99,46% untuk Random Forest, 96,2% untuk Naive Bayes, dan 97,75% untuk Decision Tree.
Variabel yang cukup berpengaruh untuk menentukan tipe dari wine dapat dilihat dari interpretion menggunakan fungsi varImp() yaitu chlorides, total.sulfur.dioxide, dan volatile.acidity.