Pokemon C2

Load and Preview Data
Pra Proses Data
Cross Validation
- Simple
- k-Fold Cross Validation
Model
Conclusion

library(rsample)
library(caret)
library(e1071)
library(ROCR)
library(partykit)

Hallo!
Setelah menggunakan metode Logistic Regression di LBB sebelumnya untuk membuat model yang dapat memprediksi kelas legendary Pokemon, maka pada saat ini kita akan mencoba 3 model tambahan yaitu Naive Bayes, Decision Tree dan Random Forest. Let’s do it!

Load and Preview Data

pokemon <- read.csv("pokemonfull.csv")
head(pokemon)

Pra Proses Data

Variabel yang akan kita jadikan target adalah is_legendary yaitu variabel yang menyatakan Pokemon tersebut legendaris atau bukan
variabel yang akan kita gunakan adalah :
1. attack
2. defense
3. height_m
4. hp
5. experience_growth
6. sp_attack 7. sp_defense
8. speed
9. weight_kg

pokemon <- pokemon[,c("is_legendary","attack","defense","height_m","hp","experience_growth","sp_attack","sp_defense","speed","weight_kg")]
pokemon <- na.omit(pokemon)
pokemon$is_legendary <- as.factor(pokemon$is_legendary)
str(pokemon)

## 'data.frame':    781 obs. of  10 variables:
##  $ is_legendary     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ attack           : int  49 62 100 52 64 104 48 63 103 30 ...
##  $ defense          : int  49 63 123 43 58 78 65 80 120 35 ...
##  $ height_m         : num  0.7 1 2 0.6 1.1 1.7 0.5 1 1.6 0.3 ...
##  $ hp               : int  45 60 80 39 58 78 44 59 79 45 ...
##  $ experience_growth: int  1059860 1059860 1059860 1059860 1059860 1059860 1059860 1059860 1059860 1000000 ...
##  $ sp_attack        : int  65 80 122 60 80 159 50 65 135 20 ...
##  $ sp_defense       : int  65 80 120 50 65 115 64 80 115 20 ...
##  $ speed            : int  45 60 80 65 80 100 43 58 78 45 ...
##  $ weight_kg        : num  6.9 13 100 8.5 19 90.5 9 22.5 85.5 2.9 ...
##  - attr(*, "na.action")= 'omit' Named int  19 20 26 27 28 37 38 50 51 52 ...
##   ..- attr(*, "names")= chr  "19" "20" "26" "27" ...

Cross Validation

Sebelum membuat model, hal yang pertama kita lakukan adalah Cross Validation yaitu membagi data secara random menurut proporsi tertentu menjadi data training dan data test

Cross Validation sendiri terbagi menjadi 2 yaitu Simple Cross Validation dan k-Fold Cross Validation. Simple Cross Validation kita gunakan untuk pembuatan model Naive Bayes dan Decision Tree, sedangkan k-Fold Validation kita gunakan untuk membuat model Random Forest

Simple

set.seed(2001)
idx <- initial_split(data = pokemon, prop = 0.8, strata = "is_legendary")
pokemon_train <- training(idx)
pokemon_test <- testing(idx)

prop.table(table(pokemon$is_legendary))

## 
##          0          1 
## 0.91165173 0.08834827

prop.table(table(pokemon_train$is_legendary))

## 
##      0      1 
## 0.9104 0.0896

prop.table(table(pokemon_test$is_legendary))

## 
##          0          1 
## 0.91666667 0.08333333

Proporsi antara data asli, data training dan data test hampir sama, menandakan kita dapat melanjutkan ke tahap berikutnya

k-Fold Cross Validation

set.seed(2001)
ctrl <- trainControl(method="repeatedcv", number=5, repeats=3)

Model

Naive Bayes

Model

pokemon_bayes <- naiveBayes(is_legendary~.,data = pokemon_train, laplace = 1)

Predict

bayes_pred <- predict(pokemon_bayes,pokemon_test)
predict(pokemon_bayes,pokemon_test,"class")

##   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [36] 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0
##  [71] 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [106] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0
## [141] 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 1
## Levels: 0 1

Hasil prediksi pokemon_test terhadap model yang telah kita buat

Confusion Matrix

confusionMatrix(bayes_pred,pokemon_test$is_legendary,positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 137   2
##          1   6  11
##                                           
##                Accuracy : 0.9487          
##                  95% CI : (0.9015, 0.9776)
##     No Information Rate : 0.9167          
##     P-Value [Acc > NIR] : 0.09006         
##                                           
##                   Kappa : 0.7055          
##                                           
##  Mcnemar's Test P-Value : 0.28884         
##                                           
##             Sensitivity : 0.84615         
##             Specificity : 0.95804         
##          Pos Pred Value : 0.64706         
##          Neg Pred Value : 0.98561         
##              Prevalence : 0.08333         
##          Detection Rate : 0.07051         
##    Detection Prevalence : 0.10897         
##       Balanced Accuracy : 0.90210         
##                                           
##        'Positive' Class : 1               
##

acc_bayes <- confusionMatrix(bayes_pred,pokemon_test$is_legendary,positive = "1")[[3]][1]
recall_bayes <- confusionMatrix(bayes_pred,pokemon_test$is_legendary,positive = "1")[[4]][1]

Nilai Sensitivity menandakan recall yaitu 0.84615

ROC and AUC

bayes_prob <- predict(pokemon_bayes,pokemon_test,"raw")
ROC_Bayes <- prediction(predictions = bayes_prob[,2],labels = pokemon_test$is_legendary)
perf_Bayes <- performance(ROC_Bayes, "tpr", "fpr")
plot(perf_Bayes)

AUC_bayes <- performance(ROC_Bayes,"auc")
AUC_bayes@y.values[[1]]

## [1] 0.957773

AUC_bayes_value <- AUC_bayes@y.values[[1]]

Hasil dari plot ROC menandakan bahwa kurva masih jauh dari titik (0.5,0.5) dan mendekat ke (0,1). Hal ini berarti model Naive Bayes yang kita buat dinilai bagus. Hal tersebut diperkuat oleh nilai AUC sebesar 0.957773 yang cenderung mendekati 1 daripada ke 0.5

Decision Tree

Model

pokemon_tree <- ctree(is_legendary~.,data = pokemon_train)
plot(pokemon_tree, type = "simple")

Predict

tree_pred <- predict(pokemon_tree,pokemon_test)
predict(pokemon_tree,pokemon_test)

##  14  15  33  36  44  57  72  82  85  92  97  98 100 102 106 109 111 112 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
## 132 133 134 137 141 152 153 159 160 161 162 169 171 176 183 186 192 201 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
## 202 207 210 222 238 242 249 250 260 262 264 265 271 272 274 282 288 290 
##   0   0   0   0   0   0   1   1   0   0   0   0   0   0   0   1   0   0 
## 294 295 305 309 314 318 321 326 331 348 360 370 376 379 386 389 399 401 
##   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0 
## 407 409 410 424 430 431 440 444 445 447 452 460 461 466 471 482 483 485 
##   0   0   0   0   0   0   0   0   1   0   0   1   0   0   0   1   1   1 
## 495 498 500 501 502 504 505 519 520 522 528 535 543 545 548 551 555 558 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0 
## 560 562 564 565 567 569 573 580 581 582 588 590 593 598 604 605 612 614 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0 
## 620 627 630 633 644 647 661 671 683 686 687 688 689 694 700 712 716 717 
##   0   0   0   0   1   1   0   0   0   0   0   0   0   0   0   0   1   1 
## 724 732 735 760 761 771 774 777 778 784 789 800 
##   0   0   0   0   0   0   0   0   0   1   0   1 
## Levels: 0 1

Confusion Matrix

confusionMatrix(tree_pred,pokemon_test$is_legendary,positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 136   3
##          1   7  10
##                                           
##                Accuracy : 0.9359          
##                  95% CI : (0.8853, 0.9688)
##     No Information Rate : 0.9167          
##     P-Value [Acc > NIR] : 0.2405          
##                                           
##                   Kappa : 0.6319          
##                                           
##  Mcnemar's Test P-Value : 0.3428          
##                                           
##             Sensitivity : 0.76923         
##             Specificity : 0.95105         
##          Pos Pred Value : 0.58824         
##          Neg Pred Value : 0.97842         
##              Prevalence : 0.08333         
##          Detection Rate : 0.06410         
##    Detection Prevalence : 0.10897         
##       Balanced Accuracy : 0.86014         
##                                           
##        'Positive' Class : 1               
##

acc_tree <- confusionMatrix(tree_pred,pokemon_test$is_legendary,positive = "1")[[3]][1]
recall_tree <- confusionMatrix(tree_pred,pokemon_test$is_legendary,positive = "1")[[4]][1]

Nilai Sensitivity menandakan recall yaitu 0.76923

ROC and AUC

tree_prob <- predict(pokemon_tree,pokemon_test,type = "prob")
ROC_Tree <- prediction(predictions = tree_prob[,2],labels = pokemon_test$is_legendary)
perf_Tree <- performance(ROC_Tree, "tpr", "fpr")
plot(perf_Tree)

AUC_tree <- performance(ROC_Tree,"auc")
AUC_tree@y.values[[1]]

## [1] 0.9185046

AUC_tree_value <- AUC_tree@y.values[[1]]

Hasil dari plot ROC menandakan bahwa kurva masih jauh dari titik (0.5,0.5) dan mendekat ke (0,1). Hal ini berarti model Decision Tree yang kita buat dinilai bagus. Hal tersebut diperkuat oleh nilai AUC sebesar 0.9185046 yang cenderung mendekati 1 daripada ke 0.5

Random Forest

Model

pokemon_forest <- train(is_legendary ~ ., data=pokemon_train, method="rf", trControl = ctrl, ntree = 200)
plot(pokemon_forest)

Berdasarkan hasil plot dari model Random Forest di atas, terlihat bahwa semakin banyak predictor yang digunakan, maka tingkat accuracy juga semakin tinggi.

varImp(pokemon_forest)

## rf variable importance
## 
##                   Overall
## experience_growth 100.000
## speed              64.856
## sp_attack          49.308
## weight_kg          30.321
## defense            18.850
## sp_defense         14.651
## hp                  5.119
## attack              3.383
## height_m            0.000

pokemon_forest$finalModel

## 
## Call:
##  randomForest(x = x, y = y, ntree = 200, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 9
## 
##         OOB estimate of  error rate: 5.12%
## Confusion matrix:
##     0  1 class.error
## 0 553 16  0.02811951
## 1  16 40  0.28571429

plot(pokemon_forest$finalModel)
legend("topright", colnames(pokemon_forest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)

* OBB dari model Random Forest adalah 5.12% yang berarti tingkat error dari model ini adalah sebesar 5.12%
* variabel experience_growth sebanyak 100% berpengaruh pada model ini, sedangkan speed 64.85% dan begitu juga untuk variabel-variabel lainnya
*

Predict

forest_pred <- predict(pokemon_forest,pokemon_test)
predict(pokemon_forest,pokemon_test)

##   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [36] 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0
##  [71] 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [106] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0
## [141] 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 1
## Levels: 0 1

Confusion Matrix

confusionMatrix(forest_pred,pokemon_test$is_legendary,positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 138   2
##          1   5  11
##                                           
##                Accuracy : 0.9551          
##                  95% CI : (0.9097, 0.9818)
##     No Information Rate : 0.9167          
##     P-Value [Acc > NIR] : 0.0470          
##                                           
##                   Kappa : 0.7342          
##                                           
##  Mcnemar's Test P-Value : 0.4497          
##                                           
##             Sensitivity : 0.84615         
##             Specificity : 0.96503         
##          Pos Pred Value : 0.68750         
##          Neg Pred Value : 0.98571         
##              Prevalence : 0.08333         
##          Detection Rate : 0.07051         
##    Detection Prevalence : 0.10256         
##       Balanced Accuracy : 0.90559         
##                                           
##        'Positive' Class : 1               
##

acc_forest <- confusionMatrix(forest_pred,pokemon_test$is_legendary,positive = "1")[[3]][1]
recall_forest <- confusionMatrix(forest_pred,pokemon_test$is_legendary,positive = "1")[[4]][1]

Nilai Sensitivity menandakan recall yaitu 0.84615

ROC and AUC

forest_prob <- predict(pokemon_forest,pokemon_test,type = "prob")
ROC_Forest <- prediction(predictions = forest_prob[,2],labels = pokemon_test$is_legendary)
perf_Forest <- performance(ROC_Forest, "tpr", "fpr")
plot(perf_Forest)

AUC_forest <- performance(ROC_Forest,"auc")
AUC_forest@y.values[[1]]

## [1] 0.9800968

AUC_forest_value <- AUC_forest@y.values[[1]]

Hasil dari plot ROC menandakan bahwa kurva masih jauh dari titik (0.5,0.5) dan mendekat ke (0,1). Hal ini berarti model Random Forest yang kita buat dinilai bagus. Hal tersebut diperkuat oleh nilai AUC sebesar 0.9800968 yang cenderung mendekati 1 daripada ke 0.5

Conclusion

Accuracy <- c(acc_bayes,acc_tree,acc_forest)
AUC_val <- c(AUC_bayes_value,AUC_tree_value,AUC_forest_value)
Recall <- c(recall_bayes,recall_tree,recall_forest)
Comparisson <- as.data.frame(rbind(Accuracy,AUC_val,Recall))
names(Comparisson) <- c("Naive Bayes", "Decision Tree", "Random Forest")
Comparisson

kita inginkan adalah lebih baik actualnya bukan legendary, sementara predictionnya legendary, artinya kita tidak mau melewatkan Pokemon yang legendary. Berarti yang akan kita ambil adalah recall

Kesimpulan :
1. Accuracy tertinggi dari ketiga model ini adalah model Random Forest
2. AUC_val tertinggi dari ketiga model ini adalah model Randon Forest
3. Recall tertinggi dari ketiga model ini adalah model Naive Bayes dan Random Forest
4. Data memiliki kelas yang tidak seimbang dan kita tidak melakukan upsample atau downsample karena data yang ada sedikit
5. Model yang paling baik untuk memprediksi apakah seekor-seonggok-sejumput-sebentuk Pokemon legendary adalah model Random Forest