Mushroom Classification: Safe to Eat or Deadly Poison?

Introduction
Data Preparation
- Load the Required Package
- Load the Dataset
Exploratory Data Analysis
- Explore Data Variables
- Pre-processing Data
Modeling
Conclusion

Introduction

Kita akan melakukan klasifikasi data untuk Mushroom dataset dimana kita akan mengklasifikasikan sebuah jamur apakah aman untuk dimakan atau beracun. Data Mushroom dapat didownload disini.

Data Preparation

Load the Required Package

library(DT)           #datatables
library(dplyr)        #praprocess data
library(caret)        #confusion matrix
library(e1071)        #Naive Bayes Classifier
library(rsample)      #Splitting data
library(partykit)     #Decision Tree
library(randomForest) #Random Forest
library(readr)        #Read RDS
library(data.table)   #data table

# image
library(jpeg)
library(grid)

# plot
library(ggplot2)
library(hrbrthemes)
library(tidyr)
library(viridis)

Load the Dataset

Untuk membuat model, kita akan menggunakan data mushrooms.csv. Maka tahap pertama yang kita lakukan adalah import data mushrooms.csv :

mushroom <- read.csv("data/mushrooms.csv")

datatable(
  mushroom,
  extensions = 'FixedColumns',
  options = list(
    scrollY = "400px",
    scrollX = TRUE,
    fixedColumns = TRUE
  )
)

Cek Tipe Data :

str(mushroom)

## 'data.frame':    8124 obs. of  23 variables:
##  $ class                   : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
##  $ cap.shape               : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
##  $ cap.surface             : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
##  $ cap.color               : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
##  $ bruises                 : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ...
##  $ odor                    : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
##  $ gill.attachment         : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
##  $ gill.spacing            : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
##  $ gill.size               : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
##  $ gill.color              : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
##  $ stalk.shape             : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
##  $ stalk.root              : Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
##  $ stalk.surface.above.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ stalk.surface.below.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ stalk.color.above.ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ stalk.color.below.ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ veil.type               : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
##  $ veil.color              : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ ring.number             : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ring.type               : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
##  $ spore.print.color       : Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4 3 3 ...
##  $ population              : Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4 5 4 ...
##  $ habitat                 : Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4 2 4 ...

Data kita terdiri dari 8,124 baris dengan 23 kolom. Target variable kita ada class dan kita akan menggunakan variable lainnya sebagai predictor untuk model kita.

Exploratory Data Analysis

Exploratory data analysis / EDA merupakan tahapan dimana kita melakukan eksplorasi terhadap variabel-variabel di data kita dan menentukan apakah variable tsb layak dipergunakan sebagai predictor untuk model kita.

Explore Data Variables

Variabel yang ada di dataset Mushroom merupakan karakteristik dari jamur :

A. CAP : top of the mushroom

cap.shape : bell=b, conical=c, flat=f, knobbed=k, sunken=s, convex=x
cap.surface : fibrous=f, grooves=g, smooth=s, scaly=y
cap.color : buff=b, cinnamon=c, red=e, gray=g, brown=n, pink=p, green=r, purple=u, white=w, yellow=y
bruises : no=f, bruises=t

B. odor : almond=a, creosote=c, foul=f, anise=l, musty=m, none=n, pungent=p, spicy=s, fishy=y

C. Gills: each of the gills forming the head.

gill.attachment : attached=a, descending=d, free=f, notched=n
gill.spacing : close=c, distant=d, crowded=w
gill.size : broad=b, narrow=n
gill.color : buff=b, red=e, gray=g, chocolate=h, black=k, brown=n, orange=o, pink=p, green=r, purple=u, white=w, yellow=y

D. Stalk: part of the mushroom between the cap and the soil.

stalk.shape : enlarging=e, tapering=t
stalk.root : missing=?, bulbous=b, club=c, equal=e, rooted=r, cup=u, rhizomorphs=z
stalk.surface.above.ring : fibrous=f, silky=k, smooth=s, scaly=y
stalk.surface.below.ring : fibrous=f, silky=k, smooth=s, scaly=y
stalk.color.above.ring : buff=b, cinnamon=c, red=e, gray=g, brown=n, orange=o, pink=p, white=w, yellow=y
stalk.color.below.ring : buff=b, cinnamon=c, red=e, gray=g, brown=n, orange=o, pink=p, white=w, yellow=y

E. Ring: membrane that envelops part of the stem.

veil.type : partial=p, universal=u
veil.color : brown=n, orange=o, white=w, yellow=y
ring.number : none=n, one=o, two=t
ring.type : cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z

F. spore.print.color : buff=b, chocolate=h, black=k, brown=n, orange=o, green=r, purple=u, white=w, yellow=y

G. population : abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y

H. habitat : woods=d, grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w

Pre-processing Data

Karena veil.type nilainya sama untuk semua row, maka variabel tersebut tidak akan berguna untuk model kita. Kita akan menghapus variabel tsb dari dataset kita:

mushroom <- mushroom %>% 
  dplyr::select(-veil.type)

str(mushroom)

## 'data.frame':    8124 obs. of  22 variables:
##  $ class                   : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
##  $ cap.shape               : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
##  $ cap.surface             : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
##  $ cap.color               : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
##  $ bruises                 : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ...
##  $ odor                    : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
##  $ gill.attachment         : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
##  $ gill.spacing            : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
##  $ gill.size               : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
##  $ gill.color              : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
##  $ stalk.shape             : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
##  $ stalk.root              : Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
##  $ stalk.surface.above.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ stalk.surface.below.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ stalk.color.above.ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ stalk.color.below.ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ veil.color              : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ ring.number             : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ring.type               : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
##  $ spore.print.color       : Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4 3 3 ...
##  $ population              : Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4 5 4 ...
##  $ habitat                 : Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4 2 4 ...

Kita akan menghilangkan beberapa perfect separator dimana odor bisa memisahkan data secara sempurna yaitu ketika nilainya c, f, m, p, s, y maka jamur tersebut pasti aman dimakan, maka kita akan mengabaikan prediktor tersebut.

mushroom <- mushroom %>% 
  dplyr::select(-odor)

str(mushroom)

## 'data.frame':    8124 obs. of  21 variables:
##  $ class                   : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
##  $ cap.shape               : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
##  $ cap.surface             : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
##  $ cap.color               : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
##  $ bruises                 : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ...
##  $ gill.attachment         : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
##  $ gill.spacing            : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
##  $ gill.size               : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
##  $ gill.color              : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
##  $ stalk.shape             : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
##  $ stalk.root              : Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
##  $ stalk.surface.above.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ stalk.surface.below.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ stalk.color.above.ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ stalk.color.below.ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ veil.color              : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ ring.number             : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ring.type               : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
##  $ spore.print.color       : Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4 3 3 ...
##  $ population              : Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4 5 4 ...
##  $ habitat                 : Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4 2 4 ...

Modeling

Cross Validation

Sebelum membuat model, kita akan membagi data kita menjadi data train dan data test. Kita akan menggunakan data train untuk melakukan training atas model kita, dan data test untuk memvalidasi model kita. Kita akan menggunakan 50% dari data kita untuk menjadi data train dan sisanya menjadi data test.

# library(rsample)
# split <- initial_split(data = mushroom, prop = 0.9, strata = "class")
# 
# data_train <- training(split)
# data_test <- testing(split)

set.seed(100)
idx <- sample(nrow(mushroom), nrow(mushroom)*0.5)

data_train <- mushroom[idx, ]
data_test <- mushroom[-idx, ]

Modeling menggunakan Naive Bayes Classifier

Kita akan menggunakan motode Naive Bayes Classifier dengan semua prediktor :

model_nb <- naiveBayes(formula = class~., data = data_train, laplace = 1)

Memprediksi menggunakan model model_nb dan data baru yaitu data_test :

model_nb_pred <- predict(model_nb, newdata = data_test)

Mengevaluasi model menggunakan confusionMatrix() dimana kita akan fokus pada jamur yang kategorinya Beracun:

cm_nb <- confusionMatrix(model_nb_pred, data_test$class, positive = "p")
cm_nb

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    e    p
##          e 2006  385
##          p   65 1606
##                                                
##                Accuracy : 0.8892               
##                  95% CI : (0.8792, 0.8987)     
##     No Information Rate : 0.5098               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.7777               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.8066               
##             Specificity : 0.9686               
##          Pos Pred Value : 0.9611               
##          Neg Pred Value : 0.8390               
##              Prevalence : 0.4902               
##          Detection Rate : 0.3954               
##    Detection Prevalence : 0.4114               
##       Balanced Accuracy : 0.8876               
##                                                
##        'Positive' Class : p                    
##

Modeling menggunakan Decision Tree

Kita akan menggunakan motode Decision Tree dengan semua prediktor :

model_tree <- ctree(formula = class~., data = data_train)

Memprediksi menggunakan model model_tree dan data baru yaitu data_test :

model_tree_pred <- predict(model_tree, newdata = data_test)

Visualisasi model:

plot(model_tree, type="simple")

Mengevaluasi model menggunakan confusionMatrix() dimana kita akan fokus pada jamur yang kategorinya Beracun:

cm_tree <- confusionMatrix(model_tree_pred, data_test$class, positive = "p")
cm_tree

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    e    p
##          e 2071   12
##          p    0 1979
##                                                
##                Accuracy : 0.997                
##                  95% CI : (0.9948, 0.9985)     
##     No Information Rate : 0.5098               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.9941               
##                                                
##  Mcnemar's Test P-Value : 0.001496             
##                                                
##             Sensitivity : 0.9940               
##             Specificity : 1.0000               
##          Pos Pred Value : 1.0000               
##          Neg Pred Value : 0.9942               
##              Prevalence : 0.4902               
##          Detection Rate : 0.4872               
##    Detection Prevalence : 0.4872               
##       Balanced Accuracy : 0.9970               
##                                                
##        'Positive' Class : p                    
##

Modeling menggunakan Random Forest

Kita akan menggunakan motode Random Forest dengan semua prediktor :

set.seed(100)
ctrl <- trainControl(method = "cv", number = 5, repeats = 3)

model_rf <- train(class ~ ., data = data_train, method = "rf", trControl = ctrl)
write_rds(model_rf, "model_rf.rds")

Memprediksi menggunakan model model_rf dan data baru yaitu data_test :

model_rf <- readRDS("model_rf.rds")
model_rf_pred <- predict(model_rf, newdata = data_test)

20 prediktor yang dianggap penting dalam metode Random Forest kita adalah:

varImp(model_rf)

## rf variable importance
## 
##   only 20 most important variables shown (out of 87)
## 
##                           Overall
## gill.sizen                100.000
## stalk.surface.above.ringk  66.151
## stalk.surface.below.ringk  52.249
## spore.print.colorh         47.838
## gill.spacingw              15.812
## ring.typep                 14.174
## spore.print.colorw         12.014
## bruisest                    9.942
## stalk.shapet                9.837
## ring.numbert                9.374
## habitatu                    8.749
## populationv                 8.696
## spore.print.colorr          8.153
## populationy                 6.958
## stalk.rootb                 6.487
## ring.typel                  6.228
## stalk.roote                 5.460
## ring.numbero                5.333
## ring.typef                  2.302
## cap.surfaces                2.294

Mengevaluasi model menggunakan confusionMatrix() dimana kita akan fokus pada jamur yang kategorinya Beracun:

cm_rf <- confusionMatrix(model_rf_pred, data_test$class, positive = "p")
cm_rf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    e    p
##          e 2071    0
##          p    0 1991
##                                                
##                Accuracy : 1                    
##                  95% CI : (0.9991, 1)          
##     No Information Rate : 0.5098               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 1                    
##                                                
##  Mcnemar's Test P-Value : NA                   
##                                                
##             Sensitivity : 1.0000               
##             Specificity : 1.0000               
##          Pos Pred Value : 1.0000               
##          Neg Pred Value : 1.0000               
##              Prevalence : 0.4902               
##          Detection Rate : 0.4902               
##    Detection Prevalence : 0.4902               
##       Balanced Accuracy : 1.0000               
##                                                
##        'Positive' Class : p                    
##

Conclusion

Berdasarkan paparan di atas, hasil dari model-model yang telah kita gunakan :

Diantara ketiga model tersebut, model menggunakan metode Random Forest menghasilkan prediksi yang paling baik.