A mushroom or toadstool is the fleshy, spore-bearing fruiting body of a fungus, typically produced above ground, on soil, or on its food source. The terms “mushroom” and “toadstool” go back centuries and were never precisely defined, nor was there consensus on application. During the 15th and 16th centuries, the terms mushrom, mushrum, muscheron, mousheroms, mussheron, or musserouns were used. There are edible mushrooms and poisonous mushrooms.
Mushrooms are used extensively in cooking, in many cuisines (notably Chinese, Korean, European, and Japanese). Most mushrooms sold in supermarkets have been commercially grown on mushroom farms. The most popular of these, Agaricus bisporus, is considered safe for most people to eat because it is grown in controlled, sterilized environments.
A number of species of mushrooms are poisonous; although some resemble certain edible species, consuming them could be fatal. Eating mushrooms gathered in the wild is risky and should only be undertaken by individuals knowledgeable in mushroom identification. Common best practice is for wild mushroom pickers to focus on collecting a small number of visually distinctive, edible mushroom species that cannot be easily confused with poisonous varieties.
Separating edible from poisonous species requires meticulous attention to detail; there is no single trait by which all toxic mushrooms can be identified, nor one by which all edible mushrooms can be identified. Identifying mushrooms requires a basic understanding of their macroscopic structure. These day, identification require microscopic analysis. But for this case, we need to identify whether a mushroom is edible or not only from macroscopic feature.
For this case, the dataset is from kaggle. The dataset contain macroscopic characteristic of edible mushroom and poisonous mushroom. From this dataset, we will try to create a machine learning model with 3 methods: naive bayes, decision tree, and random forest. We will see which one have the best performance to predict whether a mushroom edible or poisonous based on the macroscopic characteristics.
Load the required library
library(tidyverse)
library(e1071)
library(partykit)
library(randomForest)
library(caret)
library(ROCR)This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981).
mushroom <- read.csv("mushrooms.csv")
head(mushroom)Lets see the data structure.
str(mushroom)## 'data.frame': 8124 obs. of 23 variables:
## $ class : chr "p" "e" "e" "p" ...
## $ cap.shape : chr "x" "x" "b" "x" ...
## $ cap.surface : chr "s" "s" "s" "y" ...
## $ cap.color : chr "n" "y" "w" "w" ...
## $ bruises : chr "t" "t" "t" "t" ...
## $ odor : chr "p" "a" "l" "p" ...
## $ gill.attachment : chr "f" "f" "f" "f" ...
## $ gill.spacing : chr "c" "c" "c" "c" ...
## $ gill.size : chr "n" "b" "b" "n" ...
## $ gill.color : chr "k" "k" "n" "n" ...
## $ stalk.shape : chr "e" "e" "e" "e" ...
## $ stalk.root : chr "e" "c" "c" "e" ...
## $ stalk.surface.above.ring: chr "s" "s" "s" "s" ...
## $ stalk.surface.below.ring: chr "s" "s" "s" "s" ...
## $ stalk.color.above.ring : chr "w" "w" "w" "w" ...
## $ stalk.color.below.ring : chr "w" "w" "w" "w" ...
## $ veil.type : chr "p" "p" "p" "p" ...
## $ veil.color : chr "w" "w" "w" "w" ...
## $ ring.number : chr "o" "o" "o" "o" ...
## $ ring.type : chr "p" "p" "p" "p" ...
## $ spore.print.color : chr "k" "n" "n" "k" ...
## $ population : chr "s" "n" "n" "s" ...
## $ habitat : chr "u" "g" "m" "u" ...
We can see from the structure that all the columns are in character type. Meaning we are dealing with pure categorical dataset. We need to transform all of them into type factor.
mushroom <- mushroom %>%
mutate_if(is.character, as.factor)
str(mushroom)## 'data.frame': 8124 obs. of 23 variables:
## $ class : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
## $ cap.shape : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
## $ cap.surface : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
## $ cap.color : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
## $ bruises : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ...
## $ odor : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
## $ gill.attachment : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
## $ gill.spacing : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
## $ gill.size : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
## $ gill.color : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
## $ stalk.shape : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
## $ stalk.root : Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
## $ stalk.surface.above.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
## $ stalk.surface.below.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
## $ stalk.color.above.ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ stalk.color.below.ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ veil.type : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
## $ veil.color : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
## $ ring.number : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
## $ ring.type : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
## $ spore.print.color : Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4 3 3 ...
## $ population : Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4 5 4 ...
## $ habitat : Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4 2 4 ...
From the result, we can deselect veil.type because there is only 1 level.
mushroom <- mushroom %>%
select(-veil.type)
str(mushroom)## 'data.frame': 8124 obs. of 22 variables:
## $ class : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
## $ cap.shape : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
## $ cap.surface : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
## $ cap.color : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
## $ bruises : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ...
## $ odor : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
## $ gill.attachment : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
## $ gill.spacing : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
## $ gill.size : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
## $ gill.color : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
## $ stalk.shape : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
## $ stalk.root : Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
## $ stalk.surface.above.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
## $ stalk.surface.below.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
## $ stalk.color.above.ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ stalk.color.below.ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ veil.color : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
## $ ring.number : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
## $ ring.type : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
## $ spore.print.color : Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4 3 3 ...
## $ population : Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4 5 4 ...
## $ habitat : Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4 2 4 ...
Lets check if there is any NA in the dataset.
colSums(is.na(mushroom))## class cap.shape cap.surface
## 0 0 0
## cap.color bruises odor
## 0 0 0
## gill.attachment gill.spacing gill.size
## 0 0 0
## gill.color stalk.shape stalk.root
## 0 0 0
## stalk.surface.above.ring stalk.surface.below.ring stalk.color.above.ring
## 0 0 0
## stalk.color.below.ring veil.color ring.number
## 0 0 0
## ring.type spore.print.color population
## 0 0 0
## habitat
## 0
Fortunately, we don’t have missing value so we can continue to next step.
Before we make prediction model, we need to split the data into data train and data test. Although random forest don’t need cross validation because already have OBB score, cross validation still needed to equalized the treatment for all models. For this case, we can use 80% for data train and 20% for data test.
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(123)
index <- sample(x = nrow(mushroom), nrow(mushroom) * 0.80)
mushroom_train <- mushroom[index,]
mushroom_test <- mushroom[-index,]After, we split the data, we need to check if the target is imbalance.
prop.table(table(mushroom_train$class))##
## e p
## 0.5183874 0.4816126
As we can see, the data train isn’t imbalance. After there is no problem, we continue to build prediction models. For this case, we will use 3 models (Naive Bayes, Decision Tree, and Random Forest).
This is how we create the model for Naive Bayes.
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(123)
naive_model <- naiveBayes(x = mushroom_train %>% select(-class),
y = mushroom_train$class,
laplace = 1)
naive_model##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = mushroom_train %>% select(-class), y = mushroom_train$class,
## laplace = 1)
##
## A-priori probabilities:
## mushroom_train$class
## e p
## 0.5183874 0.4816126
##
## Conditional probabilities:
## cap.shape
## mushroom_train$class b c f k
## e 0.0915555556 0.0002962963 0.3760000000 0.0521481481
## p 0.0137117347 0.0012755102 0.3963647959 0.1517857143
## cap.shape
## mushroom_train$class s x
## e 0.0085925926 0.4714074074
## p 0.0003188776 0.4365433673
##
## cap.surface
## mushroom_train$class f g s y
## e 0.3702935073 0.0002964720 0.2680106730 0.3613993478
## p 0.1968730057 0.0009572431 0.3596043395 0.4425654116
##
## cap.color
## mushroom_train$class b c e g
## e 0.0103580941 0.0071026931 0.1497484463 0.2441550755
## p 0.0312101911 0.0031847134 0.2156050955 0.2079617834
## cap.color
## mushroom_train$class n p r u
## e 0.2989050015 0.0147972773 0.0044391832 0.0044391832
## p 0.2595541401 0.0235668790 0.0003184713 0.0003184713
## cap.color
## mushroom_train$class w y
## e 0.1701686890 0.0958863569
## p 0.0837579618 0.1745222930
##
## bruises
## mushroom_train$class f t
## e 0.3444082 0.6555918
## p 0.8384419 0.1615581
##
## odor
## mushroom_train$class a c f l
## e 0.0976909414 0.0002960332 0.0002960332 0.0941385435
## p 0.0003185728 0.0503345014 0.5517680790 0.0003185728
## odor
## mushroom_train$class m n p s
## e 0.0002960332 0.8063943162 0.0002960332 0.0002960332
## p 0.0089200382 0.0324944250 0.0669002867 0.1462249124
## odor
## mushroom_train$class y
## e 0.0002960332
## p 0.1427206117
##
## gill.attachment
## mushroom_train$class a f
## e 0.043903886 0.956096114
## p 0.005427842 0.994572158
##
## gill.spacing
## mushroom_train$class c w
## e 0.71610798 0.28389202
## p 0.97126437 0.02873563
##
## gill.size
## mushroom_train$class b n
## e 0.93177099 0.06822901
## p 0.43773946 0.56226054
##
## gill.color
## mushroom_train$class b e g h
## e 0.0002957705 0.0230700976 0.0603371783 0.0499852115
## p 0.4315722470 0.0003182686 0.1333545512 0.1346276257
## gill.color
## mushroom_train$class k n o p
## e 0.0792664892 0.2212363206 0.0168589175 0.2023070098
## p 0.0162316996 0.0286441757 0.0003182686 0.1651814131
## gill.color
## mushroom_train$class r u w y
## e 0.0002957705 0.1064773736 0.2253771074 0.0144927536
## p 0.0057288351 0.0127307447 0.0652450668 0.0060471038
##
## stalk.shape
## mushroom_train$class e t
## e 0.3835657 0.6164343
## p 0.4939336 0.5060664
##
## stalk.root
## mushroom_train$class ? b c e
## e 0.1680497925 0.4555423829 0.1218138708 0.2065797273
## p 0.4414673046 0.4803827751 0.0108452951 0.0669856459
## stalk.root
## mushroom_train$class r
## e 0.0480142264
## p 0.0003189793
##
## stalk.surface.above.ring
## mushroom_train$class f k s y
## e 0.097242811 0.035280166 0.863029944 0.004447080
## p 0.037332482 0.567326101 0.393107849 0.002233567
##
## stalk.surface.below.ring
## mushroom_train$class f k s y
## e 0.10999111 0.03172250 0.80610732 0.05217907
## p 0.03605616 0.55264837 0.39151244 0.01978302
##
## stalk.color.above.ring
## mushroom_train$class b c e g
## e 0.0002960332 0.0002960332 0.0224985198 0.1358792185
## p 0.1137304874 0.0089200382 0.0003185728 0.0003185728
## stalk.color.above.ring
## mushroom_train$class n o p w
## e 0.0044404973 0.0438129070 0.1361752516 0.6563055062
## p 0.1134119146 0.0003185728 0.3265371137 0.4342147181
## stalk.color.above.ring
## mushroom_train$class y
## e 0.0002960332
## p 0.0022300096
##
## stalk.color.below.ring
## mushroom_train$class b c e g
## e 0.0002960332 0.0002960332 0.0219064535 0.1367673179
## p 0.1130933418 0.0089200382 0.0003185728 0.0003185728
## stalk.color.below.ring
## mushroom_train$class n o p w
## e 0.0150976909 0.0438129070 0.1343990527 0.6471284784
## p 0.1140490602 0.0003185728 0.3313157056 0.4252946798
## stalk.color.below.ring
## mushroom_train$class y
## e 0.0002960332
## p 0.0063714559
##
## veil.color
## mushroom_train$class n o w y
## e 0.021049511 0.023124815 0.955529202 0.000296472
## p 0.000319081 0.000319081 0.997128271 0.002233567
##
## ring.number
## mushroom_train$class n o t
## e 0.0002965599 0.8748517200 0.1248517200
## p 0.0089371210 0.9709543568 0.0201085222
##
## ring.type
## mushroom_train$class e f l n
## e 0.2391819798 0.0121517487 0.0002963841 0.0002963841
## p 0.4433811802 0.0003189793 0.3358851675 0.0089314195
## ring.type
## mushroom_train$class p
## e 0.7480735033
## p 0.2114832536
##
## spore.print.color
## mushroom_train$class b h k n
## e 0.0109532268 0.0121373594 0.3975725281 0.4100059207
## p 0.0003185728 0.4084103218 0.0602102580 0.0570245301
## spore.print.color
## mushroom_train$class o r u w
## e 0.0127294257 0.0002960332 0.0109532268 0.1352871522
## p 0.0003185728 0.0200700860 0.0003185728 0.4530105129
## spore.print.color
## mushroom_train$class y
## e 0.0100651273
## p 0.0003185728
##
## population
## mushroom_train$class a c n s
## e 0.0909629630 0.0675555556 0.0957037037 0.2097777778
## p 0.0003188776 0.0124362245 0.0003188776 0.0937500000
## population
## mushroom_train$class v y
## e 0.2838518519 0.2521481481
## p 0.7190688776 0.1741071429
##
## habitat
## mushroom_train$class d g l m
## e 0.4466824645 0.3335308057 0.0545023697 0.0619075829
## p 0.3222824354 0.1985973860 0.1517373287 0.0098820529
## habitat
## mushroom_train$class p u w
## e 0.0346563981 0.0234004739 0.0453199052
## p 0.2496015301 0.0675804909 0.0003187759
This is how we create the model for Decision Tree.
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(123)
dtree_model <- ctree(formula = class ~.,
data = mushroom_train)
plot(dtree_model, type = "simple")This is how we create the model for Random Forest.
#RNGkind(sample.kind = "Rounding")
#set.seed(123)
#ctrl <- trainControl(method = "repeatedcv",
# number = 5,
# repeats = 3)
#rf_model <- train(class ~ .,
# data = mushroom_train,
# method = "rf",
# trControl = ctrl)
#saveRDS(rf_model, "mushroom_randomforest.RDS")For knitting purpose, we use model that already saved into RDS.
rf_model <- readRDS("mushroom_randomforest.RDS")Lets evaluate the Naive Bayes model for test data.
prediction_naive_class <- predict(naive_model,
mushroom_test,
type = "class")
confusionMatrix(prediction_naive_class,
mushroom_test$class,
positive = "e")## Confusion Matrix and Statistics
##
## Reference
## Prediction e p
## e 830 61
## p 9 725
##
## Accuracy : 0.9569
## 95% CI : (0.9459, 0.9663)
## No Information Rate : 0.5163
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9136
##
## Mcnemar's Test P-Value : 1.09e-09
##
## Sensitivity : 0.9893
## Specificity : 0.9224
## Pos Pred Value : 0.9315
## Neg Pred Value : 0.9877
## Prevalence : 0.5163
## Detection Rate : 0.5108
## Detection Prevalence : 0.5483
## Balanced Accuracy : 0.9558
##
## 'Positive' Class : e
##
As you can see from the confusion matrix, Naive Bayes model already have high accuracy, sensitivity, specificity, and precision.
prediction_naive_raw <- predict(naive_model,
mushroom_test,
type = "raw")
data_roc <- data.frame(pred_prob = prediction_naive_raw[,"e"],
actual = ifelse(mushroom_test$class == "e", 1, 0))
prediction_roc <- prediction(predictions = data_roc$pred_prob,
labels = data_roc$actual)
auc_number <- performance(prediction_roc, measure = "auc")
plot(performance(prediction_roc, "tpr", "fpr"))
abline(0, 1, lty = 2)
text(0.4, 0.6, paste("AUC = ", auc_number@y.values[[1]], 2))If we look at AUC number, the model is good enough because closer to 1.
Lets evaluate the Decision Tree model for test data.
prediction_dtree <- predict(dtree_model,
mushroom_test)
confusionMatrix(prediction_dtree,
mushroom_test$class,
positive = "e")## Confusion Matrix and Statistics
##
## Reference
## Prediction e p
## e 839 3
## p 0 783
##
## Accuracy : 0.9982
## 95% CI : (0.9946, 0.9996)
## No Information Rate : 0.5163
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9963
##
## Mcnemar's Test P-Value : 0.2482
##
## Sensitivity : 1.0000
## Specificity : 0.9962
## Pos Pred Value : 0.9964
## Neg Pred Value : 1.0000
## Prevalence : 0.5163
## Detection Rate : 0.5163
## Detection Prevalence : 0.5182
## Balanced Accuracy : 0.9981
##
## 'Positive' Class : e
##
As we can see, the model is better than Naive Bayes. Even though Decision Tree prone to overfit, the performance still good so we don’t need to check confusion matrix for data train.
Before we check the model, lets interprate what is the most important variable in this model.
varImp(rf_model)## rf variable importance
##
## only 20 most important variables shown (out of 95)
##
## Overall
## odorn 100.000
## odorf 31.304
## gill.sizen 31.231
## stalk.rootc 16.278
## stalk.surface.above.ringk 10.503
## bruisest 9.301
## spore.print.colorr 7.876
## stalk.surface.below.ringk 6.337
## stalk.surface.below.ringy 5.651
## ring.typep 4.598
## odorl 4.380
## stalk.rootr 4.234
## spore.print.colorh 4.037
## gill.spacingw 3.792
## odorp 2.298
## cap.colory 2.193
## spore.print.colorw 2.098
## stalk.roote 2.096
## odorc 1.753
## ring.numbert 1.571
plot(varImp(rf_model))Based on the result, odor is the most important variable in this model. If we want to improve other model we can do it with variable selection based on this result.
rf_model$finalModel##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 48
##
## OOB estimate of error rate: 0%
## Confusion matrix:
## e p class.error
## e 3369 0 0
## p 0 3130 0
Based on the result, the OOB is 0%. That means, the model will 100% predict any new data. Lets prove it.
prediction_rf <- predict(rf_model, mushroom_test, type = "raw")
confusionMatrix(data = prediction_rf,
reference = mushroom_test$class,
positive = "e")## Confusion Matrix and Statistics
##
## Reference
## Prediction e p
## e 839 0
## p 0 786
##
## Accuracy : 1
## 95% CI : (0.9977, 1)
## No Information Rate : 0.5163
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.5163
## Detection Rate : 0.5163
## Detection Prevalence : 0.5163
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : e
##
Based on the result, the model can predict 100% whether the mushroom is edible or not.
Random Forest have the best performance out of all three but we can’t interprate the model. To interprate which variable make a mushroom poisonous or not, we can use Naive Bayes or Decision Tree model. In this case Decision Tree model is better than Naive Bayes.