Introduction

Given a dataset of mushroom, answer the following questions.

  1. Identify which characteristics make a mushroom edible or poisonous?

  2. How good or reliable is the model? Would the model assist you to make a decision on whether or not to eat a mushroom you find?

Dataset

# DATASET is found in http://archive.ics.uci.edu/ml/datasets/Mushroom

#theUrl = "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"
#mushrooms = read.table(file = theUrl, header = FALSE, sep = ",")

# thanks to https://raw.githubusercontent.com/stoltzmaniac/Mushroom-Classification/master/helper_functions.R
#source('helper_functions.R')

#Import Data via Custom Function
#data = fetchAndCleanData()
#head(data)

#save dataset to a file
#library(rio)
#export(data, "mushrooms.csv")

#once i export the data to mushrooms.csv, i will use this as the dataset.
## 'data.frame':    8124 obs. of  23 variables:
##  $ Edible               : Factor w/ 2 levels "Edible","Poisonous": 2 1 1 2 1 1 1 1 2 1 ...
##  $ CapShape             : Factor w/ 6 levels "Bell","Conical",..: 3 3 1 3 3 3 1 1 3 1 ...
##  $ CapSurface           : Factor w/ 4 levels "Fibrous","Grooves",..: 4 4 4 3 4 3 4 3 3 4 ...
##  $ CapColor             : Factor w/ 10 levels "Brown","Buff",..: 1 10 9 9 4 10 9 9 9 10 ...
##  $ Bruises              : Factor w/ 2 levels "False","True": 2 2 2 2 1 2 2 2 2 2 ...
##  $ Odor                 : Factor w/ 9 levels "Almond","Anise",..: 8 1 2 8 7 1 1 2 8 1 ...
##  $ GillAttachment       : Factor w/ 2 levels "Attached","Free": 2 2 2 2 2 2 2 2 2 2 ...
##  $ GillSpacing          : Factor w/ 2 levels "Close","Crowded": 1 1 1 1 2 1 1 1 1 1 ...
##  $ GillSize             : Factor w/ 2 levels "Broad","Narrow": 2 1 1 2 1 1 1 1 2 1 ...
##  $ GillColor            : Factor w/ 12 levels "Black","Brown",..: 1 1 2 2 1 2 5 2 8 5 ...
##  $ StalkShape           : Factor w/ 2 levels "Enlarging","Tapering": 1 1 1 1 2 1 1 1 1 1 ...
##  $ StalkRoot            : Factor w/ 5 levels "Bulbous","Club",..: 3 2 2 3 3 2 2 2 3 2 ...
##  $ StalkSurfaceAboveRing: Factor w/ 4 levels "Fibrous","Scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ StalkSurfaceBelowRing: Factor w/ 4 levels "Fibrous","Scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ StalkColorAboveRing  : Factor w/ 9 levels "Brown","Buff",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ StalkColorBelowRing  : Factor w/ 9 levels "Brown","Buff",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ VeilType             : Factor w/ 1 level "Partial": 1 1 1 1 1 1 1 1 1 1 ...
##  $ VeilColor            : Factor w/ 4 levels "Brown","Orange",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ RingNumber           : Factor w/ 3 levels "None","One","Two": 2 2 2 2 2 2 2 2 2 2 ...
##  $ RingType             : Factor w/ 5 levels "Evanescent","Flaring",..: 5 5 5 5 1 5 5 5 5 5 ...
##  $ SporePrintColor      : Factor w/ 9 levels "Black","Brown",..: 1 2 2 1 2 1 1 2 1 1 ...
##  $ Population           : Factor w/ 6 levels "Abundnant","Clustered",..: 4 3 3 4 1 3 3 4 5 4 ...
##  $ Habitat              : Factor w/ 7 levels "Grasses","Leaves",..: 5 1 3 5 1 1 3 3 1 3 ...

Feature Selection before Visualization

Since there are 23 variables, it will be intensive to compare every variables and its interaction. Let’s see which variables have the most impact using Boruta algorithm.

## Warning in TentativeRoughFix(boruta.train): There are no Tentative
## attributes! Returning original object.
##                         meanImp medianImp    minImp    maxImp normHits
## CapShape              14.615470 14.428007 13.493945 15.715606        1
## CapSurface            17.310904 17.257331 16.021209 18.408430        1
## CapColor              16.490086 16.577253 15.575090 17.051839        1
## Bruises               16.475434 16.546610 15.493117 17.444819        1
## Odor                  23.649886 23.577426 22.676890 24.874218        1
## GillAttachment         7.362143  7.306709  6.445281  8.596157        1
## GillSpacing           18.276094 18.479204 17.089957 19.841007        1
## GillSize              22.584415 22.439868 21.812846 23.826869        1
## GillColor             14.807397 14.914603 13.037513 15.758207        1
## StalkShape            18.839615 18.825226 17.832902 19.887296        1
## StalkRoot             20.623533 20.513350 18.816178 22.126707        1
## StalkSurfaceAboveRing 13.082532 13.141184 12.306700 13.656548        1
## StalkSurfaceBelowRing 12.712848 12.757862 12.214016 13.250489        1
## StalkColorAboveRing   13.835534 14.014371 12.749120 14.927994        1
## StalkColorBelowRing   13.835676 13.800782 12.716384 14.669823        1
## VeilType               0.000000  0.000000  0.000000  0.000000        0
## VeilColor              8.237312  8.168256  7.677358  8.892883        1
## RingNumber            14.177253 14.100786 13.638326 15.068294        1
## RingType              17.514824 17.591988 16.594334 18.254843        1
## SporePrintColor       24.162917 24.081073 23.355758 24.906849        1
## Population            17.243272 17.460937 16.043717 17.620443        1
## Habitat               19.589782 19.369477 18.918767 20.622484        1
##                        decision
## CapShape              Confirmed
## CapSurface            Confirmed
## CapColor              Confirmed
## Bruises               Confirmed
## Odor                  Confirmed
## GillAttachment        Confirmed
## GillSpacing           Confirmed
## GillSize              Confirmed
## GillColor             Confirmed
## StalkShape            Confirmed
## StalkRoot             Confirmed
## StalkSurfaceAboveRing Confirmed
## StalkSurfaceBelowRing Confirmed
## StalkColorAboveRing   Confirmed
## StalkColorBelowRing   Confirmed
## VeilType               Rejected
## VeilColor             Confirmed
## RingNumber            Confirmed
## RingType              Confirmed
## SporePrintColor       Confirmed
## Population            Confirmed
## Habitat               Confirmed

Since Odor and SporePrintColor each has highest importance, let’s plot the graph

RED is poisonous, Green is Edible - Graphical relationship

Model Reliability

In this step, we’ll encode the dependent variable into two levels 0 and 1. This will help the algorithm to clearly classify the levels. This encoding would lead to:

“Edible” - This wil be converted to 0.

“Poisonous” - This will be convered to 1.

## 'data.frame':    8124 obs. of  23 variables:
##  $ Edible               : num  1 0 0 1 0 0 0 0 1 0 ...
##  $ CapShape             : Factor w/ 6 levels "Bell","Conical",..: 3 3 1 3 3 3 1 1 3 1 ...
##  $ CapSurface           : Factor w/ 4 levels "Fibrous","Grooves",..: 4 4 4 3 4 3 4 3 3 4 ...
##  $ CapColor             : Factor w/ 10 levels "Brown","Buff",..: 1 10 9 9 4 10 9 9 9 10 ...
##  $ Bruises              : Factor w/ 2 levels "False","True": 2 2 2 2 1 2 2 2 2 2 ...
##  $ Odor                 : Factor w/ 9 levels "Almond","Anise",..: 8 1 2 8 7 1 1 2 8 1 ...
##  $ GillAttachment       : Factor w/ 2 levels "Attached","Free": 2 2 2 2 2 2 2 2 2 2 ...
##  $ GillSpacing          : Factor w/ 2 levels "Close","Crowded": 1 1 1 1 2 1 1 1 1 1 ...
##  $ GillSize             : Factor w/ 2 levels "Broad","Narrow": 2 1 1 2 1 1 1 1 2 1 ...
##  $ GillColor            : Factor w/ 12 levels "Black","Brown",..: 1 1 2 2 1 2 5 2 8 5 ...
##  $ StalkShape           : Factor w/ 2 levels "Enlarging","Tapering": 1 1 1 1 2 1 1 1 1 1 ...
##  $ StalkRoot            : Factor w/ 5 levels "Bulbous","Club",..: 3 2 2 3 3 2 2 2 3 2 ...
##  $ StalkSurfaceAboveRing: Factor w/ 4 levels "Fibrous","Scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ StalkSurfaceBelowRing: Factor w/ 4 levels "Fibrous","Scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ StalkColorAboveRing  : Factor w/ 9 levels "Brown","Buff",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ StalkColorBelowRing  : Factor w/ 9 levels "Brown","Buff",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ VeilType             : Factor w/ 1 level "Partial": 1 1 1 1 1 1 1 1 1 1 ...
##  $ VeilColor            : Factor w/ 4 levels "Brown","Orange",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ RingNumber           : Factor w/ 3 levels "None","One","Two": 2 2 2 2 2 2 2 2 2 2 ...
##  $ RingType             : Factor w/ 5 levels "Evanescent","Flaring",..: 5 5 5 5 1 5 5 5 5 5 ...
##  $ SporePrintColor      : Factor w/ 9 levels "Black","Brown",..: 1 2 2 1 2 1 1 2 1 1 ...
##  $ Population           : Factor w/ 6 levels "Abundnant","Clustered",..: 4 3 3 4 1 3 3 4 5 4 ...
##  $ Habitat              : Factor w/ 7 levels "Grasses","Leaves",..: 5 1 3 5 1 1 3 3 1 3 ...

Classification Tree using FFTrees algorithm - Let the system guess which tree is the best, WITHOUT SPLITTING DATA

## [1] "An FFTrees object containing 6 trees using 4 predictors {Odor,SporePrintColor,GillColor,RingType}"
## [1] "FFTrees AUC: (Train = 0.99, Test = --)"
## [1] "My favorite training tree is #2, here is how it performed:"
##                          train
## n                      8124.00
## p(Correct)                0.93
## Hit Rate (HR)             0.85
## False Alarm Rate (FAR)    0.00
## d-prime                    Inf

Without splitting data, initial results shows - 4 predictors {Odor,SporePrintColor,GillColor,RingType}. But the TREE shows Odor and SporePrintColor are most relevant variables. Also the HIT RATE is about 85%.

SPLIT data

## [1] "An FFTrees object containing 2 trees using 2 predictors {Odor,SporePrintColor}"
## [1] "FFTrees AUC: (Train = 0.99, Test = 0.99)"
## [1] "My favorite training tree is #1, here is how it performed:"
##                          train   test
## n                      7717.00 407.00
## p(Correct)                0.93   0.92
## Hit Rate (HR)             0.86   0.84
## False Alarm Rate (FAR)    0.00   0.00
## d-prime                    Inf    Inf

Similarly splitting data 95% Training and 5% Testing based on variables Odor,SporePrintColor also show similar hit rate of 86% and 84% respectively.

Let’s try Random Forest Alogrithm

suppressWarnings(suppressMessages(library(randomForest)))
suppressWarnings(suppressMessages(library(e1071)))
suppressWarnings(suppressMessages(library(caret)))

set.seed(123)  

# get original data
mushrooms <- read.csv("mushrooms.csv")
str(mushrooms)
## 'data.frame':    8124 obs. of  23 variables:
##  $ Edible               : Factor w/ 2 levels "Edible","Poisonous": 2 1 1 2 1 1 1 1 2 1 ...
##  $ CapShape             : Factor w/ 6 levels "Bell","Conical",..: 3 3 1 3 3 3 1 1 3 1 ...
##  $ CapSurface           : Factor w/ 4 levels "Fibrous","Grooves",..: 4 4 4 3 4 3 4 3 3 4 ...
##  $ CapColor             : Factor w/ 10 levels "Brown","Buff",..: 1 10 9 9 4 10 9 9 9 10 ...
##  $ Bruises              : Factor w/ 2 levels "False","True": 2 2 2 2 1 2 2 2 2 2 ...
##  $ Odor                 : Factor w/ 9 levels "Almond","Anise",..: 8 1 2 8 7 1 1 2 8 1 ...
##  $ GillAttachment       : Factor w/ 2 levels "Attached","Free": 2 2 2 2 2 2 2 2 2 2 ...
##  $ GillSpacing          : Factor w/ 2 levels "Close","Crowded": 1 1 1 1 2 1 1 1 1 1 ...
##  $ GillSize             : Factor w/ 2 levels "Broad","Narrow": 2 1 1 2 1 1 1 1 2 1 ...
##  $ GillColor            : Factor w/ 12 levels "Black","Brown",..: 1 1 2 2 1 2 5 2 8 5 ...
##  $ StalkShape           : Factor w/ 2 levels "Enlarging","Tapering": 1 1 1 1 2 1 1 1 1 1 ...
##  $ StalkRoot            : Factor w/ 5 levels "Bulbous","Club",..: 3 2 2 3 3 2 2 2 3 2 ...
##  $ StalkSurfaceAboveRing: Factor w/ 4 levels "Fibrous","Scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ StalkSurfaceBelowRing: Factor w/ 4 levels "Fibrous","Scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ StalkColorAboveRing  : Factor w/ 9 levels "Brown","Buff",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ StalkColorBelowRing  : Factor w/ 9 levels "Brown","Buff",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ VeilType             : Factor w/ 1 level "Partial": 1 1 1 1 1 1 1 1 1 1 ...
##  $ VeilColor            : Factor w/ 4 levels "Brown","Orange",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ RingNumber           : Factor w/ 3 levels "None","One","Two": 2 2 2 2 2 2 2 2 2 2 ...
##  $ RingType             : Factor w/ 5 levels "Evanescent","Flaring",..: 5 5 5 5 1 5 5 5 5 5 ...
##  $ SporePrintColor      : Factor w/ 9 levels "Black","Brown",..: 1 2 2 1 2 1 1 2 1 1 ...
##  $ Population           : Factor w/ 6 levels "Abundnant","Clustered",..: 4 3 3 4 1 3 3 4 5 4 ...
##  $ Habitat              : Factor w/ 7 levels "Grasses","Leaves",..: 5 1 3 5 1 1 3 3 1 3 ...
library(caTools)
split=sample.split(mushrooms$Edible, SplitRatio=0.95)
Train=mushrooms[split==TRUE,]
Test=mushrooms[split==FALSE,]
#Fit Random Forest Model

set.seed(456)

rf = randomForest(Edible ~ .,  ntree = 100, data = Train)
print(rf)
## 
## Call:
##  randomForest(formula = Edible ~ ., data = Train, ntree = 100) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 0%
## Confusion matrix:
##           Edible Poisonous class.error
## Edible      3998         0           0
## Poisonous      0      3720           0
# Variable Importance
varImpPlot(rf,  sort = T, n.var=10, main="Top 10 - Variable Importance")

#Variable Importance
var.imp = data.frame(importance(rf, type=2))
# make row names as columns
var.imp$Variables = row.names(var.imp)  
print(var.imp[order(var.imp$MeanDecreaseGini,decreasing = T),])
##                       MeanDecreaseGini             Variables
## Odor                      1287.8437626                  Odor
## SporePrintColor            621.6112057       SporePrintColor
## GillColor                  290.7797879             GillColor
## StalkSurfaceBelowRing      245.0962193 StalkSurfaceBelowRing
## GillSize                   241.7000906              GillSize
## StalkSurfaceAboveRing      173.7856704 StalkSurfaceAboveRing
## RingType                   165.9132664              RingType
## Population                 117.1940828            Population
## GillSpacing                106.7254730           GillSpacing
## StalkRoot                   95.3099746             StalkRoot
## Habitat                     95.2761674               Habitat
## Bruises                     92.2182474               Bruises
## CapColor                    54.7694177              CapColor
## StalkColorBelowRing         53.4285156   StalkColorBelowRing
## StalkColorAboveRing         52.1297701   StalkColorAboveRing
## StalkShape                  48.8283721            StalkShape
## RingNumber                  40.8255329            RingNumber
## CapSurface                  33.6491282            CapSurface
## CapShape                    18.5942114              CapShape
## VeilColor                    4.1434287             VeilColor
## GillAttachment               0.1383254        GillAttachment
## VeilType                     0.0000000              VeilType

“Mean Decreasing Gini” - a similar term for information gain

Predicted response

#using TRAIN dataset
# Predicting response variable
Train$predicted.response = predict(rf , Train)

# Create Confusion Matrix
print(confusionMatrix(data = Train$predicted.response,  reference = Train$Edible,  positive = 'Edible'))
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Edible Poisonous
##   Edible      3998         0
##   Poisonous      0      3720
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9995, 1)
##     No Information Rate : 0.518      
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.000      
##             Specificity : 1.000      
##          Pos Pred Value : 1.000      
##          Neg Pred Value : 1.000      
##              Prevalence : 0.518      
##          Detection Rate : 0.518      
##    Detection Prevalence : 0.518      
##       Balanced Accuracy : 1.000      
##                                      
##        'Positive' Class : Edible     
## 
# using TEST dataset
# Predicting response variable
Test$predicted.response = predict(rf , Test)

# Create Confusion Matrix
print(  confusionMatrix(data= Test$predicted.response,  reference=Test$Edible, positive='Edible'))
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Edible Poisonous
##   Edible       210         0
##   Poisonous      0       196
##                                     
##                Accuracy : 1         
##                  95% CI : (0.991, 1)
##     No Information Rate : 0.5172    
##     P-Value [Acc > NIR] : < 2.2e-16 
##                                     
##                   Kappa : 1         
##  Mcnemar's Test P-Value : NA        
##                                     
##             Sensitivity : 1.0000    
##             Specificity : 1.0000    
##          Pos Pred Value : 1.0000    
##          Neg Pred Value : 1.0000    
##              Prevalence : 0.5172    
##          Detection Rate : 0.5172    
##    Detection Prevalence : 0.5172    
##       Balanced Accuracy : 1.0000    
##                                     
##        'Positive' Class : Edible    
## 

Conclusions

Random Forest gave the best model, that is able to differentiate between poisonous and non-poisonous and no misses ! In real life, humans cannot rely totally on “perfect model” which is too good to be true ! In that case, i would prefer to use human common-sense or a model that has some probabilities of misses. Perhaps more data should be made available to train the model.