Given a dataset of mushroom, answer the following questions.
Identify which characteristics make a mushroom edible or poisonous?
How good or reliable is the model? Would the model assist you to make a decision on whether or not to eat a mushroom you find?
# DATASET is found in http://archive.ics.uci.edu/ml/datasets/Mushroom
#theUrl = "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"
#mushrooms = read.table(file = theUrl, header = FALSE, sep = ",")
# thanks to https://raw.githubusercontent.com/stoltzmaniac/Mushroom-Classification/master/helper_functions.R
#source('helper_functions.R')
#Import Data via Custom Function
#data = fetchAndCleanData()
#head(data)
#save dataset to a file
#library(rio)
#export(data, "mushrooms.csv")
#once i export the data to mushrooms.csv, i will use this as the dataset.
## 'data.frame': 8124 obs. of 23 variables:
## $ Edible : Factor w/ 2 levels "Edible","Poisonous": 2 1 1 2 1 1 1 1 2 1 ...
## $ CapShape : Factor w/ 6 levels "Bell","Conical",..: 3 3 1 3 3 3 1 1 3 1 ...
## $ CapSurface : Factor w/ 4 levels "Fibrous","Grooves",..: 4 4 4 3 4 3 4 3 3 4 ...
## $ CapColor : Factor w/ 10 levels "Brown","Buff",..: 1 10 9 9 4 10 9 9 9 10 ...
## $ Bruises : Factor w/ 2 levels "False","True": 2 2 2 2 1 2 2 2 2 2 ...
## $ Odor : Factor w/ 9 levels "Almond","Anise",..: 8 1 2 8 7 1 1 2 8 1 ...
## $ GillAttachment : Factor w/ 2 levels "Attached","Free": 2 2 2 2 2 2 2 2 2 2 ...
## $ GillSpacing : Factor w/ 2 levels "Close","Crowded": 1 1 1 1 2 1 1 1 1 1 ...
## $ GillSize : Factor w/ 2 levels "Broad","Narrow": 2 1 1 2 1 1 1 1 2 1 ...
## $ GillColor : Factor w/ 12 levels "Black","Brown",..: 1 1 2 2 1 2 5 2 8 5 ...
## $ StalkShape : Factor w/ 2 levels "Enlarging","Tapering": 1 1 1 1 2 1 1 1 1 1 ...
## $ StalkRoot : Factor w/ 5 levels "Bulbous","Club",..: 3 2 2 3 3 2 2 2 3 2 ...
## $ StalkSurfaceAboveRing: Factor w/ 4 levels "Fibrous","Scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ StalkSurfaceBelowRing: Factor w/ 4 levels "Fibrous","Scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ StalkColorAboveRing : Factor w/ 9 levels "Brown","Buff",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ StalkColorBelowRing : Factor w/ 9 levels "Brown","Buff",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ VeilType : Factor w/ 1 level "Partial": 1 1 1 1 1 1 1 1 1 1 ...
## $ VeilColor : Factor w/ 4 levels "Brown","Orange",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ RingNumber : Factor w/ 3 levels "None","One","Two": 2 2 2 2 2 2 2 2 2 2 ...
## $ RingType : Factor w/ 5 levels "Evanescent","Flaring",..: 5 5 5 5 1 5 5 5 5 5 ...
## $ SporePrintColor : Factor w/ 9 levels "Black","Brown",..: 1 2 2 1 2 1 1 2 1 1 ...
## $ Population : Factor w/ 6 levels "Abundnant","Clustered",..: 4 3 3 4 1 3 3 4 5 4 ...
## $ Habitat : Factor w/ 7 levels "Grasses","Leaves",..: 5 1 3 5 1 1 3 3 1 3 ...
Since there are 23 variables, it will be intensive to compare every variables and its interaction. Let’s see which variables have the most impact using Boruta algorithm.
## Warning in TentativeRoughFix(boruta.train): There are no Tentative
## attributes! Returning original object.
## meanImp medianImp minImp maxImp normHits
## CapShape 14.615470 14.428007 13.493945 15.715606 1
## CapSurface 17.310904 17.257331 16.021209 18.408430 1
## CapColor 16.490086 16.577253 15.575090 17.051839 1
## Bruises 16.475434 16.546610 15.493117 17.444819 1
## Odor 23.649886 23.577426 22.676890 24.874218 1
## GillAttachment 7.362143 7.306709 6.445281 8.596157 1
## GillSpacing 18.276094 18.479204 17.089957 19.841007 1
## GillSize 22.584415 22.439868 21.812846 23.826869 1
## GillColor 14.807397 14.914603 13.037513 15.758207 1
## StalkShape 18.839615 18.825226 17.832902 19.887296 1
## StalkRoot 20.623533 20.513350 18.816178 22.126707 1
## StalkSurfaceAboveRing 13.082532 13.141184 12.306700 13.656548 1
## StalkSurfaceBelowRing 12.712848 12.757862 12.214016 13.250489 1
## StalkColorAboveRing 13.835534 14.014371 12.749120 14.927994 1
## StalkColorBelowRing 13.835676 13.800782 12.716384 14.669823 1
## VeilType 0.000000 0.000000 0.000000 0.000000 0
## VeilColor 8.237312 8.168256 7.677358 8.892883 1
## RingNumber 14.177253 14.100786 13.638326 15.068294 1
## RingType 17.514824 17.591988 16.594334 18.254843 1
## SporePrintColor 24.162917 24.081073 23.355758 24.906849 1
## Population 17.243272 17.460937 16.043717 17.620443 1
## Habitat 19.589782 19.369477 18.918767 20.622484 1
## decision
## CapShape Confirmed
## CapSurface Confirmed
## CapColor Confirmed
## Bruises Confirmed
## Odor Confirmed
## GillAttachment Confirmed
## GillSpacing Confirmed
## GillSize Confirmed
## GillColor Confirmed
## StalkShape Confirmed
## StalkRoot Confirmed
## StalkSurfaceAboveRing Confirmed
## StalkSurfaceBelowRing Confirmed
## StalkColorAboveRing Confirmed
## StalkColorBelowRing Confirmed
## VeilType Rejected
## VeilColor Confirmed
## RingNumber Confirmed
## RingType Confirmed
## SporePrintColor Confirmed
## Population Confirmed
## Habitat Confirmed
Since Odor and SporePrintColor each has highest importance, let’s plot the graph
In this step, we’ll encode the dependent variable into two levels 0 and 1. This will help the algorithm to clearly classify the levels. This encoding would lead to:
“Edible” - This wil be converted to 0.
“Poisonous” - This will be convered to 1.
## 'data.frame': 8124 obs. of 23 variables:
## $ Edible : num 1 0 0 1 0 0 0 0 1 0 ...
## $ CapShape : Factor w/ 6 levels "Bell","Conical",..: 3 3 1 3 3 3 1 1 3 1 ...
## $ CapSurface : Factor w/ 4 levels "Fibrous","Grooves",..: 4 4 4 3 4 3 4 3 3 4 ...
## $ CapColor : Factor w/ 10 levels "Brown","Buff",..: 1 10 9 9 4 10 9 9 9 10 ...
## $ Bruises : Factor w/ 2 levels "False","True": 2 2 2 2 1 2 2 2 2 2 ...
## $ Odor : Factor w/ 9 levels "Almond","Anise",..: 8 1 2 8 7 1 1 2 8 1 ...
## $ GillAttachment : Factor w/ 2 levels "Attached","Free": 2 2 2 2 2 2 2 2 2 2 ...
## $ GillSpacing : Factor w/ 2 levels "Close","Crowded": 1 1 1 1 2 1 1 1 1 1 ...
## $ GillSize : Factor w/ 2 levels "Broad","Narrow": 2 1 1 2 1 1 1 1 2 1 ...
## $ GillColor : Factor w/ 12 levels "Black","Brown",..: 1 1 2 2 1 2 5 2 8 5 ...
## $ StalkShape : Factor w/ 2 levels "Enlarging","Tapering": 1 1 1 1 2 1 1 1 1 1 ...
## $ StalkRoot : Factor w/ 5 levels "Bulbous","Club",..: 3 2 2 3 3 2 2 2 3 2 ...
## $ StalkSurfaceAboveRing: Factor w/ 4 levels "Fibrous","Scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ StalkSurfaceBelowRing: Factor w/ 4 levels "Fibrous","Scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ StalkColorAboveRing : Factor w/ 9 levels "Brown","Buff",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ StalkColorBelowRing : Factor w/ 9 levels "Brown","Buff",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ VeilType : Factor w/ 1 level "Partial": 1 1 1 1 1 1 1 1 1 1 ...
## $ VeilColor : Factor w/ 4 levels "Brown","Orange",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ RingNumber : Factor w/ 3 levels "None","One","Two": 2 2 2 2 2 2 2 2 2 2 ...
## $ RingType : Factor w/ 5 levels "Evanescent","Flaring",..: 5 5 5 5 1 5 5 5 5 5 ...
## $ SporePrintColor : Factor w/ 9 levels "Black","Brown",..: 1 2 2 1 2 1 1 2 1 1 ...
## $ Population : Factor w/ 6 levels "Abundnant","Clustered",..: 4 3 3 4 1 3 3 4 5 4 ...
## $ Habitat : Factor w/ 7 levels "Grasses","Leaves",..: 5 1 3 5 1 1 3 3 1 3 ...
## [1] "An FFTrees object containing 6 trees using 4 predictors {Odor,SporePrintColor,GillColor,RingType}"
## [1] "FFTrees AUC: (Train = 0.99, Test = --)"
## [1] "My favorite training tree is #2, here is how it performed:"
## train
## n 8124.00
## p(Correct) 0.93
## Hit Rate (HR) 0.85
## False Alarm Rate (FAR) 0.00
## d-prime Inf
Without splitting data, initial results shows - 4 predictors {Odor,SporePrintColor,GillColor,RingType}. But the TREE shows Odor and SporePrintColor are most relevant variables. Also the HIT RATE is about 85%.
## [1] "An FFTrees object containing 2 trees using 2 predictors {Odor,SporePrintColor}"
## [1] "FFTrees AUC: (Train = 0.99, Test = 0.99)"
## [1] "My favorite training tree is #1, here is how it performed:"
## train test
## n 7717.00 407.00
## p(Correct) 0.93 0.92
## Hit Rate (HR) 0.86 0.84
## False Alarm Rate (FAR) 0.00 0.00
## d-prime Inf Inf
Similarly splitting data 95% Training and 5% Testing based on variables Odor,SporePrintColor also show similar hit rate of 86% and 84% respectively.
suppressWarnings(suppressMessages(library(randomForest)))
suppressWarnings(suppressMessages(library(e1071)))
suppressWarnings(suppressMessages(library(caret)))
set.seed(123)
# get original data
mushrooms <- read.csv("mushrooms.csv")
str(mushrooms)
## 'data.frame': 8124 obs. of 23 variables:
## $ Edible : Factor w/ 2 levels "Edible","Poisonous": 2 1 1 2 1 1 1 1 2 1 ...
## $ CapShape : Factor w/ 6 levels "Bell","Conical",..: 3 3 1 3 3 3 1 1 3 1 ...
## $ CapSurface : Factor w/ 4 levels "Fibrous","Grooves",..: 4 4 4 3 4 3 4 3 3 4 ...
## $ CapColor : Factor w/ 10 levels "Brown","Buff",..: 1 10 9 9 4 10 9 9 9 10 ...
## $ Bruises : Factor w/ 2 levels "False","True": 2 2 2 2 1 2 2 2 2 2 ...
## $ Odor : Factor w/ 9 levels "Almond","Anise",..: 8 1 2 8 7 1 1 2 8 1 ...
## $ GillAttachment : Factor w/ 2 levels "Attached","Free": 2 2 2 2 2 2 2 2 2 2 ...
## $ GillSpacing : Factor w/ 2 levels "Close","Crowded": 1 1 1 1 2 1 1 1 1 1 ...
## $ GillSize : Factor w/ 2 levels "Broad","Narrow": 2 1 1 2 1 1 1 1 2 1 ...
## $ GillColor : Factor w/ 12 levels "Black","Brown",..: 1 1 2 2 1 2 5 2 8 5 ...
## $ StalkShape : Factor w/ 2 levels "Enlarging","Tapering": 1 1 1 1 2 1 1 1 1 1 ...
## $ StalkRoot : Factor w/ 5 levels "Bulbous","Club",..: 3 2 2 3 3 2 2 2 3 2 ...
## $ StalkSurfaceAboveRing: Factor w/ 4 levels "Fibrous","Scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ StalkSurfaceBelowRing: Factor w/ 4 levels "Fibrous","Scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ StalkColorAboveRing : Factor w/ 9 levels "Brown","Buff",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ StalkColorBelowRing : Factor w/ 9 levels "Brown","Buff",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ VeilType : Factor w/ 1 level "Partial": 1 1 1 1 1 1 1 1 1 1 ...
## $ VeilColor : Factor w/ 4 levels "Brown","Orange",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ RingNumber : Factor w/ 3 levels "None","One","Two": 2 2 2 2 2 2 2 2 2 2 ...
## $ RingType : Factor w/ 5 levels "Evanescent","Flaring",..: 5 5 5 5 1 5 5 5 5 5 ...
## $ SporePrintColor : Factor w/ 9 levels "Black","Brown",..: 1 2 2 1 2 1 1 2 1 1 ...
## $ Population : Factor w/ 6 levels "Abundnant","Clustered",..: 4 3 3 4 1 3 3 4 5 4 ...
## $ Habitat : Factor w/ 7 levels "Grasses","Leaves",..: 5 1 3 5 1 1 3 3 1 3 ...
library(caTools)
split=sample.split(mushrooms$Edible, SplitRatio=0.95)
Train=mushrooms[split==TRUE,]
Test=mushrooms[split==FALSE,]
#Fit Random Forest Model
set.seed(456)
rf = randomForest(Edible ~ ., ntree = 100, data = Train)
print(rf)
##
## Call:
## randomForest(formula = Edible ~ ., data = Train, ntree = 100)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 0%
## Confusion matrix:
## Edible Poisonous class.error
## Edible 3998 0 0
## Poisonous 0 3720 0
# Variable Importance
varImpPlot(rf, sort = T, n.var=10, main="Top 10 - Variable Importance")
#Variable Importance
var.imp = data.frame(importance(rf, type=2))
# make row names as columns
var.imp$Variables = row.names(var.imp)
print(var.imp[order(var.imp$MeanDecreaseGini,decreasing = T),])
## MeanDecreaseGini Variables
## Odor 1287.8437626 Odor
## SporePrintColor 621.6112057 SporePrintColor
## GillColor 290.7797879 GillColor
## StalkSurfaceBelowRing 245.0962193 StalkSurfaceBelowRing
## GillSize 241.7000906 GillSize
## StalkSurfaceAboveRing 173.7856704 StalkSurfaceAboveRing
## RingType 165.9132664 RingType
## Population 117.1940828 Population
## GillSpacing 106.7254730 GillSpacing
## StalkRoot 95.3099746 StalkRoot
## Habitat 95.2761674 Habitat
## Bruises 92.2182474 Bruises
## CapColor 54.7694177 CapColor
## StalkColorBelowRing 53.4285156 StalkColorBelowRing
## StalkColorAboveRing 52.1297701 StalkColorAboveRing
## StalkShape 48.8283721 StalkShape
## RingNumber 40.8255329 RingNumber
## CapSurface 33.6491282 CapSurface
## CapShape 18.5942114 CapShape
## VeilColor 4.1434287 VeilColor
## GillAttachment 0.1383254 GillAttachment
## VeilType 0.0000000 VeilType
“Mean Decreasing Gini” - a similar term for information gain
#using TRAIN dataset
# Predicting response variable
Train$predicted.response = predict(rf , Train)
# Create Confusion Matrix
print(confusionMatrix(data = Train$predicted.response, reference = Train$Edible, positive = 'Edible'))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Edible Poisonous
## Edible 3998 0
## Poisonous 0 3720
##
## Accuracy : 1
## 95% CI : (0.9995, 1)
## No Information Rate : 0.518
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.000
## Specificity : 1.000
## Pos Pred Value : 1.000
## Neg Pred Value : 1.000
## Prevalence : 0.518
## Detection Rate : 0.518
## Detection Prevalence : 0.518
## Balanced Accuracy : 1.000
##
## 'Positive' Class : Edible
##
# using TEST dataset
# Predicting response variable
Test$predicted.response = predict(rf , Test)
# Create Confusion Matrix
print( confusionMatrix(data= Test$predicted.response, reference=Test$Edible, positive='Edible'))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Edible Poisonous
## Edible 210 0
## Poisonous 0 196
##
## Accuracy : 1
## 95% CI : (0.991, 1)
## No Information Rate : 0.5172
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.5172
## Detection Rate : 0.5172
## Detection Prevalence : 0.5172
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : Edible
##
Random Forest gave the best model, that is able to differentiate between poisonous and non-poisonous and no misses ! In real life, humans cannot rely totally on “perfect model” which is too good to be true ! In that case, i would prefer to use human common-sense or a model that has some probabilities of misses. Perhaps more data should be made available to train the model.