Unit 6: Classification Rules

Identifying Poisonous Mushrooms

Steven Ferguson

2022-09-29

Step 1 – Collecting the Data

Eating mushrooms found in the wild can be dangerous. Many wild poisonous mushrooms have a similar appearance to edible mushrooms. There are no clear, simple, or consistent rules to identify wild mushrooms in the field. Consequently, identifying safe edible mushrooms in the wild is difficult, even for the most experienced foragers. Perhaps a rule-learning algorithm could generate easy to understand classification rules to determine whether a mushroom was poisonous or not.

From lecture slides:

“To identify rules for distinguishing poisonous mushrooms, we will utilize the Mushroom dataset by Jeff Schlimmer of Carnegie Mellon University. The raw dataset is available freely at the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml)..) The dataset includes information on 8,124 mushroom samples from23 species of gilled mushrooms listed in Audubon Society Field Guide to North American Mushrooms (1981).In the Field Guide, each of the mushroom species is identified”definitely edible,” “definitely poisonous,” or “likely poisonous, and not recommended to be eaten.” For the purposes of this dataset, the latter group was combined with the “definitely poisonous” group to make two classes: poisonous and edible (nonpoisonous ).”

Step 2 – Exploring and Preparing the Data

mushrooms <- read.csv("mushrooms.csv", stringsAsFactors = TRUE)

# examine the structure of the data frame
str(mushrooms)
## 'data.frame':    8124 obs. of  23 variables:
##  $ type                    : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ...
##  $ cap_shape               : Factor w/ 6 levels "bell","conical",..: 3 3 1 3 3 3 1 1 3 1 ...
##  $ cap_surface             : Factor w/ 4 levels "fibrous","grooves",..: 4 4 4 3 4 3 4 3 3 4 ...
##  $ cap_color               : Factor w/ 10 levels "brown","buff",..: 1 10 9 9 4 10 9 9 9 10 ...
##  $ bruises                 : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ odor                    : Factor w/ 9 levels "almond","anise",..: 8 1 2 8 7 1 1 2 8 1 ...
##  $ gill_attachment         : Factor w/ 2 levels "attached","free": 2 2 2 2 2 2 2 2 2 2 ...
##  $ gill_spacing            : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1 1 1 ...
##  $ gill_size               : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ...
##  $ gill_color              : Factor w/ 12 levels "black","brown",..: 1 1 2 2 1 2 5 2 8 5 ...
##  $ stalk_shape             : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1 1 1 1 1 ...
##  $ stalk_root              : Factor w/ 5 levels "bulbous","club",..: 3 2 2 3 3 2 2 2 3 2 ...
##  $ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ stalk_color_above_ring  : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ stalk_color_below_ring  : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ veil_type               : Factor w/ 1 level "partial": 1 1 1 1 1 1 1 1 1 1 ...
##  $ veil_color              : Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ ring_number             : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ring_type               : Factor w/ 5 levels "evanescent","flaring",..: 5 5 5 5 1 5 5 5 5 5 ...
##  $ spore_print_color       : Factor w/ 9 levels "black","brown",..: 1 2 2 1 2 1 1 2 1 1 ...
##  $ population              : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4 1 3 3 4 5 4 ...
##  $ habitat                 : Factor w/ 7 levels "grasses","leaves",..: 5 1 3 5 1 1 3 3 1 3 ...
# drop the veil_type feature
mushrooms$veil_type <- NULL

# examine the class distribution
table(mushrooms$type)
## 
##    edible poisonous 
##      4208      3916

“We will consider the 81214 samples in the mushroom data to be an exhaustive set of all the possible wild mushrooms”. This is an important assumption. Because this is a whole population, not a sample, we don’t need to split data up into test and training datasets. We are not trying to develop rules that cover unforeseen types of Mushrooms.

Step 3 – Training a Model on the Data

We need to pick an algorithm for our model. A ZeroR algorithm would not be helpful. ZeroR ignores all features and simply predicts the mode of a target. It would see 4208 mushrooms were labeled “edible” and 3916 were labeled “poisonous”. A zero rule algorithm, would false predict that every unknown mushroom was “edible” simply because there are more labeled edible mushrooms in the training dataset. That’s an accuracy of 52% ! We can do much better than that.

A one rule algorithm analyzes the data and generates one all-encompassing rule. It works by finding the single rule that contains the least amount of error. Despite its simplicity, it is very accurate.

Rule Learner Using the OneR() algorithm

#install.packages("OneR")
library(OneR)

# train OneR() on the data
mushroom_1R = OneR(type ~ ., data = mushrooms)

print(mushroom_1R)
## 
## Call:
## OneR.formula(formula = type ~ ., data = mushrooms)
## 
## Rules:
## If odor = almond   then type = edible
## If odor = anise    then type = edible
## If odor = creosote then type = poisonous
## If odor = fishy    then type = poisonous
## If odor = foul     then type = poisonous
## If odor = musty    then type = poisonous
## If odor = none     then type = edible
## If odor = pungent  then type = poisonous
## If odor = spicy    then type = poisonous
## 
## Accuracy:
## 8004 of 8124 instances classified correctly (98.52%)

A OneR algorithm is much more accurate: 98.52% ! These rules are very easy to understand. The algorithm found a significant pattern in the “odor” rule. If the odor was almond, anise, or “none”, then it predicted the mushroom would be edible. It found that this pattern correctly identified 98.52% of the mushrooms. Just one rule! I’m surprised at how well this worked.

Step 4 -- Evaluating Model Performance

mushroom_1R_pred <- predict(mushroom_1R, mushrooms)
table(actual = mushrooms$type, predicted = mushroom_1R_pred)
##            predicted
## actual      edible poisonous
##   edible      4208         0
##   poisonous    120      3796
# DANGEROUS RULE. TOO MANY FALSE POSITIVES. 120 predicted edible, but actually poisonous

However, the One Rule algorithm did incorrectly predict a mushroom to be edible (when it was poisonous) on 120 occasions (false positive). Yes, the risk seems low (2.58%), but it’s still unacceptably high considering the consequence of a false positive: potentially fatal poisoning. We can do better to limit these dangerous false positives.

Step 5 – Improving Model Performance

RIPPER Algorithm using the JRip() Classifier

Another common classification rule is the RIPPER Algorithm.

The ripper algorithm can create more complex rules than the OneR algorithm as it considers more than one feature of the dataset. The RIPPER algorithm adds conditions to a rule until it perfectly classifies a subset of data. It then prunes rules that no longer reducesentropy and optimizes the entire set of rules automatically by repeating steps 1 and 2 until reaching a stopping criterion. The three steps of the RIPPER algorithm can be summarized as “Grow, prune, optimize”.

detach("package:OneR", unload = TRUE)
#install.packages("RWeka")
library(RWeka) #GOOGLE WEKA AND LEARN ABOUT IT
mushroom_JRip <- JRip(type ~ ., data = mushrooms)
mushroom_JRip
## JRIP rules:
## ===========
## 
## (odor = foul) => type=poisonous (2160.0/0.0)
## (gill_size = narrow) and (gill_color = buff) => type=poisonous (1152.0/0.0)
## (gill_size = narrow) and (odor = pungent) => type=poisonous (256.0/0.0)
## (odor = creosote) => type=poisonous (192.0/0.0)
## (spore_print_color = green) => type=poisonous (72.0/0.0)
## (stalk_surface_below_ring = scaly) and (stalk_surface_above_ring = silky) => type=poisonous (68.0/0.0)
## (habitat = leaves) and (cap_color = white) => type=poisonous (8.0/0.0)
## (stalk_color_above_ring = yellow) => type=poisonous (8.0/0.0)
##  => type=edible (4208.0/0.0)
## 
## Number of Rules : 9
summary(mushroom_JRip)
## 
## === Summary ===
## 
## Correctly Classified Instances        8124              100      %
## Incorrectly Classified Instances         0                0      %
## Kappa statistic                          1     
## Mean absolute error                      0     
## Root mean squared error                  0     
## Relative absolute error                  0      %
## Root relative squared error              0      %
## Total Number of Instances             8124     
## 
## === Confusion Matrix ===
## 
##     a    b   <-- classified as
##  4208    0 |    a = edible
##     0 3916 |    b = poisonous

Indeed the RIPPER algorithm perfectly classified every instance (100% accuracy) antastic. However, the rules are now a bit more complicated to understand because they combine rules using “and/or” statements. 9 rules were used. Therefore, it’s a little less straightforward, but still accessible.

This is how to read the rules: If the odor is “foul”, then the mushroom is poisonous. If the stalk_surface_below_ring is scaly and the stalk_surface_above_ring is silky then the mushroom is poisonous. It took 9 rules to classify and separate all the poisonous mushrooms from the non-poisonous mushrooms. After nine rules, only non-poisonous mushrooms remained.

Ruler Learner Using C5.0 Decision Trees

detach("package:RWeka", unload = TRUE) #OneR will not print if RWeka is attached
library(C50)
mushroom_c5rules <- C5.0(type~odor+gill_size,data=mushrooms,rules=TRUE)
summary(mushroom_c5rules)
## 
## Call:
## C5.0.formula(formula = type ~ odor + gill_size, data = mushrooms, rules = TRUE)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Thu Sep 29 23:25:04 2022
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 8124 cases (3 attributes) from undefined.data
## 
## Rules:
## 
## Rule 1: (4328/120, lift 1.9)
##  odor in {almond, anise, none}
##  ->  class edible  [0.972]
## 
## Rule 2: (3796, lift 2.1)
##  odor in {creosote, fishy, foul, musty, pungent, spicy}
##  ->  class poisonous  [1.000]
## 
## Default class: edible
## 
## 
## Evaluation on training data (8124 cases):
## 
##          Rules     
##    ----------------
##      No      Errors
## 
##       2  120( 1.5%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##    4208          (a): class edible
##     120  3796    (b): class poisonous
## 
## 
##  Attribute usage:
## 
##  100.00% odor
## 
## 
## Time: 0.0 secs

The C5.0 classifier identified 2 if/then rules to classify the mushrooms with a 1.5% error rate. 120 false positives occurred. This happens to be the same result as the OneR algorithm. It actually used the same attributes in each rule as the OneR algorith, just formatted differently.