Part 1: Collecting Data —————–

The mushroom dataset is available at UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). The mushrooms dataset will be used for the purposes of classification using different rules. Rules are based on the “seperate and conquer” technique to greedily add rules until it perfectly classified a portion of data or run out of features for splitting.

Part 2: Rule Learners ——————-

There are total of 8124 observations (rows) of mushrooms and 23 features (columns) in the dataset. The stringAsFactors is set to true this time because all features are nominal and will be suitable for classification rule learner in this exercise.

setwd("C:/Users/Emily/Desktop/GRADUATE PROGRAM COURSES/STAT6620 Machine Learning with R/Machine Learning with R, Second Edition_Code/Chapter 05")
mushrooms <- read.csv("mushrooms.csv", stringsAsFactors = TRUE)
str(mushrooms)
## 'data.frame':    8124 obs. of  23 variables:
##  $ type                    : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ...
##  $ cap_shape               : Factor w/ 6 levels "bell","conical",..: 3 3 1 3 3 3 1 1 3 1 ...
##  $ cap_surface             : Factor w/ 4 levels "fibrous","grooves",..: 4 4 4 3 4 3 4 3 3 4 ...
##  $ cap_color               : Factor w/ 10 levels "brown","buff",..: 1 10 9 9 4 10 9 9 9 10 ...
##  $ bruises                 : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ odor                    : Factor w/ 9 levels "almond","anise",..: 8 1 2 8 7 1 1 2 8 1 ...
##  $ gill_attachment         : Factor w/ 2 levels "attached","free": 2 2 2 2 2 2 2 2 2 2 ...
##  $ gill_spacing            : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1 1 1 ...
##  $ gill_size               : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ...
##  $ gill_color              : Factor w/ 12 levels "black","brown",..: 1 1 2 2 1 2 5 2 8 5 ...
##  $ stalk_shape             : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1 1 1 1 1 ...
##  $ stalk_root              : Factor w/ 5 levels "bulbous","club",..: 3 2 2 3 3 2 2 2 3 2 ...
##  $ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ stalk_color_above_ring  : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ stalk_color_below_ring  : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ veil_type               : Factor w/ 1 level "partial": 1 1 1 1 1 1 1 1 1 1 ...
##  $ veil_color              : Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ ring_number             : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ring_type               : Factor w/ 5 levels "evanescent","flaring",..: 5 5 5 5 1 5 5 5 5 5 ...
##  $ spore_print_color       : Factor w/ 9 levels "black","brown",..: 1 2 2 1 2 1 1 2 1 1 ...
##  $ population              : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4 1 3 3 4 5 4 ...
##  $ habitat                 : Factor w/ 7 levels "grasses","leaves",..: 5 1 3 5 1 1 3 3 1 3 ...

The nominal featurs that contains several different categories per feature is useful in classification later as an new observation class type can be predicted based on similar class level of the feature. However, there’s one feature in the dataset that only contain one level, the veil_type feature, which is not useful for the classification purpose since there’s not too much information gained in class differentiation based on the same factor level value. Therefore, this feature is eliminated before proceeding to the next step.

mushrooms$veil_type <- NULL
str(mushrooms)
## 'data.frame':    8124 obs. of  22 variables:
##  $ type                    : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ...
##  $ cap_shape               : Factor w/ 6 levels "bell","conical",..: 3 3 1 3 3 3 1 1 3 1 ...
##  $ cap_surface             : Factor w/ 4 levels "fibrous","grooves",..: 4 4 4 3 4 3 4 3 3 4 ...
##  $ cap_color               : Factor w/ 10 levels "brown","buff",..: 1 10 9 9 4 10 9 9 9 10 ...
##  $ bruises                 : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ odor                    : Factor w/ 9 levels "almond","anise",..: 8 1 2 8 7 1 1 2 8 1 ...
##  $ gill_attachment         : Factor w/ 2 levels "attached","free": 2 2 2 2 2 2 2 2 2 2 ...
##  $ gill_spacing            : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1 1 1 ...
##  $ gill_size               : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ...
##  $ gill_color              : Factor w/ 12 levels "black","brown",..: 1 1 2 2 1 2 5 2 8 5 ...
##  $ stalk_shape             : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1 1 1 1 1 ...
##  $ stalk_root              : Factor w/ 5 levels "bulbous","club",..: 3 2 2 3 3 2 2 2 3 2 ...
##  $ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ stalk_color_above_ring  : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ stalk_color_below_ring  : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ veil_color              : Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ ring_number             : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ring_type               : Factor w/ 5 levels "evanescent","flaring",..: 5 5 5 5 1 5 5 5 5 5 ...
##  $ spore_print_color       : Factor w/ 9 levels "black","brown",..: 1 2 2 1 2 1 1 2 1 1 ...
##  $ population              : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4 1 3 3 4 5 4 ...
##  $ habitat                 : Factor w/ 7 levels "grasses","leaves",..: 5 1 3 5 1 1 3 3 1 3 ...

Using the table() funcation to understand the proportion of target label mushroom type of the dataset. There are 4208 observation that are edible and 3916 observation that are poisonous mushrooms.

table(mushrooms$type)
## 
##    edible poisonous 
##      4208      3916

7000 observations were randomly picked out as the trained dataset, with the remaining 1124 observations as the tested dataset.

set.seed(123)
train_sample <- sample(8124, 7000)

str(train_sample)
##  int [1:7000] 2337 6404 3322 7171 7637 370 4288 7244 4476 3706 ...
mushrooms_train <- mushrooms[train_sample, ]
mushrooms_test  <- mushrooms[-train_sample, ]

Rweka package is loaded for classification rules learning algorithm.

library(RWeka) 
## Warning: package 'RWeka' was built under R version 3.3.3

Step 3: Training a model on the data —-

Using the OneR() function to take all 21 features as the predictors for predicting the mushrooms type (edible/poisonous) in the mushrooms trained dataset.

To examine the OneR() rules algorithm by typing the name of the classifier object. The rules algorithm of OneR() shows that based on comparing the error rate of each 21 features, the odor feature, containing the minimal error rate, was selected for main classification of the mushroom type. Each category of the odor feature was classified as either edible or poisonous based on the majority vote of each category.

mushroom_1R <- OneR(type ~ ., data = mushrooms_train)

mushroom_1R
## odor:
##  almond  -> edible
##  anise   -> edible
##  creosote    -> poisonous
##  fishy   -> poisonous
##  foul    -> poisonous
##  musty   -> poisonous
##  none    -> edible
##  pungent -> poisonous
##  spicy   -> poisonous
## (6895/7000 instances correct)

Using the summary() funcation to look at the detail of this OneR() rule classifier algorithm. Out of 7000 observations in the trained dataset, there are 105 observation of mushrooms being incorrectly classified, which is unacceptable since this 1.5% error probability can lead to a large proportion of people sick or even death.

At the bottom of the summary is a confusion matrix of the OneR model, 105 incidents were misclassified as edible while they were actually poisonous. It leads to a total of 98.5% accuracy and 1.5% error.

summary(mushroom_1R)
## 
## === Summary ===
## 
## Correctly Classified Instances        6895               98.5    %
## Incorrectly Classified Instances       105                1.5    %
## Kappa statistic                          0.9699
## Mean absolute error                      0.015 
## Root mean squared error                  0.1225
## Relative absolute error                  3.0039 %
## Root relative squared error             24.5108 %
## Total Number of Instances             7000     
## 
## === Confusion Matrix ===
## 
##     a    b   <-- classified as
##  3626    0 |    a = edible
##   105 3269 |    b = poisonous

Step 4: Evaluating model performance —-

The OneR classifier model is being used for prediction of mushrooms type in the tested dataset. By loading a gmodels package, we can use the CroosTable() function to make a confusion matrix for the tested dataset.

Out of the 1124 total mushroom observations in the tested dataset, 15 incidents were misclassified as edible while they were actually poisonous. This is a totaly of 98.6% accuracy and 1.3% error.

mushroom_pred <- predict(mushroom_1R, mushrooms_test)
library(gmodels)
## Warning: package 'gmodels' was built under R version 3.3.3
CrossTable(mushrooms_test$type, mushroom_pred,
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
           dnn = c('actual default', 'predicted default'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1124 
## 
##  
##                | predicted default 
## actual default |    edible | poisonous | Row Total | 
## ---------------|-----------|-----------|-----------|
##         edible |       582 |         0 |       582 | 
##                |     0.518 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|
##      poisonous |        15 |       527 |       542 | 
##                |     0.013 |     0.469 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |       597 |       527 |      1124 | 
## ---------------|-----------|-----------|-----------|
## 
## 

Step 5: Improving model performance —-

It is too risky to bet the human lives on only single feature (characteristics) of the mushroom to determine its type. A more sophisticated model is the Repeated Incremental Pruning to Produce Error Reduction (RIPPER) that can use not only one, but a sets of rules in a logical if-else fashion to identify the type for each observation. It creates much more complex rules than OneR algorithm as it would considered more than one featurs.

The RRip() function will be used for the RIPPER rules algorithm. Since both OneR() and JRip() are both located under the Rweka package, their syntax are much similar and convenient for model comparison.

Based on the same set of data, the RIPPER algorithm used total of nine rules instead of one in the OneR for mushroom type classification. The previous eight rules are specifically identifying poisonous mushrooms with the number of occurance in the parenthesis, and leave the last rule to say that everything else not listed in the previous eight rules are classified as “edible.”

Using a sets of rules to generate a much more complex calculation, the confusion matrix of the RIPPER model shows a 100% accuracy and 0% error! A great improvement from the OneR algorithm. This is perfect for the mushroom type classification as we cannot risk lives for ingesting the wrong mushroom type.

mushroom_JRip <- JRip(type ~ ., data = mushrooms_train)
mushroom_JRip
## JRIP rules:
## ===========
## 
## (odor = foul) => type=poisonous (1860.0/0.0)
## (gill_size = narrow) and (gill_color = buff) => type=poisonous (986.0/0.0)
## (gill_size = narrow) and (odor = pungent) => type=poisonous (222.0/0.0)
## (odor = creosote) => type=poisonous (171.0/0.0)
## (spore_print_color = green) => type=poisonous (65.0/0.0)
## (stalk_surface_below_ring = scaly) and (stalk_surface_above_ring = silky) => type=poisonous (58.0/0.0)
## (habitat = leaves) and (cap_surface = scaly) and (population = clustered) => type=poisonous (10.0/0.0)
## (cap_surface = grooves) => type=poisonous (2.0/0.0)
##  => type=edible (3626.0/0.0)
## 
## Number of Rules : 9
summary(mushroom_JRip)  
## 
## === Summary ===
## 
## Correctly Classified Instances        7000              100      %
## Incorrectly Classified Instances         0                0      %
## Kappa statistic                          1     
## Mean absolute error                      0     
## Root mean squared error                  0     
## Relative absolute error                  0      %
## Root relative squared error              0      %
## Total Number of Instances             7000     
## 
## === Confusion Matrix ===
## 
##     a    b   <-- classified as
##  3626    0 |    a = edible
##     0 3374 |    b = poisonous

The prediction of mushroom types is generated using the RIPPER model on the tested dataset. A confusion matrix shows a 100% accuracy and 0% error for all observations in the tested dataset.

mushroom_pred <- predict(mushroom_JRip, mushrooms_test)

library(gmodels)
CrossTable(mushrooms_test$type, mushroom_pred,
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
           dnn = c('actual default', 'predicted default'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1124 
## 
##  
##                | predicted default 
## actual default |    edible | poisonous | Row Total | 
## ---------------|-----------|-----------|-----------|
##         edible |       582 |         0 |       582 | 
##                |     0.518 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|
##      poisonous |         0 |       542 |       542 | 
##                |     0.000 |     0.482 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |       582 |       542 |      1124 | 
## ---------------|-----------|-----------|-----------|
## 
## 

Side Note: What if using the Decision Tress algorithm? —-

The C5.0 algorithm was used in the C50 package for the mushrooms dataset just for comparison with the Rules based classification for fun. There were only two features/predictors were selected for the decision tress model. The confusion matrix of the model shows that 105 incidents were misclassified as edible while they are actually poisonous leading to an accuracy of 98.5% and 1.5% error. The result is very similar to the rules based learner of OneR algorithm.

library(C50)
## Warning: package 'C50' was built under R version 3.3.3
mushroom_c5rules <- C5.0(type ~ odor + gill_size, 
                         data = mushrooms_train, rules = TRUE)
mushroom_c5rules
## 
## Call:
## C5.0.formula(formula = type ~ odor + gill_size, data =
##  mushrooms_train, rules = TRUE)
## 
## Rule-Based Model
## Number of samples: 7000 
## Number of predictors: 2 
## 
## Number of Rules: 2 
## 
## Non-standard options: attempt to group attributes
summary(mushroom_c5rules)
## 
## Call:
## C5.0.formula(formula = type ~ odor + gill_size, data =
##  mushrooms_train, rules = TRUE)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Mon May 01 00:13:15 2017
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 7000 cases (3 attributes) from undefined.data
## 
## Rules:
## 
## Rule 1: (3731/105, lift 1.9)
##  odor in {almond, anise, none}
##  ->  class edible  [0.972]
## 
## Rule 2: (3269, lift 2.1)
##  odor in {creosote, fishy, foul, musty, pungent, spicy}
##  ->  class poisonous  [1.000]
## 
## Default class: edible
## 
## 
## Evaluation on training data (7000 cases):
## 
##          Rules     
##    ----------------
##      No      Errors
## 
##       2  105( 1.5%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##    3626          (a): class edible
##     105  3269    (b): class poisonous
## 
## 
##  Attribute usage:
## 
##  100.00% odor
## 
## 
## Time: 0.0 secs

Using the decision tree model on the mushroom tested dataset, the prediction on mushroom type is generated and is compared with the actual class label. The confusion matrix shows a 15 incidents of misclassification into edible (false negative), which is again, similar to the result of using OneR algorithm.

mushroom_pred <- predict(mushroom_c5rules, mushrooms_test)
library(gmodels)
CrossTable(mushrooms_test$type, mushroom_pred,
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
           dnn = c('actual default', 'predicted default'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1124 
## 
##  
##                | predicted default 
## actual default |    edible | poisonous | Row Total | 
## ---------------|-----------|-----------|-----------|
##         edible |       582 |         0 |       582 | 
##                |     0.518 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|
##      poisonous |        15 |       527 |       542 | 
##                |     0.013 |     0.469 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |       597 |       527 |      1124 | 
## ---------------|-----------|-----------|-----------|
## 
## 

Conclusion: Both the One R and RIPPER are rules based classification algorithms and they are using the “seperate and conquer” techniques to greedily add rules until it classified a subset of data or run out of features for splitting. Both one R and RIPPER are good at making classification on nominal outcome based on nominal predictors, while the decision tress can be used both for numerical and nominal predictors/features. Also, the rule learner based algorithm can reexamine cases that were considered but were not cover under prior rules so they are said to be more parsimonious as compared to decision tree, which cannot reexamine or modify existing partitions. Overall, the RIPPER rule classification algorithm can handle more complex data and generate multiple rules at the end, thus might lead to higher accuracy in classification prediction.