Understanding classification rules

Unlike a tree, which must be applied from top-to-bottom through a series of decisions, rules are propositions that can be read much like a statement of fact.

Separate and conquer

The process involves identifying a rule that covers a subset of examples in the training data, and then separating this partition from the remaining data. As the rules are added, additional subsets of the data are separated until the entire dataset has been covered and no more examples remain.

  • One way to imagine the rule learning process is to think about drilling down into the data by creating increasingly specific rules to identify class values
  • A rule learner begins by using the available features to find homogeneous groups.
  • More specific: drill down further by suggesting that mammals must walk on land and have a tail: all animals

  • An additional rule can be defined to separate out the bats, the only remaining mammal. A potential feature distinguishing bats from the other remaining animals would be the presence of fur.

all animals2

We learned a total of three rules:

  1. Animals that walk on land and have tails are mammals
  2. If the animal does not have fur, it is not a mammal
  3. Otherwise, the animal is a mammal

The 1R algorithm

The 1R algorithm (One Rule or OneR), improves over ZeroR by selecting a single rule

For each feature, 1R divides the data into groups based on similar values of the feature. Then, for each segment, the algorithm predicts the majority class. The error rate for the rule based on each feature is calculated and the rule with the fewest errors is chosen as the one rule.

1R

For the Travels By feature, the dataset was divided into three groups: Air, Land, and Sea. Animals in the Air and Sea groups were predicted to be non-mammal, while animals in the Land group were predicted to be mammals. This resulted in two errors: bats and frogs. The Has Fur feature divided animals into two groups. Those with fur were predicted to be mammals, while those without fur were not predicted to be mammals. Three errors were counted: pigs, elephants, and rhinos. As the Travels By feature results in fewer errors, the 1R algorithm will return the following “one rule” based on Travels By:

  • If the animal travels by air, it is not a mammal
  • If the animal travels by land, it is a mammal
  • If the animal travels by sea, it is not a mammal

The RIPPER algorithm

Repeated Incremental Pruning to Produce Error Reduction algorithm

  • The growing phase uses the separate and conquer technique to greedily add conditions to a rule until it perfectly classifies a subset of data or runs out of attributes for splitting.
  • When increasing a rule’s specificity no longer reduces entropy, the rule is immediately pruned.
  • Steps one and two are repeated until it reaches a stopping criterion, at which point the entire set of rules is optimized using a variety of heuristics.

Rules from decision trees

What makes trees and rules greedy?

Both the divide and conquer heuristic used by decision trees and the separate and conquer heuristic used by rule learners attempt to make partitions one at a time, finding the most homogeneous partition first, followed by the next best, and so on, until all examples have been classified.

  • Once divide and conquer splits on a feature, the partitions created by the split may not be re-conquered, only further subdivided. In this way, a tree is permanently limited by its history of past decisions.
  • In contrast, once separate and conquer finds a rule, any examples not covered by all of the rule’s conditions may be re-conquered.

the rule learner allows frogs to be re-conquered by the “does not have fur” decision, the decision tree cannot modify the existing partitions, and therefore must place the frog into its own rule.

difference

Rule learners can reexamine cases that were considered but ultimately not covered as part of prior rules, rule learners often find a more parsimonious set of rules than those generated from decision trees. On the other hand, this reuse of data means that the computational cost of rule learners may be somewhat higher than for decision trees.

Example - identifying poisonous mushrooms with rule learners

Step 1 - collecting data

The dataset includes information on 8,124 mushroom samples from 23 species of gilled mushrooms

Step 2 - exploring and preparing the data

Since all the 22 features and the target class are nominal, in this case, we will set stringsAsFactors = TRUE and take advantage of the automatic factor conversion:

mushrooms <- read.csv("mushrooms.csv", stringsAsFactors = TRUE)
str(mushrooms)
## 'data.frame':    8124 obs. of  23 variables:
##  $ type                    : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ...
##  $ cap_shape               : Factor w/ 6 levels "bell","conical",..: 3 3 1 3 3 3 1 1 3 1 ...
##  $ cap_surface             : Factor w/ 4 levels "fibrous","grooves",..: 4 4 4 3 4 3 4 3 3 4 ...
##  $ cap_color               : Factor w/ 10 levels "brown","buff",..: 1 10 9 9 4 10 9 9 9 10 ...
##  $ bruises                 : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ odor                    : Factor w/ 9 levels "almond","anise",..: 8 1 2 8 7 1 1 2 8 1 ...
##  $ gill_attachment         : Factor w/ 2 levels "attached","free": 2 2 2 2 2 2 2 2 2 2 ...
##  $ gill_spacing            : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1 1 1 ...
##  $ gill_size               : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ...
##  $ gill_color              : Factor w/ 12 levels "black","brown",..: 1 1 2 2 1 2 5 2 8 5 ...
##  $ stalk_shape             : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1 1 1 1 1 ...
##  $ stalk_root              : Factor w/ 5 levels "bulbous","club",..: 3 2 2 3 3 2 2 2 3 2 ...
##  $ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ stalk_color_above_ring  : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ stalk_color_below_ring  : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ veil_type               : Factor w/ 1 level "partial": 1 1 1 1 1 1 1 1 1 1 ...
##  $ veil_color              : Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ ring_number             : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ring_type               : Factor w/ 5 levels "evanescent","flaring",..: 5 5 5 5 1 5 5 5 5 5 ...
##  $ spore_print_color       : Factor w/ 9 levels "black","brown",..: 1 2 2 1 2 1 1 2 1 1 ...
##  $ population              : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4 1 3 3 4 5 4 ...
##  $ habitat                 : Factor w/ 7 levels "grasses","leaves",..: 5 1 3 5 1 1 3 3 1 3 ...
  • veil_type has only one level. The data dictionary lists two levels for this feature: partial and universal. It is likely that this data element was somehow coded incorrectly.

We will drop this variable from our analysis

mushrooms$veil_type <- NULL

About 52 percent of the mushroom samples (N = 4,208) are edible, while 48 percent (N = 3,916) are poisonous.

table(mushrooms$type)
## 
##    edible poisonous 
##      4208      3916

We will consider the 8,214 samples in the mushroom data to be an exhaustive set of all the possible wild mushrooms. This is an important assumption, because it means that we do not need to hold some samples out of the training data for testing purposes. We are not trying to develop rules that cover unforeseen types of mushrooms; we are merely trying to find rules that accurately depict the complete set of known mushroom types. Therefore, we can build and test the model on the same data.

Step 3 - training a model on the data

If we trained a hypothetical ZeroR classifier on this data, what would it predict? Since ZeroR ignores all of the features and simply predicts the target’s mode, in plain language, its rule would state that all the mushrooms are edible.

library(RWeka)
## Warning: package 'RWeka' was built under R version 3.2.5

The class variable to be learned goes to the left of the tilde, and the predictor features are written on the right, separated by + operators.

  • Model the relationship between the y class and predictors x1 and x2, you could write the formula as y ~ x1 + x2.
  • If you like to include all the variables in the model, the special term . can be used. For example, y ~ .

Allow our first OneR() rule learner to consider all the possible features in the mushroom data while constructing its rules to predict type:

mushrooms_1R <- OneR(type ~ ., data = mushrooms)
mushrooms_1R
## odor:
##  almond  -> edible
##  anise   -> edible
##  creosote    -> poisonous
##  fishy   -> poisonous
##  foul    -> poisonous
##  musty   -> poisonous
##  none    -> edible
##  pungent -> poisonous
##  spicy   -> poisonous
## (8004/8124 instances correct)
  • We see that the odor feature was selected for rule generation. For the purposes of a field guide for mushroom gathering, these rules could be summarized in a simple rule of thumb: “if the mushroom smells unappetizing, then it is likely to be poisonous.”

Step 4 - evaluating model performance

summary(mushrooms_1R)
## 
## === Summary ===
## 
## Correctly Classified Instances        8004               98.5229 %
## Incorrectly Classified Instances       120                1.4771 %
## Kappa statistic                          0.9704
## Mean absolute error                      0.0148
## Root mean squared error                  0.1215
## Relative absolute error                  2.958  %
## Root relative squared error             24.323  %
## Total Number of Instances             8124     
## 
## === Confusion Matrix ===
## 
##     a    b   <-- classified as
##  4208    0 |    a = edible
##   120 3796 |    b = poisonous

Step 5 - improving model performance

The process of training a JRip() model is very similar to how we previously trained a OneR() model.

mushroom_JRip <- JRip(type ~ ., data = mushrooms)
mushroom_JRip
## JRIP rules:
## ===========
## 
## (odor = foul) => type=poisonous (2160.0/0.0)
## (gill_size = narrow) and (gill_color = buff) => type=poisonous (1152.0/0.0)
## (gill_size = narrow) and (odor = pungent) => type=poisonous (256.0/0.0)
## (odor = creosote) => type=poisonous (192.0/0.0)
## (spore_print_color = green) => type=poisonous (72.0/0.0)
## (stalk_surface_below_ring = scaly) and (stalk_surface_above_ring = silky) => type=poisonous (68.0/0.0)
## (habitat = leaves) and (cap_color = white) => type=poisonous (8.0/0.0)
## (stalk_color_above_ring = yellow) => type=poisonous (8.0/0.0)
##  => type=edible (4208.0/0.0)
## 
## Number of Rules : 9

The JRip() classifier learned a total of nine rules from the mushroom data. An easy way to read these rules is to think of them as a list of if-else statements, similar to programming logic. The first three rules could be expressed as:

  • If the odor is foul, then the mushroom type is poisonous
  • If the gill size is narrow and the gill color is buff, then the mushroom type is poisonous
  • If the gill size is narrow and the odor is pungent, then the mushroom type is poisonous

Finally, the ninth rule implies that any mushroom sample that was not covered by the preceding eight rules is edible.

  • Else, the mushroom is edible

The numbers next to each rule indicate the number of instances covered by the rule and a count of misclassified instances.

The following figure provides a rough illustration of how the rules are applied to the mushroom data

mushroom

Summary:

Two classification methods that use so-called “greedy” algorithms to partition the data according to feature values.

  • Decision trees use a divide and conquer strategy to create flowchart-like structures C5.0
  • Rule learners separate and conquer data to identify logical if-else rules. 1R, RIPPER