Decision Trees

The idea

Predictably, decision tree learners store their model in the form of a decision tree, quite similar to a flowchart.

Data that’s to be classified begins at the root node, where it is passed through the various decisions in the tree according to feature values, until a leaf node is reached - which is when it’s assigned a predicted class.

Decision Trees offer benefits of transparency (useful for when analysis needs to be explainable) and speed. They can be used to model almost any type of data, too. Sadly, they don’t do too well when there are large numbers of nominal or numeric features.

They’re built using divide-and-conquer type algorithms (aka recursive partitioning). Beginning at the root node, the algorithm splits the data on the feature that results in the greatest information gain (or entropy). Iteratively, we can repeat this splitting at each child node until the leaves are pure - i.e. that samples at each node all belong to the same class.

The practice

We’ll be using the C5.0 implementation in R. We’ll cover a bit about the splitting and pruning processes.

Splitting

The first thing the algo has to do is decide which feature to split upon (since we may well have many). As described earlier, this is done so as to increase the information gain. The gain is defined as the difference between a parent’s impurity and the sum of its children’s impurities. There are a number of different impurity measures, entropy being the one favoured in the implementation C5.0. The formula for entropy is given by:

\(Entropy(S) = \sum_{i=1}^{c} -p_i log_2(p_i)\)

where S is a segment of data (i.e. a bunch of records), c is the number of different class levels, and \(p_i\) refers to the proportion of values falling into class level \(i\). For example, if we have a partition of data with two classes: red (60%) and blue (40%), \(E = -0.6log_2(0.6) + -0.4log_2(0.4)\) = 0.97. If it were 50/50 instead of 60/40, we’d have E = 1. If it’s 100/0 we’d have E = 0. We can see what this plot looks like using R’s curve() function!

curve(-x * log2(x) - (1-x) * log2(1-x), col='red', xlab='x', ylab='E', lwd=4, main="Entropy curve for a binary pair")

The higher the information gain, the better a feature is at creating homogeneous groups after a split on that feature.

Pruning

Leaving the algo to run ad nausea can result in very deep trees, which can easily lead to overfitting, so we typically end up pruning them. We can either pre-prune (i.e. stop it at a certain size early), which has the benefit of doing less work and thus is faster, but this comes at the expense of possibly missing subtle features had it run longer; or we can post-prune using error-rates as a condition, which is more computational work. The latter is the preferred choice, as it’s difficult to know the optimal tree-size ahead of time, making pre-pruning less attractive. C5.0 post-prunes with sensible defaults - in some cases, entire branches are moved further up the tree (subtree-raising) or replaced by simpler decisions (subtree-replacement)

The balancing act of over/under fitting is best done by playing with pruning options, which, luckily for us, C5.0 makes very easy.

Example - Risky Bank Loans

As we’ll see, we can use Decision Trees to help banks spot riskier customers and mitigate financial losses.

The data

We’ll use data on a large number (1000) of past bank loans, augmented with:

  • information indicating whether the loans went into default
  • additional information about the applicants themselves

Crucially the data includes whether the customer defaulted on the loan.

The data can be found here

Exploratory Data Analysis

library(ggplot2)
credit = read.csv("D://dev//R//mlwr/chap5-decisiontrees/credit.csv")
str(credit)
## 'data.frame':    1000 obs. of  21 variables:
##  $ checking_balance    : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
##  $ months_loan_duration: int  6 48 12 42 24 36 24 36 12 30 ...
##  $ credit_history      : Factor w/ 5 levels "critical","delayed",..: 1 5 1 5 2 5 5 5 5 1 ...
##  $ purpose             : Factor w/ 10 levels "business","car (new)",..: 8 8 5 6 2 5 6 3 8 2 ...
##  $ amount              : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ savings_balance     : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
##  $ employment_length   : Factor w/ 5 levels "> 7 yrs","0 - 1 yrs",..: 1 3 4 4 3 3 1 3 4 5 ...
##  $ installment_rate    : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ personal_status     : Factor w/ 4 levels "divorced male",..: 4 2 4 4 4 4 4 4 1 3 ...
##  $ other_debtors       : Factor w/ 3 levels "co-applicant",..: 3 3 3 2 3 3 3 3 3 3 ...
##  $ residence_history   : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ property            : Factor w/ 4 levels "building society savings",..: 3 3 3 1 4 4 1 2 3 2 ...
##  $ age                 : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ installment_plan    : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ housing             : Factor w/ 3 levels "for free","own",..: 2 2 2 1 1 1 2 3 2 2 ...
##  $ existing_credits    : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ job                 : Factor w/ 4 levels "mangement self-employed",..: 2 2 4 2 2 4 2 1 4 1 ...
##  $ dependents          : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ telephone           : Factor w/ 2 levels "none","yes": 2 1 1 1 1 2 1 2 1 1 ...
##  $ foreign_worker      : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ default             : int  1 2 1 1 2 1 1 1 1 2 ...

So we’ve got 1000 loans and 21 features. Since we’re primarily interested in default rates, we should see what the split is:

round(prop.table(table(credit$default))*100)
## 
##  1  2 
## 70 30
# apparently 1 means no and 2 means yes, so we'll refactor
credit$default = factor(credit$default, levels=c(1, 2), labels=c("No", "Yes"))
round(prop.table(table(credit$default))*100)
## 
##  No Yes 
##  70  30

Let’s have a peek at some of the ones which are likely to predict a default.

table(credit$checking_balance)
## 
##     < 0 DM   > 200 DM 1 - 200 DM    unknown 
##        274         63        269        394
ggplot(data=credit,
       aes(x=checking_balance)) +
      geom_bar() +
  ggtitle("Histogram of checking balances faceted by default") +
  facet_wrap(~default)

table(credit$savings_balance)
## 
##      < 100 DM     > 1000 DM  101 - 500 DM 501 - 1000 DM       unknown 
##           603            48           103            63           183
ggplot(data=credit,
       aes(x=savings_balance)) +
      geom_bar() +
  ggtitle("Histogram of savings balances faceted by default") +
  facet_wrap(~default)

Those two were categorical - it would have been nice to have them as pure numbers, but c’est la vie. Let’s look at some of the numerics (duration and amount):

summary(credit$months_loan_duration)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    12.0    18.0    20.9    24.0    72.0
ggplot(data=credit,
       aes(x=months_loan_duration)) +
  geom_histogram(binwidth=3) +
  ggtitle("Histogram of loan durations faceted by default") +
  facet_wrap(~default) +
  scale_x_continuous(breaks=seq(0, 75, 3), limits=c(0, 75))
## Warning: Removed 4 rows containing missing values (geom_bar).

summary(credit$amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     250    1366    2320    3271    3972   18420
ggplot(data=credit,
       aes(x=amount)) +
  geom_histogram() +
  scale_x_log10() +
  ggtitle("Histogram of loan amounts faceted by default") +
  facet_wrap(~default)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Data prep - create training and test datasets

As before, we’ll split our data into a training set (90%) and a test set (10%). Unlike before, we can’t assume that the data is randomly sorted, so we’ll need to randomise its order. There’s an idiomatic trick in R to randomise a frame’s order, and it involves 2 function calls in unison:

  • order() a function used to re-arrange a list of items in ascending or descending order
  • runif() a function which chooses random numbers from a normal distribution between [0, 1]

And here’s how it’s done:

set.seed(1235) # make our results deterministic (useful for reproducing identical results over multiple runs)
credit_randomised = credit[order(runif(nrow(credit))), ]
# assert that we've not messed up the data:
summary(credit_randomised$amount) # compare this result against earlier
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     250    1366    2320    3271    3972   18420
head(credit$amount)
## [1] 1169 5951 2096 7882 4870 9055
head(credit_randomised$amount)
## [1] 1092 4110 1582 3965 1282 2767
# same summary, different head() results -> we've not cocked this up :)

So let’s split as we’ve done before:

credit.train = credit_randomised[1:900, ]
credit.test  = credit_randomised[901:1000, ]

# if all's good we should still have 30% default rates in both sets:
round(prop.table(table(credit.train$default))*100)
## 
##  No Yes 
##  70  30
round(prop.table(table(credit.test$default))*100)
## 
##  No Yes 
##  71  29
# that'll do

Training a model

We’ll use C5.0:

#install.packages("C50")
library(C50)
# it's as easy as calling C5.0:
credit_model = C5.0(credit.train[-21],     # field 21 is the default, and we don't want that as part of the tree
                    credit.train$default)  # and we tell it what the defaults were (using credit[21] errors)
# We can inspect some high level data in the model:
credit_model
## 
## Call:
## C5.0.default(x = credit.train[-21], y = credit.train$default)
## 
## Classification Tree
## Number of samples: 900 
## Number of predictors: 20 
## 
## Tree size: 57 
## 
## Non-standard options: attempt to group attributes
# and some VERY good detail:
summary(credit_model)
## 
## Call:
## C5.0.default(x = credit.train[-21], y = credit.train$default)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sat Jan 16 04:19:09 2016
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 900 cases (21 attributes) from undefined.data
## 
## Decision tree:
## 
## checking_balance in {> 200 DM,unknown}: No (407/50)
## checking_balance in {< 0 DM,1 - 200 DM}:
## :...other_debtors = guarantor: No (39/7)
##     other_debtors in {co-applicant,none}:
##     :...credit_history in {fully repaid,fully repaid this bank}:
##         :...housing = rent: Yes (14)
##         :   housing in {for free,own}:
##         :   :...other_debtors = co-applicant: No (2)
##         :       other_debtors = none:
##         :       :...housing = for free: Yes (13/1)
##         :           housing = own:
##         :           :...savings_balance = < 100 DM: Yes (18/6)
##         :               savings_balance in {> 1000 DM,
##         :               :                   501 - 1000 DM}: No (3)
##         :               savings_balance = 101 - 500 DM:
##         :               :...months_loan_duration <= 18: Yes (3)
##         :               :   months_loan_duration > 18: No (2)
##         :               savings_balance = unknown:
##         :               :...installment_plan in {bank,stores}: No (3)
##         :                   installment_plan = none: Yes (1)
##         credit_history in {critical,delayed,repaid}:
##         :...months_loan_duration > 27:
##             :...residence_history <= 1:
##             :   :...housing = for free: No (1)
##             :   :   housing = rent: Yes (2)
##             :   :   housing = own:
##             :   :   :...checking_balance = < 0 DM: Yes (3/1)
##             :   :       checking_balance = 1 - 200 DM: No (6)
##             :   residence_history > 1:
##             :   :...job = unemployed non-resident: No (2)
##             :       job = unskilled resident: Yes (5)
##             :       job in {mangement self-employed,skilled employee}:
##             :       :...installment_plan in {bank,stores}: Yes (11/1)
##             :           installment_plan = none:
##             :           :...credit_history = delayed:
##             :               :...existing_credits <= 1: No (7/1)
##             :               :   existing_credits > 1: Yes (2)
##             :               credit_history = critical:
##             :               :...telephone = none: [S1]
##             :               :   telephone = yes: [S2]
##             :               credit_history = repaid:
##             :               :...savings_balance = > 1000 DM: No (1)
##             :                   savings_balance in {101 - 500 DM,
##             :                   :                   501 - 1000 DM}: Yes (6)
##             :                   savings_balance = unknown:
##             :                   :...age <= 28: No (3)
##             :                   :   age > 28: Yes (5)
##             :                   savings_balance = < 100 DM:
##             :                   :...job = skilled employee: Yes (20/3)
##             :                       job = mangement self-employed: [S3]
##             months_loan_duration <= 27:
##             :...credit_history = critical:
##                 :...other_debtors = co-applicant: Yes (4/1)
##                 :   other_debtors = none: No (76/16)
##                 credit_history = delayed:
##                 :...checking_balance = 1 - 200 DM: No (19/3)
##                 :   checking_balance = < 0 DM:
##                 :   :...months_loan_duration <= 18: No (2)
##                 :       months_loan_duration > 18: Yes (5)
##                 credit_history = repaid:
##                 :...savings_balance = > 1000 DM: No (9)
##                     savings_balance in {< 100 DM,101 - 500 DM,501 - 1000 DM,
##                     :                   unknown}:
##                     :...months_loan_duration <= 11:
##                         :...amount <= 10875: No (35/7)
##                         :   amount > 10875: Yes (2)
##                         months_loan_duration > 11:
##                         :...amount <= 1386: Yes (50/14)
##                             amount > 1386:
##                             :...existing_credits > 1: [S4]
##                                 existing_credits <= 1:
##                                 :...employment_length = 4 - 7 yrs: No (10)
##                                     employment_length in {> 7 yrs,0 - 1 yrs,
##                                     :                     1 - 4 yrs,unemployed}:
##                                     :...telephone = yes: [S5]
##                                         telephone = none:
##                                         :...dependents > 1: [S6]
##                                             dependents <= 1: [S7]
## 
## SubTree [S1]
## 
## purpose in {business,car (new),domestic appliances,education,furniture,others,
## :           radio/tv,repairs,retraining}: Yes (5)
## purpose = car (used): No (2)
## 
## SubTree [S2]
## 
## personal_status = female: Yes (2)
## personal_status in {divorced male,married male}: No (1)
## personal_status = single male:
## :...installment_rate <= 3: No (2)
##     installment_rate > 3: Yes (3/1)
## 
## SubTree [S3]
## 
## checking_balance = < 0 DM: No (3)
## checking_balance = 1 - 200 DM: Yes (4/1)
## 
## SubTree [S4]
## 
## personal_status in {divorced male,female}: Yes (6)
## personal_status in {married male,single male}: No (4)
## 
## SubTree [S5]
## 
## other_debtors = co-applicant: No (1)
## other_debtors = none:
## :...savings_balance = < 100 DM: Yes (16/5)
##     savings_balance in {101 - 500 DM,501 - 1000 DM}: No (3/1)
##     savings_balance = unknown:
##     :...months_loan_duration <= 18: Yes (4)
##         months_loan_duration > 18: No (5)
## 
## SubTree [S6]
## 
## other_debtors = co-applicant: No (1)
## other_debtors = none: Yes (3)
## 
## SubTree [S7]
## 
## installment_plan in {bank,stores}: No (6/1)
## installment_plan = none:
## :...months_loan_duration <= 16: No (16/2)
##     months_loan_duration > 16:
##     :...residence_history > 3: No (10/2)
##         residence_history <= 3:
##         :...job in {mangement self-employed,unskilled resident}: No (3)
##             job in {skilled employee,unemployed non-resident}: Yes (9/1)
## 
## 
## Evaluation on training data (900 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      57  125(13.9%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     594    35    (a): class No
##      90   181    (b): class Yes
## 
## 
##  Attribute usage:
## 
##  100.00% checking_balance
##   54.78% other_debtors
##   50.44% credit_history
##   44.44% months_loan_duration
##   29.44% savings_balance
##   20.44% amount
##   13.89% installment_plan
##   13.11% residence_history
##   11.78% existing_credits
##   10.67% job
##   10.22% telephone
##    9.67% employment_length
##    7.89% housing
##    5.33% dependents
##    2.00% personal_status
##    0.89% age
##    0.78% purpose
##    0.56% installment_rate
## 
## 
## Time: 0.0 secs

We can see amid the summary data that our model gives 125 errors out of 900 - an accuracy of 86.1%. Note that there is a tendency for trees to overfit data.

Evaluating Model Performance

As ever, we’ll use predict()

credit_pred = predict(credit_model, credit.test)
#install.packages("gmodels")
library(gmodels)
CrossTable(credit.test$default,
           credit_pred,
           prop.chisq = FALSE,
           prop.c     = FALSE,
           prop.r     = FALSE,
           dnn = c('actual default', 'predicted default'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                | predicted default 
## actual default |        No |       Yes | Row Total | 
## ---------------|-----------|-----------|-----------|
##             No |        62 |         9 |        71 | 
##                |     0.620 |     0.090 |           | 
## ---------------|-----------|-----------|-----------|
##            Yes |        19 |        10 |        29 | 
##                |     0.190 |     0.100 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |        81 |        19 |       100 | 
## ---------------|-----------|-----------|-----------|
## 
## 

This isn’t great:

  • Of the 71 good loans, we predicted 62 were good and 9 were bad
  • Of the 29 bad loans, we predicted 10 were bad but 19 were good (oh dear)

So that’s 72% accurate (62 + 10). Worse than our training, but that’s normal. Not much better than just saying “No” without all this modelling faff (a policy which would be right 68% of the time!). A key failure here is only spotting 10 / 29 bad loans.

Improving the model

C5.0 has a cool thing called adaptive boosting. Much is written on this, and the overall idea is that many trees are biult, and the trees vote on the best class for each example. All ML can benefit from such boosting: it’s basically saying that from a bunch of weak learners you can build a strong team. This has an obvious performance penalty!

Here’s how we do it in C5.0:

credit_boost10_model = C5.0(credit.train[-21], credit.train$default,
                            trials = 20) # this is an UPPER limit and the generally accepted best value
# This displays way too much data:
# summary(credit_boost10_model)
# So here'a copy/paste of the important bit:
#
## Evaluation on training data (900 cases):
## 
## Trial        Decision Tree   
## -----      ----------------  
##    Size      Errors  
## 
##    0     57  125(13.9%)
##    1     46  166(18.4%)
##    2     49  185(20.6%)
##    3     51  161(17.9%)
##    4     48  170(18.9%)
##    5     59  155(17.2%)
##    6     58  165(18.3%)
##    7     66  160(17.8%)
##    8     61  183(20.3%)
##    9     43  174(19.3%)
##   10     67  166(18.4%)
##   11     45  197(21.9%)
##   12     58  172(19.1%)
##   13     62  180(20.0%)
##   14     56  184(20.4%)
##   15     52  177(19.7%)
##   16     47  162(18.0%)
##   17     44  188(20.9%)
##   18     63  174(19.3%)
##   19     44  178(19.8%)
## boost             10( 1.1%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     629          (a): class No
##      10   261    (b): class Yes
## 
## 
##  Attribute usage:
## 
##  100.00% checking_balance
##  100.00% months_loan_duration
##  100.00% credit_history
##  100.00% foreign_worker
##   99.56% purpose
##   99.44% installment_plan
##   99.22% savings_balance
##   99.22% other_debtors
##   99.22% residence_history
##   98.44% personal_status
##   98.11% employment_length
##   97.44% age
##   95.44% amount
##   95.44% job
##   92.44% property
##   91.78% dependents
##   89.67% installment_rate
##   78.56% telephone
##   76.00% housing
##   67.11% existing_credits
## 
## 
## Time: 0.2 secs

This yields an error rate of 1.1% - that’s a huge improvement over 13.9%! But what about on our test data?

credit_boost10_pred = predict(credit_boost10_model, credit.test)
CrossTable(credit.test$default,
           credit_boost10_pred,
           prop.chisq = FALSE,
           prop.c     = FALSE,
           prop.r     = FALSE,
           dnn = c('actual default', 'predicted default'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                | predicted default 
## actual default |        No |       Yes | Row Total | 
## ---------------|-----------|-----------|-----------|
##             No |        61 |        10 |        71 | 
##                |     0.610 |     0.100 |           | 
## ---------------|-----------|-----------|-----------|
##            Yes |        17 |        12 |        29 | 
##                |     0.170 |     0.120 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |        78 |        22 |       100 | 
## ---------------|-----------|-----------|-----------|
## 
## 

Hmm, well, this has improved the costly mistakes (12/29 bad loans identified) but annoyingly reduced the accuracy of good loans , and I’ve no idea why?! This isn’t how it appeared in the book LOL

Either way, the bad loans are way more costly than the good loans, so can we get C5.0 to assign a penalty for different types of errors? Of course we can! We need to setup a cost matrix. Suppose we determine that bad loans are 5 times more costly than a missed opportunity, we’d define the cost matrix as such:

error_cost = matrix(c(0, 5, 1, 0), nrow = 2) # this could be a bug in the book
error_cost
##      [,1] [,2]
## [1,]    0    1
## [2,]    5    0

The above says that there’s no cost for getting results right, but a false negative has a cost of 5 vs a false positive which has a cost of 1. Here we go again:

credit_cost_model = C5.0(credit.train[-21], credit.train$default, trials=20, costs = error_cost)
## Warning in C5.0.default(credit.train[-21], credit.train$default, trials = 20, : 
## no dimnames were given for the cost matrix; the factor levels will be used
credit_cost_pred = predict(credit_cost_model, credit.test)
CrossTable(credit.test$default,
           credit_cost_pred,
           prop.chisq = FALSE,
           prop.c     = FALSE,
           prop.r     = FALSE,
           dnn = c('actual default', 'predicted default'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                | predicted default 
## actual default |        No |       Yes | Row Total | 
## ---------------|-----------|-----------|-----------|
##             No |        50 |        21 |        71 | 
##                |     0.500 |     0.210 |           | 
## ---------------|-----------|-----------|-----------|
##            Yes |        11 |        18 |        29 | 
##                |     0.110 |     0.180 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |        61 |        39 |       100 | 
## ---------------|-----------|-----------|-----------|
## 
## 

So, as expected, we’ve done better now on the bad loans, but lost ground on the good ones. Life is all about tradeoffs, eh…

Rules

The idea

These are very similar to decision trees, however, they are more akin to if-else chains without the nesting that comes with trees. They share many of the pros and cons, even theory, such as information gain. A key benefit is that they are very easy to understand.

We’ll get tuck right in to an example

Example - Poisonous Mushrooms

The data

The mushroom data is available here. It consists of 8124 samples from 23 species classified as either poisonous or edible. The original classifications were definitely edible, definitely poisonous, and “likely poisonous, and not recommended to be eaten” - the 22 features are explained at this link.

EDA and Data prep

Let’s load it up:

mushrooms = read.csv("D://dev/R/mlwr//chap5-decisiontrees/mushrooms.csv", stringsAsFactors=TRUE)
str(mushrooms)
## 'data.frame':    8124 obs. of  23 variables:
##  $ type                    : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ...
##  $ cap_shape               : Factor w/ 6 levels "bell","conical",..: 3 3 1 3 3 3 1 1 3 1 ...
##  $ cap_surface             : Factor w/ 4 levels "fibrous","grooves",..: 4 4 4 3 4 3 4 3 3 4 ...
##  $ cap_color               : Factor w/ 10 levels "brown","buff",..: 1 10 9 9 4 10 9 9 9 10 ...
##  $ bruises                 : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ odor                    : Factor w/ 9 levels "almond","anise",..: 8 1 2 8 7 1 1 2 8 1 ...
##  $ gill_attachment         : Factor w/ 2 levels "attached","free": 2 2 2 2 2 2 2 2 2 2 ...
##  $ gill_spacing            : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1 1 1 ...
##  $ gill_size               : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ...
##  $ gill_color              : Factor w/ 12 levels "black","brown",..: 1 1 2 2 1 2 5 2 8 5 ...
##  $ stalk_shape             : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1 1 1 1 1 ...
##  $ stalk_root              : Factor w/ 5 levels "bulbous","club",..: 3 2 2 3 3 2 2 2 3 2 ...
##  $ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ stalk_color_above_ring  : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ stalk_color_below_ring  : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ veil_type               : Factor w/ 1 level "partial": 1 1 1 1 1 1 1 1 1 1 ...
##  $ veil_color              : Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ ring_number             : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ring_type               : Factor w/ 5 levels "evanescent","flaring",..: 5 5 5 5 1 5 5 5 5 5 ...
##  $ spore_print_color       : Factor w/ 9 levels "black","brown",..: 1 2 2 1 2 1 1 2 1 1 ...
##  $ population              : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4 1 3 3 4 5 4 ...
##  $ habitat                 : Factor w/ 7 levels "grasses","leaves",..: 5 1 3 5 1 1 3 3 1 3 ...

So here’s an EDA nugget - see veil_type above? It’s a factor with only 1 level, which is interesting. The data-spec says that there are 2 levels to this feature, so either we don’t have a truly random sample, or there’s an error. Either way, it provides no value, so we should eliminate it:

mushrooms$veil_type = NULL

Quick peek at our type (poisonous/edible) distribution:

round(prop.table(table(mushrooms$type))*100)
## 
##    edible poisonous 
##        52        48

Alright, fairly nicely split.

This time round we’ll be doing things fairly differently: we’ll assume that our dataset to be an exhaustive set of all wild mushrooms. This is central to what our aim here is: we’re only trying to find rules that accurately depict the complete set of known mushroom types - therefore we can build and test the model on the same data. So there’s no training set and test set.

Modelling

Were we to setup a ZeroR classifier, it would simply tell us what the most likely type is - edible (given that’s the mode). Not massively useful. Not useful if you’re starving as this will just end up with you dying 48% of the time. We’ll instead use a 1R implementation from RWeka - one rule is simple, and can often be extremely predictive.

#install.packages("RWeka")
library(RWeka)

# OneR() uses the R formula syntax:
mushroom_1R = OneR(data = mushrooms,
                   type ~ .) # this means that type is a function of ALL features in mushrooms
mushroom_1R
## odor:
##  almond  -> edible
##  anise   -> edible
##  creosote    -> poisonous
##  fishy   -> poisonous
##  foul    -> poisonous
##  musty   -> poisonous
##  none    -> edible
##  pungent -> poisonous
##  spicy   -> poisonous
## (8004/8124 instances correct)

And the above was our 1-rule model! It’s based purely on odor: if it smells pleasant, odds are it’s safe - if not, avoid it.

Evaluating Model Performance

Per the output above, 8004 / 8124 is almost 99%. But where were the errors?

#reminder:
table(mushrooms$type)
## 
##    edible poisonous 
##      4208      3916
summary(mushroom_1R)
## 
## === Summary ===
## 
## Correctly Classified Instances        8004               98.5229 %
## Incorrectly Classified Instances       120                1.4771 %
## Kappa statistic                          0.9704
## Mean absolute error                      0.0148
## Root mean squared error                  0.1215
## Relative absolute error                  2.958  %
## Root relative squared error             24.323  %
## Coverage of cases (0.95 level)          98.5229 %
## Mean rel. region size (0.95 level)      50      %
## Total Number of Instances             8124     
## 
## === Confusion Matrix ===
## 
##     a    b   <-- classified as
##  4208    0 |    a = edible
##   120 3796 |    b = poisonous

This tell us that all 4208 edibles were correctly labelled as edible; of the 3916 poisonous, though, only 3796 were correctly classified, leaving us with 120 labelled as edible; so you’ve a 1.5% chance of ending up in hospital (or worse). Given how simple this is though, the accuracy is remarkable.

Improving the model

There is an improved rule algo called RIPPER (Repeated Incremental Pruning to Produce Error Reduction), we’ll use its java-based JRip() implementation (in RWeka)

mushroom_JRip = JRip(data = mushrooms,
                     type ~ .)
mushroom_JRip
## JRIP rules:
## ===========
## 
## (odor = foul) => type=poisonous (2160.0/0.0)
## (gill_size = narrow) and (gill_color = buff) => type=poisonous (1152.0/0.0)
## (gill_size = narrow) and (odor = pungent) => type=poisonous (256.0/0.0)
## (odor = creosote) => type=poisonous (192.0/0.0)
## (spore_print_color = green) => type=poisonous (72.0/0.0)
## (stalk_surface_below_ring = scaly) and (stalk_surface_above_ring = silky) => type=poisonous (68.0/0.0)
## (habitat = leaves) and (cap_color = white) => type=poisonous (8.0/0.0)
## (stalk_color_above_ring = yellow) => type=poisonous (8.0/0.0)
##  => type=edible (4208.0/0.0)
## 
## Number of Rules : 9

These can just be read as a serious of boolean tests. Fairly impressively, the last line, (else) type=edible (4208) tells us that the rules are 100% accurate. Wow