Predictably, decision tree learners store their model in the form of a decision tree, quite similar to a flowchart.
Data that’s to be classified begins at the root node, where it is passed through the various decisions in the tree according to feature values, until a leaf node is reached - which is when it’s assigned a predicted class.
Decision Trees offer benefits of transparency (useful for when analysis needs to be explainable) and speed. They can be used to model almost any type of data, too. Sadly, they don’t do too well when there are large numbers of nominal or numeric features.
They’re built using divide-and-conquer type algorithms (aka recursive partitioning). Beginning at the root node, the algorithm splits the data on the feature that results in the greatest information gain (or entropy). Iteratively, we can repeat this splitting at each child node until the leaves are pure - i.e. that samples at each node all belong to the same class.
We’ll be using the C5.0 implementation in R. We’ll cover a bit about the splitting and pruning processes.
The first thing the algo has to do is decide which feature to split upon (since we may well have many). As described earlier, this is done so as to increase the information gain. The gain is defined as the difference between a parent’s impurity and the sum of its children’s impurities. There are a number of different impurity measures, entropy being the one favoured in the implementation C5.0. The formula for entropy is given by:
\(Entropy(S) = \sum_{i=1}^{c} -p_i log_2(p_i)\)
where S is a segment of data (i.e. a bunch of records), c is the number of different class levels, and \(p_i\) refers to the proportion of values falling into class level \(i\). For example, if we have a partition of data with two classes: red (60%) and blue (40%), \(E = -0.6log_2(0.6) + -0.4log_2(0.4)\) = 0.97. If it were 50/50 instead of 60/40, we’d have E = 1. If it’s 100/0 we’d have E = 0. We can see what this plot looks like using R’s curve() function!
curve(-x * log2(x) - (1-x) * log2(1-x), col='red', xlab='x', ylab='E', lwd=4, main="Entropy curve for a binary pair")
The higher the information gain, the better a feature is at creating homogeneous groups after a split on that feature.
Leaving the algo to run ad nausea can result in very deep trees, which can easily lead to overfitting, so we typically end up pruning them. We can either pre-prune (i.e. stop it at a certain size early), which has the benefit of doing less work and thus is faster, but this comes at the expense of possibly missing subtle features had it run longer; or we can post-prune using error-rates as a condition, which is more computational work. The latter is the preferred choice, as it’s difficult to know the optimal tree-size ahead of time, making pre-pruning less attractive. C5.0 post-prunes with sensible defaults - in some cases, entire branches are moved further up the tree (subtree-raising) or replaced by simpler decisions (subtree-replacement)
The balancing act of over/under fitting is best done by playing with pruning options, which, luckily for us, C5.0 makes very easy.
As we’ll see, we can use Decision Trees to help banks spot riskier customers and mitigate financial losses.
We’ll use data on a large number (1000) of past bank loans, augmented with:
Crucially the data includes whether the customer defaulted on the loan.
The data can be found here
library(ggplot2)
credit = read.csv("D://dev//R//mlwr/chap5-decisiontrees/credit.csv")
str(credit)
## 'data.frame': 1000 obs. of 21 variables:
## $ checking_balance : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
## $ months_loan_duration: int 6 48 12 42 24 36 24 36 12 30 ...
## $ credit_history : Factor w/ 5 levels "critical","delayed",..: 1 5 1 5 2 5 5 5 5 1 ...
## $ purpose : Factor w/ 10 levels "business","car (new)",..: 8 8 5 6 2 5 6 3 8 2 ...
## $ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ savings_balance : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
## $ employment_length : Factor w/ 5 levels "> 7 yrs","0 - 1 yrs",..: 1 3 4 4 3 3 1 3 4 5 ...
## $ installment_rate : int 4 2 2 2 3 2 3 2 2 4 ...
## $ personal_status : Factor w/ 4 levels "divorced male",..: 4 2 4 4 4 4 4 4 1 3 ...
## $ other_debtors : Factor w/ 3 levels "co-applicant",..: 3 3 3 2 3 3 3 3 3 3 ...
## $ residence_history : int 4 2 3 4 4 4 4 2 4 2 ...
## $ property : Factor w/ 4 levels "building society savings",..: 3 3 3 1 4 4 1 2 3 2 ...
## $ age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ installment_plan : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ housing : Factor w/ 3 levels "for free","own",..: 2 2 2 1 1 1 2 3 2 2 ...
## $ existing_credits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ job : Factor w/ 4 levels "mangement self-employed",..: 2 2 4 2 2 4 2 1 4 1 ...
## $ dependents : int 1 1 2 2 2 2 1 1 1 1 ...
## $ telephone : Factor w/ 2 levels "none","yes": 2 1 1 1 1 2 1 2 1 1 ...
## $ foreign_worker : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ default : int 1 2 1 1 2 1 1 1 1 2 ...
So we’ve got 1000 loans and 21 features. Since we’re primarily interested in default rates, we should see what the split is:
round(prop.table(table(credit$default))*100)
##
## 1 2
## 70 30
# apparently 1 means no and 2 means yes, so we'll refactor
credit$default = factor(credit$default, levels=c(1, 2), labels=c("No", "Yes"))
round(prop.table(table(credit$default))*100)
##
## No Yes
## 70 30
Let’s have a peek at some of the ones which are likely to predict a default.
table(credit$checking_balance)
##
## < 0 DM > 200 DM 1 - 200 DM unknown
## 274 63 269 394
ggplot(data=credit,
aes(x=checking_balance)) +
geom_bar() +
ggtitle("Histogram of checking balances faceted by default") +
facet_wrap(~default)
table(credit$savings_balance)
##
## < 100 DM > 1000 DM 101 - 500 DM 501 - 1000 DM unknown
## 603 48 103 63 183
ggplot(data=credit,
aes(x=savings_balance)) +
geom_bar() +
ggtitle("Histogram of savings balances faceted by default") +
facet_wrap(~default)
Those two were categorical - it would have been nice to have them as pure numbers, but c’est la vie. Let’s look at some of the numerics (duration and amount):
summary(credit$months_loan_duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 12.0 18.0 20.9 24.0 72.0
ggplot(data=credit,
aes(x=months_loan_duration)) +
geom_histogram(binwidth=3) +
ggtitle("Histogram of loan durations faceted by default") +
facet_wrap(~default) +
scale_x_continuous(breaks=seq(0, 75, 3), limits=c(0, 75))
## Warning: Removed 4 rows containing missing values (geom_bar).
summary(credit$amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 250 1366 2320 3271 3972 18420
ggplot(data=credit,
aes(x=amount)) +
geom_histogram() +
scale_x_log10() +
ggtitle("Histogram of loan amounts faceted by default") +
facet_wrap(~default)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As before, we’ll split our data into a training set (90%) and a test set (10%). Unlike before, we can’t assume that the data is randomly sorted, so we’ll need to randomise its order. There’s an idiomatic trick in R to randomise a frame’s order, and it involves 2 function calls in unison:
And here’s how it’s done:
set.seed(1235) # make our results deterministic (useful for reproducing identical results over multiple runs)
credit_randomised = credit[order(runif(nrow(credit))), ]
# assert that we've not messed up the data:
summary(credit_randomised$amount) # compare this result against earlier
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 250 1366 2320 3271 3972 18420
head(credit$amount)
## [1] 1169 5951 2096 7882 4870 9055
head(credit_randomised$amount)
## [1] 1092 4110 1582 3965 1282 2767
# same summary, different head() results -> we've not cocked this up :)
So let’s split as we’ve done before:
credit.train = credit_randomised[1:900, ]
credit.test = credit_randomised[901:1000, ]
# if all's good we should still have 30% default rates in both sets:
round(prop.table(table(credit.train$default))*100)
##
## No Yes
## 70 30
round(prop.table(table(credit.test$default))*100)
##
## No Yes
## 71 29
# that'll do
We’ll use C5.0:
#install.packages("C50")
library(C50)
# it's as easy as calling C5.0:
credit_model = C5.0(credit.train[-21], # field 21 is the default, and we don't want that as part of the tree
credit.train$default) # and we tell it what the defaults were (using credit[21] errors)
# We can inspect some high level data in the model:
credit_model
##
## Call:
## C5.0.default(x = credit.train[-21], y = credit.train$default)
##
## Classification Tree
## Number of samples: 900
## Number of predictors: 20
##
## Tree size: 57
##
## Non-standard options: attempt to group attributes
# and some VERY good detail:
summary(credit_model)
##
## Call:
## C5.0.default(x = credit.train[-21], y = credit.train$default)
##
##
## C5.0 [Release 2.07 GPL Edition] Sat Jan 16 04:19:09 2016
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 900 cases (21 attributes) from undefined.data
##
## Decision tree:
##
## checking_balance in {> 200 DM,unknown}: No (407/50)
## checking_balance in {< 0 DM,1 - 200 DM}:
## :...other_debtors = guarantor: No (39/7)
## other_debtors in {co-applicant,none}:
## :...credit_history in {fully repaid,fully repaid this bank}:
## :...housing = rent: Yes (14)
## : housing in {for free,own}:
## : :...other_debtors = co-applicant: No (2)
## : other_debtors = none:
## : :...housing = for free: Yes (13/1)
## : housing = own:
## : :...savings_balance = < 100 DM: Yes (18/6)
## : savings_balance in {> 1000 DM,
## : : 501 - 1000 DM}: No (3)
## : savings_balance = 101 - 500 DM:
## : :...months_loan_duration <= 18: Yes (3)
## : : months_loan_duration > 18: No (2)
## : savings_balance = unknown:
## : :...installment_plan in {bank,stores}: No (3)
## : installment_plan = none: Yes (1)
## credit_history in {critical,delayed,repaid}:
## :...months_loan_duration > 27:
## :...residence_history <= 1:
## : :...housing = for free: No (1)
## : : housing = rent: Yes (2)
## : : housing = own:
## : : :...checking_balance = < 0 DM: Yes (3/1)
## : : checking_balance = 1 - 200 DM: No (6)
## : residence_history > 1:
## : :...job = unemployed non-resident: No (2)
## : job = unskilled resident: Yes (5)
## : job in {mangement self-employed,skilled employee}:
## : :...installment_plan in {bank,stores}: Yes (11/1)
## : installment_plan = none:
## : :...credit_history = delayed:
## : :...existing_credits <= 1: No (7/1)
## : : existing_credits > 1: Yes (2)
## : credit_history = critical:
## : :...telephone = none: [S1]
## : : telephone = yes: [S2]
## : credit_history = repaid:
## : :...savings_balance = > 1000 DM: No (1)
## : savings_balance in {101 - 500 DM,
## : : 501 - 1000 DM}: Yes (6)
## : savings_balance = unknown:
## : :...age <= 28: No (3)
## : : age > 28: Yes (5)
## : savings_balance = < 100 DM:
## : :...job = skilled employee: Yes (20/3)
## : job = mangement self-employed: [S3]
## months_loan_duration <= 27:
## :...credit_history = critical:
## :...other_debtors = co-applicant: Yes (4/1)
## : other_debtors = none: No (76/16)
## credit_history = delayed:
## :...checking_balance = 1 - 200 DM: No (19/3)
## : checking_balance = < 0 DM:
## : :...months_loan_duration <= 18: No (2)
## : months_loan_duration > 18: Yes (5)
## credit_history = repaid:
## :...savings_balance = > 1000 DM: No (9)
## savings_balance in {< 100 DM,101 - 500 DM,501 - 1000 DM,
## : unknown}:
## :...months_loan_duration <= 11:
## :...amount <= 10875: No (35/7)
## : amount > 10875: Yes (2)
## months_loan_duration > 11:
## :...amount <= 1386: Yes (50/14)
## amount > 1386:
## :...existing_credits > 1: [S4]
## existing_credits <= 1:
## :...employment_length = 4 - 7 yrs: No (10)
## employment_length in {> 7 yrs,0 - 1 yrs,
## : 1 - 4 yrs,unemployed}:
## :...telephone = yes: [S5]
## telephone = none:
## :...dependents > 1: [S6]
## dependents <= 1: [S7]
##
## SubTree [S1]
##
## purpose in {business,car (new),domestic appliances,education,furniture,others,
## : radio/tv,repairs,retraining}: Yes (5)
## purpose = car (used): No (2)
##
## SubTree [S2]
##
## personal_status = female: Yes (2)
## personal_status in {divorced male,married male}: No (1)
## personal_status = single male:
## :...installment_rate <= 3: No (2)
## installment_rate > 3: Yes (3/1)
##
## SubTree [S3]
##
## checking_balance = < 0 DM: No (3)
## checking_balance = 1 - 200 DM: Yes (4/1)
##
## SubTree [S4]
##
## personal_status in {divorced male,female}: Yes (6)
## personal_status in {married male,single male}: No (4)
##
## SubTree [S5]
##
## other_debtors = co-applicant: No (1)
## other_debtors = none:
## :...savings_balance = < 100 DM: Yes (16/5)
## savings_balance in {101 - 500 DM,501 - 1000 DM}: No (3/1)
## savings_balance = unknown:
## :...months_loan_duration <= 18: Yes (4)
## months_loan_duration > 18: No (5)
##
## SubTree [S6]
##
## other_debtors = co-applicant: No (1)
## other_debtors = none: Yes (3)
##
## SubTree [S7]
##
## installment_plan in {bank,stores}: No (6/1)
## installment_plan = none:
## :...months_loan_duration <= 16: No (16/2)
## months_loan_duration > 16:
## :...residence_history > 3: No (10/2)
## residence_history <= 3:
## :...job in {mangement self-employed,unskilled resident}: No (3)
## job in {skilled employee,unemployed non-resident}: Yes (9/1)
##
##
## Evaluation on training data (900 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 57 125(13.9%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 594 35 (a): class No
## 90 181 (b): class Yes
##
##
## Attribute usage:
##
## 100.00% checking_balance
## 54.78% other_debtors
## 50.44% credit_history
## 44.44% months_loan_duration
## 29.44% savings_balance
## 20.44% amount
## 13.89% installment_plan
## 13.11% residence_history
## 11.78% existing_credits
## 10.67% job
## 10.22% telephone
## 9.67% employment_length
## 7.89% housing
## 5.33% dependents
## 2.00% personal_status
## 0.89% age
## 0.78% purpose
## 0.56% installment_rate
##
##
## Time: 0.0 secs
We can see amid the summary data that our model gives 125 errors out of 900 - an accuracy of 86.1%. Note that there is a tendency for trees to overfit data.
As ever, we’ll use predict()
credit_pred = predict(credit_model, credit.test)
#install.packages("gmodels")
library(gmodels)
CrossTable(credit.test$default,
credit_pred,
prop.chisq = FALSE,
prop.c = FALSE,
prop.r = FALSE,
dnn = c('actual default', 'predicted default'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | predicted default
## actual default | No | Yes | Row Total |
## ---------------|-----------|-----------|-----------|
## No | 62 | 9 | 71 |
## | 0.620 | 0.090 | |
## ---------------|-----------|-----------|-----------|
## Yes | 19 | 10 | 29 |
## | 0.190 | 0.100 | |
## ---------------|-----------|-----------|-----------|
## Column Total | 81 | 19 | 100 |
## ---------------|-----------|-----------|-----------|
##
##
This isn’t great:
So that’s 72% accurate (62 + 10). Worse than our training, but that’s normal. Not much better than just saying “No” without all this modelling faff (a policy which would be right 68% of the time!). A key failure here is only spotting 10 / 29 bad loans.
C5.0 has a cool thing called adaptive boosting. Much is written on this, and the overall idea is that many trees are biult, and the trees vote on the best class for each example. All ML can benefit from such boosting: it’s basically saying that from a bunch of weak learners you can build a strong team. This has an obvious performance penalty!
Here’s how we do it in C5.0:
credit_boost10_model = C5.0(credit.train[-21], credit.train$default,
trials = 20) # this is an UPPER limit and the generally accepted best value
# This displays way too much data:
# summary(credit_boost10_model)
# So here'a copy/paste of the important bit:
#
## Evaluation on training data (900 cases):
##
## Trial Decision Tree
## ----- ----------------
## Size Errors
##
## 0 57 125(13.9%)
## 1 46 166(18.4%)
## 2 49 185(20.6%)
## 3 51 161(17.9%)
## 4 48 170(18.9%)
## 5 59 155(17.2%)
## 6 58 165(18.3%)
## 7 66 160(17.8%)
## 8 61 183(20.3%)
## 9 43 174(19.3%)
## 10 67 166(18.4%)
## 11 45 197(21.9%)
## 12 58 172(19.1%)
## 13 62 180(20.0%)
## 14 56 184(20.4%)
## 15 52 177(19.7%)
## 16 47 162(18.0%)
## 17 44 188(20.9%)
## 18 63 174(19.3%)
## 19 44 178(19.8%)
## boost 10( 1.1%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 629 (a): class No
## 10 261 (b): class Yes
##
##
## Attribute usage:
##
## 100.00% checking_balance
## 100.00% months_loan_duration
## 100.00% credit_history
## 100.00% foreign_worker
## 99.56% purpose
## 99.44% installment_plan
## 99.22% savings_balance
## 99.22% other_debtors
## 99.22% residence_history
## 98.44% personal_status
## 98.11% employment_length
## 97.44% age
## 95.44% amount
## 95.44% job
## 92.44% property
## 91.78% dependents
## 89.67% installment_rate
## 78.56% telephone
## 76.00% housing
## 67.11% existing_credits
##
##
## Time: 0.2 secs
This yields an error rate of 1.1% - that’s a huge improvement over 13.9%! But what about on our test data?
credit_boost10_pred = predict(credit_boost10_model, credit.test)
CrossTable(credit.test$default,
credit_boost10_pred,
prop.chisq = FALSE,
prop.c = FALSE,
prop.r = FALSE,
dnn = c('actual default', 'predicted default'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | predicted default
## actual default | No | Yes | Row Total |
## ---------------|-----------|-----------|-----------|
## No | 61 | 10 | 71 |
## | 0.610 | 0.100 | |
## ---------------|-----------|-----------|-----------|
## Yes | 17 | 12 | 29 |
## | 0.170 | 0.120 | |
## ---------------|-----------|-----------|-----------|
## Column Total | 78 | 22 | 100 |
## ---------------|-----------|-----------|-----------|
##
##
Hmm, well, this has improved the costly mistakes (12/29 bad loans identified) but annoyingly reduced the accuracy of good loans , and I’ve no idea why?! This isn’t how it appeared in the book LOL
Either way, the bad loans are way more costly than the good loans, so can we get C5.0 to assign a penalty for different types of errors? Of course we can! We need to setup a cost matrix. Suppose we determine that bad loans are 5 times more costly than a missed opportunity, we’d define the cost matrix as such:
error_cost = matrix(c(0, 5, 1, 0), nrow = 2) # this could be a bug in the book
error_cost
## [,1] [,2]
## [1,] 0 1
## [2,] 5 0
The above says that there’s no cost for getting results right, but a false negative has a cost of 5 vs a false positive which has a cost of 1. Here we go again:
credit_cost_model = C5.0(credit.train[-21], credit.train$default, trials=20, costs = error_cost)
## Warning in C5.0.default(credit.train[-21], credit.train$default, trials = 20, :
## no dimnames were given for the cost matrix; the factor levels will be used
credit_cost_pred = predict(credit_cost_model, credit.test)
CrossTable(credit.test$default,
credit_cost_pred,
prop.chisq = FALSE,
prop.c = FALSE,
prop.r = FALSE,
dnn = c('actual default', 'predicted default'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | predicted default
## actual default | No | Yes | Row Total |
## ---------------|-----------|-----------|-----------|
## No | 50 | 21 | 71 |
## | 0.500 | 0.210 | |
## ---------------|-----------|-----------|-----------|
## Yes | 11 | 18 | 29 |
## | 0.110 | 0.180 | |
## ---------------|-----------|-----------|-----------|
## Column Total | 61 | 39 | 100 |
## ---------------|-----------|-----------|-----------|
##
##
So, as expected, we’ve done better now on the bad loans, but lost ground on the good ones. Life is all about tradeoffs, eh…
These are very similar to decision trees, however, they are more akin to if-else chains without the nesting that comes with trees. They share many of the pros and cons, even theory, such as information gain. A key benefit is that they are very easy to understand.
We’ll get tuck right in to an example
The mushroom data is available here. It consists of 8124 samples from 23 species classified as either poisonous or edible. The original classifications were definitely edible, definitely poisonous, and “likely poisonous, and not recommended to be eaten” - the 22 features are explained at this link.
Let’s load it up:
mushrooms = read.csv("D://dev/R/mlwr//chap5-decisiontrees/mushrooms.csv", stringsAsFactors=TRUE)
str(mushrooms)
## 'data.frame': 8124 obs. of 23 variables:
## $ type : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ...
## $ cap_shape : Factor w/ 6 levels "bell","conical",..: 3 3 1 3 3 3 1 1 3 1 ...
## $ cap_surface : Factor w/ 4 levels "fibrous","grooves",..: 4 4 4 3 4 3 4 3 3 4 ...
## $ cap_color : Factor w/ 10 levels "brown","buff",..: 1 10 9 9 4 10 9 9 9 10 ...
## $ bruises : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
## $ odor : Factor w/ 9 levels "almond","anise",..: 8 1 2 8 7 1 1 2 8 1 ...
## $ gill_attachment : Factor w/ 2 levels "attached","free": 2 2 2 2 2 2 2 2 2 2 ...
## $ gill_spacing : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1 1 1 ...
## $ gill_size : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ...
## $ gill_color : Factor w/ 12 levels "black","brown",..: 1 1 2 2 1 2 5 2 8 5 ...
## $ stalk_shape : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1 1 1 1 1 ...
## $ stalk_root : Factor w/ 5 levels "bulbous","club",..: 3 2 2 3 3 2 2 2 3 2 ...
## $ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ stalk_color_above_ring : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ stalk_color_below_ring : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ veil_type : Factor w/ 1 level "partial": 1 1 1 1 1 1 1 1 1 1 ...
## $ veil_color : Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ ring_number : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2 2 2 ...
## $ ring_type : Factor w/ 5 levels "evanescent","flaring",..: 5 5 5 5 1 5 5 5 5 5 ...
## $ spore_print_color : Factor w/ 9 levels "black","brown",..: 1 2 2 1 2 1 1 2 1 1 ...
## $ population : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4 1 3 3 4 5 4 ...
## $ habitat : Factor w/ 7 levels "grasses","leaves",..: 5 1 3 5 1 1 3 3 1 3 ...
So here’s an EDA nugget - see veil_type above? It’s a factor with only 1 level, which is interesting. The data-spec says that there are 2 levels to this feature, so either we don’t have a truly random sample, or there’s an error. Either way, it provides no value, so we should eliminate it:
mushrooms$veil_type = NULL
Quick peek at our type (poisonous/edible) distribution:
round(prop.table(table(mushrooms$type))*100)
##
## edible poisonous
## 52 48
Alright, fairly nicely split.
This time round we’ll be doing things fairly differently: we’ll assume that our dataset to be an exhaustive set of all wild mushrooms. This is central to what our aim here is: we’re only trying to find rules that accurately depict the complete set of known mushroom types - therefore we can build and test the model on the same data. So there’s no training set and test set.
Were we to setup a ZeroR classifier, it would simply tell us what the most likely type is - edible (given that’s the mode). Not massively useful. Not useful if you’re starving as this will just end up with you dying 48% of the time. We’ll instead use a 1R implementation from RWeka - one rule is simple, and can often be extremely predictive.
#install.packages("RWeka")
library(RWeka)
# OneR() uses the R formula syntax:
mushroom_1R = OneR(data = mushrooms,
type ~ .) # this means that type is a function of ALL features in mushrooms
mushroom_1R
## odor:
## almond -> edible
## anise -> edible
## creosote -> poisonous
## fishy -> poisonous
## foul -> poisonous
## musty -> poisonous
## none -> edible
## pungent -> poisonous
## spicy -> poisonous
## (8004/8124 instances correct)
And the above was our 1-rule model! It’s based purely on odor: if it smells pleasant, odds are it’s safe - if not, avoid it.
Per the output above, 8004 / 8124 is almost 99%. But where were the errors?
#reminder:
table(mushrooms$type)
##
## edible poisonous
## 4208 3916
summary(mushroom_1R)
##
## === Summary ===
##
## Correctly Classified Instances 8004 98.5229 %
## Incorrectly Classified Instances 120 1.4771 %
## Kappa statistic 0.9704
## Mean absolute error 0.0148
## Root mean squared error 0.1215
## Relative absolute error 2.958 %
## Root relative squared error 24.323 %
## Coverage of cases (0.95 level) 98.5229 %
## Mean rel. region size (0.95 level) 50 %
## Total Number of Instances 8124
##
## === Confusion Matrix ===
##
## a b <-- classified as
## 4208 0 | a = edible
## 120 3796 | b = poisonous
This tell us that all 4208 edibles were correctly labelled as edible; of the 3916 poisonous, though, only 3796 were correctly classified, leaving us with 120 labelled as edible; so you’ve a 1.5% chance of ending up in hospital (or worse). Given how simple this is though, the accuracy is remarkable.
There is an improved rule algo called RIPPER (Repeated Incremental Pruning to Produce Error Reduction), we’ll use its java-based JRip() implementation (in RWeka)
mushroom_JRip = JRip(data = mushrooms,
type ~ .)
mushroom_JRip
## JRIP rules:
## ===========
##
## (odor = foul) => type=poisonous (2160.0/0.0)
## (gill_size = narrow) and (gill_color = buff) => type=poisonous (1152.0/0.0)
## (gill_size = narrow) and (odor = pungent) => type=poisonous (256.0/0.0)
## (odor = creosote) => type=poisonous (192.0/0.0)
## (spore_print_color = green) => type=poisonous (72.0/0.0)
## (stalk_surface_below_ring = scaly) and (stalk_surface_above_ring = silky) => type=poisonous (68.0/0.0)
## (habitat = leaves) and (cap_color = white) => type=poisonous (8.0/0.0)
## (stalk_color_above_ring = yellow) => type=poisonous (8.0/0.0)
## => type=edible (4208.0/0.0)
##
## Number of Rules : 9
These can just be read as a serious of boolean tests. Fairly impressively, the last line, (else) type=edible (4208) tells us that the rules are 100% accurate. Wow