Introduction

This project is from Book: Machine learning with R by Brett Lantz, chapter 5.

A link to the book https://bit.ly/3gsf2e0

This project is for educational purpose only.

The aim is to develop credit approval model using C5.0 decision trees

Required packages

Packages OneR and Rweka are required

library(OneR)
#I moved the Rweka package to another chunk after calling OneR() function.

Step 1 - collecting data

The data is from the UCI Machine learning Repository. the dataset is modified slightly.

The dataset includes information on 8,124 mushroom samples from 23 species of gilled mushrooms. for the purpose of the project, the mushrooms are in two classess " definitely edible" and “denfinitely posonous”

Step 2 - exploring and preparing the data

#Read the csv file, I set stringAsFactor = TRUE as the data read as character
mushrooms <- read.csv("mushrooms.csv", stringsAsFactors = TRUE)

#Explore structure of the dataset
str(mushrooms)
## 'data.frame':    8124 obs. of  23 variables:
##  $ type                    : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ...
##  $ cap_shape               : Factor w/ 6 levels "bell","conical",..: 3 3 1 3 3 3 1 1 3 1 ...
##  $ cap_surface             : Factor w/ 4 levels "fibrous","grooves",..: 4 4 4 3 4 3 4 3 3 4 ...
##  $ cap_color               : Factor w/ 10 levels "brown","buff",..: 1 10 9 9 4 10 9 9 9 10 ...
##  $ bruises                 : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ odor                    : Factor w/ 9 levels "almond","anise",..: 8 1 2 8 7 1 1 2 8 1 ...
##  $ gill_attachment         : Factor w/ 2 levels "attached","free": 2 2 2 2 2 2 2 2 2 2 ...
##  $ gill_spacing            : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1 1 1 ...
##  $ gill_size               : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ...
##  $ gill_color              : Factor w/ 12 levels "black","brown",..: 1 1 2 2 1 2 5 2 8 5 ...
##  $ stalk_shape             : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1 1 1 1 1 ...
##  $ stalk_root              : Factor w/ 5 levels "bulbous","club",..: 3 2 2 3 3 2 2 2 3 2 ...
##  $ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ stalk_color_above_ring  : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ stalk_color_below_ring  : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ veil_type               : Factor w/ 1 level "partial": 1 1 1 1 1 1 1 1 1 1 ...
##  $ veil_color              : Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ ring_number             : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ring_type               : Factor w/ 5 levels "evanescent","flaring",..: 5 5 5 5 1 5 5 5 5 5 ...
##  $ spore_print_color       : Factor w/ 9 levels "black","brown",..: 1 2 2 1 2 1 1 2 1 1 ...
##  $ population              : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4 1 3 3 4 5 4 ...
##  $ habitat                 : Factor w/ 7 levels "grasses","leaves",..: 5 1 3 5 1 1 3 3 1 3 ...

since veil_type doesn’t vary across samples, we will drop this variable

#Eleminating the feature from the data frame.
mushrooms$veil_type <- NULL
table(mushrooms$type)
## 
##    edible poisonous 
##      4208      3916

as about 52% edible and 48% are poisonous, we will consider the dataset to be an exhaustive set of all the possible wild mushrooms, so we don’t need to hold some sample out of the training data for testing purpose. we are trying to find the rules that accurately depict the complete set of known mushroom types.

Step 3 - training a model on the data

mushroom_1R <- OneR(type ~ . , data = mushrooms)

#To examine the rules

mushroom_1R
## 
## Call:
## OneR.formula(formula = type ~ ., data = mushrooms)
## 
## Rules:
## If odor = almond   then type = edible
## If odor = anise    then type = edible
## If odor = creosote then type = poisonous
## If odor = fishy    then type = poisonous
## If odor = foul     then type = poisonous
## If odor = musty    then type = poisonous
## If odor = none     then type = edible
## If odor = pungent  then type = poisonous
## If odor = spicy    then type = poisonous
## 
## Accuracy:
## 8004 of 8124 instances classified correctly (98.52%)

Step 4 - evaluating model performance

library(RWeka)
## 
## Attaching package: 'RWeka'
## The following object is masked from 'package:OneR':
## 
##     OneR
mushroom_1R_pred <- predict(mushroom_1R, mushrooms)

#Exmine the confusion matrix for the model
table(actual = mushrooms$type, predicted = mushroom_1R_pred)
##            predicted
## actual      edible poisonous
##   edible      4208         0
##   poisonous    120      3796

We notice that the model classified 120 poisonous mushrooms as edible, which is a huge mistake!

Step 5 - Improving model performance

We will use JRip() sphosticated learner for RIPPER algorithm.

mushroom_JRip <- JRip(type ~ . , data = mushrooms)

mushroom_JRip
## JRIP rules:
## ===========
## 
## (odor = foul) => type=poisonous (2160.0/0.0)
## (gill_size = narrow) and (gill_color = buff) => type=poisonous (1152.0/0.0)
## (gill_size = narrow) and (odor = pungent) => type=poisonous (256.0/0.0)
## (odor = creosote) => type=poisonous (192.0/0.0)
## (spore_print_color = green) => type=poisonous (72.0/0.0)
## (stalk_surface_below_ring = scaly) and (stalk_surface_above_ring = silky) => type=poisonous (68.0/0.0)
## (habitat = leaves) and (cap_color = white) => type=poisonous (8.0/0.0)
## (stalk_color_above_ring = yellow) => type=poisonous (8.0/0.0)
##  => type=edible (4208.0/0.0)
## 
## Number of Rules : 9
mushroom_JRip_pred <- predict(mushroom_JRip, mushrooms)

#Exmine the confusion matrix for the model
table(actual = mushrooms$type, predicted = mushroom_JRip_pred)
##            predicted
## actual      edible poisonous
##   edible      4208         0
##   poisonous      0      3916

Summary

we worked with two different rule learners to perform binary classification.