This project is from Book: Machine learning with R by Brett Lantz, chapter 5.
A link to the book https://bit.ly/3gsf2e0
This project is for educational purpose only.
The aim is to develop credit approval model using C5.0 decision trees
Packages OneR and Rweka are required
library(OneR)
#I moved the Rweka package to another chunk after calling OneR() function.
The data is from the UCI Machine learning Repository. the dataset is modified slightly.
The dataset includes information on 8,124 mushroom samples from 23 species of gilled mushrooms. for the purpose of the project, the mushrooms are in two classess " definitely edible" and “denfinitely posonous”
#Read the csv file, I set stringAsFactor = TRUE as the data read as character
mushrooms <- read.csv("mushrooms.csv", stringsAsFactors = TRUE)
#Explore structure of the dataset
str(mushrooms)
## 'data.frame': 8124 obs. of 23 variables:
## $ type : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ...
## $ cap_shape : Factor w/ 6 levels "bell","conical",..: 3 3 1 3 3 3 1 1 3 1 ...
## $ cap_surface : Factor w/ 4 levels "fibrous","grooves",..: 4 4 4 3 4 3 4 3 3 4 ...
## $ cap_color : Factor w/ 10 levels "brown","buff",..: 1 10 9 9 4 10 9 9 9 10 ...
## $ bruises : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
## $ odor : Factor w/ 9 levels "almond","anise",..: 8 1 2 8 7 1 1 2 8 1 ...
## $ gill_attachment : Factor w/ 2 levels "attached","free": 2 2 2 2 2 2 2 2 2 2 ...
## $ gill_spacing : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1 1 1 ...
## $ gill_size : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ...
## $ gill_color : Factor w/ 12 levels "black","brown",..: 1 1 2 2 1 2 5 2 8 5 ...
## $ stalk_shape : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1 1 1 1 1 ...
## $ stalk_root : Factor w/ 5 levels "bulbous","club",..: 3 2 2 3 3 2 2 2 3 2 ...
## $ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ stalk_color_above_ring : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ stalk_color_below_ring : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ veil_type : Factor w/ 1 level "partial": 1 1 1 1 1 1 1 1 1 1 ...
## $ veil_color : Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ ring_number : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2 2 2 ...
## $ ring_type : Factor w/ 5 levels "evanescent","flaring",..: 5 5 5 5 1 5 5 5 5 5 ...
## $ spore_print_color : Factor w/ 9 levels "black","brown",..: 1 2 2 1 2 1 1 2 1 1 ...
## $ population : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4 1 3 3 4 5 4 ...
## $ habitat : Factor w/ 7 levels "grasses","leaves",..: 5 1 3 5 1 1 3 3 1 3 ...
since veil_type doesn’t vary across samples, we will drop this variable
#Eleminating the feature from the data frame.
mushrooms$veil_type <- NULL
table(mushrooms$type)
##
## edible poisonous
## 4208 3916
as about 52% edible and 48% are poisonous, we will consider the dataset to be an exhaustive set of all the possible wild mushrooms, so we don’t need to hold some sample out of the training data for testing purpose. we are trying to find the rules that accurately depict the complete set of known mushroom types.
mushroom_1R <- OneR(type ~ . , data = mushrooms)
#To examine the rules
mushroom_1R
##
## Call:
## OneR.formula(formula = type ~ ., data = mushrooms)
##
## Rules:
## If odor = almond then type = edible
## If odor = anise then type = edible
## If odor = creosote then type = poisonous
## If odor = fishy then type = poisonous
## If odor = foul then type = poisonous
## If odor = musty then type = poisonous
## If odor = none then type = edible
## If odor = pungent then type = poisonous
## If odor = spicy then type = poisonous
##
## Accuracy:
## 8004 of 8124 instances classified correctly (98.52%)
library(RWeka)
##
## Attaching package: 'RWeka'
## The following object is masked from 'package:OneR':
##
## OneR
mushroom_1R_pred <- predict(mushroom_1R, mushrooms)
#Exmine the confusion matrix for the model
table(actual = mushrooms$type, predicted = mushroom_1R_pred)
## predicted
## actual edible poisonous
## edible 4208 0
## poisonous 120 3796
We notice that the model classified 120 poisonous mushrooms as edible, which is a huge mistake!
We will use JRip() sphosticated learner for RIPPER algorithm.
mushroom_JRip <- JRip(type ~ . , data = mushrooms)
mushroom_JRip
## JRIP rules:
## ===========
##
## (odor = foul) => type=poisonous (2160.0/0.0)
## (gill_size = narrow) and (gill_color = buff) => type=poisonous (1152.0/0.0)
## (gill_size = narrow) and (odor = pungent) => type=poisonous (256.0/0.0)
## (odor = creosote) => type=poisonous (192.0/0.0)
## (spore_print_color = green) => type=poisonous (72.0/0.0)
## (stalk_surface_below_ring = scaly) and (stalk_surface_above_ring = silky) => type=poisonous (68.0/0.0)
## (habitat = leaves) and (cap_color = white) => type=poisonous (8.0/0.0)
## (stalk_color_above_ring = yellow) => type=poisonous (8.0/0.0)
## => type=edible (4208.0/0.0)
##
## Number of Rules : 9
mushroom_JRip_pred <- predict(mushroom_JRip, mushrooms)
#Exmine the confusion matrix for the model
table(actual = mushrooms$type, predicted = mushroom_JRip_pred)
## predicted
## actual edible poisonous
## edible 4208 0
## poisonous 0 3916
we worked with two different rule learners to perform binary classification.