library(rpart,quietly = TRUE)
library(caret,quietly = TRUE)
library(rpart.plot,quietly = TRUE)
A dataset containing a total of 8124 observations of different
mushrooms was used to build a classification tree using the
rpart package. The goal is to predict whether a given
mushroom is edible or poisonous to humans. Each observation has 23
features including the mushroom class.
The original dataset can be obtained from the UCI Machine Learning Repository:
https://archive.ics.uci.edu/ml/datasets/Mushroom.
The dataset used to perform this analysis was obtained from the Kaggle site, and it is available here:
https://www.kaggle.com/uciml/mushroom-classification
A loss matrix has been used to introduce penalties for miss classifications, where the error of classifying a poisonous mushroom as edible is 10 times bigger than miss classifying an edible mushroom as poisonous.
The final model has a 100% accuracy on the test set with a 95% confidence interval in the range (0.9977, 1).
# reading the dataset as a dataframe
mushrooms <- read.csv("mushrooms.csv");
# structure of the data
str(mushrooms)
## 'data.frame': 8124 obs. of 23 variables:
## $ class : chr "p" "e" "e" "p" ...
## $ cap.shape : chr "x" "x" "b" "x" ...
## $ cap.surface : chr "s" "s" "s" "y" ...
## $ cap.color : chr "n" "y" "w" "w" ...
## $ bruises : chr "t" "t" "t" "t" ...
## $ odor : chr "p" "a" "l" "p" ...
## $ gill.attachment : chr "f" "f" "f" "f" ...
## $ gill.spacing : chr "c" "c" "c" "c" ...
## $ gill.size : chr "n" "b" "b" "n" ...
## $ gill.color : chr "k" "k" "n" "n" ...
## $ stalk.shape : chr "e" "e" "e" "e" ...
## $ stalk.root : chr "e" "c" "c" "e" ...
## $ stalk.surface.above.ring: chr "s" "s" "s" "s" ...
## $ stalk.surface.below.ring: chr "s" "s" "s" "s" ...
## $ stalk.color.above.ring : chr "w" "w" "w" "w" ...
## $ stalk.color.below.ring : chr "w" "w" "w" "w" ...
## $ veil.type : chr "p" "p" "p" "p" ...
## $ veil.color : chr "w" "w" "w" "w" ...
## $ ring.number : chr "o" "o" "o" "o" ...
## $ ring.type : chr "p" "p" "p" "p" ...
## $ spore.print.color : chr "k" "n" "n" "k" ...
## $ population : chr "s" "n" "n" "s" ...
## $ habitat : chr "u" "g" "m" "u" ...
# number of rows with missing values
nrow(mushrooms) - sum(complete.cases(mushrooms))
## [1] 0
There are no rows with missing values in the dataset. However the
factor variable veil.type has only one level and therefore
it is of no help, so we proceed to remove it from the analysis.
# deleting useless variable `veil.type`
mushrooms$veil.type <- NULL
In order to have an undertanding for the importance of the 21 different features in identifying the right category (edible or poisonous) we are going to:
Let’s explain the value of this algorithm using class against odor:
table(mushrooms$class,mushrooms$odor)
##
## a c f l m n p s y
## e 400 0 0 400 0 3408 0 0 0
## p 0 192 2160 0 36 120 256 576 576
We can instantly spot that mushrooms with odor equals “c”,“f”,“m”,“p”,“s” and “y” are poisonous. On the other hand, all mushrooms with almond odor (400) are edible in this dataset.
This insight will help us evaluating the correctness of the final results.
From the plot we see that the odor variable indeed
should play a role. Note that the order of the variables shown in the
plot above is not necessarily the order in which the final model will
choose variables to built up the tree.
We will split the data using two random samples without replacement:
set.seed(12345) # for reproducibility
train <- sample(1:nrow(mushrooms),size = ceiling(0.80*nrow(mushrooms)),replace = FALSE)
# training set
mushrooms_train <- mushrooms[train,]
# test set
mushrooms_test <- mushrooms[-train,]
The classification tree will be built using the training dataset. In order to ensure that we minimize the number of poisonous mushrooms classified as edible we will give a penalty 10 times bigger for this miss classification than the penalty given for classifying an edible mushroom as poisonous.
# panalty matrix: will set a penalty which is 10 times bigger for classifying a
# poissoness mushroom as edible than classifying an edible mushroom as
# poisonness
penalty.matrix <- matrix(c(0,1,10,0), byrow=TRUE, nrow=2)
# building the classification tree with rpart
tree <- rpart(class~.,
data=mushrooms_train,
parms = list(loss = penalty.matrix),
method = "class")
# choosing the best complexity parameter "cp" to prune the tree
cp.optim <- tree$cptable[which.min(tree$cptable[,"xerror"]),"CP"]
# tree prunning using the best complexity parameter. For more in
tree <- prune(tree, cp=cp.optim)
Finally we proceed to test the model on the test dataset.
pred <- predict(object=tree,mushrooms_test[-1],type="class")
t <- table(mushrooms_test$class,pred)
confusionMatrix(t)
## Confusion Matrix and Statistics
##
## pred
## e p
## e 829 0
## p 0 795
##
## Accuracy : 1
## 95% CI : (0.9977, 1)
## No Information Rate : 0.5105
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.5105
## Detection Rate : 0.5105
## Detection Prevalence : 0.5105
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : e
##
All the samples in the test dataset have been appropriately classified. Therefore we obtained: