%% [code]

library(rpart,quietly = TRUE)
library(caret,quietly = TRUE)
library(rpart.plot,quietly = TRUE)

Summary

A dataset containing a total of 8124 observations of different mushrooms was used to build a classification tree using the rpart package. The goal is to predict whether a given mushroom is edible or poisonous to humans. Each observation has 23 features including the mushroom class.

The original dataset can be obtained from the UCI Machine Learning Repository:

https://archive.ics.uci.edu/ml/datasets/Mushroom.

The dataset used to perform this analysis was obtained from the Kaggle site, and it is available here:

https://www.kaggle.com/uciml/mushroom-classification

A loss matrix has been used to introduce penalties for miss classifications, where the error of classifying a poisonous mushroom as edible is 10 times bigger than miss classifying an edible mushroom as poisonous.

The final model has a 100% accuracy on the test set with a 95% confidence interval in the range (0.9977, 1).

Exploratory Data Analysis

# reading the dataset as a dataframe
mushrooms <- read.csv("mushrooms.csv");
# structure of the data
str(mushrooms)
## 'data.frame':    8124 obs. of  23 variables:
##  $ class                   : chr  "p" "e" "e" "p" ...
##  $ cap.shape               : chr  "x" "x" "b" "x" ...
##  $ cap.surface             : chr  "s" "s" "s" "y" ...
##  $ cap.color               : chr  "n" "y" "w" "w" ...
##  $ bruises                 : chr  "t" "t" "t" "t" ...
##  $ odor                    : chr  "p" "a" "l" "p" ...
##  $ gill.attachment         : chr  "f" "f" "f" "f" ...
##  $ gill.spacing            : chr  "c" "c" "c" "c" ...
##  $ gill.size               : chr  "n" "b" "b" "n" ...
##  $ gill.color              : chr  "k" "k" "n" "n" ...
##  $ stalk.shape             : chr  "e" "e" "e" "e" ...
##  $ stalk.root              : chr  "e" "c" "c" "e" ...
##  $ stalk.surface.above.ring: chr  "s" "s" "s" "s" ...
##  $ stalk.surface.below.ring: chr  "s" "s" "s" "s" ...
##  $ stalk.color.above.ring  : chr  "w" "w" "w" "w" ...
##  $ stalk.color.below.ring  : chr  "w" "w" "w" "w" ...
##  $ veil.type               : chr  "p" "p" "p" "p" ...
##  $ veil.color              : chr  "w" "w" "w" "w" ...
##  $ ring.number             : chr  "o" "o" "o" "o" ...
##  $ ring.type               : chr  "p" "p" "p" "p" ...
##  $ spore.print.color       : chr  "k" "n" "n" "k" ...
##  $ population              : chr  "s" "n" "n" "s" ...
##  $ habitat                 : chr  "u" "g" "m" "u" ...
# number of rows with missing values
nrow(mushrooms) - sum(complete.cases(mushrooms))
## [1] 0

There are no rows with missing values in the dataset. However the factor variable veil.type has only one level and therefore it is of no help, so we proceed to remove it from the analysis.

# deleting useless variable `veil.type`
mushrooms$veil.type <- NULL

Simple algorithm to gauge variable importance

In order to have an undertanding for the importance of the 21 different features in identifying the right category (edible or poisonous) we are going to:

  1. Create a table for each feature versus class type (edible,poisonus)
  2. Report the number of columns where there is a zero in the table created in step 1
  3. Reorder the list created in step 2 by the number of zeroes reported
  4. Plot the sorted list from step 3

Let’s explain the value of this algorithm using class against odor:

table(mushrooms$class,mushrooms$odor)
##    
##        a    c    f    l    m    n    p    s    y
##   e  400    0    0  400    0 3408    0    0    0
##   p    0  192 2160    0   36  120  256  576  576

We can instantly spot that mushrooms with odor equals “c”,“f”,“m”,“p”,“s” and “y” are poisonous. On the other hand, all mushrooms with almond odor (400) are edible in this dataset.

This insight will help us evaluating the correctness of the final results.

From the plot we see that the odor variable indeed should play a role. Note that the order of the variables shown in the plot above is not necessarily the order in which the final model will choose variables to built up the tree.

Splitting the data into training and testing sets

We will split the data using two random samples without replacement:

  • training dataset (80%) for model building
  • test dataset (20%) for testing the model with unseen data
set.seed(12345) # for reproducibility
train <- sample(1:nrow(mushrooms),size = ceiling(0.80*nrow(mushrooms)),replace = FALSE)
# training set
mushrooms_train <- mushrooms[train,]
# test set
mushrooms_test <- mushrooms[-train,]

Classification Tree

The classification tree will be built using the training dataset. In order to ensure that we minimize the number of poisonous mushrooms classified as edible we will give a penalty 10 times bigger for this miss classification than the penalty given for classifying an edible mushroom as poisonous.

# panalty matrix: will set a penalty which is 10 times bigger for classifying a
# poissoness mushroom as edible than classifying an edible mushroom as 
# poisonness
penalty.matrix <- matrix(c(0,1,10,0), byrow=TRUE, nrow=2)
# building the classification tree with rpart
tree <- rpart(class~.,
              data=mushrooms_train,
              parms = list(loss = penalty.matrix),
              method = "class")

Prunning the tree using the best complexity parameter

# choosing the best complexity parameter "cp" to prune the tree
cp.optim <- tree$cptable[which.min(tree$cptable[,"xerror"]),"CP"]
# tree prunning using the best complexity parameter. For more in
tree <- prune(tree, cp=cp.optim)

Visualizing the final tree

Predictions on test set

Finally we proceed to test the model on the test dataset.

pred <- predict(object=tree,mushrooms_test[-1],type="class")
t <- table(mushrooms_test$class,pred)
confusionMatrix(t)
## Confusion Matrix and Statistics
## 
##    pred
##       e   p
##   e 829   0
##   p   0 795
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9977, 1)
##     No Information Rate : 0.5105     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.5105     
##          Detection Rate : 0.5105     
##    Detection Prevalence : 0.5105     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : e          
## 

All the samples in the test dataset have been appropriately classified. Therefore we obtained:

  • 100% accuracy on test set
  • 95% confidence interval (0.9977, 1) for the model accuracy