The mushroom classification problem is to determine whether amushroom is edible or poisonous based on its observable features .

Objective

What types of machine learning models perform best on this dataset?
Which features are most indicative of a poisonous mushroom?

Dataset Information:

This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like “leaflets three, let it be” for Poisonous Oak and Ivy.

Attribute Information:

cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y
bruises?: bruises=t,no=f
odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s
gill-attachment: attached=a,descending=d,free=f,notched=n
gill-spacing: close=c,crowded=w,distant=d
gill-size: broad=b,narrow=n
gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
stalk-shape: enlarging=e,tapering=t
stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?
stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
veil-type: partial=p,universal=u
veil-color: brown=n,orange=o,white=w,yellow=y
ring-number: none=n,one=o,two=t
ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z
spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y
population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y
habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d
Missing Attribute Values: 2480 of them (denoted by “?”), all for attribute #11. Class Distribution: - * edible: 4208 (51.8%) - poisonous: 3916 (48.2%) - total: 8124 instances

Analysis Details…….

Part 1 : Import data, clean, perofrom exploratory analysis and test the best model fit.
Part 2 : Take a deep look into important variables and further classify what feature of the variable helps in identifiying whether the mushoom is edible or not.

Environment Setup

# Load the required packages (if packages are not available, install them first)
for (package in c('caret','readr','ggplot2','magrittr','ggthemes','dplyr','corrplot','caTools')) {
  if (!require(package, character.only=T, quietly=T)) {
    install.packages(package)
    library(package,character.only=T)
  }
}

# We will be using H2o package 
# Load H2o library into R environment
library(h2o)
# Make a connection to the h2o server
h2o.init(nthreads = -1, #Number of threads -1 means use all cores on your machine
         max_mem_size = "8G")  #max mem size is the maximum memory to allocate to H2O

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         3 hours 39 minutes 
##     H2O cluster version:        3.10.5.3 
##     H2O cluster version age:    22 days  
##     H2O cluster name:           H2O_started_from_R_nkhan_xta951 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   6.79 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.3.0 (2016-05-03)

h2o.init(ip="localhost", port = 54321, startH2O = TRUE)

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         3 hours 39 minutes 
##     H2O cluster version:        3.10.5.3 
##     H2O cluster version age:    22 days  
##     H2O cluster name:           H2O_started_from_R_nkhan_xta951 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   6.79 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.3.0 (2016-05-03)

Import and Read the data - using Sys.time to keep an eye on the data parsing time

start <- Sys.time()
mushrooms_csv <- "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"
mushrooms.hex <- h2o.importFile(path = mushrooms_csv,destination_frame = "mushrooms_data.hex")

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

parseTime <- Sys.time() - start
print(paste("Took",round(parseTime, digits = 2),"seconds to parse", nrow(mushrooms.hex), "rows and", ncol(mushrooms.hex),"columns."))

## [1] "Took 2.84 seconds to parse 8124 rows and 23 columns."

head(mushrooms.hex)

##   C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 C20
## 1  p  x  s  n  t  p  f  c  n   k   e   e   s   s   w   w   p   w   o   p
## 2  e  x  s  y  t  a  f  c  b   k   e   c   s   s   w   w   p   w   o   p
## 3  e  b  s  w  t  l  f  c  b   n   e   c   s   s   w   w   p   w   o   p
## 4  p  x  y  w  t  p  f  c  n   n   e   e   s   s   w   w   p   w   o   p
## 5  e  x  s  g  f  n  f  w  b   k   t   e   s   s   w   w   p   w   o   e
## 6  e  x  y  y  t  a  f  c  b   n   e   c   s   s   w   w   p   w   o   p
##   C21 C22 C23
## 1   k   s   u
## 2   n   n   g
## 3   n   n   m
## 4   k   s   u
## 5   n   a   g
## 6   k   n   g

# Since H2o stores the column names as C1,C2,C3,....... we will replace the labels to its original colnames.
names(mushrooms.hex) <- c("class","cap.shape","cap.surface","cap.color","bruises", "odor","gill.attachment","gill.spacing","gill.size","gill.color","stalk.shape","stalk.root","stalk.surface.above.ring","stalk.surface.below.ring","stalk.color.above.ring","stalk.color.below.ring", "veil.type","veil.color","ring.number","ring.type","spore.print.color","population", "habitat")
head(mushrooms.hex)

##   class cap.shape cap.surface cap.color bruises odor gill.attachment
## 1     p         x           s         n       t    p               f
## 2     e         x           s         y       t    a               f
## 3     e         b           s         w       t    l               f
## 4     p         x           y         w       t    p               f
## 5     e         x           s         g       f    n               f
## 6     e         x           y         y       t    a               f
##   gill.spacing gill.size gill.color stalk.shape stalk.root
## 1            c         n          k           e          e
## 2            c         b          k           e          c
## 3            c         b          n           e          c
## 4            c         n          n           e          e
## 5            w         b          k           t          e
## 6            c         b          n           e          c
##   stalk.surface.above.ring stalk.surface.below.ring stalk.color.above.ring
## 1                        s                        s                      w
## 2                        s                        s                      w
## 3                        s                        s                      w
## 4                        s                        s                      w
## 5                        s                        s                      w
## 6                        s                        s                      w
##   stalk.color.below.ring veil.type veil.color ring.number ring.type
## 1                      w         p          w           o         p
## 2                      w         p          w           o         p
## 3                      w         p          w           o         p
## 4                      w         p          w           o         p
## 5                      w         p          w           o         e
## 6                      w         p          w           o         p
##   spore.print.color population habitat
## 1                 k          s       u
## 2                 n          n       g
## 3                 n          n       m
## 4                 k          s       u
## 5                 n          a       g
## 6                 k          n       g

# Now Check for the Mushroom data we imported into the h2o server at http://localhost:54321/flow/index.html

Exploratorory Analysis

# Check for the dimensions of the data
dim(mushrooms.hex)

## [1] 8124   23

# Study the Structure of the Data
# 
str(mushrooms.hex)

## Class 'H2OFrame' <environment: 0x000000001c5c4660> 
##  - attr(*, "op")= chr "colnames="
##  - attr(*, "eval")= logi TRUE
##  - attr(*, "id")= chr "RTMP_sid_b121_1"
##  - attr(*, "nrow")= int 8124
##  - attr(*, "ncol")= int 23
##  - attr(*, "types")=List of 23
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##  - attr(*, "data")='data.frame': 10 obs. of  23 variables:
##   ..$ class                   : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1
##   ..$ cap.shape               : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1
##   ..$ cap.surface             : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3
##   ..$ cap.color               : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10
##   ..$ bruises                 : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2
##   ..$ odor                    : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1
##   ..$ gill.attachment         : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2
##   ..$ gill.spacing            : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1
##   ..$ gill.size               : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1
##   ..$ gill.color              : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3
##   ..$ stalk.shape             : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1
##   ..$ stalk.root              : Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3
##   ..$ stalk.surface.above.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3
##   ..$ stalk.surface.below.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3
##   ..$ stalk.color.above.ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8
##   ..$ stalk.color.below.ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8
##   ..$ veil.type               : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1
##   ..$ veil.color              : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3
##   ..$ ring.number             : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2
##   ..$ ring.type               : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5
##   ..$ spore.print.color       : Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4 3 3
##   ..$ population              : Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4 5 4
##   ..$ habitat                 : Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4 2 4

# From the structure we can see  all the variables have Factor objects with differnet levels.
# Factors are the r-objects which are created using a vector. It stores the vector along with the distinct values of the elements in the vector as labels. The labels are always character irrespective of whether it is numeric or character or Boolean etc. in the input vector.


## Let us see how many levels each variable have 

h2o.levels(mushrooms.hex)

## [[1]]
## [1] "e" "p"
## 
## [[2]]
## [1] "b" "c" "f" "k" "s" "x"
## 
## [[3]]
## [1] "f" "g" "s" "y"
## 
## [[4]]
##  [1] "b" "c" "e" "g" "n" "p" "r" "u" "w" "y"
## 
## [[5]]
## [1] "f" "t"
## 
## [[6]]
## [1] "a" "c" "f" "l" "m" "n" "p" "s" "y"
## 
## [[7]]
## [1] "a" "f"
## 
## [[8]]
## [1] "c" "w"
## 
## [[9]]
## [1] "b" "n"
## 
## [[10]]
##  [1] "b" "e" "g" "h" "k" "n" "o" "p" "r" "u" "w" "y"
## 
## [[11]]
## [1] "e" "t"
## 
## [[12]]
## [1] "?" "b" "c" "e" "r"
## 
## [[13]]
## [1] "f" "k" "s" "y"
## 
## [[14]]
## [1] "f" "k" "s" "y"
## 
## [[15]]
## [1] "b" "c" "e" "g" "n" "o" "p" "w" "y"
## 
## [[16]]
## [1] "b" "c" "e" "g" "n" "o" "p" "w" "y"
## 
## [[17]]
## [1] "p"
## 
## [[18]]
## [1] "n" "o" "w" "y"
## 
## [[19]]
## [1] "n" "o" "t"
## 
## [[20]]
## [1] "e" "f" "l" "n" "p"
## 
## [[21]]
## [1] "b" "h" "k" "n" "o" "r" "u" "w" "y"
## 
## [[22]]
## [1] "a" "c" "n" "s" "v" "y"
## 
## [[23]]
## [1] "d" "g" "l" "m" "p" "u" "w"

#h2o.unique(mushrooms.hex)
#h2o.ddply(mushrooms_hex, 2, function(x) length(unique(x)))

# Check for missing values NA's
# 
any(is.na(mushrooms.hex))

## [1] 0

# [1] 0 - this means no NA's found 

#  1. None of the data are missing the dataset is Structured
#  2. you dont have to deal with omitting rows or columns incase there are most missing values.
#  3. you have accurate and not any predictied or average value replacing the missing data.
#  4. Less time consumption.

BUILDING MODELS

#This function will do the test,train and validation data split and build Random forest,GLM,GBM and Deep Learning Model.
# First, we will create three splits for train/test/valid independent data sets.
# We will train a data set on one set and use the others to test the validity
# The second set will be used for validation most of the time. The third set will
#  be withheld until the end, to ensure that our validation accuracy is consistent
#  with data we have never seen during the iterative process. 


# splits <- function(data){
splits <- h2o.splitFrame(
  mushrooms.hex,         ##  splitting the H2O frame we read above
  ratios = c(0.6,0.2),   ##  create splits of 60% and 20%; 
  #  H2O will create one more split of 1-(sum of these parameters)
  #  so we will get 0.6 / 0.2 / 1 - (0.6+0.2) = 0.6/0.2/0.2
  seed=1)                ##  setting a seed will ensure reproducible results (not R's seed)
train <- h2o.assign(splits[[1]], "train.hex")   
# assign the first result the R variable train
# and the H2O name train.hex
valid <- h2o.assign(splits[[2]], "valid.hex")   ## R valid, H2O valid.hex
test <- h2o.assign(splits[[3]], "test.hex")     ## R test, H2O test.hex

x_train = train[,2:23]
y_train = train[,1]

x_test = test[,2:23]
y_test = test[,1]

print(paste("Training data has", ncol(train),"columns and", nrow(train), "rows, whereas test data has", nrow(test), "rows, and validation data has rows", nrow(valid))
)

## [1] "Training data has 23 columns and 4905 rows, whereas test data has 1600 rows, and validation data has rows 1619"

# Take a look at the first few rows of the data set
train[1:5,]   ## rows 1-5, all columns

##   class cap.shape cap.surface cap.color bruises odor gill.attachment
## 1     p         x           s         n       t    p               f
## 2     p         x           y         w       t    p               f
## 3     e         x           y         y       t    a               f
## 4     e         b           y         w       t    l               f
## 5     p         x           y         w       t    p               f
##   gill.spacing gill.size gill.color stalk.shape stalk.root
## 1            c         n          k           e          e
## 2            c         n          n           e          e
## 3            c         b          n           e          c
## 4            c         b          n           e          c
## 5            c         n          p           e          e
##   stalk.surface.above.ring stalk.surface.below.ring stalk.color.above.ring
## 1                        s                        s                      w
## 2                        s                        s                      w
## 3                        s                        s                      w
## 4                        s                        s                      w
## 5                        s                        s                      w
##   stalk.color.below.ring veil.type veil.color ring.number ring.type
## 1                      w         p          w           o         p
## 2                      w         p          w           o         p
## 3                      w         p          w           o         p
## 4                      w         p          w           o         p
## 5                      w         p          w           o         p
##   spore.print.color population habitat
## 1                 k          s       u
## 2                 k          s       u
## 3                 k          n       g
## 4                 n          s       m
## 5                 k          v       g
## 
## [5 rows x 23 columns]

# Assign X and Y values

myY <- "class"
myX <- setdiff(names(train), myY)

## Run our first predictive model (Random Forest Model)
mush_rf_model  <- h2o.randomForest(x = myX,
                 y = myY,
                 training_frame = train,
                 validation_frame = test,
                 model_id = "mush_rf_model",
                 ntrees = 250,
                 max_depth = 30,
                 seed = 100)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   1%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |========                                                         |  13%
  |                                                                       
  |=================                                                |  27%
  |                                                                       
  |==================================================               |  78%
  |                                                                       
  |=================================================================| 100%

print(mush_rf_model)

## Model Details:
## ==============
## 
## H2OBinomialModel: drf
## Model ID:  mush_rf_model 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1             250                      250              146323         5
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1        14    8.18800          7         36    17.32000
## 
## 
## H2OBinomialMetrics: drf
## ** Reported on training data. **
## ** Metrics reported on Out-Of-Bag training samples **
## 
## MSE:  0.0001047809
## RMSE:  0.01023626
## LogLoss:  0.001884697
## Mean Per-Class Error:  0
## AUC:  1
## Gini:  1
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           e    p    Error     Rate
## e      2560    0 0.000000  =0/2560
## p         0 2345 0.000000  =0/2345
## Totals 2560 2345 0.000000  =0/4905
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.577329 1.000000 189
## 2                       max f2  0.577329 1.000000 189
## 3                 max f0point5  0.577329 1.000000 189
## 4                 max accuracy  0.577329 1.000000 189
## 5                max precision  1.000000 1.000000   0
## 6                   max recall  0.577329 1.000000 189
## 7              max specificity  1.000000 1.000000   0
## 8             max absolute_mcc  0.577329 1.000000 189
## 9   max min_per_class_accuracy  0.577329 1.000000 189
## 10 max mean_per_class_accuracy  0.577329 1.000000 189
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: drf
## ** Reported on validation data. **
## 
## MSE:  0.0001423202
## RMSE:  0.0119298
## LogLoss:  0.002222982
## Mean Per-Class Error:  0
## AUC:  1
## Gini:  1
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          e   p    Error     Rate
## e      805   0 0.000000   =0/805
## p        0 795 0.000000   =0/795
## Totals 805 795 0.000000  =0/1600
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.716906 1.000000  62
## 2                       max f2  0.716906 1.000000  62
## 3                 max f0point5  0.716906 1.000000  62
## 4                 max accuracy  0.716906 1.000000  62
## 5                max precision  1.000000 1.000000   0
## 6                   max recall  0.716906 1.000000  62
## 7              max specificity  1.000000 1.000000   0
## 8             max absolute_mcc  0.716906 1.000000  62
## 9   max min_per_class_accuracy  0.716906 1.000000  62
## 10 max mean_per_class_accuracy  0.716906 1.000000  62
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

#Let us see what variables are important in this model.
h2o.varimp_plot(mush_rf_model, num_of_features = NULL)

h2o.confusionMatrix(mush_rf_model,train)

## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.795358690619469:
##           e    p    Error     Rate
## e      2560    0 0.000000  =0/2560
## p         0 2345 0.000000  =0/2345
## Totals 2560 2345 0.000000  =0/4905

#Fit to test and see how good it is at predicting classification!
h2o.predict(mush_rf_model,test)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

##   predict         e            p
## 1       e 1.0000000 0.0000000000
## 2       e 0.9997938 0.0002061856
## 3       e 0.9997938 0.0002061856
## 4       p 0.0064781 0.9935219000
## 5       e 0.9997938 0.0002061856
## 6       e 1.0000000 0.0000000000
## 
## [1600 rows x 3 columns]

h2o.confusionMatrix(mush_rf_model,test)

## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.716906309485436:
##          e   p    Error     Rate
## e      805   0 0.000000   =0/805
## p        0 795 0.000000   =0/795
## Totals 805 795 0.000000  =0/1600

# Run GBM

mush_gbm_model <- h2o.gbm(x=myX,build_tree_one_node = T,
            y = myY,
            training_frame = train,
            validation_frame = test,
            model_id = "mush_gbm_model",
            ntrees = 500,
            max_depth = 6,
            learn_rate = 0.1)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   1%
  |                                                                       
  |==                                                               |   3%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |=====================                                            |  33%
  |                                                                       
  |=================================                                |  51%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |===============================================================  |  97%
  |                                                                       
  |=================================================================| 100%

# Print model performance using train data
print(mush_gbm_model)

## Model Details:
## ==============
## 
## H2OBinomialModel: gbm
## Model ID:  mush_gbm_model 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1             500                      500              266928         1
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         6    5.20800          2         18    13.15200
## 
## 
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
## 
## MSE:  1.128308e-33
## RMSE:  3.35903e-17
## LogLoss:  5.002228e-18
## Mean Per-Class Error:  0
## AUC:  1
## Gini:  1
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           e    p    Error     Rate
## e      2560    0 0.000000  =0/2560
## p         0 2345 0.000000  =0/2345
## Totals 2560 2345 0.000000  =0/4905
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  1.000000 1.000000   0
## 2                       max f2  1.000000 1.000000   0
## 3                 max f0point5  1.000000 1.000000   0
## 4                 max accuracy  1.000000 1.000000   0
## 5                max precision  1.000000 1.000000   0
## 6                   max recall  1.000000 1.000000   0
## 7              max specificity  1.000000 1.000000   0
## 8             max absolute_mcc  1.000000 1.000000   0
## 9   max min_per_class_accuracy  1.000000 1.000000   0
## 10 max mean_per_class_accuracy  1.000000 1.000000   0
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on validation data. **
## 
## MSE:  1.405236e-31
## RMSE:  3.748647e-16
## LogLoss:  2.019218e-17
## Mean Per-Class Error:  0
## AUC:  1
## Gini:  1
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          e   p    Error      Rate
## e      805   0 0.000000    =0/805
## p       36 759 0.045283   =36/795
## Totals 841 759 0.022500  =36/1600
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  1.000000 1.000000   4
## 2                       max f2  1.000000 1.000000   4
## 3                 max f0point5  1.000000 1.000000   4
## 4                 max accuracy  1.000000 1.000000   4
## 5                max precision  1.000000 1.000000   0
## 6                   max recall  1.000000 1.000000   4
## 7              max specificity  1.000000 1.000000   0
## 8             max absolute_mcc  1.000000 1.000000   4
## 9   max min_per_class_accuracy  1.000000 1.000000   4
## 10 max mean_per_class_accuracy  1.000000 1.000000   4
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

h2o.varimp_plot(mush_gbm_model, num_of_features = NULL)

## Run DeepLearning
mush_dl_model <- h2o.deeplearning(x = myX,
                        y = myY,
                        training_frame = train,
                        validation_frame = test,
                        activation = "TanhWithDropout",
                        input_dropout_ratio = 0.2,
                        hidden_dropout_ratios = c(0.5,0.5,0.5),
                        hidden = c(50,50,50),
                        epochs = 100,
                        seed = 123456)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |=================================================================| 100%

print(mush_dl_model)

## Model Details:
## ==============
## 
## H2OBinomialModel: deeplearning
## Model ID:  DeepLearning_model_R_1500745200343_3310 
## Status of Neuron Layers: predicting class, 2-class classification, bernoulli distribution, CrossEntropy loss, 12,102 weights/biases, 152.9 KB, 147,150 training samples, mini-batch size 1
##   layer units        type dropout       l1       l2 mean_rate rate_rms
## 1     1   137       Input 20.00 %                                     
## 2     2    50 TanhDropout 50.00 % 0.000000 0.000000  0.162737 0.378240
## 3     3    50 TanhDropout 50.00 % 0.000000 0.000000  0.005833 0.007445
## 4     4    50 TanhDropout 50.00 % 0.000000 0.000000  0.020526 0.033357
## 5     5     2     Softmax         0.000000 0.000000  0.005173 0.003195
##   momentum mean_weight weight_rms mean_bias bias_rms
## 1                                                   
## 2 0.000000   -0.001426   0.190487  0.004456 0.135172
## 3 0.000000    0.002671   0.208880  0.030208 0.149506
## 4 0.000000   -0.002842   0.156746  0.000417 0.093283
## 5 0.000000    0.040579   0.743097  0.011230 0.002301
## 
## 
## H2OBinomialMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on full training frame **
## 
## MSE:  0.0003953765
## RMSE:  0.01988408
## LogLoss:  0.001390508
## Mean Per-Class Error:  0
## AUC:  1
## Gini:  1
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           e    p    Error     Rate
## e      2560    0 0.000000  =0/2560
## p         0 2345 0.000000  =0/2345
## Totals 2560 2345 0.000000  =0/4905
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.052458 1.000000 214
## 2                       max f2  0.052458 1.000000 214
## 3                 max f0point5  0.052458 1.000000 214
## 4                 max accuracy  0.052458 1.000000 214
## 5                max precision  0.999985 1.000000   0
## 6                   max recall  0.052458 1.000000 214
## 7              max specificity  0.999985 1.000000   0
## 8             max absolute_mcc  0.052458 1.000000 214
## 9   max min_per_class_accuracy  0.052458 1.000000 214
## 10 max mean_per_class_accuracy  0.052458 1.000000 214
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: deeplearning
## ** Reported on validation data. **
## ** Metrics reported on full validation frame **
## 
## MSE:  0.001574598
## RMSE:  0.03968121
## LogLoss:  0.005658701
## Mean Per-Class Error:  0.000621118
## AUC:  0.9999969
## Gini:  0.9999937
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          e   p    Error     Rate
## e      804   1 0.001242   =1/805
## p        0 795 0.000000   =0/795
## Totals 804 796 0.000625  =1/1600
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.025403 0.999371 157
## 2                       max f2  0.025403 0.999748 157
## 3                 max f0point5  0.309938 0.999496 154
## 4                 max accuracy  0.025403 0.999375 157
## 5                max precision  0.999985 1.000000   0
## 6                   max recall  0.025403 1.000000 157
## 7              max specificity  0.999985 1.000000   0
## 8             max absolute_mcc  0.025403 0.998751 157
## 9   max min_per_class_accuracy  0.025403 0.998758 157
## 10 max mean_per_class_accuracy  0.025403 0.999379 157
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

plot(mush_dl_model)

## Performance on validation set
h2o.confusionMatrix(mush_dl_model)

## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.0524584492892366:
##           e    p    Error     Rate
## e      2560    0 0.000000  =0/2560
## p         0 2345 0.000000  =0/2345
## Totals 2560 2345 0.000000  =0/4905

# Interesting thing to note
#  Warning message:
# In .h2o.startModelJob(algo, params, h2oRestApiVersion) :
# Dropping bad and constant columns: [veil.type].

Using Random Forest turns out to have more accurate classification than using tree and using bagging.

Odor, spore.print.colo, gill.color should be important features to indicate poisonous mushroom.

In general, Random Forest has the best classification accuracy. Odor, spore.print.colo and stalk.color.below.ring

PART 2 OF THE MUSHROOM DATASET ANALYSIS ***************************************

Since we know the top 10 features that help us to classify the Mushrooms as “Edible” or “Poisonous”

We will a insight look at the type of sub features that clearly helps us to Classify our identification process.

# Subset the data using the top 10 features.
mushrooms.hex1 <- mushrooms.hex[1:10, c(1,6,9,10,12,13,20,21,22,23)]
mushrooms.hex1

##   class odor gill.size gill.color stalk.root stalk.surface.above.ring
## 1     p    p         n          k          e                        s
## 2     e    a         b          k          c                        s
## 3     e    l         b          n          c                        s
## 4     p    p         n          n          e                        s
## 5     e    n         b          k          e                        s
## 6     e    a         b          n          c                        s
##   ring.type spore.print.color population habitat
## 1         p                 k          s       u
## 2         p                 n          n       g
## 3         p                 n          n       m
## 4         p                 k          s       u
## 5         e                 n          a       g
## 6         p                 k          n       g
## 
## [10 rows x 10 columns]

names(mushrooms.hex1) <- c("class","odor","gill.size","gill.color","stalk.root","stalk-surface-above-ring","ring.type","spore.print.color","population","habitat")
head(mushrooms.hex)

##   class cap.shape cap.surface cap.color bruises odor gill.attachment
## 1     p         x           s         n       t    p               f
## 2     e         x           s         y       t    a               f
## 3     e         b           s         w       t    l               f
## 4     p         x           y         w       t    p               f
## 5     e         x           s         g       f    n               f
## 6     e         x           y         y       t    a               f
##   gill.spacing gill.size gill.color stalk.shape stalk.root
## 1            c         n          k           e          e
## 2            c         b          k           e          c
## 3            c         b          n           e          c
## 4            c         n          n           e          e
## 5            w         b          k           t          e
## 6            c         b          n           e          c
##   stalk.surface.above.ring stalk.surface.below.ring stalk.color.above.ring
## 1                        s                        s                      w
## 2                        s                        s                      w
## 3                        s                        s                      w
## 4                        s                        s                      w
## 5                        s                        s                      w
## 6                        s                        s                      w
##   stalk.color.below.ring veil.type veil.color ring.number ring.type
## 1                      w         p          w           o         p
## 2                      w         p          w           o         p
## 3                      w         p          w           o         p
## 4                      w         p          w           o         p
## 5                      w         p          w           o         e
## 6                      w         p          w           o         p
##   spore.print.color population habitat
## 1                 k          s       u
## 2                 n          n       g
## 3                 n          n       m
## 4                 k          s       u
## 5                 n          a       g
## 6                 k          n       g

mushrooms.hex1 <- as.data.frame(mushrooms.hex1)
plot(mushrooms.hex1)

# Data Transformation
# Transforms the Class column
class_trans <- function(key){
  switch (key,
         'p' = 'poisonous',
         'e' = 'edible'
  )
}

#Transforms the odor column
odor_trans <- function(key)(
switch(key,
        'a' = 'almond',
        'l' = 'anise',
        'c'= 'creosote',
        'y'= 'fishy',
        'f'= 'foul',
        'm'= 'musty',
        'n'= 'none',
        'p'= 'pungent',
        's'= 'spicy'
   )
  )


# Transforms the gill.size column
gill.size_trans <- function(key){
  switch(key,
         'b'= 'broad',
         'n'= 'narrow')}


#Transforms the gill.color column
gill.color_trans <- function(key){
  switch(key,
         'k'= 'black',
         'n'= 'brown',
         'b'= 'buff',
         'h'= 'chocolate',
         'g'= 'gray')}


#Transforms the stalk.root column
stalk.root_trans <- function(key){
    switch(key,
           'b'= 'bulbous',
           'c'= 'club',
           'u'= 'cup',
           'e'= 'equal',
           'z'= 'rhizomorphs',
           'r'= 'rooted',
           '?'= 'missing')}

#Transforms the stalk.surface.above.ring column
stalk.surface.above.ring_trans <- function(key){
    switch(key,
           'f'= 'fibrous',
           'y'= 'scaly',
           'k'= 'silky',
           's'= 'smooth')}


# Transforms the ring.type column
ring.type_trans <- function(key){
  switch(key,
         'c' = 'cobwebby',
         'e' = 'evanescent',
         'f' = 'flaring',
         'l' = 'large',
         'n' = 'none',
         'p' = 'pendant',
         's' = 'sheathing',
         'z' = 'zone')}

#Transforms the spore.print.color column
spore.print.color_trans <- function(key){
  switch(key,
         'k'=  'black',
         'n'=  'brown',
         'b' = 'buff',
         'h'=  'chocolate',
         'r' = 'green',
         'o' = 'orange',
         'u' = 'purple',
         'w' = 'white',
         'y' = 'yellow')}

#Transforms the population column
population_trans <- function(key){
  switch(key, 
         'a' = 'abundant',
         'c' = 'clustered',
         'n' = 'numerous',
         's' = 'scattered',
         'v' = 'several',
         'y' = 'solitary')}

# Transforms the habitat column
habitat_trans <- function(key){
    switch(key,
            'g' = 'grasses',
            'l' = 'leaves',
            'm' = 'meadows',
            'p' = 'paths',
            'u' = 'urban',
            'w' = 'waster',
            'd' = 'woods')}


# Applying data transformation on the mushroom dataset

mushrooms.hex1$class <- sapply(mushrooms.hex1$class,class_trans)
mushrooms.hex1$`spore.print.color` <- sapply(mushrooms.hex1$`spore.print.color`,spore.print.color_trans)
mushrooms.hex1$`gill.color` <- sapply(mushrooms.hex1$`gill.color`,gill.color_trans)
mushrooms.hex1$`stalk.surface.above.ring` <- sapply(mushrooms.hex1$`stalk.surface.above.ring`,stalk.surface.above.ring_trans)
mushrooms.hex1$`gill.size` <- sapply(mushrooms.hex1$`gill.size`,gill.color_trans) 
mushrooms.hex1$`stalk.root` <- sapply(mushrooms.hex1$`stalk.root`,stalk.root_trans)
mushrooms.hex1$`ring.type` <- sapply(mushrooms.hex1$`ring.type`, ring.type_trans)
mushrooms.hex1$`odor` <- sapply(mushrooms.hex1$`odor`,odor_trans)
mushrooms.hex1$`population` <- sapply(mushrooms.hex1$`population`,population_trans)
mushrooms.hex1$`habitat` <- sapply(mushrooms.hex1$`habitat`,habitat_trans)

head(mushrooms.hex1)

##       class     odor gill.size gill.color stalk.root
## 1    edible    fishy     brown      brown       club
## 2 poisonous   almond     black      brown    bulbous
## 3 poisonous    anise     black       buff    bulbous
## 4    edible    fishy     brown       buff       club
## 5 poisonous creosote     black      brown       club
## 6 poisonous   almond     black       buff    bulbous
##   stalk.surface.above.ring  ring.type spore.print.color population habitat
## 1                  fibrous evanescent             black   numerous meadows
## 2                  fibrous evanescent             brown  clustered grasses
## 3                  fibrous evanescent             brown  clustered  leaves
## 4                  fibrous evanescent             black   numerous meadows
## 5                  fibrous   cobwebby             brown   abundant grasses
## 6                  fibrous evanescent             black  clustered grasses

mushroom_features <-  lapply(seq(from=2, to=ncol(mushrooms.hex1)), 
                         function(x) {table(mushrooms.hex1$class, mushrooms.hex1[,x])})
names(mushroom_features) <- colnames(mushrooms.hex1)[2:ncol(mushrooms.hex1)]
for(i in 1:length(mushroom_features)) {
  print("Deep Look at the Features")
  print(names(mushroom_features)[i])
  print(mushroom_features[[i]]) 
}

## [1] "Deep Look at the Features"
## [1] "odor"
##            
##             almond anise creosote fishy
##   edible         0     0        0     3
##   poisonous      4     2        1     0
## [1] "Deep Look at the Features"
## [1] "gill.size"
##            
##             black brown
##   edible        0     3
##   poisonous     7     0
## [1] "Deep Look at the Features"
## [1] "gill.color"
##            
##             black brown buff chocolate
##   edible        0     1    1         1
##   poisonous     2     2    3         0
## [1] "Deep Look at the Features"
## [1] "stalk.root"
##            
##             bulbous club
##   edible          0    3
##   poisonous       6    1
## [1] "Deep Look at the Features"
## [1] "stalk.surface.above.ring"
##            
##             fibrous
##   edible          3
##   poisonous       7
## [1] "Deep Look at the Features"
## [1] "ring.type"
##            
##             cobwebby evanescent
##   edible           0          3
##   poisonous        1          6
## [1] "Deep Look at the Features"
## [1] "spore.print.color"
##            
##             black brown
##   edible        3     0
##   poisonous     3     4
## [1] "Deep Look at the Features"
## [1] "population"
##            
##             abundant clustered numerous scattered
##   edible           0         0        2         1
##   poisonous        1         4        2         0
## [1] "Deep Look at the Features"
## [1] "habitat"
##            
##             grasses leaves meadows
##   edible          1      0       2
##   poisonous       3      4       0

# Shut down the H2o data frame
h2o.shutdown(prompt=FALSE)

Mushroom Dataset Analysis

Sushil Bhatia, Jaya K and Nazima Khan

July 21, 2017

The mushroom classification problem is to determine whether amushroom is edible or poisonous based on its observable features .

Objective

Dataset Information:

Attribute Information:

Analysis Details…….

Environment Setup

Import and Read the data - using Sys.time to keep an eye on the data parsing time

Exploratorory Analysis

BUILDING MODELS

Using Random Forest turns out to have more accurate classification than using tree and using bagging.

Odor, spore.print.colo, gill.color should be important features to indicate poisonous mushroom.

In general, Random Forest has the best classification accuracy. Odor, spore.print.colo and stalk.color.below.ring

PART 2 OF THE MUSHROOM DATASET ANALYSIS ***************************************

Since we know the top 10 features that help us to classify the Mushrooms as “Edible” or “Poisonous”

We will a insight look at the type of sub features that clearly helps us to Classify our identification process.

Mushroom Dataset Analysis

Sushil Bhatia, Jaya K and Nazima Khan

July 21, 2017

The mushroom classification problem is to determine whether amushroom is edible or poisonous based on its observable features .

Objective

Dataset Information:

Attribute Information:

Analysis Details…….

Environment Setup

Import and Read the data - using Sys.time to keep an eye on the data parsing time

Exploratorory Analysis

BUILDING MODELS

Using Random Forest turns out to have more accurate classification than using tree and using bagging.

Odor, spore.print.colo, gill.color should be important features to indicate poisonous mushroom.

In general, Random Forest has the best classification accuracy. Odor, spore.print.colo and stalk.color.below.ring

************ PART 2 OF THE MUSHROOM DATASET ANALYSIS ***************************************************

Since we know the top 10 features that help us to classify the Mushrooms as “Edible” or “Poisonous”

We will a insight look at the type of sub features that clearly helps us to Classify our identification process.

PART 2 OF THE MUSHROOM DATASET ANALYSIS ***************************************