Data607 Week1 Assignment

Introduction: Mushroom Dataset

This report analyzes the Mushroom Data set and provides the data transformation logic to convert it to a usable format for data analysis based on 4 selected variables.

The Mushroom Data Set is provided in a flat file and data dictionary on the University of California Irvine Machine Learning Repository.

The required tasks are to: 1. Explore the data set and its description in the data dictionary 2. Provide relevant column names. 2. Create an R data frame with a subset of its columns.

Explore the Data Set and Its Description

We first access the raw data set online and used summary() to assess the data frame’s gross characteristics. There are two versions of the data file where V1 contains the original examples and V2 contains discretized numeric properties.

Unfortunately, the UCI data file was unavailable at the time of this writing, although it had been previously working. The URL below is included as URLUCI for reference purposes but is not active currently. The Kaggle file was in a different format (including the headers) but otherwise similar. However, I was unable to get the URL working regardless.

URLKaggle = "https://www.kaggle.com/uciml/mushroom-classification#mushrooms.csv"
URLUCI = "https://archive.ics.uci.edu/ml/datasets/Mushroom/agaricus-lepiota.data.txt"
URL="./agaricus-lepiota.data.txt"

mushroomRaw = read.csv(URL , header=FALSE)
str(mushroomRaw)

## 'data.frame':    8124 obs. of  23 variables:
##  $ V1 : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
##  $ V2 : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
##  $ V3 : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
##  $ V4 : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
##  $ V5 : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ...
##  $ V6 : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
##  $ V7 : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
##  $ V8 : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
##  $ V9 : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
##  $ V10: Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
##  $ V11: Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
##  $ V12: Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
##  $ V13: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ V14: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ V15: Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ V16: Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ V17: Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
##  $ V18: Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ V19: Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
##  $ V20: Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
##  $ V21: Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4 3 3 ...
##  $ V22: Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4 5 4 ...
##  $ V23: Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4 2 4 ...

I use the output of str() for basic data validation. I confirm consistency between values in the data file and permitted values in the data dictionary for each attribute. By manually comparing the permitted levels of each of the 23 factor in str() to the data dictionary file, I know the data frame has no unexplained values for any data attribute. For example, column V3 has 4 levels in the data frame (‘f’, ‘g’, ‘s’, ‘y’). These corresponds exactly to the permitted values for the attribute 2: cap-surface. The data value mapping is fibrous=f,grooves=g,scaly=y,smooth=s.

Renaming Data Columns

We use the data dictionary attributes to act as the new column names. The Data Dictionary provides good descriptions along with each attribute name. Keeping consistency between the data frame column names and the data dictionary version will allow easy mapping in the future.

rawColumns=c("class", "cap-shape", "cap-surface", "cap-color", "bruises",
             "odor", "gill-attachment", "gill-spacing", "gill-size",
             "gill-color", "stalk-shape", "stalk-root",
             "stalk-surface-above-ring",
             "stalk-surface-below-ring",
             "stalk-color-above-ring",
             "stalk-color-below-ring",
             "veil-type", "veil-color", "ring-number", "ring-type",
             "spore-print-color", "population", "habitat" )


colnames(mushroomRaw) <- rawColumns

I now validate the accuracy of the renamed dataset by examining sample observations (using head), dimension and observed levels of the factors (using str ), and summary statistics.

head(mushroomRaw)

##   class cap-shape cap-surface cap-color bruises odor gill-attachment
## 1     p         x           s         n       t    p               f
## 2     e         x           s         y       t    a               f
## 3     e         b           s         w       t    l               f
## 4     p         x           y         w       t    p               f
## 5     e         x           s         g       f    n               f
## 6     e         x           y         y       t    a               f
##   gill-spacing gill-size gill-color stalk-shape stalk-root
## 1            c         n          k           e          e
## 2            c         b          k           e          c
## 3            c         b          n           e          c
## 4            c         n          n           e          e
## 5            w         b          k           t          e
## 6            c         b          n           e          c
##   stalk-surface-above-ring stalk-surface-below-ring stalk-color-above-ring
## 1                        s                        s                      w
## 2                        s                        s                      w
## 3                        s                        s                      w
## 4                        s                        s                      w
## 5                        s                        s                      w
## 6                        s                        s                      w
##   stalk-color-below-ring veil-type veil-color ring-number ring-type
## 1                      w         p          w           o         p
## 2                      w         p          w           o         p
## 3                      w         p          w           o         p
## 4                      w         p          w           o         p
## 5                      w         p          w           o         e
## 6                      w         p          w           o         p
##   spore-print-color population habitat
## 1                 k          s       u
## 2                 n          n       g
## 3                 n          n       m
## 4                 k          s       u
## 5                 n          a       g
## 6                 k          n       g

str(mushroomRaw)

## 'data.frame':    8124 obs. of  23 variables:
##  $ class                   : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
##  $ cap-shape               : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
##  $ cap-surface             : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
##  $ cap-color               : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
##  $ bruises                 : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ...
##  $ odor                    : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
##  $ gill-attachment         : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
##  $ gill-spacing            : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
##  $ gill-size               : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
##  $ gill-color              : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
##  $ stalk-shape             : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
##  $ stalk-root              : Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
##  $ stalk-surface-above-ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ stalk-surface-below-ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ stalk-color-above-ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ stalk-color-below-ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ veil-type               : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
##  $ veil-color              : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ ring-number             : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ring-type               : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
##  $ spore-print-color       : Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4 3 3 ...
##  $ population              : Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4 5 4 ...
##  $ habitat                 : Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4 2 4 ...

summary(mushroomRaw)

##  class    cap-shape cap-surface   cap-color    bruises       odor     
##  e:4208   b: 452    f:2320      n      :2284   f:4748   n      :3528  
##  p:3916   c:   4    g:   4      g      :1840   t:3376   f      :2160  
##           f:3152    s:2556      e      :1500            s      : 576  
##           k: 828    y:3244      y      :1072            y      : 576  
##           s:  32                w      :1040            a      : 400  
##           x:3656                b      : 168            l      : 400  
##                                 (Other): 220            (Other): 484  
##  gill-attachment gill-spacing gill-size   gill-color   stalk-shape
##  a: 210          c:6812       b:5612    b      :1728   e:3516     
##  f:7914          w:1312       n:2512    p      :1492   t:4608     
##                                         w      :1202              
##                                         n      :1048              
##                                         g      : 752              
##                                         h      : 732              
##                                         (Other):1170              
##  stalk-root stalk-surface-above-ring stalk-surface-below-ring
##  ?:2480     f: 552                   f: 600                  
##  b:3776     k:2372                   k:2304                  
##  c: 556     s:5176                   s:4936                  
##  e:1120     y:  24                   y: 284                  
##  r: 192                                                      
##                                                              
##                                                              
##  stalk-color-above-ring stalk-color-below-ring veil-type veil-color
##  w      :4464           w      :4384           p:8124    n:  96    
##  p      :1872           p      :1872                     o:  96    
##  g      : 576           g      : 576                     w:7924    
##  n      : 448           n      : 512                     y:   8    
##  b      : 432           b      : 432                               
##  o      : 192           o      : 192                               
##  (Other): 140           (Other): 156                               
##  ring-number ring-type spore-print-color population habitat 
##  n:  36      e:2776    w      :2388      a: 384     d:3148  
##  o:7488      f:  48    n      :1968      c: 340     g:2148  
##  t: 600      l:1296    k      :1872      n: 400     l: 832  
##              n:  36    h      :1632      s:1248     m: 292  
##              p:3968    r      :  72      v:4040     p:1144  
##                        b      :  48      y:1712     u: 368  
##                        (Other): 144                 w: 192

I conclude that the data frame and intermediate data transformations are valid.

Subsetting the Raw Data

Based on the Wlodzislaw Duch email of Feb 17, 1997 quoted in the data dictionary file, highly accurate deterministic rules to identify poisonous and edible mushrooms are known.

Rules P_2 and P_3 are used to identify poisonous mushrooms and use attributes: spore-print-color, odor, stalk-surface-below-ring, stalk-color-above-ring

I will choose this subset of attributes and the class of the dataframe.

subsetRaw = subset(mushroomRaw, , 
                            select = c("class", "odor", "spore-print-color",
                                       "stalk-surface-below-ring",
                                       "stalk-color-above-ring") )
str(subsetRaw)

## 'data.frame':    8124 obs. of  5 variables:
##  $ class                   : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
##  $ odor                    : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
##  $ spore-print-color       : Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4 3 3 ...
##  $ stalk-surface-below-ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ stalk-color-above-ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...

Transforming Factor Values to Readable Text

The R Cookbook provides a recipe to transform factor levels using mapvalues() function.

classT = factor( mapvalues(subsetRaw$class, 
                           from=c("e","p") , to=c("edible", "poisonous")) )

odorT = factor(mapvalues(subsetRaw$odor,
                         from=c("a", "l", "c", "y", "f", "m", "n", "p", "s"),
                         to=c("almond", "anise", "creosote", "fishy", "foul",
                              "musty", "none", "pungent", "spicy")))

sporePrintColorT = factor(mapvalues(subsetRaw$`spore-print-color`,
                                    from=c('k', 'n', 'b', 'h', 'r', 'o', 'u', 'w', 'y'  ),
                                    to=c("black" , "brown", "buff" , "chocolate" ,
                                         "green" , "orange" , "purple" , 
                                         "white", "yellow") ))
                                    
stalkSurfaceBelowRingT = factor(mapvalues(subsetRaw$`stalk-surface-below-ring`,
                                          from = c( "f", "y", "k", "s"),
                                          to = c("fibrous", "scaly", "silky", "smooth" )))


stalkColorAboveRingT = factor(mapvalues(subsetRaw$`stalk-color-above-ring`,
                                        from = c('n', 'b', 'c', 'g', 'o', 'p', 'e', 'w', 'y' ) ,
                                        to = c('brown', 'buff', 'cinnamon', 'gray',
                                               'orange', 'pink' , 'red', 'white', 'yellow')))

I now verify that the transformed factors have the same length as the original columns of the raw dataframe.

# Number of observations in the raw subset data frame
N = dim(subsetRaw)[1]

# Must match the number of observations of each transformed factor vector
(length(classT)  == N )

## [1] TRUE

(length(stalkSurfaceBelowRingT) == N )

## [1] TRUE

(length(stalkColorAboveRingT) == N)

## [1] TRUE

(length(sporePrintColorT) == N)

## [1] TRUE

(length(odorT) == N)

## [1] TRUE

All the validations above return TRUE, so I conclude that we can stitch together all the columns into a data frame.

mushroomClean = data.frame( poisonous = classT, 
                             stalkSurfaceBelowRing = stalkSurfaceBelowRingT, 
                             stalkColorAboveRing = stalkColorAboveRingT ,
                             sporePrintColor = sporePrintColorT ,
                             odor = odorT )

str(mushroomClean)

## 'data.frame':    8124 obs. of  5 variables:
##  $ poisonous            : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ...
##  $ stalkSurfaceBelowRing: Factor w/ 4 levels "fibrous","silky",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ stalkColorAboveRing  : Factor w/ 9 levels "buff","cinnamon",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ sporePrintColor      : Factor w/ 9 levels "buff","chocolate",..: 3 4 4 3 4 3 3 4 3 3 ...
##  $ odor                 : Factor w/ 9 levels "almond","creosote",..: 7 1 4 7 6 1 1 4 7 1 ...

Evaluating Accuracy of Deterministic Rules

# These mushrooms are predicted to be edible
edibleTestSet = subset(mushroomClean, ( (odor=="almond") | (odor=="anise") | (odor =="none") ) &
                       (sporePrintColor != "green") )

falseEdibles = edibleTestSet[edibleTestSet$poisonous=="poisonous",]

# These mushrooms are predicted to be poisonous
inedibleTestSet = subset(mushroomClean,  ! ( ( (odor=="almond") | (odor=="anise") | (odor =="none") ) & 
                                               (sporePrintColor != "green") ) )

falsePoisonous = inedibleTestSet[inedibleTestSet$poisonous=="edible",]

numFalselyEdible = dim(falseEdibles)[1]
numFalselyPoisonous = dim(falsePoisonous)[1]

accuracy = (dim(mushroomClean)[1] - numFalselyEdible - numFalselyPoisonous) / (dim(mushroomClean)[1])

We conclude that the attributes are useful and that the falsely edible predictions was 48 cases and the falsely poisonouse cases was 0. This confirms the accuracy of the rule is high at 99.4091581 percent.