Overview

Your task is to study the Mushrooms dataset and the associated description of the data (i.e. “data dictionary”). This famous dataset can be found in the UCI repository. Your deliverable is the R code to perform the transformation tasks. A typical problem (which is beyond the scope of this assignment!) is to answer the question, “Which other attribute or attributes are the best predictors of whether a particular mushroom is poisonous or edible?” Data can be found in GitHub repository here: https://github.com/KatherineEvers/Mushroom-Data/blob/master/agaricus-lepiota.data

Data Tidying and Transformation

Import Mushrooms data and load library:

library(ggplot2)
mushroom <- read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"), header=FALSE)

Add meaningful column names:

names(mushroom)=c("class", "capShape", "capSurface", "capColor", "bruises", "odor", "gillAttachment", "gillSpacing", "gillSize","gillColor", "stalkShape", "stalkRoot", "stalkSurfaceAboveRing", "stalkSurfaceBelowRing", "stalkColorAboveRing", "stalkColorBelowRing", "veilType", "veilColor", "ringNumber", "ringType", "sporePrintColor", "population", "habitat")

Replace abbreviations used in the data:

levels(mushroom$class) <- c("edible", "poisonous")
levels(mushroom$capShape) <- c("bell", "conical", "flat", "knobbed", "sunken", "convex")
levels(mushroom$capSurface) <- c("fibrous", "grooves", "smooth", "scaly")
levels(mushroom$capColor) <- c("buff", "cinnamon", "red", "gray", "brown", "pink", "green", "purple", "white", "yellow")
levels(mushroom$bruises) <- c("false", "true")
levels(mushroom$odor) <- c("almond", "creosote", "foul", "anise", "musty", "none", "pungent", "spicy", "fishy")
levels(mushroom$gillAttachment) <- c("attached", "free")
levels(mushroom$gillSpacing) <- c("close", "crowded")
levels(mushroom$gillSize) <- c("broad", "narrow")
levels(mushroom$gillColor) <- c("buff", "red", "gray", "chocolate", "black", "brown", "orange", "pink", "green", "purple", "white", "yellow")
levels(mushroom$stalkShape) <- c("enlarging","tapering")     
levels(mushroom$stalkRoot) <- c("missing", "bulbous", "club", "equal", "rooted")
levels(mushroom$stalkSurfaceAboveRing) <- c("fibrous", "silky", "smooth", "scaly")
levels(mushroom$stalkSurfaceBelowRing) <- c("fibrous", "silky", "smooth", "scaly")
levels(mushroom$stalkColorAboveRing) <- c("buff", "cinnamon", "red", "gray", "brown", "orange", "pink", "white", "yellow")
levels(mushroom$stalkColorBelowRing) <- c("buff", "cinnamon", "red", "gray", "brown", "orange", "pink", "white", "yellow")
levels(mushroom$veilType) <- c("partial")
levels(mushroom$veilColor) <- c("brown", "orange", "white", "yellow")
levels(mushroom$ringNumber) <- c("none", "one", "two")
levels(mushroom$ringType) <- c("evanescent", "flaring", "large", "none", "pendant")
levels(mushroom$sporePrintColor) <- c("buff", "chocolate", "black", "brown", "orange", "green", "purple", "white", "yellow")
levels(mushroom$population) <- c("abundant", "clustered", "numerous", "scattered", "several", "solitary")
levels(mushroom$habitat) <- c("woods", "grasses", "leaves", "meadows","paths", "urban", "waste")

Investigation

Let’s investigate whether color is a good indicator of whether a mushroom is edible or poisonous. Create a data frame with a subset of the columns in the dataset:

mushroomSubset <- subset(mushroom, select=c(class, gillColor, stalkColorAboveRing, stalkColorBelowRing, sporePrintColor))

summary(mushroomSubset)     
##        class          gillColor    stalkColorAboveRing stalkColorBelowRing
##  edible   :4208   buff     :1728   white  :4464        white  :4384       
##  poisonous:3916   pink     :1492   pink   :1872        pink   :1872       
##                   white    :1202   gray   : 576        gray   : 576       
##                   brown    :1048   brown  : 448        brown  : 512       
##                   gray     : 752   buff   : 432        buff   : 432       
##                   chocolate: 732   orange : 192        orange : 192       
##                   (Other)  :1170   (Other): 140        (Other): 156       
##   sporePrintColor
##  white    :2388  
##  brown    :1968  
##  black    :1872  
##  chocolate:1632  
##  green    :  72  
##  buff     :  48  
##  (Other)  : 144

Let’s further separate the subset by class:

mushroomColorP <- subset(mushroom, class=="poisonous", select=c(gillColor, stalkColorAboveRing, stalkColorBelowRing, sporePrintColor))

summary(mushroomColorP)
##      gillColor    stalkColorAboveRing stalkColorBelowRing  sporePrintColor
##  buff     :1728   white   :1712       white   :1680       white    :1812  
##  pink     : 640   pink    :1296       pink    :1296       chocolate:1584  
##  chocolate: 528   buff    : 432       brown   : 448       black    : 224  
##  gray     : 504   brown   : 432       buff    : 432       brown    : 224  
##  white    : 246   cinnamon:  36       cinnamon:  36       green    :  72  
##  brown    : 112   yellow  :   8       yellow  :  24       buff     :   0  
##  (Other)  : 158   (Other) :   0       (Other) :   0       (Other)  :   0
mushroomColorE <- subset(mushroom, class=="edible", select=c(gillColor, stalkColorAboveRing, stalkColorBelowRing,sporePrintColor))

summary(mushroomColorE)
##    gillColor   stalkColorAboveRing stalkColorBelowRing  sporePrintColor
##  white  :956   white  :2752        white  :2704        brown    :1744  
##  brown  :936   gray   : 576        gray   : 576        black    :1648  
##  pink   :852   pink   : 576        pink   : 576        white    : 576  
##  purple :444   orange : 192        orange : 192        buff     :  48  
##  black  :344   red    :  96        red    :  96        chocolate:  48  
##  gray   :248   brown  :  16        brown  :  64        orange   :  48  
##  (Other):428   (Other):   0        (Other):   0        (Other)  :  96

Let’s visualize the subsets with stacked bar graphs:

plot1 <- ggplot(mushroom, aes(x=class,fill = sporePrintColor)) + 
  geom_bar(position='stack')+
  geom_text(stat='count', aes(label=..count..), position=position_stack(vjust=0.1))  +
  ggtitle("Edible vs Poisonous: Spore Print Color")

plot1

plot2 <- ggplot(mushroom, aes(x=class,fill = stalkColorBelowRing)) + 
  geom_bar(position='stack')+
  geom_text(stat='count', aes(label=..count..), position=position_stack(vjust=0.1)) +
  ggtitle("Edible vs Poisonous: Stalk Color Below Ring")

plot2

plot3 <- ggplot(mushroom, aes(x=class,fill = stalkColorAboveRing)) + 
  geom_bar(position='stack')+
  geom_text(stat='count', aes(label=..count..), position=position_stack(vjust=0.1)) +
  ggtitle("Edible vs Poisonous: Stalk Color Above Ring")

plot3

plot4 <- ggplot(mushroom, aes(x=class,fill = gillColor)) + 
  geom_bar(position='stack')+
  geom_text(stat='count', aes(label=..count..), position=position_stack(vjust=0.1)) +
  ggtitle("Edible vs Poisonous: Gill Color")

plot4

After comparing the 4 bar graph plots, it appears that most edible mushrooms are white above and below the stalk. They also tend to have white or brown gills and brown or black spore prints. On the other hand, poisonous mushroom tend to be pink or white above and below the stalk. Most of them have buff gills and white or chocolate spore prints. However, these tendencies are not a rule. There does not seem to be one color that is exclusively associated with one class.

Is there a better characteristic that determines whether a mushroom is edible or poisonous besides color?

Let’s investigate odor. Create another subset of the original mushroom data with only class and odor:

odorSubet <- subset(mushroom, select=c(class, odor))

summary(odorSubet)
##        class           odor     
##  edible   :4208   none   :3528  
##  poisonous:3916   foul   :2160  
##                   spicy  : 576  
##                   fishy  : 576  
##                   almond : 400  
##                   anise  : 400  
##                   (Other): 484
plot5 <- ggplot(mushroom, aes(x=class,fill = odor)) + 
  geom_bar(position='stack')+
  geom_text(stat='count', aes(label=..count..), position=position_stack(vjust=0.1)) +
  ggtitle("Edible vs Poisonous: Odor")

plot5

odor <- table(mushroom$class, mushroom$odor)

odor
##            
##             almond creosote foul anise musty none pungent spicy fishy
##   edible       400        0    0   400     0 3408       0     0     0
##   poisonous      0      192 2160     0    36  120     256   576   576

As seen in the table and bar graph, all poisonous mushrooms have a creosote, foul, musty, pungent, spicy, or fishy odor. All of the edible mushrooms have an almond or anise odor. The only odor category with overlap of the two classes is “none”. However, only 36 mushrooms with no odor are poisonous compared to 3408 edible mushrooms with no odor. Therefore, odor seems to be a better indicator of whether a mushroom is poisonous than color.

Adding all the mushroom counts together with unpleasant odors (creosote, foul, musty, pungent, spicy, or fishy) and dividing that sum by total number of poisonous mushrooms in the data set we have:

(192+2160+36+256+576+576)/3916
## [1] 0.9693565

Doing the same with the mushrooms with pleasant odors (almond and anise) we have:

(400+400)/4208
## [1] 0.1901141

Calculating the chance that a mushroom with no odor is poisonous we have:

120/3408
## [1] 0.03521127

Thus, using unpleasant odor correctly classifies about 96.94% of mushrooms in the data set as poisonous. However, using pleasant odor only correctly classifies about 19.01% of edible mushrooms. There is about a 3.52% chance that a mushroom with no odor is poisonous. Overall, odor seems to be a good starting place to determine whether a mushroom is poisonous.