Load Dataset & view structure

library(RCurl)
## Loading required package: bitops
library(plyr)
path <- getURL("https://raw.githubusercontent.com/jreznyc/datasets/master/Mushrooms/mushrooms.csv")
df <- read.csv(text=path, header=TRUE)
str(df)
## 'data.frame':    8124 obs. of  23 variables:
##  $ class                   : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
##  $ cap.shape               : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
##  $ cap.surface             : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
##  $ cap.color               : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
##  $ bruises                 : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ...
##  $ odor                    : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
##  $ gill.attachment         : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
##  $ gill.spacing            : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
##  $ gill.size               : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
##  $ gill.color              : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
##  $ stalk.shape             : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
##  $ stalk.root              : Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
##  $ stalk.surface.above.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ stalk.surface.below.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ stalk.color.above.ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ stalk.color.below.ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ veil.type               : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
##  $ veil.color              : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ ring.number             : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ring.type               : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
##  $ spore.print.color       : Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4 3 3 ...
##  $ population              : Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4 5 4 ...
##  $ habitat                 : Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4 2 4 ...

After an initial view of the data variables, let’s perform some basic exploration in terms of whether or not a mushroom is edible or poisonous.

Exploration

Let’s look at some of the different variables and how they relate to whether or not a mushroom is edible by performing some basic visualizations.

Overall Edible/Poisonous proportion

df$class <- revalue(df$class, c("e"="edible", "p"="poisonous"))
barplot(table(df$class)/8124, main="Proportion of Edible/Poisonous Mushrooms",col=c('blue','red'))

Cap shape

Below, I generate a side-by-side barplot from a proportional contingency table. We can see that vast majority of mushrooms, both poisonous and edible, have either flat or convex caps. However, within poisonous mushrooms we see that there is a slightly larger proportion of knobbed caps. Let’s continue looking at the other variables…

df$cap.shape <- revalue(df$cap.shape,c("b"="bell","c"="conical","x"="convex","f"="flat","k"="knobbed","s"="sunken"))
barplot(prop.table(table(df$class, df$cap.shape)), beside=TRUE, main="Cap Shape Proportions", legend=TRUE, col=c('blue','red'), args.legend=c(x='topleft'))

Variable selection

We can view these plots for remaining variables using a for loop to generate the same type of bar graph as above for each. This allows a quick look through each to help identify variables to isolate for later analysis.
Within each variable, if there is a stark difference in any individual characteristic between edible or poisonous mushrooms, that would potentially be a good variable to analyze later.

Note: Plots arranged in a grid for length

#how to loop through df columns: https://stackoverflow.com/questions/18462736/loop-through-columns-and-add-string-lengths-as-new-columns
#https://www.statmethods.net/advgraphs/layout.html

par(mfrow=c(6,4), mar=c(1,1,1,1))
for(i in names(df[3:23])) {
    barplot(prop.table(table(df$class, df[[i]])), beside=TRUE, main=i, legend=FALSE, col=c('blue','red'), args.legend=c(x='topleft'))
}

Variable Selection and Data Frame Transformations

After reviewing the previous graphs, the variables I’d choose for consideration are odor, gill color, gill size, and spore print color. Now I’ll create a new dataframe with those variables.

df2 <- df[c('class', 'odor', 'gill.color', 'gill.size','spore.print.color')]
df2$odor <- revalue(df$odor, c("a"="almond","l"="anise","c"="creosote","y"="fishy","f"="foul","m"="musty","n"="none","p"="pungent","s"="spicy"))
df2$gill.color <- revalue(df$gill.color, c("k"="black","n"="brown","b"="buff","h"="chocolate","g"="gray", "r"="green","o"="orange","p"="pink","u"="purple","e"="red","w"="white","y"="yellow"))
df2$gill.size <- revalue(df$gill.size, c("b"="broad","n"="narrow"))
df2$spore.print.color <- revalue(df$spore.print.color, c("k"="black","n"="brown","b"="buff","h"="chocolate","g"="gray", "r"="green","o"="orange","p"="pink","u"="purple","e"="red","w"="white","y"="yellow"))
## The following `from` values were not present in `x`: g, p, e
colnames(df2) <- c('Mushroom Class', 'Odor', 'Gill Color', 'Gill Size', 'Spore Print Color')
head(df2)
##   Mushroom Class    Odor Gill Color Gill Size Spore Print Color
## 1      poisonous pungent      black    narrow             black
## 2         edible  almond      black     broad             brown
## 3         edible   anise      brown     broad             brown
## 4      poisonous pungent      brown    narrow             black
## 5         edible    none      black     broad             brown
## 6         edible  almond      brown     broad             black