library(RCurl)
## Loading required package: bitops
library(plyr)
path <- getURL("https://raw.githubusercontent.com/jreznyc/datasets/master/Mushrooms/mushrooms.csv")
df <- read.csv(text=path, header=TRUE)
str(df)
## 'data.frame': 8124 obs. of 23 variables:
## $ class : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
## $ cap.shape : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
## $ cap.surface : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
## $ cap.color : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
## $ bruises : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ...
## $ odor : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
## $ gill.attachment : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
## $ gill.spacing : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
## $ gill.size : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
## $ gill.color : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
## $ stalk.shape : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
## $ stalk.root : Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
## $ stalk.surface.above.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
## $ stalk.surface.below.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
## $ stalk.color.above.ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ stalk.color.below.ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ veil.type : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
## $ veil.color : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
## $ ring.number : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
## $ ring.type : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
## $ spore.print.color : Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4 3 3 ...
## $ population : Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4 5 4 ...
## $ habitat : Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4 2 4 ...
After an initial view of the data variables, let’s perform some basic exploration in terms of whether or not a mushroom is edible or poisonous.
Let’s look at some of the different variables and how they relate to whether or not a mushroom is edible by performing some basic visualizations.
df$class <- revalue(df$class, c("e"="edible", "p"="poisonous"))
barplot(table(df$class)/8124, main="Proportion of Edible/Poisonous Mushrooms",col=c('blue','red'))
Below, I generate a side-by-side barplot from a proportional contingency table. We can see that vast majority of mushrooms, both poisonous and edible, have either flat or convex caps. However, within poisonous mushrooms we see that there is a slightly larger proportion of knobbed caps. Let’s continue looking at the other variables…
df$cap.shape <- revalue(df$cap.shape,c("b"="bell","c"="conical","x"="convex","f"="flat","k"="knobbed","s"="sunken"))
barplot(prop.table(table(df$class, df$cap.shape)), beside=TRUE, main="Cap Shape Proportions", legend=TRUE, col=c('blue','red'), args.legend=c(x='topleft'))
We can view these plots for remaining variables using a for loop to generate the same type of bar graph as above for each. This allows a quick look through each to help identify variables to isolate for later analysis.
Within each variable, if there is a stark difference in any individual characteristic between edible or poisonous mushrooms, that would potentially be a good variable to analyze later.
Note: Plots arranged in a grid for length
#how to loop through df columns: https://stackoverflow.com/questions/18462736/loop-through-columns-and-add-string-lengths-as-new-columns
#https://www.statmethods.net/advgraphs/layout.html
par(mfrow=c(6,4), mar=c(1,1,1,1))
for(i in names(df[3:23])) {
barplot(prop.table(table(df$class, df[[i]])), beside=TRUE, main=i, legend=FALSE, col=c('blue','red'), args.legend=c(x='topleft'))
}
After reviewing the previous graphs, the variables I’d choose for consideration are odor, gill color, gill size, and spore print color. Now I’ll create a new dataframe with those variables.
df2 <- df[c('class', 'odor', 'gill.color', 'gill.size','spore.print.color')]
df2$odor <- revalue(df$odor, c("a"="almond","l"="anise","c"="creosote","y"="fishy","f"="foul","m"="musty","n"="none","p"="pungent","s"="spicy"))
df2$gill.color <- revalue(df$gill.color, c("k"="black","n"="brown","b"="buff","h"="chocolate","g"="gray", "r"="green","o"="orange","p"="pink","u"="purple","e"="red","w"="white","y"="yellow"))
df2$gill.size <- revalue(df$gill.size, c("b"="broad","n"="narrow"))
df2$spore.print.color <- revalue(df$spore.print.color, c("k"="black","n"="brown","b"="buff","h"="chocolate","g"="gray", "r"="green","o"="orange","p"="pink","u"="purple","e"="red","w"="white","y"="yellow"))
## The following `from` values were not present in `x`: g, p, e
colnames(df2) <- c('Mushroom Class', 'Odor', 'Gill Color', 'Gill Size', 'Spore Print Color')
head(df2)
## Mushroom Class Odor Gill Color Gill Size Spore Print Color
## 1 poisonous pungent black narrow black
## 2 edible almond black broad brown
## 3 edible anise brown broad brown
## 4 poisonous pungent brown narrow black
## 5 edible none black broad brown
## 6 edible almond brown broad black