Load and transform data

Load data from url link, include header = FALSE to avoid dropping the first row:

mushroomsOriginal <- read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"),header = FALSE,stringsAsFactors = TRUE)

Create dataframe with specific columns:

mushroomsNew <- as.data.frame(mushroomsOriginal[,c(1,2,9,18)], header = FALSE)

Rename columns to readable names:

colnames(mushroomsNew) <- c("class","capShape","gillSize","veilColor")

Load library to perform string operations (required for the next step) and rename factor levels method #1 for two possible levels as demostrated in ADCR book - easy but requires an extra step of transforming variable type back to factor from chr:

library(stringr)
mushroomsNew$class <- ifelse(str_detect(mushroomsNew$class,"e") == TRUE, "edible","poisonous")
mushroomsNew$class <- as.factor(mushroomsNew$class)
mushroomsNew$gillSize <- ifelse(str_detect(mushroomsNew$gillSize,"b") == TRUE, "broad","narrow")
mushroomsNew$gillSize <- as.factor(mushroomsNew$gillSize)

Rename factor levels method #2 - very redundant:

levels(mushroomsNew$capShape)[levels(mushroomsNew$capShape) == "b"] <- "bell"
levels(mushroomsNew$capShape)[levels(mushroomsNew$capShape) == "c"] <- "conical"
levels(mushroomsNew$capShape)[levels(mushroomsNew$capShape) == "x"] <- "convex"
levels(mushroomsNew$capShape)[levels(mushroomsNew$capShape) == "f"] <- "flat"
levels(mushroomsNew$capShape)[levels(mushroomsNew$capShape) == "k"] <- "knobbed"
levels(mushroomsNew$capShape)[levels(mushroomsNew$capShape) == "s"] <- "sunken"

Rename factor levels method #3 - i find this most efficient:

levels(mushroomsNew$veilColor) <- list(brown="n",orange="o",white="w",yellow="y")

Display structure and first three rows to see if steps above worked:

str(mushroomsNew)
## 'data.frame':    8124 obs. of  4 variables:
##  $ class    : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ...
##  $ capShape : Factor w/ 6 levels "bell","conical",..: 6 6 1 6 6 6 1 1 6 1 ...
##  $ gillSize : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ...
##  $ veilColor: Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3 3 3 3 ...
mushroomsNew[1:3,]
##       class capShape gillSize veilColor
## 1 poisonous   convex   narrow     white
## 2    edible   convex    broad     white
## 3    edible     bell    broad     white

Examine data, look for insights

Now that data are organized and readable i want to see how edible compares to poisonous based on my selected attributes:

p <- subset(mushroomsNew,class == "poisonous")
poisonous <- subset(p,select = c(2,3,4))
e <- subset(mushroomsNew, class == "edible")
edible <- subset(e, select = c(2,3,4))

Looking at first few instances to see if worked…

head(poisonous)
##    capShape gillSize veilColor
## 1    convex   narrow     white
## 4    convex   narrow     white
## 9    convex   narrow     white
## 14   convex   narrow     white
## 18   convex   narrow     white
## 19   convex   narrow     white
head(edible)
##   capShape gillSize veilColor
## 2   convex    broad     white
## 3     bell    broad     white
## 5   convex    broad     white
## 6   convex    broad     white
## 7     bell    broad     white
## 8     bell    broad     white

…I notice poisonous mushrooms in all 6 observations have narrow gill size and all 6 edible observations have broad gills which makes me think perhaps there is a relationship between class and gill size?..

Making a plot to have better visual:

mosaicplot(table(mushroomsNew$class,mushroomsNew$gillSize),main = "Edible vs. Poisonous", ylab = "Gill Size", las = 1,color = 1:2)

My conclusion is although it’s hard to definitely say anything about poisonous, most edible mushrooms in the sample seem to have broad gills.