Load data from url link, include header = FALSE to avoid dropping the first row:
mushroomsOriginal <- read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"),header = FALSE,stringsAsFactors = TRUE)
Create dataframe with specific columns:
mushroomsNew <- as.data.frame(mushroomsOriginal[,c(1,2,9,18)], header = FALSE)
Rename columns to readable names:
colnames(mushroomsNew) <- c("class","capShape","gillSize","veilColor")
Load library to perform string operations (required for the next step) and rename factor levels method #1 for two possible levels as demostrated in ADCR book - easy but requires an extra step of transforming variable type back to factor from chr:
library(stringr)
mushroomsNew$class <- ifelse(str_detect(mushroomsNew$class,"e") == TRUE, "edible","poisonous")
mushroomsNew$class <- as.factor(mushroomsNew$class)
mushroomsNew$gillSize <- ifelse(str_detect(mushroomsNew$gillSize,"b") == TRUE, "broad","narrow")
mushroomsNew$gillSize <- as.factor(mushroomsNew$gillSize)
Rename factor levels method #2 - very redundant:
levels(mushroomsNew$capShape)[levels(mushroomsNew$capShape) == "b"] <- "bell"
levels(mushroomsNew$capShape)[levels(mushroomsNew$capShape) == "c"] <- "conical"
levels(mushroomsNew$capShape)[levels(mushroomsNew$capShape) == "x"] <- "convex"
levels(mushroomsNew$capShape)[levels(mushroomsNew$capShape) == "f"] <- "flat"
levels(mushroomsNew$capShape)[levels(mushroomsNew$capShape) == "k"] <- "knobbed"
levels(mushroomsNew$capShape)[levels(mushroomsNew$capShape) == "s"] <- "sunken"
Rename factor levels method #3 - i find this most efficient:
levels(mushroomsNew$veilColor) <- list(brown="n",orange="o",white="w",yellow="y")
Display structure and first three rows to see if steps above worked:
str(mushroomsNew)
## 'data.frame': 8124 obs. of 4 variables:
## $ class : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ...
## $ capShape : Factor w/ 6 levels "bell","conical",..: 6 6 1 6 6 6 1 1 6 1 ...
## $ gillSize : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ...
## $ veilColor: Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3 3 3 3 ...
mushroomsNew[1:3,]
## class capShape gillSize veilColor
## 1 poisonous convex narrow white
## 2 edible convex broad white
## 3 edible bell broad white
Now that data are organized and readable i want to see how edible compares to poisonous based on my selected attributes:
p <- subset(mushroomsNew,class == "poisonous")
poisonous <- subset(p,select = c(2,3,4))
e <- subset(mushroomsNew, class == "edible")
edible <- subset(e, select = c(2,3,4))
Looking at first few instances to see if worked…
head(poisonous)
## capShape gillSize veilColor
## 1 convex narrow white
## 4 convex narrow white
## 9 convex narrow white
## 14 convex narrow white
## 18 convex narrow white
## 19 convex narrow white
head(edible)
## capShape gillSize veilColor
## 2 convex broad white
## 3 bell broad white
## 5 convex broad white
## 6 convex broad white
## 7 bell broad white
## 8 bell broad white
…I notice poisonous mushrooms in all 6 observations have narrow gill size and all 6 edible observations have broad gills which makes me think perhaps there is a relationship between class and gill size?..
Making a plot to have better visual:
mosaicplot(table(mushroomsNew$class,mushroomsNew$gillSize),main = "Edible vs. Poisonous", ylab = "Gill Size", las = 1,color = 1:2)
My conclusion is although it’s hard to definitely say anything about poisonous, most edible mushrooms in the sample seem to have broad gills.