Mushrooms Dataset A famous—if slightly moldy—dataset about mushrooms can be found in the UCI repository here: https://archive.ics.uci.edu/ml/datasets/Mushroom. The fact that this is such a well-known dataset in the data science community makes it a good dataset to use for comparative benchmarking. For example, if someone was working to build a better decision tree algorithm (or other predictive classifier) to analyze categorical data, this dataset could be useful. A typical problem (which is beyond the scope of this assignment!) is to answer the question, “Which other attribute or attributes are the best predictors of whether a particular mushroom is poisonous or edible?”
Your task is to study the dataset and the associated description of the data (i.e. “data dictionary”). You may need to look around a bit, but it’s there! You should take the data, and create a data frame with a subset of the columns in the dataset. You should include the column that indicates edible or poisonous and three or four other columns. You should also add meaningful column names and replace the abbreviations used in the data—for example, in the appropriate column, “e” might become “edible.” Your deliverable is the R code to perform these transformation tasks.
library(stringr)
As there is no header in the dataset and the default of header is TRUE, code “header=FALSE” is included. “dim” can shows us how many rows and columns we have in the dataset.
mushroom <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data", header=FALSE)
mushroom = as.data.frame(mushroom)
dim(mushroom)
## [1] 8124 23
From the data dictionary it tells me that there are only 22 columns, therefore by checking the first column in the data with only “e” and “p”, I can tell this 1st column is the class.
Checking:
unique(mushroom$V1)
## [1] p e
## Levels: e p
summary (data.frame (mushroom == "NULL"))
## V1 V2 V3 V4
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:8124 FALSE:8124 FALSE:8124 FALSE:8124
## V5 V6 V7 V8
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:8124 FALSE:8124 FALSE:8124 FALSE:8124
## V9 V10 V11 V12
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:8124 FALSE:8124 FALSE:8124 FALSE:8124
## V13 V14 V15 V16
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:8124 FALSE:8124 FALSE:8124 FALSE:8124
## V17 V18 V19 V20
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:8124 FALSE:8124 FALSE:8124 FALSE:8124
## V21 V22 V23
## Mode :logical Mode :logical Mode :logical
## FALSE:8124 FALSE:8124 FALSE:8124
According to the result, there are no NULL values and we can move on.
By choosing the Class and 6 other columns rated to colors, I created a data frame for them, named m_colortable.
m_colortable <- mushroom[, c(1, 4, 10, 15, 16, 18, 21)]
The 7 columns are Class, Cap_color, Gill_color, Stalk_color_above_ring, Stalk_color_below_ring, Veil_color, and SporePrint_color.
colnames(m_colortable) <- c("Class", "Cap_color", "Gill_color", "Stalk_color_above_ring", "Stalk_color_below_ring", "Veil_color", "SporePrint_color")
Replace “e” as “edible”, “p” as “poisonous” in column 1, and replace all the characters in the other 6 columns with their corresponding colors.
library(plyr)
m_colortable$Class <- revalue(m_colortable$Class, c("p" = "poisonous", "e" = "edible"))
m_colortable <- sapply(m_colortable, function(x) revalue(x,c("n"="brown", "b"="buff", "c"="cinnamon", "g"="gray", "r"="green", "p"="pink",
"u"="purple","e"="red","w"="white", "y"="yellow","k"="black", "h"="chocolate","o"="orange")))
## The following `from` values were not present in `x`: n, b, c, g, r, p, u, e, w, y, k, h, o
## The following `from` values were not present in `x`: k, h, o
## The following `from` values were not present in `x`: c
## The following `from` values were not present in `x`: r, u, k, h
## The following `from` values were not present in `x`: r, u, k, h
## The following `from` values were not present in `x`: b, c, g, r, p, u, e, k, h
## The following `from` values were not present in `x`: c, g, p, e
m_colortable <- data.frame(m_colortable)
head(m_colortable, 20)
## Class Cap_color Gill_color Stalk_color_above_ring
## 1 poisonous brown black white
## 2 edible yellow black white
## 3 edible white brown white
## 4 poisonous white brown white
## 5 edible gray black white
## 6 edible yellow brown white
## 7 edible white gray white
## 8 edible white brown white
## 9 poisonous white pink white
## 10 edible yellow gray white
## 11 edible yellow gray white
## 12 edible yellow brown white
## 13 edible yellow white white
## 14 poisonous white black white
## 15 edible brown brown white
## 16 edible gray black white
## 17 edible white black white
## 18 poisonous brown brown white
## 19 poisonous white brown white
## 20 poisonous brown black white
## Stalk_color_below_ring Veil_color SporePrint_color
## 1 white white black
## 2 white white brown
## 3 white white brown
## 4 white white black
## 5 white white brown
## 6 white white black
## 7 white white black
## 8 white white brown
## 9 white white black
## 10 white white black
## 11 white white brown
## 12 white white black
## 13 white white brown
## 14 white white brown
## 15 white white black
## 16 white white brown
## 17 white white brown
## 18 white white black
## 19 white white brown
## 20 white white brown
library(reshape2)
test <- melt(m_colortable, id.vars = c("Class"), variable.name = "Parts", value.names = "Color")
## Warning: attributes are not identical across measure variables; they will
## be dropped
library(rpivotTable)
rpivotTable(test, rows = c("Parts","value"), cols = "Class", rendererName = "Table Barchart", width = "100%", height = "100%")