Mushrooms Dataset. A famous—if slightly moldy—dataset about mushrooms can be found in the UCI repository here: https://archive.ics.uci.edu/ml/datasets/Mushroom. The fact that this is such a well-known dataset in the data science community makes it a good dataset to use for comparative benchmarking. For example, if someone was working to build a better decision tree algorithm (or other predictive classifier) to analyze categorical data, this dataset could be useful. A typical problem (which is beyond the scope of this assignment!) is to answer the question, “Which other attribute or attributes are the best predictors of whether a particular mushroom is poisonous or edible?” Your task is to study the dataset and the associated description of the data (i.e. “data dictionary”). You may need to look around a bit, but it’s there! You should take the data, and create a data frame with a subset of the columns in the dataset. You should include the column that indicates edible or poisonous and three or four other columns. You should also add meaningful column names and replace the abbreviations used in the data—for example, in the appropriate column, “e” might become “edible.” Your deliverable is the R code to perform these transformation tasks.
url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data'
mushrooms <- read.table(url, sep=",", header=FALSE, stringsAsFactors = FALSE)
head(mushrooms)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 1 p x s n t p f c n k e e s s w w p w o p
## 2 e x s y t a f c b k e c s s w w p w o p
## 3 e b s w t l f c b n e c s s w w p w o p
## 4 p x y w t p f c n n e e s s w w p w o p
## 5 e x s g f n f w b k t e s s w w p w o e
## 6 e x y y t a f c b n e c s s w w p w o p
## V21 V22 V23
## 1 k s u
## 2 n n g
## 3 n n m
## 4 k s u
## 5 n a g
## 6 k n g
requires plyr
Rename Subset of Columns
library(plyr)
mushrooms2<-plyr::rename(mushrooms,c("V1"="class","V2"="cap-shape","V3"="cap-surface","V4"="cap-color","V9"="gill-size"))
mushrooms3<-mushrooms2[c("class","cap-shape","cap-surface","cap-color","gill-size")]
head(mushrooms3)
## class cap-shape cap-surface cap-color gill-size
## 1 p x s n n
## 2 e x s y b
## 3 e b s w b
## 4 p x y w n
## 5 e x s g b
## 6 e x y y b
Replace the abbreviations
mushrooms3$class<-revalue(mushrooms3$class,c("p"="poisonous"))
mushrooms3$class<-revalue(mushrooms3$class,c("e"="edible"))
mushrooms3$`gill-size`<-revalue(mushrooms3$`gill-size`,c("b"="broad"))
mushrooms3$`gill-size`<-revalue(mushrooms3$`gill-size`,c("n"="narrow"))
mushrooms3$`cap-surface`<-revalue(mushrooms3$`cap-surface`,c("f"="fibrous"))
mushrooms3$`cap-surface`<-revalue(mushrooms3$`cap-surface`,c("g"="grooves"))
mushrooms3$`cap-surface`<-revalue(mushrooms3$`cap-surface`,c("y"="scaly"))
mushrooms3$`cap-surface`<-revalue(mushrooms3$`cap-surface`,c("s"="smooth"))
head(mushrooms3)
## class cap-shape cap-surface cap-color gill-size
## 1 poisonous x smooth n narrow
## 2 edible x smooth y broad
## 3 edible b smooth w broad
## 4 poisonous x scaly w narrow
## 5 edible x smooth g broad
## 6 edible x scaly y broad
Rmd file is here:
https://github.com/johnsuh23/DATA-607/blob/master/DATA%20607%20HW%201-%20John%20Suh.Rmd