Here, you are asked to use R-you may use base functions or packages as you like. Mushrooms Dataset. A famous-if slightly moldy-data set about mushrooms can be found in the UCI repository here: https://archive.ics.uci.edu/ml/datasets/Mushroom.

The fact that this is such a well-known dataset in the data science community makes it a good dataset to use for comparative benchmarking. For example, if someone was working to build a better decision tree algorithm (or other predictive classifier) to analyze categorical data, this dataset could be useful. A typical problem (which is beyond the scope of this assignment!) is to answer the question, “Which other attribute or attributes are the best predictors of whether a particular mushroom is poisonous or edible?”

Your task is to study the dataset and the associated description of the data (i.e. “data dictionary”). You may need to look around a bit, but it’s there! You should take the data, and create a data frame with a subset of the columns in the dataset. You should include the column that indicates edible or poisonous and three or four other columns. You should also add meaningful column names and replace the abbreviations used in the data-for example, in the appropriate column, “e” might become “edible.” Your deliverable is the R code to perform these transformation tasks.

R Enviroment

options(repos = c(CRAN = "http://cran.rstudio.com"))
library('dplyr')
## Warning: package 'dplyr' was built under R version 3.3.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library ('ggplot2')
## Warning: package 'ggplot2' was built under R version 3.3.3

Load Mushroom Dataset

Mushroom<-read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data", header= FALSE, sep=",")
dim(Mushroom)
## [1] 8124   23
summary(Mushroom)
##  V1       V2       V3             V4       V5             V6      
##  e:4208   b: 452   f:2320   n      :2284   f:4748   n      :3528  
##  p:3916   c:   4   g:   4   g      :1840   t:3376   f      :2160  
##           f:3152   s:2556   e      :1500            s      : 576  
##           k: 828   y:3244   y      :1072            y      : 576  
##           s:  32            w      :1040            a      : 400  
##           x:3656            b      : 168            l      : 400  
##                             (Other): 220            (Other): 484  
##  V7       V8       V9            V10       V11      V12      V13     
##  a: 210   c:6812   b:5612   b      :1728   e:3516   ?:2480   f: 552  
##  f:7914   w:1312   n:2512   p      :1492   t:4608   b:3776   k:2372  
##                             w      :1202            c: 556   s:5176  
##                             n      :1048            e:1120   y:  24  
##                             g      : 752            r: 192           
##                             h      : 732                             
##                             (Other):1170                             
##  V14           V15            V16       V17      V18      V19     
##  f: 600   w      :4464   w      :4384   p:8124   n:  96   n:  36  
##  k:2304   p      :1872   p      :1872            o:  96   o:7488  
##  s:4936   g      : 576   g      : 576            w:7924   t: 600  
##  y: 284   n      : 448   n      : 512            y:   8           
##           b      : 432   b      : 432                             
##           o      : 192   o      : 192                             
##           (Other): 140   (Other): 156                             
##  V20           V21       V22      V23     
##  e:2776   w      :2388   a: 384   d:3148  
##  f:  48   n      :1968   c: 340   g:2148  
##  l:1296   k      :1872   n: 400   l: 832  
##  n:  36   h      :1632   s:1248   m: 292  
##  p:3968   r      :  72   v:4040   p:1144  
##           b      :  48   y:1712   u: 368  
##           (Other): 144            w: 192

Data Wrangling

Part One: Look for eitable and poisonous mushroom

Attribute Information “https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.names

1. Add column names

names(Mushroom)<-c("classes","cap_shape","cap_surface","cap_color","bruises?","odor", "gill_attachment","gill_spacing","gill_size","gill_color","stalk_shape","stalk_root","stalk_surface_above_ring","stalk_surface_below_ring","stalk_color_above_ring","stalk_color_below_ring","veil_type","veil_color","ring_number","ring_type","spore_print_color","population","habitat")

str(Mushroom)
## 'data.frame':    8124 obs. of  23 variables:
##  $ classes                 : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
##  $ cap_shape               : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
##  $ cap_surface             : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
##  $ cap_color               : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
##  $ bruises?                : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ...
##  $ odor                    : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
##  $ gill_attachment         : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
##  $ gill_spacing            : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
##  $ gill_size               : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
##  $ gill_color              : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
##  $ stalk_shape             : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
##  $ stalk_root              : Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
##  $ stalk_surface_above_ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ stalk_surface_below_ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ stalk_color_above_ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ stalk_color_below_ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ veil_type               : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
##  $ veil_color              : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ ring_number             : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ring_type               : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
##  $ spore_print_color       : Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4 3 3 ...
##  $ population              : Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4 5 4 ...
##  $ habitat                 : Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4 2 4 ...

2. Disjunctive rules for poisonous mushrooms

P_1) odor=NOT(almond.OR.anise.OR.none)
P_2) spore-print-color=green
P_3) odor=none.AND.stalk-surface-below-ring=scaly.AND.(stalk-color-above-ring=NOT.brown) 
P_4) habitat=leaves.AND.cap-color=white
or P_4') population=clustered.AND.cap_color=white
#At below,I follow the rule to find the number of eitable mushrooms
#P_1)
#odor:almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s
e1<-subset(Mushroom,odor=="a"|odor=="l"|odor=="n")
nrow(e1)
## [1] 4328
#P_1) -> P_2) 
#spore-print-color:black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y
e2<-subset(e1,e1$spore_print_color!="r")
nrow(e2)
## [1] 4256
#P_1) -> P_2) -> P_3) 
#stalk-surface-below-ring:fibrous=f,scaly=y,silky=k,smooth=s
#stalk-color-above-ring:brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
e3<-subset(e2, !(e2$odor=='n' & e2$stalk_surface_below_ring=="y" & e2$stalk_color_above_ring!="n"))
nrow(e3)
## [1] 4216
#P_1) -> P_2) -> P_3) -> P_4)
#habitat: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d
#cap-color:brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y
e4<-subset(e3,!(e3$habitat=="l" & e3$cap_color=="w"))
nrow(e4)
## [1] 4208
#P_1) -> P_2) -> P_3) -> P_4')
#population: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y
e4_1<-subset(e3,!(e3$population=="c" & e3$cap_color=="w"))
nrow(e4_1)
## [1] 4208

3. Check result:

Class Distribution:edible: 4208 / poisonous: 3916 / total: 8124

e<-nrow(e4)
e
## [1] 4208
p<-nrow(Mushroom)-nrow(e4)
p
## [1] 3916
nrow(Mushroom)
## [1] 8124

Part Two: Analysis Attribute:

1. Include “odor,population,habitat,cap_color” in dataset

df<-data.frame(Mushroom$classes,Mushroom$odor, Mushroom$cap_color,Mushroom$population, Mushroom$habitat)

#classes:edible=e,poisonous=p
levels(df$Mushroom.classes) [levels(df$Mushroom.classes)=="p"]  <- "poisonous"
levels(df$Mushroom.classes) [levels(df$Mushroom.classes)=="e"]  <- "edible"

#odor:almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s
levels(df$Mushroom.odor) [levels(df$Mushroom.odor)=="a"]  <- "almond"
levels(df$Mushroom.odor) [levels(df$Mushroom.odor)=="l"]  <- "anise"
levels(df$Mushroom.odor) [levels(df$Mushroom.odor)=="c"]  <- "creosote"
levels(df$Mushroom.odor) [levels(df$Mushroom.odor)=="y"]  <- "fishy"
levels(df$Mushroom.odor) [levels(df$Mushroom.odor)=="f"]  <- "foul"
levels(df$Mushroom.odor) [levels(df$Mushroom.odor)=="m"]  <- "musty"
levels(df$Mushroom.odor) [levels(df$Mushroom.odor)=="n"]  <- "none"
levels(df$Mushroom.odor) [levels(df$Mushroom.odor)=="p"]  <- "pungent"
levels(df$Mushroom.odor) [levels(df$Mushroom.odor)=="s"]  <- "spicy"

#cap-color:brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y 
levels(df$Mushroom.cap_color) [levels(df$Mushroom.cap_color)=="n"]  <- "brown"
levels(df$Mushroom.cap_color) [levels(df$Mushroom.cap_color)=="b"]  <- "buff"
levels(df$Mushroom.cap_color) [levels(df$Mushroom.cap_color)=="c"]  <- "cinnamon"
levels(df$Mushroom.cap_color) [levels(df$Mushroom.cap_color)=="g"]  <- "gray"
levels(df$Mushroom.cap_color) [levels(df$Mushroom.cap_color)=="r"]  <- "green"
levels(df$Mushroom.cap_color) [levels(df$Mushroom.cap_color)=="p"]  <- "pink"
levels(df$Mushroom.cap_color) [levels(df$Mushroom.cap_color)=="u"]  <- "purple"
levels(df$Mushroom.cap_color) [levels(df$Mushroom.cap_color)=="e"]  <- "red"
levels(df$Mushroom.cap_color) [levels(df$Mushroom.cap_color)=="w"]  <- "white"
levels(df$Mushroom.cap_color) [levels(df$Mushroom.cap_color)=="y"]  <- "yellow"

#population: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y
levels(df$Mushroom.population) [levels(df$Mushroom.population)=="a"]  <- "abundant"
levels(df$Mushroom.population) [levels(df$Mushroom.population)=="c"]  <- "clustered"
levels(df$Mushroom.population) [levels(df$Mushroom.population)=="n"]  <- "numerous"
levels(df$Mushroom.population) [levels(df$Mushroom.population)=="s"]  <- "scattered"
levels(df$Mushroom.population) [levels(df$Mushroom.population)=="v"]  <- "several"
levels(df$Mushroom.population) [levels(df$Mushroom.population)=="y"]  <- "solitary"

#habitat: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d
levels(df$Mushroom.habitat) [levels(df$Mushroom.habitat)=="g"]  <- "grasses"
levels(df$Mushroom.habitat) [levels(df$Mushroom.habitat)=="l"]  <- "leaves"
levels(df$Mushroom.habitat) [levels(df$Mushroom.habitat)=="m"]  <- "meadows"
levels(df$Mushroom.habitat) [levels(df$Mushroom.habitat)=="p"]  <- "paths"
levels(df$Mushroom.habitat) [levels(df$Mushroom.habitat)=="u"]  <- "urban"
levels(df$Mushroom.habitat) [levels(df$Mushroom.habitat)=="d"]  <- "woods"

head(df)
##   Mushroom.classes Mushroom.odor Mushroom.cap_color Mushroom.population
## 1        poisonous       pungent              brown           scattered
## 2           edible        almond             yellow            numerous
## 3           edible         anise              white            numerous
## 4        poisonous       pungent              white           scattered
## 5           edible          none               gray            abundant
## 6           edible        almond             yellow            numerous
##   Mushroom.habitat
## 1            urban
## 2          grasses
## 3          meadows
## 4            urban
## 5          grasses
## 6          grasses

2. “odor” distribution

qplot(Mushroom.odor, data = df, fill= Mushroom.classes)

#odor attribute almost clearify poisonous and edible mushroom, only "none" odor has low possibility to in poisonous mushroom.  

3. “odor vs cap_color”

qplot(Mushroom.odor, Mushroom.cap_color, data=df, color=Mushroom.classes, main="odor vs cap_color")

#From cap_color distribution, all colors have edible and poisonous mushrooms. Comparing the odor, it is impossible to classify for edible and poisonous.

4. “odor vs population”

qplot(Mushroom.odor,data = df, fill=Mushroom.population,facets = .~Mushroom.classes)

#From population distribution, except abundant and numerous are only edible, all esle has distributed to edible and poisonous. Comparing the odor, it is impossilbe to classify for edible and poisonous.

5. “odor vs habitat”

qplot(Mushroom.odor,data = df, fill=Mushroom.habitat,facets = .~Mushroom.classes)

#From habitat distribution, every types of habitat has edible and poisonous. Comparing the odor, it is very difficult to classify for edible and poisonous.

6. Conclusion:

Among five attributes: odor,population,habitat,cap_color, odor is the best predictors of whether a particular mushroom is poisonous or edible.