Very often, we’re tasked with taking data in one form and transforming it for easier downstream analysis. We will spend several weeks in this course on tidying and transformation operations. Some of this work could be done in SQL or R (or Python or…). Here, you are asked to use R—you may use base functions or packages as you like.

Mushrooms Dataset. A famous—if slightly moldy—dataset about mushrooms can be found in the UCI repository here: https://archive.ics.uci.edu/ml/datasets/Mushroom. The fact that this is such a well-known dataset in the data science community makes it a good dataset to use for comparative benchmarking. For example, if someone was working to build a better decision tree algorithm (or other predictive classifier) to analyze categorical data, this dataset could be useful. A typical problem (which is beyond the scope of this assignment!) is to answer the question, “Which other attribute or attributes are the best predictors of whether a particular mushroom is poisonous or edible?”

Your task is to study the dataset and the associated description of the data (i.e. “data dictionary”). You may need to look around a bit, but it’s there! You should take the data, and create a data frame with a subset of the columns in the dataset. You should include the column that indicates edible or poisonous and three or four other columns. You should also add meaningful column names and replace the abbreviations used in the data—for example, in the appropriate column, “e” might become “edible.” Your deliverable is the R code to perform these transformation tasks.

Data preparation

Read data from my github repro into R dataframe and display first few rows

Exploratorory Analysis

## [1] "data.frame"
## [1] 8123   23
##  [1] "p"   "x"   "s"   "n"   "t"   "p.1" "f"   "c"   "n.1" "k"   "e"  
## [12] "e.1" "s.1" "s.2" "w"   "w.1" "p.2" "w.2" "o"   "p.3" "k.1" "s.3"
## [23] "u"
## [1] FALSE

No null found in the data frame

## [1] FALSE

No missing values found in the data frame

Rename Columns

Rename variables

Rename the variables in each column:

mushroom$Class <- recode(mushroom$Class, e = "edible", p = "poisonous")

mushroom$`Cap-Shape` <- recode(mushroom$`Cap-Shape`, 
                               b = "bell", c = "conical", x = "convex", f = "flat", k = "knobbed", s = "sunken")

mushroom$`Cap-Surface` <- recode(mushroom$`Cap-Surface`, f="fibrous", g="grooves", y="scaly", s="smooth")

mushroom$`Cap-Color` <- recode(mushroom$`Cap-Color`, n = "brown", b = "buff", c = "cinnamon", g = "gray", r = "green", p = "pink", u = "purple", e = "red", w = "white", y = "yellow")

mushroom$Bruises <- recode(mushroom$Bruises, t = "bruises", f = "no-bruises")

mushroom$Odor <- recode(mushroom$Odor, a = 'almond', l = 'anise', c = 'creosote', y = 'fishy', f = 'foul', m = 'musty', n = 'none', p = 'pungent', s = 'spicy')

mushroom$`Gill-Attachment` <- recode(mushroom$`Gill-Attachment`, a = 'attached', d = 'descending', f = 'free', n = 'notched')

mushroom$`Gill-Spacing` <- recode(mushroom$`Gill-Spacing`, c = 'close', w = 'crowded', d = 'distant')

mushroom$`Gill-Size` <- recode(mushroom$`Gill-Size`, b = 'broad', n = 'narrow' )

mushroom$`Gill-Color` <- recode(mushroom$`Gill-Color`, k = 'black', n = 'brown', b = 'buff', h = 'chocolate', g = 'gray', r = 'green', o = 'orange', p = 'pink', u = 'purple', e = 'red', w = 'white', y = 'yellow')

mushroom$`Stalk-Shape` <- recode(mushroom$`Stalk-Shape`, e = 'enlarging', t = 'tapering')

mushroom$`Stalk-Root` <- recode(mushroom$`Stalk-Root`, b = 'bulbous', c = 'club', u = 'cup', e = 'equal', z = 'rhizomorphs', r = 'rooted')

mushroom$`Stalk-Surface-Above-Ring` <- recode(mushroom$`Stalk-Surface-Above-Ring`, f = 'fibrous', y = 'scaly', k = 'silky', s = 'smooth')

mushroom$`Stalk-Surface-Below-Ring` <- recode(mushroom$`Stalk-Surface-Below-Ring`, f = 'fibrous', y = 'scaly', k = 'silky', s = 'smooth')

mushroom$`Stalk-Color-Above-Ring` <- recode(mushroom$`Stalk-Color-Above-Ring`, n = 'brown', b = 'buff', c = 'cinnamon', g = 'gray', o = 'orange', p = 'pink', e = 'red', w = 'white', y = 'yellow')

mushroom$`Stalk-Color-Below-Ring` <- recode(mushroom$`Stalk-Color-Below-Ring`, n = 'brown', b = 'buff', c = 'cinnamon', g = 'gray', o = 'orange', p = 'pink', e = 'red', w = 'white', y = 'yellow')

mushroom$`Veil-Type` <- recode(mushroom$`Veil-Type`, p = 'partial', u = 'universal')

mushroom$`Veil-Color` <- recode(mushroom$`Veil-Color`, n = 'brown', o = 'orange', w = 'white', y = 'yellow')

mushroom$`Ring-Number` <- recode(mushroom$`Ring-Number`, n = 'none', o = 'one', t = 'two')

mushroom$`Ring-Type` <- recode(mushroom$`Ring-Type`, c = 'cobwebby', e = 'evanescent', f = 'flaring', l = 'large', n = 'none', p = 'pendant', s = 'sheathing', z = 'zone')

mushroom$`Spore-Print-Color` <- recode(mushroom$`Spore-Print-Color`, k = 'black', n = 'brown', b = 'buff', h = 'chocolate', r = 'green', o = 'orange', u = 'purple',
w = 'white', y = 'yellow')

mushroom$Population <- recode(mushroom$Population, a = 'abundant', c = 'clustered', n = 'numerous', s = 'scattered', v = 'several', y = 'solitary')

mushroom$Habitat <- recode(mushroom$Habitat, g = 'grasses', l = 'leaves', m = 'meadows', p = 'paths', u = 'urban', w = 'waste', d = 'woods')
head(mushroom)
##        Class        Cap-Shape     Cap-Surface     Cap-Color   
##  edible   :4208   bell   : 452   fibrous:2320   brown  :2283  
##  poisonous:3915   conical:   4   grooves:   4   gray   :1840  
##                   flat   :3152   smooth :2555   red    :1500  
##                   knobbed: 828   scaly  :3244   yellow :1072  
##                   sunken :  32                  white  :1040  
##                   convex :3655                  buff   : 168  
##                                                 (Other): 220  
##        Bruises          Odor      Gill-Attachment  Gill-Spacing 
##  no-bruises:4748   none   :3528   attached: 210   close  :6811  
##  bruises   :3375   foul   :2160   free    :7913   crowded:1312  
##                    spicy  : 576                                 
##                    fishy  : 576                                 
##                    almond : 400                                 
##                    anise  : 400                                 
##                    (Other): 483                                 
##   Gill-Size        Gill-Color      Stalk-Shape     Stalk-Root  
##  broad :5612   buff     :1728   enlarging:3515   ?      :2480  
##  narrow:2511   pink     :1492   tapering :4608   bulbous:3776  
##                white    :1202                    club   : 556  
##                brown    :1048                    equal  :1119  
##                gray     : 752                    rooted : 192  
##                chocolate: 732                                  
##                (Other)  :1169                                  
##  Stalk-Surface-Above-Ring Stalk-Surface-Below-Ring Stalk-Color-Above-Ring
##  fibrous: 552             fibrous: 600             white  :4463          
##  silky  :2372             silky  :2304             pink   :1872          
##  smooth :5175             smooth :4935             gray   : 576          
##  scaly  :  24             scaly  : 284             brown  : 448          
##                                                    buff   : 432          
##                                                    orange : 192          
##                                                    (Other): 140          
##  Stalk-Color-Below-Ring   Veil-Type     Veil-Color   Ring-Number
##  white  :4383           partial:8123   brown :  96   none:  36  
##  pink   :1872                          orange:  96   one :7487  
##  gray   : 576                          white :7923   two : 600  
##  brown  : 512                          yellow:   8              
##  buff   : 432                                                   
##  orange : 192                                                   
##  (Other): 156                                                   
##       Ring-Type    Spore-Print-Color     Population      Habitat    
##  evanescent:2776   white    :2388    abundant : 384   woods  :3148  
##  flaring   :  48   brown    :1968    clustered: 340   grasses:2148  
##  large     :1296   black    :1871    numerous : 400   leaves : 832  
##  none      :  36   chocolate:1632    scattered:1247   meadows: 292  
##  pendant   :3967   green    :  72    several  :4040   paths  :1144  
##                    buff     :  48    solitary :1712   urban  : 367  
##                    (Other)  : 144                     waste  : 192

Subset

create a data frame with a subset of the columns:

According to the documentation, a small number of columns provide an excellent prediction of which mushrooms are poisonous: These columns are odor, spore-print-color, stalk-surface-below-ring, and stalk-color-above-ring. I have added cap-color and habitat in addition.

Visualization

The majority of poisonous mushrooms smell foul, while the majority of edible mushrooms don’t smell at all.