Mushroom Dataset - Assignment 1

Very often, we’re tasked with taking data in one form and transforming it for easier downstream analysis. We will spend several weeks in this course on tidying and transformation operations. Some of this work could be done in SQL or R (or Python or…). Here, you are asked to use R—you may use base functions or packages as you like. Mushrooms Dataset. A famous—if slightly moldy—dataset about mushrooms can be found in the UCI repository here: https://archive.ics.uci.edu/ml/datasets/Mushroom. The fact that this is such a well-known dataset in the data science community makes it a good dataset to use for comparative benchmarking. For example, if someone was working to build a better decision tree algorithm (or other predictive classifier) to analyze categorical data, this dataset could be useful. A typical problem (which is beyond the scope of this assignment!) is to answer the question, “Which other attribute or attributes are the best predictors of whether a particular mushroom is poisonous or edible?” Your task is to study the dataset and the associated description of the data (i.e. “data dictionary”). You may need to look around a bit, but it’s there! You should take the data, and create a data frame with a subset of the columns in the dataset. You should include the column that indicates edible or poisonous and three or four other columns. You should also add meaningful column names and replace the abbreviations used in the data—for example, in the appropriate column, “e” might become “edible.” Your deliverable is the R code to perform these transformation tasks.

## Loading required package: bitops

Load the dataset and convert it into dataframe

p x s n t p.1 f c n.1 k e e.1 s.1 s.2 w w.1 p.2 w.2 o p.3 k.1 s.3 u
e x s y t a f c b k e c s s w w p w o p n n g
e b s w t l f c b n e c s s w w p w o p n n m
p x y w t p f c n n e e s s w w p w o p k s u
e x s g f n f w b k t e s s w w p w o e n a g
e x y y t a f c b n e c s s w w p w o p k n g
e b s w t a f c b g e c s s w w p w o p k n m
# Let ’s ch eck o ut th e sum mary o f mus hrdoo m data set
##  p        x        s              n        t             p.1      
##  e:4208   b: 452   f:2320   n      :2283   f:4748   n      :3528  
##  p:3915   c:   4   g:   4   g      :1840   t:3375   f      :2160  
##           f:3152   s:2555   e      :1500            s      : 576  
##           k: 828   y:3244   y      :1072            y      : 576  
##           s:  32            w      :1040            a      : 400  
##           x:3655            b      : 168            l      : 400  
##                             (Other): 220            (Other): 483  
##  f        c        n.1            k        e        e.1      s.1     
##  a: 210   c:6811   b:5612   b      :1728   e:3515   ?:2480   f: 552  
##  f:7913   w:1312   n:2511   p      :1492   t:4608   b:3776   k:2372  
##                             w      :1202            c: 556   s:5175  
##                             n      :1048            e:1119   y:  24  
##                             g      : 752            r: 192           
##                             h      : 732                             
##                             (Other):1169                             
##  s.2            w             w.1       p.2      w.2      o       
##  f: 600   w      :4463   w      :4383   p:8123   n:  96   n:  36  
##  k:2304   p      :1872   p      :1872            o:  96   o:7487  
##  s:4935   g      : 576   g      : 576            w:7923   t: 600  
##  y: 284   n      : 448   n      : 512            y:   8           
##           b      : 432   b      : 432                             
##           o      : 192   o      : 192                             
##           (Other): 140   (Other): 156                             
##  p.3           k.1       s.3      u       
##  e:2776   w      :2388   a: 384   d:3148  
##  f:  48   n      :1968   c: 340   g:2148  
##  l:1296   k      :1871   n: 400   l: 832  
##  n:  36   h      :1632   s:1247   m: 292  
##  p:3967   r      :  72   v:4040   p:1144  
##           b      :  48   y:1712   u: 367  
##           (Other): 144            w: 192

Apparently all the variables show that they are all categorical and hence above digits show the number of those categories in that variables.

Now let’s check the data dictionary to see what’s there !

In order to read the file, I had to convert the dictionary into text file and saved it in my root folder. I fenced the text file from there.

Index Attribute Information
0 class edible=e,poisonous=p
1 cap-shape bell=b,conical=c,convex=x,flat=f,knobbed=k,sunken=s
2 cap-surface fibrous=f,grooves=g,scaly=y,smooth=s
3 cap-color brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y
4 bruises? bruises=t,no=f
5 odor almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s
6 gill-attachment attached=a,descending=d,free=f,notched=n
7 gill-spacing close=c,crowded=w,distant=d
8 gill-size broad=b,narrow=n
9 gill-color black=k,brown=n,buff=b,chocolate=h,gray=g,green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y
10 stalk-shape enlarging=e,tapering=t
11 stalk-root bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?
12 stalk-surface-above-ring fibrous=f,scaly=y,silky=k,smooth=s
13 stalk-surface-below-ring fibrous=f,scaly=y,silky=k,smooth=s
14 stalk-color-above-ring brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
15 stalk-color-below-ring brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
16 veil-type partial=p,universal=u
17 veil-color brown=n,orange=o,white=w,yellow=y
18 ring-number none=n,one=o,two=t
19 ring-type cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z
20 spore-print-color black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y
21 population abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y
22 habitat grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d
##   p x s n t p.1 f c n.1 k e e.1 s.1 s.2 w w.1 p.2 w.2 o p.3 k.1 s.3 u
## 1 e x s y t   a f c   b k e   c   s   s w   w   p   w o   p   n   n g
## 2 e b s w t   l f c   b n e   c   s   s w   w   p   w o   p   n   n m
## 3 p x y w t   p f c   n n e   e   s   s w   w   p   w o   p   k   s u
## 4 e x s g f   n f w   b k t   e   s   s w   w   p   w o   e   n   a g
## 5 e x y y t   a f c   b n e   c   s   s w   w   p   w o   p   k   n g
## 6 e b s w t   a f c   b g e   c   s   s w   w   p   w o   p   k   n m

Rename the columns

Now let’s rename the column’s name. First column is the class (e= edible and p=poisonous) while the rest of other 22 variables are the names which needs to be changed according to the above given names.

Now let’s check the name of the columns to see if they’re changed.

##  [1] "class"                 "capshape"             
##  [3] "capsurface"            "capcolor"             
##  [5] "bruises"               "odor"                 
##  [7] "gillattachment"        "gillspacing"          
##  [9] "gillsize"              "gillcolor"            
## [11] "stalkshape"            "stalkroot"            
## [13] "stalksurfaceabovering" "stalksurfacebelowring"
## [15] "stalkcolorabovering"   "stalkcolorbelowring"  
## [17] "veiltype"              "veilcolor"            
## [19] "ringnumber"            "ringtype"             
## [21] "sporeprintcolor"       "population"           
## [23] "habitat"

Let’s create subsets first to change the category’s names

##  class    ringtype ringnumber capsurface bruises 
##  e:4208   e:2776   n:  36     f:2320     f:4748  
##  p:3915   f:  48   o:7487     g:   4     t:3375  
##           l:1296   t: 600     s:2555             
##           n:  36              y:3244             
##           p:3967

Now since we are done with renaming the column names, let’s move to change the category types into sensible information as right now we cannot understand what each data represents. We will replace the categories of 4 variables and transform them into meaningful and clear names.

##        class            ringtype    ringnumber    capsurface  
##  Edible   :4208   evanescent:2776   none:  36   fibrous:2320  
##  Poisonous:3915   flaring   :  48   one :7487   grooves:   4  
##                   large     :1296   two : 600   scaly  :3244  
##                   none      :  36               smooth :2555  
##                   pendant   :3967                             
##     bruises    
##  bruises:3375  
##  no     :4748  
##                
##                
## 

To check if the changes are made, let’s iterate and see the data randomly

##          class   ringtype ringnumber capsurface bruises
## 6620 Poisonous evanescent        one      scaly      no
## 3148 Poisonous    pendant        one    fibrous      no
## 6238 Poisonous evanescent        one     smooth      no
## 7479    Edible    pendant        two    fibrous      no
## 3077    Edible    pendant        one    fibrous bruises
## 5958 Poisonous      large        one      scaly      no
## 1386    Edible evanescent        one    fibrous      no
## 5976    Edible evanescent        two     smooth bruises
## 2929    Edible    pendant        one    fibrous bruises
## 5452 Poisonous      large        one      scaly      no
## 7350    Edible    pendant        two     smooth      no
## 7579 Poisonous evanescent        one      scaly      no
## 3653 Poisonous      large        one    fibrous      no
## 8040    Edible    pendant        one     smooth      no
## 5651 Poisonous      large        one      scaly      no
## 851     Edible    pendant        one      scaly bruises
## 6061 Poisonous evanescent        one      scaly      no
## 797     Edible evanescent        one     smooth      no
## 2605 Poisonous      large        one    fibrous      no
## 7388    Edible    pendant        one     smooth      no
## 3888    Edible    pendant        one      scaly bruises
## 7360    Edible    pendant        two     smooth bruises
## 6141 Poisonous evanescent        one      scaly      no
## 7102 Poisonous evanescent        one      scaly      no
## 5740 Poisonous    pendant        one     smooth bruises
## 603     Edible    pendant        one      scaly bruises
## 6414 Poisonous evanescent        one     smooth      no
## 7636    Edible    pendant        one     smooth      no
## 1874    Edible    pendant        one      scaly bruises
## 2224    Edible    pendant        one    fibrous bruises
## 4008    Edible    pendant        one    fibrous bruises
## 1751    Edible    pendant        one      scaly bruises
## 1298    Edible    pendant        one      scaly bruises
## 2653    Edible    pendant        one      scaly bruises
## 3073 Poisonous      large        one    fibrous      no
## 2621    Edible    pendant        one      scaly bruises
## 4555 Poisonous      large        one    fibrous      no
## 2209    Edible    pendant        one      scaly bruises
## 4040 Poisonous      large        one    fibrous      no
## 4406 Poisonous      large        one      scaly      no
## 2029    Edible    pendant        one      scaly bruises
## 2926 Poisonous      large        one    fibrous      no
## 8049 Poisonous evanescent        one      scaly      no
## 5107 Poisonous    pendant        one    grooves bruises
## 7896 Poisonous evanescent        one      scaly      no
## 3219    Edible    pendant        one      scaly bruises
## 7442 Poisonous evanescent        one     smooth      no
## 5571 Poisonous    pendant        two      scaly bruises
## 5872 Poisonous evanescent        one      scaly      no
## 1466    Edible    pendant        one    fibrous bruises