Assignment

Very often, we’re tasked with taking data in one form and transforming it for easier downstream analysis. We will spend several weeks in this course on tidying and transformation operations. Some of this work could be done in SQL or R (or Python or…). Here, you are asked to use R—you may use base functions or packages as you like. Mushrooms Dataset. A famous—if slightly moldy—dataset about mushrooms can be found in the UCI repository here: https://archive.ics.uci.edu/ml/datasets/Mushroom.

The fact that this is such a well-known dataset in the data science community makes it a good dataset to use for comparative benchmarking. For example, if someone was working to build a better decision tree algorithm (or other predictive classifier) to analyze categorical data, this dataset could be useful. A typical problem (which is beyond the scope of this assignment!) is to answer the question, “Which other attribute or attributes are the best predictors of whether a particular mushroom is poisonous or edible?”

Your task is to study the dataset and the associated description of the data (i.e. “data dictionary”). You may need to look around a bit, but it’s there! You should take the data, and create a data frame with a subset of the columns in the dataset. You should include the column that indicates edible or poisonous and three or four other columns. You should also add meaningful column names and replace the abbreviations used in the data—for example, in the appropriate column, “e” might become “edible.” Your deliverable is the R code to perform these transformation tasks.

Import Data Dictionary & Dataset

Import mushrooms description and dataset from GitHub

path <- 'https://raw.githubusercontent.com/dhairavc/DATA607/master/Assignment%201%20agaricus-lepiota.data'
path2 <- 'https://raw.githubusercontent.com/dhairavc/DATA607/master/Assignment1%20-%20agaricus-lepiota.names'

readLines(path2)
##   [1] "1. Title: Mushroom Database"                                                  
##   [2] ""                                                                             
##   [3] "2. Sources: "                                                                 
##   [4] "    (a) Mushroom records drawn from The Audubon Society Field Guide to North" 
##   [5] "        American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred"   
##   [6] "        A. Knopf"                                                             
##   [7] "    (b) Donor: Jeff Schlimmer (Jeffrey.Schlimmer@a.gp.cs.cmu.edu)"            
##   [8] "    (c) Date: 27 April 1987"                                                  
##   [9] ""                                                                             
##  [10] "3. Past Usage:"                                                               
##  [11] "    1. Schlimmer,J.S. (1987). Concept Acquisition Through Representational"   
##  [12] "       Adjustment (Technical Report 87-19).  Doctoral disseration, Department"
##  [13] "       of Information and Computer Science, University of California, Irvine."
##  [14] "       --- STAGGER: asymptoted to 95% classification accuracy after reviewing"
##  [15] "           1000 instances."                                                   
##  [16] "    2. Iba,W., Wogulis,J., & Langley,P. (1988).  Trading off Simplicity"      
##  [17] "       and Coverage in Incremental Concept Learning. In Proceedings of "      
##  [18] "       the 5th International Conference on Machine Learning, 73-79."          
##  [19] "       Ann Arbor, Michigan: Morgan Kaufmann.  "                               
##  [20] "       -- approximately the same results with their HILLARY algorithm    "    
##  [21] "    3. In the following references a set of rules (given below) were"         
##  [22] "\tlearned for this data set which may serve as a point of"                     
##  [23] "\tcomparison for other researchers."                                           
##  [24] ""                                                                             
##  [25] "\tDuch W, Adamczak R, Grabczewski K (1996) Extraction of logical rules"        
##  [26] "\tfrom training data using backpropagation networks, in: Proc. of the"         
##  [27] "\tThe 1st Online Workshop on Soft Computing, 19-30.Aug.1996, pp. 25-30,"       
##  [28] "\tavailable on-line at: http://www.bioele.nuee.nagoya-u.ac.jp/wsc1/"           
##  [29] ""                                                                             
##  [30] "\tDuch W, Adamczak R, Grabczewski K, Ishikawa M, Ueda H, Extraction of"        
##  [31] "\tcrisp logical rules using constrained backpropagation networks -"            
##  [32] "\tcomparison of two new approaches, in: Proc. of the European Symposium"       
##  [33] "\ton Artificial Neural Networks (ESANN'97), Bruge, Belgium 16-18.4.1997,"      
##  [34] "\tpp. xx-xx"                                                                   
##  [35] ""                                                                             
##  [36] "\tWlodzislaw Duch, Department of Computer Methods, Nicholas Copernicus"        
##  [37] "\tUniversity, 87-100 Torun, Grudziadzka 5, Poland"                             
##  [38] "\te-mail: duch@phys.uni.torun.pl"                                              
##  [39] "\tWWW     http://www.phys.uni.torun.pl/kmk/"                                   
##  [40] "\t"                                                                            
##  [41] "\tDate: Mon, 17 Feb 1997 13:47:40 +0100"                                       
##  [42] "\tFrom: Wlodzislaw Duch <duch@phys.uni.torun.pl>"                              
##  [43] "\tOrganization: Dept. of Computer Methods, UMK"                                
##  [44] ""                                                                             
##  [45] "\tI have attached a file containing logical rules for mushrooms."              
##  [46] "\tIt should be helpful for other people since only in the last year I"         
##  [47] "\thave seen about 10 papers analyzing this dataset and obtaining quite"        
##  [48] "\tcomplex rules. We will try to contribute other results later."               
##  [49] ""                                                                             
##  [50] "\tWith best regards, Wlodek Duch"                                              
##  [51] "\t________________________________________________________________"            
##  [52] ""                                                                             
##  [53] "\tLogical rules for the mushroom data sets."                                   
##  [54] ""                                                                             
##  [55] "\tLogical rules given below seem to be the simplest possible for the"          
##  [56] "\tmushroom dataset and therefore should be treated as benchmark results."      
##  [57] ""                                                                             
##  [58] "\tDisjunctive rules for poisonous mushrooms, from most general"                
##  [59] "\tto most specific:"                                                           
##  [60] ""                                                                             
##  [61] "\tP_1) odor=NOT(almond.OR.anise.OR.none)"                                      
##  [62] "\t     120 poisonous cases missed, 98.52% accuracy"                            
##  [63] ""                                                                             
##  [64] "\tP_2) spore-print-color=green"                                                
##  [65] "\t     48 cases missed, 99.41% accuracy"                                       
##  [66] "         "                                                                    
##  [67] "\tP_3) odor=none.AND.stalk-surface-below-ring=scaly.AND."                      
##  [68] "\t          (stalk-color-above-ring=NOT.brown) "                               
##  [69] "\t     8 cases missed, 99.90% accuracy"                                        
##  [70] "         "                                                                    
##  [71] "\tP_4) habitat=leaves.AND.cap-color=white"                                     
##  [72] "\t         100% accuracy     "                                                 
##  [73] ""                                                                             
##  [74] "\tRule P_4) may also be"                                                       
##  [75] ""                                                                             
##  [76] "\tP_4') population=clustered.AND.cap_color=white"                              
##  [77] ""                                                                             
##  [78] "\tThese rule involve 6 attributes (out of 22). Rules for edible"               
##  [79] "\tmushrooms are obtained as negation of the rules given above, for"            
##  [80] "\texample the rule:"                                                           
##  [81] ""                                                                             
##  [82] "\todor=(almond.OR.anise.OR.none).AND.spore-print-color=NOT.green"              
##  [83] ""                                                                             
##  [84] "\tgives 48 errors, or 99.41% accuracy on the whole dataset."                   
##  [85] ""                                                                             
##  [86] "\tSeveral slightly more complex variations on these rules exist,"              
##  [87] "\tinvolving other attributes, such as gill_size, gill_spacing,"                
##  [88] "\tstalk_surface_above_ring, but the rules given above are the simplest"        
##  [89] "\twe have found."                                                              
##  [90] ""                                                                             
##  [91] ""                                                                             
##  [92] "4. Relevant Information:"                                                     
##  [93] "    This data set includes descriptions of hypothetical samples"              
##  [94] "    corresponding to 23 species of gilled mushrooms in the Agaricus and"      
##  [95] "    Lepiota Family (pp. 500-525).  Each species is identified as"             
##  [96] "    definitely edible, definitely poisonous, or of unknown edibility and"     
##  [97] "    not recommended.  This latter class was combined with the poisonous"      
##  [98] "    one.  The Guide clearly states that there is no simple rule for"          
##  [99] "    determining the edibility of a mushroom; no rule like ``leaflets"         
## [100] "    three, let it be'' for Poisonous Oak and Ivy."                            
## [101] ""                                                                             
## [102] "5. Number of Instances: 8124"                                                 
## [103] ""                                                                             
## [104] "6. Number of Attributes: 22 (all nominally valued)"                           
## [105] ""                                                                             
## [106] "7. Attribute Information: (classes: edible=e, poisonous=p)"                   
## [107] "     1. cap-shape:                bell=b,conical=c,convex=x,flat=f,"          
## [108] "                                  knobbed=k,sunken=s"                         
## [109] "     2. cap-surface:              fibrous=f,grooves=g,scaly=y,smooth=s"       
## [110] "     3. cap-color:                brown=n,buff=b,cinnamon=c,gray=g,green=r,"  
## [111] "                                  pink=p,purple=u,red=e,white=w,yellow=y"     
## [112] "     4. bruises?:                 bruises=t,no=f"                             
## [113] "     5. odor:                     almond=a,anise=l,creosote=c,fishy=y,foul=f,"
## [114] "                                  musty=m,none=n,pungent=p,spicy=s"           
## [115] "     6. gill-attachment:          attached=a,descending=d,free=f,notched=n"   
## [116] "     7. gill-spacing:             close=c,crowded=w,distant=d"                
## [117] "     8. gill-size:                broad=b,narrow=n"                           
## [118] "     9. gill-color:               black=k,brown=n,buff=b,chocolate=h,gray=g," 
## [119] "                                  green=r,orange=o,pink=p,purple=u,red=e,"    
## [120] "                                  white=w,yellow=y"                           
## [121] "    10. stalk-shape:              enlarging=e,tapering=t"                     
## [122] "    11. stalk-root:               bulbous=b,club=c,cup=u,equal=e,"            
## [123] "                                  rhizomorphs=z,rooted=r,missing=?"           
## [124] "    12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s"         
## [125] "    13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s"         
## [126] "    14. stalk-color-above-ring:   brown=n,buff=b,cinnamon=c,gray=g,orange=o," 
## [127] "                                  pink=p,red=e,white=w,yellow=y"              
## [128] "    15. stalk-color-below-ring:   brown=n,buff=b,cinnamon=c,gray=g,orange=o," 
## [129] "                                  pink=p,red=e,white=w,yellow=y"              
## [130] "    16. veil-type:                partial=p,universal=u"                      
## [131] "    17. veil-color:               brown=n,orange=o,white=w,yellow=y"          
## [132] "    18. ring-number:              none=n,one=o,two=t"                         
## [133] "    19. ring-type:                cobwebby=c,evanescent=e,flaring=f,large=l," 
## [134] "                                  none=n,pendant=p,sheathing=s,zone=z"        
## [135] "    20. spore-print-color:        black=k,brown=n,buff=b,chocolate=h,green=r,"
## [136] "                                  orange=o,purple=u,white=w,yellow=y"         
## [137] "    21. population:               abundant=a,clustered=c,numerous=n,"         
## [138] "                                  scattered=s,several=v,solitary=y"           
## [139] "    22. habitat:                  grasses=g,leaves=l,meadows=m,paths=p,"      
## [140] "                                  urban=u,waste=w,woods=d"                    
## [141] ""                                                                             
## [142] "8. Missing Attribute Values: 2480 of them (denoted by \"?\"), all for"        
## [143] "   attribute #11."                                                            
## [144] ""                                                                             
## [145] "9. Class Distribution: "                                                      
## [146] "    --    edible: 4208 (51.8%)"                                               
## [147] "    -- poisonous: 3916 (48.2%)"                                               
## [148] "    --     total: 8124 instances"
mushrooms <- read.csv(path)
head(mushrooms)
##   p x s n t p.1 f c n.1 k e e.1 s.1 s.2 w w.1 p.2 w.2 o p.3 k.1 s.3 u
## 1 e x s y t   a f c   b k e   c   s   s w   w   p   w o   p   n   n g
## 2 e b s w t   l f c   b n e   c   s   s w   w   p   w o   p   n   n m
## 3 p x y w t   p f c   n n e   e   s   s w   w   p   w o   p   k   s u
## 4 e x s g f   n f w   b k t   e   s   s w   w   p   w o   e   n   a g
## 5 e x y y t   a f c   b n e   c   s   s w   w   p   w o   p   k   n g
## 6 e b s w t   a f c   b g e   c   s   s w   w   p   w o   p   k   n m

Update Column Names

Give more meaningful header names to the dataset

names(mushrooms) <- c("Class", "Cap-Shape", "Cap-Surface", "Cap-Color", "Bruises", "Odor", "Gill-Attachment", "Gill-Spacing", "Gill-Size", "Gill-Color", "Stalk-Shape","Stalk-Root","Stalk-Surface-Above-Ring","Stalk-Surface-Below-Ring","Stalk-Color-Above-Ring","Stalk-Color-Below-Ring","Veil-Type","Veil-Color","Ring-Number","Ring-Type","Spore-Print-Color","Population","Habitat")

head(mushrooms)
##   Class Cap-Shape Cap-Surface Cap-Color Bruises Odor Gill-Attachment
## 1     e         x           s         y       t    a               f
## 2     e         b           s         w       t    l               f
## 3     p         x           y         w       t    p               f
## 4     e         x           s         g       f    n               f
## 5     e         x           y         y       t    a               f
## 6     e         b           s         w       t    a               f
##   Gill-Spacing Gill-Size Gill-Color Stalk-Shape Stalk-Root
## 1            c         b          k           e          c
## 2            c         b          n           e          c
## 3            c         n          n           e          e
## 4            w         b          k           t          e
## 5            c         b          n           e          c
## 6            c         b          g           e          c
##   Stalk-Surface-Above-Ring Stalk-Surface-Below-Ring Stalk-Color-Above-Ring
## 1                        s                        s                      w
## 2                        s                        s                      w
## 3                        s                        s                      w
## 4                        s                        s                      w
## 5                        s                        s                      w
## 6                        s                        s                      w
##   Stalk-Color-Below-Ring Veil-Type Veil-Color Ring-Number Ring-Type
## 1                      w         p          w           o         p
## 2                      w         p          w           o         p
## 3                      w         p          w           o         p
## 4                      w         p          w           o         e
## 5                      w         p          w           o         p
## 6                      w         p          w           o         p
##   Spore-Print-Color Population Habitat
## 1                 n          n       g
## 2                 n          n       m
## 3                 k          s       u
## 4                 n          a       g
## 5                 k          n       g
## 6                 k          n       m

Subset Dataframe

Create a subset of the original dataset

subshrooms <- mushrooms[c("Class", "Odor", "Population", "Habitat")]
head(subshrooms)
##   Class Odor Population Habitat
## 1     e    a          n       g
## 2     e    l          n       m
## 3     p    p          s       u
## 4     e    n          a       g
## 5     e    a          n       g
## 6     e    a          n       m

Name Observations

Give useful names to observation values

str(subshrooms)
## 'data.frame':    8123 obs. of  4 variables:
##  $ Class     : Factor w/ 2 levels "e","p": 1 1 2 1 1 1 1 2 1 1 ...
##  $ Odor      : Factor w/ 9 levels "a","c","f","l",..: 1 4 7 6 1 1 4 7 1 4 ...
##  $ Population: Factor w/ 6 levels "a","c","n","s",..: 3 3 4 1 3 3 4 5 4 3 ...
##  $ Habitat   : Factor w/ 7 levels "d","g","l","m",..: 2 4 6 2 2 4 4 2 4 2 ...
levels(subshrooms[,1]) <- c("edible", "poisonus")
levels(subshrooms[,2]) <- c("almond","creosote", "foul", "anise", "musty", "none", "pungent", "spicy", "fishy")
levels(subshrooms[,3]) <- c("abundant", "clustered", "numerous", "scattered", "several", "solitary")
levels(subshrooms[,4]) <- c("woods", "grasses", "leaves", "meadows", "paths", "urban", "waste")
str(subshrooms)
## 'data.frame':    8123 obs. of  4 variables:
##  $ Class     : Factor w/ 2 levels "edible","poisonus": 1 1 2 1 1 1 1 2 1 1 ...
##  $ Odor      : Factor w/ 9 levels "almond","creosote",..: 1 4 7 6 1 1 4 7 1 4 ...
##  $ Population: Factor w/ 6 levels "abundant","clustered",..: 3 3 4 1 3 3 4 5 4 3 ...
##  $ Habitat   : Factor w/ 7 levels "woods","grasses",..: 2 4 6 2 2 4 4 2 4 2 ...
head(subshrooms)
##      Class    Odor Population Habitat
## 1   edible  almond   numerous grasses
## 2   edible   anise   numerous meadows
## 3 poisonus pungent  scattered   urban
## 4   edible    none   abundant grasses
## 5   edible  almond   numerous grasses
## 6   edible  almond   numerous meadows