R-Week03-Assignment

Your task is to study the Mushrooms Dataset in the UCI repository and the associated description of the data (i.e. “data dictionary”). You may need to look around a bit, but it’s there! You should take the data, and create a data frame with a subset of the columns (and if you like rows) in the dataset. You should include the column that indicates edible or poisonous and three or four other columns. You should also add meaningful column names and replace the abbreviations used in the data-for example, in the appropriate column, “e” might become “edible.” Your deliverable is the R code to perform these transformation tasks.

I started with reading the data file into a dataframe. The Header argument needs to be set to false, because this file does not have a header row. Setting na.strings to “?” turns the question marks in column 12 to NA’s.

In most cases I would want a complete copy of any data I was examining in a format that would be the most useful to me. I am still learning the advantges of having Factors in a dataframe vesus just vectors. It seems to me that having the levels for these encoded columns would be useful over the long haul, because it would serve as a built in data dictionary. This is why I did not use stringAsFactors, as.is, or colclasses during the initial read.

df <- read.csv("c:\\Users\\Robert\\Documents\\CUNY\\Bridge Classes\\R Programming\\Week3\\agaricus-lepiota.data.txt", header = FALSE, na.strings = "?")

summary(df)

##  V1       V2       V3             V4       V5             V6      
##  e:4208   b: 452   f:2320   n      :2284   f:4748   n      :3528  
##  p:3916   c:   4   g:   4   g      :1840   t:3376   f      :2160  
##           f:3152   s:2556   e      :1500            s      : 576  
##           k: 828   y:3244   y      :1072            y      : 576  
##           s:  32            w      :1040            a      : 400  
##           x:3656            b      : 168            l      : 400  
##                             (Other): 220            (Other): 484  
##  V7       V8       V9            V10       V11        V12       V13     
##  a: 210   c:6812   b:5612   b      :1728   e:3516   b   :3776   f: 552  
##  f:7914   w:1312   n:2512   p      :1492   t:4608   c   : 556   k:2372  
##                             w      :1202            e   :1120   s:5176  
##                             n      :1048            r   : 192   y:  24  
##                             g      : 752            NA's:2480           
##                             h      : 732                                
##                             (Other):1170                                
##  V14           V15            V16       V17      V18      V19     
##  f: 600   w      :4464   w      :4384   p:8124   n:  96   n:  36  
##  k:2304   p      :1872   p      :1872            o:  96   o:7488  
##  s:4936   g      : 576   g      : 576            w:7924   t: 600  
##  y: 284   n      : 448   n      : 512            y:   8           
##           b      : 432   b      : 432                             
##           o      : 192   o      : 192                             
##           (Other): 140   (Other): 156                             
##  V20           V21       V22      V23     
##  e:2776   w      :2388   a: 384   d:3148  
##  f:  48   n      :1968   c: 340   g:2148  
##  l:1296   k      :1872   n: 400   l: 832  
##  n:  36   h      :1632   s:1248   m: 292  
##  p:3968   r      :  72   v:4040   p:1144  
##           b      :  48   y:1712   u: 368  
##           (Other): 144            w: 192

For this assignment I could subset the dataframe first and only name the columns I plan to keep, however my normal course of action would be to complete the dataset. In work, I might find out I need more columns later and not want to go back to configure them. Here are all the column names assigned to their columns.

fields <- c("edible-poison", "cap-shape", "cap-surface", "cap-color", "bruises", "odor", "gill-attachment", "gill-spacing", "gill-size", "gill-color", "stalk-shape", "stalk-root", "stalk-surface-above-ring", "stalk-surface-below-ring", "stalk-color-above-ring", "stalk-color-below-ring", "veil-type", "veil-color", "ring-number", "ring-type", "spore-print-color", "population", "habitat")

colnames(df) <- fields

head(df)

##   edible-poison cap-shape cap-surface cap-color bruises odor
## 1             p         x           s         n       t    p
## 2             e         x           s         y       t    a
## 3             e         b           s         w       t    l
## 4             p         x           y         w       t    p
## 5             e         x           s         g       f    n
## 6             e         x           y         y       t    a
##   gill-attachment gill-spacing gill-size gill-color stalk-shape stalk-root
## 1               f            c         n          k           e          e
## 2               f            c         b          k           e          c
## 3               f            c         b          n           e          c
## 4               f            c         n          n           e          e
## 5               f            w         b          k           t          e
## 6               f            c         b          n           e          c
##   stalk-surface-above-ring stalk-surface-below-ring stalk-color-above-ring
## 1                        s                        s                      w
## 2                        s                        s                      w
## 3                        s                        s                      w
## 4                        s                        s                      w
## 5                        s                        s                      w
## 6                        s                        s                      w
##   stalk-color-below-ring veil-type veil-color ring-number ring-type
## 1                      w         p          w           o         p
## 2                      w         p          w           o         p
## 3                      w         p          w           o         p
## 4                      w         p          w           o         p
## 5                      w         p          w           o         e
## 6                      w         p          w           o         p
##   spore-print-color population habitat
## 1                 k          s       u
## 2                 n          n       g
## 3                 n          n       m
## 4                 k          s       u
## 5                 n          a       g
## 6                 k          n       g

I could apply the same thinking to the field contents and in most cases that is probably what I would do to complete the dataset. In this case there are 23 fileds with some having as many as 10 different possible values. For the assignment I decided that discretion is the better part of valor and subsetted first, before replacing the variables.

small.df <- subset(df, select = c("edible-poison", "cap-shape", "cap-surface", "bruises", "odor"))
head(small.df)

##   edible-poison cap-shape cap-surface bruises odor
## 1             p         x           s       t    p
## 2             e         x           s       t    a
## 3             e         b           s       t    l
## 4             p         x           y       t    p
## 5             e         x           s       f    n
## 6             e         x           y       t    a

To replace “code” letters with words in each column requires two steps. We need to add the word values to the elements for the factor and then we can replace the letter with more meaningful full word. The levels step could be avoided, if we made the columns vectors and not factors. I need to figure out if the extra work is worth it most times.

# Column 1 field values, add levels to factor before adding values
levels(small.df$'edible-poison') <- c(levels(small.df$'edible-poison'), "edible", "poison")
small.df$'edible-poison'[small.df$'edible-poison' == "e"] <- "edible"
small.df$'edible-poison'[small.df$'edible-poison' == "p"] <- "poison"

# Column 2 field values
levels(small.df$'cap-shape') <- c(levels(small.df$'cap-shape'), "bell", "conical", "convex", "flat", "knobbed", "sunken")
small.df$'cap-shape'[small.df$'cap-shape' == "b"] <- "bell"
small.df$'cap-shape'[small.df$'cap-shape' == "c"] <- "conical"
small.df$'cap-shape'[small.df$'cap-shape' == "x"] <- "convex"
small.df$'cap-shape'[small.df$'cap-shape' == "f"] <- "flat"
small.df$'cap-shape'[small.df$'cap-shape' == "k"] <- "knobbed"
small.df$'cap-shape'[small.df$'cap-shape' == "s"] <- "sunken"

# Column 3
levels(small.df$'cap-surface') <- c(levels(small.df$'cap-surface'), "fibrous", "grooves", "scaly", "smooth")
small.df$'cap-surface'[small.df$'cap-surface' == "f"] <- "fibrous"
small.df$'cap-surface'[small.df$'cap-surface' == "g"] <- "grooves"
small.df$'cap-surface'[small.df$'cap-surface' == "y"] <- "scaly"
small.df$'cap-surface'[small.df$'cap-surface' == "s"] <- "smooth"

# Column 6
levels(small.df$'odor') <- c(levels(small.df$'odor'), "almond", "anise", "creosote", "fishy", "foul", "musty", "none", "pungent", "spicy")
small.df$'odor'[small.df$'odor' == "a"] <- "almond"
small.df$'odor'[small.df$'odor' == "l"] <- "anise"
small.df$'odor'[small.df$'odor' == "c"] <- "creosote"
small.df$'odor'[small.df$'odor' == "y"] <- "fishy"
small.df$'odor'[small.df$'odor' == "f"] <- "foul"
small.df$'odor'[small.df$'odor' == "m"] <- "musty"
small.df$'odor'[small.df$'odor' == "n"] <- "none"
small.df$'odor'[small.df$'odor' == "p"] <- "pungent"
small.df$'odor'[small.df$'odor' == "s"] <- "spicy"

tail(small.df)

##      edible-poison cap-shape cap-surface bruises  odor
## 8119        poison   knobbed       scaly       f  foul
## 8120        edible   knobbed      smooth       f  none
## 8121        edible    convex      smooth       f  none
## 8122        edible      flat      smooth       f  none
## 8123        poison   knobbed       scaly       f fishy
## 8124        edible    convex      smooth       f  none

This leaves column 5 bruises , which is a True or False field. I could leave t and f, which are easy to understand, spell out true and false, or try to make them logical values. Here is how I made these fields logical.

test <- small.df[[4]] == 't'
small.df[[4]] <- test
tail(small.df)

##      edible-poison cap-shape cap-surface bruises  odor
## 8119        poison   knobbed       scaly   FALSE  foul
## 8120        edible   knobbed      smooth   FALSE  none
## 8121        edible    convex      smooth   FALSE  none
## 8122        edible      flat      smooth   FALSE  none
## 8123        poison   knobbed       scaly   FALSE fishy
## 8124        edible    convex      smooth   FALSE  none

This concludes my data transformation of the Mushroom Dataset for this assignment. Here is a summary look of my new dataframe.

summary(small.df)

##  edible-poison   cap-shape     cap-surface    bruises       
##  e     :   0   convex :3656   scaly  :3244   Mode :logical  
##  p     :   0   flat   :3152   smooth :2556   FALSE:4748     
##  edible:4208   knobbed: 828   fibrous:2320   TRUE :3376     
##  poison:3916   bell   : 452   grooves:   4   NA's :0        
##                sunken :  32   f      :   0                  
##                conical:   4   g      :   0                  
##                (Other):   0   (Other):   0                  
##       odor     
##  none   :3528  
##  foul   :2160  
##  fishy  : 576  
##  spicy  : 576  
##  almond : 400  
##  anise  : 400  
##  (Other): 484

R-Week03-Assignment

Robert Godbey

July 14, 2015