Week 1 task

Your task (should you choose to except it – sorry, had the Mission Impossible theme playing as soon as I started reading) is to study the dataset and the associated description of the data (i.e. “data dictionary”). You may need to look around a bit, but it’s there! You should take the data, and create a data frame with a subset of the columns in the dataset. You should include the column that indicates edible or poisonous and three or four other columns. You should also add meaningful column names and replace the abbreviations used in the data-for example, in the appropriate column, “e” might become “edible.” Your deliverable is the R code to perform these transformation tasks.

Setup

The data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. Data is courtesy of UCI, Jeff Schlimmer and The Audubon Society.

# Load packages
library(RCurl)

# Load data file
shrooms <- read.csv(text=getURL("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"), header = FALSE, sep = ",")
# Quick look at the data
head(shrooms)
##   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 1  p  x  s  n  t  p  f  c  n   k   e   e   s   s   w   w   p   w   o   p
## 2  e  x  s  y  t  a  f  c  b   k   e   c   s   s   w   w   p   w   o   p
## 3  e  b  s  w  t  l  f  c  b   n   e   c   s   s   w   w   p   w   o   p
## 4  p  x  y  w  t  p  f  c  n   n   e   e   s   s   w   w   p   w   o   p
## 5  e  x  s  g  f  n  f  w  b   k   t   e   s   s   w   w   p   w   o   e
## 6  e  x  y  y  t  a  f  c  b   n   e   c   s   s   w   w   p   w   o   p
##   V21 V22 V23
## 1   k   s   u
## 2   n   n   g
## 3   n   n   m
## 4   k   s   u
## 5   n   a   g
## 6   k   n   g
summary(shrooms)
##  V1       V2       V3             V4       V5             V6      
##  e:4208   b: 452   f:2320   n      :2284   f:4748   n      :3528  
##  p:3916   c:   4   g:   4   g      :1840   t:3376   f      :2160  
##           f:3152   s:2556   e      :1500            s      : 576  
##           k: 828   y:3244   y      :1072            y      : 576  
##           s:  32            w      :1040            a      : 400  
##           x:3656            b      : 168            l      : 400  
##                             (Other): 220            (Other): 484  
##  V7       V8       V9            V10       V11      V12      V13     
##  a: 210   c:6812   b:5612   b      :1728   e:3516   ?:2480   f: 552  
##  f:7914   w:1312   n:2512   p      :1492   t:4608   b:3776   k:2372  
##                             w      :1202            c: 556   s:5176  
##                             n      :1048            e:1120   y:  24  
##                             g      : 752            r: 192           
##                             h      : 732                             
##                             (Other):1170                             
##  V14           V15            V16       V17      V18      V19     
##  f: 600   w      :4464   w      :4384   p:8124   n:  96   n:  36  
##  k:2304   p      :1872   p      :1872            o:  96   o:7488  
##  s:4936   g      : 576   g      : 576            w:7924   t: 600  
##  y: 284   n      : 448   n      : 512            y:   8           
##           b      : 432   b      : 432                             
##           o      : 192   o      : 192                             
##           (Other): 140   (Other): 156                             
##  V20           V21       V22      V23     
##  e:2776   w      :2388   a: 384   d:3148  
##  f:  48   n      :1968   c: 340   g:2148  
##  l:1296   k      :1872   n: 400   l: 832  
##  n:  36   h      :1632   s:1248   m: 292  
##  p:3968   r      :  72   v:4040   p:1144  
##           b      :  48   y:1712   u: 368  
##           (Other): 144            w: 192

Data dictionary

The first column describes class (edible=e, poisonous=p) and the data set includes 22 variables:

  1. cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s
  2. cap-surface: fibrous=f, grooves=g, scaly=y, smooth=s
  3. cap-color: brown=n, buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y
  4. bruises?: bruises=t, no=f
  5. odor: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s
  6. gill-attachment: attached=a, descending=d, free=f, notched=n
  7. gill-spacing: close=c, crowded=w, distant=d
  8. gill-size: broad=b, narrow=n
  9. gill-color: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y
  10. stalk-shape: enlarging=e, tapering=t
  11. stalk-root: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?
  12. stalk-surface-above-ring: fibrous=f, scaly=y, silky=k, smooth=s
  13. stalk-surface-below-ring: fibrous=f, scaly=y, silky=k, smooth=s
  14. stalk-color-above-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
  15. stalk-color-below-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
  16. veil-type: partial=p, universal=u
  17. veil-color: brown=n, orange=o, white=w, yellow=y
  18. ring-number: none=n, one=o, two=t
  19. ring-type: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z
  20. spore-print-color: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u,white=w, yellow=y
  21. population: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y
  22. habitat: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d
# Rename columns 
names(shrooms) <- c('class','capshape','capsurface','capcolor','bruises','odor','gillattachment','gillspacing','gillsize','gillcolor','stalkshape','stalkroot','stalksurfaceabovering','stalksurfacebelowring','stalkcolorabovering','stalkcolorbelowring','veiltype','veilcolor','ringnumber','ringtype','sporeprintcolor','population','habitat')

Selecting a subset

Select class, odor, ringnumber, ringtype and population for all mushrooms that grows in the woods to further transform and review.

sub.shrooms <- subset(shrooms, habitat == 'd', select = c(class, odor, ringnumber, ringtype, population))
summary(sub.shrooms)
##  class         odor      ringnumber ringtype population
##  e:1880   n      :1816   n:  36     e: 608   a:   0    
##  p:1268   f      : 624   o:3104     f:  48   c:  36    
##           c      : 192   t:   8     l: 432   n:   0    
##           s      : 192              n:  36   s:  96    
##           y      : 192              p:2024   v:1904    
##           a      :  48                       y:1112    
##           (Other):  84

Transforming the subset

Update variables with more meaningful values.

# Update class
sub.shrooms$class <- as.character(sub.shrooms$class)
sub.shrooms$class[sub.shrooms$class == 'e'] <- 'Edible'
sub.shrooms$class[sub.shrooms$class == 'p'] <- 'Poisonous'
sub.shrooms$class <- as.factor(sub.shrooms$class)

# Update odor
sub.shrooms$odor <- as.character(sub.shrooms$odor)
sub.shrooms$odor[sub.shrooms$odor == 'a'] <- 'Almond'
sub.shrooms$odor[sub.shrooms$odor == 'l'] <- 'Anise'
sub.shrooms$odor[sub.shrooms$odor == 'c'] <- 'Creosote'
sub.shrooms$odor[sub.shrooms$odor == 'y'] <- 'Fishy'
sub.shrooms$odor[sub.shrooms$odor == 'f'] <- 'Foul'
sub.shrooms$odor[sub.shrooms$odor == 'm'] <- 'Musty'
sub.shrooms$odor[sub.shrooms$odor == 'n'] <- 'None'
sub.shrooms$odor[sub.shrooms$odor == 'p'] <- 'Pungent'
sub.shrooms$odor[sub.shrooms$odor == 's'] <- 'Spicy'
sub.shrooms$odor <- as.factor(sub.shrooms$odor)

# Update ring type; Replace with NA if no rings exist
sub.shrooms$ringtype <- as.character(sub.shrooms$ringtype)
sub.shrooms$ringtype[sub.shrooms$ringtype == 'p'] <- 'Pendant'
sub.shrooms$ringtype[sub.shrooms$ringtype == 'c'] <- 'Cobwebby'
sub.shrooms$ringtype[sub.shrooms$ringtype == 'e'] <- 'Evanescent'
sub.shrooms$ringtype[sub.shrooms$ringtype == 'f'] <- 'Flaring'
sub.shrooms$ringtype[sub.shrooms$ringtype == 'l'] <- 'Large'
sub.shrooms$ringtype[sub.shrooms$ringtype == 'n'] <- NA
sub.shrooms$ringtype[sub.shrooms$ringtype == 's'] <- 'Sheathing'
sub.shrooms$ringtype[sub.shrooms$ringtype == 'z'] <- 'Zone'
sub.shrooms$ringtype <- as.factor(sub.shrooms$ringtype)

# Update population
sub.shrooms$population <- as.character(sub.shrooms$population)
sub.shrooms$population[sub.shrooms$population == 'a'] <- 'Abundant'
sub.shrooms$population[sub.shrooms$population == 'c'] <- 'Clustered'
sub.shrooms$population[sub.shrooms$population == 'n'] <- 'Numerous'
sub.shrooms$population[sub.shrooms$population == 's'] <- 'Scattered'
sub.shrooms$population[sub.shrooms$population == 'v'] <- 'Several'
sub.shrooms$population[sub.shrooms$population == 'y'] <- 'Solitary'
sub.shrooms$population <- as.factor(sub.shrooms$population)

# Update ring number and convert to numeric
sub.shrooms$ringnumber <- as.character(sub.shrooms$ringnumber)
sub.shrooms$ringnumber[sub.shrooms$ringnumber == 'n'] <- 0
sub.shrooms$ringnumber[sub.shrooms$ringnumber == 'o'] <- 1
sub.shrooms$ringnumber[sub.shrooms$ringnumber == 't'] <- 2
sub.shrooms$ringnumber <- as.numeric(sub.shrooms$ringnumber)

# Display 25 random rows to check the transformations
set.seed(125)
sub.shrooms[sample(1:nrow(sub.shrooms), 25), ]
##          class     odor ringnumber   ringtype population
## 6035 Poisonous    Spicy          1 Evanescent    Several
## 2152    Edible     None          1    Pendant   Solitary
## 2770    Edible     None          1    Pendant   Solitary
## 2957    Edible     None          1    Pendant   Solitary
## 7595 Poisonous    Fishy          1 Evanescent    Several
## 7636 Poisonous    Musty          0       <NA>  Clustered
## 3587    Edible     None          1    Pendant   Solitary
## 2874    Edible     None          1    Pendant   Solitary
## 4093    Edible     None          1    Pendant   Solitary
## 3880 Poisonous Creosote          1    Pendant  Scattered
## 479     Edible   Almond          1    Pendant    Several
## 5019 Poisonous     Foul          1      Large    Several
## 2962    Edible     None          1    Pendant   Solitary
## 2088    Edible     None          1    Pendant   Solitary
## 3674    Edible     None          1    Pendant   Solitary
## 3196    Edible     None          1    Pendant    Several
## 3177    Edible     None          1    Pendant   Solitary
## 7113 Poisonous    Fishy          1 Evanescent    Several
## 4191 Poisonous     Foul          1      Large    Several
## 4958 Poisonous     Foul          1      Large    Several
## 3007    Edible     None          1    Pendant    Several
## 4664    Edible     None          1    Flaring   Solitary
## 4287 Poisonous     Foul          1      Large   Solitary
## 2749    Edible     None          1    Pendant   Solitary
## 3743 Poisonous Creosote          1    Pendant    Several
summary(sub.shrooms)
##        class            odor        ringnumber           ringtype   
##  Edible   :1880   None    :1816   Min.   :0.0000   Evanescent: 608  
##  Poisonous:1268   Foul    : 624   1st Qu.:1.0000   Flaring   :  48  
##                   Creosote: 192   Median :1.0000   Large     : 432  
##                   Fishy   : 192   Mean   :0.9911   Pendant   :2024  
##                   Spicy   : 192   3rd Qu.:1.0000   NA's      :  36  
##                   Almond  :  48   Max.   :2.0000                    
##                   (Other) :  84                                     
##      population  
##  Clustered:  36  
##  Scattered:  96  
##  Several  :1904  
##  Solitary :1112  
##                  
##                  
## 

Adding a few rudimentary graphs just for a bit of practice.

plot(sub.shrooms$class ~ sub.shrooms$ringtype, xlab = "Ring Type", ylab = "Class", main = "Class by Ring Type (Woods Only)")

plot(sub.shrooms$odor, xlab = "Odor", ylab = "Frequency", col = 1:length(sub.shrooms$odor))