DATA 607 Week 1

Reading in the data and naming the columns read in the data from the URL https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data

Description of the data here: https://archive.ics.uci.edu/ml/datasets/Mushroom

The data does not contain any headers so set header=FALSE Details here: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html

Check the data

head(df.mushroom)

##    V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 1:  p  x  s  n  t  p  f  c  n   k   e   e   s   s   w   w   p   w   o   p
## 2:  e  x  s  y  t  a  f  c  b   k   e   c   s   s   w   w   p   w   o   p
## 3:  e  b  s  w  t  l  f  c  b   n   e   c   s   s   w   w   p   w   o   p
## 4:  p  x  y  w  t  p  f  c  n   n   e   e   s   s   w   w   p   w   o   p
## 5:  e  x  s  g  f  n  f  w  b   k   t   e   s   s   w   w   p   w   o   e
## 6:  e  x  y  y  t  a  f  c  b   n   e   c   s   s   w   w   p   w   o   p
##    V21 V22 V23
## 1:   k   s   u
## 2:   n   n   g
## 3:   n   n   m
## 4:   k   s   u
## 5:   n   a   g
## 6:   k   n   g

Now we need to use the provided documentation in order to name the columns with the correct names. Lets create a vector that has all the column names. We can append this vector to the columns to assign column names This handy tutorial details the process I used: http://rprogramming.net/rename-columns-in-r/

Append the column names to the dataframe

colnames(df.mushroom) <- mushroom.col

Select the column called class and an addition 3 or 4 columns. We need to create a subset of the data Why did I pick these columns? Lets say I wanted to find a relationship between class and some physical attributes. I could answer the question, can we predict if a mushroom is edible or poisonous by some of its physcal attributes? Here we just use very basic commans to make the subset. We are not using any additional critera for more filtering.

df.mushroomb<-subset(df.mushroom, select=c('class', 'capshape', 'odor', 'spore_print_color'))

We need transform the columns we picked from abbreviated names to full names based on the data documentation. Transform the class variable: map P to poisonous and e to edible as given in the documentation.

We can do this manually but a loop to iterate on each entry might be easier. Lets convert the other entries in the selected columns to their full names as well. Refer to this link for a handy tutorial I used to write the following loops https://www.datamentor.io/r-programming/if-else-statement

i <- 1
for (x in df.mushroomb$class)
{
  if (x == 'p')
      {
      df.mushroomb$class[i] <- "poisonous"
      } 
      else if (x == 'e')
      {
      df.mushroomb$class[i] <- "edible"
      }
  i <- i + 1
}

i <- 1
for (x in df.mushroomb$capshape)
{
  if (x == 'b')
       {
    df.mushroomb$capshape[i] <- "bell"
       }
    else if (x == 'c')
       {
    df.mushroomb$capshape[i] <- "conical"
       }
    else if (x == 'x')
       {
    df.mushroomb$capshape[i] <- "convex"
       }
    else if (x == 'f')
       {
    df.mushroomb$capshape[i] <- "flat"
       }
    else if (x == 'k')
       {
    df.mushroomb$capshape[i] <- "knobbed"
       }
    else if (x == 's'){
    df.mushroomb$capshape[i] <- "sunken"
       }
  i <- i + 1
}

i <- 1
for (x in df.mushroomb$odor)
{
  if (x == 'a')
       {
    df.mushroomb$odor[i] <- "almond"
       }
    else if (x == 'l')
       {
    df.mushroomb$odor[i] <- "anise"
       }
    else if (x == 'c')
       {
    df.mushroomb$odor[i] <- "creosote"
       }
    else if (x == 'y')
       {
    df.mushroomb$odor[i] <- "fishy"
       }
    else if (x == 'f')
       {
    df.mushroomb$odor[i] <- "foul"
       }
    else if (x == 'm')
       {
    df.mushroomb$odor[i] <- "musty"
       }
       else if (x == 'n')
       {
    df.mushroomb$odor[i] <- "none"
       }
       else if (x == 'p')
       {
    df.mushroomb$odor[i] <- "pungent"
       }
       else if (x == 's')
       {
    df.mushroomb$odor[i] <- "spicy"
       }
  i <- i + 1
}

i <- 1
for (x in df.mushroomb$spore_print_color)
{
  if (x == 'k')
       {
    df.mushroomb$spore_print_color[i] <- "black"
       }
    else if (x == 'n')
       {
    df.mushroomb$spore_print_color[i] <- "brown"
       }
    else if (x == 'b')
       {
    df.mushroomb$spore_print_color[i] <- "buff"
       }
    else if (x == 'h')
       {
    df.mushroomb$spore_print_color[i] <- "chocolate"
       }
    else if (x == 'r')
       {
    df.mushroomb$spore_print_color[i] <- "green"
       }
    else if (x == 'o')
       {
    df.mushroomb$spore_print_color[i] <- "orange"
       }
       else if (x == 'u')
       {
    df.mushroomb$spore_print_color[i] <- "purple"
       }
       else if (x == 'w')
       {
    df.mushroomb$spore_print_color[i] <- "white"
       }
       else if (x == 'y')
       {
    df.mushroomb$spore_print_color[i] <- "yellow"
       }
  i <- i + 1
}

Troubleshooting: You may need to specify factors as false when you load in the data, otherwise the loops will not work. This depends on how the data is loaded in. fread seemed to fix it. How is our data so far?

head(df.mushroomb)

##        class capshape    odor spore_print_color
## 1: poisonous   convex pungent             black
## 2:    edible   convex  almond             brown
## 3:    edible     bell   anise             brown
## 4: poisonous   convex pungent             black
## 5:    edible   convex    none             brown
## 6:    edible   convex  almond             black

Data looks good! Looks good, lets explore the data What are the relative frequencies for the class variable? Lets try some things to answer this question

Get the counts of edible or poisonous

table(df.mushroomb$class)

## 
##    edible poisonous 
##      4208      3916

convert counts to proportions instead

class_counts <- table(df.mushroomb$class)
class_counts / sum(class_counts)

## 
##    edible poisonous 
## 0.5179714 0.4820286

lets visually summerize the data using a standard barplot We will use ggplot2 and then use summary to see binning and plotting details

library(ggplot2)
bar_plt <- ggplot(df.mushroomb, aes(x = class)) 
bar_plt <- bar_plt + geom_bar()
summary(bar_plt)

## data: class, capshape, odor, spore_print_color [8124x4]
## mapping:  x = class
## faceting: <ggproto object: Class FacetNull, Facet>
##     compute_layout: function
##     draw_back: function
##     draw_front: function
##     draw_labels: function
##     draw_panels: function
##     finish_data: function
##     init_scales: function
##     map: function
##     map_data: function
##     params: list
##     render_back: function
##     render_front: function
##     render_panels: function
##     setup_data: function
##     setup_params: function
##     shrink: TRUE
##     train: function
##     train_positions: function
##     train_scales: function
##     vars: function
##     super:  <ggproto object: Class FacetNull, Facet>
## -----------------------------------
## geom_bar: width = NULL, na.rm = FALSE
## stat_count: width = NULL, na.rm = FALSE
## position_stack

There are many more variations of data exploration but for the task at hand, the data has been subsetted and renamed for downstream analysis.

One last check:

head(df.mushroomb)

##        class capshape    odor spore_print_color
## 1: poisonous   convex pungent             black
## 2:    edible   convex  almond             brown
## 3:    edible     bell   anise             brown
## 4: poisonous   convex pungent             black
## 5:    edible   convex    none             brown
## 6:    edible   convex  almond             black

tail(df.mushroomb)

##        class capshape  odor spore_print_color
## 1: poisonous  knobbed  foul             white
## 2:    edible  knobbed  none              buff
## 3:    edible   convex  none              buff
## 4:    edible     flat  none              buff
## 5: poisonous  knobbed fishy             white
## 6:    edible   convex  none            orange

DATA 607 Week 1

Vinicio, Haro

February 2, 2018