DATA 607: Week 1 Assignment, Mushroom Data

Introduction

This assignment calls for loading a musty dataset, transforming variable names, cleaning up obscure coding in the variables to make it understandable in English, and subsetting the data into analyzable parts. We close with examples of different subsetting methods in R, including one from the dplyr package.

Prepare the R workspace

We’ll need to be certain required packges are installed locally.

# check for required packages, install if not available

if (!require('stringr')) install.packages('stringr')
if (!require('dplyr')) install.packages('dplyr')
if (!require("ggplot2")) install.packages('ggplot2')
if (!require("knitr")) install.packages('knitr')

Download the data

Start with the original, not-so-friendly mushroom database.

# get the data

file = "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"

download.file(file, destfile = "shrooms.csv", method = "curl")

Add descriptive column names

We’ll use names from the documentation.

# prep column names

columns <- c("toxicity", "cap_shape", "cap_surface", "cap_color", "bruises", "odor", "gill_attach", "gill_spacing", "gill_size", "gill_color", "stalk_shape", "stalk_root", "stalk_surface_above", "stalk_surface_below", "stalk_color_above", "stalk_color_below", "veil_type", "veil_color", "ring_number", "ring_type", "spore_print_color", "population", "habitat")

# check length
# length(columns)

kable(columns, caption = "New Variable Names")

New Variable Names
toxicity
cap_shape
cap_surface
cap_color
bruises
odor
gill_attach
gill_spacing
gill_size
gill_color
stalk_shape
stalk_root
stalk_surface_above
stalk_surface_below
stalk_color_above
stalk_color_below
veil_type
veil_color
ring_number
ring_type
spore_print_color
population
habitat

Convert to a data frame, fill in missing data

The variable ‘stalk_root" has missing data signified by “?”, so we want to replace that with R-readable ’NA’. Since “?” is a special character, this doesn’t work using na.strings in the read.csv function. So we use the sub function and a regular expression in a second step.

mush <- data.frame(read.csv("shrooms.csv", col.names=columns, strip.white=TRUE, header=FALSE))

mush$stalk_root <- sub("[?]", "NA", mush$stalk_root, fixed=FALSE)

head(mush[, 1:7])

##   toxicity cap_shape cap_surface cap_color bruises odor gill_attach
## 1        p         x           s         n       t    p           f
## 2        e         x           s         y       t    a           f
## 3        e         b           s         w       t    l           f
## 4        p         x           y         w       t    p           f
## 5        e         x           s         g       f    n           f
## 6        e         x           y         y       t    a           f

Make readable category names

The categories coded in each variable aren’t understandable, so we need to decode them into plain English terms found in the documentation. This is a bit tiresome, so we’re not going to do it for all the variables.

The approach we use here could be extended to all the columns if necessary. Since this is an excercise in subsetting, we’ll transform 5 columns and then subset them showing only the first three rows.

# toxicity -------------
mush <- transform(mush, toxicity = ifelse(toxicity == 'p', "poison", "edible"))

# cap_surface ----------
mush <- transform(mush, cap_surface = ifelse(cap_surface == 'f', "fibrous", ifelse(cap_surface == 'g', "grooves", ifelse(cap_surface == 'y', "scaly", "smooth")))) 

# cap_color ------------
mush <- transform(mush, cap_color = ifelse(cap_color == 'n', "brown", ifelse(cap_color == 'b', "buff", ifelse(cap_color == 'c', "cinnamon", ifelse(cap_color == 'g', "gray", ifelse(cap_color == 'r', "green", ifelse(cap_color == 'p', "pink", ifelse(cap_color == 'u', "purple", ifelse(cap_color == 'e', "red", ifelse(cap_color == 'w', "white", "yellow"))))))))))

# population ------------
mush <- transform(mush, population = ifelse(population == 'a', "abundant", ifelse(population == 'c', "clustered",                                      ifelse(population == 'n', "numerous",
ifelse(population == 's', "scattered",
ifelse(population == 'v', "several",
"solitary"))))))
                                                                                      # habitat ------------
mush <- transform(mush, habitat = ifelse(habitat == 'g', "grasses", ifelse(habitat == 'l', "leaves",                                      ifelse(habitat == 'm', "meadows",
ifelse(habitat == 'p', "paths",
ifelse(habitat == 'u', "urban",
ifelse(habitat == 'w', "waste",
"woods")))))))      

kable(head(mush[, c(1,3,4, 22, 23)], 3))

toxicity	cap_surface	cap_color	population	habitat
poison	smooth	brown	scattered	urban
edible	smooth	yellow	numerous	grasses
edible	smooth	white	numerous	meadows

Other Ways to Subset

The table above is subset by indexing in the standard R way. There are other ways to select rows and columns. Here are examples:

Use variable names

vars = c("toxicity", "cap_surface", "cap_color", "population", "habitat")

kable(head(mush[, vars], 3))

toxicity	cap_surface	cap_color	population	habitat
poison	smooth	brown	scattered	urban
edible	smooth	yellow	numerous	grasses
edible	smooth	white	numerous	meadows

Select by position

vars = c("toxicity", "cap_surface", "cap_color", "population", "habitat")

kable(head(mush[c(1,3,5), c(1, 3, 4, 22, 23)], 3))

	toxicity	cap_surface	cap_color	population	habitat
1	poison	smooth	brown	scattered	urban
3	edible	smooth	white	numerous	meadows
5	edible	smooth	gray	abundant	grasses

Use the subset function

kable(head(subset(mush, select = c(toxicity, cap_surface, cap_color, population, habitat)), 3))

toxicity	cap_surface	cap_color	population	habitat
poison	smooth	brown	scattered	urban
edible	smooth	yellow	numerous	grasses
edible	smooth	white	numerous	meadows

Use the dplyr package

Pick your poison.

mush <- tbl_df(mush)

kable(filter(select(mush, c(toxicity, cap_surface, cap_color, population, habitat)), toxicity == 'poison')[1:3, ])

toxicity	cap_surface	cap_color	population	habitat
poison	smooth	brown	scattered	urban
poison	scaly	white	scattered	urban
poison	scaly	white	several	grasses