Introduction


This assignment calls for loading a musty dataset, transforming variable names, cleaning up obscure coding in the variables to make it understandable in English, and subsetting the data into analyzable parts. We close with examples of different subsetting methods in R, including one from the dplyr package.


Prepare the R workspace

We’ll need to be certain required packges are installed locally.

# check for required packages, install if not available

if (!require('stringr')) install.packages('stringr')
if (!require('dplyr')) install.packages('dplyr')
if (!require("ggplot2")) install.packages('ggplot2')
if (!require("knitr")) install.packages('knitr')

Download the data

Start with the original, not-so-friendly mushroom database.

# get the data

file = "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"

download.file(file, destfile = "shrooms.csv", method = "curl")

Add descriptive column names

We’ll use names from the documentation.

# prep column names

columns <- c("toxicity", "cap_shape", "cap_surface", "cap_color", "bruises", "odor", "gill_attach", "gill_spacing", "gill_size", "gill_color", "stalk_shape", "stalk_root", "stalk_surface_above", "stalk_surface_below", "stalk_color_above", "stalk_color_below", "veil_type", "veil_color", "ring_number", "ring_type", "spore_print_color", "population", "habitat")

# check length
# length(columns)

kable(columns, caption = "New Variable Names")
New Variable Names
toxicity
cap_shape
cap_surface
cap_color
bruises
odor
gill_attach
gill_spacing
gill_size
gill_color
stalk_shape
stalk_root
stalk_surface_above
stalk_surface_below
stalk_color_above
stalk_color_below
veil_type
veil_color
ring_number
ring_type
spore_print_color
population
habitat

Convert to a data frame, fill in missing data

The variable ‘stalk_root" has missing data signified by “?”, so we want to replace that with R-readable ’NA’. Since “?” is a special character, this doesn’t work using na.strings in the read.csv function. So we use the sub function and a regular expression in a second step.

mush <- data.frame(read.csv("shrooms.csv", col.names=columns, strip.white=TRUE, header=FALSE))

mush$stalk_root <- sub("[?]", "NA", mush$stalk_root, fixed=FALSE)

head(mush[, 1:7])
##   toxicity cap_shape cap_surface cap_color bruises odor gill_attach
## 1        p         x           s         n       t    p           f
## 2        e         x           s         y       t    a           f
## 3        e         b           s         w       t    l           f
## 4        p         x           y         w       t    p           f
## 5        e         x           s         g       f    n           f
## 6        e         x           y         y       t    a           f

Make readable category names

The categories coded in each variable aren’t understandable, so we need to decode them into plain English terms found in the documentation. This is a bit tiresome, so we’re not going to do it for all the variables.

The approach we use here could be extended to all the columns if necessary. Since this is an excercise in subsetting, we’ll transform 5 columns and then subset them showing only the first three rows.

# toxicity -------------
mush <- transform(mush, toxicity = ifelse(toxicity == 'p', "poison", "edible"))

# cap_surface ----------
mush <- transform(mush, cap_surface = ifelse(cap_surface == 'f', "fibrous", ifelse(cap_surface == 'g', "grooves", ifelse(cap_surface == 'y', "scaly", "smooth")))) 

# cap_color ------------
mush <- transform(mush, cap_color = ifelse(cap_color == 'n', "brown", ifelse(cap_color == 'b', "buff", ifelse(cap_color == 'c', "cinnamon", ifelse(cap_color == 'g', "gray", ifelse(cap_color == 'r', "green", ifelse(cap_color == 'p', "pink", ifelse(cap_color == 'u', "purple", ifelse(cap_color == 'e', "red", ifelse(cap_color == 'w', "white", "yellow"))))))))))

# population ------------
mush <- transform(mush, population = ifelse(population == 'a', "abundant", ifelse(population == 'c', "clustered",                                      ifelse(population == 'n', "numerous",
ifelse(population == 's', "scattered",
ifelse(population == 'v', "several",
"solitary"))))))
                                                                                      # habitat ------------
mush <- transform(mush, habitat = ifelse(habitat == 'g', "grasses", ifelse(habitat == 'l', "leaves",                                      ifelse(habitat == 'm', "meadows",
ifelse(habitat == 'p', "paths",
ifelse(habitat == 'u', "urban",
ifelse(habitat == 'w', "waste",
"woods")))))))      

kable(head(mush[, c(1,3,4, 22, 23)], 3))
toxicity cap_surface cap_color population habitat
poison smooth brown scattered urban
edible smooth yellow numerous grasses
edible smooth white numerous meadows

Other Ways to Subset

The table above is subset by indexing in the standard R way. There are other ways to select rows and columns. Here are examples:

Use variable names

vars = c("toxicity", "cap_surface", "cap_color", "population", "habitat")

kable(head(mush[, vars], 3))
toxicity cap_surface cap_color population habitat
poison smooth brown scattered urban
edible smooth yellow numerous grasses
edible smooth white numerous meadows

Select by position

vars = c("toxicity", "cap_surface", "cap_color", "population", "habitat")

kable(head(mush[c(1,3,5), c(1, 3, 4, 22, 23)], 3))
toxicity cap_surface cap_color population habitat
1 poison smooth brown scattered urban
3 edible smooth white numerous meadows
5 edible smooth gray abundant grasses

Use the subset function

kable(head(subset(mush, select = c(toxicity, cap_surface, cap_color, population, habitat)), 3))
toxicity cap_surface cap_color population habitat
poison smooth brown scattered urban
edible smooth yellow numerous grasses
edible smooth white numerous meadows

Use the dplyr package

Pick your poison.

mush <- tbl_df(mush)

kable(filter(select(mush, c(toxicity, cap_surface, cap_color, population, habitat)), toxicity == 'poison')[1:3, ])
toxicity cap_surface cap_color population habitat
poison smooth brown scattered urban
poison scaly white scattered urban
poison scaly white several grasses