This assignment calls for loading a musty dataset, transforming variable names, cleaning up obscure coding in the variables to make it understandable in English, and subsetting the data into analyzable parts. We close with examples of different subsetting methods in R, including one from the dplyr package.
We’ll need to be certain required packges are installed locally.
# check for required packages, install if not available
if (!require('stringr')) install.packages('stringr')
if (!require('dplyr')) install.packages('dplyr')
if (!require("ggplot2")) install.packages('ggplot2')
if (!require("knitr")) install.packages('knitr')
Start with the original, not-so-friendly mushroom database.
# get the data
file = "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"
download.file(file, destfile = "shrooms.csv", method = "curl")
We’ll use names from the documentation.
# prep column names
columns <- c("toxicity", "cap_shape", "cap_surface", "cap_color", "bruises", "odor", "gill_attach", "gill_spacing", "gill_size", "gill_color", "stalk_shape", "stalk_root", "stalk_surface_above", "stalk_surface_below", "stalk_color_above", "stalk_color_below", "veil_type", "veil_color", "ring_number", "ring_type", "spore_print_color", "population", "habitat")
# check length
# length(columns)
kable(columns, caption = "New Variable Names")
| toxicity |
| cap_shape |
| cap_surface |
| cap_color |
| bruises |
| odor |
| gill_attach |
| gill_spacing |
| gill_size |
| gill_color |
| stalk_shape |
| stalk_root |
| stalk_surface_above |
| stalk_surface_below |
| stalk_color_above |
| stalk_color_below |
| veil_type |
| veil_color |
| ring_number |
| ring_type |
| spore_print_color |
| population |
| habitat |
The variable ‘stalk_root" has missing data signified by “?”, so we want to replace that with R-readable ’NA’. Since “?” is a special character, this doesn’t work using na.strings in the read.csv function. So we use the sub function and a regular expression in a second step.
mush <- data.frame(read.csv("shrooms.csv", col.names=columns, strip.white=TRUE, header=FALSE))
mush$stalk_root <- sub("[?]", "NA", mush$stalk_root, fixed=FALSE)
head(mush[, 1:7])
## toxicity cap_shape cap_surface cap_color bruises odor gill_attach
## 1 p x s n t p f
## 2 e x s y t a f
## 3 e b s w t l f
## 4 p x y w t p f
## 5 e x s g f n f
## 6 e x y y t a f
The categories coded in each variable aren’t understandable, so we need to decode them into plain English terms found in the documentation. This is a bit tiresome, so we’re not going to do it for all the variables.
The approach we use here could be extended to all the columns if necessary. Since this is an excercise in subsetting, we’ll transform 5 columns and then subset them showing only the first three rows.
# toxicity -------------
mush <- transform(mush, toxicity = ifelse(toxicity == 'p', "poison", "edible"))
# cap_surface ----------
mush <- transform(mush, cap_surface = ifelse(cap_surface == 'f', "fibrous", ifelse(cap_surface == 'g', "grooves", ifelse(cap_surface == 'y', "scaly", "smooth"))))
# cap_color ------------
mush <- transform(mush, cap_color = ifelse(cap_color == 'n', "brown", ifelse(cap_color == 'b', "buff", ifelse(cap_color == 'c', "cinnamon", ifelse(cap_color == 'g', "gray", ifelse(cap_color == 'r', "green", ifelse(cap_color == 'p', "pink", ifelse(cap_color == 'u', "purple", ifelse(cap_color == 'e', "red", ifelse(cap_color == 'w', "white", "yellow"))))))))))
# population ------------
mush <- transform(mush, population = ifelse(population == 'a', "abundant", ifelse(population == 'c', "clustered", ifelse(population == 'n', "numerous",
ifelse(population == 's', "scattered",
ifelse(population == 'v', "several",
"solitary"))))))
# habitat ------------
mush <- transform(mush, habitat = ifelse(habitat == 'g', "grasses", ifelse(habitat == 'l', "leaves", ifelse(habitat == 'm', "meadows",
ifelse(habitat == 'p', "paths",
ifelse(habitat == 'u', "urban",
ifelse(habitat == 'w', "waste",
"woods")))))))
kable(head(mush[, c(1,3,4, 22, 23)], 3))
| toxicity | cap_surface | cap_color | population | habitat |
|---|---|---|---|---|
| poison | smooth | brown | scattered | urban |
| edible | smooth | yellow | numerous | grasses |
| edible | smooth | white | numerous | meadows |
The table above is subset by indexing in the standard R way. There are other ways to select rows and columns. Here are examples:
vars = c("toxicity", "cap_surface", "cap_color", "population", "habitat")
kable(head(mush[, vars], 3))
| toxicity | cap_surface | cap_color | population | habitat |
|---|---|---|---|---|
| poison | smooth | brown | scattered | urban |
| edible | smooth | yellow | numerous | grasses |
| edible | smooth | white | numerous | meadows |
vars = c("toxicity", "cap_surface", "cap_color", "population", "habitat")
kable(head(mush[c(1,3,5), c(1, 3, 4, 22, 23)], 3))
| toxicity | cap_surface | cap_color | population | habitat | |
|---|---|---|---|---|---|
| 1 | poison | smooth | brown | scattered | urban |
| 3 | edible | smooth | white | numerous | meadows |
| 5 | edible | smooth | gray | abundant | grasses |
kable(head(subset(mush, select = c(toxicity, cap_surface, cap_color, population, habitat)), 3))
| toxicity | cap_surface | cap_color | population | habitat |
|---|---|---|---|---|
| poison | smooth | brown | scattered | urban |
| edible | smooth | yellow | numerous | grasses |
| edible | smooth | white | numerous | meadows |
Pick your poison.
mush <- tbl_df(mush)
kable(filter(select(mush, c(toxicity, cap_surface, cap_color, population, habitat)), toxicity == 'poison')[1:3, ])
| toxicity | cap_surface | cap_color | population | habitat |
|---|---|---|---|---|
| poison | smooth | brown | scattered | urban |
| poison | scaly | white | scattered | urban |
| poison | scaly | white | several | grasses |