Using mushroom records drawn from The Audubon Society Field Guide to North American Mushrooms (1981) found at the UCI Machine Learning Repository, study the dataset and create a data frame and a select subset of columns. The data and columns should be given meaningful, proper names.
The URL is accessed using the getURL funciton from the RCurl package and the data is read using the read.csv function. From inspeciton of the data, we can see that there are no headers and that the data are delimited by commas.
mush_URL <- getURL("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data")
mush_df <- read.csv(text = mush_URL, header = FALSE, sep = ",", stringsAsFactors = FALSE,
strip.white = TRUE)
Again from inspeciton, the data are single character values with no header description. The descriptions, and more information of the dataset, are found in the mushroom’s data dicitonary.
There are 23 columns in this dataset. For the sake of example, let us just look at four columns, i.e., class, cap color, odor, spore print color and habitat. The column headers were given default names when the data frame was created:
names(mush_df)
## [1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10" "V11"
## [12] "V12" "V13" "V14" "V15" "V16" "V17" "V18" "V19" "V20" "V21" "V22"
## [23] "V23"
Our aforementioned columns are located under V1, V4, V6 and V21 and V23. Let’s create a subset of these columns then give them more proper names.
mush_sub1 <- subset(mush_df, select = c(V1, V4, V6, V21, V23))
names(mush_sub1) <- c("class", "cap_color", "odor", "spore_print_color", "habitat")
head(mush_sub1)
## class cap_color odor spore_print_color habitat
## 1 p n p k u
## 2 e y a n g
## 3 e w l n m
## 4 p w p k u
## 5 e g n n g
## 6 e y a k g
Our data are still single character representations of characteristics. Referencing the mushroom’s dataset dicitonary again, we can decode these seemingly cryptic descriptions. There are many ways, and perhaps very elgant ways, to alter the data. The most intuitive method, I believe, is the mapvalues funciton from the plyr package. Using the mapvalues funciton for the subset and displaying the top of the dataset we get:
mush_sub1$class <- mapvalues(mush_sub1$class, from = c("e", "p"), to = c("edible",
"poisonous"))
mush_sub1$cap_color <- mapvalues(mush_sub1$cap_color, from = c("n", "b", "c",
"g", "r", "p", "u", "e", "w", "y"), to = c("brown", "buff", "cinnamon",
"gray", "green", "pink", "purple", "red", "white", "yellow"))
mush_sub1$odor <- mapvalues(mush_sub1$odor, from = c("a", "l", "c", "y", "f",
"m", "n", "p", "s"), to = c("almond", "anise", "creosote", "fishy", "foul",
"musty", "none", "pungent", "spicy"))
mush_sub1$spore_print_color <- mapvalues(mush_sub1$spore_print_color, from = c("k",
"n", "b", "h", "r", "o", "u", "w", "y"), to = c("black", "brown", "buff",
"chocolate", "green", "orange", "purple", "white", "yellow"))
mush_sub1$habitat <- mapvalues(mush_sub1$habitat, from = c("g", "l", "m", "p",
"u", "w", "d"), to = c("grasses", "leaves", "meadows", "paths", "urban",
"waste", "wood"))
head(mush_sub1)
## class cap_color odor spore_print_color habitat
## 1 poisonous brown pungent black urban
## 2 edible yellow almond brown grasses
## 3 edible white anise brown meadows
## 4 poisonous white pungent black urban
## 5 edible gray none brown grasses
## 6 edible yellow almond black grasses