I begin by loading the car package, which will facilitate subsequent data manipulation.
library(car)
## Loading required package: carData
Upon downloading the mushrooms file (https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data) and uploading it to github, I read the raw file into R as a data frame called “mushrooms”. I then check the top six rows of the first six columns.
mushrooms <- read.csv("https://raw.githubusercontent.com/chrosemo/data607_fall19_week1/master/agaricus-lepiota.data", header = FALSE)
head(mushrooms[1:6])
## V1 V2 V3 V4 V5 V6
## 1 p x s n t p
## 2 e x s y t a
## 3 e b s w t l
## 4 p x y w t p
## 5 e x s g f n
## 6 e x y y t a
I update the column names of “mushrooms” to reflect the names noted in the agaricus-lepiota.names file (https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.names) and again check the top six rows of the first six columns.
colnames(mushrooms) <- c("class", "cap_shape", "cap_surface", "cap_color", "bruises", "odor", "gill_attachment", "gill_spacing", "gill_size", "gill_color", "stalk_shape", "stalk_root", "stalk_surface_above_ring", "stalk_surface_below_ring", "stalk_color_above_ring", "stalk_color_below_ring", "veil_type", "veil_color", "ring_number", "ring_type", "spore_print_color", "population", "habitat")
head(mushrooms[1:6])
## class cap_shape cap_surface cap_color bruises odor
## 1 p x s n t p
## 2 e x s y t a
## 3 e b s w t l
## 4 p x y w t p
## 5 e x s g f n
## 6 e x y y t a
I subset the data frame, creating a new one called “mushrooms_working” that keeps five columns: “class”, “cap_shape”, “cap_color”, “population”, and “habitat”. I then check the new data frame’s bottom six rows.
mushrooms_working <- mushrooms[c("class", "cap_shape", "cap_color", "population", "habitat")]
tail(mushrooms_working)
## class cap_shape cap_color population habitat
## 8119 p k n v d
## 8120 e k n c l
## 8121 e x n v l
## 8122 e f n c l
## 8123 p k n v l
## 8124 e x n c l
Finally, I recode each of the columns using car’s recode function, replacing abbreviations with the appropriate values as noted in the aforementioned agaricus-lepiota.names file. I then check the recoded data frame’s bottom six rows.
mushrooms_working$class <- recode(mushrooms_working$class, "'e' = 'edible'; 'p' = 'poisonous'")
mushrooms_working$cap_shape <- recode(mushrooms_working$cap_shape, "'b' = 'bell'; 'c' = 'conical'; 'x' = 'convex'; 'f' = 'flat'; 'k' = 'knobbed'; 's' = 'sunken'")
mushrooms_working$cap_color <- recode(mushrooms_working$cap_color, "'n' = 'brown'; 'b' = 'buff'; 'c' = 'cinnamon'; 'g' = 'gray'; 'r' = 'green'; 'p' = 'pink'; 'u' = 'purple'; 'e' = 'red'; 'w' = 'white'; 'y' = 'yellow'")
mushrooms_working$population <- recode(mushrooms_working$population, "'a' = 'abundant'; 'c' = 'clustered'; 'n' = 'numerous'; 's' = 'scattered'; 'v' = 'several'; 'y' = 'solitary'")
mushrooms_working$habitat <- recode(mushrooms_working$habitat, "'g' = 'grasses'; 'l' = 'leaves'; 'm' = 'meadows'; 'p' = 'paths'; 'u' = 'urban'; 'w' = 'waste'; 'd' = 'woods'")
tail(mushrooms_working)
## class cap_shape cap_color population habitat
## 8119 poisonous knobbed brown several woods
## 8120 edible knobbed brown clustered leaves
## 8121 edible convex brown several leaves
## 8122 edible flat brown clustered leaves
## 8123 poisonous knobbed brown several leaves
## 8124 edible convex brown clustered leaves