The assignment is to load the Mushroom Dataset from the UCI repository into a data frame, rename the columns into something meaningful, replace the abbreviations used in the data, and subset the columns of the data frame. It must include the column that indicated if the mushroom is edible or poisonous and three or four other columns.
The first step will be to download the data to the local environment:
download.file('https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data', 'agaricus-lepiota.data')
Now that we have the dataset locally we will create a data frame:
df <- read.table('agaricus-lepiota.data', sep=',', stringsAsFactors=FALSE)
head(df)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 1 p x s n t p f c n k e e s s w w p w o p
## 2 e x s y t a f c b k e c s s w w p w o p
## 3 e b s w t l f c b n e c s s w w p w o p
## 4 p x y w t p f c n n e e s s w w p w o p
## 5 e x s g f n f w b k t e s s w w p w o e
## 6 e x y y t a f c b n e c s s w w p w o p
## V21 V22 V23
## 1 k s u
## 2 n n g
## 3 n n m
## 4 k s u
## 5 n a g
## 6 k n g
According to the documentation we should have 8,124 rows and 23 columns of data.
dim(df)
## [1] 8124 23
Our data checks out, but the names are not very useful.
names(df)
## [1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10" "V11"
## [12] "V12" "V13" "V14" "V15" "V16" "V17" "V18" "V19" "V20" "V21" "V22"
## [23] "V23"
I will rename the columns by using the dplyr package:
library(dplyr)
df <- df %>%
rename(type = V1,
cap_shape = V2,
cap_surface = V3,
cap_color = V4,
bruises = V5,
odor = V6,
gill_attachment = V7,
gill_spacing = V8,
gill_size = V9,
gill_color = V10,
stalk_shape = V11,
stalk_root = V12,
stalk_surface_above_ring = V13,
stalk_surface_below_ring = V14,
stalk_color_above_ring = V15,
stalk_color_below_ring = V16,
veil_type = V17,
veil_color = V18,
ring_number = V19,
ring_type = V20,
spore_print_color = V21,
population = V22,
habitat = V23)
Before changing the data to be more meaningful I want to subset the data frame. I would like to use this data to predict if a mushroom is poisonous or not. According to the documentation the odor, spore print color, stalk surface below ring, and stalk color above ring are 4 attributes that can predict if a mushroom is poisonous with a 99.9% accuracy for these 23 species.
df <- df %>%
select(type, odor, spore_print_color, stalk_color_below_ring, stalk_color_above_ring)
Now that we have these subsets, let’s examine the values:
lapply(df, function(x) table(x))
## $type
## x
## e p
## 4208 3916
##
## $odor
## x
## a c f l m n p s y
## 400 192 2160 400 36 3528 256 576 576
##
## $spore_print_color
## x
## b h k n o r u w y
## 48 1632 1872 1968 48 72 48 2388 48
##
## $stalk_color_below_ring
## x
## b c e g n o p w y
## 432 36 96 576 512 192 1872 4384 24
##
## $stalk_color_above_ring
## x
## b c e g n o p w y
## 432 36 96 576 448 192 1872 4464 8
Now that we have the subset of columns we can recode the variables. I will accomplish this with dplyr and recode statements.
df <- df %>%
mutate(type = recode(type, p="Poisonous", e="Edible"),
odor = recode(odor, a="Almond", c="Creosote", f="Foul", l="Anise", m="Musty", n="None", p="Pungent", s="Spicy", y="Fishy"),
spore_print_color = recode(spore_print_color, b="Buff", h="Chocolate", k="Black", n="Brown", o="Orange", r="Green", u="Purple", w="White", y="Yellow"),
stalk_color_below_ring = recode(stalk_color_below_ring, b="Buff", c="Cinnamon", e="Red", g="Gray", n="Brown", o="Orange", p="Pink", w="White", y="Yellow"),
stalk_color_above_ring = recode(stalk_color_above_ring, b="Buff", c="Cinnamon", e="Red", g="Gray", n="Brown", o="Orange", p="Pink", w="White", y="Yellow"))
Now to examine the final product:
head(df)
## type odor spore_print_color stalk_color_below_ring
## 1 Poisonous Pungent Black White
## 2 Edible Almond Brown White
## 3 Edible Anise Brown White
## 4 Poisonous Pungent Black White
## 5 Edible None Brown White
## 6 Edible Almond Black White
## stalk_color_above_ring
## 1 White
## 2 White
## 3 White
## 4 White
## 5 White
## 6 White
tail(df)
## type odor spore_print_color stalk_color_below_ring
## 8119 Poisonous Foul White White
## 8120 Edible None Buff Orange
## 8121 Edible None Buff Orange
## 8122 Edible None Buff Orange
## 8123 Poisonous Fishy White White
## 8124 Edible None Orange Orange
## stalk_color_above_ring
## 8119 Pink
## 8120 Orange
## 8121 Orange
## 8122 Orange
## 8123 White
## 8124 Orange