Our task is to study the mushroom dataset from machine learning repository and the associated description of the data (i.e. “data dictionary”).
Load data and name the appropriate column from the data dictionary
mushroom <- read.csv(
"https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data",
col.names = c("classes", "cap-shape", "cap-surface", "cap-color",
"bruises", "odor", "gill-attachment", "gill-spacing",
"gill-size", "gill-color", "stalk-shape", "stalk-root",
"stalk-surface-above-ring", "stalk-surface-below-ring",
"stalk-color-above-ring", "stalk-color-below-ring",
"veil-type", "veil-color", "ring-number", "ring-type",
"spore-print-color", "population", "habitat"),
na.strings = "?")
Create a data frame with a subset of the columns in the dataset. I will disregard any physical attributes i.e. cap, bruises, gill, stalk, veil, ring, spore
mushroomdf <- mushroom %>%
select(classes, population, habitat, odor)
head(mushroomdf, 5)
Subset by row
In which habitat, population is abundant?
mushroomdf %>%
filter(population=="a") %>%
count(habitat)
# Only grass habitat has mushroom in abundant.
Are foul smelling mushroom always poisonous?
mushroomdf %>%
filter(odor=="f") %>%
count(classes)
# Foul smelling mushroom is always poisonous.
We can learn a lot more just by visual representation of mushroom population
ggplot(mushroomdf, aes(x=classes, y=population)) +
geom_jitter(aes(colour = population)) +
labs(title="Mushroom Population", x="Edible & Poisonous", y="Population") +
scale_colour_discrete(name="Population Density",
labels=c("Abundant", "Clustered", "Numerous",
"Scattered", "Several", "Solitary")) +
theme_minimal()
# From the graph we can tell where mushroom is abundant, they are edible.
# It makes sense because it means they have no predator and they can grow in abundant.
# Therefore, they have no natural reason to be poisonous.#