library(data.table)
library(dplyr)
library(stringr)
library(readr)
I used read.table to load the data into R from my github repository. I stored the data in a data frame called mushroomData. I used the sep = “,” to separate the data by commons found in the raw file.
# pulling the data directly from the source website (commenting out since assignment suggests to pull
# directly from github repository)
# mushroomData <- fread('https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data', header = FALSE, stringsAsFactors = FALSE) #
# pulling data table from my github repository
mushroomData <- "https://raw.githubusercontent.com/zachalexander/data607_cunysps/master/Homework1/agaricus-lepiota.data";
mushroomData <- read.table(mushroomData, sep = ",", stringsAsFactors = FALSE, header = FALSE)
Once I read the data into R, I took a look at its dimensions:
dim(mushroomData)
## [1] 8124 23
I also noticed that the first column had values of “e” or “p”, which I believed to characterize mushrooms as either “edible” or “poisonous”.
table(mushroomData$V1)
##
## e p
## 4208 3916
After looking through the documentation, I located the “Attribute Information” section. Below, are the column names associated with the dataset. With the prior knowledge that the first column of the dataset classifies mushrooms as either edible or poisonous, I was able to add in the additional column names after this first column. I decided to store the headers in a vector for later use.
headers <- c('edib-or-poison', 'cap-shape', 'cap-surface', 'cap-color', 'bruises?', 'odor',
'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape',
'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring',
'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color',
'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat');
Created a data frame to hold all relabling values. This information was also taken directly from the “Attribute Information” section, referenced above. I created a data frame here to utilize later in a loop.
relabels <- rbind(c('edib-or-poison', "e", "edible"),
c('edib-or-poison', "p", "poisonous"),
c('odor', "a", "almond"),
c('odor', "l", "anise"),
c('odor', "c", "creosote"),
c('odor', "y", "fishy"),
c('odor', "f", "foul"),
c('odor', "m", "musty"),
c('odor', "n", "none"),
c('odor', "p", "pungent"),
c('odor', "s", "spicy"),
c('cap-color', "n", "brown"),
c('cap-color', "b", "buff"),
c('cap-color', "c", "cinnamon"),
c('cap-color', "g", "gray"),
c('cap-color', "r", "green"),
c('cap-color', "p", "pink"),
c('cap-color', "u", "purple"),
c('cap-color', "e", "red"),
c('cap-color', "w", "white"),
c('cap-color', "y", "yellow"),
c('population', "a", "abundant"),
c('population', "c", "clustered"),
c('population', "n", "numerous"),
c('population', "s", "scattered"),
c('population', "v", "several"),
c('population', "y", "solitary"),
c('habitat', "g", "grasses"),
c('habitat', "l", "leaves"),
c('habitat', "m", "meadows"),
c('habitat', "p", "paths"),
c('habitat', "u", "urban"),
c('habitat', "w", "waste"),
c('habitat', "d", "woods")
);
relabels <- data.frame(relabels, stringsAsFactors = FALSE)
for(i in 1:length(headers)) {
names(mushroomData)[i] <- headers[i]
}
I picked these columns for the subset because I found that odor can be very predictive as to whether or not a mushroom is ultimately edible or poisonous. Additionally, cap-color has been quite predictive as well. I also thought that adding population information and habitat information may be interesting for future analysis, given that we could look at trends based on prevelance and types of suitable habitats.
mushroomData <- select(mushroomData, 'edib-or-poison', 'odor', 'cap-color', 'population', 'habitat')
By setting up the relabels data frame above, I created a loop to update the data based on its corresponding values.
for(i in 1:length(relabels$X1)){
mushroomData[[relabels$X1[i]]] <- replace(mushroomData[[relabels$X1[i]]] , mushroomData[[relabels$X1[i]]] == relabels$X2[i], relabels$X3[i])
}
# just printing the first few rows since it's a long dataset.
head(mushroomData)
## edib-or-poison odor cap-color population habitat
## 1 poisonous pungent brown scattered urban
## 2 edible almond yellow numerous grasses
## 3 edible anise white numerous meadows
## 4 poisonous pungent white scattered urban
## 5 edible none gray abundant grasses
## 6 edible almond yellow numerous grasses
# uncomment below to see full dataset
# mushroomData