Loading libraries

library(data.table)
library(dplyr)
library(stringr)
library(readr)

Reading the data

I used read.table to load the data into R from my github repository. I stored the data in a data frame called mushroomData. I used the sep = “,” to separate the data by commons found in the raw file.

# pulling the data directly from the source website (commenting out since assignment suggests to pull
# directly from github repository)

# mushroomData <- fread('https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data', header = FALSE, stringsAsFactors = FALSE) #

# pulling data table from my github repository

mushroomData <- "https://raw.githubusercontent.com/zachalexander/data607_cunysps/master/Homework1/agaricus-lepiota.data";

mushroomData <- read.table(mushroomData, sep = ",", stringsAsFactors = FALSE, header = FALSE)

Dimensions of original data and locating edible/poisonous column

Once I read the data into R, I took a look at its dimensions:

dim(mushroomData)

## [1] 8124   23

I also noticed that the first column had values of “e” or “p”, which I believed to characterize mushrooms as either “edible” or “poisonous”.

table(mushroomData$V1)

## 
##    e    p 
## 4208 3916

After looking through the documentation, I located the “Attribute Information” section. Below, are the column names associated with the dataset. With the prior knowledge that the first column of the dataset classifies mushrooms as either edible or poisonous, I was able to add in the additional column names after this first column. I decided to store the headers in a vector for later use.

Created a vector to store all header information

headers <- c('edib-or-poison', 'cap-shape', 'cap-surface', 'cap-color', 'bruises?', 'odor',
             'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape',
             'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring',
             'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color',
             'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat');

Create data frame to hold all relabeling information

Created a data frame to hold all relabling values. This information was also taken directly from the “Attribute Information” section, referenced above. I created a data frame here to utilize later in a loop.

relabels <- rbind(c('edib-or-poison', "e", "edible"), 
                  c('edib-or-poison', "p", "poisonous"),
                  c('odor', "a", "almond"),
                  c('odor', "l", "anise"),
                  c('odor', "c", "creosote"),
                  c('odor', "y", "fishy"),
                  c('odor', "f", "foul"),
                  c('odor', "m", "musty"),
                  c('odor', "n", "none"),
                  c('odor', "p", "pungent"),
                  c('odor', "s", "spicy"),
                  c('cap-color', "n", "brown"),
                  c('cap-color', "b", "buff"),
                  c('cap-color', "c", "cinnamon"),
                  c('cap-color', "g", "gray"),
                  c('cap-color', "r", "green"),
                  c('cap-color', "p", "pink"),
                  c('cap-color', "u", "purple"),
                  c('cap-color', "e", "red"),
                  c('cap-color', "w", "white"),
                  c('cap-color', "y", "yellow"),
                  c('population', "a", "abundant"),
                  c('population', "c", "clustered"),
                  c('population', "n", "numerous"),
                  c('population', "s", "scattered"),
                  c('population', "v", "several"),
                  c('population', "y", "solitary"),
                  c('habitat', "g", "grasses"),
                  c('habitat', "l", "leaves"),
                  c('habitat', "m", "meadows"),
                  c('habitat', "p", "paths"),
                  c('habitat', "u", "urban"),
                  c('habitat', "w", "waste"),
                  c('habitat', "d", "woods")
                );

relabels <- data.frame(relabels, stringsAsFactors = FALSE)

Added headers to mushroom data

for(i in 1:length(headers)) {
  names(mushroomData)[i] <- headers[i]
}

Subset mushroom dataset to include 5 columns

I picked these columns for the subset because I found that odor can be very predictive as to whether or not a mushroom is ultimately edible or poisonous. Additionally, cap-color has been quite predictive as well. I also thought that adding population information and habitat information may be interesting for future analysis, given that we could look at trends based on prevelance and types of suitable habitats.

mushroomData <- select(mushroomData, 'edib-or-poison', 'odor', 'cap-color', 'population', 'habitat')

Utilize the relables data frame to re-label values in mushroom dataset

By setting up the relabels data frame above, I created a loop to update the data based on its corresponding values.

for(i in 1:length(relabels$X1)){
  mushroomData[[relabels$X1[i]]] <- replace(mushroomData[[relabels$X1[i]]] , mushroomData[[relabels$X1[i]]] == relabels$X2[i], relabels$X3[i])
}

Print out final, subsetted data frame

# just printing the first few rows since it's a long dataset.
head(mushroomData)

##   edib-or-poison    odor cap-color population habitat
## 1      poisonous pungent     brown  scattered   urban
## 2         edible  almond    yellow   numerous grasses
## 3         edible   anise     white   numerous meadows
## 4      poisonous pungent     white  scattered   urban
## 5         edible    none      gray   abundant grasses
## 6         edible  almond    yellow   numerous grasses

# uncomment below to see full dataset 
# mushroomData

Homework1

Zach Alexander

August 30, 2019