Assignment 1 : Loading Data into Data Frame

Mushrooms Dataset A famous—if slightly moldy—dataset about mushrooms can be found in the UCI repository here: https://archive.ics.uci.edu/ml/datasets/Mushroom. The fact that this is such a well-known dataset in the data science community makes it a good dataset to use for comparative benchmarking. For example, if someone was working to build a better decision tree algorithm (or other predictive classifier) to analyze categorical data, this dataset could be useful. A typical problem (which is beyond the scope of this assignment!) is to answer the question, “Which other attribute or attributes are the best predictors of whether a particular mushroom is poisonous or edible?”

Your task is to study the dataset and the associated description of the data (i.e. “data dictionary”). You may need to look around a bit, but it’s there! You should take the data, and create a data frame with a subset of the columns in the dataset. You should include the column that indicates edible or poisonous and three or four other columns. You should also add meaningful column names and replace the abbreviations used in the data—for example, in the appropriate column, “e” might become “edible.” Your deliverable is the R code to perform these transformation tasks.

Import Library

library(stringr)

Read Mushrooms Dataset

As there is no header in the dataset and the default of header is TRUE, code “header=FALSE” is included. “dim” can shows us how many rows and columns we have in the dataset.

mushroom <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data", header=FALSE)
mushroom = as.data.frame(mushroom)
dim(mushroom)
## [1] 8124   23

From the data dictionary it tells me that there are only 22 columns, therefore by checking the first column in the data with only “e” and “p”, I can tell this 1st column is the class.

Checking:

unique(mushroom$V1)
## [1] p e
## Levels: e p

To check if there are any null values in the dataset

summary (data.frame (mushroom == "NULL"))
##      V1              V2              V3              V4         
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:8124      FALSE:8124      FALSE:8124      FALSE:8124     
##      V5              V6              V7              V8         
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:8124      FALSE:8124      FALSE:8124      FALSE:8124     
##      V9             V10             V11             V12         
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:8124      FALSE:8124      FALSE:8124      FALSE:8124     
##     V13             V14             V15             V16         
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:8124      FALSE:8124      FALSE:8124      FALSE:8124     
##     V17             V18             V19             V20         
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:8124      FALSE:8124      FALSE:8124      FALSE:8124     
##     V21             V22             V23         
##  Mode :logical   Mode :logical   Mode :logical  
##  FALSE:8124      FALSE:8124      FALSE:8124

According to the result, there are no NULL values and we can move on.

Create a data frame named “m_colortable” with a subset of a total of 5 columns in the dataset

By choosing the Class and 6 other columns rated to colors, I created a data frame for them, named m_colortable.

m_colortable <- mushroom[, c(1, 4, 10, 15, 16, 18, 21)]

Apply meaningful column names

The 7 columns are Class, Cap_color, Gill_color, Stalk_color_above_ring, Stalk_color_below_ring, Veil_color, and SporePrint_color.

colnames(m_colortable) <- c("Class", "Cap_color", "Gill_color", "Stalk_color_above_ring", "Stalk_color_below_ring", "Veil_color", "SporePrint_color")

Replace the abbreviations used in the data with meaningful words

Replace “e” as “edible”, “p” as “poisonous” in column 1, and replace all the characters in the other 6 columns with their corresponding colors.

library(plyr)
m_colortable$Class <- revalue(m_colortable$Class, c("p" = "poisonous", "e" = "edible"))
m_colortable <- sapply(m_colortable, function(x) revalue(x,c("n"="brown", "b"="buff",                            "c"="cinnamon", "g"="gray", "r"="green", "p"="pink",
                       "u"="purple","e"="red","w"="white", "y"="yellow","k"="black",                             "h"="chocolate","o"="orange")))
## The following `from` values were not present in `x`: n, b, c, g, r, p, u, e, w, y, k, h, o
## The following `from` values were not present in `x`: k, h, o
## The following `from` values were not present in `x`: c
## The following `from` values were not present in `x`: r, u, k, h
## The following `from` values were not present in `x`: r, u, k, h
## The following `from` values were not present in `x`: b, c, g, r, p, u, e, k, h
## The following `from` values were not present in `x`: c, g, p, e
m_colortable <- data.frame(m_colortable)

To view the first 20 rows of the results as reference

head(m_colortable, 20)
##        Class Cap_color Gill_color Stalk_color_above_ring
## 1  poisonous     brown      black                  white
## 2     edible    yellow      black                  white
## 3     edible     white      brown                  white
## 4  poisonous     white      brown                  white
## 5     edible      gray      black                  white
## 6     edible    yellow      brown                  white
## 7     edible     white       gray                  white
## 8     edible     white      brown                  white
## 9  poisonous     white       pink                  white
## 10    edible    yellow       gray                  white
## 11    edible    yellow       gray                  white
## 12    edible    yellow      brown                  white
## 13    edible    yellow      white                  white
## 14 poisonous     white      black                  white
## 15    edible     brown      brown                  white
## 16    edible      gray      black                  white
## 17    edible     white      black                  white
## 18 poisonous     brown      brown                  white
## 19 poisonous     white      brown                  white
## 20 poisonous     brown      black                  white
##    Stalk_color_below_ring Veil_color SporePrint_color
## 1                   white      white            black
## 2                   white      white            brown
## 3                   white      white            brown
## 4                   white      white            black
## 5                   white      white            brown
## 6                   white      white            black
## 7                   white      white            black
## 8                   white      white            brown
## 9                   white      white            black
## 10                  white      white            black
## 11                  white      white            brown
## 12                  white      white            black
## 13                  white      white            brown
## 14                  white      white            brown
## 15                  white      white            black
## 16                  white      white            brown
## 17                  white      white            brown
## 18                  white      white            black
## 19                  white      white            brown
## 20                  white      white            brown

Let’s see reshape the dataset and generate a table to see the result in a better way

library(reshape2)
test <- melt(m_colortable, id.vars = c("Class"), variable.name = "Parts", value.names = "Color")
## Warning: attributes are not identical across measure variables; they will
## be dropped
library(rpivotTable)
rpivotTable(test, rows = c("Parts","value"), cols = "Class", rendererName = "Table Barchart", width = "100%", height = "100%")