Data set: Mushroom Data Set
Origin: Mushroom records drawn from The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf Donor: Jeff Schlimmer (Jeffrey.Schlimmer ‘@’ a.gp.cs.cmu.edu)
Information: This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ``leaflets three, let it be’’ for Poisonous Oak and Ivy.
Task: Your task is to study the dataset and the associated description of the data (i.e. “data dictionary”). You may need to look around a bit, but it’s there! You should take the data, and create a data frame with a subset of the columns in the dataset. You should include the column that indicates edible or poisonous and three or four other columns. You should also add meaningful column names and replace the abbreviations used in the data—for example, in the appropriate column, “e” might become “edible.” Your deliverable is the R code to perform these transformation tasks.
Read data from Github.
Mushroom_data <- read.csv("https://raw.githubusercontent.com/oggyluky11/DATA607-Assignment-1/master/agaricus-lepiota.data")
head(Mushroom_data)
## p x s n t p.1 f c n.1 k e e.1 s.1 s.2 w w.1 p.2 w.2 o p.3 k.1 s.3 u
## 1 e x s y t a f c b k e c s s w w p w o p n n g
## 2 e b s w t l f c b n e c s s w w p w o p n n m
## 3 p x y w t p f c n n e e s s w w p w o p k s u
## 4 e x s g f n f w b k t e s s w w p w o e n a g
## 5 e x y y t a f c b n e c s s w w p w o p k n g
## 6 e b s w t a f c b g e c s s w w p w o p k n m
dim(Mushroom_data)
## [1] 8123 23
The subset includes the column that indicates edible or poisonous and three other columns of which the attributes are surface related.
task_data <- data.frame(Mushroom_data["p"],Mushroom_data["s"],Mushroom_data["s.1"],Mushroom_data["s.2"])
head(task_data)
## p s s.1 s.2
## 1 e s s s
## 2 e s s s
## 3 p y s s
## 4 e s s s
## 5 e y s s
## 6 e s s s
dim(task_data)
## [1] 8123 4
names(task_data) = c("classification","cap-surface","stalk-surface-above-ring", "stalk-surface-below-ring")
Classification <- data.frame("Abbr" = c("e","p"),"Name" = c("edible","poisonous"))
Surface <- data.frame("Abbr" = c("f","g","y","s","k"),"Name" = c("fibrous","grooves","scaly","smooth","silky"))
task_data[1] <- Classification$Name[match(unlist(task_data[1]),Classification$Abbr)]
task_data[c(2,3,4)] <- Surface$Name[match(unlist(task_data[c(2,3,4)]),Surface$Abbr)]
task_data[c(2,3,4)] <- lapply(task_data[c(2,3,4)], factor)
head(task_data)
## classification cap-surface stalk-surface-above-ring
## 1 edible smooth smooth
## 2 edible smooth smooth
## 3 poisonous scaly smooth
## 4 edible smooth smooth
## 5 edible scaly smooth
## 6 edible smooth smooth
## stalk-surface-below-ring
## 1 smooth
## 2 smooth
## 3 smooth
## 4 smooth
## 5 smooth
## 6 smooth
dim(task_data)
## [1] 8123 4
str(task_data)
## 'data.frame': 8123 obs. of 4 variables:
## $ classification : Factor w/ 2 levels "edible","poisonous": 1 1 2 1 1 1 1 2 1 1 ...
## $ cap-surface : Factor w/ 4 levels "fibrous","grooves",..: 4 4 3 4 3 4 3 3 4 3 ...
## $ stalk-surface-above-ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ stalk-surface-below-ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
summary(task_data)
## classification cap-surface stalk-surface-above-ring
## edible :4208 fibrous:2320 fibrous: 552
## poisonous:3915 grooves: 4 scaly : 24
## scaly :3244 silky :2372
## smooth :2555 smooth :5175
## stalk-surface-below-ring
## fibrous: 600
## scaly : 284
## silky :2304
## smooth :4935
The pivot table on the data shows that it is not very effective to tell whether a mushroom is edible or poisonous based on obervation on surface because the odds are not significant. However, the data hints that if surface is silky then the mushroom is very likely to be a poisonous one.
library(rpivotTable)
library(reshape2)
Unpivot_data <- melt(task_data, id.vars = "classification", variable.name = "surface_type", value.name = "surface_value")
## Warning: attributes are not identical across measure variables; they will
## be dropped
rpivotTable(Unpivot_data, rows=c("surface_type","surface_value"), cols="classification", rendererName = "Table Barchart", width = "10px", height="300px")