DATA 607 Assignment 1

Assignment 1 - Loading Data into a Data Frame

Data set: Mushroom Data Set

Origin: Mushroom records drawn from The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf Donor: Jeff Schlimmer (Jeffrey.Schlimmer ‘@’ a.gp.cs.cmu.edu)

Information: This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ``leaflets three, let it be’’ for Poisonous Oak and Ivy.

Task: Your task is to study the dataset and the associated description of the data (i.e. “data dictionary”). You may need to look around a bit, but it’s there! You should take the data, and create a data frame with a subset of the columns in the dataset. You should include the column that indicates edible or poisonous and three or four other columns. You should also add meaningful column names and replace the abbreviations used in the data—for example, in the appropriate column, “e” might become “edible.” Your deliverable is the R code to perform these transformation tasks.

1. Read Mushroom Data Set

Read data from Github.

Mushroom_data <- read.csv("https://raw.githubusercontent.com/oggyluky11/DATA607-Assignment-1/master/agaricus-lepiota.data")
head(Mushroom_data)

##   p x s n t p.1 f c n.1 k e e.1 s.1 s.2 w w.1 p.2 w.2 o p.3 k.1 s.3 u
## 1 e x s y t   a f c   b k e   c   s   s w   w   p   w o   p   n   n g
## 2 e b s w t   l f c   b n e   c   s   s w   w   p   w o   p   n   n m
## 3 p x y w t   p f c   n n e   e   s   s w   w   p   w o   p   k   s u
## 4 e x s g f   n f w   b k t   e   s   s w   w   p   w o   e   n   a g
## 5 e x y y t   a f c   b n e   c   s   s w   w   p   w o   p   k   n g
## 6 e b s w t   a f c   b g e   c   s   s w   w   p   w o   p   k   n m

dim(Mushroom_data)

## [1] 8123   23

2. create a data frame with a subset of the columns in the dataset.

The subset includes the column that indicates edible or poisonous and three other columns of which the attributes are surface related.

task_data <- data.frame(Mushroom_data["p"],Mushroom_data["s"],Mushroom_data["s.1"],Mushroom_data["s.2"])
head(task_data)

##   p s s.1 s.2
## 1 e s   s   s
## 2 e s   s   s
## 3 p y   s   s
## 4 e s   s   s
## 5 e y   s   s
## 6 e s   s   s

dim(task_data)

## [1] 8123    4

3. Rename the columns and replace abbreviations used in the data.

names(task_data) = c("classification","cap-surface","stalk-surface-above-ring", "stalk-surface-below-ring")
Classification <- data.frame("Abbr" = c("e","p"),"Name" = c("edible","poisonous"))
Surface <- data.frame("Abbr" = c("f","g","y","s","k"),"Name" = c("fibrous","grooves","scaly","smooth","silky"))
task_data[1] <- Classification$Name[match(unlist(task_data[1]),Classification$Abbr)]
task_data[c(2,3,4)] <- Surface$Name[match(unlist(task_data[c(2,3,4)]),Surface$Abbr)]
task_data[c(2,3,4)] <- lapply(task_data[c(2,3,4)], factor)
head(task_data)

##   classification cap-surface stalk-surface-above-ring
## 1         edible      smooth                   smooth
## 2         edible      smooth                   smooth
## 3      poisonous       scaly                   smooth
## 4         edible      smooth                   smooth
## 5         edible       scaly                   smooth
## 6         edible      smooth                   smooth
##   stalk-surface-below-ring
## 1                   smooth
## 2                   smooth
## 3                   smooth
## 4                   smooth
## 5                   smooth
## 6                   smooth

dim(task_data)

## [1] 8123    4

Final Deliverable Data

str(task_data)

## 'data.frame':    8123 obs. of  4 variables:
##  $ classification          : Factor w/ 2 levels "edible","poisonous": 1 1 2 1 1 1 1 2 1 1 ...
##  $ cap-surface             : Factor w/ 4 levels "fibrous","grooves",..: 4 4 3 4 3 4 3 3 4 3 ...
##  $ stalk-surface-above-ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ stalk-surface-below-ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...

summary(task_data)

##    classification  cap-surface   stalk-surface-above-ring
##  edible   :4208   fibrous:2320   fibrous: 552            
##  poisonous:3915   grooves:   4   scaly  :  24            
##                   scaly  :3244   silky  :2372            
##                   smooth :2555   smooth :5175            
##  stalk-surface-below-ring
##  fibrous: 600            
##  scaly  : 284            
##  silky  :2304            
##  smooth :4935

Obervation on Data

The pivot table on the data shows that it is not very effective to tell whether a mushroom is edible or poisonous based on obervation on surface because the odds are not significant. However, the data hints that if surface is silky then the mushroom is very likely to be a poisonous one.

library(rpivotTable)
library(reshape2)
Unpivot_data <- melt(task_data, id.vars = "classification", variable.name = "surface_type", value.name = "surface_value")

## Warning: attributes are not identical across measure variables; they will
## be dropped

rpivotTable(Unpivot_data, rows=c("surface_type","surface_value"), cols="classification", rendererName = "Table Barchart", width = "10px", height="300px")