DATA 607 - Assignment 1

Read the data file from my GitHub.

df <- read.csv("https://raw.githubusercontent.com/SieSiongWong/DATA-607/master/agaricus-lepiota.data", header=TRUE, sep=",")

First of all, lets take a look at the structure of the Mushroom Data Set. There are total of 8,123 observations and have 23 variables in this dataset. In this data frame, all columns are factor type.

str(df)

## 'data.frame':    8123 obs. of  23 variables:
##  $ p  : Factor w/ 2 levels "e","p": 1 1 2 1 1 1 1 2 1 1 ...
##  $ x  : Factor w/ 6 levels "b","c","f","k",..: 6 1 6 6 6 1 1 6 1 6 ...
##  $ s  : Factor w/ 4 levels "f","g","s","y": 3 3 4 3 4 3 4 4 3 4 ...
##  $ n  : Factor w/ 10 levels "b","c","e","g",..: 10 9 9 4 10 9 9 9 10 10 ...
##  $ t  : Factor w/ 2 levels "f","t": 2 2 2 1 2 2 2 2 2 2 ...
##  $ p.1: Factor w/ 9 levels "a","c","f","l",..: 1 4 7 6 1 1 4 7 1 4 ...
##  $ f  : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
##  $ c  : Factor w/ 2 levels "c","w": 1 1 1 2 1 1 1 1 1 1 ...
##  $ n.1: Factor w/ 2 levels "b","n": 1 1 2 1 1 1 1 2 1 1 ...
##  $ k  : Factor w/ 12 levels "b","e","g","h",..: 5 6 6 5 6 3 6 8 3 3 ...
##  $ e  : Factor w/ 2 levels "e","t": 1 1 1 2 1 1 1 1 1 1 ...
##  $ e.1: Factor w/ 5 levels "?","b","c","e",..: 3 3 4 4 3 3 3 4 3 3 ...
##  $ s.1: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ s.2: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ w  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ w.1: Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ p.2: Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
##  $ w.2: Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ o  : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
##  $ p.3: Factor w/ 5 levels "e","f","l","n",..: 5 5 5 1 5 5 5 5 5 5 ...
##  $ k.1: Factor w/ 9 levels "b","h","k","n",..: 4 4 3 4 3 3 4 3 3 4 ...
##  $ s.3: Factor w/ 6 levels "a","c","n","s",..: 3 3 4 1 3 3 4 5 4 3 ...
##  $ u  : Factor w/ 7 levels "d","g","l","m",..: 2 4 6 2 2 4 4 2 4 2 ...

Update all columns name to be readable and meaningful.

colnames(df) <- c("Edible", " CapShape", "CapSurface", "CapColor", "Bruises", "Odor", "GillAttachment", "GillSpacing", "GillSize", "GillColor", "StalkShape", "StalkRoot", "StalkSurfaceAboveRing", "StalkSurfaceBelowRing", "StalkColorAboveRing", "StalkColorBelowRing", "VeilType", "VeilColor", "RingNumber", "RingType", "SporePrintColor", "Population", "Habitat")

Replace abbreviations used in the columns with a meaningful descriptive words.

## Replace abbreviations for the Edible column with yes =  Edible, and no =  Not Edible (Poisonous).

levels(df$Edible)[levels(df$Edible) %in%  c("e","p")] <- c("yes","no")

## Replace abbreviations for the Odor column with almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s.

levels(df$Odor)[levels(df$Odor) %in%  c("a","l","c","y","f","m","n","p","s")] <- c("almond","anise","creosote","fishy","foul","musty","none","pungent","spicy")

## Replace abbreviations for the Spore print Color column with black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y.

levels(df$SporePrintColor)[levels(df$SporePrintColor) %in%  c("k","n","b","h","r","o","u","w","y")] <- c("black","brown","buff","chocolate","green","orange","purple","white","yellow")

## Replace abbreviations for the Stalk Surface Below Ring column with fibrous=f,scaly=y,silky=k,smooth=s.

levels(df$StalkSurfaceBelowRing)[levels(df$StalkSurfaceBelowRing) %in%  c("f","y","k","s")] <- c("fibrous","scaly","silky","smooth")

## Replace abbreviations for the Stalk Color Above Ring column with brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y.

levels(df$StalkColorAboveRing)[levels(df$StalkColorAboveRing) %in%  c("n","b","c","g","o","p","e","w","y")] <- c("brown","buff","cinnamon","gray","orange","pink","red","white","yellow")

## Replace abbreviations for the Cap Color column with brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y.

levels(df$CapColor)[levels(df$CapColor) %in%  c("n","b","c","g","r","p","u","e","w","y")] <- c("brown","buff","cinnamon","gray","green","pink","purple","red","white","yellow")

## Replace abbreviations for the Habitat column with grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d.

levels(df$Habitat)[levels(df$Habitat) %in%  c("g","l","m","p","u","w","d")] <- c("grasses","leaves","meadows","path","urban","waste","woods")

Subset for those columns without abbreviations.

library(dplyr)

df2 <- subset(df, select=c("Edible", "Odor","SporePrintColor","StalkSurfaceBelowRing", "StalkColorAboveRing","CapColor","Habitat"))

## Reorder the column names in alphabetical order.

df2 <- df2 %>% select(sort(names(.)))

## Or sort the data frame by Edible column to split the transformed data into edible and poisonous mushrooms.

df2 <- df2[order(df2$Edible),]

## Display some rows to see the transformation done in above steps.

head(arrange(df2, desc(Edible)),n=15)

##    CapColor Edible Habitat Odor SporePrintColor StalkColorAboveRing
## 1     white     no   waste none            buff               white
## 2     white     no  leaves none            buff               white
## 3     white     no   waste none       chocolate               white
## 4     green     no  leaves none            buff               white
## 5     white     no   waste none       chocolate               white
## 6     green     no   waste none       chocolate               white
## 7     green     no  leaves none       chocolate               white
## 8     white     no  leaves none       chocolate               white
## 9     white     no   waste none       chocolate               white
## 10    green     no   waste none       chocolate               white
## 11    white     no  leaves none       chocolate               white
## 12    green     no   waste none       chocolate               white
## 13    white     no   waste none            buff               white
## 14    green     no   waste none       chocolate               white
## 15    white     no  leaves none            buff               white
##    StalkSurfaceBelowRing
## 1                  silky
## 2                  silky
## 3                  silky
## 4                  silky
## 5                  silky
## 6                  silky
## 7                  silky
## 8                  silky
## 9                  silky
## 10                 silky
## 11                 silky
## 12                 silky
## 13                 silky
## 14                 silky
## 15                 silky

## Note: The reasons I chose to transform the CapColor, Odor,SporePrintColor, StalkSurfaceBelowRing, StalkColorAboveRin, and Habitat because as mentioned by one of the researchers, Wlodzislaw Duch, from Nicholas Copernicus University in Poland, come out with 4 disjunctive rules for poisonous mushrooms.

 ## 1.) odor=NOT(almond.OR.anise.OR.none): 120 poisonous cases missed, 98.52% accuracy
 ## 2.) spore-print-color=green: 48 cases missed, 99.41% accuracy
 ## 3.) odor=none.AND.stalk-surface-below-ring=scaly.AND.(stalk-color-above-ring=NOT.brown): 8   cases missed, 99.90% accuracy
 ## 4.) habitat=leaves.AND.cap-color=white : 100% accuracy

DATA 607 - Assignment 1

Sie Siong Wong

8/29/2019

Read the data file from my GitHub.

First of all, lets take a look at the structure of the Mushroom Data Set. There are total of 8,123 observations and have 23 variables in this dataset. In this data frame, all columns are factor type.

Update all columns name to be readable and meaningful.

Replace abbreviations used in the columns with a meaningful descriptive words.

Subset for those columns without abbreviations.