Justin Herman

Use package data.table to load in mushroom Dataset

mydat <- fread("http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data")
#head(mydat)
#dim(mydat)

Add approproate names to data columns

names(mydat) <- c("classes","cap_shape","cap_surface","cap_color","bruises","odor","gill_attachment","gill_spacing","gill_size","gill_color","stalk_shape","stalk_root","Surface_above","surface_below","color_above","color_below","veil_type","veil_color","ring_number","ring_type","spore_color","population","habitat")

Create Subset of data.

  • New dataframe “my_shrooms” contains columns(classes,cap_shape,cap_color,habitat,population)
  • Mutate is performed on all columns to replace the df entries with appropriate variable names
my_shrooms <- subset(mydat,select=c(classes,cap_shape,cap_color,habitat,population))
exp_df <- my_shrooms
my_shrooms <- mutate(my_shrooms,classes=case_when(classes=="e"~"edible",classes=="p"~"poisonous"))
my_shrooms <- mutate(my_shrooms,cap_shape=case_when(cap_shape=="b"~"bell",cap_shape=="c"~"conical",cap_shape=="x"~"convex",cap_shape=="f"~"flat",cap_shape=="k"~"knobbed",cap_shape=="s"~"sunken"))
my_shrooms <- mutate(my_shrooms,cap_color=case_when(cap_color=="n"~"brown",cap_color=="b"~"buff",cap_color=="c"~"cinnamon",cap_color=="g"~"gray",cap_color=="r"~"green",cap_color=="p"~"pink",cap_color=="u"~"purple",cap_color=="e"~"red",cap_color=="w"~"white",cap_color=="y"~"yellow"))
my_shrooms <- mutate(my_shrooms,habitat=case_when(habitat=="g"~"grasses",habitat=="l"~"leaves",habitat=="m"~"meadows",habitat=="u"~"urban",habitat=="w"~"waste",habitat=="d"~"woods",habitat=="p"~"paths"))
my_shrooms <- mutate(my_shrooms,population=case_when(population=="a"~"abundant",population=="c"~"clustered",population=="n"~"numerous",population=="s"~"scattered",population=="v"~"several",population=="y"~"solitary"))


head(my_shrooms,50)
##      classes cap_shape cap_color habitat population
## 1     edible    convex    yellow grasses   numerous
## 2     edible      bell     white meadows   numerous
## 3  poisonous    convex     white   urban  scattered
## 4     edible    convex      gray grasses   abundant
## 5     edible    convex    yellow grasses   numerous
## 6     edible      bell     white meadows   numerous
## 7     edible      bell     white meadows  scattered
## 8  poisonous    convex     white grasses    several
## 9     edible      bell    yellow meadows  scattered
## 10    edible    convex    yellow grasses   numerous
## 11    edible    convex    yellow meadows  scattered
## 12    edible      bell    yellow grasses  scattered
## 13 poisonous    convex     white   urban    several
## 14    edible    convex     brown grasses   abundant
## 15    edible    sunken      gray   urban   solitary
## 16    edible      flat     white grasses   abundant
## 17 poisonous    convex     brown grasses  scattered
## 18 poisonous    convex     white   urban  scattered
## 19 poisonous    convex     brown   urban  scattered
## 20    edible      bell    yellow meadows  scattered
## 21 poisonous    convex     brown grasses    several
## 22    edible      bell    yellow meadows  scattered
## 23    edible      bell     white meadows   numerous
## 24    edible      bell     white meadows  scattered
## 25 poisonous      flat     white grasses    several
## 26    edible    convex    yellow meadows   numerous
## 27    edible    convex     white meadows   numerous
## 28    edible      flat     brown   urban   solitary
## 29    edible    convex    yellow   woods    several
## 30    edible      bell    yellow meadows   numerous
## 31 poisonous    convex     white   urban  scattered
## 32    edible    convex    yellow meadows   numerous
## 33    edible    convex     brown   paths   solitary
## 34    edible      bell    yellow meadows  scattered
## 35    edible    convex    yellow   woods    several
## 36    edible    sunken      gray   urban    several
## 37 poisonous    convex     brown   urban  scattered
## 38    edible    convex    yellow   woods    several
## 39    edible      bell    yellow meadows  scattered
## 40    edible      bell    yellow grasses  scattered
## 41    edible    convex    yellow   paths   solitary
## 42    edible    convex     brown   urban   solitary
## 43 poisonous    convex     white grasses    several
## 44    edible    convex    yellow meadows   numerous
## 45    edible    convex     white grasses   numerous
## 46    edible    convex    yellow meadows  scattered
## 47    edible    convex     white meadows   numerous
## 48    edible    convex    yellow   paths  scattered
## 49    edible      flat    yellow   paths  scattered
## 50    edible    convex     brown grasses  scattered

For loop to convert categorical string data into type factor

#summary(my_shrooms)
#head(subset(my_shrooms, select = 'cap_shape'))
#names(my_shrooms)
#count(my_shrooms, "cap_shape")
#names(my_shrooms)
#str(my_shrooms)
#my_shrooms<- as.factor(my_shrooms)
#str((my_shrooms$classes))

for (x in c(names(my_shrooms))){
    my_shrooms[x] <- factor(unlist(my_shrooms[x]))
}

summary(my_shrooms)
##       classes       cap_shape      cap_color       habitat    
##  edible   :4208   bell   : 452   brown  :2283   grasses:2148  
##  poisonous:3915   conical:   4   gray   :1840   leaves : 832  
##                   convex :3655   red    :1500   meadows: 292  
##                   flat   :3152   yellow :1072   paths  :1144  
##                   knobbed: 828   white  :1040   urban  : 367  
##                   sunken :  32   buff   : 168   waste  : 192  
##                                  (Other): 220   woods  :3148  
##      population  
##  abundant : 384  
##  clustered: 340  
##  numerous : 400  
##  scattered:1247  
##  several  :4040  
##  solitary :1712  
## 

Lets try to figure out how our columns breakdown as either: edible or poisonous

for( x in names(my_shrooms)){print(flat_table(my_shrooms, classes,x))
}
## x edible poisonous
##                   
##     4208      3915
##           cap_shape bell conical convex flat knobbed sunken
## classes                                                    
## edible               404       0   1948 1596     228     32
## poisonous             48       4   1707 1556     600      0
##           cap_color brown buff cinnamon gray green pink purple  red white yellow
## classes                                                                         
## edible               1264   48       32 1032    16   56     16  624   720    400
## poisonous            1019  120       12  808     0   88      0  876   320    672
##           habitat grasses leaves meadows paths urban waste woods
## classes                                                         
## edible               1408    240     256   136    96   192  1880
## poisonous             740    592      36  1008   271     0  1268
##           population abundant clustered numerous scattered several solitary
## classes                                                                    
## edible                    384       288      400       880    1192     1064
## poisonous                   0        52        0       367    2848      648
#store these values in flat_table$colname
for(i in seq_along(names(my_shrooms))){
    nam <- paste("flat_table",names(my_shrooms)[i],sep="") 
    assign(nam,flat_table(my_shrooms, classes,names(my_shrooms)[i]))
}
# alternative solution to avoid assign()
# create list of flat_tables
N <- 5
x <- vector("list", N)
for(i in 1:N) {
    Ps <- flat_table(my_shrooms, classes,names(my_shrooms)[i]) 
    x[[i]] <- Ps
}
names(x) <-  c("my_flat_table_1", "my_flat_table_2", "my_flat_table_3","my_flat_table_4","my_flat_table_5")

Graphically displayed

counts <- table(my_shrooms$classes, my_shrooms$cap_shape)
barplot(counts, main="Edible VS Poisonous by shape",
  xlab="Mushroom shapes", col=c("darkblue","red"),
    legend = rownames(counts), beside=TRUE)

counts <-table(my_shrooms$classes, my_shrooms$cap_color)
barplot(counts, main="Edible VS Poisonous by color",
  xlab="Mushroom shapes", col=c("darkblue","red"),
    legend = rownames(counts), beside=TRUE)

counts <-table(my_shrooms$classes, my_shrooms$population)
barplot(counts, main="Edible VS Poisonous by class_population",
  xlab="Mushroom class_population", col=c("darkblue","red"),
    legend = rownames(counts), beside=TRUE)

counts <-table(my_shrooms$classes, my_shrooms$habitat)
barplot(counts, main="Edible VS Poisonous by habitat",
  xlab="Mushroom habitat", col=c("darkblue","red"),
    legend = rownames(counts), beside=TRUE)

Observations from Data

  • Color doesn’t appear to be all that valuable in predicting the danger of our mushrooms
    • The Scale of these bar graphs can be deceiving, further statistical analysis is needed
  • Shapes like “knobbed and Bell” seem like they may be good predictors
  • Populations “several and scattered” appear to be somewhat significant predictors
  • Habitat appears to be valuable as well with several categories(paths,leaves,grasses,woods) showing a significant difference

Everything that follows from here was code experimentation.

The homework implied we would discuss interesting ideas so I figured I would try to develop a workaround on the monotonous and drawn out mutate code I used to change data inputs in the dataframe. I attempted to automate our data entry manipulation with less code. I found an idea from stackoverflow to manipulate the entries by using a table with two columns(our df columnvalues, replacement strings). The ifelse statement was taken from a stack overflow answer linked below

To me writing out a table like this is much more efficient than copy pasting code and manually inserting different variables over and over. Unfortunately this idea failed before it could begin. Several columns have shared initial values (c=conical, c=cinnamon,c= clustered)

  • This could be useful for data with larger strings that aren’t repetitive from column to column, but I couldn’t figure out a workaround.

I thought about creating a double for loop to loop through a list of tables, each table being a column from our data set and its replacement value. Then to loop through df column names.

  • The Below code will not run its hypothetical
#for datatable in list(datatables) for  x in names(exp_df){ifelse(grepl(paste(df$x, collapse = '&'), list of tables),
##table that returns true from list[match......]

Having to make seperate datatables for each column, is in itself already alot of code. It seems like I am probably attempting to reinvent the wheel and I would also need a different function then grepl to accomplish what i want. But the original code that built my exp_df_2 has some use value, although it can be dangerous if applied without fully understanding our dataset