Mushrooms Dataset Cleanup

Understanding and importing the data

The first two steps for this assignment were to read through the dataset description to understand the variables, and then to read this data into a data frame. I read this in to a data frame called ‘mushroom’ as shown below:

library(RCurl)
raw <- getURL("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data")
mushroom <- data.frame(read.csv(text=raw,header=FALSE,sep=",",stringsAsFactors=FALSE))

I noticed that the data frame that was created contained 8,124 rows, which is accurate since the description said there was 8,124 instances. However, the number of columns created was 23, when the data information stated that there were only 22 attributes. Therefore, I used the head() function to take a better look:

head(mushroom)

  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
1  p  x  s  n  t  p  f  c  n   k   e   e   s   s   w   w   p   w   o   p
2  e  x  s  y  t  a  f  c  b   k   e   c   s   s   w   w   p   w   o   p
3  e  b  s  w  t  l  f  c  b   n   e   c   s   s   w   w   p   w   o   p
4  p  x  y  w  t  p  f  c  n   n   e   e   s   s   w   w   p   w   o   p
5  e  x  s  g  f  n  f  w  b   k   t   e   s   s   w   w   p   w   o   e
6  e  x  y  y  t  a  f  c  b   n   e   c   s   s   w   w   p   w   o   p
  V21 V22 V23
1   k   s   u
2   n   n   g
3   n   n   m
4   k   s   u
5   n   a   g
6   k   n   g

Based on the information given about the data set, and the information given by the head() function, it appears that the first column is poisonous or edible, and then the 22 attributes make up the remaining columns of the data frame.

Now that I have a better understanding of the data and the data frame itself, I decided to plan how I was going to tidy up this data.

Tidying my Data Frame for Analysis

Before I tidy up my data, I asked myself, “What do I want to get out of this data set?” I noticed that there are six categories that describe a particular part of the flower and its color. Maybe there is something interesting with the part of the flower and color combination and its status as poisonous. Therefore, I will select the first column which states whether the mushroom is poisonous or edible, and then select the other six columns that have to do with color. I used the subset() function to select these columns and set them to a new data frame called “mushroom.color”:

mushroom.color <- subset(mushroom,select=c(V1,V4,V10,V15,V16,V18,V21))

Now that I have my new data frame, I need to change the column names and data values to names that are more readable. I used colnames() to rename the columns:

colnames(mushroom.color) <- c("Type","Cap","Gill","StalkAR","StalkBR","Veil","Spore")

Next, I used the gsub() function to replace all ‘p’ with ‘Poisonous’ and all ‘e’ with ‘Edible’:

mushroom.color$Type <- gsub('p','Poisonous',mushroom.color$Type)
mushroom.color$Type <- gsub('e','Edible',mushroom.color$Type)

Finally, I created a nested for loop that replaces each of the characters in the other columns with their respective colors:

for (i in 2:ncol(mushroom.color)) {
  for (j in 1:nrow(mushroom.color)) {
    if (mushroom.color[j,i]=='w') {mushroom.color[j,i] <- gsub('w','white',mushroom.color[j,i]);}
    else if (mushroom.color[j,i]=='k') {mushroom.color[j,i] <- gsub('k','black',mushroom.color[j,i])}
    else if (mushroom.color[j,i]=='n') {mushroom.color[j,i] <- gsub('n','brown',mushroom.color[j,i])}
    else if (mushroom.color[j,i]=='b') {mushroom.color[j,i] <- gsub('b','buff',mushroom.color[j,i])}
    else if (mushroom.color[j,i]=='h') {mushroom.color[j,i] <- gsub('h','chocolate',mushroom.color[j,i])}
    else if (mushroom.color[j,i]=='c') {mushroom.color[j,i] <- gsub('c','cinnamon',mushroom.color[j,i])}
    else if (mushroom.color[j,i]=='g') {mushroom.color[j,i] <- gsub('g','gray',mushroom.color[j,i])}
    else if (mushroom.color[j,i]=='r') {mushroom.color[j,i] <- gsub('r','green',mushroom.color[j,i])}
    else if (mushroom.color[j,i]=='o') {mushroom.color[j,i] <- gsub('o','orange',mushroom.color[j,i])}
    else if (mushroom.color[j,i]=='p') {mushroom.color[j,i] <- gsub('p','pink',mushroom.color[j,i])}
    else if (mushroom.color[j,i]=='u') {mushroom.color[j,i] <- gsub('u','purple',mushroom.color[j,i])}
    else if (mushroom.color[j,i]=='e') {mushroom.color[j,i] <- gsub('e','red',mushroom.color[j,i])}
    else if (mushroom.color[j,i]=='y') {mushroom.color[j,i] <- gsub('y','yellow',mushroom.color[j,i])}
  }
}

My new table is now formatted like this:

head(mushroom.color)

       Type    Cap  Gill StalkAR StalkBR  Veil Spore
1 Poisonous  brown black   white   white white black
2    Edible yellow black   white   white white brown
3    Edible  white brown   white   white white brown
4 Poisonous  white brown   white   white white black
5    Edible   gray black   white   white white brown
6    Edible yellow brown   white   white white black

Analysis of Poisonous Flowers as a Function of Color and Part of Flower

Now that I have cleaned up the data, I decided to do a short analysis on the poisonous flowers. I thought it might be interesting to see the percentage of poisonous flowers as a function of color and part of the flower. For example, if I wanted to know the percentage of flowers with a white cap that are poisonous, I would need to find the number of instances of flowers that have a white cap and are poisonous and divide that by the total number of flowers with a white cap. Here is an example of the syntax:

nrow(subset(mushroom.color,Type=='Poisonous' & Cap=='white'))/nrow(subset(mushroom.color,Cap=='white'))

Next, I want to create an empty matrix (pp.matrix) with the columns being the part of the flower, and the rows representing a color. Each element will have a poisonous percentage of that type of flower. I created a nested for loop to calculate each element. Then I turned the matrix into a data frame (percent.poisonous) and named the rows and columns:

pp.matrix <- matrix(nrow=13,ncol=6)
color <- c('white','black','brown','buff','chocolate','cinnamon','gray','green','orange','pink','purple','red','yellow')
for (i in 1:ncol(pp.matrix)) {
  for (j in 1:nrow(pp.matrix)) {
    if (i==1) {pp.matrix[j,i] <- nrow(subset(mushroom.color,Type=='Poisonous' & Cap==color[j]))/nrow(subset(mushroom.color,Cap==color[j]))}
    if (i==2) {pp.matrix[j,i] <- nrow(subset(mushroom.color,Type=='Poisonous' & Gill==color[j]))/nrow(subset(mushroom.color,Gill==color[j]))}
    if (i==3) {pp.matrix[j,i] <- nrow(subset(mushroom.color,Type=='Poisonous' & StalkAR==color[j]))/nrow(subset(mushroom.color,StalkAR==color[j]))}
    if (i==4) {pp.matrix[j,i] <- nrow(subset(mushroom.color,Type=='Poisonous' & StalkBR==color[j]))/nrow(subset(mushroom.color,StalkBR==color[j]))}
    if (i==5) {pp.matrix[j,i] <- nrow(subset(mushroom.color,Type=='Poisonous' & Veil==color[j]))/nrow(subset(mushroom.color,Veil==color[j]))}
    if (i==6) {pp.matrix[j,i] <- nrow(subset(mushroom.color,Type=='Poisonous' & Spore==color[j]))/nrow(subset(mushroom.color,Spore==color[j]))}
  }
}
percent.poisonous <- as.data.frame(pp.matrix)
colnames(percent.poisonous) <- c("Cap","Gill","StalkAR","StalkBR","Veil","Spore")
rownames(percent.poisonous) <- c('white','black','brown','buff','chocolate','cinnamon','gray','green','orange','pink','purple','red','yellow')
percent.poisonous

                Cap       Gill   StalkAR   StalkBR      Veil     Spore
white     0.3076923 0.20465890 0.3835125 0.3832117 0.4931853 0.7587940
black           NaN 0.15686275       NaN       NaN       NaN 0.1196581
brown     0.4465849 0.10687023 0.9642857 0.8750000 0.0000000 0.1138211
buff      0.7142857 1.00000000 1.0000000 1.0000000       NaN 0.0000000
chocolate       NaN 0.72131148       NaN       NaN       NaN 0.9705882
cinnamon  0.2727273        NaN 1.0000000 1.0000000       NaN       NaN
gray      0.4391304 0.67021277 0.0000000 0.0000000       NaN       NaN
green     0.0000000 1.00000000       NaN       NaN       NaN 1.0000000
orange          NaN 0.00000000 0.0000000 0.0000000 0.0000000 0.0000000
pink      0.6111111 0.42895442 0.6923077 0.6923077       NaN       NaN
purple    0.0000000 0.09756098       NaN       NaN       NaN 0.0000000
red       0.5840000 0.00000000 0.0000000 0.0000000       NaN       NaN
yellow    0.6268657 0.25581395 1.0000000 1.0000000 1.0000000 0.0000000

After analyzing the table, I realized that I can clean this up a bit. First, there are more digits displayed that I need, so I decided to display only two digits at most. I also see NaN values sprinkled in the table, which means that a particular part of the flower did not exist in that color. Therefore, I decided to replace the “NaN” values with “DNE” meaning “Does Not Exist.” Next, I saw that 0% and 100% existed on the table. If the value is 0%, that means that the particular part and color combo was never poisonous. On the other hand, if 100% was displayed, that means that the particular part and color combo was always poisonous. Therefore, I replaced 0% values with “Edible,” and “100%” values with “Poisonous.”

percent.poisonous[is.na(percent.poisonous)] <- 9.99
percent.poisonous <- round(percent.poisonous*100, digits = 2)
percent.poisonous[percent.poisonous == 999] <- 'DNE'
percent.poisonous[percent.poisonous == 0] <- 'EDIBLE'
percent.poisonous[percent.poisonous == 100] <- 'POISONOUS'
percent.poisonous

             Cap      Gill   StalkAR   StalkBR      Veil     Spore
white      30.77     20.47     38.35     38.32     49.32     75.88
black        DNE     15.69       DNE       DNE       DNE     11.97
brown      44.66     10.69     96.43      87.5    EDIBLE     11.38
buff       71.43 POISONOUS POISONOUS POISONOUS       DNE    EDIBLE
chocolate    DNE     72.13       DNE       DNE       DNE     97.06
cinnamon   27.27       DNE POISONOUS POISONOUS       DNE       DNE
gray       43.91     67.02    EDIBLE    EDIBLE       DNE       DNE
green     EDIBLE POISONOUS       DNE       DNE       DNE POISONOUS
orange       DNE    EDIBLE    EDIBLE    EDIBLE    EDIBLE    EDIBLE
pink       61.11      42.9     69.23     69.23       DNE       DNE
purple    EDIBLE      9.76       DNE       DNE       DNE    EDIBLE
red         58.4    EDIBLE    EDIBLE    EDIBLE       DNE       DNE
yellow     62.69     25.58 POISONOUS POISONOUS POISONOUS    EDIBLE

This table can now aid in telling the poison status of flowers. For example, if I see a flower with green cap, then I know that it is edible. However, if I see a green color gill, then I know the flower must be poisonous. It also provides probabilities that a particular flower part and color combination is poisonous, and whether a particular part and color combination does not exist.

Mushrooms Dataset Cleanup

Ryan Gordon

1/27/2019

Understanding and importing the data

Tidying my Data Frame for Analysis

Analysis of Poisonous Flowers as a Function of Color and Part of Flower