The first two steps for this assignment were to read through the dataset description to understand the variables, and then to read this data into a data frame. I read this in to a data frame called ‘mushroom’ as shown below:
library(RCurl)
raw <- getURL("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data")
mushroom <- data.frame(read.csv(text=raw,header=FALSE,sep=",",stringsAsFactors=FALSE))
I noticed that the data frame that was created contained 8,124 rows, which is accurate since the description said there was 8,124 instances. However, the number of columns created was 23, when the data information stated that there were only 22 attributes. Therefore, I used the head() function to take a better look:
head(mushroom)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
1 p x s n t p f c n k e e s s w w p w o p
2 e x s y t a f c b k e c s s w w p w o p
3 e b s w t l f c b n e c s s w w p w o p
4 p x y w t p f c n n e e s s w w p w o p
5 e x s g f n f w b k t e s s w w p w o e
6 e x y y t a f c b n e c s s w w p w o p
V21 V22 V23
1 k s u
2 n n g
3 n n m
4 k s u
5 n a g
6 k n g
Based on the information given about the data set, and the information given by the head() function, it appears that the first column is poisonous or edible, and then the 22 attributes make up the remaining columns of the data frame.
Now that I have a better understanding of the data and the data frame itself, I decided to plan how I was going to tidy up this data.
Before I tidy up my data, I asked myself, “What do I want to get out of this data set?” I noticed that there are six categories that describe a particular part of the flower and its color. Maybe there is something interesting with the part of the flower and color combination and its status as poisonous. Therefore, I will select the first column which states whether the mushroom is poisonous or edible, and then select the other six columns that have to do with color. I used the subset() function to select these columns and set them to a new data frame called “mushroom.color”:
mushroom.color <- subset(mushroom,select=c(V1,V4,V10,V15,V16,V18,V21))
Now that I have my new data frame, I need to change the column names and data values to names that are more readable. I used colnames() to rename the columns:
colnames(mushroom.color) <- c("Type","Cap","Gill","StalkAR","StalkBR","Veil","Spore")
Next, I used the gsub() function to replace all ‘p’ with ‘Poisonous’ and all ‘e’ with ‘Edible’:
mushroom.color$Type <- gsub('p','Poisonous',mushroom.color$Type)
mushroom.color$Type <- gsub('e','Edible',mushroom.color$Type)
Finally, I created a nested for loop that replaces each of the characters in the other columns with their respective colors:
for (i in 2:ncol(mushroom.color)) {
for (j in 1:nrow(mushroom.color)) {
if (mushroom.color[j,i]=='w') {mushroom.color[j,i] <- gsub('w','white',mushroom.color[j,i]);}
else if (mushroom.color[j,i]=='k') {mushroom.color[j,i] <- gsub('k','black',mushroom.color[j,i])}
else if (mushroom.color[j,i]=='n') {mushroom.color[j,i] <- gsub('n','brown',mushroom.color[j,i])}
else if (mushroom.color[j,i]=='b') {mushroom.color[j,i] <- gsub('b','buff',mushroom.color[j,i])}
else if (mushroom.color[j,i]=='h') {mushroom.color[j,i] <- gsub('h','chocolate',mushroom.color[j,i])}
else if (mushroom.color[j,i]=='c') {mushroom.color[j,i] <- gsub('c','cinnamon',mushroom.color[j,i])}
else if (mushroom.color[j,i]=='g') {mushroom.color[j,i] <- gsub('g','gray',mushroom.color[j,i])}
else if (mushroom.color[j,i]=='r') {mushroom.color[j,i] <- gsub('r','green',mushroom.color[j,i])}
else if (mushroom.color[j,i]=='o') {mushroom.color[j,i] <- gsub('o','orange',mushroom.color[j,i])}
else if (mushroom.color[j,i]=='p') {mushroom.color[j,i] <- gsub('p','pink',mushroom.color[j,i])}
else if (mushroom.color[j,i]=='u') {mushroom.color[j,i] <- gsub('u','purple',mushroom.color[j,i])}
else if (mushroom.color[j,i]=='e') {mushroom.color[j,i] <- gsub('e','red',mushroom.color[j,i])}
else if (mushroom.color[j,i]=='y') {mushroom.color[j,i] <- gsub('y','yellow',mushroom.color[j,i])}
}
}
My new table is now formatted like this:
head(mushroom.color)
Type Cap Gill StalkAR StalkBR Veil Spore
1 Poisonous brown black white white white black
2 Edible yellow black white white white brown
3 Edible white brown white white white brown
4 Poisonous white brown white white white black
5 Edible gray black white white white brown
6 Edible yellow brown white white white black
Now that I have cleaned up the data, I decided to do a short analysis on the poisonous flowers. I thought it might be interesting to see the percentage of poisonous flowers as a function of color and part of the flower. For example, if I wanted to know the percentage of flowers with a white cap that are poisonous, I would need to find the number of instances of flowers that have a white cap and are poisonous and divide that by the total number of flowers with a white cap. Here is an example of the syntax:
nrow(subset(mushroom.color,Type=='Poisonous' & Cap=='white'))/nrow(subset(mushroom.color,Cap=='white'))
Next, I want to create an empty matrix (pp.matrix) with the columns being the part of the flower, and the rows representing a color. Each element will have a poisonous percentage of that type of flower. I created a nested for loop to calculate each element. Then I turned the matrix into a data frame (percent.poisonous) and named the rows and columns:
pp.matrix <- matrix(nrow=13,ncol=6)
color <- c('white','black','brown','buff','chocolate','cinnamon','gray','green','orange','pink','purple','red','yellow')
for (i in 1:ncol(pp.matrix)) {
for (j in 1:nrow(pp.matrix)) {
if (i==1) {pp.matrix[j,i] <- nrow(subset(mushroom.color,Type=='Poisonous' & Cap==color[j]))/nrow(subset(mushroom.color,Cap==color[j]))}
if (i==2) {pp.matrix[j,i] <- nrow(subset(mushroom.color,Type=='Poisonous' & Gill==color[j]))/nrow(subset(mushroom.color,Gill==color[j]))}
if (i==3) {pp.matrix[j,i] <- nrow(subset(mushroom.color,Type=='Poisonous' & StalkAR==color[j]))/nrow(subset(mushroom.color,StalkAR==color[j]))}
if (i==4) {pp.matrix[j,i] <- nrow(subset(mushroom.color,Type=='Poisonous' & StalkBR==color[j]))/nrow(subset(mushroom.color,StalkBR==color[j]))}
if (i==5) {pp.matrix[j,i] <- nrow(subset(mushroom.color,Type=='Poisonous' & Veil==color[j]))/nrow(subset(mushroom.color,Veil==color[j]))}
if (i==6) {pp.matrix[j,i] <- nrow(subset(mushroom.color,Type=='Poisonous' & Spore==color[j]))/nrow(subset(mushroom.color,Spore==color[j]))}
}
}
percent.poisonous <- as.data.frame(pp.matrix)
colnames(percent.poisonous) <- c("Cap","Gill","StalkAR","StalkBR","Veil","Spore")
rownames(percent.poisonous) <- c('white','black','brown','buff','chocolate','cinnamon','gray','green','orange','pink','purple','red','yellow')
percent.poisonous
Cap Gill StalkAR StalkBR Veil Spore
white 0.3076923 0.20465890 0.3835125 0.3832117 0.4931853 0.7587940
black NaN 0.15686275 NaN NaN NaN 0.1196581
brown 0.4465849 0.10687023 0.9642857 0.8750000 0.0000000 0.1138211
buff 0.7142857 1.00000000 1.0000000 1.0000000 NaN 0.0000000
chocolate NaN 0.72131148 NaN NaN NaN 0.9705882
cinnamon 0.2727273 NaN 1.0000000 1.0000000 NaN NaN
gray 0.4391304 0.67021277 0.0000000 0.0000000 NaN NaN
green 0.0000000 1.00000000 NaN NaN NaN 1.0000000
orange NaN 0.00000000 0.0000000 0.0000000 0.0000000 0.0000000
pink 0.6111111 0.42895442 0.6923077 0.6923077 NaN NaN
purple 0.0000000 0.09756098 NaN NaN NaN 0.0000000
red 0.5840000 0.00000000 0.0000000 0.0000000 NaN NaN
yellow 0.6268657 0.25581395 1.0000000 1.0000000 1.0000000 0.0000000
After analyzing the table, I realized that I can clean this up a bit. First, there are more digits displayed that I need, so I decided to display only two digits at most. I also see NaN values sprinkled in the table, which means that a particular part of the flower did not exist in that color. Therefore, I decided to replace the “NaN” values with “DNE” meaning “Does Not Exist.” Next, I saw that 0% and 100% existed on the table. If the value is 0%, that means that the particular part and color combo was never poisonous. On the other hand, if 100% was displayed, that means that the particular part and color combo was always poisonous. Therefore, I replaced 0% values with “Edible,” and “100%” values with “Poisonous.”
percent.poisonous[is.na(percent.poisonous)] <- 9.99
percent.poisonous <- round(percent.poisonous*100, digits = 2)
percent.poisonous[percent.poisonous == 999] <- 'DNE'
percent.poisonous[percent.poisonous == 0] <- 'EDIBLE'
percent.poisonous[percent.poisonous == 100] <- 'POISONOUS'
percent.poisonous
Cap Gill StalkAR StalkBR Veil Spore
white 30.77 20.47 38.35 38.32 49.32 75.88
black DNE 15.69 DNE DNE DNE 11.97
brown 44.66 10.69 96.43 87.5 EDIBLE 11.38
buff 71.43 POISONOUS POISONOUS POISONOUS DNE EDIBLE
chocolate DNE 72.13 DNE DNE DNE 97.06
cinnamon 27.27 DNE POISONOUS POISONOUS DNE DNE
gray 43.91 67.02 EDIBLE EDIBLE DNE DNE
green EDIBLE POISONOUS DNE DNE DNE POISONOUS
orange DNE EDIBLE EDIBLE EDIBLE EDIBLE EDIBLE
pink 61.11 42.9 69.23 69.23 DNE DNE
purple EDIBLE 9.76 DNE DNE DNE EDIBLE
red 58.4 EDIBLE EDIBLE EDIBLE DNE DNE
yellow 62.69 25.58 POISONOUS POISONOUS POISONOUS EDIBLE
This table can now aid in telling the poison status of flowers. For example, if I see a flower with green cap, then I know that it is edible. However, if I see a green color gill, then I know the flower must be poisonous. It also provides probabilities that a particular flower part and color combination is poisonous, and whether a particular part and color combination does not exist.