Mycology

Part 1: Data Visualization

Is there a relationship between cap surface and cap shape of a mushroom?

#Is there a relationship between cap-surface and cap-shape of a mushroom?
ggplot(data = mushroom.data, aes(x=cap.shape, y=cap.surface, color=class)) +
  geom_jitter(alpha=0.3) +
  #geom_bar(data = mushroom.data$cap.shape)+
  scale_color_manual(breaks = c('edible','poisonous'), values=c('darkgreen','red'))

Using this graph, which shows clusters of points, each representing a mushroom, we can determine a few relationships between the cap shape and the cap surface of a mushroom. Points in red indicate poisious mushrooms, which points in gree are edible. Mushrooms with a flat or convex cap shape and a scaly, smooth, or fibrous cap surface tend to be posiious. Mushrooms with a bell cap shape which are fibrous, smooth, or scaly tend to be edible. Mushrooms which are knobbed and firbous tend to be edible, but those which are knobbed and smooth or scaly tend to be poisious.

#Is there a relationship between habitat and population?
ggplot(data = mushroom.data, aes(x=population, y=habitat, color=class)) +
  geom_jitter(alpha=0.3) +  
  scale_color_manual(breaks = c('edible','poisonous'),values=c('darkgreen','red'))

#Mushrooms that are clustered or scatered in the woods are poisionous.
#Mushrooms that are abundant or numerous in grasses are edible.
#Mushrooms that are up to scattered in grasses, leaves, meadows, paths, and waste tend to be edible.
#Mushrooms that are several or solitary tend to be poisious

Using this graph, we can visualize the relatiship between the type of mushroom population and the habitat they reside in. Mushrooms that reside in the woods tend to be poisious: Mushrooms that are clustered or scatered there are poisionous, while those which grow several together or solitarily are mostly poisious. Mushrooms that are abundant, numerous, or scattered in grasses tend to be edible, while those which are several or solitary are generally poisious. Overall, we see that mushrooms tend to be edible when they are abundant, and tend to be poisious when there are fewer growing together.

ggplot(data = mushroom.data, aes(x=habitat, y=odor, color=class)) +
  geom_jitter(alpha=0.3) +
  scale_color_manual(breaks = c('edible','poisonous'),values=c('darkgreen','red'))

#Mushrooms with fishy, spicy, pungent, musty, foul, or creosote odors are poisionus, regardless of habitat. Our noses are good at what they do!
#Mushrooms which smell of anise or almond are non-poisious, regardless of habitat.
#Mushrooms which have no smell are non-poisious on paths, urban, and waste. In meadows, they are poisious, and in woods, grasses, and leaves, they are mostly edible.

Using this graph, we can visualize the relatioship betwen a mushroom’s habitat and its odor. Mushrooms with fishy, spicy, pungent, musty, foul, or creosote odors are poisionus, regardless of habitat. Mushrooms which smell of anise or almond are non-poisious, regardless of habitat. Mushrooms which have no smell are non-poisious on paths, urban, and waste. In meadows, they are poisious, and in woods, grasses, and leaves, they are mostly edible. This demonstrates that our noses are good at what they do! Mushrooms with odors we precieve as bad, like fishy, spicy, pungent, musty, foul, or creosote, are all exclusivly poosious. More plesant smells, like almond and anise are edible. Mushrooms with no smell are generally edible, but not exclusivly non-poisious.

To determine correlations between categorical predictors, we will use a chi squared test of independance. First, we will examine the correlation between mushroom cap shape and cap surface, which we examined in the first graph.

#Pearson’s chi-squared test of independence (significance test)
chisq.test(mushroom.data$cap.shape, mushroom.data$cap.surface, correct = FALSE)

## Warning in chisq.test(mushroom.data$cap.shape, mushroom.data$cap.surface, : Chi-
## squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  mushroom.data$cap.shape and mushroom.data$cap.surface
## X-squared = 1011.5, df = 15, p-value < 2.2e-16

The p-value is less than 2.2-16, which is effectivly zero. This is significantly smaller than 0.05. Thus we can reject the null hypothesis and conclude that cap shape and cap surface are depandant on eachother.

Next we will test the dependancy between habitat and odor, which we examined in the second graph.

chisq.test(mushroom.data$habitat, mushroom.data$odor, correct = FALSE)

## Warning in chisq.test(mushroom.data$habitat, mushroom.data$odor, correct =
## FALSE): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  mushroom.data$habitat and mushroom.data$odor
## X-squared = 6675.1, df = 48, p-value < 2.2e-16

The p-value is less than 2.2e-16, this is signifacntly smaller than 0.05. Thus we can reject the null hypothesis and conclude that habitat and odor are dependant. This makes sense because different habitats have different ecosystems and preditors, which may react to odors differently.

Finally, we will examine association between categorical predictors using Goodman and Kruskal’s tau measure. The following is the association plot for the categorical predictors cap shape, cap surface, habitat, odor, and class.

varset1 <- c("cap.shape","cap.surface","habitat","odor","class")
mushroomFrame1 <- subset(mushroom.data, select = varset1)
GKmatrix1 <- GKtauDataframe(mushroomFrame1)
plot(GKmatrix1, corrColors = "blue")

The tau value for odor at class is very high at 0.94. This means that odor is almost perfectly predictable from class, meaning that the odor of a mushroom is very good for determing if it is poisious or not. The reverse direction is not as strong, with tau = 0.34, meaning that it is harder to predict a mushrooms odor knowing if it is poisious or not.

The associations between cap shape and cap surface are moderatly strong, with tau = 0.03 for the forward direction, and tau = 0.01 for the reverse. So although they are significant, they are difficult to preidict from one another.

Part 2: More Visualization

#cap color, cap shape
#cap color, population
#cap color, 

ggplot(data = mushroom.data, aes(x=cap.color, y=cap.shape, color=class)) +
  geom_jitter(alpha=0.3) +  
  scale_color_manual(breaks = c('edible','poisonous'),values=c('darkgreen','red'))

Next we compare the color of the mushroom cap and its shape. These are very obious features, even to an inexperienced mycologist, and so might be the most helpful in identifying poisious mushrooms. From the graph, we see that the majority of sunken or bell shaped mushrooms are edible, while all other shapes (convex, knobed, flat, and conical) are mostly poisious. Mushrooms that are grey, brown, white, or yellow and sunken or bell shaped tend to be edible, but not exclusivly. Only mushrooms that are sunken in shape, or green or purple in color are 100% edible. This shows that it is not so simple to identiy poisious mushrooms.

ggplot(data = mushroom.data, aes(x=cap.color, y=gill.color, z=cap.shape, color=class)) +
  geom_jitter(alpha=0.3) +  
  scale_color_manual(breaks = c('edible','poisonous'),values=c('darkgreen','red'))

Gill color is another obiovious characteristc of a mushroom, and so it might be helpful to consider that with the cap color. Here we see that gills with a buff color should be avoided, but red or orange gills can be eaten. We also see that purple gills with red, brown, green, or purple caps are edible. Otherwise, it is more complicated to identify wholly edible mushrooms using cap color and gill color.

ggplot(data = mushroom.data, aes(x=cap.color, y=population, color=class)) +
  geom_jitter(alpha=0.3) +  
  scale_color_manual(breaks = c('edible','poisonous'),values=c('darkgreen','red'))

Here we compare color with population, which is another obious characterictic of mushrooms. We see that mushrooms which are abundant or numerous of any color are edible, but those which are clustered and any color have a few poisious species. Solitary mushrooms which are yellow or grey should be avoided, but other colors are edible. Mushrooms in populations of scattered or several mushrooms are genreally poisious.

ggplot(data = mushroom.data, aes(x=cap.color, y=odor, color=class)) +
  geom_jitter(alpha=0.3) +  
  scale_color_manual(breaks = c('edible','poisonous'),values=c('darkgreen','red'))

Finially we compare cap color and the mushroom odor. Here we see very distinct seperation of edible and poisious mushrooms, and can further break down the no-smell category. It appears that if there is no smell, cinnamon, red, grey, green, and purple mushrooms are safe to eat, while other colors are not.

Part 3: Classification Model

Now we will create a classification model for the mushroom data. First we will split the data into a training a testing data set. You can see below that the ratio between edible and poisious mushrooms in the orignal data set is approxamatly maintained in in the training and testing datasets at a 5%, 95% split between training and testing data.

#Create training data
sample.ind = sample(2, nrow(mushroom.data), replace = T, prob = c(0.05,0.95))
data.dev = mushroom.data[sample.ind==1,]  
data.val = mushroom.data[sample.ind==2,] 

# Original Data
table(mushroom.data$class)/nrow(mushroom.data)

## 
##    edible poisonous 
## 0.5179714 0.4820286

# Training Data
table(data.dev$class)/nrow(data.dev)

## 
##    edible poisonous 
## 0.5321337 0.4678663

# Testing Data
table(data.val$class)/nrow(data.val)

## 
##    edible poisonous 
## 0.5172592 0.4827408

#Fit Random Forest Model
rf = randomForest(class ~ .,  
                   ntree = 100,
                   data = data.dev)
plot(rf)

print(rf)

## 
## Call:
##  randomForest(formula = class ~ ., data = data.dev, ntree = 100) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 0.51%
## Confusion matrix:
##           edible poisonous class.error
## edible       207         0  0.00000000
## poisonous      2       180  0.01098901

Viewing the plot of trees, we see that after 25-30 trees, the error is fairly stable. There are 4 variables at each split, with an OOB estimate of error rate of 0.24%. The confusion matrix shows there were 2 false positives for poisious mushrooms and 2 for edible mushrroms. Although misidentifying an edible mushroom is harmless, misidentifying a poisious mushroom could be deadly!

# Predicting response variable
data.dev$predicted.response = predict(rf , data.dev)

# Create Confusion Matrix
print(  
confusionMatrix(data = data.dev$predicted.response, reference = data.dev$class, positive = 'edible'))

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  edible poisonous
##   edible       207         0
##   poisonous      0       182
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9906, 1)
##     No Information Rate : 0.5321     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.5321     
##          Detection Rate : 0.5321     
##    Detection Prevalence : 0.5321     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : edible     
##

Using the confusion matrix for the model, we see that it has a 100% accuracy with a new data set. Subsequently, the confidence interval is very small, and the p-value is less than 2.2e-16, or effectvly zero. Thus, this model is effective for predicting if a mushroom is poisious. But it should be noted that if using the model for identification, it is important to use the exact same critera for categorizing the characteristics of the mushroom. Otherwise, the model will not preduce reliable results.

Mycology

Johannes Griesser

5/12/2020

Part 1: Data Visualization

Part 2: More Visualization

Part 3: Classification Model