Machine Learning - Logistic Regression & Decision Trees

PART II: Explore and Prepare the Data

1. Reduce dimensionality of dataset and keep features with high information value

Using graphical method to identify which features are necessary to keep.

Using statistical method to check questionable variables.


partial 
    100 

 brown orange  white yellow 
  1.18   1.18  97.54   0.10 

 none   one   two 
 0.44 92.17  7.39 

evanescent    flaring      large       none    pendant 
     34.17       0.59      15.95       0.44      48.84 

attached     free 
    2.58    97.42 

Both the statistical and graphical results indicating that the below 4 features are not very informative. We remove these variables because they all have primarily one value, so there is practically no variation. Thus, they do not serve as good predictors.

2. Using a stratified sampling approach, split data into training (60%) and test (40%) datasets. Display the class distribution for original dataset as well as training and test datasets.


 Yes   No 
51.8 48.2 

  Yes    No 
51.85 48.15 

  Yes    No 
51.72 48.28 

PART III: Train the Models

The two models to utilize for this entirely nominal dataset are a decision tree and a logistic regression. KNN models do not work well with nominal data as it is difficult to calculate the Euclidean distance necessary for the model. On the other hand, decision trees and logistic regressions handle nominal data well.

Model 1: Logistic Regression


Call:
glm(formula = edible ~ odor, family = binomial(link = "logit"), 
    data = data_train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.26076  -0.26076  -0.00003   0.00003   2.60706  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)  -2.157e+01  1.838e+03  -0.012    0.991
odoranise     5.173e-12  2.692e+03   0.000    1.000
odorcreosote  4.313e+01  3.231e+03   0.013    0.989
odorfishy     4.313e+01  2.387e+03   0.018    0.986
odorfoul      4.313e+01  2.010e+03   0.021    0.983
odormusty     4.313e+01  6.498e+03   0.007    0.995
odornone      1.820e+01  1.838e+03   0.010    0.992
odorpungent   4.313e+01  3.000e+03   0.014    0.989
odorspicy     4.313e+01  2.453e+03   0.018    0.986

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6750.15  on 4873  degrees of freedom
Residual deviance:  622.17  on 4865  degrees of freedom
AIC: 640.17

Number of Fisher Scoring iterations: 20

PART IV: Evaluate the Performance of the Models

  1. Using each of the models to predict whether the mushroom samples in your test dataset are edible or not
  2. Create a confusion matrix of your predictions against actuals, and display the accuracy

Decision Tree

  1   2   3   4   5   6 
 No Yes Yes Yes Yes Yes 
Levels: Yes No
        Yes          No
1 0.0000000 1.000000000
2 0.9979936 0.002006421
3 0.9979936 0.002006421
4 0.9979936 0.002006421
5 0.9979936 0.002006421
6 0.9979936 0.002006421
     tree_pred
       Yes   No
  Yes 1681    0
  No     3 1566
[1] 0.9990769

After evaluating the result of the decision tree model, it is good to see that the accuracy is 99.9%. There were only 3 type-I errors, which means 3 mushrooms that were not poisonous but the model predicted them as poisonous. This is much better than type-II errors in this case because otherwise the result might put someone’s life in danger by predicting poinsonous mushrooms as edible ones.

Elena Wenpei Huang

Due 12/2/2019