Machine Learning - Logistic Regression & Decision Trees

PART I: Collect the Data
- 1. Import the dataset from the following location
- 2. Replace the type feature with a new boolean feature called edible
PART II: Explore and Prepare the Data
- 1. Reduce dimensionality of dataset and keep features with high information value
- 2. Using a stratified sampling approach, split data into training (60%) and test (40%) datasets. Display the class distribution for original dataset as well as training and test datasets.
PART III: Train the Models
- Model 1: Logistic Regression
- Model 2: Decision Trees
PART IV: Evaluate the Performance of the Models
- Logistic Regression
- Decision Tree

library(tidyverse)
library(rpart)
library(rpart.plot)
library(DMwR)
library(readr)

PART I: Collect the Data

1. Import the dataset from the following location

data <- read_csv("https://s3.amazonaws.com/notredame.analytics.data/mushrooms.csv")

2. Replace the type feature with a new boolean feature called edible

data <- data %>%
  mutate_all(as.factor) %>% 
  mutate(type = recode(type, "poisonous" = "No")) %>% 
  mutate(type = recode(type, "edible" = "Yes"))

colnames(data)[1] <- "edible"

PART II: Explore and Prepare the Data

1. Reduce dimensionality of dataset and keep features with high information value

Using graphical method to identify which features are necessary to keep.

data %>%
  keep(is.factor) %>%
  gather() %>%
  group_by(key,value) %>% 
  summarise(n = n()) %>% 
  ggplot() +
  geom_bar(mapping=aes(x = value, y = n, fill=key), color="black", stat='identity', alpha = 0.7) + 
  coord_flip() +
  facet_wrap(~ key, scales = "free") +
  theme_minimal()

Using statistical method to check questionable variables.

round(prop.table(table(data$veil_type)),4)*100


partial 
    100

round(prop.table(table(data$veil_color)),4)*100


 brown orange  white yellow 
  1.18   1.18  97.54   0.10

round(prop.table(table(data$ring_number)),4)*100


 none   one   two 
 0.44 92.17  7.39

round(prop.table(table(data$ring_type)),4)*100


evanescent    flaring      large       none    pendant 
     34.17       0.59      15.95       0.44      48.84

round(prop.table(table(data$gill_attachment)),4)*100


attached     free 
    2.58    97.42

Both the statistical and graphical results indicating that the below 4 features are not very informative. We remove these variables because they all have primarily one value, so there is practically no variation. Thus, they do not serve as good predictors.

data <- select(data, -veil_type, -gill_attachment, -veil_color, -ring_number)

2. Using a stratified sampling approach, split data into training (60%) and test (40%) datasets. Display the class distribution for original dataset as well as training and test datasets.

set.seed(1234)
sample_set <- sample(nrow(data), round(nrow(data)*.6), replace = FALSE)
data_train <- data[sample_set, ]
data_test <- data[-sample_set, ]

round(prop.table(table(select(data, edible), exclude = NULL)), 4) * 100


 Yes   No 
51.8 48.2

round(prop.table(table(select(data_train, edible), exclude = NULL)), 4) * 100


  Yes    No 
51.85 48.15

round(prop.table(table(select(data_test, edible), exclude = NULL)), 4) * 100


  Yes    No 
51.72 48.28

PART III: Train the Models

The two models to utilize for this entirely nominal dataset are a decision tree and a logistic regression. KNN models do not work well with nominal data as it is difficult to calculate the Euclidean distance necessary for the model. On the other hand, decision trees and logistic regressions handle nominal data well.

Model 1: Logistic Regression

logit_mod <-
  glm(edible ~ odor, family = binomial(link = 'logit'), data = data_train)

summary(logit_mod)


Call:
glm(formula = edible ~ odor, family = binomial(link = "logit"), 
    data = data_train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.26076  -0.26076  -0.00003   0.00003   2.60706  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)  -2.157e+01  1.838e+03  -0.012    0.991
odoranise     5.173e-12  2.692e+03   0.000    1.000
odorcreosote  4.313e+01  3.231e+03   0.013    0.989
odorfishy     4.313e+01  2.387e+03   0.018    0.986
odorfoul      4.313e+01  2.010e+03   0.021    0.983
odormusty     4.313e+01  6.498e+03   0.007    0.995
odornone      1.820e+01  1.838e+03   0.010    0.992
odorpungent   4.313e+01  3.000e+03   0.014    0.989
odorspicy     4.313e+01  2.453e+03   0.018    0.986

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6750.15  on 4873  degrees of freedom
Residual deviance:  622.17  on 4865  degrees of freedom
AIC: 640.17

Number of Fisher Scoring iterations: 20

Model 2: Decision Trees

tree_mod <-
  rpart(
    edible ~ ., 
    method = "class",
    data = data_train,
    control = rpart.control(cp = 0.001)
  )

rpart.plot(tree_mod)

PART IV: Evaluate the Performance of the Models

Using each of the models to predict whether the mushroom samples in your test dataset are edible or not
Create a confusion matrix of your predictions against actuals, and display the accuracy

Logistic Regression

logit_pred <- predict(logit_mod, data_test, type = 'response')
head(logit_pred)

           1            2            3            4            5 
1.000000e+00 4.305023e-10 4.305023e-10 4.305023e-10 4.305023e-10 
           6 
4.305023e-10

# determining optimal cutoff value

library(InformationValue)
ideal_cutoff <-
  optimalCutoff(
    actuals = data_test$edible,
    predictedScores = logit_pred,
    optimiseFor = "Both"
  )

# What is the decision boundary?
ideal_cutoff

[1] NA

logit_pred <- ifelse(logit_pred > 0.5, 1, 0)
head(logit_pred)

1 2 3 4 5 6 
1 0 0 0 0 0

logit_pred_table <- table(data_test$edible, logit_pred)
logit_pred_table

     logit_pred
         0    1
  Yes 1681    0
  No    49 1520

logistic_pred_accuracy <- sum(diag(logit_pred_table)) / nrow(data_test)
logistic_pred_accuracy

[1] 0.9849231

Decision Tree

tree_pred <- predict(tree_mod, data_test,  type = "class")
head(tree_pred)

  1   2   3   4   5   6 
 No Yes Yes Yes Yes Yes 
Levels: Yes No

tree_pred_prob <- predict(tree_mod, data_test)
head(tree_pred_prob)

        Yes          No
1 0.0000000 1.000000000
2 0.9979936 0.002006421
3 0.9979936 0.002006421
4 0.9979936 0.002006421
5 0.9979936 0.002006421
6 0.9979936 0.002006421

tree_pred_table <- table(data_test$edible, tree_pred)
tree_pred_table

     tree_pred
       Yes   No
  Yes 1681    0
  No     3 1566

tree_pred_accuracy <- sum(diag(tree_pred_table)) / nrow(data_test)
tree_pred_accuracy

[1] 0.9990769

After evaluating the result of the decision tree model, it is good to see that the accuracy is 99.9%. There were only 3 type-I errors, which means 3 mushrooms that were not poisonous but the model predicted them as poisonous. This is much better than type-II errors in this case because otherwise the result might put someone’s life in danger by predicting poinsonous mushrooms as edible ones.