Machine Learning - Logistic Regression & Decision Trees
Machine Learning - Logistic Regression & Decision Trees
PART I: Collect the Data
1. Import the dataset from the following location
2. Replace the type feature with a new boolean feature called edible
PART II: Explore and Prepare the Data
1. Reduce dimensionality of dataset and keep features with high information value
Using graphical method to identify which features are necessary to keep.
data %>%
keep(is.factor) %>%
gather() %>%
group_by(key,value) %>%
summarise(n = n()) %>%
ggplot() +
geom_bar(mapping=aes(x = value, y = n, fill=key), color="black", stat='identity', alpha = 0.7) +
coord_flip() +
facet_wrap(~ key, scales = "free") +
theme_minimal()Using statistical method to check questionable variables.
partial
100
brown orange white yellow
1.18 1.18 97.54 0.10
none one two
0.44 92.17 7.39
evanescent flaring large none pendant
34.17 0.59 15.95 0.44 48.84
attached free
2.58 97.42
Both the statistical and graphical results indicating that the below 4 features are not very informative. We remove these variables because they all have primarily one value, so there is practically no variation. Thus, they do not serve as good predictors.
2. Using a stratified sampling approach, split data into training (60%) and test (40%) datasets. Display the class distribution for original dataset as well as training and test datasets.
set.seed(1234)
sample_set <- sample(nrow(data), round(nrow(data)*.6), replace = FALSE)
data_train <- data[sample_set, ]
data_test <- data[-sample_set, ]
round(prop.table(table(select(data, edible), exclude = NULL)), 4) * 100
Yes No
51.8 48.2
Yes No
51.85 48.15
Yes No
51.72 48.28
PART III: Train the Models
The two models to utilize for this entirely nominal dataset are a decision tree and a logistic regression. KNN models do not work well with nominal data as it is difficult to calculate the Euclidean distance necessary for the model. On the other hand, decision trees and logistic regressions handle nominal data well.
Model 1: Logistic Regression
logit_mod <-
glm(edible ~ odor, family = binomial(link = 'logit'), data = data_train)
summary(logit_mod)
Call:
glm(formula = edible ~ odor, family = binomial(link = "logit"),
data = data_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.26076 -0.26076 -0.00003 0.00003 2.60706
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.157e+01 1.838e+03 -0.012 0.991
odoranise 5.173e-12 2.692e+03 0.000 1.000
odorcreosote 4.313e+01 3.231e+03 0.013 0.989
odorfishy 4.313e+01 2.387e+03 0.018 0.986
odorfoul 4.313e+01 2.010e+03 0.021 0.983
odormusty 4.313e+01 6.498e+03 0.007 0.995
odornone 1.820e+01 1.838e+03 0.010 0.992
odorpungent 4.313e+01 3.000e+03 0.014 0.989
odorspicy 4.313e+01 2.453e+03 0.018 0.986
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6750.15 on 4873 degrees of freedom
Residual deviance: 622.17 on 4865 degrees of freedom
AIC: 640.17
Number of Fisher Scoring iterations: 20
Model 2: Decision Trees
tree_mod <-
rpart(
edible ~ .,
method = "class",
data = data_train,
control = rpart.control(cp = 0.001)
)
rpart.plot(tree_mod)PART IV: Evaluate the Performance of the Models
- Using each of the models to predict whether the mushroom samples in your test dataset are edible or not
- Create a confusion matrix of your predictions against actuals, and display the accuracy
Logistic Regression
1 2 3 4 5
1.000000e+00 4.305023e-10 4.305023e-10 4.305023e-10 4.305023e-10
6
4.305023e-10
# determining optimal cutoff value
library(InformationValue)
ideal_cutoff <-
optimalCutoff(
actuals = data_test$edible,
predictedScores = logit_pred,
optimiseFor = "Both"
)
# What is the decision boundary?
ideal_cutoff[1] NA
1 2 3 4 5 6
1 0 0 0 0 0
logit_pred
0 1
Yes 1681 0
No 49 1520
[1] 0.9849231
Decision Tree
1 2 3 4 5 6
No Yes Yes Yes Yes Yes
Levels: Yes No
Yes No
1 0.0000000 1.000000000
2 0.9979936 0.002006421
3 0.9979936 0.002006421
4 0.9979936 0.002006421
5 0.9979936 0.002006421
6 0.9979936 0.002006421
tree_pred
Yes No
Yes 1681 0
No 3 1566
[1] 0.9990769
After evaluating the result of the decision tree model, it is good to see that the accuracy is 99.9%. There were only 3 type-I errors, which means 3 mushrooms that were not poisonous but the model predicted them as poisonous. This is much better than type-II errors in this case because otherwise the result might put someone’s life in danger by predicting poinsonous mushrooms as edible ones.