Based on the latest topics presented, bring a dataset of your choice and create a Decision Tree where you can solve a classification or regression problem and predict the outcome of a particular feature or detail of the data used.
Switch variables to generate 2 decision trees and compare the results. Create a random forest for regression and analyze the results.
Based on real cases where decision trees went wrong, and ‘the bad & ugly’ aspects of decision trees, How can you change this perception when using the decision tree you created to solve a real problem?
library(ISLR)
library(tree)
library(rpart)
library(party)
library(tidyverse)
library(rpart.plot)
library(ROCR)
bank.churn <- read.csv("https://raw.githubusercontent.com/jconno/R-data/master/Churn_Modelling.csv")
head(bank.churn, 3)
## RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure
## 1 1 15634602 Hargrave 619 France Female 42 2
## 2 2 15647311 Hill 608 Spain Female 41 1
## 3 3 15619304 Onio 502 France Female 42 8
## Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited
## 1 0.00 1 1 1 101348.9 1
## 2 83807.86 1 0 1 112542.6 0
## 3 159660.80 3 1 0 113931.6 1
# Checking for NA values within the columns
colSums(is.na(bank.churn))
## RowNumber CustomerId Surname CreditScore Geography
## 0 0 0 0 0
## Gender Age Tenure Balance NumOfProducts
## 0 0 0 0 0
## HasCrCard IsActiveMember EstimatedSalary Exited
## 0 0 0 0
bank.churn.exit <- bank.churn %>% select(Exited, Age, Tenure, Balance, NumOfProducts, EstimatedSalary, CreditScore, HasCrCard)
bank.churn.active <- bank.churn %>% select(IsActiveMember, Balance, EstimatedSalary, HasCrCard, CreditScore)
data <- sort(sample(nrow(bank.churn.exit), nrow(bank.churn.exit) * 0.75))
exit.train <- bank.churn.exit[data,]
exit.test <- bank.churn.exit[-data,]
data2 <- sort(sample(nrow(bank.churn.active), nrow(bank.churn.active) * 0.75))
active.train <- bank.churn.active[data2,]
active.test <- bank.churn.active[-data2,]
logit.exit1 <- glm(Exited ~., data = exit.train, family = "binomial")
lm.active <- lm(IsActiveMember ~., data = active.train)
tree.fit1 <- glm(Exited ~., data = exit.train, family = "binomial")
summary(tree.fit1)
##
## Call:
## glm(formula = Exited ~ ., family = "binomial", data = exit.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1738 -0.6574 -0.5184 -0.3689 2.5940
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.046e+00 2.733e-01 -14.805 <2e-16 ***
## Age 6.803e-02 2.806e-03 24.244 <2e-16 ***
## Tenure -7.261e-03 1.053e-02 -0.690 0.4903
## Balance 4.515e-06 5.228e-07 8.636 <2e-16 ***
## NumOfProducts -7.079e-02 5.306e-02 -1.334 0.1821
## EstimatedSalary 3.714e-07 5.319e-07 0.698 0.4851
## CreditScore -6.346e-04 3.146e-04 -2.017 0.0437 *
## HasCrCard 6.332e-02 6.668e-02 0.950 0.3423
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7552.9 on 7499 degrees of freedom
## Residual deviance: 6811.8 on 7492 degrees of freedom
## AIC: 6827.8
##
## Number of Fisher Scoring iterations: 4
tree.fit.lm <- lm.active
summary(tree.fit.lm)
##
## Call:
## lm(formula = IsActiveMember ~ ., data = active.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5685 -0.5104 0.4489 0.4880 0.5468
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.560e-01 4.248e-02 10.735 <2e-16 ***
## Balance -1.314e-07 9.226e-08 -1.425 0.1543
## EstimatedSalary -2.214e-08 1.003e-07 -0.221 0.8252
## HasCrCard -2.483e-02 1.264e-02 -1.964 0.0496 *
## CreditScore 1.325e-04 6.023e-05 2.200 0.0279 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4997 on 7495 degrees of freedom
## Multiple R-squared: 0.001424, Adjusted R-squared: 0.0008908
## F-statistic: 2.672 on 4 and 7495 DF, p-value: 0.03041
# Drop Age, Tenure, Balance
tree.fit2 <- glm(Exited ~ NumOfProducts + EstimatedSalary + CreditScore + HasCrCard, data = exit.train, family = "binomial")
summary(tree.fit2)
##
## Call:
## glm(formula = Exited ~ NumOfProducts + EstimatedSalary + CreditScore +
## HasCrCard, family = "binomial", data = exit.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7605 -0.6981 -0.6469 -0.6168 2.0985
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.956e-01 2.198e-01 -3.165 0.00155 **
## NumOfProducts -2.278e-01 5.126e-02 -4.444 8.85e-06 ***
## EstimatedSalary 3.512e-07 5.019e-07 0.700 0.48418
## CreditScore -5.821e-04 2.981e-04 -1.953 0.05085 .
## HasCrCard 1.494e-02 6.291e-02 0.237 0.81235
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7552.9 on 7499 degrees of freedom
## Residual deviance: 7528.2 on 7495 degrees of freedom
## AIC: 7538.2
##
## Number of Fisher Scoring iterations: 4
r2 <- pscl::pR2(tree.fit2)
## fitting null model for pseudo-r2
r2.lm <- pscl::pR2(tree.fit.lm)
## fitting null model for pseudo-r2
caret::varImp(tree.fit2)
## Overall
## NumOfProducts 4.4435193
## EstimatedSalary 0.6995956
## CreditScore 1.9527102
## HasCrCard 0.2374005
caret::varImp(tree.fit.lm)
## Overall
## Balance 1.4245594
## EstimatedSalary 0.2208482
## HasCrCard 1.9640563
## CreditScore 2.1998112
set.seed(123)
new <- data.frame(NumOfProducts = 2, EstimatedSalary = 80000, CreditScore = 688, Balance = 8000, HasCrCard = 1, Exited = c(1,0))
predict(tree.fit2, new, type = "response")
## 1 2
## 0.181141 0.181141
new.lm <- data.frame(Balance = 7777, EstimatedSalary = 90000, HasCrCard = 0, CreditScore = 500, IsActiveMember = c(1,0))
predict(lm.active, new.lm, type = "response")
## 1 2
## 0.519214 0.519214
# Regression Model
# head(active.train)
train.tree <- rpart(Exited ~ NumOfProducts + EstimatedSalary + CreditScore + HasCrCard, data = exit.train, method = "class")
train.tree2 <- rpart(IsActiveMember ~Balance + EstimatedSalary + HasCrCard + CreditScore, data = active.train)
rpart.plot(train.tree, type = 1, extra = 1, box.palette = c("darkorchid1", "red"), branch.lty = 3, shadow.col = "black")
# Linear Model
head(bank.churn.active)
## IsActiveMember Balance EstimatedSalary HasCrCard CreditScore
## 1 1 0.00 101348.88 1 619
## 2 1 83807.86 112542.58 0 608
## 3 0 159660.80 113931.57 1 502
## 4 0 0.00 93826.63 0 699
## 5 1 125510.82 79084.10 1 850
## 6 0 113755.78 149756.71 1 645
lm.tree <- rpart(IsActiveMember ~ Balance + EstimatedSalary + HasCrCard + CreditScore, data = active.train, method = "class")
rpart.plot(train.tree, type = 1, extra = 1, box.palette = c("green", "maroon"), branch.lty = 4, shadow.col = "black")
train.predict.logit <- predict(tree.fit1, exit.train, type = "response")
train.predict.tree <- predict(train.tree, exit.train, type = "prob")
train.predict.lm <- predict(tree.fit.lm, active.train, type = "response")
train.predict.tree.lm <- predict(lm.tree, active.train, type = "prob")
train.logit.prediction <- prediction(train.predict.logit, exit.train$Exited)
train.tree.prediction <- prediction(train.predict.tree[,2], exit.train$Exited)
train.lm.prediction <- prediction(train.predict.lm, active.train$IsActiveMember)
train.lm.tree.prediction <- prediction(train.predict.tree.lm[,2], active.train$IsActiveMember)
train.logit.performance <- performance(train.logit.prediction, "tpr", "fpr")
train.tree.performance <- performance(train.tree.prediction, "tpr", "fpr")
train.lm.performance <- performance(train.lm.prediction, "tpr", "fpr")
train.lm.tree.performance <- performance(train.lm.tree.prediction, "tpr", "fpr")
set.seed(123)
# Logistic Model
plot(train.logit.performance, col = "purple", lty =3, lwd = 3)
# Tree Predictions
plot(train.tree.performance, col = "green", lty = 3, lwd = 3, add = TRUE)
set.seed(123)
# Linear Model
plot(train.lm.performance, col = "darkslateblue", lty =3, lwd = 3)
# Tree Predictions
plot(train.lm.tree.performance, col = "coral1", lty = 3, lwd = 3, add = TRUE)
According to the article posted atop this document, the benefits and pitfalls of using decision trees is clearly outlined. The benefits include efficiency, not computationally heavy, and overall simple. This gives one the argument that decision trees are quite appealing and attractive to use. Especially in today’s business world, where a quick decision is usually better than a late, well thought out decision.
For this particular problem, I utilized a churn dataset from Kaggle.com which regarded a bank’s customer lifetime. The goal of using this set is to predict whether a customer will leave the bank according to a variety of variables. The reason why I chose a churn model is because of the qualitative and binary nature of a customer exiting: exited, did not exit; this translates to 1 and 0, respectively.
Thankfully, the nature of building two linear models with a decision tree did not take much time to process, which means this was pretty quick to assemble. Though I must admit, the performance did not seem to be the best. As we observe from the chart above, it’s clear the predictions of the decision tree and of the logistic model are not close together. A possible solution to this issue could be to utilize random forest instead of a decision tree. While this is more computationally taxing, it does render a much more accurate result and overall better model performance. I believe for the next analysis, it would be worth it to try to attempt a random forest for a model.