Task

Based on the latest topics presented, bring a dataset of your choice and create a Decision Tree where you can solve a classification or regression problem and predict the outcome of a particular feature or detail of the data used.

Switch variables to generate 2 decision trees and compare the results. Create a random forest for regression and analyze the results.

Based on real cases where decision trees went wrong, and ‘the bad & ugly’ aspects of decision trees, How can you change this perception when using the decision tree you created to solve a real problem?

Libraries

library(ISLR)
library(tree)
library(rpart)
library(party)
library(tidyverse)
library(rpart.plot)
library(ROCR)

Introducing the Data

bank.churn <- read.csv("https://raw.githubusercontent.com/jconno/R-data/master/Churn_Modelling.csv")
head(bank.churn, 3)
##   RowNumber CustomerId  Surname CreditScore Geography Gender Age Tenure
## 1         1   15634602 Hargrave         619    France Female  42      2
## 2         2   15647311     Hill         608     Spain Female  41      1
## 3         3   15619304     Onio         502    France Female  42      8
##     Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited
## 1      0.00             1         1              1        101348.9      1
## 2  83807.86             1         0              1        112542.6      0
## 3 159660.80             3         1              0        113931.6      1
# Checking for NA values within the columns 

colSums(is.na(bank.churn))
##       RowNumber      CustomerId         Surname     CreditScore       Geography 
##               0               0               0               0               0 
##          Gender             Age          Tenure         Balance   NumOfProducts 
##               0               0               0               0               0 
##       HasCrCard  IsActiveMember EstimatedSalary          Exited 
##               0               0               0               0
bank.churn.exit <- bank.churn %>% select(Exited, Age, Tenure, Balance, NumOfProducts, EstimatedSalary, CreditScore, HasCrCard)

bank.churn.active <- bank.churn %>% select(IsActiveMember, Balance, EstimatedSalary, HasCrCard, CreditScore)
data <- sort(sample(nrow(bank.churn.exit), nrow(bank.churn.exit) * 0.75))

exit.train <- bank.churn.exit[data,]
exit.test <- bank.churn.exit[-data,]



data2 <- sort(sample(nrow(bank.churn.active), nrow(bank.churn.active) * 0.75))

active.train <- bank.churn.active[data2,]
active.test <- bank.churn.active[-data2,]

Modelling

logit.exit1 <- glm(Exited ~., data = exit.train, family = "binomial")

lm.active <- lm(IsActiveMember ~., data = active.train)

Regression Model

tree.fit1 <- glm(Exited ~., data = exit.train, family = "binomial")

summary(tree.fit1)
## 
## Call:
## glm(formula = Exited ~ ., family = "binomial", data = exit.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1738  -0.6574  -0.5184  -0.3689   2.5940  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -4.046e+00  2.733e-01 -14.805   <2e-16 ***
## Age              6.803e-02  2.806e-03  24.244   <2e-16 ***
## Tenure          -7.261e-03  1.053e-02  -0.690   0.4903    
## Balance          4.515e-06  5.228e-07   8.636   <2e-16 ***
## NumOfProducts   -7.079e-02  5.306e-02  -1.334   0.1821    
## EstimatedSalary  3.714e-07  5.319e-07   0.698   0.4851    
## CreditScore     -6.346e-04  3.146e-04  -2.017   0.0437 *  
## HasCrCard        6.332e-02  6.668e-02   0.950   0.3423    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7552.9  on 7499  degrees of freedom
## Residual deviance: 6811.8  on 7492  degrees of freedom
## AIC: 6827.8
## 
## Number of Fisher Scoring iterations: 4
tree.fit.lm <- lm.active

summary(tree.fit.lm)
## 
## Call:
## lm(formula = IsActiveMember ~ ., data = active.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5685 -0.5104  0.4489  0.4880  0.5468 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.560e-01  4.248e-02  10.735   <2e-16 ***
## Balance         -1.314e-07  9.226e-08  -1.425   0.1543    
## EstimatedSalary -2.214e-08  1.003e-07  -0.221   0.8252    
## HasCrCard       -2.483e-02  1.264e-02  -1.964   0.0496 *  
## CreditScore      1.325e-04  6.023e-05   2.200   0.0279 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4997 on 7495 degrees of freedom
## Multiple R-squared:  0.001424,   Adjusted R-squared:  0.0008908 
## F-statistic: 2.672 on 4 and 7495 DF,  p-value: 0.03041
# Drop Age, Tenure, Balance

tree.fit2 <- glm(Exited ~ NumOfProducts + EstimatedSalary + CreditScore + HasCrCard, data = exit.train, family = "binomial")

summary(tree.fit2)
## 
## Call:
## glm(formula = Exited ~ NumOfProducts + EstimatedSalary + CreditScore + 
##     HasCrCard, family = "binomial", data = exit.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7605  -0.6981  -0.6469  -0.6168   2.0985  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -6.956e-01  2.198e-01  -3.165  0.00155 ** 
## NumOfProducts   -2.278e-01  5.126e-02  -4.444 8.85e-06 ***
## EstimatedSalary  3.512e-07  5.019e-07   0.700  0.48418    
## CreditScore     -5.821e-04  2.981e-04  -1.953  0.05085 .  
## HasCrCard        1.494e-02  6.291e-02   0.237  0.81235    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7552.9  on 7499  degrees of freedom
## Residual deviance: 7528.2  on 7495  degrees of freedom
## AIC: 7538.2
## 
## Number of Fisher Scoring iterations: 4

McFadden’s \(R^2\)

r2 <- pscl::pR2(tree.fit2)
## fitting null model for pseudo-r2
r2.lm <- pscl::pR2(tree.fit.lm)
## fitting null model for pseudo-r2

Variable Rank

caret::varImp(tree.fit2)
##                   Overall
## NumOfProducts   4.4435193
## EstimatedSalary 0.6995956
## CreditScore     1.9527102
## HasCrCard       0.2374005
caret::varImp(tree.fit.lm)
##                   Overall
## Balance         1.4245594
## EstimatedSalary 0.2208482
## HasCrCard       1.9640563
## CreditScore     2.1998112

Predictions

set.seed(123)

new <- data.frame(NumOfProducts = 2, EstimatedSalary = 80000, CreditScore = 688, Balance = 8000, HasCrCard = 1, Exited = c(1,0))

predict(tree.fit2, new, type = "response")
##        1        2 
## 0.181141 0.181141
new.lm <- data.frame(Balance = 7777, EstimatedSalary = 90000, HasCrCard = 0, CreditScore = 500, IsActiveMember = c(1,0))

predict(lm.active, new.lm, type = "response")
##        1        2 
## 0.519214 0.519214
  • Despite errors in the results of the model, it’s clear that the linear regression model performs better.

Decision Tree

# Regression Model

# head(active.train)
train.tree <- rpart(Exited ~ NumOfProducts + EstimatedSalary + CreditScore + HasCrCard, data = exit.train, method = "class")

train.tree2 <- rpart(IsActiveMember ~Balance + EstimatedSalary + HasCrCard + CreditScore, data = active.train)

rpart.plot(train.tree, type = 1, extra = 1, box.palette = c("darkorchid1", "red"), branch.lty = 3, shadow.col = "black")

# Linear Model

head(bank.churn.active)
##   IsActiveMember   Balance EstimatedSalary HasCrCard CreditScore
## 1              1      0.00       101348.88         1         619
## 2              1  83807.86       112542.58         0         608
## 3              0 159660.80       113931.57         1         502
## 4              0      0.00        93826.63         0         699
## 5              1 125510.82        79084.10         1         850
## 6              0 113755.78       149756.71         1         645
lm.tree <- rpart(IsActiveMember ~ Balance + EstimatedSalary + HasCrCard + CreditScore, data = active.train, method = "class")

rpart.plot(train.tree, type = 1, extra = 1, box.palette = c("green", "maroon"), branch.lty = 4, shadow.col = "black")

train.predict.logit <- predict(tree.fit1, exit.train, type = "response")
train.predict.tree <- predict(train.tree, exit.train, type = "prob")


train.predict.lm <- predict(tree.fit.lm, active.train, type = "response")
train.predict.tree.lm <- predict(lm.tree, active.train, type = "prob")
train.logit.prediction <- prediction(train.predict.logit, exit.train$Exited)
train.tree.prediction <- prediction(train.predict.tree[,2], exit.train$Exited)


train.lm.prediction <- prediction(train.predict.lm, active.train$IsActiveMember)
train.lm.tree.prediction <- prediction(train.predict.tree.lm[,2], active.train$IsActiveMember)
train.logit.performance <- performance(train.logit.prediction, "tpr", "fpr")
train.tree.performance <- performance(train.tree.prediction, "tpr", "fpr")


train.lm.performance <- performance(train.lm.prediction, "tpr", "fpr")
train.lm.tree.performance <- performance(train.lm.tree.prediction, "tpr", "fpr")
set.seed(123)
# Logistic Model
plot(train.logit.performance, col = "purple", lty =3, lwd = 3)

# Tree Predictions
plot(train.tree.performance, col = "green", lty = 3, lwd = 3, add = TRUE)

set.seed(123)

# Linear Model
plot(train.lm.performance, col = "darkslateblue", lty =3, lwd = 3)

# Tree Predictions
plot(train.lm.tree.performance, col = "coral1", lty = 3, lwd = 3, add = TRUE)

Result Discussion

According to the article posted atop this document, the benefits and pitfalls of using decision trees is clearly outlined. The benefits include efficiency, not computationally heavy, and overall simple. This gives one the argument that decision trees are quite appealing and attractive to use. Especially in today’s business world, where a quick decision is usually better than a late, well thought out decision.

For this particular problem, I utilized a churn dataset from Kaggle.com which regarded a bank’s customer lifetime. The goal of using this set is to predict whether a customer will leave the bank according to a variety of variables. The reason why I chose a churn model is because of the qualitative and binary nature of a customer exiting: exited, did not exit; this translates to 1 and 0, respectively.

Thankfully, the nature of building two linear models with a decision tree did not take much time to process, which means this was pretty quick to assemble. Though I must admit, the performance did not seem to be the best. As we observe from the chart above, it’s clear the predictions of the decision tree and of the logistic model are not close together. A possible solution to this issue could be to utilize random forest instead of a decision tree. While this is more computationally taxing, it does render a much more accurate result and overall better model performance. I believe for the next analysis, it would be worth it to try to attempt a random forest for a model.