Classification Trees and Binary Logistic Regression

Q1 - Visual Exploration

  1. Import the data

2. Churn and Geography

The number of non-churned customers is significantly higher than churned customers across all three countries. This suggests that churn is relatively low in the dataset. The red bars (France) are the tallest in both churn and non-churn categories. This indicates that France has the highest number of customers in the dataset.

Churn and Balance

The thick horizontal line inside each box represents the median balance. Customers who churned have a slightly higher median balance than those who did not churn. ## Churn and Gender

Men tend to churn at a a lesser rate compared to women.Many men are willing to stick with the company with women leaving much more.

Churn and Tenure

Churn and Age

The chart shows that people between the ages of 40 - 50 years old tend to have a higher churn rate compared to the ypunger market

Churn and Credit Score

There is little Correlation between credit score and churn rate.

Churn and Number of Products

Example text!!!!!!!!

Q2 Classification Trees

1. classification tree models

3.

A

Customers over the age of 43 have a 80% chance of churning from easy - saver product

B

Customers over the age of 43 with less than 2.5 products have a 88% chance of not churning

C

4.

D

Training dataset
      Predicted
Actual   No  Yes
   No  6102  264
   Yes  963  671

• Overall accuracy is (6128+221)/8000 = 0.84 or 84% • Of all customers the model predicted to churn, they got 613/851 = 0.72 or 72% correct. • Of all customers the model predicted not to churn, they got 6128/7149 = .86 or 86% correct. • Of all customers who did churn, the model correctly identified 613/1634 = 0.38 or 38%. • Of all customers who did not churn, the model correctly identified 6128/6349 = 0.97 or 97%.

Testing Dataset

      Predicted
Actual   No  Yes
   No  1531   66
   Yes  225  178

• Overall accuracy is (1551+184)/2000 = 0.87 or 87% • Of all customers the model predicted to churn, they got 184/230 = 0.8 or 80% correct. • Of all customers the model predicted not to churn, they got 1551/1770 = .87 or 87% correct. • Of all customers who did churn, the model correctly identified 219/403 = 0.54 or 54%. • Of all customers who did not churn, the model correctly identified 1551/1597 = 0.97 or 97%.

The classification tree is not over fitting as the training data set has an Overall accuracy of 84% compared to the test data which has Overall accuracy of 87%

Q3 Binary Logistic Regression

5.

[1] "Yes" "No" 
[1] "Yes" "No" 

Call:
glm(formula = churn ~ credit_score + geography + gender + age + 
    tenure + balance + num_products, family = binomial(link = "logit"), 
    data = bank_churn_train)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       3.337e+00  2.579e-01  12.942  < 2e-16 ***
credit_score      8.639e-04  3.070e-04   2.815  0.00488 ** 
geographyGermany -7.766e-01  7.397e-02 -10.500  < 2e-16 ***
geographySpain   -2.547e-02  7.761e-02  -0.328  0.74277    
genderMale        5.664e-01  5.965e-02   9.496  < 2e-16 ***
age              -6.426e-02  2.727e-03 -23.563  < 2e-16 ***
tenure            1.457e-02  1.023e-02   1.424  0.15441    
balance          -2.509e-06  5.658e-07  -4.434 9.24e-06 ***
num_products      1.127e-01  5.191e-02   2.171  0.02990 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8099.8  on 7999  degrees of freedom
Residual deviance: 7141.1  on 7991  degrees of freedom
AIC: 7159.1

Number of Fisher Scoring iterations: 4
      Predicted
Actual  Yes   No
   Yes  206 1428
   No   238 6128

• The overall model accuracy is (6128)/8000 = 0.77 or 77%.

6

A

genderFemale and geographyFrance are ommitted.

B

3.34 + (8.64 x credit_score) + (-7.77 x geographyGermany) + (-2.55 x geographySpain) + (5.66 x genderMale) + (-6.43 x age) + (1.46 x tenure) + (-2.51 x balance) + (1.28 x num_products)

C

num_products, credit_score, geographyGermany,genderMale, age and balance

D

A postive coefficients will make it more likely for a customer not to churn

Model Comparison & Marketing Actions

##8

The Binary Logistic Regression overall model accuracy is (6128)/8000 = 0.77 or 77% compared to the Classification Tree model Overall accuracy is (1551+184)/2000 = 0.87 or 87% which makes the Classification Tree model superior

9

The classification Tree model should be used by the company as it does not suffer from over fitting fitting and has an accuracy rate of 87%

10

A

Number of products have played a role in people churning. People with over 2.5 products leave at a high percentage

B

Marketeers can used this data to target customer segments that have recently taken their business to other companies.They can offer incentives to current customer that have been flagged bt the models that they may possibly churn.