Classification Trees and Binary Logistic Regression
Q1 - Visual Exploration
- Import the data
2. Churn and Geography
The number of non-churned customers is significantly higher than churned customers across all three countries. This suggests that churn is relatively low in the dataset. The red bars (France) are the tallest in both churn and non-churn categories. This indicates that France has the highest number of customers in the dataset.
Churn and Balance
The thick horizontal line inside each box represents the median balance. Customers who churned have a slightly higher median balance than those who did not churn. ## Churn and Gender
Men tend to churn at a a lesser rate compared to women.Many men are willing to stick with the company with women leaving much more.
Churn and Tenure
Churn and Age
The chart shows that people between the ages of 40 - 50 years old tend to have a higher churn rate compared to the ypunger market
Churn and Credit Score
There is little Correlation between credit score and churn rate.
Churn and Number of Products
Example text!!!!!!!!
Q2 Classification Trees
1. classification tree models
3.
A
Customers over the age of 43 have a 80% chance of churning from easy - saver product
B
Customers over the age of 43 with less than 2.5 products have a 88% chance of not churning
C
4.
D
Training dataset
Predicted
Actual No Yes
No 6102 264
Yes 963 671
• Overall accuracy is (6128+221)/8000 = 0.84 or 84% • Of all customers the model predicted to churn, they got 613/851 = 0.72 or 72% correct. • Of all customers the model predicted not to churn, they got 6128/7149 = .86 or 86% correct. • Of all customers who did churn, the model correctly identified 613/1634 = 0.38 or 38%. • Of all customers who did not churn, the model correctly identified 6128/6349 = 0.97 or 97%.
Testing Dataset
Predicted
Actual No Yes
No 1531 66
Yes 225 178
• Overall accuracy is (1551+184)/2000 = 0.87 or 87% • Of all customers the model predicted to churn, they got 184/230 = 0.8 or 80% correct. • Of all customers the model predicted not to churn, they got 1551/1770 = .87 or 87% correct. • Of all customers who did churn, the model correctly identified 219/403 = 0.54 or 54%. • Of all customers who did not churn, the model correctly identified 1551/1597 = 0.97 or 97%.
The classification tree is not over fitting as the training data set has an Overall accuracy of 84% compared to the test data which has Overall accuracy of 87%
Q3 Binary Logistic Regression
5.
[1] "Yes" "No"
[1] "Yes" "No"
Call:
glm(formula = churn ~ credit_score + geography + gender + age +
tenure + balance + num_products, family = binomial(link = "logit"),
data = bank_churn_train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.337e+00 2.579e-01 12.942 < 2e-16 ***
credit_score 8.639e-04 3.070e-04 2.815 0.00488 **
geographyGermany -7.766e-01 7.397e-02 -10.500 < 2e-16 ***
geographySpain -2.547e-02 7.761e-02 -0.328 0.74277
genderMale 5.664e-01 5.965e-02 9.496 < 2e-16 ***
age -6.426e-02 2.727e-03 -23.563 < 2e-16 ***
tenure 1.457e-02 1.023e-02 1.424 0.15441
balance -2.509e-06 5.658e-07 -4.434 9.24e-06 ***
num_products 1.127e-01 5.191e-02 2.171 0.02990 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8099.8 on 7999 degrees of freedom
Residual deviance: 7141.1 on 7991 degrees of freedom
AIC: 7159.1
Number of Fisher Scoring iterations: 4
Predicted
Actual Yes No
Yes 206 1428
No 238 6128
• The overall model accuracy is (6128)/8000 = 0.77 or 77%.
6
A
genderFemale and geographyFrance are ommitted.
B
3.34 + (8.64 x credit_score) + (-7.77 x geographyGermany) + (-2.55 x geographySpain) + (5.66 x genderMale) + (-6.43 x age) + (1.46 x tenure) + (-2.51 x balance) + (1.28 x num_products)
C
num_products, credit_score, geographyGermany,genderMale, age and balance
D
A postive coefficients will make it more likely for a customer not to churn
Model Comparison & Marketing Actions
##8
The Binary Logistic Regression overall model accuracy is (6128)/8000 = 0.77 or 77% compared to the Classification Tree model Overall accuracy is (1551+184)/2000 = 0.87 or 87% which makes the Classification Tree model superior
9
The classification Tree model should be used by the company as it does not suffer from over fitting fitting and has an accuracy rate of 87%
10
A
Number of products have played a role in people churning. People with over 2.5 products leave at a high percentage
B
Marketeers can used this data to target customer segments that have recently taken their business to other companies.They can offer incentives to current customer that have been flagged bt the models that they may possibly churn.