Warning: package 'rattle' was built under R version 4.4.2
Loading required package: tibble
Loading required package: bitops
Rattle: A free graphical interface for data science with R.
Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
library(rpart)library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.4.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)library(dplyr)
Question 1
(1)
Load Data
bank_sub_test <-read_csv("bank_sub_test.csv")
Rows: 1334 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): marital, loan, prev_outcome, subscribed
dbl (5): id, age, balance, num_prev_contacts, duration
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bank_sub_train <-read_csv("bank_sub_train.csv")
Rows: 3000 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): marital, loan, prev_outcome, subscribed
dbl (5): id, age, balance, duration, num_prev_contacts
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(summary_data, aes(x = marital, y = percentage, fill = subscribed)) +geom_bar(stat ="identity", position ="dodge") +labs(x ="Marital Status", y ="Percentage (%)", fill ="Subscribed") +scale_fill_manual(values =c("No"="magenta2", "Yes"="turquoise")) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1)) +ggtitle("Percentage of Subscribed vs Not Subscribed by Marital Status")
(b)
ggplot(bank_sub_train, aes(x = subscribed, y = age, fill = subscribed)) +geom_boxplot() +labs(x ="Subscribed", y ="Age", fill ="Subscribed") +scale_fill_manual(values =c("No"="magenta2", "Yes"="turquoise")) +theme_minimal() +ggtitle("Distribution of Age by Subscription Status")
If their duration is greater than 646 and they are not married they are predicted to subscribe and 64% of the people who fell into this category did subscribe and 36% did not. 4% of people in the bank_sub_train data set fell into this node.
If the duration was less than 646 and their previous outcome was failure, other and unknown they are predicted to not churn . #93% of the people who fell into theis category did not subscribe and 7% did. 90% of people in the bank_sub_train dataset fell into this node.
Call:
rpart(formula = subscribed ~ age + marital + balance + loan +
duration + num_prev_contacts + prev_outcome, data = bank_sub_train)
n= 3000
CP nsplit rel error xerror xstd
1 0.05849582 0 1.0000000 1.0000000 0.04951951
2 0.04456825 2 0.8830084 0.9721448 0.04891730
3 0.01253482 3 0.8384401 0.9080780 0.04748259
4 0.01000000 6 0.7966574 0.9247911 0.04786377
Variable importance
duration prev_outcome marital balance age
58 34 3 2 2
Node number 1: 3000 observations, complexity param=0.05849582
predicted class=No expected loss=0.1196667 P(node) =1
class counts: 2641 359
probabilities: 0.880 0.120
left son=2 (2758 obs) right son=3 (242 obs)
Primary splits:
duration < 645.5 to the left, improve=88.179720, (0 missing)
prev_outcome splits as LLRL, improve=49.703030, (0 missing)
num_prev_contacts < 0.5 to the left, improve=16.656260, (0 missing)
age < 60.5 to the left, improve=11.538120, (0 missing)
balance < 888 to the left, improve= 5.539762, (0 missing)
Node number 2: 2758 observations, complexity param=0.05849582
predicted class=No expected loss=0.08375635 P(node) =0.9193333
class counts: 2527 231
probabilities: 0.916 0.084
left son=4 (2690 obs) right son=5 (68 obs)
Primary splits:
prev_outcome splits as LLRL, improve=53.968160, (0 missing)
duration < 211.5 to the left, improve=23.274390, (0 missing)
num_prev_contacts < 0.5 to the left, improve=15.599510, (0 missing)
age < 60.5 to the left, improve=10.263590, (0 missing)
balance < 888 to the left, improve= 3.629022, (0 missing)
Node number 3: 242 observations, complexity param=0.04456825
predicted class=Yes expected loss=0.4710744 P(node) =0.08066667
class counts: 114 128
probabilities: 0.471 0.529
left son=6 (136 obs) right son=7 (106 obs)
Primary splits:
marital splits as RLR, improve=4.7815010, (0 missing)
balance < 8 to the left, improve=2.9993270, (0 missing)
duration < 771 to the left, improve=2.4103630, (0 missing)
num_prev_contacts < 1.5 to the left, improve=1.5679810, (0 missing)
prev_outcome splits as RRLL, improve=0.6950792, (0 missing)
Surrogate splits:
age < 37.5 to the right, agree=0.607, adj=0.104, (0 split)
duration < 669.5 to the right, agree=0.579, adj=0.038, (0 split)
balance < -53.5 to the right, agree=0.574, adj=0.028, (0 split)
Node number 4: 2690 observations
predicted class=No expected loss=0.06802974 P(node) =0.8966667
class counts: 2507 183
probabilities: 0.932 0.068
Node number 5: 68 observations
predicted class=Yes expected loss=0.2941176 P(node) =0.02266667
class counts: 20 48
probabilities: 0.294 0.706
Node number 6: 136 observations, complexity param=0.01253482
predicted class=No expected loss=0.4411765 P(node) =0.04533333
class counts: 76 60
probabilities: 0.559 0.441
left son=12 (21 obs) right son=13 (115 obs)
Primary splits:
balance < 7.5 to the left, improve=3.1217630, (0 missing)
duration < 758 to the left, improve=2.1980180, (0 missing)
age < 40.5 to the left, improve=1.6987540, (0 missing)
prev_outcome splits as LRRL, improve=0.9119144, (0 missing)
num_prev_contacts < 1.5 to the left, improve=0.8403361, (0 missing)
Node number 7: 106 observations
predicted class=Yes expected loss=0.3584906 P(node) =0.03533333
class counts: 38 68
probabilities: 0.358 0.642
Node number 12: 21 observations
predicted class=No expected loss=0.1904762 P(node) =0.007
class counts: 17 4
probabilities: 0.810 0.190
Node number 13: 115 observations, complexity param=0.01253482
predicted class=No expected loss=0.4869565 P(node) =0.03833333
class counts: 59 56
probabilities: 0.513 0.487
left son=26 (52 obs) right son=27 (63 obs)
Primary splits:
age < 40.5 to the left, improve=1.9883420, (0 missing)
num_prev_contacts < 1.5 to the left, improve=1.6475740, (0 missing)
balance < 166.5 to the right, improve=1.4049260, (0 missing)
duration < 743 to the left, improve=1.2862330, (0 missing)
prev_outcome splits as RLRL, improve=0.4408696, (0 missing)
Surrogate splits:
duration < 1016 to the right, agree=0.617, adj=0.154, (0 split)
balance < 41.5 to the left, agree=0.565, adj=0.038, (0 split)
loan splits as RL, agree=0.557, adj=0.019, (0 split)
Node number 26: 52 observations
predicted class=No expected loss=0.3846154 P(node) =0.01733333
class counts: 32 20
probabilities: 0.615 0.385
Node number 27: 63 observations, complexity param=0.01253482
predicted class=Yes expected loss=0.4285714 P(node) =0.021
class counts: 27 36
probabilities: 0.429 0.571
left son=54 (28 obs) right son=55 (35 obs)
Primary splits:
duration < 771 to the left, improve=3.2142860, (0 missing)
num_prev_contacts < 1.5 to the left, improve=1.2420490, (0 missing)
balance < 903.5 to the right, improve=0.8449725, (0 missing)
age < 62 to the right, improve=0.7071429, (0 missing)
loan splits as LR, improve=0.5844156, (0 missing)
Surrogate splits:
balance < 486.5 to the right, agree=0.619, adj=0.143, (0 split)
age < 44.5 to the left, agree=0.603, adj=0.107, (0 split)
num_prev_contacts < 0.5 to the left, agree=0.587, adj=0.071, (0 split)
prev_outcome splits as RRRL, agree=0.587, adj=0.071, (0 split)
Node number 54: 28 observations
predicted class=No expected loss=0.3928571 P(node) =0.009333333
class counts: 17 11
probabilities: 0.607 0.393
Node number 55: 35 observations
predicted class=Yes expected loss=0.2857143 P(node) =0.01166667
class counts: 10 25
probabilities: 0.286 0.714
(5)
Training Predictions
train_probs <-predict(churn_tree, newdata = bank_sub_train, type ='prob')train_preds <-predict(churn_tree, newdata = bank_sub_train, type ='class')sub_train_updated <-cbind(bank_sub_train, train_probs, train_preds)head(sub_train_updated)
id age marital balance loan duration num_prev_contacts prev_outcome
1 3786 51 single -55 no 119 0 unknown
2 503 43 divorced 738 no 585 4 failure
3 3430 53 married 719 no 230 0 unknown
4 3696 42 married 83 yes 184 0 unknown
5 4090 45 married 185 no 249 0 unknown
6 3052 36 married 1554 no 325 0 unknown
subscribed No Yes train_preds
1 No 0.9319703 0.06802974 No
2 No 0.9319703 0.06802974 No
3 No 0.9319703 0.06802974 No
4 No 0.9319703 0.06802974 No
5 No 0.9319703 0.06802974 No
6 No 0.9319703 0.06802974 No
From the above confusion matrix, we know the following:
• The overall model accuracy is 2714/3000 = 0.9 or 90%.
• Of all customers the model predicted would subscribed, they got 2575/2791 = 0.92 or 92% correct.
• Of all customers the model predicted would not subscribed, they got 141/209 = 0.67 or 67% correct.
• Of all customers who did actually subscribed, the model correctly identified 2573/2641 = 0.97 or 97%.
• Of all customers who did not actually subscribed, the model correctly identified 141/359 = 0.39 or 39%.
Testing Predictions
test_probs <-predict(churn_tree, newdata = bank_sub_test, type ='prob')test_preds <-predict(churn_tree, newdata = bank_sub_test, type ='class')sub_test_updated <-cbind(bank_sub_test, test_probs, test_preds)head(sub_test_updated)
id age marital balance loan num_prev_contacts duration prev_outcome
1 3961 32 single 473 no 1 72 other
2 2722 44 married 1058 no 0 188 unknown
3 3060 39 married 186 no 0 433 unknown
4 3705 35 married 0 no 0 146 unknown
5 3328 46 married 1291 no 17 142 failure
6 2403 35 married 280 yes 0 65 unknown
subscribed No Yes test_preds
1 No 0.9319703 0.06802974 No
2 Yes 0.9319703 0.06802974 No
3 No 0.9319703 0.06802974 No
4 No 0.9319703 0.06802974 No
5 No 0.9319703 0.06802974 No
6 No 0.9319703 0.06802974 No
From the above confusion matrix, we know the following:
• The overall model accuracy is 1203/1334 = 0.9 or 90%.
• Of all customers the model predicted would subscribe, they got 1151/249 = 0.92 or 92% correct.
• Of all customers the model predicted would not subscribe, they got 52/92 = 0.57 or 57% correct.
• Of all customers who did actually subscribed, the model correctly identified 1151/1191 = 0.96 or 96%.
• Of all customers who did not actually subscribe, the model correctly identified 52/143 = 0.36 or 36%.
(5)
(a)
The overall accuracy of the model is the same for both the training and testing datasets at 90%. Therefore, the classification tree does not appear to be overfitting and so we do not need to consider pruning the tree.
Question 3
(6) (7)
Logistic Regression Model
bank_sub_train$subscribed <-factor(bank_sub_train$subscribed, levels =c("Yes", "No"))bank_sub_test$subscribed <-factor(bank_sub_test$subscribed, levels =c("Yes", "No"))subscribed_lr <-glm(subscribed ~ age + marital + balance + duration + prev_outcome, data = bank_sub_train, family =binomial(link ='logit'))summary(subscribed_lr)
Call:
glm(formula = subscribed ~ age + marital + balance + duration +
prev_outcome, family = binomial(link = "logit"), data = bank_sub_train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.373e+00 3.183e-01 10.598 < 2e-16 ***
age -1.323e-02 6.573e-03 -2.013 0.044118 *
maritalmarried 5.379e-01 1.622e-01 3.316 0.000913 ***
maritaldivorced 1.999e-02 2.210e-01 0.090 0.927948
balance -2.867e-05 2.168e-05 -1.322 0.186028
duration -3.649e-03 2.194e-04 -16.632 < 2e-16 ***
prev_outcomeother -5.604e-01 2.963e-01 -1.891 0.058558 .
prev_outcomesuccess -2.766e+00 3.164e-01 -8.742 < 2e-16 ***
prev_outcomeunknown 4.622e-01 1.948e-01 2.372 0.017690 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2197.6 on 2999 degrees of freedom
Residual deviance: 1660.4 on 2991 degrees of freedom
AIC: 1678.4
Number of Fisher Scoring iterations: 5
(a)
Prev_outcome_failure was omitted from prev_outcome and marital_single was omitted in marital.
maritalmarried: a value of 1 indicated the customers marital status is married, otherwise this dummy varible takes a value of 0
maritaldivorce : a value of 1 indicated the customers marital status is divorced, otherwise # this dummy varible takes a value of 0
prev_outcomeother: a value of 1 indicated the customers previous outcome was unknown, otherwise this dummy varible takes a value of 0
prev_outcomesuccess: a value of 1 indicated the customers previous outcome was success, otherwise this dummy varible takes a value of 0
prev_outcomeunknown: a value of 1 indicated the customers previous outcome was unknown, otherwise this dummy varible takes a value of 0
Age, maritalmarried, duration, prev_outcomesuccess and prev_outcomeunknown are all significant perdictor variables as they all have p - values of less than 0.05.
(d)
The co-efficient is positive for maritalmarried, maritaldivorced and prev_outcomeuknown due to this they are likely to impact if the customer subscribes.
The co-efficient is negative for age, balance, duration, prev_outcomeother and prev_outcomesuccess due to this they are unlikely to impact if the customer subscribes.
Prev_outcome_failure was omitted from prev_outcome and marital_single was omitted in marital.
3-c
Age, maritalmarried, duration, prev_outcomesuccess and prev_outcomeunknown are all significant perdictor variables as they all have p - values of less than 0.05.
3-d
The co-efficient is positive for maritalmarried, maritaldivorced and prev_outcomeuknown due to this they are likely to impact if the customer subscribes.
The co-efficient is negative for age, balance, duration, prev_outcomeother and prev_outcomesuccess due to this they are unlikely to impact if the customer subscribes.
Question 4
(9)
From the classification tree confusion matrix, we know the following:
• The overall model accuracy is 2714/3000 = 0.9 or 90%.
• Of all customers the model predicted would subscribe, they got 2575/2791 = 0.92 or 92% correct.
• Of all customers the model predicted would not subscribe, they got 141/209 = 0.67 or 67% correct.
• Of all customers who did actually subscribe, the model correctly identified 2573/2641 = 0.97 or 97%.
• Of all customers who did not actually subscribe, the model correctly identified 141/359 = 0.39 or 39%.
From the binary logistic regression model confusion matrix, we know the following:
• The overall model accuracy is (44 + 1157)/1334 = 0.9 or 90%.
• Of all customers the model predicted would susbcribe, they got 44/78 = 0.56 or 56% correct.
• Of all customers the model predicted would not subscribe, they got 1157/1256 = 0.92 or 92% correct.
• Of all customers who did actually subscribe, the model correctly identified 44/143 = 0.31 or 31%.
• Of all customers who did not actually subscribe, the model correctly identified 1157/1191 = 0.97 or 97%.
Both of the models have an overall accuracy score of 90%. Which is a high score.
However, they differ in their ability to correctly identify subscribers and non-subscribers:
The classification tree correctly predicted 92% of those who subscribed and 67% of those who did not.
The logistic regression model correctly predicted 56% of those who subscribed and 92% of those who did not.
The classification tree was more effective in correctly identifying actual subscribers (97% vs. 31%) but struggled with non-subscribers (39% vs. 97%).
The classification tree is better at identifying likely subscribers, while logistic regression is better at identifying non-subscribers.
(10)
I think the classification tree should be used by the company as The classification tree is better at identifying likely subscribers which is more important to focus on rather than non subscribers as their aim in to increase subscribers.
(11)
(a)
Using the classification tree, the strongest factor influencing subscription is duration - the duration in days since the customer last subsribed to a product or plan with the bank followed by marital status, balance, age, and past contact outcomes. A duration over 646 days is the biggest indicator of a higher likelihood to subscribe, particularly if the customer is married.
(b)
Create marketing strategys to target returning cutsomers as customers who have subscribed before more than 646 days ago are more likley to subscribe again.
For married customers market to them family plans or long term benefits.
Focus on younger customers below the age of 41 as they are likely to subscribe e.g. future financial secuirty, home improvements