Sem 2- Assignment 1
Question 1 - Visual Exploration
(A)
(B)
Commentary
Married couples are less likely to subscribe to a term deposit plan. This could be the result of a combined income, and a lesser need for a term deposit plan.
Customers with a loan are less likely to subscribe to a term deposit plan. This could be the result of existing pressure with their current loans.
Customers that have a longer duration are more likely to subscribe. It’s likely this is the product of trust, and customer nourishment that’s been built up over time.
The bank previously contacting customers, yields a greater chance of the customer subscribing.
Question 2- Classification Trees
(3)
(4)
(A)
If a customer is with the bank for a minimum of 765 days, have an existing loan and are 48 years old or older, its more likely they will subscribe.
The node is relatively pure, as the majority of outcomes are “yes”.
(B)
If the customer is with the bank for less than 377 days, it’s likely they wont subscribe.
This node is very pure as almost all outcomes are “No”
(C)
| Col1 | Col2 |
|---|---|
| Duration < 377, Previous Outcome = failure, other, unknown | 78% |
| Duration < 377, Previous Outcome not = failure/other/unknown, Duration < 251 | 1% |
| Duration < 377, Previous Outcome not = failure/other/unknown, Duration > 251 | 1% |
| Duration > 377, Duration < 765, Previous Outcome = failure/other/unknown, Age < 59 | 13% |
| Duration > 377, Duration < 765, Previous Outcome = failure/other/unknown, Age > 59 | 1% |
| Duration > 377, Duration < 765, Previous Outcome not = failure/other/unknown | 1% |
| Duration > 377, Duration > 765, Age >= 48 | 1% |
| Duration > 377, Duration > 765, Age =< 48, Loan = yes | 1% |
| Duration > 377, Duration > 765, Age =< 48, Loan not = yes | 3% |
(5)
id age marital balance loan num_prev_contacts duration prev_outcome
1 3961 32 single 473 no 1 72 other
2 2722 44 married 1058 no 0 188 unknown
3 3060 39 married 186 no 0 433 unknown
4 3705 35 married 0 no 0 146 unknown
5 3328 46 married 1291 no 17 142 failure
6 2403 35 married 280 yes 0 65 unknown
subscribed No Yes train_preds
1 No 0.9586936 0.04130644 No
2 Yes 0.9586936 0.04130644 No
3 No 0.8156425 0.18435754 No
4 No 0.9586936 0.04130644 No
5 No 0.9586936 0.04130644 No
6 No 0.9586936 0.04130644 No
Predicted
Actual No Yes
No 1176 15
Yes 90 53
id age marital balance loan duration num_prev_contacts prev_outcome
1 3786 51 single -55 no 119 0 unknown
2 503 43 divorced 738 no 585 4 failure
3 3430 53 married 719 no 230 0 unknown
4 3696 42 married 83 yes 184 0 unknown
5 4090 45 married 185 no 249 0 unknown
6 3052 36 married 1554 no 325 0 unknown
subscribed No Yes test_preds
1 No 0.9586936 0.04130644 No
2 No 0.8156425 0.18435754 No
3 No 0.9586936 0.04130644 No
4 No 0.9586936 0.04130644 No
5 No 0.9586936 0.04130644 No
6 No 0.9586936 0.04130644 No
Predicted
Actual No Yes
No 2568 73
Yes 267 92
(A)
The overall accuracy of this model is (92+2568)/3000 = 89%
Yes the classification tree is not over fitting the training data set. The overall accuracy of the test data (92%) set is higher than the training data set (88%)
Question 3 - Binary Logistic Regression
(6)
[1] "Yes" "No"
[1] "Yes" "No"
(7)
Call:
glm(formula = subscribed ~ age + marital + balance + loan + num_prev_contacts +
duration + prev_outcome, family = binomial(link = "logit"),
data = bank_train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.960e+00 7.123e-01 5.559 2.72e-08 ***
age -2.261e-02 1.047e-02 -2.159 0.03085 *
maritalmarried 2.145e-01 3.293e-01 0.651 0.51475
maritalsingle -2.616e-01 3.780e-01 -0.692 0.48900
balance -9.118e-07 3.112e-05 -0.029 0.97663
loanyes 1.051e+00 3.974e-01 2.646 0.00815 **
num_prev_contacts -1.664e-02 7.774e-02 -0.214 0.83052
duration -4.033e-03 3.559e-04 -11.332 < 2e-16 ***
prev_outcomeother -2.931e-01 4.998e-01 -0.587 0.55753
prev_outcomesuccess -2.084e+00 4.438e-01 -4.696 2.65e-06 ***
prev_outcomeunknown 6.768e-01 3.871e-01 1.748 0.08039 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 908.76 on 1333 degrees of freedom
Residual deviance: 664.86 on 1323 degrees of freedom
AIC: 686.86
Number of Fisher Scoring iterations: 6
(A)
Marital Divorce, Loan no & previous outcome failure were all omitted
- 3.36 + (-2.26 x age) + (2.15 x marital married) + (-2.61 x marital single) + (9.11 x loan yes) + (-1.66 x number of previous contacts) + (-4.03 x duration) + (-2.08 x previous outcome other) + (-2.08 x previous outcome success) + (6.76 x previous outcome unknown).
(C)
Age, Loan yes, Duration, Previous outcome success.
(D)
Married, loan yes & Previous outcome unknown are all positive coefficients . This means there’s an increased chance of customers within these variables subscribing.
(8)
Predicted
Actual Yes No
Yes 42 101
No 23 1168
The overall accuracy of this model is (42+1168)/1334 = 91%
Predicted
Actual Yes No
Yes 94 265
No 50 2591
The overall accuracy of this model is (94+2591)/3000 = 90%
Question 4 - Model Comparison & Marketing Actions
(9)
The accuracy of the classification tree model is 89%, while the accuracy of the binary logistic regression model is 90%.
(10)
The binary logistic regression model should be used by the company, as it’s more accurate by a percent. 90% vs 89%.
(11)
(A)
Marital status, Age, and if the customers currently have an existing loan are the key factors the bank must look at when deciding the customers they want to target for advertising a new loan.