Data Analysis Report

Author

Rosemary Francis

0.1 Introduction

This report is based on data from a marketing campaign in a Portuguese bank and I will be using predictive models such as the Classification Tree and the Binary Logistic Regression model to analyse the data.

0.2 Part A - Visual Exploration

1 2 (a)

Married customers appear more likely to subscribe compared to single and divorced customers. The subscription rates of consumers who are divorced, married, and single that did not subscribe to the term deposit plan are extremely high compared to the ones that did subscribe to the term deposit plan. This illustrates that married people tend to not subscribe to the term deposit plan and divorced people tend to subscribe to the term deposit plan instead.

Customers without a loan show a higher subscription rate compared to customers who currently have a loan, suggesting that existing loan commitments may reduce the likelihood of subscribing to the term deposit plan. This illustrates that the loan rates could be an important factor for people in the marital status and are subscribed.

The Previous Outcomes results show failure, other, success and unknown.The status ‘unknown’ is higher than the rest of the factors that are not subscribed. Customers with a previous successful campaign outcome show a substantially higher subscription rate compared to all other outcome factors, indicating that past success is a strong predictor of future subscription behavior. This illustrates that the previous outcome rate can help show what worked before and what could be implemented.

2 2(b)

The age distribution by subscription status shows that it is median and in the middle. The age range is similar for both those that did and did not subscribe. The outliers illustrates other ages that are older. For those that did not subscribe, the age outlier is more while those that did subscribe are less.Customers who subscribed to the term deposit plan tend to be slightly older on average than those who did not subscribe, although there is a overlap in the age distributions of both groups.

The number of previous contacts for those that did not subscribe show plots that are inconsistent while those that did subscribe show that the number of previous contacts that did subscribe are more likely to subscribe due to previous contacts.

Customers who subscribed to the term deposit plan generally had a higher number of previous contacts compared to those who did not subscribe, suggesting that repeated engagement may increase the likelihood of subscription.

The duration of those that did subscribe is higher than those that did not subscribe. It illustrates that those who stay longer are more likely to subscribe but those that don’t stay for too long don’t subscribe to the term deposit plan for the bank.

This illustrates that customers who subscribed to the term deposit plan have a higher median duration compared to those who did not subscribe, indicating that longer duration could be the cause of subscription.

2.1 Part B - Classification Tree

3 3

4 4(a)

The rule for predicting if a customer will subscribe is that if their duration was more than 646 days and not married, which is 1 year and 7 months, they will subscribe, the classification tree predicts that the customer will subscribe to the term deposit plan. If the balance was more than 7.5, the age was more than 41, and they stayed more than 771 days then they will not subscribe but if they do subscribe more than 771 days then they will subscribe. The purity of the node for the outcome shows 29% did not subscribe while 71% did subscribe. This node is not pure due to the high percentage of people will not subscribe.

5 4(b)

The rule for predicting if a customer will not subscribe is that if their duration was less than 646 days, which is 1 year and 7 months, they will not churn if the previous outcome was failure, other or unknown, then they will not subscribe but if they do subscribe more than 646 days then they will churn. The purity of the node for the outcome of the duration of less than 646 days is 93% of those that did not churn and then 7% of those that did churn due to the previous outcomes of failure, other and unknown. This shows how pure the node is and it provides the probability of a customer subscribing to the deposit plan.

6 4(c)

The most important predictor in the classification tree is duration, which accounts for 58% of the total predictive power of the model. This is followed by previous campaign outcome that is 34%. Marital status is 3%, while balance and age each show about 2% to the model. They all round up to 99% or 100%. The other variables, such as loan and number of previous contacts, contribute less than 1% and therefore have minimal impact on the model’s predictive performance.

7 5 - Training Dataset

The results i got for the training dataset on the confusion matrix is 90% (2714/3000).

8 (a)

The results i got for the testing dataset is also 90% (1203/1334) and it does not detect overfitting. The classification tree does not overfit the training dataset. The classification tree performed well in identifying customers who did not subscribe, while also showing the amount of subscribers.

The overall accuracy of the model is the same for both the training and testing dataset at 90%, it is almost identical. Therefore, the classification tree does not appear to be overfitting the training dataset and so we do not need to consider pruning the tree.

8.1 Part C - Binary Logistic Regression

9 6

[1] "Yes" "No" 
[1] "Yes" "No" 

10 7


Call:
glm(formula = subscribed ~ marital + prev_outcome, family = binomial(link = "logit"), 
    data = bank_train)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)           1.4923     0.2161   6.907 4.97e-12 ***
maritalmarried        0.5568     0.1728   3.222  0.00127 ** 
maritalsingle         0.1234     0.1849   0.667  0.50457    
prev_outcomeother    -0.5344     0.2650  -2.016  0.04376 *  
prev_outcomesuccess  -2.6427     0.2959  -8.931  < 2e-16 ***
prev_outcomeunknown   0.4096     0.1751   2.339  0.01931 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2197.6  on 2999  degrees of freedom
Residual deviance: 2023.5  on 2994  degrees of freedom
AIC: 2035.5

Number of Fisher Scoring iterations: 5

11 (a)

The dummy variables that were omitted were ‘married divorce’ and ‘prev outcome failure’.

12 (b)

The regression equation is: Y = ln(pi/1-pi) = 1.49 + 0.56.maritalmarried + 0.12.maritalsingle - 0.53.prev_outcomeother - 2.64.prev_outcomesuccess + 0.41.prev_outcomeunknown

13 (c)

The p-values corresponds to the following hypothesis test:

H0: The coefficient for marital single equals 0, whereas marital single is not important in predicting customers subscribing.

HA: The coefficient for marital single does not equal 0, whereas marital single is important in predicting customers subscribing.

The p-value for marital single is 0.50457, which is way greater that 0.05 and so we accept H0 and conclude that marital single is not an important predictors of our match outcome.

The predictor variables that are significant are ‘marital married’, ‘prev_outcomeother’, ‘prev_outcomesuccess’, ‘prev_outcomeknown’. We conclude that these are our important variables as they are less than 0.05 so we accept HA where the coefficients do not equal 0.

14 (d)

The impact of each predictor variable on the likelihood of subscribing is determined by the sign and size of its regression coefficient.

The ‘marital married’ (1.49) has a positive impact on the likelihood of subscribing compared to divorced customers. This indicates that married customers are more likely to subscribe to the term deposit plan.

The ‘marital single’ (0.12) has a small positive effect, but this effect is not significant, this suggests that single customers are likely to subscribe but not too often.

The ‘prev_outcomeother’ (-0.53) has a negative effect on people subscribing. This suggests that customers who previously subscribed are less likely to subscribe again.

The ‘prev_outcomesuccess’ (-2.64) also has a negative impact on people subscribing. This suggests that customers who previously subscribed are less likely to subscribe again.

The ‘prev_outcomeunknown’ (0.41) has a positive effect on people subscribing that customers with no clear past may still be in contact.

15 8

The confusion matrix for the training data is 89% (2669/3000).

The confusion matrix for the testing data is 90% (1198/1334).

The regression model achieves high overall accuracy on both the training and testing datasets. It appears to be doing a great job at predicting and identifying customers who did not subscribe. However, it performs poorly and it is useless when trying to predict and identify who did subscribe. The model itself is not really useful.

15.1 Part D - Model Comparison & Marketing Actions

16 9

The classification tree performed better for this task, as it was more effective at identifying customers who would subscribe and provided clearer decision rules. In contrast, the logistic regression model achieved high overall accuracy, but this was mainly because it predicted non subscribers very well, while it performed poorly at correctly identifying subscribers. This limits its usefulness for a bank marketing campaign that aims to target customers with a high probability of subscribing.

17 10

I find that the classification tree is simpler to interpret, as it produces clear rules, which makes it easier for the bank to apply in practice when deciding who to contact. The classification tree model should be used by the company. It performs better for the bank’s objective, which is to identify customers who are likely to subscribe to the term deposit plan.

It provides clear variables which makes it easier for managers to understand and apply in practice. This transparency allows the company to directly translate the model’s output into marketing decisions, such as prioritising customers with longer durations or previous successful outcomes.

Although the binary logistic regression model is also useful, in terms of the p-values in the dataset, it helps to indicate the positive coefficients and the negative coefficients and it shows the level of success. The more positive or increasingly higher a coefficient is, the better and the stronger the effect will be while the negative or lower coefficient, decreases the probabilty.

Whether it’s good or bad depends on the coeffient and what variable it is matched with and on the businesses terms.

Overall, the classification tree is more accurate for identifying potential subscribers, easier to interpret, and more suitable for marketing actions, making it the better choice for the bank.

18 11(a)

Based on the classification tree model, the most important drivers of customer subscription is duration, which is 58% to the model’s predictive power. Customers with longer interaction duration are significantly more likely to subscribe to the term deposit plan. The second most important driver is previous campaign outcomes such as failure, other, success and unknown, which is for around 34% of the model’s predictive power, with customers who previously responded successfully showing a much higher likelihood of subscribing.

Marital married also played a role in predicting subscription, although its contribution is smaller, which is 3%. The classification tree shows that married customers are more likely to subscribe under certain conditions, indicating that marital status can enhance subscription when combined with other factors such as duration and previous outcome.

Overall, the classification tree duration and the previous outcomes are the main drivers of subscription.

19 11(b)

The classification tree can be used to help the bank target customers who are most likely to subscribe to the term deposit plan. The bank can prioritise customers with characteristics associated with higher subscription rather than just randomly contacting all customers.

The marketing actions they could use to improve their subscriptions to the term deposit plan is to target customers that have a longer interation duration, a previous successful campaign outcome, and those who are married can be prioritised for future campaigns, as the tree shows that these customers are more likely to subscribe.

They could tailor marketing content to customers with a previous successful outcome, where the bank could design personalised messages that build momentum and engagement. They could emphasise content on long term financial security and savings options for customers that are married.

The bank could come up with a strategy to know when and how to contact customers and to reduce contact with customers that have low durations or those that didn’t perform well in the classification tree.

The predictive model helps us analyse these important variables and factors to help us make better marketing decisions for the bank.