Predictive Analytics

Author

Aisling

Part A- Visual Exploration

library(rattle)
library(rpart)
library(tidyverse)
library(knitr)
library(kableExtra)

1

bank_train <- read_csv("bank_sub_train.csv")
bank_test <- read_csv("bank_sub_test.csv")

2

Subscription based on Age

ggplot(data = bank_train) +
  geom_boxplot(mapping = aes(x = subscribed, y = age)) + 
      labs(title = "Subscription Age Breakdown",
       x = "Subscription",
       y = "Age")

Based on the above box plot we can clearly see that the majority of the banks overall customer demographic sits within the 30-50 age group. There is a larger interquartile range for the people who did subscribe in comparison to those who did not, and those that did subscribe also have fewer outliers. However for both subscribers and non-subscribers the median age is around 40, with non-subscribers sitting just slightly below.

Subscription based on Marital Status

ggplot(data = bank_train) +
  geom_bar(mapping = aes(x = subscribed, fill = marital), position = "dodge") +
                  labs(title = "Subscription Based on Marital Status",
                  x = "Subscription",
                  y = NULL,
                  fill = "Marital Status") +
  scale_fill_discrete(labels = c("divorced" = "Divorced", "married" = "Married", "single" = "Single"))

From the above bar chart we can see that the majority of overall customers are married (regardless of subscription), and that married customers are highest, and divorced customers are lowest, in both the subscriber and non-subscriber categories.

Subscription based on Current Loan Status

ggplot(data = bank_train) +
  geom_bar(mapping = aes(x = subscribed, fill = loan), position = "dodge") +
       labs(title = "Subscription Based on Loan Status",
       x = "Subscription",
       y = NULL,
       fill = "Currently has Loan") +
  scale_fill_discrete(labels = c("no" = "No", "yes" = "Yes"))

The vast majority of customers for both subscribers and non-subscribers do not currently have a loan out with the bank.

Subscription based on Number of Days Since Customer Last Subscribed to Product/Plan with Bank

ggplot(data = bank_train) +
  geom_boxplot(mapping = aes(x = subscribed, y = duration)) +
       labs(title = "Subscription Based on No. Days Since Customer Last Subscribed",
       x = "Subscription",
       y = NULL)

We can see from the boxplot that those who subscribed to the new plan have a higher duration of days since they last subscribed to a product or plan with the bank, whereas non-subscribers have a much lower duration since their last subscription.

Subscription based on Average Number of Times Customer Has Been Contacted by the Bank

num_times_contact <- bank_train %>%
  group_by(subscribed) %>%
           summarise("Number of Times Contacted By Bank" = mean(num_prev_contacts, na.rm = TRUE))

kable(num_times_contact,
      col.names = c("Subscribed", "Average Number of Times Contacted by Bank"),
      align = "lr",
      format = "html",
      table.attr = 'data-quarto-disable-processing = "true"') %>%
  kable_styling(full_width = TRUE) %>%
    row_spec(0, bold = TRUE, color = "white", background = "black", underline = TRUE)
Subscribed Average Number of Times Contacted by Bank
No 0.4744415
Yes 1.1225627

We can see from the above table that the customers who subscribed had, on average, received over double the amount of contacts from the bank.

Subscription based on Previous outcome from Bank Communication

ggplot(data = bank_train) +
  geom_bar(mapping = aes(x = subscribed, fill = prev_outcome), position = "dodge")  +
  labs(title = "Subscription Based on Previous oucome from Bank Communication",
       x = "Subscription",
       y = NULL,
       fill = "Previous Outcome") +
      scale_fill_discrete(labels = c("failure" = "Failure", "other" = "Other", "success" = "Success", "unknown" = "Unknown"))

Here we can see that the majority of calls do not know if the call was a success or failure. This is a problematic indicator as it suggests that the effectiveness of the bank calls are not being correctly tracked. The bank could more accurately measure this if they specified whether success or failure was limited to the allotted phone call time (if customers are suggesting that they may subscribe in the future but not over the phone, then this should be logged as a failure or a new specified category), or customers who do subscribe at a later date should be investigated to see if they had previously been contacted to assess the true impact of the bank contacts.

Part B- Classification Trees

3

bank_tree <- rpart(subscribed ~ age + marital + loan + duration + num_prev_contacts + prev_outcome,
             data = bank_train)
fancyRpartPlot(bank_tree)

4 a

If a customer has had a duration of less than 646 days since their last subscription, and the outcome of their last phone call from the bank was classified under failure, other, or unknown, then that customer is predicted to not subscribe. The final leaf in this path indicated that within this group of customers, 93% of them did not subscribe, while 7% of them did indicating a high purity as the percentage difference between them is very different.

4 b

If a customer has had a duration of more than 646 days since their last subscription and they are not classified as married, then that customer is predicted to subscribe. The final leaf in this path indicated that within this group of customers, only 64% of them subscribed with 36% not subscribing. This is still an acceptable level of difference between the two percentages, however this leaf is clearly not as pure as the first.

4 c

Base on the tree itself we can see that the most important variables for predicting subscription are the duration since their last subscription, the previous outcome of their last contact with the bank, their marital status and their age. All of these variables are featured in the classification tree. However, we can take it further and assess the exact variable importance.

bank_tree$variable.importance
         duration      prev_outcome           marital               age 
      94.89248212       54.24133524        4.78150081        2.93875173 
             loan num_prev_contacts 
       0.06373025        0.04226957 
summary(bank_tree)
Call:
rpart(formula = subscribed ~ age + marital + loan + duration + 
    num_prev_contacts + prev_outcome, data = bank_train)
  n= 3000 

          CP nsplit rel error    xerror       xstd
1 0.05849582      0 1.0000000 1.0000000 0.04951951
2 0.04456825      2 0.8830084 0.9247911 0.04786377
3 0.01532033      3 0.8384401 0.8690808 0.04657328
4 0.01114206      5 0.8077994 0.8579387 0.04630819
5 0.01000000      6 0.7966574 0.8412256 0.04590599

Variable importance
    duration prev_outcome      marital          age 
          60           35            3            2 

Node number 1: 3000 observations,    complexity param=0.05849582
  predicted class=No   expected loss=0.1196667  P(node) =1
    class counts:  2641   359
   probabilities: 0.880 0.120 
  left son=2 (2758 obs) right son=3 (242 obs)
  Primary splits:
      duration          < 645.5 to the left,  improve=88.179720, (0 missing)
      prev_outcome      splits as  LLRL,      improve=49.703030, (0 missing)
      num_prev_contacts < 0.5   to the left,  improve=16.656260, (0 missing)
      age               < 60.5  to the left,  improve=11.538120, (0 missing)
      marital           splits as  RLR,       improve= 3.431474, (0 missing)

Node number 2: 2758 observations,    complexity param=0.05849582
  predicted class=No   expected loss=0.08375635  P(node) =0.9193333
    class counts:  2527   231
   probabilities: 0.916 0.084 
  left son=4 (2690 obs) right son=5 (68 obs)
  Primary splits:
      prev_outcome      splits as  LLRL,      improve=53.968160, (0 missing)
      duration          < 211.5 to the left,  improve=23.274390, (0 missing)
      num_prev_contacts < 0.5   to the left,  improve=15.599510, (0 missing)
      age               < 60.5  to the left,  improve=10.263590, (0 missing)
      loan              splits as  RL,        improve= 2.582688, (0 missing)

Node number 3: 242 observations,    complexity param=0.04456825
  predicted class=Yes  expected loss=0.4710744  P(node) =0.08066667
    class counts:   114   128
   probabilities: 0.471 0.529 
  left son=6 (136 obs) right son=7 (106 obs)
  Primary splits:
      marital           splits as  RLR,       improve=4.7815010, (0 missing)
      duration          < 771   to the left,  improve=2.4103630, (0 missing)
      num_prev_contacts < 1.5   to the left,  improve=1.5679810, (0 missing)
      prev_outcome      splits as  RRLL,      improve=0.6950792, (0 missing)
      age               < 54.5  to the left,  improve=0.6227563, (0 missing)
  Surrogate splits:
      age      < 37.5  to the right, agree=0.607, adj=0.104, (0 split)
      duration < 669.5 to the right, agree=0.579, adj=0.038, (0 split)

Node number 4: 2690 observations
  predicted class=No   expected loss=0.06802974  P(node) =0.8966667
    class counts:  2507   183
   probabilities: 0.932 0.068 

Node number 5: 68 observations
  predicted class=Yes  expected loss=0.2941176  P(node) =0.02266667
    class counts:    20    48
   probabilities: 0.294 0.706 

Node number 6: 136 observations,    complexity param=0.01532033
  predicted class=No   expected loss=0.4411765  P(node) =0.04533333
    class counts:    76    60
   probabilities: 0.559 0.441 
  left son=12 (52 obs) right son=13 (84 obs)
  Primary splits:
      duration          < 758   to the left,  improve=2.19801800, (0 missing)
      age               < 40.5  to the left,  improve=1.69875400, (0 missing)
      prev_outcome      splits as  LRRL,      improve=0.91191440, (0 missing)
      num_prev_contacts < 1.5   to the left,  improve=0.84033610, (0 missing)
      loan              splits as  RL,        improve=0.07951318, (0 missing)
  Surrogate splits:
      age               < 26.5  to the left,  agree=0.632, adj=0.038, (0 split)
      num_prev_contacts < 9.5   to the right, agree=0.625, adj=0.019, (0 split)

Node number 7: 106 observations
  predicted class=Yes  expected loss=0.3584906  P(node) =0.03533333
    class counts:    38    68
   probabilities: 0.358 0.642 

Node number 12: 52 observations,    complexity param=0.01114206
  predicted class=No   expected loss=0.3269231  P(node) =0.01733333
    class counts:    35    17
   probabilities: 0.673 0.327 
  left son=24 (38 obs) right son=25 (14 obs)
  Primary splits:
      duration < 671.5 to the right, improve=3.8244650000, (0 missing)
      age      < 52    to the left,  improve=1.1378450000, (0 missing)
      loan     splits as  LR,        improve=0.0008944544, (0 missing)
  Surrogate splits:
      prev_outcome splits as  LRLL, agree=0.75, adj=0.071, (0 split)

Node number 13: 84 observations,    complexity param=0.01532033
  predicted class=Yes  expected loss=0.4880952  P(node) =0.028
    class counts:    41    43
   probabilities: 0.488 0.512 
  left son=26 (37 obs) right son=27 (47 obs)
  Primary splits:
      age               < 40.5  to the left,  improve=2.35801900, (0 missing)
      duration          < 858.5 to the right, improve=1.51730500, (0 missing)
      prev_outcome      splits as  LRRL,      improve=0.48285710, (0 missing)
      num_prev_contacts < 1.5   to the left,  improve=0.32938660, (0 missing)
      loan              splits as  RL,        improve=0.08328886, (0 missing)
  Surrogate splits:
      duration < 1016  to the right, agree=0.655, adj=0.216, (0 split)
      loan     splits as  RL,        agree=0.571, adj=0.027, (0 split)

Node number 24: 38 observations
  predicted class=No   expected loss=0.2105263  P(node) =0.01266667
    class counts:    30     8
   probabilities: 0.789 0.211 

Node number 25: 14 observations
  predicted class=Yes  expected loss=0.3571429  P(node) =0.004666667
    class counts:     5     9
   probabilities: 0.357 0.643 

Node number 26: 37 observations
  predicted class=No   expected loss=0.3783784  P(node) =0.01233333
    class counts:    23    14
   probabilities: 0.622 0.378 

Node number 27: 47 observations
  predicted class=Yes  expected loss=0.3829787  P(node) =0.01566667
    class counts:    18    29
   probabilities: 0.383 0.617 

We can see here that the duration of time between the last subscription accounts for 60% of improvements to the classification tree, making it the most important variable in the classification process. The previous outcome of the last call accounts for a 35% improvement and marital status and age account for 3% and 2% of improvements respectively.

5

train_probs <- predict(bank_tree, newdata = bank_train, type = 'prob')
train_preds <- predict(bank_tree, newdata = bank_train, type = 'class')
bank_train_updated <- cbind(bank_train, train_probs, train_preds)

train_con_mat <- table(bank_train_updated$subscribed, bank_train_updated$train_preds, dnn = c('Actual', 'Predicted'))

train_con_mat
      Predicted
Actual   No  Yes
   No  2560   81
   Yes  205  154

From the above confusion matrix we can see that:

  • The overall model accuracy is (2560+154)/5700 = 0.476 or 48%
  • Of all customers the model predicted to subscribe, they got 154/235 = 0.6553 or 66%
  • Of all customers the model predicted to not subscribe, they got 2560/2765 = 0.926 or 93%
  • Of all customers who did subscribe, the model correctly identified 154/359 = 0.429 or 43%
  • Of all customers who did not subscribe, the model correctly identified 2560/2641 = 0.969 or 97%
test_probs <- predict(bank_tree, newdata = bank_test, type = 'prob')
test_preds <- predict(bank_tree, newdata = bank_test, type = 'class')
bank_test_updated <- cbind(bank_test, test_probs, test_preds)

test_con_mat <- table(bank_test_updated$subscribed, bank_test_updated$test_preds, dnn = c('Actual', 'Predicted'))
test_con_mat
      Predicted
Actual   No  Yes
   No  1144   47
   Yes   84   59

From the above confusion matrix we can see that:

  • The overall model accuracy is (1144 + 59)/1334 = 0.902 or 90%
  • Of all customers the model predicted to subscribe, they got 59/106 = 0.557 or 56%
  • Of all customers the model predicted to not subscribe, they got 1144/1228 = 0.931 or 93%
  • Of all customers who did subscribe, the model correctly identified 59/143 = 0.413 or 41%
  • Of all customers who did not subscribe, the model correctly identified 1144/1191 = 0.960 or 96%

5 a

The model is actually more accurate for the testing data in comparison with the training data with a 90% success rate. Every other measurement has very close accuracy. Therefore the classification tree does not appear to be over fitting the training data set and so we do not need to consider pruning the tree.

Part C- Binary Logistic Regression

6

bank_train$subscribed <- factor(bank_train$subscribed, levels = c("No", "Yes"))
bank_test$subscribed <- factor(bank_test$subscribed, levels = c("No", "Yes"))

levels(bank_train$subscribed)
[1] "No"  "Yes"
levels(bank_test$subscribed)
[1] "No"  "Yes"

7

bank_lr <- glm(subscribed ~ age + marital + loan + duration + num_prev_contacts + prev_outcome, data = bank_train, family = binomial(link = 'logit'))

summary(bank_lr)

Call:
glm(formula = subscribed ~ age + marital + loan + duration + 
    num_prev_contacts + prev_outcome, family = binomial(link = "logit"), 
    data = bank_train)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -3.2625797  0.4154593  -7.853 4.06e-15 ***
age                  0.0131616  0.0065349   2.014  0.04400 *  
maritalmarried      -0.5336389  0.1924714  -2.773  0.00556 ** 
maritalsingle       -0.0082911  0.2211008  -0.037  0.97009    
loanyes             -0.7032756  0.2170179  -3.241  0.00119 ** 
duration             0.0036866  0.0002219  16.617  < 2e-16 ***
num_prev_contacts   -0.0004907  0.0418138  -0.012  0.99064    
prev_outcomeother    0.5742575  0.2980724   1.927  0.05403 .  
prev_outcomesuccess  2.7135397  0.3165944   8.571  < 2e-16 ***
prev_outcomeunknown -0.4488613  0.2305195  -1.947  0.05151 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2197.6  on 2999  degrees of freedom
Residual deviance: 1650.1  on 2990  degrees of freedom
AIC: 1670.1

Number of Fisher Scoring iterations: 5

7 a

Categorical variables omitted:

  • marital- divorced
  • loan- no
  • prev_outcome- failure

7 b

y = ln(pi/1-pi) = -3.27 + 0.01(age) - 0.53(maritalmarried) - 0.01(maritalsingle) - 0.7(loanyes) + 0.0037(duration) - 0.00049(num_prev_contacts) + 0.57(prev_outcomeother) + 2.71(prev_outcomesuccess) - 0.45(prev_outcomeunknown)

7 c

The importance of each predictor variable can be found by looking at the Pr(>|z|) column. These p-values are interpreted as follows:

H0: the coefficient for maritalmarried = 0, i.e. maritalmarried is not important in predicting customer subscription

HA: the coefficient for maritalmarried ≠ 0, i.e. maritalmarried is important in predicting customer subscription

If the p-value is less than or equal to 0.05, we accept the alternative hypothesis (HA).

Based on this, the significant variables are:

  • maritalmarried
  • loanyes
  • duration
  • prev_outcomesuccess

The significant variables are also indicated with an asterisk * beside the Pr(>|z|) column.

7 d

The sign of the coefficient tells us how each variable impacts the probability of “success”

The coefficients for prev_outcomeother and prev_outcomesuccess are both positive, meaning if a customer is marked as having either of these from their last call, there is a higher chance of them subscribing. If the coefficient is negative, as it is in prev_outcomeunknown, there is a higher chance that the customer will not subscribe.

This works the same for numerical variables, for example age has a positive variable, indicating that the older someone is, the more likely they are to subscribe.

  • Age: The older a person is the more likely they are to subscribe
  • Maritalmarried: Married people are less likely to subscribe
  • Maritalsingle: Single people are less likely to subscribe
  • Loanyes: Those with a loan currently out are less likely to subscribe
  • Duration: The longer a duration in days since the last subscription, the more likely a customer is to subscribe
  • Num_prev_contacts: The higher the number of previous contacts, the less likely a customer is to subscribe
  • Prev_outcomeother: Customers with a previous outcome of “other” are more likely to subscribe
  • Prev_outcomesuccess: Customers with a previous outcome of “success” are more likely to subscribe
  • Prev_outcomeunknown: Customers with a previous outcome of “unknown” are less likely to subscribe

8

train_pi <- predict(bank_lr, newdata = bank_train, type = 'response')

bank_train_updated <- bank_train %>%
  mutate(pi = train_pi) %>%
  mutate(prediction = case_when(pi > 0.5 ~ 'Yes', 
                                pi <= 0.5 ~ 'No'))

train_lr_con_mat <- table(bank_train_updated$subscribed, bank_train_updated$prediction, dnn = c('Actual', 'Predicted'))

train_lr_con_mat
      Predicted
Actual   No  Yes
   No  2588   53
   Yes  251  108

From the above confusion matrix we can see that:

  • The overall model accuracy is (2588 + 110)/3000= 0.899 or 90%
  • Of all customers the model predicted to subscribe, they got 110/163 = 0.67 or 67%
  • Of all customers the model predicted to not subscribe, they got 2588/2837 = 0.91 or 91%
  • Of all customers who did subscribe, the model correctly identified 110/359 = 0.306 or 31%
  • Of all customers who did not subscribe, the model correctly identified 2588/2641 = 0.9799 or 98%
test_pi <- predict(bank_lr, newdata = bank_test, type = 'response')

bank_test_updated <- bank_test %>%
  mutate(pi = test_pi) %>%
  mutate(prediction = case_when(pi > 0.5 ~ 'Yes', 
                                pi <= 0.5 ~ 'No'))

test_lr_con_mat <- table(bank_test_updated$subscribed, bank_test_updated$prediction, dnn = c('Actual', 'Predicted'))

test_lr_con_mat
      Predicted
Actual   No  Yes
   No  1159   32
   Yes   97   46

From the above confusion matrix we can see that:

  • The overall model accuracy is (1159 + 44)/1334= 0.90 or 90%
  • Of all customers the model predicted to subscribe, they got 44/76 = 0.5789 or 58%
  • Of all customers the model predicted to not subscribe, they got 1159/1258 = 0.92 or 92%
  • Of all customers who did subscribe, the model correctly identified 44/143 = 0.30769 or 31%
  • Of all customers who did not subscribe, the model correctly identified 1159/1191 = 0.97 or 97%

The training and testing data are very similar to each other so we can assume that the algorithm is not over fitting the training data.

Part D- Model Comparison and Marketing Actions

9

Logistical Regression Model

  • The overall model accuracy is (1159 + 44)/1334= 0.90 or 90%
  • Of all customers the model predicted to subscribe, they got 44/76 = 0.5789 or 58%
  • Of all customers the model predicted to not subscribe, they got 1159/1258 = 0.92 or 92%
  • Of all customers who did subscribe, the model correctly identified 44/143 = 0.30769 or 31%
  • Of all customers who did not subscribe, the model correctly identified 1159/1191 = 0.97 or 97%

Classification Tree Model

  • The overall model accuracy is (1144 + 59)/1334 = 0.902 or 90%
  • Of all customers the model predicted to subscribe, they got 59/106 = 0.557 or 56%
  • Of all customers the model predicted to not subscribe, they got 1144/1228 = 0.931 or 93%
  • Of all customers who did subscribe, the model correctly identified 59/143 = 0.413 or 41%
  • Of all customers who did not subscribe, the model correctly identified 1144/1191 = 0.960 or 96%

The accuracy model scores are very similar, with the logistic regression model coming in as slightly more accurate. However, in terms of use, the model needs the best score for predicting who will not subscribe so the bank can accurately target them in the future. The model is very accurate for both the logistic regression and the classification tree for identifying these customers, with 97% (logistical regression) and 96% (classification tree) correctly identified.

10

I believe that the classification tree model should be used in this situation. It will be easier to present and follow for higher ups in the company. As well as this, the classification tree model is only 1% less accurate at identifying which customers will not subscribe, meaning that results should be relatively similar regardless of which model the bank uses.

11 a

The main drivers for consumption are:

  • duration- This does appear significant in the visual exploration, as the box plot for those who did subscribe is placed higher than those who did not subscribe (longer duration since last subscription for those who did subscribe) . This matches with the classification tree where nodes over a certain duration more are more likely to lead to a subscription.

  • prev_outcome- This did not appear to be significant when looking at the visual exploration, due to the significant number of unknown phone call outcomes. Visually, it looks like a similar outcome from both subscribers and non-subscribers.

  • marital- This did not appear as significant when looking at the visual exploration because there was no indication that married customers would be more or less likely to subscribe from the bar chart.

  • age- This again did not appear significant in the initial visual exploration as the age breakdown of subscribers and non-subscribers was very similar. In the classification tree, we did see that the over 41 age group are the more likely they are to subscribe which would match with the visual exploration.

11 b

  • Firstly, we want to look at targeting those customers who have not subscribed to a bank plan for a long period of time. We can see from the first level of the classification tree that the customers who last subscribed to a plan over 646 days ago are more likely to subscribe to the new bank product. These customers should be targeted using personalised emails or phone calls. The bank could also use the previous outcome of their last phone call to decide whether or not a call will be the most effective form of communication.

  • The bank should also set up a new system for measuring success vs failure for the previous outcome of phone calls. Too many calls are being classified as unknown indicating that the system is either not correctly tracking the data or too many customers are being allowed fall through the cracks. Creating a new set of rules i.e. customer must agree over the phone or they are classified as a failure, customer account must be tracked after phone call to see if they are to be classified as a success or failure, etc. These new classifications should lead to more useful insights for the prediction models.

  • The bank should focus on customers whose last subscription to a bank plan was under 646 days ago, and who reacted positively to previous communication (prev_outcome success). Those who reacted positively seem to have a higher chance of subscribing again. Focusing on this segment with communications that promote the benefits of each new plan should increase subscriptons.

  • Customers who are married and aged over 41 are likely to subscribe and should receive specific, targeted material that focuses on lifetime stability, children in college and mortgage payments. These concepts would be familiar with this demographic and may increase engagement and subscriptions.

  • The bank should set up some form of automatic signal internally to target married customers who have a duration of 672-758 days since their last subscription. This gives the bank a window of 86 days to work with. They should consider setting up a specific marketing communications strategy for this window of time to increase their subscriptions within this segment, e.g. informational email on day 1, personalised email on day 5, phone call on day 10, etc.