Predictive Analytics Assignment 20%

Author

Joanna Zaremba

Setting Up

Warning: package 'rattle' was built under R version 4.4.2
Warning: package 'rpart' was built under R version 4.4.2

Q1

Q1, 1

Uploading necessary databases

bank_sub_train <- read_csv("bank_sub_train.csv")
bank_sub_test <- read_csv("bank_sub_test.csv")

Q1, 2

Carrying out a visual exploration of the relationship between whether a customer subscribed or not, and each of the potential predictor variables

For this part of the analysis, I will use the ggplot package to create graphs to display the relationship between whether a customer subscribed or not and the other variables in the dataset.

  • For categorical variables I will use clustered barcharts

  • For numerical variables I will use boxplots.

Q1, 2 (a)

Categorical:

ggplot(data = bank_sub_train) +
  geom_bar(mapping = aes(x = subscribed, fill = marital),position = "dodge") +
  scale_fill_manual(values = c("red1", "green3", "turquoise2")) + 
  ggtitle("Subscription by marital status") +
  xlab("Subscribed") +
  ylab("Count of Customers") +
  theme_minimal() +
  theme(axis.line = element_line(colour = "grey50")) +
  theme(panel.grid.major = element_line(color = "white"),
        axis.text = element_text(colour = "#666666"),
        axis.ticks = element_line(colour = "#666666"),
        axis.line = element_line(colour = "#666666"),
        plot.title = element_text(face = "bold")) +
        theme(plot.title = element_text(hjust = 0.5))

In both instances when it comes to the relationship between subscription and marital status, the highest numbers of customers who have and haven’t subscribed are married. The second highest are customers who are single, and the lowest place pertains to people who are divorced.

Hence, the marital demographic most likely to either be subscribed to or not be subscribed to the term deposited plan is married.

ggplot(data = bank_sub_train) +
  geom_bar(mapping = aes(x = subscribed, fill = loan),position = "dodge") +
  scale_fill_manual(values = c("red1", "green3")) + 
  ggtitle("Subscription by status of loan") +
  xlab("Subscribed") +
  ylab("Count of Customers") +
  theme_minimal() +
  theme(axis.line = element_line(colour = "grey50")) +
  theme(panel.grid.major = element_line(color = "white"),
        axis.text = element_text(colour = "#666666"),
        axis.ticks = element_line(colour = "#666666"),
        axis.line = element_line(colour = "#666666"),
        plot.title = element_text(face = "bold")) +
        theme(plot.title = element_text(hjust = 0.5))

The above graphs displays the relationship between whether the customer currently has a loan with the bank or not, and whether they are subscribed to the term deposit plan or not. As can be seen, the probability of customers who don’t have a loan with the bank are more likely to subscribe to the term deposit plan. This is also true for customers who aren’t subscribed.

ggplot(data = bank_sub_train) +
  geom_bar(mapping = aes(x = subscribed, fill = prev_outcome),position = "dodge") +
  scale_fill_manual(values = c("red1", "turquoise2","green3", "gold1" )) + 
  ggtitle("Subscription by previous outcome") +
  xlab("Subscribed") +
  ylab("Count of Customers") +
  theme_minimal() +
  theme(axis.line = element_line(colour = "grey50")) +
  theme(panel.grid.major = element_line(color = "white"),
        axis.text = element_text(colour = "#666666"),
        axis.ticks = element_line(colour = "#666666"),
        axis.line = element_line(colour = "#666666"),
        plot.title = element_text(face = "bold")) +
        theme(plot.title = element_text(hjust = 0.5))

In terms of previous contact outcome, the highest outcome of the call is described by ‘unknown’. However, when looking at customers who are subscribed to the term deposit plan, the second highest column pertains to a ‘success’ in previous outcome. Hence, a customer is more likely to subscribe to the term deposit plan if previous outcome is a success.

Q1, 2 (b)

Numerical:

ggplot(data = bank_sub_train) +
  geom_boxplot(mapping = aes(x = subscribed, y = age, fill = subscribed)) +
  scale_fill_manual(values = c("red1", "green2")) + 
  ggtitle ("Subscription based on age") +
  xlab("Subscribed") +
  ylab("Age") +
  scale_y_continuous(limits = c(0, 90), breaks = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90), labels = c('0','10', '20', '30', '40', '50', '60', '70', '80', '90')) +
  theme_minimal() +
  theme(panel.grid.major = element_line(color = "white"),
        axis.text = element_text(colour = "#666666"),
        axis.ticks = element_line(colour = "#666666"),
        axis.line = element_line(colour = "#666666"),
        plot.title = element_text(face = "bold")) +
        theme(plot.title = element_text(hjust = 0.5))

50% of customers who have subscribed to the loan deposit plan are approx. between the ages of 33 and 51. 50 % Customers who haven’t subscribed to the term deposit plan are approx. between the ages of 33 to 47. The median age is 40.

ggplot(data = bank_sub_train) +
  geom_boxplot(mapping = aes(x = subscribed, y = balance, fill = subscribed)) +
  scale_fill_manual(values = c("red1", "green2")) + 
  ggtitle ("Subscription based on customers balance") +
  xlab("Subscribed") +
  ylab("Balance") +
  scale_y_continuous(limits = c(-10000, 30000), breaks = c(-10000, 0, 10000, 20000, 30000), labels = c('-10000', '0','10000', '20000', '30000')) +
  theme_minimal() +
  theme(panel.grid.major = element_line(color = "white"),
        axis.text = element_text(colour = "#666666"),
        axis.ticks = element_line(colour = "#666666"),
        axis.line = element_line(colour = "#666666"),
        plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5))
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_boxplot()`).

Customer who have subscribed to the term deposit plan have a median balance of approx. 2’000. Customers who have subscribed to the term deposit plan also have a higher bank balance than those who haven’t.

ggplot(data = bank_sub_train) +
  geom_boxplot(mapping = aes(x = subscribed, y = duration, fill = subscribed)) +
  scale_fill_manual(values = c("red1", "green2")) + 
  ggtitle ("Subscription based on duration since last subscription (in days)") +
  xlab("Subscribed") +
  ylab("Duration") +
  theme_minimal() +
  theme(panel.grid.major = element_line(color = "white"),
        axis.text = element_text(colour = "#666666"),
        axis.ticks = element_line(colour = "#666666"),
        axis.line = element_line(colour = "#666666"),
        plot.title = element_text(face = "bold")) +
        theme(plot.title = element_text(hjust = 0.5))

Customers who have subscribed to the term deposit plan wait a longer time to subscribe than those that haven’t subscribed. Hence, it can be seen that customers aren’t likely to subscribe to the term deposit plan if they have a shorter duration time since their last subscription. Median duration times for those who have subscribed is approx. 400 days.

ggplot(data = bank_sub_train) +
  geom_boxplot(mapping = aes(x = subscribed, y = num_prev_contacts, fill = subscribed)) +
  scale_fill_manual(values = c("red1", "green2")) + 
  ggtitle ("Subscription based on number of times customer has been contacted by bank") +
  xlab("Subscribed") +
  ylab("Count of Previous Contacts") +
  theme_minimal() +
  theme(panel.grid.major = element_line(color = "white"),
        axis.text = element_text(colour = "#666666"),
        axis.ticks = element_line(colour = "#666666"),
        axis.line = element_line(colour = "#666666"),
        plot.title = element_text(face = "bold")) +
        theme(plot.title = element_text(hjust = 0.5))

As can be seen, customers who have subscribed to the term deposit plan are contacted more than those who haven’t.

Q2

Q2, 3

Creating Classification tree model

Made using training data

bank_sub_tree <- rpart(subscribed ~ age + marital + balance + loan + duration + num_prev_contacts + prev_outcome, bank_sub_train)
fancyRpartPlot(bank_sub_tree)

Q2, 4 (a)

Predicting if customer will subscribe

Customer will subscribe if duration is more than 646 days, and if they are married.

Q2, 4 (b)

Predicting if customer won’t subscribe

Customer wont subscribe if duration is less than 646 days, and if prev_outcome is failure, other or unknown.

Q2, 4 (c)

Highest contributing predictor variables

bank_sub_tree$variable.importance
summary(bank_sub_tree)
  • duration = 58%

  • prev_outcome = 34%

  • marital = 3%

  • balance = 2%

  • age = 2%

Q2, 5

Assessing quality of classification tree model

We first need to assess the quality of the BLR model on the training dataset;

train_probs <- predict(bank_sub_tree, newdata = bank_sub_train, type = 'prob')
train_preds <- predict(bank_sub_tree, newdata = bank_sub_train, type = 'class')
bank_train_updated <- cbind(bank_sub_train, train_probs, train_preds)
head(bank_train_updated)
tree_train_con_mat <- table(bank_train_updated$subscribed, bank_train_updated$train_preds, dnn=c('Actual', 'Predicted'))
tree_train_con_mat
      Predicted
Actual   No  Yes
   No  2573   68
   Yes  218  141

Now we can assess the quality of the BLR model on the testing dataset;

test_probs <- predict(bank_sub_tree, newdata = bank_sub_test, type = 'prob')
test_preds <- predict(bank_sub_tree, newdata = bank_sub_test, type = 'class')
bank_test_updated <- cbind(bank_sub_test, test_probs, test_preds)
head(bank_test_updated)
tree_test_con_mat <- table(bank_test_updated$subscribed, bank_test_updated$test_preds, dnn=c('Actual', 'Predicted'))
tree_test_con_mat
      Predicted
Actual   No  Yes
   No  1151   40
   Yes   91   52

Q2, 5

Comparing the accuracy of the classification tree model model on the training datatset and the testing dataset

First, we will assess the accuracy of the classification tree model on the training dataset:

tree_train_con_mat
      Predicted
Actual   No  Yes
   No  2573   68
   Yes  218  141
  • Overall model accuracy = (2573+141) / 3000 = 0.904 = 90.46%

  • Of all customers model predicted would subscribe, it got 141/209 = 0.674 = 67.46% correct

  • Of all customers model predicted would not subscribe, it got 2573/2791 = 0.921 = 92.18% correct

  • Of all customers who did actually subscribe, model correctly identified 141/349 = 0.404 = 40.40%

  • Of all customers who did not actually subscribe, model correctly identified 2573/2641 = 0.974 = 97.42%

Now we can assess the accuracy of the classification tree model on the testing dataset:

tree_test_con_mat
      Predicted
Actual   No  Yes
   No  1151   40
   Yes   91   52
  • Overall model accuracy =(1151+52)/1334 = 0.901 = 90.17%

  • Of all customers model predicted would subscribe, it got 52/92 = 0.565 = 56.52% correct

  • Of all customers model predicted would not subscribe, it got 1151/1242 = 0.926 = 92.67% correct

  • Of all customers who did actually subscribe, model correctly identified 52/143 = 0.36 = 36.36%

  • Of all customers who did not actually subscribe, model correctly identified 1151/1191 = 0.966 = 96.64%

The model doesn’t seem to be overfitting because the percentages are quite similar, hence no pruning is necessary.

Q3

Q3, 6

Setting up levels for Binary Logistic Regression (BLR)

For BLR, we need to determine levels based on a ‘failure’ and ‘success’ principle (as BLR predicts the probability of a success). The choice for success in the case of subscription is a ‘Yes’ (meaning customer subscribed), and the choice for failure is a ‘No’ (meaning customer didn’t subscribe).

As setting levels works on a “failure”, “success” parameter, “No” goes first when setting up levels (as in this case its a failure), and “yes” goes second (as in this case its a success).

These levels must be applied to both the training and the testing databases.

bank_sub_train$subscribed <- factor(bank_sub_train$subscribed, 
                                   levels = c("No", "Yes"))
bank_sub_test$subscribed <- factor(bank_sub_test$subscribed,
                                   levels = c("No", "Yes"))

levels(bank_sub_train$subscribed)
[1] "No"  "Yes"
levels(bank_sub_test$subscribed)
[1] "No"  "Yes"

Q3, 7

Creating the BLR model

As the BLR model is predicting the probability of a subscription, we will base the probability of a subscription based on the other variables found in the dataset. We will do this using the glm() function (General Linear Model function).

bank_sub_lr <- glm(subscribed ~ age + marital + balance + loan + duration + num_prev_contacts + prev_outcome, data = bank_sub_train, family = binomial(link = 'logit'))
summary(bank_sub_lr)

Call:
glm(formula = subscribed ~ age + marital + balance + loan + duration + 
    num_prev_contacts + prev_outcome, family = binomial(link = "logit"), 
    data = bank_sub_train)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -3.265e+00  4.151e-01  -7.866 3.67e-15 ***
age                  1.243e-02  6.574e-03   1.891  0.05862 .  
maritalmarried      -5.435e-01  1.928e-01  -2.819  0.00481 ** 
maritalsingle       -2.499e-02  2.218e-01  -0.113  0.91030    
balance              2.362e-05  2.205e-05   1.071  0.28421    
loanyes             -6.902e-01  2.175e-01  -3.173  0.00151 ** 
duration             3.690e-03  2.218e-04  16.637  < 2e-16 ***
num_prev_contacts   -6.165e-04  4.177e-02  -0.015  0.98823    
prev_outcomeother    5.784e-01  2.983e-01   1.939  0.05250 .  
prev_outcomesuccess  2.719e+00  3.168e-01   8.582  < 2e-16 ***
prev_outcomeunknown -4.428e-01  2.304e-01  -1.922  0.05461 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2197.6  on 2999  degrees of freedom
Residual deviance: 1649.0  on 2989  degrees of freedom
AIC: 1671

Number of Fisher Scoring iterations: 5

Q3, 7 (a)

Exclusion of Dummy Variables

The categorical dummy variables omitted are;

  • maritaldivorced

  • loanno

  • prev_outcomefailure

Q3, 7 (b)

Regression Equation

The regression equation for this BLR model is as follows;

y = ln(π/(1-π)) = -3.265 + 1.243(age) -5.435(maritalmarried ) -2.499(maritalsingle) +2.362(balance) -6.902 (loanyes) +3.690 (duration) -6.165 (num_prev_contacts) +5.784 (prev_outcomeother) +2.719 (prev_outcomesuccess) -4.428 (prev_outcomeunknown)

Q3, 7 (c)

Significant predictor variables

If a variable contains a P value of less than 0.05 (<0.05), it is seen as an important predictor variable.

Hence, the important predictor variables for this BLR Model are;

  • maritial married as 0.00481 < 0.05

  • loanyes as 0.00151 < 0.05

  • duration as 2e-16 < 0.05

  • prev_outcomesuccess as 2e-16 < 0.05

Q3, 7 (d)

How each predictor variable impacts the likelihood of a customer subscribing

The sign in front of the predictor variable shows how each variable impacts the probability of a success; if the predictor variable has a + sign in front, it increases the probability of success, and if the predictor variable has a - sign in front, it decreases the probability of success.

Q3, 8

Assessing the quality of the BLR model

We first need to assess the quality of the BLR model on the training dataset;

bank_sub_train_pi <- predict(bank_sub_lr, newdata = bank_sub_train, type = 'response')
bank_sub_train_pi

After the BLR model calculates the probability scores for each customer in the dataset, we need to set a paramater which the model will use to predict whether the customer will subscribe or not.

  • If a customers probability score is lower than 0.5 (<50%), the model will predict that the customer will not subscribe.

  • If a customers probability score is equal to or higher than 0.5 (>=50%), the model will predict that the customer will subscribe.

bank_sub_train_updated <- bank_sub_train %>%
  mutate(pi = bank_sub_train_pi) %>%
  mutate(prediction = case_when(pi < 0.5 ~ 'No', 
                                pi >= 0.5 ~ 'Yes'))
bank_sub_train_updated
bank_sub_train_updated$prediction <- factor(bank_sub_train_updated$prediction,
                                            levels = c("No", "Yes"))

Once the training dataset is updated with and additional variable containing the probability of a customer subscribing determined by the BLR model, we need to create a confusion matrix to fully assess its quality.

bank_sub_train_con_mat <- table(bank_sub_train_updated$subscribed,
                                bank_sub_train_updated$prediction,
                                dnn = c('Actual', 'Predicted'))
bank_sub_train_con_mat
      Predicted
Actual   No  Yes
   No  2588   53
   Yes  249  110

Now we can assess the quality of the BLR model on the testing dataset;

We need to run all the above code, but this time it needs to pertain to the testing dataset so that we can assess the quality of the model on a different set of figures to check if the model will over-fit the data.

bank_sub_test_pi <- predict(bank_sub_lr, newdata = bank_sub_test, type = 'response')
bank_sub_test_pi
bank_sub_test_updated <- bank_sub_test %>%
  mutate(pi = bank_sub_test_pi) %>%
  mutate(prediction = case_when(pi < 0.5 ~ 'No', 
                                pi >= 0.5 ~ 'Yes'))
bank_sub_test_updated
bank_sub_test_updated$prediction <- factor(bank_sub_test_updated$prediction,
                                           levels = c("No", "Yes"))
bank_sub_test_con_mat <- table(bank_sub_test_updated$subscribed,
                               bank_sub_test_updated$prediction,
                               dnn = c('Actual', 'Predicted'))
bank_sub_test_con_mat
      Predicted
Actual   No  Yes
   No  1159   32
   Yes   99   44

Q3, 8

Comparing the accuracy of the BLR model on the training datatset and the testing dataset

First, we will assess the accuracy of the BLR model on the training dataset:

bank_sub_train_con_mat
      Predicted
Actual   No  Yes
   No  2588   53
   Yes  249  110
  • Overall model accuracy = (2588+110) / 3000 = 0.899 = 89.93%

  • Of all customers model predicted would not subscribe, it got 2588 / 2837 = 0.912 = 91.22% correct

  • Of all customers model predicted would subscribe, it got 110 / 163 = 0.674 = 67.48% correct

  • Of all customers who did not actually subscribe, model correctly identified 2588 / 2641 = 0.979 = 97.99%

  • Of all customers who did actually subscribe, model correctly identified 110 / 359 = 0.306 = 30.64%

Now we can assess the accuracy of the BLR model on the testing dataset:

bank_sub_test_con_mat
      Predicted
Actual   No  Yes
   No  1159   32
   Yes   99   44
  • Overall model accuracy = (1159 + 44) / 1334 = 0.901 = 90.17%

  • Of all customers model predicted would not subscribe, it got 1159 / 1258 = 0.921 = 92.13% correct

  • Of all customers model predicted would subscribe, it got 44 / 76 = 0.578 = 57.89% correct

  • Of all customers who did not actually subscribe, model correctly identified 1159 / 1191 = 0.973 = 97.31%

  • Of all customers who did actually subscribe, model correctly identified 99 / 143 = 0.692 = 69.23%

Q4

Q4, 9

Comparing the accuracy of the classification tree model and the BLR model

a) Classification tree:

Training data:

tree_train_con_mat
      Predicted
Actual   No  Yes
   No  2573   68
   Yes  218  141
  • Overall model accuracy = (2573+141) / 3000 = 0.904 = 90.46%

  • Of all customers model predicted would subscribe, it got 141/209 = 0.674 = 67.46% correct

  • Of all customers model predicted would not subscribe, it got 2573/2791 = 0.921 = 92.18% correct

  • Of all customers who did actually subscribe, model correctly identified 141/349 = 0.404 = 40.40%

  • Of all customers who did not actually subscribe, model correctly identified 2573/2641 = 0.974 = 97.42%

tree_test_con_mat
      Predicted
Actual   No  Yes
   No  1151   40
   Yes   91   52
  • Overall model accuracy =(1151+52)/1334 = 0.901 = 90.17%

  • Of all customers model predicted would subscribe, it got 52/92 = 0.565 = 56.52% correct

  • Of all customers model predicted would not subscribe, it got 1151/1242 = 0.926 = 92.67% correct

  • Of all customers who did actually subscribe, model correctly identified 52/143 = 0.36 = 36.36%

  • Of all customers who did not actually subscribe, model correctly identified 1151/1191 = 0.966 = 96.64%

Average accuracy percentage of classification tree model across training and testing dataset.;

(90.46+ 67.46+ 92.18+40.40+ 97.42+ 90.17+ 56.52+ 92.67+ 36.36+ 96.64)/10 = 76.028% average

b) Binary Logistic Regression model:

Training data:

bank_sub_train_con_mat
      Predicted
Actual   No  Yes
   No  2588   53
   Yes  249  110
  • Overall model accuracy = (2588+110) / 3000 = 0.899 = 89.93%

  • Of all customers model predicted would not subscribe, it got 2588 / 2837 = 0.912 = 91.22% correct

  • Of all customers model predicted would subscribe, it got 110 / 163 = 0.674 = 67.48% correct

  • Of all customers who did not actually subscribe, model correctly identified 2588 / 2641 = 0.979 = 97.99%

  • Of all customers who did actually subscribe, model correctly identified 110 / 359 = 0.306 = 30.64%

bank_sub_test_con_mat
      Predicted
Actual   No  Yes
   No  1159   32
   Yes   99   44
  • Overall model accuracy = (1159 + 44) / 1334 = 0.901 = 90.17%

  • Of all customers model predicted would not subscribe, it got 1159 / 1258 = 0.921 = 92.13% correct

  • Of all customers model predicted would subscribe, it got 44 / 76 = 0.578 = 57.89% correct

  • Of all customers who did not actually subscribe, model correctly identified 1159 / 1191 = 0.973 = 97.31%

  • Of all customers who did actually subscribe, model correctly identified 99 / 143 = 0.692 = 69.23%

Average accuracy percentage of BLR model across training and testing dataset.;

(89.93+ 91.22+ 67.48+ 97.99+ 30.64+ 90.17+ 92.13+ 57.89+ 97.31+ 69.23)/ 10 = 78.399%

  • Average of classification tree model: 76.028%

  • Average of BLR model: 78.399%

Both models are very similar in accuracy, however the BLR model is on average more accurate in predicting whether a customer will subscribe or not.

Q4, 10

I think the company should use the BLR model for making predictions for subscription because the model has a higher average accuracy percentage. However, both models could be used, as their average accuracy levels are very similar.

Q4, 11

The main drivers for customers subscribing to the term deposit plan are marital status, duration and previous outcome (based on shared previous important predictor variables between the two models, refer to questions 2, 4 (c) and 3, 7 (c).