Warning: package 'rattle' was built under R version 4.4.2
Warning: package 'rpart' was built under R version 4.4.2
Warning: package 'rattle' was built under R version 4.4.2
Warning: package 'rpart' was built under R version 4.4.2
Uploading necessary databases
bank_sub_train <- read_csv("bank_sub_train.csv")
bank_sub_test <- read_csv("bank_sub_test.csv")Carrying out a visual exploration of the relationship between whether a customer subscribed or not, and each of the potential predictor variables
For this part of the analysis, I will use the ggplot package to create graphs to display the relationship between whether a customer subscribed or not and the other variables in the dataset.
For categorical variables I will use clustered barcharts
For numerical variables I will use boxplots.
Categorical:
ggplot(data = bank_sub_train) +
geom_bar(mapping = aes(x = subscribed, fill = marital),position = "dodge") +
scale_fill_manual(values = c("red1", "green3", "turquoise2")) +
ggtitle("Subscription by marital status") +
xlab("Subscribed") +
ylab("Count of Customers") +
theme_minimal() +
theme(axis.line = element_line(colour = "grey50")) +
theme(panel.grid.major = element_line(color = "white"),
axis.text = element_text(colour = "#666666"),
axis.ticks = element_line(colour = "#666666"),
axis.line = element_line(colour = "#666666"),
plot.title = element_text(face = "bold")) +
theme(plot.title = element_text(hjust = 0.5))In both instances when it comes to the relationship between subscription and marital status, the highest numbers of customers who have and haven’t subscribed are married. The second highest are customers who are single, and the lowest place pertains to people who are divorced.
Hence, the marital demographic most likely to either be subscribed to or not be subscribed to the term deposited plan is married.
ggplot(data = bank_sub_train) +
geom_bar(mapping = aes(x = subscribed, fill = loan),position = "dodge") +
scale_fill_manual(values = c("red1", "green3")) +
ggtitle("Subscription by status of loan") +
xlab("Subscribed") +
ylab("Count of Customers") +
theme_minimal() +
theme(axis.line = element_line(colour = "grey50")) +
theme(panel.grid.major = element_line(color = "white"),
axis.text = element_text(colour = "#666666"),
axis.ticks = element_line(colour = "#666666"),
axis.line = element_line(colour = "#666666"),
plot.title = element_text(face = "bold")) +
theme(plot.title = element_text(hjust = 0.5))The above graphs displays the relationship between whether the customer currently has a loan with the bank or not, and whether they are subscribed to the term deposit plan or not. As can be seen, the probability of customers who don’t have a loan with the bank are more likely to subscribe to the term deposit plan. This is also true for customers who aren’t subscribed.
ggplot(data = bank_sub_train) +
geom_bar(mapping = aes(x = subscribed, fill = prev_outcome),position = "dodge") +
scale_fill_manual(values = c("red1", "turquoise2","green3", "gold1" )) +
ggtitle("Subscription by previous outcome") +
xlab("Subscribed") +
ylab("Count of Customers") +
theme_minimal() +
theme(axis.line = element_line(colour = "grey50")) +
theme(panel.grid.major = element_line(color = "white"),
axis.text = element_text(colour = "#666666"),
axis.ticks = element_line(colour = "#666666"),
axis.line = element_line(colour = "#666666"),
plot.title = element_text(face = "bold")) +
theme(plot.title = element_text(hjust = 0.5))In terms of previous contact outcome, the highest outcome of the call is described by ‘unknown’. However, when looking at customers who are subscribed to the term deposit plan, the second highest column pertains to a ‘success’ in previous outcome. Hence, a customer is more likely to subscribe to the term deposit plan if previous outcome is a success.
Numerical:
ggplot(data = bank_sub_train) +
geom_boxplot(mapping = aes(x = subscribed, y = age, fill = subscribed)) +
scale_fill_manual(values = c("red1", "green2")) +
ggtitle ("Subscription based on age") +
xlab("Subscribed") +
ylab("Age") +
scale_y_continuous(limits = c(0, 90), breaks = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90), labels = c('0','10', '20', '30', '40', '50', '60', '70', '80', '90')) +
theme_minimal() +
theme(panel.grid.major = element_line(color = "white"),
axis.text = element_text(colour = "#666666"),
axis.ticks = element_line(colour = "#666666"),
axis.line = element_line(colour = "#666666"),
plot.title = element_text(face = "bold")) +
theme(plot.title = element_text(hjust = 0.5))50% of customers who have subscribed to the loan deposit plan are approx. between the ages of 33 and 51. 50 % Customers who haven’t subscribed to the term deposit plan are approx. between the ages of 33 to 47. The median age is 40.
ggplot(data = bank_sub_train) +
geom_boxplot(mapping = aes(x = subscribed, y = balance, fill = subscribed)) +
scale_fill_manual(values = c("red1", "green2")) +
ggtitle ("Subscription based on customers balance") +
xlab("Subscribed") +
ylab("Balance") +
scale_y_continuous(limits = c(-10000, 30000), breaks = c(-10000, 0, 10000, 20000, 30000), labels = c('-10000', '0','10000', '20000', '30000')) +
theme_minimal() +
theme(panel.grid.major = element_line(color = "white"),
axis.text = element_text(colour = "#666666"),
axis.ticks = element_line(colour = "#666666"),
axis.line = element_line(colour = "#666666"),
plot.title = element_text(face = "bold")) +
theme(plot.title = element_text(hjust = 0.5))Warning: Removed 1 row containing non-finite outside the scale range
(`stat_boxplot()`).
Customer who have subscribed to the term deposit plan have a median balance of approx. 2’000. Customers who have subscribed to the term deposit plan also have a higher bank balance than those who haven’t.
ggplot(data = bank_sub_train) +
geom_boxplot(mapping = aes(x = subscribed, y = duration, fill = subscribed)) +
scale_fill_manual(values = c("red1", "green2")) +
ggtitle ("Subscription based on duration since last subscription (in days)") +
xlab("Subscribed") +
ylab("Duration") +
theme_minimal() +
theme(panel.grid.major = element_line(color = "white"),
axis.text = element_text(colour = "#666666"),
axis.ticks = element_line(colour = "#666666"),
axis.line = element_line(colour = "#666666"),
plot.title = element_text(face = "bold")) +
theme(plot.title = element_text(hjust = 0.5))Customers who have subscribed to the term deposit plan wait a longer time to subscribe than those that haven’t subscribed. Hence, it can be seen that customers aren’t likely to subscribe to the term deposit plan if they have a shorter duration time since their last subscription. Median duration times for those who have subscribed is approx. 400 days.
ggplot(data = bank_sub_train) +
geom_boxplot(mapping = aes(x = subscribed, y = num_prev_contacts, fill = subscribed)) +
scale_fill_manual(values = c("red1", "green2")) +
ggtitle ("Subscription based on number of times customer has been contacted by bank") +
xlab("Subscribed") +
ylab("Count of Previous Contacts") +
theme_minimal() +
theme(panel.grid.major = element_line(color = "white"),
axis.text = element_text(colour = "#666666"),
axis.ticks = element_line(colour = "#666666"),
axis.line = element_line(colour = "#666666"),
plot.title = element_text(face = "bold")) +
theme(plot.title = element_text(hjust = 0.5))As can be seen, customers who have subscribed to the term deposit plan are contacted more than those who haven’t.
Creating Classification tree model
Made using training data
bank_sub_tree <- rpart(subscribed ~ age + marital + balance + loan + duration + num_prev_contacts + prev_outcome, bank_sub_train)fancyRpartPlot(bank_sub_tree)Predicting if customer will subscribe
Customer will subscribe if duration is more than 646 days, and if they are married.
Predicting if customer won’t subscribe
Customer wont subscribe if duration is less than 646 days, and if prev_outcome is failure, other or unknown.
Highest contributing predictor variables
bank_sub_tree$variable.importancesummary(bank_sub_tree)duration = 58%
prev_outcome = 34%
marital = 3%
balance = 2%
age = 2%
Assessing quality of classification tree model
We first need to assess the quality of the BLR model on the training dataset;
train_probs <- predict(bank_sub_tree, newdata = bank_sub_train, type = 'prob')
train_preds <- predict(bank_sub_tree, newdata = bank_sub_train, type = 'class')bank_train_updated <- cbind(bank_sub_train, train_probs, train_preds)
head(bank_train_updated)tree_train_con_mat <- table(bank_train_updated$subscribed, bank_train_updated$train_preds, dnn=c('Actual', 'Predicted'))
tree_train_con_mat Predicted
Actual No Yes
No 2573 68
Yes 218 141
Now we can assess the quality of the BLR model on the testing dataset;
test_probs <- predict(bank_sub_tree, newdata = bank_sub_test, type = 'prob')
test_preds <- predict(bank_sub_tree, newdata = bank_sub_test, type = 'class')bank_test_updated <- cbind(bank_sub_test, test_probs, test_preds)
head(bank_test_updated)tree_test_con_mat <- table(bank_test_updated$subscribed, bank_test_updated$test_preds, dnn=c('Actual', 'Predicted'))
tree_test_con_mat Predicted
Actual No Yes
No 1151 40
Yes 91 52
Comparing the accuracy of the classification tree model model on the training datatset and the testing dataset
First, we will assess the accuracy of the classification tree model on the training dataset:
tree_train_con_mat Predicted
Actual No Yes
No 2573 68
Yes 218 141
Overall model accuracy = (2573+141) / 3000 = 0.904 = 90.46%
Of all customers model predicted would subscribe, it got 141/209 = 0.674 = 67.46% correct
Of all customers model predicted would not subscribe, it got 2573/2791 = 0.921 = 92.18% correct
Of all customers who did actually subscribe, model correctly identified 141/349 = 0.404 = 40.40%
Of all customers who did not actually subscribe, model correctly identified 2573/2641 = 0.974 = 97.42%
Now we can assess the accuracy of the classification tree model on the testing dataset:
tree_test_con_mat Predicted
Actual No Yes
No 1151 40
Yes 91 52
Overall model accuracy =(1151+52)/1334 = 0.901 = 90.17%
Of all customers model predicted would subscribe, it got 52/92 = 0.565 = 56.52% correct
Of all customers model predicted would not subscribe, it got 1151/1242 = 0.926 = 92.67% correct
Of all customers who did actually subscribe, model correctly identified 52/143 = 0.36 = 36.36%
Of all customers who did not actually subscribe, model correctly identified 1151/1191 = 0.966 = 96.64%
The model doesn’t seem to be overfitting because the percentages are quite similar, hence no pruning is necessary.
Setting up levels for Binary Logistic Regression (BLR)
For BLR, we need to determine levels based on a ‘failure’ and ‘success’ principle (as BLR predicts the probability of a success). The choice for success in the case of subscription is a ‘Yes’ (meaning customer subscribed), and the choice for failure is a ‘No’ (meaning customer didn’t subscribe).
As setting levels works on a “failure”, “success” parameter, “No” goes first when setting up levels (as in this case its a failure), and “yes” goes second (as in this case its a success).
These levels must be applied to both the training and the testing databases.
bank_sub_train$subscribed <- factor(bank_sub_train$subscribed,
levels = c("No", "Yes"))
bank_sub_test$subscribed <- factor(bank_sub_test$subscribed,
levels = c("No", "Yes"))
levels(bank_sub_train$subscribed)[1] "No" "Yes"
levels(bank_sub_test$subscribed)[1] "No" "Yes"
Creating the BLR model
As the BLR model is predicting the probability of a subscription, we will base the probability of a subscription based on the other variables found in the dataset. We will do this using the glm() function (General Linear Model function).
bank_sub_lr <- glm(subscribed ~ age + marital + balance + loan + duration + num_prev_contacts + prev_outcome, data = bank_sub_train, family = binomial(link = 'logit'))summary(bank_sub_lr)
Call:
glm(formula = subscribed ~ age + marital + balance + loan + duration +
num_prev_contacts + prev_outcome, family = binomial(link = "logit"),
data = bank_sub_train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.265e+00 4.151e-01 -7.866 3.67e-15 ***
age 1.243e-02 6.574e-03 1.891 0.05862 .
maritalmarried -5.435e-01 1.928e-01 -2.819 0.00481 **
maritalsingle -2.499e-02 2.218e-01 -0.113 0.91030
balance 2.362e-05 2.205e-05 1.071 0.28421
loanyes -6.902e-01 2.175e-01 -3.173 0.00151 **
duration 3.690e-03 2.218e-04 16.637 < 2e-16 ***
num_prev_contacts -6.165e-04 4.177e-02 -0.015 0.98823
prev_outcomeother 5.784e-01 2.983e-01 1.939 0.05250 .
prev_outcomesuccess 2.719e+00 3.168e-01 8.582 < 2e-16 ***
prev_outcomeunknown -4.428e-01 2.304e-01 -1.922 0.05461 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2197.6 on 2999 degrees of freedom
Residual deviance: 1649.0 on 2989 degrees of freedom
AIC: 1671
Number of Fisher Scoring iterations: 5
Exclusion of Dummy Variables
The categorical dummy variables omitted are;
maritaldivorced
loanno
prev_outcomefailure
Regression Equation
The regression equation for this BLR model is as follows;
y = ln(π/(1-π)) = -3.265 + 1.243(age) -5.435(maritalmarried ) -2.499(maritalsingle) +2.362(balance) -6.902 (loanyes) +3.690 (duration) -6.165 (num_prev_contacts) +5.784 (prev_outcomeother) +2.719 (prev_outcomesuccess) -4.428 (prev_outcomeunknown)
Significant predictor variables
If a variable contains a P value of less than 0.05 (<0.05), it is seen as an important predictor variable.
Hence, the important predictor variables for this BLR Model are;
maritial married as 0.00481 < 0.05
loanyes as 0.00151 < 0.05
duration as 2e-16 < 0.05
prev_outcomesuccess as 2e-16 < 0.05
How each predictor variable impacts the likelihood of a customer subscribing
The sign in front of the predictor variable shows how each variable impacts the probability of a success; if the predictor variable has a + sign in front, it increases the probability of success, and if the predictor variable has a - sign in front, it decreases the probability of success.
Assessing the quality of the BLR model
We first need to assess the quality of the BLR model on the training dataset;
bank_sub_train_pi <- predict(bank_sub_lr, newdata = bank_sub_train, type = 'response')
bank_sub_train_piAfter the BLR model calculates the probability scores for each customer in the dataset, we need to set a paramater which the model will use to predict whether the customer will subscribe or not.
If a customers probability score is lower than 0.5 (<50%), the model will predict that the customer will not subscribe.
If a customers probability score is equal to or higher than 0.5 (>=50%), the model will predict that the customer will subscribe.
bank_sub_train_updated <- bank_sub_train %>%
mutate(pi = bank_sub_train_pi) %>%
mutate(prediction = case_when(pi < 0.5 ~ 'No',
pi >= 0.5 ~ 'Yes'))
bank_sub_train_updatedbank_sub_train_updated$prediction <- factor(bank_sub_train_updated$prediction,
levels = c("No", "Yes"))Once the training dataset is updated with and additional variable containing the probability of a customer subscribing determined by the BLR model, we need to create a confusion matrix to fully assess its quality.
bank_sub_train_con_mat <- table(bank_sub_train_updated$subscribed,
bank_sub_train_updated$prediction,
dnn = c('Actual', 'Predicted'))
bank_sub_train_con_mat Predicted
Actual No Yes
No 2588 53
Yes 249 110
Now we can assess the quality of the BLR model on the testing dataset;
We need to run all the above code, but this time it needs to pertain to the testing dataset so that we can assess the quality of the model on a different set of figures to check if the model will over-fit the data.
bank_sub_test_pi <- predict(bank_sub_lr, newdata = bank_sub_test, type = 'response')
bank_sub_test_pibank_sub_test_updated <- bank_sub_test %>%
mutate(pi = bank_sub_test_pi) %>%
mutate(prediction = case_when(pi < 0.5 ~ 'No',
pi >= 0.5 ~ 'Yes'))
bank_sub_test_updatedbank_sub_test_updated$prediction <- factor(bank_sub_test_updated$prediction,
levels = c("No", "Yes"))bank_sub_test_con_mat <- table(bank_sub_test_updated$subscribed,
bank_sub_test_updated$prediction,
dnn = c('Actual', 'Predicted'))
bank_sub_test_con_mat Predicted
Actual No Yes
No 1159 32
Yes 99 44
Comparing the accuracy of the BLR model on the training datatset and the testing dataset
First, we will assess the accuracy of the BLR model on the training dataset:
bank_sub_train_con_mat Predicted
Actual No Yes
No 2588 53
Yes 249 110
Overall model accuracy = (2588+110) / 3000 = 0.899 = 89.93%
Of all customers model predicted would not subscribe, it got 2588 / 2837 = 0.912 = 91.22% correct
Of all customers model predicted would subscribe, it got 110 / 163 = 0.674 = 67.48% correct
Of all customers who did not actually subscribe, model correctly identified 2588 / 2641 = 0.979 = 97.99%
Of all customers who did actually subscribe, model correctly identified 110 / 359 = 0.306 = 30.64%
Now we can assess the accuracy of the BLR model on the testing dataset:
bank_sub_test_con_mat Predicted
Actual No Yes
No 1159 32
Yes 99 44
Overall model accuracy = (1159 + 44) / 1334 = 0.901 = 90.17%
Of all customers model predicted would not subscribe, it got 1159 / 1258 = 0.921 = 92.13% correct
Of all customers model predicted would subscribe, it got 44 / 76 = 0.578 = 57.89% correct
Of all customers who did not actually subscribe, model correctly identified 1159 / 1191 = 0.973 = 97.31%
Of all customers who did actually subscribe, model correctly identified 99 / 143 = 0.692 = 69.23%
Comparing the accuracy of the classification tree model and the BLR model
a) Classification tree:
Training data:
tree_train_con_mat Predicted
Actual No Yes
No 2573 68
Yes 218 141
Overall model accuracy = (2573+141) / 3000 = 0.904 = 90.46%
Of all customers model predicted would subscribe, it got 141/209 = 0.674 = 67.46% correct
Of all customers model predicted would not subscribe, it got 2573/2791 = 0.921 = 92.18% correct
Of all customers who did actually subscribe, model correctly identified 141/349 = 0.404 = 40.40%
Of all customers who did not actually subscribe, model correctly identified 2573/2641 = 0.974 = 97.42%
tree_test_con_mat Predicted
Actual No Yes
No 1151 40
Yes 91 52
Overall model accuracy =(1151+52)/1334 = 0.901 = 90.17%
Of all customers model predicted would subscribe, it got 52/92 = 0.565 = 56.52% correct
Of all customers model predicted would not subscribe, it got 1151/1242 = 0.926 = 92.67% correct
Of all customers who did actually subscribe, model correctly identified 52/143 = 0.36 = 36.36%
Of all customers who did not actually subscribe, model correctly identified 1151/1191 = 0.966 = 96.64%
Average accuracy percentage of classification tree model across training and testing dataset.;
(90.46+ 67.46+ 92.18+40.40+ 97.42+ 90.17+ 56.52+ 92.67+ 36.36+ 96.64)/10 = 76.028% average
b) Binary Logistic Regression model:
Training data:
bank_sub_train_con_mat Predicted
Actual No Yes
No 2588 53
Yes 249 110
Overall model accuracy = (2588+110) / 3000 = 0.899 = 89.93%
Of all customers model predicted would not subscribe, it got 2588 / 2837 = 0.912 = 91.22% correct
Of all customers model predicted would subscribe, it got 110 / 163 = 0.674 = 67.48% correct
Of all customers who did not actually subscribe, model correctly identified 2588 / 2641 = 0.979 = 97.99%
Of all customers who did actually subscribe, model correctly identified 110 / 359 = 0.306 = 30.64%
bank_sub_test_con_mat Predicted
Actual No Yes
No 1159 32
Yes 99 44
Overall model accuracy = (1159 + 44) / 1334 = 0.901 = 90.17%
Of all customers model predicted would not subscribe, it got 1159 / 1258 = 0.921 = 92.13% correct
Of all customers model predicted would subscribe, it got 44 / 76 = 0.578 = 57.89% correct
Of all customers who did not actually subscribe, model correctly identified 1159 / 1191 = 0.973 = 97.31%
Of all customers who did actually subscribe, model correctly identified 99 / 143 = 0.692 = 69.23%
Average accuracy percentage of BLR model across training and testing dataset.;
(89.93+ 91.22+ 67.48+ 97.99+ 30.64+ 90.17+ 92.13+ 57.89+ 97.31+ 69.23)/ 10 = 78.399%
Average of classification tree model: 76.028%
Average of BLR model: 78.399%
Both models are very similar in accuracy, however the BLR model is on average more accurate in predicting whether a customer will subscribe or not.
I think the company should use the BLR model for making predictions for subscription because the model has a higher average accuracy percentage. However, both models could be used, as their average accuracy levels are very similar.
The main drivers for customers subscribing to the term deposit plan are marital status, duration and previous outcome (based on shared previous important predictor variables between the two models, refer to questions 2, 4 (c) and 3, 7 (c).