Sem 2 Assignment 1

Author

Zoe Keating

library(rattle)
Warning: package 'rattle' was built under R version 4.4.2
Loading required package: tibble
Loading required package: bitops
Rattle: A free graphical interface for data science with R.
Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
library(rpart)

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.4.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ purrr     1.0.2
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ ggplot2   3.5.1     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)

library(dplyr)

Question 1

(1)

Load Data

bank_sub_test <- read_csv("bank_sub_test.csv")
Rows: 1334 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): marital, loan, prev_outcome, subscribed
dbl (5): id, age, balance, num_prev_contacts, duration

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bank_sub_train <- read_csv("bank_sub_train.csv")
Rows: 3000 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): marital, loan, prev_outcome, subscribed
dbl (5): id, age, balance, duration, num_prev_contacts

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bank_sub_train$subscribed <- factor(bank_sub_train$subscribed, levels = c("No", "Yes"))

bank_sub_train$marital <- factor(bank_sub_train$marital, levels = c("single", "married", "divorced"))

(2)

Data Summary

summary_data <- bank_sub_test %>%
  group_by(marital, subscribed) %>%
  summarise(count = n()) %>%
  ungroup()
`summarise()` has grouped output by 'marital'. You can override using the
`.groups` argument.
summary_data <- summary_data %>%
  group_by(marital) %>%
  mutate(total = sum(count),
         percentage = (count / total) * 100) %>%
  ungroup()

(a)

Visualisation

ggplot(summary_data, aes(x = marital, y = percentage, fill = subscribed)) +

geom_bar(stat = "identity", position = "dodge") +

labs(x = "Marital Status", y = "Percentage (%)", fill = "Subscribed") +

scale_fill_manual(values = c("No" = "magenta2", "Yes" = "turquoise")) +

theme_minimal() +

theme(axis.text.x = element_text(angle = 45, hjust = 1)) +

ggtitle("Percentage of Subscribed vs Not Subscribed by Marital Status")

(b)

ggplot(bank_sub_train, aes(x = subscribed, y = age, fill = subscribed)) +

geom_boxplot() +

labs(x = "Subscribed", y = "Age", fill = "Subscribed") +

scale_fill_manual(values = c("No" = "magenta2", "Yes" = "turquoise")) +

theme_minimal() +

ggtitle("Distribution of Age by Subscription Status")

Question 2

(3)

Decision Tree Model

churn_tree <- rpart(subscribed ~ age + marital + balance + loan + duration + num_prev_contacts + prev_outcome, bank_sub_train)

fancyRpartPlot(churn_tree)

(4)

(a) (b) (c)

  • If their duration is greater than 646 and they are not married they are predicted to subscribe and 64% of the people who fell into this category did subscribe and 36% did not. 4% of people in the bank_sub_train data set fell into this node.

  • If the duration was less than 646 and their previous outcome was failure, other and unknown they are predicted to not churn . #93% of the people who fell into theis category did not subscribe and 7% did. 90% of people in the bank_sub_train dataset fell into this node.

Decision Tree Insights

churn_tree$variable.importance
         duration      prev_outcome           marital           balance 
      91.88034197       54.19775101        4.78150081        3.79274735 
              age num_prev_contacts              loan 
       2.82892333        0.22959184        0.03823735 
summary(churn_tree)
Call:
rpart(formula = subscribed ~ age + marital + balance + loan + 
    duration + num_prev_contacts + prev_outcome, data = bank_sub_train)
  n= 3000 

          CP nsplit rel error    xerror       xstd
1 0.05849582      0 1.0000000 1.0000000 0.04951951
2 0.04456825      2 0.8830084 0.9721448 0.04891730
3 0.01253482      3 0.8384401 0.9080780 0.04748259
4 0.01000000      6 0.7966574 0.9247911 0.04786377

Variable importance
    duration prev_outcome      marital      balance          age 
          58           34            3            2            2 

Node number 1: 3000 observations,    complexity param=0.05849582
  predicted class=No   expected loss=0.1196667  P(node) =1
    class counts:  2641   359
   probabilities: 0.880 0.120 
  left son=2 (2758 obs) right son=3 (242 obs)
  Primary splits:
      duration          < 645.5 to the left,  improve=88.179720, (0 missing)
      prev_outcome      splits as  LLRL,      improve=49.703030, (0 missing)
      num_prev_contacts < 0.5   to the left,  improve=16.656260, (0 missing)
      age               < 60.5  to the left,  improve=11.538120, (0 missing)
      balance           < 888   to the left,  improve= 5.539762, (0 missing)

Node number 2: 2758 observations,    complexity param=0.05849582
  predicted class=No   expected loss=0.08375635  P(node) =0.9193333
    class counts:  2527   231
   probabilities: 0.916 0.084 
  left son=4 (2690 obs) right son=5 (68 obs)
  Primary splits:
      prev_outcome      splits as  LLRL,      improve=53.968160, (0 missing)
      duration          < 211.5 to the left,  improve=23.274390, (0 missing)
      num_prev_contacts < 0.5   to the left,  improve=15.599510, (0 missing)
      age               < 60.5  to the left,  improve=10.263590, (0 missing)
      balance           < 888   to the left,  improve= 3.629022, (0 missing)

Node number 3: 242 observations,    complexity param=0.04456825
  predicted class=Yes  expected loss=0.4710744  P(node) =0.08066667
    class counts:   114   128
   probabilities: 0.471 0.529 
  left son=6 (136 obs) right son=7 (106 obs)
  Primary splits:
      marital           splits as  RLR,       improve=4.7815010, (0 missing)
      balance           < 8     to the left,  improve=2.9993270, (0 missing)
      duration          < 771   to the left,  improve=2.4103630, (0 missing)
      num_prev_contacts < 1.5   to the left,  improve=1.5679810, (0 missing)
      prev_outcome      splits as  RRLL,      improve=0.6950792, (0 missing)
  Surrogate splits:
      age      < 37.5  to the right, agree=0.607, adj=0.104, (0 split)
      duration < 669.5 to the right, agree=0.579, adj=0.038, (0 split)
      balance  < -53.5 to the right, agree=0.574, adj=0.028, (0 split)

Node number 4: 2690 observations
  predicted class=No   expected loss=0.06802974  P(node) =0.8966667
    class counts:  2507   183
   probabilities: 0.932 0.068 

Node number 5: 68 observations
  predicted class=Yes  expected loss=0.2941176  P(node) =0.02266667
    class counts:    20    48
   probabilities: 0.294 0.706 

Node number 6: 136 observations,    complexity param=0.01253482
  predicted class=No   expected loss=0.4411765  P(node) =0.04533333
    class counts:    76    60
   probabilities: 0.559 0.441 
  left son=12 (21 obs) right son=13 (115 obs)
  Primary splits:
      balance           < 7.5   to the left,  improve=3.1217630, (0 missing)
      duration          < 758   to the left,  improve=2.1980180, (0 missing)
      age               < 40.5  to the left,  improve=1.6987540, (0 missing)
      prev_outcome      splits as  LRRL,      improve=0.9119144, (0 missing)
      num_prev_contacts < 1.5   to the left,  improve=0.8403361, (0 missing)

Node number 7: 106 observations
  predicted class=Yes  expected loss=0.3584906  P(node) =0.03533333
    class counts:    38    68
   probabilities: 0.358 0.642 

Node number 12: 21 observations
  predicted class=No   expected loss=0.1904762  P(node) =0.007
    class counts:    17     4
   probabilities: 0.810 0.190 

Node number 13: 115 observations,    complexity param=0.01253482
  predicted class=No   expected loss=0.4869565  P(node) =0.03833333
    class counts:    59    56
   probabilities: 0.513 0.487 
  left son=26 (52 obs) right son=27 (63 obs)
  Primary splits:
      age               < 40.5  to the left,  improve=1.9883420, (0 missing)
      num_prev_contacts < 1.5   to the left,  improve=1.6475740, (0 missing)
      balance           < 166.5 to the right, improve=1.4049260, (0 missing)
      duration          < 743   to the left,  improve=1.2862330, (0 missing)
      prev_outcome      splits as  RLRL,      improve=0.4408696, (0 missing)
  Surrogate splits:
      duration < 1016  to the right, agree=0.617, adj=0.154, (0 split)
      balance  < 41.5  to the left,  agree=0.565, adj=0.038, (0 split)
      loan     splits as  RL,        agree=0.557, adj=0.019, (0 split)

Node number 26: 52 observations
  predicted class=No   expected loss=0.3846154  P(node) =0.01733333
    class counts:    32    20
   probabilities: 0.615 0.385 

Node number 27: 63 observations,    complexity param=0.01253482
  predicted class=Yes  expected loss=0.4285714  P(node) =0.021
    class counts:    27    36
   probabilities: 0.429 0.571 
  left son=54 (28 obs) right son=55 (35 obs)
  Primary splits:
      duration          < 771   to the left,  improve=3.2142860, (0 missing)
      num_prev_contacts < 1.5   to the left,  improve=1.2420490, (0 missing)
      balance           < 903.5 to the right, improve=0.8449725, (0 missing)
      age               < 62    to the right, improve=0.7071429, (0 missing)
      loan              splits as  LR,        improve=0.5844156, (0 missing)
  Surrogate splits:
      balance           < 486.5 to the right, agree=0.619, adj=0.143, (0 split)
      age               < 44.5  to the left,  agree=0.603, adj=0.107, (0 split)
      num_prev_contacts < 0.5   to the left,  agree=0.587, adj=0.071, (0 split)
      prev_outcome      splits as  RRRL,      agree=0.587, adj=0.071, (0 split)

Node number 54: 28 observations
  predicted class=No   expected loss=0.3928571  P(node) =0.009333333
    class counts:    17    11
   probabilities: 0.607 0.393 

Node number 55: 35 observations
  predicted class=Yes  expected loss=0.2857143  P(node) =0.01166667
    class counts:    10    25
   probabilities: 0.286 0.714 

(5)

Training Predictions

train_probs <- predict(churn_tree, newdata = bank_sub_train, type = 'prob')

train_preds <- predict(churn_tree, newdata = bank_sub_train, type = 'class')

sub_train_updated <- cbind(bank_sub_train, train_probs, train_preds)

head(sub_train_updated)
    id age  marital balance loan duration num_prev_contacts prev_outcome
1 3786  51   single     -55   no      119                 0      unknown
2  503  43 divorced     738   no      585                 4      failure
3 3430  53  married     719   no      230                 0      unknown
4 3696  42  married      83  yes      184                 0      unknown
5 4090  45  married     185   no      249                 0      unknown
6 3052  36  married    1554   no      325                 0      unknown
  subscribed        No        Yes train_preds
1         No 0.9319703 0.06802974          No
2         No 0.9319703 0.06802974          No
3         No 0.9319703 0.06802974          No
4         No 0.9319703 0.06802974          No
5         No 0.9319703 0.06802974          No
6         No 0.9319703 0.06802974          No
train_con_mat <- table(sub_train_updated$subscribed, sub_train_updated$train_preds, dnn=c('Actual', 'Predicted'))

train_con_mat
      Predicted
Actual   No  Yes
   No  2573   68
   Yes  218  141

From the above confusion matrix, we know the following:

• The overall model accuracy is 2714/3000 = 0.9 or 90%.

• Of all customers the model predicted would subscribed, they got 2575/2791 = 0.92 or 92% correct.

• Of all customers the model predicted would not subscribed, they got 141/209 = 0.67 or 67% correct.

• Of all customers who did actually subscribed, the model correctly identified 2573/2641 = 0.97 or 97%.

• Of all customers who did not actually subscribed, the model correctly identified 141/359 = 0.39 or 39%.

Testing Predictions

test_probs <- predict(churn_tree, newdata = bank_sub_test, type = 'prob')

test_preds <- predict(churn_tree, newdata = bank_sub_test, type = 'class')

sub_test_updated <- cbind(bank_sub_test, test_probs, test_preds)

head(sub_test_updated)
    id age marital balance loan num_prev_contacts duration prev_outcome
1 3961  32  single     473   no                 1       72        other
2 2722  44 married    1058   no                 0      188      unknown
3 3060  39 married     186   no                 0      433      unknown
4 3705  35 married       0   no                 0      146      unknown
5 3328  46 married    1291   no                17      142      failure
6 2403  35 married     280  yes                 0       65      unknown
  subscribed        No        Yes test_preds
1         No 0.9319703 0.06802974         No
2        Yes 0.9319703 0.06802974         No
3         No 0.9319703 0.06802974         No
4         No 0.9319703 0.06802974         No
5         No 0.9319703 0.06802974         No
6         No 0.9319703 0.06802974         No
test_con_mat <- table(sub_test_updated$subscribed, sub_test_updated$test_preds, dnn=c('Actual', 'Predicted'))

test_con_mat
      Predicted
Actual   No  Yes
   No  1151   40
   Yes   91   52

From the above confusion matrix, we know the following:

• The overall model accuracy is 1203/1334 = 0.9 or 90%.

• Of all customers the model predicted would subscribe, they got 1151/249 = 0.92 or 92% correct.

• Of all customers the model predicted would not subscribe, they got 52/92 = 0.57 or 57% correct.

• Of all customers who did actually subscribed, the model correctly identified 1151/1191 = 0.96 or 96%.

• Of all customers who did not actually subscribe, the model correctly identified 52/143 = 0.36 or 36%.

(5)

(a)

The overall accuracy of the model is the same for both the training and testing datasets at 90%. Therefore, the classification tree does not appear to be overfitting and so we do not need to consider pruning the tree.

Question 3

(6) (7)

Logistic Regression Model

bank_sub_train$subscribed <- factor(bank_sub_train$subscribed, levels = c("Yes", "No"))

bank_sub_test$subscribed <- factor(bank_sub_test$subscribed, levels = c("Yes", "No"))

subscribed_lr <- glm(subscribed ~ age + marital + balance + duration + prev_outcome, data = bank_sub_train, family = binomial(link = 'logit'))

summary(subscribed_lr)

Call:
glm(formula = subscribed ~ age + marital + balance + duration + 
    prev_outcome, family = binomial(link = "logit"), data = bank_sub_train)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)          3.373e+00  3.183e-01  10.598  < 2e-16 ***
age                 -1.323e-02  6.573e-03  -2.013 0.044118 *  
maritalmarried       5.379e-01  1.622e-01   3.316 0.000913 ***
maritaldivorced      1.999e-02  2.210e-01   0.090 0.927948    
balance             -2.867e-05  2.168e-05  -1.322 0.186028    
duration            -3.649e-03  2.194e-04 -16.632  < 2e-16 ***
prev_outcomeother   -5.604e-01  2.963e-01  -1.891 0.058558 .  
prev_outcomesuccess -2.766e+00  3.164e-01  -8.742  < 2e-16 ***
prev_outcomeunknown  4.622e-01  1.948e-01   2.372 0.017690 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2197.6  on 2999  degrees of freedom
Residual deviance: 1660.4  on 2991  degrees of freedom
AIC: 1678.4

Number of Fisher Scoring iterations: 5

(a)

Prev_outcome_failure was omitted from prev_outcome and marital_single was omitted in marital.

maritalmarried: a value of 1 indicated the customers marital status is married, otherwise this dummy varible takes a value of 0

maritaldivorce : a value of 1 indicated the customers marital status is divorced, otherwise # this dummy varible takes a value of 0

prev_outcomeother: a value of 1 indicated the customers previous outcome was unknown, otherwise this dummy varible takes a value of 0

prev_outcomesuccess: a value of 1 indicated the customers previous outcome was success, otherwise this dummy varible takes a value of 0

prev_outcomeunknown: a value of 1 indicated the customers previous outcome was unknown, otherwise this dummy varible takes a value of 0

(b)

y= Ln (pi / 1 - pi) = 3.373 - 0.01323 (age) + 0.5379 (maritalmarried) + 0.01999 (maritaldivorced) - 0.00002867 (balance) - 0.003649 (duration) - 0.560 (prev_outcomeother) - 2.766 (prev_outcomesuccess) + 0.462 (prev_outcomeunknown)

(c)

Age, maritalmarried, duration, prev_outcomesuccess and prev_outcomeunknown are all significant perdictor variables as they all have p - values of less than 0.05.

(d)

The co-efficient is positive for maritalmarried, maritaldivorced and prev_outcomeuknown due to this they are likely to impact if the customer subscribes.

The co-efficient is negative for age, balance, duration, prev_outcomeother and prev_outcomesuccess due to this they are unlikely to impact if the customer subscribes.

(8)

Logistic Regression Predictions - Training Data

train_pi <- predict(subscribed_lr, newdata = bank_sub_train, type = 'response')

subscribed_train_updated <- bank_sub_train %>%

mutate(pi = train_pi) %>%

mutate(prediction = case_when(pi > 0.5 ~ 'No', pi <= 0.5 ~ 'Yes'))

subscribed_train_updated$prediction <- factor(subscribed_train_updated$prediction, levels = c("Yes", "No"))

train_con_mat <- table(subscribed_train_updated$subscribed, subscribed_train_updated$prediction, dnn = c('Actual', 'Predicted'))

train_con_mat
      Predicted
Actual  Yes   No
   Yes  106  253
   No    55 2586

Logistic Regression Predictions - Testing Data

test_pi <- predict(subscribed_lr, newdata = bank_sub_test, type = 'response')

subscribed_test_updated <- bank_sub_test %>%

mutate(pi = test_pi) %>%

mutate(prediction = case_when(pi > 0.5 ~ 'No', pi <= 0.5 ~ 'Yes'))

subscribed_test_updated$prediction <- factor(subscribed_test_updated$prediction, levels = c("Yes", "No"))

test_con_mat <- table(subscribed_test_updated$subscribed, subscribed_test_updated$prediction, dnn = c('Actual', 'Predicted'))

test_con_mat
      Predicted
Actual  Yes   No
   Yes   44   99
   No    34 1157

Prev_outcome_failure was omitted from prev_outcome and marital_single was omitted in marital.

3-c

Age, maritalmarried, duration, prev_outcomesuccess and prev_outcomeunknown are all significant perdictor variables as they all have p - values of less than 0.05.

3-d

The co-efficient is positive for maritalmarried, maritaldivorced and prev_outcomeuknown due to this they are likely to impact if the customer subscribes.

The co-efficient is negative for age, balance, duration, prev_outcomeother and prev_outcomesuccess due to this they are unlikely to impact if the customer subscribes.

Question 4

(9)

From the classification tree confusion matrix, we know the following:

• The overall model accuracy is 2714/3000 = 0.9 or 90%.

• Of all customers the model predicted would subscribe, they got 2575/2791 = 0.92 or 92% correct.

• Of all customers the model predicted would not subscribe, they got 141/209 = 0.67 or 67% correct.

• Of all customers who did actually subscribe, the model correctly identified 2573/2641 = 0.97 or 97%.

• Of all customers who did not actually subscribe, the model correctly identified 141/359 = 0.39 or 39%.

From the binary logistic regression model confusion matrix, we know the following:

• The overall model accuracy is (44 + 1157)/1334 = 0.9 or 90%.

• Of all customers the model predicted would susbcribe, they got 44/78 = 0.56 or 56% correct.

• Of all customers the model predicted would not subscribe, they got 1157/1256 = 0.92 or 92% correct.

• Of all customers who did actually subscribe, the model correctly identified 44/143 = 0.31 or 31%.

• Of all customers who did not actually subscribe, the model correctly identified 1157/1191 = 0.97 or 97%.

Both of the models have an overall accuracy score of 90%. Which is a high score.

However, they differ in their ability to correctly identify subscribers and non-subscribers:

  • The classification tree correctly predicted 92% of those who subscribed and 67% of those who did not.
  • The logistic regression model correctly predicted 56% of those who subscribed and 92% of those who did not.
  • The classification tree was more effective in correctly identifying actual subscribers (97% vs. 31%) but struggled with non-subscribers (39% vs. 97%).

The classification tree is better at identifying likely subscribers, while logistic regression is better at identifying non-subscribers.

(10)

I think the classification tree should be used by the company as The classification tree is better at identifying likely subscribers which is more important to focus on rather than non subscribers as their aim in to increase subscribers.

(11)

(a)

Using the classification tree, the strongest factor influencing subscription is duration - the duration in days since the customer last subsribed to a product or plan with the bank followed by marital status, balance, age, and past contact outcomes. A duration over 646 days is the biggest indicator of a higher likelihood to subscribe, particularly if the customer is married.

(b)

Create marketing strategys to target returning cutsomers as customers who have subscribed before more than 646 days ago are more likley to subscribe again.

For married customers market to them family plans or long term benefits.

Focus on younger customers below the age of 41 as they are likely to subscribe e.g. future financial secuirty, home improvements