Data Assignment Ver2

Author

Chloe O Donovan

Question 1 Data Visualisation

Uploading libraries and datasets

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.4.2
Warning: package 'stringr' was built under R version 4.4.2
library(ggplot2)
library(dplyr)
library(rattle)
Warning: package 'rattle' was built under R version 4.4.2
library(rpart)

bank_training <- read_csv("bank_training.csv")
bank_testing <- read_csv("bank_testing.csv")

Examining the relationship between “churn” and “geography”:

`summarise()` has grouped output by 'geography'. You can override using the
`.groups` argument.

In the clustered bar chart above you can see the following: - On the right side in blue you can see over 80% of the customers in France did not churn, while on the left in red over 15% did churn.

  • In Germany on the right side in blue over 65% did not churn, while on the left hand side in red over 30% did churn.

  • In Spain on the right hand side in blue over 80% of customers did not churn, while on the left hand side in red over 15% did churn.

This overall shows us that majority of customers did not churn over all and also that Germany had the most customers that churned compared to Spain and France which were very similar.

Examining the relationship between “churn” and “balance”:

ggplot(bank_training, aes(x = churn, y = balance, fill = churn)) +
  geom_boxplot() +
  labs(x = "Churn Status", y = "Account Balance", fill = "Churn Status") +
  theme_minimal() +
  scale_fill_manual(values = c("Yes" = "red", "No" = "green"))

The above box plot shows customers that had between €0-€125000 did not churn while the customers that did churn had between €4000-€125000 in their account balance.

Question 2 - Classification Trees

Creating and visualising the classification tree model

churn_tree <- rpart(churn ~ credit_score + geography + gender + age  + tenure + balance + num_products, 
                      data = bank_training)
fancyRpartPlot(churn_tree)

Interpretating the classification tree:

  • If the condition is true and they are over the age of 43 and have less than two and a half number of products they are most likely to churn. This shows 90% of them did not churn and 10% did churn. These figures give how pure the leaf is and also provides the probability of a customer churning.

  • If the condition is flase and they are under the age of 43 and have less than two and a half number of products they are most likely not to churn. This shows 59% of them did not churn and 41% did churn. These figures give how pure the leaf is and also provides the probability of a customer churning.If they are under 43 years of age, and have less than two and a half products they are not likely to churn.

  • Based on your findings, do you think the classification tree is overfitting the training dataset? Explain your answer. Based on my findings the classification tree is not overfitting the training dataset as both accuracy models are the same at 78% for the bank training and bank testing.

churn_tree$variable.importance
num_products          age      balance    geography credit_score 
  302.273003   286.542433    54.863696    38.150482     0.921508 
Call:
rpart(formula = churn ~ credit_score + geography + gender + age + 
    tenure + balance + num_products, data = bank_training)
  n= 8000 

         CP nsplit rel error  xerror       xstd
1 0.0377397      0  1.000000 1.00000 0.02206797
2 0.0100000      6  0.750918 0.75459 0.01976435

Variable importance
num_products          age      balance    geography 
          44           42            8            6 

Node number 1: 8000 observations,    complexity param=0.0377397
  predicted class=No   expected loss=0.20425  P(node) =1
    class counts:  6366  1634
   probabilities: 0.796 0.204 
  left son=2 (5693 obs) right son=3 (2307 obs)
  Primary splits:
      age          < 42.5     to the left,  improve=286.31750, (0 missing)
      num_products < 2.5      to the left,  improve=228.42510, (0 missing)
      geography    splits as  LRL,          improve= 78.34265, (0 missing)
      balance      < 87554.41 to the left,  improve= 39.40310, (0 missing)
      gender       splits as  RL,           improve= 33.45343, (0 missing)
  Surrogate splits:
      num_products < 3.5      to the left,  agree=0.714, adj=0.007, (0 split)
      credit_score < 361      to the right, agree=0.712, adj=0.002, (0 split)

Node number 2: 5693 observations,    complexity param=0.0377397
  predicted class=No   expected loss=0.1190936  P(node) =0.711625
    class counts:  5015   678
   probabilities: 0.881 0.119 
  left son=4 (5566 obs) right son=5 (127 obs)
  Primary splits:
      num_products < 2.5      to the left,  improve=105.35470, (0 missing)
      age          < 38.5     to the left,  improve= 22.76541, (0 missing)
      geography    splits as  LRL,          improve= 19.53761, (0 missing)
      credit_score < 407.5    to the right, improve= 15.29919, (0 missing)
      balance      < 97664.38 to the left,  improve= 11.94169, (0 missing)

Node number 3: 2307 observations,    complexity param=0.0377397
  predicted class=No   expected loss=0.414391  P(node) =0.288375
    class counts:  1351   956
   probabilities: 0.586 0.414 
  left son=6 (2179 obs) right son=7 (128 obs)
  Primary splits:
      num_products < 2.5      to the left,  improve=83.29378, (0 missing)
      geography    splits as  LRL,          improve=55.33024, (0 missing)
      age          < 65.5     to the right, improve=35.43456, (0 missing)
      balance      < 87372.1  to the left,  improve=31.80820, (0 missing)
      gender       splits as  RL,           improve=25.40391, (0 missing)

Node number 4: 5566 observations
  predicted class=No   expected loss=0.1045634  P(node) =0.69575
    class counts:  4984   582
   probabilities: 0.895 0.105 

Node number 5: 127 observations
  predicted class=Yes  expected loss=0.2440945  P(node) =0.015875
    class counts:    31    96
   probabilities: 0.244 0.756 

Node number 6: 2179 observations,    complexity param=0.0377397
  predicted class=No   expected loss=0.3818265  P(node) =0.272375
    class counts:  1347   832
   probabilities: 0.618 0.382 
  left son=12 (849 obs) right son=13 (1330 obs)
  Primary splits:
      num_products < 1.5      to the right, improve=111.76290, (0 missing)
      geography    splits as  LRL,          improve= 49.04596, (0 missing)
      age          < 65.5     to the right, improve= 29.88819, (0 missing)
      balance      < 87460.34 to the left,  improve= 28.42016, (0 missing)
      gender       splits as  RL,           improve= 19.64109, (0 missing)
  Surrogate splits:
      balance < 6229.595 to the left,  agree=0.701, adj=0.233, (0 split)
      age     < 79.5     to the right, agree=0.611, adj=0.001, (0 split)

Node number 7: 128 observations
  predicted class=Yes  expected loss=0.03125  P(node) =0.016
    class counts:     4   124
   probabilities: 0.031 0.969 

Node number 12: 849 observations
  predicted class=No   expected loss=0.1813899  P(node) =0.106125
    class counts:   695   154
   probabilities: 0.819 0.181 

Node number 13: 1330 observations,    complexity param=0.0377397
  predicted class=Yes  expected loss=0.4902256  P(node) =0.16625
    class counts:   652   678
   probabilities: 0.490 0.510 
  left son=26 (921 obs) right son=27 (409 obs)
  Primary splits:
      geography    splits as  LRL,          improve=38.150480, (0 missing)
      age          < 66.5     to the right, improve=33.481090, (0 missing)
      gender       splits as  RL,           improve=11.784970, (0 missing)
      balance      < 46303.52 to the right, improve= 8.509361, (0 missing)
      credit_score < 407.5    to the right, improve= 4.842834, (0 missing)
  Surrogate splits:
      age < 81.5     to the left,  agree=0.693, adj=0.002, (0 split)

Node number 26: 921 observations,    complexity param=0.0377397
  predicted class=No   expected loss=0.4299674  P(node) =0.115125
    class counts:   525   396
   probabilities: 0.570 0.430 
  left son=52 (650 obs) right son=53 (271 obs)
  Primary splits:
      balance      < 46303.52 to the right, improve=28.798860, (0 missing)
      age          < 62.5     to the right, improve=20.755240, (0 missing)
      gender       splits as  RL,           improve= 6.980419, (0 missing)
      tenure       < 4.5      to the right, improve= 4.729138, (0 missing)
      credit_score < 421      to the right, improve= 3.828371, (0 missing)
  Surrogate splits:
      credit_score < 444.5    to the right, agree=0.71, adj=0.015, (0 split)

Node number 27: 409 observations
  predicted class=Yes  expected loss=0.3105134  P(node) =0.051125
    class counts:   127   282
   probabilities: 0.311 0.689 

Node number 52: 650 observations
  predicted class=No   expected loss=0.3492308  P(node) =0.08125
    class counts:   423   227
   probabilities: 0.651 0.349 

Node number 53: 271 observations
  predicted class=Yes  expected loss=0.3763838  P(node) =0.033875
    class counts:   102   169
   probabilities: 0.376 0.624 

Findings of importance are: - num_products: 44 - age: 42 - balance: 8 - geography: 6

Assessing the accuracy of the classification tree using both the training and the testing datasets.

train_probs <- predict(churn_tree, newdata = bank_training, type = 'prob')
train_preds <- predict(churn_tree, newdata = bank_training, type = 'class')

churn_train_updated <- cbind(bank_training, train_probs, train_preds)

train_con_mat <- table(churn_train_updated$churn,
                         churn_train_updated$train_preds, 
                         dnn=c('Actual', 'Predicted'))

train_con_mat
      Predicted
Actual   No  Yes
   No  6102  264
   Yes  963  671
  • The overall model accuracy is: 6102 + 671 / 8000 = 85%
  • Of all customers the model predicted to churn they got 671/671 + 264 = 72% correct.
  • Of all customers the model predicted to not churn they got 6102/ 6102 + 963 = 86% correct.
  • Of all customers who did churn the model correctly identified 671/ 671 + 963 = 41%
  • Of all customers who did not churn the model correctly identified 6102/ 6102 + 264 = 96%
test_probs <- predict(churn_tree, newdata = bank_testing, type = 'prob')
test_preds <- predict(churn_tree, newdata = bank_testing, type = 'class')

churn_test_updated <- cbind(bank_testing, test_probs, test_preds)

test_con_mat <- table(churn_test_updated$churn,
                         churn_test_updated$test_preds, 
                         dnn=c('Actual', 'Predicted'))

test_con_mat
      Predicted
Actual   No  Yes
   No  1531   66
   Yes  225  178
  • The overall model accuracy is: 178 + 1531 / 2000 = 85%
  • Of all customers the model predicted to churn they got 178/178 + 66 = 73% correct.
  • Of all customers the model predicted to not churn they got 1531/ 1531 + 225 = 87% correct.
  • Of all customers who did churn the model correctly identified 178/ 178 + 225 = 44% correctly.
  • Of all customers who did not churn the model correctly identified 1531/ 1531 + 225 = 87% correctly.

The model is not overfitting although some of the percentages are alike, the overall accuracy percent is the same therefore, there is no need for pruning the tree.

Question 3 - Binary Logistic Regression

Set up the levels of the target variable “churn”.

bank_training$churn <- factor(bank_training$churn, levels = c("Yes", "No"))
bank_testing$churn <- factor(bank_testing$churn, levels = c("Yes", "No"))

levels(bank_training$churn)
[1] "Yes" "No" 
levels(bank_testing$churn)
[1] "Yes" "No" 

Create the Binary Logistic Regression model.

churn_lr <- glm(churn ~ credit_score + age + tenure + geography + balance + num_products + gender, data = bank_training, family = binomial( link = 'logit'))
summary(churn_lr)

Call:
glm(formula = churn ~ credit_score + age + tenure + geography + 
    balance + num_products + gender, family = binomial(link = "logit"), 
    data = bank_training)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       3.337e+00  2.579e-01  12.942  < 2e-16 ***
credit_score      8.639e-04  3.070e-04   2.815  0.00488 ** 
age              -6.426e-02  2.727e-03 -23.563  < 2e-16 ***
tenure            1.457e-02  1.023e-02   1.424  0.15441    
geographyGermany -7.766e-01  7.397e-02 -10.500  < 2e-16 ***
geographySpain   -2.547e-02  7.761e-02  -0.328  0.74277    
balance          -2.509e-06  5.658e-07  -4.434 9.24e-06 ***
num_products      1.127e-01  5.191e-02   2.171  0.02990 *  
genderMale        5.664e-01  5.965e-02   9.496  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8099.8  on 7999  degrees of freedom
Residual deviance: 7141.1  on 7991  degrees of freedom
AIC: 7159.1

Number of Fisher Scoring iterations: 4

Notice the dummy variables created. For each categorical predictor variable, which dummy variables are omitted?

There were two dummy variables that were omitted were: - genderFemale - geographyFrance.

Write down the regression equation.

y = ln(π/(1-π)) = 3.337 + 0.000834(credit_score) - 0.006426(age) + 0.001457(tenure) - 0.07766(geographyGermany) - 0.002547 (geographySpain) - 0.0000002509 (balance) + 0.01127 (num_products) + 0.05664 (genderMale)

Which predictor variables are significant?

credit_score, age, geographyGermany, balance, num_products and genderMale are all significant as they are less than 0.05.

How does each predictor variable impact the likelihood of a customer churning from the easy-saver product?

credit_score, tenure, num_products and genderMale are all positive meaning they are less likely to churn from the easy saver product. While age, geographyGermany, geographySpain, and balance are all negative meaning they are more likely to churn from the easy saver product.

Training prediction

train_pi <- predict(churn_lr, newdata = bank_training, type = 'response')

churn_train_updated <- bank_training %>%
  mutate(pi = train_pi) %>%
  mutate(prediction = case_when(pi > 0.5 ~ 'Yes', 
                                pi <= 0.5 ~ 'No'))

churn_train_updated$prediction <- factor(churn_train_updated$prediction, 
                                         levels = c("No", "Yes"))

train_con_mat <- table(churn_train_updated$churn, 
                         churn_train_updated$prediction, 
                         dnn = c('Actual', 'Predicted'))
train_con_mat
      Predicted
Actual   No  Yes
   Yes  206 1428
   No   238 6128

From the matrix above we know: - The overall model accuracy is (102 + 6129)/8000 = 78% - Of all customers the model predicted would churn, they got 6129/6129 + 1532 = 80% correct. - Of all customers the model predicted would not churn, they got 102/102 + 237 = 30% correct. - Of all customers who did actually churn, the model correctly identified 6129/6129 + 237 = 96% - Of all customers who did not churn the model correctly identified 102/102 + 1532 = 62%

Testing prediction

test_pi <- predict(churn_lr, newdata = bank_testing, type = 'response')

churn_test_updated <- bank_testing %>%
  mutate(pi = test_pi) %>%
  mutate(prediction = case_when(pi > 0.5 ~ 'Yes', 
                                pi <= 0.5 ~ 'No'))

churn_test_updated$prediction <- factor(churn_test_updated$prediction, 
                                        levels = c("No", "Yes"))
test_con_mat <- table(churn_test_updated$churn, 
                        churn_test_updated$prediction, 
                        dnn = c('Actual', 'Predicted'))
test_con_mat
      Predicted
Actual   No  Yes
   Yes   54  349
   No    59 1538

From the matrix above we know: - The overall model accuracy is (29 + 1535)/2000 = 78% - Of all customers the model predicted would churn, they got 1535/1535 + 374 = 80% correct. - Of all customers the model predicted would not churn, they got 29/29 + 62 = 32% correct. - Of all customers who did actually churn, the model correctly identified 1535/1535 + 69 = 96% - Of all customers who did not churn the model correctly identified 29/29 + 374 = 72%

Question 4 – Model Comparison & Marketing Actions

Training dataset

train_probs <- predict(churn_tree, newdata = bank_training, type = 'prob')
train_preds <- predict(churn_tree, newdata = bank_training, type = 'class')

churn_train_updated <- cbind(bank_training, train_probs, train_preds)

train_con_mat <- table(churn_train_updated$churn,
                         churn_train_updated$train_preds, 
                         dnn=c('Actual', 'Predicted'))

train_con_mat
      Predicted
Actual   No  Yes
   Yes  963  671
   No  6102  264
  • The overall model accuracy is: 6102 + 671 / 8000 = 85%
  • Of all customers the model predicted to churn they got 671/671 + 264 = 72% correct.
  • Of all customers the model predicted to not churn they got 6102/ 6102 + 963 = 86% correct.
  • Of all customers who did churn the model correctly identified 671/ 671 + 963 = 41%
  • Of all customers who did not churn the model correctly identified 6102/ 6102 + 264 = 96%
train_pi <- predict(churn_lr, newdata = bank_training, type = 'response')

churn_train_updated <- bank_training %>%
  mutate(pi = train_pi) %>%
  mutate(prediction = case_when(pi > 0.5 ~ 'Yes', 
                                pi <= 0.5 ~ 'No'))

churn_train_updated$prediction <- factor(churn_train_updated$prediction, 
                                         levels = c("No", "Yes"))

train_con_mat <- table(churn_train_updated$churn, 
                         churn_train_updated$prediction, 
                         dnn = c('Actual', 'Predicted'))
train_con_mat
      Predicted
Actual   No  Yes
   Yes  206 1428
   No   238 6128

From the matrix above we know: - The overall model accuracy is (102 + 6129)/8000 = 78% - Of all customers the model predicted would churn, they got 6129/6129 + 1532 = 80% correct. - Of all customers the model predicted would not churn, they got 102/102 + 237 = 30% correct. - Of all customers who did actually churn, the model correctly identified 6129/6129 + 237 = 96% - Of all customers who did not churn the model correctly identified 102/102 + 1532 = 62%

Testing the accuracy of the training dataset: (85 +78 + 72 + 80 + 86 + 30 + 41 + 96 + 96 + 62)/10 = 72.6% accuracy

Testing dataset

test_probs <- predict(churn_tree, newdata = bank_testing, type = 'prob')
test_preds <- predict(churn_tree, newdata = bank_testing, type = 'class')

churn_test_updated <- cbind(bank_testing, test_probs, test_preds)

test_con_mat <- table(churn_test_updated$churn,
                         churn_test_updated$test_preds, 
                         dnn=c('Actual', 'Predicted'))

test_con_mat
      Predicted
Actual   No  Yes
   Yes  225  178
   No  1531   66
  • The overall model accuracy is: 178 + 1531 / 2000 = 85%
  • Of all customers the model predicted to churn they got 178/178 + 66 = 73% correct.
  • Of all customers the model predicted to not churn they got 1531/ 1531 + 225 = 87% correct.
  • Of all customers who did churn the model correctly identified 178/ 178 + 225 = 44% correctly.
  • Of all customers who did not churn the model correctly identified 1531/ 1531 + 225 = 87% correctly.
test_pi <- predict(churn_lr, newdata = bank_testing, type = 'response')

churn_test_updated <- bank_testing %>%
  mutate(pi = test_pi) %>%
  mutate(prediction = case_when(pi > 0.5 ~ 'Yes', 
                                pi <= 0.5 ~ 'No'))

churn_test_updated$prediction <- factor(churn_test_updated$prediction, 
                                        levels = c("No", "Yes"))
test_con_mat <- table(churn_test_updated$churn, 
                        churn_test_updated$prediction, 
                        dnn = c('Actual', 'Predicted'))
test_con_mat
      Predicted
Actual   No  Yes
   Yes   54  349
   No    59 1538

From the matrix above we know: - The overall model accuracy is (29 + 1535)/2000 = 78% - Of all customers the model predicted would churn, they got 1535/1535 + 374 = 80% correct. - Of all customers the model predicted would not churn, they got 29/29 + 62 = 32% correct. - Of all customers who did actually churn, the model correctly identified 1535/1535 + 69 = 96% - Of all customers who did not churn the model correctly identified 29/29 + 374 = 72%

Testing the accuracy of the testing dataset: (85 + 78 + 73 + 80 + 87 + 32 + 44 + 96 + 87 + 72)/10 = 73.4% accuracy.

Both models are alike and close in accuracy, however the binary logistic regression model is on average more accurate in predicting whether a customer will churn or not.

Which model do you think should be used by the company? Explain your answer.

Due to the accuracy rate being hire I suggest that the bank go with the binary logistic regression model as it would give a more accurate prediction on if customers will churn or will not churn.

What are some of the main reasons for customers to churn from the easy-saver product?

Age is a reason customers seem to churn, from the classification tree you can see customers over the age of 43 80% of them churned having more than two and a half products. num_products is a reason why customers would churn, from the classification tree above we could tell customers with over two and a half num_products even under the age of 43 also has a big percent rate in which they would churn.

How could your chosen predictive model be used for marketing purposes?

  • Set up a targeted email or SMS campaigns for at-risk customers with tailored messages encouraging engagement.
  • Send out a customer servey and feedback form to identify patterns in feedback or complaints that correlate with churn.
  • Prioritize these customers for retention efforts, such as personalized support.