Data analytics assignment 1 -2026

Predictive analytics

#install.packages("rattle")
library(rattle)
library(rpart)
library(tidyverse)

A)

1.

churn_train <- read_csv("bank_training.csv")
churn_test <- read_csv("bank_testing.csv")

2. a)

churn_train %>%
  count(geography, churn) %>%
  group_by(geography) %>%
  mutate(percentage = n / sum(n)) %>%
  ggplot(aes(x = geography, y = percentage, fill = churn)) +
   geom_bar(stat = "identity") +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Churn Rate by Geography",
    x = "Geography",
    y = "Percentage of Customers",
    fill = "Churn")

Customers from Germany are most likely to churn with over 25% churning, while customers from France and Spain are on just over 13% churn rate.

b)

ggplot(data = churn_train) +
  geom_boxplot(mapping = aes(x = churn, y = credit_score),
               outlier.shape = 8, outlier.colour = "red") +
  labs(
    title = "Churn by Credit Score",
    x = "Churn",
    y = "Credit score")

Median is similar across non/churned customers. Churned customers showcase extremely low credit score outliers, but there does not seem to be a big connection betweeen credit_score and churn overall apart from the fact that churn is more likely if the credit score drops under 400.

ggplot(data = churn_train) +
  geom_boxplot(mapping = aes(x = churn, y = age),
               outlier.shape = 5, outlier.colour = "red") +
  labs(
    title = "Churn by Age",
    x = "Churn",
    y = "Age")

Customers in their mid to late 40s/early 50s seem to be more likely to churn than customers in their mid 30s/early 40s. Non-churners have a high number of outliers in late 60S to late early 90s. The number of outliers in churners is significantly smaller. This boxplot tells us that there is a connection between rising age and the likelihood of churning.

ggplot(data = churn_train) +
  geom_boxplot(mapping = aes(x = churn, y = tenure),
               outlier.shape = 8, outlier.colour = "red") +
  labs(
    title = "Churn by Tenure",
    x = "Churn",
    y = "Tenure")

Tenure does not have a big connection to customer’s churns. The data overlaps heavily, whiskers are the same, median is the same and only a box for churners is slightly larger.

ggplot(data = churn_train) +
  geom_boxplot(mapping = aes(x = churn, y = balance),
               outlier.shape = 8, outlier.colour = "red") +
  labs(
    title = "Churn by Balance ",
    x = "Churn",
    y = "Balance (EUR)")

This graph tells us, that churners have slightly higher balance. At least 25% of non-churners have low/0 balance. Churners are overall richer.

ggplot(data = churn_train) +
  geom_boxplot(mapping = aes(x = churn, y = num_products),
               outlier.shape = 4, outlier.colour = "red") +
  labs(
    title = "Churn by # Products",
    x = "Churn",
    y = "# of Products")

Customers who churn typically have fewer products than the ones who do not, suggesting that easy-saver product together with another product works well, but not as a stand-alone.

Summary

This analysis tells us that geography, age, balance and number of products are correlating with customer’s churn. Germany displays highest rates of churn rate, while France and Spain are on similar lower level. Older customers are more likely to churn and churned customers usually have higher balances and less products, which all makes sense regarding the easy-saver product offer obviously targeted more towards low-balance, younger people. Credit score and tenure are very limited in the connection to churn except for increased churn associated with low credit scores.

B)

3.

churn_tree <- rpart(churn ~ geography + credit_score + age + tenure + balance + num_products, data = churn_train)
fancyRpartPlot(churn_tree)

4. a)

Customers aged 43 or older with less than 2.5 products are predicted to churn. This node is very pure, with approximately 97% of customers in this segment churning, indicating strong and reliable rule.

b)

Customers aged 42 and less with more than 2.5 products are predicted to not churn. This node is pure as well, with approximately 90% of customers in this segment not churning = strong and reliable rule.

c)

Number of products - 44%, Age - 42%, Balance - 8%, Geography - 6%

churn_tree$variable.importance

num_products          age      balance    geography credit_score 
  302.273003   286.542433    54.863696    38.150482     0.921508

summary(churn_tree)

Call:
rpart(formula = churn ~ geography + credit_score + age + tenure + 
    balance + num_products, data = churn_train)
  n= 8000 

         CP nsplit rel error    xerror       xstd
1 0.0377397      0  1.000000 1.0000000 0.02206797
2 0.0100000      6  0.750918 0.7594859 0.01981664

Variable importance
num_products          age      balance    geography 
          44           42            8            6 

Node number 1: 8000 observations,    complexity param=0.0377397
  predicted class=No   expected loss=0.20425  P(node) =1
    class counts:  6366  1634
   probabilities: 0.796 0.204 
  left son=2 (5693 obs) right son=3 (2307 obs)
  Primary splits:
      age          < 42.5     to the left,  improve=286.31750, (0 missing)
      num_products < 2.5      to the left,  improve=228.42510, (0 missing)
      geography    splits as  LRL,          improve= 78.34265, (0 missing)
      balance      < 87554.41 to the left,  improve= 39.40310, (0 missing)
      credit_score < 407.5    to the right, improve= 26.10705, (0 missing)
  Surrogate splits:
      num_products < 3.5      to the left,  agree=0.714, adj=0.007, (0 split)
      credit_score < 361      to the right, agree=0.712, adj=0.002, (0 split)

Node number 2: 5693 observations,    complexity param=0.0377397
  predicted class=No   expected loss=0.1190936  P(node) =0.711625
    class counts:  5015   678
   probabilities: 0.881 0.119 
  left son=4 (5566 obs) right son=5 (127 obs)
  Primary splits:
      num_products < 2.5      to the left,  improve=105.35470, (0 missing)
      age          < 38.5     to the left,  improve= 22.76541, (0 missing)
      geography    splits as  LRL,          improve= 19.53761, (0 missing)
      credit_score < 407.5    to the right, improve= 15.29919, (0 missing)
      balance      < 97664.38 to the left,  improve= 11.94169, (0 missing)

Node number 3: 2307 observations,    complexity param=0.0377397
  predicted class=No   expected loss=0.414391  P(node) =0.288375
    class counts:  1351   956
   probabilities: 0.586 0.414 
  left son=6 (2179 obs) right son=7 (128 obs)
  Primary splits:
      num_products < 2.5      to the left,  improve=83.29378, (0 missing)
      geography    splits as  LRL,          improve=55.33024, (0 missing)
      age          < 65.5     to the right, improve=35.43456, (0 missing)
      balance      < 87372.1  to the left,  improve=31.80820, (0 missing)
      credit_score < 407.5    to the right, improve= 7.58078, (0 missing)

Node number 4: 5566 observations
  predicted class=No   expected loss=0.1045634  P(node) =0.69575
    class counts:  4984   582
   probabilities: 0.895 0.105 

Node number 5: 127 observations
  predicted class=Yes  expected loss=0.2440945  P(node) =0.015875
    class counts:    31    96
   probabilities: 0.244 0.756 

Node number 6: 2179 observations,    complexity param=0.0377397
  predicted class=No   expected loss=0.3818265  P(node) =0.272375
    class counts:  1347   832
   probabilities: 0.618 0.382 
  left son=12 (849 obs) right son=13 (1330 obs)
  Primary splits:
      num_products < 1.5      to the right, improve=111.762900, (0 missing)
      geography    splits as  LRL,          improve= 49.045960, (0 missing)
      age          < 65.5     to the right, improve= 29.888190, (0 missing)
      balance      < 87460.34 to the left,  improve= 28.420160, (0 missing)
      credit_score < 407.5    to the right, improve=  7.678005, (0 missing)
  Surrogate splits:
      balance < 6229.595 to the left,  agree=0.701, adj=0.233, (0 split)
      age     < 79.5     to the right, agree=0.611, adj=0.001, (0 split)

Node number 7: 128 observations
  predicted class=Yes  expected loss=0.03125  P(node) =0.016
    class counts:     4   124
   probabilities: 0.031 0.969 

Node number 12: 849 observations
  predicted class=No   expected loss=0.1813899  P(node) =0.106125
    class counts:   695   154
   probabilities: 0.819 0.181 

Node number 13: 1330 observations,    complexity param=0.0377397
  predicted class=Yes  expected loss=0.4902256  P(node) =0.16625
    class counts:   652   678
   probabilities: 0.490 0.510 
  left son=26 (921 obs) right son=27 (409 obs)
  Primary splits:
      geography    splits as  LRL,          improve=38.150480, (0 missing)
      age          < 66.5     to the right, improve=33.481090, (0 missing)
      balance      < 46303.52 to the right, improve= 8.509361, (0 missing)
      credit_score < 407.5    to the right, improve= 4.842834, (0 missing)
      tenure       < 4.5      to the right, improve= 4.081744, (0 missing)
  Surrogate splits:
      age < 81.5     to the left,  agree=0.693, adj=0.002, (0 split)

Node number 26: 921 observations,    complexity param=0.0377397
  predicted class=No   expected loss=0.4299674  P(node) =0.115125
    class counts:   525   396
   probabilities: 0.570 0.430 
  left son=52 (650 obs) right son=53 (271 obs)
  Primary splits:
      balance      < 46303.52 to the right, improve=28.798860, (0 missing)
      age          < 62.5     to the right, improve=20.755240, (0 missing)
      tenure       < 4.5      to the right, improve= 4.729138, (0 missing)
      credit_score < 421      to the right, improve= 3.828371, (0 missing)
      geography    splits as  R-L,          improve= 1.853454, (0 missing)
  Surrogate splits:
      credit_score < 444.5    to the right, agree=0.71, adj=0.015, (0 split)

Node number 27: 409 observations
  predicted class=Yes  expected loss=0.3105134  P(node) =0.051125
    class counts:   127   282
   probabilities: 0.311 0.689 

Node number 52: 650 observations
  predicted class=No   expected loss=0.3492308  P(node) =0.08125
    class counts:   423   227
   probabilities: 0.651 0.349 

Node number 53: 271 observations
  predicted class=Yes  expected loss=0.3763838  P(node) =0.033875
    class counts:   102   169
   probabilities: 0.376 0.624

5.

Training data

Overall model accuracy: (6102 + 671)/8000 = 85%
Of all customers the model predicted to churn, they got 671/935 = 72% correct.
Of all customers the model predicted not to churn, they got 6102/7065 = 86% correct.
Of all customers who did churn, the model correctly identified 671/1634 = 41%.
Of all customers who did not churn, the model correctly identified 6102/6366 = 96%.

train_probs <- predict(churn_tree, newdata = churn_train, type = 'prob')
train_preds <- predict(churn_tree, newdata = churn_train, type = 'class')

churn_train_updated <- cbind(churn_train, train_probs, train_preds)
head(churn_train_updated)

  customer_id credit_score geography gender age tenure   balance num_products
1    15704442          672    France Female  53      9 169406.33            4
2    15607993          625    France Female  52      2  79468.96            1
3    15635502          443    France   Male  44      2      0.00            1
4    15631912          840    France   Male  30      8 136291.71            1
5    15788539          501    France Female  34      3 107747.57            1
6    15714680          755    France Female  78      5 121206.96            1
  churn        No       Yes train_preds
1   Yes 0.0312500 0.9687500         Yes
2    No 0.6507692 0.3492308          No
3    No 0.3763838 0.6236162         Yes
4    No 0.8954366 0.1045634          No
5    No 0.8954366 0.1045634          No
6    No 0.6507692 0.3492308          No

train_con_mat <- table(churn_train_updated$churn, churn_train_updated$train_preds, dnn=c('Actual', 'Predicted'))
train_con_mat

      Predicted
Actual   No  Yes
   No  6102  264
   Yes  963  671

Testing data

Overall model accuracy: (1531 + 178)/2000 = 85%
Of all customers the model predicted to churn, they got 178/244 = 73% correct.
Of all customers the model predicted not to churn, they got 1531/1756 = 87% correct.
Of all customers who did churn, the model correctly identified 178/403 = 44%.
Of all customers who did not churn, the model correctly identified 1531/1597 = 96%.

test_probs <- predict(churn_tree, newdata = churn_test, type = 'prob')
test_preds <- predict(churn_tree, newdata = churn_test, type = 'class')
churn_test_updated <- cbind(churn_test, test_probs, test_preds)
head(churn_test_updated)

  customer_id credit_score geography gender age tenure   balance num_products
1    15812422          637    France   Male  41      2      0.00            2
2    15725511          559    France Female  31      3 127070.73            1
3    15658306          693    France   Male  68      4  97705.99            1
4    15690332          647   Germany   Male  35      3 192407.97            1
5    15580701          712    France   Male  33      3 153819.58            1
6    15755978          606    France   Male  31     10      0.00            2
  churn        No       Yes test_preds
1    No 0.8954366 0.1045634         No
2    No 0.8954366 0.1045634         No
3    No 0.6507692 0.3492308         No
4    No 0.8954366 0.1045634         No
5   Yes 0.8954366 0.1045634         No
6    No 0.8954366 0.1045634         No

train_con_mat <- table(churn_test_updated$churn, churn_test_updated$test_preds, dnn = c('Actual', 'Predicted'))
train_con_mat

      Predicted
Actual   No  Yes
   No  1531   66
   Yes  225  178

a)

There are no signs of overfitting, the test data accuracy is almost identical to the training data one. Churn recall slightly improves from 41% to 44%. The model is better in pointing out non-churners than churners.

C)

6.

churn_train$churn <- factor(churn_train$churn, levels = c("Yes", "No"))
churn_test$churn <- factor(churn_test$churn, levels = c("Yes", "No"))

levels(churn_train$churn)

[1] "Yes" "No"

levels(churn_test$churn)

[1] "Yes" "No"

7.

churn_lr <- glm(churn ~ credit_score + geography + age + tenure + balance + num_products, data = churn_train, family = binomial(link = 'logit'))
summary(churn_lr)


Call:
glm(formula = churn ~ credit_score + geography + age + tenure + 
    balance + num_products, family = binomial(link = "logit"), 
    data = churn_train)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       3.632e+00  2.551e-01  14.236  < 2e-16 ***
credit_score      8.336e-04  3.047e-04   2.736  0.00623 ** 
geographyGermany -7.935e-01  7.345e-02 -10.803  < 2e-16 ***
geographySpain   -2.550e-02  7.719e-02  -0.330  0.74112    
age              -6.425e-02  2.711e-03 -23.700  < 2e-16 ***
tenure            1.701e-02  1.017e-02   1.673  0.09424 .  
balance          -2.376e-06  5.624e-07  -4.225 2.39e-05 ***
num_products      1.086e-01  5.181e-02   2.096  0.03607 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8099.8  on 7999  degrees of freedom
Residual deviance: 7232.1  on 7992  degrees of freedom
AIC: 7248.1

Number of Fisher Scoring iterations: 4

a)

Omitted dummy categorical variable was Geography - France

b)

y = ln(pi/(1-pi)) = 3.632 + 0.00083credit_score - 0.793geographyGermany - 0.025geography_Spain - 0.064age + 0.017tenure - 0.00000237balance + 0.1num_products

c)

Credit_score, geographyGermany, age, balance and num_products are significant because their pi value is less than 0.05.

d)

Credit_score, tenure and num_products variables have positive effect on the target variable, which means that their increase will decrease the likelihood of customers churning, while geographyGermany, geographySpain, age and balance have negative effect on the target variable, which means that their increase increases the likelihood of the customer churning.

8.

train_pi <- predict(churn_lr, newdata = churn_train, type = 'response')

churn_train_updated <- churn_train %>%
                            mutate(pi = train_pi) %>%
                            mutate(prediction = case_when(pi > 0.5 ~ 'No', 
                                                          pi <= 0.5 ~ 'Yes'))
churn_train_updated

# A tibble: 8,000 × 11
   customer_id credit_score geography gender   age tenure balance num_products
         <dbl>        <dbl> <chr>     <chr>  <dbl>  <dbl>   <dbl>        <dbl>
 1    15704442          672 France    Female    53      9 169406.            4
 2    15607993          625 France    Female    52      2  79469.            1
 3    15635502          443 France    Male      44      2      0             1
 4    15631912          840 France    Male      30      8 136292.            1
 5    15788539          501 France    Female    34      3 107748.            1
 6    15714680          755 France    Female    78      5 121207.            1
 7    15749905          698 Spain     Female    47      6      0             1
 8    15713949          850 France    Male      40      1  76914.            1
 9    15786454          552 Spain     Male      55      3      0             1
10    15619699          558 France    Male      31      7      0             1
# ℹ 7,990 more rows
# ℹ 3 more variables: churn <fct>, pi <dbl>, prediction <chr>

churn_train_updated$prediction <- factor(churn_train_updated$prediction, levels = c("Yes", "No"))

train_con_mat <- table(churn_train_updated$churn, churn_train_updated$prediction, dnn = c('Actual', 'Predicted'))
train_con_mat

      Predicted
Actual  Yes   No
   Yes  165 1469
   No   229 6137

Training data

Overall model accuracy: (6137 + 165)/8000 = 78% accuracy
Of all customers the model predicted to churn, they got 165/394 = 41% correct.
Of all customers the model predicted not to churn, they got 6137/7606 = 80% correct.
Of all customers who did churn, the model correctly identified 165/1634 = 10%.
Of all customers who did not churn, the model correctly identified 6137/6366 = 96%.

test_pi <- predict(churn_lr, newdata = churn_test, type = 'response')

churn_test_updated <- churn_test %>%
  mutate(pi = test_pi) %>%
  mutate(prediction = case_when(pi > 0.5 ~ 'No', 
                                pi <= 0.5 ~ 'Yes'))

churn_test_updated

# A tibble: 2,000 × 11
   customer_id credit_score geography gender   age tenure balance num_products
         <dbl>        <dbl> <chr>     <chr>  <dbl>  <dbl>   <dbl>        <dbl>
 1    15812422          637 France    Male      41      2      0             2
 2    15725511          559 France    Female    31      3 127071.            1
 3    15658306          693 France    Male      68      4  97706.            1
 4    15690332          647 Germany   Male      35      3 192408.            1
 5    15580701          712 France    Male      33      3 153820.            1
 6    15755978          606 France    Male      31     10      0             2
 7    15806403          650 France    Male      37      9      0             2
 8    15672481          641 France    Male      37      6      0             2
 9    15571689          740 France    Female    37      5      0             2
10    15674398          642 France    Male      38      3      0             2
# ℹ 1,990 more rows
# ℹ 3 more variables: churn <fct>, pi <dbl>, prediction <chr>

churn_test_updated$prediction <- factor(churn_test_updated$prediction, levels = c("Yes", "No"))

test_con_mat <- table(churn_test_updated$churn, churn_test_updated$prediction, dnn = c('Actual', 'Predicted'))
test_con_mat

      Predicted
Actual  Yes   No
   Yes   46  357
   No    62 1535

Testing data

Overall model accuracy: (1535 + 46)/2000 = 79% accuracy
Of all customers the model predicted to churn, they got 46/108 = 42% correct.
Of all customers the model predicted not to churn, they got 1535/1892 = 81% correct.
Of all customers who did churn, the model correctly identified 46/403 = 11%.
Of all customers who did not churn, the model correctly identified 1535/1597 = 96%.

Both data sets came out with almost identical results. The testing data set is 1% more accurate on all accounts apart from the last measure being identical. The model is overall better at identifying non-churners than churners. Its churn rate is unreliable with correctly identifying only 41% out of all predicted churners and only 10% of actual churners.

D)

9.

The classification tree is more accurate in terms of overall accuracy (~6% higher), predicted churn rate (~30% higher), predicted non-churn rate (~6% higher) and actual churn rate (~30% higher) with only the accuracy of actual non-churn rate being identical in both training and testing data in both models with 96% accuracy in identifying actual non-churners.

10.

I would recommend to use the classification tree method as it showcased higher accuracy rates across the board, provides clear, rule-based insights that drive important decisions and offers simpler and more efficient visualisation of the data findings which is more fitting for presenting to the company representatives.

11. a)

Both analyses tell us that age and number of products are the biggest drivers in customer’s decision to churn. The higher the age and the lower the number of products the higher the probability of customers churning. This makes sense as the easy-saver product is an offer targeted towards new customers and especially young people with lower balances. This aligns with the part A analysis, that says that older customers with higher balances and less products are more likely to churn. Both analyses revealed that the highest churn rate is in Germany.

b)

Data-driven marketing actions

Cross-selling - since customers with more products are less likely to churn, focus on complementary products that increase engagement with the company as well as revenue and overall customer retention.
Research of actions taken in Germany vs. France/Spain to find out what works and what does not. Adapt marketing efforts to the local culture.
Easy-saver’s target audience is and should stay younger demographic with low balances. Opportunity to offer alternative products to older demographics with higher balances in case they churn from easy-saver product - products fitting their needs like premium features and long-term value.