#install.packages("rattle")
library(rattle)
library(rpart)
library(tidyverse)Data analytics assignment 1 -2026
Predictive analytics
A)
1.
churn_train <- read_csv("bank_training.csv")
churn_test <- read_csv("bank_testing.csv")2.
a)
churn_train %>%
count(geography, churn) %>%
group_by(geography) %>%
mutate(percentage = n / sum(n)) %>%
ggplot(aes(x = geography, y = percentage, fill = churn)) +
geom_bar(stat = "identity") +
scale_y_continuous(labels = scales::percent) +
labs(
title = "Churn Rate by Geography",
x = "Geography",
y = "Percentage of Customers",
fill = "Churn")Customers from Germany are most likely to churn with over 25% churning, while customers from France and Spain are on just over 13% churn rate.
b)
ggplot(data = churn_train) +
geom_boxplot(mapping = aes(x = churn, y = credit_score),
outlier.shape = 8, outlier.colour = "red") +
labs(
title = "Churn by Credit Score",
x = "Churn",
y = "Credit score")Median is similar across non/churned customers. Churned customers showcase extremely low credit score outliers, but there does not seem to be a big connection betweeen credit_score and churn overall apart from the fact that churn is more likely if the credit score drops under 400.
ggplot(data = churn_train) +
geom_boxplot(mapping = aes(x = churn, y = age),
outlier.shape = 5, outlier.colour = "red") +
labs(
title = "Churn by Age",
x = "Churn",
y = "Age")Customers in their mid to late 40s/early 50s seem to be more likely to churn than customers in their mid 30s/early 40s. Non-churners have a high number of outliers in late 60S to late early 90s. The number of outliers in churners is significantly smaller. This boxplot tells us that there is a connection between rising age and the likelihood of churning.
ggplot(data = churn_train) +
geom_boxplot(mapping = aes(x = churn, y = tenure),
outlier.shape = 8, outlier.colour = "red") +
labs(
title = "Churn by Tenure",
x = "Churn",
y = "Tenure")Tenure does not have a big connection to customer’s churns. The data overlaps heavily, whiskers are the same, median is the same and only a box for churners is slightly larger.
ggplot(data = churn_train) +
geom_boxplot(mapping = aes(x = churn, y = balance),
outlier.shape = 8, outlier.colour = "red") +
labs(
title = "Churn by Balance ",
x = "Churn",
y = "Balance (EUR)")This graph tells us, that churners have slightly higher balance. At least 25% of non-churners have low/0 balance. Churners are overall richer.
ggplot(data = churn_train) +
geom_boxplot(mapping = aes(x = churn, y = num_products),
outlier.shape = 4, outlier.colour = "red") +
labs(
title = "Churn by # Products",
x = "Churn",
y = "# of Products")Customers who churn typically have fewer products than the ones who do not, suggesting that easy-saver product together with another product works well, but not as a stand-alone.
Summary
This analysis tells us that geography, age, balance and number of products are correlating with customer’s churn. Germany displays highest rates of churn rate, while France and Spain are on similar lower level. Older customers are more likely to churn and churned customers usually have higher balances and less products, which all makes sense regarding the easy-saver product offer obviously targeted more towards low-balance, younger people. Credit score and tenure are very limited in the connection to churn except for increased churn associated with low credit scores.
B)
3.
churn_tree <- rpart(churn ~ geography + credit_score + age + tenure + balance + num_products, data = churn_train)
fancyRpartPlot(churn_tree)4.
a)
Customers aged 43 or older with less than 2.5 products are predicted to churn. This node is very pure, with approximately 97% of customers in this segment churning, indicating strong and reliable rule.
b)
Customers aged 42 and less with more than 2.5 products are predicted to not churn. This node is pure as well, with approximately 90% of customers in this segment not churning = strong and reliable rule.
c)
Number of products - 44%, Age - 42%, Balance - 8%, Geography - 6%
churn_tree$variable.importancenum_products age balance geography credit_score
302.273003 286.542433 54.863696 38.150482 0.921508
summary(churn_tree)Call:
rpart(formula = churn ~ geography + credit_score + age + tenure +
balance + num_products, data = churn_train)
n= 8000
CP nsplit rel error xerror xstd
1 0.0377397 0 1.000000 1.0000000 0.02206797
2 0.0100000 6 0.750918 0.7594859 0.01981664
Variable importance
num_products age balance geography
44 42 8 6
Node number 1: 8000 observations, complexity param=0.0377397
predicted class=No expected loss=0.20425 P(node) =1
class counts: 6366 1634
probabilities: 0.796 0.204
left son=2 (5693 obs) right son=3 (2307 obs)
Primary splits:
age < 42.5 to the left, improve=286.31750, (0 missing)
num_products < 2.5 to the left, improve=228.42510, (0 missing)
geography splits as LRL, improve= 78.34265, (0 missing)
balance < 87554.41 to the left, improve= 39.40310, (0 missing)
credit_score < 407.5 to the right, improve= 26.10705, (0 missing)
Surrogate splits:
num_products < 3.5 to the left, agree=0.714, adj=0.007, (0 split)
credit_score < 361 to the right, agree=0.712, adj=0.002, (0 split)
Node number 2: 5693 observations, complexity param=0.0377397
predicted class=No expected loss=0.1190936 P(node) =0.711625
class counts: 5015 678
probabilities: 0.881 0.119
left son=4 (5566 obs) right son=5 (127 obs)
Primary splits:
num_products < 2.5 to the left, improve=105.35470, (0 missing)
age < 38.5 to the left, improve= 22.76541, (0 missing)
geography splits as LRL, improve= 19.53761, (0 missing)
credit_score < 407.5 to the right, improve= 15.29919, (0 missing)
balance < 97664.38 to the left, improve= 11.94169, (0 missing)
Node number 3: 2307 observations, complexity param=0.0377397
predicted class=No expected loss=0.414391 P(node) =0.288375
class counts: 1351 956
probabilities: 0.586 0.414
left son=6 (2179 obs) right son=7 (128 obs)
Primary splits:
num_products < 2.5 to the left, improve=83.29378, (0 missing)
geography splits as LRL, improve=55.33024, (0 missing)
age < 65.5 to the right, improve=35.43456, (0 missing)
balance < 87372.1 to the left, improve=31.80820, (0 missing)
credit_score < 407.5 to the right, improve= 7.58078, (0 missing)
Node number 4: 5566 observations
predicted class=No expected loss=0.1045634 P(node) =0.69575
class counts: 4984 582
probabilities: 0.895 0.105
Node number 5: 127 observations
predicted class=Yes expected loss=0.2440945 P(node) =0.015875
class counts: 31 96
probabilities: 0.244 0.756
Node number 6: 2179 observations, complexity param=0.0377397
predicted class=No expected loss=0.3818265 P(node) =0.272375
class counts: 1347 832
probabilities: 0.618 0.382
left son=12 (849 obs) right son=13 (1330 obs)
Primary splits:
num_products < 1.5 to the right, improve=111.762900, (0 missing)
geography splits as LRL, improve= 49.045960, (0 missing)
age < 65.5 to the right, improve= 29.888190, (0 missing)
balance < 87460.34 to the left, improve= 28.420160, (0 missing)
credit_score < 407.5 to the right, improve= 7.678005, (0 missing)
Surrogate splits:
balance < 6229.595 to the left, agree=0.701, adj=0.233, (0 split)
age < 79.5 to the right, agree=0.611, adj=0.001, (0 split)
Node number 7: 128 observations
predicted class=Yes expected loss=0.03125 P(node) =0.016
class counts: 4 124
probabilities: 0.031 0.969
Node number 12: 849 observations
predicted class=No expected loss=0.1813899 P(node) =0.106125
class counts: 695 154
probabilities: 0.819 0.181
Node number 13: 1330 observations, complexity param=0.0377397
predicted class=Yes expected loss=0.4902256 P(node) =0.16625
class counts: 652 678
probabilities: 0.490 0.510
left son=26 (921 obs) right son=27 (409 obs)
Primary splits:
geography splits as LRL, improve=38.150480, (0 missing)
age < 66.5 to the right, improve=33.481090, (0 missing)
balance < 46303.52 to the right, improve= 8.509361, (0 missing)
credit_score < 407.5 to the right, improve= 4.842834, (0 missing)
tenure < 4.5 to the right, improve= 4.081744, (0 missing)
Surrogate splits:
age < 81.5 to the left, agree=0.693, adj=0.002, (0 split)
Node number 26: 921 observations, complexity param=0.0377397
predicted class=No expected loss=0.4299674 P(node) =0.115125
class counts: 525 396
probabilities: 0.570 0.430
left son=52 (650 obs) right son=53 (271 obs)
Primary splits:
balance < 46303.52 to the right, improve=28.798860, (0 missing)
age < 62.5 to the right, improve=20.755240, (0 missing)
tenure < 4.5 to the right, improve= 4.729138, (0 missing)
credit_score < 421 to the right, improve= 3.828371, (0 missing)
geography splits as R-L, improve= 1.853454, (0 missing)
Surrogate splits:
credit_score < 444.5 to the right, agree=0.71, adj=0.015, (0 split)
Node number 27: 409 observations
predicted class=Yes expected loss=0.3105134 P(node) =0.051125
class counts: 127 282
probabilities: 0.311 0.689
Node number 52: 650 observations
predicted class=No expected loss=0.3492308 P(node) =0.08125
class counts: 423 227
probabilities: 0.651 0.349
Node number 53: 271 observations
predicted class=Yes expected loss=0.3763838 P(node) =0.033875
class counts: 102 169
probabilities: 0.376 0.624
5.
Training data
- Overall model accuracy: (6102 + 671)/8000 = 85%
- Of all customers the model predicted to churn, they got 671/935 = 72% correct.
- Of all customers the model predicted not to churn, they got 6102/7065 = 86% correct.
- Of all customers who did churn, the model correctly identified 671/1634 = 41%.
- Of all customers who did not churn, the model correctly identified 6102/6366 = 96%.
train_probs <- predict(churn_tree, newdata = churn_train, type = 'prob')
train_preds <- predict(churn_tree, newdata = churn_train, type = 'class')
churn_train_updated <- cbind(churn_train, train_probs, train_preds)
head(churn_train_updated) customer_id credit_score geography gender age tenure balance num_products
1 15704442 672 France Female 53 9 169406.33 4
2 15607993 625 France Female 52 2 79468.96 1
3 15635502 443 France Male 44 2 0.00 1
4 15631912 840 France Male 30 8 136291.71 1
5 15788539 501 France Female 34 3 107747.57 1
6 15714680 755 France Female 78 5 121206.96 1
churn No Yes train_preds
1 Yes 0.0312500 0.9687500 Yes
2 No 0.6507692 0.3492308 No
3 No 0.3763838 0.6236162 Yes
4 No 0.8954366 0.1045634 No
5 No 0.8954366 0.1045634 No
6 No 0.6507692 0.3492308 No
train_con_mat <- table(churn_train_updated$churn, churn_train_updated$train_preds, dnn=c('Actual', 'Predicted'))
train_con_mat Predicted
Actual No Yes
No 6102 264
Yes 963 671
Testing data
- Overall model accuracy: (1531 + 178)/2000 = 85%
- Of all customers the model predicted to churn, they got 178/244 = 73% correct.
- Of all customers the model predicted not to churn, they got 1531/1756 = 87% correct.
- Of all customers who did churn, the model correctly identified 178/403 = 44%.
- Of all customers who did not churn, the model correctly identified 1531/1597 = 96%.
test_probs <- predict(churn_tree, newdata = churn_test, type = 'prob')
test_preds <- predict(churn_tree, newdata = churn_test, type = 'class')
churn_test_updated <- cbind(churn_test, test_probs, test_preds)
head(churn_test_updated) customer_id credit_score geography gender age tenure balance num_products
1 15812422 637 France Male 41 2 0.00 2
2 15725511 559 France Female 31 3 127070.73 1
3 15658306 693 France Male 68 4 97705.99 1
4 15690332 647 Germany Male 35 3 192407.97 1
5 15580701 712 France Male 33 3 153819.58 1
6 15755978 606 France Male 31 10 0.00 2
churn No Yes test_preds
1 No 0.8954366 0.1045634 No
2 No 0.8954366 0.1045634 No
3 No 0.6507692 0.3492308 No
4 No 0.8954366 0.1045634 No
5 Yes 0.8954366 0.1045634 No
6 No 0.8954366 0.1045634 No
train_con_mat <- table(churn_test_updated$churn, churn_test_updated$test_preds, dnn = c('Actual', 'Predicted'))
train_con_mat Predicted
Actual No Yes
No 1531 66
Yes 225 178
a)
There are no signs of overfitting, the test data accuracy is almost identical to the training data one. Churn recall slightly improves from 41% to 44%. The model is better in pointing out non-churners than churners.
C)
6.
churn_train$churn <- factor(churn_train$churn, levels = c("Yes", "No"))
churn_test$churn <- factor(churn_test$churn, levels = c("Yes", "No"))levels(churn_train$churn)[1] "Yes" "No"
levels(churn_test$churn)[1] "Yes" "No"
7.
churn_lr <- glm(churn ~ credit_score + geography + age + tenure + balance + num_products, data = churn_train, family = binomial(link = 'logit'))
summary(churn_lr)
Call:
glm(formula = churn ~ credit_score + geography + age + tenure +
balance + num_products, family = binomial(link = "logit"),
data = churn_train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.632e+00 2.551e-01 14.236 < 2e-16 ***
credit_score 8.336e-04 3.047e-04 2.736 0.00623 **
geographyGermany -7.935e-01 7.345e-02 -10.803 < 2e-16 ***
geographySpain -2.550e-02 7.719e-02 -0.330 0.74112
age -6.425e-02 2.711e-03 -23.700 < 2e-16 ***
tenure 1.701e-02 1.017e-02 1.673 0.09424 .
balance -2.376e-06 5.624e-07 -4.225 2.39e-05 ***
num_products 1.086e-01 5.181e-02 2.096 0.03607 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8099.8 on 7999 degrees of freedom
Residual deviance: 7232.1 on 7992 degrees of freedom
AIC: 7248.1
Number of Fisher Scoring iterations: 4
a)
Omitted dummy categorical variable was Geography - France
b)
y = ln(pi/(1-pi)) = 3.632 + 0.00083credit_score - 0.793geographyGermany - 0.025geography_Spain - 0.064age + 0.017tenure - 0.00000237balance + 0.1num_products
c)
Credit_score, geographyGermany, age, balance and num_products are significant because their pi value is less than 0.05.
d)
Credit_score, tenure and num_products variables have positive effect on the target variable, which means that their increase will decrease the likelihood of customers churning, while geographyGermany, geographySpain, age and balance have negative effect on the target variable, which means that their increase increases the likelihood of the customer churning.
8.
train_pi <- predict(churn_lr, newdata = churn_train, type = 'response')
churn_train_updated <- churn_train %>%
mutate(pi = train_pi) %>%
mutate(prediction = case_when(pi > 0.5 ~ 'No',
pi <= 0.5 ~ 'Yes'))
churn_train_updated# A tibble: 8,000 × 11
customer_id credit_score geography gender age tenure balance num_products
<dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 15704442 672 France Female 53 9 169406. 4
2 15607993 625 France Female 52 2 79469. 1
3 15635502 443 France Male 44 2 0 1
4 15631912 840 France Male 30 8 136292. 1
5 15788539 501 France Female 34 3 107748. 1
6 15714680 755 France Female 78 5 121207. 1
7 15749905 698 Spain Female 47 6 0 1
8 15713949 850 France Male 40 1 76914. 1
9 15786454 552 Spain Male 55 3 0 1
10 15619699 558 France Male 31 7 0 1
# ℹ 7,990 more rows
# ℹ 3 more variables: churn <fct>, pi <dbl>, prediction <chr>
churn_train_updated$prediction <- factor(churn_train_updated$prediction, levels = c("Yes", "No"))
train_con_mat <- table(churn_train_updated$churn, churn_train_updated$prediction, dnn = c('Actual', 'Predicted'))
train_con_mat Predicted
Actual Yes No
Yes 165 1469
No 229 6137
Training data
- Overall model accuracy: (6137 + 165)/8000 = 78% accuracy
- Of all customers the model predicted to churn, they got 165/394 = 41% correct.
- Of all customers the model predicted not to churn, they got 6137/7606 = 80% correct.
- Of all customers who did churn, the model correctly identified 165/1634 = 10%.
- Of all customers who did not churn, the model correctly identified 6137/6366 = 96%.
test_pi <- predict(churn_lr, newdata = churn_test, type = 'response')
churn_test_updated <- churn_test %>%
mutate(pi = test_pi) %>%
mutate(prediction = case_when(pi > 0.5 ~ 'No',
pi <= 0.5 ~ 'Yes'))
churn_test_updated# A tibble: 2,000 × 11
customer_id credit_score geography gender age tenure balance num_products
<dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 15812422 637 France Male 41 2 0 2
2 15725511 559 France Female 31 3 127071. 1
3 15658306 693 France Male 68 4 97706. 1
4 15690332 647 Germany Male 35 3 192408. 1
5 15580701 712 France Male 33 3 153820. 1
6 15755978 606 France Male 31 10 0 2
7 15806403 650 France Male 37 9 0 2
8 15672481 641 France Male 37 6 0 2
9 15571689 740 France Female 37 5 0 2
10 15674398 642 France Male 38 3 0 2
# ℹ 1,990 more rows
# ℹ 3 more variables: churn <fct>, pi <dbl>, prediction <chr>
churn_test_updated$prediction <- factor(churn_test_updated$prediction, levels = c("Yes", "No"))
test_con_mat <- table(churn_test_updated$churn, churn_test_updated$prediction, dnn = c('Actual', 'Predicted'))
test_con_mat Predicted
Actual Yes No
Yes 46 357
No 62 1535
Testing data
- Overall model accuracy: (1535 + 46)/2000 = 79% accuracy
- Of all customers the model predicted to churn, they got 46/108 = 42% correct.
- Of all customers the model predicted not to churn, they got 1535/1892 = 81% correct.
- Of all customers who did churn, the model correctly identified 46/403 = 11%.
- Of all customers who did not churn, the model correctly identified 1535/1597 = 96%.
Both data sets came out with almost identical results. The testing data set is 1% more accurate on all accounts apart from the last measure being identical. The model is overall better at identifying non-churners than churners. Its churn rate is unreliable with correctly identifying only 41% out of all predicted churners and only 10% of actual churners.
D)
9.
The classification tree is more accurate in terms of overall accuracy (~6% higher), predicted churn rate (~30% higher), predicted non-churn rate (~6% higher) and actual churn rate (~30% higher) with only the accuracy of actual non-churn rate being identical in both training and testing data in both models with 96% accuracy in identifying actual non-churners.
10.
I would recommend to use the classification tree method as it showcased higher accuracy rates across the board, provides clear, rule-based insights that drive important decisions and offers simpler and more efficient visualisation of the data findings which is more fitting for presenting to the company representatives.
11.
a)
Both analyses tell us that age and number of products are the biggest drivers in customer’s decision to churn. The higher the age and the lower the number of products the higher the probability of customers churning. This makes sense as the easy-saver product is an offer targeted towards new customers and especially young people with lower balances. This aligns with the part A analysis, that says that older customers with higher balances and less products are more likely to churn. Both analyses revealed that the highest churn rate is in Germany.
b)
Data-driven marketing actions
- Cross-selling - since customers with more products are less likely to churn, focus on complementary products that increase engagement with the company as well as revenue and overall customer retention.
- Research of actions taken in Germany vs. France/Spain to find out what works and what does not. Adapt marketing efforts to the local culture.
- Easy-saver’s target audience is and should stay younger demographic with low balances. Opportunity to offer alternative products to older demographics with higher balances in case they churn from easy-saver product - products fitting their needs like premium features and long-term value.