Assignment on Data Analytics & Consumer Insights

Author

Md Saydur Rahman sayd

Part A – Exploratory Analysis of the Bank Product Dataset

Question-1.

Import the bank_training.csv and bank_testing.csv data sets into R.

At first we will import the file name bank_training.csv and bank_testing.csv.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
bank_train <- read.csv("bank_training.csv")
bank_test  <- read.csv("bank_testing.csv")

Question-2(a)

Categorical Variables vs Churn

ggplot(data = bank_train) +
  geom_bar(mapping = aes(x = geography, fill = churn),
           position = "dodge") +
  labs(title = "Customer Churn by Geography",
       x = "Geography",
       y = "Number of Customers")

Interpretation:

This chart compares the number of people who churned Yes vs did not churn No in each country. France has the highest based customers count overall, so does for both churned and non-churned. First, Germany has much fewer customers than France but the churn bar is relatively longer with respect to its non-churn bar, meaning a larger percentage of German customers leave the bank. Spain has the lowest bar for churn, meaning fewer customers leave that country. Geography appears to matter, however, as the churn patterns in the three countries differ, with Germany appearing to be the most at risk.

Question-2(b)

Number of Products vs Churn

ggplot(data = bank_train) +
  geom_bar(mapping = aes(x = num_products, fill = churn),
           position = "dodge") +
  labs(title = "Churn by Number of Bank Products",
       x = "Number of Products",
       y = "Count of Customers")

Interpretation:

This chart illustrates how the quantity of bank products a customer maintains relates to their likelihood of churning. The largest number of customers belong to 1 bank product and most churn, as indicated by the large red bar. On the other hand, 2 products customers are less likely to churn as expressed in the lower height blue bar for churn. This graph indicates a large decrease of churn for those who own more products, but not that many customers have 3 or 4 products and therefore there are small bars. In summary this overall suggests that multi-product customers are less likely to churn, suggesting that selling more products could decrease churn.

Question-2(c)

Credit Card Ownership vs Churn

ggplot(data = bank_train) +
  geom_bar(mapping = aes(x = has_credit_card, fill = churn),
           position = "dodge") +
  labs(title = "Churn by Credit Card Ownership",
       x = "Has Credit Card",
       y = "Number of Customers")

Interpretation:

This chart compares churn among customers who have a credit card to those who do not have. Most no credit card customer are more likely to churn which can be seen by the bigger red bar for those who left the bank. On the other hand, the customers who do own a credit card show relatively much less number of churned customers. This indicates that more customers leave the bank by not having a credit card, thus holding a credit card may increase switching costs of leaving to another bank.

Question- 2(d)

Continuous Variables vs Churn

ggplot(data = bank_train) +
  geom_boxplot(mapping = aes(x = churn, y = age, fill = churn)) +
  labs(title = "Age Distribution by Churn Status",
       x = "Churn",
       y = "Age")

Interpretation:

The boxplot above indicates that there is a discriminate age between customers setting from non-setters. Median age is lower for Non Churners which is near about the mid 30 second and most are in a younger age category. Customers who churned on the other hand older median age roughly mid 40s and their distribution from is shifted upwards overall.

Question-2(e)

Balance vs Churn

ggplot(data = bank_train) +
  geom_boxplot(mapping = aes(x = churn, y = balance, fill = churn)) +
  labs(title = "Account Balance by Churn Status",
       x = "Churn",
       y = "Account Balance")

Interpretation:

Boxplot also tell that, customers who churned have higher account balances compared with non-churners. The median balance of churned customers is higher which is near around the €110k mark than non churned customers near around the €90k range, with a more wide spread for the churned group. That mean that customers with extra cash in their account may be more likely to leave.

Question-2(f)

Estimated Salary vs Churn

ggplot(data = bank_train) +
  geom_histogram(mapping = aes(x = estimated_salary, fill = churn),
                 binwidth = 1190) +
  labs(title = "Estimated Salary Distribution by Churn Status",
       x = "Estimated Salary",
       y = "Frequency")

Interpretation:

The histogram illustrates that the distribution of the estimated salary is pretty much similar among customers who churned and those who did not. At every point in the salary range there is a roughly equal number of people in both churn groups. The overall form and trend of salaries is the same for both Yes and No. This tells us that estimated salary is a relatively weaker factor in predicting churn and other variables, such as age, number of products, geography or balance may be more important.

Question-2(g)

Tenure vs Churn

ggplot(data = bank_train) +
  geom_point(mapping = aes(x = tenure, y = balance, color = churn),) +
  labs(title = "Balance vs Tenure by Churn Status",
       x = "Tenure (Years)",
       y = "Account Balance")

Interpretation:

This scatter plot illustrates the relationship between tenure years at the bank and account balance, categorized by churn status. Customers who churn appear at a continuum of tenure levels meaning tenure is not a strong predictor of churning on its own. But client churn is in fact more pronounced among higher balance clients, particularly on the lower-to-mid dimension of tenure scale, suggesting that even a large segment of the customers who accumulate large balances will leave prematurely perhaps to take their money elsewhere.

Part B – Predicting Customers Who Will Churn from a Bank Product

Question-1

Import the bank_training.csv and bank_testing.csv datasets into R.

library(rattle)
Loading required package: bitops
Rattle: A free graphical interface for data science with R.
Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
library(rpart)
library(tidyverse)

bank_train <- read.csv("bank_training.csv")
bank_test  <- read.csv("bank_testing.csv")

Question-1(a)

Create and visualize a classification tree model

Bank_churn_tree <- rpart(churn ~ credit_score + geography + age + tenure + balance + num_products + has_credit_card + estimated_salary, data = bank_train)
fancyRpartPlot(Bank_churn_tree)

Bank_churn_tree$variable.importance
    num_products              age          balance        geography 
     302.2730027      286.5424332       54.8636955       38.1504823 
    credit_score estimated_salary 
       0.9215080        0.8118537 
summary(Bank_churn_tree)
Call:
rpart(formula = churn ~ credit_score + geography + age + tenure + 
    balance + num_products + has_credit_card + estimated_salary, 
    data = bank_train)
  n= 8000 

         CP nsplit rel error   xerror       xstd
1 0.0377397      0  1.000000 1.000000 0.02206797
2 0.0100000      6  0.750918 0.755202 0.01977090

Variable importance
num_products          age      balance    geography 
          44           42            8            6 

Node number 1: 8000 observations,    complexity param=0.0377397
  predicted class=No   expected loss=0.20425  P(node) =1
    class counts:  6366  1634
   probabilities: 0.796 0.204 
  left son=2 (5693 obs) right son=3 (2307 obs)
  Primary splits:
      age          < 42.5     to the left,  improve=286.31750, (0 missing)
      num_products < 2.5      to the left,  improve=228.42510, (0 missing)
      geography    splits as  LRL,          improve= 78.34265, (0 missing)
      balance      < 87554.41 to the left,  improve= 39.40310, (0 missing)
      credit_score < 407.5    to the right, improve= 26.10705, (0 missing)
  Surrogate splits:
      num_products     < 3.5      to the left,  agree=0.714, adj=0.007, (0 split)
      credit_score     < 361      to the right, agree=0.712, adj=0.002, (0 split)
      estimated_salary < 94.01    to the right, agree=0.712, adj=0.000, (0 split)

Node number 2: 5693 observations,    complexity param=0.0377397
  predicted class=No   expected loss=0.1190936  P(node) =0.711625
    class counts:  5015   678
   probabilities: 0.881 0.119 
  left son=4 (5566 obs) right son=5 (127 obs)
  Primary splits:
      num_products < 2.5      to the left,  improve=105.35470, (0 missing)
      age          < 38.5     to the left,  improve= 22.76541, (0 missing)
      geography    splits as  LRL,          improve= 19.53761, (0 missing)
      credit_score < 407.5    to the right, improve= 15.29919, (0 missing)
      balance      < 97664.38 to the left,  improve= 11.94169, (0 missing)

Node number 3: 2307 observations,    complexity param=0.0377397
  predicted class=No   expected loss=0.414391  P(node) =0.288375
    class counts:  1351   956
   probabilities: 0.586 0.414 
  left son=6 (2179 obs) right son=7 (128 obs)
  Primary splits:
      num_products < 2.5      to the left,  improve=83.29378, (0 missing)
      geography    splits as  LRL,          improve=55.33024, (0 missing)
      age          < 65.5     to the right, improve=35.43456, (0 missing)
      balance      < 87372.1  to the left,  improve=31.80820, (0 missing)
      credit_score < 407.5    to the right, improve= 7.58078, (0 missing)

Node number 4: 5566 observations
  predicted class=No   expected loss=0.1045634  P(node) =0.69575
    class counts:  4984   582
   probabilities: 0.895 0.105 

Node number 5: 127 observations
  predicted class=Yes  expected loss=0.2440945  P(node) =0.015875
    class counts:    31    96
   probabilities: 0.244 0.756 

Node number 6: 2179 observations,    complexity param=0.0377397
  predicted class=No   expected loss=0.3818265  P(node) =0.272375
    class counts:  1347   832
   probabilities: 0.618 0.382 
  left son=12 (849 obs) right son=13 (1330 obs)
  Primary splits:
      num_products < 1.5      to the right, improve=111.762900, (0 missing)
      geography    splits as  LRL,          improve= 49.045960, (0 missing)
      age          < 65.5     to the right, improve= 29.888190, (0 missing)
      balance      < 87460.34 to the left,  improve= 28.420160, (0 missing)
      credit_score < 407.5    to the right, improve=  7.678005, (0 missing)
  Surrogate splits:
      balance          < 6229.595 to the left,  agree=0.701, adj=0.233, (0 split)
      estimated_salary < 199442.8 to the right, agree=0.612, adj=0.004, (0 split)
      age              < 79.5     to the right, agree=0.611, adj=0.001, (0 split)

Node number 7: 128 observations
  predicted class=Yes  expected loss=0.03125  P(node) =0.016
    class counts:     4   124
   probabilities: 0.031 0.969 

Node number 12: 849 observations
  predicted class=No   expected loss=0.1813899  P(node) =0.106125
    class counts:   695   154
   probabilities: 0.819 0.181 

Node number 13: 1330 observations,    complexity param=0.0377397
  predicted class=Yes  expected loss=0.4902256  P(node) =0.16625
    class counts:   652   678
   probabilities: 0.490 0.510 
  left son=26 (921 obs) right son=27 (409 obs)
  Primary splits:
      geography        splits as  LRL,          improve=38.150480, (0 missing)
      age              < 66.5     to the right, improve=33.481090, (0 missing)
      balance          < 46303.52 to the right, improve= 8.509361, (0 missing)
      estimated_salary < 144513.9 to the left,  improve= 8.314878, (0 missing)
      credit_score     < 407.5    to the right, improve= 4.842834, (0 missing)
  Surrogate splits:
      estimated_salary < 505.15   to the right, agree=0.694, adj=0.005, (0 split)
      age              < 81.5     to the left,  agree=0.693, adj=0.002, (0 split)

Node number 26: 921 observations,    complexity param=0.0377397
  predicted class=No   expected loss=0.4299674  P(node) =0.115125
    class counts:   525   396
   probabilities: 0.570 0.430 
  left son=52 (650 obs) right son=53 (271 obs)
  Primary splits:
      balance          < 46303.52 to the right, improve=28.798860, (0 missing)
      age              < 62.5     to the right, improve=20.755240, (0 missing)
      estimated_salary < 140483.4 to the left,  improve= 9.660426, (0 missing)
      tenure           < 4.5      to the right, improve= 4.729138, (0 missing)
      credit_score     < 421      to the right, improve= 3.828371, (0 missing)
  Surrogate splits:
      credit_score     < 444.5    to the right, agree=0.710, adj=0.015, (0 split)
      estimated_salary < 722.69   to the right, agree=0.707, adj=0.004, (0 split)

Node number 27: 409 observations
  predicted class=Yes  expected loss=0.3105134  P(node) =0.051125
    class counts:   127   282
   probabilities: 0.311 0.689 

Node number 52: 650 observations
  predicted class=No   expected loss=0.3492308  P(node) =0.08125
    class counts:   423   227
   probabilities: 0.651 0.349 

Node number 53: 271 observations
  predicted class=Yes  expected loss=0.3763838  P(node) =0.033875
    class counts:   102   169
   probabilities: 0.376 0.624 
rpart(formula = churn ~ credit_score + geography + age + tenure + balance + num_products + has_credit_card + estimated_salary, data = bank_train)
n= 8000 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 8000 1634 No (0.7957500 0.2042500)  
   2) age< 42.5 5693  678 No (0.8809064 0.1190936)  
     4) num_products< 2.5 5566  582 No (0.8954366 0.1045634) *
     5) num_products>=2.5 127   31 Yes (0.2440945 0.7559055) *
   3) age>=42.5 2307  956 No (0.5856090 0.4143910)  
     6) num_products< 2.5 2179  832 No (0.6181735 0.3818265)  
      12) num_products>=1.5 849  154 No (0.8186101 0.1813899) *
      13) num_products< 1.5 1330  652 Yes (0.4902256 0.5097744)  
        26) geography=France,Spain 921  396 No (0.5700326 0.4299674)  
          52) balance>=46303.51 650  227 No (0.6507692 0.3492308) *
          53) balance< 46303.51 271  102 Yes (0.3763838 0.6236162) *
        27) geography=Germany 409  127 Yes (0.3105134 0.6894866) *
     7) num_products>=2.5 128    4 Yes (0.0312500 0.9687500) *

Interpretation:

a. Rule for predicting churn (Yes):
One clear rule from the classification tree for predicting that a customer will churn is if the customer is aged 42.5 years or older and has 2.5 or more bank products for example typically 3 or 4 products, then the model predicts Churn = Yes. This terminal node is very pure, because roughly 96.9% of customers in this node are churners which is only about 3.1% are non churners, meaning the customers in this group are very consistently classified as churn.

b. Rule for predicting non-churn (No):
One clear rule from the tree for predicting that a customer will not churn is, if the customer is younger than 42.5 years and has fewer than 2.5 products for example usually 1 or 2 products, then the model predicts Churn = No. This node is also quite pure, because about 89.5% of customers in this group do not churn which is around 10.5% churn, showing that the majority of customers meeting this condition are loyal and remain with the bank.

c. Important variables for predicting churn:
The most important variables for predicting churn are number of products num_products and age, followed by balance and geography. These variables are considered important because age and num_products appear at the top of the tree in the earliest splits, meaning they provide the strongest separation between churners and non-churners, and your model output also shows they have the highest variable importance values. Balance and geography appear in later splits, indicating they further refine predictions for certain customer sub-groups after the main separation has already been made.

Question-3

Comparison the results of visual exploration in Part A and Part B.(2) regarding the finding

From visual exploration we could realize that geography, number of products, Age and balance is an important factors to predict churn reason being there is noticeable distinction between customers who churned and who did not. The classification tree supports this as the factors are involved in the splits of trees and is at top level according to rank ordering by Bank_churn_tree variable. importance especially num_products and age.

One predictor that looked to be marginally important in the visual inspection but didn’t came out as a significant one being has_credit_card. The bar graph was silver-lighter, but the tree didn’t split on this variable. We have also noticed that estimated_salary was not showed up in the histogram and haven’t had high importance either or almostnear zero value. Tenure was also relatively weak in the visual examination and wasn’t included as a major factor in the classification tree.

Question-4

Assess the accuracy of the classification tree

Training Data:

train_probs <- predict(Bank_churn_tree, newdata = bank_train, type = "prob")
train_preds <- predict(Bank_churn_tree, newdata = bank_train, type = "class")

Bank_churn_tree_updated <- cbind(bank_train, train_probs, train_preds)
head(Bank_churn_tree_updated)
  customer_id surname credit_score geography gender age tenure   balance
1    15704442 Fleming          672    France Female  53      9 169406.33
2    15607993   Milne          625    France Female  52      2  79468.96
3    15635502   Ch'iu          443    France   Male  44      2      0.00
4    15631912    T'ao          840    France   Male  30      8 136291.71
5    15788539  Foxall          501    France Female  34      3 107747.57
6    15714680 Bianchi          755    France Female  78      5 121206.96
  num_products has_credit_card estimated_salary churn        No       Yes
1            4             Yes        147311.47   Yes 0.0312500 0.9687500
2            1             Yes         84606.03    No 0.6507692 0.3492308
3            1             Yes        159165.70    No 0.3763838 0.6236162
4            1             Yes         54113.38    No 0.8954366 0.1045634
5            1             Yes          9249.36    No 0.8954366 0.1045634
6            1             Yes         76016.49    No 0.6507692 0.3492308
  train_preds
1         Yes
2          No
3         Yes
4          No
5          No
6          No
train_con_mat <- table(Bank_churn_tree_updated$churn,Bank_churn_tree_updated$train_preds, dnn = c("Actual", "Predicted"))
train_con_mat
      Predicted
Actual   No  Yes
   No  6102  264
   Yes  963  671

Interpretation:

The confusion matrix shows the model’s performance on the training data.

The overall accuracy is ((6102 + 671) / (6102 + 264 + 963 + 671) = 6773/8000 = 0.8466), which is 84.66%.

Of all customers predicted to churn Predicted = Yes, the model is correct for (671 / (671 + 264) = 671/935 = 0.7176), so 71.76% of predicted churners actually churned precision for Yes.

Of all customers predicted not to churn Predicted = No, the model is correct for (6102 / (6102 + 963) = 6102/7065 = 0.8637), so 86.37% of predicted non churners actually did not churn.

Among customers who actually churned Actual = Yes, the model correctly identifies (671 / (671 + 963) = 671/1634 = 0.4106), meaning it catches 41.06% of churners.

Among customers who actually did not churn Actual = No, it correctly identifies (6102 / (6102 + 264) = 6102/6366 = 0.9585), meaning it correctly classifies 95.85% of non churners specificity.

Testing Data:

test_probs <- predict(Bank_churn_tree, newdata = bank_test, type = "prob")
test_preds <- predict(Bank_churn_tree, newdata = bank_test, type = "class")

Bank_churn_tree_updated_test <- cbind(bank_test, test_probs, test_preds)
head(Bank_churn_tree_updated_test)
  customer_id surname credit_score geography gender age tenure   balance
1    15812422  Ugorji          637    France   Male  41      2      0.00
2    15725511 Wallace          559    France Female  31      3 127070.73
3    15658306      Lo          693    France   Male  68      4  97705.99
4    15690332    Wang          647   Germany   Male  35      3 192407.97
5    15580701      Ma          712    France   Male  33      3 153819.58
6    15755978   Tseng          606    France   Male  31     10      0.00
  num_products has_credit_card estimated_salary churn        No       Yes
1            2              No        102515.42    No 0.8954366 0.1045634
2            1              No        160941.78    No 0.8954366 0.1045634
3            1             Yes         61569.07    No 0.6507692 0.3492308
4            1             Yes         40145.28    No 0.8954366 0.1045634
5            1             Yes         79176.09   Yes 0.8954366 0.1045634
6            2             Yes        195209.40    No 0.8954366 0.1045634
  test_preds
1         No
2         No
3         No
4         No
5         No
6         No
test_con_mat <- table(Bank_churn_tree_updated_test$churn, Bank_churn_tree_updated_test$test_preds, dnn = c("Actual", "Predicted"))
test_con_mat
      Predicted
Actual   No  Yes
   No  1531   66
   Yes  225  178

Interpretation:

The confusion matrix shows the model’s performance on the testing data.

The overall accuracy is ((1531 + 178) / (1531 + 66 + 225 + 178) = 1709/2000 = 0.8545), which is 85.45%.

Of all customers predicted to churn Predicted = Yes, the model is correct for (178 / (178 + 66) = 178/244 = 0.7295), so 72.95% of predicted churners actually churned.

Of all customers predicted not to churn Predicted = No, the model is correct for (1531 / (1531 + 225) = 1531/1756 = 0.8719), so 87.19% of predicted non churners actually did not churn.

Among customers who actually churned Actual = Yes, the model correctly identifies (178 / (178 + 225) = 178/403 = 0.4417), meaning it correctly flags 44.17% of churners.

Among customers who actually did not churn (Actual = No), it correctly identifies (1531 / (1531 + 66) = 1531/1597 = 0.9587), meaning it correctly identifies 95.87% of non churners.

Question-4(a)

Is the tree overfitting?

It does not look like overfitting, because the testing accuracy is 85.45% which is very close to the training accuracy 84.66%. It indicates testing is even slightly higher. If the tree were overfitting, you would expect training accuracy to be much higher than testing accuracy. However, the tree is better at predicting No than Yes, which suggests it is not memorizing the training data, but it may be too conservative and misses many churners.

Question-5

Actions to improve churn + how to use the tree for marketing

Based on your analysis, the company should focus retention actions on the high risk segments identified by the tree, especially customers in groups split by age, number of products, geography, and balance. For example, because Germany shows higher churn, the bank could run Germany specific retention offers. Since churn is related to age and balance, the bank could proactively contact older and high-balance customers with personalised benefits to stop them moving to competitors. Because num_products is the most important predictor, the bank can use the tree rules to identify customers in high risk product patterns and offer the right bundles or support for example customers with risky product combinations can be offered simplified packages or loyalty rewards. For marketing use, the tree can act as a targeting tool which score customers by their path in the tree and send retention campaigns first to nodes with the highest churn probability, while lower risk customers receive lighter, cheaper engagement.

Part C – Segmenting Bank Customers

Question-1

Import the bank_personal_loan.csv file into R

library(tidyverse)
library(cluster)
  
bank <- read_csv("bank_personal_loan.csv")
Rows: 4948 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (6): id, age, experience, income, cc_avg, personal_loan

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(bank)

Question-2

Cluster the customers based on age, experience, income and cc_avg

bank_cluster <- select(bank, age, experience, income, cc_avg)

Question-3

Does the data need scaling before Euclidean distance?

Yes Scaling of data is needed as the dimensions have different units and ranges. Variables with higher values like income would dominate distance calculation and unfairly affect the clusters without scaling. The scale makes the variables equally important.

Question-4

Create the Euclidean distance matrix

dist_matrix <- dist(scale(bank_cluster))

Question-5

Carry out hierarchical clustering

hc <- hclust(dist_matrix)

Question-6

Create a 3-cluster solution and assess quality

clusters <- cutree(hc, k = 3)
sil <- silhouette(clusters, dist_matrix)
summary(sil)
Silhouette of 4948 units in 3 clusters from silhouette.default(x = clusters, dist = dist_matrix) :
 Cluster sizes and average silhouette widths:
     1970      2657       321 
0.3648958 0.3110975 0.4038487 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.6126  0.2307  0.4270  0.3385  0.5126  0.6083 

Assess the quality:

An overall average silhouette around 0.34, which indicates a moderate clustering structure.

Question-7

Profile the clusters with tables and graphs

bank <- bank %>%
  mutate(cluster = factor(clusters,
                          levels = c(1, 2, 3),
                          labels = c("C1", "C2", "C3")))

cluster_profile <- bank %>%
  group_by(cluster) %>%
  summarise(
    Cluster_Size   = n(),
    Avg_Age        = mean(age),
    Avg_Experience = mean(experience),
    Avg_Income     = mean(income),
    Avg_CC_Spend   = mean(cc_avg)
  )

cluster_profile
# A tibble: 3 × 6
  cluster Cluster_Size Avg_Age Avg_Experience Avg_Income Avg_CC_Spend
  <fct>          <int>   <dbl>          <dbl>      <dbl>        <dbl>
1 C1              1970    35.2           9.94       71.2         1.59
2 C2              2657    54.3          29.0        65.8         1.71
3 C3               321    37.2          12.3       156.          5.93
ggplot(data = cluster_profile) +
  geom_col(mapping = aes(x = cluster, y = Cluster_Size, fill = cluster),
  position = "dodge") +
  scale_x_discrete(labels = c("C1"="Young_Low_Spend",
                              "C2"="Older_Stable",
                              "C3"="High_Income_High_Spend"))+ 
          scale_fill_discrete(labels = c("Young_Low_Spend",
                                         "Older_Stable",
                                         "High_Income_High_Spend"))

          labs(title = "Number of Customers in Each Cluster",
               x = "Cluster", y = "Cluster Size")
<ggplot2::labels> List of 3
 $ x    : chr "Cluster"
 $ y    : chr "Cluster Size"
 $ title: chr "Number of Customers in Each Cluster"

Interpretation:

The clustering indicates three different customer segments.

Cluster 2 Older_Stable, as the largest group n=2,657, are also the oldest on average mean age =54.28 and used credit services for many years which is29.01, It seems to be long-existing stable clients with moderate annual income is €65.79k and moderate amounts spent with credit cards per month is €1.71k.

Cluster 1 Young_Low_Spend have 1,970 customers and are on average much younger which is35.16 years old and less experienced with age at bank 9.94, their income is higher than Cluster 2 which is €71.18k as they spend the least amount of money on credit cards monthly €1.59k, which shows that spending behavior is lower for cluster 1.

The last group Cluster 3 High_Income_High_Spend is the smallest cluster there are only 321 customers in this cluster that means around 4.01% percent of the entire customer population, This segment can be considered as a target group since they all maintain the highest income and credit card spending values of these clusters €156.43k average

Question-8

Short paragraph summarizing each cluster

Cluster C1 Young_Low_Spend:

There are 1,970 clients in this group and they are relatively young Mean Age is 35.2 with low to moderate mean experience of 9.9 years. They have a moderate income of €71.2k and spend little on credit cards of €1.59k, indicating a younger client with medium incomes or low spending amounts.

Cluster C2 Older Stable:

The second most dominant cluster with 2657 customers, and the old group average age of 54.3 years having highest duration since purchase 29 years. Their income is slightly lower €65.8k and credit card spending moderate €1.71k, so likely older, more conservative customers with consistent but not outstanding spending habits.

Cluster C3 High_Income_High_Spend:

This is the smallest group with 321 customers but had by far the highest income €156k and Credit Card Spend €5.93k. They lie in mid-age 37.2 and average experience 12.3, which indicates high-value users with significant purchasing power, spending potential, but minimal number of transactions.

Question-9

Which segment is most likely to take a personal loan in the future?

loan_by_cluster <- bank %>%
  group_by(cluster) %>%
  summarise(
    Loan_Rate = mean(personal_loan, na.rm = TRUE)
  ) %>%
  mutate(Loan_Rate_Percent = Loan_Rate * 100)

loan_by_cluster
# A tibble: 3 × 3
  cluster Loan_Rate Loan_Rate_Percent
  <fct>       <dbl>             <dbl>
1 C1         0.0772              7.72
2 C2         0.0772              7.72
3 C3         0.383              38.3 

Interpretation:

We see that, Cluster C3 is the most likely segment to take out a personal loan in the future, because it has by far the highest proportion of customers who previously took a personal loan 38.3%. This makes C3 the best target for the marketing campaign.