knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  fig.align = "center",
  comment = "#>"
)

Introduction

This dataset has information contains about bank customers and their churn status, which indicates whether they have exited the bank or not. This project will exploring and analyzing factors influencing customer churn in banking institutions and for building predictive models to identify customers at risk of churning. This dataset collected from Kaggle (https://www.kaggle.com/datasets/saurabhbadole/bank-customer-churn-prediction-dataset)

Import Library

source("setup.R")

Data Preparation

Load Data

df <- read.csv("data_input/Churn_Modelling.csv")
head(df)

#>   RowNumber CustomerId  Surname CreditScore Geography Gender Age Tenure
#> 1         1   15634602 Hargrave         619    France Female  42      2
#> 2         2   15647311     Hill         608     Spain Female  41      1
#> 3         3   15619304     Onio         502    France Female  42      8
#> 4         4   15701354     Boni         699    France Female  39      1
#> 5         5   15737888 Mitchell         850     Spain Female  43      2
#> 6         6   15574012      Chu         645     Spain   Male  44      8
#>     Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited
#> 1      0.00             1         1              1       101348.88      1
#> 2  83807.86             1         0              1       112542.58      0
#> 3 159660.80             3         1              0       113931.57      1
#> 4      0.00             2         0              0        93826.63      0
#> 5 125510.82             1         1              1        79084.10      0
#> 6 113755.78             2         1              0       149756.71      1

Column Description

RowNumber : The sequential number assigned to each row in the dataset.
CustomerId : A unique identifier for each customer.
Surname : The surname of the customer.
CreditScore : The credit score of the customer.
Geography : The geographical location of the customer (e.g., country or region).
Gender : The gender of the customer.
Age : The age of the customer.
Tenure : The number of years the customer has been with the bank.
Balance : The account balance of the customer.
NumOfProducts : The number of bank products the customer has.
HasCrCard : Indicates whether the customer has a credit card (binary: yes/no).
IsActiveMember : Indicates whether the customer is an active member (binary: yes/no).
EstimatedSalary : The estimated salary of the customer.
Exited (Target) : Indicates whether the customer has exited the bank (binary: yes (1) exited /no (0) : not exited).

Data Cleansing

#Check General Data Information
glimpse(df)

#> Rows: 10,000
#> Columns: 14
#> $ RowNumber       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
#> $ CustomerId      <int> 15634602, 15647311, 15619304, 15701354, 15737888, 1557…
#> $ Surname         <chr> "Hargrave", "Hill", "Onio", "Boni", "Mitchell", "Chu",…
#> $ CreditScore     <int> 619, 608, 502, 699, 850, 645, 822, 376, 501, 684, 528,…
#> $ Geography       <chr> "France", "Spain", "France", "France", "Spain", "Spain…
#> $ Gender          <chr> "Female", "Female", "Female", "Female", "Female", "Mal…
#> $ Age             <int> 42, 41, 42, 39, 43, 44, 50, 29, 44, 27, 31, 24, 34, 25…
#> $ Tenure          <int> 2, 1, 8, 1, 2, 8, 7, 4, 4, 2, 6, 3, 10, 5, 7, 3, 1, 9,…
#> $ Balance         <dbl> 0.00, 83807.86, 159660.80, 0.00, 125510.82, 113755.78,…
#> $ NumOfProducts   <int> 1, 1, 3, 2, 1, 2, 2, 4, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, …
#> $ HasCrCard       <int> 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, …
#> $ IsActiveMember  <int> 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, …
#> $ EstimatedSalary <dbl> 101348.88, 112542.58, 113931.57, 93826.63, 79084.10, 1…
#> $ Exited          <int> 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …

From the dataset above, the data has 14 columns, 10,000 rows and the data types for each column. Checking the data types is a crucial step due to the data types must be appropriate for analysis.

# Select column will be drop
df_clean <- df %>% 
  select(-c(RowNumber,CustomerId,Surname))
# change data type
df_clean <- df_clean %>% 
  mutate(Geography = as.factor(Geography),
         Gender = as.factor(Gender)) 
head(df_clean)

#>   CreditScore Geography Gender Age Tenure   Balance NumOfProducts HasCrCard
#> 1         619    France Female  42      2      0.00             1         1
#> 2         608     Spain Female  41      1  83807.86             1         0
#> 3         502    France Female  42      8 159660.80             3         1
#> 4         699    France Female  39      1      0.00             2         0
#> 5         850     Spain Female  43      2 125510.82             1         1
#> 6         645     Spain   Male  44      8 113755.78             2         1
#>   IsActiveMember EstimatedSalary Exited
#> 1              1       101348.88      1
#> 2              1       112542.58      0
#> 3              0       113931.57      1
#> 4              0        93826.63      0
#> 5              1        79084.10      0
#> 6              0       149756.71      1

# Check missing value
colSums(is.na(df_clean))

#>     CreditScore       Geography          Gender             Age          Tenure 
#>               0               0               0               0               0 
#>         Balance   NumOfProducts       HasCrCard  IsActiveMember EstimatedSalary 
#>               0               0               0               0               0 
#>          Exited 
#>               0

In the dataset above, has no missing value data in any columns.

# check data distribution
describe(df_clean)

#>            variable   class count missing_rate unique_count identical_rate
#>              <char>  <char> <int>        <num>        <int>          <num>
#>  1:     CreditScore integer 10000            0          460         0.0233
#>  2:       Geography  factor 10000            0            3         0.5014
#>  3:          Gender  factor 10000            0            2         0.5457
#>  4:             Age integer 10000            0           70         0.0478
#>  5:          Tenure integer 10000            0           11         0.1048
#>  6:         Balance numeric 10000            0         6382         0.3617
#>  7:   NumOfProducts integer 10000            0            4         0.5084
#>  8:       HasCrCard integer 10000            0            2         0.7055
#>  9:  IsActiveMember integer 10000            0            2         0.5151
#> 10: EstimatedSalary numeric 10000            0         9999         0.0002
#> 11:          Exited integer 10000            0            2         0.7963
#>        min      p25       p50      p75      max        mean         sd     cv
#>      <num>    <num>     <num>    <num>    <num>       <num>      <num>  <num>
#>  1: 350.00   584.00    652.00    718.0    850.0    650.5288    96.6533 0.1486
#>  2:     NA       NA        NA       NA       NA          NA         NA     NA
#>  3:     NA       NA        NA       NA       NA          NA         NA     NA
#>  4:  18.00    32.00     37.00     44.0     92.0     38.9218    10.4878 0.2695
#>  5:   0.00     3.00      5.00      7.0     10.0      5.0128     2.8922 0.5770
#>  6:   0.00     0.00  97198.54 127644.2 250898.1  76485.8893 62397.4052 0.8158
#>  7:   1.00     1.00      1.00      2.0      4.0      1.5302     0.5817 0.3801
#>  8:   0.00     0.00      1.00      1.0      1.0      0.7055     0.4558 0.6461
#>  9:   0.00     0.00      1.00      1.0      1.0      0.5151     0.4998 0.9703
#> 10:  11.58 51002.11 100193.91 149388.2 199992.5 100090.2399 57510.4928 0.5746
#> 11:   0.00     0.00      0.00      0.0      1.0      0.2037     0.4028 1.9773

#check class variable target
table(df_clean$Exited) %>% 
  prop.table()

#> 
#>      0      1 
#> 0.7963 0.2037

We can consider our target has balance data even though the class proportion is 80:20

Data Prepocessing

Cross Validation

RNGkind(sample.kind= "Rounding")
set.seed(123) # randomizing data

# To create a binary split of data into training and testing sets with a ratio of 75:25 (75% for training and 25% for testing)
index <- initial_split(data = df_clean,
                          prop = 0.75)

# splitting
train <- training(index)
test <- testing(index)

prop.table(table(train$Exited))

#> 
#>      0      1 
#> 0.7976 0.2024

Initial Characteristic Analysis

Weight of Evidence (WoE) or Fine Classing

The fuction of WOE is to binning/categorize all predictor variables in the data, then calculate the strength of each category’s result in separating positive and negative classes. This step will separate several outlier values, missing values, and categories with low frequencies. Additionally, it can help users understand the risk behavior of customers.

binning <- woebin(dt = train,
                  y = 'Exited',
                  positive = 0)

#> ✔ Binning on 7500 rows and 11 columns in 00:00:13

binning

#> $CreditScore
#>       variable        bin count count_distr   neg   pos   posprob         woe
#>         <char>     <char> <int>       <num> <int> <int>     <num>       <num>
#> 1: CreditScore [-Inf,490)   378  0.05040000    97   281 0.7433862 -0.30771759
#> 2: CreditScore  [490,570)  1196  0.15946667   253   943 0.7884615 -0.05568449
#> 3: CreditScore  [570,610)   944  0.12586667   182   762 0.8072034  0.06057859
#> 4: CreditScore  [610,630)   539  0.07186667   127   412 0.7643785 -0.19452502
#> 5: CreditScore [630, Inf)  4443  0.59240000   859  3584 0.8066622  0.05710457
#>          bin_iv   total_iv breaks is_special_values
#>           <num>      <num> <char>            <lgcl>
#> 1: 0.0052083086 0.01094046    490             FALSE
#> 2: 0.0005026684 0.01094046    570             FALSE
#> 3: 0.0004535851 0.01094046    610             FALSE
#> 4: 0.0028769137 0.01094046    630             FALSE
#> 5: 0.0018989881 0.01094046    Inf             FALSE
#> 
#> $Geography
#>     variable     bin count count_distr   neg   pos   posprob        woe
#>       <char>  <char> <int>       <num> <int> <int>     <num>      <num>
#> 1: Geography  France  3765   0.5020000   606  3159 0.8390438  0.2797695
#> 2: Geography Germany  1888   0.2517333   607  1281 0.6784958 -0.6244938
#> 3: Geography   Spain  1847   0.2462667   305  1542 0.8348674  0.2491625
#>        bin_iv total_iv  breaks is_special_values
#>         <num>    <num>  <char>            <lgcl>
#> 1: 0.03605523 0.166205  France             FALSE
#> 2: 0.11598462 0.166205 Germany             FALSE
#> 3: 0.01416515 0.166205   Spain             FALSE
#> 
#> $Gender
#>    variable    bin count count_distr   neg   pos   posprob        woe
#>      <char> <char> <int>       <num> <int> <int>     <num>      <num>
#> 1:   Gender Female  3423      0.4564   849  2574 0.7519720 -0.2622041
#> 2:   Gender   Male  4077      0.5436   669  3408 0.8359088  0.2567355
#>        bin_iv total_iv breaks is_special_values
#>         <num>    <num> <char>            <lgcl>
#> 1: 0.03382371 0.066942 Female             FALSE
#> 2: 0.03311829 0.066942   Male             FALSE
#> 
#> $Age
#>    variable       bin count count_distr   neg   pos   posprob        woe
#>      <char>    <char> <int>       <num> <int> <int>     <num>      <num>
#> 1:      Age [-Inf,35)  2737   0.3649333   227  2510 0.9170625  1.0317267
#> 2:      Age   [35,42)  2369   0.3158667   353  2016 0.8509920  0.3710413
#> 3:      Age   [42,46)   833   0.1110667   246   587 0.7046819 -0.5016680
#> 4:      Age [46, Inf)  1561   0.2081333   692   869 0.5566944 -1.1436041
#>        bin_iv total_iv breaks is_special_values
#>         <num>    <num> <char>            <lgcl>
#> 1: 0.27862115  0.70465     35             FALSE
#> 2: 0.03876202  0.70465     42             FALSE
#> 3: 0.03207044  0.70465     46             FALSE
#> 4: 0.35519639  0.70465    Inf             FALSE
#> 
#> $Tenure
#>    variable      bin count count_distr   neg   pos   posprob         woe
#>      <char>   <char> <int>       <num> <int> <int>     <num>       <num>
#> 1:   Tenure [-Inf,2)  1084  0.14453333   243   841 0.7758303 -0.12983106
#> 2:   Tenure    [2,3)   787  0.10493333   147   640 0.8132147  0.09967431
#> 3:   Tenure    [3,6)  2256  0.30080000   473  1783 0.7903369 -0.04440405
#> 4:   Tenure    [6,7)   722  0.09626667   132   590 0.8171745  0.12595933
#> 5:   Tenure    [7,8)   773  0.10306667   129   644 0.8331177  0.23652504
#> 6:   Tenure    [8,9)   771  0.10280000   150   621 0.8054475  0.04933451
#> 7:   Tenure [9, Inf)  1107  0.14760000   244   863 0.7795845 -0.10811482
#>          bin_iv  total_iv breaks is_special_values
#>           <num>     <num> <char>            <lgcl>
#> 1: 0.0025304879 0.0130042      2             FALSE
#> 2: 0.0010116628 0.0130042      3             FALSE
#> 3: 0.0006009357 0.0130042      6             FALSE
#> 4: 0.0014702854 0.0130042      7             FALSE
#> 5: 0.0053634572 0.0130042      8             FALSE
#> 6: 0.0002465346 0.0130042      9             FALSE
#> 7: 0.0017808328 0.0130042    Inf             FALSE
#> 
#> $Balance
#>    variable           bin count count_distr   neg   pos   posprob         woe
#>      <char>        <char> <int>       <num> <int> <int>     <num>       <num>
#> 1:  Balance   [-Inf,5000)  2704   0.3605333   362  2342 0.8661243  0.49575505
#> 2:  Balance [5000,105000)  1462   0.1949333   304  1158 0.7920657 -0.03393932
#> 3:  Balance [105000, Inf)  3334   0.4445333   852  2482 0.7444511 -0.30212784
#>          bin_iv  total_iv breaks is_special_values
#>           <num>     <num> <char>            <lgcl>
#> 1: 0.0758684610 0.1203127   5000             FALSE
#> 2: 0.0002268084 0.1203127 105000             FALSE
#> 3: 0.0442174434 0.1203127    Inf             FALSE
#> 
#> $NumOfProducts
#>         variable      bin count count_distr   neg   pos   posprob        woe
#>           <char>   <char> <int>       <num> <int> <int>     <num>      <num>
#> 1: NumOfProducts [-Inf,2)  3806   0.5074667  1043  2763 0.7259590 -0.3971454
#> 2: NumOfProducts [2, Inf)  3694   0.4925333   475  3219 0.8714131  0.5421499
#>        bin_iv  total_iv breaks is_special_values
#>         <num>     <num> <char>            <lgcl>
#> 1: 0.08943819 0.2115318      2             FALSE
#> 2: 0.12209359 0.2115318    Inf             FALSE
#> 
#> $HasCrCard
#>     variable      bin count count_distr   neg   pos   posprob          woe
#>       <char>   <char> <int>       <num> <int> <int>     <num>        <num>
#> 1: HasCrCard [-Inf,1)  2204   0.2938667   453  1751 0.7944646 -0.019311074
#> 2: HasCrCard [1, Inf)  5296   0.7061333  1065  4231 0.7989048  0.008102291
#>          bin_iv     total_iv breaks is_special_values
#>           <num>        <num> <char>            <lgcl>
#> 1: 1.102180e-04 0.0001564619      1             FALSE
#> 2: 4.624387e-05 0.0001564619    Inf             FALSE
#> 
#> $IsActiveMember
#>          variable      bin count count_distr   neg   pos   posprob        woe
#>            <char>   <char> <int>       <num> <int> <int>     <num>      <num>
#> 1: IsActiveMember [-Inf,1)  3657      0.4876   976  2681 0.7331146 -0.3608787
#> 2: IsActiveMember [1, Inf)  3843      0.5124   542  3301 0.8589644  0.4353534
#>        bin_iv  total_iv breaks is_special_values
#>         <num>     <num> <char>            <lgcl>
#> 1: 0.07028957 0.1550848      1             FALSE
#> 2: 0.08479526 0.1550848    Inf             FALSE
#> 
#> $EstimatedSalary
#>           variable            bin count count_distr   neg   pos   posprob
#>             <char>         <char> <int>       <num> <int> <int>     <num>
#> 1: EstimatedSalary   [-Inf,20000)   725  0.09666667   134   591 0.8151724
#> 2: EstimatedSalary  [20000,35000)   559  0.07453333   125   434 0.7763864
#> 3: EstimatedSalary  [35000,60000)   948  0.12640000   168   780 0.8227848
#> 4: EstimatedSalary  [60000,75000)   566  0.07546667   131   435 0.7685512
#> 5: EstimatedSalary  [75000,85000)   391  0.05213333    65   326 0.8337596
#> 6: EstimatedSalary [85000,165000)  3009  0.40120000   611  2398 0.7969425
#> 7: EstimatedSalary  [165000, Inf)  1302  0.17360000   284  1018 0.7818740
#>             woe       bin_iv   total_iv breaks is_special_values
#>           <num>        <num>      <num> <char>            <lgcl>
#> 1:  0.112614936 1.184973e-03 0.01240834  20000             FALSE
#> 2: -0.126630484 1.240245e-03 0.01240834  35000             FALSE
#> 3:  0.163968659 3.233337e-03 0.01240834  60000             FALSE
#> 4: -0.171212573 2.324999e-03 0.01240834  75000             FALSE
#> 5:  0.241148830 2.815973e-03 0.01240834  85000             FALSE
#> 6: -0.004067905 6.647035e-06 0.01240834 165000             FALSE
#> 7: -0.094740322 1.602161e-03 0.01240834    Inf             FALSE

💡 Insight:

Customers with EstimatedSalary between 20000- 35000 & 60000 - 75000 has negative value for WOE, the meaning they are potentially going to exit from bank.
Bank customers from the branch in Germany are potentially included in the Exited category.
Female bank customers are potentially included in the Exited category.
Bank customers aged over 40 are potentially included in the Exited category

Change dataframe in to WOE

# data train
train_woe <- woebin_ply(dt = train,
                        bins = binning)

#> ✔ Woe transformating on 7500 rows and 10 columns in 00:00:02

train_woe

# data test
test_woe <- woebin_ply(dt = test,
                       bins = binning)

#> ✔ Woe transformating on 2500 rows and 10 columns in 00:00:02

test_woe

Information Value (IV)

# your code here
iv(dt = train_woe,
   y = 'Exited',
   positive = 0)

#>                variable   info_value
#>                  <char>        <num>
#>  1:             Age_woe 0.7046500005
#>  2:   NumOfProducts_woe 0.2115317727
#>  3:       Geography_woe 0.1662050017
#>  4:  IsActiveMember_woe 0.1550848364
#>  5:         Balance_woe 0.1203127129
#>  6:          Gender_woe 0.0669420000
#>  7:          Tenure_woe 0.0130041965
#>  8: EstimatedSalary_woe 0.0124083350
#>  9:     CreditScore_woe 0.0109404638
#> 10:       HasCrCard_woe 0.0001564619

According to Siddiqi, Naeem, the IV score can be categorized into the following values:

IV below 0.02 -> unpredictive
IV between 0.02 - 0.1 -> weak
IV between 0.1 and 0.3 categorized as medium
IV above 0.3 categorized as strong

The IV values we have obtained can be used for feature elimination, and the variables to be eliminated are as follows: Tenure_woe, EstimatedSalary_woe, CreditScore_woe, and HasCrCard_woe

# hasil data setelah feature elimination
train_woe_final <- train_woe %>% 
  select(-c(Tenure_woe,EstimatedSalary_woe,CreditScore_woe,HasCrCard_woe))
  
test_woe_final <- test_woe %>%
   select(-c(Tenure_woe,EstimatedSalary_woe,CreditScore_woe,HasCrCard_woe))

Logical Trend and Business Consideration / Coarse Classing

After successfully generating and applying WOE binning to the train and test data, We need visualization to get onother perspective for easier to present.

# logical trend plot
plot <- woebin_plot(bins = binning)
plot

#> $CreditScore

#> 
#> $Geography

#> 
#> $Gender

#> 
#> $Age

#> 
#> $Tenure

#> 
#> $Balance

#> 
#> $NumOfProducts

#> 
#> $HasCrCard

#> 
#> $IsActiveMember

#> 
#> $EstimatedSalary

💡 Insight:

The blue line represents the relationship between each binning, indicating the positive probability.
The visualization of EstimatedSalary indicates that as the value increases, the positive probability decreases. This means more customers ‘Not Exited’.

Modeling

Logistic Regression Model

Logistic regression is a classification method where the concept involves calculating the probability of a binary outcome for each target class.

model <- glm(formula = Exited ~.,
             data = train_woe_final,
             family = "binomial")

summary(model)

#> 
#> Call:
#> glm(formula = Exited ~ ., family = "binomial", data = train_woe_final)
#> 
#> Coefficients:
#>                    Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)        -1.37120    0.03371 -40.682   <2e-16 ***
#> Geography_woe      -0.90771    0.08637 -10.510   <2e-16 ***
#> Gender_woe         -1.04798    0.12581  -8.330   <2e-16 ***
#> Age_woe            -1.07561    0.03937 -27.321   <2e-16 ***
#> Balance_woe        -0.19023    0.11330  -1.679   0.0932 .  
#> NumOfProducts_woe  -0.94774    0.07610 -12.454   <2e-16 ***
#> IsActiveMember_woe -1.29030    0.08488 -15.202   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 7555.7  on 7499  degrees of freedom
#> Residual deviance: 6009.5  on 7493  degrees of freedom
#> AIC: 6023.5
#> 
#> Number of Fisher Scoring iterations: 5

💡 Interpretation model:

Variables of Balance_woe is not significant to our model.

Model Assumptions - Multicollinearity

Multicollinearity: The predictors are not highly correlated with each other (greater than +- 0.9).

Use the vif() test from the car library. A VIF (Variance Inflation Factor) value < 10 indicates that the variable does not have multicollinearity with other variables.

vif(model)

#>      Geography_woe         Gender_woe            Age_woe        Balance_woe 
#>           1.249178           1.005763           1.052761           1.361607 
#>  NumOfProducts_woe IsActiveMember_woe 
#>           1.129329           1.047355

💡 Insight: There is no multicollinearity in this data.

Prediction

After obtaining the model, predictions can be made on the test data that has been transformed into WOE (Weight of Evidence).

# Performing predictions on the test data
test_woe_final$pred_risk <- predict(object = model,
                                    newdata = test_woe_final,
                                    type = "response")
test_woe_final$pred_risk %>% head()

#> [1] 0.13004477 0.13102348 0.08527977 0.06897270 0.05257366 0.17481595

Evaluation Model

# Creating a list for the predicted risk outcomes of each data point in both the training and test datasets
list_pred <- list(test = test_woe_final$pred_risk)
# Creating label lists for the training and test datasets
list_label <- list(test = test_woe_final$Exited)
# Using the function perf_eva to perform evaluation.
perf_eva(pred = list_pred,
         label = list_label,
         confusion_matrix = TRUE,
         threshold = 0.5,
         show_plot = c("ks", "roc"))

#> $binomial_metric
#> $binomial_metric$test
#>          MSE      RMSE   LogLoss       R2        KS       AUC      Gini
#>        <num>     <num>     <num>    <num>     <num>     <num>     <num>
#> 1: 0.1218619 0.3490873 0.3908244 0.259208 0.5024097 0.8165997 0.6331994
#> 
#> 
#> $confusion_matrix
#> $confusion_matrix$test
#>     label pred_0 pred_1      error
#>    <char>  <num>  <num>      <num>
#> 1:      0   1898     83 0.04189803
#> 2:      1    333    186 0.64161850
#> 3:  total   2231    269 0.16640000
#> 
#> 
#> $pic
#> TableGrob (1 x 2) "arrange": 2 grobs
#>   z     cells    name           grob
#> 1 1 (1-1,1-1) arrange gtable[layout]
#> 2 2 (1-1,2-2) arrange gtable[layout]

KS Statistics

KS (Kolmogorov-Smirnov) Statistics is a statistical method used to measure the difference between two empirical cumulative distributions. It is often used to compare score distributions between two groups, for example, in the analysis of bank customer behavior models (distinguishing between good and bad customers).

A high KS value indicates that the model is good to classify between the two groups (for example, good vs bad customers).

Typically, a KS value > 0.4 is considered good in credit scoring.

💡 Interpretation:

KS value in this model is 0,5024 > 0,4 thats mean the Logistic Regression model able to effectively classify between the Exited (1) and not Exited (0) classes.

AUC Score

AUC (Area Under Curve) is the area under the ROC curve. As the AUC value approaches 1, the model becomes better at separating positive and negative classes.

AUC = 1 indicates the model has ideal performance.
AUC = 0.5 indicates the model is no better than random guessing in predicting defaults.

💡 Interpretation:

The model has an AUC of 0.8166, indicating it can distinguish between Exited and Not Exited classes with an accuracy of 81.6% compared to actual Exited/not-Exited cases.

Scorecard Creation

In logistic regression models, the output is typically in the form of probabilities predicting whether a customer is an Exited Customer or not exited customers. To facilitate decision-making, the binning results can be transformed into scores for each bin, which are then accumulated into a scorecard.

Make a scorecard using universal odds0 1/10 and points0 600. Odds 1/10 means that we want to set in 10 people that positive (not Exited), 1 person of them is negative (Exited).

# membentuk scorecard
score_card <- scorecard(bins = binning, 
                        model = model,
                        odds0 = 1/10,
                        points0 = 600,
                        pdo = 20)
score_card

#> $basepoints
#>      variable    bin    woe points
#>        <char> <lgcl> <lgcl>  <num>
#> 1: basepoints     NA     NA    573
#> 
#> $Geography
#>     variable     bin count count_distr   neg   pos   posprob        woe
#>       <char>  <char> <int>       <num> <int> <int>     <num>      <num>
#> 1: Geography  France  3765   0.5020000   606  3159 0.8390438  0.2797695
#> 2: Geography Germany  1888   0.2517333   607  1281 0.6784958 -0.6244938
#> 3: Geography   Spain  1847   0.2462667   305  1542 0.8348674  0.2491625
#>        bin_iv total_iv  breaks is_special_values points
#>         <num>    <num>  <char>            <lgcl>  <num>
#> 1: 0.03605523 0.166205  France             FALSE      7
#> 2: 0.11598462 0.166205 Germany             FALSE    -16
#> 3: 0.01416515 0.166205   Spain             FALSE      7
#> 
#> $Gender
#>    variable    bin count count_distr   neg   pos   posprob        woe
#>      <char> <char> <int>       <num> <int> <int>     <num>      <num>
#> 1:   Gender Female  3423      0.4564   849  2574 0.7519720 -0.2622041
#> 2:   Gender   Male  4077      0.5436   669  3408 0.8359088  0.2567355
#>        bin_iv total_iv breaks is_special_values points
#>         <num>    <num> <char>            <lgcl>  <num>
#> 1: 0.03382371 0.066942 Female             FALSE     -8
#> 2: 0.03311829 0.066942   Male             FALSE      8
#> 
#> $Age
#>    variable       bin count count_distr   neg   pos   posprob        woe
#>      <char>    <char> <int>       <num> <int> <int>     <num>      <num>
#> 1:      Age [-Inf,35)  2737   0.3649333   227  2510 0.9170625  1.0317267
#> 2:      Age   [35,42)  2369   0.3158667   353  2016 0.8509920  0.3710413
#> 3:      Age   [42,46)   833   0.1110667   246   587 0.7046819 -0.5016680
#> 4:      Age [46, Inf)  1561   0.2081333   692   869 0.5566944 -1.1436041
#>        bin_iv total_iv breaks is_special_values points
#>         <num>    <num> <char>            <lgcl>  <num>
#> 1: 0.27862115  0.70465     35             FALSE     32
#> 2: 0.03876202  0.70465     42             FALSE     12
#> 3: 0.03207044  0.70465     46             FALSE    -16
#> 4: 0.35519639  0.70465    Inf             FALSE    -35
#> 
#> $Balance
#>    variable           bin count count_distr   neg   pos   posprob         woe
#>      <char>        <char> <int>       <num> <int> <int>     <num>       <num>
#> 1:  Balance   [-Inf,5000)  2704   0.3605333   362  2342 0.8661243  0.49575505
#> 2:  Balance [5000,105000)  1462   0.1949333   304  1158 0.7920657 -0.03393932
#> 3:  Balance [105000, Inf)  3334   0.4445333   852  2482 0.7444511 -0.30212784
#>          bin_iv  total_iv breaks is_special_values points
#>           <num>     <num> <char>            <lgcl>  <num>
#> 1: 0.0758684610 0.1203127   5000             FALSE      3
#> 2: 0.0002268084 0.1203127 105000             FALSE      0
#> 3: 0.0442174434 0.1203127    Inf             FALSE     -2
#> 
#> $NumOfProducts
#>         variable      bin count count_distr   neg   pos   posprob        woe
#>           <char>   <char> <int>       <num> <int> <int>     <num>      <num>
#> 1: NumOfProducts [-Inf,2)  3806   0.5074667  1043  2763 0.7259590 -0.3971454
#> 2: NumOfProducts [2, Inf)  3694   0.4925333   475  3219 0.8714131  0.5421499
#>        bin_iv  total_iv breaks is_special_values points
#>         <num>     <num> <char>            <lgcl>  <num>
#> 1: 0.08943819 0.2115318      2             FALSE    -11
#> 2: 0.12209359 0.2115318    Inf             FALSE     15
#> 
#> $IsActiveMember
#>          variable      bin count count_distr   neg   pos   posprob        woe
#>            <char>   <char> <int>       <num> <int> <int>     <num>      <num>
#> 1: IsActiveMember [-Inf,1)  3657      0.4876   976  2681 0.7331146 -0.3608787
#> 2: IsActiveMember [1, Inf)  3843      0.5124   542  3301 0.8589644  0.4353534
#>        bin_iv  total_iv breaks is_special_values points
#>         <num>     <num> <char>            <lgcl>  <num>
#> 1: 0.07028957 0.1550848      1             FALSE    -13
#> 2: 0.08479526 0.1550848    Inf             FALSE     16

💡 Interpretation Result:

The relationship between points0 value and odds0: at a score of 600 points, there is a risk of 1 customer exiting the bank out of every 10 customers who stay.
Note: The odds of 1/10 and points 600 are not ground truth values; they are adjusted based on the scorecard’s purpose and discussions with the relevant team.
Scorecard ranges are adjusted according to each company’s preferences. Typically, the standard used is the FICO standard, ranging from 300 to 850.
Interpretation of pdo (Points to Double the Odds): A smaller pdo value indicates a tighter scorecard and fewer customers predicted to exit.

Apply test to scorecard

# Data Train
score_train <- scorecard_ply(dt = train,
                       card = score_card,
                       only_total_score = F)
score_train %>% head()

#>    Geography_points Gender_points Age_points Balance_points
#>               <num>         <num>      <num>          <num>
#> 1:                7            -8         32             -2
#> 2:              -16            -8        -16             -2
#> 3:              -16             8         12             -2
#> 4:                7             8         12              3
#> 5:                7            -8         32              0
#> 6:                7             8         12              3
#>    NumOfProducts_points IsActiveMember_points score
#>                   <num>                 <num> <num>
#> 1:                  -11                   -13   578
#> 2:                  -11                   -13   507
#> 3:                   15                   -13   577
#> 4:                   15                   -13   605
#> 5:                  -11                   -13   580
#> 6:                   15                   -13   605

# Data Test
score_test <- scorecard_ply(dt = test,
                       card = score_card,
                       only_total_score = F)
score_test %>% head()

#>    Geography_points Gender_points Age_points Balance_points
#>               <num>         <num>      <num>          <num>
#> 1:                7            -8         12              0
#> 2:                7            -8         12              3
#> 3:                7             8        -16             -2
#> 4:                7            -8         32              3
#> 5:                7            -8         12              3
#> 6:              -16             8        -16             -2
#>    NumOfProducts_points IsActiveMember_points score
#>                   <num>                 <num> <num>
#> 1:                  -11                    16   589
#> 2:                   15                   -13   589
#> 3:                   15                    16   601
#> 4:                   15                   -13   609
#> 5:                   15                    16   618
#> 6:                   15                    16   578

Performance Evaluation Scorecard using Population Stability Index

We just got the score of each characteristics customer from our dataset. To see how our scorecard result stable for some population, we can use Population Stability Index.

Population Stability Index (PSI) is a metric used in credit analysis and credit risk to evaluate the stability or change in the distribution of credit points or scores within a population over time.

score_list <- list(train = score_train$score, 
                   test = score_test$score)

label_list <- list(train = train_woe_final$Exited, 
                   test = test_woe_final$Exited)

psi <- perf_psi(score = score_list, 
                label = label_list, 
                positive = 0)
psi

#> $pic
#> $pic$pred

#> 
#> 
#> $psi
#>    variable    dataset         psi
#>      <char>     <char>       <num>
#> 1:     pred train_test 0.005944322

According to Siddiqi Naeem, the PSI (Population Stability Index) has the following ranges:

PSI < 0.10: Typically considered a sign that there is no significant change in the distribution of credit scores, and the score population tends to be stable.

💡 Interpretation :

PSI value is 0.0059 which means there is no significant changes and our scorecard stable in population score.

Cutoff

Determining whether a customer falls into the “GOOD” or “BAD” category requires setting a cutoff value for the total score. To establish this cutoff value, we need information on the approval rate and bad rate. To obtain these rates, we use the ‘approval_rate()’ function. ‘approval_rate()’ will display a table showing various cutoff values, approval rates, negative rates (bad rates), and other details to facilitate the decision-making process.

# using score test
approval_rate(score = score_test$score,
              label = test_woe_final$Exited,
              positive = 0)

#> Key: <datset>
#>            bin approval_rate neg_rate count_approved neg_approved count   neg
#>         <fctr>         <num>    <num>          <int>        <int> <int> <int>
#>  1: [-Inf,533)        0.9044   0.1544           2261          349   239   170
#>  2:  [533,555)        0.8000   0.1165           2000          233   261   116
#>  3:  [555,564)        0.7032   0.0961           1758          169   242    64
#>  4:  [564,578)        0.6120   0.0830           1530          127   228    42
#>  5:  [578,589)        0.5020   0.0637           1255           80   275    47
#>  6:  [589,601)        0.4012   0.0528           1003           53   252    27
#>  7:  [601,607)        0.3136   0.0472            784           37   219    16
#>  8:  [607,620)        0.2048   0.0332            512           17   272    20
#>  9:  [620,633)        0.1020   0.0235            255            6   257    11
#> 10: [633, Inf)        0.0000   0.0000              0            0   255     6
#>       pos
#>     <int>
#>  1:    69
#>  2:   145
#>  3:   178
#>  4:   186
#>  5:   228
#>  6:   225
#>  7:   203
#>  8:   252
#>  9:   246
#> 10:   249

Notes:

bin: Result of binning the scores is [-inf,533): Cutoff at < 533
approval_rate: Approval rate when using a cutoff point at 533.

💡 Interpretation :

When the cutoff value = 533 (row 1):

Approval rate = 90%
Bad rate = 15%

Predict Score

Setting the cutoff value can be used as a reference to determine whether a customer is considered to have “GOOD” category when their total points on the scorecard are above the cutoff value. Similarly, they would be categorized as having “BAD” category if their total points fall below the cutoff value.

# predict new data
new_data <- data.frame(list(Age = 30,
                            NumOfProducts = 3,
                            Geography = "Spain",
                            IsActiveMember = 1,
                            Balance = 95000,
                            Gender = "Female"))
new_data

#>   Age NumOfProducts Geography IsActiveMember Balance Gender
#> 1  30             3     Spain              1   95000 Female

To transform the original characteristics into total points and determine behavior, the following steps are needed:

Transform the original data into points using scorecard_ply().
Determine behavior based on the following rules:
- “GOOD”: when score > cutoff
- “BAD”: when score < cutoff

# predict behaviour
result <- predict_behaviour(data = new_data, 
                  score_card = score_card, 
                  cutoff = 533)
result

#>    score recommendation
#>    <num>         <char>
#> 1:   635           GOOD

To combine the original customer characteristics with the score results and recommendations using the cbind() function,

cbind(new_data, result)

#>   Age NumOfProducts Geography IsActiveMember Balance Gender score
#> 1  30             3     Spain              1   95000 Female   635
#>   recommendation
#> 1           GOOD

cbind(test_woe_final$pred_risk, result)

#>               V1 score recommendation
#>            <num> <num>         <char>
#>    1: 0.13004477   635           GOOD
#>    2: 0.13102348   635           GOOD
#>    3: 0.08527977   635           GOOD
#>    4: 0.06897270   635           GOOD
#>    5: 0.05257366   635           GOOD
#>   ---                                
#> 2496: 0.29704663   635           GOOD
#> 2497: 0.20839334   635           GOOD
#> 2498: 0.67440913   635           GOOD
#> 2499: 0.08048480   635           GOOD
#> 2500: 0.11618039   635           GOOD

Conclusion

The result shows that the our predict data (test_woe_final$pred_risk) and predict new data (new_data) would positively impact approval.

Banking Customer Churn Prediction

Rastiti Ratnasari

2024-07-04

Introduction

Import Library

Data Preparation

Load Data

Column Description

Data Cleansing

Data Prepocessing

Cross Validation

Initial Characteristic Analysis

Weight of Evidence (WoE) or Fine Classing

Change dataframe in to WOE

Information Value (IV)

Logical Trend and Business Consideration / Coarse Classing

Modeling

Logistic Regression Model

Model Assumptions - Multicollinearity

Prediction

Evaluation Model

KS Statistics

AUC Score

Scorecard Creation

Apply test to scorecard

Performance Evaluation Scorecard using Population Stability Index

Cutoff

Predict Score

Conclusion