knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
fig.align = "center",
comment = "#>"
)
This dataset has information contains about bank customers and their churn status, which indicates whether they have exited the bank or not. This project will exploring and analyzing factors influencing customer churn in banking institutions and for building predictive models to identify customers at risk of churning. This dataset collected from Kaggle (https://www.kaggle.com/datasets/saurabhbadole/bank-customer-churn-prediction-dataset)
source("setup.R")
df <- read.csv("data_input/Churn_Modelling.csv")
head(df)
#> RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure
#> 1 1 15634602 Hargrave 619 France Female 42 2
#> 2 2 15647311 Hill 608 Spain Female 41 1
#> 3 3 15619304 Onio 502 France Female 42 8
#> 4 4 15701354 Boni 699 France Female 39 1
#> 5 5 15737888 Mitchell 850 Spain Female 43 2
#> 6 6 15574012 Chu 645 Spain Male 44 8
#> Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited
#> 1 0.00 1 1 1 101348.88 1
#> 2 83807.86 1 0 1 112542.58 0
#> 3 159660.80 3 1 0 113931.57 1
#> 4 0.00 2 0 0 93826.63 0
#> 5 125510.82 1 1 1 79084.10 0
#> 6 113755.78 2 1 0 149756.71 1
RowNumber : The sequential number assigned to each row in the dataset.
CustomerId : A unique identifier for each customer.
Surname : The surname of the customer.
CreditScore : The credit score of the customer.
Geography : The geographical location of the customer (e.g., country or region).
Gender : The gender of the customer.
Age : The age of the customer.
Tenure : The number of years the customer has been with the bank.
Balance : The account balance of the customer.
NumOfProducts : The number of bank products the customer has.
HasCrCard : Indicates whether the customer has a credit card (binary: yes/no).
IsActiveMember : Indicates whether the customer is an active member (binary: yes/no).
EstimatedSalary : The estimated salary of the customer.
Exited (Target) : Indicates whether the customer has exited the bank (binary: yes (1) exited /no (0) : not exited).
#Check General Data Information
glimpse(df)
#> Rows: 10,000
#> Columns: 14
#> $ RowNumber <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,âŚ
#> $ CustomerId <int> 15634602, 15647311, 15619304, 15701354, 15737888, 1557âŚ
#> $ Surname <chr> "Hargrave", "Hill", "Onio", "Boni", "Mitchell", "Chu",âŚ
#> $ CreditScore <int> 619, 608, 502, 699, 850, 645, 822, 376, 501, 684, 528,âŚ
#> $ Geography <chr> "France", "Spain", "France", "France", "Spain", "SpainâŚ
#> $ Gender <chr> "Female", "Female", "Female", "Female", "Female", "MalâŚ
#> $ Age <int> 42, 41, 42, 39, 43, 44, 50, 29, 44, 27, 31, 24, 34, 25âŚ
#> $ Tenure <int> 2, 1, 8, 1, 2, 8, 7, 4, 4, 2, 6, 3, 10, 5, 7, 3, 1, 9,âŚ
#> $ Balance <dbl> 0.00, 83807.86, 159660.80, 0.00, 125510.82, 113755.78,âŚ
#> $ NumOfProducts <int> 1, 1, 3, 2, 1, 2, 2, 4, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, âŚ
#> $ HasCrCard <int> 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, âŚ
#> $ IsActiveMember <int> 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, âŚ
#> $ EstimatedSalary <dbl> 101348.88, 112542.58, 113931.57, 93826.63, 79084.10, 1âŚ
#> $ Exited <int> 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, âŚ
From the dataset above, the data has 14 columns, 10,000 rows and the data types for each column. Checking the data types is a crucial step due to the data types must be appropriate for analysis.
# Select column will be drop
df_clean <- df %>%
select(-c(RowNumber,CustomerId,Surname))
# change data type
df_clean <- df_clean %>%
mutate(Geography = as.factor(Geography),
Gender = as.factor(Gender))
head(df_clean)
#> CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard
#> 1 619 France Female 42 2 0.00 1 1
#> 2 608 Spain Female 41 1 83807.86 1 0
#> 3 502 France Female 42 8 159660.80 3 1
#> 4 699 France Female 39 1 0.00 2 0
#> 5 850 Spain Female 43 2 125510.82 1 1
#> 6 645 Spain Male 44 8 113755.78 2 1
#> IsActiveMember EstimatedSalary Exited
#> 1 1 101348.88 1
#> 2 1 112542.58 0
#> 3 0 113931.57 1
#> 4 0 93826.63 0
#> 5 1 79084.10 0
#> 6 0 149756.71 1
# Check missing value
colSums(is.na(df_clean))
#> CreditScore Geography Gender Age Tenure
#> 0 0 0 0 0
#> Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary
#> 0 0 0 0 0
#> Exited
#> 0
In the dataset above, has no missing value data in any columns.
# check data distribution
describe(df_clean)
#> variable class count missing_rate unique_count identical_rate
#> <char> <char> <int> <num> <int> <num>
#> 1: CreditScore integer 10000 0 460 0.0233
#> 2: Geography factor 10000 0 3 0.5014
#> 3: Gender factor 10000 0 2 0.5457
#> 4: Age integer 10000 0 70 0.0478
#> 5: Tenure integer 10000 0 11 0.1048
#> 6: Balance numeric 10000 0 6382 0.3617
#> 7: NumOfProducts integer 10000 0 4 0.5084
#> 8: HasCrCard integer 10000 0 2 0.7055
#> 9: IsActiveMember integer 10000 0 2 0.5151
#> 10: EstimatedSalary numeric 10000 0 9999 0.0002
#> 11: Exited integer 10000 0 2 0.7963
#> min p25 p50 p75 max mean sd cv
#> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 350.00 584.00 652.00 718.0 850.0 650.5288 96.6533 0.1486
#> 2: NA NA NA NA NA NA NA NA
#> 3: NA NA NA NA NA NA NA NA
#> 4: 18.00 32.00 37.00 44.0 92.0 38.9218 10.4878 0.2695
#> 5: 0.00 3.00 5.00 7.0 10.0 5.0128 2.8922 0.5770
#> 6: 0.00 0.00 97198.54 127644.2 250898.1 76485.8893 62397.4052 0.8158
#> 7: 1.00 1.00 1.00 2.0 4.0 1.5302 0.5817 0.3801
#> 8: 0.00 0.00 1.00 1.0 1.0 0.7055 0.4558 0.6461
#> 9: 0.00 0.00 1.00 1.0 1.0 0.5151 0.4998 0.9703
#> 10: 11.58 51002.11 100193.91 149388.2 199992.5 100090.2399 57510.4928 0.5746
#> 11: 0.00 0.00 0.00 0.0 1.0 0.2037 0.4028 1.9773
#check class variable target
table(df_clean$Exited) %>%
prop.table()
#>
#> 0 1
#> 0.7963 0.2037
We can consider our target has balance data even though the class proportion is 80:20
RNGkind(sample.kind= "Rounding")
set.seed(123) # randomizing data
# To create a binary split of data into training and testing sets with a ratio of 75:25 (75% for training and 25% for testing)
index <- initial_split(data = df_clean,
prop = 0.75)
# splitting
train <- training(index)
test <- testing(index)
prop.table(table(train$Exited))
#>
#> 0 1
#> 0.7976 0.2024
The fuction of WOE is to binning/categorize all predictor variables in the data, then calculate the strength of each categoryâs result in separating positive and negative classes. This step will separate several outlier values, missing values, and categories with low frequencies. Additionally, it can help users understand the risk behavior of customers.
binning <- woebin(dt = train,
y = 'Exited',
positive = 0)
#> â Binning on 7500 rows and 11 columns in 00:00:13
binning
#> $CreditScore
#> variable bin count count_distr neg pos posprob woe
#> <char> <char> <int> <num> <int> <int> <num> <num>
#> 1: CreditScore [-Inf,490) 378 0.05040000 97 281 0.7433862 -0.30771759
#> 2: CreditScore [490,570) 1196 0.15946667 253 943 0.7884615 -0.05568449
#> 3: CreditScore [570,610) 944 0.12586667 182 762 0.8072034 0.06057859
#> 4: CreditScore [610,630) 539 0.07186667 127 412 0.7643785 -0.19452502
#> 5: CreditScore [630, Inf) 4443 0.59240000 859 3584 0.8066622 0.05710457
#> bin_iv total_iv breaks is_special_values
#> <num> <num> <char> <lgcl>
#> 1: 0.0052083086 0.01094046 490 FALSE
#> 2: 0.0005026684 0.01094046 570 FALSE
#> 3: 0.0004535851 0.01094046 610 FALSE
#> 4: 0.0028769137 0.01094046 630 FALSE
#> 5: 0.0018989881 0.01094046 Inf FALSE
#>
#> $Geography
#> variable bin count count_distr neg pos posprob woe
#> <char> <char> <int> <num> <int> <int> <num> <num>
#> 1: Geography France 3765 0.5020000 606 3159 0.8390438 0.2797695
#> 2: Geography Germany 1888 0.2517333 607 1281 0.6784958 -0.6244938
#> 3: Geography Spain 1847 0.2462667 305 1542 0.8348674 0.2491625
#> bin_iv total_iv breaks is_special_values
#> <num> <num> <char> <lgcl>
#> 1: 0.03605523 0.166205 France FALSE
#> 2: 0.11598462 0.166205 Germany FALSE
#> 3: 0.01416515 0.166205 Spain FALSE
#>
#> $Gender
#> variable bin count count_distr neg pos posprob woe
#> <char> <char> <int> <num> <int> <int> <num> <num>
#> 1: Gender Female 3423 0.4564 849 2574 0.7519720 -0.2622041
#> 2: Gender Male 4077 0.5436 669 3408 0.8359088 0.2567355
#> bin_iv total_iv breaks is_special_values
#> <num> <num> <char> <lgcl>
#> 1: 0.03382371 0.066942 Female FALSE
#> 2: 0.03311829 0.066942 Male FALSE
#>
#> $Age
#> variable bin count count_distr neg pos posprob woe
#> <char> <char> <int> <num> <int> <int> <num> <num>
#> 1: Age [-Inf,35) 2737 0.3649333 227 2510 0.9170625 1.0317267
#> 2: Age [35,42) 2369 0.3158667 353 2016 0.8509920 0.3710413
#> 3: Age [42,46) 833 0.1110667 246 587 0.7046819 -0.5016680
#> 4: Age [46, Inf) 1561 0.2081333 692 869 0.5566944 -1.1436041
#> bin_iv total_iv breaks is_special_values
#> <num> <num> <char> <lgcl>
#> 1: 0.27862115 0.70465 35 FALSE
#> 2: 0.03876202 0.70465 42 FALSE
#> 3: 0.03207044 0.70465 46 FALSE
#> 4: 0.35519639 0.70465 Inf FALSE
#>
#> $Tenure
#> variable bin count count_distr neg pos posprob woe
#> <char> <char> <int> <num> <int> <int> <num> <num>
#> 1: Tenure [-Inf,2) 1084 0.14453333 243 841 0.7758303 -0.12983106
#> 2: Tenure [2,3) 787 0.10493333 147 640 0.8132147 0.09967431
#> 3: Tenure [3,6) 2256 0.30080000 473 1783 0.7903369 -0.04440405
#> 4: Tenure [6,7) 722 0.09626667 132 590 0.8171745 0.12595933
#> 5: Tenure [7,8) 773 0.10306667 129 644 0.8331177 0.23652504
#> 6: Tenure [8,9) 771 0.10280000 150 621 0.8054475 0.04933451
#> 7: Tenure [9, Inf) 1107 0.14760000 244 863 0.7795845 -0.10811482
#> bin_iv total_iv breaks is_special_values
#> <num> <num> <char> <lgcl>
#> 1: 0.0025304879 0.0130042 2 FALSE
#> 2: 0.0010116628 0.0130042 3 FALSE
#> 3: 0.0006009357 0.0130042 6 FALSE
#> 4: 0.0014702854 0.0130042 7 FALSE
#> 5: 0.0053634572 0.0130042 8 FALSE
#> 6: 0.0002465346 0.0130042 9 FALSE
#> 7: 0.0017808328 0.0130042 Inf FALSE
#>
#> $Balance
#> variable bin count count_distr neg pos posprob woe
#> <char> <char> <int> <num> <int> <int> <num> <num>
#> 1: Balance [-Inf,5000) 2704 0.3605333 362 2342 0.8661243 0.49575505
#> 2: Balance [5000,105000) 1462 0.1949333 304 1158 0.7920657 -0.03393932
#> 3: Balance [105000, Inf) 3334 0.4445333 852 2482 0.7444511 -0.30212784
#> bin_iv total_iv breaks is_special_values
#> <num> <num> <char> <lgcl>
#> 1: 0.0758684610 0.1203127 5000 FALSE
#> 2: 0.0002268084 0.1203127 105000 FALSE
#> 3: 0.0442174434 0.1203127 Inf FALSE
#>
#> $NumOfProducts
#> variable bin count count_distr neg pos posprob woe
#> <char> <char> <int> <num> <int> <int> <num> <num>
#> 1: NumOfProducts [-Inf,2) 3806 0.5074667 1043 2763 0.7259590 -0.3971454
#> 2: NumOfProducts [2, Inf) 3694 0.4925333 475 3219 0.8714131 0.5421499
#> bin_iv total_iv breaks is_special_values
#> <num> <num> <char> <lgcl>
#> 1: 0.08943819 0.2115318 2 FALSE
#> 2: 0.12209359 0.2115318 Inf FALSE
#>
#> $HasCrCard
#> variable bin count count_distr neg pos posprob woe
#> <char> <char> <int> <num> <int> <int> <num> <num>
#> 1: HasCrCard [-Inf,1) 2204 0.2938667 453 1751 0.7944646 -0.019311074
#> 2: HasCrCard [1, Inf) 5296 0.7061333 1065 4231 0.7989048 0.008102291
#> bin_iv total_iv breaks is_special_values
#> <num> <num> <char> <lgcl>
#> 1: 1.102180e-04 0.0001564619 1 FALSE
#> 2: 4.624387e-05 0.0001564619 Inf FALSE
#>
#> $IsActiveMember
#> variable bin count count_distr neg pos posprob woe
#> <char> <char> <int> <num> <int> <int> <num> <num>
#> 1: IsActiveMember [-Inf,1) 3657 0.4876 976 2681 0.7331146 -0.3608787
#> 2: IsActiveMember [1, Inf) 3843 0.5124 542 3301 0.8589644 0.4353534
#> bin_iv total_iv breaks is_special_values
#> <num> <num> <char> <lgcl>
#> 1: 0.07028957 0.1550848 1 FALSE
#> 2: 0.08479526 0.1550848 Inf FALSE
#>
#> $EstimatedSalary
#> variable bin count count_distr neg pos posprob
#> <char> <char> <int> <num> <int> <int> <num>
#> 1: EstimatedSalary [-Inf,20000) 725 0.09666667 134 591 0.8151724
#> 2: EstimatedSalary [20000,35000) 559 0.07453333 125 434 0.7763864
#> 3: EstimatedSalary [35000,60000) 948 0.12640000 168 780 0.8227848
#> 4: EstimatedSalary [60000,75000) 566 0.07546667 131 435 0.7685512
#> 5: EstimatedSalary [75000,85000) 391 0.05213333 65 326 0.8337596
#> 6: EstimatedSalary [85000,165000) 3009 0.40120000 611 2398 0.7969425
#> 7: EstimatedSalary [165000, Inf) 1302 0.17360000 284 1018 0.7818740
#> woe bin_iv total_iv breaks is_special_values
#> <num> <num> <num> <char> <lgcl>
#> 1: 0.112614936 1.184973e-03 0.01240834 20000 FALSE
#> 2: -0.126630484 1.240245e-03 0.01240834 35000 FALSE
#> 3: 0.163968659 3.233337e-03 0.01240834 60000 FALSE
#> 4: -0.171212573 2.324999e-03 0.01240834 75000 FALSE
#> 5: 0.241148830 2.815973e-03 0.01240834 85000 FALSE
#> 6: -0.004067905 6.647035e-06 0.01240834 165000 FALSE
#> 7: -0.094740322 1.602161e-03 0.01240834 Inf FALSE
đĄ Insight:
Customers with EstimatedSalary between 20000- 35000 & 60000 - 75000 has negative value for WOE, the meaning they are potentially going to exit from bank.
Bank customers from the branch in Germany are potentially included in the Exited category.
Female bank customers are potentially included in the Exited category.
Bank customers aged over 40 are potentially included in the Exited category
# data train
train_woe <- woebin_ply(dt = train,
bins = binning)
#> â Woe transformating on 7500 rows and 10 columns in 00:00:02
train_woe
# data test
test_woe <- woebin_ply(dt = test,
bins = binning)
#> â Woe transformating on 2500 rows and 10 columns in 00:00:02
test_woe
# your code here
iv(dt = train_woe,
y = 'Exited',
positive = 0)
#> variable info_value
#> <char> <num>
#> 1: Age_woe 0.7046500005
#> 2: NumOfProducts_woe 0.2115317727
#> 3: Geography_woe 0.1662050017
#> 4: IsActiveMember_woe 0.1550848364
#> 5: Balance_woe 0.1203127129
#> 6: Gender_woe 0.0669420000
#> 7: Tenure_woe 0.0130041965
#> 8: EstimatedSalary_woe 0.0124083350
#> 9: CreditScore_woe 0.0109404638
#> 10: HasCrCard_woe 0.0001564619
According to Siddiqi, Naeem, the IV score can be categorized into the following values:
IV below 0.02 -> unpredictive
IV between 0.02 - 0.1 -> weak
IV between 0.1 and 0.3 categorized as medium
IV above 0.3 categorized as strong
The IV values we have obtained can be used for feature elimination, and the variables to be eliminated are as follows: Tenure_woe, EstimatedSalary_woe, CreditScore_woe, and HasCrCard_woe
# hasil data setelah feature elimination
train_woe_final <- train_woe %>%
select(-c(Tenure_woe,EstimatedSalary_woe,CreditScore_woe,HasCrCard_woe))
test_woe_final <- test_woe %>%
select(-c(Tenure_woe,EstimatedSalary_woe,CreditScore_woe,HasCrCard_woe))
After successfully generating and applying WOE binning to the train and test data, We need visualization to get onother perspective for easier to present.
# logical trend plot
plot <- woebin_plot(bins = binning)
plot
#> $CreditScore
#>
#> $Geography
#>
#> $Gender
#>
#> $Age
#>
#> $Tenure
#>
#> $Balance
#>
#> $NumOfProducts
#>
#> $HasCrCard
#>
#> $IsActiveMember
#>
#> $EstimatedSalary
đĄ Insight:
The blue line represents the relationship between each binning, indicating the positive probability.
The visualization of EstimatedSalary indicates that as the value increases, the positive probability decreases. This means more customers âNot Exitedâ.
Logistic regression is a classification method where the concept involves calculating the probability of a binary outcome for each target class.
model <- glm(formula = Exited ~.,
data = train_woe_final,
family = "binomial")
summary(model)
#>
#> Call:
#> glm(formula = Exited ~ ., family = "binomial", data = train_woe_final)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.37120 0.03371 -40.682 <2e-16 ***
#> Geography_woe -0.90771 0.08637 -10.510 <2e-16 ***
#> Gender_woe -1.04798 0.12581 -8.330 <2e-16 ***
#> Age_woe -1.07561 0.03937 -27.321 <2e-16 ***
#> Balance_woe -0.19023 0.11330 -1.679 0.0932 .
#> NumOfProducts_woe -0.94774 0.07610 -12.454 <2e-16 ***
#> IsActiveMember_woe -1.29030 0.08488 -15.202 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 7555.7 on 7499 degrees of freedom
#> Residual deviance: 6009.5 on 7493 degrees of freedom
#> AIC: 6023.5
#>
#> Number of Fisher Scoring iterations: 5
đĄ Interpretation model:
Variables of Balance_woe is not significant to our model.
Multicollinearity: The predictors are not highly correlated with each other (greater than +- 0.9).
Use the vif() test from the car library. A VIF (Variance Inflation Factor) value < 10 indicates that the variable does not have multicollinearity with other variables.
vif(model)
#> Geography_woe Gender_woe Age_woe Balance_woe
#> 1.249178 1.005763 1.052761 1.361607
#> NumOfProducts_woe IsActiveMember_woe
#> 1.129329 1.047355
đĄ Insight: There is no multicollinearity in this data.
After obtaining the model, predictions can be made on the test data that has been transformed into WOE (Weight of Evidence).
# Performing predictions on the test data
test_woe_final$pred_risk <- predict(object = model,
newdata = test_woe_final,
type = "response")
test_woe_final$pred_risk %>% head()
#> [1] 0.13004477 0.13102348 0.08527977 0.06897270 0.05257366 0.17481595
# Creating a list for the predicted risk outcomes of each data point in both the training and test datasets
list_pred <- list(test = test_woe_final$pred_risk)
# Creating label lists for the training and test datasets
list_label <- list(test = test_woe_final$Exited)
# Using the function perf_eva to perform evaluation.
perf_eva(pred = list_pred,
label = list_label,
confusion_matrix = TRUE,
threshold = 0.5,
show_plot = c("ks", "roc"))
#> $binomial_metric
#> $binomial_metric$test
#> MSE RMSE LogLoss R2 KS AUC Gini
#> <num> <num> <num> <num> <num> <num> <num>
#> 1: 0.1218619 0.3490873 0.3908244 0.259208 0.5024097 0.8165997 0.6331994
#>
#>
#> $confusion_matrix
#> $confusion_matrix$test
#> label pred_0 pred_1 error
#> <char> <num> <num> <num>
#> 1: 0 1898 83 0.04189803
#> 2: 1 333 186 0.64161850
#> 3: total 2231 269 0.16640000
#>
#>
#> $pic
#> TableGrob (1 x 2) "arrange": 2 grobs
#> z cells name grob
#> 1 1 (1-1,1-1) arrange gtable[layout]
#> 2 2 (1-1,2-2) arrange gtable[layout]
KS (Kolmogorov-Smirnov) Statistics is a statistical method used to measure the difference between two empirical cumulative distributions. It is often used to compare score distributions between two groups, for example, in the analysis of bank customer behavior models (distinguishing between good and bad customers).
A high KS value indicates that the model is good to classify between the two groups (for example, good vs bad customers).
Typically, a KS value > 0.4 is considered good in credit scoring.
đĄ Interpretation:
KS value in this model is 0,5024 > 0,4 thats mean the Logistic Regression model able to effectively classify between the Exited (1) and not Exited (0) classes.
AUC (Area Under Curve) is the area under the ROC curve. As the AUC value approaches 1, the model becomes better at separating positive and negative classes.
đĄ Interpretation:
The model has an AUC of 0.8166, indicating it can distinguish between Exited and Not Exited classes with an accuracy of 81.6% compared to actual Exited/not-Exited cases.
In logistic regression models, the output is typically in the form of probabilities predicting whether a customer is an Exited Customer or not exited customers. To facilitate decision-making, the binning results can be transformed into scores for each bin, which are then accumulated into a scorecard.
Make a scorecard using universal odds0 1/10 and points0 600. Odds 1/10 means that we want to set in 10 people that positive (not Exited), 1 person of them is negative (Exited).
# membentuk scorecard
score_card <- scorecard(bins = binning,
model = model,
odds0 = 1/10,
points0 = 600,
pdo = 20)
score_card
#> $basepoints
#> variable bin woe points
#> <char> <lgcl> <lgcl> <num>
#> 1: basepoints NA NA 573
#>
#> $Geography
#> variable bin count count_distr neg pos posprob woe
#> <char> <char> <int> <num> <int> <int> <num> <num>
#> 1: Geography France 3765 0.5020000 606 3159 0.8390438 0.2797695
#> 2: Geography Germany 1888 0.2517333 607 1281 0.6784958 -0.6244938
#> 3: Geography Spain 1847 0.2462667 305 1542 0.8348674 0.2491625
#> bin_iv total_iv breaks is_special_values points
#> <num> <num> <char> <lgcl> <num>
#> 1: 0.03605523 0.166205 France FALSE 7
#> 2: 0.11598462 0.166205 Germany FALSE -16
#> 3: 0.01416515 0.166205 Spain FALSE 7
#>
#> $Gender
#> variable bin count count_distr neg pos posprob woe
#> <char> <char> <int> <num> <int> <int> <num> <num>
#> 1: Gender Female 3423 0.4564 849 2574 0.7519720 -0.2622041
#> 2: Gender Male 4077 0.5436 669 3408 0.8359088 0.2567355
#> bin_iv total_iv breaks is_special_values points
#> <num> <num> <char> <lgcl> <num>
#> 1: 0.03382371 0.066942 Female FALSE -8
#> 2: 0.03311829 0.066942 Male FALSE 8
#>
#> $Age
#> variable bin count count_distr neg pos posprob woe
#> <char> <char> <int> <num> <int> <int> <num> <num>
#> 1: Age [-Inf,35) 2737 0.3649333 227 2510 0.9170625 1.0317267
#> 2: Age [35,42) 2369 0.3158667 353 2016 0.8509920 0.3710413
#> 3: Age [42,46) 833 0.1110667 246 587 0.7046819 -0.5016680
#> 4: Age [46, Inf) 1561 0.2081333 692 869 0.5566944 -1.1436041
#> bin_iv total_iv breaks is_special_values points
#> <num> <num> <char> <lgcl> <num>
#> 1: 0.27862115 0.70465 35 FALSE 32
#> 2: 0.03876202 0.70465 42 FALSE 12
#> 3: 0.03207044 0.70465 46 FALSE -16
#> 4: 0.35519639 0.70465 Inf FALSE -35
#>
#> $Balance
#> variable bin count count_distr neg pos posprob woe
#> <char> <char> <int> <num> <int> <int> <num> <num>
#> 1: Balance [-Inf,5000) 2704 0.3605333 362 2342 0.8661243 0.49575505
#> 2: Balance [5000,105000) 1462 0.1949333 304 1158 0.7920657 -0.03393932
#> 3: Balance [105000, Inf) 3334 0.4445333 852 2482 0.7444511 -0.30212784
#> bin_iv total_iv breaks is_special_values points
#> <num> <num> <char> <lgcl> <num>
#> 1: 0.0758684610 0.1203127 5000 FALSE 3
#> 2: 0.0002268084 0.1203127 105000 FALSE 0
#> 3: 0.0442174434 0.1203127 Inf FALSE -2
#>
#> $NumOfProducts
#> variable bin count count_distr neg pos posprob woe
#> <char> <char> <int> <num> <int> <int> <num> <num>
#> 1: NumOfProducts [-Inf,2) 3806 0.5074667 1043 2763 0.7259590 -0.3971454
#> 2: NumOfProducts [2, Inf) 3694 0.4925333 475 3219 0.8714131 0.5421499
#> bin_iv total_iv breaks is_special_values points
#> <num> <num> <char> <lgcl> <num>
#> 1: 0.08943819 0.2115318 2 FALSE -11
#> 2: 0.12209359 0.2115318 Inf FALSE 15
#>
#> $IsActiveMember
#> variable bin count count_distr neg pos posprob woe
#> <char> <char> <int> <num> <int> <int> <num> <num>
#> 1: IsActiveMember [-Inf,1) 3657 0.4876 976 2681 0.7331146 -0.3608787
#> 2: IsActiveMember [1, Inf) 3843 0.5124 542 3301 0.8589644 0.4353534
#> bin_iv total_iv breaks is_special_values points
#> <num> <num> <char> <lgcl> <num>
#> 1: 0.07028957 0.1550848 1 FALSE -13
#> 2: 0.08479526 0.1550848 Inf FALSE 16
đĄ Interpretation Result:
The relationship between points0 value and odds0: at a score of 600 points, there is a risk of 1 customer exiting the bank out of every 10 customers who stay.
Note: The odds of 1/10 and points 600 are not ground truth values; they are adjusted based on the scorecardâs purpose and discussions with the relevant team.
Scorecard ranges are adjusted according to each companyâs preferences. Typically, the standard used is the FICO standard, ranging from 300 to 850.
Interpretation of pdo (Points to Double the Odds): A smaller pdo value indicates a tighter scorecard and fewer customers predicted to exit.
# Data Train
score_train <- scorecard_ply(dt = train,
card = score_card,
only_total_score = F)
score_train %>% head()
#> Geography_points Gender_points Age_points Balance_points
#> <num> <num> <num> <num>
#> 1: 7 -8 32 -2
#> 2: -16 -8 -16 -2
#> 3: -16 8 12 -2
#> 4: 7 8 12 3
#> 5: 7 -8 32 0
#> 6: 7 8 12 3
#> NumOfProducts_points IsActiveMember_points score
#> <num> <num> <num>
#> 1: -11 -13 578
#> 2: -11 -13 507
#> 3: 15 -13 577
#> 4: 15 -13 605
#> 5: -11 -13 580
#> 6: 15 -13 605
# Data Test
score_test <- scorecard_ply(dt = test,
card = score_card,
only_total_score = F)
score_test %>% head()
#> Geography_points Gender_points Age_points Balance_points
#> <num> <num> <num> <num>
#> 1: 7 -8 12 0
#> 2: 7 -8 12 3
#> 3: 7 8 -16 -2
#> 4: 7 -8 32 3
#> 5: 7 -8 12 3
#> 6: -16 8 -16 -2
#> NumOfProducts_points IsActiveMember_points score
#> <num> <num> <num>
#> 1: -11 16 589
#> 2: 15 -13 589
#> 3: 15 16 601
#> 4: 15 -13 609
#> 5: 15 16 618
#> 6: 15 16 578
We just got the score of each characteristics customer from our dataset. To see how our scorecard result stable for some population, we can use Population Stability Index.
Population Stability Index (PSI) is a metric used in credit analysis and credit risk to evaluate the stability or change in the distribution of credit points or scores within a population over time.
score_list <- list(train = score_train$score,
test = score_test$score)
label_list <- list(train = train_woe_final$Exited,
test = test_woe_final$Exited)
psi <- perf_psi(score = score_list,
label = label_list,
positive = 0)
psi
#> $pic
#> $pic$pred
#>
#>
#> $psi
#> variable dataset psi
#> <char> <char> <num>
#> 1: pred train_test 0.005944322
According to Siddiqi Naeem, the PSI (Population Stability Index) has the following ranges:
đĄ Interpretation :
PSI value is 0.0059 which means there is no significant changes and our scorecard stable in population score.
Determining whether a customer falls into the âGOODâ or âBADâ category requires setting a cutoff value for the total score. To establish this cutoff value, we need information on the approval rate and bad rate. To obtain these rates, we use the âapproval_rate()â function. âapproval_rate()â will display a table showing various cutoff values, approval rates, negative rates (bad rates), and other details to facilitate the decision-making process.
# using score test
approval_rate(score = score_test$score,
label = test_woe_final$Exited,
positive = 0)
#> Key: <datset>
#> bin approval_rate neg_rate count_approved neg_approved count neg
#> <fctr> <num> <num> <int> <int> <int> <int>
#> 1: [-Inf,533) 0.9044 0.1544 2261 349 239 170
#> 2: [533,555) 0.8000 0.1165 2000 233 261 116
#> 3: [555,564) 0.7032 0.0961 1758 169 242 64
#> 4: [564,578) 0.6120 0.0830 1530 127 228 42
#> 5: [578,589) 0.5020 0.0637 1255 80 275 47
#> 6: [589,601) 0.4012 0.0528 1003 53 252 27
#> 7: [601,607) 0.3136 0.0472 784 37 219 16
#> 8: [607,620) 0.2048 0.0332 512 17 272 20
#> 9: [620,633) 0.1020 0.0235 255 6 257 11
#> 10: [633, Inf) 0.0000 0.0000 0 0 255 6
#> pos
#> <int>
#> 1: 69
#> 2: 145
#> 3: 178
#> 4: 186
#> 5: 228
#> 6: 225
#> 7: 203
#> 8: 252
#> 9: 246
#> 10: 249
Notes:
bin: Result of binning the scores is [-inf,533): Cutoff at < 533
approval_rate: Approval rate when using a cutoff point at 533.
đĄ Interpretation :
When the cutoff value = 533 (row 1):
Approval rate = 90%
Bad rate = 15%
Setting the cutoff value can be used as a reference to determine whether a customer is considered to have âGOODâ category when their total points on the scorecard are above the cutoff value. Similarly, they would be categorized as having âBADâ category if their total points fall below the cutoff value.
# predict new data
new_data <- data.frame(list(Age = 30,
NumOfProducts = 3,
Geography = "Spain",
IsActiveMember = 1,
Balance = 95000,
Gender = "Female"))
new_data
#> Age NumOfProducts Geography IsActiveMember Balance Gender
#> 1 30 3 Spain 1 95000 Female
To transform the original characteristics into total points and determine behavior, the following steps are needed:
scorecard_ply().# predict behaviour
result <- predict_behaviour(data = new_data,
score_card = score_card,
cutoff = 533)
result
#> score recommendation
#> <num> <char>
#> 1: 635 GOOD
To combine the original customer characteristics with the score results and recommendations using the cbind() function,
cbind(new_data, result)
#> Age NumOfProducts Geography IsActiveMember Balance Gender score
#> 1 30 3 Spain 1 95000 Female 635
#> recommendation
#> 1 GOOD
cbind(test_woe_final$pred_risk, result)
#> V1 score recommendation
#> <num> <num> <char>
#> 1: 0.13004477 635 GOOD
#> 2: 0.13102348 635 GOOD
#> 3: 0.08527977 635 GOOD
#> 4: 0.06897270 635 GOOD
#> 5: 0.05257366 635 GOOD
#> ---
#> 2496: 0.29704663 635 GOOD
#> 2497: 0.20839334 635 GOOD
#> 2498: 0.67440913 635 GOOD
#> 2499: 0.08048480 635 GOOD
#> 2500: 0.11618039 635 GOOD
The result shows that the our predict data (test_woe_final$pred_risk) and predict new data (new_data) would positively impact approval.