age job marital education default balance housing loan
0 0 0 0 0 0 0 0
contact day month duration campaign pdays previous poutcome
0 0 0 0 0 0 0 0
y
0
Eglė Vaitulevičiūtė – Junior Quantitative Analyst Task (Credit Risk Model Validation)
DATA VISUALISATION TASK
First, checking whether there are truly no NaN values in the columns.
With no missing values present, the analysis proceeded to the next step. A summary of the dataset columns was generated to provide an overview of the distributions and types of the variables.
age job marital education
Min. :18.00 Length:45211 Length:45211 Length:45211
1st Qu.:33.00 Class :character Class :character Class :character
Median :39.00 Mode :character Mode :character Mode :character
Mean :40.94
3rd Qu.:48.00
Max. :95.00
default balance housing loan
Length:45211 Min. : -8019 Length:45211 Length:45211
Class :character 1st Qu.: 72 Class :character Class :character
Mode :character Median : 448 Mode :character Mode :character
Mean : 1362
3rd Qu.: 1428
Max. :102127
contact day month duration
Length:45211 Min. : 1.00 Length:45211 Min. : 0.0
Class :character 1st Qu.: 8.00 Class :character 1st Qu.: 103.0
Mode :character Median :16.00 Mode :character Median : 180.0
Mean :15.81 Mean : 258.2
3rd Qu.:21.00 3rd Qu.: 319.0
Max. :31.00 Max. :4918.0
campaign pdays previous poutcome
Min. : 1.000 Min. : -1.0 Min. : 0.0000 Length:45211
1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.: 0.0000 Class :character
Median : 2.000 Median : -1.0 Median : 0.0000 Mode :character
Mean : 2.764 Mean : 40.2 Mean : 0.5803
3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.: 0.0000
Max. :63.000 Max. :871.0 Max. :275.0000
y
Length:45211
Class :character
Mode :character
Certain columns are stored as character variables rather than factors, requiring their conversion to factor type for analysis.
age job marital education
Min. :18.00 blue-collar:9732 divorced: 5207 primary : 6851
1st Qu.:33.00 management :9458 married :27214 secondary:23202
Median :39.00 technician :7597 single :12790 tertiary :13301
Mean :40.94 admin. :5171 unknown : 1857
3rd Qu.:48.00 services :4154
Max. :95.00 retired :2264
(Other) :6835
default balance housing loan contact
yes: 815 Min. : -8019 yes:25130 yes: 7244 cellular :29285
no :44396 1st Qu.: 72 no :20081 no :37967 telephone: 2906
Median : 448 unknown :13020
Mean : 1362
3rd Qu.: 1428
Max. :102127
day month duration campaign
Min. : 1.00 may :13766 Min. : 0.0 Min. : 1.000
1st Qu.: 8.00 jul : 6895 1st Qu.: 103.0 1st Qu.: 1.000
Median :16.00 aug : 6247 Median : 180.0 Median : 2.000
Mean :15.81 jun : 5341 Mean : 258.2 Mean : 2.764
3rd Qu.:21.00 nov : 3970 3rd Qu.: 319.0 3rd Qu.: 3.000
Max. :31.00 apr : 2932 Max. :4918.0 Max. :63.000
(Other): 6060
pdays previous poutcome y
Min. : -1.0 Min. : 0.0000 failure: 4901 yes: 5289
1st Qu.: -1.0 1st Qu.: 0.0000 other : 1840 no :39922
Median : -1.0 Median : 0.0000 success: 1511
Mean : 40.2 Mean : 0.5803 unknown:36959
3rd Qu.: -1.0 3rd Qu.: 0.0000
Max. :871.0 Max. :275.0000
The categories in the qualitative variables have been identified. Now, I inspect the data to ensure it is clean and properly formatted.
Rows: 45,211
Columns: 17
$ age <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
$ job <fct> management, technician, entrepreneur, blue-collar, unknown, …
$ marital <fct> married, single, married, married, single, married, single, …
$ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiary, …
$ default <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no,…
$ balance <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
$ housing <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, y…
$ loan <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no, no…
$ contact <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
$ day <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
$ month <fct> may, may, may, may, may, may, may, may, may, may, may, may, …
$ duration <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
$ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
$ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ poutcome <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
$ y <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
The data appears to be in good order, so I will now examine its distribution. I’ll begin with the numerical variables.
Now I will examine the distribution of numerical variables by the target variable y.
The ‘yes’ class in y appears to be underrepresented, it is hard to make any assumptions. Next, I will analyze the distribution of the categorical variables.
Certain variables, including default, loan, and poutcome, exhibit a large imbalance between their categories. It would be interesting to see if the distribution is similar when grouped by the variable y.
Similar patterns are observed between the groups of the target variable y. What is the percentage distribution of the target variable y?
It is clear that the target variable y is highly imbalanced, which could pose challenges during model development. Next, a correlation heatmap was created to visually assess relationships between numerical features, helping identify strongly correlated variables that might affect model performance.
The correlation heatmap shows generally weak relationships among numerical variables, with the only notable moderate positive correlation (≈0.45) between ‘previous’ and ‘pdays’, indicating low multicollinearity overall but warranting attention to these two variables when modeling.
MODELLING TASK
Before modeling, the target variable ‘y’ was converted to numeric (1 for ‘yes’, 0 for ‘no’), categorical features were encoded as numeric factors to make them usable by algorithms, and all other numerical variables were standardized to have mean 0 and standard deviation 1, ensuring consistent scaling and improving model performance.
Then, the dataset was split into 80% training and 20% test sets. A logistic regression model was trained on the training data using all predictors, and predictions were made on the test set. Predicted probabilities were converted to binary outcomes using a 0.5 threshold. Model performance was evaluated using accuracy and a confusion matrix to assess classification results.
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 7776 695
1 196 375
Accuracy : 0.9015
95% CI : (0.8951, 0.9075)
No Information Rate : 0.8817
P-Value [Acc > NIR] : 1.273e-09
Kappa : 0.4083
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9754
Specificity : 0.3505
Pos Pred Value : 0.9180
Neg Pred Value : 0.6567
Prevalence : 0.8817
Detection Rate : 0.8600
Detection Prevalence : 0.9369
Balanced Accuracy : 0.6629
'Positive' Class : 0
The first logistic regression model achieved an overall accuracy of 90.15%. However, due to class imbalance, it performed well at predicting customers who did not subscribe (Sensitivity = 97.5%) but poorly at identifying those who did subscribe (Specificity = 35.1%), resulting in a balanced accuracy of 66.3%.
To improve the model, a second logistic regression was trained using all predictors, and the optimal threshold was then determined using the Youden Index.
Youden index: 0.08696519
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 6286 105
1 1686 965
Accuracy : 0.8019
95% CI : (0.7936, 0.8101)
No Information Rate : 0.8817
P-Value [Acc > NIR] : 1
Kappa : 0.4211
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.7885
Specificity : 0.9019
Pos Pred Value : 0.9836
Neg Pred Value : 0.3640
Prevalence : 0.8817
Detection Rate : 0.6952
Detection Prevalence : 0.7068
Balanced Accuracy : 0.8452
'Positive' Class : 0
By adjusting the threshold based on the Youden index, the model became more balanced: specificity increased to 90.2%, sensitivity to 78.9%, and balanced accuracy improved to 84.5%. Although overall accuracy decreased slightly, the Youden adjusted threshold provides a better trade-off between correctly identifying both subscribers and non-subscribers, making the model more suitable for imbalanced data.
Next, a logistic regression model with stepwise variable selection based on AIC was fitted, and the optimal threshold was determined using the Youden index to balance sensitivity and specificity.
Youden threshold: 0.0876302
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 6299 106
1 1673 964
Accuracy : 0.8033
95% CI : (0.7949, 0.8114)
No Information Rate : 0.8817
P-Value [Acc > NIR] : 1
Kappa : 0.4229
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.7901
Specificity : 0.9009
Pos Pred Value : 0.9835
Neg Pred Value : 0.3656
Prevalence : 0.8817
Detection Rate : 0.6966
Detection Prevalence : 0.7084
Balanced Accuracy : 0.8455
'Positive' Class : 0
Applying the Youden index to the stepwise AIC model resulted in performance metrics very similar to the previous model. Next, I will attempt to balance the target variable y by downsizing the majority class to create a more similar class distribution while applying stepwise AIC for variable selection and using the Youden index to determine the optimal classification threshold.
Youden threshold: 0.432272
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 6523 135
1 1449 935
Accuracy : 0.8248
95% CI : (0.8168, 0.8326)
No Information Rate : 0.8817
P-Value [Acc > NIR] : 1
Kappa : 0.4519
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8182
Specificity : 0.8738
Pos Pred Value : 0.9797
Neg Pred Value : 0.3922
Prevalence : 0.8817
Detection Rate : 0.7214
Detection Prevalence : 0.7363
Balanced Accuracy : 0.8460
'Positive' Class : 0
After balancing the target variable by downsizing the majority class and applying stepwise AIC with the Youden index threshold, the model achieved improved balance between sensitivity and specificity. Accuracy reached 82.5% and balanced accuracy was 84.6%, indicating a better performance for the minority class without substantially compromising overall predictive ability.
Next, to attempt improving the model, a logistic regression was applied on balanced data (using ROSE), with variable selection via stepwise AIC and the optimal classification threshold determined using the Youden index.
Youden threshold: 0.4399083
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 6522 143
1 1450 927
Accuracy : 0.8238
95% CI : (0.8158, 0.8316)
No Information Rate : 0.8817
P-Value [Acc > NIR] : 1
Kappa : 0.4477
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8664
Specificity : 0.8181
Pos Pred Value : 0.3900
Neg Pred Value : 0.9785
Prevalence : 0.1183
Detection Rate : 0.1025
Detection Prevalence : 0.2629
Balanced Accuracy : 0.8422
'Positive' Class : 1
The logistic regression model using downsized majority class with stepwise AIC and the Youden index provided slightly better predictive performance, achieving higher overall accuracy and balanced accuracy, making it the preferred approach for estimating the probability that a customer has subscribed to a term deposit.
MODEL TESTING TASK
Since the best results were obtained from the logistic regression model using the downsized majority class with stepwise AIC and the Youden index, I will proceed to evaluate this model’s goodness of fit, predictive ability, and other relevant performance aspects.
Sensitivity, specificity, and other metrics derived from the confusion matrix will not be re-evaluated, as they have already been analyzed earlier.
Analysis of Deviance Table
Model 1: y ~ 1
Model 2: y ~ job + marital + education + balance + housing + loan + contact +
month + duration + campaign + previous + poutcome
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 8437 11697.6
2 8399 6704.5 38 4993 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The likelihood ratio test indicates that the logistic regression model with selected predictors provides a significantly better fit than the null model (p < 0.001), confirming that the included variables collectively improve prediction of customer subscription.
AUC: 0.9126283
The ROC curve and an AUC of 0.91 indicate that the stepwise AIC logistic regression model has excellent classification performance.
Hosmer-Lemeshow test:
Hosmer and Lemeshow goodness of fit (GOF) test
data: as.numeric(as.character(test_data$y)), y_pred_prob
X-squared = 3685.5, df = 8, p-value < 2.2e-16
The Hosmer-Lemeshow test shows a highly significant result (p < 0.001), formally suggesting a lack of fit. However, with very large datasets, even small deviations can produce significant results, so this outcome have to be interpreted with caution.
McFadden pseudo R²: 0.4268418
A McFadden pseudo R² of 0.427 indicates a relatively strong explanatory power for a logistic regression model, suggesting that the selected predictors explain a substantial portion of the variation in the outcome.
Residual deviance: 6704.548 on 8399 degrees of freedom
The residual deviance of 6704.55 on 8399 degrees of freedom indicates that the model fits the data reasonably well, with lower deviance suggesting a better fit relative to the null model.
[1] 0.007302574
The maximum Cook’s distance value (0.0073) indicates that no influential observations are present, suggesting that the regression model is stable and not strongly affected by individual data points.
Cross - validated AUC: 0.9076934
The 5-fold cross-validated AUC of 0.908 indicates that the logistic regression model has excellent and reliable classification performance.
Overall, the logistic regression model with selected predictors demonstrates strong explanatory power (McFadden pseudo R² = 0.427) and fits the data reasonably well (residual deviance = 6704.55 on 8399 df), with excellent predictive performance (ROC AUC = 0.91; 5-fold cross-validated AUC = 0.908), despite a significant Hosmer-Lemeshow test (p < 0.001), which is common in large datasets.