Eglė Vaitulevičiūtė – Junior Quantitative Analyst Task (Credit Risk Model Validation)

DATA VISUALISATION TASK

First, checking whether there are truly no NaN values in the columns.

      age       job   marital education   default   balance   housing      loan 
        0         0         0         0         0         0         0         0 
  contact       day     month  duration  campaign     pdays  previous  poutcome 
        0         0         0         0         0         0         0         0 
        y 
        0

With no missing values present, the analysis proceeded to the next step. A summary of the dataset columns was generated to provide an overview of the distributions and types of the variables.

      age            job              marital           education        
 Min.   :18.00   Length:45211       Length:45211       Length:45211      
 1st Qu.:33.00   Class :character   Class :character   Class :character  
 Median :39.00   Mode  :character   Mode  :character   Mode  :character  
 Mean   :40.94                                                           
 3rd Qu.:48.00                                                           
 Max.   :95.00                                                           
   default             balance         housing              loan          
 Length:45211       Min.   : -8019   Length:45211       Length:45211      
 Class :character   1st Qu.:    72   Class :character   Class :character  
 Mode  :character   Median :   448   Mode  :character   Mode  :character  
                    Mean   :  1362                                        
                    3rd Qu.:  1428                                        
                    Max.   :102127                                        
   contact               day           month              duration     
 Length:45211       Min.   : 1.00   Length:45211       Min.   :   0.0  
 Class :character   1st Qu.: 8.00   Class :character   1st Qu.: 103.0  
 Mode  :character   Median :16.00   Mode  :character   Median : 180.0  
                    Mean   :15.81                      Mean   : 258.2  
                    3rd Qu.:21.00                      3rd Qu.: 319.0  
                    Max.   :31.00                      Max.   :4918.0  
    campaign          pdays          previous          poutcome        
 Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000   Length:45211      
 1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000   Class :character  
 Median : 2.000   Median : -1.0   Median :  0.0000   Mode  :character  
 Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803                     
 3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000                     
 Max.   :63.000   Max.   :871.0   Max.   :275.0000                     
      y            
 Length:45211      
 Class :character  
 Mode  :character

Certain columns are stored as character variables rather than factors, requiring their conversion to factor type for analysis.

      age                 job           marital          education    
 Min.   :18.00   blue-collar:9732   divorced: 5207   primary  : 6851  
 1st Qu.:33.00   management :9458   married :27214   secondary:23202  
 Median :39.00   technician :7597   single  :12790   tertiary :13301  
 Mean   :40.94   admin.     :5171                    unknown  : 1857  
 3rd Qu.:48.00   services   :4154                                     
 Max.   :95.00   retired    :2264                                     
                 (Other)    :6835                                     
 default        balance       housing      loan            contact     
 yes:  815   Min.   : -8019   yes:25130   yes: 7244   cellular :29285  
 no :44396   1st Qu.:    72   no :20081   no :37967   telephone: 2906  
             Median :   448                           unknown  :13020  
             Mean   :  1362                                            
             3rd Qu.:  1428                                            
             Max.   :102127                                            
                                                                       
      day            month          duration         campaign     
 Min.   : 1.00   may    :13766   Min.   :   0.0   Min.   : 1.000  
 1st Qu.: 8.00   jul    : 6895   1st Qu.: 103.0   1st Qu.: 1.000  
 Median :16.00   aug    : 6247   Median : 180.0   Median : 2.000  
 Mean   :15.81   jun    : 5341   Mean   : 258.2   Mean   : 2.764  
 3rd Qu.:21.00   nov    : 3970   3rd Qu.: 319.0   3rd Qu.: 3.000  
 Max.   :31.00   apr    : 2932   Max.   :4918.0   Max.   :63.000  
                 (Other): 6060                                    
     pdays          previous           poutcome       y        
 Min.   : -1.0   Min.   :  0.0000   failure: 4901   yes: 5289  
 1st Qu.: -1.0   1st Qu.:  0.0000   other  : 1840   no :39922  
 Median : -1.0   Median :  0.0000   success: 1511              
 Mean   : 40.2   Mean   :  0.5803   unknown:36959              
 3rd Qu.: -1.0   3rd Qu.:  0.0000                              
 Max.   :871.0   Max.   :275.0000

The categories in the qualitative variables have been identified. Now, I inspect the data to ensure it is clean and properly formatted.

Rows: 45,211
Columns: 17
$ age       <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
$ job       <fct> management, technician, entrepreneur, blue-collar, unknown, …
$ marital   <fct> married, single, married, married, single, married, single, …
$ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiary, …
$ default   <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no,…
$ balance   <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
$ housing   <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, y…
$ loan      <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no, no…
$ contact   <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
$ day       <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
$ month     <fct> may, may, may, may, may, may, may, may, may, may, may, may, …
$ duration  <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
$ campaign  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ pdays     <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
$ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ poutcome  <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
$ y         <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …

The data appears to be in good order, so I will now examine its distribution. I’ll begin with the numerical variables.

Now I will examine the distribution of numerical variables by the target variable y.

The ‘yes’ class in y appears to be underrepresented, it is hard to make any assumptions. Next, I will analyze the distribution of the categorical variables.

Certain variables, including default, loan, and poutcome, exhibit a large imbalance between their categories. It would be interesting to see if the distribution is similar when grouped by the variable y.

Similar patterns are observed between the groups of the target variable y. What is the percentage distribution of the target variable y?

It is clear that the target variable y is highly imbalanced, which could pose challenges during model development. Next, a correlation heatmap was created to visually assess relationships between numerical features, helping identify strongly correlated variables that might affect model performance.

The correlation heatmap shows generally weak relationships among numerical variables, with the only notable moderate positive correlation (≈0.45) between ‘previous’ and ‘pdays’, indicating low multicollinearity overall but warranting attention to these two variables when modeling.

MODELLING TASK

Before modeling, the target variable ‘y’ was converted to numeric (1 for ‘yes’, 0 for ‘no’), categorical features were encoded as numeric factors to make them usable by algorithms, and all other numerical variables were standardized to have mean 0 and standard deviation 1, ensuring consistent scaling and improving model performance.

Then, the dataset was split into 80% training and 20% test sets. A logistic regression model was trained on the training data using all predictors, and predictions were made on the test set. Predicted probabilities were converted to binary outcomes using a 0.5 threshold. Model performance was evaluated using accuracy and a confusion matrix to assess classification results.

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 7776  695
         1  196  375
                                          
               Accuracy : 0.9015          
                 95% CI : (0.8951, 0.9075)
    No Information Rate : 0.8817          
    P-Value [Acc > NIR] : 1.273e-09       
                                          
                  Kappa : 0.4083          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9754          
            Specificity : 0.3505          
         Pos Pred Value : 0.9180          
         Neg Pred Value : 0.6567          
             Prevalence : 0.8817          
         Detection Rate : 0.8600          
   Detection Prevalence : 0.9369          
      Balanced Accuracy : 0.6629          
                                          
       'Positive' Class : 0

The first logistic regression model achieved an overall accuracy of 90.15%. However, due to class imbalance, it performed well at predicting customers who did not subscribe (Sensitivity = 97.5%) but poorly at identifying those who did subscribe (Specificity = 35.1%), resulting in a balanced accuracy of 66.3%.

To improve the model, a second logistic regression was trained using all predictors, and the optimal threshold was then determined using the Youden Index.

Youden index: 0.08696519

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6286  105
         1 1686  965
                                          
               Accuracy : 0.8019          
                 95% CI : (0.7936, 0.8101)
    No Information Rate : 0.8817          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.4211          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.7885          
            Specificity : 0.9019          
         Pos Pred Value : 0.9836          
         Neg Pred Value : 0.3640          
             Prevalence : 0.8817          
         Detection Rate : 0.6952          
   Detection Prevalence : 0.7068          
      Balanced Accuracy : 0.8452          
                                          
       'Positive' Class : 0

By adjusting the threshold based on the Youden index, the model became more balanced: specificity increased to 90.2%, sensitivity to 78.9%, and balanced accuracy improved to 84.5%. Although overall accuracy decreased slightly, the Youden adjusted threshold provides a better trade-off between correctly identifying both subscribers and non-subscribers, making the model more suitable for imbalanced data.

Next, a logistic regression model with stepwise variable selection based on AIC was fitted, and the optimal threshold was determined using the Youden index to balance sensitivity and specificity.

Youden threshold: 0.0876302

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6299  106
         1 1673  964
                                          
               Accuracy : 0.8033          
                 95% CI : (0.7949, 0.8114)
    No Information Rate : 0.8817          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.4229          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.7901          
            Specificity : 0.9009          
         Pos Pred Value : 0.9835          
         Neg Pred Value : 0.3656          
             Prevalence : 0.8817          
         Detection Rate : 0.6966          
   Detection Prevalence : 0.7084          
      Balanced Accuracy : 0.8455          
                                          
       'Positive' Class : 0

Applying the Youden index to the stepwise AIC model resulted in performance metrics very similar to the previous model. Next, I will attempt to balance the target variable y by downsizing the majority class to create a more similar class distribution while applying stepwise AIC for variable selection and using the Youden index to determine the optimal classification threshold.

Youden threshold: 0.432272

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6523  135
         1 1449  935
                                          
               Accuracy : 0.8248          
                 95% CI : (0.8168, 0.8326)
    No Information Rate : 0.8817          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.4519          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8182          
            Specificity : 0.8738          
         Pos Pred Value : 0.9797          
         Neg Pred Value : 0.3922          
             Prevalence : 0.8817          
         Detection Rate : 0.7214          
   Detection Prevalence : 0.7363          
      Balanced Accuracy : 0.8460          
                                          
       'Positive' Class : 0

After balancing the target variable by downsizing the majority class and applying stepwise AIC with the Youden index threshold, the model achieved improved balance between sensitivity and specificity. Accuracy reached 82.5% and balanced accuracy was 84.6%, indicating a better performance for the minority class without substantially compromising overall predictive ability.

Next, to attempt improving the model, a logistic regression was applied on balanced data (using ROSE), with variable selection via stepwise AIC and the optimal classification threshold determined using the Youden index.

Youden threshold: 0.4399083

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6522  143
         1 1450  927
                                          
               Accuracy : 0.8238          
                 95% CI : (0.8158, 0.8316)
    No Information Rate : 0.8817          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.4477          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8664          
            Specificity : 0.8181          
         Pos Pred Value : 0.3900          
         Neg Pred Value : 0.9785          
             Prevalence : 0.1183          
         Detection Rate : 0.1025          
   Detection Prevalence : 0.2629          
      Balanced Accuracy : 0.8422          
                                          
       'Positive' Class : 1

The logistic regression model using downsized majority class with stepwise AIC and the Youden index provided slightly better predictive performance, achieving higher overall accuracy and balanced accuracy, making it the preferred approach for estimating the probability that a customer has subscribed to a term deposit.

MODEL TESTING TASK

Since the best results were obtained from the logistic regression model using the downsized majority class with stepwise AIC and the Youden index, I will proceed to evaluate this model’s goodness of fit, predictive ability, and other relevant performance aspects.

Sensitivity, specificity, and other metrics derived from the confusion matrix will not be re-evaluated, as they have already been analyzed earlier.

Analysis of Deviance Table

Model 1: y ~ 1
Model 2: y ~ job + marital + education + balance + housing + loan + contact + 
    month + duration + campaign + previous + poutcome
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1      8437    11697.6                          
2      8399     6704.5 38     4993 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The likelihood ratio test indicates that the logistic regression model with selected predictors provides a significantly better fit than the null model (p < 0.001), confirming that the included variables collectively improve prediction of customer subscription.

AUC: 0.9126283

The ROC curve and an AUC of 0.91 indicate that the stepwise AIC logistic regression model has excellent classification performance.

Hosmer-Lemeshow test:


    Hosmer and Lemeshow goodness of fit (GOF) test

data:  as.numeric(as.character(test_data$y)), y_pred_prob
X-squared = 3685.5, df = 8, p-value < 2.2e-16

The Hosmer-Lemeshow test shows a highly significant result (p < 0.001), formally suggesting a lack of fit. However, with very large datasets, even small deviations can produce significant results, so this outcome have to be interpreted with caution.

McFadden pseudo R²: 0.4268418

A McFadden pseudo R² of 0.427 indicates a relatively strong explanatory power for a logistic regression model, suggesting that the selected predictors explain a substantial portion of the variation in the outcome.

Residual deviance: 6704.548 on 8399 degrees of freedom

The residual deviance of 6704.55 on 8399 degrees of freedom indicates that the model fits the data reasonably well, with lower deviance suggesting a better fit relative to the null model.

[1] 0.007302574

The maximum Cook’s distance value (0.0073) indicates that no influential observations are present, suggesting that the regression model is stable and not strongly affected by individual data points.

Cross - validated AUC: 0.9076934

The 5-fold cross-validated AUC of 0.908 indicates that the logistic regression model has excellent and reliable classification performance.

Overall, the logistic regression model with selected predictors demonstrates strong explanatory power (McFadden pseudo R² = 0.427) and fits the data reasonably well (residual deviance = 6704.55 on 8399 df), with excellent predictive performance (ROC AUC = 0.91; 5-fold cross-validated AUC = 0.908), despite a significant Hosmer-Lemeshow test (p < 0.001), which is common in large datasets.