Group Members

  • Daniel Vaati
  • David Mauti
  • Angela Omweri
  • Emmanuel Bett
  • Kelvin Ngigi
  • Emmy Cheruiyot
  • Mike Mbumbu

Introduction

This project focuses on analyzing a large loan applicant dataset to build predictive models for both income estimation and loan default risk. We will apply statistical and machine learning techniques, with a strong emphasis on linear regression. The analysis will include data exploration, model building, and diagnostic checks to ensure valid assumptions are met. Model performance will be compared using appropriate evaluation metrics to select the best approach. Finally, we will interpret results to draw meaningful insights about credit risk and financial behavior.

Objectives

  • Analyze relationships between applicant features and loan outcomes
  • Build a linear regression model for income prediction
  • Develop classification models for loan default
  • Evaluate model performance using appropriate metrics

Methodology

We follow a structured approach:

  • Data loading and cleaning
  • Exploratory Data Analysis (EDA)
  • Model development
  • Model evaluation -Interpretation of results

Loading Required Libraries

In this section, we load all the essential libraries required for data manipulation, visualization, statistical analysis, and machine learning tasks throughout the project. This ensures a consistent and efficient workflow across all stages of the analysis, including data preprocessing, exploratory data analysis, regression modeling, and classification.

## All libraries loaded successfully.

Load data

  • In this section, we import the dataset into R for analysis.
##   person_age person_gender person_education person_income person_emp_exp
## 1         22        female           Master         71948              0
## 2         21        female      High School         12282              0
## 3         25        female      High School         12438              3
## 4         23        female         Bachelor         79753              0
## 5         24          male           Master         66135              1
## 6         21        female      High School         12951              0
##   person_home_ownership loan_amnt loan_intent loan_int_rate loan_percent_income
## 1                  RENT     35000    PERSONAL         16.02                0.49
## 2                   OWN      1000   EDUCATION         11.14                0.08
## 3              MORTGAGE      5500     MEDICAL         12.87                0.44
## 4                  RENT     35000     MEDICAL         15.23                0.44
## 5                  RENT     35000     MEDICAL         14.27                0.53
## 6                   OWN      2500     VENTURE          7.14                0.19
##   cb_person_cred_hist_length credit_score previous_loan_defaults_on_file
## 1                          3          561                             No
## 2                          2          504                            Yes
## 3                          3          635                             No
## 4                          2          675                             No
## 5                          4          586                             No
## 6                          2          532                             No
##   loan_status
## 1           1
## 2           0
## 3           1
## 4           1
## 5           1
## 6           1
##       person_age person_gender person_education person_income person_emp_exp
## 44995         24        female        Associate         31924              2
## 44996         27          male        Associate         47971              6
## 44997         37        female        Associate         65800             17
## 44998         33          male        Associate         56942              7
## 44999         29          male         Bachelor         33164              4
## 45000         24          male      High School         51609              1
##       person_home_ownership loan_amnt       loan_intent loan_int_rate
## 44995                  RENT     12229           MEDICAL         10.70
## 44996                  RENT     15000           MEDICAL         15.66
## 44997                  RENT      9000   HOMEIMPROVEMENT         14.07
## 44998                  RENT      2771 DEBTCONSOLIDATION         10.02
## 44999                  RENT     12000         EDUCATION         13.23
## 45000                  RENT      6665 DEBTCONSOLIDATION         17.05
##       loan_percent_income cb_person_cred_hist_length credit_score
## 44995                0.38                          4          678
## 44996                0.31                          3          645
## 44997                0.14                         11          621
## 44998                0.05                         10          668
## 44999                0.36                          6          604
## 45000                0.13                          3          628
##       previous_loan_defaults_on_file loan_status
## 44995                             No           1
## 44996                             No           1
## 44997                             No           1
## 44998                             No           1
## 44999                             No           1
## 45000                             No           1

Preliminary Data Checks

Before proceeding with any analysis, we perform initial checks to understand the structure, quality, and integrity of the dataset. This includes reviewing variable types, detecting missing values, and obtaining a general statistical summary of the data.

## [1] 45000    14
## [1] FALSE
##                     person_age                  person_gender 
##                              0                              0 
##               person_education                  person_income 
##                              0                              0 
##                 person_emp_exp          person_home_ownership 
##                              0                              0 
##                      loan_amnt                    loan_intent 
##                              0                              0 
##                  loan_int_rate            loan_percent_income 
##                              0                              0 
##     cb_person_cred_hist_length                   credit_score 
##                              0                              0 
## previous_loan_defaults_on_file                    loan_status 
##                              0                              0
## 'data.frame':    45000 obs. of  14 variables:
##  $ person_age                    : num  22 21 25 23 24 21 26 24 24 21 ...
##  $ person_gender                 : chr  "female" "female" "female" "female" ...
##  $ person_education              : chr  "Master" "High School" "High School" "Bachelor" ...
##  $ person_income                 : num  71948 12282 12438 79753 66135 ...
##  $ person_emp_exp                : int  0 0 3 0 1 0 1 5 3 0 ...
##  $ person_home_ownership         : chr  "RENT" "OWN" "MORTGAGE" "RENT" ...
##  $ loan_amnt                     : num  35000 1000 5500 35000 35000 2500 35000 35000 35000 1600 ...
##  $ loan_intent                   : chr  "PERSONAL" "EDUCATION" "MEDICAL" "MEDICAL" ...
##  $ loan_int_rate                 : num  16 11.1 12.9 15.2 14.3 ...
##  $ loan_percent_income           : num  0.49 0.08 0.44 0.44 0.53 0.19 0.37 0.37 0.35 0.13 ...
##  $ cb_person_cred_hist_length    : num  3 2 3 2 4 2 3 4 2 3 ...
##  $ credit_score                  : int  561 504 635 675 586 532 701 585 544 640 ...
##  $ previous_loan_defaults_on_file: chr  "No" "Yes" "No" "No" ...
##  $ loan_status                   : int  1 0 1 1 1 1 1 1 1 1 ...
##    person_age     person_gender      person_education   person_income    
##  Min.   : 20.00   Length:45000       Length:45000       Min.   :   8000  
##  1st Qu.: 24.00   Class :character   Class :character   1st Qu.:  47204  
##  Median : 26.00   Mode  :character   Mode  :character   Median :  67048  
##  Mean   : 27.76                                         Mean   :  80319  
##  3rd Qu.: 30.00                                         3rd Qu.:  95789  
##  Max.   :144.00                                         Max.   :7200766  
##  person_emp_exp   person_home_ownership   loan_amnt     loan_intent       
##  Min.   :  0.00   Length:45000          Min.   :  500   Length:45000      
##  1st Qu.:  1.00   Class :character      1st Qu.: 5000   Class :character  
##  Median :  4.00   Mode  :character      Median : 8000   Mode  :character  
##  Mean   :  5.41                         Mean   : 9583                     
##  3rd Qu.:  8.00                         3rd Qu.:12237                     
##  Max.   :125.00                         Max.   :35000                     
##  loan_int_rate   loan_percent_income cb_person_cred_hist_length  credit_score  
##  Min.   : 5.42   Min.   :0.0000      Min.   : 2.000             Min.   :390.0  
##  1st Qu.: 8.59   1st Qu.:0.0700      1st Qu.: 3.000             1st Qu.:601.0  
##  Median :11.01   Median :0.1200      Median : 4.000             Median :640.0  
##  Mean   :11.01   Mean   :0.1397      Mean   : 5.867             Mean   :632.6  
##  3rd Qu.:12.99   3rd Qu.:0.1900      3rd Qu.: 8.000             3rd Qu.:670.0  
##  Max.   :20.00   Max.   :0.6600      Max.   :30.000             Max.   :850.0  
##  previous_loan_defaults_on_file  loan_status    
##  Length:45000                   Min.   :0.0000  
##  Class :character               1st Qu.:0.0000  
##  Mode  :character               Median :0.0000  
##                                 Mean   :0.2222  
##                                 3rd Qu.:0.0000  
##                                 Max.   :1.0000
## [1] 0

The dataset contains 45,000 observations and 14 variables with no missing values or duplicate records, indicating a clean and complete dataset. Variables include a mix of numerical and categorical features related to applicant demographics, financial status, and credit history. Preliminary summaries reveal the presence of some extreme values and a class imbalance in the target variable (loan_status). Overall, the data is well-structured and ready for further preprocessing and analysis.

Exploratory Data Analysis (EDA)

In this section, we explore the dataset by examining the distribution of the target variable, numerical features, and categorical variables to understand its overall structure. The analysis focuses on identifying relationships between predictors and loan_status to uncover early patterns useful for modeling. This is supported using distribution plots, boxplots for outlier detection, and correlation analysis among numerical variables.

Target Variable Analysis

## 
##        0        1 
## 77.77778 22.22222
  • The dataset is imbalanced, with a significantly higher proportion of unsuccessful loan repayments compared to non-defaulters.

Numerical Variable Analysis

- Most numerical variables show varied distributions, with some exhibiting skewness.

- Several variables contain outliers, particularly income and age-related features.

Categorical Variable Analysis

- The categorical variables show clear group distributions, with some categories dominating others.

Relationship with Target Variable

- There are visible differences in distributions of key numerical variables across loan status groups, suggesting these features may be strong predictors of repayment behavior.

- Certain categorical groups show different proportions of default and repayment outcomes, indicating that demographic and behavioral factors influence loan performance.

Correlation Analysis

- Several variables exhibit moderate correlations, particularly financial attributes.

- Higher credit scores are generally associated with lower default rates, indicating credit score is a strong predictor of loan performance.

- There is a visible relationship between income and loan amount, suggesting lenders may align loan size with borrower income levels.

- Non-defaulting borrowers tend to face higher interest rates.

Data Cleaning & Preprocessing

To stabilize our models, we focused on two areas:

  • Winsorization: We capped Age and Experience at the 95th percentile. This neutralizes extreme outliers (e.g., 100+ years of age) that would otherwise disproportionately pull the regression line.
  • One-Hot Encoding: Categorical data was converted into binary columns. This ensures the model treats categories like “Loan Intent” as distinct groups rather than an artificial numerical scale.
## 'data.frame':    45000 obs. of  21 variables:
##  $ person_age                       : num  22 21 25 23 24 21 26 24 24 21 ...
##  $ person_income                    : num  71948 12282 12438 79753 66135 ...
##  $ person_emp_exp                   : num  0 0 3 0 1 0 1 5 3 0 ...
##  $ loan_amnt                        : num  35000 1000 5500 35000 35000 2500 35000 35000 35000 1600 ...
##  $ loan_int_rate                    : num  16 11.1 12.9 15.2 14.3 ...
##  $ loan_percent_income              : num  0.49 0.08 0.44 0.44 0.53 0.19 0.37 0.37 0.35 0.13 ...
##  $ cb_person_cred_hist_length       : num  3 2 3 2 4 2 3 4 2 3 ...
##  $ credit_score                     : int  561 504 635 675 586 532 701 585 544 640 ...
##  $ loan_status                      : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 2 2 ...
##  $ person_genderfemale              : num  1 1 1 1 0 1 1 1 1 1 ...
##  $ person_gendermale                : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ person_education                 : num  4 1 1 3 4 1 3 1 2 1 ...
##  $ person_home_ownershipOTHER       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ person_home_ownershipOWN         : num  0 1 0 0 0 1 0 0 0 1 ...
##  $ person_home_ownershipRENT        : num  1 0 0 1 1 0 1 1 1 0 ...
##  $ loan_intentEDUCATION             : num  0 1 0 0 0 0 1 0 0 0 ...
##  $ loan_intentHOMEIMPROVEMENT       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ loan_intentMEDICAL               : num  0 0 1 1 1 0 0 1 0 0 ...
##  $ loan_intentPERSONAL              : num  1 0 0 0 0 0 0 0 1 0 ...
##  $ loan_intentVENTURE               : num  0 0 0 0 0 1 0 0 0 1 ...
##  $ previous_loan_defaults_on_fileYes: num  0 1 0 0 0 0 0 0 0 0 ...

Post-Cleaning EDA

Our cleaned distributions show:

  • Income & Loan Amount: Both remain right-skewed, typical of financial data.
  • Credit Score: Displays a healthy, near-normal distribution.
  • Targeting: By removing outliers, our “After Winsorization” boxplots show much tighter, more representative interquartile ranges.

Income Modeling

We are predicting Personal Income. To ensure a high-integrity model, we applied two filters:

  • No Leakage: Dropped loan_percent_income, as it is a direct derivative of the target.
  • Bias Mitigation: Removed Gender variables to ensure the model focuses on financial merit and professional history rather than protected demographics.

Train-Test Split and Model Building (Forward Selection Approach)

In this section, the cleaned dataset is split into training and testing sets using an 80–20 ratio to ensure proper model validation on unseen data. The training set is used to build multiple regression models, while the test set is reserved for final performance evaluation. To identify the most relevant predictors of personal income, a forward stepwise selection approach is applied. This method begins with an empty model and iteratively adds variables that improve model performance until no further improvement is achieved. The final selected model is then evaluated against alternative specifications to ensure optimal predictive accuracy and interpretability.

Forward Stepwise Selection (The Logic)

Rather than using all predictors at once, we use an iterative Forward Selection approach:

  • Start Small: We begin with a “Null Model” (just the average income).
  • Iterate: The algorithm tests variables one-by-one, adding only the one that provides the most significant statistical improvement (lowest AIC).
  • Optimize: It repeats this until adding more variables no longer improves the model’s predictive power.
## Start:  AIC=815923.8
## person_income ~ 1
## 
##                                     Df  Sum of Sq        RSS    AIC
## + loan_amnt                          1 1.3463e+13 2.3736e+14 813940
## + person_home_ownershipRENT          1 9.3048e+12 2.4152e+14 814565
## + person_age                         1 3.9857e+12 2.4684e+14 815349
## + cb_person_cred_hist_length         1 3.8285e+12 2.4699e+14 815372
## + person_emp_exp                     1 3.4412e+12 2.4738e+14 815429
## + previous_loan_defaults_on_fileYes  1 7.6369e+11 2.5006e+14 815816
## + loan_intentMEDICAL                 1 5.3144e+11 2.5029e+14 815849
## + credit_score                       1 3.5532e+11 2.5047e+14 815875
## + person_home_ownershipOWN           1 3.4238e+11 2.5048e+14 815877
## + loan_intentHOMEIMPROVEMENT         1 3.4010e+11 2.5048e+14 815877
## + loan_intentPERSONAL                1 1.1228e+11 2.5071e+14 815910
## + loan_intentEDUCATION               1 6.9234e+10 2.5075e+14 815916
## + loan_intentVENTURE                 1 4.4168e+10 2.5078e+14 815920
## + person_home_ownershipOTHER         1 2.1847e+10 2.5080e+14 815923
## <none>                                            2.5082e+14 815924
## + loan_int_rate                      1 2.2173e+09 2.5082e+14 815926
## + person_education                   1 7.2213e+08 2.5082e+14 815926
## 
## Step:  AIC=813939.7
## person_income ~ loan_amnt
## 
##                                     Df  Sum of Sq        RSS    AIC
## + person_home_ownershipRENT          1 6.6200e+12 2.3074e+14 812923
## + cb_person_cred_hist_length         1 3.2604e+12 2.3410e+14 813444
## + person_age                         1 3.2065e+12 2.3415e+14 813452
## + person_emp_exp                     1 2.8196e+12 2.3454e+14 813511
## + previous_loan_defaults_on_fileYes  1 1.2004e+12 2.3616e+14 813759
## + loan_intentMEDICAL                 1 3.5766e+11 2.3700e+14 813887
## + credit_score                       1 3.1291e+11 2.3705e+14 813894
## + person_home_ownershipOWN           1 2.5374e+11 2.3711e+14 813903
## + loan_int_rate                      1 2.4677e+11 2.3711e+14 813904
## + loan_intentHOMEIMPROVEMENT         1 1.6255e+11 2.3720e+14 813917
## + loan_intentPERSONAL                1 9.7388e+10 2.3726e+14 813927
## + loan_intentEDUCATION               1 5.5882e+10 2.3730e+14 813933
## + loan_intentVENTURE                 1 3.8139e+10 2.3732e+14 813936
## <none>                                            2.3736e+14 813940
## + person_home_ownershipOTHER         1 9.6764e+09 2.3735e+14 813940
## + person_education                   1 5.4741e+08 2.3736e+14 813942
## 
## Step:  AIC=812923.4
## person_income ~ loan_amnt + person_home_ownershipRENT
## 
##                                     Df  Sum of Sq        RSS    AIC
## + cb_person_cred_hist_length         1 3.0348e+12 2.2771e+14 812449
## + person_age                         1 2.8560e+12 2.2788e+14 812477
## + person_emp_exp                     1 2.4936e+12 2.2825e+14 812534
## + person_home_ownershipOWN           1 1.6402e+12 2.2910e+14 812669
## + previous_loan_defaults_on_fileYes  1 5.2585e+11 2.3021e+14 812843
## + credit_score                       1 2.9614e+11 2.3044e+14 812879
## + loan_intentMEDICAL                 1 1.9135e+11 2.3055e+14 812896
## + loan_intentHOMEIMPROVEMENT         1 7.9732e+10 2.3066e+14 812913
## + loan_intentPERSONAL                1 6.5372e+10 2.3067e+14 812915
## + loan_intentEDUCATION               1 5.2119e+10 2.3069e+14 812917
## <none>                                            2.3074e+14 812923
## + loan_int_rate                      1 1.2378e+10 2.3073e+14 812923
## + loan_intentVENTURE                 1 9.8279e+09 2.3073e+14 812924
## + person_home_ownershipOTHER         1 1.1467e+09 2.3074e+14 812925
## + person_education                   1 1.8132e+08 2.3074e+14 812925
## 
## Step:  AIC=812448.8
## person_income ~ loan_amnt + person_home_ownershipRENT + cb_person_cred_hist_length
## 
##                                     Df  Sum of Sq        RSS    AIC
## + person_home_ownershipOWN           1 1.6085e+12 2.2610e+14 812196
## + previous_loan_defaults_on_fileYes  1 5.8956e+11 2.2712e+14 812357
## + loan_intentMEDICAL                 1 2.1485e+11 2.2749e+14 812417
## + person_age                         1 1.4653e+11 2.2756e+14 812428
## + credit_score                       1 7.7152e+10 2.2763e+14 812439
## + person_emp_exp                     1 7.6359e+10 2.2763e+14 812439
## + loan_intentPERSONAL                1 4.2323e+10 2.2766e+14 812444
## + loan_intentHOMEIMPROVEMENT         1 3.6294e+10 2.2767e+14 812445
## + loan_int_rate                      1 1.9955e+10 2.2769e+14 812448
## <none>                                            2.2771e+14 812449
## + loan_intentVENTURE                 1 1.2128e+10 2.2769e+14 812449
## + loan_intentEDUCATION               1 1.1879e+10 2.2769e+14 812449
## + person_education                   1 3.1195e+08 2.2770e+14 812451
## + person_home_ownershipOTHER         1 2.7749e+08 2.2770e+14 812451
## 
## Step:  AIC=812195.6
## person_income ~ loan_amnt + person_home_ownershipRENT + cb_person_cred_hist_length + 
##     person_home_ownershipOWN
## 
##                                     Df  Sum of Sq        RSS    AIC
## + previous_loan_defaults_on_fileYes  1 6.0542e+11 2.2549e+14 812101
## + loan_intentMEDICAL                 1 2.1396e+11 2.2588e+14 812163
## + person_age                         1 1.3385e+11 2.2596e+14 812176
## + credit_score                       1 7.5510e+10 2.2602e+14 812186
## + person_emp_exp                     1 7.2774e+10 2.2602e+14 812186
## + loan_intentVENTURE                 1 4.6953e+10 2.2605e+14 812190
## + loan_intentPERSONAL                1 4.3516e+10 2.2605e+14 812191
## + loan_intentHOMEIMPROVEMENT         1 3.5586e+10 2.2606e+14 812192
## <none>                                            2.2610e+14 812196
## + loan_intentEDUCATION               1 1.1643e+10 2.2609e+14 812196
## + loan_int_rate                      1 9.9251e+09 2.2609e+14 812196
## + person_home_ownershipOTHER         1 2.8407e+09 2.2609e+14 812197
## + person_education                   1 6.1285e+08 2.2610e+14 812197
## 
## Step:  AIC=812101
## person_income ~ loan_amnt + person_home_ownershipRENT + cb_person_cred_hist_length + 
##     person_home_ownershipOWN + previous_loan_defaults_on_fileYes
## 
##                              Df  Sum of Sq        RSS    AIC
## + loan_intentMEDICAL          1 1.9369e+11 2.2530e+14 812072
## + credit_score                1 1.7923e+11 2.2531e+14 812074
## + person_age                  1 1.3526e+11 2.2536e+14 812081
## + person_emp_exp              1 7.6325e+10 2.2541e+14 812091
## + loan_intentHOMEIMPROVEMENT  1 4.4131e+10 2.2545e+14 812096
## + loan_intentPERSONAL         1 4.1363e+10 2.2545e+14 812096
## + loan_intentVENTURE          1 3.2674e+10 2.2546e+14 812098
## + loan_intentEDUCATION        1 1.9501e+10 2.2547e+14 812100
## <none>                                     2.2549e+14 812101
## + person_education            1 3.1495e+09 2.2549e+14 812103
## + person_home_ownershipOTHER  1 1.4323e+09 2.2549e+14 812103
## + loan_int_rate               1 5.2129e+08 2.2549e+14 812103
## 
## Step:  AIC=812072.1
## person_income ~ loan_amnt + person_home_ownershipRENT + cb_person_cred_hist_length + 
##     person_home_ownershipOWN + previous_loan_defaults_on_fileYes + 
##     loan_intentMEDICAL
## 
##                              Df  Sum of Sq        RSS    AIC
## + credit_score                1 1.7604e+11 2.2512e+14 812046
## + person_age                  1 1.4104e+11 2.2516e+14 812052
## + person_emp_exp              1 8.0400e+10 2.2522e+14 812061
## + loan_intentEDUCATION        1 6.5007e+10 2.2523e+14 812064
## + loan_intentHOMEIMPROVEMENT  1 1.9323e+10 2.2528e+14 812071
## <none>                                     2.2530e+14 812072
## + loan_intentPERSONAL         1 1.2218e+10 2.2529e+14 812072
## + loan_intentVENTURE          1 7.4270e+09 2.2529e+14 812073
## + person_education            1 3.3148e+09 2.2529e+14 812074
## + person_home_ownershipOTHER  1 1.3936e+09 2.2530e+14 812074
## + loan_int_rate               1 6.5068e+08 2.2530e+14 812074
## 
## Step:  AIC=812045.9
## person_income ~ loan_amnt + person_home_ownershipRENT + cb_person_cred_hist_length + 
##     person_home_ownershipOWN + previous_loan_defaults_on_fileYes + 
##     loan_intentMEDICAL + credit_score
## 
##                              Df  Sum of Sq        RSS    AIC
## + person_age                  1 1.2089e+11 2.2500e+14 812029
## + loan_intentEDUCATION        1 6.4856e+10 2.2506e+14 812038
## + person_emp_exp              1 6.1144e+10 2.2506e+14 812038
## + loan_intentHOMEIMPROVEMENT  1 1.9678e+10 2.2510e+14 812045
## <none>                                     2.2512e+14 812046
## + loan_intentPERSONAL         1 1.2208e+10 2.2511e+14 812046
## + loan_intentVENTURE          1 6.0888e+09 2.2512e+14 812047
## + loan_int_rate               1 1.1578e+09 2.2512e+14 812048
## + person_home_ownershipOTHER  1 1.1279e+09 2.2512e+14 812048
## + person_education            1 9.3862e+08 2.2512e+14 812048
## 
## Step:  AIC=812028.6
## person_income ~ loan_amnt + person_home_ownershipRENT + cb_person_cred_hist_length + 
##     person_home_ownershipOWN + previous_loan_defaults_on_fileYes + 
##     loan_intentMEDICAL + credit_score + person_age
## 
##                              Df  Sum of Sq        RSS    AIC
## + loan_intentEDUCATION        1 5.5497e+10 2.2495e+14 812022
## + loan_intentPERSONAL         1 1.3462e+10 2.2499e+14 812028
## + loan_intentHOMEIMPROVEMENT  1 1.3351e+10 2.2499e+14 812028
## <none>                                     2.2500e+14 812029
## + loan_intentVENTURE          1 6.1260e+09 2.2499e+14 812030
## + person_emp_exp              1 2.4372e+09 2.2500e+14 812030
## + person_education            1 1.2740e+09 2.2500e+14 812030
## + loan_int_rate               1 1.0822e+09 2.2500e+14 812030
## + person_home_ownershipOTHER  1 8.9349e+08 2.2500e+14 812030
## 
## Step:  AIC=812021.7
## person_income ~ loan_amnt + person_home_ownershipRENT + cb_person_cred_hist_length + 
##     person_home_ownershipOWN + previous_loan_defaults_on_fileYes + 
##     loan_intentMEDICAL + credit_score + person_age + loan_intentEDUCATION
## 
##                              Df  Sum of Sq        RSS    AIC
## <none>                                     2.2495e+14 812022
## + loan_intentHOMEIMPROVEMENT  1 4340039311 2.2494e+14 812023
## + loan_intentPERSONAL         1 2364572744 2.2494e+14 812023
## + person_emp_exp              1 2006240893 2.2494e+14 812023
## + person_education            1 1467636058 2.2494e+14 812023
## + person_home_ownershipOTHER  1 1008515270 2.2494e+14 812024
## + loan_int_rate               1  969697438 2.2494e+14 812024
## + loan_intentVENTURE          1   43860510 2.2495e+14 812024
## 
## Call:
## lm(formula = person_income ~ loan_amnt + person_home_ownershipRENT + 
##     cb_person_cred_hist_length + person_home_ownershipOWN + previous_loan_defaults_on_fileYes + 
##     loan_intentMEDICAL + credit_score + person_age + loan_intentEDUCATION, 
##     data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -105572  -26233   -8919   13538 7071837 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        1.406e+04  6.360e+03   2.210  0.02708 *  
## loan_amnt                          2.676e+00  6.706e-02  39.901  < 2e-16 ***
## person_home_ownershipRENT         -2.908e+04  8.890e+02 -32.716  < 2e-16 ***
## cb_person_cred_hist_length         1.524e+03  2.079e+02   7.329 2.37e-13 ***
## person_home_ownershipOWN          -2.816e+04  1.755e+03 -16.051  < 2e-16 ***
## previous_loan_defaults_on_fileYes  9.052e+03  8.597e+02  10.530  < 2e-16 ***
## loan_intentMEDICAL                -6.760e+03  1.098e+03  -6.157 7.48e-10 ***
## credit_score                       4.265e+01  8.525e+00   5.003 5.66e-07 ***
## person_age                         6.960e+02  1.648e+02   4.224 2.40e-05 ***
## loan_intentEDUCATION              -3.200e+03  1.074e+03  -2.980  0.00289 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 79060 on 35990 degrees of freedom
## Multiple R-squared:  0.1032, Adjusted R-squared:  0.1029 
## F-statistic:   460 on 9 and 35990 DF,  p-value: < 2.2e-16

The forward stepwise selection results demonstrate a systematic optimization of the model, starting from a baseline AIC of 815,923.8 and concluding at 812,021.7. In each iteration, the algorithm identified the variable that provided the greatest reduction in the Akaike Information Criterion (AIC), beginning with loan_amnt as the most significant predictor and subsequently adding factors like housing status, credit history length, and medical loan intent. The process effectively balanced model complexity with predictive power, ultimately stopping at the ninth step when the addition of further variables—such as high school education or loan interest rates—failed to yield a lower AIC. This final selection of nine variables represents the most statistically efficient model, capturing the essential drivers of income while filtering out non-significant “noise” to prevent overfitting.

Final Model

The final model is retrained using the predictors selected through the forward stepwise selection process.

## 
## Call:
## lm(formula = person_income ~ loan_amnt + person_home_ownershipRENT + 
##     cb_person_cred_hist_length + person_home_ownershipOWN + previous_loan_defaults_on_fileYes + 
##     loan_intentMEDICAL + credit_score + person_age + loan_intentEDUCATION, 
##     data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -105572  -26233   -8919   13538 7071837 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        1.406e+04  6.360e+03   2.210  0.02708 *  
## loan_amnt                          2.676e+00  6.706e-02  39.901  < 2e-16 ***
## person_home_ownershipRENT         -2.908e+04  8.890e+02 -32.716  < 2e-16 ***
## cb_person_cred_hist_length         1.524e+03  2.079e+02   7.329 2.37e-13 ***
## person_home_ownershipOWN          -2.816e+04  1.755e+03 -16.051  < 2e-16 ***
## previous_loan_defaults_on_fileYes  9.052e+03  8.597e+02  10.530  < 2e-16 ***
## loan_intentMEDICAL                -6.760e+03  1.098e+03  -6.157 7.48e-10 ***
## credit_score                       4.265e+01  8.525e+00   5.003 5.66e-07 ***
## person_age                         6.960e+02  1.648e+02   4.224 2.40e-05 ***
## loan_intentEDUCATION              -3.200e+03  1.074e+03  -2.980  0.00289 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 79060 on 35990 degrees of freedom
## Multiple R-squared:  0.1032, Adjusted R-squared:  0.1029 
## F-statistic:   460 on 9 and 35990 DF,  p-value: < 2.2e-16
## [1] 61359.59
## [1] 30733.79
## [1] 0.1574616

The final linear regression model identifies several statistically significant predictors of income, including loan amount, credit score, age, and credit history length. Financial variables appear to have the strongest influence, with higher loan amounts and better credit profiles associated with higher income. Some factors, such as certain loan purposes and home ownership categories, are associated with lower income levels. Despite these significant relationships, the model demonstrates relatively low predictive power, with a test R² of approximately 0.16. This suggests that while the model captures some meaningful patterns, additional variables or more flexible models may be needed for improved accuracy.

Model Diagnostics

Before relying on a linear regression model for inference or prediction, it is important to verify that several key statistical assumptions are satisfied. These assumptions ensure that the model estimates are reliable, unbiased, and interpretable. First, we check the assumption of linearity, which requires a linear relationship between the predictors and the response variable. Second, we assess normality of residuals to ensure that the error terms are approximately normally distributed. Third, we examine homoscedasticity, meaning that the variance of residuals remains constant across all fitted values. Finally, we check for influential points and outliers that may disproportionately affect the model estimates and overall performance.

## 
##  Shapiro-Wilk normality test
## 
## data:  sample(residuals(best_model), 5000)
## W = 0.64385, p-value < 2.2e-16

## 
##  studentized Breusch-Pagan test
## 
## data:  best_model
## BP = 150.23, df = 9, p-value < 2.2e-16

## 42521 31919 29133 32076 36352 20547 17955 15886 32385 29127 29169 27772 15656 
##    96   151   163   266   312   388   654   700   730   924  1042  1069  1072 
## 32864 27771 13217 18186 27837 29515 17988 40441 27877 27849 28786 17849 17542 
##  1076  1094  1267  1278  1302  1340  1388  1392  1551  1728  1745  1770  1782 
## 17940 32635 35273 27848 27711 17863 27864 38326 32328    57 27861 27860 32301 
##  1828  1866  2060  2069  2154  2169  2197  2216  2223  2338  2428  2445  2664 
## 29162 17888 27846 27950 27872 36279 32549 32343 31911 17965 27859 17941 36322 
##  2761  2857  2883  2901  2902  2927  3009  3098  3110  3217  3218  3262  3278 
##   100 43121    69 34611 28648 28089   218 17922 27798 27882 29346 27997 15888 
##  3315  3337  3389  3542  3641  3680  3713  4014  4081  4193  4300  4320  4394 
## 43916 28655 17906 28479    43 29149 27961 15893 33976 31900 30535 32547 35753 
##  4516  4523  4534  4590  4647  4665  4734  4903  4978  5001  5031  5100  5102 
## 43871 41676   184 18090 32557    16 29166 17900    81 38506 17872 17915 27870 
##  5144  5147  5258  5301  5351  5379  5511  5631  5687  5824  5879  5964  5970 
## 23430 37821 34190 34186 17972 32484 40721 17931 15838 31910 31867 31894    47 
##  6000  6028  6031  6161  6164  6387  6458  6481  6523  6534  6621  6676  6678 
## 31893 17858 17874 32306 27840 28797 27841 27884 32481 33935 17873 27855  2423 
##  6732  6748  6756  6761  7117  7214  7394  7503  7616  7646  7720  7734  7778 
## 27799 15909 15811 27839 31913 31916 28537 17914 29153 34995 17878 27865 31914 
##  7840  7937  7990  8069  8216  8321  8383  8549  8557  8600  8759  8783  8815 
## 27853 36424 28634 29161 27657 15751 28494 32300 31906 32118 32417 29122 29145 
##  8836  8864  8934  9034  9242  9261  9387  9479  9549  9647  9734  9785  9803 
##  1880 43242 28771 23429 27867 28420 32225 17895 37383 15905 17913 32504 35740 
##  9836  9837  9930  9978 10123 10197 10329 10383 10486 10509 10598 10701 10866 
## 20614 17903 32367 27858 27779 17887 34419 32544 32071 27851 28254 29129   210 
## 11076 11271 11329 11529 11548 11658 11789 11800 11825 11834 11863 11961 12043 
## 27630 36257 29189 41544 28975 33490 38221 15908   140 31869 43148 31889 27658 
## 12157 12257 12600 12881 12897 12948 12987 13194 13295 13316 13339 13390 13505 
## 18198 27815 17859 31908 15800 15831 36449 44948 16757 29154   103 17912 40431 
## 13659 13691 13930 14091 14092 14139 14436 14605 14617 14667 14751 14763 14810 
##    35 16133 17834 31886 18197 37176 31905 29157 32292 27838 27498 18024 27856 
## 14852 14886 14890 14990 15158 15166 15180 15207 15330 15423 15590 15662 15696 
## 27847 27821    56 31920 34824 38801 17894 29179 29138 41465 27627 23432 38114 
## 15726 15729 15827 15884 15930 15982 16543 16662 16689 16714 16728 16740 16752 
##   401   141 39857 19969 18058 15894 33223 28239 32309 29204 27801 15901 15891 
## 16847 16937 16946 16960 17078 17096 17100 17240 17311 17318 17326 17327 17354 
## 17847 32038 39825 29143   125 27770 40233 32136 32298 28519 18199 29130 17956 
## 17428 17508 17510 17553 17573 17637 17684 17691 17698 17777 17849 17852 17992 
## 39117 35090 37931   166 29123   222 32125 17885 40025 28136 29243 18625 32299 
## 18020 18026 18065 18228 18256 18293 18297 18327 18379 18460 18466 18565 18582 
## 27873 43096 33488 28901 33795 17229 43631 17925   114 31358    45 32384 17860 
## 18589 18691 18717 18734 18872 19065 19137 19142 19156 19177 19261 19278 19553 
## 42086   234 27843 36598 40383 15915 29156 17942    99 32548 29188 27830 29140 
## 19614 19641 19712 19714 19864 20228 20375 20459 20483 20491 20507 20537 20567 
## 25714 31915 31917  8445 29137 27854 39985 18106 27868 30050 27783 33532    58 
## 20615 20758 20770 20786 20807 20864 21037 21110 21115 21174 21191 21288 21326 
## 15830 38300 44923 32936    68 39419 31901 32434 40162 18017 38867    82 33240 
## 21334 21360 21405 21454 21477 21541 21691 21852 21966 22041 22046 22067 22513 
## 32480 17861 32595 27820 15902 15591 27831 27540 27552 17948 32543 28807 38913 
## 22666 22683 22695 22768 22791 22799 22860 22923 22968 23143 23232 23301 23443 
## 27883 39412 15895 27874 27828    91 44787 28736 15536 17870 32554   266   365 
## 23476 23503 23558 23590 23632 23664 23859 23895 23928 24143 24181 24213 24294 
## 17933  2197 15896 40105 32579 37313 38618 37488 42085 30537 28854 41949 27871 
## 24438 24444 24498 24502 24512 24520 24591 24661 24706 24707 24778 24809 24867 
## 42651 30026 31896 17949 29160   239 17862 17886 31899 15875 27826 15889 15405 
## 24918 24944 25400 25433 25584 25593 25599 25696 25820 25871 26061 26197 26253 
## 17904 17926 32308 32572 29146 15730 32416 27811   233 40964 32305 34680 15898 
## 26267 26314 26406 26433 26458 26460 26511 26528 26559 27068 27367 27383 27519 
## 29120 27878 28386 27885 36332 15906 33630   185 18047 18264 27857 27835 32552 
## 27728 27755 27827 27835 27877 28047 28187 28325 28374 28571 28595 28603 28604 
## 32312 21241 32404 29295 37835 21958 32310 17835 41762 42020 32349 29144 32577 
## 28654 28744 28798 28846 28867 28931 28944 28985 29009 29018 29175 29189 29320 
## 35708   331 29134 15856 17916 35659 27876 22082    34 32321 38400 18636 29671 
## 29330 29374 29575 29627 29800 29906 30036 30057 30068 30566 30594 30724 30861 
## 17897 39293 29152 17924    44 32006 32254 31923 32545    70 17869 38414    63 
## 30991 31006 31279 31307 31374 31778 31825 31976 32004 32168 32444 32472 32696 
## 32365 41167 16927 29514 35923 31885 15868 32764 17875 40021 29132   264 18918 
## 32790 32810 32871 33015 33187 33214 33217 33313 33354 33381 33386 33429 33526 
## 32422 33228 29128 32234 40828 15529 32311 32048 17893 38074 29026 39854 31925 
## 33558 33580 33649 33704 33753 33941 33965 34112 34227 34337 34933 35206 35324 
## 37003 27850 17866 32348 15834 39990 31903 33624 17871 27836 32382 13218 17908 
## 35351 35378 35399 35430 35437 35438 35502 35520 35579 35724 35735 35824 35898

The diagnostic analysis reveals that the current model fails to meet several key Gauss-Markov assumptions, rendering the standard OLS results potentially unreliable. The Normal Q-Q plot is particularly telling; the sharp upward deviation from the theoretical line indicates a heavily right-skewed distribution of errors, meaning the model consistently struggles to predict high-value observations. This is compounded by evidence of heteroscedasticity visible in the Scale-Location plot, where the error variance increases alongside the fitted values. Most critically, the Residuals vs Leverage plot identifies specific influential observations—notably case 32298—which carry high leverage and large residuals. These points are likely exerting a disproportionate pull on the regression coefficients. To improve model validity, I recommend exploring a logarithmic transformation of the dependent variable to stabilize variance and mitigate skewness.

Model Improvement

Transform the target variable

## [1] 62110.12
## [1] 28955.59
## [1] 0.1367244

Diagnostic Checks for log_best_model

##                         loan_amnt         person_home_ownershipRENT 
##                          1.033331                          1.136144 
##        cb_person_cred_hist_length          person_home_ownershipOWN 
##                          3.772604                          1.087124 
## previous_loan_defaults_on_fileYes                loan_intentMEDICAL 
##                          1.063920                          1.070368 
##                      credit_score                        person_age 
##                          1.064828                          3.810142 
##              loan_intentEDUCATION 
##                          1.073787
## 
##  Durbin-Watson test
## 
## data:  log_best_model
## DW = 2.0012, p-value = 0.5471
## alternative hypothesis: true autocorrelation is greater than 0

The log-linear model shows acceptable performance in terms of multicollinearity and independence, as VIF values are low and the Durbin-Watson test indicates no autocorrelation. However, the Scale-Location plot reveals heteroscedasticity, meaning the residual variance is not constant. Overall, this suggests that the linear model assumptions are only partially satisfied and may not fully capture the underlying patterns in the data.

Choice of Target Variable Transformation

The target variable (person income) was highly right-skewed and exhibited non-constant variance, which violated key assumptions of linear regression. To address this, a logarithmic transformation was applied because it is the most appropriate and widely used transformation for income-type data, as it compresses extreme values and stabilizes variance effectively. After applying the log transformation, diagnostic checks showed some improvement in model behavior, although heteroscedasticity was not fully eliminated. Other transformations such as square root were not considered suitable because they are generally intended for mildly skewed data and would not adequately correct the strong skewness observed in income. Polynomial transformations were also not applied since they are more appropriate for capturing non-linear relationships among predictors rather than correcting the distribution of the response variable. In addition, Box-Cox transformation was not implemented because for heavily right-skewed financial variables, it typically converges to a log transformation, making it redundant in this context. Since the log transformation already represents the most theoretically sound and effective adjustment for income, further transformations were not pursued. Given that key linear assumptions still remained partially violated after transformation, particularly in terms of variance stability, the analysis was advanced toward nonlinear models, which are better suited to capture complex relationships and interactions in the data.

Regularized Regression Approach

We extend the modeling process beyond ordinary least squares by introducing regularized regression methods, specifically Ridge, Lasso, and Elastic Net. These models are used to address limitations observed in the linear regression model, such as weak predictive power and potential overfitting in the presence of multiple predictors.

Ridge regression applies L2 regularization, which shrinks coefficient values to reduce model complexity while retaining all predictors. Lasso regression applies L1 regularization, which can shrink some coefficients exactly to zero, effectively performing feature selection. Elastic Net combines both penalties, balancing shrinkage and variable selection for improved stability.

These methods are particularly useful in datasets with many correlated or weak predictors, as they improve generalization performance and reduce model variance. The optimal model is selected using cross-validated error (RMSE), ensuring fair comparison between Ridge, Lasso, and Elastic Net.

## [1] 41518.17
## [1] 41544.92
## [1] 41545.37

Ridge regression provides the best balance between bias and variance in this dataset, indicating that predictive information is distributed across many weak predictors rather than concentrated in a small subset of variables.

Loan Default Prediction

Loan Default Prediction: Build a classification model to predict loan repayment.

In this section, we build classification models to predict whether a loan applicant will successfully repay a loan (loan_status = 1) or default (loan_status = 0). This is a binary classification problem aimed at supporting credit risk assessment. Multiple machine learning models will be trained and evaluated, and the best-performing model will be selected based on classification performance metrics.

Data preprocessing

## 'data.frame':    45000 obs. of  24 variables:
##  $ person_age                       : num  22 21 25 23 24 21 26 24 24 21 ...
##  $ person_income                    : num  71948 12282 12438 79753 66135 ...
##  $ person_emp_exp                   : num  0 0 3 0 1 0 1 5 3 0 ...
##  $ loan_amnt                        : num  35000 1000 5500 35000 35000 2500 35000 35000 35000 1600 ...
##  $ loan_int_rate                    : num  16 11.1 12.9 15.2 14.3 ...
##  $ loan_percent_income              : num  0.49 0.08 0.44 0.44 0.53 0.19 0.37 0.37 0.35 0.13 ...
##  $ cb_person_cred_hist_length       : num  3 2 3 2 4 2 3 4 2 3 ...
##  $ credit_score                     : int  561 504 635 675 586 532 701 585 544 640 ...
##  $ loan_status                      : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 2 2 ...
##  $ person_genderfemale              : num  1 1 1 1 0 1 1 1 1 1 ...
##  $ person_gendermale                : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ person_educationBachelor         : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ person_educationDoctorate        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ person_educationHigh School      : num  0 1 1 0 0 1 0 1 0 1 ...
##  $ person_educationMaster           : num  1 0 0 0 1 0 0 0 0 0 ...
##  $ person_home_ownershipOTHER       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ person_home_ownershipOWN         : num  0 1 0 0 0 1 0 0 0 1 ...
##  $ person_home_ownershipRENT        : num  1 0 0 1 1 0 1 1 1 0 ...
##  $ loan_intentEDUCATION             : num  0 1 0 0 0 0 1 0 0 0 ...
##  $ loan_intentHOMEIMPROVEMENT       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ loan_intentMEDICAL               : num  0 0 1 1 1 0 0 1 0 0 ...
##  $ loan_intentPERSONAL              : num  1 0 0 0 0 0 0 0 1 0 ...
##  $ loan_intentVENTURE               : num  0 0 0 0 0 1 0 0 0 1 ...
##  $ previous_loan_defaults_on_fileYes: num  0 1 0 0 0 0 0 0 0 0 ...

Create Classification Dataset

Train-Test Split

Logistic Regression (Baseline Model)

## 
## Call:
## glm(formula = loan_status ~ ., family = binomial, data = train_class)
## 
## Coefficients: (1 not defined because of singularities)
##                                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -2.400e-01  4.142e-01  -0.579  0.56232    
## person_age                         2.463e-02  1.281e-02   1.923  0.05447 .  
## person_income                      5.342e-07  2.067e-07   2.584  0.00977 ** 
## person_emp_exp                    -1.603e-02  1.136e-02  -1.411  0.15830    
## loan_amnt                         -1.017e-04  4.411e-06 -23.051  < 2e-16 ***
## loan_int_rate                      3.374e-01  7.378e-03  45.735  < 2e-16 ***
## loan_percent_income                1.593e+01  3.450e-01  46.173  < 2e-16 ***
## cb_person_cred_hist_length        -9.684e-03  9.747e-03  -0.994  0.32044    
## credit_score                      -9.069e-03  4.577e-04 -19.813  < 2e-16 ***
## person_genderfemale               -2.469e-02  3.969e-02  -0.622  0.53390    
## person_gendermale                         NA         NA      NA       NA    
## person_educationBachelor          -3.211e-02  5.266e-02  -0.610  0.54205    
## person_educationDoctorate         -7.556e-02  1.662e-01  -0.455  0.64941    
## person_educationHigh.School        1.761e-02  5.501e-02   0.320  0.74889    
## person_educationMaster             5.428e-02  6.287e-02   0.863  0.38789    
## person_home_ownershipOTHER         3.419e-01  3.578e-01   0.956  0.33927    
## person_home_ownershipOWN          -1.400e+00  1.125e-01 -12.443  < 2e-16 ***
## person_home_ownershipRENT          7.341e-01  4.496e-02  16.326  < 2e-16 ***
## loan_intentEDUCATION              -9.231e-01  6.542e-02 -14.112  < 2e-16 ***
## loan_intentHOMEIMPROVEMENT        -1.930e-02  7.316e-02  -0.264  0.79193    
## loan_intentMEDICAL                -3.139e-01  6.292e-02  -4.989 6.07e-07 ***
## loan_intentPERSONAL               -7.942e-01  6.737e-02 -11.788  < 2e-16 ***
## loan_intentVENTURE                -1.269e+00  7.120e-02 -17.820  < 2e-16 ***
## previous_loan_defaults_on_fileYes -2.038e+01  1.146e+02  -0.178  0.85885    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 38136  on 35999  degrees of freedom
## Residual deviance: 15855  on 35977  degrees of freedom
## AIC: 15901
## 
## Number of Fisher Scoring iterations: 19

Logistic Regression Evaluation

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6560  521
##          1  439 1480
##                                           
##                Accuracy : 0.8933          
##                  95% CI : (0.8868, 0.8996)
##     No Information Rate : 0.7777          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.687           
##                                           
##  Mcnemar's Test P-Value : 0.008942        
##                                           
##             Sensitivity : 0.9373          
##             Specificity : 0.7396          
##          Pos Pred Value : 0.9264          
##          Neg Pred Value : 0.7712          
##              Prevalence : 0.7777          
##          Detection Rate : 0.7289          
##    Detection Prevalence : 0.7868          
##       Balanced Accuracy : 0.8385          
##                                           
##        'Positive' Class : 0               
## 

ROC-AUC

## Area under the curve: 0.9528

Random Forest Model

In this section, we train a Random Forest classifier to capture non-linear relationships and interactions between variables that logistic regression may fail to model. Random Forest is an ensemble learning method that builds multiple decision trees and aggregates their predictions to improve accuracy and reduce overfitting.

## 
## Call:
##  randomForest(formula = loan_status ~ ., data = train_class, ntree = 500,      mtry = sqrt(ncol(train_class) - 1), importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 7.11%
## Confusion matrix:
##       0    1 class.error
## 0 27291  710  0.02535624
## 1  1849 6150  0.23115389

Random Forest Evaluation

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6829  455
##          1  170 1546
##                                           
##                Accuracy : 0.9306          
##                  95% CI : (0.9251, 0.9357)
##     No Information Rate : 0.7777          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7884          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9757          
##             Specificity : 0.7726          
##          Pos Pred Value : 0.9375          
##          Neg Pred Value : 0.9009          
##              Prevalence : 0.7777          
##          Detection Rate : 0.7588          
##    Detection Prevalence : 0.8093          
##       Balanced Accuracy : 0.8742          
##                                           
##        'Positive' Class : 0               
## 

ROC-AUC

## Area under the curve: 0.9757

XGBoost Model

In this section, we train an XGBoost classifier, which is a powerful gradient boosting algorithm designed to improve predictive performance by sequentially correcting errors made by previous models. XGBoost is highly effective for structured/tabular data and often achieves superior accuracy compared to both logistic regression and random forest.

## ##### xgb.Booster
## call:
##   xgb.train(params = params, data = dtrain, nrounds = 100, verbose = 0)
## # of features: 23 
## # of rounds:  100

XGBoost Evaluation

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6803  445
##          1  196 1556
##                                          
##                Accuracy : 0.9288         
##                  95% CI : (0.9233, 0.934)
##     No Information Rate : 0.7777         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7845         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9720         
##             Specificity : 0.7776         
##          Pos Pred Value : 0.9386         
##          Neg Pred Value : 0.8881         
##              Prevalence : 0.7777         
##          Detection Rate : 0.7559         
##    Detection Prevalence : 0.8053         
##       Balanced Accuracy : 0.8748         
##                                          
##        'Positive' Class : 0              
## 

ROC-AUC

## Area under the curve: 0.9758

Model Performance Summary

The logistic regression model achieves an accuracy of 89.33%, indicating a strong baseline performance, though it assumes linear relationships between predictors and the target. Random Forest significantly improves performance with an accuracy of 93.06%, showing better ability to capture non-linear interactions and complex feature relationships. XGBoost performs similarly to Random Forest with an accuracy of 92.88%, confirming its strength as a high-performing boosting method. In terms of Kappa statistics, both Random Forest (0.788) and XGBoost (0.785) outperform logistic regression (0.687), indicating better agreement beyond chance. Random Forest achieves the highest sensitivity (0.976), meaning it is most effective at correctly identifying non-default cases. XGBoost provides a slightly better balance between sensitivity and specificity compared to logistic regression. All models show high statistical significance compared to the no-information rate (p < 2.2e-16), confirming strong predictive power. Logistic regression, however, shows higher specificity trade-off issues compared to tree-based models. Overall, Random Forest is the best-performing model based on accuracy and balanced performance metrics. XGBoost remains a close second and is preferred for robustness and generalization potential.

Credit Risk Analysis: Analyze the relationship between income, credit score, and loan defaults.

grouped comparisons

In this section, we analyze the relationship between key financial indicators—annual income and credit score—and loan default behavior. The objective is to understand how these variables differ between borrowers who successfully repay their loans and those who default. This helps identify whether financial strength and creditworthiness are strong predictors of loan risk.

We begin by comparing the average income and credit score across default and non-default groups. We then visualize these differences using boxplots to better understand the distributional behavior of each variable. Finally, we explore the relationship between income and credit score to assess whether stronger financial profiles are associated with better credit ratings.

##   loan_status person_income
## 1           0      86157.04
## 2           1      59886.10
##   loan_status credit_score
## 1           0     632.8149
## 2           1     631.8872

The aggregated results show the relationship between financial indicators and loan default behavior. When comparing average income, borrowers who defaulted (loan_status = 0) have a lower mean income (approximately 59,886) compared to those who successfully repaid their loans (loan_status = 1), who have a higher mean income (approximately 86,157). This suggests that income is a strong differentiating factor in loan repayment behavior.

However, when examining credit scores, the difference between the two groups is minimal. Both defaulters and non-defaulters have nearly identical average credit scores (around 632). This indicates that credit score alone may not be a strong standalone predictor of default risk in this dataset.

Overall, income appears to play a more important role than credit score in distinguishing between high-risk and low-risk borrowers.

Correlation Analysis

In this section, we examine the correlation structure between key numerical variables in the dataset. The goal is to understand how financial attributes such as income, credit score, loan amount, and loan status are related to one another. Correlation analysis helps identify potential multicollinearity and provides insight into which variables move together, which is important for both regression modeling and feature interpretation.

##                             person_age person_income person_emp_exp   loan_amnt
## person_age                  1.00000000   0.193697781     0.95441216 0.050749541
## person_income               0.19369778   1.000000000     0.18598715 0.242290131
## person_emp_exp              0.95441216   0.185987147     1.00000000 0.044589394
## loan_amnt                   0.05074954   0.242290131     0.04458939 1.000000000
## loan_int_rate               0.01340164   0.001509828     0.01663134 0.146093082
## loan_percent_income        -0.04329864  -0.234176548    -0.03986153 0.593011449
## cb_person_cred_hist_length  0.86198456   0.124315644     0.82427154 0.042969328
## credit_score                0.17843247   0.035919225     0.18619613 0.009074282
##                            loan_int_rate loan_percent_income
## person_age                   0.013401640         -0.04329864
## person_income                0.001509828         -0.23417655
## person_emp_exp               0.016631344         -0.03986153
## loan_amnt                    0.146093082          0.59301145
## loan_int_rate                1.000000000          0.12520949
## loan_percent_income          0.125209488          1.00000000
## cb_person_cred_hist_length   0.018007997         -0.03186773
## credit_score                 0.011497752         -0.01148310
##                            cb_person_cred_hist_length credit_score
## person_age                                 0.86198456  0.178432470
## person_income                              0.12431564  0.035919225
## person_emp_exp                             0.82427154  0.186196134
## loan_amnt                                  0.04296933  0.009074282
## loan_int_rate                              0.01800800  0.011497752
## loan_percent_income                       -0.03186773 -0.011483096
## cb_person_cred_hist_length                 1.00000000  0.155204130
## credit_score                               0.15520413  1.000000000

The correlation results show a very strong relationship between age, employment experience, and credit history length, indicating that these variables carry overlapping information. Loan amount is moderately related to loan percent income, suggesting that larger loans increase financial burden relative to income. Credit score, however, shows very weak correlations with most variables, implying it behaves independently within the dataset. Overall, the data contains some multicollinearity among age-related features, which should be considered during modeling.

Distribution Analysis and Skewness Check

In this section, we examine the distribution of key numerical variables to better understand their shape, spread, and potential skewness. This step is important for identifying whether any variables violate assumptions required for linear regression, particularly normality and linearity. We focus on major financial and credit-related variables such as income, loan amount, credit score, and loan-to-income ratio. These variables are expected to exhibit skewed distributions due to the presence of high-income earners and large loan amounts in the dataset. Identifying skewness helps determine whether transformations (such as logarithmic scaling) may be necessary in later modeling stages.

Feature Engineering

In this section, we construct new variables to capture additional structure in the data. The aim is to improve model performance and interpretability by transforming existing features into more informative representations, particularly around employment history, home ownership, and loan burden.

preparation (data)

Create feature engineering dataset

## 'data.frame':    45000 obs. of  31 variables:
##  $ person_age                       : num  22 21 25 23 24 21 26 24 24 21 ...
##  $ person_income                    : num  71948 12282 12438 79753 66135 ...
##  $ person_emp_exp                   : num  0 0 3 0 1 0 1 5 3 0 ...
##  $ loan_amnt                        : num  35000 1000 5500 35000 35000 2500 35000 35000 35000 1600 ...
##  $ loan_int_rate                    : num  16 11.1 12.9 15.2 14.3 ...
##  $ loan_percent_income              : num  0.49 0.08 0.44 0.44 0.53 0.19 0.37 0.37 0.35 0.13 ...
##  $ cb_person_cred_hist_length       : num  3 2 3 2 4 2 3 4 2 3 ...
##  $ credit_score                     : int  561 504 635 675 586 532 701 585 544 640 ...
##  $ loan_status                      : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 2 2 ...
##  $ person_genderfemale              : num  1 1 1 1 0 1 1 1 1 1 ...
##  $ person_gendermale                : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ person_educationBachelor         : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ person_educationDoctorate        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ person_educationHigh School      : num  0 1 1 0 0 1 0 1 0 1 ...
##  $ person_educationMaster           : num  1 0 0 0 1 0 0 0 0 0 ...
##  $ person_home_ownershipOTHER       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ person_home_ownershipOWN         : num  0 1 0 0 0 1 0 0 0 1 ...
##  $ person_home_ownershipRENT        : num  1 0 0 1 1 0 1 1 1 0 ...
##  $ loan_intentEDUCATION             : num  0 1 0 0 0 0 1 0 0 0 ...
##  $ loan_intentHOMEIMPROVEMENT       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ loan_intentMEDICAL               : num  0 0 1 1 1 0 0 1 0 0 ...
##  $ loan_intentPERSONAL              : num  1 0 0 0 0 0 0 0 1 0 ...
##  $ loan_intentVENTURE               : num  0 0 0 0 0 1 0 0 0 1 ...
##  $ previous_loan_defaults_on_fileYes: num  0 1 0 0 0 0 0 0 0 0 ...
##  $ emp_intensity                    : num  0 0 0.12 0 0.0417 ...
##  $ exp_gap                          : num  22 21 22 23 23 21 25 19 21 21 ...
##  $ log_income                       : num  11.18 9.42 9.43 11.29 11.1 ...
##  $ log_loan_amnt                    : num  10.46 6.91 8.61 10.46 10.46 ...
##  $ loan_burden                      : num  35255 983 5473 35091 35052 ...
##  $ debt_pressure                    : num  0.4865 0.0814 0.4422 0.4389 0.5292 ...
##  $ home_risk_score                  : num  1 1 0 1 1 1 1 1 1 1 ...

In this section, we construct new variables to enrich the dataset with deeper behavioral and financial insights. These engineered features capture relationships that are not directly observable in the raw data, such as employment efficiency, financial pressure, and housing stability. Employment-based variables help assess how quickly individuals accumulate work experience relative to age, while loan-based features quantify borrowing intensity and repayment burden. Log transformations are applied to reduce skewness in income and loan distributions, improving suitability for linear modeling. Additionally, composite indicators such as debt pressure and home ownership risk summarize multiple signals into interpretable risk measures. Overall, these transformations enhance model performance and improve the interpretability of credit risk and income prediction models.