Group Members
This project focuses on analyzing a large loan applicant dataset to build predictive models for both income estimation and loan default risk. We will apply statistical and machine learning techniques, with a strong emphasis on linear regression. The analysis will include data exploration, model building, and diagnostic checks to ensure valid assumptions are met. Model performance will be compared using appropriate evaluation metrics to select the best approach. Finally, we will interpret results to draw meaningful insights about credit risk and financial behavior.
We follow a structured approach:
In this section, we load all the essential libraries required for data manipulation, visualization, statistical analysis, and machine learning tasks throughout the project. This ensures a consistent and efficient workflow across all stages of the analysis, including data preprocessing, exploratory data analysis, regression modeling, and classification.
## All libraries loaded successfully.
## person_age person_gender person_education person_income person_emp_exp
## 1 22 female Master 71948 0
## 2 21 female High School 12282 0
## 3 25 female High School 12438 3
## 4 23 female Bachelor 79753 0
## 5 24 male Master 66135 1
## 6 21 female High School 12951 0
## person_home_ownership loan_amnt loan_intent loan_int_rate loan_percent_income
## 1 RENT 35000 PERSONAL 16.02 0.49
## 2 OWN 1000 EDUCATION 11.14 0.08
## 3 MORTGAGE 5500 MEDICAL 12.87 0.44
## 4 RENT 35000 MEDICAL 15.23 0.44
## 5 RENT 35000 MEDICAL 14.27 0.53
## 6 OWN 2500 VENTURE 7.14 0.19
## cb_person_cred_hist_length credit_score previous_loan_defaults_on_file
## 1 3 561 No
## 2 2 504 Yes
## 3 3 635 No
## 4 2 675 No
## 5 4 586 No
## 6 2 532 No
## loan_status
## 1 1
## 2 0
## 3 1
## 4 1
## 5 1
## 6 1
## person_age person_gender person_education person_income person_emp_exp
## 44995 24 female Associate 31924 2
## 44996 27 male Associate 47971 6
## 44997 37 female Associate 65800 17
## 44998 33 male Associate 56942 7
## 44999 29 male Bachelor 33164 4
## 45000 24 male High School 51609 1
## person_home_ownership loan_amnt loan_intent loan_int_rate
## 44995 RENT 12229 MEDICAL 10.70
## 44996 RENT 15000 MEDICAL 15.66
## 44997 RENT 9000 HOMEIMPROVEMENT 14.07
## 44998 RENT 2771 DEBTCONSOLIDATION 10.02
## 44999 RENT 12000 EDUCATION 13.23
## 45000 RENT 6665 DEBTCONSOLIDATION 17.05
## loan_percent_income cb_person_cred_hist_length credit_score
## 44995 0.38 4 678
## 44996 0.31 3 645
## 44997 0.14 11 621
## 44998 0.05 10 668
## 44999 0.36 6 604
## 45000 0.13 3 628
## previous_loan_defaults_on_file loan_status
## 44995 No 1
## 44996 No 1
## 44997 No 1
## 44998 No 1
## 44999 No 1
## 45000 No 1
Before proceeding with any analysis, we perform initial checks to understand the structure, quality, and integrity of the dataset. This includes reviewing variable types, detecting missing values, and obtaining a general statistical summary of the data.
## [1] 45000 14
## [1] FALSE
## person_age person_gender
## 0 0
## person_education person_income
## 0 0
## person_emp_exp person_home_ownership
## 0 0
## loan_amnt loan_intent
## 0 0
## loan_int_rate loan_percent_income
## 0 0
## cb_person_cred_hist_length credit_score
## 0 0
## previous_loan_defaults_on_file loan_status
## 0 0
## 'data.frame': 45000 obs. of 14 variables:
## $ person_age : num 22 21 25 23 24 21 26 24 24 21 ...
## $ person_gender : chr "female" "female" "female" "female" ...
## $ person_education : chr "Master" "High School" "High School" "Bachelor" ...
## $ person_income : num 71948 12282 12438 79753 66135 ...
## $ person_emp_exp : int 0 0 3 0 1 0 1 5 3 0 ...
## $ person_home_ownership : chr "RENT" "OWN" "MORTGAGE" "RENT" ...
## $ loan_amnt : num 35000 1000 5500 35000 35000 2500 35000 35000 35000 1600 ...
## $ loan_intent : chr "PERSONAL" "EDUCATION" "MEDICAL" "MEDICAL" ...
## $ loan_int_rate : num 16 11.1 12.9 15.2 14.3 ...
## $ loan_percent_income : num 0.49 0.08 0.44 0.44 0.53 0.19 0.37 0.37 0.35 0.13 ...
## $ cb_person_cred_hist_length : num 3 2 3 2 4 2 3 4 2 3 ...
## $ credit_score : int 561 504 635 675 586 532 701 585 544 640 ...
## $ previous_loan_defaults_on_file: chr "No" "Yes" "No" "No" ...
## $ loan_status : int 1 0 1 1 1 1 1 1 1 1 ...
## person_age person_gender person_education person_income
## Min. : 20.00 Length:45000 Length:45000 Min. : 8000
## 1st Qu.: 24.00 Class :character Class :character 1st Qu.: 47204
## Median : 26.00 Mode :character Mode :character Median : 67048
## Mean : 27.76 Mean : 80319
## 3rd Qu.: 30.00 3rd Qu.: 95789
## Max. :144.00 Max. :7200766
## person_emp_exp person_home_ownership loan_amnt loan_intent
## Min. : 0.00 Length:45000 Min. : 500 Length:45000
## 1st Qu.: 1.00 Class :character 1st Qu.: 5000 Class :character
## Median : 4.00 Mode :character Median : 8000 Mode :character
## Mean : 5.41 Mean : 9583
## 3rd Qu.: 8.00 3rd Qu.:12237
## Max. :125.00 Max. :35000
## loan_int_rate loan_percent_income cb_person_cred_hist_length credit_score
## Min. : 5.42 Min. :0.0000 Min. : 2.000 Min. :390.0
## 1st Qu.: 8.59 1st Qu.:0.0700 1st Qu.: 3.000 1st Qu.:601.0
## Median :11.01 Median :0.1200 Median : 4.000 Median :640.0
## Mean :11.01 Mean :0.1397 Mean : 5.867 Mean :632.6
## 3rd Qu.:12.99 3rd Qu.:0.1900 3rd Qu.: 8.000 3rd Qu.:670.0
## Max. :20.00 Max. :0.6600 Max. :30.000 Max. :850.0
## previous_loan_defaults_on_file loan_status
## Length:45000 Min. :0.0000
## Class :character 1st Qu.:0.0000
## Mode :character Median :0.0000
## Mean :0.2222
## 3rd Qu.:0.0000
## Max. :1.0000
## [1] 0
The dataset contains 45,000 observations and 14 variables with no missing values or duplicate records, indicating a clean and complete dataset. Variables include a mix of numerical and categorical features related to applicant demographics, financial status, and credit history. Preliminary summaries reveal the presence of some extreme values and a class imbalance in the target variable (loan_status). Overall, the data is well-structured and ready for further preprocessing and analysis.
In this section, we explore the dataset by examining the distribution of the target variable, numerical features, and categorical variables to understand its overall structure. The analysis focuses on identifying relationships between predictors and loan_status to uncover early patterns useful for modeling. This is supported using distribution plots, boxplots for outlier detection, and correlation analysis among numerical variables.
##
## 0 1
## 77.77778 22.22222
- Most numerical variables show varied distributions, with some
exhibiting skewness.
- Several variables contain outliers, particularly income and
age-related features.
- The categorical variables show clear group distributions, with some
categories dominating others.
- There are visible differences in distributions of key numerical
variables across loan status groups, suggesting these features may be
strong predictors of repayment behavior.
- Certain categorical groups show different proportions of default and
repayment outcomes, indicating that demographic and behavioral factors
influence loan performance.
- Several variables exhibit moderate correlations, particularly
financial attributes.
- Higher credit scores are generally associated with lower default
rates, indicating credit score is a strong predictor of loan
performance.
- There is a visible relationship between income and loan amount,
suggesting lenders may align loan size with borrower income levels.
- Non-defaulting borrowers tend to face higher interest rates.
To stabilize our models, we focused on two areas:
## 'data.frame': 45000 obs. of 21 variables:
## $ person_age : num 22 21 25 23 24 21 26 24 24 21 ...
## $ person_income : num 71948 12282 12438 79753 66135 ...
## $ person_emp_exp : num 0 0 3 0 1 0 1 5 3 0 ...
## $ loan_amnt : num 35000 1000 5500 35000 35000 2500 35000 35000 35000 1600 ...
## $ loan_int_rate : num 16 11.1 12.9 15.2 14.3 ...
## $ loan_percent_income : num 0.49 0.08 0.44 0.44 0.53 0.19 0.37 0.37 0.35 0.13 ...
## $ cb_person_cred_hist_length : num 3 2 3 2 4 2 3 4 2 3 ...
## $ credit_score : int 561 504 635 675 586 532 701 585 544 640 ...
## $ loan_status : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 2 2 ...
## $ person_genderfemale : num 1 1 1 1 0 1 1 1 1 1 ...
## $ person_gendermale : num 0 0 0 0 1 0 0 0 0 0 ...
## $ person_education : num 4 1 1 3 4 1 3 1 2 1 ...
## $ person_home_ownershipOTHER : num 0 0 0 0 0 0 0 0 0 0 ...
## $ person_home_ownershipOWN : num 0 1 0 0 0 1 0 0 0 1 ...
## $ person_home_ownershipRENT : num 1 0 0 1 1 0 1 1 1 0 ...
## $ loan_intentEDUCATION : num 0 1 0 0 0 0 1 0 0 0 ...
## $ loan_intentHOMEIMPROVEMENT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ loan_intentMEDICAL : num 0 0 1 1 1 0 0 1 0 0 ...
## $ loan_intentPERSONAL : num 1 0 0 0 0 0 0 0 1 0 ...
## $ loan_intentVENTURE : num 0 0 0 0 0 1 0 0 0 1 ...
## $ previous_loan_defaults_on_fileYes: num 0 1 0 0 0 0 0 0 0 0 ...
Our cleaned distributions show:
We are predicting Personal Income. To ensure a high-integrity model, we applied two filters:
In this section, the cleaned dataset is split into training and testing sets using an 80–20 ratio to ensure proper model validation on unseen data. The training set is used to build multiple regression models, while the test set is reserved for final performance evaluation. To identify the most relevant predictors of personal income, a forward stepwise selection approach is applied. This method begins with an empty model and iteratively adds variables that improve model performance until no further improvement is achieved. The final selected model is then evaluated against alternative specifications to ensure optimal predictive accuracy and interpretability.
Rather than using all predictors at once, we use an iterative Forward Selection approach:
## Start: AIC=815923.8
## person_income ~ 1
##
## Df Sum of Sq RSS AIC
## + loan_amnt 1 1.3463e+13 2.3736e+14 813940
## + person_home_ownershipRENT 1 9.3048e+12 2.4152e+14 814565
## + person_age 1 3.9857e+12 2.4684e+14 815349
## + cb_person_cred_hist_length 1 3.8285e+12 2.4699e+14 815372
## + person_emp_exp 1 3.4412e+12 2.4738e+14 815429
## + previous_loan_defaults_on_fileYes 1 7.6369e+11 2.5006e+14 815816
## + loan_intentMEDICAL 1 5.3144e+11 2.5029e+14 815849
## + credit_score 1 3.5532e+11 2.5047e+14 815875
## + person_home_ownershipOWN 1 3.4238e+11 2.5048e+14 815877
## + loan_intentHOMEIMPROVEMENT 1 3.4010e+11 2.5048e+14 815877
## + loan_intentPERSONAL 1 1.1228e+11 2.5071e+14 815910
## + loan_intentEDUCATION 1 6.9234e+10 2.5075e+14 815916
## + loan_intentVENTURE 1 4.4168e+10 2.5078e+14 815920
## + person_home_ownershipOTHER 1 2.1847e+10 2.5080e+14 815923
## <none> 2.5082e+14 815924
## + loan_int_rate 1 2.2173e+09 2.5082e+14 815926
## + person_education 1 7.2213e+08 2.5082e+14 815926
##
## Step: AIC=813939.7
## person_income ~ loan_amnt
##
## Df Sum of Sq RSS AIC
## + person_home_ownershipRENT 1 6.6200e+12 2.3074e+14 812923
## + cb_person_cred_hist_length 1 3.2604e+12 2.3410e+14 813444
## + person_age 1 3.2065e+12 2.3415e+14 813452
## + person_emp_exp 1 2.8196e+12 2.3454e+14 813511
## + previous_loan_defaults_on_fileYes 1 1.2004e+12 2.3616e+14 813759
## + loan_intentMEDICAL 1 3.5766e+11 2.3700e+14 813887
## + credit_score 1 3.1291e+11 2.3705e+14 813894
## + person_home_ownershipOWN 1 2.5374e+11 2.3711e+14 813903
## + loan_int_rate 1 2.4677e+11 2.3711e+14 813904
## + loan_intentHOMEIMPROVEMENT 1 1.6255e+11 2.3720e+14 813917
## + loan_intentPERSONAL 1 9.7388e+10 2.3726e+14 813927
## + loan_intentEDUCATION 1 5.5882e+10 2.3730e+14 813933
## + loan_intentVENTURE 1 3.8139e+10 2.3732e+14 813936
## <none> 2.3736e+14 813940
## + person_home_ownershipOTHER 1 9.6764e+09 2.3735e+14 813940
## + person_education 1 5.4741e+08 2.3736e+14 813942
##
## Step: AIC=812923.4
## person_income ~ loan_amnt + person_home_ownershipRENT
##
## Df Sum of Sq RSS AIC
## + cb_person_cred_hist_length 1 3.0348e+12 2.2771e+14 812449
## + person_age 1 2.8560e+12 2.2788e+14 812477
## + person_emp_exp 1 2.4936e+12 2.2825e+14 812534
## + person_home_ownershipOWN 1 1.6402e+12 2.2910e+14 812669
## + previous_loan_defaults_on_fileYes 1 5.2585e+11 2.3021e+14 812843
## + credit_score 1 2.9614e+11 2.3044e+14 812879
## + loan_intentMEDICAL 1 1.9135e+11 2.3055e+14 812896
## + loan_intentHOMEIMPROVEMENT 1 7.9732e+10 2.3066e+14 812913
## + loan_intentPERSONAL 1 6.5372e+10 2.3067e+14 812915
## + loan_intentEDUCATION 1 5.2119e+10 2.3069e+14 812917
## <none> 2.3074e+14 812923
## + loan_int_rate 1 1.2378e+10 2.3073e+14 812923
## + loan_intentVENTURE 1 9.8279e+09 2.3073e+14 812924
## + person_home_ownershipOTHER 1 1.1467e+09 2.3074e+14 812925
## + person_education 1 1.8132e+08 2.3074e+14 812925
##
## Step: AIC=812448.8
## person_income ~ loan_amnt + person_home_ownershipRENT + cb_person_cred_hist_length
##
## Df Sum of Sq RSS AIC
## + person_home_ownershipOWN 1 1.6085e+12 2.2610e+14 812196
## + previous_loan_defaults_on_fileYes 1 5.8956e+11 2.2712e+14 812357
## + loan_intentMEDICAL 1 2.1485e+11 2.2749e+14 812417
## + person_age 1 1.4653e+11 2.2756e+14 812428
## + credit_score 1 7.7152e+10 2.2763e+14 812439
## + person_emp_exp 1 7.6359e+10 2.2763e+14 812439
## + loan_intentPERSONAL 1 4.2323e+10 2.2766e+14 812444
## + loan_intentHOMEIMPROVEMENT 1 3.6294e+10 2.2767e+14 812445
## + loan_int_rate 1 1.9955e+10 2.2769e+14 812448
## <none> 2.2771e+14 812449
## + loan_intentVENTURE 1 1.2128e+10 2.2769e+14 812449
## + loan_intentEDUCATION 1 1.1879e+10 2.2769e+14 812449
## + person_education 1 3.1195e+08 2.2770e+14 812451
## + person_home_ownershipOTHER 1 2.7749e+08 2.2770e+14 812451
##
## Step: AIC=812195.6
## person_income ~ loan_amnt + person_home_ownershipRENT + cb_person_cred_hist_length +
## person_home_ownershipOWN
##
## Df Sum of Sq RSS AIC
## + previous_loan_defaults_on_fileYes 1 6.0542e+11 2.2549e+14 812101
## + loan_intentMEDICAL 1 2.1396e+11 2.2588e+14 812163
## + person_age 1 1.3385e+11 2.2596e+14 812176
## + credit_score 1 7.5510e+10 2.2602e+14 812186
## + person_emp_exp 1 7.2774e+10 2.2602e+14 812186
## + loan_intentVENTURE 1 4.6953e+10 2.2605e+14 812190
## + loan_intentPERSONAL 1 4.3516e+10 2.2605e+14 812191
## + loan_intentHOMEIMPROVEMENT 1 3.5586e+10 2.2606e+14 812192
## <none> 2.2610e+14 812196
## + loan_intentEDUCATION 1 1.1643e+10 2.2609e+14 812196
## + loan_int_rate 1 9.9251e+09 2.2609e+14 812196
## + person_home_ownershipOTHER 1 2.8407e+09 2.2609e+14 812197
## + person_education 1 6.1285e+08 2.2610e+14 812197
##
## Step: AIC=812101
## person_income ~ loan_amnt + person_home_ownershipRENT + cb_person_cred_hist_length +
## person_home_ownershipOWN + previous_loan_defaults_on_fileYes
##
## Df Sum of Sq RSS AIC
## + loan_intentMEDICAL 1 1.9369e+11 2.2530e+14 812072
## + credit_score 1 1.7923e+11 2.2531e+14 812074
## + person_age 1 1.3526e+11 2.2536e+14 812081
## + person_emp_exp 1 7.6325e+10 2.2541e+14 812091
## + loan_intentHOMEIMPROVEMENT 1 4.4131e+10 2.2545e+14 812096
## + loan_intentPERSONAL 1 4.1363e+10 2.2545e+14 812096
## + loan_intentVENTURE 1 3.2674e+10 2.2546e+14 812098
## + loan_intentEDUCATION 1 1.9501e+10 2.2547e+14 812100
## <none> 2.2549e+14 812101
## + person_education 1 3.1495e+09 2.2549e+14 812103
## + person_home_ownershipOTHER 1 1.4323e+09 2.2549e+14 812103
## + loan_int_rate 1 5.2129e+08 2.2549e+14 812103
##
## Step: AIC=812072.1
## person_income ~ loan_amnt + person_home_ownershipRENT + cb_person_cred_hist_length +
## person_home_ownershipOWN + previous_loan_defaults_on_fileYes +
## loan_intentMEDICAL
##
## Df Sum of Sq RSS AIC
## + credit_score 1 1.7604e+11 2.2512e+14 812046
## + person_age 1 1.4104e+11 2.2516e+14 812052
## + person_emp_exp 1 8.0400e+10 2.2522e+14 812061
## + loan_intentEDUCATION 1 6.5007e+10 2.2523e+14 812064
## + loan_intentHOMEIMPROVEMENT 1 1.9323e+10 2.2528e+14 812071
## <none> 2.2530e+14 812072
## + loan_intentPERSONAL 1 1.2218e+10 2.2529e+14 812072
## + loan_intentVENTURE 1 7.4270e+09 2.2529e+14 812073
## + person_education 1 3.3148e+09 2.2529e+14 812074
## + person_home_ownershipOTHER 1 1.3936e+09 2.2530e+14 812074
## + loan_int_rate 1 6.5068e+08 2.2530e+14 812074
##
## Step: AIC=812045.9
## person_income ~ loan_amnt + person_home_ownershipRENT + cb_person_cred_hist_length +
## person_home_ownershipOWN + previous_loan_defaults_on_fileYes +
## loan_intentMEDICAL + credit_score
##
## Df Sum of Sq RSS AIC
## + person_age 1 1.2089e+11 2.2500e+14 812029
## + loan_intentEDUCATION 1 6.4856e+10 2.2506e+14 812038
## + person_emp_exp 1 6.1144e+10 2.2506e+14 812038
## + loan_intentHOMEIMPROVEMENT 1 1.9678e+10 2.2510e+14 812045
## <none> 2.2512e+14 812046
## + loan_intentPERSONAL 1 1.2208e+10 2.2511e+14 812046
## + loan_intentVENTURE 1 6.0888e+09 2.2512e+14 812047
## + loan_int_rate 1 1.1578e+09 2.2512e+14 812048
## + person_home_ownershipOTHER 1 1.1279e+09 2.2512e+14 812048
## + person_education 1 9.3862e+08 2.2512e+14 812048
##
## Step: AIC=812028.6
## person_income ~ loan_amnt + person_home_ownershipRENT + cb_person_cred_hist_length +
## person_home_ownershipOWN + previous_loan_defaults_on_fileYes +
## loan_intentMEDICAL + credit_score + person_age
##
## Df Sum of Sq RSS AIC
## + loan_intentEDUCATION 1 5.5497e+10 2.2495e+14 812022
## + loan_intentPERSONAL 1 1.3462e+10 2.2499e+14 812028
## + loan_intentHOMEIMPROVEMENT 1 1.3351e+10 2.2499e+14 812028
## <none> 2.2500e+14 812029
## + loan_intentVENTURE 1 6.1260e+09 2.2499e+14 812030
## + person_emp_exp 1 2.4372e+09 2.2500e+14 812030
## + person_education 1 1.2740e+09 2.2500e+14 812030
## + loan_int_rate 1 1.0822e+09 2.2500e+14 812030
## + person_home_ownershipOTHER 1 8.9349e+08 2.2500e+14 812030
##
## Step: AIC=812021.7
## person_income ~ loan_amnt + person_home_ownershipRENT + cb_person_cred_hist_length +
## person_home_ownershipOWN + previous_loan_defaults_on_fileYes +
## loan_intentMEDICAL + credit_score + person_age + loan_intentEDUCATION
##
## Df Sum of Sq RSS AIC
## <none> 2.2495e+14 812022
## + loan_intentHOMEIMPROVEMENT 1 4340039311 2.2494e+14 812023
## + loan_intentPERSONAL 1 2364572744 2.2494e+14 812023
## + person_emp_exp 1 2006240893 2.2494e+14 812023
## + person_education 1 1467636058 2.2494e+14 812023
## + person_home_ownershipOTHER 1 1008515270 2.2494e+14 812024
## + loan_int_rate 1 969697438 2.2494e+14 812024
## + loan_intentVENTURE 1 43860510 2.2495e+14 812024
##
## Call:
## lm(formula = person_income ~ loan_amnt + person_home_ownershipRENT +
## cb_person_cred_hist_length + person_home_ownershipOWN + previous_loan_defaults_on_fileYes +
## loan_intentMEDICAL + credit_score + person_age + loan_intentEDUCATION,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -105572 -26233 -8919 13538 7071837
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.406e+04 6.360e+03 2.210 0.02708 *
## loan_amnt 2.676e+00 6.706e-02 39.901 < 2e-16 ***
## person_home_ownershipRENT -2.908e+04 8.890e+02 -32.716 < 2e-16 ***
## cb_person_cred_hist_length 1.524e+03 2.079e+02 7.329 2.37e-13 ***
## person_home_ownershipOWN -2.816e+04 1.755e+03 -16.051 < 2e-16 ***
## previous_loan_defaults_on_fileYes 9.052e+03 8.597e+02 10.530 < 2e-16 ***
## loan_intentMEDICAL -6.760e+03 1.098e+03 -6.157 7.48e-10 ***
## credit_score 4.265e+01 8.525e+00 5.003 5.66e-07 ***
## person_age 6.960e+02 1.648e+02 4.224 2.40e-05 ***
## loan_intentEDUCATION -3.200e+03 1.074e+03 -2.980 0.00289 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 79060 on 35990 degrees of freedom
## Multiple R-squared: 0.1032, Adjusted R-squared: 0.1029
## F-statistic: 460 on 9 and 35990 DF, p-value: < 2.2e-16
The forward stepwise selection results demonstrate a systematic optimization of the model, starting from a baseline AIC of 815,923.8 and concluding at 812,021.7. In each iteration, the algorithm identified the variable that provided the greatest reduction in the Akaike Information Criterion (AIC), beginning with loan_amnt as the most significant predictor and subsequently adding factors like housing status, credit history length, and medical loan intent. The process effectively balanced model complexity with predictive power, ultimately stopping at the ninth step when the addition of further variables—such as high school education or loan interest rates—failed to yield a lower AIC. This final selection of nine variables represents the most statistically efficient model, capturing the essential drivers of income while filtering out non-significant “noise” to prevent overfitting.
The final model is retrained using the predictors selected through the forward stepwise selection process.
##
## Call:
## lm(formula = person_income ~ loan_amnt + person_home_ownershipRENT +
## cb_person_cred_hist_length + person_home_ownershipOWN + previous_loan_defaults_on_fileYes +
## loan_intentMEDICAL + credit_score + person_age + loan_intentEDUCATION,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -105572 -26233 -8919 13538 7071837
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.406e+04 6.360e+03 2.210 0.02708 *
## loan_amnt 2.676e+00 6.706e-02 39.901 < 2e-16 ***
## person_home_ownershipRENT -2.908e+04 8.890e+02 -32.716 < 2e-16 ***
## cb_person_cred_hist_length 1.524e+03 2.079e+02 7.329 2.37e-13 ***
## person_home_ownershipOWN -2.816e+04 1.755e+03 -16.051 < 2e-16 ***
## previous_loan_defaults_on_fileYes 9.052e+03 8.597e+02 10.530 < 2e-16 ***
## loan_intentMEDICAL -6.760e+03 1.098e+03 -6.157 7.48e-10 ***
## credit_score 4.265e+01 8.525e+00 5.003 5.66e-07 ***
## person_age 6.960e+02 1.648e+02 4.224 2.40e-05 ***
## loan_intentEDUCATION -3.200e+03 1.074e+03 -2.980 0.00289 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 79060 on 35990 degrees of freedom
## Multiple R-squared: 0.1032, Adjusted R-squared: 0.1029
## F-statistic: 460 on 9 and 35990 DF, p-value: < 2.2e-16
## [1] 61359.59
## [1] 30733.79
## [1] 0.1574616
The final linear regression model identifies several statistically significant predictors of income, including loan amount, credit score, age, and credit history length. Financial variables appear to have the strongest influence, with higher loan amounts and better credit profiles associated with higher income. Some factors, such as certain loan purposes and home ownership categories, are associated with lower income levels. Despite these significant relationships, the model demonstrates relatively low predictive power, with a test R² of approximately 0.16. This suggests that while the model captures some meaningful patterns, additional variables or more flexible models may be needed for improved accuracy.
Before relying on a linear regression model for inference or
prediction, it is important to verify that several key statistical
assumptions are satisfied. These assumptions ensure that the model
estimates are reliable, unbiased, and interpretable. First, we check the
assumption of linearity, which requires a linear relationship between
the predictors and the response variable. Second, we assess normality of
residuals to ensure that the error terms are approximately normally
distributed. Third, we examine homoscedasticity, meaning that the
variance of residuals remains constant across all fitted values.
Finally, we check for influential points and outliers that may
disproportionately affect the model estimates and overall performance.
##
## Shapiro-Wilk normality test
##
## data: sample(residuals(best_model), 5000)
## W = 0.64385, p-value < 2.2e-16
##
## studentized Breusch-Pagan test
##
## data: best_model
## BP = 150.23, df = 9, p-value < 2.2e-16
## 42521 31919 29133 32076 36352 20547 17955 15886 32385 29127 29169 27772 15656
## 96 151 163 266 312 388 654 700 730 924 1042 1069 1072
## 32864 27771 13217 18186 27837 29515 17988 40441 27877 27849 28786 17849 17542
## 1076 1094 1267 1278 1302 1340 1388 1392 1551 1728 1745 1770 1782
## 17940 32635 35273 27848 27711 17863 27864 38326 32328 57 27861 27860 32301
## 1828 1866 2060 2069 2154 2169 2197 2216 2223 2338 2428 2445 2664
## 29162 17888 27846 27950 27872 36279 32549 32343 31911 17965 27859 17941 36322
## 2761 2857 2883 2901 2902 2927 3009 3098 3110 3217 3218 3262 3278
## 100 43121 69 34611 28648 28089 218 17922 27798 27882 29346 27997 15888
## 3315 3337 3389 3542 3641 3680 3713 4014 4081 4193 4300 4320 4394
## 43916 28655 17906 28479 43 29149 27961 15893 33976 31900 30535 32547 35753
## 4516 4523 4534 4590 4647 4665 4734 4903 4978 5001 5031 5100 5102
## 43871 41676 184 18090 32557 16 29166 17900 81 38506 17872 17915 27870
## 5144 5147 5258 5301 5351 5379 5511 5631 5687 5824 5879 5964 5970
## 23430 37821 34190 34186 17972 32484 40721 17931 15838 31910 31867 31894 47
## 6000 6028 6031 6161 6164 6387 6458 6481 6523 6534 6621 6676 6678
## 31893 17858 17874 32306 27840 28797 27841 27884 32481 33935 17873 27855 2423
## 6732 6748 6756 6761 7117 7214 7394 7503 7616 7646 7720 7734 7778
## 27799 15909 15811 27839 31913 31916 28537 17914 29153 34995 17878 27865 31914
## 7840 7937 7990 8069 8216 8321 8383 8549 8557 8600 8759 8783 8815
## 27853 36424 28634 29161 27657 15751 28494 32300 31906 32118 32417 29122 29145
## 8836 8864 8934 9034 9242 9261 9387 9479 9549 9647 9734 9785 9803
## 1880 43242 28771 23429 27867 28420 32225 17895 37383 15905 17913 32504 35740
## 9836 9837 9930 9978 10123 10197 10329 10383 10486 10509 10598 10701 10866
## 20614 17903 32367 27858 27779 17887 34419 32544 32071 27851 28254 29129 210
## 11076 11271 11329 11529 11548 11658 11789 11800 11825 11834 11863 11961 12043
## 27630 36257 29189 41544 28975 33490 38221 15908 140 31869 43148 31889 27658
## 12157 12257 12600 12881 12897 12948 12987 13194 13295 13316 13339 13390 13505
## 18198 27815 17859 31908 15800 15831 36449 44948 16757 29154 103 17912 40431
## 13659 13691 13930 14091 14092 14139 14436 14605 14617 14667 14751 14763 14810
## 35 16133 17834 31886 18197 37176 31905 29157 32292 27838 27498 18024 27856
## 14852 14886 14890 14990 15158 15166 15180 15207 15330 15423 15590 15662 15696
## 27847 27821 56 31920 34824 38801 17894 29179 29138 41465 27627 23432 38114
## 15726 15729 15827 15884 15930 15982 16543 16662 16689 16714 16728 16740 16752
## 401 141 39857 19969 18058 15894 33223 28239 32309 29204 27801 15901 15891
## 16847 16937 16946 16960 17078 17096 17100 17240 17311 17318 17326 17327 17354
## 17847 32038 39825 29143 125 27770 40233 32136 32298 28519 18199 29130 17956
## 17428 17508 17510 17553 17573 17637 17684 17691 17698 17777 17849 17852 17992
## 39117 35090 37931 166 29123 222 32125 17885 40025 28136 29243 18625 32299
## 18020 18026 18065 18228 18256 18293 18297 18327 18379 18460 18466 18565 18582
## 27873 43096 33488 28901 33795 17229 43631 17925 114 31358 45 32384 17860
## 18589 18691 18717 18734 18872 19065 19137 19142 19156 19177 19261 19278 19553
## 42086 234 27843 36598 40383 15915 29156 17942 99 32548 29188 27830 29140
## 19614 19641 19712 19714 19864 20228 20375 20459 20483 20491 20507 20537 20567
## 25714 31915 31917 8445 29137 27854 39985 18106 27868 30050 27783 33532 58
## 20615 20758 20770 20786 20807 20864 21037 21110 21115 21174 21191 21288 21326
## 15830 38300 44923 32936 68 39419 31901 32434 40162 18017 38867 82 33240
## 21334 21360 21405 21454 21477 21541 21691 21852 21966 22041 22046 22067 22513
## 32480 17861 32595 27820 15902 15591 27831 27540 27552 17948 32543 28807 38913
## 22666 22683 22695 22768 22791 22799 22860 22923 22968 23143 23232 23301 23443
## 27883 39412 15895 27874 27828 91 44787 28736 15536 17870 32554 266 365
## 23476 23503 23558 23590 23632 23664 23859 23895 23928 24143 24181 24213 24294
## 17933 2197 15896 40105 32579 37313 38618 37488 42085 30537 28854 41949 27871
## 24438 24444 24498 24502 24512 24520 24591 24661 24706 24707 24778 24809 24867
## 42651 30026 31896 17949 29160 239 17862 17886 31899 15875 27826 15889 15405
## 24918 24944 25400 25433 25584 25593 25599 25696 25820 25871 26061 26197 26253
## 17904 17926 32308 32572 29146 15730 32416 27811 233 40964 32305 34680 15898
## 26267 26314 26406 26433 26458 26460 26511 26528 26559 27068 27367 27383 27519
## 29120 27878 28386 27885 36332 15906 33630 185 18047 18264 27857 27835 32552
## 27728 27755 27827 27835 27877 28047 28187 28325 28374 28571 28595 28603 28604
## 32312 21241 32404 29295 37835 21958 32310 17835 41762 42020 32349 29144 32577
## 28654 28744 28798 28846 28867 28931 28944 28985 29009 29018 29175 29189 29320
## 35708 331 29134 15856 17916 35659 27876 22082 34 32321 38400 18636 29671
## 29330 29374 29575 29627 29800 29906 30036 30057 30068 30566 30594 30724 30861
## 17897 39293 29152 17924 44 32006 32254 31923 32545 70 17869 38414 63
## 30991 31006 31279 31307 31374 31778 31825 31976 32004 32168 32444 32472 32696
## 32365 41167 16927 29514 35923 31885 15868 32764 17875 40021 29132 264 18918
## 32790 32810 32871 33015 33187 33214 33217 33313 33354 33381 33386 33429 33526
## 32422 33228 29128 32234 40828 15529 32311 32048 17893 38074 29026 39854 31925
## 33558 33580 33649 33704 33753 33941 33965 34112 34227 34337 34933 35206 35324
## 37003 27850 17866 32348 15834 39990 31903 33624 17871 27836 32382 13218 17908
## 35351 35378 35399 35430 35437 35438 35502 35520 35579 35724 35735 35824 35898
The diagnostic analysis reveals that the current model fails to meet several key Gauss-Markov assumptions, rendering the standard OLS results potentially unreliable. The Normal Q-Q plot is particularly telling; the sharp upward deviation from the theoretical line indicates a heavily right-skewed distribution of errors, meaning the model consistently struggles to predict high-value observations. This is compounded by evidence of heteroscedasticity visible in the Scale-Location plot, where the error variance increases alongside the fitted values. Most critically, the Residuals vs Leverage plot identifies specific influential observations—notably case 32298—which carry high leverage and large residuals. These points are likely exerting a disproportionate pull on the regression coefficients. To improve model validity, I recommend exploring a logarithmic transformation of the dependent variable to stabilize variance and mitigate skewness.
## [1] 62110.12
## [1] 28955.59
## [1] 0.1367244
## loan_amnt person_home_ownershipRENT
## 1.033331 1.136144
## cb_person_cred_hist_length person_home_ownershipOWN
## 3.772604 1.087124
## previous_loan_defaults_on_fileYes loan_intentMEDICAL
## 1.063920 1.070368
## credit_score person_age
## 1.064828 3.810142
## loan_intentEDUCATION
## 1.073787
##
## Durbin-Watson test
##
## data: log_best_model
## DW = 2.0012, p-value = 0.5471
## alternative hypothesis: true autocorrelation is greater than 0
The log-linear model shows acceptable performance in terms of multicollinearity and independence, as VIF values are low and the Durbin-Watson test indicates no autocorrelation. However, the Scale-Location plot reveals heteroscedasticity, meaning the residual variance is not constant. Overall, this suggests that the linear model assumptions are only partially satisfied and may not fully capture the underlying patterns in the data.
The target variable (person income) was highly right-skewed and exhibited non-constant variance, which violated key assumptions of linear regression. To address this, a logarithmic transformation was applied because it is the most appropriate and widely used transformation for income-type data, as it compresses extreme values and stabilizes variance effectively. After applying the log transformation, diagnostic checks showed some improvement in model behavior, although heteroscedasticity was not fully eliminated. Other transformations such as square root were not considered suitable because they are generally intended for mildly skewed data and would not adequately correct the strong skewness observed in income. Polynomial transformations were also not applied since they are more appropriate for capturing non-linear relationships among predictors rather than correcting the distribution of the response variable. In addition, Box-Cox transformation was not implemented because for heavily right-skewed financial variables, it typically converges to a log transformation, making it redundant in this context. Since the log transformation already represents the most theoretically sound and effective adjustment for income, further transformations were not pursued. Given that key linear assumptions still remained partially violated after transformation, particularly in terms of variance stability, the analysis was advanced toward nonlinear models, which are better suited to capture complex relationships and interactions in the data.
We extend the modeling process beyond ordinary least squares by introducing regularized regression methods, specifically Ridge, Lasso, and Elastic Net. These models are used to address limitations observed in the linear regression model, such as weak predictive power and potential overfitting in the presence of multiple predictors.
Ridge regression applies L2 regularization, which shrinks coefficient values to reduce model complexity while retaining all predictors. Lasso regression applies L1 regularization, which can shrink some coefficients exactly to zero, effectively performing feature selection. Elastic Net combines both penalties, balancing shrinkage and variable selection for improved stability.
These methods are particularly useful in datasets with many correlated or weak predictors, as they improve generalization performance and reduce model variance. The optimal model is selected using cross-validated error (RMSE), ensuring fair comparison between Ridge, Lasso, and Elastic Net.
## [1] 41518.17
## [1] 41544.92
## [1] 41545.37
Ridge regression provides the best balance between bias and variance in this dataset, indicating that predictive information is distributed across many weak predictors rather than concentrated in a small subset of variables.
In this section, we build classification models to predict whether a loan applicant will successfully repay a loan (loan_status = 1) or default (loan_status = 0). This is a binary classification problem aimed at supporting credit risk assessment. Multiple machine learning models will be trained and evaluated, and the best-performing model will be selected based on classification performance metrics.
## 'data.frame': 45000 obs. of 24 variables:
## $ person_age : num 22 21 25 23 24 21 26 24 24 21 ...
## $ person_income : num 71948 12282 12438 79753 66135 ...
## $ person_emp_exp : num 0 0 3 0 1 0 1 5 3 0 ...
## $ loan_amnt : num 35000 1000 5500 35000 35000 2500 35000 35000 35000 1600 ...
## $ loan_int_rate : num 16 11.1 12.9 15.2 14.3 ...
## $ loan_percent_income : num 0.49 0.08 0.44 0.44 0.53 0.19 0.37 0.37 0.35 0.13 ...
## $ cb_person_cred_hist_length : num 3 2 3 2 4 2 3 4 2 3 ...
## $ credit_score : int 561 504 635 675 586 532 701 585 544 640 ...
## $ loan_status : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 2 2 ...
## $ person_genderfemale : num 1 1 1 1 0 1 1 1 1 1 ...
## $ person_gendermale : num 0 0 0 0 1 0 0 0 0 0 ...
## $ person_educationBachelor : num 0 0 0 1 0 0 1 0 0 0 ...
## $ person_educationDoctorate : num 0 0 0 0 0 0 0 0 0 0 ...
## $ person_educationHigh School : num 0 1 1 0 0 1 0 1 0 1 ...
## $ person_educationMaster : num 1 0 0 0 1 0 0 0 0 0 ...
## $ person_home_ownershipOTHER : num 0 0 0 0 0 0 0 0 0 0 ...
## $ person_home_ownershipOWN : num 0 1 0 0 0 1 0 0 0 1 ...
## $ person_home_ownershipRENT : num 1 0 0 1 1 0 1 1 1 0 ...
## $ loan_intentEDUCATION : num 0 1 0 0 0 0 1 0 0 0 ...
## $ loan_intentHOMEIMPROVEMENT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ loan_intentMEDICAL : num 0 0 1 1 1 0 0 1 0 0 ...
## $ loan_intentPERSONAL : num 1 0 0 0 0 0 0 0 1 0 ...
## $ loan_intentVENTURE : num 0 0 0 0 0 1 0 0 0 1 ...
## $ previous_loan_defaults_on_fileYes: num 0 1 0 0 0 0 0 0 0 0 ...
##
## Call:
## glm(formula = loan_status ~ ., family = binomial, data = train_class)
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.400e-01 4.142e-01 -0.579 0.56232
## person_age 2.463e-02 1.281e-02 1.923 0.05447 .
## person_income 5.342e-07 2.067e-07 2.584 0.00977 **
## person_emp_exp -1.603e-02 1.136e-02 -1.411 0.15830
## loan_amnt -1.017e-04 4.411e-06 -23.051 < 2e-16 ***
## loan_int_rate 3.374e-01 7.378e-03 45.735 < 2e-16 ***
## loan_percent_income 1.593e+01 3.450e-01 46.173 < 2e-16 ***
## cb_person_cred_hist_length -9.684e-03 9.747e-03 -0.994 0.32044
## credit_score -9.069e-03 4.577e-04 -19.813 < 2e-16 ***
## person_genderfemale -2.469e-02 3.969e-02 -0.622 0.53390
## person_gendermale NA NA NA NA
## person_educationBachelor -3.211e-02 5.266e-02 -0.610 0.54205
## person_educationDoctorate -7.556e-02 1.662e-01 -0.455 0.64941
## person_educationHigh.School 1.761e-02 5.501e-02 0.320 0.74889
## person_educationMaster 5.428e-02 6.287e-02 0.863 0.38789
## person_home_ownershipOTHER 3.419e-01 3.578e-01 0.956 0.33927
## person_home_ownershipOWN -1.400e+00 1.125e-01 -12.443 < 2e-16 ***
## person_home_ownershipRENT 7.341e-01 4.496e-02 16.326 < 2e-16 ***
## loan_intentEDUCATION -9.231e-01 6.542e-02 -14.112 < 2e-16 ***
## loan_intentHOMEIMPROVEMENT -1.930e-02 7.316e-02 -0.264 0.79193
## loan_intentMEDICAL -3.139e-01 6.292e-02 -4.989 6.07e-07 ***
## loan_intentPERSONAL -7.942e-01 6.737e-02 -11.788 < 2e-16 ***
## loan_intentVENTURE -1.269e+00 7.120e-02 -17.820 < 2e-16 ***
## previous_loan_defaults_on_fileYes -2.038e+01 1.146e+02 -0.178 0.85885
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 38136 on 35999 degrees of freedom
## Residual deviance: 15855 on 35977 degrees of freedom
## AIC: 15901
##
## Number of Fisher Scoring iterations: 19
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6560 521
## 1 439 1480
##
## Accuracy : 0.8933
## 95% CI : (0.8868, 0.8996)
## No Information Rate : 0.7777
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.687
##
## Mcnemar's Test P-Value : 0.008942
##
## Sensitivity : 0.9373
## Specificity : 0.7396
## Pos Pred Value : 0.9264
## Neg Pred Value : 0.7712
## Prevalence : 0.7777
## Detection Rate : 0.7289
## Detection Prevalence : 0.7868
## Balanced Accuracy : 0.8385
##
## 'Positive' Class : 0
##
## Area under the curve: 0.9528
In this section, we train a Random Forest classifier to capture non-linear relationships and interactions between variables that logistic regression may fail to model. Random Forest is an ensemble learning method that builds multiple decision trees and aggregates their predictions to improve accuracy and reduce overfitting.
##
## Call:
## randomForest(formula = loan_status ~ ., data = train_class, ntree = 500, mtry = sqrt(ncol(train_class) - 1), importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 7.11%
## Confusion matrix:
## 0 1 class.error
## 0 27291 710 0.02535624
## 1 1849 6150 0.23115389
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6829 455
## 1 170 1546
##
## Accuracy : 0.9306
## 95% CI : (0.9251, 0.9357)
## No Information Rate : 0.7777
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7884
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9757
## Specificity : 0.7726
## Pos Pred Value : 0.9375
## Neg Pred Value : 0.9009
## Prevalence : 0.7777
## Detection Rate : 0.7588
## Detection Prevalence : 0.8093
## Balanced Accuracy : 0.8742
##
## 'Positive' Class : 0
##
## Area under the curve: 0.9757
In this section, we train an XGBoost classifier, which is a powerful gradient boosting algorithm designed to improve predictive performance by sequentially correcting errors made by previous models. XGBoost is highly effective for structured/tabular data and often achieves superior accuracy compared to both logistic regression and random forest.
## ##### xgb.Booster
## call:
## xgb.train(params = params, data = dtrain, nrounds = 100, verbose = 0)
## # of features: 23
## # of rounds: 100
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6803 445
## 1 196 1556
##
## Accuracy : 0.9288
## 95% CI : (0.9233, 0.934)
## No Information Rate : 0.7777
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7845
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9720
## Specificity : 0.7776
## Pos Pred Value : 0.9386
## Neg Pred Value : 0.8881
## Prevalence : 0.7777
## Detection Rate : 0.7559
## Detection Prevalence : 0.8053
## Balanced Accuracy : 0.8748
##
## 'Positive' Class : 0
##
## Area under the curve: 0.9758
The logistic regression model achieves an accuracy of 89.33%, indicating a strong baseline performance, though it assumes linear relationships between predictors and the target. Random Forest significantly improves performance with an accuracy of 93.06%, showing better ability to capture non-linear interactions and complex feature relationships. XGBoost performs similarly to Random Forest with an accuracy of 92.88%, confirming its strength as a high-performing boosting method. In terms of Kappa statistics, both Random Forest (0.788) and XGBoost (0.785) outperform logistic regression (0.687), indicating better agreement beyond chance. Random Forest achieves the highest sensitivity (0.976), meaning it is most effective at correctly identifying non-default cases. XGBoost provides a slightly better balance between sensitivity and specificity compared to logistic regression. All models show high statistical significance compared to the no-information rate (p < 2.2e-16), confirming strong predictive power. Logistic regression, however, shows higher specificity trade-off issues compared to tree-based models. Overall, Random Forest is the best-performing model based on accuracy and balanced performance metrics. XGBoost remains a close second and is preferred for robustness and generalization potential.
In this section, we analyze the relationship between key financial indicators—annual income and credit score—and loan default behavior. The objective is to understand how these variables differ between borrowers who successfully repay their loans and those who default. This helps identify whether financial strength and creditworthiness are strong predictors of loan risk.
We begin by comparing the average income and credit score across default and non-default groups. We then visualize these differences using boxplots to better understand the distributional behavior of each variable. Finally, we explore the relationship between income and credit score to assess whether stronger financial profiles are associated with better credit ratings.
## loan_status person_income
## 1 0 86157.04
## 2 1 59886.10
## loan_status credit_score
## 1 0 632.8149
## 2 1 631.8872
The aggregated results show the relationship between financial
indicators and loan default behavior. When comparing average income,
borrowers who defaulted (loan_status = 0) have a lower mean income
(approximately 59,886) compared to those who successfully repaid their
loans (loan_status = 1), who have a higher mean income (approximately
86,157). This suggests that income is a strong differentiating factor in
loan repayment behavior.
However, when examining credit scores, the difference between the two groups is minimal. Both defaulters and non-defaulters have nearly identical average credit scores (around 632). This indicates that credit score alone may not be a strong standalone predictor of default risk in this dataset.
Overall, income appears to play a more important role than credit score in distinguishing between high-risk and low-risk borrowers.
In this section, we examine the correlation structure between key numerical variables in the dataset. The goal is to understand how financial attributes such as income, credit score, loan amount, and loan status are related to one another. Correlation analysis helps identify potential multicollinearity and provides insight into which variables move together, which is important for both regression modeling and feature interpretation.
## person_age person_income person_emp_exp loan_amnt
## person_age 1.00000000 0.193697781 0.95441216 0.050749541
## person_income 0.19369778 1.000000000 0.18598715 0.242290131
## person_emp_exp 0.95441216 0.185987147 1.00000000 0.044589394
## loan_amnt 0.05074954 0.242290131 0.04458939 1.000000000
## loan_int_rate 0.01340164 0.001509828 0.01663134 0.146093082
## loan_percent_income -0.04329864 -0.234176548 -0.03986153 0.593011449
## cb_person_cred_hist_length 0.86198456 0.124315644 0.82427154 0.042969328
## credit_score 0.17843247 0.035919225 0.18619613 0.009074282
## loan_int_rate loan_percent_income
## person_age 0.013401640 -0.04329864
## person_income 0.001509828 -0.23417655
## person_emp_exp 0.016631344 -0.03986153
## loan_amnt 0.146093082 0.59301145
## loan_int_rate 1.000000000 0.12520949
## loan_percent_income 0.125209488 1.00000000
## cb_person_cred_hist_length 0.018007997 -0.03186773
## credit_score 0.011497752 -0.01148310
## cb_person_cred_hist_length credit_score
## person_age 0.86198456 0.178432470
## person_income 0.12431564 0.035919225
## person_emp_exp 0.82427154 0.186196134
## loan_amnt 0.04296933 0.009074282
## loan_int_rate 0.01800800 0.011497752
## loan_percent_income -0.03186773 -0.011483096
## cb_person_cred_hist_length 1.00000000 0.155204130
## credit_score 0.15520413 1.000000000
The correlation results show a very strong relationship between age, employment experience, and credit history length, indicating that these variables carry overlapping information. Loan amount is moderately related to loan percent income, suggesting that larger loans increase financial burden relative to income. Credit score, however, shows very weak correlations with most variables, implying it behaves independently within the dataset. Overall, the data contains some multicollinearity among age-related features, which should be considered during modeling.
In this section, we examine the distribution of key numerical
variables to better understand their shape, spread, and potential
skewness. This step is important for identifying whether any variables
violate assumptions required for linear regression, particularly
normality and linearity. We focus on major financial and credit-related
variables such as income, loan amount, credit score, and loan-to-income
ratio. These variables are expected to exhibit skewed distributions due
to the presence of high-income earners and large loan amounts in the
dataset. Identifying skewness helps determine whether transformations
(such as logarithmic scaling) may be necessary in later modeling stages.
In this section, we construct new variables to capture additional structure in the data. The aim is to improve model performance and interpretability by transforming existing features into more informative representations, particularly around employment history, home ownership, and loan burden.
## 'data.frame': 45000 obs. of 31 variables:
## $ person_age : num 22 21 25 23 24 21 26 24 24 21 ...
## $ person_income : num 71948 12282 12438 79753 66135 ...
## $ person_emp_exp : num 0 0 3 0 1 0 1 5 3 0 ...
## $ loan_amnt : num 35000 1000 5500 35000 35000 2500 35000 35000 35000 1600 ...
## $ loan_int_rate : num 16 11.1 12.9 15.2 14.3 ...
## $ loan_percent_income : num 0.49 0.08 0.44 0.44 0.53 0.19 0.37 0.37 0.35 0.13 ...
## $ cb_person_cred_hist_length : num 3 2 3 2 4 2 3 4 2 3 ...
## $ credit_score : int 561 504 635 675 586 532 701 585 544 640 ...
## $ loan_status : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 2 2 ...
## $ person_genderfemale : num 1 1 1 1 0 1 1 1 1 1 ...
## $ person_gendermale : num 0 0 0 0 1 0 0 0 0 0 ...
## $ person_educationBachelor : num 0 0 0 1 0 0 1 0 0 0 ...
## $ person_educationDoctorate : num 0 0 0 0 0 0 0 0 0 0 ...
## $ person_educationHigh School : num 0 1 1 0 0 1 0 1 0 1 ...
## $ person_educationMaster : num 1 0 0 0 1 0 0 0 0 0 ...
## $ person_home_ownershipOTHER : num 0 0 0 0 0 0 0 0 0 0 ...
## $ person_home_ownershipOWN : num 0 1 0 0 0 1 0 0 0 1 ...
## $ person_home_ownershipRENT : num 1 0 0 1 1 0 1 1 1 0 ...
## $ loan_intentEDUCATION : num 0 1 0 0 0 0 1 0 0 0 ...
## $ loan_intentHOMEIMPROVEMENT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ loan_intentMEDICAL : num 0 0 1 1 1 0 0 1 0 0 ...
## $ loan_intentPERSONAL : num 1 0 0 0 0 0 0 0 1 0 ...
## $ loan_intentVENTURE : num 0 0 0 0 0 1 0 0 0 1 ...
## $ previous_loan_defaults_on_fileYes: num 0 1 0 0 0 0 0 0 0 0 ...
## $ emp_intensity : num 0 0 0.12 0 0.0417 ...
## $ exp_gap : num 22 21 22 23 23 21 25 19 21 21 ...
## $ log_income : num 11.18 9.42 9.43 11.29 11.1 ...
## $ log_loan_amnt : num 10.46 6.91 8.61 10.46 10.46 ...
## $ loan_burden : num 35255 983 5473 35091 35052 ...
## $ debt_pressure : num 0.4865 0.0814 0.4422 0.4389 0.5292 ...
## $ home_risk_score : num 1 1 0 1 1 1 1 1 1 1 ...
In this section, we construct new variables to enrich the dataset with deeper behavioral and financial insights. These engineered features capture relationships that are not directly observable in the raw data, such as employment efficiency, financial pressure, and housing stability. Employment-based variables help assess how quickly individuals accumulate work experience relative to age, while loan-based features quantify borrowing intensity and repayment burden. Log transformations are applied to reduce skewness in income and loan distributions, improving suitability for linear modeling. Additionally, composite indicators such as debt pressure and home ownership risk summarize multiple signals into interpretable risk measures. Overall, these transformations enhance model performance and improve the interpretability of credit risk and income prediction models.