How do individual socioeconomic characteristics—specifically annual income and home ownership status—determine a borrower’s financial constraint and affect their probability of defaulting on a bank loan?
From an economic perspective, the credit market is a fundamental pillar for economic growth, consumption, and capital accumulation. Financial institutions face a continuous trade-off between maximizing lending activities and minimizing exposure to non-performing loans (NPLs). When a bank grants a loan, evaluating the borrower’s default risk is crucial to maintaining financial stability and avoiding systemic banking crises, such as the one witnessed globally in 2008.
person_age person_income person_home_ownership person_emp_length loan_intent
1 22 59000 RENT 123 PERSONAL
2 21 9600 OWN 5 EDUCATION
3 25 9600 MORTGAGE 1 MEDICAL
4 23 65500 RENT 4 MEDICAL
5 24 54400 RENT 8 MEDICAL
6 21 9900 OWN 2 VENTURE
loan_grade loan_amnt loan_int_rate loan_status loan_percent_income
1 D 35000 16.02 1 0.59
2 B 1000 11.14 0 0.10
3 C 5500 12.87 1 0.57
4 C 35000 15.23 1 0.53
5 C 35000 14.27 1 0.55
6 A 2500 7.14 1 0.25
cb_person_default_on_file cb_person_cred_hist_length
1 Y 3
2 N 2
3 N 3
4 N 2
5 Y 4
6 N 2
To address this economic question, we utilize a customer credit dataset. The target variable is binary, representing the credit outcome for each individual:
default_status: A binary indicator where 1 represents a client who defaulted on their loan, and 0 represents a client who successfully repaid.
To predict this outcome, the model incorporates five key socio-economic and financial features:
loan_amount: The total monetary value of the requested loan.
income: The annual income of the applicant.
interest_rate: The interest rate assigned to the loan.
home_ownership: The residential status of the applicant (e.g., rent, mortgage, own).
Before establishing our predictive models, we examine the baseline distribution of our target variable to understand the unconditional probability of default within our sample.
table(credit_tidy$default_status)
0 1
23001 6464
What we see: Only 6464 out of almost 29.465 customers defaulted. That’s about 22% of the total sample
While the majority of the observations belong to non-defaulting clients, this proportion provides a sufficient baseline of default events. Unlike extreme cases of highly imbalanced data (where defaults are below 5%), our dataset contains enough representative variance for the logistic regression algorithm to effectively learn the distinct financial characteristics of high-risk profiles.
now we have the data clean, we con proceed with the probability distribution analysis
While the explicit target variable for our classification task is default_status (a discrete binary indicator), evaluating the distribution, skewness, and theoretical approximation of the primary continuous financial features is a prerequisite for robust econometric modeling.
Using the same mandatory seed, we divide the customers into a Training Set (80%) and a Test Set (20%) to evaluate its classification accuracy later.
3.1 Data Preparation: Factor Conversion
Before splitting the data and training our logistic regression models, we prepare our categorical variables by converting them into factors:
# Convert 0/1 and text variables to factorscredit_tidy <- credit_tidy |>mutate(default_status =factor(default_status, levels =c(0, 1), labels =c("No", "Yes")),home_ownership =factor(home_ownership) )
Target Variable (default_status): The original binary values (0 for non-default and 1 for default) are converted into a factor with logical labels ("No" and "Yes").
Predictor Variable (home_ownership): This text-based categorical variable is converted into a factor so the glm() function can automatically create dummy variables (e.g., comparing renting versus owning) during model estimation.
# Set seed for reproducibilityset.seed(465)# Convert categorical variables to factorscredit_factor <- credit_tidy |>mutate(across(where(is.character), as.factor))# Split the data: 80% training, 20% testingcredit_split <-initial_split(credit_factor, prop =0.8)credit_train <-training(credit_split)credit_test <-testing(credit_split)cat("Training set size:", nrow(credit_train), "\n")
Training set size: 23572
cat("Test set size:", nrow(credit_test), "\n")
Test set size: 5893
4. Modeling and Cross Validation
4.1 Model 1
# Logistic regression using glm() - Model 1 (Baseline)model_logistic_1 <-glm( default_status ~ loan_amount + income,data = credit_train,family = binomial)# View coefficients and significancesummary(model_logistic_1)
Call:
glm(formula = default_status ~ loan_amount + income, family = binomial,
data = credit_train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.672e-01 3.998e-02 -11.69 <2e-16 ***
loan_amount 1.187e-04 3.145e-06 37.73 <2e-16 ***
income -3.532e-05 8.458e-07 -41.76 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 24757 on 23571 degrees of freedom
Residual deviance: 21799 on 23569 degrees of freedom
AIC: 21805
Number of Fisher Scoring iterations: 6
Loan Amount (loan_amount): The coefficient is positive (1.187e-04). This means that larger loans mean a higher risk of default.
Income (income): The coefficient is negative (-3.532e-05). This means that as a customer’s income increases, the probability of defaulting decreases. In simple terms: wealthier clients are less likely to default.
Model Evaluation (AIC): The model reports an AIC of 21,805. We will use this number as a baseline to compare with our second model (a lower AIC means a better-fitting model).
4.2 Model 2
# Logistic regression using glm() - Model 2 (Advanced)model_logistic_2 <-glm( default_status ~ loan_amount + income + interest_rate + home_ownership + age,data = credit_train,family = binomial)# View coefficients and significance for Model 2summary(model_logistic_2)
Call:
glm(formula = default_status ~ loan_amount + income + interest_rate +
home_ownership + age, family = binomial, data = credit_train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.033e+00 1.214e-01 -33.235 <2e-16 ***
loan_amount 1.143e-04 3.462e-06 33.007 <2e-16 ***
income -3.467e-05 9.344e-07 -37.106 <2e-16 ***
interest_rate 2.728e-01 6.336e-03 43.053 <2e-16 ***
home_ownershipOTHER 4.784e-01 2.968e-01 1.612 0.107
home_ownershipOWN -1.246e+00 1.022e-01 -12.194 <2e-16 ***
home_ownershipRENT 7.680e-01 4.266e-02 18.002 <2e-16 ***
age -1.405e-03 2.999e-03 -0.469 0.639
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 24757 on 23571 degrees of freedom
Residual deviance: 18585 on 23564 degrees of freedom
AIC: 18601
Number of Fisher Scoring iterations: 6
The advanced classification model significantly improves model fit as indicated by the drop in AIC from 21,805 to 18,601:
Financial Risk Drivers (loan_amount and interest_rate): Both variables display highly significant positive coefficients (\(p < 0.001\)). Higher loan amounts (1.143e-04) and higher interest rates (0.2728) heavily increase the probability of customer default.
Income Hedge (income): Higher annual earnings (-3.467e-05) significantly reduce the likelihood of default (\(p < 0.001\)).
Housing Stability Status (home_ownership): R automatically chose the baseline category (likely MORTGAGE). Compared to this baseline:
RENT has a positive coefficient (0.7680, \(p < 0.001\)), meaning tenants are statistically much more likely to default.
OWN has a negative coefficient (-1.2460, \(p < 0.001\)), proving that customers who fully own their homes are the safest profiles.
Non-Significant Predictor (age): The variable age reports a p-value of 0.639, which is much higher than 0.05. This means that, after controlling for income and loans, age is not statistically significant for predicting default in this dataset.
4.3 Interpretation of Sample Classifications
#Training the model# 1. Predict probabilities on test data using your exact model and datasetlogistic_probs <-predict(model_logistic_2, credit_test, type ="response")# 2. Convert probabilities to match your dataset levels ("0" or "1") using threshold 0.5logistic_pred <-ifelse(logistic_probs >0.5, "1", "0")# 3. Convert predictions to factor with your exact levelslogistic_pred <-factor(logistic_pred, levels =c("0", "1"))# 4. View first few predictions head(data.frame(Actual = credit_test$default_status,Probability =round(logistic_probs, 3),Predicted = logistic_pred ))
Actual Probability Predicted
1 Yes 0.032 0
2 No 0.244 0
3 No 0.214 0
4 No 0.119 0
5 Yes 0.867 1
6 No 0.000 0
Looking at the initial sample predictions table, we can easily evaluate how the model applies the 0.5 classification threshold in practice:
Correct Classifications: For the majority of clients (such as rows 2, 3, 4, and 6), the calculated probability remains well below 50%, resulting in a correct prediction of 0 (Non-default). In row 5, the model strongly detects high risk with an 86.7% probability, correctly classifying the client as 1 (Default).
Model Mistakes and Risk Assessment: In row 1, we observe a critical classification error. The customer actually defaulted (Actual = 1), but the model calculated a very low risk probability of only 3.2%, predicting a 0. In a commercial banking context, this type of error means granting a loan to a borrower who will not pay it back, which presents a direct financial loss.
4.4 COMPARISON TABLE: CREDIT DEFAULT MODELS
# 1. Create comparative table with official variables and English textcredit_models_summary <-data.frame(Model =c("Model 1: Baseline Logistic", "Model 2: Advanced Logistic"),`Variables Included`=c("loan_amnt, person_income", "loan_amnt, person_income, loan_int_rate, person_home_ownership, person_age"),`Test Performance`=c("Lower Predictive Power", "Higher Accuracy (83.1%) & Robust Generalization"),`Practical Focus`=c("Misses critical risk factors", "Good stability, but requires lower threshold to optimize Recall"))# 2. Print professional tableknitr::kable(credit_models_summary, caption ="Table 2: Out-of-Sample Performance Comparison for Credit Default Models",align ="l")
Table 2: Out-of-Sample Performance Comparison for Credit Default Models
Good stability, but requires lower threshold to optimize Recall
4.4.1 Model Selection and Business Meaning
Which model is better?Model 2 (model_logistic2) is much better. To know if a client will pay, a bank needs to see the whole picture (Income, Age, and Interest Rate). Model 1 is too simple and misses all this important data.
The Practical Trade-off: * The Problem: The model is very accurate, but it is not perfect. It gave a tiny 3.2% risk to a client who actually defaulted (Row 1). For a bank, lending money to someone who doesn’t pay is the worst and most expensive mistake possible.
The Solution: The standard 0.5 threshold is too risky. In real life, the bank should lower the threshold (for example, to 0.3) to be more strict. This catches hidden defaults like Row 1, even if the bank rejects a few good clients by mistake.
# A tibble: 1 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.826 5 0.00687 Preprocessor1_Model1
Test Set Classification Performance
Once the model’s stability was verified using 5-fold cross-validation, the logistic regression model was evaluated on the single test partition using the standard 0.5 threshold to analyze the detailed classification metrics.
# 1. Generate predictions on the test set using the standard 0.5 thresholdcredit_preds <- credit_test %>%mutate(prob =predict(model_logistic_2, newdata = credit_test, type ="response"),pred_status =as.factor(ifelse(prob >0.5, "1", "0")),default_status =as.factor(default_status) )# 2. Create the confusion matrix tableconfusion_matrix <-table(Predicted = credit_preds$pred_status, Actual = credit_preds$default_status)# 3. Print the confusion matrix to the document so the professor can see itprint(confusion_matrix)
By running the calculations above, we get the official performance percentages for our best model (model_logistic2):
Accuracy (0.831 / 83.1%): The model guesses correctly 83.1% of the time across the entire dataset. While this looks like a great number, it is misleading. Because most clients in the dataset are good payers, any model can get high accuracy just by guessing “0” (No default) most of the time.
Precision (0.716 / 71.6%): When the model rings the alarm and flags a customer as a dangerous moroso (1), it is right 71.6% of the time.
Recall (0.396 / 39.6%): This is the most critical metric and the biggest weakness. The model only catches 39.6% of the total real morosos. This means that 60.4% of the dangerous clients are slipping through undetectedd.
Test Set vs. Cross-Validation Comparison
To evaluate overfitting and model stability for the Credit Default dataset, we compare our initial test set accuracy against the 5-fold cross-validation average:
Single Test Partition Accuracy:83.1% (0.831)
5-Fold Cross-Validation Accuracy (Average):~83.0%
What this tells us about the Model:
No Overfitting: If our logistic regression model had overfitted (meaning it just memorized the training data), its performance would drop heavily when tested on 5 different random folds. Because the cross-validation accuracy stays right at the same level (83%), we can confirm the model generalizes perfectly to new data.
High Model Stability: The accuracy does not fluctuate when changing the data partitions. This tells us the model is highly stable and robust. Its ability to predict credit risk is consistent and mathematically sound, proving
5. Economic Interpretation
1. Answering the Economic Question Based on Our Results
Our model results back up exactly what we expected from an economic point of view: a borrower's annual income and whether they own a house are the two most important factors that define their financial constraint (meaning, how tight they are on money).
The regression model shows that having a high income or owning a property gives the borrower a financial cushion that makes defaulting on a loan (\(default\_status\)) very unlikely. On the other hand, renters and people with low incomes have a much smaller margin of safety, so any small financial problem can easily force them into a default.
2. Economic Interpretation of Model Coefficients
If we look at the numbers and signs from our R output (summary(model_logistic2)), we can see exactly how these variables affect risk in real life:
Annual Income (person_income) [Negative Coefficient]: The negative sign means that as income goes up, the probability of default goes down. Economically, income is the money a household has to pay for everything. A high income acts as a liquid cushion. If a rich person has a financial emergency, they can just cut back on leisure or extra expenses to keep paying the loan. But a low-income profile has no margin; if they face an unexpected expense, they have to choose between buying food or paying the bank. Since survival comes first, they default.
Home Ownership (person_home_ownership) [Categorical]: Our results show that renters are much more likely to default compared to homeowners. Renting means having a fixed monthly expense that doesn’t build any wealth. Homeowners, however, have accumulated an asset (the house). If they face a crisis, they can ask for a second mortgage, refinance, or even sell the property to get cash. Also, homeowners have more “skin in the game”: defaulting means the bank could take their house away (foreclosure), which is a huge penalty. Renters don’t risk losing property equity, so their economic incentive to avoid default is lower.
3. Policy and Business Decisions (What should the bank do?)
These findings are not just numbers; they are highly useful for making business decisions in the bank:
Adjusting the Cut-off Threshold: Our analysis shows that using the standard 0.5 probability threshold creates a major problem: the bank only catches about 39.6% of the people who are actually going to default (Low Recall). Losing the full money of a loan is way more expensive for a bank than double-checking a client. Therefore, the bank should lower the threshold to 0.3 for clients who rent or have lower incomes, to catch potential defaults early and protect its capital.
Future Research: This model only looks at data from the client. For future projects, it would be smart to add macroeconomic variables, like the unemployment rate or inflation. This would help us see how a borrower’s risk changes when the overall economy goes through a crisis.
6. Limitations and Reproducibility
6.1 Limitations of the Analysis
Although our model performs well with an accuracy of 83.1%, it has two main economic and statistical limitations:
Omission of Macroeconomic Variables: Our dataset only includes specific data about the borrower (like age, income, and loan amount). However, credit default is heavily influenced by the general state of the economy. Because our model ignores the external factors, it assumes the economic environment is always stable, which is not true in the real world.
6.2 Steps Taken for Reproducibility
To ensure that this analysis can be perfectly replicated on any computer, I followed strict programming practices:
Clean Data Management inside Quarto: Instead of relying on variables manually modified in our RStudio global environment, the entire data cleaning process is written inside the code chunks. We explicitly rename the variables (like changing loan_status to default_status) and convert them into factors inside the script. This ensures the document runs perfectly from top to bottom on a clean session without any “invisible” errors.
Using a Fixed Random Seed (set.seed(465)): Logistic regression cross-validation (vfold_cv) and data splitting shuffle the data randomly. By inserting set.seed(465) right before these functions, we force R to use the exact same random sequence every time the document is run. This guarantees that the final metrics, coefficients, and accuracy numbers do not change when you click Render.
7. AI LOG
“We are going to start with the first dataset, the Madrid real estate prices. I already have the code from the professor’s webpage and I just need to change the names according to my dataset. Give me an introductory sentence briefly explaining what I am going to do. Once I send you the results, please explain them to me in Spanish so I can fully understand them, and then help me write the interpretation in English.
# 4. View first few predictions head(data.frame(Actual = credit_test$default_status,Probability =round(logistic_probs, 3),Predicted = logistic_pred))
Actual Probability Predicted
1 Yes 0.032 0
2 No 0.244 0
3 No 0.214 0
4 No 0.119 0
5 Yes 0.867 1
6 No 0.000 0
”
This is one prompt i used several times, i try to use the codes from your rpubs and adapt them to my datasets, if something was wrong i also ask GEMINI for help. The most i used it was for interpretations on some metrical that i didn’t understand.
this is one example answer:
“Brilliant! The code executed perfectly without a single error.
Take a look at what this sample table shows us. The model performs quite well, but it makes a specific mistake that is pure gold to explain to the professor:
In rows 2, 3, 4, and 6: The client paid on time (Actual = 0), the model calculated low probabilities (under 50%), and correctly predicted a 0. Perfect!
In row 5: The client defaulted (Actual = 1), the model anticipated the risk with a very high probability of 86.7%, and correctly predicted a 1. Perfect!
The mistake (Row 1): The client ended up defaulting (Actual = 1), but the model assigned a ridiculously low default probability (3.2%) and predicted a 0. A delinquent customer slipped right through.”
With this output, i make an interpretation of the table separating the correct predictions (2,3,4,5,6) from the wrong one (1), and i say that in the real life, this problem would be critical.
8. Final Reflections
One Major Improvement with More Time or Better Data
If we had more time or access to a more complete dataset, the most impactful improvement would be to incorporate the borrowers’ credit history and current debt levels into the model, rather than just looking at their income and housing status.
In the real banking world, knowing how much money a client earns is only half the story. The missing piece is knowing how they handle their financial obligations.
The Debt-to-Income Ratio (DTI)
Past Payment Behavior
By adding these variables, we would fix the main weakness of our current model, which is its low Recall rate (39.6%). Instead of just guessing risk based on general things like age or renting status, the algorithm could see the actual financial behavior of the applicant.
One New Economic Question for Future Investigation
Based on the results of our model, the new economic question I would like to investigate in the future is:
“How do macroeconomic shocks (Inflation and central bank interest rates) alter the financial constraints of low-income renters compared to wealthy homeowners, and how does this asymmetry impact a bank’s default rate?”