ECON 465 - Stage 3: Credit Default Analysis

Author

Miguel

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.3     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
✔ broom        1.0.13     ✔ rsample      1.3.1 
✔ dials        1.4.3      ✔ tune         1.3.0 
✔ infer        1.1.0      ✔ workflows    1.2.0 
✔ modeldata    1.5.1      ✔ workflowsets 1.1.1 
✔ parsnip      1.6.0      ✔ yardstick    1.3.2 
✔ recipes      1.3.1      
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()

Credit Default Analysis

1. Economic Question

How do individual socioeconomic characteristics—specifically annual income and home ownership status—determine a borrower’s financial constraint and affect their probability of defaulting on a bank loan?

From an economic perspective, the credit market is a fundamental pillar for economic growth, consumption, and capital accumulation. Financial institutions face a continuous trade-off between maximizing lending activities and minimizing exposure to non-performing loans (NPLs). When a bank grants a loan, evaluating the borrower’s default risk is crucial to maintaining financial stability and avoiding systemic banking crises, such as the one witnessed globally in 2008.

2. Data Import & Cleaning

credit_risk<-read.csv("credit_risk_dataset.csv", sep = ",")
head(credit_risk) 
  person_age person_income person_home_ownership person_emp_length loan_intent
1         22         59000                  RENT               123    PERSONAL
2         21          9600                   OWN                 5   EDUCATION
3         25          9600              MORTGAGE                 1     MEDICAL
4         23         65500                  RENT                 4     MEDICAL
5         24         54400                  RENT                 8     MEDICAL
6         21          9900                   OWN                 2     VENTURE
  loan_grade loan_amnt loan_int_rate loan_status loan_percent_income
1          D     35000         16.02           1                0.59
2          B      1000         11.14           0                0.10
3          C      5500         12.87           1                0.57
4          C     35000         15.23           1                0.53
5          C     35000         14.27           1                0.55
6          A      2500          7.14           1                0.25
  cb_person_default_on_file cb_person_cred_hist_length
1                         Y                          3
2                         N                          2
3                         N                          3
4                         N                          2
5                         Y                          4
6                         N                          2

To address this economic question, we utilize a customer credit dataset. The target variable is binary, representing the credit outcome for each individual:

  • default_status: A binary indicator where 1 represents a client who defaulted on their loan, and 0 represents a client who successfully repaid.

To predict this outcome, the model incorporates five key socio-economic and financial features:

  • loan_amount: The total monetary value of the requested loan.

  • income: The annual income of the applicant.

  • interest_rate: The interest rate assigned to the loan.

  • home_ownership: The residential status of the applicant (e.g., rent, mortgage, own).

  • age: The age of the applicant.

#data cleaning
credit_tidy <- credit_risk %>%
  select(person_age, person_income, person_home_ownership, 
         loan_amnt, loan_int_rate, loan_status) %>%
  filter(!is.na(loan_status), !is.na(person_income), !is.na(loan_int_rate))
head(credit_tidy)
  person_age person_income person_home_ownership loan_amnt loan_int_rate
1         22         59000                  RENT     35000         16.02
2         21          9600                   OWN      1000         11.14
3         25          9600              MORTGAGE      5500         12.87
4         23         65500                  RENT     35000         15.23
5         24         54400                  RENT     35000         14.27
6         21          9900                   OWN      2500          7.14
  loan_status
1           1
2           0
3           1
4           1
5           1
6           1
#rename the variables
credit_tidy <- credit_tidy %>%
  rename(
    age = person_age,
    income = person_income,
    home_ownership = person_home_ownership,
    loan_amount = loan_amnt,
    interest_rate = loan_int_rate,
    default_status = loan_status
  ) %>%
  mutate(
    default_status = as.factor(default_status), # Target Variable (0 = Non-default, 1 = Default)
    home_ownership = as.factor(home_ownership),   # Categorical Predictor variable
    age = as.numeric(age),
    income = as.numeric(income),
    loan_amount = as.numeric(loan_amount)
  )
glimpse(credit_tidy)
Rows: 29,465
Columns: 6
$ age            <dbl> 22, 21, 25, 23, 24, 21, 26, 24, 24, 21, 22, 21, 23, 26,…
$ income         <dbl> 59000, 9600, 9600, 65500, 54400, 9900, 77100, 78956, 83…
$ home_ownership <fct> RENT, OWN, MORTGAGE, RENT, RENT, OWN, RENT, RENT, RENT,…
$ loan_amount    <dbl> 35000, 1000, 5500, 35000, 35000, 2500, 35000, 35000, 35…
$ interest_rate  <dbl> 16.02, 11.14, 12.87, 15.23, 14.27, 7.14, 12.42, 11.11, …
$ default_status <fct> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1…

3. Probability Analysis

Before establishing our predictive models, we examine the baseline distribution of our target variable to understand the unconditional probability of default within our sample.

table(credit_tidy$default_status)

    0     1 
23001  6464 

What we see: Only 6464 out of almost 29.465 customers defaulted. That’s about 22% of the total sample

While the majority of the observations belong to non-defaulting clients, this proportion provides a sufficient baseline of default events. Unlike extreme cases of highly imbalanced data (where defaults are below 5%), our dataset contains enough representative variance for the logistic regression algorithm to effectively learn the distinct financial characteristics of high-risk profiles.

now we have the data clean, we con proceed with the probability distribution analysis

While the explicit target variable for our classification task is default_status (a discrete binary indicator), evaluating the distribution, skewness, and theoretical approximation of the primary continuous financial features is a prerequisite for robust econometric modeling.

Using the same mandatory seed, we divide the customers into a Training Set (80%) and a Test Set (20%) to evaluate its classification accuracy later.

3.1 Data Preparation: Factor Conversion

Before splitting the data and training our logistic regression models, we prepare our categorical variables by converting them into factors:

# Convert 0/1 and text variables to factors
credit_tidy <- credit_tidy |>
  mutate(
    default_status = factor(default_status, levels = c(0, 1), labels = c("No", "Yes")),
    home_ownership = factor(home_ownership)
  )
  • Target Variable (default_status): The original binary values (0 for non-default and 1 for default) are converted into a factor with logical labels ("No" and "Yes").

  • Predictor Variable (home_ownership): This text-based categorical variable is converted into a factor so the glm() function can automatically create dummy variables (e.g., comparing renting versus owning) during model estimation.

    # Set seed for reproducibility
    set.seed(465)
    
    # Convert categorical variables to factors
    credit_factor <- credit_tidy |> 
      mutate(across(where(is.character), as.factor))
    
    # Split the data: 80% training, 20% testing
    credit_split <- initial_split(credit_factor, prop = 0.8)
    
    credit_train <- training(credit_split)
    credit_test <- testing(credit_split)
    
    cat("Training set size:", nrow(credit_train), "\n")
    Training set size: 23572 
    cat("Test set size:", nrow(credit_test), "\n")
    Test set size: 5893 

4. Modeling and Cross Validation

4.1 Model 1

# Logistic regression using glm() - Model 1 (Baseline)
model_logistic_1 <- glm(
  default_status ~ loan_amount + income,
  data = credit_train,
  family = binomial
)

# View coefficients and significance
summary(model_logistic_1)

Call:
glm(formula = default_status ~ loan_amount + income, family = binomial, 
    data = credit_train)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.672e-01  3.998e-02  -11.69   <2e-16 ***
loan_amount  1.187e-04  3.145e-06   37.73   <2e-16 ***
income      -3.532e-05  8.458e-07  -41.76   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 24757  on 23571  degrees of freedom
Residual deviance: 21799  on 23569  degrees of freedom
AIC: 21805

Number of Fisher Scoring iterations: 6
  • Loan Amount (loan_amount): The coefficient is positive (1.187e-04). This means that larger loans mean a higher risk of default.
  • Income (income): The coefficient is negative (-3.532e-05). This means that as a customer’s income increases, the probability of defaulting decreases. In simple terms: wealthier clients are less likely to default.
  • Model Evaluation (AIC): The model reports an AIC of 21,805. We will use this number as a baseline to compare with our second model (a lower AIC means a better-fitting model).

4.2 Model 2

# Logistic regression using glm() - Model 2 (Advanced)
model_logistic_2 <- glm(
  default_status ~ loan_amount + income + interest_rate + home_ownership + age,
  data = credit_train,
  family = binomial
)

# View coefficients and significance for Model 2
summary(model_logistic_2)

Call:
glm(formula = default_status ~ loan_amount + income + interest_rate + 
    home_ownership + age, family = binomial, data = credit_train)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -4.033e+00  1.214e-01 -33.235   <2e-16 ***
loan_amount          1.143e-04  3.462e-06  33.007   <2e-16 ***
income              -3.467e-05  9.344e-07 -37.106   <2e-16 ***
interest_rate        2.728e-01  6.336e-03  43.053   <2e-16 ***
home_ownershipOTHER  4.784e-01  2.968e-01   1.612    0.107    
home_ownershipOWN   -1.246e+00  1.022e-01 -12.194   <2e-16 ***
home_ownershipRENT   7.680e-01  4.266e-02  18.002   <2e-16 ***
age                 -1.405e-03  2.999e-03  -0.469    0.639    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 24757  on 23571  degrees of freedom
Residual deviance: 18585  on 23564  degrees of freedom
AIC: 18601

Number of Fisher Scoring iterations: 6

The advanced classification model significantly improves model fit as indicated by the drop in AIC from 21,805 to 18,601:

  • Financial Risk Drivers (loan_amount and interest_rate): Both variables display highly significant positive coefficients (\(p < 0.001\)). Higher loan amounts (1.143e-04) and higher interest rates (0.2728) heavily increase the probability of customer default.
  • Income Hedge (income): Higher annual earnings (-3.467e-05) significantly reduce the likelihood of default (\(p < 0.001\)).
  • Housing Stability Status (home_ownership): R automatically chose the baseline category (likely MORTGAGE). Compared to this baseline:
    • RENT has a positive coefficient (0.7680, \(p < 0.001\)), meaning tenants are statistically much more likely to default.
    • OWN has a negative coefficient (-1.2460, \(p < 0.001\)), proving that customers who fully own their homes are the safest profiles.
  • Non-Significant Predictor (age): The variable age reports a p-value of 0.639, which is much higher than 0.05. This means that, after controlling for income and loans, age is not statistically significant for predicting default in this dataset.

4.3 Interpretation of Sample Classifications

#Training the model
# 1. Predict probabilities on test data using your exact model and dataset
logistic_probs <- predict(model_logistic_2, credit_test, type = "response")

# 2. Convert probabilities to match your dataset levels ("0" or "1") using threshold 0.5
logistic_pred <- ifelse(logistic_probs > 0.5, "1", "0")

# 3. Convert predictions to factor with your exact levels
logistic_pred <- factor(logistic_pred, levels = c("0", "1"))

# 4. View first few predictions 
head(data.frame(
  Actual = credit_test$default_status,
  Probability = round(logistic_probs, 3),
  Predicted = logistic_pred
  ))
  Actual Probability Predicted
1    Yes       0.032         0
2     No       0.244         0
3     No       0.214         0
4     No       0.119         0
5    Yes       0.867         1
6     No       0.000         0

Looking at the initial sample predictions table, we can easily evaluate how the model applies the 0.5 classification threshold in practice:

  • Correct Classifications: For the majority of clients (such as rows 2, 3, 4, and 6), the calculated probability remains well below 50%, resulting in a correct prediction of 0 (Non-default). In row 5, the model strongly detects high risk with an 86.7% probability, correctly classifying the client as 1 (Default).
  • Model Mistakes and Risk Assessment: In row 1, we observe a critical classification error. The customer actually defaulted (Actual = 1), but the model calculated a very low risk probability of only 3.2%, predicting a 0. In a commercial banking context, this type of error means granting a loan to a borrower who will not pay it back, which presents a direct financial loss.

4.4 COMPARISON TABLE: CREDIT DEFAULT MODELS

# 1. Create comparative table with official variables and English text
credit_models_summary <- data.frame(
  Model = c("Model 1: Baseline Logistic", "Model 2: Advanced Logistic"),
  `Variables Included` = c("loan_amnt, person_income", "loan_amnt, person_income, loan_int_rate, person_home_ownership, person_age"),
  `Test Performance` = c("Lower Predictive Power", "Higher Accuracy (83.1%) & Robust Generalization"),
  `Practical Focus` = c("Misses critical risk factors", "Good stability, but requires lower threshold to optimize Recall")
)

# 2. Print professional table
knitr::kable(credit_models_summary, 
             caption = "Table 2: Out-of-Sample Performance Comparison for Credit Default Models",
             align = "l")
Table 2: Out-of-Sample Performance Comparison for Credit Default Models
Model Variables.Included Test.Performance Practical.Focus
Model 1: Baseline Logistic loan_amnt, person_income Lower Predictive Power Misses critical risk factors
Model 2: Advanced Logistic loan_amnt, person_income, loan_int_rate, person_home_ownership, person_age Higher Accuracy (83.1%) & Robust Generalization Good stability, but requires lower threshold to optimize Recall

4.4.1 Model Selection and Business Meaning

  • Which model is better? Model 2 (model_logistic2) is much better. To know if a client will pay, a bank needs to see the whole picture (Income, Age, and Interest Rate). Model 1 is too simple and misses all this important data.

  • The Practical Trade-off: * The Problem: The model is very accurate, but it is not perfect. It gave a tiny 3.2% risk to a client who actually defaulted (Row 1). For a bank, lending money to someone who doesn’t pay is the worst and most expensive mistake possible.

    • The Solution: The standard 0.5 threshold is too risky. In real life, the bank should lower the threshold (for example, to 0.3) to be more strict. This catches hidden defaults like Row 1, even if the bank rejects a few good clients by mistake.

4.5 Cross Validation

#REVISED WITH TIDYMODELS
library(tidymodels)

set.seed(465)

credit_test_fixed <- credit_test %>%
  mutate(default_status = as.factor(default_status))

#5 folds
credit_folds <- vfold_cv(credit_test_fixed, v = 5)

# define
logistic_spec <- logistic_reg() %>% 
  set_engine("glm")


credit_cv_results <- fit_resamples(
  logistic_spec,
  default_status ~ loan_amount + income + interest_rate + home_ownership + age,
  resamples = credit_folds,
  metrics = metric_set(yardstick::accuracy)
)

# accuracy
collect_metrics(credit_cv_results)
# A tibble: 1 × 6
  .metric  .estimator  mean     n std_err .config             
  <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy binary     0.826     5 0.00687 Preprocessor1_Model1

Test Set Classification Performance

Once the model’s stability was verified using 5-fold cross-validation, the logistic regression model was evaluated on the single test partition using the standard 0.5 threshold to analyze the detailed classification metrics.

# 1. Generate predictions on the test set using the standard 0.5 threshold
credit_preds <- credit_test %>%
  mutate(
    prob = predict(model_logistic_2, newdata = credit_test, type = "response"),
    pred_status = as.factor(ifelse(prob > 0.5, "1", "0")),
    default_status = as.factor(default_status)
  )

# 2. Create the confusion matrix table
confusion_matrix <- table(Predicted = credit_preds$pred_status, Actual = credit_preds$default_status)

# 3. Print the confusion matrix to the document so the professor can see it
print(confusion_matrix)
         Actual
Predicted   No  Yes
        0 4376  792
        1  206  519
# Entering the exact numbers from our confusion matrix manually to avoid indexing errors
TN <- 4376     # True Negatives
FP <- 206      # False Positives
FN <- 792      # False Negatives
TP <- 519      # True Positives

# Accuracy = correct predictions / total predictions
accuracy <- (TP + TN) / (TP + TN + FP + FN)

# Precision = TP / (TP + FP)
precision <- ifelse(TP + FP > 0, TP / (TP + FP), 0)

# Recall = TP / (TP + FN)
recall <- TP / (TP + FN)

# Print the final metrics
cat("Accuracy:", round(accuracy, 3), "\n")
Accuracy: 0.831 
cat("Precision:", round(precision, 3), "\n")
Precision: 0.716 
cat("Recall:", round(recall, 3), "\n")
Recall: 0.396 

By running the calculations above, we get the official performance percentages for our best model (model_logistic2):

  • Accuracy (0.831 / 83.1%): The model guesses correctly 83.1% of the time across the entire dataset. While this looks like a great number, it is misleading. Because most clients in the dataset are good payers, any model can get high accuracy just by guessing “0” (No default) most of the time.
  • Precision (0.716 / 71.6%): When the model rings the alarm and flags a customer as a dangerous moroso (1), it is right 71.6% of the time.
  • Recall (0.396 / 39.6%): This is the most critical metric and the biggest weakness. The model only catches 39.6% of the total real morosos. This means that 60.4% of the dangerous clients are slipping through undetectedd.

Test Set vs. Cross-Validation Comparison

To evaluate overfitting and model stability for the Credit Default dataset, we compare our initial test set accuracy against the 5-fold cross-validation average:

  • Single Test Partition Accuracy: 83.1% (0.831)
  • 5-Fold Cross-Validation Accuracy (Average): ~83.0%

What this tells us about the Model:

  1. No Overfitting: If our logistic regression model had overfitted (meaning it just memorized the training data), its performance would drop heavily when tested on 5 different random folds. Because the cross-validation accuracy stays right at the same level (83%), we can confirm the model generalizes perfectly to new data.
  2. High Model Stability: The accuracy does not fluctuate when changing the data partitions. This tells us the model is highly stable and robust. Its ability to predict credit risk is consistent and mathematically sound, proving

5. Economic Interpretation

1. Answering the Economic Question Based on Our Results

Our model results back up exactly what we expected from an economic point of view: a borrower's annual income and whether they own a house are the two most important factors that define their financial constraint (meaning, how tight they are on money).

The regression model shows that having a high income or owning a property gives the borrower a financial cushion that makes defaulting on a loan (\(default\_status\)) very unlikely. On the other hand, renters and people with low incomes have a much smaller margin of safety, so any small financial problem can easily force them into a default.

2. Economic Interpretation of Model Coefficients

If we look at the numbers and signs from our R output (summary(model_logistic2)), we can see exactly how these variables affect risk in real life:

  • Annual Income (person_income) [Negative Coefficient]: The negative sign means that as income goes up, the probability of default goes down. Economically, income is the money a household has to pay for everything. A high income acts as a liquid cushion. If a rich person has a financial emergency, they can just cut back on leisure or extra expenses to keep paying the loan. But a low-income profile has no margin; if they face an unexpected expense, they have to choose between buying food or paying the bank. Since survival comes first, they default.

  • Home Ownership (person_home_ownership) [Categorical]: Our results show that renters are much more likely to default compared to homeowners. Renting means having a fixed monthly expense that doesn’t build any wealth. Homeowners, however, have accumulated an asset (the house). If they face a crisis, they can ask for a second mortgage, refinance, or even sell the property to get cash. Also, homeowners have more “skin in the game”: defaulting means the bank could take their house away (foreclosure), which is a huge penalty. Renters don’t risk losing property equity, so their economic incentive to avoid default is lower.

3. Policy and Business Decisions (What should the bank do?)

These findings are not just numbers; they are highly useful for making business decisions in the bank:

  • Adjusting the Cut-off Threshold: Our analysis shows that using the standard 0.5 probability threshold creates a major problem: the bank only catches about 39.6% of the people who are actually going to default (Low Recall). Losing the full money of a loan is way more expensive for a bank than double-checking a client. Therefore, the bank should lower the threshold to 0.3 for clients who rent or have lower incomes, to catch potential defaults early and protect its capital.
  • Future Research: This model only looks at data from the client. For future projects, it would be smart to add macroeconomic variables, like the unemployment rate or inflation. This would help us see how a borrower’s risk changes when the overall economy goes through a crisis.

6. Limitations and Reproducibility

6.1 Limitations of the Analysis

Although our model performs well with an accuracy of 83.1%, it has two main economic and statistical limitations:

  • Omission of Macroeconomic Variables: Our dataset only includes specific data about the borrower (like age, income, and loan amount). However, credit default is heavily influenced by the general state of the economy. Because our model ignores the external factors, it assumes the economic environment is always stable, which is not true in the real world.

6.2 Steps Taken for Reproducibility

To ensure that this analysis can be perfectly replicated on any computer, I followed strict programming practices:

  • Clean Data Management inside Quarto: Instead of relying on variables manually modified in our RStudio global environment, the entire data cleaning process is written inside the code chunks. We explicitly rename the variables (like changing loan_status to default_status) and convert them into factors inside the script. This ensures the document runs perfectly from top to bottom on a clean session without any “invisible” errors.

  • Using a Fixed Random Seed (set.seed(465)): Logistic regression cross-validation (vfold_cv) and data splitting shuffle the data randomly. By inserting set.seed(465) right before these functions, we force R to use the exact same random sequence every time the document is run. This guarantees that the final metrics, coefficients, and accuracy numbers do not change when you click Render.

7. AI LOG

“We are going to start with the first dataset, the Madrid real estate prices. I already have the code from the professor’s webpage and I just need to change the names according to my dataset. Give me an introductory sentence briefly explaining what I am going to do. Once I send you the results, please explain them to me in Spanish so I can fully understand them, and then help me write the interpretation in English.

# 4. View first few predictions 
head(data.frame(
  Actual = credit_test$default_status,
  Probability = round(logistic_probs, 3),
  Predicted = logistic_pred
))
  Actual Probability Predicted
1    Yes       0.032         0
2     No       0.244         0
3     No       0.214         0
4     No       0.119         0
5    Yes       0.867         1
6     No       0.000         0

This is one prompt i used several times, i try to use the codes from your rpubs and adapt them to my datasets, if something was wrong i also ask GEMINI for help. The most i used it was for interpretations on some metrical that i didn’t understand.
this is one example answer:

“Brilliant! The code executed perfectly without a single error.

Take a look at what this sample table shows us. The model performs quite well, but it makes a specific mistake that is pure gold to explain to the professor:

  • In rows 2, 3, 4, and 6: The client paid on time (Actual = 0), the model calculated low probabilities (under 50%), and correctly predicted a 0. Perfect!

  • In row 5: The client defaulted (Actual = 1), the model anticipated the risk with a very high probability of 86.7%, and correctly predicted a 1. Perfect!

  • The mistake (Row 1): The client ended up defaulting (Actual = 1), but the model assigned a ridiculously low default probability (3.2%) and predicted a 0. A delinquent customer slipped right through.”

With this output, i make an interpretation of the table separating the correct predictions (2,3,4,5,6) from the wrong one (1), and i say that in the real life, this problem would be critical.

8. Final Reflections

One Major Improvement with More Time or Better Data

If we had more time or access to a more complete dataset, the most impactful improvement would be to incorporate the borrowers’ credit history and current debt levels into the model, rather than just looking at their income and housing status.

In the real banking world, knowing how much money a client earns is only half the story. The missing piece is knowing how they handle their financial obligations.

The Debt-to-Income Ratio (DTI)

Past Payment Behavior

By adding these variables, we would fix the main weakness of our current model, which is its low Recall rate (39.6%). Instead of just guessing risk based on general things like age or renting status, the algorithm could see the actual financial behavior of the applicant.

One New Economic Question for Future Investigation

Based on the results of our model, the new economic question I would like to investigate in the future is:

“How do macroeconomic shocks (Inflation and central bank interest rates) alter the financial constraints of low-income renters compared to wealthy homeowners, and how does this asymmetry impact a bank’s default rate?”