Credit Risk Analysis Presentation

Miguel

title: “Credit Risk Analysis Presentation” author: “Miguel” format: revealjs: theme: sky transition: fade slide-number: true

1. Introduction & Objectives

  • Main Goal: Create a data model to predict if a client will pay back their loan or default (default_status).
  • Why it matters: Banks need a clear, automatic system to check risk instead of just guessing.
  • The Balance: We want to reject risky clients (to save money) but accept good clients.

---

2. Economic Question

“How do annual income, housing status, and the loan amount together determine a borrower’s financial constraint and change their risk of default?”

  • The Problem: Looking at variables separately is not enough. We need to see the whole picture.

  • Resources vs Obligations: Income and housing are resources; the loan amount is the obligation.

  • Hypothesis: When a client has low income, rents, and asks for a big loan, their risk increases exponentially.

3. Data Description & Structure

  • Data Source: Financial dataset containing 32,581 historical loan applications.
  • Outcome Variable (Target): default_status (Factor: 1 for Default, 0 for Non-Default).
  • Main Socioeconomic Predictors: * income: Borrower’s annual income (numerical).
    • home_ownership: (categorical: Rent, Mortgage, Own).
    • * loan_amount: Total size of the requested loan (numerical).
  • Control Variables: interest_rate (numerical) and age (numerical).

---

4. Key Findings: Distributions

  • Target Distribution: Around 78% of the clients in our dataset are safe (0), while 22% defaulted (1).
  • The Imbalance Challenge: Because default is less common, the model has to work harder to detect the risky profiles.
  • Loan Amount Insights: The average loan size in our data is around $9,600, but it ranges from small amounts up to $35,000.
  • Economic Reality: Most borrowers ask for standard amounts, but the high-risk “tail” (large loans) is where the danger lies.

---

5. Modeling Strategy

  • The Data Split: We divided our dataset into two parts using a fixed seed (465).
    • Training Set (80%): 26,065 observations to build and train our models.
    • Testing Set (20%): 6,516 observations to test how the models perform with new data.
  • Model 1 (Baseline): A simple model using only loan_amount and income.
  • Model 2 (Advanced): Our main model. It combines the 3 key pillars (loan_amount, income, home_ownership) plus control variables (interest_rate and age).

---

6. Model Validation (Cross-Validation)

  • The Technique: We used a 5-Fold Cross-Validation on our training data.
  • How it works: The data is split into 5 parts. The model trains on 4 parts and tests on the remaining 1. This repeats 5 times.
  • Why we do it: To avoid overfitting and make sure our model works well with any random sample of clients.
  • The Result: Model 2 achieved a very stable average Accuracy of 83.1% across all folds.

7. Model Performance Comparison

Accuracy: 0.831 
Precision: 0.716 
Recall: 0.396 
Metric Model 1 (Baseline) Model 2 (Advanced) Why it matters
Accuracy (Total Correct) 78.2% 83.1% Model 2 makes fewer total mistakes.
Recall (Detecting Defaults) 12.4% 39.6% Model 2 catches 3x more risky clients!
Precision (True Alarms) 52.1% 71.6% When Model 2 flags a client, it is more reliable.

---

8. Main Results: Error Analysis

  • The Alternative Rejected (Model 1): Left the bank blind to risk with too many False Negatives .
  • The Selected Model (Model 2): Minimizes the most dangerous errors
  • Economic Interpretation of Errors:
    • False Positives (Precision = 71.6%): Rejecting a good client. .
    • False Negatives (Recall = 39.6%): Accepting a bad client.
  • The Decision: Model 2 is chosen because it optimizes this trade-off, protecting the bank’s capital from catastrophic losses.

---

9. Recommendations & Limitations

  • Strategic Recommendation: Lower the classification threshold to 0.3.
    • Losing the money from a defaulted loan is much more expensive than a false alarm.
  • Study Limitations:
    • Data Constraints: We only have a snapshot of historical data, not real-time financial tracking.
    • Missing Variables: The dataset lacks macroeconomic indicators (like inflation or unemployment rates) that also affect a client’s risk of default.

---

10. Final Reflections & Future Research

  • Proposed Improvement: Incorporate the borrowers’ Debt-to-Income Ratio (DTI) and Credit Score/Past Delinquency History to capture behavioral data and eliminate the Low Recall issue.
  • Future Economic Question Inspired: > “How do macroeconomic shocks (inflation and rate hikes) alter the financial constraints of low-income renters compared to wealthy homeowners, and how does this asymmetry impact a bank’s default rate?”