Credit Risk Modeling

Author

Daniel Lee

Published

March 22, 2024

About the Data

A new variable, default, was created to include “arrears”, “missed repayment”, “collection agency”, “write off”, and “past maturity” among 12 categories in loan status.

Loan characteristics

  • interest_percent: feature-engineered to “interest_duration” and “interest_rate_monthly”
  • purpose
  • product
  • duration
  • status
  • loan_type: feature-engineered from loand_id
  • principal

People characteristics

  • age
  • gender
  • hr_verification
  • site_location
  • matrital_status
  • nationality
  • government_id
  • identity_document
  • phone_number_verified
  • days_with_the_company: feature-engineered from date_of_hire

Do I Have Sufficient Data? Deciphering Data Quality

In this plot, variables are arranged in descending order of their correlation with default. The right half typically contains binarized variables correlated with defaulted loans, while the left half displays those associated with employees who remained in good standing.

At the apex of this funnel are the top predictors, which exhibit the strongest correlations with default. In the current model, these key variables, in descending order, are purpose, interest_rate, product, and principal.

  1. Purpose: Loans lacking information on purpose are less likely to default. Additionally, specific categories such as house renovation, tuition, medical expenses, bill payments, and business ventures exhibit positive but week correlation with default.

  2. Interest Rate (Monthly): The monthly interest rate emerges as a critical factor influencing default likelihood. Loans carrying a 5% monthly interest rate demonstrate a higher propensity for default. Most loans in the dataset feature interest rates of either 5% or 10%, with a few outliers possessing weekly rates, which are standardized to a monthly basis for analysis.

  3. Product Type: Analysis of product type reveals notable discrepancies in default rates. Notably, the ‘interglobe’ product type is associated with a higher default probability compared to other product categories. Understanding such variations can inform risk management strategies, potentially leading to adjustments in product offerings or tailored risk assessment methodologies.

  4. principal: Interestingly, there seems to be an inverse relation between default and loan amounts. Larger loans are less likely to default, while small loans are more likely.

Why Should I Trust the Model?

When assessing the reliability of the model, your data scientist will provide you with various performance metrics, offering a comprehensive view of its predictive capabilities. Among these metrics, common ones include AUC (Area Under the Curve) and accuracy.

Accuracy: This is perhaps the most intuitive metric, representing the percentage of correct predictions made by the model.

AUC: Without delving into technical details, AUC is easily understandable. Ranging from 0 to 1, a score of 0 signifies the model being 100% wrong, while a score of 1 indicates 100% correctness.

Our credit risk model shows a “fair” AUC of 0.72 and an Accuracy of 0.73, signifying a need for more data.