Introduction

Column

Welcome!

Column

Storyboard at a glance (What’s changed since the last time)

  1. Re-examined purpose & goal

  2. Feature Engineering

  3. New libraries (DT, rpart, rpart.plot)

What is “credit risk?” Why does it matter?

  • Credit risk is the risk that someone will default on their debt (failing to repay loans or meet contractual obligations)

  • People who pay things on time or consistently miss payments are easy for banks to deal with. What’s hard are the people in between - sometimes they pay their bills early, sometimes on time, sometimes late, and other times they might default. Credit risk management is about creating/implementing policies & practices to help banks mitigate financial risk

  • Stakeholders: Banks have a finanical responsibility to their stakeholders (stockholders & credit union members), as well as their customers by balancing risk with different interest rates, benefits, and penalties.

Goals

  1. To create a model that can predict credit risk given data such as age, housing status, employment, and marital status

  2. To pin-point existing and new factors which might contribute to low credit risk, and ultimately providing decision makers with a more complete decision aid.

Examining Correlation Strength

Column

Correlation Matrix 1

Correlation Matrix 2

Column

Correlation Matrix Results

These two correlation matrices display the strength of a correlation between continuous variables. Based on these plots, the following are particularly noteworthy variables

  • Negative Correlation with Low Credit Risk:

    • Loans for small cars & large appliances

    • Checking account value

    • Renting a house/apartment

    • The number of months as a customer

  • Positive Correlation with Low Credit Risk:

    • Loans for small appliances & used cars

    • Owning a house/apartment

    • The number of months employed

    • Being male

    • Being single

Principle Component Analysis

Principle Components 1 through 5 explain 88.8% of explained variance.

Feature engineering & Independence testing for new variables (χ-squared)

Column

Data Post-Feature Engineering

Column

Feature Engineering notes

While age isn’t used to calculate credit score (rather, age of credit is more important), there are some trends which suggest that age groups impacts credit risk. Therefore, two new categories were created. The first (‘Generation’) based on generation, the second (‘Age.Group’) based on sociological & developmental psychology research on psychographic data (physical development, behavior, beliefs & values, etc.)

  • Generation

    • Gen Z: 18-23

    • Millennial: 24-39

    • Generation X: 40-55

    • Baby Boomers: 56-74

    • Silent Generation: 75+

  • “Age Group” (Based on Jeffrey Arnett’s Theory of Middle Adulthood & Donald Super’s Life Span Theory)

    • Emerging Adult: 18-25

    • Young Adult: 26-40

    • Middle Adult: 41-55

    • Young-Old Adult: 56-65

    • Old-Old Adult: 65+

χ-squared for independence

Support Vector Machine

Column

Reciever Operating Characteristic Curve: SVM Model 7

Column

SVM Background

Support vector machines (SVM) are a form of supervised learning used to predict categories. It uses hyperplanes to segment data, creating different categories (such as ‘low’ or ‘high’ credit risk). SVM was chosen based on the first pass-through.

Parameters:

  • method = ‘svmLinear’ (Linear kernel)

  • 70/30 test, train data split

  • Variables were chosen using a combination of backwards selection, feature selection, and data understanding

SVM Model Performance compared

Random Forest

Column

Reciever Operating Characteristic Curve: RF Model 12

Error Across Trees (RF Model 12)

Column

RF Background

Random Forests are also a classification algorithm which create a sequence of variables to make predictions. Using 1 variable at a time, data is split until a tree can predict the target variable. Random Forests are a ensemble of multiple trees. RF was chosen based on the first pass-through.

Parameters

  • method = ‘rf’

  • 70/30 test, train data split

  • Variables were chosen using a combination of backwards selection, feature selection, and data understanding

RF Model Performance compared

Results

Column

Basic CART Plot (based on RF12)

Visualizing individual trees requires CART models (Classification & Regression Trees). These models show how Credit Risk was predicted from the various dependent variables (e.g., Loan Purpose, Housing, Age Group).

CART model using the rpart library

Classification Tree 2

Column

Results

Important variables for future consideration

  • Loan Purpose
  • Housing
  • Months Customer
  • Months Employed
  • Housing
  • Years
  • Generation & Age Group
  • Total

Variables that may be removed (or may require further investigation)

  • Job
  • Marital Status & Gender
  • Checking

Accuracy vs AUC?

  • Accuracy is easily interpreted since it is intuitive
  • AUC is also helpful for this application since we’re considering the probability of classification (the probability that someone will be high or low credit risk), but it is a bit more challenging to explain to decision makers.

Recommendations

  1. Model accuracy is not serviceable for deployment (see below). The probability of predicted credit risk is not as definitive as it should be if the model were to be used as a decision aid. Improperly assigning high credit risk to (potential or current) customers could result in undeserved, high interest rates and other unfair penalties. Assigning low credit risk to risky customers could result in financial loss. The former would negatively impact customer satisfaction while the latter would jeopardize the bank’s financial responsibility to its shareholders.

  2. While ‘Generation’ had a statistically insignificant relationship with Credit Risk, both Generation and Age Group strengthened RF12 & SVM7. The improvement due these variables suggests that psychographic data could be an important variable for evaluating credit risk, and ultimately support credit risk management decisions.

  3. Employment is clearly important as “Months.Employed” positively contributes to the performance of these models. However, ‘Job’ does not, suggesting that this variable need to be improved or replaced. Annual income (or annual income level) should do a better job of predicting a customer’s ability to calculate credit risk related issues (Probability of Default, Loss Given Default). Improvements to job categories could involve more specific descriptions of trades & professions, rather than ‘skilled’ vs ‘unskilled’.

  4. Finally, credit risk needs more granularity between ‘High’ and ‘Low’. Addressing and serving low and high risk customers is relatively straightforward since they either (1) consistently pay their bills by their due date, or (2) consistently miss payments or default. However, the biggest challenge are identifying the individuals who behave erratically

Predicting new data with RF12

Post-Mortem

Column

Debugging :)

Column

What went well

  • Tried a variety of new functions in R, RMD, & Flex Dashboard (layouts, DT, CART plots). Also increased familiarity with RMD dashboards

  • I met (most of) my previous goals

  • Successfully solved some code-related issues from Week 1/2

  • Tested model with new data

  • Background research on credit risk

  • Storyboarding

What didn’t

  • Visualizing & interpreting random forests

  • Dropped my EDA graphs because they weren’t helpful & looked ugly

  • One-hot encoding (vs dummy coding)

  • So. Many. Bugs.

What I’d do differently next time

  • Experiment more with plotly & interactive plots

  • Try using Bootstrap-based (bslib) themes