Re-examined purpose & goal
Feature Engineering
New libraries (DT, rpart, rpart.plot)
Credit risk is the risk that someone will default on their debt (failing to repay loans or meet contractual obligations)
People who pay things on time or consistently miss payments are easy for banks to deal with. What’s hard are the people in between - sometimes they pay their bills early, sometimes on time, sometimes late, and other times they might default. Credit risk management is about creating/implementing policies & practices to help banks mitigate financial risk
Stakeholders: Banks have a finanical responsibility to their stakeholders (stockholders & credit union members), as well as their customers by balancing risk with different interest rates, benefits, and penalties.
To create a model that can predict credit risk given data such as age, housing status, employment, and marital status
To pin-point existing and new factors which might contribute to low credit risk, and ultimately providing decision makers with a more complete decision aid.
These two correlation matrices display the strength of a correlation between continuous variables. Based on these plots, the following are particularly noteworthy variables
Negative Correlation with Low Credit Risk:
Loans for small cars & large appliances
Checking account value
Renting a house/apartment
The number of months as a customer
Positive Correlation with Low Credit Risk:
Loans for small appliances & used cars
Owning a house/apartment
The number of months employed
Being male
Being single
Principle Components 1 through 5 explain 88.8% of explained variance.
While age isn’t used to calculate credit score (rather, age of credit is more important), there are some trends which suggest that age groups impacts credit risk. Therefore, two new categories were created. The first (‘Generation’) based on generation, the second (‘Age.Group’) based on sociological & developmental psychology research on psychographic data (physical development, behavior, beliefs & values, etc.)
Generation
Gen Z: 18-23
Millennial: 24-39
Generation X: 40-55
Baby Boomers: 56-74
Silent Generation: 75+
“Age Group” (Based on Jeffrey Arnett’s Theory of Middle Adulthood & Donald Super’s Life Span Theory)
Emerging Adult: 18-25
Young Adult: 26-40
Middle Adult: 41-55
Young-Old Adult: 56-65
Old-Old Adult: 65+
Support vector machines (SVM) are a form of supervised learning used to predict categories. It uses hyperplanes to segment data, creating different categories (such as ‘low’ or ‘high’ credit risk). SVM was chosen based on the first pass-through.
Parameters:
method = ‘svmLinear’ (Linear kernel)
70/30 test, train data split
Variables were chosen using a combination of backwards selection, feature selection, and data understanding
Random Forests are also a classification algorithm which create a sequence of variables to make predictions. Using 1 variable at a time, data is split until a tree can predict the target variable. Random Forests are a ensemble of multiple trees. RF was chosen based on the first pass-through.
Parameters
method = ‘rf’
70/30 test, train data split
Variables were chosen using a combination of backwards selection, feature selection, and data understanding
Visualizing individual trees requires CART models (Classification & Regression Trees). These models show how Credit Risk was predicted from the various dependent variables (e.g., Loan Purpose, Housing, Age Group).
Important variables for future consideration
Variables that may be removed (or may require further investigation)
Accuracy vs AUC?
Model accuracy is not serviceable for deployment (see below). The probability of predicted credit risk is not as definitive as it should be if the model were to be used as a decision aid. Improperly assigning high credit risk to (potential or current) customers could result in undeserved, high interest rates and other unfair penalties. Assigning low credit risk to risky customers could result in financial loss. The former would negatively impact customer satisfaction while the latter would jeopardize the bank’s financial responsibility to its shareholders.
While ‘Generation’ had a statistically insignificant relationship with Credit Risk, both Generation and Age Group strengthened RF12 & SVM7. The improvement due these variables suggests that psychographic data could be an important variable for evaluating credit risk, and ultimately support credit risk management decisions.
Employment is clearly important as “Months.Employed” positively contributes to the performance of these models. However, ‘Job’ does not, suggesting that this variable need to be improved or replaced. Annual income (or annual income level) should do a better job of predicting a customer’s ability to calculate credit risk related issues (Probability of Default, Loss Given Default). Improvements to job categories could involve more specific descriptions of trades & professions, rather than ‘skilled’ vs ‘unskilled’.
Finally, credit risk needs more granularity between ‘High’ and ‘Low’. Addressing and serving low and high risk customers is relatively straightforward since they either (1) consistently pay their bills by their due date, or (2) consistently miss payments or default. However, the biggest challenge are identifying the individuals who behave erratically
Tried a variety of new functions in R, RMD, & Flex Dashboard (layouts, DT, CART plots). Also increased familiarity with RMD dashboards
I met (most of) my previous goals
Successfully solved some code-related issues from Week 1/2
Tested model with new data
Background research on credit risk
Storyboarding
Visualizing & interpreting random forests
Dropped my EDA graphs because they weren’t helpful & looked ugly
One-hot encoding (vs dummy coding)
So. Many. Bugs.
Experiment more with plotly & interactive plots
Try using Bootstrap-based (bslib) themes