If any issues, questions or suggestions feel free to reach me out via e-mail wieczynskipawel@gmail.com or Linkedin. You can also visit my Github.
This is R replication of the code and exercises from the Udemy course “Credit Risk Modeling in Python 2022”.
if(!require('pacman')) install.packages('pacman')
pacman::p_load(dplyr, tidyr, ggplot2, pROC)
options(scipen = 20)
We load data which was prepared in that blog post. Our dataset containt 43236 rows and 47 attributes. We will fit logistic regression model to decide whether recovery rate is greater than 0 and if so, then we will fit linear regression model.
data <- read.csv('ead_lgd_preprocessed_data.csv')
data %>% dim()
[1] 43236 47
We split our data into training set and test set in proportions 80:20.
set.seed(2137)
n_obs <- nrow(data)
train_index <- sample(1:n_obs, 0.8 * n_obs)
For logistic regression model we create dummy variable whether recovery rate is 0 or greater than 0.
data <- data %>%
mutate(recovery_rate_0_1 = if_else(recovery_rate > 0, 1, 0) %>% as.factor())
Fit logistic regression model.
log_model_1 <- glm(
family = binomial(link = 'logit')
,recovery_rate_0_1 ~.
,data = data %>% select(-good_bad, -recovery_rate, -CCF)
,subset = train_index
)
Calculate adjusted multivariate p-values using
p.adjust() function. Based on these we exclude from the
logistic regression model the following variables:
- categorial: home_onwership, purpose, term
- continuous: emp_length, open_acc, pub_rec, acc_now_delinq,
total_rev_hi_lim, mths_since_last_delinq.
log_p_values_multi <- log_model_1 %>%
summary() %>%
coef() %>%
.[,4] %>%
p.adjust()
log_p_values_multi %>% round(5) %>% as.data.frame()
The final logistic regression model for estimating wether recovery rate is 0 or is higher than 0 is the following.
log_model_2 <- glm(
family = binomial(link = 'logit')
,recovery_rate_0_1 ~.
,data = data %>% select(
-good_bad, -recovery_rate, -CCF
,-starts_with('home_ownership'), -starts_with('purpose'), -starts_with('term')
,-emp_length, -open_acc, -pub_rec, -acc_now_delinq, -total_rev_hi_lim, -mths_since_last_delinq
)
,subset = train_index
)
log_model_2 %>% summary()
Call:
glm(formula = recovery_rate_0_1 ~ ., family = binomial(link = "logit"),
data = data %>% select(-good_bad, -recovery_rate, -CCF, -starts_with("home_ownership"),
-starts_with("purpose"), -starts_with("term"), -emp_length,
-open_acc, -pub_rec, -acc_now_delinq, -total_rev_hi_lim,
-mths_since_last_delinq), subset = train_index)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6941 -1.1045 0.7117 0.9707 2.1048
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.573617112 0.268709025 -17.021 < 0.0000000000000002 ***
grade_A 2.262087421 0.179065760 12.633 < 0.0000000000000002 ***
grade_B 1.700315449 0.145013479 11.725 < 0.0000000000000002 ***
grade_C 1.295375871 0.124141299 10.435 < 0.0000000000000002 ***
grade_D 0.988976672 0.108445843 9.120 < 0.0000000000000002 ***
grade_E 0.618683621 0.098508593 6.281 0.00000000033747697 ***
grade_F 0.343559263 0.097897093 3.509 0.000449 ***
addr_state_NM_VA -0.102256548 0.068522819 -1.492 0.135621
addr_state_NY -0.517773648 0.051274204 -10.098 < 0.0000000000000002 ***
addr_state_OK_TN_MO_LA_MD_NC -0.059891111 0.049483274 -1.210 0.226152
addr_state_CA -0.128848163 0.044723972 -2.881 0.003965 **
addr_state_UT_KY_AZ_NJ -0.194507422 0.052819361 -3.683 0.000231 ***
addr_state_AR_MI_PA_OH_MN -0.166935459 0.047855319 -3.488 0.000486 ***
addr_state_RI_MA_DE_SD_IN 0.016264962 0.064147357 0.254 0.799838
addr_state_GA_WA_OR -0.444341598 0.056671689 -7.841 0.00000000000000448 ***
addr_state_WI_MT -0.368259387 0.100786316 -3.654 0.000258 ***
addr_state_IL_CT -0.388081534 0.062489660 -6.210 0.00000000052872774 ***
addr_state_KS_SC_CO_VT_AK_MS -0.119854148 0.067069066 -1.787 0.073933 .
addr_state_TX 0.183522562 0.057231202 3.207 0.001343 **
addr_state_WV_NH_WY_DC_ME_ID -0.152398035 0.116142977 -1.312 0.189467
verification_status_Source.Verified -0.070280631 0.028261601 -2.487 0.012890 *
verification_status_Not.Verified -0.126733212 0.029830975 -4.248 0.00002153253608254 ***
initial_list_status_w -0.756697306 0.028123598 -26.906 < 0.0000000000000002 ***
months_since_issue_d 0.032940758 0.000960904 34.281 < 0.0000000000000002 ***
int_rate 0.164344608 0.008992668 18.275 < 0.0000000000000002 ***
months_since_earliest_cr_line -0.000525558 0.000150598 -3.490 0.000483 ***
delinq_2yrs 0.049035676 0.015220726 3.222 0.001275 **
inq_last_6mths -0.032303551 0.009556738 -3.380 0.000724 ***
total_acc -0.007326276 0.001171753 -6.252 0.00000000040417920 ***
annual_inc 0.000002310 0.000000344 6.716 0.00000000001867798 ***
dti -0.011541411 0.001650186 -6.994 0.00000000000267144 ***
mths_since_last_record -0.003098384 0.000415384 -7.459 0.00000000000008713 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 47357 on 34587 degrees of freedom
Residual deviance: 43139 on 34556 degrees of freedom
AIC: 43203
Number of Fisher Scoring iterations: 4
In that blog post we calculated ROC and AUC manually for educational purposes. Now let’s use function roc from the library pROC. Here we take TPR as sensitivities and we calculate FPR as 1 - specificities. AUC of this model is \(69.93\%\).
log_predict_prob <- predict.glm(log_model_2
,data[-train_index,]
,type = 'response')
log_model_roc <- roc(response = data$recovery_rate_0_1[-train_index]
,predictor = log_predict_prob)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
data.frame(
FPR = 1 - log_model_roc$specificities
,TPR = log_model_roc$sensitivities
) %>%
ggplot(aes(FPR, TPR)) +
geom_line(size = 1, col = 'blue') +
geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1), size = 1, linetype = 'dashed') +
labs(title = paste0('Recovery Rate Logistic Regression Model\nROC of Validation Set\nAUC = '
,round(log_model_roc$auc, 4))) +
theme_bw()
Now let’s take rows with recovery_rate > 0. There are 24371 observations in the filtered dataset.
data_stage_2 <- data %>%
filter(recovery_rate > 0)
data_stage_2 %>% dim()
[1] 24371 48
For those we fit linear regression model.
lin_model_1 <- lm(
recovery_rate ~.
,data = data_stage_2 %>% select(-good_bad, -recovery_rate_0_1, -CCF)
,subset = train_index
)
Calculate adjusted multivariate p-values using
p.adjust() function. Based on these we exclude from the linear
regression model the following variables:
- categorial: home_onwership, addr_state, purpose
- continuous: emp_length, months_since_earliest_cr_line, delinq_2yrs,
inq_last_6mths, pub_rec, acc_now_delinq, annual_inc,
mths_since_last_delinq, dti, mths_since_last_record.
lin_p_values_multi <- lin_model_1 %>%
summary() %>%
coef() %>%
.[,4] %>%
p.adjust()
lin_p_values_multi %>% round(5) %>% as.data.frame()
For those we fit linear regression model. We obtain pretty simple model, but with very poor \(R^2\).
lin_model_2 <- lm(
recovery_rate ~.
,data = data %>% select(
-good_bad, -recovery_rate_0_1, -CCF
,-starts_with('home_ownership'), -starts_with('purpose'), -starts_with('addr_state')
,-emp_length, -months_since_earliest_cr_line, -delinq_2yrs, -inq_last_6mths, -pub_rec
,-acc_now_delinq, -annual_inc, -mths_since_last_delinq, -dti, -mths_since_last_record
)
,subset = train_index
)
lin_model_2 %>% summary()
Call:
lm(formula = recovery_rate ~ ., data = data %>% select(-good_bad,
-recovery_rate_0_1, -CCF, -starts_with("home_ownership"),
-starts_with("purpose"), -starts_with("addr_state"), -emp_length,
-months_since_earliest_cr_line, -delinq_2yrs, -inq_last_6mths,
-pub_rec, -acc_now_delinq, -annual_inc, -mths_since_last_delinq,
-dti, -mths_since_last_record), subset = train_index)
Residuals:
Min 1Q Median 3Q Max
-0.38690 -0.05695 -0.02815 0.04626 0.96283
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.00572931021 0.01079144771 -0.531 0.595483
grade_A 0.02931672314 0.00720118758 4.071 0.0000468965584607 ***
grade_B 0.02020574669 0.00584901888 3.455 0.000552 ***
grade_C 0.01422911340 0.00503091505 2.828 0.004682 **
grade_D 0.00979628555 0.00441305373 2.220 0.026436 *
grade_E 0.00618242156 0.00402198002 1.537 0.124264
grade_F 0.00435179162 0.00401625777 1.084 0.278575
verification_status_Source.Verified -0.00322360218 0.00117223139 -2.750 0.005963 **
verification_status_Not.Verified -0.00165376640 0.00125547956 -1.317 0.187767
initial_list_status_w -0.01591725813 0.00119200421 -13.353 < 0.0000000000000002 ***
term_36 -0.00393913235 0.00119090460 -3.308 0.000942 ***
months_since_issue_d 0.00005864143 0.00003787667 1.548 0.121578
int_rate 0.00370628227 0.00036459848 10.165 < 0.0000000000000002 ***
open_acc -0.00083015190 0.00013736378 -6.043 0.0000000015238842 ***
total_acc 0.00019470063 0.00005766265 3.377 0.000735 ***
total_rev_hi_lim 0.00000016961 0.00000002201 7.705 0.0000000000000134 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.08898 on 34572 degrees of freedom
Multiple R-squared: 0.02238, Adjusted R-squared: 0.02196
F-statistic: 52.77 on 15 and 34572 DF, p-value: < 0.00000000000000022
As we see below residuals of the linear regression model ain’t behave well. These are not normally distributed and heteroscedastic. In another blog post we will estimate beta regression instead of two staged logistic and lienar regression.
par(mfrow = c(2,2))
lin_model_2 %>% plot(which = 1:3)
acf(lin_model_2$residuals, main = '')
Finally, in order to predict recovery rate a given loan first we have to decide whether recovery rate is 0 or above 0 using logistic regression model and if it’s above 0 we predict it with linear regression model. To do that we simply multiply predicted values from both models. Let’s take a threshold of 0.5 for predictions from logistic regression model.
lin_predict <- predict(lin_model_2
,data[-train_index,])
log_predict_class <- if_else(log_predict_prob > 0.5, 0, 1)
recovery_rate_summary = data.frame(
Actual = data$recovery_rate[-train_index]
,Predicted = log_predict_class * lin_predict
) %>%
pivot_longer(cols = 1:2, names_to = 'Recovery Rate', values_to = 'Value')
ggplot(recovery_rate_summary, aes(Value, fill = `Recovery Rate`)) +
geom_density(alpha= 0.5) +
theme_bw() +
labs(x = '', y ='', title = 'Densities of Actual and Predicted Recovery Rates')