If any issues, questions or suggestions feel free to reach me out via e-mail wieczynskipawel@gmail.com or Linkedin. You can also visit my Github.
This is R replication of the code and exercises from the Udemy course “Credit Risk Modeling in Python 2022”.
if(!require('pacman')) install.packages('pacman')
pacman::p_load(dplyr, ggplot2, zoo)
options(scipen = 20)
In this blog post we selected variables for our PD model and derived dummy variables from these variables thru fine classing and coarse classing. Let’s now load this dataset. It contains 466285 records and 323 columns.
data <- read.csv('pd_preprocessed_loan_data_2007_2014.csv')
data %>% dim()
[1] 466285 323
From 22 original variables we derived 126 dummy variables. Now we select columns which consists target variable (indicating good/bad loans) and dummy variables. In order to avoid dummy variable trap for each original variable we selected one dummy as a reference level and now we exclude these also. Resulting dataset for PD modeling has 105 attributes: 126 dummies - 22 reference + 1 target variable.
list_of_dummy_variables <- read.csv('pd_dummies.csv') %>% unlist()
list_of_reference_categories <- read.csv('pd_dummies_reference.csv') %>% unlist()
dummies <- setdiff(list_of_dummy_variables
,list_of_reference_categories)
data_pd <- data %>%
select(good_bad, all_of(dummies))
data_pd %>% dim()
[1] 466285 105
We split our data into training set and test set in proportions 80:20 using the same seed as data preprocessing.
set.seed(2137)
n_obs <- nrow(data)
train_index <- sample(1:n_obs, 0.8 * n_obs)
Below we fit logistic regression model on a traning data.
log_model <- glm(good_bad ~ .
,family = binomial(link = 'logit')
,data = data_pd
,subset = train_index)
Let’s calculate adjusted multivariate p-values using p.adjust() function. For any set of dummies derived from single original variable we will include in the model all these dummies if at least one of them is statistically significant at 95% level.
p_values_multi <- log_model %>%
summary() %>%
coef() %>%
.[,4] %>%
p.adjust()
p_values_multi %>% round(5) %>% as.data.frame()
NA
As it can be explored above, we will exclude from the model 14 dummies derived from the following attributes: delinq_2yrs, open_acc, pub_rec, total_acc, acc_now_deling. So our final dataset for PD modeling has 91 attributes.
dummies_to_exclude <- c(
'delinq_2yrs_1_3', 'delinq_2yrs_3_more', 'open_acc_1_3', 'open_acc_4_12'
,'open_acc_13_17', 'open_acc_18_22', 'open_acc_23_25', 'open_acc_26_30'
,'open_acc_30_more', 'pub_rec_3_4', 'pub_rec_4_more', 'total_acc_28_51'
,'total_acc_51_more', 'acc_now_delinq_0_more'
)
data_pd <- data_pd %>%
select(-all_of(dummies_to_exclude))
data_pd %>% dim()
[1] 466285 91
Let’s fit the logistic regression model again. Example intepretations
of the below coefficients are as follows:
- customer with grade F has odds \(e^{0.20088}
= 1.222478\) greater to be a good borrower than someone with
grade G (remember, grade G was reference level, so it was excluded from
the model in order to avoid dummy variable trap)
- customer with grade A has odds \(e^{1.05585-0.67172} = 1.468336\) greater to
be a good borrower than someone with grade C.
log_model_2 <- glm(good_bad ~ .
,family = binomial(link = 'logit')
,data = data_pd
,subset = train_index)
log_model_2 %>% summary()
Call:
glm(formula = good_bad ~ ., family = binomial(link = "logit"),
data = data_pd, subset = train_index)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.1563 0.2808 0.4044 0.5293 1.4638
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.87712 0.11992 15.653 < 2e-16 ***
grade_A 1.05585 0.09225 11.446 < 2e-16 ***
grade_B 0.86375 0.06089 14.185 < 2e-16 ***
grade_C 0.67172 0.05643 11.904 < 2e-16 ***
grade_D 0.49795 0.05358 9.293 < 2e-16 ***
grade_E 0.34032 0.04785 7.112 1.15e-12 ***
grade_F 0.20088 0.05002 4.016 5.92e-05 ***
home_ownership_OWN 0.08459 0.02008 4.213 2.52e-05 ***
home_ownership_MORTGAGE 0.12805 0.01266 10.118 < 2e-16 ***
addr_state_NM_VA 0.08335 0.03230 2.581 0.009860 **
addr_state_NY 0.07023 0.02417 2.906 0.003661 **
addr_state_OK_TN_MO_LA_MD_NC 0.07884 0.02317 3.403 0.000666 ***
addr_state_CA 0.10907 0.02123 5.139 2.77e-07 ***
addr_state_UT_KY_AZ_NJ 0.11012 0.02496 4.413 1.02e-05 ***
addr_state_AR_MI_PA_OH_MN 0.15059 0.02247 6.701 2.07e-11 ***
addr_state_RI_MA_DE_SD_IN 0.10353 0.02980 3.474 0.000512 ***
addr_state_GA_WA_OR 0.20845 0.02665 7.822 5.18e-15 ***
addr_state_WI_MT 0.26188 0.04766 5.494 3.92e-08 ***
addr_state_IL_CT 0.29024 0.02924 9.928 < 2e-16 ***
addr_state_KS_SC_CO_VT_AK_MS 0.34647 0.03089 11.216 < 2e-16 ***
addr_state_TX 0.25893 0.02603 9.946 < 2e-16 ***
addr_state_WV_NH_WY_DC_ME_ID 0.55527 0.05333 10.413 < 2e-16 ***
verification_status_Source.Verified -0.00687 0.01346 -0.510 0.609903
verification_status_Not.Verified 0.07954 0.01468 5.420 5.96e-08 ***
purpose_renewable_energy_moving_other_house_medical 0.43811 0.03785 11.574 < 2e-16 ***
purpose_wedding_vacation_debt_consolidation 0.44797 0.03438 13.029 < 2e-16 ***
purpose_major_purchase_home_improvement 0.49087 0.03914 12.540 < 2e-16 ***
purpose_car_credit_card 0.57124 0.03608 15.831 < 2e-16 ***
initial_list_status_w 0.04638 0.01309 3.542 0.000397 ***
term_36 0.06621 0.01414 4.682 2.84e-06 ***
emp_length_1_4 0.15055 0.01800 8.364 < 2e-16 ***
emp_length_5_6 0.11307 0.02129 5.311 1.09e-07 ***
emp_length_7_9 0.08581 0.02065 4.154 3.26e-05 ***
emp_length_10 0.15457 0.01827 8.460 < 2e-16 ***
months_since_issue_d_38_39 -0.21919 0.03289 -6.665 2.65e-11 ***
months_since_issue_d_40_41 -0.30859 0.03249 -9.497 < 2e-16 ***
months_since_issue_d_42_48 -0.52706 0.02903 -18.158 < 2e-16 ***
months_since_issue_d_49_52 -0.67152 0.03168 -21.196 < 2e-16 ***
months_since_issue_d_53_64 -0.92568 0.03017 -30.686 < 2e-16 ***
months_since_issue_d_65_84 -1.04836 0.05018 -20.891 < 2e-16 ***
months_since_issue_d_84_more -0.95443 0.05419 -17.614 < 2e-16 ***
int_rate_9.548_12.025 -0.38490 0.06658 -5.781 7.42e-09 ***
int_rate_12.025_15.74 -0.65186 0.06728 -9.688 < 2e-16 ***
int_rate_15.74_20.281 -0.86785 0.07116 -12.196 < 2e-16 ***
int_rate_20.281_more -1.01123 0.07821 -12.930 < 2e-16 ***
months_since_earliest_cr_line_141_164 0.08265 0.02517 3.283 0.001026 **
months_since_earliest_cr_line_165_247 0.04376 0.01996 2.192 0.028366 *
months_since_earliest_cr_line_248_270 0.08534 0.02625 3.251 0.001148 **
months_since_earliest_cr_line_271_352 0.11581 0.02340 4.948 7.49e-07 ***
months_since_earliest_cr_line_352_more 0.14985 0.02672 5.609 2.03e-08 ***
inq_last_6mths_1_2 -0.15521 0.01173 -13.234 < 2e-16 ***
inq_last_6mths_3_6 -0.35005 0.01898 -18.439 < 2e-16 ***
inq_last_6mths_6_more -0.67651 0.11604 -5.830 5.55e-09 ***
total_rev_hi_lim_less_5 0.09282 0.05040 1.842 0.065514 .
total_rev_hi_lim_5_10 0.15531 0.04192 3.705 0.000211 ***
total_rev_hi_lim_10_20 0.12709 0.03990 3.185 0.001445 **
total_rev_hi_lim_20_30 0.15241 0.04049 3.764 0.000167 ***
total_rev_hi_lim_30_40 0.14416 0.04201 3.431 0.000601 ***
total_rev_hi_lim_40_55 0.16852 0.04344 3.879 0.000105 ***
total_rev_hi_lim_55_95 0.18416 0.04614 3.991 6.58e-05 ***
total_rev_hi_lim_95_more 0.36069 0.06400 5.636 1.74e-08 ***
annual_inc_20_30 -0.25837 0.02361 -10.945 < 2e-16 ***
annual_inc_30_40 -0.19415 0.01843 -10.534 < 2e-16 ***
annual_inc_40_50 -0.14251 0.01739 -8.196 2.49e-16 ***
annual_inc_50_60 -0.08334 0.01728 -4.822 1.42e-06 ***
annual_inc_60_70 -0.05621 0.01832 -3.069 0.002148 **
annual_inc_70_80 0.01179 0.02023 0.583 0.559978
annual_inc_80_90 0.06953 0.02258 3.079 0.002077 **
annual_inc_90_100 0.08364 0.02495 3.352 0.000803 ***
annual_inc_100_120 0.11748 0.02495 4.708 2.50e-06 ***
annual_inc_120_140 0.17016 0.03176 5.358 8.41e-08 ***
annual_inc_140_more 0.20127 0.03228 6.236 4.50e-10 ***
mths_since_last_delinq_0_3 -0.08085 0.04577 -1.766 0.077325 .
mths_since_last_delinq_4_30 0.03855 0.01445 2.668 0.007622 **
mths_since_last_delinq_31_56 0.08398 0.01649 5.093 3.53e-07 ***
mths_since_last_delinq_56_more 0.04091 0.02012 2.033 0.042015 *
dti_1.4_3.5 0.09854 0.06663 1.479 0.139166
dti_3.5_7.7 0.11483 0.05678 2.022 0.043160 *
dti_7.7_10.5 0.03963 0.05640 0.703 0.482343
dti_10.5_16.1 -0.03567 0.05463 -0.653 0.513758
dti_16.1_20.3 -0.13296 0.05491 -2.421 0.015465 *
dti_20.3_21.7 -0.16208 0.05736 -2.826 0.004714 **
dti_21.7_22.4 -0.16767 0.06093 -2.752 0.005924 **
dti_22.4_35 -0.25394 0.05494 -4.622 3.80e-06 ***
dti_35_more -0.35436 0.08173 -4.336 1.45e-05 ***
mths_since_last_record_0_2 -0.38013 0.08445 -4.501 6.75e-06 ***
mths_since_last_record_3_20 0.08858 0.09008 0.983 0.325484
mths_since_last_record_21_31 0.13371 0.08035 1.664 0.096097 .
mths_since_last_record_32_80 0.20298 0.02618 7.752 9.02e-15 ***
mths_since_last_record_81_86 -0.14128 0.06258 -2.258 0.023973 *
mths_since_last_record_86_more -0.08358 0.02284 -3.658 0.000254 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 257347 on 373027 degrees of freedom
Residual deviance: 238469 on 372937 degrees of freedom
AIC: 238651
Number of Fisher Scoring iterations: 6
Let’s predict probabilities of being a good borrower on a validation dataset.
pred_prob <- predict.glm(log_model_2
,data_pd[-train_index,]
,type = 'response')
validation_table <- data.frame(
Actual_Class = data_pd$good_bad[-train_index]
,Predicted_Probability = pred_prob
)
validation_table %>% head()
Although there are libraries for calculating ROC and AUC, e.g. library pROC, for educational purposes let’s calculate it numerically.
We take thresholds from the grid between 0 and 1 and for each threshold we check whether loan is good or bad. Then we build misclassification table and calculate true positive rate (TPR) and false positive rate (FPR).
We calculate AUC as sum of rectangles as we’d estimated Newton integral (for this purpose useful is function rollmean from the zoo package). For our PD model AUC is 0.69 which is neither so bad nor so good.
grid_len <- 100
roc <- data.frame(
Threshold = seq(0, 1, length.out = grid_len)
,TPR = rep(NaN, grid_len)
,FPR = rep(NaN, grid_len)
)
for (i in 1:nrow(roc)) {
pred_class <- ifelse(pred_prob > roc$Threshold[i], 1, 0)
misclass_table <- table(
Actual = data_pd$good_bad[-train_index] %>% factor(levels = 1:0)
,Predicted = pred_class %>% factor(levels = 1:0)
)
roc$TPR[i] <- misclass_table[1,1] / (misclass_table[1,1] + misclass_table[1,2])
roc$FPR[i] <- misclass_table[2,1] / (misclass_table[2,1] + misclass_table[2,2])
}
id <- order(roc$FPR)
auc <- sum(diff(roc$FPR[id]) * rollmean(roc$TPR[id], 2))
ggplot(roc, aes(FPR, TPR)) +
geom_line(size = 1, col = 'blue') +
geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1), size = 1, linetype = 'dashed') +
labs(title = paste0('PD Model - ROC of Validation Set\nAUC = ', round(auc, 4))) +
theme_bw()
Gini index is calculated as \(\text{Gini} = 2 * \text{AUC} - 1\).
validation_table <- validation_table %>%
arrange(Predicted_Probability) %>%
mutate(Cumulative_Population = 1:nrow(validation_table)
,Cumulative_Good = cumsum(Actual_Class)
,Cumulative_Bad = Cumulative_Population - Cumulative_Good
,Cumulative_Perc_Population = Cumulative_Population / nrow(validation_table)
,Cumulative_Perc_Good = Cumulative_Good / sum(Actual_Class)
,Cumulative_Perc_Bad = Cumulative_Bad / (nrow(validation_table) - sum(Actual_Class))
)
gini <- 2 * auc - 1
ggplot(validation_table, aes(Cumulative_Perc_Population, Cumulative_Perc_Bad)) +
geom_line(size = 1, col = 'blue') +
geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1), size = 1, linetype = 'dashed') +
labs(title = paste0('PD Model - Gini Curve of Validation Set\nGini Index = ', round(gini, 4))) +
theme_bw()
And finally, Kolmogorov-Smirnov statistic is calculated as maximal distance beetwen CDF of defaulted and CDF of non-defaulted.
ks <- max(validation_table$Cumulative_Perc_Bad - validation_table$Cumulative_Perc_Good)
ggplot(validation_table) +
geom_line(aes(Predicted_Probability, Cumulative_Perc_Good), size = 1, col = 'blue') +
geom_line(aes(Predicted_Probability, Cumulative_Perc_Bad), size = 1, col = 'red') +
labs(title = paste0('PD Model - CDFs of Validation Set\nKolmogorov-Smirnov = ', round(ks, 4))) +
theme_bw() +
labs(y = 'CDF')