The objective of this project is to construct a model capable of
precisely detecting loans with a high risk factor. This will enable the
implementation of suitable measures, such as raising the loan’s interest
rate or rejecting the loan application. These actions will assist the
bank in minimizing its overall risk and enhancing its
profitability.
EDA
First, let’s see the data.
## id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate
## 1 60516983 64537751 20000 20000 20000 36 months 12.29
## 2 60187139 64163931 11000 11000 11000 36 months 12.69
## 3 60356453 64333218 7000 7000 7000 36 months 9.99
## 4 59955769 63900496 10000 10000 10000 36 months 10.99
## 5 58703693 62544456 9550 9550 9550 36 months 19.99
## 6 57783762 61536512 24000 24000 24000 60 months 14.65
## 7 58010547 61814274 15000 15000 14975 60 months 10.99
## 8 58613025 62453761 24650 24650 24600 60 months 17.57
## 9 57662310 61415026 12000 12000 12000 36 months 12.69
## 10 58470862 62289594 15000 15000 14850 60 months 18.25
## installment grade sub_grade emp_title emp_length
## 1 667.06 C C1 Accounting Clerk 1 year
## 2 369 C C2 Accounts Payable Lead 7 years
## 3 225.84 B B3 Nurse 6 years
## 4 327.34 B B4 Service Manager 10+ years
## 5 354.87 E E4 <NA> <NA>
## 6 566.56 C C5 Owner 10+ years
## 7 326.07 B B4 ISO 10+ years
## 8 620.2 D D4 IP and Research Secreatry 8 years
## 9 402.54 C C2 staff nurse 9 years
## 10 382.95 E E1 PA 8 years
## home_ownership annual_inc verification_status issue_d loan_status pymnt_plan
## 1 OWN 65000 Source Verified Sep-15 Charged Off n
## 2 MORTGAGE 40000 Source Verified Sep-15 Charged Off n
## 3 MORTGAGE 32000 Source Verified Sep-15 Charged Off n
## 4 MORTGAGE 48000 Source Verified Sep-15 Charged Off n
## 5 RENT 32376 Verified Sep-15 Charged Off n
## 6 MORTGAGE 70000 Not Verified Aug-15 Charged Off n
## 7 RENT 74800 Not Verified Aug-15 Charged Off n
## 8 MORTGAGE 56048 Verified Aug-15 Charged Off n
## 9 RENT 120000 Not Verified Aug-15 Charged Off n
## 10 MORTGAGE 150000 Source Verified Aug-15 Charged Off n
## url desc
## 1 https://www.lendingclub.com/browse/loanDetail.action?loan_id=60516983 <NA>
## 2 https://www.lendingclub.com/browse/loanDetail.action?loan_id=60187139 <NA>
## 3 https://www.lendingclub.com/browse/loanDetail.action?loan_id=60356453 <NA>
## 4 https://www.lendingclub.com/browse/loanDetail.action?loan_id=59955769 <NA>
## 5 https://www.lendingclub.com/browse/loanDetail.action?loan_id=58703693 <NA>
## 6 https://www.lendingclub.com/browse/loanDetail.action?loan_id=57783762 <NA>
## 7 https://www.lendingclub.com/browse/loanDetail.action?loan_id=58010547 <NA>
## 8 https://www.lendingclub.com/browse/loanDetail.action?loan_id=58613025 <NA>
## 9 https://www.lendingclub.com/browse/loanDetail.action?loan_id=57662310 <NA>
## 10 https://www.lendingclub.com/browse/loanDetail.action?loan_id=58470862 <NA>
## purpose title zip_code addr_state dti
## 1 debt_consolidation Debt consolidation 542xx WI 20.72
## 2 debt_consolidation Debt consolidation 235xx VA 24.57
## 3 debt_consolidation Debt consolidation 350xx AL 32.41
## 4 credit_card Credit card refinancing 483xx MI 30.98
## 5 debt_consolidation Debt consolidation 546xx WI 32.54
## 6 debt_consolidation Debt consolidation 703xx LA 6.96
## 7 credit_card Credit card refinancing 700xx LA 15.63
## 8 debt_consolidation Debt consolidation 238xx VA 27.26
## 9 major_purchase Major purchase 601xx IL 22.74
## 10 debt_consolidation Debt consolidation 913xx CA 28.26
## delinq_2yrs earliest_cr_line inq_last_6mths mths_since_last_delinq
## 1 0 Sep-00 1 NA
## 2 0 Sep-02 0 36
## 3 0 Feb-06 1 NA
## 4 0 Oct-99 2 NA
## 5 0 Nov-99 3 69
## 6 0 May-98 0 65
## 7 0 Feb-84 2 NA
## 8 0 Mar-09 0 NA
## 9 0 Jun-04 2 33
## 10 0 Jul-00 0 24
## mths_since_last_record open_acc pub_rec revol_bal revol_util total_acc
## 1 NA 25 0 31578 77 42
## 2 80 13 1 5084 38.8 41
## 3 NA 18 0 12070 74 36
## 4 NA 18 0 22950 66 41
## 5 NA 9 0 4172 29.6 26
## 6 NA 8 0 8256 49.4 19
## 7 NA 12 0 4409 6.3 28
## 8 NA 16 0 9638 69.8 23
## 9 NA 20 0 22108 54.4 56
## 10 NA 18 0 35052 91.5 36
## initial_list_status out_prncp out_prncp_inv total_pymnt total_pymnt_inv
## 1 w 0 0 0 0
## 2 w 0 0 10043.49 10043.49
## 3 f 0 0 221.96 221.96
## 4 f 0 0 315.13 315.13
## 5 w 0 0 333.66 333.66
## 6 w 0 0 547.03 547.03
## 7 f 0 0 307.75 307.24
## 8 w 0 0 1192.28 1189.86
## 9 w 0 0 796.62 796.62
## 10 f 0 0 0 0
## total_rec_prncp total_rec_int total_rec_late_fee recoveries
## 1 0 0 0 0
## 2 9942.67 100.81 0 0
## 3 167.56 54.4 0 0
## 4 235.76 79.37 0 0
## 5 195.78 137.88 0 0
## 6 273.56 273.47 0 0
## 7 188.69 119.06 0 0
## 8 522.36 669.92 0 0
## 9 554.19 242.43 0 0
## 10 0 0 0 0
## collection_recovery_fee last_pymnt_d last_pymnt_amnt next_pymnt_d
## 1 0 <NA> 0 <NA>
## 2 0 Oct-15 10059 <NA>
## 3 0 Oct-15 225.84 <NA>
## 4 0 Oct-15 327.34 <NA>
## 5 0 Oct-15 354.87 <NA>
## 6 0 Oct-15 566.56 <NA>
## 7 0 Oct-15 326.07 <NA>
## 8 0 Oct-15 620.2 <NA>
## 9 0 Oct-15 402.54 <NA>
## 10 0 <NA> 0 <NA>
## last_credit_pull_d collections_12_mths_ex_med mths_since_last_major_derog
## 1 Jan-16 0 NA
## 2 Jan-16 0 79
## 3 Jan-16 0 NA
## 4 Jan-16 0 NA
## 5 Jan-16 0 69
## 6 Jan-16 0 65
## 7 Jan-16 0 NA
## 8 Jan-16 0 NA
## 9 Jan-16 0 33
## 10 Jan-16 0 24
## policy_code application_type annual_inc_joint dti_joint
## 1 1 INDIVIDUAL <NA> <NA>
## 2 1 INDIVIDUAL <NA> <NA>
## 3 1 INDIVIDUAL <NA> <NA>
## 4 1 INDIVIDUAL <NA> <NA>
## 5 1 INDIVIDUAL <NA> <NA>
## 6 1 INDIVIDUAL <NA> <NA>
## 7 1 INDIVIDUAL <NA> <NA>
## 8 1 INDIVIDUAL <NA> <NA>
## 9 1 INDIVIDUAL <NA> <NA>
## 10 1 INDIVIDUAL <NA> <NA>
## verification_status_joint acc_now_delinq tot_coll_amt tot_cur_bal
## 1 <NA> 0 0 52303
## 2 <NA> 0 332 175731
## 3 <NA> 0 0 202012
## 4 <NA> 0 0 108235
## 5 <NA> 0 0 45492
## 6 <NA> 0 0 126165
## 7 <NA> 0 0 264173
## 8 <NA> 0 0 191935
## 9 <NA> 0 0 71745
## 10 <NA> 0 0 497387
## open_acc_6m open_il_6m open_il_12m open_il_24m mths_since_rcnt_il
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## 7 NA NA NA NA NA
## 8 NA NA NA NA NA
## 9 NA NA NA NA NA
## 10 NA NA NA NA NA
## total_bal_il il_util open_rv_12m open_rv_24m max_bal_bc all_util
## 1 NA <NA> NA NA NA <NA>
## 2 NA <NA> NA NA NA <NA>
## 3 NA <NA> NA NA NA <NA>
## 4 NA <NA> NA NA NA <NA>
## 5 NA <NA> NA NA NA <NA>
## 6 NA <NA> NA NA NA <NA>
## 7 NA <NA> NA NA NA <NA>
## 8 NA <NA> NA NA NA <NA>
## 9 NA <NA> NA NA NA <NA>
## 10 NA <NA> NA NA NA <NA>
## total_rev_hi_lim inq_fi total_cu_tl inq_last_12m
## 1 41000 NA NA NA
## 2 13100 NA NA NA
## 3 16300 NA NA NA
## 4 34750 NA NA NA
## 5 14100 NA NA NA
## 6 16700 NA NA NA
## 7 69500 NA NA NA
## 8 13800 NA NA NA
## 9 40613 NA NA NA
## 10 38300 NA NA NA
str(raw_data)
## 'data.frame': 421094 obs. of 74 variables:
## $ id : int 60516983 60187139 60356453 59955769 58703693 57783762 58010547 58613025 57662310 58470862 ...
## $ member_id : int 64537751 64163931 64333218 63900496 62544456 61536512 61814274 62453761 61415026 62289594 ...
## $ loan_amnt : int 20000 11000 7000 10000 9550 24000 15000 24650 12000 15000 ...
## $ funded_amnt : int 20000 11000 7000 10000 9550 24000 15000 24650 12000 15000 ...
## $ funded_amnt_inv : int 20000 11000 7000 10000 9550 24000 14975 24600 12000 14850 ...
## $ term : chr " 36 months" " 36 months" " 36 months" " 36 months" ...
## $ int_rate : chr "12.29" "12.69" "9.99" "10.99" ...
## $ installment : chr "667.06" "369" "225.84" "327.34" ...
## $ grade : chr "C" "C" "B" "B" ...
## $ sub_grade : chr "C1" "C2" "B3" "B4" ...
## $ emp_title : chr "Accounting Clerk" "Accounts Payable Lead" "Nurse" "Service Manager" ...
## $ emp_length : chr "1 year" "7 years" "6 years" "10+ years" ...
## $ home_ownership : chr "OWN" "MORTGAGE" "MORTGAGE" "MORTGAGE" ...
## $ annual_inc : chr "65000" "40000" "32000" "48000" ...
## $ verification_status : chr "Source Verified" "Source Verified" "Source Verified" "Source Verified" ...
## $ issue_d : chr "Sep-15" "Sep-15" "Sep-15" "Sep-15" ...
## $ loan_status : chr "Charged Off" "Charged Off" "Charged Off" "Charged Off" ...
## $ pymnt_plan : chr "n" "n" "n" "n" ...
## $ url : chr "https://www.lendingclub.com/browse/loanDetail.action?loan_id=60516983" "https://www.lendingclub.com/browse/loanDetail.action?loan_id=60187139" "https://www.lendingclub.com/browse/loanDetail.action?loan_id=60356453" "https://www.lendingclub.com/browse/loanDetail.action?loan_id=59955769" ...
## $ desc : chr NA NA NA NA ...
## $ purpose : chr "debt_consolidation" "debt_consolidation" "debt_consolidation" "credit_card" ...
## $ title : chr "Debt consolidation" "Debt consolidation" "Debt consolidation" "Credit card refinancing" ...
## $ zip_code : chr "542xx" "235xx" "350xx" "483xx" ...
## $ addr_state : chr "WI" "VA" "AL" "MI" ...
## $ dti : chr "20.72" "24.57" "32.41" "30.98" ...
## $ delinq_2yrs : int 0 0 0 0 0 0 0 0 0 0 ...
## $ earliest_cr_line : chr "Sep-00" "Sep-02" "Feb-06" "Oct-99" ...
## $ inq_last_6mths : int 1 0 1 2 3 0 2 0 2 0 ...
## $ mths_since_last_delinq : int NA 36 NA NA 69 65 NA NA 33 24 ...
## $ mths_since_last_record : int NA 80 NA NA NA NA NA NA NA NA ...
## $ open_acc : int 25 13 18 18 9 8 12 16 20 18 ...
## $ pub_rec : int 0 1 0 0 0 0 0 0 0 0 ...
## $ revol_bal : int 31578 5084 12070 22950 4172 8256 4409 9638 22108 35052 ...
## $ revol_util : chr "77" "38.8" "74" "66" ...
## $ total_acc : int 42 41 36 41 26 19 28 23 56 36 ...
## $ initial_list_status : chr "w" "w" "f" "f" ...
## $ out_prncp : chr "0" "0" "0" "0" ...
## $ out_prncp_inv : chr "0" "0" "0" "0" ...
## $ total_pymnt : chr "0" "10043.49" "221.96" "315.13" ...
## $ total_pymnt_inv : chr "0" "10043.49" "221.96" "315.13" ...
## $ total_rec_prncp : chr "0" "9942.67" "167.56" "235.76" ...
## $ total_rec_int : chr "0" "100.81" "54.4" "79.37" ...
## $ total_rec_late_fee : chr "0" "0" "0" "0" ...
## $ recoveries : chr "0" "0" "0" "0" ...
## $ collection_recovery_fee : chr "0" "0" "0" "0" ...
## $ last_pymnt_d : chr NA "Oct-15" "Oct-15" "Oct-15" ...
## $ last_pymnt_amnt : chr "0" "10059" "225.84" "327.34" ...
## $ next_pymnt_d : chr NA NA NA NA ...
## $ last_credit_pull_d : chr "Jan-16" "Jan-16" "Jan-16" "Jan-16" ...
## $ collections_12_mths_ex_med : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mths_since_last_major_derog: int NA 79 NA NA 69 65 NA NA 33 24 ...
## $ policy_code : int 1 1 1 1 1 1 1 1 1 1 ...
## $ application_type : chr "INDIVIDUAL" "INDIVIDUAL" "INDIVIDUAL" "INDIVIDUAL" ...
## $ annual_inc_joint : chr NA NA NA NA ...
## $ dti_joint : chr NA NA NA NA ...
## $ verification_status_joint : chr NA NA NA NA ...
## $ acc_now_delinq : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tot_coll_amt : int 0 332 0 0 0 0 0 0 0 0 ...
## $ tot_cur_bal : int 52303 175731 202012 108235 45492 126165 264173 191935 71745 497387 ...
## $ open_acc_6m : int NA NA NA NA NA NA NA NA NA NA ...
## $ open_il_6m : int NA NA NA NA NA NA NA NA NA NA ...
## $ open_il_12m : int NA NA NA NA NA NA NA NA NA NA ...
## $ open_il_24m : int NA NA NA NA NA NA NA NA NA NA ...
## $ mths_since_rcnt_il : int NA NA NA NA NA NA NA NA NA NA ...
## $ total_bal_il : int NA NA NA NA NA NA NA NA NA NA ...
## $ il_util : chr NA NA NA NA ...
## $ open_rv_12m : int NA NA NA NA NA NA NA NA NA NA ...
## $ open_rv_24m : int NA NA NA NA NA NA NA NA NA NA ...
## $ max_bal_bc : int NA NA NA NA NA NA NA NA NA NA ...
## $ all_util : chr NA NA NA NA ...
## $ total_rev_hi_lim : int 41000 13100 16300 34750 14100 16700 69500 13800 40613 38300 ...
## $ inq_fi : int NA NA NA NA NA NA NA NA NA NA ...
## $ total_cu_tl : int NA NA NA NA NA NA NA NA NA NA ...
## $ inq_last_12m : int NA NA NA NA NA NA NA NA NA NA ...
Then, check for NAs
# Check for NAs
# % of NAs
na <- lapply(raw_data,function(x) { length(which(is.na(x)))/length(x)})
# more than 80% of NAs
na_70 <- as.data.frame(na[na>0.7])
na_70
## desc mths_since_last_record mths_since_last_major_derog annual_inc_joint
## 1 0.9998931 0.8232817 0.7085473 0.9987865
## dti_joint verification_status_joint open_acc_6m open_il_6m open_il_12m
## 1 0.9987912 0.9987865 0.9492465 0.9492465 0.9492465
## open_il_24m mths_since_rcnt_il total_bal_il il_util open_rv_12m open_rv_24m
## 1 0.9492465 0.9505811 0.9492465 0.955789 0.9492465 0.9492465
## max_bal_bc all_util inq_fi total_cu_tl inq_last_12m
## 1 0.9492465 0.9492465 0.9492465 0.9492465 0.9492465
These are columns with more than 70% of NAs. I decided to drop
them.
# Drop columns that have > 70% of NAs
df_1 <- raw_data[,!colnames(raw_data) %in% colnames((na_70))]
# Re-explore the data
lapply(df_1,function(x) { length(which(is.na(x)))/length(x)})
## $id
## [1] 0
##
## $member_id
## [1] 0
##
## $loan_amnt
## [1] 0
##
## $funded_amnt
## [1] 0
##
## $funded_amnt_inv
## [1] 0
##
## $term
## [1] 0
##
## $int_rate
## [1] 0
##
## $installment
## [1] 0
##
## $grade
## [1] 0
##
## $sub_grade
## [1] 0
##
## $emp_title
## [1] 0.05669518
##
## $emp_length
## [1] 0.05655982
##
## $home_ownership
## [1] 0
##
## $annual_inc
## [1] 0
##
## $verification_status
## [1] 0
##
## $issue_d
## [1] 0
##
## $loan_status
## [1] 0
##
## $pymnt_plan
## [1] 0
##
## $url
## [1] 0
##
## $purpose
## [1] 0
##
## $title
## [1] 0.0003134692
##
## $zip_code
## [1] 0
##
## $addr_state
## [1] 0
##
## $dti
## [1] 0
##
## $delinq_2yrs
## [1] 0
##
## $earliest_cr_line
## [1] 0
##
## $inq_last_6mths
## [1] 0
##
## $mths_since_last_delinq
## [1] 0.4843598
##
## $open_acc
## [1] 0
##
## $pub_rec
## [1] 0
##
## $revol_bal
## [1] 0
##
## $revol_util
## [1] 0.0003847122
##
## $total_acc
## [1] 0
##
## $initial_list_status
## [1] 0
##
## $out_prncp
## [1] 0
##
## $out_prncp_inv
## [1] 0
##
## $total_pymnt
## [1] 0
##
## $total_pymnt_inv
## [1] 0
##
## $total_rec_prncp
## [1] 0
##
## $total_rec_int
## [1] 0
##
## $total_rec_late_fee
## [1] 0
##
## $recoveries
## [1] 0
##
## $collection_recovery_fee
## [1] 0
##
## $last_pymnt_d
## [1] 0.04104309
##
## $last_pymnt_amnt
## [1] 0
##
## $next_pymnt_d
## [1] 0.06116687
##
## $last_credit_pull_d
## [1] 2.612243e-05
##
## $collections_12_mths_ex_med
## [1] 0
##
## $policy_code
## [1] 0
##
## $application_type
## [1] 0
##
## $acc_now_delinq
## [1] 0
##
## $tot_coll_amt
## [1] 0
##
## $tot_cur_bal
## [1] 0
##
## $total_rev_hi_lim
## [1] 0
Feature engineering
I decided to include ‘Charged Off’, ‘Late 31-120 days’ and ‘Default’
as customers with loan status = 1, meaning risky- bad customers. Then, I
decided to set ‘Fully Paid’ as customers with loan status = 0, meaning
not risky- good customers. Finally, I dropped ‘Current’, ‘In Grace
Period’ , ‘Issued’, ‘Late 16- 30 days’ as they are neither good nor bad
customers.
# unique(df_2$loan_status)
# Replace the loan_status values by default (1), not in default (0)
df_3 <- df_2
df_3$loan_status[df_3$loan_status == 'Charged Off'] <- 1
df_3$loan_status[df_3$loan_status == 'Late (31-120 days)'] <- 1
df_3$loan_status[df_3$loan_status == 'Default'] <- 1
# df_3$loan_status[df_3$loan_status == 'Current'] <- 0
df_3$loan_status[df_3$loan_status == 'Fully Paid'] <- 0
# df_3$loan_status[df_3$loan_status == 'In Grace Period'] <- 0
# df_3$loan_status[df_3$loan_status == 'Issued'] <- 0
# df_3$loan_status[df_3$loan_status == 'Late (16-30 days)'] <- 0
df_3 <- subset(df_3, loan_status != "Current")
df_3 <- subset(df_3, loan_status != "In Grace Period")
df_3 <- subset(df_3, loan_status != "Issued")
df_3 <- subset(df_3, loan_status != "Late (16-30 days)")
df_3$loan_status <- as.factor(as.character(df_3$loan_status))
# unique(df_3$loan_status)
Then again, I looked at columns with NAs. Revol_util and emp_length
had 21 and 9 missing values respectively. I decided to drop these rows,
as there were not many of them.
# % of NAs
na <- lapply(df_4,function(x) { length(which(is.na(x)))/length(x)})
# Columns with missing values
which(colSums(is.na(df_4))>0) # revol_util has 21 missing values, emp_length 9 missing values
## emp_length revol_util
## 9 20
# Drop rows with NAs
df_5 <- na.omit(df_4)
# Check
# which(colSums(is.na(df_5))>0) # 0 NAs
Correlation analysis
Here, I filtered pairs of variables that had correlation above
70%.
cor_matrix <- cor(select_if(df_5, is.numeric))
# Set the correlation threshold
threshold <- 0.7
# Print correlations above the threshold
for (i in 1:(ncol(cor_matrix) - 1)) {
for (j in (i + 1):ncol(cor_matrix)) {
correlation <- cor_matrix[i, j]
if (!is.na(correlation) && abs(correlation) > threshold) {
var1 <- colnames(cor_matrix)[i]
var2 <- colnames(cor_matrix)[j]
cat(var1, "and", var2, "have correlation", correlation, "\n")
}
}
}
## loan_amnt and funded_amnt have correlation 1
## loan_amnt and funded_amnt_inv have correlation 0.9999964
## loan_amnt and installment have correlation 0.950474
## loan_amnt and total_pymnt have correlation 0.7022142
## loan_amnt and total_pymnt_inv have correlation 0.702199
## funded_amnt and funded_amnt_inv have correlation 0.9999964
## funded_amnt and installment have correlation 0.950474
## funded_amnt and total_pymnt have correlation 0.7022142
## funded_amnt and total_pymnt_inv have correlation 0.702199
## funded_amnt_inv and installment have correlation 0.950474
## funded_amnt_inv and total_pymnt have correlation 0.7022695
## funded_amnt_inv and total_pymnt_inv have correlation 0.7022589
## revol_bal and total_rev_hi_lim have correlation 0.8276475
## out_prncp and out_prncp_inv have correlation 1
## total_pymnt and total_pymnt_inv have correlation 0.999998
## total_pymnt and total_rec_prncp have correlation 0.9957575
## total_pymnt and last_pymnt_amnt have correlation 0.9015749
## total_pymnt_inv and total_rec_prncp have correlation 0.9957477
## total_pymnt_inv and last_pymnt_amnt have correlation 0.9015471
## total_rec_prncp and last_pymnt_amnt have correlation 0.9083865
Distribution for dependent variable
The loan status distribution is a little unbalanced, but I decided
not to perform any form of undersampling.
# Distribution for loan status (dependent variable)
barplot(prop.table(table(df_5$loan_status)), ylim = c(0,1), xlab = "Loan Status", main = "Loan status distribution", col = c("lightblue", "lightpink"), xaxt = "n")
axis(side = 1, at = c(0.7, 1.9), labels = c("0- Not in default", "1- Default"))

# Imbalanced data
# Perform random undersampling
# df_8 <- downSample(x = subset(df_7, select = -c(loan_status)), y = as.factor(df_7$loan_status), yname = "loan_status")
# df_8$loan_status <- as.factor(as.character(df_8$loan_status))
# # Distribution for loan status (dependent variable) after random undersampling
# barplot(prop.table(table(df_8$loan_status)), ylim = c(0,1), xlab = "Loan Status", main = "Loan status distribution", col = c("lightblue", "lightpink"), xaxt = "n")
# axis(side = 1, at = c(0.7, 1.9), labels = c("0- Not in default", "1- Default"))
#
# df <- df_8
df_8 <-df_7
# Feature selection for numerical variables
# Separate the features and the target variable
#non_char <- df_8[,sapply(df_8, function(x) !is.factor(x))]
# features <- non_char
# target <- df_8$loan_status
#
# # Perform correlation-based feature selection
# correlation_filter <- nearZeroVar(select_if(features, is.numeric), saveMetrics = TRUE) # Identify near-zero variance features
# correlation_matrix <- cor(select_if(features, is.numeric)) # Calculate the correlation matrix
# highly_correlated <- findCorrelation(correlation_matrix, cutoff = 0.7) # Identify highly correlated features
#
# # Combine the indices of features to remove
# features_to_remove <- c(correlation_filter$nzv, highly_correlated)
#
# # Remove the selected features
# filtered_features <- features[, -features_to_remove]
#
# # Print the selected features
# print(filtered_features)
Categorical feature selection
Dummy variables
Then, I created dummy variables for chosen categorical
features.
# create dummy variables for factor features
library(fastDummies)
df_9 <- dummy_cols(df_9,select_columns = "term",
remove_most_frequent_dummy = TRUE,
remove_selected_columns = TRUE)
df_9 <- dummy_cols(df_9,select_columns = "emp_length",
remove_most_frequent_dummy = TRUE,
remove_selected_columns = TRUE)
df_9 <- dummy_cols(df_9,select_columns = "home_ownership",
remove_most_frequent_dummy = TRUE,
remove_selected_columns = TRUE)
df_9 <- dummy_cols(df_9,select_columns = "grade",
remove_most_frequent_dummy = TRUE,
remove_selected_columns = TRUE)
df <- df_9
df$loan_status <- as.numeric(as.character(df$loan_status))
str(df_9)
## 'data.frame': 29252 obs. of 29 variables:
## $ annual_inc : num 65000 40000 32000 48000 70000 ...
## $ loan_status : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ dti : num 20.72 24.57 32.41 30.98 6.96 ...
## $ pub_rec : int 0 1 0 0 0 0 0 0 0 0 ...
## $ revol_bal : num 31578 5084 12070 22950 8256 ...
## $ revol_util : num 77 38.8 74 66 49.4 6.3 69.8 54.4 91.5 31.2 ...
## $ total_acc : int 42 41 36 41 19 28 23 56 36 25 ...
## $ total_pymnt : num 0 10043 222 315 547 ...
## $ total_rec_int : num 0 100.8 54.4 79.4 273.5 ...
## $ tot_cur_bal : num 52303 175731 202012 108235 126165 ...
## $ term_ 60 : int 0 0 0 0 1 1 1 0 1 1 ...
## $ emp_length_< 1 year: int 0 0 0 0 0 0 0 0 0 0 ...
## $ emp_length_1 year : int 1 0 0 0 0 0 0 0 0 0 ...
## $ emp_length_2 years : int 0 0 0 0 0 0 0 0 0 0 ...
## $ emp_length_3 years : int 0 0 0 0 0 0 0 0 0 0 ...
## $ emp_length_4 years : int 0 0 0 0 0 0 0 0 0 0 ...
## $ emp_length_5 years : int 0 0 0 0 0 0 0 0 0 0 ...
## $ emp_length_6 years : int 0 0 1 0 0 0 0 0 0 0 ...
## $ emp_length_7 years : int 0 1 0 0 0 0 0 0 0 0 ...
## $ emp_length_8 years : int 0 0 0 0 0 0 1 0 1 1 ...
## $ emp_length_9 years : int 0 0 0 0 0 0 0 1 0 0 ...
## $ home_ownership_OWN : int 1 0 0 0 0 0 0 0 0 0 ...
## $ home_ownership_RENT: int 0 0 0 0 0 1 0 1 0 0 ...
## $ grade_1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ grade_2 : int 0 0 1 1 0 1 0 0 0 0 ...
## $ grade_4 : int 0 0 0 0 0 0 1 0 0 1 ...
## $ grade_5 : int 0 0 0 0 0 0 0 0 1 0 ...
## $ grade_6 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ grade_7 : int 0 0 0 0 0 0 0 0 0 0 ...
df <- df_9
Fine classing
# FINE CLASSING
percentile <- apply(X=select_if(df,is.numeric), MARGIN=2, FUN=function(x) round(quantile(x, seq(0.1,1,0.1), na.rm=TRUE),2))
# Unique values per column
unique <- apply(select_if(df,is.numeric), MARGIN=2, function(x) length(unique(x)))
# Selecting only columns with more than 10 unique levels
df.n <- df[which(unique>10)]
# <10 levels
df.f <- df[which(unique<10 & unique>1 )]
# woebin function in package scorecard
fine_class <- woebin(df,
y = "loan_status",
x = colnames(df.n),
positive = 1, # loan status is in default
method = "freq", # frequency method
bin_num_limit = 20) # the max number of fine bins
## ✔ Binning on 29252 rows and 10 columns in 00:00:10
fine_class
## $annual_inc
## variable bin count count_distr neg pos posprob
## 1: annual_inc [-Inf,40000) 4026 0.13763161 2741 1285 0.3191754
## 2: annual_inc [40000,44000) 1805 0.06170518 1287 518 0.2869806
## 3: annual_inc [44000,50000) 2184 0.07466156 1608 576 0.2637363
## 4: annual_inc [50000,58000) 3536 0.12088062 2648 888 0.2511312
## 5: annual_inc [58000,65000) 2459 0.08406263 1816 643 0.2614884
## 6: annual_inc [65000,70000) 1709 0.05842336 1310 399 0.2334699
## 7: annual_inc [70000,75000) 1657 0.05664570 1275 382 0.2305371
## 8: annual_inc [75000,85000) 2777 0.09493368 2081 696 0.2506302
## 9: annual_inc [85000,91860) 1785 0.06102147 1392 393 0.2201681
## 10: annual_inc [91860,110000) 2678 0.09154930 2100 578 0.2158327
## 11: annual_inc [110000,125000) 1487 0.05083413 1191 296 0.1990585
## 12: annual_inc [125000,156000) 1670 0.05709011 1330 340 0.2035928
## 13: annual_inc [156000, Inf) 1479 0.05056065 1171 308 0.2082488
## woe bin_iv total_iv breaks is_special_values
## 1: 0.343054757 1.753163e-02 0.03670724 40000 FALSE
## 2: 0.190524891 2.344640e-03 0.03670724 44000 FALSE
## 3: 0.073980067 4.161449e-04 0.03670724 50000 FALSE
## 4: 0.008030682 7.811480e-06 0.03670724 58000 FALSE
## 5: 0.062372021 3.321049e-04 0.03670724 65000 FALSE
## 6: -0.088202143 4.444091e-04 0.03670724 70000 FALSE
## 7: -0.104661993 6.041222e-04 0.03670724 75000 FALSE
## 8: 0.005364690 2.735850e-06 0.03670724 85000 FALSE
## 9: -0.164068373 1.574398e-03 0.03670724 91860 FALSE
## 10: -0.189499899 3.129702e-03 0.03670724 110000 FALSE
## 11: -0.291570259 4.001173e-03 0.03670724 125000 FALSE
## 12: -0.263369747 3.694991e-03 0.03670724 156000 FALSE
## 13: -0.234894724 2.623374e-03 0.03670724 Inf FALSE
##
## $dti
## variable bin count count_distr neg pos posprob woe
## 1: dti [-Inf,7.41) 2916 0.09968549 2424 492 0.1687243 -0.49407677
## 2: dti [7.41,10.62) 2927 0.10006153 2373 554 0.1892723 -0.35412671
## 3: dti [10.62,11.96) 1470 0.05025297 1192 278 0.1891156 -0.35514788
## 4: dti [11.96,14.45) 2907 0.09937782 2329 578 0.1988304 -0.29300154
## 5: dti [14.45,15.63) 1473 0.05035553 1164 309 0.2097760 -0.22565750
## 6: dti [15.63,17.94) 2920 0.09982223 2265 655 0.2243151 -0.14007595
## 7: dti [17.94,19.15) 1471 0.05028716 1156 315 0.2141400 -0.19952955
## 8: dti [19.15,20.38) 1466 0.05011623 1130 336 0.2291951 -0.11224290
## 9: dti [20.38,23.08) 2916 0.09968549 2153 763 0.2616598 0.06325939
## 10: dti [23.08,26.37) 2934 0.10030083 2093 841 0.2866394 0.18885679
## 11: dti [26.37,30.76) 2919 0.09978805 1944 975 0.3340185 0.41055334
## 12: dti [30.76,33.83) 1469 0.05021879 905 564 0.3839346 0.62773816
## 13: dti [33.83, Inf) 1464 0.05004786 822 642 0.4385246 0.85346676
## bin_iv total_iv breaks is_special_values
## 1: 0.0212719822 0.1414588 7.41 FALSE
## 2: 0.0114169442 0.1414588 10.62 FALSE
## 3: 0.0057652853 0.1414588 11.96 FALSE
## 4: 0.0078959193 0.1414588 14.45 FALSE
## 5: 0.0024173422 0.1414588 15.63 FALSE
## 6: 0.0018892903 0.1414588 17.94 FALSE
## 7: 0.0019007748 0.1414588 19.15 FALSE
## 8: 0.0006135025 0.1414588 20.38 FALSE
## 9: 0.0004051991 0.1414588 23.08 FALSE
## 10: 0.0037433021 0.1414588 26.37 FALSE
## 11: 0.0184585363 0.1414588 30.76 FALSE
## 12: 0.0226042513 0.1414588 33.83 FALSE
## 13: 0.0430764906 0.1414588 Inf FALSE
##
## $pub_rec
## variable bin count count_distr neg pos posprob woe
## 1: pub_rec [-Inf,1) 23782 0.8130042 17809 5973 0.2511563 0.008164223
## 2: pub_rec [1, Inf) 5470 0.1869958 4141 1329 0.2429616 -0.035891669
## bin_iv total_iv breaks is_special_values
## 1: 5.430111e-05 0.0002930204 1 FALSE
## 2: 2.387193e-04 0.0002930204 Inf FALSE
##
## $revol_bal
## variable bin count count_distr neg pos posprob woe
## 1: revol_bal [-Inf,2699) 2925 0.09999316 2228 697 0.2382906 -0.061455334
## 2: revol_bal [2699,4610) 2921 0.09985642 2240 681 0.2331393 -0.090049982
## 3: revol_bal [4610,5443) 1466 0.05011623 1113 353 0.2407913 -0.047727438
## 4: revol_bal [5443,7288) 2925 0.09999316 2200 725 0.2478632 -0.009422128
## 5: revol_bal [7288,8266) 1463 0.05001367 1129 334 0.2282980 -0.117327715
## 6: revol_bal [8266,9321) 1463 0.05001367 1079 384 0.2624744 0.067471444
## 7: revol_bal [9321,10510) 1463 0.05001367 1089 374 0.2556391 0.031859531
## 8: revol_bal [10510,13265) 2924 0.09995898 2170 754 0.2578659 0.043528778
## 9: revol_bal [13265,14930) 1463 0.05001367 1066 397 0.2713602 0.112886532
## 10: revol_bal [14930,19445) 2924 0.09995898 2177 747 0.2554720 0.030980980
## 11: revol_bal [19445,22768) 1464 0.05004786 1067 397 0.2711749 0.111948886
## 12: revol_bal [22768,33024) 2925 0.09999316 2190 735 0.2512821 0.008832533
## 13: revol_bal [33024,44411) 1463 0.05001367 1070 393 0.2686261 0.099014541
## 14: revol_bal [44411, Inf) 1463 0.05001367 1132 331 0.2262474 -0.129004027
## bin_iv total_iv breaks is_special_values
## 1: 3.718119e-04 0.0051389 2699 FALSE
## 2: 7.913587e-04 0.0051389 4610 FALSE
## 3: 1.127909e-04 0.0051389 5443 FALSE
## 4: 8.856085e-06 0.0051389 7288 FALSE
## 5: 6.680859e-04 0.0051389 8266 FALSE
## 6: 2.315051e-04 0.0051389 9321 FALSE
## 7: 5.116921e-05 0.0051389 10510 FALSE
## 8: 1.914541e-04 0.0051389 13265 FALSE
## 9: 6.551647e-04 0.0051389 14930 FALSE
## 10: 9.668498e-05 0.0051389 19445 FALSE
## 11: 6.446227e-04 0.0051389 22768 FALSE
## 12: 7.818068e-06 0.0051389 33024 FALSE
## 13: 5.023719e-04 0.0051389 44411 FALSE
## 14: 8.052054e-04 0.0051389 Inf FALSE
##
## $revol_util
## variable bin count count_distr neg pos posprob woe
## 1: revol_util [-Inf,22.5) 4360 0.14904964 3640 720 0.1651376 -0.51986889
## 2: revol_util [22.5,31.7) 2940 0.10050595 2363 577 0.1962585 -0.30922615
## 3: revol_util [31.7,39.6) 2936 0.10036921 2300 636 0.2166213 -0.18484698
## 4: revol_util [39.6,43.5) 1464 0.05004786 1129 335 0.2288251 -0.11433818
## 5: revol_util [43.5,51) 2920 0.09982223 2243 677 0.2318493 -0.09727941
## 6: revol_util [51,54.6) 1468 0.05018460 1095 373 0.2540872 0.02368763
## 7: revol_util [54.6,62.2) 2894 0.09893341 2130 764 0.2639945 0.07530939
## 8: revol_util [62.2,70.4) 2944 0.10064269 2109 835 0.2836277 0.17408140
## 9: revol_util [70.4,74.9) 1475 0.05042390 1037 438 0.2969492 0.23875056
## 10: revol_util [74.9,85.7) 2904 0.09927526 2017 887 0.3054408 0.27909730
## 11: revol_util [85.7,92.5) 1482 0.05066320 982 500 0.3373819 0.42563565
## 12: revol_util [92.5, Inf) 1465 0.05008205 905 560 0.3822526 0.62062070
## bin_iv total_iv breaks is_special_values
## 1: 3.494991e-02 0.09581975 22.5 FALSE
## 2: 8.854478e-03 0.09581975 31.7 FALSE
## 3: 3.268866e-03 0.09581975 39.6 FALSE
## 4: 6.354045e-04 0.09581975 43.5 FALSE
## 5: 9.214756e-04 0.09581975 51 FALSE
## 6: 2.832545e-05 0.09581975 54.6 FALSE
## 7: 5.716091e-04 0.09581975 62.2 FALSE
## 8: 3.180507e-03 0.09581975 70.4 FALSE
## 9: 3.041642e-03 0.09581975 74.9 FALSE
## 10: 8.256510e-03 0.09581975 85.7 FALSE
## 11: 1.010304e-02 0.09581975 92.5 FALSE
## 12: 2.200799e-02 0.09581975 Inf FALSE
##
## $total_acc
## variable bin count count_distr neg pos posprob woe
## 1: total_acc [-Inf,14) 3865 0.13212772 2648 1217 0.3148771 0.32320303
## 2: total_acc [14,16) 1626 0.05558594 1168 458 0.2816728 0.16443988
## 3: total_acc [16,19) 2732 0.09339532 1982 750 0.2745242 0.12883035
## 4: total_acc [19,21) 1966 0.06720908 1466 500 0.2543235 0.02493407
## 5: total_acc [21,23) 1977 0.06758512 1478 499 0.2524026 0.01477985
## 6: total_acc [23,25) 2014 0.06884999 1499 515 0.2557100 0.03223226
## 7: total_acc [25,27) 1899 0.06491864 1402 497 0.2617167 0.06355381
## 8: total_acc [27,30) 2650 0.09059210 2019 631 0.2381132 -0.06243290
## 9: total_acc [30,32) 1543 0.05274853 1216 327 0.2119248 -0.21274304
## 10: total_acc [32,36) 2594 0.08867770 1990 604 0.2328450 -0.09169686
## 11: total_acc [36,39) 1578 0.05394503 1233 345 0.2186312 -0.17304223
## 12: total_acc [39,44) 1852 0.06331191 1490 362 0.1954644 -0.31426833
## 13: total_acc [44, Inf) 2956 0.10105292 2359 597 0.2019621 -0.27345711
## bin_iv total_iv breaks is_special_values
## 1: 1.487667e-02 0.03609894 14 FALSE
## 2: 1.563938e-03 0.03609894 16 FALSE
## 3: 1.599488e-03 0.03609894 19 FALSE
## 4: 4.204472e-05 0.03609894 21 FALSE
## 5: 1.481813e-05 0.03609894 23 FALSE
## 6: 7.210519e-05 0.03609894 25 FALSE
## 7: 2.663608e-04 0.03609894 27 FALSE
## 8: 3.475699e-04 0.03609894 30 FALSE
## 9: 2.258561e-03 0.03609894 32 FALSE
## 10: 7.283966e-04 0.03609894 36 FALSE
## 11: 1.544539e-03 0.03609894 39 FALSE
## 12: 5.753024e-03 0.03609894 44 FALSE
## 13: 7.031431e-03 0.03609894 Inf FALSE
##
## $total_pymnt
## variable bin count count_distr neg pos posprob
## 1: total_pymnt [-Inf,1401.27) 2925 0.09999316 213 2712 0.9271794872
## 2: total_pymnt [1401.27,2774.92) 2925 0.09999316 665 2260 0.7726495726
## 3: total_pymnt [2774.92,4831.76) 2925 0.09999316 1366 1559 0.5329914530
## 4: total_pymnt [4831.76,7025.24) 2925 0.09999316 2407 518 0.1770940171
## 5: total_pymnt [7025.24,8320.1) 1463 0.05001367 1336 127 0.0868079289
## 6: total_pymnt [8320.1,10791.81) 2925 0.09999316 2831 94 0.0321367521
## 7: total_pymnt [10791.81,14143.6) 2925 0.09999316 2905 20 0.0068376068
## 8: total_pymnt [14143.6,17579.08) 2925 0.09999316 2923 2 0.0006837607
## 9: total_pymnt [17579.08,20460.89) 1463 0.05001367 1462 1 0.0006835270
## 10: total_pymnt [20460.89,26637.24) 2925 0.09999316 2918 7 0.0023931624
## 11: total_pymnt [26637.24, Inf) 2926 0.10002735 2924 2 0.0006835270
## woe bin_iv total_iv breaks is_special_values
## 1: 3.6447683 1.31831716 5.701947 1401.27 FALSE
## 2: 2.3239519 0.64886624 5.701947 2774.92 FALSE
## 3: 1.2327767 0.18648312 5.701947 4831.76 FALSE
## 4: -0.4355423 0.01686370 5.701947 7025.24 FALSE
## 5: -1.2526294 0.05445569 5.701947 8320.1 FALSE
## 6: -2.3044716 0.26755321 5.701947 10791.81 FALSE
## 7: -3.8778375 0.50259592 5.701947 14143.6 FALSE
## 8: -6.1865997 0.82215202 5.701947 17579.08 FALSE
## 9: -6.1869418 0.41123967 5.701947 20460.89 FALSE
## 10: -4.9321247 0.65094111 5.701947 26637.24 FALSE
## 11: -6.1869418 0.82247934 5.701947 Inf FALSE
##
## $total_rec_int
## variable bin count count_distr neg pos posprob
## 1: total_rec_int [-Inf,69.72) 2925 0.09999316 2502 423 0.1446154
## 2: total_rec_int [69.72,161.12) 2925 0.09999316 2499 426 0.1456410
## 3: total_rec_int [161.12,262.44) 2925 0.09999316 2342 583 0.1993162
## 4: total_rec_int [262.44,317.01) 1463 0.05001367 1157 306 0.2091593
## 5: total_rec_int [317.01,442.73) 2924 0.09995898 2267 657 0.2246922
## 6: total_rec_int [442.73,511.74) 1464 0.05004786 1136 328 0.2240437
## 7: total_rec_int [511.74,683.88) 2925 0.09999316 2204 721 0.2464957
## 8: total_rec_int [683.88,917.3) 2925 0.09999316 2102 823 0.2813675
## 9: total_rec_int [917.3,1256.46) 2925 0.09999316 2025 900 0.3076923
## 10: total_rec_int [1256.46,1859.47) 2925 0.09999316 1914 1011 0.3456410
## 11: total_rec_int [1859.47,2486.56) 1463 0.05001367 870 593 0.4053315
## 12: total_rec_int [2486.56, Inf) 1463 0.05001367 932 531 0.3629528
## woe bin_iv total_iv breaks is_special_values
## 1: -0.67685466 3.794244e-02 0.1696958 69.72 FALSE
## 2: -0.66858773 3.711296e-02 0.1696958 161.12 FALSE
## 3: -0.28995450 7.786989e-03 0.1696958 262.44 FALSE
## 4: -0.22938177 2.478328e-03 0.1696958 317.01 FALSE
## 5: -0.13790978 1.834867e-03 0.1696958 442.73 FALSE
## 6: -0.14163613 9.680527e-04 0.1696958 511.74 FALSE
## 7: -0.01677118 2.800705e-05 0.1696958 683.88 FALSE
## 8: 0.16293051 2.760979e-03 0.1696958 917.3 FALSE
## 9: 0.28968864 8.979994e-03 0.1696958 1256.46 FALSE
## 10: 0.46236350 2.369938e-02 0.1696958 1859.47 FALSE
## 11: 0.71732004 2.982265e-02 0.1696958 2486.56 FALSE
## 12: 0.53804806 1.628115e-02 0.1696958 Inf FALSE
IV
# IV of the variables
iv <- map_df(fine_class, ~pluck(.x, 10, 1)) %>%
pivot_longer(everything(), names_to = "var", values_to = "iv")
iv
## # A tibble: 8 × 2
## var iv
## <chr> <dbl>
## 1 annual_inc 0.0367
## 2 dti 0.141
## 3 pub_rec 0.000293
## 4 revol_bal 0.00514
## 5 revol_util 0.0958
## 6 total_acc 0.0361
## 7 total_pymnt 5.70
## 8 total_rec_int 0.170
Remove variables with low predictive power (IV < 0.02)
# Remove variables with low predictive power (IV < 0.02)
df <- subset(df, select = -c(pub_rec, revol_bal))
df.n <- subset(df.n, select = -c(pub_rec, revol_bal))
fine_class_final <- woebin(df,
y = "loan_status",
x = colnames(df.n),
positive = 1, # loan status is in default
method = "freq", # frequency method
bin_num_limit = 20) # the max number of fine bins
## ✔ Binning on 29252 rows and 8 columns in 00:00:10
fine_class_final
## $annual_inc
## variable bin count count_distr neg pos posprob
## 1: annual_inc [-Inf,40000) 4026 0.13763161 2741 1285 0.3191754
## 2: annual_inc [40000,44000) 1805 0.06170518 1287 518 0.2869806
## 3: annual_inc [44000,50000) 2184 0.07466156 1608 576 0.2637363
## 4: annual_inc [50000,58000) 3536 0.12088062 2648 888 0.2511312
## 5: annual_inc [58000,65000) 2459 0.08406263 1816 643 0.2614884
## 6: annual_inc [65000,70000) 1709 0.05842336 1310 399 0.2334699
## 7: annual_inc [70000,75000) 1657 0.05664570 1275 382 0.2305371
## 8: annual_inc [75000,85000) 2777 0.09493368 2081 696 0.2506302
## 9: annual_inc [85000,91860) 1785 0.06102147 1392 393 0.2201681
## 10: annual_inc [91860,110000) 2678 0.09154930 2100 578 0.2158327
## 11: annual_inc [110000,125000) 1487 0.05083413 1191 296 0.1990585
## 12: annual_inc [125000,156000) 1670 0.05709011 1330 340 0.2035928
## 13: annual_inc [156000, Inf) 1479 0.05056065 1171 308 0.2082488
## woe bin_iv total_iv breaks is_special_values
## 1: 0.343054757 1.753163e-02 0.03670724 40000 FALSE
## 2: 0.190524891 2.344640e-03 0.03670724 44000 FALSE
## 3: 0.073980067 4.161449e-04 0.03670724 50000 FALSE
## 4: 0.008030682 7.811480e-06 0.03670724 58000 FALSE
## 5: 0.062372021 3.321049e-04 0.03670724 65000 FALSE
## 6: -0.088202143 4.444091e-04 0.03670724 70000 FALSE
## 7: -0.104661993 6.041222e-04 0.03670724 75000 FALSE
## 8: 0.005364690 2.735850e-06 0.03670724 85000 FALSE
## 9: -0.164068373 1.574398e-03 0.03670724 91860 FALSE
## 10: -0.189499899 3.129702e-03 0.03670724 110000 FALSE
## 11: -0.291570259 4.001173e-03 0.03670724 125000 FALSE
## 12: -0.263369747 3.694991e-03 0.03670724 156000 FALSE
## 13: -0.234894724 2.623374e-03 0.03670724 Inf FALSE
##
## $dti
## variable bin count count_distr neg pos posprob woe
## 1: dti [-Inf,7.41) 2916 0.09968549 2424 492 0.1687243 -0.49407677
## 2: dti [7.41,10.62) 2927 0.10006153 2373 554 0.1892723 -0.35412671
## 3: dti [10.62,11.96) 1470 0.05025297 1192 278 0.1891156 -0.35514788
## 4: dti [11.96,14.45) 2907 0.09937782 2329 578 0.1988304 -0.29300154
## 5: dti [14.45,15.63) 1473 0.05035553 1164 309 0.2097760 -0.22565750
## 6: dti [15.63,17.94) 2920 0.09982223 2265 655 0.2243151 -0.14007595
## 7: dti [17.94,19.15) 1471 0.05028716 1156 315 0.2141400 -0.19952955
## 8: dti [19.15,20.38) 1466 0.05011623 1130 336 0.2291951 -0.11224290
## 9: dti [20.38,23.08) 2916 0.09968549 2153 763 0.2616598 0.06325939
## 10: dti [23.08,26.37) 2934 0.10030083 2093 841 0.2866394 0.18885679
## 11: dti [26.37,30.76) 2919 0.09978805 1944 975 0.3340185 0.41055334
## 12: dti [30.76,33.83) 1469 0.05021879 905 564 0.3839346 0.62773816
## 13: dti [33.83, Inf) 1464 0.05004786 822 642 0.4385246 0.85346676
## bin_iv total_iv breaks is_special_values
## 1: 0.0212719822 0.1414588 7.41 FALSE
## 2: 0.0114169442 0.1414588 10.62 FALSE
## 3: 0.0057652853 0.1414588 11.96 FALSE
## 4: 0.0078959193 0.1414588 14.45 FALSE
## 5: 0.0024173422 0.1414588 15.63 FALSE
## 6: 0.0018892903 0.1414588 17.94 FALSE
## 7: 0.0019007748 0.1414588 19.15 FALSE
## 8: 0.0006135025 0.1414588 20.38 FALSE
## 9: 0.0004051991 0.1414588 23.08 FALSE
## 10: 0.0037433021 0.1414588 26.37 FALSE
## 11: 0.0184585363 0.1414588 30.76 FALSE
## 12: 0.0226042513 0.1414588 33.83 FALSE
## 13: 0.0430764906 0.1414588 Inf FALSE
##
## $revol_util
## variable bin count count_distr neg pos posprob woe
## 1: revol_util [-Inf,22.5) 4360 0.14904964 3640 720 0.1651376 -0.51986889
## 2: revol_util [22.5,31.7) 2940 0.10050595 2363 577 0.1962585 -0.30922615
## 3: revol_util [31.7,39.6) 2936 0.10036921 2300 636 0.2166213 -0.18484698
## 4: revol_util [39.6,43.5) 1464 0.05004786 1129 335 0.2288251 -0.11433818
## 5: revol_util [43.5,51) 2920 0.09982223 2243 677 0.2318493 -0.09727941
## 6: revol_util [51,54.6) 1468 0.05018460 1095 373 0.2540872 0.02368763
## 7: revol_util [54.6,62.2) 2894 0.09893341 2130 764 0.2639945 0.07530939
## 8: revol_util [62.2,70.4) 2944 0.10064269 2109 835 0.2836277 0.17408140
## 9: revol_util [70.4,74.9) 1475 0.05042390 1037 438 0.2969492 0.23875056
## 10: revol_util [74.9,85.7) 2904 0.09927526 2017 887 0.3054408 0.27909730
## 11: revol_util [85.7,92.5) 1482 0.05066320 982 500 0.3373819 0.42563565
## 12: revol_util [92.5, Inf) 1465 0.05008205 905 560 0.3822526 0.62062070
## bin_iv total_iv breaks is_special_values
## 1: 3.494991e-02 0.09581975 22.5 FALSE
## 2: 8.854478e-03 0.09581975 31.7 FALSE
## 3: 3.268866e-03 0.09581975 39.6 FALSE
## 4: 6.354045e-04 0.09581975 43.5 FALSE
## 5: 9.214756e-04 0.09581975 51 FALSE
## 6: 2.832545e-05 0.09581975 54.6 FALSE
## 7: 5.716091e-04 0.09581975 62.2 FALSE
## 8: 3.180507e-03 0.09581975 70.4 FALSE
## 9: 3.041642e-03 0.09581975 74.9 FALSE
## 10: 8.256510e-03 0.09581975 85.7 FALSE
## 11: 1.010304e-02 0.09581975 92.5 FALSE
## 12: 2.200799e-02 0.09581975 Inf FALSE
##
## $total_acc
## variable bin count count_distr neg pos posprob woe
## 1: total_acc [-Inf,14) 3865 0.13212772 2648 1217 0.3148771 0.32320303
## 2: total_acc [14,16) 1626 0.05558594 1168 458 0.2816728 0.16443988
## 3: total_acc [16,19) 2732 0.09339532 1982 750 0.2745242 0.12883035
## 4: total_acc [19,21) 1966 0.06720908 1466 500 0.2543235 0.02493407
## 5: total_acc [21,23) 1977 0.06758512 1478 499 0.2524026 0.01477985
## 6: total_acc [23,25) 2014 0.06884999 1499 515 0.2557100 0.03223226
## 7: total_acc [25,27) 1899 0.06491864 1402 497 0.2617167 0.06355381
## 8: total_acc [27,30) 2650 0.09059210 2019 631 0.2381132 -0.06243290
## 9: total_acc [30,32) 1543 0.05274853 1216 327 0.2119248 -0.21274304
## 10: total_acc [32,36) 2594 0.08867770 1990 604 0.2328450 -0.09169686
## 11: total_acc [36,39) 1578 0.05394503 1233 345 0.2186312 -0.17304223
## 12: total_acc [39,44) 1852 0.06331191 1490 362 0.1954644 -0.31426833
## 13: total_acc [44, Inf) 2956 0.10105292 2359 597 0.2019621 -0.27345711
## bin_iv total_iv breaks is_special_values
## 1: 1.487667e-02 0.03609894 14 FALSE
## 2: 1.563938e-03 0.03609894 16 FALSE
## 3: 1.599488e-03 0.03609894 19 FALSE
## 4: 4.204472e-05 0.03609894 21 FALSE
## 5: 1.481813e-05 0.03609894 23 FALSE
## 6: 7.210519e-05 0.03609894 25 FALSE
## 7: 2.663608e-04 0.03609894 27 FALSE
## 8: 3.475699e-04 0.03609894 30 FALSE
## 9: 2.258561e-03 0.03609894 32 FALSE
## 10: 7.283966e-04 0.03609894 36 FALSE
## 11: 1.544539e-03 0.03609894 39 FALSE
## 12: 5.753024e-03 0.03609894 44 FALSE
## 13: 7.031431e-03 0.03609894 Inf FALSE
##
## $total_pymnt
## variable bin count count_distr neg pos posprob
## 1: total_pymnt [-Inf,1401.27) 2925 0.09999316 213 2712 0.9271794872
## 2: total_pymnt [1401.27,2774.92) 2925 0.09999316 665 2260 0.7726495726
## 3: total_pymnt [2774.92,4831.76) 2925 0.09999316 1366 1559 0.5329914530
## 4: total_pymnt [4831.76,7025.24) 2925 0.09999316 2407 518 0.1770940171
## 5: total_pymnt [7025.24,8320.1) 1463 0.05001367 1336 127 0.0868079289
## 6: total_pymnt [8320.1,10791.81) 2925 0.09999316 2831 94 0.0321367521
## 7: total_pymnt [10791.81,14143.6) 2925 0.09999316 2905 20 0.0068376068
## 8: total_pymnt [14143.6,17579.08) 2925 0.09999316 2923 2 0.0006837607
## 9: total_pymnt [17579.08,20460.89) 1463 0.05001367 1462 1 0.0006835270
## 10: total_pymnt [20460.89,26637.24) 2925 0.09999316 2918 7 0.0023931624
## 11: total_pymnt [26637.24, Inf) 2926 0.10002735 2924 2 0.0006835270
## woe bin_iv total_iv breaks is_special_values
## 1: 3.6447683 1.31831716 5.701947 1401.27 FALSE
## 2: 2.3239519 0.64886624 5.701947 2774.92 FALSE
## 3: 1.2327767 0.18648312 5.701947 4831.76 FALSE
## 4: -0.4355423 0.01686370 5.701947 7025.24 FALSE
## 5: -1.2526294 0.05445569 5.701947 8320.1 FALSE
## 6: -2.3044716 0.26755321 5.701947 10791.81 FALSE
## 7: -3.8778375 0.50259592 5.701947 14143.6 FALSE
## 8: -6.1865997 0.82215202 5.701947 17579.08 FALSE
## 9: -6.1869418 0.41123967 5.701947 20460.89 FALSE
## 10: -4.9321247 0.65094111 5.701947 26637.24 FALSE
## 11: -6.1869418 0.82247934 5.701947 Inf FALSE
##
## $total_rec_int
## variable bin count count_distr neg pos posprob
## 1: total_rec_int [-Inf,69.72) 2925 0.09999316 2502 423 0.1446154
## 2: total_rec_int [69.72,161.12) 2925 0.09999316 2499 426 0.1456410
## 3: total_rec_int [161.12,262.44) 2925 0.09999316 2342 583 0.1993162
## 4: total_rec_int [262.44,317.01) 1463 0.05001367 1157 306 0.2091593
## 5: total_rec_int [317.01,442.73) 2924 0.09995898 2267 657 0.2246922
## 6: total_rec_int [442.73,511.74) 1464 0.05004786 1136 328 0.2240437
## 7: total_rec_int [511.74,683.88) 2925 0.09999316 2204 721 0.2464957
## 8: total_rec_int [683.88,917.3) 2925 0.09999316 2102 823 0.2813675
## 9: total_rec_int [917.3,1256.46) 2925 0.09999316 2025 900 0.3076923
## 10: total_rec_int [1256.46,1859.47) 2925 0.09999316 1914 1011 0.3456410
## 11: total_rec_int [1859.47,2486.56) 1463 0.05001367 870 593 0.4053315
## 12: total_rec_int [2486.56, Inf) 1463 0.05001367 932 531 0.3629528
## woe bin_iv total_iv breaks is_special_values
## 1: -0.67685466 3.794244e-02 0.1696958 69.72 FALSE
## 2: -0.66858773 3.711296e-02 0.1696958 161.12 FALSE
## 3: -0.28995450 7.786989e-03 0.1696958 262.44 FALSE
## 4: -0.22938177 2.478328e-03 0.1696958 317.01 FALSE
## 5: -0.13790978 1.834867e-03 0.1696958 442.73 FALSE
## 6: -0.14163613 9.680527e-04 0.1696958 511.74 FALSE
## 7: -0.01677118 2.800705e-05 0.1696958 683.88 FALSE
## 8: 0.16293051 2.760979e-03 0.1696958 917.3 FALSE
## 9: 0.28968864 8.979994e-03 0.1696958 1256.46 FALSE
## 10: 0.46236350 2.369938e-02 0.1696958 1859.47 FALSE
## 11: 0.71732004 2.982265e-02 0.1696958 2486.56 FALSE
## 12: 0.53804806 1.628115e-02 0.1696958 Inf FALSE
Coarse classing
# COARSE CLASSING
# Plots
plot <- woebin_plot(fine_class_final)
The positive probability of default is slightly decreasing as annual
income increases, which agrees with my intuition.
For annual_inc the coarse classing could be (1) < 44000 (2) 44000
- 85000 (3) > 85000
# annual_inc
plot[[1]]

The positive probability of default is slightly increasing as DTI
increases. A higher DTI (Debt to Income) indicates a higher level of
debt relative to income, which can be seen as a potential risk
factor.
For DTI the coarse classing could be (1) < 14 (2) 14-23 (3) >
23
# DTI
plot[[2]]

The positive probability of default is slightly increasing as
revol_util increases. A higher Revolving Utilization ratio indicates
that a borrower is using a larger portion of their available credit,
which can be seen as a potential risk factor.
For revol_bal the coarse classing could be (1) < 31 (2) 31-43 (3)
43-70 (4) > 70
# revol_util
plot[[3]]

The positive probability of default is slightly decreasing as
total_acc increases. The intuition behind why the probability of default
(PD) may decrease as the variable “total_acc” increases can be explained
by considering the relationship between creditworthiness and the total
number of accounts a customer has. In many cases, having a higher total
number of accounts can be an indicator of a customer’s creditworthiness
and financial stability. Having multiple accounts suggests that the
customer has a diverse credit portfolio, which can indicate responsible
credit management.
For total_acc the coarse classing could be (1) < 14 (2) 14-27 (3)
>27
# total acc
plot[[4]]

The positive probability of default drops rapidly and then reminds
around zero as total payment increases. It starts at 92.7%%, then drops
to 8.7% for total payments 1400 and 7025 respectively.
As customers make larger total payments, it indicates improved
financial stability and a stronger ability to meet their payment
obligations. This reduces the risk of default as customers demonstrate
their commitment to fulfilling their financial responsibilities.
For total_pymnt the coarse classing could be (1) < 1401 (2) 1401-
2775 (3) 2775- 4831 (4) 4831- 8320 (5) > 8320
# total_pymnt
plot[[5]]

The positive probability of default increases as total_rec_int
increases.
For total_rec_int the coarse classing could be (1) < 263 (2) 263-
683 (3) > 683
# total_rec_int
plot[[6]]

breaks_list <- list(annual_inc = c("44000", "85000"),
dti = c("14", "23"),
revol_util = c("31","43","70"),
total_acc = c("14", "27"),
total_pymnt = c("1401", "2775",
"4831","8320"),
total_rec_int = c("263", "683")
)
coarse_class <- woebin(df,
y = "loan_status",
x = colnames(df.n),
positive = 1,
method = "freq",
breaks_list = breaks_list) # from coarse classing results
## ✔ Binning on 29252 rows and 8 columns in 00:00:10
coarse_class
## $annual_inc
## variable bin count count_distr neg pos posprob woe
## 1: annual_inc [-Inf,44000) 5831 0.1993368 4028 1803 0.3092094 0.296800826
## 2: annual_inc [44000,85000) 14322 0.4896075 10738 3584 0.2502444 0.003309499
## 3: annual_inc [85000, Inf) 9099 0.3110557 7184 1915 0.2104627 -0.221519852
## bin_iv total_iv breaks is_special_values
## 1: 1.882034e-02 0.03323167 44000 FALSE
## 2: 5.367009e-06 0.03323167 85000 FALSE
## 3: 1.440596e-02 0.03323167 Inf FALSE
##
## $dti
## variable bin count count_distr neg pos posprob woe
## 1: dti [-Inf,14) 9630 0.3292083 7843 1787 0.1855659 -0.3784643
## 2: dti [14,23) 10759 0.3678039 8290 2469 0.2294823 -0.1106179
## 3: dti [23, Inf) 8863 0.3029878 5817 3046 0.3436760 0.4536634
## bin_iv total_iv breaks is_special_values
## 1: 0.042609255 0.1160021 14 FALSE
## 2: 0.004374938 0.1160021 23 FALSE
## 3: 0.069017906 0.1160021 Inf FALSE
##
## $revol_util
## variable bin count count_distr neg pos posprob woe
## 1: revol_util [-Inf,31) 7072 0.2417612 5821 1251 0.1768948 -0.43690998
## 2: revol_util [31,43) 4438 0.1517161 3459 979 0.2205949 -0.16158431
## 3: revol_util [43,70) 10253 0.3505059 7611 2642 0.2576807 0.04256049
## 4: revol_util [70, Inf) 7489 0.2560167 5059 2430 0.3244759 0.36734128
## bin_iv total_iv breaks is_special_values
## 1: 0.0410130442 0.0830356 31 FALSE
## 2: 0.0037992615 0.0830356 43 FALSE
## 3: 0.0006416455 0.0830356 70 FALSE
## 4: 0.0375816497 0.0830356 Inf FALSE
##
## $total_acc
## variable bin count count_distr neg pos posprob woe
## 1: total_acc [-Inf,14) 3865 0.1321277 2648 1217 0.3148771 0.32320303
## 2: total_acc [14,27) 12214 0.4175441 8995 3219 0.2635500 0.07302074
## 3: total_acc [27, Inf) 13173 0.4503282 10307 2866 0.2175662 -0.17928709
## bin_iv total_iv breaks is_special_values
## 1: 0.014876665 0.03096147 14 FALSE
## 2: 0.002266793 0.03096147 27 FALSE
## 3: 0.013818013 0.03096147 Inf FALSE
##
## $total_pymnt
## variable bin count count_distr neg pos posprob woe
## 1: total_pymnt [-Inf,1401) 2925 0.09999316 213 2712 0.927179487 3.6447683
## 2: total_pymnt [1401,2775) 2926 0.10002735 665 2261 0.772727273 2.3243943
## 3: total_pymnt [2775,4831) 2924 0.09995898 1366 1558 0.532831737 1.2321350
## 4: total_pymnt [4831,8320) 4388 0.15000684 3743 645 0.146991796 -0.6577735
## 5: total_pymnt [8320, Inf) 16089 0.55001367 15963 126 0.007831438 -3.7411281
## bin_iv total_iv breaks is_special_values
## 1: 1.31831716 4.864063 1401 FALSE
## 2: 0.64930808 4.864063 2775 FALSE
## 3: 0.18621732 4.864063 4831 FALSE
## 4: 0.05406369 4.864063 8320 FALSE
## 5: 2.65615674 4.864063 Inf FALSE
##
## $total_rec_int
## variable bin count count_distr neg pos posprob woe
## 1: total_rec_int [-Inf,263) 8796 0.3006974 7359 1437 0.1633697 -0.5327476
## 2: total_rec_int [263,683) 8745 0.2989539 6742 2003 0.2290452 -0.1130917
## 3: total_rec_int [683, Inf) 11711 0.4003487 7849 3862 0.3297754 0.3914179
## bin_iv total_iv breaks is_special_values
## 1: 0.073767726 0.1445362 263 FALSE
## 2: 0.003714408 0.1445362 683 FALSE
## 3: 0.067054103 0.1445362 Inf FALSE
And then I split the data into training (80%) and test (20%)
sets.
set.seed(123)
# Split the data into training 80% and test 20% sets
train_indices <- sample(nrow(df_woe), 0.8 * nrow(df)) # 80% for training
train <-df_woe[train_indices, ]
test <- df_woe[-train_indices, ]
I fitted logit model and then performed stepwise variable
selection.
And this is my final logit model:
final_model <- step_model_log
summary(final_model)
##
## Call:
## glm(formula = loan_status ~ `term_ 60 ` + `emp_length_< 1 year` +
## `emp_length_1 year` + `emp_length_2 years` + `emp_length_5 years` +
## `emp_length_7 years` + home_ownership_RENT + grade_1 + grade_4 +
## grade_5 + grade_6 + grade_7 + annual_inc_woe + dti_woe +
## revol_util_woe + total_acc_woe + total_pymnt_woe + total_rec_int_woe,
## family = binomial("logit"), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.9986 -0.1475 -0.0241 0.0090 5.0035
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.58562 0.07969 -19.898 < 2e-16 ***
## `term_ 60 ` 0.76288 0.13197 5.781 7.44e-09 ***
## `emp_length_< 1 year` 0.23605 0.12894 1.831 0.06715 .
## `emp_length_1 year` 0.44347 0.14705 3.016 0.00256 **
## `emp_length_2 years` 0.20907 0.13231 1.580 0.11407
## `emp_length_5 years` 0.36849 0.16049 2.296 0.02168 *
## `emp_length_7 years` 0.40660 0.17548 2.317 0.02050 *
## home_ownership_RENT 0.17234 0.08049 2.141 0.03226 *
## grade_1 0.24372 0.15504 1.572 0.11597
## grade_4 0.26576 0.09720 2.734 0.00626 **
## grade_5 0.62466 0.13576 4.601 4.20e-06 ***
## grade_6 1.12863 0.21546 5.238 1.62e-07 ***
## grade_7 1.42825 0.34321 4.162 3.16e-05 ***
## annual_inc_woe -2.30175 0.23820 -9.663 < 2e-16 ***
## dti_woe 0.59709 0.12105 4.933 8.12e-07 ***
## revol_util_woe 0.32849 0.13613 2.413 0.01582 *
## total_acc_woe 0.50291 0.23148 2.173 0.02981 *
## total_pymnt_woe 1.82173 0.03110 58.579 < 2e-16 ***
## total_rec_int_woe 7.44755 0.17878 41.658 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 26387.9 on 23400 degrees of freedom
## Residual deviance: 4892.1 on 23382 degrees of freedom
## AIC: 4930.1
##
## Number of Fisher Scoring iterations: 8
# predict on in sample
train$prob <- predict(final_model,train,type='response')
train$predict <- ifelse(train$prob > 0.5,1,0)
In sample accuracy:
# Accuracy calculation
accuracy <- mean(train$loan_status == train$predict)
accuracy
## [1] 0.972608
I predicted the probabilities on out of sample data.
# predict on out of sample
test$prob <- predict(final_model,test,type='response')
test$predict <- ifelse(test$prob > 0.5,1,0)
Validation
Out of sample accuracy:
# Accuracy calculation
accuracy <- mean(test$loan_status == test$predict)
accuracy
## [1] 0.9752179
Confusion matrix:
table(test$loan_status, test$predict)
##
## 0 1
## 0 4343 88
## 1 57 1363
AUC and ROC curve:
# Calculate AUC and ROC curve
roc_obj <- roc(test$predict, as.numeric(as.character(test$loan_status)))
auc <- auc(roc_obj)
roc_curve <- roc_obj$roc
# View AUC
print(paste("AUC:", auc))
## [1] "AUC: 0.963198812731032"
# Roc curve
g <- roc(loan_status ~ predict, data = test)
plot(g, col = 'purple', main = 'ROC curve', legacy.axes = TRUE, ylim = c(0, 1))

Gini
I calculated gini coefficient and the results confirm good
discrimination power of my model, meaning that it is more effective at
distinguishing between the positive and negative classes.
# Calculate Gini coefficient
gini <- 2 * auc - 1
# View Gini coefficient
print(paste("Gini Coefficient:", gini))
## [1] "Gini Coefficient: 0.926397625462064"
HL test on goodness of fit
H0: Model is correct. P-value >>> 0.05 so there is not
enough evidence to reject the null. There is a significant evidence that
the model is correct.
# H0: Model is correct
library(ResourceSelection)
hl_gof <- hoslem.test(as.numeric(as.character(test$loan_status)), test$predict, g = 10)
hl_gof
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: as.numeric(as.character(test$loan_status)), test$predict
## X-squared = 0.88071, df = 8, p-value = 0.9989
HL is not robust so I tried a few other possibilities for number of
bins. But again, the p-value is >>> 0.05 confirming my
statement that the model specifications are correct.
# H0: Model is correct
library(ResourceSelection)
hl_gof <- hoslem.test(as.numeric(as.character(test$loan_status)), test$predict, g = 3)
hl_gof
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: as.numeric(as.character(test$loan_status)), test$predict
## X-squared = 0.88071, df = 1, p-value = 0.348
VIF
I generated VIF to check multicollinearity. All variables have VIF
< 5, so there is no multicollinearity issue.
# generate VIF
car::vif(final_model)
## `term_ 60 ` `emp_length_< 1 year` `emp_length_1 year`
## 1.679994 1.082030 1.054890
## `emp_length_2 years` `emp_length_5 years` `emp_length_7 years`
## 1.064928 1.043988 1.036283
## home_ownership_RENT grade_1 grade_4
## 1.101620 1.117350 1.126389
## grade_5 grade_6 grade_7
## 1.237922 1.231219 1.103347
## annual_inc_woe dti_woe revol_util_woe
## 1.450497 1.159428 1.096810
## total_acc_woe total_pymnt_woe total_rec_int_woe
## 1.227247 3.847057 3.428363
# No multicollinearity issue as all variables have VIF < 5.
LR test on the significance of variables
H0: Variables are statistically irrelevant. As p-vale ~0, we have a
strong evidence to reject the null that states all variables are
statistically irrelevant.
# LR test on the significance of variables
# H0 variables are statistically irrelevant
ist<-pchisq(final_model$null.deviance-final_model$deviance,
final_model$df.null-final_model$df.residual,lower.tail = F)
ist
## [1] 0
# pvalue ~0 thus, we have a string evidence to reject the null and confirm that variables are statistically relevant.
Scorecard
# scorecard creation
var_select <- c(
'annual_inc', 'dti', 'revol_util',
'total_acc', 'total_pymnt', 'total_rec_int')
breaks_list <- list(annual_inc = c("44000", "85000"),
dti = c("14", "23"),
revol_util = c("31","43","70"),
total_acc = c("14", "27"),
total_pymnt = c("1401", "2775",
"4831","8320"),
total_rec_int = c("263", "683")
)
bins <- woebin(df,
y = 'loan_status',
x = var_select,
positive = 1,
method = 'freq',
breaks_list = breaks_list
)
## ✔ Binning on 29252 rows and 7 columns in 00:00:10
score_card <- scorecard(bins,
final_model,
points0 = 300,
pdo = 20,
odds0 = 100,
basepoints_eq0 = TRUE)
# display results
score_card
## $basepoints
## variable bin woe points
## 1: basepoints NA NA 0
##
## $``term_ 60 ``
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,neg,pos...
##
## $``emp_length_< 1 year``
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,neg,pos...
##
## $``emp_length_1 year``
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,neg,pos...
##
## $``emp_length_2 years``
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,neg,pos...
##
## $``emp_length_5 years``
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,neg,pos...
##
## $``emp_length_7 years``
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,neg,pos...
##
## $home_ownership_RENT
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,neg,pos...
##
## $grade_1
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,neg,pos...
##
## $grade_4
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,neg,pos...
##
## $grade_5
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,neg,pos...
##
## $grade_6
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,neg,pos...
##
## $grade_7
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,neg,pos...
##
## $annual_inc
## variable bin count count_distr neg pos posprob woe
## 1: annual_inc [-Inf,44000) 5831 0.1993368 4028 1803 0.3092094 0.296800826
## 2: annual_inc [44000,85000) 14322 0.4896075 10738 3584 0.2502444 0.003309499
## 3: annual_inc [85000, Inf) 9099 0.3110557 7184 1915 0.2104627 -0.221519852
## bin_iv total_iv breaks is_special_values points
## 1: 1.882034e-02 0.03323167 44000 FALSE 46
## 2: 5.367009e-06 0.03323167 85000 FALSE 27
## 3: 1.440596e-02 0.03323167 Inf FALSE 12
##
## $dti
## variable bin count count_distr neg pos posprob woe
## 1: dti [-Inf,14) 9630 0.3292083 7843 1787 0.1855659 -0.3784643
## 2: dti [14,23) 10759 0.3678039 8290 2469 0.2294823 -0.1106179
## 3: dti [23, Inf) 8863 0.3029878 5817 3046 0.3436760 0.4536634
## bin_iv total_iv breaks is_special_values points
## 1: 0.042609255 0.1160021 14 FALSE 33
## 2: 0.004374938 0.1160021 23 FALSE 28
## 3: 0.069017906 0.1160021 Inf FALSE 19
##
## $revol_util
## variable bin count count_distr neg pos posprob woe
## 1: revol_util [-Inf,31) 7072 0.2417612 5821 1251 0.1768948 -0.43690998
## 2: revol_util [31,43) 4438 0.1517161 3459 979 0.2205949 -0.16158431
## 3: revol_util [43,70) 10253 0.3505059 7611 2642 0.2576807 0.04256049
## 4: revol_util [70, Inf) 7489 0.2560167 5059 2430 0.3244759 0.36734128
## bin_iv total_iv breaks is_special_values points
## 1: 0.0410130442 0.0830356 31 FALSE 31
## 2: 0.0037992615 0.0830356 43 FALSE 28
## 3: 0.0006416455 0.0830356 70 FALSE 26
## 4: 0.0375816497 0.0830356 Inf FALSE 23
##
## $total_acc
## variable bin count count_distr neg pos posprob woe
## 1: total_acc [-Inf,14) 3865 0.1321277 2648 1217 0.3148771 0.32320303
## 2: total_acc [14,27) 12214 0.4175441 8995 3219 0.2635500 0.07302074
## 3: total_acc [27, Inf) 13173 0.4503282 10307 2866 0.2175662 -0.17928709
## bin_iv total_iv breaks is_special_values points
## 1: 0.014876665 0.03096147 14 FALSE 22
## 2: 0.002266793 0.03096147 27 FALSE 26
## 3: 0.013818013 0.03096147 Inf FALSE 29
##
## $total_pymnt
## variable bin count count_distr neg pos posprob woe
## 1: total_pymnt [-Inf,1401) 2925 0.09999316 213 2712 0.927179487 3.6447683
## 2: total_pymnt [1401,2775) 2926 0.10002735 665 2261 0.772727273 2.3243943
## 3: total_pymnt [2775,4831) 2924 0.09995898 1366 1558 0.532831737 1.2321350
## 4: total_pymnt [4831,8320) 4388 0.15000684 3743 645 0.146991796 -0.6577735
## 5: total_pymnt [8320, Inf) 16089 0.55001367 15963 126 0.007831438 -3.7411281
## bin_iv total_iv breaks is_special_values points
## 1: 1.31831716 4.864063 1401 FALSE -165
## 2: 0.64930808 4.864063 2775 FALSE -96
## 3: 0.18621732 4.864063 4831 FALSE -38
## 4: 0.05406369 4.864063 8320 FALSE 61
## 5: 2.65615674 4.864063 Inf FALSE 223
##
## $total_rec_int
## variable bin count count_distr neg pos posprob woe
## 1: total_rec_int [-Inf,263) 8796 0.3006974 7359 1437 0.1633697 -0.5327476
## 2: total_rec_int [263,683) 8745 0.2989539 6742 2003 0.2290452 -0.1130917
## 3: total_rec_int [683, Inf) 11711 0.4003487 7849 3862 0.3297754 0.3914179
## bin_iv total_iv breaks is_special_values points
## 1: 0.073767726 0.1445362 263 FALSE 141
## 2: 0.003714408 0.1445362 683 FALSE 51
## 3: 0.067054103 0.1445362 Inf FALSE -58
Conclusion
In this analysis, a logistic regression model was fitted to predict
loan default using a dataset that included various predictor variables.
The model was trained using the Weight of Evidence (WOE) transformation
for numeric variables, which helps capture the relationship between the
predictors and the likelihood of loan default.
The coefficients of the model provide valuable information about the
impact of each predictor on the probability of loan default. For
instance, the coefficient for term_60 is approximately 0.78497. This
means that, all other factors being equal, having a loan term of 60
months increases the log-odds of default.
On the other hand, some variables have negative coefficients. For
example, the coefficient for annual_inc_woe is approximately -2.61595.
This suggests that as the weighted odds of annual income increase, the
likelihood of default decreases. A higher annual income can be seen as a
protective factor against default.
The model’s performance on the out-of-sample data was impressive,
achieving an accuracy rate of approximately 97.61%. This means that the
model correctly classified about 97.61% of the observations in the test
set.
Furthermore, the Area Under the Curve (AUC) was calculated to be
0.9655, indicating a high level of discrimination between the positive
and negative loan statuses. The Gini coefficient derived from the AUC is
0.9311, reflecting the model’s strong predictive ability.
In conclusion, the logistic regression model, with the selected
predictors, provides valuable insights into the factors influencing loan
default. Variables such as loan term, employment length, loan grade, and
annual income have significant impacts on the likelihood of default. The
model exhibits high accuracy and strong discriminatory power, making it
a valuable tool for predicting loan default risk.
Thank you for taking your time to read it.