Credit risk

The objective of this project is to construct a model capable of precisely detecting loans with a high risk factor. This will enable the implementation of suitable measures, such as raising the loan’s interest rate or rejecting the loan application. These actions will assist the bank in minimizing its overall risk and enhancing its profitability.

EDA

First, let’s see the data.

##          id member_id loan_amnt funded_amnt funded_amnt_inv       term int_rate
## 1  60516983  64537751     20000       20000           20000  36 months    12.29
## 2  60187139  64163931     11000       11000           11000  36 months    12.69
## 3  60356453  64333218      7000        7000            7000  36 months     9.99
## 4  59955769  63900496     10000       10000           10000  36 months    10.99
## 5  58703693  62544456      9550        9550            9550  36 months    19.99
## 6  57783762  61536512     24000       24000           24000  60 months    14.65
## 7  58010547  61814274     15000       15000           14975  60 months    10.99
## 8  58613025  62453761     24650       24650           24600  60 months    17.57
## 9  57662310  61415026     12000       12000           12000  36 months    12.69
## 10 58470862  62289594     15000       15000           14850  60 months    18.25
##    installment grade sub_grade                 emp_title emp_length
## 1       667.06     C        C1          Accounting Clerk     1 year
## 2          369     C        C2     Accounts Payable Lead    7 years
## 3       225.84     B        B3                     Nurse    6 years
## 4       327.34     B        B4           Service Manager  10+ years
## 5       354.87     E        E4                      <NA>       <NA>
## 6       566.56     C        C5                     Owner  10+ years
## 7       326.07     B        B4                       ISO  10+ years
## 8        620.2     D        D4 IP and Research Secreatry    8 years
## 9       402.54     C        C2               staff nurse    9 years
## 10      382.95     E        E1                        PA    8 years
##    home_ownership annual_inc verification_status issue_d loan_status pymnt_plan
## 1             OWN      65000     Source Verified  Sep-15 Charged Off          n
## 2        MORTGAGE      40000     Source Verified  Sep-15 Charged Off          n
## 3        MORTGAGE      32000     Source Verified  Sep-15 Charged Off          n
## 4        MORTGAGE      48000     Source Verified  Sep-15 Charged Off          n
## 5            RENT      32376            Verified  Sep-15 Charged Off          n
## 6        MORTGAGE      70000        Not Verified  Aug-15 Charged Off          n
## 7            RENT      74800        Not Verified  Aug-15 Charged Off          n
## 8        MORTGAGE      56048            Verified  Aug-15 Charged Off          n
## 9            RENT     120000        Not Verified  Aug-15 Charged Off          n
## 10       MORTGAGE     150000     Source Verified  Aug-15 Charged Off          n
##                                                                      url desc
## 1  https://www.lendingclub.com/browse/loanDetail.action?loan_id=60516983 <NA>
## 2  https://www.lendingclub.com/browse/loanDetail.action?loan_id=60187139 <NA>
## 3  https://www.lendingclub.com/browse/loanDetail.action?loan_id=60356453 <NA>
## 4  https://www.lendingclub.com/browse/loanDetail.action?loan_id=59955769 <NA>
## 5  https://www.lendingclub.com/browse/loanDetail.action?loan_id=58703693 <NA>
## 6  https://www.lendingclub.com/browse/loanDetail.action?loan_id=57783762 <NA>
## 7  https://www.lendingclub.com/browse/loanDetail.action?loan_id=58010547 <NA>
## 8  https://www.lendingclub.com/browse/loanDetail.action?loan_id=58613025 <NA>
## 9  https://www.lendingclub.com/browse/loanDetail.action?loan_id=57662310 <NA>
## 10 https://www.lendingclub.com/browse/loanDetail.action?loan_id=58470862 <NA>
##               purpose                   title zip_code addr_state   dti
## 1  debt_consolidation      Debt consolidation    542xx         WI 20.72
## 2  debt_consolidation      Debt consolidation    235xx         VA 24.57
## 3  debt_consolidation      Debt consolidation    350xx         AL 32.41
## 4         credit_card Credit card refinancing    483xx         MI 30.98
## 5  debt_consolidation      Debt consolidation    546xx         WI 32.54
## 6  debt_consolidation      Debt consolidation    703xx         LA  6.96
## 7         credit_card Credit card refinancing    700xx         LA 15.63
## 8  debt_consolidation      Debt consolidation    238xx         VA 27.26
## 9      major_purchase          Major purchase    601xx         IL 22.74
## 10 debt_consolidation      Debt consolidation    913xx         CA 28.26
##    delinq_2yrs earliest_cr_line inq_last_6mths mths_since_last_delinq
## 1            0           Sep-00              1                     NA
## 2            0           Sep-02              0                     36
## 3            0           Feb-06              1                     NA
## 4            0           Oct-99              2                     NA
## 5            0           Nov-99              3                     69
## 6            0           May-98              0                     65
## 7            0           Feb-84              2                     NA
## 8            0           Mar-09              0                     NA
## 9            0           Jun-04              2                     33
## 10           0           Jul-00              0                     24
##    mths_since_last_record open_acc pub_rec revol_bal revol_util total_acc
## 1                      NA       25       0     31578         77        42
## 2                      80       13       1      5084       38.8        41
## 3                      NA       18       0     12070         74        36
## 4                      NA       18       0     22950         66        41
## 5                      NA        9       0      4172       29.6        26
## 6                      NA        8       0      8256       49.4        19
## 7                      NA       12       0      4409        6.3        28
## 8                      NA       16       0      9638       69.8        23
## 9                      NA       20       0     22108       54.4        56
## 10                     NA       18       0     35052       91.5        36
##    initial_list_status out_prncp out_prncp_inv total_pymnt total_pymnt_inv
## 1                    w         0             0           0               0
## 2                    w         0             0    10043.49        10043.49
## 3                    f         0             0      221.96          221.96
## 4                    f         0             0      315.13          315.13
## 5                    w         0             0      333.66          333.66
## 6                    w         0             0      547.03          547.03
## 7                    f         0             0      307.75          307.24
## 8                    w         0             0     1192.28         1189.86
## 9                    w         0             0      796.62          796.62
## 10                   f         0             0           0               0
##    total_rec_prncp total_rec_int total_rec_late_fee recoveries
## 1                0             0                  0          0
## 2          9942.67        100.81                  0          0
## 3           167.56          54.4                  0          0
## 4           235.76         79.37                  0          0
## 5           195.78        137.88                  0          0
## 6           273.56        273.47                  0          0
## 7           188.69        119.06                  0          0
## 8           522.36        669.92                  0          0
## 9           554.19        242.43                  0          0
## 10               0             0                  0          0
##    collection_recovery_fee last_pymnt_d last_pymnt_amnt next_pymnt_d
## 1                        0         <NA>               0         <NA>
## 2                        0       Oct-15           10059         <NA>
## 3                        0       Oct-15          225.84         <NA>
## 4                        0       Oct-15          327.34         <NA>
## 5                        0       Oct-15          354.87         <NA>
## 6                        0       Oct-15          566.56         <NA>
## 7                        0       Oct-15          326.07         <NA>
## 8                        0       Oct-15           620.2         <NA>
## 9                        0       Oct-15          402.54         <NA>
## 10                       0         <NA>               0         <NA>
##    last_credit_pull_d collections_12_mths_ex_med mths_since_last_major_derog
## 1              Jan-16                          0                          NA
## 2              Jan-16                          0                          79
## 3              Jan-16                          0                          NA
## 4              Jan-16                          0                          NA
## 5              Jan-16                          0                          69
## 6              Jan-16                          0                          65
## 7              Jan-16                          0                          NA
## 8              Jan-16                          0                          NA
## 9              Jan-16                          0                          33
## 10             Jan-16                          0                          24
##    policy_code application_type annual_inc_joint dti_joint
## 1            1       INDIVIDUAL             <NA>      <NA>
## 2            1       INDIVIDUAL             <NA>      <NA>
## 3            1       INDIVIDUAL             <NA>      <NA>
## 4            1       INDIVIDUAL             <NA>      <NA>
## 5            1       INDIVIDUAL             <NA>      <NA>
## 6            1       INDIVIDUAL             <NA>      <NA>
## 7            1       INDIVIDUAL             <NA>      <NA>
## 8            1       INDIVIDUAL             <NA>      <NA>
## 9            1       INDIVIDUAL             <NA>      <NA>
## 10           1       INDIVIDUAL             <NA>      <NA>
##    verification_status_joint acc_now_delinq tot_coll_amt tot_cur_bal
## 1                       <NA>              0            0       52303
## 2                       <NA>              0          332      175731
## 3                       <NA>              0            0      202012
## 4                       <NA>              0            0      108235
## 5                       <NA>              0            0       45492
## 6                       <NA>              0            0      126165
## 7                       <NA>              0            0      264173
## 8                       <NA>              0            0      191935
## 9                       <NA>              0            0       71745
## 10                      <NA>              0            0      497387
##    open_acc_6m open_il_6m open_il_12m open_il_24m mths_since_rcnt_il
## 1           NA         NA          NA          NA                 NA
## 2           NA         NA          NA          NA                 NA
## 3           NA         NA          NA          NA                 NA
## 4           NA         NA          NA          NA                 NA
## 5           NA         NA          NA          NA                 NA
## 6           NA         NA          NA          NA                 NA
## 7           NA         NA          NA          NA                 NA
## 8           NA         NA          NA          NA                 NA
## 9           NA         NA          NA          NA                 NA
## 10          NA         NA          NA          NA                 NA
##    total_bal_il il_util open_rv_12m open_rv_24m max_bal_bc all_util
## 1            NA    <NA>          NA          NA         NA     <NA>
## 2            NA    <NA>          NA          NA         NA     <NA>
## 3            NA    <NA>          NA          NA         NA     <NA>
## 4            NA    <NA>          NA          NA         NA     <NA>
## 5            NA    <NA>          NA          NA         NA     <NA>
## 6            NA    <NA>          NA          NA         NA     <NA>
## 7            NA    <NA>          NA          NA         NA     <NA>
## 8            NA    <NA>          NA          NA         NA     <NA>
## 9            NA    <NA>          NA          NA         NA     <NA>
## 10           NA    <NA>          NA          NA         NA     <NA>
##    total_rev_hi_lim inq_fi total_cu_tl inq_last_12m
## 1             41000     NA          NA           NA
## 2             13100     NA          NA           NA
## 3             16300     NA          NA           NA
## 4             34750     NA          NA           NA
## 5             14100     NA          NA           NA
## 6             16700     NA          NA           NA
## 7             69500     NA          NA           NA
## 8             13800     NA          NA           NA
## 9             40613     NA          NA           NA
## 10            38300     NA          NA           NA

str(raw_data)

## 'data.frame':    421094 obs. of  74 variables:
##  $ id                         : int  60516983 60187139 60356453 59955769 58703693 57783762 58010547 58613025 57662310 58470862 ...
##  $ member_id                  : int  64537751 64163931 64333218 63900496 62544456 61536512 61814274 62453761 61415026 62289594 ...
##  $ loan_amnt                  : int  20000 11000 7000 10000 9550 24000 15000 24650 12000 15000 ...
##  $ funded_amnt                : int  20000 11000 7000 10000 9550 24000 15000 24650 12000 15000 ...
##  $ funded_amnt_inv            : int  20000 11000 7000 10000 9550 24000 14975 24600 12000 14850 ...
##  $ term                       : chr  " 36 months" " 36 months" " 36 months" " 36 months" ...
##  $ int_rate                   : chr  "12.29" "12.69" "9.99" "10.99" ...
##  $ installment                : chr  "667.06" "369" "225.84" "327.34" ...
##  $ grade                      : chr  "C" "C" "B" "B" ...
##  $ sub_grade                  : chr  "C1" "C2" "B3" "B4" ...
##  $ emp_title                  : chr  "Accounting Clerk" "Accounts Payable Lead" "Nurse" "Service Manager" ...
##  $ emp_length                 : chr  "1 year" "7 years" "6 years" "10+ years" ...
##  $ home_ownership             : chr  "OWN" "MORTGAGE" "MORTGAGE" "MORTGAGE" ...
##  $ annual_inc                 : chr  "65000" "40000" "32000" "48000" ...
##  $ verification_status        : chr  "Source Verified" "Source Verified" "Source Verified" "Source Verified" ...
##  $ issue_d                    : chr  "Sep-15" "Sep-15" "Sep-15" "Sep-15" ...
##  $ loan_status                : chr  "Charged Off" "Charged Off" "Charged Off" "Charged Off" ...
##  $ pymnt_plan                 : chr  "n" "n" "n" "n" ...
##  $ url                        : chr  "https://www.lendingclub.com/browse/loanDetail.action?loan_id=60516983" "https://www.lendingclub.com/browse/loanDetail.action?loan_id=60187139" "https://www.lendingclub.com/browse/loanDetail.action?loan_id=60356453" "https://www.lendingclub.com/browse/loanDetail.action?loan_id=59955769" ...
##  $ desc                       : chr  NA NA NA NA ...
##  $ purpose                    : chr  "debt_consolidation" "debt_consolidation" "debt_consolidation" "credit_card" ...
##  $ title                      : chr  "Debt consolidation" "Debt consolidation" "Debt consolidation" "Credit card refinancing" ...
##  $ zip_code                   : chr  "542xx" "235xx" "350xx" "483xx" ...
##  $ addr_state                 : chr  "WI" "VA" "AL" "MI" ...
##  $ dti                        : chr  "20.72" "24.57" "32.41" "30.98" ...
##  $ delinq_2yrs                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ earliest_cr_line           : chr  "Sep-00" "Sep-02" "Feb-06" "Oct-99" ...
##  $ inq_last_6mths             : int  1 0 1 2 3 0 2 0 2 0 ...
##  $ mths_since_last_delinq     : int  NA 36 NA NA 69 65 NA NA 33 24 ...
##  $ mths_since_last_record     : int  NA 80 NA NA NA NA NA NA NA NA ...
##  $ open_acc                   : int  25 13 18 18 9 8 12 16 20 18 ...
##  $ pub_rec                    : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ revol_bal                  : int  31578 5084 12070 22950 4172 8256 4409 9638 22108 35052 ...
##  $ revol_util                 : chr  "77" "38.8" "74" "66" ...
##  $ total_acc                  : int  42 41 36 41 26 19 28 23 56 36 ...
##  $ initial_list_status        : chr  "w" "w" "f" "f" ...
##  $ out_prncp                  : chr  "0" "0" "0" "0" ...
##  $ out_prncp_inv              : chr  "0" "0" "0" "0" ...
##  $ total_pymnt                : chr  "0" "10043.49" "221.96" "315.13" ...
##  $ total_pymnt_inv            : chr  "0" "10043.49" "221.96" "315.13" ...
##  $ total_rec_prncp            : chr  "0" "9942.67" "167.56" "235.76" ...
##  $ total_rec_int              : chr  "0" "100.81" "54.4" "79.37" ...
##  $ total_rec_late_fee         : chr  "0" "0" "0" "0" ...
##  $ recoveries                 : chr  "0" "0" "0" "0" ...
##  $ collection_recovery_fee    : chr  "0" "0" "0" "0" ...
##  $ last_pymnt_d               : chr  NA "Oct-15" "Oct-15" "Oct-15" ...
##  $ last_pymnt_amnt            : chr  "0" "10059" "225.84" "327.34" ...
##  $ next_pymnt_d               : chr  NA NA NA NA ...
##  $ last_credit_pull_d         : chr  "Jan-16" "Jan-16" "Jan-16" "Jan-16" ...
##  $ collections_12_mths_ex_med : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mths_since_last_major_derog: int  NA 79 NA NA 69 65 NA NA 33 24 ...
##  $ policy_code                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ application_type           : chr  "INDIVIDUAL" "INDIVIDUAL" "INDIVIDUAL" "INDIVIDUAL" ...
##  $ annual_inc_joint           : chr  NA NA NA NA ...
##  $ dti_joint                  : chr  NA NA NA NA ...
##  $ verification_status_joint  : chr  NA NA NA NA ...
##  $ acc_now_delinq             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tot_coll_amt               : int  0 332 0 0 0 0 0 0 0 0 ...
##  $ tot_cur_bal                : int  52303 175731 202012 108235 45492 126165 264173 191935 71745 497387 ...
##  $ open_acc_6m                : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ open_il_6m                 : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ open_il_12m                : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ open_il_24m                : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ mths_since_rcnt_il         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_bal_il               : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ il_util                    : chr  NA NA NA NA ...
##  $ open_rv_12m                : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ open_rv_24m                : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_bal_bc                 : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ all_util                   : chr  NA NA NA NA ...
##  $ total_rev_hi_lim           : int  41000 13100 16300 34750 14100 16700 69500 13800 40613 38300 ...
##  $ inq_fi                     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_cu_tl                : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ inq_last_12m               : int  NA NA NA NA NA NA NA NA NA NA ...

Then, check for NAs

# Check for NAs
# % of NAs
na <- lapply(raw_data,function(x) { length(which(is.na(x)))/length(x)})

# more than 80% of NAs
na_70 <- as.data.frame(na[na>0.7]) 
na_70

##        desc mths_since_last_record mths_since_last_major_derog annual_inc_joint
## 1 0.9998931              0.8232817                   0.7085473        0.9987865
##   dti_joint verification_status_joint open_acc_6m open_il_6m open_il_12m
## 1 0.9987912                 0.9987865   0.9492465  0.9492465   0.9492465
##   open_il_24m mths_since_rcnt_il total_bal_il  il_util open_rv_12m open_rv_24m
## 1   0.9492465          0.9505811    0.9492465 0.955789   0.9492465   0.9492465
##   max_bal_bc  all_util    inq_fi total_cu_tl inq_last_12m
## 1  0.9492465 0.9492465 0.9492465   0.9492465    0.9492465

These are columns with more than 70% of NAs. I decided to drop them.

# Drop columns that have > 70% of NAs
df_1 <- raw_data[,!colnames(raw_data) %in% colnames((na_70))]

# Re-explore the data
lapply(df_1,function(x) { length(which(is.na(x)))/length(x)})

## $id
## [1] 0
## 
## $member_id
## [1] 0
## 
## $loan_amnt
## [1] 0
## 
## $funded_amnt
## [1] 0
## 
## $funded_amnt_inv
## [1] 0
## 
## $term
## [1] 0
## 
## $int_rate
## [1] 0
## 
## $installment
## [1] 0
## 
## $grade
## [1] 0
## 
## $sub_grade
## [1] 0
## 
## $emp_title
## [1] 0.05669518
## 
## $emp_length
## [1] 0.05655982
## 
## $home_ownership
## [1] 0
## 
## $annual_inc
## [1] 0
## 
## $verification_status
## [1] 0
## 
## $issue_d
## [1] 0
## 
## $loan_status
## [1] 0
## 
## $pymnt_plan
## [1] 0
## 
## $url
## [1] 0
## 
## $purpose
## [1] 0
## 
## $title
## [1] 0.0003134692
## 
## $zip_code
## [1] 0
## 
## $addr_state
## [1] 0
## 
## $dti
## [1] 0
## 
## $delinq_2yrs
## [1] 0
## 
## $earliest_cr_line
## [1] 0
## 
## $inq_last_6mths
## [1] 0
## 
## $mths_since_last_delinq
## [1] 0.4843598
## 
## $open_acc
## [1] 0
## 
## $pub_rec
## [1] 0
## 
## $revol_bal
## [1] 0
## 
## $revol_util
## [1] 0.0003847122
## 
## $total_acc
## [1] 0
## 
## $initial_list_status
## [1] 0
## 
## $out_prncp
## [1] 0
## 
## $out_prncp_inv
## [1] 0
## 
## $total_pymnt
## [1] 0
## 
## $total_pymnt_inv
## [1] 0
## 
## $total_rec_prncp
## [1] 0
## 
## $total_rec_int
## [1] 0
## 
## $total_rec_late_fee
## [1] 0
## 
## $recoveries
## [1] 0
## 
## $collection_recovery_fee
## [1] 0
## 
## $last_pymnt_d
## [1] 0.04104309
## 
## $last_pymnt_amnt
## [1] 0
## 
## $next_pymnt_d
## [1] 0.06116687
## 
## $last_credit_pull_d
## [1] 2.612243e-05
## 
## $collections_12_mths_ex_med
## [1] 0
## 
## $policy_code
## [1] 0
## 
## $application_type
## [1] 0
## 
## $acc_now_delinq
## [1] 0
## 
## $tot_coll_amt
## [1] 0
## 
## $tot_cur_bal
## [1] 0
## 
## $total_rev_hi_lim
## [1] 0

There is one more variable with above 48% of missing data (mths_since_last_delinq). I assumed that these NAs are a result of missing data, thus I decided to drop this variable. If I kept this variable and performed imputation, I could have introduced bias to the analysis.

df_2 <-df_1[,!colnames(df_1) %in% c('mths_since_last_delinq')]

Feature engineering

I decided to include ‘Charged Off’, ‘Late 31-120 days’ and ‘Default’ as customers with loan status = 1, meaning risky- bad customers. Then, I decided to set ‘Fully Paid’ as customers with loan status = 0, meaning not risky- good customers. Finally, I dropped ‘Current’, ‘In Grace Period’ , ‘Issued’, ‘Late 16- 30 days’ as they are neither good nor bad customers.

# unique(df_2$loan_status)

# Replace the loan_status values by default (1), not in default (0)
df_3 <- df_2

df_3$loan_status[df_3$loan_status == 'Charged Off'] <- 1
df_3$loan_status[df_3$loan_status == 'Late (31-120 days)'] <- 1
df_3$loan_status[df_3$loan_status == 'Default'] <- 1
# df_3$loan_status[df_3$loan_status == 'Current'] <- 0
df_3$loan_status[df_3$loan_status == 'Fully Paid'] <- 0
# df_3$loan_status[df_3$loan_status == 'In Grace Period'] <- 0
# df_3$loan_status[df_3$loan_status == 'Issued'] <- 0
# df_3$loan_status[df_3$loan_status == 'Late (16-30 days)'] <- 0

df_3 <- subset(df_3, loan_status != "Current")
df_3 <- subset(df_3, loan_status != "In Grace Period")
df_3 <- subset(df_3, loan_status != "Issued")
df_3 <- subset(df_3, loan_status != "Late (16-30 days)")

df_3$loan_status <- as.factor(as.character(df_3$loan_status))

# unique(df_3$loan_status)

Then, I converted numerical features to numeric or integer type and categorical features to factor type. Also, I dropped columns which were not informative, or included to many O’s.

# Term
df_3$term <- str_replace(df_3$term, 'months','')
df_3$term <- as.factor(as.character(df_3$term))

# Int rate
df_3$int_rate <- as.numeric(as.character(df_3$int_rate))
df_3$installment <- as.numeric(as.character(df_3$installment))

# Grade and sub grade
df_3$grade[df_3$grade == 'A'] <- 1
df_3$grade[df_3$grade == 'B'] <- 2
df_3$grade[df_3$grade == 'C'] <- 3
df_3$grade[df_3$grade == 'D'] <- 4
df_3$grade[df_3$grade == 'E'] <- 5
df_3$grade[df_3$grade == 'F'] <- 6
df_3$grade[df_3$grade == 'G'] <- 7
df_3$grade <- as.factor(as.character(df_3$grade))

df_3$sub_grade <- substring(df_3$sub_grade,2)
df_3$sub_grade <- as.factor(as.character(df_3$sub_grade))

# Emp title
# length(unique(df_3$emp_title))
# There are 120812 unique values for emp title, thus this variable is not informative and can be dropped. ** Id and member id are also not informative so we drop all 3 columns. 
df_3 <-df_3[,!colnames(df_3) %in% c('id', 'member_id', 'emp_title')]

# Home ownership
# unique(df_3$home_ownership)
df_3$home_ownership <- as.factor(df_3$home_ownership)

# Employment length
# unique(df_3$emp_length)
df_3$emp_length <- as.factor(as.character(df_3$emp_length))

# Annual inc
df_3$annual_inc <- as.numeric(df_3$annual_inc)

# Verification status
df_3$verification_status <- as.factor(df_3$verification_status)

# Issued
df_3 <- df_3[, -which(names(df_3) == "issue_d")]

# Pymnt plan
# unique(df_3$pymnt_plan)
df_3 <- df_3[, -which(names(df_3) == "pymnt_plan")]

# Url
df_3 <- df_3[, -which(names(df_3) == "url")]

# Purpose
# unique(df_3$purpose)
df_3$purpose <- as.factor(df_3$purpose)

#Title
# unique(df_3$title)
df_3$title <- as.factor(df_3$title)

# Zip code/ addres
df_3 <- df_3[, -which(names(df_3) =="zip_code")]
df_3 <- df_3[, -which(names(df_3) == "addr_state")]

# DTI
df_3$dti <- as.numeric(df_3$dti)

# delinq 2 years
# unique(df_3$delinq_2yrs)
df_3 <- df_3[, -which(names(df_3) == "delinq_2yrs")]

# earliest cr line
df_3 <- df_3[, -which(names(df_3) == "earliest_cr_line")]

# inq las 6mths
# unique(df_3$inq_last_6mths)
df_3 <- df_3[, -which(names(df_3) == "inq_last_6mths")]

# open acc
df_3$open_acc <- as.integer(df_3$open_acc)

# pub rec
# unique(df_3$pub_rec)
df_3$pub_rec <- as.integer((df_3$pub_rec))

# revol bal, util
df_3$revol_bal <- as.numeric(df_3$revol_bal)
df_3$revol_util <- as.numeric(df_3$revol_util)

# total acc
df_3$total_acc <- as.integer(df_3$total_acc)

# initial list status
df_3$initial_list_status <- as.factor(df_3$initial_list_status)

# out_prncp
# unique(df_3$out_prncp)
df_3$out_prncp <- as.numeric(df_3$out_prncp)

# unique(df_3$out_prncp_inv)
df_3$out_prncp <- as.numeric(df_3$out_prncp_inv)
df_3$out_prncp_inv <- as.numeric(df_3$out_prncp_inv)

# total pymnt
# unique(df_3$total_pymnt)
df_3$total_pymnt <- as.numeric(df_3$total_pymnt)
df_3$total_pymnt_inv <- as.numeric(df_3$total_pymnt_inv)

# total rec 
df_3$total_rec_prncp <- as.numeric(df_3$total_rec_prncp)
df_3$total_rec_int <- as.numeric(df_3$total_rec_int)
df_3 <- df_3[, -which(names(df_3) == "total_rec_late_fee")]

# recovery
df_3 <- df_3[, -which(names(df_3) == "recoveries")]
df_3 <- df_3[, -which(names(df_3) == "collection_recovery_fee")]

# last pymnt
df_3 <- df_3[, -which(names(df_3) == "last_pymnt_d")]
df_3$last_pymnt_amnt <- as.numeric(df_3$last_pymnt_amnt)
df_3 <- df_3[, -which(names(df_3) == "next_pymnt_d")]
df_3 <- df_3[, -which(names(df_3) == "last_credit_pull_d")]

# collections
# unique(df_3$collections_12_mths_ex_med)
df_3 <- df_3[, -which(names(df_3) == "collections_12_mths_ex_med")]

# policy code
# unique(df_3$policy_code)
df_3 <- df_3[, -which(names(df_3) == "policy_code")]

# appl type
# unique(df_3$application_type)
df_3 <- df_3[, -which(names(df_3) == "application_type")]

# acc deliq
# unique(df_3$acc_now_delinq)
df_3 <- df_3[, -which(names(df_3) == "acc_now_delinq")]

# coll amt
df_3 <-  df_3[, -which(names(df_3) == "tot_coll_amt")]

# curr bal
df_3$tot_cur_bal <- as.numeric(df_3$tot_cur_bal)

# rev hi lim
df_3$total_rev_hi_lim <- as.numeric(df_3$total_rev_hi_lim)

# Check

# Filter and print only character columns
# character_cols <- sapply(df_3, is.character)
# character_df <- df_3[, character_cols]
# print(character_df) # all good

df_4 <- df_3

Then again, I looked at columns with NAs. Revol_util and emp_length had 21 and 9 missing values respectively. I decided to drop these rows, as there were not many of them.

# % of NAs
na <- lapply(df_4,function(x) { length(which(is.na(x)))/length(x)})

# Columns with missing values
which(colSums(is.na(df_4))>0) # revol_util has 21 missing values, emp_length 9 missing values

## emp_length revol_util 
##          9         20

# Drop rows with NAs
df_5 <- na.omit(df_4)

# Check
# which(colSums(is.na(df_5))>0) # 0 NAs

Correlation analysis

Here, I filtered pairs of variables that had correlation above 70%.

cor_matrix <- cor(select_if(df_5, is.numeric))

# Set the correlation threshold
threshold <- 0.7

# Print correlations above the threshold
for (i in 1:(ncol(cor_matrix) - 1)) {
  for (j in (i + 1):ncol(cor_matrix)) {
    correlation <- cor_matrix[i, j]
    if (!is.na(correlation) && abs(correlation) > threshold) {
      var1 <- colnames(cor_matrix)[i]
      var2 <- colnames(cor_matrix)[j]
      cat(var1, "and", var2, "have correlation", correlation, "\n")
    }
  }
}

## loan_amnt and funded_amnt have correlation 1 
## loan_amnt and funded_amnt_inv have correlation 0.9999964 
## loan_amnt and installment have correlation 0.950474 
## loan_amnt and total_pymnt have correlation 0.7022142 
## loan_amnt and total_pymnt_inv have correlation 0.702199 
## funded_amnt and funded_amnt_inv have correlation 0.9999964 
## funded_amnt and installment have correlation 0.950474 
## funded_amnt and total_pymnt have correlation 0.7022142 
## funded_amnt and total_pymnt_inv have correlation 0.702199 
## funded_amnt_inv and installment have correlation 0.950474 
## funded_amnt_inv and total_pymnt have correlation 0.7022695 
## funded_amnt_inv and total_pymnt_inv have correlation 0.7022589 
## revol_bal and total_rev_hi_lim have correlation 0.8276475 
## out_prncp and out_prncp_inv have correlation 1 
## total_pymnt and total_pymnt_inv have correlation 0.999998 
## total_pymnt and total_rec_prncp have correlation 0.9957575 
## total_pymnt and last_pymnt_amnt have correlation 0.9015749 
## total_pymnt_inv and total_rec_prncp have correlation 0.9957477 
## total_pymnt_inv and last_pymnt_amnt have correlation 0.9015471 
## total_rec_prncp and last_pymnt_amnt have correlation 0.9083865

I decided to remove one variable from each pair of highly correlated features.

# Remove highly correlated variables
df_6 <- subset(df_5, select = -c(funded_amnt, funded_amnt_inv, installment, out_prncp, out_prncp_inv, int_rate, open_acc, total_rev_hi_lim, total_pymnt_inv, total_rec_prncp, loan_amnt, last_pymnt_amnt))

# Check
cor_matrix <- cor(select_if(df_6, is.numeric))

# Print correlations above the threshold
for (i in 1:(ncol(cor_matrix) - 1)) {
  for (j in (i + 1):ncol(cor_matrix)) {
    correlation <- cor_matrix[i, j]
    if (!is.na(correlation) && abs(correlation) > threshold) {
      var1 <- colnames(cor_matrix)[i]
      var2 <- colnames(cor_matrix)[j]
      cat(var1, "and", var2, "have correlation", correlation, "\n")
    }
  }
}

df_7 <- df_6

Distribution for dependent variable

The loan status distribution is a little unbalanced, but I decided not to perform any form of undersampling.

# Distribution for loan status (dependent variable)
barplot(prop.table(table(df_5$loan_status)), ylim = c(0,1), xlab = "Loan Status", main = "Loan status distribution", col = c("lightblue", "lightpink"), xaxt = "n")
axis(side = 1, at = c(0.7, 1.9), labels = c("0- Not in default", "1- Default"))

# Imbalanced data

# Perform random undersampling
# df_8 <- downSample(x = subset(df_7, select = -c(loan_status)), y = as.factor(df_7$loan_status), yname = "loan_status")
# df_8$loan_status <- as.factor(as.character(df_8$loan_status))

# # Distribution for loan status (dependent variable) after random undersampling
# barplot(prop.table(table(df_8$loan_status)), ylim = c(0,1), xlab = "Loan Status", main = "Loan status distribution", col = c("lightblue", "lightpink"), xaxt = "n")
# axis(side = 1, at = c(0.7, 1.9), labels = c("0- Not in default", "1- Default"))
# 
# df <- df_8

df_8 <-df_7
# Feature selection for numerical variables

# Separate the features and the target variable

#non_char <- df_8[,sapply(df_8, function(x) !is.factor(x))]
# features <- non_char
# target <- df_8$loan_status
# 
# # Perform correlation-based feature selection
# correlation_filter <- nearZeroVar(select_if(features, is.numeric), saveMetrics = TRUE)  # Identify near-zero variance features
# correlation_matrix <- cor(select_if(features, is.numeric))  # Calculate the correlation matrix
# highly_correlated <- findCorrelation(correlation_matrix, cutoff = 0.7)  # Identify highly correlated features
# 
# # Combine the indices of features to remove
# features_to_remove <- c(correlation_filter$nzv, highly_correlated)
# 
# # Remove the selected features
# filtered_features <- features[, -features_to_remove]
# 
# # Print the selected features
# print(filtered_features)

Categorical feature selection

Here I performed chi- squared test to select the most relevant categorical variables. I decided to keep 4 features with smallest p-values.

# Select only factor features
factor_features <- df_8[, sapply(df_8, is.factor)]

# Separate the factor features and the target variable
target <- as.data.frame(factor_features$loan_status) 
features <- factor_features[, -which(names(df_8) == "loan_status")]

# Define an empty list to store chi-squared test results
chi2_check <- list()

for (column in names(df_8)) {
  if (is.factor(df_8[[column]])) {
    chi_test <- chisq.test(df_8$loan_status, df_8[[column]], correct = FALSE)
    chi2_check$Feature <- c(chi2_check$Feature, column)
    chi2_check$p_value <- c(chi2_check$p_value, round(chi_test$p.value, 10))
  }
}


# Convert the list to a data frame
chi2_result <- as.data.frame(chi2_check)

# Sort the data frame by p-value in ascending order
chi2_result <- chi2_result %>% arrange(p_value)

# Print the chi-square test results
print(chi2_result)

##                Feature   p_value
## 1                 term 0.0000000
## 2                grade 0.0000000
## 3           emp_length 0.0000000
## 4       home_ownership 0.0000000
## 5  verification_status 0.0000000
## 6          loan_status 0.0000000
## 7              purpose 0.0000000
## 8                title 0.0000000
## 9  initial_list_status 0.0000000
## 10           sub_grade 0.7893962

# I select to keep the top 4 categorical features. Let's combine chosen factor and numerical features. 

df_9 <- df_8[,!(colnames(df_8)) %in% 
               c('verification_status','purpose', 'title', 
                 'initial_list_status', 'sub_grade')]

Dummy variables

Then, I created dummy variables for chosen categorical features.

# create dummy variables for factor features
library(fastDummies)
df_9 <- dummy_cols(df_9,select_columns = "term",
                   remove_most_frequent_dummy = TRUE,
                   remove_selected_columns = TRUE)

df_9 <- dummy_cols(df_9,select_columns = "emp_length",
                   remove_most_frequent_dummy = TRUE,
                   remove_selected_columns = TRUE)

df_9 <- dummy_cols(df_9,select_columns = "home_ownership",
                   remove_most_frequent_dummy = TRUE,
                   remove_selected_columns = TRUE)

df_9 <- dummy_cols(df_9,select_columns = "grade",
                   remove_most_frequent_dummy = TRUE,
                   remove_selected_columns = TRUE)

df <- df_9
df$loan_status <- as.numeric(as.character(df$loan_status))

str(df_9)

## 'data.frame':    29252 obs. of  29 variables:
##  $ annual_inc         : num  65000 40000 32000 48000 70000 ...
##  $ loan_status        : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dti                : num  20.72 24.57 32.41 30.98 6.96 ...
##  $ pub_rec            : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ revol_bal          : num  31578 5084 12070 22950 8256 ...
##  $ revol_util         : num  77 38.8 74 66 49.4 6.3 69.8 54.4 91.5 31.2 ...
##  $ total_acc          : int  42 41 36 41 19 28 23 56 36 25 ...
##  $ total_pymnt        : num  0 10043 222 315 547 ...
##  $ total_rec_int      : num  0 100.8 54.4 79.4 273.5 ...
##  $ tot_cur_bal        : num  52303 175731 202012 108235 126165 ...
##  $ term_ 60           : int  0 0 0 0 1 1 1 0 1 1 ...
##  $ emp_length_< 1 year: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ emp_length_1 year  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ emp_length_2 years : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ emp_length_3 years : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ emp_length_4 years : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ emp_length_5 years : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ emp_length_6 years : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ emp_length_7 years : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ emp_length_8 years : int  0 0 0 0 0 0 1 0 1 1 ...
##  $ emp_length_9 years : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ home_ownership_OWN : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ home_ownership_RENT: int  0 0 0 0 0 1 0 1 0 0 ...
##  $ grade_1            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ grade_2            : int  0 0 1 1 0 1 0 0 0 0 ...
##  $ grade_4            : int  0 0 0 0 0 0 1 0 0 1 ...
##  $ grade_5            : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ grade_6            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ grade_7            : int  0 0 0 0 0 0 0 0 0 0 ...

df <- df_9

Fine classing

# FINE CLASSING
percentile <- apply(X=select_if(df,is.numeric), MARGIN=2, FUN=function(x) round(quantile(x, seq(0.1,1,0.1), na.rm=TRUE),2))

# Unique values per column
unique <- apply(select_if(df,is.numeric), MARGIN=2, function(x) length(unique(x)))

# Selecting only columns with more than 10 unique levels
df.n <- df[which(unique>10)]

# <10 levels
df.f <- df[which(unique<10 & unique>1 )]

# woebin function in package scorecard
fine_class <- woebin(df, 
                    y = "loan_status", 
                    x = colnames(df.n), 
                    positive = 1, # loan status is in default
                    method = "freq", # frequency method
                    bin_num_limit = 20)  # the max number of fine bins

## ✔ Binning on 29252 rows and 10 columns in 00:00:10

fine_class

## $annual_inc
##       variable             bin count count_distr  neg  pos   posprob
##  1: annual_inc    [-Inf,40000)  4026  0.13763161 2741 1285 0.3191754
##  2: annual_inc   [40000,44000)  1805  0.06170518 1287  518 0.2869806
##  3: annual_inc   [44000,50000)  2184  0.07466156 1608  576 0.2637363
##  4: annual_inc   [50000,58000)  3536  0.12088062 2648  888 0.2511312
##  5: annual_inc   [58000,65000)  2459  0.08406263 1816  643 0.2614884
##  6: annual_inc   [65000,70000)  1709  0.05842336 1310  399 0.2334699
##  7: annual_inc   [70000,75000)  1657  0.05664570 1275  382 0.2305371
##  8: annual_inc   [75000,85000)  2777  0.09493368 2081  696 0.2506302
##  9: annual_inc   [85000,91860)  1785  0.06102147 1392  393 0.2201681
## 10: annual_inc  [91860,110000)  2678  0.09154930 2100  578 0.2158327
## 11: annual_inc [110000,125000)  1487  0.05083413 1191  296 0.1990585
## 12: annual_inc [125000,156000)  1670  0.05709011 1330  340 0.2035928
## 13: annual_inc   [156000, Inf)  1479  0.05056065 1171  308 0.2082488
##              woe       bin_iv   total_iv breaks is_special_values
##  1:  0.343054757 1.753163e-02 0.03670724  40000             FALSE
##  2:  0.190524891 2.344640e-03 0.03670724  44000             FALSE
##  3:  0.073980067 4.161449e-04 0.03670724  50000             FALSE
##  4:  0.008030682 7.811480e-06 0.03670724  58000             FALSE
##  5:  0.062372021 3.321049e-04 0.03670724  65000             FALSE
##  6: -0.088202143 4.444091e-04 0.03670724  70000             FALSE
##  7: -0.104661993 6.041222e-04 0.03670724  75000             FALSE
##  8:  0.005364690 2.735850e-06 0.03670724  85000             FALSE
##  9: -0.164068373 1.574398e-03 0.03670724  91860             FALSE
## 10: -0.189499899 3.129702e-03 0.03670724 110000             FALSE
## 11: -0.291570259 4.001173e-03 0.03670724 125000             FALSE
## 12: -0.263369747 3.694991e-03 0.03670724 156000             FALSE
## 13: -0.234894724 2.623374e-03 0.03670724    Inf             FALSE
## 
## $dti
##     variable           bin count count_distr  neg pos   posprob         woe
##  1:      dti   [-Inf,7.41)  2916  0.09968549 2424 492 0.1687243 -0.49407677
##  2:      dti  [7.41,10.62)  2927  0.10006153 2373 554 0.1892723 -0.35412671
##  3:      dti [10.62,11.96)  1470  0.05025297 1192 278 0.1891156 -0.35514788
##  4:      dti [11.96,14.45)  2907  0.09937782 2329 578 0.1988304 -0.29300154
##  5:      dti [14.45,15.63)  1473  0.05035553 1164 309 0.2097760 -0.22565750
##  6:      dti [15.63,17.94)  2920  0.09982223 2265 655 0.2243151 -0.14007595
##  7:      dti [17.94,19.15)  1471  0.05028716 1156 315 0.2141400 -0.19952955
##  8:      dti [19.15,20.38)  1466  0.05011623 1130 336 0.2291951 -0.11224290
##  9:      dti [20.38,23.08)  2916  0.09968549 2153 763 0.2616598  0.06325939
## 10:      dti [23.08,26.37)  2934  0.10030083 2093 841 0.2866394  0.18885679
## 11:      dti [26.37,30.76)  2919  0.09978805 1944 975 0.3340185  0.41055334
## 12:      dti [30.76,33.83)  1469  0.05021879  905 564 0.3839346  0.62773816
## 13:      dti  [33.83, Inf)  1464  0.05004786  822 642 0.4385246  0.85346676
##           bin_iv  total_iv breaks is_special_values
##  1: 0.0212719822 0.1414588   7.41             FALSE
##  2: 0.0114169442 0.1414588  10.62             FALSE
##  3: 0.0057652853 0.1414588  11.96             FALSE
##  4: 0.0078959193 0.1414588  14.45             FALSE
##  5: 0.0024173422 0.1414588  15.63             FALSE
##  6: 0.0018892903 0.1414588  17.94             FALSE
##  7: 0.0019007748 0.1414588  19.15             FALSE
##  8: 0.0006135025 0.1414588  20.38             FALSE
##  9: 0.0004051991 0.1414588  23.08             FALSE
## 10: 0.0037433021 0.1414588  26.37             FALSE
## 11: 0.0184585363 0.1414588  30.76             FALSE
## 12: 0.0226042513 0.1414588  33.83             FALSE
## 13: 0.0430764906 0.1414588    Inf             FALSE
## 
## $pub_rec
##    variable      bin count count_distr   neg  pos   posprob          woe
## 1:  pub_rec [-Inf,1) 23782   0.8130042 17809 5973 0.2511563  0.008164223
## 2:  pub_rec [1, Inf)  5470   0.1869958  4141 1329 0.2429616 -0.035891669
##          bin_iv     total_iv breaks is_special_values
## 1: 5.430111e-05 0.0002930204      1             FALSE
## 2: 2.387193e-04 0.0002930204    Inf             FALSE
## 
## $revol_bal
##      variable           bin count count_distr  neg pos   posprob          woe
##  1: revol_bal   [-Inf,2699)  2925  0.09999316 2228 697 0.2382906 -0.061455334
##  2: revol_bal   [2699,4610)  2921  0.09985642 2240 681 0.2331393 -0.090049982
##  3: revol_bal   [4610,5443)  1466  0.05011623 1113 353 0.2407913 -0.047727438
##  4: revol_bal   [5443,7288)  2925  0.09999316 2200 725 0.2478632 -0.009422128
##  5: revol_bal   [7288,8266)  1463  0.05001367 1129 334 0.2282980 -0.117327715
##  6: revol_bal   [8266,9321)  1463  0.05001367 1079 384 0.2624744  0.067471444
##  7: revol_bal  [9321,10510)  1463  0.05001367 1089 374 0.2556391  0.031859531
##  8: revol_bal [10510,13265)  2924  0.09995898 2170 754 0.2578659  0.043528778
##  9: revol_bal [13265,14930)  1463  0.05001367 1066 397 0.2713602  0.112886532
## 10: revol_bal [14930,19445)  2924  0.09995898 2177 747 0.2554720  0.030980980
## 11: revol_bal [19445,22768)  1464  0.05004786 1067 397 0.2711749  0.111948886
## 12: revol_bal [22768,33024)  2925  0.09999316 2190 735 0.2512821  0.008832533
## 13: revol_bal [33024,44411)  1463  0.05001367 1070 393 0.2686261  0.099014541
## 14: revol_bal  [44411, Inf)  1463  0.05001367 1132 331 0.2262474 -0.129004027
##           bin_iv  total_iv breaks is_special_values
##  1: 3.718119e-04 0.0051389   2699             FALSE
##  2: 7.913587e-04 0.0051389   4610             FALSE
##  3: 1.127909e-04 0.0051389   5443             FALSE
##  4: 8.856085e-06 0.0051389   7288             FALSE
##  5: 6.680859e-04 0.0051389   8266             FALSE
##  6: 2.315051e-04 0.0051389   9321             FALSE
##  7: 5.116921e-05 0.0051389  10510             FALSE
##  8: 1.914541e-04 0.0051389  13265             FALSE
##  9: 6.551647e-04 0.0051389  14930             FALSE
## 10: 9.668498e-05 0.0051389  19445             FALSE
## 11: 6.446227e-04 0.0051389  22768             FALSE
## 12: 7.818068e-06 0.0051389  33024             FALSE
## 13: 5.023719e-04 0.0051389  44411             FALSE
## 14: 8.052054e-04 0.0051389    Inf             FALSE
## 
## $revol_util
##       variable         bin count count_distr  neg pos   posprob         woe
##  1: revol_util [-Inf,22.5)  4360  0.14904964 3640 720 0.1651376 -0.51986889
##  2: revol_util [22.5,31.7)  2940  0.10050595 2363 577 0.1962585 -0.30922615
##  3: revol_util [31.7,39.6)  2936  0.10036921 2300 636 0.2166213 -0.18484698
##  4: revol_util [39.6,43.5)  1464  0.05004786 1129 335 0.2288251 -0.11433818
##  5: revol_util   [43.5,51)  2920  0.09982223 2243 677 0.2318493 -0.09727941
##  6: revol_util   [51,54.6)  1468  0.05018460 1095 373 0.2540872  0.02368763
##  7: revol_util [54.6,62.2)  2894  0.09893341 2130 764 0.2639945  0.07530939
##  8: revol_util [62.2,70.4)  2944  0.10064269 2109 835 0.2836277  0.17408140
##  9: revol_util [70.4,74.9)  1475  0.05042390 1037 438 0.2969492  0.23875056
## 10: revol_util [74.9,85.7)  2904  0.09927526 2017 887 0.3054408  0.27909730
## 11: revol_util [85.7,92.5)  1482  0.05066320  982 500 0.3373819  0.42563565
## 12: revol_util [92.5, Inf)  1465  0.05008205  905 560 0.3822526  0.62062070
##           bin_iv   total_iv breaks is_special_values
##  1: 3.494991e-02 0.09581975   22.5             FALSE
##  2: 8.854478e-03 0.09581975   31.7             FALSE
##  3: 3.268866e-03 0.09581975   39.6             FALSE
##  4: 6.354045e-04 0.09581975   43.5             FALSE
##  5: 9.214756e-04 0.09581975     51             FALSE
##  6: 2.832545e-05 0.09581975   54.6             FALSE
##  7: 5.716091e-04 0.09581975   62.2             FALSE
##  8: 3.180507e-03 0.09581975   70.4             FALSE
##  9: 3.041642e-03 0.09581975   74.9             FALSE
## 10: 8.256510e-03 0.09581975   85.7             FALSE
## 11: 1.010304e-02 0.09581975   92.5             FALSE
## 12: 2.200799e-02 0.09581975    Inf             FALSE
## 
## $total_acc
##      variable       bin count count_distr  neg  pos   posprob         woe
##  1: total_acc [-Inf,14)  3865  0.13212772 2648 1217 0.3148771  0.32320303
##  2: total_acc   [14,16)  1626  0.05558594 1168  458 0.2816728  0.16443988
##  3: total_acc   [16,19)  2732  0.09339532 1982  750 0.2745242  0.12883035
##  4: total_acc   [19,21)  1966  0.06720908 1466  500 0.2543235  0.02493407
##  5: total_acc   [21,23)  1977  0.06758512 1478  499 0.2524026  0.01477985
##  6: total_acc   [23,25)  2014  0.06884999 1499  515 0.2557100  0.03223226
##  7: total_acc   [25,27)  1899  0.06491864 1402  497 0.2617167  0.06355381
##  8: total_acc   [27,30)  2650  0.09059210 2019  631 0.2381132 -0.06243290
##  9: total_acc   [30,32)  1543  0.05274853 1216  327 0.2119248 -0.21274304
## 10: total_acc   [32,36)  2594  0.08867770 1990  604 0.2328450 -0.09169686
## 11: total_acc   [36,39)  1578  0.05394503 1233  345 0.2186312 -0.17304223
## 12: total_acc   [39,44)  1852  0.06331191 1490  362 0.1954644 -0.31426833
## 13: total_acc [44, Inf)  2956  0.10105292 2359  597 0.2019621 -0.27345711
##           bin_iv   total_iv breaks is_special_values
##  1: 1.487667e-02 0.03609894     14             FALSE
##  2: 1.563938e-03 0.03609894     16             FALSE
##  3: 1.599488e-03 0.03609894     19             FALSE
##  4: 4.204472e-05 0.03609894     21             FALSE
##  5: 1.481813e-05 0.03609894     23             FALSE
##  6: 7.210519e-05 0.03609894     25             FALSE
##  7: 2.663608e-04 0.03609894     27             FALSE
##  8: 3.475699e-04 0.03609894     30             FALSE
##  9: 2.258561e-03 0.03609894     32             FALSE
## 10: 7.283966e-04 0.03609894     36             FALSE
## 11: 1.544539e-03 0.03609894     39             FALSE
## 12: 5.753024e-03 0.03609894     44             FALSE
## 13: 7.031431e-03 0.03609894    Inf             FALSE
## 
## $total_pymnt
##        variable                 bin count count_distr  neg  pos      posprob
##  1: total_pymnt      [-Inf,1401.27)  2925  0.09999316  213 2712 0.9271794872
##  2: total_pymnt   [1401.27,2774.92)  2925  0.09999316  665 2260 0.7726495726
##  3: total_pymnt   [2774.92,4831.76)  2925  0.09999316 1366 1559 0.5329914530
##  4: total_pymnt   [4831.76,7025.24)  2925  0.09999316 2407  518 0.1770940171
##  5: total_pymnt    [7025.24,8320.1)  1463  0.05001367 1336  127 0.0868079289
##  6: total_pymnt   [8320.1,10791.81)  2925  0.09999316 2831   94 0.0321367521
##  7: total_pymnt  [10791.81,14143.6)  2925  0.09999316 2905   20 0.0068376068
##  8: total_pymnt  [14143.6,17579.08)  2925  0.09999316 2923    2 0.0006837607
##  9: total_pymnt [17579.08,20460.89)  1463  0.05001367 1462    1 0.0006835270
## 10: total_pymnt [20460.89,26637.24)  2925  0.09999316 2918    7 0.0023931624
## 11: total_pymnt     [26637.24, Inf)  2926  0.10002735 2924    2 0.0006835270
##            woe     bin_iv total_iv   breaks is_special_values
##  1:  3.6447683 1.31831716 5.701947  1401.27             FALSE
##  2:  2.3239519 0.64886624 5.701947  2774.92             FALSE
##  3:  1.2327767 0.18648312 5.701947  4831.76             FALSE
##  4: -0.4355423 0.01686370 5.701947  7025.24             FALSE
##  5: -1.2526294 0.05445569 5.701947   8320.1             FALSE
##  6: -2.3044716 0.26755321 5.701947 10791.81             FALSE
##  7: -3.8778375 0.50259592 5.701947  14143.6             FALSE
##  8: -6.1865997 0.82215202 5.701947 17579.08             FALSE
##  9: -6.1869418 0.41123967 5.701947 20460.89             FALSE
## 10: -4.9321247 0.65094111 5.701947 26637.24             FALSE
## 11: -6.1869418 0.82247934 5.701947      Inf             FALSE
## 
## $total_rec_int
##          variable               bin count count_distr  neg  pos   posprob
##  1: total_rec_int      [-Inf,69.72)  2925  0.09999316 2502  423 0.1446154
##  2: total_rec_int    [69.72,161.12)  2925  0.09999316 2499  426 0.1456410
##  3: total_rec_int   [161.12,262.44)  2925  0.09999316 2342  583 0.1993162
##  4: total_rec_int   [262.44,317.01)  1463  0.05001367 1157  306 0.2091593
##  5: total_rec_int   [317.01,442.73)  2924  0.09995898 2267  657 0.2246922
##  6: total_rec_int   [442.73,511.74)  1464  0.05004786 1136  328 0.2240437
##  7: total_rec_int   [511.74,683.88)  2925  0.09999316 2204  721 0.2464957
##  8: total_rec_int    [683.88,917.3)  2925  0.09999316 2102  823 0.2813675
##  9: total_rec_int   [917.3,1256.46)  2925  0.09999316 2025  900 0.3076923
## 10: total_rec_int [1256.46,1859.47)  2925  0.09999316 1914 1011 0.3456410
## 11: total_rec_int [1859.47,2486.56)  1463  0.05001367  870  593 0.4053315
## 12: total_rec_int    [2486.56, Inf)  1463  0.05001367  932  531 0.3629528
##             woe       bin_iv  total_iv  breaks is_special_values
##  1: -0.67685466 3.794244e-02 0.1696958   69.72             FALSE
##  2: -0.66858773 3.711296e-02 0.1696958  161.12             FALSE
##  3: -0.28995450 7.786989e-03 0.1696958  262.44             FALSE
##  4: -0.22938177 2.478328e-03 0.1696958  317.01             FALSE
##  5: -0.13790978 1.834867e-03 0.1696958  442.73             FALSE
##  6: -0.14163613 9.680527e-04 0.1696958  511.74             FALSE
##  7: -0.01677118 2.800705e-05 0.1696958  683.88             FALSE
##  8:  0.16293051 2.760979e-03 0.1696958   917.3             FALSE
##  9:  0.28968864 8.979994e-03 0.1696958 1256.46             FALSE
## 10:  0.46236350 2.369938e-02 0.1696958 1859.47             FALSE
## 11:  0.71732004 2.982265e-02 0.1696958 2486.56             FALSE
## 12:  0.53804806 1.628115e-02 0.1696958     Inf             FALSE

IV

# IV of the variables
iv <- map_df(fine_class, ~pluck(.x, 10, 1)) %>%
  pivot_longer(everything(), names_to = "var", values_to = "iv")

iv

## # A tibble: 8 × 2
##   var                 iv
##   <chr>            <dbl>
## 1 annual_inc    0.0367  
## 2 dti           0.141   
## 3 pub_rec       0.000293
## 4 revol_bal     0.00514 
## 5 revol_util    0.0958  
## 6 total_acc     0.0361  
## 7 total_pymnt   5.70    
## 8 total_rec_int 0.170

Remove variables with low predictive power (IV < 0.02)

# Remove variables with low predictive power (IV < 0.02) 
df <- subset(df, select = -c(pub_rec, revol_bal))

df.n <- subset(df.n, select = -c(pub_rec, revol_bal))

fine_class_final <- woebin(df, 
                    y = "loan_status", 
                    x = colnames(df.n), 
                    positive = 1, # loan status is in default
                    method = "freq", # frequency method
                    bin_num_limit = 20)  # the max number of fine bins

## ✔ Binning on 29252 rows and 8 columns in 00:00:10

fine_class_final

## $annual_inc
##       variable             bin count count_distr  neg  pos   posprob
##  1: annual_inc    [-Inf,40000)  4026  0.13763161 2741 1285 0.3191754
##  2: annual_inc   [40000,44000)  1805  0.06170518 1287  518 0.2869806
##  3: annual_inc   [44000,50000)  2184  0.07466156 1608  576 0.2637363
##  4: annual_inc   [50000,58000)  3536  0.12088062 2648  888 0.2511312
##  5: annual_inc   [58000,65000)  2459  0.08406263 1816  643 0.2614884
##  6: annual_inc   [65000,70000)  1709  0.05842336 1310  399 0.2334699
##  7: annual_inc   [70000,75000)  1657  0.05664570 1275  382 0.2305371
##  8: annual_inc   [75000,85000)  2777  0.09493368 2081  696 0.2506302
##  9: annual_inc   [85000,91860)  1785  0.06102147 1392  393 0.2201681
## 10: annual_inc  [91860,110000)  2678  0.09154930 2100  578 0.2158327
## 11: annual_inc [110000,125000)  1487  0.05083413 1191  296 0.1990585
## 12: annual_inc [125000,156000)  1670  0.05709011 1330  340 0.2035928
## 13: annual_inc   [156000, Inf)  1479  0.05056065 1171  308 0.2082488
##              woe       bin_iv   total_iv breaks is_special_values
##  1:  0.343054757 1.753163e-02 0.03670724  40000             FALSE
##  2:  0.190524891 2.344640e-03 0.03670724  44000             FALSE
##  3:  0.073980067 4.161449e-04 0.03670724  50000             FALSE
##  4:  0.008030682 7.811480e-06 0.03670724  58000             FALSE
##  5:  0.062372021 3.321049e-04 0.03670724  65000             FALSE
##  6: -0.088202143 4.444091e-04 0.03670724  70000             FALSE
##  7: -0.104661993 6.041222e-04 0.03670724  75000             FALSE
##  8:  0.005364690 2.735850e-06 0.03670724  85000             FALSE
##  9: -0.164068373 1.574398e-03 0.03670724  91860             FALSE
## 10: -0.189499899 3.129702e-03 0.03670724 110000             FALSE
## 11: -0.291570259 4.001173e-03 0.03670724 125000             FALSE
## 12: -0.263369747 3.694991e-03 0.03670724 156000             FALSE
## 13: -0.234894724 2.623374e-03 0.03670724    Inf             FALSE
## 
## $dti
##     variable           bin count count_distr  neg pos   posprob         woe
##  1:      dti   [-Inf,7.41)  2916  0.09968549 2424 492 0.1687243 -0.49407677
##  2:      dti  [7.41,10.62)  2927  0.10006153 2373 554 0.1892723 -0.35412671
##  3:      dti [10.62,11.96)  1470  0.05025297 1192 278 0.1891156 -0.35514788
##  4:      dti [11.96,14.45)  2907  0.09937782 2329 578 0.1988304 -0.29300154
##  5:      dti [14.45,15.63)  1473  0.05035553 1164 309 0.2097760 -0.22565750
##  6:      dti [15.63,17.94)  2920  0.09982223 2265 655 0.2243151 -0.14007595
##  7:      dti [17.94,19.15)  1471  0.05028716 1156 315 0.2141400 -0.19952955
##  8:      dti [19.15,20.38)  1466  0.05011623 1130 336 0.2291951 -0.11224290
##  9:      dti [20.38,23.08)  2916  0.09968549 2153 763 0.2616598  0.06325939
## 10:      dti [23.08,26.37)  2934  0.10030083 2093 841 0.2866394  0.18885679
## 11:      dti [26.37,30.76)  2919  0.09978805 1944 975 0.3340185  0.41055334
## 12:      dti [30.76,33.83)  1469  0.05021879  905 564 0.3839346  0.62773816
## 13:      dti  [33.83, Inf)  1464  0.05004786  822 642 0.4385246  0.85346676
##           bin_iv  total_iv breaks is_special_values
##  1: 0.0212719822 0.1414588   7.41             FALSE
##  2: 0.0114169442 0.1414588  10.62             FALSE
##  3: 0.0057652853 0.1414588  11.96             FALSE
##  4: 0.0078959193 0.1414588  14.45             FALSE
##  5: 0.0024173422 0.1414588  15.63             FALSE
##  6: 0.0018892903 0.1414588  17.94             FALSE
##  7: 0.0019007748 0.1414588  19.15             FALSE
##  8: 0.0006135025 0.1414588  20.38             FALSE
##  9: 0.0004051991 0.1414588  23.08             FALSE
## 10: 0.0037433021 0.1414588  26.37             FALSE
## 11: 0.0184585363 0.1414588  30.76             FALSE
## 12: 0.0226042513 0.1414588  33.83             FALSE
## 13: 0.0430764906 0.1414588    Inf             FALSE
## 
## $revol_util
##       variable         bin count count_distr  neg pos   posprob         woe
##  1: revol_util [-Inf,22.5)  4360  0.14904964 3640 720 0.1651376 -0.51986889
##  2: revol_util [22.5,31.7)  2940  0.10050595 2363 577 0.1962585 -0.30922615
##  3: revol_util [31.7,39.6)  2936  0.10036921 2300 636 0.2166213 -0.18484698
##  4: revol_util [39.6,43.5)  1464  0.05004786 1129 335 0.2288251 -0.11433818
##  5: revol_util   [43.5,51)  2920  0.09982223 2243 677 0.2318493 -0.09727941
##  6: revol_util   [51,54.6)  1468  0.05018460 1095 373 0.2540872  0.02368763
##  7: revol_util [54.6,62.2)  2894  0.09893341 2130 764 0.2639945  0.07530939
##  8: revol_util [62.2,70.4)  2944  0.10064269 2109 835 0.2836277  0.17408140
##  9: revol_util [70.4,74.9)  1475  0.05042390 1037 438 0.2969492  0.23875056
## 10: revol_util [74.9,85.7)  2904  0.09927526 2017 887 0.3054408  0.27909730
## 11: revol_util [85.7,92.5)  1482  0.05066320  982 500 0.3373819  0.42563565
## 12: revol_util [92.5, Inf)  1465  0.05008205  905 560 0.3822526  0.62062070
##           bin_iv   total_iv breaks is_special_values
##  1: 3.494991e-02 0.09581975   22.5             FALSE
##  2: 8.854478e-03 0.09581975   31.7             FALSE
##  3: 3.268866e-03 0.09581975   39.6             FALSE
##  4: 6.354045e-04 0.09581975   43.5             FALSE
##  5: 9.214756e-04 0.09581975     51             FALSE
##  6: 2.832545e-05 0.09581975   54.6             FALSE
##  7: 5.716091e-04 0.09581975   62.2             FALSE
##  8: 3.180507e-03 0.09581975   70.4             FALSE
##  9: 3.041642e-03 0.09581975   74.9             FALSE
## 10: 8.256510e-03 0.09581975   85.7             FALSE
## 11: 1.010304e-02 0.09581975   92.5             FALSE
## 12: 2.200799e-02 0.09581975    Inf             FALSE
## 
## $total_acc
##      variable       bin count count_distr  neg  pos   posprob         woe
##  1: total_acc [-Inf,14)  3865  0.13212772 2648 1217 0.3148771  0.32320303
##  2: total_acc   [14,16)  1626  0.05558594 1168  458 0.2816728  0.16443988
##  3: total_acc   [16,19)  2732  0.09339532 1982  750 0.2745242  0.12883035
##  4: total_acc   [19,21)  1966  0.06720908 1466  500 0.2543235  0.02493407
##  5: total_acc   [21,23)  1977  0.06758512 1478  499 0.2524026  0.01477985
##  6: total_acc   [23,25)  2014  0.06884999 1499  515 0.2557100  0.03223226
##  7: total_acc   [25,27)  1899  0.06491864 1402  497 0.2617167  0.06355381
##  8: total_acc   [27,30)  2650  0.09059210 2019  631 0.2381132 -0.06243290
##  9: total_acc   [30,32)  1543  0.05274853 1216  327 0.2119248 -0.21274304
## 10: total_acc   [32,36)  2594  0.08867770 1990  604 0.2328450 -0.09169686
## 11: total_acc   [36,39)  1578  0.05394503 1233  345 0.2186312 -0.17304223
## 12: total_acc   [39,44)  1852  0.06331191 1490  362 0.1954644 -0.31426833
## 13: total_acc [44, Inf)  2956  0.10105292 2359  597 0.2019621 -0.27345711
##           bin_iv   total_iv breaks is_special_values
##  1: 1.487667e-02 0.03609894     14             FALSE
##  2: 1.563938e-03 0.03609894     16             FALSE
##  3: 1.599488e-03 0.03609894     19             FALSE
##  4: 4.204472e-05 0.03609894     21             FALSE
##  5: 1.481813e-05 0.03609894     23             FALSE
##  6: 7.210519e-05 0.03609894     25             FALSE
##  7: 2.663608e-04 0.03609894     27             FALSE
##  8: 3.475699e-04 0.03609894     30             FALSE
##  9: 2.258561e-03 0.03609894     32             FALSE
## 10: 7.283966e-04 0.03609894     36             FALSE
## 11: 1.544539e-03 0.03609894     39             FALSE
## 12: 5.753024e-03 0.03609894     44             FALSE
## 13: 7.031431e-03 0.03609894    Inf             FALSE
## 
## $total_pymnt
##        variable                 bin count count_distr  neg  pos      posprob
##  1: total_pymnt      [-Inf,1401.27)  2925  0.09999316  213 2712 0.9271794872
##  2: total_pymnt   [1401.27,2774.92)  2925  0.09999316  665 2260 0.7726495726
##  3: total_pymnt   [2774.92,4831.76)  2925  0.09999316 1366 1559 0.5329914530
##  4: total_pymnt   [4831.76,7025.24)  2925  0.09999316 2407  518 0.1770940171
##  5: total_pymnt    [7025.24,8320.1)  1463  0.05001367 1336  127 0.0868079289
##  6: total_pymnt   [8320.1,10791.81)  2925  0.09999316 2831   94 0.0321367521
##  7: total_pymnt  [10791.81,14143.6)  2925  0.09999316 2905   20 0.0068376068
##  8: total_pymnt  [14143.6,17579.08)  2925  0.09999316 2923    2 0.0006837607
##  9: total_pymnt [17579.08,20460.89)  1463  0.05001367 1462    1 0.0006835270
## 10: total_pymnt [20460.89,26637.24)  2925  0.09999316 2918    7 0.0023931624
## 11: total_pymnt     [26637.24, Inf)  2926  0.10002735 2924    2 0.0006835270
##            woe     bin_iv total_iv   breaks is_special_values
##  1:  3.6447683 1.31831716 5.701947  1401.27             FALSE
##  2:  2.3239519 0.64886624 5.701947  2774.92             FALSE
##  3:  1.2327767 0.18648312 5.701947  4831.76             FALSE
##  4: -0.4355423 0.01686370 5.701947  7025.24             FALSE
##  5: -1.2526294 0.05445569 5.701947   8320.1             FALSE
##  6: -2.3044716 0.26755321 5.701947 10791.81             FALSE
##  7: -3.8778375 0.50259592 5.701947  14143.6             FALSE
##  8: -6.1865997 0.82215202 5.701947 17579.08             FALSE
##  9: -6.1869418 0.41123967 5.701947 20460.89             FALSE
## 10: -4.9321247 0.65094111 5.701947 26637.24             FALSE
## 11: -6.1869418 0.82247934 5.701947      Inf             FALSE
## 
## $total_rec_int
##          variable               bin count count_distr  neg  pos   posprob
##  1: total_rec_int      [-Inf,69.72)  2925  0.09999316 2502  423 0.1446154
##  2: total_rec_int    [69.72,161.12)  2925  0.09999316 2499  426 0.1456410
##  3: total_rec_int   [161.12,262.44)  2925  0.09999316 2342  583 0.1993162
##  4: total_rec_int   [262.44,317.01)  1463  0.05001367 1157  306 0.2091593
##  5: total_rec_int   [317.01,442.73)  2924  0.09995898 2267  657 0.2246922
##  6: total_rec_int   [442.73,511.74)  1464  0.05004786 1136  328 0.2240437
##  7: total_rec_int   [511.74,683.88)  2925  0.09999316 2204  721 0.2464957
##  8: total_rec_int    [683.88,917.3)  2925  0.09999316 2102  823 0.2813675
##  9: total_rec_int   [917.3,1256.46)  2925  0.09999316 2025  900 0.3076923
## 10: total_rec_int [1256.46,1859.47)  2925  0.09999316 1914 1011 0.3456410
## 11: total_rec_int [1859.47,2486.56)  1463  0.05001367  870  593 0.4053315
## 12: total_rec_int    [2486.56, Inf)  1463  0.05001367  932  531 0.3629528
##             woe       bin_iv  total_iv  breaks is_special_values
##  1: -0.67685466 3.794244e-02 0.1696958   69.72             FALSE
##  2: -0.66858773 3.711296e-02 0.1696958  161.12             FALSE
##  3: -0.28995450 7.786989e-03 0.1696958  262.44             FALSE
##  4: -0.22938177 2.478328e-03 0.1696958  317.01             FALSE
##  5: -0.13790978 1.834867e-03 0.1696958  442.73             FALSE
##  6: -0.14163613 9.680527e-04 0.1696958  511.74             FALSE
##  7: -0.01677118 2.800705e-05 0.1696958  683.88             FALSE
##  8:  0.16293051 2.760979e-03 0.1696958   917.3             FALSE
##  9:  0.28968864 8.979994e-03 0.1696958 1256.46             FALSE
## 10:  0.46236350 2.369938e-02 0.1696958 1859.47             FALSE
## 11:  0.71732004 2.982265e-02 0.1696958 2486.56             FALSE
## 12:  0.53804806 1.628115e-02 0.1696958     Inf             FALSE

Coarse classing

# COARSE CLASSING
# Plots
plot <- woebin_plot(fine_class_final)

The positive probability of default is slightly decreasing as annual income increases, which agrees with my intuition.

For annual_inc the coarse classing could be (1) < 44000 (2) 44000 - 85000 (3) > 85000

# annual_inc
plot[[1]]

The positive probability of default is slightly increasing as DTI increases. A higher DTI (Debt to Income) indicates a higher level of debt relative to income, which can be seen as a potential risk factor.

For DTI the coarse classing could be (1) < 14 (2) 14-23 (3) > 23

# DTI
plot[[2]]

The positive probability of default is slightly increasing as revol_util increases. A higher Revolving Utilization ratio indicates that a borrower is using a larger portion of their available credit, which can be seen as a potential risk factor.

For revol_bal the coarse classing could be (1) < 31 (2) 31-43 (3) 43-70 (4) > 70

# revol_util
plot[[3]]

The positive probability of default is slightly decreasing as total_acc increases. The intuition behind why the probability of default (PD) may decrease as the variable “total_acc” increases can be explained by considering the relationship between creditworthiness and the total number of accounts a customer has. In many cases, having a higher total number of accounts can be an indicator of a customer’s creditworthiness and financial stability. Having multiple accounts suggests that the customer has a diverse credit portfolio, which can indicate responsible credit management.

For total_acc the coarse classing could be (1) < 14 (2) 14-27 (3) >27

# total acc
plot[[4]]

The positive probability of default drops rapidly and then reminds around zero as total payment increases. It starts at 92.7%%, then drops to 8.7% for total payments 1400 and 7025 respectively.

As customers make larger total payments, it indicates improved financial stability and a stronger ability to meet their payment obligations. This reduces the risk of default as customers demonstrate their commitment to fulfilling their financial responsibilities.

For total_pymnt the coarse classing could be (1) < 1401 (2) 1401- 2775 (3) 2775- 4831 (4) 4831- 8320 (5) > 8320

# total_pymnt
plot[[5]]

The positive probability of default increases as total_rec_int increases.

For total_rec_int the coarse classing could be (1) < 263 (2) 263- 683 (3) > 683

# total_rec_int
plot[[6]]

breaks_list <- list(annual_inc = c("44000", "85000"), 
                    dti = c("14", "23"), 
                    revol_util = c("31","43","70"), 
                    total_acc = c("14", "27"),
                    total_pymnt = c("1401", "2775", 
                                    "4831","8320"),
                    total_rec_int = c("263", "683")
                    )

coarse_class <- woebin(df, 
                       y = "loan_status", 
                       x = colnames(df.n), 
                       positive = 1, 
                       method = "freq", 
                       breaks_list = breaks_list) # from coarse classing results

## ✔ Binning on 29252 rows and 8 columns in 00:00:10

coarse_class

## $annual_inc
##      variable           bin count count_distr   neg  pos   posprob          woe
## 1: annual_inc  [-Inf,44000)  5831   0.1993368  4028 1803 0.3092094  0.296800826
## 2: annual_inc [44000,85000) 14322   0.4896075 10738 3584 0.2502444  0.003309499
## 3: annual_inc  [85000, Inf)  9099   0.3110557  7184 1915 0.2104627 -0.221519852
##          bin_iv   total_iv breaks is_special_values
## 1: 1.882034e-02 0.03323167  44000             FALSE
## 2: 5.367009e-06 0.03323167  85000             FALSE
## 3: 1.440596e-02 0.03323167    Inf             FALSE
## 
## $dti
##    variable       bin count count_distr  neg  pos   posprob        woe
## 1:      dti [-Inf,14)  9630   0.3292083 7843 1787 0.1855659 -0.3784643
## 2:      dti   [14,23) 10759   0.3678039 8290 2469 0.2294823 -0.1106179
## 3:      dti [23, Inf)  8863   0.3029878 5817 3046 0.3436760  0.4536634
##         bin_iv  total_iv breaks is_special_values
## 1: 0.042609255 0.1160021     14             FALSE
## 2: 0.004374938 0.1160021     23             FALSE
## 3: 0.069017906 0.1160021    Inf             FALSE
## 
## $revol_util
##      variable       bin count count_distr  neg  pos   posprob         woe
## 1: revol_util [-Inf,31)  7072   0.2417612 5821 1251 0.1768948 -0.43690998
## 2: revol_util   [31,43)  4438   0.1517161 3459  979 0.2205949 -0.16158431
## 3: revol_util   [43,70) 10253   0.3505059 7611 2642 0.2576807  0.04256049
## 4: revol_util [70, Inf)  7489   0.2560167 5059 2430 0.3244759  0.36734128
##          bin_iv  total_iv breaks is_special_values
## 1: 0.0410130442 0.0830356     31             FALSE
## 2: 0.0037992615 0.0830356     43             FALSE
## 3: 0.0006416455 0.0830356     70             FALSE
## 4: 0.0375816497 0.0830356    Inf             FALSE
## 
## $total_acc
##     variable       bin count count_distr   neg  pos   posprob         woe
## 1: total_acc [-Inf,14)  3865   0.1321277  2648 1217 0.3148771  0.32320303
## 2: total_acc   [14,27) 12214   0.4175441  8995 3219 0.2635500  0.07302074
## 3: total_acc [27, Inf) 13173   0.4503282 10307 2866 0.2175662 -0.17928709
##         bin_iv   total_iv breaks is_special_values
## 1: 0.014876665 0.03096147     14             FALSE
## 2: 0.002266793 0.03096147     27             FALSE
## 3: 0.013818013 0.03096147    Inf             FALSE
## 
## $total_pymnt
##       variable         bin count count_distr   neg  pos     posprob        woe
## 1: total_pymnt [-Inf,1401)  2925  0.09999316   213 2712 0.927179487  3.6447683
## 2: total_pymnt [1401,2775)  2926  0.10002735   665 2261 0.772727273  2.3243943
## 3: total_pymnt [2775,4831)  2924  0.09995898  1366 1558 0.532831737  1.2321350
## 4: total_pymnt [4831,8320)  4388  0.15000684  3743  645 0.146991796 -0.6577735
## 5: total_pymnt [8320, Inf) 16089  0.55001367 15963  126 0.007831438 -3.7411281
##        bin_iv total_iv breaks is_special_values
## 1: 1.31831716 4.864063   1401             FALSE
## 2: 0.64930808 4.864063   2775             FALSE
## 3: 0.18621732 4.864063   4831             FALSE
## 4: 0.05406369 4.864063   8320             FALSE
## 5: 2.65615674 4.864063    Inf             FALSE
## 
## $total_rec_int
##         variable        bin count count_distr  neg  pos   posprob        woe
## 1: total_rec_int [-Inf,263)  8796   0.3006974 7359 1437 0.1633697 -0.5327476
## 2: total_rec_int  [263,683)  8745   0.2989539 6742 2003 0.2290452 -0.1130917
## 3: total_rec_int [683, Inf) 11711   0.4003487 7849 3862 0.3297754  0.3914179
##         bin_iv  total_iv breaks is_special_values
## 1: 0.073767726 0.1445362    263             FALSE
## 2: 0.003714408 0.1445362    683             FALSE
## 3: 0.067054103 0.1445362    Inf             FALSE

I transformed variables into WOE values.

# Transform variables into WOE values, which will be used in the model building
df_woe <- woebin_ply(df,coarse_class)

## ✔ Woe transformating on 29252 rows and 6 columns in 00:00:10

df_woe

And then I split the data into training (80%) and test (20%) sets.

set.seed(123)

# Split the data into training 80% and test 20% sets
train_indices <- sample(nrow(df_woe), 0.8 * nrow(df))  # 80% for training
train <-df_woe[train_indices, ]
test <- df_woe[-train_indices, ]

I fitted logit model and then performed stepwise variable selection.

And this is my final logit model:

final_model <- step_model_log
summary(final_model)

## 
## Call:
## glm(formula = loan_status ~ `term_ 60 ` + `emp_length_< 1 year` + 
##     `emp_length_1 year` + `emp_length_2 years` + `emp_length_5 years` + 
##     `emp_length_7 years` + home_ownership_RENT + grade_1 + grade_4 + 
##     grade_5 + grade_6 + grade_7 + annual_inc_woe + dti_woe + 
##     revol_util_woe + total_acc_woe + total_pymnt_woe + total_rec_int_woe, 
##     family = binomial("logit"), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9986  -0.1475  -0.0241   0.0090   5.0035  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -1.58562    0.07969 -19.898  < 2e-16 ***
## `term_ 60 `            0.76288    0.13197   5.781 7.44e-09 ***
## `emp_length_< 1 year`  0.23605    0.12894   1.831  0.06715 .  
## `emp_length_1 year`    0.44347    0.14705   3.016  0.00256 ** 
## `emp_length_2 years`   0.20907    0.13231   1.580  0.11407    
## `emp_length_5 years`   0.36849    0.16049   2.296  0.02168 *  
## `emp_length_7 years`   0.40660    0.17548   2.317  0.02050 *  
## home_ownership_RENT    0.17234    0.08049   2.141  0.03226 *  
## grade_1                0.24372    0.15504   1.572  0.11597    
## grade_4                0.26576    0.09720   2.734  0.00626 ** 
## grade_5                0.62466    0.13576   4.601 4.20e-06 ***
## grade_6                1.12863    0.21546   5.238 1.62e-07 ***
## grade_7                1.42825    0.34321   4.162 3.16e-05 ***
## annual_inc_woe        -2.30175    0.23820  -9.663  < 2e-16 ***
## dti_woe                0.59709    0.12105   4.933 8.12e-07 ***
## revol_util_woe         0.32849    0.13613   2.413  0.01582 *  
## total_acc_woe          0.50291    0.23148   2.173  0.02981 *  
## total_pymnt_woe        1.82173    0.03110  58.579  < 2e-16 ***
## total_rec_int_woe      7.44755    0.17878  41.658  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26387.9  on 23400  degrees of freedom
## Residual deviance:  4892.1  on 23382  degrees of freedom
## AIC: 4930.1
## 
## Number of Fisher Scoring iterations: 8

# predict on in sample
train$prob <- predict(final_model,train,type='response')

train$predict <- ifelse(train$prob > 0.5,1,0)

In sample accuracy:

# Accuracy calculation
accuracy <- mean(train$loan_status == train$predict)
accuracy

## [1] 0.972608

I predicted the probabilities on out of sample data.

# predict on out of sample
test$prob <- predict(final_model,test,type='response')

test$predict <- ifelse(test$prob > 0.5,1,0)

Validation

Out of sample accuracy:

# Accuracy calculation
accuracy <- mean(test$loan_status == test$predict)
accuracy

## [1] 0.9752179

Confusion matrix:

table(test$loan_status, test$predict)

##    
##        0    1
##   0 4343   88
##   1   57 1363

AUC and ROC curve:

# Calculate AUC and ROC curve
roc_obj <- roc(test$predict, as.numeric(as.character(test$loan_status)))
auc <- auc(roc_obj)
roc_curve <- roc_obj$roc

# View AUC 
print(paste("AUC:", auc))

## [1] "AUC: 0.963198812731032"

# Roc curve
g <- roc(loan_status ~ predict, data = test)
plot(g, col = 'purple', main = 'ROC curve', legacy.axes = TRUE, ylim = c(0, 1))

Gini

I calculated gini coefficient and the results confirm good discrimination power of my model, meaning that it is more effective at distinguishing between the positive and negative classes.

# Calculate Gini coefficient
gini <- 2 * auc - 1

# View Gini coefficient
print(paste("Gini Coefficient:", gini))

## [1] "Gini Coefficient: 0.926397625462064"

HL test on goodness of fit

H0: Model is correct. P-value >>> 0.05 so there is not enough evidence to reject the null. There is a significant evidence that the model is correct.

# H0: Model is correct 
library(ResourceSelection)
hl_gof <- hoslem.test(as.numeric(as.character(test$loan_status)), test$predict, g = 10)
hl_gof

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  as.numeric(as.character(test$loan_status)), test$predict
## X-squared = 0.88071, df = 8, p-value = 0.9989

HL is not robust so I tried a few other possibilities for number of bins. But again, the p-value is >>> 0.05 confirming my statement that the model specifications are correct.

# H0: Model is correct 
library(ResourceSelection)
hl_gof <- hoslem.test(as.numeric(as.character(test$loan_status)), test$predict, g = 3)
hl_gof

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  as.numeric(as.character(test$loan_status)), test$predict
## X-squared = 0.88071, df = 1, p-value = 0.348

VIF

I generated VIF to check multicollinearity. All variables have VIF < 5, so there is no multicollinearity issue.

# generate VIF 
car::vif(final_model)

##           `term_ 60 ` `emp_length_< 1 year`   `emp_length_1 year` 
##              1.679994              1.082030              1.054890 
##  `emp_length_2 years`  `emp_length_5 years`  `emp_length_7 years` 
##              1.064928              1.043988              1.036283 
##   home_ownership_RENT               grade_1               grade_4 
##              1.101620              1.117350              1.126389 
##               grade_5               grade_6               grade_7 
##              1.237922              1.231219              1.103347 
##        annual_inc_woe               dti_woe        revol_util_woe 
##              1.450497              1.159428              1.096810 
##         total_acc_woe       total_pymnt_woe     total_rec_int_woe 
##              1.227247              3.847057              3.428363

# No multicollinearity issue as all variables have VIF < 5.

LR test on the significance of variables

H0: Variables are statistically irrelevant. As p-vale ~0, we have a strong evidence to reject the null that states all variables are statistically irrelevant.

# LR test on the significance of variables
# H0 variables are statistically irrelevant
ist<-pchisq(final_model$null.deviance-final_model$deviance, 
            final_model$df.null-final_model$df.residual,lower.tail = F)

ist

## [1] 0

# pvalue ~0 thus, we have a string evidence to reject the null and confirm that variables are statistically relevant.

Scorecard

# scorecard creation

var_select <- c( 
         'annual_inc', 'dti', 'revol_util',  
         'total_acc', 'total_pymnt', 'total_rec_int')

breaks_list <- list(annual_inc = c("44000", "85000"), 
                    dti = c("14", "23"), 
                    revol_util = c("31","43","70"), 
                    total_acc = c("14", "27"),
                    total_pymnt = c("1401", "2775", 
                                    "4831","8320"),
                    total_rec_int = c("263", "683")
                    )

bins <- woebin(df,
               y = 'loan_status',
               x = var_select,
               positive = 1, 
               method = 'freq',
               breaks_list = breaks_list
               )

## ✔ Binning on 29252 rows and 7 columns in 00:00:10

score_card <- scorecard(bins, 
                       final_model, 
                       points0 = 300, 
                       pdo = 20,   
                       odds0 = 100, 
                       basepoints_eq0 = TRUE)                                                                                   
# display results
score_card

## $basepoints
##      variable bin woe points
## 1: basepoints  NA  NA      0
## 
## $``term_ 60 ``
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,neg,pos...
## 
## $``emp_length_< 1 year``
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,neg,pos...
## 
## $``emp_length_1 year``
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,neg,pos...
## 
## $``emp_length_2 years``
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,neg,pos...
## 
## $``emp_length_5 years``
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,neg,pos...
## 
## $``emp_length_7 years``
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,neg,pos...
## 
## $home_ownership_RENT
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,neg,pos...
## 
## $grade_1
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,neg,pos...
## 
## $grade_4
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,neg,pos...
## 
## $grade_5
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,neg,pos...
## 
## $grade_6
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,neg,pos...
## 
## $grade_7
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,neg,pos...
## 
## $annual_inc
##      variable           bin count count_distr   neg  pos   posprob          woe
## 1: annual_inc  [-Inf,44000)  5831   0.1993368  4028 1803 0.3092094  0.296800826
## 2: annual_inc [44000,85000) 14322   0.4896075 10738 3584 0.2502444  0.003309499
## 3: annual_inc  [85000, Inf)  9099   0.3110557  7184 1915 0.2104627 -0.221519852
##          bin_iv   total_iv breaks is_special_values points
## 1: 1.882034e-02 0.03323167  44000             FALSE     46
## 2: 5.367009e-06 0.03323167  85000             FALSE     27
## 3: 1.440596e-02 0.03323167    Inf             FALSE     12
## 
## $dti
##    variable       bin count count_distr  neg  pos   posprob        woe
## 1:      dti [-Inf,14)  9630   0.3292083 7843 1787 0.1855659 -0.3784643
## 2:      dti   [14,23) 10759   0.3678039 8290 2469 0.2294823 -0.1106179
## 3:      dti [23, Inf)  8863   0.3029878 5817 3046 0.3436760  0.4536634
##         bin_iv  total_iv breaks is_special_values points
## 1: 0.042609255 0.1160021     14             FALSE     33
## 2: 0.004374938 0.1160021     23             FALSE     28
## 3: 0.069017906 0.1160021    Inf             FALSE     19
## 
## $revol_util
##      variable       bin count count_distr  neg  pos   posprob         woe
## 1: revol_util [-Inf,31)  7072   0.2417612 5821 1251 0.1768948 -0.43690998
## 2: revol_util   [31,43)  4438   0.1517161 3459  979 0.2205949 -0.16158431
## 3: revol_util   [43,70) 10253   0.3505059 7611 2642 0.2576807  0.04256049
## 4: revol_util [70, Inf)  7489   0.2560167 5059 2430 0.3244759  0.36734128
##          bin_iv  total_iv breaks is_special_values points
## 1: 0.0410130442 0.0830356     31             FALSE     31
## 2: 0.0037992615 0.0830356     43             FALSE     28
## 3: 0.0006416455 0.0830356     70             FALSE     26
## 4: 0.0375816497 0.0830356    Inf             FALSE     23
## 
## $total_acc
##     variable       bin count count_distr   neg  pos   posprob         woe
## 1: total_acc [-Inf,14)  3865   0.1321277  2648 1217 0.3148771  0.32320303
## 2: total_acc   [14,27) 12214   0.4175441  8995 3219 0.2635500  0.07302074
## 3: total_acc [27, Inf) 13173   0.4503282 10307 2866 0.2175662 -0.17928709
##         bin_iv   total_iv breaks is_special_values points
## 1: 0.014876665 0.03096147     14             FALSE     22
## 2: 0.002266793 0.03096147     27             FALSE     26
## 3: 0.013818013 0.03096147    Inf             FALSE     29
## 
## $total_pymnt
##       variable         bin count count_distr   neg  pos     posprob        woe
## 1: total_pymnt [-Inf,1401)  2925  0.09999316   213 2712 0.927179487  3.6447683
## 2: total_pymnt [1401,2775)  2926  0.10002735   665 2261 0.772727273  2.3243943
## 3: total_pymnt [2775,4831)  2924  0.09995898  1366 1558 0.532831737  1.2321350
## 4: total_pymnt [4831,8320)  4388  0.15000684  3743  645 0.146991796 -0.6577735
## 5: total_pymnt [8320, Inf) 16089  0.55001367 15963  126 0.007831438 -3.7411281
##        bin_iv total_iv breaks is_special_values points
## 1: 1.31831716 4.864063   1401             FALSE   -165
## 2: 0.64930808 4.864063   2775             FALSE    -96
## 3: 0.18621732 4.864063   4831             FALSE    -38
## 4: 0.05406369 4.864063   8320             FALSE     61
## 5: 2.65615674 4.864063    Inf             FALSE    223
## 
## $total_rec_int
##         variable        bin count count_distr  neg  pos   posprob        woe
## 1: total_rec_int [-Inf,263)  8796   0.3006974 7359 1437 0.1633697 -0.5327476
## 2: total_rec_int  [263,683)  8745   0.2989539 6742 2003 0.2290452 -0.1130917
## 3: total_rec_int [683, Inf) 11711   0.4003487 7849 3862 0.3297754  0.3914179
##         bin_iv  total_iv breaks is_special_values points
## 1: 0.073767726 0.1445362    263             FALSE    141
## 2: 0.003714408 0.1445362    683             FALSE     51
## 3: 0.067054103 0.1445362    Inf             FALSE    -58

Conclusion

In this analysis, a logistic regression model was fitted to predict loan default using a dataset that included various predictor variables. The model was trained using the Weight of Evidence (WOE) transformation for numeric variables, which helps capture the relationship between the predictors and the likelihood of loan default.

The coefficients of the model provide valuable information about the impact of each predictor on the probability of loan default. For instance, the coefficient for term_60 is approximately 0.78497. This means that, all other factors being equal, having a loan term of 60 months increases the log-odds of default.

On the other hand, some variables have negative coefficients. For example, the coefficient for annual_inc_woe is approximately -2.61595. This suggests that as the weighted odds of annual income increase, the likelihood of default decreases. A higher annual income can be seen as a protective factor against default.

The model’s performance on the out-of-sample data was impressive, achieving an accuracy rate of approximately 97.61%. This means that the model correctly classified about 97.61% of the observations in the test set.

Furthermore, the Area Under the Curve (AUC) was calculated to be 0.9655, indicating a high level of discrimination between the positive and negative loan statuses. The Gini coefficient derived from the AUC is 0.9311, reflecting the model’s strong predictive ability.

In conclusion, the logistic regression model, with the selected predictors, provides valuable insights into the factors influencing loan default. Variables such as loan term, employment length, loan grade, and annual income have significant impacts on the likelihood of default. The model exhibits high accuracy and strong discriminatory power, making it a valuable tool for predicting loan default risk.

Credit risk

Natalia Miela

2023-06-24

EDA

First, let’s see the data.

Then, check for NAs

These are columns with more than 70% of NAs. I decided to drop them.

There is one more variable with above 48% of missing data (mths_since_last_delinq). I assumed that these NAs are a result of missing data, thus I decided to drop this variable. If I kept this variable and performed imputation, I could have introduced bias to the analysis.

Feature engineering

Then, I converted numerical features to numeric or integer type and categorical features to factor type. Also, I dropped columns which were not informative, or included to many O’s.

Then again, I looked at columns with NAs. Revol_util and emp_length had 21 and 9 missing values respectively. I decided to drop these rows, as there were not many of them.

Correlation analysis

Here, I filtered pairs of variables that had correlation above 70%.

I decided to remove one variable from each pair of highly correlated features.

Distribution for dependent variable

The loan status distribution is a little unbalanced, but I decided not to perform any form of undersampling.

Categorical feature selection

Here I performed chi- squared test to select the most relevant categorical variables. I decided to keep 4 features with smallest p-values.

Dummy variables

Then, I created dummy variables for chosen categorical features.

Fine classing

IV

Remove variables with low predictive power (IV < 0.02)

Coarse classing

The positive probability of default is slightly decreasing as annual income increases, which agrees with my intuition.

For annual_inc the coarse classing could be (1) < 44000 (2) 44000 - 85000 (3) > 85000

The positive probability of default is slightly increasing as DTI increases. A higher DTI (Debt to Income) indicates a higher level of debt relative to income, which can be seen as a potential risk factor.

For DTI the coarse classing could be (1) < 14 (2) 14-23 (3) > 23

The positive probability of default is slightly increasing as revol_util increases. A higher Revolving Utilization ratio indicates that a borrower is using a larger portion of their available credit, which can be seen as a potential risk factor.

For revol_bal the coarse classing could be (1) < 31 (2) 31-43 (3) 43-70 (4) > 70

For total_acc the coarse classing could be (1) < 14 (2) 14-27 (3) >27

The positive probability of default drops rapidly and then reminds around zero as total payment increases. It starts at 92.7%%, then drops to 8.7% for total payments 1400 and 7025 respectively.

As customers make larger total payments, it indicates improved financial stability and a stronger ability to meet their payment obligations. This reduces the risk of default as customers demonstrate their commitment to fulfilling their financial responsibilities.

For total_pymnt the coarse classing could be (1) < 1401 (2) 1401- 2775 (3) 2775- 4831 (4) 4831- 8320 (5) > 8320

The positive probability of default increases as total_rec_int increases.

For total_rec_int the coarse classing could be (1) < 263 (2) 263- 683 (3) > 683

I transformed variables into WOE values.

And then I split the data into training (80%) and test (20%) sets.

I fitted logit model and then performed stepwise variable selection.

And this is my final logit model:

In sample accuracy:

I predicted the probabilities on out of sample data.

Validation

Out of sample accuracy:

Confusion matrix:

AUC and ROC curve:

Gini

I calculated gini coefficient and the results confirm good discrimination power of my model, meaning that it is more effective at distinguishing between the positive and negative classes.

HL test on goodness of fit

H0: Model is correct. P-value >>> 0.05 so there is not enough evidence to reject the null. There is a significant evidence that the model is correct.

HL is not robust so I tried a few other possibilities for number of bins. But again, the p-value is >>> 0.05 confirming my statement that the model specifications are correct.

VIF

I generated VIF to check multicollinearity. All variables have VIF < 5, so there is no multicollinearity issue.

LR test on the significance of variables

H0: Variables are statistically irrelevant. As p-vale ~0, we have a strong evidence to reject the null that states all variables are statistically irrelevant.

Scorecard

Conclusion

The model’s performance on the out-of-sample data was impressive, achieving an accuracy rate of approximately 97.61%. This means that the model correctly classified about 97.61% of the observations in the test set.

Furthermore, the Area Under the Curve (AUC) was calculated to be 0.9655, indicating a high level of discrimination between the positive and negative loan statuses. The Gini coefficient derived from the AUC is 0.9311, reflecting the model’s strong predictive ability.

Thank you for taking your time to read it.