1. Overview

A case study of machine learning / modeling in R with credit default data. The notebook focuses on an end-to-end data science workflow and build a risk score model on the Lending Club P2P lending platform to determine the predicted probability of default.

1.1 Data Description

2. Environment Setup

Clean the environment and install necessary packages

rm(list = ls()) #clear all the objects in the current env
libraries <- c("lazyeval", "readr","plyr" ,"dplyr", "readxl", "ggplot2", 
    "funModeling", "scales", "tidyverse", "corrplot", "GGally", "caret",
    "rpart", "randomForest", "pROC", "gbm", "choroplethr", "choroplethrMaps",
    "microbenchmark", "doParallel", "e1071") #check the performance of SVM in e1071 versus SVM in Python since there is a translation mistakes between the architectures (arm vs x84)
libraries_missing <- setdiff(libraries, installed.packages())
if (length(libraries_missing)!=0) install.packages(libraries_missing)
#load the libraries
sapply(libraries, function(x) suppressWarnings(suppressMessages(require(x, character.only = TRUE))))
##        lazyeval           readr            plyr           dplyr          readxl 
##            TRUE            TRUE            TRUE            TRUE            TRUE 
##         ggplot2     funModeling          scales       tidyverse        corrplot 
##            TRUE            TRUE            TRUE            TRUE            TRUE 
##          GGally           caret           rpart    randomForest            pROC 
##            TRUE            TRUE            TRUE            TRUE            TRUE 
##             gbm     choroplethr choroplethrMaps  microbenchmark      doParallel 
##            TRUE            TRUE            TRUE            TRUE            TRUE 
##           e1071 
##            TRUE

3. Library namespace conflicts

Although there are two ways to load packages/libraries in R, I prefer to use the second method to utilize the %>% operator and available functions from the libraries. However, conflicts of functions with the same names between different packages are unavoidable (called masking).

(caret, plyr) (tidyverse, dplyr)

4. Data Import

#Reading the rejected and accepted applications
path <- ('~/Desktop/essential-ds-concepts/projects/lending-club/')
loans <- data.table::fread(paste0(path,'./datasets/accepted_2007_to_2018q4.csv'))
dim(loans)
## [1] 2260701     151

5. Metadata

# Reading the metadata
library(readxl)
meta_stats <-        read_excel(paste0(path, './datasets/LCDataDictionary.xlsx'), sheet = 'LoanStats') 
meta_browse_notes <- read_excel(paste0(path, './datasets/LCDataDictionary.xlsx'), sheet = 'browseNotes')
meta_reject_stats <- read_excel(paste0(path, './datasets/LCDataDictionary.xlsx'), sheet = 'RejectStats') 
meta_loans <- funModeling::df_status(loans, print_results = F)
knitr::kable(meta_loans[0:10,])
variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
id 0 0.00 0 0 0 0 character 2260701
member_id 0 0.00 2260701 100 0 0 logical 0
loan_amnt 0 0.00 33 0 0 0 numeric 1572
funded_amnt 0 0.00 33 0 0 0 numeric 1572
funded_amnt_inv 233 0.01 33 0 0 0 numeric 10057
term 0 0.00 0 0 0 0 character 3
int_rate 0 0.00 33 0 0 0 numeric 673
installment 0 0.00 33 0 0 0 numeric 93301
grade 0 0.00 0 0 0 0 character 8
sub_grade 0 0.00 0 0 0 0 character 36

The accepted data has only one unique column and it is the first column (id). It can be checked with nrow(loans) == length(unique(loans$id)).

meta_loans <- meta_loans %>% 
  mutate(uniq_ratio = unique/nrow(loans)*100) 
meta_loans %>% 
  dplyr::select(variable, unique, uniq_ratio, type) %>% 
  filter(uniq_ratio > 0.2, type=='character')
##    variable  unique uniq_ratio      type
## 1        id 2260701 100.000000 character
## 2 emp_title  483751  21.398274 character
## 3       url 2260669  99.998585 character
## 4      desc  124485   5.506478 character
## 5     title   61463   2.718758 character

There are a couple of free text fields and can be checked with the condition such as the total number of unique items is greater than a certain threshold and the type is character. For example, emp_title is a free text field and let people enter their own job titles instead of having a selection list/ drop-down menu. Such free fields can be analyzed by extracting some keywords / minimal vocabulary.

5.1 Data Transformation

# Convert some character columns to numeric columns
chr_to_num_vars <- c("annual_inc_joint", "mths_since_last_major_derog", "open_acc_6m",
     "open_il_12m", "open_il_24m", "mths_since_rcnt_il",
    "total_bal_il", "il_util", "open_rv_12m", "open_rv_24m",
    "max_bal_bc", "all_util", "total_rev_hi_lim", "total_cu_tl",
    "inq_last_12m", "dti_joint", "inq_fi", "tot_cur_bal", "tot_coll_amt")
loans <- loans %>% 
          mutate_at(.funs=list(as.numeric), .vars=chr_to_num_vars)

loans %>% 
  dplyr::select(all_of(chr_to_num_vars)) %>% 
  str()
## Classes 'data.table' and 'data.frame':   2260701 obs. of  19 variables:
##  $ annual_inc_joint           : num  NA NA 71000 NA NA NA NA NA NA NA ...
##  $ mths_since_last_major_derog: num  30 NA NA NA NA NA NA 3 NA 75 ...
##  $ open_acc_6m                : num  2 1 0 1 1 0 0 0 2 0 ...
##  $ open_il_12m                : num  0 0 0 0 0 0 0 0 0 2 ...
##  $ open_il_24m                : num  1 1 4 1 3 0 2 4 0 3 ...
##  $ mths_since_rcnt_il         : num  21 19 19 23 14 338 18 13 35 10 ...
##  $ total_bal_il               : num  4981 18005 10827 12609 73839 ...
##  $ il_util                    : num  36 73 73 70 84 99 63 75 57 72 ...
##  $ open_rv_12m                : num  3 2 0 1 4 0 2 0 2 0 ...
##  $ open_rv_24m                : num  3 3 2 1 7 0 3 0 7 2 ...
##  $ max_bal_bc                 : num  722 6472 2081 6987 9702 ...
##  $ all_util                   : num  34 29 65 45 78 76 74 55 46 49 ...
##  $ total_rev_hi_lim           : num  9300 111800 14000 67300 34000 ...
##  $ total_cu_tl                : num  1 0 5 1 1 0 0 0 0 0 ...
##  $ inq_last_12m               : num  4 6 1 0 3 0 1 2 1 1 ...
##  $ dti_joint                  : num  NA NA 13.8 NA NA ...
##  $ inq_fi                     : num  3 0 2 0 2 0 1 1 2 0 ...
##  $ tot_cur_bal                : num  144904 204396 189699 301500 331730 ...
##  $ tot_coll_amt               : num  722 0 0 0 0 ...
##  - attr(*, ".internal.selfref")=<externalptr>
# Convert some character columns to date columns
dates <- grep('date', names(loans))
start_with_date <- names(loans)[dates]
chr_to_date_vars <- 
  c("issue_d", "last_pymnt_d", "last_credit_pull_d",
    "next_pymnt_d", "earliest_cr_line", "next_pymnt_d", 'sec_app_earliest_cr_line',
    start_with_date)

loans <- loans %>%
  mutate_at(.funs=list(function(x) as.Date(paste0('01-', x), '%d-%b-%Y')), .vars=chr_to_date_vars)

loans %>% dplyr::select(all_of(chr_to_date_vars)) %>% 
  str()
## Classes 'data.table' and 'data.frame':   2260701 obs. of  11 variables:
##  $ issue_d                  : Date, format: "2015-12-01" "2015-12-01" ...
##  $ last_pymnt_d             : Date, format: "2019-01-01" "2016-06-01" ...
##  $ last_credit_pull_d       : Date, format: "2019-03-01" "2019-03-01" ...
##  $ next_pymnt_d             : Date, format: NA NA ...
##  $ earliest_cr_line         : Date, format: "2003-08-01" "1999-12-01" ...
##  $ sec_app_earliest_cr_line : Date, format: NA NA ...
##  $ hardship_start_date      : Date, format: NA NA ...
##  $ hardship_end_date        : Date, format: NA NA ...
##  $ payment_plan_start_date  : Date, format: NA NA ...
##  $ debt_settlement_flag_date: Date, format: NA NA ...
##  $ settlement_date          : Date, format: NA NA ...
##  - attr(*, ".internal.selfref")=<externalptr>
numeric_vars <- loans %>% sapply(is.numeric) %>% which() %>% names()
setdiff(meta_loans$variable, c(numeric_vars, chr_to_date_vars))
##  [1] "id"                        "member_id"                
##  [3] "term"                      "grade"                    
##  [5] "sub_grade"                 "emp_title"                
##  [7] "emp_length"                "home_ownership"           
##  [9] "verification_status"       "loan_status"              
## [11] "pymnt_plan"                "url"                      
## [13] "desc"                      "purpose"                  
## [15] "title"                     "zip_code"                 
## [17] "addr_state"                "initial_list_status"      
## [19] "application_type"          "verification_status_joint"
## [21] "hardship_flag"             "hardship_type"            
## [23] "hardship_reason"           "hardship_status"          
## [25] "hardship_loan_status"      "disbursement_method"      
## [27] "debt_settlement_flag"      "settlement_status"
na_to_zero_vars <-
  c("mths_since_last_delinq", "mths_since_last_record",
    "mths_since_last_major_derog")

loans <- 
  loans %>%
  mutate_at(.vars = na_to_zero_vars, .funs = funs(replace(., is.na(.), 0)))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
meta_loans <- funModeling::df_status(loans, print_results = FALSE)
meta_loans <-
  meta_loans %>%
  mutate(uniq_rat = unique / nrow(loans))

5.2 Variables in Loans but not in Meta Description

dplyr::setdiff(colnames(loans), meta_stats$LoanStatNew)
## [1] "verification_status_joint" "total_rev_hi_lim"

6. Default

Default: can be seen as a point in time versus a period in time

Several variables of interest:

# Check the uniqueness of the variables of interest
purrr::map(.x=loans[, c("loan_status", "delinq_2yrs", "mths_since_last_delinq")], .f=base::unique)
## $loan_status
##  [1] "Fully Paid"                                         
##  [2] "Current"                                            
##  [3] "Charged Off"                                        
##  [4] "In Grace Period"                                    
##  [5] "Late (31-120 days)"                                 
##  [6] "Late (16-30 days)"                                  
##  [7] "Default"                                            
##  [8] ""                                                   
##  [9] "Does not meet the credit policy. Status:Fully Paid" 
## [10] "Does not meet the credit policy. Status:Charged Off"
## 
## $delinq_2yrs
##  [1]  0  1  2  3  4  6  5 15  7  9 10  8 11 13 14 12 30 18 16 17 26 20 19 22 27
## [26] 39 NA 28 21 25 32 58 24 35 42 23 29 36
## 
## $mths_since_last_delinq
##   [1]  30   6   0  12  49   3  75  54  42  29  33  15  18  47  35  23  38   8
##  [19]  70  59  45  52  66  50  46  39  63  61  78  74  19  34   9  27  26  68
##  [37]  73  36  31  32  13  16  37  17  24  55   1  69   4  57   7  21  40  41
##  [55]  11  81  82  14  64  48   5  22   2  44  10  60  51  53  25  20  43  62
##  [73]  65  72  56  28  71  79  58  67  76  80  95  85  94  77  83 116  89 103
##  [91]  99 113 109  92  84 101  88 119 120  98  96  90  86 106  87 115 108 152
## [109] 100  91 105 134 146  97  93 123 131 157 122 104 112 154 121 127 170 171
## [127] 110 142 125 118 102 129 114 117 149 133 176 111 137 107 130 160 138 126
## [145] 226 156 158 128 132 178 124 135 162 145 161 179 159 147 136 151 192 188
## [163] 141 180 140 153 150 148 139 143 202 168 195
#loan_status missing one level "" as "Issued"
loans$loan_status <- ifelse(loans$loan_status=="", 'Issued', loans$loan_status)
purrr::map(.x=loans[, c("loan_status", "delinq_2yrs", "mths_since_last_delinq")], .f=base::unique)
## $loan_status
##  [1] "Fully Paid"                                         
##  [2] "Current"                                            
##  [3] "Charged Off"                                        
##  [4] "In Grace Period"                                    
##  [5] "Late (31-120 days)"                                 
##  [6] "Late (16-30 days)"                                  
##  [7] "Default"                                            
##  [8] "Issued"                                             
##  [9] "Does not meet the credit policy. Status:Fully Paid" 
## [10] "Does not meet the credit policy. Status:Charged Off"
## 
## $delinq_2yrs
##  [1]  0  1  2  3  4  6  5 15  7  9 10  8 11 13 14 12 30 18 16 17 26 20 19 22 27
## [26] 39 NA 28 21 25 32 58 24 35 42 23 29 36
## 
## $mths_since_last_delinq
##   [1]  30   6   0  12  49   3  75  54  42  29  33  15  18  47  35  23  38   8
##  [19]  70  59  45  52  66  50  46  39  63  61  78  74  19  34   9  27  26  68
##  [37]  73  36  31  32  13  16  37  17  24  55   1  69   4  57   7  21  40  41
##  [55]  11  81  82  14  64  48   5  22   2  44  10  60  51  53  25  20  43  62
##  [73]  65  72  56  28  71  79  58  67  76  80  95  85  94  77  83 116  89 103
##  [91]  99 113 109  92  84 101  88 119 120  98  96  90  86 106  87 115 108 152
## [109] 100  91 105 134 146  97  93 123 131 157 122 104 112 154 121 127 170 171
## [127] 110 142 125 118 102 129 114 117 149 133 176 111 137 107 130 160 138 126
## [145] 226 156 158 128 132 178 124 135 162 145 161 179 159 147 136 151 192 188
## [163] 141 180 140 153 150 148 139 143 202 168 195
#Define defaulted by reading the data dictionary's definition of these variables  
defaulted <- c(
    "Default", 
    "Does not meet the credit policy. Status:Charged Off", 
    "In Grace Period", 
    "Late (16-30 days)", 
    "Late (31-120 days)")
# Calculate the proportions of default/non-default
loans <- loans %>% mutate(default = ifelse(!(loan_status %in% defaulted), FALSE, TRUE))
loans %>% dplyr::summarise(default_freq = sum(default / n()))
##   default_freq
## 1   0.01550537
table(loans$default)/nrow(loans)
## 
##      FALSE       TRUE 
## 0.98449463 0.01550537

7. Remove variables with high NA ratios and high uniqueness ratios(aka free fields)

# remove variables with high NA ratios > mean(NA ratios)
high_na <- meta_loans[meta_loans$p_na > mean(meta_loans$p_na),]$variable

# remove categorical or Date variables with high uniqueness ratios > mean(uniqueness ratios)
high_uniqueness <- meta_loans[meta_loans$uniq_rat > mean(meta_loans$uniq_rat) & (meta_loans$type =='character' | meta_loans$type == 'Date'),]$variable
both <- c(high_na, high_uniqueness)
loans <- loans %>% dplyr::select(-c(both)) #remove them
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(both)` instead of `both` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
dim(loans)
## [1] 2260701     100

8. Data Splitting

Tom Fawcett

Conventional algorithms are often biased towards the majority class because the LOSS function attempts to optimize error rate without taking the data distribution into consideration. In the worst case, minority examples are treated as outliers of the majority class and ignores. The learning algorithm simply generates a trivial classifier that classifies every examples as the majority class [accuracy=freq of majority class]

Since the data is highly imbalanced, a stratified shuffle sampling method with a predefined random seed is used to preserve the class imbalance in the split data. In sklearn, this is “from sklearn.model_selection import StratifiedKFold”

set.seed(8402)
train_index <- caret::createDataPartition(loans$default, times=1, p=0.8, list=F)
train <- loans[ train_index, ]
test  <- loans[-train_index, ] #loans.loc[~train_index, :] in python
table(train$default)/nrow(train)
## 
##      FALSE       TRUE 
## 0.98449431 0.01550569
table(test$default)/nrow(test)
## 
##      FALSE       TRUE 
## 0.98449592 0.01550408

9. Exploratory Data Analysis

9.1. Continuous Variables

  • loan_amnt
  • funded_amnt
  • funded_amnt_inv
  • annual_inc
income_vars <- c("annual_inc")
loan_amount_vars <- c("loan_amnt", "funded_amnt", "funded_amnt_inv")

train %>%
  select_(.dots = income_vars) %>%
  gather_("variable", "value", gather_cols = income_vars) %>%
  ggplot(aes(x = value)) +
  facet_wrap(~ variable, ncol = 3) +
  scale_x_continuous(limits=c(0,1000000)) + 
  geom_histogram()
## Warning: `select_()` was deprecated in dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 490 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

train %>%
  select_(.dots = loan_amount_vars) %>%
  gather_("variable", "value", gather_cols = loan_amount_vars) %>%
  ggplot(aes(x = value)) +
  facet_wrap(~ variable, scales = "free_x", ncol = 3) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 78 rows containing non-finite values (stat_bin).

The annual income distribution shows that most applications have income less than 125k. The loan applied for, the loan committed and the amount invested show similar distributions.

9.2. Catogorical Variables

  • term
  • int_rate
  • grade
  • emp_title
  • home_ownership
  • verification_status
  • application_type
  • loan_status
  • purpose
  • zip_code (geo)
  • addr_state (geo)
  • issue_d (timestamp)
cat_vars <-  c("term", "grade", "sub_grade", "emp_title", "home_ownership",
    "verification_status", "loan_status", "purpose", "zip_code",
    "addr_state", "application_type", "policy_code")
meta_loans %>% dplyr::select(variable, p_zeros, p_na, unique) %>% 
  filter(variable %in% cat_vars) %>% 
  knitr::kable()
variable p_zeros p_na unique
term 0 0 3
grade 0 0 8
sub_grade 0 0 36
emp_title 0 0 483751
home_ownership 0 0 7
verification_status 0 0 4
loan_status 0 0 10
purpose 0 0 15
zip_code 0 0 957
addr_state 0 0 52
policy_code 0 0 1
application_type 0 0 3
#emp_title has too many unique values to spread out as K or 483751 binary variables 

9.3. Grade

Alternative credit scoring especially for self-employed businesses. Fintech uses social signal data and ml algorithms to assess the creditworthiness of an applicant.

LendingClub uses Grade as an ordinal factor to access the applicant’s creditworthiness.

train %>% ggplot(aes(grade, loan_amnt)) +
  geom_boxplot(fill='white', colour='darkblue', outlier.colour = 'red', outlier.shape = 1) +
  stat_summary(fun.data = function(x) return(c(y=median(x)*1.10, label=length(x))), geom='text') +
  stat_summary(fun.y=mean, geom='point', color='green') +
  facet_wrap(~default) +
  scale_y_continuous(labels = comma) +
  labs(title='Loan Amount by Grade Separated by Default', x='Grade', y='Loan Amount')
## Warning: `fun.y` is deprecated. Use `fun` instead.
## Warning: Removed 26 rows containing non-finite values (stat_boxplot).
## Warning: Removed 26 rows containing non-finite values (stat_summary).
## Removed 26 rows containing non-finite values (stat_summary).

  • Lower quality loans tend to have higher loan amount
  • There are couple of outliers in non-defaulters of grade A,B,C.
  • IQRs are slightly higher for bad loans and defaulters.
train <- train[!train$term=='',] #remove 26 NA's in term
train %>% ggplot(aes(grade, int_rate)) +
  geom_boxplot(fill='white',color='darkblue', outlier.colour = 'red', outlier.shape=1) +
  stat_summary(fun.data = function(x) return(c(y=median(x)*1.10, label=length(x))), geom='text') +
  stat_summary(fun=mean, geom='point', color='green') +
  scale_y_continuous(labels=comma) +
  labs(title='Interest Rate by Grade', x='Grade', y='Interest Rate') +
  facet_wrap(~term)

  • The lower the grade, the higher the interest rates and higher variations in interest rates.
  • There are some outliers that have really low interest rates independent of grade.
  • The 3-year term has a much higher number of high-rated borrowers while the 5-year term has a much higher number of low-rated borrowers.

9.4 Home Ownership

train %>% ggplot(aes(home_ownership, int_rate)) +
  geom_boxplot(fill = 'white', colour = 'darkblue', outlier.color = 'red', outlier.shape = 1) +
  stat_summary(fun.data = function(x) return(c(y=median(x)*1.10, label=length(x))), geom='text') +
  stat_summary(fun=mean, geom='point', colour='darkblue') + 
  scale_y_continuous(labels=comma) +
  facet_wrap(~default)

* median/mean interest rate is higher for people with ‘None’ answer in both defaulter and non-defaulter. * ratio of mortgage/own for non-defaulter is 4.4 while ratio of mortgage/own for defaulter is 3.6. * ratio of rent/own for both defaulters and non-defaulters are similar and equals to 3.5 * interest rate is generally higher with respect to the same group of home-ownership for default loans versus non-default loans.

9.5. Loan amount, funded loan amount and funded loan invested

funded_inv <- train %>%   transmute(loan_amnt=loan_amnt, value=funded_amnt, variable='funded_amount') %>% 
  bind_rows(transmute(train, loan_amnt=loan_amnt, value=funded_amnt_inv, variable='funded_amount_inv')) %>% 
  bind_cols(grade=c(transmute(train, loan_amnt=loan_amnt, value=grade, variable='grade')$value,
                    transmute(train, loan_amnt=loan_amnt, value=grade, variable='grade')$value)) 
funded_inv %>% ggplot(aes(x=loan_amnt, y=value, color=grade)) +
  geom_point(alpha=0.4) + 
  facet_wrap(~ variable) 

The diagonal line indicates that the loan amount, funded amount and funded amount by investor are the same. * The funded amount is always less than the loan amount. * The funded amount by investor is smaller than the funded amount aka not the full loan is invested in. * The variations are getting larger as the loan amount increases.

9.6. Loan amount and Income/Debt-to-Income ratio

train %>% 
  select(loan_amnt, grade, annual_inc, default) %>% 
  ggplot(aes(annual_inc, loan_amnt, color=grade)) +
  geom_point(alpha=0.3) +
  facet_wrap(~default)
## Warning: Removed 3 rows containing missing values (geom_point).

  • There are more variations in the non-default loans than default ones and there is one outlier (why would someone with income greater than 6 millions asking for a 10k loans?)
train %>% 
  select(loan_amnt, grade, dti, default) %>% 
  ggplot(aes(dti, loan_amnt, color=grade)) +
  geom_point(alpha=0.3) +
  facet_wrap(~default)
## Warning: Removed 1386 rows containing missing values (geom_point).

9.7. Time series of average and median interest rate by Grade/Loan Amount

train %>% 
  select(int_rate, grade) %>% 
  group_by(grade) %>% 
  summarise(int_rate_mean=mean(int_rate, na.rm=T),
            int_rate_median=median(int_rate, na.rm=T),
            n=n()) %>% 
  knitr::kable()
grade int_rate_mean int_rate_median n
A 7.085164 7.24 346242
B 10.674941 10.75 531354
C 14.143275 13.99 519430
D 18.141853 17.97 259803
E 21.830008 21.48 108708
F 25.444979 24.50 33217
G 28.074617 28.18 9782
train %>% 
  select(int_rate, issue_d) %>% 
  group_by(issue_d) %>% 
  summarise(mean_int = mean(int_rate)) %>% 
  ggplot(aes(x=issue_d, y=mean_int)) +
  geom_line(color='darkblue') 

Mean interest rate over time is increasing from 2007-2018. 2007 and 2011 have the lowest mean interest rate. As P2P starts to gain traction from 2012, its peak of average interest rate is in 2013. However, the mean interest overall does not draw a clear picture when segmenting by grade or loan amount. The phenomenal is also called ‘the Simpson’s paradox’. The overall effect may be different or inverse from the effect of the groups contributed to it. (omnibus correlation significance versus pairwise correlation significance).

train %>% 
  select(int_rate, grade, issue_d) %>% 
  group_by(grade, issue_d) %>% 
  summarise(int_rate_mean = mean(int_rate, na.rm = TRUE)) %>%
  ggplot(aes(issue_d, int_rate_mean)) +
  geom_line(color= "darkblue", size = 1) +
  facet_wrap(~grade)
## `summarise()` has grouped output by 'grade'. You can override using the
## `.groups` argument.

train %>% 
  select(loan_amnt, grade, issue_d) %>% 
  group_by(grade, issue_d) %>% 
  summarise(loan_amnt_mean = mean(loan_amnt, na.rm = TRUE)) %>%
  ggplot(aes(issue_d, loan_amnt_mean)) +
  geom_line(color= "darkblue", size = 1) +
  facet_wrap(~ grade)
## `summarise()` has grouped output by 'grade'. You can override using the
## `.groups` argument.

* The mean interest rate and variation in loan amount are increasing significantly for low-rated borrowers over time from Jan-2007 to Dec-2018.

9.8. Geo-location Plot (single point of population)

#states
states <- read.csv('https://www2.census.gov/geo/docs/reference/state.txt', sep='|')
# average default rate by state
default_rate_state <- train %>% 
  select(default, addr_state) %>% 
  group_by(addr_state) %>% 
  summarise(default_rate=sum(default, na.rm=TRUE)/n()) %>% 
  left_join(states, by = c('addr_state' = 'STUSAB')) %>% 
  mutate(region=tolower(STATE_NAME), value=default_rate) %>% 
  select(region, value)

library(choroplethrMaps)
utils::data(state.map)
choroplethr::state_choropleth(df = default_rate_state, 
                              title = "Default rate by State")

# average annual income by state
annual_inc_state <- train %>% 
  select(annual_inc, addr_state) %>% 
  group_by(addr_state) %>% 
  summarise(avg_annual_inc=mean(annual_inc, na.rm=TRUE)) %>% 
  left_join(states, by = c('addr_state' = 'STUSAB')) %>% 
  mutate(region=tolower(STATE_NAME), value=avg_annual_inc) %>% 
  select(region, value)

library(choroplethrMaps)
utils::data(state.map)
choroplethr::state_choropleth(df = annual_inc_state, 
                              title = "Default rate by State")

  • Average default rate and average annual income varies across states
  • Higher average income is not associated with lower average default rate across state. For example, CA has a really high average annual income but has relatively low average default rate.
  • A different binning can be adjusted to replace the default binning.

10. Correlations and Removal of Correlated Variables

library(corrplot)
train$policy_code <- factor(train$policy_code)
num_vars <- train %>% sapply(is.numeric) %>% which() %>% names()
meta_train <- funModeling::df_status(train[, ..num_vars], print_results = FALSE)
num_vars <- meta_train[meta_train$p_na <= 5,]$variable
sds <- sapply(train[,..num_vars], function(x) sd(x, na.rm=T))
corrplot::corrplot(cor(train[, ..num_vars], use = "complete.obs"), 
                   method = "ellipse", type = "upper", tl.cex=0.4)

* overall correlation between variables seems low but there are a few highly correlated ones. * some highly correlated variables may indicate similar things. * some variables are perfectly correlated and should be removed. caret::findCorrelation remove the variable with the largest mean absolute correlation. This seems to be too simple compared to best subset selection. I am going to compare the performance of caret::findCorrelation, step-wise AIC/BIC approach, Lasso, Lasso with enlarge feature space/kernelized method and recursive feature elimination.

#remove variables with high correlations although cannot control which variables to remove
#another alternative could be recursive feature elimination rfe in caret 
vars_remove <- caret::findCorrelation(cor(train[, ..num_vars], use='complete.obs'), names=T, cutoff = 0.5)
train <- train %>% select(-c(all_of(vars_remove)))

11. Summary Plots

12. Modeling

Supervised binary classification: * logistic regression and regularized logistic regression * LDA/QDA * kNN * Trees/ Rf/ GBM * SVM soft/hard * Reduce noises through PCA and non-reduced noise (remember PCA is sensitive to scale of inputs)

12.1 Imbalanced Classes

1. Strategies

1.1. Do nothing

Train on stratified distribution and it works ok without need for modification

1.2. Balance the training set

  • Oversample the minority class 80/20
  • Undersample the majority class 80/20 (preferable since it decreases the dataset size so it is more computational efficient)

Both methods result in duplicate data and make variables appear to have lower variance than they do.

In reality, undersampling 80/20 is preferable although it would be more beneficial to assess all cases.

Step 1: Take bootstrap samples from the population

Step 2: Balance each sample by down-sampling

Step 3: Learn a decision tree/random forest from each

Step 4: Majority votes on the whole set of bootstrap samples

train_down <- caret::downSample(x=train %>% select(-default),
                                y=factor(train$default), yname='default')
prop.table(table(train_down$default))
## 
## FALSE  TRUE 
##   0.5   0.5
table(train_down$default)
## 
## FALSE  TRUE 
## 28043 28043
downSample_custom <- function(x, y, yname = "Class", frac = 1){
  lev <- levels(y)
  minClass <- min(table(y))
  lev_min <- levels(y)[which.min(table(y))]
  inds_down <- sample(which(y == lev[lev != lev_min]), size = minClass * frac) #sample the indexes of the more abundant class according to minClass * frac
  inds_minClass <- which(y == lev[lev == lev_min]) #take all the indexes of the lesser abundant class
  out <- data.frame(x, y)
  out <- out[sort(c(inds_down, inds_minClass)),]
  colnames(out)[ncol(out)] <- yname
  return(out)
} 

# 80/20 downsampling the minority class 
train_down_custom <- downSample_custom(x=train %>% select(-default),
                                y=factor(train$default), yname='default', frac=1/0.8)
table(train_down_custom$default)
## 
## FALSE  TRUE 
## 35053 28043
  • Synthesize new minority class or SMOTE: remove majority classes, create new minority examples by interpolating between existing ones.

1.3. Anamoly Detection

Throw away minority examples and switch to an anamoly detection framework

  • Isolation Forest: identify anamolies in data by DF and the measuring the average number of decision splits required to isolate each data points. #avg is used to calculate each data point’s anamoly score or the likelihood that the example belongs to the minority class.

  • Box Drawing: axis-parralel hyper-rectangles around clusters of minority class examples

  • Nearest Neighbor Ensembles

1.4. Adjust the Objective Function

  • Adjust the class weight (misclassification costs)
  • Adjust the decision threshold
  • Modify an existing algorithm to be sensitive to rare classes

1.5. Neighbor-based approach

Tomek looks for pairs of majority-minority and delete the majority class samples. It helps to draw a clearer decision boundary between the majority and minority class.

1.6. Write a new algorithm to perform well on imbalanced data.

2. Why not accuracy?

  • Accuracy applies a naive 0.5 threshold to decide between classes.
  • Accuracy = 1 - Error but do not know which FP or FN is larger

=> ROC, precision-recall curve, a lift curve, profit gain curve based on predicted probability

2.1. AUC

Probability that a random positive example will be ranked above a random negative example. Sometimes refers as the ability to separate the classes.

2.2. F1

Harmonic mean of precision and recall

F1 = \(\frac{2(recall*precision)}{recall+precision}\)

12.2 Model Pipelines

  • data splitting:

  • make.names() for each column names so they do not have any space

  • remove some variables such as zip_code, title

  • pre-processing

  • remove correlated variables by simple correlation matrix. I need to compare the performance of removing variables through caret::findCorrelation with recursive feature elimination.

  • feature selection

  • model tuning

  • variable importance estimation

# particular naming scheme for categorical variables
character_vars <- train_down %>% sapply(is.character) %>% which() %>% names()
train_down <- train_down %>% mutate_at(.funs=make.names, .vars=character_vars)
test <- test %>% mutate_at(.funs=make.names, .vars=character_vars)

#remove zip code due to huge binary matrix and discriminative against some certain communities
unique(train_down$policy) == 1
## [1] TRUE
train_down <- train_down %>% select(-c(zip_code, title, policy_code))
dummies_train <- dummyVars('~.', data=train_down %>% select(-c(default)), fullRank = F) #need to specify fullRank as False for lm but do not need to specify it for tree
#fulllRank e.q. drop_first=True in OneHotEncoding

train_down_dummy <- train_down %>% 
                    select(-which(sapply(., is.character))) %>% 
                    cbind(predict(dummies_train, newdata=train_down))

dummies_test <- dummyVars('~.', data=test %>% select(dummies_train$vars), fullRank = F)
test_dummy <- test %>% 
  select((colnames(train_down))) %>% 
  select(-which(sapply(., is.character))) %>% 
  cbind(predict(dummies_test, newdata = test)) 

# levels only see in train not in test => bind these columns to the test set and set to zero 
train_not_test <- setdiff(names(train_down_dummy), names(test_dummy))
if (length(test_dummy[, c(train_not_test)])!=0) {test_dummy[, c(train_not_test)] <- list(0)}
setdiff(names(test_dummy), names(train_down_dummy)) == 0
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE
# levels only see in test not in train => drop these columns corresponding to new levels
test_not_train <- setdiff(names(test_dummy), names(train_down_dummy))
if (length(test_dummy[, c(test_not_train)])!=0) {test_dummy[, c(test_not_train)] <- list(NULL)}#set to NULL means delete those columns 
setdiff(names(test_dummy), names(train_down_dummy)) == 0
## logical(0)

12.3 Logistic Regression

  • Definition: Logistic regression is similar to linear regression except the outcome is being mapped to a range of [0,1] called the likelihood or predicted probability through a logit function. The log odds ratio is linear in terms of X. The model coefficients are estimated through MLE.

\(log(\frac{p}{1-p}) = X\hat{\beta}\)

We will start out with a simple model and intuitively use the grade variable as the predictor.

mod1 <- glm(formula = default ~ grade -1, #remove the intercept
            family = binomial(link = 'logit'),
            data = train_down, na.action = na.exclude)
# accessing the attributes
attributes(mod1)
## $names
##  [1] "coefficients"      "residuals"         "fitted.values"    
##  [4] "effects"           "R"                 "rank"             
##  [7] "qr"                "family"            "linear.predictors"
## [10] "deviance"          "aic"               "null.deviance"    
## [13] "iter"              "weights"           "prior.weights"    
## [16] "df.residual"       "df.null"           "y"                
## [19] "converged"         "boundary"          "model"            
## [22] "call"              "formula"           "terms"            
## [25] "data"              "offset"            "control"          
## [28] "method"            "contrasts"         "xlevels"          
## 
## $class
## [1] "glm" "lm"
#summary
summary(mod1)
## 
## Call:
## glm(formula = default ~ grade - 1, family = binomial(link = "logit"), 
##     data = train_down, na.action = na.exclude)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.55876  -1.25861   0.04397   1.09832   1.67521  
## 
## Coefficients:
##        Estimate Std. Error z value Pr(>|z|)    
## gradeA -1.12104    0.02708 -41.396   <2e-16 ***
## gradeB -0.28756    0.01688 -17.038   <2e-16 ***
## gradeC  0.18890    0.01506  12.541   <2e-16 ***
## gradeD  0.45687    0.02022  22.593   <2e-16 ***
## gradeE  0.58624    0.03141  18.667   <2e-16 ***
## gradeF  0.49488    0.05506   8.988   <2e-16 ***
## gradeG  0.86283    0.09869   8.743   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 77752  on 56086  degrees of freedom
## Residual deviance: 74254  on 56079  degrees of freedom
## AIC: 74268
## 
## Number of Fisher Scoring iterations: 4

All the coefficients of gradeA to gradeG are significant. GradeA and gradeB are negative while gradeC to gradeG are positive. It means that gradeA and gradeB decrease the likelihood of default while gradeC to gradeG increase the likelihood of default. The coefs help to confirm the hypothesis of bucketing bad versus good grades. GradeE has larger coefficient size than gradeF (0.58>0.49).

test <- test[!test$grade == 'X']
mod1 <- glm(formula = default ~ grade-1,
            family = binomial(link = 'logit'),
            data = train_down, na.action = na.exclude)
mod1_pred <- predict.glm(object=mod1, newdata = test, type='response')
mod1_pred_5_threshold <- ifelse(mod1_pred>0.5, TRUE, FALSE)
test$default <- factor(test$default, levels=c(FALSE,TRUE))

test$pred_glm <- factor(mod1_pred_5_threshold,
                         levels=levels(test$default))

confusionMatrix(test$pred_glm, test$default, positive = 'TRUE')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  FALSE   TRUE
##      FALSE 216988   2000
##      TRUE  228134   5010
##                                           
##                Accuracy : 0.491           
##                  95% CI : (0.4895, 0.4925)
##     No Information Rate : 0.9845          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.012           
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.71469         
##             Specificity : 0.48748         
##          Pos Pred Value : 0.02149         
##          Neg Pred Value : 0.99087         
##              Prevalence : 0.01550         
##          Detection Rate : 0.01108         
##    Detection Prevalence : 0.51565         
##       Balanced Accuracy : 0.60109         
##                                           
##        'Positive' Class : TRUE            
## 

The simple classifier gives a fair amount of TP (true default rate/ sensitivity) but really low amount of TN (non-defaults rate / specificity). The Kappa is really low.

plot_pred_type_distribution <- function(df, threshold) {
  v <- rep(NA, nrow(df))
  v <- ifelse(df$pred >= threshold & df$actual == TRUE, "TP", v)
  v <- ifelse(df$pred >= threshold & df$actual == FALSE, "FP", v)
  v <- ifelse(df$pred < threshold & df$actual == TRUE, "FN", v)
  v <- ifelse(df$pred < threshold & df$actual == FALSE, "TN", v)
  
  df$pred_type <- v
  
  ggplot(data = df, aes(x = actual, y = pred)) + 
    geom_violin(fill = rgb(1, 1 ,1, alpha = 0.02), color = NA) + 
    geom_jitter(aes(color = pred_type), alpha = 0.2, width=0.5, height=0.05) +
    geom_hline(yintercept = threshold, color = "red", alpha = 1) +
    scale_color_discrete(name = "type") +
    labs(title = sprintf("Threshold at %.2f", threshold))
  }

plot_pred_type_distribution(
  df = tibble::tibble(pred = mod1_pred, actual = test$default), threshold = 0.5)

* We can clearly see the imbalance of the test data with a lot of actual negative classes (non-defaulters). * As the 0.5 threshold moving down /decreasing, TP/sensitivity increases at the expense of TN. * As the 0.5 the threshold moving up, TN/specificity increases at the expense of TP.

knitr::purl('script.R')
## 
## 
## processing file: script.R
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |......................................................................| 100%
## output file: script.R
## [1] "script.R"