A case study of machine learning / modeling in R with credit default data. The notebook focuses on an end-to-end data science workflow and build a risk score model on the Lending Club P2P lending platform to determine the predicted probability of default.
Clean the environment and install necessary packages
rm(list = ls()) #clear all the objects in the current env
libraries <- c("lazyeval", "readr","plyr" ,"dplyr", "readxl", "ggplot2",
"funModeling", "scales", "tidyverse", "corrplot", "GGally", "caret",
"rpart", "randomForest", "pROC", "gbm", "choroplethr", "choroplethrMaps",
"microbenchmark", "doParallel", "e1071") #check the performance of SVM in e1071 versus SVM in Python since there is a translation mistakes between the architectures (arm vs x84)
libraries_missing <- setdiff(libraries, installed.packages())
if (length(libraries_missing)!=0) install.packages(libraries_missing)
#load the libraries
sapply(libraries, function(x) suppressWarnings(suppressMessages(require(x, character.only = TRUE))))
## lazyeval readr plyr dplyr readxl
## TRUE TRUE TRUE TRUE TRUE
## ggplot2 funModeling scales tidyverse corrplot
## TRUE TRUE TRUE TRUE TRUE
## GGally caret rpart randomForest pROC
## TRUE TRUE TRUE TRUE TRUE
## gbm choroplethr choroplethrMaps microbenchmark doParallel
## TRUE TRUE TRUE TRUE TRUE
## e1071
## TRUE
Although there are two ways to load packages/libraries in R, I prefer to use the second method to utilize the %>% operator and available functions from the libraries. However, conflicts of functions with the same names between different packages are unavoidable (called masking).
(caret, plyr) (tidyverse, dplyr)
#Reading the rejected and accepted applications
path <- ('~/Desktop/essential-ds-concepts/projects/lending-club/')
loans <- data.table::fread(paste0(path,'./datasets/accepted_2007_to_2018q4.csv'))
dim(loans)
## [1] 2260701 151
# Reading the metadata
library(readxl)
meta_stats <- read_excel(paste0(path, './datasets/LCDataDictionary.xlsx'), sheet = 'LoanStats')
meta_browse_notes <- read_excel(paste0(path, './datasets/LCDataDictionary.xlsx'), sheet = 'browseNotes')
meta_reject_stats <- read_excel(paste0(path, './datasets/LCDataDictionary.xlsx'), sheet = 'RejectStats')
meta_loans <- funModeling::df_status(loans, print_results = F)
knitr::kable(meta_loans[0:10,])
| variable | q_zeros | p_zeros | q_na | p_na | q_inf | p_inf | type | unique |
|---|---|---|---|---|---|---|---|---|
| id | 0 | 0.00 | 0 | 0 | 0 | 0 | character | 2260701 |
| member_id | 0 | 0.00 | 2260701 | 100 | 0 | 0 | logical | 0 |
| loan_amnt | 0 | 0.00 | 33 | 0 | 0 | 0 | numeric | 1572 |
| funded_amnt | 0 | 0.00 | 33 | 0 | 0 | 0 | numeric | 1572 |
| funded_amnt_inv | 233 | 0.01 | 33 | 0 | 0 | 0 | numeric | 10057 |
| term | 0 | 0.00 | 0 | 0 | 0 | 0 | character | 3 |
| int_rate | 0 | 0.00 | 33 | 0 | 0 | 0 | numeric | 673 |
| installment | 0 | 0.00 | 33 | 0 | 0 | 0 | numeric | 93301 |
| grade | 0 | 0.00 | 0 | 0 | 0 | 0 | character | 8 |
| sub_grade | 0 | 0.00 | 0 | 0 | 0 | 0 | character | 36 |
The accepted data has only one unique column and it is the first column (id). It can be checked with nrow(loans) == length(unique(loans$id)).
meta_loans <- meta_loans %>%
mutate(uniq_ratio = unique/nrow(loans)*100)
meta_loans %>%
dplyr::select(variable, unique, uniq_ratio, type) %>%
filter(uniq_ratio > 0.2, type=='character')
## variable unique uniq_ratio type
## 1 id 2260701 100.000000 character
## 2 emp_title 483751 21.398274 character
## 3 url 2260669 99.998585 character
## 4 desc 124485 5.506478 character
## 5 title 61463 2.718758 character
There are a couple of free text fields and can be checked with the condition such as the total number of unique items is greater than a certain threshold and the type is character. For example, emp_title is a free text field and let people enter their own job titles instead of having a selection list/ drop-down menu. Such free fields can be analyzed by extracting some keywords / minimal vocabulary.
# Convert some character columns to numeric columns
chr_to_num_vars <- c("annual_inc_joint", "mths_since_last_major_derog", "open_acc_6m",
"open_il_12m", "open_il_24m", "mths_since_rcnt_il",
"total_bal_il", "il_util", "open_rv_12m", "open_rv_24m",
"max_bal_bc", "all_util", "total_rev_hi_lim", "total_cu_tl",
"inq_last_12m", "dti_joint", "inq_fi", "tot_cur_bal", "tot_coll_amt")
loans <- loans %>%
mutate_at(.funs=list(as.numeric), .vars=chr_to_num_vars)
loans %>%
dplyr::select(all_of(chr_to_num_vars)) %>%
str()
## Classes 'data.table' and 'data.frame': 2260701 obs. of 19 variables:
## $ annual_inc_joint : num NA NA 71000 NA NA NA NA NA NA NA ...
## $ mths_since_last_major_derog: num 30 NA NA NA NA NA NA 3 NA 75 ...
## $ open_acc_6m : num 2 1 0 1 1 0 0 0 2 0 ...
## $ open_il_12m : num 0 0 0 0 0 0 0 0 0 2 ...
## $ open_il_24m : num 1 1 4 1 3 0 2 4 0 3 ...
## $ mths_since_rcnt_il : num 21 19 19 23 14 338 18 13 35 10 ...
## $ total_bal_il : num 4981 18005 10827 12609 73839 ...
## $ il_util : num 36 73 73 70 84 99 63 75 57 72 ...
## $ open_rv_12m : num 3 2 0 1 4 0 2 0 2 0 ...
## $ open_rv_24m : num 3 3 2 1 7 0 3 0 7 2 ...
## $ max_bal_bc : num 722 6472 2081 6987 9702 ...
## $ all_util : num 34 29 65 45 78 76 74 55 46 49 ...
## $ total_rev_hi_lim : num 9300 111800 14000 67300 34000 ...
## $ total_cu_tl : num 1 0 5 1 1 0 0 0 0 0 ...
## $ inq_last_12m : num 4 6 1 0 3 0 1 2 1 1 ...
## $ dti_joint : num NA NA 13.8 NA NA ...
## $ inq_fi : num 3 0 2 0 2 0 1 1 2 0 ...
## $ tot_cur_bal : num 144904 204396 189699 301500 331730 ...
## $ tot_coll_amt : num 722 0 0 0 0 ...
## - attr(*, ".internal.selfref")=<externalptr>
# Convert some character columns to date columns
dates <- grep('date', names(loans))
start_with_date <- names(loans)[dates]
chr_to_date_vars <-
c("issue_d", "last_pymnt_d", "last_credit_pull_d",
"next_pymnt_d", "earliest_cr_line", "next_pymnt_d", 'sec_app_earliest_cr_line',
start_with_date)
loans <- loans %>%
mutate_at(.funs=list(function(x) as.Date(paste0('01-', x), '%d-%b-%Y')), .vars=chr_to_date_vars)
loans %>% dplyr::select(all_of(chr_to_date_vars)) %>%
str()
## Classes 'data.table' and 'data.frame': 2260701 obs. of 11 variables:
## $ issue_d : Date, format: "2015-12-01" "2015-12-01" ...
## $ last_pymnt_d : Date, format: "2019-01-01" "2016-06-01" ...
## $ last_credit_pull_d : Date, format: "2019-03-01" "2019-03-01" ...
## $ next_pymnt_d : Date, format: NA NA ...
## $ earliest_cr_line : Date, format: "2003-08-01" "1999-12-01" ...
## $ sec_app_earliest_cr_line : Date, format: NA NA ...
## $ hardship_start_date : Date, format: NA NA ...
## $ hardship_end_date : Date, format: NA NA ...
## $ payment_plan_start_date : Date, format: NA NA ...
## $ debt_settlement_flag_date: Date, format: NA NA ...
## $ settlement_date : Date, format: NA NA ...
## - attr(*, ".internal.selfref")=<externalptr>
numeric_vars <- loans %>% sapply(is.numeric) %>% which() %>% names()
setdiff(meta_loans$variable, c(numeric_vars, chr_to_date_vars))
## [1] "id" "member_id"
## [3] "term" "grade"
## [5] "sub_grade" "emp_title"
## [7] "emp_length" "home_ownership"
## [9] "verification_status" "loan_status"
## [11] "pymnt_plan" "url"
## [13] "desc" "purpose"
## [15] "title" "zip_code"
## [17] "addr_state" "initial_list_status"
## [19] "application_type" "verification_status_joint"
## [21] "hardship_flag" "hardship_type"
## [23] "hardship_reason" "hardship_status"
## [25] "hardship_loan_status" "disbursement_method"
## [27] "debt_settlement_flag" "settlement_status"
na_to_zero_vars <-
c("mths_since_last_delinq", "mths_since_last_record",
"mths_since_last_major_derog")
loans <-
loans %>%
mutate_at(.vars = na_to_zero_vars, .funs = funs(replace(., is.na(.), 0)))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
meta_loans <- funModeling::df_status(loans, print_results = FALSE)
meta_loans <-
meta_loans %>%
mutate(uniq_rat = unique / nrow(loans))
dplyr::setdiff(colnames(loans), meta_stats$LoanStatNew)
## [1] "verification_status_joint" "total_rev_hi_lim"
Default: can be seen as a point in time versus a period in time
Several variables of interest:
# Check the uniqueness of the variables of interest
purrr::map(.x=loans[, c("loan_status", "delinq_2yrs", "mths_since_last_delinq")], .f=base::unique)
## $loan_status
## [1] "Fully Paid"
## [2] "Current"
## [3] "Charged Off"
## [4] "In Grace Period"
## [5] "Late (31-120 days)"
## [6] "Late (16-30 days)"
## [7] "Default"
## [8] ""
## [9] "Does not meet the credit policy. Status:Fully Paid"
## [10] "Does not meet the credit policy. Status:Charged Off"
##
## $delinq_2yrs
## [1] 0 1 2 3 4 6 5 15 7 9 10 8 11 13 14 12 30 18 16 17 26 20 19 22 27
## [26] 39 NA 28 21 25 32 58 24 35 42 23 29 36
##
## $mths_since_last_delinq
## [1] 30 6 0 12 49 3 75 54 42 29 33 15 18 47 35 23 38 8
## [19] 70 59 45 52 66 50 46 39 63 61 78 74 19 34 9 27 26 68
## [37] 73 36 31 32 13 16 37 17 24 55 1 69 4 57 7 21 40 41
## [55] 11 81 82 14 64 48 5 22 2 44 10 60 51 53 25 20 43 62
## [73] 65 72 56 28 71 79 58 67 76 80 95 85 94 77 83 116 89 103
## [91] 99 113 109 92 84 101 88 119 120 98 96 90 86 106 87 115 108 152
## [109] 100 91 105 134 146 97 93 123 131 157 122 104 112 154 121 127 170 171
## [127] 110 142 125 118 102 129 114 117 149 133 176 111 137 107 130 160 138 126
## [145] 226 156 158 128 132 178 124 135 162 145 161 179 159 147 136 151 192 188
## [163] 141 180 140 153 150 148 139 143 202 168 195
#loan_status missing one level "" as "Issued"
loans$loan_status <- ifelse(loans$loan_status=="", 'Issued', loans$loan_status)
purrr::map(.x=loans[, c("loan_status", "delinq_2yrs", "mths_since_last_delinq")], .f=base::unique)
## $loan_status
## [1] "Fully Paid"
## [2] "Current"
## [3] "Charged Off"
## [4] "In Grace Period"
## [5] "Late (31-120 days)"
## [6] "Late (16-30 days)"
## [7] "Default"
## [8] "Issued"
## [9] "Does not meet the credit policy. Status:Fully Paid"
## [10] "Does not meet the credit policy. Status:Charged Off"
##
## $delinq_2yrs
## [1] 0 1 2 3 4 6 5 15 7 9 10 8 11 13 14 12 30 18 16 17 26 20 19 22 27
## [26] 39 NA 28 21 25 32 58 24 35 42 23 29 36
##
## $mths_since_last_delinq
## [1] 30 6 0 12 49 3 75 54 42 29 33 15 18 47 35 23 38 8
## [19] 70 59 45 52 66 50 46 39 63 61 78 74 19 34 9 27 26 68
## [37] 73 36 31 32 13 16 37 17 24 55 1 69 4 57 7 21 40 41
## [55] 11 81 82 14 64 48 5 22 2 44 10 60 51 53 25 20 43 62
## [73] 65 72 56 28 71 79 58 67 76 80 95 85 94 77 83 116 89 103
## [91] 99 113 109 92 84 101 88 119 120 98 96 90 86 106 87 115 108 152
## [109] 100 91 105 134 146 97 93 123 131 157 122 104 112 154 121 127 170 171
## [127] 110 142 125 118 102 129 114 117 149 133 176 111 137 107 130 160 138 126
## [145] 226 156 158 128 132 178 124 135 162 145 161 179 159 147 136 151 192 188
## [163] 141 180 140 153 150 148 139 143 202 168 195
#Define defaulted by reading the data dictionary's definition of these variables
defaulted <- c(
"Default",
"Does not meet the credit policy. Status:Charged Off",
"In Grace Period",
"Late (16-30 days)",
"Late (31-120 days)")
# Calculate the proportions of default/non-default
loans <- loans %>% mutate(default = ifelse(!(loan_status %in% defaulted), FALSE, TRUE))
loans %>% dplyr::summarise(default_freq = sum(default / n()))
## default_freq
## 1 0.01550537
table(loans$default)/nrow(loans)
##
## FALSE TRUE
## 0.98449463 0.01550537
# remove variables with high NA ratios > mean(NA ratios)
high_na <- meta_loans[meta_loans$p_na > mean(meta_loans$p_na),]$variable
# remove categorical or Date variables with high uniqueness ratios > mean(uniqueness ratios)
high_uniqueness <- meta_loans[meta_loans$uniq_rat > mean(meta_loans$uniq_rat) & (meta_loans$type =='character' | meta_loans$type == 'Date'),]$variable
both <- c(high_na, high_uniqueness)
loans <- loans %>% dplyr::select(-c(both)) #remove them
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(both)` instead of `both` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
dim(loans)
## [1] 2260701 100
Tom Fawcett
Conventional algorithms are often biased towards the majority class because the LOSS function attempts to optimize error rate without taking the data distribution into consideration. In the worst case, minority examples are treated as outliers of the majority class and ignores. The learning algorithm simply generates a trivial classifier that classifies every examples as the majority class [accuracy=freq of majority class]
Since the data is highly imbalanced, a stratified shuffle sampling method with a predefined random seed is used to preserve the class imbalance in the split data. In sklearn, this is “from sklearn.model_selection import StratifiedKFold”
set.seed(8402)
train_index <- caret::createDataPartition(loans$default, times=1, p=0.8, list=F)
train <- loans[ train_index, ]
test <- loans[-train_index, ] #loans.loc[~train_index, :] in python
table(train$default)/nrow(train)
##
## FALSE TRUE
## 0.98449431 0.01550569
table(test$default)/nrow(test)
##
## FALSE TRUE
## 0.98449592 0.01550408
income_vars <- c("annual_inc")
loan_amount_vars <- c("loan_amnt", "funded_amnt", "funded_amnt_inv")
train %>%
select_(.dots = income_vars) %>%
gather_("variable", "value", gather_cols = income_vars) %>%
ggplot(aes(x = value)) +
facet_wrap(~ variable, ncol = 3) +
scale_x_continuous(limits=c(0,1000000)) +
geom_histogram()
## Warning: `select_()` was deprecated in dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 490 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
train %>%
select_(.dots = loan_amount_vars) %>%
gather_("variable", "value", gather_cols = loan_amount_vars) %>%
ggplot(aes(x = value)) +
facet_wrap(~ variable, scales = "free_x", ncol = 3) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 78 rows containing non-finite values (stat_bin).
The annual income distribution shows that most applications have income less than 125k. The loan applied for, the loan committed and the amount invested show similar distributions.
cat_vars <- c("term", "grade", "sub_grade", "emp_title", "home_ownership",
"verification_status", "loan_status", "purpose", "zip_code",
"addr_state", "application_type", "policy_code")
meta_loans %>% dplyr::select(variable, p_zeros, p_na, unique) %>%
filter(variable %in% cat_vars) %>%
knitr::kable()
| variable | p_zeros | p_na | unique |
|---|---|---|---|
| term | 0 | 0 | 3 |
| grade | 0 | 0 | 8 |
| sub_grade | 0 | 0 | 36 |
| emp_title | 0 | 0 | 483751 |
| home_ownership | 0 | 0 | 7 |
| verification_status | 0 | 0 | 4 |
| loan_status | 0 | 0 | 10 |
| purpose | 0 | 0 | 15 |
| zip_code | 0 | 0 | 957 |
| addr_state | 0 | 0 | 52 |
| policy_code | 0 | 0 | 1 |
| application_type | 0 | 0 | 3 |
#emp_title has too many unique values to spread out as K or 483751 binary variables
Alternative credit scoring especially for self-employed businesses. Fintech uses social signal data and ml algorithms to assess the creditworthiness of an applicant.
LendingClub uses Grade as an ordinal factor to access the applicant’s creditworthiness.
train %>% ggplot(aes(grade, loan_amnt)) +
geom_boxplot(fill='white', colour='darkblue', outlier.colour = 'red', outlier.shape = 1) +
stat_summary(fun.data = function(x) return(c(y=median(x)*1.10, label=length(x))), geom='text') +
stat_summary(fun.y=mean, geom='point', color='green') +
facet_wrap(~default) +
scale_y_continuous(labels = comma) +
labs(title='Loan Amount by Grade Separated by Default', x='Grade', y='Loan Amount')
## Warning: `fun.y` is deprecated. Use `fun` instead.
## Warning: Removed 26 rows containing non-finite values (stat_boxplot).
## Warning: Removed 26 rows containing non-finite values (stat_summary).
## Removed 26 rows containing non-finite values (stat_summary).
train <- train[!train$term=='',] #remove 26 NA's in term
train %>% ggplot(aes(grade, int_rate)) +
geom_boxplot(fill='white',color='darkblue', outlier.colour = 'red', outlier.shape=1) +
stat_summary(fun.data = function(x) return(c(y=median(x)*1.10, label=length(x))), geom='text') +
stat_summary(fun=mean, geom='point', color='green') +
scale_y_continuous(labels=comma) +
labs(title='Interest Rate by Grade', x='Grade', y='Interest Rate') +
facet_wrap(~term)
train %>% ggplot(aes(home_ownership, int_rate)) +
geom_boxplot(fill = 'white', colour = 'darkblue', outlier.color = 'red', outlier.shape = 1) +
stat_summary(fun.data = function(x) return(c(y=median(x)*1.10, label=length(x))), geom='text') +
stat_summary(fun=mean, geom='point', colour='darkblue') +
scale_y_continuous(labels=comma) +
facet_wrap(~default)
* median/mean interest rate is higher for people with ‘None’ answer in both defaulter and non-defaulter. * ratio of mortgage/own for non-defaulter is 4.4 while ratio of mortgage/own for defaulter is 3.6. * ratio of rent/own for both defaulters and non-defaulters are similar and equals to 3.5 * interest rate is generally higher with respect to the same group of home-ownership for default loans versus non-default loans.
funded_inv <- train %>% transmute(loan_amnt=loan_amnt, value=funded_amnt, variable='funded_amount') %>%
bind_rows(transmute(train, loan_amnt=loan_amnt, value=funded_amnt_inv, variable='funded_amount_inv')) %>%
bind_cols(grade=c(transmute(train, loan_amnt=loan_amnt, value=grade, variable='grade')$value,
transmute(train, loan_amnt=loan_amnt, value=grade, variable='grade')$value))
funded_inv %>% ggplot(aes(x=loan_amnt, y=value, color=grade)) +
geom_point(alpha=0.4) +
facet_wrap(~ variable)
The diagonal line indicates that the loan amount, funded amount and funded amount by investor are the same. * The funded amount is always less than the loan amount. * The funded amount by investor is smaller than the funded amount aka not the full loan is invested in. * The variations are getting larger as the loan amount increases.
train %>%
select(loan_amnt, grade, annual_inc, default) %>%
ggplot(aes(annual_inc, loan_amnt, color=grade)) +
geom_point(alpha=0.3) +
facet_wrap(~default)
## Warning: Removed 3 rows containing missing values (geom_point).
train %>%
select(loan_amnt, grade, dti, default) %>%
ggplot(aes(dti, loan_amnt, color=grade)) +
geom_point(alpha=0.3) +
facet_wrap(~default)
## Warning: Removed 1386 rows containing missing values (geom_point).
train %>%
select(int_rate, grade) %>%
group_by(grade) %>%
summarise(int_rate_mean=mean(int_rate, na.rm=T),
int_rate_median=median(int_rate, na.rm=T),
n=n()) %>%
knitr::kable()
| grade | int_rate_mean | int_rate_median | n |
|---|---|---|---|
| A | 7.085164 | 7.24 | 346242 |
| B | 10.674941 | 10.75 | 531354 |
| C | 14.143275 | 13.99 | 519430 |
| D | 18.141853 | 17.97 | 259803 |
| E | 21.830008 | 21.48 | 108708 |
| F | 25.444979 | 24.50 | 33217 |
| G | 28.074617 | 28.18 | 9782 |
train %>%
select(int_rate, issue_d) %>%
group_by(issue_d) %>%
summarise(mean_int = mean(int_rate)) %>%
ggplot(aes(x=issue_d, y=mean_int)) +
geom_line(color='darkblue')
Mean interest rate over time is increasing from 2007-2018. 2007 and 2011 have the lowest mean interest rate. As P2P starts to gain traction from 2012, its peak of average interest rate is in 2013. However, the mean interest overall does not draw a clear picture when segmenting by grade or loan amount. The phenomenal is also called ‘the Simpson’s paradox’. The overall effect may be different or inverse from the effect of the groups contributed to it. (omnibus correlation significance versus pairwise correlation significance).
train %>%
select(int_rate, grade, issue_d) %>%
group_by(grade, issue_d) %>%
summarise(int_rate_mean = mean(int_rate, na.rm = TRUE)) %>%
ggplot(aes(issue_d, int_rate_mean)) +
geom_line(color= "darkblue", size = 1) +
facet_wrap(~grade)
## `summarise()` has grouped output by 'grade'. You can override using the
## `.groups` argument.
train %>%
select(loan_amnt, grade, issue_d) %>%
group_by(grade, issue_d) %>%
summarise(loan_amnt_mean = mean(loan_amnt, na.rm = TRUE)) %>%
ggplot(aes(issue_d, loan_amnt_mean)) +
geom_line(color= "darkblue", size = 1) +
facet_wrap(~ grade)
## `summarise()` has grouped output by 'grade'. You can override using the
## `.groups` argument.
* The mean interest rate and variation in loan amount are increasing significantly for low-rated borrowers over time from Jan-2007 to Dec-2018.
#states
states <- read.csv('https://www2.census.gov/geo/docs/reference/state.txt', sep='|')
# average default rate by state
default_rate_state <- train %>%
select(default, addr_state) %>%
group_by(addr_state) %>%
summarise(default_rate=sum(default, na.rm=TRUE)/n()) %>%
left_join(states, by = c('addr_state' = 'STUSAB')) %>%
mutate(region=tolower(STATE_NAME), value=default_rate) %>%
select(region, value)
library(choroplethrMaps)
utils::data(state.map)
choroplethr::state_choropleth(df = default_rate_state,
title = "Default rate by State")
# average annual income by state
annual_inc_state <- train %>%
select(annual_inc, addr_state) %>%
group_by(addr_state) %>%
summarise(avg_annual_inc=mean(annual_inc, na.rm=TRUE)) %>%
left_join(states, by = c('addr_state' = 'STUSAB')) %>%
mutate(region=tolower(STATE_NAME), value=avg_annual_inc) %>%
select(region, value)
library(choroplethrMaps)
utils::data(state.map)
choroplethr::state_choropleth(df = annual_inc_state,
title = "Default rate by State")
Supervised binary classification: * logistic regression and regularized logistic regression * LDA/QDA * kNN * Trees/ Rf/ GBM * SVM soft/hard * Reduce noises through PCA and non-reduced noise (remember PCA is sensitive to scale of inputs)
Train on stratified distribution and it works ok without need for modification
Both methods result in duplicate data and make variables appear to have lower variance than they do.
In reality, undersampling 80/20 is preferable although it would be more beneficial to assess all cases.
Step 1: Take bootstrap samples from the population
Step 2: Balance each sample by down-sampling
Step 3: Learn a decision tree/random forest from each
Step 4: Majority votes on the whole set of bootstrap samples
train_down <- caret::downSample(x=train %>% select(-default),
y=factor(train$default), yname='default')
prop.table(table(train_down$default))
##
## FALSE TRUE
## 0.5 0.5
table(train_down$default)
##
## FALSE TRUE
## 28043 28043
downSample_custom <- function(x, y, yname = "Class", frac = 1){
lev <- levels(y)
minClass <- min(table(y))
lev_min <- levels(y)[which.min(table(y))]
inds_down <- sample(which(y == lev[lev != lev_min]), size = minClass * frac) #sample the indexes of the more abundant class according to minClass * frac
inds_minClass <- which(y == lev[lev == lev_min]) #take all the indexes of the lesser abundant class
out <- data.frame(x, y)
out <- out[sort(c(inds_down, inds_minClass)),]
colnames(out)[ncol(out)] <- yname
return(out)
}
# 80/20 downsampling the minority class
train_down_custom <- downSample_custom(x=train %>% select(-default),
y=factor(train$default), yname='default', frac=1/0.8)
table(train_down_custom$default)
##
## FALSE TRUE
## 35053 28043
Throw away minority examples and switch to an anamoly detection framework
Isolation Forest: identify anamolies in data by DF and the measuring the average number of decision splits required to isolate each data points. #avg is used to calculate each data point’s anamoly score or the likelihood that the example belongs to the minority class.
Box Drawing: axis-parralel hyper-rectangles around clusters of minority class examples
Nearest Neighbor Ensembles
Tomek looks for pairs of majority-minority and delete the majority class samples. It helps to draw a clearer decision boundary between the majority and minority class.
=> ROC, precision-recall curve, a lift curve, profit gain curve based on predicted probability
Probability that a random positive example will be ranked above a random negative example. Sometimes refers as the ability to separate the classes.
Harmonic mean of precision and recall
F1 = \(\frac{2(recall*precision)}{recall+precision}\)
data splitting:
make.names() for each column names so they do not have any space
remove some variables such as zip_code, title
pre-processing
remove correlated variables by simple correlation matrix. I need to compare the performance of removing variables through caret::findCorrelation with recursive feature elimination.
feature selection
model tuning
variable importance estimation
# particular naming scheme for categorical variables
character_vars <- train_down %>% sapply(is.character) %>% which() %>% names()
train_down <- train_down %>% mutate_at(.funs=make.names, .vars=character_vars)
test <- test %>% mutate_at(.funs=make.names, .vars=character_vars)
#remove zip code due to huge binary matrix and discriminative against some certain communities
unique(train_down$policy) == 1
## [1] TRUE
train_down <- train_down %>% select(-c(zip_code, title, policy_code))
dummies_train <- dummyVars('~.', data=train_down %>% select(-c(default)), fullRank = F) #need to specify fullRank as False for lm but do not need to specify it for tree
#fulllRank e.q. drop_first=True in OneHotEncoding
train_down_dummy <- train_down %>%
select(-which(sapply(., is.character))) %>%
cbind(predict(dummies_train, newdata=train_down))
dummies_test <- dummyVars('~.', data=test %>% select(dummies_train$vars), fullRank = F)
test_dummy <- test %>%
select((colnames(train_down))) %>%
select(-which(sapply(., is.character))) %>%
cbind(predict(dummies_test, newdata = test))
# levels only see in train not in test => bind these columns to the test set and set to zero
train_not_test <- setdiff(names(train_down_dummy), names(test_dummy))
if (length(test_dummy[, c(train_not_test)])!=0) {test_dummy[, c(train_not_test)] <- list(0)}
setdiff(names(test_dummy), names(train_down_dummy)) == 0
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE
# levels only see in test not in train => drop these columns corresponding to new levels
test_not_train <- setdiff(names(test_dummy), names(train_down_dummy))
if (length(test_dummy[, c(test_not_train)])!=0) {test_dummy[, c(test_not_train)] <- list(NULL)}#set to NULL means delete those columns
setdiff(names(test_dummy), names(train_down_dummy)) == 0
## logical(0)
\(log(\frac{p}{1-p}) = X\hat{\beta}\)
We will start out with a simple model and intuitively use the grade variable as the predictor.
mod1 <- glm(formula = default ~ grade -1, #remove the intercept
family = binomial(link = 'logit'),
data = train_down, na.action = na.exclude)
# accessing the attributes
attributes(mod1)
## $names
## [1] "coefficients" "residuals" "fitted.values"
## [4] "effects" "R" "rank"
## [7] "qr" "family" "linear.predictors"
## [10] "deviance" "aic" "null.deviance"
## [13] "iter" "weights" "prior.weights"
## [16] "df.residual" "df.null" "y"
## [19] "converged" "boundary" "model"
## [22] "call" "formula" "terms"
## [25] "data" "offset" "control"
## [28] "method" "contrasts" "xlevels"
##
## $class
## [1] "glm" "lm"
#summary
summary(mod1)
##
## Call:
## glm(formula = default ~ grade - 1, family = binomial(link = "logit"),
## data = train_down, na.action = na.exclude)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.55876 -1.25861 0.04397 1.09832 1.67521
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## gradeA -1.12104 0.02708 -41.396 <2e-16 ***
## gradeB -0.28756 0.01688 -17.038 <2e-16 ***
## gradeC 0.18890 0.01506 12.541 <2e-16 ***
## gradeD 0.45687 0.02022 22.593 <2e-16 ***
## gradeE 0.58624 0.03141 18.667 <2e-16 ***
## gradeF 0.49488 0.05506 8.988 <2e-16 ***
## gradeG 0.86283 0.09869 8.743 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 77752 on 56086 degrees of freedom
## Residual deviance: 74254 on 56079 degrees of freedom
## AIC: 74268
##
## Number of Fisher Scoring iterations: 4
All the coefficients of gradeA to gradeG are significant. GradeA and gradeB are negative while gradeC to gradeG are positive. It means that gradeA and gradeB decrease the likelihood of default while gradeC to gradeG increase the likelihood of default. The coefs help to confirm the hypothesis of bucketing bad versus good grades. GradeE has larger coefficient size than gradeF (0.58>0.49).
test <- test[!test$grade == 'X']
mod1 <- glm(formula = default ~ grade-1,
family = binomial(link = 'logit'),
data = train_down, na.action = na.exclude)
mod1_pred <- predict.glm(object=mod1, newdata = test, type='response')
mod1_pred_5_threshold <- ifelse(mod1_pred>0.5, TRUE, FALSE)
test$default <- factor(test$default, levels=c(FALSE,TRUE))
test$pred_glm <- factor(mod1_pred_5_threshold,
levels=levels(test$default))
confusionMatrix(test$pred_glm, test$default, positive = 'TRUE')
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 216988 2000
## TRUE 228134 5010
##
## Accuracy : 0.491
## 95% CI : (0.4895, 0.4925)
## No Information Rate : 0.9845
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.012
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.71469
## Specificity : 0.48748
## Pos Pred Value : 0.02149
## Neg Pred Value : 0.99087
## Prevalence : 0.01550
## Detection Rate : 0.01108
## Detection Prevalence : 0.51565
## Balanced Accuracy : 0.60109
##
## 'Positive' Class : TRUE
##
The simple classifier gives a fair amount of TP (true default rate/ sensitivity) but really low amount of TN (non-defaults rate / specificity). The Kappa is really low.
plot_pred_type_distribution <- function(df, threshold) {
v <- rep(NA, nrow(df))
v <- ifelse(df$pred >= threshold & df$actual == TRUE, "TP", v)
v <- ifelse(df$pred >= threshold & df$actual == FALSE, "FP", v)
v <- ifelse(df$pred < threshold & df$actual == TRUE, "FN", v)
v <- ifelse(df$pred < threshold & df$actual == FALSE, "TN", v)
df$pred_type <- v
ggplot(data = df, aes(x = actual, y = pred)) +
geom_violin(fill = rgb(1, 1 ,1, alpha = 0.02), color = NA) +
geom_jitter(aes(color = pred_type), alpha = 0.2, width=0.5, height=0.05) +
geom_hline(yintercept = threshold, color = "red", alpha = 1) +
scale_color_discrete(name = "type") +
labs(title = sprintf("Threshold at %.2f", threshold))
}
plot_pred_type_distribution(
df = tibble::tibble(pred = mod1_pred, actual = test$default), threshold = 0.5)
* We can clearly see the imbalance of the test data with a lot of actual negative classes (non-defaulters). * As the 0.5 threshold moving down /decreasing, TP/sensitivity increases at the expense of TN. * As the 0.5 the threshold moving up, TN/specificity increases at the expense of TP.
knitr::purl('script.R')
##
##
## processing file: script.R
##
|
| | 0%
|
|......................................................................| 100%
## output file: script.R
## [1] "script.R"