In this report, we attempt to predict the risk of the loan being default based on the past loan data. We obtained data from LendingClub’s website(https://www.lendingclub.com/info/download-data.action). We use loan data from year 2012-2014 as training and cross-validation set and loan data from year 2015 as a testing set. We also compare our investment performance against the baseline algorithm. We found that, among multiple machine learning algorithms that we tried, Logistic Regression provided a reasonable trade-off performance, and a higher return than the naive loan picking strategy can be achieved.
The loan dataset can be downloaded from https://www.lendingclub.com/info/download-data.action. We renamed the file of 2012-2013 loan data to loan_2012_2013.csv
, 2014 loan data to loan_2014.csv
and 2015 loan data to loan_2015.csv
. We stored them on the same directory of the script. The data dictionary can be found at the same URL.
There are 7 loan statuses: Charged Off
,Current
,Default
,Fully Paid
,In Grace Period
,Late (16-30 days)
,Late (31-120 days)
. We consider Late (31-120 days)
, Default
, Charged Off
as a default loan and Fully Paid
as a desirable loan and ignore everything else. We can create a shell script to filter such loan instead of loading the entire files.
We create extract.sh to perform the task:
#!/bin/sh
input_file=$1
output_file=$2
sed '2!d' $input_file > $output_file
awk -F '","' 'BEGIN {OFS=","} { if (toupper($17) == "FULLY PAID" || toupper($17) == "LATE (31-120 DAYS)" || toupper($17) == "DEFAULT" || toupper($17) == "CHARGED OFF") print }' $input_file >> $output_file
Execute the following command in the command line to filter only target statuses:
sh extract.sh loan_2012_2013.csv loan_2012_2013.extract.csv
sh extract.sh loan_2014.csv loan_2014.extract.csv
sh extract.sh loan_2015.csv loan_2015.extract.csv
We can now use *.extract.csv as our dataset
#load all required package
library(dplyr)
library(ggplot2)
library(caret)
library(rpart)
library(rpart.plot)
library(randomForest)
library(corrplot)
library(e1071)
library(xgboost)
library(stringr)
library(lubridate)
library(tm)
library(rms)
library(glmnet)
library(pROC)
library(doMC)
library(kernlab)
#make the analysis reproducible
set.seed(100)
setwd(".")
We firstly investigate overall data
data_a = read.csv("loan_2012_2013.extract.csv",stringsAsFactors=FALSE)
data_b = read.csv("loan_2014.extract.csv",stringsAsFactors=FALSE)
data = read.csv("loan_2015.extract.csv",stringsAsFactors=FALSE)
data = rbind(data,data_a,data_b)
rm(data_a,data_b)
We take a look at the number of each loan_status so far
data %>% group_by(loan_status) %>% summarise(count = n())
## Source: local data frame [4 x 2]
##
## loan_status count
## <chr> <int>
## 1 Charged Off 57896
## 2 Default 203
## 3 Fully Paid 242869
## 4 Late (31-120 days) 12665
We categorize Charged Off
, Default
, and Late (31-120 days)
to a single category: Default
. The data is moderately imbalance.
data$loan_status = ifelse(str_detect(data$loan_status,"Paid"),data$loan_status,"Default")
tmp = data %>% group_by(loan_status) %>% summarise(ncount = n())
tmp$ncount = 100 * tmp$ncount/nrow(data)
tmp$ncount_p = str_c(round(tmp$ncount,2),"%")
ggplot(tmp,aes(x=loan_status,y=ncount,fill=loan_status)) + geom_bar(stat="identity") +
geom_text(aes(label=ncount_p),vjust = 2)
The goal of this analysis is to produce a loan picking strategy that is superior to the naive one. Let’s look at the return and default rate for each loan grade from 2012-2015.
data$int_rate = (as.numeric(gsub(pattern = "%",replacement = "",x = data$int_rate)))
#extract issued year
data$issue_y = as.numeric(sapply( data$issue_d ,function(x){str_split(x,"-")[[1]][2]}))
displayInterestByGrade <- function(dt){
g1 = dt %>% filter(loan_status == "Default") %>% group_by(grade) %>% summarise(default_count = n())
g2 = dt %>% group_by(grade) %>% summarise(count = n(),int_rate=mean(int_rate))
g2 %>% left_join(g1) %>% mutate(default_rate = 100*default_count/count) %>% select(grade,count,default_count,int_rate,default_rate)
}
#year 2012
tmp0 = displayInterestByGrade(data %>% filter(issue_y==2012))
## Joining, by = "grade"
tmp0$year = 2012
#year 2013
tmp1 = displayInterestByGrade(data %>% filter(issue_y==2013))
## Joining, by = "grade"
tmp1$year = 2013
#year 2014
tmp2 = displayInterestByGrade(data %>% filter(issue_y==2014))
## Joining, by = "grade"
tmp2$year = 2014
#year 2015
tmp3 = displayInterestByGrade(data %>% filter(issue_y==2015))
## Joining, by = "grade"
tmp3$year = 2015
tmp = rbind(tmp0,tmp1,tmp2,tmp3)
ggplot(tmp, aes(x=grade, y=default_rate,fill=as.factor(year))) + geom_bar(stat="identity",position="dodge") + ggtitle("Default Rate(%)")
ggplot(tmp, aes(x=grade, y=int_rate,fill=as.factor(year))) + geom_bar(stat="identity",position="dodge") + ggtitle("Interest Rate(%)")
rm(tmp,tmp0,tmp1,tmp2,tmp3)
As expected, riskier loans come with higher interest rates.
We can use total_pymnt
(payment obtained from borrower) and funded_amnt
(money lent) to calculate Return on Investment(ROI).
For year 2015, the ROI of our baseline investment strategy-Picking every loans-, is:
all_roi = sum((data %>% filter(issue_y==2015))$total_pymnt)/sum((data %>% filter(issue_y==2015))$funded_amnt) - 1
all_roi
## [1] -0.1756042
or roughly -17.56%. Noted that we excluded Current
loans from our analysis(which a large amount of them typically ended up being paid-off), so negative ROI is as expected. Again, the higher the ROI is, the better.
We can look at ROI pertaining to each loan grade:
data$prediction = "Fully Paid"
createPerformanceTable <- function(dt){
dt_pick = dt %>% filter(prediction == "Fully Paid")
all_roi = sum(dt_pick$total_pymnt)/sum(dt_pick$funded_amnt) - 1
temp_table = data.frame(grade=character(0),roi=numeric(0),percent_pick=numeric(0))
for(g in c("A","B","C","D","E","F","G")){
data_pick_grade = dt_pick %>% filter(grade==g)
if(nrow(data_pick_grade)==0){
temp_table = rbind(temp_table,data.frame(grade=g,roi=0,percent_pick=0))
}
else
{
data_grade = dt %>% filter(grade==g)
roi = sum(data_pick_grade$total_pymnt)/sum(data_pick_grade$funded_amnt) - 1
temp_table = rbind(temp_table,data.frame(grade=g,roi=roi,percent_pick=100 * nrow(data_pick_grade)/nrow(data_grade)))
}
}
temp_table = rbind(temp_table,data.frame(grade="ALL",roi=all_roi,percent_pick=100 * nrow(dt_pick)/nrow(dt) ))
return(temp_table)
}
baseline_table = createPerformanceTable(data %>% filter(issue_y==2015))
baseline_table
## grade roi percent_pick
## 1 A -0.02860544 100
## 2 B -0.08447862 100
## 3 C -0.16181754 100
## 4 D -0.23565662 100
## 5 E -0.28144831 100
## 6 F -0.34315720 100
## 7 G -0.35786380 100
## 8 ALL -0.17560423 100
The baseline_table
served as a baseline performance when we evaluate our model
2012-2014 data will be served as training & cross validation set and 2015 as a test set. We will put more emphasis on avoiding default loans since they can wreak havoc on our overall investment. We also take a note of percentage of loan picked as picking a small number of loans doesn’t make a sizable investment.
We explore some features in the dataset. We will create transformation model for the 2012-2014 data. We will later apply it to 2015 data.
data1 = data %>% filter(issue_y != 2015)
#look at number of row and column
dim(data1)
## [1] 246073 113
rm(data)
Looking at the data dictionary, we can identify and drop irrelevant features(features that do not appear at the issued time, or id field). emp_title
may be useful but it needed to be effectively categorized.
discard_column = c("collection_recovery_fee","emp_title",
"funded_amnt_inv","id",
"installment","last_credit_pull_d",
"last_fico_range_high","last_fico_range_low",
"last_pymnt_amnt","last_pymnt_d",
"loan_amnt","member_id",
"next_pymnt_d","num_tl_120dpd_2m",
"num_tl_30dpd","out_prncp",
"out_prncp_inv","recoveries",
"total_pymnt","total_pymnt_inv",
"total_rec_int","total_rec_late_fee",
"total_rec_prncp","url",
"zip_code"
)
data1 = (data1[,!(names(data1) %in% discard_column)])
We also drop grade
featrue as such data also carried in sub_grade
data1$grade = NULL
We drop features that contain too many NA values. We can list the raito of NA
value for each feature:
tmp = sort(sapply(data1, function(x) sum(length(which(is.na(x)))))/nrow(data1),decreasing = TRUE)
Drop features that have more than 50% missing and take a look at what features that are still missing:
discard_column = names(tmp[tmp>0.5])
data1 = (data1[,!(names(data1) %in% discard_column)])
tmp = sort(sapply(data1, function(x) sum(length(which(is.na(x)))))/nrow(data1),decreasing = TRUE)
tmp[tmp>0]
## mo_sin_old_il_acct mths_since_recent_inq
## 0.13758925 0.11994815
## pct_tl_nvr_dlq avg_cur_bal
## 0.10749656 0.10719177
## mo_sin_old_rev_tl_op mo_sin_rcnt_rev_tl_op
## 0.10715113 0.10715113
## tot_coll_amt tot_cur_bal
## 0.10714707 0.10714707
## total_rev_hi_lim mo_sin_rcnt_tl
## 0.10714707 0.10714707
## num_accts_ever_120_pd num_actv_bc_tl
## 0.10714707 0.10714707
## num_actv_rev_tl num_bc_tl
## 0.10714707 0.10714707
## num_il_tl num_op_rev_tl
## 0.10714707 0.10714707
## num_rev_accts num_rev_tl_bal_gt_0
## 0.10714707 0.10714707
## num_tl_90g_dpd_24m num_tl_op_past_12m
## 0.10714707 0.10714707
## tot_hi_cred_lim total_il_high_credit_limit
## 0.10714707 0.10714707
## num_bc_sats num_sats
## 0.06212384 0.06212384
## bc_util percent_bc_gt_75
## 0.03882994 0.03843168
## bc_open_to_buy mths_since_recent_bc
## 0.03821630 0.03720441
## acc_open_past_24mths mort_acc
## 0.02899952 0.02899952
## total_bal_ex_mort total_bc_limit
## 0.02899952 0.02899952
tmp = tmp[tmp>0]
par(mfrow=c(2,3))
for(feature_name in names(tmp)){
hist(data1[[feature_name]],main = str_c(feature_name,"(missing=",100* round(as.numeric(tmp[feature_name]),2) ,"%)") )
}
par(mfrow=c(1,1))
From the look of distributions, it is reasonable to impute missing values with the median value. We store median_impute_model
for further use.
median_impute_model = preProcess(data1[names(tmp)],method="medianImpute")
data1 = predict(median_impute_model,data1)
Let’s re-check the columns left, there should be no feature with NA value
sort(sapply(data1, function(x) sum(length(which(is.na(x)))))/nrow(data1),decreasing = TRUE)
## funded_amnt term
## 0 0
## int_rate sub_grade
## 0 0
## emp_length home_ownership
## 0 0
## annual_inc verification_status
## 0 0
## issue_d loan_status
## 0 0
## pymnt_plan desc
## 0 0
## purpose title
## 0 0
## addr_state dti
## 0 0
## delinq_2yrs earliest_cr_line
## 0 0
## inq_last_6mths open_acc
## 0 0
## pub_rec revol_bal
## 0 0
## revol_util total_acc
## 0 0
## initial_list_status collections_12_mths_ex_med
## 0 0
## policy_code application_type
## 0 0
## acc_now_delinq tot_coll_amt
## 0 0
## tot_cur_bal total_rev_hi_lim
## 0 0
## acc_open_past_24mths avg_cur_bal
## 0 0
## bc_open_to_buy bc_util
## 0 0
## chargeoff_within_12_mths delinq_amnt
## 0 0
## mo_sin_old_il_acct mo_sin_old_rev_tl_op
## 0 0
## mo_sin_rcnt_rev_tl_op mo_sin_rcnt_tl
## 0 0
## mort_acc mths_since_recent_bc
## 0 0
## mths_since_recent_inq num_accts_ever_120_pd
## 0 0
## num_actv_bc_tl num_actv_rev_tl
## 0 0
## num_bc_sats num_bc_tl
## 0 0
## num_il_tl num_op_rev_tl
## 0 0
## num_rev_accts num_rev_tl_bal_gt_0
## 0 0
## num_sats num_tl_90g_dpd_24m
## 0 0
## num_tl_op_past_12m pct_tl_nvr_dlq
## 0 0
## percent_bc_gt_75 pub_rec_bankruptcies
## 0 0
## tax_liens tot_hi_cred_lim
## 0 0
## total_bal_ex_mort total_bc_limit
## 0 0
## total_il_high_credit_limit issue_y
## 0 0
## prediction
## 0
Let’s take a look at all features left
str(data1)
## 'data.frame': 246073 obs. of 67 variables:
## $ funded_amnt : int 20800 3000 12000 12000 28000 11100 8000 24000 15000 12000 ...
## $ term : chr " 36 months" " 36 months" " 36 months" " 36 months" ...
## $ int_rate : num 13.53 12.85 13.53 7.62 7.62 ...
## $ sub_grade : chr "B5" "B4" "B5" "A3" ...
## $ emp_length : chr "10+ years" "10+ years" "10+ years" "3 years" ...
## $ home_ownership : chr "RENT" "RENT" "RENT" "MORTGAGE" ...
## $ annual_inc : num 81500 25000 40000 96500 325000 90000 33000 100000 98000 60000 ...
## $ verification_status : chr "Verified" "Verified" "Source Verified" "Not Verified" ...
## $ issue_d : chr "Dec-2013" "Dec-2013" "Dec-2013" "Dec-2013" ...
## $ loan_status : chr "Fully Paid" "Fully Paid" "Fully Paid" "Fully Paid" ...
## $ pymnt_plan : chr "n" "n" "n" "n" ...
## $ desc : chr " Borrower added on 12/31/13 > My goal is to purchase a home. I am consolidating my debt to lower interest rate to pay off debt"| __truncated__ "" "" " Borrower added on 12/31/13 > Bought a new house, furniture, water softener, a second car, etc. Got our lives started and now "| __truncated__ ...
## $ purpose : chr "debt_consolidation" "debt_consolidation" "debt_consolidation" "debt_consolidation" ...
## $ title : chr "Reducing Debt to Purchase Home" "debt" "Debt consolidation" "Debt Consolidation and Credit Transfer" ...
## $ addr_state : chr "NY" "FL" "NM" "TX" ...
## $ dti : num 16.7 24.7 16.9 12.6 18.6 ...
## $ delinq_2yrs : int 0 0 0 0 0 1 0 0 0 0 ...
## $ earliest_cr_line : chr "Jun-1998" "May-1991" "Oct-1998" "Sep-2003" ...
## $ inq_last_6mths : int 2 0 0 0 1 0 1 0 2 1 ...
## $ open_acc : int 29 5 7 17 15 9 9 14 16 15 ...
## $ pub_rec : int 0 2 2 0 0 0 1 0 0 0 ...
## $ revol_bal : int 23473 2875 5572 13248 29581 6619 7203 21617 5749 7137 ...
## $ revol_util : chr "54.5%" "54.2%" "68.8%" "55.7%" ...
## $ total_acc : int 41 26 32 30 31 12 16 39 16 18 ...
## $ initial_list_status : chr "f" "f" "w" "f" ...
## $ collections_12_mths_ex_med: int 0 0 0 0 0 0 0 0 0 0 ...
## $ policy_code : int 1 1 1 1 1 1 1 1 1 1 ...
## $ application_type : chr "INDIVIDUAL" "INDIVIDUAL" "INDIVIDUAL" "INDIVIDUAL" ...
## $ acc_now_delinq : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tot_coll_amt : num 0 154 15386 0 0 ...
## $ tot_cur_bal : num 23473 19530 13605 200314 799592 ...
## $ total_rev_hi_lim : num 43100 5300 8100 23800 54200 10000 20800 28200 25800 29700 ...
## $ acc_open_past_24mths : num 9 3 4 4 6 2 2 7 6 8 ...
## $ avg_cur_bal : num 869 3906 2268 11783 53306 ...
## $ bc_open_to_buy : num 6811 2050 1428 2441 13901 ...
## $ bc_util : num 54.6 52.3 79.6 83.5 67.1 74.6 72.5 77.6 27.6 15.9 ...
## $ chargeoff_within_12_mths : int 0 0 0 0 0 0 0 0 0 0 ...
## $ delinq_amnt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mo_sin_old_il_acct : num 115 164 124 123 125 128 129 179 2 128 ...
## $ mo_sin_old_rev_tl_op : num 186 271 182 118 229 150 269 299 257 48 ...
## $ mo_sin_rcnt_rev_tl_op : num 0 7 1 10 5 11 14 18 7 1 ...
## $ mo_sin_rcnt_tl : num 0 7 1 9 2 11 14 7 2 1 ...
## $ mort_acc : num 0 6 0 1 5 1 0 3 0 0 ...
## $ mths_since_recent_bc : num 0 14 11 10 5 11 18 18 7 1 ...
## $ mths_since_recent_inq : num 0 8 17 10 3 11 6 7 2 3 ...
## $ num_accts_ever_120_pd : num 1 1 6 0 0 1 0 0 0 0 ...
## $ num_actv_bc_tl : num 8 2 2 4 4 4 3 3 8 4 ...
## $ num_actv_rev_tl : num 24 3 2 5 5 8 5 5 8 7 ...
## $ num_bc_sats : num 11 3 3 4 6 4 4 5 13 8 ...
## $ num_bc_tl : num 17 6 14 10 8 4 7 10 13 10 ...
## $ num_il_tl : num 1 11 8 15 11 0 2 17 1 0 ...
## $ num_op_rev_tl : num 29 4 6 8 9 8 8 8 15 15 ...
## $ num_rev_accts : num 40 9 24 14 15 11 14 19 15 18 ...
## $ num_rev_tl_bal_gt_0 : num 24 3 2 5 5 8 5 5 8 7 ...
## $ num_sats : num 29 5 7 17 15 9 9 14 16 15 ...
## $ num_tl_90g_dpd_24m : num 0 0 0 0 0 1 0 0 0 0 ...
## $ num_tl_op_past_12m : num 3 1 2 3 5 1 0 2 2 4 ...
## $ pct_tl_nvr_dlq : num 90.2 91.3 81.2 100 100 75 100 100 100 100 ...
## $ percent_bc_gt_75 : num 50 66.7 33.3 100 16.7 50 75 75 7.7 0 ...
## $ pub_rec_bankruptcies : int 0 2 0 0 0 0 1 0 0 0 ...
## $ tax_liens : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tot_hi_cred_lim : num 43100 32082 18130 233004 850886 ...
## $ total_bal_ex_mort : num 23473 19530 13605 46738 199739 ...
## $ total_bc_limit : num 15000 4300 7000 14800 42200 4000 8200 21500 20800 18100 ...
## $ total_il_high_credit_limit: num 0 26782 10030 53404 196686 ...
## $ issue_y : num 2013 2013 2013 2013 2013 ...
## $ prediction : chr "Fully Paid" "Fully Paid" "Fully Paid" "Fully Paid" ...
We parse revol_util to numeric
data1$revol_util = (as.numeric(gsub(pattern = "%",replacement = "",x = data1$int_rate)))
earliest_cr_line
is transformed to the number of days before the loan is issued
data1$earliest_cr_line = parse_date_time(str_c("01",data1$issue_d),"dmy" ) - parse_date_time(str_c("01",data1$earliest_cr_line),"dmy" )
data1$earliest_cr_line = as.numeric(data1$earliest_cr_line,units = "days")
We can see that the default rate doesn’t vary much by the month it issued. We will drop issue_d
#extract issued month
data1$issue_m = sapply( data1$issue_d ,function(x){str_split(x,"-")[[1]][1]})
tmp = data1 %>% filter(loan_status=="Default") %>% group_by(issue_m) %>% summarise(default_count = n())
tmp2 = data1 %>% group_by(issue_m) %>% summarise(count = n())
tmp2 %>% left_join(tmp) %>% mutate(default_rate = default_count/count)
## Joining, by = "issue_m"
## Source: local data frame [12 x 4]
##
## issue_m count default_count default_rate
## <chr> <int> <int> <dbl>
## 1 Apr 20828 4134 0.1984828
## 2 Aug 20604 4592 0.2228693
## 3 Dec 17151 3452 0.2012711
## 4 Feb 17129 3391 0.1979684
## 5 Jan 17315 3356 0.1938204
## 6 Jul 24245 5663 0.2335739
## 7 Jun 20575 4230 0.2055893
## 8 Mar 18570 3599 0.1938072
## 9 May 21288 4357 0.2046693
## 10 Nov 22413 5007 0.2233971
## 11 Oct 28244 6553 0.2320139
## 12 Sep 17711 3714 0.2097002
data1$issue_m = NULL
data1$issue_d = NULL
rm(tmp,tmp2)
Let’s see if there are any features that have zero variance:
#define some functions to be used later on
getNumericColumns<-function(t){
tn = sapply(t,function(x){is.numeric(x)})
return(names(tn)[which(tn)])
}
getCharColumns<-function(t){
tn = sapply(t,function(x){is.character(x)})
return(names(tn)[which(tn)])
}
getFactorColumns<-function(t){
tn = sapply(t,function(x){is.factor(x)})
return(names(tn)[which(tn)])
}
getIndexsOfColumns <- function(t,column_names){
return(match(column_names,colnames(t)))
}
tmp = apply(data1[getCharColumns(data1)],2,function(x){length(unique(x))})
tmp = tmp[tmp==1]
tmp2 = apply(data1[getNumericColumns(data1)],2,function(x){(sd(x))})
tmp2 = tmp2[tmp2==0]
discard_column = c(names(tmp),names(tmp2))
discard_column
## [1] "application_type" "prediction" "policy_code"
We then proceed to drop the zero variance features
data1 = (data1[,!(names(data1) %in% discard_column)])
Next, we investigate whether desc
(description) can be useful in determining Default and Fully Paid loan. However, noted that our test dataset(2015 loans) has only 9 loans with non-empty description’s length so desc
is quite useless for current 2015 data. Also noted that information in title
is replicated in purpose
so we will drop them.
data1$desc = NULL
data1$title = NULL
We take a look at default rate for each state from year 2012-2014. We filter out states that have too small number of loans(less than 1000):
tmp = data1 %>% filter(loan_status=="Default") %>% group_by(addr_state,issue_y) %>% summarise(default_count = n())
tmp2 = data1 %>% group_by(addr_state,issue_y) %>% summarise(count = n())
tmp3 = tmp2 %>% left_join(tmp) %>% mutate(default_rate = default_count/count)
## Joining, by = c("addr_state", "issue_y")
#order by highest default rate
a0 = (tmp3 %>% filter(issue_y == 2012 & count > 1000) %>% arrange(desc(default_rate)))[1:10,"addr_state"]$addr_state
a1 = (tmp3 %>% filter(issue_y == 2013 & count > 1000) %>% arrange(desc(default_rate)))[1:10,"addr_state"]$addr_state
a2 = (tmp3 %>% filter(issue_y == 2014 & count > 1000) %>% arrange(desc(default_rate)))[1:10,"addr_state"]$addr_state
high_default = intersect(intersect(a0,a1),a2)
#order by lowest default rate
a0 = (tmp3 %>% filter(issue_y == 2012 & count > 1000) %>% arrange((default_rate)))[1:10,"addr_state"]$addr_state
a1 = (tmp3 %>% filter(issue_y == 2013 & count > 1000) %>% arrange((default_rate)))[1:10,"addr_state"]$addr_state
a2 = (tmp3 %>% filter(issue_y == 2014 & count > 1000) %>% arrange((default_rate)))[1:10,"addr_state"]$addr_state
low_default = intersect(intersect(a0,a1),a2)
We noticed Florida and New York consistently constitute in top 10 - highest default rate from year 2012 - 2014
high_default
## [1] "FL" "NY"
While Illinois, Texas, California, Georgia have lowest default rate
low_default
## [1] "IL" "TX" "CA" "GA"
We then create binary variable for 6 states and discard the rest
data1$is_fl = ifelse(data1$addr_state=="FL",1,0)
data1$is_ny = ifelse(data1$addr_state=="NY",1,0)
data1$is_il = ifelse(data1$addr_state=="IL",1,0)
data1$is_tx = ifelse(data1$addr_state=="TX",1,0)
data1$is_ca = ifelse(data1$addr_state=="CA",1,0)
data1$is_ga = ifelse(data1$addr_state=="GA",1,0)
data1$addr_state = NULL
rm(tmp,tmp2,tmp3,a0,a1,a2,high_default,low_default)
We will investigate if there are any correlation among features
corrplot(cor(data1[getNumericColumns(data1)],use="na.or.complete"))
We found some features are quite correlated, we can remove correlated features with findCorrelation
function. The function will find all correlated pairs that have correlation exceed a specified threshold and try to remove one of them in such a way that overall correlation is reduced.
high_corr <- findCorrelation(cor(data1[getNumericColumns(data1)]), cutoff = .75)
high_corr = getNumericColumns(data1)[high_corr]
high_corr
## [1] "num_sats" "open_acc" "num_op_rev_tl"
## [4] "num_rev_accts" "num_bc_sats" "total_bc_limit"
## [7] "num_rev_tl_bal_gt_0" "num_actv_rev_tl" "tot_hi_cred_lim"
## [10] "total_rev_hi_lim" "tot_cur_bal" "total_bal_ex_mort"
## [13] "earliest_cr_line" "int_rate" "bc_util"
data1 = (data1[,!(names(data1) %in% high_corr)])
Let’s look at all numeric features we have left.
str(data1[getNumericColumns(data1)])
## 'data.frame': 246073 obs. of 42 variables:
## $ funded_amnt : int 20800 3000 12000 12000 28000 11100 8000 24000 15000 12000 ...
## $ annual_inc : num 81500 25000 40000 96500 325000 90000 33000 100000 98000 60000 ...
## $ dti : num 16.7 24.7 16.9 12.6 18.6 ...
## $ delinq_2yrs : int 0 0 0 0 0 1 0 0 0 0 ...
## $ inq_last_6mths : int 2 0 0 0 1 0 1 0 2 1 ...
## $ pub_rec : int 0 2 2 0 0 0 1 0 0 0 ...
## $ revol_bal : int 23473 2875 5572 13248 29581 6619 7203 21617 5749 7137 ...
## $ revol_util : num 13.53 12.85 13.53 7.62 7.62 ...
## $ total_acc : int 41 26 32 30 31 12 16 39 16 18 ...
## $ collections_12_mths_ex_med: int 0 0 0 0 0 0 0 0 0 0 ...
## $ acc_now_delinq : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tot_coll_amt : num 0 154 15386 0 0 ...
## $ acc_open_past_24mths : num 9 3 4 4 6 2 2 7 6 8 ...
## $ avg_cur_bal : num 869 3906 2268 11783 53306 ...
## $ bc_open_to_buy : num 6811 2050 1428 2441 13901 ...
## $ chargeoff_within_12_mths : int 0 0 0 0 0 0 0 0 0 0 ...
## $ delinq_amnt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mo_sin_old_il_acct : num 115 164 124 123 125 128 129 179 2 128 ...
## $ mo_sin_old_rev_tl_op : num 186 271 182 118 229 150 269 299 257 48 ...
## $ mo_sin_rcnt_rev_tl_op : num 0 7 1 10 5 11 14 18 7 1 ...
## $ mo_sin_rcnt_tl : num 0 7 1 9 2 11 14 7 2 1 ...
## $ mort_acc : num 0 6 0 1 5 1 0 3 0 0 ...
## $ mths_since_recent_bc : num 0 14 11 10 5 11 18 18 7 1 ...
## $ mths_since_recent_inq : num 0 8 17 10 3 11 6 7 2 3 ...
## $ num_accts_ever_120_pd : num 1 1 6 0 0 1 0 0 0 0 ...
## $ num_actv_bc_tl : num 8 2 2 4 4 4 3 3 8 4 ...
## $ num_bc_tl : num 17 6 14 10 8 4 7 10 13 10 ...
## $ num_il_tl : num 1 11 8 15 11 0 2 17 1 0 ...
## $ num_tl_90g_dpd_24m : num 0 0 0 0 0 1 0 0 0 0 ...
## $ num_tl_op_past_12m : num 3 1 2 3 5 1 0 2 2 4 ...
## $ pct_tl_nvr_dlq : num 90.2 91.3 81.2 100 100 75 100 100 100 100 ...
## $ percent_bc_gt_75 : num 50 66.7 33.3 100 16.7 50 75 75 7.7 0 ...
## $ pub_rec_bankruptcies : int 0 2 0 0 0 0 1 0 0 0 ...
## $ tax_liens : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_il_high_credit_limit: num 0 26782 10030 53404 196686 ...
## $ issue_y : num 2013 2013 2013 2013 2013 ...
## $ is_fl : num 0 1 0 0 0 0 0 0 0 0 ...
## $ is_ny : num 1 0 0 0 0 1 0 0 1 0 ...
## $ is_il : num 0 0 0 0 0 0 0 0 0 0 ...
## $ is_tx : num 0 0 0 1 0 0 0 0 0 0 ...
## $ is_ca : num 0 0 0 0 1 0 0 0 0 0 ...
## $ is_ga : num 0 0 0 0 0 0 0 0 0 0 ...
We transform annual_inc
, revol_bal
, avg_cur_bal
, bc_open_to_buy
, total_il_high_credit_limit
by deviding them by funded_amnt
(amount of loan)
data1$annual_inc = data1$annual_inc/data1$funded_amnt
data1$revol_bal = data1$revol_bal/data1$funded_amnt
data1$avg_cur_bal = data1$avg_cur_bal/data1$funded_amnt
data1$bc_open_to_buy = data1$bc_open_to_buy/data1$funded_amnt
data1$total_il_high_credit_limit = data1$total_il_high_credit_limit/data1$funded_amnt
data1$funded_amnt = NULL
Let’s look at all character features we have left.
str(data1[getCharColumns(data1)])
## 'data.frame': 246073 obs. of 9 variables:
## $ term : chr " 36 months" " 36 months" " 36 months" " 36 months" ...
## $ sub_grade : chr "B5" "B4" "B5" "A3" ...
## $ emp_length : chr "10+ years" "10+ years" "10+ years" "3 years" ...
## $ home_ownership : chr "RENT" "RENT" "RENT" "MORTGAGE" ...
## $ verification_status: chr "Verified" "Verified" "Source Verified" "Not Verified" ...
## $ loan_status : chr "Fully Paid" "Fully Paid" "Fully Paid" "Fully Paid" ...
## $ pymnt_plan : chr "n" "n" "n" "n" ...
## $ purpose : chr "debt_consolidation" "debt_consolidation" "debt_consolidation" "debt_consolidation" ...
## $ initial_list_status: chr "f" "f" "w" "f" ...
We look at home_ownership
and filter only observations that have value “MORTGAGE”,“OWN”, or “RENT” as these are only values that appear in 2015 data
table(data1$home_ownership)
##
## ANY MORTGAGE NONE OTHER OWN RENT
## 1 123973 40 43 21228 100788
data1 = data1 %>% filter(home_ownership == "MORTGAGE" | home_ownership == "OWN" | home_ownership == "RENT")
We see there are 3 loans with pymnt_plan=“y”, all ended in being default
data1 %>% filter(pymnt_plan=="y") %>% select(loan_status)
## loan_status
## 1 Default
## 2 Default
## 3 Default
But 3 loans is too small to make the conclusive finding. We will just remove pymnt_plan
feature.
data1$pymnt_plan = NULL
remove issue_y as it has no further use
## Warning in rm(all_roi, c, discard_column, feature_name, high_corr,
## numeric_column): object 'c' not found
## Warning in rm(all_roi, c, discard_column, feature_name, high_corr,
## numeric_column): object 'numeric_column' not found
We will transform all character predictor features to binary dummy variables
loan_status = data1$loan_status
dummy_model = dummyVars(loan_status ~ .,data1,fullRank = TRUE)
data1 = as.data.frame(predict(dummy_model,data1))
data1$loan_status = loan_status
rm(loan_status)
#set loan with status 'Fully Paid' as a positive sample
data1$loan_status = ifelse(data1$loan_status == "Fully Paid","Fully.Paid",data1$loan_status)
data1$loan_status = factor(data1$loan_status,levels = c("Default","Fully.Paid"))
The data is centered and scaled. We can try to remove the number of dimensions further by fitting Logistic regression and investigate p-value of the coefficients. The null hypothesis is that each feature makes no contribution to the predictive model(its coefficient is zero). We then discard each feature that fails to reject the hypothesis.
trans_model = preProcess(data1,method=c("center","scale"))
data1 = predict(trans_model, data1)
model = lrm(loan_status ~ .,data1)
model
##
## Logistic Regression Model
##
## lrm(formula = loan_status ~ ., data = data1)
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 245989 LR chi2 24375.03 R2 0.147 C 0.714
## Default 52032 d.f. 103 g 0.968 Dxy 0.428
## Fully.Paid 193957 Pr(> chi2) <0.0001 gr 2.632 gamma 0.430
## max |deriv| 7e-11 gp 0.142 tau-a 0.143
## Brier 0.150
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept 1.5087 0.0059 255.30 <0.0001
## term 60 months -0.2062 0.0061 -34.04 <0.0001
## sub_gradeA2 -0.0644 0.0142 -4.53 <0.0001
## sub_gradeA3 -0.1097 0.0145 -7.57 <0.0001
## sub_gradeA4 -0.1864 0.0163 -11.45 <0.0001
## sub_gradeA5 -0.2593 0.0173 -15.00 <0.0001
## sub_gradeB1 -0.3310 0.0185 -17.89 <0.0001
## sub_gradeB2 -0.4134 0.0210 -19.70 <0.0001
## sub_gradeB3 -0.5064 0.0235 -21.55 <0.0001
## sub_gradeB4 -0.5522 0.0237 -23.32 <0.0001
## sub_gradeB5 -0.5411 0.0218 -24.83 <0.0001
## sub_gradeC1 -0.5974 0.0233 -25.67 <0.0001
## sub_gradeC2 -0.6219 0.0235 -26.51 <0.0001
## sub_gradeC3 -0.6453 0.0233 -27.69 <0.0001
## sub_gradeC4 -0.6569 0.0234 -28.06 <0.0001
## sub_gradeC5 -0.6666 0.0233 -28.59 <0.0001
## sub_gradeD1 -0.6597 0.0230 -28.67 <0.0001
## sub_gradeD2 -0.6445 0.0223 -28.91 <0.0001
## sub_gradeD3 -0.6108 0.0214 -28.58 <0.0001
## sub_gradeD4 -0.6266 0.0213 -29.47 <0.0001
## sub_gradeD5 -0.5803 0.0200 -29.07 <0.0001
## sub_gradeE1 -0.5178 0.0177 -29.29 <0.0001
## sub_gradeE2 -0.5465 0.0183 -29.83 <0.0001
## sub_gradeE3 -0.4929 0.0170 -29.07 <0.0001
## sub_gradeE4 -0.4856 0.0166 -29.30 <0.0001
## sub_gradeE5 -0.4608 0.0157 -29.36 <0.0001
## sub_gradeF1 -0.4137 0.0146 -28.42 <0.0001
## sub_gradeF2 -0.3745 0.0133 -28.10 <0.0001
## sub_gradeF3 -0.3769 0.0132 -28.64 <0.0001
## sub_gradeF4 -0.3256 0.0116 -27.99 <0.0001
## sub_gradeF5 -0.2906 0.0106 -27.35 <0.0001
## sub_gradeG1 -0.2538 0.0094 -27.15 <0.0001
## sub_gradeG2 -0.2050 0.0082 -25.04 <0.0001
## sub_gradeG3 -0.1826 0.0073 -24.98 <0.0001
## sub_gradeG4 -0.1331 0.0063 -21.06 <0.0001
## sub_gradeG5 -0.1389 0.0062 -22.29 <0.0001
## emp_length< 1 year -0.0273 0.0057 -4.83 <0.0001
## emp_length1 year 0.0015 0.0057 0.26 0.7911
## emp_length2 years 0.0002 0.0058 0.03 0.9749
## emp_length3 years -0.0016 0.0057 -0.28 0.7826
## emp_length4 years 0.0013 0.0056 0.22 0.8229
## emp_length5 years 0.0023 0.0057 0.40 0.6917
## emp_length6 years -0.0118 0.0056 -2.11 0.0345
## emp_length7 years -0.0004 0.0056 -0.07 0.9425
## emp_length8 years -0.0121 0.0055 -2.22 0.0265
## emp_length9 years -0.0187 0.0053 -3.51 0.0004
## emp_lengthn/a -0.1086 0.0051 -21.22 <0.0001
## home_ownershipOWN -0.0256 0.0056 -4.61 <0.0001
## home_ownershipRENT -0.0949 0.0067 -14.11 <0.0001
## annual_inc 0.1356 0.0118 11.54 <0.0001
## verification_statusSource Verified -0.0386 0.0065 -5.91 <0.0001
## verification_statusVerified 0.0074 0.0070 1.07 0.2850
## purposecredit_card 0.0161 0.0253 0.64 0.5244
## purposedebt_consolidation -0.0110 0.0295 -0.37 0.7090
## purposehome_improvement -0.0316 0.0147 -2.15 0.0316
## purposehouse -0.0001 0.0068 -0.01 0.9889
## purposemajor_purchase -0.0178 0.0098 -1.82 0.0695
## purposemedical -0.0246 0.0077 -3.20 0.0014
## purposemoving -0.0198 0.0067 -2.93 0.0033
## purposeother -0.0365 0.0135 -2.69 0.0071
## purposerenewable_energy -0.0006 0.0055 -0.11 0.9111
## purposesmall_business -0.0679 0.0084 -8.12 <0.0001
## purposevacation -0.0167 0.0067 -2.49 0.0129
## purposewedding 0.0199 0.0072 2.77 0.0055
## dti -0.2184 0.0064 -34.26 <0.0001
## delinq_2yrs -0.0920 0.0071 -12.93 <0.0001
## inq_last_6mths -0.0438 0.0066 -6.68 <0.0001
## pub_rec -0.0337 0.0148 -2.28 0.0226
## revol_bal -0.0166 0.0081 -2.06 0.0399
## revol_util 0.6408 0.0336 19.06 <0.0001
## total_acc 0.1322 0.0131 10.11 <0.0001
## initial_list_statusw -0.0022 0.0054 -0.41 0.6791
## collections_12_mths_ex_med -0.0110 0.0048 -2.29 0.0217
## acc_now_delinq -0.0035 0.0049 -0.72 0.4739
## tot_coll_amt 0.0030 0.0109 0.27 0.7854
## acc_open_past_24mths -0.1285 0.0077 -16.72 <0.0001
## avg_cur_bal 0.0353 0.0100 3.52 0.0004
## bc_open_to_buy 0.0351 0.0096 3.64 0.0003
## chargeoff_within_12_mths 0.0121 0.0056 2.19 0.0288
## delinq_amnt 0.0022 0.0058 0.37 0.7084
## mo_sin_old_il_acct -0.0186 0.0057 -3.25 0.0011
## mo_sin_old_rev_tl_op 0.0175 0.0061 2.84 0.0044
## mo_sin_rcnt_rev_tl_op -0.0161 0.0081 -1.97 0.0483
## mo_sin_rcnt_tl 0.0221 0.0079 2.81 0.0050
## mort_acc 0.0618 0.0075 8.26 <0.0001
## mths_since_recent_bc 0.0555 0.0070 7.89 <0.0001
## mths_since_recent_inq -0.0064 0.0067 -0.96 0.3394
## num_accts_ever_120_pd -0.0285 0.0067 -4.26 <0.0001
## num_actv_bc_tl -0.0526 0.0066 -8.03 <0.0001
## num_bc_tl 0.0279 0.0097 2.88 0.0040
## num_il_tl 0.0168 0.0099 1.70 0.0894
## num_tl_90g_dpd_24m 0.0175 0.0067 2.62 0.0088
## num_tl_op_past_12m -0.0088 0.0076 -1.16 0.2462
## pct_tl_nvr_dlq -0.0228 0.0070 -3.24 0.0012
## percent_bc_gt_75 -0.0766 0.0065 -11.73 <0.0001
## pub_rec_bankruptcies 0.0391 0.0118 3.32 0.0009
## tax_liens 0.0090 0.0097 0.92 0.3550
## total_il_high_credit_limit 0.0721 0.0098 7.36 <0.0001
## is_fl -0.0178 0.0052 -3.42 0.0006
## is_ny -0.0284 0.0053 -5.34 <0.0001
## is_il 0.0130 0.0054 2.39 0.0166
## is_tx 0.0375 0.0055 6.77 <0.0001
## is_ca 0.0402 0.0057 7.09 <0.0001
## is_ga 0.0181 0.0054 3.34 0.0008
We set our two-tailed p-value cutoff at 0.01, we discard features with p-value exceed this threshold.
tmp = as.data.frame(anova(model))
tmp$feature = rownames(tmp)
tmp = tmp %>% filter(P > 0.01) %>% select(feature,P)
tmp
## feature P
## 1 emp_length1 year 0.79109580
## 2 emp_length2 years 0.97486100
## 3 emp_length3 years 0.78255500
## 4 emp_length4 years 0.82290650
## 5 emp_length5 years 0.69167559
## 6 emp_length6 years 0.03454150
## 7 emp_length7 years 0.94254322
## 8 emp_length8 years 0.02648818
## 9 verification_statusVerified 0.28497122
## 10 purposecredit_card 0.52436187
## 11 purposedebt_consolidation 0.70902630
## 12 purposehome_improvement 0.03161482
## 13 purposehouse 0.98894832
## 14 purposemajor_purchase 0.06951881
## 15 purposerenewable_energy 0.91112091
## 16 purposevacation 0.01287300
## 17 pub_rec 0.02260737
## 18 revol_bal 0.03987652
## 19 initial_list_statusw 0.67910967
## 20 collections_12_mths_ex_med 0.02174068
## 21 acc_now_delinq 0.47394870
## 22 tot_coll_amt 0.78543120
## 23 chargeoff_within_12_mths 0.02876742
## 24 delinq_amnt 0.70839057
## 25 mo_sin_rcnt_rev_tl_op 0.04832643
## 26 mths_since_recent_inq 0.33939555
## 27 num_il_tl 0.08944578
## 28 num_tl_op_past_12m 0.24616264
## 29 tax_liens 0.35499506
## 30 is_il 0.01664836
data1 = (data1[,!(names(data1) %in% tmp$feature)])
rm(model,tmp)
Some feature names are invalid, replace invalid characters with “_"
colnames(data1) = str_replace_all(colnames(data1)," ","_")
colnames(data1) = str_replace_all(colnames(data1),"<","_")
colnames(data1) = str_replace_all(colnames(data1),"/","_")
We then create function to apply all of the above feature engineering:
#derived from str_c("'",paste(colnames(data1),collapse="','"),"'")
#kept_column = colnames(data1)
kept_column = c('term_60_months','sub_gradeA2','sub_gradeA3','sub_gradeA4','sub_gradeA5','sub_gradeB1','sub_gradeB2','sub_gradeB3','sub_gradeB4','sub_gradeB5','sub_gradeC1','sub_gradeC2','sub_gradeC3','sub_gradeC4','sub_gradeC5','sub_gradeD1','sub_gradeD2','sub_gradeD3','sub_gradeD4','sub_gradeD5','sub_gradeE1','sub_gradeE2','sub_gradeE3','sub_gradeE4','sub_gradeE5','sub_gradeF1','sub_gradeF2','sub_gradeF3','sub_gradeF4','sub_gradeF5','sub_gradeG1','sub_gradeG2','sub_gradeG3','sub_gradeG4','sub_gradeG5','emp_length__1_year','emp_length9_years','emp_lengthn_a','home_ownershipOWN','home_ownershipRENT','annual_inc','verification_statusSource_Verified','purposemedical','purposemoving','purposeother','purposesmall_business','purposewedding','dti','delinq_2yrs','inq_last_6mths','revol_util','total_acc','acc_open_past_24mths','avg_cur_bal','bc_open_to_buy','mo_sin_old_il_acct','mo_sin_old_rev_tl_op','mo_sin_rcnt_tl','mort_acc','mths_since_recent_bc','num_accts_ever_120_pd','num_actv_bc_tl','num_bc_tl','num_tl_90g_dpd_24m','pct_tl_nvr_dlq','percent_bc_gt_75','pub_rec_bankruptcies','total_il_high_credit_limit','is_fl','is_ny','is_tx','is_ca','is_ga','loan_status')
applyFeatureEngineering <- function(dt,use_kept_column = kept_column,use_median_impute_model=median_impute_model, use_dummy_model=dummy_model,use_trans_model=trans_model){
#consolidate loan status
dt$loan_status = ifelse(str_detect(dt$loan_status,"Paid"),dt$loan_status,"Default")
#parse int_rate
dt$int_rate = (as.numeric(gsub(pattern = "%",replacement = "",x = dt$int_rate)))
#impute median
dt = predict(median_impute_model,dt)
#parse revol_util
dt$revol_util = (as.numeric(gsub(pattern = "%",replacement = "",x = dt$int_rate)))
#binary variables for addr_state
dt$is_fl = ifelse(dt$addr_state=="FL",1,0)
dt$is_ny = ifelse(dt$addr_state=="NY",1,0)
dt$is_il = ifelse(dt$addr_state=="IL",1,0)
dt$is_tx = ifelse(dt$addr_state=="TX",1,0)
dt$is_ca = ifelse(dt$addr_state=="CA",1,0)
dt$is_ga = ifelse(dt$addr_state=="GA",1,0)
#transform transactions
dt$annual_inc = dt$annual_inc/dt$funded_amnt
dt$revol_bal = dt$revol_bal/dt$funded_amnt
dt$avg_cur_bal = dt$avg_cur_bal/dt$funded_amnt
dt$bc_open_to_buy = dt$bc_open_to_buy/dt$funded_amnt
dt$total_il_high_credit_limit = dt$total_il_high_credit_limit/dt$funded_amnt
#if purpose falling outside of recognized values
all_purpose = c('debt_consolidation','small_business','other','credit_card','major_purchase','moving','home_improvement','house','car','medical','renewable_energy','vacation','wedding')
dt$purpose = ifelse(dt$purpose %in% all_purpose,dt$purpose,"other")
#create dummy variables
loan_status = dt$loan_status
dt = as.data.frame(predict(use_dummy_model,dt))
dt$loan_status = loan_status
#center,scale data
dt = predict(use_trans_model, dt)
#remove all unused features
colnames(dt) = str_replace_all(colnames(dt)," ","_")
colnames(dt) = str_replace_all(colnames(dt),"<","_")
colnames(dt) = str_replace_all(colnames(dt),"/","_")
dt = dt[use_kept_column]
#set loan with status 'Fully Paid' as a positive sample
dt$loan_status = ifelse(dt$loan_status == "Fully Paid","Fully.Paid",dt$loan_status)
dt$loan_status = factor(dt$loan_status,levels = c("Default","Fully.Paid"))
return(dt)
}
We need to determine which performance metric we want to focus on. We load the 2015 test dataset and apply feature engineering procedure.
test_data = read.csv("loan_2015.extract.csv",stringsAsFactors=FALSE)
#later used for evaluate investment
test_data_grade = test_data$grade
test_data_funded_amnt = test_data$funded_amnt
test_data_total_pymnt = test_data$total_pymnt
test_data = applyFeatureEngineering(test_data)
The loan data typically have a higher proportion of good loans. We can achieve high accuracy just by labeling all loans as Fully Paid
100*nrow(test_data %>% filter(loan_status=="Fully.Paid"))/nrow(test_data)
## [1] 72.29722
For our test data, we gain 72.3% accuracy by just following the above strategy. Recall that we yet to include the outcome of Current
loans. In a real situation, the ratio of Fully Paid
loans is usually much higher so accuracy metric is not our main concern here. We instead focus on a trade-off in identifying a default loan as an expense of mislabelling some good loans. We will look at ROC curve and pay particular focus on AUC when we train our models.
Because there is a disproportion of the target variable, we downsampling the Fully Paid loans to be equal to Default loans. This method tends to work well and run faster than upsampling or cost-sensitive training.
Noted that at the end, we aim to stack the results of various learning models(Logistic Regression,SVM,RandomForest, and XGB). Since the downside of downsampling is that information of majority class is discarded, we will continue to make a new downsampling data when we feed it to each model along the way. We anticipate that better result can be obtained by stacking all 4 models since it gets more information from the majority class.
We tune Logistic Regression to our training dataset. We use Elastic Net regularization, which comprised of Ridge and Lasso regularization, with cross-validation to prevent overfitting. Our goal is maximizing AUC.
Due to limited computation resource, we run model tunning on small data and fixed lambda parameter. We use small fold: 3-fold cross validation. We then refit the best model with the whole data.
(Noted:we put the final tuning result here instead of running through the whole process. We disable the execution of tuning code although readers can enable it back by setting eval
= TRUE )
set.seed(100)
samp = downSample(data1[-getIndexsOfColumns(data1, c( "loan_status") )],data1$loan_status,yname="loan_status")
#choose small data for tuning
train_index = createDataPartition(samp$loan_status,p = 0.05,list=FALSE,times=1)
#run tuning in parallel using available computing cores(you may need to change this)
registerDoMC(cores = 4)
ctrl <- trainControl(method = "cv",
summaryFunction = twoClassSummary,
classProbs = TRUE,
number = 3
)
glmnGrid = expand.grid(.alpha = seq(0, 1, length = 10), .lambda = 0.01)
glmnTuned = train(samp[train_index,-getIndexsOfColumns(samp,"loan_status")],y = samp[train_index,"loan_status"],method = "glmnet",tuneGrid = glmnGrid,metric = "ROC",trControl = ctrl)
plot(glmnTuned)
glmnTuned
The best penalty parameter is alpha = 0.7777778(more weight on Ridge) with fixed shrinking lambda = 0.01. We use this parameter to retrain the whole sample.
model = glmnet(
x = as.matrix(samp[-getIndexsOfColumns(samp,"loan_status")]),
y=samp$loan_status,
alpha = 0.7777778,
lambda = 0.01,
family = "binomial",
standardize = FALSE)
The finalized Logistic Regression model is applied to the 2015 loan data. We look at ROC graph and AUC. We also set probability prediction cutoff at 50%(noted that the higher this value is, the more likely the loan is Fully Paid
) and collect some performance metrics for a later comparison.
#data frame for collect model's performance
table_perf = data.frame(model=character(0),
auc=numeric(0),
accuracy=numeric(0),
sensitivity=numeric(0),
specificity=numeric(0),
kappa=numeric(0),
stringsAsFactors = FALSE
)
predict_loan_status_logit = predict(model,newx = as.matrix(test_data[-getIndexsOfColumns(test_data,"loan_status")]),type="response")
rocCurve_logit = roc(response = test_data$loan_status,
predictor = predict_loan_status_logit)
## Warning in roc.default(response = test_data$loan_status, predictor
## = predict_loan_status_logit): Deprecated use a matrix as predictor.
## Unexpected results may be produced, please pass a numeric vector.
auc_curve = auc(rocCurve_logit)
plot(rocCurve_logit,legacy.axes = TRUE,print.auc = TRUE,col="red",main="ROC(Logistic Regression)")
##
## Call:
## roc.default(response = test_data$loan_status, predictor = predict_loan_status_logit)
##
## Data: predict_loan_status_logit in 18716 controls (test_data$loan_status Default) < 48844 cases (test_data$loan_status Fully.Paid).
## Area under the curve: 0.7042
#make a prediction on 50% cutoff
predict_loan_status_label = ifelse(predict_loan_status_logit<0.5,"Default","Fully.Paid")
c = confusionMatrix(predict_loan_status_label,test_data$loan_status,positive="Fully.Paid")
table_perf[1,] = c("logistic regression",
round(auc_curve,3),
as.numeric(round(c$overall["Accuracy"],3)),
as.numeric(round(c$byClass["Sensitivity"],3)),
as.numeric(round(c$byClass["Specificity"],3)),
as.numeric(round(c$overall["Kappa"],3))
)
rm(samp,train_index)
The model’s performance is as follow
tail(table_perf,1)
## model auc accuracy sensitivity specificity kappa
## 1 logistic regression 0.704 0.643 0.636 0.662 0.251
For SVM, we use Radial Basis as a kernel function. Due to limited computation reason, we use 5% of downsampling data for tunning parameter and 10% of downsampling data for training.
set.seed(200)
#down sampling again so than we get more info when stacking
samp = downSample(data1[-getIndexsOfColumns(data1, c( "loan_status") )],data1$loan_status,yname="loan_status")
#choose small data for tuning
train_index_tuning = createDataPartition(samp$loan_status,p = 0.05,list=FALSE,times=1)
#choose small data for re-train
train_index_training = createDataPartition(samp$loan_status,p = 0.1,list=FALSE,times=1)
svmGrid = expand.grid(
.sigma = as.numeric(sigest(loan_status ~.,data = samp[train_index_tuning,],scaled=FALSE)),
.C = c(0.1,1,10)
)
svmTuned = train(
samp[train_index_tuning,-getIndexsOfColumns(samp,"loan_status")],
y = samp[train_index_tuning,"loan_status"],
method = "svmRadial",
tuneGrid = svmGrid,
metric = "ROC",
trControl = ctrl,
preProcess = NULL,
scaled = FALSE,
fit = FALSE)
plot(svmTuned)
svmTuned
The best parameter for the model is sigma
= 0.003909534, and C
= 0.1. We use this values to fit the 10% of downsampling data and collect its performance based on test set.
svm_model = ksvm(loan_status ~ .,
data = samp[train_index_training,],
kernel = "rbfdot",
kpar = list(sigma=0.003909534),
C = 0.1,
prob.model = TRUE,
scaled = FALSE)
predict_loan_status_svm = predict(svm_model,test_data,type="probabilities")
predict_loan_status_svm = as.data.frame(predict_loan_status_svm)$Fully.Paid
rocCurve_svm = roc(response = test_data$loan_status,
predictor = predict_loan_status_svm)
auc_curve = auc(rocCurve_svm)
plot(rocCurve_svm,legacy.axes = TRUE,print.auc = TRUE,col="red",main="ROC(SVM)")
##
## Call:
## roc.default(response = test_data$loan_status, predictor = predict_loan_status_svm)
##
## Data: predict_loan_status_svm in 18716 controls (test_data$loan_status Default) < 48844 cases (test_data$loan_status Fully.Paid).
## Area under the curve: 0.6976
predict_loan_status_label = ifelse(predict_loan_status_svm<0.5,"Default","Fully.Paid")
c = confusionMatrix(predict_loan_status_label,test_data$loan_status,positive="Fully.Paid")
This is the summary of model’s performance
table_perf[2,] = c("SVM",
round(auc_curve,3),
as.numeric(round(c$overall["Accuracy"],3)),
as.numeric(round(c$byClass["Sensitivity"],3)),
as.numeric(round(c$byClass["Specificity"],3)),
as.numeric(round(c$overall["Kappa"],3))
)
tail(table_perf,1)
## model auc accuracy sensitivity specificity kappa
## 2 SVM 0.698 0.592 0.536 0.739 0.213
Now, we tune RandomForest model. Like SVM, we tune parameter based on 5% downsampling data.
set.seed(300)
#down sampling again so than we get more info when stacking
samp = downSample(data1[-getIndexsOfColumns(data1, c( "loan_status") )],data1$loan_status,yname="loan_status")
#choose small data for tuning
train_index_tuning = createDataPartition(samp$loan_status,p = 0.05,list=FALSE,times=1)
#choose small data for re-train
train_index_training = createDataPartition(samp$loan_status,p = 0.1,list=FALSE,times=1)
rfGrid = expand.grid(
.mtry = as.integer(seq(2,ncol(samp), (ncol(samp) - 2)/4))
)
rfTuned = train(
samp[train_index_tuning,-getIndexsOfColumns(samp,"loan_status")],
y = samp[train_index_tuning,"loan_status"],
method = "rf",
tuneGrid = rfGrid,
metric = "ROC",
trControl = ctrl,
preProcess = NULL,
ntree = 100
)
plot(rfTuned)
rfTuned
The best parameter is mtry
(number of predictors) = 2. Like SVM, we fit 10% of downsampling data with this value.
rf_model = randomForest(loan_status ~ . ,data = samp[train_index_training,],mtry = 2,ntree=400)
predict_loan_status_rf = predict(rf_model,test_data,"prob")
predict_loan_status_rf = as.data.frame(predict_loan_status_rf)$Fully.Paid
rocCurve_rf = roc(response = test_data$loan_status,
predictor = predict_loan_status_rf)
auc_curve = auc(rocCurve_rf)
plot(rocCurve_rf,legacy.axes = TRUE,print.auc = TRUE,col="red",main="ROC(RandomForest)")
##
## Call:
## roc.default(response = test_data$loan_status, predictor = predict_loan_status_rf)
##
## Data: predict_loan_status_rf in 18716 controls (test_data$loan_status Default) < 48844 cases (test_data$loan_status Fully.Paid).
## Area under the curve: 0.7046
predict_loan_status_label = ifelse(predict_loan_status_rf<0.5,"Default","Fully.Paid")
c = confusionMatrix(predict_loan_status_label,test_data$loan_status,positive="Fully.Paid")
table_perf[3,] = c("RandomForest",
round(auc_curve,3),
as.numeric(round(c$overall["Accuracy"],3)),
as.numeric(round(c$byClass["Sensitivity"],3)),
as.numeric(round(c$byClass["Specificity"],3)),
as.numeric(round(c$overall["Kappa"],3))
)
The model’s performance is as follow
tail(table_perf,1)
## model auc accuracy sensitivity specificity kappa
## 3 RandomForest 0.705 0.638 0.622 0.679 0.25
Extreme Gradient Boosting has a very efficient implementation. Unlike SVM and RandomForest, we can tune parameter using the whole downsampling set. We focus on varying Ridge & Lasso regularization and learning rate. We use 10% of data for validating tuning parameter.
set.seed(400)
#down sampling again so than we get more info when stacking
samp = downSample(data1[-getIndexsOfColumns(data1, c( "loan_status") )],data1$loan_status,yname="loan_status")
#choose small data for validating
train_index_tuning= createDataPartition(samp$loan_status,p = 0.1,list=FALSE,times=1)
etas = c(0.1,0.3)
alphas = c(0,0.5,1)
lambdas = c(0,0.5,1)
test_watchlist = list(
test = xgb.DMatrix(
data = as.matrix(samp[train_index_tuning,][getNumericColumns(samp)]),
label = as.numeric(samp[train_index_tuning,"loan_status"])-1
)
)
gbm_perf = data.frame(eta=numeric(0),alpha=numeric(0),lambda=numeric(0),auc=numeric(0))
for(eta in etas){
for(alpha in alphas){
for(lambda in lambdas){
model = xgb.train(
data= xgb.DMatrix(
data = as.matrix(samp[-train_index_tuning,][getNumericColumns(samp)]),
label = as.numeric(samp[-train_index_tuning,"loan_status"])-1
),
objective = "binary:logistic",
nrounds = 350,
watchlist = test_watchlist,
eval_metric = "auc",
early.stop.round = 10,
alpha = alpha,
lambda = lambda,
eta = eta)
gbm_perf[nrow(gbm_perf)+1,] = c(eta,alpha,lambda,model$bestScore)
}
}
}
gbm_perf %>% arrange(desc(auc))
The best tuning parameter is eta
= 0.1, alpha
= 0.5, and lambda
= 1.0. We retrain it again here in case readers didn’t run the tuning code. We collect its performance.
set.seed(400)
test_watchlist = list(
test = xgb.DMatrix(
data = as.matrix(samp[train_index_tuning,][getNumericColumns(samp)]),
label = as.numeric(samp[train_index_tuning,"loan_status"])-1
)
)
xgb_model = xgb.train(
data= xgb.DMatrix(
data = as.matrix(samp[-train_index_tuning,][getNumericColumns(samp)]),
label = as.numeric(samp[-train_index_tuning,"loan_status"])-1
),
objective = "binary:logistic",
nrounds = 350,
watchlist = test_watchlist,
eval_metric = "auc",
early.stop.round = 10,
alpha = 0.5,
lambda = 1.0,
eta = 0.1)
## [0] test-auc:0.692064
## [1] test-auc:0.693999
## [2] test-auc:0.697318
## [3] test-auc:0.699014
## [4] test-auc:0.699192
## [5] test-auc:0.700705
## [6] test-auc:0.701450
## [7] test-auc:0.702239
## [8] test-auc:0.702641
## [9] test-auc:0.703252
## [10] test-auc:0.703859
## [11] test-auc:0.704341
## [12] test-auc:0.704865
## [13] test-auc:0.705212
## [14] test-auc:0.705281
## [15] test-auc:0.705811
## [16] test-auc:0.706393
## [17] test-auc:0.706762
## [18] test-auc:0.707394
## [19] test-auc:0.707741
## [20] test-auc:0.708143
## [21] test-auc:0.708453
## [22] test-auc:0.708823
## [23] test-auc:0.709150
## [24] test-auc:0.709570
## [25] test-auc:0.709920
## [26] test-auc:0.710222
## [27] test-auc:0.710539
## [28] test-auc:0.710843
## [29] test-auc:0.711221
## [30] test-auc:0.711444
## [31] test-auc:0.711497
## [32] test-auc:0.711852
## [33] test-auc:0.712029
## [34] test-auc:0.712396
## [35] test-auc:0.712686
## [36] test-auc:0.712853
## [37] test-auc:0.712997
## [38] test-auc:0.713061
## [39] test-auc:0.713231
## [40] test-auc:0.713370
## [41] test-auc:0.713491
## [42] test-auc:0.713599
## [43] test-auc:0.713893
## [44] test-auc:0.713867
## [45] test-auc:0.714141
## [46] test-auc:0.714303
## [47] test-auc:0.714376
## [48] test-auc:0.714483
## [49] test-auc:0.714808
## [50] test-auc:0.715020
## [51] test-auc:0.715061
## [52] test-auc:0.715091
## [53] test-auc:0.715098
## [54] test-auc:0.715198
## [55] test-auc:0.715272
## [56] test-auc:0.715473
## [57] test-auc:0.715698
## [58] test-auc:0.715782
## [59] test-auc:0.716132
## [60] test-auc:0.716180
## [61] test-auc:0.716102
## [62] test-auc:0.716098
## [63] test-auc:0.716191
## [64] test-auc:0.716454
## [65] test-auc:0.716429
## [66] test-auc:0.716613
## [67] test-auc:0.716584
## [68] test-auc:0.716636
## [69] test-auc:0.716661
## [70] test-auc:0.716861
## [71] test-auc:0.716958
## [72] test-auc:0.717042
## [73] test-auc:0.717406
## [74] test-auc:0.717539
## [75] test-auc:0.717514
## [76] test-auc:0.717457
## [77] test-auc:0.717645
## [78] test-auc:0.717704
## [79] test-auc:0.717659
## [80] test-auc:0.717606
## [81] test-auc:0.717711
## [82] test-auc:0.717821
## [83] test-auc:0.717667
## [84] test-auc:0.717834
## [85] test-auc:0.717749
## [86] test-auc:0.717893
## [87] test-auc:0.717920
## [88] test-auc:0.717964
## [89] test-auc:0.718027
## [90] test-auc:0.718159
## [91] test-auc:0.718124
## [92] test-auc:0.718170
## [93] test-auc:0.718174
## [94] test-auc:0.718334
## [95] test-auc:0.718440
## [96] test-auc:0.718514
## [97] test-auc:0.718610
## [98] test-auc:0.718661
## [99] test-auc:0.718525
## [100] test-auc:0.718644
## [101] test-auc:0.718732
## [102] test-auc:0.718867
## [103] test-auc:0.718806
## [104] test-auc:0.718848
## [105] test-auc:0.718819
## [106] test-auc:0.718930
## [107] test-auc:0.718979
## [108] test-auc:0.719016
## [109] test-auc:0.719053
## [110] test-auc:0.719173
## [111] test-auc:0.719311
## [112] test-auc:0.719291
## [113] test-auc:0.719311
## [114] test-auc:0.719341
## [115] test-auc:0.719551
## [116] test-auc:0.719563
## [117] test-auc:0.719583
## [118] test-auc:0.719598
## [119] test-auc:0.719521
## [120] test-auc:0.719502
## [121] test-auc:0.719677
## [122] test-auc:0.719646
## [123] test-auc:0.719615
## [124] test-auc:0.719596
## [125] test-auc:0.719609
## [126] test-auc:0.719571
## [127] test-auc:0.719693
## [128] test-auc:0.719664
## [129] test-auc:0.719737
## [130] test-auc:0.719680
## [131] test-auc:0.719655
## [132] test-auc:0.719715
## [133] test-auc:0.719762
## [134] test-auc:0.719814
## [135] test-auc:0.719833
## [136] test-auc:0.719678
## [137] test-auc:0.719681
## [138] test-auc:0.719712
## [139] test-auc:0.719671
## [140] test-auc:0.719693
## [141] test-auc:0.719686
## [142] test-auc:0.719705
## [143] test-auc:0.719756
## [144] test-auc:0.719792
## [145] test-auc:0.719727
## Stopping. Best iteration: 136
predict_loan_status_xgb = predict(xgb_model,as.matrix(test_data[getNumericColumns(test_data)]))
rocCurve_xgb = roc(response = test_data$loan_status,
predictor = predict_loan_status_xgb)
auc_curve = auc(rocCurve_xgb)
plot(rocCurve_xgb,legacy.axes = TRUE,print.auc = TRUE,col="red",main="ROC(XGB)")
##
## Call:
## roc.default(response = test_data$loan_status, predictor = predict_loan_status_xgb)
##
## Data: predict_loan_status_xgb in 18716 controls (test_data$loan_status Default) < 48844 cases (test_data$loan_status Fully.Paid).
## Area under the curve: 0.7128
predict_loan_status_label = ifelse(predict_loan_status_xgb<0.5,"Default","Fully.Paid")
c = confusionMatrix(predict_loan_status_label,test_data$loan_status,positive="Fully.Paid")
table_perf[4,] = c("XGB",
round(auc_curve,3),
as.numeric(round(c$overall["Accuracy"],3)),
as.numeric(round(c$byClass["Sensitivity"],3)),
as.numeric(round(c$byClass["Specificity"],3)),
as.numeric(round(c$overall["Kappa"],3))
)
The model’s performance is as follow
tail(table_perf,1)
## model auc accuracy sensitivity specificity kappa
## 4 XGB 0.713 0.607 0.549 0.759 0.239
Our final model is to combine the result of previous machine learning models and provide a single prediction by averaging probabilities from all previous models.
predict_loan_status_ensemble = predict_loan_status_logit +
predict_loan_status_svm +
predict_loan_status_rf +
predict_loan_status_xgb
predict_loan_status_ensemble = predict_loan_status_ensemble / 4
rocCurve_ensemble = roc(response = test_data$loan_status,
predictor = predict_loan_status_ensemble)
## Warning in roc.default(response = test_data$loan_status, predictor =
## predict_loan_status_ensemble): Deprecated use a matrix as predictor.
## Unexpected results may be produced, please pass a numeric vector.
auc_curve = auc(rocCurve_ensemble)
plot(rocCurve_ensemble,legacy.axes = TRUE,print.auc = TRUE,col="red",main="ROC(Ensemble Avg.)")
##
## Call:
## roc.default(response = test_data$loan_status, predictor = predict_loan_status_ensemble)
##
## Data: predict_loan_status_ensemble in 18716 controls (test_data$loan_status Default) < 48844 cases (test_data$loan_status Fully.Paid).
## Area under the curve: 0.7113
predict_loan_status_label = ifelse(predict_loan_status_ensemble<0.5,"Default","Fully.Paid")
c = confusionMatrix(predict_loan_status_label,test_data$loan_status,positive="Fully.Paid")
table_perf[5,] = c("Ensemble",
round(auc_curve,3),
as.numeric(round(c$overall["Accuracy"],3)),
as.numeric(round(c$byClass["Sensitivity"],3)),
as.numeric(round(c$byClass["Specificity"],3)),
as.numeric(round(c$overall["Kappa"],3))
)
We get the following performance
tail(table_perf,1)
## model auc accuracy sensitivity specificity kappa
## 5 Ensemble 0.711 0.623 0.586 0.72 0.246
AUC for each model and their performance when we set probability cutoff at 50% is summarised below
table_perf
## model auc accuracy sensitivity specificity kappa
## 1 logistic regression 0.704 0.643 0.636 0.662 0.251
## 2 SVM 0.698 0.592 0.536 0.739 0.213
## 3 RandomForest 0.705 0.638 0.622 0.679 0.25
## 4 XGB 0.713 0.607 0.549 0.759 0.239
## 5 Ensemble 0.711 0.623 0.586 0.72 0.246
plot(rocCurve_logit,legacy.axes = TRUE,col="red",main="ROC compare")
##
## Call:
## roc.default(response = test_data$loan_status, predictor = predict_loan_status_logit)
##
## Data: predict_loan_status_logit in 18716 controls (test_data$loan_status Default) < 48844 cases (test_data$loan_status Fully.Paid).
## Area under the curve: 0.7042
plot(rocCurve_svm,legacy.axes = TRUE,col="blue",add=TRUE)
##
## Call:
## roc.default(response = test_data$loan_status, predictor = predict_loan_status_svm)
##
## Data: predict_loan_status_svm in 18716 controls (test_data$loan_status Default) < 48844 cases (test_data$loan_status Fully.Paid).
## Area under the curve: 0.6976
plot(rocCurve_rf,legacy.axes = TRUE,col="green",add=TRUE)
##
## Call:
## roc.default(response = test_data$loan_status, predictor = predict_loan_status_rf)
##
## Data: predict_loan_status_rf in 18716 controls (test_data$loan_status Default) < 48844 cases (test_data$loan_status Fully.Paid).
## Area under the curve: 0.7046
plot(rocCurve_xgb,legacy.axes = TRUE,col="orange",add=TRUE)
##
## Call:
## roc.default(response = test_data$loan_status, predictor = predict_loan_status_xgb)
##
## Data: predict_loan_status_xgb in 18716 controls (test_data$loan_status Default) < 48844 cases (test_data$loan_status Fully.Paid).
## Area under the curve: 0.7128
plot(rocCurve_ensemble,legacy.axes = TRUE,col="black",add=TRUE)
##
## Call:
## roc.default(response = test_data$loan_status, predictor = predict_loan_status_ensemble)
##
## Data: predict_loan_status_ensemble in 18716 controls (test_data$loan_status Default) < 48844 cases (test_data$loan_status Fully.Paid).
## Area under the curve: 0.7113
legend("bottomright",legend=c("logit","svm","rf","xbg","ensemble"),fill=c("red","blue","green","orange","black"))
Kappa statistics from all models exceed 20% by just small amount, which indicated that they perform moderately better than chance. XGB takes advantage of receiving all downsampling data and provides highest AUC. Comparing performance across models may not be valid, though, because we use different downsampling data for each model. Ensemble model doesn’t improve AUC as we expected.
We are surprised to find that Logistic regression does provide a very competitive performance. At 50% cutoff, it yields reasonable compromise between the percentage of correctly identified good loans(Sensitivity) and bad loans(Specificity) while not sacrificing Accuracy too much(recall that the naive strategy yields 72.3% accuracy). SVM with RBF kernel has lowest AUC. We can train it with only some portion of data as time complexity of the model rapidly jump up. RandomForest yields a comparable result to Logistic Regression. XGB sacrifices Sensitivity rate for Specificity(ability to recall bad loans). It may be suitable if we really want to avoid default loans. Ensemble model does tune up XGB a little bit. Given the simplicity of Logistic Regression model, and ROC graphs are, overall, not significantly difference, we recommend it as a model of choice for predicting LendingClub dataset.
We can calculate investment performance assumed that we follow each model predictions based on 50% cutoff:
#assign the features back for evaluation
test_data$grade = test_data_grade
test_data$funded_amnt = test_data_funded_amnt
test_data$total_pymnt = test_data_total_pymnt
#logistic regression
test_data$prediction = ifelse(predict_loan_status_logit<0.5,"Default","Fully Paid")
logit_table = createPerformanceTable(test_data)
#SVM
test_data$prediction = ifelse(predict_loan_status_svm<0.5,"Default","Fully Paid")
svm_table = createPerformanceTable(test_data)
#rf
test_data$prediction = ifelse(predict_loan_status_rf<0.5,"Default","Fully Paid")
rf_table = createPerformanceTable(test_data)
#XGB
test_data$prediction = ifelse(predict_loan_status_xgb<0.5,"Default","Fully Paid")
xgb_table = createPerformanceTable(test_data)
#ensemble
test_data$prediction = ifelse(predict_loan_status_ensemble<0.5,"Default","Fully Paid")
ensemble_table = createPerformanceTable(test_data)
For Logistic Regression:
logit_table
## grade roi percent_pick
## 1 A -0.02868136 99.9407513
## 2 B -0.07712652 94.1537468
## 3 C -0.11730945 60.4015616
## 4 D -0.14398647 18.4491764
## 5 E -0.12421653 2.1814007
## 6 F -0.16091886 0.1059322
## 7 G 0.00000000 0.0000000
## 8 ALL -0.08203923 55.3463588
For SVM:
svm_table
## grade roi percent_pick
## 1 A -0.02842189 99.43121223
## 2 B -0.08390121 99.15374677
## 3 C -0.11470667 36.73883283
## 4 D 0.01494777 0.77139413
## 5 E 0.09743927 0.05102692
## 6 F 0.00000000 0.00000000
## 7 G 0.00000000 0.00000000
## 8 ALL -0.07477126 46.01243339
For RandomForest:
rf_table
## grade roi percent_pick
## 1 A -0.02860544 100.0000000
## 2 B -0.08057145 97.6227390
## 3 C -0.11395171 59.3570958
## 4 D -0.09056090 8.0112495
## 5 E -0.13019485 1.6456181
## 6 F -0.28616990 0.6002825
## 7 G 0.06104822 0.3740648
## 8 ALL -0.07832291 53.8839550
For XGB:
xgb_table
## grade roi percent_pick
## 1 A -0.02755103 99.0046214
## 2 B -0.06683685 84.5994832
## 3 C -0.08522953 38.6807281
## 4 D -0.10742489 14.2868622
## 5 E -0.11417002 5.2940426
## 6 F -0.10204509 1.5889831
## 7 G -0.09492406 0.6234414
## 8 ALL -0.06275065 46.3632327
For Ensemble model:
ensemble_table
## grade roi percent_pick
## 1 A -0.02857455 99.97630051
## 2 B -0.07905501 96.93798450
## 3 C -0.09891564 47.92881407
## 4 D -0.07386266 7.11128967
## 5 E -0.06045371 0.98226815
## 6 F -0.42870486 0.07062147
## 7 G 0.00000000 0.00000000
## 8 ALL -0.07069706 50.11841326
Our baseline strategy:
baseline_table
## grade roi percent_pick
## 1 A -0.02860544 100
## 2 B -0.08447862 100
## 3 C -0.16181754 100
## 4 D -0.23565662 100
## 5 E -0.28144831 100
## 6 F -0.34315720 100
## 7 G -0.35786380 100
## 8 ALL -0.17560423 100
We can visualize returns for each model:
logit_table$model_name = "logit"
svm_table$model_name = "svm"
rf_table$model_name = "rf"
xgb_table$model_name = "xgb"
ensemble_table$model_name = "ensemble"
baseline_table$model_name = "baseline"
full_table = rbind(logit_table,svm_table,rf_table,xgb_table,ensemble_table,baseline_table)
ggplot(full_table,aes(x=grade,y=roi*100,group=model_name)) +
geom_line(aes(colour = model_name)) +
geom_point() +
xlab("Grade") +
ylab("ROI(%)") +
labs(title = "Return on identified 2015 loans")
and percentage of loans for each grade they picked:
ggplot(full_table,aes(x=grade,y=percent_pick,group=model_name)) +
geom_line(aes(colour = model_name)) +
geom_point() +
xlab("Grade") +
ylab("Percent Picked(%)") +
labs(title = "Loan picked")
All 5 prediction models beat the benchmark strategy in nearly very aspects. Though they did so for low-grade loans by selecting virtually none. Percentage of loan picked by SVM taper off very fast for D, E, F, G grades. It has high Specificity(ability to recall default) so it might be best to use against riskier loans. Many models do shine on predicting middle rate- B, C, D grade - loans, as their ROI is significantly higher than the baseline’s, and still have sizable investment.
Determining the loan outcome, like many financial predictions, is clearly not an easy task. Our models barely go above 20% Kappa Statistics which indicates that chance still played a large role here. They also penalized low-grade loans so much that they rather not select anythings, and thus prevent us from reaping benefits of higher interest rate. Our finding nevertheless shows a promising venue in loan prediction, as higher-than-average ROI can still be gained on certain loan quality. A further analysis may incorporate some discarded features, such as emp_title
, or external information, like relevant economic status, into the model.