With credit card balances reaching an average of $16,000 per household and student loans becoming the largest source of household debt, it’s clear that Americans are financing many aspects of their lives.
For credit card lenders, it is risky to issue credit cards to unqualified applicants. Since clients who overused credit card for consumption, irrespective of their payment ability, may accumulate heavy credit debts, which may lead to debt crisis for credit card lenders. Thus, in order to mitigate loses from risky clients, the desire to build models that identify high-risk clients has never been stronger.
The Default of Credit Card Clients Dataset: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients
This dataset has 30,000 observations and 24 variables, which contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005. The true label is ‘default_next_month’, which is a binary variable indicating whether the client will default in the following month.
Identify credit card clients likely to default on their next payment by classifying them into ‘defaulters-1’ or ‘non-defaulters-0’. [positive=1]
# Load required libraries
library(tidyverse)
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
## ── Attaching packages ──────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.1 ✔ dplyr 0.8.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(caret) #regression
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(ggvis)
##
## Attaching package: 'ggvis'
## The following object is masked from 'package:ggplot2':
##
## resolution
library(MLmetrics)
##
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
## The following object is masked from 'package:base':
##
## Recall
library(precrec)
library(rpart) #decision tree
library(rpart.plot)
library(scales)
##
## Attaching package: 'scales'
## The following objects are masked from 'package:ggvis':
##
## fullseq, zero_range
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
set.seed(1234)
#import data
data = read_tsv('credit_card_default.tsv')
## Parsed with column specification:
## cols(
## .default = col_double()
## )
## See spec(...) for full column specifications.
creditcard = data
cat('The dimention is: ', dim(creditcard))
## The dimention is: 30000 24
cat('Is there any NA data: ', any(is.na(creditcard)))
## Is there any NA data: FALSE
str(creditcard)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 30000 obs. of 24 variables:
## $ default_next_month: num 1 1 0 0 0 0 0 0 0 0 ...
## $ limit_bal : num 20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
## $ sex : num 2 2 2 2 1 1 1 2 2 1 ...
## $ education : num 2 2 2 2 2 1 1 2 3 3 ...
## $ marriage : num 1 2 2 1 1 2 2 2 1 2 ...
## $ age : num 24 26 34 37 57 37 29 23 28 35 ...
## $ pay_sept : num 2 -1 0 0 -1 0 0 0 0 -2 ...
## $ pay_aug : num 2 2 0 0 0 0 0 -1 0 -2 ...
## $ pay_july : num -1 0 0 0 -1 0 0 -1 2 -2 ...
## $ pay_june : num -1 0 0 0 0 0 0 0 0 -2 ...
## $ pay_may : num -2 0 0 0 0 0 0 0 0 -1 ...
## $ pay_april : num -2 2 0 0 0 0 0 -1 0 -1 ...
## $ bill_sept : num 3913 2682 29239 46990 8617 ...
## $ bill_aug : num 3102 1725 14027 48233 5670 ...
## $ bill_july : num 689 2682 13559 49291 35835 ...
## $ bill_june : num 0 3272 14331 28314 20940 ...
## $ bill_may : num 0 3455 14948 28959 19146 ...
## $ bill_april : num 0 3261 15549 29547 19131 ...
## $ pay_amt_sept : num 0 0 1518 2000 2000 ...
## $ pay_amt_aug : num 689 1000 1500 2019 36681 ...
## $ pay_amt_july : num 0 1000 1000 1200 10000 657 38000 0 432 0 ...
## $ pay_amt_june : num 0 1000 1000 1100 9000 ...
## $ pay_amt_may : num 0 0 1000 1069 689 ...
## $ pay_amt_april : num 0 2000 5000 1000 679 ...
## - attr(*, "spec")=
## .. cols(
## .. default_next_month = col_double(),
## .. limit_bal = col_double(),
## .. sex = col_double(),
## .. education = col_double(),
## .. marriage = col_double(),
## .. age = col_double(),
## .. pay_sept = col_double(),
## .. pay_aug = col_double(),
## .. pay_july = col_double(),
## .. pay_june = col_double(),
## .. pay_may = col_double(),
## .. pay_april = col_double(),
## .. bill_sept = col_double(),
## .. bill_aug = col_double(),
## .. bill_july = col_double(),
## .. bill_june = col_double(),
## .. bill_may = col_double(),
## .. bill_april = col_double(),
## .. pay_amt_sept = col_double(),
## .. pay_amt_aug = col_double(),
## .. pay_amt_july = col_double(),
## .. pay_amt_june = col_double(),
## .. pay_amt_may = col_double(),
## .. pay_amt_april = col_double()
## .. )
#head(creditcard)
#View(creditcard)
#Check the length and see which varibles can be converted to factors
apply(creditcard,2, function(x) length(unique(x)))
## default_next_month limit_bal sex
## 2 81 2
## education marriage age
## 4 4 56
## pay_sept pay_aug pay_july
## 11 11 11
## pay_june pay_may pay_april
## 11 10 10
## bill_sept bill_aug bill_july
## 22723 22346 22026
## bill_june bill_may bill_april
## 21548 21010 20604
## pay_amt_sept pay_amt_aug pay_amt_july
## 7943 7899 7518
## pay_amt_june pay_amt_may pay_amt_april
## 6937 6897 6939
Questions:
How does the probability of default payment vary by categories of different demographic variables?
For the last six months, does the previous payment behavior affect next payment?(should)
Which variables are the strongest predictors of default payment?
#convert variables into factors for analysis
factor_vars <- c('sex','education','marriage','default_next_month')
creditcard_f = creditcard
creditcard_f[factor_vars] <- lapply(creditcard_f[factor_vars], function(x) as.factor(x))
#demographic features
ggplot(creditcard_f,aes(x=sex,fill=default_next_month))+
geom_bar(position = 'fill')+
ggtitle("Sex v/s Default Rate")+
xlab("Sex(1:male 2:female)") +
ylab("Default Rate") +
scale_y_continuous(labels =percent_format())
ggplot(creditcard_f,aes(x=education,fill=default_next_month))+
geom_bar(position = 'fill')+
ggtitle("Education v/s Default Rate")+
xlab("Education(1:Gra 2:Uni 3:High 4:Other)") +
ylab("Default Rate")
ggplot(creditcard_f,aes(x=marriage,fill=default_next_month))+
geom_bar(position = 'fill')+
ggtitle("Marriage v/s Default Rate")+
xlab("Marriage(0:Other 1:Mar 2:Sin 3:Dev)") +
ylab("Default Rate")
ggplot(creditcard_f,aes(y=age,x=default_next_month))+
geom_boxplot( )+
ggtitle("Age v/s Default")
From above plots, I don’t see huge differences of default rate b/w different values of each categorical variable. The range of age in default has large overlap. Hence, it looks like the probability does not vary too much by categories of these demographic variables.
ggplot(creditcard_f,aes(y=limit_bal,x=default_next_month))+
geom_boxplot( )+
ggtitle("Limit balance v/s Default")
ggplot(creditcard_f,aes(y=bill_sept,x=default_next_month))+
geom_boxplot( )+
ggtitle("Bill in sept v/s Default")
ggplot(creditcard_f,aes(y=pay_amt_sept,x=default_next_month))+
geom_boxplot( )+
ggtitle("Payment amount in sept v/s Default")
ggplot(creditcard_f,aes(x=as.factor(pay_sept),fill=default_next_month))+
geom_bar(position = 'fill')+
ggtitle("Payment status in sept v/s Default rate")+
xlab("Payment status") +
ylab("Default Rate")
From above plots, the rate of default next month goes higher with the length of payment delay goes longer. No doubt customers who do not repay on time are more likely to default.
Hypotheses:
Client whose children or him/herself going to college has higher possibility of default, since student loans are becoming the largest part of household debts. [I will divide the data to two age groups: b/w 26~46 - group of more financial pressure; others - group of less financial pressure]
Client who has higher frequency of delays are more likely to default. [is_ever_delay, fre_delay]
Client who’s bill_sept is lower than 0 is less likely to default. [bill_sept_0]
Based on previous hypotheses, I create some new features.
#age group
creditcard =creditcard%>%
mutate(age_group = case_when(age<=25~1,
age<=45~0,
age>45~1))
#frequency of delays
creditcard = creditcard%>%
mutate(fre_delay = if_else(pay_april>0,1,0)+
if_else(pay_may>0,1,0)+
if_else(pay_june>0,1,0)+
if_else(pay_july>0,1,0)+
if_else(pay_aug>0,1,0)+
if_else(pay_sept>0,1,0))
#is_ever_delay
creditcard =creditcard%>%
mutate(is_ever_delay = if_else(fre_delay>0,1,0) )
#bill_sept_0
creditcard =creditcard%>%
mutate(bill_sept_0 = if_else(bill_sept<=0,0,1))
Visualizaiton for new features.
#convert new variables into factors for analysis
new_vars <- c('age_group','fre_delay','is_ever_delay','bill_sept_0')
creditcard_f =cbind(creditcard_f,creditcard[new_vars])
creditcard_f[new_vars] <- lapply(creditcard_f[new_vars], function(x) as.factor(x))
ggplot(creditcard_f,aes(x=age_group,fill=default_next_month))+
geom_bar(position = 'fill' )+
xlab("Age group") +
ylab("Default Rate")
ggplot(creditcard_f,aes(x=bill_sept_0,fill=default_next_month))+
geom_bar(position = 'fill' )+
xlab("Bill in sept") +
ylab("Default Rate")
ggplot(creditcard_f,aes(x=is_ever_delay,fill=default_next_month))+
geom_bar(position = 'fill' )+
xlab("Is ever delay") +
ylab("Default Rate")
ggplot(creditcard_f,aes(x=fre_delay,fill=default_next_month))+
geom_bar(position = 'fill' )+
xlab("Frequency of delay") +
ylab("Default Rate")
It is obvious that client who has higher frequency of delays are more likely to default.
While the difference between age group is small, and the amount of bill in sept doesn’t affect the default rate. So, I might have to discard ‘age_group’ and ‘bill_sept_0’, keep ‘fre_delay’ and ‘is_ever_delay’
Now, let’s calculate the correlation with default_next_month for each variable before and after adding new variables.
#cor with default_next_month sort by abs
corr1 = cor(data)[,1]%>%abs()%>%sort(decreasing = TRUE)
corr2 = cor(creditcard)[,1]%>%abs()%>%sort(decreasing = TRUE)
#print the top 10 variables which have the highest cor value with default next month
#initial variables
print(corr1[2:11])
## pay_sept pay_aug pay_july pay_june pay_may
## 0.32479373 0.26355120 0.23525251 0.21661364 0.20414891
## pay_april limit_bal pay_amt_sept pay_amt_aug pay_amt_june
## 0.18686636 0.15351988 0.07292949 0.05857871 0.05682740
#after adding new variables
print(corr2[2:11])
## fre_delay is_ever_delay pay_sept pay_aug pay_july
## 0.39839436 0.35285790 0.32479373 0.26355120 0.23525251
## pay_june pay_may pay_april limit_bal pay_amt_sept
## 0.21661364 0.20414891 0.18686636 0.15351988 0.07292949
The rank of variables has changed.Let’s select these top 10 variables as final input features.
vars_name = names(corr2[2:11])
creditcard = creditcard %>% select(default_next_month,!!vars_name)
n= nrow(creditcard)
trainIndex = sample(1:n, size = round(0.8*n), replace=FALSE)
train = creditcard[trainIndex ,]
test = creditcard[-trainIndex ,]
Define functions
#prediction performance metrics
predict_perform = function(data,pD){
cat("Accuracy is ",Accuracy(y_pred = pD, y_true = data$default_next_month),'\n')
cat("Recall is ", Recall(y_pred = pD, y_true = data$default_next_month,positive = '1'),'\n' )
cat("Precision is",Precision(y_pred = pD, y_true = data$default_next_month,positive = '1'),'\n')
}
#prediction performance plots
predict_perform_p = function(data,pD){
precrec_obj <- evalmod(labels = data$default_next_month, scores = as.numeric(pD))
autoplot(precrec_obj)
}
#plot variable importance
var_imp =function(model){
v=varImp(model)
print(v)
v$var = rownames(v)
v %>% ggvis(~var, ~Overall) %>% layer_bars()
}
#--------------Logistic Regression: a full model
lrModel0 = glm(default_next_month ~., data = train, family = binomial)
print(lrModel0)
##
## Call: glm(formula = default_next_month ~ ., family = binomial, data = train)
##
## Coefficients:
## (Intercept) fre_delay is_ever_delay pay_sept pay_aug
## -1.712e+00 3.884e-01 4.987e-01 2.379e-01 -2.512e-02
## pay_july pay_june pay_may pay_april limit_bal
## -4.624e-02 -6.136e-02 -2.362e-02 -5.534e-02 -1.395e-06
## pay_amt_sept
## -9.680e-06
##
## Degrees of Freedom: 23999 Total (i.e. Null); 23989 Residual
## Null Deviance: 25430
## Residual Deviance: 21530 AIC: 21560
predictDefault0 = ifelse(predict(lrModel0, newdata = train,
type = "response")>=0.5,1,0)
#view the variable importance
var_imp(lrModel0)
## Overall
## fre_delay 18.9313557
## is_ever_delay 8.9451973
## pay_sept 11.0890461
## pay_aug 1.1576487
## pay_july 1.9392888
## pay_june 2.3429516
## pay_may 0.8331226
## pay_april 2.3967230
## limit_bal 8.8306533
## pay_amt_sept 4.5119211
#--------------Logistic Regression: a simple model
lrModel1 = glm(default_next_month ~ pay_sept, data = train, family = binomial)
predictDefault1 = ifelse(predict(lrModel1, newdata = train,
type = "response")>=0.5,1,0)
#-------------Logistic Regression: a muti-variable model based on varImp
lrModel2 = glm(default_next_month ~ pay_sept+fre_delay, data = train, family = binomial)
predictDefault2 = ifelse(predict(lrModel2, newdata = train,
type = "response")>=0.5,1,0)
#-------------compare log regression models when threshold is 0.5
predict_perform(train,predictDefault0)
## Accuracy is 0.80875
## Recall is 0.2934783
## Precision is 0.6563286
predict_perform(train,predictDefault1)
## Accuracy is 0.819125
## Recall is 0.3324588
## Precision is 0.6948688
predict_perform(train,predictDefault2)
## Accuracy is 0.810125
## Recall is 0.2872939
## Precision is 0.6703105
It is surprised that the best regression model only use pay_sept.
#compare prediction performance with different threshold use lrModel1
predictDefault1_1 = ifelse(predict(lrModel1, newdata = train,
type = "response")>=0.3,1,0)
predictDefault1_2 = ifelse(predict(lrModel1, newdata = train,
type = "response")>=0.5,1,0)
predictDefault1_3 = ifelse(predict(lrModel1, newdata = train,
type = "response")>=0.6,1,0)
predict_perform(train,predictDefault1_1)
## Accuracy is 0.7794167
## Recall is 0.518928
## Precision is 0.503821
predict_perform(train,predictDefault1_2)
## Accuracy is 0.819125
## Recall is 0.3324588
## Precision is 0.6948688
predict_perform(train,predictDefault1_3)
## Accuracy is 0.7842083
## Recall is 0.04966267
## Precision is 0.7104558
predict_perform_p(train,predictDefault1_1)
predict_perform_p(train,predictDefault1_2)
predict_perform_p(train,predictDefault1_3)
As I adjust the threshold from 0.3 to 0.7, the recall goes higher while precision goes lower. The total accuracy goes up then goes down. I might choose [lrModel1 0.3 threshold] with highest recall.
#-------------- Decision Tree: a full model
dtModel0 = rpart(as.factor(default_next_month)~., data = train, method = 'class')
#dtModel0
t_predcitDefault0 = predict(dtModel0,train,type = 'class') # factor 0 or 1
rpart.plot(dtModel0)
predict_perform(train,t_predcitDefault0)
## Accuracy is 0.819125
## Recall is 0.3324588
## Precision is 0.6948688
#view the variable importance
var_imp(dtModel0)
## Overall
## fre_delay 1068.9128
## is_ever_delay 1031.5318
## pay_aug 965.2485
## pay_july 722.5290
## pay_sept 1275.8343
## pay_june 0.0000
## pay_may 0.0000
## pay_april 0.0000
## limit_bal 0.0000
## pay_amt_sept 0.0000
#---------- Decision Tree: a muti-variable model
dtModel1 = rpart(as.factor(default_next_month)~limit_bal+fre_delay+is_ever_delay, data = train, method = 'class')
#dtModel1
t_predcitDefault1 = predict(dtModel1,train,type = 'class')
rpart.plot(dtModel1)
predict_perform(train,t_predcitDefault1)
## Accuracy is 0.804125
## Recall is 0.3429535
## Precision is 0.6049587
The first tree model which has only one node has the exactly same performance with the simple logistic regression model. So I will choose ‘lrModel1’ as the final model, and set threshold as 0.3 in order to get relatively high recall.
#choose dtModel0
Default_pre = ifelse(predict(lrModel1,test,type = 'response')>=0.3,1,0)
ConfusionMatrix(y_pred=Default_pre,y_true = test$default_next_month)
## y_pred
## y_true 0 1
## 0 4038 662
## 1 640 660
cat('\n')
#print accuracy,recall,precision
predict_perform(test,Default_pre)
## Accuracy is 0.783
## Recall is 0.5076923
## Precision is 0.4992436
So 78% of the time I’m correctly binning customers into their correct categories as either defaulters or non-defaulters.
Although the result is not satisfying, with this accuracy and recall, credit card lenders could use the model for identifying potential defaulters. But for strategies based on the model they shoud be made mildly, since there is a relatively high false positive error, which means it’s flagging too many people as defaulters, as I adjust the threshold for higher recall.
In order to improve recall while not harm the total accuracy too much, I spent a lot of time doing exploratory and feature engineering. I actually created much more features beyond those features I finally chosen for model training, but most of them are not very useful as I expected.
Maybe the performance can be improved with more data dimensions, better algorithms and deeper understanding of domain knowledges.