Backgroud

Problem

With credit card balances reaching an average of $16,000 per household and student loans becoming the largest source of household debt, it’s clear that Americans are financing many aspects of their lives.

For credit card lenders, it is risky to issue credit cards to unqualified applicants. Since clients who overused credit card for consumption, irrespective of their payment ability, may accumulate heavy credit debts, which may lead to debt crisis for credit card lenders. Thus, in order to mitigate loses from risky clients, the desire to build models that identify high-risk clients has never been stronger.

Dataset

The Default of Credit Card Clients Dataset: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

This dataset has 30,000 observations and 24 variables, which contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005. The true label is ‘default_next_month’, which is a binary variable indicating whether the client will default in the following month.

Objective

Identify credit card clients likely to default on their next payment by classifying them into ‘defaulters-1’ or ‘non-defaulters-0’. [positive=1]

Initial setup

# Load required libraries 
library(tidyverse)
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
## ── Attaching packages ──────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.1     ✔ dplyr   0.8.1
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(caret) #regression
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(ggvis)
## 
## Attaching package: 'ggvis'
## The following object is masked from 'package:ggplot2':
## 
##     resolution
library(MLmetrics) 
## 
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE
## The following object is masked from 'package:base':
## 
##     Recall
library(precrec) 
library(rpart) #decision tree 
library(rpart.plot)
library(scales)
## 
## Attaching package: 'scales'
## The following objects are masked from 'package:ggvis':
## 
##     fullseq, zero_range
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
set.seed(1234)

Dataset Overview

#import data
data = read_tsv('credit_card_default.tsv')
## Parsed with column specification:
## cols(
##   .default = col_double()
## )
## See spec(...) for full column specifications.
creditcard = data 

cat('The dimention is: ', dim(creditcard))
## The dimention is:  30000 24
cat('Is there any NA data: ', any(is.na(creditcard))) 
## Is there any NA data:  FALSE
str(creditcard) 
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 30000 obs. of  24 variables:
##  $ default_next_month: num  1 1 0 0 0 0 0 0 0 0 ...
##  $ limit_bal         : num  20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
##  $ sex               : num  2 2 2 2 1 1 1 2 2 1 ...
##  $ education         : num  2 2 2 2 2 1 1 2 3 3 ...
##  $ marriage          : num  1 2 2 1 1 2 2 2 1 2 ...
##  $ age               : num  24 26 34 37 57 37 29 23 28 35 ...
##  $ pay_sept          : num  2 -1 0 0 -1 0 0 0 0 -2 ...
##  $ pay_aug           : num  2 2 0 0 0 0 0 -1 0 -2 ...
##  $ pay_july          : num  -1 0 0 0 -1 0 0 -1 2 -2 ...
##  $ pay_june          : num  -1 0 0 0 0 0 0 0 0 -2 ...
##  $ pay_may           : num  -2 0 0 0 0 0 0 0 0 -1 ...
##  $ pay_april         : num  -2 2 0 0 0 0 0 -1 0 -1 ...
##  $ bill_sept         : num  3913 2682 29239 46990 8617 ...
##  $ bill_aug          : num  3102 1725 14027 48233 5670 ...
##  $ bill_july         : num  689 2682 13559 49291 35835 ...
##  $ bill_june         : num  0 3272 14331 28314 20940 ...
##  $ bill_may          : num  0 3455 14948 28959 19146 ...
##  $ bill_april        : num  0 3261 15549 29547 19131 ...
##  $ pay_amt_sept      : num  0 0 1518 2000 2000 ...
##  $ pay_amt_aug       : num  689 1000 1500 2019 36681 ...
##  $ pay_amt_july      : num  0 1000 1000 1200 10000 657 38000 0 432 0 ...
##  $ pay_amt_june      : num  0 1000 1000 1100 9000 ...
##  $ pay_amt_may       : num  0 0 1000 1069 689 ...
##  $ pay_amt_april     : num  0 2000 5000 1000 679 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   default_next_month = col_double(),
##   ..   limit_bal = col_double(),
##   ..   sex = col_double(),
##   ..   education = col_double(),
##   ..   marriage = col_double(),
##   ..   age = col_double(),
##   ..   pay_sept = col_double(),
##   ..   pay_aug = col_double(),
##   ..   pay_july = col_double(),
##   ..   pay_june = col_double(),
##   ..   pay_may = col_double(),
##   ..   pay_april = col_double(),
##   ..   bill_sept = col_double(),
##   ..   bill_aug = col_double(),
##   ..   bill_july = col_double(),
##   ..   bill_june = col_double(),
##   ..   bill_may = col_double(),
##   ..   bill_april = col_double(),
##   ..   pay_amt_sept = col_double(),
##   ..   pay_amt_aug = col_double(),
##   ..   pay_amt_july = col_double(),
##   ..   pay_amt_june = col_double(),
##   ..   pay_amt_may = col_double(),
##   ..   pay_amt_april = col_double()
##   .. )
#head(creditcard)
#View(creditcard)
#Check the length and see which varibles can be converted to factors
apply(creditcard,2, function(x) length(unique(x)))
## default_next_month          limit_bal                sex 
##                  2                 81                  2 
##          education           marriage                age 
##                  4                  4                 56 
##           pay_sept            pay_aug           pay_july 
##                 11                 11                 11 
##           pay_june            pay_may          pay_april 
##                 11                 10                 10 
##          bill_sept           bill_aug          bill_july 
##              22723              22346              22026 
##          bill_june           bill_may         bill_april 
##              21548              21010              20604 
##       pay_amt_sept        pay_amt_aug       pay_amt_july 
##               7943               7899               7518 
##       pay_amt_june        pay_amt_may      pay_amt_april 
##               6937               6897               6939

Exploratory Data Analysis and Feature Engineering

Questions:

Analysis on demographic variables

#convert variables into factors for analysis  
factor_vars <- c('sex','education','marriage','default_next_month')
creditcard_f = creditcard
creditcard_f[factor_vars] <- lapply(creditcard_f[factor_vars], function(x) as.factor(x))

#demographic features 
ggplot(creditcard_f,aes(x=sex,fill=default_next_month))+
  geom_bar(position = 'fill')+
  ggtitle("Sex v/s Default Rate")+
  xlab("Sex(1:male 2:female)") +
  ylab("Default Rate") +
  scale_y_continuous(labels =percent_format())

ggplot(creditcard_f,aes(x=education,fill=default_next_month))+
   geom_bar(position = 'fill')+
  ggtitle("Education v/s Default Rate")+
  xlab("Education(1:Gra 2:Uni 3:High 4:Other)") +
  ylab("Default Rate") 

ggplot(creditcard_f,aes(x=marriage,fill=default_next_month))+
  geom_bar(position = 'fill')+
  ggtitle("Marriage v/s Default Rate")+
  xlab("Marriage(0:Other 1:Mar 2:Sin 3:Dev)") +
  ylab("Default Rate") 

ggplot(creditcard_f,aes(y=age,x=default_next_month))+
  geom_boxplot( )+
  ggtitle("Age v/s Default")

From above plots, I don’t see huge differences of default rate b/w different values of each categorical variable. The range of age in default has large overlap. Hence, it looks like the probability does not vary too much by categories of these demographic variables.

Analysis on credit records

ggplot(creditcard_f,aes(y=limit_bal,x=default_next_month))+
  geom_boxplot( )+
  ggtitle("Limit balance v/s Default")

ggplot(creditcard_f,aes(y=bill_sept,x=default_next_month))+
   geom_boxplot( )+
  ggtitle("Bill in sept v/s Default")

ggplot(creditcard_f,aes(y=pay_amt_sept,x=default_next_month))+
   geom_boxplot( )+
  ggtitle("Payment amount in sept v/s Default")

ggplot(creditcard_f,aes(x=as.factor(pay_sept),fill=default_next_month))+
  geom_bar(position = 'fill')+
  ggtitle("Payment status in sept v/s Default rate")+
  xlab("Payment status") +
  ylab("Default Rate") 

From above plots, the rate of default next month goes higher with the length of payment delay goes longer. No doubt customers who do not repay on time are more likely to default.

Hypotheses:

  1. Client whose children or him/herself going to college has higher possibility of default, since student loans are becoming the largest part of household debts. [I will divide the data to two age groups: b/w 26~46 - group of more financial pressure; others - group of less financial pressure]

  2. Client who has higher frequency of delays are more likely to default. [is_ever_delay, fre_delay]

  3. Client who’s bill_sept is lower than 0 is less likely to default. [bill_sept_0]

Feature engineering

Based on previous hypotheses, I create some new features.

#age group
creditcard =creditcard%>%
  mutate(age_group = case_when(age<=25~1,   
                               age<=45~0,
                               age>45~1))

#frequency of delays
creditcard = creditcard%>%
  mutate(fre_delay = if_else(pay_april>0,1,0)+
           if_else(pay_may>0,1,0)+
           if_else(pay_june>0,1,0)+
           if_else(pay_july>0,1,0)+
           if_else(pay_aug>0,1,0)+
           if_else(pay_sept>0,1,0))

#is_ever_delay
creditcard =creditcard%>%
  mutate(is_ever_delay = if_else(fre_delay>0,1,0) )

#bill_sept_0
creditcard =creditcard%>%
  mutate(bill_sept_0 = if_else(bill_sept<=0,0,1))

Visualizaiton for new features.

#convert new variables into factors for analysis 
new_vars <- c('age_group','fre_delay','is_ever_delay','bill_sept_0')

creditcard_f =cbind(creditcard_f,creditcard[new_vars])
creditcard_f[new_vars] <- lapply(creditcard_f[new_vars], function(x) as.factor(x))

ggplot(creditcard_f,aes(x=age_group,fill=default_next_month))+
  geom_bar(position = 'fill' )+
  xlab("Age group") +
  ylab("Default Rate")

ggplot(creditcard_f,aes(x=bill_sept_0,fill=default_next_month))+
  geom_bar(position = 'fill' )+
  xlab("Bill in sept") +
  ylab("Default Rate")

ggplot(creditcard_f,aes(x=is_ever_delay,fill=default_next_month))+
  geom_bar(position = 'fill' )+
  xlab("Is ever delay") +
  ylab("Default Rate")

ggplot(creditcard_f,aes(x=fre_delay,fill=default_next_month))+
  geom_bar(position = 'fill' )+
  xlab("Frequency of delay") +
  ylab("Default Rate")

It is obvious that client who has higher frequency of delays are more likely to default.

While the difference between age group is small, and the amount of bill in sept doesn’t affect the default rate. So, I might have to discard ‘age_group’ and ‘bill_sept_0’, keep ‘fre_delay’ and ‘is_ever_delay’

Now, let’s calculate the correlation with default_next_month for each variable before and after adding new variables.

#cor with default_next_month sort by abs
corr1 = cor(data)[,1]%>%abs()%>%sort(decreasing = TRUE) 
corr2 = cor(creditcard)[,1]%>%abs()%>%sort(decreasing = TRUE) 

#print the top 10 variables which have the highest cor value with default next month 
#initial variables
print(corr1[2:11]) 
##     pay_sept      pay_aug     pay_july     pay_june      pay_may 
##   0.32479373   0.26355120   0.23525251   0.21661364   0.20414891 
##    pay_april    limit_bal pay_amt_sept  pay_amt_aug pay_amt_june 
##   0.18686636   0.15351988   0.07292949   0.05857871   0.05682740
#after adding new variables
print(corr2[2:11])
##     fre_delay is_ever_delay      pay_sept       pay_aug      pay_july 
##    0.39839436    0.35285790    0.32479373    0.26355120    0.23525251 
##      pay_june       pay_may     pay_april     limit_bal  pay_amt_sept 
##    0.21661364    0.20414891    0.18686636    0.15351988    0.07292949

The rank of variables has changed.Let’s select these top 10 variables as final input features.

vars_name = names(corr2[2:11]) 
creditcard = creditcard %>% select(default_next_month,!!vars_name)

Preprocessing and Training

Training and testing dataset

n= nrow(creditcard)
trainIndex = sample(1:n, size = round(0.8*n), replace=FALSE)
train = creditcard[trainIndex ,]  
test = creditcard[-trainIndex ,]

Define functions

#prediction performance metrics  
predict_perform = function(data,pD){
  cat("Accuracy is ",Accuracy(y_pred = pD, y_true = data$default_next_month),'\n') 
  cat("Recall is ", Recall(y_pred = pD, y_true = data$default_next_month,positive = '1'),'\n' )
  cat("Precision is",Precision(y_pred = pD, y_true = data$default_next_month,positive = '1'),'\n')
}

#prediction performance plots
predict_perform_p = function(data,pD){
  precrec_obj <- evalmod(labels = data$default_next_month, scores = as.numeric(pD))
  autoplot(precrec_obj)
}

#plot variable importance 
var_imp =function(model){
  v=varImp(model)
  print(v)
  v$var = rownames(v)
  v %>% ggvis(~var, ~Overall) %>% layer_bars()
}

Logistic Regression

#--------------Logistic Regression: a full model
lrModel0 = glm(default_next_month ~., data = train, family = binomial)
print(lrModel0)
## 
## Call:  glm(formula = default_next_month ~ ., family = binomial, data = train)
## 
## Coefficients:
##   (Intercept)      fre_delay  is_ever_delay       pay_sept        pay_aug  
##    -1.712e+00      3.884e-01      4.987e-01      2.379e-01     -2.512e-02  
##      pay_july       pay_june        pay_may      pay_april      limit_bal  
##    -4.624e-02     -6.136e-02     -2.362e-02     -5.534e-02     -1.395e-06  
##  pay_amt_sept  
##    -9.680e-06  
## 
## Degrees of Freedom: 23999 Total (i.e. Null);  23989 Residual
## Null Deviance:       25430 
## Residual Deviance: 21530     AIC: 21560
predictDefault0 = ifelse(predict(lrModel0, newdata = train, 
                                 type = "response")>=0.5,1,0)

#view the variable importance
var_imp(lrModel0)
##                  Overall
## fre_delay     18.9313557
## is_ever_delay  8.9451973
## pay_sept      11.0890461
## pay_aug        1.1576487
## pay_july       1.9392888
## pay_june       2.3429516
## pay_may        0.8331226
## pay_april      2.3967230
## limit_bal      8.8306533
## pay_amt_sept   4.5119211
#--------------Logistic Regression: a simple model
lrModel1 = glm(default_next_month ~ pay_sept, data = train, family = binomial)
predictDefault1 = ifelse(predict(lrModel1, newdata = train, 
                                 type = "response")>=0.5,1,0)

#-------------Logistic Regression: a muti-variable model based on varImp 
lrModel2 = glm(default_next_month ~ pay_sept+fre_delay, data = train, family = binomial)
predictDefault2 = ifelse(predict(lrModel2, newdata = train, 
                                 type = "response")>=0.5,1,0)

#-------------compare log regression models when threshold is 0.5
predict_perform(train,predictDefault0)
## Accuracy is  0.80875 
## Recall is  0.2934783 
## Precision is 0.6563286
predict_perform(train,predictDefault1)
## Accuracy is  0.819125 
## Recall is  0.3324588 
## Precision is 0.6948688
predict_perform(train,predictDefault2)
## Accuracy is  0.810125 
## Recall is  0.2872939 
## Precision is 0.6703105

It is surprised that the best regression model only use pay_sept.

#compare prediction performance with different threshold use lrModel1
predictDefault1_1 = ifelse(predict(lrModel1, newdata = train, 
                                 type = "response")>=0.3,1,0)
predictDefault1_2 = ifelse(predict(lrModel1, newdata = train, 
                                 type = "response")>=0.5,1,0)
predictDefault1_3 = ifelse(predict(lrModel1, newdata = train, 
                                 type = "response")>=0.6,1,0)

predict_perform(train,predictDefault1_1)
## Accuracy is  0.7794167 
## Recall is  0.518928 
## Precision is 0.503821
predict_perform(train,predictDefault1_2)
## Accuracy is  0.819125 
## Recall is  0.3324588 
## Precision is 0.6948688
predict_perform(train,predictDefault1_3)
## Accuracy is  0.7842083 
## Recall is  0.04966267 
## Precision is 0.7104558
predict_perform_p(train,predictDefault1_1)

predict_perform_p(train,predictDefault1_2)

predict_perform_p(train,predictDefault1_3)

As I adjust the threshold from 0.3 to 0.7, the recall goes higher while precision goes lower. The total accuracy goes up then goes down. I might choose [lrModel1 0.3 threshold] with highest recall.

Decision Tree

#-------------- Decision Tree: a full model
dtModel0 = rpart(as.factor(default_next_month)~., data = train, method = 'class')
#dtModel0
t_predcitDefault0 = predict(dtModel0,train,type = 'class') # factor 0 or 1 

rpart.plot(dtModel0)

predict_perform(train,t_predcitDefault0)
## Accuracy is  0.819125 
## Recall is  0.3324588 
## Precision is 0.6948688
#view the variable importance
var_imp(dtModel0)
##                 Overall
## fre_delay     1068.9128
## is_ever_delay 1031.5318
## pay_aug        965.2485
## pay_july       722.5290
## pay_sept      1275.8343
## pay_june         0.0000
## pay_may          0.0000
## pay_april        0.0000
## limit_bal        0.0000
## pay_amt_sept     0.0000
#---------- Decision Tree: a muti-variable model
dtModel1 = rpart(as.factor(default_next_month)~limit_bal+fre_delay+is_ever_delay, data = train, method = 'class')
#dtModel1
t_predcitDefault1 = predict(dtModel1,train,type = 'class')

rpart.plot(dtModel1) 

predict_perform(train,t_predcitDefault1)
## Accuracy is  0.804125 
## Recall is  0.3429535 
## Precision is 0.6049587

The first tree model which has only one node has the exactly same performance with the simple logistic regression model. So I will choose ‘lrModel1’ as the final model, and set threshold as 0.3 in order to get relatively high recall.

Validation

#choose dtModel0 
Default_pre = ifelse(predict(lrModel1,test,type = 'response')>=0.3,1,0)
ConfusionMatrix(y_pred=Default_pre,y_true = test$default_next_month)
##       y_pred
## y_true    0    1
##      0 4038  662
##      1  640  660
cat('\n')
#print accuracy,recall,precision
predict_perform(test,Default_pre)
## Accuracy is  0.783 
## Recall is  0.5076923 
## Precision is 0.4992436

So 78% of the time I’m correctly binning customers into their correct categories as either defaulters or non-defaulters.

Conclusion

Although the result is not satisfying, with this accuracy and recall, credit card lenders could use the model for identifying potential defaulters. But for strategies based on the model they shoud be made mildly, since there is a relatively high false positive error, which means it’s flagging too many people as defaulters, as I adjust the threshold for higher recall.

In order to improve recall while not harm the total accuracy too much, I spent a lot of time doing exploratory and feature engineering. I actually created much more features beyond those features I finally chosen for model training, but most of them are not very useful as I expected.

Maybe the performance can be improved with more data dimensions, better algorithms and deeper understanding of domain knowledges.