Research Objectives

Future analysis from this data may be beneficial for lending institutions (bank & even non-banks, like “Go-Pay Later” or similar business), especially because this dataset, while quite outdated, comes from an actual Taiwan bank in 2005. It has basic data demography such as marriage status and education level, along with detailed bill & payments received from the customers each month.

This research is aimed to discuss about the following (but not limited to): 1. How does the probability of default payment vary by categories of different demographic variables?
2. Which variables are the strongest predictors of default payment?

# Library
library(tidyverse)
library(ggplot2)

Read the Data

Using a public dataset found in https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset, this dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.

# Read Data
credit <- read.csv("UCI_Credit_Card.csv")
str(credit)

'data.frame':   30000 obs. of  25 variables:
 $ ID                        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ LIMIT_BAL                 : num  20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
 $ SEX                       : int  2 2 2 2 1 1 1 2 2 1 ...
 $ EDUCATION                 : int  2 2 2 2 2 1 1 2 3 3 ...
 $ MARRIAGE                  : int  1 2 2 1 1 2 2 2 1 2 ...
 $ AGE                       : int  24 26 34 37 57 37 29 23 28 35 ...
 $ PAY_0                     : int  2 -1 0 0 -1 0 0 0 0 -2 ...
 $ PAY_2                     : int  2 2 0 0 0 0 0 -1 0 -2 ...
 $ PAY_3                     : int  -1 0 0 0 -1 0 0 -1 2 -2 ...
 $ PAY_4                     : int  -1 0 0 0 0 0 0 0 0 -2 ...
 $ PAY_5                     : int  -2 0 0 0 0 0 0 0 0 -1 ...
 $ PAY_6                     : int  -2 2 0 0 0 0 0 -1 0 -1 ...
 $ BILL_AMT1                 : num  3913 2682 29239 46990 8617 ...
 $ BILL_AMT2                 : num  3102 1725 14027 48233 5670 ...
 $ BILL_AMT3                 : num  689 2682 13559 49291 35835 ...
 $ BILL_AMT4                 : num  0 3272 14331 28314 20940 ...
 $ BILL_AMT5                 : num  0 3455 14948 28959 19146 ...
 $ BILL_AMT6                 : num  0 3261 15549 29547 19131 ...
 $ PAY_AMT1                  : num  0 0 1518 2000 2000 ...
 $ PAY_AMT2                  : num  689 1000 1500 2019 36681 ...
 $ PAY_AMT3                  : num  0 1000 1000 1200 10000 657 38000 0 432 0 ...
 $ PAY_AMT4                  : num  0 1000 1000 1100 9000 ...
 $ PAY_AMT5                  : num  0 0 1000 1069 689 ...
 $ PAY_AMT6                  : num  0 2000 5000 1000 679 ...
 $ default.payment.next.month: int  1 1 0 0 0 0 0 0 0 0 ...

Here are the definition of each variables: - ID : ID of each client
- LIMIT_BAL : Amount of given credit in NT dollars (includes individual and family/supplementary credit
- SEX : Gender (1=male, 2=female)
- EDUCATION : (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
- MARRIAGE : Marital status (1=married, 2=single, 3=others)
- AGE : Age in years
- PAY_0 : Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
- PAY_2 : Repayment status in August, 2005 (scale same as above)
- PAY_3 : Repayment status in July, 2005 (scale same as above)
- PAY_4 : Repayment status in June, 2005 (scale same as above)
- PAY_5 : Repayment status in May, 2005 (scale same as above)
- PAY_6 : Repayment status in April, 2005 (scale same as above)
- BILL_AMT1 : Amount of bill statement in September, 2005 (NT dollar)
- BILL_AMT2 : Amount of bill statement in August, 2005 (NT dollar)
- BILL_AMT3 : Amount of bill statement in July, 2005 (NT dollar)
- BILL_AMT4 : Amount of bill statement in June, 2005 (NT dollar)
- BILL_AMT5 : Amount of bill statement in May, 2005 (NT dollar)
- BILL_AMT6 : Amount of bill statement in April, 2005 (NT dollar)
- PAY_AMT1 : Amount of previous payment in September, 2005 (NT dollar)
- PAY_AMT2 : Amount of previous payment in August, 2005 (NT dollar)
- PAY_AMT3 : Amount of previous payment in July, 2005 (NT dollar)
- PAY_AMT4 : Amount of previous payment in June, 2005 (NT dollar)
- PAY_AMT5 : Amount of previous payment in May, 2005 (NT dollar)
- PAY_AMT6 : Amount of previous payment in April, 2005 (NT dollar)
- default.payment.next.month : Default payment (1=yes, 0=no)

Data Preparation & EDA

Data Preparation

Data preparation step consists of re-naming variables, changing their class, and assign the appropriate levels.

# Renaming PAY_0 into PAY_1 so it won't suddenly jump into PAY_2
colnames(credit)[colnames(credit) == "PAY_0"] = "PAY_1"

# check Null/NA value
sum(is.na(credit)) # no missing value!

[1] 0

# change class SEX into factor & assign levels
credit$SEX <- as.factor(credit$SEX)
levels(credit$SEX) <- c("Male", "Female")

# change class EDUCATION into factor & assign levels
credit$EDUCATION <- as.factor(credit$EDUCATION)
levels(credit$EDUCATION) <- c("Unknown", "Graduate School", "University", "High school", "Others", "Unknown", "Unknown") #including 0 because there are actually 7 levels, not 6

# change MARRIAGE into factor & assign levels
credit$MARRIAGE <- as.factor(credit$MARRIAGE)
levels(credit$MARRIAGE) <- c("Unknown" , "Married" , "Single" ,"Others") #including 0 because there are 4 levels, not 3

# change default.payment.next.month into factor & assign levels
credit$default.payment.next.month <- as.factor(credit$default.payment.next.month)
levels(credit$default.payment.next.month) <- c("Not Default", "Default")

# Change class into factor
credit$PAY_1 <- as.factor(credit$PAY_1)
credit$PAY_2 <- as.factor(credit$PAY_2)
credit$PAY_3 <- as.factor(credit$PAY_3)
credit$PAY_4 <- as.factor(credit$PAY_4)
credit$PAY_5 <- as.factor(credit$PAY_5)
credit$PAY_6 <- as.factor(credit$PAY_6)

Data Summary

Before we look into comparison from each variables in EDA step, here are some highlights from the data:

# data summary
summary(credit)

       ID          LIMIT_BAL           SEX                  EDUCATION    
 Min.   :    1   Min.   :  10000   Male  :11888   Unknown        :  345  
 1st Qu.: 7501   1st Qu.:  50000   Female:18112   Graduate School:10585  
 Median :15000   Median : 140000                  University     :14030  
    MARRIAGE          AGE            PAY_1           PAY_2      
 Unknown:   54   Min.   :21.00   0      :14737   0      :15730  
 Married:13659   1st Qu.:28.00   -1     : 5686   -1     : 6050  
 Single :15964   Median :34.00   1      : 3688   2      : 3927  
     PAY_3           PAY_4           PAY_5           PAY_6      
 0      :15764   0      :16455   0      :16947   0      :16286  
 -1     : 5938   -1     : 5687   -1     : 5539   -1     : 5740  
 -2     : 4085   -2     : 4348   -2     : 4546   -2     : 4895  
   BILL_AMT1         BILL_AMT2        BILL_AMT3         BILL_AMT4      
 Min.   :-165580   Min.   :-69777   Min.   :-157264   Min.   :-170000  
 1st Qu.:   3559   1st Qu.:  2985   1st Qu.:   2666   1st Qu.:   2327  
 Median :  22382   Median : 21200   Median :  20088   Median :  19052  
   BILL_AMT5        BILL_AMT6          PAY_AMT1         PAY_AMT2      
 Min.   :-81334   Min.   :-339603   Min.   :     0   Min.   :      0  
 1st Qu.:  1763   1st Qu.:   1256   1st Qu.:  1000   1st Qu.:    833  
 Median : 18104   Median :  17071   Median :  2100   Median :   2009  
    PAY_AMT3         PAY_AMT4         PAY_AMT5           PAY_AMT6       
 Min.   :     0   Min.   :     0   Min.   :     0.0   Min.   :     0.0  
 1st Qu.:   390   1st Qu.:   296   1st Qu.:   252.5   1st Qu.:   117.8  
 Median :  1800   Median :  1500   Median :  1500.0   Median :  1500.0  
 default.payment.next.month
 Not Default:23364         
 Default    : 6636         
                           
 [ reached getOption("max.print") -- omitted 4 rows ]

The proportion among married vs single customer is quite balanced, which hopefully makes our analysis to be objective/non-biased towards this status

Female customer outweighs male customer in quantity. Does this mean female population are more prospective to become a lending customer?

The proportion of ‘default’ with ‘not default’ is roughly 1 : 4

EDA

Demography

Exploration of customer profile to look at how their Education, Age, and Marriage Status reflect on their limit balance and default status:

ggplot(data = credit, mapping = aes(x=AGE, y=LIMIT_BAL))+
  geom_point()+
  geom_smooth(aes(color = EDUCATION))+
  facet_wrap(~EDUCATION)

> If we roughly compare the population towards Education and Limit Balance, we can see that customer from “Graduate School (GS)” and “University graduates (Uni)” have similar limit balance pattern (with the former has slightly higher population). Our main customer from GS & Uni background has similar age bracket too, around mid-20s to 50s. And since our data doesn’t provide what “Others” Education is, we can ignore it and just focus on the 3 (GS, Uni, HS).

But how does education affect the default status of the customer?

ggplot(data = credit, mapping = aes(x=AGE, y=LIMIT_BAL))+
  geom_boxplot(aes(fill = EDUCATION))+
  facet_wrap(~default.payment.next.month)

From the graphic above, we see that the default customer profile comes from: 1. People with the Graduate School background around age 60s (they have the highest limit balance too–were they given this because they were deemed more ‘successful’ in their age?)
2. A lot of outliers come from customers with University & High School background. Were they given leniency/higher limit because they seem to be in a more ‘productive age’ to pay a loan?

On the other hand, if we analyze how the marriage status reflect on the limit balance and default status, it would be:

ggplot(data=credit, aes(x=MARRIAGE, y=LIMIT_BAL, color = default.payment.next.month))+
  geom_boxplot()

If we look at the default profile from marriage status, most default customers are married (and they were given higher limit balance too). I may assume that this is because they may have higher bill/monthly expenses than those who are singles, making them difficult in paying the loan. On the other hand, they also has the highest population of not default, making them a customer segment that has high risk high return characteristic.

Bill & Payment Each Month

Our data also specify the credit card bill and the actual payment received from customers for 6 months, which we can explore. However, we have to calculate the mean of both variables to make comparison.

# Calculate the mean of bill sent to customers each month
monthly_bill <- credit %>% 
  summarise(bill_1 = mean(BILL_AMT1),
            bill_2 = mean(BILL_AMT2),
            bill_3 = mean(BILL_AMT3),
            bill_4 = mean(BILL_AMT4),
            bill_5 = mean(BILL_AMT5),
            bill_6 = mean(BILL_AMT6))

monthly_bill <- as.data.frame((t(monthly_bill)))

# rename column
names(monthly_bill)[1] <- "avg.bill"

# rename row
row.names(monthly_bill) <- c("month_1","month_2","month_3","month_4","month_5","month_6")

monthly_bill

        avg.bill
month_1 51223.33
month_2 49179.08
month_3 47013.15
month_4 43262.95
month_5 40311.40
month_6 38871.76

By similar principles, calculate the average payment received from customers:

# Calculate the mean of payment received from customers each month
monthly_pay <- credit %>% 
  summarise(month_1 = mean(PAY_AMT1),
            month_2 = mean(PAY_AMT2),
            month_3 = mean(PAY_AMT3),
            month_4 = mean(PAY_AMT4),
            month_5 = mean(PAY_AMT5),
            month_6 = mean(PAY_AMT6))

monthly_pay <- as.data.frame((t(monthly_pay)))

# rename column
names(monthly_pay)[1] <- "avg.payment"

monthly_pay

        avg.payment
month_1    5663.581
month_2    5921.163
month_3    5225.681
month_4    4826.077
month_5    4799.388
month_6    5215.503

# combine both bill & payment data
compare <- cbind(monthly_bill, monthly_pay)

# convert rownames into colnames
library(data.table)
setDT(compare, keep.rownames = "month")
compare

This is a drastic comparison between billing amount and payment. To look at it visually:

# use `melt` to make change table formats
compare2 <- melt(compare, id.vars = "month")
compare2

      month    variable     value
 1: month_1    avg.bill 51223.331
 2: month_2    avg.bill 49179.075
 3: month_3    avg.bill 47013.155
 4: month_4    avg.bill 43262.949
 5: month_5    avg.bill 40311.401
 6: month_6    avg.bill 38871.760
 7: month_1 avg.payment  5663.581
 8: month_2 avg.payment  5921.163
 9: month_3 avg.payment  5225.681
10: month_4 avg.payment  4826.077
11: month_5 avg.payment  4799.388
12: month_6 avg.payment  5215.503

# create comparison graph
ggplot(compare2, aes(x=month, y=value, fill=variable))+
  geom_bar(stat = "identity", position = "dodge")

My assumption:
The bill is steadily decreased on following months, which makes sense because it was being paid within each month by the customer. The payment proportion is very small compared to the bill, which may be caused by the customer that chose longer installment term so the amount that they have to pay each month is small.

References

Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.

Credit Card (Default) Clients Profile

Credit Card (Default) Clients Profile

Research Objectives

Read the Data

Data Preparation & EDA

Data Preparation

Data Preparation

Data Summary

EDA

Demography

Bill & Payment Each Month

References