Credit Card (Default) Clients Profile
Research Objectives
Future analysis from this data may be beneficial for lending institutions (bank & even non-banks, like “Go-Pay Later” or similar business), especially because this dataset, while quite outdated, comes from an actual Taiwan bank in 2005. It has basic data demography such as marriage status and education level, along with detailed bill & payments received from the customers each month.
This research is aimed to discuss about the following (but not limited to): 1. How does the probability of default payment vary by categories of different demographic variables?
2. Which variables are the strongest predictors of default payment?
Read the Data
Using a public dataset found in https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset, this dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.
'data.frame': 30000 obs. of 25 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ LIMIT_BAL : num 20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
$ SEX : int 2 2 2 2 1 1 1 2 2 1 ...
$ EDUCATION : int 2 2 2 2 2 1 1 2 3 3 ...
$ MARRIAGE : int 1 2 2 1 1 2 2 2 1 2 ...
$ AGE : int 24 26 34 37 57 37 29 23 28 35 ...
$ PAY_0 : int 2 -1 0 0 -1 0 0 0 0 -2 ...
$ PAY_2 : int 2 2 0 0 0 0 0 -1 0 -2 ...
$ PAY_3 : int -1 0 0 0 -1 0 0 -1 2 -2 ...
$ PAY_4 : int -1 0 0 0 0 0 0 0 0 -2 ...
$ PAY_5 : int -2 0 0 0 0 0 0 0 0 -1 ...
$ PAY_6 : int -2 2 0 0 0 0 0 -1 0 -1 ...
$ BILL_AMT1 : num 3913 2682 29239 46990 8617 ...
$ BILL_AMT2 : num 3102 1725 14027 48233 5670 ...
$ BILL_AMT3 : num 689 2682 13559 49291 35835 ...
$ BILL_AMT4 : num 0 3272 14331 28314 20940 ...
$ BILL_AMT5 : num 0 3455 14948 28959 19146 ...
$ BILL_AMT6 : num 0 3261 15549 29547 19131 ...
$ PAY_AMT1 : num 0 0 1518 2000 2000 ...
$ PAY_AMT2 : num 689 1000 1500 2019 36681 ...
$ PAY_AMT3 : num 0 1000 1000 1200 10000 657 38000 0 432 0 ...
$ PAY_AMT4 : num 0 1000 1000 1100 9000 ...
$ PAY_AMT5 : num 0 0 1000 1069 689 ...
$ PAY_AMT6 : num 0 2000 5000 1000 679 ...
$ default.payment.next.month: int 1 1 0 0 0 0 0 0 0 0 ...
Here are the definition of each variables: - ID : ID of each client
- LIMIT_BAL : Amount of given credit in NT dollars (includes individual and family/supplementary credit
- SEX : Gender (1=male, 2=female)
- EDUCATION : (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
- MARRIAGE : Marital status (1=married, 2=single, 3=others)
- AGE : Age in years
- PAY_0 : Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
- PAY_2 : Repayment status in August, 2005 (scale same as above)
- PAY_3 : Repayment status in July, 2005 (scale same as above)
- PAY_4 : Repayment status in June, 2005 (scale same as above)
- PAY_5 : Repayment status in May, 2005 (scale same as above)
- PAY_6 : Repayment status in April, 2005 (scale same as above)
- BILL_AMT1 : Amount of bill statement in September, 2005 (NT dollar)
- BILL_AMT2 : Amount of bill statement in August, 2005 (NT dollar)
- BILL_AMT3 : Amount of bill statement in July, 2005 (NT dollar)
- BILL_AMT4 : Amount of bill statement in June, 2005 (NT dollar)
- BILL_AMT5 : Amount of bill statement in May, 2005 (NT dollar)
- BILL_AMT6 : Amount of bill statement in April, 2005 (NT dollar)
- PAY_AMT1 : Amount of previous payment in September, 2005 (NT dollar)
- PAY_AMT2 : Amount of previous payment in August, 2005 (NT dollar)
- PAY_AMT3 : Amount of previous payment in July, 2005 (NT dollar)
- PAY_AMT4 : Amount of previous payment in June, 2005 (NT dollar)
- PAY_AMT5 : Amount of previous payment in May, 2005 (NT dollar)
- PAY_AMT6 : Amount of previous payment in April, 2005 (NT dollar)
- default.payment.next.month : Default payment (1=yes, 0=no)
Data Preparation & EDA
Data Preparation
Data Preparation
Data preparation step consists of re-naming variables, changing their class, and assign the appropriate levels.
# Renaming PAY_0 into PAY_1 so it won't suddenly jump into PAY_2
colnames(credit)[colnames(credit) == "PAY_0"] = "PAY_1"
# check Null/NA value
sum(is.na(credit)) # no missing value![1] 0
# change class SEX into factor & assign levels
credit$SEX <- as.factor(credit$SEX)
levels(credit$SEX) <- c("Male", "Female")
# change class EDUCATION into factor & assign levels
credit$EDUCATION <- as.factor(credit$EDUCATION)
levels(credit$EDUCATION) <- c("Unknown", "Graduate School", "University", "High school", "Others", "Unknown", "Unknown") #including 0 because there are actually 7 levels, not 6
# change MARRIAGE into factor & assign levels
credit$MARRIAGE <- as.factor(credit$MARRIAGE)
levels(credit$MARRIAGE) <- c("Unknown" , "Married" , "Single" ,"Others") #including 0 because there are 4 levels, not 3
# change default.payment.next.month into factor & assign levels
credit$default.payment.next.month <- as.factor(credit$default.payment.next.month)
levels(credit$default.payment.next.month) <- c("Not Default", "Default")Data Summary
Before we look into comparison from each variables in EDA step, here are some highlights from the data:
ID LIMIT_BAL SEX EDUCATION
Min. : 1 Min. : 10000 Male :11888 Unknown : 345
1st Qu.: 7501 1st Qu.: 50000 Female:18112 Graduate School:10585
Median :15000 Median : 140000 University :14030
MARRIAGE AGE PAY_1 PAY_2
Unknown: 54 Min. :21.00 0 :14737 0 :15730
Married:13659 1st Qu.:28.00 -1 : 5686 -1 : 6050
Single :15964 Median :34.00 1 : 3688 2 : 3927
PAY_3 PAY_4 PAY_5 PAY_6
0 :15764 0 :16455 0 :16947 0 :16286
-1 : 5938 -1 : 5687 -1 : 5539 -1 : 5740
-2 : 4085 -2 : 4348 -2 : 4546 -2 : 4895
BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4
Min. :-165580 Min. :-69777 Min. :-157264 Min. :-170000
1st Qu.: 3559 1st Qu.: 2985 1st Qu.: 2666 1st Qu.: 2327
Median : 22382 Median : 21200 Median : 20088 Median : 19052
BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2
Min. :-81334 Min. :-339603 Min. : 0 Min. : 0
1st Qu.: 1763 1st Qu.: 1256 1st Qu.: 1000 1st Qu.: 833
Median : 18104 Median : 17071 Median : 2100 Median : 2009
PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
Min. : 0 Min. : 0 Min. : 0.0 Min. : 0.0
1st Qu.: 390 1st Qu.: 296 1st Qu.: 252.5 1st Qu.: 117.8
Median : 1800 Median : 1500 Median : 1500.0 Median : 1500.0
default.payment.next.month
Not Default:23364
Default : 6636
[ reached getOption("max.print") -- omitted 4 rows ]
- The proportion among married vs single customer is quite balanced, which hopefully makes our analysis to be objective/non-biased towards this status
- Female customer outweighs male customer in quantity. Does this mean female population are more prospective to become a lending customer?
- The proportion of ‘default’ with ‘not default’ is roughly
1 : 4
EDA
Demography
Exploration of customer profile to look at how their Education, Age, and Marriage Status reflect on their limit balance and default status:
ggplot(data = credit, mapping = aes(x=AGE, y=LIMIT_BAL))+
geom_point()+
geom_smooth(aes(color = EDUCATION))+
facet_wrap(~EDUCATION) > If we roughly compare the population towards
Education and Limit Balance, we can see that customer from “Graduate School (GS)” and “University graduates (Uni)” have similar limit balance pattern (with the former has slightly higher population). Our main customer from GS & Uni background has similar age bracket too, around mid-20s to 50s. And since our data doesn’t provide what “Others” Education is, we can ignore it and just focus on the 3 (GS, Uni, HS).
But how does education affect the default status of the customer?
ggplot(data = credit, mapping = aes(x=AGE, y=LIMIT_BAL))+
geom_boxplot(aes(fill = EDUCATION))+
facet_wrap(~default.payment.next.month)From the graphic above, we see that the
defaultcustomer profile comes from: 1. People with theGraduate Schoolbackground aroundage60s (they have the highest limit balance too–were they given this because they were deemed more ‘successful’ in their age?)
2. A lot of outliers come from customers withUniversity&High Schoolbackground. Were they given leniency/higher limit because they seem to be in a more ‘productive age’ to pay a loan?
On the other hand, if we analyze how the marriage status reflect on the limit balance and default status, it would be:
ggplot(data=credit, aes(x=MARRIAGE, y=LIMIT_BAL, color = default.payment.next.month))+
geom_boxplot()If we look at the default profile from marriage status, most default customers are
married(and they were given higher limit balance too). I may assume that this is because they may have higher bill/monthly expenses than those who are singles, making them difficult in paying the loan. On the other hand, they also has the highest population ofnot default, making them a customer segment that hashigh risk high returncharacteristic.
Bill & Payment Each Month
Our data also specify the credit card bill and the actual payment received from customers for 6 months, which we can explore. However, we have to calculate the mean of both variables to make comparison.
# Calculate the mean of bill sent to customers each month
monthly_bill <- credit %>%
summarise(bill_1 = mean(BILL_AMT1),
bill_2 = mean(BILL_AMT2),
bill_3 = mean(BILL_AMT3),
bill_4 = mean(BILL_AMT4),
bill_5 = mean(BILL_AMT5),
bill_6 = mean(BILL_AMT6))
monthly_bill <- as.data.frame((t(monthly_bill)))
# rename column
names(monthly_bill)[1] <- "avg.bill"
# rename row
row.names(monthly_bill) <- c("month_1","month_2","month_3","month_4","month_5","month_6")
monthly_bill avg.bill
month_1 51223.33
month_2 49179.08
month_3 47013.15
month_4 43262.95
month_5 40311.40
month_6 38871.76
By similar principles, calculate the average payment received from customers:
# Calculate the mean of payment received from customers each month
monthly_pay <- credit %>%
summarise(month_1 = mean(PAY_AMT1),
month_2 = mean(PAY_AMT2),
month_3 = mean(PAY_AMT3),
month_4 = mean(PAY_AMT4),
month_5 = mean(PAY_AMT5),
month_6 = mean(PAY_AMT6))
monthly_pay <- as.data.frame((t(monthly_pay)))
# rename column
names(monthly_pay)[1] <- "avg.payment"
monthly_pay avg.payment
month_1 5663.581
month_2 5921.163
month_3 5225.681
month_4 4826.077
month_5 4799.388
month_6 5215.503
# combine both bill & payment data
compare <- cbind(monthly_bill, monthly_pay)
# convert rownames into colnames
library(data.table)
setDT(compare, keep.rownames = "month")
compareThis is a drastic comparison between billing amount and payment. To look at it visually:
month variable value
1: month_1 avg.bill 51223.331
2: month_2 avg.bill 49179.075
3: month_3 avg.bill 47013.155
4: month_4 avg.bill 43262.949
5: month_5 avg.bill 40311.401
6: month_6 avg.bill 38871.760
7: month_1 avg.payment 5663.581
8: month_2 avg.payment 5921.163
9: month_3 avg.payment 5225.681
10: month_4 avg.payment 4826.077
11: month_5 avg.payment 4799.388
12: month_6 avg.payment 5215.503
# create comparison graph
ggplot(compare2, aes(x=month, y=value, fill=variable))+
geom_bar(stat = "identity", position = "dodge")My assumption:
The bill is steadily decreased on following months, which makes sense because it was being paid within each month by the customer. The payment proportion is very small compared to the bill, which may be caused by the customer that chose longer installment term so the amount that they have to pay each month is small.