The purpose of this task is to build a binary classification Model that can predict the probability of which customer will default on their next month’s credit card repayment. Based on this premise, we also wanted to ensure the resulting Model should be able to determine key drivers of this probability so it can indicate possible customers who are most likely to default.
To determine the best outcome for predictive accuracy, various machine learning algorithms such as Random Forest, Logistic Regression, Support Vector Machines(SVM) and Gradient Boosting were explored.
This report will elaborate on the approach and modelling considerations done to arrive at the estimated probability of Default using Gradient Boosting as the machine learning algorithm.
The data is based on the publicly available UCI Machine Learning Repository credit card data set. It contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005. The data set contains information of 30000 customers with 16 predictor variables containing customer information and a outcome variable - default.
Based on the assignment brief, we were aware that the new Dataset had distorted relationship between default Target and Predictors. Variables pertaining to history of payment statuses had been reduced by applying Prinicipal Component Analysis.
cc_default_train <- read.csv("AT2_credit_train_STUDENT.csv", na.strings = c("", " ", NA, "NA"))
str(cc_default_train)
## 'data.frame': 23101 obs. of 17 variables:
## $ ID : int 1 2 3 4 6 7 8 9 10 11 ...
## $ LIMIT_BAL: num 20000 120000 90000 50000 50000 500000 100000 140000 20000 200000 ...
## $ SEX : Factor w/ 5 levels "1","2","cat",..: 2 2 2 2 1 1 2 2 1 2 ...
## $ EDUCATION: int 2 2 2 2 1 1 2 3 3 3 ...
## $ MARRIAGE : int 1 2 2 1 2 2 2 1 2 2 ...
## $ AGE : int 24 26 34 37 37 29 23 28 35 34 ...
## $ PAY_PC1 : num 0.477 -1.462 -0.393 -0.393 -0.393 ...
## $ PAY_PC2 : num -3.225 0.854 0.176 0.176 0.176 ...
## $ PAY_PC3 : num 0.14504 -0.36086 0.00489 0.00489 0.00489 ...
## $ AMT_PC1 : num -1.752 -1.663 -1.135 -0.397 -0.393 ...
## $ AMT_PC2 : num -0.224 -0.144 -0.177 -0.451 -0.5 ...
## $ AMT_PC3 : num -0.0778 -0.0546 0.016 -0.0998 -0.1033 ...
## $ AMT_PC4 : num 0.00696 -0.00285 -0.12907 -0.03534 -0.1179 ...
## $ AMT_PC5 : num -0.0414 0.0439 0.0982 -0.0553 -0.0546 ...
## $ AMT_PC6 : num 0.000887 -0.02619 -0.022383 0.050465 0.112137 ...
## $ AMT_PC7 : num -0.0563 -0.1 -0.069 -0.0282 0.0186 ...
## $ default : Factor w/ 2 levels "N","Y": 2 2 1 1 1 1 1 2 1 1 ...
summary(cc_default_train)
## ID LIMIT_BAL SEX EDUCATION
## Min. : 1 Min. : -99 1 : 9244 Min. :0.000
## 1st Qu.: 7489 1st Qu.: 50000 2 :13854 1st Qu.:1.000
## Median :14987 Median : 140000 cat : 1 Median :2.000
## Mean :14981 Mean : 167524 dog : 1 Mean :1.853
## 3rd Qu.:22452 3rd Qu.: 240000 dolphin: 1 3rd Qu.:2.000
## Max. :30000 Max. :1000000 Max. :6.000
## MARRIAGE AGE PAY_PC1 PAY_PC2
## Min. :0.000 Min. : 21.0 Min. :-11.859675 Min. :-4.42243
## 1st Qu.:1.000 1st Qu.: 28.0 1st Qu.: -0.393308 1st Qu.:-0.23617
## Median :2.000 Median : 34.0 Median : -0.393308 Median : 0.17555
## Mean :1.553 Mean : 35.7 Mean : -0.001656 Mean :-0.00177
## 3rd Qu.:2.000 3rd Qu.: 41.0 3rd Qu.: 1.360047 3rd Qu.: 0.36112
## Max. :3.000 Max. :141.0 Max. : 3.813348 Max. : 5.44103
## PAY_PC3 AMT_PC1 AMT_PC2
## Min. :-3.864638 Min. :-3.41080 Min. :-4.71769
## 1st Qu.:-0.283941 1st Qu.:-1.50827 1st Qu.:-0.42961
## Median : 0.004886 Median :-0.86433 Median :-0.20780
## Mean : 0.000652 Mean : 0.00461 Mean : 0.00137
## 3rd Qu.: 0.093942 3rd Qu.: 0.49766 3rd Qu.: 0.09062
## Max. : 3.364030 Max. :37.49240 Max. :83.52137
## AMT_PC3 AMT_PC4 AMT_PC5
## Min. :-38.46500 Min. :-21.593416 Min. :-42.37665
## 1st Qu.: -0.13710 1st Qu.: -0.068199 1st Qu.: -0.08239
## Median : -0.07044 Median : 0.018389 Median : -0.03200
## Mean : 0.00383 Mean : 0.004618 Mean : 0.00148
## 3rd Qu.: 0.00325 3rd Qu.: 0.083236 3rd Qu.: 0.02644
## Max. : 21.98483 Max. : 21.823749 Max. : 17.43097
## AMT_PC6 AMT_PC7 default
## Min. :-38.88504 Min. :-41.71546 N:17518
## 1st Qu.: -0.04241 1st Qu.: -0.09273 Y: 5583
## Median : -0.00216 Median : -0.04099
## Mean : -0.00202 Mean : -0.00409
## 3rd Qu.: 0.06754 3rd Qu.: 0.03157
## Max. : 20.22670 Max. : 22.92727
head(cc_default_train)
## ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_PC1 PAY_PC2
## 1 1 20000 2 2 1 24 0.4774630 -3.2245900
## 2 2 120000 2 2 2 26 -1.4616124 0.8538666
## 3 3 90000 2 2 2 34 -0.3933076 0.1755550
## 4 4 50000 2 2 1 37 -0.3933076 0.1755550
## 5 6 50000 1 1 2 37 -0.3933076 0.1755550
## 6 7 500000 1 1 2 29 -0.3933076 0.1755550
## PAY_PC3 AMT_PC1 AMT_PC2 AMT_PC3 AMT_PC4 AMT_PC5
## 1 0.145040802 -1.7522077 -0.2243476 -0.07784106 0.006957244 -0.04135696
## 2 -0.360863449 -1.6633432 -0.1438856 -0.05460040 -0.002851947 0.04388912
## 3 0.004885522 -1.1348380 -0.1766087 0.01595485 -0.129071306 0.09824528
## 4 0.004885522 -0.3971748 -0.4510978 -0.09978950 -0.035338969 -0.05530658
## 5 0.004885522 -0.3927966 -0.5002107 -0.10334556 -0.117902012 -0.05457605
## 6 0.004885522 15.7185478 -0.6627291 -1.62093162 0.486746050 -0.47437084
## AMT_PC6 AMT_PC7 default
## 1 0.0008865935 -0.05626505 Y
## 2 -0.0261897987 -0.09997756 Y
## 3 -0.0223825102 -0.06898686 N
## 4 0.0504654915 -0.02820475 N
## 5 0.1121369503 0.01863707 N
## 6 -0.2846417994 0.70158577 N
cc_default_validation <- read.csv("AT2_credit_test_STUDENT.csv", na.strings = c("", " ", NA, "NA"))
On loading it into R, we noticed the following characteristics about the Data at first sight:
- There were 17 variables with 23,101 observations.
- All variables were INTEGERS except for SEX and Default which were CATEGORICAL.
- SEX was expected to have 1 for Male and 2 for Female but had classifications such as cat, dog and dolphin.
- Although the data dictionary had classified MARRIAGE as 1 for Married, 2 for Single and 3 for Others, there was an additional classification as 0.
- EDUCATION had six levels from 0 to 6. This was changed as 1 for Married, 2 for Single and 3 for Others, there was an additional classification as 0 which was classified as UNKNOWN.
The plots below represent the distribution of each of the predictor variables.
## # A tibble: 3 x 3
## SEX limit_bal count
## <fct> <dbl> <int>
## 1 cat 500000 1
## 2 dog 20000 1
## 3 dolphin 150000 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.0 28.0 34.0 35.7 41.0 141.0
There were also outliers noticed with SEX and MARRIAGE. To fix these inconsistencies, Data cleansing was done on the variables - SEX, EDUCATION and MARRIAGE by converting them to Factors.
- SEX was categorised to Male, Female and others.
- MARRIAGE was categorised to Married, Single and Others.
- EDUCATION was categorised to high School, university, graduate school, others and unknown.
Please see below the plots for each of the variables after transformation.
Histograms for each of the variables after transformation.
Looking for any Correlations between variables.
As seen from the above graph, there were no Correlations found since Principal Component Analysis had already been done on historical payment statuses.
The Dataset was partitioned as 80/20. What this means is 80% of the Data was allocated for training the model and 20% was allocated as Testing Data to test the Model against the Data it has not yet seen. This is done is to get an indication of how well the finalized model will perform against an out of sample data. This also ensures that the prediction is consistent everytime a new dataset is encountered by the model.
## [1] 18482 21
## [1] 4619 21
## 'data.frame': 18482 obs. of 21 variables:
## $ ID : int 1 3 4 6 8 9 10 12 13 14 ...
## $ LIMIT_BAL : num 20000 90000 50000 50000 100000 140000 20000 260000 630000 70000 ...
## $ SEX : Factor w/ 5 levels "1","2","cat",..: 2 2 2 1 2 2 1 2 2 1 ...
## $ EDUCATION : int 2 2 2 1 2 3 3 1 2 2 ...
## $ MARRIAGE : int 1 2 1 2 2 1 2 2 2 2 ...
## $ AGE : int 24 34 37 37 23 28 35 51 41 30 ...
## $ PAY_PC1 : num 0.477 -0.393 -0.393 -0.393 0.651 ...
## $ PAY_PC2 : num -3.225 0.176 0.176 0.176 0.215 ...
## $ PAY_PC3 : num 0.14504 0.00489 0.00489 0.00489 0.40055 ...
## $ AMT_PC1 : num -1.752 -1.135 -0.397 -0.393 -1.695 ...
## $ AMT_PC2 : num -0.224 -0.177 -0.451 -0.5 -0.154 ...
## $ AMT_PC3 : num -0.0778 0.016 -0.0998 -0.1033 0.0387 ...
## $ AMT_PC4 : num 0.00696 -0.12907 -0.03534 -0.1179 -0.02375 ...
## $ AMT_PC5 : num -0.0414 0.0982 -0.0553 -0.0546 -0.0262 ...
## $ AMT_PC6 : num 0.000887 -0.022383 0.050465 0.112137 0.003636 ...
## $ AMT_PC7 : num -0.0563 -0.069 -0.0282 0.0186 -0.0398 ...
## $ default : Factor w/ 2 levels "N","Y": 2 1 1 1 1 2 1 1 1 2 ...
## $ SEX_H : Factor w/ 3 levels "Female","Male",..: 1 1 1 2 1 1 2 1 1 2 ...
## $ MARRIAGE_H : Factor w/ 4 levels "Married","Others",..: 1 3 1 3 3 1 3 3 3 3 ...
## $ EDUCATION_H: Factor w/ 5 levels "Graduate School",..: 4 4 4 1 4 2 2 1 4 4 ...
## $ AGE_H : Factor w/ 8 levels "(10,20]","(20,30]",..: 2 3 3 3 2 2 3 5 4 2 ...
## # A tibble: 2 x 2
## default cnt
## <fct> <int>
## 1 N 14015
## 2 Y 4467
As mentioned in the introduction, various machine learning algorithms such as Random Forest, Gradient Boosting, Support Vector Machines and Logistic Regression were tried to predict the possibility of a credit card default.
Following multiple iterations on feature engineering and implementing available modelling techinques, we as a team decided to go with Gradient Boosting as the final model to be used for the prediction of credit card default. The model was trained with all variables except ID and SEX.
The key determining factors - Accuracy, Sensitivity and Specificity were evaluated and as called out in the introuduction section, the best performing model was Gradient Boosting which formed the basis for this report. The below sections explain the approach and the modelling done using Gradient Boosting algorithm.