INTRODUCTION

The purpose of this task is to build a binary classification Model that can predict the probability of which customer will default on their next month’s credit card repayment. Based on this premise, we also wanted to ensure the resulting Model should be able to determine key drivers of this probability so it can indicate possible customers who are most likely to default.

To determine the best outcome for predictive accuracy, various machine learning algorithms such as Random Forest, Logistic Regression, Support Vector Machines(SVM) and Gradient Boosting were explored.

This report will elaborate on the approach and modelling considerations done to arrive at the estimated probability of Default using Gradient Boosting as the machine learning algorithm.

EXPLORATORY DATA ANALYSIS

The data is based on the publicly available UCI Machine Learning Repository credit card data set. It contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005. The data set contains information of 30000 customers with 16 predictor variables containing customer information and a outcome variable - default.

Based on the assignment brief, we were aware that the new Dataset had distorted relationship between default Target and Predictors. Variables pertaining to history of payment statuses had been reduced by applying Prinicipal Component Analysis.

DATA PREPARATION

Import the Data into R as a dataframe.

cc_default_train <- read.csv("AT2_credit_train_STUDENT.csv", na.strings = c("", " ", NA, "NA"))

str(cc_default_train)
## 'data.frame':    23101 obs. of  17 variables:
##  $ ID       : int  1 2 3 4 6 7 8 9 10 11 ...
##  $ LIMIT_BAL: num  20000 120000 90000 50000 50000 500000 100000 140000 20000 200000 ...
##  $ SEX      : Factor w/ 5 levels "1","2","cat",..: 2 2 2 2 1 1 2 2 1 2 ...
##  $ EDUCATION: int  2 2 2 2 1 1 2 3 3 3 ...
##  $ MARRIAGE : int  1 2 2 1 2 2 2 1 2 2 ...
##  $ AGE      : int  24 26 34 37 37 29 23 28 35 34 ...
##  $ PAY_PC1  : num  0.477 -1.462 -0.393 -0.393 -0.393 ...
##  $ PAY_PC2  : num  -3.225 0.854 0.176 0.176 0.176 ...
##  $ PAY_PC3  : num  0.14504 -0.36086 0.00489 0.00489 0.00489 ...
##  $ AMT_PC1  : num  -1.752 -1.663 -1.135 -0.397 -0.393 ...
##  $ AMT_PC2  : num  -0.224 -0.144 -0.177 -0.451 -0.5 ...
##  $ AMT_PC3  : num  -0.0778 -0.0546 0.016 -0.0998 -0.1033 ...
##  $ AMT_PC4  : num  0.00696 -0.00285 -0.12907 -0.03534 -0.1179 ...
##  $ AMT_PC5  : num  -0.0414 0.0439 0.0982 -0.0553 -0.0546 ...
##  $ AMT_PC6  : num  0.000887 -0.02619 -0.022383 0.050465 0.112137 ...
##  $ AMT_PC7  : num  -0.0563 -0.1 -0.069 -0.0282 0.0186 ...
##  $ default  : Factor w/ 2 levels "N","Y": 2 2 1 1 1 1 1 2 1 1 ...
summary(cc_default_train)
##        ID          LIMIT_BAL            SEX          EDUCATION    
##  Min.   :    1   Min.   :    -99   1      : 9244   Min.   :0.000  
##  1st Qu.: 7489   1st Qu.:  50000   2      :13854   1st Qu.:1.000  
##  Median :14987   Median : 140000   cat    :    1   Median :2.000  
##  Mean   :14981   Mean   : 167524   dog    :    1   Mean   :1.853  
##  3rd Qu.:22452   3rd Qu.: 240000   dolphin:    1   3rd Qu.:2.000  
##  Max.   :30000   Max.   :1000000                   Max.   :6.000  
##     MARRIAGE          AGE           PAY_PC1              PAY_PC2        
##  Min.   :0.000   Min.   : 21.0   Min.   :-11.859675   Min.   :-4.42243  
##  1st Qu.:1.000   1st Qu.: 28.0   1st Qu.: -0.393308   1st Qu.:-0.23617  
##  Median :2.000   Median : 34.0   Median : -0.393308   Median : 0.17555  
##  Mean   :1.553   Mean   : 35.7   Mean   : -0.001656   Mean   :-0.00177  
##  3rd Qu.:2.000   3rd Qu.: 41.0   3rd Qu.:  1.360047   3rd Qu.: 0.36112  
##  Max.   :3.000   Max.   :141.0   Max.   :  3.813348   Max.   : 5.44103  
##     PAY_PC3             AMT_PC1            AMT_PC2        
##  Min.   :-3.864638   Min.   :-3.41080   Min.   :-4.71769  
##  1st Qu.:-0.283941   1st Qu.:-1.50827   1st Qu.:-0.42961  
##  Median : 0.004886   Median :-0.86433   Median :-0.20780  
##  Mean   : 0.000652   Mean   : 0.00461   Mean   : 0.00137  
##  3rd Qu.: 0.093942   3rd Qu.: 0.49766   3rd Qu.: 0.09062  
##  Max.   : 3.364030   Max.   :37.49240   Max.   :83.52137  
##     AMT_PC3             AMT_PC4              AMT_PC5         
##  Min.   :-38.46500   Min.   :-21.593416   Min.   :-42.37665  
##  1st Qu.: -0.13710   1st Qu.: -0.068199   1st Qu.: -0.08239  
##  Median : -0.07044   Median :  0.018389   Median : -0.03200  
##  Mean   :  0.00383   Mean   :  0.004618   Mean   :  0.00148  
##  3rd Qu.:  0.00325   3rd Qu.:  0.083236   3rd Qu.:  0.02644  
##  Max.   : 21.98483   Max.   : 21.823749   Max.   : 17.43097  
##     AMT_PC6             AMT_PC7          default  
##  Min.   :-38.88504   Min.   :-41.71546   N:17518  
##  1st Qu.: -0.04241   1st Qu.: -0.09273   Y: 5583  
##  Median : -0.00216   Median : -0.04099            
##  Mean   : -0.00202   Mean   : -0.00409            
##  3rd Qu.:  0.06754   3rd Qu.:  0.03157            
##  Max.   : 20.22670   Max.   : 22.92727
head(cc_default_train)
##   ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE    PAY_PC1    PAY_PC2
## 1  1     20000   2         2        1  24  0.4774630 -3.2245900
## 2  2    120000   2         2        2  26 -1.4616124  0.8538666
## 3  3     90000   2         2        2  34 -0.3933076  0.1755550
## 4  4     50000   2         2        1  37 -0.3933076  0.1755550
## 5  6     50000   1         1        2  37 -0.3933076  0.1755550
## 6  7    500000   1         1        2  29 -0.3933076  0.1755550
##        PAY_PC3    AMT_PC1    AMT_PC2     AMT_PC3      AMT_PC4     AMT_PC5
## 1  0.145040802 -1.7522077 -0.2243476 -0.07784106  0.006957244 -0.04135696
## 2 -0.360863449 -1.6633432 -0.1438856 -0.05460040 -0.002851947  0.04388912
## 3  0.004885522 -1.1348380 -0.1766087  0.01595485 -0.129071306  0.09824528
## 4  0.004885522 -0.3971748 -0.4510978 -0.09978950 -0.035338969 -0.05530658
## 5  0.004885522 -0.3927966 -0.5002107 -0.10334556 -0.117902012 -0.05457605
## 6  0.004885522 15.7185478 -0.6627291 -1.62093162  0.486746050 -0.47437084
##         AMT_PC6     AMT_PC7 default
## 1  0.0008865935 -0.05626505       Y
## 2 -0.0261897987 -0.09997756       Y
## 3 -0.0223825102 -0.06898686       N
## 4  0.0504654915 -0.02820475       N
## 5  0.1121369503  0.01863707       N
## 6 -0.2846417994  0.70158577       N
cc_default_validation <- read.csv("AT2_credit_test_STUDENT.csv", na.strings = c("", " ", NA, "NA"))

On loading it into R, we noticed the following characteristics about the Data at first sight:

  • There were 17 variables with 23,101 observations.
  • All variables were INTEGERS except for SEX and Default which were CATEGORICAL.
  • SEX was expected to have 1 for Male and 2 for Female but had classifications such as cat, dog and dolphin.
  • Although the data dictionary had classified MARRIAGE as 1 for Married, 2 for Single and 3 for Others, there was an additional classification as 0.
  • EDUCATION had six levels from 0 to 6. This was changed as 1 for Married, 2 for Single and 3 for Others, there was an additional classification as 0 which was classified as UNKNOWN.

The plots below represent the distribution of each of the predictor variables.

## # A tibble: 3 x 3
##   SEX     limit_bal count
##   <fct>       <dbl> <int>
## 1 cat        500000     1
## 2 dog         20000     1
## 3 dolphin    150000     1

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    21.0    28.0    34.0    35.7    41.0   141.0

There were also outliers noticed with SEX and MARRIAGE. To fix these inconsistencies, Data cleansing was done on the variables - SEX, EDUCATION and MARRIAGE by converting them to Factors.

  • SEX was categorised to Male, Female and others.
  • MARRIAGE was categorised to Married, Single and Others.
  • EDUCATION was categorised to high School, university, graduate school, others and unknown.

Please see below the plots for each of the variables after transformation.

Histograms for each of the variables after transformation.

Looking for any Correlations between variables.

As seen from the above graph, there were no Correlations found since Principal Component Analysis had already been done on historical payment statuses.

DATA PARTITION

The Dataset was partitioned as 80/20. What this means is 80% of the Data was allocated for training the model and 20% was allocated as Testing Data to test the Model against the Data it has not yet seen. This is done is to get an indication of how well the finalized model will perform against an out of sample data. This also ensures that the prediction is consistent everytime a new dataset is encountered by the model.

## [1] 18482    21
## [1] 4619   21
## 'data.frame':    18482 obs. of  21 variables:
##  $ ID         : int  1 3 4 6 8 9 10 12 13 14 ...
##  $ LIMIT_BAL  : num  20000 90000 50000 50000 100000 140000 20000 260000 630000 70000 ...
##  $ SEX        : Factor w/ 5 levels "1","2","cat",..: 2 2 2 1 2 2 1 2 2 1 ...
##  $ EDUCATION  : int  2 2 2 1 2 3 3 1 2 2 ...
##  $ MARRIAGE   : int  1 2 1 2 2 1 2 2 2 2 ...
##  $ AGE        : int  24 34 37 37 23 28 35 51 41 30 ...
##  $ PAY_PC1    : num  0.477 -0.393 -0.393 -0.393 0.651 ...
##  $ PAY_PC2    : num  -3.225 0.176 0.176 0.176 0.215 ...
##  $ PAY_PC3    : num  0.14504 0.00489 0.00489 0.00489 0.40055 ...
##  $ AMT_PC1    : num  -1.752 -1.135 -0.397 -0.393 -1.695 ...
##  $ AMT_PC2    : num  -0.224 -0.177 -0.451 -0.5 -0.154 ...
##  $ AMT_PC3    : num  -0.0778 0.016 -0.0998 -0.1033 0.0387 ...
##  $ AMT_PC4    : num  0.00696 -0.12907 -0.03534 -0.1179 -0.02375 ...
##  $ AMT_PC5    : num  -0.0414 0.0982 -0.0553 -0.0546 -0.0262 ...
##  $ AMT_PC6    : num  0.000887 -0.022383 0.050465 0.112137 0.003636 ...
##  $ AMT_PC7    : num  -0.0563 -0.069 -0.0282 0.0186 -0.0398 ...
##  $ default    : Factor w/ 2 levels "N","Y": 2 1 1 1 1 2 1 1 1 2 ...
##  $ SEX_H      : Factor w/ 3 levels "Female","Male",..: 1 1 1 2 1 1 2 1 1 2 ...
##  $ MARRIAGE_H : Factor w/ 4 levels "Married","Others",..: 1 3 1 3 3 1 3 3 3 3 ...
##  $ EDUCATION_H: Factor w/ 5 levels "Graduate School",..: 4 4 4 1 4 2 2 1 4 4 ...
##  $ AGE_H      : Factor w/ 8 levels "(10,20]","(20,30]",..: 2 3 3 3 2 2 3 5 4 2 ...
## # A tibble: 2 x 2
##   default   cnt
##   <fct>   <int>
## 1 N       14015
## 2 Y        4467

MODEL EXPLORATION

As mentioned in the introduction, various machine learning algorithms such as Random Forest, Gradient Boosting, Support Vector Machines and Logistic Regression were tried to predict the possibility of a credit card default.

Training the MODEL.

Following multiple iterations on feature engineering and implementing available modelling techinques, we as a team decided to go with Gradient Boosting as the final model to be used for the prediction of credit card default. The model was trained with all variables except ID and SEX.

EVALUATING THE MODEL

The key determining factors - Accuracy, Sensitivity and Specificity were evaluated and as called out in the introuduction section, the best performing model was Gradient Boosting which formed the basis for this report. The below sections explain the approach and the modelling done using Gradient Boosting algorithm.