Introduction

This project aims to compare the predictive accuracy of the probability of default payments in Taiwan among four classification methods: logistic regression, linear discriminant analysis, quadratic discriminant analysis, and K-nearest neighbours (KNN). The data used in this classification study is obtained from UCI’s machine learning repository (https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients).

Logistic regression: This method can be considered a special case of linear regression models. It assumes that the probability of the event (credit card default) is a linear function of the observed values (default: yes or no) of available explanatory variables (gender, education etc.). It produces a simple probability formula of classification, but cannot deal with non-linear and interactive effects in explanatory variables.

Discriminant analysis: This method is an alternative to logistic regression. It assumes that for each given class of response variable (yes or no), the explanatory variables are distributed as a multivariate normal distribution, with a common variance-covariance matrix. The model maximizes the distance between the two classes, and minimizes the distance within each class.

K-Nearest Neighbour: When given a sample point, the KNN classifer searches for the KNN points that are closest to the sample point, and returns the most common class sample among its KNN. The advantage is that no predictive model is needed for classification. The disadvantage is that its predictive accuracy is affected by the measure of distance “k” that defines the neighbourhood.

library(readxl) #read Excel file
library(Hmisc)
library(MASS) #LDA, QDA
library(class) #KNN
library(ggplot2) #visualization
library(dplyr) #data wrangling
library(stringr)
library(caret) #splitting and training data

Import data

credit <- read_excel("/Users/jenniferlu/Desktop/R/data/default of credit card clients.xls", skip = 1)
dim(credit)

## [1] 30000    25

colnames(credit)

##  [1] "ID"                         "LIMIT_BAL"                 
##  [3] "SEX"                        "EDUCATION"                 
##  [5] "MARRIAGE"                   "AGE"                       
##  [7] "PAY_0"                      "PAY_2"                     
##  [9] "PAY_3"                      "PAY_4"                     
## [11] "PAY_5"                      "PAY_6"                     
## [13] "BILL_AMT1"                  "BILL_AMT2"                 
## [15] "BILL_AMT3"                  "BILL_AMT4"                 
## [17] "BILL_AMT5"                  "BILL_AMT6"                 
## [19] "PAY_AMT1"                   "PAY_AMT2"                  
## [21] "PAY_AMT3"                   "PAY_AMT4"                  
## [23] "PAY_AMT5"                   "PAY_AMT6"                  
## [25] "default payment next month"

There are a total of 30,000 observations and 25 variables. The variables included in the data are:

1. limit_bal: Amount of the given credit (NT dollar). It includes both the individual consumer credit and his/her family (supplementary) credit.

2. Gender: (1 = male; 2 = female).

3. Education: (1 = graduate school; 2 = university; 3 = high school; 4 = others).

4. Marital status: (1 = married; 2 = single; 3 = others).

5. Age: (year).

6. Repayment Status:

PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)

PAY_2: Repayment status in August, 2005 (scale same as above)

PAY_3: Repayment status in July, 2005 (scale same as above)

PAY_4: Repayment status in June, 2005 (scale same as above)

PAY_5: Repayment status in May, 2005 (scale same as above)

PAY_6: Repayment status in April, 2005 (scale same as above)

7. Amount Billed:

BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)

BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)

BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)

BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)

BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)

BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)

8. Amount of previous payment in:

PAY_AMT1: September, 2005 (NT dollar)

PAY_AMT2: August, 2005 (NT dollar)

PAY_AMT3: July, 2005 (NT dollar)

PAY_AMT4: une, 2005 (NT dollar)

PAY_AMT5: May, 2005 (NT dollar)

PAY_AMT6: April, 2005 (NT dollar)

9. default.payment.next.month: Default payment in June, 2005 (1=yes, 0=no)

Change all column names to lower case

colnames(credit) %<>%
  str_replace_all(pattern = "\\s", replacement = "_") %>%
  tolower()

See all datatypes for each variable.

sapply(credit, class)

##                         id                  limit_bal 
##                  "numeric"                  "numeric" 
##                        sex                  education 
##                  "numeric"                  "numeric" 
##                   marriage                        age 
##                  "numeric"                  "numeric" 
##                      pay_0                      pay_2 
##                  "numeric"                  "numeric" 
##                      pay_3                      pay_4 
##                  "numeric"                  "numeric" 
##                      pay_5                      pay_6 
##                  "numeric"                  "numeric" 
##                  bill_amt1                  bill_amt2 
##                  "numeric"                  "numeric" 
##                  bill_amt3                  bill_amt4 
##                  "numeric"                  "numeric" 
##                  bill_amt5                  bill_amt6 
##                  "numeric"                  "numeric" 
##                   pay_amt1                   pay_amt2 
##                  "numeric"                  "numeric" 
##                   pay_amt3                   pay_amt4 
##                  "numeric"                  "numeric" 
##                   pay_amt5                   pay_amt6 
##                  "numeric"                  "numeric" 
## default_payment_next_month 
##                  "numeric"

Change marriage to factor variable.

credit$marriage <- as.factor(credit$marriage)

I will make the marriage variable binary since the observation “others” is vague, where marriage = 1 if individual is married, and 0 if individual is not married.

marriage01 <- rep(0, length(credit$marriage))
marriage01[credit$marriage == "1"] <- 1
credit <- data.frame(credit, marriage01)

Data Visualization

I will do some basic visualizations illustrating how each of the variables relate to whether client’s credit default status.

default <- as.factor(credit$default_payment_next_month)
credit %>%
  ggplot(aes(x = as.factor(sex))) +
  geom_bar(aes(fill = default)) +
  labs(x = "Sex", y = "Observations Count", title = "Credit Card Default by Sex") +
  theme_minimal()

From this graph, we can see that there are more female clients in this data, and that male clients were more likely to default than female clients.

credit %>%
  ggplot(aes(education)) +
  geom_bar(aes(fill = default)) +
  ggtitle("Credit Card Default by Education Level") +
  theme_minimal()

A majority of defaulted clients are university graduates.

credit %>%
  ggplot(aes(age)) +
  geom_bar(aes(fill = default)) +
  theme_minimal() +
  ggtitle("Credit Card Default by Age")

This graph shows that a majority of clients who have defaulted are around 20-30.

Testing and Splitting data

#number of observations to train
train <- 1:(0.8*dim(credit)[1])
#training data
credit.train <- credit[train,]
#testing data
credit.test <- credit[-train,]

Logistic Regression Model

credit.glm <- glm(default_payment_next_month ~ . -id -marriage -default_payment_next_month,
                  credit.train,
                  family = binomial)
summary(credit.glm)

## 
## Call:
## glm(formula = default_payment_next_month ~ . - id - marriage - 
##     default_payment_next_month, family = binomial, data = credit.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1086  -0.7034  -0.5507  -0.2922   4.0261  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.107e-01  9.881e-02  -9.216  < 2e-16 ***
## limit_bal   -6.358e-07  1.750e-07  -3.633 0.000280 ***
## sex         -1.054e-01  3.457e-02  -3.048 0.002300 ** 
## education   -1.028e-01  2.368e-02  -4.341 1.42e-05 ***
## age          4.340e-03  2.015e-03   2.154 0.031262 *  
## pay_0        5.690e-01  1.956e-02  29.085  < 2e-16 ***
## pay_2        8.064e-02  2.236e-02   3.607 0.000310 ***
## pay_3        7.453e-02  2.506e-02   2.975 0.002933 ** 
## pay_4       -1.277e-02  2.824e-02  -0.452 0.651165    
## pay_5        7.320e-02  2.970e-02   2.465 0.013714 *  
## pay_6        4.196e-03  2.426e-02   0.173 0.862701    
## bill_amt1   -6.949e-06  1.331e-06  -5.223 1.76e-07 ***
## bill_amt2    3.456e-06  1.708e-06   2.024 0.043004 *  
## bill_amt3    9.346e-07  1.514e-06   0.617 0.537051    
## bill_amt4    7.134e-07  1.570e-06   0.454 0.649580    
## bill_amt5    1.179e-06  1.718e-06   0.686 0.492524    
## bill_amt6   -3.614e-07  1.305e-06  -0.277 0.781864    
## pay_amt1    -1.515e-05  2.637e-06  -5.745 9.20e-09 ***
## pay_amt2    -7.866e-06  2.174e-06  -3.618 0.000297 ***
## pay_amt3    -4.887e-06  2.087e-06  -2.341 0.019216 *  
## pay_amt4    -3.277e-06  1.999e-06  -1.640 0.101032    
## pay_amt5    -1.715e-06  1.890e-06  -0.908 0.364009    
## pay_amt6    -3.297e-06  1.522e-06  -2.166 0.030275 *  
## marriage01   2.073e-01  3.781e-02   5.484 4.16e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25517  on 23999  degrees of freedom
## Residual deviance: 22502  on 23976  degrees of freedom
## AIC: 22550
## 
## Number of Fisher Scoring iterations: 5

#matrix of probabilities after fitting model to training data
glm.probs <- predict(credit.glm, credit.test, type = "response")

#matrix of 0 for all test observations
glm.pred <- rep(0, dim(credit.test)[1])

#change those to 1 where probabilities > 0.5
glm.pred[glm.probs > 0.5] <- 1

#table to compare predictions to actual observations
table(glm.pred, credit.test$default_payment_next_month)

##         
## glm.pred    0    1
##        0 4632  983
##        1  102  283

#test error rate
mean(glm.pred != credit.test$default_payment_next_month)

## [1] 0.1808333

The test error rate for the logistic regression model is 18.08%.

From the results, it can be seen that some variables, such as specific repayment status months, bill amounts, and amounts previously paid are not significantly related to the risk of default.

Linear Discriminant Analysis Model

credit.lda <- lda(default_payment_next_month ~ . -id -marriage -default_payment_next_month,
                  credit.train)
lda.pred <- predict(credit.lda, credit.test)
lda.category <- lda.pred$class
table(lda.category, credit.test$default_payment_next_month)

##             
## lda.category    0    1
##            0 4614  956
##            1  120  310

mean(lda.category != credit.test$default_payment_next_month)

## [1] 0.1793333

The test error rate for the linear discriminant analysis model is 17.93%.

Quadratic Discriminant Analysis Model

credit.qda <- qda(default_payment_next_month ~ . -id -marriage -default_payment_next_month,
                  data = credit.train)
credit.qda

## Call:
## qda(default_payment_next_month ~ . - id - marriage - default_payment_next_month, 
##     data = credit.train)
## 
## Prior probabilities of groups:
##       0       1 
## 0.77625 0.22375 
## 
## Group means:
##   limit_bal      sex education      age      pay_0      pay_2      pay_3
## 0  175806.5 1.638486  1.835910 35.33999 -0.1971551 -0.2906602 -0.3046162
## 1  129725.8 1.592737  1.887337 35.52086  0.6700186  0.4564246  0.3651769
##        pay_4      pay_5      pay_6 bill_amt1 bill_amt2 bill_amt3 bill_amt4
## 0 -0.3444981 -0.3778315 -0.3932367  51474.20  49258.08  46918.80  42696.05
## 1  0.2491620  0.1804469  0.1217877  47553.25  46522.80  44452.92  41230.76
##   bill_amt5 bill_amt6 pay_amt1 pay_amt2 pay_amt3 pay_amt4 pay_amt5 pay_amt6
## 0  40211.11  38741.84 6172.787 6503.240 5478.544 5207.398 5216.920 5726.611
## 1  39270.66  37945.72 3357.706 3428.811 3202.441 3134.023 3279.785 3325.663
##   marriage01
## 0  0.4420827
## 1  0.4798883

According to the test data, the probability of default is approximately 22.375%.

qda.pred <- predict(credit.qda, credit.test)
qda.category <- qda.pred$class
table(qda.category, credit.test$default_payment_next_month)

##             
## qda.category    0    1
##            0 2233  253
##            1 2501 1013

mean(qda.category != credit.test$default_payment_next_month)

## [1] 0.459

The test error rate for the quadratic discriminant analysis model is 45.9%.

K-Nearest Neighbours

I will make a new dataframe excluding id, and marriage from analysis for KNN.

credit01 <- credit %>%
  select(-id, -marriage)

#predictors from training data
train.default <- credit$default_payment_next_month[train]

knn.pred <- knn(credit.train, credit.test, train.default, k=1)
table(knn.pred, credit.test$default_payment_next_month)

##         
## knn.pred    0    1
##        0 3868  942
##        1  866  324

mean(knn.pred != credit.test$default_payment_next_month)

## [1] 0.3013333

The KNN test error rate for k=1 is 30.13%.

Testing k=3

knn.pred <- knn(credit.train, credit.test, train.default, k=3)
table(knn.pred, credit.test$default_payment_next_month)

##         
## knn.pred    0    1
##        0 4212 1017
##        1  522  249

mean(knn.pred != credit.test$default_payment_next_month)

## [1] 0.2565

The test error rate for k=2 is 25.65%.

Testing k=5

knn.pred <- knn(credit.train, credit.test, train.default, k=5)
table(knn.pred, credit.test$default_payment_next_month)

##         
## knn.pred    0    1
##        0 4358 1056
##        1  376  210

mean(knn.pred != credit.test$default_payment_next_month)

## [1] 0.2386667

The testing error is 23.88%.

Testing k=10

knn.pred <- knn(credit.train, credit.test, train.default, k=10)
table(knn.pred, credit.test$default_payment_next_month)

##         
## knn.pred    0    1
##        0 4461 1090
##        1  273  176

mean(knn.pred != credit.test$default_payment_next_month)

## [1] 0.2271667

The KNN test error rate for k=10 is 22.95%.

Testing k=50

knn.pred <- knn(credit.train, credit.test, train.default, k=50)
table(knn.pred, credit.test$default_payment_next_month)

##         
## knn.pred    0    1
##        0 4658 1193
##        1   76   73

mean(knn.pred != credit.test$default_payment_next_month)

## [1] 0.2115

The KNN test error rate for k=50 is 21.1%.

Testing k=75

knn.pred <- knn(credit.train, credit.test, train.default, k=75)
table(knn.pred, credit.test$default_payment_next_month)

##         
## knn.pred    0    1
##        0 4671 1209
##        1   63   57

mean(knn.pred != credit.test$default_payment_next_month)

## [1] 0.212

The KNN teest error rate for k=75 is 21.2%.

It appears that the linear analysis model is the most accurate at predicting credit default risk in clients, with a testing error rate of 17.93%, the lowest compared to logistic regression, quadratic analysis model, and K-Nearest Neighbours.

Predictive Accuracy of Classification Methods on the Probability of Credit Card Default