This project aims to compare the predictive accuracy of the probability of default payments in Taiwan among four classification methods: logistic regression, linear discriminant analysis, quadratic discriminant analysis, and K-nearest neighbours (KNN). The data used in this classification study is obtained from UCI’s machine learning repository (https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients).
Logistic regression: This method can be considered a special case of linear regression models. It assumes that the probability of the event (credit card default) is a linear function of the observed values (default: yes or no) of available explanatory variables (gender, education etc.). It produces a simple probability formula of classification, but cannot deal with non-linear and interactive effects in explanatory variables.
Discriminant analysis: This method is an alternative to logistic regression. It assumes that for each given class of response variable (yes or no), the explanatory variables are distributed as a multivariate normal distribution, with a common variance-covariance matrix. The model maximizes the distance between the two classes, and minimizes the distance within each class.
K-Nearest Neighbour: When given a sample point, the KNN classifer searches for the KNN points that are closest to the sample point, and returns the most common class sample among its KNN. The advantage is that no predictive model is needed for classification. The disadvantage is that its predictive accuracy is affected by the measure of distance “k” that defines the neighbourhood.
library(readxl) #read Excel file
library(Hmisc)
library(MASS) #LDA, QDA
library(class) #KNN
library(ggplot2) #visualization
library(dplyr) #data wrangling
library(stringr)
library(caret) #splitting and training data
Import data
credit <- read_excel("/Users/jenniferlu/Desktop/R/data/default of credit card clients.xls", skip = 1)
dim(credit)
## [1] 30000 25
colnames(credit)
## [1] "ID" "LIMIT_BAL"
## [3] "SEX" "EDUCATION"
## [5] "MARRIAGE" "AGE"
## [7] "PAY_0" "PAY_2"
## [9] "PAY_3" "PAY_4"
## [11] "PAY_5" "PAY_6"
## [13] "BILL_AMT1" "BILL_AMT2"
## [15] "BILL_AMT3" "BILL_AMT4"
## [17] "BILL_AMT5" "BILL_AMT6"
## [19] "PAY_AMT1" "PAY_AMT2"
## [21] "PAY_AMT3" "PAY_AMT4"
## [23] "PAY_AMT5" "PAY_AMT6"
## [25] "default payment next month"
There are a total of 30,000 observations and 25 variables. The variables included in the data are:
1. limit_bal: Amount of the given credit (NT dollar). It includes both the individual consumer credit and his/her family (supplementary) credit.
2. Gender: (1 = male; 2 = female).
3. Education: (1 = graduate school; 2 = university; 3 = high school; 4 = others).
4. Marital status: (1 = married; 2 = single; 3 = others).
5. Age: (year).
6. Repayment Status:
PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
PAY_2: Repayment status in August, 2005 (scale same as above)
PAY_3: Repayment status in July, 2005 (scale same as above)
PAY_4: Repayment status in June, 2005 (scale same as above)
PAY_5: Repayment status in May, 2005 (scale same as above)
PAY_6: Repayment status in April, 2005 (scale same as above)
7. Amount Billed:
BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
8. Amount of previous payment in:
PAY_AMT1: September, 2005 (NT dollar)
PAY_AMT2: August, 2005 (NT dollar)
PAY_AMT3: July, 2005 (NT dollar)
PAY_AMT4: une, 2005 (NT dollar)
PAY_AMT5: May, 2005 (NT dollar)
PAY_AMT6: April, 2005 (NT dollar)
9. default.payment.next.month: Default payment in June, 2005 (1=yes, 0=no)
Change all column names to lower case
colnames(credit) %<>%
str_replace_all(pattern = "\\s", replacement = "_") %>%
tolower()
See all datatypes for each variable.
sapply(credit, class)
## id limit_bal
## "numeric" "numeric"
## sex education
## "numeric" "numeric"
## marriage age
## "numeric" "numeric"
## pay_0 pay_2
## "numeric" "numeric"
## pay_3 pay_4
## "numeric" "numeric"
## pay_5 pay_6
## "numeric" "numeric"
## bill_amt1 bill_amt2
## "numeric" "numeric"
## bill_amt3 bill_amt4
## "numeric" "numeric"
## bill_amt5 bill_amt6
## "numeric" "numeric"
## pay_amt1 pay_amt2
## "numeric" "numeric"
## pay_amt3 pay_amt4
## "numeric" "numeric"
## pay_amt5 pay_amt6
## "numeric" "numeric"
## default_payment_next_month
## "numeric"
Change marriage to factor variable.
credit$marriage <- as.factor(credit$marriage)
I will make the marriage variable binary since the observation “others” is vague, where marriage = 1 if individual is married, and 0 if individual is not married.
marriage01 <- rep(0, length(credit$marriage))
marriage01[credit$marriage == "1"] <- 1
credit <- data.frame(credit, marriage01)
I will do some basic visualizations illustrating how each of the variables relate to whether client’s credit default status.
default <- as.factor(credit$default_payment_next_month)
credit %>%
ggplot(aes(x = as.factor(sex))) +
geom_bar(aes(fill = default)) +
labs(x = "Sex", y = "Observations Count", title = "Credit Card Default by Sex") +
theme_minimal()
From this graph, we can see that there are more female clients in this data, and that male clients were more likely to default than female clients.
credit %>%
ggplot(aes(education)) +
geom_bar(aes(fill = default)) +
ggtitle("Credit Card Default by Education Level") +
theme_minimal()
A majority of defaulted clients are university graduates.
credit %>%
ggplot(aes(age)) +
geom_bar(aes(fill = default)) +
theme_minimal() +
ggtitle("Credit Card Default by Age")
This graph shows that a majority of clients who have defaulted are around 20-30.
#number of observations to train
train <- 1:(0.8*dim(credit)[1])
#training data
credit.train <- credit[train,]
#testing data
credit.test <- credit[-train,]
credit.glm <- glm(default_payment_next_month ~ . -id -marriage -default_payment_next_month,
credit.train,
family = binomial)
summary(credit.glm)
##
## Call:
## glm(formula = default_payment_next_month ~ . - id - marriage -
## default_payment_next_month, family = binomial, data = credit.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1086 -0.7034 -0.5507 -0.2922 4.0261
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.107e-01 9.881e-02 -9.216 < 2e-16 ***
## limit_bal -6.358e-07 1.750e-07 -3.633 0.000280 ***
## sex -1.054e-01 3.457e-02 -3.048 0.002300 **
## education -1.028e-01 2.368e-02 -4.341 1.42e-05 ***
## age 4.340e-03 2.015e-03 2.154 0.031262 *
## pay_0 5.690e-01 1.956e-02 29.085 < 2e-16 ***
## pay_2 8.064e-02 2.236e-02 3.607 0.000310 ***
## pay_3 7.453e-02 2.506e-02 2.975 0.002933 **
## pay_4 -1.277e-02 2.824e-02 -0.452 0.651165
## pay_5 7.320e-02 2.970e-02 2.465 0.013714 *
## pay_6 4.196e-03 2.426e-02 0.173 0.862701
## bill_amt1 -6.949e-06 1.331e-06 -5.223 1.76e-07 ***
## bill_amt2 3.456e-06 1.708e-06 2.024 0.043004 *
## bill_amt3 9.346e-07 1.514e-06 0.617 0.537051
## bill_amt4 7.134e-07 1.570e-06 0.454 0.649580
## bill_amt5 1.179e-06 1.718e-06 0.686 0.492524
## bill_amt6 -3.614e-07 1.305e-06 -0.277 0.781864
## pay_amt1 -1.515e-05 2.637e-06 -5.745 9.20e-09 ***
## pay_amt2 -7.866e-06 2.174e-06 -3.618 0.000297 ***
## pay_amt3 -4.887e-06 2.087e-06 -2.341 0.019216 *
## pay_amt4 -3.277e-06 1.999e-06 -1.640 0.101032
## pay_amt5 -1.715e-06 1.890e-06 -0.908 0.364009
## pay_amt6 -3.297e-06 1.522e-06 -2.166 0.030275 *
## marriage01 2.073e-01 3.781e-02 5.484 4.16e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 25517 on 23999 degrees of freedom
## Residual deviance: 22502 on 23976 degrees of freedom
## AIC: 22550
##
## Number of Fisher Scoring iterations: 5
#matrix of probabilities after fitting model to training data
glm.probs <- predict(credit.glm, credit.test, type = "response")
#matrix of 0 for all test observations
glm.pred <- rep(0, dim(credit.test)[1])
#change those to 1 where probabilities > 0.5
glm.pred[glm.probs > 0.5] <- 1
#table to compare predictions to actual observations
table(glm.pred, credit.test$default_payment_next_month)
##
## glm.pred 0 1
## 0 4632 983
## 1 102 283
#test error rate
mean(glm.pred != credit.test$default_payment_next_month)
## [1] 0.1808333
The test error rate for the logistic regression model is 18.08%.
From the results, it can be seen that some variables, such as specific repayment status months, bill amounts, and amounts previously paid are not significantly related to the risk of default.
credit.lda <- lda(default_payment_next_month ~ . -id -marriage -default_payment_next_month,
credit.train)
lda.pred <- predict(credit.lda, credit.test)
lda.category <- lda.pred$class
table(lda.category, credit.test$default_payment_next_month)
##
## lda.category 0 1
## 0 4614 956
## 1 120 310
mean(lda.category != credit.test$default_payment_next_month)
## [1] 0.1793333
The test error rate for the linear discriminant analysis model is 17.93%.
credit.qda <- qda(default_payment_next_month ~ . -id -marriage -default_payment_next_month,
data = credit.train)
credit.qda
## Call:
## qda(default_payment_next_month ~ . - id - marriage - default_payment_next_month,
## data = credit.train)
##
## Prior probabilities of groups:
## 0 1
## 0.77625 0.22375
##
## Group means:
## limit_bal sex education age pay_0 pay_2 pay_3
## 0 175806.5 1.638486 1.835910 35.33999 -0.1971551 -0.2906602 -0.3046162
## 1 129725.8 1.592737 1.887337 35.52086 0.6700186 0.4564246 0.3651769
## pay_4 pay_5 pay_6 bill_amt1 bill_amt2 bill_amt3 bill_amt4
## 0 -0.3444981 -0.3778315 -0.3932367 51474.20 49258.08 46918.80 42696.05
## 1 0.2491620 0.1804469 0.1217877 47553.25 46522.80 44452.92 41230.76
## bill_amt5 bill_amt6 pay_amt1 pay_amt2 pay_amt3 pay_amt4 pay_amt5 pay_amt6
## 0 40211.11 38741.84 6172.787 6503.240 5478.544 5207.398 5216.920 5726.611
## 1 39270.66 37945.72 3357.706 3428.811 3202.441 3134.023 3279.785 3325.663
## marriage01
## 0 0.4420827
## 1 0.4798883
According to the test data, the probability of default is approximately 22.375%.
qda.pred <- predict(credit.qda, credit.test)
qda.category <- qda.pred$class
table(qda.category, credit.test$default_payment_next_month)
##
## qda.category 0 1
## 0 2233 253
## 1 2501 1013
mean(qda.category != credit.test$default_payment_next_month)
## [1] 0.459
The test error rate for the quadratic discriminant analysis model is 45.9%.
I will make a new dataframe excluding id, and marriage from analysis for KNN.
credit01 <- credit %>%
select(-id, -marriage)
#predictors from training data
train.default <- credit$default_payment_next_month[train]
knn.pred <- knn(credit.train, credit.test, train.default, k=1)
table(knn.pred, credit.test$default_payment_next_month)
##
## knn.pred 0 1
## 0 3868 942
## 1 866 324
mean(knn.pred != credit.test$default_payment_next_month)
## [1] 0.3013333
The KNN test error rate for k=1 is 30.13%.
knn.pred <- knn(credit.train, credit.test, train.default, k=3)
table(knn.pred, credit.test$default_payment_next_month)
##
## knn.pred 0 1
## 0 4212 1017
## 1 522 249
mean(knn.pred != credit.test$default_payment_next_month)
## [1] 0.2565
The test error rate for k=2 is 25.65%.
knn.pred <- knn(credit.train, credit.test, train.default, k=5)
table(knn.pred, credit.test$default_payment_next_month)
##
## knn.pred 0 1
## 0 4358 1056
## 1 376 210
mean(knn.pred != credit.test$default_payment_next_month)
## [1] 0.2386667
The testing error is 23.88%.
knn.pred <- knn(credit.train, credit.test, train.default, k=10)
table(knn.pred, credit.test$default_payment_next_month)
##
## knn.pred 0 1
## 0 4461 1090
## 1 273 176
mean(knn.pred != credit.test$default_payment_next_month)
## [1] 0.2271667
The KNN test error rate for k=10 is 22.95%.
knn.pred <- knn(credit.train, credit.test, train.default, k=50)
table(knn.pred, credit.test$default_payment_next_month)
##
## knn.pred 0 1
## 0 4658 1193
## 1 76 73
mean(knn.pred != credit.test$default_payment_next_month)
## [1] 0.2115
The KNN test error rate for k=50 is 21.1%.
knn.pred <- knn(credit.train, credit.test, train.default, k=75)
table(knn.pred, credit.test$default_payment_next_month)
##
## knn.pred 0 1
## 0 4671 1209
## 1 63 57
mean(knn.pred != credit.test$default_payment_next_month)
## [1] 0.212
The KNN teest error rate for k=75 is 21.2%.
It appears that the linear analysis model is the most accurate at predicting credit default risk in clients, with a testing error rate of 17.93%, the lowest compared to logistic regression, quadratic analysis model, and K-Nearest Neighbours.