Load data and transform the default.payment.next.month variable into a nominal (factor) variable:
library(ggplot2)
cc <- read.csv("a2/UCI_Credit_Card.csv")
cc$default.payment.next.month <- factor(cc$default.payment.next.month,levels=c(0,1), labels=c("No","Yes"))
head(cc, 5)
## ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5
## 1 1 20000 2 2 1 24 2 2 -1 -1 -2
## 2 2 120000 2 2 2 26 -1 2 0 0 0
## 3 3 90000 2 2 2 34 0 0 0 0 0
## 4 4 50000 2 2 1 37 0 0 0 0 0
## 5 5 50000 1 2 1 57 -1 0 -1 0 0
## PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6
## 1 -2 3913 3102 689 0 0 0
## 2 2 2682 1725 2682 3272 3455 3261
## 3 0 29239 14027 13559 14331 14948 15549
## 4 0 46990 48233 49291 28314 28959 29547
## 5 0 8617 5670 35835 20940 19146 19131
## PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
## 1 0 689 0 0 0 0
## 2 0 1000 1000 1000 0 2000
## 3 1518 1500 1000 1000 1000 5000
## 4 2000 2019 1200 1100 1069 1000
## 5 2000 36681 10000 9000 689 679
## default.payment.next.month
## 1 Yes
## 2 Yes
## 3 No
## 4 No
## 5 No
SEX & MARRIAGE
Produce a distribution for SEX variable and color code the distribution with default.payment.next.month
ggplot(cc, aes(x=SEX, fill=default.payment.next.month, color=default.payment.next.month)) +
geom_histogram(binwidth=1, position="stack") +
scale_color_manual(values=c("black","black")) +
scale_fill_manual(values=c("darkolivegreen4", "red"))
For the sex, 1 stands for male and 2 for female
Independent, mutually exlusive events
Default rates are very close to each other for males and females
However, there is significantly more females than males
Therefore, males have a higher probability of default compared to female clients (24.17% vs. 20.77%)
Produce a distribution for MARRIAGE variable and color code the distribution with default.payment.next.month
ggplot(cc, aes(x=MARRIAGE, fill=default.payment.next.month, color=default.payment.next.month)) +
geom_histogram(binwidth=1, position="stack") +
scale_color_manual(values=c("black","black")) +
scale_fill_manual(values=c("darkolivegreen4", "red"))
Marital status (1=married, 2=single, 3=others)
One anomalY: has a label 0 that is undocumented
Independent, mutually exlusive events
The majority of clients are married or single.
Among married and single, the default rate is almost the same
However, there is significantly more single clients than married ones
Therefore, married clients have a higher probability of default
PAY_0, PAY_3, PAY_5
ggplot(cc, aes(x=PAY_0, fill=default.payment.next.month, color=default.payment.next.month)) +
geom_histogram(binwidth=1, position="stack") +
scale_color_manual(values=c("black","black")) +
scale_fill_manual(values=c("darkolivegreen4", "red"))
PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
An undocumented labels present: -2, 0. If -2 = pay duly, -1=payment delay for one month, 0=payment delay for two months, etc, the probability of default becomes higher and higher with each month of payment delay after three months of payment delay
ggplot(cc, aes(x=PAY_3, fill=default.payment.next.month, color=default.payment.next.month)) +
geom_histogram(binwidth=1, position="stack") +
scale_color_manual(values=c("black","black")) +
scale_fill_manual(values=c("darkolivegreen4", "red"))
PAY_3: Repayment status in July, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
An undocumented labels again present: -2, 0.
No events associated with the label 1, which is very unlikely
If -2 = pay duly, -1=payment delay for one month, 0=payment delay for two months, etc, the probability of default becomes higher and higher with each month of payment delay after four months of payment delay (no data for three month payment delay)
ggplot(cc, aes(x=PAY_5, fill=default.payment.next.month, color=default.payment.next.month)) +
geom_histogram(binwidth=1, position="stack") +
scale_color_manual(values=c("black","black")) +
scale_fill_manual(values=c("darkolivegreen4", "red"))
PAY_5: Repayment status in May, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
The same problems as above:
An undocumented labels again present: -2, 0.
No instances for the label 1, which is very unlikely
If -2 = pay duly, -1=payment delay for one month, 0=payment delay for two months, etc, the probability of default becomes higher and higher with each month of payment delay after four months of payment delay (no data for three month payment delay)
If a client have payment delay for three/four or more months, his/her probability of default become higher with every consecutive month of payment delay
Demographic variables, such as SEX, EDUCATION, and MARRIAGE, and payment delay variables should be nominal(categorical)
Transforming demographic variables into nominal values
cc$MARRIAGE <- factor(cc$MARRIAGE,levels=c(1,2,3), labels=c("Married", "Single", "Other"))
cc$SEX <- factor(cc$SEX,levels=c(1,2), labels=c("Male", "Female"))
cc$EDUCATION <- factor(cc$EDUCATION,levels=c(1,2,3,4,5,6), labels=c("Graduate school", "University", "High school", "Others", "Unknown", "Unknown"))
Transforming payment status variables into nominal values:
cc$PAY_0 <- factor(cc$PAY_0 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "Delay 1 month", "Delay 2 months", "Delay 3 months", "Delay 4 months", "Delay 5 months", "Delay 6 months", "Delay 7 months", "Delay 8 months", "Delay 9 months", "Delay 10 months", "Delay 11 months"))
cc$PAY_2 <- factor(cc$PAY_2 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "Delay 1 month", "Delay 2 months", "Delay 3 months", "Delay 4 months", "Delay 5 months", "Delay 6 months", "Delay 7 months", "Delay 8 months", "Delay 9 months", "Delay 10 months", "Delay 11 months"))
cc$PAY_3 <- factor(cc$PAY_3 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "Delay 1 month", "Delay 2 months", "Delay 3 months", "Delay 4 months", "Delay 5 months", "Delay 6 months", "Delay 7 months", "Delay 8 months", "Delay 9 months", "Delay 10 months", "Delay 11 months"))
cc$PAY_4 <- factor(cc$PAY_4 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "Delay 1 month", "Delay 2 months", "Delay 3 months", "Delay 4 months", "Delay 5 months", "Delay 6 months", "Delay 7 months", "Delay 8 months", "Delay 9 months", "Delay 10 months", "Delay 11 months"))
cc$PAY_5 <- factor(cc$PAY_5 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "Delay 1 month", "Delay 2 months", "Delay 3 months", "Delay 4 months", "Delay 5 months", "Delay 6 months", "Delay 7 months", "Delay 8 months", "Delay 9 months", "Delay 10 months", "Delay 11 months"))
cc$PAY_6 <- factor(cc$PAY_6 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "Delay 1 month", "Delay 2 months", "Delay 3 months", "Delay 4 months", "Delay 5 months", "Delay 6 months", "Delay 7 months", "Delay 8 months", "Delay 9 months", "Delay 10 months", "Delay 11 months"))
Selecting 5000 random rows as training data
train <- cc[sample(nrow(cc), 5000), ]
nrow(train)
## [1] 5000
View(train)
test <- cc[c(537,26756),]
test
## ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0
## 537 537 440000 Female Graduate school Single 35 Delay 2 months
## 26756 26756 380000 Male Graduate school Single 28 Pay duly
## PAY_2 PAY_3 PAY_4 PAY_5
## 537 Delay 2 months Delay 2 months Delay 2 months Delay 2 months
## 26756 Pay duly Pay duly Pay duly Pay duly
## PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5
## 537 Delay 2 months 330759 337794 274075 252584 216547
## 26756 Pay duly 1294 1058 359 359 359
## BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
## 537 179221 13100 11000 10000 9000 7100 5500
## 26756 509 1058 359 359 359 509 359
## default.payment.next.month
## 537 No
## 26756 No
Building a Naive Bayes model using the nominal demographic variables as predictors:
library(e1071)
nbDem <- naiveBayes(default.payment.next.month ~ SEX + EDUCATION + MARRIAGE, train)
nbDem
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## No Yes
## 0.7736 0.2264
##
## Conditional probabilities:
## SEX
## Y Male Female
## No 0.3870217 0.6129783
## Yes 0.4266784 0.5733216
##
## EDUCATION
## Y Graduate school University High school Others Unknown
## No 0.366528712 0.460682876 0.154164511 0.005173306 0.013450595
## Yes 0.300353357 0.504416961 0.190812721 0.001766784 0.002650177
##
## MARRIAGE
## Y Married Single Other
## No 0.44061203 0.54745851 0.01192946
## Yes 0.48851590 0.49381625 0.01766784
Conditional probability of default is higher for female clients than it is for males. At first glance, it does not make sense. However, these are not true probabilities. What is important is the ratio between conditional probabilities of YES default and NO default
The probability of YES default is higher than the probability of No default for male clients. The opposite is true for females
The probability of YES default is higher than the probability of No default for university graduates and for high school graduates
The probability of YES default is higher than the probability of No default for married clients
Testing prediction on first row in the test data:
predict(nbDem, test[1,])
## [1] No
## Levels: No Yes
Testing prediction on second row in the test data:
predict(nbDem, test[2,])
## [1] No
## Levels: No Yes
The first client is a female who is single and has a graduate school educational level. She is classified as unlikely to have default.
For all attributes (SEX, EDUCATION, MARRIAGE), the client has the higher probabilities of NO default in comparison to Yes default
The accuracy of prediction is confirmed with actual data. This client did not have default
The second client is a male who is single and has a graduate school educational level. He is classified as unlikely to have default, too.
Only for one attribute (SEX), the second client has the higher probability of YES default in comparison to No default
The accuracy of prediction is confirmed with actual data. This client did not have default
Building a Naive Bayes model using THREE payment status variables as predictors:
nbPay <- naiveBayes(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3, train)
nbPay
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## No Yes
## 0.7736 0.2264
##
## Conditional probabilities:
## PAY_0
## Y Pay duly Delay 1 month Delay 2 months Delay 3 months
## No 0.1013443640 0.2029472596 0.5377456050 0.1168562565
## Yes 0.0477031802 0.1590106007 0.2941696113 0.1775618375
## PAY_0
## Y Delay 4 months Delay 5 months Delay 6 months Delay 7 months
## No 0.0346432265 0.0046535677 0.0007755946 0.0002585315
## Yes 0.2712014134 0.0388692580 0.0070671378 0.0017667845
## PAY_0
## Y Delay 8 months Delay 9 months Delay 10 months Delay 11 months
## No 0.0005170631 0.0000000000 0.0002585315 0.0000000000
## Yes 0.0008833922 0.0008833922 0.0008833922 0.0000000000
##
## PAY_2
## Y Pay duly Delay 1 month Delay 2 months Delay 3 months
## No 0.1372802482 0.2213029990 0.5486039297 0.0018097208
## Yes 0.0901060071 0.1669611307 0.4010600707 0.0000000000
## PAY_2
## Y Delay 4 months Delay 5 months Delay 6 months Delay 7 months
## No 0.0840227508 0.0046535677 0.0015511892 0.0005170631
## Yes 0.2985865724 0.0291519435 0.0106007067 0.0017667845
## PAY_2
## Y Delay 8 months Delay 9 months Delay 10 months Delay 11 months
## No 0.0000000000 0.0002585315 0.0000000000 0.0000000000
## Yes 0.0008833922 0.0008833922 0.0000000000 0.0000000000
##
## PAY_3
## Y Pay duly Delay 1 month Delay 2 months Delay 3 months
## No 0.1494312306 0.2107032058 0.5491209928 0.0005170631
## Yes 0.0971731449 0.1501766784 0.4231448763 0.0000000000
## PAY_3
## Y Delay 4 months Delay 5 months Delay 6 months Delay 7 months
## No 0.0842812823 0.0038779731 0.0015511892 0.0002585315
## Yes 0.2950530035 0.0229681979 0.0070671378 0.0008833922
## PAY_3
## Y Delay 8 months Delay 9 months Delay 10 months Delay 11 months
## No 0.0002585315 0.0000000000 0.0000000000 0.0000000000
## Yes 0.0017667845 0.0017667845 0.0000000000 0.0000000000
The probability of YES default is higher than the probability of No default for payments delays of three or more months for each payment status variables
These findings confirmed the observations that were made in the distribution analysis
If a client have payment delay for three or more months, his/her probability of YES default is higher than the probability of No default
Testing prediction on first row in the test data:
predict(nbPay, test[1,])
## [1] No
## Levels: No Yes
Testing prediction on second row in the test data:
predict(nbPay, test[2,])
## [1] No
## Levels: No Yes
The first client has 2 month payment delay. She is classified as unlikely to have default.
For all attributes (PAY_0,PAY_2,PAY_3), the client has the higher probabilities of NO default in comparison to Yes default
The accuracy of prediction is confirmed with actual data. This client did not have default
The second client has no payment delays at all. He is classified as unlikely to have default, too.
The accuracy of prediction is confirmed with actual data. This client did not have default
nbPay <- naiveBayes(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3, train, laplace=1.5)
predict(nbPay, test[1,])
## [1] No
## Levels: No Yes
Laplace smoothing did not produce different prediction for the client 1
She is still classified as unlikely to have default.
Testing prediction on second row in the test data
nbPay <- naiveBayes(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3, train, laplace=1.5)
predict(nbPay, test[2,])
## [1] No
## Levels: No Yes
Laplace smoothing did not produce different prediction for the second client
He is still classified as unlikely to have default.
Building a decision tree using up to three payment status variables as predictors:
library("rpart")
library("rpart.plot")
dtPay <- rpart(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3,
method="class",
data=train, parms=list(split='information'),
minsplit=20, cp=0.02)
Visualization of the decision tree:
rpart.plot(dtPay, type=4, extra=1)
The decision stump: a decision tree with the root immediately connected to the leafs
Intermediate steps are missing
A decision stump makes predictions based on the value of a single variable
No proper visualization of the chosen attributes and their values
The outcome “default” is either YES (meaning the client will have default) or NO (meaning the client won’t have default)
Testing prediction on the first row in the test data:
predict(nbPay, test[1,])
## [1] No
## Levels: No Yes
Testing prediction on the second row in the test data:
predict(nbPay, test[2,])
## [1] No
## Levels: No Yes
Both clients are classified as unlikely to have default.
The accuracy of predictions are confirmed with actual data. These clients did not have default
Rebuilding the decision tree with a smaller cp:
dtPay <- rpart(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3,
method="class",
data=train, parms=list(split='information'),
minsplit=20, cp=0.001)
Visualization of the decision tree:
rpart.plot(dtPay, type=4, extra=1)
The PAY_0 is chosen for the 1st split of the decision tree
The root of the tree shows that the overall fraction of the clients who have NO default is 3,901 out of the total population of 5,000
For the attribite PAY_O, 3,748 clients, who pay duly or who have payment delays of 1, 2, 3, and 7 months, have NO default
The right node represents the rest of the client population, for which the outcome of is YES default
For the attribite PAY_O, 381 clients, who have payment delays of 4, 5, 6, 8, 9 and 10 months, have YES default
This node further splits into two nodes based on the PAY_3 values. If the PAY_3 value is either 1 month delay or 9 month delay, then 19 out of 33 of the clients have NO default.
This node further splits into two nodes based on the PAY_2 values. If the PAY_2 value is either 2 month delay or 4 month delay, then 8 out of 14 of the clients have YES default.
Testing prediction on the first row in the test data:
predict(nbPay, test[1,])
## [1] No
## Levels: No Yes
Testing prediction on the second row in the test data:
predict(nbPay, test[2,])
## [1] No
## Levels: No Yes
Both clients are classified as unlikely to have default.
No changes
All models showed the same results. All predictions on two rows of test data were correct (NO default)
For more comprehensive comparison, more test data are needed to draw
However, the Naive Bayes model seems to work better on payment status variables than Decision Tree that recognized different payment attributes as redundant. In the case of Decision Tree, the predictive power could be enhanced by choosing a smaller complexity parameter
The variables “Age” can be combined with other demographic variables to predict default payment for the next month. For this, the range of years should be divided into intervals and then converted to categorical valuesOther data that could potentially enhance the accuracy of classification models include the amount of the bill statement, the amount of previous payments, and limits of the credit line. All this data are continuous variables. Therefore, they should be divided into intervals before their conversion to nominal values
It would be interesting to test the classification model that incorporates all available data: the limit of credit line, demographic variables, the amount of the bill, payment status, the amount of the previous payment