Data Distribution, Clustering and Classification

A. Data Preparation

A.1. Data Loading and Initial Transformation

Load data and transform the default.payment.next.month variable into a nominal (factor) variable:

library(ggplot2)
cc <- read.csv("a2/UCI_Credit_Card.csv")
cc$default.payment.next.month <- factor(cc$default.payment.next.month,levels=c(0,1), labels=c("No","Yes"))
head(cc, 5)

##   ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5
## 1  1     20000   2         2        1  24     2     2    -1    -1    -2
## 2  2    120000   2         2        2  26    -1     2     0     0     0
## 3  3     90000   2         2        2  34     0     0     0     0     0
## 4  4     50000   2         2        1  37     0     0     0     0     0
## 5  5     50000   1         2        1  57    -1     0    -1     0     0
##   PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6
## 1    -2      3913      3102       689         0         0         0
## 2     2      2682      1725      2682      3272      3455      3261
## 3     0     29239     14027     13559     14331     14948     15549
## 4     0     46990     48233     49291     28314     28959     29547
## 5     0      8617      5670     35835     20940     19146     19131
##   PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
## 1        0      689        0        0        0        0
## 2        0     1000     1000     1000        0     2000
## 3     1518     1500     1000     1000     1000     5000
## 4     2000     2019     1200     1100     1069     1000
## 5     2000    36681    10000     9000      689      679
##   default.payment.next.month
## 1                        Yes
## 2                        Yes
## 3                         No
## 4                         No
## 5                         No

A.2. Demographic Variables

SEX & MARRIAGE

Produce a distribution for SEX variable and color code the distribution with default.payment.next.month

ggplot(cc, aes(x=SEX, fill=default.payment.next.month, color=default.payment.next.month)) + 
  geom_histogram(binwidth=1, position="stack") +
  scale_color_manual(values=c("black","black")) +
  scale_fill_manual(values=c("darkolivegreen4", "red"))

For the sex, 1 stands for male and 2 for female
Independent, mutually exlusive events
Default rates are very close to each other for males and females
However, there is significantly more females than males
Therefore, males have a higher probability of default compared to female clients (24.17% vs. 20.77%)

Produce a distribution for MARRIAGE variable and color code the distribution with default.payment.next.month

ggplot(cc, aes(x=MARRIAGE, fill=default.payment.next.month, color=default.payment.next.month)) + 
  geom_histogram(binwidth=1, position="stack") +
  scale_color_manual(values=c("black","black")) +
  scale_fill_manual(values=c("darkolivegreen4", "red"))

Marital status (1=married, 2=single, 3=others)
One anomalY: has a label 0 that is undocumented
Independent, mutually exlusive events
The majority of clients are married or single.
Among married and single, the default rate is almost the same
However, there is significantly more single clients than married ones
Therefore, married clients have a higher probability of default

A.3. Payment Status Variables

PAY_0, PAY_3, PAY_5

ggplot(cc, aes(x=PAY_0, fill=default.payment.next.month, color=default.payment.next.month)) + 
  geom_histogram(binwidth=1, position="stack") +
  scale_color_manual(values=c("black","black")) +
  scale_fill_manual(values=c("darkolivegreen4", "red"))

PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
An undocumented labels present: -2, 0. If -2 = pay duly, -1=payment delay for one month, 0=payment delay for two months, etc, the probability of default becomes higher and higher with each month of payment delay after three months of payment delay

ggplot(cc, aes(x=PAY_3, fill=default.payment.next.month, color=default.payment.next.month)) + 
  geom_histogram(binwidth=1, position="stack") +
  scale_color_manual(values=c("black","black")) +
  scale_fill_manual(values=c("darkolivegreen4", "red"))

PAY_3: Repayment status in July, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
An undocumented labels again present: -2, 0.
No events associated with the label 1, which is very unlikely
If -2 = pay duly, -1=payment delay for one month, 0=payment delay for two months, etc, the probability of default becomes higher and higher with each month of payment delay after four months of payment delay (no data for three month payment delay)

ggplot(cc, aes(x=PAY_5, fill=default.payment.next.month, color=default.payment.next.month)) + 
  geom_histogram(binwidth=1, position="stack") +
  scale_color_manual(values=c("black","black")) +
  scale_fill_manual(values=c("darkolivegreen4", "red"))

PAY_5: Repayment status in May, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
The same problems as above:
An undocumented labels again present: -2, 0.
No instances for the label 1, which is very unlikely
If -2 = pay duly, -1=payment delay for one month, 0=payment delay for two months, etc, the probability of default becomes higher and higher with each month of payment delay after four months of payment delay (no data for three month payment delay)
If a client have payment delay for three/four or more months, his/her probability of default become higher with every consecutive month of payment delay

A.4. Transforming Nominal Variables

Demographic variables, such as SEX, EDUCATION, and MARRIAGE, and payment delay variables should be nominal(categorical)

Transforming demographic variables into nominal values

cc$MARRIAGE <- factor(cc$MARRIAGE,levels=c(1,2,3), labels=c("Married", "Single", "Other"))

cc$SEX <- factor(cc$SEX,levels=c(1,2), labels=c("Male", "Female"))

cc$EDUCATION <- factor(cc$EDUCATION,levels=c(1,2,3,4,5,6), labels=c("Graduate school", "University", "High school", "Others", "Unknown", "Unknown"))

Transforming payment status variables into nominal values:

cc$PAY_0  <- factor(cc$PAY_0 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "Delay 1 month", "Delay 2 months", "Delay 3 months", "Delay 4 months",  "Delay 5 months", "Delay 6 months", "Delay 7 months",  "Delay 8 months", "Delay 9 months", "Delay 10 months", "Delay 11 months"))

cc$PAY_2  <- factor(cc$PAY_2 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "Delay 1 month", "Delay 2 months", "Delay 3 months", "Delay 4 months",  "Delay 5 months", "Delay 6 months", "Delay 7 months",  "Delay 8 months", "Delay 9 months", "Delay 10 months", "Delay 11 months"))

cc$PAY_3  <- factor(cc$PAY_3 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "Delay 1 month", "Delay 2 months", "Delay 3 months", "Delay 4 months",  "Delay 5 months", "Delay 6 months", "Delay 7 months",  "Delay 8 months", "Delay 9 months", "Delay 10 months", "Delay 11 months"))

cc$PAY_4  <- factor(cc$PAY_4 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "Delay 1 month", "Delay 2 months", "Delay 3 months", "Delay 4 months",  "Delay 5 months", "Delay 6 months", "Delay 7 months",  "Delay 8 months", "Delay 9 months", "Delay 10 months", "Delay 11 months"))

cc$PAY_5  <- factor(cc$PAY_5 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "Delay 1 month", "Delay 2 months", "Delay 3 months", "Delay 4 months",  "Delay 5 months", "Delay 6 months", "Delay 7 months",  "Delay 8 months", "Delay 9 months", "Delay 10 months", "Delay 11 months"))

cc$PAY_6  <- factor(cc$PAY_6 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "Delay 1 month", "Delay 2 months", "Delay 3 months", "Delay 4 months",  "Delay 5 months", "Delay 6 months", "Delay 7 months",  "Delay 8 months", "Delay 9 months", "Delay 10 months", "Delay 11 months"))

A.5. Selection of Training Data

Selecting 5000 random rows as training data

train <- cc[sample(nrow(cc), 5000), ]

nrow(train)

## [1] 5000

View(train)

A.6. Selection of Testing Data

test <- cc[c(537,26756),]
test

##          ID LIMIT_BAL    SEX       EDUCATION MARRIAGE AGE          PAY_0
## 537     537    440000 Female Graduate school   Single  35 Delay 2 months
## 26756 26756    380000   Male Graduate school   Single  28       Pay duly
##                PAY_2          PAY_3          PAY_4          PAY_5
## 537   Delay 2 months Delay 2 months Delay 2 months Delay 2 months
## 26756       Pay duly       Pay duly       Pay duly       Pay duly
##                PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5
## 537   Delay 2 months    330759    337794    274075    252584    216547
## 26756       Pay duly      1294      1058       359       359       359
##       BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
## 537      179221    13100    11000    10000     9000     7100     5500
## 26756       509     1058      359      359      359      509      359
##       default.payment.next.month
## 537                           No
## 26756                         No

B. Data Classification

B.1. Naive Bayes using Demographic Variables

Building a Naive Bayes model using the nominal demographic variables as predictors:

library(e1071)
nbDem <- naiveBayes(default.payment.next.month ~ SEX + EDUCATION + MARRIAGE, train)
nbDem

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##     No    Yes 
## 0.7736 0.2264 
## 
## Conditional probabilities:
##      SEX
## Y          Male    Female
##   No  0.3870217 0.6129783
##   Yes 0.4266784 0.5733216
## 
##      EDUCATION
## Y     Graduate school  University High school      Others     Unknown
##   No      0.366528712 0.460682876 0.154164511 0.005173306 0.013450595
##   Yes     0.300353357 0.504416961 0.190812721 0.001766784 0.002650177
## 
##      MARRIAGE
## Y        Married     Single      Other
##   No  0.44061203 0.54745851 0.01192946
##   Yes 0.48851590 0.49381625 0.01766784

Conditional probability of default is higher for female clients than it is for males. At first glance, it does not make sense. However, these are not true probabilities. What is important is the ratio between conditional probabilities of YES default and NO default
The probability of YES default is higher than the probability of No default for male clients. The opposite is true for females
The probability of YES default is higher than the probability of No default for university graduates and for high school graduates
The probability of YES default is higher than the probability of No default for married clients

Testing prediction on first row in the test data:

predict(nbDem, test[1,])

## [1] No
## Levels: No Yes

Testing prediction on second row in the test data:

predict(nbDem, test[2,])

## [1] No
## Levels: No Yes

The first client is a female who is single and has a graduate school educational level. She is classified as unlikely to have default.
For all attributes (SEX, EDUCATION, MARRIAGE), the client has the higher probabilities of NO default in comparison to Yes default
The accuracy of prediction is confirmed with actual data. This client did not have default
The second client is a male who is single and has a graduate school educational level. He is classified as unlikely to have default, too.
Only for one attribute (SEX), the second client has the higher probability of YES default in comparison to No default
The accuracy of prediction is confirmed with actual data. This client did not have default

B.2. Naive Bayes using Payment Status

Building a Naive Bayes model using THREE payment status variables as predictors:

nbPay <- naiveBayes(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3, train)
nbPay

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##     No    Yes 
## 0.7736 0.2264 
## 
## Conditional probabilities:
##      PAY_0
## Y         Pay duly Delay 1 month Delay 2 months Delay 3 months
##   No  0.1013443640  0.2029472596   0.5377456050   0.1168562565
##   Yes 0.0477031802  0.1590106007   0.2941696113   0.1775618375
##      PAY_0
## Y     Delay 4 months Delay 5 months Delay 6 months Delay 7 months
##   No    0.0346432265   0.0046535677   0.0007755946   0.0002585315
##   Yes   0.2712014134   0.0388692580   0.0070671378   0.0017667845
##      PAY_0
## Y     Delay 8 months Delay 9 months Delay 10 months Delay 11 months
##   No    0.0005170631   0.0000000000    0.0002585315    0.0000000000
##   Yes   0.0008833922   0.0008833922    0.0008833922    0.0000000000
## 
##      PAY_2
## Y         Pay duly Delay 1 month Delay 2 months Delay 3 months
##   No  0.1372802482  0.2213029990   0.5486039297   0.0018097208
##   Yes 0.0901060071  0.1669611307   0.4010600707   0.0000000000
##      PAY_2
## Y     Delay 4 months Delay 5 months Delay 6 months Delay 7 months
##   No    0.0840227508   0.0046535677   0.0015511892   0.0005170631
##   Yes   0.2985865724   0.0291519435   0.0106007067   0.0017667845
##      PAY_2
## Y     Delay 8 months Delay 9 months Delay 10 months Delay 11 months
##   No    0.0000000000   0.0002585315    0.0000000000    0.0000000000
##   Yes   0.0008833922   0.0008833922    0.0000000000    0.0000000000
## 
##      PAY_3
## Y         Pay duly Delay 1 month Delay 2 months Delay 3 months
##   No  0.1494312306  0.2107032058   0.5491209928   0.0005170631
##   Yes 0.0971731449  0.1501766784   0.4231448763   0.0000000000
##      PAY_3
## Y     Delay 4 months Delay 5 months Delay 6 months Delay 7 months
##   No    0.0842812823   0.0038779731   0.0015511892   0.0002585315
##   Yes   0.2950530035   0.0229681979   0.0070671378   0.0008833922
##      PAY_3
## Y     Delay 8 months Delay 9 months Delay 10 months Delay 11 months
##   No    0.0002585315   0.0000000000    0.0000000000    0.0000000000
##   Yes   0.0017667845   0.0017667845    0.0000000000    0.0000000000

The probability of YES default is higher than the probability of No default for payments delays of three or more months for each payment status variables
These findings confirmed the observations that were made in the distribution analysis
If a client have payment delay for three or more months, his/her probability of YES default is higher than the probability of No default

Testing prediction on first row in the test data:

predict(nbPay, test[1,])

## [1] No
## Levels: No Yes

Testing prediction on second row in the test data:

predict(nbPay, test[2,])

## [1] No
## Levels: No Yes

The first client has 2 month payment delay. She is classified as unlikely to have default.
For all attributes (PAY_0,PAY_2,PAY_3), the client has the higher probabilities of NO default in comparison to Yes default
The accuracy of prediction is confirmed with actual data. This client did not have default
The second client has no payment delays at all. He is classified as unlikely to have default, too.
The accuracy of prediction is confirmed with actual data. This client did not have default

B.3. Smoothed Naive Bayes using Payment Status

nbPay <- naiveBayes(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3, train, laplace=1.5)
predict(nbPay, test[1,])

## [1] No
## Levels: No Yes

Laplace smoothing did not produce different prediction for the client 1
She is still classified as unlikely to have default.
Testing prediction on second row in the test data

nbPay <- naiveBayes(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3, train, laplace=1.5)
predict(nbPay, test[2,])

## [1] No
## Levels: No Yes

Laplace smoothing did not produce different prediction for the second client
He is still classified as unlikely to have default.

C. Classification with Decision Tree

C.1. Basic Decision Tree

Building a decision tree using up to three payment status variables as predictors:

library("rpart")
library("rpart.plot")
dtPay <- rpart(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3,
            method="class",
            data=train, parms=list(split='information'), 
            minsplit=20, cp=0.02)

Visualization of the decision tree:

rpart.plot(dtPay, type=4, extra=1)

The decision stump: a decision tree with the root immediately connected to the leafs
Intermediate steps are missing
A decision stump makes predictions based on the value of a single variable
No proper visualization of the chosen attributes and their values
The outcome “default” is either YES (meaning the client will have default) or NO (meaning the client won’t have default)

Testing prediction on the first row in the test data:

predict(nbPay, test[1,])

## [1] No
## Levels: No Yes

Testing prediction on the second row in the test data:

predict(nbPay, test[2,])

## [1] No
## Levels: No Yes

Both clients are classified as unlikely to have default.
The accuracy of predictions are confirmed with actual data. These clients did not have default

C.2. Decision Tree with a Different Complexity Parameter

Rebuilding the decision tree with a smaller cp:

dtPay <- rpart(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3,
            method="class",
            data=train, parms=list(split='information'), 
            minsplit=20, cp=0.001)