INFO 659 Assignment 2

Velena McHugh

Data Distribution, Clustering and Classification

A. Data Preparation

A.1. Data Loading and Initial Transformation

Load data and transform the default.payment.next.month variable into a nominal (factor) variable:

library(ggplot2)
cc <- read.csv("a2/UCI_Credit_Card.csv")
cc$default.payment.next.month <- factor(cc$default.payment.next.month,levels=c(0,1), labels=c("No","Yes"))
head(cc, 5)
##   ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5
## 1  1     20000   2         2        1  24     2     2    -1    -1    -2
## 2  2    120000   2         2        2  26    -1     2     0     0     0
## 3  3     90000   2         2        2  34     0     0     0     0     0
## 4  4     50000   2         2        1  37     0     0     0     0     0
## 5  5     50000   1         2        1  57    -1     0    -1     0     0
##   PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6
## 1    -2      3913      3102       689         0         0         0
## 2     2      2682      1725      2682      3272      3455      3261
## 3     0     29239     14027     13559     14331     14948     15549
## 4     0     46990     48233     49291     28314     28959     29547
## 5     0      8617      5670     35835     20940     19146     19131
##   PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
## 1        0      689        0        0        0        0
## 2        0     1000     1000     1000        0     2000
## 3     1518     1500     1000     1000     1000     5000
## 4     2000     2019     1200     1100     1069     1000
## 5     2000    36681    10000     9000      689      679
##   default.payment.next.month
## 1                        Yes
## 2                        Yes
## 3                         No
## 4                         No
## 5                         No

A.2. Demographic Variables

SEX & MARRIAGE

Produce a distribution for SEX variable and color code the distribution with default.payment.next.month

ggplot(cc, aes(x=SEX, fill=default.payment.next.month, color=default.payment.next.month)) + 
  geom_histogram(binwidth=1, position="stack") +
  scale_color_manual(values=c("black","black")) +
  scale_fill_manual(values=c("darkolivegreen4", "red"))

  • For the sex, 1 stands for male and 2 for female

  • Independent, mutually exlusive events

  • Default rates are very close to each other for males and females

  • However, there is significantly more females than males

  • Therefore, males have a higher probability of default compared to female clients (24.17% vs. 20.77%)

Produce a distribution for MARRIAGE variable and color code the distribution with default.payment.next.month

ggplot(cc, aes(x=MARRIAGE, fill=default.payment.next.month, color=default.payment.next.month)) + 
  geom_histogram(binwidth=1, position="stack") +
  scale_color_manual(values=c("black","black")) +
  scale_fill_manual(values=c("darkolivegreen4", "red"))

  • Marital status (1=married, 2=single, 3=others)

  • One anomalY: has a label 0 that is undocumented

  • Independent, mutually exlusive events

  • The majority of clients are married or single.

  • Among married and single, the default rate is almost the same

  • However, there is significantly more single clients than married ones

  • Therefore, married clients have a higher probability of default

A.3. Payment Status Variables

PAY_0, PAY_3, PAY_5

ggplot(cc, aes(x=PAY_0, fill=default.payment.next.month, color=default.payment.next.month)) + 
  geom_histogram(binwidth=1, position="stack") +
  scale_color_manual(values=c("black","black")) +
  scale_fill_manual(values=c("darkolivegreen4", "red"))

  • PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)

  • An undocumented labels present: -2, 0. If -2 = pay duly, -1=payment delay for one month, 0=payment delay for two months, etc, the probability of default becomes higher and higher with each month of payment delay after three months of payment delay

ggplot(cc, aes(x=PAY_3, fill=default.payment.next.month, color=default.payment.next.month)) + 
  geom_histogram(binwidth=1, position="stack") +
  scale_color_manual(values=c("black","black")) +
  scale_fill_manual(values=c("darkolivegreen4", "red"))

  • PAY_3: Repayment status in July, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)

  • An undocumented labels again present: -2, 0.

  • No events associated with the label 1, which is very unlikely

  • If -2 = pay duly, -1=payment delay for one month, 0=payment delay for two months, etc, the probability of default becomes higher and higher with each month of payment delay after four months of payment delay (no data for three month payment delay)

ggplot(cc, aes(x=PAY_5, fill=default.payment.next.month, color=default.payment.next.month)) + 
  geom_histogram(binwidth=1, position="stack") +
  scale_color_manual(values=c("black","black")) +
  scale_fill_manual(values=c("darkolivegreen4", "red"))

  • PAY_5: Repayment status in May, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)

  • The same problems as above:

  • An undocumented labels again present: -2, 0.

  • No instances for the label 1, which is very unlikely

  • If -2 = pay duly, -1=payment delay for one month, 0=payment delay for two months, etc, the probability of default becomes higher and higher with each month of payment delay after four months of payment delay (no data for three month payment delay)

  • If a client have payment delay for three/four or more months, his/her probability of default become higher with every consecutive month of payment delay

A.4. Transforming Nominal Variables

Demographic variables, such as SEX, EDUCATION, and MARRIAGE, and payment delay variables should be nominal(categorical)

Transforming demographic variables into nominal values

cc$MARRIAGE <- factor(cc$MARRIAGE,levels=c(1,2,3), labels=c("Married", "Single", "Other"))

cc$SEX <- factor(cc$SEX,levels=c(1,2), labels=c("Male", "Female"))

cc$EDUCATION <- factor(cc$EDUCATION,levels=c(1,2,3,4,5,6), labels=c("Graduate school", "University", "High school", "Others", "Unknown", "Unknown"))

Transforming payment status variables into nominal values:

cc$PAY_0  <- factor(cc$PAY_0 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "Delay 1 month", "Delay 2 months", "Delay 3 months", "Delay 4 months",  "Delay 5 months", "Delay 6 months", "Delay 7 months",  "Delay 8 months", "Delay 9 months", "Delay 10 months", "Delay 11 months"))

cc$PAY_2  <- factor(cc$PAY_2 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "Delay 1 month", "Delay 2 months", "Delay 3 months", "Delay 4 months",  "Delay 5 months", "Delay 6 months", "Delay 7 months",  "Delay 8 months", "Delay 9 months", "Delay 10 months", "Delay 11 months"))

cc$PAY_3  <- factor(cc$PAY_3 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "Delay 1 month", "Delay 2 months", "Delay 3 months", "Delay 4 months",  "Delay 5 months", "Delay 6 months", "Delay 7 months",  "Delay 8 months", "Delay 9 months", "Delay 10 months", "Delay 11 months"))

cc$PAY_4  <- factor(cc$PAY_4 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "Delay 1 month", "Delay 2 months", "Delay 3 months", "Delay 4 months",  "Delay 5 months", "Delay 6 months", "Delay 7 months",  "Delay 8 months", "Delay 9 months", "Delay 10 months", "Delay 11 months"))

cc$PAY_5  <- factor(cc$PAY_5 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "Delay 1 month", "Delay 2 months", "Delay 3 months", "Delay 4 months",  "Delay 5 months", "Delay 6 months", "Delay 7 months",  "Delay 8 months", "Delay 9 months", "Delay 10 months", "Delay 11 months"))

cc$PAY_6  <- factor(cc$PAY_6 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "Delay 1 month", "Delay 2 months", "Delay 3 months", "Delay 4 months",  "Delay 5 months", "Delay 6 months", "Delay 7 months",  "Delay 8 months", "Delay 9 months", "Delay 10 months", "Delay 11 months"))

A.5. Selection of Training Data

Selecting 5000 random rows as training data

train <- cc[sample(nrow(cc), 5000), ]

nrow(train)
## [1] 5000
View(train)

A.6. Selection of Testing Data

test <- cc[c(537,26756),]
test
##          ID LIMIT_BAL    SEX       EDUCATION MARRIAGE AGE          PAY_0
## 537     537    440000 Female Graduate school   Single  35 Delay 2 months
## 26756 26756    380000   Male Graduate school   Single  28       Pay duly
##                PAY_2          PAY_3          PAY_4          PAY_5
## 537   Delay 2 months Delay 2 months Delay 2 months Delay 2 months
## 26756       Pay duly       Pay duly       Pay duly       Pay duly
##                PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5
## 537   Delay 2 months    330759    337794    274075    252584    216547
## 26756       Pay duly      1294      1058       359       359       359
##       BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
## 537      179221    13100    11000    10000     9000     7100     5500
## 26756       509     1058      359      359      359      509      359
##       default.payment.next.month
## 537                           No
## 26756                         No

B. Data Classification

B.1. Naive Bayes using Demographic Variables

Building a Naive Bayes model using the nominal demographic variables as predictors:

library(e1071)
nbDem <- naiveBayes(default.payment.next.month ~ SEX + EDUCATION + MARRIAGE, train)
nbDem
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##     No    Yes 
## 0.7736 0.2264 
## 
## Conditional probabilities:
##      SEX
## Y          Male    Female
##   No  0.3870217 0.6129783
##   Yes 0.4266784 0.5733216
## 
##      EDUCATION
## Y     Graduate school  University High school      Others     Unknown
##   No      0.366528712 0.460682876 0.154164511 0.005173306 0.013450595
##   Yes     0.300353357 0.504416961 0.190812721 0.001766784 0.002650177
## 
##      MARRIAGE
## Y        Married     Single      Other
##   No  0.44061203 0.54745851 0.01192946
##   Yes 0.48851590 0.49381625 0.01766784
  • Conditional probability of default is higher for female clients than it is for males. At first glance, it does not make sense. However, these are not true probabilities. What is important is the ratio between conditional probabilities of YES default and NO default

  • The probability of YES default is higher than the probability of No default for male clients. The opposite is true for females

  • The probability of YES default is higher than the probability of No default for university graduates and for high school graduates

  • The probability of YES default is higher than the probability of No default for married clients

Testing prediction on first row in the test data:

predict(nbDem, test[1,])
## [1] No
## Levels: No Yes

Testing prediction on second row in the test data:

predict(nbDem, test[2,])
## [1] No
## Levels: No Yes
  • The first client is a female who is single and has a graduate school educational level. She is classified as unlikely to have default.

  • For all attributes (SEX, EDUCATION, MARRIAGE), the client has the higher probabilities of NO default in comparison to Yes default

  • The accuracy of prediction is confirmed with actual data. This client did not have default

  • The second client is a male who is single and has a graduate school educational level. He is classified as unlikely to have default, too.

  • Only for one attribute (SEX), the second client has the higher probability of YES default in comparison to No default

  • The accuracy of prediction is confirmed with actual data. This client did not have default

B.2. Naive Bayes using Payment Status

Building a Naive Bayes model using THREE payment status variables as predictors:

nbPay <- naiveBayes(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3, train)
nbPay
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##     No    Yes 
## 0.7736 0.2264 
## 
## Conditional probabilities:
##      PAY_0
## Y         Pay duly Delay 1 month Delay 2 months Delay 3 months
##   No  0.1013443640  0.2029472596   0.5377456050   0.1168562565
##   Yes 0.0477031802  0.1590106007   0.2941696113   0.1775618375
##      PAY_0
## Y     Delay 4 months Delay 5 months Delay 6 months Delay 7 months
##   No    0.0346432265   0.0046535677   0.0007755946   0.0002585315
##   Yes   0.2712014134   0.0388692580   0.0070671378   0.0017667845
##      PAY_0
## Y     Delay 8 months Delay 9 months Delay 10 months Delay 11 months
##   No    0.0005170631   0.0000000000    0.0002585315    0.0000000000
##   Yes   0.0008833922   0.0008833922    0.0008833922    0.0000000000
## 
##      PAY_2
## Y         Pay duly Delay 1 month Delay 2 months Delay 3 months
##   No  0.1372802482  0.2213029990   0.5486039297   0.0018097208
##   Yes 0.0901060071  0.1669611307   0.4010600707   0.0000000000
##      PAY_2
## Y     Delay 4 months Delay 5 months Delay 6 months Delay 7 months
##   No    0.0840227508   0.0046535677   0.0015511892   0.0005170631
##   Yes   0.2985865724   0.0291519435   0.0106007067   0.0017667845
##      PAY_2
## Y     Delay 8 months Delay 9 months Delay 10 months Delay 11 months
##   No    0.0000000000   0.0002585315    0.0000000000    0.0000000000
##   Yes   0.0008833922   0.0008833922    0.0000000000    0.0000000000
## 
##      PAY_3
## Y         Pay duly Delay 1 month Delay 2 months Delay 3 months
##   No  0.1494312306  0.2107032058   0.5491209928   0.0005170631
##   Yes 0.0971731449  0.1501766784   0.4231448763   0.0000000000
##      PAY_3
## Y     Delay 4 months Delay 5 months Delay 6 months Delay 7 months
##   No    0.0842812823   0.0038779731   0.0015511892   0.0002585315
##   Yes   0.2950530035   0.0229681979   0.0070671378   0.0008833922
##      PAY_3
## Y     Delay 8 months Delay 9 months Delay 10 months Delay 11 months
##   No    0.0002585315   0.0000000000    0.0000000000    0.0000000000
##   Yes   0.0017667845   0.0017667845    0.0000000000    0.0000000000
  • The probability of YES default is higher than the probability of No default for payments delays of three or more months for each payment status variables

  • These findings confirmed the observations that were made in the distribution analysis

  • If a client have payment delay for three or more months, his/her probability of YES default is higher than the probability of No default

Testing prediction on first row in the test data:

predict(nbPay, test[1,])
## [1] No
## Levels: No Yes

Testing prediction on second row in the test data:

predict(nbPay, test[2,])
## [1] No
## Levels: No Yes
  • The first client has 2 month payment delay. She is classified as unlikely to have default.

  • For all attributes (PAY_0,PAY_2,PAY_3), the client has the higher probabilities of NO default in comparison to Yes default

  • The accuracy of prediction is confirmed with actual data. This client did not have default

  • The second client has no payment delays at all. He is classified as unlikely to have default, too.

  • The accuracy of prediction is confirmed with actual data. This client did not have default

B.3. Smoothed Naive Bayes using Payment Status

nbPay <- naiveBayes(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3, train, laplace=1.5)
predict(nbPay, test[1,])
## [1] No
## Levels: No Yes
  • Laplace smoothing did not produce different prediction for the client 1

  • She is still classified as unlikely to have default.

  • Testing prediction on second row in the test data

nbPay <- naiveBayes(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3, train, laplace=1.5)
predict(nbPay, test[2,])
## [1] No
## Levels: No Yes
  • Laplace smoothing did not produce different prediction for the second client

  • He is still classified as unlikely to have default.

C. Classification with Decision Tree

C.1. Basic Decision Tree

Building a decision tree using up to three payment status variables as predictors:

library("rpart")
library("rpart.plot")
dtPay <- rpart(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3,
            method="class",
            data=train, parms=list(split='information'), 
            minsplit=20, cp=0.02)

Visualization of the decision tree:

rpart.plot(dtPay, type=4, extra=1)

  • The decision stump: a decision tree with the root immediately connected to the leafs

  • Intermediate steps are missing

  • A decision stump makes predictions based on the value of a single variable

  • No proper visualization of the chosen attributes and their values

  • The outcome “default” is either YES (meaning the client will have default) or NO (meaning the client won’t have default)

Testing prediction on the first row in the test data:

predict(nbPay, test[1,])
## [1] No
## Levels: No Yes

Testing prediction on the second row in the test data:

predict(nbPay, test[2,])
## [1] No
## Levels: No Yes
  • Both clients are classified as unlikely to have default.

  • The accuracy of predictions are confirmed with actual data. These clients did not have default

C.2. Decision Tree with a Different Complexity Parameter

Rebuilding the decision tree with a smaller cp:

dtPay <- rpart(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3,
            method="class",
            data=train, parms=list(split='information'), 
            minsplit=20, cp=0.001)

Visualization of the decision tree:

rpart.plot(dtPay, type=4, extra=1)

  • The PAY_0 is chosen for the 1st split of the decision tree

  • The root of the tree shows that the overall fraction of the clients who have NO default is 3,901 out of the total population of 5,000

  • For the attribite PAY_O, 3,748 clients, who pay duly or who have payment delays of 1, 2, 3, and 7 months, have NO default

  • The right node represents the rest of the client population, for which the outcome of is YES default

  • For the attribite PAY_O, 381 clients, who have payment delays of 4, 5, 6, 8, 9 and 10 months, have YES default

  • This node further splits into two nodes based on the PAY_3 values. If the PAY_3 value is either 1 month delay or 9 month delay, then 19 out of 33 of the clients have NO default.

  • This node further splits into two nodes based on the PAY_2 values. If the PAY_2 value is either 2 month delay or 4 month delay, then 8 out of 14 of the clients have YES default.

Testing prediction on the first row in the test data:

predict(nbPay, test[1,])
## [1] No
## Levels: No Yes

Testing prediction on the second row in the test data:

predict(nbPay, test[2,])
## [1] No
## Levels: No Yes
  • Both clients are classified as unlikely to have default.

  • No changes

D. Conclusion

  • All models showed the same results. All predictions on two rows of test data were correct (NO default)

  • For more comprehensive comparison, more test data are needed to draw

  • However, the Naive Bayes model seems to work better on payment status variables than Decision Tree that recognized different payment attributes as redundant. In the case of Decision Tree, the predictive power could be enhanced by choosing a smaller complexity parameter

  • The variables “Age” can be combined with other demographic variables to predict  default payment for the next month.  For this, the range of years should be divided into intervals and then converted to categorical values
  • Other data that could potentially enhance the accuracy of classification models include the amount of the bill statement, the amount of previous payments, and limits of the credit line. All this data are continuous variables. Therefore, they should be divided into intervals before their conversion to nominal values

  • It would be interesting to test the classification model that incorporates all available data: the limit of credit line, demographic variables, the amount of the bill, payment status, the amount of the previous payment