Victor Enchautegui

10/25/2020


Data Preparation

A.1. Data Loading and Initial Transformation

Load data and transform the default.payment.next.month variable into a nominal (factor) variable:

library(ggplot2)
cc <- read.csv("data/UCI_Credit_Card.csv")
cc$default.payment.next.month <- factor(cc$default.payment.next.month,levels=c(0,1), labels=c("No","Yes"))

A.2. Demographic Variables

Education

ggplot(cc, aes(x=EDUCATION, fill=default.payment.next.month, color=default.payment.next.month)) + 
  geom_histogram(binwidth=1, position="stack") +
  scale_color_manual(values=c("black","black")) +
  scale_fill_manual(values=c("darkolivegreen4", "red"))

Analysis: EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown). Majority of the population has an university education or greater. This variable will help with our analysis as it seems there is a relationship with university level education and defaulting.


Age

ggplot(cc, aes(x=AGE, fill=default.payment.next.month, color=default.payment.next.month)) + 
  geom_histogram(binwidth=1, position="stack") +
  scale_color_manual(values=c("black","black")) +
  scale_fill_manual(values=c("darkolivegreen4", "red"))

Analysis: AGE: Age in years. There seems to be a relationship between the younger age group and defaulting, however this can be due to the fact the population sample is heavy on the left.


A.3. Payment Status Variables

PAY_0

ggplot(cc, aes(x=PAY_0, fill=default.payment.next.month, color=default.payment.next.month)) + 
  geom_histogram(binwidth=1, position="stack") +
  scale_color_manual(values=c("black","black")) +
  scale_fill_manual(values=c("darkolivegreen4", "red"))

Analysis: PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above). However, there are two unnamed labels that are represented within the data, which are -2 and 0. If -1 = ‘paid on-time’, what 0 represent? I read two thread on Kaggle for this issue, and it continues to be unclear and the data didn’t match the responses. For the sake of this analysis, I will retrofit having -2 start as “Pay duly”.

I picked this pay statement as it is the initial statement and there is a focused median.

PAY_3

ggplot(cc, aes(x=PAY_3, fill=default.payment.next.month, color=default.payment.next.month)) + 
  geom_histogram(binwidth=1, position="stack") +
  scale_color_manual(values=c("black","black")) +
  scale_fill_manual(values=c("darkolivegreen4", "red"))

Analysis: There is a spike increase in label 0 of defaulted claims. This may mean that the probability of having consecutive ‘no default’ decreases over time.

PAY_5

ggplot(cc, aes(x=PAY_5, fill=default.payment.next.month, color=default.payment.next.month)) + 
  geom_histogram(binwidth=1, position="stack") +
  scale_color_manual(values=c("black","black")) +
  scale_fill_manual(values=c("pink", "red"))

Analysis: Similar analysis as PAY_3, however there is an increase in -2, -1, 0 and a decrease in 2.


A.4. Transforming Nominal Variables

Transform related demographic variables into nominal values with proper labels:

cc$AGE <- cut(cc$AGE,breaks=c(0,20,30,40,50,60,70,80,100), labels=c("<20s","20s", "30s", "40s", "50s", "60s", "70s", "80s+")) 
cc$AGE <- factor(cc$AGE) 
cc$EDUCATION <- factor(cc$EDUCATION,levels=c(1,2,3,4,5,6), labels=c("Grad School", "University", "High School", "Others", "Unknown", "Unknown"))

Transformed the additional variables to help with section B.2 of the assignment:

cc$PAY_0  <- factor(cc$PAY_0 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "1mo", "2mos", "3mos", "4mos",  "5mos", "6mos", "7mos",  "8mos", "9mos", "10mos", "11mos"))
cc$PAY_2  <- factor(cc$PAY_2 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "1mo", "2mos", "3mos", "4mos",  "5mos", "6mos", "7mos",  "8mos", "9mos", "10mos", "11mos"))
cc$PAY_3  <- factor(cc$PAY_3 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "1mo", "2mos", "3mos", "4mos",  "5mos", "6mos", "7mos",  "8mos", "9mos", "10mos", "11mos"))

View Education:

View(cc$EDUCATION) 

View Age:

View(cc$AGE)


A.5. Selection of Training Data

train <- cc[sample(nrow(cc), 5000), ] 

Check train data:

nrow(train)
[1] 5000

A.6. Selection of Testing Data

test <- cc[c(18,3600),]
test

Data Classification

B.1. Naive Bayes using Demographic Variables

library(e1071)
nbDem <- naiveBayes(default.payment.next.month ~ SEX + EDUCATION + AGE, train)
nbDem

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
   No   Yes 
0.764 0.236 

Conditional probabilities:
     SEX
Y         [,1]      [,2]
  No  1.614398 0.4868009
  Yes 1.571186 0.4951164

     EDUCATION
Y      Grad School   University  High School       Others      Unknown
  No  0.3592563498 0.4642576591 0.1550144017 0.0060225190 0.0154490704
  Yes 0.3033898305 0.5076271186 0.1838983051 0.0008474576 0.0042372881

     AGE
Y             20s         30s         40s         50s         60s         70s
  No  0.365706806 0.374345550 0.191623037 0.060471204 0.007853403 0.000000000
  Yes 0.372033898 0.315254237 0.209322034 0.091525424 0.011864407 0.000000000

SEX: The probability of defaulting is higher for women than it is for men. However it does not make sense as the probability of not defaulting is higher for women than it is for men.

EDUCATION: The probability of defaulting is higher for University than it is for others. However it does not make sense as the probability of not defaulting is higher for University than it is for others

AGE: The probability of defaulting is higher for younger individuals than it is for others. However it does not make sense as the probability of not defaulting is higher for younger individuals than it is for others

At first, the data does not make sense when comparing the ‘Yes’ and ‘No’, however the model created the conditional probability for each feature separately; not comparing it to the total distribution:

P(Female | no default) P(Male | no default) P(Female | default) P(Male | default)

Run model on first row of test data:

predict(nbDem, test[1,])
[1] No
Levels: No Yes

Run model on second row of test data:

predict(nbDem, test[2,])
[1] No
Levels: No Yes
The predictions are correct for both as probability of defaulting is low for both individuals for each conditional (except for the sex conditional which slightly higher to default for women). Looking at the overall probability of not defaulting is skewed to not defaulting by 78%, so the predictions are correct.

B.2. Naive Bayes using Demographic Variables

nbPay <- naiveBayes(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3, train)
nbPay

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
   No   Yes 
0.764 0.236 

Conditional probabilities:
     PAY_0
Y         Pay duly          1mo         2mos         3mos         4mos         5mos         6mos         7mos         8mos         9mos        10mos        11mos
  No  0.1005235602 0.1965968586 0.5599476440 0.1036649215 0.0345549738 0.0036649215 0.0002617801 0.0002617801 0.0000000000 0.0002617801 0.0002617801 0.0000000000
  Yes 0.0491525424 0.1457627119 0.3101694915 0.1872881356 0.2593220339 0.0381355932 0.0067796610 0.0016949153 0.0000000000 0.0000000000 0.0016949153 0.0000000000

     PAY_2
Y         Pay duly          1mo         2mos         3mos         4mos         5mos         6mos         7mos         8mos         9mos        10mos        11mos
  No  0.1311518325 0.2172774869 0.5657068063 0.0002617801 0.0798429319 0.0036649215 0.0013089005 0.0000000000 0.0005235602 0.0002617801 0.0000000000 0.0000000000
  Yes 0.0949152542 0.1415254237 0.3991525424 0.0016949153 0.3313559322 0.0220338983 0.0059322034 0.0008474576 0.0008474576 0.0016949153 0.0000000000 0.0000000000

     PAY_3
Y         Pay duly          1mo         2mos         3mos         4mos         5mos         6mos         7mos         8mos         9mos        10mos        11mos
  No  0.1350785340 0.2185863874 0.5541884817 0.0000000000 0.0861256545 0.0031413613 0.0013089005 0.0005235602 0.0002617801 0.0005235602 0.0002617801 0.0000000000
  Yes 0.1161016949 0.1313559322 0.4228813559 0.0008474576 0.3000000000 0.0203389831 0.0025423729 0.0016949153 0.0025423729 0.0016949153 0.0000000000 0.0000000000
For ‘Pay duly’ and payments delay for 1 or 2 month, the probability of defaulting is higher than not defaulting. However, this switches after 3 or more months for each pay statement (variables). If the client has more than 3 months of late payments, there’s a probability of defaulting again.

Run model on first row of test data:

predict(nbPay, test[1,])
[1] No
Levels: No Yes

Run model on second row of test data:

predict(nbPay, test[2,])
[1] No
Levels: No Yes

Analysis: The first individual had 2 months of delayed payments, however he made payment to avoid being delayed 3 months. He is was marked as ‘Pay duly’, which corresponds with the prediction.

The second individuals was always on-time with her payments and she continued to be on-time, which corresponds with the prediction.

After reviewing both individuals, the modal has proven to predicted correctly.

B.3. Smoothed Naive Bayes using Payment Status

nbPay <- naiveBayes(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3, train, laplace=1.5)
predict(nbPay, test[1,])
[1] No
Levels: No Yes
nbPay <- naiveBayes(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3, train, laplace=1.5)
predict(nbPay, test[2,])
[1] No
Levels: No Yes
Laplace smoothing did not produce any differences within the predictions for both individuals.

Classification with Decision Tree

C.1. Basic Decision Tree

library("rpart")
library("rpart.plot")
dtPay <- rpart(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3,
            method="class",
            data=train, parms=list(split='information'), 
            minsplit=20, cp=0.02)
rpart.plot(dtPay, type=4, extra=1)

Analysis: The decision tree visualization provides a decision stump (one-level decision tree). A decision stump makes a prediction based on the value of just a single input feature. There is additional variables to consider that will improve this decision tree.

Run model on first row of test data:

predict(nbPay, test[1,])
[1] No
Levels: No Yes

Run model on second row of test data:

predict(nbPay, test[2,])
[1] No
Levels: No Yes
Analysis: Both individuals are unlikely to default. After reviewing the actual results, the predictions are correct that both individuals did not default. The probability of not defaulting is 78%.

C.2. Decision Tree with a Different Complexity Parameter

dtPay <- rpart(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3,
            method="class",
            data=train, parms=list(split='information'), 
            minsplit=20, cp=0.001)
rpart.plot(dtPay, type=4, extra=1)

Analysis: This decision tree shows all the possible branches and likelihood of path based upon the sample size of 5,000 individuals.

The left side represent individuals that did not default; right side represent those that defaulted.

Following to the root of the tree, only 3,476 (70%) of the 5,000 population did not defaulted consecutively.

Run model on first row of test data:

predict(nbPay, test[1,])
[1] No
Levels: No Yes

Run model on second row of test data:

predict(nbPay, test[2,])
[1] No
Levels: No Yes

Analysis: Both individuals are unlikely to have default.

Conclusion

Both models predictions were correct. However, I feel there was a large population of “No defaults” which made it feel too obvious too predict.

As for the two models, I prefer to utilize the Naive Bayes model as the data was easily to understand and read from this workbook. With the Decision Tree, I had to perform several methods to zoom into the decision tree to understand how to use it.

Issues with the data. So, there is no clear understanding of -2 and 0. I have read the threads on Kaggle, however the suggested commentary did not aligned with the data points. Also, how can 110 individuals (for example: 46, 275, 2298, 2538, 3151) with ZERO for all billing statements and marked as ‘default’ for next month? It does not make sense.

---
title: "INFO 659 Assignment #2 (10 points)"
output: html_notebook
---
<h3>Victor Enchautegui</h3>
<h4>10/25/2020</h4>
<br>
<h2><b>Data Preparation</b></h2>
<h3><b>A.1. Data Loading and Initial Transformation</b></h3>
Load data and transform the default.payment.next.month variable into a nominal (factor) variable:
```{r}
library(ggplot2)
cc <- read.csv("data/UCI_Credit_Card.csv")
cc$default.payment.next.month <- factor(cc$default.payment.next.month,levels=c(0,1), labels=c("No","Yes"))

```
<h3><b>A.2. Demographic Variables</b></h3>
<h4><u>Education</u></h4>
```{r}
ggplot(cc, aes(x=EDUCATION, fill=default.payment.next.month, color=default.payment.next.month)) + 
  geom_histogram(binwidth=1, position="stack") +
  scale_color_manual(values=c("black","black")) +
  scale_fill_manual(values=c("darkolivegreen4", "red"))
```
<b>Analysis:</b>
EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown). Majority of the population has an university education or greater. This variable will help with our analysis as it seems there is a relationship with university level education and defaulting. 

<br>
<h4><u>Age</u></h4>
```{r}
ggplot(cc, aes(x=AGE, fill=default.payment.next.month, color=default.payment.next.month)) + 
  geom_histogram(binwidth=1, position="stack") +
  scale_color_manual(values=c("black","black")) +
  scale_fill_manual(values=c("darkolivegreen4", "red"))
```
<b>Analysis:</b>
AGE: Age in years. There seems to be a relationship between the younger age group and defaulting, however this can be due to the fact the population sample is heavy on the left.

<br>
<h3><b>A.3. Payment Status Variables</b></h3>
<h4><u>PAY_0</u></h4>
```{r}
ggplot(cc, aes(x=PAY_0, fill=default.payment.next.month, color=default.payment.next.month)) + 
  geom_histogram(binwidth=1, position="stack") +
  scale_color_manual(values=c("black","black")) +
  scale_fill_manual(values=c("darkolivegreen4", "red"))
```
<b>Analysis:</b>
PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above). However, there are two unnamed labels that are represented within the data, which are -2 and 0. If -1 = 'paid on-time', what 0 represent? I read two thread on Kaggle for this issue, and it continues to be unclear and the data didn't match the responses.  For the sake of this analysis, I will retrofit having -2 start as "Pay duly".

I picked this pay statement as it is the initial statement and there is a focused median. 
<br>
<h4><u>PAY_3</u></h4>
```{r}
ggplot(cc, aes(x=PAY_3, fill=default.payment.next.month, color=default.payment.next.month)) + 
  geom_histogram(binwidth=1, position="stack") +
  scale_color_manual(values=c("black","black")) +
  scale_fill_manual(values=c("darkolivegreen4", "red"))
```
<b>Analysis:</b>
There is a spike increase in label 0 of defaulted claims. This may mean that the probability of having consecutive 'no default' decreases over time.
<br>
<h4><u>PAY_5</u></h4>
```{r}
ggplot(cc, aes(x=PAY_5, fill=default.payment.next.month, color=default.payment.next.month)) + 
  geom_histogram(binwidth=1, position="stack") +
  scale_color_manual(values=c("black","black")) +
  scale_fill_manual(values=c("pink", "red"))
```
<b>Analysis:</b>
Similar analysis as PAY_3, however there is an increase in -2, -1, 0 and a decrease in 2.

<br>

<h3><b>A.4. Transforming Nominal Variables</b></h3>
<h4>Transform related demographic variables into nominal values with proper labels:</h4>
```{r}
cc$AGE <- cut(cc$AGE,breaks=c(0,20,30,40,50,60,70,80,100), labels=c("<20s","20s", "30s", "40s", "50s", "60s", "70s", "80s+")) 
```
```{r}
cc$AGE <- factor(cc$AGE) 
```
```{r}
cc$EDUCATION <- factor(cc$EDUCATION,levels=c(1,2,3,4,5,6), labels=c("Grad School", "University", "High School", "Others", "Unknown", "Unknown"))
```

Transformed the additional variables to help with section B.2 of the assignment:
```{r}
cc$PAY_0  <- factor(cc$PAY_0 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "1mo", "2mos", "3mos", "4mos",  "5mos", "6mos", "7mos",  "8mos", "9mos", "10mos", "11mos"))
```
```{r}
cc$PAY_2  <- factor(cc$PAY_2 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "1mo", "2mos", "3mos", "4mos",  "5mos", "6mos", "7mos",  "8mos", "9mos", "10mos", "11mos"))
```
```{r}
cc$PAY_3  <- factor(cc$PAY_3 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "1mo", "2mos", "3mos", "4mos",  "5mos", "6mos", "7mos",  "8mos", "9mos", "10mos", "11mos"))
```
<br>
<h4>View Education:</h4>
```{r}
View(cc$EDUCATION) 
```
<br>
<h4>View Age:</h4>
```{r}
View(cc$AGE)
```
<br>

<h3><b>A.5. Selection of Training Data</b></h3>
```{r}
train <- cc[sample(nrow(cc), 5000), ] 
```
<h4>Check train data:</h4>
```{r}
nrow(train)
```

<h3><b>A.6. Selection of Testing Data</b></h3>
```{r}
test <- cc[c(18,3600),]
test
```


<h2><b>Data Classification</b></h2>
<h3><b>B.1. Naive Bayes using Demographic Variables</b></h3>
```{r}
library(e1071)
nbDem <- naiveBayes(default.payment.next.month ~ SEX + EDUCATION + AGE, train)
nbDem
```
<b><u>SEX:</u></b> The probability of defaulting is higher for women than it is for men. However it does not make sense as the probability of not defaulting is higher for women than it is for men. 

<b><u>EDUCATION:</u></b> The probability of defaulting is higher for University than it is for others. However it does not make sense as the probability of not defaulting is higher for University than it is for others 

<b><u>AGE:</u></b> The probability of defaulting is higher for younger individuals than it is for others. However it does not make sense as the probability of not defaulting is higher for younger individuals than it is for others 

At first, the data does not make sense when comparing the 'Yes' and 'No', however the model created the conditional probability for each feature separately; not comparing it to the total distribution:

P(Female | no default)     P(Male | no default) 
P(Female | default)        P(Male | default)
<br>
<h4><b>Run model on first row of test data:</b></h4>
```{r}
predict(nbDem, test[1,])
```
<br>
<h4><b>Run model on second row of test data:</b></h4>
```{r}
predict(nbDem, test[2,])
```
The predictions are correct for both as probability of defaulting is low for both individuals for each conditional (except for the sex conditional which slightly higher to default for women). Looking at the overall probability of not defaulting is skewed to not defaulting by 78%, so the predictions are correct.
<br>
<h3><b>B.2. Naive Bayes using Demographic Variables</b></h3>
```{r}
nbPay <- naiveBayes(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3, train)
nbPay
```
For 'Pay duly' and payments delay for 1 or 2 month, the probability of defaulting is higher than not defaulting. However, this switches after 3 or more months for each pay statement (variables). If the client has more than 3 months of late payments, there's a probability of defaulting again.
<br>
<h4><b>Run model on first row of test data:</b></h4>
```{r}
predict(nbPay, test[1,])
```
<h4><b>Run model on second row of test data:</b></h4>
```{r}
predict(nbPay, test[2,])
```
<b>Analysis:</b>
The first individual had 2 months of delayed payments, however he made payment to avoid being delayed 3 months. He is was marked as 'Pay duly', which corresponds with the prediction.

The second individuals was always on-time with her payments and she continued to be on-time, which corresponds with the prediction.

After reviewing both individuals, the modal has proven to predicted correctly. 
<br>
<h3><b>B.3. Smoothed Naive Bayes using Payment Status</b></h3>

```{r}
nbPay <- naiveBayes(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3, train, laplace=1.5)
predict(nbPay, test[1,])
```
```{r}
nbPay <- naiveBayes(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3, train, laplace=1.5)
predict(nbPay, test[2,])
```
Laplace smoothing did not produce any differences within the predictions for both individuals.
<h2><b>Classification with Decision Tree</b></h2>
<h3><b>C.1. Basic Decision Tree</b></h3>
```{r}
library("rpart")
library("rpart.plot")
dtPay <- rpart(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3,
            method="class",
            data=train, parms=list(split='information'), 
            minsplit=20, cp=0.02)
```
```{r}
rpart.plot(dtPay, type=4, extra=1)
```
<b>Analysis:</b>
The decision tree visualization provides a decision stump (one-level decision tree). A decision stump makes a prediction based on the value of just a single input feature. There is additional variables to consider that will improve this decision tree.
<br>
<h4><b>Run model on first row of test data:</b></h4>
```{r}
predict(nbPay, test[1,])
```
<h4><b>Run model on second row of test data:</b></h4>
```{r}
predict(nbPay, test[2,])
```
<b>Analysis:</b>
Both individuals are unlikely to default. After reviewing the actual results, the predictions are correct that both individuals did not default. The probability of not defaulting is 78%. 
<br>
<h3><b>C.2. Decision Tree with a Different Complexity Parameter</b></h3>
```{r}
dtPay <- rpart(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3,
            method="class",
            data=train, parms=list(split='information'), 
            minsplit=20, cp=0.001)
```

```{r}
rpart.plot(dtPay, type=4, extra=1)
```
<b>Analysis:</b>
This decision tree shows all the possible branches and likelihood of path based upon the sample size of 5,000 individuals.

The left side represent individuals that did not default; right side represent those that defaulted.

Following to the root of the tree, only 3,476 (70%) of the 5,000 population did not defaulted consecutively.
<br>
<h4><b>Run model on first row of test data:</b></h4>
```{r}
predict(nbPay, test[1,])
```
<h4><b>Run model on second row of test data:</b></h4>
```{r}
predict(nbPay, test[2,])
```
<b>Analysis:</b>
Both individuals are unlikely to have default.

<h2><b>Conclusion</b></h2>
Both models predictions were correct. However, I feel there was a large population of "No defaults" which made it feel too obvious too predict.

As for the two models, I prefer to utilize the Naive Bayes model as the data was easily to understand and read from this workbook. With the Decision Tree, I had to perform several methods to zoom into the decision tree to understand how to use it.

Issues with the data. So, there is no clear understanding of -2 and 0. I have read the threads on Kaggle, however the suggested commentary did not aligned with the data points. Also, how can 110 individuals (for example: 46, 275, 2298, 2538, 3151) with ZERO for all billing statements and marked as 'default' for next month? It does not make sense. 
