Victor Enchautegui
10/25/2020
Data Preparation
A.1. Data Loading and Initial Transformation
Load data and transform the default.payment.next.month variable into a nominal (factor) variable:
library(ggplot2)
cc <- read.csv("data/UCI_Credit_Card.csv")
cc$default.payment.next.month <- factor(cc$default.payment.next.month,levels=c(0,1), labels=c("No","Yes"))
A.2. Demographic Variables
Education
ggplot(cc, aes(x=EDUCATION, fill=default.payment.next.month, color=default.payment.next.month)) +
geom_histogram(binwidth=1, position="stack") +
scale_color_manual(values=c("black","black")) +
scale_fill_manual(values=c("darkolivegreen4", "red"))

Analysis: EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown). Majority of the population has an university education or greater. This variable will help with our analysis as it seems there is a relationship with university level education and defaulting.
Age
ggplot(cc, aes(x=AGE, fill=default.payment.next.month, color=default.payment.next.month)) +
geom_histogram(binwidth=1, position="stack") +
scale_color_manual(values=c("black","black")) +
scale_fill_manual(values=c("darkolivegreen4", "red"))

Analysis: AGE: Age in years. There seems to be a relationship between the younger age group and defaulting, however this can be due to the fact the population sample is heavy on the left.
A.3. Payment Status Variables
PAY_0
ggplot(cc, aes(x=PAY_0, fill=default.payment.next.month, color=default.payment.next.month)) +
geom_histogram(binwidth=1, position="stack") +
scale_color_manual(values=c("black","black")) +
scale_fill_manual(values=c("darkolivegreen4", "red"))

Analysis: PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above). However, there are two unnamed labels that are represented within the data, which are -2 and 0. If -1 = ‘paid on-time’, what 0 represent? I read two thread on Kaggle for this issue, and it continues to be unclear and the data didn’t match the responses. For the sake of this analysis, I will retrofit having -2 start as “Pay duly”.
I picked this pay statement as it is the initial statement and there is a focused median.
PAY_3
ggplot(cc, aes(x=PAY_3, fill=default.payment.next.month, color=default.payment.next.month)) +
geom_histogram(binwidth=1, position="stack") +
scale_color_manual(values=c("black","black")) +
scale_fill_manual(values=c("darkolivegreen4", "red"))

Analysis: There is a spike increase in label 0 of defaulted claims. This may mean that the probability of having consecutive ‘no default’ decreases over time.
PAY_5
ggplot(cc, aes(x=PAY_5, fill=default.payment.next.month, color=default.payment.next.month)) +
geom_histogram(binwidth=1, position="stack") +
scale_color_manual(values=c("black","black")) +
scale_fill_manual(values=c("pink", "red"))

Analysis: Similar analysis as PAY_3, however there is an increase in -2, -1, 0 and a decrease in 2.
A.4. Transforming Nominal Variables
Transform related demographic variables into nominal values with proper labels:
cc$AGE <- cut(cc$AGE,breaks=c(0,20,30,40,50,60,70,80,100), labels=c("<20s","20s", "30s", "40s", "50s", "60s", "70s", "80s+"))
cc$AGE <- factor(cc$AGE)
cc$EDUCATION <- factor(cc$EDUCATION,levels=c(1,2,3,4,5,6), labels=c("Grad School", "University", "High School", "Others", "Unknown", "Unknown"))
Transformed the additional variables to help with section B.2 of the assignment:
cc$PAY_0 <- factor(cc$PAY_0 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "1mo", "2mos", "3mos", "4mos", "5mos", "6mos", "7mos", "8mos", "9mos", "10mos", "11mos"))
cc$PAY_2 <- factor(cc$PAY_2 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "1mo", "2mos", "3mos", "4mos", "5mos", "6mos", "7mos", "8mos", "9mos", "10mos", "11mos"))
cc$PAY_3 <- factor(cc$PAY_3 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "1mo", "2mos", "3mos", "4mos", "5mos", "6mos", "7mos", "8mos", "9mos", "10mos", "11mos"))
View Education:
View(cc$EDUCATION)
View Age:
View(cc$AGE)
A.5. Selection of Training Data
train <- cc[sample(nrow(cc), 5000), ]
Check train data:
nrow(train)
[1] 5000
A.6. Selection of Testing Data
test <- cc[c(18,3600),]
test
Data Classification
B.1. Naive Bayes using Demographic Variables
library(e1071)
nbDem <- naiveBayes(default.payment.next.month ~ SEX + EDUCATION + AGE, train)
nbDem
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
No Yes
0.764 0.236
Conditional probabilities:
SEX
Y [,1] [,2]
No 1.614398 0.4868009
Yes 1.571186 0.4951164
EDUCATION
Y Grad School University High School Others Unknown
No 0.3592563498 0.4642576591 0.1550144017 0.0060225190 0.0154490704
Yes 0.3033898305 0.5076271186 0.1838983051 0.0008474576 0.0042372881
AGE
Y 20s 30s 40s 50s 60s 70s
No 0.365706806 0.374345550 0.191623037 0.060471204 0.007853403 0.000000000
Yes 0.372033898 0.315254237 0.209322034 0.091525424 0.011864407 0.000000000
SEX: The probability of defaulting is higher for women than it is for men. However it does not make sense as the probability of not defaulting is higher for women than it is for men.
EDUCATION: The probability of defaulting is higher for University than it is for others. However it does not make sense as the probability of not defaulting is higher for University than it is for others
AGE: The probability of defaulting is higher for younger individuals than it is for others. However it does not make sense as the probability of not defaulting is higher for younger individuals than it is for others
At first, the data does not make sense when comparing the ‘Yes’ and ‘No’, however the model created the conditional probability for each feature separately; not comparing it to the total distribution:
P(Female | no default) P(Male | no default) P(Female | default) P(Male | default)
Run model on first row of test data:
predict(nbDem, test[1,])
[1] No
Levels: No Yes
Run model on second row of test data:
predict(nbDem, test[2,])
[1] No
Levels: No Yes
The predictions are correct for both as probability of defaulting is low for both individuals for each conditional (except for the sex conditional which slightly higher to default for women). Looking at the overall probability of not defaulting is skewed to not defaulting by 78%, so the predictions are correct.
B.2. Naive Bayes using Demographic Variables
nbPay <- naiveBayes(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3, train)
nbPay
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
No Yes
0.764 0.236
Conditional probabilities:
PAY_0
Y Pay duly 1mo 2mos 3mos 4mos 5mos 6mos 7mos 8mos 9mos 10mos 11mos
No 0.1005235602 0.1965968586 0.5599476440 0.1036649215 0.0345549738 0.0036649215 0.0002617801 0.0002617801 0.0000000000 0.0002617801 0.0002617801 0.0000000000
Yes 0.0491525424 0.1457627119 0.3101694915 0.1872881356 0.2593220339 0.0381355932 0.0067796610 0.0016949153 0.0000000000 0.0000000000 0.0016949153 0.0000000000
PAY_2
Y Pay duly 1mo 2mos 3mos 4mos 5mos 6mos 7mos 8mos 9mos 10mos 11mos
No 0.1311518325 0.2172774869 0.5657068063 0.0002617801 0.0798429319 0.0036649215 0.0013089005 0.0000000000 0.0005235602 0.0002617801 0.0000000000 0.0000000000
Yes 0.0949152542 0.1415254237 0.3991525424 0.0016949153 0.3313559322 0.0220338983 0.0059322034 0.0008474576 0.0008474576 0.0016949153 0.0000000000 0.0000000000
PAY_3
Y Pay duly 1mo 2mos 3mos 4mos 5mos 6mos 7mos 8mos 9mos 10mos 11mos
No 0.1350785340 0.2185863874 0.5541884817 0.0000000000 0.0861256545 0.0031413613 0.0013089005 0.0005235602 0.0002617801 0.0005235602 0.0002617801 0.0000000000
Yes 0.1161016949 0.1313559322 0.4228813559 0.0008474576 0.3000000000 0.0203389831 0.0025423729 0.0016949153 0.0025423729 0.0016949153 0.0000000000 0.0000000000
For ‘Pay duly’ and payments delay for 1 or 2 month, the probability of defaulting is higher than not defaulting. However, this switches after 3 or more months for each pay statement (variables). If the client has more than 3 months of late payments, there’s a probability of defaulting again.
Run model on first row of test data:
predict(nbPay, test[1,])
[1] No
Levels: No Yes
Run model on second row of test data:
predict(nbPay, test[2,])
[1] No
Levels: No Yes
Analysis: The first individual had 2 months of delayed payments, however he made payment to avoid being delayed 3 months. He is was marked as ‘Pay duly’, which corresponds with the prediction.
The second individuals was always on-time with her payments and she continued to be on-time, which corresponds with the prediction.
After reviewing both individuals, the modal has proven to predicted correctly.
B.3. Smoothed Naive Bayes using Payment Status
nbPay <- naiveBayes(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3, train, laplace=1.5)
predict(nbPay, test[1,])
[1] No
Levels: No Yes
nbPay <- naiveBayes(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3, train, laplace=1.5)
predict(nbPay, test[2,])
[1] No
Levels: No Yes
Laplace smoothing did not produce any differences within the predictions for both individuals.
Classification with Decision Tree
C.1. Basic Decision Tree
library("rpart")
library("rpart.plot")
dtPay <- rpart(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3,
method="class",
data=train, parms=list(split='information'),
minsplit=20, cp=0.02)
rpart.plot(dtPay, type=4, extra=1)

Analysis: The decision tree visualization provides a decision stump (one-level decision tree). A decision stump makes a prediction based on the value of just a single input feature. There is additional variables to consider that will improve this decision tree.
Run model on first row of test data:
predict(nbPay, test[1,])
[1] No
Levels: No Yes
Run model on second row of test data:
predict(nbPay, test[2,])
[1] No
Levels: No Yes
Analysis: Both individuals are unlikely to default. After reviewing the actual results, the predictions are correct that both individuals did not default. The probability of not defaulting is 78%.
C.2. Decision Tree with a Different Complexity Parameter
dtPay <- rpart(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3,
method="class",
data=train, parms=list(split='information'),
minsplit=20, cp=0.001)
rpart.plot(dtPay, type=4, extra=1)

Analysis: This decision tree shows all the possible branches and likelihood of path based upon the sample size of 5,000 individuals.
The left side represent individuals that did not default; right side represent those that defaulted.
Following to the root of the tree, only 3,476 (70%) of the 5,000 population did not defaulted consecutively.
Run model on first row of test data:
predict(nbPay, test[1,])
[1] No
Levels: No Yes
Run model on second row of test data:
predict(nbPay, test[2,])
[1] No
Levels: No Yes
Analysis: Both individuals are unlikely to have default.
Conclusion
Both models predictions were correct. However, I feel there was a large population of “No defaults” which made it feel too obvious too predict.
As for the two models, I prefer to utilize the Naive Bayes model as the data was easily to understand and read from this workbook. With the Decision Tree, I had to perform several methods to zoom into the decision tree to understand how to use it.
Issues with the data. So, there is no clear understanding of -2 and 0. I have read the threads on Kaggle, however the suggested commentary did not aligned with the data points. Also, how can 110 individuals (for example: 46, 275, 2298, 2538, 3151) with ZERO for all billing statements and marked as ‘default’ for next month? It does not make sense.
---
title: "INFO 659 Assignment #2 (10 points)"
output: html_notebook
---
<h3>Victor Enchautegui</h3>
<h4>10/25/2020</h4>
<br>
<h2><b>Data Preparation</b></h2>
<h3><b>A.1. Data Loading and Initial Transformation</b></h3>
Load data and transform the default.payment.next.month variable into a nominal (factor) variable:
```{r}
library(ggplot2)
cc <- read.csv("data/UCI_Credit_Card.csv")
cc$default.payment.next.month <- factor(cc$default.payment.next.month,levels=c(0,1), labels=c("No","Yes"))

```
<h3><b>A.2. Demographic Variables</b></h3>
<h4><u>Education</u></h4>
```{r}
ggplot(cc, aes(x=EDUCATION, fill=default.payment.next.month, color=default.payment.next.month)) + 
  geom_histogram(binwidth=1, position="stack") +
  scale_color_manual(values=c("black","black")) +
  scale_fill_manual(values=c("darkolivegreen4", "red"))
```
<b>Analysis:</b>
EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown). Majority of the population has an university education or greater. This variable will help with our analysis as it seems there is a relationship with university level education and defaulting. 

<br>
<h4><u>Age</u></h4>
```{r}
ggplot(cc, aes(x=AGE, fill=default.payment.next.month, color=default.payment.next.month)) + 
  geom_histogram(binwidth=1, position="stack") +
  scale_color_manual(values=c("black","black")) +
  scale_fill_manual(values=c("darkolivegreen4", "red"))
```
<b>Analysis:</b>
AGE: Age in years. There seems to be a relationship between the younger age group and defaulting, however this can be due to the fact the population sample is heavy on the left.

<br>
<h3><b>A.3. Payment Status Variables</b></h3>
<h4><u>PAY_0</u></h4>
```{r}
ggplot(cc, aes(x=PAY_0, fill=default.payment.next.month, color=default.payment.next.month)) + 
  geom_histogram(binwidth=1, position="stack") +
  scale_color_manual(values=c("black","black")) +
  scale_fill_manual(values=c("darkolivegreen4", "red"))
```
<b>Analysis:</b>
PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above). However, there are two unnamed labels that are represented within the data, which are -2 and 0. If -1 = 'paid on-time', what 0 represent? I read two thread on Kaggle for this issue, and it continues to be unclear and the data didn't match the responses.  For the sake of this analysis, I will retrofit having -2 start as "Pay duly".

I picked this pay statement as it is the initial statement and there is a focused median. 
<br>
<h4><u>PAY_3</u></h4>
```{r}
ggplot(cc, aes(x=PAY_3, fill=default.payment.next.month, color=default.payment.next.month)) + 
  geom_histogram(binwidth=1, position="stack") +
  scale_color_manual(values=c("black","black")) +
  scale_fill_manual(values=c("darkolivegreen4", "red"))
```
<b>Analysis:</b>
There is a spike increase in label 0 of defaulted claims. This may mean that the probability of having consecutive 'no default' decreases over time.
<br>
<h4><u>PAY_5</u></h4>
```{r}
ggplot(cc, aes(x=PAY_5, fill=default.payment.next.month, color=default.payment.next.month)) + 
  geom_histogram(binwidth=1, position="stack") +
  scale_color_manual(values=c("black","black")) +
  scale_fill_manual(values=c("pink", "red"))
```
<b>Analysis:</b>
Similar analysis as PAY_3, however there is an increase in -2, -1, 0 and a decrease in 2.

<br>

<h3><b>A.4. Transforming Nominal Variables</b></h3>
<h4>Transform related demographic variables into nominal values with proper labels:</h4>
```{r}
cc$AGE <- cut(cc$AGE,breaks=c(0,20,30,40,50,60,70,80,100), labels=c("<20s","20s", "30s", "40s", "50s", "60s", "70s", "80s+")) 
```
```{r}
cc$AGE <- factor(cc$AGE) 
```
```{r}
cc$EDUCATION <- factor(cc$EDUCATION,levels=c(1,2,3,4,5,6), labels=c("Grad School", "University", "High School", "Others", "Unknown", "Unknown"))
```

Transformed the additional variables to help with section B.2 of the assignment:
```{r}
cc$PAY_0  <- factor(cc$PAY_0 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "1mo", "2mos", "3mos", "4mos",  "5mos", "6mos", "7mos",  "8mos", "9mos", "10mos", "11mos"))
```
```{r}
cc$PAY_2  <- factor(cc$PAY_2 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "1mo", "2mos", "3mos", "4mos",  "5mos", "6mos", "7mos",  "8mos", "9mos", "10mos", "11mos"))
```
```{r}
cc$PAY_3  <- factor(cc$PAY_3 , levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c("Pay duly", "1mo", "2mos", "3mos", "4mos",  "5mos", "6mos", "7mos",  "8mos", "9mos", "10mos", "11mos"))
```
<br>
<h4>View Education:</h4>
```{r}
View(cc$EDUCATION) 
```
<br>
<h4>View Age:</h4>
```{r}
View(cc$AGE)
```
<br>

<h3><b>A.5. Selection of Training Data</b></h3>
```{r}
train <- cc[sample(nrow(cc), 5000), ] 
```
<h4>Check train data:</h4>
```{r}
nrow(train)
```

<h3><b>A.6. Selection of Testing Data</b></h3>
```{r}
test <- cc[c(18,3600),]
test
```


<h2><b>Data Classification</b></h2>
<h3><b>B.1. Naive Bayes using Demographic Variables</b></h3>
```{r}
library(e1071)
nbDem <- naiveBayes(default.payment.next.month ~ SEX + EDUCATION + AGE, train)
nbDem
```
<b><u>SEX:</u></b> The probability of defaulting is higher for women than it is for men. However it does not make sense as the probability of not defaulting is higher for women than it is for men. 

<b><u>EDUCATION:</u></b> The probability of defaulting is higher for University than it is for others. However it does not make sense as the probability of not defaulting is higher for University than it is for others 

<b><u>AGE:</u></b> The probability of defaulting is higher for younger individuals than it is for others. However it does not make sense as the probability of not defaulting is higher for younger individuals than it is for others 

At first, the data does not make sense when comparing the 'Yes' and 'No', however the model created the conditional probability for each feature separately; not comparing it to the total distribution:

P(Female | no default)     P(Male | no default) 
P(Female | default)        P(Male | default)
<br>
<h4><b>Run model on first row of test data:</b></h4>
```{r}
predict(nbDem, test[1,])
```
<br>
<h4><b>Run model on second row of test data:</b></h4>
```{r}
predict(nbDem, test[2,])
```
The predictions are correct for both as probability of defaulting is low for both individuals for each conditional (except for the sex conditional which slightly higher to default for women). Looking at the overall probability of not defaulting is skewed to not defaulting by 78%, so the predictions are correct.
<br>
<h3><b>B.2. Naive Bayes using Demographic Variables</b></h3>
```{r}
nbPay <- naiveBayes(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3, train)
nbPay
```
For 'Pay duly' and payments delay for 1 or 2 month, the probability of defaulting is higher than not defaulting. However, this switches after 3 or more months for each pay statement (variables). If the client has more than 3 months of late payments, there's a probability of defaulting again.
<br>
<h4><b>Run model on first row of test data:</b></h4>
```{r}
predict(nbPay, test[1,])
```
<h4><b>Run model on second row of test data:</b></h4>
```{r}
predict(nbPay, test[2,])
```
<b>Analysis:</b>
The first individual had 2 months of delayed payments, however he made payment to avoid being delayed 3 months. He is was marked as 'Pay duly', which corresponds with the prediction.

The second individuals was always on-time with her payments and she continued to be on-time, which corresponds with the prediction.

After reviewing both individuals, the modal has proven to predicted correctly. 
<br>
<h3><b>B.3. Smoothed Naive Bayes using Payment Status</b></h3>

```{r}
nbPay <- naiveBayes(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3, train, laplace=1.5)
predict(nbPay, test[1,])
```
```{r}
nbPay <- naiveBayes(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3, train, laplace=1.5)
predict(nbPay, test[2,])
```
Laplace smoothing did not produce any differences within the predictions for both individuals.
<h2><b>Classification with Decision Tree</b></h2>
<h3><b>C.1. Basic Decision Tree</b></h3>
```{r}
library("rpart")
library("rpart.plot")
dtPay <- rpart(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3,
            method="class",
            data=train, parms=list(split='information'), 
            minsplit=20, cp=0.02)
```
```{r}
rpart.plot(dtPay, type=4, extra=1)
```
<b>Analysis:</b>
The decision tree visualization provides a decision stump (one-level decision tree). A decision stump makes a prediction based on the value of just a single input feature. There is additional variables to consider that will improve this decision tree.
<br>
<h4><b>Run model on first row of test data:</b></h4>
```{r}
predict(nbPay, test[1,])
```
<h4><b>Run model on second row of test data:</b></h4>
```{r}
predict(nbPay, test[2,])
```
<b>Analysis:</b>
Both individuals are unlikely to default. After reviewing the actual results, the predictions are correct that both individuals did not default. The probability of not defaulting is 78%. 
<br>
<h3><b>C.2. Decision Tree with a Different Complexity Parameter</b></h3>
```{r}
dtPay <- rpart(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3,
            method="class",
            data=train, parms=list(split='information'), 
            minsplit=20, cp=0.001)
```

```{r}
rpart.plot(dtPay, type=4, extra=1)
```
<b>Analysis:</b>
This decision tree shows all the possible branches and likelihood of path based upon the sample size of 5,000 individuals.

The left side represent individuals that did not default; right side represent those that defaulted.

Following to the root of the tree, only 3,476 (70%) of the 5,000 population did not defaulted consecutively.
<br>
<h4><b>Run model on first row of test data:</b></h4>
```{r}
predict(nbPay, test[1,])
```
<h4><b>Run model on second row of test data:</b></h4>
```{r}
predict(nbPay, test[2,])
```
<b>Analysis:</b>
Both individuals are unlikely to have default.

<h2><b>Conclusion</b></h2>
Both models predictions were correct. However, I feel there was a large population of "No defaults" which made it feel too obvious too predict.

As for the two models, I prefer to utilize the Naive Bayes model as the data was easily to understand and read from this workbook. With the Decision Tree, I had to perform several methods to zoom into the decision tree to understand how to use it.

Issues with the data. So, there is no clear understanding of -2 and 0. I have read the threads on Kaggle, however the suggested commentary did not aligned with the data points. Also, how can 110 individuals (for example: 46, 275, 2298, 2538, 3151) with ZERO for all billing statements and marked as 'default' for next month? It does not make sense. 
