Victor Enchautegui
11/8/2020
library(ggplot2)
library(e1071)
library(caret)
package 㤼㸱caret㤼㸲 was built under R version 4.0.3Loading required package: lattice
Registered S3 method overwritten by 'data.table':
method from
print.data.table
library("rpart")
library("rpart.plot")
A. Understanding variables and relations in data (2 points)
A.1 Discuss how credit and payment history data such as PAY_AMT1 have an impact on payment default.
|
Variable/Attribute
|
Data Type
|
Potential impact on “Default” and reason
|
|
Limit_Bal
|
Integer
|
The more credit you have, the less likely to default due to more ‘spending power’.
|
|
Pay_0, 2, 3, 4, 5, 6
|
Integer
|
For those that has a status code -1 (payment was made on time), the probability of defaulting is less, especially if it is consecutive.
For code 1 and greater (payments that were delayed by respective number of months), the probability increases for the chance of defaulting.
|
|
Bill_Amt1, 2, 3, 4, 5, 6
|
Integer
|
These columns list the amount billed to each customer. The assumption would be that the higher the bill, the higher the risk, however I would not utilize this attribute on it’s own. When used by itself, I do not believe it has a high potential impact on default because some individual may have the fiances to support such high bills.
If we utilize this attribute with Limit_Bal, it can have higher potential to impact on default. For example, if the debt-to-limit ratio increases over time, this may increase the risk of defaulting as the customer is trying their limits. This is especially true if the customer is delay in payments.
|
|
Pay_Amt1, 2, 3, 4, 5, 6
|
Integer
|
Each of these amounts was paid to settle the preceding month’s bill, either in full or partially. If the payments are low or less than a specific percent of the amount owe, then this may increase the potential impact of defaulting, especially if the bill amount is increasing each month.
|
A.2. Discuss in what ways some of the above attributes contribute to default.payment.next.month together. Please identify at least two pairs of attributes that can be treated together and how.
|
Var1
|
Var2
|
Var3
|
Discuss their relation, how to combine them (ratio, difference, or others) and your reason/theory
|
|
Pay_Amt
|
Bill_Amt
|
|
I believe Pay_Amt to Bill_Amt ratio would help in determining the possibility of defaults as it shows the customers willingness to pay their bill and the amount of their bill. If the Pay_Amt to Bill_Amt ratio is less than .5, this is may mean the customer is at risk.
|
|
Limit_Bal
|
Bill_Amt
|
|
If the debt-to-limit ratio is low, and consecutively getting worse, then this is a red flag that the customer is spending more than he/she is willing to repay, which can lead to a default. These two variables should be the most significant in determining the possibility of defaulting, however, what if the individual is paying their mortgage on their credit card and paying it in full each month just to get “reward points”. This situation happens, it can send off ‘false negatives’ of defaulting.
|
|
Pay_#
|
Bill_Amt
|
Limit_Bal
|
This combination is the best combination to determine if the customer is at risk of defaulting. If the customer is not making payments, the bill amount is increasing each month, and he/she is reaching their credit limit, this is significant red flag for defaulting.
|
B. Data preparation and cleansing (1 points)
B.1 Load data and initial data conversion/transformation:
- Load “UCI_Credit_Card.csv” into data frame variable in R using read.csv().
cc <- read.csv("data/UCI_Credit_Card.csv")
- Convert the following attributes into as nominal (categorical, factor) attributes: Sex, Education, Marriage, and default.payment.next.month.
cc$SEX <- factor(cc$SEX,levels=c(1,2), labels=c("Male","Female"))
cc$MARRIAGE <- factor(cc$MARRIAGE,levels=c(1,2,3), labels=c("Married","Single","Others"))
cc$EDUCATION <- factor(cc$EDUCATION,levels=c(1,2,3,4,5,6), labels=c("Grad School","University","High School", "Others", "Unknown", "Unknown"))
cc$default.payment.next.month <- factor(cc$default.payment.next.month,levels=c(0,1), labels=c("No","Yes"))
- Use class() function check on Sex, Education, Marriage, and default.payment.next.month, they should ALL be “factor” variables.
class(cc$SEX)
[1] "factor"
class(cc$MARRIAGE)
[1] "factor"
class(cc$EDUCATION)
[1] "factor"
class(cc$default.payment.next.month)
[1] "factor"
B.2 Create a filtered dataset with only non-negative amounts.
- Use the subset() function to select only positive values on the 6 BILL_AMT attributes and 6 PAY_AMT attributes. Like (fill the … with actual):
ccpo <- subset(cc, BILL_AMT1>=0 & BILL_AMT2>=0 & BILL_AMT3>=0 & BILL_AMT4>=0 & BILL_AMT5>=0 & BILL_AMT6>=0 & PAY_AMT1>=0 & PAY_AMT2>=0 & PAY_AMT3>=0 & PAY_AMT4>=0 & PAY_AMT5>=0 & PAY_AMT6>=0)
nrow(ccpo)
[1] 28070
- Check the number of rows in the filtered subset and you can use View(ccpo) to double check on the data.
View(ccpo)
- Data Transformation and Classification/Modeling (4 points)
C.1. Pick one classification method, model with default.payment.next.month ~ variables in A.1., and evaluate:
• Select 90% of data for training and 10% for testing;
df <- sort(sample(nrow(ccpo), nrow(ccpo)*.9))
train <- ccpo[df, ]
test <- ccpo[-df, ]
• Build a model with training data (90% data) to predict default.payment.next.month, using at least three variables from A.1.
nbDem <- naiveBayes(default.payment.next.month ~ LIMIT_BAL + BILL_AMT1 + PAY_0, train)
nbDem
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
No Yes
0.774334 0.225666
Conditional probabilities:
LIMIT_BAL
Y [,1] [,2]
No 174956.7 131137.1
Yes 126707.2 113259.6
BILL_AMT1
Y [,1] [,2]
No 54688.26 74788.16
Yes 49737.65 73480.56
PAY_0
Y [,1] [,2]
No -0.1867907 0.9325569
Yes 0.7065427 1.3775166
• Run prediction with the model on test data (10% data) and record the following scores: o Present the confusion table with TP, TN, FP, and FN o Report Accuracy, Precision, Recall, F, and Kappa in Table D.
nb_prediction <- predict(nbDem, test, type = "class")
confusionMatrix(data = nb_prediction,
reference =test$default.payment.next.month,
dnn = c("Predicted", "Actual"),
mode = "prec_recall")
Confusion Matrix and Statistics
Actual
Predicted No Yes
No 2105 409
Yes 85 208
Accuracy : 0.824
95% CI : (0.8094, 0.8379)
No Information Rate : 0.7802
P-Value [Acc > NIR] : 5.138e-09
Kappa : 0.3676
Mcnemar's Test P-Value : < 2.2e-16
Precision : 0.8373
Recall : 0.9612
F1 : 0.8950
Prevalence : 0.7802
Detection Rate : 0.7499
Detection Prevalence : 0.8956
Balanced Accuracy : 0.6492
'Positive' Class : No
C.2. Perform data transformation (with new relational attributes) and redo classification:
- Follow the treatments (at least two relations) of the variable pairs you have identified in A.2
- Create new attributes that compute the relations you have identified in A.2:
ccpo$PAID_BILL <- c(ccpo$PAY_AMT1 / ccpo$BILL_AMT2) #1 or greater means full payment was made; Anything less than 1 is a risk to default
ccpo$DEBT_TO_LIMIT <- c(ccpo$BILL_AMT1 / ccpo$LIMIT_BAL) #Goal is to be close to 0. 1 or greater means the customer is at their credit limit or greater; high risk of defaulting
After reviewing data, I will replacing ‘Inf’ and ‘NaN’ with 1 for PAID_BILL attribute as this means that the customer is either paying a bill that is zero (overpaying) or didn’t pay for a bill as there was no bill to pay. These occurrences is only happening in one of the two attributes.
is.na(ccpo) <- sapply(ccpo, is.infinite)
ccpo[is.na(ccpo)] <- 1
invalid factor level, NA generatedinvalid factor level, NA generated
Build a model with training data (90% data) to predict default.payment.next.month, using the new relational attributes (plus any other variables you would like to include) here.
Run prediction with the model on test data (10% data) and record the following scores:
- Present the confusion table with TP, TN, FP, and FN
- Report Accuracy, Precision, Recall, F, and Kappa in Table D.
df1 <- sort(sample(nrow(ccpo), nrow(ccpo)*.9))
train1 <- ccpo[df1, ]
test1 <- ccpo[-df1, ]
nbDem1 <- naiveBayes(default.payment.next.month ~ PAID_BILL + DEBT_TO_LIMIT + PAY_0, train1)
nbDem1
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
No Yes
0.7753632 0.2246368
Conditional probabilities:
PAID_BILL
Y [,1] [,2]
No 0.6430843 31.7591719
Yes 0.2996887 0.4322557
DEBT_TO_LIMIT
Y [,1] [,2]
No 0.4257157 0.4097742
Yes 0.5088484 0.4109439
PAY_0
Y [,1] [,2]
No -0.1933837 0.9264983
Yes 0.7081938 1.3805876
nb_prediction1 <- predict(nbDem1, test1, type = "class")
confusionMatrix(data = nb_prediction1,
reference =test1$default.payment.next.month,
dnn = c("Predicted", "Actual"),
mode = "prec_recall")
C.3. Examine attribute value distribution (histogram), and perform log transformation on attributes you see fit:
- Create a new attribute that is the logarithm of each attribute with an extremely wide, “skew” distribution.
hist(ccpo$LIMIT_BAL, xlab="Credit Limit")
hist(ccpo$BILL_AMT1, xlab="Last Bill Statement Amount")
hist(ccpo$PAY_0, xlab="Last Payment Amount")
hist(ccpo$PAID_BILL, xlab="Paid Off Previous Statement Ratio")

hist(ccpo$DEBT_TO_LIMIT, xlab="Debt to Credit Ratio")

The following attributes I have identified to have extremely wide “skew” distribution, which I will LOG:
ccpo$LIMIT_BAL_LOG <- log10(ccpo$LIMIT_BAL)
ccpo$BILL_AMT1_LOG <-log10(ccpo$BILL_AMT1)
ccpo$PAID_BILL_LOG <-log10(ccpo$PAID_BILL)
ccpo$DEBT_TO_LIMIT_LOG <-log10(ccpo$DEBT_TO_LIMIT)
hist(ccpo$LIMIT_BAL_LOG, xlab="Credit Limit")

hist(ccpo$BILL_AMT1_LOG, xlab="Last Bill Statement Amount")

hist(ccpo$PAID_BILL_LOG, xlab="Paid Off Previous Statement Ratio")

hist(ccpo$DEBT_TO_LIMIT_LOG, xlab="Debt to Credit Ratio")

- Build a model with training data (90% data) to predict default.payment.next.month, using at the new relational (and log-transformed) attributes plus any other variables you would like to include here.
is.na(ccpo) <- sapply(ccpo, is.infinite)
ccpo[is.na(ccpo)] <- 0
invalid factor level, NA generatedinvalid factor level, NA generated
df_log <- sort(sample(nrow(ccpo), nrow(ccpo)*.9))
train_log <- ccpo[df_log, ]
test_log <- ccpo[-df_log, ]
nbDem_log <- naiveBayes(default.payment.next.month ~ LIMIT_BAL_LOG + BILL_AMT1_LOG + PAY_0, train_log)
nbDem_log
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
No Yes
0.7740569 0.2259431
Conditional probabilities:
LIMIT_BAL_LOG
Y [,1] [,2]
No 5.096040 0.3971029
Yes 4.921738 0.4193922
BILL_AMT1_LOG
Y [,1] [,2]
No 4.049380 1.301126
Yes 3.953266 1.372574
PAY_0
Y [,1] [,2]
No -0.1893633 0.9308117
Yes 0.7051507 1.3772265
- Run prediction with the model on test data (10% data) and record the following scores:
- Present the confusion table with TP, TN, FP, and FN
- Report Accuracy, Precision, Recall, F, and Kappa in Table D.
nb_prediction_log <- predict(nbDem_log, test_log, type = "class")
confusionMatrix(data = nb_prediction_log,
reference =test_log$default.payment.next.month,
dnn = c("Predicted", "Actual"),
mode = "prec_recall")
Confusion Matrix and Statistics
Actual
Predicted No Yes
No 2088 387
Yes 109 223
Accuracy : 0.8233
95% CI : (0.8087, 0.8372)
No Information Rate : 0.7827
P-Value [Acc > NIR] : 5.243e-08
Kappa : 0.3782
Mcnemar's Test P-Value : < 2.2e-16
Precision : 0.8436
Recall : 0.9504
F1 : 0.8938
Prevalence : 0.7827
Detection Rate : 0.7439
Detection Prevalence : 0.8817
Balanced Accuracy : 0.6580
'Positive' Class : No
C.4. Pick another classification model or the same model with different parameter values, and repeat the modeling and evaluation as in C.3. Report the confusion table and results to Table D.
df_log1 <- sort(sample(nrow(ccpo), nrow(ccpo)*.9))
train_log1 <- ccpo[df_log1, ]
test_log1 <- ccpo[-df_log1, ]
nbDem_log1 <- naiveBayes(default.payment.next.month ~ PAID_BILL_LOG + DEBT_TO_LIMIT_LOG + PAY_0, train_log1)
nbDem_log1
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
No Yes
0.7755611 0.2244389
Conditional probabilities:
PAID_BILL_LOG
Y [,1] [,2]
No -0.7105151 0.6440783
Yes -0.6839220 0.6692453
DEBT_TO_LIMIT_LOG
Y [,1] [,2]
No -0.7097247 0.8213071
Yes -0.5492607 0.7828808
PAY_0
Y [,1] [,2]
No -0.1909866 0.9306641
Yes 0.7029982 1.3653377
nb_prediction_log1 <- predict(nbDem_log1, test_log1, type = "class")
confusionMatrix(data = nb_prediction_log1,
reference =test_log1$default.payment.next.month,
dnn = c("Predicted", "Actual"),
mode = "prec_recall")
Confusion Matrix and Statistics
Actual
Predicted No Yes
No 2055 429
Yes 104 219
Accuracy : 0.8101
95% CI : (0.7951, 0.8245)
No Information Rate : 0.7691
P-Value [Acc > NIR] : 8e-08
Kappa : 0.3515
Mcnemar's Test P-Value : <2e-16
Precision : 0.8273
Recall : 0.9518
F1 : 0.8852
Prevalence : 0.7691
Detection Rate : 0.7321
Detection Prevalence : 0.8849
Balanced Accuracy : 0.6449
'Positive' Class : No
- Evaluation and Results (2 points)
|
|
Method
|
C.1. Classification without Transformation
|
|
Correct %
|
Precision
|
Recall
|
F
|
Kappa
|
|
C.1
|
Model 1: nb_prediction LIMIT_BAL, BILL_AMT1, PAY_0
|
82.4%
|
.8373
|
.9612
|
.8956
|
.3676
|
|
C.2
|
Model 2: nb_prediction1 PAID_BILL, DEBT_TO_LIMIT, PAY_0
|
81.8%
|
.8331
|
.9589
|
.8988
|
.3404
|
|
C.3
|
Model 3: nb_prediction_log LIMIT_BAL_LOG, BILL_AMT1_LOG, PAY_0
|
82.33%
|
.8436
|
.9504
|
.8938
|
.3782
|
|
C.4
|
Model 4: nb_prediction_log1 PAID_BILL_LOG, DEBT_TO_LIMIT_LOG, PAY_0
|
81.01%
|
.8273
|
.9518
|
.8852
|
.3515
|
- Report with Interpretation and Conclusion (3 points)
E.1. In terms of the reasons and theories presented in tasks A1 through A2, which ones have been confirmed by your analysis? Please discuss even if there is no obvious answer.
A1 ‘Reasons and Theories’
Limit_Bal: My assumption was correct with this attribute when looking at the Naive Bayes model C1. For those that will not default had a higher mean ($174,956.70) compared to those that defaulted ($126,707.20).
PAY_#: My assumption was incorrect. Although the mean for those that will not default was in the negative as expected, the mean for those that would default was .70. I was expecting the mean of those that would default to be significantly higher.
BILL_AMT#: My assumption was correct. In fact, those that did not default has a higher mean compared to those that did defaulted. The assumption can be made that those with less spending power are most likely default, however the gap between the defaulted and not defaulted isn’t wide when looking at the Naive Bayes model C1.
A2 ‘Reasons and Theories’
DEBT_TO_LIMIT: My assumption was correct, but not significantly. Those that has a lower Debt to Credit ratio will be least likely to default based upon C2’s Naive Bayes. Non-defaulters has a mean of .43 while defaulters has .51. On scale from 0 to 1, this is enough of a gap to say that the assumption is correct.
PAID_BILL: My assumption was significantly correct for those that paid their bill. This ratio took the current month payment divided by the previous month’s bill amount; 1 or greater meant that the customer paid their bill or more. For those that did not defaulted, the mean was .64. Those that did defaulted has a payment-to-bill ratio of .3.
E.2. Does data transformation (with new relational variables in C.2) help? Which one helps most and why? Or which does not? Surprisingly the data transformation with the new relational variable did not help. Accuracy, Precision, Recall, and Kappa scored less when compared to C1. F1 was slightly higher compared to C1 by .0032. The Kappa were low between .34 and .37. which means the agreement between classification and truth values represent low agreement. There was no improvement after the transformation.
E.3. Which classification method(s) and/or parameters appear to perform well? Which ones do not? I utilized Naive Bayes as my classification method throughout the exercises as I was hoping and expecting to have better results after the transformation. Sadly, the basic parameters (PAY_0, BILL_AMIT1, and LIMIT_BAL) without transforming did as good or better than the other models. All models scored high in each metric except for Kappa.
E.4. Reviewing results in Task D, which evaluation metrics (of Correct%, Kappa, F, Precision, and Recall) best capture how good/poor the result is? Which metric is not as helpful?
Although Accuracy would be the primary metric to use, Kappa metric is a robust way to find the degree of agreement between the variables that are being used. Kappa is more informative than Accuracy when working with unbalanced data.
The least helpful would be Accuracy. A good model should have a high accuracy score, but having a high accuracy score alone does not guarantee the model is well established.
As for the other metrics:
Precision identifies how accurately the model predicted the positive classes. The number of true positive events is divided by the sum of positive true and false events.
Recall measures the ratio of predicted the positive classes. The number of true positive events is divided by the sum of true positive and false negative events.
F1 is the weighted average score of recall and precision. The value at 1 is the best performance and at 0 is the worst.
All are good metrics to use depending on the dataset that is being used.
E.5. Pick the most helpful evaluation metric, which method (with what data transformation if applicable) is the overall winner of the results? Reason about why the method performs well.
The Accuracy, Precision, Recall, and F1 scores for all models were high ranging from .81 to .96, except for Kappa. Kappa scores were between .34 and .37. Overall there wasn’t much of a difference among the four models, however I would pick model C3 as the Kappa and Precision scores highest among the other modals.
---
title: "INFO 659 Assignment #3 (10 points)"
output: html_notebook
---
<h3>Victor Enchautegui</h3>
<h4>11/8/2020</h4>

```{r}
library(ggplot2)
library(e1071)
library(caret)
library("rpart")
library("rpart.plot")

```

<style type="text/css">
.tg  {border-collapse:collapse;border-color:#ccc;border-spacing:0; width:100%;}
.tg td{background-color:#fff;border-color:#ccc;border-style:solid;border-width:1px;color:#333;
  font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{background-color:#f0f0f0;border-color:#ccc;border-style:solid;border-width:1px;color:#333;
  font-family:Arial, sans-serif;font-size:14px;font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>

<h3>A. Understanding variables and relations in data (2 points)</h3>
<h4>A.1 Discuss how credit and payment history data such as PAY_AMT1 have an impact on payment default.</h4>
<table class="tg">
<thead>
  <tr>
    <th class="tg-0lax">Variable/Attribute</th>
    <th class="tg-0lax">Data Type</th>
    <th class="tg-0lax">Potential impact on "Default" and reason</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0lax">Limit_Bal</td>
    <td class="tg-0lax">Integer</td>
    <td class="tg-0lax">
    The more credit you have, the less likely to default due to more ‘spending power’. 
    </td>
  </tr>
  <tr>
    <td class="tg-0lax">Pay_0, 2, 3, 4, 5, 6 </td>
    <td class="tg-0lax">Integer</td>
    <td class="tg-0lax">
    For those that has a status code -1 (payment was made on time), the probability of defaulting is less, especially if it is consecutive. 

For code 1 and greater (payments that were delayed by respective number of months), the probability increases for the chance of defaulting. 
    </td>
  </tr>
  <tr>
    <td class="tg-0lax">Bill_Amt1, 2, 3, 4, 5, 6 </td>
    <td class="tg-0lax">Integer</td>
    <td class="tg-0lax">
    These columns list the amount billed to each customer. The assumption would be that the higher the bill, the higher the risk, however I would not utilize this attribute on it's own. When used by itself, I do not believe it has a high potential impact on default because some individual may have the fiances to support such high bills. 
    
    <br>If we utilize this attribute with Limit_Bal, it can have higher potential to impact on default. For example, if the debt-to-limit ratio increases over time, this may increase the risk of defaulting as the customer is trying their limits. This is especially true if the customer is delay in payments.
    </td>
  </tr>
  <tr>
    <td class="tg-0lax">Pay_Amt1, 2, 3, 4, 5, 6 </td>
    <td class="tg-0lax">Integer</td>
    <td class="tg-0lax">
    Each of these amounts was paid to settle the preceding month's bill, either in full or partially. If the payments are low or less than a specific percent of the amount owe, then this may increase the potential impact of defaulting, especially if the bill amount is increasing each month.
    </td>
  </tr>
</tbody>
</table>

<h4>A.2. Discuss in what ways some of the above attributes contribute to default.payment.next.month together. Please identify at least two pairs of attributes that can be treated together and how.</h4>
<table class="tg">
<thead>
  <tr>
    <th class="tg-0lax">Var1</th>
    <th class="tg-0lax">Var2</th>
    <th class="tg-0lax">Var3</th>
    <th class="tg-0lax">Discuss their relation, how to combine them (ratio, difference, or others) and your reason/theory</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0lax">Pay_Amt</td>
    <td class="tg-0lax">Bill_Amt</td>
    <td class="tg-0lax"></td>
    <td class="tg-0lax">
I believe Pay_Amt to Bill_Amt ratio would help in determining the possibility of defaults as it shows the customers willingness to pay their bill and the amount of their bill. If the Pay_Amt to Bill_Amt ratio is less than .5, this is may mean the customer is at risk.    
    </td>
  </tr>
  <tr>
    <td class="tg-0lax">Limit_Bal</td>
    <td class="tg-0lax">Bill_Amt</td>
    <td class="tg-0lax"></td>
    <td class="tg-0lax">
If the debt-to-limit ratio is low, and consecutively getting worse, then this is a red flag that the customer is spending more than he/she is willing to repay, which can lead to a default. These two variables should be the most significant in determining the possibility of defaulting, however, what if the individual is paying their mortgage on their credit card and paying it in full each month just to get “reward points”. This situation happens, it can send off ‘false negatives’ of defaulting.    
    </td>
  </tr>
  <tr>
    <td class="tg-0lax">Pay_#</td>
    <td class="tg-0lax">Bill_Amt</td>
    <td class="tg-0lax">Limit_Bal</td>
    <td class="tg-0lax">
This combination is the best combination to determine if the customer is at risk of defaulting. If the customer is not making payments, the bill amount is increasing each month, and he/she is reaching their credit limit, this is significant red flag for defaulting.    
    </td>
  </tr>
</tbody>
</table>

<h3>B. Data preparation and cleansing (1 points)</h3>
<h4>B.1 Load data and initial data conversion/transformation: </h4>
1)	Load “UCI_Credit_Card.csv” into data frame variable in R using read.csv(). 
```{r}
cc <- read.csv("data/UCI_Credit_Card.csv")
```
2)	Convert the following attributes into as nominal (categorical, factor) attributes: Sex, Education, Marriage, and default.payment.next.month. 
```{r}
cc$SEX <- factor(cc$SEX,levels=c(1,2), labels=c("Male","Female"))
cc$MARRIAGE <- factor(cc$MARRIAGE,levels=c(1,2,3), labels=c("Married","Single","Others"))
cc$EDUCATION <- factor(cc$EDUCATION,levels=c(1,2,3,4,5,6), labels=c("Grad School","University","High School", "Others", "Unknown", "Unknown"))
cc$default.payment.next.month <- factor(cc$default.payment.next.month,levels=c(0,1), labels=c("No","Yes"))
```
3)   Use class() function check on Sex, Education, Marriage, and default.payment.next.month, they should ALL be “factor” variables. 
```{r}
class(cc$SEX)
```
```{r}
class(cc$MARRIAGE)
```
```{r}
class(cc$EDUCATION)
```
```{r}
class(cc$default.payment.next.month)
```
<h4>B.2 Create a filtered dataset with only non-negative amounts.   </h4>
1)	Use the subset() function to select only positive values on the 6 BILL_AMT attributes and 6 PAY_AMT attributes. Like (fill the … with actual): 
```{r}
ccpo <- subset(cc, BILL_AMT1>=0 & BILL_AMT2>=0 & BILL_AMT3>=0 & BILL_AMT4>=0 & BILL_AMT5>=0 & BILL_AMT6>=0 & PAY_AMT1>=0 & PAY_AMT2>=0 & PAY_AMT3>=0 & PAY_AMT4>=0 & PAY_AMT5>=0 & PAY_AMT6>=0)
nrow(ccpo)
```
2)	Check the number of rows in the filtered subset and you can use View(ccpo) to double check on the data. 
```{r}
View(ccpo)
```

<h3>C.	Data Transformation and Classification/Modeling (4 points)</h3>
<h4>C.1. Pick one classification method, model with default.payment.next.month ~ variables in A.1., and evaluate:</h4>
•	Select 90% of data for training and 10% for testing; 
```{r}
df <- sort(sample(nrow(ccpo), nrow(ccpo)*.9))
train <- ccpo[df, ]
test <- ccpo[-df, ]
```

•	Build a model with training data (90% data) to predict default.payment.next.month, using at least three variables from A.1. 
```{r}
nbDem <- naiveBayes(default.payment.next.month ~ LIMIT_BAL + BILL_AMT1 + PAY_0, train)
nbDem
```

•	Run prediction with the model on test data (10% data) and record the following scores: 
o	Present the confusion table with TP, TN, FP, and FN
o	Report Accuracy, Precision, Recall, F, and Kappa in Table D.
```{r}
nb_prediction <- predict(nbDem, test, type = "class")

confusionMatrix(data = nb_prediction,
                reference =test$default.payment.next.month, 
                dnn = c("Predicted", "Actual"),
                mode = "prec_recall")
```
<h4>C.2. Perform data transformation (with new relational attributes) and redo classification: </h4>
1)	Follow the treatments (at least two relations) of the variable pairs you have identified in A.2
2)	Create new attributes that compute the relations you have identified in A.2:
```{r}
ccpo$PAID_BILL <- c(ccpo$PAY_AMT1 / ccpo$BILL_AMT2) #1 or greater means full payment was made; Anything less than 1 is a risk to default
ccpo$DEBT_TO_LIMIT <- c(ccpo$BILL_AMT1 / ccpo$LIMIT_BAL) #Goal is to be close to 0. 1 or greater means the customer is at their credit limit or greater; high risk of defaulting
```

After reviewing data, I will replacing 'Inf' and 'NaN' with 1 for PAID_BILL attribute as this means that the customer is either paying a bill that is zero (overpaying) or didn't pay for a bill as there was no bill to pay. These occurrences is only happening in one of the two attributes.
```{r}
is.na(ccpo) <- sapply(ccpo, is.infinite)
ccpo[is.na(ccpo)] <- 1
```

3)	Build a model with training data (90% data) to predict default.payment.next.month, using the new relational attributes (plus any other variables you would like to include) here. 

4)	Run prediction with the model on test data (10% data) and record the following scores: 
a.	Present the confusion table with TP, TN, FP, and FN
b.	Report Accuracy, Precision, Recall, F, and Kappa in Table D.
```{r}
df1 <- sort(sample(nrow(ccpo), nrow(ccpo)*.9))
train1 <- ccpo[df1, ]
test1 <- ccpo[-df1, ]

nbDem1 <- naiveBayes(default.payment.next.month ~ PAID_BILL + DEBT_TO_LIMIT + PAY_0, train1)
nbDem1
```

```{r}
nb_prediction1 <- predict(nbDem1, test1, type = "class")

confusionMatrix(data = nb_prediction1,
                reference =test1$default.payment.next.month, 
                dnn = c("Predicted", "Actual"),
                mode = "prec_recall")

```

<h4>C.3. Examine attribute value distribution (histogram), and perform log transformation on attributes you see fit:   </h4>
1)	Create a new attribute that is the logarithm of each attribute with an extremely wide, “skew” distribution. 

```{r}
hist(ccpo$LIMIT_BAL, xlab="Credit Limit")
```

```{r}
hist(ccpo$BILL_AMT1, xlab="Last Bill Statement Amount")
```

```{r}
hist(ccpo$PAY_0, xlab="Last Payment Amount")
```

```{r}
hist(ccpo$PAID_BILL, xlab="Paid Off Previous Statement Ratio")
```

```{r}
hist(ccpo$DEBT_TO_LIMIT, xlab="Debt to Credit Ratio")
```

The following attributes I have identified to have extremely wide “skew” distribution, which I will LOG:
```{r}
ccpo$LIMIT_BAL_LOG <- log10(ccpo$LIMIT_BAL)
ccpo$BILL_AMT1_LOG <-log10(ccpo$BILL_AMT1) 
ccpo$PAID_BILL_LOG <-log10(ccpo$PAID_BILL)
ccpo$DEBT_TO_LIMIT_LOG <-log10(ccpo$DEBT_TO_LIMIT)
```

```{r}
hist(ccpo$LIMIT_BAL_LOG, xlab="Credit Limit")
```


```{r}
hist(ccpo$BILL_AMT1_LOG, xlab="Last Bill Statement Amount")
```

```{r}
hist(ccpo$PAID_BILL_LOG, xlab="Paid Off Previous Statement Ratio")
```

```{r}
hist(ccpo$DEBT_TO_LIMIT_LOG, xlab="Debt to Credit Ratio")
```




4)	Build a model with training data (90% data) to predict default.payment.next.month, using at the new relational (and log-transformed) attributes plus any other variables you would like to include here. 

```{r}
is.na(ccpo) <- sapply(ccpo, is.infinite)
ccpo[is.na(ccpo)] <- 0
```

```{r}
df_log <- sort(sample(nrow(ccpo), nrow(ccpo)*.9))
train_log <- ccpo[df_log, ]
test_log <- ccpo[-df_log, ]
nbDem_log <- naiveBayes(default.payment.next.month ~ LIMIT_BAL_LOG + BILL_AMT1_LOG + PAY_0, train_log)
nbDem_log
```


5)	Run prediction with the model on test data (10% data) and record the following scores: 
a.	Present the confusion table with TP, TN, FP, and FN
b.	Report Accuracy, Precision, Recall, F, and Kappa in Table D.
```{r}
nb_prediction_log <- predict(nbDem_log, test_log, type = "class")

confusionMatrix(data = nb_prediction_log,
                reference =test_log$default.payment.next.month, 
                dnn = c("Predicted", "Actual"),
                mode = "prec_recall")
```
C.4. Pick another classification model or the same model with different parameter values, and repeat the modeling and evaluation as in C.3. Report the confusion table and results to Table D.

```{r}
df_log1 <- sort(sample(nrow(ccpo), nrow(ccpo)*.9))
train_log1 <- ccpo[df_log1, ]
test_log1 <- ccpo[-df_log1, ]
nbDem_log1 <- naiveBayes(default.payment.next.month ~ PAID_BILL_LOG + DEBT_TO_LIMIT_LOG + PAY_0, train_log1)
nbDem_log1
```
```{r}
nb_prediction_log1 <- predict(nbDem_log1, test_log1, type = "class")

confusionMatrix(data = nb_prediction_log1,
                reference =test_log1$default.payment.next.month, 
                dnn = c("Predicted", "Actual"),
                mode = "prec_recall")
```

<h3>D.	Evaluation and Results (2 points)</h3>
<table class="tg">
<thead>
  <tr>
    <th class="tg-0lax" rowspan="2"></th>
    <th class="tg-0lax" rowspan="2">Method</th>
    <th class="tg-0lax" colspan="5">C.1. Classification without Transformation</th>
  </tr>
  <tr>
  	<th class="tg-0lax">Correct %</th>
  	<th class="tg-0lax">Precision</th>
  	<th class="tg-0lax">Recall</th>
  	<th class="tg-0lax">F</th>
  	<th class="tg-0lax">Kappa</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0lax"><b>C.1</b></td>
    <td class="tg-0lax">
	<b>Model 1:</b> nb_prediction
	<br><i>LIMIT_BAL, BILL_AMT1, PAY_0</i>
	</td>
    <td class="tg-0lax"> 82.4%</td>
    <td class="tg-0lax">.8373</td>
    <td class="tg-0lax">.9612</td>
    <td class="tg-0lax">.8956</td>
    <td class="tg-0lax">.3676</td>
  </tr>
  <tr>
    <td class="tg-0lax"><b>C.2</b></td>
    <td class="tg-0lax">
	<b>Model 2:</b> nb_prediction1
	<br><i>PAID_BILL, DEBT_TO_LIMIT, PAY_0</i>	
	</td>
    <td class="tg-0lax"> 81.8%</td>
    <td class="tg-0lax"> .8331</td>
    <td class="tg-0lax"> .9589</td>
    <td class="tg-0lax"> .8988</td>
    <td class="tg-0lax"> .3404</td>
  </tr>
  <tr>
    <td class="tg-0lax"><b>C.3</b></td>
    <td class="tg-0lax">
	<b>Model 3:</b> nb_prediction_log
	<br><i>LIMIT_BAL_LOG, BILL_AMT1_LOG, PAY_0</i>	
	</td>
    <td class="tg-0lax"> 82.33%</td>
    <td class="tg-0lax"> .8436</td>
    <td class="tg-0lax"> .9504</td>
    <td class="tg-0lax"> .8938</td>
    <td class="tg-0lax"> .3782</td>
  </tr>
  <tr>
    <td class="tg-0lax"><b>C.4</b></td>
    <td class="tg-0lax">

	<b>Model 4:</b> nb_prediction_log1
	<br><i>PAID_BILL_LOG, DEBT_TO_LIMIT_LOG, PAY_0</i>	
	</td>
    <td class="tg-0lax"> 81.01%</td>
    <td class="tg-0lax"> .8273</td>
    <td class="tg-0lax"> .9518</td>
    <td class="tg-0lax"> .8852</td>
    <td class="tg-0lax"> .3515</td>
  </tr>
</tbody>
</table>
<br>


<h3>E.	Report with Interpretation and Conclusion (3 points)</h3>

<b>E.1. In terms of the reasons and theories presented in tasks A1 through A2, which ones have been confirmed by your analysis? Please discuss even if there is no obvious answer.</b>

<u>A1 'Reasons and Theories'</u><br>
<b>Limit_Bal:</b> My assumption was correct with this attribute when looking at the Naive Bayes model C1. For those that will not default had a higher mean ($174,956.70) compared to those that defaulted ($126,707.20).
<br><br>
<b>PAY_#:</b> My assumption was incorrect. Although the mean for those that will not default was in the negative as expected, the mean for those that would default was .70. I was expecting the mean of those that would default to be significantly higher.
<br><br>
<b>BILL_AMT#:</b> My assumption was correct. In fact, those that did not default has a higher mean compared to those that did defaulted. The assumption can be made that those with less spending power are most likely default, however the gap between the defaulted and not defaulted isn't wide when looking at the Naive Bayes model C1.
<br><br>
<u>A2 'Reasons and Theories'</u><br>
<b>DEBT_TO_LIMIT:</b> My assumption was correct, but not significantly. Those that has a lower Debt to Credit ratio will be least likely to default based upon C2's Naive Bayes. Non-defaulters has a mean of .43 while defaulters has .51. On scale from 0 to 1, this is enough of a gap to say that the assumption is correct.
<br><br>
<b>PAID_BILL:</b> My assumption was significantly correct for those that paid their bill. This ratio took the current month payment divided by the previous month's bill amount; 1 or greater meant that the customer paid their bill or more. For those that did not defaulted, the mean was .64. Those that did defaulted has a payment-to-bill ratio of .3.
<br><br><br>


<b>E.2. Does data transformation (with new relational variables in C.2) help? Which one helps most and why? Or which does not?</b>
Surprisingly the data transformation with the new relational variable did not help. Accuracy, Precision, Recall, and Kappa scored less when compared to C1. F1 was slightly higher compared to C1 by .0032. The Kappa were low between .34 and .37. which means the agreement between classification and truth values represent low agreement. There was no improvement after the transformation. 
<br><br><br>

<b>E.3. Which classification method(s) and/or parameters appear to perform well? Which ones do not?</b>
I utilized Naive Bayes as my classification method throughout the exercises as I was hoping and expecting to have better results after the transformation. Sadly, the basic parameters (PAY_0, BILL_AMIT1, and LIMIT_BAL) without transforming did as good or better than the other models. All models scored high in each metric except for Kappa.


<b>E.4. Reviewing results in Task D, which evaluation metrics (of Correct%, Kappa, F, Precision, and Recall) best capture how good/poor the result is? Which metric is not as helpful?</b>

Although Accuracy would be the primary metric to use, Kappa metric is a robust way to find the degree of agreement between the variables that are being used. Kappa is more informative than Accuracy when working with unbalanced data.

The least helpful would be Accuracy. A good model should have a high accuracy score, but having a high accuracy score alone does not guarantee the model is well established. 

As for the other metrics:

<b><u>Precision</u></b> identifies how accurately the model predicted the positive classes. The number of true positive events is divided by the sum of positive true and false events.

<b><u>Recall</u></b> measures the ratio of predicted the positive classes. The number of true positive events is divided by the sum of true positive and false negative events.

<b><u>F1</u></b> is the weighted average score of recall and precision. The value at 1 is the best performance and at 0 is the worst.

All are good metrics to use depending on the dataset that is being used. 


<b>E.5. Pick the most helpful evaluation metric, which method (with what data transformation if applicable) is the overall winner of the results? Reason about why the method performs well.</b>

The Accuracy, Precision, Recall, and F1 scores for all models were high ranging from .81 to .96, except for Kappa. Kappa scores were between .34 and .37. Overall there wasn't much of a difference among the four models, however I would pick model C3 as the Kappa and Precision scores highest among the other modals. 









