r = getOption("repos")
r["CRAN"] = "http://cran.us.r-project.org"
options(repos = r)
install.packages("rpart.plot")
Installing package into 㤼㸱C:/Users/Toby/Documents/R/win-library/3.6㤼㸲
(as 㤼㸱lib㤼㸲 is unspecified)
trying URL 'http://cran.us.r-project.org/bin/windows/contrib/3.6/rpart.plot_3.0.8.zip'
Content type 'application/zip' length 1078195 bytes (1.0 MB)
downloaded 1.0 MB
package ‘rpart.plot’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\Toby\AppData\Local\Temp\Rtmp21Xoj7\downloaded_packages
install.packages("ggplot2")
Installing package into 㤼㸱C:/Users/Toby/Documents/R/win-library/3.6㤼㸲
(as 㤼㸱lib㤼㸲 is unspecified)
trying URL 'http://cran.us.r-project.org/bin/windows/contrib/3.6/ggplot2_3.2.1.zip'
Content type 'application/zip' length 3975792 bytes (3.8 MB)
downloaded 3.8 MB
package ‘ggplot2’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\Toby\AppData\Local\Temp\Rtmp21Xoj7\downloaded_packages
install.packages("e1071")
Installing package into 㤼㸱C:/Users/Toby/Documents/R/win-library/3.6㤼㸲
(as 㤼㸱lib㤼㸲 is unspecified)
trying URL 'http://cran.us.r-project.org/bin/windows/contrib/3.6/e1071_1.7-2.zip'
Content type 'application/zip' length 1021928 bytes (997 KB)
downloaded 997 KB
package ‘e1071’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\Toby\AppData\Local\Temp\Rtmp21Xoj7\downloaded_packages
library(ggplot2)
package 㤼㸱ggplot2㤼㸲 was built under R version 3.6.1Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
getwd()
[1] "C:/Users/Toby/Downloads"
setwd("C:/Users/Toby/Downloads")
cc <- read.csv("UCI_Credit_Card.csv")
cc$default.payment.next.month <- factor(cc$default.payment.next.month, levels = c(0,1), labels = c("No","Yes"))
Education may impact payment defaults due to the limitations of having financial education or a certain education level. People with lower education may not know how to deal with paying debt or how credit cards are used properly. People with a lower education perhaps do not have high paying jobs that support their lifestyle.
Marriage may have an impact on payment default due to financial burdens. A two person house does not necessarily mean that there are two incomes. One spouse may have to support themselves and the other spouse.
Sex may have an impact on payment default due to higher spending habits. Females may have higher spending habits. Perhaps males are more impulsive and buy items with a higher price tag. But perhaps these two statements could be stereotypes.
Age may also have an impact on payment default due to experience with finance. Younger people may not know the risks of debt and credit card spending.
ggplot(cc, aes(x = SEX, fill = default.payment.next.month, color = default.payment.next.month)) + geom_histogram(binwidth = 1, position = "stack") + scale_color_manual(values = c("black", "black")) + scale_fill_manual(values = c("blue", "purple"))

I chose sex and default payment next month variables to create a histogram. In this distribution, males which are 1.0 on the x-axis have more “No” than “Yes”. But the overall counts of males are lower than females. Females have a higher count and there are more “No” than “Yes”. I believe that this variable cannot help to predict the default payment next month. The number of males and females in this dataset are not equal. If there were more males in this data, perhaps we could more Sex as a predictor but the data for both sexes are too uneven to tell.
ggplot(cc, aes(x = MARRIAGE, fill = default.payment.next.month, color = default.payment.next.month)) + geom_histogram(binwidth = 1, position = "stack") + scale_color_manual(values = c("black", "black")) + scale_fill_manual(values = c("green", "grey"))

I chose Marriage to examine default payment next month. In this distribution, those who are married in the 1 of the x-axis stated “No” more than “Yes”. Similarily, those who are single state “No” more than “Yes”. Also others who are on the 3 of the x-axis chosen “No” in a larger number. Overall all three marriage levels chose “No” more than they chose “Yes”. But I believe that Marriage cannot be used as a predictor of default payment next month. The counts of all three levels are not similar. If the counts were more equal, perhaps a decision can be made. But currently, Marriage is not a good predictor of default payment next month.
Payment status data may impact payment default due to certain times in the year. Perhaps at the end of the year, there are increased bonuses. Perhaps during the summer, more money is used to go on vacation and enjoy the heat.
ggplot(cc, aes(x = PAY_0, fill = default.payment.next.month, color = default.payment.next.month)) + geom_histogram(binwidth = 1, position = "stack") + scale_color_manual(values = c("black", "black")) + scale_fill_manual(values = c("darkred", "darkgreen"))

I chosen Pay_0 to examine with the default payment next month variable. In this distribution, most people paid duly and the least repaid a month or two months late. There were more people who chose “No” instead of “Yes” that will default payment next month. There were 15000 people that were paid duly and not months late. Perhaps unlike the other variable, there seems to be a possible correlation that when people do not pay duly, it is possible that they will not default payment next month.
ggplot(cc, aes(x = PAY_2, fill = default.payment.next.month, color = default.payment.next.month)) + geom_histogram(binwidth = 1, position = "stack") + scale_color_manual(values = c("black", "black")) + scale_fill_manual(values = c("pink", "blue"))

I chose the Pay_2 variable to examine with the default payment next month. In this distribution, most people chose “No” than “Yes” to default payment next month. It seems that most people paid duly and did not miss one month but more likely that they missed two months of repayment. Pay_2 describes the repayment status in August 2005. This variable could perhaps help to predict the default payment next month. There are more people who chose “Yes” to default payment next month when missing two months of repayment. Although not as good as Pay_0, this Pay_2 still could be a good predictor.
ggplot(cc, aes(x = PAY_6, fill = default.payment.next.month, color = default.payment.next.month)) + geom_histogram(binwidth = 1, position = "stack") + scale_color_manual(values = c("black", "black")) + scale_fill_manual(values = c("yellow", "violet"))

I chose Pay_6 to examine the default payment next month. In this distribution, there were more “No” than “Yes” for default payment next month. More people who pay duly did not default payment next month. But those who were 2 to 3 month late of repayment chose “Yes” for default payment next month. It seems that Pay_6 may be a good predictor of default payment next month. Those who missed repayment for 2-3 months most likely will default payment next month.
Variables that should be nominal:
Sex, Education, Marriage
cc$SEX <- factor(cc$SEX, levels=c(1,2), labels=c("Male", "Female"))
cc$EDUCATION <- factor(cc$EDUCATION, levels=c(1,2,3,4,5,6), labels=c("Graduate School", "University", "High School", "Others", "Unknown_1", "Unknown_2"))
cc$MARRIAGE <- factor(cc$MARRIAGE, levels=c(1,2,3), labels=c("Married", "Single", "Others"))
View(cc)
Choose 5000 random rows:
train <- cc[sample(nrow(cc), 5000),]
View(train)
Choose two random numbers:
test <- cc[c(9,25432),]
test
library(e1071)
package 㤼㸱e1071㤼㸲 was built under R version 3.6.1
nbDem <- naiveBayes(default.payment.next.month ~ SEX + EDUCATION + MARRIAGE, train)
nbDem
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
No Yes
0.7824 0.2176
Conditional probabilities:
SEX
Y Male Female
No 0.3826687 0.6173313
Yes 0.4319853 0.5680147
EDUCATION
Y Graduate School University High School Others Unknown_1 Unknown_2
No 0.3832779340 0.4502684735 0.1457427768 0.0063922271 0.0127844541 0.0015341345
Yes 0.3134191176 0.4972426471 0.1819852941 0.0009191176 0.0055147059 0.0009191176
MARRIAGE
Y Married Single Others
No 0.435851472 0.554417414 0.009731114
Yes 0.452117864 0.536832413 0.011049724
Examining the conditional probabilities of Sex, the results are very interesting. It makes sense when comparing “Yes” of Female to Male. There are more Females that chose “Yes” than Males. But when examining the “No” for Male to Female. There are also more females that said “No” compared to “Male”. I think that I would need more analysis of Sex compared to default payment next month. It does not quite make sense for me.
When examining Education compared to default payment next month, the probabilities for education is a mixed of making sense and not quite enough information to make a statement. The highest probability for “Yes” are those in University but “Yes” and “No” the probability is quite similar. In both “No” and “Yes” for all levels of education, the probability seem quite similar. These probabilities are a little bit confusing to examine.
predict(nbDem, test[1,])
[1] No
Levels: No Yes
I believe that the predictions are correct. Test 1 is a female, with a high school education and married. Females based on the conditional probability states “No” more likely than “Yes” to default payment next month. As for education, those with high school educations it seems that those in that level of education have a lower probability in both “No” and “Yes”. As for marriage status, test 1 is married and those that all married state higher “Yes”. I do not know if I can say that the predictions are true. But there is evidence of it the prediction being assumed correct.
predict(nbDem, test[2,])
[1] No
Levels: No Yes
I believe that the predictions are correct. Test 2 is a female, with a university education and single. Females based on the conditional probabilities are more likely to state “No” than “Yes” to default payment next month. But for university education and marriage status, both have higher probabilities to state “Yes”. This may impact the correct prediction. But there is evidence that the prediction may be correct due to the Sex.
nbPay <- naiveBayes(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_6, train)
nbPay
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
No Yes
0.7824 0.2176
Conditional probabilities:
PAY_0
Y [,1] [,2]
No -0.2287832 0.9674105
Yes 0.6452206 1.3662562
PAY_2
Y [,1] [,2]
No -0.3284765 1.045978
Yes 0.4761029 1.455055
PAY_6
Y [,1] [,2]
No -0.4123211 1.010040
Yes 0.1654412 1.445171
For this conditional probability, it examines three payment status which have multiple levels. I am not sure about which levels of repayment is analyzed into the prediction. But looking at the probabilities I am assuming ,1 is stating paid duly and ,2 is months paid late. The probabilities do not make sense due to the negatives and some probabilities are higher than 1.
predict(nbPay, test[1,])
[1] No
Levels: No Yes
When examining Test 1’s answers about Pay_0, Pay_2 and Pay_6, all of the repayment status are 0. 0 means that Test 1 has paid duly for all three of these repayment dates. The prediction is correct to assume “No” to payment default next month due to these results. I believe that you can use these repayment status as a predictor to the variable payment default next month.
predict(nbPay, test[2,])
[1] No
Levels: No Yes
When examining Test 2’s anwers about Pay_0, Pay_2 and Pay_6, all of the repayment status are 0 or -1. -1 and 0 states that the payment were duly. Likewise with Test 1, the prediction is correct to assume “No” to payment default next month due to these results. I believe that you can these repayment status as a predictor to the variable payment default next month.
nbPay <- naiveBayes(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_6, train, laplace = 1.5)
nbPay
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
No Yes
0.7824 0.2176
Conditional probabilities:
PAY_0
Y [,1] [,2]
No -0.2287832 0.9674105
Yes 0.6452206 1.3662562
PAY_2
Y [,1] [,2]
No -0.3284765 1.045978
Yes 0.4761029 1.455055
PAY_6
Y [,1] [,2]
No -0.4123211 1.010040
Yes 0.1654412 1.445171
When examining the smoothed Naive Bayes, the conditional probabilities did not change. Laplace smoothing is used to keep probabilities from equaling zero. A zero is not well regarded to assumptions that even if something is a slightly possiblity the probability should not be zero. But laplace may assigned too much probability to a unseen event. Above, the probability did not change from the former.
predict(nbPay, test[1,])
[1] No
Levels: No Yes
It was yet again predicted to be “No” to default payment next month. The laplace smoothing did not change the outcome and the prediction of “No” seems to be correct.
predict(nbPay, test[2,])
[1] No
Levels: No Yes
For test 2, the prediction to default payment next month is “No”. The laplace smoothing did not change the prediction thus I making me believe that the prediction is correct.
DECISION TREE
library("rpart")
library("rpart.plot")
package 㤼㸱rpart.plot㤼㸲 was built under R version 3.6.1
dtPay <- rpart(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_6, method = "class", data = train, parms = list(split = 'information'), minsplit = 20, cp = 0.02)
rpart.plot(dtPay, type = 4, extra = 1)

?rpart.control
In this decision tree, it examines the number of counts that Pay_0 (September 2005) was less than one month late. If the answer is “Yes”, people are assigned to the left side. If the answer is “No”, people are assigned to the right side. Then if Pay_0 was less than two months late, then people are assigned to the left side and people that are not two months late are assigned to the right. Each node contains two numbers, the right number agrees with the node statement at the top (“Yes” or “No”). The left number disagrees with the node statement at the top. The numbers seem reasonable except I do wonder about the left bottom left. Those who have Pay_0 more than two months, the majority count is that they will not default payment next month.
predict(nbPay, test[1,])
[1] No
Levels: No Yes
predict(dtPay, test[1,])
No Yes
9 0.8380549 0.1619451
The prediction is most likely “No” that Test 1 will default payment next month. I believe that these predictions will be correct and Test 1 will not default payment next month Both predictions believe that Test 1 will not default.
predict(nbPay, test[2,])
[1] No
Levels: No Yes
predict(dtPay, test[2,])
No Yes
25432 0.8380549 0.1619451
For Test 2, similar to Test 1 both predictions believe that they will not default payment next month. Based on the predict function and the high probability it states “No”. I believe that the prediction is correct and Test 2 will not default payment next month.
dtPay <- rpart(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_6, method = "class", data = train, parms = list(split = 'information'), minsplit = 20, cp = 0.001)
rpart.plot(dtPay, type = 1, extra =1)

In this decision tree, people with Pay_0 less than one month late are assigned to the left. Those more than one month late assigned to the right. If Pay_0 is less than 2, they assigned to the left and those with more to the right. Both nodes are asked Pay_6 (April 2005) is less than one, agreed people are to the left and disagree is to the right. People who have Pay_2 (August 2005) more than 4 months are to the left and those who disagree to the right. I noticed that there were a few people that their Pay_2 is more than 4 months who believe they will not default payment next month. But other than that, I believe that the data is reasonable.
predict(dtPay, test[1,])
No Yes
9 0.8618838 0.1381162
predict(nbPay, test[1,])
[1] No
Levels: No Yes
In both prediction, Test 1 has been decided to be “No” for defaulting payment next month. Due to the high probability of Test 1 choosing “No”, I believe that the predictions are correct.
predict(dtPay, test[2,])
No Yes
25432 0.8618838 0.1381162
predict(nbPay, test[2,])
[1] No
Levels: No Yes
In this prediction, Test 2 has been decided “No” for defaulting payment next month. Similarly to Test 1, Test 2 has high probability of choosing “No” to default payment next month. From this evidence, I believe that the model is correct and the prediction will be true.
Conclusion
I believe that the Naive Bayes model is good but not visually easy to examine the data. The Naive Bayes model I believe is not as indepth as the Decision Tree. The decision tree identifies the data and the counts are assigned to a node based on the data. The decision tree separates the data by the three payment variables. It is more visually appealing to examine the three variables within each node. Both models predicted the same results as decided that Test 1 and Test 2 would state “No” to default payment next month. I do question why the decision tree’s data does not use probability and instead it uses counts.
Perhaps variables that should be examined in the classification model is the bill amount and the pay amount. It would help to identify how much someones’ bills is and how each person paid to examine the difference. Perhaps another decision tree or Naive Bayes model can be used to examine these variables with the default payment next month.
