install.packages(“rpart.plot”)
install.packages(“ggplot2”)
install.packages(“e1071”)
library(ggplot2)
cc <- read.csv(“/Users/lechtenbel/Desktop/Lara/INFO659_Data_Analytics/INFO659_A2/UCI_Credit_Card.csv”)
Looking at the distribution of Marriage & Default Payment Next Month, and Age & Default Payment Next Month.
MARRIAGE
install.packages(“tidyverse”) library(tidyverse)
ggplot(cc, aes(x=MARRIAGE, fill=default.payment.next.month, color=default.payment.next.month)) + geom_histogram(binwidth=1, position=“stack”) + scale_color_manual(values=c(“black”,“black”)) + scale_fill_manual(values=c(“darkolivegreen4”, “red”))
rplot_marriage_default
According to the source of the data (kaggle.com), codes for variable MARRIAGE are 1=married, 2=single, 3=others. Customers who are married and single don’t appear to have hugely different rates of default payment. The number of customers who report “other” as their marriage status is so small in comparison to single & married groups that it’s impossible to eyeball. Another method would be better suited to determining which factors are predictive of defaulting on a payment. A decision tree might be a better option - specifically a classification tree, which is meant for categorical output variables like default.next.payment in which the value is a binary YES or NO. The decision tree will use the probability of each categorical choice for MARRIAGE (single, married, others) defaulting on the next payment to create a place in the decision tree. Visually, this will create a much clearer picture of the relationship of each option within MARRIAGE on the probability of default than the stacked distribution.
AGE
ggplot(cc, aes(x=AGE, fill=default.payment.next.month, color=default.payment.next.month)) + geom_histogram(binwidth=1, position=“stack”) + scale_color_manual(values=c(“black”,“black”)) + scale_fill_manual(values=c(“darkolivegreen4”, “red”))
rplot_age_default
This visually indicates a relationship with age and default payments- the rate of default payments does seem to rise in the early 20s through approx age 30 then start to taper off through age 60. However, to a lesser extent like the previous image, this doesn’t perfectly indicate the relationship between these groups of people and the probability of default. While this image gives a general sense of the association with age and default, it’s not clear exactly which age groups have a concerning probability of default and what that probability might be. A decision tree might be able to help with this, as it would ideally collapse the continous AGE variable into some categorical choices and clearly show which are likely to lead to a “YES” for default.next.payment. Clustering might be an option to help with identification of these various groups. If you performed a cluster analysis to identify 3-5 age groups within this customer sample, you could then use those age groups in a decision tree to determine their probability of defaulting and organize them in a clear visual format in order of risk.
Discuss how payment status data may have an impact on payment default.
Pay_2: ggplot(cc, aes(x=PAY_2, fill=default.payment.next.month, color=default.payment.next.month)) + geom_histogram(binwidth=1, position=“stack”) + scale_color_manual(values=c(“black”,“black”)) + scale_fill_manual(values=c(“darkolivegreen4”, “red”))
Pay_4: ggplot(cc, aes(x=PAY_2, fill=default.payment.next.month, color=default.payment.next.month)) + geom_histogram(binwidth=1, position=“stack”) + scale_color_manual(values=c(“black”,“black”)) + scale_fill_manual(values=c(“darkolivegreen4”, “red”)) Pay_6: ggplot(cc, aes(x=PAY_6, fill=default.payment.next.month, color=default.payment.next.month)) + geom_histogram(binwidth=1, position=“stack”) + scale_color_manual(values=c(“black”,“black”)) + scale_fill_manual(values=c(“darkolivegreen4”, “red”))
These distrubitions and the source data itself are a bit tricky to understand due to the need for data cleaning (or a mistake in the description of the labels). The source data describes the values for Pay_X as (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above). However, there are lots of data points with values of -2 and 0, values which are not defined in the dataset information (https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset). Reading commentary in the original dataset, it seems that some people interpreted the -2 and 0 values to mean essentially the same as -1, in that there were no payment delays (https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset/discussion/34608, https://www.kaggle.com/lucabasa/credit-card-default-a-very-pedagogical-notebook/). So if we understand that values under 1 mean no delay in payment, while values 1 or greater do, then we can see from the visualization that when Pay_2, Pay_4, or Pay_6 are not characterized by a delayed payment, then they are not likely to default on their payment next month (default.payment.next.month=No). However, this is a rough visual estimation and we can’t tell from the distribution the degree to which any of these values is predictive of defaulting on next month’s payment.
Read the documentation and decide which of the above variables should be nominal (categorical factors);
SEX (1= male, 2= female)
EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
MARRIAGE: Marital status (1=married, 2=single, 3=others)
PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
PAY_2: Repayment status in August, 2005 (scale same as above)
PAY_3: Repayment status in July, 2005 (scale same as above)
PAY_4: Repayment status in June, 2005 (scale same as above)
PAY_5: Repayment status in May, 2005 (scale same as above)
PAY_6: Repayment status in April, 2005 (scale same as above)
default.payment.next.month: Default payment (1=yes, 0=no)
Transform related demographic variables into nominal values with proper labels using the factor() function.
cc\(SEX <- factor(cc\)SEX,levels=c(1,2), labels=c(“Male”, “Female”)) cc\(EDUCATION <- factor(cc\)EDUCATION, levels = c(1,2,3,4,5,6), labels = c(“graduate school”, “university”, “high school”, “others”, “unknown”, “unknown”))
cc\(MARRIAGE <- factor(cc\)MARRIAGE,levels=c(1,2,3), labels=c(“married”, “single”, “others”))
cc\(PAY_0 <- factor(cc\)PAY_0,levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c(“no delay”, “no delay”, “no balance”, “delay 1 month”, “delay 2 months”,“delay 3 months”,“delay 4 months”,“delay 5 months”,“delay 6 months”,“delay 7 months”,“delay 8 months”,“delay 9+ months”))
cc\(PAY_2 <- factor(cc\)PAY_2,levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c(“no delay”, “no delay”, “no balance”, “delay 1 month”, “delay 2 months”,“delay 3 months”,“delay 4 months”,“delay 5 months”,“delay 6 months”,“delay 7 months”,“delay 8 months”,“delay 9+ months”))
cc\(PAY_3 <- factor(cc\)PAY_3,levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c(“no delay”, “no delay”, “no balance”, “delay 1 month”, “delay 2 months”,“delay 3 months”,“delay 4 months”,“delay 5 months”,“delay 6 months”,“delay 7 months”,“delay 8 months”,“delay 9+ months”))
cc\(PAY_4 <- factor(cc\)PAY_4,levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c(“no delay”, “no delay”, “no balance”, “delay 1 month”, “delay 2 months”,“delay 3 months”,“delay 4 months”,“delay 5 months”,“delay 6 months”,“delay 7 months”,“delay 8 months”,“delay 9+ months”))
cc\(PAY_5 <- factor(cc\)PAY_5,levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c(“no delay”, “no delay”, “no balance”, “delay 1 month”, “delay 2 months”,“delay 3 months”,“delay 4 months”,“delay 5 months”,“delay 6 months”,“delay 7 months”,“delay 8 months”,“delay 9+ months”))
cc\(PAY_6 <- factor(cc\)PAY_6,levels=c(-2,-1,0,1,2,3,4,5,6,7,8,9), labels=c(“no delay”, “no delay”, “no balance”, “delay 1 month”, “delay 2 months”,“delay 3 months”,“delay 4 months”,“delay 5 months”,“delay 6 months”,“delay 7 months”,“delay 8 months”,“delay 9+ months”))
Already done:
cc\(default.payment.next.month <- factor(cc\)default.payment.next.month, levels=c(0,1), labels=c(“NO”, “YES”))
Use View() function to invoke data viewer and check whether your data have been properly transformed. view(cc\(SEX) view(cc\)EDUCATION)
train <- cc[sample(nrow(cc), 5000), ]
Please double check on your training data with nrow(train) and/or View(train).
Please think of any two random numbers between 1 and 30,000 (the total number of rows in the original data).
test <- cc[c(140,13782),]
Build a Naive Bayes model using the nominal demographic variables as predictors:
library(e1071) nbDem <- naiveBayes(default.payment.next.month ~ SEX + EDUCATION + MARRIAGE, train) nbDem
naive_bayes_train
Examine the probabilities in the model: pick and compare some of the probability values. Do they make sense? Explain.
Looking at the Education attribute, the probabilities for defaulting on next payment do seem to make sense. Among those who will NOT default on their next payment, a university education was associated with a higher proportional probability of not defaulting at .4632. Grad school-educated people have the second largest probability of not defaulting on next payment at .3687. High school-educated people and others/unknown have the lowest proportional probability of not defaulting, at .1528 and <=.01 respectively. In terms of which level of education has the largest proportional probability of defaulting on next payment, University/college grads have the largest proportional probability at .5089, followed by Grad School with .3005 and High School with .1833 and others/unknown with negligible probaabilities. This is interesting to me that of those who default, the probabilities are higher for both university and graduate-level education, but perhaps this is explained by greater numbers of people in those classes having taken on greater debt with their credit cards due to “lifestyle creep” and student loans, or event partially living off credit cards while being a student. Perhaps those with lower education levels were less likely to qualify for this credit card to begin with, and granted lower credit limits when they did that were less likely to cause default.
Now run the model on your test data and predict the outcome:
repeat it on the second row in the test data;
predict(nbDem, test[2,])
My second test record (13782) was:
1. female: lower proportional probability of defaulting (0.56) as compared to not defaulting (.60)
2. university educated: higher proportional probability of defaulting (.51) as compared to not defaulting (.46)
3. single: lower proportional probability of defaulting (.499) as compared to not defaulting (.531)
It makes sense that on the whole, this person was classed as someone less likely to default.
Build a Naive Bayes model using THREE payment status variables as predictors:
library(e1071) nbPay <- naiveBayes(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3, train) nbPay
bayes_train
Examine the probabilities in the model: pick and compare some of the probability values. Do they make sense? Explain.
These probabilities are more confusing, and I think that’s due to the lack of clear definitions and not clean data in the original dataset. As there were coded data in the PAY_x fields with codes that were not defined, I made a best guess (informed by online discussions in Kaggle) that -2 (a code not defined in original dataset) meant the same as -1 and that 0 (also not defined) meant a zero balance. By “payment delay for x months”, I am assuming that means “payment late by x months”, but even this definition isn’t crystal clear. This makes it confusing that those with a zero balance could be likely to default on next payment, when no payment is due (see “No Balance” group at PAY_0 with probability of 0.288 among those who defaulted on their next payment).
Still, even with confusion caused by unclear or dirty data, the probabilities do indicate that of those who did default at next payment were more likely to have an existing delay on payment, until you get to delays of 8 or 9 months where the numbers are so small as to barely- or not even- register as proportions of the whole.
At PAY_0, credit card owners were comparitively more likely to default on next payment starting at delays of 1 month through & including 7 months.
At PAY_2, credit card owners were comparitively more likely to default on next payment starting at delays of 1 month through & including 7 months.
At PAY_3, credit card owners were comparitively more likely to default on next payment starting at delays of 1 month through & including 7 months.
Now run the model on your test data and predict the outcome:
+ run the following code to test prediction on first row in the test data;
predict(nbPay, test[1,])
+ repeat it on the second row in the test data;
Build a Naive Bayes model using THREE payment status variables as predictors:
nbPay <- naiveBayes(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3, train) nbPay
bayes_payment_status
Again run the model on your test data and predict the outcome:
Yes, the model correctly predicted that these test records would not default next payment. Considering both test records carried no balance at each time point, this makes sense from the probabilities listed in the nbPay model.
nbPay <- naiveBayes(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3, train, laplace=1.5)
nbPay
laplace_train
The LaPlace smoothing here was helpful for those rare attribute values of delays in payment of 7+ months, especially delays at 8 and 9+ months which saw many 0.000000 probabilities due to their rare occurences in the dataset. I used the suggested LaPlace value of 1.5 from the example R code, which adds 1.5 to each occurance of each attribute value, “pretend[ing] to see that outcome [1.5 times] more than it actually appears” (p. 217 Data Science and Big Data Analytics : Discovering, Analyzing, Visualizing and Presenting Data.). This ensures those rare occurences have a non-zero probability, which worked for all rare values of PAY_0, PAY_2, and PAY_3.
predict(nbPay, test[1,])
predict(nbPay, test[2,])
predict_laplace
The model with LaPlace smoothing also correctly predicted the outcome of the two test records. It’s not surprising that the change in the model by adding LaPlace smoothing did not affect the prediction of these two records, as neither had any rare values in PAY_0, PAY_2, or PAY_3.
Build a decision tree using up to three payment status variables as predictors. For example:
library(“rpart”)
library(“rpart.plot”)
dtPay <- rpart(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3, method=“class”, data=train, parms=list(split=‘information’), minsplit=20, cp=0.02)
rpart.plot(dtPay, type=4, extra=1)
Examine the decision tree visualization, explain what it means, and whether it is reasonable.
This decision tree (actually a “decision stump”, the simplest type of decision tree) starts with a root indication of the 5000 total observations, 3882 (77.64%) were “no default on next payment” while 1118 (22.36%) were. From that root, the tree splits into two branches based on PAY_0. One branch is a PAY_0 value of No Delay, No Balance, Delay 1 month, Delay 5 months, Delay 6 months, or Delay 8 months. The other branch is a PAY_0 value of Delay 2 months, Delay 3 months, Delay 4 months, or Delay 7 months. These are connected to the leaf nodes of No and Yes. These leaf nodes are the Class Labels. If an occurance follows the left branch, it’s likely to be a No (no default payment next month). If an occurance follows the right branch, it’s likely to be a Yes (yes default payment next month).
Again run the model on your test data and predict the outcome:
predict_decision_tree
Makes sense, as both of these test rows have values of “no balance” for PAY_0, meaning they are both more likely to be a “NO” to defaulting on next payment.
Rebuild the decision tree with a smaller cp.
dtPay <- rpart(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3,
method=“class”,
data=train, parms=list(split=‘information’),
minsplit=20, cp=0.001)
rpart.plot(dtPay, type=4, extra=1)
With the smaller cp of 0.001, the model was overplotted and I received a warning message in R about this decision tree model, which seems to need pruning:
I tried again with a complexity parameter of 0.002 and this time successfully ran a more detailed decision tree than the decision stump, without overplotting:
dtPay <- rpart(default.payment.next.month ~ PAY_0 + PAY_2 + PAY_3, method=“class”, data=train, parms=list(split=‘information’), minsplit=20, cp=0.002)
decision_tree_not2small_cp
This tree indicates that at PAY_0 if the value is no delay,no balance,delay 1 month,delay 5 months,delay 6 months, or delay 8 months the customer is classified as a likely NO to default on next payment. However if their status at PAY_0 is delay 2 months,delay 3 months,delay 4 months, or delay 7 months they are likely to be a yes, but they are not yet classified as such as there is one more branch: if at PAY_3 there was No Delay, then they are classified as a No to defaulting on next payment. If however at PAY_3 they are no balance,delay 2 months,delay 3 months,delay 4 months,delay 5 months,delay 6 months, or delay 7 months they are classified as a Yes. However, at this point the numbers at the final branch/leaf nodes are quite small (a PAY_0 status of “delay 2 months”“,”delay 3 months“,”delay 4 months“, or”delay 7 months" with a “no delay” at PAY_3 equalling only 27 occurences out of 5,000) that this model may be overfit.
decision_tree_not2small_cp
As before, the model correctly predicts that both of the test occurances were “No” to defaulting on their next payment, which is to be expected as again they both have a “no balance” status at PAY_0, which this tree predicts as a No.
Compare the results based on the different models (with different parameters and using different variables). Which model appears to be better (best) given the stated problem and identified data? Which one performs better on your two testing data rows? If it predicts correctly, what (in the model and/or in the data) are helpful? If it does not predict the correct answer, what are missing or lacking (in the model and/or in the data)?
Both models correctly predicted the test data. The decision tree seems more user-friendly for an average human, including a business user who may not have a strong background in data science or statistics, to understand. The Naive Bayes model was more complex to comprehend as there are no visuals to assist with the logical flow of determining which variables and which values are most likely to lead to a certain outcome. However, both models consistently gave the correct predications in this case.
Examining other variables not modeled in this assignment, what variables should be considered in the classification model (to predict default.payment.next.month)? In what ways they may be combined in the model? Please discuss your thinking and how you will proceed with further analysis.
I would include sex, education, and marital status in the decision tree along with payment status to understand how those values affect the classification of a customer. I would also combine both in a Naive Bayes model to similarly get the full picture of a customer; however if trying to walk a non-technical person through the process of classifying customers I would be more likely to use the decision tree. As they are all categorical variables that we already transformed into factors in R, I don’t believe it would be difficult to include them in either model.