In this session, I examine the Credit Card Clients data set found on the UCI Machine Learning Repository website to determine if a person will default on their credit card using logistic regression and random forest.
The data set can be found on the UCI Machine Learning Repository site at the following link: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients.
The data contains 24 variables and a total of 30,000 individual instances. The variables are:
There is a separate variable for each past payment, bill statement, and previous payment from April to September.
First, the data set is loaded into a data frame named “credit”. The data frame is also viewed to see the columns and class types.
data <- read.csv("~/R datasets/Credit_Card.csv")
credit = subset(data, select = c("Limit_Bal", "Sex", "Education", "Marriage", "Age", "Pay_Sep",
"Pay_Aug", "Pay_July", "Pay_June", "Pay_May", "Pay_April",
"Bill_Amt_Sep", "Bill_Amt_Aug", "Bill_Amt_July",
"Bill_Amt_June", "Bill_Amt_May", "Bill_Amt_April",
"Pay_Amt_Sep", "Pay_Amt_Aug", "Pay_Amt_July","Pay_Amt_June",
"Pay_Amt_May", "Pay_Amt_April", "Default"))
#Inspect the data frame
head(credit)
## Limit_Bal Sex Education Marriage Age Pay_Sep Pay_Aug Pay_July Pay_June
## 1 20000 2 2 1 24 2 2 -1 -1
## 2 120000 2 2 2 26 -1 2 0 0
## 3 90000 2 2 2 34 0 0 0 0
## 4 50000 2 2 1 37 0 0 0 0
## 5 50000 1 2 1 57 -1 0 -1 0
## 6 50000 1 1 2 37 0 0 0 0
## Pay_May Pay_April Bill_Amt_Sep Bill_Amt_Aug Bill_Amt_July Bill_Amt_June
## 1 -2 -2 3913 3102 689 0
## 2 0 2 2682 1725 2682 3272
## 3 0 0 29239 14027 13559 14331
## 4 0 0 46990 48233 49291 28314
## 5 0 0 8617 5670 35835 20940
## 6 0 0 64400 57069 57608 19394
## Bill_Amt_May Bill_Amt_April Pay_Amt_Sep Pay_Amt_Aug Pay_Amt_July Pay_Amt_June
## 1 0 0 0 689 0 0
## 2 3455 3261 0 1000 1000 1000
## 3 14948 15549 1518 1500 1000 1000
## 4 28959 29547 2000 2019 1200 1100
## 5 19146 19131 2000 36681 10000 9000
## 6 19619 20024 2500 1815 657 1000
## Pay_Amt_May Pay_Amt_April Default
## 1 0 0 1
## 2 0 2000 1
## 3 1000 5000 0
## 4 1069 1000 0
## 5 689 679 0
## 6 1000 800 0
#Inspect the classes of the data frame
sapply(credit, class)
## Limit_Bal Sex Education Marriage Age
## "integer" "integer" "integer" "integer" "integer"
## Pay_Sep Pay_Aug Pay_July Pay_June Pay_May
## "integer" "integer" "integer" "integer" "integer"
## Pay_April Bill_Amt_Sep Bill_Amt_Aug Bill_Amt_July Bill_Amt_June
## "integer" "integer" "integer" "integer" "integer"
## Bill_Amt_May Bill_Amt_April Pay_Amt_Sep Pay_Amt_Aug Pay_Amt_July
## "integer" "integer" "integer" "integer" "integer"
## Pay_Amt_June Pay_Amt_May Pay_Amt_April Default
## "integer" "integer" "integer" "integer"
Next, I want to see the amount of missing values and duplicates in the data frame.
sum(is.na(credit))
## [1] 0
duplicates <- credit%>%duplicated()
duplicates_amount <- duplicates%>%(table)
duplicates_amount
## .
## FALSE TRUE
## 29965 35
Since there are 35 duplicates in the data, the data frame is filtered to remove the duplicates.
credit <- credit%>%distinct()
#Displays how many duplicates are present in the updated data frame.
duplicates_counts_unique <- credit%>%duplicated()%>%table()
duplicates_counts_unique
## .
## FALSE
## 29965
Next, the factor variables are converted from their numeric values to their actual names.
credit1 <- data.frame(credit)
head(credit1)
## Limit_Bal Sex Education Marriage Age Pay_Sep Pay_Aug Pay_July Pay_June
## 1 20000 2 2 1 24 2 2 -1 -1
## 2 120000 2 2 2 26 -1 2 0 0
## 3 90000 2 2 2 34 0 0 0 0
## 4 50000 2 2 1 37 0 0 0 0
## 5 50000 1 2 1 57 -1 0 -1 0
## 6 50000 1 1 2 37 0 0 0 0
## Pay_May Pay_April Bill_Amt_Sep Bill_Amt_Aug Bill_Amt_July Bill_Amt_June
## 1 -2 -2 3913 3102 689 0
## 2 0 2 2682 1725 2682 3272
## 3 0 0 29239 14027 13559 14331
## 4 0 0 46990 48233 49291 28314
## 5 0 0 8617 5670 35835 20940
## 6 0 0 64400 57069 57608 19394
## Bill_Amt_May Bill_Amt_April Pay_Amt_Sep Pay_Amt_Aug Pay_Amt_July Pay_Amt_June
## 1 0 0 0 689 0 0
## 2 3455 3261 0 1000 1000 1000
## 3 14948 15549 1518 1500 1000 1000
## 4 28959 29547 2000 2019 1200 1100
## 5 19146 19131 2000 36681 10000 9000
## 6 19619 20024 2500 1815 657 1000
## Pay_Amt_May Pay_Amt_April Default
## 1 0 0 1
## 2 0 2000 1
## 3 1000 5000 0
## 4 1069 1000 0
## 5 689 679 0
## 6 1000 800 0
#Rename factor variables to their appropriate settings
credit$Sex[credit$Sex %in% "1"] = "Male"
credit$Sex[credit$Sex %in% "2"] = "Female"
credit$Education[credit$Education %in% "1"] = "Grad School"
credit$Education[credit$Education %in% "2"] = "College"
credit$Education[credit$Education %in% "3"] = "High School"
credit$Education[credit$Education %in% "4"] = "Other"
credit$Education[credit$Education %in% "5"] = "Unknown"
credit$Marriage[credit$Marriage %in% "0"] = "Unknown"
credit$Marriage[credit$Marriage %in% "1"] = "Married"
credit$Marriage[credit$Marriage %in% "2"] = "Single"
credit$Marriage[credit$Marriage %in% "3"] = "Other"
credit$Default[credit$Default %in% "0"] = "No"
credit$Default[credit$Default %in% "1"] = "Yes"
#See the change in the variable names
head(credit)
## Limit_Bal Sex Education Marriage Age Pay_Sep Pay_Aug Pay_July Pay_June
## 1 20000 Female College Married 24 2 2 -1 -1
## 2 120000 Female College Single 26 -1 2 0 0
## 3 90000 Female College Single 34 0 0 0 0
## 4 50000 Female College Married 37 0 0 0 0
## 5 50000 Male College Married 57 -1 0 -1 0
## 6 50000 Male Grad School Single 37 0 0 0 0
## Pay_May Pay_April Bill_Amt_Sep Bill_Amt_Aug Bill_Amt_July Bill_Amt_June
## 1 -2 -2 3913 3102 689 0
## 2 0 2 2682 1725 2682 3272
## 3 0 0 29239 14027 13559 14331
## 4 0 0 46990 48233 49291 28314
## 5 0 0 8617 5670 35835 20940
## 6 0 0 64400 57069 57608 19394
## Bill_Amt_May Bill_Amt_April Pay_Amt_Sep Pay_Amt_Aug Pay_Amt_July Pay_Amt_June
## 1 0 0 0 689 0 0
## 2 3455 3261 0 1000 1000 1000
## 3 14948 15549 1518 1500 1000 1000
## 4 28959 29547 2000 2019 1200 1100
## 5 19146 19131 2000 36681 10000 9000
## 6 19619 20024 2500 1815 657 1000
## Pay_Amt_May Pay_Amt_April Default
## 1 0 0 Yes
## 2 0 2000 Yes
## 3 1000 5000 No
## 4 1069 1000 No
## 5 689 679 No
## 6 1000 800 No
Next, exploratory tables are made to view the distribution of the data set.
Next, bar plots and distribution tables are created to see the proportion of the variables. This is done to see if the data is normally distributed. If the data is not normally distributed, it’s advantageous to see how the data is skewed.
#View the bar plots for the amount for each categorical variable
counts_Sex <- table(credit$Sex)
barplot(counts_Sex, col = c("royalblue", "darkorange1"))
#Basic table view of the amount of males and females
table(credit$Sex)
##
## Female Male
## 18091 11874
#Proportion of each gender in table
prop.table(counts_Sex)
##
## Female Male
## 0.6037377 0.3962623
counts_Education <- table(credit$Education)
barplot(counts_Education, col = c("brown4", "green3", "mediumpurple2", "slategray3",
"darkgoldenrod2"))
table(credit$Education)
##
## College Grad School High School Other Unknown
## 14019 10563 4915 123 345
#Proportion of each education level in table
prop.table(counts_Education)
##
## College Grad School High School Other Unknown
## 0.467845820 0.352511263 0.164024695 0.004104789 0.011513432
counts_Marriage <- table(credit$Marriage)
barplot(counts_Marriage, col = c("magenta2", "Cyan3", "goldenrod"))
table(credit$Marriage)
##
## Married Other Single Unknown
## 13643 323 15945 54
#Proportion of each marriage status in table
prop.table(counts_Marriage)
##
## Married Other Single Unknown
## 0.455297847 0.010779242 0.532120808 0.001802102
counts_Default <- table(credit$Default)
barplot(counts_Default, col = c("turquoise2", "sienna1"))
table(credit$Default)
##
## No Yes
## 23335 6630
prop.table(counts_Default)
##
## No Yes
## 0.7787419 0.2212581
table.default_gender <- table(credit$Default, credit$Sex)
prop.table(table.default_gender, 2)
##
## Female Male
## No 0.7921066 0.7583797
## Yes 0.2078934 0.2416203
prop.table(table.default_gender, 1)
##
## Female Male
## No 0.614099 0.385901
## Yes 0.567270 0.432730
barplot(table.default_gender, col = c("sienna1", "royalblue"), beside = T,
names.arg = c("Female", "Male"))
legend("topright", legend = c("No", "Yes"), fill = c("sienna1", "royalblue"))
ggplot(data = credit, aes(x = Age)) + geom_histogram(fill = "Blue", col = "Grey", bins = 30)
ggplot(data = credit, aes(x = Age)) + geom_histogram(aes(y = ..density..), fill = "Blue", col = "Grey", binwidth = 5)+geom_density(alpha = 0.2, color = "black", fill = "blue")
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
mean(credit$Age)
## [1] 35.48797
Looking at the created charts and tables, the data has more females than males. In addition, the age distribution is skewed to the right, meaning the data is represented by younger participants. As such, it may be easier to predict credit card default for females or for younger participants compared to males or older participants.
Before setting up the prediction model, all variables except for Default (the variable we are trying to predict) are scaled so the data is standardized.
credit1 = credit1 %>% mutate_at(c(0:23), funs(c(scale(.))))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
##
## # Simple named list: list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
##
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
head(credit1)
## Limit_Bal Sex Education Marriage Age Pay_Sep Pay_Aug
## 1 -1.1362658 0.8101398 0.1857827 -1.0572427 -1.2460567 1.79507537 1.782007
## 2 -0.3656131 0.8101398 0.1857827 0.8584783 -1.0291243 -0.87517053 1.782007
## 3 -0.5968089 0.8101398 0.1857827 0.8584783 -0.1613944 0.01491143 0.110216
## 4 -0.9050700 0.8101398 0.1857827 -1.0572427 0.1640043 0.01491143 0.110216
## 5 -0.9050700 -1.2343136 0.1857827 -1.0572427 2.3333289 -0.87517053 0.110216
## 6 -0.9050700 -1.2343136 -1.0890004 0.8584783 0.1640043 0.01491143 0.110216
## Pay_July Pay_June Pay_May Pay_April Bill_Amt_Sep Bill_Amt_Aug
## 1 -0.6987406 -0.6686308 -1.5328219 -1.4886233 -0.6431063 -0.6479949
## 2 0.1374654 0.1874052 0.2336195 1.9923918 -0.6598187 -0.6673360
## 3 0.1374654 0.1874052 0.2336195 0.2518842 -0.2992746 -0.4945444
## 4 0.1374654 0.1874052 0.2336195 0.2518842 -0.0582829 -0.0140931
## 5 -0.6987406 0.1874052 0.2336195 0.2518842 -0.5792437 -0.6119253
## 6 0.1374654 0.1874052 0.2336195 0.2518842 0.1780793 0.1100157
## Bill_Amt_July Bill_Amt_June Bill_Amt_May Bill_Amt_April Pay_Amt_Sep
## 1 -0.66856007 -0.6730531 -0.6636014 -0.6532534 -0.3421525
## 2 -0.63983063 -0.6222089 -0.6067918 -0.5985149 -0.3421525
## 3 -0.48303680 -0.4503613 -0.4178154 -0.3922509 -0.2505514
## 4 0.03204614 -0.2330771 -0.1874362 -0.1572832 -0.2214659
## 5 -0.16192442 -0.3476629 -0.3487888 -0.3321241 -0.2214659
## 6 0.15193713 -0.3716865 -0.3410114 -0.3171344 -0.1912942
## Pay_Amt_Aug Pay_Amt_July Pay_Amt_June Pay_Amt_May Pay_Amt_April Default
## 1 -0.2272537 -0.2969790 -0.3082477 -0.3143255 -0.29355736 1
## 2 -0.2137633 -0.2402136 -0.2444497 -0.3143255 -0.18111555 1
## 3 -0.1920746 -0.2402136 -0.2444497 -0.2489078 -0.01245282 0
## 4 -0.1695617 -0.2288605 -0.2380699 -0.2443940 -0.23733645 0
## 5 1.3339872 0.2706751 0.2659346 -0.2692527 -0.25538337 0
## 6 -0.1784107 -0.2596841 -0.2444497 -0.2489078 -0.24858064 0
Before creating prediction models, training and testing data sets are created. A training data set is a subset of examples used to train the model, while the testing data set is a subset used to test the training model.
#Initializes number generator.
set.seed(123)
#New sample created for the training and testing data sets. The data is split with 75% in training and 25% in testing.
sample <- sample(c(TRUE, FALSE), nrow(credit1), replace = TRUE, prob = c(0.75, 0.25))
train_set <- credit1[sample, ]
test_set <- credit1[!sample, ]
Now that the training and testing data sets are created, prediction analysis methods such as logistic regression and random forest can be completed.
First, logistic regression is done to find the probability of default for an individual. Logistic regression models the probability that a response variable (Y) belongs to a particular category. This method uses maximum likelihood to fit the model in the range between 0 and 1.
Logistic regression is a classification method great for a yes/no response. A number closer to 1 represents “Yes”, while a number closer to 0 represents “No”.
A logistic regression model is created below, which is then used to predict the probabilities of credit card default for three individuals:
# Creates the logistic regression model with the training Set
fit_glm <- glm(Default ~ ., data = train_set, family = binomial())
summary(fit_glm)
##
## Call:
## glm(formula = Default ~ ., family = binomial(), data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1350 -0.7001 -0.5549 -0.2947 3.6144
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.45715 0.01907 -76.392 < 2e-16 ***
## Limit_Bal -0.09134 0.02350 -3.887 0.000102 ***
## Sex -0.03674 0.01731 -2.123 0.033761 *
## Education -0.08618 0.01897 -4.543 5.56e-06 ***
## Marriage -0.10127 0.01907 -5.311 1.09e-07 ***
## Age 0.06822 0.01886 3.616 0.000299 ***
## Pay_Sep 0.64147 0.02297 27.924 < 2e-16 ***
## Pay_Aug 0.10416 0.02782 3.744 0.000181 ***
## Pay_July 0.07033 0.03088 2.278 0.022749 *
## Pay_June 0.01996 0.03372 0.592 0.553983
## Pay_May 0.04338 0.03500 1.239 0.215237
## Pay_April 0.01727 0.02912 0.593 0.553186
## Bill_Amt_Sep -0.33056 0.09255 -3.572 0.000355 ***
## Bill_Amt_Aug 0.03016 0.12150 0.248 0.803986
## Bill_Amt_July 0.18701 0.10564 1.770 0.076689 .
## Bill_Amt_June 0.01789 0.09725 0.184 0.854062
## Bill_Amt_May 0.05036 0.09916 0.508 0.611526
## Bill_Amt_April -0.03170 0.07636 -0.415 0.678087
## Pay_Amt_Sep -0.20410 0.04304 -4.742 2.12e-06 ***
## Pay_Amt_Aug -0.28879 0.05754 -5.019 5.19e-07 ***
## Pay_Amt_July -0.04802 0.03532 -1.360 0.173965
## Pay_Amt_June -0.06295 0.03160 -1.992 0.046381 *
## Pay_Amt_May -0.02452 0.02880 -0.852 0.394482
## Pay_Amt_April -0.01453 0.02535 -0.573 0.566606
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 23800 on 22476 degrees of freedom
## Residual deviance: 21021 on 22453 degrees of freedom
## AIC: 21069
##
## Number of Fisher Scoring iterations: 6
# Estimates probabilities of a client defaulting on their credit card
pred_probs <- predict.glm(fit_glm, newdata = test_set, type = "response")
head(pred_probs)
## 2 4 5 8 11 16
## 0.1518637 0.2483820 0.1312774 0.1722475 0.1976444 0.2661480
#Predicts if the client will default, placing them in the "no" category if the prediction is less than 0.5 - otherwise, in the "yes" category.
pred <- ifelse(pred_probs<0.5, 0,1)
#Creates confusion table displaying where each client was placed and if they were placed in the right group
confusion_table <- table(test_set$Default, pred)
#Displays confusion matrix and the statistics associated with the confusion matrix
confusionMatrix(confusion_table)
## Confusion Matrix and Statistics
##
## pred
## 0 1
## 0 5711 137
## 1 1233 407
##
## Accuracy : 0.817
## 95% CI : (0.8081, 0.8257)
## No Information Rate : 0.9274
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2959
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8224
## Specificity : 0.7482
## Pos Pred Value : 0.9766
## Neg Pred Value : 0.2482
## Prevalence : 0.9274
## Detection Rate : 0.7627
## Detection Prevalence : 0.7810
## Balanced Accuracy : 0.7853
##
## 'Positive' Class : 0
##
Overall, the model had an accuracy of 81.7%. The logistic regression model did well predicting “No” if a client would default, predicting correctly at a rate of 97.66%. However, the model did not do as well predicting “Yes” if a client would default, predicting correctly at a rate of 24.82%.
Another prediction model used is random forest. Random forest is a classifying method consisting of many decision trees. By creating a “forest” of decision trees, the classifying model hopes to select it’s best model by running many different decision trees and “takes the majority” to determine classification. To do so, random forest uses out-of-bag sampling. To find the error rate of the random forest, the out-of-bag (OOB) error is used to see the internal error estimate.
A random forest model is created to determine the probability of credit card default:
set.seed(123)
#Random Forest for variables. mtry = 5 since there are 24 variables (square root of 24 is close to 5).
fit_rf <- randomForest(factor(Default) ~., mtry = 5, data = train_set)
fit_rf
##
## Call:
## randomForest(formula = factor(Default) ~ ., data = train_set, mtry = 5)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 18.37%
## Confusion matrix:
## 0 1 class.error
## 0 16533 954 0.05455481
## 1 3176 1814 0.63647295
varImpPlot(fit_rf)
From the input printed and the plot provided, it is seen that the pay amount and bill amount in September, as well as age are important variables in determining credit card default. It can also be argued bill amount in August and bill amount in July are important variables in determining credit card default.
Looking at the confusion matrix, the random forest was able to predict “No” on credit card default pretty well, with a class error of 5.46%. However, the random forest model struggled predicting “Yes” on credit card default, with a class error of 63.65%.
Compared to the logistic regression model, the random forest model did worse in predicting “No” on credit card default, with the logistic model having 97.66% accuracy compared to the random forest model having roughly 94.54%. However, the random forest model did better predicting “Yes” on credit card default with a 36.35% accuracy compared to the 24.82% accuracy of the logistic regression model. Overall, the random forest model had a 81.63% accuracy, which is almost the same as the logistic regression model (81.7%).
In this project, logistic regression and random forest models were created to predict if an individual would default on their credit card.
First, the data was cleaned for accuracy and manipulated to view distributions and trends in the data. From the tables and plots created, the data had more females than males and was skewed in age, with participants below the age of 40 much more prevalent than participants over the age of 40.
Next, prediction models were created to predict an individual’s chances of defaulting on their credit card. The first model used was a logistic regression model. This model was used to predict if an individual would default on their credit card based on their information. This model is great for predicting a Yes/No classification for individuals. The logistic regression model did great predicting “no” on whether or not a client would default, but not so well predicting “yes”.
A random forest model was also created to determine the most important variables in a prediction model, as well as to see the accuracy of the created model. From the results, the random forest model could accurately predict someone not defaulting on their credit card, but had a more difficult time accurately predicting when someone would default on their credit card.
Overall, it seems the prediction models used did a great job predicting “No” on credit card default, but had a harder time predicting “Yes” on credit card default. I believe adding certain variables such as credit score, credit age, and credit card utilization can help the prediction models in predicting “Yes” on credit card default.
Thank you for viewing my project.