I am using the CreditCard_dataset_Original.csv dataset. This set has information regarding an individual’s credit history and can be used to predict if an individual will default on their loans. The dataset I am using can be found on the Data. World website. This dataset has over 80000 observations and 49 variables.
Within this section of my analysis, I imported the Credit Card dataset from the data. World website and copied the file into a dataset titled Og_CreditCard_dataset_.
library(readr)
library(dplyr)
Og_CreditCard_dataset_ <- read_csv("CreditCard_dataset_Original.csv")
In this section of my analysis, I selected 5 variables and renamed them accordingly. Since most of the variables in this data were strings, I converted my variables of interest into integers. Lastly, I removed any NA values form my dataset since it would interfere with my analysis.
CreditCardData <- Og_CreditCard_dataset_ %>%
select(mvar12,mvar14,mvar39,default_ind,mvar2)%>%
rename("Current_CreditCard_Balance_Own" = mvar12,
"Annual_Income" = mvar14,
'Number_CreditLines_in_delinquency' = mvar39,
"Credit_Worthiness" = mvar2)%>%
mutate(default_ind = as.integer(default_ind),
Annual_Income = as.integer(Annual_Income),
Number_CreditLines_in_delinquency = as.integer(Number_CreditLines_in_delinquency),
Current_CreditCard_Balance_Own = as.integer(Current_CreditCard_Balance_Own))%>%
filter(!is.na(default_ind),
!is.na(Annual_Income),
!is.na(Number_CreditLines_in_delinquency),
!is.na(Current_CreditCard_Balance_Own),
!is.na(Credit_Worthiness))
In this section, I recoded and made a new variable called Credit_Candidate. Many factors go into evaluating someone’s creditworthiness, like age, financial obligations, and current debt and credit score. If an individual has a credit worthiness rating above 1750, they were classified as a good Candidate. If the score was lower then in 1750, they were put in the bad Candidate class.
CreditCardData$Credit_Candidate <- ifelse(CreditCardData$Credit_Worthiness >= 1750,"Good_Candidate","Bad_Candidate")
head(CreditCardData)
## # A tibble: 6 x 6
## Current_CreditC… Annual_Income Number_CreditLi… default_ind
## <int> <int> <int> <int>
## 1 6423 123875 1 0
## 2 765 42613 0 1
## 3 0 84235 0 0
## 4 2257 123875 0 0
## 5 448 123875 0 1
## 6 16298 109010 0 0
## # … with 2 more variables: Credit_Worthiness <chr>, Credit_Candidate <chr>
Model1 <- glm(default_ind ~ Annual_Income, family = "binomial", data = CreditCardData)
summary(Model1)
##
## Call:
## glm(formula = default_ind ~ Annual_Income, family = "binomial",
## data = CreditCardData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6738 -0.6737 -0.6737 -0.6735 2.1615
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.367e+00 9.736e-03 -140.423 <2e-16 ***
## Annual_Income -4.375e-09 4.997e-09 -0.876 0.381
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 66074 on 65483 degrees of freedom
## Residual deviance: 66073 on 65482 degrees of freedom
## AIC: 66077
##
## Number of Fisher Scoring iterations: 5
Model2 <- glm(default_ind ~ Annual_Income + Number_CreditLines_in_delinquency + Current_CreditCard_Balance_Own + Credit_Candidate, family = "binomial", data = CreditCardData)
summary(Model2)
##
## Call:
## glm(formula = default_ind ~ Annual_Income + Number_CreditLines_in_delinquency +
## Current_CreditCard_Balance_Own + Credit_Candidate, family = "binomial",
## data = CreditCardData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.7902 -0.6143 -0.6034 -0.5459 2.7553
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.570e+00 1.378e-02 -113.89 <2e-16
## Annual_Income -3.330e-09 4.323e-09 -0.77 0.441
## Number_CreditLines_in_delinquency 9.495e-01 2.587e-02 36.71 <2e-16
## Current_CreditCard_Balance_Own -1.108e-05 1.018e-06 -10.88 <2e-16
## Credit_CandidateGood_Candidate 7.030e-01 2.213e-02 31.77 <2e-16
##
## (Intercept) ***
## Annual_Income
## Number_CreditLines_in_delinquency ***
## Current_CreditCard_Balance_Own ***
## Credit_CandidateGood_Candidate ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 66074 on 65483 degrees of freedom
## Residual deviance: 63370 on 65479 degrees of freedom
## AIC: 63380
##
## Number of Fisher Scoring iterations: 4
Model3 <- glm(default_ind ~ Annual_Income*Credit_Candidate + Number_CreditLines_in_delinquency + Current_CreditCard_Balance_Own, family = "binomial", data = CreditCardData)
summary(Model3)
##
## Call:
## glm(formula = default_ind ~ Annual_Income * Credit_Candidate +
## Number_CreditLines_in_delinquency + Current_CreditCard_Balance_Own,
## family = "binomial", data = CreditCardData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.7974 -0.6135 -0.6029 -0.5473 4.1966
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -1.573e+00 1.380e-02
## Annual_Income -1.825e-09 3.406e-09
## Credit_CandidateGood_Candidate 8.309e-01 3.067e-02
## Number_CreditLines_in_delinquency 9.469e-01 2.587e-02
## Current_CreditCard_Balance_Own -1.059e-05 1.019e-06
## Annual_Income:Credit_CandidateGood_Candidate -1.346e-06 2.322e-07
## z value Pr(>|z|)
## (Intercept) -113.981 < 2e-16 ***
## Annual_Income -0.536 0.592
## Credit_CandidateGood_Candidate 27.093 < 2e-16 ***
## Number_CreditLines_in_delinquency 36.607 < 2e-16 ***
## Current_CreditCard_Balance_Own -10.386 < 2e-16 ***
## Annual_Income:Credit_CandidateGood_Candidate -5.797 6.75e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 66074 on 65483 degrees of freedom
## Residual deviance: 63323 on 65478 degrees of freedom
## AIC: 63335
##
## Number of Fisher Scoring iterations: 7
Since Model Three has all of the variables that I selected, I will provide a brief summary. One Significant variable is the Number_CreditLines_in_delinquency variable. An increase in Number Credit Lines in delinquency by one unit will increase the log odds of defaulting on a credit car payment by .947. In addition, being a good credit candidate by one unit of measurement will decrease the chances of you defaulting on a credit card payment by .000001345. In other words, a good predictor for knowing if someone will default on a credit card payment is to see if they have current delinquencies with other lines or method of credit. Please direct your attention below for more details.
coef((Model3))
## (Intercept)
## -1.572833e+00
## Annual_Income
## -1.824800e-09
## Credit_CandidateGood_Candidate
## 8.309077e-01
## Number_CreditLines_in_delinquency
## 9.469242e-01
## Current_CreditCard_Balance_Own
## -1.058805e-05
## Annual_Income:Credit_CandidateGood_Candidate
## -1.345889e-06
If one was to direct there attention to the AIC and the BIC section of the graph, we will notice that our model gets better the more variables we use. The best preforming model is Model 3 and it uses all of the selected variables in the dataset plus some interactions. For more details, please direct your attention below.
library(texreg)
htmlreg(list(Model1, Model2, Model3), doctype = FALSE)
| Model 1 | Model 2 | Model 3 | ||
|---|---|---|---|---|
| (Intercept) | -1.37*** | -1.57*** | -1.57*** | |
| (0.01) | (0.01) | (0.01) | ||
| Annual_Income | -0.00 | -0.00 | -0.00 | |
| (0.00) | (0.00) | (0.00) | ||
| Number_CreditLines_in_delinquency | 0.95*** | 0.95*** | ||
| (0.03) | (0.03) | |||
| Current_CreditCard_Balance_Own | -0.00*** | -0.00*** | ||
| (0.00) | (0.00) | |||
| Credit_CandidateGood_Candidate | 0.70*** | 0.83*** | ||
| (0.02) | (0.03) | |||
| Annual_Income:Credit_CandidateGood_Candidate | -0.00*** | |||
| (0.00) | ||||
| AIC | 66076.78 | 63380.50 | 63335.24 | |
| BIC | 66094.96 | 63425.94 | 63389.77 | |
| Log Likelihood | -33036.39 | -31685.25 | -31661.62 | |
| Deviance | 66072.78 | 63370.50 | 63323.24 | |
| Num. obs. | 65484 | 65484 | 65484 | |
| p < 0.001, p < 0.01, p < 0.05 | ||||
Based on the ANOVA test we just preformed, it was seem that Model3 is the best model.
anova(Model1, Model2, Model3, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: default_ind ~ Annual_Income
## Model 2: default_ind ~ Annual_Income + Number_CreditLines_in_delinquency +
## Current_CreditCard_Balance_Own + Credit_Candidate
## Model 3: default_ind ~ Annual_Income * Credit_Candidate + Number_CreditLines_in_delinquency +
## Current_CreditCard_Balance_Own
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 65482 66073
## 2 65479 63370 3 2702.28 < 2.2e-16 ***
## 3 65478 63323 1 47.26 6.221e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(visreg)
visreg(Model1,"Annual_Income", scale = "response")
visreg(Model2,"Number_CreditLines_in_delinquency", scale = "response")
visreg(Model3,"Current_CreditCard_Balance_Own", scale = "response")
## Warning: Note that you are attempting to plot a 'main effect' in a model that contains an
## interaction. This is potentially misleading; you may wish to consider using the 'by'
## argument.
## Conditions used in construction of plot
## Annual_Income: 74325
## Credit_Candidate: Bad_Candidate
## Number_CreditLines_in_delinquency: 0
visreg(Model3, "Credit_Candidate", by = "Number_CreditLines_in_delinquency", scale = "response")
Based on the charts generated by the visreg package and the AIC/BIC test, it would seem that the worst variable to use to see whether a person will defer on a credit card payment is Annual income. The worst model is also Model1. The reason being is that this variable and model has many variations in its disruption.
The best variables to use to predict if an individual will not pay their credit card balance for the month is Number Credit Lines in delinquency and Current Credit Card Balance Own. The more money you own, the greater the log odds as well as, the higher the probability that you have at not making a credit card payment. This conclusion makes an exciting finding because it suggests that more wealthy people may not always pay their bills on time. It must not go unmentioned that both good and bad candidates are going to have delinquencies. However, the data shows that Good Candidates are going to have fewer delinquencies.