About the Data

I am using the CreditCard_dataset_Original.csv dataset. This set has information regarding an individual’s credit history and can be used to predict if an individual will default on their loans. The dataset I am using can be found on the Data. World website. This dataset has over 80000 observations and 49 variables.

Preparing the data

Within this section of my analysis, I imported the Credit Card dataset from the data. World website and copied the file into a dataset titled Og_CreditCard_dataset_.

library(readr)
library(dplyr)
Og_CreditCard_dataset_ <- read_csv("CreditCard_dataset_Original.csv")

In this section of my analysis, I selected 5 variables and renamed them accordingly. Since most of the variables in this data were strings, I converted my variables of interest into integers. Lastly, I removed any NA values form my dataset since it would interfere with my analysis.

CreditCardData <- Og_CreditCard_dataset_ %>%
  select(mvar12,mvar14,mvar39,default_ind,mvar2)%>%
  rename("Current_CreditCard_Balance_Own" = mvar12,
                              "Annual_Income" = mvar14,
                              'Number_CreditLines_in_delinquency' = mvar39,
                              "Credit_Worthiness" = mvar2)%>%
  
  mutate(default_ind = as.integer(default_ind),
         Annual_Income = as.integer(Annual_Income),
         Number_CreditLines_in_delinquency = as.integer(Number_CreditLines_in_delinquency),
         Current_CreditCard_Balance_Own = as.integer(Current_CreditCard_Balance_Own))%>%
filter(!is.na(default_ind),
       !is.na(Annual_Income),
       !is.na(Number_CreditLines_in_delinquency),
       !is.na(Current_CreditCard_Balance_Own),
       !is.na(Credit_Worthiness))

In this section, I recoded and made a new variable called Credit_Candidate. Many factors go into evaluating someone’s creditworthiness, like age, financial obligations, and current debt and credit score. If an individual has a credit worthiness rating above 1750, they were classified as a good Candidate. If the score was lower then in 1750, they were put in the bad Candidate class.

CreditCardData$Credit_Candidate <- ifelse(CreditCardData$Credit_Worthiness >= 1750,"Good_Candidate","Bad_Candidate")

head(CreditCardData)

## # A tibble: 6 x 6
##   Current_CreditC… Annual_Income Number_CreditLi… default_ind
##              <int>         <int>            <int>       <int>
## 1             6423        123875                1           0
## 2              765         42613                0           1
## 3                0         84235                0           0
## 4             2257        123875                0           0
## 5              448        123875                0           1
## 6            16298        109010                0           0
## # … with 2 more variables: Credit_Worthiness <chr>, Credit_Candidate <chr>

Simple Models.

Model1 <- glm(default_ind ~ Annual_Income, family = "binomial", data = CreditCardData)
summary(Model1)

## 
## Call:
## glm(formula = default_ind ~ Annual_Income, family = "binomial", 
##     data = CreditCardData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6738  -0.6737  -0.6737  -0.6735   2.1615  
## 
## Coefficients:
##                 Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)   -1.367e+00  9.736e-03 -140.423   <2e-16 ***
## Annual_Income -4.375e-09  4.997e-09   -0.876    0.381    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 66074  on 65483  degrees of freedom
## Residual deviance: 66073  on 65482  degrees of freedom
## AIC: 66077
## 
## Number of Fisher Scoring iterations: 5

Model2 <- glm(default_ind ~ Annual_Income + Number_CreditLines_in_delinquency + Current_CreditCard_Balance_Own + Credit_Candidate, family = "binomial", data = CreditCardData)

summary(Model2)

## 
## Call:
## glm(formula = default_ind ~ Annual_Income + Number_CreditLines_in_delinquency + 
##     Current_CreditCard_Balance_Own + Credit_Candidate, family = "binomial", 
##     data = CreditCardData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.7902  -0.6143  -0.6034  -0.5459   2.7553  
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)                       -1.570e+00  1.378e-02 -113.89   <2e-16
## Annual_Income                     -3.330e-09  4.323e-09   -0.77    0.441
## Number_CreditLines_in_delinquency  9.495e-01  2.587e-02   36.71   <2e-16
## Current_CreditCard_Balance_Own    -1.108e-05  1.018e-06  -10.88   <2e-16
## Credit_CandidateGood_Candidate     7.030e-01  2.213e-02   31.77   <2e-16
##                                      
## (Intercept)                       ***
## Annual_Income                        
## Number_CreditLines_in_delinquency ***
## Current_CreditCard_Balance_Own    ***
## Credit_CandidateGood_Candidate    ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 66074  on 65483  degrees of freedom
## Residual deviance: 63370  on 65479  degrees of freedom
## AIC: 63380
## 
## Number of Fisher Scoring iterations: 4

Interaction terms in our model.

Model3 <- glm(default_ind ~ Annual_Income*Credit_Candidate + Number_CreditLines_in_delinquency + Current_CreditCard_Balance_Own, family = "binomial", data = CreditCardData)
summary(Model3)

## 
## Call:
## glm(formula = default_ind ~ Annual_Income * Credit_Candidate + 
##     Number_CreditLines_in_delinquency + Current_CreditCard_Balance_Own, 
##     family = "binomial", data = CreditCardData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.7974  -0.6135  -0.6029  -0.5473   4.1966  
## 
## Coefficients:
##                                                Estimate Std. Error
## (Intercept)                                  -1.573e+00  1.380e-02
## Annual_Income                                -1.825e-09  3.406e-09
## Credit_CandidateGood_Candidate                8.309e-01  3.067e-02
## Number_CreditLines_in_delinquency             9.469e-01  2.587e-02
## Current_CreditCard_Balance_Own               -1.059e-05  1.019e-06
## Annual_Income:Credit_CandidateGood_Candidate -1.346e-06  2.322e-07
##                                               z value Pr(>|z|)    
## (Intercept)                                  -113.981  < 2e-16 ***
## Annual_Income                                  -0.536    0.592    
## Credit_CandidateGood_Candidate                 27.093  < 2e-16 ***
## Number_CreditLines_in_delinquency              36.607  < 2e-16 ***
## Current_CreditCard_Balance_Own                -10.386  < 2e-16 ***
## Annual_Income:Credit_CandidateGood_Candidate   -5.797 6.75e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 66074  on 65483  degrees of freedom
## Residual deviance: 63323  on 65478  degrees of freedom
## AIC: 63335
## 
## Number of Fisher Scoring iterations: 7

Since Model Three has all of the variables that I selected, I will provide a brief summary. One Significant variable is the Number_CreditLines_in_delinquency variable. An increase in Number Credit Lines in delinquency by one unit will increase the log odds of defaulting on a credit car payment by .947. In addition, being a good credit candidate by one unit of measurement will decrease the chances of you defaulting on a credit card payment by .000001345. In other words, a good predictor for knowing if someone will default on a credit card payment is to see if they have current delinquencies with other lines or method of credit. Please direct your attention below for more details.

coef((Model3))

##                                  (Intercept) 
##                                -1.572833e+00 
##                                Annual_Income 
##                                -1.824800e-09 
##               Credit_CandidateGood_Candidate 
##                                 8.309077e-01 
##            Number_CreditLines_in_delinquency 
##                                 9.469242e-01 
##               Current_CreditCard_Balance_Own 
##                                -1.058805e-05 
## Annual_Income:Credit_CandidateGood_Candidate 
##                                -1.345889e-06

looking at both AIC and BIC

If one was to direct there attention to the AIC and the BIC section of the graph, we will notice that our model gets better the more variables we use. The best preforming model is Model 3 and it uses all of the selected variables in the dataset plus some interactions. For more details, please direct your attention below.

library(texreg)
htmlreg(list(Model1, Model2, Model3), doctype = FALSE)

Statistical models
	Model 1	Model 2	Model 3
(Intercept)	-1.37^***	-1.57^***	-1.57^***
	(0.01)	(0.01)	(0.01)
Annual_Income	-0.00	-0.00	-0.00
	(0.00)	(0.00)	(0.00)
Number_CreditLines_in_delinquency		0.95^***	0.95^***
		(0.03)	(0.03)
Current_CreditCard_Balance_Own		-0.00^***	-0.00^***
		(0.00)	(0.00)
Credit_CandidateGood_Candidate		0.70^***	0.83^***
		(0.02)	(0.03)
Annual_Income:Credit_CandidateGood_Candidate			-0.00^***
			(0.00)
AIC	66076.78	63380.50	63335.24
BIC	66094.96	63425.94	63389.77
Log Likelihood	-33036.39	-31685.25	-31661.62
Deviance	66072.78	63370.50	63323.24
Num. obs.	65484	65484	65484
p < 0.001, p < 0.01, p < 0.05

Conduct likelihood ratio test to compare model fit;

Based on the ANOVA test we just preformed, it was seem that Model3 is the best model.

anova(Model1, Model2, Model3, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: default_ind ~ Annual_Income
## Model 2: default_ind ~ Annual_Income + Number_CreditLines_in_delinquency + 
##     Current_CreditCard_Balance_Own + Credit_Candidate
## Model 3: default_ind ~ Annual_Income * Credit_Candidate + Number_CreditLines_in_delinquency + 
##     Current_CreditCard_Balance_Own
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1     65482      66073                          
## 2     65479      63370  3  2702.28 < 2.2e-16 ***
## 3     65478      63323  1    47.26 6.221e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Ploting a few interesting figures using the visreg package.

library(visreg)
visreg(Model1,"Annual_Income", scale = "response")

visreg(Model2,"Number_CreditLines_in_delinquency", scale = "response")

visreg(Model3,"Current_CreditCard_Balance_Own", scale = "response")

## Warning:   Note that you are attempting to plot a 'main effect' in a model that contains an
##   interaction.  This is potentially misleading; you may wish to consider using the 'by'
##   argument.

## Conditions used in construction of plot
## Annual_Income: 74325
## Credit_Candidate: Bad_Candidate
## Number_CreditLines_in_delinquency: 0

visreg(Model3, "Credit_Candidate", by = "Number_CreditLines_in_delinquency", scale = "response")

Based on the charts generated by the visreg package and the AIC/BIC test, it would seem that the worst variable to use to see whether a person will defer on a credit card payment is Annual income. The worst model is also Model1. The reason being is that this variable and model has many variations in its disruption.

The best variables to use to predict if an individual will not pay their credit card balance for the month is Number Credit Lines in delinquency and Current Credit Card Balance Own. The more money you own, the greater the log odds as well as, the higher the probability that you have at not making a credit card payment. This conclusion makes an exciting finding because it suggests that more wealthy people may not always pay their bills on time. It must not go unmentioned that both good and bad candidates are going to have delinquencies. However, the data shows that Good Candidates are going to have fewer delinquencies.

HW4Soc712

Ariel Rosario Jr