Gaussian Model for predicting Credit Card Payment Histroy.

I am interested in investigating the Annual amount a person paid towards all credit cards payments during the previous year. The independent variables that I will use will be Estimated market value of a property owned/used by the borrower, Number of active credit lines, and if a person deferred on a credit card payment. I am interested in seeing if these factors influence how much a person pays in credit card bills the previous year.

library(readr)
CreditCard_data <- read_csv("Desktop/CreditCard_dataset_Original.csv")

Data Wrangling

library(dplyr)
New_Data <- CreditCard_data %>%
  select(mvar13,mvar15,mvar36,default_ind) %>%
  
  rename("Annual_Paid" = mvar13,
         "Properety_Vaule" = mvar15,
         "Number_Credit_Lines" = mvar36)%>%
  
  mutate(Annual_Paid = as.integer(Annual_Paid),
         Properety_Vaule= as.integer(Properety_Vaule),
         Number_Credit_Lines = as.integer(Number_Credit_Lines))%>%

  filter(!is.na(Annual_Paid),
         !is.na(Properety_Vaule),
         !is.na(Number_Credit_Lines))%>%
  
  mutate(Default_Status = sjmisc::rec(default_ind, rec = "0=Not_Default; 1=Defaulted"))

Dependent variables <- mvar13 = Annual amount paid towards all credit cards during the previous year (in $).

Independent variables: mvar15 = Estimated market value of a properety owned/used by the borrower (in $) mvar36 = Number of active credit lines default_ind = O <- The person did not defualt on a credit card payment. 1 <- The person did default on a credit card payment.

head(New_Data)

## # A tibble: 6 x 5
##   Annual_Paid Properety_Vaule Number_Credit_Lin… default_ind Default_Status
##         <int>           <int>              <int>       <dbl> <chr>         
## 1       27815          524848                  2           0 Not_Default   
## 2       14282          146668                 10           0 Not_Default   
## 3       21127          991000                  4           0 Not_Default   
## 4         548          336315                  9           0 Not_Default   
## 5         520          190420                  8           0 Not_Default   
## 6       83417          617468                  9           0 Not_Default

M1 <- glm(Annual_Paid ~ Number_Credit_Lines, family = "gaussian", data = New_Data)
summary(M1)

## 
## Call:
## glm(formula = Annual_Paid ~ Number_Credit_Lines, family = "gaussian", 
##     data = New_Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
##  -17807   -15778   -11617     -306  6043788  
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         18009.45     419.73  42.907  < 2e-16 ***
## Number_Credit_Lines  -202.86      50.75  -3.997 6.41e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2898348480)
## 
##     Null deviance: 1.2441e+14  on 42911  degrees of freedom
## Residual deviance: 1.2437e+14  on 42910  degrees of freedom
## AIC: 1056724
## 
## Number of Fisher Scoring iterations: 2

Based on the results of model one (M1), m0 is negative. In other words, the more lines of credit an individual has, the less money they are going to spend on their credit card bills for the year. Based on the Y-intercept of my model, it is safe to say that if an individual has zero lines of credit, the log-odds of credit card payments from the previous year is 18009.45. If we were direct our attention to the slope of the model, it becomes clear that for every additional line of credit, the log odds of payment decrease by - 202.86. Ultimately, the number of credit lines an individual has is a significant indicator for predicting the amount a person spent on credit card payments the previous year.

Please note that I used a gaussian method in my model since I am under the assumption that all of my data is normally distributed. To check if my data is normally distributed, I used a normal linear model to see an analysis of the same variables from model one. I got the same results in both models. The linear model can be seen towards the bottom of this paragraph.

Model_lm <- lm(Annual_Paid  ~ Number_Credit_Lines, data = New_Data)
summary(Model_lm)

## 
## Call:
## lm(formula = Annual_Paid ~ Number_Credit_Lines, data = New_Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -17807  -15778  -11617    -306 6043788 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         18009.45     419.73  42.907  < 2e-16 ***
## Number_Credit_Lines  -202.86      50.75  -3.997 6.41e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 53840 on 42910 degrees of freedom
## Multiple R-squared:  0.0003723,  Adjusted R-squared:  0.000349 
## F-statistic: 15.98 on 1 and 42910 DF,  p-value: 6.413e-05

M2 <- glm(Annual_Paid ~ Number_Credit_Lines + Properety_Vaule + Default_Status, family = "gaussian", data = New_Data)
summary(M2)

## 
## Call:
## glm(formula = Annual_Paid ~ Number_Credit_Lines + Properety_Vaule + 
##     Default_Status, family = "gaussian", data = New_Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -277970   -14117    -7795      984  6033838  
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -1.794e+03  7.354e+02  -2.440 0.014699 *  
## Number_Credit_Lines        1.689e+02  5.094e+01   3.315 0.000916 ***
## Properety_Vaule            2.907e-02  8.785e-04  33.090  < 2e-16 ***
## Default_StatusNot_Default  1.109e+04  6.372e+02  17.407  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2798795084)
## 
##     Null deviance: 1.2441e+14  on 42911  degrees of freedom
## Residual deviance: 1.2009e+14  on 42908  degrees of freedom
## AIC: 1055226
## 
## Number of Fisher Scoring iterations: 2

Like the previous model, model two (M2) indicates that the number lines of credit is a significant indicator and can predict the amount of money a person paid in credit card bills the previous year. Interestingly enough, the Properety_Vaule and the Default_Status variables are significant predictors too. Every time someone does not default of their payments, the log odds of that same person making a payment the previous year increased 1.109e+04 units. Another critical variable is the property value variable. The more increments the property value is worth, the log odds of that individual making a credit card payment for the previous year increased by 2.907e-02 units.

M3 <- glm(Annual_Paid ~ Number_Credit_Lines + Properety_Vaule*Default_Status, family = "gaussian", data = New_Data)
summary(M3)

## 
## Call:
## glm(formula = Annual_Paid ~ Number_Credit_Lines + Properety_Vaule * 
##     Default_Status, family = "gaussian", data = New_Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -305622   -13695    -7139      537  6033169  
## 
## Coefficients:
##                                            Estimate Std. Error t value
## (Intercept)                               2.141e+03  8.846e+02   2.420
## Number_Credit_Lines                       1.639e+02  5.091e+01   3.219
## Properety_Vaule                           1.209e-02  2.299e-03   5.259
## Default_StatusNot_Default                 6.287e+03  8.759e+02   7.177
## Properety_Vaule:Default_StatusNot_Default 1.983e-02  2.482e-03   7.988
##                                           Pr(>|t|)    
## (Intercept)                                0.01554 *  
## Number_Credit_Lines                        0.00129 ** 
## Properety_Vaule                           1.46e-07 ***
## Default_StatusNot_Default                 7.23e-13 ***
## Properety_Vaule:Default_StatusNot_Default 1.40e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2794703902)
## 
##     Null deviance: 1.2441e+14  on 42911  degrees of freedom
## Residual deviance: 1.1991e+14  on 42907  degrees of freedom
## AIC: 1055165
## 
## Number of Fisher Scoring iterations: 2

Model three (M3) generated interesting findings that caught my attention. When investigating the interaction between the Properety_Vaule variable and the default status variable, one can see that this interaction is a significant indicator for predicting the amount of money sent of credit card payments the previous year. The more increments the property value is worth in addition to not defaulting on a credit card payment increases the log odds of payment for the previous year by 1.983e-02 units.

Selecting a model.

Likelihood Ratio Test.

anova(M1, M2, M3, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: Annual_Paid ~ Number_Credit_Lines
## Model 2: Annual_Paid ~ Number_Credit_Lines + Properety_Vaule + Default_Status
## Model 3: Annual_Paid ~ Number_Credit_Lines + Properety_Vaule * Default_Status
##   Resid. Df Resid. Dev Df   Deviance  Pr(>Chi)    
## 1     42910 1.2437e+14                            
## 2     42908 1.2009e+14  2 4.2774e+12 < 2.2e-16 ***
## 3     42907 1.1991e+14  1 1.7834e+11 1.368e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Information Criteria.

library(texreg)
htmlreg(list(M1, M2, M3), doctype = FALSE)

Statistical models
	Model 1	Model 2	Model 3
(Intercept)	18009.45^***	-1794.18^*	2140.54^*
	(419.73)	(735.37)	(884.65)
Number_Credit_Lines	-202.86^***	168.89^***	163.85^**
	(50.75)	(50.94)	(50.91)
Properety_Vaule		0.03^***	0.01^***
		(0.00)	(0.00)
Default_StatusNot_Default		11091.77^***	6286.76^***
		(637.21)	(875.93)
Properety_Vaule:Default_StatusNot_Default			0.02^***
			(0.00)
AIC	1056724.19	1055226.32	1055164.55
BIC	1056750.19	1055269.66	1055216.55
Log Likelihood	-528359.09	-527608.16	-527576.27
Deviance	124368133271542.31	120090699452081.41	119912360337588.09
Num. obs.	42912	42912	42912
p < 0.001, p < 0.01, p < 0.05

Based on the Likelihood Ratio Test and the Information Criteria section of this analysis, I conclude that the 3rd model is the best model.

Visualizations

library(visreg)
visreg(M1,"Number_Credit_Lines", scale = "response")

After using the visreg package to create a graph for my first model, I am confident in my initial findings and interpretations for the Model 1. Based on the graph, if an individual has zero credit cards, they will pay around 18,000 dollars in debt a years. Please note that these individuals can be being older off debt for credit cards they no longer have active. Also as stated in my M1 interpretation, the more lines of credit one have possibly equated to making payments. The same pattern mentioned in the previous statement can be seen in the graph above.

library(visreg)
visreg(M3,"Properety_Vaule", scale = "response")

## Conditions used in construction of plot
## Number_Credit_Lines: 5
## Default_Status: Not_Default

The graph generated by the visreg package contains similar results to that of my second model. The more a property value is worth, the more likely an individual is to make a payment on their debit. This is interesting to me because the previous graph indicates that the more credit lines you have, the less likely you are to pay your bills. Based on the results of my first two graphs, I believe the age of an individual plays a critical row in predicting whether they will pay their bills or not make a payment. Usually, homes are brought by older people. Generally, with age come smarter and more responsible choices. Sadly, the dataset does not contain the ages of all of the individual. So I am only operating on pure speculation.

library(visreg)
visreg(M3,"Default_Status", by = "Properety_Vaule", scale = "response")

The graph towards the bottom displays the same information as the graph towards the top. However, I just switched the same variables around to get a different perspective on the data.

library(visreg)
visreg(M3,"Properety_Vaule", by = "Default_Status", scale = "response")

The interaction effects can be seen in the graph above. It comes evident that people who default and have greater property value are more likely to make greater payments on their credit cards. This graph supports my impetrations for model 3.

Homework5