2025-03-07

Introduction

For my topic, I chose the “Credit Approval” dataset found in the UC Irvine Machine learning repository. The goal of this analysis will be to analyze what characteristics indicate the likelihood of late payment and default.

Our dataset shows a sample of credit customers. It displays some characteristics about the customers, their payment amounts and statuses from April to September bill amounts, and whether they defaulted in the subsequent month (October).

Variables

Our variables are outlined below:

## [1] "ID"        "LIMIT_BAL" "SEX"       "EDUCATION"
## [1] "MARRIAGE"     "AGE"          "PayStatusSep" "PayStatusAug"
## [1] "PayStatusJul" "PayStatusJun" "PayStatusMay" "PayStatusApr"
## [1] "BillSep" "BillAug" "BillJul" "BillJun"
## [1] "BillMay"   "BillApr"   "PayAmtSep" "PayAmtAug"
## [1] "PayAmtJul" "PayAmtJun" "PayAmtMay" "PayAmtApr"
## [1] "Default"

Probability of Default

First, we can establish the probability of default of our customer sample. We can then analyze what factors may affect this probability. Out of 30000 observations, we had 6636 defaults, or 22.1%.

## $title
## [1] "Probability of Default"
## 
## attr(,"class")
## [1] "labels"

Serial Correlation Between Payment Amounts

In statistics, serial correlation is the relationship between a variable and its past, lagged values. For our payment amount data, we can assess the serial correlation to see how well customers continue their payment trends. For example, a customer with a high payment amount in May can be assumed to pay on or before the due date and have good credit. We would expect their next and previous payment to be high as well. Conversely, the correlation should decrease as the time between periods increases.

On the next slide, we can see our correlation matrix. Ultimately, the strongest relationship with the current month payment is the next month, meaning we can use payment amount as a factor in our regression. However, due to the low overall correlation, there are clearly many other factors at play.

Serial Correlation Matrix

payments <- select(creditdata,PayAmtSep:PayAmtApr)
correlations <- cor(payments)
correlations
##           PayAmtSep PayAmtAug PayAmtJul PayAmtJun PayAmtMay PayAmtApr
## PayAmtSep 1.0000000 0.2855755 0.2521911 0.1995579 0.1484593 0.1857353
## PayAmtAug 0.2855755 1.0000000 0.2447705 0.1801067 0.1809078 0.1576339
## PayAmtJul 0.2521911 0.2447705 1.0000000 0.2163251 0.1592137 0.1627400
## PayAmtJun 0.1995579 0.1801067 0.2163251 1.0000000 0.1518304 0.1578339
## PayAmtMay 0.1484593 0.1809078 0.1592137 0.1518304 1.0000000 0.1548955
## PayAmtApr 0.1857353 0.1576339 0.1627400 0.1578339 0.1548955 1.0000000

August Correlations Graph Code

as.vector(correlations[,2])
## [1] 0.2855755 1.0000000 0.2447705 0.1801067 0.1809078 0.1576339
p = as.character(colnames(correlations))
x= as.vector(correlations[,2])
y= as.character(colnames(correlations))
Correlation_By_Month <- factor(y, levels = y)
plot <- ggplot(correlations,
               aes(x=Correlation_By_Month
                   ,y=x,fill=Correlation_By_Month)
               )+geom_bar(stat= "identity")+geom_text(
                                aes(label=(round(x,3))
                                    ),vjust=-.5)+labs(
                                      y="Correlation Coefficient",
  x="")+ggtitle("Correlation of Payment Amounts with August")

Graph

As shown by our correlation matrix, we see little correlation between payments and their prior/future values. However, we do see the expected relationship most months. For example, the September payments have the highest correlation with the August payments, and the correlations of the payments prior to August decrease as time increases.

Bill Amounts

Now that we understand how payment amount may affect our probability of default, lets analyze some other factors that could have an impact. Some that could provide more explanation is the bill amount. If we compare the bill and payment amounts, we can find to what degree our clients are overspending.

Bills <- select(creditdata,BillSep:BillApr)
diff <- Bills-payments
median(as.matrix(diff), na.rm=TRUE)
## [1] 16883.5

Calculating the average difference between our bill and payment amounts, we see that there is an average gap of over $16,000. We may be extending too much credit to our clients, or not freezing their account soon enough.

Categorical Variables

We have analyzed our bill amount and payment data, and how they could possibly affect our probability of default. Now, we can analyze our categorical variables such as age and marriage to see how these values affect the customer’s credit limit. To do so, we will see how age affects our credit limit, while displaying other factors on our graph. We will take a sample of 100 for data readability.

Our education statuses are as follows: 1 <= Graduate School 2 <= University 3 <= High School 4 <= Other 5 & 6 <= Unknown

For our sex statuses: 1 <= Male and 2<= Female

For our marriage statuses: 1 <= Married 2<= Single 3<= Other/Unknown

Age and Education

According to our graph, we have a highly educated population, with most of our statuses being either Graduate School or college. Our sample does not show a clear effect on our allotted limit balance from education.

Age and Sex

Males may be deemed more risky, as we see a much higher concentration of males in the lower range of our limit balance, and less in the higher range compared to females.

Age and Marriage

In this graph, marriage is represented by the circle, and being single is represented by the triangle. Marriage has a clear positive relationship with age, but there is no clear relationship between status and limit balance.

Logistic Regression

Now that we have seen how some of our variables might affect our probability of default, we can now perform a regression to find the best predictor of default. Because our outcome is categorical (Default or No Default), we will use a logistical regression instead of a simple linear regression. The equation of the logistical regression is displayed below, setting X as our probability of default:

\(\log[P(X)/P(1-X)]= \beta_0 + \beta_1\cdot x +\beta_2\cdot x + ...\)

Our left hand side of the equation represents log odds. We use this instead of the normal probability equation P(X) because it can span from -infinity to infinity, helping us generate a continuous function. Our beta coefficients represent the change in log odds for a single unit increase in the predicting variable. Our probability can then be solved as

\(\dfrac{\exp(\beta_0 + \beta_1\cdot x +\beta_2\cdot x +...)}{(1+\exp(\beta_0 + \beta_1\cdot x +\beta_2\cdot x +...)}\)

Regression Code

logreg <- glm(Default ~ LIMIT_BAL
              + SEX + EDUCATION + 
                MARRIAGE + AGE + PayStatusSep + BillSep+
                PayAmtSep, data = creditdata, family = binomial)
options(scipen =999)
summary(logreg)

Regression Summary

## 
## Call:
## glm(formula = Default ~ LIMIT_BAL + SEX + EDUCATION + MARRIAGE + 
##     AGE + PayStatusSep + BillSep + PayAmtSep, family = binomial, 
##     data = creditdata)
## 
## Coefficients:
##                   Estimate    Std. Error z value             Pr(>|z|)    
## (Intercept)  -0.6624651271  0.1176959112  -5.629       0.000000018166 ***
## LIMIT_BAL    -0.0000015684  0.0000001464 -10.712 < 0.0000000000000002 ***
## SEX          -0.1185502313  0.0304679042  -3.891       0.000099837117 ***
## EDUCATION    -0.0971388591  0.0206846150  -4.696       0.000002650604 ***
## MARRIAGE     -0.1624637956  0.0315059577  -5.157       0.000000251467 ***
## AGE           0.0070649698  0.0017675277   3.997       0.000064125534 ***
## PayStatusSep  0.6994235229  0.0148692680  47.038 < 0.0000000000000002 ***
## BillSep      -0.0000012267  0.0000002472  -4.962       0.000000696554 ***
## PayAmtSep    -0.0000131385  0.0000020340  -6.459       0.000000000105 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 31705  on 29999  degrees of freedom
## Residual deviance: 28128  on 29991  degrees of freedom
## AIC: 28146
## 
## Number of Fisher Scoring iterations: 5

Interpreting the Results

Overall, all our variables are significant, showing that each variable has some effect on our data. The left side of our equation represents log odds, our the logarithmic function of our probability of default divided by our probability of not defaulting.

To be expected, payment status is the best predictor of default, representing 70% of our relationship. The intercept is positive, as a higher status number means higher months behind payment.

Our bill and payment amounts have a small but significant negative coefficient. This means that for every $1 increase in bill and payment amount, the probability of default decreases slightly. Although the result for bill amount seems counter-intuitive, customers in financial straits may attempt to stop using credit, therefore decreasing their bill.

As for our categorical variables, being female (2) increases the probability of default compared to being male (1), contradicting with our earlier sample. Higher education and being married also decrease this probability.

Age has a very small positive relationship, meaning being older slightly increases the probability of default.

Cotd.

In our regression, limit balance only has a slight negative relationship with default. Ideally, we would like to see that higher limits are a vote of confidence in our customer, and would represent a strong negative relationship. To make our limit balances more accurate and therefore lower default rates, we can analyze our categorical variables to find the ideal client.

Converting Log-Odds to Probability

We can now calculate the probabilities. Using our payment status coefficient, we calculate:

## [1] 0.6681878

\(\P(Default)=exp(.7)/1+exp(.7)=.668\)

This means that for a for a 1 unit increase in the payment status when all other factors remain unchanged, the probability of default goes up by a massive 66.6%. As such, we may want to stop extending credit sooner after seeing a customer delay payment.

Probabilities

Now we can see the probabilities for our other variables:

coeffs <- as.vector(coef(logreg))
probs <- numeric(length(coeffs))

for (i in 1:length(coeffs)) {
  probs[i] <- exp(coeffs[i])/(1+exp(coeffs[i]))
}

Probabilities Table

coeffcolnames <- names(coef(logreg))
table <- data.frame(coeffcolnames,probs)
print(table)
##   coeffcolnames     probs
## 1   (Intercept) 0.3401861
## 2     LIMIT_BAL 0.4999996
## 3           SEX 0.4703971
## 4     EDUCATION 0.4757344
## 5      MARRIAGE 0.4594732
## 6           AGE 0.5017662
## 7  PayStatusSep 0.6680599
## 8       BillSep 0.4999997
## 9     PayAmtSep 0.4999967

For our categorical variables, we see probabilities that hover around .5. To understand this, we can look at our marriage factor. We established that being married (1) decreases the probability of default compared to being single (2). If this status had no effect, our probability would be at exactly .5. However, because being married slightly lessens the probability of default, our coefficient is about .46.

Conclusions

From our analysis, we can see that payment status has the biggest effect on our probability of default, meaning that as a customer stops paying, they may continue not paying and ultimately default. To improve our client base to an overall better payment status, we can give more credit to married males, and be more stringent when it comes to education status. We also may want to freeze accounts once their bill increases by too much.