Predicting Home Loan Defaults

Introduction

This project aims to predict home loan defaults at micro and macro level using time series models and estimate probabilities of default associated with other economic indicators such as income and interest rate.

load("~/Downloads/defaults.rda")

Each observation in the sample data set is an individual with a home loan. There are 500 observations in each quarter from 1991 to 2022. The relevant variables are

date – year and quarter of observation
default – 1 if the individual defaulted on their loan, 0 otherwise
income – weekly income at the time of observation
price – purchase price of home
score – credit score at time of observation
interest – nominal interest rate on loan

# Install required packages
library(fpp3)
library(margins)
library(slider)

Before dive deep into the analysis, we should look at data set first

head(defaults)

##      date default income price score interest
## 1 1991 Q1       0      0 38403   509    12.25
## 2 1991 Q1       0      0 32929   531    10.71
## 3 1991 Q1       0      0 30927   445    11.19
## 4 1991 Q1       0      0 29177   442     9.86
## 5 1991 Q1       0      0 40862   440    10.62
## 6 1991 Q1       0      0 33200   443    11.55

1. Linear Probability Model

reg <- lm(data = defaults, default ~ . -date)
summary(reg)

## 
## Call:
## lm(formula = default ~ . - date, data = defaults)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.12211 -0.06886 -0.03735  0.00096  1.03705 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.959e-01  7.563e-03  39.129  < 2e-16 ***
## income       1.707e-06  4.717e-07   3.619 0.000296 ***
## price       -8.172e-08  1.784e-08  -4.581 4.63e-06 ***
## score       -3.424e-04  1.001e-05 -34.192  < 2e-16 ***
## interest    -5.885e-03  4.553e-04 -12.926  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1907 on 63995 degrees of freedom
## Multiple R-squared:  0.03835,    Adjusted R-squared:  0.03829 
## F-statistic: 637.9 on 4 and 63995 DF,  p-value: < 2.2e-16

Interpret the results:

Intercept = 0.296: estimated probability of defaults when other variables is zero is 0.296
income coefficients are positive but very small, indicating that as income is higher, the probability of default is higher. This result is surprising because in common sense, we would think that high-income people will be more able to pay back the loans.
price, score and interest coefficients are negative but also very small, suggesting that as price of home, credit score and interest are higher, it’s less likely for that person to default on the loan. The negative interest coefficient is surprising because in common sense, if the interest rate is higher, it should be more likely to default.
R-squared and adjusted R-squared are small, 0.0383, indicating that only 3.83% of variance in defaults is explained by the model.

2. Logit Model

# Logit regression
logit_reg <- glm(default ~., data = select(defaults, -date), family = binomial(link = "logit"))
summary(logit_reg)

## 
## Call:
## glm(formula = default ~ ., family = binomial(link = "logit"), 
##     data = select(defaults, -date))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8320  -0.3385  -0.1971  -0.0814   4.8380  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.015e+00  2.699e-01  11.169   <2e-16 ***
## income      -2.732e-04  2.985e-05  -9.149   <2e-16 ***
## price       -1.837e-06  9.830e-07  -1.869   0.0616 .  
## score       -6.982e-03  4.556e-04 -15.325   <2e-16 ***
## interest    -2.206e-01  1.456e-02 -15.151   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21229  on 63999  degrees of freedom
## Residual deviance: 18214  on 63995  degrees of freedom
## AIC: 18224
## 
## Number of Fisher Scoring iterations: 8

Discuss statistical significance: All coefficients, except for income, have p-value less than 0.05, indicating that they are statistically significant in predicting the probability of default.

# Marginal Effect 

marginal_effect <- margins(logit_reg)
summary(marginal_effect)

##    factor     AME     SE        z      p   lower   upper
##    income -0.0000 0.0000  -9.0558 0.0000 -0.0000 -0.0000
##  interest -0.0079 0.0005 -14.8827 0.0000 -0.0090 -0.0069
##     price -0.0000 0.0000  -1.8565 0.0634 -0.0000  0.0000
##     score -0.0003 0.0000 -14.9679 0.0000 -0.0003 -0.0002

Marginal effect in logit model are negative, different from the LPM (income is positive). The logit model provide more realistic marginal effects in when predicting binary outcome, such as the default in this case. However, its ability to adjust the marginal effect based on probability is more complex, compared to the LPM.

3. Feature Engineering

# Avg interest rate in the time period
defaults <- defaults %>% 
  group_by(date) %>% 
  mutate(avg_int = mean(interest))

# Add 2 extra variables
defaults <- defaults %>% 
  ungroup(date) %>% 
  mutate(
                   income_price = income/price,
                   int_diff = interest - avg_int)

# Perform linear regression
reg2 <- lm(data = defaults, default ~ income_price+ score+ int_diff )
summary(reg2)

## 
## Call:
## lm(formula = default ~ income_price + score + int_diff, data = defaults)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.11649 -0.06924 -0.03701 -0.00250  1.05678 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.383e-01  7.032e-03  19.662  < 2e-16 ***
## income_price -1.063e+00  6.909e-02 -15.385  < 2e-16 ***
## score        -1.013e-04  1.236e-05  -8.195 2.55e-16 ***
## int_diff      5.019e-03  1.042e-03   4.818 1.45e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1905 on 63996 degrees of freedom
## Multiple R-squared:  0.04023,    Adjusted R-squared:  0.04019 
## F-statistic: 894.3 on 3 and 63996 DF,  p-value: < 2.2e-16

The model has a better R-square than of the previous model in Q1 (0.0402 > 0.0383), indicating a better performance. All the coefficients have p-value <0.05, indicating they are statistically significant.

income_price = -1.06, indicating that the estimated probability of defaults decreases by 1.06 unit if the ratio of income to house price increase by 1 unit, holding other variables constant
score = -0.000101, indicating that estimated probability of defaults decreases by 0.000101 unit if the credit score increase by 1 unit, holding other variables constant
int_diff = 0.00502, indicating that estimated probability of defaults increases by 0.00502 unit if the difference between interest rate and national average interest rate in the same period increase by 1 unit, holding other variables constant
Intercept = 0.138, suggesting the estimated probability of defaults is 0.138, holding other variables constant

4. Default Rate Across Time

Calculate the trend-cycle component of the default rate using annual centered moving average (CMA) approach.

# Default rate in each time period 
defaults2 <- defaults %>% 
  group_by(date) %>% 
  summarize(default_rate = mean(default))
head(defaults2)

## # A tibble: 6 × 2
##      date default_rate
##     <qtr>        <dbl>
## 1 1991 Q1       0.0320
## 2 1991 Q2       0.0300
## 3 1991 Q3       0.0320
## 4 1991 Q4       0.0320
## 5 1992 Q1       0.0320
## 6 1992 Q2       0.0300

# Annual Centered Moving Average
defaults2 <- defaults2 %>% 
  mutate(cma4 = slide_dbl(default_rate, mean,
                          .before = 1,
                          .after = 2,
                          .complete = TRUE),
         cma4 = slide_dbl(cma4, mean,
                          .before = 1,
                          .complete = TRUE))

ggplot(defaults2, aes(date, default_rate)) +
  geom_line() +
  geom_line(aes(date, cma4), col = "blue") +
  ggtitle("Default Rate Across Time & Trend-Cycle (CMA)")

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

Discuss: The time series plot displays there was a significant upward trend in the default rate between 2009 and 2010, followed by a sharp decline. This can be related to the economic condition, such as the 2008 financial crisis. The use of Annual Centered Moving Average method smooths the trend-cycle out of short-term fluctuation.

5. Seasonal Effect

This section calculates the seasonal and remainder components of the default rate and the seasonally-adjusted default rate.

## Seasonal Effect
defaults2 %>% mutate(
  T = cma4,
  untrended = default_rate - T,
  quarter = quarter(date)
) -> defaults2 
  
# Seasonal Component
defaults2 %>% 
  as.data.frame() %>% 
  group_by(quarter) %>% 
  summarize(S = mean(untrended, na.rm = TRUE)) -> defaults_season

defaults2 <- left_join(defaults2, defaults_season, by = "quarter")

# Remainder Component
defaults2 <- mutate(defaults2, R = default_rate - T - S)

# Plot all components
defaults2 %>% 
  select(date, default_rate, T, S, R) %>% 
  pivot_longer(cols = -date) %>% 
  ggplot(aes(date, value)) + 
    geom_line() +
    facet_wrap(~name, ncol = 1, scales = "free_y") +
    ggtitle("Default Rate Across Time: Trend, Seasonal & Remainder")

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

Discuss:

Quarter 4 are associated with higher default rates. There are several possible explanation for this observation:

Holiday Increase Consumer Expense
Tax payment

However, with a small data set of 500 observations, it is difficult to conclude the seasonal trend of default rate.

# Seasonal adjustment 
mutate(defaults2, SA = default_rate - S) -> defaults2

ggplot(defaults2, aes(date, default_rate)) +
  geom_line() +
  geom_line(aes(date, SA), col ="red") +
  ggtitle("Default Rate Across Time: Seasonal Adjustment")

6. Forecasts

Compare mean, naive and simple linear trend forecast for the default rate for the next 5 years.

# Preparations
horizon <- 5*4 # 5-year * 4 quarters

defaults_fcst <- defaults2 %>% 
  select(qtr = date, default_rate = SA) %>% 
  mutate(Series = "Actual") 

qtrs <- max(defaults_fcst$qtr) + (1:horizon)
forecasts <- data.frame(qtr = qtrs)
forecasts

##        qtr
## 1  2023 Q1
## 2  2023 Q2
## 3  2023 Q3
## 4  2023 Q4
## 5  2024 Q1
## 6  2024 Q2
## 7  2024 Q3
## 8  2024 Q4
## 9  2025 Q1
## 10 2025 Q2
## 11 2025 Q3
## 12 2025 Q4
## 13 2026 Q1
## 14 2026 Q2
## 15 2026 Q3
## 16 2026 Q4
## 17 2027 Q1
## 18 2027 Q2
## 19 2027 Q3
## 20 2027 Q4

# MEAN Forecast
forecasts$Mean <- mean(defaults_fcst$default_rate)
forecasts

##        qtr       Mean
## 1  2023 Q1 0.03932762
## 2  2023 Q2 0.03932762
## 3  2023 Q3 0.03932762
## 4  2023 Q4 0.03932762
## 5  2024 Q1 0.03932762
## 6  2024 Q2 0.03932762
## 7  2024 Q3 0.03932762
## 8  2024 Q4 0.03932762
## 9  2025 Q1 0.03932762
## 10 2025 Q2 0.03932762
## 11 2025 Q3 0.03932762
## 12 2025 Q4 0.03932762
## 13 2026 Q1 0.03932762
## 14 2026 Q2 0.03932762
## 15 2026 Q3 0.03932762
## 16 2026 Q4 0.03932762
## 17 2027 Q1 0.03932762
## 18 2027 Q2 0.03932762
## 19 2027 Q3 0.03932762
## 20 2027 Q4 0.03932762

ggplot(defaults_fcst, aes(qtr,default_rate )) +
  geom_line() +
  geom_line(data = forecasts, aes(qtr, Mean), col ="red") +
  ggtitle("Default Rate Across Time & 5-yr Forecast (MEAN)")

# NAIVE Forecast
forecasts$Naive <- defaults_fcst$default_rate[nrow(defaults_fcst)]

ggplot(defaults_fcst, aes(qtr,default_rate )) +
  geom_line() +
  geom_line(data = forecasts, aes(qtr, Naive), col ="red") +
  ggtitle("Default Rate Across Time & 5-yr Forecast (NAIVE)")

# Simple Linear Forecast
trend <- lm(default_rate ~ qtr, defaults_fcst)
forecasts$Linear <- predict(trend, forecasts)

ggplot(defaults_fcst, aes(qtr,default_rate )) +
  geom_line() +
  geom_line(data = forecasts, aes(qtr, Linear), col ="red") +
  ggtitle("Default Rate Across Time & 5-yr Forecast (LINEAR)")

In this case, since there is a trend in the historical data, Simple Linear Trend forecast would be the best because it captures the overall trend of default rate across time in the reported period. In fact, Mean and Naive Forecast most likely leave out the trend/fluctuation in the data.

Conclusion

The default rate peaked between 2009 and 2010 (after Global Financial Crisis 2008) and then declined sharply
Quarter 4 seems to be associated with higher defaults rate, but it’s difficult to conclude with a small data set
Simple Linear Trend would be the best model to forecast the trend of default rate in the next 5 years based on historical data