Introduction

This project aims to predict home loan defaults at micro and macro level using time series models and estimate probabilities of default associated with other economic indicators such as income and interest rate.

load("~/Downloads/defaults.rda")

Each observation in the sample data set is an individual with a home loan. There are 500 observations in each quarter from 1991 to 2022. The relevant variables are

# Install required packages
library(fpp3)
library(margins)
library(slider)

Before dive deep into the analysis, we should look at data set first

head(defaults)
##      date default income price score interest
## 1 1991 Q1       0      0 38403   509    12.25
## 2 1991 Q1       0      0 32929   531    10.71
## 3 1991 Q1       0      0 30927   445    11.19
## 4 1991 Q1       0      0 29177   442     9.86
## 5 1991 Q1       0      0 40862   440    10.62
## 6 1991 Q1       0      0 33200   443    11.55

1. Linear Probability Model


reg <- lm(data = defaults, default ~ . -date)
summary(reg)
## 
## Call:
## lm(formula = default ~ . - date, data = defaults)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.12211 -0.06886 -0.03735  0.00096  1.03705 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.959e-01  7.563e-03  39.129  < 2e-16 ***
## income       1.707e-06  4.717e-07   3.619 0.000296 ***
## price       -8.172e-08  1.784e-08  -4.581 4.63e-06 ***
## score       -3.424e-04  1.001e-05 -34.192  < 2e-16 ***
## interest    -5.885e-03  4.553e-04 -12.926  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1907 on 63995 degrees of freedom
## Multiple R-squared:  0.03835,    Adjusted R-squared:  0.03829 
## F-statistic: 637.9 on 4 and 63995 DF,  p-value: < 2.2e-16

Interpret the results:

2. Logit Model


# Logit regression
logit_reg <- glm(default ~., data = select(defaults, -date), family = binomial(link = "logit"))
summary(logit_reg)
## 
## Call:
## glm(formula = default ~ ., family = binomial(link = "logit"), 
##     data = select(defaults, -date))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8320  -0.3385  -0.1971  -0.0814   4.8380  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.015e+00  2.699e-01  11.169   <2e-16 ***
## income      -2.732e-04  2.985e-05  -9.149   <2e-16 ***
## price       -1.837e-06  9.830e-07  -1.869   0.0616 .  
## score       -6.982e-03  4.556e-04 -15.325   <2e-16 ***
## interest    -2.206e-01  1.456e-02 -15.151   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21229  on 63999  degrees of freedom
## Residual deviance: 18214  on 63995  degrees of freedom
## AIC: 18224
## 
## Number of Fisher Scoring iterations: 8

Discuss statistical significance: All coefficients, except for income, have p-value less than 0.05, indicating that they are statistically significant in predicting the probability of default.

# Marginal Effect 

marginal_effect <- margins(logit_reg)
summary(marginal_effect)
##    factor     AME     SE        z      p   lower   upper
##    income -0.0000 0.0000  -9.0558 0.0000 -0.0000 -0.0000
##  interest -0.0079 0.0005 -14.8827 0.0000 -0.0090 -0.0069
##     price -0.0000 0.0000  -1.8565 0.0634 -0.0000  0.0000
##     score -0.0003 0.0000 -14.9679 0.0000 -0.0003 -0.0002

Marginal effect in logit model are negative, different from the LPM (income is positive). The logit model provide more realistic marginal effects in when predicting binary outcome, such as the default in this case. However, its ability to adjust the marginal effect based on probability is more complex, compared to the LPM.

3. Feature Engineering


# Avg interest rate in the time period
defaults <- defaults %>% 
  group_by(date) %>% 
  mutate(avg_int = mean(interest))

# Add 2 extra variables
defaults <- defaults %>% 
  ungroup(date) %>% 
  mutate(
                   income_price = income/price,
                   int_diff = interest - avg_int)

# Perform linear regression
reg2 <- lm(data = defaults, default ~ income_price+ score+ int_diff )
summary(reg2)
## 
## Call:
## lm(formula = default ~ income_price + score + int_diff, data = defaults)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.11649 -0.06924 -0.03701 -0.00250  1.05678 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.383e-01  7.032e-03  19.662  < 2e-16 ***
## income_price -1.063e+00  6.909e-02 -15.385  < 2e-16 ***
## score        -1.013e-04  1.236e-05  -8.195 2.55e-16 ***
## int_diff      5.019e-03  1.042e-03   4.818 1.45e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1905 on 63996 degrees of freedom
## Multiple R-squared:  0.04023,    Adjusted R-squared:  0.04019 
## F-statistic: 894.3 on 3 and 63996 DF,  p-value: < 2.2e-16

The model has a better R-square than of the previous model in Q1 (0.0402 > 0.0383), indicating a better performance. All the coefficients have p-value <0.05, indicating they are statistically significant.

4. Default Rate Across Time

Calculate the trend-cycle component of the default rate using annual centered moving average (CMA) approach.


# Default rate in each time period 
defaults2 <- defaults %>% 
  group_by(date) %>% 
  summarize(default_rate = mean(default))
head(defaults2)
## # A tibble: 6 Ă— 2
##      date default_rate
##     <qtr>        <dbl>
## 1 1991 Q1       0.0320
## 2 1991 Q2       0.0300
## 3 1991 Q3       0.0320
## 4 1991 Q4       0.0320
## 5 1992 Q1       0.0320
## 6 1992 Q2       0.0300
# Annual Centered Moving Average
defaults2 <- defaults2 %>% 
  mutate(cma4 = slide_dbl(default_rate, mean,
                          .before = 1,
                          .after = 2,
                          .complete = TRUE),
         cma4 = slide_dbl(cma4, mean,
                          .before = 1,
                          .complete = TRUE))

ggplot(defaults2, aes(date, default_rate)) +
  geom_line() +
  geom_line(aes(date, cma4), col = "blue") +
  ggtitle("Default Rate Across Time & Trend-Cycle (CMA)")
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

Discuss: The time series plot displays there was a significant upward trend in the default rate between 2009 and 2010, followed by a sharp decline. This can be related to the economic condition, such as the 2008 financial crisis. The use of Annual Centered Moving Average method smooths the trend-cycle out of short-term fluctuation.

5. Seasonal Effect

This section calculates the seasonal and remainder components of the default rate and the seasonally-adjusted default rate.


## Seasonal Effect
defaults2 %>% mutate(
  T = cma4,
  untrended = default_rate - T,
  quarter = quarter(date)
) -> defaults2 
  
# Seasonal Component
defaults2 %>% 
  as.data.frame() %>% 
  group_by(quarter) %>% 
  summarize(S = mean(untrended, na.rm = TRUE)) -> defaults_season

defaults2 <- left_join(defaults2, defaults_season, by = "quarter")

# Remainder Component
defaults2 <- mutate(defaults2, R = default_rate - T - S)

# Plot all components
defaults2 %>% 
  select(date, default_rate, T, S, R) %>% 
  pivot_longer(cols = -date) %>% 
  ggplot(aes(date, value)) + 
    geom_line() +
    facet_wrap(~name, ncol = 1, scales = "free_y") +
    ggtitle("Default Rate Across Time: Trend, Seasonal & Remainder")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

Discuss:

Quarter 4 are associated with higher default rates. There are several possible explanation for this observation:

However, with a small data set of 500 observations, it is difficult to conclude the seasonal trend of default rate.

# Seasonal adjustment 
mutate(defaults2, SA = default_rate - S) -> defaults2

ggplot(defaults2, aes(date, default_rate)) +
  geom_line() +
  geom_line(aes(date, SA), col ="red") +
  ggtitle("Default Rate Across Time: Seasonal Adjustment")

6. Forecasts

Compare mean, naive and simple linear trend forecast for the default rate for the next 5 years.


# Preparations
horizon <- 5*4 # 5-year * 4 quarters

defaults_fcst <- defaults2 %>% 
  select(qtr = date, default_rate = SA) %>% 
  mutate(Series = "Actual") 

qtrs <- max(defaults_fcst$qtr) + (1:horizon)
forecasts <- data.frame(qtr = qtrs)
forecasts
##        qtr
## 1  2023 Q1
## 2  2023 Q2
## 3  2023 Q3
## 4  2023 Q4
## 5  2024 Q1
## 6  2024 Q2
## 7  2024 Q3
## 8  2024 Q4
## 9  2025 Q1
## 10 2025 Q2
## 11 2025 Q3
## 12 2025 Q4
## 13 2026 Q1
## 14 2026 Q2
## 15 2026 Q3
## 16 2026 Q4
## 17 2027 Q1
## 18 2027 Q2
## 19 2027 Q3
## 20 2027 Q4
# MEAN Forecast
forecasts$Mean <- mean(defaults_fcst$default_rate)
forecasts
##        qtr       Mean
## 1  2023 Q1 0.03932762
## 2  2023 Q2 0.03932762
## 3  2023 Q3 0.03932762
## 4  2023 Q4 0.03932762
## 5  2024 Q1 0.03932762
## 6  2024 Q2 0.03932762
## 7  2024 Q3 0.03932762
## 8  2024 Q4 0.03932762
## 9  2025 Q1 0.03932762
## 10 2025 Q2 0.03932762
## 11 2025 Q3 0.03932762
## 12 2025 Q4 0.03932762
## 13 2026 Q1 0.03932762
## 14 2026 Q2 0.03932762
## 15 2026 Q3 0.03932762
## 16 2026 Q4 0.03932762
## 17 2027 Q1 0.03932762
## 18 2027 Q2 0.03932762
## 19 2027 Q3 0.03932762
## 20 2027 Q4 0.03932762
ggplot(defaults_fcst, aes(qtr,default_rate )) +
  geom_line() +
  geom_line(data = forecasts, aes(qtr, Mean), col ="red") +
  ggtitle("Default Rate Across Time & 5-yr Forecast (MEAN)")

# NAIVE Forecast
forecasts$Naive <- defaults_fcst$default_rate[nrow(defaults_fcst)]

ggplot(defaults_fcst, aes(qtr,default_rate )) +
  geom_line() +
  geom_line(data = forecasts, aes(qtr, Naive), col ="red") +
  ggtitle("Default Rate Across Time & 5-yr Forecast (NAIVE)")

# Simple Linear Forecast
trend <- lm(default_rate ~ qtr, defaults_fcst)
forecasts$Linear <- predict(trend, forecasts)

ggplot(defaults_fcst, aes(qtr,default_rate )) +
  geom_line() +
  geom_line(data = forecasts, aes(qtr, Linear), col ="red") +
  ggtitle("Default Rate Across Time & 5-yr Forecast (LINEAR)")

In this case, since there is a trend in the historical data, Simple Linear Trend forecast would be the best because it captures the overall trend of default rate across time in the reported period. In fact, Mean and Naive Forecast most likely leave out the trend/fluctuation in the data.

Conclusion