This project aims to predict home loan defaults at micro and macro level using time series models and estimate probabilities of default associated with other economic indicators such as income and interest rate.
load("~/Downloads/defaults.rda")
Each observation in the sample data set is an individual with a home loan. There are 500 observations in each quarter from 1991 to 2022. The relevant variables are
date – year and quarter of observationdefault – 1 if the individual defaulted on their loan,
0 otherwiseincome – weekly income at the time of observationprice – purchase price of homescore – credit score at time of observationinterest – nominal interest rate on loan# Install required packages
library(fpp3)
library(margins)
library(slider)
Before dive deep into the analysis, we should look at data set first
head(defaults)
## date default income price score interest
## 1 1991 Q1 0 0 38403 509 12.25
## 2 1991 Q1 0 0 32929 531 10.71
## 3 1991 Q1 0 0 30927 445 11.19
## 4 1991 Q1 0 0 29177 442 9.86
## 5 1991 Q1 0 0 40862 440 10.62
## 6 1991 Q1 0 0 33200 443 11.55
reg <- lm(data = defaults, default ~ . -date)
summary(reg)
##
## Call:
## lm(formula = default ~ . - date, data = defaults)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.12211 -0.06886 -0.03735 0.00096 1.03705
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.959e-01 7.563e-03 39.129 < 2e-16 ***
## income 1.707e-06 4.717e-07 3.619 0.000296 ***
## price -8.172e-08 1.784e-08 -4.581 4.63e-06 ***
## score -3.424e-04 1.001e-05 -34.192 < 2e-16 ***
## interest -5.885e-03 4.553e-04 -12.926 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1907 on 63995 degrees of freedom
## Multiple R-squared: 0.03835, Adjusted R-squared: 0.03829
## F-statistic: 637.9 on 4 and 63995 DF, p-value: < 2.2e-16
Interpret the results:
income coefficients are positive but very small,
indicating that as income is higher, the probability of default is
higher. This result is surprising because in common sense, we would
think that high-income people will be more able to pay back the
loans.price, score and interest
coefficients are negative but also very small, suggesting that as price
of home, credit score and interest are higher, it’s less likely for that
person to default on the loan. The negative interest
coefficient is surprising because in common sense, if the interest rate
is higher, it should be more likely to default.# Logit regression
logit_reg <- glm(default ~., data = select(defaults, -date), family = binomial(link = "logit"))
summary(logit_reg)
##
## Call:
## glm(formula = default ~ ., family = binomial(link = "logit"),
## data = select(defaults, -date))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8320 -0.3385 -0.1971 -0.0814 4.8380
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.015e+00 2.699e-01 11.169 <2e-16 ***
## income -2.732e-04 2.985e-05 -9.149 <2e-16 ***
## price -1.837e-06 9.830e-07 -1.869 0.0616 .
## score -6.982e-03 4.556e-04 -15.325 <2e-16 ***
## interest -2.206e-01 1.456e-02 -15.151 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 21229 on 63999 degrees of freedom
## Residual deviance: 18214 on 63995 degrees of freedom
## AIC: 18224
##
## Number of Fisher Scoring iterations: 8
Discuss statistical significance: All coefficients, except for
income, have p-value less than 0.05, indicating that they
are statistically significant in predicting the probability of
default.
# Marginal Effect
marginal_effect <- margins(logit_reg)
summary(marginal_effect)
## factor AME SE z p lower upper
## income -0.0000 0.0000 -9.0558 0.0000 -0.0000 -0.0000
## interest -0.0079 0.0005 -14.8827 0.0000 -0.0090 -0.0069
## price -0.0000 0.0000 -1.8565 0.0634 -0.0000 0.0000
## score -0.0003 0.0000 -14.9679 0.0000 -0.0003 -0.0002
Marginal effect in logit model are negative, different from the LPM
(income is positive). The logit model provide more
realistic marginal effects in when predicting binary outcome, such as
the default in this case. However, its ability to adjust the marginal
effect based on probability is more complex, compared to the LPM.
# Avg interest rate in the time period
defaults <- defaults %>%
group_by(date) %>%
mutate(avg_int = mean(interest))
# Add 2 extra variables
defaults <- defaults %>%
ungroup(date) %>%
mutate(
income_price = income/price,
int_diff = interest - avg_int)
# Perform linear regression
reg2 <- lm(data = defaults, default ~ income_price+ score+ int_diff )
summary(reg2)
##
## Call:
## lm(formula = default ~ income_price + score + int_diff, data = defaults)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.11649 -0.06924 -0.03701 -0.00250 1.05678
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.383e-01 7.032e-03 19.662 < 2e-16 ***
## income_price -1.063e+00 6.909e-02 -15.385 < 2e-16 ***
## score -1.013e-04 1.236e-05 -8.195 2.55e-16 ***
## int_diff 5.019e-03 1.042e-03 4.818 1.45e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1905 on 63996 degrees of freedom
## Multiple R-squared: 0.04023, Adjusted R-squared: 0.04019
## F-statistic: 894.3 on 3 and 63996 DF, p-value: < 2.2e-16
The model has a better R-square than of the previous model in Q1 (0.0402 > 0.0383), indicating a better performance. All the coefficients have p-value <0.05, indicating they are statistically significant.
income_price = -1.06, indicating that the estimated
probability of defaults decreases by 1.06 unit if the ratio of income to
house price increase by 1 unit, holding other variables constantscore = -0.000101, indicating that estimated
probability of defaults decreases by 0.000101 unit if the credit score
increase by 1 unit, holding other variables constantint_diff = 0.00502, indicating that estimated
probability of defaults increases by 0.00502 unit if the difference
between interest rate and national average interest rate in the same
period increase by 1 unit, holding other variables constantCalculate the trend-cycle component of the default rate using annual centered moving average (CMA) approach.
# Default rate in each time period
defaults2 <- defaults %>%
group_by(date) %>%
summarize(default_rate = mean(default))
head(defaults2)
## # A tibble: 6 Ă— 2
## date default_rate
## <qtr> <dbl>
## 1 1991 Q1 0.0320
## 2 1991 Q2 0.0300
## 3 1991 Q3 0.0320
## 4 1991 Q4 0.0320
## 5 1992 Q1 0.0320
## 6 1992 Q2 0.0300
# Annual Centered Moving Average
defaults2 <- defaults2 %>%
mutate(cma4 = slide_dbl(default_rate, mean,
.before = 1,
.after = 2,
.complete = TRUE),
cma4 = slide_dbl(cma4, mean,
.before = 1,
.complete = TRUE))
ggplot(defaults2, aes(date, default_rate)) +
geom_line() +
geom_line(aes(date, cma4), col = "blue") +
ggtitle("Default Rate Across Time & Trend-Cycle (CMA)")
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).
Discuss: The time series plot displays there was a significant upward trend in the default rate between 2009 and 2010, followed by a sharp decline. This can be related to the economic condition, such as the 2008 financial crisis. The use of Annual Centered Moving Average method smooths the trend-cycle out of short-term fluctuation.
This section calculates the seasonal and remainder components of the default rate and the seasonally-adjusted default rate.
## Seasonal Effect
defaults2 %>% mutate(
T = cma4,
untrended = default_rate - T,
quarter = quarter(date)
) -> defaults2
# Seasonal Component
defaults2 %>%
as.data.frame() %>%
group_by(quarter) %>%
summarize(S = mean(untrended, na.rm = TRUE)) -> defaults_season
defaults2 <- left_join(defaults2, defaults_season, by = "quarter")
# Remainder Component
defaults2 <- mutate(defaults2, R = default_rate - T - S)
# Plot all components
defaults2 %>%
select(date, default_rate, T, S, R) %>%
pivot_longer(cols = -date) %>%
ggplot(aes(date, value)) +
geom_line() +
facet_wrap(~name, ncol = 1, scales = "free_y") +
ggtitle("Default Rate Across Time: Trend, Seasonal & Remainder")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
Discuss:
Quarter 4 are associated with higher default rates. There are several possible explanation for this observation:
However, with a small data set of 500 observations, it is difficult to conclude the seasonal trend of default rate.
# Seasonal adjustment
mutate(defaults2, SA = default_rate - S) -> defaults2
ggplot(defaults2, aes(date, default_rate)) +
geom_line() +
geom_line(aes(date, SA), col ="red") +
ggtitle("Default Rate Across Time: Seasonal Adjustment")
Compare mean, naive and simple linear trend forecast for the default rate for the next 5 years.
# Preparations
horizon <- 5*4 # 5-year * 4 quarters
defaults_fcst <- defaults2 %>%
select(qtr = date, default_rate = SA) %>%
mutate(Series = "Actual")
qtrs <- max(defaults_fcst$qtr) + (1:horizon)
forecasts <- data.frame(qtr = qtrs)
forecasts
## qtr
## 1 2023 Q1
## 2 2023 Q2
## 3 2023 Q3
## 4 2023 Q4
## 5 2024 Q1
## 6 2024 Q2
## 7 2024 Q3
## 8 2024 Q4
## 9 2025 Q1
## 10 2025 Q2
## 11 2025 Q3
## 12 2025 Q4
## 13 2026 Q1
## 14 2026 Q2
## 15 2026 Q3
## 16 2026 Q4
## 17 2027 Q1
## 18 2027 Q2
## 19 2027 Q3
## 20 2027 Q4
# MEAN Forecast
forecasts$Mean <- mean(defaults_fcst$default_rate)
forecasts
## qtr Mean
## 1 2023 Q1 0.03932762
## 2 2023 Q2 0.03932762
## 3 2023 Q3 0.03932762
## 4 2023 Q4 0.03932762
## 5 2024 Q1 0.03932762
## 6 2024 Q2 0.03932762
## 7 2024 Q3 0.03932762
## 8 2024 Q4 0.03932762
## 9 2025 Q1 0.03932762
## 10 2025 Q2 0.03932762
## 11 2025 Q3 0.03932762
## 12 2025 Q4 0.03932762
## 13 2026 Q1 0.03932762
## 14 2026 Q2 0.03932762
## 15 2026 Q3 0.03932762
## 16 2026 Q4 0.03932762
## 17 2027 Q1 0.03932762
## 18 2027 Q2 0.03932762
## 19 2027 Q3 0.03932762
## 20 2027 Q4 0.03932762
ggplot(defaults_fcst, aes(qtr,default_rate )) +
geom_line() +
geom_line(data = forecasts, aes(qtr, Mean), col ="red") +
ggtitle("Default Rate Across Time & 5-yr Forecast (MEAN)")
# NAIVE Forecast
forecasts$Naive <- defaults_fcst$default_rate[nrow(defaults_fcst)]
ggplot(defaults_fcst, aes(qtr,default_rate )) +
geom_line() +
geom_line(data = forecasts, aes(qtr, Naive), col ="red") +
ggtitle("Default Rate Across Time & 5-yr Forecast (NAIVE)")
# Simple Linear Forecast
trend <- lm(default_rate ~ qtr, defaults_fcst)
forecasts$Linear <- predict(trend, forecasts)
ggplot(defaults_fcst, aes(qtr,default_rate )) +
geom_line() +
geom_line(data = forecasts, aes(qtr, Linear), col ="red") +
ggtitle("Default Rate Across Time & 5-yr Forecast (LINEAR)")
In this case, since there is a trend in the historical data, Simple Linear Trend forecast would be the best because it captures the overall trend of default rate across time in the reported period. In fact, Mean and Naive Forecast most likely leave out the trend/fluctuation in the data.