#Introduction In banking, a prepayment model is utilized to appraise the level of prepayments on a credit portfolio that will happen in a chosen period of time including feasible rate changes. Through a careful examination of borrowers decision processes which were derived from multiple variable correlation analysis, this report strives to investigate further outcomes and prepayment completion efficiency.
1) Examination of raw data and basic correlations 2) Feature analysis and selection
3) Visualization of trends and variable interdependencies 4) Initial conjectures leading to model selection 5) Predictive model inspection 6) Climate change significance 6) Final conclusions
library(tidyverse)
The foundation of our data analysis commences with segregating observations by date and discovering the general trend of the prepayments over the years. The plot shows the sum of prepayments for each date.
prepayments <- data.frame(date=Task_data$date, prep=Task_data$prepaid_amount, date_str = Task_data$date_str)
by_date <- prepayments %>% group_by(date, date_str) %>% summarize(value = sum(prep))
ggplot(by_date, aes(x=date, y = value)) + geom_line() + geom_smooth(method="lm")
When observing the given plot, it is discernibly apparent that the general trend of the prepayments is increasing. However, it may appear deceitful due to the fact that zero values have influence on the sum. To better examine the trend it is essential to plot mean value of monthly prepayments.
by_date_mean <- prepayments %>% group_by(date, date_str) %>% summarize(value = mean(prep))
## `summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
ggplot(by_date_mean, aes(x=date, y = value)) + geom_line() + geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
# It can be observed that despite increasing trend, no signifficant seasonality is detected.
data_ts <- ts(by_date_mean$value, frequency = 12, start = c(2016, 1))
library(forecast)
ggseasonplot(data_ts, year.labels=TRUE, year.labels.left=TRUE, col=rainbow(n=6, start = 0, end = 24/25, alpha = 1)) +
ylab("Mean of prepayments [PLN]") +
ggtitle("Seasonal plot: Prepayments by months")
#Conclusions from the plot The seasonal pattern of prepayment is also profuse in essential information. Since in Poland the possibility to collect tax return (PIT) begins in the middle of February and the tax office has approximately 30-90 days to issue that money, it can fairly concluded that increased prepayment factor in March comes from that reason.
data_ts_sum <- ts(by_date$value, frequency = 12, start = c(2016, 1))
boxplot(data_ts_sum~cycle(data_ts_sum))
# Income vs. Prepayments paid Our assumption in this case was that income factor may have a great influence of people’s propensity to pay higher prepayments. However, together with scrupulous study and examining scatterplot of log(income) vs. log(prepayments) no significant correlation between prepayments and household income is observed.
data_id <- na.omit(data.frame(id=Task_data$id, prep=Task_data$prepaid_amount, income = Task_data$INCOME_houshold))
data_id <- data_id %>% group_by(id, income) %>% summarize(prep = sum(prep))
ggplot(data=data_id, aes(x=log(income), y=log(prep))) + geom_point() + ggtitle("Income influence on prepayments")
# Inflation influence on prepayments
data_infl <- data.frame(date_str = macro1$date_str, inflation = macro1$inflation)
data_infl <- inner_join(by_date, data_infl, by="date_str")
ggplot(data_infl, aes(x=inflation, y=value)) + geom_point() + geom_smooth(method="lm") + ggtitle("Inflation influence on prepayments") + ylab("prepayment mean")
## Simple linear regression When going through the process of model selection our team was striving to accomplish essential assumptions - high accuracy as well as simplicity. For that reason the decision that was made was to manufactured a model based on linear relationship between inflation and prepayments. It is rather cumbersome to try to get the correct least squares line, i.e. the line that minimizes the sum of squared residuals, through trial and error. Instead we can use the
lm function in R to fit the linear model (a.k.a. regression line).
We begin by splitting the dataset into two parts, training set and testing set. In this example we will randomly take 75% row in this dataset and put it into the training set, and other 25% row in the testing set:
data <- inner_join(data, data_infl, by="date_str")
smp_size<-floor(0.75*nrow(data_infl))
set.seed(123)
train_ind<-sample(seq_len(nrow(data_infl)), size=smp_size)
train<-data_infl[train_ind, ]
test<-data_infl[-train_ind, ]
model1<-lm(value~inflation, data=train)
summary(model1)
##
## Call:
## lm(formula = value ~ inflation, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2959533 -1466426 -696059 1739939 3681051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -81468095 24190799 -3.368 0.001855 **
## inflation 860159 238324 3.609 0.000951 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1997000 on 35 degrees of freedom
## Multiple R-squared: 0.2712, Adjusted R-squared: 0.2504
## F-statistic: 13.03 on 1 and 35 DF, p-value: 0.0009507
prepayments = 593.74*inflation - 58270.84
Looks like rmse of our model is 13660 on the training set, but that is not what we care about, what we care about is the rmse of our model on the test set:
## Warning: package 'Metrics' was built under R version 4.0.5
##
## Attaching package: 'Metrics'
## The following object is masked from 'package:forecast':
##
## accuracy
library(plotly)
## Warning: package 'plotly' was built under R version 4.0.5
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
dat <- data.frame(inflation = (99.1:104.7), value = predict(model1, data.frame(inflation = (99.1:104.7))))
plot_ly() %>%
add_trace(x=~inflation, y=~value, type="scatter", mode="lines", data = dat, name = "Predicted Value") %>%
add_trace(x=~inflation, y=~value, type="scatter", mode="markers", data = test, name = "Actual Value")
library(forecast)
model_arima <- auto.arima(data_ts_sum, trace = FALSE)
summary(model_arima)
## Series: data_ts_sum
## ARIMA(0,1,1)
##
## Coefficients:
## ma1
## -0.3795
## s.e. 0.1132
##
## sigma^2 estimated as 1.074e+12: log likelihood=-747.81
## AIC=1499.61 AICc=1499.87 BIC=1503.39
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 149428.8 1015306 739308.4 1.50287 12.49055 0.2978502 -0.07538156
forecast_arima <- forecast(model_arima, level=c(90), h=1*12)
plot(forecast_arima)
linear <- lm(value~date, data=data_infl)
summary(linear)
##
## Call:
## lm(formula = value ~ date, data = data_infl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2821628 -1231977 -825843 1547187 3915854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.937e+07 1.001e+07 -5.932 3.18e-07 ***
## date 4.301e-02 6.589e-03 6.528 3.89e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1767000 on 48 degrees of freedom
## Multiple R-squared: 0.4703, Adjusted R-squared: 0.4593
## F-statistic: 42.62 on 1 and 48 DF, p-value: 3.89e-08
fitted<-linear$fitted.values