ING Lion’s Den

#Introduction In banking, a prepayment model is utilized to appraise the level of prepayments on a credit portfolio that will happen in a chosen period of time including feasible rate changes. Through a careful examination of borrowers decision processes which were derived from multiple variable correlation analysis, this report strives to investigate further outcomes and prepayment completion efficiency.

Analysis process

1) Examination of raw data and basic correlations 2) Feature analysis and selection
3) Visualization of trends and variable interdependencies 4) Initial conjectures leading to model selection 5) Predictive model inspection 6) Climate change significance 6) Final conclusions

library(tidyverse)

Data Analysis

The foundation of our data analysis commences with segregating observations by date and discovering the general trend of the prepayments over the years. The plot shows the sum of prepayments for each date.

prepayments <- data.frame(date=Task_data$date, prep=Task_data$prepaid_amount, date_str = Task_data$date_str)
by_date <- prepayments %>% group_by(date, date_str) %>% summarize(value = sum(prep))
ggplot(by_date, aes(x=date, y = value)) + geom_line() + geom_smooth(method="lm")

When observing the given plot, it is discernibly apparent that the general trend of the prepayments is increasing. However, it may appear deceitful due to the fact that zero values have influence on the sum. To better examine the trend it is essential to plot mean value of monthly prepayments.

by_date_mean <- prepayments %>% group_by(date, date_str) %>% summarize(value = mean(prep))

## `summarise()` has grouped output by 'date'. You can override using the `.groups` argument.

ggplot(by_date_mean, aes(x=date, y = value)) + geom_line() + geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

# It can be observed that despite increasing trend, no signifficant seasonality is detected.

data_ts <- ts(by_date_mean$value, frequency = 12, start = c(2016, 1))
library(forecast)
ggseasonplot(data_ts, year.labels=TRUE, year.labels.left=TRUE, col=rainbow(n=6, start = 0, end = 24/25, alpha = 1)) +
  ylab("Mean of prepayments [PLN]") +
  ggtitle("Seasonal plot: Prepayments by months")

#Conclusions from the plot The seasonal pattern of prepayment is also profuse in essential information. Since in Poland the possibility to collect tax return (PIT) begins in the middle of February and the tax office has approximately 30-90 days to issue that money, it can fairly concluded that increased prepayment factor in March comes from that reason.

data_ts_sum <- ts(by_date$value, frequency = 12, start = c(2016, 1))
boxplot(data_ts_sum~cycle(data_ts_sum))

# Income vs. Prepayments paid Our assumption in this case was that income factor may have a great influence of people’s propensity to pay higher prepayments. However, together with scrupulous study and examining scatterplot of log(income) vs. log(prepayments) no significant correlation between prepayments and household income is observed.

data_id <- na.omit(data.frame(id=Task_data$id, prep=Task_data$prepaid_amount, income = Task_data$INCOME_houshold))

data_id <- data_id %>% group_by(id, income) %>% summarize(prep = sum(prep))
ggplot(data=data_id, aes(x=log(income), y=log(prep))) + geom_point() + ggtitle("Income influence on prepayments")

# Inflation influence on prepayments

data_infl <- data.frame(date_str = macro1$date_str, inflation = macro1$inflation)
data_infl <- inner_join(by_date, data_infl, by="date_str")
ggplot(data_infl, aes(x=inflation, y=value)) + geom_point() + geom_smooth(method="lm") + ggtitle("Inflation influence on prepayments") + ylab("prepayment mean")

## Simple linear regression When going through the process of model selection our team was striving to accomplish essential assumptions - high accuracy as well as simplicity. For that reason the decision that was made was to manufactured a model based on linear relationship between inflation and prepayments. It is rather cumbersome to try to get the correct least squares line, i.e. the line that minimizes the sum of squared residuals, through trial and error. Instead we can use the lm function in R to fit the linear model (a.k.a. regression line).

We begin by splitting the dataset into two parts, training set and testing set. In this example we will randomly take 75% row in this dataset and put it into the training set, and other 25% row in the testing set:

data <- inner_join(data, data_infl, by="date_str")

smp_size<-floor(0.75*nrow(data_infl))
set.seed(123)
train_ind<-sample(seq_len(nrow(data_infl)), size=smp_size)
train<-data_infl[train_ind, ]
test<-data_infl[-train_ind, ]

model1<-lm(value~inflation, data=train)

summary(model1)

## 
## Call:
## lm(formula = value ~ inflation, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2959533 -1466426  -696059  1739939  3681051 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -81468095   24190799  -3.368 0.001855 ** 
## inflation      860159     238324   3.609 0.000951 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1997000 on 35 degrees of freedom
## Multiple R-squared:  0.2712, Adjusted R-squared:  0.2504 
## F-statistic: 13.03 on 1 and 35 DF,  p-value: 0.0009507

Established least quiares regression line:

prepayments = 593.74*inflation - 58270.84

Model validation

Looks like rmse of our model is 13660 on the training set, but that is not what we care about, what we care about is the rmse of our model on the test set:

## Warning: package 'Metrics' was built under R version 4.0.5

## 
## Attaching package: 'Metrics'

## The following object is masked from 'package:forecast':
## 
##     accuracy

library(plotly)

## Warning: package 'plotly' was built under R version 4.0.5

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

dat <- data.frame(inflation = (99.1:104.7), value = predict(model1, data.frame(inflation = (99.1:104.7))))
plot_ly() %>% 
      add_trace(x=~inflation, y=~value, type="scatter", mode="lines", data = dat, name = "Predicted Value") %>%
      add_trace(x=~inflation, y=~value, type="scatter", mode="markers", data = test, name = "Actual Value")

library(forecast)
model_arima <- auto.arima(data_ts_sum, trace = FALSE)
summary(model_arima)

## Series: data_ts_sum 
## ARIMA(0,1,1) 
## 
## Coefficients:
##           ma1
##       -0.3795
## s.e.   0.1132
## 
## sigma^2 estimated as 1.074e+12:  log likelihood=-747.81
## AIC=1499.61   AICc=1499.87   BIC=1503.39
## 
## Training set error measures:
##                    ME    RMSE      MAE     MPE     MAPE      MASE        ACF1
## Training set 149428.8 1015306 739308.4 1.50287 12.49055 0.2978502 -0.07538156

forecast_arima <- forecast(model_arima, level=c(90), h=1*12)
plot(forecast_arima)

linear <- lm(value~date, data=data_infl)
summary(linear)

## 
## Call:
## lm(formula = value ~ date, data = data_infl)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2821628 -1231977  -825843  1547187  3915854 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.937e+07  1.001e+07  -5.932 3.18e-07 ***
## date         4.301e-02  6.589e-03   6.528 3.89e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1767000 on 48 degrees of freedom
## Multiple R-squared:  0.4703, Adjusted R-squared:  0.4593 
## F-statistic: 42.62 on 1 and 48 DF,  p-value: 3.89e-08

fitted<-linear$fitted.values

ING Lion’s Den

BifOfData - Jakub Sochacki, Michał Rejmak, Tomasz Kuczyński

2021-05-20

Analysis process

Data Analysis

Established least quiares regression line:

Model validation