Bank Data

Data

The data is the monthly volume of commercial bank real-estate loans, in billions of dollars, from January 1973 to October 1978, a total of 70 observations. The data are derived from reports to the Federal Reserve System from large commercial banks.

Libraries

library(forecast)
library(TSA)
library(tidyverse)
library(tseries)
library(lmtest)

Importing the Data

data <- read.csv("D:/UC BANA/Spring 2021/Forecasting Methods/HW4/Loans.csv")
data <- as.ts(data)

Data Visualization

Time Series Plots

data %>% ggtsdisplay()

From the above plots, we observe that the time series has got non-constant variance, so we are transforming the data.

Data Transformation

We will use the Box-Cox tranformation method to find the transformation parameter

#Box-Cox tranformation on the data
bc <- BoxCox.ar(data)

bc$lambda[which.max(bc$loglike)]
## [1] -0.1

We can find that the lambda value is equal to -0.1 which means we have to consider taking log of the model or square root of the model.

ld <- log(data)
sqd <- data^0.5

ld %>% ggtsdisplay()

sqd %>% ggtsdisplay()

Since we do not observe any change in the time series plots of the transformed data, We are not considering data transformation.

Differencing

Since there is a very slow decay in the sample ACF and a high spike at lag 1 in the same PACF, We are differencing the data. Stationarity was not attained when the data was differenced once, Hence we are considering double differencing.

diff_data <- data %>% diff() %>% diff()
diff_data %>% adf.test()
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -3.9847, Lag order = 4, p-value = 0.01579
## alternative hypothesis: stationary

Time Series Plots of Differenced Data

diff_data %>% ggtsdisplay()

We can find that the time series acts more stationary after differencing twice and this is also confirmed by the ADF test. The ADF test generates a p- value of 0.0157 which is less than 0.05, which implies that the null hypothesis can be rejected and the time series is stationary.

Model Selection

From the ACF and PACF of the data, we notice that ACF has a high spike at lag 1 and PACF cuts off after lag 1 . Thus, we can consider an AR(1) model. But, we can see characteristics of MA (1) model also in the plot. Thus, we check for eacf of the data to get a better clarity.

eacf(diff_data)
## AR/MA
##   0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x o o o o o o o o o o  o  o  o 
## 1 x o o o o o o o o o o  o  o  o 
## 2 x o o o o o o o o o o  o  o  o 
## 3 x o o o o o o o o o o  o  o  o 
## 4 x o o o o o o o o o o  o  o  o 
## 5 o o x x o o o o o o o  o  o  o 
## 6 o x x o x o o o o o o  o  o  o 
## 7 x o o o x o o o o o o  o  o  o

As per the EACF output, we consider various models to check which model gives a better AIC and BIC values.

#Considering various models
(fit <-arima(data, order = c(0, 2, 1)))
## 
## Call:
## arima(x = data, order = c(0, 2, 1))
## 
## Coefficients:
##           ma1
##       -0.3787
## s.e.   0.1084
## 
## sigma^2 estimated as 0.08192:  log likelihood = -11.33,  aic = 24.66
(fit2 <-arima(data, order = c(1, 2, 0)))
## 
## Call:
## arima(x = data, order = c(1, 2, 0))
## 
## Coefficients:
##           ar1
##       -0.3593
## s.e.   0.1140
## 
## sigma^2 estimated as 0.08281:  log likelihood = -11.68,  aic = 25.37
(fit3 <-arima(data, order = c(1, 2, 1)))
## 
## Call:
## arima(x = data, order = c(1, 2, 1))
## 
## Coefficients:
##           ar1      ma1
##       -0.1444  -0.2577
## s.e.   0.2744   0.2616
## 
## sigma^2 estimated as 0.08161:  log likelihood = -11.21,  aic = 26.41
BIC(fit)
## [1] 31.06567
BIC(fit2)
## [1] 31.77634
BIC(fit3)
## [1] 35.02589

After comparing AIC and BIC values, we conclude that ARIMA(0,2,1) has got lesser values and thus it is a better model.

Parameter Estimation

#parameter estimation for the considered model
coeftest(fit)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value  Pr(>|z|)    
## ma1  -0.3787     0.1084 -3.4935 0.0004767 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that the coefficient is significant.

Overfitting

Let us check for overfitting by increasing p and q values by 1 and then checking the AIC and BIC values of those models.

#Overfitting with 2 other models
(fit4 <-arima(data, order = c(0, 2, 2)))
## 
## Call:
## arima(x = data, order = c(0, 2, 2))
## 
## Coefficients:
##           ma1     ma2
##       -0.4148  0.1014
## s.e.   0.1253  0.1529
## 
## sigma^2 estimated as 0.08138:  log likelihood = -11.12,  aic = 26.23
(fit5 <-arima(data, order = c(1, 2, 1)))
## 
## Call:
## arima(x = data, order = c(1, 2, 1))
## 
## Coefficients:
##           ar1      ma1
##       -0.1444  -0.2577
## s.e.   0.2744   0.2616
## 
## sigma^2 estimated as 0.08161:  log likelihood = -11.21,  aic = 26.41
BIC(fit4)
## [1] 34.84466
BIC(fit5)
## [1] 35.02589

We can see that the AIC and BIC values for these models are larger than our final model and thus ARIMA(0,2,1) is the best approach. We can also check whether the estimated coefficients of the overfitted models are significant.

#parameter estimation for the models considered for overfitting
coeftest(fit4)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value  Pr(>|z|)    
## ma1 -0.41480    0.12533 -3.3097 0.0009339 ***
## ma2  0.10141    0.15287  0.6633 0.5071086    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(fit5)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)
## ar1 -0.14436    0.27442 -0.5260   0.5989
## ma1 -0.25773    0.26158 -0.9853   0.3245

We can find that the MA(2) coefficient in model fit4 and the AR(1) coefficient in model fit5 are not significant.Also the MA(1) coefficients do not vary much.So, we can conclude that our inference of the model as ARIMA(0,2,1) model is correct.

Model Diagnostics

To be sure that the predictive model cannot be improved upon, it is also a good idea to check whether the residuals are normally distributed with mean zero and constant variance.

#Checking residuals for the final model
checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,2,1)
## Q* = 9.2622, df = 9, p-value = 0.4134
## 
## Model df: 1.   Total lags used: 10

We can observe that the residuals act as white noise and there are no significant values in ACF of the residuals. The histogram also confirms the normality of the residual distribution.We can confirm this from Ljung-Box test. Here the Ljung-Box test statistic is 9.26, and the p-value is 0.4134.The Null hypothesis is that there is no autocorrelation between the residuals or the residuals are independent.As we got a p-value which is greater than the critical value of 0.05,we cannot reject the Null hypothesis. So our in-sample residuals are independent.

Forecasting

We will forecast the data values for the next 2 years.

#Forecasting
fit1_f <-Arima(data, order = c(0, 2, 1))
fit1_f %>% forecast(h = 24)
##    Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
## 70       89.26624  88.89653  89.63595  88.70082  89.83166
## 71       90.93248  90.22822  91.63674  89.85541  92.00955
## 72       92.59872  91.51088  93.68657  90.93501  94.26244
## 73       94.26497  92.74691  95.78302  91.94330  96.58663
## 74       95.93121  93.94003  97.92238  92.88597  98.97644
## 75       97.59745  95.09350 100.10139  93.76799 101.42690
## 76       99.26369  96.21001 102.31737  94.59349 103.93389
## 77      100.92993  97.29181 104.56805  95.36590 106.49396
## 78      102.59617  98.34080 106.85155  96.08814 109.10420
## 79      104.26241  99.35861 109.16622  96.76269 111.76213
## 80      105.92865 100.34666 111.51065  97.39173 114.46558
## 81      107.59490 101.30618 113.88361  97.97713 117.21266
## 82      109.26114 102.23827 116.28400  98.52059 120.00168
## 83      110.92738 103.14392 118.71084  99.02361 122.83115
## 84      112.59362 104.02401 121.16323  99.48753 125.69971
## 85      114.25986 104.87934 123.64038  99.91359 128.60613
## 86      115.92610 105.71064 126.14157 100.30290 131.54930
## 87      117.59234 106.51858 128.66610 100.65649 134.52820
## 88      119.25858 107.30379 131.21338 100.97530 137.54187
## 89      120.92483 108.06682 133.78283 101.26020 140.58945
## 90      122.59107 108.80822 136.37392 101.51202 143.67012
## 91      124.25731 109.52847 138.98615 101.73149 146.78313
## 92      125.92355 110.22803 141.61907 101.91933 149.92777
## 93      127.58979 110.90735 144.27223 102.07620 153.10338
fit1_f %>% forecast(h = 24) %>% autoplot

Japan Disposable Income

Data

The data set contains the quarterly disposable income in Japan during the period of 1961 through 1987

Importing the Data

data2 <- ts(scan("D:/UC BANA/Spring 2021/Forecasting Methods/HW4/Disposable_Income.csv"),frequency=4)

Data Visualization

Time Series Plots

data2 %>% ggtsdisplay()

From the above plots, we observe that the variance increases with time for this series, so we are transforming the data.

Data Transformation

We will use the Box-Cox tranformation method to find the transformation parameter

#Box-Cox tranformation on the data
(lambda <- BoxCox.lambda(data2))
## [1] 0.2027691

We can find that the lambda value is equal to 0.2 which means we have to consider taking log of the model.

data2 %>% log() %>% ggtsdisplay()

Now the variance is stabilized but the ADF test suggests that the series is not stationary, So we take first order differencing.

Differencing

data2 %>% log() %>% diff() %>% ggtsdisplay()

data2 %>% log() %>% diff() %>% adf.test()
## Warning in adf.test(.): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -4.6074, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

The ADF test suggests that the series is stationary now.

Since the ACF at lag 4,8,12,16 is still decreasing slowly, we take a seasonal first order differencing.

data2 %>% log() %>% diff() %>% diff(lag = 4) %>% ggtsdisplay()

data2 %>% log() %>% diff() %>% diff(lag = 4) %>% adf.test()
## Warning in adf.test(.): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -6.6668, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

The ACF is now stationary.

Model Selection

From the ACF and PACF of the data, We can observe the characteristics of AR(2), MA(3), ARMA(2,3) fot the non-seasonal part . The spike at lag 4 suggests a seasonal MA(1). Hence we consider various models and choose the best among those.

#MA(3) Model
(fit21 <- data2 %>% 
    Arima(order=c(0,0,3), seasonal=c(0,0,1),include.constant = FALSE ))
## Series: . 
## ARIMA(0,0,3)(0,0,1)[4] with zero mean 
## 
## Coefficients:
##          ma1     ma2     ma3    sma1
##       1.2391  1.2875  0.4146  1.0000
## s.e.  0.0658  0.0633  0.0611  0.0657
## 
## sigma^2 estimated as 3959:  log likelihood=-607.21
## AIC=1224.42   AICc=1225.01   BIC=1237.83
#AR(2) Model
(fit22 <- data2 %>% log() %>% diff() %>% diff(lag=4) %>% 
    Arima(order=c(2,0,0), seasonal=c(0,0,1),include.constant = FALSE))
## Series: . 
## ARIMA(2,0,0)(0,0,1)[4] with zero mean 
## 
## Coefficients:
##           ar1      ar2     sma1
##       -0.5609  -0.3839  -0.5044
## s.e.   0.0909   0.0903   0.0786
## 
## sigma^2 estimated as 0.0006325:  log likelihood=233.82
## AIC=-459.64   AICc=-459.24   BIC=-449.11
#ARMA(2,3) Model
(fit23 <- data2 %>% log() %>% diff() %>% diff(lag=4) %>% 
  Arima(order=c(2,0,3), seasonal=c(0,0,1),include.constant = FALSE))
## Series: . 
## ARIMA(2,0,3)(0,0,1)[4] with zero mean 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2      ma3     sma1
##       -0.7217  -0.9157  0.1397  0.4451  -0.4413  -0.4796
## s.e.   0.1044   0.1005  0.1547  0.1424   0.1334   0.0858
## 
## sigma^2 estimated as 0.0006341:  log likelihood=235.16
## AIC=-456.31   AICc=-455.13   BIC=-437.87

After comparing AIC and BIC values, we conclude that fit22 has got lesser values and thus it is a better model. The model is a SARIMA(2,1,0)(0,1,1)[4] with a log transformation to the original series.

Parameter Estimation

#parameter estimation for the considered model
coeftest(fit22)
## 
## z test of coefficients:
## 
##       Estimate Std. Error z value  Pr(>|z|)    
## ar1  -0.560918   0.090949 -6.1674 6.943e-10 ***
## ar2  -0.383939   0.090317 -4.2510 2.128e-05 ***
## sma1 -0.504421   0.078610 -6.4168 1.392e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that the coefficients are significant.

Overfitting

Let us check for overfitting by increasing p and q values by 1 and then checking the AIC and BIC values of those models.

#Overfitting with 2 other models
(fit24 <- Arima(data2, order=c(3,1,0), seasonal=c(0,1,1),include.constant = FALSE,lambda=0))
## Series: data2 
## ARIMA(3,1,0)(0,1,1)[4] 
## Box Cox transformation: lambda= 0 
## 
## Coefficients:
##           ar1     ar2      ar3     sma1
##       -0.5877  -0.428  -0.0764  -0.5306
## s.e.   0.0986   0.110   0.1103   0.0840
## 
## sigma^2 estimated as 0.0006371:  log likelihood=234.06
## AIC=-458.13   AICc=-457.51   BIC=-444.95
(fit25 <- Arima(data2, order=c(2,1,1), seasonal=c(0,1,1),include.constant = FALSE,lambda=0))
## Series: data2 
## ARIMA(2,1,1)(0,1,1)[4] 
## Box Cox transformation: lambda= 0 
## 
## Coefficients:
##           ar1      ar2      ma1     sma1
##       -0.4285  -0.3328  -0.1546  -0.5214
## s.e.   0.2335   0.1319   0.2458   0.0791
## 
## sigma^2 estimated as 0.0006377:  log likelihood=234.01
## AIC=-458.03   AICc=-457.41   BIC=-444.85

We can see that the AIC and BIC values for these models are larger than our final model and thus SARIMA(2,1,0)(0,1,1)[4] with a log transformation is the best approach. We can also check whether the estimated coefficients of the overfitted models are significant.

#parameter estimation for the models considered for overfitting
coeftest(fit24)
## 
## z test of coefficients:
## 
##       Estimate Std. Error z value  Pr(>|z|)    
## ar1  -0.587722   0.098568 -5.9626 2.482e-09 ***
## ar2  -0.428020   0.109999 -3.8911 9.978e-05 ***
## ar3  -0.076391   0.110281 -0.6927    0.4885    
## sma1 -0.530636   0.083971 -6.3193 2.627e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(fit25)
## 
## z test of coefficients:
## 
##       Estimate Std. Error z value  Pr(>|z|)    
## ar1  -0.428471   0.233460 -1.8353   0.06646 .  
## ar2  -0.332756   0.131904 -2.5227   0.01165 *  
## ma1  -0.154606   0.245766 -0.6291   0.52930    
## sma1 -0.521445   0.079106 -6.5917 4.347e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can find that the AR(3) coefficient in model fit24 and the MA(1) coefficient in model fit25 are not significant.Also the AR(1) and AR(2) coefficients do not vary much.So, we can conclude that our inference of the model is correct.

Model Diagnostics

To be sure that the predictive model cannot be improved upon, it is also a good idea to check whether the residuals are normally distributed with mean zero and constant variance.

The final model is equivalent to SARIMA(2,1,0)(0,1,1)[4] with a log transformation.

#Final Model
(fit_final <- Arima(data2, order=c(2,1,0), seasonal=c(0,1,1),include.constant = FALSE,lambda=0))
## Series: data2 
## ARIMA(2,1,0)(0,1,1)[4] 
## Box Cox transformation: lambda= 0 
## 
## Coefficients:
##           ar1      ar2     sma1
##       -0.5609  -0.3839  -0.5044
## s.e.   0.0909   0.0903   0.0786
## 
## sigma^2 estimated as 0.0006337:  log likelihood=233.82
## AIC=-459.65   AICc=-459.24   BIC=-449.11
#Checking residuals for the final model
checkresiduals(fit_final)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,0)(0,1,1)[4]
## Q* = 3.1974, df = 5, p-value = 0.6696
## 
## Model df: 3.   Total lags used: 8

We can observe that the residuals act as white noise and there are no significant values in ACF of the residuals. The histogram also confirms the normality of the residual distribution.We can confirm this from Ljung-Box test.

Here the Ljung-Box test statistic is 3.19, and the p-value is 0.669.The Null hypothesis is that there is no autocorrelation between the residuals or the residuals are independent.As we got a p-value which is greater than the critical value of 0.05,we cannot reject the Null hypothesis. So our in-sample residuals are independent.

Forecasting

We will forecast the data values for the year 1988.

#Forecasting
fit_final %>% forecast(h = 4)
##       Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 28 Q1       420.6565 407.3027 434.4482 400.4060 441.9312
## 28 Q2       525.5113 507.3184 544.3566 497.9440 554.6047
## 28 Q3       503.9273 485.5268 523.0251 476.0597 533.4261
## 28 Q4       632.7997 606.5933 660.1382 593.1628 675.0852
fit_final %>% forecast(h = 4) %>% autoplot