Data
The data is the monthly volume of commercial bank real-estate loans, in billions of dollars, from January 1973 to October 1978, a total of 70 observations. The data are derived from reports to the Federal Reserve System from large commercial banks.
Libraries
library(forecast)
library(TSA)
library(tidyverse)
library(tseries)
library(lmtest)
Importing the Data
data <- read.csv("D:/UC BANA/Spring 2021/Forecasting Methods/HW4/Loans.csv")
data <- as.ts(data)
Data Visualization
Time Series Plots
data %>% ggtsdisplay()
From the above plots, we observe that the time series has got non-constant variance, so we are transforming the data.
Data Transformation
We will use the Box-Cox tranformation method to find the transformation parameter
#Box-Cox tranformation on the data
bc <- BoxCox.ar(data)
bc$lambda[which.max(bc$loglike)]
## [1] -0.1
We can find that the lambda value is equal to -0.1 which means we have to consider taking log of the model or square root of the model.
ld <- log(data)
sqd <- data^0.5
ld %>% ggtsdisplay()
sqd %>% ggtsdisplay()
Since we do not observe any change in the time series plots of the transformed data, We are not considering data transformation.
Differencing
Since there is a very slow decay in the sample ACF and a high spike at lag 1 in the same PACF, We are differencing the data. Stationarity was not attained when the data was differenced once, Hence we are considering double differencing.
diff_data <- data %>% diff() %>% diff()
diff_data %>% adf.test()
##
## Augmented Dickey-Fuller Test
##
## data: .
## Dickey-Fuller = -3.9847, Lag order = 4, p-value = 0.01579
## alternative hypothesis: stationary
Time Series Plots of Differenced Data
diff_data %>% ggtsdisplay()
We can find that the time series acts more stationary after differencing twice and this is also confirmed by the ADF test. The ADF test generates a p- value of 0.0157 which is less than 0.05, which implies that the null hypothesis can be rejected and the time series is stationary.
Model Selection
From the ACF and PACF of the data, we notice that ACF has a high spike at lag 1 and PACF cuts off after lag 1 . Thus, we can consider an AR(1) model. But, we can see characteristics of MA (1) model also in the plot. Thus, we check for eacf of the data to get a better clarity.
eacf(diff_data)
## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x o o o o o o o o o o o o o
## 1 x o o o o o o o o o o o o o
## 2 x o o o o o o o o o o o o o
## 3 x o o o o o o o o o o o o o
## 4 x o o o o o o o o o o o o o
## 5 o o x x o o o o o o o o o o
## 6 o x x o x o o o o o o o o o
## 7 x o o o x o o o o o o o o o
As per the EACF output, we consider various models to check which model gives a better AIC and BIC values.
#Considering various models
(fit <-arima(data, order = c(0, 2, 1)))
##
## Call:
## arima(x = data, order = c(0, 2, 1))
##
## Coefficients:
## ma1
## -0.3787
## s.e. 0.1084
##
## sigma^2 estimated as 0.08192: log likelihood = -11.33, aic = 24.66
(fit2 <-arima(data, order = c(1, 2, 0)))
##
## Call:
## arima(x = data, order = c(1, 2, 0))
##
## Coefficients:
## ar1
## -0.3593
## s.e. 0.1140
##
## sigma^2 estimated as 0.08281: log likelihood = -11.68, aic = 25.37
(fit3 <-arima(data, order = c(1, 2, 1)))
##
## Call:
## arima(x = data, order = c(1, 2, 1))
##
## Coefficients:
## ar1 ma1
## -0.1444 -0.2577
## s.e. 0.2744 0.2616
##
## sigma^2 estimated as 0.08161: log likelihood = -11.21, aic = 26.41
BIC(fit)
## [1] 31.06567
BIC(fit2)
## [1] 31.77634
BIC(fit3)
## [1] 35.02589
After comparing AIC and BIC values, we conclude that ARIMA(0,2,1) has got lesser values and thus it is a better model.
Parameter Estimation
#parameter estimation for the considered model
coeftest(fit)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -0.3787 0.1084 -3.4935 0.0004767 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that the coefficient is significant.
Overfitting
Let us check for overfitting by increasing p and q values by 1 and then checking the AIC and BIC values of those models.
#Overfitting with 2 other models
(fit4 <-arima(data, order = c(0, 2, 2)))
##
## Call:
## arima(x = data, order = c(0, 2, 2))
##
## Coefficients:
## ma1 ma2
## -0.4148 0.1014
## s.e. 0.1253 0.1529
##
## sigma^2 estimated as 0.08138: log likelihood = -11.12, aic = 26.23
(fit5 <-arima(data, order = c(1, 2, 1)))
##
## Call:
## arima(x = data, order = c(1, 2, 1))
##
## Coefficients:
## ar1 ma1
## -0.1444 -0.2577
## s.e. 0.2744 0.2616
##
## sigma^2 estimated as 0.08161: log likelihood = -11.21, aic = 26.41
BIC(fit4)
## [1] 34.84466
BIC(fit5)
## [1] 35.02589
We can see that the AIC and BIC values for these models are larger than our final model and thus ARIMA(0,2,1) is the best approach. We can also check whether the estimated coefficients of the overfitted models are significant.
#parameter estimation for the models considered for overfitting
coeftest(fit4)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -0.41480 0.12533 -3.3097 0.0009339 ***
## ma2 0.10141 0.15287 0.6633 0.5071086
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(fit5)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.14436 0.27442 -0.5260 0.5989
## ma1 -0.25773 0.26158 -0.9853 0.3245
We can find that the MA(2) coefficient in model fit4 and the AR(1) coefficient in model fit5 are not significant.Also the MA(1) coefficients do not vary much.So, we can conclude that our inference of the model as ARIMA(0,2,1) model is correct.
Model Diagnostics
To be sure that the predictive model cannot be improved upon, it is also a good idea to check whether the residuals are normally distributed with mean zero and constant variance.
#Checking residuals for the final model
checkresiduals(fit)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,2,1)
## Q* = 9.2622, df = 9, p-value = 0.4134
##
## Model df: 1. Total lags used: 10
We can observe that the residuals act as white noise and there are no significant values in ACF of the residuals. The histogram also confirms the normality of the residual distribution.We can confirm this from Ljung-Box test. Here the Ljung-Box test statistic is 9.26, and the p-value is 0.4134.The Null hypothesis is that there is no autocorrelation between the residuals or the residuals are independent.As we got a p-value which is greater than the critical value of 0.05,we cannot reject the Null hypothesis. So our in-sample residuals are independent.
Forecasting
We will forecast the data values for the next 2 years.
#Forecasting
fit1_f <-Arima(data, order = c(0, 2, 1))
fit1_f %>% forecast(h = 24)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 70 89.26624 88.89653 89.63595 88.70082 89.83166
## 71 90.93248 90.22822 91.63674 89.85541 92.00955
## 72 92.59872 91.51088 93.68657 90.93501 94.26244
## 73 94.26497 92.74691 95.78302 91.94330 96.58663
## 74 95.93121 93.94003 97.92238 92.88597 98.97644
## 75 97.59745 95.09350 100.10139 93.76799 101.42690
## 76 99.26369 96.21001 102.31737 94.59349 103.93389
## 77 100.92993 97.29181 104.56805 95.36590 106.49396
## 78 102.59617 98.34080 106.85155 96.08814 109.10420
## 79 104.26241 99.35861 109.16622 96.76269 111.76213
## 80 105.92865 100.34666 111.51065 97.39173 114.46558
## 81 107.59490 101.30618 113.88361 97.97713 117.21266
## 82 109.26114 102.23827 116.28400 98.52059 120.00168
## 83 110.92738 103.14392 118.71084 99.02361 122.83115
## 84 112.59362 104.02401 121.16323 99.48753 125.69971
## 85 114.25986 104.87934 123.64038 99.91359 128.60613
## 86 115.92610 105.71064 126.14157 100.30290 131.54930
## 87 117.59234 106.51858 128.66610 100.65649 134.52820
## 88 119.25858 107.30379 131.21338 100.97530 137.54187
## 89 120.92483 108.06682 133.78283 101.26020 140.58945
## 90 122.59107 108.80822 136.37392 101.51202 143.67012
## 91 124.25731 109.52847 138.98615 101.73149 146.78313
## 92 125.92355 110.22803 141.61907 101.91933 149.92777
## 93 127.58979 110.90735 144.27223 102.07620 153.10338
fit1_f %>% forecast(h = 24) %>% autoplot
Data
The data set contains the quarterly disposable income in Japan during the period of 1961 through 1987
Importing the Data
data2 <- ts(scan("D:/UC BANA/Spring 2021/Forecasting Methods/HW4/Disposable_Income.csv"),frequency=4)
Data Visualization
Time Series Plots
data2 %>% ggtsdisplay()
From the above plots, we observe that the variance increases with time for this series, so we are transforming the data.
Data Transformation
We will use the Box-Cox tranformation method to find the transformation parameter
#Box-Cox tranformation on the data
(lambda <- BoxCox.lambda(data2))
## [1] 0.2027691
We can find that the lambda value is equal to 0.2 which means we have to consider taking log of the model.
data2 %>% log() %>% ggtsdisplay()
Now the variance is stabilized but the ADF test suggests that the series is not stationary, So we take first order differencing.
Differencing
data2 %>% log() %>% diff() %>% ggtsdisplay()
data2 %>% log() %>% diff() %>% adf.test()
## Warning in adf.test(.): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: .
## Dickey-Fuller = -4.6074, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
The ADF test suggests that the series is stationary now.
Since the ACF at lag 4,8,12,16 is still decreasing slowly, we take a seasonal first order differencing.
data2 %>% log() %>% diff() %>% diff(lag = 4) %>% ggtsdisplay()
data2 %>% log() %>% diff() %>% diff(lag = 4) %>% adf.test()
## Warning in adf.test(.): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: .
## Dickey-Fuller = -6.6668, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
The ACF is now stationary.
Model Selection
From the ACF and PACF of the data, We can observe the characteristics of AR(2), MA(3), ARMA(2,3) fot the non-seasonal part . The spike at lag 4 suggests a seasonal MA(1). Hence we consider various models and choose the best among those.
#MA(3) Model
(fit21 <- data2 %>%
Arima(order=c(0,0,3), seasonal=c(0,0,1),include.constant = FALSE ))
## Series: .
## ARIMA(0,0,3)(0,0,1)[4] with zero mean
##
## Coefficients:
## ma1 ma2 ma3 sma1
## 1.2391 1.2875 0.4146 1.0000
## s.e. 0.0658 0.0633 0.0611 0.0657
##
## sigma^2 estimated as 3959: log likelihood=-607.21
## AIC=1224.42 AICc=1225.01 BIC=1237.83
#AR(2) Model
(fit22 <- data2 %>% log() %>% diff() %>% diff(lag=4) %>%
Arima(order=c(2,0,0), seasonal=c(0,0,1),include.constant = FALSE))
## Series: .
## ARIMA(2,0,0)(0,0,1)[4] with zero mean
##
## Coefficients:
## ar1 ar2 sma1
## -0.5609 -0.3839 -0.5044
## s.e. 0.0909 0.0903 0.0786
##
## sigma^2 estimated as 0.0006325: log likelihood=233.82
## AIC=-459.64 AICc=-459.24 BIC=-449.11
#ARMA(2,3) Model
(fit23 <- data2 %>% log() %>% diff() %>% diff(lag=4) %>%
Arima(order=c(2,0,3), seasonal=c(0,0,1),include.constant = FALSE))
## Series: .
## ARIMA(2,0,3)(0,0,1)[4] with zero mean
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3 sma1
## -0.7217 -0.9157 0.1397 0.4451 -0.4413 -0.4796
## s.e. 0.1044 0.1005 0.1547 0.1424 0.1334 0.0858
##
## sigma^2 estimated as 0.0006341: log likelihood=235.16
## AIC=-456.31 AICc=-455.13 BIC=-437.87
After comparing AIC and BIC values, we conclude that fit22 has got lesser values and thus it is a better model. The model is a SARIMA(2,1,0)(0,1,1)[4] with a log transformation to the original series.
Parameter Estimation
#parameter estimation for the considered model
coeftest(fit22)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.560918 0.090949 -6.1674 6.943e-10 ***
## ar2 -0.383939 0.090317 -4.2510 2.128e-05 ***
## sma1 -0.504421 0.078610 -6.4168 1.392e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that the coefficients are significant.
Overfitting
Let us check for overfitting by increasing p and q values by 1 and then checking the AIC and BIC values of those models.
#Overfitting with 2 other models
(fit24 <- Arima(data2, order=c(3,1,0), seasonal=c(0,1,1),include.constant = FALSE,lambda=0))
## Series: data2
## ARIMA(3,1,0)(0,1,1)[4]
## Box Cox transformation: lambda= 0
##
## Coefficients:
## ar1 ar2 ar3 sma1
## -0.5877 -0.428 -0.0764 -0.5306
## s.e. 0.0986 0.110 0.1103 0.0840
##
## sigma^2 estimated as 0.0006371: log likelihood=234.06
## AIC=-458.13 AICc=-457.51 BIC=-444.95
(fit25 <- Arima(data2, order=c(2,1,1), seasonal=c(0,1,1),include.constant = FALSE,lambda=0))
## Series: data2
## ARIMA(2,1,1)(0,1,1)[4]
## Box Cox transformation: lambda= 0
##
## Coefficients:
## ar1 ar2 ma1 sma1
## -0.4285 -0.3328 -0.1546 -0.5214
## s.e. 0.2335 0.1319 0.2458 0.0791
##
## sigma^2 estimated as 0.0006377: log likelihood=234.01
## AIC=-458.03 AICc=-457.41 BIC=-444.85
We can see that the AIC and BIC values for these models are larger than our final model and thus SARIMA(2,1,0)(0,1,1)[4] with a log transformation is the best approach. We can also check whether the estimated coefficients of the overfitted models are significant.
#parameter estimation for the models considered for overfitting
coeftest(fit24)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.587722 0.098568 -5.9626 2.482e-09 ***
## ar2 -0.428020 0.109999 -3.8911 9.978e-05 ***
## ar3 -0.076391 0.110281 -0.6927 0.4885
## sma1 -0.530636 0.083971 -6.3193 2.627e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(fit25)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.428471 0.233460 -1.8353 0.06646 .
## ar2 -0.332756 0.131904 -2.5227 0.01165 *
## ma1 -0.154606 0.245766 -0.6291 0.52930
## sma1 -0.521445 0.079106 -6.5917 4.347e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can find that the AR(3) coefficient in model fit24 and the MA(1) coefficient in model fit25 are not significant.Also the AR(1) and AR(2) coefficients do not vary much.So, we can conclude that our inference of the model is correct.
Model Diagnostics
To be sure that the predictive model cannot be improved upon, it is also a good idea to check whether the residuals are normally distributed with mean zero and constant variance.
The final model is equivalent to SARIMA(2,1,0)(0,1,1)[4] with a log transformation.
#Final Model
(fit_final <- Arima(data2, order=c(2,1,0), seasonal=c(0,1,1),include.constant = FALSE,lambda=0))
## Series: data2
## ARIMA(2,1,0)(0,1,1)[4]
## Box Cox transformation: lambda= 0
##
## Coefficients:
## ar1 ar2 sma1
## -0.5609 -0.3839 -0.5044
## s.e. 0.0909 0.0903 0.0786
##
## sigma^2 estimated as 0.0006337: log likelihood=233.82
## AIC=-459.65 AICc=-459.24 BIC=-449.11
#Checking residuals for the final model
checkresiduals(fit_final)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,0)(0,1,1)[4]
## Q* = 3.1974, df = 5, p-value = 0.6696
##
## Model df: 3. Total lags used: 8
We can observe that the residuals act as white noise and there are no significant values in ACF of the residuals. The histogram also confirms the normality of the residual distribution.We can confirm this from Ljung-Box test.
Here the Ljung-Box test statistic is 3.19, and the p-value is 0.669.The Null hypothesis is that there is no autocorrelation between the residuals or the residuals are independent.As we got a p-value which is greater than the critical value of 0.05,we cannot reject the Null hypothesis. So our in-sample residuals are independent.
Forecasting
We will forecast the data values for the year 1988.
#Forecasting
fit_final %>% forecast(h = 4)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 28 Q1 420.6565 407.3027 434.4482 400.4060 441.9312
## 28 Q2 525.5113 507.3184 544.3566 497.9440 554.6047
## 28 Q3 503.9273 485.5268 523.0251 476.0597 533.4261
## 28 Q4 632.7997 606.5933 660.1382 593.1628 675.0852
fit_final %>% forecast(h = 4) %>% autoplot