—————————————————————————————

The series in the Loan.csv is the monthly volume of commercial bank real-estate loans, in billions of dollars, from January 1973 to October 1978, a total of 70 observations. The data are derived from reports to the Federal Reserve System from large commercial banks. For such a data set, check for stationarity, build an ARIMA model, perform data transformation/difference (if needed), model identification, model selection, diagnostic checking, parameter estimation, and forecast the next two years.

Data Import and Plot

data <- as.ts(read.csv('Loans.csv'))
data %>% ggtsdisplay(lag.max = 40)

  • Loans data is imported and following observations were made:
    • Time series plot shows that the data has a trend
    • ACF is gradually reducing which is a good sign of the presence of non-constant variance.
    • PACF has high spike at Lag 1.
    • Non-stationarity in the mean cannot be inferred at this step.

Checking for Seasonality before BoxCox and ADF tests

data %>% diff() %>% acf()

  • The data has no trace of seasonality as there are no significant lags in any equal intervals of time. Thus, Boxcox and adf test can be carried out.

BoxCox Transformation for Non-Constant variance

bcx <- BoxCox.ar(data)

  • Here, the maximum likelihood estimate of the power transformation value is near to zero. In order to know how close, below equation is used.
bcx$lambda[which.max(bcx$loglike)]
## [1] -0.1
  • Thus the mle here is -0.1.
  • Further, the transformation for -0.5 is \(1/\sqrt(X_t)\) and 0 is \(log(X_t)\). Since -0.1 is in between both, both the transformations are made and checked for a better fit.
par(mfrow=c(1,2))
plot(1/(data^0.5))
plot(log(data))

  • Here, the first transformation is chosen - ie) \(1/\sqrt(X_t)\). As we can see the variance in first transformation is less than the second one.
data_Xn <- (1/(data^0.5))
data_Xn %>% ggtsdisplay(lag.max = 40)

  • Here, the variance seems to be reduced. However, there may be presence of non-stationarity in the mean. ADF test is used to check the same.

ADF test for non-stationarity in the mean

data_Xn %>% adf.test()
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -1.1827, Lag order = 4, p-value = 0.9028
## alternative hypothesis: stationary
  • ADF test:
    • Null hypothesis : The series is non stationary (p-value > 0.05)
    • Alternative hypothesis : The series is stationary (p-value < 0.05)
  • Here, the p-value is 0.9028 (> 0.05). Thus, the series is still non-stationary and needs differencing for further analysis.

Data Differencing for Stationarity

data_Xn %>% diff() %>% adf.test()
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -2.405, Lag order = 4, p-value = 0.4106
## alternative hypothesis: stationary
data_Xn %>% diff() %>% diff() %>% adf.test()
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -3.9737, Lag order = 4, p-value = 0.01628
## alternative hypothesis: stationary
  • The first diff() gave non-stationary data again with p-value 0.4106.
  • The second diff() gave stationary series with p-value 0.01628

Checking ACF and PACF plots for theoretical model assumption

data_Xn %>% diff() %>% diff() %>% ggtsdisplay(lag.max = 40)

  • The data has high spike at lag 1 in both ACF and PACF. Thus, ARIMA(1,2,1) can be considered. To further strengthen the assumption, eacf() plot is used.

EACF plot for model assumptions

data_Xn %>% diff() %>% diff() %>% eacf()
## AR/MA
##   0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x o o o o o o o o o o  o  o  o 
## 1 x o o o o o o o o o o  o  o  o 
## 2 x o o o o o o o o o o  o  o  o 
## 3 x o o o o o o o o o o  o  o  o 
## 4 o o o o o o o o o o o  o  o  o 
## 5 o o x x o o o o o o o  o  o  o 
## 6 x x o o x o o o o o o  o  o  o 
## 7 x x o o x o o o o o o  o  o  o

EACF Analysis:

  • Null hypothesis: No Autocorrelation in the residual after we fit the model
  • Alternative hypothesis: Autocorrelation at some lag
  • o –> Depicts that we fail to reject null (preferred)
  • x –> Depicts we reject null (which means there is some autocorrelation which is not preferred)

Further, ARIMA(0,2,1), ARIMA(0,2,2) and ARIMA(1,2,1) are considered and checked for better coefficient values.

AIC checks for different potential models

(fit1 <- arima(data_Xn, order = c(0,2,1)))
## 
## Call:
## arima(x = data_Xn, order = c(0, 2, 1))
## 
## Coefficients:
##           ma1
##       -0.3931
## s.e.   0.1104
## 
## sigma^2 estimated as 8.189e-08:  log likelihood = 451.5,  aic = -901
(fit2 <- arima(data_Xn, order = c(0,2,2)))
## 
## Call:
## arima(x = data_Xn, order = c(0, 2, 2))
## 
## Coefficients:
##           ma1     ma2
##       -0.4241  0.1049
## s.e.   0.1234  0.1508
## 
## sigma^2 estimated as 8.131e-08:  log likelihood = 451.73,  aic = -899.46
(fit3 <- arima(data_Xn, order = c(1,2,1)))
## 
## Call:
## arima(x = data_Xn, order = c(1, 2, 1))
## 
## Coefficients:
##           ar1      ma1
##       -0.1701  -0.2476
## s.e.   0.2781   0.2700
## 
## sigma^2 estimated as 8.149e-08:  log likelihood = 451.66,  aic = -899.31
  • Here, ARIMA(0,2,1) model has the least aic value of -901. Thus, this model is checked for overfitting.

Overfit check for ARIMA(0,2,2) and ARIMA(1,2,1)

coeftest(fit2)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value  Pr(>|z|)    
## ma1 -0.42408    0.12341 -3.4364 0.0005895 ***
## ma2  0.10494    0.15082  0.6958 0.4865634    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • For ARIMA(0,2,2):
    • It is seen that p value for MA2 is greater than 0.05 and null hypothesis cannot be rejected here which implies theta2 is not significantly different than zero.
    • Additionally, MA1 coefficient has not differed much from its original -0.3931 value. Thus, this model further strengthens that there is no overfit.
coeftest(fit3)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)
## ar1 -0.17015    0.27807 -0.6119   0.5406
## ma1 -0.24755    0.27001 -0.9168   0.3592
  • For ARIMA(1,2,1)
    • It is seen that p value for AR1 is greater than 0.05 and null hypothesis cannot be rejected here which implies phi1 is not significantly different than zero.
    • Additionally, MA1 coefficient has differed much from its original -0.3931 value. Overfit might exist here.

Residual Diagnostics

checkresiduals(fit1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,2,1)
## Q* = 7.5676, df = 9, p-value = 0.5782
## 
## Model df: 1.   Total lags used: 10
  • It can be seen that residual plot almost has constant mean and variance.
  • The acf and pacf plots resemble that of white noise.
  • Ljung-Box test:
    • Null hypothesis: There is no auto-correlation between residuals (p > 0.05)
    • Alternative hypothesis: There is auto-correlation between residuals (p < 0.05)
    • Additionally, the p-value from Ljung-Box is greater than 0.05 which implies null hypothesis cannot be rejected here. Also, this is in sync with the acf and pacf plots which shows that the residuals behave like white noise and have no correlation.

Final Model

ARIMA(0,2,1) is chosen further for forecast

Forecast of Transformed Data

data_Xn %>%
  Arima(order=c(0,2,1)) %>%
  forecast(h = 24) %>%
  autoplot() +
    ylab("Monthly volume of commercial bank real-estate loans") + xlab("Year")

Above is the forecast for transformed data of Monthly volume of commercial bank real-estate loans for Nov 1978 to Oct 1980

data %>%
  Arima(order=c(0,2,1), lambda = -0.1) %>%
  forecast(h = 24) %>%
  autoplot() +
    ylab("Monthly volume of commercial bank real-estate loans") + xlab("Year")

Conclusion

Above is the forecast for original data of Monthly volume of commercial bank real-estate loans for Nov 1978 to Oct 1980. The diverging of confidence interval is due to differencing of order 2.

—————————————————————————————

The data set Disposable_Income.csv contains (read across) the quarterly disposable income in Japan during the period of 1961 through 1987. Fit an appropriate (seasonal) ARIMA model to the disposable income series and forecast the disposable income in 1988.

Data Import and Plot

inc <- as.ts(read.csv('Disposable_Income.csv'))
inc %>% ggtsdisplay(lag.max = 40)

  • Income data is imported and following observations are made:
    • Time Series plot shows both Seasonality and Trend.
    • ACF is gradually reducing which is a good sign of the presence of non-constant variance.
    • No model assumptions can be made from ACF and PACF plots.
    • Non-stationarity in the mean cannot be inferred at this step.

Checking for Seasonality Lags

inc %>% diff() %>% acf()

  • Plotting ACF on the original data confirms seasonality at lag 4. Thus, data provided may be considered as a quarterly data.

Further, variance is taken care before seasonality to avoid any inclusion of negative values.

Confirmation of Non-constant variance

inc %>% diff(lag = 4) %>%  ggtsdisplay(lag.max = 40)

  • The plot shows significant evidence of non-constant variance. Thus, variance needs to be taken care before seasonality normalization.

Log Transformation for Non-Constant variance

inc %>% log() %>% ggtsdisplay(lag.max = 40)

  • Considering the mle as 0, log transformation is taken. Reduction in the variance is pretty evident here.

Seasonality Check

inc %>% log() %>% diff(lag = 4) %>%  ggtsdisplay(lag.max = 40)

  • The above plot shows data where both variance and seasonality is taken care of. Before assuming any theoretical models, the series needs to be stationary in the mean.

However, the PACF plot shows potential AR(3) related model. Hence ur.df test is used instead of adf test for stationarity check.

Test for non-stationarity in the mean

(lagdf = (nrow(data)-1)^(1/3))
## [1] 4.081655
inc %>% log() %>% diff(lag = 4) %>% 
  ur.df(type ="none", lags = floor(lagdf)) %>% summary()
## 
## ############################################### 
## # Augmented Dickey-Fuller Test Unit Root Test # 
## ############################################### 
## 
## Test regression none 
## 
## 
## Call:
## lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.117279 -0.010617  0.001973  0.013726  0.067756 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## z.lag.1     -0.05398    0.04298  -1.256   0.2123    
## z.diff.lag1 -0.46735    0.10415  -4.487 2.06e-05 ***
## z.diff.lag2 -0.28731    0.11468  -2.505   0.0140 *  
## z.diff.lag3  0.07082    0.11315   0.626   0.5329    
## z.diff.lag4 -0.18987    0.09983  -1.902   0.0603 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02766 on 93 degrees of freedom
## Multiple R-squared:  0.356,  Adjusted R-squared:  0.3214 
## F-statistic: 10.28 on 5 and 93 DF,  p-value: 7.22e-08
## 
## 
## Value of test-statistic is: -1.256 
## 
## Critical values for test statistics: 
##       1pct  5pct 10pct
## tau1 -2.58 -1.95 -1.62
  • ADF test:
    • Null hypothesis : The series is non stationary (test-statistic > 5 pct)
    • Alternative hypothesis : The series is stationary (test-statistic < 5 pct)

Here, the test statisctic value -1.256 is greater than -1.95. Thus, the null hypothesis cannot be rejected here and the series is found be non-stationary

Differencing and Checking ACF and PACF plots for theoretical model assumption

inc %>% log() %>% diff(lag = 4) %>%  diff() %>%  ggtsdisplay(lag.max = 40)

Different Model Proposals

  • Based on the above PACF:
    • For seasonality, lag 4 is hardly significant. Hence various models were tested with seasonality ARIMA(0,1,0). Few were checked with (1,1,0).
    • Non-seasonality part cuts off at lag 8.
  • Few of the models tried that resulted in perfect residual diagnostics were as follows:
fit Models AIC values
1 SARIMA(7,1,5)(0,1,0)[4] -439.82
2 SARIMA(6,1,5)(0,1,0)[4] -442.53
3 SARIMA(8,1,5)(0,1,0)[4] -443.27

SARIMA(8,1,5)(0,1,0)[4]

(fit_inc <- Arima(inc, order=c(8,1,5), seasonal=c(0,1,0),lambda=0 , include.constant = FALSE))
## Series: inc 
## ARIMA(8,1,5) 
## Box Cox transformation: lambda= 0 
## 
## Coefficients:
##          ar1      ar2     ar3     ar4      ar5     ar6      ar7     ar8
##       0.7636  -0.1935  0.5094  0.6696  -0.7659  0.1910  -0.5026  0.3231
## s.e.  0.1598   0.1695  0.2255  0.2123   0.1627  0.1735   0.2287  0.2129
##           ma1     ma2      ma3     ma4     ma5
##       -1.5023  0.7487  -0.5622  0.1170  0.2772
## s.e.   0.1748  0.2929   0.3452  0.4501  0.2354
## 
## sigma^2 estimated as 0.0006411:  log likelihood=235.64
## AIC=-443.27   AICc=-438.66   BIC=-405.99
checkresiduals(fit_inc)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(8,1,5)
## Q* = 5.8696, df = 3, p-value = 0.1181
## 
## Model df: 13.   Total lags used: 16
  • It can be seen that residual plot almost has constant mean and variance.
  • The acf and pacf plots resemble that of white noise.
  • Ljung-Box test:
    • Null hypothesis: There is no auto-correlation between residuals (p > 0.05)
    • Alternative hypothesis: There is auto-correlation between residuals (p < 0.05)
    • Additionally, the p-value from Ljung-Box is greater than 0.05 which implies null hypothesis cannot be rejected here. Also, this is in sync with the acf and pacf plots which shows that the residuals behave like white noise and have no correlation.

Final Model

SARIMA(8,1,5)(0,1,0)[4] is chosen further for forecast

Forecasting

fit_inc %>% forecast(h=4) %>% autoplot() +
ylab("Quarterly disposable income in Japan 1961-1988 ") + xlab("Year")

coeftest(fit_inc)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value  Pr(>|z|)    
## ar1  0.76357    0.15984  4.7772 1.778e-06 ***
## ar2 -0.19351    0.16951 -1.1416  0.253604    
## ar3  0.50936    0.22549  2.2589  0.023891 *  
## ar4  0.66956    0.21233  3.1534  0.001614 ** 
## ar5 -0.76594    0.16268 -4.7084 2.497e-06 ***
## ar6  0.19096    0.17352  1.1005  0.271094    
## ar7 -0.50264    0.22868 -2.1980  0.027950 *  
## ar8  0.32311    0.21286  1.5179  0.129030    
## ma1 -1.50235    0.17480 -8.5947 < 2.2e-16 ***
## ma2  0.74875    0.29291  2.5562  0.010582 *  
## ma3 -0.56221    0.34524 -1.6284  0.103431    
## ma4  0.11702    0.45012  0.2600  0.794887    
## ma5  0.27719    0.23540  1.1775  0.238979    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the coefficients of the model,

  • ar - 1,3,4,5,7 were significant.
  • ma - 1,2 lags were significant.
  • Seasonality difference - 1
  • Non-Seasonality difference - 1

——————————————————————————————