STA457 Assignment (Fall 2020)

Download delinquency rates and real GDP data from Quercus and put them into your working directory.
Calculate the changes of real GDP using the following R codes.

CL = ts(read.csv("DRCLACBS.csv")[,2],frequency = 4, end = c(2020,2))
dat = read.csv("GDPC1.csv")
RGDP = ts(diff(dat[,2])/100,frequency = 4, end = c(2020,3))

Use the data between 1987 Q1 and 2018 Q3 to study the relationship between the changes of real GDP and delinquency rates.

Model this relationship using the transfer function noise model.(For simplicity, assume that both delinquency rates and changes of real GDP are stationary.)

Your analyses should include:

Conduct prewhitening analysis to identify the lead-lag relationship between changes of real GDP and delinquency rates;
- ARMA model for changes of real GDP and its residual ACF and PACF plots
- Use cross correlation plot of prewhitened processes to identify transfer function (\(\nu_i\))

source("preWhitening.R")
#' @prewhiten x and y 
#'
mod.arma <- auto.arima(rgdp, max.p = 52, max.q = 52, stationary = TRUE)
p = mod.arma$arma[1]; q = mod.arma$arma[2]; npq = p+q
plot(mod.arma)

summary(mod.arma)

## Series: rgdp 
## ARIMA(2,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1     ar2    mean
##       0.2857  0.1926  0.8235
## s.e.  0.0865  0.0864  0.1200
## 
## sigma^2 estimated as 0.5204:  log likelihood=-137.32
## AIC=282.63   AICc=282.96   BIC=294.01
## 
## Training set error measures:
##                        ME      RMSE       MAE       MPE     MAPE      MASE
## Training set 0.0008729149 0.7128208 0.5250615 -3.771976 97.28674 0.6819326
##                      ACF1
## Training set 0.0003294011

npq = sum(mod.arma$arma[c(1,2)])
LBTest(mod.arma$residuals, nPQ = p+q, m = 52, ifPlot = TRUE)

#Analysis: We use the auto.arima function in forecast package to fit rgdp, then use the \(x_t\) model to prewhiten x and y as shown above. An AR(2) model is selected based on AIC. The inverse AR roots are all inside the unit circle so the fitted ARMA model is causal and invertible. Through the LBTest plot, we can see that all points are above the redline.

#' @ccf plot prewhiten x and y
#'
mod = mod.arma;nAR = mod$arma[1]; nMA = mod$arma[2]

if(nMA!=0){
  xf = PreWhiten.arma(rgdp, ar = mod$coef[1:nAR], 
                      ma = mod$coef[(1:nMA)+nAR])[-(1:nAR)]
  yf = PreWhiten.arma(cl, ar = mod$coef[1:nAR], 
                      ma=mod$coef[(1:nMA)+nAR])[-(1:nAR)]  
}else{
  xf = PreWhiten.arma(rgdp, ar = mod$coef[1:nAR], 
                      ma = 0)[-(1:nAR)]
  yf = PreWhiten.arma(cl, ar = mod$coef[1:nAR], 
                      ma=0)[-(1:nAR)] 
}
par(cex=0.75)
ccf(c(xf), c(yf), lwd=4, ylab="Cross-correlation functions",
    main="CCF of prewhitened change of real GDP and delinquency rates")
abline(v=0, col="gold", lwd=1, lty="dashed")
text(-0.5, -0.2, "0", col=2)
text(-1.5, -0.2, "-1", col=2)
text(-2.5, -0.2, "-2", col=2)
text(-3.5, -0.2, "-3", col=2)
text(-4.5, -0.2, "-4", col=2)
text(-5.5, -0.2, "-5", col=2)

Conclusion: As indicated in the above cross-correlation plot, we include \(rgdp_t\), \(rgdp_{t-1}\), \(rgdp_{t-2}\), \(rgdp_{t-3}\), \(rgdp_{t-4}\), \(rgdp_{t-5}\) in our transfer function noise model.

Fit a multiple regression using the findings in the prewhitening step, i.e. \[y_t = \sum_i v_i x_{t-i} +\xi_t,~~~(1)\] where \(y_t\) and \(x_t\) denote the output and input process, respectively, and \(\xi_t\) is the noise process.(Hint: Use prewhitening to select the lagged \(\{x_i\}\) in the regression)

#' @fit Equation (1)
#'
y <- CL
x <- RGDP
dat <- ts.intersect(y,x, stats::lag(x, -1), stats::lag(x,-2), stats::lag(x,-3),stats:: lag(x,-4), stats::lag(x,-5))
colnames(dat) <- c("CL", "RGDP", "RGDP1", "RGDP2", "RGDP3", "RGDP4", "RGDP5")
dat.obs = window(dat, start = c(1987,1), end = c(2018,3))
mod.lm = lm(CL ~ RGDP + RGDP1 + RGDP2 + RGDP3 + RGDP4 + RGDP5, data = dat.obs)
summary(mod.lm)

## 
## Call:
## lm(formula = CL ~ RGDP + RGDP1 + RGDP2 + RGDP3 + RGDP4 + RGDP5, 
##     data = dat.obs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.18072 -0.25376  0.06047  0.33032  0.97158 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.86949    0.09895  39.106   <2e-16 ***
## RGDP        -0.12977    0.06896  -1.882   0.0623 .  
## RGDP1       -0.13089    0.07163  -1.827   0.0701 .  
## RGDP2       -0.11823    0.07290  -1.622   0.1074    
## RGDP3       -0.10950    0.07299  -1.500   0.1362    
## RGDP4       -0.13460    0.07185  -1.873   0.0635 .  
## RGDP5       -0.16141    0.06939  -2.326   0.0217 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5531 on 120 degrees of freedom
## Multiple R-squared:  0.3156, Adjusted R-squared:  0.2814 
## F-statistic: 9.224 on 6 and 120 DF,  p-value: 2.651e-08

#' @plot residual ACF and PACF of the above regression
#'
par(mfrow=c(1,2))
acf(mod.lm$residuals)
pacf(mod.lm$residuals)

Fit a transfer function noise model using the rational distributed lag function, i.e. \[y_t = \frac{\delta(B)}{\omega(B)}x_t+n_t,~~~(2)\] where \(\delta(B)\) and \(\omega(B)\) are polynomials in the backward shift operator \(B\), and \(n_t\) follows an ARMA process. Write down the mathematical representation of the fitted model.

#' @fit Equation (2) and show the fitted model
#'
mod.tfn = auto.arima(dat.obs[,1], xreg = dat.obs[,-1], stationary = TRUE)
summary(mod.tfn)

## Series: dat.obs[, 1] 
## Regression with ARIMA(2,0,2)(2,0,0)[4] errors 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2     sar1     sar2  intercept
##       1.8881  -0.9023  -0.5593  0.0438  -0.1497  -0.3804     3.4561
## s.e.  0.0550   0.0548   0.1086  0.0937   0.0895   0.0885     0.1725
##          RGDP    RGDP1    RGDP2    RGDP3    RGDP4    RGDP5
##       -0.0517  -0.0570  -0.0496  -0.0315  -0.0472  -0.0435
## s.e.   0.0101   0.0121   0.0129   0.0130   0.0124   0.0101
## 
## sigma^2 estimated as 0.007977:  log likelihood=130.67
## AIC=-233.33   AICc=-229.58   BIC=-193.51
## 
## Training set error measures:
##                       ME      RMSE        MAE        MPE     MAPE
## Training set 8.68984e-06 0.0846206 0.06370182 -0.1146379 2.019421
##                   MASE       ACF1
## Training set 0.2058136 0.00876814

round(coef(mod.tfn),2)

##       ar1       ar2       ma1       ma2      sar1      sar2 intercept 
##      1.89     -0.90     -0.56      0.04     -0.15     -0.38      3.46 
##      RGDP     RGDP1     RGDP2     RGDP3     RGDP4     RGDP5 
##     -0.05     -0.06     -0.05     -0.03     -0.05     -0.04

Write down the mathematical equation of your fitted model

#my answer:

Mathematically, we can express it as \[y_t = 3.46 +（-0.15y_{t-4}-0.38y_{t-8}）（1.89y_{t-1} - 0.90y_{t-2} ）- 0.05rgdp_t - 0.06rgdp_{t-1} - 0.05rgdp_{t-2} - 0.03rgdp_{t-3} - 0.05rgdp_{t-4} - 0.04rgdp_{t-5} + \xi_t - 0.56\xi_{t-1} + 0.04\xi_{t-2}\]

Using compact notation, the fitted model becomes \[(1+0.15B^4+0.38B^8)(1-1.89B+0.90B^2)y_t = 3.46 + (-0.05 - 0.06B - 0.05B^2 - 0.03B^3 - 0.05B^4 - 0.04B^5)rgdp_t + (1 -0.56B + 0.04B^2)n_t\]

Rearranging this equation, we have \[y_t = 0.02 - \frac{0.05- 0.06B - 0.05B^2 - 0.03B^3 - 0.05B^4 - 0.04B^5}{(1+0.15B^4+0.38B^8)(1-1.89B+0.90B^2)}rgdp_t+\frac{1--0.56B + 0.04B^2}{(1+0.15B^4+0.38B^8)(1-1.89B+0.90B^2)}n_t\]

where \[\frac{3.46}{(1+0.15+0.38)(1-1.89+0.90)}≈0.02\]

Conduct the model adequacy tests (diagnostics) on the above models and conclude your inference.

#Analysis: For residual portmanteau test, all P values are above 5% significance level, so the tfn model is adequate. For the cross-correlation check plot, it did not get passed. This implies that there exists correlation between input series and noise term.

#Linear Regression Model adequacy check
#' @check model adequacy of linear regression serial correlation
#'
m = 40
lags = 1:m
df <- (0+6+1):m
n = length(mod.lm$res)
rccf = ccf(as.numeric(mod$residuals), mod.lm$residuals, plot = FALSE, lag.max = m)$acf[-(1:m)]
Qm = n * (n + 2) * cumsum((rccf^2)/(n - (0:m)))[df]
pv <- 1 - pchisq(Qm, df)
a = cbind(df, Qm, pv)

#' @check model adequacy of regression crosss correlation 
#'
par(mfrow = c(1,2))
LBTest(mod.lm$res, nPQ = 0, ifPlot = TRUE)
plot(x = a[,1],y = a[,3],
     ylim = c(0,1), pch = 15, col =4,
     ylab = "p-value", xlab = "m",
     main = "Cross-correlation check")
abline(h = 0.05, col = 2)
grid()

#Analysis: From Ljung-Box portmanteau test, the residuals from linear regression model are highly correlated since p-values are all significant.

Conduct the out of sample forecasts of the above fitted models using the remaining observations. Calculate the forecast performance using Mean squared error (MSE), Mean absolute error (MAE), and Mean absolute percentage error (MAPE):

\[RMSE = \sqrt \frac{\sum_{i=1}^L (y_{t+i}-\hat y_t(i))^2}{L}\] \[MAE = \frac{\sum_{i=1}^L \left|y_{t+i}-\hat y_t(i)\right|}{L}\] \[MAPE = \frac{1}{L}\sum_{i=1}^L \left|1-\frac{\hat y_t(i)}{y_{t+i}}\right|,\]

where \(\hat y_t(i)\) denotes the forecast at origin \(t\) with lead time \(i\)

#' @forecast using tfn
#'
dat.test = window(dat, start = c(2018, 4))
pred.tfn <- forecast(mod.tfn, xreg = dat.test[, -1])
pred.lm <- forecast(mod.lm, newdata = as.data.frame(dat.test[, -1]))



#' @calculate MSE, MAE, MAPE 
#'

result = rbind(accuracy(pred.tfn, dat.test[, 1])[2, c(2, 3, 5)], accuracy(pred.lm, dat.test[, 1])[2, c(2, 3, 5)])
rownames(result) = c("Transfer function noise model", "Linear Model")
knitr::kable(result, digit = 2, caption = "Out-of-sample forecast accuracy")

Out-of-sample forecast accuracy
	RMSE	MAE	MAPE
Transfer function noise model	0.78	0.45	21.25
Linear Model	1.60	1.15	53.18

#Analysis: From above data, the Transfer function noise model is better when we comparing RMSE, MAE, MAPE values since it is smaller than values from the linear Model.

4. Conduct the same out of sample forecasts soley on \(y_t\) using an ARIMA model. Compare and discuss its peformance metrics with the TFN model.

Hint: You may fit an ARIMA model on \(y_t\) using auto.arima but ensure that the fitted model pass the Ljung-Box test.

#' @forecast using auto.arima
#'
mod.arima = auto.arima(dat.obs[,1])
round(coef(mod.arima),2)

##  ar1  ar2 
## 0.49 0.16

LBTest(mod.arima$residuals, nPQ = 2, m = 52, ifPlot = TRUE)

This is an AR2 model, it passed the Ljung-Box portmanteau test since all dots are above the redline.

#' @forecast
pred.arima = forecast(mod.arima, h = length(dat.test[,1]))


#' @calculate MSE, MAE, MAPE 
#'
result = rbind(accuracy(pred.tfn, dat.test[,1])[2,c(2,3,5)], accuracy(pred.arima,dat.test[,1])
               [2,c(2,3,5)])
rownames(result) = c("TFN", "ARIMA")
knitr::kable(result, digit = 2, caption = "Out-of-sample forecast accuracy")

Out-of-sample forecast accuracy
	RMSE	MAE	MAPE
TFN	0.78	0.45	21.25
ARIMA	0.15	0.08	3.85

#Conclusion: From the result above, we can conclude that ARIMA model is better since the value of RMSE, MAE, MAPE are smaller than the value from TFN model.

Conduct the same out of sample forecast analysis using forecast combination of the fitted TFN model and ARIMA model (equal weight and MSE weighting). Compare its forecast metrics with those in the previous two questions

Forecast combination:

The combined forecaster \(\hat f_t(i)\) may be given by

\[\hat f_t(i) = w_a ~ \hat y_t^{(a)}(i)+w_b~ \hat y_t^{(b)}(i),\]

where the superscripts \((a)\) and \((b)\) stand for transfer function noise model and ARIMA model, respectively. For the equal weight scheme, \(w_a = w_b = 0.5\), and for the MSE weighting scheme, its weights is the solution of

\[\min_{w_a} \sqrt {\sum_{t=1}^n \{y_t -w_a \hat y_t^{(a)}-(1-w_a)\hat y_t^{(b)}\}^2},\]

where \(w_a, w_b \in[0,1]\), \(w_a+w_b=1\), and \(\hat y_t^{(a)}\) denote the fitted value at time \(t\) in the training sample and \(n\) is the series length.

#' @calculate MSE, MAE, MAPE for the equal weight forecast
#'
pred.equal.combine = 0.5 * pred.tfn[["mean"]]+0.5*pred.arima[["mean"]]
accuracy(pred.equal.combine, dat.test[,1])[1,c(2,3,5)]

##       RMSE        MAE       MAPE 
##  0.4519857  0.2441970 11.6818376

#' @calculate MSE scheme weight
#'
wa = seq(0,1,0.001)
sumsquare = sapply(wa, function(w){sum((w*mod.tfn$fitted+(1-w)*mod.arima$fitted-dat.obs[,1])^2)})
minw = wa[which.min(sumsquare)]

pred.mse.combine = minw*pred.tfn[["mean"]]+(1-minw)*pred.arima[["mean"]]
accuracy(pred.mse.combine, dat.test[,1])[1,c(2,3,5)]

##       RMSE        MAE       MAPE 
##  0.7610087  0.4395943 20.7906699

#' @calculate MSE, MAE, MAPE for the above combination forecast
#'
output = rbind(accuracy(pred.tfn, dat.test[,1])[2,c(2,3,5)], accuracy(pred.arima, dat.test[,1])[2,c(2,3,5)],
               accuracy(pred.equal.combine, dat.test[,1])[1,c(2,3,5)], accuracy(pred.mse.combine,dat.test[,1])[1,c(2,3,5)])
rownames(output) = c("TFN", "ARIMA", "Equal-Scheme", "MSE-Scheme")
knitr::kable(output, digit = 2)

	RMSE	MAE	MAPE
TFN	0.78	0.45	21.25
ARIMA	0.15	0.08	3.85
Equal-Scheme	0.45	0.24	11.68
MSE-Scheme	0.76	0.44	20.79

#Analysis: After comparing RMSE, MAE, and MAPE values of four models listed above, we can conclude that ARIMA model is better than other models since it has smaller values.

Reference: William W.S. Wei (2006), Time Series Analysis–Univariate and Multivariate Methods, Second Edition. (Chapter 14)