Homework 2

Question 1

Suppose you are hired in the central bank of NERDILAND. You get there the first day and you see that they only have four time series to do their forecast of economic activity. They are all monthly. We denote the log levels of these series as $Y_{1t}, Y_{2t}, Y_{3t}, Y_{4t}$ and we can assume that all of them share the business cycle plus an idiosyncratic part.

However, the first one is released in annual growth rates $y_{1t} = (Y_{1,t} - Y_{1t- 12})$

The second one is released in quarterly growth rates, but it is an average of the ‡ow of the last three months: (Note that it is a quartely series but published every month) $y_{2,t}= \frac{(Y_{2t}+Y_{2t-1}+Y_{2t-2})}{(Y_{2t-3}+Y_{2t-4}+Y_{2t-5})}$

The third one is released in levels, therefore you can compute monthly growth rates $y_{3t}=(Y_{3t}-Y_{3t-1} )$

The fourth one is also released in levels, but you know that it is a lagged variable, not a leading one. According to theory, the lagging of this variable with respect to current conditions is of 6 months.

When you are hired in the Bank, the only exercise that they do is univariate time series of the fourth variables, using an ARMA process for each of them. Their governor is quite happy, so far, with these forecast although he will like to see some indicator of aggregate economic activity that could summarize the dissagregate information that he has to deal with, much more when the President of NERDILAND ask him all the time about how the economy is going, and she does not have the time to deal with every detail of the four variables of the economy

1) (4 points) You get assigned to the project of estimating a common component. You decide to use a Kalman Filter. Please explain carefully your strategy, write the state space representation of how you would obtain a monthly coincident indicator of activity. (We assume that all the series share a common indicator which has some dynamics and an idiosyncratic part that has its own dynamics)

Solution

The first variable is released in annual growth rates. This implies that it is related to the level of activity in period t minus the level of activity in period t-1. We can define the annual activity using the lag operator as:

\[Annual \quad Activity = (1-L)^{12}F_t =(1-L) (1 +L + L^2 + ...+ L^{11})F_t= f_t + f_{t-1}+ ...+ f_{t-11} \]

Therefore the monthly growth rates can be expressed as:

\[y_{1t}= \gamma_1(f_t + f_{t-1}+ ...+ f_{t-11})+\epsilon_{1t}\]For the second variable, which is expressed in quarterly growth rates. We can expressed it in monthly growth rates, Calling $y^*_{\tau}$ the quarter-over-quarter growth rate in quarter $\tau$, and $y_t$ the respective month-over-month growth rate that refers to the last month of the quarter, this expression can directly be generalized as:

\[ y^*_{\tau} = \frac{1}{3} y_{t}+ \frac{2}{3}y_{t-1}+ y_{t-2}+ \frac{2}{3}y_{t-3} + \frac{1}{3}y_{t-4} \]

Therefore, given the quaterly growth rates and its past it is only a matter to isolate the monthly variables. The third variable is already in monthly growth rates, so there is no need to worry about yet.

Observation Equation

\[ \begin{bmatrix} y_{1t} \\ y_{2t}\\ y_{3t} \\ y_{4t}\end{bmatrix} = \begin{bmatrix} \gamma_{1} & \gamma_{1} & \gamma_{1}& \gamma_{1}& \gamma_{1}& \gamma_{1}& \gamma_{1} & \dots & \gamma_{1} & 0& 0& 0& 0& 0 &0 &0 &0 &0&0 \\ \frac{1}{3}\gamma_{2} & \frac{2}{3}\gamma_{2} & \gamma_2 & \frac{2}{3}\gamma_{2} & \frac{1}{3}\gamma_{2} & 0 &0 &\dots & 0& 0& \frac{1}{3} &\frac{2}{3} &1 &\frac{2}{3} & \frac{1}{3} &0 &0&0&0\\ \gamma_{3} & 0 & 0 & 0 & 0& 0& 0& \dots& 0 &0 &0 &0&0&0&0&1 &0&0&0 \\ 0 & 0 & 0 & 0 & 0& 0& \gamma_{4}& \dots& 0 &0 &0 &0&0&0&0&0&0&1&0 \\ \end{bmatrix} \begin{bmatrix} f_{t}\\ f_{t-1}\\ f_{t-2}\\ f_{t-3}\\ f_{t-5}\\ f_{t-6}\\ f_{t-7}\\ \vdots \\ f_{t-11}\\ e_{1t} \\ e_{1t-1} \\ e_{2t} \\ e_{2t-1} \\ e_{2t-2} \\ e_{2t-3} \\ e_{2t-4} \\ e_{3t} \\ e_{3t-1} \\ e_{4t} \\ e_{4t-1} \\ \end{bmatrix} \]

State Equation

$$ \[\begin{bmatrix} f_{t}\\ f_{t-1}\\ f_{t-2}\\ f_{t-3}\\ f_{t-5}\\ f_{t-6}\\ f_{t-7}\\ \vdots \\ f_{t-11}\\ e_{1t} \\ e_{1t-1} \\ e_{2t} \\ e_{2t-1} \\ e_{2t-2} \\ e_{2t-3} \\ e_{2t-4} \\e_{3t} \\ e_{3t-1} \\ e_{4t} \\e_{4t-1} \\ \end{bmatrix}\] = \[\begin{bmatrix} \phi_1 & \phi_2 & 0 & 0 & 0 & 0& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 1 & 0 & 0& 0 & 0 & 0& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 1 & 0& 0 & 0 & 0& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 &1& 0 & 0 & 0& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 &0& 1 & 0 & 0& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 &0& 0 & 1 & 0& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 &0& 0 & 0 & 1& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ &&&&&&\dots \\ 0 & 0 & 0& 0 & 0 & 0& \psi_{11}& \psi_{12} & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0& 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0& 0 & 0 & 0 & 0 & \psi_{11}& \psi_{12} & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0& 0 & 0 & 0 &0 & 1 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0& 0 & 0 & 0 & 0 & 0 & 0 &\psi_{21}& \psi_{22} & 0 & 0 & 0\\ 0 & 0 & 0& 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0\\ 0 & 0 & 0& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0& 0 & 0 & 0 & 0 & 0 & 0 & 0& 0& 0 & \psi_{11}& \psi_{12} \\ 0 & 0 & 0& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\ \end{bmatrix} \begin{bmatrix} f_{t-1}\\ f_{t-2}\\ f_{t-3}\\ f_{t-5}\\ f_{t-6}\\ f_{t-7}\\ \vdots \\ f_{t-11} \\ f_{t-12}\\ e_{1t-1} \\ e_{1t-2} \\ e_{2t-1} \\ e_{2t-2} \\ e_{2t-3} \\ e_{2t-4} \\ e_{2t-5} \\e_{3t-1} \\ e_{3t-2} \\ e_{4t-1} \\e_{4t-2} \\ \end{bmatrix}\]

error $$

And the error matrix

\[ error = \begin{bmatrix} w_t\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\\varepsilon_{1t} \\0 \\\varepsilon_{2t}\\0\\0\\0\\0 \\ \varepsilon_{3t} \\0 \\ \varepsilon_{4t} \\0 \end{bmatrix} \]

2) After you start working on 1, your boss come to you and ask you if would not be better to forget about the fourth variable. At the end of the day…why a lagging variable can help you to estimate a coincident indicator?

Solution

There are two possible argumentation to defend the inclusion of a lagging variable.

On the one hand, empirical. You can run two models, one with the lagging variable and one without and check the RMSE in both. If the one that includes the lagging has smaller RMSE it is a good signal that it improves your forecasts. However, this is an out of sample exercise and makes the model to overfit to the out of sample data which is not always a good idea.

On the other hand, theoretical. The lagging variable, if explains a higher proportion of then it should improve the signal captured by the factor. The lagging variable, once released, updates the estimation of the unobservable factor and with it improves the accuracy of the signal that it is used to predict economic activity.

6) Can you forecast the future state value of the coincident indicator? Specify how.

Solution

Yes, the model yields $f_t, f_{t-1}, f_{t-2}$ from the state equation:

\[ \begin{align*} h_t &= Fh_{t-1} + v_t \quad \quad \quad \quad State\quad Eq. \end{align*} \]

Then, given that we have $h_{t|t}$ we can iterate the forecasts as:

\[ \begin{align*} h_{t+1|t} &= Fh_{t|t} \\ h_{t+2|t} &= Fh_{t+1|t} = F^2 h_{t|t} \\ h_{t+3|t} &= F^n h_{t|t} \end{align*} \]

And then we can say that: $\hat{h_{t+n|t}} = \hat{F^n} \hat h_{t|t}$

7) You are not suppose to be in charge of the forecasting the four indicators, but do you think you can respectfully help your colleagues to do their job? Why? Under which assumptions? Solution

Solution

Because the Kalman Filter algorithm yields forecasts of the observable variables too. The vector of observables $y_t$ is also forecasted in each iteration of the filter conditional on the informational set and the state of the unobservable factor. In fact, the estimates of the observable variables are, in general terms, quite satisfactory in terms of accuracy. From this point of view ti could be valuable input for the colleagues in charge of forecasting the four indicators.

8) Now, we go to the realistic case in with you have an annual series as the one defined in 5, but that you can only observe every 12 months, a quarterly series as the one defined in the definition of $y_{2t}$ that you obseve every 3 months and two monthly series. Your boss tell you that given that you do not have complete series, you can not use them together…what do you think?

Solution

With all due respect I think I can actually use them. One of the advantages of the Kalman Filter is that it allows to deal with unbalanced dataset.

To do so, we might one to create an index variable which we can call index that takes the value 1 when the raw of data is full and 0 when there are missing values. Then with in the Kalman filter we can multiply this index variable times our matrix H . So, if the standard reduced form representation of the Kalman filter is like this:

\[ \begin{align*} y_t &= A'x_t + index'H'h_t +w_t \quad Observation \quad Eq. \\ h_t &= Fh_{t-1} + v_t \quad \quad \quad \quad State\quad Eq. \end{align*} \]

In our particular example it would be (assuming no exogenous variables):

\[ \begin{align*} y_t &= index'H'h_t +w_t \quad Observation \quad Eq. \\ h_t &= Fh_{t-1} + v_t \quad \quad \quad \quad State\quad Eq. \end{align*} \]

Which will ignore the missing data only for those rows where there is missing data in each iteration of the algorithm.

9) (double point) If the model was built just to forecast the variable $Y_{2t}$ and you have the possibility of including two more variables in the estimation $Y_{5t}$ and $Y_{6t}$ Do you think that the forecast for $Y_{2t}$ will be always better? Why? Can you propose a criteria to check if these variables improve the forecast?

Solution

First, more data is not always better and can increase forecast errors even when using dimensionality reduction techniques (Boivin & Ng, 2006). To check if more data helps there are two popular options. One is to estimate the RMSE of the model with and without the extra variables and see which one performs better. However, this is an out of sample exercise and makes the model to overfit to the out of sample data which is not always a good idea.

A more refined solution is hard thresholding (Bai & Ng,2007) consists of regressing the forecast variable on its lags and each individual indicator and selecting all indicators with an absolute t-statistic above a certain threshold. In this case, the threshold is obtained by comparing out-of-sample performance of forecasts across a range of thresholds and choosing the threshold that delivers the lowest forecast errors.

Question 2:

Suppose you are tired of working for the central bank of Nerdiland and you decide to join the private bank Weird Bank, in the forecasting department. Weird Bank has has increase your salary substantially but more importantly, the Bank gives you the possibility of doubling your salary if you predict GDP growth rates better that the rest of forecasters in Nerdiland.

In principle you have different options to forecast GDP. The easiest one is to estimate a univariate autoregressive model. You estimate this model and realize that the optimal number of lags is just one:

\[ y_t = \mu +\phi y_{t-1} +\varepsilon_t \]

GDP series are quarterly, therefore you estimate the model in quarterly frequency. You have a sample that goes from 1960.Q1 to 2010.Q4 and you want to forecast 2011.Q1 and 2011.Q2 However, you think that might be possible that you could have some informaiton in some relevant variables about the evolution of GDP. These are monthly variables. In particular, Industrial Production, Car registration, Stock market returns and Employment.

You think that all the series are related with the evolution of the economy, although Stock market returns seem to lead activity by two months, industrial production and car registration are contemporaneous and employment seems to be correlated with current and 1 and 2 lags of economic activity.

You have all the series until february 2011, starting from January 1970. Only employment starts later,in January 1975. You have several options here to take advantage of this information:

A) You can estimate the factor using the 4 variables and use the factor in forecasting GDP.

Q1) (double point) Please write the state space representation of the 4 variable-system assuming that the factor and the idiosincratic part follow an AR(1) process.

\[ {y_t} =\begin{bmatrix} y_{1t} \\ y_{2t}\\ y_{3t} \\ y_{4t}\end{bmatrix} = \begin{bmatrix} Industrial\quad Production_t \\Car \quad Registration_t\\Stock\quad Market_t \\ Employment_t\end{bmatrix} \]

Industrial Production in “t” related to activity in “t”
Car Registration in “t” related to activity in “t”
Stock Market in “t” related to activity in “t+2”
Employment in “t” related to activity in “t”, “t-1” and “t+2”.

In reduced form we can express the model as:

\[ \begin{align*} y_{i,t} &= \gamma_i f_t + \epsilon_{i,t} \\ f_t &= \phi_1 f_{t-1} + w_t \\ e_{i,t} &= \psi_{i,1} \epsilon_{i, t-1} + \epsilon_{t} \end{align*} \]

State Eq.

\[ \begin{bmatrix} y_{1t} \\ y_{2t}\\ y_{3t} \\ y_{4t}\end{bmatrix} = \begin{bmatrix} 0 & 0 & \gamma_{1} & 0 & 0 & 1& 0& 0& 0& 0 &0 &0 &0 \\ 0 & 0 & \gamma_{2} & 0 & 0 & 0& 0& 1& 0& 0 &0 &0 &0 \\ \gamma_{3} & 0 & 0 & 0 & 0& 0& 0& 0& 0 &1 &0&0 &0 \\ 0 & 0 & \gamma_{4} & \gamma_{4} & \gamma_{4}& 0& 0& 0& 0 &0 &0 &1 &0 \\ \end{bmatrix} \begin{bmatrix} f_{t+2}\\ f_{t+1}\\ f_t\\ f_{t-1}\\ f_{t-2}\\ e_{1t} \\e_{2t} \\ e_{3t} \\ e_{4t}\\ \end{bmatrix} \]

Observation Eq.

$$ \[\begin{bmatrix} f_{t+2}\\ f_{t+1}\\ f_t\\ f_{t-1}\\ f_{t-2}\\ e_{1t} \\e_{2t} \\ e_{3t} \\ e_{4t}\\ \end{bmatrix}\] = \[\begin{bmatrix} \phi_1 & & & & & && & \\ 1 & & & & & & & & \\ & 1 & & & & & & & & & \\ & & 1 & & & & & & && \\ & & &1 & && & & & & & \\ 0 & 0 & 0 & 0 & 0 & \psi_{11}& 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & \psi_{11}& 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0& 0 & 0 &\psi_{11}& 0& 0 & \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & \psi_{11}& 0 \\ \end{bmatrix} \begin{bmatrix} f_{t+1} \\ f_t\\ f_{t-1}\\ f_{t-2} \\f_{t-3}\\ e_{1t} \\e_{2t} \\ e_{3t} \\ e_{4t}\\ \end{bmatrix}\]

\[\begin{bmatrix} w_{t+1} \\ 0\\ 0\\ 0 \\0\\ \varepsilon_{1t} \\\varepsilon_{2t} \\ \varepsilon_{3t} \\ \varepsilon_{4t}\\ \end{bmatrix}\] $$

Q3) Tell me how you would use a monthly factor to estimate a quarterly series

For a given quarter and month:

\[ Y^*_q = \frac{1}{3}y^*_t +\frac{2}{3}y^*_{t-1}+y^*_{t-2}+\frac{2}{3}y^*_{t-3}+\frac{1}{3}y^*_{t-4} \]

From this point we can compute the monthly factor as:

\[ Y^*_q = \frac{1}{3}f_t +\frac{2}{3}f_{t-1}+f_{t-2}+\frac{2}{3}f_{t-3}+\frac{1}{3}f_{t-4} \]

And the estimation just needs to add the error:

\[ \hat{Y}^*_q = \frac{1}{3}f_t +\frac{2}{3}f_{t-1}+f_{t-2}+\frac{2}{3}f_{t-3}+\frac{1}{3}f_{t-4} + \frac{1}{3}e_t +\frac{2}{3}e_{t-1}+e_{t-2}+\frac{2}{3}e_{t-3}+\frac{1}{3}e_{t-4} \]

Q4) If you had the Stock Market for march 2011 could you use that information in your forecast or you have to wait until having all the variables? Explain your answer.

Yes, the Kalman Filter allows to deal with unbalanced dataset. as the same way we have been working with employment with a shorter time series starting five years, from the beginning. To do so, again, we might one to create an index variable which we can call index that takes the value 1 when the raw of data is full and 0 when there are missing values. Then with in the Kalman filter we can multiply this index variable times our matrix H . So, if the standard reduced form representation of the Kalman filter is like this:

\[ \begin{align*} y_t &= index'H'h_t +w_t \quad Observation \quad Eq. \\ h_t &= Fh_{t-1} + v_t \quad \quad \quad \quad State\quad Eq. \end{align*} \]

B) You can estimate a factor model using the 4 monthly variables and the GDP in the same model.

Q5) (double point) Please write the state space representation of the 5 variable-system assuming that the factor and the idiosincratic part follow a AR(1) process.

Observation equation (done by hand by lack of time)

State equation

Q7) Tell me what is the main difference of this specification versus the specification in 1.

The difference is that the second model adds the GDP in a autoregressive form. It allows us to forecast the GDP automatically with the expresision $\hat{y_t} = Hh_t$ while in the model with out GPD we obtain a factor and we can use this factor to run the regression of the GDP on the regression as exogenous variable.

In that sense, the second model, which includes the lag of GDP which is kind of an ARIMAX model estimated with the Kalman Filter.

Q8) Tell me under which circumstances this would be a better model than the model in 1.

GDP is released in a quarterly basis. The other data has a higher fequency. In a situation with a external shock like the COVID, adding the lag of the GDP does not add much information for forecasting the second quarter of GDP. Moroever, it will harm the prediction. A more parsimonious model, without the lag of GDP will add faster and with more intensity the negative shock captured with the higher frequency variables.

Q9) Tell me if one more variable, let ́s say, income will necessarily improve the forecast of the GDP series.

No, it will necessarily improve the model. Another variable will add both, noise and explicative capacity for the variance. Therefore, there is a trade off and more variables does not mean more predictive power as explained in previous answers.

3) Comparing the models. Q10) How would you compare the univariate model with the model in 1 and 2?

One option is to compare the RMSE. Besides this, the ARIMAX might be somehow more flexible because it can add interventions which can be useful for special times such as the Covid crises. Besides that, the multivariate model (Kalman Filter) makes a more efficient use of the available information.

Homework 2

Andrés

18/5/2021

Question 1

Question 2: