Question 1

1.1 Regress gdp on cpi. Can the relation be spurious? Explain why (hint: spurious relations need that variables have a stochastic trend)

#load data
data(USMacroG)
data <- USMacroG
#get summary of data
gdp <- ts(data[,"gdp"], frequency=4, start=c(1950,1)) 
cpi <- ts(data[,"cpi"], frequency=4, start=c(1950,1)) 
#plot the data
plot_gpd <- autoplot(gdp)+ylab("GDP")
plot_cpi <- autoplot(cpi)+ylab("CPI")
#arrange plots
grid.arrange(plot_gpd, plot_cpi, ncol=2, nrow =1)

#kpss test for gdp
summary(ur.kpss(gdp, type = "tau"))

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: tau with 4 lags. 
## 
## Value of test-statistic is: 0.8305 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.119 0.146  0.176 0.216

#adf test for gdp
summary(ur.df(gdp, type = "trend", lags = 4))

## 
## ############################################### 
## # Augmented Dickey-Fuller Test Unit Root Test # 
## ############################################### 
## 
## Test regression trend 
## 
## 
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -133.844  -22.206    0.667   23.149  149.066 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.9940402  9.0813585   0.550 0.583012    
## z.lag.1      0.0007854  0.0082632   0.095 0.924377    
## tt           0.1478414  0.2857543   0.517 0.605492    
## z.diff.lag1  0.2498878  0.0728246   3.431 0.000735 ***
## z.diff.lag2  0.1364139  0.0754394   1.808 0.072131 .  
## z.diff.lag3 -0.0089814  0.0760176  -0.118 0.906073    
## z.diff.lag4 -0.0099983  0.0744557  -0.134 0.893317    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.23 on 192 degrees of freedom
## Multiple R-squared:  0.2268, Adjusted R-squared:  0.2026 
## F-statistic: 9.384 on 6 and 192 DF,  p-value: 4.975e-09
## 
## 
## Value of test-statistic is: 0.095 8.595 4.9742 
## 
## Critical values for test statistics: 
##       1pct  5pct 10pct
## tau3 -3.99 -3.43 -3.13
## phi2  6.22  4.75  4.07
## phi3  8.43  6.49  5.47

#kpss test for cpi
summary(ur.kpss(cpi, type = "tau"))

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: tau with 4 lags. 
## 
## Value of test-statistic is: 0.9913 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.119 0.146  0.176 0.216

#adf test for cpi
summary(ur.df(cpi, type = "trend", lags = 4))

## 
## ############################################### 
## # Augmented Dickey-Fuller Test Unit Root Test # 
## ############################################### 
## 
## Test regression trend 
## 
## 
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4950 -0.6479 -0.1520  0.5684  5.2331 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.168352   0.210851  -0.798  0.42560    
## z.lag.1     -0.005670   0.002268  -2.500  0.01326 *  
## tt           0.019335   0.006282   3.078  0.00239 ** 
## z.diff.lag1  0.098416   0.068166   1.444  0.15043    
## z.diff.lag2  0.150420   0.067059   2.243  0.02603 *  
## z.diff.lag3  0.141715   0.066809   2.121  0.03519 *  
## z.diff.lag4  0.370641   0.069403   5.340  2.6e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.374 on 192 degrees of freedom
## Multiple R-squared:  0.5996, Adjusted R-squared:  0.5871 
## F-statistic: 47.92 on 6 and 192 DF,  p-value: < 2.2e-16
## 
## 
## Value of test-statistic is: -2.5 4.357 5.2196 
## 
## Critical values for test statistics: 
##       1pct  5pct 10pct
## tau3 -3.99 -3.43 -3.13
## phi2  6.22  4.75  4.07
## phi3  8.43  6.49  5.47

#perform linear model
lm_model <- tslm(gdp~cpi)
#summarize results of model
summary(lm_model)

## 
## Call:
## tslm(formula = gdp ~ cpi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -798.80 -399.06  -77.22  464.30  866.66 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1429.8892    56.6422   25.24   <2e-16 ***
## cpi           13.8727     0.2095   66.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 444.8 on 202 degrees of freedom
## Multiple R-squared:  0.956,  Adjusted R-squared:  0.9557 
## F-statistic:  4384 on 1 and 202 DF,  p-value: < 2.2e-16

#check the residuals of the model
checkresiduals(lm_model)

## 
##  Breusch-Godfrey test for serial correlation of order up to 8
## 
## data:  Residuals from Linear regression model
## LM test = 198.54, df = 8, p-value < 2.2e-16

Spurious regression basically are determined by the fact that the relation between variables can be significant only because they increase or decrease at the same time. Spurious regression models might be able to provide reasonable short-term forecasts, but they will generally not continue to work in the future.

When looking at the raw data plots, we see that both variable levels (gdp and cpi) strongly increased over the analyzed period. From looking at the test results of the kpss and the adf test we conclude that both variables deal with non-stationary data with drift (cpi only on 10%-level). Performing a linear model (gdp on cpi), we get highly significant results and an extremely large r-square. Looking at the residuals, we see that they do not rapidly fluctuate around zero, they rather seem to fluctuate rather slowly and seem to have a cyclical behavior. The ACF value for the lag of period 1 is really close to one. Nonetheless, the values of ACF are slowly decreasing over time, however, still remaining on highly significant levels. The Breusch-Godfrey test rejects the null-hypothesis of no autocorrelation and therefore we conclude that the relation we are looking at is a spurious regression.

1.2 Take the log-differences and repeat the regression: is now the relation spurious? How do you think the model can be improved?

#use the log of the variables
gdp_log_diff <- diff(log(gdp))
cpi_log_diff <- diff(log(cpi))
#quick check if variables are stationary
#ndiffs(log(gdp_log_diff))
#ndiffs(log(cpi_log_diff))
#kpss test for gdp (testing around zero mean)
summary(ur.kpss(gdp_log_diff, type = "mu"))

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 4 lags. 
## 
## Value of test-statistic is: 0.1358 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

#adf test for gdp
summary(ur.df(gdp_log_diff, type = "drift", lags = 4))

## 
## ############################################### 
## # Augmented Dickey-Fuller Test Unit Root Test # 
## ############################################### 
## 
## Test regression drift 
## 
## 
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.029853 -0.005000  0.000033  0.005231  0.032975 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.006856   0.001178   5.818 2.45e-08 ***
## z.lag.1     -0.823957   0.115927  -7.108 2.26e-11 ***
## z.diff.lag1  0.123205   0.105313   1.170   0.2435    
## z.diff.lag2  0.199999   0.094406   2.119   0.0354 *  
## z.diff.lag3  0.155218   0.083670   1.855   0.0651 .  
## z.diff.lag4  0.106866   0.070146   1.523   0.1293    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.009172 on 192 degrees of freedom
## Multiple R-squared:  0.3567, Adjusted R-squared:  0.3399 
## F-statistic: 21.29 on 5 and 192 DF,  p-value: < 2.2e-16
## 
## 
## Value of test-statistic is: -7.1075 25.2756 
## 
## Critical values for test statistics: 
##       1pct  5pct 10pct
## tau2 -3.46 -2.88 -2.57
## phi1  6.52  4.63  3.81

#kpss test for cpi (testing around zero mean)
summary(ur.kpss(cpi_log_diff, type = "mu"))

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 4 lags. 
## 
## Value of test-statistic is: 0.606 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

#adf test for cpi
summary(ur.df(cpi_log_diff, type = "drift", lags = 4))

## 
## ############################################### 
## # Augmented Dickey-Fuller Test Unit Root Test # 
## ############################################### 
## 
## Test regression drift 
## 
## 
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0176760 -0.0033251 -0.0000663  0.0030975  0.0143253 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0013540  0.0006608   2.049 0.041821 *  
## z.lag.1     -0.1504767  0.0545868  -2.757 0.006403 ** 
## z.diff.lag1 -0.4976838  0.0794600  -6.263 2.42e-09 ***
## z.diff.lag2 -0.3376658  0.0858430  -3.934 0.000117 ***
## z.diff.lag3 -0.0480587  0.0827153  -0.581 0.561913    
## z.diff.lag4  0.1412564  0.0684554   2.063 0.040412 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.005467 on 192 degrees of freedom
## Multiple R-squared:  0.3588, Adjusted R-squared:  0.3421 
## F-statistic: 21.48 on 5 and 192 DF,  p-value: < 2.2e-16
## 
## 
## Value of test-statistic is: -2.7567 3.8462 
## 
## Critical values for test statistics: 
##       1pct  5pct 10pct
## tau2 -3.46 -2.88 -2.57
## phi1  6.52  4.63  3.81

#plot the data
plot_log_gpd <- autoplot(gdp_log_diff)+ylab("Differences of log(GDP)")
plot_log_cpi <- autoplot(cpi_log_diff)+ylab("Differences of log(CPI)")
#arrange plots
grid.arrange(plot_log_gpd, plot_log_cpi, ncol=2, nrow =1)

#perform linear model
lm_model <- tslm(gdp_log_diff~cpi_log_diff)
#summarize results of model
summary(lm_model)

## 
## Call:
## tslm(formula = gdp_log_diff ~ cpi_log_diff)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.035167 -0.005035 -0.000412  0.006108  0.032658 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.010478   0.001062   9.869   <2e-16 ***
## cpi_log_diff -0.186635   0.081715  -2.284   0.0234 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.009869 on 201 degrees of freedom
## Multiple R-squared:  0.0253, Adjusted R-squared:  0.02045 
## F-statistic: 5.217 on 1 and 201 DF,  p-value: 0.02342

#calculate accuracy measures 
accuracy(lm_model)

##                        ME        RMSE        MAE      MPE     MAPE      MASE
## Training set 4.493864e-19 0.009819774 0.00729695 39.24943 143.9623 0.6675619
##                   ACF1
## Training set 0.3378579

#plot acf and pacf
acf_lm <- ggAcf(lm_model$residuals)
pacf_lm <- ggPacf(lm_model$residuals)
grid.arrange(acf_lm, pacf_lm, ncol=2, nrow =1)

#check the residuals of the model
checkresiduals(lm_model)

## 
##  Breusch-Godfrey test for serial correlation of order up to 8
## 
## data:  Residuals from Linear regression model
## LM test = 28.361, df = 8, p-value = 0.0004103

After having log-transformed our data, we cannot easily detect a similar development of the levels of the variables (differences of log(gdp) and differences of log(cpi)). The test results of the kpss and adf test demonstrate that the data is now stationary for the GDP variable, however, the CPI variable is by far not perfectly stationary (see test result). When looking at the summary statistics of the linear model, we see that the results are still significant. The CPI-coefficient is still significant ( on a 5%-level) whereas the r-squared has dramatically decreased, which seems reasonable. Analyzing the residuals, we see that they now closer fluctuate around zero. However, they do not look like white noise because there is dynamics left over that needs to be explored. The ACF value for the lag of period 1 is still really close to one. The other values of ACF do not seem to be clearly significant anymore. The Breusch-Godfrey test rejects the null-hypothesis of no autocorrelation. The distribution looks pretty like a normal distribution. On the one hand, we further deal with high autocorrelation, but on the other hand, r-square has dramatically decreased and the variance seems not to explode when t converges to infinity. All in all, the model does not look as spurious as in 1.1 anymore. To further improve the model, we suggest to use the difference function a second time for both variables to further reduce autocorrelation and making the CPI variable stationary.

Question 2

2.1 Select a dynamic model with auto.arima for the log-differences of gdp including the contemporaneous value of the log-differences of cpi as exogenous variable. Discuss briefly the model and compare its performance with the one of the model obtained in Question 1.2 with log-differentiated data.

#perform auto model
auto_model <- auto.arima(gdp_log_diff, xreg=cpi_log_diff)
#summarize results of model
summary(auto_model)

## Series: gdp_log_diff 
## Regression with ARIMA(2,0,2)(2,0,1)[4] errors 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2    sar1     sar2     sma1  intercept
##       1.4160  -0.8917  -1.1802  0.7096  0.2257  -0.2287  -0.1479     0.0101
## s.e.  0.0716   0.0787   0.1226  0.1190  0.2491   0.0748   0.2529     0.0009
##          xreg
##       -0.1655
## s.e.   0.0748
## 
## sigma^2 estimated as 8.343e-05:  log likelihood=669.35
## AIC=-1318.7   AICc=-1317.56   BIC=-1285.57
## 
## Training set error measures:
##                        ME        RMSE         MAE      MPE   MAPE      MASE
## Training set 0.0001169908 0.008929211 0.006684377 49.57283 134.54 0.6115206
##                    ACF1
## Training set 0.05070826

Analyzing performance measures, we conclude that the estimation of auto.arima with seasonality effects works better than our model from above, as RMSE (Root Mean Square Error) and MAE (Mean Absolute Error) are lower. Auto.arima selects an ARIMA(2,0,2)(2,0,1)[4] which performs really well with regards to residuals, however, it seems to be rather complex. To make the model a bit more intuitive, a further approach could be trying to reduce the complexity (f.e. reduce coefficients (p and q parameters)).

2.2 Divide your data into a test and a train set. Forecast the log-differences of gdp with the time series regression and the dynamic model. Do that by assuming that the log-differences of cpi are equal to the last observation in the train set throughout the forecasting horizon. Compare the forecasts accuracy and plot the two point estimates in a unique graph.

data <- USMacroG
#split into train(~90%) and test (~10%)
train_data_gdp <- window(gdp_log_diff, start = c(1948,1), end=c(1995,4))
train_data_cpi <- window(cpi_log_diff, start = c(1948,1), end=c(1995,4))
test_data_gdp <- window(gdp_log_diff, start=c(1996,1))
test_data_cpi <- window(cpi_log_diff, start=c(1996,1))
#log-differences of cpi are equal to the last observation in train set
#through the forecasting horizon
reg_data <- ts(last(train_data_cpi), frequency=4, start=c(1996,1), end=c(2000,4))
#auto model
#perform auto model
auto_model <- auto.arima(train_data_gdp, xreg=train_data_cpi)
auto_fit <- forecast(auto_model, xreg = reg_data)
autoarima_forecast <- predict(auto_fit)
#lm
lm_model <- tslm(train_data_gdp~train_data_cpi)
train_data_cpi <- reg_data
lm_forecast <- ts(predict(lm_model, newdata = reg_data),frequency=4, start=c(1996,1), end=c(2000,4))
#accuracy lm
accuracy(lm_model)

##                        ME       RMSE         MAE      MPE    MAPE      MASE
## Training set 2.408698e-19 0.01023393 0.007670986 45.96532 153.339 0.6648568
##                   ACF1
## Training set 0.3490385

#accuracy arima
accuracy(auto_model)

##                        ME        RMSE         MAE      MPE     MAPE      MASE
## Training set 0.0001258665 0.009268879 0.006961689 57.42632 143.1573 0.6033809
##                    ACF1
## Training set 0.05258801

#plot the models
autoplot(train_data_gdp) + #train data
  autolayer(test_data_gdp, series = "Test") + #test data
  autolayer(lm_forecast, series = "Linear Model")+ #lm model
  autolayer(autoarima_forecast, PI=FALSE, series = "Arima")+ #arima model
  ylab("Differences of log(GDP)")

Due to the accuracy outputs, we conlude that the ARIMA model performs slightly better than the lm model (compare RMSE, MAE, MAPE etc.). When comparing the plots, this does not seem too obvious, but plausible.

Question 3

3.1 Verify the presence of structural breaks in the time series regression between the log- differences of GDP and the log-differences of CPI with a QLR test and the SIS. Note: In QLR, set the p-value to reject the null of no structural break at 5% while. In gets model start with 6 lags of dependent and exogenous variable and omit residuals normality.

#tslm
lm_model <- tslm(gdp_log_diff~cpi_log_diff)
#perform the qlr test
data <- cbind(gdp_log_diff, cpi_log_diff)
qlr <- Fstats(lm_model,data=data)
plot(qlr,alpha=0.05)

#perform sc test
#generic function for performing structural change tests
sctest(qlr)

## 
##  supF test
## 
## data:  qlr
## sup.F = 6.0631, p-value = 0.4063

The black line in the plot is the set of F-statistics. The maximum F-stat is the QLR stat. The red line is the critical value based on Andrews (1993) and Hansen (1997). In this case, we would fail to reject the null that there is no structural change. When looking at the p-value of the sc test (null-hypothesis: no structural breaks), we can not reject the null hypothesis, therefore therefore we are not dealing with structural breaks.

3.2 Discuss this result (presence/non presence of breaks): is the SIS indicating stuctural breaks or outliers? Motivate your answer based only on previous point evidence (5 lines).

#If we only specify an IIS or SIS, we have no regressors. 
#The function isat allows to include both regressors and lags of the dependent variable. 
#However, isat only performs search over indicators. 
#We apply isat on gdp, allowing for 6 lags of the dependent variable (cpi) 
#and then getsm to obtain the final model. 
#as we need the data in zoo format for later, we transform it accordingly
data <- as.zoo(cbind(cpi_log_diff, gdp_log_diff))
#we use the lag function of stats package to create a matrix with lags up to 6
Lcpi <- stats::lag(data$cpi_log_diff, -(1:6))
#we rbind the data
data <- cbind(data, Lcpi)
#we estimate an AR-X model with log-ARCH-X erros (ols methodology)
gum <- arx(data$gdp_log_diff, mc=TRUE, ar=1:6, mxreg = Lcpi, normality.JarqueB = FALSE)
#general-to-specific (gets) modelling of an AR-X model (mean specification)
#with log-ARCH-X errors (log-variance specification)
getsm(gum, t.pval = 0.05, normality.JarqueB = FALSE)

## 10 path(s) to search

## Searching: 1 2 3 4 5 6 7 8 9 10

## 
## Date: Tue Apr 20 22:55:32 2021 
## Dependent var.: y 
## Method: Ordinary Least Squares (OLS) 
## Variance-Covariance: Ordinary 
## No. of observations (mean eq.): 197 
## Sample: 1951(4) to 2000(4) 
## 
## GUM mean equation:
## 
##        reg.no. keep      coef std.error   t-stat   p-value    
## mconst       1    0  0.010672 0.0018851  5.66115 5.692e-08 ***
## ar1          2    0  0.247797 0.0726341  3.41158  0.000794 ***
## ar2          3    0  0.034656 0.0755642  0.45862  0.647046    
## ar3          4    0 -0.039558 0.0752817 -0.52546  0.599893    
## ar4          5    0 -0.099445 0.0762650 -1.30394  0.193882    
## ar5          6    0 -0.110415 0.0739687 -1.49273  0.137222    
## ar6          7    0  0.031824 0.0699878  0.45471  0.649852    
## lag-1        8    0  0.011902 0.1225668  0.09711  0.922745    
## lag-2        9    0 -0.304123 0.1245722 -2.44134  0.015579 *  
## lag-3       10    0  0.033549 0.1201134  0.27931  0.780321    
## lag-4       11    0 -0.159792 0.1234516 -1.29437  0.197159    
## lag-5       12    0  0.031951 0.1223938  0.26105  0.794345    
## lag-6       13    0  0.078651 0.1177453  0.66797  0.504987    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Diagnostics:
## 
##                    Chi-sq df p-value
## Ljung-Box AR(7)   0.29882  7  0.9999
## Ljung-Box ARCH(1) 0.26763  1  0.6049
## 
## Paths searched: 
## 
## path 1 : 3 8 10 12 7 4 13 11 5 
## path 2 : 4 8 10 3 12 7 13 11 5 
## path 3 : 5 10 8 12 3 7 4 13 11 
## path 4 : 6 7 8 10 12 3 4 13 11 
## path 5 : 7 8 12 10 3 4 13 11 5 
## path 6 : 8 12 10 7 3 4 13 11 5 
## path 7 : 10 8 12 7 3 4 13 11 5 
## path 8 : 11 12 10 8 4 13 7 3 5 
## path 9 : 12 8 10 7 3 4 13 11 5 
## path 10 : 13 8 7 3 4 10 12 11 5 
## 
## Terminal models: 
## 
## spec 1 : 1 2 9 
## spec 2 : 1 2 6 9 
## spec 3 : 1 2 5 9 
## 
##                 info(sc)     logl   n   k
## spec 1 (1-cut):  -6.5320 651.3298 197   3
## spec 2:          -6.5336 654.1276 197   4
## spec 3:          -6.5286 653.6347 197   4
## 
## SPECIFIC mean equation:
## 
##              coef  std.error  t-stat   p-value    
## mconst  0.0102768  0.0013625  7.5426 1.763e-12 ***
## ar1     0.2471392  0.0670835  3.6841 0.0002981 ***
## ar5    -0.1503349  0.0636492 -2.3619 0.0191759 *  
## lag-2  -0.2927109  0.0780403 -3.7508 0.0002330 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Diagnostics and fit:
## 
##                     Chi-sq df   p-value    
## Ljung-Box AR(7)    1.61039  7 0.9782400    
## Ljung-Box ARCH(1)  0.54498  1 0.4603776    
## Jarque-Bera       15.75451  2 0.0003793 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                           
## SE of regression   0.00883
## R-squared          0.18464
## Log-lik.(n=197)  654.12759

#we can use SIS to check for structural breaks
isat(data$gdp_log_diff, t.pval = 0.05, plot=TRUE)

## 
## SIS block 1 of 7:
## 29 path(s) to search
## Searching: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 
## 
## SIS block 2 of 7:
## 26 path(s) to search
## Searching: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
## 
## SIS block 3 of 7:
## 28 path(s) to search
## Searching: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 
## 
## SIS block 4 of 7:
## 27 path(s) to search
## Searching: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 
## 
## SIS block 5 of 7:
## 29 path(s) to search
## Searching: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 
## 
## SIS block 6 of 7:
## 29 path(s) to search
## Searching: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 
## 
## SIS block 7 of 7:
## 28 path(s) to search
## Searching: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 
## 
## GETS of union of retained SIS variables... 
## 
## GETS of union of ALL retained variables...
## - All non-keep regressors significant in GUM

## 
## Date: Tue Apr 20 22:55:34 2021 
## Dependent var.: y 
## Method: Ordinary Least Squares (OLS)
## Variance-Covariance: Ordinary 
## No. of observations (mean eq.): 203 
## Sample: 1950(2) to 2000(4) 
## 
## SPECIFIC mean equation:
## 
##                  coef  std.error  t-stat   p-value    
## mconst      0.0337611  0.0049332  6.8437 1.140e-10 ***
## sis1950 Q4 -0.0205636  0.0053630 -3.8344 0.0001733 ***
## sis1953 Q3 -0.0223747  0.0045441 -4.9239 1.895e-06 ***
## sis1954 Q2  0.0181949  0.0044386  4.0993 6.237e-05 ***
## sis1957 Q4 -0.0279020  0.0052738 -5.2907 3.475e-07 ***
## sis1958 Q2  0.0340745  0.0055155  6.1780 4.142e-09 ***
## sis1960 Q2 -0.0205447  0.0047232 -4.3498 2.266e-05 ***
## sis1961 Q1  0.0179774  0.0043893  4.0957 6.327e-05 ***
## sis1965 Q1  0.0085526  0.0035744  2.3927 0.0177427 *  
## sis1966 Q2 -0.0123624  0.0033783 -3.6593 0.0003311 ***
## sis1973 Q3 -0.0131396  0.0029380 -4.4723 1.359e-05 ***
## sis1975 Q2  0.0155638  0.0033180  4.6907 5.331e-06 ***
## sis1978 Q2  0.0265667  0.0072615  3.6586 0.0003320 ***
## sis1978 Q3 -0.0321270  0.0074583 -4.3076 2.696e-05 ***
## sis1980 Q2 -0.0167488  0.0055937 -2.9942 0.0031344 ** 
## sis1980 Q4  0.0295099  0.0069766  4.2298 3.700e-05 ***
## sis1981 Q2 -0.0217957  0.0055937 -3.8965 0.0001370 ***
## sis1983 Q1  0.0218920  0.0038814  5.6402 6.398e-08 ***
## sis1984 Q3 -0.0097989  0.0031982 -3.0639 0.0025168 ** 
## sis1990 Q2 -0.0119159  0.0037795 -3.1528 0.0018918 ** 
## sis1991 Q2  0.0118629  0.0036628  3.2387 0.0014272 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Diagnostics and fit:
## 
##                      Chi-sq df p-value  
## Ljung-Box AR(1)   0.0089088  1 0.92480  
## Ljung-Box ARCH(1) 5.6516152  1 0.01744 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                           
## SE of regression   0.00698
## R-squared          0.55890
## Log-lik.(n=203)  730.38982

#we run the same thing with testing heteroskedasticity aswell
getsm(gum, t.pval = 0.05, vcov.type = "white", normality.JarqueB = FALSE)

## 10 path(s) to search
## Searching: 1 2 3 4 5 6 7 8 9 10

## 
## Date: Tue Apr 20 22:55:35 2021 
## Dependent var.: y 
## Method: Ordinary Least Squares (OLS) 
## Variance-Covariance: White (1980) 
## No. of observations (mean eq.): 197 
## Sample: 1951(4) to 2000(4) 
## 
## GUM mean equation:
## 
##        reg.no. keep      coef std.error   t-stat   p-value    
## mconst       1    0  0.010672 0.0018851  5.66115 5.692e-08 ***
## ar1          2    0  0.247797 0.0726341  3.41158  0.000794 ***
## ar2          3    0  0.034656 0.0755642  0.45862  0.647046    
## ar3          4    0 -0.039558 0.0752817 -0.52546  0.599893    
## ar4          5    0 -0.099445 0.0762650 -1.30394  0.193882    
## ar5          6    0 -0.110415 0.0739687 -1.49273  0.137222    
## ar6          7    0  0.031824 0.0699878  0.45471  0.649852    
## lag-1        8    0  0.011902 0.1225668  0.09711  0.922745    
## lag-2        9    0 -0.304123 0.1245722 -2.44134  0.015579 *  
## lag-3       10    0  0.033549 0.1201134  0.27931  0.780321    
## lag-4       11    0 -0.159792 0.1234516 -1.29437  0.197159    
## lag-5       12    0  0.031951 0.1223938  0.26105  0.794345    
## lag-6       13    0  0.078651 0.1177453  0.66797  0.504987    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Diagnostics:
## 
##                    Chi-sq df p-value
## Ljung-Box AR(7)   0.29882  7  0.9999
## Ljung-Box ARCH(1) 0.26763  1  0.6049
## 
## Paths searched: 
## 
## path 1 : 3 8 10 12 7 4 13 11 5 
## path 2 : 4 8 10 3 7 12 13 11 5 
## path 3 : 5 10 8 12 3 7 4 11 13 
## path 4 : 6 7 8 10 3 4 12 13 11 
## path 5 : 7 8 10 12 3 4 13 11 5 
## path 6 : 8 10 12 3 7 4 13 11 5 
## path 7 : 10 8 12 3 7 4 13 11 5 
## path 8 : 11 12 10 8 4 3 7 13 5 
## path 9 : 12 8 10 3 7 4 13 11 5 
## path 10 : 13 8 7 3 4 10 12 11 5 
## 
## Terminal models: 
## 
## spec 1 : 1 2 9 
## spec 2 : 1 2 6 9 
## spec 3 : 1 2 5 9 
## 
##                 info(sc)     logl   n   k
## spec 1 (1-cut):  -6.5320 651.3298 197   3
## spec 2:          -6.5336 654.1276 197   4
## spec 3:          -6.5286 653.6347 197   4
## 
## SPECIFIC mean equation:
## 
##              coef  std.error  t-stat   p-value    
## mconst  0.0102768  0.0014329  7.1721 1.537e-11 ***
## ar1     0.2471392  0.0685556  3.6049 0.0003974 ***
## ar5    -0.1503349  0.0618941 -2.4289 0.0160608 *  
## lag-2  -0.2927109  0.0852254 -3.4345 0.0007265 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Diagnostics and fit:
## 
##                     Chi-sq df   p-value    
## Ljung-Box AR(7)    1.61039  7 0.9782400    
## Ljung-Box ARCH(1)  0.54498  1 0.4603776    
## Jarque-Bera       15.75451  2 0.0003793 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                           
## SE of regression   0.00883
## R-squared          0.18464
## Log-lik.(n=197)  654.12759

#unsurprisingly, we will be able to reject the null hypothesis(residuals normally distributed)
#however, we can't reject the null hypothesis of no heteroskedasticity
#we can also not reject the null hypothesis of no residual autocorrelation
#ar1, ar5 and lag-2 are significant
#we create lagged data (lag 1 and 5)
Lcpi_1 <- stats::lag(data$cpi_log_diff, -c(1,5))
#run the isat (indicator saturation) to detect outliers and mean-shifts using (IIS, SIS, TIS)
isat(data$gdp_log_diff, ar = c(1,5), mc=TRUE, mxreg = Lcpi_1, normality.JarqueB = FALSE, plot = TRUE)

## 
## SIS block 1 of 7:
## 29 path(s) to search
## Searching: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 
## 
## SIS block 2 of 7:
## 29 path(s) to search
## Searching: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 
## 
## SIS block 3 of 7:
## 28 path(s) to search
## Searching: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 
## 
## SIS block 4 of 7:
## 29 path(s) to search
## Searching: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 
## 
## SIS block 5 of 7:
## 29 path(s) to search
## Searching: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 
## 
## SIS block 6 of 7:
## 29 path(s) to search
## Searching: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 
## 
## SIS block 7 of 7:
## 23 path(s) to search
## Searching: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 
## 
## GETS of union of retained SIS variables... 
## 
## GETS of union of ALL retained variables...
## - All non-keep regressors significant in GUM

## 
## Date: Tue Apr 20 22:55:38 2021 
## Dependent var.: y 
## Method: Ordinary Least Squares (OLS)
## Variance-Covariance: Ordinary 
## No. of observations (mean eq.): 198 
## Sample: 1951(3) to 2000(4) 
## 
## SPECIFIC mean equation:
## 
##                  coef  std.error  t-stat   p-value    
## mconst      0.0088766  0.0019843  4.4734 1.327e-05 ***
## ar1         0.2362204  0.0634768  3.7214 0.0002613 ***
## ar5        -0.1491751  0.0606418 -2.4599 0.0147947 *  
## lag-1      -0.2422254  0.0941238 -2.5735 0.0108346 *  
## lag-5      -0.0636198  0.0910756 -0.6985 0.4856991    
## sis1957 Q4 -0.0247461  0.0061149 -4.0468 7.563e-05 ***
## sis1958 Q2  0.0276007  0.0059791  4.6162 7.203e-06 ***
## sis1978 Q2  0.0333963  0.0083942  3.9785 9.874e-05 ***
## sis1978 Q3 -0.0349313  0.0083730 -4.1719 4.602e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Diagnostics and fit:
## 
##                    Chi-sq df p-value
## Ljung-Box AR(5)   5.77584  5  0.3286
## Ljung-Box ARCH(1) 0.58161  1  0.4457
## Jarque-Bera       2.91918  2  0.2323
##                           
## SE of regression   0.00828
## R-squared          0.30302
## Log-lik.(n=198)  672.68120

After having continously improved our model according to the methodology, we end up keeping the significant lags (1 and 5) as suggested by gets. We see that the model is better specified as we now deal with fewer shifts/outliers and the intercept looks very stable over time.

Predictive Analytics - Assignment 03

Simon Christensen, Egle Sepp, Fabio Staub & Laura Valente

Question 1

Question 2

Question 3