library(readxl)

1. Use the data in HSEINV for this questions.

# Gets the data  
HSEINV = read_excel("hseinv.xls")

(a) Find the correlation between \(log(invpc_t)\) and \(log(invpc_{t−1})\)

# Gets the correlation
a1 = cor(x = HSEINV$linvpc, 
        y = HSEINV$linvpc_1)
a1
## [1] 0.6391246

The correlation between \(log(invpc_t)\) and \(log(invpc_{t−1})\) is 0.6391246

(b) Find the correlation between \(ε_t\) and \(ε_{t−1}\) in the following model. Do you find different result than in part (a)? Why?

\[log(invpc_t) = \beta_0 + \beta_1t + ε_t\]

# Creates the model
b1 = lm(HSEINV$linvpc ~ HSEINV$t)
# Gets the residuals
e1 = b1$residuals
# Creates the Lag Varaible. 
lag1 <- c(rep(NA,1), head(e1,-1))
# Gets the correlation
b1.2 = cor(e1, lag1)
b1.2
## [1] NA

Yes, There are different results from part (a). The reason is part (a) is the correlation between one variable at different point in time and par (b) is the correlation between their error term.

(c) Find the correlation between \(log(invpc_t)\) and \(log(price_t)\). Also find the correlation between \(ε_t\) in part (b) and \(u_t\) in the following model:

\[log(pricet) + \beta_0 + \beta_1t + u_t\]

# correlation between log(invpc_t) and log(price_t)
c1 = cor(x = HSEINV$linvpc, 
        y = HSEINV$lprice)
c1
## [1] 0.4139445
# creates the model log(price) ~ t
c1b= lm(HSEINV$lprice ~ HSEINV$t)
# Gets the residuals 
c1br = c1b$residuals
# Gets the correlataion between the two residuals 
c1bc = cor(x = e1, 
       y = c1br)
c1bc
## [1] -0.1030907

The correlation between \(log(invpc_t)\) and \(log(price_t)\) is 0.4139445. The correlation between \(ε_t\) in part (b) and \(u_t\) is -0.1030907.

(d) Estimate the following model and report the results in standard form. Interpret the coefficient \(\hat{β}_1\) and determine whether it is statistically significan.

\(log(invpct)\) is log of per capital investment. \(log(price)\) is log of housing price index.

\[log(invpct) = \beta_0 + \beta_1 log(price) + ε_t\]

lincpc = lm(HSEINV$linvpc ~ HSEINV$lprice)
summary(lincpc)
## 
## Call:
## lm(formula = HSEINV$linvpc ~ HSEINV$lprice)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45320 -0.09201 -0.01156  0.09230  0.33922 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.55694    0.04327  -12.87 1.28e-15 ***
## HSEINV$lprice  1.12203    0.39511    2.84  0.00714 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1549 on 39 degrees of freedom
## Multiple R-squared:  0.1714, Adjusted R-squared:  0.1501 
## F-statistic: 8.065 on 1 and 39 DF,  p-value: 0.007136

The slope coefficient could be define in different ways. It could be interpret as the elasticity of capital investment given housing price index. Which is the percentage change of capital investment in response to housing price index. \(\hat{\beta}_1\) would be the percentage change on the dependent variable from a one percent change on the independent variable.

A more detail explanation would be: \(\hat{\beta}_1\) is the multiplier of how much \(log(invpc)\) changes given a change in \(log(price)\). For example, a change of 3 in \(log(price)\) would have a 3 * \(\hat{\beta}_1\) change in natural log investment per capital.

A more mathematical point of view would be as the value of \(log(price)\) changes, the value of \(log(invpc)\) changes by \(\hat{\beta}_1\) times \(log(price)\).

According to the results above the \(\hat{\beta}_1\) coefficient of 1.12203 is statistically significant.

(e) Estimate the following equation and report the results in standard form. Interpret the coefficients \(\hat{β}_1\) and determine whether it is statistically significant

\[log(invpct) = \beta_0 + \beta_1 log(price) + \beta_2t + e_t\]

e1 = lm(HSEINV$linvpc ~ HSEINV$lprice + HSEINV$t)
summary(e1)
## 
## Call:
## lm(formula = HSEINV$linvpc ~ HSEINV$lprice + HSEINV$t)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.44559 -0.09466 -0.02289  0.10311  0.25641 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.908578   0.135627  -6.699  6.3e-08 ***
## HSEINV$lprice -0.434841   0.680611  -0.639  0.52672    
## HSEINV$t       0.009556   0.003521   2.714  0.00993 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1436 on 38 degrees of freedom
## Multiple R-squared:  0.3059, Adjusted R-squared:  0.2694 
## F-statistic: 8.374 on 2 and 38 DF,  p-value: 0.0009704

\(\hat{β}_1\) is the exponential trend. It is not statistically significant.

(f) Can you use the results in part (c) to explain the different results in parts (d) and (e)?

Yes, they are correlated.

2. The data Durable represents the Personal Consumption Expenditure of Durable Goods.

Durable = read_excel("Durable.xls")
library(forecast)
library(tseries)

(a) Plot the ACF and PACF of the data. Do you think this data can be stationary?

# Plot Personal Consumption Expenditure of Durable Goods
plot(Durable)

# ACF
acf(Durable$PCEDG) # ACF is significant and decays very slow.  

# PACF of durable goods. The first one is ignore. 
pacf(Durable$PCEDG)

According to the data above there is a decaying ACF and cut off PACF which makes it a pure AR model. I don’t think this data can be stationary. The reason is there is a very slowly decaying in ACF and statistically significant.

(b) Construct the first difference of the data. Do you want to use level difference or log difference? Plot the ACF and PACF of the first difference of the data, what kind of model do you think will fit the data best (AR, MA, or ARMA)?

# Creates lag varaible 
pc_fd = (log(tail(Durable$PCEDG,-1)) - log(head(Durable$PCEDG,-1)))*100
#Ploting first difference
plot.ts(pc_fd)

# Plotting ACF
acf(pc_fd) # over difference the data. change. This data sset says to go back to the original model, the reson is because acf is really small and pacf is noisy, 

pacf(pc_fd) 

Log difference instead of level difference would be a better option. first difference of the data made the data over differentiate. Thus, the precious ACF and PACF will be use to estimate this is a pure AR model. Percentage difference makes more sense instead of level difference.

(c) Use the first difference data you constructed in part (b) to estimate the following model. Which model gives us the the smallest BIC?

\[y_t = α_0 + α_1y_{t−1} + ε_t\]

\[y_t = α_0 + ε_t + β_1ε_{t−1} + β_2ε_{t−2}\]

\[y_t = α_0 + α_1y_{t−1} + ε_t + β_1ε_{t−1} + β_2ε_{t−2}\]

# Estimate the model $$y_t = α_0 + α_1y_{t−1} + ε_t$$ 
ind_ts1 <- arima(pc_fd, order = c(1,0,0)) # The first number is for AR, the second is for i and the third is for MA.
# BIC 
AIC(ind_ts1, k = log(length(pc_fd)))
## [1] 3481.769
# estimates the model $$y_t = α_0 + ε_t + β_1ε_{t−1} + β_2ε_{t−2}$$
ind_ts2 <- arima(pc_fd, order = c(0,0,2))
AIC(ind_ts2, k = log(length(pc_fd)))
## [1] 3464.294
# Estimates the model $$y_t = α_0 + α_1y+{t−1} + ε_t + β_1ε_{t−1} + β_2ε_{t−2}$$  
ind_ts3 <- arima(pc_fd, order = c(1,0,2))
AIC(ind_ts3, k = log(length(pc_fd)))
## [1] 3469.677

The model \(y_t = α_0 + α_1y_{t−1} + ε_t\) BIC is 3481.769.
The model \(y_t = α_0 + ε_t + β_1ε_{t−1} + β_2ε_{t−2}\) BIC is 3464.294.
The model \(y_t = α_0 + α_1y_{t−1} + ε_t + β_1ε_{t−1} + β_2ε_{t−2}\) BIC is 3469.677.

Thus, the model with the smallest BIC is \(y_t = α_0 + ε_t + β_1ε_{t−1} + β_2ε_{t−2}\) with a BIC of 3464.294.

(d) Use Ljung-Box test to test serial correlations for the model that gives us the smallest BIC in part (c). Set \(K=12\). What is your conclusion?

# First get the residuals. Test whether there is serial correlation in residuals over time. 
res_ind_ts2 = ind_ts2$residuals
# Creates the test
Box.test(res_ind_ts2, lag = 12, type = c("Ljung-Box"), fitdf = 2) # lag = k
## 
##  Box-Ljung test
## 
## data:  res_ind_ts2
## X-squared = 22.015, df = 10, p-value = 0.01503

My conclusion is the p value is low and we reject the null hypothesis because the p value is below 5% significance level. In this model, the residuals are correlated.

(e) Under rolling window framework with window size = 100, test whether the following two models have difference predictive power (use DieboldMariano test).

# Rolling Window Forecast
N3 = length(Durable$PCEDG)
F_starte = 100
spread_fct1 = matrix(0,N3,1)
spread_fct2 = matrix(0,N3,1)

for(i in F_starte:(N3-1))

{
  
  is3 <- arima(Durable$PCEDG[(i-F_starte+1):i], order = c(1,0,0))
  a = forecast(is3,h=10)
  b = a$mean
  spread_fct1[i+1]=b[1]
  
  is4 <- arima(Durable$PCEDG[(i-F_starte+1):i], order = c(0,0,2))
  a = forecast(is4,h=10)
  b = a$mean
  spread_fct2[i+1]=b[1]
  
}

fct1_err = Durable$PCEDG[F_starte:N3] - spread_fct1[F_starte:N3]
fct2_err = Durable$PCEDG[F_starte:N3] - spread_fct2[F_starte:N3]

# Diebold-Mariano Test

dm.test(fct1_err,fct2_err)

(f) Use CUSUM test to check whether there is structural change in the data. Use MA(2) model and set F START = 100. What is your conclusion? Informally, when is the structural change point?

# CUSUM test
############### This is going to be on the final
# Expanding Window Forecast

# Creates the forcast
N = length(Durable$PCEDG)
F_start = 100
spread_fct1 = matrix(0,N,1)
for(i in F_start:(N-1)){
  is3 <- arima(Durable$PCEDG[1:i], order = c(0,0,2))
  a = forecast(is3,h=10)
  b = a$mean
  spread_fct1[i+1]=b[1]
}

# Calculates the forcast error
fct1_err = Durable$PCEDG[F_start+1:N] - spread_fct1[F_start+1:N]
fct1_err = fct1_err[1:(N-F_start)]


cusum1 = matrix(0,N-F_start,1)
upbd = matrix(0,N-F_start,1)
lwbd = matrix(0,N-F_start,1)
for(i in 1:(N-F_start)){
  cusum1[i] = sum(fct1_err[1:i])/sd(fct1_err)
  upbd[i] = 0.948*((N-F_start)^0.5+2*(i+1)*(N-F_start)^(-0.5))
  lwbd[i] = -0.948*((N-F_start)^0.5+2*(i+1)*(N-F_start)^(-0.5))
}

plot.ts(cusum1,ylim = c(-100,100))
par(new=TRUE)
plot.ts(upbd,ylim = c(-100,100))
par(new=TRUE)
plot.ts(lwbd,ylim = c(-100,100))

My conclusion is the test statistics starts going over the upper bound around 180 - 200 data points. This means the predictive error stars becoming to large to say that our data does not have any change. The model MA(2) model starts becomes the wrong model. We reject the null hypotheses. The data has structural change.

3. The data BC represents the Bank Credit of All Commercial Banks.

# Reads xls
BC = read_excel("BC.xls")

(a) Plot the ACF and PACF of the data. Do you think this data can be stationary? If yes, what kind of model do you think will fit the data best (AR, MA, or ARMA)?

# Plots Bank Credit of All Commercial Banks
plot.ts(BC$BCR)

# ACF of Bank Credit of All Commercial Banks 
acf(BC$BCR)

## PACF of Bank Credit of All Commercial Banks
pacf(BC$BCR)

This data can be stationary. ARMA model would be the model that will fit the data best.

(b) Estimate the following model use the original data. Which model gives us the the smallest BIC?

\[y_t = α_0 + α_1y_{t−1} + ε_t\]

\[y_t = α_0 + ε_t + β_1ε_{t−1}\]

\[y_t = α_0 + α_1y_{t−1} + ε_t + β_1ε_{t−1}\]

# Estimate the model
b.3 <- arima(BC$BCR, order = c(1,0,0)) # first is AR, second is i, third is MA. 
# BIC 
AIC(b.3, k = log(length(BC$BCR)))
## [1] 1645.195
# Estimate the model
b.3.b <- arima(BC$BCR, order = c(0,0,1)) # first number is AR, second is i, third is MA. 
# BIC 
AIC(b.3.b, k = log(length(BC$BCR)))
## [1] 1678.615
# Estimate the model
b.3.c <- arima(BC$BCR, order = c(1,0,1)) # First is AR, second is i, and third is MA. 
# BIC 
AIC(b.3.c, k = log(length(BC$BCR)))
## [1] 1650.808

\(y_t = α_0 + α_1y_{t−1} + ε_t\) BIC is 1645.195.
\(y_t = α_0 + ε_t + β_1ε_{t−1}\) BIC is 1678.615.
\(y_t = α_0 + α_1y_{t−1} + ε_t + β_1ε_{t−1}\) BIC is 1650.808.

The model with the smallest BIC is the first one \(y_t = α_0 + α_1y_{t−1} + ε_t\) with a BIC of 1645.195.

(c) Use CUSUM test to check whether there is structural change in the data. Use AR(1) model and set F START = 200. What is your conclusion? Informally, when is the structural change point?

# CUSUM test
############### This is going to be on the final
# Expanding Window Forecast

N2 = length(BC$BCR)
F_start = 200
spread_fct1 = matrix(0,N2,1)
for(i in F_start:(N2-1)){
  is3 <- arima(BC$BCR[1:i], order = c(1,0,0))
  a = forecast(is3,h=10)
  b = a$mean
  spread_fct1[i+1]=b[1]
}

fct1_err = BC$BCR[F_start+1:N2] - spread_fct1[F_start+1:N2]
fct1_err = fct1_err[1:(N2-F_start)]

cusum1 = matrix(0,N2-F_start,1)
upbd = matrix(0,N2-F_start,1)
lwbd = matrix(0,N2-F_start,1)
for(i in 1:(N2-F_start)){
  cusum1[i] = sum(fct1_err[1:i])/sd(fct1_err)
  upbd[i] = 0.948*((N2-F_start)^0.5+2*(i+1)*(N2-F_start)^(-0.5))
  lwbd[i] = -0.948*((N2-F_start)^0.5+2*(i+1)*(N2-F_start)^(-0.5))
}

plot.ts(cusum1,ylim = c(-100,100))
par(new=TRUE)
plot.ts(upbd,ylim = c(-100,100))
par(new=TRUE)
plot.ts(lwbd,ylim = c(-100,100))

The cumulative square, the predictive error stays between the lower and upper bound. This means the data does not have any structual change. We accept the null hypothesys. There is no structural change point.