Setup

For this lab, we will be using the tidyverse and the lmtest packages. We need the lmtest package to do a Wald Test. We will be using the data for Problem Set 3.

library(pacman)
p_load(tidyverse, lmtest)

df <- read_csv("/Users/jenniputz/Downloads/003-data.csv")

## Parsed with column specification:
## cols(
##   t = col_double(),
##   date = col_date(format = ""),
##   year = col_double(),
##   month = col_double(),
##   price_electricity = col_double(),
##   price_coal = col_double(),
##   price_gas = col_double()
## )

names(df)

## [1] "t"                 "date"              "year"             
## [4] "month"             "price_electricity" "price_coal"       
## [7] "price_gas"

Static Models

A static model assumes that the impact of any given variable is only effecting the outcome today, whereas a dynamic implies that the variable impacts the outcome tomorrow. Let’s estimate a static model where price_gas is the outcome variable and price_coal and price_electricity are the explanatory variables.

\[Price_{gas, t} = \beta_0 + \beta_1 Price_{coal, t} + \beta_2 Price_{electricity, t} + u_t\] To estimate this model, we use our lm() function and run OLS.

static_mod = lm(price_gas ~ price_coal + price_electricity, data = df)
summary(static_mod)

## 
## Call:
## lm(formula = price_gas ~ price_coal + price_electricity, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8971 -0.6769 -0.1044  0.4524  6.5208 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       15.30756    2.01523   7.596 4.56e-12 ***
## price_coal         0.27599    0.06135   4.499 1.46e-05 ***
## price_electricity -1.08663    0.15365  -7.072 7.47e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.174 on 135 degrees of freedom
## Multiple R-squared:  0.3424, Adjusted R-squared:  0.3327 
## F-statistic: 35.15 on 2 and 135 DF,  p-value: 5.152e-13

Here, we must believe that the price of coal and the price of electricity immediately affects the price of gas and that the current price of gas does not depend on previous prices of coal or electricity. Static models also do not allow for a persistent effect. This doesn’t seem like a very realistic way of depicting this relationship so we should instead model this using a dynamic model.

Dynamic Models

What if we think price_gas depends on past values of coal and electricity prices, or even past values of gas prices? Then we need to estimate a dynamic model. Let’s estimate a dynamic model with one lag of price_coal and one lag of price_electricity. This represents the price of coal and price of electricity one month ago. The model we are trying to estimate looks like:

\[ Price_{gas, t} = \beta_0 + \beta_1 Price_{coal, t} + \beta_2 Price_{electricity, t} + Price_{coal, t-1} + Price_{electricity, t-1}+ u_t\] To make the lagged terms, we can use the lag() function right inside our regression. We write: lag(variable, n) where n is the order of the lag, i.e. n = 1 corresponds to one period ago, n = 2 corresponds to two periods ago, and so on.

dynamic_mod1 <- lm(price_gas ~ price_coal + price_electricity + 
                     lag(price_coal, 1) + lag(price_electricity, 1), data = df)
summary(dynamic_mod1)

## 
## Call:
## lm(formula = price_gas ~ price_coal + price_electricity + lag(price_coal, 
##     1) + lag(price_electricity, 1), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6897 -0.7204 -0.0919  0.4661  6.4855 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                15.4916     1.8810   8.236 1.53e-13 ***
## price_coal                  0.1394     0.1351   1.031    0.304    
## price_electricity          -0.5495     0.3754  -1.464    0.146    
## lag(price_coal, 1)          0.1092     0.1342   0.814    0.417    
## lag(price_electricity, 1)  -0.5302     0.3769  -1.407    0.162    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.07 on 132 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3648, Adjusted R-squared:  0.3456 
## F-statistic: 18.96 on 4 and 132 DF,  p-value: 2.452e-12

Notice that we get quite different results in the dynamic model. Here, the price of coal and the price of electricity affects the price of gas immediately and in future periods. We can estimate the total effect of the price of coal by summing \(\beta_1 + \beta_3\) and the total effect of the price of electricity by summing \(\beta_2 + \beta_4\). Dynamic models with lagged explanatory variables (but not lagged outcome variables) are unbiased but inefficient.

Next, let’s try estimating a model with two-period lags. To do this, we use the lag() function and set n = 2. We estimate this dynamic model: \[ Price_{gas, t} = \beta_0 + \beta_1 Price_{coal, t} + \beta_2 Price_{electricity, t} + Price_{coal, t-1} + Price_{electricity, t-1}+ Price_{coal, t-2} + Price_{electricity, t-2} + u_t\]

dynamic_mod2 <- lm(price_gas ~ price_coal + price_electricity + 
                     lag(price_coal, 1) + lag(price_electricity, 1) +
                     lag(price_coal, 2) + lag(price_electricity, 2), data = df)
summary(dynamic_mod2)

## 
## Call:
## lm(formula = price_gas ~ price_coal + price_electricity + lag(price_coal, 
##     1) + lag(price_electricity, 1) + lag(price_coal, 2) + lag(price_electricity, 
##     2), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5491 -0.6232 -0.0692  0.4811  4.6597 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               16.28206    1.62337  10.030  < 2e-16 ***
## price_coal                 0.16367    0.11612   1.410  0.16108    
## price_electricity         -0.98405    0.33119  -2.971  0.00354 ** 
## lag(price_coal, 1)         0.02286    0.16501   0.139  0.89001    
## lag(price_electricity, 1)  0.67222    0.51484   1.306  0.19399    
## lag(price_coal, 2)         0.02361    0.11348   0.208  0.83554    
## lag(price_electricity, 2) -0.80052    0.33545  -2.386  0.01847 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8942 on 129 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.4369, Adjusted R-squared:  0.4107 
## F-statistic: 16.68 on 6 and 129 DF,  p-value: 3.54e-14

How do we interpret the coefficeint on \(Price_{electricity,t-2}\)? When the price of electricity two months ago increases by 1 dollar, this month’s price of gas decreases by .08, on average, holding all else constant. This is significant at the 5% level.

We can also add lags of our outcome variable. we do this the same way as we did with the lags of explanatory variables. Now, we are estimating the model: \[Price_{gas, t} = \beta_0 + \beta_1 Price_{coal, t} + \beta_2 Price_{electricity, t} + Price_{coal, t-1} + Price_{electricity, t-1} + Price_{gas, t-1} + u_t\]

dynamic_mod3 <- lm(price_gas ~ price_coal + price_electricity + 
                     lag(price_coal, 1) + lag(price_electricity, 1) +
                     lag(price_gas, 1), data = df)
summary(dynamic_mod3)

## 
## Call:
## lm(formula = price_gas ~ price_coal + price_electricity + lag(price_coal, 
##     1) + lag(price_electricity, 1) + lag(price_gas, 1), data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.95701 -0.20489  0.00535  0.25621  1.59325 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                2.25097    0.78618   2.863  0.00489 ** 
## price_coal                 0.09292    0.04730   1.965  0.05158 .  
## price_electricity         -0.04450    0.13235  -0.336  0.73721    
## lag(price_coal, 1)        -0.08891    0.04738  -1.876  0.06282 .  
## lag(price_electricity, 1) -0.08616    0.13265  -0.650  0.51711    
## lag(price_gas, 1)          0.85393    0.02774  30.781  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3743 on 131 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.9229, Adjusted R-squared:  0.9199 
## F-statistic: 313.4 on 5 and 131 DF,  p-value: < 2.2e-16

One thing to note: dynamic models with lagged outcome variables are biased in OLS.

Comparing Models

How do we decide which models are better? We can use a wald test to compare two models (this is an F-test). Let’s compare dynamic_mod1 to our static model. To do this, we want to check to see if the coefficients on the lagged variables are significant.

The waldtest function in R is sometimes fussy, so I am going to make new variables for the lagged terms and re-run the regression using those. Recall, we can use mutate() from the tidyverse package to create new variables in our data frame.

df2 <- df %>% mutate(
  lag_coal = c(lag(price_coal, 1)),
  lag_elec = c(lag(price_electricity, 1))
)

Now, we just run the same regression as dynamic_mod1 but use the new variables instead of lag(). Then we do the Wald Test.

reg <- lm(price_gas ~ price_coal + price_electricity + lag_coal + lag_elec, data = df2)
waldtest(reg, c("lag_coal", "lag_elec"), test = "F")

## Wald test
## 
## Model 1: price_gas ~ price_coal + price_electricity + lag_coal + lag_elec
## Model 2: price_gas ~ price_coal + price_electricity
##   Res.Df Df      F Pr(>F)
## 1    132                 
## 2    134 -2 1.2429 0.2919

The Wald Test is checking to see if both of the lagged coefficients are equal to zero, i.e. \(H_0: \beta_3 = \beta_4 = 0\). The alternative hypothesis is that at least one of these coefficients is not equal to zero. The p-value from the Wald Test is equal to .29 so we fail to reject the null hypothesis at 5%. Therefore, we should use the dynamic model over the static model.

Autocorrelation

Autocorrelation, or serial correlation, occurs when our distribances are correlated over time. That means that the shock from disturbance t is related to the shocks in t-1 and t+1. For static models or dynamic models with lagged explanatory variables, autocorrelation means that we will have unbiased estimates but OLS will be inefficient because we get biased standard errors (the same issues as heteroskedasticity!).

Let’s check and see if we have autocorrelation in our first dynamic model that we ran earlier. We need to get the residuals and the lag of the residuals. I think it is easiest to make these as new variables in the dataframe so that we can use them in the plots later.

df <- df %>% mutate(e_1 = c(NA,resid(dynamic_mod1))) %>% mutate(e_lag = c(NA, lag(resid(dynamic_mod1))))
#note we need to put an NA there because we have 1 less residuals than data points

First let’s plot residuals over time.

ggplot(data = df, aes(x = t, y = e_1)) + geom_point(color = "purple") + geom_line(color = "black")+ theme_minimal() +
  labs(x = "Month", y = "Errors")

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_path).

Visual inspection indicates that it looks like the errors are correlated with time. Let’s plot the errors and the lagged errors to be sure.

ggplot(data = df, aes(x = e_1, y = e_lag)) + geom_point(color = "purple") + geom_line(color = "black")+ theme_minimal() +
  labs(x = "Errors in time t", y = "Errors in time t-1")

## Warning: Removed 2 rows containing missing values (geom_point).

## Warning: Removed 2 rows containing missing values (geom_path).

Looks like we have a positive relationship between the erros in time t and the errors in time t-1, indicating that we have positive autocorrelation.

For models with lagged outcome variables, autocorrelation creates bias and inconsistency because contemporaneous exogeneity is violated. Let’s check the dynamic model we did that included a lag of \(Price_{gas}\).

# create new variables for the residuals and lagged residuals
df <- df %>% mutate(e_gas = c(NA,resid(dynamic_mod3))) %>% mutate(e_gas_lag = c(NA, lag(resid(dynamic_mod3))))

ggplot(data = df, aes(x = t, y = e_gas)) + geom_point(color = "purple") + geom_line(color = "black")+ theme_minimal() +
  labs(x = "Month", y = "Errors")

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_path).

ggplot(data = df, aes(x = e_gas, y = e_gas_lag)) + geom_point(color = "purple") + geom_line(color = "black")+ theme_minimal() +
  labs(x = "Errors in time t", y = "Errors in time t-1")

## Warning: Removed 2 rows containing missing values (geom_point).

## Warning: Removed 2 rows containing missing values (geom_path).

Here, it is significantly more difficult to tell if we have autocorrelation but it looks like we have negative autocorrelation (large positives followed by large negatives and vice versa).

Lab 6: Dynamic Time Series Models

Jenni Putz

2/24/2020

Setup

Static Models

Dynamic Models

Comparing Models

Autocorrelation