For this lab, we will be using the tidyverse
and the lmtest
packages. We need the lmtest
package to do a Wald Test. We will be using the data for Problem Set 3.
library(pacman)
p_load(tidyverse, lmtest)
df <- read_csv("/Users/jenniputz/Downloads/003-data.csv")
## Parsed with column specification:
## cols(
## t = col_double(),
## date = col_date(format = ""),
## year = col_double(),
## month = col_double(),
## price_electricity = col_double(),
## price_coal = col_double(),
## price_gas = col_double()
## )
names(df)
## [1] "t" "date" "year"
## [4] "month" "price_electricity" "price_coal"
## [7] "price_gas"
A static model assumes that the impact of any given variable is only effecting the outcome today, whereas a dynamic implies that the variable impacts the outcome tomorrow. Let’s estimate a static model where price_gas
is the outcome variable and price_coal
and price_electricity
are the explanatory variables.
\[Price_{gas, t} = \beta_0 + \beta_1 Price_{coal, t} + \beta_2 Price_{electricity, t} + u_t\] To estimate this model, we use our lm()
function and run OLS.
static_mod = lm(price_gas ~ price_coal + price_electricity, data = df)
summary(static_mod)
##
## Call:
## lm(formula = price_gas ~ price_coal + price_electricity, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8971 -0.6769 -0.1044 0.4524 6.5208
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.30756 2.01523 7.596 4.56e-12 ***
## price_coal 0.27599 0.06135 4.499 1.46e-05 ***
## price_electricity -1.08663 0.15365 -7.072 7.47e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.174 on 135 degrees of freedom
## Multiple R-squared: 0.3424, Adjusted R-squared: 0.3327
## F-statistic: 35.15 on 2 and 135 DF, p-value: 5.152e-13
Here, we must believe that the price of coal and the price of electricity immediately affects the price of gas and that the current price of gas does not depend on previous prices of coal or electricity. Static models also do not allow for a persistent effect. This doesn’t seem like a very realistic way of depicting this relationship so we should instead model this using a dynamic model.
What if we think price_gas depends on past values of coal and electricity prices, or even past values of gas prices? Then we need to estimate a dynamic model. Let’s estimate a dynamic model with one lag of price_coal and one lag of price_electricity. This represents the price of coal and price of electricity one month ago. The model we are trying to estimate looks like:
\[ Price_{gas, t} = \beta_0 + \beta_1 Price_{coal, t} + \beta_2 Price_{electricity, t} + Price_{coal, t-1} + Price_{electricity, t-1}+ u_t\] To make the lagged terms, we can use the lag()
function right inside our regression. We write: lag(variable, n) where n is the order of the lag, i.e. n = 1 corresponds to one period ago, n = 2 corresponds to two periods ago, and so on.
dynamic_mod1 <- lm(price_gas ~ price_coal + price_electricity +
lag(price_coal, 1) + lag(price_electricity, 1), data = df)
summary(dynamic_mod1)
##
## Call:
## lm(formula = price_gas ~ price_coal + price_electricity + lag(price_coal,
## 1) + lag(price_electricity, 1), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6897 -0.7204 -0.0919 0.4661 6.4855
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.4916 1.8810 8.236 1.53e-13 ***
## price_coal 0.1394 0.1351 1.031 0.304
## price_electricity -0.5495 0.3754 -1.464 0.146
## lag(price_coal, 1) 0.1092 0.1342 0.814 0.417
## lag(price_electricity, 1) -0.5302 0.3769 -1.407 0.162
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.07 on 132 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.3648, Adjusted R-squared: 0.3456
## F-statistic: 18.96 on 4 and 132 DF, p-value: 2.452e-12
Notice that we get quite different results in the dynamic model. Here, the price of coal and the price of electricity affects the price of gas immediately and in future periods. We can estimate the total effect of the price of coal by summing \(\beta_1 + \beta_3\) and the total effect of the price of electricity by summing \(\beta_2 + \beta_4\). Dynamic models with lagged explanatory variables (but not lagged outcome variables) are unbiased but inefficient.
Next, let’s try estimating a model with two-period lags. To do this, we use the lag()
function and set n = 2. We estimate this dynamic model: \[ Price_{gas, t} = \beta_0 + \beta_1 Price_{coal, t} + \beta_2 Price_{electricity, t} + Price_{coal, t-1} + Price_{electricity, t-1}+ Price_{coal, t-2} + Price_{electricity, t-2} + u_t\]
dynamic_mod2 <- lm(price_gas ~ price_coal + price_electricity +
lag(price_coal, 1) + lag(price_electricity, 1) +
lag(price_coal, 2) + lag(price_electricity, 2), data = df)
summary(dynamic_mod2)
##
## Call:
## lm(formula = price_gas ~ price_coal + price_electricity + lag(price_coal,
## 1) + lag(price_electricity, 1) + lag(price_coal, 2) + lag(price_electricity,
## 2), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5491 -0.6232 -0.0692 0.4811 4.6597
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.28206 1.62337 10.030 < 2e-16 ***
## price_coal 0.16367 0.11612 1.410 0.16108
## price_electricity -0.98405 0.33119 -2.971 0.00354 **
## lag(price_coal, 1) 0.02286 0.16501 0.139 0.89001
## lag(price_electricity, 1) 0.67222 0.51484 1.306 0.19399
## lag(price_coal, 2) 0.02361 0.11348 0.208 0.83554
## lag(price_electricity, 2) -0.80052 0.33545 -2.386 0.01847 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8942 on 129 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.4369, Adjusted R-squared: 0.4107
## F-statistic: 16.68 on 6 and 129 DF, p-value: 3.54e-14
How do we interpret the coefficeint on \(Price_{electricity,t-2}\)? When the price of electricity two months ago increases by 1 dollar, this month’s price of gas decreases by .08, on average, holding all else constant. This is significant at the 5% level.
We can also add lags of our outcome variable. we do this the same way as we did with the lags of explanatory variables. Now, we are estimating the model: \[Price_{gas, t} = \beta_0 + \beta_1 Price_{coal, t} + \beta_2 Price_{electricity, t} + Price_{coal, t-1} + Price_{electricity, t-1} + Price_{gas, t-1} + u_t\]
dynamic_mod3 <- lm(price_gas ~ price_coal + price_electricity +
lag(price_coal, 1) + lag(price_electricity, 1) +
lag(price_gas, 1), data = df)
summary(dynamic_mod3)
##
## Call:
## lm(formula = price_gas ~ price_coal + price_electricity + lag(price_coal,
## 1) + lag(price_electricity, 1) + lag(price_gas, 1), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.95701 -0.20489 0.00535 0.25621 1.59325
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.25097 0.78618 2.863 0.00489 **
## price_coal 0.09292 0.04730 1.965 0.05158 .
## price_electricity -0.04450 0.13235 -0.336 0.73721
## lag(price_coal, 1) -0.08891 0.04738 -1.876 0.06282 .
## lag(price_electricity, 1) -0.08616 0.13265 -0.650 0.51711
## lag(price_gas, 1) 0.85393 0.02774 30.781 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3743 on 131 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.9229, Adjusted R-squared: 0.9199
## F-statistic: 313.4 on 5 and 131 DF, p-value: < 2.2e-16
One thing to note: dynamic models with lagged outcome variables are biased in OLS.
How do we decide which models are better? We can use a wald test to compare two models (this is an F-test). Let’s compare dynamic_mod1
to our static model. To do this, we want to check to see if the coefficients on the lagged variables are significant.
The waldtest
function in R is sometimes fussy, so I am going to make new variables for the lagged terms and re-run the regression using those. Recall, we can use mutate()
from the tidyverse
package to create new variables in our data frame.
df2 <- df %>% mutate(
lag_coal = c(lag(price_coal, 1)),
lag_elec = c(lag(price_electricity, 1))
)
Now, we just run the same regression as dynamic_mod1 but use the new variables instead of lag()
. Then we do the Wald Test.
reg <- lm(price_gas ~ price_coal + price_electricity + lag_coal + lag_elec, data = df2)
waldtest(reg, c("lag_coal", "lag_elec"), test = "F")
## Wald test
##
## Model 1: price_gas ~ price_coal + price_electricity + lag_coal + lag_elec
## Model 2: price_gas ~ price_coal + price_electricity
## Res.Df Df F Pr(>F)
## 1 132
## 2 134 -2 1.2429 0.2919
The Wald Test is checking to see if both of the lagged coefficients are equal to zero, i.e. \(H_0: \beta_3 = \beta_4 = 0\). The alternative hypothesis is that at least one of these coefficients is not equal to zero. The p-value from the Wald Test is equal to .29 so we fail to reject the null hypothesis at 5%. Therefore, we should use the dynamic model over the static model.
Autocorrelation, or serial correlation, occurs when our distribances are correlated over time. That means that the shock from disturbance t is related to the shocks in t-1 and t+1. For static models or dynamic models with lagged explanatory variables, autocorrelation means that we will have unbiased estimates but OLS will be inefficient because we get biased standard errors (the same issues as heteroskedasticity!).
Let’s check and see if we have autocorrelation in our first dynamic model that we ran earlier. We need to get the residuals and the lag of the residuals. I think it is easiest to make these as new variables in the dataframe so that we can use them in the plots later.
df <- df %>% mutate(e_1 = c(NA,resid(dynamic_mod1))) %>% mutate(e_lag = c(NA, lag(resid(dynamic_mod1))))
#note we need to put an NA there because we have 1 less residuals than data points
First let’s plot residuals over time.
ggplot(data = df, aes(x = t, y = e_1)) + geom_point(color = "purple") + geom_line(color = "black")+ theme_minimal() +
labs(x = "Month", y = "Errors")
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_path).
Visual inspection indicates that it looks like the errors are correlated with time. Let’s plot the errors and the lagged errors to be sure.
ggplot(data = df, aes(x = e_1, y = e_lag)) + geom_point(color = "purple") + geom_line(color = "black")+ theme_minimal() +
labs(x = "Errors in time t", y = "Errors in time t-1")
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_path).
Looks like we have a positive relationship between the erros in time t and the errors in time t-1, indicating that we have positive autocorrelation.
For models with lagged outcome variables, autocorrelation creates bias and inconsistency because contemporaneous exogeneity is violated. Let’s check the dynamic model we did that included a lag of \(Price_{gas}\).
# create new variables for the residuals and lagged residuals
df <- df %>% mutate(e_gas = c(NA,resid(dynamic_mod3))) %>% mutate(e_gas_lag = c(NA, lag(resid(dynamic_mod3))))
ggplot(data = df, aes(x = t, y = e_gas)) + geom_point(color = "purple") + geom_line(color = "black")+ theme_minimal() +
labs(x = "Month", y = "Errors")
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_path).
ggplot(data = df, aes(x = e_gas, y = e_gas_lag)) + geom_point(color = "purple") + geom_line(color = "black")+ theme_minimal() +
labs(x = "Errors in time t", y = "Errors in time t-1")
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_path).
Here, it is significantly more difficult to tell if we have autocorrelation but it looks like we have negative autocorrelation (large positives followed by large negatives and vice versa).