You guys know the drill. In one block I’m going to read in the data and use our packages
library(pacman)
#adding lmtest to our list.
p_load(tidyverse, magrittr, ggplot2, lmtest, broom)
gas_oil <- read_csv(filepath)
## Parsed with column specification:
## cols(
## month_year = col_character(),
## month = col_double(),
## year = col_double(),
## price_gasoline = col_double(),
## price_oil = col_double(),
## t_month = col_double(),
## t = col_double()
## )
names(gas_oil)
## [1] "month_year" "month" "year" "price_gasoline"
## [5] "price_oil" "t_month" "t"
Ok. Now we know what we have to work with
What is the difference between a ‘dynamic’ or a ‘static’ model?
A static model assumes that the impact of any given variable is only effecting the outcome today,whereas dynamic implies that the variable impacts the outcome tomorrow. Let’s take a look at this:
Let’s try to estimate the following static model:
To do this in R is straightfoward, we can simply use lm()
static_mod = lm(price_oil ~ price_gasoline, data = gas_oil)
summary(static_mod)
##
## Call:
## lm(formula = price_oil ~ price_gasoline, data = gas_oil)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.872 -4.224 1.187 4.688 16.360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -33.145 1.578 -21.00 <2e-16 ***
## price_gasoline 39.033 0.603 64.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.97 on 226 degrees of freedom
## Multiple R-squared: 0.9488, Adjusted R-squared: 0.9486
## F-statistic: 4190 on 1 and 226 DF, p-value: < 2.2e-16
What does this say? Well first off, it certainly seems like a higher price of gasoline today would indicate a higher price of oil today. That seems reasonable.And it is quite significant. Anything else?
Now suppose you believe the price of gas might depend on the price of oil today and yesterday. In other words, we are are interested in estimating the following model:
P_t.oil = B0+B1*P_t.gas+P_t-1.gas+B2*P_t-2.gas+u_t
To do this, we need to use a ‘lagged’ version of P_t.gas, a variable which we do not currently have in our data-frame. A lag, really, is a fancy way of saying a version of a variable from before t. You have a few options to do this:
Lets run with option one, and then two:
When you are lagging a single period, the lag() fcn introduces a single NA to the beginning of your vector.
gas_oil$price_gas_t1 = c(NaN, na.omit(lag(gas_oil$price_gasoline,1)))
#When you are lagging two periods, the lag() fcn introduces TWO NAs to the beginning of your #vector.
gas_oil$price_gas_t2 = c(NaN,NaN,na.omit(lag(gas_oil$price_gasoline,2)))
Then you can run your regression. Alternatively, you can pass these lagged values directly to a regression object:
option 2
price_2lag = lm(price_oil ~ price_gasoline + lag(price_gasoline, 1) + lag(price_gasoline, 2),data= gas_oil)
summary(price_2lag)
##
## Call:
## lm(formula = price_oil ~ price_gasoline + lag(price_gasoline,
## 1) + lag(price_gasoline, 2), data = gas_oil)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.2018 -3.7498 0.8291 4.3443 14.5325
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -34.263 1.573 -21.776 < 2e-16 ***
## price_gasoline 47.191 3.114 15.156 < 2e-16 ***
## lag(price_gasoline, 1) -19.824 5.248 -3.777 0.000204 ***
## lag(price_gasoline, 2) 12.103 3.120 3.879 0.000138 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.766 on 222 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.952, Adjusted R-squared: 0.9513
## F-statistic: 1467 on 3 and 222 DF, p-value: < 2.2e-16
Lets run an F-test to compare our two models. To do this in R:
price_2lag = lm(price_oil ~ price_gasoline + lag(price_gasoline, 1) + lag(price_gasoline, 2),data= gas_oil)
#note that there needs to be a space between the comma and the number in the lag function
lmtest::waldtest(price_2lag,c("lag(price_gasoline, 1)","lag(price_gasoline, 2)"), test="F")
## Wald test
##
## Model 1: price_oil ~ price_gasoline + lag(price_gasoline, 1) + lag(price_gasoline,
## 2)
## Model 2: price_oil ~ price_gasoline
## Res.Df Df F Pr(>F)
## 1 222
## 2 224 -2 7.9294 0.0004718 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Try estimating this model: 0 P_t.Oil = B0+B1*P_t.gas+P_t-1.gas+B2*P_t-1.Oil+u_t
price1_gas1_mod <- lm(price_oil ~ lag(price_oil, 1) + price_gasoline + lag(price_gasoline, 1)
, data = gas_oil)
summary(price1_gas1_mod)
##
## Call:
## lm(formula = price_oil ~ lag(price_oil, 1) + price_gasoline +
## lag(price_gasoline, 1), data = gas_oil)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.8895 -2.1754 0.4101 2.6578 9.1543
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.21141 1.63453 -1.353 0.177
## lag(price_oil, 1) 0.90206 0.04001 22.545 <2e-16 ***
## price_gasoline 26.56499 1.70723 15.560 <2e-16 ***
## lag(price_gasoline, 1) -23.14670 1.81007 -12.788 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.858 on 223 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.9844, Adjusted R-squared: 0.9842
## F-statistic: 4695 on 3 and 223 DF, p-value: < 2.2e-16
Let’s construct the plots with the residuals from the model in part 1i.
First, we can use the ‘resid’ command to extract residuals from an lm object and then store them in our dataframe
gas_oil$e_1i<- c(NA, resid(price1_gas1_mod))
#We also want to graph these against *lagged* residuals, so, we can use the lag command to
#add lagged residuals as well
gas_oil$e_1i_lag <- c(NA, lag(resid(price1_gas1_mod)))
Now lets plot residuals against past residuals using ggplot2 to check visually for any error terms
plot_1 = ggplot(gas_oil)+
geom_point(aes(x=t_month,y=e_1i), col = "red1") +
labs(
title = "Residual Plot: Model 1i",
x= "Month",
y= "Residuals")+
theme_classic()
plot_2 = ggplot(gas_oil)+
geom_point(aes(x=e_1i_lag,y=e_1i), col = "red1") +
labs(
title = "Residual Plot: Model 1i",
x= "Lagged Residuals",
y= "Residuals")+
theme_classic()
Ok, to view these, we need to simply call their names in our R-script
plot_1
## Warning: Removed 1 rows containing missing values (geom_point).
plot_2
## Warning: Removed 2 rows containing missing values (geom_point).
Okay, now we need to repeat the plots from above for the model from part 1a and 1d. We can first create the residuals from our static equation and then examine them closely.
1b
gas_oil$e_1b = residuals(static_mod)
gas_oil$e_1b_lag = lag(gas_oil$e_1b, 1)
Here, we lose two observations, simply due to the lags in the model. We can’t have coefficients for periods 1 and 2 because we don’t have enough data to provide estimates for those periods. So, in order to do this, we need to code the following:
gas_oil$e_1d <- c(NA, NA, resid(price_2lag))
#have to use next bit for the plots
gas_oil$e_1d_lag <- c(NA, NA, lag(resid(price_2lag)))
Now, we want to plot the residuals for this part of the question. Lets use ggplot again
plot_3 = ggplot(gas_oil)+
geom_point(aes(x=t_month,y=e_1b), col = "red1") +
labs(
title = "Residual Plot: Model 1b",
x= "Month",
y= "Residuals")+
theme_classic()
plot_4 = ggplot(gas_oil)+
geom_point(aes(x=e_1b_lag,y=e_1b), col = "red1") +
labs(
title = "Residual Plot: Model 1b",
x= "Lagged Residuals",
y= "Residuals")+
theme_classic()
1d
plot_5 = ggplot(gas_oil)+
geom_point(aes(x=t_month,y=e_1d), col = "red1") +
labs(
title = "Residual Plot: Model 1d",
x= "Month",
y= "Residuals")+
theme_classic()
plot_6 = ggplot(gas_oil)+
geom_point(aes(x=e_1d_lag ,y=e_1d), col = "red1") +
labs(
title = "Residual Plot: Model 1d",
x= "Lagged Residuals",
y= "Residuals")+
theme_classic()
Now let’s look at all 4 plots
plot_3
plot_4
## Warning: Removed 1 rows containing missing values (geom_point).
plot_5
## Warning: Removed 2 rows containing missing values (geom_point).
plot_6
## Warning: Removed 3 rows containing missing values (geom_point).
Model (1i) shows substantially less evidence of autocorrelation.
Why do you think this is? Here is a hint: suppose P_t-1.oil belongs in the true model but you estimate model 1a or 1d instead (so you omit a variable that belongs in the model). What is the error term in the model you estimate a function of (another hint: P_t-1.oil is in the error term if you omit it. Why is that a problem?)
##Final Lesson: Testing for Autocorrelation Lets test our model for second order autocorrelation, which means we get to use the Breusch-Godfrey test you learned about in class.
To do this we will need to follow a few steps.
What terms would we want to test to determine second-order autocorrelation?
The null hypothesis is: - H0: r_1= r_2 =0 (the coefficients on the two lagged errors are = 0)
What is the alternative hypothesis?
Once we have this regression we will take the residual model Rsq and compute the LM stat as we did before: LM = n*Rsq. We can then use this to get a p-value from a chisquared distribution. Lucky for you, Luciana will allow us to use a command from the lmtest package: waldtest. Ok, lets get on with the test:
#Let's create the lagged residuals
gas_oil$e_price_mod_lag1 <- c(NA, NA, lag(resid(price_2lag), 1))
gas_oil$e_price_mod_lag2 <- c(NA, NA, lag(resid(price_2lag), 2))
Now, we can run the regression outlined above
bg_mod <- lm(data = gas_oil, e_1i ~ price_gasoline + lag(price_gasoline, 1) +
lag(price_oil, 1) + lag(e_1i, 1) + lag(e_1i, 2))
Now, we can run the test. This will be an F-test where we’re dropping our autocorrelated error terms, so let’s do that here:
waldtest(bg_mod, c("lag(e_1i, 1)", "lag(e_1i, 2)"))
## Wald test
##
## Model 1: e_1i ~ price_gasoline + lag(price_gasoline, 1) + lag(price_oil,
## 1) + lag(e_1i, 1) + lag(e_1i, 2)
## Model 2: e_1i ~ price_gasoline + lag(price_gasoline, 1) + lag(price_oil,
## 1)
## Res.Df Df F Pr(>F)
## 1 219
## 2 221 -2 1.699 0.1853
And that’s it for today! A lot of this is on your third problemset homework, so come back to these notes if you get stuck.