Load Packages!

You guys know the drill. In one block I’m going to read in the data and use our packages

library(pacman)
#adding lmtest to our list.
p_load(tidyverse, magrittr, ggplot2, lmtest, broom)


gas_oil <- read_csv(filepath)
## Parsed with column specification:
## cols(
##   month_year = col_character(),
##   month = col_double(),
##   year = col_double(),
##   price_gasoline = col_double(),
##   price_oil = col_double(),
##   t_month = col_double(),
##   t = col_double()
## )
names(gas_oil)
## [1] "month_year"     "month"          "year"           "price_gasoline"
## [5] "price_oil"      "t_month"        "t"

Ok. Now we know what we have to work with

Lesson 1: Running OLS: static model

What is the difference between a ‘dynamic’ or a ‘static’ model?

A static model assumes that the impact of any given variable is only effecting the outcome today,whereas dynamic implies that the variable impacts the outcome tomorrow. Let’s take a look at this:

Let’s try to estimate the following static model:

To do this in R is straightfoward, we can simply use lm()

static_mod = lm(price_oil ~ price_gasoline, data = gas_oil)
summary(static_mod)
## 
## Call:
## lm(formula = price_oil ~ price_gasoline, data = gas_oil)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.872  -4.224   1.187   4.688  16.360 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -33.145      1.578  -21.00   <2e-16 ***
## price_gasoline   39.033      0.603   64.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.97 on 226 degrees of freedom
## Multiple R-squared:  0.9488, Adjusted R-squared:  0.9486 
## F-statistic:  4190 on 1 and 226 DF,  p-value: < 2.2e-16

What does this say? Well first off, it certainly seems like a higher price of gasoline today would indicate a higher price of oil today. That seems reasonable.And it is quite significant. Anything else?

Lesson 2: Dynamic version, estimating a dynamic model

Now suppose you believe the price of gas might depend on the price of oil today and yesterday. In other words, we are are interested in estimating the following model:

P_t.oil = B0+B1*P_t.gas+P_t-1.gas+B2*P_t-2.gas+u_t

To do this, we need to use a ‘lagged’ version of P_t.gas, a variable which we do not currently have in our data-frame. A lag, really, is a fancy way of saying a version of a variable from before t. You have a few options to do this:

Lets run with option one, and then two:

When you are lagging a single period, the lag() fcn introduces a single NA to the beginning of your vector.

gas_oil$price_gas_t1 = c(NaN, na.omit(lag(gas_oil$price_gasoline,1)))

#When you are lagging two periods, the lag() fcn introduces TWO NAs to the beginning of your #vector.
gas_oil$price_gas_t2 = c(NaN,NaN,na.omit(lag(gas_oil$price_gasoline,2)))

Then you can run your regression. Alternatively, you can pass these lagged values directly to a regression object:

option 2

price_2lag = lm(price_oil ~ price_gasoline + lag(price_gasoline, 1) + lag(price_gasoline, 2),data= gas_oil)
summary(price_2lag)
## 
## Call:
## lm(formula = price_oil ~ price_gasoline + lag(price_gasoline, 
##     1) + lag(price_gasoline, 2), data = gas_oil)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.2018  -3.7498   0.8291   4.3443  14.5325 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -34.263      1.573 -21.776  < 2e-16 ***
## price_gasoline           47.191      3.114  15.156  < 2e-16 ***
## lag(price_gasoline, 1)  -19.824      5.248  -3.777 0.000204 ***
## lag(price_gasoline, 2)   12.103      3.120   3.879 0.000138 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.766 on 222 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.952,  Adjusted R-squared:  0.9513 
## F-statistic:  1467 on 3 and 222 DF,  p-value: < 2.2e-16

Lesson 3: F-test

Lets run an F-test to compare our two models. To do this in R:

price_2lag = lm(price_oil ~ price_gasoline + lag(price_gasoline, 1) + lag(price_gasoline, 2),data= gas_oil)
#note that there needs to be a space between the comma and the number in the lag function
lmtest::waldtest(price_2lag,c("lag(price_gasoline, 1)","lag(price_gasoline, 2)"), test="F")
## Wald test
## 
## Model 1: price_oil ~ price_gasoline + lag(price_gasoline, 1) + lag(price_gasoline, 
##     2)
## Model 2: price_oil ~ price_gasoline
##   Res.Df Df      F    Pr(>F)    
## 1    222                        
## 2    224 -2 7.9294 0.0004718 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Try estimating this model: 0 P_t.Oil = B0+B1*P_t.gas+P_t-1.gas+B2*P_t-1.Oil+u_t

price1_gas1_mod <- lm(price_oil ~ lag(price_oil, 1) + price_gasoline + lag(price_gasoline, 1)
                      , data = gas_oil)
summary(price1_gas1_mod)
## 
## Call:
## lm(formula = price_oil ~ lag(price_oil, 1) + price_gasoline + 
##     lag(price_gasoline, 1), data = gas_oil)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.8895  -2.1754   0.4101   2.6578   9.1543 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -2.21141    1.63453  -1.353    0.177    
## lag(price_oil, 1)        0.90206    0.04001  22.545   <2e-16 ***
## price_gasoline          26.56499    1.70723  15.560   <2e-16 ***
## lag(price_gasoline, 1) -23.14670    1.81007 -12.788   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.858 on 223 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.9844, Adjusted R-squared:  0.9842 
## F-statistic:  4695 on 3 and 223 DF,  p-value: < 2.2e-16

Lesson 4: Autocorrelation

Let’s construct the plots with the residuals from the model in part 1i.

First, we can use the ‘resid’ command to extract residuals from an lm object and then store them in our dataframe

gas_oil$e_1i<- c(NA, resid(price1_gas1_mod))
#We also want to graph these against *lagged* residuals, so, we can use the lag command to
#add lagged residuals as well
gas_oil$e_1i_lag <- c(NA, lag(resid(price1_gas1_mod)))

Now lets plot residuals against past residuals using ggplot2 to check visually for any error terms

plot_1 = ggplot(gas_oil)+
  geom_point(aes(x=t_month,y=e_1i), col = "red1") +
  labs(
    title =  "Residual Plot: Model 1i",
    x= "Month",
    y= "Residuals")+
  theme_classic()

plot_2 = ggplot(gas_oil)+
  geom_point(aes(x=e_1i_lag,y=e_1i), col = "red1") +
  labs(
    title =  "Residual Plot: Model 1i",
    x= "Lagged Residuals",
    y= "Residuals")+
  theme_classic()

Ok, to view these, we need to simply call their names in our R-script

plot_1
## Warning: Removed 1 rows containing missing values (geom_point).

plot_2
## Warning: Removed 2 rows containing missing values (geom_point).

Okay, now we need to repeat the plots from above for the model from part 1a and 1d. We can first create the residuals from our static equation and then examine them closely.

1b

gas_oil$e_1b = residuals(static_mod)
gas_oil$e_1b_lag = lag(gas_oil$e_1b, 1)

Here, we lose two observations, simply due to the lags in the model. We can’t have coefficients for periods 1 and 2 because we don’t have enough data to provide estimates for those periods. So, in order to do this, we need to code the following:

gas_oil$e_1d <- c(NA, NA, resid(price_2lag))
#have to use next bit for the plots
gas_oil$e_1d_lag <- c(NA, NA, lag(resid(price_2lag)))

Now, we want to plot the residuals for this part of the question. Lets use ggplot again

plot_3 = ggplot(gas_oil)+
  geom_point(aes(x=t_month,y=e_1b), col = "red1") +
  labs(
    title =  "Residual Plot: Model 1b",
    x= "Month",
    y= "Residuals")+
  theme_classic()

plot_4 = ggplot(gas_oil)+
  geom_point(aes(x=e_1b_lag,y=e_1b), col = "red1") +
  labs(
    title =  "Residual Plot: Model 1b",
    x= "Lagged Residuals",
    y= "Residuals")+
  theme_classic()

1d

plot_5 = ggplot(gas_oil)+
  geom_point(aes(x=t_month,y=e_1d), col = "red1") +
  labs(
    title =  "Residual Plot: Model 1d",
    x= "Month",
    y= "Residuals")+
  theme_classic()

plot_6 = ggplot(gas_oil)+
  geom_point(aes(x=e_1d_lag ,y=e_1d), col = "red1") +
  labs(
    title =  "Residual Plot: Model 1d",
    x= "Lagged Residuals",
    y= "Residuals")+
  theme_classic()

Now let’s look at all 4 plots

plot_3

plot_4
## Warning: Removed 1 rows containing missing values (geom_point).

plot_5
## Warning: Removed 2 rows containing missing values (geom_point).

plot_6
## Warning: Removed 3 rows containing missing values (geom_point).

Model (1i) shows substantially less evidence of autocorrelation.

Why do you think this is? Here is a hint: suppose P_t-1.oil belongs in the true model but you estimate model 1a or 1d instead (so you omit a variable that belongs in the model). What is the error term in the model you estimate a function of (another hint: P_t-1.oil is in the error term if you omit it. Why is that a problem?)

##Final Lesson: Testing for Autocorrelation Lets test our model for second order autocorrelation, which means we get to use the Breusch-Godfrey test you learned about in class.

To do this we will need to follow a few steps.

What terms would we want to test to determine second-order autocorrelation?

The null hypothesis is: - H0: r_1= r_2 =0 (the coefficients on the two lagged errors are = 0)

What is the alternative hypothesis?

Once we have this regression we will take the residual model Rsq and compute the LM stat as we did before: LM = n*Rsq. We can then use this to get a p-value from a chisquared distribution. Lucky for you, Luciana will allow us to use a command from the lmtest package: waldtest. Ok, lets get on with the test:

#Let's create the lagged residuals
gas_oil$e_price_mod_lag1 <- c(NA, NA, lag(resid(price_2lag), 1))
gas_oil$e_price_mod_lag2 <- c(NA, NA, lag(resid(price_2lag), 2))

Now, we can run the regression outlined above

bg_mod <- lm(data = gas_oil, e_1i ~ price_gasoline + lag(price_gasoline, 1) +
               lag(price_oil, 1) + lag(e_1i, 1) + lag(e_1i, 2))

Now, we can run the test. This will be an F-test where we’re dropping our autocorrelated error terms, so let’s do that here:

waldtest(bg_mod, c("lag(e_1i, 1)", "lag(e_1i, 2)"))
## Wald test
## 
## Model 1: e_1i ~ price_gasoline + lag(price_gasoline, 1) + lag(price_oil, 
##     1) + lag(e_1i, 1) + lag(e_1i, 2)
## Model 2: e_1i ~ price_gasoline + lag(price_gasoline, 1) + lag(price_oil, 
##     1)
##   Res.Df Df     F Pr(>F)
## 1    219                
## 2    221 -2 1.699 0.1853

And that’s it for today! A lot of this is on your third problemset homework, so come back to these notes if you get stuck.