Lab 5

week 6.

Ok. Let’s load some packages to begin.

library(pacman)
#bets is a package that allows you to directly use data from the Brasilian bank
p_load(BETS, lubridate, tidyverse, ggthemes, ggplot2, broom, lfe)

Hi guys! We’re going to finish out our simulation from last time (if you’re lost - go look at the lab notes from last week) and then we’re going to show you how to examine time series data and run dynamic regressions using Brasil’s national bank data.

Let’s remind ourselves of the problem.

We are trying to determine the effect of infrastructure spending on our commute times in the city. Let’s try and simulate this entire process.

We picked an ADL(1,1) model to determine this, so we need to provide to our simulation model:

  • an a1 coef for our lagged times
  • a b1 coef for our spending
  • a b2 coef for our lagged spending
  • an initial point, y0
  • a b0 for an intercept,
  • and a standard deviation for our error term.

We learned last time that our function is likely going to lead to a decrease in commute times, but it sort of depends on how our pattern of investments play out.

Let’s write our function

adl11 <- function(y_0,b_0,a1,b1,b2, sigma_u,mean_x, sigma_x, Ti){
  #draw values of shock, using our function parameter sigma_u
  u = rnorm(Ti,mean =0, sd=sigma_u)
  #set up y as an empty vector so we can start placing objects inside of it.
  y =c()
  #initialize y_0 with our user-provided y_0 parameter above
  y[1]= y_0
  #build X
  x = rnorm(Ti,mean = mean_x, sd = sigma_x)
  #We can use a for loop to build our set of ys
  #recall, i is the index, and will iterate (count) over our set which here is 2 through T (which we provided)
  for (i in 2:Ti){
    #for every i in the series from 2,3,...,T-1,T, generate a y dependent on last month's y and road expenditures
    #in the last two periods. Written out in r, where the 'i' in x[i] is calling the i'th object in vector x...
    y[i] = b_0+a1*y[i-1]+b1*x[i] + b2*x[i-1] + u[i]
  }
  #now that we have y, we need to return our dataframe
  adl_data = data.frame(
    #create a time vector <1,2,3,...,T-1,T>
    time =c(1:Ti), 
    average_commute =  y,
    road_spend = x)
  
  #in order to get this object BACK, let's return the created dataframe.
  return(adl_data)
}

now, let’s plot this using ggplot! Remember, we build ggplots using layers of aesthetic mappings, geoms and other objects, seperating them with the + operator.

Generally, when running a simulation you want to ‘set a seed.’ This forces the random draws you see to be reproducible by another researcher who is trying to replicate your work.

set.seed(42)

Let’s build a dataframe we can reference, using our adl11 function we just made:

dadl <- adl11(y_0 = 60, b_0= 10,a1 = 5/6,b1= -1/2, b2 = 1/6, sigma_u =1,
              mean_x = 10, sigma_x = 2.5, Ti = 250)

#now, let's compare what happens if we invested nothing at all by looking at what happens when b1,b2 are =0
badl <- adl11(y_0 = 60, b_0= 10,a1 = 5/6,b1= 0, b2 = 0, sigma_u =1,
              mean_x = 10, sigma_x = 2.5, Ti = 250)

Now, we just need to combine these into a single data frame.

data1 = data.frame(
  dadl, #this just grabs columns and names from the dadl data from before.
  #now we create a new variable in this dataframe by passing badl's average_commute column to a new variable 'counterfactual'
  counterfactual<-badl$average_commute
)

Now, we can plot this dataframe using ggplot and layers.

ggplot(aes(x = time, y = average_commute), data = data1) + 
  geom_line(col = 'red') + #draw our red line
  geom_line(col = 'blue', aes(y = counterfactual)) + #counterfactual line
  xlab('Months') + ylab('Average Commute Time (Minutes)') +
  theme_minimal() #add a theme to jazz it up

Now, to see where our project ended, we can access our final commute time!

dadl[250,2]
## [1] 41.52125

Nice! We cut 20 minutes off our commute time with our new program! Pretty neat! Now let’s work with some real live data.

R with Brasilian Data!

Brasil publishes its bank data through a library called ‘bets.’ Let’s use bets to examine how time series data works. Just to save us some time, I picked two codes - one for Sao Paolo unemployment, and the other for Sao Paolo collected tax on goods and services.

Bets uses the following format to retrieve data:

BrasilData = BETSget(c(4345,10782), from = "2007-06-06", to = "2014-06-06", data.frame = TRUE)

Let’s peak at our data.

head(BrasilData)
## $ts_4345
##          date    value
## 1  2007-06-01  5088500
## 2  2007-07-01  5165247
## 3  2007-08-01  5303067
## 4  2007-09-01  5960794
## 5  2007-10-01  5735314
## 6  2007-11-01  5652123
## 7  2007-12-01  5761731
## 8  2008-01-01  6003361
## 9  2008-02-01  5588422
## 10 2008-03-01  5405355
## 11 2008-04-01  6017915
## 12 2008-05-01  6266769
## 13 2008-06-01  6427037
## 14 2008-07-01  6511314
## 15 2008-08-01  6582338
## 16 2008-09-01  6922355
## 17 2008-10-01  7414263
## 18 2008-11-01  6458868
## 19 2008-12-01  6723585
## 20 2009-01-01  5278894
## 21 2009-02-01  6417158
## 22 2009-03-01  6034538
## 23 2009-04-01  5957060
## 24 2009-05-01  6097193
## 25 2009-06-01  6295261
## 26 2009-07-01  6474666
## 27 2009-08-01  6557064
## 28 2009-09-01  6875511
## 29 2009-10-01  7239073
## 30 2009-11-01  7216059
## 31 2009-12-01  8129701
## 32 2010-01-01  6732896
## 33 2010-02-01  7094619
## 34 2010-03-01  7353547
## 35 2010-04-01  7744996
## 36 2010-05-01  7363320
## 37 2010-06-01  7721636
## 38 2010-07-01  7615055
## 39 2010-08-01  7753966
## 40 2010-09-01  7918731
## 41 2010-10-01  8030469
## 42 2010-11-01  8128681
## 43 2010-12-01  8858840
## 44 2011-01-01  7821823
## 45 2011-02-01 13478594
## 46 2011-03-01  8097174
## 47 2011-04-01  8052328
## 48 2011-05-01  8489187
## 49 2011-06-01  8383973
## 50 2011-07-01  8328160
## 51 2011-08-01  8623454
## 52 2011-09-01  8941298
## 53 2011-10-01  8807699
## 54 2011-11-01  8677637
## 55 2011-12-01  9725566
## 56 2012-01-01  8617949
## 57 2012-02-01  7994589
## 58 2012-03-01  8833473
## 59 2012-04-01  9188256
## 60 2012-05-01  8820140
## 61 2012-06-01  8888215
## 62 2012-07-01  8861617
## 63 2012-08-01  9261849
## 64 2012-09-01  9239647
## 65 2012-10-01  9704163
## 66 2012-11-01  9296075
## 67 2012-12-01 10397566
## 68 2013-01-01  8915737
## 69 2013-02-01  8792927
## 70 2013-03-01  9146622
## 71 2013-04-01  9950983
## 72 2013-05-01 10434982
## 73 2013-06-01 12565989
## 74 2013-07-01  9914854
## 75 2013-08-01  9902585
## 76 2013-09-01 10963864
## 77 2013-10-01 10420406
## 78 2013-11-01  9915738
## 79 2013-12-01 10987528
## 80 2014-01-01  9998183
## 81 2014-02-01  9906200
## 82 2014-03-01  9678713
## 83 2014-04-01  9868912
## 84 2014-05-01 10038876
## 85 2014-06-01  9988777
## 
## $ts_10782
##          date value
## 1  2007-06-01 10.20
## 2  2007-07-01 10.30
## 3  2007-08-01 10.10
## 4  2007-09-01  9.40
## 5  2007-10-01  9.50
## 6  2007-11-01  8.80
## 7  2007-12-01  8.00
## 8  2008-01-01  8.60
## 9  2008-02-01  9.30
## 10 2008-03-01  9.40
## 11 2008-04-01  9.40
## 12 2008-05-01  8.60
## 13 2008-06-01  8.20
## 14 2008-07-01  8.30
## 15 2008-08-01  8.00
## 16 2008-09-01  8.00
## 17 2008-10-01  7.70
## 18 2008-11-01  8.20
## 19 2008-12-01  7.10
## 20 2009-01-01  9.40
## 21 2009-02-01 10.00
## 22 2009-03-01 10.50
## 23 2009-04-01 10.20
## 24 2009-05-01 10.20
## 25 2009-06-01  9.00
## 26 2009-07-01  8.90
## 27 2009-08-01  9.10
## 28 2009-09-01  8.70
## 29 2009-10-01  8.60
## 30 2009-11-01  8.10
## 31 2009-12-01  7.48
## 32 2010-01-01  8.04
## 33 2010-02-01  8.09
## 34 2010-03-01  8.15
## 35 2010-04-01  7.69
## 36 2010-05-01  7.76
## 37 2010-06-01  7.39
## 38 2010-07-01  7.16
## 39 2010-08-01  6.82
## 40 2010-09-01  6.31
## 41 2010-10-01  5.90
## 42 2010-11-01  5.54
## 43 2010-12-01  5.32
## 44 2011-01-01  6.03
## 45 2011-02-01  6.62
## 46 2011-03-01  6.94
## 47 2011-04-01  7.11
## 48 2011-05-01  6.74
## 49 2011-06-01  6.61
## 50 2011-07-01  6.45
## 51 2011-08-01  6.31
## 52 2011-09-01  6.07
## 53 2011-10-01  5.65
## 54 2011-11-01  4.95
## 55 2011-12-01  4.70
## 56 2012-01-01  5.50
## 57 2012-02-01  6.12
## 58 2012-03-01  6.53
## 59 2012-04-01  6.51
## 60 2012-05-01  6.17
## 61 2012-06-01  6.49
## 62 2012-07-01  5.73
## 63 2012-08-01  5.82
## 64 2012-09-01  6.51
## 65 2012-10-01  5.85
## 66 2012-11-01  5.46
## 67 2012-12-01  5.21
## 68 2013-01-01  6.42
## 69 2013-02-01  6.52
## 70 2013-03-01  6.33
## 71 2013-04-01  6.74
## 72 2013-05-01  6.35
## 73 2013-06-01  6.58
## 74 2013-07-01  5.76
## 75 2013-08-01  5.39
## 76 2013-09-01  5.79
## 77 2013-10-01  5.57
## 78 2013-11-01  4.68
## 79 2013-12-01  4.37
## 80 2014-01-01  5.03
## 81 2014-02-01  5.53
## 82 2014-03-01  5.75
## 83 2014-04-01  5.23
## 84 2014-05-01  5.10
## 85 2014-06-01  5.06

That’s not normal… what is going on? Well, the BETS has given us a list of two data frames. So we need to do a bit of data work to get it into a format we can use. I went ahead and picked a set of time series that has observations by month.

BrasilDataRight <- data.frame(
  date = BrasilData$ts_10782$date,
  goodsandserv = BrasilData$ts_4345$value,
  unemployment = BrasilData$ts_10782$value
)
read_csv('/Users/connor/Dropbox/EC421_Lab/Week6/BrasilData_backup.csv')
## Parsed with column specification:
## cols(
##   date = col_date(format = ""),
##   goodsandserv = col_double(),
##   unemployment = col_double()
## )
## # A tibble: 85 x 3
##    date       goodsandserv unemployment
##    <date>            <dbl>        <dbl>
##  1 2007-06-01      5088500         10.2
##  2 2007-07-01      5165247         10.3
##  3 2007-08-01      5303067         10.1
##  4 2007-09-01      5960794          9.4
##  5 2007-10-01      5735314          9.5
##  6 2007-11-01      5652123          8.8
##  7 2007-12-01      5761731          8  
##  8 2008-01-01      6003361          8.6
##  9 2008-02-01      5588422          9.3
## 10 2008-03-01      5405355          9.4
## # … with 75 more rows
#We are going to change our index to be one by date.
rownames(BrasilDataRight) <- as.Date(BrasilDataRight$date)

Let’s start ourselves off with a nice, static regression. What function do we use to get a regression? lm().

sreg <-lm(data = BrasilDataRight, goodsandserv ~ unemployment)

Now, we have to use a summary to get information

#summary to get a table
summary(sreg)
## 
## Call:
## lm(formula = goodsandserv ~ unemployment, data = BrasilDataRight)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1529846  -518862   -75421   267581  4929845 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14579533     475340   30.67   <2e-16 ***
## unemployment  -910994      64660  -14.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 958900 on 83 degrees of freedom
## Multiple R-squared:  0.7052, Adjusted R-squared:  0.7016 
## F-statistic: 198.5 on 1 and 83 DF,  p-value: < 2.2e-16

We can use r.squared to determine how much of our variation is explained by the static model

#Does anyone remember how to find r-squared? 
summary(sreg)$r.squared
## [1] 0.7051537

Ok, let’s compare our regression to one with a few lagged terms. Luckily, in R, including lags is pretty simple if you know how to run a regression. You can simply use the lag() function that takes two arguments: a vector and a lag count.

print(1:10)
##  [1]  1  2  3  4  5  6  7  8  9 10
print(lag(1:10,1))
##  [1] NA  1  2  3  4  5  6  7  8  9

this lets us quickly run dynamic regressions. Let’s do that with our Brasil data

reg_2lag <- lm(data=BrasilDataRight, goodsandserv ~  unemployment + lag(unemployment, 1) + lag(unemployment, 2))
summary(reg_2lag)
## 
## Call:
## lm(formula = goodsandserv ~ unemployment + lag(unemployment, 
##     1) + lag(unemployment, 2), data = BrasilDataRight)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1650060  -472476   -78361   266854  4356664 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          14877846     492466  30.211   <2e-16 ***
## unemployment          -624875     194474  -3.213   0.0019 ** 
## lag(unemployment, 1)   124882     272751   0.458   0.6483    
## lag(unemployment, 2)  -445918     191798  -2.325   0.0226 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 934600 on 79 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.7142, Adjusted R-squared:  0.7033 
## F-statistic: 65.79 on 3 and 79 DF,  p-value: < 2.2e-16

let’s test these two lags, to see if they are important. What kind of test is this?

waldtest(reg_2lag, c("lag(unemployment, 1)", "lag(unemployment, 2)")) %>% tidy()
## Warning: 'tidy.numeric' is deprecated.
## See help("Deprecated")
## # A tibble: 6 x 2
##   names       x
##   <chr>   <dbl>
## 1 p      0.0156
## 2 chi2   8.33  
## 3 df1    2     
## 4 p.F    0.0191
## 5 F      4.16  
## 6 df2   79

Let’s also run our regression with a lagged dependent variable and see if we gain any explanatory power. Where would you look? … R.squared!

reg_ar1 <- lm(goodsandserv ~ lag(goodsandserv,1) + unemployment, data = BrasilDataRight)
summary(reg_ar1)
## 
## Call:
## lm(formula = goodsandserv ~ lag(goodsandserv, 1) + unemployment, 
##     data = BrasilDataRight)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2144722  -407232   -17735   212607  5179755 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           9.123e+06  1.492e+06   6.116 3.22e-08 ***
## lag(goodsandserv, 1)  3.756e-01  9.823e-02   3.824 0.000257 ***
## unemployment         -5.683e+05  1.081e+05  -5.258 1.16e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 893100 on 81 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7415, Adjusted R-squared:  0.7352 
## F-statistic: 116.2 on 2 and 81 DF,  p-value: < 2.2e-16
sumobjar1 <- summary(reg_ar1)
sumobjar1$r.squared
## [1] 0.741549

Now we need to plot values across time.

Well, we need to use our new function… lubridate! This lets us cast dates when we are given them in string form so we can work with them in time series. Or we would… but Brasil is nice to us and passes us a date class, meaning we can get lots of information from our date column. That includes year, month, day (hour in some cases.)

ggplot(data = BrasilDataRight, aes(x = date, y = goodsandserv)) +
  geom_line(size = .2) +
  geom_point(size = .4, col = 'red') +
  #geom_hline(yintercept = 0) +
  xlab("Time")+
  ylab("Tax Revenue") +
  theme_pander()

Uh oh… that looks a bit unstable… maybe we can do something about this…

What could we do, if say we had a unit root (which is my first guess)?

BrasilDataRight$goodsandserv_fd = BrasilDataRight$goodsandserv - lag(BrasilDataRight$goodsandserv,1)

Now, we can run the graph again

ggplot(data = BrasilDataRight, aes(x = date, y = goodsandserv_fd)) +
  geom_line(size = .2) +
  geom_point(size = .4, col = 'red') +
  geom_hline(yintercept = 0) +
  xlab("Time")+
  ylab("Tax Revenue Difference") +
  geom_hline(yintercept = 0) +
  theme_pander()
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).

Oh yeah, that’s what I’m talking about! What might concern us here though?

We might have an issue with variance stationarity. Right in the middle of our series we see an explosion in swings here. We’ll get to that.

What about our other variable?

ggplot(data = BrasilDataRight, aes(x = date, y = unemployment)) +
  geom_line(size = .2) +
  geom_point(size = .4, col = 'red') +
  geom_hline(yintercept = 0) +
  xlab("Time")+
  ylab("Unemployment, percentage points") +
  theme_pander()

Crap. This is a problem in the opposite direction. I mean, good for brasil, but it means we need to change some stuff around. We can do what we did before:

BrasilDataRight$unemployment_fd <- BrasilDataRight$unemployment - lag(BrasilDataRight$unemployment,1)

Now we plot again,

ggplot(data = BrasilDataRight, aes(x = date, y = unemployment_fd)) +
  geom_line(size = .2) +
  geom_point(size = .4, col = 'red') +
  #geom_hline(yintercept = 0) +
  xlab("Time")+
  ylab("Unemployment, percentage points difference") +
  geom_hline(yintercept = 0) +
  theme_pander()
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).

Now let’s try and check for autocorrelation here.

Autocorrelation

First off, let’s remind ourselves why we care about autocorrelation at all?

Well, for one, if autocorrelation exists, we will have BIASED S.E. and will cause OLS to be inefficient.

But, how do we examine this problem? Much like heteroskedasticity, it requires we retrieve and examine our errors. Let’s run our first regression again and examine our errors compared to time.

reg_ols <- lm(goodsandserv ~ unemployment, data = BrasilDataRight)

BrasilDataRight$e_frst <-  residuals(reg_ols)

Now, there’s a couple of ways we can dig into these guys. The first way is to graph our errors against time, and the second is to graph errors against each other.

ggplot(data = BrasilDataRight, aes(x = date, y = e_frst)) +
  geom_path(size = .1) +
  geom_point(size = .2, col = 'red') +
  xlab("Time") +
  ylab("residual, static ols") +
  theme_pander()

Ok. Some concerning patterns here. Let’s look at the errors compared to their lags. What would we look for here? Errors that closely follow the errors that preceded it… it could be true.

ggplot(data = BrasilDataRight, aes(x = lag(e_frst), y = e_frst)) +
  geom_point(size = .5, col = 'red') +
  geom_smooth(method = lm, se = FALSE, alpha = .2, col = 'black') +
  xlab("lagged residual, static ols") +
  ylab("residual, static ols") +
  theme_pander()
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

Ohh boy… that looks like the definition of positive correlation. I should put that on wikipedia.

Let’s look at our more dynamic models with lagged terms and see what happens. In particular, I’d like to look at the 2-lags unemployment model and see what we get here.

First we need our residuals. There’s a complicating factor here…

reg_2lag <- lm(goodsandserv ~ unemployment + lag(unemployment, 1) + lag(unemployment,2), data = BrasilDataRight)

BrasilDataRight$e_scnd <-  residuals(reg_2lag)

what? Why is R mad? Well… remember when we simulated data, we needed to ‘rev up’ our process - the same is true with data. If you include lags, By definition you lose some of your earlier datapoints as you increase the length of time your lags last. So how do we do this? We need to tack on a ‘null’ value to the beginning. In R, that is represented by ‘NA’

BrasilDataRight$e_scnd <-  c(NA,NA,residuals(reg_2lag))

Nice. Now, we can do our autocorrelation graphs again. Let’s just do the second one to speed things up.

ggplot(data = BrasilDataRight, aes(x = lag(e_scnd), y = e_scnd)) +
  geom_point(size = .5, col = 'red') +
  geom_smooth(method = lm, se = FALSE, alpha = .2, col = 'black') +
  xlab("lagged residual, 2-lags") +
  ylab("residual, 2 lags") +
  theme_pander()
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).

Seems like the same problem… We can test this though with the Breusch Godfrey test! Let’s walk through how to do this, using our 2-lagged model as an example.

Breusch Godfrey Test

We need to follow a bunch of steps. Again.

1.) Decide on an ORDER to test. This means you are picking at what level you are checking for AC. It could be first order, including one lag of the residuals, or second or third including two or three lags. Let’s pick first and second order AC

2.) Run your regression. We did this already.

3.) Recover your residuals. We also did this! Just above.

4.) Run a new regression of your explanatory variables, including a lag of your errors.

bg_1 <- lm(e_scnd ~ unemployment + lag(unemployment, 1) + lag(unemployment, 2) + lag(e_scnd, 1), data = BrasilDataRight) 

if we had second order, we’d add lag(e_scnd,2). Let’s do that for fun

bg_2 <- lm(e_scnd ~ unemployment + lag(unemployment, 1) + lag(unemployment, 2) + lag(e_scnd, 1) + 
             lag(e_scnd, 2), data = BrasilDataRight) 

5.) Run a wald test (f-test) to determine the significance of the error term

#lets look at first order
waldtest(bg_1, c("lag(e_scnd, 1)")) %>% tidy()
## Warning: 'tidy.numeric' is deprecated.
## See help("Deprecated")
## # A tibble: 6 x 2
##   names        x
##   <chr>    <dbl>
## 1 p      0.00754
## 2 chi2   7.14   
## 3 df1    1      
## 4 p.F    0.00920
## 5 F      7.14   
## 6 df2   77

and second order now - we simply need to drop the second lag.

#second order
waldtest(bg_2, c("lag(e_scnd, 1)", "lag(e_scnd, 2)")) %>% tidy()
## Warning: 'tidy.numeric' is deprecated.
## See help("Deprecated")
## # A tibble: 6 x 2
##   names        x
##   <chr>    <dbl>
## 1 p      0.00521
## 2 chi2  10.5    
## 3 df1    2      
## 4 p.F    0.00730
## 5 F      5.26   
## 6 df2   75

6.) Now what? We need to declare our H0 and Ha:

We have H0: no autocorrelation. Ha: autocorrelation. Can we reject either of these? Yes, on both fronts. We can reject H0 for both orders of no autocorrelation.

Well that’s it for today! I hope you have a great week and we’ll see you next week for some problemset help!