Ok. Let’s load some packages to begin.
library(pacman)
#bets is a package that allows you to directly use data from the Brasilian bank
p_load(BETS, lubridate, tidyverse, ggthemes, ggplot2, broom, lfe)
Hi guys! We’re going to finish out our simulation from last time (if you’re lost - go look at the lab notes from last week) and then we’re going to show you how to examine time series data and run dynamic regressions using Brasil’s national bank data.
Let’s remind ourselves of the problem.
We are trying to determine the effect of infrastructure spending on our commute times in the city. Let’s try and simulate this entire process.
We picked an ADL(1,1) model to determine this, so we need to provide to our simulation model:
We learned last time that our function is likely going to lead to a decrease in commute times, but it sort of depends on how our pattern of investments play out.
Let’s write our function
adl11 <- function(y_0,b_0,a1,b1,b2, sigma_u,mean_x, sigma_x, Ti){
#draw values of shock, using our function parameter sigma_u
u = rnorm(Ti,mean =0, sd=sigma_u)
#set up y as an empty vector so we can start placing objects inside of it.
y =c()
#initialize y_0 with our user-provided y_0 parameter above
y[1]= y_0
#build X
x = rnorm(Ti,mean = mean_x, sd = sigma_x)
#We can use a for loop to build our set of ys
#recall, i is the index, and will iterate (count) over our set which here is 2 through T (which we provided)
for (i in 2:Ti){
#for every i in the series from 2,3,...,T-1,T, generate a y dependent on last month's y and road expenditures
#in the last two periods. Written out in r, where the 'i' in x[i] is calling the i'th object in vector x...
y[i] = b_0+a1*y[i-1]+b1*x[i] + b2*x[i-1] + u[i]
}
#now that we have y, we need to return our dataframe
adl_data = data.frame(
#create a time vector <1,2,3,...,T-1,T>
time =c(1:Ti),
average_commute = y,
road_spend = x)
#in order to get this object BACK, let's return the created dataframe.
return(adl_data)
}
now, let’s plot this using ggplot! Remember, we build ggplots using layers of aesthetic mappings, geoms and other objects, seperating them with the + operator.
Generally, when running a simulation you want to ‘set a seed.’ This forces the random draws you see to be reproducible by another researcher who is trying to replicate your work.
set.seed(42)
Let’s build a dataframe we can reference, using our adl11 function we just made:
dadl <- adl11(y_0 = 60, b_0= 10,a1 = 5/6,b1= -1/2, b2 = 1/6, sigma_u =1,
mean_x = 10, sigma_x = 2.5, Ti = 250)
#now, let's compare what happens if we invested nothing at all by looking at what happens when b1,b2 are =0
badl <- adl11(y_0 = 60, b_0= 10,a1 = 5/6,b1= 0, b2 = 0, sigma_u =1,
mean_x = 10, sigma_x = 2.5, Ti = 250)
Now, we just need to combine these into a single data frame.
data1 = data.frame(
dadl, #this just grabs columns and names from the dadl data from before.
#now we create a new variable in this dataframe by passing badl's average_commute column to a new variable 'counterfactual'
counterfactual<-badl$average_commute
)
Now, we can plot this dataframe using ggplot and layers.
ggplot(aes(x = time, y = average_commute), data = data1) +
geom_line(col = 'red') + #draw our red line
geom_line(col = 'blue', aes(y = counterfactual)) + #counterfactual line
xlab('Months') + ylab('Average Commute Time (Minutes)') +
theme_minimal() #add a theme to jazz it up
Now, to see where our project ended, we can access our final commute time!
dadl[250,2]
## [1] 41.52125
Nice! We cut 20 minutes off our commute time with our new program! Pretty neat! Now let’s work with some real live data.
Brasil publishes its bank data through a library called ‘bets.’ Let’s use bets to examine how time series data works. Just to save us some time, I picked two codes - one for Sao Paolo unemployment, and the other for Sao Paolo collected tax on goods and services.
Bets uses the following format to retrieve data:
BrasilData = BETSget(c(4345,10782), from = "2007-06-06", to = "2014-06-06", data.frame = TRUE)
Let’s peak at our data.
head(BrasilData)
## $ts_4345
## date value
## 1 2007-06-01 5088500
## 2 2007-07-01 5165247
## 3 2007-08-01 5303067
## 4 2007-09-01 5960794
## 5 2007-10-01 5735314
## 6 2007-11-01 5652123
## 7 2007-12-01 5761731
## 8 2008-01-01 6003361
## 9 2008-02-01 5588422
## 10 2008-03-01 5405355
## 11 2008-04-01 6017915
## 12 2008-05-01 6266769
## 13 2008-06-01 6427037
## 14 2008-07-01 6511314
## 15 2008-08-01 6582338
## 16 2008-09-01 6922355
## 17 2008-10-01 7414263
## 18 2008-11-01 6458868
## 19 2008-12-01 6723585
## 20 2009-01-01 5278894
## 21 2009-02-01 6417158
## 22 2009-03-01 6034538
## 23 2009-04-01 5957060
## 24 2009-05-01 6097193
## 25 2009-06-01 6295261
## 26 2009-07-01 6474666
## 27 2009-08-01 6557064
## 28 2009-09-01 6875511
## 29 2009-10-01 7239073
## 30 2009-11-01 7216059
## 31 2009-12-01 8129701
## 32 2010-01-01 6732896
## 33 2010-02-01 7094619
## 34 2010-03-01 7353547
## 35 2010-04-01 7744996
## 36 2010-05-01 7363320
## 37 2010-06-01 7721636
## 38 2010-07-01 7615055
## 39 2010-08-01 7753966
## 40 2010-09-01 7918731
## 41 2010-10-01 8030469
## 42 2010-11-01 8128681
## 43 2010-12-01 8858840
## 44 2011-01-01 7821823
## 45 2011-02-01 13478594
## 46 2011-03-01 8097174
## 47 2011-04-01 8052328
## 48 2011-05-01 8489187
## 49 2011-06-01 8383973
## 50 2011-07-01 8328160
## 51 2011-08-01 8623454
## 52 2011-09-01 8941298
## 53 2011-10-01 8807699
## 54 2011-11-01 8677637
## 55 2011-12-01 9725566
## 56 2012-01-01 8617949
## 57 2012-02-01 7994589
## 58 2012-03-01 8833473
## 59 2012-04-01 9188256
## 60 2012-05-01 8820140
## 61 2012-06-01 8888215
## 62 2012-07-01 8861617
## 63 2012-08-01 9261849
## 64 2012-09-01 9239647
## 65 2012-10-01 9704163
## 66 2012-11-01 9296075
## 67 2012-12-01 10397566
## 68 2013-01-01 8915737
## 69 2013-02-01 8792927
## 70 2013-03-01 9146622
## 71 2013-04-01 9950983
## 72 2013-05-01 10434982
## 73 2013-06-01 12565989
## 74 2013-07-01 9914854
## 75 2013-08-01 9902585
## 76 2013-09-01 10963864
## 77 2013-10-01 10420406
## 78 2013-11-01 9915738
## 79 2013-12-01 10987528
## 80 2014-01-01 9998183
## 81 2014-02-01 9906200
## 82 2014-03-01 9678713
## 83 2014-04-01 9868912
## 84 2014-05-01 10038876
## 85 2014-06-01 9988777
##
## $ts_10782
## date value
## 1 2007-06-01 10.20
## 2 2007-07-01 10.30
## 3 2007-08-01 10.10
## 4 2007-09-01 9.40
## 5 2007-10-01 9.50
## 6 2007-11-01 8.80
## 7 2007-12-01 8.00
## 8 2008-01-01 8.60
## 9 2008-02-01 9.30
## 10 2008-03-01 9.40
## 11 2008-04-01 9.40
## 12 2008-05-01 8.60
## 13 2008-06-01 8.20
## 14 2008-07-01 8.30
## 15 2008-08-01 8.00
## 16 2008-09-01 8.00
## 17 2008-10-01 7.70
## 18 2008-11-01 8.20
## 19 2008-12-01 7.10
## 20 2009-01-01 9.40
## 21 2009-02-01 10.00
## 22 2009-03-01 10.50
## 23 2009-04-01 10.20
## 24 2009-05-01 10.20
## 25 2009-06-01 9.00
## 26 2009-07-01 8.90
## 27 2009-08-01 9.10
## 28 2009-09-01 8.70
## 29 2009-10-01 8.60
## 30 2009-11-01 8.10
## 31 2009-12-01 7.48
## 32 2010-01-01 8.04
## 33 2010-02-01 8.09
## 34 2010-03-01 8.15
## 35 2010-04-01 7.69
## 36 2010-05-01 7.76
## 37 2010-06-01 7.39
## 38 2010-07-01 7.16
## 39 2010-08-01 6.82
## 40 2010-09-01 6.31
## 41 2010-10-01 5.90
## 42 2010-11-01 5.54
## 43 2010-12-01 5.32
## 44 2011-01-01 6.03
## 45 2011-02-01 6.62
## 46 2011-03-01 6.94
## 47 2011-04-01 7.11
## 48 2011-05-01 6.74
## 49 2011-06-01 6.61
## 50 2011-07-01 6.45
## 51 2011-08-01 6.31
## 52 2011-09-01 6.07
## 53 2011-10-01 5.65
## 54 2011-11-01 4.95
## 55 2011-12-01 4.70
## 56 2012-01-01 5.50
## 57 2012-02-01 6.12
## 58 2012-03-01 6.53
## 59 2012-04-01 6.51
## 60 2012-05-01 6.17
## 61 2012-06-01 6.49
## 62 2012-07-01 5.73
## 63 2012-08-01 5.82
## 64 2012-09-01 6.51
## 65 2012-10-01 5.85
## 66 2012-11-01 5.46
## 67 2012-12-01 5.21
## 68 2013-01-01 6.42
## 69 2013-02-01 6.52
## 70 2013-03-01 6.33
## 71 2013-04-01 6.74
## 72 2013-05-01 6.35
## 73 2013-06-01 6.58
## 74 2013-07-01 5.76
## 75 2013-08-01 5.39
## 76 2013-09-01 5.79
## 77 2013-10-01 5.57
## 78 2013-11-01 4.68
## 79 2013-12-01 4.37
## 80 2014-01-01 5.03
## 81 2014-02-01 5.53
## 82 2014-03-01 5.75
## 83 2014-04-01 5.23
## 84 2014-05-01 5.10
## 85 2014-06-01 5.06
That’s not normal… what is going on? Well, the BETS has given us a list of two data frames. So we need to do a bit of data work to get it into a format we can use. I went ahead and picked a set of time series that has observations by month.
BrasilDataRight <- data.frame(
date = BrasilData$ts_10782$date,
goodsandserv = BrasilData$ts_4345$value,
unemployment = BrasilData$ts_10782$value
)
read_csv('/Users/connor/Dropbox/EC421_Lab/Week6/BrasilData_backup.csv')
## Parsed with column specification:
## cols(
## date = col_date(format = ""),
## goodsandserv = col_double(),
## unemployment = col_double()
## )
## # A tibble: 85 x 3
## date goodsandserv unemployment
## <date> <dbl> <dbl>
## 1 2007-06-01 5088500 10.2
## 2 2007-07-01 5165247 10.3
## 3 2007-08-01 5303067 10.1
## 4 2007-09-01 5960794 9.4
## 5 2007-10-01 5735314 9.5
## 6 2007-11-01 5652123 8.8
## 7 2007-12-01 5761731 8
## 8 2008-01-01 6003361 8.6
## 9 2008-02-01 5588422 9.3
## 10 2008-03-01 5405355 9.4
## # … with 75 more rows
#We are going to change our index to be one by date.
rownames(BrasilDataRight) <- as.Date(BrasilDataRight$date)
Let’s start ourselves off with a nice, static regression. What function do we use to get a regression? lm().
sreg <-lm(data = BrasilDataRight, goodsandserv ~ unemployment)
Now, we have to use a summary to get information
#summary to get a table
summary(sreg)
##
## Call:
## lm(formula = goodsandserv ~ unemployment, data = BrasilDataRight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1529846 -518862 -75421 267581 4929845
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14579533 475340 30.67 <2e-16 ***
## unemployment -910994 64660 -14.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 958900 on 83 degrees of freedom
## Multiple R-squared: 0.7052, Adjusted R-squared: 0.7016
## F-statistic: 198.5 on 1 and 83 DF, p-value: < 2.2e-16
We can use r.squared to determine how much of our variation is explained by the static model
#Does anyone remember how to find r-squared?
summary(sreg)$r.squared
## [1] 0.7051537
Ok, let’s compare our regression to one with a few lagged terms. Luckily, in R, including lags is pretty simple if you know how to run a regression. You can simply use the lag() function that takes two arguments: a vector and a lag count.
print(1:10)
## [1] 1 2 3 4 5 6 7 8 9 10
print(lag(1:10,1))
## [1] NA 1 2 3 4 5 6 7 8 9
this lets us quickly run dynamic regressions. Let’s do that with our Brasil data
reg_2lag <- lm(data=BrasilDataRight, goodsandserv ~ unemployment + lag(unemployment, 1) + lag(unemployment, 2))
summary(reg_2lag)
##
## Call:
## lm(formula = goodsandserv ~ unemployment + lag(unemployment,
## 1) + lag(unemployment, 2), data = BrasilDataRight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1650060 -472476 -78361 266854 4356664
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14877846 492466 30.211 <2e-16 ***
## unemployment -624875 194474 -3.213 0.0019 **
## lag(unemployment, 1) 124882 272751 0.458 0.6483
## lag(unemployment, 2) -445918 191798 -2.325 0.0226 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 934600 on 79 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7142, Adjusted R-squared: 0.7033
## F-statistic: 65.79 on 3 and 79 DF, p-value: < 2.2e-16
let’s test these two lags, to see if they are important. What kind of test is this?
waldtest(reg_2lag, c("lag(unemployment, 1)", "lag(unemployment, 2)")) %>% tidy()
## Warning: 'tidy.numeric' is deprecated.
## See help("Deprecated")
## # A tibble: 6 x 2
## names x
## <chr> <dbl>
## 1 p 0.0156
## 2 chi2 8.33
## 3 df1 2
## 4 p.F 0.0191
## 5 F 4.16
## 6 df2 79
Let’s also run our regression with a lagged dependent variable and see if we gain any explanatory power. Where would you look? … R.squared!
reg_ar1 <- lm(goodsandserv ~ lag(goodsandserv,1) + unemployment, data = BrasilDataRight)
summary(reg_ar1)
##
## Call:
## lm(formula = goodsandserv ~ lag(goodsandserv, 1) + unemployment,
## data = BrasilDataRight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2144722 -407232 -17735 212607 5179755
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.123e+06 1.492e+06 6.116 3.22e-08 ***
## lag(goodsandserv, 1) 3.756e-01 9.823e-02 3.824 0.000257 ***
## unemployment -5.683e+05 1.081e+05 -5.258 1.16e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 893100 on 81 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.7415, Adjusted R-squared: 0.7352
## F-statistic: 116.2 on 2 and 81 DF, p-value: < 2.2e-16
sumobjar1 <- summary(reg_ar1)
sumobjar1$r.squared
## [1] 0.741549
Now we need to plot values across time.
Well, we need to use our new function… lubridate! This lets us cast dates when we are given them in string form so we can work with them in time series. Or we would… but Brasil is nice to us and passes us a date class, meaning we can get lots of information from our date column. That includes year, month, day (hour in some cases.)
ggplot(data = BrasilDataRight, aes(x = date, y = goodsandserv)) +
geom_line(size = .2) +
geom_point(size = .4, col = 'red') +
#geom_hline(yintercept = 0) +
xlab("Time")+
ylab("Tax Revenue") +
theme_pander()
Uh oh… that looks a bit unstable… maybe we can do something about this…
What could we do, if say we had a unit root (which is my first guess)?
BrasilDataRight$goodsandserv_fd = BrasilDataRight$goodsandserv - lag(BrasilDataRight$goodsandserv,1)
Now, we can run the graph again
ggplot(data = BrasilDataRight, aes(x = date, y = goodsandserv_fd)) +
geom_line(size = .2) +
geom_point(size = .4, col = 'red') +
geom_hline(yintercept = 0) +
xlab("Time")+
ylab("Tax Revenue Difference") +
geom_hline(yintercept = 0) +
theme_pander()
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
Oh yeah, that’s what I’m talking about! What might concern us here though?
We might have an issue with variance stationarity. Right in the middle of our series we see an explosion in swings here. We’ll get to that.
What about our other variable?
ggplot(data = BrasilDataRight, aes(x = date, y = unemployment)) +
geom_line(size = .2) +
geom_point(size = .4, col = 'red') +
geom_hline(yintercept = 0) +
xlab("Time")+
ylab("Unemployment, percentage points") +
theme_pander()
Crap. This is a problem in the opposite direction. I mean, good for brasil, but it means we need to change some stuff around. We can do what we did before:
BrasilDataRight$unemployment_fd <- BrasilDataRight$unemployment - lag(BrasilDataRight$unemployment,1)
Now we plot again,
ggplot(data = BrasilDataRight, aes(x = date, y = unemployment_fd)) +
geom_line(size = .2) +
geom_point(size = .4, col = 'red') +
#geom_hline(yintercept = 0) +
xlab("Time")+
ylab("Unemployment, percentage points difference") +
geom_hline(yintercept = 0) +
theme_pander()
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
Now let’s try and check for autocorrelation here.
First off, let’s remind ourselves why we care about autocorrelation at all?
Well, for one, if autocorrelation exists, we will have BIASED S.E. and will cause OLS to be inefficient.
But, how do we examine this problem? Much like heteroskedasticity, it requires we retrieve and examine our errors. Let’s run our first regression again and examine our errors compared to time.
reg_ols <- lm(goodsandserv ~ unemployment, data = BrasilDataRight)
BrasilDataRight$e_frst <- residuals(reg_ols)
Now, there’s a couple of ways we can dig into these guys. The first way is to graph our errors against time, and the second is to graph errors against each other.
ggplot(data = BrasilDataRight, aes(x = date, y = e_frst)) +
geom_path(size = .1) +
geom_point(size = .2, col = 'red') +
xlab("Time") +
ylab("residual, static ols") +
theme_pander()
Ok. Some concerning patterns here. Let’s look at the errors compared to their lags. What would we look for here? Errors that closely follow the errors that preceded it… it could be true.
ggplot(data = BrasilDataRight, aes(x = lag(e_frst), y = e_frst)) +
geom_point(size = .5, col = 'red') +
geom_smooth(method = lm, se = FALSE, alpha = .2, col = 'black') +
xlab("lagged residual, static ols") +
ylab("residual, static ols") +
theme_pander()
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
Ohh boy… that looks like the definition of positive correlation. I should put that on wikipedia.
Let’s look at our more dynamic models with lagged terms and see what happens. In particular, I’d like to look at the 2-lags unemployment model and see what we get here.
First we need our residuals. There’s a complicating factor here…
reg_2lag <- lm(goodsandserv ~ unemployment + lag(unemployment, 1) + lag(unemployment,2), data = BrasilDataRight)
BrasilDataRight$e_scnd <- residuals(reg_2lag)
what? Why is R mad? Well… remember when we simulated data, we needed to ‘rev up’ our process - the same is true with data. If you include lags, By definition you lose some of your earlier datapoints as you increase the length of time your lags last. So how do we do this? We need to tack on a ‘null’ value to the beginning. In R, that is represented by ‘NA’
BrasilDataRight$e_scnd <- c(NA,NA,residuals(reg_2lag))
Nice. Now, we can do our autocorrelation graphs again. Let’s just do the second one to speed things up.
ggplot(data = BrasilDataRight, aes(x = lag(e_scnd), y = e_scnd)) +
geom_point(size = .5, col = 'red') +
geom_smooth(method = lm, se = FALSE, alpha = .2, col = 'black') +
xlab("lagged residual, 2-lags") +
ylab("residual, 2 lags") +
theme_pander()
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).
Seems like the same problem… We can test this though with the Breusch Godfrey test! Let’s walk through how to do this, using our 2-lagged model as an example.
We need to follow a bunch of steps. Again.
1.) Decide on an ORDER to test. This means you are picking at what level you are checking for AC. It could be first order, including one lag of the residuals, or second or third including two or three lags. Let’s pick first and second order AC
2.) Run your regression. We did this already.
3.) Recover your residuals. We also did this! Just above.
4.) Run a new regression of your explanatory variables, including a lag of your errors.
bg_1 <- lm(e_scnd ~ unemployment + lag(unemployment, 1) + lag(unemployment, 2) + lag(e_scnd, 1), data = BrasilDataRight)
if we had second order, we’d add lag(e_scnd,2). Let’s do that for fun
bg_2 <- lm(e_scnd ~ unemployment + lag(unemployment, 1) + lag(unemployment, 2) + lag(e_scnd, 1) +
lag(e_scnd, 2), data = BrasilDataRight)
5.) Run a wald test (f-test) to determine the significance of the error term
#lets look at first order
waldtest(bg_1, c("lag(e_scnd, 1)")) %>% tidy()
## Warning: 'tidy.numeric' is deprecated.
## See help("Deprecated")
## # A tibble: 6 x 2
## names x
## <chr> <dbl>
## 1 p 0.00754
## 2 chi2 7.14
## 3 df1 1
## 4 p.F 0.00920
## 5 F 7.14
## 6 df2 77
and second order now - we simply need to drop the second lag.
#second order
waldtest(bg_2, c("lag(e_scnd, 1)", "lag(e_scnd, 2)")) %>% tidy()
## Warning: 'tidy.numeric' is deprecated.
## See help("Deprecated")
## # A tibble: 6 x 2
## names x
## <chr> <dbl>
## 1 p 0.00521
## 2 chi2 10.5
## 3 df1 2
## 4 p.F 0.00730
## 5 F 5.26
## 6 df2 75
6.) Now what? We need to declare our H0 and Ha:
We have H0: no autocorrelation. Ha: autocorrelation. Can we reject either of these? Yes, on both fronts. We can reject H0 for both orders of no autocorrelation.
Well that’s it for today! I hope you have a great week and we’ll see you next week for some problemset help!