#Functions and Time Series Standard workflow:

library(pacman)
p_load(tidyverse, ggthemes)

Lesson 1: Function Basics

What are functions in general ( think about math!). Something like this: f(x)= x^2 This function takes some input (a number), and gives us an output (in this case, the squared number, which is indeed also a number). A function create a mapping of links from one object to another. This doesn’t have to be two numbers, it could be a word and a dataframe, or two words. For instance, I could come up with a function called color labeler and then pass it some object, which it will map to a color. e.g. color_labeler(banana) = yellow

So how do we make a function in R? Recall: * Everything in R has a name, and everything is an object * Functions have inputs/arguments and outputs * His name is Robert Paulson

Let’s write a function that will take some x, and spit out x^2 + 10. That is, let’s write the code for f(x)=x^2+10. In general, this function looks like function([some_set_of_arguments]){your function code}. First notice that function() is a special operator that takes any arguments you want in the parentheses, and then lets you manipulate them in any way you see fit.

squarePlusten = function(x){
  #tell squarePlusten what to do. x is an input here, we can tell our function to transform our variable into
  #something else.
  x_squaredten = x^2+10
  #Now, in order to make use of this value, we need our function to spit something out. 
  #We do this with another special function, 'return()'. 
  return(x_squaredten)
}

If our function was “crated” then you will see it in your Global Environment (top right pane for most of you). Notice however: x_squaredten isn’t equal to anything now that our function has been run. That’s because x_squaredten is only defined in the context of your written function. To see for yourself try x_squaredten into your console:

So lets plug in a value for our argument, x, and see if our function works. We will check for 10 that indeed is equal to 110.

squarePlusten(10)
## [1] 110

A couple things to note: - Brackets {} around functions lets R know where the function starts and ends. Though, technically we do not need them for one line functions. - Return: this tells the function what it should return for us. We don’t hold onto any temporary objects created while your function runs, though the function can modify objects outside of itself.

Let’s look at another example. We will write a function that 1) takes a vector, 2) squares all the elements, and then 3) returns a plot.

So, what should our code do? 1) Take a vector, say c(1,2,3) 2) Square each element of the vector–note that we will need a way to store these results. 3) Plot

We can use vectors with the c() command. Lets store them in a vector called x: x = c(1,2,3)

Now, I want a function to return a plot with all of the (x,x^2 = y) pairs affiliated.

We can use a special function called sapply() which lets us perform a function a whole bunch of times. sapply(some_vector,function) will take every element in the vector and apply your function to it.

So we build a vector, as we discussed above

x = c(1,2,3)

We’ll use sapply. Wirhin sapply we define the function we want sapply to apply to our vector. Becasue function(x) x^2 is a one line function we don’t need the {} as discussed above.

plot_sq = function(x){
  #now we can use the sapply function as discussed above.
  y = sapply(x, function(x) x^2)
  #return the plot
  return(plot(x=x,y=y))
}
#check that it works
plot_sq(1:4)

Kind of an ugly plot (by our standards), butw e just built our own plotting function! Pretty exciting.

Lesson 2: For Loops

A for-loop has two parts: 1) a header specifying the iteration, and 2) a body which is executed once per iteration.

First we will build an empty vector x, which we will use a for loop to fill in. For-loop’s need empty containers to put the output of the iterations. Creating an empty container is an essential step in running a for-loop:

x=c()

For this loop, we’re saying that i is an object in 1-10. So start the first iteration of the loop where i=1. Then the next time throught the loop, the next iteration, i=2, and so forth all the way up to i=10. In this example, the body of the for-loop divide 10 by i.

for (i in 1:10){
  #dividing 10 by the numbers 1 through 10.
  x[i]= 10/i
}

Let’s look at what this makes

x
##  [1] 10.000000  5.000000  3.333333  2.500000  2.000000  1.666667  1.428571
##  [8]  1.250000  1.111111  1.000000

i is, for each loop, storing the value in your sequence (here, it’s a number in 1:10) then performing the operation you defined in the loop. Like a function, a for loop defines its “body” by setting the start point with a { and an end point with a }.

Really though, we can loop over any object we want. Maybe we want to loop over some weird sequence, say c(2,300,-4,6)

z = 0 #create an inital value for z
for (i in c(2,300,-4,6)) {
  print(paste0(i, ' plus ', z, ' equals')) #paste0 is a helpful way to repeatedly create text
  z = z + i
  print(z) #`print` is a slightly different version of `return()` link for more info (https://stackoverflow.com/questions/35773027/difference-of-print-and-return-in-r)
}
## [1] "2 plus 0 equals"
## [1] 2
## [1] "300 plus 2 equals"
## [1] 302
## [1] "-4 plus 302 equals"
## [1] 298
## [1] "6 plus 298 equals"
## [1] 304

Lets rewrite our function plot_sq with a for loop inside it

plot_sqf = function(x){
  #initialize y:
  y=c()
  #start our for loop. We're looping over every object in the vector x, so we want to iterate a number of times
  #equal to the length of x so we don't miss any.
  for (i in 1:length(x)) {
    #for each i in the vector i:length, update the ith value in y with:
    y[i] = x[i]^2
  }
  #return the plot
  return(plot(x=x,y=y))
}

Plot squared values between 1-10

plot_sqf(1:10)

So what exactly is happening in the loop? In for (i in 1:length(x)): i does what is called “indexing.” length(x) gives the # of rows x has. y[i] = x[i]^2 remember how we index arrays in R: row by column. Thus for each i=1,2.., the iterator fills row i in y with the squared element of the ith row of x.

Simulating time series data

Simulations allow us to mess around with model parameters when we know we don’t need to worry about data, because we generated it! So we will write a function that simulates the model:

y_t = alpha_o+alpha_1 * y_{t-1} + u_t, u_t ~ N(0, sigmasq)

1.) Draw T values of u, where T is the number of periods we simulate 2.) Set some starting point for y 3.) Generate y data with a for loop, which lets us keep its time-dependent status.

We set up a function that takes an initial value for y, some alphas, a std deviation parameter and a number for our total number of observations.

So the LHS y_t is always one period ahead of y_{t-1} on the RHS. If our LHS y_t starts at 1 then it makes sense that y_{t-1} starts at y_0: y[1] = a_0+y[0]+u[1]

HOWEVER: What is y[0]? R has no idea. R starts indexing at 1. This is unusual for coding languages and can trip you up if you don’t keep it in mind. So, y[0] is not defined.

Below is a model that generates data according to the model we talked about above:

ts_model = function(y_init,a_0,a_1, sigma, T){
  #our error
  u = rnorm(T,mean =0, sd=sigma )
  y = c()
  #set first observation to our provided level
  y[1]= y_init
  #loop through with for
  for (i in 2:T){
    y[i] = a_0+a_1*y[i-1]+u[i]
  }
  
  #let's create a dataframe, declare 3 variables. We haven't done this before, but we can use the data.frame() function to create a dataframe, where each column is separated by columns.
  ts_data = data.frame(
    time = c(1:T), 
    model =  y)
  
  return(ts_data)
}

Lets define some parameter values and run the data generating process (DGP). Notice that if you keep the order of the arguments the same you do not have to specify what paramter is what value. However we are telling R that y_init=3, a_0=.39 and so on:

ts_model(3,.39, .87, .04, 100)
##     time    model
## 1      1 3.000000
## 2      2 2.994076
## 3      3 2.979039
## 4      4 2.936417
## 5      5 2.854389
## 6      6 2.902023
## 7      7 2.852367
## 8      8 2.926708
## 9      9 2.995868
## 10    10 3.043410
## 11    11 2.955877
## 12    12 2.985267
## 13    13 3.020942
## 14    14 3.058620
## 15    15 3.062098
## 16    16 3.069534
## 17    17 2.981437
## 18    18 3.008510
## 19    19 2.958471
## 20    20 2.944528
## 21    21 2.908073
## 22    22 2.986782
## 23    23 2.999835
## 24    24 3.042120
## 25    25 3.040451
## 26    26 3.039758
## 27    27 2.982901
## 28    28 3.044498
## 29    29 3.032468
## 30    30 2.986228
## 31    31 3.036883
## 32    32 3.040988
## 33    33 3.066551
## 34    34 3.055451
## 35    35 2.983916
## 36    36 2.995108
## 37    37 2.990678
## 38    38 3.096156
## 39    39 3.109378
## 40    40 3.043441
## 41    41 3.084980
## 42    42 3.082107
## 43    43 3.029515
## 44    44 3.048117
## 45    45 3.009012
## 46    46 2.997732
## 47    47 3.017906
## 48    48 3.047161
## 49    49 3.039623
## 50    50 3.045097
## 51    51 3.053270
## 52    52 3.055241
## 53    53 3.047749
## 54    54 3.037017
## 55    55 3.035404
## 56    56 2.980629
## 57    57 2.965303
## 58    58 2.969624
## 59    59 2.940868
## 60    60 2.939316
## 61    61 2.880721
## 62    62 2.878654
## 63    63 2.899875
## 64    64 2.925768
## 65    65 2.954459
## 66    66 2.931753
## 67    67 2.911055
## 68    68 2.916816
## 69    69 2.969335
## 70    70 3.001056
## 71    71 2.939281
## 72    72 2.966355
## 73    73 2.914203
## 74    74 2.899922
## 75    75 2.947182
## 76    76 2.972769
## 77    77 2.931330
## 78    78 2.952770
## 79    79 3.005071
## 80    80 3.000932
## 81    81 2.978481
## 82    82 3.016739
## 83    83 2.963606
## 84    84 2.983688
## 85    85 3.060414
## 86    86 3.016845
## 87    87 3.027310
## 88    88 3.023361
## 89    89 2.989610
## 90    90 3.024375
## 91    91 3.018432
## 92    92 3.017387
## 93    93 3.038141
## 94    94 3.057222
## 95    95 3.049903
## 96    96 3.009661
## 97    97 2.964523
## 98    98 2.935390
## 99    99 2.979341
## 100  100 3.011598

We can save this output as an object–we have a homemade dataset!!

data<- ts_model(3,.39, .87, .04, 100)

Now, using ggplot, which we talked about, we can pass the ts_model2() function we made, and generate a graph of the process!

ggplot(aes(x = time, y = model), data = ts_model(3,.39, .87, .04, 100)) +  
  geom_line(col = 'purple') +
  theme_minimal()

Notice that we can have the DGP run inside the ggplot.

A quick reminder about AR models: Before you jump into plotting this, you need to be careful. What happens when \(|a_1|>=1\)? Lets chack it out by making a_1 equal to 1 in our ts_model():

ggplot(aes(x = time, y = model), data = ts_model(4,.1, 1, 10, 100)) +  
  geom_line( col = 'purple') +
  theme_minimal()

This is also a problem when a = -1. It looks a LOT different though:

#make a_1 equal to -1 in ts_model()
ggplot(aes(x = time, y = model), data = ts_model(4,.1, -1, 10, 100)) +  
  geom_line( col = 'purple') +
  theme_minimal()

Remember: These processes are non-stationary. And, his name is Robert Paulson!

ADL Simulation

You guys learned about this in class, but what does ADL stand for? Autocorrelated Distributed Lag. Let’s walk through these terms. Autocorrelated: So in total, this means an ADL model is in some way self-related. Distributed Lag: This is describing some variable x where the impact of x on our outcome y happens over several periods

Let’s simulate an adl(1,1) model. That means one ‘lag’ of y and one ‘lag’ of x. Let’s write this out:

y_t = b0 + a1y_(t-1) + b1x_t + b2x_(t-1) + u_t

Now, we need to automate our process above, so we need to give our function ALL of the variables above as parameters. That means we need an initial value of y for y_0, all of our coefficients, a mean and standard error for x, a standard error for our error term, and some timeframe of interest, T. In total, that means we need 9 different values.

adl11 <- function(y_0,b_0,a1,b1,b2, sigma_u,mean_x, sigma_x, T){
  #draw values of shock, using our function parameter sigma_u
  u = rnorm(T,mean =0, sd=sigma_u)
  #set up y as an empty vector so we can start placing objects inside of it.
  y =c()
  #initialize y_0 with our user-provided y_0 parameter above
  y[1]= y_0
  #build X
  x = rnorm(T,mean = mean_x, sd = sigma_x)
  #We can use a for loop to build our set of ys
  #recall, i is the index, and will iterate (count) over our set which here is 2 through T (which we provided)
  for (i in 2:T){
    #for every i in the series from 2,3,...,T-1,T, generate a y dependent on last month's y and road expenditures
    #in the last two periods. Written out in r, where the 'i' in x[i] is calling the i'th object in vector x...
    y[i] = b_0+a1*y[i-1]+b1*x[i] + b2*x[i-1] + u[i]
  }
  #now that we have y, we need to return our dataframe
  adl_data = data.frame(
    #create a time vector <1,2,3,...,T-1,T>
    time =c(1:T), 
    y =  y,
    x = x)
  
  #in order to get this object BACK, let's return the created dataframe.
  return(adl_data)
}

Generally, when running a simulation you want to ‘set a seed.’ This forces the random draws you see to be reproducible by another researcher who is trying to replicate your work.

set.seed(42)

Let’s build a dataframe we can reference, using our adl11 function we just made:

adl <- adl11(y_0 = 60, b_0= 10,a1 = 5/6,b1= -1/2, b2 = 1/6, sigma_u =1,
             mean_x = 10, sigma_x = 2.5, T = 250)

Now, we can plot this dataframe using ggplot and layers.

ggplot(aes(x = x, y = y), data = adl) + 
  geom_line(col = 'purple') + 
  labs(xlab = 'Time', ylab = 'Value of Y_t') +
  theme_pander()